investigating audio -visual interactions n...

INVESTIGATING AUDIO-VISUAL INTERACTIONS IN BINOCULAR RIVALRY: FATE OF THE SUPPRESSED PERCEPT

AND MODULATION OF VOLITIONAL CONTROL

Victor Barrès

Sous la direction de Manuel Vidal et Jacques Droulez Laboratoire de Physiologie de la Perception et de l’Action

Collège de France

Août 2010

Mémoire rédigé en vue de l’obtention du diplome de Master de Sciences Cognitives de l’Ecole des Hautes Etudes en Sciences Sociales

EHESS – ENS – Paris V

2

Acknowledgements This work would not have been possible without the support and guidance of Manuel

Vidal. This project originated from one of his numerous ideas, he dedicated a tremendous

amount of his time helping me and I could never thank him enough for his patience. When

looking for my master’s project I was hoping to find a mentor from whom I would learn how to

carry out rigorous and innovative research projects. Manuel Vidal fulfilled this aspiration. I would

also like to thank Jacques Droulez from whom I received many of the key ideas that form the

core of the present project. Each interaction with him led to a breakthrough in my understanding

of my research topic. The discussion especially owes a lot to his input. Pr. Alain Berthoz offered

me the opportunity to join his lab and for this I want to thank him. Entering a new lab can at

times be difficult and I want to thank all my colleagues who welcomed me and made me feel at

home. In this lab where people work on very different research topics, I thank them for the time

they spent explaining me their various projects. I also want to thank the technicians for their

help. Finally, I want to express all my gratitude to the subjects for their participation.

3

Table of content

1 BACKGROUND......................................................................................................................................... 5

1.1 MULTISTABILITY .................................................................................................................................. 5 1.1.1 What is multistability and what is it good for?................................................................................ 6 1.1.2 Examples ......................................................................................................................................... 7 1.1.3 Binocular rivalry: a powerful tool to study multistability ............................................................... 9 1.1.4 Characterizing bistability.............................................................................................................. 10

1.2 BINOCULAR RIVALRY ......................................................................................................................... 11 1.2.1 Phenomenology of binocular rivalry............................................................................................. 11 1.2.2 Two rivaling approaches............................................................................................................... 12 1.2.3 Conflicting evidences .................................................................................................................... 16 1.2.4 Hybrid view: multilevel hypothesis ............................................................................................... 25

1.3 MULTISENSORY BISTABILITY .............................................................................................................. 27 1.3.1 A few insights from literature........................................................................................................ 27 1.3.2 The McGurk effect: a multisensory illusion .................................................................................. 30 1.3.3 Conclusions ................................................................................................................................... 33

2 MATERIALS AND METHODS............................................................................................................. 34

2.1 GENERAL MATERIAL AND METHODS ................................................................................................... 34 2.1.1 Stimuli design ................................................................................................................................ 34 2.1.2 Stimuli presentation....................................................................................................................... 38 2.1.3 Subjects ......................................................................................................................................... 40 2.1.4 Experimental overview.................................................................................................................. 40

2.2 EXPERIMENT 1: BINOCULAR RIVALRY TEST (CONTROL EXPERIMENT) ................................................ 41 2.2.1 Objective ....................................................................................................................................... 41 2.2.2 Method .......................................................................................................................................... 41 2.2.3 Results and discussion................................................................................................................... 42

2.3 EXPERIMENT 2: TEST OF THE MCGURK EFFECT (BASELINE) ............................................................... 44 2.3.1 Objective ....................................................................................................................................... 44 2.3.2 Method .......................................................................................................................................... 44 2.3.3 Results and discussion................................................................................................................... 45

2.4 EXPERIMENT 3: BINOCULAR RIVALRY FOR VIDEOS (TEST AND BASELINE) .......................................... 46 2.4.1 Objective ....................................................................................................................................... 46 2.4.2 Method .......................................................................................................................................... 47 2.4.3 Results and discussion................................................................................................................... 47

2.5 EXPERIMENT 4: AUDIO-VISUAL INTEGRATION WITH THE SUPPRESSED STIMULUS ............................... 50 2.5.1 Objective ....................................................................................................................................... 50 2.5.2 Method .......................................................................................................................................... 50 2.5.3 Results and discussion................................................................................................................... 51

2.6 EXPERIMENT 5: IMPACT OF SOUND ON BINOCULAR RIVALRY.............................................................. 54 2.6.1 Objective ....................................................................................................................................... 54 2.6.2 Method .......................................................................................................................................... 54 2.6.3 Results and discussion................................................................................................................... 55

3 GENERAL DISCUSSION....................................................................................................................... 60

3.1 PROBING THE DEPTH OF SUPPRESSION................................................................................................. 60 3.2 REAL MULTISENSORY CONGRUENCY ENHANCES VOLITIONAL CONTROL............................................. 62 3.3 PERSPECTIVES..................................................................................................................................... 64

4 REFERENCES ......................................................................................................................................... 65

4

Summary The purpose of the present experiment consists in using an audio-visual setting to

investigate the phenomenon of binocular rivalry. We designed innovative rivaling stimuli

consisting of different videos of lips motions so that visual rivalry could be combined with

auditory speech material thus achieving strong multisensory congruency. Adding sound to

binocular rivalry enabled us to determinate that the suppressed stimulus was still available for

cross-modal integration. This result proves that the suppression cannot occur early in the visual

pathway and therefore fuels the interpretation of binocular rivalry as a delocalized competition

process between percepts which both coexist as neural states. We also investigated the effect of

audio-visual congruency on volitional control. We found that the addition of a congruent sound

can enhance volitional capacities only for a real congruency and not for a congruency built

through cross-modal modification of the sound.

5

1 Background

1.1 Multistability

A person strolls in a park, watching cackling ducks play in a pond while the spring flowers

exhale a delicate fragrance. In a glimpse all is perceived and merged into a coherent multisensory

picture while the process of perception goes completely unnoticed. The sound of the duck, its

shape and color, all is condensed in a coherent audio-visual percept. This reality is however

nothing more than a construction of our senses.

Figure 1 Kanizsa illusion

Psychologists and neuroscientists have followed different approaches to unveil the

functioning of perception. At least two ways can be distinguished.

The first approach consists in the constant reduction of the perception problem into

simpler aspects. Visual perception in this view can be subdivided into perception of shape, color,

motion but also perception of faces, of tools, etc. This method, which was proved to be

immensely fruitful, attempts in understanding the basic bricks of perception in a constructivist

perspective. There is however a second way in which one could tackle the problem. By

deliberately creating disruptions in the system, the scientist is able to reveal what is otherwise

hidden.

Figure 1 presents the classic Kanizsa illusion. Illusory contours are perceived leading the

observer to see a white triangle where there is none. Such an illusion, first used in the context of

the gestalt theory, reveals the property of the visual system to build long distance connections

6

between collinear pieces of contours. Illusions challenge the first property of the sensory system

that consists in producing an accurate representation of the stimulation. On the other hand

multistable stimuli preclude our senses from producing only one stable output for a given

stimulation. In our work, we will use this type of stimulus to derive conclusion on the

mechanisms of visual perception and cross-modal audio-visual perception.

1.1.1 What is multistability and what is it good for?

In the park, the stroller has the sensation to be facing a stable representation of his

environment. Our senses are however always dealing with messy and ambiguous signals.

Decoding the flow of information arriving to an eye, for example, is far from being as simple as

inverting a code. Ambiguity is the hallmark of the retinal stimulus in nearly all visual perception.

A living organism however has to achieve a stable perceptual organization of its surrounding

environment in order to guide its behavior and achieve proper adaptation. Dealing with

potentially various perceptual representations coherent with the incoming information, a unique

interpretation has to be elected. For this reason, perception always involves a decision

process. As I mentioned in the introduction, a main characteristic of the sensory system is

therefore to give birth to a unique and stable output for a given stimulation.

Multistable percepts defeat this characteristic. For an unchanging stimulus, the perceptual

system alternates spontaneously between distinct interpretations without being able to stabilize

one of them. This phenomenon has been used extensively for more than two centuries to study

visual perception. The phenomenal instability of such percepts provides an especially dramatic

and compelling example of the more general ambiguity which characterizes sensory stimulation.

This feature makes multistable percepts especially relevant stimulations to study the phenomenon

of perceptual decision.

It is however in the narrower context of the study of consciousness that multistability

recently generated a sustained interest in visual neuroscience as it decouples the conscious

perception of the observer from the characteristics of the physical stimulation. The same stimulus

indeed can evoke different conscious percepts. This provides a powerful tool to investigate the

neural bases of consciousness since the change in the subjective perception can be correlated

with the neural responses while the stimulation remains constant. I will now present the main

examples of multistability for the reader to grasp the wide range of the existing stimuli, grouped

by sensory modality.

7

1.1.2 Examples

a Vision

In the visual modality multistability can be achieved using various classes of stimulations.

The first class called ambiguous figures is illustrated in Figure 2. Looking at each of these images,

the observer will experience perceptual oscillations. The Necker cube will present itself either as a

cube seen from below or from above, due to the lack of depth cues. In the face/vase stimulus,

the observer will oscillate between perceiving a vase and perceiving two heads facing each other,

a phenomenon that is rooted in ambiguous figure/ground segregation.

(a)

Necker cube

(b)

Face/Vase

(c)

Duck/Rabbit

(d)

Wife/Mother in law

Figure 2 Ambiguous figures. a: Boring (1942), b: Rubin (1915/1958), c: Jastrow (1899), d: Boring (1930)

Other visual bistable stimuli rely on ambiguous interpretation of motion. A moving plaid

delimited by a circular aperture can also be seen as a two gratings sliding on each other in

directions perpendicular to each grating’s orientation. The apparent motion quartet (illustrated in

Figure 3) is another famous example of visual ambiguous stimulus.

8

Figure 3 Apparent Motion Quartet (Sterzer, Kleinschmidt, & Rees, 2009)

Finally a last class of visual bistable stimulations relies on the phenomenon of binocular

rivalry. Binocular rivalry is an example of multistable perception that can be initiated by showing

dissimilar images to the two eyes. The perceptual impression under such conditions is not the

spatial sum or average of the two monocular images, but rather a sequence of subjective reversals

in which each of the stimuli, in turn, dominates perception while the other entirely disappears

from sight. Figure 4 presents two classical examples of rivaling images.

Figure 4 Binocular Rivalry (Tong, Meng, & Blake, 2006)

b Audition

There exist far fewer examples of multistability in the auditory modality. Two main effects

are known to induce perceptual multistability: auditory stream segregation and verbal

transformation effect.

9

Due to stream segregation, two streams of tones presented in an alternating pattern

repeated through time (cf. Figure 5) are perceived either as two segregated streams each

comprising only one repeating sound, or as a single stream of alternating sounds (Pressnitzer &

Hupé, 2006). The auditor oscillates between these two interpretations.

Figure 5 Auditory illusion due to stream segregation (Sterzer et al. 2009)

Verbal transformation effects can arise when a speech form is rapidly and continuously

repeated (Warren & Gregory, 1958). Although at first a percept matching the initial form

dominates, with time some other interpretations appear and then alternate with the original

percept. A good example is given by the rapid repetition of the word ‘life’ which gives rise to the

interpretation ‘fly’ and results in bistable alternation between the perceived words ‘life’ and ‘fly’.

c Touch

Tactile multistability remained up until now a rather marginal research topic. It is however

worth mentioning that applying on the skin of a subject a tactile stimulation that mimics the

apparent motion quartet described above results in the same multistable interpretation of the

direction of motion (Carter, Konkle, Wang, Hayward, & Moore, 2008).

1.1.3 Binocular rivalry: a powerful tool to study multistability

In achieving our goal, which is to propose an innovative analysis of the phenomenon of

multistability by coupling it with cross-modal effects, the first step consisted in choosing an

appropriate multistable stimulation among the population of stimuli described above. For at least

two reasons binocular rivalry appeared to be the best fitting choice.

First, binocular rivalry it is the best documented multistable situation. In the past fifty

years, binocular rivalry emerged, along with the Necker cube, as a paradigmatic case of

10

multistability and concentrated most of the efforts deployed to understand this phenomenon.

Studies ranging from psychophysics to computational neuroscience but also including imaging

and animal studies provide the largest background on any multistable stimulation.

Unlike ambiguous figures, binocular rivalry allows to a great extent the scientist to

manipulate the stimulus content. For the purpose of the present experiment, it was important to

be able to design a stimulus that would show strong cross-modal congruency. As described in the

method, the rules constraining the creation of bistable binocular percepts are rather strict.

Presenting different images in each eye is far from being enough in order to induce binocular

rivalry. Part of the present work will therefore be dedicated to the presentation of a new type of

bistable percepts using binocular rivalry and specially designed to ensure a strong multisensory

effect.

1.1.4 Characterizing bistability

All the multistable stimulation described above share common characteristics. Leopold

and Logothetis (Leopold & Logothetis, 1999) established a list of three features observed in all

instances of visual bistability: exclusivity, randomness and inevitability. These characteristics

were then also evidenced for auditory bistability based on stream segregation (Pressnitzer &

Hupé, 2006).

a Exclusivity

When two percepts of a bistable stimulus are competing, the phenomenological results is

that the observer perceives either one or the other but never both of them at the same time. This

characteristic is rigorously noted for ambiguous figures, plaid, apparent motion quartet, and

auditory bistable percepts. For binocular rivalry, patchy or piecemeal rivalry is often reported,

especially at the beginning of the stimulation. However, well designed stimuli for binocular rivalry

should result in a marginal perception of piecemeal rivalry.

b Randomness

A hallmark of bistability is that the durations of alternating percepts, or phases, follow a

random law that can be fitted with a gamma or lognormal distribution. The lognormal

distribution suggests the multiplication of a large number of independent random processes,

whereas a gamma distribution is more likely to result from the combination of a small number of

consecutive Poisson processes.

11

c Inevitability

Debate over the possibility of a volitional control over bistable perception can be traced

back to the work of Helmholtz. Although the latter was strongly backing the idea that after

training an observer could take control over the alternations, it is now widely admitted that no

full control can be exerted and that alternations are inevitable. Volition can however bias the

perception of some bistable stimuli by modulating the dominance duration of each of the

possible interpretations.

1.2 Binocular Rivalry

1.2.1 Phenomenology of binocular rivalry

I will from now on focus on the phenomenon of binocular rivalry. However, before

reviewing the details of the past and present conceptual frameworks, I will present briefly the

main characteristics of the subjective experience associated with binocular rivalry.

The phenomenon of binocular rivalry is a particular form of bistability which occurs

when dissimilar images are presented to corresponding regions of the two eyes. As I already

mentioned above, rather than melding into a single coherent percept, the two images compete for

perceptual dominance. Typically, an image will dominate conscious awareness for a few seconds

before being supplanted by the previously suppressed rival image.

a Temporal dynamics

The main characteristic of the temporal dynamics of visual rivalry is its randomness.

Oscillations between rivaling percepts are not regular. The successive durations of dominance

periods seem to be drawn from a random distribution, as if generated by a stochastic process

driven by an unstable time constant (Lehky, 1988).

In his seminal work, Levelt showed that the random dynamics can however be biased

through variations of the “strength” of one rival figure over another. For Levelt the concept of

“stimulus strength” is related to “the amount of contour per area” but can be extended to

brightness and contrast. Increasing the strength of a stimulus has no effect on its dominance

durations but instead reduces its suppression durations.

b Spatial attributes

Exclusive predominance of an image over the other is not always globally achieved for

binocularly presented stimuli. Perceptual dominance can take on a “patchy” or “piecemeal”

12

appearance when the inducing figures are relatively large, as if rivalry was occurring on a local

scale. Distributed zones of the visual field seem to be involved in simultaneous rivalries. This

effect is also commonly reported during the first seconds of presentation of the rivaling stimuli.

Exclusivity is also blurred during perceptual transitions. When a suppressed image

overthrows the currently dominant one it does not do so instantaneously. Instead, transition

occurs in a wave-like fashion: the new dominant image emerges in one region and spreads

throughout the whole visual field.

1.2.2 Two rivaling approaches

What are the mechanisms underlying binocular rivalry? The theoretical debate on the

neural bases of this phenomenon has been historically divided along a line separating two main

hypotheses: a low level or bottom-up explanation and a high level or top-down approach.

a Historical views on multistability

The history of the concepts that served as a basis for the understanding of binocular

rivalry is intimately linked to the evolution of the more general understanding of multistability. I

will try to present the major trends in the theoretical debates on multistable perception while

showing how the specific works on binocular rivalry are inscribed within those trends.

Long and Toppino organized the history of research on multistability into two main

periods each characterized by a categorical division between two conflicting theories (Long &

Toppino, 2004).

Early 20th century saw the quarrel between two conceptions that Long and Toppino

described as focalizing respectively on peripheral and central processes. By peripheral

processes they refer to factors related to the operation of the sense organ, whereas central

processes refer to brain and especially cortical mechanisms.

Necker strongly backed the conception insisting on the importance of peripheral

processes by proposing an explanation in which different points on an ambiguous figure – such

as the Necker cube - are assumed to foster one or the other perceptual alternative. The

interpretation of the figure therefore depends mainly on the set of features receiving primary

processing. In this view, eye movements were critical for they consist in varying the foveated

portion of the figure and hence can trigger perceptual switches. Long and Toppino note that this

early interpretation “placed the locus of figural reversal in “optical” rather than “mental”

processes”. At this time, considerable efforts was put to demonstrate the importance of eye

movements in perceptual oscillations.

13

If eye movements appeared to be indeed related to perceptual switches, it was proved that

they are not necessary for a switch to occur. This result fueled the other early hypotheses that

involved a central or “psychological” explanations based on concepts such as will, imagination,

and attention. Binocular rivalry served as an important example of multistable phenomenon

supported by central mechanisms, and especially by volition. Advocates of this position include

Hermann von Helmholtz and William James, both of whom equated rivalry with voluntary

attention. Sir Charles Sherrington in his monograph Integrative Action of the Nervous System also

supported such a position by writing: “Only after the sensations initiated from right and left

corresponding points have been elaborated, and have reached a dignity and definiteness well

amenable to introspection, does interference between the reactions of the two eye-system

occur…”(Blake & Logothetis, 2002). Long and Toppino conclude that by the 1910’s there was a

relatively clear consensus that figural reversal was to be explained on the basis of central

processes.

The main conceptual switch occurred in the 1940’s with the birth of the Gestalt theory.

Indeed, after some rather calm decades during which multistability was left aside, Gestaltists

rediscovered this phenomenon and made considerable use of it. However, they did not simply

revive the early concepts but attempted to interpret multistability using their own theoretical

framework. The Gestalt school of thought therefore introduced the concept of satiation into the

realm of multistable perception. Gestalt conceptualization of brain functioning indeed rested on

the constructs of flowing electrical fields and changing resistance (satiation) to the flow of these

fields. Long and Toppino note that according to Kohler: “figural reversal could be attributed to a

gradual build up of resistance (“electrotonus”) in the brain to the field flow underlying the

percept first seen.” For the Gestaltists an electrical field supports one interpretation of the

multistable percepts while at the same time a resistance to this field is building up (satiation)

resulting, eventually, into the suppression of this percept and the emergence of a new field

supporting the alternative interpretation.

Although these concepts of fields and satiation were abandoned when the first tools

appeared that enabled the scientists to record directly the activity of cortical regions, they were

highly influential in the advent of the new theoretical framework supporting modern research.

Based mainly on the breakthroughs in the field of neuroscience of vision emerged the notion of

neural channels selectivity tuned to particular characteristics of the sensory (retinal) stimulus.

Such a conception incorporated the critical notion of neural adaptation which originates

directly in the concept of satiation. According to neural adaptation, continuous stimulation of a

population of neurons results eventually in a reduction of their sensitivity and alters their ability

14

to respond to subsequent stimuli until they have recovered from this adapted state. I will show in

1.2.2b how the recent low level approach makes considerable use of the notion of neural

adaptation as a key concept to explain perceptual reversal in binocular rivalry as a well as in other

forms of multistable percepts.

The period that opens with the advent of the Gestalt theory constitutes the second period

defined by Long and Toppino. As I mentioned earlier, each epoch is characterized by a

competition between two major theoretical conceptions of the phenomenon. The Gestalt theory

and its continuation in modern neuroscience represent a first position referred to as sensory

explanations by Long and Toppino. Sensory explanations are challenged during the period by

cognitive explanations. This other mainstream approach favors the role of more active,

cognitive processes such as learning, decision making, and attention.

b Low level interocular competition versus high level patterns competition

Both dichotomies described above can be subsumed into a more general opposition

between bottom-up/low level approach which insists on the role of passive sensory processes

and top-down/high level approach which focuses on the role of active cognitive processes. I will

from now on use the terms bottom-up/ top-down and low level/high level interchangeably.

It is crucial to bear in mind both categories of explanations in order to properly grasp the

importance and relevance of most scientific works concerning perceptual bistability for they

almost always consist in an attempt to back one of the competing hypotheses. I shall now present

the two main theories that were formalized in the late 1980’s and the 1990’s concerning visual

rivalry and which fall directly into the general conceptual dichotomy.

• Reciprocal inhibition between feature-detecting neurons in early vision

According to this first class of explanations, binocular rivalry arises from low-level

interocular competition between monocular neurons in the primary visual cortex (V1) or in the

lateral geneculate nucleus (LGN) of the thalamus.

This hypothesis derives from the sensory explanations described above based on neural

adaptation and mutual inhibition. It is the association of reciprocal inhibition between

competing visual neurons with inhibitory influences adapting over time that can account for

spontaneous rivalry alternations. A set of neurons maintains dominance only temporarily, until

they can no longer inhibit the activity of competing neurons, leading to a reversal in perceptual

dominance. The fundamental point is that competition is supposed to take place early in the

visual processing stages between monocular neurons. For this reason this framework can be

termed interocular competition or eye rivalry.

15

Figure 6 XOR network (Blake, 1989)

In line with this approach, Lehky (Lehky, 1988) and Blake (Blake, 1989) proposed neural

models such as the one illustrated in Figure 6. Two neurons, each corresponding to an eye,

inhibit each other in direct competition. This led to a winner take all mechanism where one

neuron, and therefore one eye, eventually takes over. However, due to neural adaptation, the

inhibition on the suppressed eye diminishes with time until a reversal occurs. The neural

mechanism corresponds therefore to a logic exclusive OR (XOR). Only one of the stimulations

passes the first stages of visual processing. This low level approach consequently implies a very

early suppression of the dominated stimulus.

• Role of prefrontal cortex and decision making: binocular rivalry as a

cognitive process

A rivaling framework to the one just presented consists in a modern version of the central

and cognitive interpretations of multistability. Leopold and Logothetis (Leopold & Logothetis,

1999) first formalized this high level hypothesis. They consider binocular rivalry as a type of

behavior in which perceptual switches would result from a decision taking place in the prefrontal

cortex.

According to this top-down interpretation suppression is not triggered by some intrinsic

competition between eyes but rather is a consequence of a decision mechanism. Contrary to the

interocular competition hypothesis the suppression would not necessarily occur at early level of

the visual processing. For a decision to be taken it might indeed be necessary for both percepts to

be maintained in some way as coexisting active neural states. According to this view, binocular

rivalry occurs later in visual processing and reflects competition between incompatible

patterns rather than competition between the eyes. Patterns are thought here to be neural

16

representations of the stimuli without specifying their localization in the brain but supposing that

they exist beyond the simple monocular stimuli representations.

1.2.3 Conflicting evidences

Only empirical evidence can foster or debunk one of the two approaches. I will therefore

now present some of the empirical works on visual rivalry. The scientific works listed here are

chosen either for their innovative quality, bringing new data to the debate or for their classical

aspect, representing common, often replicated aspects of binocular rivalry. Based on these

results, I will then be able to discuss the validity of the top-down and bottom-up frameworks.

a Indirect evidence: psychophysics studies

• Visual sensitivity

Testing the subject sensitivity to a probe stimulus provides a simple and efficient way to

investigate the phenomenon of suppression during visual rivalry. During dominance phases of

rivalry, observers show normal visual sensitivity for the detection of probe targets briefly

superimposed on the dominant stimulus. However when asked to respond to the same probe

target but this time superimposed to the suppressed stimulus, sensitivity averaged 63% of that

during dominance. Interestingly, the sensitivity decrease remains the same whether the

stimulation used as a probe is similar to the suppressed stimulus or not, which tends to favor the

interocular competition interpretation (Freeman, Nguyen, & Alais, 2005).

• Adaptation

Although suppression removes a percept from an observer’s consciousness, it has no

effect on most of the well-known adaptation aftereffects. In particular the translational motion

aftereffect (MAE) remains unaffected suggesting that the locus of suppression does not occur

before the site of MAE (Lehmkuhle & Fox, 1975). However, for a spiral aftereffect (SAE), it

appears that suppression prevents the build up of the effect which shows that rivalry suppression

occurs prior to the site of spiral motion processing (Wiesenfelder & Blake, 1990).

If the precise site of suppression cannot be determined by the study of adaptation, all the

results fully support the idea that the mechanisms responsible for suppression are cortical.

• Priming

Exposure to a stimulus such as a picture or a word makes subsequent processing of that

stimulus faster and more accurate. The initial presentation of the stimulation primes later

17

processing. Priming is a form of implicit memory since the test instruction makes no explicit

reference to the prior presentation of the stimulation. Successful priming does not rely on a

conscious recognition of the repeated stimuli. The question has therefore been tackled as to

whether a priming effect could be observed if the priming stimulus is presented during a

binocular suppression.

Picture priming (Cave, Blake, & McNamara, 1998) and semantic priming (Zimba & Blake,

1983) were shown to be disrupted suggesting that suppression renders impotent normally

effective priming stimuli. Blake and Logothetis (Blake & Logothetis, 2002) remark that priming

require “relatively refined analyses of visual information, of sort conventionally attributed to

high-level visual processing outside the domain of early visual areas”. They conclude that during

suppression input to these stages is effectively blocked.

• Stimulus swap

Logothetis et al. (Logothetis, Leopold, & Sheinberg, 1996) introduced an elegant and

ground breaking experimental paradigm in an attempt to investigate whether binocular rivalry can

be traced back to simple competition between monocular neurons within the primary visual

cortex or whether higher cortical areas have to be taken into account. Instead of presenting one

image to each eye in a classical binocular rivalry paradigm, they tested the effect of rapidly

alternating the rival stimuli between the two eyes. Surprisingly, under such conditions, the

perceptual alternations exhibit the same temporal dynamic as with static images. A singe phase of

dominance can span multiple alternations of the stimuli.

This experiment rules out the possibility of a sheer eye competition. The authors note

that “neural representations of the two stimuli compete for visual awareness independently of the

eye through which they reach the higher visual areas.” If a competition takes place during

binocular rivalry, it is therefore more likely to occur between two perceptual interpretations at a

higher level of analysis than between monocular neurons. The distinction of percept

competition and eye competition appears here to be essential in the understanding of

binocular rivalry.

• Volition

Another important indicator of the potentially cognitive nature of binocular rivalry is its

partial susceptibility to voluntary control. I already mentioned that if the alternations are

spontaneous and inevitable, it is still possible for the observer to bias the dynamics by actively

trying to hold one percept. Intention therefore plays a critical role in perceptual alternation. Van

Ee et al. analyzed the influence of volition on dominance durations for a series of dichoptically

18

presented stimuli (van Ee, van Dam, & Brouwer, 2005). Overall, they showed that voluntarily

trying to hold one percept resulted in a lengthening of the dominance duration of this percept.

This result however depends on the type of stimulation used. When presented with gratings

stimuli observers are less able to willfully influence the temporal dynamics of oscillations than

when presented with rivaling house and face. The active role of volition in the perceptual

decision favors the rather high level interpretation of visual rivalry.

• Stimuli type

Is the type of stimuli used important or is the phenomenon of binocular rivalry

independent of the nature of the images presented? Put differently, I already mentioned that the

stimulus strength plays a role in the temporal dynamics of binocular rivalry (see Levelt’s laws in

1.2.1a) but can the higher level information of the stimulation modify the dynamics? Jiang et al.

analyzed the time a suppressed stimulus takes to break suppression – ie to become dominant –

while varying the nature of this suppressed stimulus which could be either familiar (upright face

or a text written in the native language of the subject) or unfamiliar (upside-down face or a text in

an unknown foreign alphabet) (Jiang, Costello, & He, 2007). It resulted that a familiar stimulus

tended to gain dominance faster than an unfamiliar one therefore showing that the semantic

content of the suppressed stimulation played a role in the oscillation dynamics of rivalry.

Emotional stimulations have also been shown to be more prone to gain awareness in

binocular rivalry (Yang, Zald, & Blake, 2007). This result is coherent with the fMRI result

focusing on the activation of the amygdala in situation of rivaling stimulation (see 1.2.3c).

These results imply that the suppressed stimulus although unconscious is still somehow

processed by the visual system since high level information contributes to the stimulus strength

during its suppression phase. This conclusion goes against the low-level interpretation of

binocular rivalry. It also draws the attention on the importance of the nature of the stimulation

used: the underlying processing of stimuli during rivalry could very well depend on their intrinsic

nature.

b Multistability as behavior

A main epistemological breakthrough occured when Leopold and Logothetis proposed to

interpret binocular rivalry as a behavior, offering a fresh interpretation to the high level

framework. In their seminal review (Leopold & Logothetis, 1999) they suggest that “spontaneous

alternations reflect responses to active, programmed events initiated by brain areas that integrate

sensory and non-sensory information to coordinate a diversity of behaviors.” According to their

position, while the perception of an ambiguous stimulus ultimately depends on the activity of

19

sensory cortices, this activity is continually steered and modified by central brain structures

involved in planning and generating behavioral actions.

In order to back their thesis, Leopold and Logothetis review the data in favor of the high

level approach. But this alone would not suffice to support the qualification of binocular rivalry

as “behavior”. Therefore they add a interesting analysis that focuses on the temporal dynamics of

binocular rivalry. The authors highlight the close similarity between temporal dynamics for

perceptual reversals and a variety of spontaneously generated visuo-motor behaviors. The

stochastic aspect of the temporal dynamics of binocular rivalry resembles the randomness of

many exploratory behaviors that emerge from the integration of a large number of sensory and

internal variables. In particular, they note that the dynamics of free viewing (the distribution of

the fixation durations between saccades) is stochastic with the characteristic, also observed in

visual rivalry, that duration of one fixation has no significant effect on that of the next.

Perceptual reversals could therefore be linked to the more general class of exploratory

behaviors and should be related to high level information processing.

c Imaging

In the past 20 years, the research on the underlying mechanisms of binocular rivalry

strongly focused on the use of brain imaging techniques.

• Electroencephalograpgy (EEG)

Few EEG studies tackled the problem of binocular rivalry. Placing electrodes on the

occipital lobes, it was possible to record the average activity during visual rivalry which enables

some authors to conclude that suppression resulted in a reduction in amplitude of the visually

evoked responses (VER). However, since the signal recorded stemmed from both the right and

left eyes, the VER could not be linked with a specific percept.

Brown and Norcia introduced an innovative method to create a link between the recorded

VER and the rivaling stimuli (Brown & Norcia, 1997). They used two dichoptically viewed

orthogonally oriented gratings whose contrasts were modulated at different rate. By doing so,

they were able to tag the VER waveforms associated with the two gratings. The EEG signal was

then recorded over the occipital lobe while observers were reporting the perceptual reversals. The

authors could thereby show that the VER associated with the two gratings display inversely

related modulation in amplitude tightly phase-locked with the perceptual reports of dominance

and suppression.

20

Figure 7 Tagged VER associated with rivaling gratings (Blake & Logothetis, 2002)

EEG studies put forward a correlation between brain activity and perception during

rivalry but do not provide any information on where in the visual pathways the competition takes

place. EEG signals are averaged over the occipital pole and reflect the activity of large networks

whose precise localization remains unknown.

• Magnetoencephalography (MEG)

MEG, supposed to provide a somewhat better source localization than EEG, faced the

exact same problems than EEG concerning signal tagging. The solution found consisted in using

a frequency-tagged neuromagnetic response by flickering two dichoptically orthogonally oriented

gratings at different rates (Tononi, Srinivasan, Russell, & Edelman, 1998). Their study revealed

strong rivalry-related responses throughout occipital cortex as well as from some anterior

temporal and frontal sites. If the precise location of the rivalry-related responses’ origin cannot be

determined, Tong suggests that their widespread nature indicates that rivalry interactions occur at

an early stage of visual processing, leading to similar rivalry effect at both occipital and anterior

sites (Tong, 2005).

• Funtional Magnetic Resonance Imaging (fMRI)

fMRI provides the advantage of a more precise localization of the neural signals involved

in perceptual reversals, dominance and suppression. It is indeed possible to identify brain regions

in which blood oxygen level dependent (BOLD) signals fluctuate in synchrony with binocular

rivalry alternations. I will focus on the main experiments that together provide an accurate picture

of the empirical data gathered, in its diversity.

The first significant work on binocular rivalry using fMRI identified the loci whose neural

activity correlates with the occurrence of a perceptual transition – that is, brain activations

correlated with points in time when observers experienced changes in rivalry state, rather than

the particular perceptual state being experienced (Lumer, Friston, & Rees, 1998). Lumer et al.

pinpointed the extrastriate cortex, the fusiform gyrus as well as several frontal and parietal areas

21

but not the striate cortex as related to subject’s perceptual transitions. The authors concluded that

transition might be instigated by fronto-parietal areas although no test of causality had been

carried out. This activation of a fronto-parietal network has been, more cautiously interpreted by

Leopold and Logothetis as a proof that “these areas are actively involved in binocular rivalry, and

furthermore that their participation was specific to multistable viewing, as they were not active in

a control passive-viewing condition” (Leopold & Logothetis, 1999).

In the ventral temporal cortex, the fusiform face area (FFA) tends to respond

preferentially to pictures of faces when the parhippocampal area (PPA) responds to images of

indoor and outdoor scenes. Tong et al. therefore designed a binocular rivalry experiment using

images of faces competing with images of houses (Tong, Nakayama, Vaughan, & Kanwisher,

1998). The results showed that the activity in FFA and PPA reflected closely the observer

perceptual state. When the face was dominant in rivalry, activity levels were relatively high in FFA

and low in the PPA and vice versa. Interestingly, the activity level in both FFA and PPA were

similar to those observed when images of house and face were externally switched in order to

mimic rivalry (see Figure 8). According to Logothetis and Blake, this result suggests that rivalry is

fully resolved by the time signals arrive within these stages of processing (Blake & Logothetis,

2002).

Figure 8 Activity in the ventral temporal cortex correlates with perceptual states (Tong et al., 1998)

Polonsky et al. investigated the neural correlates of binocular rivalry within the visual

cortex (Polonsky, Blake, Braun, & Heeger, 2000). They used a contrast difference between two

dichoptically presented images in order to tag the BOLD signal corresponding to each image –

activity increased in the visual cortex when the subject saw the high contrast image and decreased

22

for the low contrast image. By analyzing those fluctuations and by comparing it to a stimulus flip

situation – during which the images are externally switched without rivalry - the authors observed

rivalry related fluctuations in V1, which were roughly equal to those observed in other visual

areas (V2, V3, V3a and V4v). Two conclusions are potentially coherent which such results. Either

neuronal events underlying rivalry are initiated in V1 and then propagated to other visual areas

(this interpretation corresponds to the bottom-up approach), or those neuronal events are

initiated at later stages and then propagated via feedback to V1. The authors however are unable

to conclude based on this experiment and note that both processes could occur since “local

interactions among V1 neurons may trigger the perceptual alternations during rivalry, whereas

interactions in later visual areas may reinforce the neuronal representations of coherent percepts,

just as they do during normal vision.”

An elegant way to test whether binocular rivalry can be traced back to eye competition

was proposed by Tong and Engel (Tong & Engel, 2001). The idea consisted in presenting

rivaling gratings in the portion of the visual field that corresponds to the blind-spot, a monocular

region of the primary visual cortex. This region greatly prefers stimulation in the ipsilateral eye to

that of the blind-spot eye. Interestingly, unlike eye-specific columns in human V1 which are

extremely narrow, the blind-spot region is sufficiently large for reliable functional imaging. In this

monocular region, as predicted by the eye competition, bottom up approach, activity correlates

with the perceptual state of the observer. The blind-spot representation was activated when the

ipsilateral grating became perceptually dominant and suppressed when the blind-spot grating

became dominant. This modulation was just as strong as those evoked by physical alternations of

the stimuli. This study brought strong empirical evidence in favor of the low level since it led to

the conclusion that rivalry can fully suppress monocular responses to an unperceived stimulus.

In a study focusing on the lateral geniculate nucleus (LGN), Haynes et al. used high-

resolution fMRI to find evidence of eye competition (Haynes, Deichmann, & Rees, 2005). LGN

has indeed often been thought to be the locus where interocular competition could take place.

Regions that showed a strong preference for stimulation from a specific eye displayed significant

activity suppression during binocular rivalry when the stimulus presented in their preferred eye

was perceptually suppressed. This study therefore strongly backs the low level approach by

suggesting that the eye rivalry could take place as early as LGN and therefore that suppression

occurs at the very first levels of the visual processing.

The amygdale however, which is known to process emotional stimuli, responds more

strongly to fearful faces than to neutral stimuli, even when those stimuli are suppressed from

awareness by rivalry (Williams, 2004). This very important result shows that, contrary to the data

23

drawn from the face-selective regions, rivalry is not fully solved outside the visual cortex. Some

information concerning the suppressed percept can persist and reach some subcortical brain

areas, although this neural activity is insufficient to support visual awareness.

• Electrode recording

Electrode recording has been used to monitor the neuronal activity in various brain areas

while animals were reporting their percepts during binocular rivalry.

Contrary to the previously mentioned fMRI study on the implication of LGN in

binocular rivalry (Haynes et al., 2005), Lehky and Maunsell (Lehky & Maunsell, 1996), using

single-unit recording could not find any evidence for rivalry inhibition in this structure.

In a series of experiments Logothetis et al. (Leopold & Logothetis, 1996; Logothetis &

Schall, 1989; Sheinberg & Logothetis, 1997) recorded spiking signal in many cortical areas

including the striate cortex (V1), as well as the extrastriate areas V2, V4, the middle temporal area

(MT), the medial superior temporal sulcus (MST), the inferotemporal cortex (IT), and the upper

and lower bank of the superior temporal sulcus (STS). The animal was previously taught to pull a

lever in association with seeing each pattern. The stimuli were such that they had been specially

tailored to the preferences of the neuron being monitored. An excitatory (preferred) stimulus was

rivaling with a non-excitatory (null) stimulus. Despite the unchanging nature of the retinal input,

neural activity of subsets of neurons was shown to be modulated by the monkey’s internally

generated perceptual changes. Most neurons recorded however responded to both perceptual

states equally as if the unchanging retinal input was the only factor determining their firing.

Interestingly, the percentage of cells whose activity was modulated by the perceptual state

differed significantly in the various areas. Figure 9 shows the proportions of percept-related cells

in each area. If only a small fraction of the neurons have an activity locked on the perceptual

alternations in the early stages of the visual system (V1 and V2), this proportion is higher in the

extrastriate areas (V4, MT, MST). Eventually, almost all the neurons fire in concert with the

perceptual changes in the temporal lobe (IT, STS). This result suggests, as in (Tong et al., 1998)

for fMRI, that the temporal cortex lies beyond the resolution of the perceptual conflict (Blake &

Logothetis, 2002). Another lesson seems to be that the emergence of neural loci whose activity

correlates with a perceptual state is more a continuous construction along the visual pathway than

property of high or low level structures.

24

Figure 9 Proportions of percept-related cells (Leopold & Logothetis, 1999)

• Conclusions

Imaging studies represent a large amount of the works that has been done on the

phenomenon of binocular rivalry the past 20 years. Imaging however did not provide unanimous

evidence backing one theoretical framework or the other. EEG and MEG study suffer from their

low spatial resolution and the lack of reliable algorithms to isolate the sources of the signal. fMRI

data are somewhat more useful to investigate the neural correlates of binocular rivalry but lead to

conflicting results with some studies advocating for a complete resolution of the competition at

the monocular level (Tong & Engel, 2001) while some other present proof of persistence of the

suppressed stimuli outside the visual cortex (Williams, 2004) or insist on the potential role of the

prefrontal cortex in triggering the perceptual switches (Lumer et al., 1998). fMRI data is also in

conflict with some single-cell recording studies. In particular, concerning LGN which is supposed

to play a key role in the low level approach, these two techniques provide contradictory empirical

data.

There are however some positive results brought by imaging studies. First it seems that the

suppression is not localized but rather continuous at least over the visual area (see Figure 9). The

fact that suppression is fully achieved in the temporal cortex but not in the amygdala suggests

that there exist different paths of processing of the visual input and that not all of them are

equally affected by the alternation of dominance and suppression that characterizes binocular

rivalry. Overall, these results do not back one particular approach but rather call for a

reformulation of the theoretical framework.

25

1.2.4 Hybrid view: multilevel hypothesis

Empirical data do not provide with conclusive evidence concerning the debate over the

two main frameworks and therefore efforts have been made to build an alternative approach.

This new conceptual approach consists in a hybrid view (Long & Toppino, 2004;

Tong et al., 2006) since it relies on hypotheses belonging to both high and low level theories. As I

mentioned above (see 1.2.2b), the bottom-up model insists on the importance of reciprocal

inhibition between competing visual neurons located early in the visual processing stages (most

likely monocular neurons). This inhibition however fluctuates over time through the

phenomenon of neural adaptation. One set of neurons maintains dominance only temporarily,

until they can no longer inhibit the activity of competing neurons, leading to a reversal in

perceptual dominance. This notion of competition through mutual inhibition remains central in

the hybrid model. The top-down model and some of the empirical data, on the other hand,

points towards the idea of a competition that is not fully solved by the early stages of the visual

system and that occur at different levels of the visual pathway. The hybrid view therefore consists

in a delocalization of competition all along the visual system and even beyond and insists

on multilevel processing. Inhibitory interactions could take place among monocular neurons

(eye competition) as well as among pattern-selective neurons (pattern competition).

It is essential to understand that in this view, the emergence of a unique percept is not

due to a single XOR competition as in the model proposed by Blake (Blake, 1989) which sums

up the low level models of binocular rivalry. The election of a single percept results from a

complex network of competitions between different neural populations coding for patterns at

various levels of the visual analysis. Figure 10 shows the simplest of such models involving only

two levels: monocular neurons and binocular neurons. The causality in such a model is blurred

compared to a typical low level competition. Indeed it is not possible anymore to decide what

level triggers the switches. The oscillations emerge from the global network in a delocalized

fashion. A more realistic model should be far more complex, involving many levels (Freeman,

2005) but also feedback connections from higher levels on lower levels. Multilevel processing is

coherent with anatomical, physiological, psychophysical evidence suggesting that the visual

system is characterized by a network of feedforward and feedback connections that enables signal

exchanges between neural levels.

26

Figure 10 Two level hybrid model (Wilson, 2003)

An important point on which I would like to insist on concerns the nature of the

suppressed stimulus. According to the hybrid view, it is possible that the rival stimulation leads

only to partial suppression of the inputs from one eye at the monocular level, which is consistent

with the empirical studies that found neural activity corresponding to the suppressed stimuli

outside LGN and V1. If the low level suppression is not total, a persisting neural signal is passed

on to higher stages of processing, where visual competition continues. The nature of the

stimulations could play an important role in determining the loci of competition. It is indeed

possible that depending on the content of the visual stimulation, competition takes place more or

less early. This could explain part of the disparity that exists in empirical studies.

Tong et al. (Tong et al., 2006) insists mainly on the multilevel aspects of competition

within the visual areas but it seems that we know very little on the actual loci of competition that

may very well involve areas outside the visual pathway as mentioned by the top-down model. In

an article that attempts at understanding the phenomenon of multistability in general which

includes binocular rivalry, Long and Toppino proposed an attractive multilevel hybrid model (see

Figure 11) which includes explicitly higher-level global processes that impact the delocalized

competition. In this view, higher level cognitive factors are not engaged directly in the same

multistage network of competition but send signals to all stages.

27

Figure 11 Multi-level model of multistable perception (Long & Toppino, 2004)

1.3 Multisensory bistability

The gist of the work I want to present here consists in an innovative use of multisensory

integration as a tool to investigate the mechanism underlying binocular rivalry. The recent

conceptual switch to a hybrid model could be considered as more of a draw back than a

breakthrough. By recognizing the complexity of the phenomenon of binocular rivalry that cannot

be represented by simple top down or bottom up interactions, the hybrid model calls for fresh

investigations on the nature of the suppressed stimulus. More precisely, a necessary conclusion of

the hybrid model is that for at least some categories of stimulation, it must be possible to find

correlates of the suppressed stimulus at intermediate levels of visual processing. Multisensory

integration and especially audio-visual integration will prove itself to be a powerful tool to probe

the loci of suppression.

1.3.1 A few insights from literature

Literature is relatively poor when it comes to studying bistability in a multisensory

context. I will here extend the field of investigation to other forms of bistability than binocular

rivalry since the experimental approaches focusing on other bistable phenomena offer relevant

28

indications on the type of multisensory percept that could be used and on what are the main

aspects of bistability that can be challenged by adding sound to classic visual bistable percepts.

a Speech perception and multistability

The nature of the audio-visual stimulation used is crucial. As I will show in the two next

paragraphs, the strength of the audio-visual congruency might often be insufficient, due to its

artificial nature, to provide interesting results. Munhall et al. (Munhall, ten Hove, Brammer, &

Paré, 2009) present an experimental design that enabled them to associate multistability with the

best studied case of audio-visual integration: speech perception. To achieve this goal, they relied

on the McGurk effect. In their experiment, when presented with the sound /aga/ in the auditory

modality while seeing lips uttering /aba/ the subjects reported hearing /abga/ (a more detailed

presentation of the McGurk effect will be given in 1.3.2a). Using this audio-visual effect enables

them to ensure a strong multisensory integration based on a natural association rooted in

everyday life experience of speech perception.

The experiment is based for the visual modality on the classic face-vase illusion. A

rotating black vase against a white background can either be seen as a vase or as two talking faces.

Those talking faces, whose lips movements are a consequence of the vase’s rotation utter /aba/.

In synchrony with the lips movements, the sound /aga/ is played. As in the static face-vase

illusion, the subject sees alternatively the vase and the faces. Can the McGurk effect be observed

when the talking faces are not explicitly seen i.e. when they form the background of the perceived

rotating vase? According to this study the answer is no. In order for the McGurk effect to occur,

the lips must be consciously seen.

This result can be turned the other way round and give information on bistability. When

not consciously seen, the percept does not reach the levels where the audio-visual integration

occurs. Is that the case for binocular rivalry? If the suppression does not occur strictly at early

levels of processing, it might be possible to detect multisensory integration with the suppressed

stimulus. This would provide compelling evidence in favor of a possible late suppression.

There are reasons to question the quality of Munhall et al. experiment. Mainly, the

poverty of the visual stimulus cast a doubt on the actual strength of the McGurk effect reported.

Indeed, no other cue than a side view of the mouth opening is available (no teeth, tong

etc…which are known to play an important role in lips reading and therefore in audio-visual

speech integration). By using a more detailed stimulation it might be possible to observe

integration with the suppressed stimulus. Moreover, it is important to note that the authors

cannot reject the null hypothesis and can therefore only conclude that they could not find a

McGurk effect with the unconscious percept, and not that this one does not occur.

29

b Volition is improved by multisensory percepts

Although volitional control over alternations remains quite weak in binocular rivalry, a

certain amount of control is still available to the subject. Multisensory rivaling percepts could

provide a powerful tool to investigate the mechanism underlying this volitional effect.

Accordingly, Van Ee et al. (van Ee, van Boxtel, Parker, & Alais, 2009) used two rivaling

stimuli, one congruent and one incongruent with the presented sound. The congruent visual

stimulus corresponded to a “looming pattern” (concentric sine wave pattern looming at 1Hz), the

incongruent one was a “propeller” like radial pattern. I want to insist on the fact that, contrary to

the experiment described earlier, here the audio-visual congruency is much more artificial. As the

authors will show, congruency can be reduced to temporal synchrony.

The first important result of this paper states that a congruent sound improves the

capacity of the subjects to willfully hold one percept. The authors compared the average

dominance duration of the percepts between a purely passive condition and a volitional condition

(during which subjects are asked to hold one percept) and repeated this comparison with and

without sound. A volitional effect (increase in dominance duration) was observed in both cases

for the congruent percept but this effect was significantly stronger for the sound condition.

Contrary to what the authors seem to suggest, there was however no impairing effect of an

incongruent sound on volition for the other percept.

Interestingly, the authors noted that the sound has no effect on dominance durations for

the passive viewing condition. Following this results, van Ee et al. tried to identify what was the

source of the improvement of the volitional control and concluded that it stemmed from the

temporal synchrony between the sound and the visual stimulation. Temporal congruency is the

only aspect that matters in their experiment, which shows that the audio-visual integration used is

rather low level. It would be of great interest to investigate whether the results remain the same

for audio-visual percept congruent at a higher level such as for audio-visual speech integration.

The authors also studied the importance of orienting the attention of the subject towards

the sound. They found that the volitional improvement due to congruent multisensory

stimulation required the subject to actively pay attention to both modalities. Indeed sound can

improve volitional control if attention to this additional modality is actively engaged. It does so

by increasing the dominance duration of the congruent visual percept. However, the audio-visual

congruency used here is reduced to its simplest form. Promoting multisensory interactions at

higher level of processing could lead to potentially stronger effects and for example enable to

detect an effect of sound even in a passive viewing condition.

30

c A common oscillator for all perceptual decisions?

Another clever use of multisensory bistability consists in combining two bistable percepts:

an auditory and a visual bistable percept. This is what Hupé et al. did in an attempt to study

whether the perceptual decision resulting in a switch is modality dependent or whether a

supramodal oscillator governs the oscillations in both modalities (Hupé, Joffo, & Pressnitzer,

2008). The idea of a supramodal oscillator stems from the observation of the similitude of

dynamics between all the different bistable phenomenon (see 1.1.4).

As bistable percepts, they used the auditory stream segregation (see 11.1.2b) in the

auditory modality, and in the visual modality they used two LED lights, flashing at the center of

both speakers playing the sound so that one LED flashed in synchrony with the low pitch tone,

and the other in synchrony with the high pitch tone. Consequently, in the visual modality, the

subject observed either two lights flashing separately (equivalent of the two streams percept in

the auditory modality) or an apparent motion of a light moving from one speaker to the other

(equivalent to the one stream percept in the auditory modality).

The subjects experienced switches both in the auditory and in the visual modality. The

authors analyzed whether those switches occurred in synchrony. The answer is negative, although

a switch in one modality tends to trigger a switch in the other one, the switch cannot be said to

be synchronous. Therefore the hypothesis of a single supramodal oscillator that would trigger

switches for both the auditory and visual modality seems unlikely.

However, once again the audio-visual congruency is rather weak and one would need to

investigate whether reinforcing the multisensory aspect of the stimulation could lead to

synchronous oscillations.

1.3.2 The McGurk effect: a multisensory illusion

The series of experiments we developed and that I will present in the next section are

based on the preceding conclusions. They combine binocular rivalry with sounds in a design

using audio-visual speech integration. I will therefore here present the McGurk effect, the

phenomenon on which all our experiments will be based, and will propose a very general

overview of the current understanding of audio-visual speech binding.

a The McGurk effect

It has been acknowledged for many years that watching a speaker can be beneficial for

speech understanding. In a now classic experiment carried out fifty years ago, it had been

established that seeing the speaker could lead to an improvement in the comprehension of the

auditory speech in noise equivalent to that produced by an increase of up to 15 dB in signal-to-

31

noise ratio. From this results stemmed the interpretation of audio-visual speech processing as a

phenomenon only apparent at low signal-to-noise ratios.

This interpretation changed when it was first showed that the perception of certain

speech segments could be strongly influenced by vision even when acoustic conditions were

good. The accidental discovery of the McGurk effect provided with the first example of some

audio-visual pairing leading to illusory perceptions (McGurk & MacDonald, 1976). For the

persons who are susceptible to it, the McGurk effect appears for some incongruent presentations

of lips movements and sound. Various versions of this effect exist and I will here only present to

two most common ones:

- When an auditory /ba/ is dubbed to a visual /ga/ listeners report hearing a so-called

blend percept /da/.

- When an auditory /ga/ is dubbed to a visual /ba/ listeners report hearing a

combination percept such as /bga/.

Clearly, such an illusory perception induced by an addition of visual information to

auditory speech call for a more complex interpretation of audio-visual binding in speech

perception. It appears that the construction of the conscious audio-visual verbal percept results

from deep integration of information provided by both modalities. I will therefore quickly go

through some major ideas concerning the multisensory aspect of speech perception.

b A general overview of audio-visual speech perception

Decisive for our work is the understanding of the level at which occurs the audio-visual

speech binding. We will indeed use multisensory verbal material to probe the mechanism

underlying binocular rivalry and assess whether the suppressed stimulus is available late in the

visual pathway. The question being: can the suppressed stimulus reach the areas in charge of

crossmodal integration for speech perception? Therefore it is necessary to pinpoint where this

integration takes place.

• Early sensory cortices

It is widely accepted that multisensory binding occurs at least partially in the so-called

higher association cortices that include the superior temporal sulcus, the intra-parietal sulcus and

regions in the frontal lobe. In this view a large part of the brain is often reduced into a collection

of unisensory systems that can be studied in isolation. However Kayser and Logothetis point out

that accumulating evidence challenges this position and suggests that areas hitherto regarded as

unisensory can be modulated by stimulation of several senses (Kayser & Logothetis, 2007). Does

this means that audio-visual speech integration should be thought as occurring as early as

32

unisensory cortices? In his review, Campbell notes that activation of the primary auditory cortex

has been found during silent lips reading (Campbell, 2008). However the extent to which this

activation is specifically linked to speech like event remains to be investigated. Kayser et al. also

reports an EEG study suggesting that there could exist neuronal correlates of the McGurk

illusion as early as classical auditory areas. They nevertheless precise that “the coarse nature of

this method leaves doubts about the localization of these effects, asking for methodologies with

better spatial resolution.”

Overall, Kayser et al. conclude negatively on the existence of early cross-modal

integration. If there are certainly some cross-modal effects taking place within the unisensory

areas, those effect do not correspond to multisensory binding.

• The central role of the superior temporal sulcus (STS)

The posterior part of the superior temporal sulcus (pSTS), one of the aforementioned

higher association areas, has been consistently pinpointed as a primary binding site for audio-

visual speech processing (Bernstein, Auer, & Moore, 2004; Campbell, 2008). Its location at the

crossroads of the auditory and the visual streams makes it an ideal candidate for this role. Apart

from being consistently activated by audio-visual speech perception, the left pSTS has been

showed to potentially display differential activation for congruent and incongruent audio-visual

speech. However, much more investigations would be required to better understand the role of

this area.

• The motor cortex

Another interpretation of audio-visual speech processing involving the motor cortex was

proposed by Skipper et al in an attempt to explain the McGurk effect (Skipper, van Wassenhove,

Nusbaum, & Small, 2007). The authors showed in an imaging study that audio-visual speech

perception seems to occur in many of the same areas that are active during speech production.

The McGurk effect would be a result of the mismatch resolution between the motor plan built by

the listener seeing a certain lips movement and the auditory information he receives.

The precise understanding of the mechanisms underlying audio-visual speech binding still

appears to be far from the current state of knowledge. However, many studies converge in that

they localize the multisensory binding away from the early sensory processing stages. This idea

alone justifies our attempt to use audio-visual speech stimulation in a binocular rivalry

experiment. Indeed, any cross-modal effect involving the suppressed stimulation would imply its

persistence up to the levels where the binding eventually takes place.

33

1.3.3 Conclusions

There exist few experiments using a multisensory design to study multistability. However,

based on the ones I just mentioned I would like to draw some conclusions useful to our purpose.

First, most studies rely on low level multisensory congruency (mainly temporal or spatial

congruency) and it seems crucial to develop a new way to associate a bistable visual stimulation

with sound that could rely on a stronger, higher level multisensory congruency. We chose to use

the most natural and best-studied audio-visual integration: speech perception through the

McGurk effect. Second, if multisensory integration could be found between a conscious auditory

stimulation and a suppressed visual stimulation in the case of binocular rivalry, this would

provide compelling evidence against the low level approach and favor the hybrid model where

the suppressed stimulation can potentially persist to intermediate and higher levels of processing.

Finally, volitional control in binocular rivalry seems to be affected by the addition of a

congruent sound but no effect is detected on passive viewing. Whether these results remain true

for highly congruent audio-visual verbal material needs to be verified.

34

2 Materials and methods

2.1 General material and methods

2.1.1 Stimuli design

a General content: McGurk effect

The gist of the series of experiments I will present here consists in the association of

binocular rivalry and the McGurk effect. To achieve this goal it is necessary to succeed in

inducing binocular rivalry not between static images as it is usually the case but between videos.

The material used consisted of 1.64s (25fps) videos of the face of a woman uttering either the

sound /aba/ or /aga/. The McGurk effect could then be induced by the presentation of the lips

movement /aga/ in synchrony with the sound /aba/ (extracted from the /aba/ video). For

McGurk sensitive subjects, this would indeed lead to the perception of an auditory /ada/.

b How to build rivaling videos?

Binocular rivalry was chosen among the variety of multistable stimuli for the large

amount of data available concerning this effect but also for its relative flexibility in terms of

rivaling percepts. Contrary to most multistable phenomena, binocular rivalry does not rely on

some specific percepts but more on a specific way to present images, the content of which can be

determined by the experimenter. However there are some constraints that need to be taken into

considerations in order to ensure that proper rivalry is achieved. Mainly, those constraints are the

following:

- The two rivaling images should overlap.

- The strength of the rivaling images should be balanced. This means that both

images should be rather similar in contrast, brightness, density of contours etc…

Unbalanced strengths would not prevent rivaling images from alternating in

conscious perception but would lead to the relative domination of one of the two

perceptive states.

- To avoid piecemeal rivalry, ie to avoid the spatial breaking of the perceived

bistable stimuli into multiple uncorrelated rivaling pieces, images should be small

in size (typically under 6°).

35

- Finally, the use of contrasting colors (one image red and the other green) can help

to achieve clean rivalry and also ensures that the subject are able to easily report

any sort of piecemeal.

The key aspect of our experiment is that we want to use videos as rivaling stimulations

which represent a departure from classical binocular rivalry experiments. Therefore the

constraints applicable to static images will need to be extended to video material. Mainly:

- There should be a good overlap of both videos for each video frame.

- The strength of both videos should be balanced for each video frame.

However, in order to ensure binocular rivalry to function with video stimulations,

knowledge drawn from the use of static images does not suffice. An understanding of how

motion interferes or not with binocular rivalry is also required.

The bulk of the studies on binocular rivalry focused on whether the neural mechanism

for rivalry is located at the early or late stage of processing. However, little attention was paid to

the fact that the visual system consists of two parallel pathways, the magnocellular and the

parvocellular pathways. This division raises the question about whether both pathways contribute

equally to binocular rivalry. The magnocellular pathway involves the motion-sensitive extrastriate

areas whereas the parvocellular pathway stems from retinal cells which project more dominantly

to extrastriate areas that are sensitive to form and color. Magnocellular cells are known to be

tuned to lower spatial frequency and higher temporal frequency signals than are the parvocellular

cells which are known to be color opponent. For these reasons, these two neural pathways are

thought to support two functionally distinct channels in vision. The magnocellular pathway is

generally considered to be involved in motion processing, whereas the parvocellular pathway is

often termed the color opponent channel, thus emphasizing its important implication in color

processing.

He et al. reviewed various evidence supporting the implication of both channels in

binocular rivalry and concluded that binocular rivalry is essentially supported by the

parvocellular pathway (He, Carlson, & Chen, 2005). They noted that “the visual system does

not seem to be willing to tolerate interocular conflicts in the P [parvocellular] pathway, with its

significant role in processing information related to object identity, and resorts to alternating

views of the two stimuli. Conflicts in the M [magnocellular] pathway are less likely to lead to

rivalry, with the visual system more willing to accept an integrated version of the inputs.” Color

36

and form contrast is therefore a sure source of rivalry when motion incongruence seems more

likely to lead to a blending of both percepts.

Another important aspect of motion in the context of binocular rivalry is its tendency to

reinforce a stimulus. The quantity of motion participates in the strength of a stimulation along

with brightness, contrast, etc. As exemplified by the so-called continuous flash suppression effect

which consists in realizing the suppression of a static stimulus for a rather long period of time

through the presentation in the other eye of rapidly moving Mondrian like stimuli, motion can

lead to dramatic enhancement of the dominance duration of a stimulation if not counterbalanced

in the two eyes (Tsuchiya & Koch, 2005). Therefore, in order to ensure a relative equivalence of

dominance durations between two percepts, they should be balanced in terms of quantity of

movement.

Finally, it is important to note that transient movements (ie a rapid movement in one

percept while the other remains static) are known to trigger switches in direction of the moving

stimulus.

Incongruent motions do not offer a strong enough basis to induce binocular rivalry

whereas color and contour contrast lead to maximum rivaling effect. Moreover, the quantity of

motion displayed in each video should always be equal to avoid transient effects which would

artificially trigger perceptual switches. This implies that for videos to rival, they should be

rather congruent in terms of motion content but incongruent in terms of colors.

Studies concerning the perception of color suggest that the red/green chromatic contrast

is one of the two main ones on which color vision rests. This contrast will consequently be used,

along with black/white contrast, as the support of binocular rivalry.

Achieving successful use of video material as the basis of binocular rivalry is one of the

main goal of this experiment (although a technical one) since it would open the gate for

innovative binocular rivalry studies beyond this present one.

c Visual stimulation

All visual stimulations are presented within a circular mask (diameter 5° with soft edge).

The diameter chosen is coherent with the values used in the literature which typically extend

from 1° to about 6°. Small diameters are necessary to avoid piecemeal rivalry. Stimulations are

presented against a black background.

37

Two raw 1.64s videos were recorded at 25 fps of a woman uttering either the sound

/aba/ or /aga/. Lips features were extracted and color filters were applied so that for each lips

movement (/aba/ and /aga/) two videos were generated:

- a video of type 1: Black lips/Green face

- a video of type 2: White lips/ Red face

Figure 12 Movie types

The difference of color between face and lips ensures a good perception of the lips

movement which is the most important feature for the McGurk effect. Color contrast between

videos (rivaling videos will always consist of a video of type 1 against a video of type 2) stems

from the considerations presented in 2.1.1b. While using video filters, attention was paid so that

other features such as the teeth and the tongue, which are known to play a role in speech

perception, were not removed from the videos.

The videos of the /aba/ and /aga/ lips movements were designed so that both

movements are maximally synchronized (similar onset and duration of all the segments

corresponding to the various lips movement involved in uttering both sounds) and differ only in

the frames corresponding to the utterance of the consonant /b/ and /g/ by the mouth aperture

(the pronunciation of the consonant /b/ requires the speaker to fully close her mouth unlike the

pronunciation of the consonant /g/). Using standard 25 fps videos synchronization was bound

to a 40ms precision. A cross fading was applied between the last and the first images of videos to

ensure that the videos could be looped without any discontinuity. Such a discontinuity could

indeed trigger artificial switches. Due to impossibility to record brightness in situ, luminosity was

balanced between video types on the first frame using Photoshop mean luminosity measure. We

hypothesized that this equivalence in luma values will translate into an equivalence of brightness

into the Head Mounted Diplay.

Videos were cropped to the circular dimensions mentioned above with the center of the

circle corresponding to the center of the closed mouth. Full mouth is visible throughout the

whole video. Other than the mouth, only the tip of the nose is visible, other facial features being

38

outside the circle. This type of presentation forces the focus of attention on the lips movement

and therefore is thought to maximize the possibility of a McGurk effect.

Experiment 1 uses static images: gratings and static lips.

- Gratings are sinusoidal black and white gratings with maximum contrast. The

frequency used is 1 cycle/deg. One grating is tilted of 45° clockwise, the other of

45° counter clockwise (see Figure 13)

- Static lips images consist of the first frame of /aba/ video type 1 and type 2. They

share common visual properties with the videos described above (see Figure 12)

Figure 13 Gratings

d Auditory stimulation

Sound was synchronized between both /aba/ and /aga/ videos (with a precision of

40ms) to ensure that the dubbing of the /aba/sound onto the /aga/ video could be achieved.

2.1.2 Stimuli presentation

a Visual stimulation

• Head Mounted Display (HMD)

Visual stimulations are displayed using dual channel Head Mounted Display (HMD)

nVisor SX at 1280x768, field of view: 44°x34°.

39

Figure 14 Head Mounted Display

• Stimuli

To ensure good binocular fusion but also proper fixation of the lips a fixation cross was

systematically added to the videos (black and white cross, diameter 0.3°). Fusion was also

facilitated by a random distribution of white squares around the stimuli (random distribution on a

40°x30° rectangle, square size 1°). Fusion could have been more difficult with the HMD than

with the classical stereoscope since the stimuli are virtually located at an infinite distance (infinite

focal distance) and no adjustment of vergence is possible. However the cumulative effects of the

fixation cross, of the similar shape and border of the circular stimuli, and of the randomly

distributed squares guaranteed an easy fusion.

Figure 15 Global stimulus display

b Auditory stimulation

Auditory input was played using ear canal phones (Sennheiser).

40

2.1.3 Subjects

14 subjects (10 men, 4 women) all right handed participated in the experiment. They were

all between 18 and 35 years old (average 26). Among the 14 subjects, 13 were naïve to the

purpose of the experiment and to the McGurk effect. Eye dominance was tested using a simple

alignment task: subjects were asked to align their right index finger with a point in the room while

keeping both eyes open before closing alternatively their right and left eye, reporting which one

kept the alignment. 8 subjects reported right eye dominance, 6 reported left eye dominance.

Following the results of experiment 2, subjects were divided into two groups. All subjects

participated in all the experiments except experiment 4 where only subject from the McGurk

group were tested.

2.1.4 Experimental overview

The full experimentation consists of 5 experiments, dispatched over two sessions of

about 1h30 each. Sessions were separated by at least 3 days.

The task the subject was asked to performed consisted either of a continuous report

task for the dominance durations studies (exp 1, 3 & 5) during 130s exposure to rivaling

stimulations or of a single report after each audio-visual stimuli presentation. Subjects reported

their responses by pressing the right or left arrow key on a keyboard using their right hand.

- For continuous report, the subject was put in binocular rivalry conditions and

asked to press continuously the key that corresponded to the perceived stimulus.

The subject always had the possibility to report piecemeal rivalry or any

ambiguous perceptual state simply by not pressing any key. For the gratings

stimuli, subjects were asked to report whether the consciously perceived grating

was right or left tilted. In the lips condition (static images or videos), subjects were

asked to report the color of the lips in the stimulus consciously perceived (black

or white).

- For single report, after a presentation of an audio-visual stimulus (involving

binocular rivalry or not), the subject was asked to press the key that corresponded

the best to her audio-visual perception in a forced choice paradigm.

In the continuous report experiment, one condition consisted of a volition test. The

instruction was then for the subject to keep reporting the perceptual alternations while trying to

hold a specific stimulus (black lips or white lips). Instruction was clearly given to do so without

using voluntary blinks or eye-movement away from the fixation cross.

41

In order to avoid association between lips movement and lips color, these two factors were

systematically balanced.

For the continuous report condition, the following steps were taken to clean the data:

- The first 10 seconds were systematically removed since they are known to often

correspond to an ambiguous piecemeal situation.

- Successive repetitions of same key pressing were treated as corresponding to the

same dominance period if the delay between pressing on the key was inferior to

500ms. Otherwise, the repetitions were considered as corresponding to different

dominance period separated by switches to an ambiguous state.

- Raw dominance durations were not used but rather we focused on a normalized

version of dominance durations (dominance duration/subject average dominance

duration). This normalized ratio ensures that the values can be compared across

subjects.

2.2 Experiment 1: Binocular rivalry test (control experiment)

2.2.1 Objective

Some person simply do not experience bistability when put in binocular rivalry

conditions. In order to eliminate such persons, the first part of this experiment consisted in the

most classic binocular rivalry conditions using orthogonal gratings as rivaling stimuli. A subject

that would not experience rivalry in this condition would be considered inapt to pursue other

experiments.

A first step toward testing the new stimuli built for this experiment consisted in using a

static version of the lips as rivaling images. This would serve as a preliminary control, assessing

whether for the static version the stimuli displayed a classic bistable behavior: mainly randomness

and exclusivity whose correlates are respectively a gamma distribution of dominance durations

and low proportion of piecemeal percepts. A classic behavior would partially validate the choices

made when building the stimulations. Moreover, for experiment 4, it was necessary to ensure

rivalry with static images.

2.2.2 Method

The experiment consisted of 7 continuous report trials of 130s each. The first 3 trials

used static grating images as rivaling stimuli, the 3 following trials used rivaling lips images (see

2.1.2a). Finally in the last trials, the same lips image was presented in both eyes (no binocular

42

rivalry) and the image was artificially switched (stimulus flip condition). Trials were separated by

breaks of a duration controlled by the subject.

For the gratings condition, the left tilted grating was always presented in the left eye and

the right tilted grating in the right eye (no balance was necessary). For the lips image condition,

although lips movement and lips color were balanced, the black lips were always presented to the

left eye and the white lips to the right eye in order to avoid unnecessary balancing that would

have resulted in an experiment twice as long. For the stimulus flip condition, the duration times

of the percepts were randomly drawn from a Gaussian distribution (mean= 3s, σ=2s).

The experiment started by a 4 trials training session, each lasting 60s. Training trials

replicated the experiment’s conditions. Two trials were dedicated to the gratings conditions, two

to the lips conditions. For each stimulus type, the subject was first asked to simply look at the

rivaling stimuli and then to practice reporting the conscious percept. All subjects received the

same amount of practice.

Figure 16 Experiment 1 task timeline

2.2.3 Results and discussion

Only the subjects who experienced binocular rivalry for the classic gratings stimuli were

tested in the four other experiments.

To analyze the characteristics of the binocular rivalry induced by the static image lips

stimuli, the distribution of normalized dominance duration (ratio) was plotted. The ratios were

grouped into bins which ranged from 0 and then increased by steps of 0.2. The distribution was

first determined for each subject and then averaged over all the subjects so that each subject

participates with an equal weight to the final distribution.

43

Figure 17 Normalized dominance duration distribution

Figure 17 presents the normalized dominance duration distribution. In x-axis the ratio

represent the normalized dominance durations and in the y-axis is given the frequency observed

for each ratio. The gamma shaped distribution of dominance durations characterizes binocular

rivalry and suggests that the successive dominance durations are independent from each other.

The fact that a gamma distribution was found in the case of the new lips stimuli indicates that the

stimuli we designed are at least in the static version inducing the classic random bistability. A low

proportion of ambiguous percept was reported by the subjects, 6.83% (SE=1.50) for the lips

stimuli which indicates that the rivaling percepts are mutually exclusive, which is a necessary

condition of binocular rivalry.

The static lips image rivalry was also used to assess the relative strength of the percepts.

This relative strength can be evaluated by analyzing the difference in average dominance

durations for both percepts. Longer dominance duration indicates a stronger percept. The black

lips stimulus was found to be stronger than the white lips one (F(1,12)=13,03, p<0.004).

Interestingly the interaction between the stimulus strength and eye dominance was significant

(F(1,12)=9.80, p<0.01). If the stronger stimulus is presented in the non dominance eye then the

effect of stimulus strength on dominance duration is cancelled, whereas if the stronger stimulus is

presented in the dominance eye the effects of eye dominance and stimulus strength sum up to

increase dominance durations.

Finally the stimulus flip condition proved that the subject correctly performed the task and

were able to report what they saw (12 subjects reported correctly 100% of the percepts and 2

other were above 95% of correct responses). Moreover, in this condition it was possible to

measure the average reaction time for each subject (425 ± 18 ms averaged over all subjects).

44

2.3 Experiment 2: Test of the McGurk effect (baseline)

2.3.1 Objective

The purpose of this experiment was to test the susceptibility of the subjects to the

McGurk effect. Although this effect is one of the best known multisensory effects everyone does

not show the same sensitivity to it.

There are various versions of the McGurk effect all consisting in dubbing an incongruent

sound onto a lips movement. For this experiment and the following ones we chose to focus on

the most robust McGurk effect which consists in the dubbing of the sound /aba/ onto a lips

movement uttering /aga/. Doing so results, for the person who are sensitive to this effect, in the

illusory perception of the sound /ada/ (see 1.3.2a).

This experiment served as a baseline for the experiment 4 in which the same McGurk

effect was studied under binocular rivalry conditions.

2.3.2 Method

The experiment consisted of 4 blocs of 20 single reports. For the single report, the audio-

visual stimulus presented consisted of the 1.64s video of the lips uttering either /aba/ or /aga/

on which was dubbed the sound /aba/ which therefore was respectively the original soundtrack

or an incongruent soundtrack. Out of 20 presentations in a bloc, 10 used the lips motion /aba/

and the other 10 the lips motion /aga/. The video type used was chosen randomly between type

1 (black lips/green face) and type 2 (white lips/red face) at each stimulus presentation. Color and

lips motion were randomized by group of 4 conditions representing the 4 possible combinations

of color and lips motion.

After each stimulus presentation the subject had to perform a forced choice task

reporting if the sound he had heard was closer to /aba/ or to /ada/. Instruction was given to the

subjects to respond quickly and automatically. In addition to the fact that the lips movement was

located right at the location of the fixation cross, instruction was explicitly given to the subjects

to pay attention to the lips movement as well as to the sound in order to force audio-visual

association.

Subjects started with 16 training trials that replicated the experiment’s conditions.

45



The sound heard by the subjects during audio-visual presentations was systematically

/aba/ (A/aba/). However, in the visual modality, the video shown could either correspond to a

mouth uttering /aba/ (V/aba/) or /aga/ (V/aga/). For each subject the proportion of /aba/

percept selected was analyzed in both visual conditions. In the V/aba/ condition, the proportion

of /aba/ percept selected corresponded to the proportion of correct responses since the audio-

visual stimulation was congruent. For all subjects, the proportion of /aba/ responses in this

condition was above 90% with 11 subjects at 100%. This proves that all subjects were able to

correctly report what they perceived.

In the V/aga/ condition however, the proportion of /aba/ responses indicated the

proportion of incongruent audio-visual stimulations for which the subject did not experienced

the McGurk illusion. Indeed, the McGurk effect would consist in the perception of the sound

/ada/ in this audio-visual situation. Results for this condition show that subjects could be

classified into two categories according to their sensitivity to the McGurk effect. 8 subjects

perceived the McGurk fusion (AV/ada/) in more than 90% of the trials and 6 subjects perceived

the McGurk fusion in less than 10% of the trials. The first group (McGurk group) is composed

of subjects who are highly susceptible to the McGurk effect. Inversely, the second (No McGurk

group) is composed of subjects who are almost not sensitive to the McGurk effect.

Figure 19 displays the average values for each group. It appears clearly that if both groups

have similar results for the /aba/ seen condition (congruent audio-visual condition), they strongly

differ in the /aga/ seen condition. It is important to note that in this second condition, the lower

the proportion of selected /aba/ percept, the stronger the proportion of reported McGurk effect

(/ada/ percept selected).

46

Figure 19

Interestingly, the separation between the McGurk and the No McGurk subject was

categorical (no continuum). I would like to emphasize that the McGurk effect is very robust

since, for the subjects who are susceptible to it, it leads to an almost systematic illusory audio-

visual fusion. However, our experiment proves that some subjects are simply not sensitive to the

McGurk effect which is often omitted in classic multisensory literature. It was decided that these

subjects would still participate in the experiments (although not in experiment 4).

2.4 Experiment 3: Binocular rivalry for videos (test and baseline)

2.4.1 Objective

The first purpose of this experiment consisted in testing the novel video stimulations as

rivaling percepts in the context of binocular rivalry. As I already mentioned, binocular rivalry has

been until now essentially developed for static images with very few exceptions. Our goal is

therefore to create a new sorts of rivaling stimuli using videos. In order to validate these new

stimulations, it was necessary to prove that they triggered classic binocular rivalry alternations

characterized by randomness, exclusivity, and inevitability (see 1.1.4). Moreover, the possibility

that lips motion could trigger switches was investigated.

This experiment however had a second objective as it was designed to serve as a baseline

for experiment 5 which tested the effect of sound on binocular rivalry dynamics and on volitional

control (see 2.6).

47

2.4.2 Method

The experiment consisted of 9 continuous report trials of 130s each. The rivaling visual

stimuli consisted of the above described videos presented in a loop (see 2.1.1c). The black lips

stimulus was always presented in the left eye, the white lips stimulus in the right eye (balancing

color was not necessary). The rivaling stimuli also always differ in the lips movement. No sound

was played during this experiment.

The subjects had to perform two different types of task depending on the trial. 6 trials

consisted in basic continuous report (lips color) and 3 of continuous report with volition test. In

the volition test condition, subjects were asked to try to hold the black lips which always

corresponded to the video of lips uttering /aba/.

The 9 trials were divided into 3 blocs of 3 trials. In each bloc, the first 2 trials

corresponded to basic continuous report trials and the last one to a volition test trial. In each

bloc, for the basic continuous report trials, lips movement and color were counterbalanced (a trial

consisted of lips black uttering /aba/, the other of lips black uttering /aga/). The order of these

two trials was counterbalanced between subjects.

The subjects started by 3 training trials reproducing a typical bloc of 3 trials as described

in the previous paragraph.



Designing rivaling video stimuli was the main technical challenges of this research project.

Experiment 3 was conducted in order to validate these stimuli by checking if the novel video

stimuli achieved classic binocular rivalry. Accordingly, in this experiment we examined whether

48

the three characteristics of binocular rivalry – exclusivity, randomness and inevitability – also hold

for the rivaling video stimulations we created.

In average, subjects reported perceiving a mix of both videos in only 3.96% (SE=0.78) of

the time, which is less than observed with static images (6.8%). Exclusivity between the two

rivaling videos is therefore achieved. An analysis of the distribution of dominance durations

similar to that of experiment 1 (see 2.2.3) was conducted for the basic continuous report

conditions. The gamma shape of the distribution presented in Figure 21 is characteristic of the

binocular rivalry random dynamics. Finally, the fact that none of the subject was able to prevent

the switches from occurring in the volition test condition proved the inevitable nature of the

perceptual alternations.

Figure 21

Motion is known to trigger perceptual switches. We therefore investigated whether some

temporal sequences of the videos generate significantly more switches thus partially disrupting

the random aspect of dominance durations. We analyzed the repartition of switches with the

1.65s video as the temporal baseline. For each subject, switch times were reported to their

position in the video. They were then pooled in time bins ranging from 0 ms and increasing by

steps of 50 ms. Once averaged over all subjects, the proportion of switches that occurred in each

time bin was compared to the probability level expected for a uniform distribution (t-test against

single value). Few values only were significantly different from this baseline and the outliers were

not grouped into consecutive time bins (the outliers are indicated by a star in Figure 22).

49

Figure 22

The novel rivaling videos present all the classic characteristics of binocular rivalry and

therefore can be said to induce proper bistable perceptual alternations. We proved that it is

possible to use videos as competing stimuli for binocular rivalry. This paves the way to an

even more flexible use of this bistable phenomenon.

Experiment 3 also investigates the effect of volitional control on binocular rivalry

dynamics. We plotted in Figure 23 the average dominance duration corresponding to each

percept (black lips and white lips) in both the basic continuous report (no volition) and the

volition test conditions. We did so while separating the results for the McGurk subjects and the

No McGurk subjects.

Figure 23

The interaction between lips color and volition or not condition was significant only for

the McGurk subjects (F(1,7)=10.50, p<0.015). The effect of volition on dominance duration

50

observed for the McGurk subjects stemmed from a longer durations for the hold stimulus (black

lips) than for the other stimulus (Tuckey HSD post-hoc: p=0.015). The fact that the No McGurk

subjects did not show any significant effect of volition could be explained by the fact that

contrarily to the other group, they are not able to use the lips motion as additional information of

the percept to hold.

2.5 Experiment 4: Audio-visual integration with the suppressed stimulus

2.5.1 Objective

This experiment tests whether audio-visual integration is possible with a suppressed visual

stimulus during binocular rivalry. Cross-modal effects involving the suppressed stimulation

would indeed be an indicator that this unconscious stimulus reached cortical areas in charge of

multisensory processing, therefore supporting the hybrid view against the low level interpretation

of binocular rivalry.

The McGurk effect coupled with binocular rivalry was used to assess such a potential

multisensory integration with the suppressed stimulus. Each eye was presented with a different

lips motion (V/aba/ versus V/aga/) while the sound was systematically A/aba/ so that the two

rivaling videos would lead to a different audio-visual percepts (V/aba/+ A/aba/= AV/aba/ while

V/aga/+ A/aba/= AV/ada/). Due to binocular rivalry, each video could be either consciously seen

or suppressed. The questions are therefore the following: Do the subjects always integrate the

sound with the consciously seen video or can integration with the suppressed one also occur?

And if integration with the suppressed stimulus takes place, is it most likely to occur when the

visual stimulation and sound are physically congruent (V/aba/+ A/aba/) or not (V/aga/+ A/aba/)?

2.5.2 Method

Only McGurk subjects participated in this experiment. The experiment consisted of 4

blocs of 20 to 40 single reports. For the single report, the audio-visual stimulus presented

consisted of the 1.64s two rivaling videos of the lips uttering /aba/ and /aga/ (black lips versus

white lips) on which was dubbed the original /aba/ soundtrack of the /aba/ video. The choice

was made to systematically present the strongest percept (black lips as demonstrated by

experiment 1) in the non-dominant eye in order to ensure that effect of eye-dominance and

stimulus strength would not sum up and strongly unbalance the dominance durations. Although

each eye systematically received the same type of stimulus in terms of lips color, the lips motion

type was balanced between eyes.

51

At the beginning of each single report, subjects were given the instruction to wait for

either the white lips or the black lips. Then static rivaling images similar to experiment 1 were

presented and subjects waited until they perceived the image whose lips color corresponded to

the given instruction. Once stabilized on the target percept, subjects pressed a key (space bar)

that triggered the videos in both eyes. At the end of the videos, subjects reported the sound they

heard using a forced choice task between a sound closer to /aba/ or closer to /ada/. Instruction

was given to the subjects to respond quickly and automatically. However, the subjects were

explicitly asked not to answer if a perceptual switch had occurred during the time the videos were

played in order to ensure that they had perceived only the video whose type corresponded to the

instruction. If a switch occurred they were asked to press the space bar in order to repeat the

trial. Trials could only be repeated once to limit the total amount to trials needed to a maximum

of 40. Special attention was paid to instruct the subject on the importance to redo a trial if a

switch occurred (even a partial one) during the videos presentation. Subjects started with 16

training trials which replicated the experiment’s conditions.

In order to assess whether audio-visual fusion occurred with the stimulus presented in the

suppressed eye, the results of this experiment were compared to those of experiment 2, which

provided the baseline for the multisensory integration performance without binocular rivalry.



Finding an audio-visual fusion with a stimulus presented in the suppressed eye would be a

clear indication that the so-called suppressed stimulus was still available for cross-modal

integration and therefore could not have been suppressed at early levels of visual processing.

In the auditory modality, subjects were systematically presented with the sound /aba/

while in the visual modality, although only one percept was consciously seen, subjects were

presented with rivaling videos each uttering a different sound /aba/ or /aga/. Using the initial

52

instruction on the lips color, the experimenter could manipulate the type of lips motion that

would result from the conscious visual stimulus. Two states were then possible: either the

conscious stimulus was V/aba/ while V/aga/ was suppressed (/aba/ seen), or the conscious

stimulus was V/aga/ while V/aba/ was suppressed (/aga/ seen). Importantly, if the audio-visual

integration were to take place only with the conscious stimulation then, in the /aba/ seen

condition, subjects should report hearing /aba/ while in the /aga/ seen condition subjects

should report the McGurk effect illusion /ada/. Therefore, in Figure 25 the proportion of /aba/

perceived is plotted for the /aba/ seen condition, and the proportion of /ada/ perceived is

plotted for the /aga/ seen condition so that for each condition the proportion of cross-modal

integration with the conscious stimulus is represented (black bars). In this situation, if the

proportion of audio-visual integration with the conscious percept differs from 100%, then the

difference could either stemmed from noise or from integration with the suppressed stimulus

(gray bars).

Experiment 2, whose results are also shown in Figure 25, served as a baseline for the

strength of audio-visual fusion in a non-rivaling situation. Since only McGurk subjects performed

experiment 4, only their average results are presented for experiment 2. These results differ from

100% and this difference indicates the natural amount of noise in audio-visual integration for

these subjects.

Results from experiment 4 have been segregated according to whether the conscious

stimulus was seen with the dominant eye and or with the non-dominant eye.

Figure 25

No effect of cross-modal integration was found when lips are seen in the dominant eye.

In this case, the gray bars are not significantly different from the noise level (baseline). This

situation corresponds to a suppressed stimulus presented in the non-dominant eye.

53

On the other hand, there is a trend for 19.6% of audio-visual integration with the

suppressed stimulus when lips are seen with the non-dominant eye (F(1,7)=3.38, p=0.11). Cross-

modal integration with the suppressed stimulus is not equivalent between the /aba/ seen and the

/aga/ seen condition. In the /aga/ seen condition, the suppressed stimulus is congruent with the

sound (AV congruent case, right tilted gray bars ///), whereas in the /aba/ seen condition, the

suppressed stimulus is incongruent with the sound (McGurk case, left tilted gray bars \\\).

Considering the corresponding data separately, audio-visual integration with the suppressed

stimulus occurs in 15.6% of the trials for the AV congruent case (F(1,7)=3.56, p=0.10) and in

23.7% of the trials for the McGurk case (F(1,7)=3.06, p=0.12). Importantly, these two values do

not differ statistically.

These results suggest that the suppressed percept is not deactivated at an early level of

visual processing but remains available for further processing, which in turn can rise up to

awareness. Indeed, when the hidden stimulus is presented to the dominant eye a non-negligible

proportion of integration with the auditory input seems to occur. The low-level framework for

binocular rivalry cannot account for such findings. As presented in 1.3.2b audio-visual integration

and especially audio-visual speech binding is thought to take place in higher association cortices

and not in the early sensory cortices. Therefore the trend outlined by experiment 4 suggests that

some information specific to the suppressed stimulus is maintained up until these multisensory

cortical areas. Moreover, the visual suppression of the stimulus does not result in its total

exclusion from consciousness. Indeed, the visually suppressed stimulus does have an influence on

the phenomenal awareness resulting from the auditory analysis.

The absence of effect found for a suppressed stimulus presented in the non-dominant eye

could be due to an insufficient statistical power. For an individual, the information perceived by

the non-dominant eye could be treated as less relevant than the one transferred by the dominant

eye. Therefore, the impact of the information stemming from the non-dominant eye could be

regarded as less reliable by a multisensory system realizing audio-visual integration. This would be

the case if this system was supported by Bayesian laws involving intrinsic priors on the reliability

of the signal. Consequently, the effect of a suppressed stimulus presented in the non-dominant

eye could be too weak to be distinguished from noise.

The reason why we found a cross-modal effect involving the suppressed stimulus while

preceding experiments failed in detecting such an effect (see 1.3.1a) could be due to the some

intrinsic differences between the face/vase illusion and binocular rivalry. However, it is likely that

54

the use of a natural stimulation (video of lips movement) could result in a stronger and more

reliable audio-visual integration thus offering us a better discrimination power.

2.6 Experiment 5: Impact of sound on binocular rivalry

2.6.1 Objective

The last question we addressed concerned the potential influence of sound on the dynamics of

binocular rivalry. The analysis of the Experiment 3’s results allowed characterizing the dynamics

of binocular rivalry both in passive condition (without volition) and the effect of volition when

no sound accompanies the visual presentation. Experiment 5 replicated almost exactly the

conditions of experiment 3 with the only difference that the videos are now always played in

synchrony with the /aba/ sound. After verifying that the sound did not introduce undue

modification of the alternation process, we analyzed its effect on dominance durations in order to

answer the following questions: Does the sound modify the dynamics of binocular rivalry in

passive viewing condition? What is the effect of sound on volition? Does it reinforce volitional

control?

Interestingly, since the lips motion in the two rivaling videos differ (/aba/ versus /aga/),

the /aba/ sound can either be congruent with the video for V/aba/, or incongruent for V/aga/.

Experiment 5 investigated whether this difference of audio-visual congruency could modify the

effect of sound on binocular rivalry dynamics. Finally, both McGurk and No McGurk subjects

performed the task so to compare the impact of the sound presentation on their perception.

2.6.2 Method

The method replicates that of experiment 3 with few modifications. Videos were looped

but this time they were always presented in synchrony with the /aba/ sound. 4 additional

continuous report trials were added during which the subject had to hold the white lips.

Therefore there were 12 continuous report trials in this experiment, divided into 3 blocs of 4

trials. In each bloc, the first 2 trials corresponded to basic passive continuous report trials while

the two following ones corresponded to volition test trials (in one of them the subject had to

hold the white lips and in the other the black lips; the order of these trials was balanced across

subjects). As in experiment 3, subjects started by 3 training trials: two passive ones and a volition

test (hold black lips).

55



The first step in assessing the effect of sound on binocular rivalry dynamics consisted in

verifying that the presence of sound did not unduly disturb the dynamics by triggering switches.

We therefore analyzed the switch repartition as in experiment 3 and found very few outliers

(indicated by a star in Figure 27). Since those outliers are few and are not grouped into

consecutive time bins, it can be said that the sound did not artificially trigger switches.

Figure 27

We analyzed the effect of sound on the dynamics of binocular rivalry in the passive

viewing condition (no volitional control). To do so, we compared the average dominance

durations for each rivaling percept V/aba/ (/aba/ seen) and V/aga/ (/aga/ seen) between the no

sound condition (baseline from experiment 3) and the sound condition (experiment 5). The

separation of the visual percept by lips motion and not by lips color was necessary since we

wanted to detect a potential effect of the congruence between lips motion and sound. Separate

analyses were carried out for McGurk and No McGurk subjects.

56

Figure 28

Figure 28 presents the results from the analysis. The no sound condition replicates the

results from experiment 3 which served as a baseline. The interaction between group (McGurk &

No McGurk) and sound condition on the average dominance durations was significant

(F(1,12)=8.18, p<0.015). The McGurk subjects show a dominance duration increase of 1.14s

(F(1,7)=8.43, p<0.025) while No McGurk subjects show a marginal dominance duration decrease

of 0.88s (Tuckey HSD post-hoc: p=0.22). Interestingly there was no differential effect between

lips movement in both groups.

Contrarily to the findings of van Ee et al (van Ee et al., 2009) presented in 1.3.1b, our

results show that passive viewing can in fact be influenced by the presentation of a synchronized

sound. The addition of sound results in an increase of dominance durations for the McGurk

subjects. The use of a robust and natural high-level audio-visual stimulation could be the reason

why this effect can be detected in our experiment while it remained unnoticed in van Ee et al.’s

work.

Interestingly, the No McGurk subjects remain unaffected by the presentation of a sound.

The reason might be that for the audio-visual congruency to influence the dynamics of binocular

rivalry, it has to be processed as such. It is possible that the subjects who are not susceptible to

the McGurk effect do not merge properly lips movements with sound in an audio-visual speech

binding process. Therefore, for these subjects the audio-visual congruency is not detected as such

or is reduced to a simpler form such as sheer synchrony of onset time and duration between

sound and movement. From this difference of effect of sound on binocular rivalry dynamics

between McGurk and No McGurk subjects, we can conclude that for sound to impact the

57

dynamics it has to be congruent at a higher level (speech binding association) and not only at a

low level (synchrony).

Experiment 3 proved that, when asked to willfully attempt to control the conscious

oscillations of binocular rivalry, the McGurk subjects were able to do so through a relative

increase of the average duration of the hold percept compared to the duration of the other one.

In experiment 5 we investigated whether the addition of a congruent sound modifies volitional

control over binocular rivalry.

Figure 29

For each group (McGurk and No McGurk), Figure 29 presents the relative increase of

average dominance duration in both the No sound (baseline from experiment 3) and /aba/

sound (experiment 5) conditions. The relative increase in dominance duration corresponds to the

difference between the average dominance duration in volition test condition and in passive view

condition. It is important to note that here, only the condition where the stimulus V/aba/ is hold

is represented.

For the McGurk subjects, the No sound condition represents the already mentioned

effect of volition as drawn from experiment 3 although here it represents the effect of holding

the V/aba/ percept whereas in Figure 23 the effect was given for the black lips percept hold. In

the present figure, the values for black lips with a motion corresponding to V/aga/ were left aside.

Without sound volitional control resulted in an increase of average dominance durations of 1.00s

(F(1,7)=20.53, p<0.003). In the /aba/ sound condition, a significant increase of dominance

durations of 2.70s was observed (F(1,7)=8.24, p<0.025). The dominance duration increases for

58

sound and no sound were almost significantly different (Student test: t(7)=0.11). No effect was

detected for the No McGurk subjects.

Presented with an additional congruent sound (A/aba/) and asked to hold the percept

V/aba/ the McGurk subjects are significantly better than in the no sound condition. The effect of

volition is similar to the no sound condition since it results in a relative increase of the average

dominance duration of the hold percept compared to the other one. However, the presence of a

consistent sound improved the volitional capacities of the subjects. For the No McGurk subjects,

no volitional effect could be detected in the no sound condition and the addition of sound did

not change this finding.

This result is similar to the one described in van Ee et al. (see 1.3.1b), the presentation of

sensory information congruent with one of the rivaling stimuli increases the volitional control

capacities. The absence of effect for the No McGurk subjects can be explained by arguments

similar to those presented earlier (2.6.3). No McGurk subjects might not exploit the lips motion

difference between rivaling videos. We can hypothesize that due to poor lip reading capacities

they are not able to differentiate both lips motions. Therefore, in the no sound condition No

McGurk subjects would lack the motion cue in differentiating the two rivaling videos that would

only differ by their respective color. McGurk subjects, on the other hand, are certainly able to

differentiate the lips movements. We can assume that the more cues are available to hold a video

against the other, the better the volition control will be. Accordingly, the difference of volitional

capacities in no sound condition between McGurk and No McGurk subjects could be explained

by their difference in lips reading capacities.

Experiment 5 finally tackles the question of whether there exists a difference in the effect

of sound on volitional control between the condition where the sound is congruent with the hold

percept (A/aba/ + V/aba/ hold) and the condition where it is not (A/aba/ + V/aga/ hold). Figure 30

presents the various average dominance durations for both the McGurk group (left) and No

McGurk group (right) and for all the different conditions. The two first solid gray bars represent

the baseline (passive viewing condition in experiment 5). In the following bars are shown the

results for the hold /aba/ and hold /aga/conditions. The light gray striped bars always represent

the dominance duration of the hold percept while the dark gray striped bars always represent the

dominance duration of the not-hold percept. Comparisons are made as followed: the values of

dominance duration in each of the hold condition are compared to the baseline values. We were

interested in potential variations of relative dominance duration (difference between the increase

59

of dominance duration for the hold percept and the decrease of dominance duration for the not-

hold percept).

Figure 30

The analyses showed that for the McGurk subjects the dominance duration in the hold

/aba/ condition significantly differed from the values in the baseline condition. There was an

increase in relative dominance duration, and this increase resulted from a decrease in V/aga/

dominance duration (Tuckey HSD post-hoc: p=0.05). In the Hold /aga/ condition, the values

did not differ statistically from the baseline values. For the No McGurk subjects no effect of

volition was detected in the Hold /aba/ condition whereas, surprisingly, there seem to be an

effect in the Hold /aga/ condition with an increase in relative dominance duration, although post

hoc were not significant.

Although McGurk subjects are not aware of the audio-visual conflict when presented

with V/aga/ and A/aba/ (V/aga/ +A/aba/=AV/ada/), our results suggest that there exists a difference

for these subjects between holding the congruent AV/aba/ percept and the physically incongruent

AV/ada/ percept. Surprisingly, our results indicate that volitional control is only improved for

real audio-visual congruency and not for a perceived one.

60

3 General discussion

3.1 Probing the depth of suppression

Experiment 4 provides empirical evidence against the low level interpretation of

binocular rivalry. The information that enters the visual system through the suppressed eye is not

blocked at early stages of the visual pathway but rather is maintained at least up until the

multisensory areas. This hypothesis is necessary to explain how in about 20% of the cases

subjects reported an auditory percept corresponding to an audio-visual integration with the

suppressed visual stimulus. However, remains to explain why the audio-visual integration does

not occur systematically with either the suppressed or the dominant percept. Rather, the outcome

of audio-visual integration seems to be drawn from a certain probability distribution. Indeed in

80% of the cases we found that multisensory integration occurred with the dominant visual

stimulus while in the remaining 20% it occurred with the suppressed one. In order to explain this

fact I will need to make two assumptions.

First I will consider that rivalry resolution takes place continuously along the visual

pathway and corresponds to a continuous increase of the probability of the dominant percept

and consequently to a decrease in probability of the suppressed stimulus. This framework follows

the conclusions drawn from the studies presented in (Leopold & Logothetis, 1999) and

reformulates the hybrid model from a Bayesian approach. The visual signal has to be considered

as a probability distribution over various possible visual states. Along the visual pathway, various

features of the visual input are analyzed such as shape, color, motion, each known by the visual

system as a distribution of probability over multiple states. From the combination of the possible

states of the features results a huge number of possible visual states each associated with a certain

probability. In our case, there are two predominant visual states, corresponding to the

information reaching each eye. Their motion components correspond to lips uttering /aba/ and

lips uttering /aga/. However, although associated with marginal probabilities, other perceptual

states such as lips uttering /ada/, /ata/, etc. also compose the visual signal. For the simplicity of

the presentation the two predominant states can be considered as initially equally probable

although stimulus strength and eye dominance certainly influence these initial probabilities. Along

the visual pathway, and through multistage competition, one of the two states becomes

increasingly “stronger”, meaning that its probability increases while the probability of the other

state decreases. If we associate the coding of probability with a certain population code this could

61

indeed explain why as we move along the visual pathway, more and more neurons are found

coding for the consciously perceived stimulus.

The second necessary hypothesis is that multisensory integration (MSI) occurs in a

probabilistic fashion. MSI system takes as inputs a visual signal and an auditory signal and both

can be described as I just mentioned as probability distributions over different states. For the

visual input we saw that there are two predominant states each corresponding to the information

stemming from a certain eye. In the auditory modality, although the auditory stimulation is

consistently /aba/, the auditory signal can be seen a probability distribution over different states

such as /aba/, /aga/, /ada/, /abga/, … the first one associated with a high probability and the

other ones with much lower probabilities (there is always a possibility that the subject will hear

/aga/ although he is presented with a clear /aba/). MSI can be considered as a Bayesian system

computing the probability of all the possible auditory outcomes after interaction with the visual

probability distribution (potentially using also some prior knowledge on what those outcomes

should be). A conscious auditory outcome is then randomly drawn from the auditory

distribution. Importantly this outcome is therefore not systematically the most probable one,

which in our case would always correspond to the audio-visual association of the auditory

stimulus and the dominant visual stimulus.

Based on the hypotheses presented in the last paragraph, we can now explain the findings

of experiment 4. The MSI system responsible for the auditory perceptual outcome is fed by the

visual signal that is composed of two predominant states each associated with a certain

probability. As mentioned in 1.3.2b the audio-visual integration for speech is thought to occur in

the association cortex and therefore the visual signal that feeds the MSI does not stem from early

visual areas. Multisensory integration will then construct the probability of the possible auditory

states through Bayesian multiplication of the visual and the auditory distributions before a

conscious auditory outcome is randomly drawn. Our experiment proves that by the time the

visual signal reaches the MSI areas, the probability associated with the state corresponding to the

suppressed stimulus is not negligible for the probability of the auditory outcome after

multisensory integration that corresponds to the integration with the suppressed stimulus

averages about 20%.

Figure 31 illustrates the functional model described above. On the left hand side this figure

presents a temporal snapshot of the system state on which one stimulus (V/aba/) is dominant and

the other suppressed. The right hand side presents the converse situation. The evolution from

the bottom to the top of the visual pathway represents the evolution of the different perceptual

probability distributions at each processing level, from the eyes inputs to the higher brain areas.

62

The thickness of the bar symbolizes the probability of each state where the green color represents

the currently dominant percept. Consistently with (Leopold & Logothetis, 1999) activity in the

early visual system is not specific to the dominant percept but rather reflects both the dominant

and the suppressed percepts, when activity in higher brain areas is correlated to the conscious

percept. A similar convention is applied to the auditory modality, although in this case, the

conscious stimulus could be either the one represented in red or green since it depends on a

random draw.

Figure 31 Probabilistic MSI with the suppressed stimulus

3.2 Real multisensory congruency enhances volitional control

Adding sensory information using another modality improves volitional control over the

dynamics of binocular rivalry (van Ee et al., 2009). However the mechanism through which this

enhancement of control occurs remains unclear. A deeper analysis of the results provided by

experiment 5 can unveil part of this mechanism.

For a McGurk subject, both the condition where the lips motion and the sound are

congruent (V/aba/ + A/aba/) and the condition where they are incongruent (V/aga/ + A/aba/) are

equivalent in the sense that in both cases no discrepancy between the visual and the auditory

signal is perceived. Indeed, in the second case, although there is a physical incongruence, the

subject simply experiences seeing and hearing a mouth pronouncing the sound /ada/. However,

this multisensory condition does not result in an enhancement of volitional control as the

authentically congruent condition does.

Volitional control is usually considered as originating in the frontal cortex activity, more

specifically from the area in charge of attention. Based upon the hybrid model hypothesis, it

63

seems reasonable to assume that a possible mechanism would consist in a modulating signal sent

by the frontal cortex projecting at the different stages of the multilevel competition, which would

modify the behavior of the network by increasing the probability of the percept to be maintained.

Figure 32 follows the same convention than Figure 31. I would like to insist on the fact

that Figure 32 represents a temporal snapshot of the dynamics, in this case when V/aba/ is

dominant. One has to consider that during experiment 5 the subject was in fact oscillating

between the two rivaling percept. Considering that a congruent sound can enhance volitional

capacities, it is necessary to hypothesize that the signal sent by the attentional control centers can

be modulated by some input resulting from a module whose role consists in comparing the

auditory signal with the visual signal. In our case both V/aba/ and V/aga/ have the same basic

temporal properties (they are both synchronized to the sound) and therefore a discrepancy

between sound and vision could only be detected at the level of audio-visual speech processing.

The module in charge of audio-visual comparison is termed “likelihood” as its role consists in

determining the likelihood that the auditory signal was emitted by the lips motion presented in

the visual signal. We hypothesize that the visual information feeding the module originate in the

MT/MST areas in charge of motion analysis since it is the movement of the lips that determines

the possible emitted sounds. The same brain area in charge of audio-visual speech binding (pSTS)

has been identified as a potential locus for the processing of audio-visual discrepancy in speech.

Figure 32 Real audio-visual congruency enhances volitional control

Consistently with the remarks made in 3.1 the visual signal is considered as being a

probability distribution over two states. Therefore, the comparator is fed with this probability

distribution and can be seen as realizing the comparison with the most probable state, which

corresponds to the dominant percept. The auditory signal also is to be considered as a probability

64

distribution that feeds the comparator with its most probable state. However two levels have to

be distinguished. The first one consists in the signal before the MSI, the second in the signal once

the MSI took place. If the comparison between the visual signal and the auditory signal was to

involve the second level, then no discrepancy would be detected when V/aga/ is dominant and we

would observe an enhancement of volition for this condition. Indeed, at this second level, the

auditory signal has already been modified through Bayesian MSI and if V/aga/ is dominant then

A/ada/ already became the most probable state. Consequently, the auditory signal that feeds the

comparator must originate at the first level, before MSI occurs.

3.3 Perspectives

The present thesis presents an experiment in which sound is used to investigate various

aspects of binocular rivalry. Speech material was chosen so that a single sound could induce two

distinct audio-visual outcomes when merged with each of the rivaling percept. Figure 33A

presents the general design. By using the same audio-visual material but inverting the role of

vision and audition, we should be able to study the impact of vision on rivaling sounds. As

presented in Figure 33B, the bistability is this time induced by the presentation of two

superimposed sounds and this auditory bistable percept is then combined with a unique visual

stimulus that leads to two different auditory outcomes after multisensory integration. A final

experimental design (cf Figure 33C) could study a possible synchronicity of switches in the

auditory and the visual modality. Both the visual and the auditory stimulations would be bistable

and each of the four possible audio-visual combinations would lead to a different multisensory

percept. The question of the synchronization of perceptual decisions in various sensory

modalities has been tackled for low-level multisensory congruency by Hupé et al. (Hupé et al.,

2008) who concluded that switches occured independently for a visual and an auditory percept.

We believe however that our use of strong high-level multisensory congruency could lead to

results contradicting this former experiment.

A. Audio on BR B. Vision on AR C. Both

Figure 33

65

4 References

Bernstein, Auer, & Moore. (2004). Audiovisual speech binding: Convergence or association? Dans Handbook of multisensory processes (MIT Press., p. 203-224). Cambridge.

Blake, R. (1989). A neural theory of binocular rivalry. Psychological Review, 96(1), 145-167.

Blake, R., & Logothetis, N. K. (2002). Visual competition. Nature Reviews. Neuroscience, 3(1), 13-21. doi:10.1038/nrn701

Brown, R. J., & Norcia, A. M. (1997). A method for investigating binocular rivalry in real-time with the steady-state VEP. Vision Research, 37(17), 2401-2408.

Campbell, R. (2008). The processing of audio-visual speech: empirical and neural bases. Philosophical Transactions of the Royal Society of London. Series B, Biological Sciences, 363(1493), 1001-1010. doi:10.1098/rstb.2007.2155

Carter, O., Konkle, T., Wang, Q., Hayward, V., & Moore, C. (2008). Tactile rivalry demonstrated with an ambiguous apparent-motion quartet. Current Biology: CB, 18(14), 1050-1054. doi:10.1016/j.cub.2008.06.027

Cave, C. B., Blake, R., & McNamara, T. P. (1998). Binocular Rivalry Disrupts Visual Priming. Psychological Science, 9(4), 299-302. doi:10.1111/1467-9280.00059

van Ee, R., van Dam, L. C. J., & Brouwer, G. J. (2005). Voluntary control and the dynamics of perceptual bi-stability. Vision Research, 45(1), 41-55. doi:10.1016/j.visres.2004.07.030

van Ee, R., van Boxtel, J. J. A., Parker, A. L., & Alais, D. (2009). Multisensory congruency as a mechanism for attentional control over perceptual selection. The Journal of Neuroscience: The Official Journal of the Society for Neuroscience, 29(37), 11641-11649. doi:10.1523/JNEUROSCI.0873-09.2009

Freeman, A., Nguyen, V., & Alais, D. (2005). The nature and depth of binocular rivalry. Dans Binocular Rivalry (p. 47-62). Bradford Books.

Freeman, A. W. (2005). Multistage model for binocular rivalry. Journal of Neurophysiology, 94(6), 4412-4420. doi:10.1152/jn.00557.2005

Haynes, J., Deichmann, R., & Rees, G. (2005). Eye-specific effects of binocular rivalry in the human lateral geniculate nucleus. Nature, 438(7067), 496-499. doi:10.1038/nature04169

He, Carlson, & Chen. (2005). Parallel pathways and temporal dynamics in binocular rivalry. Dans Binocular Rivalry (p. 81-100). Bradford Books.

Hupé, J., Joffo, L., & Pressnitzer, D. (2008). Bistability for audiovisual stimuli: Perceptual decision is modality specific. Journal of Vision, 8(7), 1-15.

Jiang, Y., Costello, P., & He, S. (2007). Processing of invisible stimuli: advantage of upright faces and recognizable words in overcoming interocular suppression. Psychological Science: A Journal of the American Psychological Society / APS, 18(4), 349-355. doi:10.1111/j.1467-9280.2007.01902.x

Kayser, C., & Logothetis, N. K. (2007). Do early sensory cortices integrate cross-modal information? Brain Structure & Function, 212(2), 121-132. doi:10.1007/s00429-007-0154-0

Lehky, S. R. (1988). An astable multivibrator model of binocular rivalry. Perception, 17(2), 215-228.

Lehky, S. R., & Maunsell, J. H. (1996). No binocular rivalry in the LGN of alert macaque monkeys. Vision Research, 36(9), 1225-1234.

Lehmkuhle, S. W., & Fox, R. (1975). Effect of binocular rivalry suppression on the motion aftereffect. Vision Research, 15(7), 855-859.

Leopold, & Logothetis. (1999). Multistable phenomena: changing views in perception. Trends in Cognitive Sciences, 3(7), 254-264.

Leopold, D. A., & Logothetis, N. K. (1996). Activity changes in early visual cortex reflect monkeys' percepts during binocular rivalry. Nature, 379(6565), 549-553. doi:10.1038/379549a0

Logothetis, N. K., Leopold, D. A., & Sheinberg, D. L. (1996). What is rivalling during binocular rivalry? Nature, 380(6575), 621-624. doi:10.1038/380621a0

66

Logothetis, N. K., & Schall, J. D. (1989). Neuronal correlates of subjective visual perception. Science (New York, N.Y.), 245(4919), 761-763.

Long, G. M., & Toppino, T. C. (2004). Enduring interest in perceptual ambiguity: alternating views of reversible figures. Psychological Bulletin, 130(5), 748-768. doi:10.1037/0033-2909.130.5.748

Lumer, E. D., Friston, K. J., & Rees, G. (1998). Neural correlates of perceptual rivalry in the human brain. Science (New York, N.Y.), 280(5371), 1930-1934.

McGurk, H., & MacDonald, J. (1976). Hearing lips and seeing voices. Nature, 264(5588), 746-748.

Munhall, K. G., ten Hove, M. W., Brammer, M., & Paré, M. (2009). Audiovisual integration of speech in a bistable illusion. Current Biology: CB, 19(9), 735-739. doi:10.1016/j.cub.2009.03.019

Polonsky, A., Blake, R., Braun, J., & Heeger, D. J. (2000). Neuronal activity in human primary visual cortex correlates with perception during binocular rivalry. Nature Neuroscience, 3(11), 1153-1159. doi:10.1038/80676

Pressnitzer, D., & Hupé, J. (2006). Temporal dynamics of auditory and visual bistability reveal common principles of perceptual organization. Current Biology: CB, 16(13), 1351-1357. doi:10.1016/j.cub.2006.05.054

Sheinberg, D. L., & Logothetis, N. K. (1997). The role of temporal cortical areas in perceptual organization. Proceedings of the National Academy of Sciences of the United States of America, 94(7), 3408-3413.

Skipper, J. I., van Wassenhove, V., Nusbaum, H. C., & Small, S. L. (2007). Hearing lips and seeing voices: how cortical areas supporting speech production mediate audiovisual speech perception. Cerebral Cortex (New York, N.Y.: 1991), 17(10), 2387-2399. doi:10.1093/cercor/bhl147

Sterzer, P., Kleinschmidt, A., & Rees, G. (2009). The neural bases of multistable perception. Trends in Cognitive Sciences, 13(7), 310-318. doi:10.1016/j.tics.2009.04.006

Tong, F., & Engel, S. A. (2001). Interocular rivalry revealed in the human cortical blind-spot representation. Nature, 411(6834), 195-199. doi:10.1038/35075583

Tong, F., Nakayama, K., Vaughan, J. T., & Kanwisher, N. (1998). Binocular rivalry and visual awareness in human extrastriate cortex. Neuron, 21(4), 753-759.

Tong, F. (2005). Investigations of the neural basis of binocular rivalry. Dans Binocular Rivalry (p. 63-80). Bradford Books.

Tong, F., Meng, M., & Blake, R. (2006). Neural bases of binocular rivalry. Trends in Cognitive Sciences, 10(11), 502-511. doi:10.1016/j.tics.2006.09.003

Tononi, G., Srinivasan, R., Russell, D. P., & Edelman, G. M. (1998). Investigating neural correlates of conscious perception by frequency-tagged neuromagnetic responses. Proceedings of the National Academy of Sciences of the United States of America, 95(6), 3198-3203.

Tsuchiya, N., & Koch, C. (2005). Continuous flash suppression reduces negative afterimages. Nature Neuroscience, 8(8), 1096-1101. doi:10.1038/nn1500

Warren, R. M., & Gregory, R. L. (1958). An auditory analogue of the visual reversible figure. The American Journal of Psychology, 71(3), 612-613.

Wiesenfelder, H., & Blake, R. (1990). The neural site of binocular rivalry relative to the analysis of motion in the human visual system. The Journal of Neuroscience: The Official Journal of the Society for Neuroscience, 10(12), 3880-3888.

Williams, M. A. (2004). Amygdala Responses to Fearful and Happy Facial Expressions under Conditions of Binocular Suppression. Journal of Neuroscience, 24(12), 2898-2904. doi:10.1523/JNEUROSCI.4977-03.2004

Wilson, H. R. (2003). Computational evidence for a rivalry hierarchy in vision. Proceedings of the National Academy of Sciences of the United States of America, 100(24), 14499-14503. doi:10.1073/pnas.2333622100

Yang, Zald, & Blake. (2007). Fearful expressions gain preferential access to awareness during continuous flash suppression. Emotion, 7(4), 882-886.

Zimba, L. D., & Blake, R. (1983). Binocular rivalry and semantic processing: out of sight, out of mind. Journal of Experimental Psychology. Human Perception and Performance, 9(5), 807-815.

investigating audio -visual interactions n...

Documents