investigating audio -visual interactions n...
TRANSCRIPT
INVESTIGATING AUDIO-VISUAL INTERACTIONS IN BINOCULAR RIVALRY: FATE OF THE SUPPRESSED PERCEPT
AND MODULATION OF VOLITIONAL CONTROL
Victor Barrès
Sous la direction de Manuel Vidal et Jacques Droulez Laboratoire de Physiologie de la Perception et de l’Action
Collège de France
Août 2010
Mémoire rédigé en vue de l’obtention du diplome de Master de Sciences Cognitives de l’Ecole des Hautes Etudes en Sciences Sociales
EHESS – ENS – Paris V
2
Acknowledgements This work would not have been possible without the support and guidance of Manuel
Vidal. This project originated from one of his numerous ideas, he dedicated a tremendous
amount of his time helping me and I could never thank him enough for his patience. When
looking for my master’s project I was hoping to find a mentor from whom I would learn how to
carry out rigorous and innovative research projects. Manuel Vidal fulfilled this aspiration. I would
also like to thank Jacques Droulez from whom I received many of the key ideas that form the
core of the present project. Each interaction with him led to a breakthrough in my understanding
of my research topic. The discussion especially owes a lot to his input. Pr. Alain Berthoz offered
me the opportunity to join his lab and for this I want to thank him. Entering a new lab can at
times be difficult and I want to thank all my colleagues who welcomed me and made me feel at
home. In this lab where people work on very different research topics, I thank them for the time
they spent explaining me their various projects. I also want to thank the technicians for their
help. Finally, I want to express all my gratitude to the subjects for their participation.
3
Table of content
1 BACKGROUND......................................................................................................................................... 5
1.1 MULTISTABILITY .................................................................................................................................. 5 1.1.1 What is multistability and what is it good for?................................................................................ 6 1.1.2 Examples ......................................................................................................................................... 7 1.1.3 Binocular rivalry: a powerful tool to study multistability ............................................................... 9 1.1.4 Characterizing bistability.............................................................................................................. 10
1.2 BINOCULAR RIVALRY ......................................................................................................................... 11 1.2.1 Phenomenology of binocular rivalry............................................................................................. 11 1.2.2 Two rivaling approaches............................................................................................................... 12 1.2.3 Conflicting evidences .................................................................................................................... 16 1.2.4 Hybrid view: multilevel hypothesis ............................................................................................... 25
1.3 MULTISENSORY BISTABILITY .............................................................................................................. 27 1.3.1 A few insights from literature........................................................................................................ 27 1.3.2 The McGurk effect: a multisensory illusion .................................................................................. 30 1.3.3 Conclusions ................................................................................................................................... 33
2 MATERIALS AND METHODS............................................................................................................. 34
2.1 GENERAL MATERIAL AND METHODS ................................................................................................... 34 2.1.1 Stimuli design ................................................................................................................................ 34 2.1.2 Stimuli presentation....................................................................................................................... 38 2.1.3 Subjects ......................................................................................................................................... 40 2.1.4 Experimental overview.................................................................................................................. 40
2.2 EXPERIMENT 1: BINOCULAR RIVALRY TEST (CONTROL EXPERIMENT) ................................................ 41 2.2.1 Objective ....................................................................................................................................... 41 2.2.2 Method .......................................................................................................................................... 41 2.2.3 Results and discussion................................................................................................................... 42
2.3 EXPERIMENT 2: TEST OF THE MCGURK EFFECT (BASELINE) ............................................................... 44 2.3.1 Objective ....................................................................................................................................... 44 2.3.2 Method .......................................................................................................................................... 44 2.3.3 Results and discussion................................................................................................................... 45
2.4 EXPERIMENT 3: BINOCULAR RIVALRY FOR VIDEOS (TEST AND BASELINE) .......................................... 46 2.4.1 Objective ....................................................................................................................................... 46 2.4.2 Method .......................................................................................................................................... 47 2.4.3 Results and discussion................................................................................................................... 47
2.5 EXPERIMENT 4: AUDIO-VISUAL INTEGRATION WITH THE SUPPRESSED STIMULUS ............................... 50 2.5.1 Objective ....................................................................................................................................... 50 2.5.2 Method .......................................................................................................................................... 50 2.5.3 Results and discussion................................................................................................................... 51
2.6 EXPERIMENT 5: IMPACT OF SOUND ON BINOCULAR RIVALRY.............................................................. 54 2.6.1 Objective ....................................................................................................................................... 54 2.6.2 Method .......................................................................................................................................... 54 2.6.3 Results and discussion................................................................................................................... 55
3 GENERAL DISCUSSION....................................................................................................................... 60
3.1 PROBING THE DEPTH OF SUPPRESSION................................................................................................. 60 3.2 REAL MULTISENSORY CONGRUENCY ENHANCES VOLITIONAL CONTROL............................................. 62 3.3 PERSPECTIVES..................................................................................................................................... 64
4 REFERENCES ......................................................................................................................................... 65
4
Summary The purpose of the present experiment consists in using an audio-visual setting to
investigate the phenomenon of binocular rivalry. We designed innovative rivaling stimuli
consisting of different videos of lips motions so that visual rivalry could be combined with
auditory speech material thus achieving strong multisensory congruency. Adding sound to
binocular rivalry enabled us to determinate that the suppressed stimulus was still available for
cross-modal integration. This result proves that the suppression cannot occur early in the visual
pathway and therefore fuels the interpretation of binocular rivalry as a delocalized competition
process between percepts which both coexist as neural states. We also investigated the effect of
audio-visual congruency on volitional control. We found that the addition of a congruent sound
can enhance volitional capacities only for a real congruency and not for a congruency built
through cross-modal modification of the sound.
5
1 Background
1.1 Multistability
A person strolls in a park, watching cackling ducks play in a pond while the spring flowers
exhale a delicate fragrance. In a glimpse all is perceived and merged into a coherent multisensory
picture while the process of perception goes completely unnoticed. The sound of the duck, its
shape and color, all is condensed in a coherent audio-visual percept. This reality is however
nothing more than a construction of our senses.
Figure 1 Kanizsa illusion
Psychologists and neuroscientists have followed different approaches to unveil the
functioning of perception. At least two ways can be distinguished.
The first approach consists in the constant reduction of the perception problem into
simpler aspects. Visual perception in this view can be subdivided into perception of shape, color,
motion but also perception of faces, of tools, etc. This method, which was proved to be
immensely fruitful, attempts in understanding the basic bricks of perception in a constructivist
perspective. There is however a second way in which one could tackle the problem. By
deliberately creating disruptions in the system, the scientist is able to reveal what is otherwise
hidden.
Figure 1 presents the classic Kanizsa illusion. Illusory contours are perceived leading the
observer to see a white triangle where there is none. Such an illusion, first used in the context of
the gestalt theory, reveals the property of the visual system to build long distance connections
6
between collinear pieces of contours. Illusions challenge the first property of the sensory system
that consists in producing an accurate representation of the stimulation. On the other hand
multistable stimuli preclude our senses from producing only one stable output for a given
stimulation. In our work, we will use this type of stimulus to derive conclusion on the
mechanisms of visual perception and cross-modal audio-visual perception.
1.1.1 What is multistability and what is it good for?
In the park, the stroller has the sensation to be facing a stable representation of his
environment. Our senses are however always dealing with messy and ambiguous signals.
Decoding the flow of information arriving to an eye, for example, is far from being as simple as
inverting a code. Ambiguity is the hallmark of the retinal stimulus in nearly all visual perception.
A living organism however has to achieve a stable perceptual organization of its surrounding
environment in order to guide its behavior and achieve proper adaptation. Dealing with
potentially various perceptual representations coherent with the incoming information, a unique
interpretation has to be elected. For this reason, perception always involves a decision
process. As I mentioned in the introduction, a main characteristic of the sensory system is
therefore to give birth to a unique and stable output for a given stimulation.
Multistable percepts defeat this characteristic. For an unchanging stimulus, the perceptual
system alternates spontaneously between distinct interpretations without being able to stabilize
one of them. This phenomenon has been used extensively for more than two centuries to study
visual perception. The phenomenal instability of such percepts provides an especially dramatic
and compelling example of the more general ambiguity which characterizes sensory stimulation.
This feature makes multistable percepts especially relevant stimulations to study the phenomenon
of perceptual decision.
It is however in the narrower context of the study of consciousness that multistability
recently generated a sustained interest in visual neuroscience as it decouples the conscious
perception of the observer from the characteristics of the physical stimulation. The same stimulus
indeed can evoke different conscious percepts. This provides a powerful tool to investigate the
neural bases of consciousness since the change in the subjective perception can be correlated
with the neural responses while the stimulation remains constant. I will now present the main
examples of multistability for the reader to grasp the wide range of the existing stimuli, grouped
by sensory modality.
7
1.1.2 Examples
a Vision
In the visual modality multistability can be achieved using various classes of stimulations.
The first class called ambiguous figures is illustrated in Figure 2. Looking at each of these images,
the observer will experience perceptual oscillations. The Necker cube will present itself either as a
cube seen from below or from above, due to the lack of depth cues. In the face/vase stimulus,
the observer will oscillate between perceiving a vase and perceiving two heads facing each other,
a phenomenon that is rooted in ambiguous figure/ground segregation.
(a)
Necker cube
(b)
Face/Vase
(c)
Duck/Rabbit
(d)
Wife/Mother in law
Figure 2 Ambiguous figures. a: Boring (1942), b: Rubin (1915/1958), c: Jastrow (1899), d: Boring (1930)
Other visual bistable stimuli rely on ambiguous interpretation of motion. A moving plaid
delimited by a circular aperture can also be seen as a two gratings sliding on each other in
directions perpendicular to each grating’s orientation. The apparent motion quartet (illustrated in
Figure 3) is another famous example of visual ambiguous stimulus.
8
Figure 3 Apparent Motion Quartet (Sterzer, Kleinschmidt, & Rees, 2009)
Finally a last class of visual bistable stimulations relies on the phenomenon of binocular
rivalry. Binocular rivalry is an example of multistable perception that can be initiated by showing
dissimilar images to the two eyes. The perceptual impression under such conditions is not the
spatial sum or average of the two monocular images, but rather a sequence of subjective reversals
in which each of the stimuli, in turn, dominates perception while the other entirely disappears
from sight. Figure 4 presents two classical examples of rivaling images.
Figure 4 Binocular Rivalry (Tong, Meng, & Blake, 2006)
b Audition
There exist far fewer examples of multistability in the auditory modality. Two main effects
are known to induce perceptual multistability: auditory stream segregation and verbal
transformation effect.
9
Due to stream segregation, two streams of tones presented in an alternating pattern
repeated through time (cf. Figure 5) are perceived either as two segregated streams each
comprising only one repeating sound, or as a single stream of alternating sounds (Pressnitzer &
Hupé, 2006). The auditor oscillates between these two interpretations.
Figure 5 Auditory illusion due to stream segregation (Sterzer et al. 2009)
Verbal transformation effects can arise when a speech form is rapidly and continuously
repeated (Warren & Gregory, 1958). Although at first a percept matching the initial form
dominates, with time some other interpretations appear and then alternate with the original
percept. A good example is given by the rapid repetition of the word ‘life’ which gives rise to the
interpretation ‘fly’ and results in bistable alternation between the perceived words ‘life’ and ‘fly’.
c Touch
Tactile multistability remained up until now a rather marginal research topic. It is however
worth mentioning that applying on the skin of a subject a tactile stimulation that mimics the
apparent motion quartet described above results in the same multistable interpretation of the
direction of motion (Carter, Konkle, Wang, Hayward, & Moore, 2008).
1.1.3 Binocular rivalry: a powerful tool to study multistability
In achieving our goal, which is to propose an innovative analysis of the phenomenon of
multistability by coupling it with cross-modal effects, the first step consisted in choosing an
appropriate multistable stimulation among the population of stimuli described above. For at least
two reasons binocular rivalry appeared to be the best fitting choice.
First, binocular rivalry it is the best documented multistable situation. In the past fifty
years, binocular rivalry emerged, along with the Necker cube, as a paradigmatic case of
10
multistability and concentrated most of the efforts deployed to understand this phenomenon.
Studies ranging from psychophysics to computational neuroscience but also including imaging
and animal studies provide the largest background on any multistable stimulation.
Unlike ambiguous figures, binocular rivalry allows to a great extent the scientist to
manipulate the stimulus content. For the purpose of the present experiment, it was important to
be able to design a stimulus that would show strong cross-modal congruency. As described in the
method, the rules constraining the creation of bistable binocular percepts are rather strict.
Presenting different images in each eye is far from being enough in order to induce binocular
rivalry. Part of the present work will therefore be dedicated to the presentation of a new type of
bistable percepts using binocular rivalry and specially designed to ensure a strong multisensory
effect.
1.1.4 Characterizing bistability
All the multistable stimulation described above share common characteristics. Leopold
and Logothetis (Leopold & Logothetis, 1999) established a list of three features observed in all
instances of visual bistability: exclusivity, randomness and inevitability. These characteristics
were then also evidenced for auditory bistability based on stream segregation (Pressnitzer &
Hupé, 2006).
a Exclusivity
When two percepts of a bistable stimulus are competing, the phenomenological results is
that the observer perceives either one or the other but never both of them at the same time. This
characteristic is rigorously noted for ambiguous figures, plaid, apparent motion quartet, and
auditory bistable percepts. For binocular rivalry, patchy or piecemeal rivalry is often reported,
especially at the beginning of the stimulation. However, well designed stimuli for binocular rivalry
should result in a marginal perception of piecemeal rivalry.
b Randomness
A hallmark of bistability is that the durations of alternating percepts, or phases, follow a
random law that can be fitted with a gamma or lognormal distribution. The lognormal
distribution suggests the multiplication of a large number of independent random processes,
whereas a gamma distribution is more likely to result from the combination of a small number of
consecutive Poisson processes.
11
c Inevitability
Debate over the possibility of a volitional control over bistable perception can be traced
back to the work of Helmholtz. Although the latter was strongly backing the idea that after
training an observer could take control over the alternations, it is now widely admitted that no
full control can be exerted and that alternations are inevitable. Volition can however bias the
perception of some bistable stimuli by modulating the dominance duration of each of the
possible interpretations.
1.2 Binocular Rivalry
1.2.1 Phenomenology of binocular rivalry
I will from now on focus on the phenomenon of binocular rivalry. However, before
reviewing the details of the past and present conceptual frameworks, I will present briefly the
main characteristics of the subjective experience associated with binocular rivalry.
The phenomenon of binocular rivalry is a particular form of bistability which occurs
when dissimilar images are presented to corresponding regions of the two eyes. As I already
mentioned above, rather than melding into a single coherent percept, the two images compete for
perceptual dominance. Typically, an image will dominate conscious awareness for a few seconds
before being supplanted by the previously suppressed rival image.
a Temporal dynamics
The main characteristic of the temporal dynamics of visual rivalry is its randomness.
Oscillations between rivaling percepts are not regular. The successive durations of dominance
periods seem to be drawn from a random distribution, as if generated by a stochastic process
driven by an unstable time constant (Lehky, 1988).
In his seminal work, Levelt showed that the random dynamics can however be biased
through variations of the “strength” of one rival figure over another. For Levelt the concept of
“stimulus strength” is related to “the amount of contour per area” but can be extended to
brightness and contrast. Increasing the strength of a stimulus has no effect on its dominance
durations but instead reduces its suppression durations.
b Spatial attributes
Exclusive predominance of an image over the other is not always globally achieved for
binocularly presented stimuli. Perceptual dominance can take on a “patchy” or “piecemeal”
12
appearance when the inducing figures are relatively large, as if rivalry was occurring on a local
scale. Distributed zones of the visual field seem to be involved in simultaneous rivalries. This
effect is also commonly reported during the first seconds of presentation of the rivaling stimuli.
Exclusivity is also blurred during perceptual transitions. When a suppressed image
overthrows the currently dominant one it does not do so instantaneously. Instead, transition
occurs in a wave-like fashion: the new dominant image emerges in one region and spreads
throughout the whole visual field.
1.2.2 Two rivaling approaches
What are the mechanisms underlying binocular rivalry? The theoretical debate on the
neural bases of this phenomenon has been historically divided along a line separating two main
hypotheses: a low level or bottom-up explanation and a high level or top-down approach.
a Historical views on multistability
The history of the concepts that served as a basis for the understanding of binocular
rivalry is intimately linked to the evolution of the more general understanding of multistability. I
will try to present the major trends in the theoretical debates on multistable perception while
showing how the specific works on binocular rivalry are inscribed within those trends.
Long and Toppino organized the history of research on multistability into two main
periods each characterized by a categorical division between two conflicting theories (Long &
Toppino, 2004).
Early 20th century saw the quarrel between two conceptions that Long and Toppino
described as focalizing respectively on peripheral and central processes. By peripheral
processes they refer to factors related to the operation of the sense organ, whereas central
processes refer to brain and especially cortical mechanisms.
Necker strongly backed the conception insisting on the importance of peripheral
processes by proposing an explanation in which different points on an ambiguous figure – such
as the Necker cube - are assumed to foster one or the other perceptual alternative. The
interpretation of the figure therefore depends mainly on the set of features receiving primary
processing. In this view, eye movements were critical for they consist in varying the foveated
portion of the figure and hence can trigger perceptual switches. Long and Toppino note that this
early interpretation “placed the locus of figural reversal in “optical” rather than “mental”
processes”. At this time, considerable efforts was put to demonstrate the importance of eye
movements in perceptual oscillations.
13
If eye movements appeared to be indeed related to perceptual switches, it was proved that
they are not necessary for a switch to occur. This result fueled the other early hypotheses that
involved a central or “psychological” explanations based on concepts such as will, imagination,
and attention. Binocular rivalry served as an important example of multistable phenomenon
supported by central mechanisms, and especially by volition. Advocates of this position include
Hermann von Helmholtz and William James, both of whom equated rivalry with voluntary
attention. Sir Charles Sherrington in his monograph Integrative Action of the Nervous System also
supported such a position by writing: “Only after the sensations initiated from right and left
corresponding points have been elaborated, and have reached a dignity and definiteness well
amenable to introspection, does interference between the reactions of the two eye-system
occur…”(Blake & Logothetis, 2002). Long and Toppino conclude that by the 1910’s there was a
relatively clear consensus that figural reversal was to be explained on the basis of central
processes.
The main conceptual switch occurred in the 1940’s with the birth of the Gestalt theory.
Indeed, after some rather calm decades during which multistability was left aside, Gestaltists
rediscovered this phenomenon and made considerable use of it. However, they did not simply
revive the early concepts but attempted to interpret multistability using their own theoretical
framework. The Gestalt school of thought therefore introduced the concept of satiation into the
realm of multistable perception. Gestalt conceptualization of brain functioning indeed rested on
the constructs of flowing electrical fields and changing resistance (satiation) to the flow of these
fields. Long and Toppino note that according to Kohler: “figural reversal could be attributed to a
gradual build up of resistance (“electrotonus”) in the brain to the field flow underlying the
percept first seen.” For the Gestaltists an electrical field supports one interpretation of the
multistable percepts while at the same time a resistance to this field is building up (satiation)
resulting, eventually, into the suppression of this percept and the emergence of a new field
supporting the alternative interpretation.
Although these concepts of fields and satiation were abandoned when the first tools
appeared that enabled the scientists to record directly the activity of cortical regions, they were
highly influential in the advent of the new theoretical framework supporting modern research.
Based mainly on the breakthroughs in the field of neuroscience of vision emerged the notion of
neural channels selectivity tuned to particular characteristics of the sensory (retinal) stimulus.
Such a conception incorporated the critical notion of neural adaptation which originates
directly in the concept of satiation. According to neural adaptation, continuous stimulation of a
population of neurons results eventually in a reduction of their sensitivity and alters their ability
14
to respond to subsequent stimuli until they have recovered from this adapted state. I will show in
1.2.2b how the recent low level approach makes considerable use of the notion of neural
adaptation as a key concept to explain perceptual reversal in binocular rivalry as a well as in other
forms of multistable percepts.
The period that opens with the advent of the Gestalt theory constitutes the second period
defined by Long and Toppino. As I mentioned earlier, each epoch is characterized by a
competition between two major theoretical conceptions of the phenomenon. The Gestalt theory
and its continuation in modern neuroscience represent a first position referred to as sensory
explanations by Long and Toppino. Sensory explanations are challenged during the period by
cognitive explanations. This other mainstream approach favors the role of more active,
cognitive processes such as learning, decision making, and attention.
b Low level interocular competition versus high level patterns competition
Both dichotomies described above can be subsumed into a more general opposition
between bottom-up/low level approach which insists on the role of passive sensory processes
and top-down/high level approach which focuses on the role of active cognitive processes. I will
from now on use the terms bottom-up/ top-down and low level/high level interchangeably.
It is crucial to bear in mind both categories of explanations in order to properly grasp the
importance and relevance of most scientific works concerning perceptual bistability for they
almost always consist in an attempt to back one of the competing hypotheses. I shall now present
the two main theories that were formalized in the late 1980’s and the 1990’s concerning visual
rivalry and which fall directly into the general conceptual dichotomy.
• Reciprocal inhibition between feature-detecting neurons in early vision
According to this first class of explanations, binocular rivalry arises from low-level
interocular competition between monocular neurons in the primary visual cortex (V1) or in the
lateral geneculate nucleus (LGN) of the thalamus.
This hypothesis derives from the sensory explanations described above based on neural
adaptation and mutual inhibition. It is the association of reciprocal inhibition between
competing visual neurons with inhibitory influences adapting over time that can account for
spontaneous rivalry alternations. A set of neurons maintains dominance only temporarily, until
they can no longer inhibit the activity of competing neurons, leading to a reversal in perceptual
dominance. The fundamental point is that competition is supposed to take place early in the
visual processing stages between monocular neurons. For this reason this framework can be
termed interocular competition or eye rivalry.
15
Figure 6 XOR network (Blake, 1989)
In line with this approach, Lehky (Lehky, 1988) and Blake (Blake, 1989) proposed neural
models such as the one illustrated in Figure 6. Two neurons, each corresponding to an eye,
inhibit each other in direct competition. This led to a winner take all mechanism where one
neuron, and therefore one eye, eventually takes over. However, due to neural adaptation, the
inhibition on the suppressed eye diminishes with time until a reversal occurs. The neural
mechanism corresponds therefore to a logic exclusive OR (XOR). Only one of the stimulations
passes the first stages of visual processing. This low level approach consequently implies a very
early suppression of the dominated stimulus.
• Role of prefrontal cortex and decision making: binocular rivalry as a
cognitive process
A rivaling framework to the one just presented consists in a modern version of the central
and cognitive interpretations of multistability. Leopold and Logothetis (Leopold & Logothetis,
1999) first formalized this high level hypothesis. They consider binocular rivalry as a type of
behavior in which perceptual switches would result from a decision taking place in the prefrontal
cortex.
According to this top-down interpretation suppression is not triggered by some intrinsic
competition between eyes but rather is a consequence of a decision mechanism. Contrary to the
interocular competition hypothesis the suppression would not necessarily occur at early level of
the visual processing. For a decision to be taken it might indeed be necessary for both percepts to
be maintained in some way as coexisting active neural states. According to this view, binocular
rivalry occurs later in visual processing and reflects competition between incompatible
patterns rather than competition between the eyes. Patterns are thought here to be neural
16
representations of the stimuli without specifying their localization in the brain but supposing that
they exist beyond the simple monocular stimuli representations.
1.2.3 Conflicting evidences
Only empirical evidence can foster or debunk one of the two approaches. I will therefore
now present some of the empirical works on visual rivalry. The scientific works listed here are
chosen either for their innovative quality, bringing new data to the debate or for their classical
aspect, representing common, often replicated aspects of binocular rivalry. Based on these
results, I will then be able to discuss the validity of the top-down and bottom-up frameworks.
a Indirect evidence: psychophysics studies
• Visual sensitivity
Testing the subject sensitivity to a probe stimulus provides a simple and efficient way to
investigate the phenomenon of suppression during visual rivalry. During dominance phases of
rivalry, observers show normal visual sensitivity for the detection of probe targets briefly
superimposed on the dominant stimulus. However when asked to respond to the same probe
target but this time superimposed to the suppressed stimulus, sensitivity averaged 63% of that
during dominance. Interestingly, the sensitivity decrease remains the same whether the
stimulation used as a probe is similar to the suppressed stimulus or not, which tends to favor the
interocular competition interpretation (Freeman, Nguyen, & Alais, 2005).
• Adaptation
Although suppression removes a percept from an observer’s consciousness, it has no
effect on most of the well-known adaptation aftereffects. In particular the translational motion
aftereffect (MAE) remains unaffected suggesting that the locus of suppression does not occur
before the site of MAE (Lehmkuhle & Fox, 1975). However, for a spiral aftereffect (SAE), it
appears that suppression prevents the build up of the effect which shows that rivalry suppression
occurs prior to the site of spiral motion processing (Wiesenfelder & Blake, 1990).
If the precise site of suppression cannot be determined by the study of adaptation, all the
results fully support the idea that the mechanisms responsible for suppression are cortical.
• Priming
Exposure to a stimulus such as a picture or a word makes subsequent processing of that
stimulus faster and more accurate. The initial presentation of the stimulation primes later
17
processing. Priming is a form of implicit memory since the test instruction makes no explicit
reference to the prior presentation of the stimulation. Successful priming does not rely on a
conscious recognition of the repeated stimuli. The question has therefore been tackled as to
whether a priming effect could be observed if the priming stimulus is presented during a
binocular suppression.
Picture priming (Cave, Blake, & McNamara, 1998) and semantic priming (Zimba & Blake,
1983) were shown to be disrupted suggesting that suppression renders impotent normally
effective priming stimuli. Blake and Logothetis (Blake & Logothetis, 2002) remark that priming
require “relatively refined analyses of visual information, of sort conventionally attributed to
high-level visual processing outside the domain of early visual areas”. They conclude that during
suppression input to these stages is effectively blocked.
• Stimulus swap
Logothetis et al. (Logothetis, Leopold, & Sheinberg, 1996) introduced an elegant and
ground breaking experimental paradigm in an attempt to investigate whether binocular rivalry can
be traced back to simple competition between monocular neurons within the primary visual
cortex or whether higher cortical areas have to be taken into account. Instead of presenting one
image to each eye in a classical binocular rivalry paradigm, they tested the effect of rapidly
alternating the rival stimuli between the two eyes. Surprisingly, under such conditions, the
perceptual alternations exhibit the same temporal dynamic as with static images. A singe phase of
dominance can span multiple alternations of the stimuli.
This experiment rules out the possibility of a sheer eye competition. The authors note
that “neural representations of the two stimuli compete for visual awareness independently of the
eye through which they reach the higher visual areas.” If a competition takes place during
binocular rivalry, it is therefore more likely to occur between two perceptual interpretations at a
higher level of analysis than between monocular neurons. The distinction of percept
competition and eye competition appears here to be essential in the understanding of
binocular rivalry.
• Volition
Another important indicator of the potentially cognitive nature of binocular rivalry is its
partial susceptibility to voluntary control. I already mentioned that if the alternations are
spontaneous and inevitable, it is still possible for the observer to bias the dynamics by actively
trying to hold one percept. Intention therefore plays a critical role in perceptual alternation. Van
Ee et al. analyzed the influence of volition on dominance durations for a series of dichoptically
18
presented stimuli (van Ee, van Dam, & Brouwer, 2005). Overall, they showed that voluntarily
trying to hold one percept resulted in a lengthening of the dominance duration of this percept.
This result however depends on the type of stimulation used. When presented with gratings
stimuli observers are less able to willfully influence the temporal dynamics of oscillations than
when presented with rivaling house and face. The active role of volition in the perceptual
decision favors the rather high level interpretation of visual rivalry.
• Stimuli type
Is the type of stimuli used important or is the phenomenon of binocular rivalry
independent of the nature of the images presented? Put differently, I already mentioned that the
stimulus strength plays a role in the temporal dynamics of binocular rivalry (see Levelt’s laws in
1.2.1a) but can the higher level information of the stimulation modify the dynamics? Jiang et al.
analyzed the time a suppressed stimulus takes to break suppression – ie to become dominant –
while varying the nature of this suppressed stimulus which could be either familiar (upright face
or a text written in the native language of the subject) or unfamiliar (upside-down face or a text in
an unknown foreign alphabet) (Jiang, Costello, & He, 2007). It resulted that a familiar stimulus
tended to gain dominance faster than an unfamiliar one therefore showing that the semantic
content of the suppressed stimulation played a role in the oscillation dynamics of rivalry.
Emotional stimulations have also been shown to be more prone to gain awareness in
binocular rivalry (Yang, Zald, & Blake, 2007). This result is coherent with the fMRI result
focusing on the activation of the amygdala in situation of rivaling stimulation (see 1.2.3c).
These results imply that the suppressed stimulus although unconscious is still somehow
processed by the visual system since high level information contributes to the stimulus strength
during its suppression phase. This conclusion goes against the low-level interpretation of
binocular rivalry. It also draws the attention on the importance of the nature of the stimulation
used: the underlying processing of stimuli during rivalry could very well depend on their intrinsic
nature.
b Multistability as behavior
A main epistemological breakthrough occured when Leopold and Logothetis proposed to
interpret binocular rivalry as a behavior, offering a fresh interpretation to the high level
framework. In their seminal review (Leopold & Logothetis, 1999) they suggest that “spontaneous
alternations reflect responses to active, programmed events initiated by brain areas that integrate
sensory and non-sensory information to coordinate a diversity of behaviors.” According to their
position, while the perception of an ambiguous stimulus ultimately depends on the activity of
19
sensory cortices, this activity is continually steered and modified by central brain structures
involved in planning and generating behavioral actions.
In order to back their thesis, Leopold and Logothetis review the data in favor of the high
level approach. But this alone would not suffice to support the qualification of binocular rivalry
as “behavior”. Therefore they add a interesting analysis that focuses on the temporal dynamics of
binocular rivalry. The authors highlight the close similarity between temporal dynamics for
perceptual reversals and a variety of spontaneously generated visuo-motor behaviors. The
stochastic aspect of the temporal dynamics of binocular rivalry resembles the randomness of
many exploratory behaviors that emerge from the integration of a large number of sensory and
internal variables. In particular, they note that the dynamics of free viewing (the distribution of
the fixation durations between saccades) is stochastic with the characteristic, also observed in
visual rivalry, that duration of one fixation has no significant effect on that of the next.
Perceptual reversals could therefore be linked to the more general class of exploratory
behaviors and should be related to high level information processing.
c Imaging
In the past 20 years, the research on the underlying mechanisms of binocular rivalry
strongly focused on the use of brain imaging techniques.
• Electroencephalograpgy (EEG)
Few EEG studies tackled the problem of binocular rivalry. Placing electrodes on the
occipital lobes, it was possible to record the average activity during visual rivalry which enables
some authors to conclude that suppression resulted in a reduction in amplitude of the visually
evoked responses (VER). However, since the signal recorded stemmed from both the right and
left eyes, the VER could not be linked with a specific percept.
Brown and Norcia introduced an innovative method to create a link between the recorded
VER and the rivaling stimuli (Brown & Norcia, 1997). They used two dichoptically viewed
orthogonally oriented gratings whose contrasts were modulated at different rate. By doing so,
they were able to tag the VER waveforms associated with the two gratings. The EEG signal was
then recorded over the occipital lobe while observers were reporting the perceptual reversals. The
authors could thereby show that the VER associated with the two gratings display inversely
related modulation in amplitude tightly phase-locked with the perceptual reports of dominance
and suppression.
20
Figure 7 Tagged VER associated with rivaling gratings (Blake & Logothetis, 2002)
EEG studies put forward a correlation between brain activity and perception during
rivalry but do not provide any information on where in the visual pathways the competition takes
place. EEG signals are averaged over the occipital pole and reflect the activity of large networks
whose precise localization remains unknown.
• Magnetoencephalography (MEG)
MEG, supposed to provide a somewhat better source localization than EEG, faced the
exact same problems than EEG concerning signal tagging. The solution found consisted in using
a frequency-tagged neuromagnetic response by flickering two dichoptically orthogonally oriented
gratings at different rates (Tononi, Srinivasan, Russell, & Edelman, 1998). Their study revealed
strong rivalry-related responses throughout occipital cortex as well as from some anterior
temporal and frontal sites. If the precise location of the rivalry-related responses’ origin cannot be
determined, Tong suggests that their widespread nature indicates that rivalry interactions occur at
an early stage of visual processing, leading to similar rivalry effect at both occipital and anterior
sites (Tong, 2005).
• Funtional Magnetic Resonance Imaging (fMRI)
fMRI provides the advantage of a more precise localization of the neural signals involved
in perceptual reversals, dominance and suppression. It is indeed possible to identify brain regions
in which blood oxygen level dependent (BOLD) signals fluctuate in synchrony with binocular
rivalry alternations. I will focus on the main experiments that together provide an accurate picture
of the empirical data gathered, in its diversity.
The first significant work on binocular rivalry using fMRI identified the loci whose neural
activity correlates with the occurrence of a perceptual transition – that is, brain activations
correlated with points in time when observers experienced changes in rivalry state, rather than
the particular perceptual state being experienced (Lumer, Friston, & Rees, 1998). Lumer et al.
pinpointed the extrastriate cortex, the fusiform gyrus as well as several frontal and parietal areas
21
but not the striate cortex as related to subject’s perceptual transitions. The authors concluded that
transition might be instigated by fronto-parietal areas although no test of causality had been
carried out. This activation of a fronto-parietal network has been, more cautiously interpreted by
Leopold and Logothetis as a proof that “these areas are actively involved in binocular rivalry, and
furthermore that their participation was specific to multistable viewing, as they were not active in
a control passive-viewing condition” (Leopold & Logothetis, 1999).
In the ventral temporal cortex, the fusiform face area (FFA) tends to respond
preferentially to pictures of faces when the parhippocampal area (PPA) responds to images of
indoor and outdoor scenes. Tong et al. therefore designed a binocular rivalry experiment using
images of faces competing with images of houses (Tong, Nakayama, Vaughan, & Kanwisher,
1998). The results showed that the activity in FFA and PPA reflected closely the observer
perceptual state. When the face was dominant in rivalry, activity levels were relatively high in FFA
and low in the PPA and vice versa. Interestingly, the activity level in both FFA and PPA were
similar to those observed when images of house and face were externally switched in order to
mimic rivalry (see Figure 8). According to Logothetis and Blake, this result suggests that rivalry is
fully resolved by the time signals arrive within these stages of processing (Blake & Logothetis,
2002).
Figure 8 Activity in the ventral temporal cortex correlates with perceptual states (Tong et al., 1998)
Polonsky et al. investigated the neural correlates of binocular rivalry within the visual
cortex (Polonsky, Blake, Braun, & Heeger, 2000). They used a contrast difference between two
dichoptically presented images in order to tag the BOLD signal corresponding to each image –
activity increased in the visual cortex when the subject saw the high contrast image and decreased
22
for the low contrast image. By analyzing those fluctuations and by comparing it to a stimulus flip
situation – during which the images are externally switched without rivalry - the authors observed
rivalry related fluctuations in V1, which were roughly equal to those observed in other visual
areas (V2, V3, V3a and V4v). Two conclusions are potentially coherent which such results. Either
neuronal events underlying rivalry are initiated in V1 and then propagated to other visual areas
(this interpretation corresponds to the bottom-up approach), or those neuronal events are
initiated at later stages and then propagated via feedback to V1. The authors however are unable
to conclude based on this experiment and note that both processes could occur since “local
interactions among V1 neurons may trigger the perceptual alternations during rivalry, whereas
interactions in later visual areas may reinforce the neuronal representations of coherent percepts,
just as they do during normal vision.”
An elegant way to test whether binocular rivalry can be traced back to eye competition
was proposed by Tong and Engel (Tong & Engel, 2001). The idea consisted in presenting
rivaling gratings in the portion of the visual field that corresponds to the blind-spot, a monocular
region of the primary visual cortex. This region greatly prefers stimulation in the ipsilateral eye to
that of the blind-spot eye. Interestingly, unlike eye-specific columns in human V1 which are
extremely narrow, the blind-spot region is sufficiently large for reliable functional imaging. In this
monocular region, as predicted by the eye competition, bottom up approach, activity correlates
with the perceptual state of the observer. The blind-spot representation was activated when the
ipsilateral grating became perceptually dominant and suppressed when the blind-spot grating
became dominant. This modulation was just as strong as those evoked by physical alternations of
the stimuli. This study brought strong empirical evidence in favor of the low level since it led to
the conclusion that rivalry can fully suppress monocular responses to an unperceived stimulus.
In a study focusing on the lateral geniculate nucleus (LGN), Haynes et al. used high-
resolution fMRI to find evidence of eye competition (Haynes, Deichmann, & Rees, 2005). LGN
has indeed often been thought to be the locus where interocular competition could take place.
Regions that showed a strong preference for stimulation from a specific eye displayed significant
activity suppression during binocular rivalry when the stimulus presented in their preferred eye
was perceptually suppressed. This study therefore strongly backs the low level approach by
suggesting that the eye rivalry could take place as early as LGN and therefore that suppression
occurs at the very first levels of the visual processing.
The amygdale however, which is known to process emotional stimuli, responds more
strongly to fearful faces than to neutral stimuli, even when those stimuli are suppressed from
awareness by rivalry (Williams, 2004). This very important result shows that, contrary to the data
23
drawn from the face-selective regions, rivalry is not fully solved outside the visual cortex. Some
information concerning the suppressed percept can persist and reach some subcortical brain
areas, although this neural activity is insufficient to support visual awareness.
• Electrode recording
Electrode recording has been used to monitor the neuronal activity in various brain areas
while animals were reporting their percepts during binocular rivalry.
Contrary to the previously mentioned fMRI study on the implication of LGN in
binocular rivalry (Haynes et al., 2005), Lehky and Maunsell (Lehky & Maunsell, 1996), using
single-unit recording could not find any evidence for rivalry inhibition in this structure.
In a series of experiments Logothetis et al. (Leopold & Logothetis, 1996; Logothetis &
Schall, 1989; Sheinberg & Logothetis, 1997) recorded spiking signal in many cortical areas
including the striate cortex (V1), as well as the extrastriate areas V2, V4, the middle temporal area
(MT), the medial superior temporal sulcus (MST), the inferotemporal cortex (IT), and the upper
and lower bank of the superior temporal sulcus (STS). The animal was previously taught to pull a
lever in association with seeing each pattern. The stimuli were such that they had been specially
tailored to the preferences of the neuron being monitored. An excitatory (preferred) stimulus was
rivaling with a non-excitatory (null) stimulus. Despite the unchanging nature of the retinal input,
neural activity of subsets of neurons was shown to be modulated by the monkey’s internally
generated perceptual changes. Most neurons recorded however responded to both perceptual
states equally as if the unchanging retinal input was the only factor determining their firing.
Interestingly, the percentage of cells whose activity was modulated by the perceptual state
differed significantly in the various areas. Figure 9 shows the proportions of percept-related cells
in each area. If only a small fraction of the neurons have an activity locked on the perceptual
alternations in the early stages of the visual system (V1 and V2), this proportion is higher in the
extrastriate areas (V4, MT, MST). Eventually, almost all the neurons fire in concert with the
perceptual changes in the temporal lobe (IT, STS). This result suggests, as in (Tong et al., 1998)
for fMRI, that the temporal cortex lies beyond the resolution of the perceptual conflict (Blake &
Logothetis, 2002). Another lesson seems to be that the emergence of neural loci whose activity
correlates with a perceptual state is more a continuous construction along the visual pathway than
property of high or low level structures.
24
Figure 9 Proportions of percept-related cells (Leopold & Logothetis, 1999)
• Conclusions
Imaging studies represent a large amount of the works that has been done on the
phenomenon of binocular rivalry the past 20 years. Imaging however did not provide unanimous
evidence backing one theoretical framework or the other. EEG and MEG study suffer from their
low spatial resolution and the lack of reliable algorithms to isolate the sources of the signal. fMRI
data are somewhat more useful to investigate the neural correlates of binocular rivalry but lead to
conflicting results with some studies advocating for a complete resolution of the competition at
the monocular level (Tong & Engel, 2001) while some other present proof of persistence of the
suppressed stimuli outside the visual cortex (Williams, 2004) or insist on the potential role of the
prefrontal cortex in triggering the perceptual switches (Lumer et al., 1998). fMRI data is also in
conflict with some single-cell recording studies. In particular, concerning LGN which is supposed
to play a key role in the low level approach, these two techniques provide contradictory empirical
data.
There are however some positive results brought by imaging studies. First it seems that the
suppression is not localized but rather continuous at least over the visual area (see Figure 9). The
fact that suppression is fully achieved in the temporal cortex but not in the amygdala suggests
that there exist different paths of processing of the visual input and that not all of them are
equally affected by the alternation of dominance and suppression that characterizes binocular
rivalry. Overall, these results do not back one particular approach but rather call for a
reformulation of the theoretical framework.
25
1.2.4 Hybrid view: multilevel hypothesis
Empirical data do not provide with conclusive evidence concerning the debate over the
two main frameworks and therefore efforts have been made to build an alternative approach.
This new conceptual approach consists in a hybrid view (Long & Toppino, 2004;
Tong et al., 2006) since it relies on hypotheses belonging to both high and low level theories. As I
mentioned above (see 1.2.2b), the bottom-up model insists on the importance of reciprocal
inhibition between competing visual neurons located early in the visual processing stages (most
likely monocular neurons). This inhibition however fluctuates over time through the
phenomenon of neural adaptation. One set of neurons maintains dominance only temporarily,
until they can no longer inhibit the activity of competing neurons, leading to a reversal in
perceptual dominance. This notion of competition through mutual inhibition remains central in
the hybrid model. The top-down model and some of the empirical data, on the other hand,
points towards the idea of a competition that is not fully solved by the early stages of the visual
system and that occur at different levels of the visual pathway. The hybrid view therefore consists
in a delocalization of competition all along the visual system and even beyond and insists
on multilevel processing. Inhibitory interactions could take place among monocular neurons
(eye competition) as well as among pattern-selective neurons (pattern competition).
It is essential to understand that in this view, the emergence of a unique percept is not
due to a single XOR competition as in the model proposed by Blake (Blake, 1989) which sums
up the low level models of binocular rivalry. The election of a single percept results from a
complex network of competitions between different neural populations coding for patterns at
various levels of the visual analysis. Figure 10 shows the simplest of such models involving only
two levels: monocular neurons and binocular neurons. The causality in such a model is blurred
compared to a typical low level competition. Indeed it is not possible anymore to decide what
level triggers the switches. The oscillations emerge from the global network in a delocalized
fashion. A more realistic model should be far more complex, involving many levels (Freeman,
2005) but also feedback connections from higher levels on lower levels. Multilevel processing is
coherent with anatomical, physiological, psychophysical evidence suggesting that the visual
system is characterized by a network of feedforward and feedback connections that enables signal
exchanges between neural levels.
26
Figure 10 Two level hybrid model (Wilson, 2003)
An important point on which I would like to insist on concerns the nature of the
suppressed stimulus. According to the hybrid view, it is possible that the rival stimulation leads
only to partial suppression of the inputs from one eye at the monocular level, which is consistent
with the empirical studies that found neural activity corresponding to the suppressed stimuli
outside LGN and V1. If the low level suppression is not total, a persisting neural signal is passed
on to higher stages of processing, where visual competition continues. The nature of the
stimulations could play an important role in determining the loci of competition. It is indeed
possible that depending on the content of the visual stimulation, competition takes place more or
less early. This could explain part of the disparity that exists in empirical studies.
Tong et al. (Tong et al., 2006) insists mainly on the multilevel aspects of competition
within the visual areas but it seems that we know very little on the actual loci of competition that
may very well involve areas outside the visual pathway as mentioned by the top-down model. In
an article that attempts at understanding the phenomenon of multistability in general which
includes binocular rivalry, Long and Toppino proposed an attractive multilevel hybrid model (see
Figure 11) which includes explicitly higher-level global processes that impact the delocalized
competition. In this view, higher level cognitive factors are not engaged directly in the same
multistage network of competition but send signals to all stages.
27
Figure 11 Multi-level model of multistable perception (Long & Toppino, 2004)
1.3 Multisensory bistability
The gist of the work I want to present here consists in an innovative use of multisensory
integration as a tool to investigate the mechanism underlying binocular rivalry. The recent
conceptual switch to a hybrid model could be considered as more of a draw back than a
breakthrough. By recognizing the complexity of the phenomenon of binocular rivalry that cannot
be represented by simple top down or bottom up interactions, the hybrid model calls for fresh
investigations on the nature of the suppressed stimulus. More precisely, a necessary conclusion of
the hybrid model is that for at least some categories of stimulation, it must be possible to find
correlates of the suppressed stimulus at intermediate levels of visual processing. Multisensory
integration and especially audio-visual integration will prove itself to be a powerful tool to probe
the loci of suppression.
1.3.1 A few insights from literature
Literature is relatively poor when it comes to studying bistability in a multisensory
context. I will here extend the field of investigation to other forms of bistability than binocular
rivalry since the experimental approaches focusing on other bistable phenomena offer relevant
28
indications on the type of multisensory percept that could be used and on what are the main
aspects of bistability that can be challenged by adding sound to classic visual bistable percepts.
a Speech perception and multistability
The nature of the audio-visual stimulation used is crucial. As I will show in the two next
paragraphs, the strength of the audio-visual congruency might often be insufficient, due to its
artificial nature, to provide interesting results. Munhall et al. (Munhall, ten Hove, Brammer, &
Paré, 2009) present an experimental design that enabled them to associate multistability with the
best studied case of audio-visual integration: speech perception. To achieve this goal, they relied
on the McGurk effect. In their experiment, when presented with the sound /aga/ in the auditory
modality while seeing lips uttering /aba/ the subjects reported hearing /abga/ (a more detailed
presentation of the McGurk effect will be given in 1.3.2a). Using this audio-visual effect enables
them to ensure a strong multisensory integration based on a natural association rooted in
everyday life experience of speech perception.
The experiment is based for the visual modality on the classic face-vase illusion. A
rotating black vase against a white background can either be seen as a vase or as two talking faces.
Those talking faces, whose lips movements are a consequence of the vase’s rotation utter /aba/.
In synchrony with the lips movements, the sound /aga/ is played. As in the static face-vase
illusion, the subject sees alternatively the vase and the faces. Can the McGurk effect be observed
when the talking faces are not explicitly seen i.e. when they form the background of the perceived
rotating vase? According to this study the answer is no. In order for the McGurk effect to occur,
the lips must be consciously seen.
This result can be turned the other way round and give information on bistability. When
not consciously seen, the percept does not reach the levels where the audio-visual integration
occurs. Is that the case for binocular rivalry? If the suppression does not occur strictly at early
levels of processing, it might be possible to detect multisensory integration with the suppressed
stimulus. This would provide compelling evidence in favor of a possible late suppression.
There are reasons to question the quality of Munhall et al. experiment. Mainly, the
poverty of the visual stimulus cast a doubt on the actual strength of the McGurk effect reported.
Indeed, no other cue than a side view of the mouth opening is available (no teeth, tong
etc…which are known to play an important role in lips reading and therefore in audio-visual
speech integration). By using a more detailed stimulation it might be possible to observe
integration with the suppressed stimulus. Moreover, it is important to note that the authors
cannot reject the null hypothesis and can therefore only conclude that they could not find a
McGurk effect with the unconscious percept, and not that this one does not occur.
29
b Volition is improved by multisensory percepts
Although volitional control over alternations remains quite weak in binocular rivalry, a
certain amount of control is still available to the subject. Multisensory rivaling percepts could
provide a powerful tool to investigate the mechanism underlying this volitional effect.
Accordingly, Van Ee et al. (van Ee, van Boxtel, Parker, & Alais, 2009) used two rivaling
stimuli, one congruent and one incongruent with the presented sound. The congruent visual
stimulus corresponded to a “looming pattern” (concentric sine wave pattern looming at 1Hz), the
incongruent one was a “propeller” like radial pattern. I want to insist on the fact that, contrary to
the experiment described earlier, here the audio-visual congruency is much more artificial. As the
authors will show, congruency can be reduced to temporal synchrony.
The first important result of this paper states that a congruent sound improves the
capacity of the subjects to willfully hold one percept. The authors compared the average
dominance duration of the percepts between a purely passive condition and a volitional condition
(during which subjects are asked to hold one percept) and repeated this comparison with and
without sound. A volitional effect (increase in dominance duration) was observed in both cases
for the congruent percept but this effect was significantly stronger for the sound condition.
Contrary to what the authors seem to suggest, there was however no impairing effect of an
incongruent sound on volition for the other percept.
Interestingly, the authors noted that the sound has no effect on dominance durations for
the passive viewing condition. Following this results, van Ee et al. tried to identify what was the
source of the improvement of the volitional control and concluded that it stemmed from the
temporal synchrony between the sound and the visual stimulation. Temporal congruency is the
only aspect that matters in their experiment, which shows that the audio-visual integration used is
rather low level. It would be of great interest to investigate whether the results remain the same
for audio-visual percept congruent at a higher level such as for audio-visual speech integration.
The authors also studied the importance of orienting the attention of the subject towards
the sound. They found that the volitional improvement due to congruent multisensory
stimulation required the subject to actively pay attention to both modalities. Indeed sound can
improve volitional control if attention to this additional modality is actively engaged. It does so
by increasing the dominance duration of the congruent visual percept. However, the audio-visual
congruency used here is reduced to its simplest form. Promoting multisensory interactions at
higher level of processing could lead to potentially stronger effects and for example enable to
detect an effect of sound even in a passive viewing condition.
30
c A common oscillator for all perceptual decisions?
Another clever use of multisensory bistability consists in combining two bistable percepts:
an auditory and a visual bistable percept. This is what Hupé et al. did in an attempt to study
whether the perceptual decision resulting in a switch is modality dependent or whether a
supramodal oscillator governs the oscillations in both modalities (Hupé, Joffo, & Pressnitzer,
2008). The idea of a supramodal oscillator stems from the observation of the similitude of
dynamics between all the different bistable phenomenon (see 1.1.4).
As bistable percepts, they used the auditory stream segregation (see 11.1.2b) in the
auditory modality, and in the visual modality they used two LED lights, flashing at the center of
both speakers playing the sound so that one LED flashed in synchrony with the low pitch tone,
and the other in synchrony with the high pitch tone. Consequently, in the visual modality, the
subject observed either two lights flashing separately (equivalent of the two streams percept in
the auditory modality) or an apparent motion of a light moving from one speaker to the other
(equivalent to the one stream percept in the auditory modality).
The subjects experienced switches both in the auditory and in the visual modality. The
authors analyzed whether those switches occurred in synchrony. The answer is negative, although
a switch in one modality tends to trigger a switch in the other one, the switch cannot be said to
be synchronous. Therefore the hypothesis of a single supramodal oscillator that would trigger
switches for both the auditory and visual modality seems unlikely.
However, once again the audio-visual congruency is rather weak and one would need to
investigate whether reinforcing the multisensory aspect of the stimulation could lead to
synchronous oscillations.
1.3.2 The McGurk effect: a multisensory illusion
The series of experiments we developed and that I will present in the next section are
based on the preceding conclusions. They combine binocular rivalry with sounds in a design
using audio-visual speech integration. I will therefore here present the McGurk effect, the
phenomenon on which all our experiments will be based, and will propose a very general
overview of the current understanding of audio-visual speech binding.
a The McGurk effect
It has been acknowledged for many years that watching a speaker can be beneficial for
speech understanding. In a now classic experiment carried out fifty years ago, it had been
established that seeing the speaker could lead to an improvement in the comprehension of the
auditory speech in noise equivalent to that produced by an increase of up to 15 dB in signal-to-
31
noise ratio. From this results stemmed the interpretation of audio-visual speech processing as a
phenomenon only apparent at low signal-to-noise ratios.
This interpretation changed when it was first showed that the perception of certain
speech segments could be strongly influenced by vision even when acoustic conditions were
good. The accidental discovery of the McGurk effect provided with the first example of some
audio-visual pairing leading to illusory perceptions (McGurk & MacDonald, 1976). For the
persons who are susceptible to it, the McGurk effect appears for some incongruent presentations
of lips movements and sound. Various versions of this effect exist and I will here only present to
two most common ones:
- When an auditory /ba/ is dubbed to a visual /ga/ listeners report hearing a so-called
blend percept /da/.
- When an auditory /ga/ is dubbed to a visual /ba/ listeners report hearing a
combination percept such as /bga/.
Clearly, such an illusory perception induced by an addition of visual information to
auditory speech call for a more complex interpretation of audio-visual binding in speech
perception. It appears that the construction of the conscious audio-visual verbal percept results
from deep integration of information provided by both modalities. I will therefore quickly go
through some major ideas concerning the multisensory aspect of speech perception.
b A general overview of audio-visual speech perception
Decisive for our work is the understanding of the level at which occurs the audio-visual
speech binding. We will indeed use multisensory verbal material to probe the mechanism
underlying binocular rivalry and assess whether the suppressed stimulus is available late in the
visual pathway. The question being: can the suppressed stimulus reach the areas in charge of
crossmodal integration for speech perception? Therefore it is necessary to pinpoint where this
integration takes place.
• Early sensory cortices
It is widely accepted that multisensory binding occurs at least partially in the so-called
higher association cortices that include the superior temporal sulcus, the intra-parietal sulcus and
regions in the frontal lobe. In this view a large part of the brain is often reduced into a collection
of unisensory systems that can be studied in isolation. However Kayser and Logothetis point out
that accumulating evidence challenges this position and suggests that areas hitherto regarded as
unisensory can be modulated by stimulation of several senses (Kayser & Logothetis, 2007). Does
this means that audio-visual speech integration should be thought as occurring as early as
32
unisensory cortices? In his review, Campbell notes that activation of the primary auditory cortex
has been found during silent lips reading (Campbell, 2008). However the extent to which this
activation is specifically linked to speech like event remains to be investigated. Kayser et al. also
reports an EEG study suggesting that there could exist neuronal correlates of the McGurk
illusion as early as classical auditory areas. They nevertheless precise that “the coarse nature of
this method leaves doubts about the localization of these effects, asking for methodologies with
better spatial resolution.”
Overall, Kayser et al. conclude negatively on the existence of early cross-modal
integration. If there are certainly some cross-modal effects taking place within the unisensory
areas, those effect do not correspond to multisensory binding.
• The central role of the superior temporal sulcus (STS)
The posterior part of the superior temporal sulcus (pSTS), one of the aforementioned
higher association areas, has been consistently pinpointed as a primary binding site for audio-
visual speech processing (Bernstein, Auer, & Moore, 2004; Campbell, 2008). Its location at the
crossroads of the auditory and the visual streams makes it an ideal candidate for this role. Apart
from being consistently activated by audio-visual speech perception, the left pSTS has been
showed to potentially display differential activation for congruent and incongruent audio-visual
speech. However, much more investigations would be required to better understand the role of
this area.
• The motor cortex
Another interpretation of audio-visual speech processing involving the motor cortex was
proposed by Skipper et al in an attempt to explain the McGurk effect (Skipper, van Wassenhove,
Nusbaum, & Small, 2007). The authors showed in an imaging study that audio-visual speech
perception seems to occur in many of the same areas that are active during speech production.
The McGurk effect would be a result of the mismatch resolution between the motor plan built by
the listener seeing a certain lips movement and the auditory information he receives.
The precise understanding of the mechanisms underlying audio-visual speech binding still
appears to be far from the current state of knowledge. However, many studies converge in that
they localize the multisensory binding away from the early sensory processing stages. This idea
alone justifies our attempt to use audio-visual speech stimulation in a binocular rivalry
experiment. Indeed, any cross-modal effect involving the suppressed stimulation would imply its
persistence up to the levels where the binding eventually takes place.
33
1.3.3 Conclusions
There exist few experiments using a multisensory design to study multistability. However,
based on the ones I just mentioned I would like to draw some conclusions useful to our purpose.
First, most studies rely on low level multisensory congruency (mainly temporal or spatial
congruency) and it seems crucial to develop a new way to associate a bistable visual stimulation
with sound that could rely on a stronger, higher level multisensory congruency. We chose to use
the most natural and best-studied audio-visual integration: speech perception through the
McGurk effect. Second, if multisensory integration could be found between a conscious auditory
stimulation and a suppressed visual stimulation in the case of binocular rivalry, this would
provide compelling evidence against the low level approach and favor the hybrid model where
the suppressed stimulation can potentially persist to intermediate and higher levels of processing.
Finally, volitional control in binocular rivalry seems to be affected by the addition of a
congruent sound but no effect is detected on passive viewing. Whether these results remain true
for highly congruent audio-visual verbal material needs to be verified.
34
2 Materials and methods
2.1 General material and methods
2.1.1 Stimuli design
a General content: McGurk effect
The gist of the series of experiments I will present here consists in the association of
binocular rivalry and the McGurk effect. To achieve this goal it is necessary to succeed in
inducing binocular rivalry not between static images as it is usually the case but between videos.
The material used consisted of 1.64s (25fps) videos of the face of a woman uttering either the
sound /aba/ or /aga/. The McGurk effect could then be induced by the presentation of the lips
movement /aga/ in synchrony with the sound /aba/ (extracted from the /aba/ video). For
McGurk sensitive subjects, this would indeed lead to the perception of an auditory /ada/.
b How to build rivaling videos?
Binocular rivalry was chosen among the variety of multistable stimuli for the large
amount of data available concerning this effect but also for its relative flexibility in terms of
rivaling percepts. Contrary to most multistable phenomena, binocular rivalry does not rely on
some specific percepts but more on a specific way to present images, the content of which can be
determined by the experimenter. However there are some constraints that need to be taken into
considerations in order to ensure that proper rivalry is achieved. Mainly, those constraints are the
following:
- The two rivaling images should overlap.
- The strength of the rivaling images should be balanced. This means that both
images should be rather similar in contrast, brightness, density of contours etc…
Unbalanced strengths would not prevent rivaling images from alternating in
conscious perception but would lead to the relative domination of one of the two
perceptive states.
- To avoid piecemeal rivalry, ie to avoid the spatial breaking of the perceived
bistable stimuli into multiple uncorrelated rivaling pieces, images should be small
in size (typically under 6°).
35
- Finally, the use of contrasting colors (one image red and the other green) can help
to achieve clean rivalry and also ensures that the subject are able to easily report
any sort of piecemeal.
The key aspect of our experiment is that we want to use videos as rivaling stimulations
which represent a departure from classical binocular rivalry experiments. Therefore the
constraints applicable to static images will need to be extended to video material. Mainly:
- There should be a good overlap of both videos for each video frame.
- The strength of both videos should be balanced for each video frame.
However, in order to ensure binocular rivalry to function with video stimulations,
knowledge drawn from the use of static images does not suffice. An understanding of how
motion interferes or not with binocular rivalry is also required.
The bulk of the studies on binocular rivalry focused on whether the neural mechanism
for rivalry is located at the early or late stage of processing. However, little attention was paid to
the fact that the visual system consists of two parallel pathways, the magnocellular and the
parvocellular pathways. This division raises the question about whether both pathways contribute
equally to binocular rivalry. The magnocellular pathway involves the motion-sensitive extrastriate
areas whereas the parvocellular pathway stems from retinal cells which project more dominantly
to extrastriate areas that are sensitive to form and color. Magnocellular cells are known to be
tuned to lower spatial frequency and higher temporal frequency signals than are the parvocellular
cells which are known to be color opponent. For these reasons, these two neural pathways are
thought to support two functionally distinct channels in vision. The magnocellular pathway is
generally considered to be involved in motion processing, whereas the parvocellular pathway is
often termed the color opponent channel, thus emphasizing its important implication in color
processing.
He et al. reviewed various evidence supporting the implication of both channels in
binocular rivalry and concluded that binocular rivalry is essentially supported by the
parvocellular pathway (He, Carlson, & Chen, 2005). They noted that “the visual system does
not seem to be willing to tolerate interocular conflicts in the P [parvocellular] pathway, with its
significant role in processing information related to object identity, and resorts to alternating
views of the two stimuli. Conflicts in the M [magnocellular] pathway are less likely to lead to
rivalry, with the visual system more willing to accept an integrated version of the inputs.” Color
36
and form contrast is therefore a sure source of rivalry when motion incongruence seems more
likely to lead to a blending of both percepts.
Another important aspect of motion in the context of binocular rivalry is its tendency to
reinforce a stimulus. The quantity of motion participates in the strength of a stimulation along
with brightness, contrast, etc. As exemplified by the so-called continuous flash suppression effect
which consists in realizing the suppression of a static stimulus for a rather long period of time
through the presentation in the other eye of rapidly moving Mondrian like stimuli, motion can
lead to dramatic enhancement of the dominance duration of a stimulation if not counterbalanced
in the two eyes (Tsuchiya & Koch, 2005). Therefore, in order to ensure a relative equivalence of
dominance durations between two percepts, they should be balanced in terms of quantity of
movement.
Finally, it is important to note that transient movements (ie a rapid movement in one
percept while the other remains static) are known to trigger switches in direction of the moving
stimulus.
Incongruent motions do not offer a strong enough basis to induce binocular rivalry
whereas color and contour contrast lead to maximum rivaling effect. Moreover, the quantity of
motion displayed in each video should always be equal to avoid transient effects which would
artificially trigger perceptual switches. This implies that for videos to rival, they should be
rather congruent in terms of motion content but incongruent in terms of colors.
Studies concerning the perception of color suggest that the red/green chromatic contrast
is one of the two main ones on which color vision rests. This contrast will consequently be used,
along with black/white contrast, as the support of binocular rivalry.
Achieving successful use of video material as the basis of binocular rivalry is one of the
main goal of this experiment (although a technical one) since it would open the gate for
innovative binocular rivalry studies beyond this present one.
c Visual stimulation
All visual stimulations are presented within a circular mask (diameter 5° with soft edge).
The diameter chosen is coherent with the values used in the literature which typically extend
from 1° to about 6°. Small diameters are necessary to avoid piecemeal rivalry. Stimulations are
presented against a black background.
37
Two raw 1.64s videos were recorded at 25 fps of a woman uttering either the sound
/aba/ or /aga/. Lips features were extracted and color filters were applied so that for each lips
movement (/aba/ and /aga/) two videos were generated:
- a video of type 1: Black lips/Green face
- a video of type 2: White lips/ Red face
Figure 12 Movie types
The difference of color between face and lips ensures a good perception of the lips
movement which is the most important feature for the McGurk effect. Color contrast between
videos (rivaling videos will always consist of a video of type 1 against a video of type 2) stems
from the considerations presented in 2.1.1b. While using video filters, attention was paid so that
other features such as the teeth and the tongue, which are known to play a role in speech
perception, were not removed from the videos.
The videos of the /aba/ and /aga/ lips movements were designed so that both
movements are maximally synchronized (similar onset and duration of all the segments
corresponding to the various lips movement involved in uttering both sounds) and differ only in
the frames corresponding to the utterance of the consonant /b/ and /g/ by the mouth aperture
(the pronunciation of the consonant /b/ requires the speaker to fully close her mouth unlike the
pronunciation of the consonant /g/). Using standard 25 fps videos synchronization was bound
to a 40ms precision. A cross fading was applied between the last and the first images of videos to
ensure that the videos could be looped without any discontinuity. Such a discontinuity could
indeed trigger artificial switches. Due to impossibility to record brightness in situ, luminosity was
balanced between video types on the first frame using Photoshop mean luminosity measure. We
hypothesized that this equivalence in luma values will translate into an equivalence of brightness
into the Head Mounted Diplay.
Videos were cropped to the circular dimensions mentioned above with the center of the
circle corresponding to the center of the closed mouth. Full mouth is visible throughout the
whole video. Other than the mouth, only the tip of the nose is visible, other facial features being
38
outside the circle. This type of presentation forces the focus of attention on the lips movement
and therefore is thought to maximize the possibility of a McGurk effect.
Experiment 1 uses static images: gratings and static lips.
- Gratings are sinusoidal black and white gratings with maximum contrast. The
frequency used is 1 cycle/deg. One grating is tilted of 45° clockwise, the other of
45° counter clockwise (see Figure 13)
- Static lips images consist of the first frame of /aba/ video type 1 and type 2. They
share common visual properties with the videos described above (see Figure 12)
Figure 13 Gratings
d Auditory stimulation
Sound was synchronized between both /aba/ and /aga/ videos (with a precision of
40ms) to ensure that the dubbing of the /aba/sound onto the /aga/ video could be achieved.
2.1.2 Stimuli presentation
a Visual stimulation
• Head Mounted Display (HMD)
Visual stimulations are displayed using dual channel Head Mounted Display (HMD)
nVisor SX at 1280x768, field of view: 44°x34°.
39
Figure 14 Head Mounted Display
• Stimuli
To ensure good binocular fusion but also proper fixation of the lips a fixation cross was
systematically added to the videos (black and white cross, diameter 0.3°). Fusion was also
facilitated by a random distribution of white squares around the stimuli (random distribution on a
40°x30° rectangle, square size 1°). Fusion could have been more difficult with the HMD than
with the classical stereoscope since the stimuli are virtually located at an infinite distance (infinite
focal distance) and no adjustment of vergence is possible. However the cumulative effects of the
fixation cross, of the similar shape and border of the circular stimuli, and of the randomly
distributed squares guaranteed an easy fusion.
Figure 15 Global stimulus display
b Auditory stimulation
Auditory input was played using ear canal phones (Sennheiser).
40
2.1.3 Subjects
14 subjects (10 men, 4 women) all right handed participated in the experiment. They were
all between 18 and 35 years old (average 26). Among the 14 subjects, 13 were naïve to the
purpose of the experiment and to the McGurk effect. Eye dominance was tested using a simple
alignment task: subjects were asked to align their right index finger with a point in the room while
keeping both eyes open before closing alternatively their right and left eye, reporting which one
kept the alignment. 8 subjects reported right eye dominance, 6 reported left eye dominance.
Following the results of experiment 2, subjects were divided into two groups. All subjects
participated in all the experiments except experiment 4 where only subject from the McGurk
group were tested.
2.1.4 Experimental overview
The full experimentation consists of 5 experiments, dispatched over two sessions of
about 1h30 each. Sessions were separated by at least 3 days.
The task the subject was asked to performed consisted either of a continuous report
task for the dominance durations studies (exp 1, 3 & 5) during 130s exposure to rivaling
stimulations or of a single report after each audio-visual stimuli presentation. Subjects reported
their responses by pressing the right or left arrow key on a keyboard using their right hand.
- For continuous report, the subject was put in binocular rivalry conditions and
asked to press continuously the key that corresponded to the perceived stimulus.
The subject always had the possibility to report piecemeal rivalry or any
ambiguous perceptual state simply by not pressing any key. For the gratings
stimuli, subjects were asked to report whether the consciously perceived grating
was right or left tilted. In the lips condition (static images or videos), subjects were
asked to report the color of the lips in the stimulus consciously perceived (black
or white).
- For single report, after a presentation of an audio-visual stimulus (involving
binocular rivalry or not), the subject was asked to press the key that corresponded
the best to her audio-visual perception in a forced choice paradigm.
In the continuous report experiment, one condition consisted of a volition test. The
instruction was then for the subject to keep reporting the perceptual alternations while trying to
hold a specific stimulus (black lips or white lips). Instruction was clearly given to do so without
using voluntary blinks or eye-movement away from the fixation cross.
41
In order to avoid association between lips movement and lips color, these two factors were
systematically balanced.
For the continuous report condition, the following steps were taken to clean the data:
- The first 10 seconds were systematically removed since they are known to often
correspond to an ambiguous piecemeal situation.
- Successive repetitions of same key pressing were treated as corresponding to the
same dominance period if the delay between pressing on the key was inferior to
500ms. Otherwise, the repetitions were considered as corresponding to different
dominance period separated by switches to an ambiguous state.
- Raw dominance durations were not used but rather we focused on a normalized
version of dominance durations (dominance duration/subject average dominance
duration). This normalized ratio ensures that the values can be compared across
subjects.
2.2 Experiment 1: Binocular rivalry test (control experiment)
2.2.1 Objective
Some person simply do not experience bistability when put in binocular rivalry
conditions. In order to eliminate such persons, the first part of this experiment consisted in the
most classic binocular rivalry conditions using orthogonal gratings as rivaling stimuli. A subject
that would not experience rivalry in this condition would be considered inapt to pursue other
experiments.
A first step toward testing the new stimuli built for this experiment consisted in using a
static version of the lips as rivaling images. This would serve as a preliminary control, assessing
whether for the static version the stimuli displayed a classic bistable behavior: mainly randomness
and exclusivity whose correlates are respectively a gamma distribution of dominance durations
and low proportion of piecemeal percepts. A classic behavior would partially validate the choices
made when building the stimulations. Moreover, for experiment 4, it was necessary to ensure
rivalry with static images.
2.2.2 Method
The experiment consisted of 7 continuous report trials of 130s each. The first 3 trials
used static grating images as rivaling stimuli, the 3 following trials used rivaling lips images (see
2.1.2a). Finally in the last trials, the same lips image was presented in both eyes (no binocular
42
rivalry) and the image was artificially switched (stimulus flip condition). Trials were separated by
breaks of a duration controlled by the subject.
For the gratings condition, the left tilted grating was always presented in the left eye and
the right tilted grating in the right eye (no balance was necessary). For the lips image condition,
although lips movement and lips color were balanced, the black lips were always presented to the
left eye and the white lips to the right eye in order to avoid unnecessary balancing that would
have resulted in an experiment twice as long. For the stimulus flip condition, the duration times
of the percepts were randomly drawn from a Gaussian distribution (mean= 3s, σ=2s).
The experiment started by a 4 trials training session, each lasting 60s. Training trials
replicated the experiment’s conditions. Two trials were dedicated to the gratings conditions, two
to the lips conditions. For each stimulus type, the subject was first asked to simply look at the
rivaling stimuli and then to practice reporting the conscious percept. All subjects received the
same amount of practice.
Figure 16 Experiment 1 task timeline
2.2.3 Results and discussion
Only the subjects who experienced binocular rivalry for the classic gratings stimuli were
tested in the four other experiments.
To analyze the characteristics of the binocular rivalry induced by the static image lips
stimuli, the distribution of normalized dominance duration (ratio) was plotted. The ratios were
grouped into bins which ranged from 0 and then increased by steps of 0.2. The distribution was
first determined for each subject and then averaged over all the subjects so that each subject
participates with an equal weight to the final distribution.
43
Figure 17 Normalized dominance duration distribution
Figure 17 presents the normalized dominance duration distribution. In x-axis the ratio
represent the normalized dominance durations and in the y-axis is given the frequency observed
for each ratio. The gamma shaped distribution of dominance durations characterizes binocular
rivalry and suggests that the successive dominance durations are independent from each other.
The fact that a gamma distribution was found in the case of the new lips stimuli indicates that the
stimuli we designed are at least in the static version inducing the classic random bistability. A low
proportion of ambiguous percept was reported by the subjects, 6.83% (SE=1.50) for the lips
stimuli which indicates that the rivaling percepts are mutually exclusive, which is a necessary
condition of binocular rivalry.
The static lips image rivalry was also used to assess the relative strength of the percepts.
This relative strength can be evaluated by analyzing the difference in average dominance
durations for both percepts. Longer dominance duration indicates a stronger percept. The black
lips stimulus was found to be stronger than the white lips one (F(1,12)=13,03, p<0.004).
Interestingly the interaction between the stimulus strength and eye dominance was significant
(F(1,12)=9.80, p<0.01). If the stronger stimulus is presented in the non dominance eye then the
effect of stimulus strength on dominance duration is cancelled, whereas if the stronger stimulus is
presented in the dominance eye the effects of eye dominance and stimulus strength sum up to
increase dominance durations.
Finally the stimulus flip condition proved that the subject correctly performed the task and
were able to report what they saw (12 subjects reported correctly 100% of the percepts and 2
other were above 95% of correct responses). Moreover, in this condition it was possible to
measure the average reaction time for each subject (425 ± 18 ms averaged over all subjects).
44
2.3 Experiment 2: Test of the McGurk effect (baseline)
2.3.1 Objective
The purpose of this experiment was to test the susceptibility of the subjects to the
McGurk effect. Although this effect is one of the best known multisensory effects everyone does
not show the same sensitivity to it.
There are various versions of the McGurk effect all consisting in dubbing an incongruent
sound onto a lips movement. For this experiment and the following ones we chose to focus on
the most robust McGurk effect which consists in the dubbing of the sound /aba/ onto a lips
movement uttering /aga/. Doing so results, for the person who are sensitive to this effect, in the
illusory perception of the sound /ada/ (see 1.3.2a).
This experiment served as a baseline for the experiment 4 in which the same McGurk
effect was studied under binocular rivalry conditions.
2.3.2 Method
The experiment consisted of 4 blocs of 20 single reports. For the single report, the audio-
visual stimulus presented consisted of the 1.64s video of the lips uttering either /aba/ or /aga/
on which was dubbed the sound /aba/ which therefore was respectively the original soundtrack
or an incongruent soundtrack. Out of 20 presentations in a bloc, 10 used the lips motion /aba/
and the other 10 the lips motion /aga/. The video type used was chosen randomly between type
1 (black lips/green face) and type 2 (white lips/red face) at each stimulus presentation. Color and
lips motion were randomized by group of 4 conditions representing the 4 possible combinations
of color and lips motion.
After each stimulus presentation the subject had to perform a forced choice task
reporting if the sound he had heard was closer to /aba/ or to /ada/. Instruction was given to the
subjects to respond quickly and automatically. In addition to the fact that the lips movement was
located right at the location of the fixation cross, instruction was explicitly given to the subjects
to pay attention to the lips movement as well as to the sound in order to force audio-visual
association.
Subjects started with 16 training trials that replicated the experiment’s conditions.
45
Figure 18 Experiment 2 task timeline
2.3.3 Results and discussion
The sound heard by the subjects during audio-visual presentations was systematically
/aba/ (A/aba/). However, in the visual modality, the video shown could either correspond to a
mouth uttering /aba/ (V/aba/) or /aga/ (V/aga/). For each subject the proportion of /aba/
percept selected was analyzed in both visual conditions. In the V/aba/ condition, the proportion
of /aba/ percept selected corresponded to the proportion of correct responses since the audio-
visual stimulation was congruent. For all subjects, the proportion of /aba/ responses in this
condition was above 90% with 11 subjects at 100%. This proves that all subjects were able to
correctly report what they perceived.
In the V/aga/ condition however, the proportion of /aba/ responses indicated the
proportion of incongruent audio-visual stimulations for which the subject did not experienced
the McGurk illusion. Indeed, the McGurk effect would consist in the perception of the sound
/ada/ in this audio-visual situation. Results for this condition show that subjects could be
classified into two categories according to their sensitivity to the McGurk effect. 8 subjects
perceived the McGurk fusion (AV/ada/) in more than 90% of the trials and 6 subjects perceived
the McGurk fusion in less than 10% of the trials. The first group (McGurk group) is composed
of subjects who are highly susceptible to the McGurk effect. Inversely, the second (No McGurk
group) is composed of subjects who are almost not sensitive to the McGurk effect.
Figure 19 displays the average values for each group. It appears clearly that if both groups
have similar results for the /aba/ seen condition (congruent audio-visual condition), they strongly
differ in the /aga/ seen condition. It is important to note that in this second condition, the lower
the proportion of selected /aba/ percept, the stronger the proportion of reported McGurk effect
(/ada/ percept selected).
46
Figure 19
Interestingly, the separation between the McGurk and the No McGurk subject was
categorical (no continuum). I would like to emphasize that the McGurk effect is very robust
since, for the subjects who are susceptible to it, it leads to an almost systematic illusory audio-
visual fusion. However, our experiment proves that some subjects are simply not sensitive to the
McGurk effect which is often omitted in classic multisensory literature. It was decided that these
subjects would still participate in the experiments (although not in experiment 4).
2.4 Experiment 3: Binocular rivalry for videos (test and baseline)
2.4.1 Objective
The first purpose of this experiment consisted in testing the novel video stimulations as
rivaling percepts in the context of binocular rivalry. As I already mentioned, binocular rivalry has
been until now essentially developed for static images with very few exceptions. Our goal is
therefore to create a new sorts of rivaling stimuli using videos. In order to validate these new
stimulations, it was necessary to prove that they triggered classic binocular rivalry alternations
characterized by randomness, exclusivity, and inevitability (see 1.1.4). Moreover, the possibility
that lips motion could trigger switches was investigated.
This experiment however had a second objective as it was designed to serve as a baseline
for experiment 5 which tested the effect of sound on binocular rivalry dynamics and on volitional
control (see 2.6).
47
2.4.2 Method
The experiment consisted of 9 continuous report trials of 130s each. The rivaling visual
stimuli consisted of the above described videos presented in a loop (see 2.1.1c). The black lips
stimulus was always presented in the left eye, the white lips stimulus in the right eye (balancing
color was not necessary). The rivaling stimuli also always differ in the lips movement. No sound
was played during this experiment.
The subjects had to perform two different types of task depending on the trial. 6 trials
consisted in basic continuous report (lips color) and 3 of continuous report with volition test. In
the volition test condition, subjects were asked to try to hold the black lips which always
corresponded to the video of lips uttering /aba/.
The 9 trials were divided into 3 blocs of 3 trials. In each bloc, the first 2 trials
corresponded to basic continuous report trials and the last one to a volition test trial. In each
bloc, for the basic continuous report trials, lips movement and color were counterbalanced (a trial
consisted of lips black uttering /aba/, the other of lips black uttering /aga/). The order of these
two trials was counterbalanced between subjects.
The subjects started by 3 training trials reproducing a typical bloc of 3 trials as described
in the previous paragraph.
Figure 20 Experiment 3 task timeline
2.4.3 Results and discussion
Designing rivaling video stimuli was the main technical challenges of this research project.
Experiment 3 was conducted in order to validate these stimuli by checking if the novel video
stimuli achieved classic binocular rivalry. Accordingly, in this experiment we examined whether
48
the three characteristics of binocular rivalry – exclusivity, randomness and inevitability – also hold
for the rivaling video stimulations we created.
In average, subjects reported perceiving a mix of both videos in only 3.96% (SE=0.78) of
the time, which is less than observed with static images (6.8%). Exclusivity between the two
rivaling videos is therefore achieved. An analysis of the distribution of dominance durations
similar to that of experiment 1 (see 2.2.3) was conducted for the basic continuous report
conditions. The gamma shape of the distribution presented in Figure 21 is characteristic of the
binocular rivalry random dynamics. Finally, the fact that none of the subject was able to prevent
the switches from occurring in the volition test condition proved the inevitable nature of the
perceptual alternations.
Figure 21
Motion is known to trigger perceptual switches. We therefore investigated whether some
temporal sequences of the videos generate significantly more switches thus partially disrupting
the random aspect of dominance durations. We analyzed the repartition of switches with the
1.65s video as the temporal baseline. For each subject, switch times were reported to their
position in the video. They were then pooled in time bins ranging from 0 ms and increasing by
steps of 50 ms. Once averaged over all subjects, the proportion of switches that occurred in each
time bin was compared to the probability level expected for a uniform distribution (t-test against
single value). Few values only were significantly different from this baseline and the outliers were
not grouped into consecutive time bins (the outliers are indicated by a star in Figure 22).
49
Figure 22
The novel rivaling videos present all the classic characteristics of binocular rivalry and
therefore can be said to induce proper bistable perceptual alternations. We proved that it is
possible to use videos as competing stimuli for binocular rivalry. This paves the way to an
even more flexible use of this bistable phenomenon.
Experiment 3 also investigates the effect of volitional control on binocular rivalry
dynamics. We plotted in Figure 23 the average dominance duration corresponding to each
percept (black lips and white lips) in both the basic continuous report (no volition) and the
volition test conditions. We did so while separating the results for the McGurk subjects and the
No McGurk subjects.
Figure 23
The interaction between lips color and volition or not condition was significant only for
the McGurk subjects (F(1,7)=10.50, p<0.015). The effect of volition on dominance duration
50
observed for the McGurk subjects stemmed from a longer durations for the hold stimulus (black
lips) than for the other stimulus (Tuckey HSD post-hoc: p=0.015). The fact that the No McGurk
subjects did not show any significant effect of volition could be explained by the fact that
contrarily to the other group, they are not able to use the lips motion as additional information of
the percept to hold.
2.5 Experiment 4: Audio-visual integration with the suppressed stimulus
2.5.1 Objective
This experiment tests whether audio-visual integration is possible with a suppressed visual
stimulus during binocular rivalry. Cross-modal effects involving the suppressed stimulation
would indeed be an indicator that this unconscious stimulus reached cortical areas in charge of
multisensory processing, therefore supporting the hybrid view against the low level interpretation
of binocular rivalry.
The McGurk effect coupled with binocular rivalry was used to assess such a potential
multisensory integration with the suppressed stimulus. Each eye was presented with a different
lips motion (V/aba/ versus V/aga/) while the sound was systematically A/aba/ so that the two
rivaling videos would lead to a different audio-visual percepts (V/aba/+ A/aba/= AV/aba/ while
V/aga/+ A/aba/= AV/ada/). Due to binocular rivalry, each video could be either consciously seen
or suppressed. The questions are therefore the following: Do the subjects always integrate the
sound with the consciously seen video or can integration with the suppressed one also occur?
And if integration with the suppressed stimulus takes place, is it most likely to occur when the
visual stimulation and sound are physically congruent (V/aba/+ A/aba/) or not (V/aga/+ A/aba/)?
2.5.2 Method
Only McGurk subjects participated in this experiment. The experiment consisted of 4
blocs of 20 to 40 single reports. For the single report, the audio-visual stimulus presented
consisted of the 1.64s two rivaling videos of the lips uttering /aba/ and /aga/ (black lips versus
white lips) on which was dubbed the original /aba/ soundtrack of the /aba/ video. The choice
was made to systematically present the strongest percept (black lips as demonstrated by
experiment 1) in the non-dominant eye in order to ensure that effect of eye-dominance and
stimulus strength would not sum up and strongly unbalance the dominance durations. Although
each eye systematically received the same type of stimulus in terms of lips color, the lips motion
type was balanced between eyes.
51
At the beginning of each single report, subjects were given the instruction to wait for
either the white lips or the black lips. Then static rivaling images similar to experiment 1 were
presented and subjects waited until they perceived the image whose lips color corresponded to
the given instruction. Once stabilized on the target percept, subjects pressed a key (space bar)
that triggered the videos in both eyes. At the end of the videos, subjects reported the sound they
heard using a forced choice task between a sound closer to /aba/ or closer to /ada/. Instruction
was given to the subjects to respond quickly and automatically. However, the subjects were
explicitly asked not to answer if a perceptual switch had occurred during the time the videos were
played in order to ensure that they had perceived only the video whose type corresponded to the
instruction. If a switch occurred they were asked to press the space bar in order to repeat the
trial. Trials could only be repeated once to limit the total amount to trials needed to a maximum
of 40. Special attention was paid to instruct the subject on the importance to redo a trial if a
switch occurred (even a partial one) during the videos presentation. Subjects started with 16
training trials which replicated the experiment’s conditions.
In order to assess whether audio-visual fusion occurred with the stimulus presented in the
suppressed eye, the results of this experiment were compared to those of experiment 2, which
provided the baseline for the multisensory integration performance without binocular rivalry.
Figure 24 Experiment 4 task timeline
2.5.3 Results and discussion
Finding an audio-visual fusion with a stimulus presented in the suppressed eye would be a
clear indication that the so-called suppressed stimulus was still available for cross-modal
integration and therefore could not have been suppressed at early levels of visual processing.
In the auditory modality, subjects were systematically presented with the sound /aba/
while in the visual modality, although only one percept was consciously seen, subjects were
presented with rivaling videos each uttering a different sound /aba/ or /aga/. Using the initial
52
instruction on the lips color, the experimenter could manipulate the type of lips motion that
would result from the conscious visual stimulus. Two states were then possible: either the
conscious stimulus was V/aba/ while V/aga/ was suppressed (/aba/ seen), or the conscious
stimulus was V/aga/ while V/aba/ was suppressed (/aga/ seen). Importantly, if the audio-visual
integration were to take place only with the conscious stimulation then, in the /aba/ seen
condition, subjects should report hearing /aba/ while in the /aga/ seen condition subjects
should report the McGurk effect illusion /ada/. Therefore, in Figure 25 the proportion of /aba/
perceived is plotted for the /aba/ seen condition, and the proportion of /ada/ perceived is
plotted for the /aga/ seen condition so that for each condition the proportion of cross-modal
integration with the conscious stimulus is represented (black bars). In this situation, if the
proportion of audio-visual integration with the conscious percept differs from 100%, then the
difference could either stemmed from noise or from integration with the suppressed stimulus
(gray bars).
Experiment 2, whose results are also shown in Figure 25, served as a baseline for the
strength of audio-visual fusion in a non-rivaling situation. Since only McGurk subjects performed
experiment 4, only their average results are presented for experiment 2. These results differ from
100% and this difference indicates the natural amount of noise in audio-visual integration for
these subjects.
Results from experiment 4 have been segregated according to whether the conscious
stimulus was seen with the dominant eye and or with the non-dominant eye.
Figure 25
No effect of cross-modal integration was found when lips are seen in the dominant eye.
In this case, the gray bars are not significantly different from the noise level (baseline). This
situation corresponds to a suppressed stimulus presented in the non-dominant eye.
53
On the other hand, there is a trend for 19.6% of audio-visual integration with the
suppressed stimulus when lips are seen with the non-dominant eye (F(1,7)=3.38, p=0.11). Cross-
modal integration with the suppressed stimulus is not equivalent between the /aba/ seen and the
/aga/ seen condition. In the /aga/ seen condition, the suppressed stimulus is congruent with the
sound (AV congruent case, right tilted gray bars ///), whereas in the /aba/ seen condition, the
suppressed stimulus is incongruent with the sound (McGurk case, left tilted gray bars \\\).
Considering the corresponding data separately, audio-visual integration with the suppressed
stimulus occurs in 15.6% of the trials for the AV congruent case (F(1,7)=3.56, p=0.10) and in
23.7% of the trials for the McGurk case (F(1,7)=3.06, p=0.12). Importantly, these two values do
not differ statistically.
These results suggest that the suppressed percept is not deactivated at an early level of
visual processing but remains available for further processing, which in turn can rise up to
awareness. Indeed, when the hidden stimulus is presented to the dominant eye a non-negligible
proportion of integration with the auditory input seems to occur. The low-level framework for
binocular rivalry cannot account for such findings. As presented in 1.3.2b audio-visual integration
and especially audio-visual speech binding is thought to take place in higher association cortices
and not in the early sensory cortices. Therefore the trend outlined by experiment 4 suggests that
some information specific to the suppressed stimulus is maintained up until these multisensory
cortical areas. Moreover, the visual suppression of the stimulus does not result in its total
exclusion from consciousness. Indeed, the visually suppressed stimulus does have an influence on
the phenomenal awareness resulting from the auditory analysis.
The absence of effect found for a suppressed stimulus presented in the non-dominant eye
could be due to an insufficient statistical power. For an individual, the information perceived by
the non-dominant eye could be treated as less relevant than the one transferred by the dominant
eye. Therefore, the impact of the information stemming from the non-dominant eye could be
regarded as less reliable by a multisensory system realizing audio-visual integration. This would be
the case if this system was supported by Bayesian laws involving intrinsic priors on the reliability
of the signal. Consequently, the effect of a suppressed stimulus presented in the non-dominant
eye could be too weak to be distinguished from noise.
The reason why we found a cross-modal effect involving the suppressed stimulus while
preceding experiments failed in detecting such an effect (see 1.3.1a) could be due to the some
intrinsic differences between the face/vase illusion and binocular rivalry. However, it is likely that
54
the use of a natural stimulation (video of lips movement) could result in a stronger and more
reliable audio-visual integration thus offering us a better discrimination power.
2.6 Experiment 5: Impact of sound on binocular rivalry
2.6.1 Objective
The last question we addressed concerned the potential influence of sound on the dynamics of
binocular rivalry. The analysis of the Experiment 3’s results allowed characterizing the dynamics
of binocular rivalry both in passive condition (without volition) and the effect of volition when
no sound accompanies the visual presentation. Experiment 5 replicated almost exactly the
conditions of experiment 3 with the only difference that the videos are now always played in
synchrony with the /aba/ sound. After verifying that the sound did not introduce undue
modification of the alternation process, we analyzed its effect on dominance durations in order to
answer the following questions: Does the sound modify the dynamics of binocular rivalry in
passive viewing condition? What is the effect of sound on volition? Does it reinforce volitional
control?
Interestingly, since the lips motion in the two rivaling videos differ (/aba/ versus /aga/),
the /aba/ sound can either be congruent with the video for V/aba/, or incongruent for V/aga/.
Experiment 5 investigated whether this difference of audio-visual congruency could modify the
effect of sound on binocular rivalry dynamics. Finally, both McGurk and No McGurk subjects
performed the task so to compare the impact of the sound presentation on their perception.
2.6.2 Method
The method replicates that of experiment 3 with few modifications. Videos were looped
but this time they were always presented in synchrony with the /aba/ sound. 4 additional
continuous report trials were added during which the subject had to hold the white lips.
Therefore there were 12 continuous report trials in this experiment, divided into 3 blocs of 4
trials. In each bloc, the first 2 trials corresponded to basic passive continuous report trials while
the two following ones corresponded to volition test trials (in one of them the subject had to
hold the white lips and in the other the black lips; the order of these trials was balanced across
subjects). As in experiment 3, subjects started by 3 training trials: two passive ones and a volition
test (hold black lips).
55
Figure 26 Experiment 5 task timeline
2.6.3 Results and discussion
The first step in assessing the effect of sound on binocular rivalry dynamics consisted in
verifying that the presence of sound did not unduly disturb the dynamics by triggering switches.
We therefore analyzed the switch repartition as in experiment 3 and found very few outliers
(indicated by a star in Figure 27). Since those outliers are few and are not grouped into
consecutive time bins, it can be said that the sound did not artificially trigger switches.
Figure 27
We analyzed the effect of sound on the dynamics of binocular rivalry in the passive
viewing condition (no volitional control). To do so, we compared the average dominance
durations for each rivaling percept V/aba/ (/aba/ seen) and V/aga/ (/aga/ seen) between the no
sound condition (baseline from experiment 3) and the sound condition (experiment 5). The
separation of the visual percept by lips motion and not by lips color was necessary since we
wanted to detect a potential effect of the congruence between lips motion and sound. Separate
analyses were carried out for McGurk and No McGurk subjects.
56
Figure 28
Figure 28 presents the results from the analysis. The no sound condition replicates the
results from experiment 3 which served as a baseline. The interaction between group (McGurk &
No McGurk) and sound condition on the average dominance durations was significant
(F(1,12)=8.18, p<0.015). The McGurk subjects show a dominance duration increase of 1.14s
(F(1,7)=8.43, p<0.025) while No McGurk subjects show a marginal dominance duration decrease
of 0.88s (Tuckey HSD post-hoc: p=0.22). Interestingly there was no differential effect between
lips movement in both groups.
Contrarily to the findings of van Ee et al (van Ee et al., 2009) presented in 1.3.1b, our
results show that passive viewing can in fact be influenced by the presentation of a synchronized
sound. The addition of sound results in an increase of dominance durations for the McGurk
subjects. The use of a robust and natural high-level audio-visual stimulation could be the reason
why this effect can be detected in our experiment while it remained unnoticed in van Ee et al.’s
work.
Interestingly, the No McGurk subjects remain unaffected by the presentation of a sound.
The reason might be that for the audio-visual congruency to influence the dynamics of binocular
rivalry, it has to be processed as such. It is possible that the subjects who are not susceptible to
the McGurk effect do not merge properly lips movements with sound in an audio-visual speech
binding process. Therefore, for these subjects the audio-visual congruency is not detected as such
or is reduced to a simpler form such as sheer synchrony of onset time and duration between
sound and movement. From this difference of effect of sound on binocular rivalry dynamics
between McGurk and No McGurk subjects, we can conclude that for sound to impact the
57
dynamics it has to be congruent at a higher level (speech binding association) and not only at a
low level (synchrony).
Experiment 3 proved that, when asked to willfully attempt to control the conscious
oscillations of binocular rivalry, the McGurk subjects were able to do so through a relative
increase of the average duration of the hold percept compared to the duration of the other one.
In experiment 5 we investigated whether the addition of a congruent sound modifies volitional
control over binocular rivalry.
Figure 29
For each group (McGurk and No McGurk), Figure 29 presents the relative increase of
average dominance duration in both the No sound (baseline from experiment 3) and /aba/
sound (experiment 5) conditions. The relative increase in dominance duration corresponds to the
difference between the average dominance duration in volition test condition and in passive view
condition. It is important to note that here, only the condition where the stimulus V/aba/ is hold
is represented.
For the McGurk subjects, the No sound condition represents the already mentioned
effect of volition as drawn from experiment 3 although here it represents the effect of holding
the V/aba/ percept whereas in Figure 23 the effect was given for the black lips percept hold. In
the present figure, the values for black lips with a motion corresponding to V/aga/ were left aside.
Without sound volitional control resulted in an increase of average dominance durations of 1.00s
(F(1,7)=20.53, p<0.003). In the /aba/ sound condition, a significant increase of dominance
durations of 2.70s was observed (F(1,7)=8.24, p<0.025). The dominance duration increases for
58
sound and no sound were almost significantly different (Student test: t(7)=0.11). No effect was
detected for the No McGurk subjects.
Presented with an additional congruent sound (A/aba/) and asked to hold the percept
V/aba/ the McGurk subjects are significantly better than in the no sound condition. The effect of
volition is similar to the no sound condition since it results in a relative increase of the average
dominance duration of the hold percept compared to the other one. However, the presence of a
consistent sound improved the volitional capacities of the subjects. For the No McGurk subjects,
no volitional effect could be detected in the no sound condition and the addition of sound did
not change this finding.
This result is similar to the one described in van Ee et al. (see 1.3.1b), the presentation of
sensory information congruent with one of the rivaling stimuli increases the volitional control
capacities. The absence of effect for the No McGurk subjects can be explained by arguments
similar to those presented earlier (2.6.3). No McGurk subjects might not exploit the lips motion
difference between rivaling videos. We can hypothesize that due to poor lip reading capacities
they are not able to differentiate both lips motions. Therefore, in the no sound condition No
McGurk subjects would lack the motion cue in differentiating the two rivaling videos that would
only differ by their respective color. McGurk subjects, on the other hand, are certainly able to
differentiate the lips movements. We can assume that the more cues are available to hold a video
against the other, the better the volition control will be. Accordingly, the difference of volitional
capacities in no sound condition between McGurk and No McGurk subjects could be explained
by their difference in lips reading capacities.
Experiment 5 finally tackles the question of whether there exists a difference in the effect
of sound on volitional control between the condition where the sound is congruent with the hold
percept (A/aba/ + V/aba/ hold) and the condition where it is not (A/aba/ + V/aga/ hold). Figure 30
presents the various average dominance durations for both the McGurk group (left) and No
McGurk group (right) and for all the different conditions. The two first solid gray bars represent
the baseline (passive viewing condition in experiment 5). In the following bars are shown the
results for the hold /aba/ and hold /aga/conditions. The light gray striped bars always represent
the dominance duration of the hold percept while the dark gray striped bars always represent the
dominance duration of the not-hold percept. Comparisons are made as followed: the values of
dominance duration in each of the hold condition are compared to the baseline values. We were
interested in potential variations of relative dominance duration (difference between the increase
59
of dominance duration for the hold percept and the decrease of dominance duration for the not-
hold percept).
Figure 30
The analyses showed that for the McGurk subjects the dominance duration in the hold
/aba/ condition significantly differed from the values in the baseline condition. There was an
increase in relative dominance duration, and this increase resulted from a decrease in V/aga/
dominance duration (Tuckey HSD post-hoc: p=0.05). In the Hold /aga/ condition, the values
did not differ statistically from the baseline values. For the No McGurk subjects no effect of
volition was detected in the Hold /aba/ condition whereas, surprisingly, there seem to be an
effect in the Hold /aga/ condition with an increase in relative dominance duration, although post
hoc were not significant.
Although McGurk subjects are not aware of the audio-visual conflict when presented
with V/aga/ and A/aba/ (V/aga/ +A/aba/=AV/ada/), our results suggest that there exists a difference
for these subjects between holding the congruent AV/aba/ percept and the physically incongruent
AV/ada/ percept. Surprisingly, our results indicate that volitional control is only improved for
real audio-visual congruency and not for a perceived one.
60
3 General discussion
3.1 Probing the depth of suppression
Experiment 4 provides empirical evidence against the low level interpretation of
binocular rivalry. The information that enters the visual system through the suppressed eye is not
blocked at early stages of the visual pathway but rather is maintained at least up until the
multisensory areas. This hypothesis is necessary to explain how in about 20% of the cases
subjects reported an auditory percept corresponding to an audio-visual integration with the
suppressed visual stimulus. However, remains to explain why the audio-visual integration does
not occur systematically with either the suppressed or the dominant percept. Rather, the outcome
of audio-visual integration seems to be drawn from a certain probability distribution. Indeed in
80% of the cases we found that multisensory integration occurred with the dominant visual
stimulus while in the remaining 20% it occurred with the suppressed one. In order to explain this
fact I will need to make two assumptions.
First I will consider that rivalry resolution takes place continuously along the visual
pathway and corresponds to a continuous increase of the probability of the dominant percept
and consequently to a decrease in probability of the suppressed stimulus. This framework follows
the conclusions drawn from the studies presented in (Leopold & Logothetis, 1999) and
reformulates the hybrid model from a Bayesian approach. The visual signal has to be considered
as a probability distribution over various possible visual states. Along the visual pathway, various
features of the visual input are analyzed such as shape, color, motion, each known by the visual
system as a distribution of probability over multiple states. From the combination of the possible
states of the features results a huge number of possible visual states each associated with a certain
probability. In our case, there are two predominant visual states, corresponding to the
information reaching each eye. Their motion components correspond to lips uttering /aba/ and
lips uttering /aga/. However, although associated with marginal probabilities, other perceptual
states such as lips uttering /ada/, /ata/, etc. also compose the visual signal. For the simplicity of
the presentation the two predominant states can be considered as initially equally probable
although stimulus strength and eye dominance certainly influence these initial probabilities. Along
the visual pathway, and through multistage competition, one of the two states becomes
increasingly “stronger”, meaning that its probability increases while the probability of the other
state decreases. If we associate the coding of probability with a certain population code this could
61
indeed explain why as we move along the visual pathway, more and more neurons are found
coding for the consciously perceived stimulus.
The second necessary hypothesis is that multisensory integration (MSI) occurs in a
probabilistic fashion. MSI system takes as inputs a visual signal and an auditory signal and both
can be described as I just mentioned as probability distributions over different states. For the
visual input we saw that there are two predominant states each corresponding to the information
stemming from a certain eye. In the auditory modality, although the auditory stimulation is
consistently /aba/, the auditory signal can be seen a probability distribution over different states
such as /aba/, /aga/, /ada/, /abga/, … the first one associated with a high probability and the
other ones with much lower probabilities (there is always a possibility that the subject will hear
/aga/ although he is presented with a clear /aba/). MSI can be considered as a Bayesian system
computing the probability of all the possible auditory outcomes after interaction with the visual
probability distribution (potentially using also some prior knowledge on what those outcomes
should be). A conscious auditory outcome is then randomly drawn from the auditory
distribution. Importantly this outcome is therefore not systematically the most probable one,
which in our case would always correspond to the audio-visual association of the auditory
stimulus and the dominant visual stimulus.
Based on the hypotheses presented in the last paragraph, we can now explain the findings
of experiment 4. The MSI system responsible for the auditory perceptual outcome is fed by the
visual signal that is composed of two predominant states each associated with a certain
probability. As mentioned in 1.3.2b the audio-visual integration for speech is thought to occur in
the association cortex and therefore the visual signal that feeds the MSI does not stem from early
visual areas. Multisensory integration will then construct the probability of the possible auditory
states through Bayesian multiplication of the visual and the auditory distributions before a
conscious auditory outcome is randomly drawn. Our experiment proves that by the time the
visual signal reaches the MSI areas, the probability associated with the state corresponding to the
suppressed stimulus is not negligible for the probability of the auditory outcome after
multisensory integration that corresponds to the integration with the suppressed stimulus
averages about 20%.
Figure 31 illustrates the functional model described above. On the left hand side this figure
presents a temporal snapshot of the system state on which one stimulus (V/aba/) is dominant and
the other suppressed. The right hand side presents the converse situation. The evolution from
the bottom to the top of the visual pathway represents the evolution of the different perceptual
probability distributions at each processing level, from the eyes inputs to the higher brain areas.
62
The thickness of the bar symbolizes the probability of each state where the green color represents
the currently dominant percept. Consistently with (Leopold & Logothetis, 1999) activity in the
early visual system is not specific to the dominant percept but rather reflects both the dominant
and the suppressed percepts, when activity in higher brain areas is correlated to the conscious
percept. A similar convention is applied to the auditory modality, although in this case, the
conscious stimulus could be either the one represented in red or green since it depends on a
random draw.
Figure 31 Probabilistic MSI with the suppressed stimulus
3.2 Real multisensory congruency enhances volitional control
Adding sensory information using another modality improves volitional control over the
dynamics of binocular rivalry (van Ee et al., 2009). However the mechanism through which this
enhancement of control occurs remains unclear. A deeper analysis of the results provided by
experiment 5 can unveil part of this mechanism.
For a McGurk subject, both the condition where the lips motion and the sound are
congruent (V/aba/ + A/aba/) and the condition where they are incongruent (V/aga/ + A/aba/) are
equivalent in the sense that in both cases no discrepancy between the visual and the auditory
signal is perceived. Indeed, in the second case, although there is a physical incongruence, the
subject simply experiences seeing and hearing a mouth pronouncing the sound /ada/. However,
this multisensory condition does not result in an enhancement of volitional control as the
authentically congruent condition does.
Volitional control is usually considered as originating in the frontal cortex activity, more
specifically from the area in charge of attention. Based upon the hybrid model hypothesis, it
63
seems reasonable to assume that a possible mechanism would consist in a modulating signal sent
by the frontal cortex projecting at the different stages of the multilevel competition, which would
modify the behavior of the network by increasing the probability of the percept to be maintained.
Figure 32 follows the same convention than Figure 31. I would like to insist on the fact
that Figure 32 represents a temporal snapshot of the dynamics, in this case when V/aba/ is
dominant. One has to consider that during experiment 5 the subject was in fact oscillating
between the two rivaling percept. Considering that a congruent sound can enhance volitional
capacities, it is necessary to hypothesize that the signal sent by the attentional control centers can
be modulated by some input resulting from a module whose role consists in comparing the
auditory signal with the visual signal. In our case both V/aba/ and V/aga/ have the same basic
temporal properties (they are both synchronized to the sound) and therefore a discrepancy
between sound and vision could only be detected at the level of audio-visual speech processing.
The module in charge of audio-visual comparison is termed “likelihood” as its role consists in
determining the likelihood that the auditory signal was emitted by the lips motion presented in
the visual signal. We hypothesize that the visual information feeding the module originate in the
MT/MST areas in charge of motion analysis since it is the movement of the lips that determines
the possible emitted sounds. The same brain area in charge of audio-visual speech binding (pSTS)
has been identified as a potential locus for the processing of audio-visual discrepancy in speech.
Figure 32 Real audio-visual congruency enhances volitional control
Consistently with the remarks made in 3.1 the visual signal is considered as being a
probability distribution over two states. Therefore, the comparator is fed with this probability
distribution and can be seen as realizing the comparison with the most probable state, which
corresponds to the dominant percept. The auditory signal also is to be considered as a probability
64
distribution that feeds the comparator with its most probable state. However two levels have to
be distinguished. The first one consists in the signal before the MSI, the second in the signal once
the MSI took place. If the comparison between the visual signal and the auditory signal was to
involve the second level, then no discrepancy would be detected when V/aga/ is dominant and we
would observe an enhancement of volition for this condition. Indeed, at this second level, the
auditory signal has already been modified through Bayesian MSI and if V/aga/ is dominant then
A/ada/ already became the most probable state. Consequently, the auditory signal that feeds the
comparator must originate at the first level, before MSI occurs.
3.3 Perspectives
The present thesis presents an experiment in which sound is used to investigate various
aspects of binocular rivalry. Speech material was chosen so that a single sound could induce two
distinct audio-visual outcomes when merged with each of the rivaling percept. Figure 33A
presents the general design. By using the same audio-visual material but inverting the role of
vision and audition, we should be able to study the impact of vision on rivaling sounds. As
presented in Figure 33B, the bistability is this time induced by the presentation of two
superimposed sounds and this auditory bistable percept is then combined with a unique visual
stimulus that leads to two different auditory outcomes after multisensory integration. A final
experimental design (cf Figure 33C) could study a possible synchronicity of switches in the
auditory and the visual modality. Both the visual and the auditory stimulations would be bistable
and each of the four possible audio-visual combinations would lead to a different multisensory
percept. The question of the synchronization of perceptual decisions in various sensory
modalities has been tackled for low-level multisensory congruency by Hupé et al. (Hupé et al.,
2008) who concluded that switches occured independently for a visual and an auditory percept.
We believe however that our use of strong high-level multisensory congruency could lead to
results contradicting this former experiment.
A. Audio on BR B. Vision on AR C. Both
Figure 33
65
4 References
Bernstein, Auer, & Moore. (2004). Audiovisual speech binding: Convergence or association? Dans Handbook of multisensory processes (MIT Press., p. 203-224). Cambridge.
Blake, R. (1989). A neural theory of binocular rivalry. Psychological Review, 96(1), 145-167.
Blake, R., & Logothetis, N. K. (2002). Visual competition. Nature Reviews. Neuroscience, 3(1), 13-21. doi:10.1038/nrn701
Brown, R. J., & Norcia, A. M. (1997). A method for investigating binocular rivalry in real-time with the steady-state VEP. Vision Research, 37(17), 2401-2408.
Campbell, R. (2008). The processing of audio-visual speech: empirical and neural bases. Philosophical Transactions of the Royal Society of London. Series B, Biological Sciences, 363(1493), 1001-1010. doi:10.1098/rstb.2007.2155
Carter, O., Konkle, T., Wang, Q., Hayward, V., & Moore, C. (2008). Tactile rivalry demonstrated with an ambiguous apparent-motion quartet. Current Biology: CB, 18(14), 1050-1054. doi:10.1016/j.cub.2008.06.027
Cave, C. B., Blake, R., & McNamara, T. P. (1998). Binocular Rivalry Disrupts Visual Priming. Psychological Science, 9(4), 299-302. doi:10.1111/1467-9280.00059
van Ee, R., van Dam, L. C. J., & Brouwer, G. J. (2005). Voluntary control and the dynamics of perceptual bi-stability. Vision Research, 45(1), 41-55. doi:10.1016/j.visres.2004.07.030
van Ee, R., van Boxtel, J. J. A., Parker, A. L., & Alais, D. (2009). Multisensory congruency as a mechanism for attentional control over perceptual selection. The Journal of Neuroscience: The Official Journal of the Society for Neuroscience, 29(37), 11641-11649. doi:10.1523/JNEUROSCI.0873-09.2009
Freeman, A., Nguyen, V., & Alais, D. (2005). The nature and depth of binocular rivalry. Dans Binocular Rivalry (p. 47-62). Bradford Books.
Freeman, A. W. (2005). Multistage model for binocular rivalry. Journal of Neurophysiology, 94(6), 4412-4420. doi:10.1152/jn.00557.2005
Haynes, J., Deichmann, R., & Rees, G. (2005). Eye-specific effects of binocular rivalry in the human lateral geniculate nucleus. Nature, 438(7067), 496-499. doi:10.1038/nature04169
He, Carlson, & Chen. (2005). Parallel pathways and temporal dynamics in binocular rivalry. Dans Binocular Rivalry (p. 81-100). Bradford Books.
Hupé, J., Joffo, L., & Pressnitzer, D. (2008). Bistability for audiovisual stimuli: Perceptual decision is modality specific. Journal of Vision, 8(7), 1-15.
Jiang, Y., Costello, P., & He, S. (2007). Processing of invisible stimuli: advantage of upright faces and recognizable words in overcoming interocular suppression. Psychological Science: A Journal of the American Psychological Society / APS, 18(4), 349-355. doi:10.1111/j.1467-9280.2007.01902.x
Kayser, C., & Logothetis, N. K. (2007). Do early sensory cortices integrate cross-modal information? Brain Structure & Function, 212(2), 121-132. doi:10.1007/s00429-007-0154-0
Lehky, S. R. (1988). An astable multivibrator model of binocular rivalry. Perception, 17(2), 215-228.
Lehky, S. R., & Maunsell, J. H. (1996). No binocular rivalry in the LGN of alert macaque monkeys. Vision Research, 36(9), 1225-1234.
Lehmkuhle, S. W., & Fox, R. (1975). Effect of binocular rivalry suppression on the motion aftereffect. Vision Research, 15(7), 855-859.
Leopold, & Logothetis. (1999). Multistable phenomena: changing views in perception. Trends in Cognitive Sciences, 3(7), 254-264.
Leopold, D. A., & Logothetis, N. K. (1996). Activity changes in early visual cortex reflect monkeys' percepts during binocular rivalry. Nature, 379(6565), 549-553. doi:10.1038/379549a0
Logothetis, N. K., Leopold, D. A., & Sheinberg, D. L. (1996). What is rivalling during binocular rivalry? Nature, 380(6575), 621-624. doi:10.1038/380621a0
66
Logothetis, N. K., & Schall, J. D. (1989). Neuronal correlates of subjective visual perception. Science (New York, N.Y.), 245(4919), 761-763.
Long, G. M., & Toppino, T. C. (2004). Enduring interest in perceptual ambiguity: alternating views of reversible figures. Psychological Bulletin, 130(5), 748-768. doi:10.1037/0033-2909.130.5.748
Lumer, E. D., Friston, K. J., & Rees, G. (1998). Neural correlates of perceptual rivalry in the human brain. Science (New York, N.Y.), 280(5371), 1930-1934.
McGurk, H., & MacDonald, J. (1976). Hearing lips and seeing voices. Nature, 264(5588), 746-748.
Munhall, K. G., ten Hove, M. W., Brammer, M., & Paré, M. (2009). Audiovisual integration of speech in a bistable illusion. Current Biology: CB, 19(9), 735-739. doi:10.1016/j.cub.2009.03.019
Polonsky, A., Blake, R., Braun, J., & Heeger, D. J. (2000). Neuronal activity in human primary visual cortex correlates with perception during binocular rivalry. Nature Neuroscience, 3(11), 1153-1159. doi:10.1038/80676
Pressnitzer, D., & Hupé, J. (2006). Temporal dynamics of auditory and visual bistability reveal common principles of perceptual organization. Current Biology: CB, 16(13), 1351-1357. doi:10.1016/j.cub.2006.05.054
Sheinberg, D. L., & Logothetis, N. K. (1997). The role of temporal cortical areas in perceptual organization. Proceedings of the National Academy of Sciences of the United States of America, 94(7), 3408-3413.
Skipper, J. I., van Wassenhove, V., Nusbaum, H. C., & Small, S. L. (2007). Hearing lips and seeing voices: how cortical areas supporting speech production mediate audiovisual speech perception. Cerebral Cortex (New York, N.Y.: 1991), 17(10), 2387-2399. doi:10.1093/cercor/bhl147
Sterzer, P., Kleinschmidt, A., & Rees, G. (2009). The neural bases of multistable perception. Trends in Cognitive Sciences, 13(7), 310-318. doi:10.1016/j.tics.2009.04.006
Tong, F., & Engel, S. A. (2001). Interocular rivalry revealed in the human cortical blind-spot representation. Nature, 411(6834), 195-199. doi:10.1038/35075583
Tong, F., Nakayama, K., Vaughan, J. T., & Kanwisher, N. (1998). Binocular rivalry and visual awareness in human extrastriate cortex. Neuron, 21(4), 753-759.
Tong, F. (2005). Investigations of the neural basis of binocular rivalry. Dans Binocular Rivalry (p. 63-80). Bradford Books.
Tong, F., Meng, M., & Blake, R. (2006). Neural bases of binocular rivalry. Trends in Cognitive Sciences, 10(11), 502-511. doi:10.1016/j.tics.2006.09.003
Tononi, G., Srinivasan, R., Russell, D. P., & Edelman, G. M. (1998). Investigating neural correlates of conscious perception by frequency-tagged neuromagnetic responses. Proceedings of the National Academy of Sciences of the United States of America, 95(6), 3198-3203.
Tsuchiya, N., & Koch, C. (2005). Continuous flash suppression reduces negative afterimages. Nature Neuroscience, 8(8), 1096-1101. doi:10.1038/nn1500
Warren, R. M., & Gregory, R. L. (1958). An auditory analogue of the visual reversible figure. The American Journal of Psychology, 71(3), 612-613.
Wiesenfelder, H., & Blake, R. (1990). The neural site of binocular rivalry relative to the analysis of motion in the human visual system. The Journal of Neuroscience: The Official Journal of the Society for Neuroscience, 10(12), 3880-3888.
Williams, M. A. (2004). Amygdala Responses to Fearful and Happy Facial Expressions under Conditions of Binocular Suppression. Journal of Neuroscience, 24(12), 2898-2904. doi:10.1523/JNEUROSCI.4977-03.2004
Wilson, H. R. (2003). Computational evidence for a rivalry hierarchy in vision. Proceedings of the National Academy of Sciences of the United States of America, 100(24), 14499-14503. doi:10.1073/pnas.2333622100
Yang, Zald, & Blake. (2007). Fearful expressions gain preferential access to awareness during continuous flash suppression. Emotion, 7(4), 882-886.
Zimba, L. D., & Blake, R. (1983). Binocular rivalry and semantic processing: out of sight, out of mind. Journal of Experimental Psychology. Human Perception and Performance, 9(5), 807-815.