the auditory and the visual percept evoked by the same audiovisual stimuli
Post on 22-Mar-2016
60 Views
Preview:
DESCRIPTION
TRANSCRIPT
The auditory and the visual percept evoked by the same audiovisual stimuli
Hartmut TraunmüllerNiklas Öhrström
Dept. of Linguistics, University of Stockholm
Theoretical background
It is fairly obvious that acoustic speech stimuli evoke an auditory percept, while optic speech stimuli evoke a visual percept.In phonetic terms, these percepts agree with each other in congruent AV stimuli. In incongruent AV stimuli, this is not necessarily so.
Theoretical background
Acoustic signal Optic signal
Auditory signal analysis Visual signal analysis
An auditorypercept
A visualpercept
Theoretical background
Acoustic signal
A common percept
Optic signal
Auditory signal analysis
Audiovisual integration
Visual signal analysis
An auditorypercept
A visualpercept
Theoretical background
According to the Motor Theory and the Direct Realist theory of speech perception, the ‘object’ of speech perception is gestural in nature. These theories know of only one percept of speech, which may be identified with the common AV-percept in Figure 1.
Theoretical background
Another theory, the Modulation Theory, considers speech primarily as modulated voice. The ‘object’ of normal speech perception is vocal in nature and consists in the modulation of a voice. The theory allows for a different percept in lip reading. This is gestural and consists in the modulation of a face.
Theoretical background
In order to clarify the situation, it is necessary to investigate not only the effects an optic speech signal has on auditory perception, but also those an acoustic speech signal has on visual perception of speech – and to compare these effects with each other.
Earlier studies
In an earlier experiment, we presented congruent and incongruent AV stimuli to subjects.The AV stimuli consisted of different front vowels presented within a [g_g] frame.They were incongruent with respect to openness (height) or roundedness or both.The subjects had to report which vowel they had heard. The response alternatives consisted of the nine letters that represent the long vowel phonemes of Swedish.
Earlier studies
In an earlier experiment, we presented congruent and incongruent AV stimuli to subjects.The stimuli consisted of different front vowels presented within a [g_g] frame.They were incongruent with respect to openness (height) or roundedness or both.The subjects had to report which vowel they had heard. The response alternatives consisted of the nine letters that represent the long vowel phonemes of Swedish.
Earlier studies
In an earlier experiment, we presented congruent and incongruent AV stimuli to subjects.The stimuli consisted of different front vowels presented within a [g_g] frame.The vowels were incongruent with respect to openness (height) or roundedness or both.The subjects had to report which vowel they had heard. The response alternatives consisted of the nine letters that represent the long vowel phonemes of Swedish.
Earlier studies
In an earlier experiment, we presented congruent and incongruent AV stimuli to subjects.The stimuli consisted of different front vowels presented within a [g_g] frame.The vowels were incongruent with respect to openness (height) or roundedness or both.The subjects had to report which vowel they had heard. The response alternatives consisted of the nine letters that represent the long vowel phonemes of Swedish.
Earlier studies
Typical result
A V Perceptɡyɡ ɡeɡ → ɡiɡɡeɡ ɡyɡ → ɡøɡɡiɡ ɡyɡ → ɡyɡɡeɡ ɡiɡ → ɡeɡ
Visual roundedness combined with auditory openness.
Earlier studies
Explanation
Acoustic cues to openness (F1 etc.) are salient and reliable.Optic cues to openness are less reliable because of variation due to individual habits, attitude and emotion.
Earlier studies
Explanation
Acoustic cues to openness (F1 etc.) are salient and reliable.Optic cues to openness are less reliable because of variation due to individual habits, attitude and emotion.Optic cues to roundedness are more reliable; rounded lips are easy to distinguish from unrounded in most conditions.Acoustic cues to roundedness (higher formants) lack salience and are less reliable.
Earlier studies
The mentioned experiment was designed with the objective of investigating perception in terms of phonemic categories.
Earlier studies
The mentioned experiment was designed with the objective of investigating perception in terms of phonemic categories. However, subjects informally reported having heard vowels whose quality differed from that of ordinary Swedish vowels. Auditorily rounded vowels appeared to be shifted backwards in the front-back dimension when presented together with optically unrounded vowels.
The present study
The present experiment has the aim of exploring the cross-modal perceptual effects on the finer phonetic, sub-categorical perception of vowels.
The present study
The present experiment has the aim of exploring the cross-modal perceptual effects on the finer phonetic, sub-categorical perception of vowels. It has also the additional aim of comparing the auditory and the visual perception of the same AV stimuli.
The present study
We reused a subset of the stimuli from the previous experiment.
A Vɡyɡ ɡiɡɡyɡ ɡeɡɡyɡ ---- ɡyɡ
A Vɡeɡ ɡiɡɡeɡ ɡyɡɡeɡ ---- ɡeɡ
A Vɡiɡ ɡyɡɡiɡ ɡeɡɡiɡ ---- ɡiɡ
The present study
There were 4 speakers: 2 male, 2 female.
The present study
There were 8 perceivers:They were selected from a previous experiment where they had shown sensitivity to the optic signal in incongruent audiovisual stimuli.
The 8 subjects were all phonetically skilled and familiar with the IPA-chart for vowels.
The present study
The subjects perceived the stimuli by way of headphones and a computer screen.
The stimuli were presented in quasi-random order.
Responses were given on electronic response sheets.
The present study
The subjects were instructed to rate these dimensions of the vowels:
• Lip rounding (6 degrees), 1st: unrounded; 5th: rounded
• Lip spreading (3 degrees)• Openness (18 degrees),
2nd: close vowels, 6th: close-mid vowels• Backness (11 degrees auditorily; 7 degrees
visually), 2nd: front vowels, 6th (auditorily): central vowels
The present study
In a first experiment, the subjects were instructed to rate the dimensions of vowels they heard.
In a second experiment, the same subjects were instructed to rate the dimensions of vowels they saw.
The incongruent stimuli were the same in the two experiments.
Results
0,0
0,5
1,0
1,5
0,0 0,5 1,0rndA
opn A
Openness opn vs. roundedness rnd; acoustic stimuli (listening only).
Symbols represent speakers.
Results
0,0
0,5
1,0
1,5
0,0 0,5 1,0 1,5rndV
opn V
Openness opn vs. roundedness rnd; optic stimuli (lipreading only).
Symbols represent speakers.
Results
0,0
0,5
1,0
1,5
0,0 0,5 1,0 1,5opnA
opn H
Heard openness of incongruent AV-stimuli vs. opn of A-stimuli (ρ = .80*).
Symbols represent acoustically presented vowels.
Results
0,0
0,5
1,0
1,5
0,0 0,5 1,0
rndA
rnd H
Heard roundedness of incongruent AV-stimuli vs. rnd of A-stimuli (ρ = -.05).
Symbols represent acoustically presented vowels.
Results
0,0
0,2
0,4
0,6
0,8
1,0
0,0 0,2 0,4 0,6 0,8 1,0sprA
spr H
Heard spreadness of incongruent AV-stimuli vs. spr of A-stimuli (ρ = .07).
Symbols represent acoustically presented vowels.
Results
-0,25
0,00
0,25
0,50
0,0 0,5 1,0 1,5
rndA
bac H
Heard backness of incongruent AV-stimuli vs. roundedness of A-stimuli (ρ = .71*).
Symbols represent acoustically presented vowels.
Results
0,0
0,5
1,0
1,5
0,0 0,5 1,0 1,5opnA
opn H
0,0
0,5
1,0
1,5
0,0 0,5 1,0 1,5opnV
opn H
Heard openness of incongruent AV-stimuli plotted against opn of A-stimuli (left, ρ = .71*) and of V-stimuli (right, ρ = .03).
Symbols represent acoustically presented vowels.
Results
0,0
0,5
1,0
1,5
0,0 0,5 1,0
rndA
rnd H
0,0
0,5
1,0
1,5
0,0 0,5 1,0 1,5
rndV
rnd H
Heard roundedness of incongruent AV-stimuli plotted against rnd of A-stimuli (left, ρ = -.05) and of V-stimuli (right, ρ = .79*).
Symbols represent acoustically presented vowels.
Results
0,0
0,2
0,4
0,6
0,8
1,0
0,0 0,2 0,4 0,6 0,8 1,0sprA
spr H
0,0
0,2
0,4
0,6
0,8
1,0
0,0 0,2 0,4 0,6 0,8 1,0sprV
spr H
Heard spreadness of incongruent AV-stimuli plotted against spr of A-stimuli (left, ρ = .07) and of V-stimuli (right, ρ = .90*).
Symbols represent acoustically presented vowels.
Results
-0,25
0,00
0,25
0,50
0,0 0,5 1,0 1,5
rndA
bac H
-0,25
0,00
0,25
0,50
0,0 0,5 1,0 1,5rndV
bac H
Heard backness of incongruent AV-stimuli plotted against roundedness of A-stimuli (left, ρ = .71*) and of V-stimuli (right, ρ = -.59*).
Symbols represent acoustically presented vowels.
Results
The results were subjected to linear regression analyses in which the average ratings obtained in each unimodal presentation were taken as candidate independent variables together with the interaction terms.
A comparison of the regression equations that describe the results of the listening task and the viewing task shows that the two percepts need to be distinguished from each other.
Results
The difference is particularly clear in the dimension of openness:
opnheard = 0.05 + 1.00 opnA + 0.00 opnV (r2=0.97)
opnseen = 0.05 + 0.59 opnA + 0.42 opnV (r2=0.81)
the rounded vowels to the right of their charts.
Results
The difference is particularly clear in the dimension of openness:
opnheard = 0.05 + 1.00 opnA + 0.00 opnV (r2=0.97)
opnseen = 0.05 + 0.59 opnA + 0.42 opnV (r2=0.81)
In the listening task, the estimates were based on the acoustic cues alone. In the viewing task, they were based on a weighted sum of the acoustic and the optic cues. rounded vels to the right of their rts.
Results
In perception of roundedness and spreadness, there were only some minor differences between the results of the two tasks.
In these dimensions, our subjects relied almost totally on optic cues not only when asked what they saw, but also when asked what they heard.
Results
There was, however, an interesting difference in perceived backness.
bacheard = 0.06 + 0.25 rndA - 0.20 rndAV (r2=0.74)
bacseen = 0.09 + 0.42 bacV (r2=0.22)
Results
There was, however, an interesting difference in perceived backness.
bacheard = 0.06 + 0.25 rndA - 0.20 rndAV (r2=0.74)
bacseen = 0.09 + 0.42 bacV (r2=0.22)
Note that bacheard is given by cues reflecting roundedness rather than backness.
Discussion
There are two hypothetical explanations for an effect of roundedness on perceived backness:
1. The distance from the lips to the dorso-palatal ’place of articulation’ is increased by lip rounding as well as by tongue retraction. This would provide an articulatory (gestural) explanation.
2. F2’ is lowered by lip rounding as well as by tongue retraction. This would provide an auditory explanation.
Both explanations would be consistent with the placement of the rounded vowels to the right of their unrounded counterparts in IPA-charts.
Discussion
There are two hypothetical explanations for an effect of roundedness on perceived backness:
1. The distance from the lips to the dorso-palatal ’place of articulation’ is increased by lip rounding as well as by tongue retraction. This would provide an articulatory (gestural) explanation.
2. The upper formants (F2’) are lowered by lip rounding as well as by tongue retraction. This would provide an auditory explanation.
Both explanations would be consistent with the placement of the rounded vowels to the right of their unrounded counterparts in IPA-charts.
Discussion
There are two hypothetical explanations for an effect of roundedness on perceived backness:
1. The distance from the lips to the dorso-palatal ’place of articulation’ is increased by lip rounding as well as by tongue retraction. This would provide an articulatory (gestural) explanation.
2. The upper formants (F2’) are lowered by lip rounding as well as by tongue retraction. This would provide an auditory explanation.
Both explanations would be consistent with the placement of the rounded vowels to the right of their unrounded counterparts in IPA-charts.
Discussion
Analysis of perceived backness Stimulus Prediction Observation
A (acoustic)
V (optic)
Expl. 1 (gestural)
Expl. 2 (auditory)
rounded unrounded
fronted retracted retracted
unrounded
rounded retracted fronted fronted
Discussion
Analysis of perceived backness Stimulus Prediction Observation
A (acoustic)
V (optic)
Expl. 1 (gestural)
Expl. 2 (auditory)
rounded unrounded
fronted retracted retracted
unrounded
rounded retracted fronted fronted
Conclusion: The effect is due to auditory (F2’) rather than articulatory (gestural) associations.
Discussion
The observed effect of liprounding on perceived backness cannot be explained on the basis of a late-integration hypothesis. Swedish lacks non-front unrounded vowel phonemes and phones, whose existence would be required in order to apply such a hypothesis. This is clear and direct evidence for early, pre-categorical integration.The result also shows that this integration takes place in an auditory space in which roundedness and backness have an essential component in common.
Discussion
Acoustic signal
A common percept
Optic signal
Auditory signal analysis
Audiovisual integration
Visual signal analysis
An auditorypercept
A visualpercept
Discussion
Acoustic signal
Vocal percept
Optic signal
Auditory analysis(demodulation)
Integration ofgestural information
Visual analysis(demodulation)
Integration ofvocal information
Modulation of voice Modulation of face
Gestural percept
SummarySome earlier findings: 1) In clear AV vowel stimuli, Swedes hear
roundedness predominantly by eye – but openness only by ear. (The strength of the influence of a modality reflects the reliability of the information.)
2) A predominantly male minority is less sensitive to vision. (There is a significant sex difference.)
3) Presence of visible lip rounding (a ‘marked’ feature) is more influential than its absence.
Ref: H. Traunmüller and N. Öhrström (2007) "Audiovisual perception of openness and lip rounding in front vowels" Journal of Phonetics 35: 244-258.
SummarySome earlier findings: 1) In clear AV vowel stimuli, Swedes hear
roundedness predominantly by eye – but openness only by ear. (The strength of the influence of a modality reflects the reliability of the information.)
2) A predominantly male minority is less sensitive to vision. (There is a significant sex difference.)
3) Presence of visible lip rounding (a ‘marked’ feature) is more influential than its absence.
Ref: H. Traunmüller and N. Öhrström (2007) "Audiovisual perception of openness and lip rounding in front vowels" Journal of Phonetics 35: 244-258.
SummarySome earlier findings: 1) In clear AV vowel stimuli, Swedes hear
roundedness predominantly by eye – but openness only by ear. (The strength of the influence of a modality reflects the reliability of the information.)
2) A predominantly male minority is less sensitive to vision. (There is a significant sex difference.)
3) Presence of visible lip rounding (a ‘marked’ feature) is more influential than its absence.
Ref: H. Traunmüller and N. Öhrström (2007) "Audiovisual perception of openness and lip rounding in front vowels" Journal of Phonetics 35: 244-258.
SummaryRecent findings:4) In addition to the auditory (vocal) percept that
may be influenced by vision, there is a visual (gestural) percept that may be influenced by audition. (There are two AV percepts!)
5) The auditory perception of frontness/backness is based on AV integration at the level of phonetically informative properties prior to categorization. (This is likely to hold more generally for AV integration.)
SummaryRecent findings:4) In addition to the auditory (vocal) percept that
may be influenced by vision, there is a visual (gestural) percept that may be influenced by audition. (There are two AV percepts!)
5) The auditory perception of frontness/backness is based on AV integration at the level of phonetically informative properties prior to categorization. (This is likely to hold more generally for AV integration.)
SummaryRecent findings:6) In AV vocal perception, only a minority comes
close to optimal (Bayesian) integration. 7) In AV gesture perception (by normal hearing
subjects), integration is less optimal.
Ref: H. Traunmüller (2006) "Cross-modal interactions in visual as opposed to auditory perception of vowels" Working Papers 52: 137 - 140 (Lund University, Dept. of Linguistics).
SummaryRecent findings:6) In AV vocal perception, only a minority comes
close to optimal (Bayesian) integration. 7) In AV gesture perception (by normal hearing
subjects), integration is less optimal.
Ref: H. Traunmüller (2006) "Cross-modal interactions in visual as opposed to auditory perception of vowels" Working Papers 52: 137 - 140 (Lund University, Dept. of Linguistics).
Conclusions
The results clash irreconcilably with gestural-only theories of speech perception, such as the Motor Theory and the Direct Realist Theory.
Conclusions
The results clash irreconcilably with gestural-only theories of speech perception, such as the Motor Theory and the Direct Realist Theory.
Models of auditory-visual integration need to be extended in order to capture the two percepts.
Conclusions
The results clash irreconcilably with gestural-only theories of speech perception, such as the Motor Theory and the Direct Realist Theory.
Models of auditory-visual integration need to be extended in order to capture the two percepts.
The Modulation Theory, according to which speech is primarily modulated voice, but also modulated face, provides a possible foundation for such an extention.
Acknowledgement
Research supported by the Swedish Research Council
Thank you for your attention!
Thank you for your attention!
Results
0,0
0,2
0,4
0,6
0,8
1,0
0,0 0,5 1,0 1,5rndS
spr S
0,0
0,2
0,4
0,6
0,8
1,0
0,0 0,5 1,0 1,5rndH
spr H
Left: Seen spreadness plotted against seen roundedness.
Right: Heard spreadness plotted against heard roundedness.
Symbols represent acoustically presented vowels.
The Modulation Theory
• Speech is modulated voice and face.
• The said is conveyed by the modulation.
• Perceptual recovery requires 'demodulation'.
• Users associate modulations with corresponding somatosensations.
Ref: H. Traunmüller “Speech considered as modulated voice“.
top related