hcs 7367 speech perception - university of texas …assmann/hcs6367/lec4.pdfie a c ou majority...
TRANSCRIPT
1
HCS 7367Speech Perception
Dr. Peter AssmannFall 2014
Motor theory of speech perception
Liberman, A. M.; Cooper, F. S.; Shankweiler, D. P.; Studdert‐Kennedy, M. (1967). “Perception of the speech code,” Psych. Rev. 74 (6): 431–461.
Speech involves a specialized phoneme decoder that processes the incoming speech stream by comparing it to the (invariant) neural motor commands to the articulatory muscles.
Motor theory of speech perception
Speech involves a special type of efficient code; not an alphabet or cipher. Unlike letters in written language, speech involves substantial re‐structuring of the phonemic “message.”
Acoustic cues for successive phonemes are blended, such that sound segments do not correspond to linguistic segments (phonemes).
Motor theory of speech perception
This blending or encoding process makes speech communication especially efficient for information transmission, but it is impossible to recover phonemes from the speech stream directly.
Motor theory of speech perception
Speech can be understood at rates of up to 300 or 400 words per minute, approximately 30 phonemes per second.
Studies from auditory psychophysics indicate this is faster than the resolving power of the ear.
Motor theory of speech perception
Rising and falling F2 transitions ‐‐ both are perceived as /d/.
2
Motor theory of speech perception
Formant transitions in isolation do not sound anything like the syllables they are extracted from.
Transition + vowel without burst
Invariance problem for stop consonants
Motor theory of speech perception Motor theory of speech perception
Syllables with initial /d/ followed by different vowels share a common “locus” ‐ F2 points to around 1800 Hz. This corresponds to the resonant frequency of the vocal tract at the point of closure, but no sound is radiated during the closure.
Extrapolating the transition to the “locus” does not preserve the consonant identity.
Locus Equations
Delattre et al. (1954) described the locus as the frequency location of F2 extrapolated back in time to the consonant release.
locus
Onset of voiced formant transition
Onset of vowel target
F2 of /i/
F2 of /u/
Problems with the locus concept
No physical energy exists at the time/frequency position of the locus; actual extrapolation of the formant transition may lead to a change in consonant identity.
locus
Onset of voiced formant transition
Onset of vowel target
3
Locus equations
Sussman et al. (1991, 1993) proposed that locus equations provide invariantrelational cues for the perception of place of articulation in stop consonants.
Locus equations describe the relationship between the frequency of F2 at burst onset and F2 in the vowel.
ba
Time (ms)
Fre
que
ncy
(kH
z)
0 100 200 300 4000
1
2
3
4
Locus equations
da
Time (ms)
Fre
que
ncy
(kH
z)
0 50 100 150 200 250 3000
1
2
3
4
Locus equationsga
Time (ms)
Fre
que
ncy
(kH
z)
0 50 100 150 200 250 3000
1
2
3
4
Locus equations
Locus equations
Sussman (1991)
Motor theory of speech perception
Conclusion: there is no invariant acoustic cue that uniquely identifies the stop consonants /b,d,g/.
Invariance can only be found in the motor commands that underlie the vocal tract configurations used to produce different consonants.
4
Motor theory of speech perception
Other cues?
Burst cue
F3 transition
Spectral shape (static)
Spectral shape (dynamic)
Invariance problem for stop consonants
A
NOISEBURSTS VOWELS
i
1K
2K
3K
4K
e uc
B C
Cooper, Delattre, Liberman, Borst & Gerstman (1952) JASA 24, 597-606
burst + transitionless vowels
0.5
1.0
1.5
2.0
2.5
3.0
3.5
i e a c o u
Majority responses
Bur
st f
req
uenc
y (k
Hz)
t
p k p
p
Liberman, Delattre, and Cooper (1952)
t
k pp
p
Invariance problem for stop consonants Invariance problem for stop consonants
Motor theory of speech perception
Other consonants?
Formant transitions are present (hard to see in fricatives)
Continuum of “encodedness” from stops to fricatives to semivowels and liquids to vowels
Categorical Perception
Equal acoustic changes unequal auditory percepts
place of articulation of stops: /b/ vs /d/ vs /g/
Liberman, Harris, Hoffman, and Griffith (1957)Journal of Experimental Psychology 54, 358-368
b d g
5
Categorical PerceptionPhoneme boundary (50%)
Identification(labeling) function
Discrimination function
Categorical Perception
1. Identification function shows steep slope (cross‐over) between categories
2. Good between‐category discrimination (near perfect) when the members of the pair belong to different categories
3. Poor within‐category discrimination (near chance) when they are perceived to belong to the same category
Categorical perception
“identification and discrimination functions for nonspeech stimuli do not differ from those for speech stimuli, when obtained under comparable conditions.”
Lane, H. (1965). The motor theory of speech perception: A critical review. Psych. Rev.72(4), 275‐309.
Categorical perception
Specialization for speech?
Chinchillas and humans display similar patterns of categorical perception for speech syllables (Kuhl and Miller, 1975)
Not limited to speech sounds; also includes musical intervals, faces, expressions
Right‐ear advantage in dichotic listening
Dichotic presentation of syllables (e.g. /da/ and /ba/) to different ears leads to a right ear bias (right ear syllable reported more often) corresponding to left hemisphere dominance) (Kimura, 1961; Shankweiler & Studdert‐Kennedy, 1967).
Duplex perception
Duplex perception occurs when the same signal is heard both as a speech sound and as a non‐speech sound.
One ear hears a formant transition (F3).
The other hears a “base” (all the remaining formants)
Listeners hear both a syllable and a (non‐speech) chirp.
6
Duplex perception
Evidence for specialized processing?
Primacy of speech
Motor theory of speech perception
Invariance problem
Segmentation problem
Coarticulation
Categorical perception
Mapping from continuous to discrete
Encodedness of speech
Units of perception (segments, syllables..)
Linguistic units, acoustics and articulation
Motor theory of speech perception
Speech perception involves specialized innate mechanisms that are unique to humans and different from other forms of auditory processing (speech is special hypothesis).
These mechanisms involve the recovery of information related to articulation (motor commands) rather than acoustic segments.
Motor theory of speech perception
Problem: articulation disorders not always associated with impaired perception.
Articulation studies indicate comparable complexity and lack of invariance in the articulatory domain
Revised motor theory: rather than recovering invariant motor commands, the brain is assumed to recover the underlying gestures.
Motor theory of speech perception
Perception
AcousticsArticulation
K.N. Stevens
Quantal theory Certain sounds are favored in the
languages of the world because their acoustic properties can be produced with a wide range of articulations.
Articulatoryvariations
Acousticconsequences
7
K.N. Stevens
Distinctive feature theory syllables /bi/, /du/, … segments /b/, /d/, … features [+voiced], [-rounded], …
K.N. Stevens JASA 2002
Acoustic Landmarks An early stage in the processing of speech
identifies acoustic landmarks in the signal such as syllable nuclei and acoustic discontinuities corresponding to consonantal closures and releases.
Landmarks point to regions of the signal where a more detailed phonetic analysis is carried out.
K.N. Stevens JASA 2002
Lexical access from features The acoustic signal in the vicinity of these
landmarks is processed by a set of modules, each of which identifies a phonetic feature.
K.N. Stevens JASA 2002
Lexical access from features From these landmarks and features, lexical
(word) hypotheses are evaluated.
This is done using analysis-by-synthesis, in which a word sequence is hypothesized, a possible pattern of features from this sequence is internally synthesized, and the synthesized pattern is tested for a match against an acoustically derived pattern.
Burst spectrum
Blumstein and Stevens (1979, 1980) invariance hypothesis (static spectral shape model)
The shape of the spectrum – sampled at the time of burst release – provides invariant cues specifying the place of articulation for the English stop consonants.
Static spectral shape model
Step 1: Find consonant burst onset
8
Temporal window
Step 2: Position a 25.6 ms half-Hamming window
0 0.5 1 1.5 2 2.5 3 3.5-50
-40
-30
-20
-10
0
10
Frequency (kHz)
Am
plit
ud
e i
n d
B
Acoustic landmarks
Step 3: Compute spectrum of windowed segment
LPC spectrum
FFT spectrum
Static spectral shape model
0 0.5 1 1.5 2 2.5 3 3.5-50
-40
-30
-20
-10
0
10
Frequency (kHz)
Am
plitu
de i
n dB
0 0.5 1 1.5 2 2.5 3 3.5-50
-40
-30
-20
-10
0
10
Frequency (kHz)
Am
plitu
de i
n dB
0 0.5 1 1.5 2 2.5 3 3.5-50
-40
-30
-20
-10
0
10
Frequency (kHz)
Am
plitu
de i
n dB
Diffuse falling Diffuse rising Compact
/ ba / / da / / ga /
Burst spectra of syllable-initial stop consonants
Static spectral shape model
Predictions:
1) spectral shape for a given place of articulation is the same for different talkers and in different phonetic contexts.
2) acoustic modifications that distort other aspects of the syllable but preserve the spectral shape near the burst do not affect consonant identity.
Dynamic properties
Kewley‐Port (1983) hypothesized that the
identification of stop consonants is based on
time‐varying changes in the spectrum from the
onset of the burst into the transition region.
Smits, ten Bosch, and Collier (1996)
Gross cues
spectral features that are distributed across frequency or time
overall spectral shape or its relative change over time
Detailed cues
features that are narrowly localized in frequency or time
center frequency of prominent spectral peaks and their dynamic transitions.
9
Cues for voicing
• VOT – voice onset timeShort lag (0-20 ms) = voiced sounds /b,d,g/
Long lag (80-100 ms) = unvoiced sounds /p,t,k/
bi
Time (ms)
Fre
qu
ency
(kH
z)
0 50 100 150 200 250 3000
1
2
3
4
short lag
pi
Time (ms)
Fre
qu
ency
(kH
z)
0 100 200 300 4000
1
2
3
4
long lagdi
Time (ms)
Fre
qu
ency
(kH
z)
0 50 100 150 200 250 3000
1
2
3
4
short lag
ti
Time (ms)
Fre
qu
ency
(kH
z)
0 50 100 150 200 250 300 3500
1
2
3
4
long lag
gi
Time (ms)
Fre
qu
ency
(kH
z)
0 50 100 150 200 250 300 3500
1
2
3
4
short lag
10
ki
Time (ms)
Fre
qu
ency
(kH
z)
0 100 200 300 4000
1
2
3
4
long lag
Voicing• Lisker and Abramson (1970)
Studied voice onset time (VOT) in several languages, some with 2 voicing categories (e.g., English, Spanish, Cantonese) and others with 3 or 4 voicing categories (e.g. Thai, Hindi, Korean)
• Prevoiced
• Voiced
• Voiceless aspirated
• Voiceless unaspirated
VOT in Dutch
/bɛn/ /pɛn/
Prevoiced and unaspirated stops
VOT in an English-Dutch bilingual child
E. Simon (2010).Child L2 development: A longitudinal case study on Voice Onset Times in word-initial stops.J. Child Lang. 37 159–173.
Voicing
• Identification functions for English VOT continuum.
• Frequency histograms of measured VOT in English stops.
Voicing
• Identification functions for Thai VOT continuum.
• Frequency histograms of measured VOT in Thai stops.
11
Identification and discrimination of English VOT continuum
-150 -100 -50 0 50 100 1500
20
40
60
80
100
Voice onset time (ms)
Per
cent
/b
a/ r
espo
nses
Identification function
Discriminationfunction
Hypothetical data
Categorical Perception
• Relationship between VOT and identification functions is nonlinear, with steep transitions and a clearly defined boundary.
• Discrimination functions show a peak near the phoneme boundary; near chance levels elsewhere.
/ a d a /
Time (ms)
Fre
que
ncy
(kH
z)
1. VOICE ONSET TIME
3. BURST AMPLITUDE
2. F1 CUTBACK
6. PREVOICING5. VOWEL
DURATION
4. FUNDAMENTALFREQUENCY
0 100 200 300 400 5000
1
2
3
4
Voicing cues in stop consonants Voicing cues in stop consonants/ a t a /
Time (ms)
Fre
que
ncy
(kH
z)
1. VOICE ONSET TIME
3. BURST AMPLITUDE
2. F1 CUTBACK
6. PREVOICING5. VOWEL
DURATION
4. FUNDAMENTALFREQUENCY
0 100 200 300 400 5000
1
2
3
4