phonetics hcs 7367 speech perceptionassmann/hcs6367/lec3.pdfthe standard phonetic classification...
Post on 10-Jul-2020
14 Views
Preview:
TRANSCRIPT
1
HCS 7367Speech Perception
Dr. Peter AssmannFall 2014
Phonetics
Anatomical, acoustic, neural properties
Production and perception
Focuses on continuous/physical properties
Basic unit: phone (=speech sound)
Narrow transcription: [ ph ]
Phonology
Abstract representations mediate between the speech stream and its interpretation as meaningful words
Discrete entities (phonemes, syllables, ...)
Basic unit: phoneme (smallest contrastive unit of a language)
Broad transcription: / p /
Phonology
Phonemes: minimum meaning-differentiating units in a language
Phonemic inventory: list of the phonemes of a given language
American English has ~12 vowel phonemes, 3 diphthongs, ~24 consonants
Phonology
Minimal pairs test: criterion for deciding whether two sounds belong to the same phonemic category. If two words or phrases in a given language
differ only in one phonological segment (e.g. English “pin” and “bin”) but are perceived as distinct words, then that sound contrast contains two distinct phonemes (/p/ and /b/).
Phonology
Allophones: variants (pronunciations) of a given phoneme In English, the “p” sound in “pot” is aspirated
(pronounced [ phɔt ]) while the “p” in “spot” is unaspirated (pronounced [ spɔt ]).
[ph] and [p] are allophones of the phoneme /p/
2
Phonology
Jakobson, Fant and Halle (1961). Preliminaries to Speech Analysis. Distinctive feature theory: universal set of (binary)
features, combining articulation and perception Chomsky and Halle (1968). The Sound Pattern of
English Phonological representations as sequences of segments
(phonemes) composed of distinctive features. Two levels: underlying abstract representation and
surface phonetic representation
Consonant Perception
What is a consonant?
Consonants are sounds produced with a major constriction somewhere in the vocal tract.
English Consonants
Ladefoged, P. (1993). A Course in Phonetics. Harcourt Brace: Fort Worth. 3rd Ed.
Man
ner of articulation
Place of articulation BilabialLabio‐dental
Dental Alveolar Palatal Velar Glottal
Voicing – + – + – + – + – + – + –
Stop p b t d k g
Nasal m n ŋFricative f v θ ð s z ∫ ʒ h
Affricate t∫ ʤApproximant (w) r j w
Lateral l
Consonant production
American English has 24 consonant phonemes
The standard phonetic classification system uses three main features to classify the consonants:
Place of articulation
Manner of articulation
Voicing
Consonant classification
Place of articulation
bilabial, labiodental, dental, alveolar, palatal,
velar, glottal
Manner of articulation
stops, nasal, fricatives, affricates, approximant,
lateral
Voicing
voiced, voiceless
Stop consonants in English
/ba/ /da/ /ga/
/pa/ /ta/ /ka/
bilabial alveolar velar
Place of articulation
Vo
icin
g voic
edvo
icel
ess
Manner of articulation = stop consonants
3
Consonant production
Speech production theory
Source‐filter concept still applies
Relatively narrow constriction or closure
Secondary sound source in the upper vocal tract
Source‐filter theory for consonants
Consonants: sounds produced with a major
constriction somewhere in the vocal tract.
Source‐filter theory still applies, but there are
complications:
1. secondary sound source within the vocal tract
2. secondary source is aperiodic, turbulent noise (frication) rather
than quasi‐periodic like the glottal source
3. secondary source may combine with the glottal source to produce
mixed excitation (voiced fricatives)
Acoustic tube perturbation and vowel formantsSource: Stevens (1999, p. 286)
Filter properties of stop consonants
2.0 2.5 3.01.0
1.5
2.0
F3 frequency (kHz)
F2
fre
qu
en
cy (
kHz)
After Stevens (1997)
alveolar
labial
velar
glottislips
Area functions: cylindrical tube approximations of the vocal tract
Source‐filter theory for consonants
Obstruents (fricatives, affricates, stops)
Sonorants (nasals, liquids, glides)
In obstruents, the constriction can divide the resonating vocal tract into two separate cavities.
coupled vs. uncoupled
interactions with the source
4
Schematic model for fricative /s/
After Kent, Dembowski & Lass (1996)
Glottis
Back Cavity Front Cavity
Obstruction(maxillary incisors)
Lips
Anteriorconstriction
Jet
Eddies
Time (ms)
Voiceless unaspirated stop consonant
[pɑ]
Place of articulation
What are the acoustic cues for place of articulation (i.e., what acoustic properties do listeners use to distinguish /p/, /t/, /k/ (and /b/, /d/, /g/) from each other)?
Place of articulation
What are the acoustic cues for place of articulation (i.e., what acoustic properties do listeners use to distinguish /p/, /t/, /k/ (and /b/, /d/, /g/) from each other)?
1. Frequency of burst
2. Formant transitions (F2 and F3) from the burst to the vowel
Acoustic cues for place of articulation in stops
Burst cues
Release burst, frication noise, aspiration noise
Spectral shape
Static vs. dynamic
Formant transition cues
F2, F3 onset vs. target (vowel) frequency
Formant transitions
During the closure for a stop consonant the vocal tract is completely closed, and no sound is emitted. At the moment of release the resonances change rapidly. These changes are called formant transitions.
5
Formant transitions
The first formant (F1) always increases in frequency following the release of a stop closure. F2 and F3 increases or decreases, depending on the place of constriction and the flanking vowels.
Burst cues
The burst is a brief acoustic transient that occurs at the moment of consonant release.
Burst intensity is higher for voiceless stops than for voiced stops
Burst frequency varies as a function of place
burst voicing onset
Burst cues
Is consonant place information available when the burst is removed?
Stop consonants(from Kent and Read, 1992)
Syllable-initialprevocalic stop
Closure(stop gap)
Release(noise burst)
Transition(formant transitions)
Aspirated
Unaspirated
/ pɑ /
ata
Time (ms)
Fre
qu
ency
(kH
z)
0 100 200 300 400 5000
1
2
3
4
Alveolar stop
Closure(stop gap)
Release(noise burst)
Formant transitions
F1
F2
F3
/ ada /
• high F2• high F3
aka
Time (ms)
Fre
qu
ency
(kH
z)
0 100 200 300 400 5000
1
2
3
4
Velar stop
Closure(stop gap)
Release(noise burst)
Formant transitions
F1
F2
F3
/aga/
• high F2• low F3
6
apa
Time (ms)
Fre
qu
ency
(kH
z)
0 100 200 300 400 5000
1
2
3
4
Bilabial stop
Closure(stop gap)
Release(noise burst)
Formant transitions
F1
F2
F3
/aba/
• low F2• low F3
Stop consonants(from Kent and Read, 1992)
Syllable-finalpostvocalic stop
Transition(formant transitions)
Closure(stop gap)
Release (noise burst)or
No release (no burst)
/ /
Alveolar stop
Time (ms)
Fre
qu
ency
(kH
z)
0 100 200 300 4000
1
2
3
4
/ad/
• high F2• high F3
F2
F3
ap
Time (ms)
Fre
qu
ency
(kH
z)
0 50 100 150 200 2500
1
2
3
4
F2
F3
/ab/
• low F2• low F3
Bilabial stop
Burst cues: /ba/ and /pa/
Time (ms)
Fre
quen
cy (
kHz)
0 50 100 150 200 250 300 350 4000
1
2
3
4
/ba/
Time (ms)0 50 100 150 200 250 300 350 400
0
1
2
3
4
/pa/
• Bilabials: Burst energy is concentrated at low frequencies
Burst cues: /da/ and /ta/
• Alveolars: Burst energy is concentrated at high frequencies
Time (ms)
Fre
quen
cy (
kHz)
0 50 100 150 200 250 3000
1
2
3
4
/da/
Time (ms)0 50 100 150 200 250 300 350
0
1
2
3
4
/ta/
7
Burst cues: /ga/ and /ka/
• Velars: Burst energy is concentrated in the mid range
Time (ms)
Fre
quen
cy (
kHz)
0 50 100 150 200 250 300 3500
1
2
3
4
/ka/
Time (ms)
Fre
quen
cy (
kHz)
0 50 100 150 200 250 3000
1
2
3
4
/ga/
Place of articulation
What are the acoustic cues for place of articulation (i.e., what acoustic properties do listeners use to distinguish /p/, /t/, /k/ (and /b/, /d/, /g/) from each other)?
Place of articulation
What are the acoustic cues for place of articulation (i.e., what acoustic properties do listeners use to distinguish /p/, /t/, /k/ (and /b/, /d/, /g/) from each other)?
1. Frequency of burst
2. Formant transitions (F2 and F3) from the burst to the vowel
Acoustic cues for place of articulation in stops
Burst cues
Release burst, frication noise, aspiration noise
Spectral shape
Static vs. dynamic
Formant transition cues
F2, F3 onset vs. target (vowel) frequency
Formant transitions
During the closure for a stop consonant the vocal tract is completely closed, and no sound is emitted. At the moment of release the resonances change rapidly. These changes are called formant transitions.
Formant transitions
The first formant (F1) always increases in frequency following the release of a stop closure. F2 and F3 increases or decreases, depending on the place of constriction and the flanking vowels.
8
Burst cues
The burst is a brief acoustic transient that occurs at the moment of consonant release.
Burst intensity is higher for voiceless stops than for voiced stops
Burst frequency varies as a function of place
burst voicing onset
Burst cues
Is consonant place information available when the burst is removed?
Invariance problem for stop consonants
A
NOISEBURSTS VOWELS
i
1K
2K
3K
4K
e ucB C
Cooper, Delattre, Liberman, Borst & Gerstman (1952) JASA 24, 597-606
burst + transitionless vowels0.5
1.0
1.5
2.0
2.5
3.0
3.5
i e a c o u
Majority responses
Bur
st f
req
uenc
y (k
Hz)
t
p k p
p
Liberman, Delattre, and Cooper (1952)
t
k pp
p
Invariance problem for stop consonants
Invariance problem for stop consonants
Transition + vowel without burst
Invariance problem for stop consonants
9
Categorical Perceptionof stop consonantsEqual acoustic changes unequal auditory percepts
place of articulation of stops: /b/ vs /d/ vs /g/
Liberman, Harris, Hoffman, and Griffith (1957)Journal of Experimental Psychology 54, 358-368
b d g
Categorical PerceptionPhoneme boundary (50%)
Identification(labeling) function
Discrimination function
Categorical Perception
1. Identification function shows steep slope (cross‐over) between categories
2. Good between‐category discrimination (near perfect) when the members of the pair belong to different categories
3. Poor within‐category discrimination (near chance) when they are perceived to belong to the same category
time
"bag"
Where are the segments?
F1
F2
F3
/bæg/
Segmentation problem for stops
Same noise ‐ different consonant
pea ka
1400 Hz
F 2
F 1
Different transition ‐ same consonant
dee da
1400 Hz
10
Motor Theory
Invariance problem
Segmentation problem
Categorical perception
Mapping from continuous to discrete
Encodedness of speech
Units of perception (segments, syllables..)
Linguistic units, acoustics and articulation
K.N. Stevens
Quantal theory Certain sounds are favored in the
languages of the world because their acoustic properties can be produced with a wide range of articulations.
Articulatoryvariations
Acousticconsequences
K.N. Stevens
Distinctive feature theory syllables /bi/, /du/, … segments /b/, /d/, … features [+voiced], [-rounded], …
K.N. Stevens JASA 2002
Acoustic Landmarks An early stage in the processing of speech
identifies acoustic landmarks in the signal such as syllable nuclei and acoustic discontinuities corresponding to consonantal closures and releases.
Landmarks point to regions of the signal where a more detailed phonetic analysis is carried out.
K.N. Stevens JASA 2002
Lexical access from features The acoustic signal in the vicinity of these
landmarks is processed by a set of modules, each of which identifies a phonetic feature.
K.N. Stevens JASA 2002
Lexical access from features From these landmarks and features, lexical
(word) hypotheses are evaluated.
This is done using analysis-by-synthesis, in which a word sequence is hypothesized, a possible pattern of features from this sequence is internally synthesized, and the synthesized pattern is tested for a match against an acoustically derived pattern.
11
Burst spectrum
Blumstein and Stevens (1979, 1980) invariance hypothesis (static spectral shape model)
The shape of the spectrum – sampled at the time of burst release – provides invariant cues specifying the place of articulation for the English stop consonants.
Static spectral shape model
Step 1: Find consonant burst onset
Temporal window
Step 2: Position a 25.6 ms half-Hamming window
0 0.5 1 1.5 2 2.5 3 3.5-50
-40
-30
-20
-10
0
10
Frequency (kHz)
Am
plit
ud
e i
n d
B
Acoustic landmarks
Step 3: Compute spectrum of windowed segment
LPC spectrum
FFT spectrum
Static spectral shape model
0 0.5 1 1.5 2 2.5 3 3.5-50
-40
-30
-20
-10
0
10
Frequency (kHz)
Am
plitu
de i
n dB
0 0.5 1 1.5 2 2.5 3 3.5-50
-40
-30
-20
-10
0
10
Frequency (kHz)
Am
plitu
de i
n dB
0 0.5 1 1.5 2 2.5 3 3.5-50
-40
-30
-20
-10
0
10
Frequency (kHz)
Am
plitu
de i
n dB
Diffuse falling Diffuse rising Compact
/ ba / / da / / ga /
Burst spectra of syllable-initial stop consonants
Static spectral shape model
Predictions:
1) spectral shape for a given place of articulation is the same for different talkers and in different phonetic contexts.
2) acoustic modifications that distort other aspects of the syllable but preserve the spectral shape near the burst do not affect consonant identity.
12
Dynamic properties
Kewley‐Port (1983) hypothesized that the
identification of stop consonants is based on
time‐varying changes in the spectrum from the
onset of the burst into the transition region.
Locus Equations
Delattre et al. (1954) described the locus as the frequency location of F2 extrapolated back in time to the consonant release.
locus
Onset of voiced formant transition
Onset of vowel target
F2 of /i/
F2 of /u/
Problems with the locus concept
No physical energy exists at the time/frequency position of the locus; actual extrapolation of the formant transition may lead to a change in consonant identity.
locus
Onset of voiced formant transition
Onset of vowel target
Locus equations
Sussman et al. (1991, 1993) proposed that locus equations provide invariantrelational cues for the perception of place of articulation in stop consonants.
Locus equations describe the relationship between the frequency of F2 at burst onset and F2 in the vowel.
ba
Time (ms)
Fre
que
ncy
(kH
z)
0 100 200 300 4000
1
2
3
4
Locus equationsda
Time (ms)
Fre
que
ncy
(kH
z)
0 50 100 150 200 250 3000
1
2
3
4
Locus equations
13
ga
Time (ms)
Fre
que
ncy
(kH
z)
0 50 100 150 200 250 3000
1
2
3
4
Locus equations Locus equations
Sussman (1991)
Smits, ten Bosch, and Collier (1996)
Gross cues
spectral features that are distributed across frequency or time
overall spectral shape or its relative change over time
Detailed cues
features that are narrowly localized in frequency or time
center frequency of prominent spectral peaks and their dynamic transitions.
Cues for voicing
• VOT – voice onset timeShort lag (0-20 ms) = voiced sounds /b,d,g/
Long lag (80-100 ms) = unvoiced sounds /p,t,k/
bi
Time (ms)
Fre
qu
ency
(kH
z)
0 50 100 150 200 250 3000
1
2
3
4
short lagpi
Time (ms)
Fre
qu
ency
(kH
z)
0 100 200 300 4000
1
2
3
4
long lag
14
di
Time (ms)
Fre
qu
ency
(kH
z)
0 50 100 150 200 250 3000
1
2
3
4
short lagti
Time (ms)
Fre
qu
ency
(kH
z)
0 50 100 150 200 250 300 3500
1
2
3
4
long lag
gi
Time (ms)
Fre
qu
ency
(kH
z)
0 50 100 150 200 250 300 3500
1
2
3
4
short lagki
Time (ms)
Fre
qu
ency
(kH
z)
0 100 200 300 4000
1
2
3
4
long lag
aga
Time (ms)
Fre
qu
ency
(kH
z)
0 100 200 300 400 5000
1
2
3
4
aka
Time (ms)
Fre
qu
ency
(kH
z)
0 100 200 300 400 5000
1
2
3
4
15
Voicing• Lisker and Abramson (1970)
Studied voice onset time (VOT) in several languages, some with 2 voicing categories (e.g., English, Spanish, Cantonese) and others with 3 or 4 voicing categories (e.g. Thai, Hindi, Korean)
• Prevoiced
• Voiced
• Voiceless aspirated
• Voiceless unaspirated
VOT in Dutch
/bɛn/ /pɛn/
Prevoiced and unaspirated stops
VOT in an English-Dutch bilingual child
E. Simon (2010).Child L2 development: A longitudinal case study on Voice Onset Times in word-initial stops.J. Child Lang. 37 159–173.
Voicing
• Identification functions for English VOT continuum.
• Frequency histograms of measured VOT in English stops.
Voicing
• Identification functions for Thai VOT continuum.
• Frequency histograms of measured VOT in Thai stops.
Identification and discrimination of English VOT continuum
-150 -100 -50 0 50 100 1500
20
40
60
80
100
Voice onset time (ms)
Per
cent
/b
a/ r
espo
nses
Identification function
Discriminationfunction
Hypothetical data
16
Categorical Perception
• Relationship between VOT and identification functions is nonlinear, with steep transitions and a clearly defined boundary.
• Discrimination functions show a peak near the phoneme boundary; near chance levels elsewhere.
/ a d a /
Time (ms)
Fre
que
ncy
(kH
z)
1. VOICE ONSET TIME
3. BURST AMPLITUDE
2. F1 CUTBACK
6. PREVOICING5. VOWEL
DURATION
4. FUNDAMENTALFREQUENCY
0 100 200 300 400 5000
1
2
3
4
Voicing cues in stop consonants
Voicing cues in stop consonants/ a t a /
Time (ms)
Fre
que
ncy
(kH
z)
1. VOICE ONSET TIME
3. BURST AMPLITUDE
2. F1 CUTBACK
6. PREVOICING5. VOWEL
DURATION
4. FUNDAMENTALFREQUENCY
0 100 200 300 400 5000
1
2
3
4
top related