adaptive design of speech sound systems randy diehl in collaboration with bjőrn lindblom, carl...
TRANSCRIPT
Adaptive Design of Speech Sound Systems
Randy Diehl
In collaboration with Bjőrn Lindblom, Carl Creeger, Lori Holt, and Andrew Lotto
Two observations:
Sound systems of natural languages underexploit the sound producing capabilities of humans.
The sounds that are used in natural languages vary in frequency of occurrence. /a/, /i/, /u/ are common; /š/, /õ/ are rare. /p/, /t/ are common; />/, /q/ are rare.
Why are certain speech sounds favored?Possibilities:
They are easy to hear (i.e., to distinguish from other sounds).
They are easy to produce.
They are easy to learn.
The role of auditory distinctiveness in the design of vowel inventories Liljencrants & Lindblom (1972)
Diehl, Lindblom, & Creeger (2002, 2003)
Liljencrants & Lindblom (1972)
Possible vowel sound: Any vowel-like output of a computational model of the human vocal tract (Lindblom & Sundberg, 1971).
Auditory distance: Euclidean distance between any two vowel sounds i and j in a space defined by the frequencies of the first several formants:
Dij = ((∆M1)2 + (∆M’2)2)1/2.
Selection criterion: For any given inventory size, select those vowels whose pairwise distances, Dij, are maximal.
Predicted vowel systems (Liljencrants & Lindblom (1972)
A problem: Too many high vowels
A problem: Too many high vowels
These simulations were unrealistic in at least two ways:
Acoustic distance (based on formant frequencies) is probably not a good proxy for auditory distance.
Vowel sounds do not naturally occur in conditions of total quiet.
Improving the realism of the simulations (Diehl, Lindblom, and Creeger, 2002) Define a notion of ‘auditory distance’ based
on plausible auditory representations of vowel sounds.
Model vowel systems as they would have emerged under natural conditions of background noise.
From acoustic to auditory representations
0
40
80
0 1000 2000 3000 Hz
dB
0
40
80
0 4 8 12 16Bark
Hz to Bark
*
Input
0
50
100
0 4 8 12 16Bark
Ph
ons/
Bar
k
Output
-600
0
0 4 8 12 16Bark
Auditory filtering
Computing distances among auditory spectra
0
50
100
0 4 8 12 16Bark
Ph
ons/
Bar
k
1. At each point along the Bark dimension, calculate the difference in Phons/Bark between any vowel pair.2. Square these differences.3. Sum the squares.4. Take the square root of the sum. This is a measure of the Euclidean auditory distance between two vowels.
Effects of the auditory transform
Auditory-basedSystem
7
0
500
1000
1500
2000
2500
200 400 600 800
F1 (Hz)
F2
(Hz)
Formant-basedSystem
Effects of the auditory transform
Auditory-basedSystem
Formant-basedSystem
9
0
500
1000
1500
2000
2500
200 400 600 800
F1 (Hz)
F2
(Hz)
Effects of the auditory transform
The problem of excessive high vowels is reduced—but not eliminated.
Effects of adding background noise
Hypothesis: Vowel systems have evolved to be perceptually robust even at unfavorable signal/noise ratios.
Method
We used noise whose spectral shape mimicked the long-term average for speech
(-6 dB/octave). We computed auditory distances among
vowels at 8 different S/N ratios, ranging from 10 dB to -7.5 dB.
We then averaged these distances to determine the optimal vowel systems.
3
0
500
1000
1500
2000
2500
200 400 600 800
F1 (Hz)
F2
(Hz)
Quiet Noise
3
0
500
1000
1500
2000
2500
200 400 600 800
F1 (Hz)F
2 (H
z)
Effects of adding background noise (3 vowel system)
Effects of adding background noise (5 vowel system)
5
0
500
1000
1500
2000
2500
200 400 600 800
F1 (Hz)
F2
(Hz)
Quiet Noise
5
0
500
1000
1500
2000
2500
200 400 600 800
F1 (Hz)F
2 (H
z)
Effects of adding background noise (7 vowel system)
Quiet Noise
7
0
500
1000
1500
2000
2500
200 400 600 800
F1 (Hz)
F2
(Hz)
7
0
500
1000
1500
2000
2500
200 400 600 800
F1 (Hz)F
2 (H
z)
Effects of adding background noise (9 vowel system)
9
0
500
1000
1500
2000
2500
200 400 600 800
F1 (Hz)
F2
(Hz)
9
0
500
1000
1500
2000
2500
200 400 600 800
F1 (Hz)F
2 (H
z)
Quiet Noise
Comparisons with actual vowel inventories
The reduction in the number of high vowels (relative to the Liljencrants & Lindblom simulations) yields a much better fit with actual vowel systems.
Some fronting/unrounding of the high, back vowel /u/ also appears to be common among the world’s languages (e.g., Japanese and many other 5-vowel systems, American English).
Why does background noise reduce the number of high vowels?
0
10
20
30
40
50
60
70
80
90
0 500 1000 1500 2000 2500 3000 3500
Frequency (Hz)
dB
per
Ba
rk
0
10
20
30
40
50
60
70
80
90
0 500 1000 1500 2000 2500 3000 3500
Frequency (Hz)
dB
per
Bar
k
First formant information tendsto be more noise-resistant than higher formant information.
This warps the auditory-distance space for vowels: the front-back dimension contracts relative to the open-close dimension.
This, in turn, leaves less room for high vowels.
More recent modeling (Diehl, Lindblom, and Creeger, 2003)
By further improving the realism of our auditory model by incorporating temporal (phase locking) information as well as spectral (excitation pattern) information, we obtain predicted vowel systems that fairly closely match observed systems even without the presence of background noise.
Preferred vowel inventories are reasonably well predicted on the basis of a principle of maximal auditory contrast.
What about preferred consonant inventories?
Voice distinctions
Many languages distinguish certain consonants (e.g., /b/ vs /p/, /d/ vs /t/) based on the differences in voice onset time (VOT). This is the interval between the opening of the vocal tract and the onset of vocal fold vibration (voicing).
Voice categories across languages (Lisker & Abramson 1964)
Why do languages select from these three categories of VOT?
One possibility: aerodynamic and biomechanical factors.
Another possibility: enhanced discriminability at -20 ms VOT and +20 ms VOT yields robust perceptual distinctions between the three categories.
Evidence: human infants, chinchillas,nonspeech analogs of VOT
Voice onset time and tone onset time
Time (ms)
-50 ms +50 ms
Fre
qu
ency
A
BTime (ms)
Fre
qu
ency
Discriminability of TOT stimuli
0
10
20
30
40
50
60
70
80
90
100
-50 vs.-20
-40 vs.-10
-30 vs.0
-20 vs.10
-10 vs.20
0 vs.30
10 vs.40
20 vs.50
TOT Stimulus Pair (ms)
Perc
ent Corr
ect
Dis
crim
inati
on
Are TOT categories that are consistent with the natural boundaries more learnable? (Holt, Lotto, and Diehl, JASA, 2004)
0
1
2
3
4
5
6
7
8
-60 -50 -40 -30 -20 -10 0 10 20 30 40 50 60 70 80 90 100 110 120
Rela
tive F
requency
Inconsistent
Consistent
3.36
8.21
123456789
1011121314151617181920
Condition
Avera
ge B
lock
s to
Cri
teri
on Consistent
Inconsistent
0
1
2
3
4
5
6
7
8
-60 -50 -40 -30 -20 -10 0 10 20 30 40 50 60 70 80 90 100 110 120
TOT (ms)
Rela
tive F
requency
Inconsistent
Consistent
6.57
16.86
0123456789
1011121314151617181920
Condition
Avera
ge B
lock
s to
Cri
teri
on Consistent
Inconsistent
Summary of VOT results:
Preferred voice categories are more discriminable than other possible voice categories.
The results of Holt, Lotto, and Diehl (2004) suggest that they are also more learnable.
Conclusion
Cross-language preferences in speech sound systems appear to reflect performance constraints on talkers, listeners, and language learners.
Unsolved problems
Measuring articulatory energy costs
Weighting contributions of auditory distinctiveness, least effort, and learnability
Predicting variability