adaptive design of speech sound systems randy diehl in collaboration with bjőrn lindblom, carl...

Adaptive Design of Speech Sound Systems

Randy Diehl

In collaboration with Bjőrn Lindblom, Carl Creeger, Lori Holt, and Andrew Lotto

Two observations:

Sound systems of natural languages underexploit the sound producing capabilities of humans.

The sounds that are used in natural languages vary in frequency of occurrence. /a/, /i/, /u/ are common; /š/, /õ/ are rare. /p/, /t/ are common; />/, /q/ are rare.

Why are certain speech sounds favored?Possibilities:

They are easy to hear (i.e., to distinguish from other sounds).

They are easy to produce.

They are easy to learn.

The role of auditory distinctiveness in the design of vowel inventories Liljencrants & Lindblom (1972)

Diehl, Lindblom, & Creeger (2002, 2003)

Liljencrants & Lindblom (1972)

Possible vowel sound: Any vowel-like output of a computational model of the human vocal tract (Lindblom & Sundberg, 1971).

Auditory distance: Euclidean distance between any two vowel sounds i and j in a space defined by the frequencies of the first several formants:

Dij = ((∆M1)2 + (∆M’2)2)1/2.

Selection criterion: For any given inventory size, select those vowels whose pairwise distances, Dij, are maximal.

Predicted vowel systems (Liljencrants & Lindblom (1972)

A problem: Too many high vowels

A problem: Too many high vowels

These simulations were unrealistic in at least two ways:

Acoustic distance (based on formant frequencies) is probably not a good proxy for auditory distance.

Vowel sounds do not naturally occur in conditions of total quiet.

Improving the realism of the simulations (Diehl, Lindblom, and Creeger, 2002) Define a notion of ‘auditory distance’ based

on plausible auditory representations of vowel sounds.

Model vowel systems as they would have emerged under natural conditions of background noise.

From acoustic to auditory representations

0

40

80

0 1000 2000 3000 Hz

dB

0

40

80

0 4 8 12 16Bark

Hz to Bark

*

Input

0

50

100

0 4 8 12 16Bark

Ph

ons/

Bar

k

Output

-600

0

0 4 8 12 16Bark

Auditory filtering

Computing distances among auditory spectra

0

50

100

0 4 8 12 16Bark

Ph

ons/

Bar

k

1. At each point along the Bark dimension, calculate the difference in Phons/Bark between any vowel pair.2. Square these differences.3. Sum the squares.4. Take the square root of the sum. This is a measure of the Euclidean auditory distance between two vowels.

Effects of the auditory transform

Auditory-basedSystem

7

0

500

1000

1500

2000

2500

200 400 600 800

F1 (Hz)

F2

(Hz)

Formant-basedSystem


Auditory-basedSystem

Formant-basedSystem

9

0

500

1000

1500

2000

2500

200 400 600 800

F1 (Hz)

F2

(Hz)


The problem of excessive high vowels is reduced—but not eliminated.

Effects of adding background noise

Hypothesis: Vowel systems have evolved to be perceptually robust even at unfavorable signal/noise ratios.

Method

We used noise whose spectral shape mimicked the long-term average for speech

(-6 dB/octave). We computed auditory distances among

vowels at 8 different S/N ratios, ranging from 10 dB to -7.5 dB.

We then averaged these distances to determine the optimal vowel systems.

3

0

500

1000

1500

2000

2500

200 400 600 800

F1 (Hz)

F2

(Hz)

Quiet Noise

3

0

500

1000

1500

2000

2500

200 400 600 800

F1 (Hz)F

2 (H

z)

Effects of adding background noise (3 vowel system)


5

0

500

1000

1500

2000

2500

200 400 600 800

F1 (Hz)

F2

(Hz)

Quiet Noise

5

0

500

1000

1500

2000

2500

200 400 600 800

F1 (Hz)F

2 (H

z)


Quiet Noise

7

0

500

1000

1500

2000

2500

200 400 600 800

F1 (Hz)

F2

(Hz)

7

0

500

1000

1500

2000

2500

200 400 600 800

F1 (Hz)F

2 (H

z)


9

0

500

1000

1500

2000

2500

200 400 600 800

F1 (Hz)

F2

(Hz)

9

0

500

1000

1500

2000

2500

200 400 600 800

F1 (Hz)F

2 (H

z)

Quiet Noise

Comparisons with actual vowel inventories

The reduction in the number of high vowels (relative to the Liljencrants & Lindblom simulations) yields a much better fit with actual vowel systems.

Some fronting/unrounding of the high, back vowel /u/ also appears to be common among the world’s languages (e.g., Japanese and many other 5-vowel systems, American English).

Why does background noise reduce the number of high vowels?

0

10

20

30

40

50

60

70

80

90

0 500 1000 1500 2000 2500 3000 3500

Frequency (Hz)

dB

per

Ba

rk

0

10

20

30

40

50

60

70

80

90

0 500 1000 1500 2000 2500 3000 3500

Frequency (Hz)

dB

per

Bar

k

First formant information tendsto be more noise-resistant than higher formant information.

This warps the auditory-distance space for vowels: the front-back dimension contracts relative to the open-close dimension.

This, in turn, leaves less room for high vowels.

More recent modeling (Diehl, Lindblom, and Creeger, 2003)

By further improving the realism of our auditory model by incorporating temporal (phase locking) information as well as spectral (excitation pattern) information, we obtain predicted vowel systems that fairly closely match observed systems even without the presence of background noise.

Preferred vowel inventories are reasonably well predicted on the basis of a principle of maximal auditory contrast.

What about preferred consonant inventories?

Voice distinctions

Many languages distinguish certain consonants (e.g., /b/ vs /p/, /d/ vs /t/) based on the differences in voice onset time (VOT). This is the interval between the opening of the vocal tract and the onset of vocal fold vibration (voicing).

Voice categories across languages (Lisker & Abramson 1964)

Why do languages select from these three categories of VOT?

One possibility: aerodynamic and biomechanical factors.

Another possibility: enhanced discriminability at -20 ms VOT and +20 ms VOT yields robust perceptual distinctions between the three categories.

Evidence: human infants, chinchillas,nonspeech analogs of VOT

Voice onset time and tone onset time

Time (ms)

-50 ms +50 ms

Fre

qu

ency

A

BTime (ms)

Fre

qu

ency

Discriminability of TOT stimuli

0

10

20

30

40

50

60

70

80

90

100

-50 vs.-20

-40 vs.-10

-30 vs.0

-20 vs.10

-10 vs.20

0 vs.30

10 vs.40

20 vs.50

TOT Stimulus Pair (ms)

Perc

ent Corr

ect

Dis

crim

inati

on

Are TOT categories that are consistent with the natural boundaries more learnable? (Holt, Lotto, and Diehl, JASA, 2004)

0

1

2

3

4

5

6

7

8

-60 -50 -40 -30 -20 -10 0 10 20 30 40 50 60 70 80 90 100 110 120

Rela

tive F

requency

Inconsistent

Consistent

3.36

8.21

123456789

1011121314151617181920

Condition

Avera

ge B

lock

s to

Cri

teri

on Consistent

Inconsistent

0

1

2

3

4

5

6

7

8

-60 -50 -40 -30 -20 -10 0 10 20 30 40 50 60 70 80 90 100 110 120

TOT (ms)

Rela

tive F

requency

Inconsistent

Consistent

6.57

16.86

0123456789

1011121314151617181920

Condition

Avera

ge B

lock

s to

Cri

teri

on Consistent

Inconsistent

Summary of VOT results:

Preferred voice categories are more discriminable than other possible voice categories.

The results of Holt, Lotto, and Diehl (2004) suggest that they are also more learnable.

Conclusion

Cross-language preferences in speech sound systems appear to reflect performance constraints on talkers, listeners, and language learners.

Unsolved problems

Measuring articulatory energy costs

Weighting contributions of auditory distinctiveness, least effort, and learnability

Predicting variability

adaptive design of speech sound systems randy diehl in collaboration with bjőrn lindblom, carl...

Documents