hcs 7367 speech perception - university of texas …assmann/hcs6367/lec4.pdfie a c ou majority...

1

HCS 7367Speech Perception

Dr. Peter AssmannFall 2014

Motor theory of speech perception

Liberman, A. M.; Cooper, F. S.; Shankweiler, D. P.; Studdert‐Kennedy, M. (1967). “Perception of the speech code,” Psych. Rev. 74 (6): 431–461.

Speech involves a specialized phoneme decoder that processes the incoming speech stream by comparing it to the (invariant) neural motor commands to the articulatory muscles.


Speech involves a special type of efficient code; not an alphabet or cipher. Unlike letters in written language, speech involves substantial re‐structuring of the phonemic “message.”

Acoustic cues for successive phonemes are blended, such that sound segments do not correspond to linguistic segments (phonemes).


This blending or encoding process makes speech communication especially efficient for information transmission, but it is impossible to recover phonemes from the speech stream directly.


Speech can be understood at rates of up to 300 or 400 words per minute, approximately 30 phonemes per second.

Studies from auditory psychophysics indicate this is faster than the resolving power of the ear.


Rising and falling F2 transitions ‐‐ both are perceived as /d/.

2


Formant transitions in isolation do not sound anything like the syllables they are extracted from.

Transition + vowel without burst

Invariance problem for stop consonants

Motor theory of speech perception Motor theory of speech perception

Syllables with initial /d/ followed by different vowels share a common “locus” ‐ F2 points to around 1800 Hz. This corresponds to the resonant frequency of the vocal tract at the point of closure, but no sound is radiated during the closure.

Extrapolating the transition to the “locus” does not preserve the consonant identity.

Locus Equations

Delattre et al. (1954) described the locus as the frequency location of F2 extrapolated back in time to the consonant release.

locus

Onset of voiced formant transition

Onset of vowel target

F2 of /i/

F2 of /u/

Problems with the locus concept

No physical energy exists at the time/frequency position of the locus; actual extrapolation of the formant transition may lead to a change in consonant identity.

locus

Onset of voiced formant transition

Onset of vowel target

3

Locus equations

Sussman et al. (1991, 1993) proposed that locus equations provide invariantrelational cues for the perception of place of articulation in stop consonants.

Locus equations describe the relationship between the frequency of F2 at burst onset and F2 in the vowel.

ba

Time (ms)

Fre

que

ncy

(kH

z)

0 100 200 300 4000

1

2

3

4

Locus equations

da

Time (ms)

Fre

que

ncy

(kH

z)

0 50 100 150 200 250 3000

1

2

3

4

Locus equationsga

Time (ms)

Fre

que

ncy

(kH

z)

0 50 100 150 200 250 3000

1

2

3

4

Locus equations

Locus equations

Sussman (1991)


Conclusion: there is no invariant acoustic cue that uniquely identifies the stop consonants /b,d,g/.

Invariance can only be found in the motor commands that underlie the vocal tract configurations used to produce different consonants.

4


Other cues?

Burst cue

F3 transition

Spectral shape (static)

Spectral shape (dynamic)

Invariance problem for stop consonants

A

NOISEBURSTS VOWELS

i

1K

2K

3K

4K

e uc

B C

Cooper, Delattre, Liberman, Borst & Gerstman (1952) JASA 24, 597-606

burst + transitionless vowels

0.5

1.0

1.5

2.0

2.5

3.0

3.5

i e a c o u

Majority responses

Bur

st f

req

uenc

y (k

Hz)

t

p k p

p

Liberman, Delattre, and Cooper (1952)

t

k pp

p

Invariance problem for stop consonants Invariance problem for stop consonants


Other consonants?

Formant transitions are present (hard to see in fricatives)

Continuum of “encodedness” from stops to fricatives to semivowels and liquids to vowels

Categorical Perception

Equal acoustic changes unequal auditory percepts

place of articulation of stops: /b/ vs /d/ vs /g/

Liberman, Harris, Hoffman, and Griffith (1957)Journal of Experimental Psychology 54, 358-368

b d g

5

Categorical PerceptionPhoneme boundary (50%)

Identification(labeling) function

Discrimination function


1. Identification function shows steep slope (cross‐over) between categories

2. Good between‐category discrimination (near perfect) when the members of the pair belong to different categories

3. Poor within‐category discrimination (near chance) when they are perceived to belong to the same category

Categorical perception

“identification and discrimination functions for nonspeech stimuli do not differ from those for speech stimuli, when obtained under comparable conditions.”

Lane, H. (1965). The motor theory of speech perception: A critical review. Psych. Rev.72(4), 275‐309.


Specialization for speech?

Chinchillas and humans display similar patterns of categorical perception for speech syllables (Kuhl and Miller, 1975)

Not limited to speech sounds; also includes musical intervals, faces, expressions

Right‐ear advantage in dichotic listening

Dichotic presentation of syllables (e.g. /da/ and /ba/) to different ears leads to a right ear bias (right ear syllable reported more often) corresponding to left hemisphere dominance) (Kimura, 1961; Shankweiler & Studdert‐Kennedy, 1967).

Duplex perception

Duplex perception occurs when the same signal is heard both as a speech sound and as a non‐speech sound.

One ear hears a formant transition (F3).

The other hears a “base” (all the remaining formants)

Listeners hear both a syllable and a (non‐speech) chirp.

6

Duplex perception

Evidence for specialized processing?

Primacy of speech


Invariance problem

Segmentation problem

Coarticulation


Mapping from continuous to discrete

Encodedness of speech

Units of perception (segments, syllables..)

Linguistic units, acoustics and articulation


Speech perception involves specialized innate mechanisms that are unique to humans and different from other forms of auditory processing (speech is special hypothesis).

These mechanisms involve the recovery of information related to articulation (motor commands) rather than acoustic segments.


Problem: articulation disorders not always associated with impaired perception.

Articulation studies indicate comparable complexity and lack of invariance in the articulatory domain

Revised motor theory: rather than recovering invariant motor commands, the brain is assumed to recover the underlying gestures.


Perception

AcousticsArticulation

K.N. Stevens

Quantal theory Certain sounds are favored in the

languages of the world because their acoustic properties can be produced with a wide range of articulations.

Articulatoryvariations

Acousticconsequences

7

K.N. Stevens

Distinctive feature theory syllables /bi/, /du/, … segments /b/, /d/, … features [+voiced], [-rounded], …

K.N. Stevens JASA 2002

Acoustic Landmarks An early stage in the processing of speech

identifies acoustic landmarks in the signal such as syllable nuclei and acoustic discontinuities corresponding to consonantal closures and releases.

Landmarks point to regions of the signal where a more detailed phonetic analysis is carried out.


Lexical access from features The acoustic signal in the vicinity of these

landmarks is processed by a set of modules, each of which identifies a phonetic feature.


Lexical access from features From these landmarks and features, lexical

(word) hypotheses are evaluated.

This is done using analysis-by-synthesis, in which a word sequence is hypothesized, a possible pattern of features from this sequence is internally synthesized, and the synthesized pattern is tested for a match against an acoustically derived pattern.

Burst spectrum

Blumstein and Stevens (1979, 1980) invariance hypothesis (static spectral shape model)

The shape of the spectrum – sampled at the time of burst release – provides invariant cues specifying the place of articulation for the English stop consonants.

Static spectral shape model

Step 1: Find consonant burst onset

8

Temporal window

Step 2: Position a 25.6 ms half-Hamming window

0 0.5 1 1.5 2 2.5 3 3.5-50

-40

-30

-20

-10

0

10

Frequency (kHz)

Am

plit

ud

e i

n d

B

Acoustic landmarks

Step 3: Compute spectrum of windowed segment

LPC spectrum

FFT spectrum


0 0.5 1 1.5 2 2.5 3 3.5-50

-40

-30

-20

-10

0

10

Frequency (kHz)

Am

plitu

de i

n dB

0 0.5 1 1.5 2 2.5 3 3.5-50

-40

-30

-20

-10

0

10

Frequency (kHz)

Am

plitu

de i

n dB

0 0.5 1 1.5 2 2.5 3 3.5-50

-40

-30

-20

-10

0

10

Frequency (kHz)

Am

plitu

de i

n dB

Diffuse falling Diffuse rising Compact

/ ba / / da / / ga /

Burst spectra of syllable-initial stop consonants


Predictions:

1) spectral shape for a given place of articulation is the same for different talkers and in different phonetic contexts.

2) acoustic modifications that distort other aspects of the syllable but preserve the spectral shape near the burst do not affect consonant identity.

Dynamic properties

Kewley‐Port (1983) hypothesized that the

identification of stop consonants is based on

time‐varying changes in the spectrum from the

onset of the burst into the transition region.

Smits, ten Bosch, and Collier (1996)

Gross cues

spectral features that are distributed across frequency or time

overall spectral shape or its relative change over time

Detailed cues

features that are narrowly localized in frequency or time

center frequency of prominent spectral peaks and their dynamic transitions.

9

Cues for voicing

• VOT – voice onset timeShort lag (0-20 ms) = voiced sounds /b,d,g/

Long lag (80-100 ms) = unvoiced sounds /p,t,k/

bi

Time (ms)

Fre

qu

ency

(kH

z)

0 50 100 150 200 250 3000

1

2

3

4

short lag

pi

Time (ms)

Fre

qu

ency

(kH

z)

0 100 200 300 4000

1

2

3

4

long lagdi

Time (ms)

Fre

qu

ency

(kH

z)

0 50 100 150 200 250 3000

1

2

3

4

short lag

ti

Time (ms)

Fre

qu

ency

(kH

z)

0 50 100 150 200 250 300 3500

1

2

3

4

long lag

gi

Time (ms)

Fre

qu

ency

(kH

z)

0 50 100 150 200 250 300 3500

1

2

3

4

short lag

10

ki

Time (ms)

Fre

qu

ency

(kH

z)

0 100 200 300 4000

1

2

3

4

long lag

Voicing• Lisker and Abramson (1970)

Studied voice onset time (VOT) in several languages, some with 2 voicing categories (e.g., English, Spanish, Cantonese) and others with 3 or 4 voicing categories (e.g. Thai, Hindi, Korean)

• Prevoiced

• Voiced

• Voiceless aspirated

• Voiceless unaspirated

VOT in Dutch

/bɛn/ /pɛn/

Prevoiced and unaspirated stops

VOT in an English-Dutch bilingual child

E. Simon (2010).Child L2 development: A longitudinal case study on Voice Onset Times in word-initial stops.J. Child Lang. 37 159–173.

Voicing

• Identification functions for English VOT continuum.

• Frequency histograms of measured VOT in English stops.

Voicing

• Identification functions for Thai VOT continuum.

• Frequency histograms of measured VOT in Thai stops.

11

Identification and discrimination of English VOT continuum

-150 -100 -50 0 50 100 1500

20

40

60

80

100

Voice onset time (ms)

Per

cent

/b

a/ r

espo

nses

Identification function

Discriminationfunction

Hypothetical data


• Relationship between VOT and identification functions is nonlinear, with steep transitions and a clearly defined boundary.

• Discrimination functions show a peak near the phoneme boundary; near chance levels elsewhere.

/ a d a /

Time (ms)

Fre

que

ncy

(kH

z)

1. VOICE ONSET TIME

3. BURST AMPLITUDE

2. F1 CUTBACK

6. PREVOICING5. VOWEL

DURATION

4. FUNDAMENTALFREQUENCY

0 100 200 300 400 5000

1

2

3

4

Voicing cues in stop consonants Voicing cues in stop consonants/ a t a /

Time (ms)

Fre

que

ncy

(kH

z)

1. VOICE ONSET TIME

3. BURST AMPLITUDE

2. F1 CUTBACK

6. PREVOICING5. VOWEL

DURATION

4. FUNDAMENTALFREQUENCY

0 100 200 300 400 5000

1

2

3

4

hcs 7367 speech perception - university of texas …assmann/hcs6367/lec4.pdfie a c ou majority...

Documents