phonetics hcs 7367 speech perceptionassmann/hcs6367/lec3.pdfthe standard phonetic classification...

HCS 7367Speech Perception

Dr. Peter AssmannFall 2014

Phonetics

Anatomical, acoustic, neural properties

Production and perception

Focuses on continuous/physical properties

Basic unit: phone (=speech sound)

Narrow transcription: [ ph ]

Phonology

Abstract representations mediate between the speech stream and its interpretation as meaningful words

Discrete entities (phonemes, syllables, ...)

Basic unit: phoneme (smallest contrastive unit of a language)

Broad transcription: / p /

Phonology

Phonemes: minimum meaning-differentiating units in a language

Phonemic inventory: list of the phonemes of a given language

American English has ~12 vowel phonemes, 3 diphthongs, ~24 consonants

Phonology

Minimal pairs test: criterion for deciding whether two sounds belong to the same phonemic category. If two words or phrases in a given language

differ only in one phonological segment (e.g. English “pin” and “bin”) but are perceived as distinct words, then that sound contrast contains two distinct phonemes (/p/ and /b/).

Phonology

Allophones: variants (pronunciations) of a given phoneme In English, the “p” sound in “pot” is aspirated

(pronounced [ phɔt ]) while the “p” in “spot” is unaspirated (pronounced [ spɔt ]).

[ph] and [p] are allophones of the phoneme /p/

Phonology

Jakobson, Fant and Halle (1961). Preliminaries to Speech Analysis. Distinctive feature theory: universal set of (binary)

features, combining articulation and perception Chomsky and Halle (1968). The Sound Pattern of

English Phonological representations as sequences of segments

(phonemes) composed of distinctive features. Two levels: underlying abstract representation and

surface phonetic representation

Consonant Perception

What is a consonant?

Consonants are sounds produced with a major constriction somewhere in the vocal tract.

English Consonants

Ladefoged, P. (1993). A Course in Phonetics. Harcourt Brace: Fort Worth. 3rd Ed.

ner of articulation

Place of articulation BilabialLabio‐dental

Dental Alveolar Palatal Velar Glottal

Voicing – + – + – + – + – + – + –

Stop p b t d k g

Nasal m n ŋFricative f v θ ð s z ∫ ʒ h

Affricate t∫ ʤApproximant (w) r j w

Lateral l

Consonant production

American English has 24 consonant phonemes

The standard phonetic classification system uses three main features to classify the consonants:

Place of articulation

Manner of articulation

Voicing

Consonant classification

bilabial, labiodental, dental, alveolar, palatal,

velar, glottal

Manner of articulation

stops, nasal, fricatives, affricates, approximant,

lateral

Voicing

voiced, voiceless

Stop consonants in English

/ba/ /da/ /ga/

/pa/ /ta/ /ka/

bilabial alveolar velar

g voic

Manner of articulation = stop consonants

Consonant production

Speech production theory

Source‐filter concept still applies

Relatively narrow constriction or closure

Secondary sound source in the upper vocal tract

Source‐filter theory for consonants

Consonants: sounds produced with a major

constriction somewhere in the vocal tract.

Source‐filter theory still applies, but there are

complications:

1. secondary sound source within the vocal tract

2. secondary source is aperiodic, turbulent noise (frication) rather

than quasi‐periodic like the glottal source

3. secondary source may combine with the glottal source to produce

mixed excitation (voiced fricatives)

Acoustic tube perturbation and vowel formantsSource: Stevens (1999, p. 286)

Filter properties of stop consonants

2.0 2.5 3.01.0

F3 frequency (kHz)

After Stevens (1997)

alveolar

labial

glottislips

Area functions: cylindrical tube approximations of the vocal tract

Source‐filter theory for consonants

Obstruents (fricatives, affricates, stops)

Sonorants (nasals, liquids, glides)

In obstruents, the constriction can divide the resonating vocal tract into two separate cavities.

coupled vs. uncoupled

interactions with the source

Schematic model for fricative /s/

After Kent, Dembowski & Lass (1996)

Glottis

Back Cavity Front Cavity

Obstruction(maxillary incisors)

Anteriorconstriction

Eddies

Time (ms)

Voiceless unaspirated stop consonant

What are the acoustic cues for place of articulation (i.e., what acoustic properties do listeners use to distinguish /p/, /t/, /k/ (and /b/, /d/, /g/) from each other)?

1. Frequency of burst

2. Formant transitions (F2 and F3) from the burst to the vowel

Acoustic cues for place of articulation in stops

Burst cues

Release burst, frication noise, aspiration noise

Spectral shape

Static vs. dynamic

Formant transition cues

F2, F3 onset vs. target (vowel) frequency

Formant transitions

During the closure for a stop consonant the vocal tract is completely closed, and no sound is emitted. At the moment of release the resonances change rapidly. These changes are called formant transitions.

Formant transitions

The first formant (F1) always increases in frequency following the release of a stop closure. F2 and F3 increases or decreases, depending on the place of constriction and the flanking vowels.

Burst cues

The burst is a brief acoustic transient that occurs at the moment of consonant release.

Burst intensity is higher for voiceless stops than for voiced stops

Burst frequency varies as a function of place

burst voicing onset

Burst cues

Is consonant place information available when the burst is removed?

Stop consonants(from Kent and Read, 1992)

Syllable-initialprevocalic stop

Closure(stop gap)

Release(noise burst)

Transition(formant transitions)

Aspirated

Unaspirated

/ pɑ /

Time (ms)

0 100 200 300 400 5000

Alveolar stop

Closure(stop gap)

Formant transitions

/ ada /

• high F2• high F3

Time (ms)

0 100 200 300 400 5000

Velar stop

Closure(stop gap)

Formant transitions

• high F2• low F3

Time (ms)

0 100 200 300 400 5000

Bilabial stop

Closure(stop gap)

Formant transitions

• low F2• low F3

Stop consonants(from Kent and Read, 1992)

Syllable-finalpostvocalic stop

Transition(formant transitions)

Closure(stop gap)

Release (noise burst)or

No release (no burst)

Alveolar stop

Time (ms)

0 100 200 300 4000

• high F2• high F3

Time (ms)

0 50 100 150 200 2500

• low F2• low F3

Bilabial stop

Burst cues: /ba/ and /pa/

Time (ms)

0 50 100 150 200 250 300 350 4000

Time (ms)0 50 100 150 200 250 300 350 400

• Bilabials: Burst energy is concentrated at low frequencies

Burst cues: /da/ and /ta/

• Alveolars: Burst energy is concentrated at high frequencies

Time (ms)

0 50 100 150 200 250 3000

Time (ms)0 50 100 150 200 250 300 350

Burst cues: /ga/ and /ka/

• Velars: Burst energy is concentrated in the mid range

Time (ms)

0 50 100 150 200 250 300 3500

Time (ms)

0 50 100 150 200 250 3000

1. Frequency of burst

2. Formant transitions (F2 and F3) from the burst to the vowel

Acoustic cues for place of articulation in stops

Burst cues

Release burst, frication noise, aspiration noise

Spectral shape

Static vs. dynamic

Formant transition cues

F2, F3 onset vs. target (vowel) frequency

Formant transitions

During the closure for a stop consonant the vocal tract is completely closed, and no sound is emitted. At the moment of release the resonances change rapidly. These changes are called formant transitions.

Formant transitions

The first formant (F1) always increases in frequency following the release of a stop closure. F2 and F3 increases or decreases, depending on the place of constriction and the flanking vowels.

Burst cues

The burst is a brief acoustic transient that occurs at the moment of consonant release.

Burst intensity is higher for voiceless stops than for voiced stops

Burst frequency varies as a function of place

burst voicing onset

Burst cues

Is consonant place information available when the burst is removed?

Invariance problem for stop consonants

NOISEBURSTS VOWELS

e ucB C

Cooper, Delattre, Liberman, Borst & Gerstman (1952) JASA 24, 597-606

burst + transitionless vowels0.5

i e a c o u

Majority responses

Liberman, Delattre, and Cooper (1952)

Transition + vowel without burst

Categorical Perceptionof stop consonantsEqual acoustic changes unequal auditory percepts

place of articulation of stops: /b/ vs /d/ vs /g/

Liberman, Harris, Hoffman, and Griffith (1957)Journal of Experimental Psychology 54, 358-368

Categorical PerceptionPhoneme boundary (50%)

Identification(labeling) function

Discrimination function

Categorical Perception

1. Identification function shows steep slope (cross‐over) between categories

2. Good between‐category discrimination (near perfect) when the members of the pair belong to different categories

3. Poor within‐category discrimination (near chance) when they are perceived to belong to the same category

Where are the segments?

/bæg/

Segmentation problem for stops

Same noise ‐ different consonant

pea ka

1400 Hz

Different transition ‐ same consonant

dee da

1400 Hz

Motor Theory

Invariance problem

Segmentation problem

Categorical perception

Mapping from continuous to discrete

Encodedness of speech

Units of perception (segments, syllables..)

Linguistic units, acoustics and articulation

K.N. Stevens

Quantal theory Certain sounds are favored in the

languages of the world because their acoustic properties can be produced with a wide range of articulations.

Articulatoryvariations

Acousticconsequences

K.N. Stevens

Distinctive feature theory syllables /bi/, /du/, … segments /b/, /d/, … features [+voiced], [-rounded], …

K.N. Stevens JASA 2002

Acoustic Landmarks An early stage in the processing of speech

identifies acoustic landmarks in the signal such as syllable nuclei and acoustic discontinuities corresponding to consonantal closures and releases.

Landmarks point to regions of the signal where a more detailed phonetic analysis is carried out.

Lexical access from features The acoustic signal in the vicinity of these

landmarks is processed by a set of modules, each of which identifies a phonetic feature.

Lexical access from features From these landmarks and features, lexical

(word) hypotheses are evaluated.

This is done using analysis-by-synthesis, in which a word sequence is hypothesized, a possible pattern of features from this sequence is internally synthesized, and the synthesized pattern is tested for a match against an acoustically derived pattern.

Burst spectrum

Blumstein and Stevens (1979, 1980) invariance hypothesis (static spectral shape model)

The shape of the spectrum – sampled at the time of burst release – provides invariant cues specifying the place of articulation for the English stop consonants.

Static spectral shape model

Step 1: Find consonant burst onset

Temporal window

Step 2: Position a 25.6 ms half-Hamming window

0 0.5 1 1.5 2 2.5 3 3.5-50

Frequency (kHz)

Acoustic landmarks

Step 3: Compute spectrum of windowed segment

LPC spectrum

FFT spectrum

0 0.5 1 1.5 2 2.5 3 3.5-50

Frequency (kHz)

0 0.5 1 1.5 2 2.5 3 3.5-50

Frequency (kHz)

0 0.5 1 1.5 2 2.5 3 3.5-50

Frequency (kHz)

Diffuse falling Diffuse rising Compact

/ ba / / da / / ga /

Burst spectra of syllable-initial stop consonants

Predictions:

1) spectral shape for a given place of articulation is the same for different talkers and in different phonetic contexts.

2) acoustic modifications that distort other aspects of the syllable but preserve the spectral shape near the burst do not affect consonant identity.

Dynamic properties

Kewley‐Port (1983) hypothesized that the

identification of stop consonants is based on

time‐varying changes in the spectrum from the

onset of the burst into the transition region.

Locus Equations

Delattre et al. (1954) described the locus as the frequency location of F2 extrapolated back in time to the consonant release.

Onset of voiced formant transition

Onset of vowel target

F2 of /i/

F2 of /u/

Problems with the locus concept

No physical energy exists at the time/frequency position of the locus; actual extrapolation of the formant transition may lead to a change in consonant identity.

Onset of voiced formant transition

Onset of vowel target

Locus equations

Sussman et al. (1991, 1993) proposed that locus equations provide invariantrelational cues for the perception of place of articulation in stop consonants.

Locus equations describe the relationship between the frequency of F2 at burst onset and F2 in the vowel.

Time (ms)

0 100 200 300 4000

Locus equationsda

Time (ms)

0 50 100 150 200 250 3000

Locus equations

Time (ms)

0 50 100 150 200 250 3000

Locus equations Locus equations

Sussman (1991)

Smits, ten Bosch, and Collier (1996)

Gross cues

spectral features that are distributed across frequency or time

overall spectral shape or its relative change over time

Detailed cues

features that are narrowly localized in frequency or time

center frequency of prominent spectral peaks and their dynamic transitions.

Cues for voicing

• VOT – voice onset timeShort lag (0-20 ms) = voiced sounds /b,d,g/

Long lag (80-100 ms) = unvoiced sounds /p,t,k/

Time (ms)

0 50 100 150 200 250 3000

short lagpi

Time (ms)

0 100 200 300 4000

long lag

Time (ms)

0 50 100 150 200 250 3000

short lagti

Time (ms)

0 50 100 150 200 250 300 3500

long lag

Time (ms)

0 50 100 150 200 250 300 3500

short lagki

Time (ms)

0 100 200 300 4000

long lag

Time (ms)

0 100 200 300 400 5000

Time (ms)

0 100 200 300 400 5000

Voicing• Lisker and Abramson (1970)

Studied voice onset time (VOT) in several languages, some with 2 voicing categories (e.g., English, Spanish, Cantonese) and others with 3 or 4 voicing categories (e.g. Thai, Hindi, Korean)

• Prevoiced

• Voiced

• Voiceless aspirated

• Voiceless unaspirated

VOT in Dutch

/bɛn/ /pɛn/

Prevoiced and unaspirated stops

VOT in an English-Dutch bilingual child

E. Simon (2010).Child L2 development: A longitudinal case study on Voice Onset Times in word-initial stops.J. Child Lang. 37 159–173.

Voicing

• Identification functions for English VOT continuum.

• Frequency histograms of measured VOT in English stops.

Voicing

• Identification functions for Thai VOT continuum.

• Frequency histograms of measured VOT in Thai stops.

Identification and discrimination of English VOT continuum

-150 -100 -50 0 50 100 1500

Voice onset time (ms)

Identification function

Discriminationfunction

Hypothetical data

Categorical Perception

• Relationship between VOT and identification functions is nonlinear, with steep transitions and a clearly defined boundary.

• Discrimination functions show a peak near the phoneme boundary; near chance levels elsewhere.

/ a d a /

Time (ms)

1. VOICE ONSET TIME

3. BURST AMPLITUDE

2. F1 CUTBACK

6. PREVOICING5. VOWEL

DURATION

4. FUNDAMENTALFREQUENCY

0 100 200 300 400 5000

Voicing cues in stop consonants

Voicing cues in stop consonants/ a t a /

Time (ms)

1. VOICE ONSET TIME

3. BURST AMPLITUDE

2. F1 CUTBACK

6. PREVOICING5. VOWEL

DURATION

4. FUNDAMENTALFREQUENCY

0 100 200 300 400 5000

phonetics hcs 7367 speech perceptionassmann/hcs6367/lec3.pdfthe standard phonetic classification...

Documents

autocad2008 lec1,lec2,lec3

hcs 7367 speech perception - university of texas at...

week5 lec3-bscs1

researchtopic lec3

lec3 capital budgeting

304 lec3.pdf

machine design lec3

lec3 semiconductor physics

485 lec3 history_review_ii

mit gis lec3

phonetics hcs 7367 speech perception - the university of...

lec3 intro turb

week2 lec3-bscs1

week3 lec3-bscs1

mechancanical test lec3

lec3 number systems_number_theory

lec3=internal variables

1st yr lec3

neurotransmitter2 lec3

lec3 final