more motor theory + obstruent acoustics

53
More Motor Theory + Obstruent Acoustics March 31, 2009

Upload: palmer-duncan

Post on 31-Dec-2015

17 views

Category:

Documents


2 download

DESCRIPTION

More Motor Theory + Obstruent Acoustics. March 31, 2009. To Begin With…. Homework exchange! Today: Some more thoughts on perception And then a brief review of obstruent acoustics On Thursday, we’ll be doing: A brief description of vocal tract musculature. Static palatography demo! - PowerPoint PPT Presentation

TRANSCRIPT

More Motor Theory + Obstruent Acoustics

March 31, 2009

To Begin With…• Homework exchange!

• Today:

• Some more thoughts on perception

• And then a brief review of obstruent acoustics

• On Thursday, we’ll be doing:

• A brief description of vocal tract musculature.

• Static palatography demo!

• You’re welcome to bring in a camera, if you so desire.

• Also, a link:

• http://sakurakoshimizu.blogspot.com/

Motor Theory Review• Last time, we discussed the basics of the motor theory of speech perception.

• Some basic precepts:

• Humans have a special neurological module for speech perception.

• (And other species don’t.)

• Speech perception doesn’t work on the basis of general principles.

• We perceive speech as gestures, not sounds.

• Some basic evidence:

• Categorical perception

A Modular Mind Modelcentral

processes

judgment, imagination, memory, attention

modules vision hearing touch speech

transducers eyes ears skin etc.

external, physical reality

More Evidence for Modularity• It has also been observed that speech is perceived multi-modally.

• i.e.: we can perceive it through vision, as well as hearing (or some combination of the two).

• We’re perceiving “gestures”

• …and the gestures are abstract.

• Interesting evidence: McGurk Effect

McGurk Effect, revealedAudio Visual Perceived

ba + ga da

ga + ba ba

• Some interesting facts:

• The McGurk Effect is exceedingly robust.

• Adults show the McGurk Effect more than children.

• Americans show the McGurk Effect more than Japanese.

Original McGurk Data Auditory Visual

• Stimulus: ba-ba ga-ga

• Response types:

Auditory: ba-ba Fused: da-da

Visual: ga-ga Combo: gabga, bagba

Age Auditory Visual Fused Combo

3-5 19% 36 81 0

7-8 36 0 64 0

18-40 2 0 98 0

Original McGurk Data Auditory Visual

• Stimulus: ga-ga ba-ba

• Response types:

Auditory: ba-ba Fused: da-da

Visual: ga-ga Combo: gabga, bagba

Age Auditory Visual Fused Combo

3-5 57% 10 0 19

7-8 36 21 11 32

18-40 11 31 0 54

Audio-Visual Sidebar• Visual cues affect the perception of speech in non-mismatched conditions, as well.

• Scientific studies of lipreading date back to the early twentieth century

• The original goal: improve the speech perception skills of the hearing-impaired

• Note: visual speech cues often complement audio speech cues

• In particular: place of articulation

• However, training people to become better lipreaders has proven difficult…

• Some people got it; some people don’t.

Sumby & Pollack (1954)• First investigated the influence of visual information on the perception of speech by normal-hearing listeners.

• Method:

• Presented individual word tokens to listeners in noise, with simultaneous visual cues.

• Task: identify spoken word

• Clear:

• +10 dB SNR:

• + 5 dB SNR:

• 0 dB SNR:

Sumby & Pollack data

Auditory-Only Audio-Visual

• Visual cues provide an intelligibility boost equivalent to a 12 dB increase in signal-to-noise ratio.

Tadoma Method

• Some deaf-blind people learn to perceive speech through the tactile modality, by using the Tadoma method.

Audio-Tactile Perception• Fowler & Dekle: tested ability of (naive) college students to perceive speech through the Tadoma method.

• Presented synthetic stops auditorily

• Combined with mismatched tactile information:

• Ex: audio /ga/ + tactile /ba/

• Also combined with mismatched orthographic information:

• Ex: audio /ga/ + orthographic /ba/

• Task: listeners reported what they “heard”

• Tactile condition biased listeners more towards “ba” responses

Fowler & Dekle data

orthographic mismatch condition

tactile mismatch condition

read “ba”

felt “ba”

Another Piece of the Puzzle• Another interesting finding which has been used to argue for the “speech is special” theory is duplex perception.

• Take an isolated F3 transition:

and present it to one ear…

Do the Edges First!• While presenting this spectral frame to the other ear:

Two Birds with One Spectrogram

• The resulting combo is perceived in duplex fashion:

• One ear hears the F3 “chirp”;

• The other ear hears the combined stimulus as “da”.

Duplex Interpretation• Check out the spectrograms in Praat.

• Mann and Liberman (1983) found:

• Discrimination of the F3 chirps is gradient when they’re in isolation…

• but categorical when combined with the spectral frame.

• (Compare with the F3 discrimination experiment with Japanese and American listeners)

• Interpretation: the “special” speech processor puts the two pieces of the spectrogram together.

fMRI data• Benson et al. (2001)

• Non-Speech stimuli = notes, chords, and chord progressions on a piano

fMRI data• Benson et al. (2001)

• Difference in activation for natural speech stimuli versus activation for sinewave speech stimuli

Mirror Neurons• In the 1990s, researchers in Italy discovered what they called mirror neurons in the brains of macaques.

• Macaques had been trained to make grasping motions with their hands.

• Researchers recorded the activity of single neurons while the monkeys were making these motions.

• Serendipity:

• the same neurons fired when the monkeys saw the researchers making grasping motions.

• a neurological link between perception and action.

• Motor theory claim: same links exist in the human brain, for the perception of speech gestures

Motor Theory, in a nutshell• The big idea:

• We perceive speech as abstract “gestures”, not sounds.

• Evidence:

1. The perceptual interpretation of speech differs radically from the acoustic organization of speech sounds

2. Speech perception is multi-modal

3. Direct (visual, tactile) information about gestures can influence/override indirect (acoustic) speech cues

4. Limited top-down access to the primary, acoustic elements of speech

Moving On…• One important lesson to take from the motor theory perspective is:

• The dynamics of speech are generally more important to perception than static acoustic cues.

• Note: visual chimerism over the weekend.

• Back to phonetics:

• (static) internal cues to stop place are mostly useless.

• Solid objects filter all but the lowest frequencies from the signal.

Closure Voicing• The low frequency information that passes through the stop “filter” appears as a “voicing bar” in a spectrogram.

• This acoustic information provides hardly any cues for place of articulation.

Armenian:

[bag]

Stop Transition Cues (again)• With the transition between stop closure and vowel, the perceptual task becomes much easier:

• Try the same with Peter’s productions:

• stop closures:

• with transitions:

• The moral of the story (again):

• Dynamic changes provide stronger perceptual cues to place than static acoustic information.

Release Bursts• Note: along with transitions, stops have another cue for place at their disposal.

• = release bursts

• (nasals do not have these)

• Here’s a waveform of a [p] release burst:

duration 5 msec

• What do you think the [p] burst spectrum will look like?

Burst Spectrum• [p] bursts tend to have very diffuse spectra, with energy spread across a wide range of frequencies.

• Also: [p] bursts are very weak in intensity.

• Extremely short duration of bursts requires lots of damping in the waveform.

• broader frequency range

Release Bursts• In a spectrogram:

• bilabial release bursts have a very diffuse spectrum, weakly spread across all frequencies.

[p] burst

• [p] bursts are relatively close to pure transient sounds.

Transients• A transient is:

• “a sudden pressure fluctuation that is not sustained or repeated over time.”

• An ideal transient waveform:

A Transient Spectrum• An ideal transient spectrum is perfectly flat:

Burst Filtering• The spectra of more posterior release bursts may be filtered by the cavity in front of the burst.

• Ex: [t] bursts tend to lack energy at the lowest end of the frequency scale.

• And higher frequency components are somewhat more intense.

[t] burst

Release Bursts: [k]• Velar release bursts are relatively intense.

• They also often have a strong concentration of energy in the 1500-2000 Hz range (F2/F3).

• There can often be multiple [k] release bursts.

[k] burst

Another Look• [k] bursts tend to be intense right where F2 and F3 meet in the velar pinch:

Armenian:

[bag]

Finally, Fricatives• The last type of sound we need to consider in speech is an aperiodic, continuous noise.

• (Transients are aperiodic but not continuous.)

• Ideally:

• Q: What would the spectrum of this waveform look like?

White Noise Spectrum• Technical term: White noise

• has an unlimited range of frequency components

• Analogy: white light is what you get when you combine all visible frequencies of the electromagnetic spectrum

Turbulence• We can create aperiodic noise in speech by taking advantage of the phenomenon of turbulence.

• Some handy technical terms:

• laminar flow: a fluid flowing in parallel layers, with no disruption between the layers.

• turbulent flow: a fluid flowing with chaotic property changes, including rapid variation in pressure and velocity in both space and time

• Whether or not airflow is turbulent depends on:

• the volume velocity of the fluid

• the area of the channel through which it flows

Turbulence• Turbulence is more likely with:

• a higher volume velocity

• less channel area

• All fricatives therefore require:

• a narrow constriction

• high airflow

Fricative Specs• Fricatives require great articulatory precision.

• Some data for [s] (Subtelny et al., 1972):

• alveolar constriction 1 mm

• incisor constriction 2-3 mm

• Larger constrictions result in -like sounds.

• Generally, fricatives have a cross-sectional area between 6 and 12 mm2.

• Cross-sectional areas greater than 20 mm2 result in laminar flow.

• Airflow = 330 cm3/sec for voiceless fricatives

• …and 240 cm3/sec for voiced fricatives

Turbulence Sources• For fricatives, turbulence is generated by forcing a stream of air at high velocity through either a narrow channel in the vocal tract or against an obstacle in the vocal tract.

• Channel turbulence

• produced when airflow escapes from a narrow channel and hits inert outside air

• Obstacle turbulence

• produced when airflow hits an obstacle in its path

Channel vs. Obstacle• Almost all fricatives involve an obstacle of some sort.

• General rule of thumb: obstacle turbulence is much noisier than channel turbulence

• [f] vs.

• Also: obstacle turbulence is louder, the more perpendicular the obstacle is to the airflow

• [s] vs. [x]

• [x] is a “wall fricative”

Sibilants• Alveolar, dental and post-alveolar fricatives form a special class (the sibilants) because their obstacle is the back of the upper teeth.

• This yields high intensity turbulence at high frequencies.

vs.

“shy” “thigh”

Fricative Noise• Fricative noise has some inherent spectral shaping

• …like “spectral tilt”

• Note: this is a source characteristic

• This resembles what is known as pink noise:

• Compare with white noise:

Fricative Shaping• The turbulence spectrum may be filtered by the resonating tube in front of the fricative.

• (Due to narrowness of constriction, back cavity resonances don’t really show up.)

• As usual, resonance is determined by length of the tube in front of the constriction.

• The longer the tube, the lower the “cut-off” frequency.

• A basic example:

• [s] vs.

vs.

“sigh” “shy”

[s]

Sampling Rates Revisited• Remember: Digital representations of speech can only capture frequency components up to half the sampling rate

• the Nyquist frequency

• Speech should be sampled at at least 44100 Hz

(although there is little frequency information in speech above 10,000 Hz)

• [s] has higher acoustic energy from about 3500 - 10000 Hz

• Note: telephones sample at 8000 Hz

• 44100 Hz

• 8000 Hz

Further Back

[xoma]

palatal vs. velar

• In more anterior fricatives, turbulence noise is generally shaped like a vowel made at the same place of articulation.

Even Further Back• Examples from Hebrew:

At the Tail End• [h] exhibits a lot of coarticulation

• [h] is not really a “fricative”;

• it’s more like a whispered or breathy voiced vowel.

“heed” “had”

Aspirated Fricatives• Like stops, fricatives can be aspirated.

• [h] follows the supraglottal frication in the vocal tract.

• Examples from Chinese:

[tsa] [tsha]

Back at the Ranch• There is not much of a resonating filter in front of labial fricatives…

• so their spectrum is flat and diffuse

• (like bilabial stop release bursts)

• Note: labio-dentals are more intense than bilabial fricatives

• (channel vs. obstacle turbulence)

Fricative Internal Cues• The articulatory precision required by fricatives means that they are less affected by context than stops.

• It’s easy for listeners to distinguish between the various fricative places on the basis of the frication noise alone.

• Result of both filter and source differences.

• Examples:

• There is, however, one exception to the rule…

Huh?• The two most confusable consonants in the English language are [f] and .

• (Interdentals also lack a resonating filter)