age, and hearing

20
and Hearing y Commur y Phys age, ~E

Upload: others

Post on 18-Jan-2022

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: age, and Hearing

and Hearing

y Commur

y Phys

age,

~E

Page 2: age, and Hearing

350 RLE Progress Report Number 140

Page 3: age, and Hearing

9ii ~ ~ i~iii~ ii~i iii~~i! ~ ~~ .i %!i i i~i i i ii i i ii ! i i i iiiii ii i ~ ! i ii~ ~ ! ii ii ~ il i iii ~i i~ ii i i i £ , i ii i~ i,~ ~ ~~i i i i i~~~~~iiiii~ii!~~~~i iiii>i l

i!ii ~~ii~iiiiiiiii~iiii!iiii!l~i~i!iii!ii!! ! i !! i!i!! !!i

ii i~i i~ i i !i~ o ............ ,*:, , ,, , , ,,: ,, ............ .... ,, ,, ,, ' '-. , .... ... ~ li iii i~ ~ ii i ~ii i~iiiiiii~~ liii~i ii i i i]i!£i !i~ ~ iiiii ii~ ~iii'i ii9 i 4 !! i !!!!Ai~iii!K5i~ii!!!ii

9l9' , KIii; i;jI9 , 9 iii ii ,i 111 9!iI lI (i

99 9y'ii + :,, , ii~iii iilliii i iU~ !! ,I!=

iii iii iiii 9: 'U!iiiii iiiiii,,.. :i9 9 9 9;!i9 .......................................999997il iiii~.i i~

l 9 99

;==

ili

49 9 991i!!ii ' ~

9 .4,99994999499 94, 999994 99 9999949999 949994499 'i !9 ....999, 9499999 I~i;;, .......................9 99ii

9 99 99 99 999 99 9 999 99l 999999

9999 99 >999 9 99 999~9(4(4 9 9 999 9999999999999999 9( 99 49 9 9999,9

999 99999 9949999999999999999999,9999999 99 9999999999 99999999999999999 9999 99999999999 9999999999999 " '' 49~< 99, 999 . ~99 99999 9999 99999 99499 , 9 i999,

99999999999999999999999999999999999,99 99 '9. 99 999 999 9 99 99999 99,999 9999999,9999999999999

9999999999~ 99999 99 99999 i 9 9. 9

9999 9999999 49 99999 9999 9999,9 > '999999999999 49 1499999999>9999 9

Page 4: age, and Hearing

352 RLE Progress Report Number 140

Page 5: age, and Hearing

Chapter 1. Speech Communication

Chapter 1. Speech Communication

Academic and Research Staff

Professor Kenneth N. Stevens, Professor Jonathan Allen, Professor Morris Halle, Professor Samuel J. Keyser,Dr. Krishna K. Govindarajan, Dr. Helen M. Hanson, Dr. Joseph S. Perkell, Dr. Stefanie Shattuck-Hufnagel, Dr.Reiner Wilhelms-Tricarico, Seth M. Hall, Jennell C. Vick, Majid Zandipour

Visiting Scientists and Research Affiliates

Dr. Ashraf S. Alkhairy,1 Dr Corine A. Bickley, Dr. Suzanne E. Boyce, 2 Dr. Carol Y Espy-Wilson,3 Dr. David Gow,4

Dr. Frank Guenther,' Dr. Robert E. Hillman,6 Dr. Caroline Huang,7 Dr. Harlan Lane,8 Dr. John I. Makhoul,9 Dr.Sharon Y Manuel, 10 Dr. Melanie L. Matthies," Dr. Richard S. McGowan, 12 Dr. Alice E. Turk,13 Dr. RajeshVerma,14 Dr. Lorin F. Wilde, 5 Astrid Hagen, 16 Jane Wozniak 7

Graduate Students

Helen Chen, Harold A. Cheyne, Jeung-Yoon Choi, Erika Chuang, Michael P. Harms, Mark A. Hasegawa-Johnson, Lekisha S. Jackson, Hong-Kwang J. Kuo, Kelly L. Poort, Adrienne Prahler, Andrew I. Russell, Janet L.Slifka, Jason L. Smith, Yong Zhang

Undergraduate Students

Laura C. Dilley, Emily J. Hanna, Dameon Harrell, Stefan H. Hurwitz, Mariya A. Ishutkina, Genevieve R. Lada,Teresa K. Lai, Adrian D. Perez, Dawn Perlner, Delsey M. Sherrill, Erik Strand, Jeremy Y Vogelmann, Sophia C.Yuditskaya

Technical and Support Staff

Katherine W. Kwong, Arlene E. Wint

1 KACST, Riyadh, Saudi Arabia.

2 Department of Communication Disorders, University of Cincinnati, Cincinnati, Ohio.

3 Department of Electrical Engineering, Boston University, Boston, Massachusetts.

4 Department of Psychology, Salem State College, Salem, Massachusetts.5 Department of Cognitive and Neural Systems, Boston University, Boston, Massachusetts.

6 Massachusetts Eye and Ear Infirmary, Boston, Massachusetts.

7 Altech, Inc., Cambridge, and Boston University, Boston, Massachusetts.

8 Department of Psychology, Northeastern University, Boston, Massachusetts.

9 Bolt, Beranek and Newman, Inc., Cambridge, Massachusetts.10 Department of Communication Sciences and Disorders, Emerson College, Boston, Massachusetts.11 Department of Communication Disorders, Boston University, Boston, Massachusetts.

12 Sensimetrics Corporation, Cambridge, Massachusetts.13 Department of Linguistics, University of Edinburgh, Edinburgh, Scotland.14 CEERI Centre, CSIR Complex, New Delhi, India.15 Lernout and Hauspie Speech Products, Burlington, Massachusetts.16 University of Erlangen-Nurnberg, Erlangen, Germany.17 Concord Area Special Education (CASE) Collaboratives, Concord, Massachusetts.

353

Page 6: age, and Hearing

Chapter 1. Speech Communication

Sponsors

C.J Lebel FellowshipDennis Klatt Memorial FundNational Institutes of Health

Grant R01-DC00075Grant R01-DC01291Grant R01-DC01925Grant R01-DC02125Grant R01-DC02978Grant R01-DC03007Grant R29-DC02525-01A1Grant F32-DC00194Grant F32-DC00205Grant T32-DC00038

National Science Foundation

Grant IRI 93-1496718

Grant INT 94-2114619

1.1 Studies of the Acoustics,Perception, Synthesis, andModeling of Speech Sounds

1.1.1 Glottal Characteristics of Female andMale Speakers: Data from Acousticand Physiological Recordings

The configuration of the vocal folds during the pro-duction of both normal and disordered speech has aninfluence on the voicing source waveform and, thus,affects perceived voice quality. Voice quality containsboth linguistic and nonlinguistic information which lis-teners utilize in their efforts to understand spokenlanguage and to recognize speakers. In clinical set-tings, voice quality also relays information about thehealth of the voice-production mechanism. The abil-ity to glean voice quality from speech waveforms hasimplications for computer-based speech recognition,speaker recognition, and speech synthesis, and maybe of value for diagnosis or treatment in clinical set-tings.

Theoretical models have been used to predict howchanges in vocal-fold configuration are manifested inthe output speech waveform. In previous work, weused acoustic-based methods to study variations invocal-fold configuration among female speakers

(nondisordered). We found a good correlation amongthe acoustic parameters and perceptions of breathyvoice. Preliminary evidence gathered from fiber-scopic images of the vocal folds during phonationsuggest that these acoustic measures may be usefulfor categorizing the female speakers by vocal-foldconfiguration. That work has been extended in twoways.

First, we collected data from 21 male speakers (alsonondisordered). Because male speakers are lesslikely than female speakers to have posterior glottalopenings during phonation, we expect to find lessparameter variation among male speakers, in addi-tion to significant differences in mean values, whencompared with the results from females. The data areconsistent with those expectations. The average val-ues of the acoustic parameters were significantlylower for males than for females. Although there areindividual differences among the male speakers, thevariation is smaller than that among female speak-ers. These differences in mean and variation arestronger for some parameters than for others. Sev-eral of the subjects displayed evidence of a second-ary excitation, possibly occurring at glottal opening,which was not evident in the data from the femalespeakers. This observation is also in line with gen-der-based differences in glottal configuration.

In a further attempt to verify the correlation amongthe acoustic measures and actual glottal configura-tions, images of the vocal folds during phonationwere collected in collaboration with the ResearchInstitute for Logopedics and Phoniatrics at TokyoUniversity. The images were obtained using an endo-scope and were recorded via a high-speed digitalimaging system. As with the earlier fiberscopic data,preliminary analysis suggests that female speakerswith relatively large posterior glottal openings alsohave relatively large degrees of spectral tilt and widerfirst-formant bandwidths.

1.1.2 Cues for Consonant Voicing forPostvocalic Consonants: The Writer-Rider Distinction

It is generally assumed that the voicing distinction ofthe alveolar consonant in writer versus rider is neu-tralized in the flap, but that the distinction is cued bythe duration of the preceding vowel. As part of a

Under subcontract to Stanford Research Institute (SRI) Project ECU 5652.

U.S.-India Cooperative Science Program

354 RLE Progress Report Number 140

Page 7: age, and Hearing

Chapter 1. Speech Communication

larger study of the voicing distinction for consonants,we have been examining in detail the acoustic char-acteristics of the stressed vowel and the consonantin contrasting words like writer/rider, doughty/dowdy,ricing/rising, etc., produced by several talkers. In thecase of writer/rider, the measured acoustic attributesinclude vowel length, time course of the first two for-mants in the vowel, and certain acoustic properties ofthe burst. An unexpected finding was that the trajec-tories of the formants of the diphthong were differentfor the two words. The offglide of the diphthongtoward /i/ was more extreme for writer than for rider,that is, the first formant frequency was lower and thesecond formant frequency was higher immediatelypreceding the flap for writer than for rider A percep-tion experiment was prepared with synthesized ver-sions of these words, in which the offglides of theformants were systematically varied. This experimentverified that listeners used these offglides to distin-guish between writer and rider

One interpretation of this result is that the moreextreme vowel offglide for writer is a manifestation ofa more extreme pharynx expansion, creating a condi-tion that prevents further expansion during the follow-ing consonant. Since expansion of the vocal-tractvolume is a prerequisite for a voiced obstruent con-sonant, then the more extreme offglide creates acondition that inhibits voicing during the consonant.For rider, the pharyngeal expansion in the offglide isless extreme, and there is room for further expansionin the consonant. Similar patterns for the vowel off-glide are also observed for pairs like ricing/rising anddoughtydowdy. This method of enhancing the voic-ing contrast for postvocalic obstruent consonants canapparently be added to the inventory of cues for voic-ing in such consonants.

1.1.3 Burst and Formant Transition CueIntegration and Spectral AsynchronyPerception in Stop ConsonantPerception

Two primary cues, formant transitions and burstcues, have been implicated in the perception of stopconsonants. However, the exact nature of thesecues, and how they are integrated, remainsunknown. This study investigated the interaction andrepresentation of these cues through two experi-ments.

In the first experiment, listeners identified syntheticsyllable-initial consonant-vowel (CV) stimuli in whichthe second (F2) and third formant (F3) transitions,

and the burst were independently varied in both afront vowel (/e/) and back vowel (/a /) context. Theresulting identification surface shows similaritiesacross the four subjects tested. When there was noburst cue, most listeners were reliably able to identify/b, d/ in the back vowel context and /b, g/ in the frontvowel context; and, most listeners had a difficult timeidentifying /g/ in the back vowel context, and half ofthe listeners had difficulty identifying /d/ in the frontvowel context. With the addition of the burst, listen-ers' percepts were more consistent and showed atrading relation between the burst cue and the for-mant transition cue, especially for /d, g/. In the backvowel context, the influence of the burst was toincrease /gt / responses for formant transitions cor-responding to a /g,. / percept, and increase /ga, /responses for formant transitions corresponding to a/di / percept when the burst frequency is close to F2(near 1500 Hz). In addition, for a high burst centerfrequency, most subjects tended to hear /doi / even ifthe formant transitions corresponded to a /g. / per-cept. In the front vowel context, two of the subjectsshowed little influence of the burst. The other two lis-teners showed a trading relationship between /d/ and/g/ when the burst was near 1800 Hz, i.e., near F2.

Based on the identification surfaces, it was hypothe-sized that the burst center frequency is less critical toidentification than which formant the burst is near,e.g., F2. Thus, a second experiment was performedusing burstless /C, / stimuli in which the F2 transitionwas varied. In addition, the stimuli either had one for-mant start prior to the other formants or had all for-mants start at the same time. The results show thatwhen F2 started prior to the other formants, listenersidentified all stimuli as /gi / even though the formanttransition cue corresponded to /b, / or /da /. How-ever, if F3 started ahead of the other stimuli, then lis-teners based their identification primarily on theformant transition cue. Listeners appear to be inter-preting the burst as a formant that starts prior to theother formants, i.e., a "leading" formant. Thus, theresults suggest that listeners identify stop conso-nants based on spectral asynchrony in conjunctionwith formant transition cues.

1.1.4 A Longitudinal Study of SpeechProduction

Measurements have been made on a corpus of bisyl-labic nonsense utterances spoken by three adultmale talkers on two occasions separated by 30

355

Page 8: age, and Hearing

Chapter 1. Speech Communication

years. The 1960 and 1990 recordings were pro-cessed similarly so that comparisons could be madedirectly. For the vowels, properties included mea-sures of spectrum tilt, formant frequencies, funda-mental frequency, and duration. Consonantalmeasures included durations, spectra of fricationnoise for stop bursts and fricatives, spectra of aspira-tion noise for voiceless stop consonants, and spectraof nasal consonants and of the adjacent nasalizedvowels. In informal listening tests, subjects were pre-sented with utterances from the two recording ses-sions and asked to determine which utterances weremade by the older speakers.

For the most part, the acoustic properties remainedremarkably stable over the 30-year interval. The sta-ble properties included vowel and consonant dura-tions and vowel formant frequencies. Thefundamental frequency increased by a small but con-sistent amount for all talkers (2 to 8 Hz). Two talkersshowed a significant decrease in high frequencyamplitude for vowels, suggesting a reduction in theabruptness of the glottal source at the time of glottalclosure during phonation. For all talkers there was areduction in high-frequency amplitude of aspirationnoise in voiceless aspirated stop consonants. Onetalker also showed a reduction (10 dB or more) inhigh frequency amplitude of frication noise in the

tongue-blade consonants /d/, /s/, and is/. Spectrummeasurements in nasal consonants indicated that forone talker there were changes in the acoustic proper-ties of the nasal cavity between the two recordingtimes. Listeners were somewhat inconsistent in judg-ing which utterances were produced at the later age,but were more successful in making this judgment forthe talkers who showed an increased spectral tilt withage.

The number of talkers is, of course, too small to pro-vide reliabie data on changes in the acoustic charac-teristics of speech with age. The data do show,however, that large individual differences can beexpected in the attributes that undergo change, andin the amount of change of these attributes. It shouldalso be noted that for the three talkers attributes suchas duration patterns, fundamental frequency, and for-mant frequencies were quite different, and these dif-ferences that were unique to the talkers did notchange over the 30-year period.

1.1.5 MEG Studies of Vowel Processing inAuditory Cortex

In collaborative research between MIT, the Universityof Delaware, University of Maryland, and the Univer-sity of California, San Francisco, the brain imagingtechnique magnetoencephalography (MEG) is beingused to determine which attributes of vowels giverise to distinctive responses in auditory cortex. Previ-ously, it was hypothesized that the M100 latency-the peak neuromagnetic activity that occurs atapproximately 100 ms after the stimulus onset-wasa function of the first formant frequency (Fl) and notthe fundamental frequency (FO). This hypothesis wasderived from three-formant vowels, single-formantvowels, and two-tone complexes that matched FOand F1 in the single formant vowels.

In 1996, Ragout and LePaul-Ercole presented two-formant vowels to subjects and found that the M100latency tracked the fundamental frequency (FO) andnot Fl. However, their stimuli did not match normalspeech: they used stimuli in which the amplitude ofF2 was 20 dB greater than the amplitude of Fl, lead-ing to a "duck-like" sound. In order to reconcileRagout and LePaul-Ercole's results with the afore-mentioned results, a set of two-formant vowels withvarying formant amplitudes was presented to sub-jects. Preliminary results indicate that when the sec-ond formant amplitude becomes greater than the firstformant amplitude by 6 dB, the M100 no longertracks F1 but tracks FO. Thus, the M100 latencyshifts from tracking F1 to FO as the amplitude of F2becomes greater than Fl, i.e., as the stimulibecomes less speechlike. Future experiments areplanned to determine if this result is providing evi-dence for a special speech processing module in thebrain.

1.1.6 Synthesis of Hindi

A collaborative project on rule-generated synthesis ofHindi speech has been initiated with the CentralElectronics Engineering Research Institute (CEERI)Center in New Delhi, under support from the Divisionof International Programs of the National ScienceFoundation. The project at CEERI has led to thedevelopment of an inventory of Hindi syllables syn-thesized with a Klatt formant synthesizer, and the for-mulation of procedures for concatenating the controlparameters for these syllables to produce continuousutterances. Some collaboration in the fine-tuning of

356 RLE Progress Report Number 140

Page 9: age, and Hearing

Chapter 1. Speech Communication

the syllables and in the formulation of rules for into-nation has been provided by the Speech Communi-cation group.

1.1.7 Modeling and Synthesis of LateralConsonant I/

The lateral consonant in English is generally pro-duced with a backed tongue body, a midline closureof the tongue blade at the alveolar ridge, and a patharound one or both of the lateral edges of the tongueblade. In pre-vocalic lateral consonants, the releaseof the closure causes a discontinuity in the spectralcharacteristics of the sound. Past attempts to synthe-size syllable-initial lateral consonants using formantchanges alone to represent the discontinuity havenot been entirely satisfactory. Data from priorresearch have shown rapid changes not only in theformant frequencies but also in the glottal sourceamplitude and spectrum and in the amplitudes of theformant peaks at the consonant release. New mea-surements of these parameters have been madefrom additional recordings of a number of speakers.The measurements have been guided by models oflateral-consonant production. Based on these data,new attempts of synthesis have incorporatedchanges in source amplitudes, formant bandwidths,and the location of a pole-zero pair. Including theseadditional parameters improves the naturalness ofthe synthesized lateral-vowel syllables in initial per-ception tests.

1.1.8 Modeling and Synthesis of NasalConsonants

During the production of nasal consonants, the trans-fer function of the vocal tract contains zeros andpoles with certain bandwidths and frequencies whichchange as a function of time. We are trying to refinethe theory related to how these values change. Anobjective is to develop improved methods for synthe-sizing these consonants-methods which are consis-tent with the theory. Some analysis of the spectrumchanges in utterances of intervocalic nasal conso-nants has been carried out, and these data havebeen used to refine the parameters for numericalsimulation of models of the vocal and nasal tracts.Synthesis of some brief utterances was carried out,based on these simulations. The synthesized utter-ances sounded better than synthesis done using onlythe old methods. More work is being done to deter-mine what acoustic characteristics of nasal conso-

nants are perceptually important, so that focus canbe directed to only the aspects of the theory that aremost salient.

1.2 Studies of Normal SpeechProduction

1.2.1 Experimental Studies Relating toSpeech Clarity, Rate, and Economy ofEffort

Clarity Versus Economy of Effort in SpeechProduction I: A Preliminary Study of Inter-subject Differences and Modeling Issues

This study explores the idea that clear speech is pro-duced with greater "articulatory effort" than normalspeech. Kinematic and acoustic data were gatheredfrom seven subjects as they pronounced multiplerepetitions of utterances in different speaking condi-tions, including normal, fast, clear and slow. Datawere analyzed within a framework based on adynamical model of single-axis frictionless move-ments, in which peak movement speed is used as arelative measure of articulatory effort. There were dif-ferences in peak movement speed, distance andduration among the conditions and among thespeakers. Three speakers produced "clear" utter-ances with movements that had larger distances anddurations than those for "normal" utterances. Analy-ses of these data within a peak speed, distance,duration "performance space" indicated increasedeffort (reflected in greater peak speed) in the clearcondition for the three speakers. The remainingspeakers used other combinations of parameters toproduce the clear condition. The validity of the simpledynamical model for analyzing these complex move-ments was considered by examining several addi-tional parameters. Some movement characteristicsdeparted from those required for the model-basedanalysis presumably because the articulators arestructurally complicated and interact with oneanother mechanically. More refined tests of controlstrategies for different speaking styles will depend onfuture analyses of more complicated movements withmore realistic models.

357

Page 10: age, and Hearing

Chapter 1. Speech Communication

Clarity Versus Economy of Effort in SpeechProduction II: Kinematic PerformanceSpaces for Cyclical and Speech Movements

This study was designed to test the hypothesis thatthe kinematic manipulations used by speakers tocontrol clarity are influenced by kinematic perfor-mance limits. A range of kinematic parameter valueswas elicited by having the same seven subjects pro-duce cyclical CV movements of lips, tongue bladeand tongue dorsum (/b, /, /di/, /g,,, /), at rates rang-ing from 1 to 6 Hz. The resulting measures wereused to establish speaker- and articulator-specifickinematic performance spaces, defined by move-ment duration, displacement, and peak speed. Thesedata were compared with speech movement dataproduced by the subjects in several different speak-ing conditions in the preceding study. The amount ofoverlap of the speech data and cyclical data variedacross speakers from almost no overlap to completeoverlap. Generally, speech movements were largerfor a given movement duration than cyclical move-ments, indicating that the speech movements werefaster and produced with greater effort, according tothe performance space analysis. It was hypothesizedthat the cyclical movements of the tongue and lipswere slower than the speech movements becausethey were more constrained by (coupled to) the rela-tively massive mandible. To test this hypothesis, acomparison was made of cyclical movements in max-illary versus mandibular frames of reference. Theresults indicate that the cyclical movements were notstrongly coupled to mandible movements. The over-all results indicate that the cyclical task did not suc-ceed in defining the upper limits of kinematicperformance spaces within which the speech datawere confined for most speakers. Thus, the perfor-mance limits hypothesis could not be tested effec-tively. The differences between the speech andcyclical movements may be due to other factors,such as differences in speakers' "skill" with the twotypes of movement.

Variations in Speech Movement Kinematicsand Temporal Patterns of Coarticulation withChanges in Clarity and Rate

This study tests the hypothesis that the relative tim-ing of articulatory movements at sound segmentboundaries is conditioned by a compromise betweeneconomy of effort and a requirement for clarity. Onthe one hand, articulatory movements (such as liprounding movements from I/ to /ul in /iC(C,)u/) areprogrammed to minimize effort (e.g., peak velocity);

therefore, they cannot be too fast. On the other hand,if the movements are too slow, they begin too early orend too late (with respect to the /iC/ and /Cu/boundaries) and produce less distinct vowels. Move-ment (EMMA) and acoustic data were collected fromthe same seven subjects. The speech materials weredesigned to investigate coarticulation in movementsof the lips and of the tongue. They includedVlC(Cn)V2 sequences embedded in carrier phrases,in which V, and V2 were li/ and /ul. For example: "Sayleaked coot after it." (for lip movements), "Say hemoo after it." (for tongue movements). They werespoken in three conditions, normal, clear and fast.Each subject recorded about 1100 tokens. The anal-yses focused on the amount of coarticulation (over-lap) of the /i-u/ transition movement within theacoustic interval of the //, along with several othermeasures. Consonant string duration was longest forthe clear condition and shortest for the fast condition.Peak velocities were higher in the fast and clear con-ditions than in the normal condition. The coarticula-tion effects were small and were observed more forthe lip than the tongue-body movements. Generally,there was more overlap in the fast condition than thenormal condition, but not less overlap in the clearcondition than the normal condition. The effects ofoverlap on formant values were small. Thus, produc-ing the clear condition involved increases of conso-nant string duration and peak velocity but notcoarticulation differences. While there were somesmall increases in coarticulation in the fast condition,they did not seem to affect the spectral integrity ofthe /il. Even though there was evidence of increasedeffort (as indexed by peak velocity) in the clear andfast conditions, the hypothesized effects of a tradeoffbetween clarity and economy of effort were minimallyevident in formant values for /i/ and measures ofcoarticulation (overlap).

Interarticulator Coordination in AchievingAcoustic-phonetic Goals: Motor EquivalenceStudies

These studies are based on our preliminary findingsof "motor equivalent" trading relations between liprounding and tongue-body raising for the vowel lu/,which we have interpreted as supporting our hypoth-esis that speech motor control is based on acousticgoals. We have hypothesized that when two articula-tors contribute to producing an acoustic cue and theplanned movement of one of the articulators mightmake the resultant acoustic trajectory miss the goalregion for the cue, a compensatory adjustment isplanned in the movement of the other articulator to

358 RLE Progress Report Number 140

Page 11: age, and Hearing

Chapter 1. Speech Communication

help keep the acoustic trajectory within the goalregion. Furthermore, due to economy of effort, thecompensation is limited to an amount that makes theacoustic trajectory just pass through the edge of thegoal region on its way to the next goal. Thus, weexpect to observe such compensatory covariationmainly among tokens near the edge of the goalregion, i.e., less canonical tokens. We also hypothe-size that the most canonical tokens (near the centerof the acoustic goal region) are produced with "coop-erative coordination." A canonical token of /u/ wouldbe produced by cooperative lip rounding and tonguebody raising movements. Overall, we expect to findpositive correlations of lip protrusion and tonguebody raising among tokens of /u/ that are acousticallymost canonical and negative correlations amongtokens that are least canonical.

We have tested this hypothesis for the sounds /u/, /r/

and /f/ pronounced in carrier phrases by the sevenspeakers. These sounds were chosen because theyare produced with independently controllable con-strictions formed by the tongue and by the lips, mak-ing it possible to look for motor-equivalent behaviorwith correlation analysis. Each subject pronounced atotal of about 650 tokens containing the soundsembedded in carrier phrases. To avoid biasing thecorrelations used to test the hypothesis through parti-tioning the data set into more and less canonicaltokens, we attempted to create less canonical sub-sets a priori, with manipulations of phonetic contextand speaking condition.

In one analysis approach, we have extracted andanalyzed all of the articulatory data (mid-soundtongue and lip transducer positions) and the acousticdata for /u/ (formants) and /f/ (spectral median andsymmetry).

The acoustic measures indicate that we were onlypartly successful in eliciting subsets that were moreand less canonical. There is evidence supportingmotor equivalence for all three sounds. For /u/, thefindings are generally related to how canonical thetokens were: in most cases there was compensatorycoordination (motor equivalence) for less canonicaltokens and cooperative coordination for more canon-ical tokens. For /f/, there were significant correla-tions in 28% of the possible cases. All reflectedmotor equivalence: when the tongue blade was fur-ther forward, the lips compensated with more protru-sion, presumably to maintain a front cavity that waslarge enough to achieve a low spectral center ofgravity. When there was a difference between a sub-

ject's utterance subsets in how canonical the tokenswere, the motor equivalent tokens were less canoni-cal. No evidence of cooperative coordination wasfound for /f/.

A study of acoustic variability during AmericanEnglish /r/ production also tests the hypothesis thatspeakers utilize an acoustic, rather than articulatory,planning space for speech production. Acoustic andarticulatory recordings of the seven speakers revealthat speakers utilize systematic articulatory tradeoffsto maintain acoustic stability when producing thephoneme /r/. Distinct articulator configurations usedto produce /r/ in various phonetic contexts show sys-tematic tradeoffs between the cross-sectional areasof different vocal tract sections. Analysis of acousticand articulatory variabilities reveals that thesetradeoffs act to reduce acoustic variability, thus allow-ing large contextual variations in vocal tract shape;these contextual variations in turn apparently reducethe amount of articulatory movement required inkeeping with the principle of economy of effort inspeech production.

1.2.2 Physiological Modeling of SpeechProduction

Studies of Vocal-tract Anatomy

In cooperation with Dr. Chao-Min Wu at the Univer-sity of Wisconsin, software was developed for theinteractive visualization and mapping between twodata sets: (1) anatomical sections from the VisibleHuman project and (2) a series of detailed drawingsof histological sections of tongue. One part of thesoftware compiles a set of images into a stack andcomputes spatial display sections through the data.Another component can be used to identify homolo-gous points between two 3D image data sets. Thesepoint pairs calibrate a mapping between the datasets to partially integrate them. These methods wereincorporated into the current work, described below.

Prototyping an Accurate Finite-elementModel of the Vocal Tract

Progress was made in prototyping an accurate finite-element model of the tongue and floor of the mouth,based mainly on cryo-section data from the VisibleHuman project. Software was developed for variousvisualization and measurement tasks. For example,multiple arbitrarily oriented cross sections of a 3Dstack of section images are combined into a 3D view,making it possible to capture measurements of vocal-

359

Page 12: age, and Hearing

Chapter 1. Speech Communication

tract morphology from the cryo-section data. Com-mercial visualization and measurement programs areused for drafting a topologically accurate model, andlocations of node points and spline control points areimported from the programs. These techniques willbe used to incorporate data on individual vocal-tractmorphology from MR images into vocal-tract simula-tions.

1.2.3 Theoretical Developments in SpeechProduction

In a theoretical paper on speech motor control, anoverview and supporting data are presented aboutthe control of the segmental component of speechproduction. Findings of "motor-equivalent" tradingrelations between the contributions of two constric-tions to the same acoustic transfer function providepreliminary support for the idea that segmental con-trol is based on acoustic or auditory-perceptualgoals. The goals are determined partly by non-linear,quantal relations (called "saturation effects") betweenmotor commands and articulatory movements andbetween articulation and sound. Since processingtimes would be too long to allow the use of auditoryfeedback for closed-loop error correction in achievingacoustic goals, the control mechanism must use arobust "internal model" of the relation between articu-lation and the sound output that is learned duringspeech acquisition.

Studies of the speech of cochlear implant and bilat-eral acoustic neuroma patients provide evidencesupporting two roles for auditory feedback in adults:(1) maintenance of the internal model, and (2) moni-toring the acoustic environment to help assure intelli-gibility by guiding relatively rapid adjustments in"postural" parameters underlying average soundlevel, speaking rate and amount of prosodically-based inflection of FO and SPL.

1.2.4 Laryngeal Behavior at the Beginningof Breath Groups During Speech

Each person speaks with a particular timing thatobviously depends on his or her linguistic intent butalso must depend on the physical system generatingthe speech. Speech is created with an aerodynamicenergy source that must be repeatedly replenished.This repeated activity imposes gross timing con-straints on speech, and the mechanics of creatingthis source may impose particular timing effects atthe beginning and end of each breath. We have initi-ated an experimental study of the influence of respi-

ratory constraints on the temporal patterns ofspeech. First, we are examining how utterances areinitiated at the beginnings of breath groups.

When a person draws in a breath, the air pressureinside the lungs is less than atmospheric pressure,creating a pressure gradient that causes air to moveinto the lungs. To produce speech, the speakerneeds to compress the air in the lungs to increasethe pressure while managing to retain a volume of airin the lungs during that compression. Multiple strate-gies could be used by the speaker to meet theserequirements. Pressure may also be built up behinda complete blockage of the airway such as closure ofthe glottis or lips. Pressure may be built up behind anincreased impedance of the vocal tract as its configu-ration is adjusted in anticipation of creating a speechsegment. The particular method used may dependon the type of sound to be produced, the length ofutterance, the location of emphasis in the utterance,and the relative timing of the muscles of respiration.

In an attempt to determine which method or methodsthe speaker uses, concurrent recordings of theacoustic signal, the signal from an electroglottograph(which provides an estimate of times of glottal open-ings and closings), a lung volume estimate (thoraxand abdomen measures), and an unstrobed endo-scopic videotape of the larynx were collected at thefacilities of the Voice Laboratory at MassachusettsEye and Ear Infirmary under the direction of Dr. Rob-ert Hillman. Preliminary data show that at the initia-tion of some utterances, a speaker creates a glottalclosure, which is then released when the articulatorsare in place for the initial segment of the utterance.

1.3 Speech Research Relating toSpecial Populations

1.3.1 Speech Production of CochlearImplant (CI) and Bilateral AcousticNeuroma (NF2) Patients

Longitudinal Studies

We have made three baseline pre-implant recordingson each of five additional CI subjects in the secondyear of this project. Additionally, all six research sub-jects have returned for post-implant recordings oneweek, one month, and three months following pro-cessor activation. Three of our six implant subjectshave returned for their six-month visits and theremaining three subjects will complete six-monthrecordings by June. A total of 38 recordings have

360 RLE Progress Report Number 140

Page 13: age, and Hearing

Chapter 1. Speech Communication

been made this year with Cl subjects. Over 75% ofthese recordings have been digitized an over 50% ofthe data from these recordings have been analyzed.

Short-term Stimulus Modification Studies

One CI subject has participated in a "stimulus modifi-cation" experiment. During the recording of the sub-ject's speech, an experimental speech processorwas used to modify the subject's auditory feedback.Feedback alternated between the subject's regularprogram and another program that simulated nohearing. This recording has been digitized and thedata are currently being analyzed.

Perceptual Studies

All six CI subjects participated in this study at thetime of each speech recording. Subjects are asked todiscriminate eleven consonants and eight vowelsfrom the natural speech of a same-gender speaker.We anticipate this test of perceptual ability will resultin a diagnostic measure of perceptual benefit fromthe implant and subsequently support or disconfirmour hypotheses regarding relations between speechperception and production.

Correlates of Posture

Four hypotheses inspired by our theory of the role ofhearing in speech production were tested usingacoustic and aerodynamic measures from seven Clsubjects, three speakers who had severe reductionin hearing loss following surgery for Neurofibromato-sis-2 (NF2), and one hard-of-hearing control speaker.These speakers made recordings of the RainbowPassage before and after intervention. Evidence wasfound that supports the four hypotheses:

1. Deafened speakers who regain some hearingfrom cochlear prostheses will minimize effortwith reductions in speech sound level.

2. To help reduce speech sound level, they willassume a more open glottal posture.

3. They will also minimize effort by terminatingrespiratory limbs closer to tidal-end respiratorylevel, FRC.

4. An effect of postural changes will be to changeaverage values of air expenditure toward nor-mative values.

Coarticulation

Less coarticulation is presumably a feature of clearspeech, and our theory predicts that deafenedspeakers will engage in clear speech. Therefore, wepredict that deafened adults will show less coarticula-tion before and more coarticulation after their implantspeech processors have been turned on. We havebegun to test this prediction by extracting values ofsecond formant frequency from readings of the vowelinventory in /bVt/ and /dVt/ contexts by seven implantusers. Initial findings with one adult male reveal astatistically reliable overall increase in coarticulationfollowing the activation of his implant speech proces-sor. In particular the vowels Id, lal, and /u/ showedmarked and reliable increases in coarticulation insessions following the first session after activation ofthe processor.

1.3.2 Modeling and Analysis of VowelsProduced by Speakers with Vocal-foldNodules

Speakers with vocal-fold nodules commonly usegreater effort to speak. This increased effort isreflected in their use of higher than normal subglottalpressures to produce a particular sound pressurelevel (SPL). The SPL may be low because ofdecreased maximum flow declination rate (MFDR) ofthe glottal flow waveform, increased first formantbandwidth, or increased spectral tilt.

A parallel acoustic and aerodynamic study ofselected patients with vocal nodules has been con-ducted. At comfortable voice, aerodynamic featuresare significantly different on average for the nodulesgroup (at p = 0.001), whereas acoustic features arenot significantly different (at p = 0.001) except forSPL and H1 -A1, the amplitude difference betweenthe first harmonic (H1) and the first formant promi-nence (Al). Pn, defined as the percentage of dataassociated with speakers with nodules which aremore than 1.96 standard deviations from the normalmean, is a measure of how well separated the twopopulations are for a particular feature. Even thoughSPL and H1 -Al are significantly different statisti-cally, the effect size is small, as indicated by thesmall Pn (21 and 2 percent, respectively). The differ-ence in means for SPL is 1.8 dB and for H1 - Al is2.3 dB. In contrast, Pn for the aerodynamic featuresranges from 8 to 21 percent. Ranked in order of Pn,the aerodynamic features are subglottal pressure,average flow, open quotient, minimum flow, AC flow,and MFDR. These observations show that it is easier

361

Page 14: age, and Hearing

Chapter 1. Speech Communication

to differentiate the nodules group from the normalgroup using aerodynamic rather than acoustic fea-tures. By performing linear regression on the acous-tic features with SPL as the independent variable, thetwo groups can be better separated. For example,the difference in group means for H1 - Al increasesto 3.9 dB from 2.3 dB, and Pn increases to 18 from 2percent. The results of aerodynamic measures agreewith previous findings.

The presence of a glottal chink can widen the firstformant bandwidth. However, based on the glottalchink areas estimated from the minimum glottal flow,the bandwidth is increased on the average by only1.1 times. It is not clear if the spectral tilt is increased,based on acoustic features. After regression withSPL, the mean H1 - A3 (where A3 = amplitude ofthird formant prominence) is 2.4 dB larger and meanAl - A3 is 1.7 dB smaller for the nodules group com-pared with the normal group.

A modified two-mass model of vocal-fold vibration isproposed to simulate vocal folds with nodules. Thismodel suggests that the MFDR can be decreasedbecause of increased coupling stiffness between themasses and the presence of nodules which interferewith the normal closure activity of the vocal folds.The model also suggests that increasing the subglot-tal pressure can be used to compensate for thereduced MFDR, but energy dissipated due to colli-sion is increased, implying greater potential fortrauma to the vocal-fold tissues.

In summary, the greater effort used by speakers withvocal nodules results in differences in the aerody-namic features of their speech. However, the acous-tic features show a smaller difference, reflecting theachievement of relatively good sound characteristicsdespite aberrant aerodynamics. A modified two-mass model is able to explain the need for highersubglottal pressures to produce a particular SPL, andit also demonstrates increased trauma potential tothe vocal folds as a result of increased pressures.

1.3.3 Fricative Consonant Production bySome Dysarthric Speakers:Interpretations in Terms of Models forFricatives

The aim of this research is to develop an inventory ofnoninvasive acoustic measures that can potentiallyprovide a quantitative assessment of the productionof /s/ by talkers with speech motor disorders. It is partof a broader study whose goal is to develop

improved models of normal as well as dysarthricspeakers' speech production and to use these mod-els to interpret acoustic data obtained from the utter-ances of those speakers.

Utterances of eight dysarthric speakers and two nor-mal speakers were used for this study. In particular,one repetition of each of nine words with initial /s/sound was selected from a larger database, whichwas available from previous doctoral thesis work ofHwa-Ping Chang. The intelligibility of the words wasdetermined previously by Chang.

Eight measurements were made from the fricativeportions for each of the words. These included threemeasurements that assessed the spectrum shape ofthe fricative and its spectrum amplitude in relation tothe vowel and five estimates of the degree to whichthe onset, offset, and time course of the fricativedeviated from the normal pattern. These five esti-mates were made on a three-point scale and werebased on observations of spectrograms of the utter-ances.

For purposes of analysis, the dysarthric speakerswere divided into two groups according to their over-all intelligibility: a low-intelligibility group (word intelli-gibility in the range 55-70 percent) and a high-intelligibility group (80-98 percent). The contributionsof all eight measures to this classification were statis-tically significant. The measure that showed the high-est correlation with intelligibility was a spectralmeasure (in the fricative) giving the difference (in dB)between the amplitude of the largest spectrum peakabove 4 kHz (i.e., above the third-formant range) andthe average spectrum peak corresponding to thesecond and third formants.

A task of future studies is to automate the measure-ments that are based on the scaling methods used inthis research and to apply the measures to a largerpopulation of speakers and utterances. The objectivemeasures should also be related to assessments ofclinicians along several different dimensions. Theultimate aim is to specify an inventory of quantitativeacoustic measures that can be used to accuratelyassess the effectiveness of interventions as well asthe amount of speech degeneration in speakers withneuromotor disorders.

362 RLE Progress Report Number 140

Page 15: age, and Hearing

Chapter 1. Speech Communication

1.3.4 Stop-consonant Production byDysarthric Speakers: Use of Models toInterpret Acoustic Data

Acoustic measurements have been made on stopconsonants produced by several normal and dysar-thric speakers. The acoustic data were previouslyrecorded by Hwa-Ping Chang and Helen Chen atMIT. In the present study, various aspects of produc-tion following release of the oral closure were quanti-fied through the use of acoustic measures such asspectra and durations of noise bursts and aspirationnoise, as well as shifts in frequencies of spectralprominences. Through comparison of these mea-surements from the normal and dysarthric speech,and based upon models of stop-consonant produc-tion, inferences were drawn regarding articulatorplacement (by examining burst spectra), rate of artic-ulator release (from burst duration), tongue-bodymovements (from formant transitions), and vocal-foldstate (from low-frequency spectra). The dysarthricspeakers deviated from normal speakers particularlywith respect to alveolar constriction location, rate ofrelease, and tongue-body movement into the follow-ing vowel. For example, the lowest front-cavity reso-nance in the burst spectrum of an alveolar stop isnormally in the range 3500-5500 Hz. For three ofeight dysarthric speakers, the range was lowered to1500-2800 Hz, indicating either placement of thetongue tip further back on the palate or formation ofthe constriction with the tongue body in a locationsimilar to that of a velar stop.

1.4 Speech Production Planning andProsody

1.4.1 Labeling Speech Databases

Speech labeling projects include the developmentand evaluation of several new transcription systems(for the perceived rhythm of spoken utterances andfor the excitation source) which enrich existing labelsfor phonetic segments, part of speech, word bound-aries and intonational phrases and prominences.Several new types of speech have been added,including samples from the CallHome database ofextremely casual spontaneous telephone dialoguesbetween family members and close friends, and agrowing sample of digitized speech errors. In addi-tion, a subset of the CallHome utterances are beinglabeled for distinctive features, following the methodsdescribed in section 1.5.1 below. These labeled data-bases provide the basis for evaluation of acoustic

correlates of prosodic structure, as well as aresource for other laboratories to use (for example,the Media Lab at MIT made use of the ToBl-labeledradio news sample earlier this year.)

1.4.2 Acoustic-phonetic Correlates ofProsodic Structure

We evaluated the correlates of prosodic structure inseveral production experiments. In one, we exam-ined durational cues to the direction of affiliation of areduced syllable, as in "tuna choir" versus "tuneacquire" versus "tune a choir." Results showed thatrightward affiliation was reliably distinguished fromleftward by patterns of syllable duration, but that thenature of the rightward affiliation (e.g., lexical, as in"acquire", versus phrasal, as in "a choir") was not.Moreover, the left-right distinction was most reliablyobserved for pitch accented words. In the other, weexplored the extent of boundary-related durationlengthening in critical pairs of utterances such as"(John and Sue) or (Bill) will stay" versus "(John) and(Sue or Bill) will stay." Preliminary results for a singlespeaker show that preboundary lengthening is mostpronounced on the final syllable of an intonationalphrase, but significant lengthening is also found tothe left up to and including the main stressed syllableof the final content word.

1.4.3 Evaluation and Application of aTranscription System for RegularRhythm and Repeated FO Contours

In an ongoing effort to investigate rhythm and intona-tion in running speech, a labeling system and on-linetutorial have been developed for indicating perceivedrhythm and repeated FO contours. The consistencyof use of this labeling system was recently evaluatedand a high degree of inter-labeler agreement wasobtained. Furthermore, the system was applied tothree minutes of continuous natural speech and anumber of regions of speech were identified forwhich multiple listeners heard regular rhythms Forthese regions, the interval between syllables heardas "beats" was found by measuring the time betweensuccessive vowel midpoints. Beat intervals werefound to be in the range of 200-800 ms, which is con-sistent with recent findings of other investigators.Moreover, we found that for regions where five out offive labelers agreed, the speech contained a per-ceived regular rhythm; successive beat intervalswere less variable in duration than when fewer thanfive out of five labelers agreed the speech was rhyth-mic. This observation is consistent with a hypothesis

363

Page 16: age, and Hearing

Chapter 1. Speech Communication

that speech may in some cases be acoustically regu-larly timed and fits well with informal observationsthat speech frequently sounds regularly rhythmic.Furthermore, regions where three or more listenersheard a regular rhythm were more likely to havebeen also heard as containing repeated FO contoursthan other regions of speech, and vice versa. In otherwords, regions perceived as containing regularrhythms tended to be heard as also bearing repeatedintonation. This finding is consistent with observa-tions from the literature on music and other non-speech auditory perceptual stimuli that rhythm andintonation may be interdependently perceived. Weare currently working on a theory of prosody percep-tion which unites observations from music, auditorystream analysis, and linguistics.

1.4.4 Initial Studies of the Perception ofProsodic Prominence

Initial studies of prominence perception have indi-cated that large FO excursions between two adjacentfull-vowel syllables (as in "transport") are perceptu-ally ambiguous as to the location of the perceivedprominence and that listeners can hear a sequenceof FO peaks and valleys on a sequence of full-vowelsyllables (e.g., "We're all right now") as prominenteither on the peaks or on the valleys. These resultssupport an interpretive model of FO-governed promi-nence perception, in which listeners construct a pat-tern of syllable prominence using information ofseveral kinds both from the signal and from their lan-guage knowledge. Such a view is also supported bypilot results showing a difference in which syllablesare perceived as prominent for a normal productionof a syllable string, where lexical stress is known,versus its reiterant imitation, where knowledge canprovide no such constraints on the interpretation ofthe ambiguous FO contour.

1.5 Models of Lexical Representationand Lexical Access

1.5.1 Labeling a Speech Database withLandmarks and Features

The process of lexical access requires that asequence of words be retrieved from the acousticsignal that is radiated from a speaker's mouth. In thelexical access process proposed here, an initial stepis to transform an utterance into an intermediate rep-resentation in terms of a discrete sequence of sub-lexical or phonetic units. This representation consists

of a sequence of landmarks at specific times in theutterance, together with a set of features associatedwith each landmark. The landmarks are of threekinds: acoustic discontinuities at consonantal clo-sures and releases, locations of peaks in syllabicnuclei, and glide-generated minima in the signal. Theinventory of features is the same as the distinctivefeatures that are used in phonological descriptions oflanguage.

Procedures have been developed for hand-labelingutterances in terms of these landmarks and features.The starting point in the labeling of a sentence is togenerate (automatically) an idealized sequence oflandmarks (without time labels) and features from thelexical representation of each individual word. Thisideal labeling for each consonant would contain asequence of two landmarks, each containing the fea-tures of the consonant. Each vowel and glide isassigned a single landmark with the lexically speci-fied features attached. The labeling for the sentenceconsists of assigning times to the landmarks andmodifying the landmarks (by deleting or adding land-marks) and the features based on observations ofthe signal. These procedures involve a combinationof observation of displays of waveforms, spectra, andspectrograms, together with listening to segments ofthe utterance. Labeling has been completed for 50-odd sentences containing about 2500 landmarks.

There are several reasons for preparing this data-base:

1. Comparison of the hand-generated labels withthe labels automatically generated from thesequence of lexical items provides quantitativeinformation on the modifications that are intro-duced by a talker when running speech withvarious degrees of casualness is produced;

2. The landmarks and features for an utterancecan be used as an input for testing and devel-oping procedures for recovering the wordsequence for the utterance; and

3. In the development of automatic proceduresfor extracting the landmarks and features forutterances, the hand-generated labels can beviewed as a goal against which potential auto-matically generated labels can be compared.

It is hoped that the database of utterances and labelscan be publicly available when a sufficiently largenumber of utterances have been labeled.

364 RLE Progress Report Number 140

Page 17: age, and Hearing

Chapter 1. Speech Communication

1.5.2 Detection of Landmarks and Featuresin Continuous Speech: The VoicingFeature

A major task in implementing a model for lexicalaccess from continuous speech is to develop well-defined procedures for detection and identification ofthe hierarchy of landmarks and features from analy-sis of the signal. For each landmark and featurethere is a specific inventory of acoustic propertiesthat contribute to this detection process. The inven-tory of properties that must be tapped to make thevoiced/voiceless distinction for consonants is espe-cially rich. Some initial progress has been made indeveloping signal processing methods that are rele-vant to this distinction.

In this initial work we have examined the rate ofdecrease of low-frequency spectrum amplitude in theclosure interval for a number of intervocalic voicedand voiceless obstruent consonants produced by twospeakers (one male, one female). The data show theexpected more rapid decrease in low-frequencyamplitude for the voiceless consonants than for thevoiced. The amounts of decrease following conso-nantal closure are consistent with predictions basedon estimates of vocal-tract wall expansion and oftransglottal threshold pressures for phonation. Thereis, however, considerable variability in this measureof low-frequency amplitude, and it is clear that addi-tional measures are needed to increase the reliabilityof voicing detection.

1.5.3 Deriving Word Sequences from aLandmark- and Feature-basedRepresentation of Continuous Speech

Several steps are involved in designing and imple-menting a lexical access system based on landmarksand features. These include location of landmarks,determining features associated with these land-marks, converting this landmark-feature representa-tion to a segment-feature representation, andmatching to a lexicon that is specified using the sameinventory of features. This project has examined theconversion and matching steps of this process. Ituses as input landmarks and features obtained byhand-labeling a number of sentences produced byseveral talkers. These sentences contain wordsdrawn from a small lexicon of about 250 items. Thelandmarks in the sentences identify discontinuities atconsonantal closures and releases, vowel peaks,and glide-generated minima in the signal.

The conversion from temporal locations of land-marks, together with feature labels, to a lexically-con-sistent segment/feature representation requires thattemporal information and the feature labels be usedto collapse consonantal closures and releases intosequences of segments. For example, sequences oftwo consonants between vowels (VC1C2V) are oftenproduced with just two acoustically-evident land-marks (C1 closure and C2 release), and durationinformation may be needed to determine whether thelandmarks signal one consonant or two consonants.

Matching of the segment/feature representation forrunning speech to sequences of words from a storedlexicon requires knowledge of rules specifying modi-fications that can occur in the features of the lexicalitems depending on the context. As a first step inaccounting for these modifications, some of the fea-tures that are known to be potentially influenced bycontext are marked in the lexicon as modifiable. Thematching process also requires that criteria bedefined for matching of features, since many featuresremain unspecified, both in the labeled representa-tion and in the stored lexical representation. Severalcriteria for accepting a match are possible when afeature is unspecified or is marked as modifiable.

In this project, the performance of the matcher wasevaluated using several different criteria for matchingindividual features. For example, one criterion wasthat a lexical item containing the feature [+nasal] wasnot accepted unless the labeled representation alsocontained [+nasal]; that is, there would be no match ifthe feature was unspecified for [nasal]. Experimentswith different matching criteria led to a number ofcandidates for word sequences for most sentencesand the correct sequence was almost always one ofthese candidates. The most effective matching crite-rion, which included a metric for the number of modi-fications that were needed to create the matches, ledto word sequences that were in the top three candi-dates in 95 percent of the sentences.

This exercise is leading to the formulation of animproved model that separates the lexicon from therules and proposes matching procedures that invokethe rules as an integral part of the lexical access pro-cess.

365

Page 18: age, and Hearing

Chapter 1. Speech Communication

1.5.4 A Model for the Enhancement ofPhonetic Distinctions

Models for the production and perception of speechgenerally assume that lexical items are stored inmemory in terms of segments and features. In thecase of speech production, the feature specificationscan be viewed as instructions to an articulatory com-ponent which produces an acoustic output. Thespeaker's perceptual knowledge of the minimal dis-tinctions in the language plays an important role inthis production process. Coupled with the speaker'sawareness of the perceptual/acoustic manifestationsof the minimal distinctions of the language is knowl-edge that certain articulatory gestures over andabove those specified by the phonological featurescan contribute to shaping the properties of the sound.Recruiting of these additional gestures can enhancethe perception of these distinctions.

We have formulated speech production model withcomponents that incorporate these enhancing pro-cesses. Four types of enhancement have beenexamined: voicing for obstruent consonants, nasal-ization for vowels, place distinctions for tongue bladeconsonants, and tongue body features for vowels. Ineach case, there is a gesture that is implemented inresponse to specifications from a phonological fea-ture, and a separate articulatory gesture is recruitedto enhance the perceptual contrast for that feature.

1.6 Laboratory Facilities for SpeechAnalysis and Experimentation

Facilities for performing direct-to-disk recordings, fortransferring speech data from a digital audio tape(DAT) to a hard drive and for performing automatedspeech perception experiments were set up in asound-attenuated room using a Macintosh computerwith a Digidesign soundcard and Psyscope software.

In addition, development of an in-house speech anal-ysis software program, xkl, is continuing. Xkl is anX11/Motif based program that performs real-timeanalysis of waveforms, plays and records soundfiles, creates spectrograms and other spectra, andsynthesizes speech.

1.7 Publications

1.7.1 Journal Articles

Chen, M. "Acoustic Correlates of English and FrenchNasalized Vowels." J. Acoust. Soc. Am. 102:2360-70 (1997).

Chen, M., and R. Metson. "Effects of Sinus Surgeryon Speech." Arch. Otolaryngol. Head Neck Surg.123: 845-52 (1997).

Guenther, F., C. Espy-Wilson, S. Boyce, M. Matthies,M. Zandipour, and J. Perkell. "ArticulatoryTradeoffs Reduce Acoustic Variability duringAmerican English /r/ Production." Submitted to J.Acoust. Soc. Am.

Hillman, R.E., E.B. Holmberg, J.S. Perkell, J. Kobler,P. Guiod, C. Gress, and E.E. Sperry. "SpeechRespiration in Adult Females with Vocal Nod-ules." J. Speech Hear Res. Forthcoming.

Lane, H., J. Perkell, M. Matthies, J. Wozniak, J. Man-zella, P. Guiod, M. MacCollin, and J. Vick. "TheEffect of Changes in Hearing Status on SpeechLevel and Speech Breathing: A Study withCochlear Implant Users and NF-2 Patients." J.Acoust. Soc. Am. Forthcoming.

Lane, H., J. Wozniak, M. Matthies, M. Svirsky, J. Per-kell, M. O'Connell, and J. Manzella. "Changes inSound Pressure and Fundamental FrequencyContours Following Changes in Hearing Status."J. Acoust. Soc. Am. 101: 2244-52 (1997).

Matthies, M., P. Perrier, J. Perkell, and M. Zandipour."Variation in Speech Movement Kinematics andTemporal Patterns of Coarticulation withChanges in Clarity and Rate." Submitted to J.Speech, Lang. Hear Res.

Perkell, J., and M. Zandipour. "Clarity Versus Econ-omy of Effort in Speech Production: KinematicPerformance Spaces for Cyclical and SpeechMovements." Submitted to J. Acoust. Soc. Am.

Perkell, J., M. Zandipour, M. Matthies, and H. Lane."Clarity Versus Economy of Effort in Speech Pro-duction: A Preliminary Study of Inter-Subject Dif-ferences and Modeling Issues." Submitted to J.Acoust. Soc. Am.

Perkell, J.S., M.L. Matthies, H. Lane, F. Guenther, R.Wilhelms-Tricarico, J. Wozniak, and P. Guiod."Speech Motor Control: Acoustic Goals, Satura-tion Effects, Auditory Feedback and InternalModels." Speech Commun. 22: 227-50 (1997).

Svirsky, M.A., K.N. Stevens, M.L. Matthies, J. Man-zella, J.S. Perkell, and R. Wilhelms-Tricarico."Tongue Surface Displacement During Obstruent

366 RLE Progress Report Number 140

Page 19: age, and Hearing

Chapter 1. Speech Communication

Stop Consonants." J. Acoust. Soc. Am. 102: 562-71 (1997).

Wu, C., R. Wilhelms-Tricarico, and J.A. Negulesco."Landmark Selection for Cross-Mapping MuscleAnatomy and Volumetric Images of the HumanTongue." Submitted to Clin. Anat.

1.7.2 Conference Papers

Dilley, L.C., and S. Shattuck-Hufnagel. "Ambiguity inProminence Perception in Spoken Utterances inAmerican English." Proceedings of the JointMeeting of the International Congress on Acous-tics and the Acoustical Society of America, Seat-tle, Washington, June 1998. Forthcoming.

Govindarajan, K. "Latency of MEG M100 ResponseIndexes First Formant Frequency." Proceedingsof the Joint Meeting of the International Con-gress on Acoustics and the Acoustical Society ofAmerica, Seattle, Washington, June 1998. Forth-coming.

Perkell, J., M. Matthies, and M. Zandipour. "MotorEquivalence in the Production of //." Proceed-ings of the Joint Meeting of the InternationalCongress on Acoustics and the Acoustical Soci-ety of America, Seattle, Washington, June 1998.Forthcoming.

Poort, K. "Stop-Consonant Production by DysarthricSpeakers: Use of Models to Interpret AcousticData." Proceedings of the Joint Meeting of theInternational Congress on Acoustics and theAcoustical Society of America, Seattle, Washing-ton, June 1998. Forthcoming.

Prahler, A.M. "Modeling and Synthesis of LateralConsonant /I/." Proceedings of the Joint Meetingof the International Congress on Acoustics andthe Acoustical Society of America, Seattle,Washington, June 1998. Forthcoming.

Shattuck-Hufnagel, S. and A. Turk. "The Domain ofPhrase-Final Lengthening in English." Proceed-ings of the Joint Meeting of the InternationalCongress on Acoustics and the Acoustical Soci-ety of America, Seattle, Washington, June 1998.Forthcoming.

Stevens, K.N. "Toward Models for Human Productionand Perception of Speech." Proceedings of theJoint Meeting of the International Congress onAcoustics and the Acoustical Society of America,Seattle, Washington, June 1998. Forthcoming.

Turk, A.E., and S. Shattuck-Hufnagel. "Duration as aCue to Syllable Affiliation." Proceedings of theConference on the Phonological Word, Berlin,Germany, October 1997. Forthcoming.

Wilhelms-Tricarico, R., and C.-M. Wu. "A Biome-chanical Model of the Tongue." Proceedings ofthe 1997 Bioengineering Conference, BED-Vol.35. Eds. K.B. Chandran, R. Vanderby, and M.S.Hefzy. New York: AMSE, (1997), pp. 69-70.

1.7.3 Chapter in a Book

Shattuck-Hufnagel, S. "Phrase-level Phonology inSpeech Production Planning: Evidence for theRole of Prosodic Structure." In Prosody: Theoryand Experiment: Studies Presented to GostaBruce. Ed. M. Horne. Stockholm, Sweden: Klu-wer. Forthcoming.

1.7.4 Thesis

Hagen, A. Linguistic Functions of Glottalizations andtheir Language Specific use in English and Ger-man. Diplomarbeit (M.) Comput. Sci., Universityof Erlangen-Nornberg, Germany.

367

Page 20: age, and Hearing

368 RLE Progress Report Number 140