section 1 speech communication section 2 sensory …

16
Part IV Language, Speech and Hearing Section 1 Speech Communication Section 2 Sensory Communication Section 3 Auditory Physiology Section 4 Linguistics 287 tiiiiii~!!i! , iiii -! ...... a~-.~ 5~-:' , :. , i- ::--S~-s~~ ..... -!9.;' ,,i!iii .... iii~~~i! !i~ii~ ~~i~ iiiiii!! ' ... .. .......... ,, iii!

Upload: others

Post on 24-Dec-2021

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Section 1 Speech Communication Section 2 Sensory …

Part IV Language, Speech and Hearing

Section 1 Speech Communication

Section 2 Sensory Communication

Section 3 Auditory Physiology

Section 4 Linguistics

287tiiiiii~!!i! ,

iiii -! ...... a~-.~ 5~-:'

, :. , i- ::--S~-s~~

.....

-!9.;'

,,i!iii .... iii~~~i! !i~ii~ ~~i~ iiiiii!! '... .. .......... ,, iii!

Page 2: Section 1 Speech Communication Section 2 Sensory …

288 RLE Progress Report Number 134

Page 3: Section 1 Speech Communication Section 2 Sensory …

Section 1 Speech Communication

Chapter 1 Speech Communication

Page 4: Section 1 Speech Communication Section 2 Sensory …

290 RLE Progress Report Number 134

Page 5: Section 1 Speech Communication Section 2 Sensory …

Chapter 1. Speech Communication

Chapter 1. Speech Communication

Academic and Research Staff

Professor Kenneth N. Stevens, Professor Jonathan Allen, Professor Morris Halle, Professor Samuel J.Keyser, Dr. Marie K. Huffman, Dr. Michel T. Jackson, Dr. Melanie Matthies, Dr. Joseph S. Perkell, Dr.Stefanie Shattuck-Hufnagel, Dr. Mario A. Svirsky, Seth M. Hall

Visiting Scientists and Research Affiliates

Dr. Shyam S. Agrawal,' Dr. Tirupattur V. Ananthapadmanabha,2 Dr. Vladimir M. Barsukov, Dr. Corine A.Bickley, Dr. Suzanne E. Boyce, Dr. Carol Y. Espy-Wilson, 3 Dr. Richard S. Goldhor,4 Dr. Robert E. Hillman, 3

Eva B. Holmberg,5 Dr. Caroline Huang,6 Dr. Harlan Lane, 7 Dr. John Locke,8 Dr. John I. Makhoul,9 Dr.Sharon Y. Manuel, Dr. Carol Ringo,10 Dr. Anthony Traill," Dr. David Williams,12 Giulia Arman Nassi,Torstein Pedersen, 13 Jane Websterl4

Graduate Students

Abeer A. Alwan, Hwa-Ping Chang, Marilyn Y. Chen, Helen M. Hanson, Mark A. Johnson, Ronnie Silber,Lorin F. Wilde

Undergraduate Students

Kerry L. Beach, Venkatesh R. Chari, Anna Ison, Sue O. Kim, Laura Mayfield, Bernadette Upshaw, MonnicaJ. Williams

Technical and Support Staff

D. Keith North, Arlene E. Wint

1 CEERI Centre, CSIR Complex, New Delhi, India.

2 Voice and Speech Systems, Bangalore, India.

3 Boston University, Boston, Massachusetts.

4 Audiofile, Inc., Lexington, Massachusetts.

5 MIT and Department of Speech Disorders, Boston University, Boston, Massachusetts.

6 Dragon Systems, Inc., Newton, Massachusetts.

7 Department of Psychology, Northeastern University, Boston, Massachusetts.

8 Massachusetts General Hospital, Boston, Massachusetts.

9 Bolt, Beranek and Newman, Cambridge, Massachusetts.

10 University of New Hampshire, Durham, New Hampshire.

11 University of Witwatersrand, South Africa.

12 Sensimetrics, Inc., Cambridge, Massachusetts.

13 University of Karlsruhe, Germany.

14 Massachusetts Eye and Ear Infirmary, Boston, Massachusetts.

291

Page 6: Section 1 Speech Communication Section 2 Sensory …

Chapter 1. Speech Communication

1.1 IntroductionThe overall objective of our research in speechcommunication is to gain an understanding of theprocesses whereby (1) a speaker transforms a dis-crete linguistic representation of an utterance intoan acoustic signal, and (2) a listener decodes theacoustic signal to retrieve the linguistic represen-tation. The research includes development ofmodels for speech production, speech perception,and lexical access, as well as studies of impairedspeech communication.

Sponsors

C.J. Lebel FellowshipDennis Klatt Memorial FundNational Institutes of Health

Grant T32-DC00005Grant R01-DC00075Grant F32-DC0001 5Grant R01-DC0026615

Grant P01 -DC00361 16

Grant R01-DC0077617National Science Foundation

Grant IRI 89-10561Grant IRI 88-0568015Grant INT 90-2471318

1.2 Studies of the Acoustics,Perception, and Modeling ofSpeech Sounds

1.2.1 Liquid Consonants

An attribute that distinguishes liquid consonants(such as /I/ and /r/ in English) from other conso-nants is that the tongue blade and tongue dorsumare shaped in such a way that there is more thanone acoustic path from the glottis to the lips. Oneof these paths passes over the midline of the vocaltract over much of its length whereas the other,shorter, path traverses around the side of thetongue dorsum and blade in the oral portion of thetract. Measurements of the acoustic spectra ofsounds produced when the vocal tract is in such aposition show some irregularities in the frequencyrange 1.5 to 3.5 kHz. Extra spectral peaks and

valleys not normally seen in nonnasal vowels or inglides are often evident in the spectrum, and someof the peaks have amplitudes that deviate from theamplitudes expected for vowels.

In an effort to understand the acoustic basis forthese spectral properties, we have initiated a the-oretical study of the behavior of a vocal-tractmodel in which there is a bifurcation of theacoustic path. Equations for the vocal-tracttransfer function (from glottal source to lip output)have been developed and the transfer function hasbeen calculated and displayed for dimensions thatare estimated to be within the expected range forliquids. One outcome of this preliminary analysisis that an additional pole and zero is introducedinto the transfer function in the expected fre-quency range (2.0 to 3.0 kHz) when the length ofthe tract over which there is more than one path isin the range 6-8 cm. This additional pole-zero pairis associated with the acoustic propagation timearound the two portions of the split section of thetube. Further theoretical work must be supple-mented by additional data on the configurations ofthe airways for liquids and by improved estimatesof the acoustic losses for these types of configura-tions.

1.2.2 Fricative Consonants

The need to further describe the acoustic proper-ties of fricatives is evident in rule-based speechsynthesis, where these sounds continue to presentintelligibility problems. In order to provide abaseline measure of intelligibility, identificationtests have been performed with natural fricative-vowel tokens and corresponding synthetic tokens,generated using a research version of Klatt's text-to-speech rule-based system. All the stridentalveolar (s, z) and palatal (f, ) fricatives wereidentified correctly. The weaker labiodental (f, v)and dental (0, 6) fricatives were frequently con-fused with each other. As expected, intelligibilityof text-to-speech fricatives was poorer thannatural speech.

Acoustic analysis was performed to determine dif-ferences between natural and synthetic stimuli thatcould account for observed differences in intelli-gibility. Our results confirmed the need for

15 Under subcontract to Boston University.

16 Under subcontract to Massachusetts Eye and Ear Infirmary.

17 Under subcontract to Massachusetts General Hospital.

18 U.S.-Sweden Cooperative Science Program.

292 RLE Progress Report Number 134

Page 7: Section 1 Speech Communication Section 2 Sensory …

Chapter 1. Speech Communication

improved modelling of the source changes at thefricative-vowel boundary. Intelligibility of syn-thetic labiodental and dental fricatives is poorerthan natural, even when formant transitions appearto be reproduced accurately. Natural tokensshowed subtle noise amplitude variations thatwere not found in synthetic stimuli. Aspirationnoise was found more frequently for [f] than [0]for natural stimuli, while no text-to-speech tokensincluded aspiration. Offset of noise and onset ofvoicing at the fricative-vowel boundary was tooabrupt for voiced synthetic stimuli. Preliminaryresults suggested speaker dependencies in theobserved noise variations and time-dependentemergence of high-frequency peaks. We plan toapply our results to modify rules for manipulatingspeech sources and evaluate the effect of thesechanges on the intelligibility and naturalness ofsynthetic fricatives.

1.2.3 Acoustic PropertiesContributing to the Classification ofPlace of Articulation for Stops

The production of stop consonants generatesseveral kinds of acoustic properties: (1) the spec-trum of the initial transient and burst indicating thesize of the cavity anterior to the constriction; (2)place-dependent articulatory dynamics leading todifferent time courses of the noise burst, onset ofglottal vibrations and formant transitions; (3)formant transitions indicating the changing vocaltract shape from the closed position of the stop toa more open configuration of the following vowel.A study has been initiated to measure the relativecontributions of these acoustic properties to theclassification of the consonantal place of artic-ulation using a semi-automatic procedure. Theacoustic data consisted of a number of repetitionsof voiceless unaspirated stops in meaningful wordsspoken by several female and male speakers. Thespectra averaged over the stop release and at thevowel onset were used as the acoustic features.Speaker independent and vowel independent clas-sification was about 80% using either the burst orvowel onset spectrum, and a combined strategyled to higher accuracy. These results, togetherwith further acoustic analysis of the utterances,suggest that additional acoustic properties relatedto articulatory dynamics, such as the detailedacoustic structure of the burst, the time course ofthe formant transitions, and the voice onset time,should be included in a model for stop-consonantrecognition.

1.2.4 Modeling Speech Perceptionin Noise: the Stop Consonants as aCase Study

A doctoral thesis on this topic by Abeer A. Alwanwas completed during the past year. The abstractfor that thesis follows:

This study develops procedures for predicting per-ceptual confusions of speech sounds in noise byintegrating knowledge of the acoustic propertieswhich signal phonetic contrasts of speech soundswith principles of auditory masking theory. Themethodology that was followed had three compo-nents: (1) quantifying acoustic correlates of somephonological features in naturally-spoken utter-ances and using the results to generate syntheticutterances, (2) developing a perceptual metric topredict the level and spectrum of the noise whichwill mask these acoustic correlates, and (3) per-forming a series of perceptual experiments to eval-uate the theoretical predictions.

The focus of the study was the perceptual role ofthe formant trajectories in signalling the place ofarticulation for the stop consonants /b, d/ inconsonant-vowel syllables, where the vowel waseither /a/ or /e/. Nonsense syllables were chosenfor the perceptual study so that lexical effects suchas word frequency did not bias subjects' res-ponses. Computer-generated, rather thannaturally-spoken, syllables were used to providebetter control of the stimuli.

In the analysis/synthesis stage, the acoustic prop-erties of the stop consonants /b, d/ imbedded innaturally-spoken CV syllables were quantified andthe results were then used to synthesize theseutterances with the formant synthesizer KLSYN88.In the context of the vowel /a/, the two syntheticsyllables differed in the F2 trajectory: the F2 trajec-tory was falling for /da/ and was relatively flat for/ba/. In the C/s/. context, both F2 and F3 trajec-tories were different for the consonants: F2 and F3were flat for /dg/, whereas they were rising for/be/.

A metric was then developed to predict the level ofnoise needed to mask a spectral peak corre-sponding to a formant peak. The metric wasbased on a combination of theoretical and empir-ical results. Two types of masking were studied:within-band masking (where the formant waswithin the bandwidth of the noise masker) andabove-band masking (where the formant wasabove the upper cutoff frequency of the masker).Results of auditory masking theory, which wasestablished primarily for pure tones, were usedsuccessfully to predict within-band masking offormant peaks. The predictive measure in this casewas the signal-to-noise ratio in a critical band

293

Page 8: Section 1 Speech Communication Section 2 Sensory …

Chapter 1. Speech Communication

around the formant frequency. The applicability ofthe results of masking theory to formant peaks wastested by conducting a series of discrimination anddetection experiments with synthetic, steady-statevowels.

In the above-band masking case, it was found thatpredictions based on the two methods known forpredicting aspects of this kind of masking (ANSIstandards and Ludvigsen) did not agree withexperimental results. An empirical algorithm wasdeveloped to account for the experimental data.

In the final stage of the study, a series of identifi-cation tests with synthetic CV utterances in noisewas conducted. Two noise maskers were used inthe experiments: white noise, and band-pass noisecentered around the F2 region. The spectral prom-inences associated with F2 and F3 have a loweramplitude during the transitions from the conso-nant than in the steady-state vowel, so that it ispossible, using a steady-state noise, to maskportions of a formant transition without maskingthe formant peak in the vowel. Subjects' res-ponses were analyzed with the perceptual metricdeveloped earlier. Results showed that when theF2 transition for C/a/ or the F2 and F3 transitionsfor C/e/ were masked by noise, listeners inter-preted the stimuli as though the formant transi-tions were flat. That is, /da/ was heard as /ba/and /be/ was heard as /de/.

It was also found that when only the F2 trajectoryis masked, achieved by selectively masking F2with a band-pass noise masker, then amplitudedifferences in the F3 and F4 regions could be usedas cues for place information in the C/a/ caseeven though the trajectories of these higherformants did not differ for the two consonants.

1.2.5 Unstressed Syllables

We are continuing acoustic and perceptual studiesof the vowels and consonants in unstressed sylla-bles in utterances produced with different speechstyles. Our aim is to determine what attributes ofthese syllables are retained and what attributes aremodified or deleted when reduction occurs. Wehope that the results of these studies can con-tribute to the development of models for lexicalaccess that are valid for various speech styles.

in one study we have examined the acoustic man-ifestation of /8/ in the definite article the whenthis article occurs in different phonetic environ-ments. In particular we have looked at thesequence in the in a large number of sentencesproduced by several speakers. In almost all ofthese utterances, acoustic analysis indicates thatthe underlying consonant /6/ is produced as a

nasal consonant. That is, the consonant is pro-duced with a dental closure, and the velophar-yngeal opening from the preceding nasalconsonant spreads through the dental closure.Production of this underlying voiced fricative as a[-continuant] consonant is a common occurrence(e.g., in absolute initial position, or following anobstruent, as in at the or of the), and conse-quently use of a stop-like closure following in isnot unexpected. Comparison of the duration ofthe nasal murmur for the sequences in a and inthe show that the duration for the latter sequenceis consistently longer, with the dividing pointbeing at about 50 ms.

In a second group of studies, we are examiningthe acoustic evidence for the presence ofunstressed vowels in bisyllabic utterances likebelow and support, so that they are distin-guished from the monosyllabic words blow andsport. Acoustic measurements for severalspeakers producing these utterances show that thespeakers' attempts to implement the unstressedvowels can take several forms. These forms areconsistent with a view that there are two basicrequirements for an unstressed vowel. One is thatthe airway above the glottis should be less con-stricted during the vowel than in the adjacent con-sonants, and the other is that the glottis assumes aconfiguration that is narrower than that in theadjacent consonants if those consonants areobstruents. Both of these adjustments contributeto the generation of a maximum amplitude in somepart of the frequency range to signal the presenceof the vowel. One manifestation of these intendedmovements in our data is an increase in the timefrom the release of the first consonant to therelease of the second consonant for the two-syllable utterance relative to the one-syllable one.Other acoustic measurements (such as aspirationin /p/ of support, and changes in spectrumamplitudes and formant frequencies between thetwo consonant releases in below) also provideevidence for the unstressed vowel. Preliminary lis-tening tests with synthetic versions of one of theseword pairs (support-sport) have demonstratedthe perceptual importance of these variousacoustic attributes.

A third set of experiments has compared the tem-poral and amplitude characteristics of stop conso-nants in post- and prestressed positions in anutterance. The data show consistent reduction inclosure duration and in voice onset time for con-sonants in post-stressed positions. Burst ampli-tudes (in relation to the following vowel) forpost-stressed stops are less than those for pre-stressed stops when the consonants are voicelessbut not when they are voiced. Poststressed velarstop consonants show considerable variability in

294 RLE Progress Report Number 134

Page 9: Section 1 Speech Communication Section 2 Sensory …

Chapter 1. Speech Communication

duration and in the degree to which they exhibitimprecise releases with multiple bursts.

These studies of unstressed syllables are consistentwith the notion that the intention to produce thecomponents of these syllables can take many dif-ferent acoustic forms. Models of lexical accessmust be sensitive to these different acoustic attri-butes.

1.2.6 Reductions in Casual Speechin German

For each language, there are modifications orreductions that occur in the production andacoustic patterns of utterances when they arespoken casually compared to when they arespoken carefully. Many of these reductions arelanguage-specific, but it is expected that some ofthe principles governing the reductions might beuniversal across languages. In a preliminary effortto uncover some of these principles, a study ofreductions in casual speech in German has beenundertaken.

Several sentences were spoken by three speakersof German, at four speech rates ranging fromcareful to rapid. Acoustic characteristics of anumber of components of these sentences weremeasured. These included measurements of thedurations of several types of units, and measure-ments of formant frequencies and glottal vibrationcharacteristics for selected vowels and consonants.For the more casual or rapid speech modes, thesentence durations were shorter, with functionwords and unstressed vowels contributing more tothe duration reduction than content words andstressed vowels. The durations of the nasal con-sonant regions of the sentences were reduced lessthan other portions of the sentences. The pres-ence of a nasal consonant (or a sequence withnasal consonants) was always preserved in therapid utterances, whereas the distinction betweenvoiced and voiceless obstruent consonants wasoften neutralized. Vowel formant frequencies forstressed vowels in content words showed greaterstability with changes in rate than did the formantsfor unstressed vowels or vowels in function words.Some of the reductions that occur in functionwords in German appear to be more extreme thanreductions in function words in English.

1.3 Speech Synthesis

1.3.1 Synthesis of Syllables forTesting with Different Populationsof Listeners

Ongoing improvements in procedures for synthe-sizing consonant-vowel and vowel-consonant syl-lables have led to the creation of a wide inventoryof synthesized syllables. One use of these sylla-bles has been testing the phonetic discriminationand identification capabilities of an aphasic popu-lation that is being studied by colleagues at Mass-achusetts General Hospital. Pairs of syllables havebeen synthesized which differ in one phoneticfeature, such as [ba-va] (stop-continuant),[wa-va] (sonorant-obstruent), [na-la] (nasal-nonnasal). Contrasts of these sorts have beenstudied infrequently in the past. More commonlyused contrasts such as place([ba-da-ga],[am-an]) have also been synthesized.We attempted to create highly natural-soundingstimuli by manipulating every acoustic character-istic which has been observed to contribute to thephonetic contrast in question. The other acousticcharacteristics of the pair of stimuli were identical.This synthesis approach facilitated the creation ofcontinua between pairs of stimuli. Severalcontinua have been generated by varying eachacoustic parameter which differs between theendpoints in nine steps. These continua are beingpresented informally to normal-hearing listeners inorder to assess their naturalness and will be usedin studies at Massachusetts General Hospital. Aninitial goal of the research with aphasic listeners isto determine whether individuals or groups of indi-viduals show a pattern in phonetic discriminationindicating that ability to discriminate some featuresis a prerequisite for discriminating others.

1.3.2 Synthesis of Non-EnglishConsonants

The flexibility of the glottal source in the KLSYN88synthesizer should make it capable of generatingobstruent consonants with a variety of laryngealcharacteristics. Hindi (as well as a number ofother languages) contrasts stop consonants on thedimensions of both voicing and aspiration. Thesestop consonants can be produced with four dif-ferent places of articulation. All of these conso-nants have been generated (in consonant-vowelsyllables) with the synthesizer. Of particularinterest are the voiced aspirated stops. Synthesisof these consonants requires that vocal-foldvibration continue through much of the closureinterval, with glottal spreading initiated just priorto the consonantal release. The glottal parameters

295

Page 10: Section 1 Speech Communication Section 2 Sensory …

Chapter 1. Speech Communication

are adjusted to give breathy voicing in the fewtens of milliseconds after release, followed by areturn to modal voicing. All syllables were judgedto be adequate by native listeners.

The synthesizer has also been used to generateclick sounds for perception tests by speakers of aKhoisan language that uses an inventory of clicks.Generation of the abrupt clicks /!/ and /k/required that a transient source be used to excitethe parallel resonant circuits of the synthesizer,with appropriate adjustment of the gains of the cir-cuits to simulate the resonances of the oral cavitythat are excited by the release of the tongue tip ortongue blade.

1.4 Studies of SpeechProduction

1.4.1 Degradation of Speech andHearing with Bilateral AcousticNeuromas

We have begun a project in which we are studyingthe relation between speech and hearing in peoplewho become deaf from bilateral acoustic neuromas(NF2). The primary goal of this study is to add tothe understanding of the role of hearing in thecontrol of adult speech production. The rationaleand approach to this work is similar to ourongoing work on the speech production ofcochlear implant patients. Speech acoustic andphysiological parameters will be recorded andspeech perception will be tested in a group of (stillhearing) NF2 patients who are at risk of losingtheir hearing. The same production parameterswill be recorded at intervals for the subset ofpatients from the initially-recorded group whosuffer further hearing loss. Thus far, we havebegun the recruitment of patients from across thecountry, we have ordered several special-purposetransducers, and we have nearly completed thedesign of a comprehensive test protocol.

1.4.2 Trading Relations BetweenTongue-body Raising and LipRounding in Production of theVowel /u/

Articulatory and acoustic data are being used toexplore the following hypothesis: the goals ofarticulatory movements are relatively invariantacoustic targets, which may be achieved withvarying and reciprocal contributions of different

articulators. Previous articulatory studies of similarhypotheses, expressed entirely in articulatoryterms, have been confounded by interdependen-cies of the variables being studied (lip andmandible or tongue body and mandible displace-ments). One case in which this complication maybe minimized is that of lip rounding and tongue-body raising (formation of a velo-palatal con-striction) for the vowel /u/. Lip rounding andtongue-body raising should have similar acousticeffects for /u/, mainly on the second formant fre-quency, and could show reciprocal contributionsto its production; thus, we are looking for negativecorrelations in measures of these two parameters.We are using an Electro-Magnetic MidsagittalArticulometer (EMMA) to track movements ofmidsagittal points on the tongue body, upper andlower lips and mandible for large numbers of repe-titions of utterances containing /u/ in controlledphonetic environments. Initial analyses from foursubjects of articulatory displacements at times ofminima in absolute velocity for the tongue bodyduring the /u/ (i.e., "articulatory targets") showevidence against the hypothesis for one subjectand weakly in favor of the hypothesis for the otherthree.

We are currently examining: (1) trading relationsbetween the two parameters as a function of timeover the voiced interval of each vowel, and (2)relations among the transduced measures oftongue raising and lip protrusion, the resultingvocal-tract area-function changes (using measure-ments from dental casts and video recordings offacial movements), and changes in the vowelformants (using an articulatory synthesizer).

1.4.3 Refinements to ArticulatoryMovement Transduction

We have concluded that our EMMA system almostalways produces sufficiently accurate measure-ments of the positions of articulatory structures,but we have observed in experiments with subjects(see 1.4.2 above) that there can be circumstancesunder which measurement error may approach orexceed acceptable limits. We are currently runningbench tests to explore this issue, with the goal ofdeveloping procedures and criteria for the objec-tive assessment of measurement error.

In order to make the original EMMA electronicsfunction with a revised transmitter configuration, itwas necessary to design and implement a phasecorrection circuit. This circuit has now been hardwired and installed in a box which plugs into themain unit.

296 RLE Progress Report Number 134

Page 11: Section 1 Speech Communication Section 2 Sensory …

Chapter 1. Speech Communication

1.4.4 Modeling of the LateralDimension in Vocal-Fold Vibration

The forces which drive vocal-fold vibration can besimulated using the Ishizaka-Flanagan two-massmodel, but the model misses some of the subtletiesof actual fold vibration. In particular, acousticstudies by Klatt and Klatt indicate that duringbreathy voicing, the vocal folds close graduallyalong their length, rather than all at once as in thetwo-mass model. We have developed a model,based on the two-mass model, which treats thefold surface as a pair of beam elements which areacted upon by the glottal pressure, a compressivestress which is proportional to the fold displace-ment, and a shear stress which is proportional tothe deformation of the fold's inner surface. Ratherthan assuming a single displacement for the entirelength of the fold, the displacement is smoothlyinterpolated from the arytenoid cartilage, throughthe middle of the fold, to the anterior commissure.With two degrees of freedom, the model is capableof imitating the two-mass model for tightlyadducted arytenoid cartilages, and of simulatinggradual fold closure during breathy voicing.

1.5 Speech ProductionPlanningOur studies of the speech production planningprocess have focussed on developing and testingmodels of two aspects of phonological processing:segmental (primarily the serial ordering of sublex-ical elements) and prosodic (particularly the phe-nomenon of apparent stress shift).

1.5.1 Segmental Planning

Earlier analyses of segmental speech errors like"parade fad" --+ "farade pad" showed that word-position similarity is an important factor in deter-mining which segments of an utterance willinteract with each other in errors, even when wordstress is controlled. This result supports the viewthat word structure constrains the serial orderingprocess (i.e., the process which ensures that thesub-parts of words occur in their appropriatelocations in an utterance). In fact, for some kindsof speech errors, up to 80 percent occur in word-onset position. The predeliction for word-onseterrors, at least in English, may arise from a differ-ence in processing for these segments. We aretesting this hypothesis with a series of errorelicitation experiments to determine whether theasymmetry is found only for the planning ofsyntactically-structured utterances.

1.5.2 Prosodic Planning

Recent theories of phonology have addressed thequestion of the location of prominences andboundaries in spoken language. These twoaspects of utterances are not fully determined bysyntactic and lexical structure, but instead suggestthe necessity of a separate hierarchy of prosodicconstituents, with intonational phrases at the topand individual syllables at the bottom. One aspectof this theoretical development has been thepositing of a rhythmic grid for individual lexicalitems in which the columns correspond to the syl-lables in the utterance and the rows to degrees ofrhythmic prominence. It has been argued that thisgrid representation should be extended to entireutterances, partly on the basis of the occurrence ofapparent "stress shift." Stress shift occurs whenthe major perceptual prominence of a target wordis heard not on its main-stess syllable (as in"massaCHUsetts") but on an earlier syllable (as in"MAssachusetts MIRacle") in certain rhythmicconfigurations.

Our acoustic and perceptual analyses of this phe-nomenon have shown that in many cases the per-ception of apparent stress shift is the result of thedisappearance of a pitch marker from the main-stress syllable ("CHU" in the above example)when the target word appears in the longer phrase.This disappearance of the pitch accent leaves animpression of relative prominence on the earliersyllable "MA-," even though its duration does notsystematically increase. Although the rise in fun-damental frequency on the early syllable may begreater when the target word occurs in the longerphrase, just as a stress shift account would predict,this increased rise is also found for phrases like"MONetary POLicy," where no shift can occurbecause the first syllable of the target word is themain stress syllable. This result suggests that theincreased FO movement is associated with thelength of the phrase, rather than with the migrationof rhythmic prominence to the left in the targetword.

In summary, it appears that apparent stress shift isat least in part a matter of where the pitch accentsin an utterance are located, raising the possibilitythat speakers tend to place pitch accents near thebeginnings of prosodic constituents. We are cur-rently testing this hypothesis with paragraphs inwhich the target word occurs early vs. late in itsintonational phrase. We are testing this possibilityin both elicited laboratory speech read aloud, andin a corpus of FM radio news speech. If we candemonstrate that pitch accent placement is at leastpartially determined by the beginning of a newprosodic constituent, and that pitch accents can bereliably detected in the signal, we will have a basisfor proposing that listeners make use of the occur-

297

Page 12: Section 1 Speech Communication Section 2 Sensory …

Chapter 1. Speech Communication

rence of pitch accents to help them determinewhich portion of an utterance to process as a con-stituent.

1.5.3 Integration of Segmental andProsodic Planning

We have begun a series of experiments to explorethe possibility that the segmental and prosodicaspects of production planning interact. Initialresults suggest that metrically regular utterances(with repeated use of the same stress pattern)provoke fewer segmental errors than metricallyirregular utterances. This preliminary finding iscompatible with Shattuck-Hufnagel's hypothesisthat the serial ordering process for sublexical ele-ments occurs during the integration of syntac-tic/morphological information with the prosodicframework of the planned utterance. It may bethat, when the prosody of an utterance is partiallypredictable, with stresses occurring at regularintervals, the planning process is less demandingand serial ordering operates more reliably.

1.6 Speech Research Relatingto Special Populations

1.6.1 A Psychophysically-basedVowel Perception Model for Usersof Pulsatile Cochlear Implants

Pulsatile multichannel cochlear implants thatencode second formant frequency as electrodeposition (FO-F2 strategy) represent differentvowels by stimulating different electrodes acrossthe electrode array. Thus, the ability of a subjectto identify different vowels is limited by his abilityto discriminate stimulation delivered to differentelectrodes. A vowel perception model was devel-oped for users of such devices. This model is anextension of the Durlach-Braida model of intensityresolution. It has two parts: an "internal noise"model that accounts for vowel confusions made asa consequence of poor perceptual sensitivity, anda "decision" model that accounts for vowel iden-tification errors due to response bias. The decisionmodel postulates that each stimulus is associatedwith a "response center" which represents the sen-sation expected by the subject in response to thatstimulus. To test the vowel perception model weused d' scores in an electrode identification exper-iment with a male subject and predicted thissubject's vowel confusion matrix employing threesets of response centers. The "natural" set wasdetermined by places that would be maximallystimulated by each vowel in a normal cochlea; the

"optimal" response center was determined byplaces actually stimulated by each vowel duringthe vowel identification experiment; and the "goodfit" set used response centers that were manuallyadjusted to improve the model's prediction of thereal matrix. Model predictions were tested againsta vowel confusion matrix obtained from the samesubject in a separate study.

Results using the optimal set of response centersare encouraging: the model predicted successfullysome characteristics of the confusion matrix, inspite of a five-year lapse between the experimentthat provided sensitivity data and the experimentwhere the confusion matrix was obtained. Never-theless, there were discrepancies between pre-dicted and real confusion matrices. Thesediscrepancies are interestingly biased-the subjectfrequently mistook some vowels for other vowelswith higher F2. These errors are consistent with aset of response centers that are close to theoptimal set, but somewhat shifted in the directionof the natural set. One implication of these prelim-inary results is that cochlear implant users makeuse of the central nervous system's plasticity but atthe same time this plasticity may not be unlimited.

1.6.2 The Effect of Fixed ElectrodeStimulation on Perception ofSpectral Information

This study was carried out in collaboration withMargaret W. Skinner and Timothy A. Holden of theDepartment of Otolaryngology-Head and NeckSurgery, Washington University School of Medi-cine, Saint Louis. The study attempts to find theunderlying reasons for improvements in speechperception by users of the Nucleus cochlearimplant when they employ the new Multipeakstimulation strategy instead of the older FO-F1 -F2strategy. For voiced sounds, the FO-F1-F2strategy stimulates two channels every pitchperiod: these electrodes are chosen based oninstantaneous estimates of F1 and F2. The Multi-peak strategy stimulates two additional channels inthe basal region of the cochlea. Channels 4 and 7are typically used for voiced sounds, with deliveryof pulses whose energy is proportional to theacoustic energy found in the 2.0-2.8 kHz and2.8-4.0 kHz ranges.

We designed an experiment that employed aMultipeak-like condition, a FO-F1-F2 conditionand a control condition, where fixed-amplitudepulses were sent to channels 4 and 7. Two sub-jects were tested using vowel and consonant iden-tification tasks. The control condition providedvowel identification performance at least as goodas the Multipeak condition, and significantly betterthan the FO-F1-F2 condition. Consonant identifi-

298 RLE Progress Report Number 134

Page 13: Section 1 Speech Communication Section 2 Sensory …

Chapter 1. Speech Communication

cation results for the control condition were betterthan for the FO-F1 -F2 condition, but not as goodas for the Multipeak condition. These resultssuggest that part of the improvement observedwith the Multipeak strategy is due to stimulation ofthe fixed basal channels, which act as perceptual"anchors" or references, allowing subjects to iden-tify the position of electrode pairs that vary inposition along the electrode array.

1.6.3 Acoustic Parameters of NasalUtterance in Hearing-Impaired andNormal-Hearing Speakers

One of the more prevalent abnormalities in thespeech of the hearing impaired that contributes toreduced intelligibility is inadvertent nasality. The-oretically, a wider first formant should reflect agreater loss of sound energy within the nasalcavity, which has a relatively large surface areacompared to the oral cavity of vowels. A moreprominent spectral peak in the vicinity of 1 kHzshould reflect a larger velopharyngeal opening,according to a theoretical analysis of a nasal con-figuration based on admittance curves. Fromacoustic analysis of the speech of hearing-impaired children, reduced first formant promi-nence and the presence of an extra pole-zero pairbetween the first and second formants characterizespectra of nasalized vowels. The differencebetween the amplitude of the first formant and theamplitude of the extra peak, Al -P1, was a measurethat correlated with listener judgments of thedegree of vowel nasalization. To obtain furthervalidation of these parameters as measures ofnasalization, Al and the pole-zero spacing weresystematically manipulated in synthetic utterances.Perceptual experiments with synthesized, isolatedstatic vowels [i],[I],[o],[u] showed that bothparameters contributed to the perception ofnasality. Another perceptual experiment with anumber of synthesized words (of the form bVt)gave results indicating somewhat different relativeimportance of the two parameters. Correlation ofAl -P1 with the average nasality perception judg-ments of ten listeners was found for both groupsof stimuli. Another acoustic characteristic ofspeech of the hearing-impaired children is thechange of the extra peak amplitude over timewithin the vowel due to their lack of coordinationof movement of the velum with movement of otherarticulators. From a pilot test, time variation of thefirst formant bandwidth and frequency of the nasalzero had a greater effect on nasality judgmentswhen the average Al -P1 was small. This findingsuggests that a somewhat nasalized vowel is per-ceived to be even more nasal when there is largevelum movement during production of the vowel.

1.7 Models for LexicalRepresentation and LexicalAccessAs we continue to develop models for lexicalaccess from acoustic patterns of speech, we havebeen attempting to organize and refine the repre-sentation of words, segments, and features in thelexicon. The present version of the model of thelexicon organizes phonological segments into fourclasses according to the way the features arerepresented in the sound.

Two of these classes are (1) the vowels and (2)the glides and syllabic consonants. A distin-guishing attribute of both these classes is that thespectrum of the sound that is produced when thesegments are implemented is either relatively fixedor changes only slowly. There are no acoustic dis-continuities, and all of the features of the segmentsare derived from the spectra sampled over a rela-tively long time interval and how these spectrachange with time. Both classes are produced witha source at the glottis. Within this group of seg-ments, the vowels are considered to form a class,in that the vocal tract does not have a narrow con-striction that gives a distinctive acoustic attribute.They are characterized entirely by their formantpattern, and no articulator can be regarded asprimary. The glides and syllabic consonants, onthe other hand, have spectral characteristics thatare imposed by forming a constriction with a par-ticular articulator.

A third class of segments (traditionally classified as[+consonantal]) is produced by making a narrowconstriction with a specific articulator. Formationor release of this constriction creates a disconti-nuity in the sound. The landmark generated bythis discontinuity forms a nucleus around whichacoustic information about the various features ofthe consonant is located, possibly over a timeinterval of + 50 milliseconds or more. The fourthclass consists of the clicks. The sound is of highintensity and relatively brief. The features identi-fying a click are contained in the spectral charac-teristics within a few milliseconds of the clickrelease. These clicks are not directly relevant toEnglish, except that they highlight the contrastbetween sounds that are of high intensity andconcentrated in time (the clicks) and sounds oflower intensity where evidence for the features isspread over a longer time interval around anacoustic discontinuity.

In the proposed lexical representation, the classesof segments and features within these classes areorganized in a hierarchical fashion. In the case of

299

Page 14: Section 1 Speech Communication Section 2 Sensory …

Chapter 1. Speech Communication

consonants, a distinction is made between featuresdesignating the primary articulator that forms theconstriction and features designating the activity ofother articulators.

As a part of a project in which automatic proce-dures for lexical access are being developed, weare building a computer-based lexicon of words inwhich the representation of features for the seg-ments in the words reflects the classification notedabove. There are currently about 150 words in thelexicon. The marking of feature values within thelexicon includes the possibility of labeling afeature as being subject to modification by thecontext. For example, in the word bat, the fea-tures designating place of articulation for the finalconsonant are subject to modification (as in batman) but the feature [-continuant] is not. Asecond component of the implementation of thelexical access model is to develop automatic pro-cedures for identifying landmarks and determiningthe features whose acoustic correlates appear inthe vicinity of the landmarks. Initial efforts in thisdirection are in progress.

1.8 Speech Analysis andSynthesis FacilitiesSeveral modifications have been made to theKlattools that are used for analysis and synthesis ofspeech. The KLSYN88 synthesizer has been aug-mented to include the possibility of selecting anew glottal source (designed by Dr. TirupatturAnanthapadmanabha) that has potential advan-tages and flexibility relative to the current inven-tory of glottal sources in the synthesizer. Aninverse filtering capability has also been incorpo-rated in the synthesizer program to allow for thepossibility of generating a glottal source derivedfrom a natural utterance.

The KLSPEC analysis program has been modifiedto include the possibility of obtaining a spectrumthat is an average of a number of DFT spectrasampled at one-ms intervals over a specified timespan. This type of averaging is especially usefulwhen examining the spectra of noise bursts andfricative consonants. It also has application whenusing a short time window for measuring the spec-trum of a voiced sound to avoid the necessity ofcareful placement of the window within eachglottal period.

1.9 Publications

1.9.1 Published Papers

Alwan, A. "Modelling Speech Perception inNoise: A Case Study of the Place of Artic-ulation Feature." Proceedings of the 12thInternational Congress of Phonetic Sciences 2:78-81 (1991).

Bickley, C.A. "Vocal-fold Vibration in a ComputerModel of a Larynx." In Vocal Fold Physiology.Eds. J. Gauffin and B. Hammarberg. SanDiego: Singular, 1991, pp. 37-46.

Boyce, S.E., R.A. Krakow, and F. Bell-Berti. "Pho-nological Underspecification and Speech MotorOrganisation." Phonology 8: 219-236 (1991).

Burton, M.W., S.E. Blumstein, and K.N. Stevens."A Phonetic Analysis of Prenasalized Stops inMoru." J. Phonetics 20: 127-142 (1992).

Halle, M., and K.N. Stevens. "Knowledge of Lan-guage and the Sounds of Speech." In Music,Language, Speech and Brain. Eds. J.Sundberg, L. Nord, and R. Carlson.Basingstoke, Hampshire: Macmillan Press,1991, pp. 1-19.

Kuhl, P.K., K. Williams, F. Lacerda, K.N. Stevens,and B. Lindblom. "Linguistic Experience AltersPhonetic Perception in Infants by Six Monthsof Age." Sci. 255: 606-608 (1992).

Lane, H., J.S"ChangesCochlearAdults."(1991).

,. Perkell, M. Svirsky, and Jin Speech Breathing

Implant in PostlinguallyJ. Speech Hear. Res. 34:

. Webster.FollowingDeafened526-533

Perkell, J.S. "Models, Theory and Data in SpeechProduction." Proceedings of the 12th Interna-tional Congress of Phonetic Sciences 1:182-191 (1991).

Perkell, J.S., E.B. Holmberg, and R.E. Hillman. "ASystem for Signal Processing and DataExtraction from Aerodynamic, Acoustic andElectroglottographic Signals in the Study ofVoice Production." J. Acoust. Soc. Am. 89:1777-1781 (1991).

Price, P.J., M. Ostendorf, S. Shattuck-Hufnagel,and C. Fong. "The Use of Prosody in SyntacticDisambiguation." J. Acoust. Soc. Am. 90:1956-2970 (1991).

300 RLE Progress Report Number 134

Page 15: Section 1 Speech Communication Section 2 Sensory …

Chapter 1. Speech Communication

Price, P.J., M. Ostendorf and S. Shattuck-Hufnagel. "Disambiguating Sentences UsingProsody." Proceedings of the 12th InternationalCongress of Phonetic Sciences 2: 418-421(1991).

Shattuck-Hufnagel, S. "Acoustic Correlates ofStress Shift." Proceedings of the 12th Interna-tional Congress of Phonetic Sciences 4:266-269 (1991).

Stevens, K.N. "Some Factors Influencing the Preci-sion Required for Articulatory Targets: Com-ments on Keating's Paper." In Papers inLaboratory Phonology I. Eds. J.C. Kingstonand M.E. Beckman. Cambridge: CambridgeUniversity Press, 1991, pp. 471-475.

Stevens, K.N. "Vocal-fold Vibration for ObstruentConsonants." In Vocal Fold Physiology. Eds. J.Gauffin and B. Hammarberg. San Diego: Sin-gular, 1991, pp. 29-36.

Stevens, K.N. "The Contribution of Speech Syn-thesis to Phonetics: Dennis Klatt's Legacy."Proceedings of the 12th International Congressof Phonetic Sciences 1: 28-37 (1 991 ).

Stevens, K.N., and C.A. Bickley. "ConstraintsAmong Parameters Simplify Control of KlattFormant Synthesizer." J. Phonetics 19:161-174 (1991).

Svirsky, M.A., and E.A. Tobey. "Effect of DifferentTypes of Auditory Stimulation on VowelFormant Frequencies in Multichannel CochlearImplant Users." J. Acoust. Soc. Am. 89:2895-2904 (1991).

Wilde, L.F., and C.B. Huang. "Acoustic Propertiesat Fricative-vowel Boundaries in AmericanEnglish." Proceedings of the 12th Interna-tional Congress of Phonetic Sciences 5:398-401 (1991).

1.9.2 Papers Submitted forPublication

Holmberg, E., R. Hillman, J. Perkell, and C. Gress."Relationships Between SPL and Aerodynamicand Acoustic Measures of Voice Production:Inter- and Intra-speaker Variation." J. SpeechHear. Res.

Perkell, J., M. Svirsky, M. Matthies, and M.Jordan. "Trading Relations Between Tongue-body Raising and Lip Rounding in Productionof the Vowel /u/." PERILUS, the workingpapers of the Department of Phonetics, Insti-tute of Linguistics, Stockholm. Forthcoming.

Perkell, J., and M. Matthies. "Temporal Measuresof Anticipatory Labial Coarticulation for theVowel /u/: Within- and Cross-Subject Vari-ability." J. Acoust. Soc. Am.

Perkell, J., H. Lane, M. Svirsky, and J. Webster."Speech of Cochlear Implant Patients: A Longi-tudinal Study of Vowel Production." J. Acoust.Soc. Am.

Shattuck-Hufnagel, S. "The Role of Word and Syl-lable Structure in Phonological Encoding inEnglish." Cognition. Forthcoming.

Stevens, K.N. "Lexical Access from Features."Workshop on Speech Technology for Man-Machine Interaction, Tata Institute of Funda-mental Research, Bombay, India, 1990.

Stevens, K.N. "Speech Synthesis Methods:Homage to Dennis Klatt." In Talking Machines:Theories, Models, and Applications. Eds. G.Bailly and C. Benoit. New York: Elsevier.Forthcoming.

Stevens, K.N. "Models of Speech Production." InHandbook of Acoustics. Ed. M. Crocker. NewYork: Wiley. Forthcoming.

Stevens, K.N., S.E. Blumstein, L. Glicksman, M.Burton, and K. Kurowski. "Acoustic and Per-ceptual Characteristics of Voicing in Fricativesand Fricative Clusters." J. Acoust. Soc. Am.Forthcoming.

Stevens, K.N. "Phonetic Evidence for Hierarchiesof Features." In LabPhon3. Ed. P. Keating. LosAngeles: University of California. Forthcoming.

Svirsky, M., H. Lane, J. Perkell, and J. Webster."Effects of Short-Term Auditory Deprivation onSpeech Production in Adult Cochlear ImplantUsers." J. Acoust. Soc. Am. Forthcoming.

Wightman, C., S. Shattuck-Hufnagel, M.Ostendorf, and P.J. Price. "Segmental Dura-tions in the Vicinity of Prosodic Phrase Bound-aries." J. Acoust. Soc. Am. Forthcoming.

301

Page 16: Section 1 Speech Communication Section 2 Sensory …

302 RLE Progress Report Number 134