pronunciation variation is key to understanding spoken language steven greenberg the speech...

Download Pronunciation Variation is Key to Understanding Spoken Language Steven Greenberg The Speech Institute

If you can't read please download the document

Upload: emory-robertson

Post on 18-Jan-2018

218 views

Category:

Documents


0 download

DESCRIPTION

For Whom the Bell Tolls With apologies to Ernest Hemingway and John Donne “Do not ask for whom the bell tolls, it tolls for thee”

TRANSCRIPT

Pronunciation Variation is Key to Understanding Spoken Language Steven Greenberg The Speech Institute OR For Whom the Bell Tolls With apologies to Ernest Hemingway and John Donne Do not ask for whom the bell tolls, it tolls for thee Acknowledgements and Thanks Research Funding U.S. Department of Defense U.S. National Science Foundation Scientific Collaborators Takayuki Arai Hannah Carvey Shuangyu Chang Ken Grant Leah Hitchcock For Further Information Consult the web site: Setting the Stage Take Home Messages Pronunciation variation lies at the heart of spoken language And therefore provides the key to dramatically improving the quality of speech technology (particularly automatic recognition and synthesis) Take Home Messages In order to realize this technological potential it is essential to understand the principles of pronunciation variation from a SCIENTIFIC perspective Take Home Messages A detailed statistical analysis of spontaneous telephone dialogues (Am. English) provides the empirical basis for the following generalizations: Take Home Messages The SYLLABLE, rather than the PHONE, is the basic organizational unit of spoken language hence the difficulty for ANY phonetic orthography to accurately delineate pronunciation patterns in fine detail Take Home Messages The syllable carries prosodic weight (a.k.a. accent or prominence) that affects the manner in which its constituents are phonetically realized Take Home Messages The behavior of these syllabic constituents (a.k.a. ONSET, NUCLEUS and CODA) differ dramatically from each other, and influence the phonetic character of the syllable thus, syllable position may be as important as segmental identity for characterizing pronunciation Take Home Messages The MICROSTRUCTURE of the syllable can be delineated in terms of articulatory-acoustic features (e.g., voicing, articulatory manner and place) Take Home Messages MANNER of articulation most closely parallels (in time and behavior) the classical concept of the phonetic segment and sets the basic intensity mode for the sequence of syllabic constituents (a.k.a. the ENERGY ARC) Take Home Messages The ENERGY ARC reflects cortical processing constraints on the acoustic (and visual) signal associated with the MODULATION SPECTRUM Take Home Messages PLACE of articulation is an inherently TRANS-SEGMENTAL feature that binds vocalic nuclei with preceeding and following consonants Take Home Messages Articulatory PLACE provides the discriminative (entropic) basis for lexical identity, and is therefore important to model accurately Take Home Messages VOICING spreads from the nucleic core of the syllable and spreads both forward (towards the coda) and backward (towards the onset), the degree of temporal spreading reflecting prosodic prominence magnitude in this sense, VOICING is a SYLLABIC rather than a phonetic-segment feature, in that it is sensitive to the prominence of the syllable Take Home Messages It is the PATTERN of INTERACTION among articulatory-feature dimensions across time that imparts to the syllable its specific phonetic identity Take Home Messages The specific realization of articulatory features is governed by their position within the syllable, as well as by prosodic prominence Take Home Messages The prosodic pattern of an utterance reflects the information contained within the utterance Take Home Messages Therefore, it is ultimately INFORMATION (and lexical discriminability) that governs the detailed phonetic properties of spoken language, and hence pronunciation variation largely reflects information contained in spoken language Take Home Messages For additional details consult my paper in the ICPhS Proceedings Take Home Messages Prcis of the Presentation Due to time constraints, Ill focus on just a few issues in this presentation: First, Ill examine the current state of automatic speech recognition systems from the perspective of pronunciation modeling, discussing some of the problems confronting current-generation ASR technology Prcis of the Presentation Then, Ill examine pronunciation variation as observed in a particular corpus of spontaneous dialogues Switchboard which has been manually annotated at the phonetic and prosodic levels Prcis of the Presentation The key insight garnered from this material is that the variation observed is systematic when organized into syllabic units and syllable prominence is explicitly marked Prcis of the Presentation Specifically, different parts of the syllable appear to play separate roles in pronunciation, that are particularly manifest in conjunction with prosodic prominence Prcis of the Presentation Ultimately, it is the entropy (or information) associated with a syllable and its constituents, that appears to account for the specific phonetic realization of the segments contained within Prcis of the Presentation Syllable onsets are inherently more informative than codas and are therefore more often canonically realized Prcis of the Presentation The nucleus carries much of the prosodic weight of the syllable vocalic identity is highly constrained by the syllables prominence Prcis of the Presentation These syllable-centric principles are illustrated within the context of a novel ASR system incorporating articulatory features and prosody Prcis of the Presentation The Current State of Automatic Speech Recognition The Future of Speech Recognition The Good Gordon Moore on the future of technology (July 10, 2002) Good speech recognition will be a transforming capability when it finally comes into being where you'll be able to talk into your computer and it will be able to understand what you're saying in context. It will know if you mean "to" or "two" or "too." Once the computer understands speech at that level, you'll be able to have an intelligent conversation with your computer. That can change a lot of things. First of all, it will make computing available to the people who are scared off by keyboards and such. Secondly, it will change the way we use them completely. I don't know if that's 10 years away or 50 years away. I think it's something that certainly will be coming down the road and it will be really transforming when it does. Gordon Moore on the future of technology (July 10, 2002) I suspect that it's closer to the 50 years than the 10 years to get to the level that I'm talking about. The Future of Speech Recognition The Bad The Future of Speech Recognition The Ugly Does Gordon Moore know what hes talking about? The Future of Speech Recognition The Ugly Does Gordon Moore know what hes talking about? Probably The Future of Speech Recognition The Ugly Does Gordon Moore know what hes talking about? Probably after all this is the guy who literally defined Moores Law ) (semi-conductor-based microchips double capacity every 18 months) The Future of Speech Recognition The Ugly Does Gordon Moore know what hes talking about? Probably (after all this is the guy who literally defined Moores Law) Why should one be so pessimistic given the advances made by ASR technology over the past decade? The Future of Speech Recognition The Ugly Does Gordon Moore know what hes talking about? Probably (after all this is the guy who literally defined Moores Law) Why should one be so pessimistic given the advances made by ASR technology over the past decade? Some (not-so-random) statistics . The Future of Speech Recognition The Ugly Does Gordon Moore know what hes talking about? Probably (after all this is the guy who literally defined Moores Law) Why should one be so pessimistic given the advances made by ASR technology over the past decade? Some (not-so-random) statistics . Word Recognition Error Rates for State-of-the-Art Systems (Am. English) The Future of Speech Recognition The Ugly Does Gordon Moore know what hes talking about? Probably (after all this is the guy who literally defined Moores Law) Why should one be so pessimistic given the advances made by ASR technology over the past decade? Some (not-so-random) statistics . Word Recognition Error Rates for State-of-the-Art Systems (Am. English) DIGITS 0.3% The Future of Speech Recognition The Ugly Does Gordon Moore know what hes talking about? Probably (after all this is the guy who literally defined Moores Law) Why should one be so pessimistic given the advances made by ASR technology over the past decade? Some (not-so-random) statistics . Word Recognition Error Rates for State-of-the-Art Systems (Am. English) DIGITS 0.3% NUMBERS 4.0 The Future of Speech Recognition The Ugly Does Gordon Moore know what hes talking about? Probably (after all this is the guy who literally defined Moores Law) Why should one be so pessimistic given the advances made by ASR technology over the past decade? Some (not-so-random) statistics . Word Recognition Error Rates for State-of-the-Art Systems (Am. English) DIGITS 0.3% NUMBERS 4.0 READ TEXT (WSJ) 6.0 The Future of Speech Recognition The Ugly Does Gordon Moore know what hes talking about? Probably (after all this is the guy who literally defined Moores Law) Why should one be so pessimistic given the advances made by ASR technology over the past decade? Some (not-so-random) statistics . Word Recognition Error Rates for State-of-the-Art Systems (Am. English) DIGITS 0.3% NUMBERS 4.0 READ TEXT (WSJ) 6.0 TELEPHONE DIALOGUES 20.0 The Future of Speech Recognition The Ugly Does Gordon Moore know what hes talking about? Probably (after all this is the guy who literally defined Moores Law) Why should one be so pessimistic given the advances made by ASR technology over the past decade? Some (not-so-random) statistics . Word Recognition Error Rates for State-of-the-Art Systems (Am. English) DIGITS 0.3% NUMBERS 4.0 READ TEXT (WSJ) 6.0 TELEPHONE DIALOGUES 20.0 The Future of Speech Recognition The Ugly Does Gordon Moore know what hes talking about? Probably (after all this is the guy who literally defined Moores Law) Why should one be so pessimistic given the advances made by ASR technology over the past decade? Some (not-so-random) statistics . Word Recognition Error Rates for State-of-the-Art Systems (Am. English) DIGITS 0.3% NUMBERS 4.0 READ TEXT (WSJ) 6.0 TELEPHONE DIALOGUES 20.0 The sort of computer applications envisioned by Moore are closer to dialogues than to digits or read text The Future of Speech Recognition The Ugly Does Gordon Moore know what hes talking about? Probably (after all this is the guy who literally defined Moores Law) Why should one be so pessimistic given the advances made by ASR technology over the past decade? Some (not-so-random) statistics . Word Recognition Error Rates for State-of-the-Art Systems (Am. English) DIGITS 0.3% NUMBERS 4.0 READ TEXT (WSJ) 6.0 TELEPHONE DIALOGUES 20.0 The sort of computer applications envisioned by Moore are closer to dialogues than to digits or read text Why is conversational speech so much more difficult to recognize than than other forms of spoken language? The Central Challenge for Models of Speech Recognition Phonemic Beads on a String Illustrated In traditional models of speech recognition (by machine) words are represented as mere sequences of phonetic segments (phones) . Strung together like beads on a string Analogous (in some measure) to the orthographic representation in a dictionary Prosody and other (extra)syllabic properties play little (if any) role Frequency Phonemic Beads on a String Illustrated Each time frame is obligatorily associated with a rank-ordered set of phone (as well as non-speech) classes, along with a probability score Each word in the ASR lexicon is associated with a pre-defined sequence of phonemic elements (pronunciation models) The language model provides the word co-occurrence statistics for over- riding the (inherently faulty) phonetic classifiers Acoustic Models Phonetic Classifiers Ph Prob /f/ 0.40 /s/ 0.20 /sh/ 0.10 /z/ 0.05 /v/ 0.02 /zh/ 0.01 Etc. Word 1 Word 2 Prob Five Nine 0.20 Five One 0.10 Four Nine 0.05 Four One 0.02 Six Ten 0.01 Six Nine 0.01 Etc. Front EndBack End Language Models Middle Earth Word Phones Five /f/ /ay/ /v/ Four /f/ /ao/ /r/ Six /s/ /ih/ /k/ /s/ One /w/ /ah/ /n/ Ten /t/ /eh/ /n/ Eight /ey/ /t/ Etc. Pronunciation Models Phonemic Beads on a String Illustrated Is this an accurate way to model and characterize spoken language? Frequency Phonemic Beads on a String Illustrated Is this an accurate way to characterize spoken language? If it were, then current speech recognition systems, which are predicated on such a perspective, would experience little difficulty decoding speech Frequency Phonemic Beads on a String Illustrated Is this an accurate way to characterize spoken language? If it were, then current speech recognition systems, which are predicated on such a perspective, would experience little difficulty decoding speech In fact, such ASR systems require extensive, time-consuming training on material similar to that in the task at hand in order to function well Frequency Phonemic Beads on a String Illustrated Is this an accurate way to characterize spoken language? If it were, then current speech recognition systems, which are predicated on such a perspective, would experience little difficulty decoding speech In fact, such ASR systems require extensive, time-consuming training on material similar to that in the task at hand in order to function well Moreover, ASR systems require detailed statistical knowledge of the WORDS spoken in the task (top-down processing) in order to do well Frequency Phonemic Beads on a String Illustrated Is this an accurate way to characterize spoken language? If it were, then current speech recognition systems, which are predicated on such a perspective, would experience little difficulty decoding speech In fact, such ASR systems require extensive, time-consuming training on material similar to that in the task at hand in order to function well Moreover, ASR systems require detailed statistical knowledge of the WORDS spoken in the task in order to do well, implying that . Frequency Phonemic Beads on a String Illustrated Is this an accurate way to characterize spoken language? If it were, then current speech recognition systems, which are predicated on such a perspective, would experience little difficulty decoding speech In fact, such ASR systems require extensive, time-consuming training on material similar to that in the task at hand in order to function well Moreover, ASR systems require detailed statistical knowledge of the WORDS spoken in the task in order to do well, implying that . Phonetic decoding (bottom-up) is insufficient (by itself) in ASR Frequency Phonemic Beads on a String Illustrated Is this an accurate way to characterize spoken language? If it were, then current speech recognition systems, which are predicated on such a perspective, would experience little difficulty decoding speech In fact, such ASR systems require extensive, time-consuming training on material similar to that in the task at hand in order to function well Moreover, ASR systems require detailed statistical knowledge of the WORDS spoken in the task in order to do well, implying that . Phonetic decoding (bottom-up) is insufficient (by itself) in ASR Why? Frequency A Challenge for the Phonemic Beads on a String Approach to Speech Recognition Pronunciation Variability Empirical Basis of the Statistical Analyses The data described in the remainder of this presentation derive from the SWITCHBOARD corpus, consisting of hundreds of brief (5-10 minute) telephone dialogues There is a Lot of Diversity in the Material Transcribed Spans speech of both genders (ca. 50/50%), reflecting a wide range of American dialectal variation, speaking rate and voice quality This material has been MANUALLY annotated by highly trained transcribers 1 hour LABELED and SEGMENTED at the phonetic-segment level 4 hours LABELED at the phone level and SEGMENTED wrt syllable boundaries The latter material SEGMENTED into PHONES using AUTOMATIC methods Transcription System A variant of Arpabet, a fairly broad phonetic transcription orthography Phonetic Transcription of Spontaneous English The Data are Available at . Annotation of Stress Accent Forty-five minutes of the phonetically annotated portion of the Switchboard corpus was manually labeled with respect to stress accent (prominence) Three levels of accent were distinguished: Heavy (1)Light (0.5)None (0) The data are available at .Annotation of Stress Accent Pronunciation Variability of Real Speech A statistical analysis of the Switchboard corpus reveals that there are literally dozens of ways in which common words may be (and are) pronounced As the following slide illustrates for the 20 most frequent words from the same corpus (Switchboard) of Am. English Telephone dialogues (which together account for 35% of the word tokens in the corpus) How Many Different Pronunciations? RankWordN#Pron Most Common Pronunciation MCP %Total The 20 most frequent words account for 35% of the tokens 2 QUESTIONS How DO listeners decode the speech signal given the HUGE variation in pronunciation? 2 QUESTIONS How DO listeners decode the speech signal given the HUGE variation in pronunciation? And can machine algorithms be designed to emulate the strategies used by humans? How Important are Pronunciation Models for Automatic Speech Recognition? Does pronunciation variation have a significant impact on speech recognition performance? How Important is Pronunciation Variation in ASR? Does pronunciation variation have a significant impact on speech recognition performance? It all depends . How Important is Pronunciation Variation in ASR? Does pronunciation variation have a significant impact on speech recognition performance? It all depends . For limited vocabulary tasks (such as digit strings, numbers, airplane flight information, etc.), where the recognition system knows what words to expect, it is not so important to model pronunciation variation How Important is Pronunciation Variation in ASR? Does pronunciation variation have a significant impact on speech recognition performance? It all depends . For limited vocabulary tasks (such as digit strings, numbers, airplane flight information, etc.), where the recognition system knows what words to expect, it is not so important to model pronunciation variation Because . How Important is Pronunciation Variation in ASR? Does pronunciation variation have a significant impact on speech recognition performance? It all depends . For limited vocabulary tasks (such as digit strings, numbers, airplane flight information, etc.), where the recognition system knows what words to expect, it is not so important to model pronunciation variation Because . It is possible to model the variability relatively well with statistical models based on comprehensive training material How Important is Pronunciation Variation in ASR? Does pronunciation variation have a significant impact on speech recognition performance? It all depends . For limited vocabulary tasks (such as digit strings, numbers, airplane flight information, etc.), where the recognition system knows what words to expect, it is not so important to model pronunciation variation Because . It is possible to model the variability relatively well with statistical models based on comprehensive training material However, reliance on extensive training means that recognition performance requires that the training material be highly representative of the task domain How Important is Pronunciation Variation in ASR? Does pronunciation variation have a significant impact on speech recognition performance? It all depends . For limited vocabulary tasks (such as digit strings, numbers, airplane flight information, etc.), where the recognition system knows what words to expect, it is not so important to model pronunciation variation Because . It is possible to model the variability relatively well with statistical models based on comprehensive training material However, reliance on extensive training means that recognition performance requires that the training material be highly representative of the task domain If the training material is unrepresentative of the task (in terms of pronunciation and the identity of the words spoken, both in isolation and in conjunction with other words in the utterance) ASR performance suffers significantly How Important is Pronunciation Variation in ASR? Does pronunciation variation have a significant impact on speech recognition performance? It all depends . For limited vocabulary tasks (such as digit strings, numbers, airplane flight information, etc.), where the recognition system knows what words to expect, it is not so important to model pronunciation variation Because . It is possible to model the variability relatively well with statistical models based on comprehensive training material However, reliance on extensive training means that recognition performance requires that the training material be highly representative of the task domain If the training material is unrepresentative of the task (in terms of pronunciation and the identity of the words spoken, both in isolation and in conjunction with other words in the utterance) ASR performance suffers significantly Thus, commercial recognition systems require extensive training material, which is both time consuming and expensive to collect and annotate How Important is Pronunciation Variation in ASR? Does pronunciation variation have a significant impact on speech recognition performance? It all depends . For limited vocabulary tasks (such as digit strings, numbers, airplane flight information, etc.), where the recognition system knows what words to expect, it is not so important to model pronunciation variation Because . It is possible to model the variability relatively well with statistical models based on comprehensive training material However, reliance on extensive training means that recognition performance requires that the training material be highly representative of the task domain If the training material is unrepresentative of the task (in terms of pronunciation and the identity of the words spoken, both in isolation and in conjunction with other words in the utterance) ASR performance will suffer significantly Thus, commercial recognition systems require extensive training material, which is both time consuming and expensive to collect and annotate This is one of the major reasons why few companies have yet to make a profit DIRECTLY from automatic speech recognition How Important is Pronunciation Variation in ASR? For large vocabulary tasks, pronunciation variation is probably one of the key parameters determining performance How Important is Pronunciation Variation in ASR? For large vocabulary tasks, pronunciation variation is probably one of the key parameters determining performance How do we know this? How Important is Pronunciation Variation in ASR? For large vocabulary tasks, pronunciation variation is probably one of the key parameters determining performance How do we know this? There are two separate lines of evidence . How Important is Pronunciation Variation in ASR? For large vocabulary tasks, pronunciation variation is probably one of the key parameters determining performance How do we know this? There are two separate lines of evidence . First, one can compare the performance of several different ASR systems on the same task (as was done by Greenberg and Chang, 2000) How Important is Pronunciation Variation in ASR? For large vocabulary tasks, pronunciation variation is probably one of the key parameters determining performance How do we know this? There are two separate lines of evidence . First, one can compare the performance of several different ASR systems on the same task (as was done by Greenberg and Chang, 2000) The best single overall predictor of word recognition performance (on a spontaneous telephone dialogue task, Switchboard) is the complexity of the pronunciation models within the system - correlation (r) is 0.84 Percent Word Correct Average Pronunciations per Word Submission sites How Important is Pronunciation Variation in ASR? For large vocabulary tasks, pronunciation variation is probably one of the key parameters determining performance How do we know this? There are two separate lines of evidence . First, one can compare the performance of several different ASR systems on the same task (as was done by Greenberg and Chang, 2000) The best single overall predictor of word recognition performance (on a spontaneous telephone dialogue task, Switchboard) is the complexity of the pronunciation models within the system - correlation (r) is 0.84 Second, one can artificially control the relationship of the input data (in this instance, sequences of phonetic segments from the Switchboard transcription material earlier described) with the word models in the recognition lexicon (as was done by McAllaster and Gillick, 1998) How Important is Pronunciation Variation in ASR? For large vocabulary tasks, pronunciation variation is probably one of the key parameters determining performance How do we know this? There are two separate lines of evidence . First, one can compare the performance of several different ASR systems on the same task (as was done by Greenberg and Chang, 2000) The best single overall predictor of word recognition performance (on a spontaneous telephone dialogue task, Switchboard) is the complexity of the pronunciation models within the system - correlation (r) is 0.84 Second, one can artificially control the relationship of the input data (in this instance, sequences of phonetic segments from the Switchboard transcription material earlier described) with the word models in the recognition lexicon (as was done by McAllaster and Gillick, 1998) The baseline system (no changes from the typical) error rate was 40% How Important is Pronunciation Variation in ASR? For large vocabulary tasks, pronunciation variation is probably one of the key parameters determining performance How do we know this? There are two separate lines of evidence . First, one can compare the performance of several different ASR systems on the same task (as was done by Greenberg and Chang, 2000) The best single overall predictor of word recognition performance (on a spontaneous telephone dialogue task, Switchboard) is the complexity of the pronunciation models within the system - correlation (r) is 0.84 Second, one can artificially control the relationship of the input data (in this instance, sequences of phonetic segments from the Switchboard transcription material earlier described) with the word models in the recognition lexicon (as was done by McAllaster and Gillick, 1998) The baseline system (no changes from the typical) error rate was 40% If a perfect phonetic transcript is provided to the recognizer (using a manual annotation of the corpus) the error rate is reduced significantly How Important is Pronunciation Variation in ASR? For large vocabulary tasks, pronunciation variation is probably one of the key parameters determining performance How do we know this? There are two separate lines of evidence . First, one can compare the performance of several different ASR systems on the same task (as was done by Greenberg and Chang, 2000) The best single overall predictor of word recognition performance (on a spontaneous telephone dialogue task, Switchboard) is the complexity of the pronunciation models within the system - correlation (r) is 0.84 Second, one can artificially control the relationship of the input data (in this instance, sequences of phonetic segments from the Switchboard transcription material earlier described) with the word models in the recognition lexicon (as was done by McAllaster and Gillick, 1998) The baseline system (no changes from the typical) error rate was 40% If a perfect phonetic transcript is provided to the recognizer (using a manual annotation of the corpus) the error rate is reduced significantly However, the amount of error reduction depends on the relationship between the phonetic sequences and the word models How Important is Pronunciation Variation in ASR? For large vocabulary tasks, pronunciation variation is probably one of the key parameters determining performance How do we know this? There are two separate lines of evidence . First, one can compare the performance of several different ASR systems on the same task (as was done by Greenberg and Chang, 2000) The best single overall predictor of word recognition performance (on a spontaneous telephone dialogue task, Switchboard) is the complexity of the pronunciation models within the system - correlation (r) is 0.84 Second, one can artificially control the relationship of the input data (in this instance, sequences of phonetic segments from the Switchboard transcription material earlier described) with the word models in the recognition lexicon (as was done by McAllaster and Gillick, 1998) The baseline system (no changes from the typical) error rate was 40% If a perfect phonetic transcript is provided to the recognizer (using a manual annotation of the corpus) the error rate is reduced significantly However, the amount of error reduction depends on the relationship between the phonetic sequences and the word models If there is a PERFECT match between a phonetic sequence and the word model the error rate is reduced by 88% (i.e., 5% WER) How Important is Pronunciation Variation in ASR? For large vocabulary tasks, pronunciation variation is probably one of the key parameters determining performance How do we know this? There are two separate lines of evidence . First, one can compare the performance of several different ASR systems on the same task (as was done by Greenberg and Chang, 2000) The best single overall predictor of word recognition performance (on a spontaneous telephone dialogue task, Switchboard) is the complexity of the pronunciation models within the system - correlation (r) is 0.84 Second, one can artificially control the relationship of the input data (in this instance, sequences of phonetic segments from the Switchboard transcription material earlier described) with the word models in the recognition lexicon (as was done by McAllaster and Gillick, 1998) The baseline system (no changes from the typical) error rate was 40% If a perfect phonetic transcript is provided to the recognizer (using a manual annotation of the corpus) the error rate is reduced significantly However, the amount of error reduction depends on the relationship between the phonetic sequences and the word models If there is a PERFECT match between the phonetic sequences and all of the word models, the error rate is reduced by 88% (i.e., 5% WER) If the word models include phonetic sequences other than those encountered in the data, the error rate is reduced by only 50% (20% WER) These two studies imply that the capability of predicting pronunciation variability with precision could significantly improve ASR performance Importance of Pronunciation Models for ASR These two studies imply that the capability of predicting pronunciation variability with precision could significantly improve ASR performance Particularly, if it were possible to match sequences of phonetic segments with word models stored in the lexicon with a minimum of ambiguity Importance of Pronunciation Models for ASR These two studies imply that the capability of predicting pronunciation variability with precision could significantly improve ASR performance Particularly, if it were possible to match sequences of phonetic segments with word models stored in the lexicon with a minimum of ambiguity In order to achieve such capability, it is first necessary to delineate the factors underlying pronunciation variation in spontaneous speech Importance of Pronunciation Models for ASR These two studies imply that the capability of predicting pronunciation variability with precision could significantly improve ASR performance Particularly, if it were possible to match sequences of phonetic segments with word models stored in the lexicon with a minimum of ambiguity In order to achieve such capability, it is first necessary to delineate the factors underlying pronunciation variation in spontaneous speech This topic forms the focus of the remainder of the presentation. Importance of Pronunciation Models for ASR Pronunciation Variability of Real Speech Pronunciation Variability of Real Speech As mentioned earlier, there are literally dozens of ways in which common words are pronounced Pronunciation Variability of Real Speech As mentioned earlier, there are literally dozens of ways in which common words are pronounced From casual examination of the data, the situation appears chaotic and unsystematic, with no real pattern to the variation Pronunciation Variability of Real Speech As mentioned earlier, there are literally dozens of ways in which common words are pronounced From casual examination of the data, the situation appears chaotic and unsystematic, with no real pattern to the variation As exemplied by the word that (as shown on the following slides) NPronunciationN How Many Different Pronunciations of that? NPronunciationN How Many Different Pronunciations of that? Linguistic Dissection Pronunciations But this initial impression is deceptive . These 226 instances of that exhibit 63 DIFFERENT pronunciations (from ca. 1 hour speech) This is a factor of 10 greater than the number of pronunciations found in any of the ASR systems evaluated Bottom line most sites mis-recognized about 25% of the thats If we assume that [dh ae t] is the canonical pronunciation of that We can compare the actual to the canonical pronunciation Among the 226 instances of that, there are (for the 678 canonical phones involved): 176 phone substitutions (26% of the canonical phones) 101 phone deletions (15%) 3 phone insertions (0.4%) What does this tell us about the pronunciation variation patterns? Pronunciation Pattern of That The Importance of the Syllable The analyses to follow are all linked, in some fashion, to the syllable In order to highlight patterns germane to variation in segmental identity it is necessary to partition the data in terms of ordinal position within the syllable, as well as stress accent (reflecting syllable prominence) As a consequence, we will examine syllable onsets, nuclei and codas separately in order to gain insight into the underlying patterns Onsets are the most canonical, with a greater number of substitutions than deletions That Deviation from Canonical by Syllable Position Syllable Position Deviation from Canonical (%) SubstitutionsDeletionsInsertions Onset [dh] Nucleus [ae] Coda [t] Nuclei have many substitutions but no deletions or insertions That Deviation from Canonical by Syllable Position Syllable Position Deviation from Canonical (%) SubstitutionsDeletionsInsertions Onset [dh] Nucleus [ae] Coda [t] That Deviation from Canonical by Syllable Position Syllable Position Deviation from Canonical (%) SubstitutionsDeletionsInsertions Onset [dh] Nucleus [ae] Coda [t] Codas are the least canonical, and exhibit more deletions than substitutions The Importance of Syllable Structure Now, lets examine some GENERAL patterns of pronunciation variation that are conditioned by BOTH syllable position and stress accent level In the analyses to follow, the phonetically realized data (from the phonetic transcripts) are directly compared to the canonical pronunciations (from a recognition lexicon) The analyses are therefore in terms of deviation from canonical pronunciation Such data serve to illustrate the sort of variation observed that is conditioned by position within the syllable (i.e., ONSET - NUCLEUS - CODA) As well as gauge the impact of syllable prominence on phonetic patterning (i.e., HEAVY - LIGHT - NONE) Pronunciation Variation Syllable and Accent All Segments Stress accent has a direct impact on the probability of canonical pronunciation Unaccented syllables are far more likely to be non-canonically pronounced than their accented counterparts Both SYLLABLE STRUCTURE and STRESS-ACCENT LEVEL are required for delineating the deviation patterns in full Pronunciation Variation Substitutions NUCLEUS Territory Most of the SUBSTITUTION deviations occur in the NUCLEUS Stress accent level has a profound impact on the probability of substitutions Particularly in the nucleus, but also in the onset (but not in the coda) Pronunciation Variation Deletions Most of the DELETION deviations occur in the CODA Stress accent has a significant impact on the probability of deletions Particularly in the coda, but also in the onset as well CODA Territory Pronunciation Variation Insertions ONSET Territory Most of the INSERTIONS relatively few in number occur in the ONSET Pronunciation Variation Summary All SegmentsDeletions InsertionsSubstitutions CODA Territory ONSET Territory NUCLEUS Territory Different components of the syllable are specialized wrt to pronunciation patterns (at least with respect to deviation from the canonical form) The NUCLEUS is associated with SUBSTITUTIONS The CODA is associated with DELETIONS, and the ONSET with INSERTIONS Pronunciation Variation Consonants Lets now examine patterns of pronunciation from the segmental perspective (starting with consonants) There are a few surprises in store We will analyze the pronunciation patterns with respect to place of articulation (anterior, central and posterior) As shown on the following slide Place of Articulation-Based Analysis The tongue contacts (or nearly so) the roof of the mouth in producing many of the consonantal sounds in English. Place of articulation can also be associated with the lips From Daniloff (1973) Anterior Labial [p] [b] [m] Labio-dental [f] [v] Inter-dental [th] [dh] Central Alveolar [t] [d] [n] [s] [z] Posterior Palatal [sh] [zh] Velar [k] [g] [ng] Chameleon Rhoticized [r] Lateral [l] Approximant [hh] ANTERIOR CENTRAL POSTERIOR Pronunciation Variation Segments As well as by stress-accent level (heavy, light and none) Road Map - How to Interpret the Data Compare the numbers in the YELLOW and ORANGE columns Numbers refer to instances of CANONICAL and OBSERVED (transcribed) segments Can = Canonical form Trans = Transcribed (i.e., phonetically realized) Road Map - How to Interpret the Data Compare the numbers in the YELLOW and ORANGE columns Most numbers in the YELLOW / ORANGE columns will be similar Can = Canonical form Trans = Transcribed (i.e., phonetically realized) Road Map - How to Interpret the Data Compare the numbers in the YELLOW and ORANGE columns Most numbers in the YELLOW / ORANGE columns will be similar Indicating that the phonetic realization of the segment is canonical (C) Can = Canonical form Trans = Transcribed (i.e., phonetically realized) Road Map - How to Interpret the Data Compare the numbers in the YELLOW and ORANGE columns Most numbers in the YELLOW / ORANGE columns will be similar Indicating that the phonetic realization of the segment is canonical (C) A large disparity is marked with a blue box and indicates that there is a significant affect on pronunciation Can = Canonical form Trans = Transcribed (i.e., phonetically realized) And is labeled on the tables to follow as N (non-canonical) Place of ArticulationApproximants C = Canonical realization across accent levels N = Non-canonical realization, N o = Non-canonical in unaccented syllables Example: Syllable Onsets ANTERIOR Place And is labeled on the tables to follow as N (non-canonical) Place of ArticulationApproximants C = Canonical realization across accent levels N = Non-canonical realization, N o = Non-canonical in unaccented syllables Example: Syllable Onsets ANTERIOR Place Pronunciation Patterns Syllable Onsets Place of ArticulationApproximants C = Canonical realization across accent levels N = Non-canonical realization, N o = Non-canonical in unaccented syllables The ANTERIOR and POSTERIOR onsets are usually CANONICALLY realized implying that these segments carry much of the information distinguishing among syllables (and often words) The CENTRAL and PLACE CHAMELEON onsets are often non-canonical Pronunciation Patterns Syllable Codas C = Canonical realization N = Non-canonical realization, N 0 = Non-canonical in unaccented syllables Place of ArticulationApproximants The ANTERIOR and POSTERIOR onsets are also CANONICALLY realized similar in pattern to the onsets The CENTRAL codas are often non-canonical The following slide illustrates why this may be so Nearly three-quarters of the CODA consonants are CORONALS Preponderance of Coda Coronals All accent levels combined (canonical elements) Nearly three-quarters of the CODA consonants are CORONALS In contrast is a far more equitable distribution across place among onsets Preponderance of Coda Coronals All accent levels combined (canonical elements) Nearly three-quarters of the CODA consonants are CORONALS In contrast is a far more equitable distribution across place among onsets The disparity in place distribution in coda position implies that coronals are a default category, able to sustain deletion without undue impact on the information contained within the syllable Preponderance of Coda Coronals All accent levels combined (canonical elements) Nearly three-quarters of the CODA consonants are CORONALS In contrast is a far more equitable distribution across place among onsets The disparity in place distribution in coda position implies that coronals are a default category, able to sustain deletion without undue impact on the information contained within the syllable In this sense, codas carry far less information than onsets Preponderance of Coda Coronals All accent levels combined (canonical elements) Stress accent has relatively little impact on the distribution of place in either onset or coda segments Particularly with respect to the preponderance of coronal segments in codas Suggesting that codas are inherently less informative than onsets regardless of accent level Accent and Preponderance of Coda Coronals Unaccented and heavily accented levels combined (canonical elements) Comparison of Syllable Onsets and Codas Onsets tend to be more stable (i.e., more canonical) than codas i Place of ArticulationApproximants C = Canonical realization across accent levels N = Non-canonical realization, N o = Non-canonical in unaccented syllables Comparison of Syllable Onsets and Codas Onsets tend to be more stable (i.e., more canonical) than codas The coronal segments are unstable in both contexts, but more so in codas Place of ArticulationApproximants C = Canonical realization across accent levels N = Non-canonical realization, N o = Non-canonical in unaccented syllables Comparison of Syllable Onsets and Codas Onsets tend to be more stable (i.e., more canonical) than codas The coronal segments are unstable in both contexts, but more so in codas As are the place chameleons (which tend to behave like vowels) Place of ArticulationApproximants C = Canonical realization across accent levels N = Non-canonical realization, N o = Non-canonical in unaccented syllables Comparison of Syllable Onsets and Codas Onsets tend to be more stable (i.e., more canonical) than codas The coronal segments are unstable in both contexts, but more so in codas As are the place chameleons (which tend to behave like vowels) The unstable anterior and posterior phones are mostly ambi-syllabic junctures Place of ArticulationApproximants C = Canonical realization across accent levels N = Non-canonical realization, N o = Non-canonical in unaccented syllables Pronunciation Variation Vowels Lets now examine patterns of pronunciation from the vocalic perspective As well see, prosodic prominence and vocalic identity are intertwined In the following slides, the classical vowel space is replotted so that the y- axis, instead of representing height or F1, is associated with the probability of the vowel occuring in a accented syllable The x-axis represents the (hypothetical) front-back tongue position (and hence remains a constant throughout the plots to follow) Spatial Patterning of Vocalic Identity/Stress Vocalic Identity Among Unstressed Nuclei The high, lax monophthongs are almost always unstressed Vocalic Identity Among Unstressed Nuclei The high, lax monophthongs are almost always unstressed The low vowels are rarely unstressed The high vowels are rarely heavily stressed Vocalic Identity Among FULLY Stressed Nuclei The high vowels are rarely heavily stressed The low vowels are far more likely to be heavily stressed Vocalic Identity Among FULLY Stressed Nuclei The high vowels are rarely heavily stressed The low vowels are far more likely to be heavily stressed An intermediate degree of stress accounts for the other vocalic instances Vocalic Identity Among Fully Stressed Nuclei The Vowel System Under (Full) Stress (Accent) In HEAVILY STRESSED nuclei there is a relatively even distribution of segments across the vowel space, with a slight bias towards the front and central vowels Canonical Vowels Only In UNSTRESSED syllables vowels are confined largely to the high-front and high-central sectors of the articulatory space The Vowel System Without (Stress) Accent Canonical Vowels Only In unaccented syllables vowels are confined largely to the high-front and high-central sectors of the articulatory space The low and mid vowels get creamed The Vowel System Without (Stress) Accent Canonical Vowels Only Stress accent exerts a profound effect on the character of the vowel space High vowels are largely associated with unaccented syllables Low vowels are mostly associated with accented forms This distinction between accented and unaccented syllables is of profound importance for understanding (and modeling) pronunciation variation The Vowel Systems Compared Canonical Vowels Only Heavily AccentedUnaccented Informations Role in Phonetic Patterning There is something special about coronal segments (in coda position) A significant proportion of these segments are phonetically unrealized One potential explanation pertains to the trajectory of the second formant (reflecting the front cavity resonance) The locus (target) frequency of ALVEOLAR is ca Hz, similar to the second formant of the front and central vowels Given the preponderance of non-back vowels in the corpus (particularly in unaccented syllables), the second formant for vocalic segments preceding a coda consonant is likely to be between 1500 and 2500 Hz Thus, the absence of a coda segment points, by implication, to the alveolar place of articulation under many circumstances This hypothesis is schematically illustrated on the following slide Why do Coronal Coda Segments Delete So Often? HEAVILY STRESSED Syllables Why do Coronal Coda Segments Delete So Often? UNSTRESSED Syllables A Multi-Tier, Syllable-Centric Automatic Speech Recognition System For a limited-vocabulary task (OGI Numbers95) A Multi-Tier ASR System Illustrated The speech signal is modeled as a sequence of syllables each with a variable amount of prominence Each syllable consists of a vocalic nucleus, and optionally contains onset and coda elements Each syllabic constituent is specified in terms of articulatory-acoustic features, most of which are inherently trans-segmental Stress Acc Accent Level ent Level Syllable 1Syllable 2Syllable 3 Nucleus Onset Nucleus Coda AFs Manner Place Voicing Height Rounding etc Time Articulatory-feature classification MLP-based AF classifiers trained on log-compressed critical-band energy features One classifier for each of the AF dimensions Articulatory-Acoustic Feature Classification Speech Inputs Stress Labeling AF Classification & AF Segmentation Word Scores Word Models Pronunciation Statistics Syllable-based Word Hypothesis Scoring Structure of the Multi-Tier System Within-Syllable AF-dimension Score Combining Single AF-dimension Scoring by Onset, Nucleus and Coda See Shawn Changs thesis for additional details: ICSI Technical Report Word Models One canonical base form for each word in the vocabulary Approach is illustrated for the word six [s] [ih] [k] [s] Syllable 1 Stress=0.75 OnsetNucleusCoda MannerFricativeVocalicStopFricative PlaceAlveolarFrontVelarAlveolar VoicingUnvoicedVoicedUnvoiced RoundUnround HeightHigh DynamicStatic Tense/LaxLax Duration14 (5.0)8 (3.7)9 (3.2)12 (5.9) Structure of the Multi-Tier System A Multi-Tier ASR System Some fundamental questions concerning the recognition model: What AF dimensions should be used and what are their relative contributions to recognition? How do inaccuracies in classification and segmentation affect recognition? How much does stress accent information help? How effective is the model in capturing pronunciation variation? Shapleys Index an importance index for each of the AF dimensions (sum to 1.0) derived from the trained fuzzy measures The greater the value, the more important a dimension Place and manner of articulation are most important for this task Relative Contribution of AF Dimensions to ASR Corpus Spontaneously spoken numbers (OGI Numbers95) Three conditions Baseline: entirely automatic recognition Fabricated: all information derived from manually labeled phonetic transcripts (except stress-accent from AutoSAL) Half-way House: partially automatic and partially fabricated (phonetic segmentation and whether a segment is vocalic or not is taken from the transcripts otherwise, entirely automatic) Trained on ca. 2.5 hours data and tested on ca. 1 hour data Controlled Experiments Experiments and Analysis Overall Half-way house system performance is much closer to the fabricated data results than the baseline system Suggests the importance of accurate segmentation and knowledge of the location of syllabic nuclei (at least for this implementation) In a separate study (Chang, Greenberg & Wester, 2001) vocalic recognition was between 93 and 98% correct (for TIMIT sentences), and segmentation can be performed using such manner classifiers ConditionWord Error Rate Baseline5.6% Half-way House2.0% Fabricated1.3% Effect of Pronunciation Variation on ASR Withholding pronunciation variation information associated with each syllable position separately or in all positions concurrently Onset constituents appears to be the most canonical Codas are the least canonical and benefit most from pronunciation variation information Condition Word Error Rate % StandardOnsetNucleusCodaAll Baseline Half-way House Fabricated Standard refers to the regular system with all pronunciation variation intact Experiments and Analysis Syllable Positions Contribution of each syllable position tested by neutralizing information pertaining to the onset, nucleus and coda separately The onset and coda constituents are more important for accurate recognition than the coda Condition Word Error Rate % StandardOnsetNucleusCoda Baseline Half-way House Fabricated Standard refers to the regular system Conclusions The phonemic-beads-on-a-string model prevalent in current ASR systems is of limited utility, particularly for spontaneous discourse Statistical analysis of the Switchboard corpus suggests that a syllable- centric approach that combines information from many linguistic levels (i.e., multi-tier) has potential for capturing much of the salient variation in pronunciation oberved in spontaneous speech This multi-tier approach has been embedded in a test application for a limited vocabulary task, and shows promise for future development Specifically, it is possible through controlled experiments, to quantitatively ascertain the contribution of various articulatory features and syllable constituents to ASR performance Such a micro-management approach may be useful for developing novel ASR architectures specialized for spontaneous material Thats All Many Thanks for Your Time and Attention Syllables rise and fall in energy over the course of their duration Vocalic nuclei are highest in amplitude Onset consonants gradually rise in energy arching towards the peak Coda consonants decline in amplitude, usually more abruptly than onsets The Energy Arc Illustrated Spectro-temporal profile (STeP)Spectrogram + Waveform seven Low-frequency modulation of acoustic energy reflects the fluctuation of energy in the syllable These slow modulations of energy are essential for the brain to digest and decode the speech signal The energy arc reflects this intrinsic property of cortical processing Articulatory manner is constrained by the energy arc Energy Arc Reflects the Modulation Spectrum Modulation Spectrum Syllable Duration From Arai & Greenberg (1997) The ASR systems are not optimized for phone recognition Phone output is constrained by the pronunciation models Should look at pronunciation models? Linguistic Dissection: Pronunciations The ASR systems are not optimized for phone recognition Phone output is constrained by the pronunciation models Should look at pronunciation models? Linguistic Dissection: Pronunciations Percent Word Correct Average Pronunciations per Word r = 0.84 Submission sites It appears that the more pronunciation models per word, the better recognition Why not simply adding more pronunciations? Difficult to come up with new pronunciations Increased confusability may degrade recognition (McAllaster, et al. 1998) Linguistic Dissection: Pronunciations Percent Word Correct Average Pronunciations per Word r = 0.84 Submission sites Speech Inputs Stress Labeling AF Classification & AF Segmentation Word Scores Syllable-based Word Hypothesis Scoring Word Models Pronunciation Statistics A Test-bed Implementation A Syllable, AF and Stress Accent Model of Speech Speech Inputs Stress Labeling AF Classification & AF Segmentation Word Scores Syllable-based Word Hypothesis Scoring Word Models Pronunciation Statistics Summary of the Presentation Visual cues (speechreading) can combine with the acoustic speech signal, providing information analogous to the modulation patterns + + Video LeadsAudio Leads 40 400 ms Baseline Condition SYNCHRONOUS A/V Pronunciation variation lies at the heart of spoken language And therefore provides the key to dramatically improving the quality of speech technology (particularly automatic recognition and synthesis) In order to realize this technological potential it is essential to understand the principles of pronunciation variation from a SCIENTIFIC perspective A detailed statistical analysis of spontaneous telephone dialogues (Am. English) provides the empirical basis for the following generalizations: The SYLLABLE, rather than the PHONE, is the basic organizational unit of spoken language hence the difficulty for any phonetic orthography to accurately delineate pronunciation patterns in fine detail The syllable carries prosodic weight (a.k.a. accent or prominence) that affects the manner in which its constituents are phonetically realized The behavior of these syllabic constituents (a.k.a. onset, nucleus and coda) differ dramatically from each other, and influence the phonetic character of the syllable thus, syllable position may be as important as segmental identity for characterizing pronunciation The MICROSTRUCTURE of the syllable can be delineated in terms of articulatory-acoustic features (e.g., voicing, articulatory manner and place) MANNER of articulation most closely parallels (in time and behavior) the classical concept of the phone (and phonetic segment) and sets the basic intensity mode for the sequence of syllabic constituents (ENERGY ARC) Take Home Messages The ENERGY ARC reflects cortical processing constraints on the acoustic (and visual) signal associated with the MODULATION SPECTRUM PLACE of articulation is an inherently TRANS-SEGMENTAL feature that binds vocalic nuclei with preceeding and following consonants Articulatory PLACE provides the discriminative (entropic) basis for lexical identity, and is therefore important to model accurately VOICING spreads from the nuclei core of the syllable and spreads both forward (towards the coda) and backward (towards the onset), the degree of temporal spreading reflecting prosodic prominence magnitude in this sense, VOICING is a SYLLABIC rather than a phonetic-segment feature, in that it is sensitive to the prominence of the syllable It is the pattern of interaction among articulatory-feature dimensions across time that imparts to the syllable its specific phonetic identity The specific realization of articulatory features is governed by their position within the syllable, as well as prosodic prominence The prosodic pattern of an utterance reflects the information contained within the utterance Therefore, it is ultimately INFORMATION (and lexical discriminability) that governs the detailed phonetic properties of spoken language, and hence pronunciation variation largely reflects information contained in spoken language Take Home Messages J = JUNCTURE OGI Numbers95 corpus Syllables generally consist of three constituents - ONSET, NUCLEUS, CODA Syllables and Phonetic Segments Illustrated J = JUNCTURE OGI Numbers95 corpus Syllables generally consist of three constituents - ONSET, NUCLEUS, CODA Virtually all syllables contain a NUCLEUS, which is VOCALIC (by definition) Syllables and Phonetic Segments Illustrated J = JUNCTURE OGI Numbers95 corpus Syllables generally consist of three constituents - ONSET, NUCLEUS, CODA Virtually all syllables contain a NUCLEUS, which is VOCALIC (by definition) Most (but not all) syllables also contain an ONSET (usually a CONSONANT) Syllables and Phonetic Segments Illustrated J = JUNCTURE OGI Numbers95 corpus Syllables generally consist of three constituents - ONSET, NUCLEUS, CODA Virtually all syllables contain a NUCLEUS, which is VOCALIC (by definition) Most (but not all) syllables also contain an ONSET (usually a CONSONANT) Many syllables contain a CODA (also typically a CONSONANT) Syllables and Phonetic Segments Illustrated Annotation of Stress Accent Forty-five minutes of the phonetically annotated portion of the Switchboard corpus was manually labeled with respect to stress accent (prominence) Three levels of accent were distinguished: Heavy (1)Light (0.5)None (0) An example of the annotation (attached to the vocalic nucleus) is shown below (where the accent levels could not be derived from a dictionary) In this example most of the syllables are unaccented, with two labeled as lightly accented (0.5) (and one other labeled as very lightly accented (0.25)) For lack of time, I will focus on just a few issues in this presentation: First, Ill examine the current state of automatic speech recognition systems from the perspective of pronunciation modeling Discussing some of the problems confronting current-generation ASR technology Then, Ill examine pronunciation variation as observed in a particular corpus of spontaneous dialogues Switchboard which has been manually annotated at the phonetic and prosodic levels The key insight garnered from this material is that the variation observed is systematic when organized into syllabic units and syllable prominence is explicitly marked Specifically, different parts of the syllable appear to play separate roles in pronunciation Ultimately, it is the entropy (or information) associated with a syllable and its constituents, that appears to account for the specific phonetic realization of the segments contained within The onset is inherently more informative than the coda, and is therefore more often canonically realized The nucleus carries much of the prosodic weight of the syllable vocalic identity is highly constrained by the syllables prominence Prcis of the Presentation Syllable Prominence (Stress Accent) Syllables vary with respect to their perceptual (and linguistic) prominence Some syllables are heavily accented, while others are completely unaccented A certain proportion of syllables are accented, but not heavily (i.e., intermediate) In English, the accent system is associated with syllabic stress, which is based on a broad constellation of acoustic parameters that includes: Duration, Amplitude, Vocalic Spectrum, Fundamental Frequency (among others) Accent level is an important parameter for understanding pronunciation variation (at least in American English) A straightforward means of illustrating the difference between accented and unaccented syllables is shown on the following slide Which shows hundreds of instances of the word seven in a 3-D profile called a STeP (Spectro-Temporal Profile) Syllable Prominence (Accent) Illustrated [s] [eh] [vx] [en] accented syllable unaccented syllable Seven mean duration Full-spectrum perspective OGI Numbers95 [s] [eh] [vx] [en] Nucleus Onset Ambi-syllabic pure juncture Nucleus Juncture