construction of phoneme-to-phoneme converters...
Post on 20-Dec-2015
234 views
TRANSCRIPT
Construction of phoneme-to-phoneme converters--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
P2P learning requires the orthographic transcription, an initial G2P transcription and a target phonemic transcription (e.g. TY or AV) of a sufficiently large collection of name utterances. These 3-tuples are supplied to a 4 step training procedure:• Two-fold alignment: Orthography ↔ Initial transcription ↔ Target transcription
• Transformation retrieval• Generation of training examples: describe linguistic context
Previous and next phonemes and graphemes Lexical context (Part Of Speech) Prosodic context (stressed syllable or not) Morphological context (word prefix/suffix) External features: e.g. name type, name source, speaker tongue
• Rule induction Learn decision tree per input (pattern): stochastic rules in leaf nodes Rule formalism: if context → leaf node then [input pattern] → [output pattern] with probability Pfir
In generation mode: rules applied to initial G2P transcription of unseen name variants with probabilities
Towards improved proper name recognitionBert Réveil and Jean-Pierre Martens
DSSP group, Ghent University, Department of Electronics and Information SystemsSint-Pietersnieuwstraat 41, 9000 Ghent, Belgium
{breveil,martens}@elis.ugent.be
Topic description--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Automatic proper name recognition is a key component of multiple speech-based applications (e.g. voice-driven navigation systems). This recognition is challenged by the mismatch between the way the names are represented in the recognizer and the way they are actually pronounced:
• Incorrect phonemic name transcriptions: common grapheme-to-phoneme (G2P)converters can’t cope with archaic spelling and foreign name parts, manual transcriptions are too costly (e.g. Ugchelsegrensweg, Haînautlaan)
• Multiple plausible name pronunciations: within or across languages (e.g. Roger)• Cross-lingual pronunciation variation: foreign names, foreign application users
In order to improve the phonemic transcriptions and capture the pronunciation variation we adopt acoustic and lexical modeling approaches. Acoustic modeling targets a better modeling of the expected utterance sounds. Lexical modeling tries to foresee the most plausible phonemic transcription(s) for each name in the recognition lexicon.
Experimental set-up-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Database: Autonomata Spoken Name Corpus (ASNC)• 120 Dutch, 40 English, 20 French, 40 Moroccan and 20 Turkish speakers• Every speaker reads 181 names with either Dutch, English, French, Moroccan or Turkish origin• Non-overlapping train and test set (disjunctive names, speakers)• Human expert transcriptions
- TY: typical Dutch transcription (one for each name from TeleAtlas)- AV: auditory verified Dutch transcription (one for each name utterance)
This work: only Dutch native utterances + non-native utterances of Dutch names
Speech recognizer: state-of-the-art VoCon 3200 from Nuance• Grammar: name loop with 21K different names (3.5K names of ASNC + 17.5K others)
RECOGNITION SYSTEM GPSPlease guide me
towards ‘A&u.stInHMMs
…
“O”Lexicon
…Austin 'O.stIn…
Table 1: Number of utterances for all (speaker,name) pairs in train and test set
Set DU EN FR MO TU
(DU,*) train 9960 1909 966 1245 943
test 4440 851 414 555 437
(*,DU) train 9960 3000 1680 3360 1560
test 4440 1800 720 1440 840
Acknowledgments-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
The presented work was carried out in the Autonomata TOO project, granted under the Dutch-Flemish STEVIN program (http://taalunieversum.org/taal/technologie/stevin/), with partners RU Nijmegen, Universiteit Utrecht, Nuance and TeleAtlas.
References-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
[1] B. Réveil, J.-P. Martens and B. D’hoore, How speaker tongue and name source language affect the automatic recognition of spoken names, in Proc. InterSpeech 2009, UK, Brighton
[2] H. van den Heuvel, B. Réveil and J.-P. Martens, Pronunciation-based ASR for names, in Proc. InterSpeech 2009, UK, Brighton
[3] B. Réveil, J.-P. Martens and H. van den Heuvel, Improving proper name recognition by adding automatically learned pronunciation variants to the lexicon, in Proc. LREC 2010, Valletta, Malta
Acoustic and lexical modeling strategies-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
The modeling approaches are firstly conceived for the primary targeted users, also called the native (NAT) users (in our case Dutch natives). W.r.t. these users, two types of non-native languages are distinguished: foreign languages that most NAT speakers are familiar with (NN1), and other foreign languages (NN2).
Strategy 1: Incorporating NN1 language knowledge
• Acoustic modeling: two model sets- AC-MONO : standard NAT Dutch model (trained on Dutch speech alone)- AC-MULTI : Dutch (20%) and NN1 training data (English, French and German)
Lexical modeling- G2P transcribers for NAT and NN1 languages (Nuance RealSpeak TTS)
Foreign transcriptions are nativized in combination with AC-MONO- Data-driven selection of one extra G2P converter per name origin
Strategy 2: Creating pronunciation variants (lexical modeling)- Computed per (speaker, name) combination- Created from initial G2P transcriptions by means of automatically learned
phoneme-to-phoneme (P2P) converters
~ D i r k () V a n () D e n () ~ B o ~ ssch e
‘ d I r K _ f A n _ d E n _ ‘ b O . s $
‘ d i r k _ v A n _ d $ m _ ~ b O . s $
High level features
OrthographyInitial
transcription
Alignment process
(letter-to-sound)
Alignment process
(sound-to-sound)
Target transcription
Transformation learning
Example generation
Learn morphological classes
Stochastic rule induction
Experimental assessment--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Incorporating NN1 language knowledge
• Including extra G2P transcriptions (acoustic model = AC-MONO)- Boost for (DU,-DU): NAT speakers use NN1 knowledge when reading foreign names, including NN2 names- Degradation for (DU,DU): reduced by selecting only one extra G2P
• Decoding with multilingual acoustic model- NAT speakers: loss for NAT names, boost for English names only
Dutch sounds not as well modeled as before English better known than French? English and Dutch sound inventories differ more than French and Dutch?
- Foreign speakers: boost for both NN1 name origins- mother tongue sounds better modeled
• Plain multilingual G2P transcriptions bring no improvement
Creating pronunciation variants
• Baseline P2Ps: Dutch G2P transcriptions as initials, AV transcriptions as targets- Alternative P2Ps for (DU,NN1) and (NN1,DU) cells
- create additional P2P that starts from NN1 G2P transcriptions- combine most probable variants generated by both P2P converters
• P2P variants lead to significant improvements for all (speaker, name) cells- 10 .. 25% relative for NAT + foreign names , 5 .. 17% for foreign speakers
Table 2: Name Error Rate (%) for systems with G2P lexicons
(spkr,name) System DU EN FR MO TU
(DU,*) AC-MONO + DUN G2P 6.5 38.5 21.3 14.6 28.4
AC-MONO + 4G2P (nativized) 7.2 22.7 9.9 9.5 17.2
AC-MONO + G2P-selection (nativized) 6.5 20.8 7.2 9.0 18.1
AC-MULTI + G2P-selection (nativized) 8.5 14.9 7.2 8.3 16.2
AC-MULTI + G2P-selection (plain) 8.5 14.0 7.7 8.6 18.1
(*,DU) AC-MONO + DUN G2P 6.5 25.1 33.2 26.9 40.8
AC-MONO + 4G2P (nativized) 7.2 22.8 32.2 27.0 40.6
AC-MONO + G2P-selection (nativized) 6.5 22.8 31.1 25.3 38.5
AC-MULTI + G2P-selection (nativized) 8.5 17.6 22.6 25.2 38.6
AC-MULTI + G2P-selection (plain) 8.5 18.2 22.6 25.8 40.4
Table 3: Name Error Rate (%) for systems with P2P transcription variants
(spkr,name) System DU EN FR MO TU
(DU,*) AC-MULTI + G2P-selection (nativized) 8.5 14.9 7.2 8.3 16.2
+ 4 P2P variants (baseline) 7.7 13.2 6.3 7.0 11.9
+ 4 P2P variants (alternative) 7.7 12.2 6.3 7.0 11.9
(*,DU) AC-MULTI + G2P-selection (nativized) 8.5 17.6 22.6 25.2 38.6
+ 4 P2P variants (baseline) 7.7 17.2 19.9 24.0 35.2
+ 4 P2P variants (alternative) 7.7 16.4 18.8 24.0 35.2