construction of phoneme-to-phoneme converters...

1
Construction of phoneme-to-phoneme converters --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- P2P learning requires the orthographic transcription, an initial G2P transcription and a target phonemic transcription (e.g. TY or AV) of a sufficiently large collection of name utterances. These 3-tuples are supplied to a 4 step training procedure: • Two-fold alignment: Orthography ↔ Initial transcription Target transcription • Transformation retrieval • Generation of training examples: describe linguistic context Previous and next phonemes and graphemes Lexical context (Part Of Speech) Prosodic context (stressed syllable or not) Morphological context (word prefix/suffix) External features: e.g. name type, name source, speaker tongue • Rule induction Learn decision tree per input (pattern): stochastic rules in leaf nodes Rule formalism: if context leaf node then [input pattern] [output pattern] with probability P fir In generation mode: rules applied to initial G2P transcription of unseen name variants with probabilities Towards improved proper name recognition Bert Réveil and Jean-Pierre Martens DSSP group, Ghent University, Department of Electronics and Information Systems Sint-Pietersnieuwstraat 41, 9000 Ghent, Belgium {breveil,martens}@elis.ugent.be Topic description --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Automatic proper name recognition is a key component of multiple speech-based applications (e.g. voice-driven navigation systems). This recognition is challenged by the mismatch between the way the names are represented in the recognizer and the way they are actually pronounced: • Incorrect phonemic name transcriptions : common grapheme-to-phoneme (G2P) converters can’t cope with archaic spelling and foreign name parts, manual transcriptions are too costly (e.g. Ugchelsegrensweg, Haînautlaan) • Multiple plausible name pronunciations : within or across languages (e.g. Roger) • Cross-lingual pronunciation variation : foreign names, foreign application users In order to improve the phonemic transcriptions and capture the pronunciation variation we adopt acoustic and lexical modeling approaches. Acoustic modeling targets a better modeling of the expected utterance sounds. Lexical modeling tries to foresee the most plausible phonemic transcription(s) for each name in the recognition lexicon. Experimental set-up --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- ---------------------------------------------------------------------------------------------------------------------------------------------- Database: Autonomata Spoken Name Corpus (ASNC) • 120 Dutch, 40 English, 20 French, 40 Moroccan and 20 Turkish speakers • Every speaker reads 181 names with either Dutch, English, French, Moroccan or Turkish origin • Non-overlapping train and test set (disjunctive names, speakers) • Human expert transcriptions - TY: typical Dutch transcription (one for each name from TeleAtlas) - AV: auditory verified Dutch transcription (one for each name utterance) This work: only Dutch native utterances + non-native utterances of Dutch names Speech recognizer: state-of-the-art VoCon 3200 from Nuance • Grammar: name loop with 21K different names (3.5K names of ASNC + 17.5K others) RECOGNITION SYSTEM GPS Please guide me towards ‘A&u.stIn HMMs “O Lexicon Austin 'O.stIn Table 1: Number of utterances for all (speaker,name) pairs in train and test set Set DU EN FR MO TU (DU,*) train 9960 1909 966 1245 943 test 4440 851 414 555 437 (*,DU) train 9960 3000 1680 3360 1560 test 4440 1800 720 1440 840 Acknowledgments --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- ----------------------------------------------------------------------------------------------------------------------------------------------The presented work was carried out in the Autonomata TOO project, granted under the Dutch-Flemish STEVIN program (http://taalunieversum.org/taal/technologie/stevin/), with partners RU Nijmegen, Universiteit Utrecht, Nuance and TeleAtlas. References --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- ---------------------------------------------------------------------------------------------------------------------------------------------- [1] B. Réveil, J.-P. Martens and B. D’hoore, How speaker tongue and name source language affect the automatic recognition of spoken names, in Proc. InterSpeech 2009, UK, Brighton [2] H. van den Heuvel, B. Réveil and J.-P. Martens, Pronunciation-based ASR for names, in Proc. InterSpeech 2009, UK, Brighton [3] B. Réveil, J.-P. Martens and H. van den Heuvel, Improving proper name recognition by adding automatically learned pronunciation variants to the lexicon, in Proc. LREC 2010, Valletta, Malta Acoustic and lexical modeling strategies --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- ---------------------------------------------------------------------------------------------------------------------------------------------- The modeling approaches are firstly conceived for the primary targeted users, also called the native (NAT) users (in our case Dutch natives). W.r.t. these users, two types of non- native languages are distinguished: foreign languages that most NAT speakers are familiar with (NN1), and other foreign languages (NN2). Strategy 1 : Incorporating NN1 language knowledge • Acoustic modeling: two model sets - AC-MONO : standard NAT Dutch model (trained on Dutch speech alone) - AC-MULTI : Dutch (20%) and NN1 training data (English, French and German) Lexical modeling - G2P transcribers for NAT and NN1 languages (Nuance RealSpeak TTS) Foreign transcriptions are nativized in combination with AC-MONO - Data-driven selection of one extra G2P converter per name origin Strategy 2 : Creating pronunciation variants (lexical modeling) - Computed per (speaker, name) combination - Created from initial G2P transcriptions by means of automatically learned phoneme-to-phoneme (P2P) converters ~ D i r k () V a n () D e n () ~ B o ~ ssc h e d I r K _ f A n _ d E n _ b O . s $ d i r k _ v A n _ d $ m _ ~ b O . s $ High level features Orthography Initial transcription Alignment process (letter-to-sound) Alignment process (sound-to-sound) Target transcription Transformation learning Example generation Learn morphological classes Stochastic rule induction Experimental assessment --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Incorporating NN1 language knowledge • Including extra G2P transcriptions (acoustic model = AC-MONO) - Boost for (DU,-DU): NAT speakers use NN1 knowledge when reading foreign names, including NN2 names - Degradation for (DU,DU): reduced by selecting only one extra G2P • Decoding with multilingual acoustic model - NAT speakers: loss for NAT names, boost for English names only Dutch sounds not as well modeled as before English better known than French? English and Dutch sound inventories differ more than French and Dutch? - Foreign speakers: boost for both NN1 name origins - mother tongue sounds better modeled • Plain multilingual G2P transcriptions bring no improvement Creating pronunciation variants Baseline P2Ps: Dutch G2P transcriptions as initials, AV transcriptions as targets - Alternative P2Ps for (DU,NN1) and (NN1,DU) cells - create additional P2P that starts from NN1 G2P transcriptions - combine most probable variants generated by both P2P converters P2P variants lead to significant improvements for all (speaker, name) cells - 10 .. 25% relative for NAT + foreign names , 5 .. 17% for foreign speakers Table 2: Name Error Rate (%) for systems with G2P lexicons (spkr,name ) System DU EN FR MO TU (DU,*) AC-MONO + DUN G2P 6.5 38.5 21.3 14.6 28.4 AC-MONO + 4G2P (nativized) 7.2 22.7 9.9 9.5 17.2 AC-MONO + G2P-selection (nativized) 6.5 20.8 7.2 9.0 18.1 AC-MULTI + G2P-selection (nativized) 8.5 14.9 7.2 8.3 16.2 AC-MULTI + G2P-selection (plain) 8.5 14.0 7.7 8.6 18.1 (*,DU) AC-MONO + DUN G2P 6.5 25.1 33.2 26.9 40.8 AC-MONO + 4G2P (nativized) 7.2 22.8 32.2 27.0 40.6 AC-MONO + G2P-selection (nativized) 6.5 22.8 31.1 25.3 38.5 AC-MULTI + G2P-selection (nativized) 8.5 17.6 22.6 25.2 38.6 Table 3: Name Error Rate (%) for systems with P2P transcription variants (spkr,name ) System DU EN FR MO TU (DU,*) AC-MULTI + G2P-selection (nativized) 8.5 14.9 7.2 8.3 16.2 + 4 P2P variants (baseline) 7.7 13.2 6.3 7.0 11.9 + 4 P2P variants (alternative) 7.7 12.2 6.3 7.0 11.9 (*,DU) AC-MULTI + G2P-selection (nativized) 8.5 17.6 22.6 25.2 38.6 + 4 P2P variants (baseline) 7.7 17.2 19.9 24.0 35.2 + 4 P2P variants (alternative) 7.7 16.4 18.8 24.0 35.2

Post on 20-Dec-2015

234 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Construction of phoneme-to-phoneme converters -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Construction of phoneme-to-phoneme converters--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

P2P learning requires the orthographic transcription, an initial G2P transcription and a target phonemic transcription (e.g. TY or AV) of a sufficiently large collection of name utterances. These 3-tuples are supplied to a 4 step training procedure:• Two-fold alignment: Orthography ↔ Initial transcription ↔ Target transcription

• Transformation retrieval• Generation of training examples: describe linguistic context

Previous and next phonemes and graphemes Lexical context (Part Of Speech) Prosodic context (stressed syllable or not) Morphological context (word prefix/suffix) External features: e.g. name type, name source, speaker tongue

• Rule induction Learn decision tree per input (pattern): stochastic rules in leaf nodes Rule formalism: if context → leaf node then [input pattern] → [output pattern] with probability Pfir

In generation mode: rules applied to initial G2P transcription of unseen name variants with probabilities

Towards improved proper name recognitionBert Réveil and Jean-Pierre Martens

DSSP group, Ghent University, Department of Electronics and Information SystemsSint-Pietersnieuwstraat 41, 9000 Ghent, Belgium

{breveil,martens}@elis.ugent.be

Topic description--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Automatic proper name recognition is a key component of multiple speech-based applications (e.g. voice-driven navigation systems). This recognition is challenged by the mismatch between the way the names are represented in the recognizer and the way they are actually pronounced:

• Incorrect phonemic name transcriptions: common grapheme-to-phoneme (G2P)converters can’t cope with archaic spelling and foreign name parts, manual transcriptions are too costly (e.g. Ugchelsegrensweg, Haînautlaan)

• Multiple plausible name pronunciations: within or across languages (e.g. Roger)• Cross-lingual pronunciation variation: foreign names, foreign application users

In order to improve the phonemic transcriptions and capture the pronunciation variation we adopt acoustic and lexical modeling approaches. Acoustic modeling targets a better modeling of the expected utterance sounds. Lexical modeling tries to foresee the most plausible phonemic transcription(s) for each name in the recognition lexicon.

Experimental set-up-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Database: Autonomata Spoken Name Corpus (ASNC)• 120 Dutch, 40 English, 20 French, 40 Moroccan and 20 Turkish speakers• Every speaker reads 181 names with either Dutch, English, French, Moroccan or Turkish origin• Non-overlapping train and test set (disjunctive names, speakers)• Human expert transcriptions

- TY: typical Dutch transcription (one for each name from TeleAtlas)- AV: auditory verified Dutch transcription (one for each name utterance)

This work: only Dutch native utterances + non-native utterances of Dutch names

Speech recognizer: state-of-the-art VoCon 3200 from Nuance• Grammar: name loop with 21K different names (3.5K names of ASNC + 17.5K others)

RECOGNITION SYSTEM GPSPlease guide me

towards ‘A&u.stInHMMs

“O”Lexicon

…Austin 'O.stIn…

Table 1: Number of utterances for all (speaker,name) pairs in train and test set

Set DU EN FR MO TU

(DU,*) train 9960 1909 966 1245 943

test 4440 851 414 555 437

(*,DU) train 9960 3000 1680 3360 1560

test 4440 1800 720 1440 840

Acknowledgments-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

The presented work was carried out in the Autonomata TOO project, granted under the Dutch-Flemish STEVIN program (http://taalunieversum.org/taal/technologie/stevin/), with partners RU Nijmegen, Universiteit Utrecht, Nuance and TeleAtlas.

References-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

[1] B. Réveil, J.-P. Martens and B. D’hoore, How speaker tongue and name source language affect the automatic recognition of spoken names, in Proc. InterSpeech 2009, UK, Brighton

[2] H. van den Heuvel, B. Réveil and J.-P. Martens, Pronunciation-based ASR for names, in Proc. InterSpeech 2009, UK, Brighton

[3] B. Réveil, J.-P. Martens and H. van den Heuvel, Improving proper name recognition by adding automatically learned pronunciation variants to the lexicon, in Proc. LREC 2010, Valletta, Malta

Acoustic and lexical modeling strategies-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

The modeling approaches are firstly conceived for the primary targeted users, also called the native (NAT) users (in our case Dutch natives). W.r.t. these users, two types of non-native languages are distinguished: foreign languages that most NAT speakers are familiar with (NN1), and other foreign languages (NN2).

Strategy 1: Incorporating NN1 language knowledge

• Acoustic modeling: two model sets- AC-MONO : standard NAT Dutch model (trained on Dutch speech alone)- AC-MULTI : Dutch (20%) and NN1 training data (English, French and German)

Lexical modeling- G2P transcribers for NAT and NN1 languages (Nuance RealSpeak TTS)

Foreign transcriptions are nativized in combination with AC-MONO- Data-driven selection of one extra G2P converter per name origin

Strategy 2: Creating pronunciation variants (lexical modeling)- Computed per (speaker, name) combination- Created from initial G2P transcriptions by means of automatically learned

phoneme-to-phoneme (P2P) converters

~ D i r k () V a n () D e n () ~ B o ~ ssch e

‘ d I r K _ f A n _ d E n _ ‘ b O . s $

‘ d i r k _ v A n _ d $ m _ ~ b O . s $

High level features

OrthographyInitial

transcription

Alignment process

(letter-to-sound)

Alignment process

(sound-to-sound)

Target transcription

Transformation learning

Example generation

Learn morphological classes

Stochastic rule induction

Experimental assessment--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Incorporating NN1 language knowledge

• Including extra G2P transcriptions (acoustic model = AC-MONO)- Boost for (DU,-DU): NAT speakers use NN1 knowledge when reading foreign names, including NN2 names- Degradation for (DU,DU): reduced by selecting only one extra G2P

• Decoding with multilingual acoustic model- NAT speakers: loss for NAT names, boost for English names only

Dutch sounds not as well modeled as before English better known than French? English and Dutch sound inventories differ more than French and Dutch?

- Foreign speakers: boost for both NN1 name origins- mother tongue sounds better modeled

• Plain multilingual G2P transcriptions bring no improvement

Creating pronunciation variants

• Baseline P2Ps: Dutch G2P transcriptions as initials, AV transcriptions as targets- Alternative P2Ps for (DU,NN1) and (NN1,DU) cells

- create additional P2P that starts from NN1 G2P transcriptions- combine most probable variants generated by both P2P converters

• P2P variants lead to significant improvements for all (speaker, name) cells- 10 .. 25% relative for NAT + foreign names , 5 .. 17% for foreign speakers

Table 2: Name Error Rate (%) for systems with G2P lexicons

(spkr,name) System DU EN FR MO TU

(DU,*) AC-MONO + DUN G2P 6.5 38.5 21.3 14.6 28.4

AC-MONO + 4G2P (nativized) 7.2 22.7 9.9 9.5 17.2

AC-MONO + G2P-selection (nativized) 6.5 20.8 7.2 9.0 18.1

AC-MULTI + G2P-selection (nativized) 8.5 14.9 7.2 8.3 16.2

AC-MULTI + G2P-selection (plain) 8.5 14.0 7.7 8.6 18.1

(*,DU) AC-MONO + DUN G2P 6.5 25.1 33.2 26.9 40.8

AC-MONO + 4G2P (nativized) 7.2 22.8 32.2 27.0 40.6

AC-MONO + G2P-selection (nativized) 6.5 22.8 31.1 25.3 38.5

AC-MULTI + G2P-selection (nativized) 8.5 17.6 22.6 25.2 38.6

AC-MULTI + G2P-selection (plain) 8.5 18.2 22.6 25.8 40.4

Table 3: Name Error Rate (%) for systems with P2P transcription variants

(spkr,name) System DU EN FR MO TU

(DU,*) AC-MULTI + G2P-selection (nativized) 8.5 14.9 7.2 8.3 16.2

+ 4 P2P variants (baseline) 7.7 13.2 6.3 7.0 11.9

+ 4 P2P variants (alternative) 7.7 12.2 6.3 7.0 11.9

(*,DU) AC-MULTI + G2P-selection (nativized) 8.5 17.6 22.6 25.2 38.6

+ 4 P2P variants (baseline) 7.7 17.2 19.9 24.0 35.2

+ 4 P2P variants (alternative) 7.7 16.4 18.8 24.0 35.2