candeias sti lg2p_vfinal

© 2005, it - instituto de telecomunicações. Todos os direitos reservados.

Arlindo Veiga1,2

Sara Candeias1

Fernando Perdigão1,2

1Instituto de Telecomunicações, Polo de Coimbra, Portugal2Universidade de Coimbra, DEEC, Portugal

STIL 20118th Symposium in Information and Human Language Technology

Oct. 14-26 2011 Cuiaba, Brazil

GENERATING A PRONUNCIATION DICTIONARY

FOR EUROPEAN PORTUGUESE

USING A JOINT-SEQUENCE MODEL

WITH EMBEDDED STRESS ASSIGNMENT

2

SUMMARY

• Goal

• Problem Statement

• G2P System

• Joint-Sequence Model

• Stressed Vowel Assignment

• Results

• Conclusions

STIL 2011 - Cuiabá, Brazil - Oct.24-26 2011

3

GOAL


• To Generate a Pronunciation Dictionary for EP

• To Develop a G2P System for EP

4

PROBLEM STATEMENT


What approaches?

How?Implementing an

automatic system for

converter G2P

• linguistic rules• Portuguese has an orthography roughly phonologically based

provides a good coverage of the association between G2P

• No natural human-language satisfies this assumption the

association between G and P is not quite one-to-one list of

exceptions

• Very complex, hard and tiresome

5

PROBLEM STATEMENT


What approaches?

How?Implementing an


converter G2P

• linguistic rules

• statistics

• Using pronunciation examples it could be possible to predict

the pronunciation of unseen words by analogy

• Is not smart enough…

• vaga -> v „a g 6 vs. vagarosa -> v 6 g 6 r „O z 6


6

PROBLEM STATEMENT


What approaches?

How?Implementing an


converter G2P


• statistics

• MIXED

7

System based on a mixed approach funded on:

• a scholastic model: joint-sequence model

• rules for stressed vowel assignment


G2P SYSTEM

Alignment between graphemes and phonemes:

“one-to-one”

8 STIL 2011 - Cuiabá, Brazil - Oct.24-26 2011

JOINT-SEQUENCE MODEL

< B r a s i l >

/ b r 6 z i l /

Alignment between graphemes and

phonemes: “one-to-one”


< c h a m o u > < t ê m >

/ S 6 m o / / t 6~ i~ 6~ i~ /

< B r a s i l >

/ b r 6 z i l /





< c h a m o u > < t ê m >

/ S 6 m o / / t 6~ i~ 6~ i~ /





• Implementing the Levenshtein algorithm (“1-01”)

• Defining alternative symbols

• Graphemes DIGRAPHS

< c h a m o u >

< S a m º >

/ S 6 m o /






• Phonemes SAMPA UniChar

< t ê m >

< t 6 ~ i ~ 6 ~ i ~ /

/ t i i /

/ t Æ i /

< c h a m o u >

< S a m º >

/ S 6 m o /







< c h a m o u >

< S a m º >

/ S 6 m o /

< t ê m >

/ t Æ i /







< c h a m o u >

< S a m º >

/ S 6 m o /

< t ê m >

/ t Æ i /

Graphonemes

GOAL: to compute the most probable

pronunciation of a word given the word‟s

graphoneme form

TECHNIQUE: using n-grams


16





G2P SYSTEM

• Several errors due to incorrect stress assignment:

solidamente, incansavelmente

17





G2P SYSTEM

Marking the Vstressed improved the statistical model by

expressing graphoneme classes unequivocally

6 rules


STRESSED VOWEL ASSIGNMENT

For adverbs ending in <mente> (< pido> → <rapidamente> (fast → quickly):

• An algorithm that divides the word into two parts, <ROOT> and <mente>.

• The <ROOT> part undertakes a specific module (list of graphematic patterns which have the Vstressed

identified).

To generate a univocal graphoneme, we attributed special symbols to the Vstressed

19

To estimate the graphoneme‟s model:

• SpeechDat pronunciation dictionary• 15k entries

• Deletion of foreign words

• Change of some transcriptions

• Standardization of the pronunciation


VOCABULARY

Applied to the CETEMPúblico vocabulary

40k words 40k pronunciations

20

CETEMPúblico 40k pronunciations:

• Iterative procedure:

• Long manual verification

• Correction of the transcriptions

• Comparison to the pronunciations of LOQUENDO


DICTIONARY

This dictionary was used for the training and test procedure.

• The majority of the transcriptions agreed.

• The transcriptions from our dictionary were the right ones most of the times.

21

EXPERIMENTS

All experiments were based on the dictionary of the

40K pronunciations:

• with stress marking

• without stress marking


Final results were obtained by evaluating the average of the five partial

results.

To train and test the model, each one of these two dictionaries was

partitioned into five folds for a cross-validation procedure.

22

The performance of the G2P conversion system was expressed

in two average error rates: average error rate of phonemes

(PER) and average error rate of words (WER)


RESULTS

23

RESULTS

The following figures summarize the results obtained using n-

grams with n between 2 and 8


24

RESULTS

The use of n-grams with large contexts (n greater than 5) did

not improve the system. In fact, there was a slight increase in

the error rates (lack of samples to estimate large contexts)


25

RESULTS

The marking of the stressed vowel contributed to a significant

improvement in the system performance


26

CONCLUSIONS


The joint-sequence model with embedded stress

assignment had good results.

By inspecting the test errors, we observed that most of them resulted

from uncommon grapheme patterns or compound words without graphic

stress marks.

The most frequent errors resulted from the pronunciation of the

stressed <e> and <o> since they could be pronounced as /E/ vs. /e/

(<selo>: verb vs. noun) and /O/ vs. /o/ (<ovos> (pl) vs. <ovo>(sing))

without any systematic rule.

Obrigada

Our system is freely available on http://www.co.it.pt/~labfala/g2p/ and

includes models, dictionaries and the G2P converter.

© 2005, it - instituto de telecomunicações. Todos os direitos reservados.

Arlindo Veiga1,2

Sara Candeias1

([email protected])

Fernando Perdigão1,2

1Instituto de Telecomunicações, Polo de Coimbra, Portugal2Universidade de Coimbra, DEEC, Portugal

STIL 20118th Symposium in Information and Human Language Technology

Oct. 14-26 2011 Cuiaba, Brazil

GENERATING A PRONUNCIATION DICTIONARY

FOR EUROPEAN PORTUGUESE

USING A JOINT-SEQUENCE MODEL

WITH EMBEDDED STRESS ASSIGNMENT

28

INTRODUCTION


Generate a Pronunciation Dictionary for PE

• Grapheme-to-Phoneme conversion (G2P)

Bom dia b‟o~ d‟i6 (en. Good morning)

• Applications: component of ASR and TTS systems

e.g. in language learning, machine translation,…

• For correct pronunciation we need:

• G2P, stress assignment

• Contribution of this paper:

• Show phonological constraints (vowel stressed)

• Evaluate a mixed approach for G2P system

• Turn the dictionary (the model and the converter) publicly available

candeias sti lg2p_vfinal

Travel