development of automatic speech recognition and synthesis technologies to support chinese learners...

Development of Automatic Speech Recognition andSynthesis Technologies to Support

Chinese Learners of English: The CUHK ExperienceHelen Meng, Wai-Kit Lo, Alissa M. Harrison, Pauline Lee,

Ka-Ho Wong, Wai-Kim Leung and Fanbo Meng

Represented by Chun-Yu Chen

Design and collection of the chinese learners ofenglish corpus

• Speech data for deriving salient mispronunciations made by Chinese learners of English

• The corpus includes data from 100 Cantonese subjects and 111 Mandarin subjects

• Sounds similar to the learner’s first language will be easy for the learner to acquire while different sounds will present difficulty

• We summarize these errors by deriving phonological rules for the observed errors

Capturing language transfer effects throughcontrastive phonological analyses

• To model such mispronunciations, we make use of context-sensitive phonological rules, of the format:

Capturing language transfer effects throughcontrastive phonological analyses

Mispronunciation prediction with manually andautomatically derived phonological rules

• Manually written phonological rules• We first developed a list of 43 context-insensitive

rules and generate hypothesized pronunciation variants that may appear in the learners’ speech

• The dictionary grows exponentially and many pronunciations generated are rare or implausible in the learner’s speech

• To reduce the number of implausible pronunciations , the context-sensitive rules was compiled

• context-sensitive rules developed using the immediate neighboring segments and symbols for various linguistic classes

─ like consonants and vowels

• The extended pronunciation dictionary (EPD) with 51 context-sensitive rules were developed


• Automatically derived phonological rules• Manually authoring phonological rules requires

expertise in both the mother language and also the L2 being learned

─the feasible language pairs will be limited

• Our approach is based on a few assumptions1. differences in the phonetic transcriptions and the

canonical pronunciations are due to negative language transfe

2. interferences such as misread prompts, unknown words, transcription errors


• The automatic rule derivation1. aligned the canonical pronunciations with the manual

transcriptions

2. obtain a set of all phonetic substitutions, insertions, and deletions

3. perform the rule selection process by keeping the top-N rules in the basic rule set and evaluate the coverage of the top-N rules by computing the F1-score


Mispronunciation detection and diagnoses

• In this system , ASR is using to detect mis-pronunciations with the extended pronunciation dictionary and predicted mispronunciation for the given word

• The steps in this system:1. The process is repeated for all rules to generate the extended

pronunciation dictionary

2. The recognized phone sequences are then aligned with the canonical phone sequences. Phones that cannot be aligned properly can then be easily identified as deletions, insertions and substitutions

3. provide diagnostic feedback

• Representation of Extended Pronunciations• We devise the Extended Recognition Network (ERN) as a

compact representation of the same information• we use the finite state transducer as a vehicle to represent the

rules

Mispronunciation detection and diagnoses

Enhancing mispronunciation detection by fusionwith pronunciation scoring

• detection of salient mispronunciations is refer to the linguistically-motivated approach

• Not all the possible mispronunciations are predicted by the approach• the expansion rule may be absent due to pruning or lack of

relevant language transfer knowledge• the quality of the acoustic models is poor which hinders

recognition accuracy• the mispronunciations may be caused by factors other than

language transfer

• Conventional pronunciation scoring is based on the posterior probability of a speech unit being produced by the speaker

• To minimize the total detection error , we combine the two techique

• We first optimize individual thresholds for every English phone , and define a backoff list


• The backoff list is a list of phones that is better handled by the pronunciation scoring approach


Utterance rejection for pre-filtering

• grossly erroneous like false starts , pressing the <stop> button too early ,etc. should be appropriately handled by a pre-filtering mechanism

• we use the statistical phone duration model to pre-filter for intact utterances

• If forced alignment produces phone durations that are overly long or short, as compared with their inherent values, it may suggest that the input utterance is not intact


• In phone duration scoring, we incorporate an anti-model to increase the discriminative power of the phone duration model

• The “catch-all” anti-model1. We first shuffle the utterances in the corpus such that the

recordings will not be matching to the prompting texts

2. A forced-alignment is then performed using this intentionally shuffled prompts

3. A Gamma distribution is then trained using all aligned phone durations in the shuffled corpus


Ongoing work and future directions

• Our approach described above uses the ERN to provide explicitly modeled mispronunciations to capture the error

• we are exploring the use of discriminatively trained acoustic models, with reference to predicted mispronunciations

development of automatic speech recognition and synthesis technologies to support chinese learners...

Documents

sensitive phonological

n rules

contextsensitive rules

contextinsensitive rules

learners speech

compiled slide

canonical pronunciations

mother language