development of automatic speech recognition and synthesis technologies to support chinese learners...

17
Development of Automatic Speech Recognition and Synthesis Technologies to Support Chinese Learners of English: The CUHK Experience Helen Meng, Wai-Kit Lo, Alissa M. Harrison, Pauline Lee, Ka-Ho Wong, Wai-Kim Leung and Fanbo Meng Represented by Chun-Yu Chen

Upload: willis-warren

Post on 18-Dec-2015

213 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Development of Automatic Speech Recognition and Synthesis Technologies to Support Chinese Learners of English: The CUHK Experience Helen Meng, Wai-Kit

Development of Automatic Speech Recognition andSynthesis Technologies to Support

Chinese Learners of English: The CUHK ExperienceHelen Meng, Wai-Kit Lo, Alissa M. Harrison, Pauline Lee,

Ka-Ho Wong, Wai-Kim Leung and Fanbo Meng

Represented by Chun-Yu Chen

Page 2: Development of Automatic Speech Recognition and Synthesis Technologies to Support Chinese Learners of English: The CUHK Experience Helen Meng, Wai-Kit

Design and collection of the chinese learners ofenglish corpus

• Speech data for deriving salient mispronunciations made by Chinese learners of English

• The corpus includes data from 100 Cantonese subjects and 111 Mandarin subjects

Page 3: Development of Automatic Speech Recognition and Synthesis Technologies to Support Chinese Learners of English: The CUHK Experience Helen Meng, Wai-Kit

• Sounds similar to the learner’s first language will be easy for the learner to acquire while different sounds will present difficulty

• We summarize these errors by deriving phonological rules for the observed errors

Capturing language transfer effects throughcontrastive phonological analyses

Page 4: Development of Automatic Speech Recognition and Synthesis Technologies to Support Chinese Learners of English: The CUHK Experience Helen Meng, Wai-Kit

• To model such mispronunciations, we make use of context-sensitive phonological rules, of the format:

Capturing language transfer effects throughcontrastive phonological analyses

Page 5: Development of Automatic Speech Recognition and Synthesis Technologies to Support Chinese Learners of English: The CUHK Experience Helen Meng, Wai-Kit

Mispronunciation prediction with manually andautomatically derived phonological rules

• Manually written phonological rules• We first developed a list of 43 context-insensitive

rules and generate hypothesized pronunciation variants that may appear in the learners’ speech

• The dictionary grows exponentially and many pronunciations generated are rare or implausible in the learner’s speech

• To reduce the number of implausible pronunciations , the context-sensitive rules was compiled

Page 6: Development of Automatic Speech Recognition and Synthesis Technologies to Support Chinese Learners of English: The CUHK Experience Helen Meng, Wai-Kit

• context-sensitive rules developed using the immediate neighboring segments and symbols for various linguistic classes

─ like consonants and vowels

• The extended pronunciation dictionary (EPD) with 51 context-sensitive rules were developed

Mispronunciation prediction with manually andautomatically derived phonological rules

Page 7: Development of Automatic Speech Recognition and Synthesis Technologies to Support Chinese Learners of English: The CUHK Experience Helen Meng, Wai-Kit

• Automatically derived phonological rules• Manually authoring phonological rules requires

expertise in both the mother language and also the L2 being learned

─the feasible language pairs will be limited

• Our approach is based on a few assumptions1. differences in the phonetic transcriptions and the

canonical pronunciations are due to negative language transfe

2. interferences such as misread prompts, unknown words, transcription errors

Mispronunciation prediction with manually andautomatically derived phonological rules

Page 8: Development of Automatic Speech Recognition and Synthesis Technologies to Support Chinese Learners of English: The CUHK Experience Helen Meng, Wai-Kit

• The automatic rule derivation1. aligned the canonical pronunciations with the manual

transcriptions

2. obtain a set of all phonetic substitutions, insertions, and deletions

3. perform the rule selection process by keeping the top-N rules in the basic rule set and evaluate the coverage of the top-N rules by computing the F1-score

Mispronunciation prediction with manually andautomatically derived phonological rules

Page 9: Development of Automatic Speech Recognition and Synthesis Technologies to Support Chinese Learners of English: The CUHK Experience Helen Meng, Wai-Kit

Mispronunciation detection and diagnoses

• In this system , ASR is using to detect mis-pronunciations with the extended pronunciation dictionary and predicted mispronunciation for the given word

• The steps in this system:1. The process is repeated for all rules to generate the extended

pronunciation dictionary

2. The recognized phone sequences are then aligned with the canonical phone sequences. Phones that cannot be aligned properly can then be easily identified as deletions, insertions and substitutions

3. provide diagnostic feedback

Page 10: Development of Automatic Speech Recognition and Synthesis Technologies to Support Chinese Learners of English: The CUHK Experience Helen Meng, Wai-Kit

• Representation of Extended Pronunciations• We devise the Extended Recognition Network (ERN) as a

compact representation of the same information• we use the finite state transducer as a vehicle to represent the

rules

Mispronunciation detection and diagnoses

Page 11: Development of Automatic Speech Recognition and Synthesis Technologies to Support Chinese Learners of English: The CUHK Experience Helen Meng, Wai-Kit

Enhancing mispronunciation detection by fusionwith pronunciation scoring

• detection of salient mispronunciations is refer to the linguistically-motivated approach

• Not all the possible mispronunciations are predicted by the approach• the expansion rule may be absent due to pruning or lack of

relevant language transfer knowledge• the quality of the acoustic models is poor which hinders

recognition accuracy• the mispronunciations may be caused by factors other than

language transfer

Page 12: Development of Automatic Speech Recognition and Synthesis Technologies to Support Chinese Learners of English: The CUHK Experience Helen Meng, Wai-Kit

• Conventional pronunciation scoring is based on the posterior probability of a speech unit being produced by the speaker

• To minimize the total detection error , we combine the two techique

• We first optimize individual thresholds for every English phone , and define a backoff list

Enhancing mispronunciation detection by fusionwith pronunciation scoring

Page 13: Development of Automatic Speech Recognition and Synthesis Technologies to Support Chinese Learners of English: The CUHK Experience Helen Meng, Wai-Kit

• The backoff list is a list of phones that is better handled by the pronunciation scoring approach

Enhancing mispronunciation detection by fusionwith pronunciation scoring

Page 14: Development of Automatic Speech Recognition and Synthesis Technologies to Support Chinese Learners of English: The CUHK Experience Helen Meng, Wai-Kit

Utterance rejection for pre-filtering

• grossly erroneous like false starts , pressing the <stop> button too early ,etc. should be appropriately handled by a pre-filtering mechanism

• we use the statistical phone duration model to pre-filter for intact utterances

Page 15: Development of Automatic Speech Recognition and Synthesis Technologies to Support Chinese Learners of English: The CUHK Experience Helen Meng, Wai-Kit

• If forced alignment produces phone durations that are overly long or short, as compared with their inherent values, it may suggest that the input utterance is not intact

Utterance rejection for pre-filtering

Page 16: Development of Automatic Speech Recognition and Synthesis Technologies to Support Chinese Learners of English: The CUHK Experience Helen Meng, Wai-Kit

• In phone duration scoring, we incorporate an anti-model to increase the discriminative power of the phone duration model

• The “catch-all” anti-model1. We first shuffle the utterances in the corpus such that the

recordings will not be matching to the prompting texts

2. A forced-alignment is then performed using this intentionally shuffled prompts

3. A Gamma distribution is then trained using all aligned phone durations in the shuffled corpus

Utterance rejection for pre-filtering

Page 17: Development of Automatic Speech Recognition and Synthesis Technologies to Support Chinese Learners of English: The CUHK Experience Helen Meng, Wai-Kit

Ongoing work and future directions

• Our approach described above uses the ERN to provide explicitly modeled mispronunciations to capture the error

• we are exploring the use of discriminatively trained acoustic models, with reference to predicted mispronunciations