development of automatic speech recognition and synthesis technologies to support chinese learners...
TRANSCRIPT
Development of Automatic Speech Recognition andSynthesis Technologies to Support
Chinese Learners of English: The CUHK ExperienceHelen Meng, Wai-Kit Lo, Alissa M. Harrison, Pauline Lee,
Ka-Ho Wong, Wai-Kim Leung and Fanbo Meng
Represented by Chun-Yu Chen
Design and collection of the chinese learners ofenglish corpus
• Speech data for deriving salient mispronunciations made by Chinese learners of English
• The corpus includes data from 100 Cantonese subjects and 111 Mandarin subjects
• Sounds similar to the learner’s first language will be easy for the learner to acquire while different sounds will present difficulty
• We summarize these errors by deriving phonological rules for the observed errors
Capturing language transfer effects throughcontrastive phonological analyses
• To model such mispronunciations, we make use of context-sensitive phonological rules, of the format:
Capturing language transfer effects throughcontrastive phonological analyses
Mispronunciation prediction with manually andautomatically derived phonological rules
• Manually written phonological rules• We first developed a list of 43 context-insensitive
rules and generate hypothesized pronunciation variants that may appear in the learners’ speech
• The dictionary grows exponentially and many pronunciations generated are rare or implausible in the learner’s speech
• To reduce the number of implausible pronunciations , the context-sensitive rules was compiled
• context-sensitive rules developed using the immediate neighboring segments and symbols for various linguistic classes
─ like consonants and vowels
• The extended pronunciation dictionary (EPD) with 51 context-sensitive rules were developed
Mispronunciation prediction with manually andautomatically derived phonological rules
• Automatically derived phonological rules• Manually authoring phonological rules requires
expertise in both the mother language and also the L2 being learned
─the feasible language pairs will be limited
• Our approach is based on a few assumptions1. differences in the phonetic transcriptions and the
canonical pronunciations are due to negative language transfe
2. interferences such as misread prompts, unknown words, transcription errors
Mispronunciation prediction with manually andautomatically derived phonological rules
• The automatic rule derivation1. aligned the canonical pronunciations with the manual
transcriptions
2. obtain a set of all phonetic substitutions, insertions, and deletions
3. perform the rule selection process by keeping the top-N rules in the basic rule set and evaluate the coverage of the top-N rules by computing the F1-score
Mispronunciation prediction with manually andautomatically derived phonological rules
Mispronunciation detection and diagnoses
• In this system , ASR is using to detect mis-pronunciations with the extended pronunciation dictionary and predicted mispronunciation for the given word
• The steps in this system:1. The process is repeated for all rules to generate the extended
pronunciation dictionary
2. The recognized phone sequences are then aligned with the canonical phone sequences. Phones that cannot be aligned properly can then be easily identified as deletions, insertions and substitutions
3. provide diagnostic feedback
• Representation of Extended Pronunciations• We devise the Extended Recognition Network (ERN) as a
compact representation of the same information• we use the finite state transducer as a vehicle to represent the
rules
Mispronunciation detection and diagnoses
Enhancing mispronunciation detection by fusionwith pronunciation scoring
• detection of salient mispronunciations is refer to the linguistically-motivated approach
• Not all the possible mispronunciations are predicted by the approach• the expansion rule may be absent due to pruning or lack of
relevant language transfer knowledge• the quality of the acoustic models is poor which hinders
recognition accuracy• the mispronunciations may be caused by factors other than
language transfer
• Conventional pronunciation scoring is based on the posterior probability of a speech unit being produced by the speaker
• To minimize the total detection error , we combine the two techique
• We first optimize individual thresholds for every English phone , and define a backoff list
Enhancing mispronunciation detection by fusionwith pronunciation scoring
• The backoff list is a list of phones that is better handled by the pronunciation scoring approach
Enhancing mispronunciation detection by fusionwith pronunciation scoring
Utterance rejection for pre-filtering
• grossly erroneous like false starts , pressing the <stop> button too early ,etc. should be appropriately handled by a pre-filtering mechanism
• we use the statistical phone duration model to pre-filter for intact utterances
• If forced alignment produces phone durations that are overly long or short, as compared with their inherent values, it may suggest that the input utterance is not intact
Utterance rejection for pre-filtering
• In phone duration scoring, we incorporate an anti-model to increase the discriminative power of the phone duration model
• The “catch-all” anti-model1. We first shuffle the utterances in the corpus such that the
recordings will not be matching to the prompting texts
2. A forced-alignment is then performed using this intentionally shuffled prompts
3. A Gamma distribution is then trained using all aligned phone durations in the shuffled corpus
Utterance rejection for pre-filtering
Ongoing work and future directions
• Our approach described above uses the ERN to provide explicitly modeled mispronunciations to capture the error
• we are exploring the use of discriminatively trained acoustic models, with reference to predicted mispronunciations