carlasimoes i ms ws speech tech
TRANSCRIPT
-
8/9/2019 CarlaSimoes I MS WS Speech Tech
1/13
Acoustic ModelingIntroduction and MethodologyFellowship in collaboration with Prof. Carlos Teixeira, FCUL
Carla [email protected]
I Microsoft Workshop on Speech Technology - Building bridges between industryand academia, May 2 2007, MLDC
-
8/9/2019 CarlaSimoes I MS WS Speech Tech
2/13
Overview
Introduction Speech Components What are Acoustic Models? Why to use them?
Methodology Training Acoustic Models
Modelling English Spoken by Portuguese speakers
Conclusion and Future Work
-
8/9/2019 CarlaSimoes I MS WS Speech Tech
3/13
Corpus(Speech + Transcriptions)
Feature extraction
Acoustic model training
SAPI(developersSpeech API)
Lexicon(phonetic dictionary)
Acoustic Models(Hidden Markov
Models)
Speech RecognitionEngine (SR)
Language Pack(contains core SR and
TTS engines)
Text-to-speechEngine (TTS)
Grammar + Lexicon(for SR apps; grammardefines the permittedsequence of words)
+
SpeechApplications
Telephony(Speech Server2007, Exchange
12)
Mobility(Voice
Command)
Feature vector
Desktop(Office12,
Vista)
Home(TV, Kitchen)
Speech Components
-
8/9/2019 CarlaSimoes I MS WS Speech Tech
4/13
What are Acoustic Models?
They reflect the way we pronounce a certain language
Speech can be broken into phonetic segments, phones
Acoustic Models are representations of speech segments
Acoustic model training involves mapping models toacoustic examples obtained from training data
-
8/9/2019 CarlaSimoes I MS WS Speech Tech
5/13
Why to use them?
Basis of an automatic speech recognition (ASR) system
S1 S2 S3
Speech Waveform
Sequence of observed speech vectors
Sequence of symbols
Front End
S1 S2 S3 The acoustic model gives the likelihood fora given feature vector as produced by aparticular phoneme
-
8/9/2019 CarlaSimoes I MS WS Speech Tech
6/13
Methodology
Our Acoustic Models are Hidden Markov Models (HMMs) based Markov Assumption: each state probability depends on the previous one
Each HMM has 3 states each state represents a short segment of
speech, described mathematically byGaussian probability distributions
Acoustically similar information is sharedacross HMMs - sharing states calledsenones
Cross-word triphone System
-
8/9/2019 CarlaSimoes I MS WS Speech Tech
7/13
Methodology
Training up a cross-word triphone system for a new language Acoustic model training involves mapping acoustic models (cross-word
triphone or whole-word triphones) with equivalent labels (transcriptions)
Corpus(Speech+word
leveltranscriptions )
+
Lexicon
Phoneset+
QuestionSet
Word leveltranscriptions
intoMonophone
level
transcriptions
Prototypemonophone systemconverted to initial
Cross-WordTriphone System
Cross-wordsystem is then
updated toproduce the final
Cross-Word
Triphone system
Clustering triphones intoacoustically similar groups
A phoneset file should never
contain more than 50 phones
-
8/9/2019 CarlaSimoes I MS WS Speech Tech
8/13
ModellingEnglish Spoken by Portuguese Speakers
Normally a speech recognizers precision is lower for non-native users Non-native accents are more problematic than dialects more
variability
Research on non-native accent modeling reveals largegains in performance when acoustics and pronunciationof an accent are taken into account
An usage scenario: Voice controlled applications, wherePortuguese language is dominant but English termsare supported with the same accuracy
-
8/9/2019 CarlaSimoes I MS WS Speech Tech
9/13
Experiments are being developed concerning this problem
Corpus description 4689 Utterances for a universe of 227 Words
Files are sampled at 8Khz for 16 bits linear 11 male speakers
Model settings 3468 utterances for training 1221 utterances for test 43 minutes of speech Senones 1200
ModellingEnglish Spoken by Portuguese Speakers
-
8/9/2019 CarlaSimoes I MS WS Speech Tech
10/13
ModellingEnglish Spoken by Portuguese Speakers
English spoken byPortuguese corpus
New ModelTraining
T e s t i n g
Test corpus
English spoken by
Portuguese corpus
U p d a t e
ENU Model
New Model
T e s t i n g
Test corpus
-
8/9/2019 CarlaSimoes I MS WS Speech Tech
11/13
ModellingEnglish Spoken by Portuguese Speakers
English spoken byPortuguese corpus
T r ai ni n
g
ENU Phoneset PTG Phoneset
English to Portuguesemapped phoneset
PTG corpus
New ModelTestingTest corpus
-
8/9/2019 CarlaSimoes I MS WS Speech Tech
12/13
Future Work
The improvement of Acoustic Models requires gatheringhundreds of hours of speech data
The amounts of data would have to be larger if weredealing with non-native speakers, because the accentvariability gets too high
Possible solutions:
Define new phonesets which implies a phonetic studyconcerning the Portuguese English pronunciation Train the native models with the English spoken by
Portuguese corpus
-
8/9/2019 CarlaSimoes I MS WS Speech Tech
13/13
2007 Microsoft Corporation. All rights reserved.This presentation is for informational purposes only. Microsoft makes no warranties, express or implied, in this summary.
Carla Simes
Muito obrigado pela vossa ateno!
Acoustic ModelingIntroduction and Methodology
www.microsoft.com/portugal/mldc
I Microsoft Workshop on Speech Technology -Building bridges between industry and
academia, May 2 2007, MLDC
mailto:[email protected]://www.microsoft.com/portugal/mldchttp://www.microsoft.com/portugal/mldchttp://www.microsoft.com/portugal/mldchttp://www.microsoft.com/portugal/mldchttp://www.microsoft.com/portugal/mldchttp://www.microsoft.com/portugal/mldcmailto:[email protected]:[email protected]:[email protected]