what i did on my summer “vacation“ jeremy morris 10/06/2006
TRANSCRIPT
What I did on my What I did on my Summer “Vacation“Summer “Vacation“
Jeremy MorrisJeremy Morris
10/06/200610/06/2006
Summer at AFRL - DAGSISummer at AFRL - DAGSI
AFRLAFRL• Air Force Research LabsAir Force Research Labs• Wright-Patterson AFB, Dayton OHWright-Patterson AFB, Dayton OH
DAGSI Student/Faculty Resarch DAGSI Student/Faculty Resarch Fellowship programFellowship program• Dayton Area Graduate Studies InstituteDayton Area Graduate Studies Institute• Effort to encourage collaboration Effort to encourage collaboration
between Ohio universities and AFRLbetween Ohio universities and AFRL
Summer at AFRL – SCREAM LabSummer at AFRL – SCREAM Lab
SCREAM LabSCREAM Lab• Speech and Speech and
Communication Research, Communication Research, Engineering, Analysis and Engineering, Analysis and Modeling LabModeling Lab
• Interest in a wide variety of speech research Interest in a wide variety of speech research issues for the militaryissues for the military
Speech-to-speech translation, rapid development Speech-to-speech translation, rapid development of speech recognition systems, etc.of speech recognition systems, etc.
Summer at AFRL – Why us?Summer at AFRL – Why us?
SCREAM Lab members were SCREAM Lab members were interested in collaborating with OSUinterested in collaborating with OSU
SCREAM Lab working on research in SCREAM Lab working on research in using phonological features in using phonological features in speech recognitionspeech recognition• Perceived overlap with ASAT projectPerceived overlap with ASAT project
Review – Phonological FeaturesReview – Phonological Features
For the ASAT Project, we have been For the ASAT Project, we have been using phonological feature detectorsusing phonological feature detectors
We train detectors on a particular We train detectors on a particular phonological featurephonological feature• e.g. manner or place for consonant, e.g. manner or place for consonant,
height, frontness, etc. for vowelsheight, frontness, etc. for vowels We then combine these features We then combine these features
together for ASR purposestogether for ASR purposes
Phonological Features (cont.)Phonological Features (cont.)
SCREAM Lab very interested in SCREAM Lab very interested in phonological feature detectorsphonological feature detectors• Need for quick development of new ASR Need for quick development of new ASR
systems for new languagessystems for new languages• A full set of phonological feature A full set of phonological feature
detectors would allow reuse of acoustic detectors would allow reuse of acoustic data for training across new languagesdata for training across new languages
Multi-lingual detectors are clearly needed to Multi-lingual detectors are clearly needed to get full coverage of all featuresget full coverage of all features
Phonological Features (cont.)Phonological Features (cont.)
Our phonological feature detectorsOur phonological feature detectors• Monolingual (English only)Monolingual (English only)• Trained using a set of multi-layer perceptron Trained using a set of multi-layer perceptron
neural networksneural networks• Output a set of phonological feature class Output a set of phonological feature class
probabilitiesprobabilities SCREAM lab feature detectorsSCREAM lab feature detectors
• Monolingual and multilingualMonolingual and multilingual• Trained using Gaussian Mixture ModelsTrained using Gaussian Mixture Models• Output a set of likelihoodsOutput a set of likelihoods• Based on work by Tanja Schultz (CMU)Based on work by Tanja Schultz (CMU)
Summer at AFRL - ProposalSummer at AFRL - Proposal
Besides acoustic models, new ASR Besides acoustic models, new ASR systems for new languages have systems for new languages have other needsother needs
An ASR system needs a lexicon An ASR system needs a lexicon mapping phones-to-wordsmapping phones-to-words• Normally hand-constructedNormally hand-constructed• Require time and expertiseRequire time and expertise
Summer at AFRL - ProposalSummer at AFRL - Proposal
Our proposal: look at methods of Our proposal: look at methods of bootstrapping new lexicons from:bootstrapping new lexicons from:• Acoustic dataAcoustic data• Word-level transcriptsWord-level transcripts• Phonological feature detector outputsPhonological feature detector outputs
How?How?• Start by looking at work on deriving Start by looking at work on deriving
Acoustic Sub-Word UnitsAcoustic Sub-Word Units
Summer at AFRM - ProposalSummer at AFRM - Proposal
Acoustic Sub-Word Units (ASWUs)Acoustic Sub-Word Units (ASWUs)• Similar to phones in that they are Similar to phones in that they are
smaller pieces of wordssmaller pieces of words• BUT – automatically derived from BUT – automatically derived from
acoustics instead of manually definedacoustics instead of manually defined• Used to derive both a sub-word unit set Used to derive both a sub-word unit set
and a lexicon for that set simultaneouslyand a lexicon for that set simultaneously• Research in this area has been mainly to Research in this area has been mainly to
improve ASR performance improve ASR performance
Summer at AFRL - ProposalSummer at AFRL - Proposal
Can we use these methods along Can we use these methods along with phonological features as inputs with phonological features as inputs to induce new lexicons?to induce new lexicons?• Using phonological features, the sub-Using phonological features, the sub-
word units may be mappable to word units may be mappable to standard IPA phone labelsstandard IPA phone labels
Summer at AFRL - ProposalSummer at AFRL - Proposal The proposed system is inspired by an The proposed system is inspired by an
ASWU by (Singh et al., 2002)ASWU by (Singh et al., 2002)• Notable for not requiring word boundaries to Notable for not requiring word boundaries to
be marked for trainingbe marked for training Start with a basic dictionary (including a Start with a basic dictionary (including a
starting phoneset size)starting phoneset size) Train a set of acoustic models on the Train a set of acoustic models on the
training data with that dictionarytraining data with that dictionary Alter the basic dictionary in a manner that Alter the basic dictionary in a manner that
improves your pronunciationsimproves your pronunciations Repeat until a stopping criterion is reachedRepeat until a stopping criterion is reached
Summer at AFRL - ProposalSummer at AFRL - Proposal
Start with a basic dictionaryStart with a basic dictionary• Start with an assumption that the Start with an assumption that the
number of phones in a word is related to number of phones in a word is related to the number of letters in the orthographythe number of letters in the orthography
Basic dictionary maps word to sequence of Basic dictionary maps word to sequence of letters in that word:letters in that word:
ABLE ABLE A B L E A B L E
BANNED BANNED B A N N E D B A N N E D
Summer at AFRL - ProposalSummer at AFRL - Proposal
Train a set of acoustic modelsTrain a set of acoustic models• Using the basic dictionary, map words in Using the basic dictionary, map words in
the transcript to these “pronunciations”the transcript to these “pronunciations”• Train an HMM-model using the output of Train an HMM-model using the output of
the feature detectors as its input, and the feature detectors as its input, and the above mapping as training labelsthe above mapping as training labels
Summer at AFRL - ProposalSummer at AFRL - Proposal
Alter the basic dictionaryAlter the basic dictionary• Using some metric, find a candidate “phone” to Using some metric, find a candidate “phone” to
be modifiedbe modified We’ve looked at a couple of metrics – more on this We’ve looked at a couple of metrics – more on this
laterlater
• Once the phone is identified, see if the phone Once the phone is identified, see if the phone should be “split” or “deleted”should be “split” or “deleted”
A “split” indicates that the given phone label actually A “split” indicates that the given phone label actually represents two different sounds, and so should be represents two different sounds, and so should be replaced with two different phone labelsreplaced with two different phone labels
A “delete” indicates that A “delete” indicates that for a particular word or for a particular word or wordswords the model fits better if that phone label is the model fits better if that phone label is removed from the pronunciationremoved from the pronunciation
Summer at AFRL - ProposalSummer at AFRL - Proposal
Split example:Split example:
BE BE B E B E
DEVELOP DEVELOP D E1 V E1 L O P D E1 V E1 L O P
Delete examples:Delete examples:
ABLE ABLE A B L E :: ABLE A B L E :: ABLE A B L A B L
ABANDONED ABANDONED A B A N D O N D A B A N D O N D
Summer at AFRL - ProposalSummer at AFRL - Proposal
For splits, all possible alterations are For splits, all possible alterations are added to temporary lexiconadded to temporary lexicon
For deletes, we alter the HMM to add a For deletes, we alter the HMM to add a possible deletion arc for the phone possible deletion arc for the phone
After lexicon or HMM is altered, word After lexicon or HMM is altered, word transcript is force aligned using new transcript is force aligned using new possible pronunciationspossible pronunciations• Best pronunciations are pulled from this Best pronunciations are pulled from this
alignment and used to build new lexiconalignment and used to build new lexicon• Steps are repeated using the new lexicon in Steps are repeated using the new lexicon in
place of the basic lexiconplace of the basic lexicon
Summer at AFRL - ProposalSummer at AFRL - Proposal
How do we determine the candidate How do we determine the candidate “phone label” to alter?“phone label” to alter?• Initially, modelled each phone with two Initially, modelled each phone with two
Gaussians in the HMMGaussians in the HMM• Compared the two Gaussians to each other Compared the two Gaussians to each other
using their KL-divergencesusing their KL-divergences Took the phone label with the largest KL divergence Took the phone label with the largest KL divergence
as the one to alteras the one to alter Idea was that each Gaussian described a cluster – the Idea was that each Gaussian described a cluster – the
further these centers were from each other, the more further these centers were from each other, the more probable they were describing two different phonesprobable they were describing two different phones
Summer at AFRL - ProposalSummer at AFRL - Proposal
KL-divergence metric did not work KL-divergence metric did not work wellwell• System would pick candidates that a System would pick candidates that a
human would find unreasonable (such human would find unreasonable (such as “F” or “Q”)as “F” or “Q”)
• System would split or delete these System would split or delete these phones multiple times, continually phones multiple times, continually returning to the same phone labelreturning to the same phone label
Summer at AFRL - ProposalSummer at AFRL - Proposal
Why did the KL divergence perform Why did the KL divergence perform this way?this way?• Suspcion: Large variations in the two Suspcion: Large variations in the two
Gaussians in areas that do not matter Gaussians in areas that do not matter for that phone pushed up the scores for that phone pushed up the scores (e.g. vowel features for consonants)(e.g. vowel features for consonants)
• Splitting these phones only allowed the Splitting these phones only allowed the coverage to spread wider, drawing the coverage to spread wider, drawing the system back to those phonessystem back to those phones
Summer at AFRL - ProposalSummer at AFRL - Proposal What next?What next? Tried Mahalanobis distance metric, with Tried Mahalanobis distance metric, with
poor results alsopoor results also Returned to Acoustic Sub-Word papers for Returned to Acoustic Sub-Word papers for
inspirationinspiration• Instead of looking at cluster stats, multiple Instead of looking at cluster stats, multiple
papers use an average frame likelihood metric papers use an average frame likelihood metric for each phone cluster to determine candidate for each phone cluster to determine candidate phone for alteringphone for altering
• Have started moving my code to use this Have started moving my code to use this framework – preliminary passes show promise, framework – preliminary passes show promise, but no results quite yetbut no results quite yet
Conclusion – It’s 75 miles to DaytonConclusion – It’s 75 miles to Dayton Advice for those thinking of doing work at Advice for those thinking of doing work at
WPAFBWPAFB• Working in the SCREAM Lab was greatWorking in the SCREAM Lab was great
Hundreds of processors, tons of multi-lingual corporaHundreds of processors, tons of multi-lingual corpora Friendly people, decent work environment (if a bit dark)Friendly people, decent work environment (if a bit dark)
• Many hoops to jump through, even just for a Many hoops to jump through, even just for a summer student summer student
ID badges, computer usage training, etc.ID badges, computer usage training, etc.• Sometimes feels like you’re working at a Sometimes feels like you’re working at a
corporation…corporation… until the guys in uniform come arounduntil the guys in uniform come around
• The base is built like a campus crossed with a The base is built like a campus crossed with a prisonprison
cinderblock is the building material of choice.cinderblock is the building material of choice.• Don’t forget your ID BadgeDon’t forget your ID Badge
It’s 75 miles from Columbus to DaytonIt’s 75 miles from Columbus to Dayton