learning latent representations for speech generation and...
TRANSCRIPT
LearningLatentRepresentationsforSpeechGenerationandTransformation
Wei-Ning Hsu,YuZhang,JamesGlassMITComputerScienceandArtificialIntelligenceLaboratory,Cambridge,MA,USA
Interspeech 2017
WhattoExpectinThisTalk
1. Aconvolutionalvariationalautoencoderframeworktomodelagenerativeprocessofspeech
𝒛Encoder𝝁𝒛
𝝈𝒛
q(𝒛|𝒙)
Decoder𝒙
𝝈𝒙
p(𝒙|𝒛)
𝝁𝒙
WhattoExpectinThisTalk
1. Aconvolutionalvariationalautoencoderframeworktomodelagenerativeprocessofspeech
2. Amethodtoassociatelearnedlatentrepresentationswithphysicalattributes,suchasspeakeridentityandlinguisticcontent
𝒛Encoder𝝁𝒛
𝝈𝒛
q(𝒛|𝒙)
Decoder𝒙
𝝈𝒙
p(𝒙|𝒛)
𝝁𝒙
WhattoExpectinThisTalk
1. Aconvolutionalvariationalautoencoderframeworktomodelagenerativeprocessofspeech
2. Amethodtoassociatelearnedlatentrepresentationswithphysicalattributes,suchasspeakeridentityandlinguisticcontent
3. Simplelatentspacearithmeticoperationstomodifyspeechattributes
𝒛Encoder𝝁𝒛
𝝈𝒛
q(𝒛|𝒙)
Decoder𝒙
𝝈𝒙
p(𝒙|𝒛)
𝝁𝒙
Outline1. Motivation2. BackgroundandModels3. LatentAttribute
RepresentationsandOperations4. Experiments
5. Conclusion
Motivation
• Wewanttolearnagenerativeprocessofspeech1. Whatarethefactorsthataffectspeechgeneration?2. Howdothesefactorsplayaroleinspeechgeneration?3. Howcanweinferthesefactorsfromobservedspeech?
Welcome!
Motivation
• Wewanttolearnagenerativeprocessofspeech1. Whatarethefactorsthataffectspeechgeneration?2. Howdothesefactorsplayaroleinspeechgeneration?3. Howcanweinferthesefactorsfromobservedspeech?
• Whydowewanttolearna generativeprocess?• Synthesis(1,2)• Recognitionandverification(3)• Voiceconversionanddenoising (1,2,3)
Welcome!
Outline1. Motivations2. BackgroundandModels3. LatentAttribute
RepresentationsandOperations4. Experiments
5. Conclusion
GenerativeModelBackgrounds
• “Shallow”generativemodels• HiddenMarkovmodel-Gaussianmixturemodels (HMM-GMMs)
• “Deep”generativemodels• Generativeadversarialnetworks(GANs)
• model𝑝(𝒙|𝒛) andbypasstheinferencemodel(generator/discriminator)
• Auto-regressivemodels (e.g.WaveNets)• model𝑝(𝒙+|𝒙,:+.,) andabstainfromusinglatentvariables
• Variationalautoencoders(VAEs)• learnaninferencemodelandagenerativemodeljointly
VariationalAutoencoders(VAEs)
• Defineaprobabilisticgenerativeprocessbetweenobservation𝒙 andlatentvariable𝒛• 𝑝(𝒛),𝑝(𝒙|𝒛),andq(𝒛|𝒙) aredefinedtobeinsomeparametric family
• Wedefine𝑝(𝒙|𝒛) (decoder) andq(𝒛|𝒙) (encoder) tobediagonalGaussians• Parameters(meanandvariance)aredescribedusingsomeNN
• 𝑝(𝒛) isdefined tobeisotropicGaussianwithunitvariance
𝒛Encoder𝝁𝒛
𝝈𝒛
q(𝒛|𝒙)
Decoder𝒙𝝁𝒙
𝝈𝒙
p(𝒙|𝒛)
ConvolutionalNeuralNetworkArchitecture
𝑝(𝒙|𝒛)𝒛𝑞(𝒛|𝒙)𝒙
Encoder Decoder
Conv1 Conv2 Conv3 FC1 Gauss Sample
FC2 T-conv1 T-conv2 T-conv3(Gauss)
FC3(+reshape
)
*T-convstandsfortransposedconvolution
𝒛Encoder𝝁𝒛
𝝈𝒛
q(𝒛|𝒙)
Decoder𝒙
𝝈𝒙
p(𝒙|𝒛)
𝝁𝒙
𝝁𝒛
𝝈𝒛
𝝁𝒙
𝝈𝒙
ExperimentSetup
• Dataset:TIMIT(5.4hr) (standard462speakersx/si trainingset)
• SpeechSegmentDimension:• Unsupervisedtraining(i.e.,nouseofphonetictranscription)• T=20frames(withshiftof8frames)• F=80(FBank)or200(LogMagnitudeSpectrogram)
• TrainingObjective:VariationalLowerBound• Optimizer:Adam
20
ExperimentSetup
• Dataset:TIMIT(5.4hr) (standard462speakersx/si trainingset)
• SpeechSegmentDimension:• Unsupervisedtraining(i.e.,nouseofphonetictranscription)• T=20frames(withshiftof8frames)• F=80(FBank)or200(LogMagnitudeSpectrogram)
• TrainingObjective:VariationalLowerBound• Optimizer:Adam
20
8
ExperimentSetup
• Dataset:TIMIT(5.4hr) (standard462speakersx/si trainingset)
• SpeechSegmentDimension:• Unsupervisedtraining(i.e.,nouseofphonetictranscription)• T=20frames(withshiftof8frames)• F=80(FBank)or200(LogMagnitudeSpectrogram)
• TrainingObjective:VariationalLowerBound• Optimizer:Adam
8
20
Outline
1. Motivations2. BackgroundandModels3. LatentAttribute
RepresentationsandOperations4. Experiments
5. AudioDemo6. Conclusion
SpeechReconstructionIllustration
• ThetrainedVAEisabletoreconstructspeechsegments
• Examplesfrom10instancesof/aa/,/sh/,and/p/(sampledatcenterofsegment)
/aa/ /sh/ /p/
LatentAttributeRepresentations
• VAEisencouragedtomodelindependent factorsusingdifferentdimensions• Because theprior isassumedtobeadiagonalGaussian
0.3
-0.7
-0.2
1.5
0.4
Encoder Decoder
LatentAttributeRepresentations
• VAEisencouragedtomodelindependent factorsusingdifferentdimensions• Because theprior isassumedtobeadiagonalGaussian
• Wewanttoassociatephysicalattributeswithsomedimensions
0.3
-0.7
SpeakerIdentity
-0.2
1.5
0.4
Phone
Encoder Decoder
LatentAttributeRepresentations
• VAEisencouragedtomodelindependent factorsusingdifferentdimensions• Because theprior isassumedtobeadiagonalGaussian
• Wewanttoassociateparticulardimensionswithdifferentphysicalattributes
0.3
-0.7
SpeakerIdentity
-0.2
1.5
0.4
Phone
Encoder Decoder
0.3
-0.7
0
0
0
LatentSpeakerRepresentation
0
0
-0.2
1.5
0.4
LatentPhoneRepresentation
LatentAttributeRepresentations
• Factorshavenormaldistributions alongtheirassociateddimensions
LatentAttributeRepresentations
• Factorshavenormaldistributions alongtheirassociateddimensions
• Forexample,ifwewanttoestimatethelatentphonerepresentation for/aa/:
0.3
-0.7
SpeakerA
-0.2
1.5
0.4
/aa/
-0.4
1.1
SpeakerB
-0.2
1.5
0.4
/aa/
0.1
0.4
SpeakerC
-0.2
1.5
0.4
/aa/
LatentAttributeRepresentations
• Factorshavenormaldistributions alongtheirassociateddimensions
• Forexample,ifwewanttoestimatethelatentphonerepresentation for/aa/:• Wecanestimatelatentattributebytakingthemeanlatentrepresentations
0.3
-0.7
SpeakerA
-0.2
1.5
0.4
/aa/
-0.4
1.1
SpeakerB
-0.2
1.5
0.4
/aa/
0.1
0.4
SpeakerC
-0.2
1.5
0.4
/aa/
0
0
-0.2
1.5
0.4
Average
LatentPhoneRepresentationfor/aa/
EmpiricalStudyoftheAssumptions
• Wecompute latentattributerepresentationsoftwoattributes:
0.3
-0.7
0
0
0
LatentSpeakerAttribute
-0.4
1.1
0
0
0
0.1
0.4
0
0
0
0
0
-0.2
1.5
0.4
LatentPhoneAttribute
0
0
0.8
-0.3
0.2
0
0
-0.9
-0.2
-0.8
EmpiricalStudyoftheAssumptions
• Wecompute latentattributerepresentationsoftwoattributes:
• Computetheabsolutecosinesimilaritybetweenlatentattributerepresentations
0.3
-0.7
0
0
0
LatentSpeakerAttribute
-0.4
1.1
0
0
0
0.1
0.4
0
0
0
0
0
-0.2
1.5
0.4
LatentPhoneAttribute
0
0
0.8
-0.3
0.2
0
0
-0.9
-0.2
-0.8
EmpiricalStudyoftheAssumptions
• Wecompute latentattributerepresentationsoftwoattributes:
• Computetheabsolutecosinesimilaritybetweenlatentattributerepresentations
0.3
-0.7
0
0
0
LatentSpeakerAttribute
-0.4
1.1
0
0
0
0.1
0.4
0
0
0
0
0
-0.2
1.5
0.4
LatentPhoneAttribute
0
0
0.8
-0.3
0.2
0
0
-0.9
-0.2
-0.8
EmpiricalStudyoftheAssumptions
• Wecompute latentattributerepresentationsoftwoattributes:
• Computetheabsolutecosinesimilaritybetweenlatentattributerepresentations
0.3
-0.7
0
0
0
LatentSpeakerAttribute
-0.4
1.1
0
0
0
0.1
0.4
0
0
0
0
0
-0.2
1.5
0.4
LatentPhoneAttribute
0
0
0.8
-0.3
0.2
0
0
-0.9
-0.2
-0.8
EmpiricalStudyoftheAssumptions
• Wecompute latentattributerepresentationsoftwoattributes:
• Computetheabsolutecosinesimilaritybetweenlatentattributerepresentations
0.3
-0.7
0
0
0
LatentSpeakerAttribute
-0.4
1.1
0
0
0
0.1
0.4
0
0
0
0
0
-0.2
1.5
0.4
LatentPhoneAttribute
0
0
0.8
-0.3
0.2
0
0
-0.9
-0.2
-0.8
EmpiricalStudyoftheAssumptions
• Wecompute latentattributerepresentationsoftwoattributes:
• Computetheabsolutecosinesimilaritybetweenlatentattributerepresentations
0.3
-0.7
0
0
0
LatentSpeakerAttribute
-0.4
1.1
0
0
0
0.1
0.4
0
0
0
0
0
-0.2
1.5
0.4
LatentPhoneAttribute
0
0
0.8
-0.3
0.2
0
0
-0.9
-0.2
-0.8
EmpiricalStudyoftheAssumptions
• Wecompute latentattributerepresentationsoftwoattributes:
• Computetheabsolutecosinesimilaritybetweenlatentattributerepresentations
0.3
-0.7
0
0
0
LatentSpeakerAttribute
-0.4
1.1
0
0
0
0.1
0.4
0
0
0
0
0
-0.2
1.5
0.4
LatentPhoneAttribute
0
0
0.8
-0.3
0.2
0
0
-0.9
-0.2
-0.8
EmpiricalStudyoftheAssumptions
• Wecompute latentattributerepresentationsoftwoattributes:
• Computetheabsolutecosinesimilaritybetweenlatentattributerepresentations
0.3
-0.7
0
0
0
LatentSpeakerAttribute
-0.4
1.1
0
0
0
0.1
0.4
0
0
0
0
0
-0.2
1.5
0.4
LatentPhoneAttribute
0
0
0.8
-0.3
0.2
0
0
-0.9
-0.2
-0.8
ArithmeticOperationstoModifyAttributes
• Theresultsuggests thatwecanmodify aspecificattributewithoutalteringtheothers• SupposewewanttoconvertthevoicefromspeakerA(lightblue)tospeakerB(darkblue)• Wecandothefollowingoperations:
0.3
-0.7
?
?
?
-0.4
1.1
?
?
?
ArithmeticOperationstoModifyAttributes
• Theresultsuggests thatwecanmodify aspecificattributewithoutalteringtheothers• SupposewewanttoconvertthevoicefromspeakerA(lightblue)tospeakerB(darkblue)• Wecandothefollowingoperations:
ArithmeticOperationstoModifyAttributes
SpeakerIdentity
0.3
-0.7
-0.2
1.5
0.4
Phone
Encoder
• Theresultsuggests thatwecanmodify aspecificattributewithoutalteringtheothers• SupposewewanttoconvertthevoicefromspeakerA(lightblue)tospeakerB(darkblue)• Wecandothefollowingoperations:
ArithmeticOperationstoModifyAttributes
SpeakerIdentity
0.3
-0.7
-0.2
1.5
0.4
Phone
Encoder
0
0
-0.2
1.5
0.4
0.3
-0.7
0
0
0
--0.4
1.1
0
0
0
+-0.4
1.1
-0.2
1.5
0.4
• Theresultsuggests thatwecanmodify aspecificattributewithoutalteringtheothers• SupposewewanttoconvertthevoicefromspeakerA(lightblue)tospeakerB(darkblue)• Wecandothefollowingoperations:
ArithmeticOperationstoModifyAttributes
SpeakerIdentity
0.3
-0.7
-0.2
1.5
0.4
Phone
Encoder Decoder
0
0
-0.2
1.5
0.4
0.3
-0.7
0
0
0
--0.4
1.1
0
0
0
+-0.4
1.1
-0.2
1.5
0.4
• Theresultsuggests thatwecanmodify aspecificattributewithoutalteringtheothers• SupposewewanttoconvertthevoicefromspeakerA(lightblue)tospeakerB(darkblue)• Wecandothefollowingoperations:
Outline1. Motivations2. BackgroundandModels3. LatentAttribute
RepresentationsandOperations4. Experiments
5. Conclusion
• GriffinandLimalgorithm isusedforwaveformreconstruction• Iterativelyestimatephase
MagnitudeSpectrogramReconstruction
ModifythePhoneme
• Modify/aa/to/ae/,F2goesup(backvowel->frontvowel)
/aa/ /ae/ /aa/ /ae/ /aa/ /ae/
ModifythePhoneme
• Modify/s/to/sh/,cutoffgoesdown(alveolar->palatalstrident)
/s/ /sh/ /s/ /sh/ /s/ /sh/
ModifytheSpeaker
• Modifyafemale toamale,pitchdecreases
ModifytheSpeaker
• Modifyamaletoafemale,pitchincreases
• Wechooseanutterancefromamalespeaker(madc0)• Modifytoanothermalespeaker(mabc0),andafemalespeaker(fajw0)
• Eachspeakerhasonly8utterancesintheset• ~4s/utterances
• Estimatethelatentspeakerrepresentationusingonly30sofspeech
ModifytheSpeakerforAnEntireUtterance
ModifytheSpeakerforAnEntireUtterance
OriginalSpeaker(top)originalspectrogram,(bottom)reconstructedspectrogram
ModifytheSpeakerforAnEntireUtterance
ConverttoSpeakermabc0(top)originalspectrogram,(bottom)modifiedspectrogram
ModifytheSpeakerforAnEntireUtterance
ConverttoSpeakerfajw0(top)originalspectrogram,(bottom)modifiedspectrogram
QuantitativeEvaluation
• Wetraindiscriminatorsforphoneclassificationandspeakerclassification
• Posteriorsasthequantitativemetric• Discriminators’meanopinion scoreonthetwoattributes• Posterioroftargetattributeincreases;posteriorofsourceattributedecreases• Posteriorsofirrelevantattributesunchanged
QuantitativeEvaluation
• Wetraindiscriminatorsforphoneclassificationandspeakerclassification
• Posteriorsasthequantitativemetric• Discriminators’meanopinion scoreonthetwoattributes• Posterioroftargetattributeincreases;posteriorofsourceattributedecreases• Posteriorsofirrelevantattributesunchanged
QuantitativeEvaluation
• Wetraindiscriminatorsforphoneclassificationandspeakerclassification
• Posteriorsasthequantitativemetric• Discriminators’meanopinion scoreonthetwoattributes• Posterioroftargetattributeincreases;posteriorofsourceattributedecreases• Posteriorsofirrelevantattributesunchanged
QuantitativeEvaluation
• Wetraindiscriminatorsforphoneclassificationandspeakerclassification
• Posteriorsasthequantitativemetric• Discriminators’meanopinion scoreonthetwoattributes• Posterioroftargetattributeincreases;posteriorofsourceattributedecreases• Posteriorsofirrelevantattributesunchanged
QuantitativeEvaluation
• Wetraindiscriminatorsforphoneclassificationandspeakerclassification
• Posteriorsasthequantitativemetric• Discriminators’meanopinion scoreonthetwoattributes• Posterioroftargetattributeincreases;posteriorofsourceattributedecreases• Posteriorsofirrelevantattributesunchanged
QuantitativeEvaluation
• Wetraindiscriminatorsforphoneclassificationandspeakerclassification
• Posteriorsasthequantitativemetric• Discriminators’meanopinion scoreonthetwoattributes• Posterioroftargetattributeincreases;posteriorofsourceattributedecreases• Posteriorsofirrelevantattributesunchanged
QuantitativeEvaluation
• Wetraindiscriminatorsforphoneclassificationandspeakerclassification
• Posteriorsasthequantitativemetric• Discriminators’meanopinion scoreonthetwoattributes• Posterioroftargetattributeincreases;posteriorofsourceattributedecreases• Posteriorsofirrelevantattributesunchanged
QuantitativeEvaluation
• Wetraindiscriminatorsforphoneclassificationandspeakerclassification
• Posteriorsasthequantitativemetric• Discriminators’meanopinion scoreonthetwoattributes• Posterioroftargetattributeincreases;posteriorofsourceattributedecreases• Posteriorsofirrelevantattributesunchanged
QuantitativeEvaluation
• Wetraindiscriminatorsforphoneclassificationandspeakerclassification
• Posteriorsasthequantitativemetric• Discriminators’meanopinion scoreonthetwoattributes• Posterioroftargetattributeincreases;posteriorofsourceattributedecreases• Posteriorsofirrelevantattributesunchanged
Outline1. Motivations2. BackgroundandModels3. LatentAttribute
RepresentationsandOperations4. Experiments
5. Conclusion
ConclusionandFutureWork
• WepresentaCNN-VAEtomodelgenerationprocessofspeechsegments
ConclusionandFutureWork
• WepresentaCNN-VAEtomodelgenerationprocessofspeechsegments
• Theframeworkleveragesvastquantitiesofunannotateddatatolearnageneralspeechanalyzerandageneralspeechsynthesizer.
ConclusionandFutureWork
• WepresentaCNN-VAEtomodelgenerationprocessofspeechsegments
• Theframeworkleveragesvastquantitiesofunannotateddatatolearnageneralspeechanalyzerandageneralspeechsynthesizer.
• Wedemonstratequalitativelyandquantitativelytheabilitytomodify speechattributes.
ConclusionandFutureWork
• WepresentaCNN-VAEtomodelgenerationprocessofspeechsegments
• Theframeworkleveragesvastquantitiesofunannotateddatatolearnageneralspeechanalyzerandageneralspeechsynthesizer.
• Wedemonstratequalitativelyandquantitativelytheabilitytomodify speechattributes.
• Wehaveapplied themodificationoperationtodataaugmentation forASRandachievedsignificant improvement fordomainadaptation.(submitted toASRU)
ConclusionandFutureWork
• WepresentaCNN-VAEtomodelgenerationprocessofspeechsegments
• Theframeworkleveragesvastquantitiesofunannotateddatatolearnageneralspeechanalyzerandageneralspeechsynthesizer.
• Wedemonstratequalitativelyandquantitativelytheabilitytomodify speechattributes.
• Wehaveapplied themodificationoperationtodataaugmentation forASRandachievedsignificant improvement fordomainadaptation. (submitted toASRU)
• Forfuturework,weplantoinvestigatetheuseofVAEonvoiceconversionandspeechde-noisingunder thesettingofnoparalleltrainingdata.
ThanksforListening.Q&A?Paper,slides, samplesandfollow-upworkscanbefoundonhttp://people.csail.mit.edu/wnhsu/