learning latent representations for speech generation and...

LearningLatentRepresentationsforSpeechGenerationandTransformation

Wei-Ning Hsu,YuZhang,JamesGlassMITComputerScienceandArtificialIntelligenceLaboratory,Cambridge,MA,USA

Interspeech 2017

WhattoExpectinThisTalk

1. Aconvolutionalvariationalautoencoderframeworktomodelagenerativeprocessofspeech

𝒛Encoder𝝁𝒛

𝝈𝒛

q(𝒛|𝒙)

Decoder𝒙

𝝈𝒙

p(𝒙|𝒛)

𝝁𝒙



2. Amethodtoassociatelearnedlatentrepresentationswithphysicalattributes,suchasspeakeridentityandlinguisticcontent

𝒛Encoder𝝁𝒛

𝝈𝒛

q(𝒛|𝒙)

Decoder𝒙

𝝈𝒙

p(𝒙|𝒛)

𝝁𝒙



2. Amethodtoassociatelearnedlatentrepresentationswithphysicalattributes,suchasspeakeridentityandlinguisticcontent

3. Simplelatentspacearithmeticoperationstomodifyspeechattributes

𝒛Encoder𝝁𝒛

𝝈𝒛

q(𝒛|𝒙)

Decoder𝒙

𝝈𝒙

p(𝒙|𝒛)

𝝁𝒙

Outline1. Motivation2. BackgroundandModels3. LatentAttribute

RepresentationsandOperations4. Experiments

5. Conclusion

Motivation

• Wewanttolearnagenerativeprocessofspeech1. Whatarethefactorsthataffectspeechgeneration?2. Howdothesefactorsplayaroleinspeechgeneration?3. Howcanweinferthesefactorsfromobservedspeech?

Welcome!

Motivation

• Wewanttolearnagenerativeprocessofspeech1. Whatarethefactorsthataffectspeechgeneration?2. Howdothesefactorsplayaroleinspeechgeneration?3. Howcanweinferthesefactorsfromobservedspeech?

• Whydowewanttolearna generativeprocess?• Synthesis(1,2)• Recognitionandverification(3)• Voiceconversionanddenoising (1,2,3)

Welcome!

Outline1. Motivations2. BackgroundandModels3. LatentAttribute


5. Conclusion

GenerativeModelBackgrounds

• “Shallow”generativemodels• HiddenMarkovmodel-Gaussianmixturemodels (HMM-GMMs)

• “Deep”generativemodels• Generativeadversarialnetworks(GANs)

• model𝑝(𝒙|𝒛) andbypasstheinferencemodel(generator/discriminator)

• Auto-regressivemodels (e.g.WaveNets)• model𝑝(𝒙+|𝒙,:+.,) andabstainfromusinglatentvariables

• Variationalautoencoders(VAEs)• learnaninferencemodelandagenerativemodeljointly

VariationalAutoencoders(VAEs)

• Defineaprobabilisticgenerativeprocessbetweenobservation𝒙 andlatentvariable𝒛• 𝑝(𝒛),𝑝(𝒙|𝒛),andq(𝒛|𝒙) aredefinedtobeinsomeparametric family

• Wedefine𝑝(𝒙|𝒛) (decoder) andq(𝒛|𝒙) (encoder) tobediagonalGaussians• Parameters(meanandvariance)aredescribedusingsomeNN

• 𝑝(𝒛) isdefined tobeisotropicGaussianwithunitvariance

𝒛Encoder𝝁𝒛

𝝈𝒛

q(𝒛|𝒙)

Decoder𝒙𝝁𝒙

𝝈𝒙

p(𝒙|𝒛)

ConvolutionalNeuralNetworkArchitecture

𝑝(𝒙|𝒛)𝒛𝑞(𝒛|𝒙)𝒙

Encoder Decoder

Conv1 Conv2 Conv3 FC1 Gauss Sample

FC2 T-conv1 T-conv2 T-conv3(Gauss)

FC3(+reshape

)

*T-convstandsfortransposedconvolution

𝒛Encoder𝝁𝒛

𝝈𝒛

q(𝒛|𝒙)

Decoder𝒙

𝝈𝒙

p(𝒙|𝒛)

𝝁𝒙

𝝁𝒛

𝝈𝒛

𝝁𝒙

𝝈𝒙

ExperimentSetup

• Dataset:TIMIT(5.4hr) (standard462speakersx/si trainingset)

• SpeechSegmentDimension:• Unsupervisedtraining(i.e.,nouseofphonetictranscription)• T=20frames(withshiftof8frames)• F=80(FBank)or200(LogMagnitudeSpectrogram)

• TrainingObjective:VariationalLowerBound• Optimizer:Adam

20

ExperimentSetup




20

8

ExperimentSetup




8

20

Outline

1. Motivations2. BackgroundandModels3. LatentAttribute


5. AudioDemo6. Conclusion

SpeechReconstructionIllustration

• ThetrainedVAEisabletoreconstructspeechsegments

• Examplesfrom10instancesof/aa/,/sh/,and/p/(sampledatcenterofsegment)

/aa/ /sh/ /p/

LatentAttributeRepresentations

• VAEisencouragedtomodelindependent factorsusingdifferentdimensions• Because theprior isassumedtobeadiagonalGaussian

0.3

-0.7

-0.2

1.5

0.4

Encoder Decoder



• Wewanttoassociatephysicalattributeswithsomedimensions

0.3

-0.7

SpeakerIdentity

-0.2

1.5

0.4

Phone

Encoder Decoder



• Wewanttoassociateparticulardimensionswithdifferentphysicalattributes

0.3

-0.7

SpeakerIdentity

-0.2

1.5

0.4

Phone

Encoder Decoder

0.3

-0.7

0

0

0

LatentSpeakerRepresentation

0

0

-0.2

1.5

0.4

LatentPhoneRepresentation


• Factorshavenormaldistributions alongtheirassociateddimensions



• Forexample,ifwewanttoestimatethelatentphonerepresentation for/aa/:

0.3

-0.7

SpeakerA

-0.2

1.5

0.4

/aa/

-0.4

1.1

SpeakerB

-0.2

1.5

0.4

/aa/

0.1

0.4

SpeakerC

-0.2

1.5

0.4

/aa/



• Forexample,ifwewanttoestimatethelatentphonerepresentation for/aa/:• Wecanestimatelatentattributebytakingthemeanlatentrepresentations

0.3

-0.7

SpeakerA

-0.2

1.5

0.4

/aa/

-0.4

1.1

SpeakerB

-0.2

1.5

0.4

/aa/

0.1

0.4

SpeakerC

-0.2

1.5

0.4

/aa/

0

0

-0.2

1.5

0.4

Average

LatentPhoneRepresentationfor/aa/

EmpiricalStudyoftheAssumptions

• Wecompute latentattributerepresentationsoftwoattributes:

0.3

-0.7

0

0

0

LatentSpeakerAttribute

-0.4

1.1

0

0

0

0.1

0.4

0

0

0

0

0

-0.2

1.5

0.4

LatentPhoneAttribute

0

0

0.8

-0.3

0.2

0

0

-0.9

-0.2

-0.8

EmpiricalStudyoftheAssumptions

• Wecompute latentattributerepresentationsoftwoattributes:

• Computetheabsolutecosinesimilaritybetweenlatentattributerepresentations

0.3

-0.7

0

0

0

LatentSpeakerAttribute

-0.4

1.1

0

0

0

0.1

0.4

0

0

0

0

0

-0.2

1.5

0.4

LatentPhoneAttribute

0

0

0.8

-0.3

0.2

0

0

-0.9

-0.2

-0.8

ArithmeticOperationstoModifyAttributes

• Theresultsuggests thatwecanmodify aspecificattributewithoutalteringtheothers• SupposewewanttoconvertthevoicefromspeakerA(lightblue)tospeakerB(darkblue)• Wecandothefollowingoperations:

0.3

-0.7

?

?

?

-0.4

1.1

?

?

?


SpeakerIdentity

0.3

-0.7

-0.2

1.5

0.4

Phone

Encoder



SpeakerIdentity

0.3

-0.7

-0.2

1.5

0.4

Phone

Encoder

0

0

-0.2

1.5

0.4

0.3

-0.7

0

0

0

--0.4

1.1

0

0

0

+-0.4

1.1

-0.2

1.5

0.4



SpeakerIdentity

0.3

-0.7

-0.2

1.5

0.4

Phone

Encoder Decoder

0

0

-0.2

1.5

0.4

0.3

-0.7

0

0

0

--0.4

1.1

0

0

0

+-0.4

1.1

-0.2

1.5

0.4




5. Conclusion

• GriffinandLimalgorithm isusedforwaveformreconstruction• Iterativelyestimatephase

MagnitudeSpectrogramReconstruction

ModifythePhoneme

• Modify/aa/to/ae/,F2goesup(backvowel->frontvowel)

/aa/ /ae/ /aa/ /ae/ /aa/ /ae/

ModifythePhoneme

• Modify/s/to/sh/,cutoffgoesdown(alveolar->palatalstrident)

/s/ /sh/ /s/ /sh/ /s/ /sh/

ModifytheSpeaker

• Modifyafemale toamale,pitchdecreases

ModifytheSpeaker

• Modifyamaletoafemale,pitchincreases

• Wechooseanutterancefromamalespeaker(madc0)• Modifytoanothermalespeaker(mabc0),andafemalespeaker(fajw0)

• Eachspeakerhasonly8utterancesintheset• ~4s/utterances

• Estimatethelatentspeakerrepresentationusingonly30sofspeech

ModifytheSpeakerforAnEntireUtterance


OriginalSpeaker(top)originalspectrogram,(bottom)reconstructedspectrogram


ConverttoSpeakermabc0(top)originalspectrogram,(bottom)modifiedspectrogram


ConverttoSpeakerfajw0(top)originalspectrogram,(bottom)modifiedspectrogram

QuantitativeEvaluation

• Wetraindiscriminatorsforphoneclassificationandspeakerclassification

• Posteriorsasthequantitativemetric• Discriminators’meanopinion scoreonthetwoattributes• Posterioroftargetattributeincreases;posteriorofsourceattributedecreases• Posteriorsofirrelevantattributesunchanged



5. Conclusion

ConclusionandFutureWork

• WepresentaCNN-VAEtomodelgenerationprocessofspeechsegments



• Theframeworkleveragesvastquantitiesofunannotateddatatolearnageneralspeechanalyzerandageneralspeechsynthesizer.




• Wedemonstratequalitativelyandquantitativelytheabilitytomodify speechattributes.





• Wehaveapplied themodificationoperationtodataaugmentation forASRandachievedsignificant improvement fordomainadaptation.(submitted toASRU)





• Wehaveapplied themodificationoperationtodataaugmentation forASRandachievedsignificant improvement fordomainadaptation. (submitted toASRU)

• Forfuturework,weplantoinvestigatetheuseofVAEonvoiceconversionandspeechde-noisingunder thesettingofnoparalleltrainingdata.

ThanksforListening.Q&A?Paper,slides, samplesandfollow-upworkscanbefoundonhttp://people.csail.mit.edu/wnhsu/

learning latent representations for speech generation and...

Documents