learning latent representations for speech generation and...

61
Learning Latent Representations for Speech Generation and Transformation Wei-Ning Hsu, Yu Zhang, James Glass MIT Computer Science and Artificial Intelligence Laboratory, Cambridge, MA, USA Interspeech 2017

Upload: others

Post on 01-Sep-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Learning Latent Representations for Speech Generation and …people.csail.mit.edu/wnhsu/assets/pdf/is17_learning_v2... · 2019. 10. 18. · What to Expect in This Talk 1. A convolutional

LearningLatentRepresentationsforSpeechGenerationandTransformation

Wei-Ning Hsu,YuZhang,JamesGlassMITComputerScienceandArtificialIntelligenceLaboratory,Cambridge,MA,USA

Interspeech 2017

Page 2: Learning Latent Representations for Speech Generation and …people.csail.mit.edu/wnhsu/assets/pdf/is17_learning_v2... · 2019. 10. 18. · What to Expect in This Talk 1. A convolutional

WhattoExpectinThisTalk

1. Aconvolutionalvariationalautoencoderframeworktomodelagenerativeprocessofspeech

𝒛Encoder𝝁𝒛

𝝈𝒛

q(𝒛|𝒙)

Decoder𝒙

𝝈𝒙

p(𝒙|𝒛)

𝝁𝒙

Page 3: Learning Latent Representations for Speech Generation and …people.csail.mit.edu/wnhsu/assets/pdf/is17_learning_v2... · 2019. 10. 18. · What to Expect in This Talk 1. A convolutional

WhattoExpectinThisTalk

1. Aconvolutionalvariationalautoencoderframeworktomodelagenerativeprocessofspeech

2. Amethodtoassociatelearnedlatentrepresentationswithphysicalattributes,suchasspeakeridentityandlinguisticcontent

𝒛Encoder𝝁𝒛

𝝈𝒛

q(𝒛|𝒙)

Decoder𝒙

𝝈𝒙

p(𝒙|𝒛)

𝝁𝒙

Page 4: Learning Latent Representations for Speech Generation and …people.csail.mit.edu/wnhsu/assets/pdf/is17_learning_v2... · 2019. 10. 18. · What to Expect in This Talk 1. A convolutional

WhattoExpectinThisTalk

1. Aconvolutionalvariationalautoencoderframeworktomodelagenerativeprocessofspeech

2. Amethodtoassociatelearnedlatentrepresentationswithphysicalattributes,suchasspeakeridentityandlinguisticcontent

3. Simplelatentspacearithmeticoperationstomodifyspeechattributes

𝒛Encoder𝝁𝒛

𝝈𝒛

q(𝒛|𝒙)

Decoder𝒙

𝝈𝒙

p(𝒙|𝒛)

𝝁𝒙

Page 5: Learning Latent Representations for Speech Generation and …people.csail.mit.edu/wnhsu/assets/pdf/is17_learning_v2... · 2019. 10. 18. · What to Expect in This Talk 1. A convolutional

Outline1. Motivation2. BackgroundandModels3. LatentAttribute

RepresentationsandOperations4. Experiments

5. Conclusion

Page 6: Learning Latent Representations for Speech Generation and …people.csail.mit.edu/wnhsu/assets/pdf/is17_learning_v2... · 2019. 10. 18. · What to Expect in This Talk 1. A convolutional

Motivation

• Wewanttolearnagenerativeprocessofspeech1. Whatarethefactorsthataffectspeechgeneration?2. Howdothesefactorsplayaroleinspeechgeneration?3. Howcanweinferthesefactorsfromobservedspeech?

Welcome!

Page 7: Learning Latent Representations for Speech Generation and …people.csail.mit.edu/wnhsu/assets/pdf/is17_learning_v2... · 2019. 10. 18. · What to Expect in This Talk 1. A convolutional

Motivation

• Wewanttolearnagenerativeprocessofspeech1. Whatarethefactorsthataffectspeechgeneration?2. Howdothesefactorsplayaroleinspeechgeneration?3. Howcanweinferthesefactorsfromobservedspeech?

• Whydowewanttolearna generativeprocess?• Synthesis(1,2)• Recognitionandverification(3)• Voiceconversionanddenoising (1,2,3)

Welcome!

Page 8: Learning Latent Representations for Speech Generation and …people.csail.mit.edu/wnhsu/assets/pdf/is17_learning_v2... · 2019. 10. 18. · What to Expect in This Talk 1. A convolutional

Outline1. Motivations2. BackgroundandModels3. LatentAttribute

RepresentationsandOperations4. Experiments

5. Conclusion

Page 9: Learning Latent Representations for Speech Generation and …people.csail.mit.edu/wnhsu/assets/pdf/is17_learning_v2... · 2019. 10. 18. · What to Expect in This Talk 1. A convolutional

GenerativeModelBackgrounds

• “Shallow”generativemodels• HiddenMarkovmodel-Gaussianmixturemodels (HMM-GMMs)

• “Deep”generativemodels• Generativeadversarialnetworks(GANs)

• model𝑝(𝒙|𝒛) andbypasstheinferencemodel(generator/discriminator)

• Auto-regressivemodels (e.g.WaveNets)• model𝑝(𝒙+|𝒙,:+.,) andabstainfromusinglatentvariables

• Variationalautoencoders(VAEs)• learnaninferencemodelandagenerativemodeljointly

Page 10: Learning Latent Representations for Speech Generation and …people.csail.mit.edu/wnhsu/assets/pdf/is17_learning_v2... · 2019. 10. 18. · What to Expect in This Talk 1. A convolutional

VariationalAutoencoders(VAEs)

• Defineaprobabilisticgenerativeprocessbetweenobservation𝒙 andlatentvariable𝒛• 𝑝(𝒛),𝑝(𝒙|𝒛),andq(𝒛|𝒙) aredefinedtobeinsomeparametric family

• Wedefine𝑝(𝒙|𝒛) (decoder) andq(𝒛|𝒙) (encoder) tobediagonalGaussians• Parameters(meanandvariance)aredescribedusingsomeNN

• 𝑝(𝒛) isdefined tobeisotropicGaussianwithunitvariance

𝒛Encoder𝝁𝒛

𝝈𝒛

q(𝒛|𝒙)

Decoder𝒙𝝁𝒙

𝝈𝒙

p(𝒙|𝒛)

Page 11: Learning Latent Representations for Speech Generation and …people.csail.mit.edu/wnhsu/assets/pdf/is17_learning_v2... · 2019. 10. 18. · What to Expect in This Talk 1. A convolutional

ConvolutionalNeuralNetworkArchitecture

𝑝(𝒙|𝒛)𝒛𝑞(𝒛|𝒙)𝒙

Encoder Decoder

Conv1 Conv2 Conv3 FC1 Gauss Sample

FC2 T-conv1 T-conv2 T-conv3(Gauss)

FC3(+reshape

)

*T-convstandsfortransposedconvolution

𝒛Encoder𝝁𝒛

𝝈𝒛

q(𝒛|𝒙)

Decoder𝒙

𝝈𝒙

p(𝒙|𝒛)

𝝁𝒙

𝝁𝒛

𝝈𝒛

𝝁𝒙

𝝈𝒙

Page 12: Learning Latent Representations for Speech Generation and …people.csail.mit.edu/wnhsu/assets/pdf/is17_learning_v2... · 2019. 10. 18. · What to Expect in This Talk 1. A convolutional

ExperimentSetup

• Dataset:TIMIT(5.4hr) (standard462speakersx/si trainingset)

• SpeechSegmentDimension:• Unsupervisedtraining(i.e.,nouseofphonetictranscription)• T=20frames(withshiftof8frames)• F=80(FBank)or200(LogMagnitudeSpectrogram)

• TrainingObjective:VariationalLowerBound• Optimizer:Adam

20

Page 13: Learning Latent Representations for Speech Generation and …people.csail.mit.edu/wnhsu/assets/pdf/is17_learning_v2... · 2019. 10. 18. · What to Expect in This Talk 1. A convolutional

ExperimentSetup

• Dataset:TIMIT(5.4hr) (standard462speakersx/si trainingset)

• SpeechSegmentDimension:• Unsupervisedtraining(i.e.,nouseofphonetictranscription)• T=20frames(withshiftof8frames)• F=80(FBank)or200(LogMagnitudeSpectrogram)

• TrainingObjective:VariationalLowerBound• Optimizer:Adam

20

8

Page 14: Learning Latent Representations for Speech Generation and …people.csail.mit.edu/wnhsu/assets/pdf/is17_learning_v2... · 2019. 10. 18. · What to Expect in This Talk 1. A convolutional

ExperimentSetup

• Dataset:TIMIT(5.4hr) (standard462speakersx/si trainingset)

• SpeechSegmentDimension:• Unsupervisedtraining(i.e.,nouseofphonetictranscription)• T=20frames(withshiftof8frames)• F=80(FBank)or200(LogMagnitudeSpectrogram)

• TrainingObjective:VariationalLowerBound• Optimizer:Adam

8

20

Page 15: Learning Latent Representations for Speech Generation and …people.csail.mit.edu/wnhsu/assets/pdf/is17_learning_v2... · 2019. 10. 18. · What to Expect in This Talk 1. A convolutional

Outline

1. Motivations2. BackgroundandModels3. LatentAttribute

RepresentationsandOperations4. Experiments

5. AudioDemo6. Conclusion

Page 16: Learning Latent Representations for Speech Generation and …people.csail.mit.edu/wnhsu/assets/pdf/is17_learning_v2... · 2019. 10. 18. · What to Expect in This Talk 1. A convolutional

SpeechReconstructionIllustration

• ThetrainedVAEisabletoreconstructspeechsegments

• Examplesfrom10instancesof/aa/,/sh/,and/p/(sampledatcenterofsegment)

/aa/ /sh/ /p/

Page 17: Learning Latent Representations for Speech Generation and …people.csail.mit.edu/wnhsu/assets/pdf/is17_learning_v2... · 2019. 10. 18. · What to Expect in This Talk 1. A convolutional

LatentAttributeRepresentations

• VAEisencouragedtomodelindependent factorsusingdifferentdimensions• Because theprior isassumedtobeadiagonalGaussian

0.3

-0.7

-0.2

1.5

0.4

Encoder Decoder

Page 18: Learning Latent Representations for Speech Generation and …people.csail.mit.edu/wnhsu/assets/pdf/is17_learning_v2... · 2019. 10. 18. · What to Expect in This Talk 1. A convolutional

LatentAttributeRepresentations

• VAEisencouragedtomodelindependent factorsusingdifferentdimensions• Because theprior isassumedtobeadiagonalGaussian

• Wewanttoassociatephysicalattributeswithsomedimensions

0.3

-0.7

SpeakerIdentity

-0.2

1.5

0.4

Phone

Encoder Decoder

Page 19: Learning Latent Representations for Speech Generation and …people.csail.mit.edu/wnhsu/assets/pdf/is17_learning_v2... · 2019. 10. 18. · What to Expect in This Talk 1. A convolutional

LatentAttributeRepresentations

• VAEisencouragedtomodelindependent factorsusingdifferentdimensions• Because theprior isassumedtobeadiagonalGaussian

• Wewanttoassociateparticulardimensionswithdifferentphysicalattributes

0.3

-0.7

SpeakerIdentity

-0.2

1.5

0.4

Phone

Encoder Decoder

0.3

-0.7

0

0

0

LatentSpeakerRepresentation

0

0

-0.2

1.5

0.4

LatentPhoneRepresentation

Page 20: Learning Latent Representations for Speech Generation and …people.csail.mit.edu/wnhsu/assets/pdf/is17_learning_v2... · 2019. 10. 18. · What to Expect in This Talk 1. A convolutional

LatentAttributeRepresentations

• Factorshavenormaldistributions alongtheirassociateddimensions

Page 21: Learning Latent Representations for Speech Generation and …people.csail.mit.edu/wnhsu/assets/pdf/is17_learning_v2... · 2019. 10. 18. · What to Expect in This Talk 1. A convolutional

LatentAttributeRepresentations

• Factorshavenormaldistributions alongtheirassociateddimensions

• Forexample,ifwewanttoestimatethelatentphonerepresentation for/aa/:

0.3

-0.7

SpeakerA

-0.2

1.5

0.4

/aa/

-0.4

1.1

SpeakerB

-0.2

1.5

0.4

/aa/

0.1

0.4

SpeakerC

-0.2

1.5

0.4

/aa/

Page 22: Learning Latent Representations for Speech Generation and …people.csail.mit.edu/wnhsu/assets/pdf/is17_learning_v2... · 2019. 10. 18. · What to Expect in This Talk 1. A convolutional

LatentAttributeRepresentations

• Factorshavenormaldistributions alongtheirassociateddimensions

• Forexample,ifwewanttoestimatethelatentphonerepresentation for/aa/:• Wecanestimatelatentattributebytakingthemeanlatentrepresentations

0.3

-0.7

SpeakerA

-0.2

1.5

0.4

/aa/

-0.4

1.1

SpeakerB

-0.2

1.5

0.4

/aa/

0.1

0.4

SpeakerC

-0.2

1.5

0.4

/aa/

0

0

-0.2

1.5

0.4

Average

LatentPhoneRepresentationfor/aa/

Page 23: Learning Latent Representations for Speech Generation and …people.csail.mit.edu/wnhsu/assets/pdf/is17_learning_v2... · 2019. 10. 18. · What to Expect in This Talk 1. A convolutional

EmpiricalStudyoftheAssumptions

• Wecompute latentattributerepresentationsoftwoattributes:

0.3

-0.7

0

0

0

LatentSpeakerAttribute

-0.4

1.1

0

0

0

0.1

0.4

0

0

0

0

0

-0.2

1.5

0.4

LatentPhoneAttribute

0

0

0.8

-0.3

0.2

0

0

-0.9

-0.2

-0.8

Page 24: Learning Latent Representations for Speech Generation and …people.csail.mit.edu/wnhsu/assets/pdf/is17_learning_v2... · 2019. 10. 18. · What to Expect in This Talk 1. A convolutional

EmpiricalStudyoftheAssumptions

• Wecompute latentattributerepresentationsoftwoattributes:

• Computetheabsolutecosinesimilaritybetweenlatentattributerepresentations

0.3

-0.7

0

0

0

LatentSpeakerAttribute

-0.4

1.1

0

0

0

0.1

0.4

0

0

0

0

0

-0.2

1.5

0.4

LatentPhoneAttribute

0

0

0.8

-0.3

0.2

0

0

-0.9

-0.2

-0.8

Page 25: Learning Latent Representations for Speech Generation and …people.csail.mit.edu/wnhsu/assets/pdf/is17_learning_v2... · 2019. 10. 18. · What to Expect in This Talk 1. A convolutional

EmpiricalStudyoftheAssumptions

• Wecompute latentattributerepresentationsoftwoattributes:

• Computetheabsolutecosinesimilaritybetweenlatentattributerepresentations

0.3

-0.7

0

0

0

LatentSpeakerAttribute

-0.4

1.1

0

0

0

0.1

0.4

0

0

0

0

0

-0.2

1.5

0.4

LatentPhoneAttribute

0

0

0.8

-0.3

0.2

0

0

-0.9

-0.2

-0.8

Page 26: Learning Latent Representations for Speech Generation and …people.csail.mit.edu/wnhsu/assets/pdf/is17_learning_v2... · 2019. 10. 18. · What to Expect in This Talk 1. A convolutional

EmpiricalStudyoftheAssumptions

• Wecompute latentattributerepresentationsoftwoattributes:

• Computetheabsolutecosinesimilaritybetweenlatentattributerepresentations

0.3

-0.7

0

0

0

LatentSpeakerAttribute

-0.4

1.1

0

0

0

0.1

0.4

0

0

0

0

0

-0.2

1.5

0.4

LatentPhoneAttribute

0

0

0.8

-0.3

0.2

0

0

-0.9

-0.2

-0.8

Page 27: Learning Latent Representations for Speech Generation and …people.csail.mit.edu/wnhsu/assets/pdf/is17_learning_v2... · 2019. 10. 18. · What to Expect in This Talk 1. A convolutional

EmpiricalStudyoftheAssumptions

• Wecompute latentattributerepresentationsoftwoattributes:

• Computetheabsolutecosinesimilaritybetweenlatentattributerepresentations

0.3

-0.7

0

0

0

LatentSpeakerAttribute

-0.4

1.1

0

0

0

0.1

0.4

0

0

0

0

0

-0.2

1.5

0.4

LatentPhoneAttribute

0

0

0.8

-0.3

0.2

0

0

-0.9

-0.2

-0.8

Page 28: Learning Latent Representations for Speech Generation and …people.csail.mit.edu/wnhsu/assets/pdf/is17_learning_v2... · 2019. 10. 18. · What to Expect in This Talk 1. A convolutional

EmpiricalStudyoftheAssumptions

• Wecompute latentattributerepresentationsoftwoattributes:

• Computetheabsolutecosinesimilaritybetweenlatentattributerepresentations

0.3

-0.7

0

0

0

LatentSpeakerAttribute

-0.4

1.1

0

0

0

0.1

0.4

0

0

0

0

0

-0.2

1.5

0.4

LatentPhoneAttribute

0

0

0.8

-0.3

0.2

0

0

-0.9

-0.2

-0.8

Page 29: Learning Latent Representations for Speech Generation and …people.csail.mit.edu/wnhsu/assets/pdf/is17_learning_v2... · 2019. 10. 18. · What to Expect in This Talk 1. A convolutional

EmpiricalStudyoftheAssumptions

• Wecompute latentattributerepresentationsoftwoattributes:

• Computetheabsolutecosinesimilaritybetweenlatentattributerepresentations

0.3

-0.7

0

0

0

LatentSpeakerAttribute

-0.4

1.1

0

0

0

0.1

0.4

0

0

0

0

0

-0.2

1.5

0.4

LatentPhoneAttribute

0

0

0.8

-0.3

0.2

0

0

-0.9

-0.2

-0.8

Page 30: Learning Latent Representations for Speech Generation and …people.csail.mit.edu/wnhsu/assets/pdf/is17_learning_v2... · 2019. 10. 18. · What to Expect in This Talk 1. A convolutional

EmpiricalStudyoftheAssumptions

• Wecompute latentattributerepresentationsoftwoattributes:

• Computetheabsolutecosinesimilaritybetweenlatentattributerepresentations

0.3

-0.7

0

0

0

LatentSpeakerAttribute

-0.4

1.1

0

0

0

0.1

0.4

0

0

0

0

0

-0.2

1.5

0.4

LatentPhoneAttribute

0

0

0.8

-0.3

0.2

0

0

-0.9

-0.2

-0.8

Page 31: Learning Latent Representations for Speech Generation and …people.csail.mit.edu/wnhsu/assets/pdf/is17_learning_v2... · 2019. 10. 18. · What to Expect in This Talk 1. A convolutional

ArithmeticOperationstoModifyAttributes

• Theresultsuggests thatwecanmodify aspecificattributewithoutalteringtheothers• SupposewewanttoconvertthevoicefromspeakerA(lightblue)tospeakerB(darkblue)• Wecandothefollowingoperations:

0.3

-0.7

?

?

?

-0.4

1.1

?

?

?

Page 32: Learning Latent Representations for Speech Generation and …people.csail.mit.edu/wnhsu/assets/pdf/is17_learning_v2... · 2019. 10. 18. · What to Expect in This Talk 1. A convolutional

ArithmeticOperationstoModifyAttributes

• Theresultsuggests thatwecanmodify aspecificattributewithoutalteringtheothers• SupposewewanttoconvertthevoicefromspeakerA(lightblue)tospeakerB(darkblue)• Wecandothefollowingoperations:

Page 33: Learning Latent Representations for Speech Generation and …people.csail.mit.edu/wnhsu/assets/pdf/is17_learning_v2... · 2019. 10. 18. · What to Expect in This Talk 1. A convolutional

ArithmeticOperationstoModifyAttributes

SpeakerIdentity

0.3

-0.7

-0.2

1.5

0.4

Phone

Encoder

• Theresultsuggests thatwecanmodify aspecificattributewithoutalteringtheothers• SupposewewanttoconvertthevoicefromspeakerA(lightblue)tospeakerB(darkblue)• Wecandothefollowingoperations:

Page 34: Learning Latent Representations for Speech Generation and …people.csail.mit.edu/wnhsu/assets/pdf/is17_learning_v2... · 2019. 10. 18. · What to Expect in This Talk 1. A convolutional

ArithmeticOperationstoModifyAttributes

SpeakerIdentity

0.3

-0.7

-0.2

1.5

0.4

Phone

Encoder

0

0

-0.2

1.5

0.4

0.3

-0.7

0

0

0

--0.4

1.1

0

0

0

+-0.4

1.1

-0.2

1.5

0.4

• Theresultsuggests thatwecanmodify aspecificattributewithoutalteringtheothers• SupposewewanttoconvertthevoicefromspeakerA(lightblue)tospeakerB(darkblue)• Wecandothefollowingoperations:

Page 35: Learning Latent Representations for Speech Generation and …people.csail.mit.edu/wnhsu/assets/pdf/is17_learning_v2... · 2019. 10. 18. · What to Expect in This Talk 1. A convolutional

ArithmeticOperationstoModifyAttributes

SpeakerIdentity

0.3

-0.7

-0.2

1.5

0.4

Phone

Encoder Decoder

0

0

-0.2

1.5

0.4

0.3

-0.7

0

0

0

--0.4

1.1

0

0

0

+-0.4

1.1

-0.2

1.5

0.4

• Theresultsuggests thatwecanmodify aspecificattributewithoutalteringtheothers• SupposewewanttoconvertthevoicefromspeakerA(lightblue)tospeakerB(darkblue)• Wecandothefollowingoperations:

Page 36: Learning Latent Representations for Speech Generation and …people.csail.mit.edu/wnhsu/assets/pdf/is17_learning_v2... · 2019. 10. 18. · What to Expect in This Talk 1. A convolutional

Outline1. Motivations2. BackgroundandModels3. LatentAttribute

RepresentationsandOperations4. Experiments

5. Conclusion

Page 37: Learning Latent Representations for Speech Generation and …people.csail.mit.edu/wnhsu/assets/pdf/is17_learning_v2... · 2019. 10. 18. · What to Expect in This Talk 1. A convolutional

• GriffinandLimalgorithm isusedforwaveformreconstruction• Iterativelyestimatephase

MagnitudeSpectrogramReconstruction

Page 38: Learning Latent Representations for Speech Generation and …people.csail.mit.edu/wnhsu/assets/pdf/is17_learning_v2... · 2019. 10. 18. · What to Expect in This Talk 1. A convolutional

ModifythePhoneme

• Modify/aa/to/ae/,F2goesup(backvowel->frontvowel)

/aa/ /ae/ /aa/ /ae/ /aa/ /ae/

Page 39: Learning Latent Representations for Speech Generation and …people.csail.mit.edu/wnhsu/assets/pdf/is17_learning_v2... · 2019. 10. 18. · What to Expect in This Talk 1. A convolutional

ModifythePhoneme

• Modify/s/to/sh/,cutoffgoesdown(alveolar->palatalstrident)

/s/ /sh/ /s/ /sh/ /s/ /sh/

Page 40: Learning Latent Representations for Speech Generation and …people.csail.mit.edu/wnhsu/assets/pdf/is17_learning_v2... · 2019. 10. 18. · What to Expect in This Talk 1. A convolutional

ModifytheSpeaker

• Modifyafemale toamale,pitchdecreases

Page 41: Learning Latent Representations for Speech Generation and …people.csail.mit.edu/wnhsu/assets/pdf/is17_learning_v2... · 2019. 10. 18. · What to Expect in This Talk 1. A convolutional

ModifytheSpeaker

• Modifyamaletoafemale,pitchincreases

Page 42: Learning Latent Representations for Speech Generation and …people.csail.mit.edu/wnhsu/assets/pdf/is17_learning_v2... · 2019. 10. 18. · What to Expect in This Talk 1. A convolutional

• Wechooseanutterancefromamalespeaker(madc0)• Modifytoanothermalespeaker(mabc0),andafemalespeaker(fajw0)

• Eachspeakerhasonly8utterancesintheset• ~4s/utterances

• Estimatethelatentspeakerrepresentationusingonly30sofspeech

ModifytheSpeakerforAnEntireUtterance

Page 43: Learning Latent Representations for Speech Generation and …people.csail.mit.edu/wnhsu/assets/pdf/is17_learning_v2... · 2019. 10. 18. · What to Expect in This Talk 1. A convolutional

ModifytheSpeakerforAnEntireUtterance

OriginalSpeaker(top)originalspectrogram,(bottom)reconstructedspectrogram

Page 44: Learning Latent Representations for Speech Generation and …people.csail.mit.edu/wnhsu/assets/pdf/is17_learning_v2... · 2019. 10. 18. · What to Expect in This Talk 1. A convolutional

ModifytheSpeakerforAnEntireUtterance

ConverttoSpeakermabc0(top)originalspectrogram,(bottom)modifiedspectrogram

Page 45: Learning Latent Representations for Speech Generation and …people.csail.mit.edu/wnhsu/assets/pdf/is17_learning_v2... · 2019. 10. 18. · What to Expect in This Talk 1. A convolutional

ModifytheSpeakerforAnEntireUtterance

ConverttoSpeakerfajw0(top)originalspectrogram,(bottom)modifiedspectrogram

Page 46: Learning Latent Representations for Speech Generation and …people.csail.mit.edu/wnhsu/assets/pdf/is17_learning_v2... · 2019. 10. 18. · What to Expect in This Talk 1. A convolutional

QuantitativeEvaluation

• Wetraindiscriminatorsforphoneclassificationandspeakerclassification

• Posteriorsasthequantitativemetric• Discriminators’meanopinion scoreonthetwoattributes• Posterioroftargetattributeincreases;posteriorofsourceattributedecreases• Posteriorsofirrelevantattributesunchanged

Page 47: Learning Latent Representations for Speech Generation and …people.csail.mit.edu/wnhsu/assets/pdf/is17_learning_v2... · 2019. 10. 18. · What to Expect in This Talk 1. A convolutional

QuantitativeEvaluation

• Wetraindiscriminatorsforphoneclassificationandspeakerclassification

• Posteriorsasthequantitativemetric• Discriminators’meanopinion scoreonthetwoattributes• Posterioroftargetattributeincreases;posteriorofsourceattributedecreases• Posteriorsofirrelevantattributesunchanged

Page 48: Learning Latent Representations for Speech Generation and …people.csail.mit.edu/wnhsu/assets/pdf/is17_learning_v2... · 2019. 10. 18. · What to Expect in This Talk 1. A convolutional

QuantitativeEvaluation

• Wetraindiscriminatorsforphoneclassificationandspeakerclassification

• Posteriorsasthequantitativemetric• Discriminators’meanopinion scoreonthetwoattributes• Posterioroftargetattributeincreases;posteriorofsourceattributedecreases• Posteriorsofirrelevantattributesunchanged

Page 49: Learning Latent Representations for Speech Generation and …people.csail.mit.edu/wnhsu/assets/pdf/is17_learning_v2... · 2019. 10. 18. · What to Expect in This Talk 1. A convolutional

QuantitativeEvaluation

• Wetraindiscriminatorsforphoneclassificationandspeakerclassification

• Posteriorsasthequantitativemetric• Discriminators’meanopinion scoreonthetwoattributes• Posterioroftargetattributeincreases;posteriorofsourceattributedecreases• Posteriorsofirrelevantattributesunchanged

Page 50: Learning Latent Representations for Speech Generation and …people.csail.mit.edu/wnhsu/assets/pdf/is17_learning_v2... · 2019. 10. 18. · What to Expect in This Talk 1. A convolutional

QuantitativeEvaluation

• Wetraindiscriminatorsforphoneclassificationandspeakerclassification

• Posteriorsasthequantitativemetric• Discriminators’meanopinion scoreonthetwoattributes• Posterioroftargetattributeincreases;posteriorofsourceattributedecreases• Posteriorsofirrelevantattributesunchanged

Page 51: Learning Latent Representations for Speech Generation and …people.csail.mit.edu/wnhsu/assets/pdf/is17_learning_v2... · 2019. 10. 18. · What to Expect in This Talk 1. A convolutional

QuantitativeEvaluation

• Wetraindiscriminatorsforphoneclassificationandspeakerclassification

• Posteriorsasthequantitativemetric• Discriminators’meanopinion scoreonthetwoattributes• Posterioroftargetattributeincreases;posteriorofsourceattributedecreases• Posteriorsofirrelevantattributesunchanged

Page 52: Learning Latent Representations for Speech Generation and …people.csail.mit.edu/wnhsu/assets/pdf/is17_learning_v2... · 2019. 10. 18. · What to Expect in This Talk 1. A convolutional

QuantitativeEvaluation

• Wetraindiscriminatorsforphoneclassificationandspeakerclassification

• Posteriorsasthequantitativemetric• Discriminators’meanopinion scoreonthetwoattributes• Posterioroftargetattributeincreases;posteriorofsourceattributedecreases• Posteriorsofirrelevantattributesunchanged

Page 53: Learning Latent Representations for Speech Generation and …people.csail.mit.edu/wnhsu/assets/pdf/is17_learning_v2... · 2019. 10. 18. · What to Expect in This Talk 1. A convolutional

QuantitativeEvaluation

• Wetraindiscriminatorsforphoneclassificationandspeakerclassification

• Posteriorsasthequantitativemetric• Discriminators’meanopinion scoreonthetwoattributes• Posterioroftargetattributeincreases;posteriorofsourceattributedecreases• Posteriorsofirrelevantattributesunchanged

Page 54: Learning Latent Representations for Speech Generation and …people.csail.mit.edu/wnhsu/assets/pdf/is17_learning_v2... · 2019. 10. 18. · What to Expect in This Talk 1. A convolutional

QuantitativeEvaluation

• Wetraindiscriminatorsforphoneclassificationandspeakerclassification

• Posteriorsasthequantitativemetric• Discriminators’meanopinion scoreonthetwoattributes• Posterioroftargetattributeincreases;posteriorofsourceattributedecreases• Posteriorsofirrelevantattributesunchanged

Page 55: Learning Latent Representations for Speech Generation and …people.csail.mit.edu/wnhsu/assets/pdf/is17_learning_v2... · 2019. 10. 18. · What to Expect in This Talk 1. A convolutional

Outline1. Motivations2. BackgroundandModels3. LatentAttribute

RepresentationsandOperations4. Experiments

5. Conclusion

Page 56: Learning Latent Representations for Speech Generation and …people.csail.mit.edu/wnhsu/assets/pdf/is17_learning_v2... · 2019. 10. 18. · What to Expect in This Talk 1. A convolutional

ConclusionandFutureWork

• WepresentaCNN-VAEtomodelgenerationprocessofspeechsegments

Page 57: Learning Latent Representations for Speech Generation and …people.csail.mit.edu/wnhsu/assets/pdf/is17_learning_v2... · 2019. 10. 18. · What to Expect in This Talk 1. A convolutional

ConclusionandFutureWork

• WepresentaCNN-VAEtomodelgenerationprocessofspeechsegments

• Theframeworkleveragesvastquantitiesofunannotateddatatolearnageneralspeechanalyzerandageneralspeechsynthesizer.

Page 58: Learning Latent Representations for Speech Generation and …people.csail.mit.edu/wnhsu/assets/pdf/is17_learning_v2... · 2019. 10. 18. · What to Expect in This Talk 1. A convolutional

ConclusionandFutureWork

• WepresentaCNN-VAEtomodelgenerationprocessofspeechsegments

• Theframeworkleveragesvastquantitiesofunannotateddatatolearnageneralspeechanalyzerandageneralspeechsynthesizer.

• Wedemonstratequalitativelyandquantitativelytheabilitytomodify speechattributes.

Page 59: Learning Latent Representations for Speech Generation and …people.csail.mit.edu/wnhsu/assets/pdf/is17_learning_v2... · 2019. 10. 18. · What to Expect in This Talk 1. A convolutional

ConclusionandFutureWork

• WepresentaCNN-VAEtomodelgenerationprocessofspeechsegments

• Theframeworkleveragesvastquantitiesofunannotateddatatolearnageneralspeechanalyzerandageneralspeechsynthesizer.

• Wedemonstratequalitativelyandquantitativelytheabilitytomodify speechattributes.

• Wehaveapplied themodificationoperationtodataaugmentation forASRandachievedsignificant improvement fordomainadaptation.(submitted toASRU)

Page 60: Learning Latent Representations for Speech Generation and …people.csail.mit.edu/wnhsu/assets/pdf/is17_learning_v2... · 2019. 10. 18. · What to Expect in This Talk 1. A convolutional

ConclusionandFutureWork

• WepresentaCNN-VAEtomodelgenerationprocessofspeechsegments

• Theframeworkleveragesvastquantitiesofunannotateddatatolearnageneralspeechanalyzerandageneralspeechsynthesizer.

• Wedemonstratequalitativelyandquantitativelytheabilitytomodify speechattributes.

• Wehaveapplied themodificationoperationtodataaugmentation forASRandachievedsignificant improvement fordomainadaptation. (submitted toASRU)

• Forfuturework,weplantoinvestigatetheuseofVAEonvoiceconversionandspeechde-noisingunder thesettingofnoparalleltrainingdata.

Page 61: Learning Latent Representations for Speech Generation and …people.csail.mit.edu/wnhsu/assets/pdf/is17_learning_v2... · 2019. 10. 18. · What to Expect in This Talk 1. A convolutional

ThanksforListening.Q&A?Paper,slides, samplesandfollow-upworkscanbefoundonhttp://people.csail.mit.edu/wnhsu/