wavenet vocoder and its applications in voice conversion · wavenet vocoder to these voice...

15
WaveNet ƭŲsPċǛȭǹùßş WaveNet Vocoder and its Applications in Voice Conversion ȼćc*Ƨð·*ȼ7Ô*ĔĐ**řĉķ* Wen-Chin Huang*, Chen-Chou Lo*, Hsin-Te Hwang*, Yu Tsao**, Hsin-Min Wang* *ŰŽȞǨǏŶ ŰŽä Institute of Information Science Academia Sinica **ŰŽȞǨǏŶç^ĉŰŽÕ Research Center for Information Technology Innovation Academia Sinica ûdž ĆǛȭǹùĪ"Ǫ!Aƙ1ńŋļĪ(source-filter model) ŌųƭŲ (vocoder)ǛȭǏƾȄƿǛȭqĆët!syàǛȭŎƬȿuȝċAƙƭŲťǡ śǠƳ<ǒȿ!AƙƭŲŌĞĨȄƿǛȭǹùäŝàťǛȭȿPưŎÄ!sƳũĩǛ ƫťū)ÄōĺȄįøkŀÄ Ʃ(deep learning)Ȳȿ WaveNet ďŚȡIJĕ à`ťǛȭŝàçǀȿƯŞŝƳȆpĊĺūĵưŎÄēȻťǛȭ WaveNet ƭŲ Ō WaveNet ť8Ç(ȿQ@ŞŝǯǰAƙƭŲťȻǩǛȭťƯ_ȿ»ǾŇǃ ÒǛȭǹùŰŽȠäòşȆpȿKŰŽȠäșŤťǛȭǹùĪ! AƙƭŲŌųȄƿǛȭǹùȿěǠćǖ« WaveNet ƭŲÊJKÂ8ĉǺø TťǛȭǹùĪȿ!ǔ' WaveNet ƭŲǿǛȭǹùĪťßşň_ċ§ Ⱥȿá9ĵǶŹǛȭǹùĪVY0şAƙƭŲƳ WaveNet ƭŲäÑ[ť ƖĝPȿäĵǶťǛȭǹùĪhí 1)ǤVÉưdƠŲ(variational auto-encoder, VAE) 2)ƖyŝàÉêƝǴǤVÉưdƠŲ!s 3)dzŔÓȲǤVÉưdƠ Ų(cross domain VAE, CDVAE)§ȺƖĝȷŴȿ8ǛȭǹùĪ0ş WaveNet ƭ The 2018 Conference on Computational Linguistics and Speech Processing ROCLING 2018, pp. 96-110 ©The Association for Computational Linguistics and Chinese Language Processing 96

Upload: others

Post on 28-May-2020

39 views

Category:

Documents


0 download

TRANSCRIPT

WaveNet ƭŲ�sPċǛȭǹù�ßş

WaveNetVocoderand itsApplications inVoiceConversion

ȼćc*�Ƨð·*�ȼ7Ô*�ĔĐ**�řĉķ*

Wen-ChinHuang*,Chen-ChouLo*,Hsin-TeHwang*,YuTsao**,Hsin-MinWang*

*��ŰŽȞǨǏŶ ŰŽä

InstituteofInformationScience

AcademiaSinica

**��ŰŽȞǨǏŶç^ĉŰŽ�Õ

ResearchCenterforInformationTechnologyInnovation

AcademiaSinica

ûdž

�ĆǛȭǹùĪ�"Ǫ!Aƙ1ńŋļ�Ī�(source-filter model)Ō�ų�ƭŲ�

(vocoder)­ǛȭǏƾȄƿǛȭqĆët!syàǛȭ�ŎƬȿuȝċAƙƭŲ�ťǡ�

śǠƳ<ǒȿ!AƙƭŲ�ŌĞĨȄƿǛȭǹùäŝàťǛȭȿPưŎÄ!sƳũĩǛ

ƫťū)Ä�ōĺȄ�įøk��ŀÄ Ʃ(deep learning)Ȳ��ȿWaveNet ďŚȡIJĕ

à`ťǛȭŝàçǀ��ȿƯŞŝƳȆpĊĺūĵưŎÄēȻťǛȭ�WaveNet ƭŲ�

Ō WaveNet ť�8Ç(ȿQ@ŞŝǯǰAƙƭŲ�ťȻ�ǩǛȭťƯ_ȿ�»ǾŇǃ

Ò�ǛȭǹùŰŽ����Ƞäòş�Ȇpȿ�KŰŽ�ȠäșŤťǛȭǹùĪ��!

AƙƭŲ�Ō�ųȄƿǛȭǹùȿěǠćǖ�« WaveNet ƭŲ�ÊJ�KÂ8ĉǺø

TťǛȭǹùĪ�ȿ!ǔ' WaveNet ƭŲ��ǿ�ǛȭǹùĪ��ťßşň_�ċ§

Ⱥ�ȿá9ĵǶ��ŹǛȭǹùĪ�VY0şAƙƭŲ�Ƴ WaveNet ƭŲ�äÑ[ť

Ɩĝ�P�ȿäĵǶťǛȭǹùĪ�hí 1)ǤVÉưdƠŲ�(variational auto-encoder,

VAE)�2)ƖyŝàÉ­ê�ƝǴ�ǤVÉưdƠŲ��!s 3)dzŔÓȲ�ǤVÉưdƠ

Ų�(crossdomainVAE,CDVAE)�§ȺƖĝȷŴȿ�8ǛȭǹùĪ��0ş WaveNet ƭ

The 2018 Conference on Computational Linguistics and Speech Processing ROCLING 2018, pp. 96-110 ©The Association for Computational Linguistics and Chinese Language Processing

96

Ų�ÏȿƳũĩǛƫťū)Ä�ŖÑȷƺťā���ưŎÄĊȬȿ\Cė! VAEŌ�ų

�ǛȭǹùĪ��0şWaveNetƭŲ�Ïėȷƺťøk�

țȔǕɀWaveNetȿƭŲ�ȿǛȭǹùȿǤVÉưdƠŲ��

Abstract

Most voice conversion models rely on vocoders based on the source-filter model to extract

speech parameters and synthesize speech. However, the naturalness and similarity of the

converted speech are limited due to the vast theories and constraints posed by traditional

vocoders. In the field of deep learning, a network structure called WaveNet is one of the state-

of-the-art techniques in speech synthesis, which is capable of generating speech samples of

extremely high quality compared with past methods. One of the extensions of WaveNet is the

WaveNet vocoder. Its ability to synthesize speech of quality higher than traditional vocoders

has made it gradually adopted by several foreign voice conversion research teams. In this work,

we study the combination of the WaveNet vocoder with the voice conversion models recently

developed by domestic research teams, in order to evaluate the potential of applying the

WaveNet vocoder to these voice conversion models and to introduce the WaveNet vocoder to

the domestic speech processing research community. In the experiments, we compared the

converted speeches generated by three voice conversion models using a traditional WORLD

vocoder and the WaveNet vocoder, respectively. The compared voice conversion models

include 1) variational auto-encoder (VAE), 2) variational autoencoding Wasserstein generative

adversarial network (VAW-GAN), and 3) cross domain variarional auto-encoder (CDVAE).

Experimental results show that, using the WaveNet vocoder, the similarity between the

converted speech generated by all the three models and the target speech is significantly

improved. As for naturalness, only VAE benefits from the WaveNet vocoder.

Keywords: WaveNet, Vocoder, Voice Conversion, Variational Auto-Encoder.

97

��ƞǠ

Ǜȭǹù(voiceconversion)Ļî��āǤ1ńǛȭťǝǗK¤ťģ#�ȿ«1ńǛȭǹ

ùàũĩǛȭ�ǛȭǹùçǀťßşÆĻȿhí«ƭȭ�ǩǶºťſȳǛȭǏƾ

(narrowbandspeech)ǹùàƭȭ�ǩǶ�ť©ȳǛȭǏƾ(widebandspeech)[1]�.Ōć

�ǹǛȭ (text-to-speech)ƍƙ�Ïƽś [2]�âď«ŤƭuúţÛťǛȭǹŌĭ¾Ǜȭ

[3]Ɔ�P�ȿĕ�ěs¾LJťßşŌǛƫǛȭǹù(speakervoiceconversion)(ćŗ�¾

ĻźǛƫǛȭǹùŌǛȭǹù)ȿPũĩŌ�āǤ1ńǛƫ(sourcespeaker)ťǝǗK¤�ȿ

«1ńǛƫťǛȭǹùàũĩǛƫ(targetspeaker)ťǛȭ [4]�

�ĆťǛȭǹùĊĺď�ƭŲ�(vocoder)ťġĞ�1§ǵǛȭǹùȿ�lȪdžH«

ǛȭļÌȄƿVĜ(âźqĆi)ȿ!ĸÑƭŲ�äȪdžťqĆȿ2�ɀȳǣ(spectrum)�

ȮÎ(prosody)�ʼnŤń(excitation)ƆŔÓqĆ(ȀźŌǛȭqĆ)�ôƺȿS­ǿ�Ǜȭq

ĆȄƿǹùȿ�«ǹùȆÏťǛȭqĆǽȆƭŲ�Ȍo�ǛȭļÌ�ƊƬǍ�ȿ�8R

�ťǛȭǹùƍƙwǃìnjŌ�8ȍVɀVĜ�ǹù�yà�ǹùÏǛȭ�ǩť��s

ƳũĩǛƫǛȭťū)ŸÄȿȟ�ƳǛȭǹùĊĺė¥Wț6�ȿ�ƳƭŲ��ȄƿV

ĜƳȌo�ǛȭļÌťȆŸėț�ȆpȿƭŲ�ťǒǎ�džď�ċ1ńŋļ�Ī�

(source-filtermodel)ťśǠȿhíčęťƟÚȱ'ƠŲ� [5]!sŚ�ǛȭǹùƳǛȭ

yà(ć�ǹǛȭ)Ȳ��Æǃ0şťȻ�ǩƭŲ�ȿ2� STRAIGHT[6]ƳWORLD[7]ƭ

Ų��ŎƬȿ�ċ1ńŋļ�Ī�śǠäǒǎťƭŲ�­�ȶŤƭťȆŸėǡ�ť<ǒ

ƳƊiȿ0Ñƭȭ�ǩ�ưŎÄƆōĺȄ�įøk�

ǺÁ1ȿ!ŀÄȶŵƚƝƗ(deepneuralnetwork,DNN)Ō�ųťçǀǃÆĻ0şċ

Ǜȭŝàūț�ȴ��P�ȿWaveNet[8]ďŚȡIJĕà`ťǛȭŝàçǀ���ßş

ċć�ǹǛȭȿĮƝƗĞĨwÒƘ£ťǛǍqĆsǛȭqĆȿŪôŞŝTôǺŭ§�ƭ

ťȻ�ǩǛȭļÌ [8]�Į�ȿ�ċ WaveNet ťƭŲ�çǀ�»ƚǃøT [9, 10]�

WaveNet ƭŲ�Ō�Ź!ǨĈȹd(datadriven)�ƭŲ�ȿlǽȆ�ȐǑơǛĈ Ʃä

Ñ[ťƭŲ��PåņťNjƴ�džş!t AƙƭŲ��ȌoǛȭļÌťȆŸ�űW�

ǝȿƘ£AƙƭŲ��ëtƬÑťǛȭqĆ.Ō WaveNet ƭŲ�ťǸJȿ\ǛȭļÌ

98

wŪôǽȆ WaveNetƭŲ�ťǸTƃnjŲÑ[�Šċ WaveNetƭŲ�wŪôŞŝǛȭ

ļÌ(h}�ȳǣƳū,ǨǏ)ȿȊI�AƙƭŲ�0şǏƾƽśťçǀVș'ŃȳǣƳ

ū,ǨǏÏSƕyȌo�ǛȭļÌťȆŸ�ä®Ʋťǜº�ȴȿ�Ƭw!ŞŝĵAƙƭ

Ų�ưŎÄēȻťǛȭ�

�K»ėǓ�ūšàŐťǛȭǹùçǀ[11,12]ȿ*�Ć�0şAƙƭŲ�ȿ0Ñǹ

ùťƭȭ�ǩōĺƤƥøk�WaveNetƭŲ��Ïȿx�ȯ±ťǛȭǹù�ȠƐƐș

�0ş�ċ 2018ÁǛȭǹùïãǫ(VoiceConversionChallenge2018,VCC2018)[13]�Ŗ

Ñ]L{ťȠ&5ď0şWaveNetƭŲ�ċ�9ťǛȭǹùƍƙ� [14,15]ȿȄƬ�¿

økǹùÏǛȭťưŎÄsƳũĩǛƫǛȭū)Ä�ěǠćČ�­ WaveNet ƭŲ�>

��Ɠȿ�³ŴPƳ�KŰŽ�ȠäșŤťǛȭǹùçǀƖyÏä½1�ĭȬăŧȿø

3�KǛȭǹù�ŜƱP�Ǜȭŝà$eŰŽƫqƪ�

ěǠćťƂƈ¡ñ���ċƅ�Ƃȿá9�ƓWaveNetƭŲ��ċƅ�Ƃȿá9�

ƓŠ�K�ȠäșŤť�ŹǛȭǹùçǀ�ƅ�Ƃ\Ō§ȺƖĝƳVĜ�ĮƉǠćťƣ

ƖƳĚ1³Ę\~Śċƅ�Ƃ�

��WaveNetƭŲ�

2.1ƝƗĞĨ

WaveNet[8]ď�8ŀ´ưǼİƝƗȿƻŠ!�ģ#ĬŘÉ1ǾȽtīȿw!ŝà�ǩ

ĦȻťǛȭļÌɀ

! " ℎ = ! %& %&'(, … , %&'+, ℎ ȿ(1)

0

&1+

P�2Ōš]tīťīěȽƠƾȿ3ŌÞů��¯ȿ%&Ōš]džŝàťǛȭǏƾīěȽȿ

ĮīěȽď!ąĆ�ėȝťȨĄ,ȡ1ǂŴ (2� 16 bitsȐi,ȡ , l2+5 =

655368ąĆ;)ȿℎ\ŌǷbŔÓ|Ȑ�Ƶ«WaveNet.ŌƭŲ�(2� [9,10])ȿ\Ƿb

99

ŔÓ|Ȑh}�AƙƭŲ�STRAIGHT [6]âWorld [7]äëtťǛȭqĆƆǨǏȿlȳ

ǣŔÓ(spectral feature)��ȳ(fundamental frequency)s÷ǻʼngǏƾťȫȃęÚqĆ

(aperiodicity)�Ƙ£ǷbŔÓ|ȐȿĠýÉ(1)ȿWaveNet5w!'ǎT­ßťǛȭīěÃ

X����äŴȿ�8ĩŅťWaveNetƭŲ�h}�Ǔ�ťıºj�(residualblock)ȿĴ

8ıºj��ė�82×1ťžľ�ĝmŻ´(dilatedcausalconvolution)��8ȘõʼnĿU

Ć(gatedactivationfunction)�!s�81×1ťmŻ´�žľ�ĝmŻ´w!þ�ƝǴť

Þů�ȿƬȘõʼnĿUĆ\ǃ£ƨŌɀ

tanh >?,@ ∗ % + C?,@ ∗ ℎ ⨀σ >F,@ ∗ % + CF,@ ∗ ℎ ȿ(2)

P�%Ōžľ�ĝmŻ´ťǸTȿ>�CŌwǃǑơťmŻğȿ∗ ǂmŻȅƇȿ⨀ ǂ

Ƚ­Ƚū�ȿσ(∙) ǂsigmoidUĆ�;Ñ�øťďȿWaveNetťǑơď«Ĵ8Ǜȭīě

ȽŞŝťȆŸljŌVȶ�ȴȿ�0ş�rő(crossentropy)ũĩUÉ1ǑơƝǴ�!16

bitsȐi,ȡťǛȭļÌŌ2ȿĴ8ǛȭīěȽw!ǂŴà65536ŹėȝťȶYȿƬ

WaveNetťǑơũĩlŌĭűVȶĴ8ǛȭīěȽ�Ō�ł°ǎƇȐƳȊIȶYȆ�Ȃ

àťVȶȓǜȿǸJťļÌĖHƻŠAƙǏƾƽśƠŲĊĺ(2�μ-lawƠŲĺ)Ȅ�įȐ

iƱǶ°,FťȐi,ȡ(2�8bitsȿlĴ8ǛȭīěȽCė2568ȶY)�

���WaveNetƭŲ�ŴÝ��

2.2�ǛƫWaveNetƭŲ��ǛƫǞȈ

ĕčťWaveNetƭŲ�ď�ŹǛƫ4Ǫ(speakerdependent)ťƭŲ�ȿl0ş��Ǜƫ

100

ťǛĈ.ŌǑơǨĈȥ [9]�ŠċǑơWaveNetƭŲ�Ȫdž�ȐťǑơǛĈȿƬĀȥ�

�Ǜƫ�ȐťǛĈȷÑ�ȩ�W§Ȣȿ�ǛƫWaveNetƭŲ�(multi-speakerWaveNet

vocoder)�ƬǃøTȿl�ǑơȡIJ0ş�{ǛƫťǛĈ1ǑơWaveNetƭŲ� [10]�

«�ǛƫWaveNetƭŲ�ßşċǛȭǹùđȿÖȰȑ­ũĩǛƫȄƿǛƫǞȈ(speaker

adaptation)ȿ�l0şũĩǛƫťǛĈȿ­�ǛƫWaveNetƭŲ�ȄƿǞȈǑơ�§Ⱥ

Ǣ§ȿƳ�Ǜƫ WaveNet ƭŲ�ūĵȿĮ�ǛƫǞȈťç¹w!ėăøkǸTǛȭť

�ǩsƳũĩǛƫťū)Ä [14,15,16]�

��ǛȭǹùĪ�

3.1�ċǤVÉưdƠŲ�ťǛȭǹù

ǤVÉưdƠŲ� (variational auto-encoder, VAE) [17, 18] w!«ǸJƱPƠŲ�

(encoder)(âźƠŲUĆ)ťǛȭȭġ(speech frame)%ƠŲàh}ǛȭK¤ťȤƼƠŲ

(latentcode)ȿ!HǂŴ�ĮƠŲ�ǃ<ǒŌǛƫŕƀȿ�l¼ĘPäǸT�ȤƼƠŲHC

h}ǛȭK¤ȿ�ȭƑƆǛȭǨǏȿƬ�h}ǛƫǨǏ�ȤƼƠŲHƳǛƫǂŴĺI ô

ÏȿǽȆnjŲ�(decoder)(âźnjŲUĆ)1Ȍo�­ßǛƫǂŴĺūțťǛȭǏƾȿ5

wȇ[Ǜȭǹùťũť�VAEťũĩUĆ���ÉäŴȿƬ� VAE ƩƳÈĪȆŸ�ĕ

�iũĩUÉlŌĕ�iÉ(3)ɀ

JKL MN(%) ≤ −QRST % = −QUVW % − QXSY H ȿ(3)

QXSY Z; H = \]^(_` H % ||MN H )ȿ(4)

QUVW Z, c; % = −def H % JKL MN % H, I ,(5)

P�Z, cVYŌƠŲ��njŲ�qĆȿ\]^(∙ || ∙)Ō KLĄÄ(Kullback-Leiblerdivergence)�

VAE ĞĨ���äŴȿP�%ǂŴǛȭȭġȿHŌǘȭġťȤƼƠŲȿIŌǛƫǂŴ

ĺ��Ī�ÈĪ ƩȡIJȿVAE­ǸJǛȭìnjPǛȭK¤ŎÏȌoȿ�zđĕ/i

ǛƫǂŴĺȿ«ĮưáƠŲƳnjŲȌoȆŸßşċŮ�ǛƫťǛȭǨĈÏȿlwƩÑä

ėǛƫȀşťƠŲ��njŲ�sx8Ǜƫxư�ǛƫǂŴĺ��0şȡIJ6«ǸJ1ń

101

Ǜƫ�Ĵ�Ǜȭȭġ%ƚŠƠŲ�ƠŲÑ[ȤƼƠŲHÏȿƖyÈĪ ƩȡIJäƩÑ�ũ

ĩǛƫťǛƫǂŴĺIȿǽȆnjŲ�1«ǘ1ńǛƫǛȭǹùàũĩǛƫǛȭ�Šċ VAE

ďƻŠ«ǛȭK¤HƖy­ßǛƫǂŴĺI1ȇ[Ǜȭǹùȿ«1ńǛȭǹùàÈĪ 

ƩǛĈ�ť$ÝǛƫȿǂŴƠŲ��ÈĪ ƩȆŸĖƩÑąy�zťV+ŔÚȿĕÏS

ǽȆ�zǛƫǂŴĺ1Ìà­ßťV+�ūǶċAƙǛƫǹùĊĺȪdž�zǛƫťÀƿ

ǛĈ(lǛƫ9ťǑơćěūz)sȪ«�zǛƫ�ūzǛvȄƿȭġ­ȾȿVAE �Ī�

ǑơȡIJďǽȆưáȌoťĊÉ1ƩÑƠŲ��njŲ�sǛƫǂŴĺȿĂ�ȪdžÀƿǛ

Ĉȿ�ijȪ­ȾǛv�ùǍ�ȿVAEď�Źw!ßşċōÀƿǛĈģ#�ťǛȭǹùç

ǀ�

����ċǤVÉưdƠŲ�ťǛȭǹùŴÝ��

3.2ÊJŝàÉ­ê�ƝǴťǛȭǹù

ŝàÉ­ê�ƝǴ(generative adversarial network,GAN)ťħØďǽȆǥŝàĪ�

(generator)!sȖYĪ�(discriminator)�ū­êȿƬǥŝàĪ�ťǸTV+ĕ/i

[19]��ǑơȆŸ�ȿŝàĪ�Ė4ý�zťǸJǨǏ!sĪ�ĞĨȿ1ŞŝǺ)ŭ§

V+ťŝàǸTV+ɁȖYĪ�\ǦǧjVš]ǸJťV+ŽƁďµċŝàĪ�äŝà

ťV+ȿȌďŭ§ǨĈťV+���ċ VAEťǛȭǹù�ȿPnjŲUĆänjŲŝàťǛ

ȭĹ£�ĕƔǹùÏǛȭť�ǩȿƬƻŠÊJ GANťũĩUĆȿo VAEǸTťǛȭ�

ǩÑ!Ȅ�įǃ�Ë [12]�

�ǡ� GAN ťǤÌ�ȿWassersteinGAN(W-GAN)!Ǒơż£ŌŔƴƬǃÆĻ0ş

[20]�W-GANď!ö�ĬDzȨ(earthmover'sdistanceȿâźWassersteindistance).Ōũ

ĩUĆť�Ź GANĞĨ�ċćŗ [12]�ȿ.ƫƻŠǎƇĪ�äǺ)�V+Ƴo�V+

102

ťö�ĬDzȨ1ǁȐǛȭǹùņƇĺť��ȿ�ǽȆ Discriminator 1�Ëo� VAE ť

ǸTǛȭ�ǩ�Ō�ǥǛȭǹù�ȫ­ȾǛĈģ#�§ŚȿũĩUĆǃ£ƨŌö�ĬDz

Ȩť­?ÌÉ(Kantorovich-Rubinsteinduality)ȿP£ƨ��Éɀ

> MY∗, MY|W = ghM

i jk+dl~no∗ \ % − dl~no|p \ % ȿ(6)

P�ȿDŌƄy 1-LipschitzcontinuityťǔVUĆ(criticfunction)ȿƆDċ GANĞĨ�ť

ȖYĪ�ȿMY∗ŌũĩǛƫťŭ§ĬŘV+ȿMY|WŌǹùÏǛȭťĬŘV+�ǃş1Ǻ)

§Śǿ8UĆťŀÄŵƚƝǴŌ\qȿP�ťŭ§V+MY∗wƻŠtīũĩǛƫťǛȭĸ

ÑȿƬMY|W\ďǽȆo�ť VAE1ŖÑȿ�lŠǹùÏťǛȭǸTĸÑ��Ǜȭǹùť

ßş�ȿö�ĬDzȨw¨Ō�Éɀ

dl~no∗ \q % − dr~ef H %W \q sN(H, IY ȿ(7)

P�ȿsNŌ VAEťnjŲUĆ�Ƴ�äǻ� GANĞĨūzȿǔVUĆ\qƳnjŲUĆsN(Ɔ

Dċ GANĞĨ�ťŝàĪ�)ťũĩď�ū­êť��ĮȪȵ�ǝĎťďȿǿ8ĞĨď

ǥǔVUĆ­ċŭ§ũĩǛȭ!s VAE njŲUĆäŝàťǹùǛȭ1>ȖYȿ¼ĘƘ

ŭ§ťǛȭǶȻťVĆȿƬƘŝàťǛȭǶ-ťVĆ�ŎƬȿnjŲUĆťȸdž$elŌ

ǥnjŲŝà�ǛȭťV+w!ŨȐôǺũĩǛƫŭ§ťǛȭV+ȿ�lł°ǔVUĆ­

ċŭ§ǛȭƳǹùǛȭťVĆº;�ǽȆƖy VAEŝàŔÚƳW-GANť­êĞĨ1ȇ

[�Ë VAE ǹùǛȭť�ǩȿĮņƇĺǃź. VAW-GANȿ���äŴ�PũĩUĆ�

�Éɀ

Quvwxv0 = −\]^(_` H % ||M_c(H)) + def H % [JKL MN % H, I

+dl~no∗ \q % − dr~ef H %W \q sN(H, IY �(8)

���ÊJŝàÉ­ê�ƝǴť VAEǛȭǹùŴÝ��

103

3.3dzŔÓȲ�ǤVÉưdƠŲ�ťǛȭǹù

�VAEǛȭǹù�ȿ<ǒƠŲ�ƯƸtTƳǛƫǨǏōțťǛȭK¤�dzŔÓȲ�ǤV

ÉưdƠŲ�(crossdomainVAE,CDVAE)ťǛȭǹù��ÕÙÜȿlŌGVZşÒz�ȭ

ġ�ëtť�üė�zµÚťȳǣŔÓqĆȿ1ǥƠŲ�ƸtTēƏƋťǛȭK¤ [21]�

á90şSTRAIGHTƠŲ� [6]äĸÑťȳǣļª(ƊźSTRAIGHTspectrum,SP)sĢŒ:

ȳǣ6Ć(melcepstralcoefficients,MCCs)[22].ŌLŹ�zťȳǣŔÓqĆ����ä

ŴȿCDVAEŌ�ƍXưƠŲ�ťȥyȿĴŹŔÓqĆėPxưť�ƕƠŲ�snjŲ��

­ċĴŹŔÓqĆä­ßťƠŲ�ƳnjŲ�ƕȿȟ�džw!ȏÈǸJŔÓ(���ťǴ

Ð(1)Ƴ(2))ȿƠŲ�ȌdžƯæT�8ƯǥP�ŔÓnjŲ��w!à`ȏÈťǂŴĺɁz

īťȿnjŲ�džƯÒ$ÝƠŲ�äƠŲTťǂŴĺȏÈTǘŔÓ(���ťǴÐ(3)Ƴ(4))�

á9ēȄ�į�ũĩUĆ�a��8ū)ǜºȿǥ�zŔÓqĆťƠŲ�äæTťǂŴ

ĺ�ūDZǺ(lH|}ƳH~��džôǺ)�

���dzŔÓȲ�ť VAEǛȭǹùŴÝ��

��§Ⱥ

4.1§Ⱥǒ£

á90ş VoiceConversionChallenge2018 (VCC2018)ĵǫ��Ėäø3ťǛĈȿP�h

í� 12,¬ĥƶǛǛƫťǛȭȿĴ,ǛƫȒDŽ� 81vǑơşǛĈ!s 35vŃǖşǛ

ĈȿtīȳŘŌ 22050ǬƷ [13]�á90şWORLDƭŲ� [7].ŌëtǛȭqĆ�Ǜ

ȭǏƾȸHǃWàȗ 25Ķŷ�ȏŢ 5Ķŷťȭġ�ôƺȿá9ÒĴ8ȭġ�ëtǛȭ

qĆȿhí 513Ɯťȳǣļª(spectralenvelope)�513ƜťȫȃęÚǏƾs�ȳɁ35Ɯ

104

ťĢŒ:ȳǣ6Ć(híƅ 0ƜťȭġƯȐ)\Òȳǣļª�ëT�

�WaveNetƭŲ�ĊȬȿá9!HayashiƆ� [10]äøTWaveNetƭŲ�Ȍoƭȭ

ļÌȿ§.��0şPø3ťWaveNetƭŲ�șńœě1�Į�ȿäėĪ�sǑơqĆȎ

�Įșńœěȱǒ;ūz�á90şVCC2018�ťäėǑơǛĈȿƣĆŌ972vȿƣȗÄ

ƎŌ54Vȕ1Ǒơ�ǛƫWaveNetƭŲ��WaveNetƭŲ�ä0şťǷbŔÓ|Ȑ(lǸ

JWaveNetƭŲ�ťǛȭqĆ)hípȟƅ0Ɯť34ƜťĢŒ:ȳǣ6Ć��ƢŌ2Ɯťȫ

ȃęÚǏƾ��ȳȿ!sŁȭ/Ŋȭ�FŔÓ�Ō�«ǿ�ǷbŔÓ|Ȑ0ş�WaveNet

ƭŲ��ȿ§.�òş�đȚnjĜÄǞą(timeresolutionadjustment)ťç¹ [9,10]ȿl

Ɗ��DžDŽǿ�ƭ ŔÓȿ0Ñ�9�ǸJťǛȭļÌėzīťđȚnjĜÄ(�ltī

ȳŘ)��ǛƫWaveNetƭŲ�ťǑơǼ�ĆŌ200000ȿǛƫǞȈ\ďǽȆȄ�į0şũ

ĩǛƫťǛĈȄƿ50008Ǽ�ťǑơ¢à�

�ǛȭǹùĪ�ťȍVȿá90şMșťVAEsVAW-GANǛȭǹùĪ�șńœě2ȿ

äėĪ��ǑơqĆȎ�ȱǒ;ūz�á9øTťCDVAE²Ěșńȿ*P§.ħØ�ū

šċL8VAEȿ�Įá90şL8ūzťVAEĪ�ȿǑơqĆ�%ŏșńœě�ťǒ£�

á9zī0şVCC2018ťäėǑơǨĈ�á9­ȳǣļªt­Ćȿ�Ȅƿunit-sumťƯȐ

ĭLjiȿ�«ƯȐ�ǑơsǹùȡIJtT��ǹùȡIJȿȳǣqĆŠǛȭǹùĪ�Ȅƿ

ǹùȿP�0şƯȐȌoÏȿƳ�ēdťȫȃęÚǏƾȿ!s�­ĆžȚȄƿƟÚǹù

Ïť�ȳȿ�zǸJWORLDƭŲ�âWaveNetƭŲ�ȄƿǛȭyà�ǙƒVAE�VAW-GAN

sCDVAEțċǛȭqĆ�ȶŵƚƝǴĞĨƳqĆƆǒ£wqƪ [12,18,21]�

4.2§ȺƖĝƳǐǠ

á9ȑ­ VCC2018� SF1 toTF1ǿ�8zÚYǹù­ȿȣĬÒŃǖǛĈ�ȉt 10vǛ

ȭȿHZş VAE�VAW-GAN�CDVAE�ŹǛȭǹùĪ�ȄƿǛȭǹùÏȿ«PǸTŔÓ

VY0şWORLDƭŲ�!sWaveNetƭŲ�yàǛȭ�

1 https://github.com/kan-bayashi/PytorchWaveNetVocoder2 https://github.com/JeremyCCHsu/vae-npvc

105

4.2.1ȳǣ�

��ŌǹùǛȭ�ȳǣ��Ò��wŤŚȿ0ş WaveNet ƭŲ�äŝà�Ǜȭȳǣ�

�ȻȳťƖĨǶôǺŭ§Ǜȭ(ǶŌȣĬ)ɁƬ-ȳťƖĨ\ȆċĪƌ�ȧǏǶ�ȿƦ�

ŭ§ǛȭǘėťOð¶ƖĨ(formantstructure)��ĮȿWaveNetƭŲ�äŞŝťǛȭȦ

ŎǶưŎȿ*ėǶ�ťȧȭ�

(a)VAEWORLD (b)VAEWaveNet

(c)VAW-GANWORLD (d)VAW-GANWaveNet

(e)CDVAEWORLD (f)CDVAEWaveNet

���0ş�zƭŲ�ŝà�Ǜȭȳǣ��

4.2.2ƮѧȺ

Į§Ⱥȋǟ 9,uǖƫ1ȄƿǛȭưŎÄsƳũĩǛƫū)ÄťƮѧȺ��NŌuǖ

106

ƫȑ­ǹùǛȭưŎÄäǔ'ťÀ�ÝLJVĆ(meanopinionscore,MOS)�Ɩĝ(h}ũ

ĩǛƫ�ŭ§Ǜȭ)�ƛyÀ�VĆs7ÕjȚƖĝȿwŤŚWaveNetƭŲ�ū­WORLD

ƭŲ�� VAE ǛȭǹùĪ��ėȷƺťEfȿ� CDVAE ǛȭǹùĪ��ė�Ǔ*�ȷ

ƺťEfȿ� VAW-GAN�\ď WORLDƭŲ�ū­ċ WaveNetƭŲ�ė�Ǔ*�ȷƺ

ťEfȿǿ� [23]ťƖĝȶ)�á9NJ¦[ȿWORLDƭŲ�B|Şŝż£*Ƕƌťƭ

ȭȿƬWaveNetƭŲ�ȦŎw!ŞŝƳWORLDƭŲ�ūĵǶŁĒưŎťǛȭȿ*ėđ

äŞŝťƭȭƮǮ1ǶŌ�ż£ȿ}ėéȭs�Ǔȧȭ�á9ǚŌwƯťo�ė�ɀP

�Ōȿ� VCC2018ǛĈÅ�ťǛƫÍĮǛȭļÌǭf(waveformtrajectory)ūºǶ�ȿȂ

à WaveNet ƭŲ�ȩ!ėăÿtĭűťǭfɁP�ŌȿWaveNet ƭŲ�ď!Òŭ§Ǜ

ȭäët�ǛȭqĆȄƿ�ǛƫǑơsǛƫǞȈȿ*§ȢȄƿǛȭǹùđnďǸJǹù

ÏťǛȭqĆȿǿLƫȚwƯėƹº(mismatch)�Ě1ƵƯł°ǑơsǹùȚťƹºȿ

5wƯŞŝưŎÄs�ǩēȻťǛȭ�

��Ōuǖƫȑ­ǹùǛȭƳũĩǛȭ�ū)Äť ABX ŃǖƖĝ�Ò��w!Ł

Ĥ�Ŭ[ȿūǶċWORLDƭŲ�ȿ0şWaveNetƭŲ�1yàǛȭw!ėăťøȻƳ

ũĩǛƫȚťū)Ä�ƛy�NƳ���Ɩĝȿá9w!ŤŚ WaveNet ƭŲ�äŝà

ťǛȭ�ưŎÄ�ƳWORLDƭŲ�ū)ȿ*�ƳũĩǛƫ�ū)Ä�\ŦǶE�;Ñ

ĽÝťďȿšè WaveNet ƭŲ�ô� VAW-GAN Ïȿ�CŝàǛȭ�ưŎÄƳ WORLD

ƭŲ�ūǶ��Ȝ-ȿ�ƳũĩǛƫ�ū)ÄäŖÑťøk��� VAE� CDVAE�ǿ«

ďĚ1Ȫdžǃóǐť�

�N�ǛȭưŎÄ�À�ÝLJVĆ�ǜºƟ ǂ 95%7ÕjȚ�

107

���ƳũĩǛƫū)Ä�=��

��ƖǠ

�ěǠć�ȿá9­ WaveNet ƭŲ�>��Ɗ�ť�Ɠȿ�«PßşċĆ8�KŰŽ

�ȠäŰŤ�ǛȭǹùĪ��ÒƮѧȺťƖĝȿWaveNetƭŲ�³Ś�PƳAƙƭŲ

�ūĵ��ȿäƯ­ŚėǛȭǹùĪ�Şŝ�ĭȬăŧ�ũ]Ŭ[ťăŧ�dž�økǹ

ùÏǛȭƳũĩǛƫǛȭ�Țťū)Äȿ�ǛȭưŎÄĊȬ\ƳAƙWORLDƭŲ�ė

ūǺťǂŚ��Ě1ȿá9¼ĘƯȑ­WaveNetƭŲ�äyàǛȭ�ưŎÄȄƿEiȿ

2�a�ǑơǛĈȿâƫóǐēėăťǛƫǞȈç¹ȿ!Ƣ¯ǸJ WaveNet ƭŲ��

ǛȭqĆ�ǑơȡIJƳ§ȢǹùȡIJ�ȚťƹºƆƆ�á9�ęĘ�KǛȭŰŽƫ�Ō

á9ťÊ�ȿƯĊ5×ȁ�ș�0şWaveNetƭŲ�¸QȿǥŝàǛȭť�ǩøk�

qƪćŗ

[1] W.Fujitsuru,H.Sekimoto,T.Toda,HSaruwatari,andK.Shikano,“BandwidthExtension

ofCellularPhoneSpeechBasedonMaximumLikelihoodEstimationwithGMM,”Proc.

NCSP2008

[2] C.C.Hsia,C.H.Wu,andJ.Q.Wu,“ConversionFunctionClusteringandSelectionUsing

Linguistic and Spectral Information for Emotional Voice Conversion” IEEE Trans. on

Computers,56(9),pp.1225–1233,September2007.

[3] H.Doi,T.Toda,K.Nakamura,H.Saruwatari,K.Shikano,“AlaryngealSpeechEnhancement

BasedonOne-to-manyEigenvoiceConversion,”IEEE/ACMTrans.onAudio,Speech,and

108

LanguageProcessing,22(1),pp.172–183,January2014.

[4] Y.Stylianou,O.Cappe,andE.Moulines, “Continuousprobabilistic transformforvoice

conversion,”IEEETransactionsonSpeechandAudioProcessing,vol.6,no.2,pp.131–

142,Mar1998.

[5] B.S.AtalandS.L.Hanauer:“Speechanalysisandsynthesisbylinearpredictionofthe

speechwave”,inJ.Acoust.Soc.America,vol.50,no.2,pp.637–655,Mar.1971.

[6] H. Kawahara, I. Masuda-Katsuse, and A. de Cheveign, “Restructuring speech

representationsusingapitch-adaptivetime-frequencysmoothingandaninstantaneous-

frequency-basedf0extraction:Possibleroleofarepetitivestructureinsounds,”Speech

Communication,vol.27,no.3,pp.187–207,1999.

[7] M.Morise,F.Yokomori,andK.Ozawa,“WORLD:avocoder-basedhigh-qualityspeech

synthesissystemforreal-timeapplications,”IEICETrans.Inf.Syst.,vol.E99-D,no.7,pp.

1877-1884,2016.

[8] A.vandenOord,S.Dieleman,H.Zen,K.Simonyan,O.Vinyals,A.Graves,N.Kalchbrenner,

A.W.Senior,andK.Kavukcuoglu,“WaveNet:Agenerativemodelforrawaudio,”CoRR,

vol.abs/1609.03499,2016.

[9] A. Tamamori, T. Hayashi, K. Kobayashi, K. Takeda, and T. Toda, “Speaker-dependent

WaveNetvocoder,”Proc.INTERSPEECH,pp.1118–1122,2017.

[10] T.Hayashi,A.Tamamori,K.Kobayashi,K.Takeda,andT.Toda,“Aninvestigationofmulti-

speakertrainingforWaveNetvocoder,”Proc.ASRU,2017.

[11] J.Chou,C.Yeh,H.Lee,L.Lee,“Multi-targetVoiceConversionwithoutParallelDataby

Adversarially Learning Disentangled Audio Representations,” Proc. INTERSPEECH, pp.

501-505,2018.

[12] C.-C. Hsu, H.-T. Hwang, Y.-C. Wu, Y. Tsao, and H.-M. Wang, “Voice conversion from

unaligned corpora using variational autoencoding wasserstein generative adversarial

networks,”inProc.Interspeech,2017,pp.3364–3368.�

109

[13] J.Lorenzo-Trueba,J.Yamagishi,T.Toda,D.Saito,F.Villavicencio,T.Kinnunen,andZ.Ling,

“The voice conversion challenge 2018: Promoting development of parallel and

nonparallelmethods,”inProc.Odyssey,2018,pp.195–202.

[14] L.Liu,Z.Ling,Y.Jiang,M.Zhou,L.Dai,“WaveNetVocoderwithLimitedTrainingDatafor

VoiceConversion,”Proc.INTERSPEECH,pp.1983-1987,2018.

[15] P.L.Tobing,Y.-C.Wu,T.Hayashi,K.Kobayashi,andT.Toda,"NUvoiceconversionsystem

forthevoiceconversionchallenge2018,"inProc.Odyssey2018,pp.219-226.

[16] Y.-C.Wu,P.L.Tobing,T.Hayashi,K.Kobayashi,andT.Toda,“TheNUNon-ParallelVoice

ConversionSystemfortheVoiceConversionChallenge2018,”inProc.Odyssey,2018,pp.

211–218.

[17] D. P. Kingma and M. Welling, “Auto-encoding variational bayes,” CoRR, vol.

abs/1312.6114,2013.

[18] C.-C.Hsu,H.-T.Hwang,Y.-C.Wu,Y.Tsao,andH.-M.Wang,“Voiceconversionfromnon-

parallelcorporausingvariationalauto-encoder,”inProc.APISPAASC,2016,pp.1–6.

[19] I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde- Farley, S. Ozair, A. C.

Courville,andY.Bengio,“Generativeadversarialnetworks,”CoRR,vol.abs/1406.2661,

2014.�

[20] M.Arjovsky,S.Chintala,andL.Bottou,“WassersteinGAN,”CoRR,vol.abs/1701.07875,

2017.

[21] W.-C.Huang,H.-T.Hwang,Y.-H.Peng,Y.Tsao,andH.-M.Wang,“VoiceConversionBased

onCross-DomainFeaturesUsingVariationalAutoEncoders,”inProc.ISCSLP2018.

[22] T.Fukada,K.Tokuda,T.Kobayashi,andS.Imai,“Anadaptivealgorithmformel-cepstral

analysisofspeech,”inProc.ICASSP1992.

[23] K. Kobayashi, T. Hayashi, A. Tamamori, and T. Toda, “Statistical voice conversionwith

WaveNet-basedwaveformgeneration,”Proc.INTERSPEECH,pp.1138–1142,2017.

110