wavenet vocoder and its applications in voice conversion · wavenet vocoder to these voice...
TRANSCRIPT
WaveNet ƭŲ�sPċǛȭǹù�ßş
WaveNetVocoderand itsApplications inVoiceConversion
ȼćc*�Ƨð·*�ȼ7Ô*�ĔĐ**�řĉķ*
Wen-ChinHuang*,Chen-ChouLo*,Hsin-TeHwang*,YuTsao**,Hsin-MinWang*
*��ŰŽȞǨǏŶ ŰŽä
InstituteofInformationScience
AcademiaSinica
**��ŰŽȞǨǏŶç^ĉŰŽ�Õ
ResearchCenterforInformationTechnologyInnovation
AcademiaSinica
ûdž
�ĆǛȭǹùĪ�"Ǫ!Aƙ1ńŋļ�Ī�(source-filter model)Ō�ų�ƭŲ�
(vocoder)ǛȭǏƾȄƿǛȭqĆët!syàǛȭ�ŎƬȿuȝċAƙƭŲ�ťǡ�
śǠƳ<ǒȿ!AƙƭŲ�ŌĞĨȄƿǛȭǹùäŝàťǛȭȿPưŎÄ!sƳũĩǛ
ƫťū)Ä�ōĺȄ�įøk��ŀÄ Ʃ(deep learning)Ȳ��ȿWaveNet ďŚȡIJĕ
à`ťǛȭŝàçǀ��ȿƯŞŝƳȆpĊĺūĵưŎÄēȻťǛȭ�WaveNet ƭŲ�
Ō WaveNet ť�8Ç(ȿQ@ŞŝǯǰAƙƭŲ�ťȻ�ǩǛȭťƯ_ȿ�»ǾŇǃ
Ò�ǛȭǹùŰŽ����Ƞäòş�Ȇpȿ�KŰŽ�ȠäșŤťǛȭǹùĪ��!
AƙƭŲ�Ō�ųȄƿǛȭǹùȿěǠćǖ�« WaveNet ƭŲ�ÊJ�KÂ8ĉǺø
TťǛȭǹùĪ�ȿ!ǔ' WaveNet ƭŲ��ǿ�ǛȭǹùĪ��ťßşň_�ċ§
Ⱥ�ȿá9ĵǶ��ŹǛȭǹùĪ�VY0şAƙƭŲ�Ƴ WaveNet ƭŲ�äÑ[ť
Ɩĝ�P�ȿäĵǶťǛȭǹùĪ�hí 1)ǤVÉưdƠŲ�(variational auto-encoder,
VAE)�2)ƖyŝàÉê�ƝǴ�ǤVÉưdƠŲ��!s 3)dzŔÓȲ�ǤVÉưdƠ
Ų�(crossdomainVAE,CDVAE)�§ȺƖĝȷŴȿ�8ǛȭǹùĪ��0ş WaveNet ƭ
The 2018 Conference on Computational Linguistics and Speech Processing ROCLING 2018, pp. 96-110 ©The Association for Computational Linguistics and Chinese Language Processing
96
Ų�ÏȿƳũĩǛƫťū)Ä�ŖÑȷƺťā���ưŎÄĊȬȿ\Cė! VAEŌ�ų
�ǛȭǹùĪ��0şWaveNetƭŲ�Ïėȷƺťøk�
țȔǕɀWaveNetȿƭŲ�ȿǛȭǹùȿǤVÉưdƠŲ��
Abstract
Most voice conversion models rely on vocoders based on the source-filter model to extract
speech parameters and synthesize speech. However, the naturalness and similarity of the
converted speech are limited due to the vast theories and constraints posed by traditional
vocoders. In the field of deep learning, a network structure called WaveNet is one of the state-
of-the-art techniques in speech synthesis, which is capable of generating speech samples of
extremely high quality compared with past methods. One of the extensions of WaveNet is the
WaveNet vocoder. Its ability to synthesize speech of quality higher than traditional vocoders
has made it gradually adopted by several foreign voice conversion research teams. In this work,
we study the combination of the WaveNet vocoder with the voice conversion models recently
developed by domestic research teams, in order to evaluate the potential of applying the
WaveNet vocoder to these voice conversion models and to introduce the WaveNet vocoder to
the domestic speech processing research community. In the experiments, we compared the
converted speeches generated by three voice conversion models using a traditional WORLD
vocoder and the WaveNet vocoder, respectively. The compared voice conversion models
include 1) variational auto-encoder (VAE), 2) variational autoencoding Wasserstein generative
adversarial network (VAW-GAN), and 3) cross domain variarional auto-encoder (CDVAE).
Experimental results show that, using the WaveNet vocoder, the similarity between the
converted speech generated by all the three models and the target speech is significantly
improved. As for naturalness, only VAE benefits from the WaveNet vocoder.
Keywords: WaveNet, Vocoder, Voice Conversion, Variational Auto-Encoder.
97
��ƞǠ
Ǜȭǹù(voiceconversion)Ļî��āǤ1ńǛȭťǝǗK¤ťģ#�ȿ«1ńǛȭǹ
ùàũĩǛȭ�ǛȭǹùçǀťßşÆĻȿhí«ƭȭ�ǩǶºťſȳǛȭǏƾ
(narrowbandspeech)ǹùàƭȭ�ǩǶ�ť©ȳǛȭǏƾ(widebandspeech)[1]�.Ōć
�ǹǛȭ (text-to-speech)ƍƙ�Ïƽś [2]�âď«ŤƭuúţÛťǛȭǹŌĭ¾Ǜȭ
[3]Ɔ�P�ȿĕ�ěs¾LJťßşŌǛƫǛȭǹù(speakervoiceconversion)(ćŗ�¾
ĻźǛƫǛȭǹùŌǛȭǹù)ȿPũĩŌ�āǤ1ńǛƫ(sourcespeaker)ťǝǗK¤�ȿ
«1ńǛƫťǛȭǹùàũĩǛƫ(targetspeaker)ťǛȭ [4]�
�ĆťǛȭǹùĊĺď�ƭŲ�(vocoder)ťġĞ�1§ǵǛȭǹùȿ�lȪdžH«
ǛȭļÌȄƿVĜ(âźqĆi)ȿ!ĸÑƭŲ�äȪdžťqĆȿ2�ɀȳǣ(spectrum)�
ȮÎ(prosody)�ʼnŤń(excitation)ƆŔÓqĆ(ȀźŌǛȭqĆ)�ôƺȿSǿ�Ǜȭq
ĆȄƿǹùȿ�«ǹùȆÏťǛȭqĆǽȆƭŲ�Ȍo�ǛȭļÌ�ƊƬǍ�ȿ�8R
�ťǛȭǹùƍƙwǃìnjŌ�8ȍVɀVĜ�ǹù�yà�ǹùÏǛȭ�ǩť��s
ƳũĩǛƫǛȭťū)ŸÄȿȟ�ƳǛȭǹùĊĺė¥Wț6�ȿ�ƳƭŲ��ȄƿV
ĜƳȌo�ǛȭļÌťȆŸėț�ȆpȿƭŲ�ťǒǎ�džď�ċ1ńŋļ�Ī�
(source-filtermodel)ťśǠȿhíčęťƟÚȱ'ƠŲ� [5]!sŚ�ǛȭǹùƳǛȭ
yà(ć�ǹǛȭ)Ȳ��Æǃ0şťȻ�ǩƭŲ�ȿ2� STRAIGHT[6]ƳWORLD[7]ƭ
Ų��ŎƬȿ�ċ1ńŋļ�Ī�śǠäǒǎťƭŲ��ȶŤƭťȆŸėǡ�ť<ǒ
ƳƊiȿ0Ñƭȭ�ǩ�ưŎÄƆōĺȄ�įøk�
ǺÁ1ȿ!ŀÄȶŵƚƝƗ(deepneuralnetwork,DNN)Ō�ųťçǀǃÆĻ0şċ
Ǜȭŝàūț�ȴ��P�ȿWaveNet[8]ďŚȡIJĕà`ťǛȭŝàçǀ���ßş
ċć�ǹǛȭȿĮƝƗĞĨwÒƘ£ťǛǍqĆsǛȭqĆȿŪôŞŝTôǺŭ§�ƭ
ťȻ�ǩǛȭļÌ [8]�Į�ȿ�ċ WaveNet ťƭŲ�çǀ�»ƚǃøT [9, 10]�
WaveNet ƭŲ�Ō�Ź!ǨĈȹd(datadriven)�ƭŲ�ȿlǽȆ�ȐǑơǛĈ Ʃä
Ñ[ťƭŲ��PåņťNjƴ�džş!t AƙƭŲ��ȌoǛȭļÌťȆŸ�űW�
ǝȿƘ£AƙƭŲ��ëtƬÑťǛȭqĆ.Ō WaveNet ƭŲ�ťǸJȿ\ǛȭļÌ
98
wŪôǽȆ WaveNetƭŲ�ťǸTƃnjŲÑ[�Šċ WaveNetƭŲ�wŪôŞŝǛȭ
ļÌ(h}�ȳǣƳū,ǨǏ)ȿȊI�AƙƭŲ�0şǏƾƽśťçǀVș'ŃȳǣƳ
ū,ǨǏÏSƕyȌo�ǛȭļÌťȆŸ�ä®Ʋťǜº�ȴȿ�Ƭw!ŞŝĵAƙƭ
Ų�ưŎÄēȻťǛȭ�
�K»ėǓ�ūšàŐťǛȭǹùçǀ[11,12]ȿ*�Ć�0şAƙƭŲ�ȿ0Ñǹ
ùťƭȭ�ǩōĺƤƥøk�WaveNetƭŲ��Ïȿx�ȯ±ťǛȭǹù�ȠƐƐș
�0ş�ċ 2018ÁǛȭǹùïãǫ(VoiceConversionChallenge2018,VCC2018)[13]�Ŗ
Ñ]L{ťȠ&5ď0şWaveNetƭŲ�ċ�9ťǛȭǹùƍƙ� [14,15]ȿȄƬ�¿
økǹùÏǛȭťưŎÄsƳũĩǛƫǛȭū)Ä�ěǠćČ� WaveNet ƭŲ�>
��Ɠȿ�³ŴPƳ�KŰŽ�ȠäșŤťǛȭǹùçǀƖyÏä½1�ĭȬăŧȿø
3�KǛȭǹù�ŜƱP�Ǜȭŝà$eŰŽƫqƪ�
ěǠćťƂƈ¡ñ���ċƅ�Ƃȿá9�ƓWaveNetƭŲ��ċƅ�Ƃȿá9�
ƓŠ�K�ȠäșŤť�ŹǛȭǹùçǀ�ƅ�Ƃ\Ō§ȺƖĝƳVĜ�ĮƉǠćťƣ
ƖƳĚ1³Ę\~Śċƅ�Ƃ�
��WaveNetƭŲ�
2.1ƝƗĞĨ
WaveNet[8]ď�8ŀ´ưǼİƝƗȿƻŠ!�ģ#ĬŘÉ1ǾȽtīȿw!ŝà�ǩ
ĦȻťǛȭļÌɀ
! " ℎ = ! %& %&'(, … , %&'+, ℎ ȿ(1)
0
&1+
P�2Ōš]tīťīěȽƠƾȿ3ŌÞů��¯ȿ%&Ōš]džŝàťǛȭǏƾīěȽȿ
ĮīěȽď!ąĆ�ėȝťȨĄ,ȡ1ǂŴ (2� 16 bitsȐi,ȡ , l2+5 =
655368ąĆ;)ȿℎ\ŌǷbŔÓ|Ȑ�Ƶ«WaveNet.ŌƭŲ�(2� [9,10])ȿ\Ƿb
99
ŔÓ|Ȑh}�AƙƭŲ�STRAIGHT [6]âWorld [7]äëtťǛȭqĆƆǨǏȿlȳ
ǣŔÓ(spectral feature)��ȳ(fundamental frequency)s÷ǻʼngǏƾťȫȃęÚqĆ
(aperiodicity)�Ƙ£ǷbŔÓ|ȐȿĠýÉ(1)ȿWaveNet5w!'ǎTßťǛȭīěÃ
X����äŴȿ�8ĩŅťWaveNetƭŲ�h}�Ǔ�ťıºj�(residualblock)ȿĴ
8ıºj��ė�82×1ťžľ�ĝmŻ´(dilatedcausalconvolution)��8ȘõʼnĿU
Ć(gatedactivationfunction)�!s�81×1ťmŻ´�žľ�ĝmŻ´w!þ�ƝǴť
Þů�ȿƬȘõʼnĿUĆ\ǃ£ƨŌɀ
tanh >?,@ ∗ % + C?,@ ∗ ℎ ⨀σ >F,@ ∗ % + CF,@ ∗ ℎ ȿ(2)
P�%Ōžľ�ĝmŻ´ťǸTȿ>�CŌwǃǑơťmŻğȿ∗ ǂmŻȅƇȿ⨀ ǂ
ȽȽū�ȿσ(∙) ǂsigmoidUĆ�;Ñ�øťďȿWaveNetťǑơď«Ĵ8Ǜȭīě
ȽŞŝťȆŸljŌVȶ�ȴȿ�0ş�rő(crossentropy)ũĩUÉ1ǑơƝǴ�!16
bitsȐi,ȡťǛȭļÌŌ2ȿĴ8ǛȭīěȽw!ǂŴà65536ŹėȝťȶYȿƬ
WaveNetťǑơũĩlŌĭűVȶĴ8ǛȭīěȽ�Ō�ł°ǎƇȐƳȊIȶYȆ�Ȃ
àťVȶȓǜȿǸJťļÌĖHƻŠAƙǏƾƽśƠŲĊĺ(2�μ-lawƠŲĺ)Ȅ�įȐ
iƱǶ°,FťȐi,ȡ(2�8bitsȿlĴ8ǛȭīěȽCė2568ȶY)�
���WaveNetƭŲ�ŴÝ��
2.2�ǛƫWaveNetƭŲ��ǛƫǞȈ
ĕčťWaveNetƭŲ�ď�ŹǛƫ4Ǫ(speakerdependent)ťƭŲ�ȿl0ş��Ǜƫ
100
ťǛĈ.ŌǑơǨĈȥ [9]�ŠċǑơWaveNetƭŲ�Ȫdž�ȐťǑơǛĈȿƬĀȥ�
�Ǜƫ�ȐťǛĈȷÑ�ȩ�W§Ȣȿ�ǛƫWaveNetƭŲ�(multi-speakerWaveNet
vocoder)�ƬǃøTȿl�ǑơȡIJ0ş�{ǛƫťǛĈ1ǑơWaveNetƭŲ� [10]�
«�ǛƫWaveNetƭŲ�ßşċǛȭǹùđȿÖȰȑũĩǛƫȄƿǛƫǞȈ(speaker
adaptation)ȿ�l0şũĩǛƫťǛĈȿ�ǛƫWaveNetƭŲ�ȄƿǞȈǑơ�§Ⱥ
Ǣ§ȿƳ�Ǜƫ WaveNet ƭŲ�ūĵȿĮ�ǛƫǞȈťç¹w!ėăøkǸTǛȭť
�ǩsƳũĩǛƫťū)Ä [14,15,16]�
��ǛȭǹùĪ�
3.1�ċǤVÉưdƠŲ�ťǛȭǹù
ǤVÉưdƠŲ� (variational auto-encoder, VAE) [17, 18] w!«ǸJƱPƠŲ�
(encoder)(âźƠŲUĆ)ťǛȭȭġ(speech frame)%ƠŲàh}ǛȭK¤ťȤƼƠŲ
(latentcode)ȿ!HǂŴ�ĮƠŲ�ǃ<ǒŌǛƫŕƀȿ�l¼ĘPäǸT�ȤƼƠŲHC
h}ǛȭK¤ȿ�ȭƑƆǛȭǨǏȿƬ�h}ǛƫǨǏ�ȤƼƠŲHƳǛƫǂŴĺI ô
ÏȿǽȆnjŲ�(decoder)(âźnjŲUĆ)1Ȍo�ßǛƫǂŴĺūțťǛȭǏƾȿ5
wȇ[Ǜȭǹùťũť�VAEťũĩUĆ���ÉäŴȿƬ� VAE ƩƳÈĪȆŸ�ĕ
�iũĩUÉlŌĕ�iÉ(3)ɀ
JKL MN(%) ≤ −QRST % = −QUVW % − QXSY H ȿ(3)
QXSY Z; H = \]^(_` H % ||MN H )ȿ(4)
QUVW Z, c; % = −def H % JKL MN % H, I ,(5)
P�Z, cVYŌƠŲ��njŲ�qĆȿ\]^(∙ || ∙)Ō KLĄÄ(Kullback-Leiblerdivergence)�
VAE ĞĨ���äŴȿP�%ǂŴǛȭȭġȿHŌǘȭġťȤƼƠŲȿIŌǛƫǂŴ
ĺ��Ī�ÈĪ ƩȡIJȿVAEǸJǛȭìnjPǛȭK¤ŎÏȌoȿ�zđĕ/i
ǛƫǂŴĺȿ«ĮưáƠŲƳnjŲȌoȆŸßşċŮ�ǛƫťǛȭǨĈÏȿlwƩÑä
ėǛƫȀşťƠŲ��njŲ�sx8Ǜƫxư�ǛƫǂŴĺ��0şȡIJ6«ǸJ1ń
101
Ǜƫ�Ĵ�Ǜȭȭġ%ƚŠƠŲ�ƠŲÑ[ȤƼƠŲHÏȿƖyÈĪ ƩȡIJäƩÑ�ũ
ĩǛƫťǛƫǂŴĺIȿǽȆnjŲ�1«ǘ1ńǛƫǛȭǹùàũĩǛƫǛȭ�Šċ VAE
ďƻŠ«ǛȭK¤HƖyßǛƫǂŴĺI1ȇ[Ǜȭǹùȿ«1ńǛȭǹùàÈĪ
ƩǛĈ�ť$ÝǛƫȿǂŴƠŲ��ÈĪ ƩȆŸĖƩÑąy�zťV+ŔÚȿĕÏS
ǽȆ�zǛƫǂŴĺ1ÌàßťV+�ūǶċAƙǛƫǹùĊĺȪdž�zǛƫťÀƿ
ǛĈ(lǛƫ9ťǑơćěūz)sȪ«�zǛƫ�ūzǛvȄƿȭġȾȿVAE �Ī�
ǑơȡIJďǽȆưáȌoťĊÉ1ƩÑƠŲ��njŲ�sǛƫǂŴĺȿĂ�ȪdžÀƿǛ
Ĉȿ�ijȪȾǛv�ùǍ�ȿVAEď�Źw!ßşċōÀƿǛĈģ#�ťǛȭǹùç
ǀ�
����ċǤVÉưdƠŲ�ťǛȭǹùŴÝ��
3.2ÊJŝàÉê�ƝǴťǛȭǹù
ŝàÉê�ƝǴ(generative adversarial network,GAN)ťħØďǽȆǥŝàĪ�
(generator)!sȖYĪ�(discriminator)�ūêȿƬǥŝàĪ�ťǸTV+ĕ/i
[19]��ǑơȆŸ�ȿŝàĪ�Ė4ý�zťǸJǨǏ!sĪ�ĞĨȿ1ŞŝǺ)ŭ§
V+ťŝàǸTV+ɁȖYĪ�\ǦǧjVš]ǸJťV+ŽƁďµċŝàĪ�äŝà
ťV+ȿȌďŭ§ǨĈťV+���ċ VAEťǛȭǹù�ȿPnjŲUĆänjŲŝàťǛ
ȭĹ£�ĕƔǹùÏǛȭť�ǩȿƬƻŠÊJ GANťũĩUĆȿo VAEǸTťǛȭ�
ǩÑ!Ȅ�įǃ�Ë [12]�
�ǡ� GAN ťǤÌ�ȿWassersteinGAN(W-GAN)!Ǒơż£ŌŔƴƬǃÆĻ0ş
[20]�W-GANď!ö�ĬDzȨ(earthmover'sdistanceȿâźWassersteindistance).Ōũ
ĩUĆť�Ź GANĞĨ�ċćŗ [12]�ȿ.ƫƻŠǎƇĪ�äǺ)�V+Ƴo�V+
102
ťö�ĬDzȨ1ǁȐǛȭǹùņƇĺť��ȿ�ǽȆ Discriminator 1�Ëo� VAE ť
ǸTǛȭ�ǩ�Ō�ǥǛȭǹù�ȫȾǛĈģ#�§ŚȿũĩUĆǃ£ƨŌö�ĬDz
Ȩť?ÌÉ(Kantorovich-Rubinsteinduality)ȿP£ƨ��Éɀ
> MY∗, MY|W = ghM
i jk+dl~no∗ \ % − dl~no|p \ % ȿ(6)
P�ȿDŌƄy 1-LipschitzcontinuityťǔVUĆ(criticfunction)ȿƆDċ GANĞĨ�ť
ȖYĪ�ȿMY∗ŌũĩǛƫťŭ§ĬŘV+ȿMY|WŌǹùÏǛȭťĬŘV+�ǃş1Ǻ)
§Śǿ8UĆťŀÄŵƚƝǴŌ\qȿP�ťŭ§V+MY∗wƻŠtīũĩǛƫťǛȭĸ
ÑȿƬMY|W\ďǽȆo�ť VAE1ŖÑȿ�lŠǹùÏťǛȭǸTĸÑ��Ǜȭǹùť
ßş�ȿö�ĬDzȨw¨Ō�Éɀ
dl~no∗ \q % − dr~ef H %W \q sN(H, IY ȿ(7)
P�ȿsNŌ VAEťnjŲUĆ�Ƴ�äǻ� GANĞĨūzȿǔVUĆ\qƳnjŲUĆsN(Ɔ
Dċ GANĞĨ�ťŝàĪ�)ťũĩď�ūêť��ĮȪȵ�ǝĎťďȿǿ8ĞĨď
ǥǔVUĆċŭ§ũĩǛȭ!s VAE njŲUĆäŝàťǹùǛȭ1>ȖYȿ¼ĘƘ
ŭ§ťǛȭǶȻťVĆȿƬƘŝàťǛȭǶ-ťVĆ�ŎƬȿnjŲUĆťȸdž$elŌ
ǥnjŲŝà�ǛȭťV+w!ŨȐôǺũĩǛƫŭ§ťǛȭV+ȿ�lł°ǔVUĆ
ċŭ§ǛȭƳǹùǛȭťVĆº;�ǽȆƖy VAEŝàŔÚƳW-GANťêĞĨ1ȇ
[�Ë VAE ǹùǛȭť�ǩȿĮņƇĺǃź. VAW-GANȿ���äŴ�PũĩUĆ�
�Éɀ
Quvwxv0 = −\]^(_` H % ||M_c(H)) + def H % [JKL MN % H, I
+dl~no∗ \q % − dr~ef H %W \q sN(H, IY �(8)
���ÊJŝàÉê�ƝǴť VAEǛȭǹùŴÝ��
103
3.3dzŔÓȲ�ǤVÉưdƠŲ�ťǛȭǹù
�VAEǛȭǹù�ȿ<ǒƠŲ�ƯƸtTƳǛƫǨǏōțťǛȭK¤�dzŔÓȲ�ǤV
ÉưdƠŲ�(crossdomainVAE,CDVAE)ťǛȭǹù��ÕÙÜȿlŌGVZşÒz�ȭ
ġ�ëtť�üė�zµÚťȳǣŔÓqĆȿ1ǥƠŲ�ƸtTēƏƋťǛȭK¤ [21]�
á90şSTRAIGHTƠŲ� [6]äĸÑťȳǣļª(ƊźSTRAIGHTspectrum,SP)sĢŒ:
ȳǣ6Ć(melcepstralcoefficients,MCCs)[22].ŌLŹ�zťȳǣŔÓqĆ����ä
ŴȿCDVAEŌ�ƍXưƠŲ�ťȥyȿĴŹŔÓqĆėPxưť�ƕƠŲ�snjŲ��
ċĴŹŔÓqĆäßťƠŲ�ƳnjŲ�ƕȿȟ�džw!ȏÈǸJŔÓ(���ťǴ
Ð(1)Ƴ(2))ȿƠŲ�ȌdžƯæT�8ƯǥP�ŔÓnjŲ��w!à`ȏÈťǂŴĺɁz
īťȿnjŲ�džƯÒ$ÝƠŲ�äƠŲTťǂŴĺȏÈTǘŔÓ(���ťǴÐ(3)Ƴ(4))�
á9ēȄ�į�ũĩUĆ�a��8ū)ǜºȿǥ�zŔÓqĆťƠŲ�äæTťǂŴ
ĺ�ūDZǺ(lH|}ƳH~��džôǺ)�
���dzŔÓȲ�ť VAEǛȭǹùŴÝ��
��§Ⱥ
4.1§Ⱥǒ£
á90ş VoiceConversionChallenge2018 (VCC2018)ĵǫ��Ėäø3ťǛĈȿP�h
í� 12,¬ĥƶǛǛƫťǛȭȿĴ,ǛƫȒDŽ� 81vǑơşǛĈ!s 35vŃǖşǛ
ĈȿtīȳŘŌ 22050ǬƷ [13]�á90şWORLDƭŲ� [7].ŌëtǛȭqĆ�Ǜ
ȭǏƾȸHǃWàȗ 25Ķŷ�ȏŢ 5Ķŷťȭġ�ôƺȿá9ÒĴ8ȭġ�ëtǛȭ
qĆȿhí 513Ɯťȳǣļª(spectralenvelope)�513ƜťȫȃęÚǏƾs�ȳɁ35Ɯ
104
ťĢŒ:ȳǣ6Ć(híƅ 0ƜťȭġƯȐ)\Òȳǣļª�ëT�
�WaveNetƭŲ�ĊȬȿá9!HayashiƆ� [10]äøTWaveNetƭŲ�Ȍoƭȭ
ļÌȿ§.��0şPø3ťWaveNetƭŲ�șńœě1�Į�ȿäėĪ�sǑơqĆȎ
�Įșńœěȱǒ;ūz�á90şVCC2018�ťäėǑơǛĈȿƣĆŌ972vȿƣȗÄ
ƎŌ54Vȕ1Ǒơ�ǛƫWaveNetƭŲ��WaveNetƭŲ�ä0şťǷbŔÓ|Ȑ(lǸ
JWaveNetƭŲ�ťǛȭqĆ)hípȟƅ0Ɯť34ƜťĢŒ:ȳǣ6Ć��ƢŌ2Ɯťȫ
ȃęÚǏƾ��ȳȿ!sŁȭ/Ŋȭ�FŔÓ�Ō�«ǿ�ǷbŔÓ|Ȑ0ş�WaveNet
ƭŲ��ȿ§.�òş�đȚnjĜÄǞą(timeresolutionadjustment)ťç¹ [9,10]ȿl
Ɗ��DžDŽǿ�ƭ ŔÓȿ0Ñ�9�ǸJťǛȭļÌėzīťđȚnjĜÄ(�ltī
ȳŘ)��ǛƫWaveNetƭŲ�ťǑơǼ�ĆŌ200000ȿǛƫǞȈ\ďǽȆȄ�į0şũ
ĩǛƫťǛĈȄƿ50008Ǽ�ťǑơ¢à�
�ǛȭǹùĪ�ťȍVȿá90şMșťVAEsVAW-GANǛȭǹùĪ�șńœě2ȿ
äėĪ��ǑơqĆȎ�ȱǒ;ūz�á9øTťCDVAE²Ěșńȿ*P§.ħØ�ū
šċL8VAEȿ�Įá90şL8ūzťVAEĪ�ȿǑơqĆ�%ŏșńœě�ťǒ£�
á9zī0şVCC2018ťäėǑơǨĈ�á9ȳǣļªtĆȿ�Ȅƿunit-sumťƯȐ
ĭLjiȿ�«ƯȐ�ǑơsǹùȡIJtT��ǹùȡIJȿȳǣqĆŠǛȭǹùĪ�Ȅƿ
ǹùȿP�0şƯȐȌoÏȿƳ�ēdťȫȃęÚǏƾȿ!s�ĆžȚȄƿƟÚǹù
Ïť�ȳȿ�zǸJWORLDƭŲ�âWaveNetƭŲ�ȄƿǛȭyà�ǙƒVAE�VAW-GAN
sCDVAEțċǛȭqĆ�ȶŵƚƝǴĞĨƳqĆƆǒ£wqƪ [12,18,21]�
4.2§ȺƖĝƳǐǠ
á9ȑ VCC2018� SF1 toTF1ǿ�8zÚYǹùȿȣĬÒŃǖǛĈ�ȉt 10vǛ
ȭȿHZş VAE�VAW-GAN�CDVAE�ŹǛȭǹùĪ�ȄƿǛȭǹùÏȿ«PǸTŔÓ
VY0şWORLDƭŲ�!sWaveNetƭŲ�yàǛȭ�
1 https://github.com/kan-bayashi/PytorchWaveNetVocoder2 https://github.com/JeremyCCHsu/vae-npvc
105
4.2.1ȳǣ�
��ŌǹùǛȭ�ȳǣ��Ò��wŤŚȿ0ş WaveNet ƭŲ�äŝà�Ǜȭȳǣ�
�ȻȳťƖĨǶôǺŭ§Ǜȭ(ǶŌȣĬ)ɁƬ-ȳťƖĨ\ȆċĪƌ�ȧǏǶ�ȿƦ�
ŭ§ǛȭǘėťOð¶ƖĨ(formantstructure)��ĮȿWaveNetƭŲ�äŞŝťǛȭȦ
ŎǶưŎȿ*ėǶ�ťȧȭ�
(a)VAEWORLD (b)VAEWaveNet
(c)VAW-GANWORLD (d)VAW-GANWaveNet
(e)CDVAEWORLD (f)CDVAEWaveNet
���0ş�zƭŲ�ŝà�Ǜȭȳǣ��
4.2.2ƮѧȺ
Į§Ⱥȋǟ 9,uǖƫ1ȄƿǛȭưŎÄsƳũĩǛƫū)ÄťƮѧȺ��NŌuǖ
106
ƫȑǹùǛȭưŎÄäǔ'ťÀ�ÝLJVĆ(meanopinionscore,MOS)�Ɩĝ(h}ũ
ĩǛƫ�ŭ§Ǜȭ)�ƛyÀ�VĆs7ÕjȚƖĝȿwŤŚWaveNetƭŲ�ūWORLD
ƭŲ�� VAE ǛȭǹùĪ��ėȷƺťEfȿ� CDVAE ǛȭǹùĪ��ė�Ǔ*�ȷ
ƺťEfȿ� VAW-GAN�\ď WORLDƭŲ�ūċ WaveNetƭŲ�ė�Ǔ*�ȷƺ
ťEfȿǿ� [23]ťƖĝȶ)�á9NJ¦[ȿWORLDƭŲ�B|Şŝż£*Ƕƌťƭ
ȭȿƬWaveNetƭŲ�ȦŎw!ŞŝƳWORLDƭŲ�ūĵǶŁĒưŎťǛȭȿ*ėđ
äŞŝťƭȭƮǮ1ǶŌ�ż£ȿ}ėéȭs�Ǔȧȭ�á9ǚŌwƯťo�ė�ɀP
�Ōȿ� VCC2018ǛĈÅ�ťǛƫÍĮǛȭļÌǭf(waveformtrajectory)ūºǶ�ȿȂ
à WaveNet ƭŲ�ȩ!ėăÿtĭűťǭfɁP�ŌȿWaveNet ƭŲ�ď!Òŭ§Ǜ
ȭäët�ǛȭqĆȄƿ�ǛƫǑơsǛƫǞȈȿ*§ȢȄƿǛȭǹùđnďǸJǹù
ÏťǛȭqĆȿǿLƫȚwƯėƹº(mismatch)�Ě1ƵƯł°ǑơsǹùȚťƹºȿ
5wƯŞŝưŎÄs�ǩēȻťǛȭ�
��ŌuǖƫȑǹùǛȭƳũĩǛȭ�ū)Äť ABX ŃǖƖĝ�Ò��w!Ł
Ĥ�Ŭ[ȿūǶċWORLDƭŲ�ȿ0şWaveNetƭŲ�1yàǛȭw!ėăťøȻƳ
ũĩǛƫȚťū)Ä�ƛy�NƳ���Ɩĝȿá9w!ŤŚ WaveNet ƭŲ�äŝà
ťǛȭ�ưŎÄ�ƳWORLDƭŲ�ū)ȿ*�ƳũĩǛƫ�ū)Ä�\ŦǶE�;Ñ
ĽÝťďȿšè WaveNet ƭŲ�ô� VAW-GAN Ïȿ�CŝàǛȭ�ưŎÄƳ WORLD
ƭŲ�ūǶ��Ȝ-ȿ�ƳũĩǛƫ�ū)ÄäŖÑťøk��� VAE� CDVAE�ǿ«
ďĚ1Ȫdžǃóǐť�
�N�ǛȭưŎÄ�À�ÝLJVĆ�ǜºƟ ǂ 95%7ÕjȚ�
107
���ƳũĩǛƫū)Ä�=��
��ƖǠ
�ěǠć�ȿá9 WaveNet ƭŲ�>��Ɗ�ť�Ɠȿ�«PßşċĆ8�KŰŽ
�ȠäŰŤ�ǛȭǹùĪ��ÒƮѧȺťƖĝȿWaveNetƭŲ�³Ś�PƳAƙƭŲ
�ūĵ��ȿäƯŚėǛȭǹùĪ�Şŝ�ĭȬăŧ�ũ]Ŭ[ťăŧ�dž�økǹ
ùÏǛȭƳũĩǛƫǛȭ�Țťū)Äȿ�ǛȭưŎÄĊȬ\ƳAƙWORLDƭŲ�ė
ūǺťǂŚ��Ě1ȿá9¼ĘƯȑWaveNetƭŲ�äyàǛȭ�ưŎÄȄƿEiȿ
2�a�ǑơǛĈȿâƫóǐēėăťǛƫǞȈç¹ȿ!Ƣ¯ǸJ WaveNet ƭŲ��
ǛȭqĆ�ǑơȡIJƳ§ȢǹùȡIJ�ȚťƹºƆƆ�á9�ęĘ�KǛȭŰŽƫ�Ō
á9ťÊ�ȿƯĊ5×ȁ�ș�0şWaveNetƭŲ�¸QȿǥŝàǛȭť�ǩøk�
qƪćŗ
[1] W.Fujitsuru,H.Sekimoto,T.Toda,HSaruwatari,andK.Shikano,“BandwidthExtension
ofCellularPhoneSpeechBasedonMaximumLikelihoodEstimationwithGMM,”Proc.
NCSP2008
[2] C.C.Hsia,C.H.Wu,andJ.Q.Wu,“ConversionFunctionClusteringandSelectionUsing
Linguistic and Spectral Information for Emotional Voice Conversion” IEEE Trans. on
Computers,56(9),pp.1225–1233,September2007.
[3] H.Doi,T.Toda,K.Nakamura,H.Saruwatari,K.Shikano,“AlaryngealSpeechEnhancement
BasedonOne-to-manyEigenvoiceConversion,”IEEE/ACMTrans.onAudio,Speech,and
108
LanguageProcessing,22(1),pp.172–183,January2014.
[4] Y.Stylianou,O.Cappe,andE.Moulines, “Continuousprobabilistic transformforvoice
conversion,”IEEETransactionsonSpeechandAudioProcessing,vol.6,no.2,pp.131–
142,Mar1998.
[5] B.S.AtalandS.L.Hanauer:“Speechanalysisandsynthesisbylinearpredictionofthe
speechwave”,inJ.Acoust.Soc.America,vol.50,no.2,pp.637–655,Mar.1971.
[6] H. Kawahara, I. Masuda-Katsuse, and A. de Cheveign, “Restructuring speech
representationsusingapitch-adaptivetime-frequencysmoothingandaninstantaneous-
frequency-basedf0extraction:Possibleroleofarepetitivestructureinsounds,”Speech
Communication,vol.27,no.3,pp.187–207,1999.
[7] M.Morise,F.Yokomori,andK.Ozawa,“WORLD:avocoder-basedhigh-qualityspeech
synthesissystemforreal-timeapplications,”IEICETrans.Inf.Syst.,vol.E99-D,no.7,pp.
1877-1884,2016.
[8] A.vandenOord,S.Dieleman,H.Zen,K.Simonyan,O.Vinyals,A.Graves,N.Kalchbrenner,
A.W.Senior,andK.Kavukcuoglu,“WaveNet:Agenerativemodelforrawaudio,”CoRR,
vol.abs/1609.03499,2016.
[9] A. Tamamori, T. Hayashi, K. Kobayashi, K. Takeda, and T. Toda, “Speaker-dependent
WaveNetvocoder,”Proc.INTERSPEECH,pp.1118–1122,2017.
[10] T.Hayashi,A.Tamamori,K.Kobayashi,K.Takeda,andT.Toda,“Aninvestigationofmulti-
speakertrainingforWaveNetvocoder,”Proc.ASRU,2017.
[11] J.Chou,C.Yeh,H.Lee,L.Lee,“Multi-targetVoiceConversionwithoutParallelDataby
Adversarially Learning Disentangled Audio Representations,” Proc. INTERSPEECH, pp.
501-505,2018.
[12] C.-C. Hsu, H.-T. Hwang, Y.-C. Wu, Y. Tsao, and H.-M. Wang, “Voice conversion from
unaligned corpora using variational autoencoding wasserstein generative adversarial
networks,”inProc.Interspeech,2017,pp.3364–3368.�
109
[13] J.Lorenzo-Trueba,J.Yamagishi,T.Toda,D.Saito,F.Villavicencio,T.Kinnunen,andZ.Ling,
“The voice conversion challenge 2018: Promoting development of parallel and
nonparallelmethods,”inProc.Odyssey,2018,pp.195–202.
[14] L.Liu,Z.Ling,Y.Jiang,M.Zhou,L.Dai,“WaveNetVocoderwithLimitedTrainingDatafor
VoiceConversion,”Proc.INTERSPEECH,pp.1983-1987,2018.
[15] P.L.Tobing,Y.-C.Wu,T.Hayashi,K.Kobayashi,andT.Toda,"NUvoiceconversionsystem
forthevoiceconversionchallenge2018,"inProc.Odyssey2018,pp.219-226.
[16] Y.-C.Wu,P.L.Tobing,T.Hayashi,K.Kobayashi,andT.Toda,“TheNUNon-ParallelVoice
ConversionSystemfortheVoiceConversionChallenge2018,”inProc.Odyssey,2018,pp.
211–218.
[17] D. P. Kingma and M. Welling, “Auto-encoding variational bayes,” CoRR, vol.
abs/1312.6114,2013.
[18] C.-C.Hsu,H.-T.Hwang,Y.-C.Wu,Y.Tsao,andH.-M.Wang,“Voiceconversionfromnon-
parallelcorporausingvariationalauto-encoder,”inProc.APISPAASC,2016,pp.1–6.
[19] I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde- Farley, S. Ozair, A. C.
Courville,andY.Bengio,“Generativeadversarialnetworks,”CoRR,vol.abs/1406.2661,
2014.�
[20] M.Arjovsky,S.Chintala,andL.Bottou,“WassersteinGAN,”CoRR,vol.abs/1701.07875,
2017.
[21] W.-C.Huang,H.-T.Hwang,Y.-H.Peng,Y.Tsao,andH.-M.Wang,“VoiceConversionBased
onCross-DomainFeaturesUsingVariationalAutoEncoders,”inProc.ISCSLP2018.
[22] T.Fukada,K.Tokuda,T.Kobayashi,andS.Imai,“Anadaptivealgorithmformel-cepstral
analysisofspeech,”inProc.ICASSP1992.
[23] K. Kobayashi, T. Hayashi, A. Tamamori, and T. Toda, “Statistical voice conversionwith
WaveNet-basedwaveformgeneration,”Proc.INTERSPEECH,pp.1138–1142,2017.
110