wavenet factorization with singular value …lxie.nwpu-aslp.org/papers/2019asru_dhq.pdfvoice...

8
WAVENET FACTORIZATION WITH SINGULAR VALUE DECOMPOSITION FOR VOICE CONVERSION Hongqiang Du 1,2 , Xiaohai Tian 2 , Lei Xie 1* , Haizhou Li 2 1 School of Computer Science, Northwestern Polytechnical University, xi’an, China 2 Department of Electrical and Computer Engineering, National University of Singapore, Singapore [email protected], [email protected], [email protected], [email protected] ABSTRACT WaveNet vocoder has seen its great advantage over tradi- tional vocoders in voice quality. However, it usually re- quires a relatively large amount of speech data to train a speaker-dependent WaveNet vocoder. Therefore, it remains a challenge to build a high-quality WaveNet vocoder for low resource tasks, e.g. voice conversion, where speech samples are limited in real applications. We propose to use singular value decomposition (SVD) to reduce WaveNet parame- ters while maintaining its output voice quality. Specifically, we apply SVD on dilated convolution layers, and impose semi-orthogonal constraint to improve the performance. Ex- periments conducted on CMU-ARCTIC database show that as compared with the original WaveNet vocoder, the pro- posed method maintains similar performance, in terms of both quality and similarity, while using much less training data. Index TermsVoice Conversion (VC), WaveNet, Sin- gular Value Decomposition (SVD) 1. INTRODUCTION Voice conversion (VC) is a technique that aims to modify a speech signal of a source speaker to make it sound as if it is uttered by a target speaker without changing the content information. This technique has many applications, including voice morphing, emotion conversion, speech enhancement, movie dubbing as well as other entertainment applications. One of the challenges is to generate high quality speech. Various statistical approaches have been proposed, e.g. Gaus- sian mixture model (GMM) [1–3], frequency warping [4–7] and exemplar based methods [8–10]. Recently, deep learn- ing has become mainstream in this area, including deep neural network (DNN) [11–13], long short-term memory (LSTM) [14], variational auto-encoder(VAE) [15], and gen- erative adversarial networks (GAN) [16–18]. Despite much progress, the quality of converted speech can not match that of the source speech. One major rea- son is that traditional vocoders with simplified assumption *Lei Xie is the corresponding author, [email protected]. are generally adopted for speech reconstruction. To address this problem, several neural vocoders [19–22] are proposed to generate time-domain signal directly. The effectiveness of incorporating speaker dependent (SD) WaveNet [23,24] into VC task for waveform generation has been investigated. Re- cently, an SD WaveNet [25] is first adopted to replace the tra- ditional vocoder for converted speech generation; while an- other attempt [26] uses the WaveNet to learn a mapping be- tween phonetic posteriorgram (PPG) and waveform samples to generate the speech for a specific target speaker. While SD WaveNet in general generates high quality speech, it requires a relatively large amount of data for training. According to a recent study [27], an SD WaveNet needs to be trained with at least 320 utterances, which is not always available in practice. To reduce the data requirement from target speaker, a speaker adapted WaveNet is proposed [28, 29], where a speaker in- dependent WaveNet is first trained on a corpus consisting of various speakers, and then adapted with a small amount of target data. In this paper, we propose to build an SD WaveNet with limited data from target speaker, where singular value de- composition (SVD) [30] technique is employed to reduce the number of WaveNet parameters. Specifically, SVD is applied to the dilated convolution layers by inserting a 1 × 1 con- volution layer, where the original weight matrix is factorized into two small matrices. Inspired by [31], the semi-orthogonal constraint is used during the decomposed weight matrix up- dating, which leads to a rapid convergence for network train- ing. The contributions of this paper include, 1) We employ SVD to WaveNet which significantly re- duces the number of parameters; 2) We significantly reduce the amount of data required for SD WaveNet training, while maintaining the quality of generated speech; The rest of this paper is organized as follows. Section 2 briefly describes the overview of WaveNet vocoder. The de- tails of our proposed method are discussed in Section 3. In

Upload: others

Post on 14-Jul-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: WAVENET FACTORIZATION WITH SINGULAR VALUE …lxie.nwpu-aslp.org/papers/2019ASRU_DHQ.pdfVoice conversion (VC) is a technique that aims to modify a speech signal of a source speaker

WAVENET FACTORIZATION WITH SINGULAR VALUE DECOMPOSITION FOR VOICECONVERSION

Hongqiang Du1,2, Xiaohai Tian2, Lei Xie1* , Haizhou Li2

1School of Computer Science, Northwestern Polytechnical University, xi’an, China2Department of Electrical and Computer Engineering, National University of Singapore, Singapore

[email protected], [email protected], [email protected], [email protected]

ABSTRACTWaveNet vocoder has seen its great advantage over tradi-tional vocoders in voice quality. However, it usually re-quires a relatively large amount of speech data to train aspeaker-dependent WaveNet vocoder. Therefore, it remainsa challenge to build a high-quality WaveNet vocoder for lowresource tasks, e.g. voice conversion, where speech samplesare limited in real applications. We propose to use singularvalue decomposition (SVD) to reduce WaveNet parame-ters while maintaining its output voice quality. Specifically,we apply SVD on dilated convolution layers, and imposesemi-orthogonal constraint to improve the performance. Ex-periments conducted on CMU-ARCTIC database show thatas compared with the original WaveNet vocoder, the pro-posed method maintains similar performance, in terms ofboth quality and similarity, while using much less trainingdata.

Index Terms— Voice Conversion (VC), WaveNet, Sin-gular Value Decomposition (SVD)

1. INTRODUCTION

Voice conversion (VC) is a technique that aims to modify aspeech signal of a source speaker to make it sound as if itis uttered by a target speaker without changing the contentinformation. This technique has many applications, includingvoice morphing, emotion conversion, speech enhancement,movie dubbing as well as other entertainment applications.

One of the challenges is to generate high quality speech.Various statistical approaches have been proposed, e.g. Gaus-sian mixture model (GMM) [1–3], frequency warping [4–7]and exemplar based methods [8–10]. Recently, deep learn-ing has become mainstream in this area, including deepneural network (DNN) [11–13], long short-term memory(LSTM) [14], variational auto-encoder(VAE) [15], and gen-erative adversarial networks (GAN) [16–18].

Despite much progress, the quality of converted speechcan not match that of the source speech. One major rea-son is that traditional vocoders with simplified assumption

*Lei Xie is the corresponding author, [email protected].

are generally adopted for speech reconstruction. To addressthis problem, several neural vocoders [19–22] are proposedto generate time-domain signal directly. The effectiveness ofincorporating speaker dependent (SD) WaveNet [23,24] intoVC task for waveform generation has been investigated. Re-cently, an SD WaveNet [25] is first adopted to replace the tra-ditional vocoder for converted speech generation; while an-other attempt [26] uses the WaveNet to learn a mapping be-tween phonetic posteriorgram (PPG) and waveform samplesto generate the speech for a specific target speaker. While SDWaveNet in general generates high quality speech, it requiresa relatively large amount of data for training. According to arecent study [27], an SD WaveNet needs to be trained with atleast 320 utterances, which is not always available in practice.To reduce the data requirement from target speaker, a speakeradapted WaveNet is proposed [28, 29], where a speaker in-dependent WaveNet is first trained on a corpus consisting ofvarious speakers, and then adapted with a small amount oftarget data.

In this paper, we propose to build an SD WaveNet withlimited data from target speaker, where singular value de-composition (SVD) [30] technique is employed to reduce thenumber of WaveNet parameters. Specifically, SVD is appliedto the dilated convolution layers by inserting a 1 × 1 con-volution layer, where the original weight matrix is factorizedinto two small matrices. Inspired by [31], the semi-orthogonalconstraint is used during the decomposed weight matrix up-dating, which leads to a rapid convergence for network train-ing.

The contributions of this paper include,

1) We employ SVD to WaveNet which significantly re-duces the number of parameters;

2) We significantly reduce the amount of data required forSD WaveNet training, while maintaining the quality ofgenerated speech;

The rest of this paper is organized as follows. Section 2briefly describes the overview of WaveNet vocoder. The de-tails of our proposed method are discussed in Section 3. In

Page 2: WAVENET FACTORIZATION WITH SINGULAR VALUE …lxie.nwpu-aslp.org/papers/2019ASRU_DHQ.pdfVoice conversion (VC) is a technique that aims to modify a speech signal of a source speaker

Total:20,992

DilatedConv

Inputdim200

Filterkernel:2 256

W:200 2 256=102,400× ×

DilatedConv

Inputdim200

Residualinput

Filterkernel:2

Filterkernel:2

12

3

32

N:200 2 32=12,800× ×

Filterkernel:1

123

Filterkernel:1 256

M:32 1 256=8,192× ×

1 1Conv×Residualinput

(b)Originaldilatedconvolutionlayer (c)Factorizeddilatedconvolutionlayer

Filterkernel:2 321

(a)FactorizedWaveNetstructure

+

SVDSC=256

DilatedConvDC=32

1 1Conv× 1 1Conv×

tanhSigmoid

SVDSC=256

DilatedConvDC=32

SkipoutputResidualoutput

Residualinput

N

M

Fig. 1. Diagram of the proposed Factorized WaveNet using SVD. Fig (a) is the factorized WaveNet with SVD. SC denotesoutput channel of SVD layer, DC indicates output channel of dilated convolution. N and M indicate two small matrices afterdecomposition respectively. We set SC and DC with 256 and 32 as an instance. Fig (b) is the dilated convolution layer beforeapplying SVD. The number of parameters for original weight matrix W is 102,400. Fig (c) is the dilated convolution layer afterapplying SVD. The number of parameters for the decomposed weight matrices N and M are 12,800 and 8,192 respectively.The total number of parameters are reduced to 20,992.

Sections 4 and 5, we describe the experimental setup and re-port the results. Finally, the conclusions are given in Sec-tion 6.

2. WAVENET VOCODER

2.1. WaveNet Vocoder

WaveNet vocoder [23–25] is an autoregressive generativemodel that can directly model raw audio waveforms fromgiven acoustic features, e.g. spectral features, aperiodicity,and f0. Given a waveform x = [x0, x1, · · · , xT−1], the jointprobability is modeled as the product of conditional distribu-tions as follows.

p(x) =

T∏t=1

p(xt|x1, · · ·xt−1). (1)

In order to learn the long-range dependencies amongtemporal waveform samples, WaveNet utilizes an architec-ture based on a stack of dilated causal convolution layersand a gated activation unit. The calculation function at eachdilated causal layer is expressed as follows.

z = tanh(Wf ∗ i+ Vf ∗ h)� σ(Wg ∗ i+ Vg ∗ h), (2)

where i is the input vector and h is the extra condition fea-ture, e.g. spectral features. ∗ and � represent the convolutionand element-wise product operator respectively. f and g de-note filter and gate, respectively. W and V indicate learnableconvolution weights.

2.2. Advantages and Limitations

WaveNet vocoder directly establishes the mapping betweenacoustic features and time-domain waveforms in a data-drivenmanner. Therefore, some time-domain information which isusually discarded in the conventional vocoder, e.g. phase in-formation, can be recovered, that allows for high fidelity voicereconstruction.

On the other hand, WaveNet vocoder is a deep auto-regressive neural network which mainly utilizes stacked di-lated convolutions. The dilation grows exponentially with thenumber of layers (e.g. 1, 2, ..., 512) and then repeated 3 or 4times, which makes the network very deep. Hence, WaveNetgenerally consists of a large number of parameters, whichrequires a large number of data for training.

3. WAVENET FACTORIZATION WITH SINGULARVALUE DECOMPOSITION

To reduce the data requirement of SD WaveNet training, inthis section, we propose to factorize the WaveNet with singu-lar value decomposition (SVD) technique.

3.1. SVD for WaveNet

Singular value decomposition (SVD) is a factorization tech-nique in linear algebra, which has been successfully used toreduce model parameters in neural networks [31–34]. Gen-erally, SVD is applied to decompose a learned weight matrixinto two lower dimension matrices, and discard the smallersingular values [32–34]. With a reduced number of weight

Page 3: WAVENET FACTORIZATION WITH SINGULAR VALUE …lxie.nwpu-aslp.org/papers/2019ASRU_DHQ.pdfVoice conversion (VC) is a technique that aims to modify a speech signal of a source speaker

parameters, the matrix is still able to achieve a close perfor-mance as the original large weight matrix.

Inspired by the success of [31], where an SVD is em-ployed in time delay neural networks (TDNNs) for automaticspeech recognition (ASR), we investigate the effectiveness ofapplying the SVD technique to adjust the network structure ofWaveNet. Since the dilated convolution layers are the domi-nant component in the WaveNet model, which takes the ma-jority number of weight parameters, we choose to apply SVDto these layers.

We use an example to show how it works. As shown inFig. 1 (a), a 1 × 1 convolution layer is added after each di-lated convolution layer, and we called it a SVD layer. DC isthe output channel of dilated convolution layer and SC is theoutput channel of SVD layer. Then the original weight matrixis decomposed into two matrices, denote as N and M respec-tively. Hence, the sizes of weight matrices N and M are con-trolled by DC and SC. For example, as shown in Fig. 1 (b),we set input dimension of dilated convolution layer to 200,filter size to 2 andDC to 256. The parameter number of orig-inal matrix is 200 × 2 × 256 = 102, 400. By adding a 1 × 1SVD layer with DC and SC of 32 and 256, shown in Fig. 1(c), the parameter number of decomposed matrices N and Mare 200× 2× 32 = 12, 800 and 32× 1× 256 = 8, 192. Con-sequently, the number of parameters are effectively reducedfrom 102,400 to 20,992.

3.2. Train with Semi-orthogonal Constraint

To prevent numerical instability and accelerate the conver-gence, the decomposed model is trained from a randominitialization with a constraint of the matrix M to be semi-orthogonal. Specifically, we apply an update after everyfour time-steps of back-propagation to make the matrix to beclose to a semi-orthogonal matrix. The deduced process ofparameters update formula refers to [31], it is expressed asfollows.

M ←M − 1

2α2

(MMT − α2I

)M, (3)

where α is a scaling factor, and I is identity matrix. α allowsus to control how fast the parameters of different SVD layerschange in a more consistent way. Let P ≡ MMT , then wecompute the scale α as follows.

α =√tr(PPT )/tr(P ), (4)

4. EXPERIMENTAL SETUPS

4.1. Training of WaveNet Vocoder

A WaveNet vocoder takes acoustic features as input [24] andgenerates speech waveforms as output. WORLD vocoder [35]

was employed for speech analysis to prepare the acoustic fea-tures for WaveNet training. We extracted 513-dimensionalspectrum, 1-dimensional aperiodicity coefficient and 1 di-mensional F0 with 5 ms frame shift. Then 40-dimensionalmel cepstral coefficients (MCCs) were calculated from spec-trum using speech signal processing toolkit (SPTK) 1. The40-dimensional MCCs, 3-dimensional logarithmic funda-mental frequency (static, delta and delta delta) together withthe aperiodicity and the unvoiced/voiced (U/V) flag, form a45-dimenional acoustic feature for a speech frame.

The WaveNet vocoder consists of 3 blocks, with 10 dila-tion layers in each block. The dilation in each block startsfrom 1 and exponentially increases by a factor of 2 until 512.The filter size of causal dilated convolution is 2. The hiddenunits of residual connection, skip connection and the SVDlayer before the softmax output layer are set to 256. We de-note the output channel of dilated convolution layer as DC,the output channel of SVD layer as SC, as shown in Fig. 1.The network is optimized by Adam optimization method witha constant learning rate of 0.0001, and a mini-batch size of14,000. The speech waveforms are encoded in form of 8 bitsµ-law.

We compare three WaveNet configurations that are shownin Table 1. Different number of utterances, e.g. 100, 300and 500, from a specific speaker, are used for SD-WaveNetvocoder training. To avoid over-fitting, we vary the iterationsteps according to the training data size, e.g. 400,000 iterationsteps for 500 utterances, 300,000 and 100,000 steps for 300and 100 utterances respectively.

4.2. Voice Conversion Experiments

Experiments were conducted on CMU-ARCTIC dataset [36].We selected two source speakers (clb and rms) and two targetspeakers (bdl and slt). Intra-gender and inter-gender con-versions were conducted between following pairs: rms tobdl (M2M), clb to slt (F2F), clb to bdl (F2M) and rms toslt (M2F). 100 utterances were used for training, another 20non-overlap utterances of each speaker were used for evalua-tion. In total, 80 converted sentences were generated for the4 speaker pairs.

We implemented the PPG approach to voice conver-sion [37] in the experiments. 42 dimensional PPG features ofCMU-ARCTIC dataset were extracted with the PPG extractortrained with Kaldi [38] on Wall Street Journal corpus [39].All audio files were downsampled to 16 kHz. The PPG con-version model consists of one feed-forward layer, two longshort-term memory (LSTM) layers and one linear outputlayer. Each hidden layer consists of 128 units. Adam opti-mization method is used for model training with a learningrate of 0.002. The PPG conversion model takes the PPGfeatures as input and generate MCC features for WaveNetvocoder.

1https://sourceforge.net/projects/sp-tk/

Page 4: WAVENET FACTORIZATION WITH SINGULAR VALUE …lxie.nwpu-aslp.org/papers/2019ASRU_DHQ.pdfVoice conversion (VC) is a technique that aims to modify a speech signal of a source speaker

Table 1. WaveNet vocoder configurations, where DC and SC denote the output channel of dilated convolution layer andSVD layer, respectively. WaveNet-SVD64 and WaveNet-SVD32 are WaveNets with 64 and 32 output channels in the dilatedconvolution layer.

Vocoder Dilation Layers Filter Size Residual Channel Skip Channel DC SC Model SizeWaveNet

30 2 256 256256 N/A 49.9MB

WaveNet-SVD64 64 256 32.4MBWaveNet-SVD32 32 16.7MB

5. EVALUATIONS

We validated our proposed factorized WaveNet with two setsof experiments, that cover the training of WaveNet vocoderand WaveNet vocoder in voice conversion. Both objectiveand subjective evaluation results were reported.

5.1. Evaluation Metrics

5.1.1. Objective Metrics

For objective evaluation, we employed signal-to-noise ratio(SNR) and root mean square error (RMSE) to evaluate dis-tortion between the original speech and synthesized speech intime-domain and frequency-domain respectively.

SNR = 10 log10

( ∑Nn=1 y(n)

2∑Nn=1(x(n)− y(n))2

), (5)

RMSE =

√√√√ 1

F

F∑f=1

(20 log10

|Y (f)||X(f)|

)2

, (6)

where, x(n) and y(n) represent the windowed signals of syn-thesized speech and natural speech, respectively. n denotesthe sample position in time domain. N is the frame length.|X(f)| and |Y (f)| represent the magnitude of synthesized

speech and natural speech at f -th frequency bin, respectively.F is the number of frequency bins.

For both SNR and RMSE calculation, a time shift be-tween ±200 points is used to maximize the correlation be-tween original and synthesized speech. Lower SNR andRMSE indicate the smaller distortion.

5.1.2. Subjective Metrics

For subjective evaluation, we conducted AB and XAB prefer-ence tests to assess speech quality and speaker similarity re-spectively. During AB tests, we randomly selected samples Aand B from the proposed method and the reference method re-spectively. Each listener was required to select a sample withbetter quality. For XAB tests, X represented the referencesample of the target, A and B referred to the converted sam-ples randomly selected from the comparison methods. Notedthat the language content of X, A and B were the same. Then

listeners were asked to choose the sample more close to thereference sample or no preference.

We conducted the mean opinion score (MOS) tests to as-sess the naturalness of the converted speech. Each listenerwas asked to rate opinion score on a five-point scale (5: ex-cellent, 4: good, 3: fair, 2: poor, 1: bad)

For each system, 20 sentences, randomly selected fromthe 80 converted samples, were used for listening tests. 10proficient English listeners participated in all listening tests.

5.2. Evaluations of WaveNet Vocoder

5.2.1. Factorized WaveNet with different SVD configurations

We start by studying WaveNet as a vocoder presented in Sec-tion 4.1, with and without SVD.

First, we validated the model sizes with different config-urations. As shown in Table 1, the model size of originalWaveNet is 49.9MB. While, by applying SVD to factorize theWaveNet, the model size decreases to 32.4MB and 16.7MBwith DC of 64 and 32 respectively. We observe that themodel size is effectively reduced through SVD.

Table 2. Comparison of average SNR and RMSE withand without semi-orthogonal constraint for different WaveNettraining.

Configuration Constraint SNR (dB) RMSE (dB)WaveNet

w3.597 9.499

WaveNet-SVD64 3.598 8.545WaveNet-SVD32 3.601 8.348WaveNet-SVD32 w/o 3.605 8.465

Then, we validated the performance by varying the outputchannel size of dilation layer using 100 training utterances.As shown in Table 2, it is observed that theRMSE decreaseswhen SVD is applied. The best performance is achieved withthe DC of 32. We don’t observe that SVD has a significantimpact on the SNR among the various configurations.

Finally, we examined the performance of semi-orthogonalconstraint with WaveNet-SVD32 and 100 training utterances.As shown in Table 2, by using the semi-orthogonal con-straint, the RMSE decreases from 8.465 dB to 8.348 dB,while SNR remains almost the same. This confirms the needfor semi-orthogonal constraint in decomposed weight matrixupdating.

Page 5: WAVENET FACTORIZATION WITH SINGULAR VALUE …lxie.nwpu-aslp.org/papers/2019ASRU_DHQ.pdfVoice conversion (VC) is a technique that aims to modify a speech signal of a source speaker

Table 3. Comparison of the SNR and RMSE between WaveNet and WaveNet-SVD vocoders.

VocoderData Size SNR (dB) RMSE (dB)

100utts 300utts 500utts 100utts 300utts 500uttsWaveNet 3.597 3.606 3.605 9.499 8.350 8.044

WaveNet-SVD 3.601 3.603 3.604 8.348 7.949 7.816

With the empirical observations above, we set the outputchannel of dilated layer as 32 with the semi-orthogonal con-straint in the rest of the experiments.

5.2.2. Objective Evaluation

Table 3 shows the objective evaluation results of WaveNetvocoders with and without applying SVD. It is observed thatall the systems achieve similar performance in terms of SNR.According toRMSE, it is observed that system performanceimproves as training data increase for both WaveNet andWaveNet-SVD. In addition, we found that WaveNet-SVDconsistently outperforms WaveNet for different amounts oftraining data. Moreover, we also observe that WaveNet-SVD(100 utts) achieves results that are similar to WaveNet (300utts) and close to WaveNet (500 utts). Fig. 2 gives an exampleof spectrum generated from (a) original, (b) WaveNet (500utts), (c) WaveNet (100 utts) and (d) WaveNet-SVD (100utts). The spectrum generated by WaveNet-SVD (100 utts) isclearly sharper than that of WaveNet (100 utts). And it is alsocloser to that of WaveNet (500 utts) and the original speech.

(c) WaveNet (100 utts) (d) WaveNet-SVD32 (100 utts)

(b) WaveNet (500 utts)(a) Original

Fig. 2. Example of spectrums of the same utterance ‘b0436’from bdl, which are synthesized with different systems. (a)original, (b) WaveNet (500 utts), (c) WaveNet (100 utts), (d)WaveNet-SVD (100 utts).

5.2.3. Subjective Evaluation

We further conducted AB preference and MOS tests to assessthe speech quality generated from different systems.

The subjective results of AB tests are presented in Fig. 3.We first validated the effect of the amount of training data. Asshown in Fig. 3 (a), WaveNet (500 utts) outperforms WaveNet(100 utts) significantly. Then, we compared various WaveNet

configurations with different amounts of training data. Fig. 3(b) suggests that WaveNet-SVD (100 utts) achieves a signif-icantly better performance than that of WaveNet (100 utts).Results showed in Fig. 3 (c) indicates that WaveNet-SVD(100 utts) slightly outperforms WaveNet (300 utts). In Fig. 3(d), the preference scores of WaveNet-SVD (100 utts) andWaveNet (500 utts) fall into each others confidence intervals,which means they are not statistically significant. The resultsof MOS test in Table 4 are consistent with those in Fig. 3.Hence, we conclude that WaveNet-SVD (100 utts) signifi-cantly outperforms WaveNet (100 utts), and achieves similarresults as WaveNet (500 utts).

Fig. 3. Results of speech quality preference tests with 95%confidence intervals for (a) WaveNet (100 utts) vs. WaveNet(500 utts), (b) WaveNet (100 utts) vs. WaveNet-SVD (100utts), (c) WaveNet (300 utts) vs. WaveNet-SVD (100 utts),(d) WaveNet (500 utts) vs. WaveNet-SVD (100 utts).

Table 4. Mean opinion scores on speech quality of wave-forms reconstructed using different vocoders.

WORLDWaveNet-SVD

(100 utts)WaveNet(100 utts)

WaveNet(500 utts)

MOS 3.44 3.70 3.28 3.74

5.3. Evaluation of Voice Conversion

5.3.1. Objective Evaluation

We further studied the effect of WaveNet vocoder in PPGvoice conversion systems as discussed in Section 4.2. RMSEwas used as evaluation metric.

Table 5 shows the objective results for different systems.It is observed that the average RMSE of WaveNet decreasesfrom 13.515 dB to 13.313 dB, as training data increases from100 to 500. Moreover, we also observe that the averageRMSE of WaveNet-SVD outperforms WaveNet (100 utts)

Page 6: WAVENET FACTORIZATION WITH SINGULAR VALUE …lxie.nwpu-aslp.org/papers/2019ASRU_DHQ.pdfVoice conversion (VC) is a technique that aims to modify a speech signal of a source speaker

Table 5. Comparison of RMSE (dB) between VC resultsgenerated with WaveNet and WaveNet-SVD vocoders.

Vocoder Utts Intra Inter AverageWaveNet 500 13.304 13.321 13.313WaveNet 100 13.436 13.594 13.515

WaveNet-SVD 100 13.394 13.385 13.390

and is close to that of WaveNet (500 utts). Similar trends arealso observed in Intra-gender and Inter-gender conversion.

5.3.2. Subjective Evaluation

We then conducted AB, XAB and MOS tests to assess thegenerated speech quality and speaker similarity betweenWaveNet and WaveNet-SVD.

Fig. 4. Quality preference tests of converted speech sampleswith 95% confidence intervals of different vocoders for (a)WaveNet (100 utts) vs. WaveNet (500 utts), (b) WaveNet (100utts) vs. WaveNet-SVD (100 utts), (c) WaveNet (500 utts) vs.WaveNet-SVD (100 utts).

The results of AB test are presented in Fig. 4. We firstexamined the effect of the amount of training data in terms ofspeech quality. As shown in Fig. 4 (a), WaveNet (500 utts)outperforms WaveNet (100 utts) significantly. Then, we com-pared the speech quality generated by proposed WaveNet-SVD and WaveNet with the same 100 training utterances.Fig. 4 (b) suggests that WaveNet-SVD (100 utts) achieves asignificantly better performance than WaveNet (100 utts). Fi-nally, we compared the speech quality generated by proposedWaveNet-SVD and WaveNet with different amounts of train-ing utterances. Results showed in Fig. 4 (c) indicate that thepreference scores of WaveNet-SVD (100 utts) and WaveNet(500 utts) fall into each other’s confidence intervals, whichmeans they are not significantly different. The results of MOStest in Table 6 are consistent with those in Fig. 4.

Table 6. Mean opinion scores on speech quality of convertedspeech reconstructed using different vocoders.

WORLDWaveNet-SVD

(100 utts)WaveNet(100 utts)

WaveNet(500 utts)

MOS 3.36 3.67 3.21 3.70

Fig. 5. Similarity preference tests of converted speech sam-ples with 95% confidence intervals of different vocoders for(a) WaveNet (100 utts) vs. WaveNet (500 utts), (b) WaveNet(100 utts) vs. WaveNet-SVD (100 utts), (c) WaveNet (500utts) vs. WaveNet-SVD (100 utts).

Fig. 5 shows the XAB results for different systems. Itis observed that the similarity results are consistent with thatof voice quality test. Both WaveNet (500 utts) and WaveNet-SVD (100 utts) significantly outperforms WaveNet (100 utts).WaveNet-SVD (100 utts) and WaveNet (500 utts) are not sig-nificantly different in terms of speaker identity.

Therefore, we conclude that WaveNet-SVD (100 utts) sig-nificantly outperforms WaveNet (100 utts), and achieves sim-ilar results as WaveNet (500 utts). The synthesized samplesfor different vocoders can be found on the website 2

6. CONCLUSIONS

In this study, we proposed a factorized WaveNet using singu-lar value decomposition. Specifically, a 1× 1 SVD layer wasadded after each dilated convolution layer to decompose oneparameter matrix into two smaller matrices. The experimentalresults demonstrated that 1) the proposed method reduced themodel parameters effectively; 2) it maintained similar perfor-mance with much less training data (100 utts) than the originalWaveNet (500 utts), in terms of both quality and similarity.

7. ACKNOWLEDGEMENTS

This research is supported by the National Research Founda-tion Singapore under its AI Singapore Programme (AwardNumber: AISG-100E-2018-006), and by Programmaticgrant no. A18A2b0046 from the Singapore government’sResearch, Innovation and Enterprise 2020 plan (AdvancedManufacturing and Engineering domain). It is also partiallysupported by the National Key Research and DevelopmentProgram of China (No.2017YFB1002102).

2https://dhqadg.github.io/ASRU-WaveNet-SVD/

Page 7: WAVENET FACTORIZATION WITH SINGULAR VALUE …lxie.nwpu-aslp.org/papers/2019ASRU_DHQ.pdfVoice conversion (VC) is a technique that aims to modify a speech signal of a source speaker

8. REFERENCES

[1] Yannis Stylianou, Olivier Cappe, and Eric Moulines,“Continuous probabilistic transform for voice conver-sion,” IEEE Transactions on speech and audio process-ing, vol. 6, no. 2, pp. 131–142, 1998.

[2] Alexander Kain and Michael W Macon, “Spectral voiceconversion for text-to-speech synthesis,” in Proceedingsof the 1998 IEEE International Conference on Acous-tics, Speech and Signal Processing, ICASSP’98 (Cat.No. 98CH36181). IEEE, 1998, vol. 1, pp. 285–288.

[3] Tomoki Toda, Alan W Black, and Keiichi Tokuda,“Voice conversion based on maximum-likelihood esti-mation of spectral parameter trajectory,” IEEE Transac-tions on Audio, Speech, and Language Processing, vol.15, no. 8, pp. 2222–2235, 2007.

[4] Daniel Erro, Asuncion Moreno, and Antonio Bonafonte,“Voice conversion based on weighted frequency warp-ing,” IEEE Transactions on Audio, Speech, and Lan-guage Processing, vol. 18, no. 5, pp. 922–931, 2009.

[5] Elizabeth Godoy, Olivier Rosec, and Thierry Chonavel,“Voice conversion using dynamic frequency warpingwith amplitude scaling, for parallel or nonparallel cor-pora,” IEEE Transactions on Audio, Speech, and Lan-guage Processing, vol. 20, no. 4, pp. 1313–1323, 2011.

[6] Xiaohai Tian, Zhizheng Wu, Siu Wa Lee, Nguyen QuyHy, Eng Siong Chng, and Minghui Dong, “Sparse rep-resentation for frequency warping based voice conver-sion,” in 2015 IEEE International Conference on Acous-tics, Speech and Signal Processing (ICASSP). IEEE,2015, pp. 4235–4239.

[7] Xiaohai Tian, Zhizheng Wu, Siu Wa Lee, and Eng SiongChng, “Correlation-based frequency warping for voiceconversion,” in The 9th International Symposium onChinese Spoken Language Processing. IEEE, 2014, pp.211–215.

[8] Ryoichi Takashima, Tetsuya Takiguchi, and YasuoAriki, “Exemplar-based voice conversion in noisy envi-ronment,” in 2012 IEEE Spoken Language TechnologyWorkshop (SLT). IEEE, 2012, pp. 313–317.

[9] Zhizheng Wu, Tuomas Virtanen, Eng Siong Chng,and Haizhou Li, “Exemplar-based sparse representa-tion with residual compensation for voice conversion,”IEEE/ACM Transactions on Audio, Speech, and Lan-guage Processing, vol. 22, no. 10, pp. 1506–1521, 2014.

[10] Xiaohai Tian, Siu Wa Lee, Zhizheng Wu, Eng SiongChng, and Haizhou Li, “An exemplar-based approachto frequency warping for voice conversion,” IEEE/ACMTransactions on Audio, Speech, and Language Process-ing, vol. 25, no. 10, pp. 1863–1876, 2017.

[11] Srinivas Desai, E Veera Raghavendra, B Yegna-narayana, Alan W Black, and Kishore Prahallad, “Voiceconversion using artificial neural networks,” in 2009IEEE International Conference on Acoustics, Speechand Signal Processing. IEEE, 2009, pp. 3893–3896.

[12] Ling-Hui Chen, Zhen-Hua Ling, Li-Juan Liu, and Li-Rong Dai, “Voice conversion using deep neural net-works with layer-wise generative training,” IEEE/ACMTransactions on Audio, Speech and Language Process-ing (TASLP), vol. 22, no. 12, pp. 1859–1872, 2014.

[13] Seyed Hamidreza Mohammadi and Alexander Kain,“Voice conversion using deep neural networks withspeaker-independent pre-training,” in 2014 IEEE Spo-ken Language Technology Workshop (SLT). IEEE, 2014,pp. 19–23.

[14] Lifa Sun, Shiyin Kang, Kun Li, and Helen Meng,“Voice conversion using deep bidirectional long short-term memory based recurrent neural networks,” in 2015IEEE International Conference on Acoustics, Speechand Signal Processing (ICASSP). IEEE, 2015, pp.4869–4873.

[15] Chin-Cheng Hsu, Hsin-Te Hwang, Yi-Chiao Wu,Yu Tsao, and Hsin-Min Wang, “Voice conversion fromnon-parallel corpora using variational auto-encoder,” in2016 Asia-Pacific Signal and Information ProcessingAssociation Annual Summit and Conference (APSIPA).IEEE, 2016, pp. 1–6.

[16] Yuki Saito, Shinnosuke Takamichi, and HiroshiSaruwatari, “Statistical parametric speech synthe-sis incorporating generative adversarial networks,”IEEE/ACM Transactions on Audio, Speech, and Lan-guage Processing, vol. 26, no. 1, pp. 84–96, 2017.

[17] Takuhiro Kaneko and Hirokazu Kameoka, “Parallel-data-free voice conversion using cycle-consistent ad-versarial networks,” arXiv preprint arXiv:1711.11293,2017.

[18] Takuhiro Kaneko, Hirokazu Kameoka, Kou Tanaka, andNobukatsu Hojo, “Cyclegan-vc2: Improved cyclegan-based non-parallel voice conversion,” in ICASSP2019-2019 IEEE International Conference on Acous-tics, Speech and Signal Processing (ICASSP). IEEE,2019, pp. 6820–6824.

[19] Nal Kalchbrenner, Erich Elsen, Karen Simonyan, SebNoury, Norman Casagrande, Edward Lockhart, FlorianStimberg, Aaron van den Oord, Sander Dieleman, andKoray Kavukcuoglu, “Efficient neural audio synthesis,”arXiv preprint arXiv:1802.08435, 2018.

Page 8: WAVENET FACTORIZATION WITH SINGULAR VALUE …lxie.nwpu-aslp.org/papers/2019ASRU_DHQ.pdfVoice conversion (VC) is a technique that aims to modify a speech signal of a source speaker

[20] Jean-Marc Valin and Jan Skoglund, “Lpcnet: Improv-ing neural speech synthesis through linear prediction,”in ICASSP 2019-2019 IEEE International Conferenceon Acoustics, Speech and Signal Processing (ICASSP).IEEE, 2019, pp. 5891–5895.

[21] Ryan Prenger, Rafael Valle, and Bryan Catanzaro,“Waveglow: A flow-based generative network forspeech synthesis,” in ICASSP 2019-2019 IEEE Inter-national Conference on Acoustics, Speech and SignalProcessing (ICASSP). IEEE, 2019, pp. 3617–3621.

[22] Jaime Lorenzo-Trueba, Thomas Drugman, Javier La-torre, Thomas Merritt, Bartosz Putrycz, Roberto Barra-Chicote, Alexis Moinet, and Vatsal Aggarwal, “TowardsAchieving Robust Universal Neural Vocoding,” in Proc.Interspeech 2019, 2019, pp. 181–185.

[23] Aaron Van Den Oord, Sander Dieleman, Heiga Zen,Karen Simonyan, Oriol Vinyals, Alex Graves, NalKalchbrenner, Andrew Senior, and Koray Kavukcuoglu,“Wavenet: A generative model for raw audio,” 2016.

[24] Akira Tamamori, Tomoki Hayashi, KazuhiroKobayashi, Kazuya Takeda, and Tomoki Toda,“Speaker-dependent wavenet vocoder.,” in INTER-SPEECH, 2017, pp. 1118–1122.

[25] Kazuhiro Kobayashi, Tomoki Hayashi, AkiraTamamori, and Tomoki Toda, “Statistical voiceconversion with wavenet-based waveform generation.,”in Interspeech, 2017, pp. 1138–1142.

[26] Xiaohai Tian, Eng-Siong Chng, and Haizhou Li, “Aspeaker-dependent wavenet for voice conversion withnon-parallel data,” in Interspeech, 2019.

[27] Nagaraj Adiga, Vassilis Tsiaras, and Yannis Stylianou,“On the use of wavenet as a statistical vocoder,” in 2018IEEE International Conference on Acoustics, Speechand Signal Processing (ICASSP). IEEE, 2018, pp.5674–5678.

[28] Li-Juan Liu, Zhen-Hua Ling, Yuan Jiang, Ming Zhou,and Li-Rong Dai, “Wavenet vocoder with limited train-ing data for voice conversion,” in Proc. Interspeech,2018, pp. 1983–1987.

[29] Berrak Sisman, Mingyang Zhang, and Haizhou Li, “Avoice conversion framework with tandem feature sparserepresentation and speaker-adapted wavenet vocoder,”Interspeech, pp. 1978–1982, 2018.

[30] Gene H Golub and Christian Reinsch, “Singular valuedecomposition and least squares solutions,” in LinearAlgebra, pp. 134–151. Springer, 1971.

[31] Daniel Povey, Gaofeng Cheng, Yiming Wang, Ke Li,Hainan Xu, Mahsa Yarmohamadi, and Sanjeev Khudan-pur, “Semi-orthogonal low-rank matrix factorizationfor deep neural networks,” in Proceedings of INTER-SPEECH, 2018.

[32] Jian Xue, Jinyu Li, and Yifan Gong, “Restructuring ofdeep neural network acoustic models with singular valuedecomposition.,” in Interspeech, 2013, pp. 2365–2369.

[33] George Tucker, Minhua Wu, Ming Sun, SankaranPanchapagesan, Gengshen Fu, and Shiv Vitaladevuni,“Model compression applied to small-footprint keywordspotting.,” in INTERSPEECH, 2016, pp. 1878–1882.

[34] Rohit Prabhavalkar, Ouais Alsharif, Antoine Bruguier,and Lan McGraw, “On the compression of recurrentneural networks with an application to lvcsr acousticmodeling for embedded speech recognition,” in 2016IEEE International Conference on Acoustics, Speechand Signal Processing (ICASSP). IEEE, 2016, pp.5970–5974.

[35] Masanori Morise, Fumiya Yokomori, and Kenji Ozawa,“World: a vocoder-based high-quality speech synthesissystem for real-time applications,” IEICE TRANSAC-TIONS on Information and Systems, vol. 99, no. 7, pp.1877–1884, 2016.

[36] John Kominek and Alan W Black, “The cmu arcticspeech databases,” in Fifth ISCA workshop on speechsynthesis, 2004.

[37] Lifa Sun, Kun Li, Hao Wang, Shiyin Kang, and He-len Meng, “Phonetic posteriorgrams for many-to-onevoice conversion without parallel data training,” in2016 IEEE International Conference on Multimedia andExpo (ICME). IEEE, 2016, pp. 1–6.

[38] Daniel Povey, Arnab Ghoshal, Gilles Boulianne, LukasBurget, Ondrej Glembek, Nagendra Goel, Mirko Han-nemann, Petr Motlicek, Yanmin Qian, Petr Schwarz,et al., “The kaldi speech recognition toolkit,” Tech.Rep., IEEE Signal Processing Society, 2011.

[39] Douglas B Paul and Janet M Baker, “The design for thewall street journal-based csr corpus,” in Proceedings ofthe workshop on Speech and Natural Language. Associ-ation for Computational Linguistics, 1992, pp. 357–362.