[ieee 2009 2nd international congress on image and signal processing (cisp) - tianjin, china...

3
A 450bps Speech Coding Algorithm Based on Multi-mode Matrix Quantization Xia Zou, Xiongwei Zhang PLA Univ. of Sci.&Tech Nanjing, P. R. China Abstract—A 450bps speech coder based on multi-frame structure and multi-mode matrix quantization is presented. The multi- frame structure consisting of four frames is adopted to reduce the algorithm delay. The parameter matrices are classified into different modes based on the voicing vector information of superframe. To improve speech quality, a dynamic bit allocation scheme is developed. Experimental results show that the speech quality of the proposed vocoder is intelligible with good naturalness. Keywords-speech coding; matrix quantization; multi-frame; multi-mode I. INTRODUCTION In recent years, military application and satellite communication have increased the interest in speech coding below 2400bps. In 2001, a 1200bps speech coder [1] was added to the 2.4kbps Mixed Excitation Linear Prediction (MELP) speech coder [2]. This enhanced MELP (MELPe) was adopted as the new MIL-STD-3005. In 2002, MELPe was tested against other candidates such as France’s HSX (Harmonic Stochastic eXcitation) [3] and Turkey’s SB-LPC (Split-Band Linear Predictive Coding) [4]. Subsequently, the US DoD MELPe was adopted as NATO (North Atlantic Treaty Organization) standard, known as STANAG-4591. In 2005, a new 600bps MELPe speech coder was added to the NATO standard STANAG-4591. Now, there are more advanced efforts to lower the bit rates below 600bps [5, 6]. The same achievements were made to obtain high quality speech coder at 600bps [7-9]. In these speech coder algorithms, the correlation between consecutive speech frames is exploited with a multi- frame structure to further reduce the rate of speech coding. And efficient quantization is applied to increase quality. In this paper, a 450bps speech coding algorithm is proposed. In the proposed coder, the parameters of four consecutive frames are grouped together and jointly quantized for different frame combinations based on the statistic properties of voiced and unvoiced speech to obtain high coding efficiency. Furthermore, a dynamic bit allocation scheme is employed to take advantage of the multi-frame structure. The PESQ score of the proposed 450bps speech coder is over 2.5. II. OVERVIEW OF CODING SCHEME Speech analysis is performed on 25ms frames. The parameters that are extracted from the speech signal are LPC coefficients, pitch, gain, band-pass voicing decisions. The encoder block diagram of speech coder is shown in Figure 1. Figure 1. Speech encoder block diagram The input speech is high-pass filtered at 60Hz to remove the low frequency energy. A 10th order linear prediction analysis is performed employing the autocorrelation method. The M-best pitch candidates are calculated based on autocorrelation after low-pass filtering the speech signal to 800Hz bandwidth. An optimal pitch candidate is selected from the M-best candidates to minimize the cost function considering the time evolution of the pitch } { min arg 1 1 , , 1 1 , 0 + + Δ × + Δ × = l j l i l l ij l l i M j i i l r r p p index δ δ (1) | | |, | 1 1 , 1 , 1 1 + + = Δ = Δ l j l i l l ij l i l index l l i p p p p p p l (2) where l i p is the th i candidate pitch value of th l frame (current frame), l i r is the normalized autocorrelation value corresponding to l i p and δ is a parameter to control the contribution of pitch differentials. The speech is filtered into five frequency bands with pass-bands of 0-500, 500-1000, 1000-2000, 2000-3000 and 3000-4000Hz. The band-pass voicing decisions are made based on the pitch correlations for the band-pass signal and the time envelope of the band-pass signal. The calculation of gain operates on the LPC residual. 978-1-4244-4131-0/09/$25.00 ©2009 IEEE

Upload: xiongwei

Post on 24-Feb-2017

213 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: [IEEE 2009 2nd International Congress on Image and Signal Processing (CISP) - Tianjin, China (2009.10.17-2009.10.19)] 2009 2nd International Congress on Image and Signal Processing

A 450bps Speech Coding Algorithm Based on Multi-mode Matrix Quantization

Xia Zou, Xiongwei Zhang PLA Univ. of Sci.&Tech

Nanjing, P. R. China

Abstract—A 450bps speech coder based on multi-frame structure and multi-mode matrix quantization is presented. The multi-frame structure consisting of four frames is adopted to reduce the algorithm delay. The parameter matrices are classified into different modes based on the voicing vector information of superframe. To improve speech quality, a dynamic bit allocation scheme is developed. Experimental results show that the speech quality of the proposed vocoder is intelligible with good naturalness.

Keywords-speech coding; matrix quantization; multi-frame; multi-mode

I. INTRODUCTION In recent years, military application and satellite

communication have increased the interest in speech coding below 2400bps. In 2001, a 1200bps speech coder [1] was added to the 2.4kbps Mixed Excitation Linear Prediction (MELP) speech coder [2]. This enhanced MELP (MELPe) was adopted as the new MIL-STD-3005. In 2002, MELPe was tested against other candidates such as France’s HSX (Harmonic Stochastic eXcitation) [3] and Turkey’s SB-LPC (Split-Band Linear Predictive Coding) [4]. Subsequently, the US DoD MELPe was adopted as NATO (North Atlantic Treaty Organization) standard, known as STANAG-4591. In 2005, a new 600bps MELPe speech coder was added to the NATO standard STANAG-4591. Now, there are more advanced efforts to lower the bit rates below 600bps [5, 6]. The same achievements were made to obtain high quality speech coder at 600bps [7-9]. In these speech coder algorithms, the correlation between consecutive speech frames is exploited with a multi-frame structure to further reduce the rate of speech coding. And efficient quantization is applied to increase quality.

In this paper, a 450bps speech coding algorithm is proposed. In the proposed coder, the parameters of four consecutive frames are grouped together and jointly quantized for different frame combinations based on the statistic properties of voiced and unvoiced speech to obtain high coding efficiency. Furthermore, a dynamic bit allocation scheme is employed to take advantage of the multi-frame structure. The PESQ score of the proposed 450bps speech coder is over 2.5.

II. OVERVIEW OF CODING SCHEME Speech analysis is performed on 25ms frames. The

parameters that are extracted from the speech signal are LPC

coefficients, pitch, gain, band-pass voicing decisions. The encoder block diagram of speech coder is shown in Figure 1.

Figure 1. Speech encoder block diagram

The input speech is high-pass filtered at 60Hz to remove the low frequency energy. A 10th order linear prediction analysis is performed employing the autocorrelation method. The M-best pitch candidates are calculated based on autocorrelation after low-pass filtering the speech signal to 800Hz bandwidth. An optimal pitch candidate is selected from the M-best candidates to minimize the cost function considering the time evolution of the pitch

}{minarg 11,,1

1,0

++−

−≤≤−−Δ×+Δ×= l

jl

ill

ijll

iMjii

l rrppindex δδ (1)

|||,| 11,1,11

++−− −=Δ−=Δ −lj

li

llij

li

lindex

lli pppppp l (2)

where lip is the thi candidate pitch value of thl frame

(current frame), lir is the normalized autocorrelation value

corresponding to lip and δ is a parameter to control the

contribution of pitch differentials. The speech is filtered into five frequency bands with pass-bands of 0-500, 500-1000, 1000-2000, 2000-3000 and 3000-4000Hz. The band-pass voicing decisions are made based on the pitch correlations for the band-pass signal and the time envelope of the band-pass signal. The calculation of gain operates on the LPC residual.

978-1-4244-4131-0/09/$25.00 ©2009 IEEE

Page 2: [IEEE 2009 2nd International Congress on Image and Signal Processing (CISP) - Tianjin, China (2009.10.17-2009.10.19)] 2009 2nd International Congress on Image and Signal Processing

At the decoder, the voiced excitation is generated as

∑=

−≤≤+×=K

kkkkn NnnAex

110 )cos( ϕω (3)

where nex is the voiced excitation, kA are the spectral magnitudes and kω and kϕ are the voiced harmonics and phases, respectively. K is the number of voiced harmonics and N is the frame size. The unvoiced excitation is generated by a

uniform random number generator. The voiced and unvoiced excitation signals are filtered and summed to form the excitation. The LPC filter and post filter are applied to the resultant excitation to form the synthetic speech. The decoder block diagram is shown in Figure 2.

Figure 2. Speech decoder block diagram

III. BIT ALLOCATION AND PARAMETER QUANTIZATION

A. Bit allocation To exploit the correlation between successive frames, the

number of frames should be long. But the memory requirements will be too big and the improvement of speech quality will be little. So, a multi-frame structure consisting of four consecutive frames at 450bps is adopted. Table 1 shows the bit allocation for the 450bps speech coder. The bit allocation for pitch and gain parameters is dynamic according to the voicing pattern.

TABLE I. BIT ALLOCATION FOR THE 450BPS SPEECH CODING

Parameters Bits LSF 24

Pitch & Gain 17 Voicing 4

Total 45

B. Voicing pattern quantization The probability of the band-pass voicing vector patterns is

analyzed and the sixteen most probable voicing patterns are selected. Table 2 shows the probability of the sixteen most voicing patterns. The voicing vector is quantized with a weighted Euclidean distance and the weight vector is set according to the importance of different sub-bands.

TABLE II. THE PROBABILITY OF THE SIXTEEN MOST VOICING PATTERNS

Voicing Patterns Probability 10000, 10000, 10000, 10000 18.36% 10000, 10000, 10000, 11111 11.46% 10000, 10000, 11111, 11111 1.88% 10000, 11000, 11111, 11111 1.02% 10000, 11111, 11111, 11111 2.76% 11000, 11111, 11111, 11111 1.74% 11100, 11111, 11111, 11111 1.41% 11110, 11111, 11111, 11111 1.61% 11111, 10000, 10000, 10000 1.62% 11111, 11111, 10000, 10000 2.56% 11111, 11111, 11000, 10000 1.15% 11111, 11111, 11111, 10000 3.60% 11111, 11111, 11111, 11000 2.09% 11111, 11111, 11111, 11100 1.53% 11111, 11111, 11111, 11110 1.38% 11111, 11111, 11111, 11111 11.58%

C. Pitch and Gain quantization The voicing pattern are categorized into six modes

according the number of voiced frames and the transition of unvoice and voice. Different bit allocation and quantization schemes for gain and pitch are used for different modes. Table 3 shows the dynamic bit allocation scheme.

TABLE III. DYNAMIC BIT ALLOCATION FOR GAIN AND PITCH

Parameters Pitch Gain UUUU 0 12

UUUV, VUUU 7 10 UUVV, VVUU 8 9 VUUV, UVVU 8 9

UVVV, VVVU, VVUV, VUVV 9 8 VVVV 9 8

The pitch is only quantized for voiced frames. For the UUUU voicing pattern, no pitch information is transmitted. For the UUUV and VUUU voicing patterns, the scalar quantizer is applied for the pitch of the voiced frame. For the patterns containing more than two voiced frames, the pitch of the voiced frames is vector quantized. The pitch is quantized in logarithmic domain.

The gain values are transformed into logarithmic values before quantization. The logarithmic values are quantized with gain-shape vector quantization to prevent sensitivity to speech input level. The average value of the four gains are calculated and quantized with scalar quantizer. After the average is subtracted from gain values, the shape is vector quantization. The bit allocation of shape is more for nonstationary frames. The memory requirements are reduced greatly with gain-shape vector quantization.

D. LSF quantization The LPC coefficients are converted to linear spectral

frequencies (LSFs). The LSFs vector is quantized with predictive multistage matrix quantization. First, The LSF matrices are divided into six different modes based on the voicing information and the average LSF matrix of the training

Page 3: [IEEE 2009 2nd International Congress on Image and Signal Processing (CISP) - Tianjin, China (2009.10.17-2009.10.19)] 2009 2nd International Congress on Image and Signal Processing

set belonging to the current mode is subtracted from the LSF matrix to obtain a differential LSF matrix. Then the correlation between successive superframes is removed using linear prediction to obtain the residual LSF matrix. The predictive coefficients are different for different voicing pattern transition between superframes. The residual LSF matrices are quantized with three stage codebook of 256, 256, 256 levels respectively. The same multistage codebook is used for different mode to reduce the memory requirements. The predictive coefficients and codebook are designed with multi-mode predictive multistage matrix quantization [10].

IV. SUBJECTIVE AND OBJECTIVE TEST RESULTS For speech quality assessment, PESQ test [11] are

performed. In the PESQ test, six sentences from three male and three female are used. Each sentence is about 10s.

The DRT and PESQ test results of the proposed 450bps speech coder are given in Table 4. The results show that the average PESQ score of the proposed coder is more than 2.5. Informal listening tests prove that the synthetic speech is intelligible with good naturalness at 450bps. Figure 3 shows the plot of original speech and 450bps synthesis speech. Figure 4 shows the corresponding speech spectrograms. The results show that the formants are well preserved with the proposed speech coder.

TABLE IV. PESQ TEST RESULTS

rate PESQ 450bps 2.57

Figure 3. Plot of speech

V. CONCLUSION

We present a 450bps speech coder which makes use of multi-frame structure and multi-mode matrix quantization to obtain good speech quality at very low bit rate. Real-time

implementation of the coding algorithm has been accomplished on a single TMS320VC5509a digital signal processor.

Figure 4. Speech spectrograms

REFERENCES

[1] T. Wang, K. Koishida, V. Cuperman, A. Gersho, J.S. Collura, “A 1200/2400bps coding suite based on MELP”, IEEE workshop on speech coding, Tsukuba, Japan, pp. 90-92, 2002.

[2] L.M. Supplee, R.P. chon, J.S. Collura, A. McCree, “MELP: the new Federal standard at 2400bps”, Proc. IEEE Int. Conf. Acoustic, Speech, Signal Processing, Munich, Germany, pp. 1591-1594, 1997.

[3] G. Guilmin, P. Gournay, F. Chartier, “Description of the French NATO candidate”, IEEE workshop on speech coding, Tsukuba, Japan, pp. 84-86, 2002.

[4] S. Villette, K.T. Al Naimi, C. Sturt, A.M. Kondoz, H. Palaz, “A 2.4/1.2kbps SB-LPC based speech coder: the Turkish NATO STANAG candidate”, IEEE workshop on speech coding, Tsukuba, Japan, pp. 87-89, 2002.

[5] A. McCree, “A scalable phonetic vocoder framework using joint predictive vector quantization of melp parameters”, Proc. IEEE Int. Conf. Acoustic, Speech, Signal Processing, Toulouse, France, pp. 705-708, 2006.

[6] A. McCree, K. Brady, T.F. Quatieri, “Multisensor very low bit rate speech coding using segment quantization”, Proc. IEEE Int. Conf. Acoustic, Speech, Signal Processing, Las Vegas, NV, pp. 3997-4000, 2008.

[7] J.W. Zhang, T.H. Huo, J.L. Li, H.J. Cui, K. Tang, “High quality 0.6kb/s speech coding algorithm”, J. Tsinghua Univ. of Sci.&Tech. (Chinese), Vol.43, No.4, pp. 449-452, 2003.

[8] X. Zou, X.W. Zhang, “High Quality 0.6/1.2/2.4kbps Multi-Band LPC Speech Coding Algorithm”, IEE International Conference on Wireless, Mobile & Multimedia Networks, Hangzhou, China, pp. 1061-1064, 2006.

[9] M. W. Chamberlain, “A 600 bps MELP vocoder for use on HF channels”, IEEE Military Communications Conference, pp. 447- 453, 2001.

[10] X. Zou, X.W. Zhang, “Efficient coding of LSF parameters using multi-mode predictive multistage matrix quantization”, IEEE International Conference on Signal Processing, Beijing, China, pp. 542-545, 2008.

[11] ITU-T P.862, “Perceptual evaluation of speech quality (PESQ): An objective method for end-to-end speech quality assessment of narrow-band telephone networks and speech codecs”, 2001.