evaluation of time domain features

4
Copyrights: 978-1-4673-1711-5/12/$31.00 c 2012 IEEE ICSES 2012 – International Conference on Signals and Electronic Systems WROCLAW, POLAND, September 18 – 21, 2012 Evaluation of Time Domain Features for Voiced/Non-voiced Classification of Speech F. Ykhlef System Architecture and Multimedia Division, CDTA, Cite 20 Aout 1956, Baba Hassen, Algiers, Algeria, [email protected] L. Bendaouia Embedded System Laboratory, Cergy-Pontoise, University, 6 rue du Ponceau, Cergy, France, [email protected] Abstract— In this paper, we have performed an evaluation of several time domain features for voiced/non-voiced classifi- cation of speech signal. We have chosen in a seamless way three features: autocorrelation function (ACF), average magnitude difference function (AMDF) and weighted ACF (WACF) to form three different classifiers. Experimental results were conducted on TIMIT database in clean and noisy environ- ments. The white noise extracted from the NOISEX92 database has been incorporated to validate the developed classifiers. We have established an overall ranking of these classifiers based on the average value of the percentage of classification accuracy (Pc). I. INTRODUCTION Accurate and reliable voiced/non-voiced classification of speech is a crucial preprocessing step in many speech processing applications and it is essential in most analysis and synthesis systems. The goal of classification is to de- termine whether the speech production system involves vibration of the vocal folds or not. For example, voicing determination is vital in pitch detection problem. The accu- racy of detection can significantly improve the performances of a pitch detector [1]. Voiced speech consists of periodic or quasi periodic sounds made when there is a significant glottal activity (vi- bration of the vocal folds). Unvoiced speech is a non period- ic, random excitation sound caused by air passing through a narrow constriction of the vocal track. Unvoiced sounds include the main classes of consonants which are voiceless fricatives, occlusives and stops. When both quasi-periodic and random excitations are present simultaneously (which is named mixed excitation, such as voiced fricatives), the speech is classified voiced because the vibration of vocal folds is a part of the speech act. In other contexts, the mixed excitation could be treated by itself as a different class [2]. The non-voiced region includes silence and unvoiced speech [3]. A variety of classifiers for robust voiced/non-voiced clas- sification have been reported in literature [1], [2], [4]. The majority of them use hybrid approaches for voicing decision which include time and frequency acoustical features. Several features have been used in literature. We can mention the energy of the signal (E), zeros crossing rate (ZCR), autocorrelation function (ACF), average magnitude difference function (AMDF), weighted ACF (WACF), cep- stral function (CEP), discrete wavelet transform (DWT) coefficients, first coefficient of a p th order linear prediction analysis and harmonic measure [3,4]. The combination of evidence from multiple features can be done by using statis- tical models such as neural network, Gaussian mixture mod- el or hidden Markov model [4]. This combination can sig- nificantly offer strong classifiers that depend on the number of features incorporated in the model. On the other hand, the need for hardware implementation and real time applications requests the reduction of the number of features which aims at decreasing the computational complexity. The perfor- mances of these classifiers in terms of percentage of classi- fication accuracy (Pc) and especially in noisy environments depend on the choice of the suitable feature. The ACF, AMDF and WACF are widely used as input attribute to these classification models. They are preferred due to their low computation and precise estimation. Fur- thermore, these features are commonly used in the develop- ment of pitch detection techniques since they provide signif- icant instants of pitch periods. Thus, they can be used for both classification and pitch detection. In our study, we will focus on their use for speech classification. Also, they are considered as time-based function which is generally a pre- ferable characteristic in real time applications. The purpose of this paper is to study separately the performance of each feature for voiced/non-voiced classification in clean and noisy environment and to establish an overall ranking of these later. To achieve our purpose, we have developed three classification schemes that use only one feature with- out the need of pre- or post processing stages. We have used manually segmented speech signals from TIMIT database to measure the success of the classification into voiced and non-voiced frames. The Performance of the developed classifiers is evaluated by using an additive white noise, extracted from the NOISEX 92 data base. We have used different SNRs (Signal to Noise Ratios) of the input signal ranged from 30 to –5dB. Our paper is organized as follow. In section II, the de- tailed implementations of the three classifiers are reviewed. In section III, the evaluation and comparison of the features for voiced non-voiced classification are given. Section IV gives a conclusion and future perspectives.

Upload: thuong-khanh-tran

Post on 21-Jul-2016

2 views

Category:

Documents


0 download

TRANSCRIPT

Copyrights: 978-1-4673-1711-5/12/$31.00 c© 2012 IEEE

ICSES 2012 – International Conference on Signals and Electronic SystemsWROCLAW, POLAND, September 18 – 21, 2012

Evaluation of Time Domain Features for Voiced/Non-voiced Classification of Speech

F. Ykhlef

System Architecture and Multimedia Division, CDTA,

Cite 20 Aout 1956, Baba Hassen, Algiers, Algeria, [email protected]

L. Bendaouia Embedded System Laboratory,

Cergy-Pontoise, University, 6 rue du Ponceau, Cergy, France,

[email protected]

Abstract— In this paper, we have performed an evaluation of several time domain features for voiced/non-voiced classifi-cation of speech signal. We have chosen in a seamless way three features: autocorrelation function (ACF), average magnitude difference function (AMDF) and weighted ACF (WACF) to form three different classifiers. Experimental results were conducted on TIMIT database in clean and noisy environ-ments. The white noise extracted from the NOISEX92 database has been incorporated to validate the developed classifiers. We have established an overall ranking of these classifiers based on the average value of the percentage of classification accuracy (Pc).

I. INTRODUCTION

Accurate and reliable voiced/non-voiced classification of speech is a crucial preprocessing step in many speech processing applications and it is essential in most analysis and synthesis systems. The goal of classification is to de-termine whether the speech production system involves vibration of the vocal folds or not. For example, voicing determination is vital in pitch detection problem. The accu-racy of detection can significantly improve the performances of a pitch detector [1].

Voiced speech consists of periodic or quasi periodic sounds made when there is a significant glottal activity (vi-bration of the vocal folds). Unvoiced speech is a non period-ic, random excitation sound caused by air passing through a narrow constriction of the vocal track. Unvoiced sounds include the main classes of consonants which are voiceless fricatives, occlusives and stops. When both quasi-periodic and random excitations are present simultaneously (which is named mixed excitation, such as voiced fricatives), the speech is classified voiced because the vibration of vocal folds is a part of the speech act. In other contexts, the mixed excitation could be treated by itself as a different class [2]. The non-voiced region includes silence and unvoiced speech [3].

A variety of classifiers for robust voiced/non-voiced clas-sification have been reported in literature [1], [2], [4]. The majority of them use hybrid approaches for voicing decision which include time and frequency acoustical features.

Several features have been used in literature. We can mention the energy of the signal (E), zeros crossing rate (ZCR), autocorrelation function (ACF), average magnitude difference function (AMDF), weighted ACF (WACF), cep-

stral function (CEP), discrete wavelet transform (DWT) coefficients, first coefficient of a pth order linear prediction analysis and harmonic measure [3,4]. The combination of evidence from multiple features can be done by using statis-tical models such as neural network, Gaussian mixture mod-el or hidden Markov model [4]. This combination can sig-nificantly offer strong classifiers that depend on the number of features incorporated in the model. On the other hand, the need for hardware implementation and real time applications requests the reduction of the number of features which aims at decreasing the computational complexity. The perfor-mances of these classifiers in terms of percentage of classi-fication accuracy (Pc) and especially in noisy environments depend on the choice of the suitable feature.

The ACF, AMDF and WACF are widely used as input attribute to these classification models. They are preferred due to their low computation and precise estimation. Fur-thermore, these features are commonly used in the develop-ment of pitch detection techniques since they provide signif-icant instants of pitch periods. Thus, they can be used for both classification and pitch detection. In our study, we will focus on their use for speech classification. Also, they are considered as time-based function which is generally a pre-ferable characteristic in real time applications. The purpose of this paper is to study separately the performance of each feature for voiced/non-voiced classification in clean and noisy environment and to establish an overall ranking of these later. To achieve our purpose, we have developed three classification schemes that use only one feature with-out the need of pre- or post processing stages.

We have used manually segmented speech signals from TIMIT database to measure the success of the classification into voiced and non-voiced frames. The Performance of the developed classifiers is evaluated by using an additive white noise, extracted from the NOISEX 92 data base. We have used different SNRs (Signal to Noise Ratios) of the input signal ranged from 30 to –5dB.

Our paper is organized as follow. In section II, the de-tailed implementations of the three classifiers are reviewed. In section III, the evaluation and comparison of the features for voiced non-voiced classification are given. Section IV gives a conclusion and future perspectives.

Copyrights: 978-1-4673-1711-5/12/$31.00 c© 2012 IEEE

No Yes

LPF: 0-700Hz

Speech frame

Find max. peak « iβ »

Voiced Non-voiced

( ) 21i n:n, =ττφ

i0i β≥β

Fig. 1: Block diagram of the correlation classifiers i=1, ACF, i=3, WACF,

II. VOICED/NON-VOICED CLASSIFIERS

In the following section, detailed explanations of three distinct classifiers are provided. The classification schemes are based on just one acoustical feature amongst: ACF, AMDF and WACF.

A. ACF

The fist classifier is based on the ACF. Fig. 1 (i=1) shows a block diagram of the classification scheme. It is based on frame by frame processing of the speech signal using a sta-tionary rectangular window of 22.5 ms duration. The used speech signal has a sampling rate (Fs) of 16 kS/s. The frame x(n) is low pass filtered to 700Hz (A 20- point linear phase, finite impulse response Low Pass Filter (LPF) ) in order to eliminate the formant structure of the speech signal. The first stage of the processing is the computation of the ACF using equation 1.

( ) ( ) ( )∑−τ−

=

τ+=τφ1N

0n1 nxnx

N1 (1)

21 n:n=τ where N, is the length of the frame which is equal to 360. τ is a lag number. n1 and n2 represent the rang of the ACF computation which corresponds to the frequency band of the fundamental frequency (namely from 70 to 600Hz). They are respectively set to 26 and 200 samples for a sampling rate of 16 kS/s. The characteristics of ( )τφ1 is that it has a large peak for voiced frames that decreases for non-voiced frames.

The second stage of the processing is the extraction of the largest peak in the ACF which is called 1β . Then, in the third stage, this peak is compared to a fixed threshold 01β . If 1β is greater than 01β , the frame is classified voiced; otherwise, it is classified non-voiced.

No Yes

LPF: 0-700Hz

Speech frame

Find global minimum valley « 2β »

Voiced Non-voiced

( ) 212 n:n, =ττφ

022 β≤β

Fig. 2: Block diagram of the AMDF classifier It should be noticed that it is possible to use a normalized

version of the autocorrelation function to obtain a voicing decision. In this case, we would be obliged to add a silence detector to eliminate the wrong decisions (silence frames classified voiced ones) that could appear as presented in [5].

We have preferred in this study to use a non-normalized version of the ACF in order to reduce the number of features in the classification (we don’t need a silence detector for voiced/non-voiced classification).

B. AMDF

Fig. 2 shows a block diagram of the classification based on the AMDF. It is clear that the classifier follows the same steps as the previous one except for the function which is replaced by the AMDF feature given by equation 2.

( ) ( ) ( )∑−τ−

=

τ+−=τφ1

02

1 N

n

nxnxN

(2)

21 : nn=τ The values of the variables N, n1 and n2 are maintained the same as in the ACF classifier.

The AMDF is a variation of the ACF, where instead of correlating the input speech at various delays where multip-lications and summations are formed at each value, a differ-ence signal is formed between the delayed speech and the original, and at each delay value, the absolute magnitude is taken. Unlike, ACF, however, the AMDF calculations re-quire no multiplications, a desirable propriety of hardware and real time applications [6]. The characteristics of

( )τφ2 is that it has several valleys appeared periodically for voiced speech. The global minimum valley 2β is used for voicing decision. If it is lower or equal than a fixed thre-shold 02β , the speech is classified voiced, otherwise, it is classified non-voiced. The same as ACF, the AMDF is the only feature used in speech classification, and there is no need for a silence detector.

Copyrights: 978-1-4673-1711-5/12/$31.00 c© 2012 IEEE

C. WACF

The WACF has been firstly proposed by T. Shimamura [7] in 2001 for pitch detection. Utilizing that the AMDF has similar characteristics to the ACF, the ACF is weighted by the reciprocal of the AMDF to form what is called WACF.

In this paper, the WACF is used to classify the speech frames into voiced and non-voiced classes. The proposed classification scheme is shown in the diagram block of Fig.1 (i=3) where ( )τφ3 denotes the WACF given by the follow-ing equation:

( ) ( )( )( )κ+τφ

τφ=τφ2

13 (3)

21 : nn=τ ( )τφ1 is the ACF, ( )τφ2 is the AMDF,

The values of the variables n1 and n2 are maintained the same as in the ACF classifier.

κ is a fixed number used to avoid divergence of the di-rectly inversed ( )τφ2 at lag zero (because ( ) 002 =φ ). In our case, this fixed number does not have high importance be-cause ( )τφ3 is computed between n1 and n2. Thus, in this study, it is set to 0.1. The voicing decision follows the same steps as in the ACF classifier. The maximum peak found

3β is compared this time to 03β .

III. PERFORMANCE EVALUATION

A. Criteria of the test

The performance of the three classifiers was tested on a speech database which was hand labeled into voiced/non-voiced regions. Three measures were used:

• Voiced speech classified as non-voiced (VNV error), • Non-voiced speech classified as voiced (NVV error), • Percentage of classification accuracy (Pc) :

)NVVBVNVA(1Pc ×+×−= (4) where A and B represent respectively the percentage of voiced and non-voiced frames in the speech utterance.

The two types of error listed above occurred during the initial speech classification into voiced and non-voiced re-gions, where non-voiced speech was considered as voiced or voiced speech misclassified as non-voiced.

B. Speech Data

The performance of the developed classifiers is eva-luated on the TIMIT database [8]. Speech underwent exten-sive manual labeling before it could be used. The spoken material consisted of a set of 26 rich English sentences from TIMIT database (13 female and 13 male) that contain sever-al dialects and acoustic forms (weak voiced speech and fast voiced non-voiced transition). The percentage of voiced speech samples in each utterance is maintained at 50% (A and B equal to 0.5 in equation 4) by appending required duration of silence.

TABLE1. PERFORMANCE OF VOICED/NON-VOICED CLASSIFICATION (%)

SNR

ACF

AMDF

WACF

Whi

te n

oise

Clean 97.58 2.20 2.62

95.92 2.93 5.23

97.66 2.37 2.31

30dB 97.11 3.28 2.49

95.85 2.89 5.39

97.62 2.74 2.01

20dB 97.05 3.43 2.47

95.63 2.44 6.30

97.60 2.86 1.93

10dB 96.52 3.60 3.34

83.90 0.28

31.91

96.57 5.25 1.60

5dB 93.97 3.24 8.81

62.84 0.04

74.28

95.28 6.21 3.21

0dB 63.74 0.21 72.31

50.00 00.00 100.0

87.29 7.71

17.69

-5dB 50.75 0.04 98.45

50.00 00.00 100.0

67.08 4.55

61.28 * Pc VNV NVV

Two experienced persons performed the manual classifi-cation of the spoken material.The original time waveform was used as the primary tool, with spectrograph used only in few cases where the waveform was insufficient to make the decision.

C. Performance of the classifiers

The three classifiers developed in this paper are defined as threshold based classifiers. The performances of these classification schemes are directly related to the optimal choice of the thresholds which are given as follows:

• 01β for the ACF, • 02β for the AMDF, • 03β for the WACF,

The main purpose of this study is to evaluate the assess-ment of each time domain feature for voicing classification of English language by using an optimal decision level. Consequently, we need to find an optimal threshold for each feature to evaluate the global performance of the classifiers. A practical approach is to seek a value that gives the optimal classification of each utterance in clean environment and then, to use it in noisy environment. Once the peaks (or valley) 1β , 2β and 3β are computed for each frame in the utterance, the thresholds 01β , 02β and 03β are respectively found by computing the median value of the previous peaks (or valley). These thresholds are updated for every utterance in the speech database. The performances of the three clas-sifiers are reported in table 1 for clean and noisy speech. A white noise from the NOISEX92 database has been incorpo-rated in order to obtain the noisy speech.

The entire classifiers have good Pc in clean environment which are degraded by decreasing the SNRs of the added noise. The WACF has the best performance and ranks first under several SNRs.

Copyrights: 978-1-4673-1711-5/12/$31.00 c© 2012 IEEE

The noticed degradation of the entire classifiers is essen-tially due to the NVV error (in low SNRs) which increases by the diminution of the SNRs.

The ACF classifier ranks second with large NVV error for a SNR of -5 dB. Finally, the AMDF ranks third with 100% NVV error for a SNR of 0dB.

IV. CONCLUSION

This paper reported the results of performance evalua-tion of three voiced/non-voiced classification schemes which use only one time domain feature. Based on a variety of error measurements, the performance of the developed classifiers was highlighted for different white SNRs. It has been shown that the degradation of the percentage of classi-fication accuracy is proportionally related to the SNR level.

The accuracy degradation of the classifiers is essentially due to the NVV error. The ranking of the classifiers was established based on the percentage of classification accura-cy. The WACF classifier ranks first, followed respectively by the ACF and AMDF ones.

The combination of evidence from several features will improve the accuracy of classification; however, the compu-tation complexity is going to increase.

In future works, performance of the studied features will be evaluated for other noise types. Furthermore, a combina-tion of evidence from multiple features is going to be done by using a statistical model.

ACKNOWLEDGMNENTS

Authors would like to thank D. Belabdi and Y. Boudjahfa, post graduate English students, respectively at Batna Uni-versity and ENS school of Algiers, for their helpful sugges-tions and language support.

REFERENCES [1] S. Ahmadi, and A.S. Spanias, “Cepstrum-Based Pitch Detection using

a New Statistical V/UV Classification Algorithm,” IEEE Transactions on Speech and Audio Processing, vol. 7, no. 3, pp. 333-338, 1999.

[2] Y. Qi, and B.R. Hunt, “Voiced-Unvoiced-Silence Classification of speech using Hybrid Features and a Network Classifier,” IEEE Transactions on Speech and Audio Processing, vol. 1, no. 2, pp. 250-255, 1993.

[3] N. Dhananjaya and B. Yegnanarayana, “Voiced/Nonvoiced Detection Based on Robustness of Voiced Epochs,” IEEE Signal Processing Let-ters, vol. 17, no. 3, pp. 273-276, 2010.

[4] B S. Atal and L. R. Rabiner, “A Pattern Recognition Approach to Voiced-Unvoiced-Silence Classification with Applications to Speech Recognition”, IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. ASSP-24, no. 3, pp. 201-212, 1976.

[5] L. R. Rabiner, M. J. Cheng, A. E. Rosenberg and C. A. McGonegal, “A Comparative Performance Study of Several Pitch Detection Algo-rithm”, IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 24, no. 5, pp. 399-418, 1976.

[6] Z. Yu-Min, W. Zhen-Yang, L. Hai-Bin, and Z. Lin, “Modified AMDF Pitch Detection Algorithm,” International Conference on Machine Learning and Cybernetics, pp. 470 – 473, vol.1, Xian, 2003.

[7] T. Shimamura and H. Kobayashi, “Weighted Autocorrelation for Pitch Extraction of Noisy Speech”, IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 9, no. 7, pp. 727-730, 2001.

[8] J. S. Garofolo et al., “TIMIT Acoustic-Phonetic Continuous Speech Corpus Linguistic Data Consortium,” Philadelphia, PA, 1993.