vector quantization

Speaker Identification for enhanced Vector Quantization-Gaussian Mixture Model

Ritu Sharma, M.E. 4th sem, Communication, S.S.C.E.T.Bhilai. [email protected] Guided By Mr. Piyush Lotia, H.O.D. Of Electronics & Instrumentation

ABSTRACT

The use of Gaussian Mixture Models (GMM) are most common in speaker identification due to it can be performed in a completely text independent situation. However, it sounds efficient to speaker identification application, but it results long time processing in practice. In this paper, we propose a decision function by using vector quantization (VQ) techniques to decrease the training model for GMM in order to reduce the processing time. Adaptation techniques such as vector quantization decision function for GMM will prove to be successful for the task of speech recognition. Currently GMM have can be adaptable to a wide variety of situations. However, the VQ approaches, offer the possibility of being more effective in situations with large training data, since they offers simplicity in computation rather than modelling classes separately and combining the separate models. The advantage of VQ is that the problem of segmenting speech into phonetic units can be avoided. Nevertheless, its disadvantages are lies in the complexity of codebook search during recognition. The speaker identification problem lay on manage and process huge speaker data sets in a short time limit. Normally identification errors for huge database often occur when a speaker is taken for another speaker belonging to the same gender. For example male speaker A unrecognized as another male speaker B. In this paper, we propose a decision tree function by using vector quantization techniques to decrease the training model for GMM in order to reduce the processing time. Besides, we fixed identification errors in huge database using a decision tree function combines with VQ to separate out the very confusable speakers prior. To overcome those shortages, we introduce a new hybrid VQ decision/GMM model. Although in baseline form, the VQ-based solution is less accurate than the GMM, but it offers simplicity in computation. In our proposed hybrid modelling, we use VQ approach as a decision tree to distinguish male and female speaker in order to group them into smaller subgroup. Then, GMM approach will run in that particular subgroup to obtain the identification result.

INTRODUCTION

Speech is the most basic mean of human communication. As technology advances & increasing sophisticated tools become available to use with speech signals, these can be applied for the benefit of human kind. The speech signal conveys several levels of information. Primarily, the speech signal conveys the words or message being spoken, but on a secondary level, the signal also conveys information about the identity of the talker. The area of speaker recognition is concerned with extracting the identity of the person speaking the utterance. Statistical based method such as GMM is dominant scheme for speaker recognition, including speaker identification and speaker verification. Here we use GMM for the speaker identification. In speaker identification the goal is to determine which voice on of a group of known voices best matches the input voice sample. We will show that our hybrid method brings a significant performance in time processing applied in a text independent speaker identification task over the standard GMM approach. The outline of this paper presents the different phases for speaker identification. In first phase we enrolled the voice samples, then extract features of speech to parameterize the speech by cepstral analysis such as Mel-frequency cepstral analysis. Then perform training and test of speaker utterance to get desired voice sample.

METHODOLOGY

Main Body Text:-A Gaussian mixture model is a weighted sum of M component Gaussian densities as given by the equation,

For our system, GMM for each speaker was trained as a single-stage HMM (Hidden Markov Model) with a Gaussian Mixture observation density. The complete Gaussian mixture model is parameterized by the mean vectors, covariance matrices and mixture weights from all component densities. The covariance matrices, , can be full rank or constrained to be diagonal. Additionally, parameters can be shared, or tied, among the Gaussian components, such as having a common covariance matrix for all components. The choice of model configuration (number of components, full or diagonal covariance matrices, and parameter tying) is often determined by the amount of data available for estimating the GMM parameters and how the GMM is used in a particular biometric application. It is also important to note that because the component Gaussians are acting together to model the overall feature densities, full covariance matrices are not necessary even if the features are not statistically independent. The linear combination of diagonal covariance basis Gaussians is capable of modelling the correlations between feature vector elements. The effect of using a set of M full covariance matrix Gaussians can be equally obtained by using a larger set of diagonal covariance Gaussians.

Speaker identification via likelihood ratio detection- Given a segment of speech O & a hypothesized speaker S , the task of speaker identification , is to determine if O was spoken by S or not . The single-speaker detection task can be stated as a basic hypothesis test between two hypotheses.

H0: y is from the hypothesized speaker s. H1: y is not from the hypothesized speaker s.

Fig.1) Likelihood ratio-based speaker detection system

The optimum test is to decide between these two hypotheses is a likelihood ratio (LR) test given by,

Where is the probability density function for the hypothesis H0 evaluated for the observed speech segment O, also referred to as the likelihood of hypothesis H0 given the speech segment. Similarly is the probability density function for the hypothesis H1 evaluated for the observed speech segment O, also referred to as the likelihood of hypothesis H1 given the speech segment.

IMPLEMENTATION

1.) Vector quantization decision function for GMM - Speaker identification system involves two main stages, the enrolment stage and the verification stage. These phases involve two main parts:

Feature Extraction.

Pattern Classification.

In our implementation, we will use MFCC technique to extract the speech feature in order to obtain the best result for pattern classification. For pattern classification part, we present a new model to applied VQ decision function for GMM approach.

2.) Baseline Vector Quantization Speaker Identification -Vector Quantization (VQ) is a pattern classification technique applied to speech data to form a representative set of features. It maps vectors to smaller regions called cluster. These cluster's center, centroid, are collected and will make up a codebook. The speaker identification is depending on the cookbook to identify a speaker. In VQ training phase, Vector Quantization is executed using MFCC as input. Later on, the speaker identification engine will run the nearest-neighbour search to find the codeword in the current codebook that is closest and assign that vector to the corresponding cell. Then, it finds centroids and update for each speech signal and the cookbooks are created. In testing phase, a function will computes the Euclidean distance between training data and testing data. The system will identify which calculation yields the lowest value and checks this value against a constraint threshold. If the value is lower than the threshold, the system outputs an answer. The below figure shows the speaker identification flow for the VQ in training & testing phase.

(TESTING PHASECompute MFCCCompute nearest neighbourFind minimum distanceDecision) (TRAINING PHASECompute MFCCExecuted VQCompute nearest neighbourFind centroids & create codebook)

VQ process flow

3.) Vector quantization decision function for Gaussian Mixture Modelling - The Decision Tree is one of the most popular classification algorithms in current use in data mining and machine learning. In speaker identification decision analysis, a decision tree can be used to visually and explicitly represent decisions and decision making. It advantages are it provide robustness and it can perform well with large data in a short time. For VQ, the primary factor is the cookbook sizes, an experiment indicate that the optimum size is not dependent on the amount of training data. When a cookbook is generated, its only remains the centroid which can represent the whole cluster. The amount of data is significantly less, since the number of centroids is at least ten times smaller than the number of vectors in the original sample. This will reduce the amount of computations needed when comparing in later stages. In fact, VQ based solution is less accurate than the GMM. In our proposed hybrid modelling, we take the superiority of VQ, which is simplicity computation to distinguish between male and female speaker. Besides, we combine the decision tree function and VQ classification techniques in order to fixed identification errors in huge database, this novel approach is used to separate out the very confusable speakers prior in the same gender group. Later on, we make use of GMM merits to identify the speaker identity in the smaller subgroup. Below figure is for Speaker identification system based on vector quantization decision function for Gaussian mixture modelling. After MFCC feature extraction process, the speech signal will transform to a feature vector form. For the phase 1 of the classification, VQ classifier clustering the speaker model into two subgroup which is subgroup A and subgroup B. In phase 2 classification, we can use the decision tree function to separate out the speaker models that gain the similar score into 4 difference group.

(MFCC feature extraction)

(VQ) (Classification phase 1)

(Subgroup a (male)) (Subgroup b (Female))

(GMM) (GMM) (Classification phase 2)

(Result (score)) (Result (score))

Speaker identification system based on vector quantization decision function for Gaussian mixture modelling

This process aims to solve the similarity speaker problem in order to make an improvement on the accuracy rate when the application facing a huge database. But in our experiment we take less database hence it is not required. Thats why in phase 2 classification, we utilize dominance of GMM model to get the accuracy rates. GMM process will just applied in the particular subgroup to identify the speaker identity. GMM classification engine will calculate log likelihood score for subgroup training speaker data and save it into a speaker model. While in testing phase, a comparison about training speaker and testing speaker will be done. GMM classification engine will make a decision followed by maximum posteriori probability. On account of the GMM model just need to train speaker data in the subgroup instead training all speaker data, the computation time will decrease.

RESULT

The result of time processing for 10 speakers by using baseline GMM and hybrid VQ/GMM shows in table 1. We report that the baseline GMM need 60.5 seconds for the whole training and testing process whereas our hybrid VQ/GMM just need 48.55 seconds. Thus, our implementation can categorized as more simplified version for classification techniques in speaker identification system. Obviously, a significant improvement compared to the baseline system is reported, a reduction in identification times up to 20% is reached. The results indicate that with the hybrid modelling, the performance of the speaker identification system is improved. Moreover, the speed of verification is significantly increased because number of features is reduced over 50% which consequently decrease the complexity of our identification system.

Comparison of time processing

ALGORITHM

GMM

VQ-GMM

Time(sec)

60.5

48.55

CONCLUSION

A new, hybrid, robust and simplicity computation method of pattern classification technique for speaker identification system is proposed. We observe that one good way of applying hybrid method between VQ, decision tree and GMM because of their difference ways to classified data. We are intended to improve the computation, the approximation quality and the accuracy of the speaker identification system by the proposed method. Future work will be concentrating on investigation of the effectiveness of hybrid VQ decision /GMM for more robust speaker recognition. A new algorithm for acceleration of GMM based text-independent speaker identification systems is found superior to earlier proposed optimize GMM only for applications characterized by short test utterances and matching train-test conditions. The experiments confirmed the enhancement resulted from the application of the proposed optimization algorithm. The optimize GMM will prove to be very effective for speaker identification task. The GMM maintains high identification performance with increasing population size. The advantages of using a GMM as a likelihood function are that it is computationally inexpensive, & is based on well-understood statistical model.

FUTURE SCOPE

Speaker recognition uses the acoustic features of speech that have been found to differ between individuals. These acoustic patterns reflect both anatomy (e.g., size and shape of the throat and mouth) and learned behavioral patterns (e.g., voice pitch, speaking style). This incorporation of learned patterns into the voice templates (the latter called "voiceprints") has earned speaker recognition its classification as a "behavioural biometric." Speaker recognition systems employ three styles of spoken input: text-dependent, text-prompted and text-independent. Most speaker verification applications use text-dependent input, which involves selection and enrolment of one or more voice passwords. Text-prompted input is used whenever there is concern of imposters. The various technologies used to process and store voiceprints include hidden Markov models, pattern matching algorithms, neural networks, matrix representation and decision trees. Some systems also use "anti-speaker" techniques, such as cohort models, and world models. Ambient noise levels can impede both collections of the initial and subsequent voice samples. Performance degradation can result from changes in behavioral attributes of the voice and from enrolment using one telephone and verification on another telephone. Voice changes due to aging also need to be addressed by recognition systems. Many companies market speaker recognition engines, often as part of large voice processing, control and switching systems. Capture of the biometric is seen as non-invasive. The technology needs little additional hardware by using existing microphones and voice-transmission technology allowing recognition over long distances via ordinary telephones (wire line or wireless). MFCC based SI system with VQ modelling technique has very good identification accuracy and therefore, it is robust against noise. After analyzing the results of both experiments it is also concluded that sampling frequency of speech and number of vectors in VQ codebook improve the identification accuracy greatly. In future, a Multiple Classifier System (MCS), having more than one classifier, will be designed to further improve the identification accuracy. Hidden Markov Model will be used as a classification technique. Finally, a suitable combination technique will be required to reach a consensus by combining the individual opinions of each classifier. In summary, we have selected a few of the most influential techniques that have been proven to work in practice in independent studies, or shown significant promise in the past few NIST technology evaluation benchmarks:

_ Universal background modeling (UBM)

_ Score normalization, calibration, fusion

_ Sequence kernel SVMs

_ Use of prosodies and high-level features with SVM

_ Phonetic normalization using ASR

_ Explicit session variability modeling and compensation

For transferring the technology into practice, therefore, in future it will be important to focus on making the methods less sensitive to selection of the data sets. The methods also require computational simplifications before they can be used in real-world applications such as in smart cards or mobile phones, for instance. Finally, the current techniques require several minutes of training and test data to give satisfactory performance that presents a challenge for applications where real-time decision is desired. We should also address human-related error sources, such as the effects of emotions, vocal organ illness, aging, and level of attention. further investigation, especially as to how temporal and prosodic features can capture high-level phenomena (robust) without using computationally intensive speech recognizer (practical). It remains a great challenge in the near future to understand what features to exactly look for in speech.

REFERENCES

1. Rose, R. C. and Reynolds, D. A., Text-independent speaker identification using automatic acoustic segmentation. In Proceedings of the International Conference on Acoustics, Speech, and Signal Processing, 1990, pp. 293296.

2. Reynolds, D. A., A Gaussian Mixture Modelling Approach to Text-Independent Speaker Identification. Ph.D. thesis, Georgia Institute of Technology, September 1992.

3. Reynolds, D. A. and Rose, R. C., Robust text-independent speaker identification using Gaussian mixture speaker models, IEEE Trans. Speech Audio Process. 3 (1995), 7283.

4. Reynolds, D. A., Speaker identification and verification using Gaussian mixture speaker models, Speech Commun. 17 (1995), 91108.

5. Reynolds, D. A., Automatic speaker recognition using Gaussian mixture speaker models, Lincoln Lab. J. 8 (1996), 173192.

6. Martin, A. and Przybocki, M., The NIST 1999 speaker recognition evaluationan overview, Digital Signal Process. 10 (2000), 118.

7. R. Auckenthaler, M. Carey, H. Lloyd-Thomas, Score Normalization for Text-independent Speaker Verification Systems, Digital Signal Processing 10, 42-54, 2000.

8. D. A. Reynolds, T. F. Quatieri, and R B. Dunn, Speaker verification using adapted Gaussian mixture models, Digital Signal Processing, vol. 10, no. 1-3, pp. 19-41, Jan. 2000.

9. Gauvain, J. L. and Lee, C.-H., Maximum a posteriori estimation for multivariate Gaussian mixture observations of Markov chains, IEEE Trans. Speech Audio Process. 2 (1994), 291298.

10. D. A. Reynolds, An Overview of speaker recognition technology, IEEE International Conference on Acoustic, Speech and Signal Processing, vol.4, pp. 4072-4075, 2002.

11. V. Wan and W. M. Campbell, Support vector machines for speaker verification and identification, IEEE International Workshop on Neural Network for Signal Processing, vol. 2, pp.775 -784, 2000.

12. J. M. Naik, Speaker Verification: A Tutorial, IEEE Communication Magazine, vol. 28, pp.42-48. January 1990.

13. S. Furui. 40 Years of Progress in Automatic Speaker Recognition. In Proc. of the 3rd Int. Conf. on Advances in Biometrics, pages 10501059, 2009.

14. S.Furui,Cepstral analysis technique for automatic speaker verification, IEEE Transactions on Acoustics, Speech and Signal Processing, vol.29, pp. 254-272, 1981.2204

15. T. Kinnunen, E. Karpov, and P. Franti, Real-time speaker identification and verification, IEEE Transactions on Audio, Speech and Language Processing, vol. 14. no. 1, pp. 277-288, Jan. 2006.

16. H. R. Sadegh Mohammadi and R. Saeidi, "Efficient implementation of GMM based speaker verification using sorted Gaussian mixture model," in Proc. EUSIPCO06, Florence, Italy , Sept. 4-8, 2006.

17. H. R. Sadegh Mohammadi, R. Saeidi, M. R. Rohani, and R. D. Rodman, "Combined inter-frame and intra-frame fast scoring methods for efficient implementation of GMM-based speaker Verification systems," in Proc. ICASSP07, US, 2007.

18. R. Saeidi, H. R. Sadegh Mohammadi, and M. Khalaj Amir- Hosseini, "An efficient GMM classification post-processing method for structural Gaussian mixture model based speaker verification," in Proc. ICASSP06, vol. 1, pp. 909-912, Toulouse, France, May 2006.

19. The 2000 NIST Speaker Recognition Evaluation,http://www.nist.gov/speech/tests/spk/2000/index.htm.

20. Gersho, A. and Cuperman, V. Vector Quantization and Signal Compression, Kluwer Academics, USA, 1992.

21. Soong F. K. and Rosenberg, A. E. On the use of instantaneous and transitional spectral information in speaker recognition, in n Proc. of International conf. on acoustic, speech, and signal processing, (ICASSP86), 1986, 877-880.

22. Sakoe, H.and Chiba, S., "Dynamic programming algorithm optimization for spoken word recognition", Acoustics, Speech, and Signal Processing, IEEE Transactions on Volume 26, Issue 1, Feb 1978, Page 43 - 49.

23. Vlasta Radova and Zdenek Svenda, "Speaker Identification Based on Vector Quantization", Proceedings of the Second International Workshop on Text, Speech and Dialogue, Vol. 1692, 1999, Pages: 341 - 344.

24. Lawrence R. Rabiner, "A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition", Proceedings of the IEEE, 77 (2), 1989, p. 257286.

25. Solera, U.R., Martin-Iglesias, D., Gallardo-Antolin, A., Pelaez-Moreno, C. and Diaz-de-Maria, F, "Robust ASR using Support Vector Machines", Speech Communication, Volume 49 Issue 4, 2007.

26. J. Pelecanos, S. Myers, S. Sridharan and V. Chandran, "Vector Quantization Based Gaussian Modelling for Speaker Verification", 15th International Conference on Pattern Recognition, Volume 3, 2000, p. 3298.

27. Qiguang Lin, Ea-Ee Jan, ChiWei Che, Dong-Suk Yuk and Flanagan, J, "Selective use of the speech spectrum and a VQGMM method for speaker identification", Fourth International Conference on Spoken Language, Vol 4, 1996, Pg:2415 - 2418.

28. Yu, K., Mason, J., Oglesby, J., Speaker recognition using hidden Markov models, dynamic time warping and vector quantization Vision, Image and Signal Processing, IEEE Proceedings, Oct 1995.

29. Gauvain, J. L. and Lee, C.-H., Maximum a posteriori estimation for multivariate Gaussian mixture observations of Markov chains, IEEE Trans. Speech Audio Process. 2 (1994), 291298.

30. Sakoe, H.and Chiba, S., "Dynamic programming algorithm optimization for spoken word recognition", Acoustics, Speech, and Signal Processing, IEEE Transactions on Volume 26, Issue 1, Feb 1978, Page 43 -49.

(

)

(

)

>

H1

accept

,

H0

accept

,

1

0

q

q

H

O

p

H

O

p

(

)

0

H

O

p

(

)

1

H

O

p

(

)

(

)

(

)

(

)

(

)

(

)

(

)

1

1

0

,

1

s,

constraint

e

satisfy th

further

weight

mixture

The

m.

mixture

of

matrix

covariance

m.

mixture

of

r

mean vecto

vector.

feature

of

dimension

2

exp

2

1

,

density

gaussian

unimodal

,

m.

mixture

the

of

weight

,

1

=

=

=

=

=

-

-

=

=

=

=

=

M

i

c

m

c

m

m

m

n

m

x

m

m

x

T

m

n

m

m

N

m

m

N

c

m

m

m

N

M

m

C

m

x

b

m

m

m

p

m

m

m

m

vector quantization

Documents