[ieee 2012 ieee 10th jubilee international symposium on intelligent systems and informatics (sisy) -...

4
Comparison of the Automatic Speaker Recognition Performance over Standard Features Milan M. Dobrović * , Vlado D. Delić ** , Nikša M. Jakovljević ** , Ivan D. Jokić ** * Telekom Srbija/Function of Information Technology, Belgrade, Serbia ** University of Novi Sad/Faculty of Technical Sciences, Novi Sad, Serbia [email protected] [email protected] [email protected] [email protected] Abstract—This paper presents a study of speaker recogni- tion accuracy depending on the choice of features, window width and model complexity. The standard features were considered, such as linear and perceptual prediction coeffi- cients (LPC and PLP) and mel-frequency cepstral coeffi- cients (MFCC). Gaussian mixture model (GMM), with the use of HTK tools, was chosen for speaker modelling. Speech database S70W100s120, recorded at the Electrical Engineer- ing Department of Belgrade University, was used for pur- poses of system training and testing. Ten speaker models and the universal background model (UBM) were trained. I. INTRODUCTION For people, speech is the most natural way to commu- nicate. The characteristics of the human voice are unique for each person, so it is quite natural that people are rec- ognized by their voice. Automatic speaker recognition is the field of digital signal processing related to the recog- nition of people based on their voice, performed by the machines. Depending on the purpose, speaker recognition can be divided into two separate categories: identification and verification. The task of speaker verification is the confirmation or rejection of claimed identity based on his/her voice. Unlike speaker verification, the process of speaker identification makes a non-binary decision. The system decides who is a potential speaker, which group he/she belongs to, or is he/she an unknown person [1]. Depending on whether the text pronounced by the speaker is known to the system, the recognition can be text-dependent (the text is specified by the system, or it is chosen by the speaker during the training process), or text-independent (the text is arbitrary and chosen by the speaker during the recognition process). This paper intro- duces a system for text-independent speaker recognition. Section II presents the benefits of using voice as a biometric feature. Section III describes the speech data- base and the implemented speaker recognition system. The results are presented in section IV, followed by the conclusion. II. THE THEORETICAL BACKGROUND A number of methods and characteristic used in bio- metric identification systems have been presented and investigated. Among the most popular biometric charac- teristics are fingerprints, face features and voice [2]. Each biometric feature has its advantages and drawbacks, and there are several factors why the voice signal is used in biometrics [3]: There is no threat to privacy, since the request for speaking a sequence of words people do not consider being a threat to privacy. The existence of a large number of applications in which speech is the main (if not the only) signal that is available - such as telephony. Simple data transfer as a result of widespread tele- phone network. Cheap and available devices used for gathering in- formation. For applications related to the telephone net- work, there is no need to install special transmitters or network at the access points, as mobile phones provide access to almost everywhere. Even for applications that are not related to telephony, soundcards and microphones are cheap and easily available. Human speech contains more information than con- tained in the words themselves, such as information about language, emotional state, gender and identity of the speaker. Although the speech contains a wide range of information, people distinguish it easily. Depending on the specific purpose, it depends which speech signal information is relevant. For example, linguistic informa- tion is relevant if the goal is to identify the sequence of spoken words. The presence of irrelevant information (such as the environment and the identity of the speaker in this case) can adversely affect system performance [1,3]. Characteristics essential for the speaker identity are changing relatively slowly. Hence, feature extraction is the process of data compression that maintains the essen- tial information about speaker's identity [4]. The most commonly used features in automatic speaker recognition are: LPC (Linear Prediction Coefficients), PLP (Percep- tual Linear Prediction) and MFCC (Mel-Frequency Cep- stral Coefficients), which are the standard features in speech recognition too, because spectrum envelope (voice timbre) identifies speakers as well as phones. A. System description Currently, the dominant approach to speaker modelling is a Gaussian mixture model (GMM) [5], where each speaker is modelled by a probability density function given by: ( ) ( ) = = = K k k k T k k k n π w λ p 1 1 ) ( ) ( 2 1 exp 2 2 / 1 2 / μ x Σ μ x Σ x (1) SISY 2012 • 2012 IEEE 10th Jubilee International Symposium on Intelligent Systems and Informatics • September 20-22, 2012, Subotica, Serbia – 341 – 978-1-4673-4750-1/12/$31.00 ©2012 IEEE

Upload: ivan-d

Post on 11-Dec-2016

212 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: [IEEE 2012 IEEE 10th Jubilee International Symposium on Intelligent Systems and Informatics (SISY) - Subotica, Serbia (2012.09.20-2012.09.22)] 2012 IEEE 10th Jubilee International

Comparison of the Automatic Speaker Recognition Performance over Standard Features

Milan M. Dobrović*, Vlado D. Delić**, Nikša M. Jakovljević**, Ivan D. Jokić** * Telekom Srbija/Function of Information Technology, Belgrade, Serbia

** University of Novi Sad/Faculty of Technical Sciences, Novi Sad, Serbia [email protected]

[email protected] [email protected]

[email protected]

Abstract—This paper presents a study of speaker recogni-tion accuracy depending on the choice of features, window width and model complexity. The standard features were considered, such as linear and perceptual prediction coeffi-cients (LPC and PLP) and mel-frequency cepstral coeffi-cients (MFCC). Gaussian mixture model (GMM), with the use of HTK tools, was chosen for speaker modelling. Speech database S70W100s120, recorded at the Electrical Engineer-ing Department of Belgrade University, was used for pur-poses of system training and testing. Ten speaker models and the universal background model (UBM) were trained.

I. INTRODUCTION For people, speech is the most natural way to commu-

nicate. The characteristics of the human voice are unique for each person, so it is quite natural that people are rec-ognized by their voice. Automatic speaker recognition is the field of digital signal processing related to the recog-nition of people based on their voice, performed by the machines. Depending on the purpose, speaker recognition can be divided into two separate categories: identification and verification. The task of speaker verification is the confirmation or rejection of claimed identity based on his/her voice. Unlike speaker verification, the process of speaker identification makes a non-binary decision. The system decides who is a potential speaker, which group he/she belongs to, or is he/she an unknown person [1].

Depending on whether the text pronounced by the speaker is known to the system, the recognition can be text-dependent (the text is specified by the system, or it is chosen by the speaker during the training process), or text-independent (the text is arbitrary and chosen by the speaker during the recognition process). This paper intro-duces a system for text-independent speaker recognition.

Section II presents the benefits of using voice as a biometric feature. Section III describes the speech data-base and the implemented speaker recognition system. The results are presented in section IV, followed by the conclusion.

II. THE THEORETICAL BACKGROUND A number of methods and characteristic used in bio-

metric identification systems have been presented and investigated. Among the most popular biometric charac-teristics are fingerprints, face features and voice [2]. Each biometric feature has its advantages and drawbacks, and there are several factors why the voice signal is used in biometrics [3]:

− There is no threat to privacy, since the request for speaking a sequence of words people do not consider being a threat to privacy.

− The existence of a large number of applications in which speech is the main (if not the only) signal that is available - such as telephony.

− Simple data transfer as a result of widespread tele-phone network.

− Cheap and available devices used for gathering in-formation. For applications related to the telephone net-work, there is no need to install special transmitters or network at the access points, as mobile phones provide access to almost everywhere. Even for applications that are not related to telephony, soundcards and microphones are cheap and easily available.

Human speech contains more information than con-tained in the words themselves, such as information about language, emotional state, gender and identity of the speaker. Although the speech contains a wide range of information, people distinguish it easily. Depending on the specific purpose, it depends which speech signal information is relevant. For example, linguistic informa-tion is relevant if the goal is to identify the sequence of spoken words. The presence of irrelevant information (such as the environment and the identity of the speaker in this case) can adversely affect system performance [1,3]. Characteristics essential for the speaker identity are changing relatively slowly. Hence, feature extraction is the process of data compression that maintains the essen-tial information about speaker's identity [4]. The most commonly used features in automatic speaker recognition are: LPC (Linear Prediction Coefficients), PLP (Percep-tual Linear Prediction) and MFCC (Mel-Frequency Cep-stral Coefficients), which are the standard features in speech recognition too, because spectrum envelope (voice timbre) identifies speakers as well as phones.

A. System description Currently, the dominant approach to speaker modelling

is a Gaussian mixture model (GMM) [5], where each speaker is modelled by a probability density function given by:

( )

( )∑=

−⎥⎦⎤

⎢⎣⎡ −−−=

=K

kkk

Tk

k

knπ

w

λp

1

1 )()(21exp

2 2/12/μxΣμx

Σ

x (1)

SISY 2012 • 2012 IEEE 10th Jubilee International Symposium on Intelligent Systems and Informatics • September 20-22, 2012, Subotica, Serbia

– 341 –978-1-4673-4750-1/12/$31.00 ©2012 IEEE

Page 2: [IEEE 2012 IEEE 10th Jubilee International Symposium on Intelligent Systems and Informatics (SISY) - Subotica, Serbia (2012.09.20-2012.09.22)] 2012 IEEE 10th Jubilee International

where x is a n-dimensional feature vector, wk is the weight of the k-th Gaussian component (such that

10 ≤≤ kw and 11 =∑ =Kk kw ) and μk and Σk are the mean

value and the covariance matrix of the k-th Gaussian component. A speaker model is defined by set of parame-ters λ = {wk, μk, Σk, k =1,2,...,K}. In order to simplify the model complexity, the covariances of Gaussian distribu-tions are approximated by diagonal matrices.

B. UBM (Universal Background model) Implementation of a speaker recognition system often

requires a trained UBM. It is actually a large GMM trained to represent the speaker-independent distribution of features [6]. The UBM is used as a possible alternative model of speakers during the verification process as well as in open-set speaker identification systems.

There are a number of different parameters involved in the UBM training process. They can be divided into two categories: a) algorithm parameters and b) data parame-ters [7]. The algorithm parameters are related to changes in the training process, such as the number of mixtures, training methods, the number of iterations, initialization methods, etc. The data parameters include the different ways of defining a subset of data. These parameters con-sider the speech database, the amount of data, number of speakers, amount of data per speaker, method of speaker selection, ways of using the feature vectors, data balanc-ing according to channel, microphone, language, or other variability.

III. SYSTEM IMPLEMENTATION For the system training and testing, a part of the speech

database S70W100s120 [8] was used. The database con-tains utterances of 120 speakers; every speaker pro-nounced 70 sentences and 100 isolated words. The data-base was originally recorded on tape, in the anechoic chamber of the Belgrade University Electrical Engineer-ing Department. Later, the database was digitalized with 16 bits per sample and sampling frequency of 22050 Hz within the AlfaNum project. From this speech database, 30 speakers were randomly selected. To achieve adequate system performance estimation with relatively small number of speakers, the selected speakers were the same gender (male).

Ten speaker models, UBM and silence model were trained during the system training. Eleven utterances with spoken isolated words were used for training of each speaker model. After the silence removal (pauses be-tween spoken words), approximately 40 seconds of train-ing data per speaker was used for speaker model training. UBM was trained based on the utterances of another 10 speakers (a total of 110 utterances of isolated words), that represents the collective identity of impostors. After the silence removal, the amount of UBM training data was about 400 seconds. Since the silence does not carry any information about the speaker identity, but can signifi-cantly degrade the actual speaker model characteristics, labelling the data used for training speaker models, UBM and the silence model has started. A hundred utterances (each of the 20 speakers uttered 5) were used in the sys-tem testing phase. The test set consisted of the utterances of 10 speakers whose individual models were formed during the training process, as well as utterances of 10 speakers which did not participate in the system training.

Thus, the training and testing data sets were disjunctive. The average duration of testing utterances was 3.5 s (ap-proximately 1.8 s without pauses).

The HTK (Hidden Markov Models Toolkit) software tool, primarily designed for speech recognition systems based on hidden Markov models [9], was used for the system training and testing. Since this work remains lim-ited to the speaker modelling by Gaussian mixture models, HMM was restricted to sequences with a single emitting state.

The HTK tools: HCopy, HCompV, HERest, HHed were used for the system training. Fig. 1 shows the pro-cedure of the system training by HTK. Feature extraction was realized by the HCopy tool. In laboratory conditions, all audio files (both from the training and the test set) are available in advance, so usually feature extraction is performed at the beginning of the system training, which was the case in this work. The input is consisted of the speech database in waveform format, and configuration file with parameters that are the basis for the speech con-version into the parametric form. The output is the speech database in parametric form.

Initialization of the models was performed using the

HTK tool HCompV. This function loads the input HMM model and voice base for training, and the output is the new HMM model whose mean and variance are equal to the global mean value and variance. The values of the global variance in the following steps are used as the variance floor, which prevents some models having too

Speech Base

Re-estimated HMM models

HCopy

HCompV

HHed

HERest

HERest

configuration file

Feature vector format

x3

Variance floor

HMM prototypes

script file

Finite HMM models

x4

HMM prototype

Re-estimated HMM models more mixture

component

configuration file

Figure 1. HTK system training

M. M. Dobrović et al. • Comparison of the Automatic Speaker Recognition Performance over Standard Features

– 342 –

Page 3: [IEEE 2012 IEEE 10th Jubilee International Symposium on Intelligent Systems and Informatics (SISY) - Subotica, Serbia (2012.09.20-2012.09.22)] 2012 IEEE 10th Jubilee International

small values of variance due to insufficient number of training observations.

Speaker model re-estimations were performed by the HERest tool. HERest simultaneously updates all the HMM models using the whole training speech database. In short, HERest works as follows: HERest loads all the HMM models and every training file must have an asso-ciated label file which gives a transcription for that file. After loading it into memory, it uses the associated tran-scription to construct a composite HMM which spans the whole utterance. This model is actually obtained by merging individual HMMs corresponding to each label in the transcription. The Forward-Backward algorithm is then applied on the whole composite HMM. For re-estimation of the speaker models, HERest was used in five iterations.

The HHed tool and the appropriate script file were used for increment of the number of mixture components. Con-version from HMM with one Gaussian mixture into HMM with several Gaussian mixtures is usually one of the last steps in the system training. The mechanism of increasing number of Gaussian mixture components is called mixture splitting. This procedure is very flexible because it allows a gradual increase in Gaussian components. During the system training, Gaussian distribution is being gradually increased by one. Any increase in the number of Gaussian distributions is followed by four iterations of the HERest tool.

The HParse tool creates a network based on the gram-mar in which the allowed speaker models were stated. Having formed a network that defines the possible states and transitions between them, the speaker recognition process is performed by the HVite tool. The task of speaker recognition is to find the most likely path through the network for a given test utterance and HMM models. The HResults tool is used to analyze the results and gen-erate statistics regarding the system performance.

IV. RESULTS This paper examined the speaker recognition perform-

ance for the three types of features: - MFCC - PLP - LPC

and the variable window's width. During the system train-ing, the following GMM parameter values were selected:

• K=1,2,4,8,16,32,64, number of Gaussian mix-tures,

• n=39 for MFCC and PLP, and n=30 for LPC, which are the standard dimensions of feature vector x [10].

During the feature extraction, the Hamming window function was applied on the considered speech samples. Applied Hamming window widths were 20ms, 25ms, 30ms, 40ms, 50ms and 100ms, with 10ms window shift. Standard values of the window width are up to 35ms. Selected values are larger since it is known that the char-acteristics of the speaker are slowly changing, and there is a possibility they are contained in the averaged spec-trum of the signal. Window shift value was not changed, in order to retain the same number of observations in the model training, regardless of the width of the window function.

TABLE IV. SPEAKER RECOGNITION PERFORMANCE [%] FOR FEATURE VECTOR

LPC_D_A

20ms 25ms 30ms 40ms 50ms 100ms1GMM 11 11 12 12 12 15 2GMM 13 13 12 12 11 9 4GMM 10 8 8 6 6 13 8GMM 56 42 59 65 67 23

16GMM 37 40 39 25 42 35 32GMM 66 66 65 51 66 57 64GMM 71 70 38 73 70 20

TABLE III. SPEAKER RECOGNITION PERFORMANCE [%] FOR FEATURE VECTOR

PLP_D_A

20ms 25ms 30ms 40ms 50ms 100ms1GMM 32 32 35 38 39 38 2GMM 50 46 43 38 38 42 4GMM 54 25 23 23 22 62 8GMM 28 39 39 63 51 30

16GMM 87 85 84 70 82 62 32GMM 99 97 92 95 95 81 64GMM 86 89 86 83 78 79

TABLE II. SPEAKER RECOGNITION PERFORMANCE [%] FOR FEATURE VECTOR

PLP_0_D_A

20ms 25ms 30ms 40ms 50ms 100ms1GMM 40 38 39 39 41 42 2GMM 39 57 60 60 58 54 4GMM 64 64 63 74 58 17 8GMM 79 82 51 47 46 49

16GMM 86 88 87 74 46 83 32GMM 95 97 99 94 86 56 64GMM 96 97 96 96 91 70

TABLE I. SPEAKER RECOGNITION PERFORMANCE [%] FOR FEATURE VECTOR

MFCC_0_D_A

20ms 25ms 30ms 40ms 50ms 100ms1GMM 41 40 37 39 36 36 2GMM 35 32 27 24 26 39 4GMM 35 38 37 37 37 50 8GMM 54 53 50 24 25 46

16GMM 93 94 82 58 40 67 32GMM 97 96 96 96 95 75 64GMM 97 97 94 93 94 53

SISY 2012 • 2012 IEEE 10th Jubilee International Symposium on Intelligent Systems and Informatics • September 20-22, 2012, Subotica, Serbia

– 343 –

Page 4: [IEEE 2012 IEEE 10th Jubilee International Symposium on Intelligent Systems and Informatics (SISY) - Subotica, Serbia (2012.09.20-2012.09.22)] 2012 IEEE 10th Jubilee International

Tables I, II, III and IV present the accuracy of the speaker recognition systems, depending on the window′s width and the number of Gaussian mixtures. Table I refers to the experiments in which the first 12 MFCC coeffi-cients, and the so-called zero MFCC were used, as well as their first and second derivative (MFCC_0_D_A). As it can be seen in the Table I, increasing the number of Gaus-sian mixtures generally increases the accuracy. For win-dows that do not include stationary segments (> 30ms), this trend is not linear. It is assumed this is related to the correlation between observations, which is significant in these cases.

Considering PLP features, two types of feature vectors were used. In the first case, the system performance was tested for the feature vector containing the zero cepstral and the first 12 PLP coefficients, as well as their first and second derivative (PLP_0_D_A). As it can also be seen in the Table II, increasing the number of Gaussian mixtures generally increases the accuracy. The results presented in Table II are similar to those for the MFCC, except they are better for small number of Gaussian mixtures.

In the case shown in Table III, system performance was tested for the feature vector containing the first 13 PLP coefficients and their first and second derivative (PLP_D_A). These results are also similar to the MFCC results. It is interesting to note that for all window widths there is significant drop of accuracy for 64 GMM in com-parison with 32 GMM, probably due to system overtrain-ing.

Finally, shown in Table IV, the speaker recognition sys-tem performance was investigated for the feature vector containing the first 10 LPC coefficients and their first and second derivative (LPC_D_A). It shows that LPC features gives from 20% to 30% poorer speaker recognition results in comparison with the PLP and MFCC features. Since the noise level was negligible, the possible cause of the poorer speaker recognition results is the insufficient number of LPC coefficients.

Fig. 2 shows the comparative results for the different feature vectors (MFCC_0_D_A, PLP_0_D_A, PLP_D_A, and LPC_D_A) and 25ms window width. Results show that LPC features give significantly poorer speaker recog-nition results in comparison with those for the PLP and MFCC features.

Figure 2. The accuracy of speaker recognition [%] for the different

feature vectors and the 25ms window width

V. CONCLUSION As expected, the best performance for speaker recogni-

tion systems have been achieved with the MFCC and PLP features. The LPC features give significantly poorer speaker recognition results comparing to other set of features, although the speech was clean.

This study also showed that the optimal window width is between 20ms and 40ms. Although the characteristics of the speaker are slowly changing, longer window width does not result in better system accuracy. This indicates that dynamic of speech is important for speaker recogni-tion as well.

Increasing the number of Gaussians (model complex-ity) does not result in better system accuracy even though the models are not over-trained (The accuracies of the systems with 1 and 32 Gaussian distribution per model is higher than for the system with 4 Gaussians per model). The reason might be that when the model has 1 Gaussian per model, then the mean values are sufficiently distant in the feature space, even though the model poorly describes a speaker in the space. In the case of a relatively small number of Gaussians, the model is too simple to correctly describe speaker in the feature space as in the case of a single Gaussian per model, but it splits the space corre-sponding to single speaker, so some overlapping regions can be modelled better by one speaker. When the number of Gaussians is further increased, approximation of the feature distribution, in other words the coverage area, is better, and it reduces the likelihood of the speaker recog-nition error.

REFERENCES [1] J. P. Campbell, Jr. Speaker recognition, Department of Defense

Fort Meade, MD. [2] L. Myers, “ An Exploration of Voice Biometrics”, April, 2004 [3] S. Furui, “Recent advances in speaker recognition”. Pattern

Recognition Letters, vol. 18 pp. 859–872, 1997. [4] I. Jokić, S. Jokić, Z. Perić, M. Gnjatović, V. Delić, “Influence of

the Number of Principal Components Used to the Automatic Speaker Recognition Accuracy”, scheduled for publication in the journal Electronics and Electrical Engineering – Kaunas: Tech-nologija, ISSN 1392-1215, in No. 7(123), September of 2012.

[5] J. P. Campbell et al. “Forensic speaker recognition”, IEEE Signal Processing Magazine, vol. 26 pp. 95–103,2009.

[6] D. A. Reynolds, T. F. Quatieri, and R. B. Dunn, “Speaker verifica-tion using adapted Gaussian mixture models” Digital Signal Proc-essing, vol. 10, no. 1-3, pp. 19–41, 2000.

[7] T. Hasan, J. H. L. Hansen, “A study on Universal Background Model training in Speaker Verification,” IEEE Trans on Audio, Speech and Language Process. vol. 19. pp. 1890–1899, 2011.

[8] V. Delić, "Speech databases of Serbian collected under AlfaNum project," in Serbian, DOGS, 2000, pp. 29–32.

[9] S. Young et al."The HTK Book (for HTK version 3.4)" Cambridge University Department of Engineering, 2009.

[10] T. Kinnunen, H. Li, “An overview of text-independent speaker recognition: from features to supervectors,” Speech Communica-tion, vol. 52, pp 12–40, 2010.

0

10

20

30

40

50

60

70

80

90

100

1GMM 2GMM 4GMM 8GMM 16GMM 32GMM 64GMM

MFCC_0_D_APLP_0_D_APLP_D_ALPC_D_A

M. M. Dobrović et al. • Comparison of the Automatic Speaker Recognition Performance over Standard Features

– 344 –