[ieee 2006 15th international conference on computing - mexico city, mexico (2006.11.21-2006.11.21)]...

Feature Selection for a Fast Speaker Detection System with Neural Networks andGenetic Algorithms

Rocıo Quixtiano-Xicohtencatl1, Leticia Flores-Pulido2, Orion Fausto Reyes-Galaviz3

Universidad Autonoma de Tlaxcala, Facultad de Ingenierıa y TecnologıaLaboratorio de Sistemas Inteligentes

Apizaco, Tlaxcala. [email protected],3{aicitel,orionfrg}@ingenieria.uatx.mx

Abstract

Today, there is a great necessity for security systemsin banks, laboratories, etc.; specially those that have re-stricted areas or expensive equipment. Most of the time peo-ple use magnetic cards or similar technologies. However,these kind of devices can be vulnerable, because these mightbe used by intruders in case of a misplaced device. More ad-vanced technologies use iris or voice detection, potentiallyincreasing the security level against intruders. This work isfocused on the latter group.

This paper proposes a hybrid method, for the speech pro-cessing area, to select and extract the best features that rep-resent a speech sample. The proposed method makes useof a Genetic Algorithm along with Feed Forward NeuralNetworks in order to either deny or accept personal ac-cess in real time. Finally, to test the proposed method, aseries of experiments were conducted, by using fifteen dif-ferent speakers; obtaining an efficiency rate of up to 97%on intruder detection.

1. Introduction

Speaker Recognition refers to the problems of identify-ing a person or verifying her/his identity based on the infor-mation included in a speech sample. These biometric tech-niques offer the possibility to control (deny or accept) ac-cess to many services such as: bank accounts, restricted ar-eas at research laboratories, and in general, any place whereidentification of authorized personal is required.

Automatic Speaker Authentication is the process of de-ciding, among a number of registered speakers, the one thatclaims his identity, for example, if a person is trying to beanother and gives a sample of his voice; the process willknow if there’s a match, accepting or denying that person.

We will focus on this approach.

In the speaker recognition field, most of the techniquesused deal with feature extraction, feature selection, and re-duction of dimensionality. These tasks generally analyzethe signal in terms of time and frequency domains. Re-searches of these fields deal generally with problems wherethe number of features to be processed is quite large. Theyalso try to eliminate environmental noise conditions or noiseproduced by recording devices, redundant information, andother undesirable conditions which may decrease the recog-nition accuracy.

This work proposes a novel method, to select and extractthe best features to represent a speech sample, which usesgenetic algorithms combined with feed forward neural net-works to accomplish this purpose. The genetic algorithmuses elitist methods, crossover and mutation operations togenerate new populations of individuals, and combines neu-ral networks to obtain the fitness value of the individuals ateach generation.

Several results were obtained by making different kindsof experiments; they consist on changing the crossover rate,mutation rate, number of generations, number of individu-als and number of output features. From them a recognitionpercentage of up to 97% has been reached on the authenti-cation of different speakers.

On the next section we will present a review of priorcomparative studies on the speaker recognition field. Sec-tion 3 details the fundamental basis in speaker recogni-tion/authentication process and describes our proposed sys-tem. Section 4 deals with acoustic processing and our fea-ture extraction method which uses the Mel Frequency Cep-stral Coefficients (MFCCs) method [1]. A fundamental the-ory on speaker pattern classification, time delay feed for-ward neural networks, genetic algorithms, and our proposedhybrid system is given in Section 5. The complete systemdescription is on Section 6. Our experimental results and

Proceedings of the 15th International Conference on Computing (CIC'06)0-7695-2708-6/06 $20.00 © 2006

comments are presented in Sections 7 and 8.

2. State of the Art

Recently, some research efforts have been made in thespeaker recognition field showing promising results. Mi-ramontes de Leon used the Vector Quantization methodin text-independent speaker recognition tasks, applied tophone threats; he reduced the search space and obtaineda recognition percentage between 80 and 100% [2]. Pele-canos, used the Gaussian Mixture Model combined with aVector Quantization method, for relatively well-clustereddata, obtaining a training time equivalent to 20% less ofthe time taken with the standard method, composed byGaussian Mixture Models, and achieving an accuracy per-centage of 90% [3]. Hsieh used the Wavelet Transformcombined with Gaussian Mixture Models to process Chi-nese language, using voice samples from phone calls, andreached an accuracy of 96.81% [4]. Oh accomplished adimensionality reduction, applied on voice, text, numbers,and images patterns, by using a simple genetic algorithm; aHybrid Genetic Algorithm combined with a one-layer Neu-ral Network, and sequential search algorithms, he obtaineda recognition result of 96.72% [5]. Ha-Jin used MFCCsfor feature extraction of voice samples, Principal Compo-nent Analysis for dimensionality reduction, and GaussianMixture Models for classification, obtaining an accuracy of100%; this method classifies the voice samples but doesn’tdiscriminate an intruder [6]. The results reported on theseworks highlight the advances done in the exploration of thisfield.

3. Speaker Authentication Process

The speaker authentication process is basically a patternrecognition problem, and it is similar to speech recognition.The goal is to take the speaker’s sound wave as an input,and at the end authenticate the speaker’s name. Generally,the Speaker authentication process is done in two steps; thefirst step is the acoustic processing, or features extraction,while the second is known as pattern processing or classi-fication. In the proposed system, we have added an extrastep between both of them, called feature selection (Fig. 1).For our case, in the acoustic analysis, the speaker’s signalis processed to extract relevant features in function of time.The feature set obtained from each speech sample is rep-resented by a vector, and each vector is taken as a pattern.Next, all vectors go to an acoustic features selection mod-ule, which will help us; to select the best features for thetraining process, and at the same time to efficiently reducethe input vectors. The selection is done through the use ofgenetic algorithms. As for the pattern recognition methods,four main approaches have been traditionally used: pattern

comparison, statistical models, knowledge based systems,and connectionist models. We focus in the use of the lastone.

Speaker’sSpeech Signal

AcousticAnalysis

Microphone(Digitalization)

FeatureSelection

Pattern Classification

SelectedFeatures Vector

DecisionRule

Trained Model

Deny

Accept

4. Preprocessing Data

The acoustic analysis implies the application and selec-tion of filter techniques, feature extraction, signal segmenta-tion, and normalization. With the application of these tech-niques the signal is described in terms of its fundamentalcomponents. One speech signal is complex and codifiesmore information than the one needed to be analyzed andprocessed in real time applications. For this reason, in ourspeaker authentication system we use a feature extractionfunction as a first plane processor. Its input is a speech sig-nal, and its output is a vector of features that characterizeskey elements of the speech’s sound wave. In this work weused Mel Frequency Cepstral Coefficients (MFCC) [1] asour feature extraction method.

4.1. Mel Frequency Cepstral Coefficients

The first step of speech processing, once we have thevoice samples, is to obtain (from these samples) the spectralcharacteristics. This step is necessary because the impor-tant information of the sample is codified in the frequencydomain, and the speech samples are recorded by means ofelectronic devices in the time domain. When the time do-main is converted to the frequency domain we obtain the pa-rameters which indicate the occurrence of each frequency.

There is a wide variety of ways to represent the speechsamples in their parametric form. One of the most com-monly used on speaker recognition tasks are MFCCs. Thehuman ear decomposes the received sound signals in its fun-damental frequencies. Located in the inner ear we find thecochlea which has a conic spiral form. This is one of thethree cavities that form the physical structure of the ear [7].This cochlea filters the frequencies in a natural way. Thesound waves are introduced inside this structure bouncingon its walls and getting inside the spiral with low or high fre-quency, taking into account each frequency’s wave length[8]. MFCCs are based on the frequency response the hu-man ear perceives. This method behaves as a filter banklinearly distributed in low frequencies and with logarithmic


Voice Signal Inverse FourierTransform

Discrete CosineTransform

Mel CepstralCoefficientsVector

spacing on the higher frequencies. This is called the MelFrequency Scale, which is linear below 1000 Hz, and loga-rithmic above 1000 Hz (Fig. 2) [2].

5. Speaker Pattern Classification

After extracting the acoustic features of each speakersample, feature vectors are obtained. Each one of these vec-tors represents a pattern. These vectors are later used forthe feature selection and the classification processes. Forthe present work, we focused on connectionist models, alsoknown as neural networks, to classify these vectors (pat-terns). They are reinforced with genetic algorithms to se-lect the best features of the vector in order to improve thetraining/testing process, obtaining with this, a more efficientGenetic-Neural hybrid classification system.

5.1. Genetic Algorithms

During the last years, there has been a growing inter-est on problem solving systems based on the evolution andhereditary principles. Such systems maintain a populationof potential solutions, they use selection processes based onfitness of individuals which compose that population, andgenetic operators. The evolutionary program is a probabilis-tic algorithm which maintains a population of individuals,P (t) = xt

1, ..., xtn for an iteration t. Each individual rep-

resents a potential solution to the problem. Each solutionxt

i is evaluated to give a measure of its fitness. Then, anew population (iteration t + 1) is formed by selecting thefitter individuals (select step). Some members of the newpopulation undergo transformations (alter step) by meansof genetic operators to generate new solutions. There areunitary transformations mi (mutation type), which createnew individuals by performing small changes in a single in-dividual (mi : S → S), and higher order transformations cj

(crossover type), which create new individuals by combin-ing genetic information of several (two or more) individuals(cj : Sx...S→S). After a number of generations, the programconverges. It is hoped that the best individual represents anear optimum (reasonable) solution [9].

5.2. Neural Networks

Artificial neural networks (ANN) are widely used onpattern classification tasks, showing good performance andhigh accuracy results. In general, an artificial neural net-work is represented by a set of nodes and connections(weights). The nodes are a simple representation of a natu-ral neural network while the connections represent the dataflow between the neurons. These connections or weightsare dynamically updated during the network’s training. Inthis work, we will use the Feed Forward Time Delay Neu-ral Network model, selected because this model has showngood results in the voice recognition field [10].

5.3. Feed-Forward Time Delay Neural Network

Input Delay refers to a delay in time, in other words, ifwe delay the input signal by one time unit and let the neuralnetwork receive both the original and the delayed signals,we have a simple time delay neural network. This neuralnetwork was developed to classify phonemes in 1987 byWeibel and Hanazawa [11].

The Feed-Forward Time Delay neural network doesn’tfluctuate with changes; the features inside the sound signalcan be detected no matter in which position they are in. Thetime delays let the neural network find a temporal relationdirectly in the input signal and in a more abstract represen-tation of the hidden layer. It does this by using the sameweights for each step in time [10].

5.4. Scaled conjugate gradient back-propagation

The neural network’s training can be done through atechnique known as back-propagation. The scaled conju-gate methods are based on the general optimization strat-egy. We use scaled conjugate gradient back-propagation(SCGBP) to train the neural networks because this algo-rithm shows a lineal convergence on most of the problems.It uses a mechanism to decide how far it will go on a specificdirection, and avoids the time consumption on the linearsearch by a learning iteration, making it one of the fastestsecond order algorithms [14].

5.5. Hybrid System

We designed a genetic algorithm, an evolutionary pro-gram to reduce the number of input features by selectingthe most relevant characteristics of the vectors, used on theclassification process. The whole feature selection processis described in the following paragraphs.

As we mentioned before, we obtain an original matrixby putting together every vector obtained by each speechsample. Then, to work with our hybrid system, we firstly


unite together a speaker’s samples. Each speaker has tenspeech samples, and there are 15 different speakers, thismeans that we have 150 samples belonging to 15 differ-ent classes. After each class is united, we first shuffle itscolumns randomly and separate 70% of that class for train-ing and 30% for testing, each piece, altogether with theother classes, builds a training matrix and a testing matrix,providing a different training data set on each experiment.After building the training matrix, and before training theneural network, we shuffle it again, to assure that all thelabels are not in sequential order.

Having done this, a training matrix of size p × q is ob-tained, where p is the number of acoustic features of eachspeech sample, and q is the number of samples. We want toreduce this matrix to a size m × q, where m is the numberof the selected characteristics, given by the user. In order todo this, we first need to generate randomly an initial pop-ulation of n individuals, each having a length of p. Theseindividuals are represented by a randomly generated binarycode; there are two options, the algorithm can automati-cally decide the number of ”1s” and ”0s” that the individ-uals will have, or we can control the number of ”1s”, toobtain a specific dimensionality, this can be decided by theuser. These individuals are then used to generate n differentmatrices, simply by comparing each row of every individualwith each row of the training matrix. If a number ”1” ap-pears in any row of any given individual, then we must takethe whole row of the matrix that corresponds to that indi-vidual’s row, if there’s a ”0”, we do not take that particularrow (Figure 3). In this sense, that’s the way an individualreduces the training matrix to n smaller matrices. The ma-trix’s row dimension can be calculated simply by adding allthe ”1s” of each individual. The column dimension neverchanges.

1

0

1

1

1

0

1

Number of Samples

Num

ber o

f Fea

ture

s

Sel

ecte

d Fe

atur

es

12

3

4

5

6Rows

Rows

1

356

Number of Samples

After doing this, n different neural networks are gener-ated, which will be trained by the n generated matrices. Ob-taining with these n different results, this will be later used,as the fitness function, to select the best individuals thatachieved the best accuracy, from that generation. To choosewhich individuals are going to pass to the next generation,we simply sort them from the best result to the worst. Nextwe eliminate the individuals that gave the worst results byusing elitism on the bottom half of the individuals in a given

VoiceSignals

MFCCVectors

Speakers’ SamplesMatrix

FeatureSelection

n Individuals

Rand. Trainingend Testing

Matrices

Time DelayNeural Network

n Reduced Matrices

Fitness Function &Genetic Operators

Repeat for a givennumber of Gen.

n different Results

n bestresults

generation. Having the best half of individuals chosen, wecreate a new generation using the roulette algorithm. Thenwe perform crossover and mutation operations on the newgeneration of individuals. At this point we considered thatthe best individual of each generation passes to the next gen-eration unchanged.

For every two newly generated parents, we generate arandom number that goes from 0 to 1, we then compare thisnumber to our crossover rate, if the number is smaller thatthe crossover rate, those parents will suffer a crossover op-eration. This means that we combine the information con-tained in both individuals to generate two offspring with in-formation of the original parents. If the randomly generatednumber is bigger than our crossover rate, we let those indi-viduals pass to the next generation unchanged.

After doing so, for every offspring, we generate a ran-dom number that goes from 0 to 1, we compare this numberto our mutation rate, if the number is smaller than this rate,the individual is mutated. To do this, we generate a ran-dom number that goes from 1 to p, that number representsa row of the individual, if there’s a number ”1” in that row,we change it to ”0” and vice versa. If the number is biggerthan our mutation rate, we let this individual pass to the nextgeneration unchanged.

We repeat these steps for a given number of generationsstated by the user. At the end we obtain an individual thatrepresents the selected features to be kept to obtain goodaccuracy results (Fig. 4).

6. System Implementation

The database is composed of 47,304 speech samples;containing digits and another set of isolated words [12]. Forthe experiments, 15 different speakers were selected, con-sidering the samples where the external noise didn’t affectthe pronunciation quality; these speakers recorded the num-bers from 1 trough 10, ten times each. With these samples


we generated a four digit spoken code for each speaker, forexample, if one speaker’s code is 5879, we concatenated thespoken samples corresponding to numbers 5, 8, 7, and 9, ob-taining ten spoken codes for each selected person, and con-structing with these 150, three second, sound files in WAVformat. For other experiments, ten different four digit codeswere generated, making each speaker ”say” the ten differ-ent codes; obtaining with these another corpus composedof 150 different samples. And finally, one four digit codewas generated, and repeated ten times for every speaker.These combinations of codes were implemented to test thenetwork’s capability, efficiency, and robustness when rec-ognizing the authorized speakers.

From these files, and using the freeware software Praat,version 4.3.09 [13], we extracted 16 MFCCs for every 50milliseconds, obtaining 467 features per vector for each file.Here we can mention that, as a final part of the acousticprocessing, we have to implement an algorithm to clean thefiles obtained by Praat, this was done because the outputvectors samples contain additional information not neededwhen training the neural networks.

With the clean vectors we construct the training and test-ing matrices that will be used to work with the proposed al-gorithm. The training and testing matrices are reduced withthe help of the genetic algorithm, at the end; with the help ofthe fitness function (described on the previous section) weobtain an individual which will tell us which features haveto be used in order to obtain higher accuracy on the speakerauthentication stage. The number of selected features, canbe introduced by the user, to decide the dimensionality ofthe final selected features.

The overall time the hybrid system took to complete oneexperiment was calculated by equation 1, which considersthe processor used on the experiments:

t = (3min×Ngen × Nind) (1)

7. Experimental Results

In order to obtain several results, and assure the reliabil-ity of our proposed algorithm, we experimented with dif-ferent parameters to observe the algorithm’s behavior, bychanging the mutation and crossover rates, the number ofindividuals, the number of generations, and the number ofselected features. In Table 1, the results were obtained; bychanging the mentioned parameters and with the help of thetesting matrix. And it can be seen, that the best recognitionpercentage is obtained, when selecting and extracting 467features from the voice samples of the authorized person-nel.

Table 2, shows, on the 1st and 2nd columns, the numberof corpus used to perform the experiments, the 2nd columnshows the description of the access code for each speaker,

Features Gen./Ind. Mut/Cross Rates Efficiency60 5/10 0.09/0.05 63.44%60 10/10 0.09/0.05 68.08%60 5/10 0.09/0.15 74.14%60 10/10 0.09/0.15 77.77%70 10/5 0.15/0.09 74.66%

100 20/20 0.15/0.01 83.36%150 10/5 0.15/0.09 88.12%150 - - 80.00%467 - - 86.66%

Corpus Description Recognition10, four digit code

1 samples, a different code 97.98%for each speaker

four digit code samples,2 10 different codes 50.34%

for each speaker3 The same four digit 80.00%

code for all speakers

and the 3rd column shows the recognition percentage effi-ciency, for intruder detection.

8. Concluding Remarks

The hybrid algorithm, which self adapted to the threecorpus, works considerably good. Also when we have alow number of selected characteristics, the system’s accu-racy is low, but when we select more than 100 features, theaccuracy is higher. When a person is being recognized, thetime spent to recognize the voice sample, takes less than0.5 seconds. In other experiments, performed to tests thesystem’s robustness, four intruders were introduced, malesand females, ”saying” the same four digits code of four au-thorized speakers (authenticated by the TDNN); 3 out of 4tests gave good results when denying the access to thoseintruders. We also want to experiment with other combina-tions of mutation and crossover rates, and also implementan uniform crossover and mutation algorithm. We need tocompare the actual results with other results, on other ex-periments. And furthermore, we want to use the vector’sdimensionality as a part of the fitness function.


References

[1] L. Rabiner, B.H. Juang. Fundamentals of SpeechRecognition, Prentice Hall Signal Processing Series.ISBN: 0-13-015157-2.(1993).

[2] Miramontes de Leon G., De la Rosa Vargas J. I. andGarcıa E., ’Application of an Annular/Sphere SearchAlgorithm for Speaker Recognition’, Proceedings ofthe 15th International Conference on Electronics, Com-munications and Computers, 0-7695-2283-1/05 IEEE.CONIELECOMP, (2005).

[3] Pelecanos J., Myers S., Sridharan S. and ChandranV., ’Vector Quantization based Gaussian Modeling forSpeaker Verification’, Proceedings of the Internationalconference on Pattern Recognition (ICPR’00). IEEEComputer Society Press, ISBN: 0-7695-0750-6. (2000).

[4] Hsieh C., Lai E., Wang Y., ’Robust Speaker Identifica-tion System based on Wavelet Transform and GaussianMixture Model’, Journal of Information Science andEngineering Vol. 19 No. 2, 267-282, March, (2003).

[5] Oh I. S., Lee J. S. and Moon B. R., ’Hybrid Genetic Al-gorithms for Feature Selection’, IEEE Transactions onPattern Analysis and Machine Intelligence, VOL. 26,NO. 11. (2004)

[6] Ha-Jin Yu., ”Speaker Recognition in Unknown Mis-matched Conditions Using Augmented Principal Com-ponent Analysis”, LNCS-3733, Springer Verlag, ISBN:3-540-29414-7. (2005).

[7] A. J. Balague, Diccionario Enciclopedico Baber, Edito-rial Baber, 1119 E. Broadway Street Glendale, Califor-nia 91205. (1991).

[8] Bernal J., Bobadilla J., Gomez P., Reconocimientode voz y fonetica acustica, Ed. Alfaomega Ra-Ma ,Madrid, Espana. (2000).

[9] Z. Michalewicz, Genetic Algorithms + Data Structures= Evolution Programs. 3rd. Edition, November 26.Springer, ISBN: 3540606769. (1998).

[10] J. Hilera, V. Martınez, Redes Neuronales ArtificialesFundamentos, modelos y aplicaciones. Alfaomega,Madrid, Espana. (2000).

[11] Weibel A., Hanazawa T., Hinton G., Shikano K., andLang K.J., ’Phoneme Recognition Using Time DelayNeural Networks’, IEEE Trans. Acoustics, Speech, Sig-nal Proc., ASSP-37: 32’339, (1989).

[12] PhD Reyes-Garcıa Carlos A. Sistema Inteligente deComandos Hablados para Control de DispositivosCaseros, Mexico. Project Supported by COSNET,(2005).

[13] P. Boersma and D. Weenink. Praat, doing phoneticsby computer version 4.3.09, www.praat.org. Copyright1992-2005 by. Institute of Phonetic Sciences. Univer-sity of Amsterdam Herengracht 3381016CG Amster-dam. (2005).

[14] Orozco Garca, J., Reyes-Garcıa, C.A., Clasificacionde Llanto del Bebe Utilizando una Red Neural deGradiente Conjugado Escalado, Memorias de MI-CAI/TAIA 2002, Merida, Yuc., Mexico, April, pp 203-213, ISBN 970-18-7825-6. (2002)


[ieee 2006 15th international conference on computing - mexico city, mexico (2006.11.21-2006.11.21)]...

Documents