a speaker veriﬂcation system - gatech.edu

A Speaker Verification System

Yao Xie† Xiayu Zheng ∗

May 1, 2006

Course Project for EEL6825 (Pattern Recognition)

Instructor: Dr. Clint Slatton

∗Yao Xie, Xiayu Zheng are with the Department of Electrical and Computer Engineering, University of Florida,Gainesville, FL 32611-6130, USA. Email: {xieyao, xiayu}@dsp.ufl.edu

1

1 Introduction

Speaker verification is defined as deciding if a speaker is whom he claims to be [1]. Automaticspeaker verification (ASV) is the use of a machine to verify a person’s claimed identity from hisvoice. A typical ASV setup consists of two steps. First, the claimant, who has previously enrolledin the system, presents an encrypted smart card containing his identification information. Then,he attempts to be authenticated by speaking a prompted phrase(s) into the microphone.

Prior to a verification session, the users must enroll in the system (typically under supervisedconditions). During the enrollment session, voice models are generated and stored (to set up a userdata-base) to be used in later verification session. This is the training procedure, which mainlyconsist of feature extraction, feature compression, and clustering. The common feature extractedfor speech signal are: wavelet coefficients, fast fourier transform (FFT) coefficients, discrete cosinecoefficients, linear predictive coefficients (LPC) and Mel-frequency cepstral coefficients (MFCC)[2], among which the MFCC method is the most popular one. Feature reduction is requiredsometimes when the feature space dimension is large and a simpler classifier is needed. Somecommon approaches are Principle Component Analysis (PCA), and Fisher Linear Discriminant(FLD) [3], with the latter directly find the component maximize the discriminating capabilitybetween classes. Although we may not need feature reduction in our system since the featurespace dimension is pretty small by using MFCC (e.g., 12), we could consider FLD when we needfeature reduction. Finally, cluttering is used to construct codebook for the user. Herein we useone of the most popular method: Linde-Buzo-Gray (LBG) algorithm [4] for clustering.

The verification session, involves with obtaining the speaker’s phrases, extracting feature(similar to the training procedure), comparing the so-obtained feature with the codebooks inthe database, and making the decision. Generally, there is a tradeoff between the verificationaccuracy and the test-session duration. Many factors, due to human and environment, contributeto verification error; these factors are generally out of the control of the algorithms, but theyare important: no matter how good a speaker-recognition algorithm is, Human errors (e.g.,misreading or mis-pronunciation) will ultimately limit the performance of the speaker verificationsystem. We will test the robustness of our system against these factors in our experiment.

In this project we present a simple speaker verification system. Our system could be used,for example, to verify members of an office (multi-class verification), or one person (one-classverification), to restrict the public access to an important equipment. We tried to realize themajor stage of the speech verification system and made it work in some simple scenarios. However,we do not mean to be complete in every aspect. Our purpose is to apply what we have learntfrom EEL6825 this semester to a particular application.

2 Speaker Verification System

2.1 System Configuration

Our system, as shown in Figure 1, mainly includes two parts, i.e., the training and the speakertesting part. Each part consists of digital speech pre-processing, feature extraction and featurereduction (optional, depends on the performance).

In the training part, we start from the user enrollment module. When a new user is addedto the system, the module is used to “teach” the system the new user’s voice. The input of thismodule is the identification words which the new user speaks into a microphone. These spokenwords are used for “training data”. After sampling the analogy voice signal and converting itto digital signal, we perform the digital speech pre-processing, feature extraction and featurereduction. The so-obtained features are then compressed by feature compression (clustering)

2

block to form a codebook. The new codebook is stored in the data-base for future use. The newcodebook is assigned a index to indicate the new user.

With the training data and the new codebook, we can evaluate the system performance inthe threshold generation module. This module is used to set a sensitivity level for the systemtowards each user. This sensitivity value is called threshold and needs to be generated whenevera new user is enrolled. The threshold value can be reset, for example, when a user has receivedfalse rejections to many time and needs to adjust the sensitivity level.

In the speaker testing part, the speaker verification module is used to identify a user. First,a user informs the system that he or she is some user. The system will then indicate the user tospeak the verification words. This utterance of the words is referred to as the testing speech. Themodule will proceed with the same digital speech pre-processing, feature extraction and featurereduction (optional, used if the training part uses it) as those used in the training part. Theextracted features of the testing speech are then compared to the codebooks in the data-base.Based on some similarity metrics, the system will decide on whether the user has passed or failedthe voice verification test.

2.2 Low-level Functions

2.2.1 Digital Speech Pre-processing

In this sub-module, the user’s voice is sampled at 22.050 kHz, in 16 bits and mono channeland saved as “.wav” format. Each voice file contains one period of the spoken verification words.First, we remove the silent period (before the speaker starts to speak) at the beginning of the rawfiles and take a segment with certain length follows that. In our system, we takes a 1.5 secondvoice segment, which corresponds to 22050∗1.5 = 33075 samples in the voice file. The segmenteddata file is then filtered by a low-pass filter to remove some out-of-band noise and boost thesignal-to-noise ratio (SNR). The transfer function of the filter we used is as follows

A(z) =1

1− 0.95z−1(1)

The resulted speech signal is a data vector.

2.2.2 Feature Extraction

Feature extraction maps the preprocessed speech segments into a multidimensional featurespace. Digital signal processing techniques [5] are applied to extract features. We use MFCCfor feature extraction. This method uses the absolute value of 12 coefficients from Mel-frequencycepstral transform for each frame. We have five steps to obtain the MFCCs [5], as shown inFigure 2.

(1) The filtered data vector is framed into overlapping blocks. Each block contains 256 sampleswith adjacent frames being separated by 128 samples. This yields a minimum of 0.5 overlapto ensure that all sampled values are used by at least two blocks. Since the speech signalsare quasi-stationary between 5 ms and 100 ms, we choose to use 256 sample so that eachblock is 11.6 ms. Also, we choose 256 since it is a power of 2 which facilitates the use ofFFT in subsequent stages.

(2) Each block is windowed to minimize spectral distortion and discontinuities. A Hammingwindow is used.

(3) The FFT is applied to each windowed block as the beginning of the Mel-Cepstral Transform.After this stage, the spectral coefficients of each block are generated.

3

(4) The Mel-Frequency Transform is applied to each spectral block to convert the scale to aMel-scale. The Mel-scale is a logarithmic scale similar to the way the human ear perceivessound.

(5) Finally, the Discrete Cosine Transform is applied to each Mel-Spectrum to convert it backto real valued time sequence.

After these five steps, we has obtained a set of 12-dimensional Mel-Cepstral Coefficient vectors.

2.2.3 Feature Reduction

This sub-module, which could also be called as feature selection, is used to reduce the di-mension of speech vectors. Since we use MFCC for the feature extraction, the 256-dimensionalspeech vector has already been reduced to the 12-dimensional Mel-Cepstral Coefficient vector,this sub-module is optional. When feature reduction is needed, we could use the Fisher’s lineardiscriminant analysis [3] to further reduce the vector dimension.

For example, the Fisher linear discriminant for two-class problem employs the linear functionwTx for which the criterion function

J(w) =|m̃1 − m̃2|2

s̃21 + s̃2

2

, (2)

and the solution is:

w = S−1w (m1 −m2). (3)

where Sw is the within class scatter matrix, and m1 and m2 are the means for class 1 and class2, respectively.

2.2.4 Feature Compression (Clustering)

As shown in Figure 3, the sets of 12-Dimension or further reduced-dimension (using featurereduction) Mel-Cepstral Coefficient vectors are clustered using LBG clustering algorithm [4]. Letthe Mel-Cepstral Coefficient vectors be {xn, n = 1, 2, . . . , Nt}, where Nt is the number of vectors.Starting from an initial codebook (obtained via the splitting method [4]), we iteratively updatethe codebook until no further improvement on the minimum Euclidean Distance is obtained withthe following two criteria.

(1) Nearest neighbor condition (NNC): for a given codebook {ci}Ni=1, assign a vector xn to the

ith region

Si = {ci : ||xn − ci||2 ≤ ||xn − cj ||2, ∀i 6= j}, (4)

where Si, i = 1, 2, . . . , N is partition set for the ith code vector ci, and N is the number ofcode vectors in the codebook.

(2) Centroid condition (CC): for a given partition Si, we update the code vector ci as

c̃i =1Ni

Ni∑

n=1

xn, xn ∈ Si, (5)

where Ni, i = 1, 2, . . . , N is the number of Mel-Cepstral Coefficient vectors in the partitionset Si. We update the old codebook as {ci}N

i=1 = {c̃i}Ni=1.

After clustering, each user corresponds to one codebook.

4

2.2.5 Feature Comparison

In the Threshold Generation module or the Speaker Verification module, each Mel-CepstralCoefficient vector of the test speech is compared with the codebooks to calculate its distances(e.g., Euclidean distance) to each codebook. The codebook vector closest to the test vector isfound. The corresponding minimum Euclidean Distance, or Distortion Factor, is then stored untilthe Distortion Factor for each test vector has been calculated. The Average Distortion Factor isthen found and normalized.

3 Experiment Results

We try some experiments to evaluate the performance of our speaker verification system. Twospeakers, Xiayu and Yao, will generate the speech data. The verification words take the form“Myname is ×××”. The recorded speeches are sampled at 22.050 kHz, in 16 bits, mono channel andsaved as “.wav” files. The raw speech signals from two speakers are shown in Figure 4.

First, the experiments are performed without the feature reduction module. Both Xiayu andYao repeat their verification words 20 times and stored in “.wav” files as the training data. In thedigital speech pre-processing module, each speech file is taken out a 1.5 second-segment after thesilent period. In the feature extraction module, we use the open public software toolbox “Voice-box” [6]. The parameters are obtained as described in Section 2.2.2. In the feature compressionmodule, we choose the length of codevector in each codebook as N = 16, 32, 64, respectively, andcompare the difference. The codebook for both Xiayu and Yao are constructed and stored in adatabase, and the average Distortion Factors are calculated.

In our first experiment, Xiayu and Yao said the right verification words (corresponds to theirnames). They recorded the words 20 times used for 20 test trials. The speakers just speak ina normal tone similar to those they recorded for training data. This presents not significantdifficulty on classification.

First, we set the codevector length N = 16. Figure 5 shows the normalized distortion versusthe test times (trial number) for Xiayu and Yao. In Figure 5(a), in the 20 tests, Yao’s testingspeech has a very small distance to Yao’s codebook, or a very small normalized distortion (nor-malized by the average Distortion Factor of Yao’s codebook), whereas Xiayu’s codebook has arelative large normalized distortion to Yao’s codebook. In Figure 5(b), we have similar obser-vation for Xiayu’s testing voices. Thus, the system can easily make the right decision that thespeaker is Yao in Figure 5(a), and Xiayu in Figure 5(b).

Then, we increase the codevector length to N = 32 and N = 64, respectively. Figure 6 andFigure 7 show that, the normalized distortion of the speaker’s test words to his/her own codebook(e.g., Yao’s test words to Yao’s codebook) does not vary much, whereas the normalized distortionof the speaker’s test words to the other’s codebook (e.g., Yao’s test voice to Xiayu’s codebook)becomes much larger than that when N = 16. Therefore we can conclude that the larger the N ,the more accurate the codebook will be, at the cost of more memory used.

We can also study the choice for the threshold values from Figures 5-7. In Figures 5-7,we can easily identify the speaker with the carefully chosen threshold values, but with somethreshold values (too low or too high), the miss1 can happen. Tables 1-3 show the hit and missprobabilities for N = 16, 32, 64. For N = 16 and N = 32, we can have hits with probability 1 (forthe 20 testing speeches) by choosing the normalized threshold value as 1.3 for Xiayu and 1.5 forYao. For N = 64, we can obtain hits with probability 1 (for the 20 testing speeches) by choosing

1We define the probability of acceptance with the correct speaker as a hit, the probability of rejection with thecorrect speaker as a miss, the probability of acceptance with the wrong tester as a false alarm, the probability ofrejection with the wrong tester as a correct rejection.

5

the normalized threshold value as 1.4 for Xiayu and 1.7 for Yao.In the second experiment, Xiayu and Yao will say the wrong words. They pretended to be

each other, e.g., Yao said “My name is Xiayu Zheng”. They recorded the words 20 times usedfor 20 test trials. This adds some difficulty to the classification tasks.

In this experiment, we still use the previously obtained codebooks to calculate the normalizeddistortion. Figures 8-10 show the normalized distortion versus the test times. In this case, ourspeaker verification system can still tell the cheating word is from Xiayu or from Yao. Note fromFigures 8-10 that there is higher normalized distortion than their counterparts in Figures 5-7.This observation motivates us that, we can further improve our verification system by setting twodifferent threshold values for each user, denoted as Ti1, Ti2, i = 1, 2, . . . , Nu, where Ti1 > Ti2 andNu is the number of the users. Then the system can make the decisions as follows:

• if the normalized distortion of the speaker is lager than any of Ti1, i = 1, 2, . . . , Nu, he/sheis denied for verification;

• if the the normalized distortion of the tester is smaller than one of Ti1, i = 1, 2, . . . , Nu, butlarger than any of Ti2, i = 1, 2, . . . , Nu, another chance is given to the speaker to recordthe words;

• if the the normalized distortion of the speaker is smaller than one of Ti2, i = 1, 2, . . . , Nu,say, T12, then decide the speaker’s identity as that associated with T12, and vice versa.

Finally, we briefly study the performance of our system with the feature reduction module(realized using FLD). MFCC has reduces the feature dimension from 256 to 12, and the featurereduction module further reduces the dimension from 12 to 1. We use same training data fromXiayu and Yao, and calculate the weight w according to (3). The codebooks for Xiayu andYao are also obtained based on the projected data. Note that each codevector is now only 1-dimensional. Figure 11 shows the normalized distortion in each trial, when the two speaker saythe right word. Comparing Figure 11 with the previous results, in which no the feature reductionmodule is used, we find that in this case, the feature reduction module increases the differencebetween the normalized distortion values of two speakers. However, in another case, when thetwo speaker say the wrong words and pretend to be each other, the system can make a wrongdecision, as shown in Figure 12(b). This means that feature reduction may reduce the systemrobustness.

4 Conclusion

In this project, we present a simple speaker verification system. We have realized the fundamentalstages of the system: training session, test session. More specifically, we realized from dataacquisition, feature extraction, feature reduction, clustering for feature compression, to finallymaking a decision. We also perform some experiments to test the system performance. Bybuilding a skeleton for the system, we have learned a big picture on about how different patternclassification algorithms play theirs roles and effect each other.

Some specific and interesting aspects we have learnt: how to choose threshold for classifiers,in different scenario (speaking right and wrong words), to achieve the desirable classification (wepresented some ideas for this in the experimental section); how to tradeoff between the efficiency(length of codebook) and accuracy (difference in normalized distortion between speakers). Oursystem works well in all the settings we have tested so far. But it is only for two-user problem.In order to generalize our system to the problem with more users, and to improve it to be morerobust to speaker errors, we are facing more challenges and need more work.

6

A Appendix

We have coded up most of the part for our system. Consider the length of the codes, we donot include it here (please refer to the .zip pack.) An brief introduction to the functions of thesubroutines (subroutines with similar names with these have similar functions just with differentparameter values):

• TrainingXiayu.m: training for Xiayu’s data.

• TrainingYao.m: training for Yao’s data.

• TrainingFisher.m: training for Fisher’s data.

• spchVerificationFisher64.m: for speech verification, codebook length 64, speakers speaktheir names.

• misspchVerificationFisher64: for speech verification, codebook length 64, speakers speakthe other’s names.

• MinFrobiusDs.m: find the codebook in the data with minimum Frobius norm to the featureextracted from the test speech.

• LBGFrobinus.m: realize LBG algorithm.

• initialCodebook.m: generate codebook from the training data.

• Fisher2LD.m: find the weight vector for FLD.

7

References

[1] J. P. Campbell, “Speaker recognition: a tutorial,” Proceedings of the IEEE, vol. 85, pp. 1437–1462, September 1997.

[2] L. Rabiner and B. Juang, Fundamentals of Speech Recognition. Prentice-Hall, 1993.

[3] R. O. Duda, P. E. Hart, and D. G. Stork, Pattern Classification. John Wiley and Sons, Inc.,2001.

[4] Y. Linde, A. Buzo, and R. M. Gray, “An algorithm for vector quantizer design,” IEEE Trans-actions on communications, vol. 28, pp. 84–95, January 1980.

[5] Y. Zhang and L. M. Bruce, “Automated accident detection at intersections,” Technical ReportDocument, 2004, http://www.mdot.state.ms.us/research/pdf/AADI.pdf.

[6] “Voicebox,” http://www.ee.ic.ac.uk/hp/staff/dmb/voicebox/voicebox.html.

8

Table 1: Hit and Miss Probability with different threshold values, N = 16

Normalized threshold Yao’s Hit Yao’s Miss Xiayu’s Hit Xiayu’s Miss1.1 0.10 0.90 0.55 0.451.2 0.30 0.70 0.95 0.051.3 0.75 0.25 1.00 0.001.4 0.90 0.10 1.00 0.001.5 1.00 0.00 1.00 0.00


Normalized threshold Yao’s Hit Yao’s Miss Xiayu’s Hit Xiayu’s Miss1.1 0.05 0.95 0.80 0.201.2 0.30 0.70 0.90 0.101.3 0.55 0.45 1.00 0.001.4 0.85 0.15 1.00 0.001.5 1.00 0.00 1.00 0.00


Normalized threshold Yao’s Hit Yao’s Miss Xiayu’s Hit Xiayu’s Miss1.1 0.00 1.00 0.05 0.951.2 0.15 0.85 0.40 0.601.3 0.30 0.70 0.90 0.101.4 0.50 0.50 1.00 0.001.5 0.85 0.15 1.00 0.001.6 0.95 0.05 1.00 0.001.7 1.00 0.00 1.00 0.00

9

text Digital Speech Pre-prosessing

Feature Extraction

Feature Reduction

Feature Comparison Decision

Speaker Testing

Speaker Verification

Digital Speech Pre-prosessing

Feature Extraction

Feature Reduction

Feature Comparison

Threshold Generation

Digital Speech Pre-prosessing

Feature Extraction

Feature Reduction

Feature Compression

User Enrollment

Codebook

Threshold

Training

Figure 1: Configuration of our speaker verification system.

Framing Windowing FFT Mel-frequency transform DCT

Filtered Digital Speech signals

Overlapped Blocks

Windowed Blocks (Hamming)

Spectral Coefficients

Mel-spectral Coefficients

Sets of Mel- Cepstral

Coefficients

Figure 2: The feature extraction module using MFCC.

Linde-Buzo-Gray (LBG) Clustering

Algorithm

Sets of Mel-Cepstral Coefficients vectors (12-D

or reduced-dimension version)

Codebook

Figure 3: The feature compression module.

0 0.5 1 1.5 2 2.5 3 3.5 4−0.03

−0.025

−0.02

−0.015

−0.01

−0.005

0

0.005

0.01

0.015

0.02

Time (second)

Am

plitu

de

Yao

0 0.5 1 1.5 2 2.5 3 3.5 4−1

−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

1

Time (second)

Am

plitu

de

Xiayu

(a) (b)

Figure 4: The raw speech signals, (a): Yao’s (b): Xiayu’s.

10

0 5 10 15 201

1.5

2

2.5

3

3.5

4

4.5

Test times

Nor

mal

ized

Dis

tort

ion

Codebook YaoCodebook Xiayu

0 5 10 15 200.5

1

1.5

2

2.5

3

3.5

4

4.5

5

Test times

Nor

mal

ized

Dis

tort

ion

Codebook XiayuCodebook Yao

(a) (b)

Figure 5: Normalized distortion versus test times with codebook length N = 16, when: (a): Yaoclaimed to be Yao, (b): Xiayu claimed to be Xiayu.

0 5 10 15 201

1.5

2

2.5

3

3.5

4

4.5

5

5.5

6

Test times

Nor

mal

ized

Dis

tort

ion


0 5 10 15 200

1

2

3

4

5

6

7

Test times

Nor

mal

ized

Dis

tort

ion


(a) (b)


11

0 5 10 15 201

2

3

4

5

6

7

Test times

Nor

mal

ized

Dis

tort

ion


0 5 10 15 201

2

3

4

5

6

7

8

Test times

Nor

mal

ized

Dis

tort

ion


(a) (b)


0 5 10 15 201.5

2

2.5

3

3.5

4

4.5

5

Test times

Nor

mal

ized

Dis

tort

ion


0 5 10 15 201.5

2

2.5

3

3.5

4

4.5

5

Test times

Nor

mal

ized

Dis

tort

ion


(a) (b)

Figure 8: Normalized distortion versus test times with codebook length N = 16, when (a): Yaopretended to be Xiayu, (b): Xiayu pretended to be Yao.

12

0 5 10 15 201.5

2

2.5

3

3.5

4

4.5

5

5.5

6

Test times

Nor

mal

ized

Dis

tort

ion


0 5 10 15 201.5

2

2.5

3

3.5

4

4.5

5

5.5

6

6.5

Test times

Nor

mal

ized

Dis

tort

ion


(a) (b)


0 5 10 15 202

3

4

5

6

7

8

Test times

Nor

mal

ized

Dis

tort

ion


0 5 10 15 201

2

3

4

5

6

7

8

Test times

Nor

mal

ized

Dis

tort

ion


(a) (b)


13

0 5 10 15 200

100

200

300

400

500

600

Test times

Nor

mal

ized

Dis

tort

ion


0 5 10 15 200

50

100

150

200

250

300

Test times

Nor

mal

ized

Dis

tort

ion


(a) (b)

Figure 11: Normalized distortion versus test times with codebook length N = 64, using Fisher’sLinear Discriminant for feature reduction, when: (a): Yao claimed to be Yao, (b): Xiayu claimedto be Xiayu.

0 5 10 15 200

100

200

300

400

500

600

700

800

Test times

Nor

mal

ized

Dis

tort

ion


0 5 10 15 200

20

40

60

80

100

120

140

160

Test times

Nor

mal

ized

Dis

tort

ion


(a) (b)

Figure 12: Normalized distortion versus test times with codebook length N = 64, using Fisher’sLinear Discriminant for feature reduction, when: (a): Yao pretended to be Xiayu, (b): Xiayupretended to be Yao.

14

a speaker veriﬂcation system - gatech.edu

Documents