speaker recognition using wavelet packet entropy, i-vector, and...

10
Research Article Speaker Recognition Using Wavelet Packet Entropy, I-Vector, and Cosine Distance Scoring Lei Lei and She Kun Laboratory of Cyberspace, School of Information and Soſtware Engineering, University of Electronic Science and Technology of China, Chengdu 610054, China Correspondence should be addressed to She Kun; [email protected] Received 16 February 2017; Revised 17 April 2017; Accepted 26 April 2017; Published 14 May 2017 Academic Editor: Lei Zhang Copyright © 2017 Lei Lei and She Kun. is is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Today, more and more people have benefited from the speaker recognition. However, the accuracy of speaker recognition oſten drops off rapidly because of the low-quality speech and noise. is paper proposed a new speaker recognition model based on wavelet packet entropy (WPE), i-vector, and cosine distance scoring (CDS). In the proposed model, WPE transforms the speeches into short-term spectrum feature vectors (short vectors) and resists the noise. I-vector is generated from those short vectors and characterizes speech to improve the recognition accuracy. CDS fast compares with the difference between two i-vectors to give out the recognition result. e proposed model is evaluated by TIMIT speech database. e results of the experiments show that the proposed model can obtain good performance in clear and noisy environment and be insensitive to the low-quality speech, but the time cost of the model is high. To reduce the time cost, the parallel computation is used. 1. Introduction Speaker recognition refers to recognizing the unknown per- sons from their voices. With the use of speech as a biometric in access system, more and more ordinary persons have ben- efited from this technology [1]. An example is the automatic speech-based access system. Compared with the conven- tional password-based system, this system is more suitable for old people whose eyes cannot see clearly and figures are clumsy. With the development of phone-based service, the speech used for recognition is usually recorded by phone. However, the quality of phone speech is low for recognition because the sampling rate of the phone speech is only 8 KHz. Moreover, the ambient noise and channel noise cannot be completely removed. erefore, it is necessary to find a speaker recogni- tion model that is not sensitive to those factors such as noise and low-quality speech. In a speaker recognition model, the speech is firstly transformed into one or many feature vectors that represent unique information for a particular speaker irrespective of the speech content [2]. e most widely used feature vector is the short vector, because it is easy to compute and yield good performance [3]. Usually, the short vector is extracted by Mel frequency cepstral coefficient (MFCC) method [4]. is method can represent the speech spectrum in compacted form, but the extracted short vector represents only the static information of the speech. To represent the dynamic infor- mation, the Fused MFCC (FMFCC) method [5] is proposed. is method calculates not only the cepstral coefficients but also the delta derivatives, so the short vector extracted by this method can represent both the static and dynamic informa- tion. Both of the two methods use discrete Fourier transform (DFT) to obtain the frequency spectrum. DFT decomposes the signal into a global frequency domain. If a part of frequency is destroyed by noise, the whole spectrum will be strongly interfered [6]. In other words, the DFT-based extrac- tion methods, such as MFCC and FMFCC, are insensitive to the noise. Wavelet packet transform (WPT) [7] is other type of tool used to obtain the frequency spectrum. Compared with the DFT, WPT decomposes the speech into many small frequency bands that are independent of each other. Because of those independent bands, the ill effect of noise cannot be transmitted over the whole spectrum. In other words, WPT has antinoise ability. Based on WPT, wavelet packet entropy Hindawi Journal of Electrical and Computer Engineering Volume 2017, Article ID 1735698, 9 pages https://doi.org/10.1155/2017/1735698

Upload: others

Post on 13-Oct-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Speaker Recognition Using Wavelet Packet Entropy, I-Vector, and …downloads.hindawi.com/journals/jece/2017/1735698.pdf · 2019. 7. 30. · ResearchArticle Speaker Recognition Using

Research ArticleSpeaker Recognition Using Wavelet Packet EntropyI-Vector and Cosine Distance Scoring

Lei Lei and She Kun

Laboratory of Cyberspace School of Information and Software Engineering University of Electronic Science and Technology of ChinaChengdu 610054 China

Correspondence should be addressed to She Kun kunuestceducn

Received 16 February 2017 Revised 17 April 2017 Accepted 26 April 2017 Published 14 May 2017

Academic Editor Lei Zhang

Copyright copy 2017 Lei Lei and She KunThis is an open access article distributed under the Creative Commons Attribution Licensewhich permits unrestricted use distribution and reproduction in any medium provided the original work is properly cited

Today more and more people have benefited from the speaker recognition However the accuracy of speaker recognition oftendrops off rapidly because of the low-quality speech and noise This paper proposed a new speaker recognition model based onwavelet packet entropy (WPE) i-vector and cosine distance scoring (CDS) In the proposed model WPE transforms the speechesinto short-term spectrum feature vectors (short vectors) and resists the noise I-vector is generated from those short vectors andcharacterizes speech to improve the recognition accuracy CDS fast compares with the difference between two i-vectors to give outthe recognition result The proposed model is evaluated by TIMIT speech database The results of the experiments show that theproposed model can obtain good performance in clear and noisy environment and be insensitive to the low-quality speech but thetime cost of the model is high To reduce the time cost the parallel computation is used

1 Introduction

Speaker recognition refers to recognizing the unknown per-sons from their voices With the use of speech as a biometricin access system more and more ordinary persons have ben-efited from this technology [1] An example is the automaticspeech-based access system Compared with the conven-tional password-based system this system ismore suitable forold people whose eyes cannot see clearly and figures areclumsy

With the development of phone-based service the speechused for recognition is usually recorded by phone Howeverthe quality of phone speech is low for recognition because thesampling rate of the phone speech is only 8KHz Moreoverthe ambient noise and channel noise cannot be completelyremoved Therefore it is necessary to find a speaker recogni-tion model that is not sensitive to those factors such as noiseand low-quality speech

In a speaker recognition model the speech is firstlytransformed into one or many feature vectors that representunique information for a particular speaker irrespective ofthe speech content [2]Themost widely used feature vector isthe short vector because it is easy to compute and yield good

performance [3] Usually the short vector is extracted byMel frequency cepstral coefficient (MFCC) method [4] Thismethod can represent the speech spectrum in compactedform but the extracted short vector represents only the staticinformation of the speech To represent the dynamic infor-mation the Fused MFCC (FMFCC) method [5] is proposedThis method calculates not only the cepstral coefficients butalso the delta derivatives so the short vector extracted by thismethod can represent both the static and dynamic informa-tion

Both of the two methods use discrete Fourier transform(DFT) to obtain the frequency spectrum DFT decomposesthe signal into a global frequency domain If a part offrequency is destroyed by noise the whole spectrum will bestrongly interfered [6] In other words theDFT-based extrac-tion methods such as MFCC and FMFCC are insensitive tothe noise Wavelet packet transform (WPT) [7] is other typeof tool used to obtain the frequency spectrum Comparedwith the DFT WPT decomposes the speech into many smallfrequency bands that are independent of each other Becauseof those independent bands the ill effect of noise cannot betransmitted over the whole spectrum In other words WPThas antinoise ability Based on WPT wavelet packet entropy

HindawiJournal of Electrical and Computer EngineeringVolume 2017 Article ID 1735698 9 pageshttpsdoiorg10115520171735698

2 Journal of Electrical and Computer Engineering

(WPE) [8] method is proposed to extract the short vectorReferences [8ndash11] have shown that the short vector extractedby WPE is insensitive to noise

I-vector is another type of feature vector It is a robust wayto represent a speech using a single high-dimension vectorand it is generated by the short vectors I-vector considersboth of the speaker-dependent and background informationso it usually leads to good accuracy References [12ndash14] haveused it to enhance the performance of speaker recognitionmodel Specially [15] uses the i-vector to improve the dis-crimination of the low-quality speech Usually the i-vectoris generated from the short vectors extracted by the MFCCor FMFCCmethods but we employ theWPE to extract thoseshort vectors because theWPEcan resist the ill effect of noise

Once the speeches are transformed into the featurevectors a classifier is used to recognize the identity of speakerbased on those feature vectors Gaussian mixture model(GMM) is a conventional classifier Because it is fast andsimple GMM has been widely used for speaker recognition[4 16] However if the dimension of the feature vector ishigh the curse of dimensionwill destroy this classifierUnfor-tunately i-vector is high-dimensional vector compared withthe short vector Cosine distance scoring (CDS) is anothertype of classifier used for the speaker recognition [17] Thisclassifier uses a kernel function to deal with the problem ofhigh-dimension vector so it is suitable for the i-vector In thispaper we employ the CDS for speaker classification

The main work of this paper is to propose a new speakerrecognition model by using the wavelet packet entropy(WPE) i-vector and cosine distance scoring (CDS) WPE isused to extract the short vectors from speeches because itis robust against the noise I-vector is generated from thoseshort vectors It is used to characterize the speeches used forrecognition to improve the discrimination of the low-qualityspeech CDS is very suitable for high-dimension vector suchas i-vector because it uses a kernel function to deal with thecurse of dimension To improve the discrimination of the i-vector linear discriminant analysis (LDA) and the covariancenormalization (WCNN) are added to the CDS Our proposedmodel is evaluated by TIMIT database The result of theexperiments show that the proposed model can deal with thelow-quality speech problem and resist the ill effect of noiseHowever the time cost of the new model is high becauseextractingWPE is time-consumingThis paper calculates theWPE in a parallel way to reduce the time cost

The rest of this paper is organized as follows In Section 2we describe the conventional speaker recognition model InSection 3 the speaker recognition model based on i-vectoris described We propose a new speaker recognition modelin Section 4 and the performance of the proposed model isreported in Section 5 Finally we give out a conclusion inSection 6

2 The Conventional SpeakerRecognition Model

Conventional speaker recognition model can be dividedinto two parts such as short vector extraction and speaker

classification The short vector extraction transforms thespeech into the short vectors and the speaker classificationuses a classifier to give out the recognition result based on theshort vectors

21 Short Vector Extraction Mel frequency cepstral coef-ficient (MFCC) method is the conventional short vectorextraction algorithm This method firstly decomposes thespeech into 20ndash30ms speech frames For each frame thecepstral coefficient can be calculated as follows [18]

(1) Take DFT of the frame to obtain the frequencyspectrum

(2) Map the power of the spectrum onto Mel scale usingthe Mel filter bank

(3) Calculate the logarithm value of the power spectrummapped on the Mel scale

(4) Take DCT of logarithmic power spectrum to obtainthe cepstral coefficient

Usually the lower 13-14 coefficients are used to form the shortvector Fused MFCC (FMFCC) method is the extension ofMFCC Compared with MFCC it further calculates the deltaderivatives to represent the dynamic information of speechThe derivatives are defined as follows [5]

119889119894 =sum2119901=1 119901 (119888119888119894minus119901 + 119888119888119894+119901)

2 sum2119901=1 1199012

119889119889119894 =sum2119901=1 119901 (119889119894minus119901 + 119889119894+119901)

2 sum2119901=1 1199012

119894 = 1 2 3

(1)

where 119888119888119894 is the 119894th cepstral coefficient obtained by theMFCCmethod and 119901 is the offset 119889119894 is the 119894th delta coefficientand 119889119889119894 is the 119894th delta-delta coefficient If the short vectorextracted byMFCC is denoted as [1198881198881 1198881198882 1198881198883 119888119888119868]119879 thenthe short vector extracted by FMFCC is denoted as [1198881198881 11988811988821198881198883 119888119888119868 1198891 1198892 119889119868 1198891198891 1198891198892 119889119889119868]119879

22 Speaker Classification Gaussian mixture model (GMM)is a conventional classifier It is defined as

119901 (x) =119872

sum119898=1

120572119898119866 (x120583119898Σ119898) (2)

where x is a short vector extracted from an unknown speech119866(x120583119898Σ119898) is the 119898th Gaussian function in GMM where120583119898Σ119898 are its mean vector and variance matrix respectively120572119898 is the combination weight of the Gaussian function andsatisfiessum119872119898=1 120572119898 = 1119872 is the mixture number of the GMMAll of the parameters such as weights mean vectors andvariancematrices are estimated by the famous EM algorithm[19] using the speech samples of a known speaker In otherwords 119901(x) represents the characteristic of the knownspeakerrsquos voice so we use 119901(x) to recognize the author of

Journal of Electrical and Computer Engineering 3

Background speeches

Background model

Known speeches

Short vector extraction

Known i-vectors

Unknown speeches

Unknown i-vectors

result

Backgroundshort vectors

Known short vectors

Unknown short vectors

Speaker classification

I-vector extraction

Figure 1 The structure of the speaker recognition using i-vector

the unknown speeches Assume that an unknown speech isdenoted by Y = y1 y2 y119873 where y119894 represents the119894th short vector extracted from Y Also assume that theparameters of 119901(x) are estimated using the speech samplesof a known speaker 119904 The result of recognition is defined as

119903 = 1119873119873

sum119894=1

log [119901 (y119894)] + 120579 (3)

where 120579 gt 0 is the decision threshold and should be adjustedbeforehand to obtain the best recognition performance If119903 le 0 then the GMMdecides that the author of the unknownspeech is not the known speaker 119904 if the 119903 gt 0 then theGMMdecides that the unknown speech is spoken by the speaker 119904

3 The Speaker Recognition ModelUsing I-Vector

The speaker recognition model using i-vector can be decom-posed into three parts such as short vector extraction i-vector extraction and speaker classification Figure 1 showsthe structure of the model

There are three types of speeches used for this modelBackground speeches contains thousands of speeches spokenby lots of people the known speeches are the speech samplesof known speakers and the unknown speeches are spoken bythe speaker to be recognized In the short vector extractionall of the speeches are transformed into the short vectors bya feature extraction method In the i-vector extraction thebackground short vectors are used to train the backgroundmodel The background model is usually represented by aGMMwith 2048 mixtures and all covariance matrices of theGMM are assumed the same for easy computation Basedon the background model the known and unknown shortvectors are used to extract the known and unknown i-vectorsrespectively Note that one i-vector refers to only one speechIn the speaker classification a classifier is used to match the

known i-vector with the unknown i-vector and give out therecognition result

4 The Proposed Speaker Recognition Model

The accuracy of recognition system usually drops off rapidlybecause of the low-quality speech and noise To deal withthe problem we propose a new speaker recognition modelbased on wavelet packet entropy (WPE) i-vector and cosinedistance scoring (CDS) In Section 41 we describe the WPEmethod and use it to extract the short vector Section 42describes how to extract the i-vector using the above shortvectors Finally the details of CDS are described in Sec-tion 43

41 Short Vector Extraction This paper uses WPE to extractthe short vector The WPE is based on the wavelet packettransform (WPT) [20] so the WPT is firstly described WPTis a local signal processing approach that is used to obtainthe frequency spectrum It decomposes the speech intomanylocal frequency bands at multiple levels and obtains thefrequency spectrum based on the bands For the discretesignal such as digital speech WPT is usually implemented bythe famousMallat fast algorithm [21] In the algorithmWPTis realized by a low-pass filter and a high-pass filter which aregenerated by the mother wavelet and the corresponding scalefunction respectively Through the two filters the speechis iteratively decomposed into a low-frequency and a high-frequency components We can use a full binary tree todescribe the process of WPT The three structures are shownin Figure 2

In Figure 2 root is the speech to be analyzed Eachnonroot node represents a component The left child is thelow-frequency component of its parent and the right child isthe high-frequency component of its parent The left branchand the right branch are the low-pass and high-pass filtering

4 Journal of Electrical and Computer Engineering

W0

0

W1

1W0

1

W0

2 W1

2 W2

2 W3

2

W0

3 W1

3 W2

3 W3

3W

5

3 W6

3 W7

3W4

3

Figure 2 The wavelet packet transform at 2 levels

Framing

WPT

Shannon entropy

calculation

Normalization

Shannon entropy

calculation

Shannon entropy

calculation

Speech

middot middot middot

[s0 s1 s15]

W1

3W0

4 W15

4

Figure 3 The flow chart of wavelet packet entropy

processes followed by 2 1 downsampling respectively Thefiltering processes are defined as

W2119896119895+1 (119899) = W119896119895 lowast h (2119899)

W2119896+1119895+1 (119899) = W119896119895 lowast g (2119899)

0 le 119896 le 2119895 0 le 119895 le 119869 0 le 119899 le 119873119895

(4)

where h and g are the low-pass and high-pass filter respec-tively 119873119895 is the length of the frequency component at level119895 lowast is the convolution operation 119869 is the total numberof the decomposition levels Because the WPT satisfies theconservation of energy each leaf node denotes the spectrumof the frequency bands obtained byWPT Based on theWPTthe wavelet packet entropy (WPE) method is proposed toextract the short vector and we add a normalization step intothemethod to reduce the ill effect of the volume in this paperThe flow chart ofWPEused in this paper is shown in Figure 3

Assume that there is a digital speech signal that has finiteenergy and length It is firstly decomposed into 20ms framesand then each frame is normalized The normalizationprocess is defined as

119891 [119898] = 119891 [119898] minus 120583120590 119894 = 1 2 3 119872 (5)

where 119891 is a signal frame and 119872 is its length 120583 is themean value of the frame and 120590 is its standard variance 119891 isthe normalized frame After the normalization process theWPT decomposes the frame at 4 levels using (4) Thereforewe finally obtain 16 frequency bands and the frequencyspectrums in those bands are denoted as 11988204 11988214 119882154 respectively For each spectrum the Shannon entropy iscalculated The Shannon entropy is denoted as

119904119896 = minus119873

sum119899=1

119901119896119899 log (119901119896119899) 119896 = 0 2 3 7 (6)

with

119901119896119899 =100381610038161003816100381610038161198821198964 (119899)100381610038161003816100381610038162

119890119896

119890119896 =119873

sum119899=1

100381610038161003816100381610038161198821198964 (119899)100381610038161003816100381610038162

(7)

where 119890119896 is the energy of the 119896th spectrum 119901119896119899 is the energydistribution of the 119896th spectrum 119873 is the length of eachfrequency spectrum Finally all of Shannon entropies of allspectrums are calculated and are collected to form a featurevector that is denoted as [1199040 1199041 1199047]119879

42 I-Vector Extraction I-vector is a robust feature vectorthat represents a speech using a single high-dimension vectorBecause it considers the background information i-vectorusually improves the accuracy of recognition [22] Assumethat there is a set of speeches Those speeches are suppliedby different speakers and the all speeches are transformedinto the short vectors In the i-vector theory the speaker- andchannel-dependent feature vector is assumed as

m = m + Tw (U) (8)

where m is the speaker- and channel-dependent featurevectorm is the background factor Usually it is generated bystacking the mean vectors of a background model Assumethat the mean vectors of the background model are denotedby 1205831198791 1205831198792 120583119879119873 where each mean vector is a row vectorm is denoted by [12058311205832 120583119873]119879 T is named the totalvariability matrix and represents a space that contains thespeaker- and channel-dependent information w(U) is arandom vector having standard normal distribution 119873(0 1)The i-vector is the expectation of the w(U) U is a set ofspeeches and all of speeches are transformed into the shortvectors Assume that a background model is given and Σ isinitialized by covariance matrix of the background model Tand w(U) are initialized randomly T and w(U) are estimatedby an iteratively process described as follows

Journal of Electrical and Computer Engineering 5

(1) E-step for each speech in the set U calculate theparameters of the posterior distribution ofw(U)usingthe current estimates of T Σ andm

(2) M-step updateT andΣ by a linear regression inwhichw(U)s play the role of explanatory variables

(3) Iterate until the expectation of the w(U) is stableThe details of the estimation processes of T and w(U) aredescribed in [23]

43 Speaker Classification Cosine distance scoring (CDS)is used as the classifier in our proposed model It usesa kernel function to deal with the curse of dimensionso CDS is very suitable for the i-vector To describe thisclassifier easily we take a two-classification task for exampleAssume that there are two speakers denoted as 1199041 and 1199042The two speakers respectively speak 1198731 and 1198732 speechesAll speeches are represented by i-vectors and are denoted by1198831 = x11 x12 x11198731 and 1198832 = x21 x22 x21198732 where x

119894119895

is i-vector representing the 119895th speech sample of the speakersi We also assume there is an unknown speech representedby i-vector y The purpose of the classifier is to match theunknown i-vector with the known i-vectors and determinewhich one speaks the unknown speech the result of therecognition is defined as

119863119894 (y) = 1119873119894

119873119894

sum119899=1

119870 (x119894119899 y) + 120579 119894 = 1 2 (9)

where Ni is the total number of speeches supported by thespeaker si 120579 is the decision threshold If Di(y) le 0 theunknown speeches are not spoken by the known speaker siif Di(y) gt 0 then author of the unknown speeches is thespeaker si 119870(sdot sdot) is the cosine kernel and is defined as

119870 (x y) = xy119879

radicxx119879radicyy119879 (10)

where x is the known i-vector and y is the unknown i-vector Usually the linear discriminant analysis (LDA) andwithin class covariance normalization (WCCN) are used toimplement the discrimination of the i-vector Therefore thekernel function is rewritten as

119870 (x y) =(A119879x)Wminus1 (A119879y)

radic(A119879x)Wminus1 (A119879x)radic(A119879y)Wminus1 (A119879y) (11)

where A is the LDA projection matrix and W is WCCNmatrix A and W are estimated by using all of the i-vectorsand the details of LAD and WCCN are described in [24]

5 Experiment and Results

In this section we report the outcome of our experimentsIn Section 51 we describe the experimental dataset InSection 52 we carry on an experiment to select the optimalmother wavelet for the WPE algorithm In Section 53 weevaluate the recognition accuracy of our model In Sec-tion 54 we evaluate the performance of the proposedmodelFinally the time cost of the model is count in Section 55

51 Experimental Dataset The results of our experimentsare performed on the TIMIT speech database [25] Thisdatabase contains 630 speakers (192 females and 438 males)who come from 8 different English dialect regions Eachspeaker supplies ten speech samples that are sampled at16 KHz and last 5 seconds All female speeches are usedto obtain background models that represent the commoncharacteristic of the female voice Also all male speeches areused to generate another background model characterizingthe male voice 384 speakers (192 females and 192 males) arerandomly selected and their speeches are used as the knownand unknown speeches The test results presented in ourexperiments are collected on a computer with 25GHz IntelCore i5 CPU and 8GM of memory and the experimentalplatform is MATLAB R2012b

52 Optimal Mother Wavelet A good mother wavelet canimprove the performance of the WPE algorithmThe perfor-mance of a mother wavelet is based on two important ele-ments such as the support size and the number of vanishingmoments If a mother wavelet has large number of vanishmoments the WPE would ignore much of unimportantinformation if the mother wavelet has small support sizetheWPEwould accurately locate important information [26]Therefore an optimal mother wavelet should have a largenumber of vanishing moments and a small support sizeIn this view the Daubechies and Symlet wavelets are goodwavelets because they have the largest number of vanishingmoments for a given support size Moreover those waveletsare orthogonal and are suitable for the Mallat fast algorithm

In is paper we use the Energy-to-Shannon Entropy Ratio(ESER) to evaluate those Daubechies and Symlet waveletsto find out the best one ESER is a way to analyze theperformance of mother wavelet and has been employed toselect the best mother wavelet in [27] The ESER is definedas

119903 = 119864119878 (12)

where 119878 is the Shannon entropy of the spectrum obtained byWPT and 119864 was the energy of the spectrumThe high energymeans the spectrum obtained by WPT contained muchenough information of the speech The low entropy meansthat the information in the spectrum is stable Thereforethe optimal mother wavelet should maximize the energy andmeanwhile minimize the entropy

In this experiment 8 Daubechies and 8 Symlet waveletswhich are respectively denoted as db1ndash8 and sym1ndash8 areemployed to decompose speeches that are randomly selectedfrom the TIMIT database We run the experiment 100 timesand record the average WSER of those mother wavelets inTable 1

In Table 1 We find that db4 and sym6 obtain the highestESER In other words the db4 and sysm6 are the bestmother wavelets for the speech data Reference [28] suggeststhat the sym6 can improve the performance of the speakerrecognitionmodelHowever the Symletwavelets produce thecomplex coefficients whose imaginary parts are redundantfor the real signal such as digital speech so we abandon thesym6 and choose the db4

6 Journal of Electrical and Computer Engineering

Table 1 The average WSER of the mother wavelets

Mother wavelet ESERdb1 88837db2 89032db3 89744db4 90849db5 90141db6 89653db7 89169db8 89084sym1 88835sym2 89036sym3 89493sym4 89975sym5 90382sym6 90859sym7 90244sym8 89837

53 The Accuracy of Speaker Recognition Model in ClearEnvironment This experiment evaluates the accuracy of thespeaker recognition model We randomly select 384 speakers(192 females and 192males) For each speaker half of speechesare used as the unknown speech and the other half of speechesare used as the known speeches For each speaker the speakerrecognition model matches the hisher unknown speecheswith all of the known speeches of the 384 speakers anddetermines who speaks the unknown speeches If the resultis right the model obtains one score if the result is wrongthe model gets zero score Finally we count the score andcalculate the mean accuracy that is defined as

accuray = score384 times 100 (13)

In this experiment we use four types of speaker recog-nition models for comparison The first one is the MFCC-GMM model [4] This model uses MFCC method to extract14D short vectors and uses the GMM with 8 mixtures torecognize speaker based on those short vectors The secondone is FMFCC-GMMmodel [16] This model is very similarto the MFCC-GMM model but it uses the FMFCC methodto extract the 52D short vectors The third one is the WPE-GMM model [10] This model firstly uses WPE to transformthe speeches into 16D short vectors and then uses GMMfor speaker classification The last one was the WPE-I-CDSmodel proposed in this paper Compared with WPE-GMMmodel ourmodel uses the 16D short vectors to generate 400Di-vector and uses CDS to recognize speaker based on the i-vector We carry on each experiment in this section 25 timesto obtain the mean accuracyThemean accuracy of the above4 models is shown in Figure 4

In Figure 4 we find that MFCC-GMM obtains the lowestaccuracy of 8846 The result of [4] shows the MFCC-GMM model can obtain accuracy of higher than 90 Thisis because we use the GMM with 8 mixtures as the classifierbut [4] uses the GMMwith 32mixtures as the classifier Large

8KHz speech16KHz speech

WPE

-I-CD

S

WPE

-GM

M

MFC

C-G

MM

FMFC

C-G

MM

Speaker recognition models

0102030405060708090

100

Acc

urac

y (

)Figure 4 The mean accuracy of 4 models in clean environment

mixture number can improve the performance of the GMMbut it also causes the very high computational expenseWPE-I-CDS obtain the highest accuracy of 9436This interpretsthe achievements of i-vector theory On the other handwhen the 8KHz speeches (low-quality speeches) are usedall accuracy of speaker recognition models is decreased Theaccuracy of MFCC-GMM FMFCC-GMM andWPE-GMMdecrease by about 6 Comparatively the accuracy of WPE-I-CDS decreases by about 1 This is because the i-vectorconsiders the i-vector to improve the accuracy of the speakerrecognition model and the CDS used the LDA and WCCNto improve the discrimination of the i-vector Reference [29]also reports that the combination of the i-vector and the CDScan enhance the performance of speaker recognition modelused for low-quality speeches such as phone speeches

54 The Accuracy of Speaker Recognition Model in NoisyEnvironment It is hard to find a clean speech in the realapplications because the noise in the transmission channeland environment cannot be controlled In this experimentwe add 30 dB 20 dB and 10 dB Gaussian white noise intothe speeches to simulate the noisy speeches All noises aregenerated by the MATLABrsquos Gaussian white noise function

For comparison this experiment employed three i-vectorbasedmodels such asMFCC-I-CDS [30] FMFCC-I-CD [31]and our WPE-I-CDS The two models are very similar toour proposed model but they use the MFCC and FMFCCto extract the short vectors respectively The accuracy of the3 models in noisy environment is shown in Figure 5

In Figure 5 the three models obtained high accuracyin clean environment This also shows that the i-vector canimprove the recognition accuracy effectively However whenweuse the noisy speeches to test the 3models their accuraciesdecrease When 30 dB noise is added to the speeches the

Journal of Electrical and Computer Engineering 7

MFCC-I-CDSFMFCC-I-CDSWPE-I-CDS

55

60

65

70

75

80

85

90

95A

ccur

acy

()

30 20 10CleanNoise (dB)

Figure 5 The accuracy of the 3 models in noisy environment

accuracy of the three models decreases by about 4 Thisshows that all of the models can resist weak noise Howeverwhen we enhance the power of noise the accuracy ofMFCC-I-CDS and FMFCC-I-CDS drops off rapidly In particularwhen the noise increases into 10 dB the accuracy of the abovetwo models decreases by more than 30 Comparatively theWPE-I-CDSrsquos accuracy decreases by less than 12 Thoseshow that the WPE-I-CDS is robust in noisy environmentcompared with MFCC-I-CDS and FMFCC-I-CDS This isbecause the WPE uses the WPT to obtain the frequencyspectrum but MFCC and FMFCC use the DFT to do thatThe WPT decomposes the speech into many local frequencybands that can limit the ill effect of noise but the DFTdecomposes the speech into a global frequency domain thatis sensitive to the noise

55 The Performance of the Speaker Recognition ModelUsually the speaker recognition model is used in the accesscontrol systemTherefore a good speaker recognition modelshould have ability to accept the login of the correct peopleand meanwhile to reject the access of the imposter as agatekeeper does In this experiment we use the receiveroperating characteristic (ROC) curve to evaluate the ability ofourmodelTheROC curve shows the true positive rate (TPR)as a function of the false positive rate (FPR) for differentvalues of the decision threshold and has been employed in[2]

In this experiment we randomly select 384 speakers (192males and 192 females) to calculate the ROC curve Half ofthose speakers are used as the correct people and anotherhalf of the speakers are used as the imposters We firstlyuse the speeches of the correct people to test the speakerrecognition model to calculate the TPR and then we use thespeeches of the imposters to attack the speaker recognition

MFCC-GMMFMFCC-GMM

WPE-GMMWPE-I-CDS

0

01

02

03

04

05

06

07

08

09

TPR

02 03 04 0501FPR

Figure 6 The ROC curves of TPR versus FPR

model to calculate the FPR The 4 models such as MFCC-GMM FMFCC-GMM WPE-GMM and our WPE-I-CDSare used for comparison To plot the ROC curve we adjustedthe decision thresholds to obtain different ROC points TheROC curves of those 4 models were shown in Figure 6

Low FPR shows that the speaker recognition model caneffectively resist the attack coming from the imposters andhigh TPR shows that the speaker recognition model canaccurately accept the correct speakersrsquo login In other wordsa speaker recognition model can be useful if its TPR is highfor a low FPR In Figure 6 when FPR is higher than 045 allmodels obtain the high TPR but WPE-I-CDS obtain higherTPR than other 3 models for a given FPR that is less than 45This shows that theWPE-I-CDS can more effectively achievethe access control task than other models

56 Time Cost This section tests the time cost of thefast MFCC-GMM the conventional MFCC-I-CDS and ourWPE-I-CDS We used 200 5-second-long speeches to testeach model and calculated the average time cost The resultof this experiment was shown in Table 2

In Table 2MFCC-GMMdoes not employ the i-vector forspeech representation so it does not cost time to extract thei-vector Comparatively the WPE-I-CDS should cost time toextract the i-vector The WPE-I-CDS cost the most time toextract the short vector compared with the MFCC-GMMThis is because the WPT used by WPE is more complexthan the DFT used by the MFCC On the other hand theparameters of GMM should be estimated beforehand asMFCC-GMM cost time to train the classifier CDS needs notcost time to estimate the parameters but it should cost time toestimate the matrices of the LDA andWCNN in the trainingclassifier step In all the i-vector can improve the recognitionaccuracy at cost of increasing the time consumption andcalculating the WPE costs too much time compared with

8 Journal of Electrical and Computer Engineering

Table 2 The time cost of the different speaker recognition models

Speaker recognition model Feature extraction (sspeech) Speaker classification (sspeech)Short vector extraction I-vector extraction Training classifier Recognition

MFCC-GMM 046 mdash 192 064MFCC-I-CDS 045 237 181 073WPE-I-CDS 281 224 182 075WPE-I-CDS(parallel computation) 078 223 182 074

calculating the MFCCTherefore it is very important to findway to reduce the time cost of the WPE

Parallel computation is an effective way to reduce thetime cost because the loops in the linear computation canbe finished at once using a parallel algorithm For examplea signal whose length is 119873 is decomposed by WPT at119872 levels In the conventional linear algorithm of WPT wehave to run a filtering process whose time complexity was119874(log119873) 119872 times 119873 times for each decomposition level sothe total time cost of WPT is 119874(119872119873 log119873) If we used 119873independent computational cores to implant theWPTusing aparallel algorithm the time complexity ofWPT can reduce to119874(119872 log119873) This paper uses 16 independent computationalcores to implement the WPE parallel algorithm and the lastline of Table 2 shows that the time cost of WPE is reducedvery much

6 Conclusions

With the development of the computer technique the speakerrecognition has been widely used for speech-based accesssystem In the real environment the quality of the speechmay be low and noise in the transformation channel cannotbe controlled Therefore it is necessary to find a speakerrecognition model that is not sensitive to those factors suchas noise and low-quality speech

This paper proposes a new speaker recognition modelby employing wavelet packet entropy (WPE) i-vector andCDS and we name themodelWPE-I-CDSWPE used a localanalysis tool namedWPT rather than the DFT to decomposethe signal Because WPT decomposes the signal into manyindependent frequency bands that limit the ill effect of noisetheWPE is robust in the noisy environment I-vector is a typeof robust feature vector Because it considers the backgroundinformation i-vector can improve the accuracy of recogni-tion CDS uses a kernel function to deal with the curse ofdimension so it is suitable for the high-dimension featurevector such as i-vector The result of the experiments in thispaper shows that the proposed speaker recognition modelscan improve the performance of recognition compared withthe conventional models such as MFCC-GMM FMFCC-GMM and WPE-GMM in clean environment MoreovertheWPE-I-CDS obtains higher accuracy than other i-vector-based models such as MFCC-I-CDS and FMFCC-I-CDS innoisy environment However the time cost of the proposedmodel is very higher To reduce the time cost we employthe parallel algorithm to implement the WPE and i-vectorextraction methods

In the future we will combine audio and visual feature toimprove the performance of the speaker recognition system

Conflicts of Interest

The authors declare that they have no conflicts of interest

Acknowledgments

The authors also thank Professor Kun She for their assis-tance in preparation of the manuscript This paper is sup-ported by the Project of Sichuan Provincial Science Plan(M112016GZ0073) and National Nature Foundation (Grantno 61672136)

References

[1] H Perez J Martinez and I Espinosa ldquoUsing acoustic paraligu-istic information to assess the interaction quality in speech-based system for elder usersrdquo International Jounrnal of Human-Computer Studies vol 98 pp 1ndash13 2017

[2] N Almaadeed A Aggoun and A Amira ldquoSpeaker identifica-tion using multimodal neural networks and wavelet analysisrdquoIET Biometrics vol 4 no 1 pp 18ndash28 2015

[3] T Kinnunen and H Li ldquoAn overview of text-independentspeaker recognition from features to supervectorsrdquo SpeechCommunication vol 52 no 1 pp 12ndash40 2010

[4] D A Reynolds and R C Rose ldquoRobust text-independentspeaker identification using gaussian mixture speaker modelsrdquoIEEE Transactions on Speech and Audio Processing vol 3 no 1pp 72ndash83 1995

[5] K S Ahmad A S Thosar J H Nirmal and V S PandeldquoA unique approach in text-independent speaker recognitionusing mfcc feature sets and probabilistic neural networkrdquo inProceedings of the 8th International Conference on Advances inPattern Recognition (ICAPR rsquo15) Kolkata India 2015

[6] X-Y Zhang J Bai and W-Z Liang ldquoThe speech recognitionsystem based on bark wavelet MFCCrdquo in Proceedings of the 8thInternational Conference on Signal Processing (ICSP rsquo06) pp 16ndash20 Beijing China 2006

[7] A Biswas P K Sahu and M Chandra ldquoAdmissible waveletpacket features based on human inner ear frequency responsefor Hindi consonant recongnitionrdquo Computer amp Communica-tion Technology vol 40 pp 1111ndash1122 2014

[8] K Daqrouq and T A Tutunji ldquoSpeaker identification usingvowels features through a combined method of formantswavelets and neural network classifiersrdquo Applied Soft Comput-ing vol 27 pp 231ndash239 2015

Journal of Electrical and Computer Engineering 9

[9] D Avci ldquoAn expert system for speaker identification usingadaptive wavelet sure entropyrdquo Expert Systems with Applica-tions vol 36 no 3 pp 6295ndash6300 2009

[10] K Daqrouq ldquoWavelet entropy and neural network for text-independent speaker identificationrdquo Engineering Applications ofArtificial Intelligence vol 24 no 5 pp 796ndash802 2011

[11] M K Elyaberani S H Mahmoodian and G Sheikhi ldquoWaveletpacket entropy in speaker-identification emotional state detec-tion from speech signalrdquo Journal of Intelligent Procedures inElectrical Technology vol 20 pp 67ndash74 2015

[12] M Li A Tsiartas M Van Segbroeck and S S NarayananldquoSpeaker verification using simplified and supervised i-vectormodelingrdquo in Proceedings of the 38th IEEE International Con-ference on Acoustics Speech and Signal Processing (ICASSP rsquo13)pp 7199ndash7203 Vancouver Canada 2013

[13] D Garcia-Romero and A McCree ldquoSupervised domain adap-tation for i-vector based speaker recognitionrdquo in Proceedingsof the IEEE International Conference on Acoustics Speech andSignal Processing pp 4047ndash4051 Florence Italy 2014

[14] A KanagasundaramDDean and S Sridharan ldquoI-vector basedspeaker recognition using advanced channel compensationtechniquesrdquo Computer Speech amp Labguage vol 28 pp 121ndash1402014

[15] M H Bahari R Saeidi H Van Hamme and D Van LeeuwenldquoAccent recognition using i-vector gaussian mean supervectorand gaussian posterior probability supervector for spontaneoustelephone speechrdquo in Proceedings of the IEEE InternationalConference on Acoustics Speech and Signal Processing pp 7344ndash7348 IEEE Vancouver Canada 2013

[16] B Saha and K Kamarauslas ldquoEvaluation of effectiveness ofdifferent methods in speaker recognitionrdquo Elektronika ir Elek-trochnika vol 98 pp 67ndash70 2015

[17] K K George C S Kumar K I Ramachandran and A PandaldquoCosine distance features for robust speaker verificationrdquo inProceedings of the 16th Annual Conference of the InternationalSpeech Communication Association (INTERSPEECH rsquo15) pp234ndash238 Dresden Germany 2015

[18] B G Nagaraja and H S Jayanna ldquoEfficient window for mono-lingual and crosslingual speaker identification usingMFCCrdquo inProceedings of the International Conference on Advanced Com-puting and Communication Systems (ICACCS rsquo13) pp 1ndash4Coimbatore India 2013

[19] M Medeiros G Araujo H Macedo M Chella and L MatosldquoMulti-kernel approach to parallelization of EM algorithm forGMM trainingrdquo in Proceedings of the 3rd Brazilian Conferenceon Intelligent Systems (BRACIS rsquo14) pp 158ndash165 Sao PauloBrazil 2014

[20] H R Tohidypour S A Seyyedsalehi and H Behbood ldquoCom-parison between wavelet packet transform Bark Wavelet ampMFCC for robust speech recognition tasksrdquo in Proceedings ofthe 2nd International Conference on Industrial Mechatronics andAutomation (ICIMA rsquo10) pp 329ndash332 Wuhan China 2010

[21] X-F Lv and P Tao ldquoMallat algorithm of wavelet for time-varying system parametric identificationrdquo in Proceedings of the25th Chinese Control and Decision Conference (CCDC rsquo13) pp1554ndash1556 Guiyang China 2013

[22] D Marthnez O Pichot and L Burget ldquoLanguage recognitionin I-vector spacerdquo in Proceedongs of the Interspeech pp 861ndash864Florence Italy 2011

[23] P Kenny G Boulianne and P Dumouchel ldquoEigenvoice model-ing with sparse training datardquo IEEE Transactions on Speech andAudio Processing vol 13 no 3 pp 345ndash354 2005

[24] M McLaren and D Van Leeuwen ldquoImproved speaker recog-nition when using i-vectors from multiple speech sourcesrdquoin Proceedings of the 36th IEEE International Conference onAcoustics Speech and Signal Processing (ICASSP rsquo11) pp 5460ndash5463 Prague Czech Republic 2011

[25] A Biswas P K Sahu A Bhowmick and M Chandra ldquoFeatureextraction technique using ERB like wavelet sub-band periodicand aperiodic decomposition for TIMITphoneme recognitionrdquoInternational Journal of SpeechTechnology vol 17 no 4 pp 389ndash399 2014

[26] S G Mallat A Wavelet Tour of Signal Processing ELsevierAmsterdam Netherlands 2012

[27] Q Yang and J Wang ldquoMulti-level wavelet shannon entropy-based method for single-sensor fault locationrdquo Entropy vol 17no 10 pp 7101ndash7117 2015

[28] TGanchevM Siafarikas IMporas andT Stoyanova ldquoWaveletbasis selection for enhanced speech parametrization in speakerverificationrdquo International Journal of Speech Technology vol 17no 1 pp 27ndash36 2014

[29] M Seboussaoui P Kenny and N Dehak ldquoAn i-vector extractorsuitable for speaker recognition with both microphone andtelephone speechrdquo Odyssey vol 6 pp 1ndash6 2011

[30] N Dehak R Dehak P Kenny N Brummer P Ouellet and PDumouchel ldquoSupport vectormachines versus fast scoring in thelow-dimensional total variability space for speaker verificationrdquoin Proceedings of the 10th Annual Conference of the InternationalSpeech Communication Association (INTERSPEECH rsquo09) pp1559ndash1562 Brighton United Kingdom 2009

[31] M IMandasariMMcLaren andDVanLeeuwen ldquoEvaluationof i-vector speaker recognition systems for forensic applicationrdquoin 12th Annual Conference of the International Speech Commu-nication Association (INTERSPEECH rsquo11) pp 21ndash24 FlorenceItaly 2011

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Active and Passive Electronic Components

Control Scienceand Engineering

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

RotatingMachinery

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

Journal of

Volume 201

Submit your manuscripts athttpswwwhindawicom

VLSI Design

Hindawi Publishing Corporationhttpwwwhindawicom Volume 201

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Shock and Vibration

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawi Publishing Corporation httpwwwhindawicom

Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

SensorsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Navigation and Observation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

DistributedSensor Networks

International Journal of

Page 2: Speaker Recognition Using Wavelet Packet Entropy, I-Vector, and …downloads.hindawi.com/journals/jece/2017/1735698.pdf · 2019. 7. 30. · ResearchArticle Speaker Recognition Using

2 Journal of Electrical and Computer Engineering

(WPE) [8] method is proposed to extract the short vectorReferences [8ndash11] have shown that the short vector extractedby WPE is insensitive to noise

I-vector is another type of feature vector It is a robust wayto represent a speech using a single high-dimension vectorand it is generated by the short vectors I-vector considersboth of the speaker-dependent and background informationso it usually leads to good accuracy References [12ndash14] haveused it to enhance the performance of speaker recognitionmodel Specially [15] uses the i-vector to improve the dis-crimination of the low-quality speech Usually the i-vectoris generated from the short vectors extracted by the MFCCor FMFCCmethods but we employ theWPE to extract thoseshort vectors because theWPEcan resist the ill effect of noise

Once the speeches are transformed into the featurevectors a classifier is used to recognize the identity of speakerbased on those feature vectors Gaussian mixture model(GMM) is a conventional classifier Because it is fast andsimple GMM has been widely used for speaker recognition[4 16] However if the dimension of the feature vector ishigh the curse of dimensionwill destroy this classifierUnfor-tunately i-vector is high-dimensional vector compared withthe short vector Cosine distance scoring (CDS) is anothertype of classifier used for the speaker recognition [17] Thisclassifier uses a kernel function to deal with the problem ofhigh-dimension vector so it is suitable for the i-vector In thispaper we employ the CDS for speaker classification

The main work of this paper is to propose a new speakerrecognition model by using the wavelet packet entropy(WPE) i-vector and cosine distance scoring (CDS) WPE isused to extract the short vectors from speeches because itis robust against the noise I-vector is generated from thoseshort vectors It is used to characterize the speeches used forrecognition to improve the discrimination of the low-qualityspeech CDS is very suitable for high-dimension vector suchas i-vector because it uses a kernel function to deal with thecurse of dimension To improve the discrimination of the i-vector linear discriminant analysis (LDA) and the covariancenormalization (WCNN) are added to the CDS Our proposedmodel is evaluated by TIMIT database The result of theexperiments show that the proposed model can deal with thelow-quality speech problem and resist the ill effect of noiseHowever the time cost of the new model is high becauseextractingWPE is time-consumingThis paper calculates theWPE in a parallel way to reduce the time cost

The rest of this paper is organized as follows In Section 2we describe the conventional speaker recognition model InSection 3 the speaker recognition model based on i-vectoris described We propose a new speaker recognition modelin Section 4 and the performance of the proposed model isreported in Section 5 Finally we give out a conclusion inSection 6

2 The Conventional SpeakerRecognition Model

Conventional speaker recognition model can be dividedinto two parts such as short vector extraction and speaker

classification The short vector extraction transforms thespeech into the short vectors and the speaker classificationuses a classifier to give out the recognition result based on theshort vectors

21 Short Vector Extraction Mel frequency cepstral coef-ficient (MFCC) method is the conventional short vectorextraction algorithm This method firstly decomposes thespeech into 20ndash30ms speech frames For each frame thecepstral coefficient can be calculated as follows [18]

(1) Take DFT of the frame to obtain the frequencyspectrum

(2) Map the power of the spectrum onto Mel scale usingthe Mel filter bank

(3) Calculate the logarithm value of the power spectrummapped on the Mel scale

(4) Take DCT of logarithmic power spectrum to obtainthe cepstral coefficient

Usually the lower 13-14 coefficients are used to form the shortvector Fused MFCC (FMFCC) method is the extension ofMFCC Compared with MFCC it further calculates the deltaderivatives to represent the dynamic information of speechThe derivatives are defined as follows [5]

119889119894 =sum2119901=1 119901 (119888119888119894minus119901 + 119888119888119894+119901)

2 sum2119901=1 1199012

119889119889119894 =sum2119901=1 119901 (119889119894minus119901 + 119889119894+119901)

2 sum2119901=1 1199012

119894 = 1 2 3

(1)

where 119888119888119894 is the 119894th cepstral coefficient obtained by theMFCCmethod and 119901 is the offset 119889119894 is the 119894th delta coefficientand 119889119889119894 is the 119894th delta-delta coefficient If the short vectorextracted byMFCC is denoted as [1198881198881 1198881198882 1198881198883 119888119888119868]119879 thenthe short vector extracted by FMFCC is denoted as [1198881198881 11988811988821198881198883 119888119888119868 1198891 1198892 119889119868 1198891198891 1198891198892 119889119889119868]119879

22 Speaker Classification Gaussian mixture model (GMM)is a conventional classifier It is defined as

119901 (x) =119872

sum119898=1

120572119898119866 (x120583119898Σ119898) (2)

where x is a short vector extracted from an unknown speech119866(x120583119898Σ119898) is the 119898th Gaussian function in GMM where120583119898Σ119898 are its mean vector and variance matrix respectively120572119898 is the combination weight of the Gaussian function andsatisfiessum119872119898=1 120572119898 = 1119872 is the mixture number of the GMMAll of the parameters such as weights mean vectors andvariancematrices are estimated by the famous EM algorithm[19] using the speech samples of a known speaker In otherwords 119901(x) represents the characteristic of the knownspeakerrsquos voice so we use 119901(x) to recognize the author of

Journal of Electrical and Computer Engineering 3

Background speeches

Background model

Known speeches

Short vector extraction

Known i-vectors

Unknown speeches

Unknown i-vectors

result

Backgroundshort vectors

Known short vectors

Unknown short vectors

Speaker classification

I-vector extraction

Figure 1 The structure of the speaker recognition using i-vector

the unknown speeches Assume that an unknown speech isdenoted by Y = y1 y2 y119873 where y119894 represents the119894th short vector extracted from Y Also assume that theparameters of 119901(x) are estimated using the speech samplesof a known speaker 119904 The result of recognition is defined as

119903 = 1119873119873

sum119894=1

log [119901 (y119894)] + 120579 (3)

where 120579 gt 0 is the decision threshold and should be adjustedbeforehand to obtain the best recognition performance If119903 le 0 then the GMMdecides that the author of the unknownspeech is not the known speaker 119904 if the 119903 gt 0 then theGMMdecides that the unknown speech is spoken by the speaker 119904

3 The Speaker Recognition ModelUsing I-Vector

The speaker recognition model using i-vector can be decom-posed into three parts such as short vector extraction i-vector extraction and speaker classification Figure 1 showsthe structure of the model

There are three types of speeches used for this modelBackground speeches contains thousands of speeches spokenby lots of people the known speeches are the speech samplesof known speakers and the unknown speeches are spoken bythe speaker to be recognized In the short vector extractionall of the speeches are transformed into the short vectors bya feature extraction method In the i-vector extraction thebackground short vectors are used to train the backgroundmodel The background model is usually represented by aGMMwith 2048 mixtures and all covariance matrices of theGMM are assumed the same for easy computation Basedon the background model the known and unknown shortvectors are used to extract the known and unknown i-vectorsrespectively Note that one i-vector refers to only one speechIn the speaker classification a classifier is used to match the

known i-vector with the unknown i-vector and give out therecognition result

4 The Proposed Speaker Recognition Model

The accuracy of recognition system usually drops off rapidlybecause of the low-quality speech and noise To deal withthe problem we propose a new speaker recognition modelbased on wavelet packet entropy (WPE) i-vector and cosinedistance scoring (CDS) In Section 41 we describe the WPEmethod and use it to extract the short vector Section 42describes how to extract the i-vector using the above shortvectors Finally the details of CDS are described in Sec-tion 43

41 Short Vector Extraction This paper uses WPE to extractthe short vector The WPE is based on the wavelet packettransform (WPT) [20] so the WPT is firstly described WPTis a local signal processing approach that is used to obtainthe frequency spectrum It decomposes the speech intomanylocal frequency bands at multiple levels and obtains thefrequency spectrum based on the bands For the discretesignal such as digital speech WPT is usually implemented bythe famousMallat fast algorithm [21] In the algorithmWPTis realized by a low-pass filter and a high-pass filter which aregenerated by the mother wavelet and the corresponding scalefunction respectively Through the two filters the speechis iteratively decomposed into a low-frequency and a high-frequency components We can use a full binary tree todescribe the process of WPT The three structures are shownin Figure 2

In Figure 2 root is the speech to be analyzed Eachnonroot node represents a component The left child is thelow-frequency component of its parent and the right child isthe high-frequency component of its parent The left branchand the right branch are the low-pass and high-pass filtering

4 Journal of Electrical and Computer Engineering

W0

0

W1

1W0

1

W0

2 W1

2 W2

2 W3

2

W0

3 W1

3 W2

3 W3

3W

5

3 W6

3 W7

3W4

3

Figure 2 The wavelet packet transform at 2 levels

Framing

WPT

Shannon entropy

calculation

Normalization

Shannon entropy

calculation

Shannon entropy

calculation

Speech

middot middot middot

[s0 s1 s15]

W1

3W0

4 W15

4

Figure 3 The flow chart of wavelet packet entropy

processes followed by 2 1 downsampling respectively Thefiltering processes are defined as

W2119896119895+1 (119899) = W119896119895 lowast h (2119899)

W2119896+1119895+1 (119899) = W119896119895 lowast g (2119899)

0 le 119896 le 2119895 0 le 119895 le 119869 0 le 119899 le 119873119895

(4)

where h and g are the low-pass and high-pass filter respec-tively 119873119895 is the length of the frequency component at level119895 lowast is the convolution operation 119869 is the total numberof the decomposition levels Because the WPT satisfies theconservation of energy each leaf node denotes the spectrumof the frequency bands obtained byWPT Based on theWPTthe wavelet packet entropy (WPE) method is proposed toextract the short vector and we add a normalization step intothemethod to reduce the ill effect of the volume in this paperThe flow chart ofWPEused in this paper is shown in Figure 3

Assume that there is a digital speech signal that has finiteenergy and length It is firstly decomposed into 20ms framesand then each frame is normalized The normalizationprocess is defined as

119891 [119898] = 119891 [119898] minus 120583120590 119894 = 1 2 3 119872 (5)

where 119891 is a signal frame and 119872 is its length 120583 is themean value of the frame and 120590 is its standard variance 119891 isthe normalized frame After the normalization process theWPT decomposes the frame at 4 levels using (4) Thereforewe finally obtain 16 frequency bands and the frequencyspectrums in those bands are denoted as 11988204 11988214 119882154 respectively For each spectrum the Shannon entropy iscalculated The Shannon entropy is denoted as

119904119896 = minus119873

sum119899=1

119901119896119899 log (119901119896119899) 119896 = 0 2 3 7 (6)

with

119901119896119899 =100381610038161003816100381610038161198821198964 (119899)100381610038161003816100381610038162

119890119896

119890119896 =119873

sum119899=1

100381610038161003816100381610038161198821198964 (119899)100381610038161003816100381610038162

(7)

where 119890119896 is the energy of the 119896th spectrum 119901119896119899 is the energydistribution of the 119896th spectrum 119873 is the length of eachfrequency spectrum Finally all of Shannon entropies of allspectrums are calculated and are collected to form a featurevector that is denoted as [1199040 1199041 1199047]119879

42 I-Vector Extraction I-vector is a robust feature vectorthat represents a speech using a single high-dimension vectorBecause it considers the background information i-vectorusually improves the accuracy of recognition [22] Assumethat there is a set of speeches Those speeches are suppliedby different speakers and the all speeches are transformedinto the short vectors In the i-vector theory the speaker- andchannel-dependent feature vector is assumed as

m = m + Tw (U) (8)

where m is the speaker- and channel-dependent featurevectorm is the background factor Usually it is generated bystacking the mean vectors of a background model Assumethat the mean vectors of the background model are denotedby 1205831198791 1205831198792 120583119879119873 where each mean vector is a row vectorm is denoted by [12058311205832 120583119873]119879 T is named the totalvariability matrix and represents a space that contains thespeaker- and channel-dependent information w(U) is arandom vector having standard normal distribution 119873(0 1)The i-vector is the expectation of the w(U) U is a set ofspeeches and all of speeches are transformed into the shortvectors Assume that a background model is given and Σ isinitialized by covariance matrix of the background model Tand w(U) are initialized randomly T and w(U) are estimatedby an iteratively process described as follows

Journal of Electrical and Computer Engineering 5

(1) E-step for each speech in the set U calculate theparameters of the posterior distribution ofw(U)usingthe current estimates of T Σ andm

(2) M-step updateT andΣ by a linear regression inwhichw(U)s play the role of explanatory variables

(3) Iterate until the expectation of the w(U) is stableThe details of the estimation processes of T and w(U) aredescribed in [23]

43 Speaker Classification Cosine distance scoring (CDS)is used as the classifier in our proposed model It usesa kernel function to deal with the curse of dimensionso CDS is very suitable for the i-vector To describe thisclassifier easily we take a two-classification task for exampleAssume that there are two speakers denoted as 1199041 and 1199042The two speakers respectively speak 1198731 and 1198732 speechesAll speeches are represented by i-vectors and are denoted by1198831 = x11 x12 x11198731 and 1198832 = x21 x22 x21198732 where x

119894119895

is i-vector representing the 119895th speech sample of the speakersi We also assume there is an unknown speech representedby i-vector y The purpose of the classifier is to match theunknown i-vector with the known i-vectors and determinewhich one speaks the unknown speech the result of therecognition is defined as

119863119894 (y) = 1119873119894

119873119894

sum119899=1

119870 (x119894119899 y) + 120579 119894 = 1 2 (9)

where Ni is the total number of speeches supported by thespeaker si 120579 is the decision threshold If Di(y) le 0 theunknown speeches are not spoken by the known speaker siif Di(y) gt 0 then author of the unknown speeches is thespeaker si 119870(sdot sdot) is the cosine kernel and is defined as

119870 (x y) = xy119879

radicxx119879radicyy119879 (10)

where x is the known i-vector and y is the unknown i-vector Usually the linear discriminant analysis (LDA) andwithin class covariance normalization (WCCN) are used toimplement the discrimination of the i-vector Therefore thekernel function is rewritten as

119870 (x y) =(A119879x)Wminus1 (A119879y)

radic(A119879x)Wminus1 (A119879x)radic(A119879y)Wminus1 (A119879y) (11)

where A is the LDA projection matrix and W is WCCNmatrix A and W are estimated by using all of the i-vectorsand the details of LAD and WCCN are described in [24]

5 Experiment and Results

In this section we report the outcome of our experimentsIn Section 51 we describe the experimental dataset InSection 52 we carry on an experiment to select the optimalmother wavelet for the WPE algorithm In Section 53 weevaluate the recognition accuracy of our model In Sec-tion 54 we evaluate the performance of the proposedmodelFinally the time cost of the model is count in Section 55

51 Experimental Dataset The results of our experimentsare performed on the TIMIT speech database [25] Thisdatabase contains 630 speakers (192 females and 438 males)who come from 8 different English dialect regions Eachspeaker supplies ten speech samples that are sampled at16 KHz and last 5 seconds All female speeches are usedto obtain background models that represent the commoncharacteristic of the female voice Also all male speeches areused to generate another background model characterizingthe male voice 384 speakers (192 females and 192 males) arerandomly selected and their speeches are used as the knownand unknown speeches The test results presented in ourexperiments are collected on a computer with 25GHz IntelCore i5 CPU and 8GM of memory and the experimentalplatform is MATLAB R2012b

52 Optimal Mother Wavelet A good mother wavelet canimprove the performance of the WPE algorithmThe perfor-mance of a mother wavelet is based on two important ele-ments such as the support size and the number of vanishingmoments If a mother wavelet has large number of vanishmoments the WPE would ignore much of unimportantinformation if the mother wavelet has small support sizetheWPEwould accurately locate important information [26]Therefore an optimal mother wavelet should have a largenumber of vanishing moments and a small support sizeIn this view the Daubechies and Symlet wavelets are goodwavelets because they have the largest number of vanishingmoments for a given support size Moreover those waveletsare orthogonal and are suitable for the Mallat fast algorithm

In is paper we use the Energy-to-Shannon Entropy Ratio(ESER) to evaluate those Daubechies and Symlet waveletsto find out the best one ESER is a way to analyze theperformance of mother wavelet and has been employed toselect the best mother wavelet in [27] The ESER is definedas

119903 = 119864119878 (12)

where 119878 is the Shannon entropy of the spectrum obtained byWPT and 119864 was the energy of the spectrumThe high energymeans the spectrum obtained by WPT contained muchenough information of the speech The low entropy meansthat the information in the spectrum is stable Thereforethe optimal mother wavelet should maximize the energy andmeanwhile minimize the entropy

In this experiment 8 Daubechies and 8 Symlet waveletswhich are respectively denoted as db1ndash8 and sym1ndash8 areemployed to decompose speeches that are randomly selectedfrom the TIMIT database We run the experiment 100 timesand record the average WSER of those mother wavelets inTable 1

In Table 1 We find that db4 and sym6 obtain the highestESER In other words the db4 and sysm6 are the bestmother wavelets for the speech data Reference [28] suggeststhat the sym6 can improve the performance of the speakerrecognitionmodelHowever the Symletwavelets produce thecomplex coefficients whose imaginary parts are redundantfor the real signal such as digital speech so we abandon thesym6 and choose the db4

6 Journal of Electrical and Computer Engineering

Table 1 The average WSER of the mother wavelets

Mother wavelet ESERdb1 88837db2 89032db3 89744db4 90849db5 90141db6 89653db7 89169db8 89084sym1 88835sym2 89036sym3 89493sym4 89975sym5 90382sym6 90859sym7 90244sym8 89837

53 The Accuracy of Speaker Recognition Model in ClearEnvironment This experiment evaluates the accuracy of thespeaker recognition model We randomly select 384 speakers(192 females and 192males) For each speaker half of speechesare used as the unknown speech and the other half of speechesare used as the known speeches For each speaker the speakerrecognition model matches the hisher unknown speecheswith all of the known speeches of the 384 speakers anddetermines who speaks the unknown speeches If the resultis right the model obtains one score if the result is wrongthe model gets zero score Finally we count the score andcalculate the mean accuracy that is defined as

accuray = score384 times 100 (13)

In this experiment we use four types of speaker recog-nition models for comparison The first one is the MFCC-GMM model [4] This model uses MFCC method to extract14D short vectors and uses the GMM with 8 mixtures torecognize speaker based on those short vectors The secondone is FMFCC-GMMmodel [16] This model is very similarto the MFCC-GMM model but it uses the FMFCC methodto extract the 52D short vectors The third one is the WPE-GMM model [10] This model firstly uses WPE to transformthe speeches into 16D short vectors and then uses GMMfor speaker classification The last one was the WPE-I-CDSmodel proposed in this paper Compared with WPE-GMMmodel ourmodel uses the 16D short vectors to generate 400Di-vector and uses CDS to recognize speaker based on the i-vector We carry on each experiment in this section 25 timesto obtain the mean accuracyThemean accuracy of the above4 models is shown in Figure 4

In Figure 4 we find that MFCC-GMM obtains the lowestaccuracy of 8846 The result of [4] shows the MFCC-GMM model can obtain accuracy of higher than 90 Thisis because we use the GMM with 8 mixtures as the classifierbut [4] uses the GMMwith 32mixtures as the classifier Large

8KHz speech16KHz speech

WPE

-I-CD

S

WPE

-GM

M

MFC

C-G

MM

FMFC

C-G

MM

Speaker recognition models

0102030405060708090

100

Acc

urac

y (

)Figure 4 The mean accuracy of 4 models in clean environment

mixture number can improve the performance of the GMMbut it also causes the very high computational expenseWPE-I-CDS obtain the highest accuracy of 9436This interpretsthe achievements of i-vector theory On the other handwhen the 8KHz speeches (low-quality speeches) are usedall accuracy of speaker recognition models is decreased Theaccuracy of MFCC-GMM FMFCC-GMM andWPE-GMMdecrease by about 6 Comparatively the accuracy of WPE-I-CDS decreases by about 1 This is because the i-vectorconsiders the i-vector to improve the accuracy of the speakerrecognition model and the CDS used the LDA and WCCNto improve the discrimination of the i-vector Reference [29]also reports that the combination of the i-vector and the CDScan enhance the performance of speaker recognition modelused for low-quality speeches such as phone speeches

54 The Accuracy of Speaker Recognition Model in NoisyEnvironment It is hard to find a clean speech in the realapplications because the noise in the transmission channeland environment cannot be controlled In this experimentwe add 30 dB 20 dB and 10 dB Gaussian white noise intothe speeches to simulate the noisy speeches All noises aregenerated by the MATLABrsquos Gaussian white noise function

For comparison this experiment employed three i-vectorbasedmodels such asMFCC-I-CDS [30] FMFCC-I-CD [31]and our WPE-I-CDS The two models are very similar toour proposed model but they use the MFCC and FMFCCto extract the short vectors respectively The accuracy of the3 models in noisy environment is shown in Figure 5

In Figure 5 the three models obtained high accuracyin clean environment This also shows that the i-vector canimprove the recognition accuracy effectively However whenweuse the noisy speeches to test the 3models their accuraciesdecrease When 30 dB noise is added to the speeches the

Journal of Electrical and Computer Engineering 7

MFCC-I-CDSFMFCC-I-CDSWPE-I-CDS

55

60

65

70

75

80

85

90

95A

ccur

acy

()

30 20 10CleanNoise (dB)

Figure 5 The accuracy of the 3 models in noisy environment

accuracy of the three models decreases by about 4 Thisshows that all of the models can resist weak noise Howeverwhen we enhance the power of noise the accuracy ofMFCC-I-CDS and FMFCC-I-CDS drops off rapidly In particularwhen the noise increases into 10 dB the accuracy of the abovetwo models decreases by more than 30 Comparatively theWPE-I-CDSrsquos accuracy decreases by less than 12 Thoseshow that the WPE-I-CDS is robust in noisy environmentcompared with MFCC-I-CDS and FMFCC-I-CDS This isbecause the WPE uses the WPT to obtain the frequencyspectrum but MFCC and FMFCC use the DFT to do thatThe WPT decomposes the speech into many local frequencybands that can limit the ill effect of noise but the DFTdecomposes the speech into a global frequency domain thatis sensitive to the noise

55 The Performance of the Speaker Recognition ModelUsually the speaker recognition model is used in the accesscontrol systemTherefore a good speaker recognition modelshould have ability to accept the login of the correct peopleand meanwhile to reject the access of the imposter as agatekeeper does In this experiment we use the receiveroperating characteristic (ROC) curve to evaluate the ability ofourmodelTheROC curve shows the true positive rate (TPR)as a function of the false positive rate (FPR) for differentvalues of the decision threshold and has been employed in[2]

In this experiment we randomly select 384 speakers (192males and 192 females) to calculate the ROC curve Half ofthose speakers are used as the correct people and anotherhalf of the speakers are used as the imposters We firstlyuse the speeches of the correct people to test the speakerrecognition model to calculate the TPR and then we use thespeeches of the imposters to attack the speaker recognition

MFCC-GMMFMFCC-GMM

WPE-GMMWPE-I-CDS

0

01

02

03

04

05

06

07

08

09

TPR

02 03 04 0501FPR

Figure 6 The ROC curves of TPR versus FPR

model to calculate the FPR The 4 models such as MFCC-GMM FMFCC-GMM WPE-GMM and our WPE-I-CDSare used for comparison To plot the ROC curve we adjustedthe decision thresholds to obtain different ROC points TheROC curves of those 4 models were shown in Figure 6

Low FPR shows that the speaker recognition model caneffectively resist the attack coming from the imposters andhigh TPR shows that the speaker recognition model canaccurately accept the correct speakersrsquo login In other wordsa speaker recognition model can be useful if its TPR is highfor a low FPR In Figure 6 when FPR is higher than 045 allmodels obtain the high TPR but WPE-I-CDS obtain higherTPR than other 3 models for a given FPR that is less than 45This shows that theWPE-I-CDS can more effectively achievethe access control task than other models

56 Time Cost This section tests the time cost of thefast MFCC-GMM the conventional MFCC-I-CDS and ourWPE-I-CDS We used 200 5-second-long speeches to testeach model and calculated the average time cost The resultof this experiment was shown in Table 2

In Table 2MFCC-GMMdoes not employ the i-vector forspeech representation so it does not cost time to extract thei-vector Comparatively the WPE-I-CDS should cost time toextract the i-vector The WPE-I-CDS cost the most time toextract the short vector compared with the MFCC-GMMThis is because the WPT used by WPE is more complexthan the DFT used by the MFCC On the other hand theparameters of GMM should be estimated beforehand asMFCC-GMM cost time to train the classifier CDS needs notcost time to estimate the parameters but it should cost time toestimate the matrices of the LDA andWCNN in the trainingclassifier step In all the i-vector can improve the recognitionaccuracy at cost of increasing the time consumption andcalculating the WPE costs too much time compared with

8 Journal of Electrical and Computer Engineering

Table 2 The time cost of the different speaker recognition models

Speaker recognition model Feature extraction (sspeech) Speaker classification (sspeech)Short vector extraction I-vector extraction Training classifier Recognition

MFCC-GMM 046 mdash 192 064MFCC-I-CDS 045 237 181 073WPE-I-CDS 281 224 182 075WPE-I-CDS(parallel computation) 078 223 182 074

calculating the MFCCTherefore it is very important to findway to reduce the time cost of the WPE

Parallel computation is an effective way to reduce thetime cost because the loops in the linear computation canbe finished at once using a parallel algorithm For examplea signal whose length is 119873 is decomposed by WPT at119872 levels In the conventional linear algorithm of WPT wehave to run a filtering process whose time complexity was119874(log119873) 119872 times 119873 times for each decomposition level sothe total time cost of WPT is 119874(119872119873 log119873) If we used 119873independent computational cores to implant theWPTusing aparallel algorithm the time complexity ofWPT can reduce to119874(119872 log119873) This paper uses 16 independent computationalcores to implement the WPE parallel algorithm and the lastline of Table 2 shows that the time cost of WPE is reducedvery much

6 Conclusions

With the development of the computer technique the speakerrecognition has been widely used for speech-based accesssystem In the real environment the quality of the speechmay be low and noise in the transformation channel cannotbe controlled Therefore it is necessary to find a speakerrecognition model that is not sensitive to those factors suchas noise and low-quality speech

This paper proposes a new speaker recognition modelby employing wavelet packet entropy (WPE) i-vector andCDS and we name themodelWPE-I-CDSWPE used a localanalysis tool namedWPT rather than the DFT to decomposethe signal Because WPT decomposes the signal into manyindependent frequency bands that limit the ill effect of noisetheWPE is robust in the noisy environment I-vector is a typeof robust feature vector Because it considers the backgroundinformation i-vector can improve the accuracy of recogni-tion CDS uses a kernel function to deal with the curse ofdimension so it is suitable for the high-dimension featurevector such as i-vector The result of the experiments in thispaper shows that the proposed speaker recognition modelscan improve the performance of recognition compared withthe conventional models such as MFCC-GMM FMFCC-GMM and WPE-GMM in clean environment MoreovertheWPE-I-CDS obtains higher accuracy than other i-vector-based models such as MFCC-I-CDS and FMFCC-I-CDS innoisy environment However the time cost of the proposedmodel is very higher To reduce the time cost we employthe parallel algorithm to implement the WPE and i-vectorextraction methods

In the future we will combine audio and visual feature toimprove the performance of the speaker recognition system

Conflicts of Interest

The authors declare that they have no conflicts of interest

Acknowledgments

The authors also thank Professor Kun She for their assis-tance in preparation of the manuscript This paper is sup-ported by the Project of Sichuan Provincial Science Plan(M112016GZ0073) and National Nature Foundation (Grantno 61672136)

References

[1] H Perez J Martinez and I Espinosa ldquoUsing acoustic paraligu-istic information to assess the interaction quality in speech-based system for elder usersrdquo International Jounrnal of Human-Computer Studies vol 98 pp 1ndash13 2017

[2] N Almaadeed A Aggoun and A Amira ldquoSpeaker identifica-tion using multimodal neural networks and wavelet analysisrdquoIET Biometrics vol 4 no 1 pp 18ndash28 2015

[3] T Kinnunen and H Li ldquoAn overview of text-independentspeaker recognition from features to supervectorsrdquo SpeechCommunication vol 52 no 1 pp 12ndash40 2010

[4] D A Reynolds and R C Rose ldquoRobust text-independentspeaker identification using gaussian mixture speaker modelsrdquoIEEE Transactions on Speech and Audio Processing vol 3 no 1pp 72ndash83 1995

[5] K S Ahmad A S Thosar J H Nirmal and V S PandeldquoA unique approach in text-independent speaker recognitionusing mfcc feature sets and probabilistic neural networkrdquo inProceedings of the 8th International Conference on Advances inPattern Recognition (ICAPR rsquo15) Kolkata India 2015

[6] X-Y Zhang J Bai and W-Z Liang ldquoThe speech recognitionsystem based on bark wavelet MFCCrdquo in Proceedings of the 8thInternational Conference on Signal Processing (ICSP rsquo06) pp 16ndash20 Beijing China 2006

[7] A Biswas P K Sahu and M Chandra ldquoAdmissible waveletpacket features based on human inner ear frequency responsefor Hindi consonant recongnitionrdquo Computer amp Communica-tion Technology vol 40 pp 1111ndash1122 2014

[8] K Daqrouq and T A Tutunji ldquoSpeaker identification usingvowels features through a combined method of formantswavelets and neural network classifiersrdquo Applied Soft Comput-ing vol 27 pp 231ndash239 2015

Journal of Electrical and Computer Engineering 9

[9] D Avci ldquoAn expert system for speaker identification usingadaptive wavelet sure entropyrdquo Expert Systems with Applica-tions vol 36 no 3 pp 6295ndash6300 2009

[10] K Daqrouq ldquoWavelet entropy and neural network for text-independent speaker identificationrdquo Engineering Applications ofArtificial Intelligence vol 24 no 5 pp 796ndash802 2011

[11] M K Elyaberani S H Mahmoodian and G Sheikhi ldquoWaveletpacket entropy in speaker-identification emotional state detec-tion from speech signalrdquo Journal of Intelligent Procedures inElectrical Technology vol 20 pp 67ndash74 2015

[12] M Li A Tsiartas M Van Segbroeck and S S NarayananldquoSpeaker verification using simplified and supervised i-vectormodelingrdquo in Proceedings of the 38th IEEE International Con-ference on Acoustics Speech and Signal Processing (ICASSP rsquo13)pp 7199ndash7203 Vancouver Canada 2013

[13] D Garcia-Romero and A McCree ldquoSupervised domain adap-tation for i-vector based speaker recognitionrdquo in Proceedingsof the IEEE International Conference on Acoustics Speech andSignal Processing pp 4047ndash4051 Florence Italy 2014

[14] A KanagasundaramDDean and S Sridharan ldquoI-vector basedspeaker recognition using advanced channel compensationtechniquesrdquo Computer Speech amp Labguage vol 28 pp 121ndash1402014

[15] M H Bahari R Saeidi H Van Hamme and D Van LeeuwenldquoAccent recognition using i-vector gaussian mean supervectorand gaussian posterior probability supervector for spontaneoustelephone speechrdquo in Proceedings of the IEEE InternationalConference on Acoustics Speech and Signal Processing pp 7344ndash7348 IEEE Vancouver Canada 2013

[16] B Saha and K Kamarauslas ldquoEvaluation of effectiveness ofdifferent methods in speaker recognitionrdquo Elektronika ir Elek-trochnika vol 98 pp 67ndash70 2015

[17] K K George C S Kumar K I Ramachandran and A PandaldquoCosine distance features for robust speaker verificationrdquo inProceedings of the 16th Annual Conference of the InternationalSpeech Communication Association (INTERSPEECH rsquo15) pp234ndash238 Dresden Germany 2015

[18] B G Nagaraja and H S Jayanna ldquoEfficient window for mono-lingual and crosslingual speaker identification usingMFCCrdquo inProceedings of the International Conference on Advanced Com-puting and Communication Systems (ICACCS rsquo13) pp 1ndash4Coimbatore India 2013

[19] M Medeiros G Araujo H Macedo M Chella and L MatosldquoMulti-kernel approach to parallelization of EM algorithm forGMM trainingrdquo in Proceedings of the 3rd Brazilian Conferenceon Intelligent Systems (BRACIS rsquo14) pp 158ndash165 Sao PauloBrazil 2014

[20] H R Tohidypour S A Seyyedsalehi and H Behbood ldquoCom-parison between wavelet packet transform Bark Wavelet ampMFCC for robust speech recognition tasksrdquo in Proceedings ofthe 2nd International Conference on Industrial Mechatronics andAutomation (ICIMA rsquo10) pp 329ndash332 Wuhan China 2010

[21] X-F Lv and P Tao ldquoMallat algorithm of wavelet for time-varying system parametric identificationrdquo in Proceedings of the25th Chinese Control and Decision Conference (CCDC rsquo13) pp1554ndash1556 Guiyang China 2013

[22] D Marthnez O Pichot and L Burget ldquoLanguage recognitionin I-vector spacerdquo in Proceedongs of the Interspeech pp 861ndash864Florence Italy 2011

[23] P Kenny G Boulianne and P Dumouchel ldquoEigenvoice model-ing with sparse training datardquo IEEE Transactions on Speech andAudio Processing vol 13 no 3 pp 345ndash354 2005

[24] M McLaren and D Van Leeuwen ldquoImproved speaker recog-nition when using i-vectors from multiple speech sourcesrdquoin Proceedings of the 36th IEEE International Conference onAcoustics Speech and Signal Processing (ICASSP rsquo11) pp 5460ndash5463 Prague Czech Republic 2011

[25] A Biswas P K Sahu A Bhowmick and M Chandra ldquoFeatureextraction technique using ERB like wavelet sub-band periodicand aperiodic decomposition for TIMITphoneme recognitionrdquoInternational Journal of SpeechTechnology vol 17 no 4 pp 389ndash399 2014

[26] S G Mallat A Wavelet Tour of Signal Processing ELsevierAmsterdam Netherlands 2012

[27] Q Yang and J Wang ldquoMulti-level wavelet shannon entropy-based method for single-sensor fault locationrdquo Entropy vol 17no 10 pp 7101ndash7117 2015

[28] TGanchevM Siafarikas IMporas andT Stoyanova ldquoWaveletbasis selection for enhanced speech parametrization in speakerverificationrdquo International Journal of Speech Technology vol 17no 1 pp 27ndash36 2014

[29] M Seboussaoui P Kenny and N Dehak ldquoAn i-vector extractorsuitable for speaker recognition with both microphone andtelephone speechrdquo Odyssey vol 6 pp 1ndash6 2011

[30] N Dehak R Dehak P Kenny N Brummer P Ouellet and PDumouchel ldquoSupport vectormachines versus fast scoring in thelow-dimensional total variability space for speaker verificationrdquoin Proceedings of the 10th Annual Conference of the InternationalSpeech Communication Association (INTERSPEECH rsquo09) pp1559ndash1562 Brighton United Kingdom 2009

[31] M IMandasariMMcLaren andDVanLeeuwen ldquoEvaluationof i-vector speaker recognition systems for forensic applicationrdquoin 12th Annual Conference of the International Speech Commu-nication Association (INTERSPEECH rsquo11) pp 21ndash24 FlorenceItaly 2011

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Active and Passive Electronic Components

Control Scienceand Engineering

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

RotatingMachinery

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

Journal of

Volume 201

Submit your manuscripts athttpswwwhindawicom

VLSI Design

Hindawi Publishing Corporationhttpwwwhindawicom Volume 201

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Shock and Vibration

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawi Publishing Corporation httpwwwhindawicom

Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

SensorsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Navigation and Observation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

DistributedSensor Networks

International Journal of

Page 3: Speaker Recognition Using Wavelet Packet Entropy, I-Vector, and …downloads.hindawi.com/journals/jece/2017/1735698.pdf · 2019. 7. 30. · ResearchArticle Speaker Recognition Using

Journal of Electrical and Computer Engineering 3

Background speeches

Background model

Known speeches

Short vector extraction

Known i-vectors

Unknown speeches

Unknown i-vectors

result

Backgroundshort vectors

Known short vectors

Unknown short vectors

Speaker classification

I-vector extraction

Figure 1 The structure of the speaker recognition using i-vector

the unknown speeches Assume that an unknown speech isdenoted by Y = y1 y2 y119873 where y119894 represents the119894th short vector extracted from Y Also assume that theparameters of 119901(x) are estimated using the speech samplesof a known speaker 119904 The result of recognition is defined as

119903 = 1119873119873

sum119894=1

log [119901 (y119894)] + 120579 (3)

where 120579 gt 0 is the decision threshold and should be adjustedbeforehand to obtain the best recognition performance If119903 le 0 then the GMMdecides that the author of the unknownspeech is not the known speaker 119904 if the 119903 gt 0 then theGMMdecides that the unknown speech is spoken by the speaker 119904

3 The Speaker Recognition ModelUsing I-Vector

The speaker recognition model using i-vector can be decom-posed into three parts such as short vector extraction i-vector extraction and speaker classification Figure 1 showsthe structure of the model

There are three types of speeches used for this modelBackground speeches contains thousands of speeches spokenby lots of people the known speeches are the speech samplesof known speakers and the unknown speeches are spoken bythe speaker to be recognized In the short vector extractionall of the speeches are transformed into the short vectors bya feature extraction method In the i-vector extraction thebackground short vectors are used to train the backgroundmodel The background model is usually represented by aGMMwith 2048 mixtures and all covariance matrices of theGMM are assumed the same for easy computation Basedon the background model the known and unknown shortvectors are used to extract the known and unknown i-vectorsrespectively Note that one i-vector refers to only one speechIn the speaker classification a classifier is used to match the

known i-vector with the unknown i-vector and give out therecognition result

4 The Proposed Speaker Recognition Model

The accuracy of recognition system usually drops off rapidlybecause of the low-quality speech and noise To deal withthe problem we propose a new speaker recognition modelbased on wavelet packet entropy (WPE) i-vector and cosinedistance scoring (CDS) In Section 41 we describe the WPEmethod and use it to extract the short vector Section 42describes how to extract the i-vector using the above shortvectors Finally the details of CDS are described in Sec-tion 43

41 Short Vector Extraction This paper uses WPE to extractthe short vector The WPE is based on the wavelet packettransform (WPT) [20] so the WPT is firstly described WPTis a local signal processing approach that is used to obtainthe frequency spectrum It decomposes the speech intomanylocal frequency bands at multiple levels and obtains thefrequency spectrum based on the bands For the discretesignal such as digital speech WPT is usually implemented bythe famousMallat fast algorithm [21] In the algorithmWPTis realized by a low-pass filter and a high-pass filter which aregenerated by the mother wavelet and the corresponding scalefunction respectively Through the two filters the speechis iteratively decomposed into a low-frequency and a high-frequency components We can use a full binary tree todescribe the process of WPT The three structures are shownin Figure 2

In Figure 2 root is the speech to be analyzed Eachnonroot node represents a component The left child is thelow-frequency component of its parent and the right child isthe high-frequency component of its parent The left branchand the right branch are the low-pass and high-pass filtering

4 Journal of Electrical and Computer Engineering

W0

0

W1

1W0

1

W0

2 W1

2 W2

2 W3

2

W0

3 W1

3 W2

3 W3

3W

5

3 W6

3 W7

3W4

3

Figure 2 The wavelet packet transform at 2 levels

Framing

WPT

Shannon entropy

calculation

Normalization

Shannon entropy

calculation

Shannon entropy

calculation

Speech

middot middot middot

[s0 s1 s15]

W1

3W0

4 W15

4

Figure 3 The flow chart of wavelet packet entropy

processes followed by 2 1 downsampling respectively Thefiltering processes are defined as

W2119896119895+1 (119899) = W119896119895 lowast h (2119899)

W2119896+1119895+1 (119899) = W119896119895 lowast g (2119899)

0 le 119896 le 2119895 0 le 119895 le 119869 0 le 119899 le 119873119895

(4)

where h and g are the low-pass and high-pass filter respec-tively 119873119895 is the length of the frequency component at level119895 lowast is the convolution operation 119869 is the total numberof the decomposition levels Because the WPT satisfies theconservation of energy each leaf node denotes the spectrumof the frequency bands obtained byWPT Based on theWPTthe wavelet packet entropy (WPE) method is proposed toextract the short vector and we add a normalization step intothemethod to reduce the ill effect of the volume in this paperThe flow chart ofWPEused in this paper is shown in Figure 3

Assume that there is a digital speech signal that has finiteenergy and length It is firstly decomposed into 20ms framesand then each frame is normalized The normalizationprocess is defined as

119891 [119898] = 119891 [119898] minus 120583120590 119894 = 1 2 3 119872 (5)

where 119891 is a signal frame and 119872 is its length 120583 is themean value of the frame and 120590 is its standard variance 119891 isthe normalized frame After the normalization process theWPT decomposes the frame at 4 levels using (4) Thereforewe finally obtain 16 frequency bands and the frequencyspectrums in those bands are denoted as 11988204 11988214 119882154 respectively For each spectrum the Shannon entropy iscalculated The Shannon entropy is denoted as

119904119896 = minus119873

sum119899=1

119901119896119899 log (119901119896119899) 119896 = 0 2 3 7 (6)

with

119901119896119899 =100381610038161003816100381610038161198821198964 (119899)100381610038161003816100381610038162

119890119896

119890119896 =119873

sum119899=1

100381610038161003816100381610038161198821198964 (119899)100381610038161003816100381610038162

(7)

where 119890119896 is the energy of the 119896th spectrum 119901119896119899 is the energydistribution of the 119896th spectrum 119873 is the length of eachfrequency spectrum Finally all of Shannon entropies of allspectrums are calculated and are collected to form a featurevector that is denoted as [1199040 1199041 1199047]119879

42 I-Vector Extraction I-vector is a robust feature vectorthat represents a speech using a single high-dimension vectorBecause it considers the background information i-vectorusually improves the accuracy of recognition [22] Assumethat there is a set of speeches Those speeches are suppliedby different speakers and the all speeches are transformedinto the short vectors In the i-vector theory the speaker- andchannel-dependent feature vector is assumed as

m = m + Tw (U) (8)

where m is the speaker- and channel-dependent featurevectorm is the background factor Usually it is generated bystacking the mean vectors of a background model Assumethat the mean vectors of the background model are denotedby 1205831198791 1205831198792 120583119879119873 where each mean vector is a row vectorm is denoted by [12058311205832 120583119873]119879 T is named the totalvariability matrix and represents a space that contains thespeaker- and channel-dependent information w(U) is arandom vector having standard normal distribution 119873(0 1)The i-vector is the expectation of the w(U) U is a set ofspeeches and all of speeches are transformed into the shortvectors Assume that a background model is given and Σ isinitialized by covariance matrix of the background model Tand w(U) are initialized randomly T and w(U) are estimatedby an iteratively process described as follows

Journal of Electrical and Computer Engineering 5

(1) E-step for each speech in the set U calculate theparameters of the posterior distribution ofw(U)usingthe current estimates of T Σ andm

(2) M-step updateT andΣ by a linear regression inwhichw(U)s play the role of explanatory variables

(3) Iterate until the expectation of the w(U) is stableThe details of the estimation processes of T and w(U) aredescribed in [23]

43 Speaker Classification Cosine distance scoring (CDS)is used as the classifier in our proposed model It usesa kernel function to deal with the curse of dimensionso CDS is very suitable for the i-vector To describe thisclassifier easily we take a two-classification task for exampleAssume that there are two speakers denoted as 1199041 and 1199042The two speakers respectively speak 1198731 and 1198732 speechesAll speeches are represented by i-vectors and are denoted by1198831 = x11 x12 x11198731 and 1198832 = x21 x22 x21198732 where x

119894119895

is i-vector representing the 119895th speech sample of the speakersi We also assume there is an unknown speech representedby i-vector y The purpose of the classifier is to match theunknown i-vector with the known i-vectors and determinewhich one speaks the unknown speech the result of therecognition is defined as

119863119894 (y) = 1119873119894

119873119894

sum119899=1

119870 (x119894119899 y) + 120579 119894 = 1 2 (9)

where Ni is the total number of speeches supported by thespeaker si 120579 is the decision threshold If Di(y) le 0 theunknown speeches are not spoken by the known speaker siif Di(y) gt 0 then author of the unknown speeches is thespeaker si 119870(sdot sdot) is the cosine kernel and is defined as

119870 (x y) = xy119879

radicxx119879radicyy119879 (10)

where x is the known i-vector and y is the unknown i-vector Usually the linear discriminant analysis (LDA) andwithin class covariance normalization (WCCN) are used toimplement the discrimination of the i-vector Therefore thekernel function is rewritten as

119870 (x y) =(A119879x)Wminus1 (A119879y)

radic(A119879x)Wminus1 (A119879x)radic(A119879y)Wminus1 (A119879y) (11)

where A is the LDA projection matrix and W is WCCNmatrix A and W are estimated by using all of the i-vectorsand the details of LAD and WCCN are described in [24]

5 Experiment and Results

In this section we report the outcome of our experimentsIn Section 51 we describe the experimental dataset InSection 52 we carry on an experiment to select the optimalmother wavelet for the WPE algorithm In Section 53 weevaluate the recognition accuracy of our model In Sec-tion 54 we evaluate the performance of the proposedmodelFinally the time cost of the model is count in Section 55

51 Experimental Dataset The results of our experimentsare performed on the TIMIT speech database [25] Thisdatabase contains 630 speakers (192 females and 438 males)who come from 8 different English dialect regions Eachspeaker supplies ten speech samples that are sampled at16 KHz and last 5 seconds All female speeches are usedto obtain background models that represent the commoncharacteristic of the female voice Also all male speeches areused to generate another background model characterizingthe male voice 384 speakers (192 females and 192 males) arerandomly selected and their speeches are used as the knownand unknown speeches The test results presented in ourexperiments are collected on a computer with 25GHz IntelCore i5 CPU and 8GM of memory and the experimentalplatform is MATLAB R2012b

52 Optimal Mother Wavelet A good mother wavelet canimprove the performance of the WPE algorithmThe perfor-mance of a mother wavelet is based on two important ele-ments such as the support size and the number of vanishingmoments If a mother wavelet has large number of vanishmoments the WPE would ignore much of unimportantinformation if the mother wavelet has small support sizetheWPEwould accurately locate important information [26]Therefore an optimal mother wavelet should have a largenumber of vanishing moments and a small support sizeIn this view the Daubechies and Symlet wavelets are goodwavelets because they have the largest number of vanishingmoments for a given support size Moreover those waveletsare orthogonal and are suitable for the Mallat fast algorithm

In is paper we use the Energy-to-Shannon Entropy Ratio(ESER) to evaluate those Daubechies and Symlet waveletsto find out the best one ESER is a way to analyze theperformance of mother wavelet and has been employed toselect the best mother wavelet in [27] The ESER is definedas

119903 = 119864119878 (12)

where 119878 is the Shannon entropy of the spectrum obtained byWPT and 119864 was the energy of the spectrumThe high energymeans the spectrum obtained by WPT contained muchenough information of the speech The low entropy meansthat the information in the spectrum is stable Thereforethe optimal mother wavelet should maximize the energy andmeanwhile minimize the entropy

In this experiment 8 Daubechies and 8 Symlet waveletswhich are respectively denoted as db1ndash8 and sym1ndash8 areemployed to decompose speeches that are randomly selectedfrom the TIMIT database We run the experiment 100 timesand record the average WSER of those mother wavelets inTable 1

In Table 1 We find that db4 and sym6 obtain the highestESER In other words the db4 and sysm6 are the bestmother wavelets for the speech data Reference [28] suggeststhat the sym6 can improve the performance of the speakerrecognitionmodelHowever the Symletwavelets produce thecomplex coefficients whose imaginary parts are redundantfor the real signal such as digital speech so we abandon thesym6 and choose the db4

6 Journal of Electrical and Computer Engineering

Table 1 The average WSER of the mother wavelets

Mother wavelet ESERdb1 88837db2 89032db3 89744db4 90849db5 90141db6 89653db7 89169db8 89084sym1 88835sym2 89036sym3 89493sym4 89975sym5 90382sym6 90859sym7 90244sym8 89837

53 The Accuracy of Speaker Recognition Model in ClearEnvironment This experiment evaluates the accuracy of thespeaker recognition model We randomly select 384 speakers(192 females and 192males) For each speaker half of speechesare used as the unknown speech and the other half of speechesare used as the known speeches For each speaker the speakerrecognition model matches the hisher unknown speecheswith all of the known speeches of the 384 speakers anddetermines who speaks the unknown speeches If the resultis right the model obtains one score if the result is wrongthe model gets zero score Finally we count the score andcalculate the mean accuracy that is defined as

accuray = score384 times 100 (13)

In this experiment we use four types of speaker recog-nition models for comparison The first one is the MFCC-GMM model [4] This model uses MFCC method to extract14D short vectors and uses the GMM with 8 mixtures torecognize speaker based on those short vectors The secondone is FMFCC-GMMmodel [16] This model is very similarto the MFCC-GMM model but it uses the FMFCC methodto extract the 52D short vectors The third one is the WPE-GMM model [10] This model firstly uses WPE to transformthe speeches into 16D short vectors and then uses GMMfor speaker classification The last one was the WPE-I-CDSmodel proposed in this paper Compared with WPE-GMMmodel ourmodel uses the 16D short vectors to generate 400Di-vector and uses CDS to recognize speaker based on the i-vector We carry on each experiment in this section 25 timesto obtain the mean accuracyThemean accuracy of the above4 models is shown in Figure 4

In Figure 4 we find that MFCC-GMM obtains the lowestaccuracy of 8846 The result of [4] shows the MFCC-GMM model can obtain accuracy of higher than 90 Thisis because we use the GMM with 8 mixtures as the classifierbut [4] uses the GMMwith 32mixtures as the classifier Large

8KHz speech16KHz speech

WPE

-I-CD

S

WPE

-GM

M

MFC

C-G

MM

FMFC

C-G

MM

Speaker recognition models

0102030405060708090

100

Acc

urac

y (

)Figure 4 The mean accuracy of 4 models in clean environment

mixture number can improve the performance of the GMMbut it also causes the very high computational expenseWPE-I-CDS obtain the highest accuracy of 9436This interpretsthe achievements of i-vector theory On the other handwhen the 8KHz speeches (low-quality speeches) are usedall accuracy of speaker recognition models is decreased Theaccuracy of MFCC-GMM FMFCC-GMM andWPE-GMMdecrease by about 6 Comparatively the accuracy of WPE-I-CDS decreases by about 1 This is because the i-vectorconsiders the i-vector to improve the accuracy of the speakerrecognition model and the CDS used the LDA and WCCNto improve the discrimination of the i-vector Reference [29]also reports that the combination of the i-vector and the CDScan enhance the performance of speaker recognition modelused for low-quality speeches such as phone speeches

54 The Accuracy of Speaker Recognition Model in NoisyEnvironment It is hard to find a clean speech in the realapplications because the noise in the transmission channeland environment cannot be controlled In this experimentwe add 30 dB 20 dB and 10 dB Gaussian white noise intothe speeches to simulate the noisy speeches All noises aregenerated by the MATLABrsquos Gaussian white noise function

For comparison this experiment employed three i-vectorbasedmodels such asMFCC-I-CDS [30] FMFCC-I-CD [31]and our WPE-I-CDS The two models are very similar toour proposed model but they use the MFCC and FMFCCto extract the short vectors respectively The accuracy of the3 models in noisy environment is shown in Figure 5

In Figure 5 the three models obtained high accuracyin clean environment This also shows that the i-vector canimprove the recognition accuracy effectively However whenweuse the noisy speeches to test the 3models their accuraciesdecrease When 30 dB noise is added to the speeches the

Journal of Electrical and Computer Engineering 7

MFCC-I-CDSFMFCC-I-CDSWPE-I-CDS

55

60

65

70

75

80

85

90

95A

ccur

acy

()

30 20 10CleanNoise (dB)

Figure 5 The accuracy of the 3 models in noisy environment

accuracy of the three models decreases by about 4 Thisshows that all of the models can resist weak noise Howeverwhen we enhance the power of noise the accuracy ofMFCC-I-CDS and FMFCC-I-CDS drops off rapidly In particularwhen the noise increases into 10 dB the accuracy of the abovetwo models decreases by more than 30 Comparatively theWPE-I-CDSrsquos accuracy decreases by less than 12 Thoseshow that the WPE-I-CDS is robust in noisy environmentcompared with MFCC-I-CDS and FMFCC-I-CDS This isbecause the WPE uses the WPT to obtain the frequencyspectrum but MFCC and FMFCC use the DFT to do thatThe WPT decomposes the speech into many local frequencybands that can limit the ill effect of noise but the DFTdecomposes the speech into a global frequency domain thatis sensitive to the noise

55 The Performance of the Speaker Recognition ModelUsually the speaker recognition model is used in the accesscontrol systemTherefore a good speaker recognition modelshould have ability to accept the login of the correct peopleand meanwhile to reject the access of the imposter as agatekeeper does In this experiment we use the receiveroperating characteristic (ROC) curve to evaluate the ability ofourmodelTheROC curve shows the true positive rate (TPR)as a function of the false positive rate (FPR) for differentvalues of the decision threshold and has been employed in[2]

In this experiment we randomly select 384 speakers (192males and 192 females) to calculate the ROC curve Half ofthose speakers are used as the correct people and anotherhalf of the speakers are used as the imposters We firstlyuse the speeches of the correct people to test the speakerrecognition model to calculate the TPR and then we use thespeeches of the imposters to attack the speaker recognition

MFCC-GMMFMFCC-GMM

WPE-GMMWPE-I-CDS

0

01

02

03

04

05

06

07

08

09

TPR

02 03 04 0501FPR

Figure 6 The ROC curves of TPR versus FPR

model to calculate the FPR The 4 models such as MFCC-GMM FMFCC-GMM WPE-GMM and our WPE-I-CDSare used for comparison To plot the ROC curve we adjustedthe decision thresholds to obtain different ROC points TheROC curves of those 4 models were shown in Figure 6

Low FPR shows that the speaker recognition model caneffectively resist the attack coming from the imposters andhigh TPR shows that the speaker recognition model canaccurately accept the correct speakersrsquo login In other wordsa speaker recognition model can be useful if its TPR is highfor a low FPR In Figure 6 when FPR is higher than 045 allmodels obtain the high TPR but WPE-I-CDS obtain higherTPR than other 3 models for a given FPR that is less than 45This shows that theWPE-I-CDS can more effectively achievethe access control task than other models

56 Time Cost This section tests the time cost of thefast MFCC-GMM the conventional MFCC-I-CDS and ourWPE-I-CDS We used 200 5-second-long speeches to testeach model and calculated the average time cost The resultof this experiment was shown in Table 2

In Table 2MFCC-GMMdoes not employ the i-vector forspeech representation so it does not cost time to extract thei-vector Comparatively the WPE-I-CDS should cost time toextract the i-vector The WPE-I-CDS cost the most time toextract the short vector compared with the MFCC-GMMThis is because the WPT used by WPE is more complexthan the DFT used by the MFCC On the other hand theparameters of GMM should be estimated beforehand asMFCC-GMM cost time to train the classifier CDS needs notcost time to estimate the parameters but it should cost time toestimate the matrices of the LDA andWCNN in the trainingclassifier step In all the i-vector can improve the recognitionaccuracy at cost of increasing the time consumption andcalculating the WPE costs too much time compared with

8 Journal of Electrical and Computer Engineering

Table 2 The time cost of the different speaker recognition models

Speaker recognition model Feature extraction (sspeech) Speaker classification (sspeech)Short vector extraction I-vector extraction Training classifier Recognition

MFCC-GMM 046 mdash 192 064MFCC-I-CDS 045 237 181 073WPE-I-CDS 281 224 182 075WPE-I-CDS(parallel computation) 078 223 182 074

calculating the MFCCTherefore it is very important to findway to reduce the time cost of the WPE

Parallel computation is an effective way to reduce thetime cost because the loops in the linear computation canbe finished at once using a parallel algorithm For examplea signal whose length is 119873 is decomposed by WPT at119872 levels In the conventional linear algorithm of WPT wehave to run a filtering process whose time complexity was119874(log119873) 119872 times 119873 times for each decomposition level sothe total time cost of WPT is 119874(119872119873 log119873) If we used 119873independent computational cores to implant theWPTusing aparallel algorithm the time complexity ofWPT can reduce to119874(119872 log119873) This paper uses 16 independent computationalcores to implement the WPE parallel algorithm and the lastline of Table 2 shows that the time cost of WPE is reducedvery much

6 Conclusions

With the development of the computer technique the speakerrecognition has been widely used for speech-based accesssystem In the real environment the quality of the speechmay be low and noise in the transformation channel cannotbe controlled Therefore it is necessary to find a speakerrecognition model that is not sensitive to those factors suchas noise and low-quality speech

This paper proposes a new speaker recognition modelby employing wavelet packet entropy (WPE) i-vector andCDS and we name themodelWPE-I-CDSWPE used a localanalysis tool namedWPT rather than the DFT to decomposethe signal Because WPT decomposes the signal into manyindependent frequency bands that limit the ill effect of noisetheWPE is robust in the noisy environment I-vector is a typeof robust feature vector Because it considers the backgroundinformation i-vector can improve the accuracy of recogni-tion CDS uses a kernel function to deal with the curse ofdimension so it is suitable for the high-dimension featurevector such as i-vector The result of the experiments in thispaper shows that the proposed speaker recognition modelscan improve the performance of recognition compared withthe conventional models such as MFCC-GMM FMFCC-GMM and WPE-GMM in clean environment MoreovertheWPE-I-CDS obtains higher accuracy than other i-vector-based models such as MFCC-I-CDS and FMFCC-I-CDS innoisy environment However the time cost of the proposedmodel is very higher To reduce the time cost we employthe parallel algorithm to implement the WPE and i-vectorextraction methods

In the future we will combine audio and visual feature toimprove the performance of the speaker recognition system

Conflicts of Interest

The authors declare that they have no conflicts of interest

Acknowledgments

The authors also thank Professor Kun She for their assis-tance in preparation of the manuscript This paper is sup-ported by the Project of Sichuan Provincial Science Plan(M112016GZ0073) and National Nature Foundation (Grantno 61672136)

References

[1] H Perez J Martinez and I Espinosa ldquoUsing acoustic paraligu-istic information to assess the interaction quality in speech-based system for elder usersrdquo International Jounrnal of Human-Computer Studies vol 98 pp 1ndash13 2017

[2] N Almaadeed A Aggoun and A Amira ldquoSpeaker identifica-tion using multimodal neural networks and wavelet analysisrdquoIET Biometrics vol 4 no 1 pp 18ndash28 2015

[3] T Kinnunen and H Li ldquoAn overview of text-independentspeaker recognition from features to supervectorsrdquo SpeechCommunication vol 52 no 1 pp 12ndash40 2010

[4] D A Reynolds and R C Rose ldquoRobust text-independentspeaker identification using gaussian mixture speaker modelsrdquoIEEE Transactions on Speech and Audio Processing vol 3 no 1pp 72ndash83 1995

[5] K S Ahmad A S Thosar J H Nirmal and V S PandeldquoA unique approach in text-independent speaker recognitionusing mfcc feature sets and probabilistic neural networkrdquo inProceedings of the 8th International Conference on Advances inPattern Recognition (ICAPR rsquo15) Kolkata India 2015

[6] X-Y Zhang J Bai and W-Z Liang ldquoThe speech recognitionsystem based on bark wavelet MFCCrdquo in Proceedings of the 8thInternational Conference on Signal Processing (ICSP rsquo06) pp 16ndash20 Beijing China 2006

[7] A Biswas P K Sahu and M Chandra ldquoAdmissible waveletpacket features based on human inner ear frequency responsefor Hindi consonant recongnitionrdquo Computer amp Communica-tion Technology vol 40 pp 1111ndash1122 2014

[8] K Daqrouq and T A Tutunji ldquoSpeaker identification usingvowels features through a combined method of formantswavelets and neural network classifiersrdquo Applied Soft Comput-ing vol 27 pp 231ndash239 2015

Journal of Electrical and Computer Engineering 9

[9] D Avci ldquoAn expert system for speaker identification usingadaptive wavelet sure entropyrdquo Expert Systems with Applica-tions vol 36 no 3 pp 6295ndash6300 2009

[10] K Daqrouq ldquoWavelet entropy and neural network for text-independent speaker identificationrdquo Engineering Applications ofArtificial Intelligence vol 24 no 5 pp 796ndash802 2011

[11] M K Elyaberani S H Mahmoodian and G Sheikhi ldquoWaveletpacket entropy in speaker-identification emotional state detec-tion from speech signalrdquo Journal of Intelligent Procedures inElectrical Technology vol 20 pp 67ndash74 2015

[12] M Li A Tsiartas M Van Segbroeck and S S NarayananldquoSpeaker verification using simplified and supervised i-vectormodelingrdquo in Proceedings of the 38th IEEE International Con-ference on Acoustics Speech and Signal Processing (ICASSP rsquo13)pp 7199ndash7203 Vancouver Canada 2013

[13] D Garcia-Romero and A McCree ldquoSupervised domain adap-tation for i-vector based speaker recognitionrdquo in Proceedingsof the IEEE International Conference on Acoustics Speech andSignal Processing pp 4047ndash4051 Florence Italy 2014

[14] A KanagasundaramDDean and S Sridharan ldquoI-vector basedspeaker recognition using advanced channel compensationtechniquesrdquo Computer Speech amp Labguage vol 28 pp 121ndash1402014

[15] M H Bahari R Saeidi H Van Hamme and D Van LeeuwenldquoAccent recognition using i-vector gaussian mean supervectorand gaussian posterior probability supervector for spontaneoustelephone speechrdquo in Proceedings of the IEEE InternationalConference on Acoustics Speech and Signal Processing pp 7344ndash7348 IEEE Vancouver Canada 2013

[16] B Saha and K Kamarauslas ldquoEvaluation of effectiveness ofdifferent methods in speaker recognitionrdquo Elektronika ir Elek-trochnika vol 98 pp 67ndash70 2015

[17] K K George C S Kumar K I Ramachandran and A PandaldquoCosine distance features for robust speaker verificationrdquo inProceedings of the 16th Annual Conference of the InternationalSpeech Communication Association (INTERSPEECH rsquo15) pp234ndash238 Dresden Germany 2015

[18] B G Nagaraja and H S Jayanna ldquoEfficient window for mono-lingual and crosslingual speaker identification usingMFCCrdquo inProceedings of the International Conference on Advanced Com-puting and Communication Systems (ICACCS rsquo13) pp 1ndash4Coimbatore India 2013

[19] M Medeiros G Araujo H Macedo M Chella and L MatosldquoMulti-kernel approach to parallelization of EM algorithm forGMM trainingrdquo in Proceedings of the 3rd Brazilian Conferenceon Intelligent Systems (BRACIS rsquo14) pp 158ndash165 Sao PauloBrazil 2014

[20] H R Tohidypour S A Seyyedsalehi and H Behbood ldquoCom-parison between wavelet packet transform Bark Wavelet ampMFCC for robust speech recognition tasksrdquo in Proceedings ofthe 2nd International Conference on Industrial Mechatronics andAutomation (ICIMA rsquo10) pp 329ndash332 Wuhan China 2010

[21] X-F Lv and P Tao ldquoMallat algorithm of wavelet for time-varying system parametric identificationrdquo in Proceedings of the25th Chinese Control and Decision Conference (CCDC rsquo13) pp1554ndash1556 Guiyang China 2013

[22] D Marthnez O Pichot and L Burget ldquoLanguage recognitionin I-vector spacerdquo in Proceedongs of the Interspeech pp 861ndash864Florence Italy 2011

[23] P Kenny G Boulianne and P Dumouchel ldquoEigenvoice model-ing with sparse training datardquo IEEE Transactions on Speech andAudio Processing vol 13 no 3 pp 345ndash354 2005

[24] M McLaren and D Van Leeuwen ldquoImproved speaker recog-nition when using i-vectors from multiple speech sourcesrdquoin Proceedings of the 36th IEEE International Conference onAcoustics Speech and Signal Processing (ICASSP rsquo11) pp 5460ndash5463 Prague Czech Republic 2011

[25] A Biswas P K Sahu A Bhowmick and M Chandra ldquoFeatureextraction technique using ERB like wavelet sub-band periodicand aperiodic decomposition for TIMITphoneme recognitionrdquoInternational Journal of SpeechTechnology vol 17 no 4 pp 389ndash399 2014

[26] S G Mallat A Wavelet Tour of Signal Processing ELsevierAmsterdam Netherlands 2012

[27] Q Yang and J Wang ldquoMulti-level wavelet shannon entropy-based method for single-sensor fault locationrdquo Entropy vol 17no 10 pp 7101ndash7117 2015

[28] TGanchevM Siafarikas IMporas andT Stoyanova ldquoWaveletbasis selection for enhanced speech parametrization in speakerverificationrdquo International Journal of Speech Technology vol 17no 1 pp 27ndash36 2014

[29] M Seboussaoui P Kenny and N Dehak ldquoAn i-vector extractorsuitable for speaker recognition with both microphone andtelephone speechrdquo Odyssey vol 6 pp 1ndash6 2011

[30] N Dehak R Dehak P Kenny N Brummer P Ouellet and PDumouchel ldquoSupport vectormachines versus fast scoring in thelow-dimensional total variability space for speaker verificationrdquoin Proceedings of the 10th Annual Conference of the InternationalSpeech Communication Association (INTERSPEECH rsquo09) pp1559ndash1562 Brighton United Kingdom 2009

[31] M IMandasariMMcLaren andDVanLeeuwen ldquoEvaluationof i-vector speaker recognition systems for forensic applicationrdquoin 12th Annual Conference of the International Speech Commu-nication Association (INTERSPEECH rsquo11) pp 21ndash24 FlorenceItaly 2011

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Active and Passive Electronic Components

Control Scienceand Engineering

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

RotatingMachinery

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

Journal of

Volume 201

Submit your manuscripts athttpswwwhindawicom

VLSI Design

Hindawi Publishing Corporationhttpwwwhindawicom Volume 201

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Shock and Vibration

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawi Publishing Corporation httpwwwhindawicom

Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

SensorsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Navigation and Observation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

DistributedSensor Networks

International Journal of

Page 4: Speaker Recognition Using Wavelet Packet Entropy, I-Vector, and …downloads.hindawi.com/journals/jece/2017/1735698.pdf · 2019. 7. 30. · ResearchArticle Speaker Recognition Using

4 Journal of Electrical and Computer Engineering

W0

0

W1

1W0

1

W0

2 W1

2 W2

2 W3

2

W0

3 W1

3 W2

3 W3

3W

5

3 W6

3 W7

3W4

3

Figure 2 The wavelet packet transform at 2 levels

Framing

WPT

Shannon entropy

calculation

Normalization

Shannon entropy

calculation

Shannon entropy

calculation

Speech

middot middot middot

[s0 s1 s15]

W1

3W0

4 W15

4

Figure 3 The flow chart of wavelet packet entropy

processes followed by 2 1 downsampling respectively Thefiltering processes are defined as

W2119896119895+1 (119899) = W119896119895 lowast h (2119899)

W2119896+1119895+1 (119899) = W119896119895 lowast g (2119899)

0 le 119896 le 2119895 0 le 119895 le 119869 0 le 119899 le 119873119895

(4)

where h and g are the low-pass and high-pass filter respec-tively 119873119895 is the length of the frequency component at level119895 lowast is the convolution operation 119869 is the total numberof the decomposition levels Because the WPT satisfies theconservation of energy each leaf node denotes the spectrumof the frequency bands obtained byWPT Based on theWPTthe wavelet packet entropy (WPE) method is proposed toextract the short vector and we add a normalization step intothemethod to reduce the ill effect of the volume in this paperThe flow chart ofWPEused in this paper is shown in Figure 3

Assume that there is a digital speech signal that has finiteenergy and length It is firstly decomposed into 20ms framesand then each frame is normalized The normalizationprocess is defined as

119891 [119898] = 119891 [119898] minus 120583120590 119894 = 1 2 3 119872 (5)

where 119891 is a signal frame and 119872 is its length 120583 is themean value of the frame and 120590 is its standard variance 119891 isthe normalized frame After the normalization process theWPT decomposes the frame at 4 levels using (4) Thereforewe finally obtain 16 frequency bands and the frequencyspectrums in those bands are denoted as 11988204 11988214 119882154 respectively For each spectrum the Shannon entropy iscalculated The Shannon entropy is denoted as

119904119896 = minus119873

sum119899=1

119901119896119899 log (119901119896119899) 119896 = 0 2 3 7 (6)

with

119901119896119899 =100381610038161003816100381610038161198821198964 (119899)100381610038161003816100381610038162

119890119896

119890119896 =119873

sum119899=1

100381610038161003816100381610038161198821198964 (119899)100381610038161003816100381610038162

(7)

where 119890119896 is the energy of the 119896th spectrum 119901119896119899 is the energydistribution of the 119896th spectrum 119873 is the length of eachfrequency spectrum Finally all of Shannon entropies of allspectrums are calculated and are collected to form a featurevector that is denoted as [1199040 1199041 1199047]119879

42 I-Vector Extraction I-vector is a robust feature vectorthat represents a speech using a single high-dimension vectorBecause it considers the background information i-vectorusually improves the accuracy of recognition [22] Assumethat there is a set of speeches Those speeches are suppliedby different speakers and the all speeches are transformedinto the short vectors In the i-vector theory the speaker- andchannel-dependent feature vector is assumed as

m = m + Tw (U) (8)

where m is the speaker- and channel-dependent featurevectorm is the background factor Usually it is generated bystacking the mean vectors of a background model Assumethat the mean vectors of the background model are denotedby 1205831198791 1205831198792 120583119879119873 where each mean vector is a row vectorm is denoted by [12058311205832 120583119873]119879 T is named the totalvariability matrix and represents a space that contains thespeaker- and channel-dependent information w(U) is arandom vector having standard normal distribution 119873(0 1)The i-vector is the expectation of the w(U) U is a set ofspeeches and all of speeches are transformed into the shortvectors Assume that a background model is given and Σ isinitialized by covariance matrix of the background model Tand w(U) are initialized randomly T and w(U) are estimatedby an iteratively process described as follows

Journal of Electrical and Computer Engineering 5

(1) E-step for each speech in the set U calculate theparameters of the posterior distribution ofw(U)usingthe current estimates of T Σ andm

(2) M-step updateT andΣ by a linear regression inwhichw(U)s play the role of explanatory variables

(3) Iterate until the expectation of the w(U) is stableThe details of the estimation processes of T and w(U) aredescribed in [23]

43 Speaker Classification Cosine distance scoring (CDS)is used as the classifier in our proposed model It usesa kernel function to deal with the curse of dimensionso CDS is very suitable for the i-vector To describe thisclassifier easily we take a two-classification task for exampleAssume that there are two speakers denoted as 1199041 and 1199042The two speakers respectively speak 1198731 and 1198732 speechesAll speeches are represented by i-vectors and are denoted by1198831 = x11 x12 x11198731 and 1198832 = x21 x22 x21198732 where x

119894119895

is i-vector representing the 119895th speech sample of the speakersi We also assume there is an unknown speech representedby i-vector y The purpose of the classifier is to match theunknown i-vector with the known i-vectors and determinewhich one speaks the unknown speech the result of therecognition is defined as

119863119894 (y) = 1119873119894

119873119894

sum119899=1

119870 (x119894119899 y) + 120579 119894 = 1 2 (9)

where Ni is the total number of speeches supported by thespeaker si 120579 is the decision threshold If Di(y) le 0 theunknown speeches are not spoken by the known speaker siif Di(y) gt 0 then author of the unknown speeches is thespeaker si 119870(sdot sdot) is the cosine kernel and is defined as

119870 (x y) = xy119879

radicxx119879radicyy119879 (10)

where x is the known i-vector and y is the unknown i-vector Usually the linear discriminant analysis (LDA) andwithin class covariance normalization (WCCN) are used toimplement the discrimination of the i-vector Therefore thekernel function is rewritten as

119870 (x y) =(A119879x)Wminus1 (A119879y)

radic(A119879x)Wminus1 (A119879x)radic(A119879y)Wminus1 (A119879y) (11)

where A is the LDA projection matrix and W is WCCNmatrix A and W are estimated by using all of the i-vectorsand the details of LAD and WCCN are described in [24]

5 Experiment and Results

In this section we report the outcome of our experimentsIn Section 51 we describe the experimental dataset InSection 52 we carry on an experiment to select the optimalmother wavelet for the WPE algorithm In Section 53 weevaluate the recognition accuracy of our model In Sec-tion 54 we evaluate the performance of the proposedmodelFinally the time cost of the model is count in Section 55

51 Experimental Dataset The results of our experimentsare performed on the TIMIT speech database [25] Thisdatabase contains 630 speakers (192 females and 438 males)who come from 8 different English dialect regions Eachspeaker supplies ten speech samples that are sampled at16 KHz and last 5 seconds All female speeches are usedto obtain background models that represent the commoncharacteristic of the female voice Also all male speeches areused to generate another background model characterizingthe male voice 384 speakers (192 females and 192 males) arerandomly selected and their speeches are used as the knownand unknown speeches The test results presented in ourexperiments are collected on a computer with 25GHz IntelCore i5 CPU and 8GM of memory and the experimentalplatform is MATLAB R2012b

52 Optimal Mother Wavelet A good mother wavelet canimprove the performance of the WPE algorithmThe perfor-mance of a mother wavelet is based on two important ele-ments such as the support size and the number of vanishingmoments If a mother wavelet has large number of vanishmoments the WPE would ignore much of unimportantinformation if the mother wavelet has small support sizetheWPEwould accurately locate important information [26]Therefore an optimal mother wavelet should have a largenumber of vanishing moments and a small support sizeIn this view the Daubechies and Symlet wavelets are goodwavelets because they have the largest number of vanishingmoments for a given support size Moreover those waveletsare orthogonal and are suitable for the Mallat fast algorithm

In is paper we use the Energy-to-Shannon Entropy Ratio(ESER) to evaluate those Daubechies and Symlet waveletsto find out the best one ESER is a way to analyze theperformance of mother wavelet and has been employed toselect the best mother wavelet in [27] The ESER is definedas

119903 = 119864119878 (12)

where 119878 is the Shannon entropy of the spectrum obtained byWPT and 119864 was the energy of the spectrumThe high energymeans the spectrum obtained by WPT contained muchenough information of the speech The low entropy meansthat the information in the spectrum is stable Thereforethe optimal mother wavelet should maximize the energy andmeanwhile minimize the entropy

In this experiment 8 Daubechies and 8 Symlet waveletswhich are respectively denoted as db1ndash8 and sym1ndash8 areemployed to decompose speeches that are randomly selectedfrom the TIMIT database We run the experiment 100 timesand record the average WSER of those mother wavelets inTable 1

In Table 1 We find that db4 and sym6 obtain the highestESER In other words the db4 and sysm6 are the bestmother wavelets for the speech data Reference [28] suggeststhat the sym6 can improve the performance of the speakerrecognitionmodelHowever the Symletwavelets produce thecomplex coefficients whose imaginary parts are redundantfor the real signal such as digital speech so we abandon thesym6 and choose the db4

6 Journal of Electrical and Computer Engineering

Table 1 The average WSER of the mother wavelets

Mother wavelet ESERdb1 88837db2 89032db3 89744db4 90849db5 90141db6 89653db7 89169db8 89084sym1 88835sym2 89036sym3 89493sym4 89975sym5 90382sym6 90859sym7 90244sym8 89837

53 The Accuracy of Speaker Recognition Model in ClearEnvironment This experiment evaluates the accuracy of thespeaker recognition model We randomly select 384 speakers(192 females and 192males) For each speaker half of speechesare used as the unknown speech and the other half of speechesare used as the known speeches For each speaker the speakerrecognition model matches the hisher unknown speecheswith all of the known speeches of the 384 speakers anddetermines who speaks the unknown speeches If the resultis right the model obtains one score if the result is wrongthe model gets zero score Finally we count the score andcalculate the mean accuracy that is defined as

accuray = score384 times 100 (13)

In this experiment we use four types of speaker recog-nition models for comparison The first one is the MFCC-GMM model [4] This model uses MFCC method to extract14D short vectors and uses the GMM with 8 mixtures torecognize speaker based on those short vectors The secondone is FMFCC-GMMmodel [16] This model is very similarto the MFCC-GMM model but it uses the FMFCC methodto extract the 52D short vectors The third one is the WPE-GMM model [10] This model firstly uses WPE to transformthe speeches into 16D short vectors and then uses GMMfor speaker classification The last one was the WPE-I-CDSmodel proposed in this paper Compared with WPE-GMMmodel ourmodel uses the 16D short vectors to generate 400Di-vector and uses CDS to recognize speaker based on the i-vector We carry on each experiment in this section 25 timesto obtain the mean accuracyThemean accuracy of the above4 models is shown in Figure 4

In Figure 4 we find that MFCC-GMM obtains the lowestaccuracy of 8846 The result of [4] shows the MFCC-GMM model can obtain accuracy of higher than 90 Thisis because we use the GMM with 8 mixtures as the classifierbut [4] uses the GMMwith 32mixtures as the classifier Large

8KHz speech16KHz speech

WPE

-I-CD

S

WPE

-GM

M

MFC

C-G

MM

FMFC

C-G

MM

Speaker recognition models

0102030405060708090

100

Acc

urac

y (

)Figure 4 The mean accuracy of 4 models in clean environment

mixture number can improve the performance of the GMMbut it also causes the very high computational expenseWPE-I-CDS obtain the highest accuracy of 9436This interpretsthe achievements of i-vector theory On the other handwhen the 8KHz speeches (low-quality speeches) are usedall accuracy of speaker recognition models is decreased Theaccuracy of MFCC-GMM FMFCC-GMM andWPE-GMMdecrease by about 6 Comparatively the accuracy of WPE-I-CDS decreases by about 1 This is because the i-vectorconsiders the i-vector to improve the accuracy of the speakerrecognition model and the CDS used the LDA and WCCNto improve the discrimination of the i-vector Reference [29]also reports that the combination of the i-vector and the CDScan enhance the performance of speaker recognition modelused for low-quality speeches such as phone speeches

54 The Accuracy of Speaker Recognition Model in NoisyEnvironment It is hard to find a clean speech in the realapplications because the noise in the transmission channeland environment cannot be controlled In this experimentwe add 30 dB 20 dB and 10 dB Gaussian white noise intothe speeches to simulate the noisy speeches All noises aregenerated by the MATLABrsquos Gaussian white noise function

For comparison this experiment employed three i-vectorbasedmodels such asMFCC-I-CDS [30] FMFCC-I-CD [31]and our WPE-I-CDS The two models are very similar toour proposed model but they use the MFCC and FMFCCto extract the short vectors respectively The accuracy of the3 models in noisy environment is shown in Figure 5

In Figure 5 the three models obtained high accuracyin clean environment This also shows that the i-vector canimprove the recognition accuracy effectively However whenweuse the noisy speeches to test the 3models their accuraciesdecrease When 30 dB noise is added to the speeches the

Journal of Electrical and Computer Engineering 7

MFCC-I-CDSFMFCC-I-CDSWPE-I-CDS

55

60

65

70

75

80

85

90

95A

ccur

acy

()

30 20 10CleanNoise (dB)

Figure 5 The accuracy of the 3 models in noisy environment

accuracy of the three models decreases by about 4 Thisshows that all of the models can resist weak noise Howeverwhen we enhance the power of noise the accuracy ofMFCC-I-CDS and FMFCC-I-CDS drops off rapidly In particularwhen the noise increases into 10 dB the accuracy of the abovetwo models decreases by more than 30 Comparatively theWPE-I-CDSrsquos accuracy decreases by less than 12 Thoseshow that the WPE-I-CDS is robust in noisy environmentcompared with MFCC-I-CDS and FMFCC-I-CDS This isbecause the WPE uses the WPT to obtain the frequencyspectrum but MFCC and FMFCC use the DFT to do thatThe WPT decomposes the speech into many local frequencybands that can limit the ill effect of noise but the DFTdecomposes the speech into a global frequency domain thatis sensitive to the noise

55 The Performance of the Speaker Recognition ModelUsually the speaker recognition model is used in the accesscontrol systemTherefore a good speaker recognition modelshould have ability to accept the login of the correct peopleand meanwhile to reject the access of the imposter as agatekeeper does In this experiment we use the receiveroperating characteristic (ROC) curve to evaluate the ability ofourmodelTheROC curve shows the true positive rate (TPR)as a function of the false positive rate (FPR) for differentvalues of the decision threshold and has been employed in[2]

In this experiment we randomly select 384 speakers (192males and 192 females) to calculate the ROC curve Half ofthose speakers are used as the correct people and anotherhalf of the speakers are used as the imposters We firstlyuse the speeches of the correct people to test the speakerrecognition model to calculate the TPR and then we use thespeeches of the imposters to attack the speaker recognition

MFCC-GMMFMFCC-GMM

WPE-GMMWPE-I-CDS

0

01

02

03

04

05

06

07

08

09

TPR

02 03 04 0501FPR

Figure 6 The ROC curves of TPR versus FPR

model to calculate the FPR The 4 models such as MFCC-GMM FMFCC-GMM WPE-GMM and our WPE-I-CDSare used for comparison To plot the ROC curve we adjustedthe decision thresholds to obtain different ROC points TheROC curves of those 4 models were shown in Figure 6

Low FPR shows that the speaker recognition model caneffectively resist the attack coming from the imposters andhigh TPR shows that the speaker recognition model canaccurately accept the correct speakersrsquo login In other wordsa speaker recognition model can be useful if its TPR is highfor a low FPR In Figure 6 when FPR is higher than 045 allmodels obtain the high TPR but WPE-I-CDS obtain higherTPR than other 3 models for a given FPR that is less than 45This shows that theWPE-I-CDS can more effectively achievethe access control task than other models

56 Time Cost This section tests the time cost of thefast MFCC-GMM the conventional MFCC-I-CDS and ourWPE-I-CDS We used 200 5-second-long speeches to testeach model and calculated the average time cost The resultof this experiment was shown in Table 2

In Table 2MFCC-GMMdoes not employ the i-vector forspeech representation so it does not cost time to extract thei-vector Comparatively the WPE-I-CDS should cost time toextract the i-vector The WPE-I-CDS cost the most time toextract the short vector compared with the MFCC-GMMThis is because the WPT used by WPE is more complexthan the DFT used by the MFCC On the other hand theparameters of GMM should be estimated beforehand asMFCC-GMM cost time to train the classifier CDS needs notcost time to estimate the parameters but it should cost time toestimate the matrices of the LDA andWCNN in the trainingclassifier step In all the i-vector can improve the recognitionaccuracy at cost of increasing the time consumption andcalculating the WPE costs too much time compared with

8 Journal of Electrical and Computer Engineering

Table 2 The time cost of the different speaker recognition models

Speaker recognition model Feature extraction (sspeech) Speaker classification (sspeech)Short vector extraction I-vector extraction Training classifier Recognition

MFCC-GMM 046 mdash 192 064MFCC-I-CDS 045 237 181 073WPE-I-CDS 281 224 182 075WPE-I-CDS(parallel computation) 078 223 182 074

calculating the MFCCTherefore it is very important to findway to reduce the time cost of the WPE

Parallel computation is an effective way to reduce thetime cost because the loops in the linear computation canbe finished at once using a parallel algorithm For examplea signal whose length is 119873 is decomposed by WPT at119872 levels In the conventional linear algorithm of WPT wehave to run a filtering process whose time complexity was119874(log119873) 119872 times 119873 times for each decomposition level sothe total time cost of WPT is 119874(119872119873 log119873) If we used 119873independent computational cores to implant theWPTusing aparallel algorithm the time complexity ofWPT can reduce to119874(119872 log119873) This paper uses 16 independent computationalcores to implement the WPE parallel algorithm and the lastline of Table 2 shows that the time cost of WPE is reducedvery much

6 Conclusions

With the development of the computer technique the speakerrecognition has been widely used for speech-based accesssystem In the real environment the quality of the speechmay be low and noise in the transformation channel cannotbe controlled Therefore it is necessary to find a speakerrecognition model that is not sensitive to those factors suchas noise and low-quality speech

This paper proposes a new speaker recognition modelby employing wavelet packet entropy (WPE) i-vector andCDS and we name themodelWPE-I-CDSWPE used a localanalysis tool namedWPT rather than the DFT to decomposethe signal Because WPT decomposes the signal into manyindependent frequency bands that limit the ill effect of noisetheWPE is robust in the noisy environment I-vector is a typeof robust feature vector Because it considers the backgroundinformation i-vector can improve the accuracy of recogni-tion CDS uses a kernel function to deal with the curse ofdimension so it is suitable for the high-dimension featurevector such as i-vector The result of the experiments in thispaper shows that the proposed speaker recognition modelscan improve the performance of recognition compared withthe conventional models such as MFCC-GMM FMFCC-GMM and WPE-GMM in clean environment MoreovertheWPE-I-CDS obtains higher accuracy than other i-vector-based models such as MFCC-I-CDS and FMFCC-I-CDS innoisy environment However the time cost of the proposedmodel is very higher To reduce the time cost we employthe parallel algorithm to implement the WPE and i-vectorextraction methods

In the future we will combine audio and visual feature toimprove the performance of the speaker recognition system

Conflicts of Interest

The authors declare that they have no conflicts of interest

Acknowledgments

The authors also thank Professor Kun She for their assis-tance in preparation of the manuscript This paper is sup-ported by the Project of Sichuan Provincial Science Plan(M112016GZ0073) and National Nature Foundation (Grantno 61672136)

References

[1] H Perez J Martinez and I Espinosa ldquoUsing acoustic paraligu-istic information to assess the interaction quality in speech-based system for elder usersrdquo International Jounrnal of Human-Computer Studies vol 98 pp 1ndash13 2017

[2] N Almaadeed A Aggoun and A Amira ldquoSpeaker identifica-tion using multimodal neural networks and wavelet analysisrdquoIET Biometrics vol 4 no 1 pp 18ndash28 2015

[3] T Kinnunen and H Li ldquoAn overview of text-independentspeaker recognition from features to supervectorsrdquo SpeechCommunication vol 52 no 1 pp 12ndash40 2010

[4] D A Reynolds and R C Rose ldquoRobust text-independentspeaker identification using gaussian mixture speaker modelsrdquoIEEE Transactions on Speech and Audio Processing vol 3 no 1pp 72ndash83 1995

[5] K S Ahmad A S Thosar J H Nirmal and V S PandeldquoA unique approach in text-independent speaker recognitionusing mfcc feature sets and probabilistic neural networkrdquo inProceedings of the 8th International Conference on Advances inPattern Recognition (ICAPR rsquo15) Kolkata India 2015

[6] X-Y Zhang J Bai and W-Z Liang ldquoThe speech recognitionsystem based on bark wavelet MFCCrdquo in Proceedings of the 8thInternational Conference on Signal Processing (ICSP rsquo06) pp 16ndash20 Beijing China 2006

[7] A Biswas P K Sahu and M Chandra ldquoAdmissible waveletpacket features based on human inner ear frequency responsefor Hindi consonant recongnitionrdquo Computer amp Communica-tion Technology vol 40 pp 1111ndash1122 2014

[8] K Daqrouq and T A Tutunji ldquoSpeaker identification usingvowels features through a combined method of formantswavelets and neural network classifiersrdquo Applied Soft Comput-ing vol 27 pp 231ndash239 2015

Journal of Electrical and Computer Engineering 9

[9] D Avci ldquoAn expert system for speaker identification usingadaptive wavelet sure entropyrdquo Expert Systems with Applica-tions vol 36 no 3 pp 6295ndash6300 2009

[10] K Daqrouq ldquoWavelet entropy and neural network for text-independent speaker identificationrdquo Engineering Applications ofArtificial Intelligence vol 24 no 5 pp 796ndash802 2011

[11] M K Elyaberani S H Mahmoodian and G Sheikhi ldquoWaveletpacket entropy in speaker-identification emotional state detec-tion from speech signalrdquo Journal of Intelligent Procedures inElectrical Technology vol 20 pp 67ndash74 2015

[12] M Li A Tsiartas M Van Segbroeck and S S NarayananldquoSpeaker verification using simplified and supervised i-vectormodelingrdquo in Proceedings of the 38th IEEE International Con-ference on Acoustics Speech and Signal Processing (ICASSP rsquo13)pp 7199ndash7203 Vancouver Canada 2013

[13] D Garcia-Romero and A McCree ldquoSupervised domain adap-tation for i-vector based speaker recognitionrdquo in Proceedingsof the IEEE International Conference on Acoustics Speech andSignal Processing pp 4047ndash4051 Florence Italy 2014

[14] A KanagasundaramDDean and S Sridharan ldquoI-vector basedspeaker recognition using advanced channel compensationtechniquesrdquo Computer Speech amp Labguage vol 28 pp 121ndash1402014

[15] M H Bahari R Saeidi H Van Hamme and D Van LeeuwenldquoAccent recognition using i-vector gaussian mean supervectorand gaussian posterior probability supervector for spontaneoustelephone speechrdquo in Proceedings of the IEEE InternationalConference on Acoustics Speech and Signal Processing pp 7344ndash7348 IEEE Vancouver Canada 2013

[16] B Saha and K Kamarauslas ldquoEvaluation of effectiveness ofdifferent methods in speaker recognitionrdquo Elektronika ir Elek-trochnika vol 98 pp 67ndash70 2015

[17] K K George C S Kumar K I Ramachandran and A PandaldquoCosine distance features for robust speaker verificationrdquo inProceedings of the 16th Annual Conference of the InternationalSpeech Communication Association (INTERSPEECH rsquo15) pp234ndash238 Dresden Germany 2015

[18] B G Nagaraja and H S Jayanna ldquoEfficient window for mono-lingual and crosslingual speaker identification usingMFCCrdquo inProceedings of the International Conference on Advanced Com-puting and Communication Systems (ICACCS rsquo13) pp 1ndash4Coimbatore India 2013

[19] M Medeiros G Araujo H Macedo M Chella and L MatosldquoMulti-kernel approach to parallelization of EM algorithm forGMM trainingrdquo in Proceedings of the 3rd Brazilian Conferenceon Intelligent Systems (BRACIS rsquo14) pp 158ndash165 Sao PauloBrazil 2014

[20] H R Tohidypour S A Seyyedsalehi and H Behbood ldquoCom-parison between wavelet packet transform Bark Wavelet ampMFCC for robust speech recognition tasksrdquo in Proceedings ofthe 2nd International Conference on Industrial Mechatronics andAutomation (ICIMA rsquo10) pp 329ndash332 Wuhan China 2010

[21] X-F Lv and P Tao ldquoMallat algorithm of wavelet for time-varying system parametric identificationrdquo in Proceedings of the25th Chinese Control and Decision Conference (CCDC rsquo13) pp1554ndash1556 Guiyang China 2013

[22] D Marthnez O Pichot and L Burget ldquoLanguage recognitionin I-vector spacerdquo in Proceedongs of the Interspeech pp 861ndash864Florence Italy 2011

[23] P Kenny G Boulianne and P Dumouchel ldquoEigenvoice model-ing with sparse training datardquo IEEE Transactions on Speech andAudio Processing vol 13 no 3 pp 345ndash354 2005

[24] M McLaren and D Van Leeuwen ldquoImproved speaker recog-nition when using i-vectors from multiple speech sourcesrdquoin Proceedings of the 36th IEEE International Conference onAcoustics Speech and Signal Processing (ICASSP rsquo11) pp 5460ndash5463 Prague Czech Republic 2011

[25] A Biswas P K Sahu A Bhowmick and M Chandra ldquoFeatureextraction technique using ERB like wavelet sub-band periodicand aperiodic decomposition for TIMITphoneme recognitionrdquoInternational Journal of SpeechTechnology vol 17 no 4 pp 389ndash399 2014

[26] S G Mallat A Wavelet Tour of Signal Processing ELsevierAmsterdam Netherlands 2012

[27] Q Yang and J Wang ldquoMulti-level wavelet shannon entropy-based method for single-sensor fault locationrdquo Entropy vol 17no 10 pp 7101ndash7117 2015

[28] TGanchevM Siafarikas IMporas andT Stoyanova ldquoWaveletbasis selection for enhanced speech parametrization in speakerverificationrdquo International Journal of Speech Technology vol 17no 1 pp 27ndash36 2014

[29] M Seboussaoui P Kenny and N Dehak ldquoAn i-vector extractorsuitable for speaker recognition with both microphone andtelephone speechrdquo Odyssey vol 6 pp 1ndash6 2011

[30] N Dehak R Dehak P Kenny N Brummer P Ouellet and PDumouchel ldquoSupport vectormachines versus fast scoring in thelow-dimensional total variability space for speaker verificationrdquoin Proceedings of the 10th Annual Conference of the InternationalSpeech Communication Association (INTERSPEECH rsquo09) pp1559ndash1562 Brighton United Kingdom 2009

[31] M IMandasariMMcLaren andDVanLeeuwen ldquoEvaluationof i-vector speaker recognition systems for forensic applicationrdquoin 12th Annual Conference of the International Speech Commu-nication Association (INTERSPEECH rsquo11) pp 21ndash24 FlorenceItaly 2011

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Active and Passive Electronic Components

Control Scienceand Engineering

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

RotatingMachinery

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

Journal of

Volume 201

Submit your manuscripts athttpswwwhindawicom

VLSI Design

Hindawi Publishing Corporationhttpwwwhindawicom Volume 201

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Shock and Vibration

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawi Publishing Corporation httpwwwhindawicom

Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

SensorsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Navigation and Observation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

DistributedSensor Networks

International Journal of

Page 5: Speaker Recognition Using Wavelet Packet Entropy, I-Vector, and …downloads.hindawi.com/journals/jece/2017/1735698.pdf · 2019. 7. 30. · ResearchArticle Speaker Recognition Using

Journal of Electrical and Computer Engineering 5

(1) E-step for each speech in the set U calculate theparameters of the posterior distribution ofw(U)usingthe current estimates of T Σ andm

(2) M-step updateT andΣ by a linear regression inwhichw(U)s play the role of explanatory variables

(3) Iterate until the expectation of the w(U) is stableThe details of the estimation processes of T and w(U) aredescribed in [23]

43 Speaker Classification Cosine distance scoring (CDS)is used as the classifier in our proposed model It usesa kernel function to deal with the curse of dimensionso CDS is very suitable for the i-vector To describe thisclassifier easily we take a two-classification task for exampleAssume that there are two speakers denoted as 1199041 and 1199042The two speakers respectively speak 1198731 and 1198732 speechesAll speeches are represented by i-vectors and are denoted by1198831 = x11 x12 x11198731 and 1198832 = x21 x22 x21198732 where x

119894119895

is i-vector representing the 119895th speech sample of the speakersi We also assume there is an unknown speech representedby i-vector y The purpose of the classifier is to match theunknown i-vector with the known i-vectors and determinewhich one speaks the unknown speech the result of therecognition is defined as

119863119894 (y) = 1119873119894

119873119894

sum119899=1

119870 (x119894119899 y) + 120579 119894 = 1 2 (9)

where Ni is the total number of speeches supported by thespeaker si 120579 is the decision threshold If Di(y) le 0 theunknown speeches are not spoken by the known speaker siif Di(y) gt 0 then author of the unknown speeches is thespeaker si 119870(sdot sdot) is the cosine kernel and is defined as

119870 (x y) = xy119879

radicxx119879radicyy119879 (10)

where x is the known i-vector and y is the unknown i-vector Usually the linear discriminant analysis (LDA) andwithin class covariance normalization (WCCN) are used toimplement the discrimination of the i-vector Therefore thekernel function is rewritten as

119870 (x y) =(A119879x)Wminus1 (A119879y)

radic(A119879x)Wminus1 (A119879x)radic(A119879y)Wminus1 (A119879y) (11)

where A is the LDA projection matrix and W is WCCNmatrix A and W are estimated by using all of the i-vectorsand the details of LAD and WCCN are described in [24]

5 Experiment and Results

In this section we report the outcome of our experimentsIn Section 51 we describe the experimental dataset InSection 52 we carry on an experiment to select the optimalmother wavelet for the WPE algorithm In Section 53 weevaluate the recognition accuracy of our model In Sec-tion 54 we evaluate the performance of the proposedmodelFinally the time cost of the model is count in Section 55

51 Experimental Dataset The results of our experimentsare performed on the TIMIT speech database [25] Thisdatabase contains 630 speakers (192 females and 438 males)who come from 8 different English dialect regions Eachspeaker supplies ten speech samples that are sampled at16 KHz and last 5 seconds All female speeches are usedto obtain background models that represent the commoncharacteristic of the female voice Also all male speeches areused to generate another background model characterizingthe male voice 384 speakers (192 females and 192 males) arerandomly selected and their speeches are used as the knownand unknown speeches The test results presented in ourexperiments are collected on a computer with 25GHz IntelCore i5 CPU and 8GM of memory and the experimentalplatform is MATLAB R2012b

52 Optimal Mother Wavelet A good mother wavelet canimprove the performance of the WPE algorithmThe perfor-mance of a mother wavelet is based on two important ele-ments such as the support size and the number of vanishingmoments If a mother wavelet has large number of vanishmoments the WPE would ignore much of unimportantinformation if the mother wavelet has small support sizetheWPEwould accurately locate important information [26]Therefore an optimal mother wavelet should have a largenumber of vanishing moments and a small support sizeIn this view the Daubechies and Symlet wavelets are goodwavelets because they have the largest number of vanishingmoments for a given support size Moreover those waveletsare orthogonal and are suitable for the Mallat fast algorithm

In is paper we use the Energy-to-Shannon Entropy Ratio(ESER) to evaluate those Daubechies and Symlet waveletsto find out the best one ESER is a way to analyze theperformance of mother wavelet and has been employed toselect the best mother wavelet in [27] The ESER is definedas

119903 = 119864119878 (12)

where 119878 is the Shannon entropy of the spectrum obtained byWPT and 119864 was the energy of the spectrumThe high energymeans the spectrum obtained by WPT contained muchenough information of the speech The low entropy meansthat the information in the spectrum is stable Thereforethe optimal mother wavelet should maximize the energy andmeanwhile minimize the entropy

In this experiment 8 Daubechies and 8 Symlet waveletswhich are respectively denoted as db1ndash8 and sym1ndash8 areemployed to decompose speeches that are randomly selectedfrom the TIMIT database We run the experiment 100 timesand record the average WSER of those mother wavelets inTable 1

In Table 1 We find that db4 and sym6 obtain the highestESER In other words the db4 and sysm6 are the bestmother wavelets for the speech data Reference [28] suggeststhat the sym6 can improve the performance of the speakerrecognitionmodelHowever the Symletwavelets produce thecomplex coefficients whose imaginary parts are redundantfor the real signal such as digital speech so we abandon thesym6 and choose the db4

6 Journal of Electrical and Computer Engineering

Table 1 The average WSER of the mother wavelets

Mother wavelet ESERdb1 88837db2 89032db3 89744db4 90849db5 90141db6 89653db7 89169db8 89084sym1 88835sym2 89036sym3 89493sym4 89975sym5 90382sym6 90859sym7 90244sym8 89837

53 The Accuracy of Speaker Recognition Model in ClearEnvironment This experiment evaluates the accuracy of thespeaker recognition model We randomly select 384 speakers(192 females and 192males) For each speaker half of speechesare used as the unknown speech and the other half of speechesare used as the known speeches For each speaker the speakerrecognition model matches the hisher unknown speecheswith all of the known speeches of the 384 speakers anddetermines who speaks the unknown speeches If the resultis right the model obtains one score if the result is wrongthe model gets zero score Finally we count the score andcalculate the mean accuracy that is defined as

accuray = score384 times 100 (13)

In this experiment we use four types of speaker recog-nition models for comparison The first one is the MFCC-GMM model [4] This model uses MFCC method to extract14D short vectors and uses the GMM with 8 mixtures torecognize speaker based on those short vectors The secondone is FMFCC-GMMmodel [16] This model is very similarto the MFCC-GMM model but it uses the FMFCC methodto extract the 52D short vectors The third one is the WPE-GMM model [10] This model firstly uses WPE to transformthe speeches into 16D short vectors and then uses GMMfor speaker classification The last one was the WPE-I-CDSmodel proposed in this paper Compared with WPE-GMMmodel ourmodel uses the 16D short vectors to generate 400Di-vector and uses CDS to recognize speaker based on the i-vector We carry on each experiment in this section 25 timesto obtain the mean accuracyThemean accuracy of the above4 models is shown in Figure 4

In Figure 4 we find that MFCC-GMM obtains the lowestaccuracy of 8846 The result of [4] shows the MFCC-GMM model can obtain accuracy of higher than 90 Thisis because we use the GMM with 8 mixtures as the classifierbut [4] uses the GMMwith 32mixtures as the classifier Large

8KHz speech16KHz speech

WPE

-I-CD

S

WPE

-GM

M

MFC

C-G

MM

FMFC

C-G

MM

Speaker recognition models

0102030405060708090

100

Acc

urac

y (

)Figure 4 The mean accuracy of 4 models in clean environment

mixture number can improve the performance of the GMMbut it also causes the very high computational expenseWPE-I-CDS obtain the highest accuracy of 9436This interpretsthe achievements of i-vector theory On the other handwhen the 8KHz speeches (low-quality speeches) are usedall accuracy of speaker recognition models is decreased Theaccuracy of MFCC-GMM FMFCC-GMM andWPE-GMMdecrease by about 6 Comparatively the accuracy of WPE-I-CDS decreases by about 1 This is because the i-vectorconsiders the i-vector to improve the accuracy of the speakerrecognition model and the CDS used the LDA and WCCNto improve the discrimination of the i-vector Reference [29]also reports that the combination of the i-vector and the CDScan enhance the performance of speaker recognition modelused for low-quality speeches such as phone speeches

54 The Accuracy of Speaker Recognition Model in NoisyEnvironment It is hard to find a clean speech in the realapplications because the noise in the transmission channeland environment cannot be controlled In this experimentwe add 30 dB 20 dB and 10 dB Gaussian white noise intothe speeches to simulate the noisy speeches All noises aregenerated by the MATLABrsquos Gaussian white noise function

For comparison this experiment employed three i-vectorbasedmodels such asMFCC-I-CDS [30] FMFCC-I-CD [31]and our WPE-I-CDS The two models are very similar toour proposed model but they use the MFCC and FMFCCto extract the short vectors respectively The accuracy of the3 models in noisy environment is shown in Figure 5

In Figure 5 the three models obtained high accuracyin clean environment This also shows that the i-vector canimprove the recognition accuracy effectively However whenweuse the noisy speeches to test the 3models their accuraciesdecrease When 30 dB noise is added to the speeches the

Journal of Electrical and Computer Engineering 7

MFCC-I-CDSFMFCC-I-CDSWPE-I-CDS

55

60

65

70

75

80

85

90

95A

ccur

acy

()

30 20 10CleanNoise (dB)

Figure 5 The accuracy of the 3 models in noisy environment

accuracy of the three models decreases by about 4 Thisshows that all of the models can resist weak noise Howeverwhen we enhance the power of noise the accuracy ofMFCC-I-CDS and FMFCC-I-CDS drops off rapidly In particularwhen the noise increases into 10 dB the accuracy of the abovetwo models decreases by more than 30 Comparatively theWPE-I-CDSrsquos accuracy decreases by less than 12 Thoseshow that the WPE-I-CDS is robust in noisy environmentcompared with MFCC-I-CDS and FMFCC-I-CDS This isbecause the WPE uses the WPT to obtain the frequencyspectrum but MFCC and FMFCC use the DFT to do thatThe WPT decomposes the speech into many local frequencybands that can limit the ill effect of noise but the DFTdecomposes the speech into a global frequency domain thatis sensitive to the noise

55 The Performance of the Speaker Recognition ModelUsually the speaker recognition model is used in the accesscontrol systemTherefore a good speaker recognition modelshould have ability to accept the login of the correct peopleand meanwhile to reject the access of the imposter as agatekeeper does In this experiment we use the receiveroperating characteristic (ROC) curve to evaluate the ability ofourmodelTheROC curve shows the true positive rate (TPR)as a function of the false positive rate (FPR) for differentvalues of the decision threshold and has been employed in[2]

In this experiment we randomly select 384 speakers (192males and 192 females) to calculate the ROC curve Half ofthose speakers are used as the correct people and anotherhalf of the speakers are used as the imposters We firstlyuse the speeches of the correct people to test the speakerrecognition model to calculate the TPR and then we use thespeeches of the imposters to attack the speaker recognition

MFCC-GMMFMFCC-GMM

WPE-GMMWPE-I-CDS

0

01

02

03

04

05

06

07

08

09

TPR

02 03 04 0501FPR

Figure 6 The ROC curves of TPR versus FPR

model to calculate the FPR The 4 models such as MFCC-GMM FMFCC-GMM WPE-GMM and our WPE-I-CDSare used for comparison To plot the ROC curve we adjustedthe decision thresholds to obtain different ROC points TheROC curves of those 4 models were shown in Figure 6

Low FPR shows that the speaker recognition model caneffectively resist the attack coming from the imposters andhigh TPR shows that the speaker recognition model canaccurately accept the correct speakersrsquo login In other wordsa speaker recognition model can be useful if its TPR is highfor a low FPR In Figure 6 when FPR is higher than 045 allmodels obtain the high TPR but WPE-I-CDS obtain higherTPR than other 3 models for a given FPR that is less than 45This shows that theWPE-I-CDS can more effectively achievethe access control task than other models

56 Time Cost This section tests the time cost of thefast MFCC-GMM the conventional MFCC-I-CDS and ourWPE-I-CDS We used 200 5-second-long speeches to testeach model and calculated the average time cost The resultof this experiment was shown in Table 2

In Table 2MFCC-GMMdoes not employ the i-vector forspeech representation so it does not cost time to extract thei-vector Comparatively the WPE-I-CDS should cost time toextract the i-vector The WPE-I-CDS cost the most time toextract the short vector compared with the MFCC-GMMThis is because the WPT used by WPE is more complexthan the DFT used by the MFCC On the other hand theparameters of GMM should be estimated beforehand asMFCC-GMM cost time to train the classifier CDS needs notcost time to estimate the parameters but it should cost time toestimate the matrices of the LDA andWCNN in the trainingclassifier step In all the i-vector can improve the recognitionaccuracy at cost of increasing the time consumption andcalculating the WPE costs too much time compared with

8 Journal of Electrical and Computer Engineering

Table 2 The time cost of the different speaker recognition models

Speaker recognition model Feature extraction (sspeech) Speaker classification (sspeech)Short vector extraction I-vector extraction Training classifier Recognition

MFCC-GMM 046 mdash 192 064MFCC-I-CDS 045 237 181 073WPE-I-CDS 281 224 182 075WPE-I-CDS(parallel computation) 078 223 182 074

calculating the MFCCTherefore it is very important to findway to reduce the time cost of the WPE

Parallel computation is an effective way to reduce thetime cost because the loops in the linear computation canbe finished at once using a parallel algorithm For examplea signal whose length is 119873 is decomposed by WPT at119872 levels In the conventional linear algorithm of WPT wehave to run a filtering process whose time complexity was119874(log119873) 119872 times 119873 times for each decomposition level sothe total time cost of WPT is 119874(119872119873 log119873) If we used 119873independent computational cores to implant theWPTusing aparallel algorithm the time complexity ofWPT can reduce to119874(119872 log119873) This paper uses 16 independent computationalcores to implement the WPE parallel algorithm and the lastline of Table 2 shows that the time cost of WPE is reducedvery much

6 Conclusions

With the development of the computer technique the speakerrecognition has been widely used for speech-based accesssystem In the real environment the quality of the speechmay be low and noise in the transformation channel cannotbe controlled Therefore it is necessary to find a speakerrecognition model that is not sensitive to those factors suchas noise and low-quality speech

This paper proposes a new speaker recognition modelby employing wavelet packet entropy (WPE) i-vector andCDS and we name themodelWPE-I-CDSWPE used a localanalysis tool namedWPT rather than the DFT to decomposethe signal Because WPT decomposes the signal into manyindependent frequency bands that limit the ill effect of noisetheWPE is robust in the noisy environment I-vector is a typeof robust feature vector Because it considers the backgroundinformation i-vector can improve the accuracy of recogni-tion CDS uses a kernel function to deal with the curse ofdimension so it is suitable for the high-dimension featurevector such as i-vector The result of the experiments in thispaper shows that the proposed speaker recognition modelscan improve the performance of recognition compared withthe conventional models such as MFCC-GMM FMFCC-GMM and WPE-GMM in clean environment MoreovertheWPE-I-CDS obtains higher accuracy than other i-vector-based models such as MFCC-I-CDS and FMFCC-I-CDS innoisy environment However the time cost of the proposedmodel is very higher To reduce the time cost we employthe parallel algorithm to implement the WPE and i-vectorextraction methods

In the future we will combine audio and visual feature toimprove the performance of the speaker recognition system

Conflicts of Interest

The authors declare that they have no conflicts of interest

Acknowledgments

The authors also thank Professor Kun She for their assis-tance in preparation of the manuscript This paper is sup-ported by the Project of Sichuan Provincial Science Plan(M112016GZ0073) and National Nature Foundation (Grantno 61672136)

References

[1] H Perez J Martinez and I Espinosa ldquoUsing acoustic paraligu-istic information to assess the interaction quality in speech-based system for elder usersrdquo International Jounrnal of Human-Computer Studies vol 98 pp 1ndash13 2017

[2] N Almaadeed A Aggoun and A Amira ldquoSpeaker identifica-tion using multimodal neural networks and wavelet analysisrdquoIET Biometrics vol 4 no 1 pp 18ndash28 2015

[3] T Kinnunen and H Li ldquoAn overview of text-independentspeaker recognition from features to supervectorsrdquo SpeechCommunication vol 52 no 1 pp 12ndash40 2010

[4] D A Reynolds and R C Rose ldquoRobust text-independentspeaker identification using gaussian mixture speaker modelsrdquoIEEE Transactions on Speech and Audio Processing vol 3 no 1pp 72ndash83 1995

[5] K S Ahmad A S Thosar J H Nirmal and V S PandeldquoA unique approach in text-independent speaker recognitionusing mfcc feature sets and probabilistic neural networkrdquo inProceedings of the 8th International Conference on Advances inPattern Recognition (ICAPR rsquo15) Kolkata India 2015

[6] X-Y Zhang J Bai and W-Z Liang ldquoThe speech recognitionsystem based on bark wavelet MFCCrdquo in Proceedings of the 8thInternational Conference on Signal Processing (ICSP rsquo06) pp 16ndash20 Beijing China 2006

[7] A Biswas P K Sahu and M Chandra ldquoAdmissible waveletpacket features based on human inner ear frequency responsefor Hindi consonant recongnitionrdquo Computer amp Communica-tion Technology vol 40 pp 1111ndash1122 2014

[8] K Daqrouq and T A Tutunji ldquoSpeaker identification usingvowels features through a combined method of formantswavelets and neural network classifiersrdquo Applied Soft Comput-ing vol 27 pp 231ndash239 2015

Journal of Electrical and Computer Engineering 9

[9] D Avci ldquoAn expert system for speaker identification usingadaptive wavelet sure entropyrdquo Expert Systems with Applica-tions vol 36 no 3 pp 6295ndash6300 2009

[10] K Daqrouq ldquoWavelet entropy and neural network for text-independent speaker identificationrdquo Engineering Applications ofArtificial Intelligence vol 24 no 5 pp 796ndash802 2011

[11] M K Elyaberani S H Mahmoodian and G Sheikhi ldquoWaveletpacket entropy in speaker-identification emotional state detec-tion from speech signalrdquo Journal of Intelligent Procedures inElectrical Technology vol 20 pp 67ndash74 2015

[12] M Li A Tsiartas M Van Segbroeck and S S NarayananldquoSpeaker verification using simplified and supervised i-vectormodelingrdquo in Proceedings of the 38th IEEE International Con-ference on Acoustics Speech and Signal Processing (ICASSP rsquo13)pp 7199ndash7203 Vancouver Canada 2013

[13] D Garcia-Romero and A McCree ldquoSupervised domain adap-tation for i-vector based speaker recognitionrdquo in Proceedingsof the IEEE International Conference on Acoustics Speech andSignal Processing pp 4047ndash4051 Florence Italy 2014

[14] A KanagasundaramDDean and S Sridharan ldquoI-vector basedspeaker recognition using advanced channel compensationtechniquesrdquo Computer Speech amp Labguage vol 28 pp 121ndash1402014

[15] M H Bahari R Saeidi H Van Hamme and D Van LeeuwenldquoAccent recognition using i-vector gaussian mean supervectorand gaussian posterior probability supervector for spontaneoustelephone speechrdquo in Proceedings of the IEEE InternationalConference on Acoustics Speech and Signal Processing pp 7344ndash7348 IEEE Vancouver Canada 2013

[16] B Saha and K Kamarauslas ldquoEvaluation of effectiveness ofdifferent methods in speaker recognitionrdquo Elektronika ir Elek-trochnika vol 98 pp 67ndash70 2015

[17] K K George C S Kumar K I Ramachandran and A PandaldquoCosine distance features for robust speaker verificationrdquo inProceedings of the 16th Annual Conference of the InternationalSpeech Communication Association (INTERSPEECH rsquo15) pp234ndash238 Dresden Germany 2015

[18] B G Nagaraja and H S Jayanna ldquoEfficient window for mono-lingual and crosslingual speaker identification usingMFCCrdquo inProceedings of the International Conference on Advanced Com-puting and Communication Systems (ICACCS rsquo13) pp 1ndash4Coimbatore India 2013

[19] M Medeiros G Araujo H Macedo M Chella and L MatosldquoMulti-kernel approach to parallelization of EM algorithm forGMM trainingrdquo in Proceedings of the 3rd Brazilian Conferenceon Intelligent Systems (BRACIS rsquo14) pp 158ndash165 Sao PauloBrazil 2014

[20] H R Tohidypour S A Seyyedsalehi and H Behbood ldquoCom-parison between wavelet packet transform Bark Wavelet ampMFCC for robust speech recognition tasksrdquo in Proceedings ofthe 2nd International Conference on Industrial Mechatronics andAutomation (ICIMA rsquo10) pp 329ndash332 Wuhan China 2010

[21] X-F Lv and P Tao ldquoMallat algorithm of wavelet for time-varying system parametric identificationrdquo in Proceedings of the25th Chinese Control and Decision Conference (CCDC rsquo13) pp1554ndash1556 Guiyang China 2013

[22] D Marthnez O Pichot and L Burget ldquoLanguage recognitionin I-vector spacerdquo in Proceedongs of the Interspeech pp 861ndash864Florence Italy 2011

[23] P Kenny G Boulianne and P Dumouchel ldquoEigenvoice model-ing with sparse training datardquo IEEE Transactions on Speech andAudio Processing vol 13 no 3 pp 345ndash354 2005

[24] M McLaren and D Van Leeuwen ldquoImproved speaker recog-nition when using i-vectors from multiple speech sourcesrdquoin Proceedings of the 36th IEEE International Conference onAcoustics Speech and Signal Processing (ICASSP rsquo11) pp 5460ndash5463 Prague Czech Republic 2011

[25] A Biswas P K Sahu A Bhowmick and M Chandra ldquoFeatureextraction technique using ERB like wavelet sub-band periodicand aperiodic decomposition for TIMITphoneme recognitionrdquoInternational Journal of SpeechTechnology vol 17 no 4 pp 389ndash399 2014

[26] S G Mallat A Wavelet Tour of Signal Processing ELsevierAmsterdam Netherlands 2012

[27] Q Yang and J Wang ldquoMulti-level wavelet shannon entropy-based method for single-sensor fault locationrdquo Entropy vol 17no 10 pp 7101ndash7117 2015

[28] TGanchevM Siafarikas IMporas andT Stoyanova ldquoWaveletbasis selection for enhanced speech parametrization in speakerverificationrdquo International Journal of Speech Technology vol 17no 1 pp 27ndash36 2014

[29] M Seboussaoui P Kenny and N Dehak ldquoAn i-vector extractorsuitable for speaker recognition with both microphone andtelephone speechrdquo Odyssey vol 6 pp 1ndash6 2011

[30] N Dehak R Dehak P Kenny N Brummer P Ouellet and PDumouchel ldquoSupport vectormachines versus fast scoring in thelow-dimensional total variability space for speaker verificationrdquoin Proceedings of the 10th Annual Conference of the InternationalSpeech Communication Association (INTERSPEECH rsquo09) pp1559ndash1562 Brighton United Kingdom 2009

[31] M IMandasariMMcLaren andDVanLeeuwen ldquoEvaluationof i-vector speaker recognition systems for forensic applicationrdquoin 12th Annual Conference of the International Speech Commu-nication Association (INTERSPEECH rsquo11) pp 21ndash24 FlorenceItaly 2011

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Active and Passive Electronic Components

Control Scienceand Engineering

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

RotatingMachinery

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

Journal of

Volume 201

Submit your manuscripts athttpswwwhindawicom

VLSI Design

Hindawi Publishing Corporationhttpwwwhindawicom Volume 201

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Shock and Vibration

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawi Publishing Corporation httpwwwhindawicom

Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

SensorsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Navigation and Observation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

DistributedSensor Networks

International Journal of

Page 6: Speaker Recognition Using Wavelet Packet Entropy, I-Vector, and …downloads.hindawi.com/journals/jece/2017/1735698.pdf · 2019. 7. 30. · ResearchArticle Speaker Recognition Using

6 Journal of Electrical and Computer Engineering

Table 1 The average WSER of the mother wavelets

Mother wavelet ESERdb1 88837db2 89032db3 89744db4 90849db5 90141db6 89653db7 89169db8 89084sym1 88835sym2 89036sym3 89493sym4 89975sym5 90382sym6 90859sym7 90244sym8 89837

53 The Accuracy of Speaker Recognition Model in ClearEnvironment This experiment evaluates the accuracy of thespeaker recognition model We randomly select 384 speakers(192 females and 192males) For each speaker half of speechesare used as the unknown speech and the other half of speechesare used as the known speeches For each speaker the speakerrecognition model matches the hisher unknown speecheswith all of the known speeches of the 384 speakers anddetermines who speaks the unknown speeches If the resultis right the model obtains one score if the result is wrongthe model gets zero score Finally we count the score andcalculate the mean accuracy that is defined as

accuray = score384 times 100 (13)

In this experiment we use four types of speaker recog-nition models for comparison The first one is the MFCC-GMM model [4] This model uses MFCC method to extract14D short vectors and uses the GMM with 8 mixtures torecognize speaker based on those short vectors The secondone is FMFCC-GMMmodel [16] This model is very similarto the MFCC-GMM model but it uses the FMFCC methodto extract the 52D short vectors The third one is the WPE-GMM model [10] This model firstly uses WPE to transformthe speeches into 16D short vectors and then uses GMMfor speaker classification The last one was the WPE-I-CDSmodel proposed in this paper Compared with WPE-GMMmodel ourmodel uses the 16D short vectors to generate 400Di-vector and uses CDS to recognize speaker based on the i-vector We carry on each experiment in this section 25 timesto obtain the mean accuracyThemean accuracy of the above4 models is shown in Figure 4

In Figure 4 we find that MFCC-GMM obtains the lowestaccuracy of 8846 The result of [4] shows the MFCC-GMM model can obtain accuracy of higher than 90 Thisis because we use the GMM with 8 mixtures as the classifierbut [4] uses the GMMwith 32mixtures as the classifier Large

8KHz speech16KHz speech

WPE

-I-CD

S

WPE

-GM

M

MFC

C-G

MM

FMFC

C-G

MM

Speaker recognition models

0102030405060708090

100

Acc

urac

y (

)Figure 4 The mean accuracy of 4 models in clean environment

mixture number can improve the performance of the GMMbut it also causes the very high computational expenseWPE-I-CDS obtain the highest accuracy of 9436This interpretsthe achievements of i-vector theory On the other handwhen the 8KHz speeches (low-quality speeches) are usedall accuracy of speaker recognition models is decreased Theaccuracy of MFCC-GMM FMFCC-GMM andWPE-GMMdecrease by about 6 Comparatively the accuracy of WPE-I-CDS decreases by about 1 This is because the i-vectorconsiders the i-vector to improve the accuracy of the speakerrecognition model and the CDS used the LDA and WCCNto improve the discrimination of the i-vector Reference [29]also reports that the combination of the i-vector and the CDScan enhance the performance of speaker recognition modelused for low-quality speeches such as phone speeches

54 The Accuracy of Speaker Recognition Model in NoisyEnvironment It is hard to find a clean speech in the realapplications because the noise in the transmission channeland environment cannot be controlled In this experimentwe add 30 dB 20 dB and 10 dB Gaussian white noise intothe speeches to simulate the noisy speeches All noises aregenerated by the MATLABrsquos Gaussian white noise function

For comparison this experiment employed three i-vectorbasedmodels such asMFCC-I-CDS [30] FMFCC-I-CD [31]and our WPE-I-CDS The two models are very similar toour proposed model but they use the MFCC and FMFCCto extract the short vectors respectively The accuracy of the3 models in noisy environment is shown in Figure 5

In Figure 5 the three models obtained high accuracyin clean environment This also shows that the i-vector canimprove the recognition accuracy effectively However whenweuse the noisy speeches to test the 3models their accuraciesdecrease When 30 dB noise is added to the speeches the

Journal of Electrical and Computer Engineering 7

MFCC-I-CDSFMFCC-I-CDSWPE-I-CDS

55

60

65

70

75

80

85

90

95A

ccur

acy

()

30 20 10CleanNoise (dB)

Figure 5 The accuracy of the 3 models in noisy environment

accuracy of the three models decreases by about 4 Thisshows that all of the models can resist weak noise Howeverwhen we enhance the power of noise the accuracy ofMFCC-I-CDS and FMFCC-I-CDS drops off rapidly In particularwhen the noise increases into 10 dB the accuracy of the abovetwo models decreases by more than 30 Comparatively theWPE-I-CDSrsquos accuracy decreases by less than 12 Thoseshow that the WPE-I-CDS is robust in noisy environmentcompared with MFCC-I-CDS and FMFCC-I-CDS This isbecause the WPE uses the WPT to obtain the frequencyspectrum but MFCC and FMFCC use the DFT to do thatThe WPT decomposes the speech into many local frequencybands that can limit the ill effect of noise but the DFTdecomposes the speech into a global frequency domain thatis sensitive to the noise

55 The Performance of the Speaker Recognition ModelUsually the speaker recognition model is used in the accesscontrol systemTherefore a good speaker recognition modelshould have ability to accept the login of the correct peopleand meanwhile to reject the access of the imposter as agatekeeper does In this experiment we use the receiveroperating characteristic (ROC) curve to evaluate the ability ofourmodelTheROC curve shows the true positive rate (TPR)as a function of the false positive rate (FPR) for differentvalues of the decision threshold and has been employed in[2]

In this experiment we randomly select 384 speakers (192males and 192 females) to calculate the ROC curve Half ofthose speakers are used as the correct people and anotherhalf of the speakers are used as the imposters We firstlyuse the speeches of the correct people to test the speakerrecognition model to calculate the TPR and then we use thespeeches of the imposters to attack the speaker recognition

MFCC-GMMFMFCC-GMM

WPE-GMMWPE-I-CDS

0

01

02

03

04

05

06

07

08

09

TPR

02 03 04 0501FPR

Figure 6 The ROC curves of TPR versus FPR

model to calculate the FPR The 4 models such as MFCC-GMM FMFCC-GMM WPE-GMM and our WPE-I-CDSare used for comparison To plot the ROC curve we adjustedthe decision thresholds to obtain different ROC points TheROC curves of those 4 models were shown in Figure 6

Low FPR shows that the speaker recognition model caneffectively resist the attack coming from the imposters andhigh TPR shows that the speaker recognition model canaccurately accept the correct speakersrsquo login In other wordsa speaker recognition model can be useful if its TPR is highfor a low FPR In Figure 6 when FPR is higher than 045 allmodels obtain the high TPR but WPE-I-CDS obtain higherTPR than other 3 models for a given FPR that is less than 45This shows that theWPE-I-CDS can more effectively achievethe access control task than other models

56 Time Cost This section tests the time cost of thefast MFCC-GMM the conventional MFCC-I-CDS and ourWPE-I-CDS We used 200 5-second-long speeches to testeach model and calculated the average time cost The resultof this experiment was shown in Table 2

In Table 2MFCC-GMMdoes not employ the i-vector forspeech representation so it does not cost time to extract thei-vector Comparatively the WPE-I-CDS should cost time toextract the i-vector The WPE-I-CDS cost the most time toextract the short vector compared with the MFCC-GMMThis is because the WPT used by WPE is more complexthan the DFT used by the MFCC On the other hand theparameters of GMM should be estimated beforehand asMFCC-GMM cost time to train the classifier CDS needs notcost time to estimate the parameters but it should cost time toestimate the matrices of the LDA andWCNN in the trainingclassifier step In all the i-vector can improve the recognitionaccuracy at cost of increasing the time consumption andcalculating the WPE costs too much time compared with

8 Journal of Electrical and Computer Engineering

Table 2 The time cost of the different speaker recognition models

Speaker recognition model Feature extraction (sspeech) Speaker classification (sspeech)Short vector extraction I-vector extraction Training classifier Recognition

MFCC-GMM 046 mdash 192 064MFCC-I-CDS 045 237 181 073WPE-I-CDS 281 224 182 075WPE-I-CDS(parallel computation) 078 223 182 074

calculating the MFCCTherefore it is very important to findway to reduce the time cost of the WPE

Parallel computation is an effective way to reduce thetime cost because the loops in the linear computation canbe finished at once using a parallel algorithm For examplea signal whose length is 119873 is decomposed by WPT at119872 levels In the conventional linear algorithm of WPT wehave to run a filtering process whose time complexity was119874(log119873) 119872 times 119873 times for each decomposition level sothe total time cost of WPT is 119874(119872119873 log119873) If we used 119873independent computational cores to implant theWPTusing aparallel algorithm the time complexity ofWPT can reduce to119874(119872 log119873) This paper uses 16 independent computationalcores to implement the WPE parallel algorithm and the lastline of Table 2 shows that the time cost of WPE is reducedvery much

6 Conclusions

With the development of the computer technique the speakerrecognition has been widely used for speech-based accesssystem In the real environment the quality of the speechmay be low and noise in the transformation channel cannotbe controlled Therefore it is necessary to find a speakerrecognition model that is not sensitive to those factors suchas noise and low-quality speech

This paper proposes a new speaker recognition modelby employing wavelet packet entropy (WPE) i-vector andCDS and we name themodelWPE-I-CDSWPE used a localanalysis tool namedWPT rather than the DFT to decomposethe signal Because WPT decomposes the signal into manyindependent frequency bands that limit the ill effect of noisetheWPE is robust in the noisy environment I-vector is a typeof robust feature vector Because it considers the backgroundinformation i-vector can improve the accuracy of recogni-tion CDS uses a kernel function to deal with the curse ofdimension so it is suitable for the high-dimension featurevector such as i-vector The result of the experiments in thispaper shows that the proposed speaker recognition modelscan improve the performance of recognition compared withthe conventional models such as MFCC-GMM FMFCC-GMM and WPE-GMM in clean environment MoreovertheWPE-I-CDS obtains higher accuracy than other i-vector-based models such as MFCC-I-CDS and FMFCC-I-CDS innoisy environment However the time cost of the proposedmodel is very higher To reduce the time cost we employthe parallel algorithm to implement the WPE and i-vectorextraction methods

In the future we will combine audio and visual feature toimprove the performance of the speaker recognition system

Conflicts of Interest

The authors declare that they have no conflicts of interest

Acknowledgments

The authors also thank Professor Kun She for their assis-tance in preparation of the manuscript This paper is sup-ported by the Project of Sichuan Provincial Science Plan(M112016GZ0073) and National Nature Foundation (Grantno 61672136)

References

[1] H Perez J Martinez and I Espinosa ldquoUsing acoustic paraligu-istic information to assess the interaction quality in speech-based system for elder usersrdquo International Jounrnal of Human-Computer Studies vol 98 pp 1ndash13 2017

[2] N Almaadeed A Aggoun and A Amira ldquoSpeaker identifica-tion using multimodal neural networks and wavelet analysisrdquoIET Biometrics vol 4 no 1 pp 18ndash28 2015

[3] T Kinnunen and H Li ldquoAn overview of text-independentspeaker recognition from features to supervectorsrdquo SpeechCommunication vol 52 no 1 pp 12ndash40 2010

[4] D A Reynolds and R C Rose ldquoRobust text-independentspeaker identification using gaussian mixture speaker modelsrdquoIEEE Transactions on Speech and Audio Processing vol 3 no 1pp 72ndash83 1995

[5] K S Ahmad A S Thosar J H Nirmal and V S PandeldquoA unique approach in text-independent speaker recognitionusing mfcc feature sets and probabilistic neural networkrdquo inProceedings of the 8th International Conference on Advances inPattern Recognition (ICAPR rsquo15) Kolkata India 2015

[6] X-Y Zhang J Bai and W-Z Liang ldquoThe speech recognitionsystem based on bark wavelet MFCCrdquo in Proceedings of the 8thInternational Conference on Signal Processing (ICSP rsquo06) pp 16ndash20 Beijing China 2006

[7] A Biswas P K Sahu and M Chandra ldquoAdmissible waveletpacket features based on human inner ear frequency responsefor Hindi consonant recongnitionrdquo Computer amp Communica-tion Technology vol 40 pp 1111ndash1122 2014

[8] K Daqrouq and T A Tutunji ldquoSpeaker identification usingvowels features through a combined method of formantswavelets and neural network classifiersrdquo Applied Soft Comput-ing vol 27 pp 231ndash239 2015

Journal of Electrical and Computer Engineering 9

[9] D Avci ldquoAn expert system for speaker identification usingadaptive wavelet sure entropyrdquo Expert Systems with Applica-tions vol 36 no 3 pp 6295ndash6300 2009

[10] K Daqrouq ldquoWavelet entropy and neural network for text-independent speaker identificationrdquo Engineering Applications ofArtificial Intelligence vol 24 no 5 pp 796ndash802 2011

[11] M K Elyaberani S H Mahmoodian and G Sheikhi ldquoWaveletpacket entropy in speaker-identification emotional state detec-tion from speech signalrdquo Journal of Intelligent Procedures inElectrical Technology vol 20 pp 67ndash74 2015

[12] M Li A Tsiartas M Van Segbroeck and S S NarayananldquoSpeaker verification using simplified and supervised i-vectormodelingrdquo in Proceedings of the 38th IEEE International Con-ference on Acoustics Speech and Signal Processing (ICASSP rsquo13)pp 7199ndash7203 Vancouver Canada 2013

[13] D Garcia-Romero and A McCree ldquoSupervised domain adap-tation for i-vector based speaker recognitionrdquo in Proceedingsof the IEEE International Conference on Acoustics Speech andSignal Processing pp 4047ndash4051 Florence Italy 2014

[14] A KanagasundaramDDean and S Sridharan ldquoI-vector basedspeaker recognition using advanced channel compensationtechniquesrdquo Computer Speech amp Labguage vol 28 pp 121ndash1402014

[15] M H Bahari R Saeidi H Van Hamme and D Van LeeuwenldquoAccent recognition using i-vector gaussian mean supervectorand gaussian posterior probability supervector for spontaneoustelephone speechrdquo in Proceedings of the IEEE InternationalConference on Acoustics Speech and Signal Processing pp 7344ndash7348 IEEE Vancouver Canada 2013

[16] B Saha and K Kamarauslas ldquoEvaluation of effectiveness ofdifferent methods in speaker recognitionrdquo Elektronika ir Elek-trochnika vol 98 pp 67ndash70 2015

[17] K K George C S Kumar K I Ramachandran and A PandaldquoCosine distance features for robust speaker verificationrdquo inProceedings of the 16th Annual Conference of the InternationalSpeech Communication Association (INTERSPEECH rsquo15) pp234ndash238 Dresden Germany 2015

[18] B G Nagaraja and H S Jayanna ldquoEfficient window for mono-lingual and crosslingual speaker identification usingMFCCrdquo inProceedings of the International Conference on Advanced Com-puting and Communication Systems (ICACCS rsquo13) pp 1ndash4Coimbatore India 2013

[19] M Medeiros G Araujo H Macedo M Chella and L MatosldquoMulti-kernel approach to parallelization of EM algorithm forGMM trainingrdquo in Proceedings of the 3rd Brazilian Conferenceon Intelligent Systems (BRACIS rsquo14) pp 158ndash165 Sao PauloBrazil 2014

[20] H R Tohidypour S A Seyyedsalehi and H Behbood ldquoCom-parison between wavelet packet transform Bark Wavelet ampMFCC for robust speech recognition tasksrdquo in Proceedings ofthe 2nd International Conference on Industrial Mechatronics andAutomation (ICIMA rsquo10) pp 329ndash332 Wuhan China 2010

[21] X-F Lv and P Tao ldquoMallat algorithm of wavelet for time-varying system parametric identificationrdquo in Proceedings of the25th Chinese Control and Decision Conference (CCDC rsquo13) pp1554ndash1556 Guiyang China 2013

[22] D Marthnez O Pichot and L Burget ldquoLanguage recognitionin I-vector spacerdquo in Proceedongs of the Interspeech pp 861ndash864Florence Italy 2011

[23] P Kenny G Boulianne and P Dumouchel ldquoEigenvoice model-ing with sparse training datardquo IEEE Transactions on Speech andAudio Processing vol 13 no 3 pp 345ndash354 2005

[24] M McLaren and D Van Leeuwen ldquoImproved speaker recog-nition when using i-vectors from multiple speech sourcesrdquoin Proceedings of the 36th IEEE International Conference onAcoustics Speech and Signal Processing (ICASSP rsquo11) pp 5460ndash5463 Prague Czech Republic 2011

[25] A Biswas P K Sahu A Bhowmick and M Chandra ldquoFeatureextraction technique using ERB like wavelet sub-band periodicand aperiodic decomposition for TIMITphoneme recognitionrdquoInternational Journal of SpeechTechnology vol 17 no 4 pp 389ndash399 2014

[26] S G Mallat A Wavelet Tour of Signal Processing ELsevierAmsterdam Netherlands 2012

[27] Q Yang and J Wang ldquoMulti-level wavelet shannon entropy-based method for single-sensor fault locationrdquo Entropy vol 17no 10 pp 7101ndash7117 2015

[28] TGanchevM Siafarikas IMporas andT Stoyanova ldquoWaveletbasis selection for enhanced speech parametrization in speakerverificationrdquo International Journal of Speech Technology vol 17no 1 pp 27ndash36 2014

[29] M Seboussaoui P Kenny and N Dehak ldquoAn i-vector extractorsuitable for speaker recognition with both microphone andtelephone speechrdquo Odyssey vol 6 pp 1ndash6 2011

[30] N Dehak R Dehak P Kenny N Brummer P Ouellet and PDumouchel ldquoSupport vectormachines versus fast scoring in thelow-dimensional total variability space for speaker verificationrdquoin Proceedings of the 10th Annual Conference of the InternationalSpeech Communication Association (INTERSPEECH rsquo09) pp1559ndash1562 Brighton United Kingdom 2009

[31] M IMandasariMMcLaren andDVanLeeuwen ldquoEvaluationof i-vector speaker recognition systems for forensic applicationrdquoin 12th Annual Conference of the International Speech Commu-nication Association (INTERSPEECH rsquo11) pp 21ndash24 FlorenceItaly 2011

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Active and Passive Electronic Components

Control Scienceand Engineering

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

RotatingMachinery

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

Journal of

Volume 201

Submit your manuscripts athttpswwwhindawicom

VLSI Design

Hindawi Publishing Corporationhttpwwwhindawicom Volume 201

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Shock and Vibration

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawi Publishing Corporation httpwwwhindawicom

Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

SensorsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Navigation and Observation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

DistributedSensor Networks

International Journal of

Page 7: Speaker Recognition Using Wavelet Packet Entropy, I-Vector, and …downloads.hindawi.com/journals/jece/2017/1735698.pdf · 2019. 7. 30. · ResearchArticle Speaker Recognition Using

Journal of Electrical and Computer Engineering 7

MFCC-I-CDSFMFCC-I-CDSWPE-I-CDS

55

60

65

70

75

80

85

90

95A

ccur

acy

()

30 20 10CleanNoise (dB)

Figure 5 The accuracy of the 3 models in noisy environment

accuracy of the three models decreases by about 4 Thisshows that all of the models can resist weak noise Howeverwhen we enhance the power of noise the accuracy ofMFCC-I-CDS and FMFCC-I-CDS drops off rapidly In particularwhen the noise increases into 10 dB the accuracy of the abovetwo models decreases by more than 30 Comparatively theWPE-I-CDSrsquos accuracy decreases by less than 12 Thoseshow that the WPE-I-CDS is robust in noisy environmentcompared with MFCC-I-CDS and FMFCC-I-CDS This isbecause the WPE uses the WPT to obtain the frequencyspectrum but MFCC and FMFCC use the DFT to do thatThe WPT decomposes the speech into many local frequencybands that can limit the ill effect of noise but the DFTdecomposes the speech into a global frequency domain thatis sensitive to the noise

55 The Performance of the Speaker Recognition ModelUsually the speaker recognition model is used in the accesscontrol systemTherefore a good speaker recognition modelshould have ability to accept the login of the correct peopleand meanwhile to reject the access of the imposter as agatekeeper does In this experiment we use the receiveroperating characteristic (ROC) curve to evaluate the ability ofourmodelTheROC curve shows the true positive rate (TPR)as a function of the false positive rate (FPR) for differentvalues of the decision threshold and has been employed in[2]

In this experiment we randomly select 384 speakers (192males and 192 females) to calculate the ROC curve Half ofthose speakers are used as the correct people and anotherhalf of the speakers are used as the imposters We firstlyuse the speeches of the correct people to test the speakerrecognition model to calculate the TPR and then we use thespeeches of the imposters to attack the speaker recognition

MFCC-GMMFMFCC-GMM

WPE-GMMWPE-I-CDS

0

01

02

03

04

05

06

07

08

09

TPR

02 03 04 0501FPR

Figure 6 The ROC curves of TPR versus FPR

model to calculate the FPR The 4 models such as MFCC-GMM FMFCC-GMM WPE-GMM and our WPE-I-CDSare used for comparison To plot the ROC curve we adjustedthe decision thresholds to obtain different ROC points TheROC curves of those 4 models were shown in Figure 6

Low FPR shows that the speaker recognition model caneffectively resist the attack coming from the imposters andhigh TPR shows that the speaker recognition model canaccurately accept the correct speakersrsquo login In other wordsa speaker recognition model can be useful if its TPR is highfor a low FPR In Figure 6 when FPR is higher than 045 allmodels obtain the high TPR but WPE-I-CDS obtain higherTPR than other 3 models for a given FPR that is less than 45This shows that theWPE-I-CDS can more effectively achievethe access control task than other models

56 Time Cost This section tests the time cost of thefast MFCC-GMM the conventional MFCC-I-CDS and ourWPE-I-CDS We used 200 5-second-long speeches to testeach model and calculated the average time cost The resultof this experiment was shown in Table 2

In Table 2MFCC-GMMdoes not employ the i-vector forspeech representation so it does not cost time to extract thei-vector Comparatively the WPE-I-CDS should cost time toextract the i-vector The WPE-I-CDS cost the most time toextract the short vector compared with the MFCC-GMMThis is because the WPT used by WPE is more complexthan the DFT used by the MFCC On the other hand theparameters of GMM should be estimated beforehand asMFCC-GMM cost time to train the classifier CDS needs notcost time to estimate the parameters but it should cost time toestimate the matrices of the LDA andWCNN in the trainingclassifier step In all the i-vector can improve the recognitionaccuracy at cost of increasing the time consumption andcalculating the WPE costs too much time compared with

8 Journal of Electrical and Computer Engineering

Table 2 The time cost of the different speaker recognition models

Speaker recognition model Feature extraction (sspeech) Speaker classification (sspeech)Short vector extraction I-vector extraction Training classifier Recognition

MFCC-GMM 046 mdash 192 064MFCC-I-CDS 045 237 181 073WPE-I-CDS 281 224 182 075WPE-I-CDS(parallel computation) 078 223 182 074

calculating the MFCCTherefore it is very important to findway to reduce the time cost of the WPE

Parallel computation is an effective way to reduce thetime cost because the loops in the linear computation canbe finished at once using a parallel algorithm For examplea signal whose length is 119873 is decomposed by WPT at119872 levels In the conventional linear algorithm of WPT wehave to run a filtering process whose time complexity was119874(log119873) 119872 times 119873 times for each decomposition level sothe total time cost of WPT is 119874(119872119873 log119873) If we used 119873independent computational cores to implant theWPTusing aparallel algorithm the time complexity ofWPT can reduce to119874(119872 log119873) This paper uses 16 independent computationalcores to implement the WPE parallel algorithm and the lastline of Table 2 shows that the time cost of WPE is reducedvery much

6 Conclusions

With the development of the computer technique the speakerrecognition has been widely used for speech-based accesssystem In the real environment the quality of the speechmay be low and noise in the transformation channel cannotbe controlled Therefore it is necessary to find a speakerrecognition model that is not sensitive to those factors suchas noise and low-quality speech

This paper proposes a new speaker recognition modelby employing wavelet packet entropy (WPE) i-vector andCDS and we name themodelWPE-I-CDSWPE used a localanalysis tool namedWPT rather than the DFT to decomposethe signal Because WPT decomposes the signal into manyindependent frequency bands that limit the ill effect of noisetheWPE is robust in the noisy environment I-vector is a typeof robust feature vector Because it considers the backgroundinformation i-vector can improve the accuracy of recogni-tion CDS uses a kernel function to deal with the curse ofdimension so it is suitable for the high-dimension featurevector such as i-vector The result of the experiments in thispaper shows that the proposed speaker recognition modelscan improve the performance of recognition compared withthe conventional models such as MFCC-GMM FMFCC-GMM and WPE-GMM in clean environment MoreovertheWPE-I-CDS obtains higher accuracy than other i-vector-based models such as MFCC-I-CDS and FMFCC-I-CDS innoisy environment However the time cost of the proposedmodel is very higher To reduce the time cost we employthe parallel algorithm to implement the WPE and i-vectorextraction methods

In the future we will combine audio and visual feature toimprove the performance of the speaker recognition system

Conflicts of Interest

The authors declare that they have no conflicts of interest

Acknowledgments

The authors also thank Professor Kun She for their assis-tance in preparation of the manuscript This paper is sup-ported by the Project of Sichuan Provincial Science Plan(M112016GZ0073) and National Nature Foundation (Grantno 61672136)

References

[1] H Perez J Martinez and I Espinosa ldquoUsing acoustic paraligu-istic information to assess the interaction quality in speech-based system for elder usersrdquo International Jounrnal of Human-Computer Studies vol 98 pp 1ndash13 2017

[2] N Almaadeed A Aggoun and A Amira ldquoSpeaker identifica-tion using multimodal neural networks and wavelet analysisrdquoIET Biometrics vol 4 no 1 pp 18ndash28 2015

[3] T Kinnunen and H Li ldquoAn overview of text-independentspeaker recognition from features to supervectorsrdquo SpeechCommunication vol 52 no 1 pp 12ndash40 2010

[4] D A Reynolds and R C Rose ldquoRobust text-independentspeaker identification using gaussian mixture speaker modelsrdquoIEEE Transactions on Speech and Audio Processing vol 3 no 1pp 72ndash83 1995

[5] K S Ahmad A S Thosar J H Nirmal and V S PandeldquoA unique approach in text-independent speaker recognitionusing mfcc feature sets and probabilistic neural networkrdquo inProceedings of the 8th International Conference on Advances inPattern Recognition (ICAPR rsquo15) Kolkata India 2015

[6] X-Y Zhang J Bai and W-Z Liang ldquoThe speech recognitionsystem based on bark wavelet MFCCrdquo in Proceedings of the 8thInternational Conference on Signal Processing (ICSP rsquo06) pp 16ndash20 Beijing China 2006

[7] A Biswas P K Sahu and M Chandra ldquoAdmissible waveletpacket features based on human inner ear frequency responsefor Hindi consonant recongnitionrdquo Computer amp Communica-tion Technology vol 40 pp 1111ndash1122 2014

[8] K Daqrouq and T A Tutunji ldquoSpeaker identification usingvowels features through a combined method of formantswavelets and neural network classifiersrdquo Applied Soft Comput-ing vol 27 pp 231ndash239 2015

Journal of Electrical and Computer Engineering 9

[9] D Avci ldquoAn expert system for speaker identification usingadaptive wavelet sure entropyrdquo Expert Systems with Applica-tions vol 36 no 3 pp 6295ndash6300 2009

[10] K Daqrouq ldquoWavelet entropy and neural network for text-independent speaker identificationrdquo Engineering Applications ofArtificial Intelligence vol 24 no 5 pp 796ndash802 2011

[11] M K Elyaberani S H Mahmoodian and G Sheikhi ldquoWaveletpacket entropy in speaker-identification emotional state detec-tion from speech signalrdquo Journal of Intelligent Procedures inElectrical Technology vol 20 pp 67ndash74 2015

[12] M Li A Tsiartas M Van Segbroeck and S S NarayananldquoSpeaker verification using simplified and supervised i-vectormodelingrdquo in Proceedings of the 38th IEEE International Con-ference on Acoustics Speech and Signal Processing (ICASSP rsquo13)pp 7199ndash7203 Vancouver Canada 2013

[13] D Garcia-Romero and A McCree ldquoSupervised domain adap-tation for i-vector based speaker recognitionrdquo in Proceedingsof the IEEE International Conference on Acoustics Speech andSignal Processing pp 4047ndash4051 Florence Italy 2014

[14] A KanagasundaramDDean and S Sridharan ldquoI-vector basedspeaker recognition using advanced channel compensationtechniquesrdquo Computer Speech amp Labguage vol 28 pp 121ndash1402014

[15] M H Bahari R Saeidi H Van Hamme and D Van LeeuwenldquoAccent recognition using i-vector gaussian mean supervectorand gaussian posterior probability supervector for spontaneoustelephone speechrdquo in Proceedings of the IEEE InternationalConference on Acoustics Speech and Signal Processing pp 7344ndash7348 IEEE Vancouver Canada 2013

[16] B Saha and K Kamarauslas ldquoEvaluation of effectiveness ofdifferent methods in speaker recognitionrdquo Elektronika ir Elek-trochnika vol 98 pp 67ndash70 2015

[17] K K George C S Kumar K I Ramachandran and A PandaldquoCosine distance features for robust speaker verificationrdquo inProceedings of the 16th Annual Conference of the InternationalSpeech Communication Association (INTERSPEECH rsquo15) pp234ndash238 Dresden Germany 2015

[18] B G Nagaraja and H S Jayanna ldquoEfficient window for mono-lingual and crosslingual speaker identification usingMFCCrdquo inProceedings of the International Conference on Advanced Com-puting and Communication Systems (ICACCS rsquo13) pp 1ndash4Coimbatore India 2013

[19] M Medeiros G Araujo H Macedo M Chella and L MatosldquoMulti-kernel approach to parallelization of EM algorithm forGMM trainingrdquo in Proceedings of the 3rd Brazilian Conferenceon Intelligent Systems (BRACIS rsquo14) pp 158ndash165 Sao PauloBrazil 2014

[20] H R Tohidypour S A Seyyedsalehi and H Behbood ldquoCom-parison between wavelet packet transform Bark Wavelet ampMFCC for robust speech recognition tasksrdquo in Proceedings ofthe 2nd International Conference on Industrial Mechatronics andAutomation (ICIMA rsquo10) pp 329ndash332 Wuhan China 2010

[21] X-F Lv and P Tao ldquoMallat algorithm of wavelet for time-varying system parametric identificationrdquo in Proceedings of the25th Chinese Control and Decision Conference (CCDC rsquo13) pp1554ndash1556 Guiyang China 2013

[22] D Marthnez O Pichot and L Burget ldquoLanguage recognitionin I-vector spacerdquo in Proceedongs of the Interspeech pp 861ndash864Florence Italy 2011

[23] P Kenny G Boulianne and P Dumouchel ldquoEigenvoice model-ing with sparse training datardquo IEEE Transactions on Speech andAudio Processing vol 13 no 3 pp 345ndash354 2005

[24] M McLaren and D Van Leeuwen ldquoImproved speaker recog-nition when using i-vectors from multiple speech sourcesrdquoin Proceedings of the 36th IEEE International Conference onAcoustics Speech and Signal Processing (ICASSP rsquo11) pp 5460ndash5463 Prague Czech Republic 2011

[25] A Biswas P K Sahu A Bhowmick and M Chandra ldquoFeatureextraction technique using ERB like wavelet sub-band periodicand aperiodic decomposition for TIMITphoneme recognitionrdquoInternational Journal of SpeechTechnology vol 17 no 4 pp 389ndash399 2014

[26] S G Mallat A Wavelet Tour of Signal Processing ELsevierAmsterdam Netherlands 2012

[27] Q Yang and J Wang ldquoMulti-level wavelet shannon entropy-based method for single-sensor fault locationrdquo Entropy vol 17no 10 pp 7101ndash7117 2015

[28] TGanchevM Siafarikas IMporas andT Stoyanova ldquoWaveletbasis selection for enhanced speech parametrization in speakerverificationrdquo International Journal of Speech Technology vol 17no 1 pp 27ndash36 2014

[29] M Seboussaoui P Kenny and N Dehak ldquoAn i-vector extractorsuitable for speaker recognition with both microphone andtelephone speechrdquo Odyssey vol 6 pp 1ndash6 2011

[30] N Dehak R Dehak P Kenny N Brummer P Ouellet and PDumouchel ldquoSupport vectormachines versus fast scoring in thelow-dimensional total variability space for speaker verificationrdquoin Proceedings of the 10th Annual Conference of the InternationalSpeech Communication Association (INTERSPEECH rsquo09) pp1559ndash1562 Brighton United Kingdom 2009

[31] M IMandasariMMcLaren andDVanLeeuwen ldquoEvaluationof i-vector speaker recognition systems for forensic applicationrdquoin 12th Annual Conference of the International Speech Commu-nication Association (INTERSPEECH rsquo11) pp 21ndash24 FlorenceItaly 2011

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Active and Passive Electronic Components

Control Scienceand Engineering

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

RotatingMachinery

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

Journal of

Volume 201

Submit your manuscripts athttpswwwhindawicom

VLSI Design

Hindawi Publishing Corporationhttpwwwhindawicom Volume 201

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Shock and Vibration

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawi Publishing Corporation httpwwwhindawicom

Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

SensorsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Navigation and Observation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

DistributedSensor Networks

International Journal of

Page 8: Speaker Recognition Using Wavelet Packet Entropy, I-Vector, and …downloads.hindawi.com/journals/jece/2017/1735698.pdf · 2019. 7. 30. · ResearchArticle Speaker Recognition Using

8 Journal of Electrical and Computer Engineering

Table 2 The time cost of the different speaker recognition models

Speaker recognition model Feature extraction (sspeech) Speaker classification (sspeech)Short vector extraction I-vector extraction Training classifier Recognition

MFCC-GMM 046 mdash 192 064MFCC-I-CDS 045 237 181 073WPE-I-CDS 281 224 182 075WPE-I-CDS(parallel computation) 078 223 182 074

calculating the MFCCTherefore it is very important to findway to reduce the time cost of the WPE

Parallel computation is an effective way to reduce thetime cost because the loops in the linear computation canbe finished at once using a parallel algorithm For examplea signal whose length is 119873 is decomposed by WPT at119872 levels In the conventional linear algorithm of WPT wehave to run a filtering process whose time complexity was119874(log119873) 119872 times 119873 times for each decomposition level sothe total time cost of WPT is 119874(119872119873 log119873) If we used 119873independent computational cores to implant theWPTusing aparallel algorithm the time complexity ofWPT can reduce to119874(119872 log119873) This paper uses 16 independent computationalcores to implement the WPE parallel algorithm and the lastline of Table 2 shows that the time cost of WPE is reducedvery much

6 Conclusions

With the development of the computer technique the speakerrecognition has been widely used for speech-based accesssystem In the real environment the quality of the speechmay be low and noise in the transformation channel cannotbe controlled Therefore it is necessary to find a speakerrecognition model that is not sensitive to those factors suchas noise and low-quality speech

This paper proposes a new speaker recognition modelby employing wavelet packet entropy (WPE) i-vector andCDS and we name themodelWPE-I-CDSWPE used a localanalysis tool namedWPT rather than the DFT to decomposethe signal Because WPT decomposes the signal into manyindependent frequency bands that limit the ill effect of noisetheWPE is robust in the noisy environment I-vector is a typeof robust feature vector Because it considers the backgroundinformation i-vector can improve the accuracy of recogni-tion CDS uses a kernel function to deal with the curse ofdimension so it is suitable for the high-dimension featurevector such as i-vector The result of the experiments in thispaper shows that the proposed speaker recognition modelscan improve the performance of recognition compared withthe conventional models such as MFCC-GMM FMFCC-GMM and WPE-GMM in clean environment MoreovertheWPE-I-CDS obtains higher accuracy than other i-vector-based models such as MFCC-I-CDS and FMFCC-I-CDS innoisy environment However the time cost of the proposedmodel is very higher To reduce the time cost we employthe parallel algorithm to implement the WPE and i-vectorextraction methods

In the future we will combine audio and visual feature toimprove the performance of the speaker recognition system

Conflicts of Interest

The authors declare that they have no conflicts of interest

Acknowledgments

The authors also thank Professor Kun She for their assis-tance in preparation of the manuscript This paper is sup-ported by the Project of Sichuan Provincial Science Plan(M112016GZ0073) and National Nature Foundation (Grantno 61672136)

References

[1] H Perez J Martinez and I Espinosa ldquoUsing acoustic paraligu-istic information to assess the interaction quality in speech-based system for elder usersrdquo International Jounrnal of Human-Computer Studies vol 98 pp 1ndash13 2017

[2] N Almaadeed A Aggoun and A Amira ldquoSpeaker identifica-tion using multimodal neural networks and wavelet analysisrdquoIET Biometrics vol 4 no 1 pp 18ndash28 2015

[3] T Kinnunen and H Li ldquoAn overview of text-independentspeaker recognition from features to supervectorsrdquo SpeechCommunication vol 52 no 1 pp 12ndash40 2010

[4] D A Reynolds and R C Rose ldquoRobust text-independentspeaker identification using gaussian mixture speaker modelsrdquoIEEE Transactions on Speech and Audio Processing vol 3 no 1pp 72ndash83 1995

[5] K S Ahmad A S Thosar J H Nirmal and V S PandeldquoA unique approach in text-independent speaker recognitionusing mfcc feature sets and probabilistic neural networkrdquo inProceedings of the 8th International Conference on Advances inPattern Recognition (ICAPR rsquo15) Kolkata India 2015

[6] X-Y Zhang J Bai and W-Z Liang ldquoThe speech recognitionsystem based on bark wavelet MFCCrdquo in Proceedings of the 8thInternational Conference on Signal Processing (ICSP rsquo06) pp 16ndash20 Beijing China 2006

[7] A Biswas P K Sahu and M Chandra ldquoAdmissible waveletpacket features based on human inner ear frequency responsefor Hindi consonant recongnitionrdquo Computer amp Communica-tion Technology vol 40 pp 1111ndash1122 2014

[8] K Daqrouq and T A Tutunji ldquoSpeaker identification usingvowels features through a combined method of formantswavelets and neural network classifiersrdquo Applied Soft Comput-ing vol 27 pp 231ndash239 2015

Journal of Electrical and Computer Engineering 9

[9] D Avci ldquoAn expert system for speaker identification usingadaptive wavelet sure entropyrdquo Expert Systems with Applica-tions vol 36 no 3 pp 6295ndash6300 2009

[10] K Daqrouq ldquoWavelet entropy and neural network for text-independent speaker identificationrdquo Engineering Applications ofArtificial Intelligence vol 24 no 5 pp 796ndash802 2011

[11] M K Elyaberani S H Mahmoodian and G Sheikhi ldquoWaveletpacket entropy in speaker-identification emotional state detec-tion from speech signalrdquo Journal of Intelligent Procedures inElectrical Technology vol 20 pp 67ndash74 2015

[12] M Li A Tsiartas M Van Segbroeck and S S NarayananldquoSpeaker verification using simplified and supervised i-vectormodelingrdquo in Proceedings of the 38th IEEE International Con-ference on Acoustics Speech and Signal Processing (ICASSP rsquo13)pp 7199ndash7203 Vancouver Canada 2013

[13] D Garcia-Romero and A McCree ldquoSupervised domain adap-tation for i-vector based speaker recognitionrdquo in Proceedingsof the IEEE International Conference on Acoustics Speech andSignal Processing pp 4047ndash4051 Florence Italy 2014

[14] A KanagasundaramDDean and S Sridharan ldquoI-vector basedspeaker recognition using advanced channel compensationtechniquesrdquo Computer Speech amp Labguage vol 28 pp 121ndash1402014

[15] M H Bahari R Saeidi H Van Hamme and D Van LeeuwenldquoAccent recognition using i-vector gaussian mean supervectorand gaussian posterior probability supervector for spontaneoustelephone speechrdquo in Proceedings of the IEEE InternationalConference on Acoustics Speech and Signal Processing pp 7344ndash7348 IEEE Vancouver Canada 2013

[16] B Saha and K Kamarauslas ldquoEvaluation of effectiveness ofdifferent methods in speaker recognitionrdquo Elektronika ir Elek-trochnika vol 98 pp 67ndash70 2015

[17] K K George C S Kumar K I Ramachandran and A PandaldquoCosine distance features for robust speaker verificationrdquo inProceedings of the 16th Annual Conference of the InternationalSpeech Communication Association (INTERSPEECH rsquo15) pp234ndash238 Dresden Germany 2015

[18] B G Nagaraja and H S Jayanna ldquoEfficient window for mono-lingual and crosslingual speaker identification usingMFCCrdquo inProceedings of the International Conference on Advanced Com-puting and Communication Systems (ICACCS rsquo13) pp 1ndash4Coimbatore India 2013

[19] M Medeiros G Araujo H Macedo M Chella and L MatosldquoMulti-kernel approach to parallelization of EM algorithm forGMM trainingrdquo in Proceedings of the 3rd Brazilian Conferenceon Intelligent Systems (BRACIS rsquo14) pp 158ndash165 Sao PauloBrazil 2014

[20] H R Tohidypour S A Seyyedsalehi and H Behbood ldquoCom-parison between wavelet packet transform Bark Wavelet ampMFCC for robust speech recognition tasksrdquo in Proceedings ofthe 2nd International Conference on Industrial Mechatronics andAutomation (ICIMA rsquo10) pp 329ndash332 Wuhan China 2010

[21] X-F Lv and P Tao ldquoMallat algorithm of wavelet for time-varying system parametric identificationrdquo in Proceedings of the25th Chinese Control and Decision Conference (CCDC rsquo13) pp1554ndash1556 Guiyang China 2013

[22] D Marthnez O Pichot and L Burget ldquoLanguage recognitionin I-vector spacerdquo in Proceedongs of the Interspeech pp 861ndash864Florence Italy 2011

[23] P Kenny G Boulianne and P Dumouchel ldquoEigenvoice model-ing with sparse training datardquo IEEE Transactions on Speech andAudio Processing vol 13 no 3 pp 345ndash354 2005

[24] M McLaren and D Van Leeuwen ldquoImproved speaker recog-nition when using i-vectors from multiple speech sourcesrdquoin Proceedings of the 36th IEEE International Conference onAcoustics Speech and Signal Processing (ICASSP rsquo11) pp 5460ndash5463 Prague Czech Republic 2011

[25] A Biswas P K Sahu A Bhowmick and M Chandra ldquoFeatureextraction technique using ERB like wavelet sub-band periodicand aperiodic decomposition for TIMITphoneme recognitionrdquoInternational Journal of SpeechTechnology vol 17 no 4 pp 389ndash399 2014

[26] S G Mallat A Wavelet Tour of Signal Processing ELsevierAmsterdam Netherlands 2012

[27] Q Yang and J Wang ldquoMulti-level wavelet shannon entropy-based method for single-sensor fault locationrdquo Entropy vol 17no 10 pp 7101ndash7117 2015

[28] TGanchevM Siafarikas IMporas andT Stoyanova ldquoWaveletbasis selection for enhanced speech parametrization in speakerverificationrdquo International Journal of Speech Technology vol 17no 1 pp 27ndash36 2014

[29] M Seboussaoui P Kenny and N Dehak ldquoAn i-vector extractorsuitable for speaker recognition with both microphone andtelephone speechrdquo Odyssey vol 6 pp 1ndash6 2011

[30] N Dehak R Dehak P Kenny N Brummer P Ouellet and PDumouchel ldquoSupport vectormachines versus fast scoring in thelow-dimensional total variability space for speaker verificationrdquoin Proceedings of the 10th Annual Conference of the InternationalSpeech Communication Association (INTERSPEECH rsquo09) pp1559ndash1562 Brighton United Kingdom 2009

[31] M IMandasariMMcLaren andDVanLeeuwen ldquoEvaluationof i-vector speaker recognition systems for forensic applicationrdquoin 12th Annual Conference of the International Speech Commu-nication Association (INTERSPEECH rsquo11) pp 21ndash24 FlorenceItaly 2011

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Active and Passive Electronic Components

Control Scienceand Engineering

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

RotatingMachinery

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

Journal of

Volume 201

Submit your manuscripts athttpswwwhindawicom

VLSI Design

Hindawi Publishing Corporationhttpwwwhindawicom Volume 201

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Shock and Vibration

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawi Publishing Corporation httpwwwhindawicom

Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

SensorsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Navigation and Observation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

DistributedSensor Networks

International Journal of

Page 9: Speaker Recognition Using Wavelet Packet Entropy, I-Vector, and …downloads.hindawi.com/journals/jece/2017/1735698.pdf · 2019. 7. 30. · ResearchArticle Speaker Recognition Using

Journal of Electrical and Computer Engineering 9

[9] D Avci ldquoAn expert system for speaker identification usingadaptive wavelet sure entropyrdquo Expert Systems with Applica-tions vol 36 no 3 pp 6295ndash6300 2009

[10] K Daqrouq ldquoWavelet entropy and neural network for text-independent speaker identificationrdquo Engineering Applications ofArtificial Intelligence vol 24 no 5 pp 796ndash802 2011

[11] M K Elyaberani S H Mahmoodian and G Sheikhi ldquoWaveletpacket entropy in speaker-identification emotional state detec-tion from speech signalrdquo Journal of Intelligent Procedures inElectrical Technology vol 20 pp 67ndash74 2015

[12] M Li A Tsiartas M Van Segbroeck and S S NarayananldquoSpeaker verification using simplified and supervised i-vectormodelingrdquo in Proceedings of the 38th IEEE International Con-ference on Acoustics Speech and Signal Processing (ICASSP rsquo13)pp 7199ndash7203 Vancouver Canada 2013

[13] D Garcia-Romero and A McCree ldquoSupervised domain adap-tation for i-vector based speaker recognitionrdquo in Proceedingsof the IEEE International Conference on Acoustics Speech andSignal Processing pp 4047ndash4051 Florence Italy 2014

[14] A KanagasundaramDDean and S Sridharan ldquoI-vector basedspeaker recognition using advanced channel compensationtechniquesrdquo Computer Speech amp Labguage vol 28 pp 121ndash1402014

[15] M H Bahari R Saeidi H Van Hamme and D Van LeeuwenldquoAccent recognition using i-vector gaussian mean supervectorand gaussian posterior probability supervector for spontaneoustelephone speechrdquo in Proceedings of the IEEE InternationalConference on Acoustics Speech and Signal Processing pp 7344ndash7348 IEEE Vancouver Canada 2013

[16] B Saha and K Kamarauslas ldquoEvaluation of effectiveness ofdifferent methods in speaker recognitionrdquo Elektronika ir Elek-trochnika vol 98 pp 67ndash70 2015

[17] K K George C S Kumar K I Ramachandran and A PandaldquoCosine distance features for robust speaker verificationrdquo inProceedings of the 16th Annual Conference of the InternationalSpeech Communication Association (INTERSPEECH rsquo15) pp234ndash238 Dresden Germany 2015

[18] B G Nagaraja and H S Jayanna ldquoEfficient window for mono-lingual and crosslingual speaker identification usingMFCCrdquo inProceedings of the International Conference on Advanced Com-puting and Communication Systems (ICACCS rsquo13) pp 1ndash4Coimbatore India 2013

[19] M Medeiros G Araujo H Macedo M Chella and L MatosldquoMulti-kernel approach to parallelization of EM algorithm forGMM trainingrdquo in Proceedings of the 3rd Brazilian Conferenceon Intelligent Systems (BRACIS rsquo14) pp 158ndash165 Sao PauloBrazil 2014

[20] H R Tohidypour S A Seyyedsalehi and H Behbood ldquoCom-parison between wavelet packet transform Bark Wavelet ampMFCC for robust speech recognition tasksrdquo in Proceedings ofthe 2nd International Conference on Industrial Mechatronics andAutomation (ICIMA rsquo10) pp 329ndash332 Wuhan China 2010

[21] X-F Lv and P Tao ldquoMallat algorithm of wavelet for time-varying system parametric identificationrdquo in Proceedings of the25th Chinese Control and Decision Conference (CCDC rsquo13) pp1554ndash1556 Guiyang China 2013

[22] D Marthnez O Pichot and L Burget ldquoLanguage recognitionin I-vector spacerdquo in Proceedongs of the Interspeech pp 861ndash864Florence Italy 2011

[23] P Kenny G Boulianne and P Dumouchel ldquoEigenvoice model-ing with sparse training datardquo IEEE Transactions on Speech andAudio Processing vol 13 no 3 pp 345ndash354 2005

[24] M McLaren and D Van Leeuwen ldquoImproved speaker recog-nition when using i-vectors from multiple speech sourcesrdquoin Proceedings of the 36th IEEE International Conference onAcoustics Speech and Signal Processing (ICASSP rsquo11) pp 5460ndash5463 Prague Czech Republic 2011

[25] A Biswas P K Sahu A Bhowmick and M Chandra ldquoFeatureextraction technique using ERB like wavelet sub-band periodicand aperiodic decomposition for TIMITphoneme recognitionrdquoInternational Journal of SpeechTechnology vol 17 no 4 pp 389ndash399 2014

[26] S G Mallat A Wavelet Tour of Signal Processing ELsevierAmsterdam Netherlands 2012

[27] Q Yang and J Wang ldquoMulti-level wavelet shannon entropy-based method for single-sensor fault locationrdquo Entropy vol 17no 10 pp 7101ndash7117 2015

[28] TGanchevM Siafarikas IMporas andT Stoyanova ldquoWaveletbasis selection for enhanced speech parametrization in speakerverificationrdquo International Journal of Speech Technology vol 17no 1 pp 27ndash36 2014

[29] M Seboussaoui P Kenny and N Dehak ldquoAn i-vector extractorsuitable for speaker recognition with both microphone andtelephone speechrdquo Odyssey vol 6 pp 1ndash6 2011

[30] N Dehak R Dehak P Kenny N Brummer P Ouellet and PDumouchel ldquoSupport vectormachines versus fast scoring in thelow-dimensional total variability space for speaker verificationrdquoin Proceedings of the 10th Annual Conference of the InternationalSpeech Communication Association (INTERSPEECH rsquo09) pp1559ndash1562 Brighton United Kingdom 2009

[31] M IMandasariMMcLaren andDVanLeeuwen ldquoEvaluationof i-vector speaker recognition systems for forensic applicationrdquoin 12th Annual Conference of the International Speech Commu-nication Association (INTERSPEECH rsquo11) pp 21ndash24 FlorenceItaly 2011

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Active and Passive Electronic Components

Control Scienceand Engineering

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

RotatingMachinery

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

Journal of

Volume 201

Submit your manuscripts athttpswwwhindawicom

VLSI Design

Hindawi Publishing Corporationhttpwwwhindawicom Volume 201

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Shock and Vibration

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawi Publishing Corporation httpwwwhindawicom

Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

SensorsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Navigation and Observation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

DistributedSensor Networks

International Journal of

Page 10: Speaker Recognition Using Wavelet Packet Entropy, I-Vector, and …downloads.hindawi.com/journals/jece/2017/1735698.pdf · 2019. 7. 30. · ResearchArticle Speaker Recognition Using

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Active and Passive Electronic Components

Control Scienceand Engineering

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

RotatingMachinery

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

Journal of

Volume 201

Submit your manuscripts athttpswwwhindawicom

VLSI Design

Hindawi Publishing Corporationhttpwwwhindawicom Volume 201

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Shock and Vibration

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawi Publishing Corporation httpwwwhindawicom

Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

SensorsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Navigation and Observation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

DistributedSensor Networks

International Journal of