international conference on intelligent and advanced systems 2007 chee-ming ting sh-hussain salleh...
TRANSCRIPT
![Page 1: International Conference on Intelligent and Advanced Systems 2007 Chee-Ming Ting Sh-Hussain Salleh Tian-Swee Tan A. K. Ariff. Jain-De,Lee](https://reader033.vdocument.in/reader033/viewer/2022051516/56649e9f5503460f94ba188b/html5/thumbnails/1.jpg)
International Conference on Intelligent and Advanced Systems 2007
Chee-Ming Ting Sh-Hussain Salleh Tian-Swee Tan A. K. Ariff.
Jain-De,Lee
![Page 2: International Conference on Intelligent and Advanced Systems 2007 Chee-Ming Ting Sh-Hussain Salleh Tian-Swee Tan A. K. Ariff. Jain-De,Lee](https://reader033.vdocument.in/reader033/viewer/2022051516/56649e9f5503460f94ba188b/html5/thumbnails/2.jpg)
INTRODUCTION
GMM SPEAKER IDENTIFICATION SYSTEM
EXPERIMENTAL EVALUATION
CONCLUSION
![Page 3: International Conference on Intelligent and Advanced Systems 2007 Chee-Ming Ting Sh-Hussain Salleh Tian-Swee Tan A. K. Ariff. Jain-De,Lee](https://reader033.vdocument.in/reader033/viewer/2022051516/56649e9f5503460f94ba188b/html5/thumbnails/3.jpg)
Speaker recognition is generally divided into two tasks
◦ Speaker Verification(SV)
◦ Speaker Identification(SI)
Speaker model
◦ Text-dependent(TD)
◦ Text-independent(TI)
![Page 4: International Conference on Intelligent and Advanced Systems 2007 Chee-Ming Ting Sh-Hussain Salleh Tian-Swee Tan A. K. Ariff. Jain-De,Lee](https://reader033.vdocument.in/reader033/viewer/2022051516/56649e9f5503460f94ba188b/html5/thumbnails/4.jpg)
Many approaches have been proposed for TI speaker recognition◦VQ based method◦Hidden Markov Models◦Gaussian Mixture Model
VQ based method
![Page 5: International Conference on Intelligent and Advanced Systems 2007 Chee-Ming Ting Sh-Hussain Salleh Tian-Swee Tan A. K. Ariff. Jain-De,Lee](https://reader033.vdocument.in/reader033/viewer/2022051516/56649e9f5503460f94ba188b/html5/thumbnails/5.jpg)
Hidden Markov Models
◦ State Probability
◦ Transition Probability
Classify acoustic events corresponding to HMM states to characterize each speaker in TI task
TI performance is unaffected by discarding transition probabilities in HMM models
![Page 6: International Conference on Intelligent and Advanced Systems 2007 Chee-Ming Ting Sh-Hussain Salleh Tian-Swee Tan A. K. Ariff. Jain-De,Lee](https://reader033.vdocument.in/reader033/viewer/2022051516/56649e9f5503460f94ba188b/html5/thumbnails/6.jpg)
Gaussian Mixture Model
◦ Corresponds to a single state continuous ergodic HMM
◦ Discarding the transition probabilities in the HMM models
The use of GMM for speaker identity modeling
◦ The Gaussian components represent some general speaker-dependent spectral shapes
◦ The capability of Gaussian mixture to model arbitrary densities
![Page 7: International Conference on Intelligent and Advanced Systems 2007 Chee-Ming Ting Sh-Hussain Salleh Tian-Swee Tan A. K. Ariff. Jain-De,Lee](https://reader033.vdocument.in/reader033/viewer/2022051516/56649e9f5503460f94ba188b/html5/thumbnails/7.jpg)
The GMM speaker identification system consists of the following elements
◦ Speech processing
◦Gaussian mixture model
◦ Parameter estimation
◦ Identification
![Page 8: International Conference on Intelligent and Advanced Systems 2007 Chee-Ming Ting Sh-Hussain Salleh Tian-Swee Tan A. K. Ariff. Jain-De,Lee](https://reader033.vdocument.in/reader033/viewer/2022051516/56649e9f5503460f94ba188b/html5/thumbnails/8.jpg)
The Mel-scale frequency cepstral coefficients (MFCC) extraction is used in front-end processing
Input Speech Signal
Input Speech Signal Pre-EmphasisPre-Emphasis FrameFrame Hamming
Window
HammingWindow
FFTFFTTriangularband-pass
filter
Triangularband-pass
filterLogarithmLogarithm DCTDCT
Mel-sca1e cepstral feature analysis
![Page 9: International Conference on Intelligent and Advanced Systems 2007 Chee-Ming Ting Sh-Hussain Salleh Tian-Swee Tan A. K. Ariff. Jain-De,Lee](https://reader033.vdocument.in/reader033/viewer/2022051516/56649e9f5503460f94ba188b/html5/thumbnails/9.jpg)
The Gaussian model is a weighted linear combination of M uni-model Gaussian component densities
The mixture weight satisfy the constraint that
M
iii xbwxp
1
)()|(
Where is a D-dimensional vectorx
are the component densitiesMixbi ,...,1),( wi , i=1,…,M are the mixture weights
M
iiw
1
1
![Page 10: International Conference on Intelligent and Advanced Systems 2007 Chee-Ming Ting Sh-Hussain Salleh Tian-Swee Tan A. K. Ariff. Jain-De,Lee](https://reader033.vdocument.in/reader033/viewer/2022051516/56649e9f5503460f94ba188b/html5/thumbnails/10.jpg)
Each component density is a D-variate Gaussian function of the form
The Gaussian mixture density model are denoted as
)}()(2
1exp{
||)2(
1)( 1
2/12/ iiT
ii
Dxxxbi
Where is mean vectori
is covariance matrixi
Miw iii ,...,1),,,(
![Page 11: International Conference on Intelligent and Advanced Systems 2007 Chee-Ming Ting Sh-Hussain Salleh Tian-Swee Tan A. K. Ariff. Jain-De,Lee](https://reader033.vdocument.in/reader033/viewer/2022051516/56649e9f5503460f94ba188b/html5/thumbnails/11.jpg)
Conventional GMM training process
Input training vectorInput training vector
LBG algorithmLBG algorithm
EM algorithmEM algorithm
ConvergenceConvergence EndEndY
N
![Page 12: International Conference on Intelligent and Advanced Systems 2007 Chee-Ming Ting Sh-Hussain Salleh Tian-Swee Tan A. K. Ariff. Jain-De,Lee](https://reader033.vdocument.in/reader033/viewer/2022051516/56649e9f5503460f94ba188b/html5/thumbnails/12.jpg)
Input training vector
Input training vector
Overall averageOverall average
SplitSplit
ClusteringClustering
Cluster’saverage Cluster’saverage
Calculate Distortion Calculate
Distortion (D-D’)/D< δ(D-D’)/D< δ
D’=DD’=D
m<Mm<M EndEnd
N Y
Y N
![Page 13: International Conference on Intelligent and Advanced Systems 2007 Chee-Ming Ting Sh-Hussain Salleh Tian-Swee Tan A. K. Ariff. Jain-De,Lee](https://reader033.vdocument.in/reader033/viewer/2022051516/56649e9f5503460f94ba188b/html5/thumbnails/13.jpg)
Speaker model training is to estimate the GMM parameters via maximum likelihood (ML) estimation
Expectation-maximization (EM) algorithm
T
ttxpXp
1
)|()|(
T
tti xip
Tw
1
),|(1
T
t t
T
t tti
xip
xxip
1
1
),|(
),|(
2
1
1
22
),|(
),|(iT
t t
T
t tti
xip
xxip
![Page 14: International Conference on Intelligent and Advanced Systems 2007 Chee-Ming Ting Sh-Hussain Salleh Tian-Swee Tan A. K. Ariff. Jain-De,Lee](https://reader033.vdocument.in/reader033/viewer/2022051516/56649e9f5503460f94ba188b/html5/thumbnails/14.jpg)
This paper proposes an algorithm consists of two steps
![Page 15: International Conference on Intelligent and Advanced Systems 2007 Chee-Ming Ting Sh-Hussain Salleh Tian-Swee Tan A. K. Ariff. Jain-De,Lee](https://reader033.vdocument.in/reader033/viewer/2022051516/56649e9f5503460f94ba188b/html5/thumbnails/15.jpg)
Cluster the training vectors to the mixture component with the highest likelihood
Re-estimate parameters of each component
)(maxarg1
xbC iMi
i
number of vectors classified in cluster i / total number of training vectors
iw
sample mean of vectors classified in cluster i.i
sample covariance matrix of vectors classified in cluster ii
![Page 16: International Conference on Intelligent and Advanced Systems 2007 Chee-Ming Ting Sh-Hussain Salleh Tian-Swee Tan A. K. Ariff. Jain-De,Lee](https://reader033.vdocument.in/reader033/viewer/2022051516/56649e9f5503460f94ba188b/html5/thumbnails/16.jpg)
The feature is classified to the speaker ,whose model likelihood is the highest
The above can be formulated in logarithmic term
S
SkkXpS
1
)|(maxargˆ
T
tkt
SkxpS
11
)|(logmaxargˆ
![Page 17: International Conference on Intelligent and Advanced Systems 2007 Chee-Ming Ting Sh-Hussain Salleh Tian-Swee Tan A. K. Ariff. Jain-De,Lee](https://reader033.vdocument.in/reader033/viewer/2022051516/56649e9f5503460f94ba188b/html5/thumbnails/17.jpg)
Database and Experiment Conditions◦ 7 male and 3 female◦ The same 40 sentences utterances with different text◦ The average sentences duration is approximately 3.5 s
Performance Comparison between EM and Highest Mixture Likelihood Clustering Training
◦ The number of Gaussian components 16
◦ 16 dimensional MFCCs◦ 20 utterances is used for training
![Page 18: International Conference on Intelligent and Advanced Systems 2007 Chee-Ming Ting Sh-Hussain Salleh Tian-Swee Tan A. K. Ariff. Jain-De,Lee](https://reader033.vdocument.in/reader033/viewer/2022051516/56649e9f5503460f94ba188b/html5/thumbnails/18.jpg)
Convergence condition 03.0|)|()|(| )()1( kk XpXp
![Page 19: International Conference on Intelligent and Advanced Systems 2007 Chee-Ming Ting Sh-Hussain Salleh Tian-Swee Tan A. K. Ariff. Jain-De,Lee](https://reader033.vdocument.in/reader033/viewer/2022051516/56649e9f5503460f94ba188b/html5/thumbnails/19.jpg)
The comparison between EM and highest likelihood clustering training on identification rate
◦ 10 sentences were used for training◦ 25 sentences were used for testing◦ 4 Gaussian components◦ 8 iterations
![Page 20: International Conference on Intelligent and Advanced Systems 2007 Chee-Ming Ting Sh-Hussain Salleh Tian-Swee Tan A. K. Ariff. Jain-De,Lee](https://reader033.vdocument.in/reader033/viewer/2022051516/56649e9f5503460f94ba188b/html5/thumbnails/20.jpg)
Effect of Different Number of Gaussian Mixture Components and Amount of Training Data
◦MFCCs feature dimension is fixed to 12◦ 25 sentences is used for testing
![Page 21: International Conference on Intelligent and Advanced Systems 2007 Chee-Ming Ting Sh-Hussain Salleh Tian-Swee Tan A. K. Ariff. Jain-De,Lee](https://reader033.vdocument.in/reader033/viewer/2022051516/56649e9f5503460f94ba188b/html5/thumbnails/21.jpg)
Effect of Feature Set on Performance for Different Number of Gaussian Mixture Components
◦Combination with first and second order difference coefficients was tested
◦ 10 sentences is used for training◦ 30 sentences is used for testing
![Page 22: International Conference on Intelligent and Advanced Systems 2007 Chee-Ming Ting Sh-Hussain Salleh Tian-Swee Tan A. K. Ariff. Jain-De,Lee](https://reader033.vdocument.in/reader033/viewer/2022051516/56649e9f5503460f94ba188b/html5/thumbnails/22.jpg)
Comparably to conventional EM training but with less computational time
First order difference coefficients is sufficient to capture the transitional information with reasonable dimensional complexity
The 12 dimensional 16 order GMM and using 5 training sentences achieved 98.4% identification rate