speech discrimination based on multiscale spectro–temporal modulations
DESCRIPTION
Speech Discrimination Based on Multiscale Spectro–Temporal Modulations. Nima Mesgarani, Shihab Shamma, University of Maryland. Malcolm Slaney IBM. Reporter : Chen, Hung-Bin. Outline. Introduction VAD ( Voice Activity Detection and Speech Segmentation ) - PowerPoint PPT PresentationTRANSCRIPT
ICASSP 2004 1
Speech Discrimination Based on Multiscale Spectro–Temporal
Modulations
Nima Mesgarani, Shihab Shamma,
University of Maryland
Malcolm Slaney
IBM
Reporter : Chen, Hung-Bin
ICASSP 2004 2
Outline
• Introduction VAD ( Voice Activity Detection and Speech Segmentation )– discriminate speech from non-speech which consists of noise sounds– multiscale spectro-temporal modulation features extracted using a mode
l of auditory cortex
• Two state-of-the-art systems– Robust Multifeature Speech/Music Discriminator– Robust Speech Recognition In Noisy Environments
• Auditory model
• Experimental results
• Summary and Conclusions
ICASSP 2004 3
Introduction - VAD
• significance– Speech recognition systems designed for real world conditions, a
robust discrimination of speech from other sounds is a crucial step.
• advantage– Speech discrimination can also be used for coding or
telecommunication applications.
• proposed system– a feature set inspired by investigations of various stages of the auditory
system
ICASSP 2004 4
Two state-of-the-art systems
• Multi–feature System– Features
• Thirteen features in Time, Frequency, and Cepstrum domain are used to model speech and music (noise).
– Classification• A Gaussian mixture model (GMM) mo
dels each class of data as the union of several Gaussian clusters in the feature space.
• Reference:– [1] E. Scheirer, M. Slaney, ”Construction a
nd evaluation of a robust multifeature speech/music discriminator”, ICASSP’97, 1997.
ICASSP 2004 5
Two state-of-the-art systems (cont)
• Voicing–energy System– Features
• frame-by-frame maximum autocorrelation and log-energy features is making the speech/non-speech decision.
• PLP• LDA+MLLT
– Segmentation• use an HMM-based segmentation procedure with two models, one for speec
h segments and one for non-speech segments.
• Reference:– [2] B. Kingsbury, G. Saon, L. Mangu, M. Padmanabhan and R. Sarikaya, ”Robust speech recognition in noi
sy environments: The 2001 IBM SPINE evaluation system”, ICASSP 2002,
ICASSP 2004 6
Auditory model
• The computational auditory model is based on neurophysiological, biophysical, and psychoacoustical investigations at various stages of the auditory system.
• transformation of the acoustic signal into an internal neural representation (auditory spectrogram)
ICASSP 2004 7
Auditory model (cont)
• a complex spatiotemporal pattern– vibrations along the basilar membrane of the cochlea
• 3–step process1) highpass filter, by an instantaneous nonlinear compression
2) lowpass filter (hair cell membrane leakage)
3) detects discontinuities in the responses across the tonotopic axis of the auditory nerve array
– computationally via a bank of modulation-selective filters centered at each frequency along the tonotopic axis.
ICASSP 2004 8
Auditory model (cont)
• Sound is analyzed by a model of the cochlea (depicted on the left) consisting of a bank of 128 constant-Q bandpass .lters with center frequencies equally spaced on a logarithmic frequency axis
ICASSP 2004 9
Multilinear Analysis Of Cortical Representation
• auditory model is a multidimensional array.• the time dimension is averaged over a given time window which
results in a three mode tensor for each time window with each elements representing the overall modulations at corresponding frequency, rate and scale (128(frequency channels) ×26 (rates) ×6 (scales)
ICASSP 2004 10
Multilinear Analysis Of Cortical Representation (cont)
• Using multi-dimensional PCA to tailor the amount of reduction in each subspace independently.
• To generalize the multidimensional tensors concept, we consider a generalization of SVD (Singular Value Decomposition) to tensors.
• D = S×1Ufrequency×2Urate×3Uscale×4Usamples– D : The resulting data– S : I1 × I2 × ... × IN
• Original : (128(frequency channels) ×26 (rates) ×6 (scales)• The resulting tensor whose retained singular vectors in each mode
( 7 for frequency , 5 for rate and 3 for scale dimensions) is used for classification.
• Classification was performed using a Support Vector Machine (SVM)
ICASSP 2004 11
Experimental Results
• Audio Database from TIMIT – Training data : 300 samples
– Testing data : 150 different sentences spoken by 50 different speakers (25 male, 25 female)
– training and test sets were different.
• To make the non-speech class– from BBC Sound Effects audio CD, RWC Genre Database and Noisex a
nd Aurora databases were assembled together.
• The training set– 300 speech and 740 non-speech samples
• the testing set – 150 speech and 450 non-speech samples
• The audio length is equal.
ICASSP 2004 12
Experimental Results (cont)
• speech detection/discrimination – Table 1 and 2 shows the effect
ICASSP 2004 13
Experimental Results (cont)
• tests white and pink noise were added to speech with specified signal to noise ratio (SNR).
ICASSP 2004 14
Experimental Results (cont)
• different levels of reverberation on the performance
ICASSP 2004 15
Summary and Conclusions
• This work is but one in a series of efforts at incorporating multi–scale cortical representations (and more broadly, perceptual insights) in a variety of audio and speech processing applications.
• Applications such as – automatic classification
– segmentation of animal sounds
– an efficient encoding of speech and music
ICASSP 2004 16
Reference
• Two state-of-the-art systems– [1] E. Scheirer, M. Slaney, ”Construction and evaluation of a robust multifeature s
peech/music discriminator”, ICASSP’97, 1997.– [2] B. Kingsbury, G. Saon, L. Mangu, M. Padmanabhan and R. Sarikaya, ”Robust
speech recognition in noisy environments: The 2001 IBM SPINE evaluation system”, ICASSP 2002, vol. I, pp. 53–56, 2002.
• Central Auditory System– [4] K. Wang and S. A. Shamma, ”Spectral shape analysis in the central auditory s
ystem”, IEEE Trans. Speech Audio Proc. vol. 3 (5), pp. 382–395, 1995.– [6] M. Elhilali, T. Chi and S. A. Shamma, ”A spectro-temporal modulation index
(STMI) for assessment of speech intelligibility”, Speech comm., vol. 41, pp. 331–348, 2003.
– Auditory cortical representation of complex acoustic spectra as inferred from the ripple analysis method
• SHIHAB A. SHAMMA– http://www.isr.umd.edu/People/faculty/Shamma.html