speech discrimination based on multiscale spectro–temporal modulations

ICASSP 2004 1

Speech Discrimination Based on Multiscale Spectro–Temporal

Modulations

Nima Mesgarani, Shihab Shamma,

University of Maryland

Malcolm Slaney

IBM

Reporter : Chen, Hung-Bin

ICASSP 2004 2

Outline

• Introduction VAD ( Voice Activity Detection and Speech Segmentation )– discriminate speech from non-speech which consists of noise sounds– multiscale spectro-temporal modulation features extracted using a mode

l of auditory cortex

• Two state-of-the-art systems– Robust Multifeature Speech/Music Discriminator– Robust Speech Recognition In Noisy Environments

• Auditory model

• Experimental results

• Summary and Conclusions

ICASSP 2004 3

Introduction - VAD

• significance– Speech recognition systems designed for real world conditions, a

robust discrimination of speech from other sounds is a crucial step.

• advantage– Speech discrimination can also be used for coding or

telecommunication applications.

• proposed system– a feature set inspired by investigations of various stages of the auditory

system

ICASSP 2004 4

Two state-of-the-art systems

• Multi–feature System– Features

• Thirteen features in Time, Frequency, and Cepstrum domain are used to model speech and music (noise).

– Classification• A Gaussian mixture model (GMM) mo

dels each class of data as the union of several Gaussian clusters in the feature space.

• Reference:– [1] E. Scheirer, M. Slaney, ”Construction a

nd evaluation of a robust multifeature speech/music discriminator”, ICASSP’97, 1997.

ICASSP 2004 5

Two state-of-the-art systems (cont)

• Voicing–energy System– Features

• frame-by-frame maximum autocorrelation and log-energy features is making the speech/non-speech decision.

• PLP• LDA+MLLT

– Segmentation• use an HMM-based segmentation procedure with two models, one for speec

h segments and one for non-speech segments.

• Reference:– [2] B. Kingsbury, G. Saon, L. Mangu, M. Padmanabhan and R. Sarikaya, ”Robust speech recognition in noi

sy environments: The 2001 IBM SPINE evaluation system”, ICASSP 2002,

ICASSP 2004 6

Auditory model

• The computational auditory model is based on neurophysiological, biophysical, and psychoacoustical investigations at various stages of the auditory system.

• transformation of the acoustic signal into an internal neural representation (auditory spectrogram)

ICASSP 2004 7

Auditory model (cont)

• a complex spatiotemporal pattern– vibrations along the basilar membrane of the cochlea

• 3–step process1) highpass filter, by an instantaneous nonlinear compression

2) lowpass filter (hair cell membrane leakage)

3) detects discontinuities in the responses across the tonotopic axis of the auditory nerve array

– computationally via a bank of modulation-selective filters centered at each frequency along the tonotopic axis.

ICASSP 2004 8

Auditory model (cont)

• Sound is analyzed by a model of the cochlea (depicted on the left) consisting of a bank of 128 constant-Q bandpass .lters with center frequencies equally spaced on a logarithmic frequency axis

ICASSP 2004 9

Multilinear Analysis Of Cortical Representation

• auditory model is a multidimensional array.• the time dimension is averaged over a given time window which

results in a three mode tensor for each time window with each elements representing the overall modulations at corresponding frequency, rate and scale (128(frequency channels) ×26 (rates) ×6 (scales)

ICASSP 2004 10

Multilinear Analysis Of Cortical Representation (cont)

• Using multi-dimensional PCA to tailor the amount of reduction in each subspace independently.

• To generalize the multidimensional tensors concept, we consider a generalization of SVD (Singular Value Decomposition) to tensors.

• D = S×1Ufrequency×2Urate×3Uscale×4Usamples– D : The resulting data– S : I1 × I2 × ... × IN

• Original : (128(frequency channels) ×26 (rates) ×6 (scales)• The resulting tensor whose retained singular vectors in each mode

( 7 for frequency , 5 for rate and 3 for scale dimensions) is used for classification.

• Classification was performed using a Support Vector Machine (SVM)

ICASSP 2004 11

Experimental Results

• Audio Database from TIMIT – Training data : 300 samples

– Testing data : 150 different sentences spoken by 50 different speakers (25 male, 25 female)

– training and test sets were different.

• To make the non-speech class– from BBC Sound Effects audio CD, RWC Genre Database and Noisex a

nd Aurora databases were assembled together.

• The training set– 300 speech and 740 non-speech samples

• the testing set – 150 speech and 450 non-speech samples

• The audio length is equal.

ICASSP 2004 12

Experimental Results (cont)

• speech detection/discrimination – Table 1 and 2 shows the effect

ICASSP 2004 13


• tests white and pink noise were added to speech with specified signal to noise ratio (SNR).

ICASSP 2004 14


• different levels of reverberation on the performance

ICASSP 2004 15

Summary and Conclusions

• This work is but one in a series of efforts at incorporating multi–scale cortical representations (and more broadly, perceptual insights) in a variety of audio and speech processing applications.

• Applications such as – automatic classification

– segmentation of animal sounds

– an efficient encoding of speech and music

ICASSP 2004 16

Reference

• Two state-of-the-art systems– [1] E. Scheirer, M. Slaney, ”Construction and evaluation of a robust multifeature s

peech/music discriminator”, ICASSP’97, 1997.– [2] B. Kingsbury, G. Saon, L. Mangu, M. Padmanabhan and R. Sarikaya, ”Robust

speech recognition in noisy environments: The 2001 IBM SPINE evaluation system”, ICASSP 2002, vol. I, pp. 53–56, 2002.

• Central Auditory System– [4] K. Wang and S. A. Shamma, ”Spectral shape analysis in the central auditory s

ystem”, IEEE Trans. Speech Audio Proc. vol. 3 (5), pp. 382–395, 1995.– [6] M. Elhilali, T. Chi and S. A. Shamma, ”A spectro-temporal modulation index

(STMI) for assessment of speech intelligibility”, Speech comm., vol. 41, pp. 331–348, 2003.

– Auditory cortical representation of complex acoustic spectra as inferred from the ripple analysis method

• SHIHAB A. SHAMMA– http://www.isr.umd.edu/People/faculty/Shamma.html

speech discrimination based on multiscale spectro–temporal modulations

Documents