application to speech analysis qcvsp.cs.ntua.gr/publications/jpubl+bchap/dimitriadis... ·...

19
Continuous energy demodulation methods and application to speech analysis q Dimitrios Dimitriadis * , Petros Maragos National Technical University of Athens, School of Electrical and Computer Engineering, Iroon Polytexneiou str., Athens 15773, Greece Received 5 April 2004; received in revised form 10 March 2005; accepted 31 August 2005 Abstract Speech resonance signals appear to contain significant amplitude and frequency modulations. An efficient demodula- tion approach is based on energy operators. In this paper, we develop two new robust methods for energy-based speech demodulation and compare their performance on both test and actual speech signals. The first method uses smoothing splines for discrete-to-continuous signal approximation. The second (and best) method uses time-derivatives of Gabor filters. Further, we apply the best demodulation method to explore the statistical distribution of speech modulation features and study their properties regarding applications of speech classification and recognition. Finally, we present some preliminary recognition results and underline their improvements when compared to the corresponding MFCC results. Ó 2005 Elsevier B.V. All rights reserved. Keywords: Nonstationary speech analysis; Energy operators; AM–FM modulations; Demodulation; Gabor filterbanks; Feature distri- butions; ASR; Robust features; Nonlinear speech analysis 1. Introduction Nonlinear, time-varying signal models of the AM–FM type are nowadays receiving significant attention in nonstationary signal and speech pro- cessing schemes. The estimation of the signal instan- taneous frequencies and amplitude envelopes is referred to as the ÔDemodulation ProblemÕ. Demod- ulating AM–FM signals, i.e., nonstationary sines that have a combined amplitude modulation (AM) and frequency modulation (FM) xðtÞ¼ aðtÞ cos Z t 0 xðsÞ ds ð1Þ has been a major research problem with many appli- cations in communication systems, speech process- ing, and in general, nonstationary signal analysis. To solve it, a new approach was developed in the 1990Õs based on nonlinear differential operators that can track the instantaneous energy (or its derivatives) of a source producing an oscillation (Kaiser, 1983, 1990; Maragos and Potamianos, 1995). The main representative of this class of operators is the contin- uous-time Teager–Kaiser energy operator (TEO) W½xðtÞ ½ _ xðtÞ 2 xðtÞxðtÞ, where _ xðtÞ¼ dxðtÞ=dt. 0167-6393/$ - see front matter Ó 2005 Elsevier B.V. All rights reserved. doi:10.1016/j.specom.2005.08.007 q This work was partially supported by the European NoE MUSCLE. * Corresponding author. Tel.: +302107722964. E-mail addresses: [email protected], [email protected] (D. Dimitriadis), [email protected] (P. Maragos). URL: http://cvsp.cs.ntua.gr (D. Dimitriadis). Speech Communication 48 (2006) 819–837 www.elsevier.com/locate/specom

Upload: others

Post on 24-Jul-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: application to speech analysis qcvsp.cs.ntua.gr/publications/jpubl+bchap/Dimitriadis... · 2006-12-30 · Further, we apply the best demodulation method to explore the statistical

Speech Communication 48 (2006) 819–837

www.elsevier.com/locate/specom

Continuous energy demodulation methods andapplication to speech analysis q

Dimitrios Dimitriadis *, Petros Maragos

National Technical University of Athens, School of Electrical and Computer Engineering, Iroon Polytexneiou str., Athens 15773, Greece

Received 5 April 2004; received in revised form 10 March 2005; accepted 31 August 2005

Abstract

Speech resonance signals appear to contain significant amplitude and frequency modulations. An efficient demodula-tion approach is based on energy operators. In this paper, we develop two new robust methods for energy-based speechdemodulation and compare their performance on both test and actual speech signals. The first method uses smoothingsplines for discrete-to-continuous signal approximation. The second (and best) method uses time-derivatives of Gaborfilters. Further, we apply the best demodulation method to explore the statistical distribution of speech modulationfeatures and study their properties regarding applications of speech classification and recognition. Finally, we present somepreliminary recognition results and underline their improvements when compared to the corresponding MFCC results.� 2005 Elsevier B.V. All rights reserved.

Keywords: Nonstationary speech analysis; Energy operators; AM–FM modulations; Demodulation; Gabor filterbanks; Feature distri-butions; ASR; Robust features; Nonlinear speech analysis

1. Introduction

Nonlinear, time-varying signal models of theAM–FM type are nowadays receiving significantattention in nonstationary signal and speech pro-cessing schemes. The estimation of the signal instan-taneous frequencies and amplitude envelopes isreferred to as the �Demodulation Problem�. Demod-ulating AM–FM signals, i.e., nonstationary sines

0167-6393/$ - see front matter � 2005 Elsevier B.V. All rights reserved

doi:10.1016/j.specom.2005.08.007

q This work was partially supported by the European NoEMUSCLE.

* Corresponding author. Tel.: +302107722964.E-mail addresses: [email protected], [email protected]

(D. Dimitriadis), [email protected] (P. Maragos).URL: http://cvsp.cs.ntua.gr (D. Dimitriadis).

that have a combined amplitude modulation (AM)and frequency modulation (FM)

xðtÞ ¼ aðtÞ cos

Z t

0

xðsÞ ds

� �ð1Þ

has been a major research problem with many appli-cations in communication systems, speech process-ing, and in general, nonstationary signal analysis.To solve it, a new approach was developed in the1990�s based on nonlinear differential operators thatcan track the instantaneous energy (or its derivatives)of a source producing an oscillation (Kaiser, 1983,1990; Maragos and Potamianos, 1995). The mainrepresentative of this class of operators is the contin-uous-time Teager–Kaiser energy operator (TEO)W½xðtÞ� � ½ _xðtÞ�2 � xðtÞ€xðtÞ, where _xðtÞ ¼ dxðtÞ=dt.

.

Page 2: application to speech analysis qcvsp.cs.ntua.gr/publications/jpubl+bchap/Dimitriadis... · 2006-12-30 · Further, we apply the best demodulation method to explore the statistical

820 D. Dimitriadis, P. Maragos / Speech Communication 48 (2006) 819–837

Applied to the AM–FM signal (1), W yields theinstantaneous source energy, i.e., W[x(t)] �a2(t)x2(t), where the approximation error becomesnegligible (Maragos et al., 1993) if the instantaneousamplitude a (t) and instantaneous frequency x(t) donot vary too fast or too much with respect to the aver-age value of x(t). Thus, AM–FM demodulation canbe achieved by separating the instantaneous energyinto its amplitude and frequency components. W isthe main ingredient of the first energy separation

algorithm (ESA)ffiffiffiffiffiffiffiffiffiffiffiffiffiffiW½ _xðtÞ�W½xðtÞ�

s� xðtÞ; W½xðtÞ�ffiffiffiffiffiffiffiffiffiffiffiffiffiffi

W½ _xðtÞ�p � jaðtÞj ð2Þ

developed in (Maragos et al., 1993) and used for sig-nal and speech AM–FM demodulation.

Motivated by the strong evidence of the existenceof amplitude and frequency modulations (AM–FM)in speech resonance signals, which make theiramplitudes and frequencies vary instantaneouslywithin a pitch period (Maragos et al., 1993), we pro-pose to model each speech resonance with an AM–FM signal

rðtÞ ¼ A exp �Z t

0

bðsÞ ds

� �cos

Z t

0

xðsÞ ds

� �ð3Þ

and the total speech signal as a superposition of asmall number of such AM–FM signals. The smoothestimation of their instantaneous frequencies x(t)and amplitude envelopes |a(t)| is of significantimportance for speech analysis and feature extrac-tion applications.

The instantaneous energy separation methodol-ogy has led to several classes of algorithms fordemodulating discrete-time AM–FM signals

r½n�¼ rðnT Þ¼Aexp �Z n

0

b½k�dk� �

cos

Z n

0

x½k�dk� �

;

ð4Þ

where a[n] = a(nT) and x[n] = Tx(nT). Note that inEq. (4) the

R½�� notation is used symbolically instead

of the summation notationP½�� for the discrete

signals involved; i.e., in the FM-part it is used tounderline the differential relationship between xand phase. A direct approach to this objective isto apply the discrete-time Teager–Kaiser operatorWd ½xn� � x2

n � xn�1xnþ1, where xn � x[n], to the dis-crete-time AM–FM signal (4) and thus derive dis-crete energy equations of the form

Wd ½rn� � a2½n�sin2ðx½n�Þ. ð5Þ

This yields the following algorithm, called Discrete

ESA or DESA (Maragos et al., 1993)

arccos 1�Wd ½rn � rn�1� þWd ½rnþ1 � rn�4Wd ½rn�

� �� x½n� ¼ 2pf ½n�; ð6ÞffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiWd ½rn�

sin2ðx½n�Þ

s� ja½n�j. ð7Þ

Another approach involves estimating the instanta-neous frequency by modeling the discrete-time signalrn via the exact Prony, as shown in (Fertig andMcClellan, 1996; Ramalingam, 1996). This yieldsalgorithms that also contain the discrete energy oper-ator as their main ingredient. The advantages of theESAs are efficiency, low computational complexityand excellent time resolution (5-sample window).Their main disadvantage, though, is a moderate sen-sitivity to noise (Potamianos and Maragos, 1994).

In this paper, two more systematic demodulationapproaches are developed where at first, the dis-crete-time signal is expanded in the continuous-timedomain and then, the continuous-time ESA of Eq.(2) is applied upon. The first approach is to approx-imate the signal with smoothing splines and thendifferentiate the approximating continuous-timepolynomials (Dimitriadis and Maragos, 2001). Thesecond one combines time-differentiation and filter-ing of the signal with a Gabor filterbank into convo-lutions of the signal with the time-derivatives of theGabor filter�s impulse response. The advantages ofsuch approaches are that we can both avoid havingthe noisy one-sample discrete-time approximationsof the derivatives and also succeed in havingsmoother estimates of the signal�s time-derivativesin the presence of noise.

Another contribution of this paper is the applica-tion of these novel proposed algorithms for speechanalysis and feature extraction. Significant conclu-sions have been drawn from the instantaneoussignals and their distributions. Finally, featuresinspired from the speech resonance model, Eq. (3),are extracted. The motivation for such feature anal-ysis scheme stems from improvements in ASR rec-ognition rates which we have observed in previousexperimental work (Dimitriadis and Maragos,2003; Dimitriadis et al., 2002). In this paper, westudy their distributions and dependencies on vari-ous aspects of the speech signal. Finally, we presentsome recognition results indicating the improve-ment of ASR rates in clean and noisy speech whenthe AM–FM feature vectors augment the MFCCs.

Page 3: application to speech analysis qcvsp.cs.ntua.gr/publications/jpubl+bchap/Dimitriadis... · 2006-12-30 · Further, we apply the best demodulation method to explore the statistical

D. Dimitriadis, P. Maragos / Speech Communication 48 (2006) 819–837 821

The paper is organized as follows: Section 2describes the smoothing spline approach presenting,at the beginning, a brief background on splines. Sec-tion 3 presents the differentiated Gabor filteringapproach. In Section 4 the experimental results,comparing these two new approaches with the stan-dard DESA, are presented and the best algorithm isproposed. In Section 5, we apply this algorithm tospeech analysis and feature extraction. We examinetheir statistical distributions and classification prop-erties and we apply them to clean and noisy speechrecognition tasks. Finally, in Section 6 some overallconclusions are presented.

2. Spline ESA

In this section we interpolate1 discrete-time sig-nals with splines. Spline functions are piecewisecontinuous polynomials assembled as linear combi-nations of B-splines. A spline function of order nhas continuous (and smooth) derivatives up to ordern � 1, a very important property when using theW-operator. This was our principal motivation forintroducing the use of splines. In addition, smooth-ing splines provide even smoother time-derivativesthan the exact splines, without losing the propertyof continuity.

2.1. Exact splines

Given the initial signal samples x[n], wheren = 1, . . . ,N, the interpolating spline function isgiven by

smðtÞ ¼Xþ1

n¼�1c½n�bmðt � nÞ; ð8Þ

where bm(t) is the B-spline of order m and the coeffi-cients c[n] depend only on the data x[n] and the ana-lytic expression of the B-spline. The B-spline oforder m can be formed as the (m + 1)th-fold convolu-tion of the zeroth-order B-spline with itself

bmðtÞ � b0ðtÞ � b0ðtÞ � � � � � b0ðtÞ|fflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl}ðmþ1Þ-times

;

where the zeroth-order B-spline is defined by

1 We have also experimented with least-square polynomials forthe interpolation process. However, the main problem using suchpolynomials is that there is no theoretical guaranty that the time-derivatives of the interpolant are smooth.

b0ðtÞ ¼1 if jtj < 1=2;

1=2 if jtj ¼ 1=2;

0 otherwise.

8><>:

Using the discrete B-spline bm[n] � bm(n), Eq. (8)becomes

smðnÞ ¼ ðc � bmÞ½n�. ð9ÞFor the exact interpolation problem, sm(n) = x[n]. Bytransforming Eq. (9) in the Z domain, we obtain

CðzÞ ¼ X ðzÞBmðzÞ

. ð10Þ

Thus, the spline coefficients c[n] can be determinedrecursively from the above equation. This approachcan be considered, also, as a filter with frequencyresponse H(z) = 1/Bm(z) that is called spline filter.Spline filters of odd orders are proved to be alwaysstable (Unser et al., 1991). Each original samplesm(n) = x[n] is resynthesized by the contribution ofm neighbor spline coefficients.

2.2. Smoothing splines

We have used exact splines to improve theperformance of the ESA and tested them on noisyAM–FM signals with different levels of SNR. Theresults were, however, disappointing as the exact fit-ting of the curve was the source of large estimationerrors due to the presence of noise. The problem ofnoise led us to the need for approximating signalsamples with the use of smoothing splines. The mainadvantage of smoothing splines is that the approxi-mating polynomial does not pass precisely throughthe signal samples but ‘‘close enough’’.

The ‘‘smoothing spline approximating function’’is defined as the function sm of order m = 2r � 1 thatminimizes the mean square error criterion (Unseret al., 1993).

� ¼Xþ1

n¼�1ðx½n� � smðnÞÞ2|fflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl}

�d

þkZ þ1

�1

orsmðxÞoxr

� �2

dx|fflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl}�s

;

where �d is the mean square error of the approxima-tion function and �s is the mean square error intro-duced by the need for a smoothed curve. Thiscriterion is a compromise between the need forcloser-to-the-data points approximation curve andthe need for a smoothed curve. The positive parame-ter k quantifies the approximating curve�s smooth-ness, criterion �s, and how close to the data points

Page 4: application to speech analysis qcvsp.cs.ntua.gr/publications/jpubl+bchap/Dimitriadis... · 2006-12-30 · Further, we apply the best demodulation method to explore the statistical

822 D. Dimitriadis, P. Maragos / Speech Communication 48 (2006) 819–837

this curve pass, criterion �d. For k = 0, no smoothingeffect takes place and the approximation curve fitsexactly the signal samples. If k50, the deviation fromthe data samples increases with k in a nonlinear way,while concurrently the smoothing performance isincreasing, too. r is the order of the polynomial�stime-derivative that regulates its smoothness.

As shown in (Unser et al., 1993), the approximat-ing polynomial sm(t) minimizing the mean squareerror � is a linear combination of splines bm, as inEq. (8), though the coefficients c[n] are computedas the output of an IIR filter H k

mðzÞ, much differentfrom the filter in (10)

CðzÞ ¼ H kmðzÞX ðzÞ ¼

X ðzÞP k

mðzÞ; ð11Þ

where P kmðzÞ is given by

P kmðzÞ ¼ BmðzÞ þ kð�zþ 2� z�1Þr ð12Þ

and r is defined as above. The IIR filter H kmðzÞ has a

symmetric impulse response and all its poles lieinside the unit circle of the complex plane (Aldroubiet al., 1992). So, spline coefficients c[n] can be stablydetermined via a few recursive equations (Unseret al., 1993). Henceforth, smoothing splines withk50 are applied in every case, unless statedotherwise.

2.3. Spline ESA

The previous discussion leads us to approximatea discrete-time signal x[n] using smoothing splinesof mth-order and thus create a continuous-timesignal

smðtÞ ¼Xþ1

n¼�1c½n�bmðt � nÞ. ð13Þ

The basic idea of the new approach for ESA-baseddemodulation is to apply the continuous-timeenergy operator W and the continuous ESA to thecontinuous-time signal sm(t) instead of using thediscrete energy operator Wd and the DESA onthe discrete signal x[n]

W½smðtÞ� ¼osmðtÞ

ot

� �2

� smðtÞo2smðtÞ

ot2. ð14Þ

The use of continuous ESA requires the estimationof the signal�s first-, second- and third-order deriva-tives. A basic property of the B-splines is that theirtime-derivatives can be obtained from a recursiveequation

dbmðtÞdt¼ bm�1 t þ 1

2

� �� bm�1 t � 1

2

� �. ð15Þ

The time derivative of the polynomial (13) is givenby

osmðtÞot¼Xþ1

n¼�1c½n� obmðt � nÞ

ot. ð16Þ

Given the coefficients c[n] of the spline approxima-tion (13) and taking under consideration (15), aftersome math, we can derive the following closed-formexpressions of these derivatives, involving only thecoefficients c[n] and the B-spline analytic expressions

osmðtÞot¼X

n

ðc½n� � c½n� 1�Þbm�1ðt � nþ 1=2Þ; ð17Þ

o2smðtÞot2

¼X

n

ðc½nþ 1� � 2c½n� þ c½n� 1�Þbm�2ðt � nÞ;

ð18Þo3smðtÞ

ot3¼X

n

ðc½nþ 1� � 3c½n� þ 3c½n� 1� � c½n� 2�Þ

� bm�3ðt � nþ 1=2Þ. ð19Þ

By applying these signal derivatives in the continu-ous ESA, we can estimate the instantaneous ampli-tude a(t) and frequency x(t) of the continuoussignal sm(t). Finally, we obtain the sampled estimatesof the instantaneous amplitude and frequency sig-nals (a[n] = a(nT), x[n] = Tx(nT)) of the originaldiscrete signal x[n]. This whole approach presentedabove is called the Spline ESA.

An important part of the Spline ESA is the com-putation of the spline coefficients c[n]. In the sequel,the details of this algorithm are discussed. First, thezeros of the denominator polynomial P k

mðzÞ inEq. (11) are estimated. Due to the symmetric formof this polynomial the zeros come in pairsðzi; z�1

i Þ; i ¼ 1; . . . ; r. Thus, the transfer function inEq. (11) can be formulated as

H mðzÞ ¼ c0

Yr

i¼1

�zi

ð1� ziz�1Þð1� zizÞ; ð20Þ

where c0 is a normalizing constant depending on thespline order m. From Eqs. (11) and (20) (Unser et al.,1993) the recursive equations are:

yþi ½n� ¼ yi�1½n� þ ziyþi ½n� 1�; n ¼ 2; . . . ;N ;

yi½n� ¼ ziðyi½nþ 1� � yþi ½n�Þ; n ¼ N � 1; . . . ; 1;

yi½N � ¼ aið2yþi ½N � � yi�1½N �Þð21Þ

Page 5: application to speech analysis qcvsp.cs.ntua.gr/publications/jpubl+bchap/Dimitriadis... · 2006-12-30 · Further, we apply the best demodulation method to explore the statistical

0.00750.01 0.025 0.05 0.075 0.1 0.25 0.5 0.75 1 1.5 2 2.5 5 100

2

4

6

8

10

12

14

16

18

λ–Values

Mea

n A

bs. E

rror

Ave

rage

d ov

er 1

00 A

M–F

M M

od.D

epth

s

Cubic Splines

Fifth–Order SplinesSeventh–Order SplinesSNR = 25 dB

Fig. 1. Mean absolute frequency estimation error of spline ESAas a function of the smoothing parameter k when SNR = 25 dB.

D. Dimitriadis, P. Maragos / Speech Communication 48 (2006) 819–837 823

where ai ¼ �zi=ð1� z2i Þ, yi � 1[n] is the input and

yi[n] is the output of a digital filter with transferfunction

T iðzÞ ¼�zi

ð1� ziz�1Þð1� zizÞ

and y0[n] = x[n]. If the above step is repeated asmany times as the number of pole pairs ðzi; z�1

i Þ,the final output sequence yr[n] equals c[n]. Theboundary conditions can be set as

yþi ½1� ¼Xk0

k¼1

zjk�1ji yi�1½k�;

where k0 is an integer that ensures a certain level ofprecision, defined a priori.

An example is presented for two different valuesof k, k = 0 and k = 0.5 to clarify the estimation pro-cess of c[n] for splines of order m = 5 (r = 3). First,when k = 0, we interpolate the input signal usingexact splines of order m = 5. The denominator ofthe transfer function H5(z) is P5(z) � B5(z) and thepoles are

z1 ¼ �0:04309; z2 ¼ 0:43057;

z3 ¼ z�11 ; z4 ¼ z�1

2 .

Setting k = 0.5 (when m = 5) in Eq. (12) the poles ofH5(z) = 1/P5(z) are

z1 ¼ 0:32548;

z2 ¼ 0:32154� 0:47128i; z3 ¼ 0:32154þ 0:47128i;

z4 ¼ z�11 ; z5 ¼ z�1

2 ; z6 ¼ z�13 .

In both cases, we find c[n] using the algorithm of Eq.(21), even though the number and values of thepoles are quite different. Having computed c[n],the coefficient sequence is convolved with the B-spline bm to yield the approximating signal sm. In gen-eral, the evaluation of the spline coefficients by thefiltering approach, presented above, is less computa-tionally intensive than the standard numerical anal-ysis approach using sparse Toeplitz matrices.

The choice of spline order of m = 5 is the result ofboth experimentation and theoretical analysis. Theuse of the Spline ESA requires the estimation ofthe signal (or its continuous-time expansion) time-derivatives up to those of third-order. It becomesobvious that the smallest possible order of B-splinewith continuous derivatives is m = 5. Besides thetheoretical need for continuity of the derivatives(for the use of CT-ESA), several experiments wereundertaken for different B-spline order and the best

results were again obtained for the fifth-ordersplines, as presented in Fig. 1. Finally, only fivesamples are needed for the estimation of the splinecoefficients c[n], when selecting m = 5 and the time-resolution remains the same as the correspondingone of the DESA. For larger values of m the time-resolution becomes successively poorer.

The choice of the optimal value of k is not com-pletely arbitrary, too. We have attempted to exper-imentally determine a suitable range of values forthe k-parameter for different SNRs. In Fig. 1, themean absolute errors of the cubic, fifth- and sev-enth-order smoothing spline polynomials arepresented for various values of the k parameterwhen SNR = 25 dB. For a given AM–FM signal,Eq. (25), Spline ESA is applied to estimate thedemodulating error performance as a function ofthe k-parameter and the spline order. In these exper-iments the corresponding mean absolute curvesshow global minima for particular range of k-valuesaround 0.75. More specifically, the global minimaoccur when k 2 [0.1,1] independently of the SNRvalues. The mean absolute errors of the seventh-order smoothing spline approximation are alwayssmaller than the corresponding ones of the cubicand fifth-order spline polynomials. Noticing that,the global minimum values are quite similar for thesetwo approximation curves (the fifth-and seventh-order spline polynomials), we are proposing the useof the fifth-order smoothing spline approximation

Page 6: application to speech analysis qcvsp.cs.ntua.gr/publications/jpubl+bchap/Dimitriadis... · 2006-12-30 · Further, we apply the best demodulation method to explore the statistical

824 D. Dimitriadis, P. Maragos / Speech Communication 48 (2006) 819–837

polynomial since it yields almost the optimal mini-mum error rates and has better time-resolution, fivesample window instead of seven. The optimal valueof k is not a priori known and can be determined onlythrough experimentation because the estimationerrors implicitly depend on the SNR, the signal,and the application.

3. Gabor ESA

The ESA cannot handle wideband signals, suchas speech signals, due to inherent limitations ofthe algorithm. One efficient way to deal with suchlimitations is bandpass filtering of the signal. Forthis process, the Gabor filters are chosen for severalreasons listed in (Maragos et al., 1993), such as theoptimal time-frequency discriminability. In Section2, smoothing splines are used to approximate thediscrete-time signals x[n] and then they are derivedusing closed formulas. The continuous-time TEOW, combined with bandpass filtering and sampledat time instances t = nT, is given by

W½sðtÞ� ¼ _s2ðtÞ � sðtÞ€sðtÞjt¼nT ; sðtÞ ¼ xðtÞ � gðtÞ;ð22Þ

where x(t) is the continuous-time signal and g(t) isthe Gabor filter impulse response

gðtÞ ¼ expð�b2t2Þ cosðxctÞ. ð23ÞThe constants b and xc are the filter parameters.

Since convolution commutes with time-differenti-ation (Papoulis, 1962)

dm

dtmðxðtÞ � gðtÞÞ ¼ xðtÞ � dm

dtmgðtÞ; m ¼ 1; 2; 3; . . .

ð24ÞThe Gabor time-derivatives are given by closedformulae

dgðtÞdt¼ �2b2t cosðxctÞ � xc sinðxctÞ�

expð�b2t2Þ;

d2gðtÞdt2

¼ 4b2xct sinðxctÞ�

þð4b4t2 � 2b2 � x2cÞ cosðxctÞ

� expð�b2t2Þ;

d3gðtÞdt3

¼ 12b4t � 8b6t3 þ 6b2x2ct

� cosðxctÞ

þ 6b2xc � 12b4xct2 þ x3

c

� sinðxctÞ

�expð�b2t2Þ.

Using the above equations in Eq. (22), the output ofW acting on the bandpass filtered signal is given by

W½sðtÞ� ¼ W½xðtÞ � gðtÞ�

¼ d

dtðxðtÞ � gðtÞÞ

� �2

� ðxðtÞ

� gðtÞÞ d2

dt2ðxðtÞ � gðtÞÞ

� �

¼ xðtÞ � dgðtÞdt

� �2

� ðxðtÞ � gðtÞÞ xðtÞ � d2gðtÞdt2

� �.

Through this approach, the necessary processes ofbandpass filtering and the subsequent differentia-tions are combined into a single convolution withderivatives of the Gabor filter�s impulse response.

Since the output of the continuous-time TEO Wwill be sampled at time instances t = nT, to imple-ment the Gabor TEO we must essentially convolvethe discrete-time speech signal x[n] with the sampledderivatives of the Gabor function

gðmÞ½n� ¼ dm

dtmgðtÞjt¼nT .

The next step is to incorporate the Gabor TEO intothe continuous-time ESA formulae. The resultingdemodulation algorithm is called the Gabor ESA.This algorithm exhibits some advantages comparedto the Spline ESA or to the original discrete demod-ulation algorithm DESA. First, bandpass filteringof noisy signals increases the SNR of the filtered sig-nals. Second, fewer parameters are required, com-pared to the Spline ESA where the k-parameter isimportant. The parameters needed are only thoseconcerning the filterbank specifications. Finally,the differentiation is introduced on the filters andnot on the speech signal itself and this fact leadsto smoother results.

The Gabor ESA is computationally more inten-sive than the original DESA or the Spline ESA whenthey are applied upon bandpass filtered signals. Onthe other hand, as shown in Fig. 2, Gabor ESAprovides smoother estimates of the instantaneousfrequency compared to the corresponding ones ofthe DESA, especially on noisy signals.

In Fig. 2, a phoneme /aa/ extracted from theTIMIT database is filtered by a Gabor filter withcenter frequency fc = 1285 Hz and bandwidthb = 400 Hz. The filter is manually placed accordingto the phoneme�s energy spectrum to filter just oneresonance. White Gaussian noise with SNR =10 dB is added to examine the algorithm robustnessto noise. In general, both Gabor ESA and DESAprovide robustness to noise, but the Gabor ESAyields somewhat smoother estimates, as shown in

Page 7: application to speech analysis qcvsp.cs.ntua.gr/publications/jpubl+bchap/Dimitriadis... · 2006-12-30 · Further, we apply the best demodulation method to explore the statistical

0 100 200 300 400 500 600 700 800 900 1000800

900

1000

1100

1200

1300

1400

1500

1600

Time (samples)

Inst.Frequency Estimates of Gabor ESA and Smooth DESA

Inst

anta

neo

us

Fre

qu

ency

(H

z)

Gabor ESASmooth DESA

Fig. 2. Instantaneous frequency signal of a bandpass filteredphoneme /aa/using Gabor ESA and DESA (dotted line indicatesthe filter�s center frequency).

D. Dimitriadis, P. Maragos / Speech Communication 48 (2006) 819–837 825

Fig. 2. This is due to the fact that the differentiationis held on the filters and not on the signal. Thus, thedifferentiation does not introduce additional noisesuch as the one introduced by the DESA�s approx-imation of the signal derivatives.

4. Experimental results

4.1. AM–FM test signals

The first series of experiments are performed withknown input AM–FM signals. The test signals havedifferent AM and FM modulation depths and arethe same as those used in (Maragos et al., 1993).This family of AM–FM test signals is

s½n� ¼ 1þ k cospn100

� � cos

pn5þ ‘

p=100sin

pn100

� � �þW ½n�;

ð25Þwhere n = 1, . . . , 400, and (k,‘) = (0.05i, 0.05j), i, j =1, . . . , 10 are, respectively, the AM and FM modula-tion depths (indexes). Also, zero-mean, gaussiannoise W[n] of different SNR levels is added to exam-ine the algorithm robustness to noise. The estima-tion errors, for different SNR levels, are averagedover 100 different AM–FM depths. The processfollowed in this experiment is the same for all threealgorithms. First, the input AM–FM signal for a gi-ven SNR level and certain AM and FM modulationindexes is estimated and then, bandpass filtering is

introduced. The Gabor ESA is based on such filter-ing process so, to keep the comparison of the corre-sponding error rates fair (input signals should havethe same SNR), the filtering process is held for theDESA and the Spline ESA, too. This processincreases the SNR of the filtered signals andconsequently improves the algorithm robustness tonoise. The Gabor filter is placed at the mean instan-taneous frequency value Xc = p/5 and its bandwidthparameter is b = p/12, so that the signal falls withinthe filter�s passband. Then, the demodulated instan-taneous signals, a[n] and f [n], are estimated andcompared to the reference instantaneous amplitudeand frequency signals

ak½n� ¼ 1þ k cosp

100n

� and

f ‘½n� ¼1

2pp5þ ‘ cos

p100

n� �

.

The gain of the instantaneous amplitude signal ak[n]is altered due to the filtering process and a filtercompensation algorithm is followed as proposed in(Bovik et al., 1993).

As shown in Fig. 3, the instantaneous frequencyestimation error rates are quite small and the filter-ing process adds additional robustness to thedemodulation algorithms since the error rates seemto be invariant to the SNR levels.

Spline ESA�s performance is strongly dependenton the k-parameter. However, to the best of ourknowledge, there is no efficient way to optimallyadjust this parameter for minimizing the demodula-tion error. Thus, the use of the Gabor ESA is pro-posed for the demodulation scheme, since it issimpler and yields smaller errors than both theSpline ESA and the DESA. This choice is, also, sup-ported by the next series of experiments with speechsignals, where the Gabor ESA clearly outperformsthe other two algorithms.

4.2. Speech test signals

In this section, the proposed demodulation algo-rithms are applied upon speech signals. Manyexperiments have been conducted with differentphoneme classes and the results, in terms of errorrates, are proved to be quite similar. Thus, similarresults can be reproduced for different kinds of pho-nemes, without affecting the general conclusions,although we herein present figures only for a fewphonemes. All phonemes are extracted from theTIMIT speech database. Their sampling frequency

Page 8: application to speech analysis qcvsp.cs.ntua.gr/publications/jpubl+bchap/Dimitriadis... · 2006-12-30 · Further, we apply the best demodulation method to explore the statistical

0 10 20 30 40 50 60 703.06

3.065

3.07

3.075

3.08

3.085

3.09

3.095

3.1

SNR Values

Mea

n A

bs E

rror

Ave

rage

d ov

er 1

00 A

M–F

M M

od.D

epth

s

Mean Absolute Error of Estimated Frequency of Different AM–FM Modul. Depths vs Different SNR

Gabor ESASpline ESASmooth DESA

Fig. 3. Comparison of different ESAs to instantaneous frequency signals.

Table 1Mel-spaced Gabor filterbank parameters

Index Gabor filterbank parameters

Center frequency(in Hz)

Bandwidth(in Hz)

1 303 6062 738 8703 1361 12464 2254 17865 3535 25616 5370 3670

826 D. Dimitriadis, P. Maragos / Speech Communication 48 (2006) 819–837

is Fs = 16 kHz. The phoneme segmentation is basedon the given transcription files of the database.

There is no known method to estimate preciselyand uniquely the instantaneous modulating signalsfor a speech signal, in general. As a consequence,it is impossible to compare the estimation errors ina quantitative way. Their performance could beeither examined in a qualitative way e.g. in termsof smoothness and the existence of spikes or singu-larities, or compared to some reference signals, as inFig. 7.

At first, a Gabor filter is placed manually accord-ing to the phoneme spectrum to demodulate onlyone of its formants. The demodulation estimatesare examined in terms of their smoothness (whetheror not spikes appear in the instantaneous modulat-ing signals) and their numerical singularities, Fig. 2.Then, in a second experiment, a Gabor filterbank isused to concurrently test the algorithm performanceusing various filters. The filterbank is mel-spacedspanning the interval [0 � Fs/2] Hz and six filters,in total, are used with a frequency overlap of 50%,Table 1. In Fig. 4, the afforementioned filterbankis superimposed over the spectrum of a phoneme/aa/. The filterbank parameters are the same as thoseused for ASR experiments. The bandpass speechoutputs are demodulated using all three algorithmsand their corresponding estimates are presented inFig. 5. The continuous-time algorithms yield quite

similar instantaneous estimates, thus, for the nextexperiment where a phoneme /aa/ is demodulated,only one of these algorithms is used. In Fig. 6, theoutput of the fifth-filter for a phoneme /aa/ isdemodulated and the instantaneous frequency esti-mates are presented. The continuous algorithm pro-vides smoother estimates, with smaller fluctuationsconcerning the instantaneous frequency signals.

The DESA is, always, outperformed by both theGabor ESA and the Spline ESA (when the k param-eter is chosen properly) independently of the inputspeech signal. Parameter k regulates the Spline filterbandwidth, Eqs. (11) and (12), where smootherapproximating curves correspond to filtering outthe higher frequency part of the signal spectrum. ThisSpline filter is a lowpass filter and its passband is reg-ulated by the k parameter. Choosing too large a value

Page 9: application to speech analysis qcvsp.cs.ntua.gr/publications/jpubl+bchap/Dimitriadis... · 2006-12-30 · Further, we apply the best demodulation method to explore the statistical

0 303 738 1361 2254 3535 5370 8000

Mag

nitu

de (

in d

B)

Speech Spectrum and Mel Spaced Gabor Filterbank

Frequency (in Hz)

Fig. 4. Gabor filterbank superimposed over speech spectrum of a phoneme /aa/.

450 500 550 6002000

3000

4000

5000

6000

7000

Time (samples)

Inst

anta

neo

us

Fre

qu

ency

(H

z)

Inst. Frequency Estimates of the 5th Filter for Different Demod.Approaches

Gabor ESA Spline ESADESA

Fig. 5. Comparison of different ESAs to instantaneous frequency estimates of bandpass filtered speech signal (dotted line indicates thefilter�s center frequency).

D. Dimitriadis, P. Maragos / Speech Communication 48 (2006) 819–837 827

for k causes problems of oversmoothing, eliminatinga significant part of the input signals� spectrum, i.e.,loosing some part of the modulation signals.

Finally, one more set of experiments is conductedto examine the performance of the demodulationalgorithms in different parts of the speech spectrum,

using a Gabor filterbank of varying parameters. Thecorresponding instantaneous signals of speech can-not be determined exactly. So, the clean bandpassspeech signal estimates a0

i ½n� and f 0i ½n� (where i =

1, . . . ,N and N the number of filters) are used asreference signals. White, gaussian noise is added

Page 10: application to speech analysis qcvsp.cs.ntua.gr/publications/jpubl+bchap/Dimitriadis... · 2006-12-30 · Further, we apply the best demodulation method to explore the statistical

12.5 18.75 25 31.25 37.5 43.75 502500

3000

3500

4000

Time (in msec)

Inst

anta

neou

s Fr

eque

ncy

(in

Hz)

Inst. Frequency Estimates of the 5th Filter Bandpass Signal /aa/

DESA Inst. Freq.GaborESA Inst. FreqFilter’s Center Freq.

Fig. 6. Comparison of Gabor ESA and DESA to instantaneous frequency estimates of a bandpass filtered phoneme /aa/ (dotted lineindicates the filter�s center frequency).

828 D. Dimitriadis, P. Maragos / Speech Communication 48 (2006) 819–837

to the speech signals and the yielded noisy estimatesaj

i ½n� and f ji ½n� (j is the index of the SNR values—here

j = 1, . . . , 10) are compared to the reference signalsa0

i ½n� and f 0i ½n�. The RMS errors are averaged over

all the different noise levels. A linearly spaced Gaborfilterbank, with 25 (N = 25) filters and filter overlapequal to 50%, is used. This experiment provides use-ful insight on the algorithm performance as a func-tion of the frequency bins (indexes of the filtercenter frequencies) and different SNR values. Theresults appear to be quite similar for different pho-neme classes. For this reason only one experiment,for the phoneme /aa/, is herein presented. In Fig. 7,Gabor ESA seems to be more robust across differentfilter indexes and exhibits an almost flat error-ratecurve, independent of the filter center frequencies.On the other hand, both DESA and Spline ESA exhi-bit a good behaviour for the lower part of the spec-trum but they become unstable for the higher partof the spectrum, as their estimation errors increasesignificantly in direct proportion to the filter indexes.

The Gabor ESA is proven to be the best demod-ulating algorithm among all three candidates. Theexperimental results show that it is more robust interms of noise, exhibits a uniform performancewhen used in different parts of the speech spectrumand finally, it yields the smoother instantaneousestimates. Also, it requires fewer parameters than

the Spline ESA does. Henceforth, the Gabor ESAwill be uniformly used, unless otherwise stated.

5. Distributions of speech modulation signals

The AM–FM model of speech is especiallyimportant as it provides useful insight for the for-mant structure of different phoneme classes, likethe vowels, in time scales that the linear source-filtermodel considers it fixed (Rabiner and Schafer,1978). In (Potamianos and Maragos, 1996, 1999)some first preliminary results were presented point-ing that the instantaneous modulation signalsappear to have different patterns depending on thephoneme classes and the speaker articulation.Herein, these dependencies are being investigatedin a more detailed and thorough manner. Differentphoneme classes have quite different formant struc-ture; therefore, the demodulated instantaneous esti-mates ai[n] and fi[n] (where i is the filter indexnumber) of these classes should have different distri-butions too. As a consequence, nonlinear speechfeatures, based on these instantaneous signals, exhi-bit different distributions, and therefore, they can beuseful for recognition tasks. Herein, the histogramsof both the raw instantaneous signals |ai[n]| and fi[n]and the novel AM–FM features are presented fordifferent phoneme classes. These histograms reveal

Page 11: application to speech analysis qcvsp.cs.ntua.gr/publications/jpubl+bchap/Dimitriadis... · 2006-12-30 · Further, we apply the best demodulation method to explore the statistical

0 5 10 15 20 250

5

10

15

20

25

30

35

Filter’s Center Frequency Index

Each Error Averaged over 10 Different SNRs

RM

S F

req.

Est

imat

ion

Err

or A

vera

ged

(%)

Gabor ESASpline ESASmooth DESA

Fig. 7. Comparison of the instantaneous frequency estimates of the three demodulation algorithms on noisy versions of a phoneme /aa/ tothe clean case, across A filterbank.

D. Dimitriadis, P. Maragos / Speech Communication 48 (2006) 819–837 829

some part of the AM–FM formant structure ofspeech, potentially leading to successful patternclassification and ASR applications.

The phoneme signals are extracted from theTIMIT database according to the given transcrip-tions. The phoneme steady-state part is extracted(middle one third) and the proposed features are esti-mated over the whole span of this segment (i.e., thesteady-state of the phoneme is assumed as one largespeech frame). In this case, some of the transientphenomena present in speech are not taken underconsideration and they are smoothed out. For thedemodulation process, a mel-spaced Gabor filter-bank is used as in Section 4.2. The histograms ofthe raw instantaneous signals are directly estimatedfrom their corresponding values without any furtherpostprocessing. The histogram bins span between[0. . .1] and [0. . .8] kHz for the normalized instanta-neous amplitude and frequency signals, respectively.For the modulation features case, the histogramsspan their corresponding dynamic range. The distri-butions for both cases are estimated as histograms of30 fixed, linearly spaced bins.

5.1. Distributions of instantaneous modulating

signals

In this section, the raw instantaneous modulatingsignal distributions are being examined. First, differ-

ent histograms for the same class of phonemesdepending on the speaker�s gender are estimatedto examine whether or not the gender is an impor-tant factor affecting the structure of the instanta-neous estimates. This experiment is held forseveral different phoneme classes and the estimatedhistograms for both genders appear to be quitesimilar. According to these experimental results, itseems that the speaker gender is not affecting theraw instantaneous signal distributions.

Further, the distributions of different phonemeclasses are estimated to examine whether or not theyexhibit significant differences. The vertical axis ofthe frequency-related histograms is plotted on thelog-scale to examine whether the instantaneousfrequency estimates exhibit exponential type dis-tributions, e.g., Laplacian or Gaussian. The speechsignals are randomly chosen from the TIMIT data-base to obtain unbiased distributions, independentof the speaker�s accent and gender. Their histogramsare superimposed investigating the existence ofpatterns in their distributions.

In Figs. 8 and 9 the distributions of the instanta-neous signals |ai[n]| and fi[n] (for all six filters) areplotted for five randomly chosen instances of twodifferent phoneme classes (phonemes /aa/ and /sh/).The log-frequency distributions are quite linear,which leads to the conjecture that they follow anexponential distribution. Their peak is close to the

Page 12: application to speech analysis qcvsp.cs.ntua.gr/publications/jpubl+bchap/Dimitriadis... · 2006-12-30 · Further, we apply the best demodulation method to explore the statistical

(a)

(b)

Fig. 8. (a) Five histograms of the instantaneous amplitude signals of phoneme /aa/, (b) five histograms of the instantaneous amplitudesignals of phoneme /sh/.

830 D. Dimitriadis, P. Maragos / Speech Communication 48 (2006) 819–837

Page 13: application to speech analysis qcvsp.cs.ntua.gr/publications/jpubl+bchap/Dimitriadis... · 2006-12-30 · Further, we apply the best demodulation method to explore the statistical

(a)

(b)

Fig. 9. (a) Five histograms of the instantaneous frequency signals of phoneme /aa/, (b) five histograms of the instantaneous frequencysignals of phoneme /sh/.

D. Dimitriadis, P. Maragos / Speech Communication 48 (2006) 819–837 831

Page 14: application to speech analysis qcvsp.cs.ntua.gr/publications/jpubl+bchap/Dimitriadis... · 2006-12-30 · Further, we apply the best demodulation method to explore the statistical

832 D. Dimitriadis, P. Maragos / Speech Communication 48 (2006) 819–837

center frequency fc of the filters as dictated by theAM–FM resonance model and shown in Fig. 2.Also, the |ai[n]|-distributions appear to have differ-ent patterns. For the phoneme /aa/ case, the histo-grams are concentrated in the lower-index binsand they exhibit small variance. On the other hand,for the phoneme /sh/ case, the distributions are pre-senting larger variance and span a bigger number ofbins.

5.2. Distributions of modulation-related speech

features

Besides the raw instantaneous signal distribu-tions, the distributions of the proposed nonlinearmodulation features are examined. These novel fea-tures consist of either the Mean Values of the instan-taneous amplitude (IA-Mean), frequency (IF-Mean)and AM-modulating signals b[n] (BandW-Mean) (4)or the FM-modulation percentages (FMP), estimatedover the bandpassed speech signals, as above.

The IA-Mean and IF-Mean features are theshort-time means of the normalized instantaneousamplitude and frequency signals ai[n], fi[n]. Theinstantaneous frequency signal values have beentrimmed within ±5% of their mean value so thatisolated spikes are ignored. The FMP features aredefined as FMPi = Bi/Fi for each filter i, where Bi

is the mean bandwidth (a weighted version of thefi[n]-signal deviation (Potamianos and Maragos,1996)) and Fi is the (amplitude) weighted meanfrequency value. Fi and Bi are estimated from theinformation signals ai[n] and fi[n] as follows:

F i ¼PT

k¼0fi½k�a2i ½k�PT

k¼0a2i ½k�

;

Bi ¼PT

k¼0½ _a2i ½k� þ ðfi½k� � F iÞ2a2

i ½k��PTk¼0a2

i ½k�; ð26Þ

where i = 1, . . . , 6 is the filter index and T the timewindow length. Finally, the AM-modulating signalb(t) is defined as the ratio of the first time-derivativeof the instantaneous amplitude over that signal

biðtÞ ¼ � _aiðtÞ=aiðtÞ ð27Þ

and b[n] = b(nT) is the sampled version of this signal(4). The proposed features provide informationabout the speech formant fine-structure takingadvantage of the excellent time-resolution of theESA. Thus, some of the transitional phenomena

and the instantaneous formant variations aremapped onto these nonlinear AM–FM features.

The features are estimated over 1000 instances ofdifferent phonemes and their histograms areobtained. This experiment reveals different patternsfor different phoneme classes. In Figs. 10–13 the dis-tributions of the IA-Mean, IF-Mean, BandW-Meanand FMP features for the phoneme classes /aa/, /ae/,/sh/, /f/, /p/ and /b/ are presented. In those figures,the vowel distributions seem to be different fromthe corresponding ones of the fricatives and the plo-sives. Each of the three different classes (vowels,fricatives and plosives) show similar distributionswith small intra-distribution variance but on thecontrary, they exhibit significant inter-distributionvariance. This, however, is expected since theselected sound-classes are characterized by differentformant (resonance) structure. These differences aremapped into the modulated signals. The fricative,plosive and vowel signals exhibit a very differentamplitude and frequency structure as their instanta-neous signals clearly indicate. Moreover, the instan-taneous frequency mean values are clustered aroundthe filter center frequencies, Fig. 11.

Figs. 10–13 indicate that there are observabledifferences among the collective distributions of thefeatures for different phoneme classes. Another con-clusion is that different phonemes of the same classhave similar distributions over some filters, whilethey differentiate over the rest of their filter distribu-tions. The classification information of the phonemesmust be considered collectively over the whole filter-bank of instantaneous signal. Vowels appear to havedistributions with mean values well-centered in themiddle bins and large variance. Fricative distribu-tions show similar mean values but smaller variance.Finally, the distributions of plossives have smallmean values but they exhibit large variance and spanseveral bins.

The next step is to estimate the statisticalmoments of the distributions and to examinewhether they differ sufficiently, depending on thephoneme classes analyzed. Towards this approachthe following clustering experiments are applied.

5.3. PCA Plots of modulation-related speech

features

Using a feature reduction technique, the featuresets are projected onto a 3D-space. The featuredimensionality is thus reduced to one half of the ori-ginal set using the DCT. Thus, only the three most

Page 15: application to speech analysis qcvsp.cs.ntua.gr/publications/jpubl+bchap/Dimitriadis... · 2006-12-30 · Further, we apply the best demodulation method to explore the statistical

Fig. 10. Histograms of IA-Mean features for 1000 instances of phonemes /aa/, /ae/, /sh/, /f/, /p/ and /b/.

D. Dimitriadis, P. Maragos / Speech Communication 48 (2006) 819–837 833

important coefficients of each feature vector arekept. This reduction is used to study the data clus-tering properties in a computationally more efficientway. After this process, we observe that thesmoothed features (only the steady-state part ofthe phonemes is taken under consideration) formclusters with small overlap, as shown in Fig. 14.The 3D ellipsoids depicted, corresponding to thegaussian dispersion characteristics of the respectivefeature sets, are estimated by

ðx� lÞTR�1ðx� lÞ ¼ const; ð28Þwhere l, R are the mean vectors and the covariancematrices of the individual phoneme classes pre-sented and const is the probability level correspond-ing to the percentage of samples included in eachvolume (Duda et al., 2001). Here, the ellipsoidsare estimated by setting their principal axes lengthequal to two times the corresponding standard devi-ation; i.e., their x-axis length is set equal to 2rx. InFig. 14, the 3D-projections of the proposed nonlin-ear feature distributions, when examined in pairs,exhibit some separability and thus, these features

could be useful in speech recognition tasks by fol-lowing some well established pattern classificationmethodology.

5.4. Speech recognition using modulation-based

features

The proposed features are applied to clean andnoisy recognition tasks, after studying their distri-butions and classification properties. The goal isto examine whether the correct phoneme recogni-tion rates can be improved when the standardMFCC feature set is augmented by the proposednonlinear features and if they show additionalrobustness to noise. The speech databases used wereobtained from the TIMIT database after addingtwo different kinds of noise and fixing their SNRlevel equal to 10 dB. Specifically, we created theseTIMIT + Noise databases by adding white and pinknoise only to the test set of the original TIMIT data-base, leaving the training set unaltered. We haveused the HTK Toolkit (Young et al., 2002) as theHMM-based recognizer. The HMMs are 3-state,

Page 16: application to speech analysis qcvsp.cs.ntua.gr/publications/jpubl+bchap/Dimitriadis... · 2006-12-30 · Further, we apply the best demodulation method to explore the statistical

0 0.080

100

200Histogram of BandWMean Vectors for Phoneme /aa/

0 0.080

200400

Histogram of BandW–Mean Vectors for Phoneme /ae/

0 0.080

100200

Histogram of BandW–Mean Vectors for Phoneme /sh/

0 0.080

100200

Histogram of BandW–Mean Vectors for Phoneme /f/

0 0.080

100200

Histogram of BandW–Mean Vectors for Phoneme /p/

0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.080

100200

Histogram of BandW–Mean Vectors for Phoneme /b/

1st Channel

2nd Channel

3rd Channel

4th Channel

5th Channel

Fig. 12. Histograms of BandW-Mean features for 1000 instances of phonemes /aa/, /ae/, /sh/, /f/, /p/ and /b/.

0

500

1000Histogram of IF–Mean Vectors for Phoneme /aa/

0500

1000Histogram of IF–Mean Vectors for Phoneme /ae/

0500

1000Histogram of IF–Mean Vectors for Phoneme /sh/

0500

1000Histogram of IF–Mean Vectors for Phoneme /f/

0500

1000Histogram of IF–Mean Vectors for Phoneme /p/

0 1000 2000 3000 4000 5000 60000

5001000

Histogram of IF–Mean Vectors for Phoneme /b/

Inst. Frequency (Hz)

1st Channel

2nd Channel

3rd Channel

4th Channel

5th Channel

6th Channel

Fig. 11. Histograms of IF-Mean Features for 1000 instances of phonemes /aa/, /ae/, /sh/, /f/, /p/ and /b/.

834 D. Dimitriadis, P. Maragos / Speech Communication 48 (2006) 819–837

Page 17: application to speech analysis qcvsp.cs.ntua.gr/publications/jpubl+bchap/Dimitriadis... · 2006-12-30 · Further, we apply the best demodulation method to explore the statistical

0

100

200Histogram of FMP Vectors for Phoneme /aa/

0

100

200Histogram of FMP Vectors for Phoneme /ae/

0100200

Histogram of FMP Vectors for Phoneme /sh/

0100200

Histogram of FMP Vectors for Phoneme /f/

0

100

200Histogram of FMP Vectors for Phoneme /p/

0.01 0.015 0.02 0.025 0.03 0.035 0.04 0.045 0.05 0

100

200Histogram of FMP Vectors for Phoneme /b/

1st Channel

2nd Channel

3rd Channel

4th Channel

5th Channel

Fig. 13. Histograms of FMP Features for 1000 instances of phonemes /aa/, /ae/, /sh/, /f/, /p/ and /b/.

D. Dimitriadis, P. Maragos / Speech Communication 48 (2006) 819–837 835

left–right with 16 gaussian mixtures per state. Thegrammar used for all cases is the all-pair,unweighted grammar. All HMM models are trainedin the clean speech training set and tested in thenoise-corrupted versions of the testing set.

The input vectors are split into two different datastreams, one for the standard features MFCC andthe other for the modulation-based features. Thedata streams are assumed independent (Younget al., 2002). The augmented feature vector consistsof 57 coefficients, 39 samples for the �standard� fea-tures (normalized energy, MFCCs, first and secondtime-derivatives) and 18 for the modulation features(6 coefficients plus their first and second time-deriv-atives). The frame length is set equal to 30 ms withframe-period equal to 10 ms. The weights of thesetwo independent data streams are set s1 = 1.00and s2 = 0.50, for the MFCCs and the modulationfeatures, correspondingly. In Table 2 the recogni-tion results are presented for the TIMIT-basedASR tasks. By combining MFCCs with the pro-posed AM–FM features, a performance improve-ment is obtained for the clean and especially forthe noisy tasks where the improvement is larger(TIMIT + Noise) driving us to the conclussion thatthe proposed nonlinear features exhibit additionalrobustness to noise. The mean relative improvement

is ranging from 18% to 25% and is achieved whenthe MFCC feature vectors are augmented with theproposed nonlinear modulation features.

6. Conclusions and discussion

In this paper, two new continuous-time methodsare proposed for the standard speech demodulationapproach (ESA). It is shown that these methodsexhibit better demodulation performance, especiallywhen noisy input signals are concerned. The best ofthese algorithms (the Gabor ESA) is then employedfor modulation speech analysis and featureextraction.

Several useful conclusions are deduced from theproposed speech analysis scheme. First, our experi-ments have shown that the instantaneous amplitudeand frequency signals do not seem to depend on thespeaker genders since their individual distributionsseem to be quite similar. These signals seem to bedependent only on the phoneme classes, as presentedin Figs. 8 and 9, where the estimated histogramsappear to be different for each one of the phonemeclasses. Similar distributions have also been esti-mated for other phoneme classes. The inter-class dif-ferences appear to be quite significant, while onlyminor ones appear for phonemes of the same class.

Page 18: application to speech analysis qcvsp.cs.ntua.gr/publications/jpubl+bchap/Dimitriadis... · 2006-12-30 · Further, we apply the best demodulation method to explore the statistical

Fig. 14. Instances of 3D DCT projections for phonemes /aa/, /sh/ and /b/of the vectors (a) IA-Mean, (b) IF-Mean, (c) BandW-Mean and(d) FMP. (The markers ‘‘o’’, ‘‘x’’ and ‘‘+’’ represent the projections of the phoneme classes /aa/, /sh/ and /b/, respectively.)

Table 2Correct phoneme accuracies (%) and mean relative improvement for modulation features on the TIMIT and TIMIT + Noise tasks

Features Database

TIMIT TIMIT + white TIMIT + pink Mean rel. improv.

Phoneme accuracy for the TIMIT tasks (%) (for SNR = 10 dB)

MFCC 58.40 17.72 18.60 –MFCC + IA-Mean 59.61 26.03 31.05 23.22MFCC + IF-Mean 59.34 25.38 30.92 22.11MFCC + BandW-Mean 59.23 24.78 28.55 18.85MFCC + FMP 59.92 26.15 32.84 25.56

836 D. Dimitriadis, P. Maragos / Speech Communication 48 (2006) 819–837

Page 19: application to speech analysis qcvsp.cs.ntua.gr/publications/jpubl+bchap/Dimitriadis... · 2006-12-30 · Further, we apply the best demodulation method to explore the statistical

D. Dimitriadis, P. Maragos / Speech Communication 48 (2006) 819–837 837

The distributions of the instantaneous frequencysignals appear to follow an exponential law. Theirlinear parts (when the y-axes are log-scaled) areclear indications of such type of distribution.

In addition, nonlinear features based on theAM–FM resonance model of speech are proposed.Similarly, differences appear in their distributions,as shown in Figs. 10–13. The corresponding meanvalues and variances appear different and highlydependent on the phoneme classes. Similar conclu-sions are drawn by studying the 3D ellipsoids whichare based on their lower-order statistical moments.

The proposed features are projected to the 3Dfeature-space and their corresponding equal-likelihood ellipses are estimated, taking underconsideration their statistical properties. The corre-sponding ellipsoid volumes seem to have a relativelysmall overlap when considering different pairs ofphoneme classes. We provide some experimentalevidence that the AM–FM features exhibit differentand reasonably separable distributions for differentphoneme classes.

These experiments show that the proposed fea-tures could efficiently be used for speech classifica-tion and recognition tasks, as well. They appear toefficiently differentiate for different phoneme classesand thus, they could be useful to adequately dis-criminate these phoneme classes. Motivated by thisfeature analysis, we have also applied these AM–FM features to some clean and noisy speech recog-nition tasks, with a clear improvement of the correctaccuracy results when compared to the correspond-ing results of the MFCCs.

References

Aldroubi, A., Unser, M., Eden, M., 1992. Cardinal spline filters:stability and convergence to the ideal sinc interpolator. SignalProcess. 28, 127–138.

Bovik, A.C., Maragos, P., Quatieri, T.F., 1993. AM–FM energydetection and separation in noise using multiband energyoperators. IEEE Trans. Signal Process. 41 (Dec.).

Dimitriadis, D., Maragos, P., 2001. An improved energydemodulation algorithm using splines. In: Proc. ICASSP-01,Salt Lake, UT, May.

Dimitriadis, D., Maragos, P., 2003. Robust energy demodulationbased on continuous models with application to speechrecognition. In: Proc. Eurospeech-03, Geneva, September.

Dimitriadis, D., Maragos, P., Potamianos, A., 2002. Modulationfeatures for speech recognition. In: Proc. ICASSP-02,Orlando, FL, May.

Duda, R.O., Hart, P.E., Stork, D.G., 2001. Pattern Classificationand Scene Analysis. Wiley, New York.

Fertig, L.B., McClellan, J.H., 1996. Instantaneous frequencyestimation using linear prediction with comparisons to theDESAs. IEEE Signal Process. Lett. 3, 54–56.

Kaiser, J.F., 1983. Some observations on vocal tract operationfrom a fluid flow point of view. In: Titze, I.R., Scherer, R.C.(Eds.), Vocal Fold Physiology: Biomechanics, Acoustics,andPhonatory Control. Denver Center for Performing Arts,Denver, CO, pp. 358–386.

Kaiser, J.F., 1990. On a simple algorithm to calculate the �Energy�of a signal. In: Proc. ICASSP-90, Albuquerque, NM, April,pp. 381–384.

Maragos, P., Kaiser, J.F., Quatieri, T.F., 1993. Energy separationin signal modulations with application to speech analysis.IEEE Trans. Signal Process. 41, 3024–3051.

Maragos, P., Potamianos, A., 1995. Higher order differentialenergy operators. IEEE Signal Process. Lett. 2 (Aug.), 152–154.

Papoulis, A., 1962. The Fourier Integral and its Applications.McGraw-Hill, New York, NY.

Potamianos, A., Maragos, P., 1994. A comparison of the energyoperator and the Hilbert transform approach to signal andspeech demodulation. Signal Process. 37 (May), 95–120.

Potamianos, A., Maragos, P., 1996. Speech formant frequencyand bandwidth tracking using multiband energy demodula-tion. J. Acoust. Soc. Amer. 99 (6), 3795–3806.

Potamianos, A., Maragos, P., 1999. Speech analysis and synthesisusing an AM–FM modulation model. Speech Communica-tion 28, 195–209.

Rabiner, L.R., Schafer, R.W., 1978. Digital Processing of SpeechSignals. Prentice-Hall, Englewood Cliffs, NJ.

Ramalingam, C.S., 1996. On the equivalence of DESA-1a andProny�s method when the signal is a sinusoid. IEEE SignalProcess. Lett. 3 (May), 141–143.

Unser, M., Aldroubi, A., Eden, M., 1993. B-Spline signalprocessing: Part I – Theory. Part II – Efficient design andapplications. IEEE Trans. Signal Process. 41 (Feb.), 821–848.

Unser, M., Aldroubi, A., Eden, M., 1991. Fast B-Splinetransforms for continuous image representation and interpo-lation. IEEE Trans. Pattern Anal. Machine Intell. 13(March), 277–285.

Young, S., Evermann, G., Kershaw, D., Moore, G., Odell, J.,Ollason, D., Povey, D., Valtchev, V., Woodland, P., 2002TheHTK Book (for HTK Version 3.2). Cambridge Research Lab:Entropics, Cambridge, England.