interfaceforbarge-infreespokendialoguesystembasedon ... · sound elimination by controling sound...

13
Hindawi Publishing Corporation EURASIP Journal on Advances in Signal Processing Volume 2007, Article ID 57470, 13 pages doi:10.1155/2007/57470 Research Article Interface for Barge-in Free Spoken Dialogue System Based on Sound Field Reproduction and Microphone Array Shigeki Miyabe, 1 Yoichi Hinamoto, 2 Hiroshi Saruwatari, 1 Kiyohiro Shikano, 1 and Yosuke Tatekura 3 1 Graduate School of Information Science, Nara Institute of Science and Technology, Takayama-Cho 8916-5, Ikoma-Shi, Nara 630-0192, Japan 2 Department of Control Engineering, Takuma National College of Technology, Takuma-Cho Koda 551, Mitoyo-Shi, Kagawa 769-1192, Japan 3 Faculty of Engineering, Shizuoka University, Johoku 3-5-1, Hamamatsu-Shi, Shizuoka 432-8561, Japan Received 1 May 2006; Revised 17 October 2006; Accepted 29 October 2006 Recommended by Aki Harma A barge-in free spoken dialogue interface using sound field control and microphone array is proposed. In the conventional spoken dialogue system using an acoustic echo canceller, it is indispensable to estimate a room transfer function, especially when the transfer function is changed by various interferences. However, the estimation is dicult when the user and the system speak simultaneously. To resolve the problem, we propose a sound field control technique to prevent the response sound from being observed. Combined with a microphone array, the proposed method can achieve high elimination performance with no adaptive process. The ecacy of the proposed interface is ascertained in the experiments on the basis of sound elimination and speech recognition. Copyright © 2007 Shigeki Miyabe et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. 1. INTRODUCTION For hands-free realization of smooth communication with a spoken dialogue system, it should be guaranteed that a user’s command utterance reaches the system clearly. However, a user might interrupt sound responses from the system and utter a command, or he might start speaking before the ter- mination of the sound responses from the system. In such a situation, the sound given from the system to the user is observed as an acoustic echo return at a microphone used for acquisition of the user’s speech input, and degrades the speech recognition performance in receiving the user’s input command. Such a situation is referred to as barge-in [1]. Hereafter, the sound message outputted from the system is called response sound. As a solution to this problem, an acoustic echo can- celler is commonly used [2]. Since the echo return of the response sound is a convolution of the known response sound signal and a transfer function from a loudspeaker to a microphone, we eliminate the echo return by esti- mating the transfer function with an adaptive filter. Many types of acoustic echo canceller have been proposed, such as single-channel, stereophonic, beamformer-integrated, and wave-synthesis-integrated types [36]. The room transfer function is variable and fluctuates because of changes of room conditions, such as the movement of people in the room and changes in temperature [7]. Therefore, the adap- tation must be continued even after its temporary conver- gence. However, in the state of barge-in (this is also called a “double-talk problem”), since user’s speech input is mixed in the observed signal, the speech acts as noise to the esti- mation and the estimation fails. In this case, the adaptation process should be stopped by some type of double-talk de- tection technique [8, 9]. Therefore, when the room transfer function changes in the barge-in state, the elimination per- formance degrades. In order to achieve robustness, we propose a new inter- face for a barge-in free spoken dialogue system that combines multichannel sound field control and a microphone array. At first, to prevent the response sound from being observed at the microphone elements, we utilize the sound field repro- duction technique via multiple loudspeakers and an inverse filter of the room transfer functions [10]. The sound field reproduction is generally used in a transaural system [11], which presents a three-dimensional sound image to a user at a fixed position. We apply this technique to the response

Upload: others

Post on 06-Aug-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: InterfaceforBarge-inFreeSpokenDialogueSystemBasedon ... · sound elimination by controling sound field around the mi-crophone to be silent alongside the transaural reproduction at

Hindawi Publishing CorporationEURASIP Journal on Advances in Signal ProcessingVolume 2007, Article ID 57470, 13 pagesdoi:10.1155/2007/57470

Research ArticleInterface for Barge-in Free Spoken Dialogue System Based onSound Field Reproduction andMicrophone Array

Shigeki Miyabe,1 Yoichi Hinamoto,2 Hiroshi Saruwatari,1 Kiyohiro Shikano,1 and Yosuke Tatekura3

1Graduate School of Information Science, Nara Institute of Science and Technology, Takayama-Cho 8916-5,Ikoma-Shi, Nara 630-0192, Japan

2Department of Control Engineering, Takuma National College of Technology, Takuma-Cho Koda 551, Mitoyo-Shi,Kagawa 769-1192, Japan

3Faculty of Engineering, Shizuoka University, Johoku 3-5-1, Hamamatsu-Shi, Shizuoka 432-8561, Japan

Received 1 May 2006; Revised 17 October 2006; Accepted 29 October 2006

Recommended by Aki Harma

A barge-in free spoken dialogue interface using sound field control and microphone array is proposed. In the conventional spokendialogue system using an acoustic echo canceller, it is indispensable to estimate a room transfer function, especially when thetransfer function is changed by various interferences. However, the estimation is difficult when the user and the system speaksimultaneously. To resolve the problem, we propose a sound field control technique to prevent the response sound from beingobserved. Combined with a microphone array, the proposed method can achieve high elimination performance with no adaptiveprocess. The efficacy of the proposed interface is ascertained in the experiments on the basis of sound elimination and speechrecognition.

Copyright © 2007 Shigeki Miyabe et al. This is an open access article distributed under the Creative Commons Attribution License,which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

1. INTRODUCTION

For hands-free realization of smooth communication with aspoken dialogue system, it should be guaranteed that a user’scommand utterance reaches the system clearly. However, auser might interrupt sound responses from the system andutter a command, or he might start speaking before the ter-mination of the sound responses from the system. In sucha situation, the sound given from the system to the user isobserved as an acoustic echo return at a microphone usedfor acquisition of the user’s speech input, and degrades thespeech recognition performance in receiving the user’s inputcommand. Such a situation is referred to as barge-in [1].Hereafter, the sound message outputted from the system iscalled response sound.

As a solution to this problem, an acoustic echo can-celler is commonly used [2]. Since the echo return of theresponse sound is a convolution of the known responsesound signal and a transfer function from a loudspeakerto a microphone, we eliminate the echo return by esti-mating the transfer function with an adaptive filter. Manytypes of acoustic echo canceller have been proposed, suchas single-channel, stereophonic, beamformer-integrated, and

wave-synthesis-integrated types [3–6]. The room transferfunction is variable and fluctuates because of changes ofroom conditions, such as the movement of people in theroom and changes in temperature [7]. Therefore, the adap-tation must be continued even after its temporary conver-gence. However, in the state of barge-in (this is also calleda “double-talk problem”), since user’s speech input is mixedin the observed signal, the speech acts as noise to the esti-mation and the estimation fails. In this case, the adaptationprocess should be stopped by some type of double-talk de-tection technique [8, 9]. Therefore, when the room transferfunction changes in the barge-in state, the elimination per-formance degrades.

In order to achieve robustness, we propose a new inter-face for a barge-in free spoken dialogue system that combinesmultichannel sound field control and a microphone array. Atfirst, to prevent the response sound from being observed atthe microphone elements, we utilize the sound field repro-duction technique via multiple loudspeakers and an inversefilter of the room transfer functions [10]. The sound fieldreproduction is generally used in a transaural system [11],which presents a three-dimensional sound image to a userat a fixed position. We apply this technique to the response

Page 2: InterfaceforBarge-inFreeSpokenDialogueSystemBasedon ... · sound elimination by controling sound field around the mi-crophone to be silent alongside the transaural reproduction at

2 EURASIP Journal on Advances in Signal Processing

sound elimination by controling sound field around the mi-crophone to be silent alongside the transaural reproductionat user’s ears. In the next step, user’s speech is enhanced bymicrophone array signal processing. By increasing the num-bers of loudspeakers and microphone elements, the controlof the proposed method becomes robust against the fluctua-tion of the room transfer functions. With sufficient numbersof loudspeakers and microphones, the proposed method en-ables us to eliminate the response sound with enough robust-ness to sustain speech recognition accuracy.

Although the proposed method requires many loud-speakers and the cost for the hardware is higher than the con-ventional acoustic echo canceller, the proposed method usesa fixed filter designed in advance and real-time adaptation isunnecessary. As a result, computational cost can be reduced.In addition, the proposed method has an advantage thatsound virtual reality [12] can be achieved with transauralreproduction. Thus we can realize duplex telecommunica-tion, for example, video conference, with telepresence as ifthe users share the same space. Besides, we can apply the pro-posed method for control of car navigation system by spo-ken dialogue system. We can eliminate not only the responsesound of the car navigation but also music of car audio.Moreover, in this case user’s position is limited and nowa-days car interior has many loudspeakers whose positions arefixed. Therefore the disadvantage of the proposed method,that is, fix of the positions of the loudspeakers and the user,is not problematic.

In Section 2, we describe the basic concept and problemsof the conventional acoustic echo canceller. In Section 3, wedescribe the principle of the proposed interface. In Section 4,an experimental comparison of response sound eliminationperformances is carried out. In Section 5, the effectivenessof the proposed method is validated in the speech recogni-tion experiment. In Section 6, we assess the quality of theresponse sound reproduced by the proposed method.

2. CONVENTIONAL ACOUSTIC ECHO CANCELLER

To eliminate the acoustic echo of the response sound, anacoustic echo canceller is generally used. In this section, wedescribe the basic principle of the acoustic echo canceller,and indicate its weakness against the fluctuation of a roomtransfer function.

2.1. Principle and problem of conventionalacoustic echo canceller

The configuration of an acoustic echo canceller using anadaptive filter is shown in Figure 1. Let the source signal ofthe response sound be x(ω), where ω shows the angular fre-quency. The echo return of the response sound ymic(ω) canbe written as the product of x(ω) and the transfer functiongmic(ω) from a loudspeaker to a microphone,

ymic(ω) = gmic(ω)x(ω). (1)

The acoustic echo canceller calculates an estimate gmic(ω),denoted as gmic(ω). Then the estimated response sound

x(ω)

ε(ω)

Echo canceller

gmic(ω)

ymic(ω)�

+

Loudspeaker

gmic(ω)

ymic(ω)

MicrophoneUser

Figure 1: Configuration of acoustic echo canceller in spoken dia-logue system.

ymic(ω) can be obtained as

ymic(ω) = gmic(ω)x(ω). (2)

To estimate gmic(ω), an adaptive filter is used and the esti-mated transfer function gmic(ω) is updated iteratively tomin-imize the power of the error signal ε(ω),

ε(ω) = ymic(ω)� ymic(ω). (3)

Once the room transfer function is estimated, the acousticecho canceller can eliminate the response sound sufficiently.However, whenever the transfer function is changed, it mustbe reestimated. To follow the fluctuation of the transfer func-tion in real time, online adaptation, for example, least meansquares [13] or recursive least squares, is used. However,these adaptation techniques are weak against noise. In thestate of barge-in, since user’s input speech is mixed with theobserved signal, an accurate error of the estimation cannot beobtained and the adaptation diverges. Therefore, the adapta-tion must be stopped using double-talk detection [8]. How-ever, it is often difficult to decide whether the error is causedby either fluctuation or barge-in.

2.2. Response sound elimination error of theacoustic echo canceller when fluctuationof the room transfer function occurs

The room transfer functions are easily changed with the vari-ation of the system’s state such as the movement of people.In this section, the response sound elimination error signalε(ω) is examined in the case where the transfer function ischanged. Suppose that the variation Δgmic(ω) caused by thefluctuation of room transfer functions is added to the origi-nal transfer function gmic(ω). In this case, the response soundis expressed as

ymic(ω) =[

gmic(ω) + Δgmic(ω)]

x(ω). (4)

The elimination error signal ε(ω) of the response sound iswritten using the estimated filter gmic(ω) as

ε(ω) = Δgmic(ω)x(ω), (5)

where we assume that the filter was exactly estimated so as tosatisfy gmic(ω) = gmic(ω) and gmic(ω)x(ω)� gmic(ω)x(ω) = 0.

Page 3: InterfaceforBarge-inFreeSpokenDialogueSystemBasedon ... · sound elimination by controling sound field around the mi-crophone to be silent alongside the transaural reproduction at

Shigeki Miyabe et al. 3

Responsesound

r(ω)

gpriR(ω)

xR(ω)

gpriL(ω)

xL(ω)

Inverse filterh1,K+1(ω)

h1,K+2(ω)

...

hM,K+1(ω)

hM,K+2(ω)

S1...SM

gN�1,1(ω)

g11(ω)

gK1(ω)

gKM(ω)

gK+1,M(ω)

Array signal processing(delay-and-sum array)

C1

CKy1(ω), � � � ,yK (ω) = 0

CK+1

CK+2

Reproducedsound

yK+1(ω)

yK+2(ω)

Silentsignal

ymic(ω) = 0

Figure 2: Configuration of the proposed system.

Since the acoustic echo canceller has no mechanism for im-proving the robustness of the elimination (unless it containsa suitable post-processing for that case), the fluctuation ofthe transfer function effects directly its error. Therefore, ifthe fluctuation occurs when the adaptation stops because ofbarge-in, its elimination performance degrades.

3. PROPOSEDMETHOD: MULTIPLE-OUTPUT ANDMULTIPLE-NO-INPUTMETHOD

In this section, we propose a new response sound elimina-tion technique, which is robust against the fluctuation ofthe room transfer function. The proposed method mainlyconsists of two steps. First, sound field control with multi-ple loudspeakers realizes silent zones at the microphone el-ements while the dialogue system gives the response soundto the user. Next, by delay-and-sum-type signal process-ing using a microphone array, the residual component ofthe response sound caused by the fluctuation of the trans-fer function is suppressed and user’s utterance is empha-sized. The response sound signal is outputted from the mul-tiple loudspeakers and cancelled at multiple control points.With this mechanism, the response sound is preventedfrom being inputted to the speech recognition system. Thuswe call this technique multiple-output/multiple-no-input(MOMNI) method. We discuss the relation between the ro-bustness of the control and the number of transfer chan-nels. Then it is proved that we can improve its robustnessagainst the fluctuation of the transfer functions by increas-ing the numbers of loudspeakers and microphone elements.With sufficient numbers of loudspeakers and microphones,the MOMNI method can eliminate the response sound withenough robustness using fixed filter coefficients. Needless tosay, this processing requires no double-talk detection.

3.1. Sound field control

Here, we describe the sound field control used to eliminatethe acoustic echo of the response sound from the system. Theconfiguration of the proposed system is shown in Figure 2.Let M be the number of secondary sound sources S1, . . . , SMand let N be the number of control points C1, . . . ,CN . Thecontrol pointsC1, . . . ,CK (K = N�2) are arranged to the ele-ments of a microphone array for acquisition of user’s speech,

and CK+1 and CK+2 are set at both ears of the user. The sig-nals to be reproduced at the control points C1, . . . ,CK+2 aredescribed by

x(ω) = [

xmic 1(ω), . . . , xmicK (ω), xR(ω), xL(ω)]T, (6)

and similarly, the signals observed at these control points arerepresented by

y(ω) = [

ymic 1(ω), . . . , ymicK (ω), yR(ω), yL(ω)]T. (7)

Using, for example, chirp signal [14], we should measure inadvance all of the transfer functions from secondary soundsources Sm to control points Cn, denoted by gnm(ω), wheren = 1, . . . ,N , and m = 1, . . . ,M. Here, to design an inversefilter of the transfer functions with nonminimum phases, thecondition M > N must hold [10]. To use fixed filter coeffi-cients for the inverse filter, the positions of the loudspeakersand the microphones should not be changed after the mea-surement. In addition, we specify the position for the user tolisten to the response sound, by, for example, setting a chairat the position. Here in the phase of the measurement, to ob-tain the transfer function of user’s ears, since it is a burdenfor the user to sit on the position wearing microphones athis/her ears, we can substitute the user by a head and torsosimulator (HATS) with microphones at the ears. Let G(ω) bean N�M matrix consisting of gnm(ω), and let H(ω) be anM�N inverse filter matrix with components hmk(ω). The in-verse filterH(ω) is then designed so that

G(ω)H(ω) = IN (ω), (8)

where IN (ω) denotes an N�N identity matrix. Using thetransfer function matrix G(ω) and the inverse filter matrixH(ω), the relation between the observed signals y(ω) and thereproduced signals x(ω) is written as

y(ω) = G(ω)H(ω)x(ω). (9)

In (9), we reproduce the response sounds of a dialogue sys-tem at both the user’s ears (i.e., [yR(ω), yL(ω)] = [xR(ω),xL(ω)]), and reproduce silent signals with zero amplitudesat the microphone elements (i.e., [ymic 1(ω), . . . , ymicK (ω)] =[0, . . . , 0]) as

x(ω) =[

0, . . . , 0︸ ︷︷ ︸

K

, xR(ω), xL(ω)

]T

. (10)

By this sound reproduction, we can actualize a sound field inwhich the response sound is presented to the user while theresponse sound cancels at the microphone elements.

To remove the redundant filtering process of the zerosignals, we truncate the matrix H(ω) into H�(ω) which isan M � 2 filter matrix composed of the filter componentshmk�(ω) (m = 1, . . . ,M, k� = K + 1,K + 2) which are takenfromH(ω). By inputting the response sound to this filter ma-trix, the following equation holds:

y(ω) = G(ω)H�(ω)[

xR(ω), xL(ω)]T

=[

0, . . . , 0︸ ︷︷ ︸

K

, xR(ω), xL(ω)

]T

.(11)

Page 4: InterfaceforBarge-inFreeSpokenDialogueSystemBasedon ... · sound elimination by controling sound field around the mi-crophone to be silent alongside the transaural reproduction at

4 EURASIP Journal on Advances in Signal Processing

Therefore, the condition equivalent to (10) can be realizedwith anM � 2 filter matrix.

Since the proposed method uses an inverse filter of theroom transfer function, we can show the response soundto the user in the form of a transaural system, say, a three-dimensional sound field localization [11]. In transaural sys-tem, we can show the user a clear sound image of a pri-mary sound source by reproducing a binaural signal [15],say, a convolution of a signal and transfer functions from thesound source to a person’s ears. To provide a practical ap-plication of this property, we generate the response soundsignals xR(ω) and xL(ω) by multiplying a monaural sourceof the response sound signal rsrc(ω) and the room transferfunctions gpri(ω) = [gpriR(ω), gpriL(ω)]T between a primarysound source and both the user’s ears as

[

xR(ω), xL(ω)]T = gpri(ω)rsrc(ω). (12)

In the transaural reproduction described above, the soundimage is degraded when the user is not at the prepared posi-tion because the perceived response sound is not an accuratebinaural sound. However, the sound quality away from theprepared position is sufficient for the presentation of the re-sponse sound for the spoken dialogue system. We will justifythis argument in the experiment in Section 6.

3.2. Signal processing usingmicrophone array

In this section, we will focus our attention on array signalprocessing. In this study, we adopt a delay-and-sum array sig-nal processing [16] to emphasize the user’s utterance. The fil-ter of the kth element in the delay-and-sum array is denotedby wk(ω) for k = 1, . . . ,K . Then wk(ω) can be expressed as

wk(ω) = 1K�e� jωτk , (13)

where τk stands for the arrival time difference of the user’sutterance between a suitable standard point and the kth el-ement position. We set τk to form a directivity to the lookdirection of the user. Suppose that the signal added throughthe array filters is a signal for speech recognition. Then theresponse sound contained in the observed signal is expressedas

ymic(ω) =K∑

k=1wk(ω)ymic k(ω). (14)

When this delay-and-sum-type array is used, the system’s re-sponse sounds which arrive from other than the target di-rection are out of phase at each element, and only the user’sspeech which comes from the target direction is in phase ateach element and is added. As a result, only user’s speech canbe emphasized in the ymic(ω). Thus we give this signal to thespeech decoder to recognize the user’s speech.

3.3. Inverse systemdesign for sound field reproduction

In a multipoint control system which controls multiple con-trol points with many loudspeakers, large amounts of calcu-lation and memory are needed to design an inverse filter in

the time domain. Therefore, we design the inverse filter ma-trix H(ω) by using the least-norm solution (LNS) in the fre-quency domain [12]. The method has advantages that theamount of calculation is small in the frequency domain, andthe designed system is stable because the output from eachsound source is suppressed to the minimum. Here, we usethe Moore-Penrose generalized inverse matrix as the inversematrix which gives the least-norm solution. We obtain a sin-gular value decomposition of G(ω) as

G(ω) = U(ω)[

ΓN (ω),ON ,M�N (ω)]

VH(ω),

ΓN (ω) � diag[

μ1(ω),μ2(ω), . . . ,μN (ω)]

,(15)

whereU(ω) andV(ω) areN�N andM�M unitary matrices,respectively, μn(ω) for n = 1, 2, . . . ,N are the singular valuesof G(ω), and are arranged so that μn(ω) � μn+1(ω) in matrixΓN (ω),ON ,M�N (ω) denotes an N�(M �N) null matrix, and���H(ω) represents a conjugate transposition.

Then the Moore-Penrose generalized inverse matrixG+(ω) (= H(ω)) of G(ω) is given by

G+(ω) = V(ω)

[

ΛN (ω)OM�N ,N (ω)

]

UH(ω),

ΛN (ω) � diag[

1μ1(ω)

,1

μ2(ω), . . . ,

1μN (ω)

]

.

(16)

Then we utilize the Moore-Penrose generalized inverse ma-trix for the inverse filter asH(ω) = G+(ω).

3.4. Response sound elimination error for fluctuationof room transfer functions

In an acoustic echo canceller, because we need to reestimatethe transfer function when it is changed, there is a prob-lem that the response sound elimination accuracy degradesduring the estimation process. In contrast, it is proved thatthe proposed technique is robust against the fluctuation ofroom transfer functions, even when the fixed filter coeffi-cients are used. Here, we suppose that an inverse filter matrixcomputed before the fluctuation is used to control the soundfield.

Supposing that the variation Δgnm(ω) caused by the fluc-tuation of transfer functions is added to a transfer functiongnm(ω), the transfer function matrix after the fluctuation willbecome G(ω) + ΔG(ω), where ΔG(ω) is an N � M matrixcomposed of Δgnm(ω). Then, by using an inverse filter matrixH(ω) designed before the fluctuation of transfer functions,the signals y(ω) observed at each control point are expressedas

y(ω) = [

G(ω) + ΔG(ω)]

H(ω)x(ω)

= [

IN (ω) + ΔG(ω)H(ω)]

x(ω),(17)

and the errors caused by the fluctuation are representedas ΔG(ω)H(ω)x(ω). In this case, the error Δymic(ω) of the

Page 5: InterfaceforBarge-inFreeSpokenDialogueSystemBasedon ... · sound elimination by controling sound field around the mi-crophone to be silent alongside the transaural reproduction at

Shigeki Miyabe et al. 5

response sound elimination ymic(ω) in (14) is written as

Δymic(ω)

=K∑

k=1wk(ω)

�{ M∑

m=1Δg(k+2)m(ω) �

[

hm1(ω)xR(ω) + hm2(ω)xL(ω)]

}

.

(18)

Since this system controls ymic(ω) such that it is 0 before thefluctuation of transfer functions, Δymic(ω) after the fluctua-tion is the response sound elimination error signal ε(ω). Thisis expressed as

ε(ω) = ymic(ω) + Δymic(ω) = Δymic(ω). (19)

Next, let the singular values of G(ω) be μj(ω) for j =1, 2, . . . ,N and let the eigenvalues ofGH(ω)G(ω) be λj(ω) forj = 1, 2, . . . ,N . Then, the norm �G(ω)� is given by

∥G(ω)∥

∥ =√

max j(

λj(ω)) =

max j(

{

μj(ω)}2)

= ∣

∣μ1(ω)∣

∣,

(20)

where max j(aj) denotes the largest element of aj for any j.The relation λj(ω) = �μj(ω)�2 is used here.

Alternatively, since the singular values ofG+(ω) are givenby 1/μj(ω), the norm �G+(ω)� is expressed as

∥G+(ω)∥

∥ =√

√max j

(

1λj(ω)

)

=√

√max j

(

1{

μj(ω)}2

)

= 1∣

∣μN (ω)∣

.

(21)

Since the secondary sound source is arranged with almostequal distance for each control point, if the number of sec-ondary sound sources,M, increases, the norm of G(ω) is di-rectly proportional to M, that is, �G(ω)� M. Moreover,the condition number of G(ω), which is expressed by theratio between the maximum and minimum singular values,that is,

cond(G) = μ1μN

, (22)

is known to be close to unity when the number of secondarysound sources arranged is much larger than that of controlpoints (this is experimentally proven in Section 4.3). There-fore, the following relation can be derived from (20) and(21):

∥H(ω)∥

∥ = ∥

∥G+(ω)∥

∥ = 1∣

∣μN (ω)∣

1∣

∣μ1(ω)∣

= 1∥

∥G(ω)∥

1M

.

(23)

Substituting (13) into (18), we obtain

Δ ymic(ω)

= ∥

∥H(ω)∥

1K

{ K∑

k=1

M∑

m=1Δgkm(ω)

� [hm(K+1)(ω)xR(ω) + hm(K+2)(ω)xL(ω)] � e� jωτk

}

,

(24)

where hmn(ω) = hmn(ω)/�H(ω)�. We assume that Δgnm(ω)for n = 1, 2, . . . ,N and m = 1, 2, . . . ,M are mutually inde-pendent and follow the same Gaussian distribution with zeromean and variance σ2. Furthermore, since hmn(ω) is a func-tion normalized by �H(ω)� and independent on M, the de-viation of ��� in (24) can be represented by η

�MKσ , where

η is a suitable constant. Therefore, the elimination error ε(ω)of response sound is obtained from (23) as

ε(ω) = Δymic(ω) 1M� 1K��MK = 1�

MK. (25)

In other words, (25) shows that the elimination error ofthe response sound for the fluctuation of the transfer func-tions is inversely proportional to

�MK . Thus, if the num-

ber of transfer channels from loudspeakers to microphonesincreases, the response sound elimination of the proposedmethod improves its robustness against the fluctuation of thetransfer functions.

We remark that in the real environment, it is difficult toprove whether or not the variations Δgnm(ω) caused by thefluctuation of the room transfer functions are mutually in-dependent for every channel from a loudspeaker to a micro-phone. However, in the next section, the simulations usingimpulse responses measured in the real environment showthat the error estimation in (25) is valid.

4. EXPERIMENTAL COMPARISONOF RESPONSESOUND ELIMINATION PERFORMANCE

To assess the robustness of the proposed method againstthe fluctuation of the room transfer functions, the responsesound elimination performance of the proposed method isevaluated by simulations. Its performance is compared withthat of conventional acoustic echo canceller.

4.1. Experimental conditions

The simulations are carried out by using impulse responsesmeasured in a real acoustic environment. Figure 3 shows thearrangement of the apparatuses. To imitate the user at thecenter of the room, we set a HATS. To cause fluctuations ofthe room transfer functions intentionally, we placed a life-size mannequin as an interference near a user, under the as-sumption that a person approaches to the user. We measuredin a total of 13 patterns of the room impulse responses: 12patterns are for the state in which the interference is allo-cated, and the remaining pattern is for the state in which no

Page 6: InterfaceforBarge-inFreeSpokenDialogueSystemBasedon ... · sound elimination by controling sound field around the mi-crophone to be silent alongside the transaural reproduction at

6 EURASIP Journal on Advances in Signal Processing

interference exists. The transfer functions before fluctuationare used to design filters for both the acoustic echo cancellerand the proposedmethod, and we evaluated the performanceunder static transfer functions after fluctuations. To preventthe effect of the change of condition to observe the user’s ut-terance, we did not change the user’s position in these fluc-tuations. A loudspeaker set in front of the user is used bothas an acoustic echo canceller and as a primary sound sourceof the proposed method. The reverberation time is about 160milliseconds. The room impulse responses are sampled at afrequency of 48 kHz and the magnitudes are quantized to 16bits. We used a circular array with 12 elements, and equallyspaced elements were selected for use.

4.1.1. Conventional acoustic echo canceller

Our interest is focused on the robustness against the fluctua-tion of room transfer functions. Therefore, the experiment iscarried out under the assumption that the filter coefficientsof the acoustic echo canceller are once estimated precisely,and then the fluctuation occurs when the estimation stopsbecause of barge-in. To imitate this situation, we used thetransfer function before fluctuation as the estimated trans-fer function of the acoustic echo canceller, and fixed its filtercoefficients. The microphone element closest to the user ischosen as a microphone for acquisition of the user’s speech.

4.1.2. Proposedmethod

The inverse filter in the proposed method is calculated us-ing only the impulse responses in the case where there is nofluctuation. The design conditions of the inverse filters are asfollows: the number of secondary sound sources M = 4 to36, the number of control points N = 3 to 8, the filter length16384, and the passband range 150 to 4000Hz.

4.2. Evaluation score

The response sound elimination performance is evaluatedusing echo return loss enhancement (ERLE) as

ERLE( dB) = 10 log10

ω

{

ymicref (ω)}2

ω

{

ε(ω)}2 , (26)

where ymicref (ω) is the response sound reproduced at a stan-dard microphone, and ε(ω) is the response sound elimina-tion error signal derived from (5) or (19).

4.3. Experimental results and discussion

Figures 4–6 show that frequency characteristics of the re-sponse sound elimination error signal in the conventionalacoustic echo canceller and proposed method after the roomtransfer function have changed. In these evaluations, we useda female utterance selected from the ASJ database [17] as aresponse sound. From these figures, it turns out that the re-sponse sound can be suppressed independent of frequency inthe passband by even which techniques.

Loudspeakersfor acousticecho canceller

Microphonearray

Loudspeakersfor sound field

control

Microphone toobserve response

sound

Position ofinterference

1 3 5 8 11

2 4 6 9 12

7 10

1m 0.5m0.5m

3.9m

0.5m

Figure 3: Layout of acoustic experiment room.

The ERLE for each position of the interference in thecase of the typical number of loudspeakers and 2 elementsis shown in Figure 7, and that for each position of interfer-ence in the case of 24 loudspeakers and the typical numberofmicrophones is in Figure 8. In these evaluations, to removethe effect of the bias of frequency characteristics, we used awhite noise as a response sound. It can be seen that increas-ing both the number of microphone elements and the num-ber of loudspeakers improves the performance of the pro-posed method, and can make the control robust against thefluctuation of room transfer functions. Regardless of the po-sition of the interference, the performance of the proposedmethod is superior to that of the conventional echo canceller.Hereafter, we discuss only the averaged ERLE of 12 types offluctuations.

In Figure 9, ERLE is shown as a function of the numberof transfer channels (= MK) from the loudspeakers to themicrophone elements. The theoretical curve in the figure isdrawn by plotting the ERLE derived from (25), which is givenby

ERLEtheory( dB) = 10 log10

ω

{

ymicref (ω)}2

ω

{

ε(ω)}2

= 10 log10

ω

{

ymic(ω)}2

ω

{

Δ ymic(ω)}2

ξ + 10 log101

1/(MK)

ξ + 10 log10(MK),

(27)

where ξ is a suitable constant.From this figure, we can see that the response sound

elimination performance is improved if the number of trans-fer channels increases. It also turns out that the deviationbetween the experimental and theoretical values arises whenthe number of microphone elements increases. The reasonsare as follows.

Page 7: InterfaceforBarge-inFreeSpokenDialogueSystemBasedon ... · sound elimination by controling sound field around the mi-crophone to be silent alongside the transaural reproduction at

Shigeki Miyabe et al. 7

0 500 1000 1500 2000 2500 3000 3500 4000

�100

�80

�60

�40

�20

0

20

40

Frequency (Hz)

Amplitude

(dB)

Without processingWith processing

Figure 4: Example of frequency characteristics of observed signalobtained by acoustic echo canceller. The signal is observed at themicrophone near the user. The position of interference is number 1in Figure 3.

0 500 1000 1500 2000 2500 3000 3500 4000

�100

�80

�60

�40

�20

0

20

40

Frequency (Hz)

Amplitude

(dB)

Without processingWith processing

Figure 5: Example of frequency characteristics of observed signalobtained by the proposed method with 36 loudspeakers and 1 mi-crophone element. The signal is observed at the microphone nearthe user. The position of interference is number 1 in Figure 3.

(A) The stability margin of the inverse filters becomessmall when the number of control points is close to that ofthe secondary sound sources.

(B) When there exist too many transfer channels, the in-dependence of each channel is no longer valid. Consequently,the performance is saturated.

To prove the above claim (A), we show the conditionnumber of transfer functions in Figure 10. The condition

0 500 1000 1500 2000 2500 3000 3500 4000

�100

�80

�60

�40

�20

0

20

40

Frequency (Hz)

Amplitude

(dB)

Without processingWith processing

Figure 6: Example of frequency characteristics of observed sig-nal obtained by proposed method with 36 loudspeakers and 6microphone elements. The signal is observed at the microphonenear the user. The position of interference is number 1 in Figure 3.

1 2 3 4 5 6 7 8 9 10 11 120

5

10

15

20

25

30

35

Position of interference

ERLE(dB)

Conventional acoustic echo cancellerProposed method (12 loudspeakers, 2 microphones)Proposed method (24 loudspeakers, 2 microphones)Proposed method (36 loudspeakers, 2 microphones)

Figure 7: ERLE for each position of interference in 2 microphoneelements. The horizontal axis represents the position of interferencein Figure 3.

number, expressed as cond(G(ω)) in (22), represents theunstableness of the inverse filters. This figure shows thatthe condition number becomes close to 1 when the num-ber of loudspeakers is much larger than that of the micro-phone elements (equal to the number of control points mi-nus two), as argued in Section 3.4. However, when the num-ber of microphone elements increases, the condition numberincreases. In addition, such a tendency becomes remarkablewhen the number of the secondary sound sources is small.This causes an appreciable degradation in ERLE.

Comparing the conventional acoustic echo canceller withthe proposed method in Figure 9, we see that the proposed

Page 8: InterfaceforBarge-inFreeSpokenDialogueSystemBasedon ... · sound elimination by controling sound field around the mi-crophone to be silent alongside the transaural reproduction at

8 EURASIP Journal on Advances in Signal Processing

1 2 3 4 5 6 7 8 9 10 11 120

5

10

15

20

25

30

35

Position of interference

ERLE(dB)

Conventional acoustic echo cancellerProposed method (24 loudspeakers, 1 microphone)Proposed method (24 loudspeakers, 2 microphones)Proposed method (24 loudspeakers, 4 microphones)Proposed method (24 loudspeakers, 6 microphones)

Figure 8: ERLE for each position of interference in 24 loudspeak-ers. The horizontal axis represents the position of interference inFigure 3.

50 100 150 20010

15

20

25

30

35

40

Number of transfer channels

ERLE(dB)

Proposed method(6 microphones)

Proposed method(1 microphone)

Proposed method(4 microphones)

Proposed method(2 microphones)

Theoretical curve

Conventionalacoustic

echo canceller

Figure 9: ERLE for different numbers of room transfer channelsfrom loudspeakers to microphone elements.

method is more robust against the fluctuation of transferfunctions if the number of transfer channels increases.

5. SPEECH RECOGNITION EXPERIMENT

The experiment involving large vocabulary speech recogni-tion is carried out to investigate the efficacy of the proposedmethod, compared to that of the conventional acoustic echocanceller.

5.1. Experimental conditions

In the recognition experiment, we use the speech signal ob-tained by imposing the response sound elimination errorsignal ε(ω) on the user’s input speech. A large vocabularyrecognition engine Julius ver. 3.4.2 [18] is used as a speech

5 10 15 20 25 30 35 400

5

10

15

20

110

115

Number of loudspeakers

Con

dition

number

1 microphone element2 microphone elements4 microphone elements6 microphone elements

Figure 10: Condition number of average in passband.

decoder. We used two kinds of speaker-independent pho-netic tied mixtures [19] as phoneme models. One is an ordi-nary clean model. The other is generated by a known-noiseimposition technique [20] (see the appendix). We imposed aknown noise of 30 dB on the observed signals to mask the re-dundant response sound, and to match its phoneme features,we imposed the noise of 25 dB on the speech in the learn-ing data. A language model is made from newspaper dicta-tion with a vocabulary of 20 000 words [21]. As the user’sspeech, 200 sentences obtained from 23males and 23 femalesare used through the JNAS database [22]. As the responsesound of the dialogue system, a sentence of a female’s speechfrom the ASJ database is used. Experimental conditions suchas interference arrangements to cause changes of the transferfunctions are the same as in the previous section.

5.2. Evaluation score

In order to evaluate the speech recognition performance, weadopt the word accuracy as an evaluation score. Word accu-racy is defined as follows:

word accuracy(%) = W � S�D � I

W, (28)

whereW is the total number of words in the test speech, S isthe number of substitution errors, D is the number of dele-tion errors, and I is the number of insertion errors. The re-sultant recognition score is computed using the average valueof data derived from the 200 sentences.

5.3. Experimental results and discussions

The speech recognition results obtained by the proposedmethod are shown in Figure 11 for the clean model, andin Figure 12 for the known-noise imposition. The results ofthe recognition experiment show that the word accuracy is

Page 9: InterfaceforBarge-inFreeSpokenDialogueSystemBasedon ... · sound elimination by controling sound field around the mi-crophone to be silent alongside the transaural reproduction at

Shigeki Miyabe et al. 9

1 2 4 645

50

55

60

65

70

75

80

Number of microphone elements

Wordaccuracy

(%)

Conventional acoustic echo cancellerProposed method (12 loudspeakers)Proposed method (24 loudspeakers)Proposed method (36 loudspeakers)

Figure 11: Word accuracy with clean model.

1 2 4 660

65

70

75

80

85

90

Number of microphone elements

Wordaccuracy

(%)

Conventional acoustic echo cancellerProposed method (12 loudspeakers)Proposed method (24 loudspeakers)Proposed method (36 loudspeakers)

Figure 12: Word accuracy when known-noise imposition tech-nique is applied.

�8.0% and �13.2% without any processing, and 47.1% and64.6% when using the conventional acoustic echo canceller,for the clean model and known-noise imposition, respec-tively. By masking the redundant component of the responsesound, all the results are improved compared with the resultsusing the clean model. All the performances of the proposedmethod in the figure are superior to those of the conventionalacoustic echo canceller. Note that neither system is adapted,that is, optimal weights for system before acoustic change areused. The results show that when the transfer functions arechanged, the degradation of speech recognition accuracy canbe prevented by increasing the number of transfer channels.From these results, the effectiveness of the proposed responsesound elimination technique is ascertained.

Loudspeakersfor acousticecho canceller

Loudspeakersfor sound field

control

Positions ofhead and

torso simulator

Microphonearray

0

1

2

3

1m 0.5m

0.5m

0.5m

0.5m

Figure 13: Layout of the experimental room in the sound qualityassessment.

6. SOUNDQUALITY ASSESSMENT AT VARIOUSUSER POSITIONS

The sound quality of the proposed method is guaranteedand clear sound image is presented only when the user’s earsare at the control points where the response sound is repro-duced. However, even when the user moves away from thecontrolled area, the quality of the response sound is sufficientfor the spoken dialogue system. To prove this argument, weassess the quality of the response sound which is perceivedby the user at various positions. The quality is assessed fromtwo aspects; objective and subjective evaluations.

6.1. Objective evaluation

The objective evaluation is carried out via a simulation us-ing impulse responses measured in a real acoustic environ-ment. Figure 13 shows the arrangement of the apparatuses.The room is the same one used in the experiments of Sections4 and 5. We measured four patterns of impulse responseschanging the positions of the HATS from position 0 to po-sition 3. The control points of the MOMNI method are twomicrophone elements in the microphone array and the earsof the HATS at the position 0. The primary sound source ofthe response sound is the loudspeaker of the acoustic echocanceller.

As an evaluation score, we introduce cepstral distance(CD, [23]) which is often used in various speech processings.CD is given by

CD(dB) = 1F

F∑

t=1

20log 10

20∑

l=12(

Cobs(l, t)� Cref (l, t))2,

(29)

Page 10: InterfaceforBarge-inFreeSpokenDialogueSystemBasedon ... · sound elimination by controling sound field around the mi-crophone to be silent alongside the transaural reproduction at

10 EURASIP Journal on Advances in Signal Processing

0 1 2 30

1

2

3

4

5

Index of user’s position

Cepstrald

istance

(dB)

Acoustic echo cancellerProposed method (12 loudspeakers, 1 microphone)Proposed method (12 loudspeakers, 2 microphones)

Figure 14: Cepstral distance in various positions when 12 loud-speakers are used for the proposed method.

0 1 2 30

1

2

3

4

5

Index of user’s position

Cepstrald

istance

(dB)

Acoustic echo cancellerProposed method (24 loudspeakers, 1 microphone)Proposed method (24 loudspeakers, 2 microphones)

Figure 15: Cepstral distance in various positions when 24 loud-speakers are used for the proposed method.

where F denotes the number of speech frames, Cobs(l, t) isthe lth FFT-based cepstrum of the observed signal at the tthframe, and Cref (l, t) is a reference cepstrum for evaluatingthe distance. The number of liftering points is 20. A lowerCD value indicates better sound quality. We obtain Cref (l, t)from the source signal of the response sound. We average theCDs at both ears. Note that to express CD in dB, the term20/log 10 is multiplied to the Eucredian distances betweenthe cepstrum coefficients which are obtained from naturallogarithm of the waveforms. In addition, because of symme-try of cepstrum coefficients, we can obtain liftered cepstrumfrom twice of the cepstrum coefficients from l = 1 to l = 20.

Figures 14 and 15 show the CDs of the proposed methodcompared with those of the acoustic echo canceller. Since

0 1 21

2

3

4

5

25

35

45

Index of user’s position

Meanop

inionscore

Equ

ivalentQvalue(dB)

Acoustic echo cancellerProposed method

Figure 16:Mean opinion score for the positions of the subjects. Theblocks show the means and the error bars show the 95% confidenceintervals.

the proposed method reproduces the output sound of theacoustic echo canceller at the position 0, its CD is similar tothat of the acoustic echo canceller. When the HATS is not atthe position 0, the CDs increase. However, its difference isonly within 1 dB. Thus, the sound quality degradation of theproposed method is not significant.

6.2. Subjective evaluation

To ascertain that the distortion caused by the proposedmethod is not discomfort, we conduct a subjective evaluationof the sound quality reproduced by the proposed method ina real environment. We changed the positions of the subjectsand let them answer mean opinion score (MOS). The opin-ion score for evaluation was set to a 5-point scale (5: excel-lent, 4: good, 3: fair, 2: poor, 1: bad).

The room used in this experiment is the same one wherethe impulse responses are measured in the other experi-ments. We directed the positions of the subjects by settingchairs at the position 0, the position 1, and the position 2in the Figure 13. The filter of the MOMNI method was de-signed using measured impulse responses where the HATSis set at the position 0. The primary sound source of the re-sponse sound is the loudspeaker of the acoustic echo can-celler. The number of the secondary sound sources is 24 andthe microphone elements of the silent reproduction are two.

We compared the MOSs of the proposed method and theacoustic echo canceller. In addition, to give the MOSs objec-tive meaning, we evaluated opinion equivalent Q value [24].To obtain opinion equivalent Q value, we made three kindsof response sounds imposed white noises whose segmentalSNRs are 25 dB, 35 dB, and 45 dB. Then these noise-addedresponse sounds are outputted from the acoustic echo can-celler. Therefore, the forms of the reproductions are five, thatis, the MOMNI method, the acoustic echo canceller, and thethree noise-added response sounds. For each of these forms,we prepared 15 sentences of the speech uttered by four malesand three females. Then for each of the three positions, weevaluated the MOSs in random orders.

Page 11: InterfaceforBarge-inFreeSpokenDialogueSystemBasedon ... · sound elimination by controling sound field around the mi-crophone to be silent alongside the transaural reproduction at

Shigeki Miyabe et al. 11

Figure 16 shows the MOSs for each of the subjects’ posi-tions. The scores of the acoustic echo canceller rated at morethan four in any of the positions. For the MOMNI method,the score at the position 0 is similar to that of the acousticecho canceller. Even at the position 0, the binaural responsesound is degraded by the difference of the shapes of the headand the sitting heights between the subjects and the HATS.However, we can see that the degradation does not influ-ence the MOSs. Although the MOSs decrease as the subjectsmove away from the position 0, the degradation of the scoreis within one. In addition, even in the worst score at the po-sition 3, the opinion equivalent Q value is over 45 dB. Fromthese findings, it is ascertained that the proposed method canpresent the response sound with sufficient quality even whenthe user is out of the prepared position.

7. CONCLUSION

We have proposed a barge-in free spoken dialogue interfacecombining sound field reproduction and a microphone ar-ray. It is shown that the response sound elimination per-formance for the fluctuation of room transfer functions de-pends on the number of transfer channels. By using an ad-equate number of loudspeakers and microphone elements,the performance of the proposed method is better than thatof the conventional acoustic echo canceller. In the experi-ment where the proposed method is compared with acousticecho canceller in the condition that the filter coefficients arefixed, the efficacy of the proposed method is ascertained. Al-though the proposed method requires multichannel filteringand multiple loudspeakers, the proposed method can main-tain the high speech recognition performance in barge-in sit-uation without adaptation.

The remaining problem is that there is still room forimprovement in beamforming because the delay-and-sumbeamformer is weak against reverberation. We are now ad-dressing this problem via unsupervised adaptive array [25].

APPENDIX

A. KNOWN-NOISE IMPOSITION

Even with the use of some effective noise suppressionmethod, it is difficult to eliminate interferencial noises com-pletely. The proposed method is not excepted from this is-sue and there still exists a residual component of the re-sponse sound in the processed signal, because of the fluctu-ation of the transfer functions. To obtain optimum recog-nition performance, we generally need to develop matchedphoneme models for a speech decoder. However, without apriori information on signal-to-noise ratio, the accurate con-struction of such matched models is very difficult. To handlemany different types of noise, known-noise imposition hasbeen proposed [20]. This technique masks the residual unex-pected component with a known noise. To prevent this noisefrom causing a mismatch in the phoneme feature betweenthe processed signal and the phoneme model, we generate aphoneme model made of the speech imposed with the same

User’s speech

Response sound

Signalprocessing

Known noise

Decoder

+

Known-noisematched

phonetic model

Figure 17: Configuration of known-noise imposition.

noise in advance. We apply this technique in the masking ofthe residual response sound as follows.

(1) We impose known noise on a speech database andtrain the corresponding matched model using an EM algo-rithm in advance.

(2)We impose known noise on the noise-reduced outputfrom the delay-and-sum array in the proposed system.

(3) We perform speech recognition using a known-noisematched model for the system output.

Figure 17 shows a configuration of this process.

ACKNOWLEDGMENTS

We would like to thank Mr. Koichi Mino of NAIST and Dr.Shoji Makino of NTT CS Laboratories for their valuable dis-cussions. This work was partly supported by CREST Program“Advanced Media Technology for Everyday Living” of JST inJapan.

REFERENCES

[1] B. H. Juang and F. K. Soong, “Hands-free telecommunica-tions,” in Proceedings of International Workshop on Hands-FreeSpeech Communication, pp. 5–8, Kyoto, Japan, April 2001.

[2] E. Hansler, “Acoustic echo and noise control: where do wecome from—where do we go?” in Proceedings of InternationalWorkshop on Acoustic Echo and Noise Control (IWAENC ’01),pp. 1–4, Darmstadt, Germany, September 2001.

[3] S. Makino and S. Shimauchi, “Stereophonic acoustic echocancellation—an overview and recent solutions,” in Proceed-ings of 6th IEEE International Workshop on Acoustic Echo andNoise Control (IWAENC ’99), pp. 12–19, Pocono Manor, Pa,USA, September 1999.

[4] Y.-W. Jung, J.-H. Lee, Y.-C. Park, and D.-H. Youn, “A newadaptive algorithm for stereophonic acoustic echo canceller,”in Proceedings of IEEE International Conference on Acoustics,Speech and Signal Processing (ICASSP ’00), vol. 2, pp. 801–804,Istanbul, Turkey, June 2000.

[5] W. Herbordt and W. Kellermann, “Acoustic echo cancellationembedded into the generalized sidelobe canceller,” in Proceed-ings of European Signal Processing Conference (EUPSICO ’00),vol. 3, pp. 1843–1846, Tampere, Finlande, September 2000.

[6] H. Buchner, S. Spors, and W. Kellermann, “Wave-domainadaptive filtering: acoustic echo cancellation for full-duplexsystems based on wave-field synthesis,” in Proceedings of IEEEInternational Conference on Acoustics, Speech and Signal Pro-cessing (ICASSP ’04), vol. 4, pp. 117–120, Montreal, Que,Canada, May 2004.

[7] Y. Tatekura, H. Saruwatari, and K. Shikano, “Sound reproduc-tion system including adaptive compensation of temperaturefluctuation effect for broad-band sound control,” IEICE Trans-actions on Fundamentals of Electronics, Communications andComputer Sciences, vol. E85-A, no. 8, pp. 1851–1860, 2002.

Page 12: InterfaceforBarge-inFreeSpokenDialogueSystemBasedon ... · sound elimination by controling sound field around the mi-crophone to be silent alongside the transaural reproduction at

12 EURASIP Journal on Advances in Signal Processing

[8] J. Benesty, D. R. Morgan, and J. H. Cho, “A family of dou-bletalk detectors based on cross-correlation,” in Proceedings of6th IEEE International Workshop on Acoustic Echo and NoiseControl (IWAENC ’99), pp. 108–111, Pocono Manor, Pa, USA,September 1999.

[9] K. Ochiai, T. Araseki, and T. Ogihara, “Echo canceler withtwo echo pathmodels,” IEEE Transactions on Communications,vol. 25, no. 6, pp. 589–595, 1977.

[10] M. Miyoshi and Y. Kaneda, “Inverse filtering of room acous-tics,” IEEE Transactions on Acoustics, Speech, and Signal Pro-cessing, vol. 36, no. 2, pp. 145–152, 1988.

[11] J. Bauck and D. H. Cooper, “Generalized transaural stereo andapplications,” Journal of the Audio Engineering Society, vol. 44,no. 9, pp. 683–705, 1996.

[12] Y. Tatekura, H. Saruwatari, and K. Shikano, “An iterative in-verse filter design method for the multichannel sound fieldreproduction system,” IEICE Transactions on Fundamentals ofElectronics, Communications and Computer Sciences, vol. E84-A, no. 4, pp. 991–998, 2001.

[13] S. Haykin, Adaptive Filter Theory, Prentice-Hall, EnglewoodCliffs, NJ, USA, 4th edition, 1991.

[14] Y. Suzuki, F. Asano, H.-Y. Kim, and T. Sone, “An optimumcomputer-generated pulse signal suitable for themeasurementof very long impulse responses,” Journal of the Acoustical Soci-ety of America, vol. 97, no. 2, pp. 1119–1123, 1995.

[15] J. Blauert, Spatial Hearing, MIT Press, Cambridge, Mass, USA,Revised edition, 1997.

[16] J. L. Flanagan, J. D. Johnston, R. Zahn, and G. W. Elko,“Computer-steered microphone arrays for sound transduc-tion in large rooms,” Journal of the Acoustical Society of Amer-ica, vol. 78, no. 5, pp. 1508–1518, 1985.

[17] S. Hayamizu, S. Itahashi, T. Kobayashi, and T. Takezawa, “De-sign and creation of speech and text corpora of dialogue,”IEICE Transactions on Information and Systems, vol. E76-D,no. 1, pp. 17–22, 1993.

[18] A. Lee, T. Kawahara, and K. Shikano, “Julius—an open sourcereal-time large vocabulary recognition engine,” in Proceed-ings of 7th European Conference on Speech Communicationand Technology (EUROSPEECH ’01), pp. 1691–1694, Aalborg,Denmark, September 2001.

[19] A. Lee, T. Kawahara, K. Takeda, and K. Shikano, “A new pho-netic tied-mixture model for efficient decoding,” in Proceed-ings of IEEE International Conference on Acoustics, Speech, andSignal Processing (ICASSP ’00), vol. 3, pp. 1269–1272, Istanbul,Turkey, June 2000.

[20] S. Yamade, A. Lee, H. Saruwatari, and K. Shikano, “Unsu-pervised speaker adaptation based on HMM sufficient statis-tics in various noisy environments,” in Proceedings of 8th Eu-ropean Conference on Speech Communication and Technology(EUROSPEECH ’03), vol. 2, pp. 1493–1496, Geneva, Switzer-land, September 2003.

[21] K. Itou, M. Yamamoto, K. Takeda, et al., “The design ofthe newspaper-based Japanese large vocabulary continuousspeech recognition corpus,” in Proceedings of 5th InternationalConference on Spoken Language Processing (ICSLP ’98), vol. 7,pp. 3261–3264, Sydney, Australia, November-December 1998.

[22] K. Itou, M. Yamamoto, K. Takeda, et al., “JNAS: Japanesespeech corpus for large vocabulary continuous speech recog-nition research,” Journal of the Acoustical Society of Japan (E),vol. 20, no. 3, pp. 199–206, 1999.

[23] L. Rabiner and B. H. Juang, Fundamentals of Speech Recogni-tion, Prentice-Hall, Englewood Cliffs, NJ, USA, 1993.

[24] J. R. Deller Jr., J. H. L. Hansen, and J. G. Proakis,Discrete-TimeProcessing of Speech Signals, Macmillan, New York, NY, USA,1993.

[25] S. Miyabe, T. Takatani, Y. Mori, H. Saruwatari, K. Shikano, andY. Tatekura, “Double-talk free spoken dialogue interface com-bining sound field control with semi-blind source separation,”in Proceedings of IEEE International Conference on Acoustics,Speech and Signal Processing (ICASSP ’06), vol. 1, pp. 809–812,Toulouse, France, May 2006.

Shigeki Miyabe was born in Nara, Japan,on July 1, 1978. He received the B.E. de-gree in electrical and electronics engineer-ing from Kobe University in 2003, and re-ceived the M.E. degree in information andscience from Nara Institute of Science andTechnology (NAIST) in 2005. He is now aPh.D. student at Graduate School of Infor-mation Science, NAIST. His research inter-ests include sound field control and arraysignal processing. He is aMember of the Acoustical Society of Japan(ASJ).

Yoichi Hinamoto received the B.E. de-gree in electrical and electronic engineeringfromUniversity of Tokushima in 2001, M.E.degree in information science from NAISTin 2003, and Ph.D. degree in informaticsfrom kyoto University in 2006. He is cur-rently a Research Associate of Takuma Na-tional College of Technology. His researchinterests include digital signal processingand adaptive filter algorithm. He is a Mem-ber of the Institute of Electronics, Information and Communica-tion Engineers of Japan (IEICE) and the Institute of Electrical andElectronics Engineers (IEEE).

Hiroshi Saruwatari was born in Nagoya,Japan, on July 27, 1967. He received the B.E.,M.E., and Ph.D. degrees in electrical en-gineering from Nagoya University, Nagoya,Japan, in 1991, 1993, and 2000, respectively.He joined Intelligent Systems Laboratory,SECOM co., Ltd., Mitaka, Tokyo, Japan, in1993, where he is engaged in the researchand development of the ultrasonic arraysystem for the acoustic imaging. He is cur-rently an Associate Professor of Graduate School of InformationScience, Nara Institute of Science and Technology. His research in-terests include array signal processing, blind source separation, andsound field reproduction. He received the Paper Awards from IE-ICE in 2000 and 2006. He is a Member of the IEEE, the VR Societyof Japan, the IEICE, and the Acoustical Society of Japan.

Kiyohiro Shikano received the B.S., M.S.,and Ph.D. degrees in electrical engineer-ing from Nagoya University in 1970, 1972,and 1980, respectively. He is currentlya Professor at Nara Institute of Scienceand Technology (NAIST), where he is di-recting Speech and Acoustics Laboratory.From 1972 to 1993, he had been workingat NTT Laboratories. During 1986–1990,

Page 13: InterfaceforBarge-inFreeSpokenDialogueSystemBasedon ... · sound elimination by controling sound field around the mi-crophone to be silent alongside the transaural reproduction at

Shigeki Miyabe et al. 13

he was the Head of Speech Processing Department at ATR Inter-preting Telephony Research Laboratories. During 1984–1986, hewas a Visiting Scientist in Carnegie Mellon University. He receivedthe Yonezawa Prize from IEICE in 1975, the Signal Processing Soci-ety 1990 Senior Award from IEEE in 1991, the Technical Develop-ment Award from ASJ in 1994, IPSJ Yamashita SIG Research Awardin 2000, and Paper Award from the Virtual Reality Society of Japanin 2001, IEICE Paper Award in 2005 and 2006, and Inose Awardin 2005. He is a Fellow of the Institute of Electronics, Informationand Communication Engineers of Japan (IEICE), and InformationProcessing Society of Japan, and a Member of the Acoustical Soci-ety of Japan (ASJ), Japan VR Society, the Institute of Electrical andElectronics Engineers (IEEE), and International Speech Commu-nication Society (ISCA).

Yosuke Tatekura was born in Kyoto, Japan,on May 17, 1975. He received the B.E. de-grees in precision engineering from OsakaUniversity in 1998, and received the M.E.and Ph.D. degrees in information sciencefrom Nara Institute of Science and Tech-nology (NAIST) in 2000 and 2002, respec-tively. He is currently a Research Associateof Shizuoka University. His research inter-ests include sound field control and virtualsound source synthesis.