audio-visual speech recognition using an infrared headset

14
Audio-visual speech recognition using an infrared headset Jing Huang * , Gerasimos Potamianos, Jonathan Connell, Chalapathy Neti IBM Thomas J. Watson Research Center, Yorktown Heights, NY 10598, United States Received 12 February 2004; received in revised form 17 August 2004; accepted 4 October 2004 Abstract It is well known that frontal video of the speakerÕs mouth region contains significant speech information that, when combined with the acoustic signal, can improve accuracy and noise robustness of automatic speech recognition (ASR) systems. However, extraction of such visual speech information from full-face videos is computationally expensive, as it requires tracking faces and facial features. In addition, robust face detection remains challenging in practical human– computer interaction (HCI), where the subjectÕs posture and environment (lighting, background) are hard to control, and thus successfully compensate for. In this paper, in order to bypass these hindrances to practical bimodal ASR, we consider the use of a specially designed, wearable audio-visual headset, a feasible solution in certain HCI scenarios. Such a headset can consistently focus on the speakerÕs mouth region, thus eliminating altogether the need for face track- ing. In addition, it employs infrared illumination to provide robustness against severe lighting variations. We study the appropriateness of this novel device for audio-visual ASR by conducting both small- and large-vocabulary recognition experiments on data recorded using it under various lighting conditions. We benchmark the resulting ASR performance against bimodal data containing frontal, full-face videos collected at an ideal, studio-like environment, under uniform lighting. The experiments demonstrate that the infrared headset video contains comparable speech information to the studio, full-face video data, thus being a viable sensory device for audio-visual ASR. Ó 2004 Elsevier B.V. All rights reserved. Keywords: Visual speech; Audio-visual speech recognition; Infrared headset; Feature fusion; Multi-stream HMM; Decision fusion 1. Introduction Motivated by the bimodality (auditory and vi- sual) of human speech perception (Sumby and Pol- lack, 1954) and by the need for robust automatic speech recognition (ASR) in noisy environments, significant research effort has recently been direc- ted into the study of audio-visual automatic speech 0167-6393/$ - see front matter Ó 2004 Elsevier B.V. All rights reserved. doi:10.1016/j.specom.2004.10.007 * Corresponding author. Tel.: +1 914 945 3226; fax: +1 914 945 4490. E-mail addresses: [email protected] (J. Huang), gpotam@ us.ibm.com (G. Potamianos), [email protected] (J. Con- nell), [email protected] (C. Neti). Speech Communication 44 (2004) 83–96 www.elsevier.com/locate/specom

Upload: jing-huang

Post on 29-Jun-2016

217 views

Category:

Documents


2 download

TRANSCRIPT

Speech Communication 44 (2004) 83–96

www.elsevier.com/locate/specom

Audio-visual speech recognition using an infrared headset

Jing Huang *, Gerasimos Potamianos, Jonathan Connell, Chalapathy Neti

IBM Thomas J. Watson Research Center, Yorktown Heights, NY 10598, United States

Received 12 February 2004; received in revised form 17 August 2004; accepted 4 October 2004

Abstract

It is well known that frontal video of the speaker�s mouth region contains significant speech information that, when

combined with the acoustic signal, can improve accuracy and noise robustness of automatic speech recognition (ASR)

systems. However, extraction of such visual speech information from full-face videos is computationally expensive, as it

requires tracking faces and facial features. In addition, robust face detection remains challenging in practical human–

computer interaction (HCI), where the subject�s posture and environment (lighting, background) are hard to control,

and thus successfully compensate for. In this paper, in order to bypass these hindrances to practical bimodal ASR, we

consider the use of a specially designed, wearable audio-visual headset, a feasible solution in certain HCI scenarios.

Such a headset can consistently focus on the speaker�s mouth region, thus eliminating altogether the need for face track-

ing. In addition, it employs infrared illumination to provide robustness against severe lighting variations. We study the

appropriateness of this novel device for audio-visual ASR by conducting both small- and large-vocabulary recognition

experiments on data recorded using it under various lighting conditions. We benchmark the resulting ASR performance

against bimodal data containing frontal, full-face videos collected at an ideal, studio-like environment, under uniform

lighting. The experiments demonstrate that the infrared headset video contains comparable speech information to the

studio, full-face video data, thus being a viable sensory device for audio-visual ASR.

� 2004 Elsevier B.V. All rights reserved.

Keywords: Visual speech; Audio-visual speech recognition; Infrared headset; Feature fusion; Multi-stream HMM; Decision fusion

0167-6393/$ - see front matter � 2004 Elsevier B.V. All rights reserv

doi:10.1016/j.specom.2004.10.007

* Corresponding author. Tel.: +1 914 945 3226; fax: +1 914

945 4490.

E-mail addresses: [email protected] (J. Huang), gpotam@

us.ibm.com (G. Potamianos), [email protected] (J. Con-

nell), [email protected] (C. Neti).

1. Introduction

Motivated by the bimodality (auditory and vi-

sual) of human speech perception (Sumby and Pol-

lack, 1954) and by the need for robust automatic

speech recognition (ASR) in noisy environments,

significant research effort has recently been direc-

ted into the study of audio-visual automatic speech

ed.

84 J. Huang et al. / Speech Communication 44 (2004) 83–96

recognition (AVASR). In addition to the tradi-

tional acoustic input, AVASR utilizes visual

speech information, present in the speaker�s mouth

region. Such information has been successfully ex-

tracted and demonstrated to improve the accuracyand noise robustness of ASR systems for both

small- and large-vocabulary tasks by a number

of authors, including Duchnowski et al. (1994),

Hennecke et al. (1996), Chen and Rao (1998),

Teissier et al. (1999), Dupont and Luettin (2000),

Chen (2001), Chibelushi et al. (2002), Heckmann

et al. (2002), and Nefian et al. (2002), among oth-

ers. For example, Potamianos et al. (2003) reportthat bimodal ASR of a small-vocabulary task un-

der speech babble noise at 0dB signal-to-noise ratio

(SNR) achieves the performance of an acoustic-

only recognizer at 10dB, i.e., the visual modality

provides an ‘‘effective’’ SNR gain of 10dB in

‘‘usable’’ ASR range. The authors also report sig-

nificant gains on the same task even in clean acous-

tic conditions, and furthermore, an 8dB‘‘effective’’ SNR gain for large-vocabulary continu-

ous speech recognition (LVCSR).

However, almost all AVASR research work in

the literature, including the results mentioned

above, has concentrated on databases recorded

under ideal visual conditions. Such sets contain

high-resolution video of the subjects� frontal face,with very limited variation in head pose and sub-ject-camera distance, rather uniform lighting,

and, in most cases, constant background. This

kind of recording is often unrealistic in practical

human–computer interaction (HCI) scenarios,

where the subject�s posture and/or the environ-

ment are hard to control, as could be, for example,

HCI inside automobiles, offices, and in trading

floors, human interaction with intelligent kiosks,cellphones, and PDAs, or audio-visual speech

transcription of lectures, meetings, and broadcast

news. In such scenarios, robust extraction of the

visual speech information given the full-face video

becomes a challenging problem, since it requires

accurate tracking of the speaker�s face and facial

features (e.g., mouth corners, possibly lip con-

tours), as well as successful compensation for headpose and lighting variations in the final visual

speech features. Not surprisingly, preliminary re-

sults by Potamianos and Neti (2003) show that

AVASR performance in visually challenging do-

mains (offices, automobiles) lags significantly com-

pared to the performance achieved on data

recorded in ideal conditions (studio-like environ-

ment), as measured by visual-only word error rate

(WER) and relative WER reduction in bimodal vs.

audio-only ASR in noise.

These facts motivated us to design a special

audio-visual wearable headset as a means to ro-

bustly capture the visual speech information and

thus allow practical AVASR deployment. The

headset can consistently focus on the speaker�smouth region, therefore eliminating altogether theneed for face tracking. Furthermore, by employing

infrared illumination, it captures visual speech even

in extreme lighting conditions. In addition to the

added robustness, bypassing the face tracking step

reduces computational requirements substantially.

For example, in the real-time PC-based AVASR

system of Connell et al. (2003), such tracking con-

sumes about 72% of the visual front end process-ing. Eliminating this step could possibly allow

future real-time AVASR on devices with less avail-

able resources, such as PDAs. The designed head-

set is discussed in more detail in Section 2 of this

paper. Although not the first such device developed

in the literature for AVASR (Adjoudani et al.

(1997)), the proposed design is significantly smaller

and less intrusive.Using a headset for bimodal speech capture in

HCI may at first seem restrictive, however, in

many cases, humans already do wear head-

mounted microphones while communicating with

other humans. Examples include call centers, trad-

ing floors, and aircraft cockpits. All three consti-

tute noisy environments where visual speech can

play a critical role in improving ASR performance,thus enabling automatic transcription. Other pos-

sible scenarios include office dictation and lecture

transcription. In the later case, the lecturer should

be tether-free, so the designed audio-visual headset

allows wireless transmission of its bimodal input.

To reinforce our point, recent work in the litera-

ture has focused in improving ASR robustness to

noise by considering additional head-mountedmicrophone sensors. Indeed, Graciarena et al.

(2003) consider a throat microphone, namely a

skin-vibration transducer, whereas Zhang et al.

J. Huang et al. / Speech Communication 44 (2004) 83–96 85

(2004) utilize a bone-conductive microphone to

capture speech-induced skull vibration. The re-

corded signals are correlated to the subject�sspeech, thus allowing background speech rejection

and improved ASR by using enhancement tech-niques, for example. One possible advantage of

our proposed audio-visual headset over the sen-

sors mentioned above is that the visual signal, in

addition to being correlated to the acoustic one

(Jiang et al., 2002), also contains complementary

information to it (Massaro and Stork, 1998). Such

information contributes to significant ASR

improvements even in clean acoustics.The main focus of this paper is to study the

appropriateness of the novel infrared headset for

AVASR. For this purpose, we utilize our state-

of-the-art bimodal recognizer (Potamianos et al.,

2003) to conduct both small-vocabulary (con-

nected digits) and large-vocabulary (LVCSR) rec-

ognition experiments on data recorded using the

audio-visual headset. We benchmark these tasksagainst AVASR on frontal, full-face videos, col-

lected at an ideal, ‘‘studio’’-like environment. Of

particular interest is the performance of the system

visual front end, as well as its audio-visual fusion

module. The first is measured by visual-only

ASR accuracy on the small-vocabulary task,

whereas the latter by the relative WER reduction,

compared to audio-only performance (for bothrecognition tasks). As our experiments demon-

strate, the headset achieves grossly comparable

performance with the ‘‘studio’’-quality full-face

videos, thus showcasing its appropriateness and

robustness in capturing the visual speech

information.

As we already mentioned, a crucial element in

any AVASR system is the visual speech featureextraction. Our speechreading system utilizes an

appearance-based approach that first estimates a

region-of-interest (ROI), containing the area

around the speaker�s mouth appropriately normal-

ized, and subsequently uses dimensionality reduc-

tion techniques to obtain the visual features. In

the case of video captured by the headset, each en-

tire input frame can be viewed as the ROI. How-ever, due to unavoidable small variations in the

headset placement among subjects, it turns out

that some ROI normalization is desirable. In this

paper, we propose such a scheme, based on track-

ing the subject�s two mouth corners. Notice that

this is a much simpler task compared to tracking

in full-face videos, as the headset images are tightly

constrained (no background objects), and thus thealgorithm can utilize weak models. In addition,

only two facial points need to be detected, com-

pared to the hierarchical face and multiple facial

feature scheme of Senior (1999), required in the

case of full-face videos. A second crucial element

in AVASR is the combination of the extracted vi-

sual features with the traditional acoustic ones.

Here, we consider two main approaches foraudio-visual integration, namely feature and deci-

sion fusion. Thus, compared to an earlier version

of this work (Huang et al., 2003), this paper con-

siders additional techniques for both visual pro-

cessing and audio-visual integration, and, for the

first time, it reports LVCSR on headset data.

The paper is structured as follows: Section 2

presents the novel audio-visual headset. Sections3 and 4 review the main components of the

AVASR system, namely the visual front end and

audio-visual fusion, respectively. Section 5 is de-

voted to the experimental study and comparison

of AVASR performance across two datasets of

interest: full-face studio-quality videos and head-

set-captured data. Finally, Section 6 concludes

the paper.

2. The audio-visual headset

As we mentioned in Section 1, recently there has

been increased interest in the development of

novel head-mounted sensors that supplement the

traditional close-talking microphone input in or-der to improve ASR robustness and performance.

For example, Graciarena et al. (2003) consider a

throat microphone, whereas Zhang et al. (2004)

propose a number of sensors, including a bone-

conductive microphone, an in-ear microphone,

and a device with an infrared light emitter and re-

ceiver pair. The latter aims at the speaker�s mouth

region, seeking to capture the amount of teethvisibility. Therefore, in contrast to the previous de-

vices, it can be considered as a crude sensor for vi-

sual speech capture. Of course, significantly more

86 J. Huang et al. / Speech Communication 44 (2004) 83–96

visual speech information can be obtained by a

regular head-mounted camera. To our knowledge,

the first such audio-visual headset designed for

AVASR has been proposed by Adjoudani et al.

(1997). This device was quite bulky and intrusive,but nevertheless robustly captured the mouth

ROI of the subject. A much more advanced sys-

tem, also commercially available, is reported in

(Gagne et al., 2001). This headset allows wireless

transmission of the captured video to base sta-

tions, where it is displayed on small portable

screens. The system is designed to be worn by lec-

turers in order to improve their intelligibility byhearing-impaired subjects in the audience, taking

advantage of the human speechreading capability.

Fig. 1. (First row) Frontal and side view of the infrared headset; (midd

row) detail of ear-piece and boom.

In our audio-visual headset design, an impor-

tant consideration was to house both microphone

and camera in a single boom with minimal protru-

sion in front of the subject�s mouth. Ideally, the de-

signed headset should not feel much bulkier than atypical close-talking, head-mounted microphone

headset. The current version of the derived

audio-visual headset is depicted in Fig. 1. Frontal

and side views of it are shown, as well as of a sub-

ject wearing it, and details of two of its compo-

nents. The headset contains the audio and video

components in a single adjustable boom. A micro-

phone is placed in the side of the boom, offset fromthe mouth, in order to reduce audio distortion

caused by breath noise. Located in the boom tip

le row) frontal and side view of a subject wearing it; and (lower

Fig. 2. Example video frames from the audio-visual databases considered in this paper for AVASR (see Section 5): (upper row) Data

recorded using the audio-visual headset and (lower row) studio-quality, full-face data, recorded employing a teleprompter. The latter

are used as an AVASR performance benchmark.

J. Huang et al. / Speech Communication 44 (2004) 83–96 87

are a wide-view miniature camera (less than 2.5cm

in cube) and a number of infrared components: On

either side of the camera are two diodes that illu-

minate the speaker�s face with infrared light. Both

camera and infrared diodes sit in a plastic housing,

covered by a rectangular infrared filter, which

blocks visible light and allows only infrared light

to pass through, where it is picked up by the cam-era lens. The design reduces the environmental

lighting effect on captured images, allowing good

visibility of the mouth ROI even in a dark room.

The camera and microphone outputs are con-

nected to a 1.2GHz wireless transmitter placed

next to the headset ear-piece (see also Fig. 1), with

the receiver attached to a base station. This allows

tether-free operation of up to four headsets in sce-narios such as trading floors and lecture rooms.

Also located on the headset are a battery pack that

provides power for it and an auxiliary wireless tele-

phone, which is wired into the headset. This pro-

vides normal telephone conversation with a base

station that is attached to a conventional tele-

phone. The entire headset weighs approximately

480g.The wirelessly transmitted captured audio and

S-video signals are fed into a laptop based AVASR

or audio-visual data collection system (Connell

et al., 2003), the latter after conversion to digital

video via the Firewire interface. The system re-

cords audio at 22kHz and video at a 30frames/s

rate and a 720 · 480 pixel resolution. Example

video frames captured by the headset are depicted

in the upper row of Fig. 2.

3. Visual front end processing

There exist two critical modules in AVASR sys-

tems: the visual front end design and the audio-visual integration strategy. The latter is often tied

to the choice of speech recognition method em-

ployed, for example neural network based classifi-

ers as used by Duchnowski et al. (1994) and

Heckmann et al. (2002), or more conventional hid-

den Markov model (HMM) systems with Gaussian

mixture emission probabilities, as for example in

(Dupont and Luettin, 2000; Nefian et al., 2002;Potamianos et al., 2003). In this paper, we opt

for HMM based recognizers, focusing our atten-

tion instead to the visual feature extraction and

audio-visual fusion modules, which we describe

in more detail in the current and next section,

respectively.

In general, there exist three possibilities for vi-

sual speech representation given the speaker�s vi-deo (Hennecke et al., 1996): Appearance-based

features that typically seek a suitable compressed

representation of the pixel values within a visual

region of interest (ROI) (Duchnowski et al.,

1994, Potamianos et al., 2003), shape-based fea-

tures that consist of a geometric or statistical rep-

resentation of the lip contours and possibly of the

88 J. Huang et al. / Speech Communication 44 (2004) 83–96

lower face (Dupont and Luettin, 2000, Gurbuz

et al., 2001, Heckmann et al., 2002), and combina-

tion of the two strategies. The latter can just con-

catenate features from each category, as in

(Dupont and Luettin, 2000), or seek to build acombined statistical model of face and appearance,

as in (Matthews et al., 2001). Of course, the choice

of visual features dictates the requirements on vi-

sual front end steps that precede their extraction.

Thus, appearance based features require only the

mouth ROI, whereas a geometric representation

of lip shape is dependent on tracking the speaker�slip contours within it. In turn, ROI extraction andits appropriate normalization require successful

face and facial feature detection, and possibly an

estimate of the subject�s head pose within the re-

corded visual scene.

Our AVASR system considered in this work

employs appearance-type features based on the

discrete cosine transform (DCT) of the visual

ROI. Thus, it avoids computationally expensivelip-tracking and face shape estimation. In addi-

tion, using the headset for visual capture elimi-

nates the need for the first module of the visual

front end chain, namely face tracking and localiza-

tion of most of the facial features, as compared to

using full-face videos. Therefore, only the ROI

estimation and the visual feature extraction steps

are required, detailed next. For completeness,and since full-face videos are considered in our

experiments as a benchmark of AVASR perfor-

mance, we also briefly describe face tracking and

ROI extraction for such data.

3.1. ROI extraction on headset captured videos

In our preliminary AVASR work using theaudio-visual headset, reported by Huang et al.

(2003), most of the captured video frame was con-

sidered as the ROI with no normalization applied

to it. In particular, only simple boundary trunca-

tion and subsampling was applied to the original

720 · 480 pixel frame, in order to obtain a

64 · 64 pixel ROI. However, due to the unavoid-

able variations in headset placement among sub-jects, it soon became apparent that some ROI

normalization was desirable. In this paper, we pro-

pose such a scheme, based on tracking two mouth

corners of the recorded subject. This allows cor-

recting slight positioning errors, boom rotation,

and rescaling of the physical mouth size. Recogni-

tion experiments reported in Section 5 demon-

strate the benefits of this approach over ourearlier crude ROI extraction scheme.

Extracting the two mouth corners turns out to

be fairly simple in the headset scenario, since it is

assumed that the camera is already aimed nearly

directly at the mouth. Because the types of features

seen in the captured images are tightly constrained

(i.e., no confusing background objects are ex-

pected in the scene), the algorithms can use veryweak models and hence run quickly.

The first phase of the mouth finding algorithm

estimates the position of the head in the image.

Usually the mouth region is highly illuminated

by the infrared source and the image gets darker

toward the cheeks (even with extra ambient light-

ing). The algorithm starts by histogram equalizing

the entire image to compensate for overall bright-ness changes. Next, only the pixels in a central ver-

tical strip of the equalized image are histogramed

and a binarization threshold is picked. Finally,

the entire image is binarized using this threshold

and projected onto the vertical and horizontal axes

to determine a rough bounding box for the facial

region.

The second phase models the mouth as a darkslit and attempts to determine its extent. Notice

that the mouth appears as a dark slit even when

it is closed, due to the curvature of the lip surfaces.

In detail, the image is first further enhanced using

a horizontal center-surround barmask with a cen-

ter band of 5 pixels in a 17 pixel square (see Fig.

3(d)). An area (yellow box in Fig. 3(d)) within

the face bounding box from the first phase is thenused to obtain another threshold, subsequently

used to binarize the entire image (see Fig. 3(e)).

Then the pixel values within a prospective mouth

box (blue box in Fig. 3(e)) are projected on the

horizontal axis to find the x coordinates of the

mouth corners (see Fig. 3(f)). To find the y coordi-

nates, two independent vertical strips are created

around the lateral mouth bounds, and the binaryimage is projected on the vertical axis in each re-

gion to locate them (see Fig. 3(g,h)). As a sanity

check, the tilt angle between the mouth corners is

Fig. 3. ROI extraction for an example frame captured by the headset: (a) original frame, histogram equalized, with estimated mouth

corners depicted; (b) extracted ROI, normalized on basis of the estimated mouth corners, and subsequently used for DCT-based visual

feature extraction and (c) ROI reconstruction using visual speech features (100 DCT coefficients). Detailed steps for mouth corner

estimation: (d) Barmask-enhanced frame, with a superimposed rectangle over which the binarization threshold is computed; (e)

binarized image after thresholding, with mouth search region depicted; (f) binary image projection within mouth search region, used to

estimate the mouth corner x-coordinates; (g) barmask-enhanced frame with mouth-corner search regions superimposed and (h) image

projection within search regions, used to estimate the mouth corner y-coordinates.

J. Huang et al. / Speech Communication 44 (2004) 83–96 89

computed. If this angle is too steep, the higher of

the two mouth corners is lowered to match the

other one.

The final step is to output a normalized mouth

image. Such image is extracted after the mouth po-

sition, width, and angle have been smoothed over

90 J. Huang et al. / Speech Communication 44 (2004) 83–96

a small, non-causal temporal window. The full im-

age is de-rotated to make the mouth horizontal,

then a portion is extracted and rescaled so that

the mouth spans about 60% of the image width.

A second histogram equalization of just this regionis performed to make the details clearly visible.

The result is a 64 · 64 pixel ROI with an aspect ra-

tio of about 1.7 covering the mouth (see Fig. 3(b)).

3.2. ROI extraction on full-face videos

In the case of full-face videos that possibly in-

clude confusing background, the estimation ofthe speaker�s face location and of landmark facial

features is required preceding the ROI extraction.

In addition to ROI localization, features such as

mouth corners, chin, nose, and eyes can provide

head-pose information, that can be useful in ROI

normalization. In our work, we employ a statisti-

cal approach to this problem, reported in (Senior,

1999).In more detail, given a video frame, face detec-

tion is first performed by searching for face candi-

dates that contain a relatively high proportion of

skin-tone pixels over an image ‘‘pyramid’’ of pos-

sible locations and scales. Each candidate is size-

normalized to a chosen template size (here, an

11 · 11 square), and its greyscale pixel values are

placed into a 121-dimensional face candidate vec-tor. Every vector is given a score based on the

combination of the two-class (face versus non-

face) Fisher linear discriminant, as well as its ‘‘dis-

tance from face space’’ (DFFS), i.e., the face

vector projection error onto a lower, 40-dimen-

sional space, obtained by means of principal

components analysis (PCA). Candidate regions

exceeding a threshold score are considered asfaces. Once a face has been detected, an ensemble

of facial feature detectors are used to estimate the

locations of 26 facial features. Each feature loca-

tion is determined using a within-face restricted

search, based on the feature linear discriminant

and ‘‘distance from feature space’’ (similar to the

DFFS discussed above) scores of 11 · 11 square

candidate features. A training step is required toestimate the Fisher discriminant and PCA eigen-

vectors for both stages, utilizing a number of man-

ually annotated video frames (Senior, 1999).

Face and facial feature tracking provides mouth

location, size, and orientation estimates. These are

subsequently smoothed over time to improve

robustness. Based on the result, a 64 · 64 pixel

ROI is obtained for every video frame. This con-tains the lower face around the speaker�s mouth,

and is properly normalized to compensate for

rotation, size, and lighting variations, the latter

by using histogram equalization of the face.

3.3. Visual feature representation

For both scenarios (headset and full-face vid-eos), an appearance-based representation of the

ROI leads to the final visual features. In more de-

tail, a two-dimensional, separable DCT is applied

to the greyscale 4096-pixel ROI, and 100 DCT

coefficients are retained. The appropriate locations

of the kept coefficients are estimated based on

highest energy over a training set. Notice that

these coefficients retain significant informationabout the mouth shape, teeth and tongue visibility.

This fact is demonstrated in Fig. 3(c), where the

reconstructed ROI on basis of the selected coeffi-

cients is depicted.

To reduce the dimensionality of the DCT repre-

sentation and improve discrimination among the

speech classes of interest, an intra-frame linear dis-

criminant analysis (LDA) projection is applied,resulting in a 30-dimensional feature vector. This

is followed by a maximum likelihood linear trans-

formation (MLLT) that improves maximum likeli-

hood based statistical data modeling (Potamianos

et al., 2003). In order to facilitate audio-visual fu-

sion, linear interpolation is employed that syn-

chronizes the features to the 100Hz rate of their

audio counterpart (see next section), whereas fea-ture mean normalization is used to further com-

pensate for lighting variations, providing the

visual-only static features. In order to capture vi-

sual speech dynamics, 15 consecutive such features

are then concatenated, and subsequently pro-

jected/rotated by means of an inter-frame LDA/

MLLT combination onto a lower-dimensional

space. This final linear transform gives rise to dy-

namic visual features ov,t of dimension 41. These

processing blocks are schematically depicted in

Fig. 4.

Fig. 4. Block diagram of our AVASR system (Potamianos et al., 2003). Synchronous, 60-dimensional audio features, oa,t, and 41-

dimensional visual observations, ov,t, are extracted, both at 100Hz. A feature fusion strategy used for speech recognition by hidden

Markov models is depicted.

J. Huang et al. / Speech Communication 44 (2004) 83–96 91

4. Audio-visual integration

In parallel to the visual feature extraction,audio features are also obtained, time synchro-

nously, at 100Hz. First, 24 mel-frequency cepstral

coefficients of the speech signal are computed over

a sliding window of 25ms, and are mean normal-

ized to provide static features. Then, nine consecu-

tive such frames are concatenated and projected by

means of LDA and MLLT onto a 60-dimensional

space, producing dynamic audio features oa,t, as itis schematically depicted in Fig. 4.

Given the two single-modality feature streams,

a number of audio-visual integration strategies

can be envisioned for AVASR (Hennecke et al.,

1996). Most such methods fall within the feature

fusion or decision fusion framework. The former

approach, also known as ‘‘direct identification’’

(Teissier et al., 1999) combines the bimodalspeech information at the feature level and utilizes

a single classifier for recognition, as in (Duchnow-

ski et al., 1994; Chen, 2001; Potamianos et al.,

2003), for example. Techniques that use the

audio-visual data to enhance acoustic features,

such as in (Girin et al., 2001; Deligne et al.,

2002), also fit into this category. On the other

hand, decision fusion, also termed as ‘‘separate

identification’’ in (Teissier et al., 1999), combines

the two single-modality classification decisions

typically at the likelihood level, in order to obtaina bimodal recognition score (Heckmann et al.,

2002; Dupont and Luettin, 2000; Nefian et al.,

2002; Potamianos et al., 2003). Within the classi-

fier fusion framework of Jain et al. (2000), the

latter approach fits into ‘‘parallel, measurement-

level’’ integration. Particularly for ASR, the like-

lihood combination can occur at various temporal

levels, such as the utterance end (Adjoudani et al.,1997), intermediate word or phone boundaries

(Nefian et al., 2002), or at the finer level of sub-

phonetic units (Dupont and Luettin, 2000). In

the latter case, and if HMMs are used for ASR,

decision fusion typically employs the multi-stream

HMM (MSHMM). In this paper, we report most

results using a feature fusion approach, and occa-

sionally employing a simple MSHMM basedintegration.

In more detail, our feature fusion approach first

obtains 101-dimensional concatenated audio-

visual vectors oav,t = [oa,t,ov,t]. It then discrimina-

tively reduces their size by means of an LDA/

MLLT cascade in order to allow more accurate

speech modeling (Potamianos et al., 2003). The

final features od,t are 60-dimensional (see also

92 J. Huang et al. / Speech Communication 44 (2004) 83–96

Fig. 4). These are then fed to a suitable HMM

based decoder for speech recognition. The scheme

is referred to in our experiments as ‘‘AVf’’.

In the MSHMM based decision fusion ap-

proach, the single-modality observations are as-sumed generated by audio-only and visual-only

HMMs of identical topologies with class-condi-

tional emission probabilities Pa(oa,t) and Pv(ov,t),

respectively. Both are modeled as mixtures of

Gaussian densities. The class-conditional score of

the concatenated audio-visual observation is then

set to Pav(oav,t) = Pa(oa,t)k · Pv (ov,t)

1�k (Dupont

and Luettin, 2000). Exponent k is used to appro-priately weigh the contribution of each stream,

depending on the ‘‘relative confidence’’ on each

modality. In general, k can vary locally, and thus

model audio-visual inputs in realistic scenarios,

where the quality and speech information content

of each stream changes over time. Here though,

the exponent is assumed constant over a database,

and is estimated by minimizing the WER onappropriate held-out data. This integration

scheme is referred to in our experiments as ‘‘AVd’’.

In both audio-visual fusion algorithms, HMM

parameters are obtained by the traditional maxi-

mum likelihood approach, based on available

data. For decision fusion in particular, the

HMM stream component parameters are sepa-

rately trained and subsequently joined forAVASR, with the MSHMM transition probabili-

ties set equal to the ones of the audio-only

HMM. This approach is simple to implement,

however separate stream training does not enforce

proper audio-visual synchrony in the MSHMM,

and thus it does not fully utilize the advantages

of decision fusion. Indeed, the more complicated

scheme of joint stream training, not consideredin this work, is reported by Potamianos et al.

(2003) to further improve AVASR.

5. Experiments

We now proceed to investigate the appropriate-

ness of the introduced audio-visual headset forAVASR. For that purpose, we use as a compari-

son benchmark the recognition performance on

full face videos, recorded under ideal, ‘‘studio’’-

like conditions. The corresponding audio-visual

databases are briefly discussed next.

5.1. The audio-visual databases

As mentioned in the Introduction, most

AVASR research has concentrated on data col-

lected under ideal visual conditions. One such cor-

pus is the IBM ViaVoiceTM audio-visual database,

recorded in a quiet, studio-like environment, with

uniform lighting and background. The subjects�head pose is ensured to remain frontal with little

variation in the corpus, due to the use of a tele-prompter that displays the dictation text. High-

quality video of the subjects� full frontal face is

captured, and is MPEG2 encoded at 30Hz and a

704 · 480 pixel frame size. In addition to video,

wideband audio is synchronously collected at a

rate of 16kHz using a desktop, omni-directional

microphone. The acoustic SNR is at 20dB. A 50-

subject subset of this database containing con-nected-digit utterances (DIGIT), and a 257-subject

dataset containing ViaVoice training scripts, i.e.,

continuous read speech with mostly verbalized

punctuation (LVCSR), are used here as a reference

for audio-visual ASR under ideal conditions. Four

example video frames from these sets have been

depicted in the lower row of Fig. 2.

The headset data are recorded using a laptop-based audio-visual collection system (Connell

et al., 2003), with the video data received via the

Firewire interface after conversion into DV format

at a base station. As discussed in Section 2, the bi-

modal data are wirelessly transmitted to base from

the headset. The system records audio at 22kHz

and video with a 720 · 480 pixel resolution at

30Hz. Prior to recording, the adjustable boom isused to fine-tune the headset placement with re-

spect to the speaker�s mouth. A preview window

on the laptop is used to correct and verify proper

placement. A total of 79 subjects uttering con-

nected digit strings (DIGIT), and 113 subjects

reading ViaVoice dictation scripts (LVCSR) are

recorded as our experimental datasets. Four exam-

ple video frames from these sets can be viewed inthe upper row of Fig. 2.

Therefore, four audio-visual databases are

available for AVASR experiments, spanning the

Table 1

Four audio-visual databases used in this paper for speaker-

independent AVASR

Task Environ. Set Utter. Dur. Words Sub.

LVCSR Studio Train 17111 34:55 219,470 239

Test 790 1:46 11,732 18

Headset Train 11466 28:30 192,273 95

Test 720 1:46 12,528 18

DIGIT Studio Train 4881 7:08 41,496 42

Check 729 1:04 6,162 42

Test 539 0:47 4,580 8

Headset Train 2970 4:22 25,176 71

Check 729 1:03 6,213 71

Test 312 0:26 2,658 8

Two recognition tasks (LVCSR and connected digits) are

considered for each of the two audio-visual domains (full-face

studio and headset-captured data). The corpora partitioning

into training, test, and check (for the second task only) is

depicted (number of utterances, duration (in hours), total

uttered words, and number of subjects are shown for each set).

Within each set, italics signify identical subjects.

Table 2

Effects of absence (‘‘No’’) or presence (‘‘Yes’’) of mouth-based

ROI normalization on the visual-only (‘‘VI’’) and audio-visual

(‘‘AVf’’) recognition performance (reported in WER, %) on the

headset DIGIT test set

ROI norm. VI Clean Noisy

AU AVf AU AVf

No 46.5 1.31 1.57 21.4 16.1

Yes 35.0 1.31 1.09 21.4 11.9

For AVASR, two acoustic conditions (‘‘Clean’’, ‘‘Noisy’’) are

considered, with the corresponding audio-only (‘‘AU’’) WER

also depicted.

Table 3

Audio-only, visual-only, and audio-visual (using both feature

and decision fusion) WER, %, on the DIGIT test set of the

studio and headset corpora in two acoustic conditions

Set VI Clean Noisy

AU AVf AVd AU AVf AVd

Studio 47.1 0.83 1.13 0.81 21.4 13.6 14.7

Headset 35.0 1.31 1.09 0.87 21.4 11.9 12.1

J. Huang et al. / Speech Communication 44 (2004) 83–96 93

two sensor configuration environments (‘‘studio’’,

‘‘headset’’), and the two recognition tasks (‘‘DI-

GIT’’, ‘‘LVCSR’’). Their duration and other rele-

vant information are summarized in Table 1. Forthe purpose of our experiments, each set is split

into non-overlapping subsets, one for training

HMMs, one for testing, and in the case of con-

nected digits, a ‘‘check’’ (held-out) set used for

estimating the MSHMM exponent in decision fu-

sion. In all cases, the training and test subjects

are disjoint, thus our experimental framework is

speaker-independent (see also Table 1).

5.2. Visual-only and audio-visual recognition

For both visual-only and audio-visual recogni-

tion, we utilize the standard IBM stack decoder,

recently modified to also accommodate MSHMM

based decision fusion. In particular, for connected-

digit ASR (11-word vocabulary) in both studioand headset cases, we use HMMs with 166 context

dependent states (the context is cross-word, span-

ning up to five phones to either side) and 3.2k

Gaussian mixture components. For both LVCSR

sets, we use HMMs with 2991 context dependent

states, and 58.8k Gaussian mixture components.

For LVCSR decoding, we utilize a 18.7k word

vocabulary, and a trigram language model. For

all databases, in addition to their original acoustic

signal, audio-only and AVASR are also consid-

ered on artificially corrupted audio by additive

‘‘speech babble’’ noise, at around 7dB SNR forDIGIT, and 4dB SNR for LVCSR. The HMMs

used for decoding are still trained on the original

(clean) data. All recognition results are depicted

in Tables 2–4. Word error rates (WERs) of sin-

gle-modality (visual-only (VI), audio-only (AU))

and audio-visual (AVf) ASR are presented for

the DIGIT task. Decision fusion results (AVd)

are also reported. For LVCSR, only audio-only(AU) and audio-visual feature fusion (AVf) per-

formance is presented. Note that in contrast to

small-vocabulary tasks, visual-only recognition of

large-vocabulary continuous speech is a highly

confusable task, thus visual-only LVCSR results

are meaningless.

We first investigate the performance of the ROI

normalization algorithm for headset data, intro-duced in Section 3. For that purpose, we consider

the connected-digit ASR task. As expected,

mouth-based ROI normalization helps, with a

Table 4

Audio-only and audio-visual (using feature fusion) WER, %, on

the LVCSR test set of the studio and headset corpora in two

acoustic conditions

Set Clean Noisy

AU AVf AU AVf

Studio 14.5 12.9 55.8 37.2

Headset 20.6 18.7 55.2 39.7

94 J. Huang et al. / Speech Communication 44 (2004) 83–96

25% relative improvement (from 46.5% ! 35.0%)

in visual-only recognition observed (as depicted

in Table 2). The algorithm also benefits AVASR,

both in clean and noisy acoustic conditions. For

example, in the former, ASR performance de-grades if audio is combined with non-normalized

visual features by means of feature fusion. How-

ever, there is a 17% relative improvement

(1.31% ! 1.09%) when normalized visual features

are used. In addition, for noisy audio, the

improvement is significantly larger in the latter

case (44% vs. 25% relative).

It is expected that by using the infrared headsetone would obtain a better and more accurate ROI

than by face tracking, and thus better visual-only

ASR performance, compared to the full-face stu-

dio video datasets. Indeed (see Table 3), the DI-

GIT visual-only WER improves by 26% relative,

when moving from studio to headset data

(47.1% ! 35.0%). This is so even though headset

data are collected with varying office lighting,whereas in the studio set, both lighting and back-

ground are kept rather uniform. As expected, the

improved visual-only ASR on the headset data

translates to increased relative benefit of the visual

modality to bimodal recognition, both in the clean

and noisy acoustic conditions. Indeed, in the for-

mer, by using audio-visual decision fusion there

is a significant ASR performance improvement inthe headset domain (34% relative, from

1.31% ! 0.87%), whereas a statistically insignifi-

cant improvement is observed in the studio data.

For noisy audio, the relative ASR improvement

due to the visual modality is slightly greater for

the headset (44% vs. 36%). There, feature fusion

slightly outperforms decision fusion, which we be-

lieve is due to the asynchronous (separate) trainingof the MSHMM components. Nevertheless, it is

worth observing that the final AVASR perfor-

mance on the clean studio data is superior to the

headset, because of the better audio-only recogni-

tion (by about 37% relative). The latter is due to

interference in the acoustic signal wireless trans-mission, as well as some degree of breathing noise

in the headset. We are currently working to resolve

these audio problems in future versions of the

prototype.

The audio channel quality issues also become

apparent by comparing the audio-only LVCSR

performance on the headset and studio data. As

depicted in Table 4, audio-only WER on headsetdata is significantly higher than the studio data

WER (20.6% vs. 14.5%), i.e., by about 30% rela-

tive. Nevertheless, the visual modality still helps

improve ASR by about 9% (20.6% ! 18.7%) and

28% (55.2% ! 39.7%), for clean and noisy speech,

respectively. These numbers are quite close but

slightly inferior to the corresponding relative

WER reductions for studio data (11% and 33%,respectively). We believe that this is due to the fact

that, in contrast to the DIGIT sets, a substantially

smaller number of LVCSR training subjects are

available (see Table 1).

Our experimental results clearly demonstrate

that the speech information contained in the head-

set video data can contribute to ASR similarly well

to studio data videos, which have been collectedunder unrealistically controlled conditions. This

conclusion becomes particularly clear by the noisy

ASR experiments, reported above: there, the noise

level has been adjusted to result in approximately

the same audio-only performance across the two

data types. Incorporating the visual information

improves DIGIT ASR by a 36% and 44% relative,

and LVCSR by 33% and 28%, for the studio andheadset data, respectively. Of course, the exact rel-

ative improvements depend heavily on the data-

base specifics, including duration and subject

population. Such factors are practically impossible

to adjust in pre-existing databases.

6. Conclusions

In this paper, we introduced a specially de-

signed audio-visual headset, and investigated its

J. Huang et al. / Speech Communication 44 (2004) 83–96 95

feasibility for AVASR. The headset bypasses the

need for face tracking, thus improving the robust-

ness of visual speech feature extraction in realistic

scenarios and reducing the AVASR computational

requirements. We benchmarked its recognitionperformance against an ideal-case, ‘‘visually

clean’’ data domain, where full-face videos are

captured, and hence face tracking is required. We

reported both small- and large-vocabulary recog-

nition experiments, observing overall that the

headset video contributes significant speech infor-

mation to its audio channel. Compared to the

full-face video, its ASR contribution is superiorin the small-vocabulary recognition experiments

and somewhat inferior for LVCSR, possibly

reflecting the disparate database population sizes.

Nevertheless, the results demonstrate that the

headset constitutes a viable sensory device for

audio-visual ASR, unaffected by head pose and

lighting variations.

Acknowledgment

The authors would like to thank colleagues

Liam Comerford, Luis Elizalde, Tom Picunko,

and Gabriel Taubin for the design and hardware

implementation of the infrared headset, as well

as Larry Sansone for data collection.

References

Adjoudani, A., Guiard-Marigny, T., Le Goff, B., Reveret, L.,

Benoıt, C., 1997. A multimedia platform for audio-visual

speech processing. In: Proc. Europ. Conf. on Speech

Communication Technol., pp. 1671–1674.

Chen, T., 2001. Audiovisual speech processing. Lip reading and

lip synchronization. IEEE Signal Process. Mag. 18 (1), 9–

21.

Chen, T., Rao, R.R., 1998. Audio-visual integration in multi-

modal communication. Proc. IEEE 86 (5), 837–852.

Chibelushi, C.C., Deravi, F., Mason, J.S.D., 2002. A review of

speech-based bimodal recognition. IEEE Trans. Multimedia

4 (1), 23–37.

Connell, J.H., Haas, N., Marcheret, E., Neti, C., Potamianos,

G., Velipasalar, S., 2003. A real-time prototype for small-

vocabulary audio-visual ASR. In: Proc. Internat. Conf. on

Multimedia Expo, pp. 469–472.

Deligne, S., Potamianos, G., Neti, C., 2002. Audio-visual

speech enhancement with AVCDCN (audio-visual code-

book dependent cepstral normalization). In: Proc. Internat.

Conf. on Spoken Lang. Process., pp. 1449–1452.

Duchnowski, P., Meier, U., Waibel, A., 1994. See me, hear me:

Integrating automatic speech recognition and lip-reading.

In: Proc. Internat. Conf. on Spoken Language Process., pp.

547–550.

Dupont, S., Luettin, J., 2000. Audio-visual speech modeling for

continuous speech recognition. IEEE Trans. Multimedia 2

(3), 141–151.

Gagne, J.-P., Laplante-Levesque, A., Labelle, M., Doucet, K.,

2001. Evaluation of an audiovisual-FM system (AudiSee):

The effects of visual distractions on speechreading perfor-

mance. Technical Report, Ecole d�orthophonie et d�audiol-ogie, Faculte de medecine, Universite de Montreal.

Girin, L., Schwartz, J.-L., Feng, G., 2001. Audio-visual

enhancement of speech in noise. J. Acoust. Soc. Amer.

109 (6), 3007–3020.

Graciarena, M., Franco, H., Sonmez, K., Bratt, H., 2003.

Combining standard and throat microphones for robust

speech recognition. IEEE Signal Process. Lett. 10 (3), 72–74.

Gurbuz, S., Tufekci, Z., Patterson, E., Gowdy, J.N., 2001.

Application of affine-invariant Fourier descriptors to lip-

reading for audio-visual speech recognition. In: Proc.

Internat. Conf. Acoust., Speech, Signal Process., pp. 177–

180.

Heckmann, M., Berthommier, F., Kroschel, K., 2002. Noise

adaptive stream weighting in audio-visual speech recogni-

tion. EURASIP J. Appl. Signal Process. 2002 (11), 1260–

1273.

Hennecke, M.E., Stork, D.G., Prasad, K.V., 1996. Visionary

speech: Looking ahead to practical speechreading systems.

In: Stork, D.G., Hennecke, M.E. (Eds.), Speechreading by

Humans and Machines. Springer, Berlin, pp. 331–

349.

Huang, J., Potamianos, G., Neti, C., 2003. Improving audio-

visual speech recognition with an infrared headset. In: Proc.

Work. Audio-Visual Speech Process., pp. 175–178.

Jain, A.K., Duin, R.P.W., Mao, J., 2000. Statistical pattern

recognition: A review. IEEE Trans. Pattern Anal. Machine

Intell. 22 (1), 4–37.

Jiang, J., Alwan, A., Keating, P.A., Chaney, B., Auer Jr., E.T.,

Bernstein, L.E., 2002. On the relationship between face

movements, tongue movements, and speech acoustics.

EURASIP J. Appl. Signal Process. 2002 (11), 1174–1188.

Massaro, D.W., Stork, D.G., 1998. Speech recognition and

sensory integration. Amer. Scient. 86, 236–244.

Matthews, I., Potamianos, G., Neti, C., Luettin, J., 2001. A

comparison of model and transform-based visual features

for audio-visual LVCSR. In: Proc. Internat. Conf. on

Multimedia Expo.

Nefian, A.V., Liang, L., Pi, X., Liu, X., Murphy, K., 2002.

Dynamic Bayesian networks for audio-visual speech recog-

nition. EURASIP J. Appl. Signal Process. 2002 (11), 1274–

1288.

Potamianos, G., Neti, C., 2003. Audio-visual speech recogni-

tion in challenging environments. In: Proc. Europ. Conf. on

Speech Comm. Technol., pp. 1293–1296.

96 J. Huang et al. / Speech Communication 44 (2004) 83–96

Potamianos, G., Neti, C., Gravier, G., Garg, A., Senior, A.W.,

2003. Recent advances in the automatic recognition of

audio-visual speech. Proc. IEEE 91 (9), 1306–1326.

Senior, A.W., 1999. Face and feature finding for a face

recognition system. In: Proc. Internat. Conf. on Audio

Video-based Biometric Person Authentication, pp. 154–159.

Sumby, W.H., Pollack, I., 1954. Visual contribution to speech

intelligibility in noise. J. Acoust. Soc. Amer. 26, 212–215.

Teissier, P., Robert-Ribes, J., Schwartz, J.L., 1999. Comparing

models for audiovisual fusion in a noisy-vowel recognition

task. IEEE Trans. Speech Audio Process 7 (6), 629–642.

Zhang, Z., Liu, Z., Sinclair, M., Acero, A., Deng, L., Droppo,

J., Huang, X., Zheng, Y., 2004. Multi-sensory microphones

for robust speech detection, enhancement and recognition.

In: Proc. IEEE Internat. Conf. on Acoust. Speech Signal

Process., Vol. 3, pp. 781–784.