lecture - carnegie mellon school of computer...

Lecture and Presentation Trackingin an Intelligent Meeting Room

Ivica Rogina, Thomas SchaafInteractive Systems LabsUniversitaet Karlsruhe

E-mail: [email protected]

Abstract

Archiving, indexing, and later browsing throughstored presentations and lectures is a task that canbe observed with a growing frequency. We haveinvestigated the special problems and advantages oflectures and propose the design and adaptation ofa speech recognizer towards a lecture such that therecognition accuracy can be signi�cantly improvedby prior analysis of the presented documents usinga special class-based language model. We de�nea tracking accuracy measure which measures howwell a system can automatically align recognizedwords with parts of a presentation and show thatby prior exploitation of the presented documents,the tracking accuracy can be improved. The systemdescribed in this paper is part of an intelligentmeeting room developed in the European-Union-sponsored project FAME (Facilitating Agent forMulticultural Exchange).

1. Introduction

While a decade ago, presentations that wereusing more than simple black and white overheadfoils were rare, today, the standard way of present-ing a lecture or a talk is by using powerful toolsthat help in designing a multimedia presentationwhich itself is presented in a room equipped withmany supporting audio and video devices.

With both, the contents of lectures or presenta-tion as well as the equipment in lecture halls andmeeting rooms becoming more and more complex,the task of reducing the workload for the peoplegiving presentation and the people listening isbecoming more important. Archiving, indexing,

MY NAMEIS *(IVICA)*(ROGINA) ANDI WOULD LIKETO INTRODUCEYOU TO THE*(INTERACTIVE) ...

SPEAKINGMAKING*(GESTURES)*(HAND-WRITING)EXPRESSING*(EMOTIONS)GIVING...

HORSEBEACH*(RECOG-NITION)*(TOOLKIT)*(JANUS)+BREATH+HAS BEEN*(BENCH-MARKED)AT ...

Figure 1: Synchronizing slides with recognized text

and later browsing through stored presentations isanother task that can be observed with a growingfrequency. To address these problems, we haveinvestigated the special problems and advantagesof lectures. In this paper we propose the designand adaptation of a speech recognizer towards alecture such that the recognition accuracy canbe signi�cantly improved by analyzing the pre-sented documents and adapting the recognizer'svocabulary and special class-based languagemodel accordingly. We use a tracking accuracymeasure which measures how well a system canautomatically align recognized words with parts ofa presentation and show that by prior exploiting ofthe presented documents the tracking accuracy canbe improved. The system described in this paperis part of an intelligent meeting room developedin the European-Union-sponsored project FAME(Facilitating Agent for Multicultural Exchange).

2. The FAME Room

Rather than an intelligent living room, theFAME room is more of an intelligent meetingroom. In addition to the tasks performed byour previously presented meeting tracker systems[1][2][3][4], the FAME project foresees activities ofthe room during a meeting or lecture, namely toact as an information butler in the background.Meetings and lectures should be held as usual,only in cases where the participants explicitlyor implicitly require additional information, theinformation butler becomes active.

The most important di�erences for the systembetween tracking a meeting and tracking a lectureare the quality of the speech acoustics, the avail-ability of prepared documents (the presentationdocuments), and the speaking style. We canexpect much more planned speech in a lecturethan in a meeting.

The speech acoustics are easier for the recog-nizer for two reasons: �rst, the speech is moreplanned than in a discussion, and secondly, it'smore reasonable to expect a lecture speaker towear a head-mounted microphone or at least alapel microphone. Although we have experiencedrather signi�cant di�erences in the audio qualityof speech recorded with lapel microphones due toe�ects like acoustic shadow made by the speaker'schin or like rubbing the microphone with parts ofthe body or the clothing, the recognition accuracyis still better than with distant speaking table-mounted microphones. Although experiments[10] have shown, that using microphone arrayscan improve the recognition of distantly recordedspeech, in the FAME room, we will �rst focus onusing microphone arrays only for localization ofsound sources.

For later indexing and browsing of lectures andfor transmitting the lecture via a video conferenceto audience distributed in di�erent rooms, thelecturer is monitored by an automatic intelligentcamera. The room is equipped with one or morecameras that are tracking the lecturer and theaudience. The system can automatically decidewhich image is transmitted. For browsing pur-poses, in addition to the meeting browser [3], thelecture assistant will also synchronize the given

talk with the presented slides (and, if available,with presented audio and video documents).

The two major tasks in assisting the lecturerconsist of controlling the audio and video devicesin the room and controlling the presentation. Theformer means turning on and o� devices, dimmingthe lights, setting volumes of speakers etc., the lat-ter means automatically selecting and displayingslides and optionally audio or video documents.Both services can be performed implicitly by thesystem "guessing" what is currently needed, or byhaving the speaker give explicit commands.

3. The Lecture Assistant

The tasks of the lecture assistant addressed inthis paper are

� analysis of the presented documents

� related information retrieval from the inter-net

� adaptation of the vocabulary and languagemodel and pronunciation lexicon

� speech recognition and lecture tracking

We will now describe these tasks in greater detail(see Fig. 2).

3.1 Analysis of the Presentation

The analysis of the presentation documentshas two goals. One is to extract the importantcontent words, the other is to retrieve an ASCIIrepresentation of the document which can be usedto correlate it with the recognizer output suchthat the system always knows which part of thepresentation the speaker is talking about.

The extracted words are compared with therecognizer's background vocabulary. A tfidf -valueis computed for every content word. All out-of-vocabulary words (OOV) and the words withhigher-than-average tfidf -values are considered tobe important.

synchonization and alignmentof hypotheses with presentationdocuments (slides)

extract importantcontent words

background data corpus

contextsextract n-gram

class-basedbasic

LM

speech recognizer

time-taggedhypotheses

Text-to-Phone

presentationdocuments(slides)

WWW search engine

WWWdocuments

�nd suitable classes andadapt class-based LM

Figure 2: Overview of the lecture assistant

3.2 Document Retrieval

from the Internet

Similar to the system described in [5], everyimportant content word from the presentationdocuments is used for a simple singe-word queryin a standard internet search engine, the top n

(we used n=100 in our experiments) documentsdelivered by the search engine are analyzed, allquintuples of words where the middle word is animportant content word are extracted and stored.

In addition, one search engine query is madewith all important words with the expectationthat the retrieved documents, especially the mosthighly ranked ones, have more relevance.

3.3 Adapting the Class-Based

Language Model

The primary goal of the language model designwas to allow easy adding of new words. Thereforea class-based design was chosen. The basis was atrigram model on a 40k vocabulary trained withthe HUB-4 (broadcast news) training corpus. Thislanguage model did not contain any classes. Tode�ne classes that would be suitable to acceptnew OOV words, 20k more words from the HUB-4vocabulary were taken. All sentences from thecorpus which contained one of the additional20k words were fed into a Kneser-Ney [9] bigramclustering algorithm. The result of the clusteringprocess was a set of 2500 classes { representingthe OOV-words from the point of view of the base40k-language model.

The 10k most frequent words were removedfrom these classes. Then, only the set Z of classesthat consisted of a suÆcient number of words(minimum of N = 50) were kept (72 classes),all other classes were ignored and not used anyfurther. The removing of the top 10k words wasnecessary to assure that these were not ignoredtogether with the unused classes.

The resulting language model thus consistedof trigrams on the 40k most frequent wordsfrom HUB-4 plus 72 "OOV-classes". The modelperformed equally well, in terms of recognitionaccuracy, as a corresponding plain trigram.

Some of the 72 classes were built o� wordswhose relation to each other is diÆcult to see, someclasses could be clearly identi�ed, (see example in�gure 3).

In the class M30, which is displayed in Fig.3, there are 56 non-OOV words plus one wordnamed OOV30 representing all OOV-words (min.50 words) assigned to M30 during the clusteringprocess.

Now, the model was prepared to accept newOOV words by adding them into an appropriateclass as follows.

CLASS:Z30 = f OOV30 ASTHMA TUBERCULOSIS

DIABETES POLIO PNEUMONIA DIARRHEA

CHOLERA RAPHAEL ALCOHOLISM HEPATITIS

MALARIA OBESITY MEASLES DEHYDRATION

SCHIZOPHRENIA INGENUITY NAUSEA ADVIL

MALNUTRITION ALLERGIES VALIUM UNTREATED

MELANOMA HERPES ACETAMINOPHEN DYSENTERY

ULCER SYPHILIS OSTEOPOROSIS COLDS

LONGEVITY VOMITING PEROXIDE FLASHBACKS

LIGGETT MENINGITIS ALS DIZZINESS

TREMORS INFLUENZA SOYBEANS INDIGESTION

DIPHTHERIA INSOMNIA NUMBNESS BULIMIA

DEMENTIA LUPUS SIRHAN MENOPAUSAL

AFFECTIONS PIMPLES MAIM MARKIE

CHLAMYDIA POLYGAMY g

Figure 3: An example language model class

Let v be a word that has to be added to therecognizer. Let �(w) be the index of the class towhich the non-OOV word w belongs. For each classc 2 Z, we de�ne �v

c(w) as:

�vc (w) =

8<:

�(w) w 2 V

c w = v

�(UNK) else(1)

p(wj�vc (w)) =

8><>:

p(OOVcjc) � p(vjOOVc) w = v

#(w)

#(instances in c)w 2 V

(2)where

p(OOVcjc) =#(OOV instances in c)

#(all instances in c)(3)

and

p(vjOOV) :=1

#(OOV tokens in c)(4)

is the approximation of the probability of ob-serving the word v among the OOV words in c,assuming an equal distribution of all tokens.

Cv = argmaxc2ZQ

j p(wj j�vc (wj)) � p(�

vc (wj)jH)

whereH = �vc(w1); : : :�

vc (wj�1)

(5)

Here, wj denotes the j-th word of a text thatwas concatenated from all retrieved documentsfrom the internet that contain the word v. Whilein eq. 5, H is generally the entire history of wordwj , we restricted the actual history to trigramsjust like in the recognizer's language model.

Actually, it is not necessary to compute Cv

over all retrieved text but we get the exact sameresult if we regard only the word v together witha suÆciently wide context. For training trigrams,it is enough to use a context of �2 words.

In our experiments, we computed not only themost suitable class as in eq. 5 but also computedthe second and the third best class and addedevery content word to the top three classes. Thisapproach was chosen because in such a rathersmall number of 72 classes, one single class can notcover all semantic properties of a word. Adding aword to all classes of the system did not improvethe performance any further.

When new words were added to their classes,they were given an extraordinary high within classprobability compared to the other OOV wordsin that class. This approach is justi�ed by theexpectance that the important words from thepresentation documents are almost certainly tobe spoken by the lecturer. The appearance ofthese words is much more likely than would beestimated from the retrieved documents.

What remains to be done when adding newwords to the system, is to �nd an entry for thepronunciation lexicon. In our system, we look theword up in a large background lexicon. If it is notfound in the lexicon a text-to-speech system [7] isused to provide a phone sequence.

3.4 Recognition and Tracking

A tracking system for a presentation or alecture uses the same basic technology that mostsystems would use which have to monitor peoplein action and relate their actions (esp. theirspoken utterances) with corresponding documentsor parts of documents. A good lecture tracker canbe used as a basis for a good meeting tracker in

Lecture 1 2 3 4 5WER [%] 58.5 37.9 33.5 43.7 31.0

Table 1: baseline error rates on �ve lectures

Lecture 3 4 5baseline 33.5% 43.7% 31.0%system 1 28.1% 39.7% 29.8%system 2 26.8% 37.8% 27.6%

Table 2: improvements by LM-adaptation

which several people talk about and work with aset of possibly shared documents.

The tasks performed by the lecture trackerduring the lecture consist of recognizing thelecturer's speech, possibly switching slides anddisplaying documents (WWW, video, audio)when appropriate, and, in hopefully rare cases,interacting with the lecturer to resolve problems.While interaction is a problem that is part of theFAME project, we have not addressed it in thispaper.

After the lecture, the recorded audio and thepresented documents have to be aligned, indexedand stored, such that it will be possible to retrievethe documents and the audio recording of thespeaker for later browsing.

4. Experiments

We have conducted experiments on �ve self-recorded lectures. The speech recognizer used inour experiments is based on the JANUS speechrecognition toolkit [6]. It was trained on theHUB-4 broadcast news training corpus and wasused in other systems like [1][8]. The baseline worderror rates for the �ve test lectures are shown intable 1. Lectures 1 and 2 were recorded with lapelmicrophones, lectures 3, 4, and 5 were recordedwith a head mounted microphone.

We trained two systems, system 1 used onlyOOV words for adapting the language model andthe vocabulary, system 2 also used important con-tent words and treated them as if they were OOV.After adapting the language model, the word errorrates were as shown in table 2.

The OOV-rate of the test set lectures is approx-imately 5%. So the adding of OOV words found inthe presented documents to the recognizer's vocab-ulary is not suÆcient to explain the signi�cant gainin word accuracy. The other major contribution tothe improvement comes from increasing the uni-gram probabilities for the important content words.

In our system, we treat every content word inthe analyzed documents like an OOV-word. Even ifthey are already in the vocabulary, they are addedto the three top-scoring classes (see eq. 5) as if theywere OOV. In such an approach, the new wordsappear more than once in the vocabulary and canbe regarded as homographs (two di�erent wordswith the same spelling). It makes sense, if we con-sider that these words can very likely be used incompletely di�erent contexts with completely dif-ferent meanings as in the text that was used totrain the basic background language model. Usingmany more than three classes resulted in a increaseof the word error rate as the following graph (av-eraged over lectures 3,4, and 5) shows:

classes

WER

706050403020100

35.535

34.534

33.533

32.532

31.531

30.5

In the synchronization experiments, every hy-pothesis word was automatically assigned a slide ofthe presentation and compared to the actual slidethat was displayed at the corresponding time. To�nd a temporal alignment of the slides, we use astandard dynamic time warping algorithm which,for now, assumes that the slides are presented ina linear order in only one direction. Every wordof the recognizer's hypothesized output is assigneda slide corresponding to the resulting DTW path.Local distances in the DTW-matrix, e.g. thedistance between a hypothesis-word wt and a slidesi) are computed by comparing the content wordsof si with the context wt�n : : : wt : : : wt+n aroundthe questioned word wt. The inverse likelihood

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

Figure 4: A typical alignment for a presentation of8 slides (top = actual, bottom = predicted)

Lecture 3 4 5baseline 34.7% 31.8% 32.1%adapted 28.3% 31.0% 27.8%

Table 3: Improvements in tracking error rates

of a word from the context appearing on theslide is taken as the distance. We found that theDTW-path is rather stable and choosing n = 3 issuÆcient. Increasing n did not result in highertracking accuracies.

A zero knowledge approach, that would assignevery slide a time slot of the same size would pro-duce in our lectures a tracking error of 50-60%. TheTracking error rate improved as shown in table 3.Here we found that the tracking accuracy is onlyloosely correlated with recognition accuracy.

Of course, the tracking accuracy highly dependson the amount of information found on the slides.Presentations containing only headlines and im-ages are much harder to track than presentationswith lots of text.

5. Conclusion and

Further Plans

We have shown, that it is possible to improvethe speech recognizer's word accuracy signi�cantlyif we can used data from documents that aspeaker plans to present during a lecture. We havede�ned and evaluated the tracking accuracy andhave shown that this too can pro�t from priorexploitation of the presentation documents.

In the future, we plan to also spot explicitcommands to switch slides or refer to speci�cslides either by naming them or their number orby addressing their contents. Also from humanexperience, we can expect some improvements inthe tracking accuracy when we also take prosodyinto account, because many speakers lower theirvoice or make pauses in their talks when �nishing

a slide and switching to the next slide.

Acknowledgements

Part of this work was carried out within theFAME project and has been funded by the Euro-pean Union as IST project No. IST-2000-28323.

6. References

[1] AlexWaibel, Michael Bett, Michael Finke, Rainer Stiefel-hagen: \Meeting Browser: Tracking and Summarizing

Meetings", Proceedings of the Human Technology Con-ference, San Diego 2001

[2] Tanja Schultz, AlexWaibel, Michael Bett, Florian Metze,Yue Pan, Klaus Ries, Thomas Schaaf, Hagen Soltau,Martin Westphal, Hua Yu, and Klaus Zechner: \The ISLMeeting Room System", Proceedings of the Workshop onHands-Free Speech Communication (HSC-2001), KyotoJapan, April 2001.

[3] Alex Waibel, Michael Bett, Florian Metze, Klaus Ries,Thomas Schaaf, Tanja Schultz, Hagen Soltau, Hua Yu,Klaus Zechner: \Advances in Automatic Meeting Record

Creation and Access", ICASSP 2001, Salt Lake City

[4] Ralph Gross, Michael Bett, Hua Yu, Xiaojin Zhu, YuePan, Jie Yang, Alex Waibel: \Towards a Multimodal

Meeting Record", ICASSP 2001, Salt Lake City

[5] Thomas Kemp and Alex Waibel: \Reducing the OOV

Rate in Broadcast News Speech Recognition", Proceed-ings of the ICSLP 98,Sydney, Australia, 30th November-4th December 1998.

[6] Ivica Rogina and Alex Waibel: \The JANUS Recog-

nizer", ARPA Workshop on Spoken Language Technol-ogy, 1995

[7] William M. Fischer: \A Statistical Text-to-Phone Func-

tion Using N-Grams and Rules", Proceedings of theICASSP 99, Phoenix, Arizona, March 1999.

[8] Thomas Schaaf: \Detection of OOV Words Using Gen-

eralized Word Models and a Semantic Class Language

Model", Proceedings of the Eurospeech 2001, Aalborg,September 2001

[9] Reinhard Kneser and Hermann Ney: \Improved Clus-

tering Techniques for Class-Based Statistical Language

Modelling", Proceedings of the Eurospeech 1993, pages973{976

[10] Michael L. Seltzer, Bhiksha Raj, and Richard M. Stern:\Speech Recognizer-Based Microphone Array Processing

for Robust Hands-Free Speech Recognition ",Proceedingsof the ICASSP 2002

lecture - carnegie mellon school of computer...

Documents