spoken dialogue technology: enabling the conversational

80
Spoken Dialogue Technology: Enabling the Conversational User Interface MICHAEL F. MCTEAR University of Ulster Spoken dialogue systems allow users to interact with computer-based applications such as databases and expert systems by using natural spoken language. The origins of spoken dialogue systems can be traced back to Artificial Intelligence research in the 1950s concerned with developing conversational interfaces. However, it is only within the last decade or so, with major advances in speech technology, that large-scale working systems have been developed and, in some cases, introduced into commercial environments. As a result many major telecommunications and software companies have become aware of the potential for spoken dialogue technology to provide solutions in newly developing areas such as computer-telephony integration. Voice portals, which provide a speech-based interface between a telephone user and Web-based services, are the most recent application of spoken dialogue technology. This article describes the main components of the technology—speech recognition, language understanding, dialogue management, communication with an external source such as a database, language generation, speech synthesis—and shows how these component technologies can be integrated into a spoken dialogue system. The article describes in detail the methods that have been adopted in some well-known dialogue systems, explores different system architectures, considers issues of specification, design, and evaluation, reviews some currently available dialogue development toolkits, and outlines prospects for future development. Categories and Subject Descriptors: H.5.2 [Information Interfaces and Presentation]: User Interfaces—Natural language, Voice I/O; I.2.7 [Artificial Intelligence]: Natural Language Processing—Discourse, Speech recognition and synthesis General Terms: Human Factors Additional Key Words and Phrases: Dialogue management, human computer interaction, language generation, language understanding, speech recognition, speech synthesis 1. INTRODUCING SPOKEN DIALOGUE TECHNOLOGY The “conversational computer” has been the goal of researchers in speech technol- ogy and artificial intelligence (AI) for more Author’s address: School of Information and Software Engineering, University of Ulster, Shore Road, Newtownabbey BT37 0QB, Northern Ireland, UK; email: [email protected]. Permission to make digital/hard copy of part or all of this work for personal or classroom use is granted with- out fee provided that the copies are not made or distributed for profit or commercial advantage, the copyright notice, the title of the publication, and its date appear, and notice is given that copying is by permission of ACM, Inc. To copy otherwise, to republish, to post on servers, or to redistribute to lists, requires prior specific permission and/or a fee. c 2002 ACM 0360-0300/02/0300-0090 $5.00 than 30 years. A number of large-scale research programs have addressed this goal, including the DARPA Communica- tor Project, Japan’s Fifth Generation pro- gram, and the European Union’s ESPRIT and Language Engineering programs. The ACM Computing Surveys, Vol. 34, No. 1, March 2002, pp. 90–169.

Upload: others

Post on 10-Nov-2021

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Spoken Dialogue Technology: Enabling the Conversational

Spoken Dialogue Technology: Enabling the ConversationalUser Interface

MICHAEL F. MCTEAR

University of Ulster

Spoken dialogue systems allow users to interact with computer-based applications suchas databases and expert systems by using natural spoken language. The origins ofspoken dialogue systems can be traced back to Artificial Intelligence research in the1950s concerned with developing conversational interfaces. However, it is only withinthe last decade or so, with major advances in speech technology, that large-scaleworking systems have been developed and, in some cases, introduced into commercialenvironments. As a result many major telecommunications and software companieshave become aware of the potential for spoken dialogue technology to provide solutionsin newly developing areas such as computer-telephony integration. Voice portals, whichprovide a speech-based interface between a telephone user and Web-based services, arethe most recent application of spoken dialogue technology. This article describes themain components of the technology—speech recognition, language understanding,dialogue management, communication with an external source such as a database,language generation, speech synthesis—and shows how these component technologiescan be integrated into a spoken dialogue system. The article describes in detail themethods that have been adopted in some well-known dialogue systems, exploresdifferent system architectures, considers issues of specification, design, and evaluation,reviews some currently available dialogue development toolkits, and outlines prospectsfor future development.

Categories and Subject Descriptors: H.5.2 [Information Interfaces andPresentation]: User Interfaces—Natural language, Voice I/O; I.2.7 [ArtificialIntelligence]: Natural Language Processing—Discourse, Speech recognition andsynthesis

General Terms: Human Factors

Additional Key Words and Phrases: Dialogue management, human computerinteraction, language generation, language understanding, speech recognition, speechsynthesis

1. INTRODUCING SPOKEN DIALOGUETECHNOLOGY

The “conversational computer” has beenthe goal of researchers in speech technol-ogy and artificial intelligence (AI) for more

Author’s address: School of Information and Software Engineering, University of Ulster, Shore Road,Newtownabbey BT37 0QB, Northern Ireland, UK; email: [email protected].

Permission to make digital/hard copy of part or all of this work for personal or classroom use is granted with-out fee provided that the copies are not made or distributed for profit or commercial advantage, the copyrightnotice, the title of the publication, and its date appear, and notice is given that copying is by permission ofACM, Inc. To copy otherwise, to republish, to post on servers, or to redistribute to lists, requires prior specificpermission and/or a fee.c©2002 ACM 0360-0300/02/0300-0090 $5.00

than 30 years. A number of large-scaleresearch programs have addressed thisgoal, including the DARPA Communica-tor Project, Japan’s Fifth Generation pro-gram, and the European Union’s ESPRITand Language Engineering programs. The

ACM Computing Surveys, Vol. 34, No. 1, March 2002, pp. 90–169.

Page 2: Spoken Dialogue Technology: Enabling the Conversational

Spoken Language Technology 91

impression of effortless spontaneous con-versation with a computer has been fos-tered by examples from science fictionsuch as HAL in 2001: A Space Odysseyor the computer on the Star Ship Enter-prise. It is only recently, however, thatspoken language interaction with com-puters has become a practical possibil-ity both in scientific as well as in com-mercial terms. This is due to advancesin speech technology, language process-ing, and dialogue modeling, as well asthe emergence of faster and more power-ful computers to support these technolo-gies. Applications such as voice dictationand the control of appliances using voicecommands are appearing on the marketand an ever-increasing number of soft-ware and telecommunications companiesare seeking to incorporate speech technol-ogy into their products. It is important,however, to be aware of the limitations ofthese applications. Commonly statementsare made in sales and marketing litera-ture such as Talk to your computer as youwould talk to your next-door neighbor orTeach your computer the art of conversa-tion. However, the technologies involvedwould not be sufficient to enable a com-puter to engage in a natural conversationwith a user. Voice dictation systems pro-vide a transcription of what the user dic-tates to the system, but the system doesnot attempt to interpret the user’s inputor to discuss it with the user. Command-and-control applications enable users toperform commands with voice input thatwould otherwise be performed using thekeyboard or mouse. The computer recog-nizes the voice command and carries outthe action, or reports that the commandwas not recognized. No other form of dia-logue is involved. Similar restrictions ap-ply to most other forms of voice-based sys-tem in current use.

Spoken dialogue systems, on the otherhand, can be viewed as an advanced appli-cation of spoken language technology. Spo-ken dialogue systems provide an interfacebetween the user and a computer-basedapplication that permits spoken interac-tion with the application in a relativelynatural manner. In so doing, spoken dia-

logue systems subsume most of the ma-jor fields of spoken language technology,including speech recognition and speechsynthesis, language processing, and dia-logue management.

The aim of the current survey is to de-scribe the essential characteristics of spo-ken dialogue technology at a level of tech-nical detail that should be accessible tocomputer scientists who are not special-ists in speech recognition and computa-tional linguistics. The survey provides anoverview for those wishing to research intoor develop spoken dialogue systems, andhopefully also for those who are alreadyexperienced in this field. Most publishedwork to date on spoken dialogue systemstends to report on the design, implemen-tation, and evaluation of individual sys-tems or projects, as would be expected withan emerging technology. The present pa-per will not attempt to survey the growingnumber of spoken dialogue systems cur-rently in existence but rather will focuson the underlying technologies, using ex-amples of particular systems to illustratecommonly occurring issues.1

1.1. Overview of the Paper

The remainder of the paper is structuredas follows. In the next section spoken di-alogue systems are defined as computersystems that use spoken language to in-teract with users to accomplish a task. Di-alogue systems are classified in terms ofdifferent control strategies and some ex-amples are presented in Section 3 that il-lustrate this classification and give a feelfor the achievements as well as the lim-itations of current technology. Section 4describes the components of a spoken di-alogue system—speech recognition, lan-guage understanding, dialogue manage-ment, external communication, responsegeneration, and text-to-speech synthesis.

1 Inevitably there are omissions, in some cases ofwell-known and important systems, but this is un-avoidable, as the aim is not to provide a comprehen-sive review of dialogue systems but to focus on thegeneral issues of the technology. Interested readerscan follow up particular systems in the referencesprovided at the end of the survey and in Appendix A.

ACM Computing Surveys, Vol. 34, No. 1, March 2002.

Page 3: Spoken Dialogue Technology: Enabling the Conversational

92 McTear

The key to a successful dialogue system isthe integration of these components intoa working system. Section 5 reviews anumber of architectures and dialogue con-trol strategies that provide this integra-tion. Methodologies to support the specifi-cation, design, and evaluation of a spokendialogue system are reviewed in Section 6.Particular methods have evolved for spec-ifying system requirements, such as userstudies, the use of speech corpora, andWizard-of-Oz studies. Methods have alsobeen developed for the evaluation of dia-logue systems that go beyond the meth-ods used for evaluation of the individualelements such as the speech recognitionand spoken language understanding com-ponents. This section also examines somecurrent work on guidelines and standardsfor spoken language systems. A recent de-velopment is the emergence of toolkits andplatforms to support the construction ofspoken dialogue systems, similar to thetoolkits and development platforms thatare used in expert systems development.Some currently available toolkits are re-viewed and evaluated in Section 7. Finally,Section 8 examines directions for futureresearch in spoken dialogue technology.

2. SPOKEN DIALOGUE SYSTEMS----- ADEFINITION

Spoken dialogue systems have been de-fined as computer systems with which hu-mans interact on a turn-by-turn basis andin which spoken natural language playsan important part in the communication[Fraser 1997]. The main purpose of a spo-ken dialogue system is to provide an in-terface between a user and a computer-based application such as a database orexpert system. There is a wide variety ofsystems that are covered by this defini-tion, ranging from question-answer sys-tems that answer one question at a timeto “conversational” systems that engage inan extended conversation with the user.Furthermore, the mode of communicationcan range from minimal natural language,consisting perhaps of only a small setof words such as the digits 0–9 and thewords yes and no, through to large vocab-

ulary systems supporting relatively free-form input. The input itself may be spokenor typed and may be combined with otherinput modes such as DTMF (touch-tone)input, while the output may be spoken ordisplayed as text on a screen, and may beaccompanied by visual output in the formof tables or images.

Spoken dialogue systems enable casualand naive users to interact with complexcomputer applications in a natural way us-ing speech. Current IVR (Interactive VoiceResponse) systems limit users in whatthey can say and how they can say it.However, users of speech-based computersystems often do not know exactly whatinformation they require and how to ob-tain it—they require the support of thecomputer to determine their precise re-quirements. For this reason it is essen-tial that speech-based computer systemsshould be able to engage in a dialogue withusers rather than simply respond to prede-termined spoken commands. At the sametime spoken dialogue systems are more re-stricted than conversational computers inthat their conversational topics are lim-ited, usually to a single domain such asflight enquiries.

Spoken dialogue systems can be classi-fied into three main types, according to themethods used to control the dialogue withthe user:

(1) finite state- (or graph-) based systems;(2) frame-based systems; and(3) agent-based systems.

The type of dialogue control strategy usedhas a bearing on how the system accom-plishes two of its main tasks: process-ing the user’s input and recovering fromerrors.

2.1. Finite State-Based Systems

In a finite state-based system the useris taken through a dialogue consistingof a sequence of predetermined steps orstates. Most commercially available spo-ken dialogue systems use this form of di-alogue control strategy. The dialogue flowis specified as a set of dialogue states with

ACM Computing Surveys, Vol. 34, No. 1, March 2002.

Page 4: Spoken Dialogue Technology: Enabling the Conversational

Spoken Language Technology 93

transitions denoting various alternativepaths through the dialogue graph. Thesystem maintains control of the dialogue,produces prompts at each dialogue state,recognizes (or rejects) specific words andphrases in response to the prompt, andproduces actions based on the recognizedresponse. The following is an example ofan interaction with a basic finite state-based system in which the system veri-fies the user’s input at each state of thedialogue:

System: What is your destination?User: London.System: Was that London?User: Yes.System: What day do you want to travel?User: Friday.System: Was that Sunday?User: No.System: What day do you want to travel?

Generally systems with state-based di-alogue control restrict the user’s inputto single words or phrases that provideresponses to carefully designed systemprompts. A major advantage of this formof dialogue control is that the required vo-cabulary and grammar for each state canbe specified in advance, resulting in moreconstrained speech recognition and lan-guage understanding. However, the dis-advantage is that these systems restrictthe user’s input to predetermined wordsand phrases, making correction of mis-recognized items difficult as well as in-hibiting the user’s opportunity to take theinitiative and ask questions or introducenew topics. If augmented with a naturallanguage component, a state-based sys-tem can accept more natural input in theform of sentences or partial sentences.Furthermore, with a simple data struc-ture such as a form or frame that keepstrack of which information the user hasprovided and what the system still needsto know, a more flexible and more natu-ral dialogue flow is possible. Verificationcan also be delayed until the system hasgathered all the required information. The

Nuance demo banking system to be de-scribed in Section 3 is an example of astate-based system with these additionalfunctionalities.

2.2. Frame-Based Systems

In a frame- (or template-) based systemthe user is asked questions that enablethe system to fill slots in a template inorder to perform a task such as provid-ing train timetable information. In thistype of system the dialogue flow is not pre-determined but depends on the content ofthe user’s input and the information thatthe system has to elicit. For example:

System: What is your destination?User: London.System: What day do you want to travel?User: Friday.

System: What is your destination?User: London on Friday around 10 in

the morning.System: I have the following connec-

tion. . . .

In the first example the user provides oneitem of information at a time and the sys-tem performs rather like a state-basedsystem. However, if the user provides morethan the requested information, as in thesecond example, the system can acceptthis information and check if any addi-tional items of information are requiredbefore searching the database for a con-nection. Frame-based systems functionlike production systems, taking a partic-ular action based on the current state ofaffairs. The questions and other promptsthat the system should ask can be listed,along with the conditions that have to betrue for a particular question or promptto be relevant. Some form of natural lan-guage input is required by frame-basedsystems to permit the user to respondmore flexibly to the system’s prompts,as in the second example. Natural lan-guage is also required to correct errors ofrecognition or understanding by the sys-tem. Generally, however, it is sufficient

ACM Computing Surveys, Vol. 34, No. 1, March 2002.

Page 5: Spoken Dialogue Technology: Enabling the Conversational

94 McTear

for the system to be able to recognize themain concepts in the user’s utterance. ThePhilips train timetable information, to bedescribed in Section 3, is an example of aframe-based system.

2.3. Agent-Based Systems

Agent-based or AI systems are designed topermit complex communication betweenthe system, the user, and the underlyingapplication in order to solve some prob-lem or task. There are many variants onagent-based systems, depending on whatparticular aspects of intelligent behaviorare included in the system. The followingdialogue, taken from Sadek and de Mori[1998], illustrates a dialogue agent thatengages in mixed-initiative cooperative di-alogue with the user:

User: I’m looking for a job in the Calaisarea. Are there any servers?

System: No, there aren’t any employmentservers for Calais. However, thereis an employment server for Pas-de-Calais and an employmentserver for Lille. Are you interestedin one of these?

In this example the system’s answer tothe user’s request is negative. But ratherthan simply responding No, the systemattempts to provide a more cooperativeresponse that might address the user’sneeds.

In agent-based systems communicationis viewed as interaction between twoagents, each of which is capable of rea-soning about its own actions and beliefs,and sometimes also about the actions andbeliefs of the other agent. The dialoguemodel takes the preceding context into ac-count, with the result that the dialogueevolves dynamically as a sequence of re-lated steps that build on each other. Gener-ally there are mechanisms for error detec-tion and correction, and the system mayuse expectations to predict and interpretthe user’s next utterances. These systemstend to be mixed initiative, which meansthat the user can take control of the dia-logue, introduce new topics, or make con-

tributions that are not constrained by theprevious system prompts. For this rea-son the form of the user’s input cannotbe determined in advance as consisting ofa set number of words, phrases, or con-cepts, and, in the most complex systems,a sophisticated natural language under-standing component is required to processthe user’s utterances. The Circuit-Fix-It-Shop system, to be presented in Section 3,is an example of one type of agent-basedsystem. Other types will be discussed inSection 5.

2.4. Verification

In addition to the different levels of lan-guage understanding required by differ-ent types of dialogue system, there arealso different methods for verifying theuser’s input. In the most basic state-based systems, in which user input is re-stricted to single words or phrases elicitedat each state of the dialogue, the sim-plest verification strategy involves the sys-tem confirming that the user’s words havebeen correctly recognized. The main choiceis between confirmations associated witheach state of the dialogue (i.e., every timea value is elicited the system verifies thevalue before moving on to the next state),or confirmations at a later point in thetransaction. The latter option, which is il-lustrated in the example from the Nuancebanking system in Section 3, provides fora more natural dialogue flow. The morenatural input permitted in frame-basedsystems also makes possible a more flex-ible confirmation strategy in which thesystem can verify a value that has justbeen elicited and, within the same utter-ance, ask the next question. This strat-egy of implicit verification is illustratedin the example from the Philips traintimetable information system in Section 3.Implicit verification provides for a morenatural dialogue flow as well as a poten-tially shorter dialogue, and is made pos-sible because the system is able to pro-cess the more complex user input that mayarise when the user takes the initiativeto correct the system’s misrecognitions

ACM Computing Surveys, Vol. 34, No. 1, March 2002.

Page 6: Spoken Dialogue Technology: Enabling the Conversational

Spoken Language Technology 95

and misunderstandings. Finally, in agent-based systems, more complex methods ofverification (or “grounding”) are requiredalong with decisions as to how and whenthe grounding is to be achieved. Verifica-tion will be discussed in greater detail inSection 4.3.2, and some examples of veri-fication strategies can be seen in the ex-amples presented in Section 3.

2.5. Knowledge Sources for DialogueManagement

The dialogue manager may draw on anumber of knowledge sources, which aresometimes referred to collectively as thedialogue model. A dialogue model mightinclude the following types of knowledgerelevant to dialogue management:

A dialogue history: A record of the dia-logue so far in terms of the propositionsthat have been discussed and the enti-ties that have been mentioned. This rep-resentation provides a basis for concep-tual coherence and for the resolution ofanaphora and ellipsis.

A task record: A representation of the in-formation to be gathered in the dialogue.This record, often referred to as a form,template, or status graph, is used to de-termine what information has not yetbeen acquired (see Section 5.2). Thisrecord can also be used as a task mem-ory [Aretoulaki and Ludwig 1999] forcases where a user wishes to changethe values of some parameters, such asan earlier departure time, but does notneed to repeat the whole dialogue toprovide the other values that remainunchanged.

A world knowledge model: This model con-tains general background informationthat supports any commonsense reason-ing required by the system, for example,that Christmas day is December 25.

A domain model: A model with specific in-formation about the domain in question,for example, flight information.

A generic model of conversational compe-tence: This includes knowledge of theprinciples of conversational turn-takingand discourse obligations—for example,

that an appropriate response to a re-quest for information is to supply theinformation or provide a reason for notsupplying it.

A user model: This model may con-tain relatively stable information aboutthe user that may be relevant to thedialogue—such as the user’s age, gen-der, and preferences—as well as infor-mation that changes over the course ofthe dialogue, such as the user’s goals,beliefs, and intentions.

These knowledge sources are used in dif-ferent ways and to different degrees ac-cording to the dialogue strategy chosen.In the case of a state-based system thesemodels, if they exist at all, are representedimplicitly in the system. For example, theitems of information and the sequencein which they are acquired are predeter-mined and thus represented implicitly inthe dialogue states. Similarly, if there isa user model, it is likely to be simple andto consist of a small number of elementsthat determine the dialogue flow. For ex-ample, the system could have a mecha-nism for looking up user information todetermine whether the user has previousexperience of this system. This informa-tion could then be used to allow differentpaths through the system (for example,with less verbose instructions), or to ad-dress user preferences without having toask for them.

Frame-based systems require an ex-plicit task model as this information isused to determine what questions stillneed to be asked. This is the mechanismused by these systems to control the dia-logue flow. Generally the user model, if oneexists, would not need to be any more so-phisticated than that described for state-based systems. Agent-based systems, onthe other hand, require complex dialogueand user models as well as mechanismsfor using these models as a basis for de-cisions on how to control the dialogue. In-formation about the dialogue history andthe user can be used to constrain howthe system interprets the user’s subse-quent utterances and to determine whatthe system should say and how it should

ACM Computing Surveys, Vol. 34, No. 1, March 2002.

Page 7: Spoken Dialogue Technology: Enabling the Conversational

96 McTear

Table I. Dialogue Control StrategiesFeatures /Dialoguecontrol strategy State-based Frame-based Agent-basedInput Single words or phrases Natural language with Unrestricted natural

concept spotting languageVerification Explicit confirmation—either Explicit and implicit Grounding

of each input or at end confirmationof transaction

Dialogue model Information state represented Explicit representation Model of system’simplicitly in dialogue states of information states intentions, goals,

Dialogue control represented Dialogue control and beliefs.explicitly with state diagram represented with Dialogue history,

control algorithm contextUser model Simple model of user Simple model of user Model of user’s

characteristics or characteristics or intentions, goals,preferences preferences and beliefs

be said. These sorts of modeling involverepresentations of discourse structure, ofintentions, goals, and beliefs, and of dia-logue as a collaborative activity. Variousapproaches to dialogue and user modelingin agent-based systems will be exemplifiedand discussed in later sections of thissurvey.

2.6. Summary

This section has provided a classificationof spoken dialogue systems in terms ofthe type of dialogue control strategy usedand has shown how the choice of dialoguecontrol strategy is closely related to theform of the user’s input as well as the useof particular error recovery mechanisms.These points are summarized in Table I.The next section presents three examplesof spoken dialogue systems that illustratethese different control strategies.

3. EXAMPLES OF SPOKEN DIALOGUESYSTEMS

3.1. State-Based Dialogue Control: TheNuance Automatic Banking System

The Nuance automatic banking systemenables callers to conduct transactionsover the telephone such as paying a bill,transferring funds between accounts, orobtaining an account balance. The systemconducts a dialogue in English with theuser, who can speak to the system usingrelatively unrestricted speech. It should

be stressed that the following dialoguesare based on recorded interactions with ademo system available in the UK (avail-able at: +44 1223 516959), used here toillustrate some important aspects of state-based dialogue control. The shortcomingsthat are presented in some of the exam-ples would have been addressed in a fullydeveloped commercial application.

System: What would you like to do?User: Pay a bill.System: What company would you like to

pay?User: Midland Bank.System: How much would you like to pay?User: One hundred and twenty-five

pounds fifty-seven pence.System: What date would you like the pay-

ment to be made on?User: Monday.System: You want to pay HSBC one hun-

dred fifty-five pounds and fifty-seven pence on April 5th? Is thiscorrect?

User: Yes.System: Your payment will be sent. Would

you like to pay another bill?

In order to complete the transaction thesystem has to elicit three specific valuesfrom the caller—payee, amount, transferdate. These are confirmed and the trans-action is completed with a return to thesub-menu for bill payments. Some points

ACM Computing Surveys, Vol. 34, No. 1, March 2002.

Page 8: Spoken Dialogue Technology: Enabling the Conversational

Spoken Language Technology 97

Fig. 1 . Dialogue flow for paymentof a bill.

worth noting in this example are that thesystem is able to process phrases specify-ing currency amounts, is able to transformrelative dates such as Monday into abso-lute dates such as April 5th, and is able tohandle multiple synonyms (Midland Bankor HSBC ).

The dialogue flow, which is system-directed with a predetermined sequenceof questions, can be modeled as a set ofinterconnected dialogue states, as shownin Figure 1.

Transferring amounts between ac-counts involves a different type of transac-tion in which the system can accept morecomplex natural language input from thecaller, thus reducing the number of ques-tions to be asked and the time requiredfor the transaction:

System: What would you like to do?User: I would like to transfer two hun-

dred and fifty pounds from mycurrent to my savings account.

System: You want to transfer two hundredfifty pounds from current to sav-ings. Is this correct?

User: Yes.

This transaction is also modeled using di-alogue states combined with multiple slot

filling. Additional transitions need to beprovided depending on which combina-tions of the values of amount, source ac-count, and destination account are pro-vided by the user in the natural languageinput. Problems can arise if this func-tionality is not available throughout thedialogue, as illustrated in the followingexcerpt:

System1: What company would you like topay?

User1: Abbey National.System2: How much would you like to

pay?User2: One hundred pounds next Mon-

day.System3: What date would you like the

payment to be made on?User3: Next Monday.System4: You want to pay Abbey National

one hundred pounds and sixty-one pence on April 5th? Is thiscorrect?

User4: No.System5: Would you like to change the

company, amount, or date?User5: Amount.System6: How much would you like to

pay?User6: One hundred pounds.System7: You want to pay Abbey National

one hundred pounds on April5th? Is this correct?

User7: Yes.

In User2 the user provides an overinfor-mative answer, specifying not only theamount requested in System2 but also thedate of the transaction. The system’s nextquestion (System3) indicates that eitherthe date was misrecognized or that thesystem is not set up for multiple slot fillingat this dialogue state. Whatever the rea-son, the system goes on to ask the questionrequired at the next dialogue state, that is,for the date of the transaction. Thus theuser is requested to supply informationthat had just been given in the previousresponse. Moreover, as shown in System4,the system has recognized next Monday

ACM Computing Surveys, Vol. 34, No. 1, March 2002.

Page 9: Spoken Dialogue Technology: Enabling the Conversational

98 McTear

not as a date but as part of the amount,that is, as and sixty-one pence. As shownin System5–User7, the system is able torecover from this error by getting the userto reconfirm the amount.

A final example illustrates some limita-tions on the system’s mechanisms for errorrecovery:

System: Would you like to change the com-pany, amount, or date?

User: The company and the date.System: Please say company, amount, or

date.

From this example it can be seen that thesystem is unable to correct more than oneerror at a time. Thus, while the strategyof leaving the confirmation of the threeitems of information until they have allbeen elicited will normally result in a morenatural and faster transaction, it may leadto a frustrating and rather unnatural dia-logue where more than one item has to bechanged.

The Nuance banking system is an ad-vanced spoken dialogue system that em-ploys a state-based dialogue control inconjunction with natural language under-standing and multiple slot filling. Givengood recognition rates the system is likelyto perform well and complete transactionsefficiently and effectively. These sampleinteractions were chosen carefully to il-lustrate some potential problems that arelikely to arise when a state-based systemis extended to provide more sophisticatedfunctionality. The examples would appearto indicate that the multiple slot filling af-forded by the system’s natural languageprocessing is not available at each dia-logue state. Although this can be easilyremedied, a further complication is intro-duced as now the system has to maintaina dialogue model of which slots have beenfilled so as to direct the dialogue down al-ternative paths. With small numbers ofslots this is not a problem, but with anincreasing number of slots the combina-tions of state transitions soon reach un-manageable proportions. This is particu-larly the case when further transitions areadded to deal with verification of the slot

values. Thus, while state-based dialoguecontrol is useful for implementing simpletransactions, introducing more sophisti-cated functionality may require additionalmechanisms such as an explicit dialoguemodel that obscure the semantics of thestate-based control. This issue will be dis-cussed in greater detail in Section 5. Thenext example will show how a frame-basedsystem is able to handle these issues moreeffectively.

3.2. Frame-Based Dialogue Control: ThePhilips Automatic Train TimetableInformation System

The Philips Automatic Train TimetableInformation System (available at: +49241 604020) provides information overthe telephone about train connections be-tween 1,200 German cities. The systemconducts a natural dialogue in Germanwith the user, who can speak to the systemusing fluent, unrestricted speech. The aimof the dialogue is to enable the system toconstruct an appropriate database querythat retrieves the information required bythe user and then to present that infor-mation in a suitable form. This systemfalls within a class of similar systems con-cerned with the provision of information,such as travel schedules, banking, stocksand shares, and weather forecasts. TheGerman version of the system is a researchprototype, which is not connected withthe official information service provided byGerman Rail (Deutsche Bahn AG). The fol-lowing is an example of a dialogue with thePhilips Automatic Train Timetable Infor-mation System, translated from the origi-nal German [Aust et al. 1995]:

System1: Good morning. This is the au-tomatic timetable informationservice. How can I help you?

User1: I need a connection fromHamburg to Munich.

System2: When would you like to go fromHamburg to Munich?

User2: Next Friday at eight o’clock.System3: So you want to travel on Sun-

day the 20th of November ateight o’clock a.m.?

ACM Computing Surveys, Vol. 34, No. 1, March 2002.

Page 10: Spoken Dialogue Technology: Enabling the Conversational

Spoken Language Technology 99

User3: No, on Friday.System4: So you want to travel on Friday

the 18th of November at eighto’clock a.m.?

User4: Yes.System5: There is the following con-

nection: With IntercityExpress793, departure from Hamburgat 8:02, arrival in Munich at14:05. Do you want me to re-peat the connection?

User5: No, but I need a train for thereturn trip on Sunday.

System6: At what time would you liketo go from Munich to Ham-burg on Sunday the 20th ofNovember?

User6: I have to arrive in Hamburglate in the evening.

System7: So you want to arrive in Ham-burg between nine p.m. andeleven p.m.?

User7: Yes.System8: There are three connections.

This is the first one: With In-tercityExpress 880, departurefrom Munich at 15:54, arrivalin Hamburg at 21:58. Do youwant the next connection?

User8: No, thanks, that one is fine.System9: Do you need yet another con-

nection?User9: No, thank you.System10: We wish you a pleasant jour-

ney. Thank you for calling.Good bye.

It is informative to examine the dialogueflow in this example. The purpose of thesystem is to guide the user toward pro-viding all the required values—departurecity, destination city, departure time, anddeparture date, as well as ascertainingother values, such as whether the time isa.m. or p.m. and whether the time refers tothe departure or arrival. These values areacquired over the course of the dialogueand once they have been acquired the sys-tem can execute a suitably constraineddatabase query and provide one or moreconnections.

This system could have been imple-mented using state-based control in whichthe system would progress through a num-ber of states eliciting and verifying therequired values. Alternatively the sys-tem could have been implemented asa question-answer or natural languagedatabase system in which the user wouldhave been required to provide all the re-quired parameters in one utterance, whichcould then have been translated directlyinto a database query. However, one of theaims of the designers of this system wasto enable a more flexible interaction inwhich the user would not be constrainedeither to input one value at a time or toinput all the values within one utterance.This flexibility is necessary as it cannotbe determined in advance what a usermight know regarding the information re-quired to make a valid query. For example,the system may need to know if the userwishes to travel on an Intercity train, re-quires a train with restaurant facilities,and so on. If the user is not aware of allthe possibilities, the system has to issuerelevant queries and elicit suitable valuesin order to find the best connection.

A second aspect of dialogue flow con-cerns the sequencing of the system’s ques-tions. There should be a logical order tothe questions. This order may be largelydetermined by what information is to beelicited in a well-structured task such as atravel information enquiry. The disadvan-tage of a state-based approach combinedwith natural language processing capa-bilities is that users may produce overin-formative answers that provide more in-formation than the system has requestedat that point. In the Philips example atSystem2–User2, the system’s more open-ended prompt When would you like to gofrom Hamburg to Munich? is ambiguousin that it can allow the user to supplydeparture time or date or both—as hap-pens in User2. Even with a more con-strained prompt such as On which daywould you like to go from Hamburg toMunich? the user might supply both dateand time. A system that followed a pre-determined sequence of questions mightthen ask At what time would you like to go

ACM Computing Surveys, Vol. 34, No. 1, March 2002.

Page 11: Spoken Dialogue Technology: Enabling the Conversational

100 McTear

from Hamburg to Munich?—an unaccept-able question as the time has already beengiven. The Philips system uses a statusgraph to keep track of which slots havealready been filled. This mechanism willbe described in greater detail in Section 5.

A close examination of the dialogue alsoshows that the system is able to deal withrecognition errors and misunderstand-ings. For example, in System3 the sys-tem attempts to confirm the departuredate and time but has misrecognized thedeparture date and is corrected by theuser in User3. More subtly, the systemuses different strategies for confirmation.In System2 an implicit confirmation re-quest is used in which the values for de-parture city and destination provided bythe user in User1 are echoed back withinthe system’s next question, which also in-cludes a request for the value for the de-parture date and/or time. If the system’sinterpretation is correct, the dialogue canproceed smoothly to the next value to beobtained and the user does not have toconfirm the previous values. Otherwise, ifthe system has misunderstood the input,the user can correct the values before an-swering the next question. Conversely, anexplicit confirmation request halts the di-alogue flow and requires an explicit confir-mation from the user. An example occursin System3–User3 in which the systemmakes an explicit request for the confirma-tion of the departure date and time and theuser corrects the date. The next exchangeSystem4–User4 is a further example of anexplicit confirmation request to verify thedeparture date and time.

One further aspect of the Philips sys-tem is its robustness. An example canbe seen at System6–User6. In responseto the system prompt for the departuretime, the user does not provide a direct re-sponse containing the required time butstates a constraint on the arrival time,expressed vaguely as late in the evening.The system is able to interpret this ex-pression in terms of a range (between9 p.m. and 11 p.m.) and to find an ap-propriate departure time that meets thisconstraint. More generally, the system isrobust enough to be able to handle a

range of different expressions for datesand times (e.g., three days before Christ-mas, within this month) and to be able todeal with cases of missing and contradic-tory information.

The provision of information such astrain times is a typical application of spo-ken dialogue technology. Philips has de-veloped a system with similar function-ality for Swiss Rail, which has been anofficial part of Swiss Rail’s informationservice since 1996. Public reaction to thesystem has been favorable, with over 80%of the people who used the service rat-ing it as “excellent.” Strik et al. [1996] re-ported on a project involving adaptationof the German system to the Dutch pub-lic transport network, while the EuropeanR&D project ARISE, which includes theDutch, French, and Italian railway opera-tors, builds on earlier European projectsand on the Philips systems to providemore elaborate services with a multilin-gual component. The strengths and weak-nesses of frame-based dialogue control arediscussed further in Section 5.

3.3. Agent-Based Dialogue Control forProblem Solving—The Circuit-Fix-ItShop System

Problem-solving tasks usually involvesome degree of co-operation betweenagents, particularly when one of theagents is unable to solve the problemalone. Cooperation may take the form ofassigning particular subtasks to particu-lar agents with the accomplishment of thesubtasks contributing to the completion ofthe overall task. Such co-operative prob-lem solving will require communicationbetween the agents to report on the state-of-play of the problem solving and to sup-ply the information required by the otheragent. The next example illustrates collab-orative problem solving in the Circuit-Fix-It Shop system [Smith and Hipp 1994].Other agent-based systems will be dis-cussed in Section 5.

The Circuit-Fix-It Shop system helpsusers to fix an electronic circuit by engag-ing in a spoken dialogue with the user.Communication between system and user

ACM Computing Surveys, Vol. 34, No. 1, March 2002.

Page 12: Spoken Dialogue Technology: Enabling the Conversational

Spoken Language Technology 101

Fig. 2 . Goal tree before utterance System1.

is motivated by the fact that neither par-ticipant is fully equipped to solve the prob-lem alone. The system has all the neces-sary knowledge about the task but doesnot have the required sensory abilitiesto monitor the state of the world or toperform actions in the world. These arethe concern of the human user who hasalso some knowledge of the task, whichcan range from that of a novice to thatof an expert, depending on the individualuser. The system employs user modeling—estimating the user’s knowledge of par-ticular objects and actions based oninferences from earlier exchanges in thedialogue—to determine the content ofthe information to be communicated tothe user. Context-dependent expectationsare also used to correct speech recogni-tion hypotheses and to track when the usershifts to a different subdialogue. The fol-lowing is an example of a dialogue betweena user and the system:

System1: Is there a wire between connec-tor 84 and connector 99?

User1: It is not there.System2: Add a wire between connector 84

and connector 99.User2: I need help.System3: Obtain a wire.User3: Done.System4: Connect metal of one free end of

the wire to connector 84.User4: How?System5: Bend back connector 84.User5: Done.System6: Insert free end of wire into con-

nector 84.User6: Wire between 84 and 99 is con-

necting.System7: What is the LED displaying?

The dialogue evolves in the form of a proof,which can be illustrated using a goal tree.The goal tree represents the solution ofthe problem as it evolves dynamically. Thesystem invokes rules to prove the goalin a top-down fashion—as in Prolog-styletheorem proving. The proof may succeedusing internally available knowledge, inwhich case no dialogue is required. How-ever, the system is designed to deal withcases where the proof fails because the in-formation required to complete the proofis not available to the system. In this casethe system engages in dialogue with theuser to obtain the missing information (de-scribed as “missing axioms”) so that theproof can succeed.

At the beginning of the dialogue, thesystem does not know whether there isa wire between connector 84 and connec-tor 99. As this is a missing axiom in thecurrent proof, the system produces utter-ance System1 to ask the user. The stateof the proof at this point is shown in thefollowing goal tree displayed in Figure 2.The user confirms that the wire is miss-ing. From this the system can infer thatthe user knows the location of the connec-tors and these facts are added to the usermodel. Figure 3 shows the current stateof the goal tree. So that the current goalcan be completed, the system instructs theuser to add a wire between the connec-tors. This yields the goal tree shown inFigure 4. As the user does not know howto do this, a subgoal is inserted instructingthe user on how to accomplish this task.This subgoal consists of the actions: lo-cate connector 84, locate connector 99, ob-tain a wire, connect one end of wire to 84,and connect other end of wire to 99. Theseitems are added to the goal tree depictedin Figure 5. However, as the user modelcontains the information that the user can

ACM Computing Surveys, Vol. 34, No. 1, March 2002.

Page 13: Spoken Dialogue Technology: Enabling the Conversational

102 McTear

Fig. 3 . Goal tree after utterance User1.

Fig. 4 . Goal tree after utterance System2.

Fig. 5 . Goal tree after utterance User2.

ACM Computing Surveys, Vol. 34, No. 1, March 2002.

Page 14: Spoken Dialogue Technology: Enabling the Conversational

Spoken Language Technology 103

locate these connectors, instructions forthe first two actions are not required andso the system proceeds with instructionsfor the third action, which is confirmed inUser3, and for the fourth action. Here theuser requires further instructions, whichare given in System5 with the action con-firmed in User5. At this point the userasserts that the wire between 84 and 99is connecting, so that the fifth instructionto connect the second end to 99 is not re-quired. A further missing axiom is discov-ered, which leads the system to ask whatthe LED is displaying (System7).

3.4. Summary

The examples presented in this sectionhave illustrated three different types ofdialogue control strategy. The selectionof a dialogue control strategy determinesthe degree of flexibility possible in the di-alogue and places requirements on thetechnologies employed for processing theuser’s input and for correcting errors.There are many variations on the dialoguestrategies illustrated here, and these willbe discussed in greater detail in Section 5.The next section will examine the com-ponent technologies of spoken dialoguesystems.

4. COMPONENTS OF A SPOKENDIALOGUE SYSTEM

A spoken dialogue system involves the in-tegration of a number of components thattypically provide the following functional-ities [Wyard et al. 1996]:

Speech recognition: The conversion of aninput speech utterance, consisting of asequence of acoustic-phonetic parame-ters, into a string of words.

Language understanding: The analysisof this string of words with the aimof producing a meaning representationfor the recognized utterance that canbe used by the dialogue managementcomponent.

Dialogue Management: The control of theinteraction between the system and theuser, including the coordination of theother components of the system.

Communication with external system:For example, with a database sys-tem, expert system, or other computerapplication.

Response generation: The specification ofthe message to be output by the system.

Speech output: The use of text-to-speechsynthesis or prerecorded speech to out-put the system’s message.

These components are examined in the fol-lowing subsections in relation to their rolein a spoken dialogue system (for a recenttext on speech and language processing,see Jurafsky and Martin [2000]).

4.1. Speech Recognition

The task of the speech recognition com-ponent of a spoken dialogue system isto convert the user’s input utterance,which consists of a continuous-time signal,into a sequence of discrete units such asphonemes (units of sound) or words. Onemajor obstacle is the high degree of vari-ability in the speech signal. This variabil-ity arises from the following factors:

Linguistic variability: Effects on thespeech signal caused by various linguis-tic phenomena. One example is coartic-ulation, that is, the fact that the samephoneme can have different acoustic re-alizations in different contexts, deter-mined by the phonemes preceding andfollowing the sound in question.

Speaker variability: Differences betweenspeakers, attributable to physical fac-tors such as the shape of the vocal tractas well as factors such as age, gen-der, and regional origin; and differenceswithin speakers, due to the fact thateven the same words spoken on a differ-ent occasion by the same speaker tendto differ in terms of their acoustic prop-erties. Physical factors such as tired-ness, congested airways due to a cold,and changes of mood have a bearingon how words are pronounced, but thelocation of a word within a sentence andthe degree of emphasis it is given arealso factors which result in intraspeakervariability.

ACM Computing Surveys, Vol. 34, No. 1, March 2002.

Page 15: Spoken Dialogue Technology: Enabling the Conversational

104 McTear

Channel variability: The effects of back-ground noise, which can be either con-stant or transient, and of the trans-mission channel, such as the telephonenetwork or a microphone.

The speech recognition component of atypical spoken dialogue application has tobe able to cope with the following addi-tional factors:

Speaker independence: As the applicationwill normally be used for a wide varietyof casual users, the recognizer cannotbe trained on an individual speaker(or small number of speakers) who willuse the system, as is the case for dic-tation systems; instead, for speaker-independent recognition samples haveto be collected from a variety of speak-ers whose speech patterns should berepresentative of the potential users ofthe system. Speaker-independent recog-nition is more error-prone than speaker-dependent recognition.

Vocabulary size: The size of the vocabularyvaries with the application and with theparticular design of the dialogue sys-tem. Thus a carefully controlled dia-logue may constrain the user to a vocab-ulary limited to a few words expressingthe options that are available in the sys-tem, while in a more flexible system thevocabulary may amount to more than athousand words.

Continuous speech: Users of spoken dia-logue systems expect to be able to speaknormally to the system and not, for ex-ample, in the isolated word mode em-ployed in some dictation systems. How-ever, it is difficult to determine wordboundaries in continuous speech sincethere is no physical separation in thecontinuous-time speech signal.

Spontaneous conversational speech: Sincethe speech that is input to a spoken di-alogue system is normally spontaneousand unplanned, it is typically charac-terized by disfluencies, such as hesita-tions and fillers (for example, amm ander, false starts, in which the speakerbegins one structure then breaks offmidway and starts again, and extra-

linguistic phenomena such as cough-ing. The speech recognizer has to beable to extract from the speech sig-nal a sequence of words from whichthe speaker’s intended meaning can becomputed.

The basic process of speech recognitioninvolves finding a sequence of words, usinga set of models acquired in a prior train-ing phase, and matching these with the in-coming speech signal that constitutes theuser’s utterance. The models may be wordmodels, in the case of systems with a smallvocabulary, but more typically the modelsare of units of sound such as phonemes ortriphones, which model a sound as well asits context in terms of the preceding andsucceeding sounds. The most successfulapproaches view this pattern-matching asa probabilistic process which has to be ableto account both for temporal variability—due to different durations of the sounds re-sulting from differences in speaking rateand the inherently inexact nature of hu-man speech—and acoustic variability—due to the linguistic, speaker, and channelfactors described earlier. The following for-mula expresses this process:

W∗ = argmaxw

P (O | W )P (W ).

In this formulaW∗ represents the wordsequence with the maximum a posteri-ori probability, while O represents the ob-servation that is derived from the speechsignal. Two probabilities are involved:P (O | W ), known as the acoustic model,which has been derived through a train-ing process and which is the probabilitythat a sequence of words W will producean observation O; and a language model,derived from an analysis of a language cor-pus giving the prior probability distribu-tion assigned to the sequence of wordsW.

The observation O comprises a seriesof vectors representing acoustic featuresof the speech signal. These feature vec-tors are derived from the physical sig-nal, which is sampled and then digitallyencoded. Perceptually important speaker-independent features are extracted andredundant features are discarded.

ACM Computing Surveys, Vol. 34, No. 1, March 2002.

Page 16: Spoken Dialogue Technology: Enabling the Conversational

Spoken Language Technology 105

Fig. 6 . A simple hidden Markov model.

Acoustic modeling is a process of map-ping from the continuous speech signalto the discrete sounds of the words to berecognized. The acoustic model of a wordis represented in hidden Markov mod-els (HMMs), as in Figure 6. Each statein the HMM might represent a unit ofsound, for example, the three sounds inthe word dog. Transitions between eachstate, A = a12a13 . . .an1 . . .ann, representthe probability of transitioning from onestate to the next and model the temporalprogression of the speech sounds. Due tovariability in the duration of the sounds, asound may spread across several frames sothat the model can take a loop transitionand remain in the same state. For exam-ple, if there were five frames for the worddog, the states sequence S1, S1, S2, S2, S3might be produced, reflecting the longerduration of the sounds representing dand o. A hidden Markov model is doublystochastic, as in addition to the transi-tion probabilities the output of each state,B = bi(ot), is probabilistic. Instead of eachstate having a single unit of sound asoutput, all units of sound are potentiallyassociated with each state, each with itsown probability. The model is “hidden” be-cause, given a particular sequence of out-put symbols, it is not possible to deter-mine which sequence of states producedthese output symbols. It is, however, pos-sible to determine the sequence of statesthat has the highest probability of havinggenerated a particular output sequence.In theory this would require a procedurethat would examine all possible state se-quences and compute their probabilities.In practice, because of the Markov as-sumption that being in a given state de-

pends only on the previous state, an ef-ficient dynamic programming proceduresuch as the Viterbi algorithm or A∗ de-coding can be used to reduce the searchspace. If a state sequence is viewed as apath through a state-time lattice, at eachpoint in the lattice only the path with thehighest probability is selected.

The output of the acoustic modelingstage is a set of word hypotheses whichcan be examined to find the best wordsequence, using a language model P (W ).The language model contains knowledgeabout which words are more likely in agiven sequence. Two types of model arepossible. A finite state network predictsall the possible word sequences in the lan-guage model. This approach is useful if allthe phrases that are likely to occur in thespeech input can be specified in advance.The disadvantage is that perfectly legalstrings that were not anticipated are ruledout. Finite state networks can be used toparse well-defined sequences such as ex-pressions of time.

Alternatively, an N -gram model can beused. The use of N -grams involves com-puting the probability of a sequence ofwords as a product of the probabilities ofeach word, assuming that the occurrenceof each word is determined by the preced-ing N − 1 words. This relationship is ex-pressed in the formula

P (W ) = P (w1, . . . , wn)

=N∏

n=1

P (wn | w1, . . . , wn−1).

However, because of the high computa-tional cost involved in calculating the

ACM Computing Surveys, Vol. 34, No. 1, March 2002.

Page 17: Spoken Dialogue Technology: Enabling the Conversational

106 McTear

probability of a word given a large num-ber of preceding words, N -grams are usu-ally reduced to bigrams (N = 2) or tri-grams (N = 3). Thus in a bigram modelP (wi | wi−1) the probability of all pos-sible next words is based only on thecurrent word, while in a trigram modelP (wi | wi−2, wi−1) it is based on two pre-ceding words. N -gram models may also bebased on classes rather than words, i.e.,the words are grouped into classes repre-senting either syntactic categories such asnoun or verb, or semantic categories, suchas days of the week or names of airports.A language model reduces the perplexityof a system, which will usually result ingreater recognition accuracy. Perplexity isroughly defined as the average branch-ing factor, or average number of words,that might follow a given word. If theperplexity is low, recognition is likely tobe more accurate as the search space isreduced.

The output of the speech recognizermay be a number of scored alternativesas in the following example representingthe recognizer’s best guesses for the in-put string what time does the flight leave?[Wyard et al. 1996]:

(1) what time does the white leaf 1245.6.(2) what time does the flight leave 1250.1.(3) what time does a flight leave 1252.3.(4) what time did the flight leave 1270.1.(5) what time did a flight leave 1272.3.

Sometimes there are only small differ-ences between the alternatives, caused byone or two words that may not contributeto the meaning of the string. For this rea-son, the alternatives can be more economi-cally represented in a directed graph or asa word lattice. The selection of the mostlikely sequence may be the responsibilityof other system components. For example,if the domain of the dialogue system isflight enquiries, then the first sequence,which had the best score from the speechrecognizer, would be discarded as contex-tually irrelevant. Similarly, dialogue infor-mation would assist the choice between2 and 3, which ask about a flight departurethat has not yet taken place, and 4 and 5,

which ask about some departure that hasalready happened.

As an alternative to returning the com-plete sequence of words that matches theacoustic signal, the recognizer can searchfor key words. This technique is known asword spotting. Word spotting is useful fordealing with extraneous elements in theinput, for example, detecting yes in thestring well, uh, yes, that’s right. The maindifficulty for word spotting is to detect non-key-word speech. One method is to trainthe system with a variety of non-key-wordexamples, known as sink (or garbage)models. A word-spotting grammar net-work can then be specified that allows anysequence of sink models in combinationwith the key words to be recognized.

Users of spoken dialogue systems aregenerally constrained to having to waituntil the system has completed its out-put before they can begin speaking. Onceusers are familiar with a system, theymay wish to speed up the dialogue by in-terrupting the system. This is known asbarge-in. The difficulty with simultane-ous speech, which is common in human-human conversation, is that the incomingspeech becomes corrupted with echo fromthe ongoing prompt, thus affecting the re-cognition. Various techniques are underdevelopment to facilitate barge-in.

4.1.1. Summary. This section has out-lined the main characteristics of thespeech recognition process, describing theuncertain and probabilistic nature of thisprocess, in order to clarify the require-ments that are put on the other systemcomponents. In a linear architecture theoutput of the speech recognizer providesthe input to the language understandingmodule. Difficulties may arise for this com-ponent if the word sequence that is out-put does not constitute a legal sentence,as specified by the component’s gram-mar. In any case, the design of the lan-guage understanding component needs totake account of the nature of the outputfrom the speech recognition module. Sim-ilarly, in an architecture in which the di-alogue management component interactswith each of the other components, one

ACM Computing Surveys, Vol. 34, No. 1, March 2002.

Page 18: Spoken Dialogue Technology: Enabling the Conversational

Spoken Language Technology 107

of the roles of dialogue management willbe to monitor when the user’s utterancehas not been reliably recognized and to de-vise appropriate remedial steps. These is-sues will be discussed in greater detail insubsequent sections. For more extensiveaccounts of speech recognition, see, forexample, Rabiner and Juang [1993] andYoung and Bloothooft [1997]. For tuto-rial overviews, see Makhoul and Schwartz[1995] and Power [1996].

4.2. Language Understanding

The role of the language understandingcomponent is to analyze the output of thespeech recognition component and to de-rive a meaning representation that can beused by the dialogue control component.Language understanding involves syntac-tic analysis, to determine the constituentstructure of the recognized string (i.e., howthe words group together), and seman-tic analysis, to determine the meaningsof the constituents. These two processesmay be kept separate at the representa-tional level in order to maintain general-izability to other domains, but they tend tobe combined during processing for reasonsof efficiency. On the other hand, some ap-proaches to language understanding mayinvolve little or no syntactic processingand derive a semantic representation di-rectly from the recognized string. Theadvantages and disadvantages of theseapproaches, and the particular problemsinvolved in the processing of spoken lan-guage, will be reviewed in this section.

The theoretical foundations for lan-guage processing are to be found in lin-guistics, psychology, and computationallinguistics. Current grammatical formal-isms in computational linguistics share anumber of key characteristics, of whichthe main ingredient is a feature-based de-scription of grammatical units, such aswords, phrases and sentences [Uszkoreitand Zaenen 1996]. These feature-basedformalisms are similar to those used inknowledge representation research anddata type theory.

Feature terms are sets of attribute-value pairs in which the values can be

atomic symbols or further feature terms.Feature terms belong to types, which maybe organized in a type hierarchy or asdisjunctive terms, functional constraints,or sets. The following simple exampleshows a feature-based representation forthe words lions, roar, and roars as well asa simple grammar using the PATR-II for-malism [Shieber 1986] that defines howthe words can be combined in a well-formed sentence:

lexiconlions: [cat:NP, head: [agreement: [number:

plural, person:third] ] ]roar: [cat:V, head: [form: finite, subject:

[agreement: [number:plural,person:third] ] ] ]

roars: [cat:V, head: [form: finite, subject:[agreement: [number:singular,person:third] ] ] ]

grammarS→ NP VP<S head> = <VP head><S head subject> = <NP head>

VP→ V<VP head> = <V head>

The lexicon consists of complex featurestructures describing the syntactically rel-evant characteristics of the words, suchas whether they are singular or plural.The grammar consists of phrase structurerules and equations that determine howthe words can be combined.

The means by which feature terms maybe combined to produce well-formed fea-ture terms is through the process of unifi-cation. For example: the words lions androar can be combined as their featuresunify, whereas lions and roars cannot, asthe agreement features are incompatible.This basic formalism has been used to ac-count for a wide range of syntactic phe-nomena and, in combination with unifi-cation, to provide a standard approach tosentence analysis using string-combiningand information-combining operations.

Feature-based grammars are oftensubsumed under the term unificationgrammars. One major advantage of uni-fication grammars is that they permit

ACM Computing Surveys, Vol. 34, No. 1, March 2002.

Page 19: Spoken Dialogue Technology: Enabling the Conversational

108 McTear

a declarative encoding of grammaticalknowledge that is independent of any spe-cific processing algorithm. A further ad-vantage is that a similar formalism can beused for semantic representation, with theeffect that the simultaneous use of syntac-tic and semantic constraints can improvethe efficiency of the linguistic processing.

In computational semantics, sentencesare analyzed on the basis of their con-stituent structure, under the assumptionof the principle of compositionality, that is,that the meaning of a sentence is a func-tion of the meanings of its parts. Each syn-tactic rule has a corresponding semanticrule and the analysis of the constituentstructure of the sentence will lead to thesemantic analysis of the sentence as themeanings of the individual constituentsidentified by the syntactic analysis arecombined. The meaning representationfrom this form of semantic analysis is typ-ically a logical formula in first order predi-cate calculus (FOPC) or some more power-ful intermediate representation languagesuch as Montague’s intensional logic orDiscourse Representation Theory (DRT).The advantage of a representation of themeaning of a sentence in a form such asa formula of FOPC is that it can be usedto derive a set of valid inferences based onthe inference rules of FOPC. For example,as Pulman [1996] showed, a query such as:

Does every flight from London to San Franciscostop over in Reykyavik?

cannot be answered straightforwardly bya relational database that does not storepropositions of the form every X has prop-erty P. Instead a logical inference has tobe made from the meaning of the sentencebased on the equivalence between every Xhas property P and there is no X that doesnot have property P. Based on this infer-ence the system simply has to determine ifa nonstopping flight can be found, in whichcase the answer is no; otherwise it is yes.

While linguistics and psychology pro-vide a theoretical basis for computationallinguistics, the characteristics of spokenlanguage require additional (or even al-ternative) techniques. One problem is thatnaturally occurring text, both in written

form, as in newspaper stories, as well as inspoken form, as in spoken dialogues, is farremoved from the well-formed sentencesthat constitute the data for theoretical lin-guistics and psychology. In linguistics themain concern is with developing theoriesthat can account for items of theoretical in-terest, often rare phenomena that demon-strate the wide coverage of the theory,while in psychology the main concern iswith identifying the cognitive processesinvolved in language understanding. Tra-ditionally a symbolic representation isused, with hand-crafted rules that pro-duce a complete parsing of grammaticallycorrect sentences but with a target cover-age based on a relatively small set of exem-plar sentences. When confronted with nat-urally occurring texts such as newspaperstories, these theoretically well-motivatedgrammars tend to generate a very largenumber of possible parses, due to ambigu-ous structures contained in the grammarrules, while, conversely, they often fail toproduce the correct analysis of a given sen-tence, often having a failure rate of morethan 60% [Marcus 1995].

Spoken language introduces a furtherproblem in that the output from the speechrecognizer will often not have the form ofa grammatically well-formed string thatcan be parsed by a conventional lan-guage understanding system. Rather it islikely to contain features of spontaneousspeech, such as sentence fragments, after-thoughts, self-corrections, slips of thetongue, or ungrammatical combinations.The following examples of utterances(cited in Moore [1995]), from a corpus col-lected from subjects using either a sim-ulated or an actual spoken language AirTravel Information System (ATIS), wouldnot be interpreted by a traditional linguis-tic grammar:

What kind of airplane goes from Philadelphia toSan Francisco Monday stopping in Dallas in theafternoon (first class flight)(Do)(Do any of these flights)Are there any flightsthat arrive after five p.m.

The first example is a well-formed sen-tence followed by an additional fragmentor after-thought, enclosed in parentheses.

ACM Computing Surveys, Vol. 34, No. 1, March 2002.

Page 20: Spoken Dialogue Technology: Enabling the Conversational

Spoken Language Technology 109

The second example is a self-correction inwhich the words intended for deletion areenclosed in parentheses.

Some of these performance phenomenaoccur sufficiently regularly that they couldbe described by special rules. For example,in some systems rules have been devel-oped that can recognize and correct self-repairs in an utterance [Dowding et al,1993; Heeman and Allen 1997]. A conven-tional grammar could be enhanced withadditional rules that could handle some ofthese phenomena, but the problem is thatit would be impossible to predict all thepotential occurrences of these features ofspontaneous speech in this way. An alter-native approach is to develop more robustmethods for processing spoken language.

Robust parsing aims to recover syntac-tic and semantic information from unre-stricted text that contains features thatare not accounted for in hand-craftedgrammars. Robust parsing often involvespartial parsing, in which the aim is notto perform a complete analysis of the textbut to recover chunks, such as nonrecur-sive noun phrases, that can be used toextract the essential items of meaning inthe text. Thus the aim is to achieve abroad coverage of a representative sam-ple of language which represents a reason-able approximate solution to the analysisof the text [Abney 1997]. In some systemsmixed approaches are used, such as firstattempting to carry out a full linguisticanalysis on the input and only resortingto robust techniques if this is unsuccess-ful. BBN’s Delphi system [Stallard andBobrow 1992], MIT’s TINA system [Seneff1992], and SRI International’s Geminisystem [Dowding et al. 1993] work in thisway. As Moore [1995] reported, differentresults have been obtained. The SRI teamfound that a combination of detailed lin-guistic analysis and robust processing re-sulted in better performance than robustprocessing alone, while the best perform-ing system at the same evaluation (theNovember 1992 ATIS evaluation) was theCMU Phoenix system, which uses onlyrobust processing methods and does notattempt to account for every word in anutterance.

4.2.1. Integration of the Speech Recognitionand Natural Language Understanding Compo-nents. So far it has been assumed thatthe speech recognizer and the naturallanguage understanding module are con-nected serially and that the speech moduleoutputs a single string to be analyzed bythe language understanding module. Typ-ically, however, the output from the speechrecognition component is a set of rankedhypotheses, of which only a few will makesense when subjected to syntactic and se-mantic analysis. The most likely hypothe-sis may turn out not to be the string thatis ranked as the best set of words identi-fied by the speech recognition component(see the example in Section 4.1). What thisimplies is that, in addition to interpret-ing the string (or strings) output by thespeech recognizer to provide a semantic in-terpretation, the language understandingmodule can provide an additional knowl-edge source to constrain the output ofthe speech recognizer. This in turn hasimplications for the system architecture,in particular for the ways in which thespeech recognition and natural languageunderstanding components can be linkedor integrated.

The standard approach to integrationinvolves selecting as a preferred hypoth-esis the string with the highest recogni-tion score that can be processed by thenatural language component. The disad-vantage of this approach is that stringsmay be rejected as unparsable that nev-ertheless represent what the speaker hadactually said. In this case the recognizerwould be overconstrained by the languagecomponent. Alternatively, if robust pars-ing were applied, the recognizer could beunderconstrained, as a robust parser willattempt to make sense out of almost anyword string.

One alternative approach to integrationis word lattice parsing, in which therecognizer produces a set of scored wordhypotheses and the natural languagemodule attempts to find a grammaticalutterance spanning the input signal thathas the highest acoustic score. This ap-proach becomes unacceptable in the caseof word lattices containing large numbers

ACM Computing Surveys, Vol. 34, No. 1, March 2002.

Page 21: Spoken Dialogue Technology: Enabling the Conversational

110 McTear

of hypotheses, particularly when there isa large degree of word boundary uncer-tainty. Another alternative is to use N -best filtering in which the recognizer out-puts the n-best hypotheses (where N mayrange from between 10 to 100 sentencehypotheses), and these are then ranked bythe language understanding componentto determine the best-scoring hypothesis[Price 1996]. This approach has the ad-vantage of simplicity but the disadvantageof a high computational cost given a largevalue for N . Many practical systems have,however, produced acceptable results withvalues as low as N = 5, using robust pro-cessing if strict grammatical parsing wasnot successful with the top five recognitionhypotheses [Kubala et al. 1992].

4.2.2. Some Solutions. Various solutionshave been adopted to the problem of de-riving a semantic representation from thestring provided by the speech recognitioncomponent. These include: comprehensivelinguistic analysis, methods for dealingwith ill-formed and incomplete input, andmethods involving concept spotting. Someof these will be briefly reviewed in the fol-lowing paragraphs.

4.2.2.1. SUNDIAL. In the SUNDIALproject [Peckham 1993], which was con-cerned with travel information in English,French, German, and Italian, several dif-ferent approaches were adopted, with thefollowing common features:

—a rich linguistic analysis;—robust methods for handling partial and

ill-formed input;—a semantic representation language for

task-oriented dialogues.

Linguistic analysis in the German versionis based on a chart parser using a uni-fication categorial grammar [Eckert andNiemann 1994]. Syntactic and semanticstructures are built in parallel by unifyingcomplex feature structures during pars-ing. The aim is to find a consistent max-imal edge of the utterance, but if no singleedge can be found, the best interpretationis selected for the partial descriptions re-turned by the chart parser. These partial

descriptions are referred to as utterancefield objects (UFOs). Various scoring mea-sures are applied to the chart edges to de-termine the best interpretation. Addition-ally some features of spontaneous speechsuch as pauses, filled pauses, and ellipses,are represented explicitly in the grammar.The following example illustrates the useof UFOs in the analysis of the string I wantto go—at nine o’clock from Koeln [Eckertand Niemann 1994]:

U1: syntax: [string: ‘I want to go’]semantics: [type:want, theagent: [type:

speaker], thetheme: [type: go ]]U2: syntax: [string: ‘at nine o’clock’]

semantics: [type: time, thehour: 9]U3: syntax: [string: ‘from Koeln’]

semantics: [type: go, thesource: [type: location,thecity: koeln ]]

This sequence of UFOs is a set of partialdescriptions that cannot be combined intoa longer spanning edge, as U2, an ellip-tical construction, is not compatible withU1 and U3. However, it is still possibleto build a semantic representation fromthese partial descriptions, as shown in thisexample.

This example also illustrates the se-mantic interface language (SIL) which isused in SUNDIAL to pass the content ofmessages between modules. Two differ-ent levels of detail are provided in SIL,both in terms of typed feature structures:a linguistically oriented level, as shownabove, and a task-oriented level, whichcontains information relevant to an appli-cation, such as relations in the applicationdatabase. The task-oriented representa-tion for the partial descriptions in the ex-ample above would be:

U1: [task param]: [none]]U2: [task param]: [sourcetime: 900]]U1: [task param]: [sourcecity: koeln]]

This task-oriented representation is usedby the dialogue manager to determinewhether the information elicited from theuser is sufficient to permit database accessor whether further parameters need to beelicited.

Reporting on a comparative evaluationbetween earlier versions of the system,which did not include a robust semantic

ACM Computing Surveys, Vol. 34, No. 1, March 2002.

Page 22: Spoken Dialogue Technology: Enabling the Conversational

Spoken Language Technology 111

analysis, and a later version that did,Eckert and Niemann [1994] found a muchbetter dialogue completion rate in thelater system, even though word accuracyrate (the results from the speech rec-ognizer) had remained roughly constantacross the systems.

4.2.2.2. SpeechActs. The SpeechActssystem [Martin et al. 1996], which enablesprofessionals on the move to obtain on-lineinformation and services, uses a feature-based approach to encode the semanticcontent of utterances in terms of theunderlying application that is being ac-cessed, such as the Mail or Calendar tool.The analysis aims to provide an accu-rate understanding of the input whiletolerating misrecognized words. Thus theanalysis is more comprehensive than key-word matching, which would miss subtleshades of meaning, and a full semanticanalysis, which might fail due to missingor misrecognized words from the speechrecognizer. Because there is wide varia-tion in the linguistic demands of the dif-ferent applications, the system is reset fora new lexicon and grammar for each newapplication.

Generally a grammar for a speech rec-ognizer differs from the grammar of alanguage understanding component. Thefunction of a speech recognition grammaris to determine the words that were spo-ken and to specify every possible utterancethat a user might say to the system. Forthis purpose a finite state grammar or alanguage model is normally used, as de-scribed earlier. The role of the languageunderstanding component is to extract themeanings of the words, and for this pur-pose a more general grammar is required.Two problems arise, however. First, giventhat the grammar formalisms for differ-ent recognizers vary widely, a developerwould have to write a different version ofthe recognizer grammar each time a newrecognition system was used.2 A second

2 Note: due to rapid changes in technology, the devel-opers of SpeechActs did not want to restrict futuredevelopers to specific speech recognizers but to allowthem to use newly available technologies as plug-incomponents.

problem concerns the degree of integrationbetween the speech recognition and lan-guage understanding components. Withdifferent grammar types this integrationwould be less feasible.

The solution that was adopted was aUnified Grammar, which could be com-piled into a speech recognizer grammarthat would include constraints to help re-duce perplexity, as well as into a corre-sponding grammar for natural languageprocessing [Martin et al. 1996]. A UnifiedGrammar consists of a collection of rulesthat include finite state patterns and aug-mentations. An example of a pattern is

“what” root=“be” (“in”–“on”) namePossessivesem=calendar;

This pattern requires the first word to bewhat, then a form of the verb be, then ei-ther in or on, followed by the output of therule namePossessive (for example: Paul’s),and finally a word with the semantics of“calendar.” Augmentations include testssuch as

be.past-participle ! = t; be.ing-form ! = t;

stating that the form of the verb must notbe been or being, and the action

action := ‘lookup’;

which adds the feature “lookup” to the ac-tion associated with this pattern. One ad-vantage of these grammar rules for thelanguage understanding process is that anumber of different utterances with thesame meaning will result in the sameanalysis. A second advantage is flexibil-ity, which is achieved through the use ofwild cards in the patterns to cope withconfusions arising from the speech recog-nizer between insignificant words such asof, a, or the, which, if misrecognized, wouldpose problems for a conventional languageprocessor.

4.2.2.3. The Philips Automatic Train Time-table Information System. A rather differ-ent approach was adopted in the Philipstrain timetable information dialogue sys-tem [Aust et al. 1995], which was illus-trated in Section 3. This system also ac-cepts partial structures as input, similar

ACM Computing Surveys, Vol. 34, No. 1, March 2002.

Page 23: Spoken Dialogue Technology: Enabling the Conversational

112 McTear

to those accepted by the SUNDIAL andSpeechActs systems. The speech recog-nition module generates word graphs,consisting of about 10 edges per graph af-ter graph optimization. Each path throughthe graph represents a sentence hypoth-esis. The language understanding mod-ule has to find the best path throughthe graph and then determine its mean-ing. However, language analysis in thissystem involves a search for meaningfulconcepts using an attributed context-freestochastic grammar to identify the rele-vant phrases and compute their meanings.Thus the grammar is semantic ratherthan syntactic since it is not concernedwith the structure of the sentence but withits meaning. In addition, each rule mayhave a probability indicating how likelyit is to be applied, given its left hand-side nonterminal. The following is an ex-ample of some grammar rules for a traintimetable application:

<station> STRING: ‘London’ {‘London’;}| ‘Paris’ {‘Paris’;};

<origin> STRING: ‘from’ <station> {#2;}| ‘not’ ‘from’ <station> {NOT #3;};

Rules have a syntactic part in whichthe left-hand side—preceding the colon—comprises a nonterminal (in angularbrackets) and the right-hand side com-prises a sequence of nonterminals and/orterminals (enclosed in single quotes). Thesemantic rules, which contain assign-ments and expressions, are enclosed inbraces. The semantic part may also con-tain a speech understanding expressionrepresented as an attribute reference,with a number following the # symbolindicating the position of the nontermi-nal within an assigned sequence. Linksare provided between the attributes of thegrammar rules and the variables (or slots)that represent the concepts for which val-ues are to be obtained during the dialogue.

As many of the input strings are in-complete or ungrammatical, in terms ofa traditional sentence-based grammar,

the system can generally still derive ameaning for these strings at a fairly lowcomputational cost. This involves proce-dures for dealing with filler arcs, thatis, those parts of a graph that do notcontain identified concepts (what are de-scribed as meaningful fillers). In additionto this, a concept bigram model is used tomodel more frequently observed conceptsequences.

A semantics-driven approach such asthis permits the analysis of the ungram-matical strings that characterize natu-rally occurring spoken language. For ex-ample: given an input string such as

I want to uh let me see from Frankfurt yes is therea train to Hamburg from Frankfurt at about10 o’clock?

the system could identify the essentialconcepts such as source (from+Place-name), destination (to+Placename), time(at+ time expression) and compute ameaning for the sentence, ignoring themeaningful fillers in the string (I want to,let me see, yes is there a train, etc.). How-ever, as Moore [1995] pointed out, conceptspotting might not be able to handle morecomplex examples such as

What cities does the train from Hamburg stop at?

as the case marking word at is separatedfrom its associated case word what cities.Only a more sophisticated grammaticalanalysis could determine this sort of re-lationship between disjoint constituents.Generally speaking, this type of construc-tion is relatively infrequent and, if en-countered sufficiently frequently, could bemodeled in the grammar. In any case,a system with adequate repair facilitiesshould be capable of eliciting the requiredinformation from the user through moredirected questions and prompts. Some ofthese techniques for achieving this type ofrobust behavior will be described below.

4.2.3. Summary. This section has exam-ined the role of the language understand-ing component and issues such as gram-mar representation and robust parsing.The construction of the language under-standing component is determined on the

ACM Computing Surveys, Vol. 34, No. 1, March 2002.

Page 24: Spoken Dialogue Technology: Enabling the Conversational

Spoken Language Technology 113

Fig. 7 . An architecture for spoken dialogue systems.

one hand by the nature of the input asreceived from the speech recognizer, and,on the other hand, by the type of in-put required by the dialogue manager.Many systems, including the ATIS sys-tems and the Philips train timetable in-formation system, are essentially drivenby a template-filling mechanism. In theATIS systems the dialogue is driven by theuser who submits queries to the system,while the system only takes the initiativeto elicit missing information in the tem-plate. The Philips system is system-led,with the selection of questions determinedby the system and the next step in the di-alogue being determined by what is miss-ing or needs to be clarified or confirmedin the template. The SUNDIAL systemis similarly driven by the goal of elicit-ing the information required to fill a tem-plate and execute a database query, butit is based on a more complex architec-ture for dialogue management that re-quires more extensive processing of thelinguistic input. The SpeechActs languageunderstanding component is carefully en-gineered in terms of the underlying appli-cation, with the main aim being to providea semantic analysis that can be handled bythe system’s dialogue manager in spite ofspeech recognition errors. How a dialoguemanager handles the input from the lan-guage understanding component and gen-erates output to the user will be examinedin the next section.

4.3. Dialogue Management

The main function of the dialogue man-agement component is to control the flowof the dialogue. This involves the followingtasks:

—determining whether sufficient infor-mation has been elicited from the user

in order to enable communication withthe external application;

—communicating with the external appli-cation;

—communicating information back to theuser.

In a simple architecture these tasks couldbe seen in terms of a serially ordered setof processes: the system finds out what theuser wants to know or do, consults the ex-ternal application, and reports the resultsback to the user. More typically, however,the process of determining what the userwants to know or do will be more problem-atic, as the information elicited from theuser may not be sufficient to enable thesystem to consult the external application,for reasons such as the following:

—the user’s input may be ill-formed,with the result that it could not besufficiently interpreted by the speechrecognition and language understand-ing components;

—the user’s input may be incomplete orinaccurate, with the result that insuffi-cient information is available to consultthe external application.

This subsection will describe the issues in-volved in dealing with ill-formed, incom-plete, or inaccurate input from the user.Methods for controlling the dialogue flowwill be discussed in Section 5.

Various error-handling processes are re-quired to deal with these situations, in-volving clarification and verification sub-dialogues between the system and theuser. This implies a more complex andmore integrated system architecture, inwhich the dialogue management compo-nent has a central controlling function,as shown in Figure 7. Variations on thisarchitecture will be discussed in greater

ACM Computing Surveys, Vol. 34, No. 1, March 2002.

Page 25: Spoken Dialogue Technology: Enabling the Conversational

114 McTear

detail in Section 5. The next two subsec-tions will examine two important func-tions of the dialogue manager:

—dealing with input that the system rec-ognizes as ill-formed or incomplete;

—using confirmation strategies to verifythat the input recognized by the sys-tem is indeed what was intended by theuser.

4.3.1. Dealing with Ill-Formed or IncompleteInput. The simplest way of dealing withill-formed or incomplete input is to sim-ply report the problem back to the userand to request a reformulation of the in-put. This method is clearly inadequate asit fails to distinguish the different waysin which the input is ill-formed or incom-plete, and it fails to support the user inreformulating the input. A number of dif-ferent solutions have been proposed.

Assuming that the user’s input has beenprocessed by the speech recognition andlanguage understanding components us-ing the methods described in Sections 4.1and 4.2, a number of higher level knowl-edge sources can be brought to bear toassist in the interpretation of ill-formedinput. The main knowledge sources thathave been used for this purpose includeinterpretation based on

—speech acts;—the discourse context;—the dialogue structure;—a user model.

4.3.1.1. Interpretation based on speech acts.The concept of speech acts emerged outof the work of Austin [1962] in the phi-losophy of language and was further de-veloped by Searle [1969]. A speech act isdefined as the function of an utterance—for example, a request, promise, warn-ing, or piece of information. Speech actanalysis involves a higher level of analy-sis than syntactic and semantic analysis,requiring reference to external informa-tion such as the discourse context and thespeaker’s beliefs and desires. For example:the utterance It’s cold in here has the syn-tactic form of a declarative and a literal

semantic reading describing the level oftemperature. However, while in a particu-lar context the utterance might have thisliteral reading, in a different context itcould function as a request (for example,to close a window or turn on the heating).Speech act theory has been used as a ba-sis for computational theories of commu-nication in which responses to a speaker’sutterances are guided by the hearer’srecognition of the intentions underlyingthe utterances [Allen and Perrault 1980;Cohen and Levesque 1990].

Speech act analysis has been used inthe TRAINS project to support the inter-pretation of ill-formed input. The TRAINSproject [Allen et al. 1995] is concernedwith the development of dialogue tech-nology in support of collaborative prob-lem solving. The current work has in-volved an interactive planning assistantthat engages in dialogue with a user tosolve route-planning problems. The lin-guistic analysis of the user’s utterancesis constructed by a bottom-up parser witha feature-based grammar, in which eachrule specifies syntactic and semantic con-straints. However, the output of the parseris not a syntactic or semantic analysis,but rather a sequence of speech acts thatprovide the “minimal covering” of the in-put, that is, the shortest sequence thataccounts for the input. This enables theparser to output an analysis even if anutterance is otherwise uninterpretable[Allen et al. 1996]. For example: the utter-ance Okay now let’s take the last train andgo from Albany to Milwaukee was outputfrom the speech recognizer as okay now Itake the last train in go from albany to is.The best sequence of speech acts that cov-ered this input was

(1) a CONFIRMATION/ACKNOWL-EDGE (okay);

(2) a TELL, with content to take the lasttrain (now I take the last train);

(3) a REQUEST to go from Albany ( gofrom albany).

Although the system was unable to per-form a complete syntactic and seman-tic analysis of this utterance, enough

ACM Computing Surveys, Vol. 34, No. 1, March 2002.

Page 26: Spoken Dialogue Technology: Enabling the Conversational

Spoken Language Technology 115

information was extracted to enable thesystem to continue the dialogue, in thiscase by generating a clarification subdia-logue. Thus robust parsing using speechacts enabled the system to compensatefor the errors emanating from the speechrecognition component.

4.3.1.2. Interpretation based on the dis-course context. Discourse context is oftenused to assist with the interpretation ofitems that are otherwise uninterpretableout of context, such as pronouns (he, she,they, etc.) and deictic expressions such asthis one, the next one, the previous flight.These expressions, which usually refer tosome item that has been mentioned pre-viously in the dialogue, can only be inter-preted if some record of previously men-tioned items has been kept. Similarly, itis often not possible to interpret inputthat is incomplete due to ellipsis. Ellip-sis involves clauses that are syntacticallyincomplete in which the missing partscan be recovered from a previous mainclause—for example, where in response tothe question Which flight do you wish tobook? the user says The London one. Inthis case the incomplete input has to beinterpreted in terms of the preceding ques-tion, that is, I wish to book the Londonflight.

Considerable attention has been de-voted in computational theories of dis-course to the issues associated withdiscourse context. The simplest methodinvolves maintaining a history list ofelements mentioned in the preceding dis-course that can be referred to using pro-nouns. A more elaborate approach in-volves the concept of centering, whichstates a preference for pronominalizationof entities in the history list that play acentral role in a main clause over those insubordinate and adjunct clauses. In otherwords, certain entities, such as those inthe subject position in a main clause, arethe center of the discourse focus and can bereferred to using pronouns in the next fewsentences until the focus shifts to anotherentity. Grosz, et al. [1983] introduced theconcept of centering to computational dis-course analysis, while Walker [1989] has

compared the simpler use of history listswith analysis based on centering.

More generally, discourse phenomenasuch as anaphora and ellipsis requiresome representation of the local contextthat contains the syntactic and seman-tic structures of previous clauses. Sophis-ticated language understanding systemsnormally include such a discourse-pro-cessing component, sometimes referred toas the pragmatics module, to deal with is-sues of context.

4.3.1.3. Interpretation based on the dialoguestructure. Interpretation based on the di-alogue structure makes use of the expecta-tions provided by the dialogue model. Theessential idea is that at each point in a dia-logue there are constraints on what can besaid next. These constraints can be of var-ious types and can assist several compo-nents of a dialogue system. The constraintthat the next user utterance is likely to besome form of the words yes or no, becausethe system prompt was in the form of ayes/no question, constrains the recognizerto having to deal with only this limitedvocabulary at this point in the dialogue.Similar expectation-based constraints canhelp determine which grammatical andsemantic rules need to be active at anygiven point in the dialogue.

In contrast to the methods describedfor the resolution of anaphora and ellip-sis, which are usually only applied aftera sentence has been completely recog-nized by the speech recognition compo-nent, Young et al. [1989] in the MINDS-IIsystem proposed the use of higher-levelknowledge sources to assist speech recog-nition by reducing the search space for thewords in the speech signal. Unlike the lan-guage models described earlier, this ap-proach involves the dynamic prediction ofthe concepts that are likely to be referredto in the user’s next utterance, based onthe previous user query, the database re-sponse, and the current state of the di-alogue. The system predicts which goalsand subgoals the user is likely to try tocomplete at the next point in the dialogue.These are combined with information froma user model that represents the domain

ACM Computing Surveys, Vol. 34, No. 1, March 2002.

Page 27: Spoken Dialogue Technology: Enabling the Conversational

116 McTear

concepts and relations among conceptsthat the user is likely to know about.The predicted concepts are then trans-lated into word sequences that denotethese concepts and these word sequencesare combined into a semantic network thatrepresents a maximally constrained gram-mar at this particular point in the dia-logue. The results indicated that the use ofthese higher-level knowledge sources re-duced test set perplexity from 279.2 for acomplete grammar to 17.8 when best pre-dictions were used, while the speech recog-nition error rate decreased from 17.9%to 3.5%.

Expectation-driven processing is alsoused in the Circuit-Fix-It Shop system toassist with the interpretation of the user’sutterances. The approach used is basedon the notion of attentional state as de-scribed in the theory of discourse structureof Grosz and Sidner [1986]. Essentially theattentional state refers to the focus of at-tention of the conversational participants.In this system the key idea is that at anypoint in the dialogue there is a particulartask step which is under discussion andthat, given this information, the systemcan derive a set of expectations of whatthe user will say next. To take a particu-lar example [Smith and Hipp 1994]:

Following a statement by the system that is anattempt to have a specific task step S performed,where S and another task step T must be com-pleted as part of the performance of task step R,there are expectations for any of the followingtypes of response:

(1) a statement about missing or uncertainbackground knowledge necessary for the ac-complishment of S (e.g. how do I do substepS1?)

(2) a statement about a subgoal of S (e.g. I havecompleted substep S1)

(3) a statement about the underlying purposefor S (e.g. why does S need to be done?)

(4) a statement about ancestor task steps ofwhich accomplishment of S is a part (e.g. Ihave completed R)

(5) a statement about another task step which,along with S, is needed to accomplish someancestor task step (e.g. how do I do T?)

(6) a statement indicating accomplishment of S(e.g. S is done)

These expectations are computed from twosources:

(1) the domain processor, which providessituation-specific expectations basedon the actions applicable within thedomain;

(2) the dialogue controller, which hasknowledge about the general nature oftask-oriented dialogues.

For example: if the topic is What is theLED displaying? then the expectationsprovided by the domain processor wouldconsist of the possible descriptions of theLED display. With the same topic, the dia-logue controller would examine the inputat a more abstract level as a query aboutthe observing of the value for a propertyof an object. Expectations about observingproperty values of an object would includequestions about its location, how to per-form the observation, or definitions of theproperty.

The expectations that are computed areused to provide several types of contextualinterpretation of the user’s utterances, in-cluding the resolution of anaphoric ref-erence and elliptical responses as wellas the maintenance of dialogue coherencewhen dealing with clarification subdia-logues. The expectations also assist thelanguage understanding component, as inthe MINDS-II system, by providing a setof strings that the dialogue controller ex-pects the user to say next, along with anestimate of the probability of each string.

4.3.1.4. Interpretation based on a usermodel. Information about the user, oftenreferred to as a user model, is a furthersource of information to assist with the in-terpretation of the user’s utterances. Usermodeling first emerged in the context ofthe natural language dialogue systems ofthe early 1980s as a means of provid-ing more cooperative conversational be-havior through the use of a model of theuser’s beliefs, goals, and plans [Wahlsterand Kobsa 1989]. One aspect of coopera-tive conversational behavior that was in-vestigated is the ability to respond to auser’s queries that are underspecified orill-formed by inferring the plan underlying

ACM Computing Surveys, Vol. 34, No. 1, March 2002.

Page 28: Spoken Dialogue Technology: Enabling the Conversational

Spoken Language Technology 117

the user’s query and responding in termsof the plan. Carberry [1989] developed asystem for incrementally constructing amodel of the user’s plans from an ongo-ing dialogue and then using this modelto interpret subsequent utterances. At theinitial point in the dialogue the systemhas no prior context to consider. The ut-terance is interpreted in terms of a set ofdomain-dependent plans and a set of can-didate goals is derived that are incorpo-rated into one or more models of the user’splan that can be used to provide a contextfor the interpretation of future utterances.As further utterances are input, these arematched with the current context models,which may either be expanded to incorpo-rate a further aspect of the current focusedplan, or may be discarded as no longer rel-evant. User models were also employed inother work by Carberry and others to han-dle elliptical queries and pragmatically ill-formed queries—that is, queries involvingmisconceptions, where the user’s beliefsdiffer from those of the system (see alsoPollack [1986]). The use of informationabout the user’s beliefs, goals, and inten-tions in more recent dialogue systems suchas TRAINS will be reviewed in Section 5in terms of BDI (Belief, Desire, and Inten-tion) architectures.

4.3.2. Confirmations and Verifications. Thevarious knowledge sources described inSection 4.3.1 enable the dialogue system tocompensate for input that is ill-formed orincomplete without having to consult theuser with requests for repetition or clari-fication. On the other hand, verification isrequired to deal with potentially misrec-ognized input where the system “realizes”that it may have misrecognized or misun-derstood what the user said. Verificationis common in human-human dialoguesto ensure that the information conveyedis mutually understood and that a com-mon ground is established and maintained[Clark 1992]. Confirming that the systemhas understood what the user actually in-tended is even more necessary in spokendialogues with computers given the pos-sibility of recognition and understandingerrors. There are several different ways

in which a dialogue system can verify orconfirm that the user’s utterances havebeen correctly understood. Some examples(from [Gerbino and Danieli 1993]) are pro-vided to illustrate these different types.

4.3.2.1. Explicit Verification. Explicit veri-fication takes the form of a question thatasks explicitly for confirmation of the in-put. This may be accompanied by a requestto answer with yes or no:

Do you want to go from Trento to Milano? Yes orno?

In this case two values are confirmed atthe same time. If one value is incorrect, theuser is able to correct, (e.g., No to Torino),although it is problematic if both valuesare incorrect as the potential for misrecog-nition is increased if the user has to cor-rect both values (e.g., No from Torino toMerano). An alternative is to confirm eachvalue separately:

Do you want to go from Trento? Do you want togo to Milano?

Although this is a more robust methodfor confirming values, the disadvantage isthat the number of turns required to com-plete the dialogue is increased.

4.3.2.2. Implicit verification. With im-plicit verification the system embeds inits next question a repetition of its under-standing of what the user said. The usercan still correct the repeated value, but ifthe user answers the question without cor-recting the value, then that value has beenimplicitly confirmed:

User: I want to travel from Milano toRoma.

System: At what time do you want to leavefrom Merano to Roma?

User: No I want to leave from Milano inthe evening.

In this case the user rejects the depar-ture city proposed by the system and of-fers a correction, at the same time an-swering the question about the departuretime. This strategy is closer to naturalconversation and can be employed if thesystem is reasonably confident with the

ACM Computing Surveys, Vol. 34, No. 1, March 2002.

Page 29: Spoken Dialogue Technology: Enabling the Conversational

118 McTear

output from the speech recognition andlanguage understanding components. Themain problem is that implicit requestsfor verification can give rise to a widerrange of possible responses, which maysubsequently put greater demands on therecognition and understanding processesthrough increased complexity. A furtherproblem is that verification requests in-volve a combination of the particular val-ues that were misrecognized and thosenew values that are to be elicited in thesystem’s question. With even a relativelysmall number of values the combinato-rial possibilities expand rapidly, so thatthe questions cannot be determined inadvance and a specific language genera-tion component is required to generate thequestions on the fly.

Verification may take place over a seriesof utterances that involve the applicationof a sequence of different types of questionto ensure a correct response. The follow-ing example illustrates an explicit recov-ery strategy [Gerbino and Danieli 1993]:

User1: I want to go to Milano leavingfrom Torino early in the morn-ing. (Recognition problems—EMPTY INPUT)

System1: I haven’t understood. Please re-phrase your sentence.

User2: I want to go from Torino toMilano and I want to leaveearly in the morning. (Recog-nition problems—EMPTY IN-PUT)

System2: I haven’t understood. Where doyou want to leave from?

User3: From Torino, early in the morn-ing. (Recognition problems—EMPTY INPUT)

System3: I haven’t understood. Tell me thename of the departure city, onlyone word.

User4: Torino.System3: Do you want to leave from

Torino?.User4: Yes.

At User1 the Dialogue Manager receivesempty input from the speech recognizer

and asks for a rephrase (System1). Af-ter the next user utterance, the inputis still empty, so the Dialogue Managerstarts a sequence of requests to collectthe parameters required to query the database (System2). The Dialogue Manageruses predictions—expecting a departurecity—to interpret the user’s next input(User3). As this strategy also fails, the Di-alogue Manager switches to isolated wordrecognition (System3). Following recogni-tion of the input, the Dialogue Managerconfirms the parameters before continu-ing (System4). The interaction style cannow switch back to continuous speech. Thefollowing example illustrates an alterna-tive implicit recovery strategy [Gerbinoand Danieli 1993]:

User1: I want to go to Milano leavingfrom Torino early in the morn-ing. (Recognition problems)

System1: Sorry, where do you want toleave from?

User2: From Torino, early in the morn-ing.

System2: Do you want to go from Torinoto Milano early in the morning?

User3: Yes.

When the Dialogue Manager receivesill-formed input (either fewer parametersthan expected, or some unexpected pa-rameters), it uses predictions based on thecurrent point in the dialogue to interpretthe input, that is, it expects a departurecity (System1–User2). The user inputs thedeparture time as well as the departurecity (User3). Here the departure time canbe ignored as it was already introduced inUser1. All the parameters introduced inUser1 (departure time, arrival city) arestored and used in the confirmation inSystem2, while at the same time asking forthe missing parameter (departure city).

Verification is one of the most challeng-ing issues in spoken dialogue systems. Afailure to verify correctly may lead to mis-communication, while an explicit verifi-cation strategy may result in an unrea-sonably lengthy dialogue, which has anadverse effect on user satisfaction. Forexample: in one experiment in which a

ACM Computing Surveys, Vol. 34, No. 1, March 2002.

Page 30: Spoken Dialogue Technology: Enabling the Conversational

Spoken Language Technology 119

Fig. 8 . An architecture including an Information Manager.

user had to read a credit card numberconsisting of four blocks of four digits, itwas reported that the worst case involved100 spoken prompts before the dialoguewas successfully completed [Vergeynstet al. 1993]. Considerable research is be-ing directed toward the development of ef-fective and efficient verification strategiesthat allow the system to degrade grace-fully, for example, by moving down a hi-erarchy from implicit to explicit verifi-cation, and finally to spelling mode, toelicit the required value, and then switch-ing back to higher-level modes once theproblem has been resolved. Applications ofthese and similar strategies have been re-ported in papers on the SUNDIAL project[Heisterkamp and McGlashan 1996] (seealso Section 5.3.4), as well as in anAAAI workshop on miscommunication inhuman-machine dialogue [McRoy 1996].

4.4. External Communication

Generally dialogue systems require someform of communication with an outsidesource such as a database in order to re-trieve the information requested duringthe course of the dialogue. Most dialoguesystems communicate with a database.For example: in the Philips train timetableinformation system the user supplies therequired parameters, such as source anddestination stations, date, and time of de-

parture or arrival, that enable the systemto execute a database query. In other cases,the external communication may be witha knowledge base or with a planning sys-tem. Each of these possibilities will be ex-amined in the following subsections.

4.4.1. Communicating with a Database.The processes involved in communicatingwith a database are not generally dis-cussed in the spoken dialogue systemsliterature, presumably on the assumptionthat once the parameters of the queryhave been elicited during the course ofthe dialogue, accessing the database toretrieve the required information is astraightforward process. However, prob-lems may arise if the vocabulary of thedialogue does not map directly on to thevocabulary of the application, if the querymakes false assumptions concerning theactual contents of the database so that nostraightforward response is possible, or ifthe data that is retrieved is ambiguous orindeterminate.

One solution to the problem of map-ping between the vocabularies of the ap-plication and of the dialogue is to addan additional component to the archi-tecture as shown in Figure 8. This isthe approach adopted by Whittaker andAttwater [1996], who separated the di-alogue and information aspects of the

ACM Computing Surveys, Vol. 34, No. 1, March 2002.

Page 31: Spoken Dialogue Technology: Enabling the Conversational

120 McTear

system and assigned any complex infor-mation processing that is required to anInformation Manager. The InformationManager interacts on the one hand withthe Dialogue Manager to communicate in-formation events, and on the other handwith the application database to handlequeries and search results. An additionallink in the architecture shown in Figure 8is to the Speech Platform to provide recog-nition constraints through the generationand manipulation of speech recognitionvocabularies. Variations on this architec-ture will be discussed in greater detail inSection 5.

Some of the functions of the Infor-mation Manager may include resolvingitems of information that can be ex-pressed in several ways, such as differentspellings for surnames, shortened formsof first names, and problems associatedwith place names. More generally, the In-formation Manager may be responsiblefor relationships within vocabularies, suchas synonyms and homophones, the ma-nipulation of database hypotheses, suchas scoring of partial and complete en-tries, and interactions with the applica-tion database. The key structure is thedata model, which contains a number ofvocabulary models that are each associ-ated with one vocabulary within the appli-cation, so that a distinction can be madebetween how items are represented in thedatabase and how they may be referencedwithin a spoken dialogue.

Database queries may be ill-formed be-cause the user has misconceptions aboutthe contents of the database. This issuehas received wide attention in the field ofnatural language interfaces to databases,but has not yet been incorporated intospoken dialogue systems, due to the needto deal with more basic problems result-ing from speech recognition and languageprocessing errors. Early work by Kaplan[1983] addressed the issue of false as-sumptions, as illustrated in the followingexample:

User: How many students got As in Lin-guistics in 1991?

User: None.

The system’s response is correct if the setof students that got an A in Linguisticsis empty, but it would also be correct ifthere were no students taking linguisticsin 1991. However, in the latter case thesystem’s response is misleading, as it doesnot correct the user’s false assumptions.Kaplan proposed solutions to this prob-lem using corrective indirect responses andsuggestive indirect responses.

Problems may also arise if the userhas misconceptions about the world modelrepresented in the database. Carberry[1986] discussed the query Which apart-ments are for sale? which (in an Americanreal-estate context) is inappropriate, asapartments are rented, not sold, althoughapartment blocks may be sold, for exam-ple, to property developers. Resolving thisproblem involved discerning the user’sgoal that gave rise to the ill-formed query.Other approaches involve identifying ob-jects and their attributes that have beenincorrectly referenced and substituting aviable alternative [McCoy 1986].

Problems of ambiguous or indetermi-nate data have been treated to some ex-tent in spoken dialogue systems, usu-ally with some mechanism that has beenspecifically devised to handle problemsthat have been predicted in advance. Forexample, the Philips Dialogue Descrip-tion Language [Philips Speech Processing1997] has a mechanism for handling un-derspecified or ambiguous values, such asmore than one train station with the samename. There are also mechanisms for com-bining values, for example, if a user callsin the afternoon with the utterance Todayat 8, the two values can be combined intothe single value 20.00 hours. Similarly,values that contradict, such as the samevalue for a destination and a source, canbe detected.

Database access may be unsuccessfulbecause a value did not find an exactmatch in the database. For example: aquery concerning a flight to London at8 might be unsuccessful, although theremay be flights to London just before orjust after this time. Enabling such a queryin the SUNDIAL system involved relax-ing some of the parameters of the query

ACM Computing Surveys, Vol. 34, No. 1, March 2002.

Page 32: Spoken Dialogue Technology: Enabling the Conversational

Spoken Language Technology 121

Fig. 9 . The architecture of the Circuit-fix-it-Shop system.

[Giachin and McGlashan 1997]. In thiscase, the destination would not be an ob-vious candidate for relaxation, as the userwould probably not want a flight to someother destination that leaves at 8. Re-laxing the time would involve a simplerelaxation algorithm that computes timeintervals at increasing distances (for ex-ample, 5 minutes per iteration) from be-fore and after the requested time, up to agiven threshold.

Finally, there are problems concerninghow the output is to be presented to theuser. If a number of database solutionshave been found, it is necessary to decidehow much to present. In SUNDIAL the in-terval of solutions was divided into foursubintervals:

0 . . .MinGoal . . .MaxGoal . . .Threshold . . . ,

where MinGoal and MaxGoal representedthe optimum range of solutions thatwere presented directly, and entries be-tween MaxGoal and Threshold were pre-sented only in summary form [Giachin

and McGlashan 1997]. Some of theseissues are dealt with in more detailin Section 4.5 (response generation) andSection 4.6 (speech output) of this article.

4.4.2. Communicating with a KnowledgeBase. Communication with a knowledgebase is required for systems that supportproblem solving rather than informationretrieval. The Circuit-Fix-It Shop system[Smith and Hipp 1994], introduced inSection 3.3, which helps users to fix anelectronic circuit, is a good example ofsuch a system. In addition to the usualcomponents that have been described sofar, this system also includes the typesof component found in knowledge-basedsystems, such as a domain processor, ageneral reasoning module, and a generalknowledge component. The relationshipsbetween these components and thelinguistic and dialogue processing compo-nents of the system are shown in Figure 9[Smith and Hipp 1994]. The domainprocessor is the application-dependent

ACM Computing Surveys, Vol. 34, No. 1, March 2002.

Page 33: Spoken Dialogue Technology: Enabling the Conversational

122 McTear

component of the system—dealing in thiscase with electronic circuit repair. Thiscomponent contains all the informationabout the application domain that enablesit to recommend to the dialogue controllerthe steps that are required to accomplisha task using a special notation GADL(Goal Action and Description Language)to represent goals, actions and states. Thedomain processor also receives informa-tion back from the dialogue controller,which will have communicated with theuser to obtain information about thecurrent actions and states. This in turnenables the domain processor to updateits world model and then to propose thenext task step to be achieved.

While the domain processor is applica-tion-specific and in principle can be substi-tuted by any other application domain, thegeneral reasoning component is domain-independent and incorporates the generalmechanisms for reasoning with the knowl-edge contained in the domain processor. Inthe case of the Circuit-Fix-It Shop systemthis reasoning takes the form of an inter-ruptible theorem prover that requires in-teraction with the outside world to resolvemissing axioms. This component providesthe basis for the dialogue control mech-anisms of the Circuit-Fix-It Shop systemand will be discussed in greater detail inSection 5.

The knowledge component, which is alsoapplication-independent, includes knowl-edge relevant to task-oriented dialogues,such as how actions decompose into sub-actions and how theorems are used toprove goal completion. Among the othertypes of knowledge included in this com-ponent are general dialogue knowledgeabout the linguistic realisations of task ex-pectations and knowledge about the userthat is acquired during the course of thedialogue.

Problem solving in the Circuit-Fix-ItShop system involves cooperating with theuser to solve a specific goal, such as how torepair a particular circuit. Problem solv-ing is achieved through communicationbetween the system and the user to estab-lish what actions have to be carried outand what the current state of the task

might be. The components described inthis section support the system in its rea-soning about the steps required to com-plete a task, in deciding what informationto communicate to the user, and in the in-tegration of information provided by theuser into the system’s model of the currentstate of the task.

4.4.3. Communication with a Planning Sys-tem. Problem solving can also be achievedthrough the use of a planning system thatsupports reasoning about goals, plans, andactions. While the task structures usedin the Circuit-Fix-It Shop system involvetask decomposition into subtasks and sub-sequently into primitive actions to be car-ried out, the problem-solving mechanismsare different from those that are usedin conventional planning systems. TheCircuit-Fix-It Shop system is presentedwith an explicit goal at the beginning ofthe dialogue and its task is to collaboratewith the user in proving the goal in muchthe same way as a theorem is proved. Plan-ning systems incorporate further compli-cations in that often the system has toinfer the user’s goals from statements oractions that may not explicitly representthe goals (plan recognition). Planning sys-tems typically include an explicit repre-sentation of beliefs, desires, and intentionsthat are reasoned about during the courseof the problem solving. These elements areassumed implicitly in the Circuit-Fix-ItShop system.

The TRAINS project [Allen et al. 1995]is concerned with the integration of natu-ral language dialogue and plan reasoningto support collaborative problem solving.The purpose of the dialogue is to negoti-ate and develop a plan. The speech actsthat comprise the dialogue are motivatedby reasoning about the plan and are at thesame time interpreted in the light of thecurrent plan. Figure 10 provides a simpli-fied view of the components of the TRAINSsystem.

Plan reasoning in the TRAINS systeminvolves two algorithms—the incorpora-tion algorithm and the elaboration algo-rithm. The incorporation algorithm is con-cerned essentially with plan recognition,

ACM Computing Surveys, Vol. 34, No. 1, March 2002.

Page 34: Spoken Dialogue Technology: Enabling the Conversational

Spoken Language Technology 123

Fig. 10 . The TRAINS architecture.

that is, with finding causal and moti-vational connections between potentialinterpretations of the current utter-ance and the current plan. The algo-rithm searches through a space of plangraphs with nodes representing eventsand states, and links representing rela-tions between events and states such asenablement, effect, generation, and justi-fication. The elaboration algorithm sup-ports the system’s construction of a planusing means-ends planning. If the userencounters some choice that requires con-firmation, for example, an element in theplan that is ambiguous, the system gener-ates an utterance to request confirmation.

4.4.4. Summary. This discussion of therole of the external communication com-ponent in a spoken dialogue system hasshown how an integrated system archi-tecture, as illustrated in Figure 7, is re-quired in order to support interactionbetween the dialogue management com-ponent and the other system components.In addition to the problem of determiningwhether sufficient information has been

elicited from the user to provide input tothe external application, as discussed inSection 4.3, obtaining the required infor-mation from the external source is notnecessarily a straightforward task andcomplex interactions may be required in-volving mediations between the dialoguemanager and the user. In the case of adatabase query the requested informationmay not be available in the form that wasrequested so that a reformulated query isrequired. In a plan reasoning applicationsuch as TRAINS the plan reasoner mayfail to find a connection between an event,goal, or fact inferred from the user’s ut-terance and a node in the plan graph, inwhich case it could be assumed that theuser’s utterance had been misinterpretedand the language understanding compo-nent would be required to search for an al-ternative interpretation, failing which thesystem would request clarification or re-pair. Thus the interpretation and resolu-tion of the user’s query may involve com-plex interaction with the external sourcebefore the system can report a result backto the user.

ACM Computing Surveys, Vol. 34, No. 1, March 2002.

Page 35: Spoken Dialogue Technology: Enabling the Conversational

124 McTear

4.5. Response Generation

Assuming that the requested informa-tion has been retrieved from the exter-nal source, the response generation com-ponent now has to construct the messagethat is to be sent to the speech output com-ponent to be spoken to the user. Broadlyspeaking, the construction of the messageconsists of three decisions involving:

(1) what information should be included;(2) how the information should be struc-

tured;(3) the form of the message—for exam-

ple, the choice of words and syntacticstructure.

Response generation can be achieved us-ing simple methods, such as the insertionof the retrieved data into predefined slotsin a template. On the other hand, com-plex methods using natural language gen-eration techniques may be used, althoughgenerally these more complex methodshave only been applied in research proto-type systems.

Response generation in a dialogue sys-tem involves additional tasks beyondthose required for other language gener-ation tasks. Given that the information tobe generated is in the form of some nonlin-guistic representation—for example, theresults of a database query or a chain ofreasoning from an expert system—the di-alogue manager has to relate the informa-tion to what was previously said (using adiscourse history) as well as to the user’sgoals and knowledge (using a user model).

Use of a discourse history enables thesystem to provide a response that is con-sistent and coherent with the precedingdialogue. For example: if some entity thathas already been mentioned is to be re-ferred to again, the system should checkwhether an anaphoric expression can beused unambiguously to refer to the en-tity on a second mention, as in the follow-ing example taken from Reiter and Dale[1997]:

The next train is the Caledonian Express. Itleaves at 10 am. Many tourist guidebooks highlyrecommend this train.

Little research has been done on the useof pronouns in language generation, al-though there has been some researchon generating definite descriptions—forexample, the use of the train if theCaledonian Express and no other trainhas been previously mentioned [Dale andReiter 1995].

As mentioned earlier, user modeling inthe early 1980s was concerned with mak-ing natural language dialogue systemsmore cooperative. In addition to support-ing the interpretation of the user’s ut-terances by modeling the user’s beliefs,goals, and plans, the other main applica-tion of user models was to enable a sys-tem to adapt its output to the user’s per-ceived needs [Wahlster and Kobsa 1989].A number of research projects addressedthis issue, of which the following areindicative.

The KNOME system [Chin 1989] pro-vided different levels of explanation ofUnix commands depending on its catego-rization of the user’s level of competenceand the degree of difficulty of the com-mand in question. The TAILOR system[Paris 1989] adapted its output to theuser’s level of expertise by selecting thetype of description and the particular in-formation that would be appropriate for agiven user. Based on an extensive analysisof scientific texts, it was found that textsfrom adult encyclopedias and manuals forexperts mainly included structural infor-mation that could be represented usingconstituency schemas describing the partsof the objects, while encyclopedias foryoung children and manuals for novicescontained mainly process-oriented infor-mation that described the functional char-acteristics of the objects. TAILOR wasable to generate appropriate descriptionsto different types of users and to pro-duce a range of descriptions for usersfalling between the two extremes of noviceor expert. Finally, in the IMP system,Jameson [1989] investigated the use ofanticipation feedback to determine thebias of the system’s output. Basicallywhat this involves is that the systemattempts to anticipate the user’s reac-tion to its output and then takes this

ACM Computing Surveys, Vol. 34, No. 1, March 2002.

Page 36: Spoken Dialogue Technology: Enabling the Conversational

Spoken Language Technology 125

anticipated reaction into account in fi-nalizing its output. This technique isparticularly appropriate for evaluation-oriented dialogues, such as personnel se-lection interviews and dialogues involvingtravel agents, hotel managers, and salespeople.

A user model was used in the Circuit-Fix-It Shop system to enable the systemto determine what needed to be said to theuser and what could be omitted because ofexisting user knowledge (see the examplediscussed in Section 3.3). In this systemthe dialogue controller invoked inferencesto derive additional axioms about the userbased on the user’s utterances. These in-ferences, which are similar to those usedby Chin [1989] in the KNOME system,included the following [Smith and Hipp1994], p. 60):

If the axiom meaning is that the user has a goalto learn some information, then conclude that theuser does not know about the information. If theaxiom meaning is that an action was completed,then conclude that the user knows how to performthe action.

These inferences, which are based on ab-stract descriptions of actions and their ef-fects, were used to provide user model ax-ioms that could be used by the theoremprover along with other axioms that wereavailable to prove goal completion. Thusthe user model information was employedwithin the dialogue system to determinethe selection of the information to be pre-sented to the user.

A considerable amount of research intext generation has been concerned withthe organization of messages, that is,their discourse structure. One of the mostwidely known approaches involves the useof rhetorical relations between elementsof a text, as described in Rhetorical Struc-ture Theory (RST) [Mann and Thompson1988]. Examples of rhetorical relationsare elaboration, exemplification, and con-trast. Alternatively, schemas have beenused to provide the structure of theinformation to be presented [McKeown1985]. A schema sets out the main com-ponents of a text, using elements suchas identification, analogy, comparison,

and particular-illustration, which have asequential ordering in a text and can oc-cur recursively. Schema-based systems of-ten use general programming constructssuch as local variables and conditionaltests.

The form of the output is knownas the linguistic realization. This in-volves the choices of lexical items andsyntactic structures to express the de-sired meaning. The choice of lexical itemsmight involve deciding between the wordsleave and depart to express the conceptof “departure,” while syntactic decisionsmight involve the choice of an activeor a passive sentence [Reiter and Dale1997]. Linguistic realization also involvesthe generation of grammatically correctstructures, for example, selecting the ap-propriate tense and rules of agreement.From the perspective of the constructionof a text, four different categories of con-tent may be involved [Reiter and Dale1997]:

(1) unchanging text, that is, parts of themessage that are always present in theoutput text;

(2) directly-available data, that is, infor-mation that has been retrieved from adatabase or knowledge base;

(3) computable data, that is, informationthat is derived from the data as a resultof some computation or reasoning (forexample, the number of records foundin the database for trains between twocities);

(4) unavailable data, that is, informationthat is not present in the data butwhich supplements the information(this is common in texts authored byhumans, for example, extra informa-tion that a railway line may be blockedby snow).

A dialogue system may make use of atleast the first three types, using unchang-ing text for the constant parts of a mes-sage, retrieved data to convey the informa-tion that was requested, and computabledata to summarize the information or torequire a more specific choice from theuser.

ACM Computing Surveys, Vol. 34, No. 1, March 2002.

Page 37: Spoken Dialogue Technology: Enabling the Conversational

126 McTear

4.6. Speech Output

Speech output involves the translationof the message constructed by the re-sponse generation component into spokenform. In the simplest cases prerecordedcanned speech may be used, sometimeswith spaces to be filled by retrieved or pre-viously recorded samples, as in

You have a call from <Jason Smith>. Do youwish to take the call?

in which most of the message is prere-corded and the element in angular brack-ets is either synthesized or played froma recorded sample. This method workswell when the messages to be outputare constant, but synthetic speech is re-quired when the text is variable and un-predictable, when large amounts of infor-mation have to be processed and selectionsspoken out, and when consistency of voiceis required. In these cases text to speechsynthesis (TTS) is used.

Text-to-speech synthesis can be seen asa two stage process involving

(1) text analysis;(2) speech generation [Edgington et al.

1996a, 1996b].

Text analysis involves the analysis of theinput text that results in a linguisticrepresentation that can be used by thespeech generation stage to produce syn-thetic speech by synthesizing a speechwaveform from the linguistic representa-tion. The text analysis stage is sometimesreferred to as text-to-phoneme conversion,although this description does not coverthe analysis of linguistic structure that isinvolved. The second stage, which is oftenreferred to as phoneme-to-speech conver-sion, involves the generation of a prosodicdescription (including rhythm and intona-tion), followed by speech generation thatproduces the final speech waveform. A con-siderable amount of research has been car-ried out in text-to-speech synthesis whichis beyond the scope of the present survey(see, for example, Edgington et al. [1996a,1996b] and Carlson and Granstrom [1997]for recent overviews). This research hasresulted in several commercially available

text-to-speech systems, such as DECTalkand the BT Laureate system [Page andBreen 1996]. The main aspects of text-to-speech synthesis that are relevantto spoken dialogue systems will be re-viewed briefly. The text analysis stageof text-to-speech synthesis comprises fourtasks:

(1) text segmentation and normalization;(2) morphological analysis;(3) syntactic tagging and parsing;(4) the modeling of continuous speech

effects.

Text segmentation is concerned with theseparation of the text into units such asparagraphs and sentences. In some casesthis structure will already exist in the re-trieved text, but there are many instancesof ambiguous markers. For example, a fullstop may be taken as a marker of a sen-tence boundary, but it is also used for sev-eral other functions such as marking anabbreviation (St.), as a component of adate (12.9.97 ), or as part of an acronym(M.I.5 ). Normalization involves the inter-pretation of abbreviations and other stan-dard forms such as dates, times, and cur-rencies, and their conversion into a formthat can be spoken. In many cases ambigu-ity in the expressions has to be resolved—for example, St. can be “street” or “saint.”

Morphological analysis is required, onthe one hand, to deal with the problem ofstoring pronunciations of large numbersof words that are morphological variantsof one another, and, on the other, to as-sist with pronunciation. Typically a pro-nunciation dictionary will store only theroot forms of words, such as write. Thepronunciations of related forms, such aswrites and writing, can be derived usingmorphological rules. Similarly, words suchas staring need to be analyzed morpho-logically to establish their pronunciation.Potential root forms are star+ ing andstare+ ing. The former is incorrect on thebasis of a morphological rule that requiresconsonant doubling (starring), while thelatter is correct because of the rulethat requires e-deletion before the -ingform.

ACM Computing Surveys, Vol. 34, No. 1, March 2002.

Page 38: Spoken Dialogue Technology: Enabling the Conversational

Spoken Language Technology 127

Tagging is required to determine theparts of speech of the words in the textand to permit a limited syntactic analy-sis, usually involving stochastic process-ing. A small number of words—estimatedat between 1 and 2% of words in a typi-cal lexicon [Edgington et al. 1996a]—havealternative pronunciations depending ontheir part of speech. For example: live as averb will rhyme with give, but as an adjec-tive rhymes with five. The part of speechalso affects stress assignment within aword—for example, record as a noun ispronounced ’record (with the stress on thefirst syllable), and as a verb as re’cord(with the stress on the second syllable).

Modeling continuous speech effects isconcerned with achieving natural sound-ing speech when the words are spoken ina continuous sequence. Two problems areencountered. First, there are weak formsof words, involving mainly function wordssuch as auxiliary verbs, determiners, andprepositions. These words are often un-stressed and given reduced or amendedarticulations in continuous speech. With-out these adjustments the output soundsstilted and unnatural. The second probleminvolves coarticulation effects across wordboundaries, which have the effect of delet-ing or changing sounds. For example: if thewords good and boy are spoken togetherquickly, the /d/ in good is assimilated tothe /b/ in boy. Modeling these coarticula-tion effects is important for the productionof naturally sounding speech.

There has been an increasing concernwith the generation of prosody in speechsynthesis, as poor prosody is often seenas a major problem for speech systemsthat tend to sound unnatural despite goodmodeling of the individual units of sound.Prosody includes phrasing, pitch, loud-ness, tempo, and rhythm, and is used toconvey differences in meaning as well asto convey attitude.

The speech generation process involvesmapping from an abstract linguistic rep-resentation of the text, as provided by thetext analysis stage, to a parametric contin-uous representation. Two main methodshave been used to model speech: articula-tory synthesis, which models characteris-

tics of the vocal tract and speech articula-tors, and formant synthesis, which modelscharacteristics of the acoustic signal. For-mant synthesis has been the more success-ful method and has produced commercialsystems such as DECTalk that yield a highdegree of intelligibility.

An alternative method that is usedin recent work, for example, in BT’sLaureate system, involves concatenativespeech synthesis, in which prerecordedunits of speech are stored in a speechdatabase and selected and joined togetherin speech generation. The relevant unitsare usually not phonemes, due to the prob-lems that arise with coarticulation, but di-phones, which assist in the modeling ofthe transitions from one unit of sound tothe next. Various algorithms have beendeveloped for joining the units togethersmoothly.

Generally relatively little emphasis hasbeen put on the speech output processby developers of spoken dialogue systems.This is partly due to the fact that text-to-speech systems are commercially avail-able that can be used to produce reason-ably intelligible output. However, thereare certain applications where more natu-rally sounding output is desirable, for ex-ample, in applications involving the syn-thesis of speech for the handicapped, or insystems for foreign language instruction.

4.7. Summary

The basic components of spoken dialoguesystems that have been described in thepreceding sections represent various tech-nologies, each of which constitutes a majorresearch and development area in its ownright. An interesting aspect of spoken dia-logue systems is that these separate tech-nologies have to be somehow harnessedand integrated to produce an acceptable,working system. It is essential that thecomponents of the system should workin integration—indeed, the efficiency ofthe individual components is less impor-tant than the efficiency of the completesystem. For example, it can be arguedthat a system with a high-performancespeech recognizer would still be ineffective

ACM Computing Surveys, Vol. 34, No. 1, March 2002.

Page 39: Spoken Dialogue Technology: Enabling the Conversational

128 McTear

if the dialogue control component func-tioned poorly. Conversely, a good dialoguecontrol component can often help com-pensate for the weaknesses of the speechrecognizer by producing a reasonable re-sponse in the face of unreliable input. Oneof the major challenges for developers ofspoken dialogue systems is to integratethe component technologies to produce arobust and acceptable system, in whichthe whole is greater than the sum of itsparts.

5. DIALOGUE CONTROL

There are two aspects to dialogue control:the extent to which one of the agents main-tains the initiative in the dialogue andthe ways in which the flow of the dia-logue is managed. Dialogue control may besystem-led, user-led, or mixed-initiative.In a system-led dialogue the system asks asequence of questions to elicit the requiredparameters of the task from the user. In auser-led dialogue the user controls the di-alogue and asks the system questions inorder to obtain information. In a mixed-initiative dialogue control is shared. Theuser can ask questions at any time, butthe system can also take control to elicitrequired information or to clarify unclearinformation. The management of dialoguecontrol is not an issue for user-led dia-logue as the user decides which questionsto ask, as in Question-Answer and NaturalLanguage Database Systems. In system-led and mixed-initiative dialogue the con-trol has to be managed in order to deter-mine what questions the system shouldask, in what order, and when. Three mainstrategies for dialogue control were iden-tified in Section 2.5 and illustrated inSection 3. Finite state-based dialogue con-trol supports a system-led dialogue inwhich all the questions that the systemasks have been predetermined. Frame-based systems are also primarily system-led, although they permit a limited degreeof user initiative. Agent-based systemstend to be mixed-initiative. These distinc-tions along with other aspects of dialoguecontrol will be examined in the followingsubsections.

5.1. Finite State-Based Systems

5.1.1. The Basic Model. In a basic finitestate-based system the dialogue structureis represented in the form of a state tran-sition network in which the nodes repre-sent the system’s questions and the tran-sitions between the nodes determine allthe possible paths through the network,thus specifying all legal dialogues. Eachstate represents an information state inwhich some information is elicited from orconfirmed with the user. Subdialogues canbe used within the basic network to sup-port a more modular design approach andto provide libraries of commonly occur-ring transactions. Figure 11, taken fromthe Danish Dialogue Project, illustratesthe use of a basic finite state network tomodel the dialogue flow for an automaticbook club service [Larsen and Baeekgaard1994]. In this dialogue the system pro-gresses through a series of states, withthe transitions between states being de-termined by the user’s responses. Thereare various choice points and loops, as wellas subdialogues (for example, for the tasksCHECK MEMBER, ORDER, CANCEL,and OVERVIEW). Furthermore, in thisparticular architecture, the user can usethe key words repeat and change, torequest repetition of the system’s out-put and to change a previously acceptedparameter.

5.1.2. Advantages of Finite State Models. Amajor advantage of the finite state modelis its simplicity. From a developer’s per-spective state transition networks are par-ticularly suitable for modeling dialogueflow in a well-structured task involvinginformation to be exchanged in a prede-termined sequence, with the system re-taining control over the dialogue and de-ciding which question to ask next. In thisway the semantics of the system is clearand intuitive. Moreover, as the user’s re-sponses are restricted, fewer technologi-cal demands are put on the system compo-nents, particularly the speech recognizer.The lack of flexibility and naturalness maybe justified as a trade-off against thesetechnological demands. For these reasons

ACM Computing Surveys, Vol. 34, No. 1, March 2002.

Page 40: Spoken Dialogue Technology: Enabling the Conversational

Spoken Language Technology 129

Fig. 11 . Dialogue graph for an automatic book service.

most currently available commercial sys-tems use some form of finite state dialoguemodeling.

It is interesting to note that there issome support in empirical studies forthe use of state-based dialogue control.Hone and Baber [1995] examined the re-lationship between dialogue control andtransaction times, finding that more con-strained dialogues that employed a menu-like interaction style with yes/no confir-mation of all user input tended to result indialogues with longer transaction times,as would be expected. However, this effectdepended on the system’s level of recogni-tion accuracy, which was manipulated inthe experiments. It was found that therewas a greater likelihood of errors in theless constrained system as it permitted alarger active recognition vocabulary.

In another study two versions of a sim-ple call assistance application were built[Potjer et al. 1996]. The system-led version

used isolated word recognition and wordspotting, while the mixed-initiative ver-sion used continuous speech recognitionand more complex natural language pro-cessing. In the system-led version the userwas prompted for the required service intwo steps, while in the mixed-initiativeversion the user could request the ser-vice in a single utterance. The minimumnumber of turns per transaction was lowerfor the mixed-initiative system, althoughmore additional turns were required forthe mixed-initiative system on accountof the greater number of recognition er-rors. Thus the system-led interface wasnot slower than its mixed-initiative coun-terpart. Moreover, a subjective analysis ofuser satisfaction indicated that users weresatisfied with both versions.

Similar results were found in a studyinvolving train timetable information inwhich it was found that for simple servicesa system-driven dialogue using isolated

ACM Computing Surveys, Vol. 34, No. 1, March 2002.

Page 41: Spoken Dialogue Technology: Enabling the Conversational

130 McTear

word recognition achieved good user ac-ceptance [Billi et al. 1996]. This find-ing was supported in a study of dialoguestrategies comparing explicit and im-plicit recovery from communication break-downs [Danieli and Gerbino 1995]. Theversion incorporating explicit confirma-tion and repair, which made greater useof isolated word recognition and spelling,was found to be robust and safe, eventhough it increased the number of turnsrequired to complete the transaction. Theconclusion from these studies is thatsystem-led dialogue using state transi-tions would appear to be suitable for sim-ple tasks with a flat menu structure anda small list of options, bringing also theadvantage of less complex spoken lan-guage and dialogue modeling technology.The lack of flexibility and naturalness maybe justified as a trade-off against thesetechnological demands.

As mentioned earlier, state transitionnetworks are particularly suitable formodeling dialogue flow in well-structuredtasks. The automatic book service illus-trated in Figure 11 is a good example.Other examples are directory assistance,questionnaires, and travel inquiries, pro-vided the dialogue is constrained to a ba-sic, system-led series of questions to elicita number of well-defined responses. Con-sider the part of a directory inquiry di-alogue in which the system elicits thename of the person to be called: here thesystem has to identify a unique individ-ual, which generally requires eliciting afirst and last name. This might be ac-complished in a single step—Request Firstand Last Name—or in a series of steps—Request Surname > Request Spelling ofSurname > Request First Name > Con-firm First and Last Name. A finite statedialogue model could be created for thistask with subdialogues for subtasks suchas requesting the surname and first name.Additional states would be required forcases of multiple individuals with thesame name, variations on first names, andnames that are pronounced similarly (ho-mophones) and thus require spelling todisambiguate. The main characteristic ofthis task is that there is a finite and

clearly defined set of information itemsto be exchanged, the information can beelicited in a natural order, and the taskmay be decomposed into a hierarchy ofwell-ordered subtasks [McTear 1998]. Afinite state model could also be used forsimilarly structured tasks such as obtain-ing weather forecasts or football scores, or-dering items from a catalogue, or makingsimple bank transactions.

Dialogues for questionnaires are alsohighly structured even though a largenumber of questions may be required toelicit the required information. For ques-tionnaires the user can be constrainedthrough carefully designed prompts toproduce an acceptable range of responses[Hansen et al. 1996]. In a large-scaleproject involving the U.S. Census the di-alogue was implemented using a finitestate network as the information had tobe elicited in a fixed order, for example:Name > Gender > Birth date > MaritalStatus, etc., with subdialogues being usedfor the more complex items ([Cole et al.1997]. Finite state models can be usedfor similar tasks such as eliciting a per-son’s personal details for financial trans-actions or obtaining information for in-surance quotes. The key characteristic ofthis class of dialogues is that they arewell structured. Even though there maybe several items of information to beelicited, these can be broken down intowell-structured subtasks that are inde-pendent of one another.

5.1.3. Disadvantages of Finite State Net-works. Finite state dialogue models arenot suitable for modeling less well-structured tasks characterized by sub-tasks whose order is difficult to predict,by information modeled at different levelsof abstraction, or by complex dependen-cies between items of information [Kamm1995]. A good example is the Flight Reser-vation System of the Danish DialogueProject [Dybkjær et al. 1998]. Althoughthe reservation task would appear to bewell structured, as it consists of a se-ries of ordered subtasks, there are com-plex dependencies in this system betweenvarious parameters, for example, between

ACM Computing Surveys, Vol. 34, No. 1, March 2002.

Page 42: Spoken Dialogue Technology: Enabling the Conversational

Spoken Language Technology 131

discounted fares and flight availability. Asa result a client could opt for a discountedfare and go on to confirm several param-eters only to have to backtrack to a dif-ferent dialogue path because the desireddeparture time was not available at thediscounted price. The key word change canbe spoken by the user of this system to cor-rect the latest piece of information givento the system, but to correct earlier in-formation change has to be used repeat-edly to cause the system to backtrack se-quentially until the item to be changedis reached. Thus, when there are depen-dencies between the items of information,the use of a finite state dialogue model be-comes unwieldy, leading to a combinato-rial explosion of states and transitions.

Finite state dialogue models are inflex-ible. This characteristic is not a problemif the interaction with the user is con-trolled by the system and restricted to awell-ordered sequence of questions. How-ever, because the dialogue paths are spec-ified in advance, there is no way of manag-ing deviations from these paths. Problemsarise if the user needs to correct an item orintroduce some information that was notforeseen at the time of the design of thedialogue. Adding natural language facili-ties, while providing the user with greaterflexibility in what they can say, can addto these problems. Taking the example ofa simple travel inquiry system, a naturalorder for the system’s questions might be:destination > origin > date > time. How-ever, when answering the system’s ques-tion concerning destination the user mightreply with a destination as well as thedeparture time (or indeed other combi-nations of the four required parameters).A finite state-based system would simplyprogress through its set of predeterminedquestions, ignoring or failing to processthe additional information and then ask-ing an irrelevant question concerning thedeparture time.

The solution to this problem would be toinclude a dialogue model so that the sys-tem “knows” what it has already elicited aswell as what has still to be asked. The sys-tem could then loop through the dialoguemodel until all the required information

has been elicited. In this way the prob-lem of irrelevant questions would also beavoided. However, the problem is that, assoon as the number of items grows, thenumber of transitions to cater for eachrequired dialogue path grows to unman-ageable proportions. This problem is fur-ther augmented if adequate repair mech-anisms are to be included at each node forconfirmation or clarification of the user’sinput. Thus it was estimated that in thePhilips system there were about 1,000 sys-tem questions. Allowing for flexible adap-tation to the user’s input—for example, inthe case where a user says more than thesystem expected or provides an unantic-ipated response—given that almost anyquestion could follow almost any other—the network would require tens of thou-sands of transitions [Aust and Oerder1995].

Dialogues involving some form of ne-gotiation between system and user can-not be modeled using finite state meth-ods, as the course of the dialogue cannotbe determined in advance. For example,planning a journey may require the dis-cussion of constraints that are unknownby either the system or the user at theoutset. In these interactions some form ofnegotiation and discussion of constraintsis required. For example, in the TRAINSproject, to be discussed below, the userand the system collaborate to construct anagreed upon executable plan that has tobe developed incrementally in order to in-corporate new constraints that arise dur-ing the course of the dialogue [Allen et al.1995].

5.2. Frame-Based Systems

Rather than build a dialogue according toa predetermined sequence of questions tobe asked, a frame-based system takes theanalogy of a form-filling task in which apredetermined set of information is to begathered. This frame (or template) fulfilsthe role of a dialogue model that keeps ac-count of the items for which the systemrequires information. Naturally this willalso involve questions, but the questionsdo not have to be asked in a particular

ACM Computing Surveys, Vol. 34, No. 1, March 2002.

Page 43: Spoken Dialogue Technology: Enabling the Conversational

132 McTear

sequence. For example, in the Philips traintimetable system, the questions that thesystem might ask are listed together withtheir preconditions—that is, the condi-tions under which that question should beasked. Some questions for a travel systemmight be

condition: unknown(origin) & unknown(destination)

question: “Which route do you want to travel?”

condition: unknown(origin)question: “Where do you want to travel from?”

condition: unknown(destination)question: “Where do you want to travel to?”

Given all the questions and their precon-ditions, which do not need to be stated inchronological order, the dialogue controlcomponent can decide the next question tobe asked based on those questions whosepreconditions are true. If several ques-tions can be asked at a particular stagein the dialogue, other factors can be usedto choose a question to be asked. For ex-ample, in the Philips SpeechMania sys-tem, each dialogue action (including thequestions) is coded with a key word thatdetermines the dialogue action’s priorityand thus the dialogue flow. Some examplesof these key words in their default order ofpriority are

ONCE: for an action that has not yet beenexecuted, for example, the initialgreeting;

MULTIPLE: if more than one value has beenreturned for a variable, so thatambiguity resolution is required;

VERIFIABLE: to be used if a value has not beenconfirmed by the user;

UNDEFINED: to be used when no value has beendefined for a variable and a questionis required to elicit the value fromthe user.

Given this priority mechanism, problemsrelating to what is ambiguous are resolvedbefore attempts to verify a value, whichare in turn resolved before questions fornew values. Thus a sequence of questionsevolves based on the current context ofthe system (what has been asked so far,what information is ambiguous, what has

to be confirmed), without having to specifypredetermined paths through a dialoguenetwork.

A similar mechanism has been usedin the Communicator system developedat the University of Colorado, Boulder[Ward and Pellom 1999]. This system ob-tains information from the Internet on air-line flights, hotels, and rental cars. Thedialogue control is described as “event-driven,” meaning that the dialogue man-ager decides what to do next based on thecurrent system context rather than a pre-determined script. In this case the con-text consists of the semantic content of theuser’s input together with a template ofslots to be filled. On assimilating a parseduser’s utterance with the dialogue context,the system decides on its next action ac-cording to a set of priorities similar tothose used in the Philips system:

—clarify if necessary;—finish if all done;—retrieve data and present to user;—prompt user for required information.

A variation on frames is the use ofa form consisting of a number of slotsfor the relevant attributes in the do-main. Dahlback and Jonsson [1999] de-scribed their use of information specifica-tion forms for a bus timetable informationsystem. Forms are also used as the maindialogue items in VoiceXML documents.A form in VoiceXML consists of field andcontrol items. A field gathers informationfrom the user using speech or DTMF in-put while control items involve sequencesof procedural statements for promptingand computation. The Form Interpreta-tion Algorithm determines which items ina form to visit depending on the status oftheir guard condition. Thus, unless a fieldvariable within a form has the value un-defined, that form will not be visited. In adirected form the form items are executedonce in a sequential order, resulting in arigid, system-directed dialogue. A mixed-initiative form, combined with a grammar,enables the user to input all the requireditems in one utterance, giving a more flex-ible dialogue.

ACM Computing Surveys, Vol. 34, No. 1, March 2002.

Page 44: Spoken Dialogue Technology: Enabling the Conversational

Spoken Language Technology 133

Goddeau et al. [1996] discussed a morecomplex type of form, the E-form (elec-tronic form), which has been used in aspoken language interface to a databaseof classified advertisements for used cars.E-forms differ from the types of form andframe described so far, in that the slotsmay have different priorities for differentusers—for example, for some users thecolor of a car may be more critical thanthe model or mileage. Furthermore, in-formation in slots can be related—for ex-ample, a more recent model usually costsmore. The E-form allows users to exploremultiple combinations to find the car thatbest suits their preferences. Thus the se-lection of an appropriate car is viewed asan optimization task which involves morethan the retrieval of a set of records froma database. However, it is the user whoperforms this optimization, whereas in aproblem-solving system the optimizationwould be performed by the system or, ide-ally, as a result of a negotiation dialoguebetween system and user. The E-form isused to determine the system’s next re-sponse, which is based on the current sta-tus of the E-form, the most recent systemprompt, and the number of items returnedfrom the database:

—If no records found, ask user to be moregeneral.

—If fewer than five records found, con-sider the search complete and generatea response that outputs the retrievedrecords.

—Otherwise cycle through an ordered listof prompts choosing the first promptwhose slot in the E-form is empty.

—If too many records have been found andall the prompt fields have been filled,ask the user to be more specific.

Other data structures that can be usedto control the dialogue are schemas,task structure graphs, and type hierar-chies. Schemas are used in the CarnegieMellon Communicator system to modelmore complex tasks than the basic in-formation retrieval tasks that use forms[Constantinides et al. 1998; Rudnickyet al. 1999]. A schema is a strategy for

completing a goal in a task-based dia-logue, such as determining an itinerary.The itinerary is represented as a hierar-chical data structure that is constructedinteractively over the course of the di-alogue. At the same time the nodes inthe tree are filled with specific informa-tion about the trip. While there is a de-fault sequence of actions to populate thetree that is maintained as a stack-basedagenda, the user and the system can bothcontrol this ordering and cause the focusof the dialogue to shift (for example: Let’stalk about the first leg [of the itinerary]again). Task structure graphs provide asimilar semantic structure to the E-formand are used to determine the behavior ofthe dialogue control module as well as thelanguage understanding module [Wrightet al. 1998]. The graph depicts relation-ships between the elements of a customer-services application and is used to providea contextual interpretation of spoken ut-terances in a dialogue. Similarly, type hi-erarchies can be used to model the domainof a dialogue and as a basis for clarifica-tion questions [Denecke and Waibel 1997].Given that information in a type hierar-chy can be missing or underspecified, clar-ification requests are generated to enablethe user to achieve their communicativegoal.

In summary: there are a number of dif-ferent types of data structure, such asthe frame, E-form, schema, task structuregraph, and type hierarchy that can be usedto model the structure of the informationrequired by the user and to determine theactions to be taken by the dialogue systemto obtain this information.

5.2.1. Advantages of Frame-Based Systems.The frame-based approach has several ad-vantages over the finite-state-based ap-proach, for the user as well as the de-veloper. As far as the user is concerned,there is greater flexibility. For example,there is some evidence that it can be dif-ficult to constrain users to the responsesrequired by the system, even when the sys-tem prompts have been carefully designedto do just that [Eckert et al. 1995]. Theability to use natural language and the

ACM Computing Surveys, Vol. 34, No. 1, March 2002.

Page 45: Spoken Dialogue Technology: Enabling the Conversational

134 McTear

use of multiple slot filling enables the sys-tem to process the user’s overinformativeanswers and corrections. In this way thetransaction time for the dialogue can bereduced, resulting in a more efficient andmore natural dialogue flow. The frame-based system fulfilled a number of dia-logue design requirements identified bythe Philips dialogue team, including thefollowing:

—there should not be a rigid question-answer scheme to obtain the requiredvalues;

—no more questions than necessaryshould be asked;

—no more confirmation than necessaryshould be required;

—information given by the caller, prior tothe system asking for it, should be used[Philips Speech Processing 1997].

Similarly, the Communicator system de-veloped at the University of Colorado[Ward and Pellom 1999] enables a mixed-initiative dialogue in which the user cantake control. This degree of user control isgreater than in the Philips system, wherethe system has control of the dialogue flowbut the user can insert corrections to itemsthat the system has misrecognized or mis-understood. In the Communicator systemthe user can respond with anything to thesystem’s question, that is, not necessar-ily the answer to the question. The sys-tem will parse the utterance and decidewhether and how to respond to it, puttingon hold prompts for any additional miss-ing information that is required by the sys-tem to build a database query.

From a developer’s perspective, im-plementing this degree of flexibility ina graph-based system becomes cumber-some, if not impossible. A large number ofstates and transitions are required to dealwith the number of different paths that di-alogues might take. A frame-based systemcan be specified declaratively with the sys-tem’s questions listed as in a rule-basedexpert system (or production system).

5.2.2. Disadvantages of Frame-Based Sys-tems. Finite state-based and frame-based

approaches are appropriate for well-defined tasks in which the system takesthe initiative in the dialogue and elicitsinformation from the user to complete atask, such as performing a database query.Frame-based systems provide greater flex-ibility than state-based systems as the di-alogue flow is event-driven and not pre-determined. However, the system contextthat contributes to the determination ofthe system’s next action is fairly limited,being confined essentially to the analysisof the user’s previous utterance in conjunc-tion with a template of slots to be filled anda number of priorities for control of the di-alogue. More complex transactions cannotbe modeled using these approaches for thefollowing reasons:

—different users may vary in the levelof knowledge they bring to the task, sothat a wide range of responses is re-quired by the system;

—the state of the world may change dy-namically during the course of the dia-logue, with the result that it is not possi-ble to specify all possible configurationsin advance;

—the aim of the dialogue is not just to ob-tain sufficient information from the userto execute a database query or carry outsome action—instead, the dialogue in-volves the negotiation of some task to beachieved, involving planning and othertypes of collaborative interaction.

For the developer a frame-based approachhas the disadvantage of any productionsystem with a large number of rules andcontexts in that it is difficult to predictwhich rule (or question) is likely to fire in aparticular context. A considerable amountof experimentation may be required to en-sure that the system does not producean inappropriate question under some cir-cumstances that had not been foreseen atdesign time.

5.3. Agent-Based Systems

Agent-based approaches draw on tech-niques from Artificial Intelligence (AI)and focus on the modeling of dialogue ascollaboration between intelligent agents.

ACM Computing Surveys, Vol. 34, No. 1, March 2002.

Page 46: Spoken Dialogue Technology: Enabling the Conversational

Spoken Language Technology 135

Several classes of agent-based systemwill be described and evaluated, includ-ing systems using theorem proving, plan-ning, distributed architectures, and con-versational agents.

5.3.1. Dialogue Control Using Theorem Prov-ing. The functionality of the Circuit-Fix-It Shop system was illustrated inSection 3.3 and its architecture presentedin Section 4.4.2. A key aspect of the sys-tem is that solutions are developed dy-namically, on the basis of task steps recom-mended by the domain processor, takinginto account the current situation and theuser’s current state of knowledge. Thisdynamic development contrasts with theapproach adopted in some plan-based sys-tems, in which a complete solution is de-veloped by the system and then commu-nicated to the user. Theorem proving isused in the Circuit-Fix-It Shop system todetermine task completion, and dialogueis required for the acquisition of axiomsthat are missing but are required to com-plete a task step. Thus dialogue is inte-grated closely with task processing and isinvoked when the system is unable to com-plete the task using its own resources. Abrief overview of the theorem-proving ap-proach and of the role of missing axiomsin the production of dialogue is presentedin the following paragraphs.

The role of the theorem prover is todetermine task completion. The dialoguecontroller receives a suggested task (orgoal) from the domain processor and se-lects which goal is to be solved. The di-alogue controller decides when theoremproving is to be activated, whether lan-guage can be used to acquire missingaxioms, and when theorems must be dy-namically modified. The following exam-ple (discussed earlier in Section 3.3) illus-trates this process.

The task to be completed involvesthe system finding out whether there isa wire between connectors 84 and 99,represented in GADL (Goal and ActionDescription Language) as

goal(computer,learn(ext know(phys state(prop(wire(84,99),exist,X),true)))).

Fig. 12 . Inserting a substep into the proof tree.

X represents an unknown value for the ex-istence of the wire. One way to solve thisgoal is to look for an appropriate axiom inthe knowledge base, such as:

axiom(phys state(prop(wire(84,99),exist,present),true))—that is, wire is present,

oraxiom(phys state(prop(wire(84,99),exist,absent),true))—that is, wire is absent.

As there is no appropriate axiom in theknowledge base, the system determinesthat the next step should be for the userto add the wire, which should then pro-vide the missing axiom that the wire ispresent. This step is conveyed by the sys-tem to the user using language. However,at this point the user is unable to completethis task step and requests help. This re-sponse does not satisfy the missing axiom.In conventional theorem proving, a newrule for the proof would be sought usingbacktracking. However, for this to be suc-cessful, theorem descriptions would haveto be listed for every possible response thatprovides an axiom other than the requiredone. A more general approach is to modifythe original theorem by inserting the sub-step that needs to be resolved (in this case,how to add the wire), and then resumethe theorem-proving process once the sub-step has been solved. To do this, an In-terruptible Prolog SIMulator (IPSIM) isemployed to enable the dynamic modifica-tion of theorems. IPSIM inserts into theactive theorem specification the requiredsubstep (learning how to add the wire) be-fore the theorem step of acquiring the re-quired axiom about adding the wire (seeFigure 12).

As can be seen, the substep of learn-ing how to add the wire has to be solvedbefore the missing axiom of adding the

ACM Computing Surveys, Vol. 34, No. 1, March 2002.

Page 47: Spoken Dialogue Technology: Enabling the Conversational

136 McTear

wire can be acquired. Thus the problemsolving that is involved in the Circuit-Fix-it Shop system is accomplished usingtheorem proving, while the Missing Ax-iom theory provides the mechanism for us-ing language. The interruptible theoremprover creates subdialogues to solve sub-steps that are inserted into the partiallycompleted proof. Given the overall controlexerted by the dialogue controller, it is pos-sible to select which substep to work onfirst and also to move between substeps (orsubdialogues) if this is appropriate giventhe current state of the proof, the system’sinferences concerning the user’s knowl-edge, and the user’s responses. In thisway, dialogue is closely integrated withproblem solving, but the theorem-provingparadigm that is used with the interrupt-ible theorem prover allows the dialogue toevolve dynamically.

5.3.2. Plan-Based Approaches. In plan-based approaches to dialogue, utterancesare treated in the same way as actions ina planning system that are performed inorder to achieve some goal [Cohen 1994].The goal of the utterance may be somedesired physical state, such as having adrink of beer in a bar. In this case, an ut-terance that functions as a request for abeer is incorporated into a plan that alsoinvolves physical actions, such as the cus-tomer handing over some money and thebartender giving the beer. On the otherhand, an utterance that has the functionof conveying information effects a changein the listener’s mental state. Much of theearly work involving plans in the 1980swas concerned with recognizing the inten-tion behind an utterance and matchingthis intention with some part of a planthat might achieve a particular goal [Allen1983]. A cooperative system would adoptthe user’s goal, anticipate any obstaclesto the plan, and produce a response thatwould promote the completion of the goal[Allen and Perrault 1980].

A key element of this approach wasthe modeling of utterances as speechacts, which, like action operators in plan-ning, consisted of roles, preconditions, con-

straints, and effects. For example: the fol-lowing is a definition of the communicativeact ConvinceByInform [Allen 1995]:

Roles: Speaker, Hearer, PropConstraints: Agent(Speaker), Agent(Hearer),

Proposition(Prop),Bel(Speaker,Prop)

Preconditions: At(Speaker, Loc(Hearer))Effects: Bel(Hearer,Prop)

With this communicative act, if a speakerinforms a hearer of some proposition andconvinces the hearer of that proposition,one constraint (a condition that mustbe true) is that the speaker must be-lieve the proposition. A precondition ofthe act—which if not true, can be madetrue through a further action—is that thespeaker should be at the same locationas the hearer (for face-to-face communica-tion). As a result of the communicative act,the hearer will believe the proposition. Aplan to achieve a goal involving languagewould typically involve chaining togethera series of such communicative acts, in-cluding acts that involve representing theintentions and communicative actions ofanother agent. Allen [1995] provided a de-tailed account of the planning underlyinga simple dialogue concerned with the pur-chase of a train ticket. Part of this planinvolves the agent purchasing the ticketfinding out the price of the ticket from theclerk, which in turn requires the clerk toproduce a ConvinceByInform act statingthe price of the ticket. But to achieve this,the agent buying the ticket has to pro-duce a MotivateByRequest act that has asits effect an intention on the part of theclerk to produce the required Convince-ByInform act. These acts are chained to-gether in the correct sequence to achievethe goal.

There are several problems with plan-based approaches. In order to infer aspeaker’s plan, the listening agent has tobe able to recognize the communicativeact performed by the speaker’s utteranceand locate this act within a particular planschema. If the communicative act has beenincorrectly recognized, this will result inan incorrect identification of the speaker’s

ACM Computing Surveys, Vol. 34, No. 1, March 2002.

Page 48: Spoken Dialogue Technology: Enabling the Conversational

Spoken Language Technology 137

plan. At the very least this will requireadditional mechanisms for repairing theincorrect assignment of the speaker’s in-tention. In addition to this, however, thereis the problem that the processes of planrecognition and planning, which involvechaining from preconditions of plans toactions, can in more complex cases be-come combinatorially intractable. As plan-ning algorithms require reasoning fromfirst principles, they are best suited for re-stricted domains in which the reasoning iskept to manageable proportions.

Much of the early work in planning fo-cused on the analysis of individual utter-ances rather than on how these utterancescould be combined to form a coherent dia-logue. Litman and Allen [1987] extendedthe basic model in two directions. First,they introduced the notion of a hierarchyof plans, which included subdialogues forclarification or correction of a plan. Theyalso introduced a distinction between do-main plans and discourse plans. Domainplans, which involve task-related speechacts, are modelled using traditional plan-based approaches. Discourse plans, whichare task-independent, involve acts used tocontrol the dialogue such as “clarification”and “correction.” However, this approachstill required an assumption of coopera-tivity between the agents and did not pro-vide an explanation of the dialogue activ-ity of agents who did not share mutualgoals.

Some alternative approaches that ad-dress these issues will be presented in thenext subsections. One direction, as illus-trated in the TRAINS system, extends thetraditional plan-based approach by mod-eling conversational agency within a mul-tiagent action theory. A second direction,implemented in the SUNDIAL system,does not attempt to infer a speaker’s inten-tions but bases decisions about dialoguecontinuation on the current state of thedialogue and the system’s current beliefstate. Finally, a third approach, as exem-plified by the ARTEMIS system, modelsdialogue as a process of rational interac-tion in which instances of dialogue struc-ture emerge as a consequence of the dy-namics of rationality principles.

5.3.3. Conversational Agency in the TRAINSSystem. The TRAINS project has ex-tended early work in plan-based systemsin two main directions. First, the approachto problem solving has required more elab-orate reasoning to support collaborativeand incremental planning. This is to en-able the mixed-initiative planning that isusually involved when two agents collab-orate to solve a problem. In the TRAINSproject the system has the ability to planlow-level details such as routes betweencities and has knowledge of constraintssuch as congestion at certain stations orwhether an engine is available. The hu-man manager, on the other hand, hasknowledge of high-level goals, is aware ofthe motivations and justifications of par-ticular plans of action, and can decidewhether to relax soft constraints such asaccepting a route that involves congestionin a particular city on the route. The sys-tem and the manager collaborate to con-struct an agreed upon executable plan.However, unlike in conventional planningwhere the initial goal is completely speci-fied and a complete plan can be produced,in the TRAINS project the initial goalis typically underspecified and the planhas to be developed incrementally in or-der to incorporate new constraints thatarise, such as information about conges-tion. As a result modifications to the orig-inal plan may be required. The generalmodel for this mixed-initiative planninginvolves four steps [Ferguson et al. 1996]:

Focus: Identify the goal or subgoal underconsideration.

Gather constraints: Collect constraintsinvolving resources, backgroundinformation, and preferences.

Instantiate solution: Generate a solu-tion as soon as one can be pro-duced efficiently.

Criticise, correct or accept: If the so-lution is criticized and modifica-tions are required, continue withStep 2; otherwise, if the solutionis acceptable, go to Step 1 and se-lect a new goal or subgoal.

ACM Computing Surveys, Vol. 34, No. 1, March 2002.

Page 49: Spoken Dialogue Technology: Enabling the Conversational

138 McTear

Fig. 13 . The BDI model of conversational agency.

This model provides a basis for the tasksto be performed in the dialogue and the or-der in which they are to be performed. Itwould be possible to develop a correspond-ing task hierarchy and translate this di-rectly into a dialogue structure. However,a second aim of the project was to developa more elaborate model of dialogue thatwould provide the required flexibility formixed-initiative interaction and would besuitably integrated with the complex com-monsense knowledge and reasoning of theproblem-solving component.

The dialogue manager in the TRAINSsystem functions as a conversationalagent that performs communicative actswhen it sends messages to other agents,observes the communicative acts of theother agents, and maintains and monitorsits own mental state. Thus the model ofdialogue in the TRAINS system is moreambitious than in most other systemswhere the dialogue component is a toolthat provides a front-end interface to theother components of the system [Traum1996]. The theoretical foundation of theconversational agent is provided by theBDI (Belief, Desire, and Intention) archi-tecture of Bratman et al. [1988], whichmodels agents that plan and execute ac-tions in the physical world. The origi-

nal BDI model is specialized to conver-sational actions, as shown in Figure 13[Allen 1995]. Based on its current be-liefs about the domain, including nestedbeliefs about shared knowledge, and thediscourse obligations that each conversa-tional agent has, the agent selects com-municative goals, decides what speech actto perform next, generates an utterance,analyzes the manager’s response, and up-dates its beliefs about the discourse stateand its own discourse obligations.

Discourse obligations are an importantelement of the conversational agent.Earlier plan-based models of dialoguewere based mainly on an analysis ofthe intentions of speakers. A dialogueagent would construct a model of aspeaker’s intentions, infer the speaker’sgoals, and adopt a goal to achieve thesegoals. However, this model made strongassumptions of cooperativeness andfailed to explain why an agent wouldstill respond when it did not know ananswer or did not wish to adopt thespeaker’s goals. The solution was to makea distinction between the intentions thatan agent might have in performing a taskand the obligations that constitute theconventions of cooperative conversation[Traum and Allen 1994]. These discourse

ACM Computing Surveys, Vol. 34, No. 1, March 2002.

Page 50: Spoken Dialogue Technology: Enabling the Conversational

Spoken Language Technology 139

Table II. Some Obligation Rules [Traum and Allen 1994]

Source of obligation Obliged actionS1 Accept or Promise A S1 achieve AS1 Request A S2 address Request: accept A or reject AS1 YesNo Question whether P S2 Answer-if PS1 WH-Question P(x) S2 Inform-ref xutterance not understood or incorrect repair utterance

obligations are socially based—theyrepresent what an agent should normallydo and they are generally addressedbefore task-related goals and intentions.Table II presents some discourse obli-gation rules. Thus, to take the exampleof a request by S1, there is a discourseobligation on S2 to respond to the request,though not necessarily to accept it.

In addition to these discourse obliga-tions, the agent has to consider variousgeneral conversational obligations, suchas acknowledging the utterances of othersand not interrupting, coordinating mutualbeliefs (grounding), and planning the exe-cution of domain goals. The following dis-course actor algorithm shows how thesecompeting goals are prioritized [Traumand Allen 1994]:

while conversation is not finishedif system has obligationsthen address obligationselse if system has turnthen if system has intended conversation acts

then call generator to produce NL utteranceselse if some material is ungroundedthen address grounding situationelse if some proposal is not acceptedthen consider proposalselse if high-level goals are unsatisfiedthen address goalselse release turn or attempt to end conversation

else if no one has turnthen take turnelse if long pausethen take turn

This algorithm produces a reactive-deliberative model of dialogue agencyin which the agent reasons about itsdiscourse obligations and domain goals.Where there is a conflict, the agent dis-plays a “relaxed” conversational style inwhich discourse obligations are addressedbefore the agent’s own goals. This givesrise to an interaction in which the initia-tive of the other agent is followed. How-ever, in a less cooperative situation, theagent can continue to address its discourseobligations but can respond in different

ways, for example, by rejecting requestsand refusing to answer questions. Finally,the system can take the initiative, forexample, in a situation where the otheragent does not take a turn, by taking theturn and using this opportunity to ad-dress its own goals. Thus, depending onthe flow of the dialogue and on the behav-ior of the other agent, the TRAINS conver-sational agent can shift its focus from theobligation-driven process of following theother’s initiative to the goal-driven processof taking the lead in the conversation.

5.3.4. Event-Driven Dialogue in a DistributedArchitecture—the SUNDIAL Dialogue Manager.As mentioned earlier, plan-based ap-proaches depend on the system being ableto assign an intention to the user’s ut-terances that can be related to the user’splan. An alternative approach models dia-logue on the basis of a combination of thesystem’s belief and intention states, anddoes not attempt to model these states inthe user. This is the approach adopted inthe SUNDIAL system [McGlashan et al.1990], in which dialogue continuation isbased on the results of a contextual se-mantic interpretation of the user’s ut-terance and on monitoring changes inthe system’s belief state. This approachis event-driven in the same way as aframe-based system, but the differenceis that the representation of context inSUNDIAL is more complex than that usedin frame-based systems. SUNDIAL usesa distributed architecture, in which thevarious functions of the dialogue compo-nent are realized in different modules, asshown in Figure 14. The task module rep-resents the task structure of an applica-tion, the belief module represents the in-terpretation of utterances in the currentdialogue context, and the dialogue module,which contains rules for dialogue behav-ior, deals with predictions for the next user

ACM Computing Surveys, Vol. 34, No. 1, March 2002.

Page 51: Spoken Dialogue Technology: Enabling the Conversational

140 McTear

Fig. 14 . The SUNDIAL architecture.

utterance and strategies for how the dia-logue should continue. The following dia-logue can be used to illustrate this method:

System1: Hello Sundial reservation sys-tem. Can I help you?

User1: I’d like a ticket from London toParis.

System2: London to Paris. When do youplan to leave?

At the point in the dialogue where the userhas just produced User1, the Belief Mod-ule has recorded that the items Depar-ture City and Arrival City were recognizedwith average recognition scores. The nextitems on the agenda for the task moduleare to request the Departure Date and De-parture Time. Given these circumstances,the system could choose from a range ofpossibilities for continuing the dialogue:

(1) explicit confirmation of departure city(Did you say you want to fly fromLondon?);

(2) explicit confirmation of departure city(Did you say you want to fly to Paris?);

(3) explicit confirmation of arrival city(Did you say you want to fly fromLondon?);

(4) request for departure date (On whichdate do you wish to fly?);

(5) request for departure time (At whattime do you wish to depart?).

There are also many more possibilitiesinvolving further combinations of theseitems.

The solution that was adopted in thecurrent example was S2: London to Paris.When do you plan to leave? Here two possi-ble dialogue allowances are combined: animplicit (rather than an explicit) requestfor confirmation, as the recognition scorefor the two parameters indicated reason-ably reliable recognition, and a requestfor information about the departure time.Figure 15 presents a snapshot of the sys-tem at this point in the dialogue.

The processes that underlie this behav-ior are based on mappings from the sys-tem’s belief state to the dialogue level[Heisterkamp and McGlashan 1996]. Theuser’s utterance is analyzed semanti-cally, yielding a surface semantic de-scription in the SIL representation (seeSection 4.2.2.1), and then interpreted inthe context of the current task to provide atask-level interpretation. In this example,two new items are added to the system’scontextual model: the departure place andthe arrival place. The contextual functionsof these semantic items derived from theuser’s utterance result in some change inthe system’s contextual model and are rep-resented as follows:

—new for system(goalcity:paris).—new for system(sourcecity:london).

ACM Computing Surveys, Vol. 34, No. 1, March 2002.

Page 52: Spoken Dialogue Technology: Enabling the Conversational

Spoken Language Technology 141

Fig. 15 . Snapshot of the dialogue state.

Table III. Contextual Functions and Goals

new for system(X) confirm(X) orspecify([X,Y])

repeated by user(X) cancel ‘confirm’goals for X

inferred by system(X) introduce goalconfirm(X)

modified by user(X) introduce repairgoal confirm(X)

negated by user(X) repair(X)

The next level of processing involvesevaluating the contextual functions to seewhether they solve a goal, modify a goal,or introduce a new goal. The completelist of contextual functions used and theirassociated goals are shown in Table III.The dialogue goals form a conflict set overwhich the dialogue strategy operates todetermine the best dialogue continua-tion. The goals are grouped into classes:initiatives, responses, and evaluations.Generally, evaluations are given a higherpriority than reactions, and reactionshave precedence over initiatives, with theresult than confirmations and answersto the user’s questions will be resolvedearlier than questions to be asked by thesystem.

Combining goals, as in this example,where the system provides both an im-plicit confirmation of two values as well asa further question within the same utter-ance (London to Paris, when do you plan

Fig. 16 . Metastrategy of degradation and recovery.

to leave?), depends on the current stateof the dialogue. The default setting is forany number of reactions (except confirmgoals for inferred items) and one initia-tive to be combined, thus enabling the di-alogue to proceed more quickly. However,a metastrategy of degradation and recov-ery, in which the dialogue goals are or-ganized hierarchically from more open tomore restricted interaction, as shown inFigure 16 [Giachin and McGlashan 1997],is used for situations when problems ariseand the system has to focus on repair, andalso to return to more flexible behavioras dialogue processing improves. If prob-lems occur in a dialogue, the confirmation

ACM Computing Surveys, Vol. 34, No. 1, March 2002.

Page 53: Spoken Dialogue Technology: Enabling the Conversational

142 McTear

strategy changes from implicit throughsingle explicit confirmation, moving at thelowest level to the use of spelling or menus.Once the problems are been resolved, theopposite sequence comes into operation.

The approach adopted in the SUNDIALproject overcomes the objections encoun-tered in a state transition network ap-proach in which the range of possibledialogue transitions would be extremelylarge at each node if flexible dialogue be-havior were permitted that accounted forall the required circumstances, such as thedegree of confidence in the recognition ofitems in the user’s input. Thus, insteadof a global dialogue control structure inwhich the dialogue paths are determinedin advance, this approach adopts a localmanagement strategy in which dialoguecontrol is determined on a turn-by-turnbasis depending on the information avail-able in the various modules of the dia-logue component. The focus on system be-liefs and intentions avoids the difficultiesof identifying the user’s goals and atten-tions that are associated with plan-basedapproaches. This approach also offers amore generic architecture which in prin-ciple can be more easily ported to otherapplications, whereas in the state transi-tion network approach each new applica-tion has to be constructed ab initio.

5.3.5. Dialogue as Rational Interaction. The“Rational Interaction” approach to dia-logue views communication as a specialcase of intelligent behavior. The followingsimple example illustrates this approach[Sadek and de Mori 1998]:

User: Is server 36−68−02−22 operatedby Meteo France?

System: (answers ‘yes’ or ‘no’).

In order to make a decision as to how to an-swer the user’s query, the system engagesin the following sequence of actions:

(1) system infers the intention of the userto know if p (p = server 36−68−02−22 is operated by Meteo France);

(2) system adopts the intention that theuser eventually comes to know if p;

(3) system adopts the intention of inform-ing the user that p or of informing theuser that not p.

This chain of reasoning may appear at firstglance to be a case of “overkill” with re-gard to this simple example. However, therationale for this approach becomes clearif the answer to the query is no, as in orderto cooperatively address the user’s inten-tion, the system would have to plan to sup-ply additional information, for example,that the server is operated by some otheroperator, or that it is some other serverthat is operated by Meteo France. Thisplan would involve reasoning about therelevance of the information, that is, whatthe user needs to know as well as what theuser does not need to be told. Answeringyes could also involve a chain of reasoning,for example, deciding whether the user isauthorised to know p or inferring the con-sequences of telling the user that p.

In this approach dialogue structureemerges dynamically as a consequence ofprinciples of rational cooperative inter-action. Processes of dialogue can be ex-plained in terms of the plans, goals, andintentions of the agents involved in the di-alogue. Plans themselves are not predeter-mined schemas of action but are deriveddeductively from rationality principles. Totake a simple example: since dialogue isa joint activity between two agents, eachagent in the dialogue has a commitmentto being understood. This commitment ex-plains and motivates the need for confir-mations and requests for clarification indialogue [Cohen 1994]. While agents nor-mally have the goal of behaving coopera-tively in dialogue, an agent does not neces-sarily have to adopt another agent’s goals,if there is good reason not to. For exam-ple, an agent should not supply informa-tion that is confidential or assist in actionsthat it knows to be illegal or harmful. Inother words, an agent has to attempt toachieve a rational balance between its ownmental attitudes and those of other agentsand between these mental attitudes anddesired plans of action.

The theoretical framework for ratio-nal agency is based on a set of logical

ACM Computing Surveys, Vol. 34, No. 1, March 2002.

Page 54: Spoken Dialogue Technology: Enabling the Conversational

Spoken Language Technology 143

axioms that formalize basic principles ofrational action and cooperative commu-nication. The foundations for this frame-work, developed by Cohen and Levesque[1990], were extended by Sadek and im-plemented as the basis of the dialoguemanagement component of the ARTEMISsystem. ARTEMIS is an agent technol-ogy developed at France Telecom-CNETas a generic framework for specifying andimplementing intelligent dialogue agents[Sadek and de Mori 1998]. The AGSapplication, which was developed usingARTEMIS, provides information in the ar-eas of employment and weather forecasts[Sadek et al. 1997]. In addition to the usualcomponents of a spoken dialogue system,the dialogue manager of ARTEMIS appli-cations includes a “rational unit,” whichsupports reasoning about knowledge andactions and enables the dialogue agent toproduce rationally motivated plans of ac-tion, including communicative acts, in re-sponse to the user’s utterances. The es-sential elements of the rational unit area number of principles of rationality andcooperation.

The rationality principles describe anagent’s reasoning about actions, beliefs,and plans to modify the world, for ex-ample, through actions, including commu-nicative acts that change another agent’smental state. An agent’s intentions are de-fined in terms of the agent’s beliefs aboutthe world, as well as its goals and commit-ments. The principles are formalized as aset of logical axioms. Some examples willgive a flavor of the formalism and how itis applied to dialogue agency.

One of the properties of a rational agentis consistency of beliefs. This property ismodeled in terms of mental attitudes suchas belief (B) and choice (C), as follows:

B(i, φ)⇒ ¬(i, ¬φ).

That is, if agent i believes formula schemaφ, then he or she will not believe that φis not true. The logical model for choicespecifies that an agent chooses the logicalconsequences of its goals:

|=(C(i, φ) ∧ B(i, φ ⇒ ψ))⇒ C(i, ψ).

That is, it is a valid formula that, if agent idesires that φ should be true and believesthat ψ is a consequence of φ, then he orshe will adopt the desire that ψ be true.Within the world of action, there is a needfor an action model that specifies the ra-tional effect (RE ) of an action, that is, thereasons for which an action is performed.Also needed are the action’s feasibility pre-conditions (FP), that is, the conditions thathave to be true for the action to be feasible.The following simple example illustratesthe act of informing:

〈i, Inform( j , φ)〉FP : B(i, φ) ∧ ¬B(i, B( j , φ))RE : B( j , φ).

In other words, for i to inform j aboutφ, it must be the case that i believes φand that i believes that j does not alreadyknow φ. The first part of the feasibilityprecondition models the sincerity condi-tion that is an integral element in coop-erative communication, while the secondpart models principles of quantity and re-lation, that is, that an agent should nottell another agent something that they al-ready know. The rational effect of this com-municative act is that agent j comes toknow φ.

The cooperation principles express themotivation for an agent to behave coop-eratively with respect to another agent.Agents in dialogue accept a number ofminimal commitments, for example, toparticipate actively in the conversation, toattempt to understand the other agent’sconcerns, and to generate answers to theseconcerns that are cooperative. For exam-ple: there is a commitment to ensure mu-tual understanding, which can involveproviding confirmations or making re-quests for clarification, as well as takingcare that the other agent does not have er-roneous beliefs. Cooperation also involvesadopting the other agent’s intentions, pro-vided that this does not conflict with thefirst agent’s own intentions. These prin-ciples of cooperation have been formal-ized in a similar way to the rationalityprinciples.

ACM Computing Surveys, Vol. 34, No. 1, March 2002.

Page 55: Spoken Dialogue Technology: Enabling the Conversational

144 McTear

The view that dialogue is a special caseof rational behavior brings several ad-vantages. Given that dialogue involves ajoint commitment to mutual understand-ing, there is a motivation for agents tomake their intentions clear through confir-mations, clarifications, repairs, and elab-orations. Although these behaviors areincluded in other approaches, there isno theoretical motivation for their inclu-sion. The theory also accounts for differentcontexts of interaction and explains whyan agent might provide more informationthan is required by the other agent’s query.For example: if a user asks for an address,the system might also provide a telephonenumber, if one is available. However, thisadditional information should not be im-plemented as an automatically generatedresponse schema but rather as somethingto be determined within a particular con-text of interaction on the basis of the ratio-nality principles. Finally, the theory pro-vides a basis for more advanced dialogues,for example, those involving negotiationrather than simple information retrieval,where various types of cooperative andcorrective responses may be required.

5.3.6. Advantages of Agent-Based Systems.As mentioned earlier, finite state-basedand frame-based dialogue control meth-ods are not suitable for more complexdialogues that involve more than the elici-tation of a number of predetermined pa-rameters to enable some information tobe retrieved from a database or a simpletransaction to be performed. More com-plex technologies are required for dialoguesystems that support collaborative prob-lem solving. Such dialogues involve nego-tiation of a task in which both system anduser ask questions, request clarifications,make corrections and suggestions, andchange the topic. In a mixed-initiative dia-logue such as this, the dialogue agent hasto be able to maintain and reason over ex-plicit models of the task at hand, of the cur-rent dialogue state, and of its own and theother agent’s beliefs and intentions. Theagent-based dialogue systems describedin this section, although mainly still atthe research laboratory stage, illustrate

a number of different approaches to thelong-term goal of enabling conversationalinteraction with computers.

5.3.7. Disadvantages of Agent-Based Sys-tems. The main disadvantage of agent-based dialogue control is that it requiresmuch more complex resources and pro-cessing than the simpler dialogue controlmethods. In order to engage in a mixed-initiative dialogue involving negotiationand collaboration, a system requires moresophisticated natural language capabil-ities. Whereas simple pattern-matchingand concept-spotting techniques are suf-ficient for finite state and frame-basedsystems, a deeper semantic representa-tion is required to interpret the user’s in-put in more open-ended mixed-initiativedialogues.

Integration of dialogue control with do-main and task knowledge is a furtherchallenge for agent-based systems. It ispossible to handcraft domain-specific andtask-specific knowledge into a dialoguesystem and then to use simpler dialoguecontrol methods. However, this raises theissue of reusability as systems have tobe completely redesigned when portingto new domains. A more satisfactory so-lution is to develop a generic domain-independent dialogue management com-ponent that can be easily adapted tonew tasks. Essentially this is the solu-tion adopted in the agent-based systemsdescribed in this section. So, for example,in the theorem-based approach, while thedomain processor is application-specific,the general reasoning component andthe knowledge component are domain-independent and incorporate generalmechanisms for reasoning with the knowl-edge contained in the domain processor(see Section 4.4.2). The same applies tosystems based on planning (Section 5.3.2)and rational agency (Section 5.3.5). Inthe TRIPS project, the successor to theTRAINS project (Section 5.3.3) an archi-tecture has been developed including anabstract problem-solving model that sup-ports the underlying structure for a col-laborative task-based dialogue in terms ofkey concepts such as objectives, solutions,

ACM Computing Surveys, Vol. 34, No. 1, March 2002.

Page 56: Spoken Dialogue Technology: Enabling the Conversational

Spoken Language Technology 145

resources, and situations. Porting to a newdomain and task is a matter of specifyingmappings from this model to operations inthe new domain [Allen et al. 2000].

Many of the techniques required to sup-port agent-based dialogue control, such asintention recognition and reasoning, arecomputationally intensive. Finding waysof implementing these techniques to en-able real-time performance and to supportopen-ended, mixed-initiative dialogues re-mains a challenge for researchers in spo-ken dialogue.

5.4. Summary

With such a number of different ap-proaches to dialogue management, it isreasonable to ask which approach is mostappropriate for a particular application.Conversational agents that incorporateprinciples of rationality and cooperationwould seem to be the obvious choice, asthey come closest to modeling human con-versational competence. Certainly, for ap-plications that involve cooperative prob-lem solving with negotiated solutions, thesimpler types of dialogue control are notsufficient. On the other hand, for simpleapplications and for constrained subtaskswithin some applications, more basic tech-niques such as finite state and template-based control may be appropriate.

Finite state networks are suitable forsimple, well-structured tasks that can bedecomposed into clearly defined subtasks.They are also suitable for tasks involv-ing sequential form filling. A finite state-based system will be system-driven with apredefined dialogue path from which theuser cannot deviate. Ideally the user’s re-sponses should be short, in the form of sin-gle words or simple phrases, and a ma-jor task in the design of such a systemrequires careful wording of the systemprompts to constrain the user to answerin this way. When designed using a statetransition diagram, these systems havea clear and intuitive semantics. The dis-advantage of finite state systems is thatthey are inflexible and cannot be easily de-signed to permit the user to make a correc-tion of some previously elicited value or to

change easily from one subtask to another.Some of these deficiencies have been ad-dressed by incorporating additional func-tionalities such as natural language pro-cessing and frames into the basic model[McTear et al. 2000]. However, these addi-tions tend to obscure the semantics of thesystem and increase the number of alter-native paths through the dialogue, withthe potential of leading to combinatorialexplosion. For these reasons finite statesystems are best used for simple tasks in-volving the elicitation of a small numberof items from the user, or for well-definedsubtasks within a larger application, suchas eliciting a date or time.

Frame-based systems permit a greaterdegree of flexibility, as the user can pro-vide more information than required bythe system’s question, and in addition doesnot need to supply all the information atone time if this is not possible. The sys-tem keeps track of what information isrequired and asks its questions accord-ingly. The design of a frame-based systemis declarative, as in a production system,thus providing a clear semantics. How-ever, frame-based systems are restrictedto basic information retrieval tasks withno facility for negotiation of the infor-mation. Some of the dialogue behaviorsthat a frame-based system might display,such as clarifying old information beforeasking for new values, are hard-codedinto the system’s control structure. In asystem based on conversational agencythese behaviors would be more flexibleand would be based on a process of ex-plicit deliberation by the system. Thus, forapplications involving the elicitation of afixed set of information from the user andthe retrieval by the system of a clearlydefined set of information in return, aframe-based system provides a suitablesolution with greater flexibility than a fi-nite state-based system. The Philips traintimetable system is a good example ofsuch an application. However, an appli-cation that involved consideration of con-straints imposed by the user’s goals, suchas the need to make particular connec-tions or to be able to change goals dur-ing the course of an interaction, would

ACM Computing Surveys, Vol. 34, No. 1, March 2002.

Page 57: Spoken Dialogue Technology: Enabling the Conversational

146 McTear

be beyond the scope of a frame-basedsystem.

There is a wide range of agent-basedapproaches. These are usually motivatedby particular theories of dialogue. TheCircuit-Fix-It Shop system views prob-lem solving as theorem proving. Dialoguecontrol evolves dynamically through themechanism of interruptible theorem prov-ing, which is used to deal with missingaxioms and user requests for clarificationand help. Axioms that describe aspects ofthe user’s domain-specific knowledge areused effectively to enable the system to en-gage in a limited amount of user modeling.Other approaches, such as those adoptedin TRAINS, SUNDIAL, and Traum’s con-versational agency, are motivated more bylinguistic theories of dialogue, where thefocus is on a cooperative dialogue strategythat evolves dynamically over the courseof the interaction. In these systems goals,beliefs, intentions, and obligations forma conflict set that is resolved accordingto a particular conversational metastrat-egy. Finally, in the rational agency ap-proach, communication is viewed as a spe-cial case of rational behavior and dialoguecontrol is determined in terms of axiomsthat encode principles of rational coopera-tive behavior.

While there have been no comprehen-sive studies to date of the costs and ben-efits of these different dialogue manage-ment approaches, some of the methodsthat have been developed for the designand evaluation of spoken dialogue systemsaddress these issues in part. These will bereviewed in the next section, while Sec-tion 7 will examine some dialogue devel-opment tools that are available to build fi-nite state and frame-based systems.

6. SPECIFYING, DESIGNING, ANDEVALUATING A SPOKEN DIALOGUESYSTEM

Developing a spoken dialogue system canbe viewed as a special case of software en-gineering, with its own methods and eval-uation criteria that have evolved over thepast few years. A recent EU project, DISC(Spoken Language Dialogue Systems and

Components), is concerned with specify-ing a best-practice methodology for thedevelopment and evaluation of spokendialogue systems [Dybkjær et al. 1997].Various sets of guidelines and standardshave emerged as a result of researchprojects such as the Danish DialogueProject [Bernsen et al. 1998], the EU-funded EAGLES project on standards forspoken language systems [Gibbon et al.1997], and projects funded in the US un-der the ARPA initiatives on spoken lan-guage systems [Hirschman 1995]. Themain trends in this work are reviewed inthis section, looking first at the methodsemployed to support the specification, de-sign, and development of spoken dialoguesystems, and then at methods of evalua-tion that have been used.

6.1. Development Methodologies

Developing a spoken dialogue system in-volves deciding on the tasks that the sys-tem has to perform in order to solve aproblem interactively with a human user;specifying a dialogue structure that willsupport the performance of the task; deter-mining the recognition vocabularies andlanguage structures that will be involved;and designing and implementing a solu-tion that meets these criteria.

Various methods are in common use forestablishing system requirements. Theseinclude: literature research, interviewswith users to elicit the information re-quired to construct the domain and taskmodels; field-study observations or record-ings of humans performing the tasks; fieldexperiments, in which some parametersof the task are simulated; full-scale sim-ulations; and rapid prototyping. In or-der to illustrate the issues involved, thetwo most commonly applied methods—design based on an analysis of human-human dialogues, and design based onsimulations—will be described, followedby a discussion of usability issues and ofdesign guidelines and standards.

6.1.1. Design Based on the Analysis ofHuman-Human Dialogues. Human-humandialogues provide an insight into how

ACM Computing Surveys, Vol. 34, No. 1, March 2002.

Page 58: Spoken Dialogue Technology: Enabling the Conversational

Spoken Language Technology 147

humans accomplish task-oriented dia-logues. Considerable effort has gone intocollecting corpora of relevant dialogues,many of which are publicly available, suchas the TRAINS corpora and the CSLUcorpora. Analysis of the TRAINS corporacan provide information about the struc-ture of the dialogues in support of conver-sational modeling—for example, whethertask-oriented dialogues consist mainlyof a single topic—and about the range ofvocabulary and language structures in-volved. The CSLU corpora, on the otherhand, are focused mainly on modeling ac-cents and multilingual pronunciations.

Analysis of natural dialogues may alsopinpoint some aspects of how humans in-teract with software such as email andcalendar applications when they are us-ing a speech-based rather than a graphicaluser interface. In the SpeechActs projectsat Sun Microsystems Laboratories, pre-design studies are used before dialogue de-sign to help the designer view the taskfrom the user’s perspective and to de-velop a feel for the style of interaction[Yankelovich n.d.]. One of the findings wasthat users of the calendar application typi-cally used relative dates such as tomorrowor next Monday, whereas absolute dateswould be used in the version provided inthe graphical user interface. The organiza-tion of information, such as the number-ing of messages in Sun’s Mail Tool GUI,gave rise to confusion in the spoken lan-guage interface as it became difficult tokeep track of which messages were newand to refer back easily to previously readmessages [Yankelovich et al. 1995]. Thusit was concluded that users of a speechuser interface (SUI) employ a differentset of mental abilities compared to whenthey use a graphical user interface (GUI).For this reason it was recommended thatmethods need to be developed to copefor the lack of visual cues when inter-acting with software applications overa telephone line. The predesign stud-ies had an important bearing on issuessuch as these as well as for the designof prompts, the selection of verificationstrategies, and the provision of immediatefeedback.

6.1.2. Design Based on Simulations: Wizardof Oz and “System in the Loop.”. Althoughthe analysis of human-human dialoguescan provide useful information to sup-port the design of spoken dialogue sys-tems, the main drawback of this approachis that it is not possible to generalizefrom unrestricted human-human dialogueto the more restricted human-computerdialogues that can be supported by currenttechnology. Current systems are restrictedby limited speech recognition capabilities,limited vocabulary and grammatical cov-erage, and limited ability to tolerate andrecover from error. To investigate how hu-mans might talk to a more restricted di-alogue partner, such as a computer sys-tem in a situation where no such systempresently exists, some sort of simulationof the system is required.

The Wizard-of-Oz (WOZ) method is com-monly used to investigate how humansmight interact with a computer system[Fraser and Gilbert 1991]. In this methoda human simulates the role of the com-puter, providing answers using a synthe-sized voice, and the user is made to be-lieve that he or she is interacting with acomputer. The situation is controlled withscenarios, in which the user has to find outone or more pieces of information from thesystem, for example, a flight arrival timeand the arrival terminal. The use of a se-ries of carefully designed WOZ simulatedsystems enables designs to be developediteratively and evaluation to be carried outbefore significant resources have been in-vested in system building [Gibbon et al.1997]. One of the greatest difficulties fac-ing the WOZ method is that it is difficultfor a human experimenter to behave ex-actly as a computer would, and to antic-ipate the sorts of recognition and under-standing problems that might occur in thereal system.

To overcome this disadvantage, the“System in the Loop” method may be used.In this case, a system with limited func-tionality is used to collect data. For exam-ple: the system might incorporate on thefirst cycle speech recognition and speechunderstanding modules, but the main dia-logue management component may still be

ACM Computing Surveys, Vol. 34, No. 1, March 2002.

Page 59: Spoken Dialogue Technology: Enabling the Conversational

148 McTear

missing. On successive cycles additionalcomponents can be added and the func-tionality of the system increased, thus per-mitting more data to be collected. It is alsopossible to combine this method with theWOZ method, in which the human wizardsimulates those parts of the system thathave not yet been implemented.

6.1.3. Usability Analysis. As with anyother software, the success of a spoken dia-logue system does not depend solely on thefunctionality and performance of the soft-ware but also on its usability and accep-tance by the users for whom it is intended.

One aspect of usability is to determinethe costs and benefits of the proposed sys-tem. Lennig et al. [1995] described thedevelopment of a system that aimed to au-tomate the handling of some directory as-sistance (DA) calls in Bell Canada. Oneof the initial investigations involved de-termining where potential savings couldbe made. Since the average operator worktime per call was found to be approxi-mately 25 seconds, and the cost to com-panies in the US of providing directory as-sistance was estimated to be over $1.5B,a reduction of 1 second in the work timeper call would represent savings of over$60M per year. Operator acceptance wasanother factor that was investigated inthis study. Operators were generally pos-itive and particularly welcomed the factthat with the automatic system they didnot have to continually repeat the sameinformation. Using the system was easieron their voice and also required less key-ing, thus avoiding the problems of repeti-tive strain injury.

Similar findings emerged in a part-automated directory enquiries systemdeveloped by Vocalis for Telia TeleRe-spons, Sweden’s leading network servicesprovider [Peckham n.d.]. A major concernwas how to reduce the running costs ofdirectory enquiries without compromisingcustomer satisfaction. The solution was asemiautomated system using a combina-tion of voice response and speech recogni-tion technology. The main tasks of an op-erator handling a directory enquiry call

are identified and those parts that canbe handled automatically are specified.Voice processing technology is used at thebeginning and end of calls—to greet thecaller and to release the requested num-ber. The search for the number is handledby the human operator. Word spotting al-lows the speech recognition component torecognize key words in the midst of extra-neous words and sounds, while “talkover”allows the caller to speak over the systemoutput if they do not wish to wait for themachine to finish before they begin speak-ing. The commercial benefits of the systeminclude increased staff productivity andsubstantial cost savings. From the tech-nical perspective, the system achieves aspeech recognition accuracy level of 97%,while Telia TeleRespons benefits from an8% increase in efficiency and savings ofmillions of pounds each year. Researchfrom Telia TeleRespons shows that over90% of people are pleased to use the sys-tem and that an additional 6–7% actuallyprefer it. All the significant factors, includ-ing market conditions, pricing, the role ofthe operators, and the views of the unions,suppliers, and customers, were analyzedat the outset before the technical require-ments of the system were considered.

6.1.4. Requirements Specification. Follow-ing the analysis of requirements for a spo-ken dialogue system based on one or moreof the methods described in the precedingparagraphs, a formal requirements speci-fication can be produced. The most elab-orate approach would appear to be thatemployed in the Danish Dialogue Projectin which two sets of documents—a DesignSpace Development (DSD) and a DesignRationale (DR)—are produced [Bernsen1993]. A DSD document (or frame) rep-resents the design space structure anddesigner commitments at a given pointduring system design, so that a series ofDSDs provide a series of snapshots of theevolving design process. A DSD containsinformation about general constraints andcriteria as well as the application of theseconstraints and criteria to the system un-der development in the Danish Dialogue

ACM Computing Surveys, Vol. 34, No. 1, March 2002.

Page 60: Spoken Dialogue Technology: Enabling the Conversational

Spoken Language Technology 149

Table IV. Contextual Functions and GoalsA. General constraints and criteria B. Application of constraints and criteriaOverall design goal: to the artefact within the design space

Spoken language dialogue system prototype System aspects:operating via the telephone and capable of 500 words vocabulary.replacing a human operator. Max. 100 words in active vocabulary.

Realism criteria: Limited speaker-independent recognition ofThe artefact should be preferable to current continuous speech.

technological alternatives. Close-to-real-time response.The system should run on machines which could Sufficient task domain coverage.

be purchased by a travel agency. Task aspects:Usability criteria: User Tasks:

Maximize the naturalness of user interaction Obtain information on and perform booking ofwith the system. flights between two specific cities.

Constraints on system naturalness resulting Use single sentences (or max. 10 words).from trade-offs with system feasibility have Use short sentences (average 3–4 words).to be made in a principled fashion based on C. Hypothetical issuesknowledge of users in order to be practicable Is a vocabulary of 500 words sufficient to captureby users. the sublanguage vocabulary needed in the task

domain?

Project. Table IV shows some sample en-tries from a DSD (from [Dybkjær et al.1996]). A DR frame represents the reason-ing about a particular design problem. Anexample given in Dybkjær et al. [1996] de-scribes a feature which had not been takeninto account in the original specification—that users were not able to get the priceof the tickets they had reserved. The DRcontains information concerning the justi-fication for the original specification, a listof possible options, the resolution adopted,and comments. In this way the evolvingdesign and its rationale are comprehen-sively documented.

In addition to requirement specificationdocuments such as these, the EAGLEShandbook [Gibbon et al. 1997] recom-mends a formal and explicit description ofthe proposed dialogue. One way to do thiswould be to represent the dialogue flowas a flowchart, state transition network,or dialogue grammar, in which all reach-able states in the dialogue are specified,along with information on what actionsshould be performed in each state and howto decide which state should be the next.Tools exist for displaying this type of in-formation graphically. For example, in theDanish Dialogue Project DDL (DialogueDescription Language), a graphical lan-guage for describing state transition dia-grams for event-driven systems is used toprovide a formal specification of spoken di-alogue systems.

6.1.5. Design Guidelines. The theoreticalbasis for much of the work on design guide-lines comes from a theory of cooperativeconversation, developed by the philoso-pher of language, Grice [1975]. Griceidentified a number of maxims underly-ing cooperative conversation concerningquantity, quality, relation, and manner ofcommunication. For example: the quan-tity maxim stated that a speaker shouldbe as informative as required, but notmore informative than required; the qual-ity maxim related to the truth of a conver-sational contribution, and the relevancemaxim to its relevance. Some commonlyused evaluation metrics, such as Contex-tual Appropriateness (Section 6.2.2), aswell as the usability guidelines developedin the Danish Dialogue Project (see nextsection) are based loosely on Grice’s work.

6.1.5.1. Guidelines in the Danish DialogueProject. Gricean maxims have been devel-oped and extended into a set of usabilityguidelines in the Danish Dialogue Project[Bernsen et al. 1996]. A first set of theguidelines was developed on the basis ofanalysis of 120 examples of user-systeminteraction problems identified in a cor-pus of dialogues from the Wizard-of-Oz(WOZ) simulations of the Danish dialoguesystem. The guidelines were subsequentlyrefined and consolidated into a toolcalled DET (Dialogue Evaluation Tool),which can be used to support the design

ACM Computing Surveys, Vol. 34, No. 1, March 2002.

Page 61: Spoken Dialogue Technology: Enabling the Conversational

150 McTear

of cooperative dialogue systems and asa tool for diagnostic evaluation [Dybkjæret al. 1997]. DET consists of 22 guide-lines grouped under seven different as-pects of dialogue, such as informativenessand partner symmetry, and divided intogeneric (GG) and specific guidelines (SG).The following are some examples (thosemarked with * are based on Grice):

InformativenessGG1... *Make your contribution as infor-

mative as is required (for the current pur-poses of the exchange).

SG1... Be fully explicit in communicatingto users the commitments they have made.

Partner SymmetryGG10... Inform the dialogue partners of

important nonnormal characteristics thatthey should take into account in order tobehave cooperatively in dialogue. Ensurethe feasibility of what is required of them.

SG4... Provide clear and comprehensi-ble communication of what the system canand cannot do.

(Source: [Dybkjær et al. 1997]).

Using the guidelines for evaluation in-volves analyzing transcripts of dialoguesto identify instances of violations of theguidelines, which are marked up in thetranscripts. The violations can then be ex-amined in greater detail, disagreementsbetween analyzers resolved, and recom-mendations developed for enhancing coop-erativity in the dialogue system. The gen-erality of the guidelines has been exploredby applying them successfully as a dia-logue design guide to part of a corpus fromthe Sundial project [Dybkjær et al. 1997].

6.1.5.2. The EAGLES Guidelines. In theEAGLES handbook a series of recommen-dations have been proposed to supportthe design of spoken dialogue systems.These guidelines include recommenda-tions for the design of interactive voiceresponse (IVR) systems and for the de-sign of prompts. The following is a sum-mary of the recommendations for the de-sign of dialogue systems [Gibbon et al.1997]:

(1) Data collection:—Study of recordings of human-

human interaction in a situationsimilar to the one in which the sys-tem will be used.

—Wizard-of-Oz simulations.—Transcription of the dialogues.

(2) Specification, design, and implementa-tion of a first version (X) of the dialoguesystem.

(3) Tests:—Laboratory tests using corpora

recorded in Wizard-of-Oz simula-tions, and then with laboratory staffsimulating users, recording newdata.

—Field tests with real users, recordingnew corpora.

(4) Tune the system by iteratively modify-ing, then testing it.

(5) Design and implement an X + 1 ver-sion of the system, integrating newtechnologies.

(6) Tests (as in step 3).(7) Return to step 4 unless the system is

deemed to be complete.

Specific recommendations concerning thedialogue model and the vocabulary of thesystem are included in the following addi-tional guidelines:

(1) Conduct a dialogue act analysis of thedialogues collected in the corpora, pay-ing special attention to the conditionsthat must be satisfied in order to pro-ceed from one dialogue state to thenext.

(2) Describe the dialogue state transitionsusing some formally explicit apparatus(such as a flowchart or formal specifi-cation language).

(3) Use the data to identify the total lex-icon required, then divide it into sub-lexicons, where each sublexicon is as-sociated with a dialogue act.

(4) Use the data to identify a coveringgrammar, then divide it into subgram-mars, where each subgrammar is asso-ciated with a dialogue act.

ACM Computing Surveys, Vol. 34, No. 1, March 2002.

Page 62: Spoken Dialogue Technology: Enabling the Conversational

Spoken Language Technology 151

6.2. Evaluation

Most of the methods used to support thespecification and design of spoken dia-logue systems—such as corpus analysis,WOZ, and system-in-the-loop—can also beused to collect data for evaluation. Thissection will focus on the metrics that areemployed rather than on the methods ofdata collection.

Evaluation of spoken dialogue systemscan involve either evaluation of the in-dividual components ( glass box evalua-tion), or evaluation of the system as awhole (black box evaluation). Evaluationof individual components, with measuressuch as word accuracy and sentence ac-curacy, have been employed for some timeto measure the performance of spoken lan-guage systems under the ARPA initiatives[Hirschman 1995]. It is only more recentlythat measures have been developed forspoken dialogue systems as a whole. Bothtypes of evaluation have been described insome detail in a review paper by Baggia[1996], from which much of the materialin this section is derived. See also Smith[1997] and Smith and Gordon [1997] for acomparable set of evaluation methods.

6.2.1. Evaluation of Individual Components.Evaluation of individual components isgenerally based on the concept of a ref-erence answer, which determines the de-sired output of the component to be com-pared with its actual output. Referenceanswers are easier to determine for com-ponents such as the speech recognizer andthe language understanding component,but more difficult with the dialogue man-ager where the range of acceptable behav-iors is greater. The most commonly usedmeasure for speech recognizers is Word ac-curacy (WA). WA accounts for errors at theword level, which include insertion (WI ),deletion (WD), and substitution of words(WS). WA is calculated as a percentage us-ing the formula

WA = 100(

1− WS +WI +WD

W

)%,

where W is the total number of words inthe reference answer.

Sentence accuracy (SA) is a measure ofthe percentage of utterances in a corpusthat have been completely and correctlyrecognized. In this case the recognizedstring of words is matched exactly with thewords in the reference answer. Sentenceunderstanding rate (SU), on the otherhand, measures the rate of understoodsentences in comparison with a referencemeaning representation. An alternativemeasure of understanding is concept accu-racy (CA), which measures the percentageof concepts that have been correctly under-stood. CA is similar to WA, as it measureserrors at the concept level which includeinsertions, deletions, and substitutions.Text understanding (TA), a measure usedin the Message Understanding Confer-ences, measures the amount of significantinformation that has been extracted froma text, using templates as the referenceanswers. Finally, an evaluation methodhas been developed in the ARPA SpokenLanguage System program to measure thecorrectness of database query responsesby matching the actual responses withreference answers expressed as a set ofminimal and maximal tuples [Hirschman1995]. The correct answer must include atleast the information in the minimal an-swer and no more information than is inthe maximal answer. This measure is sim-ilar to some of the measures for dialoguesuccess to be discussed below.

Some interesting results have emergedfrom evaluation studies using these mea-sures. Hirschman [1995] reported that, inthe various ARPA evaluations, the errorrate for sentence understanding was muchlower than that for sentence recognition(10.4% compared with 25.2%), indicatingthat it is easier to understand sentencesthan to recognize them and that sentenceunderstanding, by using robust process-ing techniques, is able to compensate tosome extent for errors produced by thespeech recognition component. Similar re-sults were reported by Boros et al. [1996]in a comparison of word accuracy andconcept accuracy measures. Boros et al.

ACM Computing Surveys, Vol. 34, No. 1, March 2002.

Page 63: Spoken Dialogue Technology: Enabling the Conversational

152 McTear

found that it is possible to achieve per-fect understanding with less than per-fect recognition, but only when the mis-recognitions affect semantically irrelevantwords. When misrecognition affects partsof the utterance that are significant forunderstanding, CA may be lower thanWA. Thus it is important to examineclosely the relationships between differentmeasures.

6.2.2. Evaluation of Spoken Dialogue Sys-tems. The performance of a spoken dia-logue system can be measured in terms ofthe extent to which it achieves its task,the costs of achieving the task (for exam-ple, the time taken or number of turnsrequired to complete the task), and mea-sures of the quality of the interaction, suchas the extent to which the system behavescooperatively (see the EAGLES handbook[Gibbon et al. 1997] for a detailed accountof the measures described below togetherwith annotated examples).

A set of core metrics was identified in theSUNDIAL project that measures these as-pects of dialogic interaction [Simpson andFraser 1993].

Transaction success (TS ) is similar tothe ARPA measure of the correctness ofdatabase query responses discussed ear-lier, in that this metric measures how suc-cessful the system has been in providingthe user with the requested information.TS was defined as a four-valued measureto account for cases of partial success aswell as instances where the user’s goal wasnot clearly identifiable or changed dur-ing the course of the interaction: S (suc-ceed), SC (succeed with constraint relax-ation), SN (succeed with no answer), andF (fail).

Number of turns is a measure of the du-ration of the dialogue in terms of the num-ber of turns taken to complete the trans-action. An alternative measure is the timetaken to complete the transaction. Thesemeasures can be used in conjunction withdifferent dialogue strategies to give anindication of the costs of the dialogue,which may be compared with other mea-sures such as transaction success or useracceptance.

Correction rate (CR) is a measure of theproportion of turns in a dialogue that areconcerned with correcting either the sys-tem’s or the user’s utterances, which mayhave been the result of speech recognitionerrors, errors in language understanding,or misconceptions. A dialogue that hada high degree of CR might be judged tohave high costs in terms of user accept-ability, as well as potentially high costsfinancially.

Contextual appropriateness (CA) is ameasure of the extent to which the systemprovides appropriate responses. The met-ric can be divided into a number of values,such as: TF (total failure), AP (appropri-ate), IA (inappropriate), AI (appropriate/inappropriate), and IC (incomprehensi-ble). With TF, the system fails to respondto the user. IA is used for responsesthat are inappropriate, defined usually interms of Gricean maxims ([Grice 1975],see above). AI is used when the evalua-tor is in doubt, and IC when the content ofan utterance cannot be interpreted.

The Behavioral Coding Scheme is a sim-ilar measure of the quality of responsesproduced in an interaction with a spokendialogue system [Sutton et al. 1995]. Be-havioral coding classifies the user’s utter-ances to an automated questionnaire into11 different types, some of which are illus-trated in Table V. The Behavioral CodingScheme has proven useful for the objec-tive evaluation of user behavior, for evalu-ating the performance of the system, andas a basis for further refinement of thesystem.

A number of more qualitative mea-sures have been developed, including ametric for evaluating dialogue strategies,such as strategies for recovering fromerrors [Danieli and Gerbino 1995]. Im-plicit recovery (IR), which can be com-pared with implicit verification as illus-trated in Section 4.3.2, has been definedas the ability to overcome errors producedby the speech recognizer or parser and torectify these implicitly. This strategy con-trasts with an explicit strategy involvingcorrection that can be measured using thecorrection rate (CR) metric. The follow-ing example illustrates implicit recovery

ACM Computing Surveys, Vol. 34, No. 1, March 2002.

Page 64: Spoken Dialogue Technology: Enabling the Conversational

Spoken Language Technology 153

Table V. Categories from the Behavioral Coding Scheme

Code Response Class Description ExampleAA1 Adequate Answer 1 Answer is concise and S: Have you ever been married?

responsive. U: Yes.AA2 Adequate Answer 2 Answer is usable but not S: Have you ever been married?

concise. U: No I haven’t.AA3 Adequate Answer 3 Answer is responsive but not S: Have you ever been married?

usable. U: Unfortunately.IA1 Inadequate Answer 1 Answer does not appear to be S: What is your sex, female or male?

responsive. U: Neither.IA2 Inadequate Answer 2 User says nothing at all. S: What is your sex, female or male?

U: <silence>

(values understood by the system in an-gular brackets):

User1: I want to go from Roma toMilano in the morning.<arrival-city =MILANO,departure-time=MORNING>

System1: Sorry, where do you want toleave from?

User2: From Roma.<departure-city = ROMA, cost-

of-ticket?>System2: Do you want to go from Roma to

Milano leaving in the morning?

Although the user’s first utterance con-tains the concepts that the system re-quires to retrieve the desired information,the departure city has not been recog-nized, so the system takes into account theconcepts that have been correctly under-stood and asks for the concept that was notunderstood. The user’s second utterancecontains the required information but ad-ditional words have been inserted at therecognition level that are interpreted as arequest for the cost of the ticket. As thisconcept is not relevant in the current con-text, it is disregarded by the system andthe user is asked to confirm the correctconcepts. In this case the IR score is 100%and the system has succeeded in spite ofrecognition and parsing errors but withouthaving to engage in explicit correction.

In a study comparing explicit and im-plicit recovery strategies and using theother measures described earlier such ascontextual appropriateness, Danieli andGerbino [1995] found that the system thatused an explicit recovery achieved greater

robustness in terms of dealing with er-rors, although at the cost of longer trans-actions. It was also found that, as usersbecame more familiar with the system,the recovery results for the system us-ing implicit recovery improved substan-tially. Thus several aspects have to beconsidered and balanced when evaluatinga dialogue system, including transactionsuccess and dialogue duration, which mea-sure the ability of the system to find therequired information, and contextual ap-propriateness, which measures the qual-ity of the dialogue. However, it is not pos-sible using these measures to determinewhether the higher transaction successof the system using the explicit recoverystrategy was more critical to performancethan the efficiency of the system using theimplicit recovery strategy.

A recently developed tool for theevaluation of spoken dialogue systems,PARADISE (PARAdigm for Dialogue Sys-tem Evaluation), addresses the limita-tions of the methods discussed so farby combining various performance mea-sures such as transaction success, usersatisfaction, and dialogue cost into a sin-gle performance evaluation function, andby enabling performance to be calculatedfor subdialogues as well as complete dia-logues [Walker et al. 1997]. In this frame-work the overall goal of a dialogue sys-tem is viewed in terms of maximizing usersatisfaction. This goal is subdivided intothe subgoals of maximizing task successand minimizing costs. The latter is in turnsubdivided into efficiency measures andqualitative measures. A brief overviewof this framework is provided in the

ACM Computing Surveys, Vol. 34, No. 1, March 2002.

Page 65: Spoken Dialogue Technology: Enabling the Conversational

154 McTear

following paragraphs, although for moredetail and a comprehensive set of illus-trative examples, see Walker et al. [1997],citeyearwalker:2.

Transaction success is calculated byusing an attribute value matrix (AVM)that represents the information to be ex-changed between the system and the userin terms of a set of ordered pairs of at-tributes and their possible values. For ex-ample: departure-city might have val-ues Milano, Roma, Torino, Trento, whiledeparture-range might have the valuesmorning, evening. The correct values foreach attribute are determined by scenar-ios (for example, the user might be re-quired to find a train that leaves fromTorino to Milano in the evening). Thesevalues are referred to as the scenario keys,and these are plotted on a confusion ma-trix along with any incorrect values thatoccurred during the actual dialogue. Thisconfusion matrix is used to calculate thekappa coefficient, κ [Carletta 1996], whichindicates how well the system has per-formed a particular task within a givenscenario, using the following formula:

κ = P (A)− P (E)1− P (E)

,

where P (A) is the proportion of times thatthe AVMs for the actual dialogues agreeswith the AVMs for the scenario keys, andP (E ) is the proportion of times that theAVMs for the dialogues and the keys areexpected to agree by chance. Unlike othermeasures of transaction success and con-cept accuracy, κ takes into account the in-herent complexity of the task by correctingfor expected chance agreement.

Dialogue costs are measured in termsof cost measures ci that can be appliedas a function to any subdialogue. The in-formation goals that an utterance con-tributes to are determined by using theAVM representation to tag the dialoguewith the attributes for the task. In thisway dialogue strategies used to achievethe task can be evaluated both in the di-alogue as a whole as well as in subdi-alogues. Given a set of measures ci the

different measures are combined to deter-mine their relative contribution to perfor-mance, using the formula

Performance = (a ∗N (κ))−n∑

i=1

wi ∗N (ci),

in which α is a weight on κ, the cost func-tions ci are weighted by wi, and N is a Zscore normalization function that is usedto overcome the problem that the values ofci are not on the same scale as κ and mayalso be calculated over varying scales. Theweights for α and wi are solved using mul-tiple linear regression. Using this formulait is possible to calculate performance in-volving multiple dialogue strategies, in-cluding performance over subdialogues.

PARADISE is a framework for evalu-ating dialogue systems that incorporatesand enhances previously used measures.It supports comparisons between dialoguestrategies and separates the tasks to beachieved from how they are achieved inthe dialogue. Performance can be calcu-lated at any level of a dialogue, such assubtasks, and performance can be asso-ciated with different dialogue strategies.Furthermore, subjective as well as objec-tive measures can be combined and theirrelative cost factors to overall performancecan be specified. PARADISE has been usedas a tool to evaluate a number of applica-tions, including accessing train schedulesand email as well as voice dialing and mes-saging [Walker et al. 1998; Kamm et al.1999].

6.3. Summary

It can be seen that considerable atten-tion has been devoted in recent years tothe engineering aspects of spoken dialoguesystems and to their specification, design,and evaluation, and that there is somedegree of convergence on methodologiesand frameworks, due in large part to con-certed efforts in large-scale research anddevelopment projects sponsored by ARPAand the EU, and also due to the view ofbest practice in dialogue engineering as aspecialization of best practice in software

ACM Computing Surveys, Vol. 34, No. 1, March 2002.

Page 66: Spoken Dialogue Technology: Enabling the Conversational

Spoken Language Technology 155

engineering, exemplified in projects suchas DISC. A further factor is the increas-ing availability of large corpora that canbe used both to support initial specifica-tion and design of systems and also asdata for evaluation studies. It is salutaryto conclude this section with a commenton the importance of customer satisfactionfor the success of spoken dialogue systemsin the marketplace:

From a commercial perspective, the success of aspoken dialogue system is only slightly relatedto technical matters. . . . I have, for example, seentrial systems with a disgracefully low word ac-curacy score receiving a user satisfaction ratingof around 95%. I have also seen technically excel-lent systems being removed from service due tonegative user attitudes. (Norman Fraser, cited inDybkjær [et al. 1997])

7. TOOLKITS FOR DEVELOPING SPOKENDIALOGUE SYSTEMS

The development of a spoken dialogue sys-tem is a complex process involving the in-tegration of the various component tech-nologies described in Section 4. It wouldbe a formidable task to build and inte-grate these components from scratch. For-tunately a number of toolkits and author-ing environments have become availablethat support the construction of spoken di-alogue systems, even for those who haveno specialist knowledge of the componenttechnologies such as speech recognitionand natural language processing. The fol-lowing are some of the dialogue devel-opment environments that are currentlyavailable:

—the Generic Dialogue System Platform(CPK, Denmark);

—GULAN—An Integrated System forTeaching Spoken Dialogue SystemsTechnology (CCT/KTH);

—the CSLU toolkit (Center for Spo-ken Language Understanding at theOregon Graduate Institute of Scienceand Technology);

—CU Communicator system;—the Nuance Developers’ Toolkit (Nuance

Communications);—SpeechWorks;

—Natural Language Speech Assistant(NLSA) (Unisys Corporation);

—SpeechManiaTM: A Dialogue Appli-cation Development Toolkit (PhilipsSpeech Processing);

—the REWARD Dialogue platform;—Vocalis SpeechWareTM.

The first three systems were developedmainly to support academic research andto support the teaching of spoken lan-guage technology. The CPK toolkit, de-veloped at the Centre for PersonKommu-nikation at the University of Aalborg inDenmark, has been incorporated into theREWARD dialogue platform and is ac-companied by a Web-based course. Thismaterial, which includes details of thedevelopment platform to be used for im-plementation, is currently not availablepublicly. GULAN, a system for teachingspoken dialogue technology, is under de-velopment at KTH (Stockholm) and atLinkoping University and Uppsala Uni-versity [Sjolander et al. 1998]. The sys-tem, which is currently in Swedish butdue to be ported to English, is presentlyonly runnable locally. The CSLU toolkit,to be described in greater detail below,is available free-of-charge under a licenseagreement for educational, research, per-sonal, or evaluation purpose. The commer-cial systems are available under a rangeof license agreements. Some systems areavailable as evaluation versions and oth-ers can be obtained at a relatively low costfor academic purposes. Web sites with fur-ther information about these systems, in-cluding pricing, are listed in Appendix B.

A comprehensive description and eval-uation of all these systems is beyond thescope of the current survey. To give aflavor of what is available, one academi-cally oriented system, the CSLU toolkit,and one commercial system, the PhilipsSpeechManiaTM system, will be examined,followed by a brief outline of desirable fea-tures of spoken dialogue toolkits.

7.1. The CSLU Toolkit

The CSLU toolkit has been developedat the Center for Spoken Language

ACM Computing Surveys, Vol. 34, No. 1, March 2002.

Page 67: Spoken Dialogue Technology: Enabling the Conversational

156 McTear

Fig. 17 . Using RAD to simulate an auto attendantat a furniture store.

Understanding (CSLU) at the OregonGraduate Institute of Science and Tech-nology to support speech-related researchand development activities [Sutton et al.1998]. The toolkit includes core technolo-gies for speech recognition and text-to-speech synthesis, as well as a graphicallybased authoring environment (RAD) fordesigning and implementing spoken dia-logue systems. This section will focus onlyon RAD. Information about other compo-nents of the CSLU toolkit can be found atthe CSLU web site (see Appendix B).

A major advantage of the RAD interfaceis that users are shielded from many of thecomplex specification processes involvedin the construction of a spoken dialoguesystem. Building a dialogue system in-volves selecting and linking graphical di-alogue objects into a finite-state dialoguemodel, which may include branching de-cisions, loops, jumps, and subdialogues,as illustrated in Figure 17. Each objectcan be used for functions such as gener-ating prompts, recording and recognizingspeech, and performing actions. As far asspeech recognition is concerned, the in-put can be in the form of single words,for which a tree-based recognizer is used,or as phrases or sentences that are spec-ified using a finite state grammar, which

also enables key word spotting. There areadditional built-in facilities for digit andalpha-digit recognition. The words speci-fied for recognition at a given state are au-tomatically translated by the system intoa phonetic representation called Worldbetusing built-in word models stored in dic-tionaries. Pronunciations can also be cus-tomised using the Worldbet symbols. It isalso possible to implement dynamic recog-nition, in which case a list of words to berecognized is obtained from some externalsource, such as a Web page, and pronunci-ation models for the words are generateddynamically at run-time. Prompts can bespecified in textual form and are outputusing the University of Edinburgh’s Fes-tival TTS (text-to-speech) system, or theycan be prerecorded, and, with some addi-tional effort, spliced together at run-time.The use of the subdialogue states permitsa more modular dialogue design, as sub-tasks, such as eliciting an account num-ber, can be implemented in a subdialoguethat is potentially reusable. Repair dia-logues are a special case of subdialogue.A default repair subdialogue is includedthat is activated if the recognition scorefor the user’s input falls below a giventhreshold, but it is also relatively easyto design and implement customized re-pair subdialogues. There is also a spe-cial dialogue object for inserting picturesand sound files at appropriate places inthe dialogue without the need for com-plex programming commands. The list-builder object simplifies the programmingof a repetitive series of exchanges, suchas questions, answers, and hints in an in-teractive learning programe, by allowingthe programmer to specify lists of ques-tions, answers, and hints in a simple dia-logue box with the system looping througheach of the alternatives either in serialor random order. A number of online tu-torials, accompanied by simple illustra-tive examples of dialogue systems, pro-vide an introduction to the basic functionsof RAD.

Functions are provided in RAD for voice-based Web access. For example, a givenURL can be accessed and the HTML doc-ument read and parsed, relevant strings

ACM Computing Surveys, Vol. 34, No. 1, March 2002.

Page 68: Spoken Dialogue Technology: Enabling the Conversational

Spoken Language Technology 157

can be identified, tags removed, and therequired information output using text-to-speech. Although not documented in thecurrent online tutorials, it is also rela-tively simple to develop an interface todatabases and spreadsheets. Recently anatural language processing componenthas been developed that allows recognizedstrings to be parsed and relevant conceptsto be extracted [Kaiser et al. 1999]. Finally,the toolkit includes an animated conversa-tional agent (BALDI), developed at Uni-versity of California at Santa Cruz, whichpresents visual speech through facial an-imation synchronized with synthesized orrecorded speech.

RAD is currently being used effectivelyto provide interactive language learningfor profoundly deaf children [Cole et al.1999a] and to provide a practical intro-duction to spoken dialogue technology forundergraduate students [McTear 1999].Plans are underway to develop multilin-gual versions of the toolkit [Cole et al.1999b].

7.2. SpeechMania TM

SpeechManiaTM, a product of PhilipsSpeech Processing, is an application de-velopment environment to support the de-velopment of telephone-based spoken dia-logue systems. The software allows peopleto talk with computers over the phone toaccess information services such as rail-way and flight timetables, bank state-ments, and stock exchange quotations, orto engage in transactions such as reserv-ing a hotel room or reserving seats for amovie through a call center. The basic sys-tem architecture is shown in Figure 18.

Processing is divided into modules forspeech recognition, speech understanding,dialogue control, and speech output, withserial communication between the mod-ules. Speech recognition, which is basedon hidden Markov models with continuousmixture densities, is provided as part ofthe system in the form of acoustic modelsfor particular languages such as Americanor British English, German, and Dutch.The other modules are specified usingHDDL (High-level Dialogue Description

Fig. 18 . The SpeechManiaTM architecture.

Language), a dedicated declarative pro-gramming language for automatic enquirysystems. HDDL consists of a number ofsections, of which two of the most impor-tant are Rules and Actions. The Rulessection contains attributed context-freegrammars for speech understanding (seeSection 4.2.2.3 for an example). Dialoguecontrol is specified in the Actions sec-tion using conditional actions (condac-tions) (see Section 5.2.1). For speech out-put all the required words and phrasesare prerecorded and the appropriate seg-ments are concatenated and replayed asspecified by the dialogue control module.Finally, there is a transaction interface toexternal systems such as databases.

In addition to these modules, Speech-ManiaTM includes a range of tools thatsupport the development and evaluationof a spoken dialogue system. Applicationscan be tested offline using a text-based in-terface. The HDDL code can be checkedusing the HDDL parser, which in addi-tion generates a list of words not presentin the system’s speech recognition lexiconand produces a list of the system promptsfor recording using the Recording Stationtool. A transcription tool enables the de-veloper to optimize the system’s speechrecognition by comparing the words spo-ken by the user in a series of loggeddialogues with the words recognised bythe system. Dialogues can be marked upfor subsequent statistical analysis, sys-tem evaluation, and system refinement

ACM Computing Surveys, Vol. 34, No. 1, March 2002.

Page 69: Spoken Dialogue Technology: Enabling the Conversational

158 McTear

Table VI. Features of Spoken Dialogue Toolkits

Visual programmingDialogue design SubdialoguesPrompts Recording tools

TTSSimulation and testing Design prototyping and simulation

WOZOffline testingData capture

Natural language understanding Grammar developmentMultimodality Graphics

Facial animationSystem training and tuning Training the speech recognizer and other componentsInterfaces Databases, Web, APIsReusable components Commonly used recognizers, grammars, subdialoguesPlatform Programming languages and environment

through training of the language modeland the stochastic grammar.

7.3. Features of Spoken Dialogue Toolkits

Spoken dialogue toolkits can be comparedand evaluated across a number of dimen-sions. Table VI presents a preliminary setof features that can be used. This list needsto be treated with some caution, however,as the presence or absence of a particularfeature has to be considered in relation tothe users and uses for which a given toolkitis intended. For example, in respect of easeof use, the CSLU toolkit, due to its graph-ical authoring environment, provides anexcellent facility for students to quicklydevelop and test small dialogue systems.In contrast, developing even a simple sys-tem with SpeechManiaTM involves a rela-tively steep learning curve as the dialoguecontrol and natural language understand-ing have to be programmed in HDDL. Onthe other hand, however, SpeechManiaTM

provides a number of powerful tools tosupport the developer, including a fullyfunctional telephone interface, that arenot available in the CSLU toolkit. Thuswhen comparing the two toolkits it isimportant to consider their main pur-poses: the CSLU toolkit is designed mainlyas an educational tool, whose graphi-cal interface shields developers from thecomplexities involved in programming aninterface using some sort of program-ming language. SpeechManiaTM, as well

as the Nuance toolkit, in which the func-tions for state transitions and other statefunctions are programmed using C-code,are sophisticated development environ-ments intended for commercial softwaredevelopers.

Most of the toolkits listed above usesome sort of visual programming to repre-sent dialogue states and transitions. Thisfacility is useful as long as the dialoguesare simple enough to be implemented us-ing finite state methods (see Section 5).To date SpeechManiaTM is the only toolkitthat provides an alternative form of dia-logue control that is programmed declara-tively rather than visually.

Facilities for developing system promptsvary across the toolkits. In RAD the Fes-tival TTS is closely integrated with thegraphical authoring environment, withthe result that prompts can be designedby simply typing in the text to be synthe-sized into a prompt dialogue box. Thereare also facilities for adjusting various pa-rameters in the synthesized voice as wellas recording prompts as wave files. InSpeechManiaTM, as in other toolkits suchas NLSA, there is a sophisticated record-ing tool that is used to record and manip-ulate sections of sound files that can beconcatenated at run time to provide sys-tem prompts.

Support at the design and testing stagesis an important feature that is available inmost toolkits. The CSLU toolkit supportsrapid prototyping through its easy-to-use

ACM Computing Surveys, Vol. 34, No. 1, March 2002.

Page 70: Spoken Dialogue Technology: Enabling the Conversational

Spoken Language Technology 159

graphical authoring environment. Thereis also a facility for data capture that al-lows complete dialogues to be recordedand replayed. SpeechManiaTM also sup-ports data capture but in addition therecorded items of user input can also beused by tools provided within the devel-opment environment to optimize speechrecognition and understanding and to pro-vide data for training and testing. Simi-lar facilities are provided in the Nuancetoolkit. The NLSA Dialogue Assistant notonly provides a visual representation ofthe dialogue flow but also a simulationtool that enables Wizard-of-Oz testing ofthe viability of an application even beforea speech recognition module has been in-stalled. With the NLSA Dialogue Assis-tant the developer can simulate an ap-plication using actual phone lines, selectsystem prompts and responses, and recordcaller responses, thus collecting data thatcan be used as a basis for subsequent re-design of the dialogue.

Most toolkits support some form of nat-ural language understanding, generallywith a semantic or concept-based gram-mar with terms that map closely on todomain-specific items such as entities in adatabase. Grammar development is gen-erally not supported and it is the roleof the developer to design the requiredgrammar rules according to the formal-ism supported within the toolkit. One ex-ception is the NLSA toolkit, which pro-vides a tool Speech Assistant—to auto-mate the process of creating grammars.Speech Assistant has a format similar toa spreadsheet. The words and phrasesthat constitute potential user utterancesare entered into rows of the tables andfor each set of related phrases a token isprovided that represents the meaning ofthe responses. BNF grammars that areused by the speech recognition and nat-ural language understanding componentsare compiled automatically from thesetables.

Most spoken dialogue toolkits assumea telephone interface and so do not needto support pictures, animations, and othervisual displays. The main exception is theCSLU toolkit, which, as described earlier,

supports the display of pictures and alsoprovides an animated talking head. As re-quirements emerge for multimodal spokendialogue systems, as in public informationkiosks, there will need to be greater sup-port in toolkits for other modalities in ad-dition to voice.

Each of the toolkits provides a numberof additional tools and functions, for ex-ample, to customize and optimize speechrecognition, and APIs are provided to en-able a seamless interface to other appli-cations. For example, the Festival TTSand BALDI animated face are launchedautomatically in RAD, and there area number of supporting tools, such asSpeechView, to record, view, and manip-ulate waveforms. As expected, the com-mercially oriented toolkits such as Nuanceand SpeechManiaTM have greater sup-port for telephony functions, as thesesystems are most likely to be deployedwith a telephone interface. There has alsobeen more attention paid in these com-mercial toolkits to interfaces with exter-nal information sources such as largedatabases, although these facilities canalso be provided with a little effort inRAD. Most of the toolkits provide theirown speech recognition system, in somecases with barge-in facilities that allowthe user to interrupt system prompts.NLSA is recognizer-independent, withsupport for a number of commerciallyavailable speech recognisers, so thatdevelopers can choose the recognizerthat is best suited to their particularapplications.

Reuse is an important issue for any soft-ware development environment. Each ofthe toolkits addresses this issue in someway. In RAD it is possible to create andsave subdialogues and customized repairsubdialogues for reuse in other applica-tions. Similarly files of commonly useditems, such as variants of words such asyes and no, can be loaded along with theirpronunciations as required. Similar facil-ities exist in SpeechManiaTM. Other prod-ucts take the issue of reuse still further.SpeechWorks provides DialogModulesTM,reusable objects containing prebuilt vo-cabularies and grammars as well as

ACM Computing Surveys, Vol. 34, No. 1, March 2002.

Page 71: Spoken Dialogue Technology: Enabling the Conversational

160 McTear

error-recovery routines that can be eas-ily integrated into an application. Ex-amples of these reusable objects includesubdialogues for continuous digits, tele-phone numbers, zip codes, currency, andcredit card numbers. The Nuance toolkitand NSLA have similar reusable com-ponents that can be imported into anapplication.

Most of the toolkits are available onseveral platforms such as Unix and ver-sions of Windows such as 95, 98, and NT.The most commonly used languages areC, C++, TCL, HDDL, and Java. Manyof the toolkits are compliant with speechAPI standards (SAPI) such as MicrosoftSAPI and are customized for telephoneAPI- (TAPI-) compliant telephony cards.

7.4. Summary

In this section a number of toolkits thatsupport the development of spoken dia-logue systems have been examined. Mostof the toolkits provide fairly similar fea-tures with varying degrees of supportfor particular functionalities. The focusin this section has been mainly on thesupport provided in the toolkits for thedialogue management component. Natu-rally, the performance of individual com-ponents, in particular the speech recog-nizer, will have an important bearing onthe acceptability of a dialogue system de-veloped with a given toolkit, although thisissue may become less relevant with themovement towards open architectures, inwhich developers can slot in recognizersand other components that suit their par-ticular applications.

8. FUTURE DIRECTIONS

There are a number of ways in which spo-ken dialogue technology may develop overthe next decade. As far as research is con-cerned, there are several initiatives aim-ing at conversational systems that supportmore natural mixed-initiative dialogues.The focus of much of this work is on work-ing systems rather than on the develop-ment of the individual components. Thus,

instead of concentrating, for example, onthe development of a sophisticated natu-ral language understanding component inisolation, research is likely to be directedtoward the ways in which such a compo-nent can be integrated with the other com-ponents of a spoken dialogue system, andon how it can be deployed in real-worldapplications. Measurement of the perfor-mance of such components will be in termsof their contribution to the performance ofthe complete system.

Studies of human-human dialogue haveprovided useful insights into how moresophisticated dialogue systems might be-have and have resulted in theories ofcooperative interaction that inform the de-sign and evaluation of interactive speechsystems [Bernsen et al. 1998] as well asmodels of conversational agency [Traum1996], which integrate AI work on plan-ning with speech act theory. As speechrecognition and natural language under-standing become more robust, more so-phisticated dialogue managers that havebeen developed in text-based systems asnoted, for example, Carberry and Lambert[1999], will emerge. Jurafsky and Martin[2000] (Chapter 19) is a recent review ofthe theoretical background as well as re-cent developments in dialogue and con-versational agency. Another trend is to-ward the use of statistical techniquesin dialogue management. For example,dialogue sequences have been modeledas dialogue-act-N-grams in order to helppredict upcoming dialogue acts [Nagataand Morimoto 1994]. Probabilistic meth-ods are being used in conjunction withreinforcement learning algorithms to en-able the automated learning of optimaldialogue strategies. Dialogue is modeledas a Markov decision process (MDP) andviewed as a trajectory in a state spacedetermined by system actions and userresponses. Given multiple action choicesat each state, reinforcement learning isused to explore the choices systematicallyand to compute the best policy for ac-tion selection based on rewards associatedwith each state transition [Litman et al.2000].

ACM Computing Surveys, Vol. 34, No. 1, March 2002.

Page 72: Spoken Dialogue Technology: Enabling the Conversational

Spoken Language Technology 161

While most of this survey has been con-cerned with dialogue systems that pro-vide a spoken language interface, therehas been a recent development toward theintegration of spoken language technologywith other modalities [Cohen and Oviatt1995]. In the TRAINS project, for exam-ple, the system displays a map of thearea under discussion with the route beingplanned marked and highlighted. Some ofthe travel information systems, such asthe ATIS (Air Travel Information System)in the United States and the EU EspritMASK project involve multimodal inter-action. For example, the MASK system isplanned as a multimodal, multimedia ser-vice kiosk to be located in train stations,with the user being able to speak to thesystem as well as use a touch screen andkeypad, while the system displays infor-mation to the user on a screen [Lamelet al. 1995]. The selection and coordinationof different media in relation to differ-ent types of content to be displayed andthe varying needs of the user and thetask have been the subject of much re-search (see, for example, the papers inMaybury [1993]). Although this work isstill in its infancy and many of the so-lutions adopted tend to be ad hoc andapplication-specific, there has been someprogress toward a general theory of inputand output modalities and of how speechmight be integrated within a multimodalcontext [Bernsen 1994].

Another important application area isthe World Wide Web. With the increas-ing integration of the Internet and do-mestic television, there is a potential forapplications using spoken dialogue tech-nology to perform services such as homeshopping, or to control and program ap-pliances around the home such as mi-crowave ovens and VCRs. These needsare being addressed through VoiceXML(Voice eXtensible Markup Language)—an XML-based mark-up language for cre-ating distributed voice applications thatfeature synthesized speech, digitized au-dio, recognition of spoken and DTMFkey input, recording of spoken input,telephony, and mixed-initiative dialogues

[VoiceXML Forum n.d.]. VoiceXML pro-vides an open environment with standard-ized dialogue scripting and speech gram-mar formats. Furthermore, because it isbased on XML, a vast selection of edit-ing and parsing tools is available, includ-ing both commercial and freely availableopen-source tools. A VoiceXML documentspecifies each interaction dialogue to beconducted by a VoiceXML interpreter. AVoiceXML document forms a conversa-tional finite state machine, with some de-gree mixed-initiative that allows users ina limited way to input more than one valuein a particular dialogue state. VoiceXMLhas been accepted as a standard forWeb-based spoken dialogue systems andVoiceXML servers may well replace cur-rent proprietary development platformsfor spoken dialogue systems. Further workwill be required to integrate the morecomplex functionalities described in thissurvey into the next versions of thestandard.

Bringing these points together, some ofthe issues that are likely to be importantin spoken dialogue research in the nextdecade are

—more robust speech recognition, in-cluding the ability to perform well innoisy conditions, to deal with out-of-vocabulary words, and to integrate moreclosely with technologies for naturallanguage processing;

—the use of prosody in spoken dialoguesystems, both to provide more naturallysounding output and to assist recogni-tion by identifying phrase boundaries aswell as the functions of utterances;

—research concerned with component in-tegration and with investigating theextent to which the language under-standing and dialogue managementcomponents can compensate for defi-ciencies in speech recognition;

—investigations of the applicability of dif-ferent technologies for particular ap-plication types, such as the costs andbenefits of parsing using theoreticallymotivated grammars compared with

ACM Computing Surveys, Vol. 34, No. 1, March 2002.

Page 73: Spoken Dialogue Technology: Enabling the Conversational

162 McTear

robust and partial parsing and withmore pragmatically driven methodssuch as concept spotting;

—studies of different approaches to di-alogue management in relation to therequirements of an application, indi-cating, for example, where state-basedmethods are applicable and in which cir-cumstances more complex approachesare required;

—the incorporation of more sophisticatedapproaches to dialogue management de-riving from AI-based research;

—research into the use of stochastic andmachine learning techniques;

—the development of multimodal dialoguesystems;

—dialogue systems with Web integration.

It is unlikely that all of these issues willbe addressed in commercial systems inthe short term, although there is con-siderable interest in the commercial po-tential of voice commerce, involving theintegration of spoken language and Inter-net technologies. In general, however, theemphasis in dialogue research is on de-veloping more advanced systems and ontesting theories of dialogue, while in thecommercial environment the aim is to pro-duce systems that will work in the realworld. Here the performance of the sys-tem is measured not in terms of the eval-uation measures applied to a laboratoryprototype but in terms of its efficiency,effectiveness, usability and acceptabilityunder real-world conditions. Factors thatdetermine the successful deployment of asystem include marketability, profitabil-ity, and user acceptance, for which con-siderable effort has to be directed towardmanaging user expectations in respectof the constraints of the technology andconvincing users of the benefits of thetechnology.

In conclusion, as can be seen from thissurvey, there has been a dramatic increasein interest in spoken dialogue systemsover the past decade, and there is everyindication that this interest will continue,given that there are still many problems to

be resolved and given the obvious benefitsof the technology.

APPENDIX

A. SELECTED WORLD WIDE WEBADDRESSES FOR DIALOGUERESEARCH PROJECTS

The following is a list of Web sites for spo-ken dialogue technology, covering many ofthe major projects that could not be dis-cussed within the scope of the survey. Thelinks have been tested and were valid atthe time of writing.

— AAAI Workshop on Miscommunicationin Dialogue, August 1996http://www.cs.uwm.edu/faculty/mcroy/mnmPapers.html

— Center for PersonKommunikation(CPK), Aalborg, Denmark—member ofthe Danish Dialogue Projecthttp://cpk.auc.dk/

— Computers that listen—examples of ap-plications in commercial usehttp://www.voiceio.com/examples.htm

— CONVERSA—voice enabling technolo-gieshttp://www.conversa.com

— CSLR Home Page (Center for Spo-ken Language Research, University ofColorado)http://cslr.colorado.edu

— CSLU Home Page (Center for SpokenLanguage Understanding, Oregon)http://cslu.cse.ogi.edu/

— Dialogues 2000 Project (BT and Univer-sity of Edinburgh)http://www.ccir.ed.ac.uk/d2000/

— DISC—Spoken Language DialogueSystems and Components Best prac-tice in development and evaluationhttp://www.disc2.dk/

— EAGLES Project: Expert AdvisoryGroup on Language Engineering Stan-dardshttp://coral.lili.uni-bielefeld.de/gibbon/EAGLES/

ACM Computing Surveys, Vol. 34, No. 1, March 2002.

Page 74: Spoken Dialogue Technology: Enabling the Conversational

Spoken Language Technology 163

— INRIA—dialogue projects at INRIA(France)http://www.inria.fr/Equipes/DIALOGUE-eng.html

— LIMSI: Projects on spoken language(France)http://www.limsi.fr/Recherche/TLP/projects.html

— Links to spoken dialogue systems(University of Hamburg)http://nats-www.informatik.uni/hamburg.de/jum/research/dialog/sys.html

— Microsoft User Interface Research—Persona Project, ConversationalInterfaceshttp://www.research.microsoft.com/research/ui/

— Natural Interactive Systems Lab-oratory (NIS), Odense University,Denmarkhttp://www.nis.sdu.dk/

— Research on Dialogue Processing,Ministry of Education, Science, Sports,and Culture, Japanhttp://winnie.kuis.kyoto-u.ac.jp/Text/taiwa/e-abst.html

— SIGDIAL—special interest group ofACL for dialogue and discoursehttp://www.sigdial.org/

— Speech Applications Project (Sun Mi-crosystems)http://www.sun.com/research/speech/index.html

— Spoken Language Systems Group(MIT)http://sls-www.lcs.mit.edu/SLShome.html

— TOOT project on evaluation of spokendialogue systems (AT&T Research)http://www.research.att.com/diane/TOOT.html

— TRAINS Project Home Page (Univer-sity of Rochester)http://www.cs.rochester.edu/research/trains/

— Verbmobil (Large project based inGermany on spoken language anddialogue)

http://www.dfki.uni-sb.de/verbmobil/overview-us.html

— VoiceXML Forumhttp://www.voicexml.org

— Waxholm dialog project (Sweden)http://www.speech.kth.se/waxholm/waxholm2.html

B. WORLD WIDE WEB ADDRESSES FORSPOKEN DIALOGUE TOOLKITS

— CPK Generic Dialogue System Plat-formhttp://www.kom.auc.dk/lbl/IMM/S9 98/SDS course overview.html

— CSLU toolkithttp://cslu.cse.ogi.edu/toolkit/

— CU Communicator systemhttp://cslr.colorado.edu/beginweb/cumove/cucommunicator.html

— GULAN—CCT/KTHhttp://www.speech.kth.se/jocke/publications/gulan.html

— IBM Voice Serverhttp://www-4.ibm.com/software/speech/

— Natural Language Speech Assistant(NLSA)(Unisys Corporation)http://www.unisys.com/

— NUANCE Developers Toolkithttp://www.nuance.com/

— SpeechManiaTM (Philips)http://www.speech.be.philips.com/

— SpeechWorkshttp://www.speechworks.com/

— Vocalis SpeechWareTM

http://www.vocalis.com/

ACKNOWLEDGMENTS

I would like to acknowledge the helpful commentson an earlier draft of this paper that I receivedfrom many friends and colleagues, in particular, fromHarald Aust, Alex Schoenmakers, Ronnie Smith,David James, and Ian O’Neill, and from the anony-mous reviewers of the paper.

REFERENCES

ABNEY, S. 1997. Part-of-speech tagging and par-tial parsing. In Corpus-Based Methods in Lan-guage and Speech Processing, S. Young and G.Bloothooft, Eds. Kluwer Academic Publishers,Dordrecht, The Netherlands, 118–136.

ACM Computing Surveys, Vol. 34, No. 1, March 2002.

Page 75: Spoken Dialogue Technology: Enabling the Conversational

164 McTear

ALLEN, J. 1983. Recognising intentions from natu-ral language utterances. In Computational Mod-els of Discourse, M. Brady and R. Berwick, Eds.MIT Press, Cambridge, MA, 107–166.

ALLEN, J. 1995. Natural Language Processing, 2nded. Benjamin Cummings Publishing CompanyInc., Redwood, CA.

ALLEN, J., BYRON, D., DZIKOVSKA, M., FERGUSON, G.,GALESCU, L., AND STENT, A. 2000. An architec-ture for a generic dialogue shell. Natural Lan-guage Engineering 6, 3, 1–16.

ALLEN, J., MILLER, B., RINGGER, E., AND SIKORSKI, T.1996. A robust system for natural spoken dia-logue. In Proceedings of the 34th Annual Meetingof the ACL (Santa Cruz, CA). ACL, 62–70.

ALLEN, J. AND PERRAULT, C. 1980. Analysing inten-tion in utterances. Artificial Intelligence 15, 143–178.

ALLEN, J., SCHUBERT, L., FERGUSON, G., HWANG, C., KATO,T., LIGHT, M., MILLER, B., POESIO, M., AND TRAUM,D. 1995. The TRAINS project: a case study inbuilding a conversational planning agent. Jour-nal of Experimental and Theoretical ArtificialIntelligence 7, 7–48.

ARETOULAKI, M. AND LUDWIG, B. 1999. Automaton-descriptions and theorem-proving: a marriagemade in heaven? In Proceedings of IJCAI’99Workshop on Knowledge and Reasoning in Prac-tical Dialogue Systems (Stockholm, Sweden),IJCAI.

AUST, H. AND OERDER, M. 1995. Dialogue control inautomatic inquiry systems. In Proceedings of theESCA Workshop on Spoken Dialogue Systems,P. Dalsgaard, L. Larsen, L. Boves, and I. Thom-sen, Eds. ESCA, Vigso, Denmark, 121–124.

AUST, H., OERDER, M., SEIDE, F., AND STEINBISS, V.1995. The Philips automatic train timetableinformation system. Speech Communication 17,249–262.

AUSTIN, J. L. 1962. How to Do Things with Words.Oxford University Press, Oxford, UK.

BAGGIA, P. 1996. Evaluation of spoken dialoguesystems. Turorial, The 14th European SummerSchool on Language and Speech Communica-tion.

BERNSEN, N. 1993. The structure of the designspace. In Computers, Communication, and Us-ability: Design Issues, Research and Methods forIntegrated Services, P. Byerley, P. Barnard, andJ. May, Eds. North Holland, Amsterdam, TheNetherlands, 221–244.

BERNSEN, N. 1994. Foundations of multimodal rep-resentations: a taxonomy of representationalmodalities. Interacting with Computers 6, 4,347–371.

BERNSEN, N., DYBKJÆR, H., AND DYBKJÆR, L. 1996.Co-operativity in human-machine and human-human spoken dialogue. Discourse Processes 21,2, 213–236.

BERNSEN, N., DYBKJÆR, H., AND DYBKJÆR, L. 1998.Designing Interactive Speech Systems: From

First Ideas to User Testing. Springer Verlag, NewYork, NY.

BILLI, R., CASTAGNERI, G., AND DANIELI, M. 1996.Field trial evaluation of two different informa-tion inquiry systems. In IVTTA. IEEE, BaskingRidge, NJ, 129–132.

BOROS, M., ECKERT, W., GALLWITZ, F., GORZ, G.,HANRIEDER, G., AND NIEMANN, H. 1996. To-wards understanding spontaneous speech: wordaccuracy vs. concept accuracy. In Proceedings ofthe 4th International Conference on Spoken Lan-guage Processing (ICSLP96), Philadephia, PA).ICSLP, 1005–1008.

BRATMAN, M., ISRAEL, D., AND POLLACK, M. 1988.Plans and resource-bounded practical rea-soning. Computational Intelligence 4, 2, 349–355.

CARBERRY, S. 1986. The use of inferred knowledgein handling pragmatically ill-formed queries. InCommunication Failure in Dialogue, R. Reilly,Ed. Elsevier Science Publishers North Holland,Amsterdam, The Netherlands, 187–200.

CARBERRY, S. 1989. Plan recognition and its use inunderstanding dialogue. In User Models in Di-alog Systems, A. Kobsa and W. Wahlster Eds.Springer Verlag, London, UK, 133–162.

CARBERRY, S. AND LAMBERT, L. 1999. A processmodel for recognising communicative acts andmodeling negotiation subdialogues. Computa-tional Linguistics 25, 1, 1–54.

CARLETTA, J. 1996. Assessing the reliability of sub-jective codings. Computational Linguistics 22, 2,249–254.

CARLSON, R. AND GRANSTROM, B. 1997. Speech syn-thesis. In The Handbook of Phonetic Science,W. J. Hardcastle and J. Laver, Eds. Blackwell,Oxford, UK, 768–788.

CHIN, D. 1989. KNOME: Modeling what the userknows in UC. In User Models in Dialog Systems,A. Kobsa and W. Wahlster, Eds. Springer Verlag,London, UK, 74–107.

CLARK, H. 1992. Arenas of Language Use. Univer-sity of Chicago Press, Chicago, IL.

COHEN, P. 1994. Models of dialogue. In Proceedingsof the 4th NEC Research Symposium, M. Nagao,Ed. SIAM Press Philadephia, PA.

COHEN, P. AND LEVESQUE, H. 1990. Rational inter-action as the basis for communication. In In-tentions in Communication, P. Cohen, J. Morganand M. Pollack, Eds. MIT Press, Cambridge, MA,221–256.

COHEN, P. AND OVIATT, S. 1995. The role of voice inhuman-machine communication. In Voice Com-munication Between Humans and Machines,D. Roe and J. Wilpon, Eds. National AcademyPress, Washington, DC, 34–75.

COLE, R., MASSARO, D., DE VILLIERS, J. RUNDLE, B.,SHOBAKI, K. WOUTERS, J., COHEN, M., BESKOW,J., STONE, P., CONNORS, P., TARACHOW, A., AND

SOLCHER, D. 1999a. New tools for interactivespeech and language training: Using animated

ACM Computing Surveys, Vol. 34, No. 1, March 2002.

Page 76: Spoken Dialogue Technology: Enabling the Conversational

Spoken Language Technology 165

conversational agents in the classrooms ofprofoundly deaf children. In Proceedings ofESCA/SOCRATES Workshop on Method andTool Innovations for Speech Science Education(London). 45–52.

COLE, R., SERRIDGE, B., HOSOM, J. P., CRONK, A., AND

KAISER, E. 1999b. A platform for multilingualresearch in spoken dialogue systems. In Multi-Lingual Interoperability in Speech Technology(MIST) (Leusden, The Netherlands).

COLE, R., NOVICK, D., VERMEULEN, P., SUTTON, S.,FANTY, M., WESSELS, L., DE VILLIERS, J., SCHALKWYK,J., HANSEN, B., AND BURNETT, D. 1997. Experi-ments with a spoken dialogue system for takingthe U.S. census. In Speech Communication 23, 3,243–260.

CONSTANTINIDES, P., HANSMA, S., TCHOU, C., AND

RUDNICKY, A. 1998. A schema based approachto dialog control. In Proceedings of the 5th Inter-national Conference on Spoken Language Pro-cessing (ICSLP’98, Sydney, Australia), Vol. 2.ICSLP, 409–412.

DAHLBACK, N. AND JONSSON A. 1999. Knowledgesources in spoken dialogue systems. In Proceed-ings of 6th European Conf. on Speech Com-munication and Technology (Eurospeech’99,Budapest, Hungary). ESCA.

DALE, R. AND REITER, E. 1995. Computational in-terpretations of the Gricean maxims in the gen-eration of referring expressions. Cognitive Sci-ence 19, 233–263.

DANIELI, M. AND GERBINO, E. 1995. Metrics for eval-uating dialogue strategies in a spoken languagesystem. In Working Notes of the AAAI SpringSymposium on Empirical Methods on DiscourseInterpretation and Generation. AAAI, Stanford,CA, 34–39.

DENECKE, M. AND WAIBEL, A. 1997. Dialogue strate-gies guiding users to their communicativegoals. In Proceedings of 5th European Conf.on Speech Communication and Technology(Eurospeech’97, Rhodes, Greece). ESCA.

DOWDING, J., GANRON, J. M., APPELT, D., BEAR, J. CHERNY,L., MOORE, R., AND MORAN, D. 1993. Gemini:a natural language system for spoken languageunderstanding. In Proceedings of the 31st An-nual Meeting of the ACL. ACL Columbus, OH,54–61.

DYBKJÆR, L., BERNSEN, N. O., AND DYBKJÆR, H.1996. Evaluation of spoken dialogue systems.In Proceedings of the Eleventh Twente Work-shop on Language Technology (TWLT 11): Di-alogue Management in Natural Language Sys-tems, S. LuperFoy and A. Nijholt and G. V. vanZanten, Eds. Universiteit Twente, Enschede,The Netherlands.

DYBKJÆR, L., BERNSEN, N. O., AND DYBKJÆR, H.1997. Generality and objectivity: central is-sues in putting a dialogue evaluation toolinto practical use. In Interactive Spoken Di-alog Systems: Bringing Speech and NLP To-gether in Real Applications. Proceedings of a

Workshop Sponsored by the Association forComputational Linguistics (Madrid, Spain),J. Hirschberg, C. Kamm and M. Walker, Eds.ACL, 17–24.

DYBKJÆR, L. BERNSEN. N. O., AND DYBKJÆR, H. 1998.A methodology for diagnostic evaluation of spo-ken human-machine interaction. InternationalJournal of Human-Computer Studies 48, 605–625.

ECKERT, W. AND NIEMANN, H. 1994. Semantic analy-sis in a robust spoken dialog system. In Proceed-ings of the 3rd International Conference on Spo-ken Language Processing (Yokohama, Japan).ICSLP, 107–110.

ECKERT, W., NOTH, E., NIEMANN, H., AND SCHUKAT-TALAMAZZANI, E. G. 1995. Real users behaveweird experiences made collecting large human-machine dialog corpora. In Proceedings of theESCA Workshop on Spoken Dialogue Systems, P.Dalsgaard, L. Larsen, L. Boves, and I. Thomsen,Eds. ESCA, Vigso, Denmark, 193–196.

EDGINGTON, M., LOWRY, A., JACKSON, P., BREEN, A. P.,AND MINNIS, S. 1996a. Overview of currenttext-to-speech synthesis techniques: Part I—textand linguistic analysis. BT Technology Journal14, 1, 68–83.

EDGINGTON, M., LOWRY, A., JACKSON, P., BREEN, A. P.,AND MINNIS, S. 1996b. Overview of currenttext-to-speech synthesis techniques: Part II—prosody and speech generation. BT TechnologyJournal 14, 1, 84–99.

FERGUSON, G., ALLEN, J. F., AND MILLER, B. 1996.Trains-95: Towards a mixed-initiative planningassistant. Proceedings of the 3rd InternationalConference on AI Planning Systems (AIPS-96,Edinburgh, Scotland, UK). 70–77.

FRASER N. 1997. Assessment of interactive sys-tems. In Handbook of Standards and Resourcesfor Spoken Language Systems, D. Gibbon,R. Moore, and R. Winski, Eds. Mouton deGruyter, New York, NY, 564–614.

FRASER, N. AND GILBERT, G. N. 1991. Simulatingspeech systems. Computer Speech and Lan-guage 5 81–99.

GERBINO, E. AND DANIELI, M. 1993. Managing di-alogue in a continuous speech understandingsystem. In Proceedings of 3rd European Confer-ence on Speech Communication and Technology(Eurospeech’93, Berlin, Germany). ESCA, 1661–1664.

GIACHIN, E. AND MCGLASHAN, S. 1997. Spokenlanguage dialogue systems. In Corpus-BasedMethods in Language and Speech Processing,S. Young and G. Bloothooft, Eds. Kluwer Aca-demic Publishers, Dordrecht, The Netherlands,69–117.

GIBBON, D. MOORE, R., AND WINSKI R., EDS. 1997.Handbook of Standards and Resources for Spo-ken Language Systems. Mouton de Gruyter, NewYork, NY.

GODDEAU, D., MENG, H., POLIFRONI, J., SENEFF, S., AND

BUSAYAPONGCHAI. 1996. A form-based dialogue

ACM Computing Surveys, Vol. 34, No. 1, March 2002.

Page 77: Spoken Dialogue Technology: Enabling the Conversational

166 McTear

manager for spoken language applications. InProceedings of 4th International Conferenceon Spoken Language Processing (ICSLP’96,Pittsburgh, PA). ICSLP, 701–704.

GRICE, P. 1975. Logic and conversation. In Syntaxand Semantics Vol. 3: Speech Acts, P. Cole andJ. Morgan, Eds. Academic Press, New York, NY,41–58.

GROSZ, B. J., JOSHI, A. K., AND WEINSTEIN, S. 1983.Providing a unified account of definite nounphrases in discourse. In Proceedings of the 21stAnnual Meeting of the ACL. ACL, Boston, MA,44–50.

GROSZ, B. J. AND SIDNER, C. 1986. Attention, inten-tion, and the structure of discourse. In Compu-tational Linguistics 12, 3, 175–204.

HANSEN, B., NOVICK, D. G., AND SUTTON, S. 1996.Systematic design of spoken prompts. In CHI’96(Vancouver, B.C., Canada). ACM Press, NewYork, NY, 157–164.

HEEMAN, P. A. AND ALLEN, J. F. 1997. Intonationalboundaries, speech repairs, and discourse mark-ers: modeling spoken dialog. In Proceedings ofthe 35th Annual Meeting of the ACL and the 8thConference of the European Chapter of the Asso-ciation for Computational Linguistics (Madrid,Spain). ACL, 254–261.

HEISTERKAMP, P. AND MCGLASHAN, S. 1996. Units ofdialogue management: an example. In Proceed-ings of the 4th International Conference on Spo-ken Language Processing (ICSLP’96, Philadel-phia, PA). ICSLP, 200–203.

HIRSCHMAN, L. 1995. The roles of language pro-cessing in a spoken language interface. InVoice Communication Between Humans andMachines, D. Roe and J. Wilpon, Eds. NationalAcademy Press Washington, DC, 217–237.

HONE, K. S. AND BABER, C. 1995. Using a simu-lation method to predict the transaction timeeffects of applying alternative levels of con-straint to utterances within speech interactivedialogues. In Proceedings of the ESCA Work-shop on Spoken Dialogue Systems, P. Dalsgaard,L. Larsen, L. Boves, and I. Thomsen, Eds. ESCA,Vigso, Denmark, 209–212.

JAMESON, A. 1989. But what will the user think?Belief ascription and image maintenance in dia-log. In User Models in Dialog Systems, A. Kobsaand W. Wahlster, Eds. Springer Verlag, London,UK, 255–312.

JURAFSKY, D. AND MARTIN, J. 2000. Speech and Lan-guage Processing: An Introduction to NaturalLanguage Processing. Prentice Hall, EnglewoodCliffs, NJ.

KAISER, E. C., JOHNSTON, M., AND HEEMAN, P. A. 1999.Profer: Predictive, robust finite-state parsingfor spoken language. In Proceedings of ICASSP(Phoenix, AZ), Vol. 2. IEEE, 629–632.

KAMM, C. 1995. User interfaces for voice applica-tions. In Voice Communication Between Humansand Machines, D. Roe and J. Wilpon, Eds. Na-tional Academy Press, Washington, DC, 34–75.

KAMM, C., WALKER, M. A., AND J. LITMAN, D. 1999.Evaluating spoken language systems. In Pro-ceedings of American Voice Input/Output Soci-ety (AVIOS). AVIOS.

KAPLAN, J. 1983. Cooperative responses from aportable natural language database query sys-tem. In Computational Models of Discourse,M. Brady and R. Berwick, Eds. MIT Press,Cambridge, MA, 167–208.

KUBALA, F., BARRY, C., BATES, M., BOBROW, R.,FUNG, P., INGRIA, R., MAKHOUL, J., NGUYEN, L.,SCHWARTZ, R., AND STALLARD, D. 1992. BBN By-blos and HARC February 1992 ATIS benchmarkresults. In Proceedings of the DARPA Speech andNatural Language Workshop, Harriman, N.Y.Morgan Kaufmann Publishers, San Mateo, CA,72–77.

LAMEL, L. F., BENNACEF, S. K., BONNEAU-MAYNARD,H., ROSSETN, S., AND GAUVAIN, J. L. 1995. Re-cent developments in spoken language sys-tems for information retrieval. In Proceedingsof the ESCA Workshop on Spoken Dialogue Sys-tems, P. Dalsgaard, L. Larsen, L. Boves, andI. Thomsen, Eds. ESCA, Vigso, Denmark, 17–20.

LARSEN, L. B., AND BAEEKGAARD, A. 1994. Rapid pro-totyping of a dialogue system using a generic di-alogue development platform. In Proceedings ofICSLP’94 (Yokohama, Japan). ICSLP, 919–922.

LENNIG, M., BIELBY, G., AND MASSICOTTE, J. 1995. Di-rectory assistance automation in Bell Canada:Trial results. Speech Communication 17, 227–234.

LITMAN, D. J. AND ALLEN, J. F. 1987. A plan recog-nition model for subdialogues in conversation.Cognitive Science 11, 163–200.

LITMAN, D. J., KEARNS, M. S., SINGH, S., AND WALKER,M. A. 2000. Automatic optimization of dia-logue management. In Proceedings of 18th Inter-national Conference on Computational Linguis-tics (COLING-2000, Saarbrucken, Germany).ACL.

MAKHOUL, J. AND SCHWARTZ, R. 1995. State ofthe art in continuous speech recognition. InVoice Communication Between Humans and Ma-chines, D. Roe and J. Wilpon, Eds. NationalAcademy Press, Washington, DC, 165–197.

MANN, W. AND THOMPSON, S. 1988. Rhetorical struc-ture theory: toward a functional theory of textorganisation. Text 3, 243–281.

MARCUS, M. 1995. New trends in natural languageprocessing: statistical natural language process-ing. In Voice Communication Between Humansand Machines, D. Roe and J. Wilpon, Eds.National Academy Press, Washington, DC, 482–504.

MARTIN, P., CRABBE, F., ADAMS, S., BAATZ, E., AND

YANKELOVICH, N. 1996. SpeechActs: a spokenlanguage framework. IEEE Computer 29, 7, 33–40.

MAYBURY, M. T., Ed. 1993. Intelligent MultimediaInterfaces. MIT Press, Cambridge, MA.

ACM Computing Surveys, Vol. 34, No. 1, March 2002.

Page 78: Spoken Dialogue Technology: Enabling the Conversational

Spoken Language Technology 167

MCCOY, K. F. 1986. Generating responses toproperty misconceptions using perspective.In Communication Failure in Dialogue, R.Reilly, Ed. Elsevier Science Publishers NorthHolland, Amsterdam, The Netherlands, 149–160.

MCGLASHAN, S., BILANGE, E., FRASER, N., GILBERT, N.,HEISTERKAMP, P., AND YOUD, N. 1990. Manag-ing oral dialogues. Research Report, Social andComputer Sciences Research Group, Universityof Surrey, Surrey, UK.

MCKEOWN, K. 1985. Text Generation. CambridgeUniversity Press, Cambridge, UK.

MCROY, S. 1996. Detecting, repairing, and pre-venting human-machine miscommunication. InAAAI ’96 Workshop (Portland, OR).

MCTEAR, M. 1998. Modelling spoken dialogueswith state transition diagrams: experiences withthe CSLU toolkit. In Proceedings of the 5th In-ternational Conference on Spoken Language Pro-cessing (ICSLP’98, Sydney, Australia). ICSLP,1223–1226.

MCTEAR, M. 1999. Using the CSLU toolkit forpracticals in spoken dialogue technology. InProceedings of ESCA/SOCRATES Workshopon Method and Tool Innovations for SpeechScience Education (London, UK). ESCA, 113–116.

MCTEAR, M., ALLEN, S., CLATWORTHY, L., ELLISON, N.,LAVELLE, C., AND MCCAFFERY, H. 2000. Inte-grating flexibility into a structured dialoguemodel: Some design considerations. In Proceed-ings of the 6th International Conference on Spo-ken Language Processing (ICSLP’2000, Beijing,China), Vol. 1, ICSLP, 110–113.

MOORE, R. C. 1995. Integration of speech with nat-ural language understanding. In Voice Commu-nication Between Humans and Machines, D. Roeand J. Wilpon, Eds. National Academy Press,Washington, DC, 254–271.

NAGATA, M. AND MORIMOTO, T. 1994. First steps to-ward statistical modeling of dialogue to predictthe speech act type of the next utterance. SpeechCommunication 15, 193–203.

PAGE, J. H. AND BREEN, A. P. 1996. The Laureatetext-to-speech system—architecture and appli-cations. BT Technology Journal 14, 1, 57–67.

PARIS, C. L. 1989. The use of explicit user modelsin a generation system for tailoring answers tothe user’s level of expertise. In User Models inDialog Systems, A. Kobsa and W. Wahlster, Eds.Springer Verlag, London, UK, 200–232.

PECKHAM, J. 1993. A new generation of spoken di-alogue systems: Results and lessons from theSUNDIAL project. In Proceedings of 3rd Euro-pean Conference on Speech Communication andTechnology (Eurospeech’93, Berlin, Germany).ESCA, 33–40.

PECKHAM, J. n.d. Vocalis develops directoryenquiries service with speech recognitionfor Telia. http://www.callcentres.com.au/speechr1.htm.

PHILIPS SPEECH PROCESSING. 1997. Hddl v2.0—dialog description language—user’s guide.Philips Speech Processing, Aachen, Germany.

POLLACK, M. 1986. Some requirements for a modelof the plan-inference process in conversation.In Communication Failure in Dialogue, R.Reilly, Ed. Elsevier Science Publishers NorthHolland, Amsterdam, The Netherlands, 245–256.

POTJER, J., RUSSEL, A., BOVES, L., AND OS, E. D. 1996.Subjective and objective evaluation of two typesof dialogues in a call assistance service. InIVTTA. IEEE, Basking Ridge, NJ, 121–124.

POWER, K. J. 1996. The listening telephone—automating speech recognition over the PSTN.BT Technology Journal 14, 1, 112–126.

PRICE, P. 1996. Spoken language understanding.In Survey of the State of the Art in HumanLanguage Technology, R. A. Cole, J. Mariani,H. Uszkoreit, A. Zaenen, and V. Zue, Eds.Cambridge University Press, Cambridge, UK.Online version: http://cslu.cse.ogi.edu/HLTsurvey/.

PULMAN, S. G. 1996. Semantics. In Survey of theState of the Art in Human Language Tech-nology, R. A. Cole, J. Mariani, H. Uszkoreit,A. Zaenen, and V. Zue, Eds. Cambridge Uni-versity Press, Cambridge, UK. Online version:http://cslu.cse.ogi.edu/HLTsurvey/.

RABINER, L. R. AND JUANG, B. H. 1993. Fundamen-tals of Speech Recognition. Prentice Hall, Engle-wood Cliffs, NJ.

REITER, E. AND DALE, R. 1997. Building appliednatural language generation systems. NaturalLanguage Engineering 3, 1, 57–87.

RUDNICKY, A. I., THAYER, E., CONSTANTINIDES, P.,TCHOU, C., SHERN, R., LENZO, K., XU, W.,AND OH, A. 1999. Creating natural dialogsin the Carnegie Mellon Communicator sys-tem. In Proceedings of 6th European Con-ference on Speech Communication and Tech-nology (Eurospeech’99, Budapest, Hungary).ESCA.

SADEK, M. D., BRETIER, P., AND PANAGET, F. 1997.ARTIMIS: Natural dialogue meets rationalagency. In Proceedings of 15th InternationalJoint Conference on Artificial Intelligence,(IJCAI-97). Morgan Kaufmann Publishers, SanFrancisco, CA, 1030–1035.

SADEK, M. D. AND DE MORI, R. 1998. Dialogue sys-tems. In Spoken Dialogues with Computers,R. de Mori, Ed. Academic Press, London, UK,523–561.

SEARLE, J. R. 1969. Speech Acts: An Essay in thePhilosophy of Language. Cambridge UniversityPress, Cambridge, UK.

SENEFF, S. 1992. TINA: A natural language sys-tem for spoken language applications. Compu-tational Linguistics 18, 1, 61–86.

SHIEBER, S. M. 1986. An Introduction to Unifi-cation-based Approaches to Grammar. CSLILecture Notes, CSLI, Stanford, CA.

ACM Computing Surveys, Vol. 34, No. 1, March 2002.

Page 79: Spoken Dialogue Technology: Enabling the Conversational

168 McTear

SIMPSON, A. AND FRASER, N. 1993. Black box andglass box evaluation of the SUNDIAL sys-tem. In Proceedings of 3rd European Confer-ence on Speech Communication and Technology(Eurospeech’93, Berlin, Germany). ESCA, 33–40.

SJOLANDER, K., BESKOW, J., GUSTAFSON, J., LEWIN, E.,CARLSON, R., AND GRANSTROM, B. 1998. Web-based educational tools for speech technol-ogy. In Proceedings of the 5th InternationalConference on Spoken Language Processing(ICSLP’98, Sydney, Australia). ICSLP, 3217–3220.

SMITH, R. AND GORDON, S. A. 1997. Effects ofvariable initiative on linguistic behavior inhuman-computer spoken natural language di-alogue. Computational Linguistics 23, 1, 141–168.

SMITH R. AND HIPP D. R. 1994 Spoken Natural Lan-guage Dialog Systems: A Practical Approach.Oxford University Press, New York, NY.

SMITH, R. W. 1997. Performance measures forthe next generation of spoken natural lan-guage dialog systems. In Interactive Spoken Di-alog Systems: Bringing Speech and NLP To-gether in Real Applications. Proceedings of aWorkshop Sponsored by the Association forComputational Linguistics (Madrid, Spain),J. Hirschberg, C. Kamm, and M. Walker, Eds.ACL, 37–40.

STALLARD, D. AND BOBROW, R. 1992. Fragment pro-cessing in the DELPHI system. In Proceedingsof the Speech and Natural Language Workshop,Harriman, N.Y. Morgan Kaufmann Publishers,San Mateo, CA, 305–310.

STRIK, H., RUSSEL, A., VAN DEN HEUVEL, H., CUCCHIARINI,C., AND BOVES, L. 1996. Localizing an au-tomatic inquiry system for public transportinformation. In Proceedings of the 4th Interna-tional Conference on Spoken Language Process-ing (ICSLP’96, Philadephia, PA), Vol. 2, ICSLP,24–31.

SUTTON, S., COLE, R., DE VILLIERS, J., SCHALKWYK, J.,VERMEULEN, P., MACON, M., YAN, Y., KAISER, E.,RUNDLE, B., SHOBAKI, K., HOSOM, J. P., KAIN,A., WOUTERS, J., MASSARO, D., AND COHEN, M.1998. Universal speech tools: The CSLUtoolkit. In Proceedings of the 5th InternationalConference on Spoken Language Processing(ICSLP’98, Sydney, Australia). ICSLP, 3221–3224.

SUTTON, S., HANSEN, B., LANDER, T., NOVICK, D. G.,AND COLE, R. 1995. Evaluating the effective-ness of dialogue for an automated spokenquestionnaire. Tech. Rep. CS/E95-12, Depart-ment of Computer Science and Engineering,Oregon Graduate Institute of Science andTechnology.

TRAUM, D. R. 1996. Conversational agency: theTRAINS-93 dialogue manager. In Proceedingsof the eleventh Twente Workshop on LanguageTechnology (TWLT 11): Dialogue Managementin Natural Language Systems, S. LuperFoy,

A. Nijholt, and G. V. van Zanten, Eds. Univer-siteit Twente, Enschede, The Netherlands.

TRAUM, D. R. AND ALLEN, J. F. 1994. Discourse obli-gations in dialogue processing. In Proceedings ofthe 32nd Annual General Meeting of the Associa-tion for Computational Linguistics (Las Cruces,NM). ACL, 1–8.

USZKOREIT, H. AND ZAENEN, A. 1996. Grammarformalisms. In Survey of the State of theArt in Human Language Technology, R. A.Cole, J. Mariani, H. Uszkoreit, A. Zaenen,and V. Zue, Eds. Cambridge University Press,Cambridge, UK. Online version: http://cslu.cse.ogi.edu/HLTsurvey/.

VERGEYNST, N. A., EDWARDS, K., FOSTER, J. C., AND JACK,M. A. 1993. Spoken dialogues for human-computer interaction over the telephone: com-plexity measures. In Proceedings of 3rd Euro-pean Conference on Speech Communication andTechnology (Eurospeech’93, Berlin, Germany).ESCA, 1415–1418.

VOICEXML FORUM. n.d. http://www.voicexml.org.

WAHLSTER, W. AND KOBSA, A. 1989. User models indialog systems. User Models in Dialog Systems,A. Kobsa and W. Wahlster, Eds. Springer Verlag,London, UK, 4–24.

WALKER, M. A. 1989. Evaluating discourse pro-cessing algorithms. In Proceedings of the 27thAnnual General Meeting of the Associationfor Computational Linguistics (Vancouver, B.C.,Canada). ACL, 251–261.

WALKER, M. A., LITMAN, D., KAMM, C., AND ABELLA, A.1997. PARADISE: a general framework forevaluating spoken dialogue agents. In Proceed-ings of the 35th Annual General Meeting ofthe Association for Computational Linguistics,ACL/EACL (Madrid, Spain). ACL, 271–280.

WALKER, M. A., LITMAN, D., KAMM, C., AND ABELLA, A.1998. Evaluating spoken dialogue agents withPARADISE: Two case studies. Computer Speechand Language 12, 3, 317–347.

WARD, W. AND PELLOM, B. 1999. The CU Commu-nicator system. In Proceedings of IEEE Work-shop on Automatic Speech Recognition and Un-derstanding IEEE, (Keystone, CO). IEEE.

WHITTAKER, S. J. AND ATTWATER, D. J. 1996. Thedesign of complex telephony applications us-ing large vocabulary speech technology. In Pro-ceedings of the 4th International Conferenceon Spoken Language Processing (ICSLP-96,Philadephia, PA). ICSLP, 705–708.

WRIGHT, J., GORIN, A., AND ABELLA, A. 1998. Spokenlanguage understanding within dialogs using agraphical model of task structure. In Proceed-ings of the 5th International Conference on Spo-ken Language Processing (ICSLP’98, Sydney,Australia), Vol. 5. ECSLP.

WYARD, P. J., SIMONS, A. D., APPELBY, S., KANEEN,E., WILLIAMS, S. H., AND PRESTON, K. R. 1996.Spoken language systems—beyond prompt andresponse. BT Technology Journal 14, 1, 187–205.

ACM Computing Surveys, Vol. 34, No. 1, March 2002.

Page 80: Spoken Dialogue Technology: Enabling the Conversational

Spoken Language Technology 169

YANKELOVICH, N. n.d. Using natural dialogues asthe basis for speech interface design. In Auto-mated Spoken Dialog Systems, S. Luperfoy, Ed.MIT Press, Cambridge, MA.

YANKELOVICH, N., LEVOW, G., AND MARX, M.1995. Designing SpeechActs: issues inspeech user interfaces. In Proceedings ofCHI95. Addison-Wesley, Reading, MA. 369–375.

YOUNG, S. AND BLOOTHOOFT, G. EDS. 1997. Corpus-Based Methods in Language and Speech Process-ing. Kluwer Academic Publishers, Dordrecht,The Netherlands.

YOUNG, S. R., HAUPTMANN, A. G., WARD, W. H., SMITH,E. T., AND WERNER, P. 1989. High level knowl-edge sources in usable speech recognition sys-tems. Communications of the ACM 32, 2, 183–194.

Received September 1998; revised February 2000, October 2000; accepted October 2001

ACM Computing Surveys, Vol. 34, No. 1, March 2002.