spoken dialogue system architecture

PowerPoint Presentation

Spoken Dialogue System ArchitectureJoshua GordonCS47061OutlineGoals of an SDS architectureResearch challengesPractical considerationsAn end-to-end tour of a real world SDS2SDS ArchitecturesSoftware abstractions that tie together orchestrate the many NLP components required for human-computer dialogueConduct task-oriented, limited-domain conversations Manage the many levels of information processing (e.g., utterance interpretation, turn taking) necessary for dialogueIn real-time, under uncertainty33ExamplesInformation seeking, transactionalMost commonCMU Bus route informationColumbia Virtual LibrarianGoogle Directory service

Lets Go Public4ExamplesVirtual HumansMultimodal input / outputProsody and facial expressionAuditory and visual clues assist turn takingMany limitationsScriptingConstrained domain

http://ict.usc.edu/projects/virtual_humans

5ExamplesInteractive Kiosks

Multi-participant conversations!Surprises and challenges passersby to trivia games[Bohus and Horvitz, 2009]66ExamplesRobotic Interfaces

www.cellbots.com

Speech interface to a UAV[Eliasson, 2007]7Conversational skillsSDS Architectures tie together:Speech recognitionTurn takingDialogue managementUtterance interpretationGroundingNatural language generationAnd increasingly includeMultimodal input / outputGesture recognition8Research Challenges in every areaSpeech recognitionAccuracy in interactive settings, detecting emotion.Turn takingFluidly handling overlap, backchannels.Dialogue managementIncreasingly complex domains, better generalization, multi-party conversations.Utterance interpretationReducing constraints on what the user can say, and how they can say it. Attending to prosody, emphasis, speech rate.9A tour of a real-world SDSCMU OlympusOpen source collection of dialogue system componentsResearch platform used to investigate dialogue management, turn taking, spoken language interpretationActively developedMany implementationsLets go public, Team Talk, CheckItOutwww.speech.cs.cmu.edu

10Conventional SDS Pipeline

11Speech signals to words. Words to domain concepts. Concepts to system intentions. Intentions to utterances (represented as text). Text to speech.11Olympus under the hood: provider pattern

12Speech recognition

13The Sphinx Open Source Recognition Toolkit Pocket-sphinxContinuous speech, speaker independent recognition systemIncludes tools for language model compilation, pronunciation, and acoustic model adaptationProvides word level confidence annotation, n-best listsEfficient runs on embedded devices (including an iPhone SDK)Olympus supports parallel decoding engines / modelsTypically runs parallel acoustic models for male and female speech14http://cmusphinx.sourceforge.net/Speech recognition challenge in interactive settings

15Spontaneous dialogue is difficult for speech recognizersPoor in interactive settings compared to one-off applications like voice search and dictationPerformance phenomena: backchannels, pause-fillers, false-startsOOV wordsInteraction with an SDS is cognitively demanding for usersWhat can I say and when? Will the system understand me? Uncertainty increases disfluency, resulting in further recognition errors16WER (Word Error Rate)Non-interactive settings Google Voice Search: 17% deployed (0.57% OOV over 10k queries randomly sampled from Sept-Dec, 2008)Interactive settings:Lets Go Public: 17% in controlled conditions vs. 68% in the fieldCheckItOut: Used to investigate task-oriented performance under worst case ASR - 30% to 70% depending on experimentVirtual Humans: 37% in laboratory conditions17Examples of (worst-case) recognizer noiseS: What book would you like?U: The Language of SycamoresASR: THE LANGUAGE OF IS .A. COMING WARS

S: Hi Scott, welcome back!U: Not Scott, Sarah! Sarah Lopez.ASR: SCOTT SARAH SCOUT LAW1818Error PropagationRecognizer noise injects uncertainty into the pipelineInformation loss occurs when moving from an acoustic signal to a lexical representationMost SDSs ignore prosody, amplitude, emphasisInformation provided to downstream components includesAn n-best list, or word latticeLow level features: speech rate, speech energy19Spoken Language Understanding

20SLU maps from words to conceptsDialog acts (the overall intent of an utterance) Domain specific concepts (like a book, or bus route)Single utterances vs. across turnsChallenging in noisy settingsEx. Does the library have Hitchhikers Guide to the Galaxy by Douglas Adams on audio cassette?21Dialog ActBook RequestTitleThe Hitchhikers Guide to the GalaxyAuthorDouglas AdamsMediaAudio CassetteSemantic grammarsDomain independent concepts[Yes], [No], [Help], [Repeat], [Number]Domain specific concepts[Book], [Author][Quit](*THANKS *good bye)(*THANKS goodbye)(*THANKS +bye);

THANKS(thanks *VERY_MUCH)(thank you *VERY_MUCH)

VERY_MUCH(very much)(a lot);

2222Grammars generalize poorlyUseful for extracting fine-grained concepts, butHand engineeredTime consuming to develop and tuneRequires expert linguistic knowledge to construct Difficult to maintain over complex domainsLack robustness to OOV words, novel phrasingSensitive to recognizer noise23SLU in Olympus: the Phoenix ParserPhoenix is a semantic parser, indented to be robust to recognition noisePhoenix parses the incoming stream of recognition hypothesesMaps words in ASR hypotheses to semantic framesEach frame has an associated CFG Grammar, specifying word patterns that match the slotMultiple parses may be produced for a single utteranceThe frame is forward to the next component in the pipeline 24Statistical methodsSupervised learning is commonly used for single utterance interpretationGiven word sequence W, find the semantic representation of meaning M that has maximum a posteriori probability P(M|W)Useful for dialog act identification, determining broad intentLike all supervised techniquesRequires a training corpusOften is domain and recognizer dependent25Belief updating

26Cross-utterance SLUU: Get my coffee cup and put it on my desk. The one at the back. Difficult in noisy settings Mostly new territory for SDS

[Zuckerman, 2009]27Dialogue Management

28The Dialogue ManagerRepresents the systems agendaMany techniquesHierarchal plans, state / transaction tables, Markov processesSystem initiative vs. mixed initiativeSystem initiative has less uncertainty about the dialog state, but is clunkyRequired to manage uncertainty and error handingBelief updating, domain independent error handling strategies

2930Task Specification, Agenda, and Execution

[Bohus, 2007]Domain independent error handling

31[Bohus, 2007]Error recovery strategiesError Handling Strategy (misunderstanding)ExampleExplicit confirmationDid you say you wanted a room starting at 10 a.m.?Implicit confirmationStarting at 10 a.m. ... until what time?Error Handling Strategy (non-understanding)ExampleNotify that a non-understanding occurredSorry, I didnt catch that .Ask user to repeatCan you please repeat that?Ask user to rephraseCan you please rephrase that?Repeat promptWould you like a small room or a large one?32Statistical Approaches to Dialogue ManagementLearning management policy from a corpusDialogue can be modeled as Partially Observable Markov Decision Processes (POMDP)Reinforcement learning is applied (either to existing corpora or through user simulation studies) to learn an optimal strategyEvaluation functions typically reference the PARADISE framework

33Interaction management

34The Interaction ManagerMediates between the discrete, symbolic reasoning of the dialog manager, and the continuous real-time nature of user interactionManages timing, turn-taking, and barge-inYields the turn to the user on interruptionPrevents the system from speaking over the userNotifies the dialog manager ofInterruptions and incomplete utterances35Natural Language Generation and Speech Synthesis

36NLG and Speech SynthesisTemplate based, e.g., for explicit error handling strategiesDid you say ? More interesting cases in disambiguation dialogsA TTS synthesizes the NLG outputThe audio server allows interruption mid utteranceProduction systems incorporateProsody, intonation contours to indicate degree of certaintyOpen source TTS frameworksFestival - http://www.cstr.ed.ac.uk/projects/festival/Flite - http://www.speech.cs.cmu.edu/flite/

37Asynchronous architectures

38Blaylock, 2002An asynchronous modification of TRIPS, most work is directed toward best-case speech recognitionLemon, 2003Backup recognition pass enables better discussion of OOV utterances

38Problem-solving architectures

FORRSooth models task-oriented dialogue as cooperative decision makingSix FORR-based services operating in parallelInterpretationGroundingGenerationDiscourseSatisfactionInteractionEach service has access to the same knowledge in the form of descriptives3939Thanks! Questions?40

spoken dialogue system architecture

Documents

speech recognizerspoor

speech interface

speech rate

speech signals

interactive settings

spontaneous dialogue

domain conversations

system intentions