On Organic Interfaces
Victor Zue ([email protected])
MIT Computer Science and Artificial
Intelligence Laboratory
MIT Computer Science and Artificial Intelligence Laboratory
Acknowledgements
Eric BrillScott CyphersJim GlassDave GoddeauT J HazenLee HetheringtonLynette HirschmanRaymond LauHong LeungHelen MengMike PhillipsJoe PolifroniShinsuke SakaiStephanie SeneffDave ShipmanMichelle SpinaNikko StrömChao Wang
Research StaffGraduate Students
Anderson, M.Aull, A.Brown, R.Chan, W.Chang, J.Chang, S.Chen, C.Cyphers, S.Daly, N.Doiron, R.Flammia, G.Glass, J.Goddeau, D.Hazen, T.J.Hetherington, L.
Huttenlocher, D.Jaffe, O.Kassel, R.Kasten,P.Kuo, J. Kuo, S.Lauritzen, N.Lamel, L.Lau, R.Leung, H.Lim, A.Manos, A.Marcus, J.Neben, N.Niyogi, P.
Mou, X.Ng, K.Pan, K.Pitrelli, J.Randolph, M.Rtischev, D.Sainath, T.Sarma, S.Seward, D.Soclof, M.Spina, M.Tang, M.Wichiencharoen, A.Zeiger, K.
MIT Computer Science and Artificial Intelligence Laboratory
Introduction
MIT Computer Science and Artificial Intelligence Laboratory
Speech interfaces are ideal for information access and management when:
• The information space is broad and complex,
• The users are technically naive,
• The information device is small, or
• Only telephones are available.
Speech interfaces are ideal for information access and management when:
• The information space is broad and complex,
• The users are technically naive,
• The information device is small, or
• Only telephones are available.
Virtues of Spoken Language
Natural: Requires no special training
Flexible: Leaves hands and eyes free
Efficient: Has high data rate
Economical: Can be transmitted/received inexpensively
MIT Computer Science and Artificial Intelligence Laboratory
SpeechSpeech
TextText
Recognition
SpeechSpeech
TextText
Synthesis
UnderstandingGeneration
Communication via Spoken Language
MeaningMeaning
Human
Computer
Input Output
MIT Computer Science and Artificial Intelligence Laboratory
Components of a Spoken Dialogue System
DISCOURSE CONTEXT
DISCOURSE CONTEXT
DIALOGUEMANAGEMENT
DIALOGUEMANAGEMENT
DATABASE
Graphs& Tables
LANGUAGEUNDERSTANDING
LANGUAGEUNDERSTANDING
MeaningRepresentation
MeaningRepresentation
Meaning
LANGUAGEGENERATIONLANGUAGE
GENERATIONSPEECH
SYNTHESISSPEECH
SYNTHESISSpeech
Sentence
SPEECHRECOGNITION
SPEECHRECOGNITION
Speech
Words
MIT Computer Science and Artificial Intelligence Laboratory
Tremendous Progress to Date
Technological Advances
Inexpensive Computing Increased Task Complexity
Data Intensive Training
MIT Computer Science and Artificial Intelligence Laboratory
Some Example Systems
BBN, 2007
MIT, 2007 KTH, 2007
MIT Computer Science and Artificial Intelligence Laboratory
Speech Synthesis
• Recent trend moves toward corpus-based approaches– Increased storage and compute capacity
– Availability of large text and speech corpora
– Modeled after successful utilization for speech recognition
• Many successful implementations, e.g.,– AT&T
– Cepstral
– Microsoft
compassiondisputed
cedar citysincegiantsince
compassiondisputed
cedar citysincegiantsince
computerscience
MIT Computer Science and Artificial Intelligence Laboratory
But we are far from done …
• Machine performance typically lags far behind human performance
• How can interfaces be truly anthropomorphic?
MACHINE HUMAN0
20
40
60
80SWITCHBOARD (Spontaneous Speech)
43%
4%
Lippmann, 1997
MIT Computer Science and Artificial Intelligence Laboratory
Premise of the Talk
• Propose a different perspective on development of speech-based interfaces
• Draw from insights in evolution of computer science – Computer systems are increasingly complex
– There is a move towards treating these complex systems like organisms that can observe, grow, and learn
• Will focus on spoken dialogue systems
MIT Computer Science and Artificial Intelligence Laboratory
Organic Interfaces
MIT Computer Science and Artificial Intelligence Laboratory
Computer: Yesterday and Today
• Computation of static functions in a static environment, with well-understood specification
• Computation is its main goal xxxxx
• Single agent xxxxxxxxxxxxxxxxxx
• Batch processing of text and homogeneous data
• Stand-alone applications
• Binary notion of correctness
• Adaptive systems operating in environments that are dynamic and uncertain
• Communication, sensing, and control just as important
• Multiple agents that may be cooperative, neutral, adversarial
• Stream processing of massive, heterogeneous data
• Interaction with humans is key
• Trade off multiple criteria
Increasingly, we rely on probabilistic representation, machine learning techniques, and optimization principles to build complex systems
Increasingly, we rely on probabilistic representation, machine learning techniques, and optimization principles to build complex systems
MIT Computer Science and Artificial Intelligence Laboratory
Properties of Organic Systems
• Robust to changes in environment and operating conditions
• Learning through experiences
• Observe their own behavior
• Context aware
• Self healing
• …
MIT Computer Science and Artificial Intelligence Laboratory
Research Challenges
MIT Computer Science and Artificial Intelligence Laboratory
Some Research Challenges
• Robustness– Signal Representation
– Acoustic Modeling
– Lexical Modeling
– Multimodal Interactions
• Establishing Context
• Adaptation
• Learning– Statistical Dialogue Management
– Interactive Learning
– Learning by Imitation
• Robustness– Signal Representation
– Acoustic Modeling
– Lexical Modeling
– Multimodal Interactions
• Establishing Context
• Adaptation
• Learning– Statistical Dialogue Management
– Interactive Learning
– Learning by Imitation
* Please refer to written paper for topics not covered in talk
MIT Computer Science and Artificial Intelligence Laboratory
Robustness: Acoustic Modeling
• Statistical n-grams have masked the inadequacies in acoustic modeling, but at a cost – Size of training corpus
– Application-dependent performance
• To promote acoustic modeling research, we may want to develop a sub-word based recognition kernel– Application independent
– Stronger constraints than phonemes
– Closed vocabulary for a given language
• Some success has been demonstrated (e.g., Chung & Seneff, 1998)
sentence
phonetics
syntax
semantics
word (syllable)
morphology
phonotactics
phonemics
acousticsAcousticModels LMUnits
Sub-word Units
SpeechRecognition
Kernel
MIT Computer Science and Artificial Intelligence Laboratory
Robustness: Lexical Access
• Current approaches represent words as phoneme strings
• Phonological rules are sometimes used to derive alternate pronunciations
“temperature”
• Lexical representation based on features offers much appeal (Stevens, 1995)
– Fewer models, less training data, greater parsimony
– Alternative lexical access models (e.g., Zue, 1983)
• Lexical access based on islands of reliability might be better able to deal with variability
MIT Computer Science and Artificial Intelligence Laboratory
Robustness: Multimodal Interactions
• Other modalities can augment/complement speech
LANGUAGEUNDERSTANDING
LANGUAGEUNDERSTANDING
meaning
SPEECHRECOGNITION
SPEECHRECOGNITION
GESTURERECOGNITION
GESTURERECOGNITION
HANDWRITINGRECOGNITION
HANDWRITINGRECOGNITION
MOUTH & EYESTRACKING
MOUTH & EYESTRACKING
MIT Computer Science and Artificial Intelligence Laboratory
Challenges for Multimodal Interfaces
• Input needs to be understood in the proper context – “What about that one”
• Timing information is a useful way to relate inputs
Speech: “Move this one over here”
Pointing: (object) (location)
time
• Handling uncertainties and errors (Cohen, 2003)
• Need to develop a unifying linguistic framework
MIT Computer Science and Artificial Intelligence Laboratory
Audio Visual Symbiosis
• The audio and visual signals both contain information about:– Identity/location of the person
– Linguistic message
– Emotion, mood, stress, etc.
• Integration of these sources of information has been known to help humans
Benoit, 2000
MIT Computer Science and Artificial Intelligence Laboratory
Audio Visual Symbiosis
• The audio and visual signals both contain information about:– Identity/location of the person
– Linguistic message
– Emotion, mood, stress, etc.
• Integration of these sources of information has been known to helps humans
• Exploiting this symbiosis can lead to robustness, e.g.,– Locating and identifying the speaker Hazen et al., 2003
MIT Computer Science and Artificial Intelligence Laboratory
Audio Visual Symbiosis
• The audio and visual signals both contain information about:– Identity/location of the person
– Linguistic message
– Emotion, mood, stress, etc.
• Integration of these sources of information has been known to helps humans
• Exploiting this symbiosis can lead to robustness, e.g.,– Locating and identifying the speaker
– Speech recognition/understanding augmented with facial features
Huang et al., 2004
MIT Computer Science and Artificial Intelligence Laboratory
Audio Visual Symbiosis
• The audio and visual signals both contain information about:– Identity/location of the person
– Linguistic message
– Emotion, mood, stress, etc.
• Integration of these sources of information has been known to helps humans
• Exploiting this symbiosis can lead to robustness, e.g.,– Locating and identifying the speaker
– Speech recognition/understanding augmented with facial features
– Speech and gesture integration
Gruenstein et al., 2006
Cohen, 2005
MIT Computer Science and Artificial Intelligence Laboratory
Audio Visual Symbiosis
• The audio and visual signals both contain information about:– Identity/location of the person
– Linguistic message
– Emotion, mood, stress, etc.
• Integration of these sources of information has been known to helps humans
• Exploiting this symbiosis can lead to robustness, e.g.,– Locating and identifying the speaker
– Speech recognition/understanding augmented with facial features
– Speech and gesture integration
– Audio/visual information delivery
Ezzat, 2003
MIT Computer Science and Artificial Intelligence Laboratory
Establishing Context
• Context setting is important for dialogue interaction– Environment
– Linguistic constructs
– Discourse
• Much work has been done, e.g.,– Context-dependent acoustic and language models
– Sound segmentation
– Discourse modeling
• Some interesting new directions– Tapestry of applications
– Acoustic scene analysis (Ellis, 2006)
calendar
photos
weather
address
stocks
phonebook music
MIT Computer Science and Artificial Intelligence Laboratory
Acoustic Scene Analysis
• Acoustic signals contain a wealth of information (linguistic message, environment, speaker, emotion, …)
• We need to find ways to adequately describe the signals
time
signal type: speech
transcript: although both of the, both sides of the Central Artery …
topic: traffic report
speaker: female
. . .
signal type: speech
transcript: Forecast calls for at least partly sunny weather …
topic: weather, sponsor acknowledgement, time
speaker: male
. . .
signal type: speech
transcript: This is Morning Edition, I’m Bob Edwards …
topic: NPR news
speaker: male, Bob Edwards
. . .
signal type: music
genre: instrumental
artist: unknown
. . .
Some time in the future …
MIT Computer Science and Artificial Intelligence Laboratory
Learning
• Perhaps the most important aspect of organic interfaces– Use of stochastic modeling techniques for speech recognition,
language understanding, machine translation, and dialogue modeling
• Many different ways to learn– Passive learning
– Interactive learning
– Learning by imitation
MIT Computer Science and Artificial Intelligence Laboratory
Hetherington, 1991
Interactive Learning: An Example
• New words are inevitable, and they cannot be ignored
• Acoustic and linguistic knowledge is needed to– Detect
– Learn, and
– Utilize new words
• Fundamental changes in problem formulation and search strategy may be necessary
MIT Computer Science and Artificial Intelligence Laboratory
Interactive Learning: An Example
• New words are inevitable, and they cannot be ignored
• Acoustic and linguistic knowledge is needed to– Detect
– Learn, and
– Utilize new words
• Fundamental changes in problem formulation and search strategy may be necessary
• New words can be detected and incorporated through– Dynamic update of vocabulary
Chung & Seneff, 2004
MIT Computer Science and Artificial Intelligence Laboratory
Interactive Learning: An Example
• New words are inevitable, and they cannot be ignored
• Acoustic and linguistic knowledge is needed to– Detect
– Learn, and
– Utilize new words
• Fundamental changes in problem formulation and search strategy may be necessary
• New words can be detected and incorporated through– Dynamic update of vocabulary
– Speak and Spell
Fillisko & Seneff, 2006
MIT Computer Science and Artificial Intelligence Laboratory
Learning by Imitation
• Many tasks can be learned through interaction– “This is how you enable Bluetooth.”
“Enable Bluetooth.”
– “These are my glasses.” “Where are my glasses?”
• Promising research by James Allen (2007)– Learning phase:
* User shows the system how to perform tasks (perhaps through some spoken commentary)
* System learns the task through learning algorithms and updates its knowledge base
– Application phase* Looks up tasks in its knowledge base and executes the procedure
Allen et.al., (2007)
MIT Computer Science and Artificial Intelligence Laboratory
In Summary
• Great strides have been made in speech technologies
• Truly anthropomorphic spoken dialogue interfaces can only be realized if they can behave like organisms– Observe, learn, grow, and heal
• Many challenges remain …
MIT Computer Science and Artificial Intelligence Laboratory
Thank You
MIT Computer Science and Artificial Intelligence Laboratory
What’s the phone number of Flora in Arlington????What’s the phone number of Flora in Arlington
Dynamic Vocabulary Understanding
• Dynamically alter vocabulary within a single utterance
“What’s the phone number for Flora in Arlington.”
Arlington DinerBlue Plate ExpressTea Tray in the SkyAsiana GrilleBagels etcFlora….
Hub
NLGNLG
ASRASR ContextContext
TTSTTS DialogDialog
NLUNLU
AudioAudio DBDB
“The telephone number for Flora is …”
Clause: wh_questionProperty: phoneTopic: restaurantName: ????City: Arlington
Clause: wh_questionProperty: phoneTopic: restaurantName: FloraCity: Arlington