spoken conversational agents - marco ronchetti
TRANSCRIPT
1
Spoken Conversational Agents
Prof. Giuseppe Riccardi Department of Information Engineering and Computer Science
University of Trento [email protected]
2
Spoken Conversational Agents
Prof. Giuseppe Riccardi Department of Information Engineering and Computer Science
University of Trento [email protected]
Spring 2012, LPSMT, G. Riccardi
n Talking to Computers n Research and Technological challenges n Spoken Language Understanding n Speech Technology on Smartphones n Speech Demo App
3
Outline
Spring 2012, LPSMT, G. Riccardi
5
The AI dream
n Design computers that are able to n Parse human language ( speech, text…) n Understand users’ intentions n Execute simple/complex task n Interact autonomously or cooperatively n Relate to humans socially, emotionally n Operate in a virtual and physical space
Spring 2012, G. Riccardi
7
Vision (“2001: a Space Odissey”, 1968 by Stanley Kubrick)
HAL 9000: "Heuristically programmed Algorithmic”
Interactive Systems
9
Uh hi, I need a flight tomorrow from Boston..
Machine Customer
How may I help you?
Spring 2012, LPSMT, G. Riccardi
Spring 2012, LPSMT, G. Riccardi
Retrieving directions from the Web
http://maps.google.it/maps? f=d&hl=it&saddr=".$partenza."&daddr=".$arrivo."&sll=41.895888,12.489052
&sspn=14.60869,29.882813&layer=&ie=UTF8
onclick=\"this.blur()\"\u003e27\u003c/a\u003e.\u003c/td\u003e\u003ctd_class=\"dirsegtext\"id=\"dirsegtext_0_223\"\u003eSvolta_a\u003cb\u003esinistra\u003c/b\u003ea\u003cb\u003ePiazza_Raffaello_Sanzio\u003c/b\u003e\u003c/td\u003e\u003ctdclass=\"sdist\"\u003e\u003cdivid=\"sxdist\"class=\"nw\"\u003e54\u0026#160;m\u003c/div\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctrclass=\"dirsegment\"id=\"panel_0_225\"polypoint=\"225\"\u003e\u003ctdclass=\"iconpw\"\u003e\u003cimgsrc=\"/
12
Spring 2012, LPSMT, G. Riccardi
13
Human-Machine Spoken Dialog User voice request
Automatic Speech Recognition
DM Action
Dialogue Management
SLU
Concepts
Spoken Language Understanding
LG
Words
Language Generation
ASR
Words
TTS
Text-to-Speech Synthesis
Voice reply to User
Spring 2010 14
Automatic Speech Recognition (ASR)
n An ASR system converts the speech signal into words n Rich Transcript (Words, Speaker, Language, Dialect,
etc..) n The recognized words can be
n The final output, or n The input to natural language processing
Automatic Speech
Recognition
“yes I would like to make a reservation.. “
Spring 2010 15
ASR Challenges
n Transducers n Telephone handset n Close Talking mic n Open/Desktop Mic
n Channel n Landline n Wireless n VoIP
Automatic Speech
Recognition
“yes I would like to make a reservation.. “
n Speaker n Dialect n Accent n Children
n Noise (SNR) n Office Room n Airplane Cockpit n Cocktail Party
Problem
Spring 2010 16
ASR - Overview
Feature Extraction A
Pattern Matching Ŵ
Given the acoustic observation sequence A=a1,a2,…,am, what is the most likely “word” sequence W=w1,w2,…,wn?
L EH T AH S P R EY Let us pray Lettuce spray
Spring 2010 17
Sound Units (American English)
Phoneme
Vowels Dipthongs Semivowels Consonants
Front (EVE)
Mid (UP)
Back (BOOT) Liquids Glides Nasals Stops Fricatives Whisper Affricates
Spring 2010 18
Dictionary From phoneme to words
Let = L EH T Us = AH S Pray = P R EY Lettuce = L EH T AH S Spray = S P R EY n Not a unique mapping! n Prosodic Information
n Us = AH1 S vs Us = AH0 S
Stress Markers
Spring 2010 19
A Statistical Approach to ASR
n (Almost) no linguistic knowledge is required n Variable “recognition units” are supported
n Phoneme, syllable, word, phrase n Non-linguistically motivated
n Automatic training of statistical models
Spring 2010 20
• Probability of word sequences. • W= “I wanna fly to Boston”
• Shannon’s game
Language Modeling
)to|Boston(...)I|wanna()I()tofly,wanna,I,|Boston(...)I|wanna()I()(
PPPPPPWP
×××=
×××=
Spring 2007 21
Syntax
n Relation amongst words n Not random! (frequency-rank plotà Zipf’s law)
n Word Classes n Verb, Noun, Determiner
n Words grouped into constituents n “Al mattino faccio una passeggiata” vs “Faccio una passeggiata
al mattino” vs “Faccio al mattino una passeggiata”
n Constituents n Verb Phrase (VP), Noun Phrase (NP), Prepositional
Phrase (PP). n PP: “Al mattino”. VP: “faccio”. NP: “una passeggiata”.
Spring 2010 24
Search Problem n Search Space of ASR n Vocabulary = 104
n Sentence length = 25 (Wall Street Journal) n Number of candidate word strings
n 10100 ! n Weighted strings
n Acoustic and Language models n Dynamic Programming n Beam search
Understanding Natural Language Natural Language Query to DB
“Find the best flight from New York to Paris tomorrow business class”�
Flight Database
?
Spring 2012, LPSMT, G. Riccardi
26
Natural Language Query to DB
“Find the best flight from New York to Paris tomorrow business class”�
Flight Database
(JFK, CDG,Z,1300,0700,V,..) ?
Spring 2012, LPSMT, G. Riccardi
27
Natural Language Query to DB
“Find the best flight from New York to Paris tomorrow business class”�
Interactive Machine
Flight Database Spring 2012, LPSMT, G. Riccardi
28
Natural Language Query to DB
Find the best flight from New York to Paris tomorrow business class �
Spring 2012, LPSMT, G. Riccardi
29
Natural Language Query to DB
Find the best flight from New York to Paris tomorrow business class �
Object
Spring 2012, LPSMT, G. Riccardi
30
Natural Language Query to DB
Find the best flight from New York to Paris tomorrow business class �
TASK: Transactional
Object
Spring 2012, LPSMT, G. Riccardi
31
Natural Language Query to DB
Find the best flight from New York to Paris tomorrow business class �
TASK: Transactional
USER CONSTRAINTS
Object
Spring 2012, LPSMT, G. Riccardi
32
Spring 2012, LPSMT, G. Riccardi 33
Machine Understanding n User:
“Find the best flight from New York to Paris tomorrow business class”
n Speech Recognition: “ Find the bass flight from Newark to Paris
tomorrow business class”
n Speech Understanding: n @action=Request-Reservation (0.9) n @origin=Newark (0.5) n @time-departure=Tuesday (0.7) n @destination=Paris (0.8)
Spring 2012, LPSMT, G. Riccardi 34
Handling Uncertainty
n How likely is that it was a “Request about a Reservation” given speech utterance U ? n @action=Request-Reservation (0.9) n P( @action=Request-Reservation | U ) = 0.9
n How likely is that the “Departure city is Newark “? n @origin=Newark (0.5)
n And so on: n @time-departure=Tuesday (0.7) n @destination=Paris (0.8)
Machine Handling Uncertainty IBM Watson in Jeopardy! Game
D. Ferrucci et al. “Building Watson: An Overview of the DeepQA Project”, AI Magazine, v. 31, n. 3, 2010
Conversational Agents
Spring 2012, LPSMT, G. Riccardi
37 Dinarelli M.,Stepanov E.,Varges S. and Riccardi G. “The LUNA Spoken Dialogue System: Beyond Utterance Classification”. ICASSP, 2010.
38
Voice Applications Smartphones/Tablets
n Natural interface replacing typing n For task execution (“finding the telephone numbers of restaurant nearby”)
n Voice Search n Spoken query translation into text for search n Intent Recognition triggered by target words/
phrases (“weather”, “restaurants”)
Spring 2012, LPSMT, G. Riccardi
39
Conversational Agents
n Multimodal agent n Speech, touch, sensorial information.. n Require interaction to resolve/negotiate/committ
to the task requirements n Task resolution/execution ( “send email to my brother”, “make reservation for best indian restaurant nearby”)
Spring 2012, LPSMT, G. Riccardi
42
Speech Technology on Android android.speech, android.speech.tts
n Automatic Speech Recognition n API Level 3 (Honeycombe)
n N-Best, Two types of LM, Error tags n API 8 (Froyo)
n Language Support, Partial Results, Speech Timeouts
n API 14 (Ice Cream Sandwich) n Confidence Scores
Spring 2012, LPSMT, G. Riccardi
43
Speech Demo App
n Available on Course website n Starter App
n Experiments with different ASR parameters and ASR output
n Use as base for developing your class/master project.
Spring 2012, LPSMT, G. Riccardi
Spring 2012, LPSMT, G. Riccardi
44
References (Research) 1. Tutorial, “Talking to Computers: From Speech Sounds to
Human Computer Interaction”, sisl.disi.unitn.it/~riccardi/ 2. Ed. by G. Tur and R. De Mori, Spoken Language
Understanding: Systems for Extracting Semantic Information from Speech, Wiley, 2011
3. D. Jurafsky and J. Martin, Speech and Language Processing,