spoken conversational agents - marco ronchetti

44
1 Spoken Conversational Agents Prof. Giuseppe Riccardi Department of Information Engineering and Computer Science University of Trento [email protected]

Upload: others

Post on 14-Feb-2022

7 views

Category:

Documents


0 download

TRANSCRIPT

1

Spoken Conversational Agents

Prof. Giuseppe Riccardi Department of Information Engineering and Computer Science

University of Trento [email protected]

2

Spoken Conversational Agents

Prof. Giuseppe Riccardi Department of Information Engineering and Computer Science

University of Trento [email protected]

Spring 2012, LPSMT, G. Riccardi

n Talking to Computers n Research and Technological challenges n Spoken Language Understanding n Speech Technology on Smartphones n Speech Demo App

3

Outline

Spring 2012, LPSMT, G. Riccardi

4

Talking to Machines

Spring 2012, LPSMT, G. Riccardi

5

The AI dream

n Design computers that are able to n  Parse human language ( speech, text…) n  Understand users’ intentions n  Execute simple/complex task n  Interact autonomously or cooperatively n  Relate to humans socially, emotionally n  Operate in a virtual and physical space

Spring 2012, LPSMT, G. Riccardi

6

Talking to Computers

1968

Spring 2012, G. Riccardi

7

Vision (“2001: a Space Odissey”, 1968 by Stanley Kubrick)

HAL 9000: "Heuristically programmed Algorithmic”

Spring 2012, LPSMT, G. Riccardi

8

Present

2006

Interactive Systems

9

Uh hi, I need a flight tomorrow from Boston..

Machine Customer

How may I help you?

Spring 2012, LPSMT, G. Riccardi

Spring 2012, LPSMT, G. Riccardi

“892424” Application (2006)

Pronto PagineGialle 892424 10

Spring 2012, LPSMT, G. Riccardi

Travel Directions

Departure

Arrival

11

Spring 2012, LPSMT, G. Riccardi

Retrieving directions from the Web

http://maps.google.it/maps? f=d&hl=it&saddr=".$partenza."&daddr=".$arrivo."&sll=41.895888,12.489052

&sspn=14.60869,29.882813&layer=&ie=UTF8

onclick=\"this.blur()\"\u003e27\u003c/a\u003e.\u003c/td\u003e\u003ctd_class=\"dirsegtext\"id=\"dirsegtext_0_223\"\u003eSvolta_a\u003cb\u003esinistra\u003c/b\u003ea\u003cb\u003ePiazza_Raffaello_Sanzio\u003c/b\u003e\u003c/td\u003e\u003ctdclass=\"sdist\"\u003e\u003cdivid=\"sxdist\"class=\"nw\"\u003e54\u0026#160;m\u003c/div\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctrclass=\"dirsegment\"id=\"panel_0_225\"polypoint=\"225\"\u003e\u003ctdclass=\"iconpw\"\u003e\u003cimgsrc=\"/

12

Spring 2012, LPSMT, G. Riccardi

13

Human-Machine Spoken Dialog User voice request

Automatic Speech Recognition

DM Action

Dialogue Management

SLU

Concepts

Spoken Language Understanding

LG

Words

Language Generation

ASR

Words

TTS

Text-to-Speech Synthesis

Voice reply to User

Spring 2010 14

Automatic Speech Recognition (ASR)

n  An ASR system converts the speech signal into words n  Rich Transcript (Words, Speaker, Language, Dialect,

etc..) n  The recognized words can be

n  The final output, or n  The input to natural language processing

Automatic Speech

Recognition

“yes I would like to make a reservation.. “

Spring 2010 15

ASR Challenges

n  Transducers n  Telephone handset n  Close Talking mic n  Open/Desktop Mic

n  Channel n  Landline n  Wireless n  VoIP

Automatic Speech

Recognition

“yes I would like to make a reservation.. “

n  Speaker n  Dialect n  Accent n  Children

n  Noise (SNR) n  Office Room n  Airplane Cockpit n  Cocktail Party

Problem

Spring 2010 16

ASR - Overview

Feature Extraction A

Pattern Matching Ŵ

Given the acoustic observation sequence A=a1,a2,…,am, what is the most likely “word” sequence W=w1,w2,…,wn?

L EH T AH S P R EY Let us pray Lettuce spray

Spring 2010 17

Sound Units (American English)

Phoneme

Vowels Dipthongs Semivowels Consonants

Front (EVE)

Mid (UP)

Back (BOOT) Liquids Glides Nasals Stops Fricatives Whisper Affricates

Spring 2010 18

Dictionary From phoneme to words

Let = L EH T Us = AH S Pray = P R EY Lettuce = L EH T AH S Spray = S P R EY n Not a unique mapping! n  Prosodic Information

n  Us = AH1 S vs Us = AH0 S

Stress Markers

Spring 2010 19

A Statistical Approach to ASR

n  (Almost) no linguistic knowledge is required n Variable “recognition units” are supported

n  Phoneme, syllable, word, phrase n  Non-linguistically motivated

n Automatic training of statistical models

Spring 2010 20

•  Probability of word sequences. •  W= “I wanna fly to Boston”

•  Shannon’s game

Language Modeling

)to|Boston(...)I|wanna()I()tofly,wanna,I,|Boston(...)I|wanna()I()(

PPPPPPWP

×××=

×××=

Spring 2007 21

Syntax

n  Relation amongst words n  Not random! (frequency-rank plotà Zipf’s law)

n  Word Classes n  Verb, Noun, Determiner

n  Words grouped into constituents n  “Al mattino faccio una passeggiata” vs “Faccio una passeggiata

al mattino” vs “Faccio al mattino una passeggiata”

n  Constituents n  Verb Phrase (VP), Noun Phrase (NP), Prepositional

Phrase (PP). n  PP: “Al mattino”. VP: “faccio”. NP: “una passeggiata”.

Spring 2010 22

Languages: how many?

Spring 2010 23

Languages-Speakers Statistics

From www.ethnologue.com (5/2012)

Spring 2010 24

Search Problem n Search Space of ASR n Vocabulary = 104

n Sentence length = 25 (Wall Street Journal) n Number of candidate word strings

n  10100 ! n Weighted strings

n Acoustic and Language models n Dynamic Programming n Beam search

Spring 2010 25

ASR Performance (Word Error Rate)

Understanding Natural Language Natural Language Query to DB

“Find the best flight from New York to Paris tomorrow business class”�

Flight Database

?

Spring 2012, LPSMT, G. Riccardi

26

Natural Language Query to DB

“Find the best flight from New York to Paris tomorrow business class”�

Flight Database

(JFK, CDG,Z,1300,0700,V,..) ?

Spring 2012, LPSMT, G. Riccardi

27

Natural Language Query to DB

“Find the best flight from New York to Paris tomorrow business class”�

Interactive Machine

Flight Database Spring 2012, LPSMT, G. Riccardi

28

Natural Language Query to DB

Find the best flight from New York to Paris tomorrow business class �

Spring 2012, LPSMT, G. Riccardi

29

Natural Language Query to DB

Find the best flight from New York to Paris tomorrow business class �

Object

Spring 2012, LPSMT, G. Riccardi

30

Natural Language Query to DB

Find the best flight from New York to Paris tomorrow business class �

TASK: Transactional

Object

Spring 2012, LPSMT, G. Riccardi

31

Natural Language Query to DB

Find the best flight from New York to Paris tomorrow business class �

TASK: Transactional

USER CONSTRAINTS

Object

Spring 2012, LPSMT, G. Riccardi

32

Spring 2012, LPSMT, G. Riccardi 33

Machine Understanding n  User:

“Find the best flight from New York to Paris tomorrow business class”

n  Speech Recognition: “ Find the bass flight from Newark to Paris

tomorrow business class”

n  Speech Understanding: n  @action=Request-Reservation (0.9) n  @origin=Newark (0.5) n  @time-departure=Tuesday (0.7) n  @destination=Paris (0.8)

Spring 2012, LPSMT, G. Riccardi 34

Handling Uncertainty

n How likely is that it was a “Request about a Reservation” given speech utterance U ? n  @action=Request-Reservation (0.9) n  P( @action=Request-Reservation | U ) = 0.9

n How likely is that the “Departure city is Newark “? n  @origin=Newark (0.5)

n  And so on: n  @time-departure=Tuesday (0.7) n  @destination=Paris (0.8)

Spring 2012, LPSMT, G. Riccardi

35

Present

2011

Machine Handling Uncertainty IBM Watson in Jeopardy! Game

D. Ferrucci et al. “Building Watson: An Overview of the DeepQA Project”, AI Magazine, v. 31, n. 3, 2010

Conversational Agents

Spring 2012, LPSMT, G. Riccardi

37 Dinarelli M.,Stepanov E.,Varges S. and Riccardi G. “The LUNA Spoken Dialogue System: Beyond Utterance Classification”. ICASSP, 2010.

38

Voice Applications Smartphones/Tablets

n Natural interface replacing typing n  For task execution (“finding the telephone numbers of restaurant nearby”)

n Voice Search n  Spoken query translation into text for search n  Intent Recognition triggered by target words/

phrases (“weather”, “restaurants”)

Spring 2012, LPSMT, G. Riccardi

39

Conversational Agents

n Multimodal agent n  Speech, touch, sensorial information.. n  Require interaction to resolve/negotiate/committ

to the task requirements n  Task resolution/execution ( “send email to my brother”, “make reservation for best indian restaurant nearby”)

Spring 2012, LPSMT, G. Riccardi

40

Example 1 (Voice Search) Android

Spring 2012, LPSMT, G. Riccardi

41

Example 2 (Agent-like app) Android

Spring 2012, LPSMT, G. Riccardi

42

Speech Technology on Android android.speech, android.speech.tts

n Automatic Speech Recognition n API Level 3 (Honeycombe)

n  N-Best, Two types of LM, Error tags n API 8 (Froyo)

n  Language Support, Partial Results, Speech Timeouts

n API 14 (Ice Cream Sandwich) n  Confidence Scores

Spring 2012, LPSMT, G. Riccardi

43

Speech Demo App

n Available on Course website n Starter App

n  Experiments with different ASR parameters and ASR output

n Use as base for developing your class/master project.

Spring 2012, LPSMT, G. Riccardi

Spring 2012, LPSMT, G. Riccardi

44

References (Research) 1.  Tutorial, “Talking to Computers: From Speech Sounds to

Human Computer Interaction”, sisl.disi.unitn.it/~riccardi/ 2.  Ed. by G. Tur and R. De Mori, Spoken Language

Understanding: Systems for Extracting Semantic Information from Speech, Wiley, 2011

3.  D. Jurafsky and J. Martin, Speech and Language Processing,