SPOKEN DIALOG SYSTEM FOR INTELLIGENT SERVICE ROBOTS
Intelligent Software Lab. POSTECH
Prof. Gary Geunbae Lee
Introduction to Spoken Dialog System (SDS) for Human-Robot Interaction (HRI) Brief introduction to SDS
Language processing oriented But not signal processing oriented
Mainly based on papers at ACL, NAACL, HLT, ICASSP, INTESPEECH, ASRU, SLT,
SIGDIAL, CSL, SPECOM, IEEE TASLP
2
This Tutorial
OUTLINES INTRODUCTION
AUTOMATIC SPEECH RECOGNITION
SPOKEN LANGUAGE UNDERSTANDING
DIALOG MANAGEMENT
CHALLENGES & ISSUES MULTI-MODAL DIALOG SYSTEM DIALOG SIMULATOR
DEMOS
REFERENCES
INTRODUCTION
Human-Robot Interaction (in Movie)
Human-Robot Interaction (in Real World)
Wikipedia (http://en.wikipedia.org/wiki/Human_robot_interaction)
What is HRI?
Human-robot interaction (HRI) is the study of interactions the study of interactions between people and robots.between people and robots. HRI is multidisciplinary with contributions from the fields of human-computer interaction, artificial intelligence, robotics, natural language natural language understandingunderstanding, and social science.
The basic goal of HRI is to develop principles and algorithms develop principles and algorithms to allow more natural and effective communication and allow more natural and effective communication and interactioninteraction between humans and robots.
• Signal Processing• Speech Recognition• Speech Understanding• Dialog Management• Speech Synthesis
Area of HRIVision
Speech
Haptics
Emotion
Learning
SPOKEN DIALOG SYSTEM (SDS)
Tele-serviceTele-service
Car-navigationCar-navigation Home networkingHome networking
Robot interfaceRobot interface
SDS APPLICATIONS
Talk, Listen and Interact
AUTOMATIC SPEECH RECOGNITION
SCIENCE FICTION Eagle Eye (2008, D.J. Caruso)
AUTOMATIC SPEECH RECOGNITION
x y
Speech Words
(x, y)
Training examples
Learning algorithm
A process by which an acoustic speech signal is converted into a set of words[Rabiner et al., 1993]
NOISY CHANNEL MODEL GOAL
Find the most likely sequence w of “words” in language L given the sequence of acoustic observation vectors O
Treat acoustic input O as sequence of individual observations O = o1,o2,o3,…,ot
Define a sentence as a sequence of words: W = w1,w2,w3,…,wn
)|(maxargˆ OWPWLW
)()|(maxargˆ WPWOPWLW
)(
)()|(maxargˆ
OP
WPWOPW
LW
Bayes rule
Golden rule
TRADITIONAL ARCHITECTURE
FeatureExtraction
Decoding
AcousticModel
PronunciationModel
LanguageModel
버스 정류장이어디에 있나요 ?
Speech Signals Word Sequence
버스 정류장이어디에 있나요 ?
NetworkConstruction
SpeechDB
TextCorpora
HMMEstimation
G2P
LMEstimation
WO
)()|(maxargˆ WPWOPWLW
TRADITIONAL PROCESSES
FEATURE EXTRACTION The Mel-Frequency Cepstrum Coefficients (MFCC) is a
popular choice [Paliwal, 1992]
Frame size : 25ms / Frame rate : 10ms
39 feature per 10ms frame Absolute : Log Frame Energy (1) and MFCCs (12) Delta : First-order derivatives of the 13 absolute coefficients Delta-Delta : Second-order derivatives of the 13 absolute
coefficients
Preemphasis/HammingWindow
FFT(Fast Fourier Transform)
Mel-scalefilter bank
log|.|DCT (Discrete Cosine Transform)
MFCC(12-Dimension)
X(n)
25 ms
10ms . . .
a1 a2
a3
ACOUSTIC MODEL Provide P(O|Q) = P(features|phone) Modeling Units [Bahl et al., 1986]
Context-independent : Phoneme Context-dependent : Diphone, Triphone, Quinphone
pL-p+pR : left-right context triphone
Typical acoustic model [Juang et al., 1986]
Continuous-density Hidden Markov Model Distribution : Gaussian Mixture
HMM Topology : 3-state left-to-right model for each phone, 1-state for silence or pause
),,( BA
K
kjkjktjkjj xNcxb
1
),;()(
codebook
bj(x)
PRONUCIATION MODEL Provide P(Q|W) = P(phone|word) Word Lexicon [Hazen et al., 2002]
Map legal phone sequences into words according to phonotactic rules
G2P (Grapheme to phoneme) : Generate a word lexicon automatically
Several word may have multiple pronunciations Example
Tomato
P([towmeytow]|tomato) = P([towmaatow]|tomato) = 0.1 P([tahmeytow]|tomato) = P([tahmaatow]|tomato) = 0.4
[t]
[ow]
[ah]
[m]
[ey]
[aa]
[t] [ow]
0.2
0.8 1.0
1.0 0.5
0.5 1.0
1.01.0
LANGUAGE MODEL Provide P(W) ; the probability of the sentence [Beaujard et al.,
1999]
We saw this was also used in the decoding process as the probability of transitioning from one word to another.
Word sequence : W = w1,w2,w3,…,wn
The problem is that we cannot reliably estimate the conditional word probabilities, for all words and all sequence lengths in a given language
n-gram Language Model n-gram language models use the previous n-1 words to
represent the history
Bi-grams are easily incorporated in a viterbi search
n
iiin wwwPwwP
111 )|()1(
)|( 11 ii wwwP
)|()|( 1)1(11 iniiii wwwPwwwP
LANGUAGE MODEL Example
Finite State Network (FSN)
Context Free Grammar (CFG)
Bigram
서울부산
에서
출발
세시네시
대구대전 도착
출발하는
기차버스
P( 에서 | 서울 )=0.2 P( 세시 | 에서 )=0.5P( 출발 | 세시 )=1.0 P( 하는 | 출발 )=0.5P( 출발 | 서울 )=0.5 P( 도착 | 대구 )=0.9…
$time = 세시 | 네시 ;$city = 서울 | 부산 | 대구 | 대전 ;$trans = 기차 | 버스 ;$sent = $city ( 에서 $time 출발 | 출발 $city 도착 ) 하는 $trans
Expanding every word to state level, we get a search network [Demuynck et al., 1997]
NETWORK CONSTRUCTION
I
L
S
A
M
일
이
삼
사
I L
I
S A M
S A삼
사
일
이
Acoustic Model Pronunciation Model Language Model
I
I L
S A M
Wordtransition
P( 일 |x)
P( 사 |x)
P( 삼 |x)
P( 이 |x)LM is
applied
S A
startend이
일
사
삼
Between-wordtransition
Intra-wordtransition
Search Network
DECODING Find
Viterbi Search : Dynamic Programming Token Passing Algorithm [Young et al., 1989]
)|(maxargˆ OWPWLW
• Initialize all states with a token with a null history and the likelihood that it’s a start state
• For each frame ak
– For each token t in state s with probability P(t), history H– For each state r
– Add new token to s with probability P(t) Ps,r Pr(ak), and history s.H
HTK Hidden Markov Model Toolkit (HTK)
A portable toolkit for building and manipulating hidden Markov models [Young et al., 1996]
- HShell : User I/O & interaction with OS- HLabel : Label files- HLM : Language model- HNet : Network and lattices- HDic : Dictionaries- HVQ : VQ codebooks- HModel : HMM definitions- HMem : Memory management- HGrf : Graphics- HAdapt : Adaptation- HRec : Main recognition processing functions
SUMMARY
x y
Speech Words
(x, y)
Training examples
Learning algorithm
I
L
S
A
M
일
이
삼
사
I L
I
S A M
S A삼
사
일
이
Acoustic Model Pronunciation Model Language Model
Decoding
Search Network Construction
Speech Understanding= Spoken Language Understanding (SLU)
SPEECH UNDERSTANDING (in general)
Computer Program
Speaker ID /Language ID
Sentiment / Opinion
Named Entity / Relation
Topic / Intent
Speech Segment
Summary
Syntactic / Semantic Role
SQL
Meaning Representation
Dave /English
Nervous
LOC = pod bayOBJ = door
Control the Spaceship
Open the doors.
Open=Verb, the=Det. ...
select * from DOORS where ...
SPEECH UNDERSTANDING (in SDS)
x y
InputSpeech or
Words
OutputIntentions
(x, y)
Training examples
Learning algorithm
A process by which natural langauge speech is mapped to frame structure encoding of its meanings [Mori et al., 2008]
What’s difference between NLU and SLU? Robustness; noise and ungrammatical spoken language Domain-dependent; further deep-level semantics (e.g.
Person vs. Cast) Dialog; dialog history dependent and utt. by utt. Analysis
Traditional approaches; natural language to SQL conversion
ASRSpeech
SLUSQL
GenerateDatabase
TextSemantic
Frame SQL Response
A typical ATIS system (from [Wang et al., 2005])
LANGUAGE UNDERSTANDING
REPRESENTATION Semantic frame (slot/value structure) [Gildea and Jurafsky, 2002]
An intermediate semantic representation to serve as the interface between user and dialog system
Each frame contains several typed components called slots. The type of a slot specifies what kind of fillers it is expecting.
“Show me flights from Seattle to Boston”
ShowFlight
Subject Flight
FLIGHT Departure_City Arrival_City
SEA BOS
<frame name=‘ShowFlight’ type=‘void’> <slot type=‘Subject’>FLIGHT</slot> <slot type=‘Flight’/> <slot type=‘DCity’>SEA</slot> <slot type=‘ACity’>BOS</slot> </slot></frame>
Semantic representation on ATIS task; XML format (left) and hierarchical representation (right) [Wang et al., 2005]
Meaning Representations for Spoken Dialog System Slot type 1: Intent, Subject Goal, Dialog Act (DA)
The meaning (intention) of an utt. at the discourse level Slot type 2: Component Slot, Named Entity (NE)
The identifier of entity such as person, location, organization, or time. In SLU, it represents domain-specific meaning of a word (or word group).
SEMANTIC FRAME
<frame domain=`RestaurantGuide'> <slot type=`DA' name=`SEARCH_RESTAURANT'/> <slot type=`NE' name=`CITY'>Pohang</slot> <slot type=`NE' name=`ADDRESS'>Daeyidong</slot> <slot type=`NE' name=`FOOD_TYPE'>Korean</slot></frame>
Ex) Find Korean restaurants in Daeyidong, Pohang
Two Classification Problems
HOW TO SOLVE
Find Korean restaurants in Daeyidong, PohangInput:
Output: SEARCH_RESTAURANT
Dialog Act Identification
FOOD_TYPE ADDRESS CITY
Find Korean restaurants in Daeyidong, PohangInput:
Output: Named Entity Recognition
Encoding:
x is an input (word), y is an output (NE), and z is another output (DA).
Vector x = {x1, x2, x3, …, xT}
Vector y = {y1, y2, y3, …, yT} Scalar z
Goal: modeling the functions y=f(x) and z=g(x)
PROBLEM FORMALIZATION
x Find Korean restaurants
in Daeyidong , Pohang .
y O FOOD_TYPE-B O O ADDRESS-B O CITY-B O
z SEARCH_RESTAURANT
CASCADE APPROACH I
Classification(Dialog Act / Intent)
Sequential Labeling
(Named Entity / Frame Slot)
Automatic Speech
Recognition
Dialog Management
Sequential Labeling Model (e.g. HMM, CRFs)
Classification Model
(e.g. MaxEnt, SVM)
x,yx x,y,z
Named Entity Dialog Act
Dialog Act Named Entity
Improve NE, but not DA.
CASCADE APPROACH II
Classification(Dialog Act / Intent)
Sequential Labeling
(Named Entity / Frame Slot)
Automatic Speech
Recognition
Dialog Management
Multiple Sequential Models (e.g.
intent-dependent)
Classification Model
(e.g. MaxEnt, SVM)
x,y,zx x,z
z
Named Entity ↔ Dialog Act
JOINT APPROACH
Joint Inference
Classification(Dialog Act / Intent)
Sequential Labeling
(Named Entity / Frame Slot)
Automatic Speech
Recognition
Dialog Management
Joint Model(e.g. TriCRFs)
x x,y,z
[Jeong and Lee, 2006]
MACHINE LEARNING FOR SLU Relational Learning (RL) or Structured Prediction (SP)
[Dietterich, 2002; Lafferty et al., 2004, Sutton and McCallum, 2006]
Structured or relational patterns are important because they can be exploited to improve the prediction accuracy of our classier
Argmax search (e.g. Sum-Max, Belief propagation, Viterbi etc)
Basically, RL for language processing is to use a left-to-right structure (a.k.a linear-chain or sequence structure)
Algorithms: CRFs, Max-Margin Markov Net (M3N), SVM for Independent and Structured Output (SVM-ISO), Structured Perceptron, etc.
MACHINE LEARNING FOR SLU Background: Maximum Entropy (a.k.a logistic regression)
Conditional and discriminative manner Unstructured! (no dependency in y) Dialog act classification problem
Conditional Random Fields [Lafferty et al. 2001]
Structured versions of MaxEnt (argmax search in inference) Undirected graphical models Popular in language and text processing Linear-chain structure for practical implementation Named entity recognition problem
z
x
yt-1 yt yt+1
xt-1 xt xt+1
fk
gk
hk
SUMMARY
Solve by isolate (or independent) classifiersuch as Naïve Bayes, and MaxEnt
Solve by structured (or relational) classifiersuch as HMM, and CRFs
Find Korean restaurants in Daeyidong, PohangInput:
Output: SEARCH_RESTAURANT
Dialog Act Identification
FOOD_TYPE ADDRESS CITY
Find Korean restaurants in Daeyidong, PohangInput:
Output: Named Entity Recognition
Coffee Break
DIALOG MANAGEMENT
DIALOG MANAGEMENT
x y
InputWords or Intentions
OutputSystem
Response
(x, y)
Training examples
Learning algorithm
A central component of a dialog system to produce system responses with external knowledge sources[McTear, 2004]
DIALOG MANAGEMENT GOAL
Answer your query (e.g., question and order) given the task domain It includes :
Provide query results Ask further slot information Confirm user utterance Notify invalid query Suggest the alternative
Related to dialog complexity and task complexity.
In practice Find the best system action a given the dialog state s
DESIGN ISSUES Task complexity
How hard the task is? How much the system has domain knowledge?
Simple Complex
Call Routing
CollaborativePlanning
WeatherInformation
Conversational English Tutoring
AutomaticBanking
DESIGN ISSUES Dialog complexity
What dialog phenomena are allowed Initiative strategies
e.g., system-initiative vs. user-initiative vs. mixed-initiative
Meta-dialogs; the dialog itself e.g., Could you hold on for a minute?
Subdialogs; clarification/confirmation e.g., You selected KE airlines, is it right?
Multiple dialog threads e.g., domain switching
DIALOG EXAMPLES Example 1
S: Where are you going?U: Gangnam, SeoulS: What kinds of food do you have in Gangnam, Seoul?U: Please African foodS: Sorry, there are no African restaurants. Please select another food.
System-initiative Implicit confirmation Merely informs user of failed query Rigid dialog flow
DIALOG EXAMPLES Example 2
S: May I help you?U: I’d like to have African food in Gannam, Seoul S: Sorry, there are no African restaurants. S: There are Korean, Japanese, Indian, and American restaurants in Gangnam.U: Please Korean restaurants.S: Would you like to find Korean restaurants?U: Yes pleases
Mixed-initiative Implicit/Explicit confirmation Suggests the alternative when query fails More natural dialog flow
DIALOG EXAMPLES Example 3
U: I’d like to have African food in Gangnam, Seoul S: Sorry, there are no African restaurants. S: How about American restaurants in Gangnam, Seoul?U: No I don’t like it.S: What is your favorite food?U: I like grilled and seasoned beef S: So, how about Korean restaurants?U: Good.
Mixed-initiative Implicit/Explicit confirmation Recommends the alternative when query fails Most natural dialog flow
DIALOG CONTROL Finite-state based approach
Input : Single word or phrase State transition network (or graph) It can allow all legal dialog flow which is pre-defined in the state diagram.
Frame-based approach Input : Natural language with concept spotting Form-filling tasks to access information source But the questions do not have to be asked in a predetermined sequence
Plan-based approach Input : Unrestricted natural language The modeling of dialog as collaboration between intelligent agents to
solve some problems or task. For more complex task, such as negotiation and problem solving.
KNOWLEDGE-BASED DM (KBDM) Rule-based approaches
Early KBDMs were developed with handcrafted rules (e.g., information state update).
Simple Example [Larsson and Traum, 2003]
Agenda-based approaches Recent KBDMs were developed with domain-
specific knowledge and domain-independent dialog engine.
VoiceXML What is VoiceXML?
The HTML(XML) of the voice web. The open standard markup language for voice
application VoiceXML Resources : http://www.voicexml.org/
Can do Rapid implementation and management Integrated with World Wide Web Mixed-Initiative dialogue Simple Dialogue implementation solution
VoiceXML EXAMPLES: Say one of: Sports scores; Weather information; Log in.
U: Sports scores
<vxml version="2.0" xmlns="http://www.w3.org/2001/vxml"> <menu> <prompt>Say one of: <enumerate/></prompt> <choice next="http://www.example.com/sports.vxml"> Sports scores </choice> <choice next="http://www.example.com/weather.vxml"> Weather information </choice> <choice next="#login"> Log in </choice> </menu> </vxml>
AGENDA-BASED DM RavenClaw DM (CMU)
Using Hierarchical Task Decomposition A set of all possible dialogs in the domain Tree of dialog agents Each agent handles the corresponding part of the dialog
task
[Bohus and Rudnicky, 2003]
EXAMPLE-BASED DM (EBDM) Example-based approaches
Dialog State Space
Domain = Building_GuidanceDialog Act = WH-QUESTIONMain Goal = SEARCH-LOCROOM-TYPE=1 (filled), ROOM-NAME=0 (unfilled)LOC-FLOOR=0, PER-NAME=0, PER-TITLE=0Previous Dialog Act = <s>, Previous Main Goal = <s> Discourse History Vector = [1,0,0,0,0]Lexico-semantic Pattern = ROOM_TYPE 이 어디 지 ?System Action = inform(Floor)
Dialog Corpus
USER: 회의 실 이 어디 지 ?[Dialog Act = WH-QUESTION][Main Goal = SEARCH-LOC][ROOM-TYPE = 회의실 ]SYSTEM: 3 층에 교수회의실 , 2 층에 대회의실 , 소회의실이 있습니다 . [System Action = inform(Floor)]
Turn #1 (Domain=Building_Guidance)
Dialog Example
Indexed by using semantic & discourse features
Having the similar state
),(argmax* heSe iEei
[Lee et al., 2009]
STOCHASTIC DM Supervised approaches [Griol et al., 2008]
Find the best system action to maximize the conditional probability P(a|s) given the dialog state Based on supervised learning algorithms
MDP/POMDP-based approaches [Williams and Young, 2007]
Find the optimal system action to maximize the reward function R(a|s) given the belief state Based on reinforcement learning algorithms
In general, a dialog state space is too large So, generalizing the current dialog state is important
Dialog as a Markov Decision Process
User
SpeechUnderstanding
SpeechGeneration
StateEstimator
DialogPolicy
Optimize
k
kk rR
us
ua
ma
ua~
ma~
duum ssas ~,~,~~
MDP
usergoal
userdialog act
noisy estimate ofuser dialog act
dialoghistory
machinestate
machinedialog act
ReinforcementLearning
Reward
),( mm asr
ms~
ds
[Williams and Young, 2007]
SUMMARY
x y
InputWords or Intentions
OutputSystem
Response
Dialog Corpus
Dialog Model
External DB
Agenda-based approachStochastic approachExample-based approach
Demo Building guidance dialog TV program guide dialog Multi-domain dialog with chatting
CHALLENGES & ISSUESMULTI-MODAL DIALOG SYSTEM
MULTI-MODAL DIALOG SYSTEM
x y
InputGesture
OutputSystem
Response
(x, y)
Training examples
Learning algorithm
InputSpeech
Inputface
MULTIMODAL DIALOG SYSTEM A system which supports human-computer
interaction over multiple different input and/or output modes. Input: voice, pen, gesture, face expression, etc. Output: voice, graphical output, etc.
Applications GPS Information guide system Smart home control Etc.
여기에서 여기로 가는 제일 빠른 길 좀 알려 줘 .
voice
pen
MOTIVATION Speech: the Ultimate Interface?
(+) Interaction style: natural (use free speech) Natural repair process for error recovery
(+) Richer channel – speaker’s disposition and emotional state (if system’s knew how to deal with that..)
(-) Input inconsistent (high error rates), hard to correct error e.g., may get different result, each time we speak the
same words. (-) Slow (sequential) output style: using TTS (text-to-speech)
How to overcome these weak points? Multimodal interface!!
ADVANTAGES Task performance and user preference
Migration of Human-Computer Interaction away from the desktop
Adaptation to the environment
Error recovery and handling
Special situations where mode choice helps
TASK PERFORMANCE AND USER PREFERENCE Task performance and user preference for
multimodal over speech only interfaces [Oviatt et al., 1997] 10% faster task completion, 23% fewer words, (Shorter and simpler linguistic constructions) 36% fewer task errors, 35% fewer spoken disfluencies, 90-100% user preference to interact this way.
• Speech-only dialog system
Speech: Bring the drink on the table to the side of bed
• Multimodal dialog System
Speech: Bring this to herePen gesture:
Easy, Simplified
user utterance
!
MIGRAION OF HCI AWAY FROM THE DESKTOP Small portable computing devices
Such as PDAs, organizers, and smart-phones Limited screen real estate for graphical output Limited input no keyboard/mouse (arrow keys, thumbwheel) Complex GUIs not feasible Augment limited GUI with natural modalities such as speech and pen
Use less space Rapid navigation over menu hierarchy
Other devices Kiosks, car navigation system…
No mouse or keyboard
Speech + pen gesture
APPLICATION TO THE ENVIRONMENT Multimodal interfaces enable rapid adaptation to
changes in the environment Allow user to switch modes Mobile devices that are used in multiple environments
Environmental conditions can be either physical or social Physical
Noise: Increases in ambient noise can degrade speech performance switch to GUI, stylus pen input
Brightness: Bright light in outdoor environment can limit usefulness of graphical display
Social Speech many be easiest for password, account number
etc, but in public places users may be uncomfortable being overheard Switch to GUI or keypad input
ERROR RECOVERY AND HANDLING Advantages for recovery and reduction of
error: Users intuitively pick the mode that is less error-prone. Language is often simplified. Users intuitively switch modes after an error
The same problem is not repeated. Multimodal error correction
Cross-mode compensation - complementarity Combining inputs from multiple modalities can reduce
the overall error rate Multimodal interface has potentially
SPECIAL SITUATIONS WHERE MODE CHOICE HELPS Users with disability People with a strong accent or a cold People with RSI Young children or non-literate users Other users who have problems when handle
the standard devices: mouse and keyboard
Multimodal interfaces let people choose their preferred interaction style depending on the actual task, the context, and their own preferences and abilities.
Demo Multimodal dialog in smart home domain English teaching dialog
CHALLENGES & ISSUESDIALOG SIMULATOR
SYSTEM EVALUATION Real User Evaluation
Real Interaction
1. High Cost (-)
2. Human Factor- It looses objectivity (-)
Spoken Dialog System
1. Reflecting Real World (+)
SYSTEM EVALUATION Simulated User Evaluation
Simulated Interaction
Spoken Dialog System Simulated User
Virtual Environment
1. Low Cost (+)
2. Consistent Evaluation- It guarantees objectivity (+)
1. Not Real World (-)
SYSTEM DEVELOPMENT Exposing System to Diverse Environment
•Different users•Noises•Unexpected focus shift
Spoken Dialog System
USER SIMULATION
Spoken Dialog System Simulated Users
Simulated User Input
System Output
PROBLEMS User Simulation for spoken dialog systems
involves four essential problems [Jung et al., 2009]
User Intention Simulation
User Utterance Simulation
ASR Channel SimulationSpoken Dialog System Simulated Users
USER INTENTION SIMULATION Goal
Generating appropriate user intentions given the current dialog state
P( user_intention | dialog_state)
ExampleU1 : 근처에 중국집 가자S1 : 행당동에 북경 , 아서원 , 시온반점이 있고 홍익동에
정궁중화요리 , 도선동에 양자강이 있습니다 . U2 : 삼성동에는 뭐가 있지 ? Semantic Frame
(Intention)
Dialog act WH-QUESTION
Main Goal SEARCH-LOC
Named Entity LOC_ADDRESS
USER UTTERANCE SIMULATION Goal
Generating natural languages given the user intentions
P( user_utterance | user_intention )
Semantic Frame (Intention)
Dialog act WH-QUESTION
Main Goal SEARCH-LOC
Named Entity LOC_ADDRESS
• 삼성동에는 뭐가 있지 ?• 삼성동 쪽에 뭐가 있지 ?• 삼성동에 있는 것은 뭐니 ?• …
ASR CHANNEL SIMULATION Goal
Generating noisy utterances from a clean utterance at certain error rates
P( utternoisy | utterclean , error_rate)
• 삼성동에는 뭐가 있지 ?
• 삼성동에 뭐 있니 ? • 삼정동에는 뭐가 있지 ? • 상성동 뭐 가니 ?•삼성동에는 무엇이 있니 ?• …
Clean utterance Noisy utterance
AUTOMATED DIALOG SYSTEM EVALUATION
[Jung et al., 2009]
Demo Self-learned dialog system Translating dialog system
REFERENCES
REFERENCES ASR (1/2)
L.R. Rabiner and B.H. Juang, 1993. Fundamentals of Speech Recognition, Prentice-Hall.
L.R. Bahl, P.F. Brown, P.V. de Souza, and R.L. Mercer, 1986. Maximum mutual information estimation of hidden Markov model parameters for speech recognition, Proceedings of 1986 IEEE International Conference on Acoustics, Speech and Signal Processing, pp.49–52.
K.K. Paliwal, 1992. Dimensionality reduction of the enhanced feature set for the HMMbased speech recognizer, Digital Signal Processing, vol.2, pp.157–173.
B.H. Juang, S.E. Levinson, and M.M. Sondhi, 1986. Maximum likelihood estimation for multivariate mixture observations of Markov chains, IEEE Transactions on Information Theory, vol.32, no.2, pp.307–309.
T.J. Hazen, I.L. Hetherington, H. Shu, and K. Livescu, 2002. Pronunciation modeling using a finite-state transducer representation, Proceedings of the ISCA Workshop on Pronunciation Modeling and Lexicon Adaptation, pp.99–104.
REFERENCES ASR (2/2)
K. Demuynck, J. Duchateau, and D.V. Compernolle, 1997. A static lexicon network representation for cross-word context dependent phones, Proceedings of the 5th European Conference on Speech Communication and Technology, pp.143–146.
S.J. Young, N.H. Russell, and J.H.S Thornton, 1989. Token passing: a simple conceptual model for connected speech recognition systems. Technical Report CUED/F-INFENG/TR.38, Cambridge University Engineering Department.
S. Young, J. Jansen, J. Odell, D. Ollason, and P. Woodland, 1996. The HTK book. Entropics Cambridge Research Lab., Cambridge, UK.
HTK website: http://htk.eng.cam.ac.uk/
REFERENCES SLU
R. De Mori et al. Spoken Language Understanding for Conversational Systems. Signal Processing Magazine. 25(3):50-58. 2008.
Y. Wang, L. Deng, and A. Acero. September 2005, Spoken Language Understanding: An introduction to the statistical framework. IEEE Signal Processing Magazine, 27(5):16-31.
D. Gildea, and D. Jurafsky. 2002. Automatic labeling of semantic roles. Computational Linguistics, 28(3):245-288.
M. Jeong and G.G. Lee, 2006. Jointly predicting dialog act and named entity for spoken language understanding, IEEE/ACL workshop on SLT.
T. G. Dietterich, 2002. Machine learning for sequential data: A review. Caelli(Ed.) Structural, Syntactic, and Statistical Pattern Recognition.
J. Lafferty, A. McCallum, and F. Pereira. 2001. Conditional Random Fields: Probabilistic models for segmenting and labeling sequence data. ICML.
C. Sutton and A. McCallum, 2006. An introduction to conditional random fields for relational learning. In Introduction to Statistical Relational Learning. L. Getoor and B. Taskar, Eds. MIT Press.
REFERENCES DM
M. F. McTear, Spoken Dialogue Technology - Toward the Conversational User Interface: Springer Verlag London, 2004.
S. Larsson, and D. R. Traum, “Information state and dialogue management in the TRINDI dialogue move engine toolkit,” Natural Language Engineering, vol. 6, pp. 323-340, 2006.
B. Bohus, and A. Rudnicky, “RavenClaw: Dialog Management Using Hierarchical Task Decomposition and an Expectation Agenda,” in Proc. of the European Conference on Speech, Communication and Technology, 2003, pp. 597-600.
D. Griol, L. F. Hurtado, E. Segarra et al., “A statistical approach to spoken dialog systems design and evaluation,” Speech Communication, vol. 50, no. 8-9, pp. 666-682, 2008.
J. D. Williams, and S. Young, “Partially observable Markov decision processes for spoken dialog systems,” Computer Speech and Language, vol. 21, pp. 393-422, 2007.
C. Lee, S. Jung, S. Kim et al., “Example-based Dialog Modeling for Practical Multi-domain Dialog System,” Speech Communication, vol. 51, no. 5, pp. 466-484, 2009.
REFERENCES MULTI-MODAL DIALOG SYSTEM & DIALOG
SIMULATOR S. L. Oviatt , A. DeAngeli, and K. Kuhn, 1997, Integration and
synchronization of input modes during multimodal human-computer interaction. In Proceedings of Conference on Human Factors in Computing Systems: CHI '97.
R. Lopez-Cozar, A. D. la Torre, J. C. Segura et al., “Assessment of dialogue systems by means of a new simulation technique,” Speech Communication, vol. 40, no. 3, pp. 387-407, 2003.
J. Schatzmann, B. Thomson, K. Weilhammer et al., “Agenda-based User Simulation for Bootstrapping a POMDP Dialogue System,” in Proc. of the Human Language Technology/North American Chapter of the Association for Computational Linguistics, 2007, pp. 149-152.
S. Jung, C. Lee, K. Kim et al., “Data-driven user simulation for automated evaluation of spoken dialog systems,” Computer Speech and Language, 2009.
Thank You & QAThank You & QA