learning dialog acts for embodied agents thomas k harris kth: 19 may 2005
TRANSCRIPT
Introduction::SGPUC::Learning::Applicability::Grounding 2
Today’s Talk
• Introduction: Problems talking to robots
• SGPUC: A mini-problem addressed
• Learning– Supervised– Semi-supervised
• Application: Weak contributors
• Grounding: Back to Sensors
Introduction::SGPUC::Learning::Applicability::Grounding 3
Scenario for Search
We found it!
We are at <x,y>
Introduction::SGPUC::Learning::Applicability::Grounding 4
Issues in Spoken HRI
1. How do people decompose the task into sub-tasks?
2. What language do people use to get the tasks performed by the robots?
3. Given a human command, what is the expected robot behavior?
Explore using Wizard of Oz experiments
Introduction::SGPUC::Learning::Applicability::Grounding 5
WOZ design
Natural spoken communication takes place via walkie-talkie. All communication and robot movements are recorded.
•Participants:•1 experimenter•2 robot teleoperator-actors•1 subject
•Experimenter places treasure.•Teleoperators can only see what robots can identify.•Subject can only see map data generated by robots.
Introduction::SGPUC::Learning::Applicability::Grounding 6
Annotation and analysis
Utterances classified into functional categories
For one experiment:
8 major utterance categories
20 minor utterance categories
394 unique utterances
Carnegie Mellon MockBrow annotation tool
Introduction::SGPUC::Learning::Applicability::Grounding 7
Utterance/Task Breakdown
Controlling team behaviors Grounding
Positive/negative feedback Informing robot of it’s state or
the world Explanations of commands Orientation Grounding
Navigation Simple Navigation commands Spatial Referential Navigation Object Referential Navigation
Manipulation Manipulating the environment Manipulating treasure
Coverage Manipulating the webcam view Object coverage commands Generic coverage
Asking about the robot’s abilities Filler Real-time command
modifications
Introduction::SGPUC::Learning::Applicability::Grounding 8
Designing the SDS Input Pass
• Words -> Speech Acts and Concepts is usually a knowledge engineered “white-box” function.
• Coverage issues:– One-to-many mapping from concepts to words– Space (words) is large (Nobody can even say how large.)– ASR is sensitive to overcoverage
• Input issues:– Noisy– Probabilistic– Dynamic and Situational
• Output (concepts) are difficult to share/generalize from one domain/system to another.
Introduction::SGPUC::Learning::Applicability::Grounding 9
What do we do?
• A lot of design iterations!
• Restrict the domain
• Share components
• Control the speaker through – Training and entrainment – Domain-related expectations– Influencing or outright directing the dialog
Introduction::SGPUC::Learning::Applicability::Grounding 10
Use the Data
• Words -> Speech Acts and Concepts can also be a data-driven “black-box” function, or a hybrid.
• This has its own set of problems– Labeling data is costly– The catch-22 (data collection requires a working system). Iterate
starting with seed data which can be• nothing• designer hypothesized data• WoZ data• from a similar or previous-version SDS• from some human-human analog
– The performance often seems nice at first, but then asymptotes quickly.
• I’m only going to address the labeling cost issue here.
Introduction::SGPUC::Learning::Applicability::Grounding 11
Avoiding Labeling Costs
• Easily Labeled– Observe broad classes of utterances relevant
to a domain, e.g. “request for train ticket, request for train schedule, other”
• Automatically Observable Data– Observe co-occurring automatically
identifiable phenomena, e.g. record which tickets are purchased by a human agent after which customer utterances.
• Unlabeled Data
Introduction::SGPUC::Learning::Applicability::Grounding 12
A Mini-Problem
• Let’s look at a small part of the words -> speech acts and concepts problem in a real system, the Speech Graffiti Personal Universal Controller (SGPUC).
• Hopefully this small, concrete system and it’s mini-problem will facilitate manageable experimentation of approaches.
• But first, a little about the system itself.
Introduction::SGPUC::Learning::Applicability::Grounding 13
Speech Graffiti Personal Universal Controller
• Protocol-based appliance communication architecture
• Automatically built SDS from appliance description language
• Speech Graffiti style user interface
Introduction::SGPUC::Learning::Applicability::Grounding 14
Appliance Communication Architecture
Havi adapter
X10adapter Speech Graffiti
Personal Universal Controller
Introduction::SGPUC::Learning::Applicability::Grounding 15
James
Stereo Digital camera
(mode)<turns stereo on>
x-bass volume off
on
off
volume up
volume down
tuner auxiliary CD
(radioband)
seek
AM
FM
frequency
station
#
WXXX
frequency
station
#
WXXX
forward
backward
(status)
repeat
disc
track
next track
last track
random
play stop pause
offsingle track
single disc
all discs
#
#
on off
control info
Play mode
play
stop
fast fwd
rewind
record
pause
Other devices…
Device mode
Media type
camera
VCR
Digital video
unknown
VHS
none
(mode)
step
forwardbackward
Introduction::SGPUC::Learning::Applicability::Grounding 16
Speech Graffiti Dialog
• Artificial subset language
• Tree-structured functions
• Universal conversational primitives
• User-directed
• Great for recognition
• Entirely declarative (and automatic)
Introduction::SGPUC::Learning::Applicability::Grounding 17
Six Dialog Acts in SGPUC
• Session Management
• Help
• Execution
• Query
• Response Related
• Other
Introduction::SGPUC::Learning::Applicability::Grounding 18
Session Management
User: blah blah blah...System: ignoring userUser: hello jamesSystem: stereo, digital cameraUser: stereoSystem: stereo hereUser: goodbyeSystem: goodbyeUser: blah blah blah...System: ignoring user
Introduction::SGPUC::Learning::Applicability::Grounding 19
Help/Exploration/Orientation
User: alarm clock optionsSystem: alarm, clock, radio, sleep, and more…User: helpSystem: to hear what options you have, say
options, to repeat an utterance, say repeat…User: stereo optionsSystem: while turning stereo on: off, am, fm,
auxiliary, cd, and more…User: where am ISystem: stereo options
Introduction::SGPUC::Learning::Applicability::Grounding 20
Execution/Specification
User: stereo auxiliarySystem: while turning the stereo on and switching
to auxiliary: auxiliaryUser: cdSystem: while switching to cd mode: cdUser: playSystem: while playing a cd: playing cdUser: switch disc to fourSystem: while switching discs: switching cd disc to
four
Introduction::SGPUC::Learning::Applicability::Grounding 21
Query
User: what is the am frequency
System: the am frequency is five hundred thirty
User: random status
System: random is off
User: status
System: tuner am, station wabc, volume low, and more…
Introduction::SGPUC::Learning::Applicability::Grounding 22
Response Related
User: radio band am optionsSystem: am options are frequency, kabc, k001,
k002, and more…User: moreSystem: k003, k004User: moreSystem: band options are fmUser: repeatSystem: band options are fm
Introduction::SGPUC::Learning::Applicability::Grounding 23
Back to the Mini-Problem
• The language is explicit and regular in classifying dialog acts.
• A grammar will accurately classify dialog acts.• Users are taught the SG language.• Users learn the language incompletely and have
faulty memories.• Utterances have false starts, spurious
repetitions, etc.• ASR is error prone.• 37.5% of utterances’ dialog acts were
misclassified.
Introduction::SGPUC::Learning::Applicability::Grounding 24
Data
• Listening to the actual speech, I labeled 2010 utterances (from 10 participants). Each utterance is labeled with one of the six of the dialog acts.
• Note that this labeling is much faster than transcription or much other labeling. 2010 utterances were labeled in 2 ½ hours, close to real-time.
• Each utterance is represented by a boolean vector, where each element in the vector represents whether that word appears or not in the utterance. (i.e. word order is ignored!)
Introduction::SGPUC::Learning::Applicability::Grounding 25
A Naïve Bayes Classifier
#ˆ |#x
x cP W in Class c
c
#ˆ# _
cP Class c
all utterances
1
1 21 2
|| , ...,
, ...,
n
xi
n
P Class c P W Class cP Class c W W W
P W W W
ˆ ˆ| 1 |x xP W out Class c P W in Class c
1
arg max |n
c xi
Class P Class c P W Class c
Introduction::SGPUC::Learning::Applicability::Grounding 26
Classifier Results
0.00%
5.00%
10.00%
15.00%
20.00%
25.00%
30.00%
35.00%
40.00%
45.00%
50.00%
200 400 600 800 1000 1200 1400 1600 1800
Utterances Labeled
Err
or
Classifier Error
Grammar
Introduction::SGPUC::Learning::Applicability::Grounding 27
Problems with Naïve Bayes
• Independence assumption– Word existence in an utterance contributes a fixed
amount to class distinction regardless of context.– i.e. “bank” contributes the same thing to the classifier
in the context of “world bank” and “river bank”• Estimates a high-dimensional model
– The model estimates 5 parameters (1-#classes) for each word. Words that occur infrequently will be severely over-fitted.
• Problems with singletons words– If a word appears in an utterance that hasn’t occurred
in the training data for a particular class, the probability assigned to that class is zero.
Introduction::SGPUC::Learning::Applicability::Grounding 28
Latent Semantic Analysis to the Rescue
• Independence assumption– LSA models both synonymy and polysemy. – Polysemy: Words that occur in different contexts i.e. “bank” in
“world bank” vs “river bank” tend to become distinguished.– Synonymy: Words that occur in similar contexts i.e. the “white”
and “black” of “white sheep” and “black sheep” tend to become undistinguished.
• Estimates a high-dimensional model– The effective dimension is arbitrarily fixed.
• Problems with singletons words– The dimensionality reduction serves as a smoothing function.
Introduction::SGPUC::Learning::Applicability::Grounding 29
How Does LSA Work?C1: Human machine interface for ABC computer applications.C2: A survey of user opinion of computer system response time.C3: The EPS user interface management system.C4: System and human system engineering testing of EPS.C…:
C1 C2 C3 C4 …Human 1 0 0 1 …
Interface 1 0 1 0 …
Computer 1 1 0 0 …
User 0 1 1 0 …
System 0 1 1 2 …
Response 0 1 0 0 …
Time 0 1 0 0 …
EPS 0 0 1 1 …
Survey 0 1 0 0 …
{X} =
Introduction::SGPUC::Learning::Applicability::Grounding 30
Singular Value Decomposition
• Any mxn matrix X where m>n can be decomposed into the product of three matrices, UDVT, where:– U is an mxn matrix and V is an nxn matrix
both with orthogonal columns.– D is an nxn diagonal matrix
• D is a sort-of basis in n dimensions for X.
• In Matlab, [U, D, V] = SVD(X);
Introduction::SGPUC::Learning::Applicability::Grounding 31
LSA Algorithm in 4 Easy Steps
• Build your feature-passage matrix X. (Here I chose word-utterance.)
• [U, D, V] = SVD(X)
• Zero out all but the highest g values of D to form a new reduced D.
• Recompose a reduced X as UDVT.
Introduction::SGPUC::Learning::Applicability::Grounding 32
The Recomposed MatrixC1: Human machine interface for ABC computer applications.C2: A survey of user opinion of computer system response time.C3: The EPS user interface management system.C4: System and human system engineering testing of EPS.C…:
C1 C2 C3 C4 …Human (1)0.16 (0)0.40 (0)0.38 (1)0.47 …
Interface (1)0.14 (0)0.37 (1)0.33 (0)0.40 …
Computer (1)0.15 (1)0.51 (0)0.36 (0)0.41 …
User (0)0.26 (1)0.84 (1)0.61 (0)0.70 …
System (0)0.45 (1)1.23 (1)1.05 (2)1.27 …
Response (0)0.16 (1)0.58 (0)0.38 (0)0.42 …
Time (0)0.16 (1)0.58 (0)0.38 (0)0.42 …
EPS (0)0.22 (0)0.55 (1)0.51 (1)0.63 …
Survey (0)0.10 (1)0.53 (0)0.23 (0)0.21 …
{X} =
Introduction::SGPUC::Learning::Applicability::Grounding 33
And This Means?
• Cosine distances between words show patterns of similarity, as do cosine distances between passages.
• Clustering with these distances makes clusters that feel “semantic” and mimic human choices in standardized tests for word sorting and lexical priming so well that people have suggested that LSA may be an actual psycholinguistic mechanism.
Introduction::SGPUC::Learning::Applicability::Grounding 34
LSA-Discounted NB Estimators
• Why don’t we try to use an LSA-reconstructed matrix to train the NB classifier?
• Used various amounts of labeled data, discounted by various amounts of unlabeled LSA data.
• Unlabeled decoder output boosts classification!
Introduction::SGPUC::Learning::Applicability::Grounding 36
Applications for Weak Contributors
• By itself a la “How may I help you?” systems• Informing dialog management by adjusting confidence
measures of parsed concepts.– More effective error correction, i.e. “Please repeat the name of
the city in which you want to travel?” vs. “I’m sorry, I didn’t understand that?”
– More effective confirmation strategies.
• Guided utterance self-correction. A coarse classifier could re-weight the language model or re-order hypotheses to elicit a corrected best hypothesis.
• How much information needs to be understood for the conversation to progress?
Introduction::SGPUC::Learning::Applicability::Grounding 37
Grounding Language for Embodied Agents
• Prediction Functions– Concepts and Actions -> Words– Concepts and Actions -> Sensor Data
• Perceptive Function– Words, Sensor Data, Proprioception, and
Predictions -> Concepts
• Planning Function– Concepts and Goals -> Actions
Introduction::SGPUC::Learning::Applicability::Grounding 38
Sensory Deprivation
Push: To press forcefully
Force: Energy or strength
Energy: Strength of force Strength: The power to resist force
From D. Roy, 2004
Introduction::SGPUC::Learning::Applicability::Grounding 39
Prediction
• Why bother with prediction? Among other things, we’d like to see robots find stable meanings of things “in the wild”.
“Tiger”predicts
predicts
40
Summary
• Spoken dialogue is poorly characterized by engineers.
• Approaches that learn in both supervised and unsupervised settings can help.
• Embodied agents provide an ideal platform for grounded language acquisition.
Introduction::SGPUC::Learning::Applicability::Grounding 41
Controlling team behaviors
“you guys get together” “T- you go first and B- follow”
Introduction::SGPUC::Learning::Applicability::Grounding 42
Grounding
Positive/negative feedback “ok that’s better”
Informing robot of state “so that’s up” “I don’t see anything there”
Explanations of commands “so I can see which direction is up”
Orientation Grounding “What you’re facing now with the camera – is that the
vehicle that you just circumnavigated” “I can tell you’re going in the wrong direction, stop”
Introduction::SGPUC::Learning::Applicability::Grounding 43
Navigation
Simple Navigation commands “so um T- turn to you left” “T- I want you to turn right 90 degrees” “can you go in that general direction” “can you proceed in that direction”
Spatial Referential Navigation “go to that open area” “continue around the periphery of that open area” “back out of that alley” “proceed in that direction until you find an opening to turn left”
Object Referential Navigation “go over by T-” “can you go on the other side of that vehicle” “go over by the posters”
Introduction::SGPUC::Learning::Applicability::Grounding 44
Manipulation
Manipulating the environment “T- why don’t you move the trash can”
Manipulating treasure “T- bring the coin to me”
Manipulating the webcam view “ok B- look to your left” “B- can you look around with the camera a
little”
Introduction::SGPUC::Learning::Applicability::Grounding 45
Coverage
Object coverage commands “ok so examine the shelf” “do you see something on that shelf in front of
B-” “can you look over by that table over there”
Generic Coverage “do you see anything that looks interesting”
Introduction::SGPUC::Learning::Applicability::Grounding 46
Asking about the robot’s abilities
“is that possible”