interactive learning of grounded verb semantics towards ...cse.msu.edu/~jchai/papers/acl2017.pdf ·...

Interactive Learning of Grounded Verb Semantics towardsHuman-Robot Communication

Lanbo She and Joyce Y. ChaiDepartment of Computer Science and Engineering

Michigan State UniversityEast Lansing, Michigan 48824, USA

{shelanbo, jchai}@cse.msu.edu

Abstract

To enable human-robot communicationand collaboration, previous works repre-sent grounded verb semantics as the poten-tial change of state to the physical worldcaused by these verbs. Grounded verbsemantics are acquired mainly based onthe parallel data of the use of a verbphrase and its corresponding sequencesof primitive actions demonstrated by hu-mans. The rich interaction between teach-ers and students that is considered impor-tant in learning new skills has not yet beenexplored. To address this limitation, thispaper presents a new interactive learningapproach that allows robots to proactivelyengage in interaction with human partnersby asking good questions to learn mod-els for grounded verb semantics. The pro-posed approach uses reinforcement learn-ing to allow the robot to acquire an op-timal policy for its question-asking be-haviors by maximizing the long-term re-ward. Our empirical results have shownthat the interactive learning approach leadsto more reliable models for grounded verbsemantics, especially in the noisy environ-ment which is full of uncertainties. Com-pared to previous work, the models ac-quired from interactive learning result in a48% to 145% performance gain when ap-plied in new situations.

1 Introduction

In communication with cognitive robots, one ofthe challenges is that robots do not have sufficientlinguistic or world knowledge as humans do. Forexample, if a human asks a robot to boil the wa-ter but the robot has no knowledge what this verb

phrase means and how this verb phrase relates toits own actuator, the robot will not be able to exe-cute this command. Thus it is important for robotsto continuously learn the meanings of new verbsand how the verbs are grounded to its underlyingaction representations from its human partners.

To support learning of grounded verb seman-tics, previous works (She et al., 2014; Misra et al.,2015; She and Chai, 2016) rely on multiple in-stances of human demonstrations of correspond-ing actions. From these demonstrations, robotscapture the state change of the environment causedby the actions and represent verb semantics asthe desired goal state. One advantage of suchstate-based representation is that, when robots en-counter the same verbs/commands in a new situa-tion, the desired goal state will trigger the actionplanner to automatically plan a sequence of prim-itive actions to execute the command.

While the state-based verb semantics providesan important link to connect verbs to the robot’sactuator, previous works also present several limi-tations. First of all, previous approaches were de-veloped under the assumption of perfect percep-tion of the environment (She et al., 2014; Misraet al., 2015; She and Chai, 2016). However, thisassumption does not hold in real-world situatedinteraction. The robot’s representation of the envi-ronment is often incomplete and error-prone dueto its limited sensing capabilities. Thus it is notclear whether previous approaches can scale up tohandle noisy and incomplete environment.

Second, most previous works rely on multi-ple demonstration examples to acquire groundedverb models. Each demonstration is simply asequence of primitive actions associated with averb. No other type of interaction between humansand robots is explored. Previous cognitive stud-ies (Bransford et al., 2000) on how people learnhave shown that social interaction (e.g., conver-

sation with teachers) can enhance student learn-ing experience and improve learning outcomes.For robotic learning, previous work (Cakmak andThomaz, 2012) has also demonstrated the neces-sity of question answering in the learning process.Thus, in our view, interactive learning beyonddemonstration of primitive actions should play avital role in the robot’s acquisition of more reliablemodels of grounded verb semantics. This is es-pecially important because the robot’s perceptionof the world is noisy and incomplete, human lan-guage can be ambiguous, and the robot may lackthe relevant linguistic or world knowledge duringthe learning process.

To address these limitations, we have developeda new interactive learning approach where robotsactively engage with humans to acquire modelsof grounded verb semantics. Our approach ex-plores the space of interactive question answeringbetween humans and robots during the learningprocess. In particular, motivated by previous workon robot learning (Cakmak and Thomaz, 2012),we designed a set of questions that are pertinentto verb semantic representations. We further ap-plied reinforcement learning to learn an optimalpolicy that guides the robot in deciding when toask what questions. Our empirical results haveshown that this interactive learning process leadsto more reliable representations of grounded verbsemantics, which contribute to significantly betteraction performance in new situations. When theenvironment is noisy and uncertain (as in a realis-tic situation), the models acquired from interactivelearning result in a performance gain between 48%and 145% when applied in new situations. Our re-sults further demonstrate that the interaction pol-icy acquired from reinforcement learning leads tothe most efficient interaction and the most reliableverb models.

2 Related Work

To enable human-robot communication and col-laboration, recent years have seen an increasingamount of works which aim to learn semanticsof language that are grounded to agents’ percep-tion (Gorniak and Roy, 2007; Tellex et al., 2014;Kim and Mooney, 2012; Matuszek et al., 2012a;Liu et al., 2014; Liu and Chai, 2015; Thomasonet al., 2015, 2016; Yang et al., 2016; Gao et al.,2016) and action (Matuszek et al., 2012b; Artziand Zettlemoyer, 2013; She et al., 2014; Misra

et al., 2014, 2015; She and Chai, 2016). Specif-ically for verb semantics, recent works exploredthe connection between verbs and action plan-ning (She et al., 2014; Misra et al., 2014, 2015;She and Chai, 2016), for example, by represent-ing grounded verbs semantics as the desired goalstate of the physical world that is a result of thecorresponding actions. Such representations arelearned based on example actions demonstratedby humans. Once acquired, these representationswill allow agents to interpret verbs/commands is-sued by humans in new situations and apply actionplanning to execute actions. Given its clear advan-tage in connecting verbs with actions, our workalso applies the state-based representation for verbsemantics. However, we have developed a new ap-proach which goes beyond learning from demon-strated examples by exploring how rich interactionbetween humans and agents can be used to acquiremodels for grounded verb semantics.

This approach was motivated by previous cog-nitive studies (Bransford et al., 2000) on how peo-ple learn as well as recent findings on robot skilllearning (Cakmak and Thomaz, 2012). One ofthe principles for human learning is that “learningis enhanced through socially supported interac-tions”. Studies have shown that social interactionwith teachers and peers (e.g., substantive conver-sation) can enhance student learning experienceand improve learning outcomes. In recent workon interactive robot learning of new skills (Cak-mak and Thomaz, 2012), researchers identifiedthree types of questions that can be used by a hu-man/robot student to enhance learning outcomes:1) demonstration query (i.e., asking for a full orpartial demonstration of the task), 2) label query(i.e., asking whether an execution is correct), and3) feature query (i.e., asking for a specific featureor aspect of the task). Inspired by these previousfindings, our work explores interactive learning toacquire grounded verb semantics. In particular, weaim to address when to ask what questions duringinteraction to improve learning.

3 Acquisition of Grounded VerbSemantics

This section gives a brief review on acquisition ofgrounded verb semantics and illustrates the differ-ences between previous approaches and our ap-proach using interactive learning.

Figure 1: An example of acquiring state-based representation for verb semantics based on an initialenvironment Ei, and a language command Li, the primitive action sequence

−→Ai demonstrated by the

human, and the final environment E ′i that results from the execution of−→Ai in Ei.

3.1 State-based Representation

As shown in Figure 1, the verb semantics (e.g.,boil(x)) is represented by the goal state (e.g.,Status(x, TempHigh)) which is the result of thedemonstrated primitive actions. Given the verbphrase boil the water (i.e., Li), the human teachesthe robot how to accomplish the corresponding ac-tion based on a sequence of primitive actions

−→Ai.

By comparing the final environment E ′i with theinitial environment Ei, the robot is able to iden-tify the state change of the environment, which be-comes a hypothesis of goal state to represent verbsemantics. Compared to procedure-based repre-sentations, the state-based representation supportsautomated planning at the execution time. It isenvironment-independent and more generalizable.In (She and Chai, 2016), instead of one hypoth-esis, it maintains a specific-to-general hypothesisspace as shown in Figure 2 to capture all goal hy-potheses of a particular verb frame. Specifically,it assumes that one verb frame may lead to differ-ent outcomes under different environments, whereeach possible outcome is represented by one nodein the hierarchical graph and each node is a con-junction of multiple atomic fluents. 1

Given a language command (i.e., a verb phrase),a robot will engage in the following processes:

• Execution. In this process, the robot will se-lect a hypothesis from the space of hypothe-ses that is most relevant to the current situa-tion and use the corresponding goal state toplan for actions to execute.

• Learning. When the robot fails to selecta hypothesis or fails to execute the action,it will ask the human for a demonstration.

1In this work, we assume the set of atomic fluents rep-resenting environment state are given and do not address thequestion of whether these predicates are adequate to representa domain.

Figure 2: An example hypothesis space for theverb frame fill(x, y).

Based on the demonstrated actions, the robotwill learn a new representation (i.e., newnodes) and update the hypothesis space.

3.2 Noisy Environment

Figure 3: An example probabilistic sensing result.

Previous works represent the environment Ei asa conjunction of grounded state fluents. Each flu-ent consists of a predicate and one or more argu-ments (i.e., objects in the physical world, or ob-ject status), representing one aspect of the per-ceived environment. An example of a fluent is“Has(Kettle1,WATER)” meaning object Kettle1

has some water inside, where Has is the predi-cate, and Kettle1 and WATER are arguments. Theset of fluents include the status of the robot (e.g.,Grasping(Kettle1)), the status of different objects(e.g., Status(WATER, TempHigh)), and relationsbetween objects (e.g., On(Kettle1, Stove)). Onelimitation of the previous works is that the envi-

Figure 4: A general framework of robot interactive learning. KB stands for knowledge base, θE standsfor Interaction Strategy for Execution, and θD stands for Interaction Strategy for Learning.

Only demonstration(She and Chai, 2016) Interactive learning (this work)Execution Example Execution Example

H1: Boil the water. H1: Boil the water.R1: (select a hypothesis and execute the action accordingly) R1: I see a silver object on the table, is that a pot?H2: (provide true/false feedback on whether the execution is H2: Yes.correct) R2: Should I leave the pot on the stove after boil?

H3: Yes.R3: Ok. (select a hypothesis to execute)H4: Thank you.

Learning Example Learning ExampleH3: Boil the water. H5: Boil the water.R3: I don’t know how to boil, could you show me an example? R5: I don’t know how to boil, could you show me an example?H4: (human shows the robot how to achieve the correspondingaction through a sequence of primitive actions)

H6: Ok. (demonstrate the task, at the end put the pot on aheat pad)

R4: (indicate the acceptance and induce a hypothesis to rep- R6: So after boiling something, should I put it on a heat pad?resent verb semantics) H7: No, you don’t have to.

R7: I see. (induce goal state hypothesis)

Table 1: Examples to show differences between learning through demonstrations as in the previousworks (She and Chai, 2016) and the proposed learning from interaction.

ronment has a perfect, deterministic representa-tion, as shown in Figure 1. This is clearly not thecase in the realistic physical world.

In reality, given limitations of sensor capabili-ties, the environment representation is often par-tial, error prone, and full of uncertainties. Figure 3shows an example of a more realistic representa-tion where each fluent comes with a confidencebetween 0 and 1 to indicate how likely that par-ticular fluent can be detected in the current envi-ronment. Thus, it is unclear whether the previouswork is able to handle representations with uncer-tainties. Our interactive learning approach aimsto address these uncertainties through interactivequestion answering with human partners.

4 Interactive Learning

4.1 Framework of Interactive LearningFigure 4 shows a general framework for interac-tive learning of action verbs. It aims to support alife-long learning cycle for robots, where the robotcan continuously (1) engage in collaboration and

communication with humans based on its exist-ing knowledge; (2) acquire new verbs by learn-ing from humans and experiencing the change ofthe world (i.e., grounded verb semantics as in thiswork); and (3) learn how to interact (i.e., updateinteraction policies). The lifelong learning cycleis composed by a sequence of interactive learn-ing episodes (Episode 1, 2...) where each episodeconsists of either an execution phase or a learningphase or both.

The execution phase starts with a human requestfor action (e.g., boil the water). According to itsinteraction policy, the robotic agent may choose toask one or more questions (i.e., Q+

i ) and wait forhuman answers (i.e., A+

i ), or select a hypothesisfrom its existing knowledge base to execute thecommand (i.e., Execute). With the human feed-back of the execution, the robot can update its in-teraction policy and existing knowledge.

In the learning phase, the robot can initiatethe learning by requesting a demonstration fromthe human. After the human performs the task,

the robotic agent can either choose to update itsknowledge if it feels confident, or it can choose toask the human one or more questions before up-dating its knowledge.

4.2 Examples of Interactive LearningTable 1 illustrates the differences between the pre-vious approach that acquires verb models basedsolely on demonstrations and our current work thatacquires models based on interactive learning. Asshown in Table 1, under the demonstration setting,humans only provide a demonstration of primitiveactions and there’s no interactive question answer-ing. In the interactive learning setting, the robotcan proactively choose to ask questions regard-ing the uncertainties either about the environment(e.g., R1), the goal (e.g., R2), or the demonstra-tions (e.g., R6). Our hypothesis is that rich inter-actions based on question answering will allow therobot to learn more reliable models for groundedverb semantics, especially in a noisy environment.

Then the question is how to manage such inter-action: when to ask and what questions to ask tomost efficiently acquire reliable models and applythem in execution. Next we describe the appli-cation of reinforcement learning to manage inter-active question answering for both the executionphase and the learning phase.

4.3 Formulation of Interactive LearningMarkov Decision Process (MDP) and its closelyrelated Reinforcement Learning (RL) have beenapplied to sequential decision-making problemsin dynamic domains with uncertainties, e.g.,dialogue/interaction management (Singh et al.,2002; Paek and Pieraccini, 2008; Williams andZweig, 2016), mapping language commands toactions (Branavan et al., 2009), interactive robotlearning (Knox and Stone, 2011), and interactiveinformation retrieval (Li et al., 2017). In this work,we formulate the choice of when to ask what ques-tions during interaction as a sequential decision-making problem and apply reinforcement learningto acquire an optimal policy to manage interaction.

Specifically, each of the execution and learningphases is governed by one policy (i.e., θE and θD),which is updated by the reinforcement learning al-gorithm. The use of RL intends to obtain opti-mal policies that can lead to the highest long-termreward by balancing the cost of interaction (e.g.,the length of interaction and difficulties of ques-tions) and the quality of the acquired models. The

reinforcement formulation for both the executionphase and the learning phase are described below.

State For the execution phase, each state se ∈SE is a five tuple: se = <l, e,KB,Grd,Goal>.l is a language command, including a verb andmultiple noun phrases extracted by the Stan-ford parser. For example, the command “Mi-crowave the ramen” is represented as l =microwave(ramen). The environment e is aprobabilistic representation of the currently per-ceived physical world, consisting of a set ofgrounded fluents and the confidence of perceiv-ing each fluent (an example is shown in Figure 3).KB stands for the existing knowledge of verbmodels. Grd accounts for the agent’s current be-lief of object grounding: the probability of eachnoun in the l being grounded to different objects.Goal represents the agent’s belief of different goalstate hypotheses of the current command. Withinone interaction episode, command l and knowl-edge KB will stay the same, while e, Grd, andGoal may change accordingly due to interactivequestion answering and robot actions. In the ex-ecution phase, Grd and Goal are initialized withexisting knowledge of learned verb models. Forthe learning phase, a state sd ∈ SD is a four tu-ple: sd = <l, estart, eend, Grd>. estart and eendstands for the environment before the demonstra-tion and after the demonstration.

Action Motivated by previous studies onhow humans ask questions while learning newskills (Cakmak and Thomaz, 2012), the agent’squestion set includes two categories: yes/no ques-tions and wh- questions. These questions aredesigned to address ambiguities in noun phrasegrounding, uncertain environment sensing, andgoal states. They are domain independent innature. For example, one of the questions isnp grd ynq(n, o). It is a yes/no question askingwhether the noun phrase n refers to an object o(e.g., “I see a silver object, is that the pot?”).Other questions are env pred ynq(p) (i.e., whethera fluent p is present in the environment; e.g., “Isthe microwave door open?”) and goal pred ynq(p)(i.e., whether a predicate p should be part of thegoal; “Should the pot be on a pot stand?”). Ta-ble 2 lists all the actions available in the execu-tion and learning phases. The select hypo action(i.e., select a goal hypothesis to execute) is onlyfor the execution. Ideally, after asking questions,the agent should be more likely to select a goal hy-

Action Name Explanation Question Example Reward

1. np grd whq(n) Ask for the grounding of a np. “Which is the cup, can you show me?” -6.51

2. np grd ynq(n, o) Confirm the grounding of a np. “I see a silver object, is that the pot?” -1.0 / -2.0

3. env pred ynq(p) Confirm a predicate in current environment. “Is the microwave door open?” -1.0 / -2.0

4. goal pred ynq(p) Confirm whether a predicate p should be inthe final environment.

“Is it true the pot should be on thecounter?”

-1.0 / -2.0

5. select hypo(h) Choose a hypothesis to use as goal and ex-ecute.

100 / -2.0

6. bulk np grd ynq(n, o) Confirm the grounding of multiple nps. “I think the pot is the red object andmilk is in the white box, am I right?”

-3.0 / -6.02

7. pred change ynq(p) Ask whether a predicate p has been changedby the action demonstration.

“The pot is on a stand after the action,is that correct?”

-1.0 / -2.0

8. include fluent(∧p) Include ∧p into the goal state representa-tion. Update the verb semantic knowledge.

100 / -2.0

Table 2: The action space for reinforcement learning, where n stands for a noun phrase, o a physicalobject, p a fluent representation of the current state of the world, h a goal hypothesis. Action 1 and 2 areshared by both the execution and learning phases. Action 3, 4, 5 are for the execution phase, and 6, 7, 8are only used for the learning phase. -1.0/-2.0 are typically used for yes/no questions. When the humananswers the question with a “yes”, the reward is -1.0, otherwise it’s -2.0.

pothesis that best describes the current situation.For the learning phase, the include fluent(∧p) ac-tion forms a goal hypothesis by conjoining a set offluents ps where each p should have high probabil-ity of being part of the goal.Transition The transition function takes actiona in state s, and gives the next state s′ accordingto human feedback. Note that the command l doesnot change during interaction. But the agent’s be-lief of environment e, object grounding Grd, andgoal hypotheses Goal is changed according to thequestions and human answers. For example, sup-pose the agent asks whether noun phrase n refersto the object o, if the human confirms it, the prob-ability of n being grounded to o becomes 1.0, oth-erwise it will become 0.0.Reward Finding a good reward function is ahard problem in reinforcement learning. Our cur-rent approach has followed the general practicein the spoken dialogue community (Schatzmannet al., 2006; Fang et al., 2014; Su et al., 2016).The immediate robot questions are assigned smallcosts to favor shorter and more efficient interac-tion. Furthermore, motivated by how humans ask

1According to the study in (Cakmak and Thomaz, 2012),the frequency of y/n questions used by humans is about 6.5times the frequency of open questions (wh question), whichmotivates our assignment of -6.5 to wh questions.

2bulk np grd ynq asks multiple object grounding all atonce. This is harder to answer than asking for a single np.Therefore, its cost is assigned three times of the other yes/noquestions.

Algorithm 1: Policy learning. The execution and

learning phases share the same learning process, but with

different state s, action a spaces, and feature vectors φ.

The eend is only available to the learning phase.Input : e, l (, eend);

Feature function φ;Old policy θ (i.e., a weight vector)Verb Goal States HypothesesH;

Initialize : state s initialized with e, l (, eend);first action a ∼ P (a|s; θ) with ε greedy

1 while s is not terminal do2 Take action a, receive reward r;3 s′ = T (s, a);4 Choose a′ ∼ P (a′|s′; θ) with ε greedy;

δ ← r + γ · θT · φ(s′, a′)− θT · φ(s, a);5 θ ← θ + δ · η · φ(s, a);6 end7 if s terminates with positive feedback then8 UpdateH;9 end

Output : UpdatedH and θ.

questions (Cakmak and Thomaz, 2012), yes/noquestions are easier for a human to answer thanthe open questions (e.g., wh-questions) and thusare given smaller costs. A large positive rewardis given at the end of interaction when the task iscompleted successfully. Detailed reward assign-ment for different actions are shown in Table 2.Learning The SARSA algorithm with linearfunction approximation is utilized to update poli-cies θE and θD (Sutton and Barto, 1998). Specif-ically, the objective of training is to learn an opti-mal value function Q(s, a) (i.e., the expected cu-

Features shared by both phasesIf a is a np grd whq(n).The entropy of candidate groundings of n.If n has more than 4 grounding candidates.If a is a np grd ynq(n, o).The probability of n grounded to o.

Additional Features specific for the Execution phaseIf a is a select hypo(h) action.The probability of hypo h not satisfied in current envi-ronment.Similarity between the ns used by command l and thecommands from previous experiences.

Additional Features specific for the Learning phaseIf a is a pred change ynq(p).The probability of p been changed by demo.

Table 3: Example features used by the two phases.a stands for action. Other notations are the sameas used in Table 2. The“If” features are binary, andthe other features are real-valued.

mulative reward of taking action a in a state s).This value function is approximated by a linearfunction Q(s, a) = θᵀ · φ(s, a), where φ(s, a) isa feature vector and θ is a weight updated duringtraining. Details of the algorithm is shown in Al-gorithm 1. During testing, the agent can take anaction a that maximizes the Q value at a state s.Feature Example features used by the twophases are listed in Table 3. These features in-tend to capture different dimensions of informa-tion such as specific types of questions, how wellnoun phrases are grounded to the environment, un-certainties of the environment, and consistenciesbetween a hypothesis and the current environment.

5 Evaluation

5.1 Experiment SetupDataset. To evaluate our approach, we utilizedthe benchmark made available by (Misra et al.,2015). Individual language commands and corre-sponding action sequences are extracted similarlyas (She and Chai, 2016). This dataset includescommon tasks in the kitchen and living room do-mains, where each data instance comes with a lan-guage command (e.g., “boil the water”, “throwthe beer into the trashcan”) and the correspond-ing sequence of primitive actions. In total, thereare 979 instances, including 75 different verbs and215 different noun phrases. The length of primi-tive action sequences range from 1 to 51 with anaverage of 4.82 (+/-4.8). We divided the datasetinto three groups: (1) 200 data instances were usedby reinforcement learning to acquire optimal inter-

action policies; (2) 600 data instances were usedby different approaches (i.e., previous approachesand our interactive learning approach) to acquiregrounded verb semantics models; and (3) 179 datainstances were used as testing data to evaluate thelearned verb models. The performance on apply-ing the learned models to execute actions for thetesting data is reported.

To learn interaction policies, a simulated humanmodel is created from the dataset (Schatzmannet al., 2006) to continuously interact with therobot learner3. This simulated user can answerthe robot’s different types of questions and makedecisions on whether the robot’s execution iscorrect. During policy learning, one data instancecan be used multiple times. At each time, the in-teraction sequence is different due to exploitationand exploration in RL in selecting the next action.The RL discount factor γ is set to 0.99, the ε in ε-greedy is 0.1, and the learning rate is 0.01.

Noisy Environment Representation. The origi-nal data provided by (Misra et al., 2015) is basedon the assumption that environment sensing is per-fect and deterministic. To enable incomplete andnoisy environment representation, for each fluent(e.g., grasping(Cup3), near(robot1, Cup3)) inthe original data, we independently sampled a con-fidence value to simulate the likelihood that a par-ticular fluent can be detected correctly from theenvironment. We applied the following four dif-ferent variations in sampling the confidence val-ues, which correspond to different levels of sensorreliability.(1) PerfectEnv represents the most reliable sensor.If a fluent is true in the original data, its sampledconfidence is 1, and 0 otherwise.(2) NormStd3 represents a relatively reliable sen-sor. For each fluent in the original environment, aconfidence is sampled according to a normal dis-tribution N (1, 0.32) with an interval [0,1]. Thisdistribution has a large probability of sampling anumber larger than 0.5, meaning the correspond-ing fluent is still more likely to be true.(3) NormStd5 represents a less reliable sensor.The sampling distribution is N (1, 0.52), whichhas a larger probability of generating a numbersmaller than 0.5 compared to NormStd3.

3In our future work, interacting with real humans will beconducted through Amazon Mechanical Turk. And the poli-cies acquired with a simulated user in this work will be usedas initial policies.

(4) UniEnv represents an unreliable sensor. Eachnumber is sampled with a uniform distribution be-tween 0 and 1. This means the sensor works ran-domly. A fluent has a equal change to be true orfalse no matter what the true environment is.Evaluation Metrics. We used the same evalua-tion metrics as in the previous works (Misra et al.,2015; She and Chai, 2016) to evaluate the perfor-mance of applying the learned models to testinginstances on action planning.

• IED: Instruction Editing Distance. This is anumber between 0 and 1 measuring the sim-ilarity between the predicted action sequenceand the ground-truth action sequence. IEDequals 1 if the two sequences are exactly thesame.

• SJI: State Jaccard Index. This is a num-ber between 0 and 1 measuring the similaritybetween the predicted and the ground-truthstate changes. SJI equals 1 if action planningleads to exactly the same state change as inthe ground-truth.

Configurations. To understand the role of interac-tive learning in model acquisition and action plan-ning, we first compared the interactive learning ap-proach with the previous leading approach (pre-sented as She16). To further evaluate the interac-tion policies acquired by reinforcement learning,we also compared the learned policy (i.e., RLPol-icy) with the following two baseline policies:

• RandomPolicy which randomly selects ques-tions to ask during interaction.

• ManualPolicy which continuously asks foryes/no confirmations (i.e., object groundingquestions (GroundQ), environment ques-tions (EnvQ), goal prediction questions(GoalQ)) until there’s no more questions be-fore making a decision on model acquisitionor action execution.

5.2 Results5.2.1 The Effect of Interactive LearningTable 4 shows the performance comparison on thetesting data between the previous approach She16and our interactive learning approach based on en-vironment representations with different levels ofnoise. The verb models acquired by interactivelearning perform better consistently across all four

She16 RL policy % improvement

IED SJI IED SJI IED SJI

PerfectEnv 0.430 0.426 0.453 0.468 5.3%∗ 9.9%∗

NormStd3 0.284 0.273 0.420 0.431 47.9%∗ 57.9%∗

NormStd5 0.172 0.168 0.392 0.411 127.9%∗ 144.6%∗

UniEnv 0.168 0.163 0.332 0.347 97.6%∗ 112.9%∗

Table 4: Performance comparison between She16and our interactive learning based on environmentrepresentations with different levels of noise. Allthe improvements (marked *) are statistically sig-nificant (p < 0.01).

0 100 200 300 400 500 600Number of data used to acquire verb models in learning phase

0.00

0.05

0.10

0.15

0.20

0.25

0.30

0.35

0.40

0.45

Perf

orm

ance

in a

ctio

n p

lannin

g (

SJI)

RL policyManual policyRandom policyShe16

Figure 5: Performance (SJI) comparison by ap-plying models acquired based on different interac-tion policies to the testing data.

environment conditions. When the environmentbecomes noisy (i.e., NormStd3, NormStd5, andUniEnv), the performance of She16 that only relieson demonstrations decreases significantly. Whilethe interactive learning improves the performanceunder the perfect environment condition, its effectin noisy environment is more remarkable. It leadsto a significant performance gain between 48%and 145%. These results validate our hypothe-sis that interactive question answering can help toalleviate the problem of uncertainties in environ-ment representation and goal prediction.

Figure 5 shows the performance of the vari-ous learned models on the testing data, based ona varying number of training instances and dif-ferent interaction policies. The interactive learn-ing guided by the policy acquired from RL out-performs the previous approach She16. The RLpolicy slightly outperforms interactive learning us-ing manually defined policy (i.e., ManualPolicy).However, as shown in the next section, the Man-

Average number of questions PerformanceLearning Phase Execution Phase

GroundQ EnvQ TotalQ GroundQ EnvQ GoalQ TotalQ IED SJI

RLPolicy2.130∗ 2.615∗ 4.746∗ 0.383∗ 0.650∗ 2.626 3.665∗ 0.420 0.430∗

+/-0.231 +/-0.317 +/-0.307 +/-0.137 +/-0.366 +/-0.331 +/-0.469 +/-0.015 +/-0.018

ManualPolicy2.495 5.338 7.833 1.236 3.202 2.353 6.792 0.406 0.404

+/-0.025 +/-0.008 +/-0.025 +/-0.002 +/-0.012 +/-0.023 +/-0.025 +/-0.002 +/-0.004

RandomPolicy0.545 0.368 0.913 0.678 0.081 0.151 0.909 0.114 0.113

+/-0.016 +/-0.033 +/-0.040 +/-0.055 +/-0.030 +/-0.024 +/-0.018 +/-0.025 +/-0.029

Table 5: Comparison between different policies including the average number (and standard deviation)of different types of questions asked during the execution phase and the learning phase respectively, andthe performance on action planning for the testing data. The results are based on the noisy environmentsampled by NormStd3. * indicates statistically significant difference (p < 0.05) comparing RLPolicywith ManualPolicy.

ualPolicy results in much longer interaction (i.e.,more questions) than the RL acquired policy.

5.2.2 Comparison of Interaction PoliciesTable 5 compares the performance of different in-teraction policies. It shows the average number ofquestions asked under different policies. It is notsurprising the RandomPolicy has the worst perfor-mance. For the ManualPolicy, its performance issimilar to the RLPolicy. However, the average in-teraction length of ManualPolicy is 6.792, whichis much longer than the RLPolicy (which is 3.127).These results further demonstrate that the policylearned from RL enables efficient interactions andthe acquisition of more reliable verb models.

6 Conclusion

Robots live in a noisy environment. Due to thelimitations in their external sensors, their repre-sentations of the shared environment can be er-ror prone and full of uncertainties. As shown inprevious work (Mourao et al., 2012), learning ac-tion models from the noisy and incomplete obser-vation of the world is extremely challenging. Thesame problem applies to the acquisition of verb se-mantics that are grounded to the perceived world.To address this problem, this paper presents aninteractive learning approach which aims to han-dle uncertainties of the environment as well as in-completeness and conflicts in state representationby asking human partners intelligent questions.The interaction strategies are learned through re-inforcement learning. Our empirical results haveshown a significant improvement in model acqui-sition and action prediction. When applying thelearned models in new situations, the models ac-

quired through interactive learning leads to over140% performance gain in noisy environment.

The current investigation also has several lim-itations. As in previous works, we assume theworld can be described by a closed set of predi-cates. This causes significant simplification for thephysical world. One of the important questions toaddress in the future is how to learn new predicatesthrough interaction with humans. Another limita-tion is that the current utility function is learnedbased on a set of pre-identified features. Futurework can explore deep neural network to alleviatefeature engineering.

As cognitive robots start to enter our dailylives, data-driven approaches to learning may notbe possible in new situations. Human partnerswho work side-by-side with these cognitive robotsare great resources that the robots can directlylearn from. Recent years have seen an increasingamount of work on task learning from human part-ners (Saunders et al., 2006; Chernova and Veloso,2008; Cantrell et al., 2012; Mohan et al., 2013;Asada et al., 2009; Mohseni-Kabir et al., 2015;Nejati et al., 2006; Liu et al., 2016). Our futurework will incorporate interactive learning of verbsemantics with task learning to enable autonomythat can learn by communicating with humans.

Acknowledgments

This work was supported by the National ScienceFoundation (IIS-1208390 and IIS-1617682) andthe DARPA SIMPLEX program under a subcon-tract from UCLA (N66001-15-C-4035). The au-thors would like to thank Dipendra K. Misra andcolleagues for providing the evaluation data, andthe anonymous reviewers for valuable comments.

ReferencesYoav Artzi and Luke Zettlemoyer. 2013. Weakly su-

pervised learning of semantic parsers for mappinginstructions to actions. Transactions of the Associa-tion for Computational Linguistics Volume1(1):49–62.

Minoru Asada, Koh Hosoda, Yasuo Kuniyoshi, HiroshiIshiguro, Toshio Inui, Yuichiro Yoshikawa, MasakiOgino, and Chisato Yoshida. 2009. Cognitive devel-opmental robotics: A survey. IEEE Transactions onAutonomous Mental Development 1(1):12–34.

S. R. K. Branavan, Harr Chen, Luke S. Zettlemoyer,and Regina Barzilay. 2009. Reinforcement learn-ing for mapping instructions to actions. In Pro-ceedings of the Joint Conference of the 47th AnnualMeeting of the ACL and the 4th International JointConference on Natural Language Processing of theAFNLP: Volume 1 - Volume 1. Association for Com-putational Linguistics, Stroudsburg, PA, USA, ACL’09, pages 82–90.

John D. Bransford, Ann L. Brown, and Rodney R.Cocking. 2000. How People Learn: Brain, Mind,Experience, and School: Expanded Edition. Na-tional Academy Press., Washington, DC.

Maya Cakmak and Andrea L. Thomaz. 2012. De-signing robot learners that ask good questions. InProceedings of the 7th Annual ACM/IEEE Inter-national Conference on Human-Robot Interaction.ACM, New York, NY, USA, HRI ’12, pages 17–24.

R. Cantrell, K. Talamadupula, P. Schermerhorn, J. Ben-ton, S. Kambhampati, and M. Scheutz. 2012. Tellme when and why to do it! run-time plannermodel updates via natural language instruction. InProceedings of the 7th Annual ACM/IEEE Inter-national Conference on Human-Robot Interaction.ACM, New York, NY, USA, HRI ’12, pages 471–478.

Sonia Chernova and Manuela Veloso. 2008. Teach-ing multi-robot coordination using demonstrationof communication and state sharing. In Proceed-ings of the 7th international joint conference onAutonomous agents and multiagent systems-Volume3. International Foundation for Autonomous Agentsand Multiagent Systems, pages 1183–1186.

Rui Fang, Malcolm Doering, and Joyce Y. Chai. 2014.Collaborative models for referring expression gen-eration in situated dialogue. In Proceedings ofthe 28th AAAI Conference on Artificial Intelligence.AAAI Press, AAAI’14, pages 1544–1550.

Qiaozi Gao, Malcolm Doering, Shaohua Yang, andJoyce Y. Chai. 2016. Physical causality of actionverbs in grounded language understanding. In ACL(1). The Association for Computer Linguistics.

P. Gorniak and D. Roy. 2007. Situated language under-standing as filtering perceived affordances. In Cog-nitive Science, volume 31(2), pages 197–231.

Joohyun Kim and Raymond J. Mooney. 2012. Un-supervised pcfg induction for grounded languagelearning with highly ambiguous supervision. In Pro-ceedings of the Conference on Empirical Methodsin Natural Language Processing and Natural Lan-guage Learning (EMNLP-CoNLL ’12). Jeju Island,Korea, pages 433–444.

W. Bradley Knox and Peter Stone. 2011. Under-standing human teaching modalities in reinforce-ment learning environments: A preliminary report.In IJCAI 2011 Workshop on Agents Learning Inter-actively from Human Teachers (ALIHT).

Jiwei Li, Alexander H. Miller, Sumit Chopra,Marc’Aurelio Ranzato, and Jason Weston. 2017.Learning through dialogue interactions. In ICLR.

Changsong Liu and Joyce Y. Chai. 2015. Learning tomediate perceptual differences in situated human-robot dialogue. In Proceedings of the Twenty-NinthAAAI Conference on Artificial Intelligence. AAAIPress, AAAI’15, pages 2288–2294.

Changsong Liu, Lanbo She, Rui Fang, and Joyce Y.Chai. 2014. Probabilistic labeling for efficient ref-erential grounding based on collaborative discourse.In Proceedings of the 52nd Annual Meeting of theAssociation for Computational Linguistics. Associ-ation for Computational Linguistics, pages 13–18.

Changsong Liu, Shaohua Yang, Sari Saba-Sadiya, Nis-hant Shukla, Yunzhong He, Song-chun Zhu, andJoyce Y. Chai. 2016. Jointly learning groundedtask structures from language instruction and visualdemonstration. In Proceedings of the 2016 Con-ference on Empirical Methods in Natural LanguageProcessing. Association for Computational Linguis-tics, Austin, Texas, pages 1482–1492.

Cynthia Matuszek, Nicholas Fitzgerald, Luke Zettle-moyer, Liefeng Bo, and Dieter Fox. 2012a. Ajoint model of language and perception for groundedattribute learning. In John Langford and JoellePineau, editors, Proceedings of the 29th Interna-tional Conference on Machine Learning (ICML-12).ACM, New York, NY, USA, pages 1671–1678.

Cynthia Matuszek, Evan Herbst, Luke S. Zettlemoyer,and Dieter Fox. 2012b. Learning to parse nat-ural language commands to a robot control sys-tem. In Jaydev P. Desai, Gregory Dudek, OussamaKhatib, and Vijay Kumar, editors, ISER. Springer,volume 88 of Springer Tracts in Advanced Robotics,pages 403–415.

Dipendra K Misra, Jaeyong Sung, Kevin Lee, andAshutosh Saxena. 2014. Tell me dave: Context-sensitive grounding of natural language to manipula-tion instructions. Proceedings of Robotics: Scienceand Systems (RSS), Berkeley, USA .

Dipendra Kumar Misra, Kejia Tao, Percy Liang, andAshutosh Saxena. 2015. Environment-driven lexi-con induction for high-level instructions. In Pro-ceedings of the 53rd Annual Meeting of the As-sociation for Computational Linguistics and the

7th International Joint Conference on Natural Lan-guage Processing (Volume 1: Long Papers). Associ-ation for Computational Linguistics, Beijing, China,pages 992–1002.

Shiwali Mohan, James Kirk, and John Laird. 2013. Acomputational model for situated task learning withinteractive instruction. In Proceedings of ICCM2013 - 12th International Conference on CognitiveModeling.

Anahita Mohseni-Kabir, Charles Rich, Sonia Cher-nova, Candace L. Sidner, and Daniel Miller. 2015.Interactive hierarchical task learning from a singledemonstration. In Proceedings of the Tenth AnnualACM/IEEE International Conference on Human-Robot Interaction. ACM, HRI ’15, pages 205–212.

Kira Mourao, Luke S. Zettlemoyer, Ronald P. A. Pet-rick, and Mark Steedman. 2012. Learning STRIPSoperators from noisy and incomplete observations.In Proceedings of the Twenty-Eighth Conference onUncertainty in Artificial Intelligence. Catalina Is-land, CA, USA, pages 614–623.

Negin Nejati, Pat Langley, and Tolga Konik. 2006.Learning hierarchical task networks by observation.In Proceedings of the 23rd international conferenceon Machine learning. ACM, pages 665–672.

Tim Paek and Roberto Pieraccini. 2008. Automatingspoken dialogue management design using machinelearning: An industry perspective. Speech Commu-nication 50(8-9):716–729.

Joe Saunders, Chrystopher L Nehaniv, and KerstinDautenhahn. 2006. Teaching robots by mouldingbehavior and scaffolding the environment. In Pro-ceedings of the 1st ACM SIGCHI/SIGART confer-ence on Human-robot interaction. ACM, pages 118–125.

Jost Schatzmann, Karl Weilhammer, Matt Stuttle, andSteve Young. 2006. A survey of statistical user sim-ulation techniques for reinforcement-learning of di-alogue management strategies. Knowl. Eng. Rev.21(2):97–126.

Lanbo She and Joyce Y. Chai. 2016. Incremental ac-quisition of verb hypothesis space towards physicalworld interaction. In Proceedings of the 54th An-nual Meeting of the Association for ComputationalLinguistics, ACL 2016, August 7-12, 2016, Berlin,Germany, Volume 1: Long Papers.

Lanbo She, Shaohua Yang, Yu Cheng, Yunyi Jia,Joyce Y. Chai, and Ning Xi. 2014. Back to theblocks world: Learning new actions through situ-ated human-robot dialogue. In Proceedings of theSIGDIAL 2014 Conference, The 15th Annual Meet-ing of the Special Interest Group on Discourse andDialogue, 18-20 June 2014, Philadelphia, PA, USA.pages 89–97.

Satinder Singh, Diane Litman, Michael Kearns, andMarilyn Walker. 2002. Optimizing dialogue man-agement with reinforcement learning: Experimentswith the njfun system. Journal of Artificial Intelli-gence Research 16:105–133.

Pei-Hao Su, Milica Gasic, Nikola Mrksic, Lina M. Ro-jas Barahona, Stefan Ultes, David Vandyke, Tsung-Hsien Wen, and Steve Young. 2016. On-line activereward learning for policy optimisation in spoken di-alogue systems. In Proceedings of the 54th AnnualMeeting of the Association for Computational Lin-guistics (Volume 1: Long Papers). Association forComputational Linguistics, Berlin, Germany, pages2431–2441.

Richard S. Sutton and Andrew G. Barto. 1998. In-troduction to Reinforcement Learning. MIT Press,Cambridge, MA, USA, 1st edition.

Stefanie Tellex, Pratiksha Thaker, Joshua Joseph,and Nicholas Roy. 2014. Learning perceptuallygrounded word meanings from unaligned paralleldata. Machine Learning 94(2):151–167.

Jesse Thomason, Jivko Sinapov, Maxwell Svetlik, Pe-ter Stone, and Raymond J. Mooney. 2016. Learningmulti-modal grounded linguistic semantics by play-ing ”i spy”. In Proceedings of the 25th InternationalJoint Conference on Artificial Intelligence (IJCAI-16). New York City, pages 3477–3483.

Jesse Thomason, Shiqi Zhang, Raymond Mooney, andPeter Stone. 2015. Learning to interpret natural lan-guage commands through human-robot dialog. InProceedings of the 24th International Joint Confer-ence on Artificial Intelligence (IJCAI). pages 1923–1929.

Jason D Williams and Geoffrey Zweig. 2016. End-to-end lstm-based dialog control optimized with su-pervised and reinforcement learning. arXiv preprintarXiv:1606.01269 .

Shaohua Yang, Qiaozi Gao, Changsong Liu, CaimingXiong, Song-Chun Zhu, and Joyce Y. Chai. 2016.Grounded semantic role labeling. In NAACL HLT2016, The 2016 Conference of the North AmericanChapter of the Association for Computational Lin-guistics: Human Language Technologies, San DiegoCalifornia, USA, June 12-17, 2016. pages 149–159.

interactive learning of grounded verb semantics towards ...cse.msu.edu/~jchai/papers/acl2017.pdf ·...

Documents