learning rewards from linguistic feedback · karthik narasimhan,1 thomas l. grifﬁths1,2...

Learning Rewards from Linguistic Feedback

Theodore R. Sumers,1 Mark K. Ho,2 Robert D. Hawkins,2Karthik Narasimhan,1 Thomas L. Griffiths1,2

1Department of Computer Science, Princeton University, Princeton, NJ2Department of Psychology, Princeton University, Princeton, NJ{sumers, mho, rdhawkins, karthikn, tomg}@princeton.edu

Abstract

We explore unconstrained natural language feedback as alearning signal for artificial agents. Humans use rich and var-ied language to teach, yet most prior work on interactivelearning from language assumes a particular form of input(e.g. commands). We propose a general framework whichdoes not make this assumption. We decompose linguisticfeedback into two components: a grounding to features of aMarkov decision process and sentiment about those features.We then perform an analogue of inverse reinforcement learn-ing, regressing the teacher’s sentiment on the features to in-fer their latent reward function. To evaluate our approach, wefirst collect a corpus of teaching behavior in a cooperativetask where both teacher and learner are human. We use ourframework to implement two artificial learners: a simple “lit-eral” model and a “pragmatic” model with additional induc-tive biases. We baseline these with a neural network trainedend-to-end to predict latent rewards. We then repeat our initialexperiment pairing human teachers with our models. We findour “literal” and “pragmatic” models successfully learn fromlive human feedback and offer statistically-significant perfor-mance gains over the end-to-end baseline, with the “prag-matic” model approaching human performance on the task.Inspection reveals the end-to-end network learns representa-tions similar to our models, suggesting they reflect emergentproperties of the data. Our work thus provides insight intothe information structure of naturalistic linguistic feedbackas well as methods to leverage it for reinforcement learning.

1 IntroductionFor autonomous agents to be widely usable, they must be re-sponsive to human users’ natural modes of communication.For instance, imagine designing a household cleaning robot.Certain aspects of its behavior will be pre-programmed (e.g.how to use an outlet to recharge itself), while other aspectswill be need to be learned (e.g. if a user wants it to chargein the living room or the kitchen). But how should the robotinfer what a person wants?

Here, we focus on unconstrained linguistic feedback as alearning signal for autonomous agents. Humans use naturallanguage flexibly and intuitively to express their desires viacommands, counterfactuals, encouragement, explicit prefer-ences, or other forms of feedback. For example, if a humanencounters the robot charging in the living room as desired,they may provide feedback such as “Great job.” If it is found

charging in the kitchen, the human may respond with “Youshould have gone to the living room” or “I don’t like see-ing you in the kitchen.” Our approach differs from previousmethods for interactive learning that involve non-linguisticdemonstrations (Abbeel and Ng 2004; Argall et al. 2009;Dragan, Lee, and Srinivasa 2013; Ho et al. 2016; Hadfield-Menell et al. 2016), rewards/punishments (Knox and Stone2009; MacGlashan et al. 2017; Christiano et al. 2017), orlanguage commands (Tellex et al. 2011; Wang, Liang, andManning 2016; Tellex et al. 2020) by utilizing the expressivecapacity of open-ended naturalistic language. The agent’schallenge is to interpret such feedback in the context of itsbehavior and environment to infer the teacher’s preferences,enabling generalization to new tasks and environments.

We formalize this inference as linear regression over fea-tures of a Markov decision process (MDP). We decomposefeedback into sentiment and features, then regress the senti-ment against the features to infer the latent reward function.This framework enables learning from arbitrary language.To provide a concrete implementation, we draw on naturalis-tic forms of feedback studied in educational research (Shute2008; Lipnevich and Smith 2009) which relate feedback toelements of the MDP. For example, “Good job” refers toprior behavior, whereas “You should have gone to the livingroom” refers to an action. Intuitively, the teacher’s sentimentabout these elements provides indirect evidence about theirreward function: positive sentiment about an action impliespositive weight on its features. We implement two versionsof our model and pair it with human teachers in an inter-active experiment. It achieves scores comparable to humanlearners, with statistically-significant performance improve-ments over a baseline neural net. We outline related workin Section 2, then introduce our formalization in Section 3.Section 4 describes our task and experimental data and Sec-tion 5 details our model implementations. Finally, Section 6discusses results and Section 7 concludes.

2 Background and Related WorkThe work presented here complements existing methods thatenable artificial agents to learn from and interact with hu-mans. For example, a large literature studies how agentscan learn latent preferences from non-linguistic human feed-back. Algorithms such as TAMER (Knox and Stone 2009)and COACH (MacGlashan et al. 2017) transform human-

arX

iv:2

009.

1471

5v1

[cs

.AI]

30

Sep

2020

action feedback update1 2 3

learner teacher

sentiment

features

A B

pink is good, but go for the squaresnext time!

imperative descriptiveevaluative

Figure 1: A: Episodes involve three stages. B: We factor utterances into sentiment and features to infer latent weights w (solidlines, Section 3.2). This allows us to integrate multiple forms of feedback (dashed lines, Section 3.3).

generated rewards and punishments into quantities thatreinforcement learning (RL) algorithms can reason with.Preference elicitation, which provides a user with binarychoices between trajectories, is a similarly intuitive train-ing method (Christiano et al. 2017). Finally, demonstration-based approaches use a set of expert trajectories to learn apolicy—as in imitation learning (Ross and Bagnell 2010)—or infer an underlying reward function—as in inverse rein-forcement learning (IRL) (Abbeel and Ng 2004). This ideahas been extended to settings in which agents are providedwith intentionally informative demonstrations (Ho et al.2016) or themselves act informatively (Dragan, Lee, andSrinivasa 2013), such as the more general “cooperative IRL”setting (Hadfield-Menell et al. 2016).

Another body of research has focused on linguis-tic human-agent interaction. Dialogue systems (Artzi andZettlemoyer 2011; Li et al. 2016) learn to interpret userqueries in the context of the ongoing interaction, whilerobots and assistants (Thomason et al. 2015; Wang et al.2019; Thomason et al. 2020; Szlam et al. 2019) ground lan-guage in their physical surroundings (for a review of lan-guage in robotics, see Tellex et al. (2020)). A parallel lineof work in machine learning uses language to improve sam-ple efficiency: to shape rewards (Maclin and Shavlik 1994;Kuhlmann et al. 2004), often via subgoals (Kaplan, Sauer,and Sosa 2017; Williams et al. 2018; Chevalier-Boisvertet al. 2018; Goyal, Niekum, and Mooney 2019; Bahdanauet al. 2019; Zhou and Small 2020). While these approachesallow for multiple interactions, they generally interpret andexecute independent declarative statements– queries, com-mands, or (sub)goals. Perhaps the most similar works to oursare (MacGlashan et al. 2015; Fu et al. 2019), which both per-form IRL on linguistic input in the form of natural languagecommands. Our work differs in two key ways: first, weuse unconstrained and unfiltered natural language; second,we seek to learn general latent preferences rather than in-fer command-contextual rewards. A somewhat smaller bodyof work investigates such open-ended language: to correctcaptioning models (Ling and Fidler 2017), capture environ-mental characteristics (Narasimhan, Barzilay, and Jaakkola2018), or improve hindsight replay (Cideron et al. 2019). Fora recent review of language in RL, see Luketina et al. (2019).

To fully leverage linguistic feedback, we aim to recoverthe speaker’s general preferences from naturalistic interac-

tions. Unlike prior approaches, we do not solicit a specificform of language (i.e. commands, corrections, or descrip-tions). We instead elicit naturalistic human teaching and ex-plore the inferential machinery required to extract informa-tion from it. This follows a long history of studying emer-gent language in other domains including action coordina-tion (Djalali et al. 2011; Djalali, Lauer, and Potts 2012;Potts 2012; Ilinykh, Zarrie, and Schlangen 2019; Suhr et al.2019; Allison, Luger, and Hofmann 2018), reference prag-matics (He et al. 2017; Udagawa and Aizawa 2019), andnavigation (Thomason et al. 2019).

3 Learning Rewards from Language

In this section, we formalize our approach. We decom-pose utterances into sentiment and features, then use lin-ear regression to infer the teacher’s rewards over those fea-tures. This allows us to perform an analogy of IRL (Abbeeland Ng 2004; Jeon, Milli, and Dragan 2020) on uncon-strained linguistic feedback. To extract features from utter-ances, we ground them into the learner’s context (Harnad1990; Mooney 2008), an approach inspired by educationalresearch (Shute 2008; Lipnevich and Smith 2009).

3.1 Setup

We begin by defining a learner agent whose interactionswith the environment are defined by a Markov Decision Pro-cess (MDPs) (Puterman 1994). Formally, a finite-horizonMDPM = 〈S,A,H, T,R〉 is a set of states S, a set of ac-tions A, a horizon H ∈ N, a probabilistic transition functionT : S×A→ ∆(S), and a reward function R : S×A→ R.Given an MDP, a policy is a mapping from states to ac-tions, π : S → A. An optimal policy, π∗, is one thatmaximizes the future expected reward (value) from a state,V h(s) = maxaR(s, a) +

∑s′ T (s′ | s, a)V h−1(s′), where

V 0(s) = maxaR(s, a). States and actions are characterizedby features φ, where φ : S × A → {0, 1}K is an indica-tor function representing whether a feature is present for aparticular action a in state s. We denote a state-action trajec-tory by τ = 〈s0, a0, ...sT , aT 〉. Finally, we summarize ar-bitrary sets of state-action tuples, including trajectories, bytheir feature counts: nφ({〈s, a〉}) =

∑s,a∈{〈s,a〉} φ(s, a).

3.2 Interactive Learning from LanguageWe consider a learning setting where the reward functionis hidden from the learner agent but known to a teacheragent who is allowed to send natural-language messagesu (see Fig. 1A). We formulate the online learning task asBayesian inference over possible rewards: conditioning onthe teacher’s language and recursively updating a beliefstate. Formally, we assume that the teacher’s reward functionis parameterized by a latent variable w ∈ RK representingthe rewards associated with features φ:

R(s, a) = w>φ(s, a). (1)

The learner is attempting to recover these weights from theteacher’s utterances, calculating P (w|u).

Learning unfolds over a series of interactive episodes(Fig. 1A). At the start of episode i, the learner has a beliefdistribution over the teacher’s reward weights, P (wi), whichit uses to identify its policy. First, the learner has an opportu-nity to act in the world given this policy, sampling a trajec-tory τ i. Second, they receive feedback in the form of a nat-ural language utterance ui from the teacher (and optionallya reward signal from the environment). Third, the learneruses the feedback to update its beliefs about the reward,P (wi+1|ui, τ i), which is then used for the next episode.

We now describe our general formal approach for infer-ring latent rewards from feedback. Similar to IRL methods(Ramachandran and Amir 2007), we formalize reward learn-ing from language as Bayesian linear regression. We assumethe learner extracts the sentiment ζ from the teacher’s utter-ance and models it as:

ζ ∼ N (f>w, σ2ζ ) (2)

where f ∈ RK is a vector describing which features φthe utterance relates to. Each utterance is analyzed for itssentiment ζ and its features f. We use a Gaussian prior:wi ∼ N (µi,Σi). On each episode, we perform Bayesian up-dates (Murphy 2007) to obtain a posterior P (wi+1|ui, τ i) =N (µi+1,Σi+1).

Factoring utterances this way decomposes learning intotwo distinct sub-problems: sentiment analysis and contex-tualization. The features f contextualize the teacher’s senti-ment ζ. The regression then sets the teachers latent prefer-ences w ∈ RK to “explain’ the sentiment. Concretely, if theteacher says “Good job,” a learner could infer the teacherhas positive weights on the features obtained by its prior tra-jectory (ζ > 0, f = nφ(τ i)).

Sentiment analysis is a well-studied problem (Pang andLee 2008; Liu and Zhang 2012; Hutto and Gilbert 2014;Kim 2014) and thus ready solutions exist. Contextualiza-tion is more challenging. To determine f , we must relatethe teacher’s utterance to the features of the MDP, φ. Manycontextualization procedures are possible; each embodies atheory of the teacher’s mind when giving feedback. We nowconsider one such proposal.

3.3 Integrating Different Forms of FeedbackWe outline our approach to relate teachers’ feedback toMDP features, demonstrating our framework can accom-modate a range of natural human expressiveness. We draw

on educational literature (Lipnevich and Smith 2009; Shute2008), which studies the characteristic forms of feedbackgiven by human teachers. We identify correspondences be-tween these forms and prior work in RL, show each naturallymaps to an element of the learner’s MDP, and derive featuresf accordingly.

Evaluative Feedback. Perhaps the simplest feedback anagent can receive is a scalar value in response to their actions(e.g. environmental rewards, praise, criticism). The RL liter-ature has previously elicited such feedback (+1/-1) from hu-man teachers (Thomaz and Breazeal 2008; Knox and Stone2009; MacGlashan et al. 2017). In our setting, we considerhow linguistic utterances can be interpreted as evaluativefeedback. For example, when a teacher says “Good job,” thisclearly relates to the learner’s behavior, τ i. We thus interpretevaluative feedback as informative about the feature countsobtained under that trajectory: f = nφ(τ i).1

Imperative Feedback. Another form of feedback tellsthe learner what the correct action was. This is the gen-eral form of supervised learning. In RL, it includes labelingsets of actions as good or bad (Judah et al. 2010; Christianoet al. 2017), learning from demonstrations (Ross and Bag-nell 2010; Abbeel and Ng 2004; Ho et al. 2016), and correc-tions to dialogue agents (Li et al. 2016; Chen et al. 2017).In our setting, imperative feedback specifies a counterfac-tual behavior: something the learner should (or should not)have done, e.g. “You should have gone to the living room.”Imperative feedback is thus a retrospective version of a com-mand. Mapping this language to features takes two steps: wefirst identify the set of actions the teacher is referring to, thenaggregate their feature counts. Formally, we define a state-action grounding function G(u, S,A) which returns a set ofstate-action tuples from the full set: G : u, S,A 7→ S, A,where S ⊆ S, A ⊆ A. We take the feature counts of thesetuples as our reference vector: f = nφ(G(u, S,A)).

Descriptive Feedback. Finally, descriptive feedback pro-vides explicit information about how the learner shouldmodify their behavior. Descriptive feedback is the most vari-able form of human teaching, encompassing explanationsand problem-solving strategies. It is generally found to bethe most effective (Shute 2008; Lipnevich and Smith 2009;van der Kleij, Feskens, and Eggen 2015; Hattie and Tim-perley 2007). Supervised and RL approaches have used de-scriptive language to improve sample efficiency (Srivastava,Labutov, and Mitchell 2017; Hancock et al. 2018; Ling andFidler 2017) or communicate general task-relevant informa-tion (Narasimhan, Barzilay, and Jaakkola 2018). In IRL, de-scriptive feedback explains the underlying structure of theteacher’s preferences and thus relates directly to features φ.2If the human says ‘I don’t like seeing you in the kitchen,” the

1An example alternative theory of mind could replace nφ(τ i)with an inverse choice model (McFadden 1974; Train 2003). Thiswould posit the teacher giving feedback on the learner’s implied,latent preferences, rather than their explicit, observed actions.

2In problem settings beyond IRL, such feedback may relate tothe transition function T : S×A; we view such extensions, includ-ing unifying our work with (Narasimhan, Barzilay, and Jaakkola2018) as exciting future directions.

Figure 2: A: The learner collected objects with different rewards (top) masked with shapes and colors (bottom). The teachercould see both views. B: Example reward function used to mask objects (here, objects worth 1-3 reward are rendered as pinktriangles; the teacher thus “prefers” pink squares, worth 8-10). C: Pairs played 10 episodes, each on a new level. D: Feedbackshifted from descriptive to evaluative as learners improved. Learners scored poorly on level 6, reversing this trend.

robot should assign a negative weight to states and actionswhere it and the human are both in the kitchen. Formally,we define an indicator function over features designatingwhether or not that feature is referenced in the utterance:I : u, φ→ {0, 1}K . We then set f = I(u, φ).

Prior RL algorithms generally operate on one of theseforms. Interactions are thus limited: the algorithm solic-its feedback of a particular type. Our framework allowsus to unify them. We define a grounding function fG :u 7→ {Imperative, Evaluative, Descriptive} that identifiesthe form, then set f accordingly:

f ∈ RK =

nφ(τ i) if fG(u) = Evaluativenφ(G(u, S,A)) if fG(u) = ImperativeI(u, φ) if fG(u) = Descriptive

(3)This is illustrated in Fig. 1. Intuitively, fG grounds feed-

back into the learner’s context, formalized as an element ofthe MDP. Features defined on that element then relate thefeedback to the teacher’s rewards. Table 1 shows examplesof this decomposition for various forms. In Section 4, weelicit and analyze naturalistic human-human teaching, ob-serving these three forms. In Section 5, we implement a pairof agents embodying this contextualization procedure. Fi-nally, we train an end-to-end neural network and note that itlearns to interpret qualitatively different forms of feedback.

4 Human-Human Instruction DatasetTo study human linguistic feedback and generate a datasetto evaluate our method, we designed a two-player collabo-rative game. One player (the learner) used a robot to col-lect a variety of colored shapes. Each yielded a reward,which the learner could not see. The second player (theteacher) watched the learner and could see the objects’ re-wards (Fig. 2A). Teacher-learner pairs engaged in 10 inter-active episodes (Fig. 1A, Fig. 2C). We describe the experi-ment below.

4.1 Experiment and GameplayWe recruited 208 participants from Amazon MechanicalTurk using psiTurk (Gureckis et al. 2016). Participants werepaid $1.50 and a score-based bonus up to $1.00. Teachersand learners received the same bonus, based on the learner’sscore. The full experiment consisted of instructions and apractice level, followed by 10 levels of gameplay. Each levelcontained a different set of 20 objects. We generated 110such levels, using 10 for the experiment and 100 for modelevaluation (Section 6). Collecting each object yields a re-ward between -10 and 10. Objects were distributed to eachof the four corners (Fig. 2C). In each episode, the learner had8 seconds to act (move around and collect objects), then theteacher had unlimited time to provide feedback (send chatmessages). Both players were shown the score and runningbonus during the feedback phase. This leaked informationabout the reward function to the learner, but we found itwas important to encourage active participation. The pri-mary disadvantage is that the human baseline for human-model comparisons benefits from additional information notseen by our models.

4.2 Task MDP and RewardsHuman control was continuous: the learner used the arrowkeys to steer the robot. However, the only rewarding actionswere collecting objects and there was no discounting. As aresult, the learner’s score was the sum of the rewards of thecollected objects. Due to object layout and short time hori-zon, learners only had time to reach one corner. Each cornerhad 5 objects, so there were 124 possible object combina-tions per level.3 We refer to these combinations as trajecto-ries τ , and formalize the task as choosing one. Concretely,the learner samples its beliefs w ∼ p(wi), then choosesthe optimal trajectory: π := argmaxτ V

τ = w>nφ(τ). Toinduce teacher preferences, we assigned each teacher a re-ward function which masked objects with shapes and colors.

3Choice of corner, then up to 5 objects: 4 ∗ ((51

)+

(52

)+

(53

)+(

54

)+

(55

)) = 124.

Utterance Feedback Form Grounding (fG) Features (f) Sentiment (ζ)“Keep it up excellent” Evaluative nφ(τ i) Behavior-dependent +17“Not a good move” Evaluative nφ(τ i) Behavior-dependent -10“Top left would have been better” Imperative nφ(G(u, S,A)) Environment-dependent +17“The light-blue squares are high valued” Descriptive I(u, φ) φBlueSquare +13“I think Yellow is bad” Descriptive I(u, φ) φYellow -16

Table 1: Example feedback from our experiment with feature / sentiment decomposition.

Thus the distribution of actions and rewards on each levelwere the same for all players, but the objects were displayeddifferently depending on the assigned reward function. Ourreward functions combined two independent perceptual di-mensions, with color (pink, blue, or yellow) encoding signand shape (circles, squares, or triangles) encoding magni-tude (Fig. 2). We permuted the shapes and colors to generate36 different functions.

4.3 Human-Human Results and LanguageOur 104 human pairs played 10 games each, yielding 1040total messages (see Table 1 for examples; note empty mes-sages were allowed). We use our feedback classifier (seeSection 5.1) to explore the prevalence of various forms offeedback. We observe that humans use all three in a curricu-lum structure known as “scaffolding” (Shute 2008): teachersinitially use descriptive feedback to correct specific behav-ior, then switch to evaluative as the learners’ score improves(Fig 2D). Teachers could send unlimited messages and thussometimes used multiple forms. Most episodes containedevaluative (63%) or descriptive (34%) feedback; only 6%used imperative. The infrequency of imperative feedback isreasonable given our task: specifying the optimal trajectoryvia language is more challenging than describing desirablefeatures. Not all pairs fared well: some learners did not lis-ten, leading teachers to express frustration; some teachersdid not understand the task or sent irrelevant messages. Wedo not filter these out, as they represent naturalistic humanlanguage productions under this setting.

5 Agent ModelsWe now describe our three models. The first (Section 5.1) di-rectly implements our framework. The second (Section 5.2)extends it with pragmatic biases based on Gricean max-ims (Grice 1975). Finally, we train a neural net end-to-endfrom experiment episodes (Section 5.3).

5.1 “Literal” ModelOur literal model directly implements our framework.

Utterance Segmentation and Sentiment. Teachers oftensent multiple messages per episode, each potentially con-taining multiple forms of feedback. To process them, we firstsplit each message on punctuation (!.,;), then treated eachsplit from each message as a separate utterance. To extractsentiment, we used VADER (Hutto and Gilbert 2014), whichis optimized for social media. VADER provides an outputζ ∈ [−1, 1], which we scaled by 30 (set via grid search).

Utterance Features. To implement fG, we labeled 685utterances from pilot experiments and trained a logistic re-gression on TF-IDF unigrams and bigrams, achieving aweighted-F1 of .86. For evaluative feedback, as described inEq. 3, we simply used the feature counts from the learner’strajectory f = nφ(τ i). Imperative feedback requires a task-specific action-grounding function G(u, S,A). While ac-tion grounding in complex domains is an open researcharea (Tellex et al. 2020), in our experiment all imperativelanguage referenced a cluster of objects (e.g. “Top left wouldhave been better”). We thus used regular expressions to iden-tify references to corners and aggregated features over ac-tions in that corner. For descriptive feedback, we defined asimilar indicator function I(u, φ) identifying features in theutterance. We used relatively nameable shapes and colors,so teachers used predictable language to refer to object fea-tures (“pink”, “magenta”, “purple”, “violet”...). Again, weused regular expressions to match these synonyms. Finally,we normalized‖f‖1 = 1 so all forms carry equal weight.

Belief Updates. Because players had seen object valuesin practice levels ranging between -10 and 10, we initializedour belief state as µ0 = 0, Σ0 = diag(25). This gives anapproximately 95% chance of feature weights falling intothat range. For each utterance, we perform Bayesian updatesto obtain posteriors P (wi+1|ui, τ i) = N (µi+1,Σi+1). Weuse σ2

ζ = 12 for all updates, which we set via grid search.

5.2 “Pragmatic” ModelWe augment the “literal” model with two biases based onpragmatic principles (Grice 1975). While pragmatics are of-ten derived as a result of recursive reasoning (Goodman andFrank 2016), we opt for a simpler heuristic approach.

Sentiment Pragmatics. The Gricean “maxim of quan-tity” states that speakers bias towards parsimony. Empiri-cally, teachers often referenced a feature or an action with-out an explicit sentiment. Utterances such as “top left” or“pink circles” implied positive sentiment (e.g. “pink circles[are good]”). To account for this, we defaulted to a positivebias (ζ = 15) if the detected sentiment was neutral.

Reference Pragmatics. The Gricean “maxim of relation”posits that speakers provide information that is relevant tothe task at hand. We assume utterances describe impor-tant features, and thus unmentioned features are not usefulfor decision making. We implemented this bias by follow-ing each Bayesian update with a second, negative update(ζ = −30, set via grid search) to all features not referencedby the original update, gradually decaying weights of un-mentioned features.

Figure 3: Left: learning within our experiment structure.We plot averaged normalized score over the 10 learningepisodes; bars indicate 1 SE (68% CI). Right: learning withspecific feedback types. We plot averaged normalized scoreon 100 test levels after each episode.

5.3 End-to-end Inference NetworkTo baseline our structured models, we train an inference net-work end-to-end to predict the latent reward function fromepisodes in our experiment. Conceptually, this is akin to“factory-training” a housecleaning robot, enabling it to sub-sequently adapt to its owners’ particular preferences.

Model Architecture. We use a feed-forward architecture.We tokenize the utterance and generate a small embeddingspace (D=30), representing phrases as a mean bag-of-words(MBOW) across tokens. We represent the trajectory withits feature counts, nφ(τ). We concatenate the token embed-dings with the feature counts, use a single fully-connected128-width hidden layer with ReLU activations, then use alinear layer to map down to a 9-dimension output.

Model Training. Our dataset is skewed towards positive-scoring games, as players learned over the course of the ex-periment. To avoid learning a default positive bias, we firstdownsample positive-score games to match negative-scoreones. This left a total of 388 episodes from 98 differentteachers with a mean score of 1.09 (mean of all games was8.53). We augment the data by exchanging the reward func-tion (Fig. 2B), simulating the same episode under a differ-ent set of preferences. We take a new reward function andswitch both feature counts and token synonyms, preservingthe relationships between ui, τ i, and w. We repeat this forall 36 possible reward functions, increasing our data vol-ume and allowing us to separate rewards from teachers. Weused ten-fold CV with 8-1-1 train-validate-test splits, split-ting both teachers and reward functions. Thus the network istrained on one set of rewards (i.e. latent human preferences)and teachers (i.e. linguistic expression of those preferences),then tested against unseen preferences and language. Weused stochastic gradient descent with a learning rate of .005and weight decay of 0.0001, stopping when validation set er-ror increased. We train the network (including embeddings)end-to-end with an L2 loss on the true reward.

Multiple Episodes. Given a (u, τ) tuple, our model pre-dicts the reward w associated with every feature. To evaluateit over multiple trials in Section 6, we use a comparable up-date procedure as our structured models. Concretely, we ini-tialize univariate Gaussian priors over each feature µ0 = 0,

Model Experiment Interaction SamplingOffline Live n All Eval Desc Imp

Literal 30.6 34.7 46 40.5 38.7 40.6 16.7Pragmatic 38.2 42.8 47 52.5 50.4 58.2 31.7Inference 25.3 35.0 55 47.6 54.3 53.2 –Human – 44.3 104 – – – –

Table 2: Normalized scores averaged over 10 episodes oflearning. “Experiment” plays the 10 experiment episodeswith a single human; “Interaction Sampling” draws (u, τ)tuples from the entire corpus and plays 100 test levels aftereach update.

σ0 = 25, then run our inference network on each interactionand perform a Bayesian update on each feature using ourmodel’s output as an observation with the same fixed noise.For each feature, P (wi+1|ui, τ i) = N (µi, σi) ∗ N (w, 12 ).In all offline testing, we use the network from the appropri-ate CV fold to ensure it is always evaluated on its test set(teachers and rewards).

6 Results and AnalysisWe seek to answer several questions about our models. First,do they work: are they able to acquire information about thehumans’ reward function? Second, do our structured modelsoutperform the end-to-end learned inference network? Andfinally, do the “Pragmatic” augmentations improve the “Lit-eral” model? We run a second interactive experiment pairinghuman teachers with our models and find the answer to allthree is yes (Section 6.1). We then analyze how our modelslearn by testing forms of feedback separately (Section 6.2).

6.1 Learning from Live InteractionsWe ran a second online interactive experiment with the samestructure. We recruited 148 participants from Prolific, an on-line participant recruitment tool, and paired each with oneof three model learners. We evaluate the averaged normal-ized score across all 10 levels (the mean percent of high-est possible score on each level). To assess the effect of in-teractivity, we also replay sequences of (u, τ) tuples fromour first human-human experiment and evaluate model per-formance. The results are shown in Fig. 3 and summarizedin Table 2. We performed a linear regression with perfor-mance as the dependent variable and level number, model(neural/“Literal”/“Pragmatic”), and interaction mode (of-fline/interactive) as predictors. We used contrast-coding tocompare the neural against the two structured models andcompare the two structured models against each other. Weused effect-coding to test the influence of interaction modeand level number. We first find a significant effect for levelnumber (p < 1e − 15), indicating that model scores do im-prove over successive levels. We then find significant differ-ences between the baseline neural model and the mean of“Literal”/“Pragmatic” models (p < 1e − 4), and betweenthe “Literal” and “Pragmatic” models (p < 1e − 5). Thisconfirms that each level of inductive bias improves infor-mation capture from human teaching: indeed, our “Prag-

Figure 4: Left: A trajectory from our experiment. Right: Inference network output given this trajectory. Top row: the modellearns to map feature-related tokens (descriptive feedback) directly to rewards, independent of the trajectory. Bottom left /center: the model maps evaluative feedback (praise and criticism) through the feature-counts from the trajectory. Bottom right: afailure mode. Due to the MBOW utterance representation, descriptive feedback with negative sentiment is not handled correctly.

matic” model achieves performance comparable to the hu-man learners in our first experiment (although we note theresults are not strictly comparable as the experiments useddifferent platforms to recruit participants). Finally, we finda significant effect of interactivity (p < 1e − 4). The largeimprovement in the neural model from offline to interactiveis particularly intriguing. Offline, it performs very poorly onseveral teachers who use idiosyncratic language; it appearsthat teachers in the interactive setting are able to adjust theirlanguage and adapt to the model. We explore this further inthe next section.

6.2 Learning from Different Forms of FeedbackTo evaluate our specific implementation of “forms of feed-back,” we test performance on (u, τ) tuples sampled fromthe corpus as a whole. Our “episode” structure is as fol-lows: we draw a (u, τ) tuple at random from the first exper-iment, update each model, and have it act on our 100 pre-generated test levels. We take its averaged normalized scoreon these levels. We repeat this procedure 5 times for eachcross-validation fold, ensuring the learned model is alwaystested on its hold-out teachers and rewards. This draws feed-back from a variety of teachers and tests learners on a vari-ety of level configurations, giving a picture of overall learn-ing trends. Normalized scores over test levels are shown inFig. 3 and Table 2 (“Interaction Sampling”). We note thatall models improve when sampling the entire corpus (“All”)versus being paired with an individual teacher (Section 6.1).This suggests blending feedback across teachers helps miti-gate individual idiosyncrasies. This particularly benefits theinference network, which outperforms the “Literal” model.We then use our feedback classifier (Section 5.1) to selec-tively sample one form of feedback. This reveals that our“Pragmatic” augmentations help on all forms, but most dra-matically on “Descriptive” feedback. Finally, we confirmour inference network learns to contextualize feedback ap-propriately (Fig. 4). It maps evaluative feedback throughits prior behavior and descriptive (feature-based) tokens di-rectly to the appropriate features. Inspection of utterancesreveals several failure modes on rarer speech patterns, most

notably descriptive feedback with negative sentiment.

7 ConclusionWe presented a general method to recover latent rewardsfrom language by factoring it into two subproblems: sen-timent analysis and contextualization. Contextualizationgrounds language into the feature-space of the MDP; weprovide an implementation inspired by research on humanteaching. We validated our approach on live interactionswith humans. We find statistically significant performancegains over a baseline model learned end-to-end from human-human interactions and achieve near-human performancewith our “Pragmatic” model. Finally, we note that modelperformance varies across our three evaluations (offline,live, and sampled interactions), which underscores the im-portance of interaction with actual humans to validate suchmodels. We anticipate several promising directions for fu-ture research. First, our structured models could utilize moreformal language parsing and theory-of-mind based prag-matic reasoning. Second, analysis reveals the learned modelextracted patterns resembling our structured model, suggest-ing our “forms of feedback” are emergent properties of thedata. It may thus be feasible to replace our implementa-tion with a learned contextualization function. Finally, ourend-to-end model could incorporate pretrained embeddings,more sophisticated language models, or recursion (e.g. anRNN over episodes instead of a feed-forward model). Theseextensions would facilitate application to more complex do-mains. Besides extending our framework, we see poten-tial synergies from integrating it with instruction following:treating commands as “Imperative” feedback could providea natural preference-based prior for interpreting future in-structions. In general, we hope the methods and insights pre-sented here facilitate the adoption of natural language as aninput for learning.

Ethical StatementEquipping artificial agents with the capacity to learn fromlinguistic feedback is an important step towards improvingvalue alignment between humans and machines, one aspectof developing systems that support safe and beneficial inter-actions. However, potential costs of include expanding theset of roles that such agents can play to those requiring sig-nificant interaction with humans – roles currently restrictedto human agents. As a consequence, certain jobs may bemore readily replaced by automated systems. On the otherhand, being able to provide feedback to artificial agents ver-bally could expand the group of people who will be able tointeract with those agents, potentially creating new opportu-nities for people with disabilities or less formal training incomputer science.

ReferencesAbbeel, P.; and Ng, A. Y. 2004. Apprenticeship Learning via In-verse Reinforcement Learning. In Proceedings of the Twenty-FirstInternational Conference on Machine Learning, ICML, 1. NewYork, NY, USA.

Allison, F.; Luger, E.; and Hofmann, K. 2018. How Players Speakto an Intelligent Game Character Using Natural Language Mes-sages. Transactions of the Digital Games Research Association 4.

Argall, B. D.; Chernova, S.; Veloso, M.; and Browning, B. 2009.A survey of robot learning from demonstration. Robotics and au-tonomous systems 57(5): 469–483.

Artzi, Y.; and Zettlemoyer, L. 2011. Bootstrapping SemanticParsers from Conversations. In EMNLP 2011, 421–432. ACL.

Bahdanau, D.; Hill, F.; Leike, J.; Hughes, E.; Hosseini, S. A.; Kohli,P.; and Grefenstette, E. 2019. Learning to Understand Goal Speci-fications by Modelling Reward. In ICLR 2019.

Chen, L.; Yang, R.; Chang, C.; Ye, Z.; Zhou, X.; and Yu, K. 2017.On-line Dialogue Policy Learning with Companion Teaching. InProceedings of the 15th Conference of the European Chapter ofthe Association for Computational Linguistics, 198–204. Valencia,Spain: Association for Computational Linguistics.

Chevalier-Boisvert, M.; Bahdanau, D.; Lahlou, S.; Willems, L.; Sa-haria, C.; Nguyen, T. H.; and Bengio, Y. 2018. BabyAI: First StepsTowards Grounded Language Learning With a Human In the Loop.ArXiv abs/1810.08272.

Christiano, P. F.; Leike, J.; Brown, T.; Martic, M.; Legg, S.; andAmodei, D. 2017. Deep Reinforcement Learning from HumanPreferences. In Advances in Neural Information Processing Sys-tems, 4299–4307.

Cideron, G.; Seurin, M.; Strub, F.; and Pietquin, O. 2019. Self-Educated Language Agent With Hindsight Experience Replay ForInstruction Following. ArXiv abs/1910.09451.

Djalali, A.; Clausen, D.; Lauer, S.; Schultz, K.; and Potts, C. 2011.Modeling Expert Effects and Common Ground Using QuestionsUnder Discussion. In Proceedings of the AAAI Workshop on Build-ing Representations of Common Ground with Intelligent Agents.Washington, DC: AAAI Press.

Djalali, A.; Lauer, S.; and Potts, C. 2012. Corpus Evidence forPreference-Driven Interpretation. In Aloni, M.; Kimmelman, V.;Roelofsen, F.; Sassoon, G. W.; Schulz, K.; and Westera, M., eds.,Proceedings of the 18th Amsterdam Colloquium: Revised SelectedPapers, 150–159. Berlin: Springer.

Dragan, A. D.; Lee, K. C.; and Srinivasa, S. S. 2013. Legibil-ity and Predictability of Robot Motion. In Proceedings of the8th ACM/IEEE International Conference on Human-Robot Inter-action, HRI 13, 301308. IEEE Press.

Fu, J.; Korattikara, A.; Levine, S.; and Guadarrama, S. 2019. FromLanguage to Goals: Inverse Reinforcement Learning for Vision-Based Instruction Following. ArXiv abs/1902.07742.

Goodman, N. D.; and Frank, M. C. 2016. Pragmatic LanguageInterpretation as Probabilistic Inference. Trends in Cognitive Sci-ences 20(11): 818 – 829.

Goyal, P.; Niekum, S.; and Mooney, R. J. 2019. Using natural lan-guage for reward shaping in reinforcement learning. In Proceed-ings of the 28th International Joint Conference on Artificial Intel-ligence, 2385–2391. AAAI Press.

Grice, H. P. 1975. Logic and Conversation. In Syntax and Seman-tics: Vol. 3: Speech Acts, 41–58. New York: Academic Press.

Gureckis, T. M.; Martin, J.; McDonnell, J.; Rich, A. S.; Markant,D.; Coenen, A.; Halpern, D.; Hamrick, J. B.; and Chan, P. 2016.psiTurk: An open-source framework for conducting replicable be-havioral experiments online. Behavior Research Methods 48(3):829–842. ISSN 1554-3528.

Hadfield-Menell, D.; Russell, S. J.; Abbeel, P.; and Dragan, A.2016. Cooperative Inverse Reinforcement Learning. In Lee, D. D.;Sugiyama, M.; Luxburg, U. V.; Guyon, I.; and Garnett, R., eds., Ad-vances in Neural Information Processing Systems 29, 3909–3917.Curran Associates, Inc.

Hancock, B.; Varma, P.; Wang, S.; Bringmann, M.; Liang, P.; andR, C. 2018. Training Classifiers with Natural Language Explana-tions. Proceedings of the 56th Annual Meeting of the Associationfor Computational Linguistics .

Harnad, S. 1990. The symbol grounding problem. Physica D:Nonlinear Phenomena 42(1-3): 335–346.

Hattie, J.; and Timperley, H. 2007. The Power of Feedback. Reviewof Educational Research 77(1): 81–112.

He, H.; Balakrishnan, A.; Eric, M.; and Liang, P. 2017. LearningSymmetric Collaborative Dialogue Agents with Dynamic Knowl-edge Graph Embeddings. Proceedings of the 55th Annual Meetingof the Association for Computational Linguistics .

Ho, M. K.; Littman, M.; MacGlashan, J.; Cushman, F.; and Auster-weil, J. L. 2016. Showing versus Doing: Teaching by Demon-stration. In Advances in Neural Information Processing Systems,3027–3035.

Hutto, C. J.; and Gilbert, E. 2014. VADER: A Parsimonious Rule-Based Model for Sentiment Analysis of Social Media Text. InICWSM. The AAAI Press.

Ilinykh, N.; Zarrie, S.; and Schlangen, D. 2019. MeetUp! A Cor-pus of Joint Activity Dialogues in a Visual Environment. arXivabs/1907.05084.

Jeon, H. J.; Milli, S.; and Dragan, A. D. 2020. Reward-rational(implicit) choice: A unifying formalism for reward learning. ArXivabs/2002.04833.

Judah, K.; Roy, S.; Fern, A.; and Dietterich, T. G. 2010. Reinforce-ment Learning via Practice and Critique Advice. In Proceedingsof the Twenty-Fourth AAAI Conference on Artificial Intelligence,AAAI’10, 481486. AAAI Press.

Kaplan, R.; Sauer, C.; and Sosa, A. 2017. Beating Atariwith Natural Language Guided Reinforcement Learning. ArXivabs/1704.05539.

Kim, Y. 2014. Convolutional Neural Networks for Sentence Clas-sification. EMNLP 2014 .

Knox, W. B.; and Stone, P. 2009. Interactively Shaping Agents viaHuman Reinforcement: The TAMER Framework. In Proceedingsof the Fifth International Conference on Knowledge Capture, 916.New York, NY, USA: Association for Computing Machinery.

Kuhlmann, G.; Stone, P.; Mooney, R.; and Shavlik, J. 2004. Guid-ing a reinforcement learner with natural language advice: Initialresults in RoboCup soccer. AAAI Workshop - Technical Report .

Li, J.; Miller, A. H.; Chopra, S.; Ranzato, M.; and Weston, J.2016. Dialogue Learning With Human-In-The-Loop. ArXivabs/1611.09823.

Ling, H.; and Fidler, S. 2017. Teaching Machines to Describe Im-ages with Natural Language Feedback. In NIPS.

Lipnevich, A.; and Smith, J. 2009. Effects of differential feedbackon students’ examination performance. Journal of ExperimentalPsychology: Applied 15(4): 319–333.

Liu, B.; and Zhang, L. 2012. A Survey of Opinion Mining andSentiment Analysis, 415–463. Boston, MA: Springer US.

Luketina, J.; Nardelli, N.; Farquhar, G.; Foerster, J.; Andreas, J.;Grefenstette, E.; Whiteson, S.; and Rocktaschel, T. 2019. A Sur-vey of Reinforcement Learning Informed by Natural Language. InIJCAI 2019, volume 57. AAAI Press.

MacGlashan, J.; Babes-Vroman, M.; desJardins, M.; Littman,M. L.; Muresan, S.; Squire, S.; Tellex, S.; Arumugam, D.; andYang, L. 2015. Grounding English Commands to Reward Func-tions. In Robotics: Science and Systems.

MacGlashan, J.; Ho, M. K.; Loftin, R.; Peng, B.; Wang, G.;Roberts, D. L.; Taylor, M. E.; and Littman, M. L. 2017. InteractiveLearning from Policy-Dependent Human Feedback. In Proceed-ings of the 34th International Conference on Machine Learning -Volume 70, ICML, 22852294. JMLR.

Maclin, R.; and Shavlik, J. W. 1994. Incorporating advice intoagents that learn from reinforcements. In Proceedings of the Na-tional Conference on Artificial Intelligence, volume 1, 694–699.AAAI.

McFadden, D. 1974. Conditional logit analysis of qualitativechoice behavior. Frontiers in Econometrics 105–142.

Mooney, R. J. 2008. Learning to Connect Language and Percep-tion. In AAAI, 1598–1601.

Murphy, K. 2007. Conjugate Bayesian analysis of the Gaus-sian distribution URL https://www.cs.ubc.ca/∼murphyk/Papers/bayesGauss.pdf.

Narasimhan, K.; Barzilay, R.; and Jaakkola, T. 2018. Groundinglanguage for transfer in deep reinforcement learning. Journal ofArtificial Intelligence Research 63: 849–874.

Pang, B.; and Lee, L. 2008. Opinion Mining and Sentiment Anal-ysis. Foundational Trends in Information Retrieval 2(12): 1135.

Potts, C. 2012. Goal-Driven Answers in the Cards Dialogue Cor-pus. In Arnett, N.; and Bennett, R., eds., Proceedings of the 30thWest Coast Conference on Formal Linguistics, 1–20. Somerville,MA: Cascadilla Press.

Puterman, M. L. 1994. Markov Decision Processes: DiscreteStochastic Dynamic Programming. John Wiley & Sons, Inc.

Ramachandran, D.; and Amir, E. 2007. Bayesian Inverse Rein-forcement Learning. In IJCAI, volume 7, 2586–2591.

Ross, S.; and Bagnell, D. 2010. Efficient reductions for imitationlearning. In Proceedings of the thirteenth international conferenceon artificial intelligence and statistics, 661–668.Shute, V. J. 2008. Focus on Formative Feedback. Review of Edu-cational Research 78(1): 153–189.Srivastava, S.; Labutov, I.; and Mitchell, T. 2017. Joint ConceptLearning and Semantic Parsing from Natural Language Explana-tions. In EMNLP 2017, 1527–1536. ACL.Suhr, A.; Yan, C.; Schluger, J.; Yu, S.; Khader, H.; Mouallem, M.;Zhang, I.; and Artzi, Y. 2019. Executing Instructions in SituatedCollaborative Interactions. EMNLP 2019 .Szlam, A.; Gray, J.; Srinet, K.; Jernite, Y.; Joulin, A.; Synnaeve,G.; Kiela, D.; Yu, H.; Chen, Z.; Goyal, S.; Guo, D.; Rothermel, D.;Zitnick, C. L.; and Weston, J. 2019. Why Build an Assistant inMinecraft? ArXiv abs/1907.09273.Tellex, S.; Gopalan, N.; Kress-Gazit, H.; and Matuszek, C. 2020.Robots That Use Language. Annual Review of Control, Robotics,and Autonomous Systems 3(1): 25–55.Tellex, S.; Kollar, T.; Dickerson, S.; Walter, M. R.; Banerjee, A. G.;Teller, S.; and Roy, N. 2011. Understanding Natural LanguageCommands for Robotic Navigation and Mobile Manipulation. InAAAI.Thomason, J.; Murray, M.; Cakmak, M.; and Zettlemoyer, L. 2019.Vision-and-Dialog Navigation. ArXiv abs/1907.04957.Thomason, J.; Padmakumar, A.; Sinapov, J.; Walker, N.; Jiang,Y.; Yedidsion, H.; Hart, J.; Stone, P.; and Mooney, R. J. 2020.Jointly Improving Parsing and Perception for Natural LanguageCommands through Human-Robot Dialog. The Journal of Arti-ficial Intelligence Research (JAIR) 67: 327–374.Thomason, J.; Zhang, S.; Mooney, R.; and Stone, P. 2015. Learningto Interpret Natural Language Commands through Human-RobotDialog. In Proceedings of the 24th International Conference onArtificial Intelligence, IJCAI’15, 19231929. AAAI Press.Thomaz, A. L.; and Breazeal, C. 2008. Teachable robots: Under-standing human teaching behavior to build more effective robotlearners. Artificial Intelligence 172(6-7): 716–737.Train, K. 2003. Discrete Choice Methods with Simulation. Cam-bridge University Press.Udagawa, T.; and Aizawa, A. 2019. A Natural Language Corpusof Common Grounding under Continuous and Partially-ObservableContext. Proceedings of the AAAI Conference on Artificial Intelli-gence 33: 71207127. ISSN 2159-5399.van der Kleij, F.; Feskens, R.; and Eggen, T. 2015. Effects of Feed-back in a Computer-Based Learning Environment on Students’Learning Outcomes: A Meta-Analysis. Review of Educational Re-search 85.Wang, S. I.; Liang, P.; and Manning, C. D. 2016. Learning Lan-guage Games through Interaction. Proceedings of the 54th AnnualMeeting of the Association for Computational Linguistics .Wang, X.; Huang, Q.; elikyilmaz, A.; Gao, J.; Shen, D.; Wang, Y.-F.; Wang, W. Y.; and Zhang, L. 2019. Reinforced Cross-ModalMatching and Self-Supervised Imitation Learning for Vision-Language Navigation. In CVPR, 6629–6638.Williams, E. C.; Gopalan, N.; Rhee, M.; and Tellex, S. 2018. Learn-ing to Parse Natural Language to Grounded Reward Functions withWeak Supervision. In 2018 IEEE International Conference onRobotics and Automation (ICRA), 4430–4436.Zhou, L.; and Small, K. 2020. Inverse Reinforcement Learningwith Natural Language Goals. ArXiv abs/2008.06924.

https://www.cs.ubc.ca/~murphyk/Papers/bayesGauss.pdf

https://www.cs.ubc.ca/~murphyk/Papers/bayesGauss.pdf

learning rewards from linguistic feedback · karthik narasimhan,1 thomas l. grifﬁths1,2...

Documents