alignment-based compositional semantics for instruction

Alignment-Based Compositional Semantics for Instruction Following

Jacob Andreas and Dan KleinComputer Science Division

University of California, Berkeley{jda,klein}@cs.berkeley.edu

Abstract

This paper describes an alignment-basedmodel for interpreting natural language in-structions in context. We approach in-struction following as a search over plans,scoring sequences of actions conditionedon structured observations of text and theenvironment. By explicitly modeling boththe low-level compositional structure ofindividual actions and the high-level struc-ture of full plans, we are able to learnboth grounded representations of sentencemeaning and pragmatic constraints on in-terpretation. To demonstrate the model’sflexibility, we apply it to a diverse setof benchmark tasks. On every task, weoutperform strong task-specific baselines,and achieve several new state-of-the-artresults.

1 Introduction

In instruction-following tasks, an agent executesa sequence of actions in a real or simulated envi-ronment, in response to a sequence of natural lan-guage commands. Examples include giving nav-igational directions to robots and providing hintsto automated game-playing agents. Plans speci-fied with natural language exhibit compositional-ity both at the level of individual actions and atthe overall sequence level. This paper describes aframework for learning to follow instructions byleveraging structure at both levels.

Our primary contribution is a new, alignment-based approach to grounded compositional se-mantics. Building on related logical approaches(Reddy et al., 2014; Pourdamghani et al., 2014),we recast instruction following as a pair of nested,structured alignment problems. Given instructionsand a candidate plan, the model infers a sequence-to-sequence alignment between sentences and

atomic actions. Within each sentence–action pair,the model infers a structure-to-structure alignmentbetween the syntax of the sentence and a graph-based representation of the action.

At a high level, our agent is a block-structured,graph-valued conditional random field, with align-ment potentials to relate instructions to actions andtransition potentials to encode the environmentmodel (Figure 3). Explicitly modeling sequence-to-sequence alignments between text and actionsallows flexible reasoning about action sequences,enabling the agent to determine which actions arespecified (perhaps redundantly) by text, and whichactions must be performed automatically (in or-der to satisfy pragmatic constraints on interpreta-tion). Treating instruction following as a sequenceprediction problem, rather than a series of inde-pendent decisions (Branavan et al., 2009; Artziand Zettlemoyer, 2013), makes it possible to usegeneral-purpose planning machinery, greatly in-creasing inferential power.

The fragment of semantics necessary to com-plete most instruction-following tasks is essen-tially predicate–argument structure, with limitedinfluence from quantification and scoping. Thusthe problem of sentence interpretation can reason-ably be modeled as one of finding an alignment be-tween language and the environment it describes.We allow this structure-to-structure alignment—an “overlay” of language onto the world—to bemediated by linguistic structure (in the form ofdependency parses) and structured perception (inwhat we term grounding graphs). Our modelthereby reasons directly about the relationship be-tween language and observations of the environ-ment, without the need for an intermediate logi-cal representation of sentence meaning. This, inturn, makes it possible to incorporate flexible fea-ture representations that have been difficult to in-tegrate with previous work in semantic parsing.

We apply our approach to three established

2

1

3

2

1

3

2

1

3

. . . right round the white water butstay quite close ’cause you don’totherwise you’re going to be in thatstone creek . . .

Go down the yellow hall. Turn leftat the intersection of the yellow andthe gray.

Clear the right column. Then theother column. Then the row.

(a) Map reading (b) Maze navigation (c) Puzzle solving

Figure 1: Example tasks handled by our framework. The tasks feature noisy text, over- and under-specification of plans, andchallenging search problems.

instruction-following benchmarks: the map read-ing task of Vogel and Jurafsky (2010), the mazenavigation task of MacMahon et al. (2006), andthe puzzle solving task of Branavan et al. (2009).An example from each is shown in Figure 1.These benchmarks exhibit a range of qualitativeproperties—both in the length and complexity oftheir plans, and in the quantity and quality of ac-companying language. Each task has been stud-ied in isolation, but we are unaware of any pub-lished approaches capable of robustly handlingall three. Our general model outperforms strong,task-specific baselines in each case, achievingrelative error reductions of 15–20% over sev-eral state-of-the-art results. Experiments demon-strate the importance of our contributions in bothcompositional semantics and search over plans.We have released all code for this project atgithub.com/jacobandreas/instructions.

2 Related work

Existing work on instruction following can beroughly divided into two families: semanticparsers and linear policy estimators.

Semantic parsers Parser-based approaches(Chen and Mooney, 2011; Artzi and Zettlemoyer,2013; Kim and Mooney, 2013) map from text intoa formal language representing commands. Thesetake familiar structured prediction models forsemantic parsing (Zettlemoyer and Collins, 2005;Wong and Mooney, 2006), and train them withtask-provided supervision. Instead of attemptingto match the structure of a manually-annotatedsemantic parse, semantic parsers for instructionfollowing are trained to maximize a reward signal

provided by black-box execution of the predictedcommand in the environment. (It is possible tothink of response-based learning for questionanswering (Liang et al., 2013) as a special case.)

This approach uses a well-studied mechanismfor compositional interpretation of language, but issubject to certain limitations. Because the environ-ment is manipulated only through black-box exe-cution of the completed semantic parse, there is noway to incorporate current or future environmentstate into the scoring function. It is also in generalnecessary to hand-engineer a task-specific formallanguage for describing agent behavior. Thus it isextremely difficult to work with environments thatcannot be modeled with a fixed inventory of pred-icates (e.g. those involving novel strings or arbi-trary real quantities).

Much of contemporary work in this family isevaluated on the maze navigation task introducedby MacMahon et al. (2006). Dukes (2013) also in-troduced a “blocks world” task for situated parsingof spatial robot commands.

Linear policy estimators An alternative fam-ily of approaches is based on learning a pol-icy over primitive actions directly (Branavan etal., 2009; Vogel and Jurafsky, 2010).1 Policy-based approaches instantiate a Markov decisionprocess representing the action domain, and ap-ply standard supervised or reinforcement-learningapproaches to learn a function for greedily select-ing among actions. In linear policy approximators,natural language instructions are incorporated di-rectly into state observations, and reading order

1This is distinct from semantic parsers in which greedyinference happens to have an interpretation as a policy (Vla-chos and Clark, 2014).

github.com/jacobandreas/instructions

becomes part of the action selection process.Almost all existing policy-learning approaches

make use of an unstructured parameterization,with a single (flat) feature vector representing alltext and observations. Such approaches are thusrestricted to problems that are simple enough (andhave small enough action spaces) to be effectivelycharacterized in this fashion. While there is a greatdeal of flexibility in the choice of feature func-tion (which is free to inspect the current and fu-ture state of the environment, the whole instruc-tion sequence, etc.), standard linear policy estima-tors have no way to model compositionality in lan-guage or actions.

Agents in this family have been evaluated on avariety of tasks, including map reading (Andersonet al., 1991) and gameplay (Branavan et al., 2009).

Though both families address the same classof instruction-following problems, they have beenapplied to a totally disjoint set of tasks. It shouldbe emphasized that there is nothing inherent topolicy learning that prevents the use of composi-tional structure, and nothing inherent to generalcompositional models that prevents more compli-cated dependence on environment state. Indeed,previous work (Branavan et al., 2011; Narasimhanet al., 2015) uses aspects of both to solve a differ-ent class of gameplay problems. In some sense,our goal in this paper is simply to combine thestrengths of semantic parsers and linear policy es-timators for fully general instruction following.As we shall see, however, this requires changesto many aspects of representation, learning and in-ference.

3 Representations

We wish to train a model capable of followingcommands in a simulated environment. We do soby presenting the model with a sequence of train-ing pairs (x,y), where each x is a sequence of nat-ural language instructions (x1, x2, . . . , xm), e.g.:

(Go down the yellow hall., Turn left., . . . )

and each y is a demonstrated action sequence(y1, y2, . . . , yn), e.g.:

(rotate(90), move(2), . . . )

Given a start state, y can equivalently be char-acterized by a sequence of (state, action, state)

Go down the yellow hall

go down hallthe

(a) Text

(b) Syntax

(c) Alignment

(d) Perception

(e) Environment

* go down the yellow hall

yellow

move(2)

Figure 2: Structure-to-structure alignment connecting a sin-gle sentence (via its syntactic analysis) to the environmentstate (via its grounding graph). The connecting alignmentstake the place of a traditional semantic parse and allow flexi-ble, feature-driven linking between lexical primitives and per-ceptual factors.

triples resulting from execution of the environ-ment model. An example instruction is shown inFigure 2a. An example action, situated in the en-vironment where it occurs, is shown in Figure 2e.

Our model performs compositional interpreta-tion of instructions by leveraging existing struc-ture inherent in both text and actions. Thus weinterpret xi and yj not as raw strings and primitiveactions, but rather as structured objects.

Linguistic structure We assume access to a pre-trained parser, and in particular that each of theinstructions xi is represented by a tree-structureddependency parse. An example is shown in Fig-ure 2b.

Action structure By analogy to the represen-tation of instructions as parse trees, we assumethat each (state, action, state) triple (provided bythe environment model) can be characterized bya grounding graph. The structure and content ofthis representation is task-specific. An examplegrounding graph for the maze navigation task is

shown in Figure 2d. The example contains a nodecorresponding to the primitive action move(2)(in the upper left), and several nodes correspond-ing to locations in the environment that are visibleafter the action is performed.

Each node in the graph (and, though not de-picted, each edge) is decorated with a list of fea-tures. These features might be simple indica-tors (e.g. whether the primitive action performedwas move or rotate), real values (the distancetraveled) or even string-valued (English-languagenames of visible landmarks, if available in theenvironment description). Formally, a groundinggraph consists of a tuple (V,E,L, fV , fE), with

– V a set of vertices

– E ∈ V × V a set of (directed) edges

– L a space of labels (numbers, strings, etc.)

– fV : V → 2L a vertex feature function

– fE : E → 2L an edge feature function

In this paper we have tried to remain agnosticto details of graph construction. Our goal with thegrounding graph framework is simply to accom-modate a wider range of modeling decisions thanallowed by existing formalisms. Graphs mightbe constructed directly, given access to a struc-tured virtual environment (as in all experimentsin this paper), or alternatively from outputs of aperceptual system. For our experiments, we haveremained as close as possible to task representa-tions described in the existing literature. Detailsfor each task can be found in the accompanyingsoftware package.

Graph-based representations are extremelycommon in formal semantics (Jones et al., 2012;Reddy et al., 2014), and the version presented herecorresponds to a simple generalization of famil-iar formal methods. Indeed, if L is the set of allatomic entities and relations, fV returns a uniquelabel for every v ∈ V , and fE always returnsa vector with one active feature, we recover theexistentially-quantified portion of first order logicexactly, and in this form can implement large partsof classical neo-Davidsonian semantics (Parsons,1990) using grounding graphs.

Crucially, with an appropriate choice of L thisformalism also makes it possible to go beyond set-theoretic relations, and incorporate string-valuedfeatures (like names of entities and landmarks) andreal-valued features (like colors and positions) aswell.

Turn left.Go down the yellow hall.

turn left

Text

Alignments

Plans

Figure 3: Our model is a conditional random field that de-scribes distributions over state-action sequences conditionedon input text. Each variable’s domain is a structured value.Sentences align to a subset of the state–action sequences,with the rest of the states filled in by pragmatic (planning)implication. State-to-state structure represents planning con-straints (environment model) while state-to-text structure rep-resents compositional alignment. All potentials are log-linearand feature-driven.

Lexical semantics We must eventually combinefeatures provided by parse trees with features pro-vided by the environment. Examples here mightinclude simple conjunctions (word=yellow ∧rgb=(0.5, 0.5, 0.0)) or more compli-cated computations like edit distance betweenlandmark names and lexical items. Features ofthe latter kind make it possible to behave correctlyin environments containing novel strings or otherfeatures unseen during training.

This aspect of the syntax–semantics inter-face has been troublesome for some logic-basedapproaches: while past work has used relatedmachinery for selecting lexicon entries (Berantand Liang, 2014) or for rewriting logical forms(Kwiatkowski et al., 2013), the relationship be-tween text and the environment has ultimatelybeen mediated by a discrete (and indeed finite) in-ventory of predicates. Several recent papers haveinvestigated simple grounded models with real-valued output spaces (Andreas and Klein, 2014;McMahan and Stone, 2015), but we are unawareof any fully compositional system in recent lit-erature that can incorporate observations of thesekinds.

Formally, we assume access to a joining featurefunction ϕ : (2L × 2L)→ Rd. As with groundinggraphs, our goal is to make the general frameworkas flexible as possible, and for individual exper-iments have chosen ϕ to emulate modeling deci-sions from previous work.

4 Model

As noted in the introduction, we approach instruc-tion following as a sequence prediction problem.Thus we must place a distribution over sequencesof actions conditioned on instructions. We decom-pose the problem into two components, describinginterlocking models of “path structure” and “ac-tion structure”. Path structure captures how se-quences of instructions give rise to sequences ofactions, while action structure captures the com-positional relationship between individual utter-ances and the actions they specify.

Path structure: aligning utterances to actions

The high-level path structure in the model is de-picted in Figure 3. Our goal here is to permit bothunder- and over-specification of plans, and to ex-pose a planning framework which allows plans tobe computed with lookahead (i.e. non-greedily).

These goals are achieved by introducing a se-quence of latent alignments between instructionsand actions. Consider the multi-step example inFigure 1b. If the first instruction go down the yel-low hall were interpreted immediately, we wouldhave a presupposition failure—the agent is facinga wall, and cannot move forward at all. Thus animplicit rotate action, unspecified by text, mustbe performed before any explicit instructions canbe followed.

To model this, we take the probability of a (text,plan, alignment) triple to be log-proportional tothe sum of two quantities:

1. a path-only score ψ(n; θ) +∑

j ψ(yj ; θ)

2. a path-and-text score, itself the sum of all pairscores ψ(xi, yj ; θ) licensed by the alignment

(1) captures our desire for pragmatic constraintson interpretation, and provides a means of encod-ing the inherent plausibility of paths. We takeψ(n; θ) and ψ(y; θ) to be linear functions of θ.(2) provides context-dependent interpretation oftext by means of the structured scoring functionψ(x, y; θ), described in the next section.

Formally, we associate with each instruction xia sequence-to-sequence alignment variable ai ∈1 . . . n (recalling that n is the number of actions).

Then we have2

p(y,a|x; θ) ∝ exp

{ψ(n) +

n∑j=1

ψ(yj)

+

m∑i=1

n∑j=1

1[aj = i] ψ(xi, yj)

}(1)

We additionally place a monotonicity constrainton the alignment variables. This model is globallynormalized, and for a fixed alignment is equiva-lent to a linear-chain CRF. In this sense it is analo-gous to IBM Model I (Brown et al., 1993), with thestructured potentials ψ(xi, yj) taking the place oflexical translation probabilities. While alignmentmodels from machine translation have previouslybeen used to align words to fragments of semanticparses (Wong and Mooney, 2006; Pourdamghaniet al., 2014), we are unaware of such models be-ing used to align entire instruction sequences todemonstrations.

Action structure: aligning words to perceptsIntuitively, this scoring function ψ(x, y) shouldcapture how well a given utterance describes anaction. If neither the utterances nor the actions hadstructure (i.e. both could be represented with sim-ple bags of features), we would recover somethinganalogous to the conventional policy-learning ap-proach. As structure is essential for some of ourtasks, ψ(x, y) must instead fill the role of a seman-tic parser in a conventional compositional model.

Our choice of ψ(x, y) is driven by the followingfundamental assumptions: Syntactic relations ap-proximately represent semantic relations. Syntac-tic proximity implies relational proximity. In thisview, there is an additional hidden structure-to-structure alignment between the grounding graphand the parsed text describing it. 3 Words line upwith nodes, and dependencies line up with rela-tions. Visualizations are shown in Figure 2c andthe zoomed-in portion of Figure 3.

As with the top-level alignment variables, thisapproach can viewed as a simple relaxation of afamiliar model. CCG-based parsers assume thatsyntactic type strictly determines semantic type,

2Here and in the remainder of this paper, we suppress thedependence of the various potentials on θ in the interest ofreadability.

3It is formally possible to regard the sequence-to-sequence and structure-to-structure alignments as a single(structured) random variable. However, the two kinds ofalignments are treated differently for purposes of inference,so it is useful to maintain a notational distinction.

and that each lexical item is associated with asmall set of functional forms. Here we simplyallow all words to license all predicates, multi-ple words to specify the same predicate, and someedges to be skipped. We instead rely on a scoringfunction to impose soft versions of the hard con-straints typically provided by a grammar. Relatedmodels have previously been used for question an-swering (Reddy et al., 2014; Pasupat and Liang,2015).

For the moment let us introduce variables bto denote these structure-to-structure alignments.(As will be seen in the following section, it isstraightforward to marginalize over all choices ofb. Thus the structure-to-structure alignments arenever explicitly instantiated during inference, anddo not appear in the final form of ψ(x, y).) Fora fixed alignment, we define ψ(x, y, b) accordingto a recurrence relation. Let xi be the ith word ofthe sentence, and let yj be the jth node in the ac-tion graph (under some topological ordering). Letc(i) and c(j) give the indices of the dependents ofxi and children of yj respectively. Finally, let xik

and yjl denote the associated dependency type orrelation. Define a “descendant” function:

d(i, j) ={(k, l) : k ∈ c(i), l ∈ c(j), (k, l) ∈ b

}Then,

ψ(xi, yj , b) = exp

{θ⊤ϕ(xi, yj)

+∑

(k,l)∈d(x,y)

[θ⊤ϕ

(xik, yjl

)· ψ(xk, yl, b)

]}

This is just an unnormalized synchronous deriva-tion between x and y—at any aligned (node, word)pair, the score for the entire derivation is the scoreproduced by combining that word and node, timesthe scores at all the aligned descendants. Observethat as long as there are no cycles in the depen-dency parse, it is perfectly acceptable for the rela-tion graph to contain cycles and even self-loops—the recurrence still bottoms out appropriately.

5 Learning and inference

Given a sequence of training pairs (x,y), wewish to find a parameter setting that maximizesp(y|x; θ). If there were no latent alignments aor b, this would simply involve minimization ofa convex objective. The presence of latent vari-ables complicates things. Ideally, we would like

Algorithm 1 Computing structure-to-structurealignments

xi are words in reverse topological orderyj are grounding graph nodes (root last)chart is an m× n arrayfor i = 1 to |x| do

for j = 1 to |y| doscore← exp

{θ⊤ϕ(xi, yj)

}for (k, l) ∈ d(i, j) do

s←∑

l∈c(j)

[exp

{θ⊤ϕ(xik, yjl)

}· chart[k, l]

]score← score · s

end forchart[i, j]← score

end forend forreturn chart[n,m]

to sum over the latent variables, but that sum is in-tractable. Instead we make a series of variationalapproximations: first we replace the sum with amaximization, then perform iterated conditionalmodes, alternating between maximization of theconditional probability of a and θ. We begin byinitializing θ randomly.

As noted in the preceding section, the vari-able b does not appear in these equations. Con-ditioned on a, the sum over structure-to-structureψ(x, y) =

∑b ψ(x, y, b) can be performed ex-

actly using a simple dynamic program which runsin time O(|x||y|) (assuming out-degree boundedby a constant, and with |x| and |y| the number ofwords and graph nodes respectively). This is Al-gorithm 1.

In our experiments, θ is optimized using L-BFGS (Liu and Nocedal, 1989). Calculation ofthe gradient with respect to θ requires computa-tion of a normalizing constant involving the sumover p(x,y′,a) for all y′. While in principle thenormalizing constant can be computed using theforward algorithm, in practice the state spaces un-der consideration are so large that even this is in-tractable. Thus we make an additional approxima-tion, constructing a set Y of alternative actions andtaking

p(y,a|x) ≈n∑j=1

exp{ψ(yj)+

∑mi=1 1[ai=j]ψ(xi,yi)

}∑

y∈Y exp{ψ(y)+

∑mi=1 1[ai=j]ψ(xi,y)

}

Y is constructed by sampling alternative actionsfrom the environment model. Meanwhile, maxi-mization of a can be performed exactly using theViterbi algorithm, without computation of normal-izers.

Inference at test time involves a slightly differ-ent pair of optimization problems. We again per-form iterated conditional modes, here on the align-ments a and the unknown output path y. Max-imization of a is accomplished with the Viterbialgorithm, exactly as before; maximization of yalso uses the Viterbi algorithm, or a beam searchwhen this is computationally infeasible. If boundson path length are known, it is straightforward toadapt these dynamic programs to efficiently con-sider paths of all lengths.

6 Evaluation

As one of the main advantages of this approachis its generality, we evaluate on several differentbenchmark tasks for instruction following. Theseexhibit great diversity in both environment struc-ture and language use. We compare our fullsystem to recent state-of-the-art approaches toeach task. In the introduction, we highlightedtwo core aspects of our approach to semantics:compositionality (by way of grounding graphsand structure-to-structure alignments) and plan-ning (by way of inference with lookahead andsequence-to-sequence alignments). To evaluatethese, we additionally present a pair of ablation ex-periments: no grounding graphs (an agent with anunstructured representation of environment state),and no planning (a reflex agent with no looka-head).

Map reading Our first application is the mapnavigation task established by Vogel and Jurafsky(2010), based on data collected for a psychologicalexperiment by Anderson et al. (1991) (Figure 1a).Each training datum consists of a map with a des-ignated starting position, and a collection of land-marks, each labeled with a spatial coordinate anda string name. Names are not always unique, andlandmarks in the test set are never observed dur-ing training. This map is accompanied by a setof instructions specifying a path from the start-ing position to some (unlabeled) destination point.These instruction sets are informal and redundant,involving as many as a hundred utterances. Theyare transcribed from spoken text, so grammaticalerrors, disfluencies, etc. are common. This is a

P R F1

Vogel and Jurafsky (2010) 0.46 0.51 0.48Andreas and Klein (2014) 0.43 0.51 0.45

Model [no planning] 0.44 0.46 0.45Model [no grounding graphs] 0.52 0.52 0.52Model [full] 0.51 0.60 0.55

Table 1: Evaluation results for the map-reading task. P is pre-cision, R is recall and F1 is F-measure. Scores are calculatedwith respect to transitions between landmarks appearing inthe reference path (for details see Vogel and Jurafsky (2010)).We use the same train / test split. Some variant of our modelachieves the best published results on all three metrics.

Feature Weight

word=top ∧ side=North 1.31word=top ∧ side=South 0.61word=top ∧ side=East −0.93

dist=0 4.51dist=1 2.78dist=4 1.54

Table 2: Learned feature values. The model learns that theword top often instructs the navigator to position itself abovea landmark, occasionally to position itself below a landmark,but rarely to the side. The bottom portion of the table showslearned text-independent constraints: given a choice, neardestinations are preferred to far ones (so shorter paths are pre-ferred overall).

prime example of a domain that does not lend it-self to logical representation—grammars may betoo rigid, and previously-unseen landmarks andreal-valued positions are handled more easily withfeature machinery than predicate logic.

The map task was previously studied by Vo-gel and Jurafsky (2010), who implemented SARSA

with a simple set of features. By combining thesefeatures with our alignment model and search pro-cedure, we achieve state-of-the-art results on thistask by a substantial margin (Table 1).

Some learned feature values are shown in Ta-ble 2. The model correctly infers cardinal direc-tions (the example shows the preferred side of adestination landmark modified by the word top).Like Vogel et al., we see support for both allocen-tric references (you are on top of the hill) and ego-centric references (the hill is on top of you). Wecan also see pragmatics at work: the model learnsuseful text-independent constraints—in this case,that near destinations should be preferred to farones.

Maze navigation The next application we con-sider is the maze navigation task of MacMahon etal. (2006) (Figure 1b). Here, a virtual agent is sit-

Success (%)

Kim and Mooney (2012) 57.2Chen (2012) 57.3

Model [no planning] 58.9Model [no grounding graphs] 51.7Model [full] 59.6

Kim and Mooney (2013) [reranked] 62.8Artzi et al. (2014) [semi-supervised] 65.3

Table 3: Evaluation results for the maze navigation task.“Success” shows the percentage of actions resulting in a cor-rect position and orientation after observing a single instruc-tion. We use the leave-one-map-out evaluation employed byprevious work.4 All systems are trained on full action se-quences. Our model outperforms several task-specific base-lines, as well as a baseline with path structure but no actionstructure.

uated in a maze (whose hallways are distinguishedwith various wallpapers, carpets, and the presenceof a small set of standard objects), and again giveninstructions for getting from one point to another.This task has been the subject of focused attentionin semantic parsing for several years, resulting ina variety of sophisticated approaches.

Despite superficial similarity to the previousnavigation task, the language and plans requiredfor this task are quite different. The proportion ofinstructions to actions is much higher (so redun-dancy much lower), and the interpretation of lan-guage is highly compositional.

As can be seen in Table 3, we outperform anumber of systems purpose-built for this naviga-tion task. We also outperform both variants ofour system, most conspicuously the variant with-out grounding graphs. This highlights the impor-tance of compositional structure. Recent work byKim and Mooney (2013) and Artzi et al. (2014)has achieved better results; these systems makeuse of techniques and resources (respectively, dis-criminative reranking and a seed lexicon of hand-annotated logical forms) that are largely orthogo-nal to the ones used here, and might be applied toimprove our own results as well.

Puzzle solving The last task we consider is theCrossblock task studied by Branavan et al. (2009)(Figure 1c). Here, again, natural language is usedto specify a sequence of actions, in this case thesolution to a simple game. The environment issimple enough to be captured with a flat feature

4We specifically targeted the single-sentence version ofthis evaluation, as an alternative full-sequence evaluationdoes not align precisely with our data condition.

Match (%) Success (%)

No text 54 78Branavan ’09 63 –

Model [no planning] 64 66Model [full] 70 86

Table 4: Results for the puzzle solving task. “Match” showsthe percentage of predicted action sequences that exactlymatch the annotation. “Success” shows the percentage ofpredicted action sequences that result in a winning game con-figuration, regardless of the action sequence performed. Fol-lowing Branavan et al. (2009), we average across five randomtrain / test folds. Our model achieves state-of-the-art resultson this task.

representation, so there is no distinction betweenthe full model and the variant without groundinggraphs.

Unlike the other tasks we consider, Crossblockis distinguished by a challenging associated searchproblem. Here it is nontrivial to find any sequencethat eliminates all the blocks (the goal of the puz-zle). Thus this example allows us measure the ef-fectiveness of our search procedure.

Results are shown in Table 4. As can be seen,our model achieves state-of-the-art performanceon this task when attempting to match the human-specified plan exactly. If we are purely concernedwith task completion (i.e. solving the puzzle, per-haps not with the exact set of moves specifiedin the instructions) we can measure this directly.Here, too, we substantially outperform a no-textbaseline. Thus it can be seen that text induces auseful heuristic, allowing the model to solve a con-siderable fraction of problem instances not solvedby naıve beam search.

The problem of inducing planning heuristicsfrom side information like text is an importantone in its own right, and future work might focusspecifically on coupling our system with a moresophisticated planner. Even at present, the re-sults in this section demonstrate the importance oflookahead and high-level reasoning in instructionfollowing.

7 Conclusion

We have described a new alignment-based com-positional model for following sequences of nat-ural language instructions, and demonstrated theeffectiveness of this model on a variety of tasks. Afully general solution to the problem of contextualinterpretation must address a wide range of well-studied problems, but the work we have described

here provides modular interfaces for the study ofa number of fundamental linguistic issues from amachine learning perspective. These include:

Pragmatics How do we respond to presup-position failures, and choose among possibleinterpretations of an instruction disambiguatedonly by context? The mechanism provided bythe sequence-prediction architecture we have de-scribed provides a simple answer to this ques-tion, and our experimental results demonstrate thatthe learned pragmatics aid interpretation of in-structions in a number of concrete ways: am-biguous references are resolved by proximity inthe map reading task, missing steps are inferredfrom an environment model in the maze naviga-tion task, and vague hints are turned into real plansby knowledge of the rules in Crossblock. A morecomprehensive solution might explicitly describethe process by which instruction-givers’ own be-liefs (expressed as distributions over sequences)give rise to instructions.

Compositional semantics The graph alignmentmodel of semantics presented here is an expres-sive and computationally efficient generalizationof classical logical techniques to accommodate en-vironments like the map task, or those exploredin our previous work (Andreas and Klein, 2014).More broadly, our model provides a compositionalapproach to semantics that does not require anexplicit formal language for encoding sentencemeaning. Future work might extend this approachto tasks like question answering, where logic-based approaches have been successful.

Our primary goal in this paper has been to ex-plore methods for integrating compositional se-mantics and the pragmatic context provided by se-quential structures. While there is a great dealof work left to do, we find it encouraging thatthis general approach results in substantial gainsacross multiple tasks and contexts.

Acknowledgments

The authors would like to thank S.R.K. Brana-van for assistance with the Crossblock evaluation.The first author is supported by a National ScienceFoundation Graduate Fellowship.

ReferencesAnne H. Anderson, Miles Bader, Ellen Gurman Bard,

Elizabeth Boyle, Gwyneth Doherty, Simon Garrod,Stephen Isard, Jacqueline Kowtko, Jan McAllister,Jim Miller, et al. 1991. The HCRC map task corpus.Language and speech, 34(4):351–366.

Jacob Andreas and Dan Klein. 2014. Grounding lan-guage with points and paths in continuous spaces. InProceedings of the Conference on Natural LanguageLearning.

Yoav Artzi and Luke Zettlemoyer. 2013. Weakly su-pervised learning of semantic parsers for mappinginstructions to actions. Transactions of the Associa-tion for Computational Linguistics, 1(1):49–62.

Yoav Artzi, Dipanjan Das, and Slav Petrov. 2014.Learning compact lexicons for CCG semantic pars-ing. In Proceedings of the Conference on Empiri-cal Methods in Natural Language Processing, pages1273–1283, Doha, Qatar, October. Association forComputational Linguistics.

Jonathan Berant and Percy Liang. 2014. Semanticparsing via paraphrasing. In Proceedings of the An-nual Meeting of the Association for ComputationalLinguistics, page 92.

S.R.K. Branavan, Harr Chen, Luke S. Zettlemoyer, andRegina Barzilay. 2009. Reinforcement learning formapping instructions to actions. In Proceedings ofthe Annual Meeting of the Association for Compu-tational Linguistics, pages 82–90. Association forComputational Linguistics.

S.R.K. Branavan, David Silver, and Regina Barzilay.2011. Learning to win by reading manuals in aMonte-Carlo framework. In Proceedings of the Hu-man Language Technology Conference of the Asso-ciation for Computational Linguistics, pages 268–277.

Peter Brown, Vincent Della Pietra, Stephen DellaPietra, and Robert Mercer. 1993. The mathemat-ics of statistical machine translation: Parameter esti-mation. Computational Linguistics, 19(2):263–311,June.

David L. Chen and Raymond J. Mooney. 2011. Learn-ing to interpret natural language navigation instruc-tions from observations. In Proceedings of the Meet-ing of the Association for the Advancement of Artifi-cial Intelligence, volume 2, pages 1–2.

David L Chen. 2012. Fast online lexicon learning forgrounded language acquisition. In Proceedings ofthe Annual Meeting of the Association for Computa-tional Linguistics, pages 430–439.

Kais Dukes. 2013. Semantic annotation of robotic spa-tial commands. In Language and Technology Con-ference (LTC).

Bevan Jones, Jacob Andreas, Daniel Bauer,Karl Moritz Hermann, and Kevin Knight. 2012.Semantics-based machine translation with hyper-edge replacement grammars. In Proceedings ofthe International Conference on ComputationalLinguistics, pages 1359–1376.

Joohyun Kim and Raymond J. Mooney. 2012. Un-supervised PCFG induction for grounded languagelearning with highly ambiguous supervision. In Pro-ceedings of the Conference on Empirical Methods inNatural Language Processing, pages 433–444.

Joohyun Kim and Raymond J. Mooney. 2013. Adapt-ing discriminative reranking to grounded languagelearning. In Proceedings of the Annual Meeting ofthe Association for Computational Linguistics.

Tom Kwiatkowski, Eunsol Choi, Yoav Artzi, and LukeZettlemoyer. 2013. Scaling semantic parsers withon-the-fly ontology matching. In Proceedings of theConference on Empirical Methods in Natural Lan-guage Processing.

Percy Liang, Michael I. Jordan, and Dan Klein. 2013.Learning dependency-based compositional seman-tics. Computational Linguistics, 39(2):389–446.

Dong Liu and Jorge Nocedal. 1989. On the limitedmemory BFGS method for large scale optimization.Mathematical Programming, 45(1-3):503–528.

Matt MacMahon, Brian Stankiewicz, and BenjaminKuipers. 2006. Walk the talk: Connecting language,knowledge, and action in route instructions. Pro-ceedings of the Meeting of the Association for theAdvancement of Artificial Intelligence, 2(6):4.

Brian McMahan and Matthew Stone. 2015. ABayesian model of grounded color semantics.Transactions of the Association for ComputationalLinguistics, 3:103–115.

Karthik Narasimhan, Tejas Kulkarni, and ReginaBarzilay. 2015. Language understanding for text-based games using deep reinforcement learning. InProceedings of the Conference on Empirical Meth-ods in Natural Language Processing.

Terence Parsons. 1990. Events in the semantics of En-glish. MIT Press.

Panupong Pasupat and Percy Liang. 2015. Composi-tional semantic parsing on semi-structured tables. InProceedings of the Annual Meeting of the Associa-tion for Computational Linguistics.

Nima Pourdamghani, Yang Gao, Ulf Hermjakob, andKevin Knight. 2014. Aligning english strings withabstract meaning representation graphs. In Proceed-ings of the Conference on Empirical Methods in Nat-ural Language Processing.

Siva Reddy, Mirella Lapata, and Mark Steedman.2014. Large-scale semantic parsing withoutquestion-answer pairs. Transactions of the Associ-ation for Computational Linguistics, 2:377–392.

Andreas Vlachos and Stephen Clark. 2014. A new cor-pus and imitation learning framework for context-dependent semantic parsing. Transactions of the As-sociation for Computational Linguistics, 2:547–559.

Adam Vogel and Dan Jurafsky. 2010. Learning tofollow navigational directions. In Proceedings ofthe Annual Meeting of the Association for Compu-tational Linguistics, pages 806–814. Association forComputational Linguistics.

Yuk Wah Wong and Raymond Mooney. 2006. Learn-ing for semantic parsing with statistical machinetranslation. In Proceedings of the Human Lan-guage Technology Conference of the North Ameri-can Chapter of the Association for ComputationalLinguistics, pages 439–446, New York, New York.

Luke S. Zettlemoyer and Michael Collins. 2005.Learning to map sentences to logical form: Struc-tured classification with probabilistic categorialgrammars. In Proceedings of the Conference on Un-certainty in Artificial Intelligence, pages 658–666.

alignment-based compositional semantics for instruction

Documents