learning to compose neural networks for question...

10
Learning to Compose Neural Networks for Question Answering Jacob Andreas and Marcus Rohrbach and Trevor Darrell and Dan Klein Department of Electrical Engineering and Computer Sciences University of California, Berkeley {jda,rohrbach,trevor,klein}@eecs.berkeley.edu Abstract We describe a question answering model that applies to both images and structured knowl- edge bases. The model uses natural lan- guage strings to automatically assemble neu- ral networks from a collection of composable modules. Parameters for these modules are learned jointly with network-assembly param- eters via reinforcement learning, with only (world, question, answer) triples as supervi- sion. Our approach, which we term a dynamic neural module network, achieves state-of-the- art results on benchmark datasets in both vi- sual and structured domains. 1 Introduction This paper presents a compositional, attentional model for answering questions about a variety of world representations, including images and struc- tured knowledge bases. The model translates from questions to dynamically assembled neural net- works, then applies these networks to world rep- resentations (images or knowledge bases) to pro- duce answers. We take advantage of two largely independent lines of work: on one hand, an exten- sive literature on answering questions by mapping from strings to logical representations of meaning; on the other, a series of recent successes in deep neural models for image recognition and captioning. By constructing neural networks instead of logical forms, our model leverages the best aspects of both linguistic compositionality and continuous represen- tations. Our model has two components, trained jointly: first, a collection of neural “modules” that can be freely composed (Figure 1a); second, a network lay- out predictor that assembles modules into complete deep networks tailored to each question (Figure 1b). What cities are in Georgia? Atlanta and lookup Georgia find city Georgia Atlanta Montgomery Knowledge source relate in Network layout (Section 4.2) find[city] lookup[Georgia] relate[in] and (b) Module inventory (Section 4.1) find lookup and relate (a) (c) (d) Figure 1: A learned syntactic analysis (a) is used to assemble a collection of neural modules (b) into a deep neural network (c), and applied to a world representation (d) to produce an answer. Previous work has used manually-specified modular structures for visual learning (Andreas et al., 2016). Here we: learn a network structure predictor jointly with module parameters themselves extend visual primitives from previous work to reason over structured world representations Training data consists of (world, question, answer) triples: our approach requires no supervision of net- work layouts. We achieve state-of-the-art perfor- mance on two markedly different question answer- ing tasks: one with questions about natural im- ages, and another with more compositional ques- tions about United States geography. 1 2 Deep networks as functional programs We begin with a high-level discussion of the kinds of composed networks we would like to learn. 1 We have released our code at http://github.com/ jacobandreas/nmn2

Upload: others

Post on 13-Oct-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Learning to Compose Neural Networks for Question Answeringnlp.cs.berkeley.edu/pubs/Andreas-Rohrbach-Darrell-Klein_2016_DNM… · requiring intermediate computation. A more struc-tured

Learning to Compose Neural Networks for Question Answering

Jacob Andreas and Marcus Rohrbach and Trevor Darrell and Dan KleinDepartment of Electrical Engineering and Computer Sciences

University of California, Berkeley{jda,rohrbach,trevor,klein}@eecs.berkeley.edu

Abstract

We describe a question answering model thatapplies to both images and structured knowl-edge bases. The model uses natural lan-guage strings to automatically assemble neu-ral networks from a collection of composablemodules. Parameters for these modules arelearned jointly with network-assembly param-eters via reinforcement learning, with only(world, question, answer) triples as supervi-sion. Our approach, which we term a dynamicneural module network, achieves state-of-the-art results on benchmark datasets in both vi-sual and structured domains.

1 Introduction

This paper presents a compositional, attentionalmodel for answering questions about a variety ofworld representations, including images and struc-tured knowledge bases. The model translates fromquestions to dynamically assembled neural net-works, then applies these networks to world rep-resentations (images or knowledge bases) to pro-duce answers. We take advantage of two largelyindependent lines of work: on one hand, an exten-sive literature on answering questions by mappingfrom strings to logical representations of meaning;on the other, a series of recent successes in deepneural models for image recognition and captioning.By constructing neural networks instead of logicalforms, our model leverages the best aspects of bothlinguistic compositionality and continuous represen-tations.

Our model has two components, trained jointly:first, a collection of neural “modules” that can befreely composed (Figure 1a); second, a network lay-out predictor that assembles modules into completedeep networks tailored to each question (Figure 1b).

What cities are in Georgia? Atlanta

and

lookup Georgia

find city

Georgia

Atlanta

Montgomery

Knowledge source

relate in

Network layout (Section 4.2)

find[city]

lookup[Georgia]

relate[in]

and

(b)

Module inventory (Section 4.1)

find

lookup

and

relate

(a) (c)

(d)

Figure 1: A learned syntactic analysis (a) is used to assemble acollection of neural modules (b) into a deep neural network (c),and applied to a world representation (d) to produce an answer.

Previous work has used manually-specified modularstructures for visual learning (Andreas et al., 2016).Here we:

• learn a network structure predictor jointly withmodule parameters themselves• extend visual primitives from previous work to

reason over structured world representations

Training data consists of (world, question, answer)triples: our approach requires no supervision of net-work layouts. We achieve state-of-the-art perfor-mance on two markedly different question answer-ing tasks: one with questions about natural im-ages, and another with more compositional ques-tions about United States geography.1

2 Deep networks as functional programs

We begin with a high-level discussion of the kindsof composed networks we would like to learn.

1We have released our code at http://github.com/

jacobandreas/nmn2

Page 2: Learning to Compose Neural Networks for Question Answeringnlp.cs.berkeley.edu/pubs/Andreas-Rohrbach-Darrell-Klein_2016_DNM… · requiring intermediate computation. A more struc-tured

Andreas et al. (2016) describe a heuristic ap-proach for decomposing visual question answeringtasks into sequence of modular sub-problems. Forexample, the question What color is the bird? mightbe answered in two steps: first, “where is the bird?”(Figure 2a), second, “what color is that part of theimage?” (Figure 2c). This first step, a generic mod-ule called find, can be expressed as a fragment ofa neural network that maps from image features anda lexical item (here bird) to a distribution over pix-els. This operation is commonly referred to as theattention mechanism, and is a standard tool for ma-nipulating images (Xu et al., 2015) and text repre-sentations (Hermann et al., 2015).

The first contribution of this paper is an exten-sion and generalization of this mechanism to enablefully-differentiable reasoning about more structuredsemantic representations. Figure 2b shows how thesame module can be used to focus on the entityGeorgia in a non-visual grounding domain; moregenerally, by representing every entity in the uni-verse of discourse as a feature vector, we can obtaina distribution over entities that corresponds roughlyto a logical set-valued denotation.

Having obtained such a distribution, existing neu-ral approaches use it to immediately compute aweighted average of image features and project backinto a labeling decision—a describe module (Fig-ure 2c). But the logical perspective suggests a num-ber of novel modules that might operate on atten-tions: e.g. combining them (by analogy to conjunc-tion or disjunction) or inspecting them directly with-out a return to feature space (by analogy to quantifi-cation, Figure 2d). These modules are discussed indetail in Section 4. Unlike their formal counterparts,they are differentiable end-to-end, facilitating theirintegration into learned models. Building on previ-ous work, we learn behavior for a collection of het-erogeneous modules from (world, question, answer)triples.

The second contribution of this paper is a modelfor learning to assemble such modules composition-ally. Isolated modules are of limited use—to ob-tain expressive power comparable to either formalapproaches or monolithic deep networks, they mustbe composed into larger structures. Figure 2 showssimple examples of composed structures, but forrealistic question-answering tasks, even larger net-

blackandwhite

Georgia

Atlanta

Montgomery

Georgia

Atlanta

Montgomery

exists

true

find bird

describe color

find state(a) (b)

(c) (d)

Figure 2: Simple neural module networks, corresponding tothe questions What color is the bird? and Are there any states?(a) A neural find module for computing an attention overpixels. (b) The same operation applied to a knowledge base.(c) Using an attention produced by a lower module to identifythe color of the region of the image attended to. (d) Performingquantification by evaluating an attention directly.

works are required. Thus our goal is to automati-cally induce variable-free, tree-structured computa-tion descriptors. We can use a familiar functionalnotation from formal semantics (e.g. Liang et al.,2011) to represent these computations.2 We writethe two examples in Figure 2 as

(describe[color] find[bird])

and(exists find[state])

respectively. These are network layouts: they spec-ify a structure for arranging modules (and their lex-ical parameters) into a complete network. Andreaset al. (2016) use hand-written rules to deterministi-cally transform dependency trees into layouts, andare restricted to producing simple structures like theabove for non-synthetic data. For full generality, wewill need to solve harder problems, like transform-ing What cities are in Georgia? (Figure 1) into

(andfind[city](relate[in] lookup[Georgia]))

In this paper, we present a model for learning to se-lect such structures from a set of automatically gen-erated candidates. We call this model a dynamicneural module network.

2But note that unlike formal semantics, the behavior of theprimitive functions here is itself unknown.

Page 3: Learning to Compose Neural Networks for Question Answeringnlp.cs.berkeley.edu/pubs/Andreas-Rohrbach-Darrell-Klein_2016_DNM… · requiring intermediate computation. A more struc-tured

3 Related work

There is an extensive literature on database ques-tion answering, in which strings are mapped to log-ical forms, then evaluated by a black-box execu-tion model to produce answers. Supervision may beprovided either by annotated logical forms (Wongand Mooney, 2007; Kwiatkowski et al., 2010; An-dreas et al., 2013) or from (world, question, answer)triples alone (Liang et al., 2011; Pasupat and Liang,2015). In general the set of primitive functionsfrom which these logical forms can be assembled isfixed, but one recent line of work focuses on induc-ing new predicates functions automatically, eitherfrom perceptual features (Krishnamurthy and Kol-lar, 2013) or the underlying schema (Kwiatkowskiet al., 2013). The model we describe in this paperhas a unified framework for handling both the per-ceptual and schema cases, and differs from existingwork primarily in learning a differentiable executionmodel with continuous evaluation results.

Neural models for question answering are alsoa subject of current interest. These include ap-proaches that model the task directly as a multiclassclassification problem (Iyyer et al., 2014), modelsthat attempt to embed questions and answers in ashared vector space (Bordes et al., 2014) and at-tentional models that select words from documentssources (Hermann et al., 2015). Such approachesgenerally require that answers can be retrieved di-rectly based on surface linguistic features, withoutrequiring intermediate computation. A more struc-tured approach described by Yin et al. (2015) learnsa query execution model for database tables with-out any natural language component. Previous ef-forts toward unifying formal logic and representa-tion learning include those of Grefenstette (2013)and Krishnamurthy and Mitchell (2013).

The visually-grounded component of this workrelies on recent advances in convolutional net-works for computer vision (Simonyan and Zisser-man, 2014), and in particular the fact that late convo-lutional layers in networks trained for image recog-nition contain rich features useful for other down-stream vision tasks, while preserving spatial infor-mation. These features have been used for both im-age captioning (Xu et al., 2015) and visual questionanswering (Yang et al., 2015).

Most previous approaches to visual question an-swering either apply a recurrent model to deep rep-resentations of both the image and the question (Renet al., 2015; Malinowski et al., 2015), or use thequestion to compute an attention over the input im-age, and then answer based on both the question andthe image features attended to (Yang et al., 2015;Xu and Saenko, 2015). Other approaches includethe simple classification model described by Zhouet al. (2015) and the dynamic parameter predictionnetwork described by Noh et al. (2015). All ofthese models assume that a fixed computation canbe performed on the image and question to computethe answer, rather than adapting the structure of thecomputation to the question.

As noted, Andreas et al. (2016) previously con-sidered a simple generalization of these attentionalapproaches in which small variations in the net-work structure per-question were permitted, withthe structure chosen by (deterministic) syntactic pro-cessing of questions. Other approaches in this gen-eral family include the “universal parser” sketchedby Bottou (2014), the graph transformer networksof Bottou et al. (1997), and the recursive neural net-works of Socher et al. (2013), which use a fixed treestructure to perform further linguistic analysis with-out any external world representation. We are un-aware of previous work that succeeds in simultane-ously learning both the parameters for and structuresof instance-specific neural networks.

4 Model

Recall that our goal is to map from questions andworld representations to answers. This process in-volves the following variables:

1. w a world representation2. x a question3. y an answer4. z a network layout5. θ a collection of model parameters

Our model is built around two distributions: a lay-out model p(z|x; θ`) which chooses a layout for asentence, and a execution model pz(y|w; θe) whichapplies the network specified by z to w.

For ease of presentation, we introduce these mod-els in reverse order. We first imagine that z is always

Page 4: Learning to Compose Neural Networks for Question Answeringnlp.cs.berkeley.edu/pubs/Andreas-Rohrbach-Darrell-Klein_2016_DNM… · requiring intermediate computation. A more struc-tured

observed, and in Section 4.1 describe how to evalu-ate and learn modules parameterized by θe withinfixed structures. In Section 4.2, we move to the realscenario, where z is unknown. We describe how topredict layouts from questions and learn θe and θ`jointly without layout supervision.

4.1 Evaluating modulesGiven a layout z, we assemble the correspondingmodules into a full neural network (Figure 1c), andapply it to the knowledge representation. Interme-diate results flow between modules until an answeris produced at the root. We denote the output of thenetwork with layout z on input world w as JzKw;when explicitly referencing the substructure of z, wecan alternatively write Jm(h1, h2)K for a top-levelmodule m with submodule outputs h1 and h2. Wethen define the execution model:

pz(y|w) = (JzKw)y (1)

(This assumes that the root module of z producesa distribution over labels y.) The set of possiblelayouts z is restricted by module type constraints:some modules (like find above) operate directly onthe input representation, while others (like describeabove) also depend on input from specific earliermodules. Two base types are considered in this pa-per are Attention (a distribution over pixels or enti-ties) and Labels (a distribution over answers).

Parameters are tied across multiple instances ofthe same module, so different instantiated networksmay share some parameters but not others. Moduleshave both parameter arguments (shown in squarebrackets) and ordinary inputs (shown in parenthe-ses). Parameter arguments, like the running bird

example in Section 2, are provided by the layout,and are used to specialize module behavior for par-ticular lexical items. Ordinary inputs are the re-sult of computation lower in the network. In ad-dition to parameter-specific weights, modules haveglobal weights shared across all instances of themodule (but not shared with other modules). WewriteA, a,B, b, . . . for global weights and ui, vi forweights associated with the parameter argument i.⊕ and� denote (possibly broadcasted) elementwiseaddition and multiplication respectively. The com-plete set of global weights and parameter-specificweights constitutes θe. Every module has access to

the world representation, represented as a collectionof vectors w1, w2, . . . (or W expressed as a matrix).The nonlinearity σ denotes a rectified linear unit.

The modules used in this paper are shown below,with names and type constraints in the first row and adescription of the module’s computation following.

Lookup (→ Attention)lookup[i] produces an attention focused entirely at theindex f(i), where the relationship f between wordsand positions in the input map is known ahead of time(e.g. string matches on database fields).

Jlookup[i]K = ef(i) (2)

where ei is the basis vector that is 1 in the ith positionand 0 elsewhere.

Find (→ Attention)find[i] computes a distribution over indices by con-catenating the parameter argument with each positionof the input feature map, and passing the concatenatedvector through a MLP:

Jfind[i]K = softmax(a� σ(Bvi ⊕ CW ⊕ d)) (3)

Relate (Attention→ Attention)relate directs focus from one region of the input toanother. It behaves much like the find module, butalso conditions its behavior on the current region ofattention h. Let w(h) =

∑k hkw

k, where hk is thekth element of h. Then,

Jrelate[i](h)K = softmax(a �σ(Bvi ⊕ CW ⊕Dw(h)⊕ e)) (4)

And (Attention*→ Attention)and performs an operation analogous to set intersec-tion for attentions. The analogy to probabilistic logicsuggests multiplying probabilities:

Jand(h1, h2, . . .)K = h1 � h2 � · · · (5)

Describe (Attention→ Labels)describe[i] computes a weighted average of w underthe input attention. This average is then used to predictan answer representation. With w as above,

Jdescribe[i](h)K = softmax(Aσ(Bw(h)+vi)) (6)

Exists (Attention→ Labels)exists is the existential quantifier, and inspects theincoming attention directly to produce a label, ratherthan an intermediate feature vector like describe:

Jexists](h)K = softmax((

maxk

hk)a+ b

)(7)

Page 5: Learning to Compose Neural Networks for Question Answeringnlp.cs.berkeley.edu/pubs/Andreas-Rohrbach-Darrell-Klein_2016_DNM… · requiring intermediate computation. A more struc-tured

What cities are in Georgia?

what

city

be

in

Georgia

find[city]

relate[in]

lookup[Georgia]

relate[in]

...

lookup[Georgia]find[city]

and

(a)

(b)

(c)

(d)

relate[in]

lookup[Georgia]

Figure 3: Generation of layout candidates. The input sentence(a) is represented as a dependency parse (b). Fragments of thisdependency parse are then associated with appropriate modules(c), and these fragments are assembled into full layouts (d).

With z observed, the model we have describedso far corresponds largely to that of Andreas et al.(2016), though the module inventory is different—in particular, our new exists and relate modulesdo not depend on the two-dimensional spatial struc-ture of the input. This enables generalization to non-visual world representations.

Learning in this simplified setting is straightfor-ward. Assuming the top-level module in each layoutis a describe or exists module, the fully- instan-tiated network corresponds to a distribution over la-bels conditioned on layouts. To train, we maximize∑

(w,y,z) log pz(y|w; θe) directly. This can be under-stood as a parameter-tying scheme, where the deci-sions about which parameters to tie are governed bythe observed layouts z.

4.2 Assembling networks

Next we describe the layout model p(z|x; θ`). Wefirst use a fixed syntactic parse to generate a smallset of candidate layouts, analogously to the waya semantic grammar generates candidate semanticparses in previous work (Berant and Liang, 2014).

A semantic parse differs from a syntactic parsein two primary ways. First, lexical items must be

mapped onto a (possibly smaller) set of semanticprimitives. Second, these semantic primitives mustbe combined into a structure that closely, but not ex-actly, parallels the structure provided by syntax. Forexample, state and province might need to be identi-fied with the same field in a database schema, whileall states have a capital might need to be identifiedwith the correct (in situ) quantifier scope.

While we cannot avoid the structure selectionproblem, continuous representations simplify thelexical selection problem. For modules that accepta vector parameter, we associate these parameterswith words rather than semantic tokens, and thusturn the combinatorial optimization problem asso-ciated with lexicon induction into a continuous one.Now, in order to learn that province and state havethe same denotation, it is sufficient to learn that theirassociated parameters are close in some embeddingspace—a task amenable to gradient descent. (Notethat this is easy only in an optimizability sense,and not an information-theoretic one—we must stilllearn to associate each independent lexical item withthe correct vector.) The remaining combinatorialproblem is to arrange the provided lexical items intothe right computational structure. In this respect,layout prediction is more like syntactic parsing thanordinary semantic parsing, and we can rely on anoff-the-shelf syntactic parser to get most of the waythere. In this work, syntactic structure is provided bythe Stanford dependency parser (De Marneffe andManning, 2008).

The construction of layout candidates is depictedin Figure 3, and proceeds as follows:

1. Represent the input sentence as a dependencytree.

2. Collect all nouns, verbs, and prepositionalphrases that are attached directly to a wh-wordor copula.

3. Associate each of these with a layout frag-ment: Ordinary nouns and verbs are mappedto a single find module. Proper nouns to a sin-gle lookup module. Prepositional phrases aremapped to a depth-2 fragment, with a relate

module for the preposition above a find mod-ule for the enclosed head noun.

4. Form subsets of this set of layout fragments.For each subset, construct a layout candidate by

Page 6: Learning to Compose Neural Networks for Question Answeringnlp.cs.berkeley.edu/pubs/Andreas-Rohrbach-Darrell-Klein_2016_DNM… · requiring intermediate computation. A more struc-tured

joining all fragments with an and module, andinserting either a measure or describe moduleat the top (each subset thus results in two parsecandidates.)

All layouts resulting from this process feature arelatively flat tree structure with at most one con-junction and one quantifier. This is a strong sim-plifying assumption, but appears sufficient to covermost of the examples that appear in both of ourtasks. As our approach includes both categories, re-lations and simple quantification, the range of phe-nomena considered is generally broader than pre-vious perceptually-grounded QA work (Krishna-murthy and Kollar, 2013; Matuszek et al., 2012).

Having generated a set of candidate parses, weneed to score them. This is a ranking problem;as in the rest of our approach, we solve it usingstandard neural machinery. In particular, we pro-duce an LSTM representation of the question, afeature-based representation of the query, and passboth representations through a multilayer perceptron(MLP). The query feature vector includes indicatorson the number of modules of each type present, aswell as their associated parameter arguments. Whileone can easily imagine a more sophisticated parse-scoring model, this simple approach works well forour tasks.

Formally, for a question x, let hq(x) be an LSTMencoding of the question (i.e. the last hidden layer ofan LSTM applied word-by-word to the input ques-tion). Let {z1, z2, . . .} be the proposed layouts forx, and let f(zi) be a feature vector representing theith layout. Then the score s(zi|x) for the layout zi is

s(zi|x) = a>σ(Bhq(x) + Cf(zi) + d) (8)

i.e. the output of an MLP with inputs hq(x) andf(zi), and parameters θ` = {a,B,C, d}. Finally,we normalize these scores to obtain a distribution:

p(zi|x; θ`) = es(zi|x)/ n∑

j=1

es(zj |x) (9)

Having defined a layout selection modulep(z|x; θ`) and a network execution modelpz(y|w; θe), we are ready to define a modelfor predicting answers given only (world, question)pairs. The key constraint is that we want to min-imize evaluations of pz(y|w; θe) (which involves

expensive application of a deep network to a largeinput representation), but can tractably evaluatep(z|x; θ`) for all z (which involves applicationof a shallow network to a relatively small set ofcandidates). This is the opposite of the situationusually encountered semantic parsing, where callsto the query execution model are fast but the set ofcandidate parses is too large to score exhaustively.

In fact, the problem more closely resembles thescenario faced by agents in the reinforcement learn-ing setting (where it is cheap to score actions, butpotentially expensive to execute them and obtain re-wards). We adopt a common approach from that lit-erature, and express our model as a stochastic pol-icy. Under this policy, we first sample a layout zfrom a distribution p(z|x; θ`), and then apply z tothe knowledge source and obtain a distribution overanswers p(y|z, w; θe).

After z is chosen, we can train the executionmodel directly by maximizing log p(y|z, w; θe) withrespect to θe as before (this is ordinary backprop-agation). Because the hard selection of z is non-differentiable, we optimize p(z|x; θ`) using a policygradient method. The gradient of the reward surfaceJ with respect to the parameters of the policy is

∇J(θ`) = E[∇ log p(z|x; θ`) · r] (10)

(this is the REINFORCE rule (Williams, 1992)). Herethe expectation is taken with respect to rollouts ofthe policy, and r is the reward. Because our goal isto select the network that makes the most accuratepredictions, we take the reward to be identically thenegative log-probability from the execution phase,i.e.

E[(∇ log p(z|x; θ`)) · log p(y|z, w; θe)] (11)

Thus the update to the layout-scoring model at eachtimestep is simply the gradient of the log-probabilityof the chosen layout, scaled by the accuracy of thatlayout’s predictions. At training time, we approxi-mate the expectation with a single rollout, so at eachstep we update θ` in the direction (∇ log p(z|x; θ`))·log p(y|z, w; θe) for a single z ∼ p(z|x; θ`). θe andθ` are optimized using ADADELTA (Zeiler, 2012)with ρ = 0.95, ε = 1e−6 and gradient clipping at anorm of 10.

Page 7: Learning to Compose Neural Networks for Question Answeringnlp.cs.berkeley.edu/pubs/Andreas-Rohrbach-Darrell-Klein_2016_DNM… · requiring intermediate computation. A more struc-tured

What is in the sheep’s ear? What color is shewearing?

What is the mandragging?

(describe[what]

(and find[sheep]

find[ear]))

(describe[color]

find[wear])

(describe[what]

find[man])

tag white boat (board)

Figure 4: Sample outputs for the visual question answeringtask. The second row shows the final attention provided as in-put to the top-level describe module. For the first two exam-ples, the model produces reasonable parses, attends to the cor-rect region of the images (the ear and the woman’s clothing),and generates the correct answer. In the third image, the verb isdiscarded and a wrong answer is produced.

5 Experiments

The framework described in this paper is general,and we are interested in how well it performs ondatasets of varying domain, size and linguistic com-plexity. To that end, we evaluate our model on tasksat opposite extremes of both these criteria: a largevisual question answering dataset, and a small col-lection of more structured geography questions.

5.1 Questions about images

Our first task is the recently-introduced Visual Ques-tion Answering challenge (VQA) (Antol et al.,2015). The VQA dataset consists of more than200,000 images paired with human-annotated ques-tions and answers, as in Figure 4.

We use the VQA 1.0 release, employing the de-velopment set for model selection and hyperparam-eter tuning, and reporting final results from the eval-uation server on the test-standard set. For the ex-periments described in this section, the input featurerepresentationswi are computed by the the fifth con-volutional layer of a 16-layer VGGNet after pooling(Simonyan and Zisserman, 2014). Input images arescaled to 448×448 before computing their represen-tations. We found that performance on this task was

test-dev test-std

Yes/No Number Other All All

Zhou (2015) 76.6 35.0 42.6 55.7 55.9Noh (2015) 80.7 37.2 41.7 57.2 57.4Yang (2015) 79.3 36.6 46.1 58.7 58.9NMN 81.2 38.0 44.0 58.6 58.7D-NMN 81.1 38.6 45.5 59.4 59.4

Table 1: Results on the VQA test server. NMN is theparameter-tying model from Andreas et al. (2015), and D-NMNis the model described in this paper.

best if the candidate layouts were relatively simple:only describe, and and find modules are used, andlayouts contain at most two conjuncts.

One weakness of this basic framework is a diffi-culty modeling prior knowledge about answers (ofthe form most bears are brown). This kinds of lin-guistic “prior” is essential for the VQA task, andeasily incorporated. We simply introduce an extrahidden layer for recombining the final module net-work output with the input sentence representationhq(x) (see Equation 8), replacing Equation 1 with:

log pz(y|w, x) = (Ahq(x) +BJzKw)y (12)

(Now modules with output type Labels shouldbe understood as producing an answer embeddingrather than a distribution over answers.) This allowsthe question to influence the answer directly.

Results are shown in Table 1. The use of dynamicnetworks provides a small gain, most noticeably on”other” questions. We achieve state-of-the-art re-sults on this task, outperforming a highly effectivevisual bag-of-words model (Zhou et al., 2015), amodel with dynamic network parameter prediction(but fixed network structure) (Noh et al., 2015), amore conventional attentional model (Yang et al.,2015), and a previous approach using neural mod-ule networks with no structure prediction (Andreaset al., 2016).

Some examples are shown in Figure 4. In general,the model learns to focus on the correct region of theimage, and tends to consider a broad window aroundthe region. This facilitates answering questions likeWhere is the cat?, which requires knowledge of thesurroundings as well as the object in question.

Page 8: Learning to Compose Neural Networks for Question Answeringnlp.cs.berkeley.edu/pubs/Andreas-Rohrbach-Darrell-Klein_2016_DNM… · requiring intermediate computation. A more struc-tured

Accuracy

Model GeoQA GeoQA+Q

LSP-F 48 –LSP-W 51 –NMN 51.7 35.7D-NMN 54.3 42.9

Table 2: Results on the GeoQA dataset, and the GeoQAdataset with quantification. Our approach outperforms both apurely logical model (LSP-F) and a model with learned percep-tual predicates (LSP-W) on the original dataset, and a fixed-structure NMN under both evaluation conditions.

5.2 Questions about geography

The next set of experiments we consider focuseson GeoQA, a geographical question-answeringtask first introduced by Krishnamurthy and Kollar(2013). This task was originally paired with a vi-sual question answering task much simpler than theone just discussed, and is appealing for a numberof reasons. In contrast to the VQA dataset, GeoQAis quite small, containing only 263 examples. Twobaselines are available: one using a classical se-mantic parser backed by a database, and anotherwhich induces logical predicates using linear clas-sifiers over both spatial and distributional features.This allows us to evaluate the quality of our modelrelative to other perceptually grounded logical se-mantics, as well as strictly logical approaches.

The GeoQA domain consists of a set of entities(e.g. states, cities, parks) which participate in vari-ous relations (e.g. north-of, capital-of). Here we takethe world representation to consist of two pieces: aset of category features (used by the find module)and a different set of relational features (used by therelate module). For our experiments, we use a sub-set of the features originally used by Krishnamurthyet al. The original dataset includes no quantifiers,and treats the questions What cities are in Texas?and Are there any cities in Texas? identically. Be-cause we are interested in testing the parser’s abilityto predict a variety of different structures, we intro-duce a new version of the dataset, GeoQA+Q, whichdistinguishes these two cases, and expects a Booleananswer to questions of the second kind.

Results are shown in Table 2. As in the orig-inal work, we report the results of leave-one-environment-out cross-validation on the set of 10 en-

Is Key Largo an island?

(exists (and lookup[key-largo] find[island]))

yes: correct

What national parks are in Florida?

(and find[park] (relate[in] lookup[florida]))

everglades: correct

What are some beaches in Florida?

(exists (and lookup[beach]

(relate[in] lookup[florida])))

yes (daytona-beach): wrong parse

What beach city is there in Florida?

(and lookup[beach] lookup[city]

(relate[in] lookup[florida]))

[none] (daytona-beach): wrong module behavior

Figure 5: Example layouts and answers selected by the modelon the GeoQA dataset. For incorrect predictions, the correctanswer is shown in parentheses.

vironments. Our dynamic model (D-NMN) outper-forms both the logical (LSP-F) and perceptual mod-els (LSP-W) described by (Krishnamurthy and Kol-lar, 2013), as well as a fixed-structure neural mod-ule net (NMN). This improvement is particularlynotable on the dataset with quantifiers, where dy-namic structure prediction produces a 20% relativeimprovement over the fixed baseline. A variety ofpredicted layouts are shown in Figure 5.

6 Conclusion

We have introduced a new model, the dynamic neu-ral module network, for answering queries aboutboth structured and unstructured sources of informa-tion. Given only (question, world, answer) triplesas training data, the model learns to assemble neu-ral networks on the fly from an inventory of neuralmodels, and simultaneously learns weights for thesemodules so that they can be composed into novelstructures. Our approach achieves state-of-the-artresults on two tasks. We believe that the success ofthis work derives from two factors:

Continuous representations improve the expres-siveness and learnability of semantic parsers: by re-placing discrete predicates with differentiable neuralnetwork fragments, we bypass the challenging com-binatorial optimization problem associated with in-duction of a semantic lexicon. In structured world

Page 9: Learning to Compose Neural Networks for Question Answeringnlp.cs.berkeley.edu/pubs/Andreas-Rohrbach-Darrell-Klein_2016_DNM… · requiring intermediate computation. A more struc-tured

representations, neural predicate representations al-low the model to invent reusable attributes and re-lations not expressed in the schema. Perhaps moreimportantly, we can extend compositional question-answering machinery to complex, continuous worldrepresentations like images.

Semantic structure prediction improves general-ization in deep networks: by replacing a fixed net-work topology with a dynamic one, we can tailor thecomputation performed to each problem instance,using deeper networks for more complex questionsand representing combinatorially many queries withcomparatively few parameters. In practice, this re-sults in considerable gains in speed and sample effi-ciency, even with very little training data.

These observations are not limited to the questionanswering domain, and we expect that they can beapplied similarly to tasks like instruction following,game playing, and language generation.

Acknowledgments

JA is supported by a National Science FoundationGraduate Fellowship. MR is supported by a fellow-ship within the FIT weltweit-Program of the GermanAcademic Exchange Service (DAAD). This workwas additionally supported by DARPA, AFRL, DoDMURI award N000141110688, NSF awards IIS-1427425 and IIS-1212798, and the Berkeley Visionand Learning Center.

References

Jacob Andreas, Andreas Vlachos, and Stephen Clark.2013. Semantic parsing as machine translation. InProceedings of the Annual Meeting of the Associationfor Computational Linguistics, Sofia, Bulgaria.

Jacob Andreas, Marcus Rohrbach, Trevor Darrell, andDan Klein. 2016. Neural module networks. In Pro-ceedings of the Conference on Computer Vision andPattern Recognition.

Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Mar-garet Mitchell, Dhruv Batra, C Lawrence Zitnick, andDevi Parikh. 2015. VQA: Visual question answer-ing. In Proceedings of the International Conferenceon Computer Vision.

Jonathan Berant and Percy Liang. 2014. Semantic pars-ing via paraphrasing. In Proceedings of the AnnualMeeting of the Association for Computational Linguis-tics, volume 7, page 92.

Antoine Bordes, Sumit Chopra, and Jason Weston. 2014.Question answering with subgraph embeddings. Pro-ceedings of the Conference on Empirical Methods inNatural Language Processing.

Leon Bottou, Yoshua Bengio, and Yann Le Cun. 1997.Global training of document processing systems us-ing graph transformer networks. In Computer Visionand Pattern Recognition, 1997. Proceedings., 1997IEEE Computer Society Conference on, pages 489–494. IEEE.

Leon Bottou. 2014. From machine learning to machinereasoning. Machine learning, 94(2):133–149.

Marie-Catherine De Marneffe and Christopher D Man-ning. 2008. The Stanford typed dependencies repre-sentation. In Proceedings of the International Confer-ence on Computational Linguistics, pages 1–8.

Edward Grefenstette. 2013. Towards a formal distribu-tional semantics: Simulating logical calculi with ten-sors. Joint Conference on Lexical and ComputationalSemantics.

Karl Moritz Hermann, Tomas Kocisky, Edward Grefen-stette, Lasse Espeholt, Will Kay, Mustafa Suleyman,and Phil Blunsom. 2015. Teaching machines to readand comprehend. In Advances in Neural InformationProcessing Systems, pages 1684–1692.

Mohit Iyyer, Jordan Boyd-Graber, Leonardo Claudino,Richard Socher, and Hal Daume III. 2014. A neu-ral network for factoid question answering over para-graphs. In Proceedings of the Conference on Empiri-cal Methods in Natural Language Processing.

Jayant Krishnamurthy and Thomas Kollar. 2013. Jointlylearning to parse and perceive: connecting natural lan-guage to the physical world. Transactions of the Asso-ciation for Computational Linguistics.

Jayant Krishnamurthy and Tom Mitchell. 2013. Vec-tor space semantic parsing: A framework for compo-sitional vector space models. In Proceedings of theACL Workshop on Continuous Vector Space Modelsand their Compositionality.

Tom Kwiatkowski, Luke Zettlemoyer, Sharon Goldwa-ter, and Mark Steedman. 2010. Inducing probabilis-tic CCG grammars from logical form with higher-order unification. In Proceedings of the Conferenceon Empirical Methods in Natural Language Process-ing, pages 1223–1233, Cambridge, Massachusetts.

Tom Kwiatkowski, Eunsol Choi, Yoav Artzi, and LukeZettlemoyer. 2013. Scaling semantic parsers with on-the-fly ontology matching. In Proceedings of the Con-ference on Empirical Methods in Natural LanguageProcessing.

Percy Liang, Michael Jordan, and Dan Klein. 2011.Learning dependency-based compositional semantics.In Proceedings of the Human Language Technology

Page 10: Learning to Compose Neural Networks for Question Answeringnlp.cs.berkeley.edu/pubs/Andreas-Rohrbach-Darrell-Klein_2016_DNM… · requiring intermediate computation. A more struc-tured

Conference of the Association for Computational Lin-guistics, pages 590–599, Portland, Oregon.

Mateusz Malinowski, Marcus Rohrbach, and Mario Fritz.2015. Ask your neurons: A neural-based approach toanswering questions about images. In Proceedings ofthe International Conference on Computer Vision.

Cynthia Matuszek, Nicholas FitzGerald, Luke Zettle-moyer, Liefeng Bo, and Dieter Fox. 2012. A jointmodel of language and perception for grounded at-tribute learning. In International Conference on Ma-chine Learning.

Hyeonwoo Noh, Paul Hongsuck Seo, and Bohyung Han.2015. Image question answering using convolutionalneural network with dynamic parameter prediction.arXiv preprint arXiv:1511.05756.

Panupong Pasupat and Percy Liang. 2015. Composi-tional semantic parsing on semi-structured tables. InProceedings of the Annual Meeting of the Associationfor Computational Linguistics.

Mengye Ren, Ryan Kiros, and Richard Zemel. 2015. Ex-ploring models and data for image question answer-ing. In Advances in Neural Information ProcessingSystems.

K Simonyan and A Zisserman. 2014. Very deep con-volutional networks for large-scale image recognition.arXiv preprint arXiv:1409.1556.

Richard Socher, John Bauer, Christopher D. Manning,and Andrew Y. Ng. 2013. Parsing with compositionalvector grammars. In Proceedings of the Annual Meet-ing of the Association for Computational Linguistics.

Ronald J Williams. 1992. Simple statistical gradient-following algorithms for connectionist reinforcementlearning. Machine learning, 8(3-4):229–256.

Yuk Wah Wong and Raymond J. Mooney. 2007. Learn-ing synchronous grammars for semantic parsing withlambda calculus. In Proceedings of the Annual Meet-ing of the Association for Computational Linguistics,volume 45, page 960.

Huijuan Xu and Kate Saenko. 2015. Ask, attendand answer: Exploring question-guided spatial atten-tion for visual question answering. arXiv preprintarXiv:1511.05234.

Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho,Aaron Courville, Ruslan Salakhutdinov, RichardZemel, and Yoshua Bengio. 2015. Show, attendand tell: Neural image caption generation with visualattention. In International Conference on MachineLearning.

Zichao Yang, Xiaodong He, Jianfeng Gao, Li Deng,and Alex Smola. 2015. Stacked attention net-works for image question answering. arXiv preprintarXiv:1511.02274.

Pengcheng Yin, Zhengdong Lu, Hang Li, and Ben Kao.2015. Neural enquirer: Learning to query tables.arXiv preprint arXiv:1512.00965.

Matthew D Zeiler. 2012. ADADELTA: Anadaptive learning rate method. arXiv preprintarXiv:1212.5701.

Bolei Zhou, Yuandong Tian, Sainbayar Sukhbaatar,Arthur Szlam, and Rob Fergus. 2015. Simple base-line for visual question answering. arXiv preprintarXiv:1512.02167.