arxiv:2105.02486v1 [cs.cl] 6 may 2021

17
Towards General Natural Language Understanding with Probabilistic Worldbuilding Abulhair Saparov and Tom M. Mitchell Machine Learning Department Carnegie Mellon University {asaparov, tom.mitchell}@cs.cmu.edu Abstract We introduce the Probabilistic Worldbuild- ing Model (PWM), a new fully-symbolic Bayesian model of semantic parsing and reasoning, as a first step in a research pro- gram toward more domain- and task-general NLU and AI. Humans create internal mental models of their observations which greatly aid in their ability to understand and rea- son about a large variety of problems. In PWM, the meanings of sentences, acquired facts about the world, and intermediate steps in reasoning are all expressed in a human- readable formal language, with the design goal of interpretability. PWM is Bayesian, designed specifically to be able to generalize to new domains and new tasks. We derive and implement an inference algorithm that reads sentences by parsing and abducing up- dates to its latent world model that cap- ture the semantics of those sentences, and evaluate it on two out-of-domain question- answering datasets: (1) ProofWriter and (2) a new dataset we call FictionalGeoQA, designed to be more representative of real language but still simple enough to focus on evaluating reasoning ability, while being robust against heuristics. Our method out- performs baselines on both, thereby demon- strating its value as a proof-of-concept. 1 Introduction Despite recent progress in AI and NLP produc- ing algorithms that perform well on a number of NLP tasks, it is still unclear how to move forward and develop algorithms that understand language as well as humans do. In particular, large-scale language models such as BERT (Devlin et al., 2019), RoBERTa (Liu et al., 2019), GPT-3 (Brown et al., 2020), XLNet (Yang et al., 2019), and oth- ers were trained on a very large amount of text and can then be applied to perform many differ- ent NLP tasks after some fine-tuning. In the case of GPT-3, some tasks require very few additional training examples to achieve state-of-the-art per- formance. As a result of training on text from virtually every domain, these models are domain- general. This is in contrast with NLP algorithms that are largely trained on one or a small handful of domains, and as such, are not able to perform well on new domains outside of their training. De- spite this focus on domain-generality, there are still a large number of tasks on which these large- scale language models perform poorly (Dunietz et al., 2020). Many limitations of today’s state- of-the-art methods become evident when compar- ing with the human ability to understand language (Lake et al., 2016; Tamari et al., 2020; Bender and Koller, 2020; Gardner et al., 2019; Linzen, 2020). Many cognitive scientists posit that humans cre- ate rich mental models of the world from their ob- servations which provide superior explainability, reasoning, and generalizability to new domains and tasks. How do we, as a field, move from to- day’s state-of-the-art to more general intelligence? What are the next steps to develop algorithms that can generalize to new tasks at the same level as humans? The lack of interpretability in many of these models makes these questions impossible to answer precisely. One promising direction is to change the evaluation metric: Brown et al. (2020), Linzen (2020), and many others have suggested zero-shot or few-shot accuracy to measure the per- formance of algorithms (i.e. the algorithm is eval- uated with a new dataset, wholly separate from its training; or in the case of few-shot learning, save for a few examples). While this shift is welcome, it alone will not solve the above issues. We introduce the Probabilistic Worldbuilding Model (PWM), a probabilistic generative model of reasoning and semantic parsing. Like some past approaches, PWM explicitly builds an internal mental model, which we call the theory (Tamari et al., 2020; Hogan et al., 2021; Mitchell et al., 2018; Charniak and Goldman, 1993). The the- ory constitutes what the algorithm believes to be arXiv:2105.02486v2 [cs.CL] 20 Dec 2021

Upload: others

Post on 30-Dec-2021

8 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: arXiv:2105.02486v1 [cs.CL] 6 May 2021

Towards General Natural Language Understandingwith Probabilistic Worldbuilding

Abulhair Saparov and Tom M. MitchellMachine Learning Department

Carnegie Mellon University{asaparov, tom.mitchell}@cs.cmu.edu

Abstract

We introduce the Probabilistic Worldbuild-ing Model (PWM), a new fully-symbolicBayesian model of semantic parsing andreasoning, as a first step in a research pro-gram toward more domain- and task-generalNLU and AI. Humans create internal mentalmodels of their observations which greatlyaid in their ability to understand and rea-son about a large variety of problems. InPWM, the meanings of sentences, acquiredfacts about the world, and intermediate stepsin reasoning are all expressed in a human-readable formal language, with the designgoal of interpretability. PWM is Bayesian,designed specifically to be able to generalizeto new domains and new tasks. We deriveand implement an inference algorithm thatreads sentences by parsing and abducing up-dates to its latent world model that cap-ture the semantics of those sentences, andevaluate it on two out-of-domain question-answering datasets: (1) ProofWriter and(2) a new dataset we call FictionalGeoQA,designed to be more representative of reallanguage but still simple enough to focuson evaluating reasoning ability, while beingrobust against heuristics. Our method out-performs baselines on both, thereby demon-strating its value as a proof-of-concept.

1 IntroductionDespite recent progress in AI and NLP produc-ing algorithms that perform well on a number ofNLP tasks, it is still unclear how to move forwardand develop algorithms that understand languageas well as humans do. In particular, large-scalelanguage models such as BERT (Devlin et al.,2019), RoBERTa (Liu et al., 2019), GPT-3 (Brownet al., 2020), XLNet (Yang et al., 2019), and oth-ers were trained on a very large amount of textand can then be applied to perform many differ-ent NLP tasks after some fine-tuning. In the caseof GPT-3, some tasks require very few additional

training examples to achieve state-of-the-art per-formance. As a result of training on text fromvirtually every domain, these models are domain-general. This is in contrast with NLP algorithmsthat are largely trained on one or a small handfulof domains, and as such, are not able to performwell on new domains outside of their training. De-spite this focus on domain-generality, there arestill a large number of tasks on which these large-scale language models perform poorly (Dunietzet al., 2020). Many limitations of today’s state-of-the-art methods become evident when compar-ing with the human ability to understand language(Lake et al., 2016; Tamari et al., 2020; Bender andKoller, 2020; Gardner et al., 2019; Linzen, 2020).Many cognitive scientists posit that humans cre-ate rich mental models of the world from their ob-servations which provide superior explainability,reasoning, and generalizability to new domainsand tasks. How do we, as a field, move from to-day’s state-of-the-art to more general intelligence?What are the next steps to develop algorithms thatcan generalize to new tasks at the same level ashumans? The lack of interpretability in many ofthese models makes these questions impossible toanswer precisely. One promising direction is tochange the evaluation metric: Brown et al. (2020),Linzen (2020), and many others have suggestedzero-shot or few-shot accuracy to measure the per-formance of algorithms (i.e. the algorithm is eval-uated with a new dataset, wholly separate from itstraining; or in the case of few-shot learning, savefor a few examples). While this shift is welcome,it alone will not solve the above issues.

We introduce the Probabilistic WorldbuildingModel (PWM), a probabilistic generative modelof reasoning and semantic parsing. Like some pastapproaches, PWM explicitly builds an internalmental model, which we call the theory (Tamariet al., 2020; Hogan et al., 2021; Mitchell et al.,2018; Charniak and Goldman, 1993). The the-ory constitutes what the algorithm believes to be

arX

iv:2

105.

0248

6v2

[cs

.CL

] 2

0 D

ec 2

021

Page 2: arXiv:2105.02486v1 [cs.CL] 6 May 2021

T = { mammal(alice), cat(bob), ∀x(cat(x)→ mammal(x)), . . . }Theory

πi = Axcat(bob)

Ax∀x(cat(x)→ mammal(x))

∀Ecat(bob)→ mammal(bob)

→Emammal(bob)

Proof and logical form of ith sentence

yi = “Bob is a mammal.”ith sentence

generate proofsfrom the axioms

in the theory

generate sentencefrom logical form

parse sentenceto logical form

infer theory andproofs from observedlogical forms

Figure 1: The generative process and inference in our model, with an example of a theory, generating a proof ofa logical form which itself generates the sentence “Bob is a mammal.” During inference, only the sentences areobserved, whereas the theory and proofs are latent. Given sentence yi, the language module outputs the logicalform. The reasoning module then infers the proof πi of the logical form and updates the posterior of the theory T .

true. PWM is fully symbolic and Bayesian, usinga single unified human-readable formal languageto represent all meaning, and is therefore inher-ently interpretable. This is in contrast to systemsthat use subsymbolic representations of meaningfor some or all of their components. Every ran-dom variable in PWM is well-defined with respectto other random variables and/or grounded prim-itives. Prior knowledge such as the rules of de-ductive inference, the structure of English gram-mar, knowledge of basic physics and mathematicscan be incorporated by modifying the prior distri-butions of the random variables in PWM. Incor-porating prior knowledge can greatly reduce theamount of training data required to achieve suffi-cient generalizability, as we will demonstrate. Ex-tensibility is key to future research that could en-able more general NLU and AI, as it provides aclearer path forward for future exploration.

We present an implementation of inference un-der the proposed model, called the Probabilis-tic Worldbuilding from Language (PWL). WhilePWM is an abstract mathematical description ofthe underlying distribution of axioms, proofs, log-ical forms, and sentences, PWL is the algorithmthat reads sentences, computes logical form rep-resentations of their meaning, and updates the ax-ioms and proofs in the theory accordingly. See fig-ure 1 for a high-level schematic diagram of PWMand PWL. PWM describes the process depictedby the red arrows, whereas PWL is the algorithmdepicted by the green arrows. We emphasize thatthe reasoning in PWL is not a theorem prover andis not purely deductive. Instead, PWL solves adifferent problem of abduction, which in someways is computationally easier than deductive in-ference: Given a set of observations, work back-wards to find a set of axioms that deductively ex-

plain the observations. It is these abduced axiomsthat constitute the internal “mental model.” Hu-mans often rely on abductive reasoning, for exam-ple in commonsense reasoning (Bhagavatula et al.,2020; Furbach et al., 2015).

A core principle of our approach is to ensuregenerality by design. Simplifying assumptions of-ten trade away generality for tractability, such asby restricting the representation of the meaningsof sentences, or number of steps during reason-ing. PWM is designed to be domain- and task-general, and to this end, uses higher-order logic(i.e. lambda calculus) (Church, 1940) as the for-mal language, which we believe is sufficiently ex-pressive to capture the meaning of declarative andinterrogative sentences in natural language. Fur-thermore, PWM uses natural deduction for rea-soning, which is complete in that if a logical formφ is true, there is a proof of φ (Henkin, 1950).

In section 3, we describe PWM and PWL moreprecisely. In section 4, as a proof-of-concept ofthe value of further research, we run experimentson two question-answering datasets: ProofWriter(Tafjord et al., 2021) and a new dataset, calledFictionalGeoQA, which we specifically createdto evaluate the ability to reason over short para-graphs while being robust against simpler heuris-tic strategies. Unlike ProofWriter, the text in Fic-tionalGeoQA was not template-generated and ismore realistic, but is still simple enough to fo-cus the evaluation on reasoning rather than pars-ing, with many sentences having semantics thatgo beyond the Horn clause fragment of first-orderlogic. PWL outperforms the baselines with re-spect to zero-shot accuracy (i.e. without look-ing at any training examples). Our code and datais freely available at github.com/asaparov/PWL andgithub.com/asaparov/fictionalgeoqa.

Page 3: arXiv:2105.02486v1 [cs.CL] 6 May 2021

In summary, the primary contributions of thispaper are the following:• PWM, a new model for more general NLU,

and PWL, the implementation that reads sen-tences, computes their logical forms, and up-dates its theory accordingly.

• Introducing FictionalGeoQA, a new question-answering dataset designed to evaluate the abil-ity to reason over language.

• Experiments on ProofWriter and Fictional-GeoQA demonstrating that PWL outperformsbaseline methods on question-answering.

2 Related work

Fully symbolic methods were commonplace inearlier AI research (Newell and Simon, 1976;Dreyfus, 1985). However, they were oftentimesbrittle: A new observation would contradict the in-ternal theory or violate an assumption, and it wasnot clear how to resolve the impasse in a principledmanner and proceed. But they do have some keyadvantages: Symbolic approaches that use well-studied human-readable formal languages such asfirst-order logic, higher-order logic, type theory,etc. enable humans to readily inspect and under-stand the internal processing of these algorithms,effecting a high degree of interpretability (Dowty,1981; Gregory, 2015; Cooper et al., 2015). Sym-bolic systems can be made general by design, byusing a sufficiently expressive formal languageand ontology. Hybrid methods have been exploredto alleviate the brittleness of formal systems whileengendering their strengths, such as interpretabil-ity and generalizability; for example, the recentwork in neuro-symbolic methods (Yi et al., 2020;Saha et al., 2020; Tafjord et al., 2021). Neuraltheorem provers are in this vein (Rocktäschel andRiedel, 2017). However, the proofs considered inthese approaches are based on backward chain-ing (Russell and Norvig, 2010), which restrictsthe semantics to the Horn clause fragment of first-order logic. Sun et al. (2020); Ren et al. (2020);Arakelyan et al. (2021) extend coverage to the ex-istential positive fragment of first-order logic. Innatural language, sentences express more complexsemantics such as negation, nested universal quan-tification, and higher-order structures. Our workexplores the other side of the tradeoff betweentractability and expressivity/generality. Theoremprovers attempt to solve the problem of deduc-tion: finding a proof of a given formula, given aset of axioms. In contrast, the reasoning compo-

nent of PWM is abductive, and the abduced ax-ioms can be used in downstream tasks, such asquestion-answering, and to better read new sen-tences in the context of the world model, as wewill demonstrate. We posit that abduction is suffi-cient for more general NLU (Hobbs, 2006; Hobbset al., 1993). PWM combines Bayesian statisticalmachine learning with symbolic representations inorder to handle uncertainty in a principled manner,“smoothing out” or “softening” the rigidity of apurely symbolic approach. In PWM, the internaltheory is a random variable, so if a new observa-tion is inconsistent with the theory, there may beother theories in the probability space that are con-sistent with the observation. The probabilistic ap-proach provides a principled way to resolve theseimpasses.

PWM is certainly not the first to combinesymbolic and probabilistic methods. There is arich history of inductive logic programming (ILP)(Muggleton, 1991; Cropper and Morel, 2021) andprobabilistic ILP languages (Muggleton, 1996;Cussens, 2001; Sato et al., 2005; Bellodi andRiguzzi, 2015). These languages could be usedto learn a “theory” from a collection of observa-tions, but they are typically restricted to learningrules in the form of first-order Horn clauses, fortractability. In natural language, it is easy to ex-press semantics beyond the Horn clause fragmentof first-order logic.

Knowledge bases (KBs) and cognitive archi-tectures (Kotseruba and Tsotsos, 2020; Hoganet al., 2021; Laird et al., 1987; Mitchell et al.,2018) have attempted to explicitly model domain-general knowledge in a form amenable to rea-soning. Cognitive architectures aim to moreclosely replicate human cognition. Some ap-proaches use probabilistic methods to handle un-certainty (Niepert et al., 2012; Niepert and Domin-gos, 2015; Jain et al., 2019). However, manyof these approaches make strong simplifying as-sumptions that restrict the expressive power ofthe formal language that expresses facts in theKB. For example, many knowledge bases canbe characterized as graphs, where each entitycorresponds to a vertex and every fact corre-sponds to a labeled edge. For example, the be-lief plays_sport(s_williams,tennis) is rep-resentable as a directed edge connecting the ver-tex s_williams to the vertex tennis, with theedge label plays_sport. While this assumptiongreatly aids tractability and scalability, allowing

Page 4: arXiv:2105.02486v1 [cs.CL] 6 May 2021

many problems in reasoning to be solved by graphalgorithms, it greatly hinders expressivity and gen-erality, and there are many kinds of knowledgethat simply cannot be expressed and representedin such KBs. PWM does not make such restric-tions on logical forms in the theory, allowing forricher semantics, such as definitions, universally-quantified statements, conditionals, etc.

3 Model

In this section, we provide a mathematical descrip-tion of PWM. At a high level, the process for gen-erating a sentence sampled from this probabilitydistribution is:1. Sample the theory T from a prior distributionp(T ). T is a collection of logical forms inhigher-order logic that represent what PWL be-lieves to be true.

2. For each observation i, sample a proof πi fromp(πi | T ). The conclusion of the proof is thelogical form xi, which represents the meaningof the ith sentence.

3. Sample the ith sentence yi from p(yi | xi).Inference is effectively the inverse of this process,and is implemented by PWL. During inference,PWL is given a collection of observed sentencesy1, . . . , yn and the goal is to discern the value ofthe latent variables: the logical form of each sen-tence x , {x1, . . . , xn}, the proofs for each log-ical form π , {π1, . . . , πn}, and the underlyingtheory T . Both the generative process and infer-ence algorithm naturally divide into two modules:• Language module: During inference, this

module’s purpose is to infer the logical formof each observed sentence. That is, given theinput sentence yi, this module outputs the kmost-probable values of the logical form xi(i.e. semantic parsing).

• Reasoning module: During inference, thismodule’s purpose is to infer the underlyingtheory that logically entails the observed log-ical forms (and their proofs thereof). That is,given an input collection of logical forms x,this module outputs the posterior distributionof the underlying theory T and the proofs π ofthose logical forms.

Note that the yi need not necessarily be sentences,and PWM can easily be generalized to other kindsof data. For example, if a generative model of im-ages is available for p(yi | xi), then an equiva-lent "vision module" may be defined. This mod-ule may be used either in place of, or together with

the language module. In the above generative pro-cess, PWM assumes each sentence to be indepen-dent. A model of context is required to properlyhandle inter-sentential anaphora or conversationalsettings. This can be done by allowing the distri-bution on yi to depend on previous logical formsor sentences: p(yi | x1, . . . , xi) (i.e. relaxing thei.i.d. assumption). For simplicity of this proof-of-concept, this is left to future work.

There is a vast design space for symbolic rep-resentations of meaning. We are unable to com-prehensively list all of our design choices, but wedescribe two important ones below.

Neo-Davidsonian semantics (Parsons, 1990) isused to represent meaning in all logical forms(both in the theory and during semantic pars-ing). As a concrete example, a straightforwardway to represent the meaning of “Jason traveledto New York” could be with the logical formtravel(jason, nyc). In neo-Davidsonian seman-tics, this would instead be represented with threedistinct atoms: travel(c1), arg1(c1) = jason,and arg2(c1) = nyc. Here, c1 is a constant thatrepresents the “traveling event,” whose first argu-ment is the constant representing Jason, and whosesecond argument is the constant representing NewYork City. This representation allows the event tobe more readily modified by other logical expres-sions, such as in “Jason quickly traveled to NYCbefore nightfall.”

In addition, PWM defers named entity linkingto the reasoning module (it is not done during pars-ing). That is, the semantic parser does not parse“Jason” directly into the constant jason. Rather,named entities are parsed into existentially-quantified expressions, e.g. ∃j(name(j) =“Jason” ∧ . . .). This simplifies the parser’s taskand allows reasoning to aid in entity linking. Ta-ble 1 details these design options.

3.1 Reasoning module3.1.1 Generative process for the theory p(T )The theory T is a collection of axioms a1, a2, . . .represented in higher-order logic. We choose afairly simple prior p(T ) for rapid prototyping, butit is straightforward to substitute with a more com-plex prior. Specifically a1, a2, . . . are distributedaccording to a distribution Ga which is sampledfrom a Dirichlet process (DP) (Ferguson, 1973),an exchangeable non-parametric distribution.

Ga ∼ DP(Ha, α), (1)a1, a2, . . . ∼ Ga, (2)

Page 5: arXiv:2105.02486v1 [cs.CL] 6 May 2021

Semantic parser output Possible axioms in theory

Without neo-Davidsoniansemantics, language moduleperforms entity linking

∃b(book(b) ∧ write(alex, b)) book(c1), write(alex, c1).

With neo-Davidsonian semantics,language module performs entitylinking

∃b(book(b) ∧ ∃w(write(w)∧ arg1(w) = alex ∧ arg2(w) = b))

book(c1), write(c2),arg1(c2) = alex,arg2(c2) = c1.

With neo-Davidsonian semantics,reasoning module resolves namedentities

∃a(name(a) = “Alex”∧ ∃b(book(b) ∧ ∃w(write(w)∧ arg1(w) = a ∧ arg2(w) = b)))

book(c1), write(c2),name(c3) = “Alex”,arg1(c2) = c3,arg2(c2) = c1.ou

rapp

roac

h

Table 1: Design choices in the representation of the meaning of “Alex wrote a book.”

where Ha is the base distribution and α = 0.1 isthe concentration parameter. An equivalent per-spective of the DP that better illustrates how thesamples are generated is the Chinese restaurantprocess (Aldous, 1985):

ai ∼ Ha with probabilityα

α+ i− 1, (3)

ai = aj with probability∑i−1

k=1 1{ak = aj}α+ i− 1

. (4)

That is, the ith sample is drawn fromHa with prob-ability proportional to α, or it is set to a previoussample with probability proportional to the num-ber of times that sample has been drawn.

The base distribution Ha recursively generateslogical forms in higher-order logic. Since any for-mula can be written as a tree, they can be gener-ated top-down, starting from the root. The type ofeach node (i.e. conjunction∧, disjunction∨, nega-tion ¬, quantification ∀x, etc) is sampled from acategorical distribution. If the type of the node isselected to be an atom (e.g. book(c1)), then itspredicate is sampled from a non-parametric distri-bution of predicate symbols Hp. The atom’s argu-ment(s) are each sampled as follows: if nV is thenumber of available variables (from earlier gener-ated quantifiers), then sample a variable uniformlyat random with probability 1

nV +1 ; otherwise, withprobability 1

nV +1 , sample a constant from a non-parametric distribution of constant symbols Hc.For brevity, we refer the reader to our code for thespecific forms of Hp and Hc.

Since PWM uses a neo-Davidsonian represen-tation, another node type that Ha can generateis an event argument (e.g. arg1(c1) = jason).When this is selected, the event constant (c1 in theexample) is sampled in the same way an atom’sargument is sampled, as described above: first bytrying to sample a variable, and otherwise sam-pling a constant from Hc. The right-side of theequality (jason in the example) can either be avariable, constant, string, or number, so PWM

first selects its type from a categorical distribution.If the type is chosen to be a number, string, or vari-able, its value is sampled uniformly. If the type ischosen to be a constant, it is sampled from Hc.

Names of entities are treated specially in thisprior: The number of names available to each en-tity is sampled according to a very light-tailed dis-tribution i.i.d.: for entity ci the number of namesnN (ci) , #{s : name(ci) = s} is distributed ac-cording to p(nN (ci) = k) ∝ λk

2. This ensures

that entities tend not to have too many names.Sets are also treated specially in this prior:

One kind of axiom that can be generatedis one that declares the size of a set, e.g.size(λx.planet(x)) = 8 denotes that the size ofthe set of planets is 8. In the prior, the size of eachset is distributed according to a geometric distribu-tion with parameter 10−4. Sets can have arity notequal to 1, in which case their elements are tuples.

Deterministic constraints: We also imposehard constraints on the theory T . Most impor-tantly, T is required to be globally consistent.While this is a conceptually simple requirement, itis computationally expensive (generally undecide-able even in first-order logic). PWL enforces thisconstraint by keeping track of the known sets inthe theory (i.e. a set is known if its set size ax-iom is used in a proof, or if the set appears as asubset/superset in a universally-quantified axiom,such as in ∀x(cat(x)→ mammal(x)) where theset λx.cat(x) is a subset of λx.mammal(x)). Foreach set, PWL computes which elements are prov-ably members of that set. If the number of prov-able members of a set is greater than its size, orif an element is both provably a member and nota member of a set, the theory is inconsistent. Re-laxing this constraint would be valuable in futureresearch, perhaps instead by only considering therelevant sets rather than all sets in the theory, ordeferring consistency checks altogether. We placea handful of other constraints on the theory T : The

Page 6: arXiv:2105.02486v1 [cs.CL] 6 May 2021

name of an entity must be a string (and not a num-ber or a constant). All constants are distinct; thatis, ci 6= cj for all i 6= j. This helps to alleviateidentifiability issues, as otherwise, there would bea much larger number of logically-equivalent the-ories. No event can be an argument of itself (e.g.there is no constant ci such that arg1(ci) = ci).If a theory T satisfies all constraints, we write“T valid.”

These constraints do slightly complicate com-putation of the prior, since the generative processfor generating T is conditioned on T being valid:

p(T | T valid) = p(T )/p(T valid), (5)where p(T valid) =

∑T ′:T ′ valid

p(T ′), (6)

and the denominator is intractable to compute.However, we show in section 3.1.3 that for infer-ence, it suffices to be able to efficiently computethe ratio of prior probabilities:p(T1|T1 valid)p(T2|T2 valid)

=p(T1)p(T2 valid)p(T2)p(T1 valid)

=p(T1)

p(T2). (7)

Additionally note that since the above constraintsdo not depend on the order of the axioms, con-stants, etc. (i.e. the constraints themselves are ex-changeable), the distribution of T conditioned onT being valid is exchangeable.

Properties of the prior p(T ): We emphasizethat these distributions were chosen for simplicityand ease of implementation, and they worked wellenough in experiments. However, there are likelymany distributions that would work just as well.The parameters in the above distributions are notlearned; they were set and fixed a priori. Neverthe-less, this prior does exhibit useful properties for adomain- and task-general model of reasoning:• Occam’s razor: Smaller/simpler theories are

given higher probability than larger and morecomplex theories.

• Consistency: Inconsistent theories are discour-aged or impossible.

• Entities tend to have a unique name. Our priorabove encodes one direction of this prior belief:each entity is unlikely to have many names.However, the prior does not discourage onename from referring to multiple entities.

• Entities tend to have a unique type. Notehowever that this does not discourage typesprovable by subsumption. For example, ifthe theory has the axioms novel(c1) and∀x(novel(x) → book(x)), even thoughbook(c1) is provable, it is not an axiom in thisexample and the prior only applies to axioms.

3.1.2 Generative process for proofs p(πi | T )PWM uses natural deduction, a well-studiedproof calculus, for the proofs (Gentzen, 1935,1969). Pfenning (2004) provides an accessible in-troduction. Figure 2 illustrates a simple exampleof a natural deduction proof. Each horizontal lineis a deduction step, with the (zero or more) for-mulas above the line being its premises, and theone formula below the line being its conclusion.Each deduction step has a label to the right of theline. For example, the “∧I” step denotes conjunc-tion introduction: given that A and B are true, thisstep concludes that A ∧B is true, where A and Bcan be any formula. A natural deduction proof canrely on axioms (denoted by “Ax”).

AxA ∧ ¬A

∧EA

AxA ∧ ¬A

∧E¬A

¬E⊥

¬I¬(A ∧ ¬A)

Figure 2: An example of a proof of ¬(A ∧ ¬A).We can write any natural deduction proof πi as asequence of deduction steps πi , (πi,1, . . . , πi,k)by traversing the proof tree in prefix order. Wedefine a simple generative process for πi:1. First sample the length of the proof k from a

Poisson distribution with parameter 20.2. For each j = 1, . . . , k: Select a deduction rule

from the proof calculus with a categorical dis-tribution. If the Ax rule is selected, then sim-ply take the next available axiom from the the-ory T = a1, a2, . . . If the deduction rule re-quires premises, then each premise is selecteduniformly at random from πi,1, . . . , πi,j−1.1

The above generative process may produce a for-est rather than a single proof tree. Thus, πi is sam-pled conditioned on πi being a valid proof. Just aswith p(T ) in equation 5, this conditioning causesp(πi | T ) to be intractable to compute. However,only the ratio of the prior probability is needed forinference, which can be computed efficiently:p(πi | T, πi valid)p(π′i | T, π′i valid)

=p(πi | T )p(π′i valid | T )p(π′i | T )p(πi valid | T )

=p(πi | T )p(π′i | T )

. (8)

While PWL was initially implemented assum-ing classical logic, it is easy to adapt PWL to useother logics, such as intuitionistic logic. Intuition-istic logic is identical to classical logic except that

1Some deduction rules require additional parameters, andwe refer the reader to our code for details on how these pa-rameters are sampled.

Page 7: arXiv:2105.02486v1 [cs.CL] 6 May 2021

the law of the excluded middleA∨¬A is not a the-orem (see figure 3 for an example where the twologics disagree). The interpretable nature of thereasoning module makes it easy to adapt it to otherkinds of logic or proof calculi. PWL supports bothclassical and intuitionistic logic.

3.1.3 InferenceHaving described the generative process for thetheory T and proofsπ, we now describe inference.Given logical forms x, the goal is to compute theposterior distribution of T andπ such that the con-clusion of the each proof πi is xi. That is, PWLaims to recover the latent theory and proofs thatexplain/entail the given observed logical forms.To this end, PWL uses Metropolis-Hastings (MH)(Hastings, 1970; Robert and Casella, 2004). PWLperforms inference in a streaming fashion, start-ing with the case n = 1 to obtain MH sam-ples from p(π1, T |x1). Then, for every newlogical form xn, PWL uses the last samplefrom p(π1, . . . , πn−1, T |x1, . . . , xn−1) as a start-ing point and then obtains MH samples fromp(π1, . . . , πn, T |x1, . . . , xn). This warm-start ini-tialization serves to dramatically reduce the num-ber of iterations needed to mix the Markov chain.To obtain the MH samples, the proof of each newlogical form π

(0)n is initialized using algorithm 1,

whereas the proofs of previous logical forms arekept from the last MH sample. The axioms inthese proofs constitute the theory sample T (0).Then, for each iteration t=1, . . . , Niter, MH pro-poses a mutation to one or more proofs in π(t).The possible mutations are listed in table 2. Thismay change axioms in T (t). Let T ′, π′i be thenewly proposed theory and proofs. Then, computethe acceptance probability:

p(T ′)

p(T (t))

n∏i=1

p(π′i|T ′)p(π

(t)i |T (t))

g(T (t),π(t)|T ′,π′)g(T ′,π′|T (t),π(t))

, (9)

where g(T ′,π′|T (t),π(t)) is the probability ofproposing the mutation from T (t),π(t) to T ′,π′,and g(T (t),π(t)|T ′,π′) is the probability of theinverse of this mutation. Since this quantity de-pends only on the ratio of probabilities, it can becomputed efficiently (see equations 7 and 8). Oncethis quantity is computed, sample from a Bernoulliwith this quantity as its parameter. If it succeeds,MH accepts the proposed theory and proofs as thenext sample: T (t+1)=T ′ and π(t+1)

i =π′i. Other-wise, reject the proposal and keep the old sample:T (t+1) = T (t) and π(t+1)

i = π(t)i . If every possible

theory and proof is reachable from the initial theo-

Algorithm 1: Pseudocode for proof initial-ization. If any new axiom violates the de-terministic constraints in section 3.1.1, thefunction returns null.1 function init_proof(formula A)2 if A is a conjunction B1 ∧ . . . ∧Bn

3 for i = 1 to n do4 φi = init_proof(Bi)

5 returnφ1 . . . φn ∧IB1 ∧ . . . ∧Bn

6 else if A is a disjunction B1 ∨ . . . ∨Bn

7 I = shuffle(1, . . . , n)8 for i ∈ I do9 φi = init_proof(Bi)

10 if φi 6= null

11 returnφi ∨I

B1 ∨ . . . ∨Bn

12 else if A is a negation ¬B13 return init_disproof(B)14 else if A is an implication B1 → B2

15 if using classical logic16 I = shuffle(1, 2)17 for i ∈ I do18 if i = 119 φ1 = init_disproof(B1)20 if φ1 = null continue

21 return

φ1Ax

B1 ¬E⊥⊥E

B2 →IB1 → B2

22 else23 φ2 = init_proof(B2)24 if φ2 = null continue

25 returnφ2 →I

B1 → B2

26 else if using intuitionistic logic

27 return AxB1 → B2

28 else if A is an existential quantification ∃x.f(x)29 let C be the set of known constants, numbers, and

strings in T , and the new constant c∗

30 I = swap2(shuffle(C))31 for c ∈ I do32 φc = init_proof(f(c))33 if φc 6= null

34 returnφc ∃I

∃x.f(x)

35 else if A is a universal quantification ∀x.f(x)36 return Ax

∀x.f(x)37 else if A is an equality B1 = B2

38 return AxB1 = B2

39 else if A is an atom (e.g. book(great_gatsby))

40 return AxA

41 else return null

2swap randomly selects an element in its input list to swapwith the first element. The probability of moving an elementc to the front of the list is computed as follows: Recursivelyinspect the atoms in the formula f(c) and count the numberof “matching” atoms: The atoms t(c) or c(t) is considered“matching” if it is provable in T . Next, count the numberof “mismatching” axioms: for each atom t(c) in the formulaf(c), an axiom t′(c) is “mismatching” if t 6= t′. And simi-

Page 8: arXiv:2105.02486v1 [cs.CL] 6 May 2021

ProposalProbabilityof selecting

proposal

Select a grounded atomic axiom (e.g. square(c1)) and propose to replace it with an instantiation of a universal quantifi-cation (e.g. ∀x(rectangle(x) ∧ rhombus(x)→ square(x))), where the antecedent conjuncts are selected uniformlyat random from the other grounded atomic axioms for the constant c1: rectangle(c1), rhombus(c1), etc.

1N

The inverse of the above proposal: select an instantiation of a universal quantification and replace it with a groundedatomic axiom.

1N

Select an axiom that declares the size of a set (e.g. of the form size(us_states) = 50), and propose to change the sizeof the set by sampling from the prior distribution, conditioned on the maximum and minimum consistent set size.

1N

Select a node from a proof tree of type ∨I,→ I, or ∃I.3 These nodes were created in algorithm 1 on lines 7, 16, and 30,respectively, where for each node, a single premise was selected out of a number of possible premises. This proposalnaturally follows from the desire to explore other selections by re-sampling the proof: it simply calls init_proof againon the formula at this proof node.

1N

Merge: Select a “mergeable” event; that is, three constants (ci, cj , ck) such that arg1(ci) = cj , arg2(ci) = ck , andt(ci) for some constant t are axioms, and there also exist constants (ci′ , cj′ , ck′ ) such that i′ > i, arg1(ci′ ) = cj′ ,arg2(ci′ ) = ck′ , and t(ci′ ) are axioms. Next, propose to merge ci′ with ci by replacing all instances of ci′ with ciin the proof trees, cj′ with cj , and ck′ with ck . This proposal is not necessary in that these changes are reachable withother proposals, but those proposals may have low probability, and so this can help to more easily escape local maxima.

αN

Split: The inverse of the above proposal. βN

Table 2: A list of the Metropolis-Hastings proposals implemented in PWL thus far. N , here, is a normalizationterm: N = |A| + |U | + |C| + |P | + α|M | + β|S| where: A is the set of grounded atomic axioms in T (e.g.square(c1)), U is the set of universally-quantified axioms that can be eliminated by the second proposal, C is theset of axioms that declare the size of a set (e.g. size(A) = 4), P is the set of nodes of type ∨I, → I, or ∃I3 inthe proofs π, M is the set of “mergeable” events (described above), and S is the set of “splittable” events. In ourexperiments, α = 2 and β = 0.001.

ry by a sequence of mutations, then with suffi-ciently many iterations, the samples T (t) and π(t)iwill be distributed according to the true posteriorp(T,π|x). If only a subset of possible theoriesand proofs are reachable from the initial theory,the MH samples will be distributed according tothe true posterior conditioned on that subset. Thismay suffice for many applications, particularly ifthe theories in the subset have desirable propertiessuch as better tractability. But the subset cannot betoo small since then PWL would lose generality.

The function init_proof in algorithm 1 re-cursively calls init_disproof. Due to spacelimitations, we refer the reader to our code forthis function; it closely mirrors the structure ofinit_proof. The purpose of init_proof is tofind some proof of a given higher-order formula,or return null if none exists. Its task is abduction,which is easier than theorem proving, since it cancreate new axioms as needed. The returned proof

larly for each atom c(t) in the formula f(c), an axiom c(t′) is“mismatching” if t 6= t′. Let n be the number of “matching”atoms and m be the number of “mismatching” axioms, thenthe probability of moving c to the front of the list is propor-tional to exp{n − 2m}. This greatly increases the chanceof finding a high-probability proof in the first iteration of theloop on line 31, and since this function is also used in an MHproposal, it dramatically improves the acceptance rate. Thisreduces the number of MH iterations needed to sufficientlymix the Markov chain.

need not be “optimal” since it serves as the initialstate for MH, which will further refine the proof.The validity of the proofs is guaranteed by the factthat init_proof only returns valid proofs and theMH proposals preserve validity.

3.2 Language moduleFor the language module, PWM uses the proba-bilistic model of Saparov et al. (2017). The gener-ative nature of their semantic parsing model allowsit to fit seamlessly into PWM and PWL. The log-ical forms in their model are distributed accordingto a semantic prior, which we replace with our dis-tribution of logical forms conditioned on the the-ory p(πi|T ). Their parser is probabilistic and findsthe k-best logical forms that maximize p(yi|xi, T )for a given input sentence. Combined with our rea-soning module’s ability to compute the probabilityof a logical form, the parser can resolve ambigu-ous interpretations of sentences by exploiting ac-quired knowledge. We will demonstrate the utilityof this property in resolving lexical ambiguity.

However, the semantic grammar in Saparovet al. (2017) was designed for a DATALOG rep-resentation of logical forms. Thus, we designedand implemented a new grammar for our moredomain-general formalism in higher-order logic.Though their model induces preterminal produc-

3Also disproofs of conjunctions, if using classical logic.

Page 9: arXiv:2105.02486v1 [cs.CL] 6 May 2021

tion rules from data (e.g. N → “cat”), we mustmanually specify the nonterminal production rules(e.g. NP → ADJP NP). This allows us to en-code prior knowledge of the English language intoPWM, dramatically improving its statistical effi-ciency and obviating the need for massive train-ing sets to learn English syntax. It is nonethe-less tedious to design these rules while maintain-ing domain-generality. Once specified, however,these rules can be re-used in new tasks and do-mains with minimal or no changes. We also im-proved their model to generalize over inflectedforms of words. In the generative process, in-stead of generating sentence tokens directly (e.g.“I am sleeping”), PWM generates word roots withflags indicating their inflection (e.g. “I be[1ST,SG]sleep[PRS,PTCP]”). During parsing, this has theeffect of performing morphological and semanticparsing jointly. We extracted the necessary com-prehensive morphology information from Wik-tionary (Wikimedia Foundation, 2020).

We train this new grammar to learn the parame-ters that govern the conditional distributions andthe preterminal production rules. To do so, weconstruct a small seed training set consisting of 55labeled sentences, 47 nouns, 55 adjectives, and 20verbs.4 We wrote and labeled these sentences byhand, largely in the domain of astronomy, with theaim to cover a diverse range of English syntacticconstructions. This small training set was suffi-cient thanks to the statistical efficiency of PWM.

While PWL uses the same parsing algorithmof Saparov et al. (2017), we provide an easier-to-understand presentation. Given an input sentenceyi, the parser aims to find the logical form(s) xiand derivation trees ti that maximize the posteriorprobability p(xi, ti|yi, T ). This discrete optimiza-tion is performed using branch-and-bound (Landand Doig, 1960): the algorithm starts by consid-ering the set of all derivation trees and partitionsit into a number of subsets (the “branch” step).For each subset S, the parser computes an upperbound on the log probability of any derivation inS (the “bound” step). Having computed the boundfor each subset, the parser puts them into a prior-ity queue, prioritized by the bound. The parserthen dequeues the subset with the highest boundand repeats this process, further subdividing thisset, computing the bound for each subdivision, andadding them to the queue. Eventually, the parser

4The grammar, morphology data, code, as well as the seedtraining set are available in our Github repository.

will dequeue a subset containing a single deriva-tion whose log probability is at least the highestpriority in the queue. This derivation is optimal.The algorithm can be continued to obtain the top-k derivations/logical forms. Since this algorithmoperates over sets of logical forms (where each setis possibly infinite), we implemented a data struc-ture to sparsely represent such sets of higher-orderformulas, as well as algorithms to perform set op-erations, such as intersection and subtraction.

4 Experiments4.1 ProofWriterTo demonstrate our implementation as a proof-of-concept, we evaluate it on two question-answeringtasks. The first is the ProofWriter dataset (Tafjordet al., 2021), which itself is based on the earlierRuleTaker dataset (Clark et al., 2020). To evaluateand demonstrate the out-of-domain language un-derstanding and reasoning ability of PWL, we usethe Birds-Electricity “open-world”5 portion of thedataset, as the authors evaluated their method onthis portion zero-shot, just as we do (i.e. the al-gorithm did not see any example from this portionduring training). This portion of the data is subdi-vided into 6 sections, each with varying degrees ofdifficulty. An example from this dataset is shownin figure 3. For each example, PWL reads the con-text and abduces a theory. Next, it parses the querysentence yn+1 into a logical form xn+1 and esti-mates its unnormalized probability:

p(xn+1 | x) ∝ p(x1, . . . , xn+1), (10)

=∑

T,πi proof of xi

p(T )

n+1∏i=1

p(πi | T ), (11)

≈∑

T (t),π(t)i ∼T,π|x

p(T (t))n+1∏i=1

p(π(t)i | T

(t)). (12)

Here, x are the previously read logical forms(the context). Since the quantity in equation 11is intractable to compute, PWL approximates itby sampling from the posterior T, π1, . . . , πn+1 |x1, . . . , xn+1 and summing over distinct samples.Although this approximation seems crude, thesum is dominated by a small number of the mostprobable theories and proofs, and MH is an effec-tive way to find them, as we observe in experi-ments. MH is run for 400 iterations, and at ev-ery 100th iteration, PWL re-initializes the Markovchain by performing 20 “exploratory” MH steps

5The dataset comes in two flavors: one which makes theclosed-world assumption, and one that does not.

Page 10: arXiv:2105.02486v1 [cs.CL] 6 May 2021

SECTION N ProofWriter-All ProofWriter-Iter PWL (classical) (intuitionistic)

Electricity1 162 98.15 98.77 92.59 100.00Electricity2 180 91.11 90.00 90.00 100.00Electricity3 624 91.99 94.55 88.46 100.00Electricity4 4224 91.64 99.91 94.22 100.00Birds1 40 100.00 95.00 100.00 100.00Birds2 40 100.00 95.00 100.00 100.00

Average 5270 91.99 98.82 93.43 100.00

Table 3: Zero-shot accuracy of PWL and baselines on the ProofWriter dataset.

(i.e. consisting of only the third and fourth propos-als in table 2 and accepting every proposal). Thisre-initialization is analogous to a random restartand can help to escape from local maxima. How-ever, it may be promising to explore other ap-proaches to compute this quantity, such as Luoet al. (2020). Once PWL has computed this prob-ability for the query sentence, it does the same forthe negation of the sentence. These unnormal-ized probabilities are compared, and if they arewithin 2000 in log probability, PWL returns thelabel unknown. If the first probability is suffi-ciently larger than the second, PWL returns true,and otherwise, return false. The parameters in theprior were set by hand initially by choosing valueswhich we thought were reasonable (e.g. the aver-age length of a natural deduction proof for a sen-tence containing a simple subject noun phrase, ob-ject noun phrase, and transitive verb is around 20steps, which is why the Poisson parameter for theproof length is set to 20). The values were tweakedas necessary by running the algorithm on toy ex-amples during debugging. Note that the sentences“Bill is a bird” and “Bill is not a bird” can still bothbe true if each “Bill” refers to distinct entities. Toavoid this, we chose an extreme value of the priorparameter such that the log prior probability of atheory with two entities having the same name is2000 less than that of a theory where the name isunique. It is for this reason 2000 was chosen asthe threshold for determining whether a query istrue/false vs unknown. This prior worked wellenough in our experiments, but the goal is to havea single prior work well with any task, so furtherwork to explore which priors work better across awider variety of tasks is welcome. We evaluatedPWL using both classical and intuitionistic logic,even though the ground truth labels in the datasetwere generated using intuitionistic logic.

Table 3 lists the zero-shot accuracy of PWL,comparing with baselines based on the T5 trans-former (Raffel et al., 2020). We emphasizehere that PWL is not perfectly comparable tothe baseline, since they aim to demonstrate thattheir method can learn to reason. We instead

Context: “The switch is on. The circuit has the bell. If thecircuit has the switch and the switch is on then the circuit iscomplete. If the circuit does not have the switch then the circuitis complete. If the circuit is complete and the circuit has the lightbulb then the light bulb is glowing. If the circuit is complete andthe circuit has the bell then the bell is ringing. If the circuit iscomplete and the circuit has the radio then the radio is playing.”

Query: “The circuit is complete.” true, false, unknown?

Figure 3: An example from the Electricity1 section inthe ProofWriter dataset. Its label is unknown. Un-der classical logic, the query is provably true from theinformation in the 1st, 3rd, and 4th sentences.

aim to demonstrate that PWL’s ability to parseand reason end-to-end generalizes to an out-of-domain question-answering task. The baseline istrained on other portions of the ProofWriter data,whereas PWL is trained only on its seed trainingset. PWL performed much better using intuition-istic logic than classical logic, as expected sincethe ground truth labels were generated using in-tuitionistic semantics. However, most real-worldreasoning tasks would take the law of the excludedmiddle to be true, and classical logic would serveas a better default. Although the task is relativelysimple, it nevertheless demonstrates the proof-of-concept and the promise of further research.

4.2 FictionalGeoQAThe sentences in the ProofWriter experiment aretemplate-generated and have simple semantics.For the sake of evaluation more representative ofreal-world language, we introduce a new question-answering dataset called FictionalGeoQA6. Tocreate this dataset, we took questions from Geo-Query (Zelle and Mooney, 1996), and for eachquestion, we wrote a paragraph context containingthe information necessary to answer the question.We added distractor sentences to make the taskmore robust against heuristics. Whenever possi-ble, the sentences in this paragraph were takenfrom Simple English Wikipedia. However, somefacts, such as the lengths of rivers, are not ex-pressed in sentences in Wikipedia (they typicallyappear in a table on the right side of the page),so we wrote those sentences by hand: We took

6Available at github.com/asaparov/fictionalgeoqa

Page 11: arXiv:2105.02486v1 [cs.CL] 6 May 2021

all

supe

rlativ

e

subj

ectiv

eco

ncep

t def

.ob

ject

ive

conc

ept d

ef.

lexi

cal

ambi

guity

nega

tion

larg

eco

ntex

tar

ithm

etic

coun

ting

0su

bset

s

1su

bset

2su

bset

s

3su

bset

s

4su

bset

s

N 600 210 150 170 180 102 100 20 30 85 213 187 85 30

UnifiedQA 33.8 29.5 7.3 33.5 32.8 14.7 43.0 10.0 20.0 41.2 47.9 27.8 8.2 23.3Boxer + E 9.7 0.0 12.0 11.8 0.0 15.7 14.0 10.0 0.0 7.1 17.8 5.3 4.7 0.0Boxer + Vampire 9.7 0.0 12.0 11.8 0.0 15.7 14.0 10.0 0.0 7.1 17.8 5.3 4.7 0.0PWL parser + E 5.0 0.0 13.3 2.9 0.0 15.7 4.0 10.0 0.0 1.2 7.0 5.3 4.7 0.0PWL parser + Vampire 9.0 0.0 13.3 11.2 0.0 15.7 4.0 10.0 0.0 12.9 13.6 5.3 4.7 0.0PWL 43.1 40.5 33.3 33.5 34.4 23.5 45.0 10.0 0.0 43.5 62.9 39.0 17.6 0.0

LEGEND: superlative The subset of the dataset with examples that require reasoning over superlatives, i.e. “longest river.”subjective concept def. Subset with definitions of “subjective” concepts, i.e. “Every river longer than 500 km is major.”objective concept def. Subset with definitions of “objective” concepts, i.e. the population of a location is the number of people living there.lexical ambiguity Subset with lexical ambiguity, i.e. “has” means different things in “a state has a city named” vs “a state has an area of...”negation Subset with examples that require reasoning with classical negation (negation-as-failure is insufficient).large context Subset of examples where there are at least 100 sentences in the context.arithmetic Subset with examples that require simple arithmetic. counting Subset with examples that require counting.n subset(s) Examples that belong to exactly n of the above subsets (no example is a member of more than 4 subsets).

Table 4: Zero-shot accuracy of PWL and baselines on the FictionalGeoQA dataset.

Context: “River Giffeleney is a river in Wulstershire. RiverWulstershire is a river in the state of Wulstershire. River Elsuiris a river in Wulstershire. The length of River Giffeleney is 413kilometers. The length of River Wulstershire is 830 kilometers.The length of River Elsuir is 207 kilometers. Every river that isshorter than 400 kilometers is not major.”

Query: “What rivers in Wulstershire are not major?”

Figure 4: An example from FictionalGeoQA, a newfictional geography question-answering dataset that wecreated to evaluate reasoning in natural language un-derstanding.

questions from GEOQUERY that expressed the de-sired fact in interrogative form (e.g. “What is thelength of <river name>?”) and converted theminto declarative form (e.g. “The length of <rivername> is <length>.”). The resulting dataset con-tains 600 examples, where 67.4% of the sentencesare from Simple English Wikipedia, and 90% ofthe examples contain at least one sentence notfrom Wikipedia. We replaced all place names withfictional ones to remove any confounding effectsfrom pretraining. To keep the focus of the evalu-ation on reasoning ability, we chose to restrict thecomplexity of the language. In particular, eachsentence is independent and can be understoodin isolation (e.g. no cross-sentential anaphora).The sentences are more complex than those inProofWriter, having more of the complexities ofreal language, such as synonymy, lexical ambigu-ity (e.g. what is the semantics of “has” in “a statehas city” vs “a state has area”; or whether “largeststate” refers to area or population), and syntacticambiguity. This increased difficulty is evident inthe results. This dataset is meant to evaluate out-of-domain generalizability, so we do not provide aseparate training set for fine-tuning.

We compare PWL (using classical logic) witha number of baselines: (1) UnifiedQA (Khashabi

et al., 2020), a QA system based on large-scaleneural language models, (2) Boxer (Bos, 2015),a wide-coverage semantic parser, combined withVampire 4.5.1 (Kovács and Voronkov, 2013), atheorem prover for full first-order logic, (3) Boxercombined with E 2.6 (Schulz et al., 2019), anothertheorem prover for full first-order logic, (4) thelanguage module of PWL combined with Vam-pire, and (5) the language module of PWL com-bined with E. The results are shown in figure 4,along with a breakdown across multiple subsets ofthe dataset. UnifiedQA performs relatively wellbut fairs more poorly on questions with negationand subjective concept definitions (e.g. “Everyriver longer than 500km is major... What are themajor rivers?”). Humans are easily able to under-stand and utilize such definitions, and the abilityto do so is instrumental in learning about new con-cepts or words in new domains. PWL is able tofare better than UnifiedQA in examples with lexi-cal ambiguity, as a result of the language module’sability to exploit acquired knowledge to resolveambiguities. We find that Boxer has significantlyhigher coverage than PWL (100% vs 79.8%) butmuch lower precision. For instance, Boxer usesthe semantic representation in the Parallel Mean-ing Bank (Abzianidze et al., 2017) which has asimpler representation of superlatives, and is thusunable to capture the correct semantics of superla-tives in examples of this dataset. We also find thatfor most examples, Boxer produces different se-mantics for the question vs the context sentences,oftentimes predicting the incorrect semantic rolefor the interrogative words, which leads to the the-orem provers being unable to find a proof for theseextra semantic roles. We also experimented withreplacing our reasoning module with a theorem

Page 12: arXiv:2105.02486v1 [cs.CL] 6 May 2021

prover and found that for almost all examples, thesearch of the theorem prover would explode com-binatorially. This was due to the fact that our se-mantic representation relies heavily on sets, so anumber of simple set theoretic axioms are requiredfor the theorem provers, but this quickly causes thededuction problem to become undecideable. Ourreasoning module instead performs abduction, andis able to create axioms to more quickly find aninitial proof, and then refine that proof using MH.Despite our attempt to maximize the generalizabil-ity of the grammar in PWL, there are a numberof linguistic phenomena that we did not yet im-plement, such as interrogative subordinate clauses,wh-movement, spelling or grammatical mistakes,etc, and this led to the lower coverage on thisdataset. Work remains to be done to implementthese missing production rules in order to furtherincrease the coverage of the parser.

5 Conclusions and future work

We introduced PWM, a fully-symbolic Bayesianmodel of semantic parsing and reasoning, whichwe hope serves as a compelling first step in a re-search program toward more domain- and task-general NLU. We derived PWL, an efficient in-ference algorithm that reads sentences by parsingand abducing updates to its latent world modelthat capture the semantics of those sentences, andempirically demonstrated its ability to generalizeto two out-of-domain question-answering tasks.To do so, we created a new question-answeringdataset, FictionalGeoQA, designed specifically toevaluate reasoning ability while capturing more ofthe complexities of real language and being robustagainst heuristic strategies. PWL is able to readand understand sentences with richer semantics,such as definitions of new concepts. In contrastwith past deductive reasoning approaches, PWLperforms abduction, which is computationally eas-ier. The highly underspecified nature of the prob-lem of abduction is alleviated by the probabilisticnature of PWL, as it gives a principled way to findthe most probable theories. We present an infer-ence strategy where Metropolis-Hastings (MH) isperformed on each sentence, in sequence, wherethe previous sample of the theory and proofs pro-vides a warm-start for inference of the next sen-tence, reducing the number of MH iterations.

There are many avenues for future work: A sim-ple prior was used for proofs p(πi|T ), and an alter-nate approach could be to use a compositional ex-

changeable distribution such as adaptor grammars(Johnson et al., 2006).

The first MH proposal in table 2 is simple butrestrictive: the antecedent conjuncts and the con-sequent are restricted to be atomic. MH wouldbe able to explore a much larger and semanticallyricher set of theories if the antecedent or conse-quent could contain more complex formulas, in-cluding quantified formulas. In addition, the infer-ence algorithm sometimes becomes stuck in localmaxima. One way to improve the efficiency of in-ference is to add a new MH proposal that specif-ically proposes to split or merge types. For ex-ample, if the theory has the axioms cat(c1) anddog(c1), this proposal would split c1 into two con-cepts: cat(c1) and dog(c2). This kind of type-based Markov chain Monte Carlo is similar inprinciple to Liang et al. (2010).

As mentioned earlier, a model of context is nec-essary in the language module to properly handlecross-sentential anaphora and conversational con-texts. Real language very rarely consists of sen-tences that are independent of context. There arealso many research questions on the issue of scal-ability. Although PWL is able to scale to exam-ples in FictionalGeoQA with more than 100 sen-tences, there are two main bottlenecks currentlypreventing it from scaling to significantly largertheories: (1) the maintenance of global consis-tency, and (2) the unfocused nature of the currentMH proposals. When checking for consistencyof a new axiom, rather than considering all otheraxioms/sets in the theory, it would be preferableto only consider the portion of the theory rele-vant to the new axiom. Additionally, the currentMH proposals do not take into account the goalof reasoning. For example, if the current task isto answer a question about geography, then MHproposals for proofs unrelated to geography arewasteful, and would increase the number of MHsteps needed. A more clever goal-aware approachfor selecting proofs to mutate would help to alle-viate this problem and improve scalability. PWMprovides a path to incorporate information fromadditional modalities in principled fashion: for ex-ample by adding a generative model of images,which would serve as a separate “vision module.”In addition, even though PWL is fully-symbolic,non-symbolic methods could be used for expres-sive prior/proposal distributions or approximateinference. There are many fascinating researchpaths to pursue from here.

Page 13: arXiv:2105.02486v1 [cs.CL] 6 May 2021

Acknowledgments

We thank the anonymous reviewers and the Ac-tion Editor for their invaluable feedback. We alsothank Peter Clark, William W. Cohen, Rik van No-ord, and Johan Bos for their insightful and helpfuldiscussion.

References

Lasha Abzianidze, Johannes Bjerva, Kilian Evang,Hessel Haagsma, Rik van Noord, Pierre Lud-mann, Duc-Duy Nguyen, and Johan Bos. 2017.The parallel meaning bank: Towards a mul-tilingual corpus of translations annotated withcompositional meaning representations. In Pro-ceedings of the 15th Conference of the Euro-pean Chapter of the Association for Computa-tional Linguistics, EACL 2017, Valencia, Spain,April 3-7, 2017, Volume 2: Short Papers, pages242–247. Association for Computational Lin-guistics.

David J. Aldous. 1985. Exchangeability and re-lated topics. In Lecture Notes in Mathematics,pages 1–198. Springer Berlin Heidelberg.

Erik Arakelyan, Daniel Daza, Pasquale Minervini,and Michael Cochez. 2021. Complex query an-swering with neural link predictors. In Interna-tional Conference on Learning Representations.

Elena Bellodi and Fabrizio Riguzzi. 2015. Struc-ture learning of probabilistic logic programs bysearching the clause space. Theory Pract. Log.Program., 15(2):169–212.

Emily M. Bender and Alexander Koller. 2020.Climbing towards NLU: on meaning, form, andunderstanding in the age of data. In Proceed-ings of the 58th Annual Meeting of the Associa-tion for Computational Linguistics, ACL 2020,Online, July 5-10, 2020, pages 5185–5198. As-sociation for Computational Linguistics.

Chandra Bhagavatula, Ronan Le Bras, ChaitanyaMalaviya, Keisuke Sakaguchi, Ari Holtzman,Hannah Rashkin, Doug Downey, Wen-tau Yih,and Yejin Choi. 2020. Abductive commonsensereasoning. In 8th International Conference onLearning Representations, ICLR 2020, AddisAbaba, Ethiopia, April 26-30, 2020.

Johan Bos. 2015. Open-domain semantic pars-ing with boxer. In Proceedings of the 20thNordic Conference of Computational Linguis-tics, NODALIDA 2015, Institute of the Lithua-nian Language, Vilnius, Lithuania, May 11-13, 2015, volume 109 of Linköping Elec-tronic Conference Proceedings, pages 301–304.Linköping University Electronic Press / Associ-ation for Computational Linguistics.

Tom B. Brown, Benjamin Mann, Nick Ry-der, Melanie Subbiah, Jared Kaplan, PrafullaDhariwal, Arvind Neelakantan, Pranav Shyam,Girish Sastry, Amanda Askell, Sandhini Agar-wal, Ariel Herbert-Voss, Gretchen Krueger,Tom Henighan, Rewon Child, Aditya Ramesh,Daniel M. Ziegler, Jeffrey Wu, Clemens Win-ter, Christopher Hesse, Mark Chen, Eric Sigler,Mateusz Litwin, Scott Gray, Benjamin Chess,Jack Clark, Christopher Berner, Sam McCan-dlish, Alec Radford, Ilya Sutskever, and DarioAmodei. 2020. Language models are few-shotlearners. CoRR, abs/2005.14165v4.

Eugene Charniak and Robert P. Goldman. 1993.A bayesian model of plan recognition. Artif. In-tell., 64(1):53–79.

Alonzo Church. 1940. A formulation of the sim-ple theory of types. J. Symb. Log., 5(2):56–68.

Peter Clark, Oyvind Tafjord, and Kyle Richard-son. 2020. Transformers as soft reasoners overlanguage. In Proceedings of the Twenty-NinthInternational Joint Conference on Artificial In-telligence, IJCAI 2020, pages 3882–3890. ij-cai.org.

Robin Cooper, Simon Dobnik, Shalom Lappin,and Staffan Larsson. 2015. Probabilistic typetheory and natural language semantics. In Lin-guistic Issues in Language Technology, Volume10, 2015. CSLI Publications.

Andrew Cropper and Rolf Morel. 2021. Learn-ing programs by learning from failures. Mach.Learn., 110(4):801–856.

James Cussens. 2001. Parameter estimationin stochastic logic programs. Mach. Learn.,44(3):245–271.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2019. BERT: pre-training

Page 14: arXiv:2105.02486v1 [cs.CL] 6 May 2021

of deep bidirectional transformers for languageunderstanding. In Proceedings of the 2019Conference of the North American Chapter ofthe Association for Computational Linguistics:Human Language Technologies, NAACL-HLT2019, Minneapolis, MN, USA, June 2-7, 2019,Volume 1 (Long and Short Papers), pages 4171–4186. Association for Computational Linguis-tics.

D. R. Dowty. 1981. Introduction to Montague Se-mantics, 1st ed. 1981. edition. Studies in Lin-guistics and Philosophy, 11. Springer Nether-lands, Dordrecht.

H. L. Dreyfus. 1985. From micro-worlds toknowledge representation: Ai at an impasse.In R. J. Brachman and H. J. Levesque, editors,Readings in Knowledge Representation, pages71–93. Kaufmann, Los Altos, CA.

Jesse Dunietz, Gregory Burnham, Akash Bharad-waj, Owen Rambow, Jennifer Chu-Carroll, andDavid A. Ferrucci. 2020. To test machinecomprehension, start by defining comprehen-sion. In Proceedings of the 58th Annual Meet-ing of the Association for Computational Lin-guistics, ACL 2020, Online, July 5-10, 2020,pages 7839–7859. Association for Computa-tional Linguistics.

Thomas S. Ferguson. 1973. A Bayesian Analysisof Some Nonparametric Problems. The Annalsof Statistics, 1(2):209 – 230.

Ulrich Furbach, Andrew S. Gordon, and ClaudiaSchon. 2015. Tackling benchmark problems ofcommonsense reasoning. In Proceedings of theWorkshop on Bridging the Gap between Humanand Automated Reasoning - A workshop of the25th International Conference on AutomatedDeduction (CADE-25), Berlin, Germany, Au-gust 1, 2015, volume 1412 of CEUR WorkshopProceedings, pages 47–59. CEUR-WS.org.

Matt Gardner, Jonathan Berant, Hannaneh Ha-jishirzi, Alon Talmor, and Sewon Min. 2019.On making reading comprehension more com-prehensive. In Proceedings of the 2nd Work-shop on Machine Reading for Question An-swering, MRQA@EMNLP 2019, Hong Kong,China, November 4, 2019, pages 105–112. As-sociation for Computational Linguistics.

G. Gentzen. 1935. Untersuchungen über das lo-gische schließen i. Mathematische Zeitschrift,39:176–210.

G. Gentzen. 1969. Investigations into LogicalDeduction. In M. E. Szabo, editor, The Col-lected Papers of Gerhard Gentzen, pages 68–213. North-Holland, Amsterdam.

Howard Gregory. 2015. Language and Logics: AnIntroduction to the Logical Foundations of Lan-guage. Edinburgh University Press.

W. K. Hastings. 1970. Monte carlo samplingmethods using markov chains and their appli-cations. Biometrika, 57(1):97–109.

Leon Henkin. 1950. Completeness in the theoryof types. J. Symb. Log., 15(2):81–91.

Jerry R. Hobbs. 2006. Abduction in Natural Lan-guage Understanding, chapter 32. John Wiley& Sons, Ltd.

Jerry R. Hobbs, Mark E. Stickel, Douglas E. Ap-pelt, and Paul A. Martin. 1993. Interpretationas abduction. Artif. Intell., 63(1-2):69–142.

Aidan Hogan, Eva Blomqvist, Michael Cochez,Claudia d’Amato, Gerard de Melo, ClaudioGutiérrez, Sabrina Kirrane, José Emilio LabraGayo, Roberto Navigli, Sebastian Neu-maier, Axel-Cyrille Ngonga Ngomo, AxelPolleres, Sabbir M. Rashid, Anisa Rula, LukasSchmelzeisen, Juan F. Sequeda, Steffen Staab,and Antoine Zimmermann. 2021. Knowledgegraphs. ACM Comput. Surv., 54(4):71:1–71:37.

Arcchit Jain, Tal Friedman, Ondrej Kuzelka,Guy Van den Broeck, and Luc De Raedt. 2019.Scalable rule learning in probabilistic knowl-edge bases. In 1st Conference on AutomatedKnowledge Base Construction, AKBC 2019,Amherst, MA, USA, May 20-22, 2019.

Mark Johnson, Thomas L. Griffiths, and SharonGoldwater. 2006. Adaptor grammars: A frame-work for specifying compositional nonparamet-ric bayesian models. In Advances in Neural In-formation Processing Systems 19, Proceedingsof the Twentieth Annual Conference on Neu-ral Information Processing Systems, Vancou-ver, British Columbia, Canada, December 4-7,2006, pages 641–648. MIT Press.

Page 15: arXiv:2105.02486v1 [cs.CL] 6 May 2021

Daniel Khashabi, Sewon Min, Tushar Khot,Ashish Sabharwal, Oyvind Tafjord, Peter Clark,and Hannaneh Hajishirzi. 2020. Unifiedqa:Crossing format boundaries with a single QAsystem. In Findings of the Association for Com-putational Linguistics: EMNLP 2020, OnlineEvent, 16-20 November 2020, volume EMNLP2020 of Findings of ACL, pages 1896–1907. As-sociation for Computational Linguistics.

Iuliia Kotseruba and John K. Tsotsos. 2020. 40years of cognitive architectures: core cognitiveabilities and practical applications. Artif. Intell.Rev., 53(1):17–94.

Laura Kovács and Andrei Voronkov. 2013. First-order theorem proving and vampire. In Com-puter Aided Verification - 25th InternationalConference, CAV 2013, Saint Petersburg, Rus-sia, July 13-19, 2013. Proceedings, volume8044 of Lecture Notes in Computer Science,pages 1–35. Springer.

John E. Laird, Allen Newell, and Paul S. Rosen-bloom. 1987. SOAR: an architecture for gen-eral intelligence. Artif. Intell., 33(1):1–64.

Brenden M. Lake, Tomer D. Ullman, Joshua B.Tenenbaum, and Samuel J. Gershman. 2016.Building machines that learn and think like peo-ple. CoRR, abs/1604.00289v3.

A. H. Land and A. G. Doig. 1960. An automaticmethod of solving discrete programming prob-lems. Econometrica, 28(3):497.

Percy Liang, Michael I. Jordan, and Dan Klein.2010. Type-based MCMC. In Human Lan-guage Technologies: Conference of the NorthAmerican Chapter of the Association of Com-putational Linguistics, Proceedings, June 2-4, 2010, Los Angeles, California, USA, pages573–581. The Association for ComputationalLinguistics.

Tal Linzen. 2020. How can we accelerateprogress towards human-like linguistic gener-alization? In Proceedings of the 58th An-nual Meeting of the Association for Computa-tional Linguistics, ACL 2020, Online, July 5-10,2020, pages 5210–5217. Association for Com-putational Linguistics.

Yinhan Liu, Myle Ott, Naman Goyal, JingfeiDu, Mandar Joshi, Danqi Chen, Omer Levy,

Mike Lewis, Luke Zettlemoyer, and VeselinStoyanov. 2019. Roberta: A robustly op-timized BERT pretraining approach. CoRR,abs/1907.11692v1.

Yucen Luo, Alex Beatson, Mohammad Norouzi,Jun Zhu, David Duvenaud, Ryan P. Adams,and Ricky T. Q. Chen. 2020. SUMO: un-biased estimation of log marginal probabilityfor latent variable models. In 8th Interna-tional Conference on Learning Representations,ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. OpenReview.net.

Tom M. Mitchell, William W. Cohen, Este-vam R. Hruschka Jr., Partha P. Talukdar,Bo Yang, Justin Betteridge, Andrew Carl-son, Bhavana Dalvi Mishra, Matt Gardner,Bryan Kisiel, Jayant Krishnamurthy, Ni Lao,Kathryn Mazaitis, Thahir Mohamed, Ndapan-dula Nakashole, Emmanouil A. Platanios, AlanRitter, Mehdi Samadi, Burr Settles, Richard C.Wang, Derry Wijaya, Abhinav Gupta, XinleiChen, Abulhair Saparov, Malcolm Greaves, andJoel Welling. 2018. Never-ending learning.Commun. ACM, 61(5):103–115.

Stephen Muggleton. 1991. Inductive logic pro-gramming. New Gener. Comput., 8(4):295–318.

Stephen Muggleton. 1996. Stochastic logic pro-grams. In Advances in Inductive Logic Pro-gramming, pages 254–264.

Allen Newell and Herbert A. Simon. 1976. Com-puter science as empirical inquiry: Symbols andsearch. Commun. ACM, 19(3):113–126.

Mathias Niepert and Pedro M. Domingos. 2015.Learning and inference in tractable probabilisticknowledge bases. In Proceedings of the Thirty-First Conference on Uncertainty in Artificial In-telligence, UAI 2015, July 12-16, 2015, Amster-dam, The Netherlands, pages 632–641. AUAIPress.

Mathias Niepert, Christian Meilicke, and HeinerStuckenschmidt. 2012. Towards distributedMCMC inference in probabilistic knowledgebases. In Proceedings of the Joint Work-shop on Automatic Knowledge Base Con-struction and Web-scale Knowledge Extraction,AKBC-WEKEX@NAACL-HLT 2012, Montrèal,

Page 16: arXiv:2105.02486v1 [cs.CL] 6 May 2021

Canada, June 7-8, 2012, pages 1–6. Associa-tion for Computational Linguistics.

Terence Parsons. 1990. Events in the Semantics ofEnglish. MIT Press, Cambridge, MA.

Frank Pfenning. 2004. Natural deduction. Lecturenotes in 15-815 Automated Theorem Proving.

Colin Raffel, Noam Shazeer, Adam Roberts,Katherine Lee, Sharan Narang, MichaelMatena, Yanqi Zhou, Wei Li, and Peter J. Liu.2020. Exploring the limits of transfer learningwith a unified text-to-text transformer. J. Mach.Learn. Res., 21:140:1–140:67.

Hongyu Ren, Weihua Hu, and Jure Leskovec.2020. Query2box: Reasoning over knowl-edge graphs in vector space using box em-beddings. In 8th International Conference onLearning Representations, ICLR 2020, AddisAbaba, Ethiopia, April 26-30, 2020. OpenRe-view.net.

Christian P. Robert and George Casella. 2004.Monte Carlo Statistical Methods. SpringerTexts in Statistics. Springer.

Tim Rocktäschel and Sebastian Riedel. 2017.End-to-end differentiable proving. In Advancesin Neural Information Processing Systems 30:Annual Conference on Neural Information Pro-cessing Systems 2017, December 4-9, 2017,Long Beach, CA, USA, pages 3788–3800.

Stuart J. Russell and Peter Norvig. 2010. ArtificialIntelligence - A Modern Approach, Third Inter-national Edition. Pearson Education.

Swarnadeep Saha, Sayan Ghosh, Shashank Sri-vastava, and Mohit Bansal. 2020. Prover:Proof generation for interpretable reasoningover rules. In Proceedings of the 2020Conference on Empirical Methods in NaturalLanguage Processing, EMNLP 2020, Online,November 16-20, 2020, pages 122–136. Asso-ciation for Computational Linguistics.

Abulhair Saparov, Vijay A. Saraswat, and Tom M.Mitchell. 2017. A probabilistic generativegrammar for semantic parsing. In Proceedingsof the 21st Conference on Computational Natu-ral Language Learning (CoNLL 2017), Vancou-ver, Canada, August 3-4, 2017, pages 248–259.Association for Computational Linguistics.

Taisuke Sato, Yoshitaka Kameya, and Neng-FaZhou. 2005. Generative modeling with fail-ure in PRISM. In IJCAI-05, Proceedings ofthe Nineteenth International Joint Conferenceon Artificial Intelligence, Edinburgh, Scotland,UK, July 30 - August 5, 2005, pages 847–852.Professional Book Center.

Stephan Schulz, Simon Cruanes, and Petar Vuk-mirovic. 2019. Faster, higher, stronger: E 2.3.In Automated Deduction - CADE 27 - 27th In-ternational Conference on Automated Deduc-tion, Natal, Brazil, August 27-30, 2019, Pro-ceedings, volume 11716 of Lecture Notes inComputer Science, pages 495–507. Springer.

Haitian Sun, Andrew O. Arnold, Tania Bedrax-Weiss, Fernando Pereira, and William W. Co-hen. 2020. Faithful embeddings for knowledgebase queries. In Advances in Neural Informa-tion Processing Systems 33: Annual Confer-ence on Neural Information Processing Systems2020, NeurIPS 2020, December 6-12, 2020, vir-tual.

Oyvind Tafjord, Bhavana Dalvi, and Peter Clark.2021. Proofwriter: Generating implications,proofs, and abductive statements over natu-ral language. In Findings of the Associationfor Computational Linguistics: ACL/IJCNLP2021, Online Event, August 1-6, 2021, vol-ume ACL/IJCNLP 2021 of Findings of ACL,pages 3621–3634. Association for Computa-tional Linguistics.

Ronen Tamari, Chen Shani, Tom Hope, MiriamR. L. Petruck, Omri Abend, and Dafna Shahaf.2020. Language (re)modelling: Towards em-bodied language understanding. In Proceedingsof the 58th Annual Meeting of the Associationfor Computational Linguistics, ACL 2020, On-line, July 5-10, 2020, pages 6268–6281. Asso-ciation for Computational Linguistics.

Wikimedia Foundation. 2020. Wiktionary datadumps.

Zhilin Yang, Zihang Dai, Yiming Yang, Jaime G.Carbonell, Ruslan Salakhutdinov, and Quoc V.Le. 2019. Xlnet: Generalized autoregressivepretraining for language understanding. In Ad-vances in Neural Information Processing Sys-tems 32: Annual Conference on Neural In-formation Processing Systems 2019, NeurIPS

Page 17: arXiv:2105.02486v1 [cs.CL] 6 May 2021

2019, December 8-14, 2019, Vancouver, BC,Canada, pages 5754–5764.

Kexin Yi, Chuang Gan, Yunzhu Li, Push-meet Kohli, Jiajun Wu, Antonio Torralba, andJoshua B. Tenenbaum. 2020. CLEVRER: colli-sion events for video representation and reason-ing. In 8th International Conference on Learn-ing Representations, ICLR 2020, Addis Ababa,Ethiopia, April 26-30, 2020.

John M. Zelle and Raymond J. Mooney. 1996.Learning to parse database queries using induc-tive logic programming. In Proceedings of theThirteenth National Conference on Artificial In-telligence and Eighth Innovative Applicationsof Artificial Intelligence Conference, AAAI 96,IAAI 96, Portland, Oregon, USA, August 4-8,1996, Volume 2, pages 1050–1055. AAAI Press/ The MIT Press.