natural language inference in the real world · natural language inference in the real world ellie...

Natural Language Inference in the

Real WorldEllie Pavlick

University of Pennsylvania

Natural Language Inference

A man is pointing at a silver sedan.

There is no man pointing at a car.

280 Bos

NP/N: A N/N: record N: date

λq.λp.(x

;q@x;p@x) λp.λx.(

record(y)nn(y,x)

;p@x) λx.date(x)

[fa]N: record date

record(y)nn(y,x)

;date(x)

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . [merge]

record(y)nn(y,x)date(x)

[fa]NP: A record date

λp.(x

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . [merge]

Figure 1: Derivation with λ-DRSs, including β-conversion, for “A record date”. Com-binatory rules are indicated by solid lines, semantic rules by dotted lines.

DRS composed by the α-operator. The merge conjoins two DRSs into a larger DRS— semantically the merge is interpretated as (dynamic) logical conjunction. Merge-reduction is the process of eliminating the merge operation by forming a new DRSresulting from the union of the domains and conditions of the argument DRSso ofa merge, respectively (obeying certain constraints). Figure 1 illustrates the syntax-semantics interface (and merge-reduction) for a derivation of a simple noun phrase.

Boxer adopts Van der Sandt’s view as presupposition as anaphora (Van der Sandt,1992), in which presuppositional expressions are either resolved to previously estab-lished discourse entities or accommodated on a suitable level of discourse. Van derSandt’s proposal is cast in DRT, and therefore relatively easy to integrate in Boxer’ssemantic formalism. The α-operator indicates information that has to be resolved inthe context, and is lexically introduced by anaphoric or presuppositional expressions.A DRS constructed with α resembles the proto-DRS of Van der Sandt’s theory of pre-supposition (Van der Sandt, 1992) although they are syntactically defined in a slightlydifferent way to overcome problems with free and bound variables, following Bos(2003). Note that the difference between anaphora and presupposition collapses inVan der Sandt’s theory.

The types are the ingredients of a typed lambda calculus that is employed to con-struct DRSs in a bottom-up fashion, compositional way. The language of lambda-

Bos (2008)

280 Bos

λq.λp.(x

;q@x;p@x) λp.λx.(

record(y)nn(y,x)

;p@x) λx.date(x)

[fa]N: record date

record(y)nn(y,x)

;date(x)

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . [merge]

λp.(x

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . [merge]

Bos (2008)

tokenizer)tagger)NER)parser)

coref3resol.)

Parse)tree))deriva9on))genera9on)

Tree)space)search)

derived trees

derivation steps

From T to H

good candidates Classifier Initial parse tree of T&H

Syntactic knowledge components

Lexical knowledge components

Entailment decision

COMPONENTS

Fig. 6. System decomposition for the transformation-based BIUTEE system

7 System Mapping

In this section, we show the practical usability of our architecture. We illustratehow three existing RTE systems can be mapped concretely onto our architecture:BIUTEE, a transformation-based system; EDITS, an alignment-based system; andTIE, a multi-stage classification system, which represent different approaches toRTE. We also discuss mapping for formal reasoning-based systems.

7.1 BIUTEE

BIUTEE (Stern and Dagan, 2011) is a system for recognizing textual entailmentbased on the transformation-based approach outlined in Section 2.2 developedat Bar-Ilan University (BIU).12 The system derives the Hypothesis (H) from theText (T) with a sequence of rewrite steps. Since BIUTEE represents H and T asdependency trees, these rewrite steps can either involve just lexical knowledge (noderewrites) or lexico-syntactic knowledge (subtree rewrites). Additional operations(e.g., the insertion of syntactic structure or modification of POS types) ensure theexistence for a derivation for any T-H pair. The likelihood that a derivation preservesentailment is finally calculated by a confidence model.

Figure 6 shows the decomposition of BIUTEE in terms of the EXCITEMENTarchitecture. The LAP provides all required annotations, including parse trees. TheEDA encapsulates the “top level” of the algorithm. It consists of three parts: ageneration part where rewrite steps are performed, i.e., entailed trees are generated;a search part which selects good candidates among the generated derivations; and aclassifier. The generation part relies on entailment rules which are stored modularlyin (lexico-)syntactic knowledge components and lexical knowledge components. The

12 BIUTEE 2.4.1 is downloadable from http://www.cs.biu.ac.il/~nlp/downloads/biutee/

Padó et al. (2015)

280 Bos

λq.λp.(x

;q@x;p@x) λp.λx.(

record(y)nn(y,x)

;p@x) λx.date(x)

[fa]N: record date

record(y)nn(y,x)

;date(x)

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . [merge]

λp.(x

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . [merge]

Bos (2008)

Published as a conference paper at ICLR 2016

A wedding party taking pictures :: Someone got married

Premise Hypothesis

(A) ConditionalEncoding

(C) Word-by-wordAttention

(B) Attention

Figure 1: Recognizing textual entailment using (A) conditional encoding via two LSTMs, one overthe premise and one over the hypothesis conditioned on the representation of the premise (c5), (B)attention only based on the last output vector (h9) or (C) word-by-word attention based on all outputvectors of the hypothesis (h7, h8 and h9).

state is initialized with the last cell state of the previous LSTM (c5 in the example), i.e. it is condi-tioned on the representation that the first LSTM built for the premise (A). We use word2vec vectors(Mikolov et al., 2013) as word representations, which we do not optimize during training. Out-of-vocabulary words in the training set are randomly initialized by sampling values uniformly from(�0.05, 0.05) and optimized during training.1 Out-of-vocabulary words encountered at inferencetime on the validation and test corpus are set to fixed random vectors. By not tuning representationsof words for which we have word2vec vectors, we ensure that at inference time their representationstays close to unseen similar words for which we have word2vec embeddings. We use a linear layerto project word vectors to the dimensionality of the hidden size of the LSTM, yielding input vectorsxi

. Finally, for classification we use a softmax layer over the output of a non-linear projection ofthe last output vector (h9 in the example) into the target space of the three classes (ENTAILMENT,NEUTRAL or CONTRADICTION), and train using the cross-entropy loss.

2.3 ATTENTION

Attentive neural networks have recently demonstrated success in a wide range of tasks ranging fromhandwriting synthesis (Graves, 2013), digit classification (Mnih et al., 2014), machine translation(Bahdanau et al., 2015), image captioning (Xu et al., 2015), speech recognition (Chorowski et al.,2015) and sentence summarization (Rush et al., 2015), to geometric reasoning (Vinyals et al., 2015).The idea is to allow the model to attend over past output vectors (see Figure 1 B), thereby mitigatingthe LSTM’s cell state bottleneck. More precisely, an LSTM with attention for RTE does not need tocapture the whole semantics of the premise in its cell state. Instead, it is sufficient to output vectorswhile reading the premise and accumulating a representation in the cell state that informs the secondLSTM which of the output vectors of the premise it needs to attend over to determine the RTE class.

Let Y 2 Rk⇥L be a matrix consisting of output vectors [h1 · · · hL

] that the first LSTM producedwhen reading the L words of the premise, where k is a hyperparameter denoting the size of em-beddings and hidden layers. Furthermore, let e

2 RL be a vector of 1s and hN

be the last outputvector after the premise and hypothesis were processed by the two LSTMs respectively. The atten-tion mechanism will produce a vector ↵ of attention weights and a weighted representation r of thepremise via

M = tanh(WyY +WhhN

⌦ eL

) M 2 Rk⇥L (7)

↵ = softmax(wTM) ↵ 2 RL (8)

r = Y↵T r 2 Rk (9)

1We found 12.1k words in SNLI for which we could not obtain word2vec embeddings, resulting in 3.65Mtunable parameters.

Rocktäschel et al. (2016)

tokenizer)tagger)NER)parser)

coref3resol.)

Parse)tree))deriva9on))genera9on)

Tree)space)search)

derived trees

derivation steps

From T to H

good candidates Classifier Initial parse tree of T&H

Syntactic knowledge components

Lexical knowledge components

Entailment decision

COMPONENTS

Fig. 6. System decomposition for the transformation-based BIUTEE system

7 System Mapping

In this section, we show the practical usability of our architecture. We illustratehow three existing RTE systems can be mapped concretely onto our architecture:BIUTEE, a transformation-based system; EDITS, an alignment-based system; andTIE, a multi-stage classification system, which represent different approaches toRTE. We also discuss mapping for formal reasoning-based systems.

7.1 BIUTEE

BIUTEE (Stern and Dagan, 2011) is a system for recognizing textual entailmentbased on the transformation-based approach outlined in Section 2.2 developedat Bar-Ilan University (BIU).12 The system derives the Hypothesis (H) from theText (T) with a sequence of rewrite steps. Since BIUTEE represents H and T asdependency trees, these rewrite steps can either involve just lexical knowledge (noderewrites) or lexico-syntactic knowledge (subtree rewrites). Additional operations(e.g., the insertion of syntactic structure or modification of POS types) ensure theexistence for a derivation for any T-H pair. The likelihood that a derivation preservesentailment is finally calculated by a confidence model.

Figure 6 shows the decomposition of BIUTEE in terms of the EXCITEMENTarchitecture. The LAP provides all required annotations, including parse trees. TheEDA encapsulates the “top level” of the algorithm. It consists of three parts: ageneration part where rewrite steps are performed, i.e., entailed trees are generated;a search part which selects good candidates among the generated derivations; and aclassifier. The generation part relies on entailment rules which are stored modularlyin (lexico-)syntactic knowledge components and lexical knowledge components. The

12 BIUTEE 2.4.1 is downloadable from http://www.cs.biu.ac.il/~nlp/downloads/biutee/

Padó et al. (2015)

Natural Language Inference. (MacCartney PhD Thesis 2009)

No man is pointing at a car.

A man is pointing at a silver car. SUB(sedan, car) ⊏ yes

A man is pointing at a silver car.

A man is pointing at a car.

SUB(sedan, car) ⊏DEL(silver) ⊏

SUB(sedan, car) ⊏DEL(silver) ⊏INS(no) ^

Jimmy Dean refused to dance without pants.

Last December they had argued that the council had failed to consider possible environmental

effects of contaminated land at the site.

Question Answering

Did the council consider the environmental effects?

Yes/No

Summarization

They argued that the council didn’t consider environmental effects.

Dialogue

Remove from calendar?

I decided I don’t want to go to that party on Saturday.

Wish List• Models of language that are descriptive rather

than prescriptive based on judgements by real people rather than linguists

• Methods that are derived from data, rather than reliant on lexicons or ontologies

• Natural language itself as a logical form, rather than requiring an intermediate representation

Wish List• We want to get out of the “lab” and model

language that people actually use

• We want models of language that are descriptive rather than prescriptive based on judgements by real people rather than linguists

• We want methods that are derived from data, rather than reliant on lexicons or ontologies

• We want rules for logical composition that are descriptive rather than prescriptive based on judgements by real people

Last December they had argued that the council had failed to consider possible effects of contaminated land

at the site.

The council considered environmental consequences.

at the site.

lexical semantics

SUB(effects, consequences)

at the site.

modifiers

INS(environmental)

at the site.

(non-subsective) modifiers

DEL(possible)

at the site.

predicates

SUB(consider, consider)

at the site.

“higher order” predicates

DEL(fail to)

at the site.

lexical semantics

SUB(effects, consequences)

Lexical Semantics

WordNet

entity

physical entity

person

male child

Cub Scout

abstraction

legal right cabotage

body of water

female child

Lexical Semantics

WordNet

entity

physical entity

person

male child

Cub Scout

abstraction

body of water

female child

a boy is a male child

Lexical Semantics

WordNet

entity

physical entity

person

male child

Cub Scout

abstraction

body of water

female child

a boy is a male childa Cub Scout is a boy

Lexical Semantics

WordNet

entity

physical entity

person

male child

Cub Scout

abstraction

body of water

female child

girla girl is NOT a boy

a boy is a male childa Cub Scout is a boy

Lexical Semantics

entity

physical entity

person

male child

Cub Scout

abstraction

body of water

female child

WordNet

Lexical Semantics

Bilingual Pivoting (PPDB)

entity

physical entity

person

male child

Cub Scout

abstraction

body of water

female child

ahogados a la playa ...

get washed up on beaches ...

... fünf Landwirte , weil

... 5 farmers were in Ireland ...

oder wurden , gefoltert

or have been , tortured

festgenommen

thrown into jail

festgenommen

imprisoned

... ...

WordNet

Lexical Semantics

WordNet

Bilingual Pivoting (PPDB)

Vector Space Models (word2vec)

entity

physical entity

person

male child

Cub Scout

abstraction

body of water

female child

festgenommen

thrown into jail

festgenommen

imprisoned

... ...

Lexical Semanticsboy kid boy girl boy doggy boy mother

boy sons boy dude boy pup boy mommy

boy guys boy fella boy ah boy father

boy males boy laddie boy infant boy student

boy son boy gentleman boy pooch boy person

boy child boy male boy yarn boy type

boy boyfriend boy youth boy sonny boy cheeky

boy hans boy juvenile boy childhood boy buster

boy lad boy toddler boy doggy boy husband

boy guy boy puppy boy servant boy offspring

boy wraps boy fellow boy grandson boy wee

boy gus boy mate boy colt boy idiot

boy bollocks boy little boy darling boy partner

boy teenager boy bro boy teen boy toy

boy baby boy bloke boy junior boy old-timer

boy waiter boy boss boy baby boy calf

boy men boy kiddo boy bit boy protege

boy mandog boy apprentice boy daughter boy sage

boy buddy boy brat boy foal boy kitty

boy friend boy lapdog boy bearer boy bloodhound

boy does boy children boy shorty boy homie

boy pops boy bachelor boy foal boy cub

boy youngster boy soldier boy chum boy wolf

boy brother boy sweetheart boy blood boy honey

boy ⊏ kid

boy ⊏ person

boy ∣ girl

boy ∣ puppy

boy dog boy apprentice boy daughter boy sage

boy # toddler

boy # idiot

The Paraphrase Databaseboy kid boy girl boy doggy boy motherboy sons boy dude boy pup boy mommyboy guys boy fella boy ah boy fatherboy males boy laddie boy infant boy studentboy son boy gentleman boy pooch boy personboy child boy male boy yarn boy typeboy boyfriend boy youth boy sonny boy cheekyboy hans boy juvenile boy childhood boy busterboy lad boy toddler boy doggy boy husbandboy guy boy puppy boy servant boy offspringboy wraps boy fellow boy grandson boy weeboy gus boy mate boy colt boy idiotboy bollocks boy little boy darling boy partnerboy teenager boy bro boy teen boy toyboy baby boy bloke boy junior boy old-timerboy waiter boy boss boy baby boy calfboy men boy kiddo boy bit boy protegeboy dog boy apprentice boy daughter boy sageboy buddy boy brat boy foal boy kittyboy friend boy lapdog boy bearer boy bloodhoundboy does boy children boy shorty boy homieboy pops boy bachelor boy foal boy cubboy youngster boy soldier boy chum boy wolfboy brother boy sweetheart boy blood boy honey

~100M word and phrase pairs

Natural Logic

to capture these diverse relations.

couch !

unableable

feline

carni vore

canine

hippo food

CHAPTER 5. ENTAILMENT RELATIONS 71

R0000 R0001 R0010 R0011

R0100 R0101 R0111

R1000 R1010 R1011

R1100 R1101 R1110 R1111

Figure 5.2: The 16 elementary set relations, represented by Johnston diagrams. Eachbox represents the universe U , and the two circles within the box represent the setsx and y. A region is white if it is empty, and shaded if it is non-empty. Thus in thediagram labeled R1101, only the region x� y is empty, indicating that � � x � y � U .

equivalence class in which only partition 10 is empty.) These equivalence classes aredepicted graphically in figure 5.2.

In fact, each of these equivalence classes is a set relation, that is, a set of orderedpairs of sets. We will refer to these 16 set relations as the elementary set relations,and we will denote this set of 16 relations by R. By construction, the relations in R

are both mutually exhaustive (every ordered pair of sets belongs to some relation inR) and mutually exclusive (no ordered pair of sets belongs to two different relationsin R). Thus, every ordered pair of sets can be assigned to exactly one relation in R.

equivalence

R0000 R0001 R0010 R0011

R0100 R0101 R0111

R1000 R1010 R1011

R1100 R1101 R1110 R1111

R0000 R0001 R0010 R0011

R0100 R0101 R0111

R1000 R1010 R1011

R1100 R1101 R1110 R1111

R0000 R0001 R0010 R0011

R0100 R0101 R0111

R1000 R1010 R1011

R1100 R1101 R1110 R1111

R0000 R0001 R0010 R0011

R0100 R0101 R0111

R1000 R1010 R1011

R1100 R1101 R1110 R1111

R0000 R0001 R0010 R0011

R0100 R0101 R0111

R1000 R1010 R1011

R1100 R1101 R1110 R1111

forward entailment

reverse entailment

negation

alternation

independence no pathhypernymy

hyponymy shared hypernym

antonymsynonym

Figure 2: Mappings of set-theorhetic entailment relationsonto the WordNet hierarchy. In the Venn diagrams, re-produced from (MacCartney, 2009), the left circle repre-sents states in which x is true and the right circle states inwhich y is true. The shaded area represents all possibletrue states. E.g. when x ⌘ y (equivalence), in every truestate, either both x and y are true, or neither is.

3 Entailment Relations

We use the relations from Bill MacCartney’s the-sis on natural language inference as the basis forour categorization of relations (MacCartney, 2009).MacCartney’s work focused on integrating the se-mantic properties previously employed by systemsfor question answering (Harabagiu and Hickl, 2006)and RTE (Bar-Haim et al., 2007) within the formaltheory of natural logic (Lakoff, 1972). As a result,he provides a simple framework which models lexi-cal entailment in 7 “basic entailment relationships”:

Equivalence (⌘): if X is true then Y is true, andif Y is true then X is true.

Forward entailment (@): if X is true then Y istrue, but if Y is true then X may or may not be true.

Reverse entailment (A): if Y is true then X istrue, but if X is true then Y may or may not be true.

Negation (^): if X is true then Y is false, and ifY is false then X is true; either X or Y must be true.

Alternation (|): if X is true then Y is false, but ifY is false then X may or may not be true.

Cover (^): if X is true then Y may or may not betrue, and if Y is true then X may or may not be true;

either X or Y must be true. We omit this relation,since its applicability to RTE is not clear.

Independence (#): if X is true then Y may ormay not be true, and if Y is true then X may or maynot be true.

4 Extracting entailment relations fromWordNet

4.1 Mapping onto Natural Logic relations

We would like to train a model to automatically dis-tinguish between the relationships described above.In order to gather labelled training data, we first lookto the information available in the existing Word-Net hierarchy. For roughly 2.5 million (60%) of thenoun pairs in PPDB, both nouns appear in WordNet(although not necessarily in the same synset). Weuse the rules in Table 2 (shown graphically in Fig-ure 2) to map a pair of nodes in the WordNet nounhierarchy onto one of the basic entailment relationsdescribed in section 3.

Other Relatedness In addition to MacCartney’srelations, we define a sixth catch-all category forterms which are flagged as related by WordNet butwhose relation is not built into the hierarchical struc-ture. These noun pairs do not meet the criteria of thebasic entailment relations but carry more informa-tion than do truly independent terms. We combineholonymy (part/whole relationships), attributes (ad-jectives closely tied to a specific noun), and deriva-tionally related terms as “other” relations.

4.2 Shortcomings of WordNet labeling

Our definitions give the desired results for the ⌘,A, @, and “other” categories. Table 1 shows someexamples of nouns in PPDB which were assignedto each of these labels. However, whereas Mac-Cartney’s alternation is strictly contradictory (X !¬Y ), co-hyponyms of a common parent in WordNetdo not necessarily have this property. Some exam-ples behave well (e.g. lunch and dinner share theparent meal), but others are better labelled as @ or #(e.g. boy and guy share the parent man and brotherand worker share the parent person). As a result,the WordNet alternation provides no information interms of entailment, behaving instead like MacCart-ney’s independence (i.e. “if X is true then Y may or

Adding Semantics to Data-Driven Paraphrasing. (Pavlick et al. ACL 2015)

Natural Logic

couch !

unableable

feline

carni vore

canine

hippo food

R0000 R0001 R0010 R0011

R0100 R0101 R0111

R1000 R1010 R1011

R1100 R1101 R1110 R1111

equivalence

R0000 R0001 R0010 R0011

R0100 R0101 R0111

R1000 R1010 R1011

R1100 R1101 R1110 R1111

R0000 R0001 R0010 R0011

R0100 R0101 R0111

R1000 R1010 R1011

R1100 R1101 R1110 R1111

R0000 R0001 R0010 R0011

R0100 R0101 R0111

R1000 R1010 R1011

R1100 R1101 R1110 R1111

R0000 R0001 R0010 R0011

R0100 R0101 R0111

R1000 R1010 R1011

R1100 R1101 R1110 R1111

R0000 R0001 R0010 R0011

R0100 R0101 R0111

R1000 R1010 R1011

R1100 R1101 R1110 R1111

forward entailment

reverse entailment

negation

alternation

antonymsynonym

Equivalence

Natural Logic

couch !

unableable

feline

carni vore

canine

hippo food

R0000 R0001 R0010 R0011

R0100 R0101 R0111

R1000 R1010 R1011

R1100 R1101 R1110 R1111

equivalence

R0000 R0001 R0010 R0011

R0100 R0101 R0111

R1000 R1010 R1011

R1100 R1101 R1110 R1111

R0000 R0001 R0010 R0011

R0100 R0101 R0111

R1000 R1010 R1011

R1100 R1101 R1110 R1111

R0000 R0001 R0010 R0011

R0100 R0101 R0111

R1000 R1010 R1011

R1100 R1101 R1110 R1111

R0000 R0001 R0010 R0011

R0100 R0101 R0111

R1000 R1010 R1011

R1100 R1101 R1110 R1111

R0000 R0001 R0010 R0011

R0100 R0101 R0111

R1000 R1010 R1011

R1100 R1101 R1110 R1111

forward entailment

reverse entailment

negation

alternation

antonymsynonym

Equivalence

Entailment

Natural Logic

couch !

unableable

feline

carni vore

canine

hippo food

R0000 R0001 R0010 R0011

R0100 R0101 R0111

R1000 R1010 R1011

R1100 R1101 R1110 R1111

equivalence

R0000 R0001 R0010 R0011

R0100 R0101 R0111

R1000 R1010 R1011

R1100 R1101 R1110 R1111

R0000 R0001 R0010 R0011

R0100 R0101 R0111

R1000 R1010 R1011

R1100 R1101 R1110 R1111

R0000 R0001 R0010 R0011

R0100 R0101 R0111

R1000 R1010 R1011

R1100 R1101 R1110 R1111

R0000 R0001 R0010 R0011

R0100 R0101 R0111

R1000 R1010 R1011

R1100 R1101 R1110 R1111

R0000 R0001 R0010 R0011

R0100 R0101 R0111

R1000 R1010 R1011

R1100 R1101 R1110 R1111

forward entailment

reverse entailment

negation

alternation

antonymsynonym

Entailment

ExclusionEquivalence

Natural Logic

couch !

unableable

feline

carni vore

canine

hippo food

R0000 R0001 R0010 R0011

R0100 R0101 R0111

R1000 R1010 R1011

R1100 R1101 R1110 R1111

equivalence

R0000 R0001 R0010 R0011

R0100 R0101 R0111

R1000 R1010 R1011

R1100 R1101 R1110 R1111

R0000 R0001 R0010 R0011

R0100 R0101 R0111

R1000 R1010 R1011

R1100 R1101 R1110 R1111

R0000 R0001 R0010 R0011

R0100 R0101 R0111

R1000 R1010 R1011

R1100 R1101 R1110 R1111

R0000 R0001 R0010 R0011

R0100 R0101 R0111

R1000 R1010 R1011

R1100 R1101 R1110 R1111

R0000 R0001 R0010 R0011

R0100 R0101 R0111

R1000 R1010 R1011

R1100 R1101 R1110 R1111

forward entailment

reverse entailment

negation

alternation

antonymsynonym

Independent

Entailment

ExclusionEquivalence

Monolingual Featuressymmetric and asymmetric similarities based on

dependency context

cosine:0.431 lin:0.544 balprec:0.396 weeds:0.289…

1 1 1 1 1 0 0 1 0 0 1 01 0 1 0 1 1 1 1 1 1 1 1kid

little girl nsubj-draw

lost-dep

nsubj-vomit

poss-shoe

dep-brother

nsubjpass-buckled

nsubj-yell

rcmod-tired

Discovery of Inference Rules from Text. (Lin and Pantel SIGKDD 2001)

Monolingual Features

…if this can happen to my little girl , it can happen to other kids…

nmodnmod

[X]<-nmod-[happen]-advcl->[happen]-nmod->[Y]:1 [X]<-conj-[Y]:1 [X]<-pobj-[as]<-prep-[know]<-rcmod-[Y]:1

lexico-syntactic patterns

Automatic acquisition of hyponyms from large text corpora. (Hearst COLING 1992) Semantic taxonomy induction from heterogenous evidence. (Snow et al. ACL 2006)

Bilingual Features

festgenommen

thrown into jail

festgenommen

imprisoned

... ...

Paraphrasing with bilingual parallel corpora. (Bannard and Callison-Burch ACL 2005) PPDB: The paraphrase database. (Ganitkevitch et al. NAACL 2013)

Table 1

mono Predicted label (using monolingual features)

Predicted label (using bilingual features)

Predicted label (using all features)

ind syn hyp exl oth ≣ ⊐ ¬ # ~ ≣ ⊐ ¬ # ~ ≣ ⊐ ¬ # ~syn 1 3 1 0 0 4 ≣ 58% 20% 4% 15% 3% 62% 21% 5% 4% 8% 83% 10% 0% 2% 4%

hyp 2 3 7 0 1 13 ⊐ 20% 51% 3% 18% 7% 27% 5% 7% 7% 54% 6% 76% 2% 7% 8%

exl 1 1 1 1 0 4 ¬ 26% 14% 37% 17% 6% 6% 14% 30% 36% 14% 2% 8% 73% 13% 3%

ind 14 2 2 0 1 20 # 8% 13% 2% 71% 6% 1% 7% 6% 78% 8% 1% 4% 2% 88% 6%

oth 3 1 2 0 2 10 ~ 15% 21% 5% 36% 23% 8% 19% 9% 30% 35% 5% 10% 3% 18% 64%

ind syn hyp exl oth

syn 0 3 1 0 0 4

hyp 2 7 1 2 13 24

exl 1 0 1 1 1 4

ind 15 0 1 1 2 20

oth 3 1 2 1 3 10

ind syn hyp exl oth

syn 9 368 46 1 19 443

hyp 97 83 1004 29 108 1321

exl 49 9 29 275 13 375

ind 1730 15 82 35 114 1976

oth 169 48 97 33 609 956

Figure 4: Confusion matrices for classifier trained using only monolingual features (distributional and path) versus bilingualfeatures (paraphrase and translation). True labels are shown along rows, predicted along columns. The matrix is normalizedalong rows, so that the predictions for each (true) class sum to 100%.

� F1 when excludingAll Lex. Dist. Path Para. Tran. WN

# 79 -2.0 -0.2 -1.2 -1.7 -0.2 -0.1⌘ 57 -3.5 +0.2 -0.7 -2.4 -3.7 +0.5A 68 -4.6 -0.3 -0.8 -0.8 -0.7 -1.6¬ 49 -4.0 -0.8 -2.9 +0.3 -0.0 -2.2⇠ 51 -4.9 -0.5 -0.7 -1.2 -0.9 -0.3

Table 5: F1 measure (⇥100) achieved by entailment classifierusing 10-fold cross validation on the training data.

Table 3 shines some light onto the differencesbetween monolingual and bilingual similarities.While the monolingual asymmetric metrics aregood for identifying A pairs, the symmetric met-rics consistently identify ¬ pairs; none of themonolingual scores we explored were effectivein making the subtle distinction between ⌘ pairsand the other types of paraphrase. In contrast,the bilingual similarity metric is fairly precisefor identifying ⌘ pairs, but provides less infor-mation for distinguishing between types of non-equivalent paraphrase. These differences are fur-ther exhibited in the confusion matrices shown inFigure 4; when the classifier is trained using onlymonolingual features, it misclassifies 26% of ¬pairs as ⌘, whereas the bilingual features makethis error only 6% of the time. On the other hand,the bilingual features completely fail to predict theA class, calling over 80% of such pairs ⌘ or ⇠.

7 Evaluation

7.1 Intrinsic Evaluation

We test the performance of our classifier intrinsi-cally, through its ability to reproduce the humanlabels for the phrase pairs from the SICK test sen-tences. Table 7 shows the precision and recallachieved by the classifier for each of our 5 en-tailment classes. The classifier is able to achieve

an overall 79% accuracy, reaching >70% preci-sion while maintaining good levels of recall on allclasses.

True Pred. N Example misclassifications⇠ # 169 boy/little, an empy/the air# ⇠ 114 little/toy, color/hairA ⇠ 108 drink/juice, ocean/surfA # 97 in front of/the face of, vehicle/horseA ⌘ 83 cat/kitten, pavement/sidewalk⌘ A 46 big/grand, a girl/a young ladyA ¬ 29 kid/teenager, no small/a large¬ A 29 old man/young man, a car/a window# ⌘ 15 a person/one, a crowd/a large⌘ # 9 he is/man is, photo/still⌘ ¬ 1 girl is/she is

Table 6: Example misclassifications from some of the mostfrequent and most interesting error categories.

Figure 4 shows the classifier’s confusion ma-trix and Table 6 shows some examples of commonand interesting error cases. The majority of errors(26%) come from confusing the ⇠ class with the# class. This mistake is not too concerning froman RTE perspective since ⇠ can be treated as aspecial case of # (Section 5). There are very fewcases in which the classifier makes extreme errors,e.g. confusing ⌘ with ¬ or with #; some interest-ing examples of such errors arise when the phrasescontain pronouns (e.g. girl ⌘ she) or when therelation uses a highly infrequent word sense (e.g.photo ⌘ still).

7.2 The Nutcracker RTE System

To further test our classifier, we evaluate the use-fulness of the automatic entailment predictions ina downstream RTE task. We run our experimentsusing Nutcracker, a state-of-the-art RTE systembased on formal semantics (Bjerva et al., 2014).In the SemEval 2014 RTE challenge, this system

equiv.

entail.

indep.

equiv.entail. excl. other indep.Table 1

hyp 2 3 7 0 1 13 ⊐ 20% 51% 3% 18% 7% 27% 5% 7% 7% 54% 6% 76% 2% 7% 8%

exl 1 1 1 1 0 4 ¬ 26% 14% 37% 17% 6% 6% 14% 30% 36% 14% 2% 8% 73% 13% 3%

ind 14 2 2 0 1 20 # 8% 13% 2% 71% 6% 1% 7% 6% 78% 8% 1% 4% 2% 88% 6%

oth 3 1 2 0 2 10 ~ 15% 21% 5% 36% 23% 8% 19% 9% 30% 35% 5% 10% 3% 18% 64%

ind syn hyp exl oth

syn 0 3 1 0 0 4

hyp 2 7 1 2 13 24

exl 1 0 1 1 1 4

ind 15 0 1 1 2 20

oth 3 1 2 1 3 10

ind syn hyp exl oth

syn 9 368 46 1 19 443

hyp 97 83 1004 29 108 1321

exl 49 9 29 275 13 375

ind 1730 15 82 35 114 1976

oth 169 48 97 33 609 956

# 79 -2.0 -0.2 -1.2 -1.7 -0.2 -0.1⌘ 57 -3.5 +0.2 -0.7 -2.4 -3.7 +0.5A 68 -4.6 -0.3 -0.8 -0.8 -0.7 -1.6¬ 49 -4.0 -0.8 -2.9 +0.3 -0.0 -2.2⇠ 51 -4.9 -0.5 -0.7 -1.2 -0.9 -0.3

7 Evaluation

equiv.

entail.

indep.

Monolingual features

Bilingual features

Table 1

hyp 2 3 7 0 1 13 ⊐ 20% 51% 3% 18% 7% 27% 5% 7% 7% 54% 6% 76% 2% 7% 8%

exl 1 1 1 1 0 4 ¬ 26% 14% 37% 17% 6% 6% 14% 30% 36% 14% 2% 8% 73% 13% 3%

ind 14 2 2 0 1 20 # 8% 13% 2% 71% 6% 1% 7% 6% 78% 8% 1% 4% 2% 88% 6%

oth 3 1 2 0 2 10 ~ 15% 21% 5% 36% 23% 8% 19% 9% 30% 35% 5% 10% 3% 18% 64%

ind syn hyp exl oth

syn 0 3 1 0 0 4

hyp 2 7 1 2 13 24

exl 1 0 1 1 1 4

ind 15 0 1 1 2 20

oth 3 1 2 1 3 10

ind syn hyp exl oth

syn 9 368 46 1 19 443

hyp 97 83 1004 29 108 1321

exl 49 9 29 275 13 375

ind 1730 15 82 35 114 1976

oth 169 48 97 33 609 956

# 79 -2.0 -0.2 -1.2 -1.7 -0.2 -0.1⌘ 57 -3.5 +0.2 -0.7 -2.4 -3.7 +0.5A 68 -4.6 -0.3 -0.8 -0.8 -0.7 -1.6¬ 49 -4.0 -0.8 -2.9 +0.3 -0.0 -2.2⇠ 51 -4.9 -0.5 -0.7 -1.2 -0.9 -0.3

7 Evaluation

equiv.

entail.

indep.

hyp 2 3 7 0 1 13 ⊐ 20% 51% 3% 18% 7% 27% 5% 7% 7% 54% 6% 76% 2% 7% 8%

exl 1 1 1 1 0 4 ¬ 26% 14% 37% 17% 6% 6% 14% 30% 36% 14% 2% 8% 73% 13% 3%

ind 14 2 2 0 1 20 # 8% 13% 2% 71% 6% 1% 7% 6% 78% 8% 1% 4% 2% 88% 6%

oth 3 1 2 0 2 10 ~ 15% 21% 5% 36% 23% 8% 19% 9% 30% 35% 5% 10% 3% 18% 64%

ind syn hyp exl oth

syn 0 3 1 0 0 4

hyp 2 7 1 2 13 24

exl 1 0 1 1 1 4

ind 15 0 1 1 2 20

oth 3 1 2 1 3 10

ind syn hyp exl oth

syn 9 368 46 1 19 443

hyp 97 83 1004 29 108 1321

exl 49 9 29 275 13 375

ind 1730 15 82 35 114 1976

oth 169 48 97 33 609 956

# 79 -2.0 -0.2 -1.2 -1.7 -0.2 -0.1⌘ 57 -3.5 +0.2 -0.7 -2.4 -3.7 +0.5A 68 -4.6 -0.3 -0.8 -0.8 -0.7 -1.6¬ 49 -4.0 -0.8 -2.9 +0.3 -0.0 -2.2⇠ 51 -4.9 -0.5 -0.7 -1.2 -0.9 -0.3

7 Evaluation

equiv.

entail.

indep.

Bilingual features

Table 1

hyp 2 3 7 0 1 13 ⊐ 20% 51% 3% 18% 7% 27% 5% 7% 7% 54% 6% 76% 2% 7% 8%

exl 1 1 1 1 0 4 ¬ 26% 14% 37% 17% 6% 6% 14% 30% 36% 14% 2% 8% 73% 13% 3%

ind 14 2 2 0 1 20 # 8% 13% 2% 71% 6% 1% 7% 6% 78% 8% 1% 4% 2% 88% 6%

oth 3 1 2 0 2 10 ~ 15% 21% 5% 36% 23% 8% 19% 9% 30% 35% 5% 10% 3% 18% 64%

ind syn hyp exl oth

syn 0 3 1 0 0 4

hyp 2 7 1 2 13 24

exl 1 0 1 1 1 4

ind 15 0 1 1 2 20

oth 3 1 2 1 3 10

ind syn hyp exl oth

syn 9 368 46 1 19 443

hyp 97 83 1004 29 108 1321

exl 49 9 29 275 13 375

ind 1730 15 82 35 114 1976

oth 169 48 97 33 609 956

# 79 -2.0 -0.2 -1.2 -1.7 -0.2 -0.1⌘ 57 -3.5 +0.2 -0.7 -2.4 -3.7 +0.5A 68 -4.6 -0.3 -0.8 -0.8 -0.7 -1.6¬ 49 -4.0 -0.8 -2.9 +0.3 -0.0 -2.2⇠ 51 -4.9 -0.5 -0.7 -1.2 -0.9 -0.3

7 Evaluation

equiv.

entail.

indep.

hyp 2 3 7 0 1 13 ⊐ 20% 51% 3% 18% 7% 27% 5% 7% 7% 54% 6% 76% 2% 7% 8%

exl 1 1 1 1 0 4 ¬ 26% 14% 37% 17% 6% 6% 14% 30% 36% 14% 2% 8% 73% 13% 3%

ind 14 2 2 0 1 20 # 8% 13% 2% 71% 6% 1% 7% 6% 78% 8% 1% 4% 2% 88% 6%

oth 3 1 2 0 2 10 ~ 15% 21% 5% 36% 23% 8% 19% 9% 30% 35% 5% 10% 3% 18% 64%

ind syn hyp exl oth

syn 0 3 1 0 0 4

hyp 2 7 1 2 13 24

exl 1 0 1 1 1 4

ind 15 0 1 1 2 20

oth 3 1 2 1 3 10

ind syn hyp exl oth

syn 9 368 46 1 19 443

hyp 97 83 1004 29 108 1321

exl 49 9 29 275 13 375

ind 1730 15 82 35 114 1976

oth 169 48 97 33 609 956

# 79 -2.0 -0.2 -1.2 -1.7 -0.2 -0.1⌘ 57 -3.5 +0.2 -0.7 -2.4 -3.7 +0.5A 68 -4.6 -0.3 -0.8 -0.8 -0.7 -1.6¬ 49 -4.0 -0.8 -2.9 +0.3 -0.0 -2.2⇠ 51 -4.9 -0.5 -0.7 -1.2 -0.9 -0.3

7 Evaluation

equiv.

entail.

indep.

Bilingual features

Table 1

hyp 2 3 7 0 1 13 ⊐ 20% 51% 3% 18% 7% 27% 5% 7% 7% 54% 6% 76% 2% 7% 8%

exl 1 1 1 1 0 4 ¬ 26% 14% 37% 17% 6% 6% 14% 30% 36% 14% 2% 8% 73% 13% 3%

ind 14 2 2 0 1 20 # 8% 13% 2% 71% 6% 1% 7% 6% 78% 8% 1% 4% 2% 88% 6%

oth 3 1 2 0 2 10 ~ 15% 21% 5% 36% 23% 8% 19% 9% 30% 35% 5% 10% 3% 18% 64%

ind syn hyp exl oth

syn 0 3 1 0 0 4

hyp 2 7 1 2 13 24

exl 1 0 1 1 1 4

ind 15 0 1 1 2 20

oth 3 1 2 1 3 10

ind syn hyp exl oth

syn 9 368 46 1 19 443

hyp 97 83 1004 29 108 1321

exl 49 9 29 275 13 375

ind 1730 15 82 35 114 1976

oth 169 48 97 33 609 956

# 79 -2.0 -0.2 -1.2 -1.7 -0.2 -0.1⌘ 57 -3.5 +0.2 -0.7 -2.4 -3.7 +0.5A 68 -4.6 -0.3 -0.8 -0.8 -0.7 -1.6¬ 49 -4.0 -0.8 -2.9 +0.3 -0.0 -2.2⇠ 51 -4.9 -0.5 -0.7 -1.2 -0.9 -0.3

7 Evaluation

equiv.

entail.

indep.

equiv. entail. excl. other indep.

Table 1

hyp 2 3 7 0 1 13 ⊐ 20% 51% 3% 18% 7% 27% 5% 7% 7% 54% 6% 76% 2% 7% 8%

exl 1 1 1 1 0 4 ¬ 26% 14% 37% 17% 6% 6% 14% 30% 36% 14% 2% 8% 73% 13% 3%

ind 14 2 2 0 1 20 # 8% 13% 2% 71% 6% 1% 7% 6% 78% 8% 1% 4% 2% 88% 6%

oth 3 1 2 0 2 10 ~ 15% 21% 5% 36% 23% 8% 19% 9% 30% 35% 5% 10% 3% 18% 64%

ind syn hyp exl oth

syn 0 3 1 0 0 4

hyp 2 7 1 2 13 24

exl 1 0 1 1 1 4

ind 15 0 1 1 2 20

oth 3 1 2 1 3 10

ind syn hyp exl oth

syn 9 368 46 1 19 443

hyp 97 83 1004 29 108 1321

exl 49 9 29 275 13 375

ind 1730 15 82 35 114 1976

oth 169 48 97 33 609 956

# 79 -2.0 -0.2 -1.2 -1.7 -0.2 -0.1⌘ 57 -3.5 +0.2 -0.7 -2.4 -3.7 +0.5A 68 -4.6 -0.3 -0.8 -0.8 -0.7 -1.6¬ 49 -4.0 -0.8 -2.9 +0.3 -0.0 -2.2⇠ 51 -4.9 -0.5 -0.7 -1.2 -0.9 -0.3

7 Evaluation

equiv.

entail.

indep.

Table 1

hyp 2 3 7 0 1 13 ⊐ 20% 51% 3% 18% 7% 27% 5% 7% 7% 54% 6% 76% 2% 7% 8%

exl 1 1 1 1 0 4 ¬ 26% 14% 37% 17% 6% 6% 14% 30% 36% 14% 2% 8% 73% 13% 3%

ind 14 2 2 0 1 20 # 8% 13% 2% 71% 6% 1% 7% 6% 78% 8% 1% 4% 2% 88% 6%

oth 3 1 2 0 2 10 ~ 15% 21% 5% 36% 23% 8% 19% 9% 30% 35% 5% 10% 3% 18% 64%

ind syn hyp exl oth

syn 0 3 1 0 0 4

hyp 2 7 1 2 13 24

exl 1 0 1 1 1 4

ind 15 0 1 1 2 20

oth 3 1 2 1 3 10

ind syn hyp exl oth

syn 9 368 46 1 19 443

hyp 97 83 1004 29 108 1321

exl 49 9 29 275 13 375

ind 1730 15 82 35 114 1976

oth 169 48 97 33 609 956

# 79 -2.0 -0.2 -1.2 -1.7 -0.2 -0.1⌘ 57 -3.5 +0.2 -0.7 -2.4 -3.7 +0.5A 68 -4.6 -0.3 -0.8 -0.8 -0.7 -1.6¬ 49 -4.0 -0.8 -2.9 +0.3 -0.0 -2.2⇠ 51 -4.9 -0.5 -0.7 -1.2 -0.9 -0.3

7 Evaluation

All features

Predicting entailment relations

Equivalent Entailment Exclusion Other Independent

look at/watch little girl/girl close/open swim/water girl/play

a person/someone

kuwait/country

minimal/significant

husband/marry found/party

clean/cleanse

tower/building

boy/young girl oil/oil price man/talk

distant/remote

sneaker/footwear

nobody/someone

country/patriotic profit/year

phone/telephone heroin/drug blue/green drive/

vehicleholiday/series

last autumn/last fall

typhoon/storm

france/germany playing/toy city/south

at the site.

modifiers

INS(environmental)

Denotational Semantics

consequences

environmental consequences

consequences

irreversible environmental consequences

consequences

Does environmental consequence entail consequence?

consequences

Does consequence entail environmental consequence?

From image descriptions to visual denotations. (Young et al. TACL 2014)

Entailment above the word level in distributional semantics. (Baroni et al. EACL 2012)

BIUTEE: A Modular Open-Source System for Recognizing Textual Entailment (Stern and Dagan. ACL 2012)

consequences

Does consequence entail environmental consequence? ✘

The policy will have consequences.

consequences

Does consequence entail environmental consequence?

The contaminated land will have consequences.

Natural Logic Entailment Relations

N does not entail AN N does entail AN

Compositional Entailment in Adjective Nouns. (Pavlick and Callison-Burch ACL 2016)

Forward Entailment (AN ⊏ N)

ANbrown dog

ANbrown dog Equivalence (AN ≣ N)

ANentire world

Alternation (N | AN)

Independent (N # AN)

Equivalence (AN ≣ N)

Reverse Entailment (AN ⊐ N)

alleged criminal

former senator

brown dog

possible winner

entire world

Experimental Design

Wellers hopes the financial system will be fully operational by 2015 .

Wellers hopes the system will be fully operational by 2015 .

Experimental Design

Contradiction Entailment Can’t Tell

Experimental Design

~200 human annotators

~5,000 sentences

4 genres

Empirical Analysis

Image Captions

3%3%8%

AN < N AN = N AN # N AN > N AN | N None

Literature

29%61%

26% 57%

Forums

Empirical Analysis

Image Captions

3%3%8%

AN < N AN = N AN # N AN > N AN | N None

Literature

29%61%

26% 57%

Forums

Up to 53% error rate by assuming all adjectives are restrictive.

When does N entail AN?

(probably sandy) (probably not sandy)

When does N entail AN?Sometimes, it is a property of the adjective and noun…

When does N entail AN?…or even just the adjective.

The [deadly] attack killed at least 12 civilians.

The [entire] bill is now subject to approval by the parliament.

Equivalence AN

When does N entail AN?Other times, it is a property of the context+word knowledge.

Red numbers spelled out their [perfect] record: 9-2.

Schilling even stayed busy after serving Epstein turkey at his [former] home on Thursday.

Alternation N AN

When does N entail AN?NOT

Other times, it is a property of the context+word knowledge.

Undefined Relations

No N is an AN, but every AN is an N.

Undefined Relations

Bush travels Monday to Michigan to make remarks on the [Japanese] economy.

No N is an AN, but every AN is an N.

should

A dictionary of nonsubsective adjectives. (Nayak et al. Tech Report 2014)

fake former artificial

counterfeit possible probable unlikely likely

at the site.

(non-subsective) modifiers

DEL(possible)

should

fake former artificial

counterfeit possible probable unlikely likely

A dictionary of nonsubsective adjectives. (Nayak et al. Tech Report 2014)

should

Privative (e.g. fake)

Plain Non-subsective (e.g. alleged)

ANN AN

Subsective (e.g. red)

Figure 1: Three main classes of adjectives. If their entailment behavior is consistent with their theoreticaldefinitions, we would expect our annotations (Section 3) to produce the insertion (blue) and deletion(red) patterns shown by the bar graphs. Bars (left to right) represent CONTRADICTION, UNKNOWN, andENTAILMENT

While these generalizations are intuitive, thereis little experimental evidence to support them.In this paper, we collect human judgements ofthe validity of inferences following from the in-sertion and deletion of various classes of adjec-tives and analyze the results. Our findings suggestthat, in practice, most sentences involving non-subsective ANs can be safely generalized to state-ments about the N. That is, non-subsective adjec-tives often behave like normal, subsective adjec-tives. On further analysis, we reveal that, whenadjectives do behave non-subsectively, they oftenexhibit asymmetric entailment behavior in whichinsertion leads to contradictions (ID ) ¬ fake ID)but deletion leads to entailments (fake ID ) ID).We present anecdotal evidence for how the en-tailment associated with inserting/deleting a non-subsective adjective depends on the salient prop-erties of the noun phrase under discussion, ratherthan on the adjective itself.

2 Background and Related Work

Classes of Adjectives. Adjectives are com-monly classified taxonomically as either subsec-tive or non-subsective (Kamp and Partee, 1995).Subsective adjectives are adjectives which pickout a subset of the set denoted by the unmodifiednoun; that is, AN ⇢ N1. For non-subsective adjec-tives, in contrast, the AN cannot be guaranteed tobe a subset of N. For example, clever is subsective,and so a clever thief is always a thief. However,

1We use the notation N and AN to refer both the the nat-ural language expression itself (e.g. red car) as well as itsdenotation, e.g. {x|x is a red car}.

alleged is non-subsective, so there are many pos-sible worlds in which an alleged thief is not in facta thief. Of course, there may also be many possi-ble worlds in which the alleged thief is a thief, butthe word alleged, being non-subsective, does notguarantee this to hold.

Non-subsective adjectives can be further di-vided into two classes: privative and plain. Setsdenoted by privative ANs are completely disjointfrom the set denoted by the head N (AN \ N =;), and this mutual exclusivity is encoded in themeaning of the A itself. For example, fake is con-sidered to be a quintessential privative adjectivesince, given the usual definition of fake, a fake IDcan not actually be an ID. For plain non-subsectiveadjectives, there may be worlds in which the ANis and N, and worlds in which the AN is not an N:neither inference is guaranteed by the meaning ofthe A. As mentioned above, alleged is quintessen-tially plain non-subsective since, for example, analleged thief may or may not be an actual thief.In short, we can summarize the classes of adjec-tives in the following way: subsective adjectivesentail the nouns they modify, privative adjectivescontradict the nouns they modify, and plain non-subsective adjectives are compatible with (but donot entail) the nouns they modify. Figure 1 depictsthese distinctions.

While the hierarchical classification of adjec-tives described above is widely accepted and oftenapplied in NLP tasks (Amoia and Gardent, 2006;Amoia and Gardent, 2007; Boleda et al., 2012;McCrae et al., 2014), it is not undisputed. Somelinguists take the position that in fact privative ad-

She was the expected winner. She was the winner.

So-Called Non-Subsective Adjectives. (Pavlick and Callison-Burch *SEM 2016, Best Paper!)

should

ANN AN

(a) Privative (b) Plain Non-Sub. (c) Subsective

Figure 2: Observed entailment judgements for insertion (blue) and deletion (red) of adjectives. Compareto expected distributions in Figure 1.

to receive labels of UNKNOWN in both directions.We expect the subsective adjectives to receive la-bels of ENTAILMENT in the deletion direction (redcar ) car) and labels of UNKNOWN in the inser-tion direction (car 6) red car). Figure 1 depictsthese expected distributions.

Observations. The observed entailment patternsfor insertion and deletion of non-subsective adjec-tives are shown in Figure 2. Our control sampleof subsective adjectives (Figure 2c) largely pro-duced the expected results, with 96% of deletionsproducing ENTAILMENTs and 73% of insertionsproducing UNKNOWNs.3 The entailment patternsproduced by the non-subsective adjectives, how-ever, did not match our predictions. The plain non-subsective adjectives (e.g. alleged) behave nearlyidentically to how we expect regular, subsectiveadjectives to behave (Figure 2b). That is, in 80%of cases, deleting the plain non-subsective adjec-tive was judged to produce ENTAILMENT, ratherthan the expected UNKNOWN. The examples inTable 2 shed some light onto why this is the case.Often, the differences between N and AN are notrelevant to the main point of the utterance. For ex-ample, while an expected surge in unemploymentis not a surge in unemployment, a policy that dealswith an expected surge deals with a surge.

The privative adjectives (e.g. fake) also failto match the predicted distribution. While in-sertions often produce the expected CONTRADIC-TIONs, deletions produce a surprising number ofENTAILMENTs (Figure 2a). Such a pattern doesnot fit into any of the adjective classes from Fig-ure 1. While some ANs (e.g. counterfeit money)behave in the prototypically privative way, others

3A full discussion of the 27% of insertions that deviatedfrom the expected behavior is given in Pavlick and Callison-Burch (2016).

(1) Swiss officials on Friday said they’ve launched aninvestigation into Urs Tinner’s alleged role.

(2) To deal with an expected surge in unemployment,the plan includes a huge temporary jobs program.

(3) They kept it close for a half and had a theoreticalchance come the third quarter.

Table 2: Contrary to expectations, the deletion ofplain non-subsective adjectives often preserves the(plausible) truth in a model. E.g. alleged role 6)role, but investigation into alleged role ) investi-gation into role.

(e.g. mythical beast) have the property in whichN)¬AN, but AN)N (Figure 3). Table 3 pro-vides some telling examples of how this AN)Ninference, in the case of privative adjectives, oftendepends less on the adjective itself, and more onproperties of the modified noun that are at issue inthe given context. For example, in Table 3 Exam-ple 2(a), a mock debate probably contains enoughof the relevant properties (namely, arguments) thatit can entail debate, while in Example 2(b), a mockexecution lacks the single most important property(the death of the executee) and so cannot entail ex-ecution. (Note that, from Example 3(b), it appearsthe jury is still out on whether leaps in artificialintelligence entail leaps in intelligence...)

5 Discussion

The results presented suggest a few important pat-terns for NLP systems. First, that while a non-subsective AN might not be an instance of the N(taxonomically speaking), statements that are trueof an AN are often true of the N as well. This isrelevant for IE and QA systems, and is likely to be-come more important as NLP systems focus moreon “micro reading” tasks (Nakashole and Mitchell,2014), where facts must be inferred from singledocuments or sentences, rather than by exploiting

Actually behave like normal, subsective adjectives (e.g. red).

should

Actually behave like normal, subsective adjectives (e.g. red).

To deal with an expected surge in unemployment, the plan includes a huge

temporary jobs program.

ANN AN

5 Discussion

should

She was carrying a fake gun. She was carrying a gun.ANN

ANN AN

should

She was carrying a fake gun. She was carrying a gun.

Don’t behave symmetrically.

ANN AN

5 Discussion

should

The 27-year-old Gazan seeks an id to get through security checkpoints and

find work in Cairo.

Does he seek a fake id?

ANN AN

5 Discussion

should

The 27-year-old Gazan seeks an id to get through security checkpoints and

find work in Cairo.

Does he seek a fake id? ✘

ANN AN

5 Discussion

should

The 27-year-old Gazan seeks a fake id to get through security checkpoints

and find work in Cairo.

Does he seek a id? ✔

ANN AN

5 Discussion

Bush travels Monday to Michigan to make remarks on the economy.

Does N entail AN?

Bush travels Monday to Michigan to make remarks on the American economy.✔

Does N entail AN?

Bush travels Monday to Michigan to make remarks on the American economy.

Bush travels Monday to Michigan to make remarks on the Japanese economy.✘

Does N entail AN?

5,378 naturally-occurring sentences

4,868 train / 510 test

Does N entail AN?

2,990 unique ANs

Disjoint ANs in train/test

5,378 naturally-occurring sentences

4,868 train / 510 test

Human Most Freq.Class

MFC by Adj.

BOW BOV MaxEnt+LR

LSTM Trans.

Does N entail AN?

MFC by Adj.

BOW BOV MaxEnt+LR

LSTM Trans.

Does N entail AN?

MFC by Adj.

BOW BOV MaxEnt+LR

LSTM Trans.

85.393.3

Does N entail AN?

MFC by Adj.

BOW BOV MaxEnt+LR

LSTM Trans.

92.285.3

Does N entail AN?

MFC by Adj.

BOW BOV MaxEnt+LR

LSTM Trans.

86.68692.2

85.393.3

Does N entail AN?Pretty strong baselines

MFC by Adj.

BOW BOV MaxEnt+LR

LSTM Trans.

85.386.68692.2

85.393.3

Does N entail AN?Supervised Model using involved NLP pipeline

(Magnini et al. 2014)

MFC by Adj.

BOW BOV MaxEnt+LR

LSTM Trans.

86.685.386.68692.2

85.393.3

Does N entail AN?Several DNN Models (Bowman et al. 2015)

MFC by Adj.

BOW BOV MaxEnt+LR

LSTM Trans.

86.886.685.386.68692.2

85.393.3

Does N entail AN?Several DNN Models, with transfer learning

(Bowman et al. 2015)

Summary• Humans are flexible with their language, they don’t abide by

hard-and-fast logical rules of composition and entailment.

• We can acquire lexical entailments at scale…

• …but lexical entailments are not enough. We need composition in order to model unseen phrases and full sentences.

• Inferences involving composition is too complex to capture using simple heuristics, and requires models to incorporate context and common sense when performing reasoning.

• Humans are flexible with their language, they don’t abide by hard-and-fast logical rules of composition and entailment.

Summary

• Inference involving composition is too complex to capture using simple heuristics, and requires models to incorporate context and common sense when performing reasoning.

Summary

Current Work

at the site.

predicates

SUB(consider, consider)

Current Work-in-progress.

Projection through Predicates

Input Output

in France ⊏ in Europe Forward

in Europe ⊐ in France Reverse

in France | in Germany Alternation

in France # in the city Independent

Input Output

born(in France) ⊏ born(in Europe)

Input Output

born(in Europe) ⊐ born(in France)

Input Output

born(in France) | born(in Germany)

Input Output

born(in France) # born(in the city)

Input Output

has traveled(in France) # has traveled(in Germany)

Input Output

is banned(in France) ⊐ is banned(in Europe)

Identity Non-Exclusive, Identity

Non-Exclusive, Negation

is a has a lacks a

lives in considers avoidswas born in has visited bannedis married to likes dislikes

at the site.

“higher order” predicates

DEL(fail to)

Implicative Verbs

Implicative Verbs. (Karttunen Language 1971)

Implicative Verbs

She managed to fix the bug.

Implicative Verbs

She wanted to fix the bug.

Implicative Verbs

Did she fix the bug?

Implicative Verbs

Did she fix the bug?

Implicative Verbs

She managed to fix the bug last night.

She managed to fix the bug tomorrow.

Implicative Verbs

She managed to fix the bug last night.

She managed to fix the bug tomorrow.

Implicative Verbs

She wanted to fix the bug last night.

She wanted to fix the bug tomorrow.

Implicative Verbs

venture to

forget to

manage to

bother to

happen to

get to

decide to

dare to

try to

agree to

promise towant to

intend to

plan to

hope to

Tense Manages to Predict Implicatives. (Pavlick and Callison-Burch EMNLP 2016)

Implicative Verbsventure to

forget tomanage tobother to

happen toget to

decide todare totry to

agree topromise towant to

intend toplan tohope to

happen toget to

Gallager chose to accept a full scholarship to play football for Temple

University.

happen toget to

Wilkins was allowed to leave in 1987 to join

French outfit Paris Saint-Germain.

Gallager chose to accept a full scholarship to play football for Temple

University.

Implicative Verbs

She fixed the bug.

Implicative Verbs

EMNLP 2016 Submission 25. Confidential review copy. DO NOT DISTRIBUTE.

(0.14) UFJ wants to merge with Mitsubishi, a combination that’d surpass Citigroup as the world’s biggest bank.6) The merger of Japanese Banks creates the world’s biggest bank.

(0.55) After graduating, Gallager chose to accept a full scholarship to play football for Temple University.) Gallager attended Temple University.

(0.68) Wilkins was allowed to leave in 1987 to join French outfit Paris Saint-Germain.) Wilkins departed Milan in 1987.

Table 3: Examples from the RTE3 dataset (Giampiccolo et al., 2007) which require recognizing implicative behavior, even in verbsthat are not implicative by definition. The tendency of certain verbs (e.g. be allowed) to behave as de facto implicatives is capturedsurprisingly well by the tense agreement score (shown in parentheses).

the main verb of a sentence, the truth of the compli-ment could only be inferred 30% of the time. Whena verb with high tense agreement appeared as themain verb, the truth of the compliment could be in-ferred 91% of the time. This difference is significantat p < 0.01. That is, tense agreement provides astrong signal for identifying non-implicative verbs,and thus can help systems avoid false-positive en-tailment judgements, e.g. incorrectly inferring thatwanting to merge ) merging.

Figure 1: Whether or not compliment is entailed for main verbswith varying levels of tense agreement. Verbs with high tenseagreement yield more definitive judgments (true/false). Eachbar represents aggregated judgements over approx. 20 verbs.

Interestingly, tense agreement accurately mod-els verbs that are not implicative by definition, butwhich nonetheless tend to behave implicatively inpractice. For example, our method finds high tenseagreement for choose to and be allowed to, whichare often used to communicate, albeit indirectly, thattheir compliments did in fact happen. To convinceourselves that treating such verbs as implicativesmakes sense in practice, we manually look throughthe RTE3 dataset (Giampiccolo et al., 2007) for ex-amples containing high-scoring verbs according toour method. Table 3 shows some example inferencesthat hinge precisely on recognizing these types of de

facto implicatives.

5 Discussion and Related Work

Language understanding tasks such as RTE (Clarket al., 2007; MacCartney, 2009) and bias detection(Recasens et al., 2013) have been shown to requireknowledge of implicative verbs, but such knowledgehas previously come from manually-built word listsrather than from data. Nairn et al. (2006) and Mar-tin et al. (2009) describe automatic systems to han-dle implicatives, but require hand-crafted rules foreach unique verb that is handled. The tense agree-ment method we present offers a starting point foracquiring such rules from data, and is well-suitedfor incorporating into statistical systems. The clearnext step is to explore similar data-driven means forlearning the specific behaviors of individual implica-tive verbs, which has been well-studied from a the-oretical perspective (Karttunen, 1971; Nairn et al.,2006; Amaral et al., 2012; Karttunen, 2012). An-other interesting extension concerns the role of tensein word representations. While currently, tense israrely built directly into distributional representa-tions of words (Mikolov et al., 2013; Pennington etal., 2014), our results suggest it may offer importantinsights into the semantics of individual words. Weleave this question as a direction for future work.

6 Conclusion

Differentiating between implicative and non-implicative verbs is important for discriminatinginferences that can and cannot be made in naturallanguage. We have presented a data-driven methodthat captures the implicative tendencies of verbs byexploiting the tense relationship between the verband its compliment clauses. This method effectivelyseparates known implicatives from known non-implicatives, and, more importantly, provides goodpredictive signal in an entailment recognition task.