natural language inference in the real world · natural language inference in the real world ellie...

153
Natural Language Inference in the Real World Ellie Pavlick University of Pennsylvania

Upload: others

Post on 07-Jul-2020

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Natural Language Inference in the Real World · Natural Language Inference in the Real World Ellie Pavlick University of Pennsylvania

Natural Language Inference in the

Real WorldEllie Pavlick

University of Pennsylvania

Page 2: Natural Language Inference in the Real World · Natural Language Inference in the Real World Ellie Pavlick University of Pennsylvania

Natural Language Inference

Page 3: Natural Language Inference in the Real World · Natural Language Inference in the Real World Ellie Pavlick University of Pennsylvania

Natural Language Inference

A man is pointing at a silver sedan.

Page 4: Natural Language Inference in the Real World · Natural Language Inference in the Real World Ellie Pavlick University of Pennsylvania

Natural Language Inference

There is no man pointing at a car.

A man is pointing at a silver sedan.

Page 5: Natural Language Inference in the Real World · Natural Language Inference in the Real World Ellie Pavlick University of Pennsylvania

Natural Language Inference

There is no man pointing at a car.

A man is pointing at a silver sedan.

Page 6: Natural Language Inference in the Real World · Natural Language Inference in the Real World Ellie Pavlick University of Pennsylvania

Natural Language Inference

There is no man pointing at a car.

A man is pointing at a silver sedan.

Page 7: Natural Language Inference in the Real World · Natural Language Inference in the Real World Ellie Pavlick University of Pennsylvania

Natural Language Inference

There is no man pointing at a car.

A man is pointing at a silver sedan.

Page 8: Natural Language Inference in the Real World · Natural Language Inference in the Real World Ellie Pavlick University of Pennsylvania

Natural Language Inference

There is no man pointing at a car.

A man is pointing at a silver sedan.

280 Bos

NP/N: A N/N: record N: date

λq.λp.(x

;q@x;p@x) λp.λx.(

y

record(y)nn(y,x)

;p@x) λx.date(x)

[fa]N: record date

λx.(

y

record(y)nn(y,x)

;date(x)

)

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . [merge]

λx.

y

record(y)nn(y,x)date(x)

[fa]NP: A record date

λp.(x

;

y

record(y)nn(y,x)date(x)

;p@x)

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . [merge]

λp.

x y

record(y)nn(y,x)date(x)

;p@x

Figure 1: Derivation with λ-DRSs, including β-conversion, for “A record date”. Com-binatory rules are indicated by solid lines, semantic rules by dotted lines.

DRS composed by the α-operator. The merge conjoins two DRSs into a larger DRS— semantically the merge is interpretated as (dynamic) logical conjunction. Merge-reduction is the process of eliminating the merge operation by forming a new DRSresulting from the union of the domains and conditions of the argument DRSso ofa merge, respectively (obeying certain constraints). Figure 1 illustrates the syntax-semantics interface (and merge-reduction) for a derivation of a simple noun phrase.

Boxer adopts Van der Sandt’s view as presupposition as anaphora (Van der Sandt,1992), in which presuppositional expressions are either resolved to previously estab-lished discourse entities or accommodated on a suitable level of discourse. Van derSandt’s proposal is cast in DRT, and therefore relatively easy to integrate in Boxer’ssemantic formalism. The α-operator indicates information that has to be resolved inthe context, and is lexically introduced by anaphoric or presuppositional expressions.A DRS constructed with α resembles the proto-DRS of Van der Sandt’s theory of pre-supposition (Van der Sandt, 1992) although they are syntactically defined in a slightlydifferent way to overcome problems with free and bound variables, following Bos(2003). Note that the difference between anaphora and presupposition collapses inVan der Sandt’s theory.

The types are the ingredients of a typed lambda calculus that is employed to con-struct DRSs in a bottom-up fashion, compositional way. The language of lambda-

Bos (2008)

Page 9: Natural Language Inference in the Real World · Natural Language Inference in the Real World Ellie Pavlick University of Pennsylvania

Natural Language Inference

There is no man pointing at a car.

A man is pointing at a silver sedan.

280 Bos

NP/N: A N/N: record N: date

λq.λp.(x

;q@x;p@x) λp.λx.(

y

record(y)nn(y,x)

;p@x) λx.date(x)

[fa]N: record date

λx.(

y

record(y)nn(y,x)

;date(x)

)

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . [merge]

λx.

y

record(y)nn(y,x)date(x)

[fa]NP: A record date

λp.(x

;

y

record(y)nn(y,x)date(x)

;p@x)

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . [merge]

λp.

x y

record(y)nn(y,x)date(x)

;p@x

Figure 1: Derivation with λ-DRSs, including β-conversion, for “A record date”. Com-binatory rules are indicated by solid lines, semantic rules by dotted lines.

DRS composed by the α-operator. The merge conjoins two DRSs into a larger DRS— semantically the merge is interpretated as (dynamic) logical conjunction. Merge-reduction is the process of eliminating the merge operation by forming a new DRSresulting from the union of the domains and conditions of the argument DRSso ofa merge, respectively (obeying certain constraints). Figure 1 illustrates the syntax-semantics interface (and merge-reduction) for a derivation of a simple noun phrase.

Boxer adopts Van der Sandt’s view as presupposition as anaphora (Van der Sandt,1992), in which presuppositional expressions are either resolved to previously estab-lished discourse entities or accommodated on a suitable level of discourse. Van derSandt’s proposal is cast in DRT, and therefore relatively easy to integrate in Boxer’ssemantic formalism. The α-operator indicates information that has to be resolved inthe context, and is lexically introduced by anaphoric or presuppositional expressions.A DRS constructed with α resembles the proto-DRS of Van der Sandt’s theory of pre-supposition (Van der Sandt, 1992) although they are syntactically defined in a slightlydifferent way to overcome problems with free and bound variables, following Bos(2003). Note that the difference between anaphora and presupposition collapses inVan der Sandt’s theory.

The types are the ingredients of a typed lambda calculus that is employed to con-struct DRSs in a bottom-up fashion, compositional way. The language of lambda-

Bos (2008)

20

LAP

tokenizer)tagger)NER)parser)

coref3resol.)

EDA

Parse)tree))deriva9on))genera9on)

Tree)space)search)

derived trees

derivation steps

From T to H

good candidates Classifier Initial parse tree of T&H

Syntactic knowledge components

Lexical knowledge components

Entailment decision

COMPONENTS

Fig. 6. System decomposition for the transformation-based BIUTEE system

7 System Mapping

In this section, we show the practical usability of our architecture. We illustratehow three existing RTE systems can be mapped concretely onto our architecture:BIUTEE, a transformation-based system; EDITS, an alignment-based system; andTIE, a multi-stage classification system, which represent different approaches toRTE. We also discuss mapping for formal reasoning-based systems.

7.1 BIUTEE

BIUTEE (Stern and Dagan, 2011) is a system for recognizing textual entailmentbased on the transformation-based approach outlined in Section 2.2 developedat Bar-Ilan University (BIU).12 The system derives the Hypothesis (H) from theText (T) with a sequence of rewrite steps. Since BIUTEE represents H and T asdependency trees, these rewrite steps can either involve just lexical knowledge (noderewrites) or lexico-syntactic knowledge (subtree rewrites). Additional operations(e.g., the insertion of syntactic structure or modification of POS types) ensure theexistence for a derivation for any T-H pair. The likelihood that a derivation preservesentailment is finally calculated by a confidence model.

Figure 6 shows the decomposition of BIUTEE in terms of the EXCITEMENTarchitecture. The LAP provides all required annotations, including parse trees. TheEDA encapsulates the “top level” of the algorithm. It consists of three parts: ageneration part where rewrite steps are performed, i.e., entailed trees are generated;a search part which selects good candidates among the generated derivations; and aclassifier. The generation part relies on entailment rules which are stored modularlyin (lexico-)syntactic knowledge components and lexical knowledge components. The

12 BIUTEE 2.4.1 is downloadable from http://www.cs.biu.ac.il/~nlp/downloads/biutee/

Padó et al. (2015)

Page 10: Natural Language Inference in the Real World · Natural Language Inference in the Real World Ellie Pavlick University of Pennsylvania

Natural Language Inference

There is no man pointing at a car.

A man is pointing at a silver sedan.

280 Bos

NP/N: A N/N: record N: date

λq.λp.(x

;q@x;p@x) λp.λx.(

y

record(y)nn(y,x)

;p@x) λx.date(x)

[fa]N: record date

λx.(

y

record(y)nn(y,x)

;date(x)

)

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . [merge]

λx.

y

record(y)nn(y,x)date(x)

[fa]NP: A record date

λp.(x

;

y

record(y)nn(y,x)date(x)

;p@x)

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . [merge]

λp.

x y

record(y)nn(y,x)date(x)

;p@x

Figure 1: Derivation with λ-DRSs, including β-conversion, for “A record date”. Com-binatory rules are indicated by solid lines, semantic rules by dotted lines.

DRS composed by the α-operator. The merge conjoins two DRSs into a larger DRS— semantically the merge is interpretated as (dynamic) logical conjunction. Merge-reduction is the process of eliminating the merge operation by forming a new DRSresulting from the union of the domains and conditions of the argument DRSso ofa merge, respectively (obeying certain constraints). Figure 1 illustrates the syntax-semantics interface (and merge-reduction) for a derivation of a simple noun phrase.

Boxer adopts Van der Sandt’s view as presupposition as anaphora (Van der Sandt,1992), in which presuppositional expressions are either resolved to previously estab-lished discourse entities or accommodated on a suitable level of discourse. Van derSandt’s proposal is cast in DRT, and therefore relatively easy to integrate in Boxer’ssemantic formalism. The α-operator indicates information that has to be resolved inthe context, and is lexically introduced by anaphoric or presuppositional expressions.A DRS constructed with α resembles the proto-DRS of Van der Sandt’s theory of pre-supposition (Van der Sandt, 1992) although they are syntactically defined in a slightlydifferent way to overcome problems with free and bound variables, following Bos(2003). Note that the difference between anaphora and presupposition collapses inVan der Sandt’s theory.

The types are the ingredients of a typed lambda calculus that is employed to con-struct DRSs in a bottom-up fashion, compositional way. The language of lambda-

Bos (2008)

Published as a conference paper at ICLR 2016

x1

c1

h1

x2

c2

h2

x3

c3

h3

x4

c4

h4

x5

c5

h5

x6

c6

h6

x7

c7

h7

x8

c8

h8

x9

c9

h9

A wedding party taking pictures :: Someone got married

Premise Hypothesis

(A) ConditionalEncoding

(C) Word-by-wordAttention

(B) Attention

Figure 1: Recognizing textual entailment using (A) conditional encoding via two LSTMs, one overthe premise and one over the hypothesis conditioned on the representation of the premise (c5), (B)attention only based on the last output vector (h9) or (C) word-by-word attention based on all outputvectors of the hypothesis (h7, h8 and h9).

state is initialized with the last cell state of the previous LSTM (c5 in the example), i.e. it is condi-tioned on the representation that the first LSTM built for the premise (A). We use word2vec vectors(Mikolov et al., 2013) as word representations, which we do not optimize during training. Out-of-vocabulary words in the training set are randomly initialized by sampling values uniformly from(�0.05, 0.05) and optimized during training.1 Out-of-vocabulary words encountered at inferencetime on the validation and test corpus are set to fixed random vectors. By not tuning representationsof words for which we have word2vec vectors, we ensure that at inference time their representationstays close to unseen similar words for which we have word2vec embeddings. We use a linear layerto project word vectors to the dimensionality of the hidden size of the LSTM, yielding input vectorsxi

. Finally, for classification we use a softmax layer over the output of a non-linear projection ofthe last output vector (h9 in the example) into the target space of the three classes (ENTAILMENT,NEUTRAL or CONTRADICTION), and train using the cross-entropy loss.

2.3 ATTENTION

Attentive neural networks have recently demonstrated success in a wide range of tasks ranging fromhandwriting synthesis (Graves, 2013), digit classification (Mnih et al., 2014), machine translation(Bahdanau et al., 2015), image captioning (Xu et al., 2015), speech recognition (Chorowski et al.,2015) and sentence summarization (Rush et al., 2015), to geometric reasoning (Vinyals et al., 2015).The idea is to allow the model to attend over past output vectors (see Figure 1 B), thereby mitigatingthe LSTM’s cell state bottleneck. More precisely, an LSTM with attention for RTE does not need tocapture the whole semantics of the premise in its cell state. Instead, it is sufficient to output vectorswhile reading the premise and accumulating a representation in the cell state that informs the secondLSTM which of the output vectors of the premise it needs to attend over to determine the RTE class.

Let Y 2 Rk⇥L be a matrix consisting of output vectors [h1 · · · hL

] that the first LSTM producedwhen reading the L words of the premise, where k is a hyperparameter denoting the size of em-beddings and hidden layers. Furthermore, let e

L

2 RL be a vector of 1s and hN

be the last outputvector after the premise and hypothesis were processed by the two LSTMs respectively. The atten-tion mechanism will produce a vector ↵ of attention weights and a weighted representation r of thepremise via

M = tanh(WyY +WhhN

⌦ eL

) M 2 Rk⇥L (7)

↵ = softmax(wTM) ↵ 2 RL (8)

r = Y↵T r 2 Rk (9)

1We found 12.1k words in SNLI for which we could not obtain word2vec embeddings, resulting in 3.65Mtunable parameters.

3

Rocktäschel et al. (2016)

20

LAP

tokenizer)tagger)NER)parser)

coref3resol.)

EDA

Parse)tree))deriva9on))genera9on)

Tree)space)search)

derived trees

derivation steps

From T to H

good candidates Classifier Initial parse tree of T&H

Syntactic knowledge components

Lexical knowledge components

Entailment decision

COMPONENTS

Fig. 6. System decomposition for the transformation-based BIUTEE system

7 System Mapping

In this section, we show the practical usability of our architecture. We illustratehow three existing RTE systems can be mapped concretely onto our architecture:BIUTEE, a transformation-based system; EDITS, an alignment-based system; andTIE, a multi-stage classification system, which represent different approaches toRTE. We also discuss mapping for formal reasoning-based systems.

7.1 BIUTEE

BIUTEE (Stern and Dagan, 2011) is a system for recognizing textual entailmentbased on the transformation-based approach outlined in Section 2.2 developedat Bar-Ilan University (BIU).12 The system derives the Hypothesis (H) from theText (T) with a sequence of rewrite steps. Since BIUTEE represents H and T asdependency trees, these rewrite steps can either involve just lexical knowledge (noderewrites) or lexico-syntactic knowledge (subtree rewrites). Additional operations(e.g., the insertion of syntactic structure or modification of POS types) ensure theexistence for a derivation for any T-H pair. The likelihood that a derivation preservesentailment is finally calculated by a confidence model.

Figure 6 shows the decomposition of BIUTEE in terms of the EXCITEMENTarchitecture. The LAP provides all required annotations, including parse trees. TheEDA encapsulates the “top level” of the algorithm. It consists of three parts: ageneration part where rewrite steps are performed, i.e., entailed trees are generated;a search part which selects good candidates among the generated derivations; and aclassifier. The generation part relies on entailment rules which are stored modularlyin (lexico-)syntactic knowledge components and lexical knowledge components. The

12 BIUTEE 2.4.1 is downloadable from http://www.cs.biu.ac.il/~nlp/downloads/biutee/

Padó et al. (2015)

Page 11: Natural Language Inference in the Real World · Natural Language Inference in the Real World Ellie Pavlick University of Pennsylvania

Natural Language Inference

Natural Language Inference. (MacCartney PhD Thesis 2009)

No man is pointing at a car.

A man is pointing at a silver sedan.

Page 12: Natural Language Inference in the Real World · Natural Language Inference in the Real World Ellie Pavlick University of Pennsylvania

Natural Language Inference

Natural Language Inference. (MacCartney PhD Thesis 2009)

No man is pointing at a car.

A man is pointing at a silver sedan.

Page 13: Natural Language Inference in the Real World · Natural Language Inference in the Real World Ellie Pavlick University of Pennsylvania

Natural Language Inference

Natural Language Inference. (MacCartney PhD Thesis 2009)

No man is pointing at a car.

A man is pointing at a silver sedan.

A man is pointing at a silver car. SUB(sedan, car) ⊏ yes

Page 14: Natural Language Inference in the Real World · Natural Language Inference in the Real World Ellie Pavlick University of Pennsylvania

Natural Language Inference

Natural Language Inference. (MacCartney PhD Thesis 2009)

No man is pointing at a car.

A man is pointing at a silver sedan.

A man is pointing at a silver car.

A man is pointing at a car.

SUB(sedan, car) ⊏DEL(silver) ⊏

yes

yes

Page 15: Natural Language Inference in the Real World · Natural Language Inference in the Real World Ellie Pavlick University of Pennsylvania

Natural Language Inference

Natural Language Inference. (MacCartney PhD Thesis 2009)

No man is pointing at a car.

A man is pointing at a silver sedan.

A man is pointing at a silver car.

A man is pointing at a car.

SUB(sedan, car) ⊏DEL(silver) ⊏INS(no) ^

yes

yes

no

Page 16: Natural Language Inference in the Real World · Natural Language Inference in the Real World Ellie Pavlick University of Pennsylvania

Natural Language Inference

A man is pointing at a silver sedan.

Page 17: Natural Language Inference in the Real World · Natural Language Inference in the Real World Ellie Pavlick University of Pennsylvania

Natural Language Inference

A man is pointing at a silver sedan.

Jimmy Dean refused to dance without pants.

Natural Language Inference. (MacCartney PhD Thesis 2009)

Page 18: Natural Language Inference in the Real World · Natural Language Inference in the Real World Ellie Pavlick University of Pennsylvania

Natural Language Inference

A man is pointing at a silver sedan.

Last December they had argued that the council had failed to consider possible environmental

effects of contaminated land at the site.

Page 19: Natural Language Inference in the Real World · Natural Language Inference in the Real World Ellie Pavlick University of Pennsylvania

Natural Language Inference

A man is pointing at a silver sedan.

Last December they had argued that the council had failed to consider possible environmental

effects of contaminated land at the site.

Page 20: Natural Language Inference in the Real World · Natural Language Inference in the Real World Ellie Pavlick University of Pennsylvania

Natural Language Inference

A man is pointing at a silver sedan.

Last December they had argued that the council had failed to consider possible environmental

effects of contaminated land at the site.

Page 21: Natural Language Inference in the Real World · Natural Language Inference in the Real World Ellie Pavlick University of Pennsylvania

Natural Language Inference

Last December they had argued that the council had failed to consider possible environmental

effects of contaminated land at the site.

Question Answering

Did the council consider the environmental effects?

Yes/No

Page 22: Natural Language Inference in the Real World · Natural Language Inference in the Real World Ellie Pavlick University of Pennsylvania

Natural Language Inference

Summarization

They argued that the council didn’t consider environmental effects.

Last December they had argued that the council had failed to consider possible environmental

effects of contaminated land at the site.

Page 23: Natural Language Inference in the Real World · Natural Language Inference in the Real World Ellie Pavlick University of Pennsylvania

Natural Language Inference

Dialogue

Remove from calendar?

I decided I don’t want to go to that party on Saturday.

Page 24: Natural Language Inference in the Real World · Natural Language Inference in the Real World Ellie Pavlick University of Pennsylvania

Wish List• Models of language that are descriptive rather

than prescriptive based on judgements by real people rather than linguists

• Methods that are derived from data, rather than reliant on lexicons or ontologies

• Natural language itself as a logical form, rather than requiring an intermediate representation

Page 25: Natural Language Inference in the Real World · Natural Language Inference in the Real World Ellie Pavlick University of Pennsylvania

Wish List• We want to get out of the “lab” and model

language that people actually use

• We want models of language that are descriptive rather than prescriptive based on judgements by real people rather than linguists

• We want methods that are derived from data, rather than reliant on lexicons or ontologies

Page 26: Natural Language Inference in the Real World · Natural Language Inference in the Real World Ellie Pavlick University of Pennsylvania

Wish List• We want to get out of the “lab” and model

language that people actually use

• We want rules for logical composition that are descriptive rather than prescriptive based on judgements by real people

• We want methods that are derived from data, rather than reliant on lexicons or ontologies

Page 27: Natural Language Inference in the Real World · Natural Language Inference in the Real World Ellie Pavlick University of Pennsylvania

Wish List• We want to get out of the “lab” and model

language that people actually use

• We want rules for logical composition that are descriptive rather than prescriptive based on judgements by real people

• We want methods that are derived from data, rather than reliant on lexicons or ontologies

Page 28: Natural Language Inference in the Real World · Natural Language Inference in the Real World Ellie Pavlick University of Pennsylvania

Last December they had argued that the council had failed to consider possible effects of contaminated land

at the site.

The council considered environmental consequences.

Page 29: Natural Language Inference in the Real World · Natural Language Inference in the Real World Ellie Pavlick University of Pennsylvania

Last December they had argued that the council had failed to consider possible effects of contaminated land

at the site.

The council considered environmental consequences.

Page 30: Natural Language Inference in the Real World · Natural Language Inference in the Real World Ellie Pavlick University of Pennsylvania

Last December they had argued that the council had failed to consider possible effects of contaminated land

at the site.

The council considered environmental consequences.

lexical semantics

SUB(effects, consequences)

Page 31: Natural Language Inference in the Real World · Natural Language Inference in the Real World Ellie Pavlick University of Pennsylvania

Last December they had argued that the council had failed to consider possible effects of contaminated land

at the site.

The council considered environmental consequences.

modifiers

INS(environmental)

Page 32: Natural Language Inference in the Real World · Natural Language Inference in the Real World Ellie Pavlick University of Pennsylvania

Last December they had argued that the council had failed to consider possible effects of contaminated land

at the site.

The council considered environmental consequences.

(non-subsective) modifiers

DEL(possible)

Page 33: Natural Language Inference in the Real World · Natural Language Inference in the Real World Ellie Pavlick University of Pennsylvania

Last December they had argued that the council had failed to consider possible effects of contaminated land

at the site.

The council considered environmental consequences.

predicates

SUB(consider, consider)

Page 34: Natural Language Inference in the Real World · Natural Language Inference in the Real World Ellie Pavlick University of Pennsylvania

Last December they had argued that the council had failed to consider possible effects of contaminated land

at the site.

The council considered environmental consequences.

“higher order” predicates

DEL(fail to)

Page 35: Natural Language Inference in the Real World · Natural Language Inference in the Real World Ellie Pavlick University of Pennsylvania

Last December they had argued that the council had failed to consider possible effects of contaminated land

at the site.

The council considered environmental consequences.

lexical semantics

SUB(effects, consequences)

Page 36: Natural Language Inference in the Real World · Natural Language Inference in the Real World Ellie Pavlick University of Pennsylvania

Lexical Semantics

Page 37: Natural Language Inference in the Real World · Natural Language Inference in the Real World Ellie Pavlick University of Pennsylvania

Lexical Semantics

WordNet

entity

physical entity

person

male child

boy

Cub Scout

abstraction

right

legal right cabotage

body of water

female child

girl

Page 38: Natural Language Inference in the Real World · Natural Language Inference in the Real World Ellie Pavlick University of Pennsylvania

Lexical Semantics

WordNet

entity

physical entity

person

male child

boy

Cub Scout

abstraction

right

legal right cabotage

body of water

female child

girl

a boy is a male child

Page 39: Natural Language Inference in the Real World · Natural Language Inference in the Real World Ellie Pavlick University of Pennsylvania

Lexical Semantics

WordNet

entity

physical entity

person

male child

boy

Cub Scout

abstraction

right

legal right cabotage

body of water

female child

girl

a boy is a male childa Cub Scout is a boy

Page 40: Natural Language Inference in the Real World · Natural Language Inference in the Real World Ellie Pavlick University of Pennsylvania

Lexical Semantics

WordNet

entity

physical entity

person

male child

boy

Cub Scout

abstraction

right

legal right cabotage

body of water

female child

girla girl is NOT a boy

a boy is a male childa Cub Scout is a boy

Page 41: Natural Language Inference in the Real World · Natural Language Inference in the Real World Ellie Pavlick University of Pennsylvania

Lexical Semantics

entity

physical entity

person

male child

boy

Cub Scout

abstraction

right

legal right cabotage

body of water

female child

girl

WordNet

Page 42: Natural Language Inference in the Real World · Natural Language Inference in the Real World Ellie Pavlick University of Pennsylvania

Lexical Semantics

Bilingual Pivoting (PPDB)

entity

physical entity

person

male child

boy

Cub Scout

abstraction

right

legal right cabotage

body of water

female child

girl

ahogados a la playa ...

get washed up on beaches ...

... fünf Landwirte , weil

... 5 farmers were in Ireland ...

...

oder wurden , gefoltert

or have been , tortured

festgenommen

thrown into jail

festgenommen

imprisoned

...

... ...

...

WordNet

Page 43: Natural Language Inference in the Real World · Natural Language Inference in the Real World Ellie Pavlick University of Pennsylvania

Lexical Semantics

WordNet

Bilingual Pivoting (PPDB)

Vector Space Models (word2vec)

entity

physical entity

person

male child

boy

Cub Scout

abstraction

right

legal right cabotage

body of water

female child

girl

ahogados a la playa ...

get washed up on beaches ...

... fünf Landwirte , weil

... 5 farmers were in Ireland ...

...

oder wurden , gefoltert

or have been , tortured

festgenommen

thrown into jail

festgenommen

imprisoned

...

... ...

...

Page 44: Natural Language Inference in the Real World · Natural Language Inference in the Real World Ellie Pavlick University of Pennsylvania

Lexical Semanticsboy kid boy girl boy doggy boy mother

boy sons boy dude boy pup boy mommy

boy guys boy fella boy ah boy father

boy males boy laddie boy infant boy student

boy son boy gentleman boy pooch boy person

boy child boy male boy yarn boy type

boy boyfriend boy youth boy sonny boy cheeky

boy hans boy juvenile boy childhood boy buster

boy lad boy toddler boy doggy boy husband

boy guy boy puppy boy servant boy offspring

boy wraps boy fellow boy grandson boy wee

boy gus boy mate boy colt boy idiot

boy bollocks boy little boy darling boy partner

boy teenager boy bro boy teen boy toy

boy baby boy bloke boy junior boy old-timer

boy waiter boy boss boy baby boy calf

boy men boy kiddo boy bit boy protege

boy mandog boy apprentice boy daughter boy sage

boy buddy boy brat boy foal boy kitty

boy friend boy lapdog boy bearer boy bloodhound

boy does boy children boy shorty boy homie

boy pops boy bachelor boy foal boy cub

boy youngster boy soldier boy chum boy wolf

boy brother boy sweetheart boy blood boy honey

Page 45: Natural Language Inference in the Real World · Natural Language Inference in the Real World Ellie Pavlick University of Pennsylvania

Lexical Semanticsboy kid boy girl boy doggy boy mother

boy sons boy dude boy pup boy mommy

boy guys boy fella boy ah boy father

boy males boy laddie boy infant boy student

boy son boy gentleman boy pooch boy person

boy child boy male boy yarn boy type

boy boyfriend boy youth boy sonny boy cheeky

boy hans boy juvenile boy childhood boy buster

boy lad boy toddler boy doggy boy husband

boy guy boy puppy boy servant boy offspring

boy wraps boy fellow boy grandson boy wee

boy gus boy mate boy colt boy idiot

boy bollocks boy little boy darling boy partner

boy teenager boy bro boy teen boy toy

boy baby boy bloke boy junior boy old-timer

boy waiter boy boss boy baby boy calf

boy men boy kiddo boy bit boy protege

boy mandog boy apprentice boy daughter boy sage

boy buddy boy brat boy foal boy kitty

boy friend boy lapdog boy bearer boy bloodhound

boy does boy children boy shorty boy homie

boy pops boy bachelor boy foal boy cub

boy youngster boy soldier boy chum boy wolf

boy brother boy sweetheart boy blood boy honey

boy ⊏ kid

boy ⊏ person

Page 46: Natural Language Inference in the Real World · Natural Language Inference in the Real World Ellie Pavlick University of Pennsylvania

Lexical Semanticsboy kid boy girl boy doggy boy mother

boy sons boy dude boy pup boy mommy

boy guys boy fella boy ah boy father

boy males boy laddie boy infant boy student

boy son boy gentleman boy pooch boy person

boy child boy male boy yarn boy type

boy boyfriend boy youth boy sonny boy cheeky

boy hans boy juvenile boy childhood boy buster

boy lad boy toddler boy doggy boy husband

boy guy boy puppy boy servant boy offspring

boy wraps boy fellow boy grandson boy wee

boy gus boy mate boy colt boy idiot

boy bollocks boy little boy darling boy partner

boy teenager boy bro boy teen boy toy

boy baby boy bloke boy junior boy old-timer

boy waiter boy boss boy baby boy calf

boy men boy kiddo boy bit boy protege

boy mandog boy apprentice boy daughter boy sage

boy buddy boy brat boy foal boy kitty

boy friend boy lapdog boy bearer boy bloodhound

boy does boy children boy shorty boy homie

boy pops boy bachelor boy foal boy cub

boy youngster boy soldier boy chum boy wolf

boy brother boy sweetheart boy blood boy honey

boy ∣ girl

boy ∣ puppy

Page 47: Natural Language Inference in the Real World · Natural Language Inference in the Real World Ellie Pavlick University of Pennsylvania

Lexical Semanticsboy kid boy girl boy doggy boy mother

boy sons boy dude boy pup boy mommy

boy guys boy fella boy ah boy father

boy males boy laddie boy infant boy student

boy son boy gentleman boy pooch boy person

boy child boy male boy yarn boy type

boy boyfriend boy youth boy sonny boy cheeky

boy hans boy juvenile boy childhood boy buster

boy lad boy toddler boy doggy boy husband

boy guy boy puppy boy servant boy offspring

boy wraps boy fellow boy grandson boy wee

boy gus boy mate boy colt boy idiot

boy bollocks boy little boy darling boy partner

boy teenager boy bro boy teen boy toy

boy baby boy bloke boy junior boy old-timer

boy waiter boy boss boy baby boy calf

boy men boy kiddo boy bit boy protege

boy dog boy apprentice boy daughter boy sage

boy buddy boy brat boy foal boy kitty

boy friend boy lapdog boy bearer boy bloodhound

boy does boy children boy shorty boy homie

boy pops boy bachelor boy foal boy cub

boy youngster boy soldier boy chum boy wolf

boy brother boy sweetheart boy blood boy honey

boy # toddler

boy # idiot

Page 48: Natural Language Inference in the Real World · Natural Language Inference in the Real World Ellie Pavlick University of Pennsylvania

The Paraphrase Databaseboy kid boy girl boy doggy boy motherboy sons boy dude boy pup boy mommyboy guys boy fella boy ah boy fatherboy males boy laddie boy infant boy studentboy son boy gentleman boy pooch boy personboy child boy male boy yarn boy typeboy boyfriend boy youth boy sonny boy cheekyboy hans boy juvenile boy childhood boy busterboy lad boy toddler boy doggy boy husbandboy guy boy puppy boy servant boy offspringboy wraps boy fellow boy grandson boy weeboy gus boy mate boy colt boy idiotboy bollocks boy little boy darling boy partnerboy teenager boy bro boy teen boy toyboy baby boy bloke boy junior boy old-timerboy waiter boy boss boy baby boy calfboy men boy kiddo boy bit boy protegeboy dog boy apprentice boy daughter boy sageboy buddy boy brat boy foal boy kittyboy friend boy lapdog boy bearer boy bloodhoundboy does boy children boy shorty boy homieboy pops boy bachelor boy foal boy cubboy youngster boy soldier boy chum boy wolfboy brother boy sweetheart boy blood boy honey

Page 49: Natural Language Inference in the Real World · Natural Language Inference in the Real World Ellie Pavlick University of Pennsylvania

The Paraphrase Databaseboy kid boy girl boy doggy boy motherboy sons boy dude boy pup boy mommyboy guys boy fella boy ah boy fatherboy males boy laddie boy infant boy studentboy son boy gentleman boy pooch boy personboy child boy male boy yarn boy typeboy boyfriend boy youth boy sonny boy cheekyboy hans boy juvenile boy childhood boy busterboy lad boy toddler boy doggy boy husbandboy guy boy puppy boy servant boy offspringboy wraps boy fellow boy grandson boy weeboy gus boy mate boy colt boy idiotboy bollocks boy little boy darling boy partnerboy teenager boy bro boy teen boy toyboy baby boy bloke boy junior boy old-timerboy waiter boy boss boy baby boy calfboy men boy kiddo boy bit boy protegeboy dog boy apprentice boy daughter boy sageboy buddy boy brat boy foal boy kittyboy friend boy lapdog boy bearer boy bloodhoundboy does boy children boy shorty boy homieboy pops boy bachelor boy foal boy cubboy youngster boy soldier boy chum boy wolfboy brother boy sweetheart boy blood boy honey

Page 50: Natural Language Inference in the Real World · Natural Language Inference in the Real World Ellie Pavlick University of Pennsylvania

The Paraphrase Databaseboy kid boy girl boy doggy boy motherboy sons boy dude boy pup boy mommyboy guys boy fella boy ah boy fatherboy males boy laddie boy infant boy studentboy son boy gentleman boy pooch boy personboy child boy male boy yarn boy typeboy boyfriend boy youth boy sonny boy cheekyboy hans boy juvenile boy childhood boy busterboy lad boy toddler boy doggy boy husbandboy guy boy puppy boy servant boy offspringboy wraps boy fellow boy grandson boy weeboy gus boy mate boy colt boy idiotboy bollocks boy little boy darling boy partnerboy teenager boy bro boy teen boy toyboy baby boy bloke boy junior boy old-timerboy waiter boy boss boy baby boy calfboy men boy kiddo boy bit boy protegeboy dog boy apprentice boy daughter boy sageboy buddy boy brat boy foal boy kittyboy friend boy lapdog boy bearer boy bloodhoundboy does boy children boy shorty boy homieboy pops boy bachelor boy foal boy cubboy youngster boy soldier boy chum boy wolfboy brother boy sweetheart boy blood boy honey

~100M word and phrase pairs

Page 51: Natural Language Inference in the Real World · Natural Language Inference in the Real World Ellie Pavlick University of Pennsylvania

Natural Logic

to capture these diverse relations.

couch !

sofa

crow

bird

unableable

feline

carni vore

canine

hippo food

CHAPTER 5. ENTAILMENT RELATIONS 71

R0000 R0001 R0010 R0011

R0100 R0101 R0111

R1000 R1010 R1011

R1100 R1101 R1110 R1111

R0110

R1001

Figure 5.2: The 16 elementary set relations, represented by Johnston diagrams. Eachbox represents the universe U , and the two circles within the box represent the setsx and y. A region is white if it is empty, and shaded if it is non-empty. Thus in thediagram labeled R1101, only the region x� y is empty, indicating that � � x � y � U .

equivalence class in which only partition 10 is empty.) These equivalence classes aredepicted graphically in figure 5.2.

In fact, each of these equivalence classes is a set relation, that is, a set of orderedpairs of sets. We will refer to these 16 set relations as the elementary set relations,and we will denote this set of 16 relations by R. By construction, the relations in R

are both mutually exhaustive (every ordered pair of sets belongs to some relation inR) and mutually exclusive (no ordered pair of sets belongs to two different relationsin R). Thus, every ordered pair of sets can be assigned to exactly one relation in R.

equivalence

CHAPTER 5. ENTAILMENT RELATIONS 71

R0000 R0001 R0010 R0011

R0100 R0101 R0111

R1000 R1010 R1011

R1100 R1101 R1110 R1111

R0110

R1001

Figure 5.2: The 16 elementary set relations, represented by Johnston diagrams. Eachbox represents the universe U , and the two circles within the box represent the setsx and y. A region is white if it is empty, and shaded if it is non-empty. Thus in thediagram labeled R1101, only the region x� y is empty, indicating that � � x � y � U .

equivalence class in which only partition 10 is empty.) These equivalence classes aredepicted graphically in figure 5.2.

In fact, each of these equivalence classes is a set relation, that is, a set of orderedpairs of sets. We will refer to these 16 set relations as the elementary set relations,and we will denote this set of 16 relations by R. By construction, the relations in R

are both mutually exhaustive (every ordered pair of sets belongs to some relation inR) and mutually exclusive (no ordered pair of sets belongs to two different relationsin R). Thus, every ordered pair of sets can be assigned to exactly one relation in R.

CHAPTER 5. ENTAILMENT RELATIONS 71

R0000 R0001 R0010 R0011

R0100 R0101 R0111

R1000 R1010 R1011

R1100 R1101 R1110 R1111

R0110

R1001

Figure 5.2: The 16 elementary set relations, represented by Johnston diagrams. Eachbox represents the universe U , and the two circles within the box represent the setsx and y. A region is white if it is empty, and shaded if it is non-empty. Thus in thediagram labeled R1101, only the region x� y is empty, indicating that � � x � y � U .

equivalence class in which only partition 10 is empty.) These equivalence classes aredepicted graphically in figure 5.2.

In fact, each of these equivalence classes is a set relation, that is, a set of orderedpairs of sets. We will refer to these 16 set relations as the elementary set relations,and we will denote this set of 16 relations by R. By construction, the relations in R

are both mutually exhaustive (every ordered pair of sets belongs to some relation inR) and mutually exclusive (no ordered pair of sets belongs to two different relationsin R). Thus, every ordered pair of sets can be assigned to exactly one relation in R.

CHAPTER 5. ENTAILMENT RELATIONS 71

R0000 R0001 R0010 R0011

R0100 R0101 R0111

R1000 R1010 R1011

R1100 R1101 R1110 R1111

R0110

R1001

Figure 5.2: The 16 elementary set relations, represented by Johnston diagrams. Eachbox represents the universe U , and the two circles within the box represent the setsx and y. A region is white if it is empty, and shaded if it is non-empty. Thus in thediagram labeled R1101, only the region x� y is empty, indicating that � � x � y � U .

equivalence class in which only partition 10 is empty.) These equivalence classes aredepicted graphically in figure 5.2.

In fact, each of these equivalence classes is a set relation, that is, a set of orderedpairs of sets. We will refer to these 16 set relations as the elementary set relations,and we will denote this set of 16 relations by R. By construction, the relations in R

are both mutually exhaustive (every ordered pair of sets belongs to some relation inR) and mutually exclusive (no ordered pair of sets belongs to two different relationsin R). Thus, every ordered pair of sets can be assigned to exactly one relation in R.

CHAPTER 5. ENTAILMENT RELATIONS 71

R0000 R0001 R0010 R0011

R0100 R0101 R0111

R1000 R1010 R1011

R1100 R1101 R1110 R1111

R0110

R1001

Figure 5.2: The 16 elementary set relations, represented by Johnston diagrams. Eachbox represents the universe U , and the two circles within the box represent the setsx and y. A region is white if it is empty, and shaded if it is non-empty. Thus in thediagram labeled R1101, only the region x� y is empty, indicating that � � x � y � U .

equivalence class in which only partition 10 is empty.) These equivalence classes aredepicted graphically in figure 5.2.

In fact, each of these equivalence classes is a set relation, that is, a set of orderedpairs of sets. We will refer to these 16 set relations as the elementary set relations,and we will denote this set of 16 relations by R. By construction, the relations in R

are both mutually exhaustive (every ordered pair of sets belongs to some relation inR) and mutually exclusive (no ordered pair of sets belongs to two different relationsin R). Thus, every ordered pair of sets can be assigned to exactly one relation in R.

CHAPTER 5. ENTAILMENT RELATIONS 71

R0000 R0001 R0010 R0011

R0100 R0101 R0111

R1000 R1010 R1011

R1100 R1101 R1110 R1111

R0110

R1001

Figure 5.2: The 16 elementary set relations, represented by Johnston diagrams. Eachbox represents the universe U , and the two circles within the box represent the setsx and y. A region is white if it is empty, and shaded if it is non-empty. Thus in thediagram labeled R1101, only the region x� y is empty, indicating that � � x � y � U .

equivalence class in which only partition 10 is empty.) These equivalence classes aredepicted graphically in figure 5.2.

In fact, each of these equivalence classes is a set relation, that is, a set of orderedpairs of sets. We will refer to these 16 set relations as the elementary set relations,and we will denote this set of 16 relations by R. By construction, the relations in R

are both mutually exhaustive (every ordered pair of sets belongs to some relation inR) and mutually exclusive (no ordered pair of sets belongs to two different relationsin R). Thus, every ordered pair of sets can be assigned to exactly one relation in R.

forward entailment

reverse entailment

negation

alternation

independence no pathhypernymy

hyponymy shared hypernym

antonymsynonym

Thai

Asian

Figure 2: Mappings of set-theorhetic entailment relationsonto the WordNet hierarchy. In the Venn diagrams, re-produced from (MacCartney, 2009), the left circle repre-sents states in which x is true and the right circle states inwhich y is true. The shaded area represents all possibletrue states. E.g. when x ⌘ y (equivalence), in every truestate, either both x and y are true, or neither is.

3 Entailment Relations

We use the relations from Bill MacCartney’s the-sis on natural language inference as the basis forour categorization of relations (MacCartney, 2009).MacCartney’s work focused on integrating the se-mantic properties previously employed by systemsfor question answering (Harabagiu and Hickl, 2006)and RTE (Bar-Haim et al., 2007) within the formaltheory of natural logic (Lakoff, 1972). As a result,he provides a simple framework which models lexi-cal entailment in 7 “basic entailment relationships”:

Equivalence (⌘): if X is true then Y is true, andif Y is true then X is true.

Forward entailment (@): if X is true then Y istrue, but if Y is true then X may or may not be true.

Reverse entailment (A): if Y is true then X istrue, but if X is true then Y may or may not be true.

Negation (^): if X is true then Y is false, and ifY is false then X is true; either X or Y must be true.

Alternation (|): if X is true then Y is false, but ifY is false then X may or may not be true.

Cover (^): if X is true then Y may or may not betrue, and if Y is true then X may or may not be true;

either X or Y must be true. We omit this relation,since its applicability to RTE is not clear.

Independence (#): if X is true then Y may ormay not be true, and if Y is true then X may or maynot be true.

4 Extracting entailment relations fromWordNet

4.1 Mapping onto Natural Logic relations

We would like to train a model to automatically dis-tinguish between the relationships described above.In order to gather labelled training data, we first lookto the information available in the existing Word-Net hierarchy. For roughly 2.5 million (60%) of thenoun pairs in PPDB, both nouns appear in WordNet(although not necessarily in the same synset). Weuse the rules in Table 2 (shown graphically in Fig-ure 2) to map a pair of nodes in the WordNet nounhierarchy onto one of the basic entailment relationsdescribed in section 3.

Other Relatedness In addition to MacCartney’srelations, we define a sixth catch-all category forterms which are flagged as related by WordNet butwhose relation is not built into the hierarchical struc-ture. These noun pairs do not meet the criteria of thebasic entailment relations but carry more informa-tion than do truly independent terms. We combineholonymy (part/whole relationships), attributes (ad-jectives closely tied to a specific noun), and deriva-tionally related terms as “other” relations.

4.2 Shortcomings of WordNet labeling

Our definitions give the desired results for the ⌘,A, @, and “other” categories. Table 1 shows someexamples of nouns in PPDB which were assignedto each of these labels. However, whereas Mac-Cartney’s alternation is strictly contradictory (X !¬Y ), co-hyponyms of a common parent in WordNetdo not necessarily have this property. Some exam-ples behave well (e.g. lunch and dinner share theparent meal), but others are better labelled as @ or #(e.g. boy and guy share the parent man and brotherand worker share the parent person). As a result,the WordNet alternation provides no information interms of entailment, behaving instead like MacCart-ney’s independence (i.e. “if X is true then Y may or

Adding Semantics to Data-Driven Paraphrasing. (Pavlick et al. ACL 2015)

Page 52: Natural Language Inference in the Real World · Natural Language Inference in the Real World Ellie Pavlick University of Pennsylvania

Natural Logic

to capture these diverse relations.

couch !

sofa

crow

bird

unableable

feline

carni vore

canine

hippo food

CHAPTER 5. ENTAILMENT RELATIONS 71

R0000 R0001 R0010 R0011

R0100 R0101 R0111

R1000 R1010 R1011

R1100 R1101 R1110 R1111

R0110

R1001

Figure 5.2: The 16 elementary set relations, represented by Johnston diagrams. Eachbox represents the universe U , and the two circles within the box represent the setsx and y. A region is white if it is empty, and shaded if it is non-empty. Thus in thediagram labeled R1101, only the region x� y is empty, indicating that � � x � y � U .

equivalence class in which only partition 10 is empty.) These equivalence classes aredepicted graphically in figure 5.2.

In fact, each of these equivalence classes is a set relation, that is, a set of orderedpairs of sets. We will refer to these 16 set relations as the elementary set relations,and we will denote this set of 16 relations by R. By construction, the relations in R

are both mutually exhaustive (every ordered pair of sets belongs to some relation inR) and mutually exclusive (no ordered pair of sets belongs to two different relationsin R). Thus, every ordered pair of sets can be assigned to exactly one relation in R.

equivalence

CHAPTER 5. ENTAILMENT RELATIONS 71

R0000 R0001 R0010 R0011

R0100 R0101 R0111

R1000 R1010 R1011

R1100 R1101 R1110 R1111

R0110

R1001

Figure 5.2: The 16 elementary set relations, represented by Johnston diagrams. Eachbox represents the universe U , and the two circles within the box represent the setsx and y. A region is white if it is empty, and shaded if it is non-empty. Thus in thediagram labeled R1101, only the region x� y is empty, indicating that � � x � y � U .

equivalence class in which only partition 10 is empty.) These equivalence classes aredepicted graphically in figure 5.2.

In fact, each of these equivalence classes is a set relation, that is, a set of orderedpairs of sets. We will refer to these 16 set relations as the elementary set relations,and we will denote this set of 16 relations by R. By construction, the relations in R

are both mutually exhaustive (every ordered pair of sets belongs to some relation inR) and mutually exclusive (no ordered pair of sets belongs to two different relationsin R). Thus, every ordered pair of sets can be assigned to exactly one relation in R.

CHAPTER 5. ENTAILMENT RELATIONS 71

R0000 R0001 R0010 R0011

R0100 R0101 R0111

R1000 R1010 R1011

R1100 R1101 R1110 R1111

R0110

R1001

Figure 5.2: The 16 elementary set relations, represented by Johnston diagrams. Eachbox represents the universe U , and the two circles within the box represent the setsx and y. A region is white if it is empty, and shaded if it is non-empty. Thus in thediagram labeled R1101, only the region x� y is empty, indicating that � � x � y � U .

equivalence class in which only partition 10 is empty.) These equivalence classes aredepicted graphically in figure 5.2.

In fact, each of these equivalence classes is a set relation, that is, a set of orderedpairs of sets. We will refer to these 16 set relations as the elementary set relations,and we will denote this set of 16 relations by R. By construction, the relations in R

are both mutually exhaustive (every ordered pair of sets belongs to some relation inR) and mutually exclusive (no ordered pair of sets belongs to two different relationsin R). Thus, every ordered pair of sets can be assigned to exactly one relation in R.

CHAPTER 5. ENTAILMENT RELATIONS 71

R0000 R0001 R0010 R0011

R0100 R0101 R0111

R1000 R1010 R1011

R1100 R1101 R1110 R1111

R0110

R1001

Figure 5.2: The 16 elementary set relations, represented by Johnston diagrams. Eachbox represents the universe U , and the two circles within the box represent the setsx and y. A region is white if it is empty, and shaded if it is non-empty. Thus in thediagram labeled R1101, only the region x� y is empty, indicating that � � x � y � U .

equivalence class in which only partition 10 is empty.) These equivalence classes aredepicted graphically in figure 5.2.

In fact, each of these equivalence classes is a set relation, that is, a set of orderedpairs of sets. We will refer to these 16 set relations as the elementary set relations,and we will denote this set of 16 relations by R. By construction, the relations in R

are both mutually exhaustive (every ordered pair of sets belongs to some relation inR) and mutually exclusive (no ordered pair of sets belongs to two different relationsin R). Thus, every ordered pair of sets can be assigned to exactly one relation in R.

CHAPTER 5. ENTAILMENT RELATIONS 71

R0000 R0001 R0010 R0011

R0100 R0101 R0111

R1000 R1010 R1011

R1100 R1101 R1110 R1111

R0110

R1001

Figure 5.2: The 16 elementary set relations, represented by Johnston diagrams. Eachbox represents the universe U , and the two circles within the box represent the setsx and y. A region is white if it is empty, and shaded if it is non-empty. Thus in thediagram labeled R1101, only the region x� y is empty, indicating that � � x � y � U .

equivalence class in which only partition 10 is empty.) These equivalence classes aredepicted graphically in figure 5.2.

In fact, each of these equivalence classes is a set relation, that is, a set of orderedpairs of sets. We will refer to these 16 set relations as the elementary set relations,and we will denote this set of 16 relations by R. By construction, the relations in R

are both mutually exhaustive (every ordered pair of sets belongs to some relation inR) and mutually exclusive (no ordered pair of sets belongs to two different relationsin R). Thus, every ordered pair of sets can be assigned to exactly one relation in R.

CHAPTER 5. ENTAILMENT RELATIONS 71

R0000 R0001 R0010 R0011

R0100 R0101 R0111

R1000 R1010 R1011

R1100 R1101 R1110 R1111

R0110

R1001

Figure 5.2: The 16 elementary set relations, represented by Johnston diagrams. Eachbox represents the universe U , and the two circles within the box represent the setsx and y. A region is white if it is empty, and shaded if it is non-empty. Thus in thediagram labeled R1101, only the region x� y is empty, indicating that � � x � y � U .

equivalence class in which only partition 10 is empty.) These equivalence classes aredepicted graphically in figure 5.2.

In fact, each of these equivalence classes is a set relation, that is, a set of orderedpairs of sets. We will refer to these 16 set relations as the elementary set relations,and we will denote this set of 16 relations by R. By construction, the relations in R

are both mutually exhaustive (every ordered pair of sets belongs to some relation inR) and mutually exclusive (no ordered pair of sets belongs to two different relationsin R). Thus, every ordered pair of sets can be assigned to exactly one relation in R.

forward entailment

reverse entailment

negation

alternation

independence no pathhypernymy

hyponymy shared hypernym

antonymsynonym

Thai

Asian

Figure 2: Mappings of set-theorhetic entailment relationsonto the WordNet hierarchy. In the Venn diagrams, re-produced from (MacCartney, 2009), the left circle repre-sents states in which x is true and the right circle states inwhich y is true. The shaded area represents all possibletrue states. E.g. when x ⌘ y (equivalence), in every truestate, either both x and y are true, or neither is.

3 Entailment Relations

We use the relations from Bill MacCartney’s the-sis on natural language inference as the basis forour categorization of relations (MacCartney, 2009).MacCartney’s work focused on integrating the se-mantic properties previously employed by systemsfor question answering (Harabagiu and Hickl, 2006)and RTE (Bar-Haim et al., 2007) within the formaltheory of natural logic (Lakoff, 1972). As a result,he provides a simple framework which models lexi-cal entailment in 7 “basic entailment relationships”:

Equivalence (⌘): if X is true then Y is true, andif Y is true then X is true.

Forward entailment (@): if X is true then Y istrue, but if Y is true then X may or may not be true.

Reverse entailment (A): if Y is true then X istrue, but if X is true then Y may or may not be true.

Negation (^): if X is true then Y is false, and ifY is false then X is true; either X or Y must be true.

Alternation (|): if X is true then Y is false, but ifY is false then X may or may not be true.

Cover (^): if X is true then Y may or may not betrue, and if Y is true then X may or may not be true;

either X or Y must be true. We omit this relation,since its applicability to RTE is not clear.

Independence (#): if X is true then Y may ormay not be true, and if Y is true then X may or maynot be true.

4 Extracting entailment relations fromWordNet

4.1 Mapping onto Natural Logic relations

We would like to train a model to automatically dis-tinguish between the relationships described above.In order to gather labelled training data, we first lookto the information available in the existing Word-Net hierarchy. For roughly 2.5 million (60%) of thenoun pairs in PPDB, both nouns appear in WordNet(although not necessarily in the same synset). Weuse the rules in Table 2 (shown graphically in Fig-ure 2) to map a pair of nodes in the WordNet nounhierarchy onto one of the basic entailment relationsdescribed in section 3.

Other Relatedness In addition to MacCartney’srelations, we define a sixth catch-all category forterms which are flagged as related by WordNet butwhose relation is not built into the hierarchical struc-ture. These noun pairs do not meet the criteria of thebasic entailment relations but carry more informa-tion than do truly independent terms. We combineholonymy (part/whole relationships), attributes (ad-jectives closely tied to a specific noun), and deriva-tionally related terms as “other” relations.

4.2 Shortcomings of WordNet labeling

Our definitions give the desired results for the ⌘,A, @, and “other” categories. Table 1 shows someexamples of nouns in PPDB which were assignedto each of these labels. However, whereas Mac-Cartney’s alternation is strictly contradictory (X !¬Y ), co-hyponyms of a common parent in WordNetdo not necessarily have this property. Some exam-ples behave well (e.g. lunch and dinner share theparent meal), but others are better labelled as @ or #(e.g. boy and guy share the parent man and brotherand worker share the parent person). As a result,the WordNet alternation provides no information interms of entailment, behaving instead like MacCart-ney’s independence (i.e. “if X is true then Y may or

Equivalence

Adding Semantics to Data-Driven Paraphrasing. (Pavlick et al. ACL 2015)

Page 53: Natural Language Inference in the Real World · Natural Language Inference in the Real World Ellie Pavlick University of Pennsylvania

Natural Logic

to capture these diverse relations.

couch !

sofa

crow

bird

unableable

feline

carni vore

canine

hippo food

CHAPTER 5. ENTAILMENT RELATIONS 71

R0000 R0001 R0010 R0011

R0100 R0101 R0111

R1000 R1010 R1011

R1100 R1101 R1110 R1111

R0110

R1001

Figure 5.2: The 16 elementary set relations, represented by Johnston diagrams. Eachbox represents the universe U , and the two circles within the box represent the setsx and y. A region is white if it is empty, and shaded if it is non-empty. Thus in thediagram labeled R1101, only the region x� y is empty, indicating that � � x � y � U .

equivalence class in which only partition 10 is empty.) These equivalence classes aredepicted graphically in figure 5.2.

In fact, each of these equivalence classes is a set relation, that is, a set of orderedpairs of sets. We will refer to these 16 set relations as the elementary set relations,and we will denote this set of 16 relations by R. By construction, the relations in R

are both mutually exhaustive (every ordered pair of sets belongs to some relation inR) and mutually exclusive (no ordered pair of sets belongs to two different relationsin R). Thus, every ordered pair of sets can be assigned to exactly one relation in R.

equivalence

CHAPTER 5. ENTAILMENT RELATIONS 71

R0000 R0001 R0010 R0011

R0100 R0101 R0111

R1000 R1010 R1011

R1100 R1101 R1110 R1111

R0110

R1001

Figure 5.2: The 16 elementary set relations, represented by Johnston diagrams. Eachbox represents the universe U , and the two circles within the box represent the setsx and y. A region is white if it is empty, and shaded if it is non-empty. Thus in thediagram labeled R1101, only the region x� y is empty, indicating that � � x � y � U .

equivalence class in which only partition 10 is empty.) These equivalence classes aredepicted graphically in figure 5.2.

In fact, each of these equivalence classes is a set relation, that is, a set of orderedpairs of sets. We will refer to these 16 set relations as the elementary set relations,and we will denote this set of 16 relations by R. By construction, the relations in R

are both mutually exhaustive (every ordered pair of sets belongs to some relation inR) and mutually exclusive (no ordered pair of sets belongs to two different relationsin R). Thus, every ordered pair of sets can be assigned to exactly one relation in R.

CHAPTER 5. ENTAILMENT RELATIONS 71

R0000 R0001 R0010 R0011

R0100 R0101 R0111

R1000 R1010 R1011

R1100 R1101 R1110 R1111

R0110

R1001

Figure 5.2: The 16 elementary set relations, represented by Johnston diagrams. Eachbox represents the universe U , and the two circles within the box represent the setsx and y. A region is white if it is empty, and shaded if it is non-empty. Thus in thediagram labeled R1101, only the region x� y is empty, indicating that � � x � y � U .

equivalence class in which only partition 10 is empty.) These equivalence classes aredepicted graphically in figure 5.2.

In fact, each of these equivalence classes is a set relation, that is, a set of orderedpairs of sets. We will refer to these 16 set relations as the elementary set relations,and we will denote this set of 16 relations by R. By construction, the relations in R

are both mutually exhaustive (every ordered pair of sets belongs to some relation inR) and mutually exclusive (no ordered pair of sets belongs to two different relationsin R). Thus, every ordered pair of sets can be assigned to exactly one relation in R.

CHAPTER 5. ENTAILMENT RELATIONS 71

R0000 R0001 R0010 R0011

R0100 R0101 R0111

R1000 R1010 R1011

R1100 R1101 R1110 R1111

R0110

R1001

Figure 5.2: The 16 elementary set relations, represented by Johnston diagrams. Eachbox represents the universe U , and the two circles within the box represent the setsx and y. A region is white if it is empty, and shaded if it is non-empty. Thus in thediagram labeled R1101, only the region x� y is empty, indicating that � � x � y � U .

equivalence class in which only partition 10 is empty.) These equivalence classes aredepicted graphically in figure 5.2.

In fact, each of these equivalence classes is a set relation, that is, a set of orderedpairs of sets. We will refer to these 16 set relations as the elementary set relations,and we will denote this set of 16 relations by R. By construction, the relations in R

are both mutually exhaustive (every ordered pair of sets belongs to some relation inR) and mutually exclusive (no ordered pair of sets belongs to two different relationsin R). Thus, every ordered pair of sets can be assigned to exactly one relation in R.

CHAPTER 5. ENTAILMENT RELATIONS 71

R0000 R0001 R0010 R0011

R0100 R0101 R0111

R1000 R1010 R1011

R1100 R1101 R1110 R1111

R0110

R1001

Figure 5.2: The 16 elementary set relations, represented by Johnston diagrams. Eachbox represents the universe U , and the two circles within the box represent the setsx and y. A region is white if it is empty, and shaded if it is non-empty. Thus in thediagram labeled R1101, only the region x� y is empty, indicating that � � x � y � U .

equivalence class in which only partition 10 is empty.) These equivalence classes aredepicted graphically in figure 5.2.

In fact, each of these equivalence classes is a set relation, that is, a set of orderedpairs of sets. We will refer to these 16 set relations as the elementary set relations,and we will denote this set of 16 relations by R. By construction, the relations in R

are both mutually exhaustive (every ordered pair of sets belongs to some relation inR) and mutually exclusive (no ordered pair of sets belongs to two different relationsin R). Thus, every ordered pair of sets can be assigned to exactly one relation in R.

CHAPTER 5. ENTAILMENT RELATIONS 71

R0000 R0001 R0010 R0011

R0100 R0101 R0111

R1000 R1010 R1011

R1100 R1101 R1110 R1111

R0110

R1001

Figure 5.2: The 16 elementary set relations, represented by Johnston diagrams. Eachbox represents the universe U , and the two circles within the box represent the setsx and y. A region is white if it is empty, and shaded if it is non-empty. Thus in thediagram labeled R1101, only the region x� y is empty, indicating that � � x � y � U .

equivalence class in which only partition 10 is empty.) These equivalence classes aredepicted graphically in figure 5.2.

In fact, each of these equivalence classes is a set relation, that is, a set of orderedpairs of sets. We will refer to these 16 set relations as the elementary set relations,and we will denote this set of 16 relations by R. By construction, the relations in R

are both mutually exhaustive (every ordered pair of sets belongs to some relation inR) and mutually exclusive (no ordered pair of sets belongs to two different relationsin R). Thus, every ordered pair of sets can be assigned to exactly one relation in R.

forward entailment

reverse entailment

negation

alternation

independence no pathhypernymy

hyponymy shared hypernym

antonymsynonym

Thai

Asian

Figure 2: Mappings of set-theorhetic entailment relationsonto the WordNet hierarchy. In the Venn diagrams, re-produced from (MacCartney, 2009), the left circle repre-sents states in which x is true and the right circle states inwhich y is true. The shaded area represents all possibletrue states. E.g. when x ⌘ y (equivalence), in every truestate, either both x and y are true, or neither is.

3 Entailment Relations

We use the relations from Bill MacCartney’s the-sis on natural language inference as the basis forour categorization of relations (MacCartney, 2009).MacCartney’s work focused on integrating the se-mantic properties previously employed by systemsfor question answering (Harabagiu and Hickl, 2006)and RTE (Bar-Haim et al., 2007) within the formaltheory of natural logic (Lakoff, 1972). As a result,he provides a simple framework which models lexi-cal entailment in 7 “basic entailment relationships”:

Equivalence (⌘): if X is true then Y is true, andif Y is true then X is true.

Forward entailment (@): if X is true then Y istrue, but if Y is true then X may or may not be true.

Reverse entailment (A): if Y is true then X istrue, but if X is true then Y may or may not be true.

Negation (^): if X is true then Y is false, and ifY is false then X is true; either X or Y must be true.

Alternation (|): if X is true then Y is false, but ifY is false then X may or may not be true.

Cover (^): if X is true then Y may or may not betrue, and if Y is true then X may or may not be true;

either X or Y must be true. We omit this relation,since its applicability to RTE is not clear.

Independence (#): if X is true then Y may ormay not be true, and if Y is true then X may or maynot be true.

4 Extracting entailment relations fromWordNet

4.1 Mapping onto Natural Logic relations

We would like to train a model to automatically dis-tinguish between the relationships described above.In order to gather labelled training data, we first lookto the information available in the existing Word-Net hierarchy. For roughly 2.5 million (60%) of thenoun pairs in PPDB, both nouns appear in WordNet(although not necessarily in the same synset). Weuse the rules in Table 2 (shown graphically in Fig-ure 2) to map a pair of nodes in the WordNet nounhierarchy onto one of the basic entailment relationsdescribed in section 3.

Other Relatedness In addition to MacCartney’srelations, we define a sixth catch-all category forterms which are flagged as related by WordNet butwhose relation is not built into the hierarchical struc-ture. These noun pairs do not meet the criteria of thebasic entailment relations but carry more informa-tion than do truly independent terms. We combineholonymy (part/whole relationships), attributes (ad-jectives closely tied to a specific noun), and deriva-tionally related terms as “other” relations.

4.2 Shortcomings of WordNet labeling

Our definitions give the desired results for the ⌘,A, @, and “other” categories. Table 1 shows someexamples of nouns in PPDB which were assignedto each of these labels. However, whereas Mac-Cartney’s alternation is strictly contradictory (X !¬Y ), co-hyponyms of a common parent in WordNetdo not necessarily have this property. Some exam-ples behave well (e.g. lunch and dinner share theparent meal), but others are better labelled as @ or #(e.g. boy and guy share the parent man and brotherand worker share the parent person). As a result,the WordNet alternation provides no information interms of entailment, behaving instead like MacCart-ney’s independence (i.e. “if X is true then Y may or

Equivalence

Entailment

Adding Semantics to Data-Driven Paraphrasing. (Pavlick et al. ACL 2015)

Page 54: Natural Language Inference in the Real World · Natural Language Inference in the Real World Ellie Pavlick University of Pennsylvania

Natural Logic

to capture these diverse relations.

couch !

sofa

crow

bird

unableable

feline

carni vore

canine

hippo food

CHAPTER 5. ENTAILMENT RELATIONS 71

R0000 R0001 R0010 R0011

R0100 R0101 R0111

R1000 R1010 R1011

R1100 R1101 R1110 R1111

R0110

R1001

Figure 5.2: The 16 elementary set relations, represented by Johnston diagrams. Eachbox represents the universe U , and the two circles within the box represent the setsx and y. A region is white if it is empty, and shaded if it is non-empty. Thus in thediagram labeled R1101, only the region x� y is empty, indicating that � � x � y � U .

equivalence class in which only partition 10 is empty.) These equivalence classes aredepicted graphically in figure 5.2.

In fact, each of these equivalence classes is a set relation, that is, a set of orderedpairs of sets. We will refer to these 16 set relations as the elementary set relations,and we will denote this set of 16 relations by R. By construction, the relations in R

are both mutually exhaustive (every ordered pair of sets belongs to some relation inR) and mutually exclusive (no ordered pair of sets belongs to two different relationsin R). Thus, every ordered pair of sets can be assigned to exactly one relation in R.

equivalence

CHAPTER 5. ENTAILMENT RELATIONS 71

R0000 R0001 R0010 R0011

R0100 R0101 R0111

R1000 R1010 R1011

R1100 R1101 R1110 R1111

R0110

R1001

Figure 5.2: The 16 elementary set relations, represented by Johnston diagrams. Eachbox represents the universe U , and the two circles within the box represent the setsx and y. A region is white if it is empty, and shaded if it is non-empty. Thus in thediagram labeled R1101, only the region x� y is empty, indicating that � � x � y � U .

equivalence class in which only partition 10 is empty.) These equivalence classes aredepicted graphically in figure 5.2.

In fact, each of these equivalence classes is a set relation, that is, a set of orderedpairs of sets. We will refer to these 16 set relations as the elementary set relations,and we will denote this set of 16 relations by R. By construction, the relations in R

are both mutually exhaustive (every ordered pair of sets belongs to some relation inR) and mutually exclusive (no ordered pair of sets belongs to two different relationsin R). Thus, every ordered pair of sets can be assigned to exactly one relation in R.

CHAPTER 5. ENTAILMENT RELATIONS 71

R0000 R0001 R0010 R0011

R0100 R0101 R0111

R1000 R1010 R1011

R1100 R1101 R1110 R1111

R0110

R1001

Figure 5.2: The 16 elementary set relations, represented by Johnston diagrams. Eachbox represents the universe U , and the two circles within the box represent the setsx and y. A region is white if it is empty, and shaded if it is non-empty. Thus in thediagram labeled R1101, only the region x� y is empty, indicating that � � x � y � U .

equivalence class in which only partition 10 is empty.) These equivalence classes aredepicted graphically in figure 5.2.

In fact, each of these equivalence classes is a set relation, that is, a set of orderedpairs of sets. We will refer to these 16 set relations as the elementary set relations,and we will denote this set of 16 relations by R. By construction, the relations in R

are both mutually exhaustive (every ordered pair of sets belongs to some relation inR) and mutually exclusive (no ordered pair of sets belongs to two different relationsin R). Thus, every ordered pair of sets can be assigned to exactly one relation in R.

CHAPTER 5. ENTAILMENT RELATIONS 71

R0000 R0001 R0010 R0011

R0100 R0101 R0111

R1000 R1010 R1011

R1100 R1101 R1110 R1111

R0110

R1001

Figure 5.2: The 16 elementary set relations, represented by Johnston diagrams. Eachbox represents the universe U , and the two circles within the box represent the setsx and y. A region is white if it is empty, and shaded if it is non-empty. Thus in thediagram labeled R1101, only the region x� y is empty, indicating that � � x � y � U .

equivalence class in which only partition 10 is empty.) These equivalence classes aredepicted graphically in figure 5.2.

In fact, each of these equivalence classes is a set relation, that is, a set of orderedpairs of sets. We will refer to these 16 set relations as the elementary set relations,and we will denote this set of 16 relations by R. By construction, the relations in R

are both mutually exhaustive (every ordered pair of sets belongs to some relation inR) and mutually exclusive (no ordered pair of sets belongs to two different relationsin R). Thus, every ordered pair of sets can be assigned to exactly one relation in R.

CHAPTER 5. ENTAILMENT RELATIONS 71

R0000 R0001 R0010 R0011

R0100 R0101 R0111

R1000 R1010 R1011

R1100 R1101 R1110 R1111

R0110

R1001

Figure 5.2: The 16 elementary set relations, represented by Johnston diagrams. Eachbox represents the universe U , and the two circles within the box represent the setsx and y. A region is white if it is empty, and shaded if it is non-empty. Thus in thediagram labeled R1101, only the region x� y is empty, indicating that � � x � y � U .

equivalence class in which only partition 10 is empty.) These equivalence classes aredepicted graphically in figure 5.2.

In fact, each of these equivalence classes is a set relation, that is, a set of orderedpairs of sets. We will refer to these 16 set relations as the elementary set relations,and we will denote this set of 16 relations by R. By construction, the relations in R

are both mutually exhaustive (every ordered pair of sets belongs to some relation inR) and mutually exclusive (no ordered pair of sets belongs to two different relationsin R). Thus, every ordered pair of sets can be assigned to exactly one relation in R.

CHAPTER 5. ENTAILMENT RELATIONS 71

R0000 R0001 R0010 R0011

R0100 R0101 R0111

R1000 R1010 R1011

R1100 R1101 R1110 R1111

R0110

R1001

Figure 5.2: The 16 elementary set relations, represented by Johnston diagrams. Eachbox represents the universe U , and the two circles within the box represent the setsx and y. A region is white if it is empty, and shaded if it is non-empty. Thus in thediagram labeled R1101, only the region x� y is empty, indicating that � � x � y � U .

equivalence class in which only partition 10 is empty.) These equivalence classes aredepicted graphically in figure 5.2.

In fact, each of these equivalence classes is a set relation, that is, a set of orderedpairs of sets. We will refer to these 16 set relations as the elementary set relations,and we will denote this set of 16 relations by R. By construction, the relations in R

are both mutually exhaustive (every ordered pair of sets belongs to some relation inR) and mutually exclusive (no ordered pair of sets belongs to two different relationsin R). Thus, every ordered pair of sets can be assigned to exactly one relation in R.

forward entailment

reverse entailment

negation

alternation

independence no pathhypernymy

hyponymy shared hypernym

antonymsynonym

Thai

Asian

Figure 2: Mappings of set-theorhetic entailment relationsonto the WordNet hierarchy. In the Venn diagrams, re-produced from (MacCartney, 2009), the left circle repre-sents states in which x is true and the right circle states inwhich y is true. The shaded area represents all possibletrue states. E.g. when x ⌘ y (equivalence), in every truestate, either both x and y are true, or neither is.

3 Entailment Relations

We use the relations from Bill MacCartney’s the-sis on natural language inference as the basis forour categorization of relations (MacCartney, 2009).MacCartney’s work focused on integrating the se-mantic properties previously employed by systemsfor question answering (Harabagiu and Hickl, 2006)and RTE (Bar-Haim et al., 2007) within the formaltheory of natural logic (Lakoff, 1972). As a result,he provides a simple framework which models lexi-cal entailment in 7 “basic entailment relationships”:

Equivalence (⌘): if X is true then Y is true, andif Y is true then X is true.

Forward entailment (@): if X is true then Y istrue, but if Y is true then X may or may not be true.

Reverse entailment (A): if Y is true then X istrue, but if X is true then Y may or may not be true.

Negation (^): if X is true then Y is false, and ifY is false then X is true; either X or Y must be true.

Alternation (|): if X is true then Y is false, but ifY is false then X may or may not be true.

Cover (^): if X is true then Y may or may not betrue, and if Y is true then X may or may not be true;

either X or Y must be true. We omit this relation,since its applicability to RTE is not clear.

Independence (#): if X is true then Y may ormay not be true, and if Y is true then X may or maynot be true.

4 Extracting entailment relations fromWordNet

4.1 Mapping onto Natural Logic relations

We would like to train a model to automatically dis-tinguish between the relationships described above.In order to gather labelled training data, we first lookto the information available in the existing Word-Net hierarchy. For roughly 2.5 million (60%) of thenoun pairs in PPDB, both nouns appear in WordNet(although not necessarily in the same synset). Weuse the rules in Table 2 (shown graphically in Fig-ure 2) to map a pair of nodes in the WordNet nounhierarchy onto one of the basic entailment relationsdescribed in section 3.

Other Relatedness In addition to MacCartney’srelations, we define a sixth catch-all category forterms which are flagged as related by WordNet butwhose relation is not built into the hierarchical struc-ture. These noun pairs do not meet the criteria of thebasic entailment relations but carry more informa-tion than do truly independent terms. We combineholonymy (part/whole relationships), attributes (ad-jectives closely tied to a specific noun), and deriva-tionally related terms as “other” relations.

4.2 Shortcomings of WordNet labeling

Our definitions give the desired results for the ⌘,A, @, and “other” categories. Table 1 shows someexamples of nouns in PPDB which were assignedto each of these labels. However, whereas Mac-Cartney’s alternation is strictly contradictory (X !¬Y ), co-hyponyms of a common parent in WordNetdo not necessarily have this property. Some exam-ples behave well (e.g. lunch and dinner share theparent meal), but others are better labelled as @ or #(e.g. boy and guy share the parent man and brotherand worker share the parent person). As a result,the WordNet alternation provides no information interms of entailment, behaving instead like MacCart-ney’s independence (i.e. “if X is true then Y may or

Entailment

ExclusionEquivalence

Adding Semantics to Data-Driven Paraphrasing. (Pavlick et al. ACL 2015)

Page 55: Natural Language Inference in the Real World · Natural Language Inference in the Real World Ellie Pavlick University of Pennsylvania

Natural Logic

to capture these diverse relations.

couch !

sofa

crow

bird

unableable

feline

carni vore

canine

hippo food

CHAPTER 5. ENTAILMENT RELATIONS 71

R0000 R0001 R0010 R0011

R0100 R0101 R0111

R1000 R1010 R1011

R1100 R1101 R1110 R1111

R0110

R1001

Figure 5.2: The 16 elementary set relations, represented by Johnston diagrams. Eachbox represents the universe U , and the two circles within the box represent the setsx and y. A region is white if it is empty, and shaded if it is non-empty. Thus in thediagram labeled R1101, only the region x� y is empty, indicating that � � x � y � U .

equivalence class in which only partition 10 is empty.) These equivalence classes aredepicted graphically in figure 5.2.

In fact, each of these equivalence classes is a set relation, that is, a set of orderedpairs of sets. We will refer to these 16 set relations as the elementary set relations,and we will denote this set of 16 relations by R. By construction, the relations in R

are both mutually exhaustive (every ordered pair of sets belongs to some relation inR) and mutually exclusive (no ordered pair of sets belongs to two different relationsin R). Thus, every ordered pair of sets can be assigned to exactly one relation in R.

equivalence

CHAPTER 5. ENTAILMENT RELATIONS 71

R0000 R0001 R0010 R0011

R0100 R0101 R0111

R1000 R1010 R1011

R1100 R1101 R1110 R1111

R0110

R1001

Figure 5.2: The 16 elementary set relations, represented by Johnston diagrams. Eachbox represents the universe U , and the two circles within the box represent the setsx and y. A region is white if it is empty, and shaded if it is non-empty. Thus in thediagram labeled R1101, only the region x� y is empty, indicating that � � x � y � U .

equivalence class in which only partition 10 is empty.) These equivalence classes aredepicted graphically in figure 5.2.

In fact, each of these equivalence classes is a set relation, that is, a set of orderedpairs of sets. We will refer to these 16 set relations as the elementary set relations,and we will denote this set of 16 relations by R. By construction, the relations in R

are both mutually exhaustive (every ordered pair of sets belongs to some relation inR) and mutually exclusive (no ordered pair of sets belongs to two different relationsin R). Thus, every ordered pair of sets can be assigned to exactly one relation in R.

CHAPTER 5. ENTAILMENT RELATIONS 71

R0000 R0001 R0010 R0011

R0100 R0101 R0111

R1000 R1010 R1011

R1100 R1101 R1110 R1111

R0110

R1001

Figure 5.2: The 16 elementary set relations, represented by Johnston diagrams. Eachbox represents the universe U , and the two circles within the box represent the setsx and y. A region is white if it is empty, and shaded if it is non-empty. Thus in thediagram labeled R1101, only the region x� y is empty, indicating that � � x � y � U .

equivalence class in which only partition 10 is empty.) These equivalence classes aredepicted graphically in figure 5.2.

In fact, each of these equivalence classes is a set relation, that is, a set of orderedpairs of sets. We will refer to these 16 set relations as the elementary set relations,and we will denote this set of 16 relations by R. By construction, the relations in R

are both mutually exhaustive (every ordered pair of sets belongs to some relation inR) and mutually exclusive (no ordered pair of sets belongs to two different relationsin R). Thus, every ordered pair of sets can be assigned to exactly one relation in R.

CHAPTER 5. ENTAILMENT RELATIONS 71

R0000 R0001 R0010 R0011

R0100 R0101 R0111

R1000 R1010 R1011

R1100 R1101 R1110 R1111

R0110

R1001

Figure 5.2: The 16 elementary set relations, represented by Johnston diagrams. Eachbox represents the universe U , and the two circles within the box represent the setsx and y. A region is white if it is empty, and shaded if it is non-empty. Thus in thediagram labeled R1101, only the region x� y is empty, indicating that � � x � y � U .

equivalence class in which only partition 10 is empty.) These equivalence classes aredepicted graphically in figure 5.2.

In fact, each of these equivalence classes is a set relation, that is, a set of orderedpairs of sets. We will refer to these 16 set relations as the elementary set relations,and we will denote this set of 16 relations by R. By construction, the relations in R

are both mutually exhaustive (every ordered pair of sets belongs to some relation inR) and mutually exclusive (no ordered pair of sets belongs to two different relationsin R). Thus, every ordered pair of sets can be assigned to exactly one relation in R.

CHAPTER 5. ENTAILMENT RELATIONS 71

R0000 R0001 R0010 R0011

R0100 R0101 R0111

R1000 R1010 R1011

R1100 R1101 R1110 R1111

R0110

R1001

Figure 5.2: The 16 elementary set relations, represented by Johnston diagrams. Eachbox represents the universe U , and the two circles within the box represent the setsx and y. A region is white if it is empty, and shaded if it is non-empty. Thus in thediagram labeled R1101, only the region x� y is empty, indicating that � � x � y � U .

equivalence class in which only partition 10 is empty.) These equivalence classes aredepicted graphically in figure 5.2.

In fact, each of these equivalence classes is a set relation, that is, a set of orderedpairs of sets. We will refer to these 16 set relations as the elementary set relations,and we will denote this set of 16 relations by R. By construction, the relations in R

are both mutually exhaustive (every ordered pair of sets belongs to some relation inR) and mutually exclusive (no ordered pair of sets belongs to two different relationsin R). Thus, every ordered pair of sets can be assigned to exactly one relation in R.

CHAPTER 5. ENTAILMENT RELATIONS 71

R0000 R0001 R0010 R0011

R0100 R0101 R0111

R1000 R1010 R1011

R1100 R1101 R1110 R1111

R0110

R1001

Figure 5.2: The 16 elementary set relations, represented by Johnston diagrams. Eachbox represents the universe U , and the two circles within the box represent the setsx and y. A region is white if it is empty, and shaded if it is non-empty. Thus in thediagram labeled R1101, only the region x� y is empty, indicating that � � x � y � U .

equivalence class in which only partition 10 is empty.) These equivalence classes aredepicted graphically in figure 5.2.

In fact, each of these equivalence classes is a set relation, that is, a set of orderedpairs of sets. We will refer to these 16 set relations as the elementary set relations,and we will denote this set of 16 relations by R. By construction, the relations in R

are both mutually exhaustive (every ordered pair of sets belongs to some relation inR) and mutually exclusive (no ordered pair of sets belongs to two different relationsin R). Thus, every ordered pair of sets can be assigned to exactly one relation in R.

forward entailment

reverse entailment

negation

alternation

independence no pathhypernymy

hyponymy shared hypernym

antonymsynonym

Thai

Asian

Figure 2: Mappings of set-theorhetic entailment relationsonto the WordNet hierarchy. In the Venn diagrams, re-produced from (MacCartney, 2009), the left circle repre-sents states in which x is true and the right circle states inwhich y is true. The shaded area represents all possibletrue states. E.g. when x ⌘ y (equivalence), in every truestate, either both x and y are true, or neither is.

3 Entailment Relations

We use the relations from Bill MacCartney’s the-sis on natural language inference as the basis forour categorization of relations (MacCartney, 2009).MacCartney’s work focused on integrating the se-mantic properties previously employed by systemsfor question answering (Harabagiu and Hickl, 2006)and RTE (Bar-Haim et al., 2007) within the formaltheory of natural logic (Lakoff, 1972). As a result,he provides a simple framework which models lexi-cal entailment in 7 “basic entailment relationships”:

Equivalence (⌘): if X is true then Y is true, andif Y is true then X is true.

Forward entailment (@): if X is true then Y istrue, but if Y is true then X may or may not be true.

Reverse entailment (A): if Y is true then X istrue, but if X is true then Y may or may not be true.

Negation (^): if X is true then Y is false, and ifY is false then X is true; either X or Y must be true.

Alternation (|): if X is true then Y is false, but ifY is false then X may or may not be true.

Cover (^): if X is true then Y may or may not betrue, and if Y is true then X may or may not be true;

either X or Y must be true. We omit this relation,since its applicability to RTE is not clear.

Independence (#): if X is true then Y may ormay not be true, and if Y is true then X may or maynot be true.

4 Extracting entailment relations fromWordNet

4.1 Mapping onto Natural Logic relations

We would like to train a model to automatically dis-tinguish between the relationships described above.In order to gather labelled training data, we first lookto the information available in the existing Word-Net hierarchy. For roughly 2.5 million (60%) of thenoun pairs in PPDB, both nouns appear in WordNet(although not necessarily in the same synset). Weuse the rules in Table 2 (shown graphically in Fig-ure 2) to map a pair of nodes in the WordNet nounhierarchy onto one of the basic entailment relationsdescribed in section 3.

Other Relatedness In addition to MacCartney’srelations, we define a sixth catch-all category forterms which are flagged as related by WordNet butwhose relation is not built into the hierarchical struc-ture. These noun pairs do not meet the criteria of thebasic entailment relations but carry more informa-tion than do truly independent terms. We combineholonymy (part/whole relationships), attributes (ad-jectives closely tied to a specific noun), and deriva-tionally related terms as “other” relations.

4.2 Shortcomings of WordNet labeling

Our definitions give the desired results for the ⌘,A, @, and “other” categories. Table 1 shows someexamples of nouns in PPDB which were assignedto each of these labels. However, whereas Mac-Cartney’s alternation is strictly contradictory (X !¬Y ), co-hyponyms of a common parent in WordNetdo not necessarily have this property. Some exam-ples behave well (e.g. lunch and dinner share theparent meal), but others are better labelled as @ or #(e.g. boy and guy share the parent man and brotherand worker share the parent person). As a result,the WordNet alternation provides no information interms of entailment, behaving instead like MacCart-ney’s independence (i.e. “if X is true then Y may or

Independent

Entailment

ExclusionEquivalence

Adding Semantics to Data-Driven Paraphrasing. (Pavlick et al. ACL 2015)

Page 56: Natural Language Inference in the Real World · Natural Language Inference in the Real World Ellie Pavlick University of Pennsylvania

Monolingual Featuressymmetric and asymmetric similarities based on

dependency context

cosine:0.431 lin:0.544 balprec:0.396 weeds:0.289…

1 1 1 1 1 0 0 1 0 0 1 01 0 1 0 1 1 1 1 1 1 1 1kid

little girl nsubj-draw

lost-dep

nsubj-vomit

poss-shoe

dep-brother

nsubjpass-buckled

nsubj-yell

rcmod-tired

Discovery of Inference Rules from Text. (Lin and Pantel SIGKDD 2001)

Page 57: Natural Language Inference in the Real World · Natural Language Inference in the Real World Ellie Pavlick University of Pennsylvania

Monolingual Features

…if this can happen to my little girl , it can happen to other kids…

advcl

nmodnmod

[X]<-nmod-[happen]-advcl->[happen]-nmod->[Y]:1 [X]<-conj-[Y]:1 [X]<-pobj-[as]<-prep-[know]<-rcmod-[Y]:1

lexico-syntactic patterns

Automatic acquisition of hyponyms from large text corpora. (Hearst COLING 1992) Semantic taxonomy induction from heterogenous evidence. (Snow et al. ACL 2006)

Page 58: Natural Language Inference in the Real World · Natural Language Inference in the Real World Ellie Pavlick University of Pennsylvania

Bilingual Features

ahogados a la playa ...

get washed up on beaches ...

... fünf Landwirte , weil

... 5 farmers were in Ireland ...

...

oder wurden , gefoltert

or have been , tortured

festgenommen

thrown into jail

festgenommen

imprisoned

...

... ...

...

Paraphrasing with bilingual parallel corpora. (Bannard and Callison-Burch ACL 2005) PPDB: The paraphrase database. (Ganitkevitch et al. NAACL 2013)

Page 59: Natural Language Inference in the Real World · Natural Language Inference in the Real World Ellie Pavlick University of Pennsylvania

Table 1

mono Predicted label (using monolingual features)

Predicted label (using bilingual features)

Predicted label (using all features)

ind syn hyp exl oth ≣ ⊐ ¬ # ~ ≣ ⊐ ¬ # ~ ≣ ⊐ ¬ # ~syn 1 3 1 0 0 4 ≣ 58% 20% 4% 15% 3% 62% 21% 5% 4% 8% 83% 10% 0% 2% 4%

hyp 2 3 7 0 1 13 ⊐ 20% 51% 3% 18% 7% 27% 5% 7% 7% 54% 6% 76% 2% 7% 8%

exl 1 1 1 1 0 4 ¬ 26% 14% 37% 17% 6% 6% 14% 30% 36% 14% 2% 8% 73% 13% 3%

ind 14 2 2 0 1 20 # 8% 13% 2% 71% 6% 1% 7% 6% 78% 8% 1% 4% 2% 88% 6%

oth 3 1 2 0 2 10 ~ 15% 21% 5% 36% 23% 8% 19% 9% 30% 35% 5% 10% 3% 18% 64%

bi

ind syn hyp exl oth

syn 0 3 1 0 0 4

hyp 2 7 1 2 13 24

exl 1 0 1 1 1 4

ind 15 0 1 1 2 20

oth 3 1 2 1 3 10

both

ind syn hyp exl oth

syn 9 368 46 1 19 443

hyp 97 83 1004 29 108 1321

exl 49 9 29 275 13 375

ind 1730 15 82 35 114 1976

oth 169 48 97 33 609 956

True

labe

l

Figure 4: Confusion matrices for classifier trained using only monolingual features (distributional and path) versus bilingualfeatures (paraphrase and translation). True labels are shown along rows, predicted along columns. The matrix is normalizedalong rows, so that the predictions for each (true) class sum to 100%.

� F1 when excludingAll Lex. Dist. Path Para. Tran. WN

# 79 -2.0 -0.2 -1.2 -1.7 -0.2 -0.1⌘ 57 -3.5 +0.2 -0.7 -2.4 -3.7 +0.5A 68 -4.6 -0.3 -0.8 -0.8 -0.7 -1.6¬ 49 -4.0 -0.8 -2.9 +0.3 -0.0 -2.2⇠ 51 -4.9 -0.5 -0.7 -1.2 -0.9 -0.3

Table 5: F1 measure (⇥100) achieved by entailment classifierusing 10-fold cross validation on the training data.

Table 3 shines some light onto the differencesbetween monolingual and bilingual similarities.While the monolingual asymmetric metrics aregood for identifying A pairs, the symmetric met-rics consistently identify ¬ pairs; none of themonolingual scores we explored were effectivein making the subtle distinction between ⌘ pairsand the other types of paraphrase. In contrast,the bilingual similarity metric is fairly precisefor identifying ⌘ pairs, but provides less infor-mation for distinguishing between types of non-equivalent paraphrase. These differences are fur-ther exhibited in the confusion matrices shown inFigure 4; when the classifier is trained using onlymonolingual features, it misclassifies 26% of ¬pairs as ⌘, whereas the bilingual features makethis error only 6% of the time. On the other hand,the bilingual features completely fail to predict theA class, calling over 80% of such pairs ⌘ or ⇠.

7 Evaluation

7.1 Intrinsic Evaluation

We test the performance of our classifier intrinsi-cally, through its ability to reproduce the humanlabels for the phrase pairs from the SICK test sen-tences. Table 7 shows the precision and recallachieved by the classifier for each of our 5 en-tailment classes. The classifier is able to achieve

an overall 79% accuracy, reaching >70% preci-sion while maintaining good levels of recall on allclasses.

True Pred. N Example misclassifications⇠ # 169 boy/little, an empy/the air# ⇠ 114 little/toy, color/hairA ⇠ 108 drink/juice, ocean/surfA # 97 in front of/the face of, vehicle/horseA ⌘ 83 cat/kitten, pavement/sidewalk⌘ A 46 big/grand, a girl/a young ladyA ¬ 29 kid/teenager, no small/a large¬ A 29 old man/young man, a car/a window# ⌘ 15 a person/one, a crowd/a large⌘ # 9 he is/man is, photo/still⌘ ¬ 1 girl is/she is

Table 6: Example misclassifications from some of the mostfrequent and most interesting error categories.

Figure 4 shows the classifier’s confusion ma-trix and Table 6 shows some examples of commonand interesting error cases. The majority of errors(26%) come from confusing the ⇠ class with the# class. This mistake is not too concerning froman RTE perspective since ⇠ can be treated as aspecial case of # (Section 5). There are very fewcases in which the classifier makes extreme errors,e.g. confusing ⌘ with ¬ or with #; some interest-ing examples of such errors arise when the phrasescontain pronouns (e.g. girl ⌘ she) or when therelation uses a highly infrequent word sense (e.g.photo ⌘ still).

7.2 The Nutcracker RTE System

To further test our classifier, we evaluate the use-fulness of the automatic entailment predictions ina downstream RTE task. We run our experimentsusing Nutcracker, a state-of-the-art RTE systembased on formal semantics (Bjerva et al., 2014).In the SemEval 2014 RTE challenge, this system

equiv.

entail.

excl.

other

indep.

equiv.entail. excl. other indep.Table 1

mono Predicted label (using monolingual features)

Predicted label (using bilingual features)

Predicted label (using all features)

ind syn hyp exl oth ≣ ⊐ ¬ # ~ ≣ ⊐ ¬ # ~ ≣ ⊐ ¬ # ~syn 1 3 1 0 0 4 ≣ 58% 20% 4% 15% 3% 62% 21% 5% 4% 8% 83% 10% 0% 2% 4%

hyp 2 3 7 0 1 13 ⊐ 20% 51% 3% 18% 7% 27% 5% 7% 7% 54% 6% 76% 2% 7% 8%

exl 1 1 1 1 0 4 ¬ 26% 14% 37% 17% 6% 6% 14% 30% 36% 14% 2% 8% 73% 13% 3%

ind 14 2 2 0 1 20 # 8% 13% 2% 71% 6% 1% 7% 6% 78% 8% 1% 4% 2% 88% 6%

oth 3 1 2 0 2 10 ~ 15% 21% 5% 36% 23% 8% 19% 9% 30% 35% 5% 10% 3% 18% 64%

bi

ind syn hyp exl oth

syn 0 3 1 0 0 4

hyp 2 7 1 2 13 24

exl 1 0 1 1 1 4

ind 15 0 1 1 2 20

oth 3 1 2 1 3 10

both

ind syn hyp exl oth

syn 9 368 46 1 19 443

hyp 97 83 1004 29 108 1321

exl 49 9 29 275 13 375

ind 1730 15 82 35 114 1976

oth 169 48 97 33 609 956

True

labe

l

Figure 4: Confusion matrices for classifier trained using only monolingual features (distributional and path) versus bilingualfeatures (paraphrase and translation). True labels are shown along rows, predicted along columns. The matrix is normalizedalong rows, so that the predictions for each (true) class sum to 100%.

� F1 when excludingAll Lex. Dist. Path Para. Tran. WN

# 79 -2.0 -0.2 -1.2 -1.7 -0.2 -0.1⌘ 57 -3.5 +0.2 -0.7 -2.4 -3.7 +0.5A 68 -4.6 -0.3 -0.8 -0.8 -0.7 -1.6¬ 49 -4.0 -0.8 -2.9 +0.3 -0.0 -2.2⇠ 51 -4.9 -0.5 -0.7 -1.2 -0.9 -0.3

Table 5: F1 measure (⇥100) achieved by entailment classifierusing 10-fold cross validation on the training data.

Table 3 shines some light onto the differencesbetween monolingual and bilingual similarities.While the monolingual asymmetric metrics aregood for identifying A pairs, the symmetric met-rics consistently identify ¬ pairs; none of themonolingual scores we explored were effectivein making the subtle distinction between ⌘ pairsand the other types of paraphrase. In contrast,the bilingual similarity metric is fairly precisefor identifying ⌘ pairs, but provides less infor-mation for distinguishing between types of non-equivalent paraphrase. These differences are fur-ther exhibited in the confusion matrices shown inFigure 4; when the classifier is trained using onlymonolingual features, it misclassifies 26% of ¬pairs as ⌘, whereas the bilingual features makethis error only 6% of the time. On the other hand,the bilingual features completely fail to predict theA class, calling over 80% of such pairs ⌘ or ⇠.

7 Evaluation

7.1 Intrinsic Evaluation

We test the performance of our classifier intrinsi-cally, through its ability to reproduce the humanlabels for the phrase pairs from the SICK test sen-tences. Table 7 shows the precision and recallachieved by the classifier for each of our 5 en-tailment classes. The classifier is able to achieve

an overall 79% accuracy, reaching >70% preci-sion while maintaining good levels of recall on allclasses.

True Pred. N Example misclassifications⇠ # 169 boy/little, an empy/the air# ⇠ 114 little/toy, color/hairA ⇠ 108 drink/juice, ocean/surfA # 97 in front of/the face of, vehicle/horseA ⌘ 83 cat/kitten, pavement/sidewalk⌘ A 46 big/grand, a girl/a young ladyA ¬ 29 kid/teenager, no small/a large¬ A 29 old man/young man, a car/a window# ⌘ 15 a person/one, a crowd/a large⌘ # 9 he is/man is, photo/still⌘ ¬ 1 girl is/she is

Table 6: Example misclassifications from some of the mostfrequent and most interesting error categories.

Figure 4 shows the classifier’s confusion ma-trix and Table 6 shows some examples of commonand interesting error cases. The majority of errors(26%) come from confusing the ⇠ class with the# class. This mistake is not too concerning froman RTE perspective since ⇠ can be treated as aspecial case of # (Section 5). There are very fewcases in which the classifier makes extreme errors,e.g. confusing ⌘ with ¬ or with #; some interest-ing examples of such errors arise when the phrasescontain pronouns (e.g. girl ⌘ she) or when therelation uses a highly infrequent word sense (e.g.photo ⌘ still).

7.2 The Nutcracker RTE System

To further test our classifier, we evaluate the use-fulness of the automatic entailment predictions ina downstream RTE task. We run our experimentsusing Nutcracker, a state-of-the-art RTE systembased on formal semantics (Bjerva et al., 2014).In the SemEval 2014 RTE challenge, this system

equiv.

entail.

excl.

other

indep.

Monolingual features

only

Bilingual features

only

Adding Semantics to Data-Driven Paraphrasing. (Pavlick et al. ACL 2015)

Page 60: Natural Language Inference in the Real World · Natural Language Inference in the Real World Ellie Pavlick University of Pennsylvania

Table 1

mono Predicted label (using monolingual features)

Predicted label (using bilingual features)

Predicted label (using all features)

ind syn hyp exl oth ≣ ⊐ ¬ # ~ ≣ ⊐ ¬ # ~ ≣ ⊐ ¬ # ~syn 1 3 1 0 0 4 ≣ 58% 20% 4% 15% 3% 62% 21% 5% 4% 8% 83% 10% 0% 2% 4%

hyp 2 3 7 0 1 13 ⊐ 20% 51% 3% 18% 7% 27% 5% 7% 7% 54% 6% 76% 2% 7% 8%

exl 1 1 1 1 0 4 ¬ 26% 14% 37% 17% 6% 6% 14% 30% 36% 14% 2% 8% 73% 13% 3%

ind 14 2 2 0 1 20 # 8% 13% 2% 71% 6% 1% 7% 6% 78% 8% 1% 4% 2% 88% 6%

oth 3 1 2 0 2 10 ~ 15% 21% 5% 36% 23% 8% 19% 9% 30% 35% 5% 10% 3% 18% 64%

bi

ind syn hyp exl oth

syn 0 3 1 0 0 4

hyp 2 7 1 2 13 24

exl 1 0 1 1 1 4

ind 15 0 1 1 2 20

oth 3 1 2 1 3 10

both

ind syn hyp exl oth

syn 9 368 46 1 19 443

hyp 97 83 1004 29 108 1321

exl 49 9 29 275 13 375

ind 1730 15 82 35 114 1976

oth 169 48 97 33 609 956

True

labe

l

Figure 4: Confusion matrices for classifier trained using only monolingual features (distributional and path) versus bilingualfeatures (paraphrase and translation). True labels are shown along rows, predicted along columns. The matrix is normalizedalong rows, so that the predictions for each (true) class sum to 100%.

� F1 when excludingAll Lex. Dist. Path Para. Tran. WN

# 79 -2.0 -0.2 -1.2 -1.7 -0.2 -0.1⌘ 57 -3.5 +0.2 -0.7 -2.4 -3.7 +0.5A 68 -4.6 -0.3 -0.8 -0.8 -0.7 -1.6¬ 49 -4.0 -0.8 -2.9 +0.3 -0.0 -2.2⇠ 51 -4.9 -0.5 -0.7 -1.2 -0.9 -0.3

Table 5: F1 measure (⇥100) achieved by entailment classifierusing 10-fold cross validation on the training data.

Table 3 shines some light onto the differencesbetween monolingual and bilingual similarities.While the monolingual asymmetric metrics aregood for identifying A pairs, the symmetric met-rics consistently identify ¬ pairs; none of themonolingual scores we explored were effectivein making the subtle distinction between ⌘ pairsand the other types of paraphrase. In contrast,the bilingual similarity metric is fairly precisefor identifying ⌘ pairs, but provides less infor-mation for distinguishing between types of non-equivalent paraphrase. These differences are fur-ther exhibited in the confusion matrices shown inFigure 4; when the classifier is trained using onlymonolingual features, it misclassifies 26% of ¬pairs as ⌘, whereas the bilingual features makethis error only 6% of the time. On the other hand,the bilingual features completely fail to predict theA class, calling over 80% of such pairs ⌘ or ⇠.

7 Evaluation

7.1 Intrinsic Evaluation

We test the performance of our classifier intrinsi-cally, through its ability to reproduce the humanlabels for the phrase pairs from the SICK test sen-tences. Table 7 shows the precision and recallachieved by the classifier for each of our 5 en-tailment classes. The classifier is able to achieve

an overall 79% accuracy, reaching >70% preci-sion while maintaining good levels of recall on allclasses.

True Pred. N Example misclassifications⇠ # 169 boy/little, an empy/the air# ⇠ 114 little/toy, color/hairA ⇠ 108 drink/juice, ocean/surfA # 97 in front of/the face of, vehicle/horseA ⌘ 83 cat/kitten, pavement/sidewalk⌘ A 46 big/grand, a girl/a young ladyA ¬ 29 kid/teenager, no small/a large¬ A 29 old man/young man, a car/a window# ⌘ 15 a person/one, a crowd/a large⌘ # 9 he is/man is, photo/still⌘ ¬ 1 girl is/she is

Table 6: Example misclassifications from some of the mostfrequent and most interesting error categories.

Figure 4 shows the classifier’s confusion ma-trix and Table 6 shows some examples of commonand interesting error cases. The majority of errors(26%) come from confusing the ⇠ class with the# class. This mistake is not too concerning froman RTE perspective since ⇠ can be treated as aspecial case of # (Section 5). There are very fewcases in which the classifier makes extreme errors,e.g. confusing ⌘ with ¬ or with #; some interest-ing examples of such errors arise when the phrasescontain pronouns (e.g. girl ⌘ she) or when therelation uses a highly infrequent word sense (e.g.photo ⌘ still).

7.2 The Nutcracker RTE System

To further test our classifier, we evaluate the use-fulness of the automatic entailment predictions ina downstream RTE task. We run our experimentsusing Nutcracker, a state-of-the-art RTE systembased on formal semantics (Bjerva et al., 2014).In the SemEval 2014 RTE challenge, this system

equiv.

entail.

excl.

other

indep.

equiv.entail. excl. other indep.Table 1

mono Predicted label (using monolingual features)

Predicted label (using bilingual features)

Predicted label (using all features)

ind syn hyp exl oth ≣ ⊐ ¬ # ~ ≣ ⊐ ¬ # ~ ≣ ⊐ ¬ # ~syn 1 3 1 0 0 4 ≣ 58% 20% 4% 15% 3% 62% 21% 5% 4% 8% 83% 10% 0% 2% 4%

hyp 2 3 7 0 1 13 ⊐ 20% 51% 3% 18% 7% 27% 5% 7% 7% 54% 6% 76% 2% 7% 8%

exl 1 1 1 1 0 4 ¬ 26% 14% 37% 17% 6% 6% 14% 30% 36% 14% 2% 8% 73% 13% 3%

ind 14 2 2 0 1 20 # 8% 13% 2% 71% 6% 1% 7% 6% 78% 8% 1% 4% 2% 88% 6%

oth 3 1 2 0 2 10 ~ 15% 21% 5% 36% 23% 8% 19% 9% 30% 35% 5% 10% 3% 18% 64%

bi

ind syn hyp exl oth

syn 0 3 1 0 0 4

hyp 2 7 1 2 13 24

exl 1 0 1 1 1 4

ind 15 0 1 1 2 20

oth 3 1 2 1 3 10

both

ind syn hyp exl oth

syn 9 368 46 1 19 443

hyp 97 83 1004 29 108 1321

exl 49 9 29 275 13 375

ind 1730 15 82 35 114 1976

oth 169 48 97 33 609 956

True

labe

l

Figure 4: Confusion matrices for classifier trained using only monolingual features (distributional and path) versus bilingualfeatures (paraphrase and translation). True labels are shown along rows, predicted along columns. The matrix is normalizedalong rows, so that the predictions for each (true) class sum to 100%.

� F1 when excludingAll Lex. Dist. Path Para. Tran. WN

# 79 -2.0 -0.2 -1.2 -1.7 -0.2 -0.1⌘ 57 -3.5 +0.2 -0.7 -2.4 -3.7 +0.5A 68 -4.6 -0.3 -0.8 -0.8 -0.7 -1.6¬ 49 -4.0 -0.8 -2.9 +0.3 -0.0 -2.2⇠ 51 -4.9 -0.5 -0.7 -1.2 -0.9 -0.3

Table 5: F1 measure (⇥100) achieved by entailment classifierusing 10-fold cross validation on the training data.

Table 3 shines some light onto the differencesbetween monolingual and bilingual similarities.While the monolingual asymmetric metrics aregood for identifying A pairs, the symmetric met-rics consistently identify ¬ pairs; none of themonolingual scores we explored were effectivein making the subtle distinction between ⌘ pairsand the other types of paraphrase. In contrast,the bilingual similarity metric is fairly precisefor identifying ⌘ pairs, but provides less infor-mation for distinguishing between types of non-equivalent paraphrase. These differences are fur-ther exhibited in the confusion matrices shown inFigure 4; when the classifier is trained using onlymonolingual features, it misclassifies 26% of ¬pairs as ⌘, whereas the bilingual features makethis error only 6% of the time. On the other hand,the bilingual features completely fail to predict theA class, calling over 80% of such pairs ⌘ or ⇠.

7 Evaluation

7.1 Intrinsic Evaluation

We test the performance of our classifier intrinsi-cally, through its ability to reproduce the humanlabels for the phrase pairs from the SICK test sen-tences. Table 7 shows the precision and recallachieved by the classifier for each of our 5 en-tailment classes. The classifier is able to achieve

an overall 79% accuracy, reaching >70% preci-sion while maintaining good levels of recall on allclasses.

True Pred. N Example misclassifications⇠ # 169 boy/little, an empy/the air# ⇠ 114 little/toy, color/hairA ⇠ 108 drink/juice, ocean/surfA # 97 in front of/the face of, vehicle/horseA ⌘ 83 cat/kitten, pavement/sidewalk⌘ A 46 big/grand, a girl/a young ladyA ¬ 29 kid/teenager, no small/a large¬ A 29 old man/young man, a car/a window# ⌘ 15 a person/one, a crowd/a large⌘ # 9 he is/man is, photo/still⌘ ¬ 1 girl is/she is

Table 6: Example misclassifications from some of the mostfrequent and most interesting error categories.

Figure 4 shows the classifier’s confusion ma-trix and Table 6 shows some examples of commonand interesting error cases. The majority of errors(26%) come from confusing the ⇠ class with the# class. This mistake is not too concerning froman RTE perspective since ⇠ can be treated as aspecial case of # (Section 5). There are very fewcases in which the classifier makes extreme errors,e.g. confusing ⌘ with ¬ or with #; some interest-ing examples of such errors arise when the phrasescontain pronouns (e.g. girl ⌘ she) or when therelation uses a highly infrequent word sense (e.g.photo ⌘ still).

7.2 The Nutcracker RTE System

To further test our classifier, we evaluate the use-fulness of the automatic entailment predictions ina downstream RTE task. We run our experimentsusing Nutcracker, a state-of-the-art RTE systembased on formal semantics (Bjerva et al., 2014).In the SemEval 2014 RTE challenge, this system

equiv.

entail.

excl.

other

indep.

Monolingual features

only

Bilingual features

only

Adding Semantics to Data-Driven Paraphrasing. (Pavlick et al. ACL 2015)

Page 61: Natural Language Inference in the Real World · Natural Language Inference in the Real World Ellie Pavlick University of Pennsylvania

Table 1

mono Predicted label (using monolingual features)

Predicted label (using bilingual features)

Predicted label (using all features)

ind syn hyp exl oth ≣ ⊐ ¬ # ~ ≣ ⊐ ¬ # ~ ≣ ⊐ ¬ # ~syn 1 3 1 0 0 4 ≣ 58% 20% 4% 15% 3% 62% 21% 5% 4% 8% 83% 10% 0% 2% 4%

hyp 2 3 7 0 1 13 ⊐ 20% 51% 3% 18% 7% 27% 5% 7% 7% 54% 6% 76% 2% 7% 8%

exl 1 1 1 1 0 4 ¬ 26% 14% 37% 17% 6% 6% 14% 30% 36% 14% 2% 8% 73% 13% 3%

ind 14 2 2 0 1 20 # 8% 13% 2% 71% 6% 1% 7% 6% 78% 8% 1% 4% 2% 88% 6%

oth 3 1 2 0 2 10 ~ 15% 21% 5% 36% 23% 8% 19% 9% 30% 35% 5% 10% 3% 18% 64%

bi

ind syn hyp exl oth

syn 0 3 1 0 0 4

hyp 2 7 1 2 13 24

exl 1 0 1 1 1 4

ind 15 0 1 1 2 20

oth 3 1 2 1 3 10

both

ind syn hyp exl oth

syn 9 368 46 1 19 443

hyp 97 83 1004 29 108 1321

exl 49 9 29 275 13 375

ind 1730 15 82 35 114 1976

oth 169 48 97 33 609 956

True

labe

l

Figure 4: Confusion matrices for classifier trained using only monolingual features (distributional and path) versus bilingualfeatures (paraphrase and translation). True labels are shown along rows, predicted along columns. The matrix is normalizedalong rows, so that the predictions for each (true) class sum to 100%.

� F1 when excludingAll Lex. Dist. Path Para. Tran. WN

# 79 -2.0 -0.2 -1.2 -1.7 -0.2 -0.1⌘ 57 -3.5 +0.2 -0.7 -2.4 -3.7 +0.5A 68 -4.6 -0.3 -0.8 -0.8 -0.7 -1.6¬ 49 -4.0 -0.8 -2.9 +0.3 -0.0 -2.2⇠ 51 -4.9 -0.5 -0.7 -1.2 -0.9 -0.3

Table 5: F1 measure (⇥100) achieved by entailment classifierusing 10-fold cross validation on the training data.

Table 3 shines some light onto the differencesbetween monolingual and bilingual similarities.While the monolingual asymmetric metrics aregood for identifying A pairs, the symmetric met-rics consistently identify ¬ pairs; none of themonolingual scores we explored were effectivein making the subtle distinction between ⌘ pairsand the other types of paraphrase. In contrast,the bilingual similarity metric is fairly precisefor identifying ⌘ pairs, but provides less infor-mation for distinguishing between types of non-equivalent paraphrase. These differences are fur-ther exhibited in the confusion matrices shown inFigure 4; when the classifier is trained using onlymonolingual features, it misclassifies 26% of ¬pairs as ⌘, whereas the bilingual features makethis error only 6% of the time. On the other hand,the bilingual features completely fail to predict theA class, calling over 80% of such pairs ⌘ or ⇠.

7 Evaluation

7.1 Intrinsic Evaluation

We test the performance of our classifier intrinsi-cally, through its ability to reproduce the humanlabels for the phrase pairs from the SICK test sen-tences. Table 7 shows the precision and recallachieved by the classifier for each of our 5 en-tailment classes. The classifier is able to achieve

an overall 79% accuracy, reaching >70% preci-sion while maintaining good levels of recall on allclasses.

True Pred. N Example misclassifications⇠ # 169 boy/little, an empy/the air# ⇠ 114 little/toy, color/hairA ⇠ 108 drink/juice, ocean/surfA # 97 in front of/the face of, vehicle/horseA ⌘ 83 cat/kitten, pavement/sidewalk⌘ A 46 big/grand, a girl/a young ladyA ¬ 29 kid/teenager, no small/a large¬ A 29 old man/young man, a car/a window# ⌘ 15 a person/one, a crowd/a large⌘ # 9 he is/man is, photo/still⌘ ¬ 1 girl is/she is

Table 6: Example misclassifications from some of the mostfrequent and most interesting error categories.

Figure 4 shows the classifier’s confusion ma-trix and Table 6 shows some examples of commonand interesting error cases. The majority of errors(26%) come from confusing the ⇠ class with the# class. This mistake is not too concerning froman RTE perspective since ⇠ can be treated as aspecial case of # (Section 5). There are very fewcases in which the classifier makes extreme errors,e.g. confusing ⌘ with ¬ or with #; some interest-ing examples of such errors arise when the phrasescontain pronouns (e.g. girl ⌘ she) or when therelation uses a highly infrequent word sense (e.g.photo ⌘ still).

7.2 The Nutcracker RTE System

To further test our classifier, we evaluate the use-fulness of the automatic entailment predictions ina downstream RTE task. We run our experimentsusing Nutcracker, a state-of-the-art RTE systembased on formal semantics (Bjerva et al., 2014).In the SemEval 2014 RTE challenge, this system

equiv.

entail.

excl.

other

indep.

equiv.entail. excl. other indep.Table 1

mono Predicted label (using monolingual features)

Predicted label (using bilingual features)

Predicted label (using all features)

ind syn hyp exl oth ≣ ⊐ ¬ # ~ ≣ ⊐ ¬ # ~ ≣ ⊐ ¬ # ~syn 1 3 1 0 0 4 ≣ 58% 20% 4% 15% 3% 62% 21% 5% 4% 8% 83% 10% 0% 2% 4%

hyp 2 3 7 0 1 13 ⊐ 20% 51% 3% 18% 7% 27% 5% 7% 7% 54% 6% 76% 2% 7% 8%

exl 1 1 1 1 0 4 ¬ 26% 14% 37% 17% 6% 6% 14% 30% 36% 14% 2% 8% 73% 13% 3%

ind 14 2 2 0 1 20 # 8% 13% 2% 71% 6% 1% 7% 6% 78% 8% 1% 4% 2% 88% 6%

oth 3 1 2 0 2 10 ~ 15% 21% 5% 36% 23% 8% 19% 9% 30% 35% 5% 10% 3% 18% 64%

bi

ind syn hyp exl oth

syn 0 3 1 0 0 4

hyp 2 7 1 2 13 24

exl 1 0 1 1 1 4

ind 15 0 1 1 2 20

oth 3 1 2 1 3 10

both

ind syn hyp exl oth

syn 9 368 46 1 19 443

hyp 97 83 1004 29 108 1321

exl 49 9 29 275 13 375

ind 1730 15 82 35 114 1976

oth 169 48 97 33 609 956

True

labe

l

Figure 4: Confusion matrices for classifier trained using only monolingual features (distributional and path) versus bilingualfeatures (paraphrase and translation). True labels are shown along rows, predicted along columns. The matrix is normalizedalong rows, so that the predictions for each (true) class sum to 100%.

� F1 when excludingAll Lex. Dist. Path Para. Tran. WN

# 79 -2.0 -0.2 -1.2 -1.7 -0.2 -0.1⌘ 57 -3.5 +0.2 -0.7 -2.4 -3.7 +0.5A 68 -4.6 -0.3 -0.8 -0.8 -0.7 -1.6¬ 49 -4.0 -0.8 -2.9 +0.3 -0.0 -2.2⇠ 51 -4.9 -0.5 -0.7 -1.2 -0.9 -0.3

Table 5: F1 measure (⇥100) achieved by entailment classifierusing 10-fold cross validation on the training data.

Table 3 shines some light onto the differencesbetween monolingual and bilingual similarities.While the monolingual asymmetric metrics aregood for identifying A pairs, the symmetric met-rics consistently identify ¬ pairs; none of themonolingual scores we explored were effectivein making the subtle distinction between ⌘ pairsand the other types of paraphrase. In contrast,the bilingual similarity metric is fairly precisefor identifying ⌘ pairs, but provides less infor-mation for distinguishing between types of non-equivalent paraphrase. These differences are fur-ther exhibited in the confusion matrices shown inFigure 4; when the classifier is trained using onlymonolingual features, it misclassifies 26% of ¬pairs as ⌘, whereas the bilingual features makethis error only 6% of the time. On the other hand,the bilingual features completely fail to predict theA class, calling over 80% of such pairs ⌘ or ⇠.

7 Evaluation

7.1 Intrinsic Evaluation

We test the performance of our classifier intrinsi-cally, through its ability to reproduce the humanlabels for the phrase pairs from the SICK test sen-tences. Table 7 shows the precision and recallachieved by the classifier for each of our 5 en-tailment classes. The classifier is able to achieve

an overall 79% accuracy, reaching >70% preci-sion while maintaining good levels of recall on allclasses.

True Pred. N Example misclassifications⇠ # 169 boy/little, an empy/the air# ⇠ 114 little/toy, color/hairA ⇠ 108 drink/juice, ocean/surfA # 97 in front of/the face of, vehicle/horseA ⌘ 83 cat/kitten, pavement/sidewalk⌘ A 46 big/grand, a girl/a young ladyA ¬ 29 kid/teenager, no small/a large¬ A 29 old man/young man, a car/a window# ⌘ 15 a person/one, a crowd/a large⌘ # 9 he is/man is, photo/still⌘ ¬ 1 girl is/she is

Table 6: Example misclassifications from some of the mostfrequent and most interesting error categories.

Figure 4 shows the classifier’s confusion ma-trix and Table 6 shows some examples of commonand interesting error cases. The majority of errors(26%) come from confusing the ⇠ class with the# class. This mistake is not too concerning froman RTE perspective since ⇠ can be treated as aspecial case of # (Section 5). There are very fewcases in which the classifier makes extreme errors,e.g. confusing ⌘ with ¬ or with #; some interest-ing examples of such errors arise when the phrasescontain pronouns (e.g. girl ⌘ she) or when therelation uses a highly infrequent word sense (e.g.photo ⌘ still).

7.2 The Nutcracker RTE System

To further test our classifier, we evaluate the use-fulness of the automatic entailment predictions ina downstream RTE task. We run our experimentsusing Nutcracker, a state-of-the-art RTE systembased on formal semantics (Bjerva et al., 2014).In the SemEval 2014 RTE challenge, this system

equiv.

entail.

excl.

other

indep.

Monolingual features

only

Bilingual features

only

Adding Semantics to Data-Driven Paraphrasing. (Pavlick et al. ACL 2015)

Page 62: Natural Language Inference in the Real World · Natural Language Inference in the Real World Ellie Pavlick University of Pennsylvania

Table 1

mono Predicted label (using monolingual features)

Predicted label (using bilingual features)

Predicted label (using all features)

ind syn hyp exl oth ≣ ⊐ ¬ # ~ ≣ ⊐ ¬ # ~ ≣ ⊐ ¬ # ~syn 1 3 1 0 0 4 ≣ 58% 20% 4% 15% 3% 62% 21% 5% 4% 8% 83% 10% 0% 2% 4%

hyp 2 3 7 0 1 13 ⊐ 20% 51% 3% 18% 7% 27% 5% 7% 7% 54% 6% 76% 2% 7% 8%

exl 1 1 1 1 0 4 ¬ 26% 14% 37% 17% 6% 6% 14% 30% 36% 14% 2% 8% 73% 13% 3%

ind 14 2 2 0 1 20 # 8% 13% 2% 71% 6% 1% 7% 6% 78% 8% 1% 4% 2% 88% 6%

oth 3 1 2 0 2 10 ~ 15% 21% 5% 36% 23% 8% 19% 9% 30% 35% 5% 10% 3% 18% 64%

bi

ind syn hyp exl oth

syn 0 3 1 0 0 4

hyp 2 7 1 2 13 24

exl 1 0 1 1 1 4

ind 15 0 1 1 2 20

oth 3 1 2 1 3 10

both

ind syn hyp exl oth

syn 9 368 46 1 19 443

hyp 97 83 1004 29 108 1321

exl 49 9 29 275 13 375

ind 1730 15 82 35 114 1976

oth 169 48 97 33 609 956

True

labe

l

Figure 4: Confusion matrices for classifier trained using only monolingual features (distributional and path) versus bilingualfeatures (paraphrase and translation). True labels are shown along rows, predicted along columns. The matrix is normalizedalong rows, so that the predictions for each (true) class sum to 100%.

� F1 when excludingAll Lex. Dist. Path Para. Tran. WN

# 79 -2.0 -0.2 -1.2 -1.7 -0.2 -0.1⌘ 57 -3.5 +0.2 -0.7 -2.4 -3.7 +0.5A 68 -4.6 -0.3 -0.8 -0.8 -0.7 -1.6¬ 49 -4.0 -0.8 -2.9 +0.3 -0.0 -2.2⇠ 51 -4.9 -0.5 -0.7 -1.2 -0.9 -0.3

Table 5: F1 measure (⇥100) achieved by entailment classifierusing 10-fold cross validation on the training data.

Table 3 shines some light onto the differencesbetween monolingual and bilingual similarities.While the monolingual asymmetric metrics aregood for identifying A pairs, the symmetric met-rics consistently identify ¬ pairs; none of themonolingual scores we explored were effectivein making the subtle distinction between ⌘ pairsand the other types of paraphrase. In contrast,the bilingual similarity metric is fairly precisefor identifying ⌘ pairs, but provides less infor-mation for distinguishing between types of non-equivalent paraphrase. These differences are fur-ther exhibited in the confusion matrices shown inFigure 4; when the classifier is trained using onlymonolingual features, it misclassifies 26% of ¬pairs as ⌘, whereas the bilingual features makethis error only 6% of the time. On the other hand,the bilingual features completely fail to predict theA class, calling over 80% of such pairs ⌘ or ⇠.

7 Evaluation

7.1 Intrinsic Evaluation

We test the performance of our classifier intrinsi-cally, through its ability to reproduce the humanlabels for the phrase pairs from the SICK test sen-tences. Table 7 shows the precision and recallachieved by the classifier for each of our 5 en-tailment classes. The classifier is able to achieve

an overall 79% accuracy, reaching >70% preci-sion while maintaining good levels of recall on allclasses.

True Pred. N Example misclassifications⇠ # 169 boy/little, an empy/the air# ⇠ 114 little/toy, color/hairA ⇠ 108 drink/juice, ocean/surfA # 97 in front of/the face of, vehicle/horseA ⌘ 83 cat/kitten, pavement/sidewalk⌘ A 46 big/grand, a girl/a young ladyA ¬ 29 kid/teenager, no small/a large¬ A 29 old man/young man, a car/a window# ⌘ 15 a person/one, a crowd/a large⌘ # 9 he is/man is, photo/still⌘ ¬ 1 girl is/she is

Table 6: Example misclassifications from some of the mostfrequent and most interesting error categories.

Figure 4 shows the classifier’s confusion ma-trix and Table 6 shows some examples of commonand interesting error cases. The majority of errors(26%) come from confusing the ⇠ class with the# class. This mistake is not too concerning froman RTE perspective since ⇠ can be treated as aspecial case of # (Section 5). There are very fewcases in which the classifier makes extreme errors,e.g. confusing ⌘ with ¬ or with #; some interest-ing examples of such errors arise when the phrasescontain pronouns (e.g. girl ⌘ she) or when therelation uses a highly infrequent word sense (e.g.photo ⌘ still).

7.2 The Nutcracker RTE System

To further test our classifier, we evaluate the use-fulness of the automatic entailment predictions ina downstream RTE task. We run our experimentsusing Nutcracker, a state-of-the-art RTE systembased on formal semantics (Bjerva et al., 2014).In the SemEval 2014 RTE challenge, this system

equiv.

entail.

excl.

other

indep.

equiv. entail. excl. other indep.

Table 1

mono Predicted label (using monolingual features)

Predicted label (using bilingual features)

Predicted label (using all features)

ind syn hyp exl oth ≣ ⊐ ¬ # ~ ≣ ⊐ ¬ # ~ ≣ ⊐ ¬ # ~syn 1 3 1 0 0 4 ≣ 58% 20% 4% 15% 3% 62% 21% 5% 4% 8% 83% 10% 0% 2% 4%

hyp 2 3 7 0 1 13 ⊐ 20% 51% 3% 18% 7% 27% 5% 7% 7% 54% 6% 76% 2% 7% 8%

exl 1 1 1 1 0 4 ¬ 26% 14% 37% 17% 6% 6% 14% 30% 36% 14% 2% 8% 73% 13% 3%

ind 14 2 2 0 1 20 # 8% 13% 2% 71% 6% 1% 7% 6% 78% 8% 1% 4% 2% 88% 6%

oth 3 1 2 0 2 10 ~ 15% 21% 5% 36% 23% 8% 19% 9% 30% 35% 5% 10% 3% 18% 64%

bi

ind syn hyp exl oth

syn 0 3 1 0 0 4

hyp 2 7 1 2 13 24

exl 1 0 1 1 1 4

ind 15 0 1 1 2 20

oth 3 1 2 1 3 10

both

ind syn hyp exl oth

syn 9 368 46 1 19 443

hyp 97 83 1004 29 108 1321

exl 49 9 29 275 13 375

ind 1730 15 82 35 114 1976

oth 169 48 97 33 609 956

True

labe

l

Figure 4: Confusion matrices for classifier trained using only monolingual features (distributional and path) versus bilingualfeatures (paraphrase and translation). True labels are shown along rows, predicted along columns. The matrix is normalizedalong rows, so that the predictions for each (true) class sum to 100%.

� F1 when excludingAll Lex. Dist. Path Para. Tran. WN

# 79 -2.0 -0.2 -1.2 -1.7 -0.2 -0.1⌘ 57 -3.5 +0.2 -0.7 -2.4 -3.7 +0.5A 68 -4.6 -0.3 -0.8 -0.8 -0.7 -1.6¬ 49 -4.0 -0.8 -2.9 +0.3 -0.0 -2.2⇠ 51 -4.9 -0.5 -0.7 -1.2 -0.9 -0.3

Table 5: F1 measure (⇥100) achieved by entailment classifierusing 10-fold cross validation on the training data.

Table 3 shines some light onto the differencesbetween monolingual and bilingual similarities.While the monolingual asymmetric metrics aregood for identifying A pairs, the symmetric met-rics consistently identify ¬ pairs; none of themonolingual scores we explored were effectivein making the subtle distinction between ⌘ pairsand the other types of paraphrase. In contrast,the bilingual similarity metric is fairly precisefor identifying ⌘ pairs, but provides less infor-mation for distinguishing between types of non-equivalent paraphrase. These differences are fur-ther exhibited in the confusion matrices shown inFigure 4; when the classifier is trained using onlymonolingual features, it misclassifies 26% of ¬pairs as ⌘, whereas the bilingual features makethis error only 6% of the time. On the other hand,the bilingual features completely fail to predict theA class, calling over 80% of such pairs ⌘ or ⇠.

7 Evaluation

7.1 Intrinsic Evaluation

We test the performance of our classifier intrinsi-cally, through its ability to reproduce the humanlabels for the phrase pairs from the SICK test sen-tences. Table 7 shows the precision and recallachieved by the classifier for each of our 5 en-tailment classes. The classifier is able to achieve

an overall 79% accuracy, reaching >70% preci-sion while maintaining good levels of recall on allclasses.

True Pred. N Example misclassifications⇠ # 169 boy/little, an empy/the air# ⇠ 114 little/toy, color/hairA ⇠ 108 drink/juice, ocean/surfA # 97 in front of/the face of, vehicle/horseA ⌘ 83 cat/kitten, pavement/sidewalk⌘ A 46 big/grand, a girl/a young ladyA ¬ 29 kid/teenager, no small/a large¬ A 29 old man/young man, a car/a window# ⌘ 15 a person/one, a crowd/a large⌘ # 9 he is/man is, photo/still⌘ ¬ 1 girl is/she is

Table 6: Example misclassifications from some of the mostfrequent and most interesting error categories.

Figure 4 shows the classifier’s confusion ma-trix and Table 6 shows some examples of commonand interesting error cases. The majority of errors(26%) come from confusing the ⇠ class with the# class. This mistake is not too concerning froman RTE perspective since ⇠ can be treated as aspecial case of # (Section 5). There are very fewcases in which the classifier makes extreme errors,e.g. confusing ⌘ with ¬ or with #; some interest-ing examples of such errors arise when the phrasescontain pronouns (e.g. girl ⌘ she) or when therelation uses a highly infrequent word sense (e.g.photo ⌘ still).

7.2 The Nutcracker RTE System

To further test our classifier, we evaluate the use-fulness of the automatic entailment predictions ina downstream RTE task. We run our experimentsusing Nutcracker, a state-of-the-art RTE systembased on formal semantics (Bjerva et al., 2014).In the SemEval 2014 RTE challenge, this system

equiv.

entail.

excl.

other

indep.

Table 1

mono Predicted label (using monolingual features)

Predicted label (using bilingual features)

Predicted label (using all features)

ind syn hyp exl oth ≣ ⊐ ¬ # ~ ≣ ⊐ ¬ # ~ ≣ ⊐ ¬ # ~syn 1 3 1 0 0 4 ≣ 58% 20% 4% 15% 3% 62% 21% 5% 4% 8% 83% 10% 0% 2% 4%

hyp 2 3 7 0 1 13 ⊐ 20% 51% 3% 18% 7% 27% 5% 7% 7% 54% 6% 76% 2% 7% 8%

exl 1 1 1 1 0 4 ¬ 26% 14% 37% 17% 6% 6% 14% 30% 36% 14% 2% 8% 73% 13% 3%

ind 14 2 2 0 1 20 # 8% 13% 2% 71% 6% 1% 7% 6% 78% 8% 1% 4% 2% 88% 6%

oth 3 1 2 0 2 10 ~ 15% 21% 5% 36% 23% 8% 19% 9% 30% 35% 5% 10% 3% 18% 64%

bi

ind syn hyp exl oth

syn 0 3 1 0 0 4

hyp 2 7 1 2 13 24

exl 1 0 1 1 1 4

ind 15 0 1 1 2 20

oth 3 1 2 1 3 10

both

ind syn hyp exl oth

syn 9 368 46 1 19 443

hyp 97 83 1004 29 108 1321

exl 49 9 29 275 13 375

ind 1730 15 82 35 114 1976

oth 169 48 97 33 609 956

True

labe

l

Figure 4: Confusion matrices for classifier trained using only monolingual features (distributional and path) versus bilingualfeatures (paraphrase and translation). True labels are shown along rows, predicted along columns. The matrix is normalizedalong rows, so that the predictions for each (true) class sum to 100%.

� F1 when excludingAll Lex. Dist. Path Para. Tran. WN

# 79 -2.0 -0.2 -1.2 -1.7 -0.2 -0.1⌘ 57 -3.5 +0.2 -0.7 -2.4 -3.7 +0.5A 68 -4.6 -0.3 -0.8 -0.8 -0.7 -1.6¬ 49 -4.0 -0.8 -2.9 +0.3 -0.0 -2.2⇠ 51 -4.9 -0.5 -0.7 -1.2 -0.9 -0.3

Table 5: F1 measure (⇥100) achieved by entailment classifierusing 10-fold cross validation on the training data.

Table 3 shines some light onto the differencesbetween monolingual and bilingual similarities.While the monolingual asymmetric metrics aregood for identifying A pairs, the symmetric met-rics consistently identify ¬ pairs; none of themonolingual scores we explored were effectivein making the subtle distinction between ⌘ pairsand the other types of paraphrase. In contrast,the bilingual similarity metric is fairly precisefor identifying ⌘ pairs, but provides less infor-mation for distinguishing between types of non-equivalent paraphrase. These differences are fur-ther exhibited in the confusion matrices shown inFigure 4; when the classifier is trained using onlymonolingual features, it misclassifies 26% of ¬pairs as ⌘, whereas the bilingual features makethis error only 6% of the time. On the other hand,the bilingual features completely fail to predict theA class, calling over 80% of such pairs ⌘ or ⇠.

7 Evaluation

7.1 Intrinsic Evaluation

We test the performance of our classifier intrinsi-cally, through its ability to reproduce the humanlabels for the phrase pairs from the SICK test sen-tences. Table 7 shows the precision and recallachieved by the classifier for each of our 5 en-tailment classes. The classifier is able to achieve

an overall 79% accuracy, reaching >70% preci-sion while maintaining good levels of recall on allclasses.

True Pred. N Example misclassifications⇠ # 169 boy/little, an empy/the air# ⇠ 114 little/toy, color/hairA ⇠ 108 drink/juice, ocean/surfA # 97 in front of/the face of, vehicle/horseA ⌘ 83 cat/kitten, pavement/sidewalk⌘ A 46 big/grand, a girl/a young ladyA ¬ 29 kid/teenager, no small/a large¬ A 29 old man/young man, a car/a window# ⌘ 15 a person/one, a crowd/a large⌘ # 9 he is/man is, photo/still⌘ ¬ 1 girl is/she is

Table 6: Example misclassifications from some of the mostfrequent and most interesting error categories.

Figure 4 shows the classifier’s confusion ma-trix and Table 6 shows some examples of commonand interesting error cases. The majority of errors(26%) come from confusing the ⇠ class with the# class. This mistake is not too concerning froman RTE perspective since ⇠ can be treated as aspecial case of # (Section 5). There are very fewcases in which the classifier makes extreme errors,e.g. confusing ⌘ with ¬ or with #; some interest-ing examples of such errors arise when the phrasescontain pronouns (e.g. girl ⌘ she) or when therelation uses a highly infrequent word sense (e.g.photo ⌘ still).

7.2 The Nutcracker RTE System

To further test our classifier, we evaluate the use-fulness of the automatic entailment predictions ina downstream RTE task. We run our experimentsusing Nutcracker, a state-of-the-art RTE systembased on formal semantics (Bjerva et al., 2014).In the SemEval 2014 RTE challenge, this system

All features

Adding Semantics to Data-Driven Paraphrasing. (Pavlick et al. ACL 2015)

Page 63: Natural Language Inference in the Real World · Natural Language Inference in the Real World Ellie Pavlick University of Pennsylvania

Predicting entailment relations

Equivalent Entailment Exclusion Other Independent

look at/watch little girl/girl close/open swim/water girl/play

a person/someone

kuwait/country

minimal/significant

husband/marry found/party

clean/cleanse

tower/building

boy/young girl oil/oil price man/talk

distant/remote

sneaker/footwear

nobody/someone

country/patriotic profit/year

phone/telephone heroin/drug blue/green drive/

vehicleholiday/series

last autumn/last fall

typhoon/storm

france/germany playing/toy city/south

Adding Semantics to Data-Driven Paraphrasing. (Pavlick et al. ACL 2015)

Page 64: Natural Language Inference in the Real World · Natural Language Inference in the Real World Ellie Pavlick University of Pennsylvania

Last December they had argued that the council had failed to consider possible effects of contaminated land

at the site.

The council considered environmental consequences.

modifiers

INS(environmental)

Page 65: Natural Language Inference in the Real World · Natural Language Inference in the Real World Ellie Pavlick University of Pennsylvania

Denotational Semantics

consequences

Page 66: Natural Language Inference in the Real World · Natural Language Inference in the Real World Ellie Pavlick University of Pennsylvania

Denotational Semantics

consequences

environmental consequences

Page 67: Natural Language Inference in the Real World · Natural Language Inference in the Real World Ellie Pavlick University of Pennsylvania

Denotational Semantics

consequences

environmental consequences

irreversible environmental consequences

Page 68: Natural Language Inference in the Real World · Natural Language Inference in the Real World Ellie Pavlick University of Pennsylvania

Denotational Semantics

consequences

environmental consequences

Does environmental consequence entail consequence?

Page 69: Natural Language Inference in the Real World · Natural Language Inference in the Real World Ellie Pavlick University of Pennsylvania

Denotational Semantics

consequences

environmental consequences

Does consequence entail environmental consequence?

Page 70: Natural Language Inference in the Real World · Natural Language Inference in the Real World Ellie Pavlick University of Pennsylvania

Natural Language Inference

Natural Language Inference. (MacCartney PhD Thesis 2009)

No man is pointing at a car.

A man is pointing at a silver sedan.

A man is pointing at a silver car.

A man is pointing at a car.

SUB(sedan, car) ⊏DEL(silver) ⊏

yes

yes

Page 71: Natural Language Inference in the Real World · Natural Language Inference in the Real World Ellie Pavlick University of Pennsylvania

Natural Language Inference

Natural Language Inference. (MacCartney PhD Thesis 2009)

No man is pointing at a car.

A man is pointing at a silver sedan.

A man is pointing at a silver car.

A man is pointing at a car.

SUB(sedan, car) ⊏DEL(silver) ⊏

yes

yes

From image descriptions to visual denotations. (Young et al. TACL 2014)

Entailment above the word level in distributional semantics. (Baroni et al. EACL 2012)

BIUTEE: A Modular Open-Source System for Recognizing Textual Entailment (Stern and Dagan. ACL 2012)

Page 72: Natural Language Inference in the Real World · Natural Language Inference in the Real World Ellie Pavlick University of Pennsylvania

Denotational Semantics

consequences

environmental consequences

Does consequence entail environmental consequence? ✘

The policy will have consequences.

Page 73: Natural Language Inference in the Real World · Natural Language Inference in the Real World Ellie Pavlick University of Pennsylvania

Denotational Semantics

consequences

environmental consequences

Does consequence entail environmental consequence?

The contaminated land will have consequences.

Page 74: Natural Language Inference in the Real World · Natural Language Inference in the Real World Ellie Pavlick University of Pennsylvania

Natural Logic Entailment Relations

N does not entail AN N does entail AN

Compositional Entailment in Adjective Nouns. (Pavlick and Callison-Burch ACL 2016)

Page 75: Natural Language Inference in the Real World · Natural Language Inference in the Real World Ellie Pavlick University of Pennsylvania

Natural Logic Entailment Relations

N does not entail AN N does entail AN

Forward Entailment (AN ⊏ N)

ANbrown dog

Compositional Entailment in Adjective Nouns. (Pavlick and Callison-Burch ACL 2016)

Page 76: Natural Language Inference in the Real World · Natural Language Inference in the Real World Ellie Pavlick University of Pennsylvania

Natural Logic Entailment Relations

N does not entail AN N does entail AN

Forward Entailment (AN ⊏ N)

ANbrown dog Equivalence (AN ≣ N)

ANentire world

Compositional Entailment in Adjective Nouns. (Pavlick and Callison-Burch ACL 2016)

Page 77: Natural Language Inference in the Real World · Natural Language Inference in the Real World Ellie Pavlick University of Pennsylvania

Natural Logic Entailment Relations

N does not entail AN N does entail AN

N AN

Alternation (N | AN)

Forward Entailment (AN ⊏ N)

AN

Independent (N # AN)

N AN

Equivalence (AN ≣ N)

AN

Reverse Entailment (AN ⊐ N)

N

alleged criminal

former senator

brown dog

possible winner

entire world

Compositional Entailment in Adjective Nouns. (Pavlick and Callison-Burch ACL 2016)

Page 78: Natural Language Inference in the Real World · Natural Language Inference in the Real World Ellie Pavlick University of Pennsylvania

Experimental Design

Wellers hopes the financial system will be fully operational by 2015 .

Wellers hopes the system will be fully operational by 2015 .

Compositional Entailment in Adjective Nouns. (Pavlick and Callison-Burch ACL 2016)

Page 79: Natural Language Inference in the Real World · Natural Language Inference in the Real World Ellie Pavlick University of Pennsylvania

Experimental Design

Wellers hopes the financial system will be fully operational by 2015 .

Wellers hopes the system will be fully operational by 2015 .

Contradiction Entailment Can’t Tell

Compositional Entailment in Adjective Nouns. (Pavlick and Callison-Burch ACL 2016)

Page 80: Natural Language Inference in the Real World · Natural Language Inference in the Real World Ellie Pavlick University of Pennsylvania

Experimental Design

Wellers hopes the system will be fully operational by 2015 .

Wellers hopes the financial system will be fully operational by 2015 .

Contradiction Entailment Can’t Tell

Compositional Entailment in Adjective Nouns. (Pavlick and Callison-Burch ACL 2016)

Page 81: Natural Language Inference in the Real World · Natural Language Inference in the Real World Ellie Pavlick University of Pennsylvania

Experimental Design

Wellers hopes the system will be fully operational by 2015 .

Wellers hopes the financial system will be fully operational by 2015 .

Contradiction Entailment Can’t Tell

Compositional Entailment in Adjective Nouns. (Pavlick and Callison-Burch ACL 2016)

~200 human annotators

~5,000 sentences

4 genres

Page 82: Natural Language Inference in the Real World · Natural Language Inference in the Real World Ellie Pavlick University of Pennsylvania

Empirical Analysis

Image Captions

3%3%8%

85%

AN < N AN = N AN # N AN > N AN | N None

Literature

5%4%

29%61%

News

7%7%

26% 57%

Forums

7%7%

37%

47%

Compositional Entailment in Adjective Nouns. (Pavlick and Callison-Burch ACL 2016)

Page 83: Natural Language Inference in the Real World · Natural Language Inference in the Real World Ellie Pavlick University of Pennsylvania

Empirical Analysis

Image Captions

3%3%8%

85%

AN < N AN = N AN # N AN > N AN | N None

Literature

5%4%

29%61%

News

7%7%

26% 57%

Forums

7%7%

37%

47%

Up to 53% error rate by assuming all adjectives are restrictive.

Compositional Entailment in Adjective Nouns. (Pavlick and Callison-Burch ACL 2016)

Page 84: Natural Language Inference in the Real World · Natural Language Inference in the Real World Ellie Pavlick University of Pennsylvania

When does N entail AN?

Compositional Entailment in Adjective Nouns. (Pavlick and Callison-Burch ACL 2016)

Page 85: Natural Language Inference in the Real World · Natural Language Inference in the Real World Ellie Pavlick University of Pennsylvania

(probably sandy) (probably not sandy)

When does N entail AN?Sometimes, it is a property of the adjective and noun…

Compositional Entailment in Adjective Nouns. (Pavlick and Callison-Burch ACL 2016)

Page 86: Natural Language Inference in the Real World · Natural Language Inference in the Real World Ellie Pavlick University of Pennsylvania

When does N entail AN?…or even just the adjective.

Compositional Entailment in Adjective Nouns. (Pavlick and Callison-Burch ACL 2016)

Page 87: Natural Language Inference in the Real World · Natural Language Inference in the Real World Ellie Pavlick University of Pennsylvania

The [deadly] attack killed at least 12 civilians.

The [entire] bill is now subject to approval by the parliament.

Equivalence AN

When does N entail AN?Other times, it is a property of the context+word knowledge.

Compositional Entailment in Adjective Nouns. (Pavlick and Callison-Burch ACL 2016)

Page 88: Natural Language Inference in the Real World · Natural Language Inference in the Real World Ellie Pavlick University of Pennsylvania

Red numbers spelled out their [perfect] record: 9-2.

Schilling even stayed busy after serving Epstein turkey at his [former] home on Thursday.

Alternation N AN

When does N entail AN?NOT

^

Other times, it is a property of the context+word knowledge.

Compositional Entailment in Adjective Nouns. (Pavlick and Callison-Burch ACL 2016)

Page 89: Natural Language Inference in the Real World · Natural Language Inference in the Real World Ellie Pavlick University of Pennsylvania

Undefined Relations

When does N entail AN?NOT

^

Other times, it is a property of the context+word knowledge.

Compositional Entailment in Adjective Nouns. (Pavlick and Callison-Burch ACL 2016)

Page 90: Natural Language Inference in the Real World · Natural Language Inference in the Real World Ellie Pavlick University of Pennsylvania

No N is an AN, but every AN is an N.

Undefined Relations

When does N entail AN?NOT

^

Other times, it is a property of the context+word knowledge.

Compositional Entailment in Adjective Nouns. (Pavlick and Callison-Burch ACL 2016)

Page 91: Natural Language Inference in the Real World · Natural Language Inference in the Real World Ellie Pavlick University of Pennsylvania

Undefined Relations

When does N entail AN?NOT

^

Other times, it is a property of the context+word knowledge.

Bush travels Monday to Michigan to make remarks on the [Japanese] economy.

No N is an AN, but every AN is an N.

Compositional Entailment in Adjective Nouns. (Pavlick and Callison-Burch ACL 2016)

Page 92: Natural Language Inference in the Real World · Natural Language Inference in the Real World Ellie Pavlick University of Pennsylvania

When does N entail AN?NOT

^

should

A dictionary of nonsubsective adjectives. (Nayak et al. Tech Report 2014)

fake former artificial

counterfeit possible probable unlikely likely

Page 93: Natural Language Inference in the Real World · Natural Language Inference in the Real World Ellie Pavlick University of Pennsylvania

Last December they had argued that the council had failed to consider possible effects of contaminated land

at the site.

The council considered environmental consequences.

(non-subsective) modifiers

DEL(possible)

Page 94: Natural Language Inference in the Real World · Natural Language Inference in the Real World Ellie Pavlick University of Pennsylvania

When does N entail AN?NOT

^

should

fake former artificial

counterfeit possible probable unlikely likely

A dictionary of nonsubsective adjectives. (Nayak et al. Tech Report 2014)

Page 95: Natural Language Inference in the Real World · Natural Language Inference in the Real World Ellie Pavlick University of Pennsylvania

When does N entail AN?NOT

^

should

ANN

Privative (e.g. fake)

Plain Non-subsective (e.g. alleged)

ANN AN

Subsective (e.g. red)

Figure 1: Three main classes of adjectives. If their entailment behavior is consistent with their theoreticaldefinitions, we would expect our annotations (Section 3) to produce the insertion (blue) and deletion(red) patterns shown by the bar graphs. Bars (left to right) represent CONTRADICTION, UNKNOWN, andENTAILMENT

While these generalizations are intuitive, thereis little experimental evidence to support them.In this paper, we collect human judgements ofthe validity of inferences following from the in-sertion and deletion of various classes of adjec-tives and analyze the results. Our findings suggestthat, in practice, most sentences involving non-subsective ANs can be safely generalized to state-ments about the N. That is, non-subsective adjec-tives often behave like normal, subsective adjec-tives. On further analysis, we reveal that, whenadjectives do behave non-subsectively, they oftenexhibit asymmetric entailment behavior in whichinsertion leads to contradictions (ID ) ¬ fake ID)but deletion leads to entailments (fake ID ) ID).We present anecdotal evidence for how the en-tailment associated with inserting/deleting a non-subsective adjective depends on the salient prop-erties of the noun phrase under discussion, ratherthan on the adjective itself.

2 Background and Related Work

Classes of Adjectives. Adjectives are com-monly classified taxonomically as either subsec-tive or non-subsective (Kamp and Partee, 1995).Subsective adjectives are adjectives which pickout a subset of the set denoted by the unmodifiednoun; that is, AN ⇢ N1. For non-subsective adjec-tives, in contrast, the AN cannot be guaranteed tobe a subset of N. For example, clever is subsective,and so a clever thief is always a thief. However,

1We use the notation N and AN to refer both the the nat-ural language expression itself (e.g. red car) as well as itsdenotation, e.g. {x|x is a red car}.

alleged is non-subsective, so there are many pos-sible worlds in which an alleged thief is not in facta thief. Of course, there may also be many possi-ble worlds in which the alleged thief is a thief, butthe word alleged, being non-subsective, does notguarantee this to hold.

Non-subsective adjectives can be further di-vided into two classes: privative and plain. Setsdenoted by privative ANs are completely disjointfrom the set denoted by the head N (AN \ N =;), and this mutual exclusivity is encoded in themeaning of the A itself. For example, fake is con-sidered to be a quintessential privative adjectivesince, given the usual definition of fake, a fake IDcan not actually be an ID. For plain non-subsectiveadjectives, there may be worlds in which the ANis and N, and worlds in which the AN is not an N:neither inference is guaranteed by the meaning ofthe A. As mentioned above, alleged is quintessen-tially plain non-subsective since, for example, analleged thief may or may not be an actual thief.In short, we can summarize the classes of adjec-tives in the following way: subsective adjectivesentail the nouns they modify, privative adjectivescontradict the nouns they modify, and plain non-subsective adjectives are compatible with (but donot entail) the nouns they modify. Figure 1 depictsthese distinctions.

While the hierarchical classification of adjec-tives described above is widely accepted and oftenapplied in NLP tasks (Amoia and Gardent, 2006;Amoia and Gardent, 2007; Boleda et al., 2012;McCrae et al., 2014), it is not undisputed. Somelinguists take the position that in fact privative ad-

She was the expected winner. She was the winner.

pred

icte

d

So-Called Non-Subsective Adjectives. (Pavlick and Callison-Burch *SEM 2016, Best Paper!)

Page 96: Natural Language Inference in the Real World · Natural Language Inference in the Real World Ellie Pavlick University of Pennsylvania

When does N entail AN?NOT

^

should

ANN

Privative (e.g. fake)

Plain Non-subsective (e.g. alleged)

ANN AN

Subsective (e.g. red)

Figure 1: Three main classes of adjectives. If their entailment behavior is consistent with their theoreticaldefinitions, we would expect our annotations (Section 3) to produce the insertion (blue) and deletion(red) patterns shown by the bar graphs. Bars (left to right) represent CONTRADICTION, UNKNOWN, andENTAILMENT

While these generalizations are intuitive, thereis little experimental evidence to support them.In this paper, we collect human judgements ofthe validity of inferences following from the in-sertion and deletion of various classes of adjec-tives and analyze the results. Our findings suggestthat, in practice, most sentences involving non-subsective ANs can be safely generalized to state-ments about the N. That is, non-subsective adjec-tives often behave like normal, subsective adjec-tives. On further analysis, we reveal that, whenadjectives do behave non-subsectively, they oftenexhibit asymmetric entailment behavior in whichinsertion leads to contradictions (ID ) ¬ fake ID)but deletion leads to entailments (fake ID ) ID).We present anecdotal evidence for how the en-tailment associated with inserting/deleting a non-subsective adjective depends on the salient prop-erties of the noun phrase under discussion, ratherthan on the adjective itself.

2 Background and Related Work

Classes of Adjectives. Adjectives are com-monly classified taxonomically as either subsec-tive or non-subsective (Kamp and Partee, 1995).Subsective adjectives are adjectives which pickout a subset of the set denoted by the unmodifiednoun; that is, AN ⇢ N1. For non-subsective adjec-tives, in contrast, the AN cannot be guaranteed tobe a subset of N. For example, clever is subsective,and so a clever thief is always a thief. However,

1We use the notation N and AN to refer both the the nat-ural language expression itself (e.g. red car) as well as itsdenotation, e.g. {x|x is a red car}.

alleged is non-subsective, so there are many pos-sible worlds in which an alleged thief is not in facta thief. Of course, there may also be many possi-ble worlds in which the alleged thief is a thief, butthe word alleged, being non-subsective, does notguarantee this to hold.

Non-subsective adjectives can be further di-vided into two classes: privative and plain. Setsdenoted by privative ANs are completely disjointfrom the set denoted by the head N (AN \ N =;), and this mutual exclusivity is encoded in themeaning of the A itself. For example, fake is con-sidered to be a quintessential privative adjectivesince, given the usual definition of fake, a fake IDcan not actually be an ID. For plain non-subsectiveadjectives, there may be worlds in which the ANis and N, and worlds in which the AN is not an N:neither inference is guaranteed by the meaning ofthe A. As mentioned above, alleged is quintessen-tially plain non-subsective since, for example, analleged thief may or may not be an actual thief.In short, we can summarize the classes of adjec-tives in the following way: subsective adjectivesentail the nouns they modify, privative adjectivescontradict the nouns they modify, and plain non-subsective adjectives are compatible with (but donot entail) the nouns they modify. Figure 1 depictsthese distinctions.

While the hierarchical classification of adjec-tives described above is widely accepted and oftenapplied in NLP tasks (Amoia and Gardent, 2006;Amoia and Gardent, 2007; Boleda et al., 2012;McCrae et al., 2014), it is not undisputed. Somelinguists take the position that in fact privative ad-

(a) Privative (b) Plain Non-Sub. (c) Subsective

Figure 2: Observed entailment judgements for insertion (blue) and deletion (red) of adjectives. Compareto expected distributions in Figure 1.

to receive labels of UNKNOWN in both directions.We expect the subsective adjectives to receive la-bels of ENTAILMENT in the deletion direction (redcar ) car) and labels of UNKNOWN in the inser-tion direction (car 6) red car). Figure 1 depictsthese expected distributions.

Observations. The observed entailment patternsfor insertion and deletion of non-subsective adjec-tives are shown in Figure 2. Our control sampleof subsective adjectives (Figure 2c) largely pro-duced the expected results, with 96% of deletionsproducing ENTAILMENTs and 73% of insertionsproducing UNKNOWNs.3 The entailment patternsproduced by the non-subsective adjectives, how-ever, did not match our predictions. The plain non-subsective adjectives (e.g. alleged) behave nearlyidentically to how we expect regular, subsectiveadjectives to behave (Figure 2b). That is, in 80%of cases, deleting the plain non-subsective adjec-tive was judged to produce ENTAILMENT, ratherthan the expected UNKNOWN. The examples inTable 2 shed some light onto why this is the case.Often, the differences between N and AN are notrelevant to the main point of the utterance. For ex-ample, while an expected surge in unemploymentis not a surge in unemployment, a policy that dealswith an expected surge deals with a surge.

The privative adjectives (e.g. fake) also failto match the predicted distribution. While in-sertions often produce the expected CONTRADIC-TIONs, deletions produce a surprising number ofENTAILMENTs (Figure 2a). Such a pattern doesnot fit into any of the adjective classes from Fig-ure 1. While some ANs (e.g. counterfeit money)behave in the prototypically privative way, others

3A full discussion of the 27% of insertions that deviatedfrom the expected behavior is given in Pavlick and Callison-Burch (2016).

(1) Swiss officials on Friday said they’ve launched aninvestigation into Urs Tinner’s alleged role.

(2) To deal with an expected surge in unemployment,the plan includes a huge temporary jobs program.

(3) They kept it close for a half and had a theoreticalchance come the third quarter.

Table 2: Contrary to expectations, the deletion ofplain non-subsective adjectives often preserves the(plausible) truth in a model. E.g. alleged role 6)role, but investigation into alleged role ) investi-gation into role.

(e.g. mythical beast) have the property in whichN)¬AN, but AN)N (Figure 3). Table 3 pro-vides some telling examples of how this AN)Ninference, in the case of privative adjectives, oftendepends less on the adjective itself, and more onproperties of the modified noun that are at issue inthe given context. For example, in Table 3 Exam-ple 2(a), a mock debate probably contains enoughof the relevant properties (namely, arguments) thatit can entail debate, while in Example 2(b), a mockexecution lacks the single most important property(the death of the executee) and so cannot entail ex-ecution. (Note that, from Example 3(b), it appearsthe jury is still out on whether leaps in artificialintelligence entail leaps in intelligence...)

5 Discussion

The results presented suggest a few important pat-terns for NLP systems. First, that while a non-subsective AN might not be an instance of the N(taxonomically speaking), statements that are trueof an AN are often true of the N as well. This isrelevant for IE and QA systems, and is likely to be-come more important as NLP systems focus moreon “micro reading” tasks (Nakashole and Mitchell,2014), where facts must be inferred from singledocuments or sentences, rather than by exploiting

Actually behave like normal, subsective adjectives (e.g. red).

obse

rved

pred

icte

d

She was the expected winner. She was the winner.

So-Called Non-Subsective Adjectives. (Pavlick and Callison-Burch *SEM 2016, Best Paper!)

Page 97: Natural Language Inference in the Real World · Natural Language Inference in the Real World Ellie Pavlick University of Pennsylvania

When does N entail AN?NOT

^

should

Actually behave like normal, subsective adjectives (e.g. red).

To deal with an expected surge in unemployment, the plan includes a huge

temporary jobs program.

ANN

Privative (e.g. fake)

Plain Non-subsective (e.g. alleged)

ANN AN

Subsective (e.g. red)

Figure 1: Three main classes of adjectives. If their entailment behavior is consistent with their theoreticaldefinitions, we would expect our annotations (Section 3) to produce the insertion (blue) and deletion(red) patterns shown by the bar graphs. Bars (left to right) represent CONTRADICTION, UNKNOWN, andENTAILMENT

While these generalizations are intuitive, thereis little experimental evidence to support them.In this paper, we collect human judgements ofthe validity of inferences following from the in-sertion and deletion of various classes of adjec-tives and analyze the results. Our findings suggestthat, in practice, most sentences involving non-subsective ANs can be safely generalized to state-ments about the N. That is, non-subsective adjec-tives often behave like normal, subsective adjec-tives. On further analysis, we reveal that, whenadjectives do behave non-subsectively, they oftenexhibit asymmetric entailment behavior in whichinsertion leads to contradictions (ID ) ¬ fake ID)but deletion leads to entailments (fake ID ) ID).We present anecdotal evidence for how the en-tailment associated with inserting/deleting a non-subsective adjective depends on the salient prop-erties of the noun phrase under discussion, ratherthan on the adjective itself.

2 Background and Related Work

Classes of Adjectives. Adjectives are com-monly classified taxonomically as either subsec-tive or non-subsective (Kamp and Partee, 1995).Subsective adjectives are adjectives which pickout a subset of the set denoted by the unmodifiednoun; that is, AN ⇢ N1. For non-subsective adjec-tives, in contrast, the AN cannot be guaranteed tobe a subset of N. For example, clever is subsective,and so a clever thief is always a thief. However,

1We use the notation N and AN to refer both the the nat-ural language expression itself (e.g. red car) as well as itsdenotation, e.g. {x|x is a red car}.

alleged is non-subsective, so there are many pos-sible worlds in which an alleged thief is not in facta thief. Of course, there may also be many possi-ble worlds in which the alleged thief is a thief, butthe word alleged, being non-subsective, does notguarantee this to hold.

Non-subsective adjectives can be further di-vided into two classes: privative and plain. Setsdenoted by privative ANs are completely disjointfrom the set denoted by the head N (AN \ N =;), and this mutual exclusivity is encoded in themeaning of the A itself. For example, fake is con-sidered to be a quintessential privative adjectivesince, given the usual definition of fake, a fake IDcan not actually be an ID. For plain non-subsectiveadjectives, there may be worlds in which the ANis and N, and worlds in which the AN is not an N:neither inference is guaranteed by the meaning ofthe A. As mentioned above, alleged is quintessen-tially plain non-subsective since, for example, analleged thief may or may not be an actual thief.In short, we can summarize the classes of adjec-tives in the following way: subsective adjectivesentail the nouns they modify, privative adjectivescontradict the nouns they modify, and plain non-subsective adjectives are compatible with (but donot entail) the nouns they modify. Figure 1 depictsthese distinctions.

While the hierarchical classification of adjec-tives described above is widely accepted and oftenapplied in NLP tasks (Amoia and Gardent, 2006;Amoia and Gardent, 2007; Boleda et al., 2012;McCrae et al., 2014), it is not undisputed. Somelinguists take the position that in fact privative ad-

(a) Privative (b) Plain Non-Sub. (c) Subsective

Figure 2: Observed entailment judgements for insertion (blue) and deletion (red) of adjectives. Compareto expected distributions in Figure 1.

to receive labels of UNKNOWN in both directions.We expect the subsective adjectives to receive la-bels of ENTAILMENT in the deletion direction (redcar ) car) and labels of UNKNOWN in the inser-tion direction (car 6) red car). Figure 1 depictsthese expected distributions.

Observations. The observed entailment patternsfor insertion and deletion of non-subsective adjec-tives are shown in Figure 2. Our control sampleof subsective adjectives (Figure 2c) largely pro-duced the expected results, with 96% of deletionsproducing ENTAILMENTs and 73% of insertionsproducing UNKNOWNs.3 The entailment patternsproduced by the non-subsective adjectives, how-ever, did not match our predictions. The plain non-subsective adjectives (e.g. alleged) behave nearlyidentically to how we expect regular, subsectiveadjectives to behave (Figure 2b). That is, in 80%of cases, deleting the plain non-subsective adjec-tive was judged to produce ENTAILMENT, ratherthan the expected UNKNOWN. The examples inTable 2 shed some light onto why this is the case.Often, the differences between N and AN are notrelevant to the main point of the utterance. For ex-ample, while an expected surge in unemploymentis not a surge in unemployment, a policy that dealswith an expected surge deals with a surge.

The privative adjectives (e.g. fake) also failto match the predicted distribution. While in-sertions often produce the expected CONTRADIC-TIONs, deletions produce a surprising number ofENTAILMENTs (Figure 2a). Such a pattern doesnot fit into any of the adjective classes from Fig-ure 1. While some ANs (e.g. counterfeit money)behave in the prototypically privative way, others

3A full discussion of the 27% of insertions that deviatedfrom the expected behavior is given in Pavlick and Callison-Burch (2016).

(1) Swiss officials on Friday said they’ve launched aninvestigation into Urs Tinner’s alleged role.

(2) To deal with an expected surge in unemployment,the plan includes a huge temporary jobs program.

(3) They kept it close for a half and had a theoreticalchance come the third quarter.

Table 2: Contrary to expectations, the deletion ofplain non-subsective adjectives often preserves the(plausible) truth in a model. E.g. alleged role 6)role, but investigation into alleged role ) investi-gation into role.

(e.g. mythical beast) have the property in whichN)¬AN, but AN)N (Figure 3). Table 3 pro-vides some telling examples of how this AN)Ninference, in the case of privative adjectives, oftendepends less on the adjective itself, and more onproperties of the modified noun that are at issue inthe given context. For example, in Table 3 Exam-ple 2(a), a mock debate probably contains enoughof the relevant properties (namely, arguments) thatit can entail debate, while in Example 2(b), a mockexecution lacks the single most important property(the death of the executee) and so cannot entail ex-ecution. (Note that, from Example 3(b), it appearsthe jury is still out on whether leaps in artificialintelligence entail leaps in intelligence...)

5 Discussion

The results presented suggest a few important pat-terns for NLP systems. First, that while a non-subsective AN might not be an instance of the N(taxonomically speaking), statements that are trueof an AN are often true of the N as well. This isrelevant for IE and QA systems, and is likely to be-come more important as NLP systems focus moreon “micro reading” tasks (Nakashole and Mitchell,2014), where facts must be inferred from singledocuments or sentences, rather than by exploiting

obse

rved

pred

icte

d

She was the expected winner. She was the winner.

So-Called Non-Subsective Adjectives. (Pavlick and Callison-Burch *SEM 2016, Best Paper!)

Page 98: Natural Language Inference in the Real World · Natural Language Inference in the Real World Ellie Pavlick University of Pennsylvania

When does N entail AN?NOT

^

should

She was carrying a fake gun. She was carrying a gun.ANN

Privative (e.g. fake)

Plain Non-subsective (e.g. alleged)

ANN AN

Subsective (e.g. red)

Figure 1: Three main classes of adjectives. If their entailment behavior is consistent with their theoreticaldefinitions, we would expect our annotations (Section 3) to produce the insertion (blue) and deletion(red) patterns shown by the bar graphs. Bars (left to right) represent CONTRADICTION, UNKNOWN, andENTAILMENT

While these generalizations are intuitive, thereis little experimental evidence to support them.In this paper, we collect human judgements ofthe validity of inferences following from the in-sertion and deletion of various classes of adjec-tives and analyze the results. Our findings suggestthat, in practice, most sentences involving non-subsective ANs can be safely generalized to state-ments about the N. That is, non-subsective adjec-tives often behave like normal, subsective adjec-tives. On further analysis, we reveal that, whenadjectives do behave non-subsectively, they oftenexhibit asymmetric entailment behavior in whichinsertion leads to contradictions (ID ) ¬ fake ID)but deletion leads to entailments (fake ID ) ID).We present anecdotal evidence for how the en-tailment associated with inserting/deleting a non-subsective adjective depends on the salient prop-erties of the noun phrase under discussion, ratherthan on the adjective itself.

2 Background and Related Work

Classes of Adjectives. Adjectives are com-monly classified taxonomically as either subsec-tive or non-subsective (Kamp and Partee, 1995).Subsective adjectives are adjectives which pickout a subset of the set denoted by the unmodifiednoun; that is, AN ⇢ N1. For non-subsective adjec-tives, in contrast, the AN cannot be guaranteed tobe a subset of N. For example, clever is subsective,and so a clever thief is always a thief. However,

1We use the notation N and AN to refer both the the nat-ural language expression itself (e.g. red car) as well as itsdenotation, e.g. {x|x is a red car}.

alleged is non-subsective, so there are many pos-sible worlds in which an alleged thief is not in facta thief. Of course, there may also be many possi-ble worlds in which the alleged thief is a thief, butthe word alleged, being non-subsective, does notguarantee this to hold.

Non-subsective adjectives can be further di-vided into two classes: privative and plain. Setsdenoted by privative ANs are completely disjointfrom the set denoted by the head N (AN \ N =;), and this mutual exclusivity is encoded in themeaning of the A itself. For example, fake is con-sidered to be a quintessential privative adjectivesince, given the usual definition of fake, a fake IDcan not actually be an ID. For plain non-subsectiveadjectives, there may be worlds in which the ANis and N, and worlds in which the AN is not an N:neither inference is guaranteed by the meaning ofthe A. As mentioned above, alleged is quintessen-tially plain non-subsective since, for example, analleged thief may or may not be an actual thief.In short, we can summarize the classes of adjec-tives in the following way: subsective adjectivesentail the nouns they modify, privative adjectivescontradict the nouns they modify, and plain non-subsective adjectives are compatible with (but donot entail) the nouns they modify. Figure 1 depictsthese distinctions.

While the hierarchical classification of adjec-tives described above is widely accepted and oftenapplied in NLP tasks (Amoia and Gardent, 2006;Amoia and Gardent, 2007; Boleda et al., 2012;McCrae et al., 2014), it is not undisputed. Somelinguists take the position that in fact privative ad-

pred

icte

d

So-Called Non-Subsective Adjectives. (Pavlick and Callison-Burch *SEM 2016, Best Paper!)

Page 99: Natural Language Inference in the Real World · Natural Language Inference in the Real World Ellie Pavlick University of Pennsylvania

When does N entail AN?NOT

^

should

She was carrying a fake gun. She was carrying a gun.

Don’t behave symmetrically.

ANN

Privative (e.g. fake)

Plain Non-subsective (e.g. alleged)

ANN AN

Subsective (e.g. red)

Figure 1: Three main classes of adjectives. If their entailment behavior is consistent with their theoreticaldefinitions, we would expect our annotations (Section 3) to produce the insertion (blue) and deletion(red) patterns shown by the bar graphs. Bars (left to right) represent CONTRADICTION, UNKNOWN, andENTAILMENT

While these generalizations are intuitive, thereis little experimental evidence to support them.In this paper, we collect human judgements ofthe validity of inferences following from the in-sertion and deletion of various classes of adjec-tives and analyze the results. Our findings suggestthat, in practice, most sentences involving non-subsective ANs can be safely generalized to state-ments about the N. That is, non-subsective adjec-tives often behave like normal, subsective adjec-tives. On further analysis, we reveal that, whenadjectives do behave non-subsectively, they oftenexhibit asymmetric entailment behavior in whichinsertion leads to contradictions (ID ) ¬ fake ID)but deletion leads to entailments (fake ID ) ID).We present anecdotal evidence for how the en-tailment associated with inserting/deleting a non-subsective adjective depends on the salient prop-erties of the noun phrase under discussion, ratherthan on the adjective itself.

2 Background and Related Work

Classes of Adjectives. Adjectives are com-monly classified taxonomically as either subsec-tive or non-subsective (Kamp and Partee, 1995).Subsective adjectives are adjectives which pickout a subset of the set denoted by the unmodifiednoun; that is, AN ⇢ N1. For non-subsective adjec-tives, in contrast, the AN cannot be guaranteed tobe a subset of N. For example, clever is subsective,and so a clever thief is always a thief. However,

1We use the notation N and AN to refer both the the nat-ural language expression itself (e.g. red car) as well as itsdenotation, e.g. {x|x is a red car}.

alleged is non-subsective, so there are many pos-sible worlds in which an alleged thief is not in facta thief. Of course, there may also be many possi-ble worlds in which the alleged thief is a thief, butthe word alleged, being non-subsective, does notguarantee this to hold.

Non-subsective adjectives can be further di-vided into two classes: privative and plain. Setsdenoted by privative ANs are completely disjointfrom the set denoted by the head N (AN \ N =;), and this mutual exclusivity is encoded in themeaning of the A itself. For example, fake is con-sidered to be a quintessential privative adjectivesince, given the usual definition of fake, a fake IDcan not actually be an ID. For plain non-subsectiveadjectives, there may be worlds in which the ANis and N, and worlds in which the AN is not an N:neither inference is guaranteed by the meaning ofthe A. As mentioned above, alleged is quintessen-tially plain non-subsective since, for example, analleged thief may or may not be an actual thief.In short, we can summarize the classes of adjec-tives in the following way: subsective adjectivesentail the nouns they modify, privative adjectivescontradict the nouns they modify, and plain non-subsective adjectives are compatible with (but donot entail) the nouns they modify. Figure 1 depictsthese distinctions.

While the hierarchical classification of adjec-tives described above is widely accepted and oftenapplied in NLP tasks (Amoia and Gardent, 2006;Amoia and Gardent, 2007; Boleda et al., 2012;McCrae et al., 2014), it is not undisputed. Somelinguists take the position that in fact privative ad-

(a) Privative (b) Plain Non-Sub. (c) Subsective

Figure 2: Observed entailment judgements for insertion (blue) and deletion (red) of adjectives. Compareto expected distributions in Figure 1.

to receive labels of UNKNOWN in both directions.We expect the subsective adjectives to receive la-bels of ENTAILMENT in the deletion direction (redcar ) car) and labels of UNKNOWN in the inser-tion direction (car 6) red car). Figure 1 depictsthese expected distributions.

Observations. The observed entailment patternsfor insertion and deletion of non-subsective adjec-tives are shown in Figure 2. Our control sampleof subsective adjectives (Figure 2c) largely pro-duced the expected results, with 96% of deletionsproducing ENTAILMENTs and 73% of insertionsproducing UNKNOWNs.3 The entailment patternsproduced by the non-subsective adjectives, how-ever, did not match our predictions. The plain non-subsective adjectives (e.g. alleged) behave nearlyidentically to how we expect regular, subsectiveadjectives to behave (Figure 2b). That is, in 80%of cases, deleting the plain non-subsective adjec-tive was judged to produce ENTAILMENT, ratherthan the expected UNKNOWN. The examples inTable 2 shed some light onto why this is the case.Often, the differences between N and AN are notrelevant to the main point of the utterance. For ex-ample, while an expected surge in unemploymentis not a surge in unemployment, a policy that dealswith an expected surge deals with a surge.

The privative adjectives (e.g. fake) also failto match the predicted distribution. While in-sertions often produce the expected CONTRADIC-TIONs, deletions produce a surprising number ofENTAILMENTs (Figure 2a). Such a pattern doesnot fit into any of the adjective classes from Fig-ure 1. While some ANs (e.g. counterfeit money)behave in the prototypically privative way, others

3A full discussion of the 27% of insertions that deviatedfrom the expected behavior is given in Pavlick and Callison-Burch (2016).

(1) Swiss officials on Friday said they’ve launched aninvestigation into Urs Tinner’s alleged role.

(2) To deal with an expected surge in unemployment,the plan includes a huge temporary jobs program.

(3) They kept it close for a half and had a theoreticalchance come the third quarter.

Table 2: Contrary to expectations, the deletion ofplain non-subsective adjectives often preserves the(plausible) truth in a model. E.g. alleged role 6)role, but investigation into alleged role ) investi-gation into role.

(e.g. mythical beast) have the property in whichN)¬AN, but AN)N (Figure 3). Table 3 pro-vides some telling examples of how this AN)Ninference, in the case of privative adjectives, oftendepends less on the adjective itself, and more onproperties of the modified noun that are at issue inthe given context. For example, in Table 3 Exam-ple 2(a), a mock debate probably contains enoughof the relevant properties (namely, arguments) thatit can entail debate, while in Example 2(b), a mockexecution lacks the single most important property(the death of the executee) and so cannot entail ex-ecution. (Note that, from Example 3(b), it appearsthe jury is still out on whether leaps in artificialintelligence entail leaps in intelligence...)

5 Discussion

The results presented suggest a few important pat-terns for NLP systems. First, that while a non-subsective AN might not be an instance of the N(taxonomically speaking), statements that are trueof an AN are often true of the N as well. This isrelevant for IE and QA systems, and is likely to be-come more important as NLP systems focus moreon “micro reading” tasks (Nakashole and Mitchell,2014), where facts must be inferred from singledocuments or sentences, rather than by exploiting

obse

rved

pred

icte

d

So-Called Non-Subsective Adjectives. (Pavlick and Callison-Burch *SEM 2016, Best Paper!)

Page 100: Natural Language Inference in the Real World · Natural Language Inference in the Real World Ellie Pavlick University of Pennsylvania

When does N entail AN?NOT

^

should

She was carrying a fake gun. She was carrying a gun.

Don’t behave symmetrically.

The 27-year-old Gazan seeks an id to get through security checkpoints and

find work in Cairo.

Does he seek a fake id?

ANN

Privative (e.g. fake)

Plain Non-subsective (e.g. alleged)

ANN AN

Subsective (e.g. red)

Figure 1: Three main classes of adjectives. If their entailment behavior is consistent with their theoreticaldefinitions, we would expect our annotations (Section 3) to produce the insertion (blue) and deletion(red) patterns shown by the bar graphs. Bars (left to right) represent CONTRADICTION, UNKNOWN, andENTAILMENT

While these generalizations are intuitive, thereis little experimental evidence to support them.In this paper, we collect human judgements ofthe validity of inferences following from the in-sertion and deletion of various classes of adjec-tives and analyze the results. Our findings suggestthat, in practice, most sentences involving non-subsective ANs can be safely generalized to state-ments about the N. That is, non-subsective adjec-tives often behave like normal, subsective adjec-tives. On further analysis, we reveal that, whenadjectives do behave non-subsectively, they oftenexhibit asymmetric entailment behavior in whichinsertion leads to contradictions (ID ) ¬ fake ID)but deletion leads to entailments (fake ID ) ID).We present anecdotal evidence for how the en-tailment associated with inserting/deleting a non-subsective adjective depends on the salient prop-erties of the noun phrase under discussion, ratherthan on the adjective itself.

2 Background and Related Work

Classes of Adjectives. Adjectives are com-monly classified taxonomically as either subsec-tive or non-subsective (Kamp and Partee, 1995).Subsective adjectives are adjectives which pickout a subset of the set denoted by the unmodifiednoun; that is, AN ⇢ N1. For non-subsective adjec-tives, in contrast, the AN cannot be guaranteed tobe a subset of N. For example, clever is subsective,and so a clever thief is always a thief. However,

1We use the notation N and AN to refer both the the nat-ural language expression itself (e.g. red car) as well as itsdenotation, e.g. {x|x is a red car}.

alleged is non-subsective, so there are many pos-sible worlds in which an alleged thief is not in facta thief. Of course, there may also be many possi-ble worlds in which the alleged thief is a thief, butthe word alleged, being non-subsective, does notguarantee this to hold.

Non-subsective adjectives can be further di-vided into two classes: privative and plain. Setsdenoted by privative ANs are completely disjointfrom the set denoted by the head N (AN \ N =;), and this mutual exclusivity is encoded in themeaning of the A itself. For example, fake is con-sidered to be a quintessential privative adjectivesince, given the usual definition of fake, a fake IDcan not actually be an ID. For plain non-subsectiveadjectives, there may be worlds in which the ANis and N, and worlds in which the AN is not an N:neither inference is guaranteed by the meaning ofthe A. As mentioned above, alleged is quintessen-tially plain non-subsective since, for example, analleged thief may or may not be an actual thief.In short, we can summarize the classes of adjec-tives in the following way: subsective adjectivesentail the nouns they modify, privative adjectivescontradict the nouns they modify, and plain non-subsective adjectives are compatible with (but donot entail) the nouns they modify. Figure 1 depictsthese distinctions.

While the hierarchical classification of adjec-tives described above is widely accepted and oftenapplied in NLP tasks (Amoia and Gardent, 2006;Amoia and Gardent, 2007; Boleda et al., 2012;McCrae et al., 2014), it is not undisputed. Somelinguists take the position that in fact privative ad-

(a) Privative (b) Plain Non-Sub. (c) Subsective

Figure 2: Observed entailment judgements for insertion (blue) and deletion (red) of adjectives. Compareto expected distributions in Figure 1.

to receive labels of UNKNOWN in both directions.We expect the subsective adjectives to receive la-bels of ENTAILMENT in the deletion direction (redcar ) car) and labels of UNKNOWN in the inser-tion direction (car 6) red car). Figure 1 depictsthese expected distributions.

Observations. The observed entailment patternsfor insertion and deletion of non-subsective adjec-tives are shown in Figure 2. Our control sampleof subsective adjectives (Figure 2c) largely pro-duced the expected results, with 96% of deletionsproducing ENTAILMENTs and 73% of insertionsproducing UNKNOWNs.3 The entailment patternsproduced by the non-subsective adjectives, how-ever, did not match our predictions. The plain non-subsective adjectives (e.g. alleged) behave nearlyidentically to how we expect regular, subsectiveadjectives to behave (Figure 2b). That is, in 80%of cases, deleting the plain non-subsective adjec-tive was judged to produce ENTAILMENT, ratherthan the expected UNKNOWN. The examples inTable 2 shed some light onto why this is the case.Often, the differences between N and AN are notrelevant to the main point of the utterance. For ex-ample, while an expected surge in unemploymentis not a surge in unemployment, a policy that dealswith an expected surge deals with a surge.

The privative adjectives (e.g. fake) also failto match the predicted distribution. While in-sertions often produce the expected CONTRADIC-TIONs, deletions produce a surprising number ofENTAILMENTs (Figure 2a). Such a pattern doesnot fit into any of the adjective classes from Fig-ure 1. While some ANs (e.g. counterfeit money)behave in the prototypically privative way, others

3A full discussion of the 27% of insertions that deviatedfrom the expected behavior is given in Pavlick and Callison-Burch (2016).

(1) Swiss officials on Friday said they’ve launched aninvestigation into Urs Tinner’s alleged role.

(2) To deal with an expected surge in unemployment,the plan includes a huge temporary jobs program.

(3) They kept it close for a half and had a theoreticalchance come the third quarter.

Table 2: Contrary to expectations, the deletion ofplain non-subsective adjectives often preserves the(plausible) truth in a model. E.g. alleged role 6)role, but investigation into alleged role ) investi-gation into role.

(e.g. mythical beast) have the property in whichN)¬AN, but AN)N (Figure 3). Table 3 pro-vides some telling examples of how this AN)Ninference, in the case of privative adjectives, oftendepends less on the adjective itself, and more onproperties of the modified noun that are at issue inthe given context. For example, in Table 3 Exam-ple 2(a), a mock debate probably contains enoughof the relevant properties (namely, arguments) thatit can entail debate, while in Example 2(b), a mockexecution lacks the single most important property(the death of the executee) and so cannot entail ex-ecution. (Note that, from Example 3(b), it appearsthe jury is still out on whether leaps in artificialintelligence entail leaps in intelligence...)

5 Discussion

The results presented suggest a few important pat-terns for NLP systems. First, that while a non-subsective AN might not be an instance of the N(taxonomically speaking), statements that are trueof an AN are often true of the N as well. This isrelevant for IE and QA systems, and is likely to be-come more important as NLP systems focus moreon “micro reading” tasks (Nakashole and Mitchell,2014), where facts must be inferred from singledocuments or sentences, rather than by exploiting

obse

rved

pred

icte

d

So-Called Non-Subsective Adjectives. (Pavlick and Callison-Burch *SEM 2016, Best Paper!)

Page 101: Natural Language Inference in the Real World · Natural Language Inference in the Real World Ellie Pavlick University of Pennsylvania

When does N entail AN?NOT

^

should

She was carrying a fake gun. She was carrying a gun.

Don’t behave symmetrically.

The 27-year-old Gazan seeks an id to get through security checkpoints and

find work in Cairo.

Does he seek a fake id? ✘

ANN

Privative (e.g. fake)

Plain Non-subsective (e.g. alleged)

ANN AN

Subsective (e.g. red)

Figure 1: Three main classes of adjectives. If their entailment behavior is consistent with their theoreticaldefinitions, we would expect our annotations (Section 3) to produce the insertion (blue) and deletion(red) patterns shown by the bar graphs. Bars (left to right) represent CONTRADICTION, UNKNOWN, andENTAILMENT

While these generalizations are intuitive, thereis little experimental evidence to support them.In this paper, we collect human judgements ofthe validity of inferences following from the in-sertion and deletion of various classes of adjec-tives and analyze the results. Our findings suggestthat, in practice, most sentences involving non-subsective ANs can be safely generalized to state-ments about the N. That is, non-subsective adjec-tives often behave like normal, subsective adjec-tives. On further analysis, we reveal that, whenadjectives do behave non-subsectively, they oftenexhibit asymmetric entailment behavior in whichinsertion leads to contradictions (ID ) ¬ fake ID)but deletion leads to entailments (fake ID ) ID).We present anecdotal evidence for how the en-tailment associated with inserting/deleting a non-subsective adjective depends on the salient prop-erties of the noun phrase under discussion, ratherthan on the adjective itself.

2 Background and Related Work

Classes of Adjectives. Adjectives are com-monly classified taxonomically as either subsec-tive or non-subsective (Kamp and Partee, 1995).Subsective adjectives are adjectives which pickout a subset of the set denoted by the unmodifiednoun; that is, AN ⇢ N1. For non-subsective adjec-tives, in contrast, the AN cannot be guaranteed tobe a subset of N. For example, clever is subsective,and so a clever thief is always a thief. However,

1We use the notation N and AN to refer both the the nat-ural language expression itself (e.g. red car) as well as itsdenotation, e.g. {x|x is a red car}.

alleged is non-subsective, so there are many pos-sible worlds in which an alleged thief is not in facta thief. Of course, there may also be many possi-ble worlds in which the alleged thief is a thief, butthe word alleged, being non-subsective, does notguarantee this to hold.

Non-subsective adjectives can be further di-vided into two classes: privative and plain. Setsdenoted by privative ANs are completely disjointfrom the set denoted by the head N (AN \ N =;), and this mutual exclusivity is encoded in themeaning of the A itself. For example, fake is con-sidered to be a quintessential privative adjectivesince, given the usual definition of fake, a fake IDcan not actually be an ID. For plain non-subsectiveadjectives, there may be worlds in which the ANis and N, and worlds in which the AN is not an N:neither inference is guaranteed by the meaning ofthe A. As mentioned above, alleged is quintessen-tially plain non-subsective since, for example, analleged thief may or may not be an actual thief.In short, we can summarize the classes of adjec-tives in the following way: subsective adjectivesentail the nouns they modify, privative adjectivescontradict the nouns they modify, and plain non-subsective adjectives are compatible with (but donot entail) the nouns they modify. Figure 1 depictsthese distinctions.

While the hierarchical classification of adjec-tives described above is widely accepted and oftenapplied in NLP tasks (Amoia and Gardent, 2006;Amoia and Gardent, 2007; Boleda et al., 2012;McCrae et al., 2014), it is not undisputed. Somelinguists take the position that in fact privative ad-

(a) Privative (b) Plain Non-Sub. (c) Subsective

Figure 2: Observed entailment judgements for insertion (blue) and deletion (red) of adjectives. Compareto expected distributions in Figure 1.

to receive labels of UNKNOWN in both directions.We expect the subsective adjectives to receive la-bels of ENTAILMENT in the deletion direction (redcar ) car) and labels of UNKNOWN in the inser-tion direction (car 6) red car). Figure 1 depictsthese expected distributions.

Observations. The observed entailment patternsfor insertion and deletion of non-subsective adjec-tives are shown in Figure 2. Our control sampleof subsective adjectives (Figure 2c) largely pro-duced the expected results, with 96% of deletionsproducing ENTAILMENTs and 73% of insertionsproducing UNKNOWNs.3 The entailment patternsproduced by the non-subsective adjectives, how-ever, did not match our predictions. The plain non-subsective adjectives (e.g. alleged) behave nearlyidentically to how we expect regular, subsectiveadjectives to behave (Figure 2b). That is, in 80%of cases, deleting the plain non-subsective adjec-tive was judged to produce ENTAILMENT, ratherthan the expected UNKNOWN. The examples inTable 2 shed some light onto why this is the case.Often, the differences between N and AN are notrelevant to the main point of the utterance. For ex-ample, while an expected surge in unemploymentis not a surge in unemployment, a policy that dealswith an expected surge deals with a surge.

The privative adjectives (e.g. fake) also failto match the predicted distribution. While in-sertions often produce the expected CONTRADIC-TIONs, deletions produce a surprising number ofENTAILMENTs (Figure 2a). Such a pattern doesnot fit into any of the adjective classes from Fig-ure 1. While some ANs (e.g. counterfeit money)behave in the prototypically privative way, others

3A full discussion of the 27% of insertions that deviatedfrom the expected behavior is given in Pavlick and Callison-Burch (2016).

(1) Swiss officials on Friday said they’ve launched aninvestigation into Urs Tinner’s alleged role.

(2) To deal with an expected surge in unemployment,the plan includes a huge temporary jobs program.

(3) They kept it close for a half and had a theoreticalchance come the third quarter.

Table 2: Contrary to expectations, the deletion ofplain non-subsective adjectives often preserves the(plausible) truth in a model. E.g. alleged role 6)role, but investigation into alleged role ) investi-gation into role.

(e.g. mythical beast) have the property in whichN)¬AN, but AN)N (Figure 3). Table 3 pro-vides some telling examples of how this AN)Ninference, in the case of privative adjectives, oftendepends less on the adjective itself, and more onproperties of the modified noun that are at issue inthe given context. For example, in Table 3 Exam-ple 2(a), a mock debate probably contains enoughof the relevant properties (namely, arguments) thatit can entail debate, while in Example 2(b), a mockexecution lacks the single most important property(the death of the executee) and so cannot entail ex-ecution. (Note that, from Example 3(b), it appearsthe jury is still out on whether leaps in artificialintelligence entail leaps in intelligence...)

5 Discussion

The results presented suggest a few important pat-terns for NLP systems. First, that while a non-subsective AN might not be an instance of the N(taxonomically speaking), statements that are trueof an AN are often true of the N as well. This isrelevant for IE and QA systems, and is likely to be-come more important as NLP systems focus moreon “micro reading” tasks (Nakashole and Mitchell,2014), where facts must be inferred from singledocuments or sentences, rather than by exploiting

obse

rved

pred

icte

d

So-Called Non-Subsective Adjectives. (Pavlick and Callison-Burch *SEM 2016, Best Paper!)

Page 102: Natural Language Inference in the Real World · Natural Language Inference in the Real World Ellie Pavlick University of Pennsylvania

When does N entail AN?NOT

^

should

She was carrying a fake gun. She was carrying a gun.

Don’t behave symmetrically.

The 27-year-old Gazan seeks a fake id to get through security checkpoints

and find work in Cairo.

Does he seek a id? ✔

ANN

Privative (e.g. fake)

Plain Non-subsective (e.g. alleged)

ANN AN

Subsective (e.g. red)

Figure 1: Three main classes of adjectives. If their entailment behavior is consistent with their theoreticaldefinitions, we would expect our annotations (Section 3) to produce the insertion (blue) and deletion(red) patterns shown by the bar graphs. Bars (left to right) represent CONTRADICTION, UNKNOWN, andENTAILMENT

While these generalizations are intuitive, thereis little experimental evidence to support them.In this paper, we collect human judgements ofthe validity of inferences following from the in-sertion and deletion of various classes of adjec-tives and analyze the results. Our findings suggestthat, in practice, most sentences involving non-subsective ANs can be safely generalized to state-ments about the N. That is, non-subsective adjec-tives often behave like normal, subsective adjec-tives. On further analysis, we reveal that, whenadjectives do behave non-subsectively, they oftenexhibit asymmetric entailment behavior in whichinsertion leads to contradictions (ID ) ¬ fake ID)but deletion leads to entailments (fake ID ) ID).We present anecdotal evidence for how the en-tailment associated with inserting/deleting a non-subsective adjective depends on the salient prop-erties of the noun phrase under discussion, ratherthan on the adjective itself.

2 Background and Related Work

Classes of Adjectives. Adjectives are com-monly classified taxonomically as either subsec-tive or non-subsective (Kamp and Partee, 1995).Subsective adjectives are adjectives which pickout a subset of the set denoted by the unmodifiednoun; that is, AN ⇢ N1. For non-subsective adjec-tives, in contrast, the AN cannot be guaranteed tobe a subset of N. For example, clever is subsective,and so a clever thief is always a thief. However,

1We use the notation N and AN to refer both the the nat-ural language expression itself (e.g. red car) as well as itsdenotation, e.g. {x|x is a red car}.

alleged is non-subsective, so there are many pos-sible worlds in which an alleged thief is not in facta thief. Of course, there may also be many possi-ble worlds in which the alleged thief is a thief, butthe word alleged, being non-subsective, does notguarantee this to hold.

Non-subsective adjectives can be further di-vided into two classes: privative and plain. Setsdenoted by privative ANs are completely disjointfrom the set denoted by the head N (AN \ N =;), and this mutual exclusivity is encoded in themeaning of the A itself. For example, fake is con-sidered to be a quintessential privative adjectivesince, given the usual definition of fake, a fake IDcan not actually be an ID. For plain non-subsectiveadjectives, there may be worlds in which the ANis and N, and worlds in which the AN is not an N:neither inference is guaranteed by the meaning ofthe A. As mentioned above, alleged is quintessen-tially plain non-subsective since, for example, analleged thief may or may not be an actual thief.In short, we can summarize the classes of adjec-tives in the following way: subsective adjectivesentail the nouns they modify, privative adjectivescontradict the nouns they modify, and plain non-subsective adjectives are compatible with (but donot entail) the nouns they modify. Figure 1 depictsthese distinctions.

While the hierarchical classification of adjec-tives described above is widely accepted and oftenapplied in NLP tasks (Amoia and Gardent, 2006;Amoia and Gardent, 2007; Boleda et al., 2012;McCrae et al., 2014), it is not undisputed. Somelinguists take the position that in fact privative ad-

(a) Privative (b) Plain Non-Sub. (c) Subsective

Figure 2: Observed entailment judgements for insertion (blue) and deletion (red) of adjectives. Compareto expected distributions in Figure 1.

to receive labels of UNKNOWN in both directions.We expect the subsective adjectives to receive la-bels of ENTAILMENT in the deletion direction (redcar ) car) and labels of UNKNOWN in the inser-tion direction (car 6) red car). Figure 1 depictsthese expected distributions.

Observations. The observed entailment patternsfor insertion and deletion of non-subsective adjec-tives are shown in Figure 2. Our control sampleof subsective adjectives (Figure 2c) largely pro-duced the expected results, with 96% of deletionsproducing ENTAILMENTs and 73% of insertionsproducing UNKNOWNs.3 The entailment patternsproduced by the non-subsective adjectives, how-ever, did not match our predictions. The plain non-subsective adjectives (e.g. alleged) behave nearlyidentically to how we expect regular, subsectiveadjectives to behave (Figure 2b). That is, in 80%of cases, deleting the plain non-subsective adjec-tive was judged to produce ENTAILMENT, ratherthan the expected UNKNOWN. The examples inTable 2 shed some light onto why this is the case.Often, the differences between N and AN are notrelevant to the main point of the utterance. For ex-ample, while an expected surge in unemploymentis not a surge in unemployment, a policy that dealswith an expected surge deals with a surge.

The privative adjectives (e.g. fake) also failto match the predicted distribution. While in-sertions often produce the expected CONTRADIC-TIONs, deletions produce a surprising number ofENTAILMENTs (Figure 2a). Such a pattern doesnot fit into any of the adjective classes from Fig-ure 1. While some ANs (e.g. counterfeit money)behave in the prototypically privative way, others

3A full discussion of the 27% of insertions that deviatedfrom the expected behavior is given in Pavlick and Callison-Burch (2016).

(1) Swiss officials on Friday said they’ve launched aninvestigation into Urs Tinner’s alleged role.

(2) To deal with an expected surge in unemployment,the plan includes a huge temporary jobs program.

(3) They kept it close for a half and had a theoreticalchance come the third quarter.

Table 2: Contrary to expectations, the deletion ofplain non-subsective adjectives often preserves the(plausible) truth in a model. E.g. alleged role 6)role, but investigation into alleged role ) investi-gation into role.

(e.g. mythical beast) have the property in whichN)¬AN, but AN)N (Figure 3). Table 3 pro-vides some telling examples of how this AN)Ninference, in the case of privative adjectives, oftendepends less on the adjective itself, and more onproperties of the modified noun that are at issue inthe given context. For example, in Table 3 Exam-ple 2(a), a mock debate probably contains enoughof the relevant properties (namely, arguments) thatit can entail debate, while in Example 2(b), a mockexecution lacks the single most important property(the death of the executee) and so cannot entail ex-ecution. (Note that, from Example 3(b), it appearsthe jury is still out on whether leaps in artificialintelligence entail leaps in intelligence...)

5 Discussion

The results presented suggest a few important pat-terns for NLP systems. First, that while a non-subsective AN might not be an instance of the N(taxonomically speaking), statements that are trueof an AN are often true of the N as well. This isrelevant for IE and QA systems, and is likely to be-come more important as NLP systems focus moreon “micro reading” tasks (Nakashole and Mitchell,2014), where facts must be inferred from singledocuments or sentences, rather than by exploiting

obse

rved

pred

icte

d

So-Called Non-Subsective Adjectives. (Pavlick and Callison-Burch *SEM 2016, Best Paper!)

Page 103: Natural Language Inference in the Real World · Natural Language Inference in the Real World Ellie Pavlick University of Pennsylvania

Bush travels Monday to Michigan to make remarks on the economy.

Does N entail AN?

Compositional Entailment in Adjective Nouns. (Pavlick and Callison-Burch ACL 2016)

Page 104: Natural Language Inference in the Real World · Natural Language Inference in the Real World Ellie Pavlick University of Pennsylvania

Bush travels Monday to Michigan to make remarks on the economy.

Does N entail AN?

Bush travels Monday to Michigan to make remarks on the American economy.✔

Compositional Entailment in Adjective Nouns. (Pavlick and Callison-Burch ACL 2016)

Page 105: Natural Language Inference in the Real World · Natural Language Inference in the Real World Ellie Pavlick University of Pennsylvania

Bush travels Monday to Michigan to make remarks on the economy.

Does N entail AN?

Bush travels Monday to Michigan to make remarks on the American economy.

Bush travels Monday to Michigan to make remarks on the Japanese economy.✘

Compositional Entailment in Adjective Nouns. (Pavlick and Callison-Burch ACL 2016)

Page 106: Natural Language Inference in the Real World · Natural Language Inference in the Real World Ellie Pavlick University of Pennsylvania

Bush travels Monday to Michigan to make remarks on the economy.

Does N entail AN?

Bush travels Monday to Michigan to make remarks on the American economy.

Bush travels Monday to Michigan to make remarks on the Japanese economy.✘

Compositional Entailment in Adjective Nouns. (Pavlick and Callison-Burch ACL 2016)

5,378 naturally-occurring sentences

4,868 train / 510 test

Page 107: Natural Language Inference in the Real World · Natural Language Inference in the Real World Ellie Pavlick University of Pennsylvania

Bush travels Monday to Michigan to make remarks on the economy.

Does N entail AN?

Bush travels Monday to Michigan to make remarks on the American economy.

Bush travels Monday to Michigan to make remarks on the Japanese economy.✘

Compositional Entailment in Adjective Nouns. (Pavlick and Callison-Burch ACL 2016)

2,990 unique ANs

Disjoint ANs in train/test

5,378 naturally-occurring sentences

4,868 train / 510 test

Page 108: Natural Language Inference in the Real World · Natural Language Inference in the Real World Ellie Pavlick University of Pennsylvania

20

36

52

68

84

100

Human Most Freq.Class

MFC by Adj.

BOW BOV MaxEnt+LR

LSTM Trans.

Accu

racy

Does N entail AN?

Compositional Entailment in Adjective Nouns. (Pavlick and Callison-Burch ACL 2016)

Page 109: Natural Language Inference in the Real World · Natural Language Inference in the Real World Ellie Pavlick University of Pennsylvania

20

36

52

68

84

100

Human Most Freq.Class

MFC by Adj.

BOW BOV MaxEnt+LR

LSTM Trans.

93.3

Accu

racy

Does N entail AN?

Compositional Entailment in Adjective Nouns. (Pavlick and Callison-Burch ACL 2016)

Page 110: Natural Language Inference in the Real World · Natural Language Inference in the Real World Ellie Pavlick University of Pennsylvania

20

36

52

68

84

100

Human Most Freq.Class

MFC by Adj.

BOW BOV MaxEnt+LR

LSTM Trans.

85.393.3

Accu

racy

Does N entail AN?

Compositional Entailment in Adjective Nouns. (Pavlick and Callison-Burch ACL 2016)

Page 111: Natural Language Inference in the Real World · Natural Language Inference in the Real World Ellie Pavlick University of Pennsylvania

20

36

52

68

84

100

Human Most Freq.Class

MFC by Adj.

BOW BOV MaxEnt+LR

LSTM Trans.

92.285.3

93.3

Accu

racy

Does N entail AN?

Compositional Entailment in Adjective Nouns. (Pavlick and Callison-Burch ACL 2016)

Page 112: Natural Language Inference in the Real World · Natural Language Inference in the Real World Ellie Pavlick University of Pennsylvania

20

36

52

68

84

100

Human Most Freq.Class

MFC by Adj.

BOW BOV MaxEnt+LR

LSTM Trans.

86.68692.2

85.393.3

Accu

racy

Does N entail AN?Pretty strong baselines

Compositional Entailment in Adjective Nouns. (Pavlick and Callison-Burch ACL 2016)

Page 113: Natural Language Inference in the Real World · Natural Language Inference in the Real World Ellie Pavlick University of Pennsylvania

20

36

52

68

84

100

Human Most Freq.Class

MFC by Adj.

BOW BOV MaxEnt+LR

LSTM Trans.

85.386.68692.2

85.393.3

Accu

racy

Does N entail AN?Supervised Model using involved NLP pipeline

(Magnini et al. 2014)

Compositional Entailment in Adjective Nouns. (Pavlick and Callison-Burch ACL 2016)

Page 114: Natural Language Inference in the Real World · Natural Language Inference in the Real World Ellie Pavlick University of Pennsylvania

20

36

52

68

84

100

Human Most Freq.Class

MFC by Adj.

BOW BOV MaxEnt+LR

LSTM Trans.

86.685.386.68692.2

85.393.3

Accu

racy

Does N entail AN?Several DNN Models (Bowman et al. 2015)

Compositional Entailment in Adjective Nouns. (Pavlick and Callison-Burch ACL 2016)

Page 115: Natural Language Inference in the Real World · Natural Language Inference in the Real World Ellie Pavlick University of Pennsylvania

20

36

52

68

84

100

Human Most Freq.Class

MFC by Adj.

BOW BOV MaxEnt+LR

LSTM Trans.

86.886.685.386.68692.2

85.393.3

Accu

racy

Does N entail AN?Several DNN Models, with transfer learning

(Bowman et al. 2015)

Compositional Entailment in Adjective Nouns. (Pavlick and Callison-Burch ACL 2016)

Page 116: Natural Language Inference in the Real World · Natural Language Inference in the Real World Ellie Pavlick University of Pennsylvania

Summary• Humans are flexible with their language, they don’t abide by

hard-and-fast logical rules of composition and entailment.

• We can acquire lexical entailments at scale…

• …but lexical entailments are not enough. We need composition in order to model unseen phrases and full sentences.

• Inferences involving composition is too complex to capture using simple heuristics, and requires models to incorporate context and common sense when performing reasoning.

Page 117: Natural Language Inference in the Real World · Natural Language Inference in the Real World Ellie Pavlick University of Pennsylvania

• Humans are flexible with their language, they don’t abide by hard-and-fast logical rules of composition and entailment.

• We can acquire lexical entailments at scale…

• …but lexical entailments are not enough. We need composition in order to model unseen phrases and full sentences.

• Inferences involving composition is too complex to capture using simple heuristics, and requires models to incorporate context and common sense when performing reasoning.

Summary

Page 118: Natural Language Inference in the Real World · Natural Language Inference in the Real World Ellie Pavlick University of Pennsylvania

• Humans are flexible with their language, they don’t abide by hard-and-fast logical rules of composition and entailment.

• We can acquire lexical entailments at scale…

• …but lexical entailments are not enough. We need composition in order to model unseen phrases and full sentences.

• Inferences involving composition is too complex to capture using simple heuristics, and requires models to incorporate context and common sense when performing reasoning.

Summary

Page 119: Natural Language Inference in the Real World · Natural Language Inference in the Real World Ellie Pavlick University of Pennsylvania

• Humans are flexible with their language, they don’t abide by hard-and-fast logical rules of composition and entailment.

• We can acquire lexical entailments at scale…

• …but lexical entailments are not enough. We need composition in order to model unseen phrases and full sentences.

• Inferences involving composition is too complex to capture using simple heuristics, and requires models to incorporate context and common sense when performing reasoning.

Summary

Page 120: Natural Language Inference in the Real World · Natural Language Inference in the Real World Ellie Pavlick University of Pennsylvania

• Humans are flexible with their language, they don’t abide by hard-and-fast logical rules of composition and entailment.

• We can acquire lexical entailments at scale…

• …but lexical entailments are not enough. We need composition in order to model unseen phrases and full sentences.

• Inference involving composition is too complex to capture using simple heuristics, and requires models to incorporate context and common sense when performing reasoning.

Summary

Page 121: Natural Language Inference in the Real World · Natural Language Inference in the Real World Ellie Pavlick University of Pennsylvania

Current Work

Page 122: Natural Language Inference in the Real World · Natural Language Inference in the Real World Ellie Pavlick University of Pennsylvania

Last December they had argued that the council had failed to consider possible effects of contaminated land

at the site.

The council considered environmental consequences.

predicates

SUB(consider, consider)

Page 123: Natural Language Inference in the Real World · Natural Language Inference in the Real World Ellie Pavlick University of Pennsylvania

Current Work-in-progress.

Projection through Predicates

Input Output

in France ⊏ in Europe Forward

in Europe ⊐ in France Reverse

in France | in Germany Alternation

in France # in the city Independent

Page 124: Natural Language Inference in the Real World · Natural Language Inference in the Real World Ellie Pavlick University of Pennsylvania

Current Work-in-progress.

Projection through Predicates

Input Output

in France ⊏ in Europe Forward

in Europe ⊐ in France Reverse

in France | in Germany Alternation

in France # in the city Independent

born

Page 125: Natural Language Inference in the Real World · Natural Language Inference in the Real World Ellie Pavlick University of Pennsylvania

Current Work-in-progress.

Projection through Predicates

Input Output

in France ⊏ in Europe Forward

in Europe ⊐ in France Reverse

in France | in Germany Alternation

in France # in the city Independent

born(in France) ⊏ born(in Europe)

Page 126: Natural Language Inference in the Real World · Natural Language Inference in the Real World Ellie Pavlick University of Pennsylvania

Current Work-in-progress.

Projection through Predicates

Input Output

in France ⊏ in Europe Forward

in Europe ⊐ in France Reverse

in France | in Germany Alternation

in France # in the city Independent

born(in Europe) ⊐ born(in France)

Page 127: Natural Language Inference in the Real World · Natural Language Inference in the Real World Ellie Pavlick University of Pennsylvania

Current Work-in-progress.

Projection through Predicates

Input Output

in France ⊏ in Europe Forward

in Europe ⊐ in France Reverse

in France | in Germany Alternation

in France # in the city Independent

born(in France) | born(in Germany)

Page 128: Natural Language Inference in the Real World · Natural Language Inference in the Real World Ellie Pavlick University of Pennsylvania

Current Work-in-progress.

Projection through Predicates

Input Output

in France ⊏ in Europe Forward

in Europe ⊐ in France Reverse

in France | in Germany Alternation

in France # in the city Independent

born(in France) # born(in the city)

Page 129: Natural Language Inference in the Real World · Natural Language Inference in the Real World Ellie Pavlick University of Pennsylvania

Current Work-in-progress.

Projection through Predicates

Input Output

in France ⊏ in Europe Forward

in Europe ⊐ in France Reverse

in France | in Germany Alternation

in France # in the city Independent

has traveled(in France) # has traveled(in Germany)

Page 130: Natural Language Inference in the Real World · Natural Language Inference in the Real World Ellie Pavlick University of Pennsylvania

Current Work-in-progress.

Projection through Predicates

Input Output

in France ⊏ in Europe Forward

in Europe ⊐ in France Reverse

in France | in Germany Alternation

in France # in the city Independent

is banned(in France) ⊐ is banned(in Europe)

Page 131: Natural Language Inference in the Real World · Natural Language Inference in the Real World Ellie Pavlick University of Pennsylvania

Current Work-in-progress.

Projection through Predicates

Identity Non-Exclusive, Identity

Non-Exclusive, Negation

is a has a lacks a

lives in considers avoidswas born in has visited bannedis married to likes dislikes

Page 132: Natural Language Inference in the Real World · Natural Language Inference in the Real World Ellie Pavlick University of Pennsylvania

Last December they had argued that the council had failed to consider possible effects of contaminated land

at the site.

The council considered environmental consequences.

“higher order” predicates

DEL(fail to)

Page 133: Natural Language Inference in the Real World · Natural Language Inference in the Real World Ellie Pavlick University of Pennsylvania

Implicative Verbs

Implicative Verbs. (Karttunen Language 1971)

Page 134: Natural Language Inference in the Real World · Natural Language Inference in the Real World Ellie Pavlick University of Pennsylvania

Implicative Verbs

She managed to fix the bug.

Implicative Verbs. (Karttunen Language 1971)

Page 135: Natural Language Inference in the Real World · Natural Language Inference in the Real World Ellie Pavlick University of Pennsylvania

Implicative Verbs

She managed to fix the bug.

She wanted to fix the bug.

Implicative Verbs. (Karttunen Language 1971)

Page 136: Natural Language Inference in the Real World · Natural Language Inference in the Real World Ellie Pavlick University of Pennsylvania

Implicative Verbs

She managed to fix the bug.

She wanted to fix the bug.

Did she fix the bug?

Implicative Verbs. (Karttunen Language 1971)

Page 137: Natural Language Inference in the Real World · Natural Language Inference in the Real World Ellie Pavlick University of Pennsylvania

Implicative Verbs

She managed to fix the bug.

She wanted to fix the bug.

Did she fix the bug?

Implicative Verbs. (Karttunen Language 1971)

Page 138: Natural Language Inference in the Real World · Natural Language Inference in the Real World Ellie Pavlick University of Pennsylvania

Implicative Verbs

She managed to fix the bug last night.

She managed to fix the bug tomorrow.

Implicative Verbs. (Karttunen Language 1971)

Page 139: Natural Language Inference in the Real World · Natural Language Inference in the Real World Ellie Pavlick University of Pennsylvania

Implicative Verbs

She managed to fix the bug last night.

She managed to fix the bug tomorrow.

Implicative Verbs. (Karttunen Language 1971)

Page 140: Natural Language Inference in the Real World · Natural Language Inference in the Real World Ellie Pavlick University of Pennsylvania

Implicative Verbs

She wanted to fix the bug last night.

She wanted to fix the bug tomorrow.

Implicative Verbs. (Karttunen Language 1971)

Page 141: Natural Language Inference in the Real World · Natural Language Inference in the Real World Ellie Pavlick University of Pennsylvania

Implicative Verbs

venture to

forget to

manage to

bother to

happen to

get to

decide to

dare to

try to

agree to

promise towant to

intend to

plan to

hope to

Tense Manages to Predict Implicatives. (Pavlick and Callison-Burch EMNLP 2016)

Page 142: Natural Language Inference in the Real World · Natural Language Inference in the Real World Ellie Pavlick University of Pennsylvania

Implicative Verbsventure to

forget tomanage tobother to

happen toget to

decide todare totry to

agree topromise towant to

intend toplan tohope to

Tense Manages to Predict Implicatives. (Pavlick and Callison-Burch EMNLP 2016)

Page 143: Natural Language Inference in the Real World · Natural Language Inference in the Real World Ellie Pavlick University of Pennsylvania

Implicative Verbsventure to

forget tomanage tobother to

happen toget to

decide todare totry to

agree topromise towant to

intend toplan tohope to

Gallager chose to accept a full scholarship to play football for Temple

University.

Tense Manages to Predict Implicatives. (Pavlick and Callison-Burch EMNLP 2016)

Page 144: Natural Language Inference in the Real World · Natural Language Inference in the Real World Ellie Pavlick University of Pennsylvania

Implicative Verbsventure to

forget tomanage tobother to

happen toget to

decide todare totry to

agree topromise towant to

intend toplan tohope to

Wilkins was allowed to leave in 1987 to join

French outfit Paris Saint-Germain.

Tense Manages to Predict Implicatives. (Pavlick and Callison-Burch EMNLP 2016)

Gallager chose to accept a full scholarship to play football for Temple

University.

Page 145: Natural Language Inference in the Real World · Natural Language Inference in the Real World Ellie Pavlick University of Pennsylvania

Implicative Verbs

She managed to fix the bug.

She fixed the bug.

Contradiction Entailment Can’t Tell

Tense Manages to Predict Implicatives. (Pavlick and Callison-Burch EMNLP 2016)

Page 146: Natural Language Inference in the Real World · Natural Language Inference in the Real World Ellie Pavlick University of Pennsylvania

Implicative Verbs

4

288

289

290

291

292

293

294

295

296

297

298

299

300

301

302

303

304

305

306

307

308

309

310

311

312

313

314

315

316

317

318

319

320

321

322

323

324

325

326

327

328

329

330

331

332

333

334

335

336

337

338

339

340

341

342

343

344

345

346

347

348

349

350

351

352

353

354

355

356

357

358

359

360

361

362

363

364

365

366

367

368

369

370

371

372

373

374

375

376

377

378

379

380

381

382

383

EMNLP 2016 Submission 25. Confidential review copy. DO NOT DISTRIBUTE.

(0.14) UFJ wants to merge with Mitsubishi, a combination that’d surpass Citigroup as the world’s biggest bank.6) The merger of Japanese Banks creates the world’s biggest bank.

(0.55) After graduating, Gallager chose to accept a full scholarship to play football for Temple University.) Gallager attended Temple University.

(0.68) Wilkins was allowed to leave in 1987 to join French outfit Paris Saint-Germain.) Wilkins departed Milan in 1987.

Table 3: Examples from the RTE3 dataset (Giampiccolo et al., 2007) which require recognizing implicative behavior, even in verbsthat are not implicative by definition. The tendency of certain verbs (e.g. be allowed) to behave as de facto implicatives is capturedsurprisingly well by the tense agreement score (shown in parentheses).

the main verb of a sentence, the truth of the compli-ment could only be inferred 30% of the time. Whena verb with high tense agreement appeared as themain verb, the truth of the compliment could be in-ferred 91% of the time. This difference is significantat p < 0.01. That is, tense agreement provides astrong signal for identifying non-implicative verbs,and thus can help systems avoid false-positive en-tailment judgements, e.g. incorrectly inferring thatwanting to merge ) merging.

Figure 1: Whether or not compliment is entailed for main verbswith varying levels of tense agreement. Verbs with high tenseagreement yield more definitive judgments (true/false). Eachbar represents aggregated judgements over approx. 20 verbs.

Interestingly, tense agreement accurately mod-els verbs that are not implicative by definition, butwhich nonetheless tend to behave implicatively inpractice. For example, our method finds high tenseagreement for choose to and be allowed to, whichare often used to communicate, albeit indirectly, thattheir compliments did in fact happen. To convinceourselves that treating such verbs as implicativesmakes sense in practice, we manually look throughthe RTE3 dataset (Giampiccolo et al., 2007) for ex-amples containing high-scoring verbs according toour method. Table 3 shows some example inferencesthat hinge precisely on recognizing these types of de

facto implicatives.

5 Discussion and Related Work

Language understanding tasks such as RTE (Clarket al., 2007; MacCartney, 2009) and bias detection(Recasens et al., 2013) have been shown to requireknowledge of implicative verbs, but such knowledgehas previously come from manually-built word listsrather than from data. Nairn et al. (2006) and Mar-tin et al. (2009) describe automatic systems to han-dle implicatives, but require hand-crafted rules foreach unique verb that is handled. The tense agree-ment method we present offers a starting point foracquiring such rules from data, and is well-suitedfor incorporating into statistical systems. The clearnext step is to explore similar data-driven means forlearning the specific behaviors of individual implica-tive verbs, which has been well-studied from a the-oretical perspective (Karttunen, 1971; Nairn et al.,2006; Amaral et al., 2012; Karttunen, 2012). An-other interesting extension concerns the role of tensein word representations. While currently, tense israrely built directly into distributional representa-tions of words (Mikolov et al., 2013; Pennington etal., 2014), our results suggest it may offer importantinsights into the semantics of individual words. Weleave this question as a direction for future work.

6 Conclusion

Differentiating between implicative and non-implicative verbs is important for discriminatinginferences that can and cannot be made in naturallanguage. We have presented a data-driven methodthat captures the implicative tendencies of verbs byexploiting the tense relationship between the verband its compliment clauses. This method effectivelyseparates known implicatives from known non-implicatives, and, more importantly, provides goodpredictive signal in an entailment recognition task.

Tense Manages to Predict Implicatives. (Pavlick and Callison-Burch EMNLP 2016)

Contradiction Entailment Can’t Tell

Page 147: Natural Language Inference in the Real World · Natural Language Inference in the Real World Ellie Pavlick University of Pennsylvania

Implicative Verbs

4

288

289

290

291

292

293

294

295

296

297

298

299

300

301

302

303

304

305

306

307

308

309

310

311

312

313

314

315

316

317

318

319

320

321

322

323

324

325

326

327

328

329

330

331

332

333

334

335

336

337

338

339

340

341

342

343

344

345

346

347

348

349

350

351

352

353

354

355

356

357

358

359

360

361

362

363

364

365

366

367

368

369

370

371

372

373

374

375

376

377

378

379

380

381

382

383

EMNLP 2016 Submission 25. Confidential review copy. DO NOT DISTRIBUTE.

(0.14) UFJ wants to merge with Mitsubishi, a combination that’d surpass Citigroup as the world’s biggest bank.6) The merger of Japanese Banks creates the world’s biggest bank.

(0.55) After graduating, Gallager chose to accept a full scholarship to play football for Temple University.) Gallager attended Temple University.

(0.68) Wilkins was allowed to leave in 1987 to join French outfit Paris Saint-Germain.) Wilkins departed Milan in 1987.

Table 3: Examples from the RTE3 dataset (Giampiccolo et al., 2007) which require recognizing implicative behavior, even in verbsthat are not implicative by definition. The tendency of certain verbs (e.g. be allowed) to behave as de facto implicatives is capturedsurprisingly well by the tense agreement score (shown in parentheses).

the main verb of a sentence, the truth of the compli-ment could only be inferred 30% of the time. Whena verb with high tense agreement appeared as themain verb, the truth of the compliment could be in-ferred 91% of the time. This difference is significantat p < 0.01. That is, tense agreement provides astrong signal for identifying non-implicative verbs,and thus can help systems avoid false-positive en-tailment judgements, e.g. incorrectly inferring thatwanting to merge ) merging.

Figure 1: Whether or not compliment is entailed for main verbswith varying levels of tense agreement. Verbs with high tenseagreement yield more definitive judgments (true/false). Eachbar represents aggregated judgements over approx. 20 verbs.

Interestingly, tense agreement accurately mod-els verbs that are not implicative by definition, butwhich nonetheless tend to behave implicatively inpractice. For example, our method finds high tenseagreement for choose to and be allowed to, whichare often used to communicate, albeit indirectly, thattheir compliments did in fact happen. To convinceourselves that treating such verbs as implicativesmakes sense in practice, we manually look throughthe RTE3 dataset (Giampiccolo et al., 2007) for ex-amples containing high-scoring verbs according toour method. Table 3 shows some example inferencesthat hinge precisely on recognizing these types of de

facto implicatives.

5 Discussion and Related Work

Language understanding tasks such as RTE (Clarket al., 2007; MacCartney, 2009) and bias detection(Recasens et al., 2013) have been shown to requireknowledge of implicative verbs, but such knowledgehas previously come from manually-built word listsrather than from data. Nairn et al. (2006) and Mar-tin et al. (2009) describe automatic systems to han-dle implicatives, but require hand-crafted rules foreach unique verb that is handled. The tense agree-ment method we present offers a starting point foracquiring such rules from data, and is well-suitedfor incorporating into statistical systems. The clearnext step is to explore similar data-driven means forlearning the specific behaviors of individual implica-tive verbs, which has been well-studied from a the-oretical perspective (Karttunen, 1971; Nairn et al.,2006; Amaral et al., 2012; Karttunen, 2012). An-other interesting extension concerns the role of tensein word representations. While currently, tense israrely built directly into distributional representa-tions of words (Mikolov et al., 2013; Pennington etal., 2014), our results suggest it may offer importantinsights into the semantics of individual words. Weleave this question as a direction for future work.

6 Conclusion

Differentiating between implicative and non-implicative verbs is important for discriminatinginferences that can and cannot be made in naturallanguage. We have presented a data-driven methodthat captures the implicative tendencies of verbs byexploiting the tense relationship between the verband its compliment clauses. This method effectivelyseparates known implicatives from known non-implicatives, and, more importantly, provides goodpredictive signal in an entailment recognition task.

}High tense agreement strongly predicts

whether the truth of the compliment can

be inferred.

Tense Manages to Predict Implicatives. (Pavlick and Callison-Burch EMNLP 2016)

Contradiction Entailment Can’t Tell

Page 148: Natural Language Inference in the Real World · Natural Language Inference in the Real World Ellie Pavlick University of Pennsylvania

• Humans are flexible with their language. Computers need to be flexible too.

• What aspects of meaning do we expect our semantic representations have built-in? What do we expect to have to deal with in context, at runtime?

• What types of semantic tasks should we need to optimize for explicitly? Shouldn’t some things come “for free” when we train for harder tasks?

Discussion

Page 149: Natural Language Inference in the Real World · Natural Language Inference in the Real World Ellie Pavlick University of Pennsylvania

• Humans are flexible with their language. Computers need to be flexible too.

• What aspects of meaning do we expect our semantic representations have built-in? What do we expect to have to deal with in context, at runtime?

• What types of semantic tasks should we need to optimize for explicitly? Shouldn’t some things come “for free” when we train for harder tasks?

Discussion

Page 150: Natural Language Inference in the Real World · Natural Language Inference in the Real World Ellie Pavlick University of Pennsylvania

• Humans are flexible with their language. Computers need to be flexible too.

• What aspects of meaning do we expect our semantic representations have built-in? What do we expect to have to deal with in context, at runtime?

• What types of semantic tasks should we need to optimize for explicitly? Shouldn’t some things come “for free” when we train for harder tasks?

Discussion

Page 151: Natural Language Inference in the Real World · Natural Language Inference in the Real World Ellie Pavlick University of Pennsylvania

• Humans are flexible with their language. Computers need to be flexible too.

• What aspects of meaning do we expect our semantic representations have built-in? What do we expect to have to deal with in context, at runtime?

• What types of semantic tasks should we need to optimize for explicitly? Shouldn’t some things come “for free” when we train for harder tasks?

Discussion

Page 152: Natural Language Inference in the Real World · Natural Language Inference in the Real World Ellie Pavlick University of Pennsylvania

BibliographyBannard and Callison-Burch. Paraphrasing with bilingual parallel corpora. ACL 2005.

Bos. Wide-coverage semantic analysis with boxer. STEP 2008.

Bowman et al. A large annotated corpus for learning natural language inference. EMNLP 2015.

Ganitkevitch et al. PPDB: The paraphrase database. NAACL 2013.

Hearst. Automatic acquisition of hyponyms from large text corpora. COLING 1992.

Karttunen. Implicative Verbs. Language 1971.

Lin and Pantel. Discovery of Inference Rules from Text. SIGKDD 2001.

MacCartney. Natural Language Inference. PhD Thesis 2009.

Padó et al. Design and realization of a modular architecture for textual entailment. NLE 2014.

Pavlick and Callison-Burch. Most babies are little and most problems are huge: Compositional Entailment in Adjective-Nouns. ACL 2016.

Pavlick and Callison-Burch. So-Called Non-Subsective Adjective.*SEM 2016.

Pavlick and Callison-Burch. Tense Manages to Predict Implicative Behavior in Verbs. EMNLP 2016.

Pavlick et al. Adding Semantics to Data-Driven Paraphrasing. ACL 2015.

Pavlick et al. PPDB 2.0: Better paraphrase ranking, fine-grained entailment relations, word embeddings, and style classification. ACL 2015.

Pavlick et al. Domain-Specific Paraphrase Extraction. ACL 2015.

Rocktäschel et al. Reasoning about Entailment with Neural Attention. ICLR 2016.

Snow et al. Semantic taxonomy induction from heterogenous evidence. ACL 2006.

Page 153: Natural Language Inference in the Real World · Natural Language Inference in the Real World Ellie Pavlick University of Pennsylvania

Thank you!