rte @ stanford rajat raina, aria haghighi, christopher cox, jenny finkel, jeff michels, kristina...

RTE @ Stanford

Rajat Raina, Aria Haghighi, Christopher Cox, Jenny Finkel, Jeff Michels, Kristina Toutanova,

Bill MacCartney, Marie-Catherine de Marneffe, Christopher D. Manning and Andrew Y. Ng

PASCAL Challenges WorkshopApril 12, 2005

Our approach

• Represent using syntactic dependencies• But also use semantic annotations.• Try to handle language variability.

• Perform semantic inference over this representation• Use linguistic knowledge sources.• Compute a “cost” for inferring hypothesis from text.

Low cost Hypothesis is entailed.

Outline of this talk

• Representation of sentences• Syntax: Parsing and post-processing• Adding annotations on representation (e.g., semantic

roles)

• Inference by graph matching• Inference by abductive theorem proving

• A combined system• Results and error analysis

Sentence processing

• Parse with a standard PCFG parser. [Klein & Manning, 2003] • Al Qaeda: [Aa]l[ -]Qa’?[ie]da• Train on some extra sentences from recent news.

• Used a high-performing Named Entity Recognizer (next slide)• Force parse tree to be consistent with certain NE tags.

Example: American Ministry of Foreign Affairs announced that Russia called the United States...

(S(NP (NNP American_Ministry_of_Foreign_Affairs))(VP (VBD announced)

(…)))

Named Entity Recognizer

• Trained a robust conditional random field model. [Finkel et al., 2003]

• Interpretation of numeric quantity statementsExample:T: Kessler's team conducted 60,643 face-to-face interviews

with adults in 14 countries.H: Kessler's team interviewed more than 60,000 adults in 14

countries. TRUE

Annotate numerical values implied by:• “6.2 bn”, “more than 60000”, “around 10”, …• MONEY/DATE named entities

Parse tree post-processing

• Recognize collocations using WordNet

Example: Shrek 2 rang up $92 million.

(S(NP (NNP Shrek) (CD 2))(VP (VBD rang_up)

(NP (QP ($ $) (CD 92) (CD million))))

(. .))

MONEY, 9200000

Parse tree Dependencies

• Find syntactic dependencies• Transform parse tree representations into typed syntactic

dependencies, including a certain amount of collapsing and normalization

Example: Bill’s mother walked to the grocery store.subj(walked, mother)poss(mother, Bill)to(walked, store)nn(store, grocery)

• Dependencies can also be written as a logical formulamother(A) Bill(B) poss(B, A) grocery(C) store(C) walked(E, A,

C)

Basic representations

Representations

Dependency graph Logical formula

mother(A) Bill(B) poss(B, A)grocery(C) store(C)

walked(E, A, C)poss

grocery

store

walked

mother

Bill

subj to

nn

•Can make representation richer

•“walked” is a verb

•“Bill” is a PERSON (named entity).

•“store” is the location/destination of “walked”.

•…

PERSON

VBD

ARGM-LOC

VBD

PERSON

ARGM-LOC

Annotations

• Parts-of-speech, named entities• Already computed.

• Semantic rolesExample:T: C and D Technologies announced that it has closed the

acquisition of Datel, Inc.H1: C and D Technologies acquired Datel Inc. TRUE

H2: Datel acquired C and D Technologies. FALSE

• Use a state-of-the-art semantic role classifier to label verb arguments. [Toutanova et al. 2005]

More annotations

• CoreferenceExample:T: Since its formation in 1948, Israel …H: Israel was established in 1948. TRUE

• Use a conditional random field model for coreference detection.

Note: Appositive “references” were previously detected.T: Bush, the President of USA, went to Florida.H: Bush is the President of USA. TRUE

• Other annotations• Word stems (very useful)• Word senses (no performance gain in our system)

Event nouns

• Use a heuristic to find event nouns• Augment text representation using WordNet derivational

links.

Example:T: … witnessed the murder of police commander ...H: Police officer killed. TRUE

Text logical formula:murder(M) police_commander(P) of(M, P)

Augment:murder(E, M, P)

NOUN:

VERB:



roles)



Graph Matching Approach

• Why Graph Matching?:• Dependency tree has natural graphical interpretation• Successful in other domains: e.g., Lossy image matching

• Input: Hypothesis (H) and Text (T) GraphsToy example:

• Vertices are words and phrases • Edges are labeled dependencies

• Output: Cost of matching H to T (next slide)

BMW

bought

John

ARG0(Agent) ARG1(Theme)

PERSON

T: John bought a BMW. H: John purchased a car. TRUE

car

purchased

John


PERSON

Graph Matching: Idea

• Idea: Align H to T so that vertices are similar and preserve relations (as in machine translation)

• A matching M is a mapping from vertices of H to vertices of TThus, for each vertex v in H, M(v) is a vertex in T

BMW

bought

John


PERSON

car

purchased

John


PERSON

T

H

matching

Graph Matching: Costs

• The cost of a matching MatchCost(M) measures the “quality” of a matching M• VertexCost(M) – Compare vertices in H with matched vertices

in T• RelationCost(M) – Compare edges (relations) in H with

corresponding edges (relations) in T

• MatchCost(M) = (1 - ß) VertexCost(M) + ß RelationCost(M)


• VertexCost(M)For each vertex v in H, and vertex M(v) in T:• Do vertex heads share same stem and/or POS ?• Is T vertex head a hypernym of H vertex head?• Are vertex heads “similar” phrases? (next slide)

• RelationCost(M)For each edge (v,v’) in H, and edge (M(v),M(v’)) in T• Are parent/child pairs in H parent/child in T ?• Are parent/child pairs in H ancestor/descendant in T ?• Do parent/child pairs in H share a common ancestor in T?

Digression: Phrase similarity

• Measures based on WordNet (Resnik/Lesk).

• Distributional similarity• Example: “run” and “marathon” are related.• Latent Semantic Analysis to discover words that are distributionally

similar (i.e., have common neighbors).

• Used a web-search based measure• Query google.com for all pages with:

• “run”• “marathon”• Both “run” and “marathon”

• Learning paraphrases. [Similar to DIRT: Lin and Pantel, 2001]

• “World knowledge” (labor intensive)• CEO = Chief_Executive_Officer• Philippines Filipino• [Can add common facts: “Paris is the capital of France”, …]


• VertexCost(M)For each vertex v in H, and vertex M(v) in T:• Do vertex heads share same stem and or POS ?• Is T vertex head a hypernym of H vertex head?• Are vertex heads “similar” phrases? (next slide)

• RelationCost(M)For each edge (v,v’) in H, and edge (M(v),M(v’)) in T• Are parent/child pairs in H parent/child in T ?• Are parent/child pairs in H ancestor/descendant in T ?• Do parent/child pairs in H share a common ancestor in T?

Graph Matching: Example

VertexCost: (0.0 + 0.2 + 0.4)/3 = 0.2 RelationCost: 0 (Graphs Isomorphic)

ß = 0.45 (say) MatchCost: 0.55 * (0.2) + 0.45 * (0.0) = 0.11

BMW

bought

John


PERSON

car

purchased

John


PERSON

Hypernym Match Cost: 0.4

Exact Match Cost: 0.0

Synonym Match Cost: 0.2



roles)



Abductive inference

• Idea: • Represent text and hypothesis as logical formulae.• A hypothesis can be inferred from the text if and only if the

hypothesis logical formula can be proved from the text logical formula.

• Toy example:

T: John bought a BMW. H: John purchased a car. TRUE

John(A) BMW(B) bought(E, A, B) John(x) car(y) purchased(z, x, y)

Prove?Allow assumptions at various

“costs”

BMW(t) + $2 => car(t)

bought(p, q, r) + $1 => purchased(p, q, r)

Abductive assumptions

• Assign costs to all assumptions of the form:

P(p1, p2, …, pm)

Q(q1, q2, …, qn)

• Build an assumption cost model

Predicate match cost•Phrase similarity cost?•Same named entity type?•…

Argument match cost•Same semantic role?•Constant similarity cost?•…

Abductive theorem proving

• Each assumption provides a potential proof step.

• Find the proof with the minimum total cost• Uniform cost search• If there is a low-cost proof, the hypothesis is entailed.

• Example:T: John(A) BMW(B) bought(E, A, B) H: John(x) car(y) purchased(z, x, y)

Here is a possible proof by resolution refutation (for the earlier costs):$0 -John(x) -car(y) -purchased(z, x, y) [Given: negation of hypothesis]$0 -car(y) -purchased(z, A, y) [Unify with John(A)]$2 -purchased(z, A, B) [Unify with BMW(B)]$3 NULL [Unify with purchased(E, A, B)]Proof cost = 3

Abductive theorem proving

• Can automatically learn good assumption costs• Start from a labeled dataset (e.g.: the PASCAL development

set)• Intuition: Find assumptions that are used in the proofs for

TRUE examples, and lower their costs (by framing a log-linear model). Iterate. [Details: Raina et al., in submission]

Some interesting features

• Examples of handling “complex” constructions in graph matching/abductive inference.

• Antonyms/Negation: High cost for matching verbs, if they are antonyms or one is negated and the other not.

• T: Stocks fell. H: Stocks rose. FALSE

• T: Clinton’s book was not a hit H: Clinton’s book was a hit. FALSE

• Non-factive verbs:

• T: John was charged for doing X. H: John did X. FALSECan detect because “doing” in text has non-factive “charged” asa parent but “did” does not have such a parent.

Some interesting features

• “Superlative check”• T: This is the tallest tower in western Japan.

H: This is the tallest tower in Japan. FALSE



roles)



Results

• Combine inference methods• Each system produces a score.

Separately normalize each system’s score variance.Suppose normalized scores are s1 and s2.

• Final score S = w1s1 + w2s2

• Learn classifier weights w1 and w2 on the development set using logistic regression. Two submissions:• Train one classifier weight for all RTE tasks. (General)• Train different classifier weights for each RTE task.

(ByTask)

Results

General ByTask

Accuracy CWS Accuracy CWS

DevSet 1 64.8% 0.778 65.5% 0.805

DevSet 2 52.1% 0.578 55.7% 0.661

DevSet 1 + DevSet

258.5% 0.679 60.8% 0.743

Test Set 56.3% 0.620 55.3% 0.686

•Balanced predictions. [55.4%, 51.2% predicted TRUE on test set.]

•Best other results: Accuracy=58.6%, CWS=0.617

Results by task

General ByTask

Accuracy CWS Accuracy CWS

CD 79.3% 0.903 84.7% 0.926

IE 47.5% 0.493 55.0% 0.590

IR 56.7% 0.590 55.6% 0.604

MT 46.7% 0.480 47.5% 0.479

PP 58.0% 0.623 54.0% 0.535

QA 48.5% 0.478 43.9% 0.466

RC 52.9% 0.523 50.7% 0.480

Partial coverage results

50

60

70

80

90

100

0.015 0.115 0.215 0.315 0.415 0.515 0.615 0.715 0.815 0.915

Coverage

Accuracy

•Can also draw coverage-CWS curves. For example:

•at 50% coverage, CWS = 0.781

•at 25% coverage, CWS = 0.873

ByTask

General

ByTask

Task-specific optimization

seems better!

Some interesting issues

• Phrase similarity• away from the coast farther inland• won victory in presidential election became President• stocks get a lift stocks rise• life threatening fatal

• Dictionary definitions• believe there is only one God are monotheistic

• “World knowledge”• K Club, venue of the Ryder Cup, … K Club will host the

Ryder Cup

Future directions

• Need more NLP components in there:• Better treatment of frequent nominalizations,

parenthesized material, etc.

• Need much more ability to do inference• Fine distinctions between meanings, and fine similarities.

e.g., “reach a higher level” and “rise”We need a high-recall, reasonable precision similarity measure!

• Other resources (e.g., antonyms) are also very sparse.

• More task-specific optimization.

Thanks!

rte @ stanford rajat raina, aria haghighi, christopher cox, jenny finkel, jeff michels, kristina...

Documents

slide force parse tree

parse tree dependencies

semantic annotations

semantic roles example

error analysis slide

parse tree representations

grocery dependencies

semantic inference