rte @ stanford rajat raina, aria haghighi, christopher cox, jenny finkel, jeff michels, kristina...
Post on 19-Dec-2015
217 views
TRANSCRIPT
RTE @ Stanford
Rajat Raina, Aria Haghighi, Christopher Cox, Jenny Finkel, Jeff Michels, Kristina Toutanova,
Bill MacCartney, Marie-Catherine de Marneffe, Christopher D. Manning and Andrew Y. Ng
PASCAL Challenges WorkshopApril 12, 2005
Our approach
• Represent using syntactic dependencies• But also use semantic annotations.• Try to handle language variability.
• Perform semantic inference over this representation• Use linguistic knowledge sources.• Compute a “cost” for inferring hypothesis from text.
Low cost Hypothesis is entailed.
Outline of this talk
• Representation of sentences• Syntax: Parsing and post-processing• Adding annotations on representation (e.g., semantic
roles)
• Inference by graph matching• Inference by abductive theorem proving
• A combined system• Results and error analysis
Sentence processing
• Parse with a standard PCFG parser. [Klein & Manning, 2003] • Al Qaeda: [Aa]l[ -]Qa’?[ie]da• Train on some extra sentences from recent news.
• Used a high-performing Named Entity Recognizer (next slide)• Force parse tree to be consistent with certain NE tags.
Example: American Ministry of Foreign Affairs announced that Russia called the United States...
(S(NP (NNP American_Ministry_of_Foreign_Affairs))(VP (VBD announced)
(…)))
Named Entity Recognizer
• Trained a robust conditional random field model. [Finkel et al., 2003]
• Interpretation of numeric quantity statementsExample:T: Kessler's team conducted 60,643 face-to-face interviews
with adults in 14 countries.H: Kessler's team interviewed more than 60,000 adults in 14
countries. TRUE
Annotate numerical values implied by:• “6.2 bn”, “more than 60000”, “around 10”, …• MONEY/DATE named entities
Parse tree post-processing
• Recognize collocations using WordNet
Example: Shrek 2 rang up $92 million.
(S(NP (NNP Shrek) (CD 2))(VP (VBD rang_up)
(NP (QP ($ $) (CD 92) (CD million))))
(. .))
MONEY, 9200000
Parse tree Dependencies
• Find syntactic dependencies• Transform parse tree representations into typed syntactic
dependencies, including a certain amount of collapsing and normalization
Example: Bill’s mother walked to the grocery store.subj(walked, mother)poss(mother, Bill)to(walked, store)nn(store, grocery)
• Dependencies can also be written as a logical formulamother(A) Bill(B) poss(B, A) grocery(C) store(C) walked(E, A,
C)
Basic representations
Representations
Dependency graph Logical formula
mother(A) Bill(B) poss(B, A)grocery(C) store(C)
walked(E, A, C)poss
grocery
store
walked
mother
Bill
subj to
nn
•Can make representation richer
•“walked” is a verb
•“Bill” is a PERSON (named entity).
•“store” is the location/destination of “walked”.
•…
PERSON
VBD
ARGM-LOC
VBD
PERSON
ARGM-LOC
Annotations
• Parts-of-speech, named entities• Already computed.
• Semantic rolesExample:T: C and D Technologies announced that it has closed the
acquisition of Datel, Inc.H1: C and D Technologies acquired Datel Inc. TRUE
H2: Datel acquired C and D Technologies. FALSE
• Use a state-of-the-art semantic role classifier to label verb arguments. [Toutanova et al. 2005]
More annotations
• CoreferenceExample:T: Since its formation in 1948, Israel …H: Israel was established in 1948. TRUE
• Use a conditional random field model for coreference detection.
Note: Appositive “references” were previously detected.T: Bush, the President of USA, went to Florida.H: Bush is the President of USA. TRUE
• Other annotations• Word stems (very useful)• Word senses (no performance gain in our system)
Event nouns
• Use a heuristic to find event nouns• Augment text representation using WordNet derivational
links.
Example:T: … witnessed the murder of police commander ...H: Police officer killed. TRUE
Text logical formula:murder(M) police_commander(P) of(M, P)
Augment:murder(E, M, P)
NOUN:
VERB:
Outline of this talk
• Representation of sentences• Syntax: Parsing and post-processing• Adding annotations on representation (e.g., semantic
roles)
• Inference by graph matching• Inference by abductive theorem proving
• A combined system• Results and error analysis
Graph Matching Approach
• Why Graph Matching?:• Dependency tree has natural graphical interpretation• Successful in other domains: e.g., Lossy image matching
• Input: Hypothesis (H) and Text (T) GraphsToy example:
• Vertices are words and phrases • Edges are labeled dependencies
• Output: Cost of matching H to T (next slide)
BMW
bought
John
ARG0(Agent) ARG1(Theme)
PERSON
T: John bought a BMW. H: John purchased a car. TRUE
car
purchased
John
ARG0(Agent) ARG1(Theme)
PERSON
Graph Matching: Idea
• Idea: Align H to T so that vertices are similar and preserve relations (as in machine translation)
• A matching M is a mapping from vertices of H to vertices of TThus, for each vertex v in H, M(v) is a vertex in T
BMW
bought
John
ARG0(Agent) ARG1(Theme)
PERSON
car
purchased
John
ARG0(Agent) ARG1(Theme)
PERSON
T
H
matching
Graph Matching: Costs
• The cost of a matching MatchCost(M) measures the “quality” of a matching M• VertexCost(M) – Compare vertices in H with matched vertices
in T• RelationCost(M) – Compare edges (relations) in H with
corresponding edges (relations) in T
• MatchCost(M) = (1 - ß) VertexCost(M) + ß RelationCost(M)
Graph Matching: Costs
• VertexCost(M)For each vertex v in H, and vertex M(v) in T:• Do vertex heads share same stem and/or POS ?• Is T vertex head a hypernym of H vertex head?• Are vertex heads “similar” phrases? (next slide)
• RelationCost(M)For each edge (v,v’) in H, and edge (M(v),M(v’)) in T• Are parent/child pairs in H parent/child in T ?• Are parent/child pairs in H ancestor/descendant in T ?• Do parent/child pairs in H share a common ancestor in T?
Digression: Phrase similarity
• Measures based on WordNet (Resnik/Lesk).
• Distributional similarity• Example: “run” and “marathon” are related.• Latent Semantic Analysis to discover words that are distributionally
similar (i.e., have common neighbors).
• Used a web-search based measure• Query google.com for all pages with:
• “run”• “marathon”• Both “run” and “marathon”
• Learning paraphrases. [Similar to DIRT: Lin and Pantel, 2001]
• “World knowledge” (labor intensive)• CEO = Chief_Executive_Officer• Philippines Filipino• [Can add common facts: “Paris is the capital of France”, …]
Graph Matching: Costs
• VertexCost(M)For each vertex v in H, and vertex M(v) in T:• Do vertex heads share same stem and or POS ?• Is T vertex head a hypernym of H vertex head?• Are vertex heads “similar” phrases? (next slide)
• RelationCost(M)For each edge (v,v’) in H, and edge (M(v),M(v’)) in T• Are parent/child pairs in H parent/child in T ?• Are parent/child pairs in H ancestor/descendant in T ?• Do parent/child pairs in H share a common ancestor in T?
Graph Matching: Example
VertexCost: (0.0 + 0.2 + 0.4)/3 = 0.2 RelationCost: 0 (Graphs Isomorphic)
ß = 0.45 (say) MatchCost: 0.55 * (0.2) + 0.45 * (0.0) = 0.11
BMW
bought
John
ARG0(Agent) ARG1(Theme)
PERSON
car
purchased
John
ARG0(Agent) ARG1(Theme)
PERSON
Hypernym Match Cost: 0.4
Exact Match Cost: 0.0
Synonym Match Cost: 0.2
Outline of this talk
• Representation of sentences• Syntax: Parsing and post-processing• Adding annotations on representation (e.g., semantic
roles)
• Inference by graph matching• Inference by abductive theorem proving
• A combined system• Results and error analysis
Abductive inference
• Idea: • Represent text and hypothesis as logical formulae.• A hypothesis can be inferred from the text if and only if the
hypothesis logical formula can be proved from the text logical formula.
• Toy example:
T: John bought a BMW. H: John purchased a car. TRUE
John(A) BMW(B) bought(E, A, B) John(x) car(y) purchased(z, x, y)
Prove?Allow assumptions at various
“costs”
BMW(t) + $2 => car(t)
bought(p, q, r) + $1 => purchased(p, q, r)
Abductive assumptions
• Assign costs to all assumptions of the form:
P(p1, p2, …, pm)
Q(q1, q2, …, qn)
• Build an assumption cost model
Predicate match cost•Phrase similarity cost?•Same named entity type?•…
Argument match cost•Same semantic role?•Constant similarity cost?•…
Abductive theorem proving
• Each assumption provides a potential proof step.
• Find the proof with the minimum total cost• Uniform cost search• If there is a low-cost proof, the hypothesis is entailed.
• Example:T: John(A) BMW(B) bought(E, A, B) H: John(x) car(y) purchased(z, x, y)
Here is a possible proof by resolution refutation (for the earlier costs):$0 -John(x) -car(y) -purchased(z, x, y) [Given: negation of hypothesis]$0 -car(y) -purchased(z, A, y) [Unify with John(A)]$2 -purchased(z, A, B) [Unify with BMW(B)]$3 NULL [Unify with purchased(E, A, B)]Proof cost = 3
Abductive theorem proving
• Can automatically learn good assumption costs• Start from a labeled dataset (e.g.: the PASCAL development
set)• Intuition: Find assumptions that are used in the proofs for
TRUE examples, and lower their costs (by framing a log-linear model). Iterate. [Details: Raina et al., in submission]
Some interesting features
• Examples of handling “complex” constructions in graph matching/abductive inference.
• Antonyms/Negation: High cost for matching verbs, if they are antonyms or one is negated and the other not.
• T: Stocks fell. H: Stocks rose. FALSE
• T: Clinton’s book was not a hit H: Clinton’s book was a hit. FALSE
• Non-factive verbs:
• T: John was charged for doing X. H: John did X. FALSECan detect because “doing” in text has non-factive “charged” asa parent but “did” does not have such a parent.
Some interesting features
• “Superlative check”• T: This is the tallest tower in western Japan.
H: This is the tallest tower in Japan. FALSE
Outline of this talk
• Representation of sentences• Syntax: Parsing and post-processing• Adding annotations on representation (e.g., semantic
roles)
• Inference by graph matching• Inference by abductive theorem proving
• A combined system• Results and error analysis
Results
• Combine inference methods• Each system produces a score.
Separately normalize each system’s score variance.Suppose normalized scores are s1 and s2.
• Final score S = w1s1 + w2s2
• Learn classifier weights w1 and w2 on the development set using logistic regression. Two submissions:• Train one classifier weight for all RTE tasks. (General)• Train different classifier weights for each RTE task.
(ByTask)
Results
General ByTask
Accuracy CWS Accuracy CWS
DevSet 1 64.8% 0.778 65.5% 0.805
DevSet 2 52.1% 0.578 55.7% 0.661
DevSet 1 + DevSet
258.5% 0.679 60.8% 0.743
Test Set 56.3% 0.620 55.3% 0.686
•Balanced predictions. [55.4%, 51.2% predicted TRUE on test set.]
•Best other results: Accuracy=58.6%, CWS=0.617
Results by task
General ByTask
Accuracy CWS Accuracy CWS
CD 79.3% 0.903 84.7% 0.926
IE 47.5% 0.493 55.0% 0.590
IR 56.7% 0.590 55.6% 0.604
MT 46.7% 0.480 47.5% 0.479
PP 58.0% 0.623 54.0% 0.535
QA 48.5% 0.478 43.9% 0.466
RC 52.9% 0.523 50.7% 0.480
Partial coverage results
50
60
70
80
90
100
0.015 0.115 0.215 0.315 0.415 0.515 0.615 0.715 0.815 0.915
Coverage
Accuracy
•Can also draw coverage-CWS curves. For example:
•at 50% coverage, CWS = 0.781
•at 25% coverage, CWS = 0.873
ByTask
General
ByTask
Task-specific optimization
seems better!
Some interesting issues
• Phrase similarity• away from the coast farther inland• won victory in presidential election became President• stocks get a lift stocks rise• life threatening fatal
• Dictionary definitions• believe there is only one God are monotheistic
• “World knowledge”• K Club, venue of the Ryder Cup, … K Club will host the
Ryder Cup
Future directions
• Need more NLP components in there:• Better treatment of frequent nominalizations,
parenthesized material, etc.
• Need much more ability to do inference• Fine distinctions between meanings, and fine similarities.
e.g., “reach a higher level” and “rise”We need a high-recall, reasonable precision similarity measure!
• Other resources (e.g., antonyms) are also very sparse.
• More task-specific optimization.