search engine statistics beyond the n-gram: application to noun compound bracketing
DESCRIPTION
Search Engine Statistics Beyond the n-gram: Application to Noun Compound Bracketing. Preslav Nakov and Marti Hearst Computer Science Division and SIMS University of California, Berkeley. Supported by NSF DBI-0317510 and a gift from Genentech. Overview. Unsupervised algorithm - PowerPoint PPT PresentationTRANSCRIPT
Search Engine Statistics Beyond the n-gram:
Application to Noun Compound Bracketing
Search Engine Statistics Beyond the n-gram:
Application to Noun Compound Bracketing
Preslav Nakov and Marti HearstComputer Science Division and SIMS
University of California, Berkeley
Supported by NSF DBI-0317510 and a gift from Genentech
Overview
Unsupervised algorithm Applied here to noun compound bracketing, but
promising for structural ambiguity generally
Features n-grams, 2 , MI Beyond the n-gram
surface features paraphrases
State-of-the art accuracy
Noun Compound Bracketing
(a) [ [ liver cell ] antibody ] (left bracketing)
(b) [ liver [cell line] ] (right bracketing)
In (a), the antibody targets the liver cell. In (b), the cell line is derived from the liver.
liver cell line liver cell antibody
Related Work
Marcus(1980), Pustejosky&al.(1993), Resnik(1993) adjacency model: Pr(w1|w2) vs. Pr(w2|w3)
Lauer (1995) dependency model: Pr(w1|w2) vs. Pr(w1|w3)
Keller & Lapata (2004): use the Web unigrams and bigrams
Girju & al. (2005) supervised model bracketing in context requires WordNet senses to be given
Pr that w1 precedes w2
This work:• 2 • Web• n-grams• paraphrases• surface features
Adjacency & Dependency (1)
right bracketing: [w1[w2w3] ] w2w3 is a compound (modified by w1)
home health care
w1 and w2 independently modify w3
adult male rat
left bracketing : [ [w1w2 ]w3] only 1 modificational choice possible
law enforcement officer
w1 w2 w3
w1 w2 w3
Adjacency & Dependency (2)
right bracketing: [w1[w2w3] ] w2w3 is a compound (modified by w1)
w1 and w2 independently modify w3
adjacency model Is w2w3 a compound?
(vs. w1w2 being a compound)
dependency model Does w1 modify w3?
(vs. w1 modifying w2)
w1 w2 w3
w1 w2 w3
w1 w2 w3
Frequencies
Adjacency model Compare #(w1,w2) to #(w2,w3)
Dependency model Compare #(w1,w2) to #(w1,w3)
rightleft
w1 w2 w3
w1 w2 w3
Frequency of w1w2
Probabilities
Adjacency model Compare Pr(w1w2|w2) to Pr(w2w3|w3)
Dependency model Compare Pr(w1w2|w2) to Pr(w1w3|w3)
leftright
w1 w2 w3
w1 w2 w3
Pr that w1 modifies w2
Probabilities: Dependency
Dependency model Pr(left) = Pr(w1w2|w2)Pr(w2w3|w3)
Pr(right) = Pr(w1w3|w3)Pr(w2w3|w3)
So we compare Pr(w1w2|w2) to Pr(w1w3|w3)
BUT! No cancellation in
the Lauer’s model:
w1 w2 w3
left
right
Probabilities: Estimation
Using page hits as a proxy for n-gram counts
Pr(w1w2|w2) = #(w1,w2) / #(w2) #(w2) word frequency; query for “w2”
#(w1,w2) bigram frequency; query for “w1 w2”
smoothed by 0.5
Probabilities: Why? (1)
Why should we use: (a) Pr(w1w2|w2), rather than (b) Pr(w2w1|w1)?
Keller&Lapata (2004) calculate: AltaVista queries:
(a): 70.49% (b): 68.85%
British National Corpus: (a): 63.11% (b): 65.57%
Probabilities: Why? (2)
Why should we use: (a) Pr(w1w2|w2), rather than
(b) Pr(w2w1|w1)?
Maybe to introduce a bracketing prior. Just like Lauer (1995) did.
But otherwise, no reason to prefer either one. Do we need probabilities? (association is OK) Do we need a directed model? (symmetry is OK)
Association Models: 2 (Chi Squared)
A = #(wi,wj)
B = #(wi) – #(wi,wj)
C = #(wj) – #(wi,wj)
D = N – (A+B+C) N = 8 trillion (= A+B+C+D)
8 billion Web pages x 1,000 words
Web-derived Surface Features
Authors often disambiguate noun compounds using surface markers, e.g.: amino-acid sequence left brain stem’s cell left brain’s stem cell right
The enormous size of the Web makes them frequent enough to be useful.
Web-derived Surface Features:Dash (hyphen)
Left dash cell-cycle analysis left
Right dash donor T-cell right fiber optics-system should be left..
Double dash T-cell-depletion unusable…
Web-derived Surface Features:Possessive Marker
Attached to the first word brain’s stem cell right
Attached to the second word brain stem’s cell left
Combined features brain’s stem-cell right
Web-derived Surface Features:Capitalization
don’t-care – lowercase – uppercase Plasmodium vivax Malaria left plasmodium vivax Malaria left
lowercase – uppercase – don’t-care brain Stem cell right brain Stem Cell right
Disabled on: Roman digits Single-letter words: e.g. vitamin D deficiency
Web-derived Surface Features:Embedded Slash
Left embedded slash leukemia/lymphoma cell right
Web-derived Surface Features:Parentheses
Single-word growth factor (beta) left (brain) stem cell right
Two-word (growth factor) beta left brain (stem cell) right
Web-derived Surface Features:Column, dot, semi-column
Following the first word home. health care right adult, male rat right
Following the second word health care, provider left lung cancer: patients left
Web-derived Surface Features:Dash to External Word
External word to the left mouse-brain stem cell right
External word to the right tumor necrosis factor-alpha left
Web-derived Surface Features:Problems & Solutions
Problem: search engines ignore punctuation “brain-stem cell” does not work
Solution: query for “brain stem cell” obtain 1,000 document summaries look for the features in these summaries
Other Web-derived Features:Abbreviation
After the second word tumor necrosis factor (NF) right
After the third word tumor necrosis (TN) factor right
We query for e.g. “tumor necrosis tn factor” Problems:
Roman digits: IV, VI States: CA Short words: me
Other Web-derived Features:Concatenation
Consider health care reform healthcare : 79,500,000 carereform : 269 healthreform: 812
Adjacency model healthcare vs. carereform
Dependency model healthcare vs. healthreform
Triples “healthcare reform” vs. “health carereform”
Other Web-derived Features:Using Google’s *
Each * allows an one-word wildcard
Single star “health care * reform” left “health * care reform” right
More stars and/or reverse order “care reform * * health” right
Adjacency model
Other Web-derived Features:Reorder
Reorders for “health care reform” “care reform health” right “reform health care” left
Other Web-derived Features:Internal Inflection Variability
First word ???
Second word tyrosine kinase activation tyrosine kinases activation
Other Web-derived Features:Switch The First Two Words
Predict right, if we can reorder adult male rat as male adult rat
Paraphrases (1)
The semantics of a noun compound is often made overt by a paraphrase (Warren,1978) Prepositional
stem cells in the brain right cells from the brain stem right
Verbal virus causing human immunodeficiency left pain associated with arthritis migraine right
Copula office building that is a skyscraper right
Paraphrases (2)
Lauer(1995), Keller&Lapata(2003), Girju&al. (2005) predict NC semantics by choosing the most likely preposition: of, for, in, at, on, from, with, about, (like)
This could be problematic, when more than one preposition is possible
In contrast: we try to predict syntax, not semantics we do not disambiguate, just add up all counts
cells in (the) bone marrow left cells from (the) bone marrow left
Paraphrases (3)
prepositional paraphrases: We use: ~150 prepositions
verbal paraphrases: We use: associated with, caused by, contained in,
derived from, focusing on, found in, involved in, located at/in, made of, performed by, preventing, related to and used by/in/for.
copula paraphrases: We use: is/was and that/which/who
optional elements: articles: a, an, the quantifiers: some, every, etc. pronouns: this, these, etc.
Evaluation: Datasets
Lauer Set 244 noun compounds (NCs)
from Grolier’s encyclopedia inter-annotator agreement: 81.5%
Biomedical Set 430 NCs
from MEDLINE inter-annotator agreement: 88% ( =.606)
Evaluation: Experiments
Exact phrase queries Limited to English
Inflections: Lauer Set: Carroll’s morphological tools Biomedical Set: UMLS Specialist Lexicon
Results: Lauer (1)correct
N/Awrong
Results Lauer (2)correct
N/Awrong
Results Lauer (3)
Results: Bio (1)correct
N/Awrong
Results Bio (2)correct
N/Awrong
Individual Surface Features Performance: Bio
Paraphrase and Surface Features Performance
Lauer Set
Biomedical Set
Discussion
Lauer Bio
Adjacency vs. Dependency 2 vs. frequencies vs. probabilities
Conclusion
Introduced search engine statistics that go beyond the n-gram (applicable to other tasks) surface features paraphrases
Obtained new state-of-the-art results on NC bracketing more robust than Lauer (1995) more accurate than Keller&Lapata (2004)
Future Work
Recognize ambiguous cases Bracket more than 3 nouns Not just bracketing but dependences:
e.g. growth factor alpha Bracket NPs in general (other POS)
augment Penn Treebank with NP-internal dependences
Application to other structural ambiguity problems: Prepositional phrase attachment Noun phrase coordination
The End
Thank you!
Web Counts: Problems
Page hits are inaccurate This may be ok (Keller&Lapata,2003)
The Web lacks linguistic annotation Pr(health|care) = #(“health care”) / #(care)
health: noun care: both verb and noun can be adjacent by chance can come from different sentences
Cannot find: stem cells VERB PREPOSITION brain protein synthesis’ inhibition