search engine statistics beyond the n-gram: application to noun compound bracketing

Search Engine Statistics Beyond the n-gram:

Application to Noun Compound Bracketing

Search Engine Statistics Beyond the n-gram:

Application to Noun Compound Bracketing

Preslav Nakov and Marti HearstComputer Science Division and SIMS

University of California, Berkeley

Supported by NSF DBI-0317510 and a gift from Genentech

Overview

Unsupervised algorithm Applied here to noun compound bracketing, but

promising for structural ambiguity generally

Features n-grams, 2 , MI Beyond the n-gram

surface features paraphrases

State-of-the art accuracy

Noun Compound Bracketing

(a) [ [ liver cell ] antibody ] (left bracketing)

(b) [ liver [cell line] ] (right bracketing)

In (a), the antibody targets the liver cell. In (b), the cell line is derived from the liver.

liver cell line liver cell antibody

Related Work

Marcus(1980), Pustejosky&al.(1993), Resnik(1993) adjacency model: Pr(w1|w2) vs. Pr(w2|w3)

Lauer (1995) dependency model: Pr(w1|w2) vs. Pr(w1|w3)

Keller & Lapata (2004): use the Web unigrams and bigrams

Girju & al. (2005) supervised model bracketing in context requires WordNet senses to be given

Pr that w1 precedes w2

This work:• 2 • Web• n-grams• paraphrases• surface features

Adjacency & Dependency (1)

right bracketing: [w1[w2w3] ] w2w3 is a compound (modified by w1)

home health care

w1 and w2 independently modify w3

adult male rat

left bracketing : [ [w1w2 ]w3] only 1 modificational choice possible

law enforcement officer

w1 w2 w3

w1 w2 w3

Adjacency & Dependency (2)

right bracketing: [w1[w2w3] ] w2w3 is a compound (modified by w1)

w1 and w2 independently modify w3

adjacency model Is w2w3 a compound?

(vs. w1w2 being a compound)

dependency model Does w1 modify w3?

(vs. w1 modifying w2)

w1 w2 w3

w1 w2 w3

w1 w2 w3

Frequencies

Adjacency model Compare #(w1,w2) to #(w2,w3)

Dependency model Compare #(w1,w2) to #(w1,w3)

rightleft

w1 w2 w3

w1 w2 w3

Frequency of w1w2

Probabilities

Adjacency model Compare Pr(w1w2|w2) to Pr(w2w3|w3)

Dependency model Compare Pr(w1w2|w2) to Pr(w1w3|w3)

leftright

w1 w2 w3

w1 w2 w3

Pr that w1 modifies w2

Probabilities: Estimation

Using page hits as a proxy for n-gram counts

Pr(w1w2|w2) = #(w1,w2) / #(w2) #(w2) word frequency; query for “w2”

#(w1,w2) bigram frequency; query for “w1 w2”

smoothed by 0.5

Probabilities: Why? (1)

Why should we use: (a) Pr(w1w2|w2), rather than (b) Pr(w2w1|w1)?

Keller&Lapata (2004) calculate: AltaVista queries:

(a): 70.49% (b): 68.85%

British National Corpus: (a): 63.11% (b): 65.57%

Probabilities: Why? (2)

Why should we use: (a) Pr(w1w2|w2), rather than

(b) Pr(w2w1|w1)?

Maybe to introduce a bracketing prior. Just like Lauer (1995) did.

But otherwise, no reason to prefer either one. Do we need probabilities? (association is OK) Do we need a directed model? (symmetry is OK)

Association Models: 2 (Chi Squared)

A = #(wi,wj)

B = #(wi) – #(wi,wj)

C = #(wj) – #(wi,wj)

D = N – (A+B+C) N = 8 trillion (= A+B+C+D)

8 billion Web pages x 1,000 words

Web-derived Surface Features

Authors often disambiguate noun compounds using surface markers, e.g.: amino-acid sequence left brain stem’s cell left brain’s stem cell right

The enormous size of the Web makes them frequent enough to be useful.

Web-derived Surface Features:Dash (hyphen)

Left dash cell-cycle analysis left

Right dash donor T-cell right fiber optics-system should be left..

Double dash T-cell-depletion unusable…

Web-derived Surface Features:Possessive Marker

Attached to the first word brain’s stem cell right

Attached to the second word brain stem’s cell left

Combined features brain’s stem-cell right

Web-derived Surface Features:Capitalization

don’t-care – lowercase – uppercase Plasmodium vivax Malaria left plasmodium vivax Malaria left

lowercase – uppercase – don’t-care brain Stem cell right brain Stem Cell right

Disabled on: Roman digits Single-letter words: e.g. vitamin D deficiency

Web-derived Surface Features:Embedded Slash

Left embedded slash leukemia/lymphoma cell right

Web-derived Surface Features:Parentheses

Single-word growth factor (beta) left (brain) stem cell right

Two-word (growth factor) beta left brain (stem cell) right

Web-derived Surface Features:Column, dot, semi-column

Following the first word home. health care right adult, male rat right

Following the second word health care, provider left lung cancer: patients left

Web-derived Surface Features:Dash to External Word

External word to the left mouse-brain stem cell right

External word to the right tumor necrosis factor-alpha left

Web-derived Surface Features:Problems & Solutions

Problem: search engines ignore punctuation “brain-stem cell” does not work

Solution: query for “brain stem cell” obtain 1,000 document summaries look for the features in these summaries

Other Web-derived Features:Abbreviation

After the second word tumor necrosis factor (NF) right

After the third word tumor necrosis (TN) factor right

We query for e.g. “tumor necrosis tn factor” Problems:

Roman digits: IV, VI States: CA Short words: me

Other Web-derived Features:Concatenation

Consider health care reform healthcare : 79,500,000 carereform : 269 healthreform: 812

Adjacency model healthcare vs. carereform

Dependency model healthcare vs. healthreform

Triples “healthcare reform” vs. “health carereform”

Other Web-derived Features:Using Google’s *

Each * allows an one-word wildcard

Single star “health care * reform” left “health * care reform” right

More stars and/or reverse order “care reform * * health” right

Adjacency model

Other Web-derived Features:Reorder

Reorders for “health care reform” “care reform health” right “reform health care” left

Other Web-derived Features:Internal Inflection Variability

First word ???

Second word tyrosine kinase activation tyrosine kinases activation

Other Web-derived Features:Switch The First Two Words

Predict right, if we can reorder adult male rat as male adult rat

Paraphrases (1)

The semantics of a noun compound is often made overt by a paraphrase (Warren,1978) Prepositional

stem cells in the brain right cells from the brain stem right

Verbal virus causing human immunodeficiency left pain associated with arthritis migraine right

Copula office building that is a skyscraper right

Paraphrases (2)

Lauer(1995), Keller&Lapata(2003), Girju&al. (2005) predict NC semantics by choosing the most likely preposition: of, for, in, at, on, from, with, about, (like)

This could be problematic, when more than one preposition is possible

In contrast: we try to predict syntax, not semantics we do not disambiguate, just add up all counts

cells in (the) bone marrow left cells from (the) bone marrow left

Paraphrases (3)

prepositional paraphrases: We use: ~150 prepositions

verbal paraphrases: We use: associated with, caused by, contained in,

derived from, focusing on, found in, involved in, located at/in, made of, performed by, preventing, related to and used by/in/for.

copula paraphrases: We use: is/was and that/which/who

optional elements: articles: a, an, the quantifiers: some, every, etc. pronouns: this, these, etc.

Evaluation: Datasets

Lauer Set 244 noun compounds (NCs)

from Grolier’s encyclopedia inter-annotator agreement: 81.5%

Biomedical Set 430 NCs

from MEDLINE inter-annotator agreement: 88% ( =.606)

Evaluation: Experiments

Exact phrase queries Limited to English

Inflections: Lauer Set: Carroll’s morphological tools Biomedical Set: UMLS Specialist Lexicon

Results: Lauer (1)correct

N/Awrong

Results Lauer (2)correct

N/Awrong

Results Lauer (3)

Results: Bio (1)correct

N/Awrong

Results Bio (2)correct

N/Awrong

Individual Surface Features Performance: Bio

Paraphrase and Surface Features Performance

Lauer Set

Biomedical Set

Discussion

Lauer Bio

Adjacency vs. Dependency 2 vs. frequencies vs. probabilities

Conclusion

Introduced search engine statistics that go beyond the n-gram (applicable to other tasks) surface features paraphrases

Obtained new state-of-the-art results on NC bracketing more robust than Lauer (1995) more accurate than Keller&Lapata (2004)

Future Work

Recognize ambiguous cases Bracket more than 3 nouns Not just bracketing but dependences:

e.g. growth factor alpha Bracket NPs in general (other POS)

augment Penn Treebank with NP-internal dependences

Application to other structural ambiguity problems: Prepositional phrase attachment Noun phrase coordination

The End

Thank you!

Web Counts: Problems

Page hits are inaccurate This may be ok (Keller&Lapata,2003)

The Web lacks linguistic annotation Pr(health|care) = #(“health care”) / #(care)

health: noun care: both verb and noun can be adjacent by chance can come from different sentences

Cannot find: stem cells VERB PREPOSITION brain protein synthesis’ inhibition

search engine statistics beyond the n-gram: application to noun compound bracketing

Documents

w1 w2 w3leftrightprobabilities

w2 word frequency query

w2 bigram frequency

b prw2w1w1

double dashtcell

surface markers

bracketing prior

b c d8