search engine statistics beyond the n-gram: application to noun compound bracketing

45
Search Engine Statistics Beyond the n- gram: Application to Noun Compound Bracketing Preslav Nakov and Marti Hearst Computer Science Division and SIMS University of California, Berkeley Supported by NSF DBI-0317510 and a gift from Genentech

Upload: gannon-winters

Post on 30-Dec-2015

21 views

Category:

Documents


0 download

DESCRIPTION

Search Engine Statistics Beyond the n-gram: Application to Noun Compound Bracketing. Preslav Nakov and Marti Hearst Computer Science Division and SIMS University of California, Berkeley. Supported by NSF DBI-0317510 and a gift from Genentech. Overview. Unsupervised algorithm - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Search Engine Statistics Beyond the n-gram:  Application to Noun Compound Bracketing

Search Engine Statistics Beyond the n-gram:

Application to Noun Compound Bracketing

Search Engine Statistics Beyond the n-gram:

Application to Noun Compound Bracketing

Preslav Nakov and Marti HearstComputer Science Division and SIMS

University of California, Berkeley

Supported by NSF DBI-0317510 and a gift from Genentech

Page 2: Search Engine Statistics Beyond the n-gram:  Application to Noun Compound Bracketing

Overview

Unsupervised algorithm Applied here to noun compound bracketing, but

promising for structural ambiguity generally

Features n-grams, 2 , MI Beyond the n-gram

surface features paraphrases

State-of-the art accuracy

Page 3: Search Engine Statistics Beyond the n-gram:  Application to Noun Compound Bracketing

Noun Compound Bracketing

(a) [ [ liver cell ] antibody ] (left bracketing)

(b) [ liver [cell line] ] (right bracketing)

In (a), the antibody targets the liver cell. In (b), the cell line is derived from the liver.

liver cell line liver cell antibody

Page 4: Search Engine Statistics Beyond the n-gram:  Application to Noun Compound Bracketing

Related Work

Marcus(1980), Pustejosky&al.(1993), Resnik(1993) adjacency model: Pr(w1|w2) vs. Pr(w2|w3)

Lauer (1995) dependency model: Pr(w1|w2) vs. Pr(w1|w3)

Keller & Lapata (2004): use the Web unigrams and bigrams

Girju & al. (2005) supervised model bracketing in context requires WordNet senses to be given

Pr that w1 precedes w2

This work:• 2 • Web• n-grams• paraphrases• surface features

Page 5: Search Engine Statistics Beyond the n-gram:  Application to Noun Compound Bracketing

Adjacency & Dependency (1)

right bracketing: [w1[w2w3] ] w2w3 is a compound (modified by w1)

home health care

w1 and w2 independently modify w3

adult male rat

left bracketing : [ [w1w2 ]w3] only 1 modificational choice possible

law enforcement officer

w1 w2 w3

w1 w2 w3

Page 6: Search Engine Statistics Beyond the n-gram:  Application to Noun Compound Bracketing

Adjacency & Dependency (2)

right bracketing: [w1[w2w3] ] w2w3 is a compound (modified by w1)

w1 and w2 independently modify w3

adjacency model Is w2w3 a compound?

(vs. w1w2 being a compound)

dependency model Does w1 modify w3?

(vs. w1 modifying w2)

w1 w2 w3

w1 w2 w3

w1 w2 w3

Page 7: Search Engine Statistics Beyond the n-gram:  Application to Noun Compound Bracketing

Frequencies

Adjacency model Compare #(w1,w2) to #(w2,w3)

Dependency model Compare #(w1,w2) to #(w1,w3)

rightleft

w1 w2 w3

w1 w2 w3

Frequency of w1w2

Page 8: Search Engine Statistics Beyond the n-gram:  Application to Noun Compound Bracketing

Probabilities

Adjacency model Compare Pr(w1w2|w2) to Pr(w2w3|w3)

Dependency model Compare Pr(w1w2|w2) to Pr(w1w3|w3)

leftright

w1 w2 w3

w1 w2 w3

Pr that w1 modifies w2

Page 9: Search Engine Statistics Beyond the n-gram:  Application to Noun Compound Bracketing

Probabilities: Dependency

Dependency model Pr(left) = Pr(w1w2|w2)Pr(w2w3|w3)

Pr(right) = Pr(w1w3|w3)Pr(w2w3|w3)

So we compare Pr(w1w2|w2) to Pr(w1w3|w3)

BUT! No cancellation in

the Lauer’s model:

w1 w2 w3

left

right

Page 10: Search Engine Statistics Beyond the n-gram:  Application to Noun Compound Bracketing

Probabilities: Estimation

Using page hits as a proxy for n-gram counts

Pr(w1w2|w2) = #(w1,w2) / #(w2) #(w2) word frequency; query for “w2”

#(w1,w2) bigram frequency; query for “w1 w2”

smoothed by 0.5

Page 11: Search Engine Statistics Beyond the n-gram:  Application to Noun Compound Bracketing

Probabilities: Why? (1)

Why should we use: (a) Pr(w1w2|w2), rather than (b) Pr(w2w1|w1)?

Keller&Lapata (2004) calculate: AltaVista queries:

(a): 70.49% (b): 68.85%

British National Corpus: (a): 63.11% (b): 65.57%

Page 12: Search Engine Statistics Beyond the n-gram:  Application to Noun Compound Bracketing

Probabilities: Why? (2)

Why should we use: (a) Pr(w1w2|w2), rather than

(b) Pr(w2w1|w1)?

Maybe to introduce a bracketing prior. Just like Lauer (1995) did.

But otherwise, no reason to prefer either one. Do we need probabilities? (association is OK) Do we need a directed model? (symmetry is OK)

Page 13: Search Engine Statistics Beyond the n-gram:  Application to Noun Compound Bracketing

Association Models: 2 (Chi Squared)

A = #(wi,wj)

B = #(wi) – #(wi,wj)

C = #(wj) – #(wi,wj)

D = N – (A+B+C) N = 8 trillion (= A+B+C+D)

8 billion Web pages x 1,000 words

Page 14: Search Engine Statistics Beyond the n-gram:  Application to Noun Compound Bracketing

Web-derived Surface Features

Authors often disambiguate noun compounds using surface markers, e.g.: amino-acid sequence left brain stem’s cell left brain’s stem cell right

The enormous size of the Web makes them frequent enough to be useful.

Page 15: Search Engine Statistics Beyond the n-gram:  Application to Noun Compound Bracketing

Web-derived Surface Features:Dash (hyphen)

Left dash cell-cycle analysis left

Right dash donor T-cell right fiber optics-system should be left..

Double dash T-cell-depletion unusable…

Page 16: Search Engine Statistics Beyond the n-gram:  Application to Noun Compound Bracketing

Web-derived Surface Features:Possessive Marker

Attached to the first word brain’s stem cell right

Attached to the second word brain stem’s cell left

Combined features brain’s stem-cell right

Page 17: Search Engine Statistics Beyond the n-gram:  Application to Noun Compound Bracketing

Web-derived Surface Features:Capitalization

don’t-care – lowercase – uppercase Plasmodium vivax Malaria left plasmodium vivax Malaria left

lowercase – uppercase – don’t-care brain Stem cell right brain Stem Cell right

Disabled on: Roman digits Single-letter words: e.g. vitamin D deficiency

Page 18: Search Engine Statistics Beyond the n-gram:  Application to Noun Compound Bracketing

Web-derived Surface Features:Embedded Slash

Left embedded slash leukemia/lymphoma cell right

Page 19: Search Engine Statistics Beyond the n-gram:  Application to Noun Compound Bracketing

Web-derived Surface Features:Parentheses

Single-word growth factor (beta) left (brain) stem cell right

Two-word (growth factor) beta left brain (stem cell) right

Page 20: Search Engine Statistics Beyond the n-gram:  Application to Noun Compound Bracketing

Web-derived Surface Features:Column, dot, semi-column

Following the first word home. health care right adult, male rat right

Following the second word health care, provider left lung cancer: patients left

Page 21: Search Engine Statistics Beyond the n-gram:  Application to Noun Compound Bracketing

Web-derived Surface Features:Dash to External Word

External word to the left mouse-brain stem cell right

External word to the right tumor necrosis factor-alpha left

Page 22: Search Engine Statistics Beyond the n-gram:  Application to Noun Compound Bracketing

Web-derived Surface Features:Problems & Solutions

Problem: search engines ignore punctuation “brain-stem cell” does not work

Solution: query for “brain stem cell” obtain 1,000 document summaries look for the features in these summaries

Page 23: Search Engine Statistics Beyond the n-gram:  Application to Noun Compound Bracketing

Other Web-derived Features:Abbreviation

After the second word tumor necrosis factor (NF) right

After the third word tumor necrosis (TN) factor right

We query for e.g. “tumor necrosis tn factor” Problems:

Roman digits: IV, VI States: CA Short words: me

Page 24: Search Engine Statistics Beyond the n-gram:  Application to Noun Compound Bracketing

Other Web-derived Features:Concatenation

Consider health care reform healthcare : 79,500,000 carereform : 269 healthreform: 812

Adjacency model healthcare vs. carereform

Dependency model healthcare vs. healthreform

Triples “healthcare reform” vs. “health carereform”

Page 25: Search Engine Statistics Beyond the n-gram:  Application to Noun Compound Bracketing

Other Web-derived Features:Using Google’s *

Each * allows an one-word wildcard

Single star “health care * reform” left “health * care reform” right

More stars and/or reverse order “care reform * * health” right

Adjacency model

Page 26: Search Engine Statistics Beyond the n-gram:  Application to Noun Compound Bracketing

Other Web-derived Features:Reorder

Reorders for “health care reform” “care reform health” right “reform health care” left

Page 27: Search Engine Statistics Beyond the n-gram:  Application to Noun Compound Bracketing

Other Web-derived Features:Internal Inflection Variability

First word ???

Second word tyrosine kinase activation tyrosine kinases activation

Page 28: Search Engine Statistics Beyond the n-gram:  Application to Noun Compound Bracketing

Other Web-derived Features:Switch The First Two Words

Predict right, if we can reorder adult male rat as male adult rat

Page 29: Search Engine Statistics Beyond the n-gram:  Application to Noun Compound Bracketing

Paraphrases (1)

The semantics of a noun compound is often made overt by a paraphrase (Warren,1978) Prepositional

stem cells in the brain right cells from the brain stem right

Verbal virus causing human immunodeficiency left pain associated with arthritis migraine right

Copula office building that is a skyscraper right

Page 30: Search Engine Statistics Beyond the n-gram:  Application to Noun Compound Bracketing

Paraphrases (2)

Lauer(1995), Keller&Lapata(2003), Girju&al. (2005) predict NC semantics by choosing the most likely preposition: of, for, in, at, on, from, with, about, (like)

This could be problematic, when more than one preposition is possible

In contrast: we try to predict syntax, not semantics we do not disambiguate, just add up all counts

cells in (the) bone marrow left cells from (the) bone marrow left

Page 31: Search Engine Statistics Beyond the n-gram:  Application to Noun Compound Bracketing

Paraphrases (3)

prepositional paraphrases: We use: ~150 prepositions

verbal paraphrases: We use: associated with, caused by, contained in,

derived from, focusing on, found in, involved in, located at/in, made of, performed by, preventing, related to and used by/in/for.

copula paraphrases: We use: is/was and that/which/who

optional elements: articles: a, an, the quantifiers: some, every, etc. pronouns: this, these, etc.

Page 32: Search Engine Statistics Beyond the n-gram:  Application to Noun Compound Bracketing

Evaluation: Datasets

Lauer Set 244 noun compounds (NCs)

from Grolier’s encyclopedia inter-annotator agreement: 81.5%

Biomedical Set 430 NCs

from MEDLINE inter-annotator agreement: 88% ( =.606)

Page 33: Search Engine Statistics Beyond the n-gram:  Application to Noun Compound Bracketing

Evaluation: Experiments

Exact phrase queries Limited to English

Inflections: Lauer Set: Carroll’s morphological tools Biomedical Set: UMLS Specialist Lexicon

Page 34: Search Engine Statistics Beyond the n-gram:  Application to Noun Compound Bracketing

Results: Lauer (1)correct

N/Awrong

Page 35: Search Engine Statistics Beyond the n-gram:  Application to Noun Compound Bracketing

Results Lauer (2)correct

N/Awrong

Page 36: Search Engine Statistics Beyond the n-gram:  Application to Noun Compound Bracketing

Results Lauer (3)

Page 37: Search Engine Statistics Beyond the n-gram:  Application to Noun Compound Bracketing

Results: Bio (1)correct

N/Awrong

Page 38: Search Engine Statistics Beyond the n-gram:  Application to Noun Compound Bracketing

Results Bio (2)correct

N/Awrong

Page 39: Search Engine Statistics Beyond the n-gram:  Application to Noun Compound Bracketing

Individual Surface Features Performance: Bio

Page 40: Search Engine Statistics Beyond the n-gram:  Application to Noun Compound Bracketing

Paraphrase and Surface Features Performance

Lauer Set

Biomedical Set

Page 41: Search Engine Statistics Beyond the n-gram:  Application to Noun Compound Bracketing

Discussion

Lauer Bio

Adjacency vs. Dependency 2 vs. frequencies vs. probabilities

Page 42: Search Engine Statistics Beyond the n-gram:  Application to Noun Compound Bracketing

Conclusion

Introduced search engine statistics that go beyond the n-gram (applicable to other tasks) surface features paraphrases

Obtained new state-of-the-art results on NC bracketing more robust than Lauer (1995) more accurate than Keller&Lapata (2004)

Page 43: Search Engine Statistics Beyond the n-gram:  Application to Noun Compound Bracketing

Future Work

Recognize ambiguous cases Bracket more than 3 nouns Not just bracketing but dependences:

e.g. growth factor alpha Bracket NPs in general (other POS)

augment Penn Treebank with NP-internal dependences

Application to other structural ambiguity problems: Prepositional phrase attachment Noun phrase coordination

Page 44: Search Engine Statistics Beyond the n-gram:  Application to Noun Compound Bracketing

The End

Thank you!

Page 45: Search Engine Statistics Beyond the n-gram:  Application to Noun Compound Bracketing

Web Counts: Problems

Page hits are inaccurate This may be ok (Keller&Lapata,2003)

The Web lacks linguistic annotation Pr(health|care) = #(“health care”) / #(care)

health: noun care: both verb and noun can be adjacent by chance can come from different sentences

Cannot find: stem cells VERB PREPOSITION brain protein synthesis’ inhibition