the web as a corpus: going beyond the n-gramthe web as a corpus: going beyond the n-gram preslav...

80
The Web as a Corpus: Going Beyond the n-gram Preslav Nakov National University of Singapore (joint work with Marti Hearst, UC Berkeley) University of Heidelberg February 3, 2011

Upload: others

Post on 12-Aug-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: The Web as a Corpus: Going Beyond the n-gramThe Web as a Corpus: Going Beyond the n-gram Preslav Nakov National University of Singapore (joint work with Marti Hearst, UC Berkeley)

The Web as a Corpus:Going Beyond the n-gram

Preslav NakovNational University of Singapore

(joint work with Marti Hearst, UC Berkeley)

University of HeidelbergFebruary 3, 2011

Page 2: The Web as a Corpus: Going Beyond the n-gramThe Web as a Corpus: Going Beyond the n-gram Preslav Nakov National University of Singapore (joint work with Marti Hearst, UC Berkeley)

2

Plan

Introduction

Surface Features & Paraphrases

Syntactic Tasks

Semantic Tasks

Application to Machine Translation

Page 3: The Web as a Corpus: Going Beyond the n-gramThe Web as a Corpus: Going Beyond the n-gram Preslav Nakov National University of Singapore (joint work with Marti Hearst, UC Berkeley)

3

Introduction

Page 4: The Web as a Corpus: Going Beyond the n-gramThe Web as a Corpus: Going Beyond the n-gram Preslav Nakov National University of Singapore (joint work with Marti Hearst, UC Berkeley)

4

Dave Bowman: “Open the pod bay doors, HAL”

HAL 9000: “I’m sorry Dave.  I’m afraid I can’t do that.”

NLP: The Dream

This is too hard!

So, we tackle sub­problems instead.

Page 5: The Web as a Corpus: Going Beyond the n-gramThe Web as a Corpus: Going Beyond the n-gram Preslav Nakov National University of Singapore (joint work with Marti Hearst, UC Berkeley)

5

How to Tackle the Problem?

The field was stuck for quite some time. e.g., CYC: manually annotate all semantic 

concepts and relations

A new statistical approach started in the 90s Get large text collections.  Compute statistics (over the words).

Page 6: The Web as a Corpus: Going Beyond the n-gramThe Web as a Corpus: Going Beyond the n-gram Preslav Nakov National University of Singapore (joint work with Marti Hearst, UC Berkeley)

6

Size Matters

Banko & Brill ’01: “Scaling to Very, Very Large Corpora for Natural Language Disambiguation”, ACL 

Spelling correction: Which word should we use?                 <principal>   <principle>

Use context:

I am in my third year as the principal of Anamosa High School.

Power without principle is barren, but principle without power is futile. (Tony Blair)

Page 7: The Web as a Corpus: Going Beyond the n-gramThe Web as a Corpus: Going Beyond the n-gram Preslav Nakov National University of Singapore (joint work with Marti Hearst, UC Berkeley)

7 Log­linear improvement even to a billion words! Getting more data is better than fine­tuning algorithms!

Bigger is better than smarter!

Banko & Brill ’01

Great idea! Can it be extended to other tasks?

For this problem, one can get a lot of training data.

Page 8: The Web as a Corpus: Going Beyond the n-gramThe Web as a Corpus: Going Beyond the n-gram Preslav Nakov National University of Singapore (joint work with Marti Hearst, UC Berkeley)

8

Web as a Baseline

“Web as a baseline” (Lapata & Keller 04;05): applied simple n­gram models to machine translation candidate selection article generation noun compound interpretation noun compound bracketing adjective ordering  spelling correction  countability detection prepositional phrase attachment

Their conclusion: => Web n­grams should be used as a baseline.

Significantly better than the best supervised algorithm.

Not significantly different than the best supervised algorithm.

These are all UNSUPERVISED!

Page 9: The Web as a Corpus: Going Beyond the n-gramThe Web as a Corpus: Going Beyond the n-gram Preslav Nakov National University of Singapore (joint work with Marti Hearst, UC Berkeley)

9

Contribution

New features paraphrases surface features

The ultimate goalUse the Web as a corpus, and not just as a source of page hit frequencies!

Page 10: The Web as a Corpus: Going Beyond the n-gramThe Web as a Corpus: Going Beyond the n-gram Preslav Nakov National University of Singapore (joint work with Marti Hearst, UC Berkeley)

10

Noun Compound Bracketing

Page 11: The Web as a Corpus: Going Beyond the n-gramThe Web as a Corpus: Going Beyond the n-gram Preslav Nakov National University of Singapore (joint work with Marti Hearst, UC Berkeley)

11

Noun Compound Bracketing: The Problem

(a) [ liver [cell line] ]              (right bracketing)(b) [ [ liver cell ] antibody ]    (left bracketing)

In (a), the cell line is derived from the liver. In (b), the antibody targets the liver cell.

liver       cell         line liver        cell     antibody

Page 12: The Web as a Corpus: Going Beyond the n-gramThe Web as a Corpus: Going Beyond the n-gram Preslav Nakov National University of Singapore (joint work with Marti Hearst, UC Berkeley)

12

Measuring Word Associations

Using n-gram Statistics

Frequencies Dependency: #(w1,w2) vs. #(w1,w3)

Adjacency:    #(w1,w2) vs. #(w2,w3)

Probabilities Dependency: Pr(w1→w2|w2) vs. Pr(w1→w3|w3)

Adjacency:     Pr(w1→w2|w2) vs. Pr(w2→w3|w3)

Also: Pointwise Mutual Information, Chi Square, etc.

w1          w2         w3

adjacency

dependency

Page 13: The Web as a Corpus: Going Beyond the n-gramThe Web as a Corpus: Going Beyond the n-gram Preslav Nakov National University of Singapore (joint work with Marti Hearst, UC Berkeley)

13

Web-derived Surface Features

Observations Authors often disambiguate noun compounds 

using surface markers. The enormous size of the Web makes them 

frequent enough to be useful.

Idea Look for instances of the target noun compound 

where it occurs with suitable surface markers.

Here starts the new work…

Page 14: The Web as a Corpus: Going Beyond the n-gramThe Web as a Corpus: Going Beyond the n-gram Preslav Nakov National University of Singapore (joint work with Marti Hearst, UC Berkeley)

14

Web-derived Surface Features:Dash (hyphen)

Left dash cell­cycle analysis  left

Right dash donor T­cell  right

Page 15: The Web as a Corpus: Going Beyond the n-gramThe Web as a Corpus: Going Beyond the n-gram Preslav Nakov National University of Singapore (joint work with Marti Hearst, UC Berkeley)

15

Web-derived Surface Features:Possessive Marker

Attached to the first word brain’s stem cell  right

Attached to the second word brain stem’s cell  left

Page 16: The Web as a Corpus: Going Beyond the n-gramThe Web as a Corpus: Going Beyond the n-gram Preslav Nakov National University of Singapore (joint work with Marti Hearst, UC Berkeley)

16

Web-derived Surface Features:Capitalization

don’t­care – lowercase – uppercase Plasmodium vivax Malaria  left plasmodium vivax Malaria  left

lowercase – uppercase – don’t­care brain Stem cell  right brain Stem Cell  right

Page 17: The Web as a Corpus: Going Beyond the n-gramThe Web as a Corpus: Going Beyond the n-gram Preslav Nakov National University of Singapore (joint work with Marti Hearst, UC Berkeley)

17

Web-derived Surface Features:Embedded Slash

Left embedded slash leukemia/lymphoma cell  right

Page 18: The Web as a Corpus: Going Beyond the n-gramThe Web as a Corpus: Going Beyond the n-gram Preslav Nakov National University of Singapore (joint work with Marti Hearst, UC Berkeley)

18

Web-derived Surface Features:Parentheses

Single word growth factor (beta)  left (brain) stem cell  right

Two words (growth factor) beta  left brain (stem cell)  right

Page 19: The Web as a Corpus: Going Beyond the n-gramThe Web as a Corpus: Going Beyond the n-gram Preslav Nakov National University of Singapore (joint work with Marti Hearst, UC Berkeley)

19

Web-derived Surface Features:Comma,dot,column,semi-column,…

Following the second word lung cancer: patients  left  health care, provider  left

Following the first word home. health care  right adult, male rat  right

Page 20: The Web as a Corpus: Going Beyond the n-gramThe Web as a Corpus: Going Beyond the n-gram Preslav Nakov National University of Singapore (joint work with Marti Hearst, UC Berkeley)

20

Web-derived Surface Features:Dash to External Word

External word to the left mouse­brain stem cell  right

External word to the right tumor necrosis factor­alpha  left

Page 21: The Web as a Corpus: Going Beyond the n-gramThe Web as a Corpus: Going Beyond the n-gram Preslav Nakov National University of Singapore (joint work with Marti Hearst, UC Berkeley)

21

Web-derived Surface Features:Problems & Solutions

Problem: search engines ignore punctuation “brain­stem cell” does not work

Solution:  query for “brain stem cell” obtain 1,000 document summaries scan for the features in these summaries

One can get much more than 1,000 results using the “*” operator and inflections.

Page 22: The Web as a Corpus: Going Beyond the n-gramThe Web as a Corpus: Going Beyond the n-gram Preslav Nakov National University of Singapore (joint work with Marti Hearst, UC Berkeley)

22

Other Web-derived Features:Abbreviation

After the second word tumor necrosis (TN) factor  left

After the third word tumor necrosis factor (NF)  right

Query for e.g., “tumor necrosis tn factor”“tumor necrosis factor nf”

Page 23: The Web as a Corpus: Going Beyond the n-gramThe Web as a Corpus: Going Beyond the n-gram Preslav Nakov National University of Singapore (joint work with Marti Hearst, UC Berkeley)

23

Other Web-derived Features:Concatenation

Consider “health care reform” healthcare   : 79,500,000 carereform   : 269 healthreform: 812

Adjacency model healthcare vs. carereform

Dependency model healthcare vs. healthreform

Triples “healthcare reform” vs. “health carereform”

w1          w2         w3

adjacency

dependency

Tests for lexicalization

Page 24: The Web as a Corpus: Going Beyond the n-gramThe Web as a Corpus: Going Beyond the n-gram Preslav Nakov National University of Singapore (joint work with Marti Hearst, UC Berkeley)

24

Other Web-derived Features:Using the star operator “*”

Single star “health care * reform”  left “health * care reform”  right

More stars and/or reverse order “care reform * * health”    right “reform * * * health care”  left

Page 25: The Web as a Corpus: Going Beyond the n-gramThe Web as a Corpus: Going Beyond the n-gram Preslav Nakov National University of Singapore (joint work with Marti Hearst, UC Berkeley)

25

Other Web-derived Features:Reorder

Reorders for “health care reform” “care reform health”  right “reform health care”  left

Page 26: The Web as a Corpus: Going Beyond the n-gramThe Web as a Corpus: Going Beyond the n-gram Preslav Nakov National University of Singapore (joint work with Marti Hearst, UC Berkeley)

26

Other Web-derived Features:Internal Inflection Variability

First word bone mineral density bones mineral density

Second word bone mineral density bone minerals density

 right

 left

Page 27: The Web as a Corpus: Going Beyond the n-gramThe Web as a Corpus: Going Beyond the n-gram Preslav Nakov National University of Singapore (joint work with Marti Hearst, UC Berkeley)

27

Other Web-derived Features:Switch The First Two Words

Predict right, if we can reorder adult male rat    as male adult rat

Page 28: The Web as a Corpus: Going Beyond the n-gramThe Web as a Corpus: Going Beyond the n-gram Preslav Nakov National University of Singapore (joint work with Marti Hearst, UC Berkeley)

28

Paraphrases

Prepositional cells in (the) bone marrow       left (61,700) cells from (the) bone marrow  left (16,500) marrow cells from (the) bone  right (12)

Verbal cells extracted from (the) bone marrow  left (17) marrow cells found in (the) bone            right (1)

Copula cells that are bone marrow  left (3)

“bone marrow cell”: left­ or right­bracketed?

Page 29: The Web as a Corpus: Going Beyond the n-gramThe Web as a Corpus: Going Beyond the n-gram Preslav Nakov National University of Singapore (joint work with Marti Hearst, UC Berkeley)

29

Evaluation

Method: Exact phrase queries limited to English Dataset: Lauer’s Dataset

244 noun compounds from Grolier’s encyclopedia

Page 30: The Web as a Corpus: Going Beyond the n-gramThe Web as a Corpus: Going Beyond the n-gram Preslav Nakov National University of Singapore (joint work with Marti Hearst, UC Berkeley)

30

Evaluation Results (1)

Co­occurrences

Page 31: The Web as a Corpus: Going Beyond the n-gramThe Web as a Corpus: Going Beyond the n-gram Preslav Nakov National University of Singapore (joint work with Marti Hearst, UC Berkeley)

31

Evaluation Results (2)

Paraphrases, surface features, majority vote

Page 32: The Web as a Corpus: Going Beyond the n-gramThe Web as a Corpus: Going Beyond the n-gram Preslav Nakov National University of Singapore (joint work with Marti Hearst, UC Berkeley)

32

Comparison to Others

Page 33: The Web as a Corpus: Going Beyond the n-gramThe Web as a Corpus: Going Beyond the n-gram Preslav Nakov National University of Singapore (joint work with Marti Hearst, UC Berkeley)

33

Application:Query Segmentation

Segmentation [ used car parts ] [ used car ] [ parts ] [ used ] [ car parts ] [ used ] [ car ] [ parts ]

Bracketing

[ [ used car ] parts ] [ used [ car parts ] ]

S. Bergsma, Q. Wang. Learning Noun Phrase Query Segmentation. EMNLP'07, pp. 819­826.

ACL’07: Adding Noun Phrase Structure to the Penn TreebankDavid Vadas and James R. Curran

Page 34: The Web as a Corpus: Going Beyond the n-gramThe Web as a Corpus: Going Beyond the n-gram Preslav Nakov National University of Singapore (joint work with Marti Hearst, UC Berkeley)

34

Prepositional Phrase

Attachment

Page 35: The Web as a Corpus: Going Beyond the n-gramThe Web as a Corpus: Going Beyond the n-gram Preslav Nakov National University of Singapore (joint work with Marti Hearst, UC Berkeley)

35

PP attachment

(a) Peter spent millions of dollars. (noun)(b) Peter spent time with his family. (verb)

quadruple: (v, n1, p, n2)(a) (spent, millions, of, dollars)(b) (spent, time, with, family)

PP combines with the NPto form another NP

PP is an indirect object of the verb

Page 36: The Web as a Corpus: Going Beyond the n-gramThe Web as a Corpus: Going Beyond the n-gram Preslav Nakov National University of Singapore (joint work with Marti Hearst, UC Berkeley)

36

Results

Simpler but not significantlydifferent from 84.3%(Pantel&Lin,00).

Page 37: The Web as a Corpus: Going Beyond the n-gramThe Web as a Corpus: Going Beyond the n-gram Preslav Nakov National University of Singapore (joint work with Marti Hearst, UC Berkeley)

37

Noun Phrase Coordination

Page 38: The Web as a Corpus: Going Beyond the n-gramThe Web as a Corpus: Going Beyond the n-gram Preslav Nakov National University of Singapore (joint work with Marti Hearst, UC Berkeley)

38

NP Coordination: Ellipsis Ellipsis

car and truck production means car production and truck production

No ellipsis president and chief executive

Page 39: The Web as a Corpus: Going Beyond the n-gramThe Web as a Corpus: Going Beyond the n-gram Preslav Nakov National University of Singapore (joint work with Marti Hearst, UC Berkeley)

39

NP Coordination: Ellipsis

Penn Treebank annotations ellipsis:

(NP car/NN and/CC truck/NN production/NN). no ellipsis:

(NP (NP president/NN) and/CC (NP chief/NN executive/NN))

Page 40: The Web as a Corpus: Going Beyond the n-gramThe Web as a Corpus: Going Beyond the n-gram Preslav Nakov National University of Singapore (joint work with Marti Hearst, UC Berkeley)

40

Results428 examples from Penn TB

Comparable to other researchers (but no standard dataset).

Page 41: The Web as a Corpus: Going Beyond the n-gramThe Web as a Corpus: Going Beyond the n-gram Preslav Nakov National University of Singapore (joint work with Marti Hearst, UC Berkeley)

41

Paraphrasing Noun Compounds

Page 42: The Web as a Corpus: Going Beyond the n-gramThe Web as a Corpus: Going Beyond the n-gram Preslav Nakov National University of Singapore (joint work with Marti Hearst, UC Berkeley)

42

Noun Compound Semantics

Traditionally – choose one abstract relation Fixed set of abstract relations (Girju&al.,2005)

malaria mosquito  CAUSE olive oil                  SOURCE

Prepositions (Lauer,1995):  malaria mosquito  WITH olive oil                  FROM

Recoverably Deletable Predicates (Levi,1978):  malaria mosquito  CAUSE olive oil                  FROM

Our approach: use multiple paraphrasing verbs Paraphrasing verbs

malaria mosquito  carries, spreads, causes, transmits, brings, has olive oil                  comes from, is obtained from, is extracted from

Page 43: The Web as a Corpus: Going Beyond the n-gramThe Web as a Corpus: Going Beyond the n-gram Preslav Nakov National University of Singapore (joint work with Marti Hearst, UC Berkeley)

43

NC Semantics: Method

For NC “noun1 noun2”, query for:

"noun2 THAT * noun1“

THAT can be that, which or who; up to 8 “*”s.

POS tag the snippets. Extract verbal paraphrases.

post­modifier

pre­modifier

Page 44: The Web as a Corpus: Going Beyond the n-gramThe Web as a Corpus: Going Beyond the n-gram Preslav Nakov National University of Singapore (joint work with Marti Hearst, UC Berkeley)

44

NC Semantics: Sample Verbal Paraphrases

Verbs+prepositions for migraine treatment 

7 prevent3 be given for3 be for2 reduce2 benefit1 relieve

Page 45: The Web as a Corpus: Going Beyond the n-gramThe Web as a Corpus: Going Beyond the n-gram Preslav Nakov National University of Singapore (joint work with Marti Hearst, UC Berkeley)

45

Dynamic componential analysis 

Classic componential analysis

Example: Treatments

Page 46: The Web as a Corpus: Going Beyond the n-gramThe Web as a Corpus: Going Beyond the n-gram Preslav Nakov National University of Singapore (joint work with Marti Hearst, UC Berkeley)

46

Comparing to (Girju&al.,05)

14 out of 21 relations are shown.

Page 47: The Web as a Corpus: Going Beyond the n-gramThe Web as a Corpus: Going Beyond the n-gram Preslav Nakov National University of Singapore (joint work with Marti Hearst, UC Berkeley)

47

Amazon’s Mechanical Turk: Malaria Mosquito

Five judges: 5 carries 3 causes 2 transmits 2 infects with 1 has 1 supplies

The program: 23 carry 16 spread 12 cause 9 transmit 7 bring 4 have 3 be infected with 3 be responsible for …

Page 48: The Web as a Corpus: Going Beyond the n-gramThe Web as a Corpus: Going Beyond the n-gram Preslav Nakov National University of Singapore (joint work with Marti Hearst, UC Berkeley)

48

MTurk: Comparison to 30 Humans

Page 49: The Web as a Corpus: Going Beyond the n-gramThe Web as a Corpus: Going Beyond the n-gram Preslav Nakov National University of Singapore (joint work with Marti Hearst, UC Berkeley)

49

Average cosine correlation

Average cosine correlation (in %) between human­ andprogram­generated verbs for the Levi­250 dataset.

Page 50: The Web as a Corpus: Going Beyond the n-gramThe Web as a Corpus: Going Beyond the n-gram Preslav Nakov National University of Singapore (joint work with Marti Hearst, UC Berkeley)

50

Levi’s RecoverablyDeletable Predicates

Page 51: The Web as a Corpus: Going Beyond the n-gramThe Web as a Corpus: Going Beyond the n-gram Preslav Nakov National University of Singapore (joint work with Marti Hearst, UC Berkeley)

51

MTurk (human) vs. Web (program):Aggregated by Levi’s RDP Cosine correlation (in %s) between the human­ and the program­ 

generated verbs by Levi’s RDP: using all human­proposed verbs vs. using the first verb from each worker only.

Page 52: The Web as a Corpus: Going Beyond the n-gramThe Web as a Corpus: Going Beyond the n-gram Preslav Nakov National University of Singapore (joint work with Marti Hearst, UC Berkeley)

52

Average Cosine Correlation

Left: calculated for each noun compound Right: aggregated by relation

Page 53: The Web as a Corpus: Going Beyond the n-gramThe Web as a Corpus: Going Beyond the n-gram Preslav Nakov National University of Singapore (joint work with Marti Hearst, UC Berkeley)

53

Predicting Abstract Semantic Relations

Page 54: The Web as a Corpus: Going Beyond the n-gramThe Web as a Corpus: Going Beyond the n-gram Preslav Nakov National University of Singapore (joint work with Marti Hearst, UC Berkeley)

54

Levi’s RecoverablyDeletable Predicates

Page 55: The Web as a Corpus: Going Beyond the n-gramThe Web as a Corpus: Going Beyond the n-gram Preslav Nakov National University of Singapore (joint work with Marti Hearst, UC Berkeley)

55

Search Engine Queries

Given noun1 and noun2, query for:

“noun2 * noun1"“noun1 * noun2"

Use up to 8 “*”s.

POS tag the snippets. Extract: verbs, prep, verb+prep, coordinations.

Page 56: The Web as a Corpus: Going Beyond the n-gramThe Web as a Corpus: Going Beyond the n-gram Preslav Nakov National University of Singapore (joint work with Marti Hearst, UC Berkeley)

56

Most Frequent Features for committee member

Page 57: The Web as a Corpus: Going Beyond the n-gramThe Web as a Corpus: Going Beyond the n-gram Preslav Nakov National University of Singapore (joint work with Marti Hearst, UC Berkeley)

57

Predicting Semantic Relations:Levi’s RDPs

v – verb p – preposition c – coordinating conjunction

Vector­space model kNN Classifier Dice coefficient (freqs)

Page 58: The Web as a Corpus: Going Beyond the n-gramThe Web as a Corpus: Going Beyond the n-gram Preslav Nakov National University of Singapore (joint work with Marti Hearst, UC Berkeley)

58

Relations BetweenComplex Nominals

Page 59: The Web as a Corpus: Going Beyond the n-gramThe Web as a Corpus: Going Beyond the n-gram Preslav Nakov National University of Singapore (joint work with Marti Hearst, UC Berkeley)

59

SemEval’07: Data

There are seven such relations.

Page 60: The Web as a Corpus: Going Beyond the n-gramThe Web as a Corpus: Going Beyond the n-gram Preslav Nakov National University of Singapore (joint work with Marti Hearst, UC Berkeley)

60

SemEval’07: Results

Using up to 10 stars: 67.0kNN classifier with the Dice coefficient

Page 61: The Web as a Corpus: Going Beyond the n-gramThe Web as a Corpus: Going Beyond the n-gram Preslav Nakov National University of Singapore (joint work with Marti Hearst, UC Berkeley)

61

SemEval’07: Results

Using up to 10 stars: 68.1

kNN classifier with the Dice coefficient

Page 62: The Web as a Corpus: Going Beyond the n-gramThe Web as a Corpus: Going Beyond the n-gram Preslav Nakov National University of Singapore (joint work with Marti Hearst, UC Berkeley)

62

SAT AnalogyQuestions

Page 63: The Web as a Corpus: Going Beyond the n-gramThe Web as a Corpus: Going Beyond the n-gram Preslav Nakov National University of Singapore (joint work with Marti Hearst, UC Berkeley)

63

SAT Analogy Questions

Page 64: The Web as a Corpus: Going Beyond the n-gramThe Web as a Corpus: Going Beyond the n-gram Preslav Nakov National University of Singapore (joint work with Marti Hearst, UC Berkeley)

64

SAT: Nouns Only

Page 65: The Web as a Corpus: Going Beyond the n-gramThe Web as a Corpus: Going Beyond the n-gram Preslav Nakov National University of Singapore (joint work with Marti Hearst, UC Berkeley)

65

Head-Modifier Relationsin

Noun-Noun Compounds

Page 66: The Web as a Corpus: Going Beyond the n-gramThe Web as a Corpus: Going Beyond the n-gram Preslav Nakov National University of Singapore (joint work with Marti Hearst, UC Berkeley)

66

30-Relations from (Nastase & Szpakowicz,2003)

Page 67: The Web as a Corpus: Going Beyond the n-gramThe Web as a Corpus: Going Beyond the n-gram Preslav Nakov National University of Singapore (joint work with Marti Hearst, UC Berkeley)

67

Noun-Modifier Relations: 30 classes

v – verb p – preposition c – coordinating conjunction

Page 68: The Web as a Corpus: Going Beyond the n-gramThe Web as a Corpus: Going Beyond the n-gram Preslav Nakov National University of Singapore (joint work with Marti Hearst, UC Berkeley)

68

Application to Machine

Translation

Page 69: The Web as a Corpus: Going Beyond the n-gramThe Web as a Corpus: Going Beyond the n-gram Preslav Nakov National University of Singapore (joint work with Marti Hearst, UC Berkeley)

69

MT: Parallel Text

Page 70: The Web as a Corpus: Going Beyond the n-gramThe Web as a Corpus: Going Beyond the n-gram Preslav Nakov National University of Singapore (joint work with Marti Hearst, UC Berkeley)

70

Paraphrasingthe Phrase Table (1)

Phrase Table Entry , spain 's economy ||| , la economía española ||| 1 0.0056263 1 

0.00477047 2.718

Paraphrased Entries , economy of spain  ||| , la economía española ||| 1 0.0056263 1 

0.00477047 2.718 , the economy of spain  ||| , la economía española ||| 1 0.0056263 

1 0.00477047 2.718 , spain economy  ||| , la economía española ||| 1 0.0056263 1 

0.00477047 2.718 , economy of a spain  ||| , la economía española ||| 1 0.0056263 1 

0.00477047 2.718 , economy of an spain  ||| , la economía española ||| 1 0.0056263 

1 0.00477047 2.718 , economy of the spain  ||| , la economía española ||| 1 0.0056263 

1 0.00477047 2.718

Web­basedfiltering

Page 71: The Web as a Corpus: Going Beyond the n-gramThe Web as a Corpus: Going Beyond the n-gram Preslav Nakov National University of Singapore (joint work with Marti Hearst, UC Berkeley)

71

Paraphrasing the Phrase Table (2)

Page 72: The Web as a Corpus: Going Beyond the n-gramThe Web as a Corpus: Going Beyond the n-gram Preslav Nakov National University of Singapore (joint work with Marti Hearst, UC Berkeley)

72

Paraphrasing the Training Corpus

Page 73: The Web as a Corpus: Going Beyond the n-gramThe Web as a Corpus: Going Beyond the n-gram Preslav Nakov National University of Singapore (joint work with Marti Hearst, UC Berkeley)

73

Paraphrasing a Sentence

Page 74: The Web as a Corpus: Going Beyond the n-gramThe Web as a Corpus: Going Beyond the n-gram Preslav Nakov National University of Singapore (joint work with Marti Hearst, UC Berkeley)

74

Paraphrasing NPs/NCs

purelysyntactic

useWeb stats

Page 75: The Web as a Corpus: Going Beyond the n-gramThe Web as a Corpus: Going Beyond the n-gram Preslav Nakov National University of Singapore (joint work with Marti Hearst, UC Berkeley)

75

Results

Page 76: The Web as a Corpus: Going Beyond the n-gramThe Web as a Corpus: Going Beyond the n-gram Preslav Nakov National University of Singapore (joint work with Marti Hearst, UC Berkeley)

76

Conclusion

Page 77: The Web as a Corpus: Going Beyond the n-gramThe Web as a Corpus: Going Beyond the n-gram Preslav Nakov National University of Singapore (joint work with Marti Hearst, UC Berkeley)

77

Conclusion

Tapped the potential of very large corpora for unsupervised algorithms: Go beyond n­grams

Surface features Paraphrases

Results competitive with the best unsupervised 

algorithms can rival supervised algorithms

Page 78: The Web as a Corpus: Going Beyond the n-gramThe Web as a Corpus: Going Beyond the n-gram Preslav Nakov National University of Singapore (joint work with Marti Hearst, UC Berkeley)

78

Resume

Surface Features & Paraphrases Syntactic Tasks

Noun Compound Bracketing Prepositional Phrase Attachment Noun Compound Coordination

Semantic Tasks Paraphrasing Noun Compounds Predicting Abstract Semantic Relations Relations Between Complex Nominals SAT Analogy Questions Head­Modifier Relations

Application Machine Translation

Page 79: The Web as a Corpus: Going Beyond the n-gramThe Web as a Corpus: Going Beyond the n-gram Preslav Nakov National University of Singapore (joint work with Marti Hearst, UC Berkeley)

79

Future Work

New exciting features

Other problems

Use less queries

Use the Web as a corpus, and not justas a source of page hit frequencies!

Page 80: The Web as a Corpus: Going Beyond the n-gramThe Web as a Corpus: Going Beyond the n-gram Preslav Nakov National University of Singapore (joint work with Marti Hearst, UC Berkeley)

80

Thank You

Questions?