1 autonomous web-scale information extraction doug downey advisor: oren etzioni department of...

160
1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University of Washington

Upload: samson-briggs

Post on 25-Dec-2015

214 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University

1

Autonomous Web-scale Information Extraction

Doug DowneyAdvisor: Oren Etzioni Department of Computer Science and EngineeringTuring CenterUniversity of Washington

Page 2: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University

2

Web Information Extraction

…cities such as Chicago… => Chicago City

C such as x => x C [Hearst,1992]

…Edison invented the light bulb…(Edison, light bulb) Invented

x V y => (x, y) V

e.g., KnowItAll [Etzioni et al., 2005], TextRunner [Banko et al., 2007], others [Pasca et al., 2007]

Page 3: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University

3

Identifying correct extractions

…mayors of major cities such as Giuliani… => Giuliani City

Supervised IE: hand-label examples of each concept

Not possible on the Web (far too many concepts)

=> Unsupervised IE (UIE)

How can we automatically identify correct extractions for any concept without hand-labeled data?

Page 4: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University

4

KnowItAll Hypothesis (KH)

Extractions that occur more frequently in distinct sentences in the corpus are more likely to be correct.

Repetitions of the same error are relatively rare

…mayors of major cities such as Giuliani… …hotels in popular cities such as Marriot.…

Misinformation is the exception rather than the rule

“Elvis killed JFK” – 200 hits“Oswald killed JFK” – 3000 hits

Page 5: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University

5

Redundancy

KH can identify many correct statements because the Web is highly redundant

– same facts repeated many times, in many ways – e.g., “Edison invented the light bulb” – 10,000 hits

(but leveraging the KH is a little tricky => probabilistic model)

Thesis:We can identify correct extractions without labeled data using a probabilistic model of redundancy.

Page 6: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University

6

1) Background2) KH as a general problem structure

• Monotonic Feature Model

3) URNS model• How does probability increase with repetition?

4) Challenge: The “long tail”• Unsupervised language models

Outline

Page 7: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University

7

Classical Supervised Learning

?

Learn function from x = (x1, …, xd) to y {0, 1}given labeled examples (x, y)

x1

x2

Page 8: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University

8

Semi-Supervised Learning (SSL)

Learn function from x = (x1, …, xd) to y {0, 1}given labeled examples (x, y)and unlabeled examples (x)

x1

x2

Page 9: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University

9

Monotonic Features

x1

x2

Learn function from x = (x1, …, xd) to y {0, 1} given monotonic feature x1

and unlabeled examples (x)

Page 10: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University

10

Monotonic Features

x1

x2

Learn function from x = (x1, …, xd) to y {0, 1} given monotonic feature x1 and unlabeled examples (x)

P(y=1 | x1) increases with x1

Page 11: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University

11

Common Structure

Task Monotonic FeatureUIE “C such as x”

[Etzioni et al., 2005]

Word Sense Disambiguation

“plant and animal species” [Yarowsky, 1995]

Information Retrieval search query[Kwok & Grunfield, 1995; Thompson & Turtle, 1995]

Document Classification

Topic word, e.g.: “politics”[McCallum & Nigam, 1999; Gliozzo, 2005]

Named Entity Recognition

contains(“Mr.”)

[Collins & Singer, 1998]

Page 12: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University

12

MF model is provably distinct from standard smoothness assumptions in SSL Cluster Assumption Manifold Assumption => MFs can complement other methods

Unlike co-training, MF Model doesn’t require labeled data pre-defined “views”

Isn’t this just ___ ?

Page 13: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University

13

One MF implies PAC-learnability without labeled data …when MF is conditionally independent of other features & is

minimally informative Corollary to co-training theorem [Blum and Mitchell, 1998]

MFs provide more information (vs. labels) about unlabeled examples as feature space grows As number of features increases

Information gain due to MFs stays constant, vs. Information gain due to labeled examples falls(under assumptions)

Theoretical Results

Page 14: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University

14

MFA: Given MFs and unlabeled data Use the MFs to produce noisy labels Train any classifier

Classification with the MF Model

Page 15: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University

15

20 Newsgroups dataset (MF:newsgroup name)

Vs. Two SSL baselines (NB + EM, LP)

Without labeled data:

Experimental Results

Page 16: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University

16

MFA-SSL provides a 15% error reduction for 100-400 labeled examples.

MFA-BOTH provides a 31% error reduction for 0-800 labeled examples.

Experimental Results

Page 17: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University

17

Bad News: confusable MFs

For more complex tasks, monotonicity is insufficient

Example: City extractions

MF: extraction frequency with e.g., “cities such as x”

..also MF for:has skyscrapers

has an opera house

located on Earth, …

New York 1488

Chicago 999

Los Angeles 859

… …

Twisp 1

Northeast 1

MF Extraction value

Page 18: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University

18

Performance of MFA in UIE

Page 19: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University

19

MFA for SSL in UIE

Page 20: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University

20

1) Background2) KH as a general problem structure

• Monotonic Feature Model

3) URNS model• How does probability increase with repetition?

4) Challenge: The “long tail”• Unsupervised language models

Outline

Page 21: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University

21

If an extraction x appears k times in a set of n distinct occurrences of the pattern, what is the probability that x C?

Consider a single pattern suggesting C , e.g.,

countries such as x

Redundancy: Single Pattern

Page 22: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University

22

“…countries such as Saudi Arabia…”

“…countries such as the United States…”

“…countries such as Saudi Arabia…”

“…countries such as Japan…”

“…countries such as Africa…”

“…countries such as Japan…”

“…countries such as the United Kingdom…”

“…countries such as Iraq…”

“…countries such as Afghanistan…”

“…countries such as Australia…”

C = Country

n = 10 occurrences

Redundancy: Single Pattern

Page 23: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University

23

C = Country

n = 10

Saudi Arabia

Japan

United States

Africa

United Kingdom

Iraq

Afghanistan

Australia

k2

2

1

1

1

1

1

1

p = probability pattern yields a correct extraction, i.e.,

p = 0.9

0.99

0.99

0.9

0.9

0.9

0.9

0.9

0.9 Noisy-or ignores: –Sample size (n) –Distribution of C

Naïve Model: Noisy-Or

Pnoisy-orPnoisy-or(xC | x seen k times)

= 1 – (1 – p)k

[Agichtein & Gravano, 2000; Lin et al. 2003]

Page 24: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University

24

United States

China

. . .

OilWatch Africa

Religion Paraguay

Chicken Mole

Republics of Kenya

Atlantic Ocean

C = Country

n ~50,000

3899

1999

1

1

1

1

1

0.9999…

0.9999…

0.9

0.9

0.9

0.9

0.9

C = Country

n = 10

Saudi Arabia

Japan

United States

Africa

United Kingdom

Iraq

Afghanistan

Australia

2

2

1

1

1

1

1

1

0.99

0.99

0.9

0.9

0.9

0.9

0.9

0.9

As sample size increases, noisy-or becomes inaccurate.

Needed in Model: Sample Size

Pnoisy-or Pnoisy-ork k

Page 25: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University

25

C = Country

n ~50,000

3899

1999

1

1

1

1

1

0.9999…

0.9999…

0.9

0.9

0.9

0.9

0.9

Needed in Model: Distribution of C

Pnoisy-ork

Pfreq(xC | x seen k times)

= 1 – (1 – p)k/n

United States

China

. . .

OilWatch Africa

Religion Paraguay

Chicken Mole

Republics of Kenya

Atlantic Ocean

Page 26: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University

26

C = Country

n ~50,000

3899

1999

1

1

1

1

1

0.9999…

0.9999…

0.05

0.05

0.05

0.05

0.05

Needed in Model: Distribution of C

PfreqkUnited States

China

. . .

OilWatch Africa

Religion Paraguay

Chicken Mole

Republics of Kenya

Atlantic Ocean

Pfreq(xC | x seen k times)

= 1 – (1 – p)k/n

Page 27: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University

27

New York

Chicago

. . .

El Estor

Nikki

Ragaz

Villegas

Northeastwards

C = City

n ~50,000

1488

999

1

1

1

1

1

0.9999…

0.9999…

0.05

0.05

0.05

0.05

0.05

Probability xC depends on the distribution of C.

C = Country

n ~50,000

3899

1999

1

1

1

1

1

0.9999…

0.9999…

0.05

0.05

0.05

0.05

0.05

Needed in Model: Distribution of C

Pfreq Pfreqk kUnited States

China

. . .

OilWatch Africa

Religion Paraguay

Chicken Mole

Republics of Kenya

Atlantic Ocean

Page 28: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University

28

Tokyo

U.K.

Sydney

Cairo

Tokyo

Tokyo

Atlanta

Atlanta

Yakima

Utah

U.K.

…cities such as Tokyo…

Urn for C = City

My solution: URNS Model

Page 29: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University

29

C – set of unique target labels

E – set of unique error labels

num(C) – distribution of target labels

num(E) – distribution of error labels

Urn – Formal Definition

Page 30: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University

30

distribution of target labels: num(C) = {2, 2, 1, 1, 1}

distribution of error labels: num(E) = {2, 1}

U.K.

Sydney

Cairo

Tokyo

Tokyo

Atlanta

Atlanta

Yakima

Utah

U.K.

Urn for C = City

Urn Example

Page 31: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University

31

If an extraction x appears k times in a set of n distinct occurrences of the pattern, what is the probability that x C?

Computing Probabilities

Page 32: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University

32

Given that an extraction x appears k times in n draws from the urn (with replacement), what is the probability that x C?

where s is the total number of balls in the urn

Computing Probabilities

Page 33: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University

33

URNS without labeled data

Needed: num(C), num(E)

Assumed to be Zipf

Frequency of ith element i-z

With assumptions, learn Zipfian parameters for any class C from unlabeled data alone

Page 34: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University

34

p 1 - p

C Zipf E Zipf

Observed frequency distribution

URNS without labeled data

Constant across C, for a given pattern

Learn num(C) from unlabeled data!

Constant across C

Page 35: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University

35

New York

Chicago

. . .

El Estor

Nikki

Ragaz

Villegas

Cres

Northeastwards

C = City

n ~50,000

1488

999

1

1

1

1

1

1

0.9999…

0.9999…

0.63

0.63

0.63

0.63

0.63

0.63

C = Country

n ~50,000

3899

1999

1

1

1

1

1

1

0.9999…

0.9999…

0.03

0.03

0.03

0.03

0.03

0.03

Probabilities Assigned by URNS

PURNS PURNSk kUnited States

China

. . .

OilWatch Africa

Religion Paraguay

Chicken Mole

Republics of Kenya

Atlantic Ocean

New Zeland

Page 36: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University

36

0

1

2

3

4

5

City Film Country MayorOf

De

via

tio

n f

rom

ide

al l

og

lik

elih

oo

d

urns

noisy-or

pmi

URNS’s probabilities are 15-22x closer to optimal.

Probability Accuracy

Page 37: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University

37

Sensitivity Analysis

URNS assumes num(E), p are constant

If we alter parameter choices substantially, URNS still outperforms noisy-or, PMI by at least 8x

Most sensitive to p

p ~ 0.85 is relatively consistent across randomly selected classes from Wordnet(solvents, devices, thinkers, relaxants, mushrooms, mechanisms, resorts, flies, tones, machines, …)

Page 38: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University

38

Multiple urns Target label frequencies correlated across urns Error label frequencies can be uncorrelated

Phrase Hits“Omaha and other cities” 950

“Illinois and other cities” 24,400

“cities such as Omaha” 930

“cities such as Illinois” 6

Multiple Extraction Patterns

Page 39: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University

39

Benefits from Multiple Urns

10 1.0 1.0

20 0.9875 1.0

50 0.925 0.955

100 0.8375 0.845

200 0.7075 0.71

Precision at K K Single Multiple

Using multiple URNS reduces error by 29%.

Page 40: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University

40

URNS vs. MFA

Page 41: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University

41

URNS + MFA in SSL

MFA-ssl (urns) reduces error by 6%, on average.

Page 42: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University

42

URNS: Learnable from unlabeled data

All URNS parameters can be learned from unlabeled data alone [Theorem 20]

URNS implies PAC learnability from unlabeled data alone [Theorem 21]

Even with confusable MFs (i.e. even without conditional independence)

(with assumptions)

Page 43: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University

43

Parameters Learnable (1)

We can express the URNS model as:

Compound Poisson Process Mixture gC() + gE() can be learned, given enough

samples [Loh, 1993]

Task: learn power-law distributions gC(), gE() from

their sum

Page 44: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University

44

Parameters Learnable (2)

Assume:

Sufficiently high frequency => only target elements

Sufficiently low frequency => only errors

Then:

gC() + gE() =

Page 45: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University

45

1) Background2) KH as a general problem structure

• Monotonic Feature Model

3) URNS model• How does probability increase with repetition?

4) Challenge: The “long tail”• Unsupervised language models

Outline

Page 46: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University

46

0

250

500

0 50000 100000

Frequency rank of extraction

Nu

mb

er

of

tim

es

ex

tra

cti

on

a

pp

ea

rs i

n p

att

ern

A mixture of correct and incorrect

e.g., (Dave Shaver, Pickerington)(Ronald McDonald, McDonaldland)

Tend to be correct

e.g., (Bloomberg, New York City)

Challenge: the “long tail”

Page 47: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University

47

Mayor McCheese

Page 48: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University

48

Strategy1) Model how common extractions occur in text

2) Rank sparse extractions by fit to model

Assessing Sparse Extractions

Page 49: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University

49

Terms in the same class tend to appear in similar contexts.

“cities including __” 42,000 1

“__ and other cities” 37,900 0

The Distributional Hypothesis

Hits with Hits withContext Chicago Twisp

“__ hotels” 2,000,000 1,670

“mayor of __” 657,000 82

Page 50: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University

50

Precomputed – scalable

Handle sparsity

Unsupervised Language Models

Page 51: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University

51

cities such as Chicago , Boston ,

But Chicago isn’t the best

cities such as Chicago , Boston ,

Los Angeles and Chicago .

Compute dot products between vectors of common and sparse extractions [cf. Ravichandran et al. 2005]

1 2 1… …

such

as

x , B

osto

n

But

x is

n’t th

e

Ang

eles

and

x .

Baseline: context vectors

Page 52: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University

52

Twisp: < >

HMM(Twisp):

HMM provides “distributional summary” Compact (efficient – 10-50x less data retrieved) Dense (accurate – 23-46% error reduction)

. . . 0 0 0 1 . . .

0.14 0.01 … 0.06 t=1 2 N

HMM Compresses Context Vectors

Page 53: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University

53

Task: Ranking sparse TextRunner extractions.

Metric: Area under precision-recall curve.

Language models reduce missing area by 39% over nearest competitor.

Experimental Results

Headquartered Merged Average

Frequency 0.710 0.784 0.713

PL 0.651 0.851 … 0.785

LM 0.810 0.908 0.851

Page 54: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University

54

Summary of Thesis

Formalization of Monotonic Features (MFs) One MF enables PAC Learnability from unlabeled

data alone [Corollary 4.1]

MFs provide greater information gain vs. labels as feature space increases in size [Theorem 8]

The MF model is formally distinct from other SSL approaches [Theorems 9 and 10]

MF model is insufficient when “subconcepts” are

present [Proposition 12]

Page 55: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University

55

Summary: MFs (Continued)

MFA: General SSL algorithm for MFs Given MFs, MFA perf. equivalent to state-of-the-art

SSL algorithm with 160 labeled examples. [Table 2.1]

Even when MFs are not given, MFA can detect MFs in SSL, reducing error by 16%. [Figure 2.5]

MFA is not effective for UIE [Table 2.2 & Figure 2.6]

Page 56: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University

56

Summary: URNS

URNS: Formal model of redundancy in IE Describes how probability increases with MF value

[Proposition 13]

Models corroboration among multiple extraction mechanisms (multiple URNS) [Proposition 14]

Page 57: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University

57

URNS Theoretical Results

Uniform Special Case (USC) Odds in USC increase exponentially with repetition

[Theorem 15]

Error decreases exponentially when parameters are known [Theorem 16]

Zipfian Case (ZC) Closed-form expression for ZC probability given

parameters and odds given repetitions [Theorem 17]

Error in ZC is bounded above by K / n1- for any > 0 when parameters are known [Theorem 19]

Page 58: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University

58

URNS Theoretical Results (cont.)

Zipfian Case (ZC) In ZC, with probability 1-, the parameters of URNS

can be estimated with error < for all , > 0, given sufficient data [Theorem 20]

In ZC, URNS guarantees PAC learnability given only unlabeled data, given that the MF is sufficiently informative and a “seperability” criterion is met in the concept space [Theorem 21]

Page 59: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University

59

URNS Experimental Results

Supervised Learning [Table 3.3]

19% error reduction over noisy-or 10% error reduction over logistic regression Comparable performance to SVM

Semi-supervised IE [Figure 3.4]

6% error reduction over LP Unsupervised IE [Figure 3.2]

1500% error reduction over noisy-or 2200% error reduction over PMI

Improved Efficiency [Table 3.2]

8x faster than PMI

Page 60: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University

60

Other Applications of URNS

Estimating extraction precision and recall [Table 3.7]

Identifying synonymous objects and relations (RESOLVER) [Yates & Etzioni, 2007]

Identifying functional relations in text [Ritter et al., 2008]

Page 61: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University

61

Assessing Sparse Extractions

Hidden Markov Model assessor (HMM-T): Error reduction of 23-46% over context vectors on

typechecking task [Table 4.1]

Error reduction of 28% over context vectors on sparse unary extractions [Table 4.2]

10-50x more efficient vs. context vectors

Sparse extraction assessment with language models:

Error reduction of 39% over previous work [Table 4.3]

Massively more scalable than previous techniques

Page 62: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University

62

Acknowledgements:Oren Etzioni

Mike CafarellaPedro DomingosSusan Dumais

Eric HorvitzAlan Ritter

Stef SchoenmackersStephen Soderland

Dan Weld

Page 63: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University

63

Page 64: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University

64

Page 65: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University

65

Extraction is sometimes “easy”: generic extraction patterns

…cities such as Chicago… => Chicago City

C such as x => x C [Hearst,1992]

But most sentences are “tough”:

We walked the tree-lined streets of the bustling metropolis that is Atlanta.

Extracting Atlanta City requires: Syntactic Parsing (Atlanta -> is -> metropolis) Subclass discovery (metropolis(x)=>city(x))

Challenging & difficult to scale e.g. [Collins, 1997; Snow & Ng 2006]

Web IE without labeled examples

Page 66: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University

66

Extraction is sometimes “easy”: generic extraction patterns

…cities such as Chicago… => Chicago City

C such as x => x C [Hearst,1992]

But most sentences are “tough”:

We walked the tree-lined streets of the bustling metropolis that is Atlanta.

“cities such as Atlanta” – 21,600 Hits

Web IE without labeled examples

Page 67: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University

67

Web IE without labeled examples

Extraction is sometimes “easy”: generic extraction patterns

…cities such as Chicago… => Chicago City

C such as x => x C [Hearst,1992]

…Bloomberg, mayor of New York City…(Bloomberg, New York City) Mayor

x, C of y => (x, y) C

The scale and redundancy of the Web makes a multitude of facts “easy” to extract.

Page 68: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University

68

http://www.cs.washington.edu/research/textrunner/

[Banko et al., 2007]

TextRunner Search

Page 69: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University

69

Extraction patterns make errors:

“Erik Jonsson, CEO of Texas Instruments, mayor of Dallas from 1964-1971, and…”

Extraction patterns make errors:

“Erik Jonsson, CEO of Texas Instruments, mayor of Dallas from 1964-1971, and…”

But…

Task: Assess which extractions are correct Without hand-labeled examples At Web-scale

Thesis: “We can assess extraction correctness by leveraging redundancy and probabilistic models.”

Page 70: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University

70

1) Motivation

2) Background on Web IE

3) Estimating extraction correctness URNS model of redundancy

[Downey et al., IJCAI 2005]

(Distinguished Paper Award)

4) Challenge: The “long tail”

5) Machine learning generalization

Outline

Page 71: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University

71

2) Multiple patterns

Phrase Hits

1) Repetition

“Chicago and other cities” 94,400

“Illinois and other cities” 23,100

“cities such as Chicago” 42,500

“cities such as Illinois” 7

Redundancy – Two Intuitions

Goal: a formal model of these intuitions.

Given a term x and a set of sentences containing extraction patterns for a class C, what is the probability that x C?

Page 72: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University

72

Given a term x and a set of sentences containing extraction patterns for a class C, what is the probability that x C?

If an extraction x appears k times in a set of n distinct occurrences of the pattern, what is the probability that x C?

Consider a single pattern suggesting C , e.g.,

countries such as x

Redundancy: Single Pattern

Page 73: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University

73

“…countries such as Saudi Arabia…”

“…countries such as the United States…”

“…countries such as Saudi Arabia…”

“…countries such as Japan…”

“…countries such as Africa…”

“…countries such as Japan…”

“…countries such as the United Kingdom…”

“…countries such as Iraq…”

“…countries such as Afghanistan…”

“…countries such as Australia…”

C = Country

n = 10 occurrences

Redundancy: Single Pattern

Page 74: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University

74

C = Country

n = 10

Saudi Arabia

Japan

United States

Africa

United Kingdom

Iraq

Afghanistan

Australia

k2

2

1

1

1

1

1

1

p = probability pattern yields a correct extraction, i.e.,

p = 0.9

0.99

0.99

0.9

0.9

0.9

0.9

0.9

0.9 Noisy-or ignores: –Sample size (n) –Distribution of C

Naïve Model: Noisy-Or

Pnoisy-orPnoisy-or(xC | x seen k times)

= 1 – (1 – p)k

[Agichtein & Gravano, 2000; Lin et al. 2003]

Page 75: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University

75

United States

China

. . .

OilWatch Africa

Religion Paraguay

Chicken Mole

Republics of Kenya

Atlantic Ocean

C = Country

n ~50,000

3899

1999

1

1

1

1

1

0.9999…

0.9999…

0.9

0.9

0.9

0.9

0.9

C = Country

n = 10

Saudi Arabia

Japan

United States

Africa

United Kingdom

Iraq

Afghanistan

Australia

2

2

1

1

1

1

1

1

0.99

0.99

0.9

0.9

0.9

0.9

0.9

0.9

As sample size increases, noisy-or becomes inaccurate.

Needed in Model: Sample Size

Pnoisy-or Pnoisy-ork k

Page 76: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University

76

C = Country

n ~50,000

3899

1999

1

1

1

1

1

0.9999…

0.9999…

0.9

0.9

0.9

0.9

0.9

Needed in Model: Distribution of C

Pnoisy-ork

Pfreq(xC | x seen k times)

= 1 – (1 – p)k/n

United States

China

. . .

OilWatch Africa

Religion Paraguay

Chicken Mole

Republics of Kenya

Atlantic Ocean

Page 77: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University

77

C = Country

n ~50,000

3899

1999

1

1

1

1

1

0.9999…

0.9999…

0.05

0.05

0.05

0.05

0.05

Needed in Model: Distribution of C

PfreqkUnited States

China

. . .

OilWatch Africa

Religion Paraguay

Chicken Mole

Republics of Kenya

Atlantic Ocean

Pfreq(xC | x seen k times)

= 1 – (1 – p)k/n

Page 78: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University

78

New York

Chicago

. . .

El Estor

Nikki

Ragaz

Villegas

Northeastwards

C = City

n ~50,000

1488

999

1

1

1

1

1

0.9999…

0.9999…

0.05

0.05

0.05

0.05

0.05

Probability xC depends on the distribution of C.

C = Country

n ~50,000

3899

1999

1

1

1

1

1

0.9999…

0.9999…

0.05

0.05

0.05

0.05

0.05

Needed in Model: Distribution of C

Pfreq Pfreqk kUnited States

China

. . .

OilWatch Africa

Religion Paraguay

Chicken Mole

Republics of Kenya

Atlantic Ocean

Page 79: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University

79

Tokyo

U.K.

Sydney

Cairo

Tokyo

Tokyo

Atlanta

Atlanta

Yakima

Utah

U.K.

…cities such as Tokyo…

Urn for C = City

My solution: URNS Model

Page 80: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University

80

C – set of unique target labels

E – set of unique error labels

num(C) – distribution of target labels

num(E) – distribution of error labels

Urn – Formal Definition

Page 81: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University

81

distribution of target labels: num(C) = {2, 2, 1, 1, 1}

distribution of error labels: num(E) = {2, 1}

U.K.

Sydney

Cairo

Tokyo

Tokyo

Atlanta

Atlanta

Yakima

Utah

U.K.

Urn for C = City

Urn Example

Page 82: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University

82

If an extraction x appears k times in a set of n distinct occurrences of the pattern, what is the probability that x C?

Computing Probabilities

Page 83: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University

83

Given that an extraction x appears k times in n draws from the urn (with replacement), what is the probability that x C?

where s is the total number of balls in the urn

Computing Probabilities

Page 84: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University

84

Multiple urns Target label frequencies correlated across urns Error label frequencies can be uncorrelated

Phrase Hits“Chicago and other cities” 94,400

“Illinois and other cities” 23,100

“cities such as Chicago” 42,500

“cities such as Illinois” 7

Multiple Extraction Patterns

Page 85: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University

85

URNS without labeled data

Needed: num(C), num(E)

Assumed to be Zipf

Frequency of ith element i-z

With assumptions, learn Zipfian parameters for any class C from unlabeled data alone

Page 86: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University

86

p 1 - p

C Zipf E Zipf

Observed frequency distribution

URNS without labeled data

Constant across C, for a given pattern

Learn num(C) from unlabeled data!

Constant across C

Page 87: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University

87

New York

Chicago

. . .

El Estor

Nikki

Ragaz

Villegas

Cres

Northeastwards

C = City

n ~50,000

1488

999

1

1

1

1

1

1

0.9999…

0.9999…

0.63

0.63

0.63

0.63

0.63

0.63

C = Country

n ~50,000

3899

1999

1

1

1

1

1

1

0.9999…

0.9999…

0.03

0.03

0.03

0.03

0.03

0.03

Probabilities Assigned by URNS

PURNS PURNSk kUnited States

China

. . .

OilWatch Africa

Religion Paraguay

Chicken Mole

Republics of Kenya

Atlantic Ocean

New Zeland

Page 88: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University

88

0

1

2

3

4

5

City Film Country MayorOf

De

via

tio

n f

rom

ide

al l

og

lik

elih

oo

d

urns

noisy-or

pmi

URNS’s probabilities are 15-22x closer to optimal.

Probability Accuracy

Page 89: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University

89

Computation is efficient Continuous Zipf & Poisson approximations => Closed form expression P(x C | evidence)

vs. Pointwise Mutual Information (PMI) [Etzioni et al. 2005]

PMI computed with search engine hit counts (inspired by [Turney, 2000])

URNS requires no hit count queries (~8x faster)

Scalability

Page 90: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University

90

Probabilistic model of redundancy Accurate without hand-labeled examples

15-22x improvement in accuracy Scalable

8x faster

[Downey et al., IJCAI 2005]

URNS: Contributions

Page 91: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University

91

1) Motivation

2) Background on Web IE

3) Estimating extraction correctness

4) Challenge: The “long tail” Language models to the rescue

[Downey et al., ACL 2007]

5) Machine learning generalization

Outline

Page 92: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University

92

0

250

500

0 50000 100000

Frequency rank of extraction

Nu

mb

er

of

tim

es

ex

tra

cti

on

a

pp

ea

rs i

n p

att

ern

A mixture of correct and incorrect

e.g., (Dave Shaver, Pickerington)(Ronald McDonald, McDonaldland)

Tend to be correct

e.g., (Bloomberg, New York City)

Challenge: the “long tail”

Page 93: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University

93

Mayor McCheese

Page 94: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University

94

Strategy1) Model how common extractions occur in text

2) Rank sparse extractions by fit to model

Unsupervised language models Precomputed – scalable Handle sparsity

Assessing Sparse Extractions

Page 95: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University

95

The “distributional hypothesis”:Instances of the same relationship tend to appear in similar contexts.

…David B. Shaver was elected as the new mayor of Pickerington, Ohio.

http://www.law.capital.edu/ebriefsarchive/Summer2004/ClassActionsLeft.asp

…Mike Bloomberg was elected as the new mayor of New York City.

http://www.queenspress.com/archives/coverstories/2001/issue52/coverstory.htm

Assessing Sparse Extractions

Page 96: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University

96

Type errors are common:

Alexander the Great conquered Egypt… (Great, Egypt) Conquered

Locally acquired malaria is now uncommon… (Locally, malaria) Acquired

Type checking

Page 97: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University

97

cities such as Chicago , Boston ,

But Chicago isn’t the best

cities such as Chicago , Boston ,

Los Angeles and Chicago .

Compute dot products between vectors of common and sparse extractions [cf. Ravichandran et al. 2005]

1 2 1… …

such

as

x , B

osto

n

But

x is

n’t th

e

Ang

eles

and

x .

Baseline: context vectors (1)

Page 98: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University

98

Miami: < >Twisp: < >

Problems: Vectors are large Intersections are sparse

. . . 71 25 1 513 . . .w

hen

he v

isite

d X

he v

isite

d X

and

visi

ted

X a

nd o

ther

X a

nd o

ther

citi

es

. . . 0 0 0 1 . . .

Baseline: context vectors (2)

Page 99: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University

99

ti ti+1 ti+2 ti+3

wi wi+1 wi+2 wi+3

cities such as Seattle

Hidden Markov Model (HMM)

States – unobserved

Words – observed

Hidden States ti {1, …, N} (N fairly small)

Train on unlabeled data – P(ti | wi = w) is N-dim. distributional summary of w

– Compare extractions using KL divergence

Page 100: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University

100

Twisp: < >

P(t | Twisp):

Distributional Summary P(t | w) Compact (efficient – 10-50x less data retrieved) Dense (accurate – 23-46% error reduction)

. . . 0 0 0 1 . . .

0.14 0.01 … 0.06 t=1 2 N

HMM Compresses Context Vectors

Page 101: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University

101

Is Pickerington of the same type as Chicago?

Chicago , IllinoisPickerington , Ohio

Chicago:Pickerington:

=> Context vectors say no,

dot product is 0!

291 0 …

<x>

, O

hio

<x>

, Ill

inoi

s

0 1 …

Example

Page 102: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University

102

HMM Generalizes:

Chicago , Illinois

Pickerington , Ohio

Example

Page 103: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University

103

Task: Ranking sparse TextRunner extractions.

Metric: Area under precision-recall curve.

Language models reduce missing area by 39% over nearest competitor.

Experimental Results

Headquartered Merged Average

Frequency 0.710 0.784 0.713

PL 0.651 0.851 … 0.785

LM 0.810 0.908 0.851

Page 104: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University

104

No hand-labeled data Scalability

Language models precomputed=> Can be queried at interactive speed

Improved accuracy over previous work[Downey et al., ACL 2007]

REALM: Contributions

Page 105: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University

105

1) Motivation

2) Background on Web IE

3) Estimating extraction correctness

4) Challenge: The “long tail”

5) Machine learning generalization Monotonic Features

[Downey et al., 2008 (submitted)]

Outline

Page 106: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University

106

Common Structure

Task Hint BootstrapWeb IE “x, C of y” Distributional Hypothesis

Word Sense Disambiguation

“plant and animal species”

One sense per context, one sense per discourse[Yarowsky, 1995]

Information Retrieval

search query Pseudo-relevance feedback [Kwok & Grunfield, 1995; Thompson & Turtle, 1995]

Document Classification

Topic word, e.g.: “politics”

Semi-supervised Learning

[McCallum & Nigam, 1999; Gliozzo, 2005]

Page 107: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University

107

Common Structure

Task Hint BootstrapWeb IE “x, C of y” Distributional Hypothesis

Word Sense Disambiguation

“plant and animal species”

One sense per context, one sense per discourse[Yarowsky, 1995]

Information Retrieval

search query Pseudo-relevance feedback [Kwok & Grunfield, 1995; Thompson & Turtle, 1995]

Document Classification

Topic word, e.g.: “politics”

Bag-of-words and EM [McCallum & Nigam, 1999; Gliozzo, 2005]

Identity of a monotonic feature xi such that:P(y = 1 | xi) increases strictly monotonically with xi

Classification of examples x = (x1, …, xd) into classes y {0, 1}

Page 108: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University

108

Classical Supervised Learning

?

Learn function from x = (x1, …, xd) to y {0, 1}given labeled examples (x, y)

x1

x2

Page 109: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University

109

Semi-Supervised Learning (SSL)

Learn function from x = (x1, …, xd) to y {0, 1}given labeled examples (x, y)and unlabeled examples (x)

x1

x2

Page 110: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University

110

Monotonic Features

x1

x2

Learn function from x = (x1, …, xd) to y {0, 1} given monotonic feature x1

and unlabeled examples (x)

Page 111: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University

111

Monotonic Features

x1

x2

Learn function from x = (x1, …, xd) to y {0, 1} given monotonic feature x1

and unlabeled examples (x)

Page 112: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University

112

Monotonic Features

x1

x2

Learn function from x = (x1, …, xd) to y {0, 1} given monotonic feature x1

and unlabeled examples (x)

Page 113: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University

113

1. No labeled data, MFs given (MA) With noisy labels from MFs, train any classifier

2. Labeled data, no MFs given (MA-SSL) Detect MFs from labeled data, run MA

3. Labeled data and MFs given (MA-BOTH) Run MA with given & detected MFs

Exploiting MF Structure

Page 114: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University

114

20 Newsgroups dataset

Task: Given text, determine newsgroup of origin

(MFs: newsgroup name)

Without labeled data:

Experimental Results

Page 115: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University

115

MA-SSL provides a 15% error reduction for 100-400 labeled examples.

MA-BOTH provides a 31% error reduction for 0-800 labeled examples.

Experimental Results

Page 116: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University

116

Co-training Requires labeled examples and known views

Semi-supervised smoothness assumptions Cluster assumption Manifold assumption …both provably distinct from MF structure

Relationship to other approaches

Page 117: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University

117

Best known methods for IE without labeled data Probabilities of correctness (URNS)

Massive improvements in accuracy (15-22x) Handling sparse data (Language models)

Vastly more scalable than previous work Accuracy wins (39% error reduction)

Generalization beyond IE Monotonic Feature abstraction – widely applicable Accuracy wins in document classification

Summary of Results

Page 118: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University

118

IE Web IE But still need:

A coherent knowledge base MayorOf(Chicago, Daley) –

the same “Chicago” as Starred-in(Chicago, Zeta-Jones)? Future Work: entity resolution, schema discovery

Improved accuracy and coverage Currently, ignore character/document features, recursive

structure, etc. Future work: more sophisticated language models

(e.g. PCFGs)

Conclusions and Future Work

Page 119: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University

119

Thanks!

Acknowledgements:Oren Etzioni

Mike CafarellaPedro DomingosSusan Dumais

Eric HorvitzStef Schoenmackers

Dan Weld

Page 120: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University

120

Self-Supervised Learning

Input Examples Output

Supervised Labeled Classifier

Semi-supervised Labeled & Unlabeled Classifier

Self-supervised Unlabeled Classifier

Unsupervised Unlabeled Clustering

Page 121: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University

121

Language Modeling for IE REALM is simple, ignores:

Character- or Document-Level Features Web structure Recursive structure (PCFGs)

Goal: x won an Oscar for playing a villain…What is P(x) ?

From facts to knowledge Entity resolution and inference

Future Work

Page 122: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University

122

Named Entity Location Lexical Statistics improve state of the art

[Downey et al., IJCAI 2007]

Modeling Web Search Characterizing user behavior

[Downey et al., SIGIR 2007] (poster)[Liebling et al., 2008] (submitted)

Predictive models [Downey et al., IJCAI 2007]

Other Work

Page 123: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University

123

Web Fact-Finding

Who has won three or more Academy Awards?

Page 124: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University

124

Web Fact-FindingProblems:

User has to pick the right words, often a tedious process:

"world foosball champion in 1998“ – 0 hits“world foosball champion” 1998 – 2 hits, no answer

What if I could just ask for P(x) in“x was world foosball champion in 1998?”

How far can language modeling and the distributional hypothesis take us?

Page 125: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University

125

Miami

Twisp

Star Wars

. . . 98 0 20 250 30 513 . . .

. . . 5 0 1 2 1 1 . . .. . . 1 1000 0 2 1 1 . . .

X s

ound

trac

khe

vis

ited

X a

ndci

ties

such

as

XX

and

oth

er c

ities

X lo

dgin

g

KnowItAll Hypothesis

Distributional Hypothesis

Page 126: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University

126

Miami

Twisp

Star Wars

. . . 98 0 20 250 30 513 . . .

. . . 5 0 1 2 1 1 . . .. . . 1 1000 0 2 1 1 . . .

X s

ound

trac

khe

vis

ited

X a

ndci

ties

such

as

XX

and

oth

er c

ities

X lo

dgin

g

KnowItAll Hypothesis

Distributional Hypothesis

Page 127: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University

127

invent in real time

TextRunner

Ranked by frequency

REALM improves precision of the top 20 extractions by an average of 90%.

Page 128: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University

128

Tarantella, Santa Cruz

International Business Machines Corporation, Armonk

Mirapoint, Sunnyvale

ALD, Sunnyvale

PBS, Alexandria

General Dynamics, Falls Church

Jupitermedia Corporation, Darien

Allegro, Worcester

Trolltech, Oslo

Corbis, Seattle

TR Precision: 40% REALM Precision: 100%

Improving TextRunner: Example (1)

“headquartered” Top 10:company, Palo Alto

held company, Santa Cruz

storage hardware and software, Hopkinton

Northwestern Mutual, Tacoma

1997, New York City

Google, Mountain View

PBS, Alexandria

Linux provider, Raleigh

Red Hat, Raleigh

TI, Dallas

TR Precision: 40%

Page 129: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University

129

Arabs, Rhodes

Arabs, Istanbul

Assyrians, Mesopotamia

Great, Egypt

Assyrians, Kassites

Arabs, Samarkand

Manchus, Outer Mongolia

Vandals, North Africa

Arabs, Persia

Moors, Lagos

TR Precision: 60% REALM Precision: 90%

Improving TextRunner: Example (2)

“conquered” Top 10:Great, Egypt

conquistador, Mexico

Normans, England

Arabs, North Africa

Great, Persia

Romans, part

Romans, Greeks

Rome, Greece

Napoleon, Egypt

Visigoths, Suevi Kingdom

TR Precision: 60%

Page 130: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University

130

Previous n-gram technique (1)

1) Form a context vector for each extracted argument:…

cities such as Chicago , Boston ,

But Chicago isn’t the best

cities such as Chicago , Boston ,

Los Angeles and Chicago .

2) Compute dot products between extractions and seeds in this space [cf. Ravichandran et al. 2005].

1 2 1… …

such as <x> , Boston

But <x> isn’t the

Angeles and <x> .

Page 131: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University

131

Miami: < >Twisp: < >

Problems: Vectors are large Intersections are sparse

. . . 71 25 1 513 . . .w

hen

he v

isite

d X

he v

isite

d X

and

visi

ted

X a

nd o

ther

X a

nd o

ther

citi

es

. . . 0 0 0 1 . . .

Previous n-gram technique (2)

Page 132: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University

132

Miami: < >

P(t | Miami):

Latent state distribution P(t | w) Compact (efficient – 10-50x less data retrieved) Dense (accurate – 23-46% error reduction)

. . . 71 25 1 513 . . .

0.14 0.01 … 0.06 t=1 2 N

Compressing Context Vectors

Page 133: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University

133

Example: N-Grams on Sparse Data

Is Pickerington of the same type as Chicago?

Chicago , IllinoisPickerington , Ohio

Chicago:Pickerington:

=> N-grams says no, dot product is 0!

291 0 …

<x> , Ohio

<x> , Illinois

0 1 …

Page 134: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University

134

HMM Generalizes:

Chicago , Illinois

Pickerington , Ohio

Example: HMM-T on Sparse Data

Page 135: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University

135

HMM-T Limitations

Learning iterations take time proportional to (corpus size *Tk+1)

T = number of latent states

k = HMM order

We use limited values T=20, k=3 Sufficient for typechecking (Santa Clara is a city) Too coarse for relation assessment

(Santa Clara is where Intel is headquartered)

Page 136: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University

136

The REALM ArchitectureTwo steps for assessing R(arg1, arg2) Typechecking

Ensure arg1 and arg2 are of proper type for RMayorOf(Intel, Santa Clara)

Leverages all occurrences of each arg Relation Assessment

Ensure R actually holds between arg1 and arg2MayorOf(Giuliani, Seattle)

Both steps use pre-computed language models=> Scales to Open IE

Page 137: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University

137

Type checking isn’t enoughNY Mayor Giuliani toured downtown Seattle.

Want: How do arguments behave in relation to each other?

Relation Assessment

Page 138: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University

138

N-gram language model:

P(wi, wi-1, … wi-k)

arg1, arg2 often far apart => large k (inaccurate)

REL-GRAMS (1)

Page 139: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University

139

Relational Language Model (REL-GRAMS):

For any two arguments e1, e2:

P(wi, wi-1, … wi-k | wi = e1, e1 near e2)

k can be small – REL-GRAMS still captures entity relationships Mitigate sparsity with BM25 metric (from IR)

Combine with HMM-T by multiplying ranks.

REL-GRAMS (2)

Page 140: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University

140

Experiments

Task: Re-rank sparse TextRunner extractions for Conquered, Founded, Headquartered, MergedREALM vs.

TextRunner (TR) – frequency ordering (equivalent to PMI [Etzioni et al, 2005] and Urns [Downey et al, 2005])

Pattern Learning (PL) – based on Snowball [Agichtein 2000]

HMM-T and REL-GRAMS in isolation

Page 141: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University

141

Learning num(C) and num(E)

From untagged data: ill-posed problem• num(C) can vary wildly with C

e.g., countries vs. cities vs. mayors

Assume:1) Consistent precision of a single co-occurrence,

e.g., in a randomly drawn phrase “C such as x”,x C about p of the time. (0.9 for [Etzioni,

2005])

2) num(E) is constant for all C

3) num(C) is Zipf Estimate num(C) from untagged data using EM

[Downey et al. 2005] (Also: multiple contexts)

Page 142: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University

142

URNS without labeled data

Frequency Rank

Fre

qu

ency

Frequency Rank

Fre

qu

ency

Frequency Rank

Fre

qu

en

cy

1 -

P(x C) in “C such as x”

Assumed ~0.9

Error Distribution

Assumed large with Zipf parameter 1.0

Page 143: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University

143

URNS without labeled data

Frequency Rank

Fre

qu

ency

Frequency Rank

Fre

qu

ency

Frequency Rank

Fre

qu

en

cy

1 - Can vary wildly (e.g. cities vs. countries).

Learned from unlabeled data using EM

Page 144: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University

144

Distributional Similarity

Naïve Approach – find sentences containing seed1&seed2 or arg1&arg2:

Compare context distributions:

P(wb,…, we | seed1, seed2 )

P(wb,…, we | arg1, arg2)But e – b can be large

Many parameters, sparse data => inaccuracy

wb … wh seed1 wh+2 … wi seed2 wi+2 … we

wb … wh arg1 wh+2 … wi arg2 wi+2 … we

Page 145: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University

145

http://www.cs.washington.edu/research/textrunner/

TextRunner Search

Page 146: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University

146

Large textual corpora are redundant,

and we can use this observation to bootstrap extraction and classification models 

from minimally labeled, or even completely unlabeled data.

Thesis

Page 147: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University

147

Supervised classification task: Feature space X of d-tuples x = (x1, …, xd)

Binary output space Y = {0, 1} Inputs

Labeled examples DL = {(x, y)} ~ P(x, y)

Output: concept c: X -> {0, 1} that approximates P(y | x).

Monotonic Features

Page 148: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University

148

Semi-supervised classification task: Feature space X of d-tuples x = (x1, …, xd)

Binary output space Y = {0, 1} Inputs

Labeled examples DL = {(x, y)} ~ P(x, y)

Unlabeled examples DU = {(x)} ~ P(x)

Output: concept c: X -> {0, 1} that approximates P(y | x).

Monotonic Features

Smaller

Page 149: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University

149

Semi-supervised classification task: Feature space X of d-tuples x = (x1, …, xd)

Binary output space Y = {0, 1} Inputs

Labeled examples DL = {(x, y)} ~ P(x, y)

Unlabeled examples DU = {(x)} ~ P(x) Monotonic features M {1,…,d} such that:

P(y=1 | xi) increases strictly monotonically with xi for all i M.

Output: concept c: X -> {0, 1} that approximates P(y | x).

Potentially empty!

Monotonic Features

Page 150: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University

150

Problem: num(C) can vary wildly e.g. cities vs. countries

Assume: num(C), num(E) Zipf distributed

freq. of ith element i-z

p and num(E) independent of C

Learn num(C) from unlabeled data alone With Expectation Maximization

URNS without labeled data

Page 151: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University

151

20 Newsgroups dataset

Task: Given text, determine newsgroup of origin

(MFs: newsgroup name)

Without labeled data:

Experimental Results

Page 152: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University

152

Typecheck each arg by comparing HMM’s distributional summaries:

Rank arguments in ascending order of f(arg)

arg|,|

||

1(arg) tPseedtP

seedsKLf

ii

HMM Type-checking

Page 153: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University

153

Classical Supervised Learning

?

Learn function from x = (x1, …, xd) to y {0, 1}given labeled examples (x, y)

x1

x2

Page 154: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University

154

Semi-supervised Learning (SSL)

Learn function from x = (x1, …, xd) to y {0, 1}given labeled examples (x, y)and unlabeled examples (x)

x1

x2

Page 155: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University

155

Self-supervised Learning

x1

x2

Learn function from x = (x1, …, xd) to y {0, 1} given unlabeled examples (x)

Page 156: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University

156

Self-supervised Learning

x1

x2

Learn function from x = (x1, …, xd) to y {0, 1} given unlabeled examples (x)and system labels its own examples

Page 157: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University

157

Self-supervised Learning

Input Examples Output

Supervised Labeled Classifier

Semi-supervised Labeled & Unlabeled Classifier

Self-supervised Unlabeled Classifier

Unsupervised Unlabeled Clustering

Page 158: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University

158

Supervised classification task: Feature space X of d-tuples x = (x1, …, xd)

Binary output space Y = {0, 1} Inputs

Labeled examples DL = {(x, y)} ~ P(x, y)

Output: concept c: X -> {0, 1} that approximates P(y | x).

Monotonic Features

Page 159: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University

159

Semi-supervised classification task: Feature space X of d-tuples x = (x1, …, xd)

Binary output space Y = {0, 1} Inputs

Labeled examples DL = {(x, y)} ~ P(x, y)

Unlabeled examples DU = {(x)} ~ P(x)

Output: concept c: X -> {0, 1} that approximates P(y | x).

Monotonic Features

Smaller

Page 160: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University

160

Semi-supervised classification task: Feature space X of d-tuples x = (x1, …, xd)

Binary output space Y = {0, 1} Inputs

Labeled examples DL = {(x, y)} ~ P(x, y)

Unlabeled examples DU = {(x)} ~ P(x) Monotonic features M {1,…,d} such that:

P(y=1 | xi) increases strictly monotonically with xi for all i M.

Output: concept c: X -> {0, 1} that approximates P(y | x).

Potentially empty!

Monotonic Features