empirical development of an exponential probabilistic model using textual analysis to build a better...

49
Empirical Development of an Exponential Probabilistic Model Using Textual Analysis to Build a Better Model Jaime Teevan & David R. Karger CSAIL (LCS+AI), MIT

Post on 18-Dec-2015

221 views

Category:

Documents


4 download

TRANSCRIPT

Page 1: Empirical Development of an Exponential Probabilistic Model Using Textual Analysis to Build a Better Model Jaime Teevan & David R. Karger CSAIL (LCS+AI),

Empirical Development of anExponential Probabilistic Model

Using Textual Analysis to Build a Better Model

Jaime Teevan & David R. KargerCSAIL (LCS+AI), MIT

Page 2: Empirical Development of an Exponential Probabilistic Model Using Textual Analysis to Build a Better Model Jaime Teevan & David R. Karger CSAIL (LCS+AI),

Goal: Better Generative Model

Generative v. discriminative modelApplies to many applications Information retrieval (IR)

Relevance feedback Using unlabeled data

Classification

Assumptions explicit

Page 3: Empirical Development of an Exponential Probabilistic Model Using Textual Analysis to Build a Better Model Jaime Teevan & David R. Karger CSAIL (LCS+AI),

Using a Model for IR

1. Define model2. Learn parameters from query3. Rank documents

Hyper-learn

• Better model improves applications Trickle down to improve retrieval Classification, relevance feedback, …

• Corpus specific models

Page 4: Empirical Development of an Exponential Probabilistic Model Using Textual Analysis to Build a Better Model Jaime Teevan & David R. Karger CSAIL (LCS+AI),

Overview

Related workProbabilistic models Example: Poisson Model Compare model to text

Hyper-learning the model Exponential framework Investigate retrieval performance

Conclusion and future work

Page 5: Empirical Development of an Exponential Probabilistic Model Using Textual Analysis to Build a Better Model Jaime Teevan & David R. Karger CSAIL (LCS+AI),

Related Work

Using text for retrieval algorithm [Jones, 1972], [Greiff, 1998]

Using text to model text [Church & Gale, 1995], [Katz, 1996]

Learning model parameters [Zhai & Lafferty, 2002]

Hyper-learn the model from text!

Page 6: Empirical Development of an Exponential Probabilistic Model Using Textual Analysis to Build a Better Model Jaime Teevan & David R. Karger CSAIL (LCS+AI),

Probabilistic Models

Rank documents by RV = Pr(rel|d)

Naïve Bayesian models

RV = Pr(rel|d)

Page 7: Empirical Development of an Exponential Probabilistic Model Using Textual Analysis to Build a Better Model Jaime Teevan & David R. Karger CSAIL (LCS+AI),

Probabilistic Models

Rank documents by RV = Pr(rel|d)

Naïve Bayesian models

= Pr(dt|rel) features t

RV = Pr(rel|d) 8Open assumptionsFeature definitionFeature distribution family

words

# occs in doc

Defines the model!

Pr(d|rel)

Page 8: Empirical Development of an Exponential Probabilistic Model Using Textual Analysis to Build a Better Model Jaime Teevan & David R. Karger CSAIL (LCS+AI),

Using a Naïve Bayesian Model

1. Define model2. Learn parameters from query3. Rank documents

Page 9: Empirical Development of an Exponential Probabilistic Model Using Textual Analysis to Build a Better Model Jaime Teevan & David R. Karger CSAIL (LCS+AI),

Using a Naïve Bayesian Model

1. Define model2. Learn parameters from query3. Rank documents

Pr(dt|rel) =

Page 10: Empirical Development of an Exponential Probabilistic Model Using Textual Analysis to Build a Better Model Jaime Teevan & David R. Karger CSAIL (LCS+AI),

Using a Naïve Bayesian Model

1. Define model2. Learn parameters from query3. Rank documents

Pr(dt|rel) = θ e -θ

dt!

dtPoisson Model

θ: specifies term distribution

Page 11: Empirical Development of an Exponential Probabilistic Model Using Textual Analysis to Build a Better Model Jaime Teevan & David R. Karger CSAIL (LCS+AI),

1E-19

1E-171E-15

1E-13

1E-111E-09

1E-071E-05

0.001

0.1

0 1 2 3 4 5

Poisson

Term occurs exactly dt

times

Pr(

d t|rel)

Example Poisson Distribution

θ=0.0006

Pr(dt|rel)≈1E-15

+

Page 12: Empirical Development of an Exponential Probabilistic Model Using Textual Analysis to Build a Better Model Jaime Teevan & David R. Karger CSAIL (LCS+AI),

Using a Naïve Bayesian Model

1. Define model2. Learn parameters from query3. Rank documents

Learn a θ for each term

Maximum likelihood θ Term’s average number of occurrence

Incorporate prior expectations

Page 13: Empirical Development of an Exponential Probabilistic Model Using Textual Analysis to Build a Better Model Jaime Teevan & David R. Karger CSAIL (LCS+AI),

Using a Naïve Bayesian Model

1. Define model2. Learn parameters from query3. Rank documents

Page 14: Empirical Development of an Exponential Probabilistic Model Using Textual Analysis to Build a Better Model Jaime Teevan & David R. Karger CSAIL (LCS+AI),

Using a Naïve Bayesian Model

1. Define model2. Learn parameters from query3. Rank documents

For each document, find RV

Sort documents by RV

= Pr(dt|rel)

. words t

RV

Page 15: Empirical Development of an Exponential Probabilistic Model Using Textual Analysis to Build a Better Model Jaime Teevan & David R. Karger CSAIL (LCS+AI),

Using a Naïve Bayesian Model

1. Define model2. Learn parameters from query3. Rank documents

For each document, find RV

Sort documents by RV

= Pr(dt|rel)

. words t

RV

Which step goes wrong?

Page 16: Empirical Development of an Exponential Probabilistic Model Using Textual Analysis to Build a Better Model Jaime Teevan & David R. Karger CSAIL (LCS+AI),

Using a Naïve Bayesian Model

1. Define model2. Learn parameters from query3. Rank documents

Page 17: Empirical Development of an Exponential Probabilistic Model Using Textual Analysis to Build a Better Model Jaime Teevan & David R. Karger CSAIL (LCS+AI),

Using a Naïve Bayesian Model

1. Define model2. Learn parameters from query3. Rank documents

Pr(dt|rel) = θ e -θ

dt!

dt

Page 18: Empirical Development of an Exponential Probabilistic Model Using Textual Analysis to Build a Better Model Jaime Teevan & David R. Karger CSAIL (LCS+AI),

1E-19

1E-171E-15

1E-13

1E-111E-09

1E-071E-05

0.001

0.1

0 1 2 3 4 5

DataPoisson

Term occurs exactly dt

times

Pr(

d t|rel)

How Good is the Model?

θ=0.0006

15 times

+

Page 19: Empirical Development of an Exponential Probabilistic Model Using Textual Analysis to Build a Better Model Jaime Teevan & David R. Karger CSAIL (LCS+AI),

How Good is the Model?

1E-19

1E-171E-15

1E-13

1E-111E-09

1E-071E-05

0.001

0.1

0 1 2 3 4 5

DataPoisson

Term occurs exactly dt

times

Pr(

d t|rel)

θ=0.0006

15 times

Misfit!

+

Page 20: Empirical Development of an Exponential Probabilistic Model Using Textual Analysis to Build a Better Model Jaime Teevan & David R. Karger CSAIL (LCS+AI),

Hyper-learning a Better FitThrough Textual Analysis

Using an Exponential Framework

Page 21: Empirical Development of an Exponential Probabilistic Model Using Textual Analysis to Build a Better Model Jaime Teevan & David R. Karger CSAIL (LCS+AI),

Need framework for hyper-learning

Bernoulli

Poisson

Normal

Mixtures

Hyper-Learning Framework

Page 22: Empirical Development of an Exponential Probabilistic Model Using Textual Analysis to Build a Better Model Jaime Teevan & David R. Karger CSAIL (LCS+AI),

Need framework for hyper-learning

Goal: Same benefits as Poisson Model One parameter Easy to work with (e.g., prior)

Bernoulli

Poisson

Normal

One parameter exponential families

Mixtures

Hyper-Learning Framework

Page 23: Empirical Development of an Exponential Probabilistic Model Using Textual Analysis to Build a Better Model Jaime Teevan & David R. Karger CSAIL (LCS+AI),

Well understood, learning easy [Bernardo & Smith, 1994], [Gous, 1998]

Pr( dt | rel ) = f(dt) g(θ) e

Functions f(dt) and h(dt) specify family E.g., Poisson: f(dt) = (dt!)-1, h(dt) = dt

Parameter θ term’s specific distribution

Exponential Framework

θ h(dt)

Page 24: Empirical Development of an Exponential Probabilistic Model Using Textual Analysis to Build a Better Model Jaime Teevan & David R. Karger CSAIL (LCS+AI),

Using a Hyper-learned Model

1. Define model2. Learn parameters from query3. Rank documents

Page 25: Empirical Development of an Exponential Probabilistic Model Using Textual Analysis to Build a Better Model Jaime Teevan & David R. Karger CSAIL (LCS+AI),

Using a Hyper-learned Model

1. Hyper-learn model2. Learn parameters from query3. Rank documents

Page 26: Empirical Development of an Exponential Probabilistic Model Using Textual Analysis to Build a Better Model Jaime Teevan & David R. Karger CSAIL (LCS+AI),

Using a Hyper-learned Model

1. Hyper-learn model2. Learn parameters from query3. Rank documents

Want “best” f(dt) and h(dt)

Iterative hill climbing Local maximum Poisson starting point

Page 27: Empirical Development of an Exponential Probabilistic Model Using Textual Analysis to Build a Better Model Jaime Teevan & David R. Karger CSAIL (LCS+AI),

Using a Hyper-learned Model

1. Hyper-learn model2. Learn parameters from query3. Rank documents

Data: TREC query result sets Past queries to learn about future queries

Hyper-learn and test with different sets

Page 28: Empirical Development of an Exponential Probabilistic Model Using Textual Analysis to Build a Better Model Jaime Teevan & David R. Karger CSAIL (LCS+AI),

Recall the Poisson Distribution

1E-19

1E-171E-15

1E-13

1E-111E-09

1E-071E-05

0.001

0.1

0 1 2 3 4 5

DataPoissonNew Model

Term occurs exactly dt

times

Pr(

d t|rel)

15 times

+

Page 29: Empirical Development of an Exponential Probabilistic Model Using Textual Analysis to Build a Better Model Jaime Teevan & David R. Karger CSAIL (LCS+AI),

Poisson Starting Point - h(dt)

-2

-1

0

1

2

3

4

5

6

0 1 2 3 4 5

PoissonLearned

h(d

t)

dt

Pr(dt|rel) = f(dt) g(θ) eθ h(dt)

+

Page 30: Empirical Development of an Exponential Probabilistic Model Using Textual Analysis to Build a Better Model Jaime Teevan & David R. Karger CSAIL (LCS+AI),

-2

-1

0

1

2

3

4

5

6

0 1 2 3 4 5

PoissonLearned

h(d

t)

dt

Hyper-learned Model - h(dt)Hyper-learned Model - h(dt)+

Pr(dt|rel) = f(dt) g(θ) eθ h(dt)

Page 31: Empirical Development of an Exponential Probabilistic Model Using Textual Analysis to Build a Better Model Jaime Teevan & David R. Karger CSAIL (LCS+AI),

Poisson Distribution

1E-19

1E-171E-15

1E-13

1E-111E-09

1E-071E-05

0.001

0.1

0 1 2 3 4 5

DataPoissonNew Model

Term occurs exactly dt

times

Pr(

d t|rel)

15 times

+

Page 32: Empirical Development of an Exponential Probabilistic Model Using Textual Analysis to Build a Better Model Jaime Teevan & David R. Karger CSAIL (LCS+AI),

1E-19

1E-171E-15

1E-13

1E-111E-09

1E-071E-05

0.001

0.1

0 1 2 3 4 5

DataPoissonNew Model

Term occurs exactly dt

times

Hyper-learned Distribution

15 times

Hyper-learned Distribution+

Pr(

d t|rel)

Page 33: Empirical Development of an Exponential Probabilistic Model Using Textual Analysis to Build a Better Model Jaime Teevan & David R. Karger CSAIL (LCS+AI),

1E-19

1E-171E-15

1E-13

1E-111E-09

1E-071E-05

0.001

0.1

0 1 2 3 4 5

DataPoissonNew Model

Term occurs exactly dt

times

5 times

Hyper-learned DistributionHyper-learned Distribution+

Pr(

d t|rel)

Page 34: Empirical Development of an Exponential Probabilistic Model Using Textual Analysis to Build a Better Model Jaime Teevan & David R. Karger CSAIL (LCS+AI),

1E-19

1E-171E-15

1E-131E-11

1E-091E-07

1E-050.001

0.1

0 1 2 3 4 5

DataPoissonNew Model

Term occurs exactly dt

times

30 times

Hyper-learned DistributionHyper-learned Distribution+

Pr(

d t|rel)

Page 35: Empirical Development of an Exponential Probabilistic Model Using Textual Analysis to Build a Better Model Jaime Teevan & David R. Karger CSAIL (LCS+AI),

1E-19

1E-171E-15

1E-13

1E-111E-09

1E-071E-05

0.001

0.1

0 1 2 3 4 5

DataPoissonNew Model

Term occurs exactly dt

times

300 times

Hyper-learned DistributionHyper-learned Distribution+

Pr(

d t|rel)

Page 36: Empirical Development of an Exponential Probabilistic Model Using Textual Analysis to Build a Better Model Jaime Teevan & David R. Karger CSAIL (LCS+AI),

Performing Retrieval

1. Hyper-learn model2. Learn parameters from query3. Rank documents

Page 37: Empirical Development of an Exponential Probabilistic Model Using Textual Analysis to Build a Better Model Jaime Teevan & David R. Karger CSAIL (LCS+AI),

Performing Retrieval

1. Hyper-learn model2. Learn parameters from query3. Rank documents

Pr( dt | rel ) = f(dt) g(θ) e

Learn θ for each term

θ h(dt)

Labeled docs

Page 38: Empirical Development of an Exponential Probabilistic Model Using Textual Analysis to Build a Better Model Jaime Teevan & David R. Karger CSAIL (LCS+AI),

Learning θ

Sufficient statistics Summarize all observed data τ1: # of observations τ2: Σobservations d h(dt)

Incorporating prior easy

Map τ1 and τ2 θ

20 labeled documents

Page 39: Empirical Development of an Exponential Probabilistic Model Using Textual Analysis to Build a Better Model Jaime Teevan & David R. Karger CSAIL (LCS+AI),

Performing Retrieval

1. Hyper-learn model2. Learn parameters from query3. Rank documents

Page 40: Empirical Development of an Exponential Probabilistic Model Using Textual Analysis to Build a Better Model Jaime Teevan & David R. Karger CSAIL (LCS+AI),

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

PoissonNew Model

Recall

Pre

cisi

on

Results: Labeled DocumentsResults: Labeled Documents

Page 41: Empirical Development of an Exponential Probabilistic Model Using Textual Analysis to Build a Better Model Jaime Teevan & David R. Karger CSAIL (LCS+AI),

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

PoissonNew Model

Recall

Pre

cisi

on

Results: Labeled DocumentsResults: Labeled Documents

Page 42: Empirical Development of an Exponential Probabilistic Model Using Textual Analysis to Build a Better Model Jaime Teevan & David R. Karger CSAIL (LCS+AI),

Performing Retrieval

1. Hyper-learn model2. Learn parameters from query3. Rank documents

Short query

Page 43: Empirical Development of an Exponential Probabilistic Model Using Textual Analysis to Build a Better Model Jaime Teevan & David R. Karger CSAIL (LCS+AI),

Query = single labeled documentVector space-like equation

RV = Σ a(t, d) + Σ b(q, d)

Problem: Document dominatesSolution: Use only query portion Another solution: Normalize

Retrieval: Query

t in doc q in query

Retrieval: Query

Page 44: Empirical Development of an Exponential Probabilistic Model Using Textual Analysis to Build a Better Model Jaime Teevan & David R. Karger CSAIL (LCS+AI),

0

0.1

0.2

0.3

0.4

0.5

0.6

PoissonNew ModelTF.IDF

Recall

Pre

cisi

on

Retrieval: Query

Page 45: Empirical Development of an Exponential Probabilistic Model Using Textual Analysis to Build a Better Model Jaime Teevan & David R. Karger CSAIL (LCS+AI),

0

0.1

0.2

0.3

0.4

0.5

0.6

PoissonNew ModelTF.IDF

Recall

Pre

cisi

on

Retrieval: Query

Page 46: Empirical Development of an Exponential Probabilistic Model Using Textual Analysis to Build a Better Model Jaime Teevan & David R. Karger CSAIL (LCS+AI),

0

0.1

0.2

0.3

0.4

0.5

0.6

PoissonNew ModelTF.IDF

Recall

Pre

cisi

on

Retrieval: Query

Page 47: Empirical Development of an Exponential Probabilistic Model Using Textual Analysis to Build a Better Model Jaime Teevan & David R. Karger CSAIL (LCS+AI),

Conclusion

Probabilistic models Example: Poisson Model

Hyper-learning the model Exponential framework Learned a better model Investigate retrieval performance

- Easy to work with

- Better …

- Bad text model

- Heavy tailed!

Page 48: Empirical Development of an Exponential Probabilistic Model Using Textual Analysis to Build a Better Model Jaime Teevan & David R. Karger CSAIL (LCS+AI),

Use model betterUse for other applications Other IR applications Classification

Correct for document lengthHyper-learn on different corpora Test if learned model generalizes Different for genre? Language?

People?

Hyper-learn model better

Future Work

Page 49: Empirical Development of an Exponential Probabilistic Model Using Textual Analysis to Build a Better Model Jaime Teevan & David R. Karger CSAIL (LCS+AI),

Questions?

Contact us with questions:

Jaime [email protected]

David [email protected]