supervised ir incremental user feedback (relevance feedback) or initial fixed training sets –user...

Supervised IR

• Incremental user feedback (Relevance Feedback)

OR

• Initial fixed training sets– User tags relevant/irrelevant

– Routing problem initial class

Big open Question –How do we obtain feedback automatically with minimal effort?

Refined Computation of relevant set based on:

“Unsupervised” IRPredicting relevance without user feedback

Pattern Matching:–Query vector/set–Document vector/set–Co-occurrence of terms assumed to be indication of relevance

Relevance Feedback

Incremental Feedback in vector model

Refer to Rocchio, 71

Q0 = Initial Query

Q1 = Q0 + Ri - Si

1

NRel

i = 1

NRel1

NIrrel

i = 1

NIrrel

Probabilistic IR/Text Classification

Document Retrieval

If P(Rel|Doci) > P(Irrel|Doci)

Then Doci is “relevant”

Else Doci is “not relevant”

-OR- P(Rel|Doci)

P(Irrel|Doci)

Then Doci is “relevant”…

Magnitude of ratio indicates our confidence

If > 1

Text Classification

Select Classj such that:

P(Classj | Doci) is maximized

(Bowling, DogBreeding, etc.) (incoming mail message)

Alternately

P(Classj | Doci)

P(NOT Classj | Doci)

is maximized

General FormulationCompute:

P(Classj | Evidence)

One of a fixed K classes or set of feature values

*disjoint classes* - Can’t be a (e.g. Words in a language,Medical Test Results, etc)

member of more than 1

Uses:

•REL/IRREL Document Retrieval

•Work/Bowling/Dog Breeding Text Classification/Routing

•Spanish/Italian/English Language ID

•Sick/Well Medical Diagnosis

•Herpes/Not Herpes Medical Diagnosis

Feature SetGoal: To Compute:

P(Classj | Doci) Abstract Formulation

P(Classj| Representation of Doci) Probability given a

representation of Doci

P(Classj| W1, W2,…Wk) One Representation of a vector

of words in the document

-OR-

P(Classj| F1, F2, … Fk) More general, a list of document

features

Problem – Sparse Feature Set

In Medical Diagnosis:

worth considering all possible feature combinations

Test 1 Test 2 Test 3 F(H), F(Not H)

Herpes T T T 30/1

-Herpes T T F 12/120

Herpes T F T 17/3

-Herpes T F F 4/186

-Herpes F T T 100/32

Can compute P(Evidence|Classi) directly from data for all evidence patterns

Eg P(T,T,F|Herpes) = 12/Total Herpes

Word 17 Word 24 Word 38 Word 54

Work C++ Compile Run 486

Personal Collie Show Pup Fur

Work

Personal Akita Show Pup Groom

In IR:

Too many combinations of feature values to estimate class distribution after all combinations

Bayes RulePosterior probability of class given evidence Prior probability of class

P(Classi|Evidence) = P(Evidence|Classi) x P(Classi)

P(Evidence)

Uninformative prior: P(Classi) = 1

(Total # of Classes)

Example in Medical DiagnosisA single blood test Probability of test if patient has herpes

P(Herpes|Evidence) = P(Evidence|Herpes) * P(Herpes)

Probability of herpes given a test result P(Evidence) Prior probability of patient having herpes

Prob of a (pos/neg) test result

P(Herpes|Positive Test) = .9

P(Herpes|Negative Test) = .001

P(Not Herpes|Positive Test) = .1

P(Not Herpes|Negative Test) = .999

Evidence DecompositionP(Classj | Evidence)

A given combination of feature values

Medical Diagnosis

Class Blood Test Visible Sores Fever Blood Test 2

HERPES POS T T F

NOT HERP NEG F T F

NOT HERP NEG F F F

HERPES NEG T F T

Text Classification/ Routing

Class W13 W27 W34 W49 Wi

Work Compiler C++ YK486 Disassembler …

Bowling

Dog Breeding Collie Show Grooming Sire …

Personal date Tonight movie love …

Example in Text Classification / RoutingDog Breeding (collie, groom, show) Prior chance that mail is about dog breeding

P(Classi|Evidence) = P(Evidence|Classi) * P(Classi)

Observe directly through Training data P(Evidence)Class 1 – Dog Breeding Training Class 2 - Work

Fur Collie

Collie

Groom

Show

PoodleSire

Breed

Akita

Pup

Compiler

X86

C++

Lex

YACC

JavaComputer

Probabalistic IRTarget/Goal:

Document

Retrieval

Evidence P(Rel|Doci) P(Irrel|Doci)

(Words in) Doc1 .95 .05



For a given model of

relevance to user’s needs

Document Routing /

Classification

Evidence P(Work1) P(Work2) P(Dog Breeding) P(Bowling) P(other)

(Words in) Doc1 .91 .01 .07 .02 .01

(Words in) Doc2 .45 .45 .03 .05 .02

(Words in) Doc3 .01 .03 .94 .01 .01

Multiple Binary SplitsQ1

A

A1 A2

B

B1 B2

Flat K-Way Classification

A B C D EF G

Q1

Likelihood Ratios

Binary Classifications

P(Rel|Doci) Document Retrieval options are

P(Irrel|Doci) Rel and Irrel1.

2.P(Work|Doci) Binary routing task –

P(Personal|Doci) (2 possible classes)

Can Treat K-Way classification as a series of binary classifications

3.

P(Classj|Doci)

P(NOT Classj|Doci)

•Compute this ratio for all classes

•Choose class j for which this ratio is greatest

Note: Ratios (Partially) Self Weighting

P(The|Personal) 1 5137/100,000 P(The|Work) 1 5238/100,000

e.g.( )

P(Akita|Personal) 37 37/100,000 P(Akita|Work) 1 1/100,000

e.g.( )

Dependence Trees(Hierarchical Bayesian Models)

P(w1,w2,…,wk) = P(w1) * P(w2|w1) * P(w3|w2) * P(w4|w2) * P(w5|w2) * P(w6|w5) * P(w6|w5w4)

= direction of dependence

w1

w2

w3 w4

w5

w6

Full Probability Decomposition

P(w) = P(w1) * P(w2|w1) * P(w3|w2w1) * P(w4|w3w2w1) * …

Using Simplifying (Markov) Assumptions

P(w) = P(w1) * P(w2|w1) * P(w3|w2) * P(w4|w3) * …(Assume P(word) only conditional upon the probability of the previous word)

Assumption of Full Independence

P(w) = P(w1) * P(w2|w1) * P(w3) * P(w4|w1w3) * …

Graphical Models – Partial Decomposition into Dependence TreesP(w) = P(w1) * P(w2) * P(w3) * P(w4) * …

supervised ir incremental user feedback (relevance feedback) or initial fixed training sets –user...

Documents