supervised ir incremental user feedback (relevance feedback) or initial fixed training sets –user...
Post on 01-Jan-2016
222 Views
Preview:
TRANSCRIPT
Supervised IR
• Incremental user feedback (Relevance Feedback)
OR
• Initial fixed training sets– User tags relevant/irrelevant
– Routing problem initial class
Big open Question –How do we obtain feedback automatically with minimal effort?
Refined Computation of relevant set based on:
“Unsupervised” IRPredicting relevance without user feedback
Pattern Matching:–Query vector/set–Document vector/set–Co-occurrence of terms assumed to be indication of relevance
Relevance Feedback
Incremental Feedback in vector model
Refer to Rocchio, 71
Q0 = Initial Query
Q1 = Q0 + Ri - Si
1
NRel
i = 1
NRel1
NIrrel
i = 1
NIrrel
Probabilistic IR/Text Classification
Document Retrieval
If P(Rel|Doci) > P(Irrel|Doci)
Then Doci is “relevant”
Else Doci is “not relevant”
-OR- P(Rel|Doci)
P(Irrel|Doci)
Then Doci is “relevant”…
Magnitude of ratio indicates our confidence
If > 1
Text Classification
Select Classj such that:
P(Classj | Doci) is maximized
(Bowling, DogBreeding, etc.) (incoming mail message)
Alternately
P(Classj | Doci)
P(NOT Classj | Doci)
is maximized
General FormulationCompute:
P(Classj | Evidence)
One of a fixed K classes or set of feature values
*disjoint classes* - Can’t be a (e.g. Words in a language,Medical Test Results, etc)
member of more than 1
Uses:
•REL/IRREL Document Retrieval
•Work/Bowling/Dog Breeding Text Classification/Routing
•Spanish/Italian/English Language ID
•Sick/Well Medical Diagnosis
•Herpes/Not Herpes Medical Diagnosis
Feature SetGoal: To Compute:
P(Classj | Doci) Abstract Formulation
P(Classj| Representation of Doci) Probability given a
representation of Doci
P(Classj| W1, W2,…Wk) One Representation of a vector
of words in the document
-OR-
P(Classj| F1, F2, … Fk) More general, a list of document
features
Problem – Sparse Feature Set
In Medical Diagnosis:
worth considering all possible feature combinations
Test 1 Test 2 Test 3 F(H), F(Not H)
Herpes T T T 30/1
-Herpes T T F 12/120
Herpes T F T 17/3
-Herpes T F F 4/186
-Herpes F T T 100/32
Can compute P(Evidence|Classi) directly from data for all evidence patterns
Eg P(T,T,F|Herpes) = 12/Total Herpes
Word 17 Word 24 Word 38 Word 54
Work C++ Compile Run 486
Personal Collie Show Pup Fur
Work
Personal Akita Show Pup Groom
In IR:
Too many combinations of feature values to estimate class distribution after all combinations
Bayes RulePosterior probability of class given evidence Prior probability of class
P(Classi|Evidence) = P(Evidence|Classi) x P(Classi)
P(Evidence)
Uninformative prior: P(Classi) = 1
(Total # of Classes)
Example in Medical DiagnosisA single blood test Probability of test if patient has herpes
P(Herpes|Evidence) = P(Evidence|Herpes) * P(Herpes)
Probability of herpes given a test result P(Evidence) Prior probability of patient having herpes
Prob of a (pos/neg) test result
P(Herpes|Positive Test) = .9
P(Herpes|Negative Test) = .001
P(Not Herpes|Positive Test) = .1
P(Not Herpes|Negative Test) = .999
Evidence DecompositionP(Classj | Evidence)
A given combination of feature values
Medical Diagnosis
Class Blood Test Visible Sores Fever Blood Test 2
HERPES POS T T F
NOT HERP NEG F T F
NOT HERP NEG F F F
HERPES NEG T F T
Text Classification/ Routing
Class W13 W27 W34 W49 Wi
Work Compiler C++ YK486 Disassembler …
Bowling
Dog Breeding Collie Show Grooming Sire …
Personal date Tonight movie love …
Example in Text Classification / RoutingDog Breeding (collie, groom, show) Prior chance that mail is about dog breeding
P(Classi|Evidence) = P(Evidence|Classi) * P(Classi)
Observe directly through Training data P(Evidence)Class 1 – Dog Breeding Training Class 2 - Work
Fur Collie
Collie
Groom
Show
PoodleSire
Breed
Akita
Pup
Compiler
X86
C++
Lex
YACC
JavaComputer
Probabalistic IRTarget/Goal:
Document
Retrieval
Evidence P(Rel|Doci) P(Irrel|Doci)
(Words in) Doc1 .95 .05
(Words in) Doc2 .80 .20
(Words in) Doc3 .01 .99
For a given model of
relevance to user’s needs
Document Routing /
Classification
Evidence P(Work1) P(Work2) P(Dog Breeding) P(Bowling) P(other)
(Words in) Doc1 .91 .01 .07 .02 .01
(Words in) Doc2 .45 .45 .03 .05 .02
(Words in) Doc3 .01 .03 .94 .01 .01
Multiple Binary SplitsQ1
A
A1 A2
B
B1 B2
Flat K-Way Classification
A B C D EF G
Q1
Likelihood Ratios
P(Class1| Evidence) = P(Evidence|Class1) * P(Class1)
P(Evidence)
P(Class2| Evidence) = P(Evidence|Class2) * P(Class2)
P(Evidence)
P(Class1|Evidence) P(Evidence|Class1) P(Class1)
P(Class2|Evidence) P(Evidence|Class2) P(Class2)
= *
Likelihood Ratios
Binary Classifications
P(Rel|Doci) Document Retrieval options are
P(Irrel|Doci) Rel and Irrel1.
2.P(Work|Doci) Binary routing task –
P(Personal|Doci) (2 possible classes)
Can Treat K-Way classification as a series of binary classifications
3.
P(Classj|Doci)
P(NOT Classj|Doci)
•Compute this ratio for all classes
•Choose class j for which this ratio is greatest
Independence Assumption
Evidence = w1,w2,w3,…wk
P(Class1|Evidence) P(Class1) P(Evidence|Class1)
P(Class2|Evidence) P(Class2) P(Evidence|Class2)
P(Class1) P(wi|Class1)
P(Class2) P(wi|Class2)
= *
Final Odds Initial Odds
= * i = 1
k
Using Independence AssumptionP(Personal|Akita,pup,fur,show) P(Personal) P(Akita|Personal) P(pup|Personal)
P(Work|Akita,pup,fur,show) P(Work) P(Akita|Work) P(pup|Work)= * *
P(fur|Personal) P(show|Personal)
P(fur|Work) P(show|Work)* *
P(Personal|Evidence) 1 27 18 36 3
P(Work|Evidence) 9 2 0 2 5= * * * *
Product of likelihood ratios for each word
= some constant
Note: Ratios (Partially) Self Weighting
P(The|Personal) 1 5137/100,000 P(The|Work) 1 5238/100,000
e.g.( )
P(Akita|Personal) 37 37/100,000 P(Akita|Work) 1 1/100,000
e.g.( )
Bayesian Model ApplicationsAuthorship Identification
P(Hamilton|Evidence) P(Evidence|Hamilton) P(Hamilton)P(Madison|Evidence) P(Evidence|Madison) P(Madison)
= *
Sense Disambiguation
P(Tank-Container|Evidence) P(Evidence|Tank-Container) P(Tank-Container) P(Tank-Vehicle|Evidence) P(Evidence|Tank-Vehicle) P(Tank-Vehicle)
= *
Dependence Trees(Hierarchical Bayesian Models)
P(w1,w2,…,wk) = P(w1) * P(w2|w1) * P(w3|w2) * P(w4|w2) * P(w5|w2) * P(w6|w5) * P(w6|w5w4)
= direction of dependence
w1
w2
w3 w4
w5
w6
Full Probability Decomposition
P(w) = P(w1) * P(w2|w1) * P(w3|w2w1) * P(w4|w3w2w1) * …
Using Simplifying (Markov) Assumptions
P(w) = P(w1) * P(w2|w1) * P(w3|w2) * P(w4|w3) * …(Assume P(word) only conditional upon the probability of the previous word)
Assumption of Full Independence
P(w) = P(w1) * P(w2|w1) * P(w3) * P(w4|w1w3) * …
Graphical Models – Partial Decomposition into Dependence TreesP(w) = P(w1) * P(w2) * P(w3) * P(w4) * …
top related