a review of information filtering part i: adaptive filtering
Embed Size (px)
DESCRIPTION
A Review of Information Filtering Part I: Adaptive Filtering. Chengxiang Zhai Language Technologies Institiute School of Computer Science Carnegie Mellon University. Outline. The Problem of Adaptive Information Filtering (AIF) The TREC Work on AIF Evaluation Setup Main Approaches - PowerPoint PPT PresentationTRANSCRIPT

A Review of
Information FilteringPart I: Adaptive Filtering
Chengxiang Zhai
Language Technologies InstitiuteSchool of Computer ScienceCarnegie Mellon University

Outline
• The Problem of Adaptive Information
Filtering (AIF)
• The TREC Work on AIF
– Evaluation Setup
– Main Approaches
– Sample Results
• The Importance of Learning
• Summary & Research Directions

Adaptive Information Filtering (AIF)
• Dynamic information stream
• (Relatively) stable user interest
• System “blocks” non-relevant information according
to user’s interest
• User provides feedback on the received items
• System learns from user’s feedback
• Performance measured by the utility of the filtering
decisions

A Typical AIF Application: News Filtering
• Given a news stream and users
• Each user expresses interest by a text “query”
• For each news article, system makes a yes/no
filtering decision for each user interest
• User provides feedback on the received news
• System learns from feedback
• Utility = 3*|Good| - 2 *|Bad|

AIF vs. Retrieval, Categorization, Topic tracking etc.
• AIF is like retrieval over a dynamic stream of
information items, but ranking is impossible
• AIF is like online binary categorization without
initial training data and with limited feedback
• AIF is like tracking user interest over a news
stream

Evaluation of AIF
• Primary measure: linear utility (->prob. cut)
• E.g., used in TREC7 & 8
used in TREC9
• Problems with the linear utility
– Unbounded
– Not comparable across topics/profiles
– Average utility may be dominated by one topic
NRNRU 4321 NRLF 231
),( MinUNRMaxUT 29

Other Measures
• Nonlinear utility (e.g., “early” relevant doc is
worth more)
• Normalized utility
– More meaningful for averaging
– But can be inversely correlated with precision/recall!
• Other measures that reflect a trade-off between
precision and recall

A Typical AIF System
...Binary
Classifier
UserInterestProfile
User
Doc Source
Accepted Docs
Initialization
Learning FeedbackAccumulated
Docs
utility func
User profile text

Three Basic Problems in AIF
• Making filtering decision (Binary classifier)
– Doc text, profile text yes/no
• Initialization
– Initialize the filter based on only the profile text or very few
examples
• Learning from
– Limited relevance judgments (only on “yes” docs)
– Accumulated documents
• All trying to maximize the utility

The TREC Work on AIF
• The Filtering Track of TREC
• Major Approaches to AIF
• Sample Results

The Filtering Track (TREC7, 8, &9)(Hull 99, Hull & Robertson 00, Robertson & Hull 01)
• Encourage development and evaluation of
techniques for text filtering
• Tasks
– Adaptive filtering (start with little/none training, online
filtering with limited feedback)
– Batch filtering (start with many training examples, online
filtering with limited feedback)
– Routing (start with many training examples, ranking test
documents)

AIF Evaluation Setup
• TREC7: LF1, LF3 utility functions
– AP88--90 + 50 topics
– No training initially
• TREC8: LF1, LF2 utility functions
– Financial Times 92-94 + 50 topics
– No training initially
• TREC9: T9U, Precision@50, etc
– OHSUMED + 63 original topics + 4903 MeSH topics
– 2? initial (positive) training examples available

Major Approaches to AIF
• “Extended” retrieval systems
– “Reuse” retrieval techniques to score documents
– Use a score threshold for filtering decision
– Learn to improve scoring with traditional feedback
– New approaches to threshold setting and learning
• “Modified” categorization systems
– Adapt to binary, unbalanced categorization
– New approaches to initialization
– Train with “censored” training examples

A General Vector-Space AIF Approach
doc vector
profile vector
Scoring Thresholding
yes
no
FeedbackInformation
VectorLearning
ThresholdLearning
threshold
UtilityEvaluation

Extended Retrieval Systems
• City Univ./MicroSoft (Okapi): Prob. IR
• Univ. of Massachusetts (Inquery): Infer. Net.
• Queens College, CUNY (Pirc): Prob. IR
• Clairvoyance Corp. (Clarit): Vector Space
• Univ. of Nijmegen (KUN): Vector Space
• Univ. of Twente (TNO): Language Model
• And many others … ...

Threshold Setting in Extended Retrieval Systems
• Utility-independent approaches (generally not
working well, not covered in this talk)
• Indirect (linear) utility optimization
– Logistic regression (score->prob. of relevance)
• Direct utility optimization
– Empirical utility optimization
– Expected utility optimization given score distributions
• All try to learn the optimal threshold

Difficulties in Threshold Learning
36.5 R33.4 N32.1 R29.9 ?27.3 ?…...
=30.0
• Censored data
• Little/none labeled data
• Scoring bias due to vector learning

Logistic Regression
• General idea: convert score of D to p(R|D)
• Fit the model using feedback data
• Linear utility is optimized with a fixed prob. cutoff
• But,– Possibly incorrect parametric assumptions
– No positive examples initially
– Censored data and limited positive feedback
)()|(log DsDRO

Logistic Regression in Okapi(Robertson & Walker 2000)
• Motivation: Recover probability of relevance
from the original prob. IR model
• Need to estimate , , and ast1 (avg. score of
top 1% docs)
• All topics share the same , which is initially
set and never updated
1astDs
DRO)(
)|(log

Logistic Regression in Okapi(cont.)
• Initially, all topics share the same , , and ast1 is
estimated with a linear regression ast1 =
a1 + a2 * maxscore
• After one week, ast1 is estimated based on the
documents available from the week.
• Threshold learning is fixed all the time
is updated with gradient descent
– heuristic “ladder” is used to allow “exploration”

Logistic Regression in Okapi(cont.)
• Pros– Well-motivated method for the Okapi system
– Based on principled approach
• Cons– Limited adaptation
– Exploration is ad hoc (over-explore initially)
– Some nonlinear utility may not correspond to a
fixed probability cutoff

Direct Utility Optimization
• Given – A utility function U(CR+ ,CR- ,CN+ ,CN-)
– Training data D={<si, {R,N,?}>}
• Formulate utility as a function of the threshold and training data: U=F(,D)
• Choose the threshold by optimizing F(,D), i.e.,
),(maxarg DF

Empirical Utility Optimization
• Basic idea– Compute the utility on the training data for each
candidate threshold (score of a training doc)
– Choose the threshold that gives the maximum utility
• Difficulty: Biased training sample!– We can only get an upper bound for the true optimal
threshold.
• Solutions:– Heuristic adjustment(lowering) of threshold
– Lead to “beta-gamma threshold learning”

The Beta-Gamma Threshold Learning Method in CLARIT(zhai et al. 00)
• Basic idea
– Extend the empirical utility optimization method by putting a lower bound on the threshold.
is to correct score bias
is to control exploration
, are relatively stable and can be tuned based on independent data
• Can optimize any utility function (with appropriate “zero” utility )

optimalθ
Illustration of Beta-Gamma Threshold Learning
Cutoff position
Utility
0 1 2 3 … K ...
zeroθ
, N
examplestrainingN
e N
#
*β-1(βα γ*
, [0,1]
The more examples,the less exploration(closer to optimal)
optimalzero θ*α-1(θ*αθ
Encourage exploration up to zero

Beta-Gamma Threshold Learning (cont.)
• Pros
– Explicitly addresses exploration-exploitation
tradeoff (“Safe” exploration)
– Arbitrary utility (with appropriate lower bound)
– Empirically effective and robust
• Cons
– Purely heuristic
– Zero utility lower bound often too conservative

Score Distribution Approaches( Aramptzis & Hameren 01; Zhang & Callan 01)
• Assume generative model of scores p(s|R), p(s|N)
• Estimate the model with training data • Find the threshold by optimizing the
expected utility under the estimated model• Specific methods differ in the way of
defining and estimating the scoring distributions

A General Formulation of Score Distribution Approaches
• Given p(R), p(s|R), and p(s|N), E[U] for sample size n, is a function of and n, I.e., E[U]=F(n, )
• The optimal threshold for sample size n is
)|())((
)|())((
)|()(
)|()(
NspRpnC
NspRpnC
RspRnpC
RspRnpC
N
N
R
R
1
1
),()( maxarg
nFn

Solution for Linear Utility& Continuous p(s|R) & p(s|N)
• Linear utility
• The optimal threshold is the solution to the following equation (independent of n)
)|())(()|()( NspRpRspRpF
10
42
13
4321
NRNRU

Gaussian-Exponential Distributions
• P(s|R) ~ N(,2) p(s-s0|N) ~ E()
)()(
)|()|( 02
2
02
1 sss
eNsspeRsp
(From Zhang & Callan 2001)

Optimal Threshold for Gaussian-Exp. Distributions
02
2
22
2
221
2
1
0
0
sRp
Rpc
ba
acbif
ifab
]))((
)(ln[
/)(

Parameter Estimation in KUN (Aramptzis & Hameren 01)
, 2 estimated using ML on rel. docs
estimated using top 50 non-rel. docs
• Some recent “improvement”:
– Compute p(s) based on p(wi)
– Initial distribution: q as the only rel doc.
– Soft probabilistic threshold, sampling with p(R|s)
),,...,(),,...,(,
),(
mm
m
iii
wwdqqqwhere
wqdqs
11
1

Maximum Conditional Likelihood (Zhang & Callan 01)
• Explicitly modeling of censored data
• Data: {<si, ri,i>} ri {R,N},
• Maximizing
• Conjugate Gradient Descent
• Prior is introduced for smoothing (making it
Bayesian?)
• Minimum “delivery ratio” used to ensure
exploration
),,,|,( ii
ii srsp

Score Distribution Approaches (cont.)
• Pros
– Principled approach
– Arbitrary utility
– Empirically effective
• Cons
– May be sensitive to the scoring function
– Exploration not addressed

“Modified” Categorization Methods
• Mostly applied to batch filtering, or routing and sometimes combined with Rocchio– K-Nearest Neighbor (CMU)– Naïve Bayes (Seoul)– Neural Network (ICDC, DSO, IRIT)– Decision Tree (NTT)
• Only K-Nearest Neighbor was applied to AIF (CMU)– With special thresholding strategies

The State of the Art Performance
• For high-precision utilities, system can
hardly beat the zero-return baseline! (I.e.,
negative utility)• Direct/indirect utility optimization methods
generally performed much better than utility-independent tuning of threshold
• Hard to compare different threshold learning methods, due to too many other factors (e.g., scoring, etc)

• TREC7– No initial example– No system beats
the zero-return baseline for F1 (pr>=0.4)
– Several systems beat the zero-return baseline for F3 (pr>=0.2)
(from Hull 99)

• TREC7– Learning effect is
clear in some systems
– But, stream is not “long” enough for systems to benefit from learning
(from Hull 99)

• TREC8– Again, learning
effect is clear– But, systems still
couldn’t beat the zero-return baseline!
(from Hull & Robertson 00)

• TREC9– 2 initial examples– Amplifying learning effect– T9U (prob >=0.33)– Systems clearly beat the
zero-return baseline!
(from Robertson & Hull 01)

The Importance of Learning in AIF(Results from Zhai et al. 00)
• Learning and initial inaccuracies: Learning
compensates for initial inaccuracies
• Exploitation vs. exploration: Exploration (lowering
threshold) pays off in the long run
score
time
ideal adaptive ideal fixedactual adaptive
actual fixed

DeliveryRatio
0.002 0.001 0.0005 0.00025 0.000125
fix-init -114.84 -47.22 -16.86 -9.86 -2.68
updategamma=0.1
1.86 0.78 1.18 1.38 0.44
AverageLF1
fix-optimal 18.24 18.24 18.24 18.24 18.24
Learning Effect 1: Correction of Inappropriate Initial Threshold Setting
bad initial thresholdwithout updating bad initial threshold
with updating

Effect of delivery ratio over time
-3
-2
-1
0
1
2
3
4
5
1 2 3 4 5 6 7 8
Time period (unit=10,000 docs)
Average-F1
0.000125 0.00025 0.0005
0.001 0.002 Optimal
Learning Effect 2: Early Exploration Pays Off

Learning Effect 3: Regular Exploration Pays Off Later
Gamma value & Learning effect
-3
-2.5
-2
-1.5
-1
-0.5
0
0.5
1
1.5
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
time period (unit=5,000docs)
Averge LF1
gamma=0.5 gamma=0.1
gamma=0.01 gamma=0.001

Threshold updating over time (topic 21)
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
0 10000 20000 30000 40000 50000 60000 70000
time on stream (n-th doc)
threshold value
update(gamma=0.1) update(gamma=0.5) optimal
initial update(gamma=0.01)
Tradeoff between Exploration and Exploitation:
over-explore
under-explore

Summary
• AIF is a very interesting and challenging
online learning problem
• As a learning task, it has extremely sparse
training data
– Initially no training data
– Later, limited and censored training examples
• Practically, learning must also be efficient

Summary(cont.)
• Evaluation of AIF is challenging
• Good performance (utility) is achieved by
– Direct/indirect utility optimization
– Learning the optimal score threshold from feedback
– Appropriate tradeoff between exploration and
exploitation
• Several different threshold methods can all be
effective

Research Directions
• Threshold learning– Non-parametric score density estimation?– Controlled comparison of threshold methods
• Integrated AIF model– Bayesian decision theory + EM?
• Exploration-exploitation tradeoff– Reinforcement learning?
• User model & evaluation measures– Users care about more factors than the linear utility– Users’ interest may drift over time– Redundancy reduction & novelty detection

References
General papers on TREC filtering evaluation D. Hull, The TREC-7 Filtering Track: Description and Analysis , TREC-7 Proceedings.
D. Hull and S. Robertson, The TREC-8 Filtering Track Final Report, TREC-8 Proceedings.
S. Robertson and D. Hull, The TREC-9 Filtering Track Final Report, TREC-9 Proceedings.
Papers on specific adaptive filtering methods
Stephen Robertson and Stephen Walker (2000), Threshold Setting in Adaptive Filtering . Journal of Documentation, 56:312-331, 2000
Chengxiang Zhai, Peter Jansen, and David A. Evans, Exploration of a heuristic approach to threshold learning in adaptive filtering, 2000 ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'00), 2000. Poster presentation.
Avi Arampatzis and Andre van Hameren The Score-Distributional Threshold Optimization for Adaptive Binary Classification Tasks , SIGIR'2001.
Yi Zhang and Jamie Callan, 2001, Maximum Likelihood Estimation for Filtering Threshold, SIGIR 2001.