a review of information filtering part ii: collaborative filtering chengxiang zhai language...

A Review of

Information FilteringPart II: Collaborative Filtering

Chengxiang Zhai

Language Technologies InstitiuteSchool of Computer ScienceCarnegie Mellon University

Outline

• A Conceptual Framework for Collaborative

Filtering (CF)

• Rating-based Methods (Breese et al. 98)

– Memory-based methods

– Model-based methods

• Preference-based Methods (Cohen et al. 99 & Freund et al. 98)

• Summary & Research Directions

What is Collaborative Filtering (CF)?

• Making filtering decisions for an individual user

based on the judgments of other users

• Inferring individual’s interest/preferences from

that of other similar users

• General idea

– Given a user u, find similar users {u1, …, um}

– Predict u’s preferences based on the preferences of

u1, …, um

CF: Applications

• Recommender Systems: books, CDs, Videos,

Movies, potentially anything!

• Can be combined with content-based filtering

• Example (commercial) systems

– GroupLens (Resnick et al. 94): usenet news rating

– Amazon: book recommendation

– Firefly (purchased by Microsoft?): music

recommendation

– Alexa: web page recommendation

CF: Assumptions

• Users with a common interest will have similar

preferences

• Users with similar preferences probably share

the same interest

• Examples

– “interest is IR” => “read SIGIR papers”

– “read SIGIR papers” => “interest is IR”

• Sufficiently large number of user preferences

are available

CF: Intuitions

• User similarity– If Jamie liked the paper, I’ll like the paper

– ? If Jamie liked the movie, I’ll like the movie

– Suppose Jamie and I viewed similar movies in the past six months …

• Item similarity– Since 90% of those who liked Star Wars also liked

Independence Day, and, you liked Star Wars

– You may also like Independence Day

Collaborative Filtering vs. Content-based Filtering

• Basic filtering question: Will user U like item

X?

• Two different ways of answering it

– Look at what U likes

– Look at who likes X

• Can be combined

=> characterize X => content-based filtering

=> characterize U => collaborative filtering

Rating-based vs. Preference-based

• Rating-based: User’s preferences are encoded

using numerical ratings on items

– Complete ordering

– Absolute values can be meaningful

– But, values must be normalized to combine

• Preferences: User’s preferences are represented

by partial ordering of items

– Partial ordering

– Easier to exploit implicit preferences

A Formal Framework for Rating

u1

u2

…

ui

...

um

Users: U

Objects: O

o1 o2 … oj … on

3 1.5 …. … 2

2

1

3

Xij=f(ui,oj)=?

?

The task

Unknown function f: U x O R

• Assume known f values for some (u,o)’s

• Predict f values for other (u,o)’s

• Essentially function approximation, like

other learning problems

Where are the intuitions?

• Similar users have similar preferences

– If u u’, then for all o’s, f(u,o) f(u’,o)

• Similar objects have similar user preferences

– If o o’, then for all u’s, f(u,o) f(u,o’)

• In general, f is “locally constant”

– If u u’ and o o’, then f(u,o) f(u’,o’)

– “Local smoothness” makes it possible to predict unknown

values by interpolation or extrapolation

• What does “local” mean?

Two Groups of Approaches

• Memory-based approaches

– f(u,o) = g(u)(o) g(u’)(o) if u u’

– Find “neighbors” of u and combine g(u’)(o)’s

• Model-based approaches

– Assume structures/model: object cluster, user cluster, f’

defined on clusters

– f(u,o) = f’(cu, co)

– Estimation & Probabilistic inference

Memory-based Approaches (Breese et al. 98)

• General ideas:

– Xij: rating of object j by user i

– ni: average rating of all objects by user i

– Normalized ratings: Vij = Xij - ni

– Memory-based prediction

• Specific approaches differ in w(a,i) -- the

distance/similarity between user a and i

m

iaajajij

m

iaj iawknvxviawKv

11

),(/1ˆˆ),(ˆ

User Similarity Measures

• Pearson correlation coefficient (sum over commonly rated items)

• Cosine measure

• Many other possibilities!

jiij

jaaj

jiijaaj

pnxnx

nxnx

iaw22 )()(

))((

),(

n

jij

n

jaj

n

jijaj

c

xx

xx

iaw

1

2

1

2

1),(

Improving User Similarity Measures (Breese et al. 98)

• Dealing with missing values: default

ratings

• Inverse User Frequency (IUF): similar to

IDF

• Case Amplification: use w(a,I)p, e.g., p=2.5

Model-based Approaches (Breese et al. 98)

• General ideas

– Assume that data/ratings are explained by a

probabilistic model with parameter

– Estimate/learn model parameter based on data

– Predict unknown rating using E [xk+1 | x1, …, xk], which

is computed using the estimated model

• Specific methods differ in the model used and how

the model is estimated

rxxrxpxxxE kkr

kk ),,...,|(],...,|[ 1111

Probabilistic Clustering

• Clustering users based on their ratings

– Assume ratings are observations of a

multinomial mixture model with parameters

p(C), p(xi|C)

– Model estimated using standard EM

• Predict ratings using E[xk+1 | x1, …, xk]

)|(),...,|(),...,|(

),...,|(],...,|[

1111

1111

cCrxpxxcCpxxrxp

rxxrxpxxxE

kkc

kk

kkr

kk

Bayesian Network

• Use BN to capture object/item dependency

– Each item/object is a node

– (Dependency) structure is learned from all data

– Model parameters: p(xk+1 |pa(xk+1)) where

pa(xk+1) is the parents/predictors of xk+1

(represented as a decision tree)

• Predict ratings using E[xk+1 | x1, …, xk]

111

1111

),...,|(

),...,|(],...,|[

kkk

kkr

kk

xnodeattreedecisionthebygivenxxrxp

rxxrxpxxxE

Three-way Aspect Model(Popescul et al. 2001)

• CF + content-based

• Generative model

• (u,d,w) as observations

• z as hidden variable

• Standard EM

• Essentially clustering the joint data

• Evaluation on ResearchIndex data

• Found it’s better to treat (u,w) as observations

Evaluation Criteria (Breese et al. 98)

• Rating accuracy

– Average absolute deviation

– Pa = set of items predicted

• Ranking accuracy

– Expected utility

– Exponentially decaying viewing probabillity

( halflife )= the rank where the viewing probability

=0.5

– d = neutral rating

|ˆ||| ajajPj

Pa xxSa

a

1

jj

aja

dxS

)/()(

),max(112

0

Datasets

Results

- BN & CR+ are generally better than VSIM & BC- BN is best with more training data- VSIM is better with little training data- Inverse User Freq. Is effective- Case amplification is mostly effective

Summary of Rating-based Methods

• Effectiveness

– Both memory-based and model-based methods can

be effective

– The correlation method appears to be robust

– Bayesian network works well with plenty of training

data, but not very well with little training data

– The cosine similarity method works well with little

training data

Summary of Rating-based Methods (cont.)

• Efficiency

– Memory based methods are slower than model-

based methods in predicting

– Learning can be extremely slow for model-based

methods

Preference-based Methods(Cohen et al. 99, Freund et al. 98)

• Motivation

– Explicit ratings are not always available, but

implicit orderings/preferences might be available

– Only relative ratings are meaningful, even if when

ratings are available

– Combining preferences has other applications, e.g.,

• Merging results from different search engines

A Formal Model of Preferences

• Instances: O={o1,…, on}

• Ranking function: R: (U x) O x O [0,1]

– R(u,v)=1 means u is strongly preferred to v

– R(u,v)=0 means v is strongly preferred to u

– R(u,v)=0.5 means no preference

• Feedback: F = {(u,v)}, u is preferred to v

• Minimize Loss:),(minarg),(

||),(

),(

FRLRvuRF

FRLHRFvu

11

Hypothesis space

The Hypothesis Space H

• Without constraints on H, the loss is

minimized by any R that agrees with F

• Appropriate constraints for collaborative

filtering

• Compare this with

1

iaUi

iia wvuRwvuR}{

),(),(

m

iaajajij

m

iaj iawknvxviawKv

11

),(/1ˆˆ),(ˆ

The Hedge Algorithm for Combining Preferences

• Iterative updating of w1, w2, …, wn

• Initialization: wi is uniform

• Updating: [0,1]

• L=0 => weight stays

• L is large => weight is decreased

t

FRLtit

i Zw

wtt

i ),(1

Some Theoretical Results

• The cumulative loss of Ra will not be much

worse than that of the best ranking

expert/feature

• Preferences Ra => ordering => R

L(R,F) <= DISAGREE(,Ra)/|F| + L(Ra,F)

• Need to find that minimizes disagreement

• General case: NP-complete

A Greedy Ordering Algorithm

• Use weighted graph to represent preferences R

• For each node, compute the potential value, I.e., outgoing_weights - ingoing_weights

• Rank the node with the highest potential value above all others

• Remove this node and its edges, repeat

• At least half of the optimal agreement is guaranteed

OuOu

vuRuvRv ),(),()(

Improvement

• Identify all the strongly connected

components

• Rank the components consistently with the

edges between them

• Rank the nodes within a component using

the basic greedy algorithm

Evaluation of Ordering Algorithms

• Measure: “weight coverage”

• Datasets = randomly generated small graphs

• Observations

– The basic greedy algorithm works better than a

random permutation baseline

– Improved version is generally better, but the

improvement is insignificant for large graphs

Metasearch Experiments• Task: Known item search

– Search for a ML researchers’ homepage

– Search for a university homepage

• Search expert = variant of query

• Learn to merge results of all search experts

• Feedback

– Complete : known item preferred to all others

– Click data : known item preferred to all above it

• Leave-one-out testing

Metasearch Results

• Measures: compare combined preferences with individual ranking function

– sign test: to see which system tends to rank the known relevant article higher.

– #queries with the known relevant item ranked above k.

– average rank of the known relevant item

• Learned system better than individual expert by all measure (not surprising, why?)

Metasearch Results (cont.)

Direct Learning of an Ordering Function

• Each expert is treated as a ranking feature fi: O R U

{0} (allow partial ranking)

• Given preference feedback : X x X R

• Goal: to learn H that minimizes the loss

• D (x0,x1): a distribution over X x X (actually a uniform

dist. over pairs with feedback order) D (x0,x1) = c max{0,

(x0,x1) }

)]()([Pr)]]()()[[,()( ~),(,

101010 10

10

xHxHxHxHxxDHrloss Dxxxx

D

The RankBoost Algorithm

• Iterative updating of D(x0,x1)

• Initialization: D1= D

• For t=1,…,T:– Train weak learner using Dt

– Get weak hypothesis ht: X R – Choose t >0– Update

• Final hypothesis:

t

xhxht

t Z

exxDxxD

ttt ))()((),(),(

1010

101

T

tt xhxH

1

)()(

How to Choose t and Design ht ?

• Bound on the ranking loss

• Thus, we should choose t that minimizes the

bound

• Three approaches:

– Numerical search

– Special case: h is either 0 or 1

– Approximation of Z, then find analytic solution

T

ttD Zrloss

1

Efficient RankBoost for Bipartite Feedback

t

xhxht

t Z

exxDxxD

ttt ))()((),(),(

1010

101

0

)(0

01

0)()(

t

xht

tZ

exvxv

tt

1

)(1

11

1)()(

t

xht

tZ

exvxv

tt

10

1010 )()(),(

ttt

ttt

ZZZ

xvxvxxD

Complexity at each round: O(|X0||X1|) O(|X0|+|X1|)

Bipartite feedback: Essentially binary classification

X0

X1

Evaluation of RankBoost

• Meta-search: Same as in (Cohen et al 99)

• Perfect feedback

• 4-fold cross validation

EachMovie Evaluation

# users #movies/user

#feedback movies

Performance ComparisonCohen et al. 99 vs. Freund et al. 99

Summary

• CF is “easy”

– The user’s expectation is low

– Any recommendation is better than none

– Making it practically useful

• CF is “hard”

– Data sparseness

– Scalability

– Domain-dependent

Summary (cont.)• CF as a Learning Task

– Rating-based formulation

• Learn f: U x O -> R

• Algorithms

– Instance-based/memory-based (k-nearest neighbors)

– Model-based (probabilistic clustering)

– Preference-based formulation

• Learn PREF: U x O x O -> R

• Algorithms

– General preference combination (Hedge), greedy ordering

– Efficient restricted preference combination (RankBoost)

Summary (cont.)

• Evaluation

– Rating-based methods

• Simple methods seem to be reasonably effective

• Advantage of sophisticated methods seems to be limited

– Preference-based methods

• More effective than rating-based methods according to

one evaluation

• Evaluation on meta-search is weak

Research Directions

• Exploiting complete information

– CF + content-based filtering + domain knowledge +

user model …

• More “localized” kernels for instance-based

methods

– Predicting movies need different “neighbor users”

than predicting books

– Suggesting using items similar to the target item as

features to find neighbors

Research Directions (cont.)

• Modeling time

– There might be sequential patterns on the items a

user purchased (e.g., bread machine -> bread

machine mix)

• Probabilistic model of preferences

– Making preference function a probability function,

e.g, P(A>B|U)

– Clustering items and users

– Minimizing preference disagreements

References

• Cohen, W.W., Schapire, R.E., and Singer, Y. (1999) "Learning to Order Things",

Journal of AI Research, Volume 10, pages 243-270.

• Freund, Y., Iyer, R.,Schapire, R.E., & Singer, Y. (1999). An efficient boosting

algorithm for combining preferences. Machine Learning Journal. 1999.

• Breese, J. S., Heckerman, D., and Kadie, C. (1998). Empirical Analysis of Predictive

Algorithms for Collaborative Filtering. In Proceedings of the 14th Conference on

Uncertainty in Articial Intelligence, pp. 43-52.

• Alexandrin Popescul and Lyle H. Ungar, Probabilistic Models for Unified

Collaborative and Content-Based Recommendation in Sparse-Data Environments,

UAI 2001.

• N. Good, J.B. Schafer, J. Konstan, A. Borchers, B. Sarwar, J. Herlocker, and J. Riedl.

"Combining Collaborative Filtering with Personal Agents for Better

Recommendations." Proceedings AAAI-99. pp 439-446. 1999.

a review of information filtering part ii: collaborative filtering chengxiang zhai language...

Documents

preferences of u

user u

rating u

u objects

u likes

neighbors of u

u x o r

users preferences