cobafi : collaborative bayesian filtering

49
COBAFI: COLLABORATIVE BAYESIAN FILTERING Alex Beutel Joint work with Kenton Murray, Christos Faloutsos, Alex Smola April 9, 2014 – Seoul, South Korea

Upload: cleo

Post on 26-Feb-2016

102 views

Category:

Documents


8 download

DESCRIPTION

CoBaFi : Collaborative Bayesian Filtering. Alex Beutel Joint work with Kenton Murray, Christos Faloutsos , Alex Smola April 9, 2014 – Seoul, South Korea. Online Recommendation. Movies. 5. 5. 2. Users. 5. 3. 5. Online Rating Models. Online Rating Models. Reality. - PowerPoint PPT Presentation

TRANSCRIPT

CoBaFi: Collaborative Bayesian Filtering

CoBaFi:Collaborative Bayesian FilteringAlex Beutel

Joint work with Kenton Murray, Christos Faloutsos, Alex SmolaApril 9, 2014 Seoul, South Korea

Online Recommendation

25

UsersMovies

53

55

2Online Rating Models

3Online Rating ModelsNormal Collaborative FilteringFit a Gaussian - Minimize the error

Reality

Minimizing error isnt good enough -Understanding the shape matters!4Online Rating ModelsOur Model

5

Normal Collaborative FilteringFit a Gaussian - Minimize the error

Our Goals and ChallengesGiven: A matrix of user ratingsFind: A model that best fits and predicts user preferences

Goals:G1. Fit the recommender distributionG2. Understand users who rate few itemsG3. Detect abnormal spam behavior

6 1. BackgroundOutline 2. Model Formulation 3. Inference 4. Catching Spam 5. Experiments77Collaborative Filtering

X

UVUsersMoviesGenres

5 =1.50.7360002.2362.2362.231.20.25 =8[Background]

Matrix Factorization

XUsersMovies

9[Background]

UVGenresBayesian Probabilistic Matrix Factorization (Salakhutdinov & Mnih, ICML 2008)

U

~

10[Background]

1. BackgroundOutline 2. Our Model 3. Inference 4. Catching Spam 5. Experiments1111Our Model

12Use user preferences to predict ratingsCluster users (& items)Share preferences within clusters The Recommender DistributionFirst introduced by Tan et al, 2013

Normalization

Normalization

2 = -1.02 = 0.41 = 0Vary 213LinearQuadraticThe Recommender Distribution

0.30.40.30.2-0.70.40.30.80.4

Genre PreferencesGeneral LeaningHow Polarizedui14Goal 1: Fit the recommender distribution

Understanding varying preferences

5

5215

3

151

Resulting Co-clustering

UV

16Finding User PreferencesU

U

17Goal 2: Understand users who rate few items

Chinese Restaurant Process1

2

3

18

1. BackgroundOutline 2. Our Model 3. Inference 4. Catching Spam 5. Experiments1919Gibbs Sampling - ClustersProbability of a cluster based on size (CRP)x Probability ui would come from the cluster

[Details]20Probability of picking a cluster =Sampling user parameters[Details]Probability of preferences ui given cluster parametersx Probability of predicting ratings ri,j using new preferences

Recommender distribution is non-conjugate

Cant sample directly!21Probability of user preferences ui = 1. BackgroundOutline 2. Our Model 3. Inference 4. Catching Spam 5. Experiments2222Review Spam and Fraud

5

5Image from http://sinovera.deviantart.com/art/Cute-Devil-117932337

11111111

1

5555523

Clustering Fraudsters1

2

3

New Spam ClusterPrevious Real Cluster24

Clustering Fraudsters1

2

3

Too much spam get separated into fraud cluster

Trying to hide just means (a) very little spam or (b) camouflage reinforcing realistic reviews.25

Clustering Fraudsters1

2

3

4

5

Nave SpammersSpam + Noise

HijackedAccounts

26Goal 3: Detect abnormal spam behavior 1. BackgroundOutline 2. Our Model 3. Inference 4. Catching Spam 5. Experiments2727Does it work?28

Better FitCatching Nave Spammers29

83% are clustered togetherInjection29Clustered Hijacked Accounts

Clustered hijacked accountsClustered attacked movies30

Injection30Real world clusters

31

31Shape of real world data

32Shape of Netflix reviews

Most GaussianMost skewedThe RookieThe O.C. Season 2The FanSamurai X: Trust and BetrayalCadet KellyAqua Teen Hunger Force: Vol. 2Money TrainSealab 2001: Season 1Alice Doesnt Live HereAqua Teen Hunger Force: Vol. 2Sea of LoveGilmore Girls: Season 3Boiling PointFelicity: Season 4True BelieverThe O.C. Season 1StakeoutThe Shield Season 3The PackageQueer as Folk Season 433

More GaussianMore Skewed

Shape of Amazon Clothing reviews

Amazon Clothing Most Skewed ReviewsBra Disc Nipple CoversVanity Fair Womens String Bikini Panty Lee Mens Relaxed Fit Tapered Jean Carhartt Mens Dungaree JeanWrangler Mens Cowboy Cut Slim Fit Jean Nearly all are heavily polarized!34

Shape of Amazon Electronics reviews

Amazon Electronics Most Skewed ReviewsSony CD-R 50 Pack Spindle Olympus Stylus Epic Zoom Camera Sony AC Adapter Laptop Charger Apricorn Hard Drive Upgrade Kit Corsair 1GB Desktop Memory Nearly all are heavily polarized!35

Shape of BeerAdvocate reviews

BeerAdvocate Most Gaussian ReviewsWeizenbock (Sierra Nevada)Ovila Abbey Saison (Sierra Nevada) Stoudts Abbey Double Ale Stoudts Fat Dog StoutJuniper Black Ale Nearly all are Gaussian!36

Hypotheses on shape of dataHard to evaluate beyond binary

Selection bias Only committed viewers watch Season 4 of a TV series

Hard to compare value across very different items.Lots of beers and movies to compareFewer TV showsEven fewer jeans or hard drives

vs.

37

Key PointsModeling: Fit real data with flexible recommender distributionPrediction: Predict user preferencesAnomaly Detection: When does a user not match the normal model?

38Questions?Alex [email protected]://alexbeutel.com

39

u5

u6a

Sampling Cluster Parameters

Hyperparameters , , W, Priors on , , W40Gibbs Sampling - ClustersProbability of a cluster (CRP)Probability ui would be sampled from cluster a

[Details]

41Sampling user parameters[Details]

Probability of ui given cluster parametersProbability of predicting ratings ri,j

Recommender distribution is non-conjugate

Cant sample directly!42Use a Laplace approximation and perform Metropolis-Hastings SamplingSampling user parameters[Details]

Use candidate normal distribution

Mode of p(ui)Variance of p(ui)Sample

Metropolis-Hastings Sampling:Keep new with probability

43Sampling Cluster Parameters

PriorsUsers/Items in the cluster[Details]44Inferring Hyperparameters[Details]

Solved directly no sampling needed!Prior hidden as additional cluster45Have to use non-standard sampling procedure:99.12% acceptance rate for Amazon Electronics77.77% acceptance rate for Netflix 24kDoes Metropolis Hasting work?46Does it work?UniformBPMFCoBaFi (us)Netflix (24k users)1.69041.25251.1827BeerAdvocate2.19721.98551.6741Compare on Predictive Probability (PP) to see how well our model fits the data47Handling SpammersPP BeforePP AfterBPMF1.70471.8146CoBaFi1.05491.7042PP BeforePP AfterBPMF1.23751.3057CoBaFi0.96701.2935Random nave spammers in Amazon Electronics datasetRandom hijacked accounts in Netflix 24k dataset4848Clustered Nave Spammers

83% are clustered together4949Clustered Hijacked Accounts

Clustered hijacked accountsClustered attacked movies5050