cobafi : collaborative bayesian filtering
Embed Size (px)
DESCRIPTION
CoBaFi : Collaborative Bayesian Filtering. Alex Beutel Joint work with Kenton Murray, Christos Faloutsos , Alex Smola April 9, 2014 – Seoul, South Korea. Online Recommendation. Movies. 5. 5. 2. Users. 5. 3. 5. Online Rating Models. Online Rating Models. Reality. - PowerPoint PPT PresentationTRANSCRIPT
CoBaFi: Collaborative Bayesian Filtering
CoBaFi:Collaborative Bayesian FilteringAlex Beutel
Joint work with Kenton Murray, Christos Faloutsos, Alex SmolaApril 9, 2014 Seoul, South Korea
Online Recommendation
25
UsersMovies
53
55
2Online Rating Models
3Online Rating ModelsNormal Collaborative FilteringFit a Gaussian - Minimize the error
Reality
Minimizing error isnt good enough -Understanding the shape matters!4Online Rating ModelsOur Model
5
Normal Collaborative FilteringFit a Gaussian - Minimize the error
Our Goals and ChallengesGiven: A matrix of user ratingsFind: A model that best fits and predicts user preferences
Goals:G1. Fit the recommender distributionG2. Understand users who rate few itemsG3. Detect abnormal spam behavior
6 1. BackgroundOutline 2. Model Formulation 3. Inference 4. Catching Spam 5. Experiments77Collaborative Filtering
X
UVUsersMoviesGenres
5 =1.50.7360002.2362.2362.231.20.25 =8[Background]
Matrix Factorization
XUsersMovies
9[Background]
UVGenresBayesian Probabilistic Matrix Factorization (Salakhutdinov & Mnih, ICML 2008)
U
~
10[Background]
1. BackgroundOutline 2. Our Model 3. Inference 4. Catching Spam 5. Experiments1111Our Model
12Use user preferences to predict ratingsCluster users (& items)Share preferences within clusters The Recommender DistributionFirst introduced by Tan et al, 2013
Normalization
Normalization
2 = -1.02 = 0.41 = 0Vary 213LinearQuadraticThe Recommender Distribution
0.30.40.30.2-0.70.40.30.80.4
Genre PreferencesGeneral LeaningHow Polarizedui14Goal 1: Fit the recommender distribution
Understanding varying preferences
5
5215
3
151
Resulting Co-clustering
UV
16Finding User PreferencesU
U
17Goal 2: Understand users who rate few items
Chinese Restaurant Process1
2
3
18
1. BackgroundOutline 2. Our Model 3. Inference 4. Catching Spam 5. Experiments1919Gibbs Sampling - ClustersProbability of a cluster based on size (CRP)x Probability ui would come from the cluster
[Details]20Probability of picking a cluster =Sampling user parameters[Details]Probability of preferences ui given cluster parametersx Probability of predicting ratings ri,j using new preferences
Recommender distribution is non-conjugate
Cant sample directly!21Probability of user preferences ui = 1. BackgroundOutline 2. Our Model 3. Inference 4. Catching Spam 5. Experiments2222Review Spam and Fraud
5
5Image from http://sinovera.deviantart.com/art/Cute-Devil-117932337
11111111
1
5555523
Clustering Fraudsters1
2
3
New Spam ClusterPrevious Real Cluster24
Clustering Fraudsters1
2
3
Too much spam get separated into fraud cluster
Trying to hide just means (a) very little spam or (b) camouflage reinforcing realistic reviews.25
Clustering Fraudsters1
2
3
4
5
Nave SpammersSpam + Noise
HijackedAccounts
26Goal 3: Detect abnormal spam behavior 1. BackgroundOutline 2. Our Model 3. Inference 4. Catching Spam 5. Experiments2727Does it work?28
Better FitCatching Nave Spammers29
83% are clustered togetherInjection29Clustered Hijacked Accounts
Clustered hijacked accountsClustered attacked movies30
Injection30Real world clusters
31
31Shape of real world data
32Shape of Netflix reviews
Most GaussianMost skewedThe RookieThe O.C. Season 2The FanSamurai X: Trust and BetrayalCadet KellyAqua Teen Hunger Force: Vol. 2Money TrainSealab 2001: Season 1Alice Doesnt Live HereAqua Teen Hunger Force: Vol. 2Sea of LoveGilmore Girls: Season 3Boiling PointFelicity: Season 4True BelieverThe O.C. Season 1StakeoutThe Shield Season 3The PackageQueer as Folk Season 433
More GaussianMore Skewed
Shape of Amazon Clothing reviews
Amazon Clothing Most Skewed ReviewsBra Disc Nipple CoversVanity Fair Womens String Bikini Panty Lee Mens Relaxed Fit Tapered Jean Carhartt Mens Dungaree JeanWrangler Mens Cowboy Cut Slim Fit Jean Nearly all are heavily polarized!34
Shape of Amazon Electronics reviews
Amazon Electronics Most Skewed ReviewsSony CD-R 50 Pack Spindle Olympus Stylus Epic Zoom Camera Sony AC Adapter Laptop Charger Apricorn Hard Drive Upgrade Kit Corsair 1GB Desktop Memory Nearly all are heavily polarized!35
Shape of BeerAdvocate reviews
BeerAdvocate Most Gaussian ReviewsWeizenbock (Sierra Nevada)Ovila Abbey Saison (Sierra Nevada) Stoudts Abbey Double Ale Stoudts Fat Dog StoutJuniper Black Ale Nearly all are Gaussian!36
Hypotheses on shape of dataHard to evaluate beyond binary
Selection bias Only committed viewers watch Season 4 of a TV series
Hard to compare value across very different items.Lots of beers and movies to compareFewer TV showsEven fewer jeans or hard drives
vs.
37
Key PointsModeling: Fit real data with flexible recommender distributionPrediction: Predict user preferencesAnomaly Detection: When does a user not match the normal model?
38Questions?Alex [email protected]://alexbeutel.com
39
u5
u6a
Sampling Cluster Parameters
Hyperparameters , , W, Priors on , , W40Gibbs Sampling - ClustersProbability of a cluster (CRP)Probability ui would be sampled from cluster a
[Details]
41Sampling user parameters[Details]
Probability of ui given cluster parametersProbability of predicting ratings ri,j
Recommender distribution is non-conjugate
Cant sample directly!42Use a Laplace approximation and perform Metropolis-Hastings SamplingSampling user parameters[Details]
Use candidate normal distribution
Mode of p(ui)Variance of p(ui)Sample
Metropolis-Hastings Sampling:Keep new with probability
43Sampling Cluster Parameters
PriorsUsers/Items in the cluster[Details]44Inferring Hyperparameters[Details]
Solved directly no sampling needed!Prior hidden as additional cluster45Have to use non-standard sampling procedure:99.12% acceptance rate for Amazon Electronics77.77% acceptance rate for Netflix 24kDoes Metropolis Hasting work?46Does it work?UniformBPMFCoBaFi (us)Netflix (24k users)1.69041.25251.1827BeerAdvocate2.19721.98551.6741Compare on Predictive Probability (PP) to see how well our model fits the data47Handling SpammersPP BeforePP AfterBPMF1.70471.8146CoBaFi1.05491.7042PP BeforePP AfterBPMF1.23751.3057CoBaFi0.96701.2935Random nave spammers in Amazon Electronics datasetRandom hijacked accounts in Netflix 24k dataset4848Clustered Nave Spammers
83% are clustered together4949Clustered Hijacked Accounts
Clustered hijacked accountsClustered attacked movies5050