story of ibm research’s success at kdd/netflix cup 2007
DESCRIPTION
Story of IBM Research’s success at KDD/Netflix Cup 2007. Saharon Rosset TAU Statistics (Formerly IBM) IBM Research’s teams: Task 1: Yan Liu, Zhenzhen Kou (CMU intern) Task 2: Saharon Rosset, Claudia Perlich, Yan Liu. October 2006 Announcement of the NETFLIX Competition. USAToday headline: - PowerPoint PPT PresentationTRANSCRIPT
© Copyright IBM Corporation 2007
Story of IBM Research’s success atKDD/Netflix Cup 2007
Saharon RossetTAU Statistics (Formerly IBM)
IBM Research’s teams:Task 1: Yan Liu, Zhenzhen Kou (CMU intern) Task 2: Saharon Rosset, Claudia Perlich, Yan Liu
Slide 2
IBM Research – Mathematical Sciences Department
© Copyright IBM Corporation 2007
Slide 3
IBM Research – Mathematical Sciences Department
© Copyright IBM Corporation 2007
October 2006 Announcement of the NETFLIX Competition
USAToday headline:
“Netflix offers $1 million prize for better movie recommendations”
Details: Beat NETFLIX current recommender model ‘Cinematch’ by 10% based on
absolute rating error prior to 2011 $50,000 for the annual progress price (relative to baseline) Data contains a subset of 100 million movie ratings from NETFLIX including
480,189 users and 17,770 movies Performance is evaluated on holdout movies-users pairs NETFLIX competition has attracted 24,396 contestants on 19,799 teams from
155 different countries 14891 valid submissions from 2282 different teams current best result is 7.8% better than baseline (from 6.7% as of March)
Slide 4
IBM Research – Mathematical Sciences Department
© Copyright IBM Corporation 2007
All movies (80K)
All
use
rs (
6.8
M)
NETFLIXCompetition
Data
17KSelection unclear
480 KAt least 20Ratings by end 2005
100 M ratings
Data Overview: NETFLIX Internet Movie Data Base
Fields
Title
Year
Actors
Awards
Revenue
…
4 5 ?
3
2
?
QualifierDataset
3M
Slide 5
IBM Research – Mathematical Sciences Department
© Copyright IBM Corporation 2007
17K
mo
vie
s
Training Data Task 2
Task 1 Movie Arrival
1998 Time 2005 2006
User Arrival
4 5 ?
3
2
?
QualifierDataset
3M
KDD CUP
NO Useror MovieArrival
NETFLIX data generation process
Slide 6
IBM Research – Mathematical Sciences Department
© Copyright IBM Corporation 2007
KDD-CUP 2007 based on the NETFLIX competition
Knowledge Discovery and Data Mining (KDD)-CUP Annual competition of the premier conference in Data Mining Training: NETFLIX competition data from 1998-2005 Test: 2006 ratings randomly split by movie in to two tasks
Task 1: Who rated what in 2006 Given a list of 100,000 pairs of users and movies, predict for each pair the
probability that the user rated the movie in 2006 Result: We are the second runner-up, No 3 out of 39 teams Many of the competing teams have been working on the Netflix data for
over six months, giving them a decided advantage in Task 1 here
Task 2: Number of ratings per movie in 2006 Given a list of 8863 movies, predict the number of additional reviews that
all existing users will give in 2006 Result: We are the winner, No 1 out of 34 teams
Slide 7
IBM Research – Mathematical Sciences Department
© Copyright IBM Corporation 2007
Generation of Test sets from 2006 for Task 1 and Task 2
Task 1
Task 2
Users
Mo
vies
183
8
24
316
19324
89
25
375
0
RatingTotals
2.2
0.9
1.4
2.5
4.2
1.9
1.4
2.6
0log(n+1)
Marginal 2006Distribution of
rating
Movie User Rating
M1 U31 4
M832 U83
M63 U2 3
M83 U97
M527 U63 1
M36 U81
… … …
Task 2Test Set (8.8K)
Remove Pairs that were
rated prior to 2006
Movie User Rating
M1 U31 1
M832 U83 0
M63 U2 1
M83 U97 0
M527 U63 0
… … …
Task 1Test Set (100K)
Sample (movie, user)
pairs according to product of
marginals
Back
Slide 8
IBM Research – Mathematical Sciences Department
© Copyright IBM Corporation 2007
Insights from the battlefields: What makes a model successful?
Previous successful ‘engagements’ of our team: Competitions: KDD-CUP 1999, 2000, 2003, ILP-Challenge 2005 Applications: MAP, OnTarget, …
Components of successful modeling:
1. Data and domain understanding• Generation of data and task• Cleaning and representation/transformation
2. Statistical insights• Statistical properties• Test validity of assumptions• Performance measure
3. Modeling and learning approach• Most “publishable” part• Choice or development of most suitable algorithm
Imp
ort
an
ce?
Slide 9
IBM Research – Mathematical Sciences Department
© Copyright IBM Corporation 2007
Task 1: Did User A review Movie B in 2006?
Task formulation A classification task to answer question whether “existing” users will
review “existing” movies
Challenges Huge amount of data
• how to sample the data so that any learning algorithms can be applied is critical
Complex affecting factors• decrease of interest in old movies, growing tendency of watching (reviewing) more
movies by Netflix users
Key solutions Effective sampling strategies to keep as much information as possible Careful feature extraction from multiple sources
Slide 10
IBM Research – Mathematical Sciences Department
© Copyright IBM Corporation 2007
Task 1: Effective Sampling Strategies
Sampling the movie-user pairs for “existing” users and “existing” movies from 2004, 2005 as training set and 4Q 2005 as developing set The probability of picking a movie was proportional to the number of ratings that movie
received; the same strategy for users
……
Movie5 .0011 ……
Movie3 .001……
Movie4 .0007
……
User7 .0007 ……
User6 .00012……
User8 .00003……
Movies
Users
……
Movie5 User 7 ……
Movie3 User 7……
Movie4 .User 8
….1488844,3,2005-09-06822109,5,2005-05-13885013,4,2005-10-1930878,4,2005-12-26823519,3,2004-05-03
…
HistorySamples
Slide 11
IBM Research – Mathematical Sciences Department
© Copyright IBM Corporation 2007
Task 1: Effective Sampling Strategies
Sampling the movie-user pairs for “existing” users and “existing” movies from 2004, 2005 as training set and 4Q 2005 as developing set The probability of picking a movie was proportional to the number of ratings that movie
received; the same strategy for users
……
Movie5 .0011 ……
Movie3 .001……
Movie4 .0007
……
User7 .0007 ……
User6 .00012……
User8 .00003……
Movies
Users
……
Movie5 User 7 ……
Movie3 User 7……
Movie4 .User 8
The Ratio of Positive Examples
Slide 12
IBM Research – Mathematical Sciences Department
© Copyright IBM Corporation 2007
Task 1: Multiple Information Sources
Graph-based features based on NETFLIX training set : construct a graph with users and movies as nodes, create an edge if the user reviews the movie
Content-based features: Plot, director, actor, genre, movie connections, box office, scores of the movie crawled from Netflix and IMDB
1488844,3,2005-09-06822109,5,2005-05-13885013,4,2005-10-1930878,4,2005-12-26823519,3,2004-05-03
…
movie
useruser
movie
user user
Slide 13
IBM Research – Mathematical Sciences Department
© Copyright IBM Corporation 2007
Task 1: Feature Extraction
Movie-based features• Graph topology: # of ratings per movie (across different years), adjacent scores between movies
calculated using SVD on the graph matrix
• Movie content: similarity of two movies calculated using Latent Semantic Indexing based on bag of words from (1) plots of the movie and (2) other information, such as director, actors, and genre
User profile• Graph topology: # of ratings per user (across different years)
• User preferences based on the movies being rated: key word match count, average/min/max of similarity scores between the movie being predicted and movies having been rated by the user
movie (rated)
user
movie (rated)
movie (rated)
…
movie (to predict)
key word match count, average/min/max of similarity scores
Slide 14
IBM Research – Mathematical Sciences Department
© Copyright IBM Corporation 2007
Task 1: Learning strategy
Learning Algorithm: Single classifiers: logistic regression, Ridge regression, decision tree, support vector
machines Naïve Ensemble: combining sub-classifiers built on different types of features with pre-set
weights Ensemble classifiers: combining sub-classifiers with weights learned from the development
set
Slide 15
IBM Research – Mathematical Sciences Department
© Copyright IBM Corporation 2007
Task 2 description: How many reviews did a Movie receive in 2006?
Task formulation Regression task to predict the total count of reviewers from “existing”
users for 8863 “existing” movies
Challenges Movie dynamics and life-cycle
• Interest in movies changes over time
User dynamics and life-cycle• No new users are added to the database
Key solutions Use counts from test set of Task 1 to learn a model for 2006 adjusting for pair
removal Build set of quarterly lagged models to determine the overall scalar Use Poisson regression
Slide 16
IBM Research – Mathematical Sciences Department
© Copyright IBM Corporation 2007
Some data observations
1. Task 1 test set is a potential response for training a model for Task 2 Was sampled according to marginal
(= # reviews for movie in 06 / # reviews in 06)which is proportional to the Task 2 response (= # reviews for movie in 06)
BIG advantage: we get a view of 2006 behavior for half the movies Build model on this half, apply to the other half (Task 2 test set)
Caveats:• Proportional sampling implies there is a scaling parameter left, which we don’t
know• Recall that after sampling (movie, person) pairs that appeared before 2006 were
dropped from Task 1 test set Correcting it is interesting research challenge of inverse rejection sampling
2. No new movies and reviewers in 2006 Need to emphasize modeling the life-cycle of movies (and reviewers)
• How are older movies reviewed relative to newer movies?• Does this depend on other features (like movie’s genre)?
This is especially critical when we consider the scaling caveat above
Slide 17
IBM Research – Mathematical Sciences Department
© Copyright IBM Corporation 2007
Some statistical perspectives
1. Poisson distribution is very appropriate for counts Clearly true of overall counts for 2006
• Assuming any kind of reasonable reviewers arrival process• Implies appropriate modeling approach for true counts is Poisson regression:
ni ~ Pois (it)log(i) = j j xij
* = arg max l(n ; X,) (maximum likelihood solution)
What happens when we sub-sample for Task 1 test set?• Sum is fixed multinomial• Large N, small p each sub-sampled count well approximated by Poisson• Can be shown that Poisson regression (=assuming independence) is appropriate
What does this imply for model evaluation approach?• Variance stabilizing transformation for Poisson is square root
ni has roughly constant variance RMSE of log (prediction +1) against log(# ratings +1) emphasizes performance on unpopular movies (small Poisson parameter larger log scale variance)
• We still assumed that if we do well in a likelihood formulation, we will do well with any evaluation approach
Slide 18
IBM Research – Mathematical Sciences Department
© Copyright IBM Corporation 2007
Some statistical perspectives (ctd.)
2. Can we invert the rejection sampling mechanism? This can be viewed as a missing data problem
• Can we design a practical EM algorithm with our huge data size? Interesting research problem…
We implemented ad-hoc inversion algorithm• Iterate until convergence between:
- assuming movie marginals are correct and adjusting reviewer marginals- assuming reviewer marginals are correct and adjusting movie marginals
• We verified that it indeed improved our data since it increased correlation with 4Q2005 counts
Slide 19
IBM Research – Mathematical Sciences Department
© Copyright IBM Corporation 2007
Modeling Approach Schema
Inverse RejectionSampling
Count ratings by Movie from
Estimate Poison Regression M1
&Predict on Task 1
movies
Task 1Test (100K)
MovieFeatures
IMDB
ConstructMovie
Features
ConstructLagged Features
Q1-Q4 2005
Validate against 2006Task 1 counts
NETFLIX challenge
Estimate 4 Poison Regression G1…G4
&Predict for 2006
Find optimalScalar
Estimate2006 total
Ratings for Task 2
Test set
Use M1 toPredict Task 2
movies
ScalePredictions
To Total
Slide 20
IBM Research – Mathematical Sciences Department
© Copyright IBM Corporation 2007
Some observations on modeling approach
1. Lagged datasets are meant to simulate forward prediction to 2006 Select quarter (e.g., Q105), remove all movies & reviewers that “started” later Build model on this data with e.g., Q305 as response Apply model to our full dataset, which is naturally cropped at Q405
Gives a prediction for Q206 With several models like this, predict all of 2006 Two potential uses:
• Use as our prediction for 2006 – but only if better than the model built on Task 1 movies!
• Consider only sum of their predictions to use for scaling the Task 1 model
2. We evaluated models on Task 1 test set Used holdout when also building them on this set How can we evaluate the models built on lagged datasets?
• Missing a scaling parameter between the 2006 prediction and sampled set• Solution: select optimal scaling based on Task 1 test set performance
Since other model was still better, we knew we should use it!
Slide 21
IBM Research – Mathematical Sciences Department
© Copyright IBM Corporation 2007
Some details on our models and submission
All models at movie level. Features we used: Historical reviews in previous months/quarters/years (on log scale) Movie’s age since premier, movie’s age in Netflix (since first review)
• Also consider log, square etc have flexibility in form of functional dependence Movie’s genre
• Include interactions between genre and age “life cycle” seems to differ by genre!
Models we considered (MSE on log-scale on Task 1 holdout): Poisson regression on Task 1 test set (0.24) Log-scale linear regression model on Task 1 test set (0.25) Sum of lagged models on built on 2005 quarters + best scaling (0.31)
Scaling based on lagged models Our estimated of number of reviews for all models in Task 1 test set: about
9.5M• Implied scaling parameter for predictions about 90• Total of our submitted predictions for Task 2 test set was 9.3M
Slide 22
IBM Research – Mathematical Sciences Department
© Copyright IBM Corporation 2007
Competition evaluation
First we were informed that we won with RMSE of ~770 They mistakenly evaluated on non-log scale Strong emphasis on most popular movies We won by large margin
Our model did well on popular movies!
Then they re-evaluated on log scale, we still won On log scale the least popular movies are emphasized
• Recall that variance stabilizing transformation is in between (square root) So our predictions did well on unpopular movies too!
Interesting question: would we win on square root scale (or similarly, Poisson likelihood-based evaluation)? Sure hope so!
Slide 23
IBM Research – Mathematical Sciences Department
© Copyright IBM Corporation 2007
Competition evaluation (ctd.)
Results of competition (log-scale evaluation):
Components of our model’s MSE: The error of the model for the scaled-down Task 1 test set (which we
estimated at about 0.24) Additional error from incorrect scaling factor
Scaling numbers: True total reviews: 8.7M Sum of our predictions: 9.3M
Interesting question: what would be best scaling For log-scale evaluation? Conjecture: need to under-estimate true total For square-root evaluation? Conjecture: need to estimate about right
Slide 24
IBM Research – Mathematical Sciences Department
© Copyright IBM Corporation 2007
Effect of scaling on the two evaluation approaches
ScalingTotal reviews
(M)Log-scale
MSESquare-root scale
MSE Comment
0.7 6.55 0.222 40.28
0.8 7.48 0.208 29.80 Best log performance
0.9 8.42 0.225 26.38Best sqrt performance
0.93 8.70 0.234 26.55 Correct scaling
1 9.35 0.263 28.86 Our solution
1.1 10.29 0.316 36.37
Slide 25
IBM Research – Mathematical Sciences Department
© Copyright IBM Corporation 2007
7 8 9 10
0.2
00
.22
0.2
40
.26
0.2
80
.30
25
28
31
34
37
40
0.2
00
.22
0.2
40
.26
0.2
80
.30
Legend
Log-scale MSE
SQRT MSE
True sum
Submitted sum
Sum predictions (M)
Effect of scaling on the two evaluation approaches
Slide 26
IBM Research – Mathematical Sciences Department
© Copyright IBM Corporation 2007
Acknowledgements
Rick Lawrence Naoki Abe Prem Melville Hisashi Kashima (TRL) Shohei Hido (TRL) Chandan Reddy Grzegorz Swirszcz And many more ..