matrix factorization

Matrix Factorization

Bamshad MobasherDePaul University

The $1 Million Question

2

Ratings Data

1 3 43 5 5

4 5 53 3

2 2 25

2 1 13 3

1

17,700 movies

480,000users

4

Training Datai 100 million ratings (matrix is 99% sparse)

i Rating = [user, movie-id, time-stamp, rating value]

i Generated by users between Oct 1998 and Dec 2005

i Users randomly chosen among set with at least 20 ratings4 Small perturbations to help with anonymity

Ratings Data

1 3 43 5 5

4 5 53 3

2 ? ??

2 1 ?3 ?

1

Test Data Set(most recent ratings)

480,000users

17,700 movies

6

Scoringi Minimize root mean square error

4 Does not necessarily correlate well with user satisfaction4 But is a widely-used well-understood quantitative measure

i RMSE Baseline Scores on Test Data4 1.054 - just predict the mean user rating for each movie4 0.953 - Netflix’s own system (Cinematch) as of 20064 0.941 - nearest-neighbor method using correlation4 0.857 - required 10% reduction to win $1 million

Mean square error = 1/|R| S (u,i) e R ( rui - rui )2

^

7

Matrix Factorization of Ratings Data

i Based on the idea of Latent Factor Analysis4 Identify latent (unobserved) factors that “explain” observations in the data4 In this case, observations are user ratings of movies4 The factors may represent combinations of features or characteristics of movies

and users that result in the ratings

R Q Pm

use

rs

n movies

m u

sers

n moviesf

f

~~x

rui qTi pu ~~

8


Figure from Koren, Bell, Volinksy, IEEE Computer, 2009

9


Credit: Alex Lin, Intelligent Mining

Predictions as Filling Missing Data


Learning Factor Matricesi Need to learn the feature vectors from training data

4 User feature vector: (a, b, c)4 Item feature vector (x, y, z)

i Approach: Minimize the errors on known ratings


Learning Factor Matrices

minq,p S (u,i) e R ( rui - qti pu )2

rui qti pu

~~

minq,p S (u,i) e R ( rui - qti pu )2 + l (|qi|2 + |pu|2

)

Add regularization

12

Stochastic Gradient Descent (SGD)

eui = rui - qti pu

qi qi + g ( eui pu - l qi )

pu pu + g ( eui qi - l pu )

minq,p S (u,i) e R ( rui - qti pu )2 + l (|qi|2 + |pu|2 )

regularizationgoodness of fit

Online (“stochastic”) gradient update equations:

13

Components of a Rating Predictoruser-movie interactionmovie biasuser bias

User-movie interactioni Characterizes the matching between

users and moviesi Attracts most research in the fieldi Benefits from algorithmic and

mathematical innovations

Baseline predictor• Separates users and movies• Often overlooked • Benefits from insights into users’

behavior• Among the main practical

contributions of the competition

14Credit: Yehuda Koren, Google, Inc.

Modeling Systematic Biases

rui m + bu + bi + user-movie interactions

~~overall meanrating

mean rating

for user u

mean rating

for movie i

Example:Mean rating m = 3.7

You are a critical reviewer: your ratings are 1 lower than the mean -> bu = -1

Star Wars gets a mean rating of 0.5 higher than average movie: bi = + 0.5

Predicted rating for you on Star Wars = 3.7 - 1 + 0.5 = 3.2

qti pu

15Credit: Padhraic Smyth, University of California, Irvine

Objective Function

minq,p { S (u,i) e R ( rui - (m + bu + bi + qti

pu ) )2

+ l (|qi|2 + |pu|2 + |bu|2 + |bi|2 ) }

regularization

goodness of fit

Typically selected via grid-search on a validation set


5%

8%

17Figure from Koren, Bell, Volinksy, IEEE Computer, 2009

Explanation for increase?

19

Adding Time Effectsrui m + bu + bi + user-movie interactions

~~

~~rui m + bu(t) + bi(t) + user-movie interactions

Add time dependence to biases

Time-dependence parametrized by linear trends, binning, and other methods

For details see Y. Koren, Collaborative filtering with temporal dynamics, ACM SIGKDD Conference 2009


Adding Time Effectsrui m + bu(t) + bi(t) + qt

i pu(t) ~~

Add time dependence to user “factor weights”

Models the fact that user’s interests over “genres” (the q’s)

may change over time

21

Figure from Koren, Bell, Volinksy, IEEE Computer, 2009

5%

8%

22

The Kitchen Sink Approach….i Many options for modeling

4 Variants of the ideas we have seen so farh Different numbers of factorsh Different ways to model timeh Different ways to handle implicit informationh ….

4 Other models (not described here)h Nearest-neighbor modelsh Restricted Boltzmann machines

i Model averaging was useful….4 Linear model combining4 Neural network combining4 Gradient boosted decision tree combining4 Note: combining weights learned on validation set (“stacking”)


Other Aspects of Model Buildingi Automated parameter tuning

4 Using a validation set, and grid search, various parameters such as learning rates, regularization parameters, etc., can be optimized

i Memory requirements4 Memory: can fit within roughly 1 Gbyte of RAM

i Training time 4 Order of days: but achievable on commodity hardware rather than a

supercomputer4 Some parallelization used


Progress Prize 2008

Sept 2nd Only 3 teams qualify for 1% improvement over previous year

Oct 2nd Leading team has 9.4% overall improvement

Progress prize ($50,000) awarded to BellKor team of 3 AT&T researchers (same as before) plus 2 Austrian graduate students, Andreas Toscher and Martin Jahrer

Key winning strategy: clever “blending” of predictions

from models used by both teams

Speculation that 10% would be attained by mid-2009

26

The Leading Team for the Final Prize

i BellKorPragmaticChaos4 BellKor:

h Yehuda Koren (now Yahoo!), Bob Bell, Chris Volinsky, AT&T4 BigChaos:

h Michael Jahrer, Andreas Toscher, 2 grad students from Austria4 Pragmatic Theory

h Martin Chabert, Martin Piotte, 2 engineers from Montreal (Quebec)

27

June 26th 2009: after 1000 days & nights…

29

Million Dollars Awarded Sept 21st 2009

30

matrix factorization

Documents

e r rui qti pu

pu minq

xrui qti pu

ratings matrix

e r rui rui

mean user rating

rui qti pu qi qi g eui

bu bi qti pu