xuhappy.github.io · 2 days ago · recsys 2 acknowledgements the slides used in this chapter are...
TRANSCRIPT
RECSYS 1
Recommender Systemsfrom IERG 4030 by Prof Wing C Lau
RECSYS 2
Acknowledgements The slides used in this chapter are adapted from
CS246 Mining Massive Data-sets by Jure Leskovec Stanford University
Yehuda Koren Robert Bell amp ChrisVolinsky ldquoLessons from the Netflix Prizerdquo ATampT Research
Some slides and plots borrowed from Yehuda Koren Robert Bell and Padhraic Smyth
with the authorrsquos permission All copyrights belong to the original author of the material
RECSYS 3
Content-based Systems amp Collaborative Filtering
CS246 Mining Massive DatasetsJure Leskovec Stanford University
httpcs246stanfordedu
RECSYS 4
High Dimensional Data
High dim data
Locality sensitive hashing
Clustering
Dimensionality
reduction
Graph data
PageRank SimRank
Community Detection
Spam Detection
Infinite data
Filtering data
streams
Web advertising
Queries on streams
Machine learning
SVM
Decision Trees
Perceptron kNN
Apps
Recommender systems
Association Rules
Duplicate document detection
1162019
RECSYS 5
Example Recommender Systems
Customer X Buys Metallica CD Buys Megadeth CD
Customer Y Does search on Metallica Recommender system
suggests Megadeth from data collected about customer X
RECSYS 6
Recommendations
Items
Search Recommendations
Products web sites blogs news items hellip
Examples
RECSYS 8
The Long Tail
Source Chris Anderson (2004)
1162019 8
RECSYS 9
Physical vs Online
1162019 9Read httpwwwwiredcomwiredarchive1210tailhtml to learn more
RECSYS 11
Formal Model
X = set of Customers
S = set of Items
Utility function u X times S RR = set of ratingsR is a totally ordered seteg 0-5 stars real number in [01]
RECSYS 12
Utility Matrix
04102
0305021
Avatar LOTR Matrix Pirates
Alice
Bob
Carol
David
RECSYS 13
Key Problems
(1) Gathering ldquoknownrdquo ratings for matrix How to collect the data in the utility matrix
(2) Extrapolate unknown ratings from the known ones Mainly interested in high unknown ratings
bull We are not interested in knowing what you donrsquot like but what you like
(3) Evaluating extrapolation methods How to measure successperformance of
recommendation methods
RECSYS 14
(1) Gathering Ratings Explicit
Ask people to rate items Doesnrsquot work well in practice ndash people
canrsquot be bothered
Implicit Learn ratings from user actions
bull Eg purchase implies high rating What about low ratings
RECSYS 15
(2) Extrapolating Utilities Key problem matrix U is sparse
Most people have not rated most items Cold start
bull New items have no ratingsbull New users have no history
Three approaches to recommender systems 1) Content-based 2) Collaborative Filtering
bull Memory-basedbull User-based Collaborative Filteringbull Item-based Collaborative Filtering
bull Latent factor based
1162019
RECSYS 16
Content-based Recommender Systems
1162019
RECSYS 17
Content-based Recommendations Main idea Recommend items to customer x similar to
previous items rated highly by x
Example
Movie recommendations Recommend movies with same actor(s)
director genre hellip
Websites blogs news Recommend other sites with ldquosimilarrdquo content
RECSYS 18
Plan of Action
likes
Item profiles
RedCircles
Triangles
User profile
match
recommendbuild
1162019
RECSYS 19
Item Profiles For each item create an item profile
Profile is a set (vector) of features Movies author title actor directorhellip Text Set of ldquoimportantrdquo words in document
How to pick important features Usual heuristic from text mining is TF-IDF
(Term frequency Inverse Doc Frequency)bull Term hellip Featurebull Document hellip Item
1162019
RECSYS 20
Sidenote TF-IDFfij = frequency of term (feature) i in doc (item) j
ni = number of docs that mention term iN = total number of docs
TF-IDF score wij = TFij times IDFi
Doc profile = set of words with highest TF-IDF scores together with their scores
1162019
Note we normalize TFto discount for ldquolongerrdquo documents
RECSYS 21
User Profiles and Prediction
User profile possibilities Weighted average of rated item profiles Variation weight by difference from average
rating for item hellip
Prediction heuristic Given user profile x and item profile i estimate 119906119906(119961119961 119946119946) = cos(119961119961 119946119946) = 119961119961middot119946119946
| 119961119961 |sdot| 119946119946 |
1162019
RECSYS 22
Pros Content-based Approach
+ No need for data on other users No cold-start or sparsity problems
+ Able to recommend to users with unique tastes
+ Able to recommend new amp unpopular items No first-rater problem
+ Able to provide explanations Can provide explanations of recommended items by listing content-
features that caused an item to be recommended
1162019
RECSYS 23
Cons Content-based Approach ndash Finding the appropriate features is hard
Eg images movies music
ndash Overspecialization Never recommends items outside userrsquos
content profile People might have multiple interests Unable to exploit quality judgments of other users
ndash Recommendations for new users How to build a user profile
1162019
RECSYS 24
Collaborative Filtering
RECSYS 25
Collaborative filtering Recommend items based on past transactions of users
Analyze relations between users andor items
Specific data characteristics are irrelevant Domain-free useritem attributes are not necessary Can identify elusive aspects
RECSYS 26
Collaborative Filtering (CF)
RECSYS 27
Example of Memory-based Collaborative FilteringUser-User Collaborative Filtering
1 Consider user x
2 Find set N of other userswhose ratings are ldquosimilarrdquo to xrsquos ratings eg using K-nearest neighbors (KNN)
3 Recommend items to xbased on the weighted ratings of items by users in N
RECSYS 28
Similar Users Let rx be the vector of user xrsquos ratings
Jaccard similarity measure Problem Ignores the value of the rating
Cosine similarity measure
sim(x y) = cos(rx ry) = 119903119903119909119909sdot119903119903119910119910||119903119903119909119909||sdot||119903119903119910119910||
Problem Treats missing ratings as ldquonegativerdquo Pearson correlation coefficient
Sxy = items rated by both users x and y
1162019
rx = [ _ _ ]ry = [ _ _]
rx ry as setsrx = 1 4 5ry = 1 3 4
rx ry as pointsrx = 1 0 0 1 3ry = 1 0 2 2 0
rx ry hellip avgrating of x y
RECSYS 29
Similarity Metric
Intuitively we want sim(A B) gt sim(A C)
Jaccard similarity 15 lt 24
Cosine similarity 0386 gt 0322 Considers missing ratings as ldquonegativerdquo Solution subtract the (row) mean
1162019
sim AB vs AC0092 gt -0559Notice cosine sim is correlation when data is centered at 0
Cosine sim
RECSYS 30
Rating Predictions
Let rx be the vector of user xrsquos ratings
Let N be the set of k users most similar to xwho have rated item i
Prediction for item i of user x 119903119903119909119909119909119909 = 1
119896119896sum119910119910isin119873119873 119903119903119910119910119909119909
119903119903119909119909119909119909 =sum119910119910isin119873119873 119904119904119909119909119910119910sdot119903119903119910119910119910119910sum119910119910isin119873119873 119904119904119909119909119910119910
Other options
Many other tricks possiblehellip1162019
Shorthand119956119956119961119961119961119961 = 119956119956119946119946119956119956 119961119961119961119961
RECSYS 31
Another type of Memory-based Collaborative Filtering Item-Item based Collaborative Filtering
So far User-user collaborative filtering
Another view Item-item For item i find other similar items Estimate rating for item i based
on the target userrsquos ratings on items similar to item i Can use same similarity metrics and
prediction functions as in user-user model
1162019
sumsum
isin
isinsdot
=)(
)(
xiNj ij
xiNj xjijxi s
rsr
sijhellip similarity of items i and jrxjhelliprating of user x on item jN(ix)hellip set items rated by x similar to i
RECSYS 32
Item-Item CF (|N|=2)
121110987654321
455311
3124452
534321423
245424
5224345
423316
users
mov
ies
- unknown rating - rating between 1 to 5
RECSYS 33
Item-Item CF (|N|=2)
121110987654321
455 311
3124452
534321423
245424
5224345
423316
users
- estimate rating of movie 1 by user 5
1162019
mov
ies
RECSYS 34
Item-Item CF (|N|=2)
121110987654321
455 311
3124452
534321423
245424
5224345
423316
users
Neighbor selectionIdentify movies similar to movie 1 rated by user 5
1162019
mov
ies
100
-018
041
-010
-031
059
sim(1m)
Here we use Pearson correlation as similarity1) Subtract mean rating mi from each movie i
m1 = (1+3+5+5+4)5 = 36row 1 [-26 0 -06 0 0 14 0 0 14 0 04 0]
2) Compute cosine similarities between rows
RECSYS 35
Item-Item CF (|N|=2)
121110987654321
455 311
3124452
534321423
245424
5224345
423316
users
Compute similarity weightss13=041 s16=059
1162019
mov
ies
100
-018
041
-010
-031
059
sim(1m)
RECSYS 36
Item-Item CF (|N|=2)
121110987654321
45526311
3124452
534321423
245424
5224345
423316
users
Predict by taking weighted average
r15 = (0412 + 0593) (041+059) = 26
mov
ies
RECSYS 37
Common Practice for Item-Item Collaborative Filtering
Define similarity sij of items i and j
Select K nearest neighbors (KNN) N(i x) Set of Items most similar to item i that were rated by x
Estimate rating rxi as the weighted average
1162019
sumsum
isin
isinsdot
=)(
)(ˆxiNj ij
xiNj xjijxi s
rsr
N(ix) = set of items similar to item i that were rated by xsij = similarity of items i and jrxj= rating of user x on item j
RECSYS 38
Item-Item vs User-User
04180109
0305081
Avatar LOTR Matrix Pirates
Alice
Bob
Carol
David
1162019
In practice it has been observed that item-itemoften works better than user-user
Why Items are simpler users have multiple tastes
RECSYS 39
ProsCons of Collaborative Filtering
+ Works for any kind of item No feature selection needed
- Cold Start Need enough users in the system to find a match
- Sparsity The userratings matrix is sparse Hard to find users that have rated the same items
- First rater Cannot recommend an item that has not been
previously rated New items Esoteric items
- Popularity bias Cannot recommend items to someone with
unique taste Tends to recommend popular items
RECSYS 40
Hybrid Methods Implement two or more different recommenders and
combine predictions Perhaps using a linear model
Add content-based methods to collaborative filtering Item profiles for new item problem Demographics to deal with new user problem
1162019
RECSYS 41
- Evaluation
- Error metrics
- Complexity Speed
RECSYS 42
Evaluation
1 3 4
3 5 5
4 5 5
3
3
2 2 2
5
2 1 1
3 3
1
movies
users
1162019
RECSYS 43
Evaluation
1 3 4
3 5 5
4 5 5
3
3
2
2 1
3
1
Test Data Set
users
movies
1162019
RECSYS 44
Evaluating Predictions Compare predictions with known ratings
Root-mean-square error (RMSE)bull where is predicted is the true rating of x on i
Precision at top 10 bull of those in top 10
Rank Correlation bull Spearmanrsquos correlation between systemrsquos and userrsquos complete
rankings Another approach 01 model
Coveragebull Number of itemsusers for which system can make predictions
Precision = TP (TP + FP) Accuracy = (TP+TN) (TP + FP + TN + FN) Receiver Operating characteristic (ROC) Curve
Y-axis True Positive Rates (TPR) X-axis False Positive Rates (FPR)bull TPR (aka Recall) = TP P = TP (TP+FN) bull FPR = FP N = FP (FP + TN)bull See httpsenwikipediaorgwikiPrecision and recall
RECSYS 45
Problems with Error Measures Narrow focus on accuracy sometimes
misses the point Prediction Diversity Prediction Context Order of predictions
In practice we care only to predict high ratings RMSE might penalize a method that does well
for high ratings and badly for others
1162019
RECSYS 46
Collaborative Filtering Complexity
Expensive step is finding k most similar customers O(|X|) Recall that X = set of customers in the system
Too expensive to do at runtime Could pre-compute using clustering as approx
Naiumlve pre-computation takes time O(N middot|C|) |C| = of clusters = k in the k-means N = of data points
We already know how to do this Near-neighbor search in high dimensions (LSH) Clustering Dimensionality reduction
1162019
RECSYS 47
Tip Add Data Leverage all the data
Donrsquot try to reduce data size in an effort to make fancy algorithms work
Simple methods on large data do best
Add more data eg add IMDB data on genres
More data beats better algorithmshttpanandtypepadcomdatawocky200803more-data-usualhtml
RECSYS 48
Recommender SystemsLatent Factor Models
CS246 Mining Massive DatasetsJure Leskovec Stanford University
httpcs246stanfordedu
RECSYS 49
Geared towards females
Geared towards males
Serious
Funny
Collaborative Filtering viaLatent Factor Models (eg SVD)
The PrincessDiaries
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
RECSYS 50Koren Bell Volinksy IEEE Computer 2009
RECSYS 51
The Netflix Utility Matrix R
1 3 4
3 5 5
4 5 5
3
3
2 2 2
5
2 1 1
3 3
1
480000 users
17700 movies
1162019
Matrix R
RECSYS 52
Utility Matrix R Evaluation
1 3 4
3 5 5
4 5 5
3
3
2
2 1
3
1
Test Data Set
SSE = sum(119909119909119909119909)isin119877119877 119903119909119909119909119909 minus 119903119903119909119909119909119909 2
1162019
480000 users
17700 movies
Predicted rating
True rating of user x on item i
119955119955120785120785120788120788
Matrix R
Training Data Set
RECSYS 53
Latent Factor Models
ldquoSVDrdquo on Netflix data R asymp Q middot PT
For now letrsquos assume we can approximate the rating matrix R as a product of ldquothinrdquo Q middot PT
R has missing entries but letrsquos ignore that for nowbull Basically we will want the reconstruction error to be small on
known ratings and we donrsquot care about the values on the missing ones
45531
312445
53432142
24542
522434
42331
2-41
56-5
53-2
32111
-221-7
37-1
-924143-48-5-253-211
13-112-72914-131457-8
1-6784-3924176-421asymp
users
item
s
PT
Q
item
s
users
R
SVD A = U Σ VT
f factors
ffactors
RECSYS 54
Ratings as Products of Factors How to estimate the missing rating of
user x for item i
1162019
45531
312445
53432142
24542
522434
42331
item
s
2-41
56-5
53-2
32111
-221-7
37-1
-924143-48-5-253-211
13-112-72914-131457-8
1-6784-3924176-421
asymp
item
s
users
users
PT
119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931
= 119943119943
119954119954119946119946119943119943 sdot 119953119953119961119961119943119943
qi = row i of Qpx = column x of PT
ffac
tors
Qf factors
RECSYS 55
Ratings as Products of Factors How to estimate the missing rating of
user x for item i
1162019
45531
312445
53432142
24542
522434
42331
item
s
2-41
56-5
53-2
32111
-221-7
37-1
-924143-48-5-253-211
13-112-72914-131457-8
1-6784-3924176-421
asymp
item
s
users
users
PT
ffac
tors
Qf factors
119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931
= 119943119943
119954119954119946119946119943119943 sdot 119953119953119961119961119943119943
qi = row i of Qpx = column x of PT
RECSYS 56
Ratings as Products of Factors How to estimate the missing rating of
user x for item i
1162019
45531
312445
53432142
24542
522434
42331
item
s
2-41
56-5
53-2
32111
-221-7
37-1
-924143-48-5-253-211
13-112-72914-131457-8
1-6784-3924176-421
asymp
item
s
users
users
QPT
24
ffac
tors
f factors
119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931
= 119943119943
119954119954119946119946119943119943 sdot 119953119953119961119961119943119943
qi = row i of Qpx = column x of PT
RECSYS 57
Geared towards females
Geared towards males
Serious
Funny
Latent Factor Models
1162019
The PrincessDiaries
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
RECSYS 58
Geared towards females
Geared towards males
Serious
Funny
Latent Factor Models
1162019
The PrincessDiaries
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
RECSYS 59
Recap SVD
Remember SVD A Input data matrix U Left singular vecs V Right singular vecs Σ Singular values
SVD gives minimum reconstruction error (SSE) min119880119880VΣ
sum119909119909119894119894 119860119860119909119909119894119894 minus 119880119880Σ119881119881T 1199091199091198941198942
So in our case ldquoSVDrdquo on Netflix data R asymp Q middot PT
A = R Q = U PT = Σ VT
But we are not done yet R has missing entries1162019
U
The sum goes over all entriesBut our R has missing entries
119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931
Am
n
Σm
n
VT
asymp
U
RECSYS 60
Latent Factor Models
SVD isnrsquot defined when entries are missing
Use specialized methods to find P Q min
119875119875119876119876sum 119909119909119909119909 isinR 119903119903119909119909119909119909 minus 119902119902119909119909 sdot 119901119901119909119909119879119879 2
Notebull We donrsquot require cols of P Q to be orthogonalunit lengthbull P Q map usersmovies to a latent spacebull The most popular model among Netflix contestants
1162019
45531
312445
53432142
24542
522434
42331
2-41
56-5
53-2
32111
-221-7
37-1
-924143-48-5-253-211
13-112-72914-131457-8
1-6784-3924176-421asymp
PTQ
users
item
s
119903119909119909119909119909 = 119902119902119909119909 sdot 119901119901119909119909119879119879
f factors
ffactorsitem
s
users
RECSYS 61
Dealing with Missing Entries Want to minimize SSE for unseen test data
Idea Minimize SSE on training data Want large f ( of factors) to capture all the signals But SSE on test data begins to rise for f gt 2
Regularization is needed to avoid Overfitting Allow rich model where there are sufficient data Shrink aggressively where data are scarce
1162019
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
1 3 43 5 5
4 5 533
2
2 1 3
1
λhellip regularization parameterldquoerrorrdquo ldquolengthrdquo
RECSYS 62
Geared towards females
Geared towards males
serious
escapist
The PrincessDiaries
The Lion King
Braveheart
Independence Day
AmadeusThe Color Purple
Oceanrsquos 11
Sense and Sensibility
Gus
Dave
Recommendations via Latent Factor Models (eg SVD++ by the [Bellkor Team])
1162019
Lethal Weapon
Dumb and Dumber
RECSYS 63
Dealing with Missing Entries Want to minimize SSE for unseen test data
Idea Minimize SSE on training data Want large f ( of factors) to capture all the signals But SSE on test data begins to rise for f gt 2
Regularization is needed Allow rich model where there are sufficient data Shrink aggressively where data are scarce
1162019
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
1 3 43 5 5
4 5 533
2
2 1 3
1
λhellip regularization parameterldquoerrorrdquo ldquolengthrdquo
RECSYS 64
Geared towards females
Geared towards males
serious
funny
The Effect of Regularization
1162019
The PrincessDiaries
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
minfactors ldquoerrorrdquo + λ ldquolengthrdquo
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
RECSYS 65
Geared towards females
Geared towards males
serious
funny
The Effect of Regularization
1162019
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
The PrincessDiaries
minfactors ldquoerrorrdquo + λ ldquolengthrdquo
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
RECSYS 66
Geared towards females
Geared towards males
serious
funny
The Effect of Regularization
1162019
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
minfactors ldquoerrorrdquo + λ ldquolengthrdquo
The PrincessDiaries
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
RECSYS 67
Geared towards females
Geared towards males
serious
funny
The Effect of Regularization
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
The PrincessDiaries
minfactors ldquoerrorrdquo + λ ldquolengthrdquo
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
RECSYS 68
Use Gradient Descent to search forthe optimal settings
Want to find matrices P and Q
Gradient descent Initialize P and Q (using SVD pretend missing ratings are
0) Do gradient descent
bull P larr P - η middotnablaPbull Q larr Q - η middotnablaQbull Where nablaQ is gradientderivative of matrix Q120571120571119876119876 = [120571120571119902119902119909119909119894119894] and 120571120571119902119902119909119909119894119894 = sum119909119909119909119909 minus2 119903119903119909119909119909119909 minus 119902119902119909119909119901119901119909119909119879119879 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894
bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q Observation Computing gradients is slow
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
How to compute gradient of a matrixCompute gradient of every element independently
RECSYS 69
Degression to the lecture notes ofRegression and Gradient Descent
by Andrew Ngrsquos Machine Learning course from Coursera
RECSYS 70
(Batch) Gradient Descent
Want to find matrices P and Q
Gradient descent Initialize P and Q (using SVD pretend missing ratings are
0) Do gradient descent
bull P larr P - η middotnablaPbull Q larr Q - η middotnablaQbull Where nablaQ is gradientderivative of matrix Q120571120571119876119876 = [120571120571119902119902119909119909119894119894] and 120571120571119902119902119909119909119894119894 = sum119909119909119909119909 minus2 119903119903119909119909119909119909 minus 119902119902119909119909119901119901119909119909119879119879 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894
bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q Observation Computing gradients is slow
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
How to compute gradient of a matrixCompute gradient of every element independently
RECSYS 72
Stochastic Gradient Descent
Gradient Descent (GD) vs Stochastic GD Observation 120571120571119876119876 = [120571120571119902119902119909119909119894119894] where
120571120571119902119902119909119909119894119894 = 119909119909119909119909
minus2 119903119903119909119909119909119909 minus 119902119902119909119909119894119894119901119901119909119909119894119894 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894 = 119961119961119946119946
nabla119928119928 119955119955119961119961119946119946
bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q 119928119928 = 119928119928 minus 119928119928 = 119928119928 minus η sum119961119961119946119946nabla119928119928 (119955119955119961119961119946119946) Idea Instead of evaluating gradient over all ratings evaluate it for
each individual rating and make a step
GD 119928119928larr119928119928 minus η sum119955119955119961119961119946119946 nabla119928119928(119955119955119961119961119946119946)
SGD 119928119928larr119928119928 minus 119928119928(119955119955119961119961119946119946) Faster convergence
bull Need more steps but each step is computed much faster
1162019
RECSYS 73
SGD vs GD Convergence of GD vs SGD
1162019
Iterationstep
Valu
e of
the
obje
ctiv
e fu
nctio
n
GD improves the value of the objective function at every step SGD improves the value but in a ldquonoisyrdquo wayGD takes fewer steps to converge but each steptakes much longer to compute In practice SGD is much faster
RECSYS 74
Stochastic Gradient Descent
Stochastic gradient decent Initialize P and Q (using SVD pretend missing ratings are
0) Then iterate over the ratings (multiple times if necessary) and update
factorsFor each rxi
bull 120576120576119909119909119909119909 = 119903119903119909119909119909119909 minus 119902119902119909119909 sdot 119901119901119909119909119879119879 (derivative of the ldquoerrorrdquo)bull 119902119902119909119909 larr 119902119902119909119909 + 120578120578 120576120576119909119909119909119909 119901119901119909119909 minus 120582120582 119902119902119909119909 (update equation)bull 119901119901119909119909 larr 119901119901119909119909 + 120578120578 120576120576119909119909119909119909 119902119902119909119909 minus 120582120582 119901119901119909119909 (update equation)
2 for loops For until convergence
bull For each rxibull Compute gradient do a ldquosteprdquo
1162019
η hellip learning rate
RECSYS 75Koren Bell Volinksy IEEE Computer 2009
1162019
RECSYS 76
Summary Recommendations via Optimization
Goal Make good recommendations Quantify goodness using SSE
So Lower SSE means better recommendations We want to make good recommendations on items
that some user has not yet seen Letrsquos set values for P and Q such that they work
well on known (user item) ratings And hope these values for P and Q will predict well
the unknown ratings
This is the a case where we apply Optimization methods
1162019
1 3 43 5 5
4 5 533
2
2 1 3
1
RECSYS 77
Backup Slides
RECSYS 78
The Netflix Challenge 2006-09
We Know What You OughtTo Be Watching This
Summer
Netflix Prizebull Training data
ndash 100 million ratingsndash 480000 usersndash 17770 moviesndash 6 years of data 2000-2005
bull Test datandash Last few ratings of each user (28 million)ndash Evaluation criterion root mean squared error (RMSE) ndash Netflix Cinematch RMSE 09514
bull Competitionndash 2700+ teamsndash $1 million grand prize for 10 improvement on Cinematch resultndash $50000 2007 progress prize for 843 improvement
scoremovieuser121152131434524123237682576344541568523425
22345
57664566
scoremovieuser6219617232473153414284935
745
696836
Training data Test data
Movie rating data
Overall rating distribution
bull Third of ratings are 4sbull Average rating is 368
From TimelyDevelopmentcom
ratings per movie
bull Avg ratingsmovie 5627
ratings per user
bull Avg ratingsuser 208
Average movie rating by movie count
bull More ratings to better movies
From TimelyDevelopmentcom
Most loved movies
CountAvg ratingTitle137812 4593 The Shawshank Redemption 133597 4545 Lord of the Rings The Return of the King1808834306 The Green Mile 150676 4460Lord of the Rings The Two Towers 139050 4415 Finding Nemo 117456 4504 Raiders of the Lost Ark 180736 4299 Forrest Gump 147932 4433 Lord of the Rings The Fellowship of the ring149199 4325 The Sixth Sense 144027 4333 Indiana Jones and the Last Crusade
Challenges
bull Size of datandash Scalabilityndash Keeping data in memory
bull Missing datandash 99 percent missingndash Very imbalanced
bull Avoiding overfittingbull Test and training data differ significantly movie 16322
bull Use an ensemble of complementing predictorsbull Two half tuned models worth more than a single fully
tuned model
The BellKor recommender system
RECSYS 101
Extending Latent Factor Model to Include Biases
1162019
RECSYS 102
Modeling Biases and Interactions
μ = overall mean rating bx = bias of user x bi = bias of movie i
user-movie interactionmovie biasuser bias
User-Movie interaction Characterizes the matching between
users and movies Attracts most research in the field Benefits from algorithmic and
mathematical innovations
Baseline predictor Separates users and movies Benefits from insights into userrsquos
behavior Among the main practical
contributions of the competition
RECSYS 103
Baseline Predictor We have expectations on the rating by
user x of movie i even without estimating xrsquos attitude towards movies like i
ndash Rating scale of user xndash Values of other ratings user
gave recently (day-specific mood anchoring multi-user accounts)
ndash (Recent) popularity of movie indash Selection bias related to
number of ratings user gave on the same day (ldquofrequencyrdquo)
1162019
RECSYS 104
Putting It All Together
Example Mean rating micro = 37 You are a critical reviewer your ratings are 1 star
lower than the mean bx = -1 Star Wars gets a mean rating of 05 higher than
average movie bi = + 05 Predicted rating for you on Star Wars
= 37 - 1 + 05 = 32
1162019
RECSYS 105
Fitting the New Model
Solve
Stochastic gradient decent to find parameters Note Both biases bu bi as well as interactions qi pu are treated
as parameters (we estimate them)
1162019
regularization
goodness of fit
λ is selected via grid-search on a validation set
( )
++++
+++minus
sumsumsumsum
sumisin
ii
xx
xx
ii
Rix
Txiixxi
PQ
bbpq
pqbbr
2222
2
)(
)(min
λ
micro
RECSYS 106
BellKor Recommender System The winner of the Netflix Challenge
Multi-scale modeling of the dataCombine top level ldquoregionalrdquomodeling of the data with a refined local view Global
bull Overall deviations of usersmovies Factorization
bull Addressing ldquoregionalrdquo effects Collaborative filtering
bull Extract local patterns
Global effects
Factorization
Collaborative filtering
RECSYS 107
Performance of Various Methods
0885
089
0895
09
0905
091
0915
092
1 10 100 1000
RMSE
Millions of parameters
CF (no time bias)
Basic Latent Factors
Latent Factors w Biases
1162019
RECSYS 108
Grand Prize 08563
Netflix 09514
Movie average 10533User average 10651
Global average 11296
Performance of Various Methods
Basic Collaborative filtering 094
Latent factors 090
Latent factors+Biases 089
Collaborative filtering++ 091
1162019
RECSYS 109
Temporal Biases Of Users
Sudden rise in the average movie rating(early 2004) Improvements in Netflix GUI improvements Meaning of rating changed
Movie age Users prefer new movies
without any reasons Older movies are just
inherently better than newer ones
1162019
Y Koren Collaborative filtering with temporal dynamics KDD rsquo09
RECSYS 110
Temporal Biases amp Factors
Original modelrxi = micro +bx + bi + qi middotpx
T
Add time dependence to biasesrxi = micro +bx(t)+ bi(t) +qi middot px
T
Make parameters bu and bi to depend on time (1) Parameterize time-dependence by linear trends
(2) Each bin corresponds to 10 consecutive weeks
Add temporal dependence to factors px(t)hellip user preference vector on day t
Y Koren Collaborative filtering with temporal dynamics KDD rsquo09
RECSYS 111
Adding Temporal Effects
1162019
0875
088
0885
089
0895
09
0905
091
0915
092
1 10 100 1000 10000
RMSE
Millions of parameters
CF (no time bias)Basic Latent FactorsCF (time bias)Latent Factors w Biases+ Linear time factors+ Per-day user biases+ CF
RECSYS 112
Grand Prize 08563
Netflix 09514
Movie average 10533User average 10651
Global average 11296
Performance of Various Methods
Basic Collaborative filtering 094
Latent factors 090
Latent factors+Biases 089
Collaborative filtering++ 091
1162019 112
Latent factors+Biases+Time 0876
Still no prize Getting desperateTry a ldquokitchen sinkrdquo approach
RECSYS 113
Effect of ensemble size
087
08725
0875
08775
088
08825
0885
08875
089
08925
1 8 15 22 29 36 43 50
Predictors
Erro
r - R
MSE
RECSYS 2
Acknowledgements The slides used in this chapter are adapted from
CS246 Mining Massive Data-sets by Jure Leskovec Stanford University
Yehuda Koren Robert Bell amp ChrisVolinsky ldquoLessons from the Netflix Prizerdquo ATampT Research
Some slides and plots borrowed from Yehuda Koren Robert Bell and Padhraic Smyth
with the authorrsquos permission All copyrights belong to the original author of the material
RECSYS 3
Content-based Systems amp Collaborative Filtering
CS246 Mining Massive DatasetsJure Leskovec Stanford University
httpcs246stanfordedu
RECSYS 4
High Dimensional Data
High dim data
Locality sensitive hashing
Clustering
Dimensionality
reduction
Graph data
PageRank SimRank
Community Detection
Spam Detection
Infinite data
Filtering data
streams
Web advertising
Queries on streams
Machine learning
SVM
Decision Trees
Perceptron kNN
Apps
Recommender systems
Association Rules
Duplicate document detection
1162019
RECSYS 5
Example Recommender Systems
Customer X Buys Metallica CD Buys Megadeth CD
Customer Y Does search on Metallica Recommender system
suggests Megadeth from data collected about customer X
RECSYS 6
Recommendations
Items
Search Recommendations
Products web sites blogs news items hellip
Examples
RECSYS 8
The Long Tail
Source Chris Anderson (2004)
1162019 8
RECSYS 9
Physical vs Online
1162019 9Read httpwwwwiredcomwiredarchive1210tailhtml to learn more
RECSYS 11
Formal Model
X = set of Customers
S = set of Items
Utility function u X times S RR = set of ratingsR is a totally ordered seteg 0-5 stars real number in [01]
RECSYS 12
Utility Matrix
04102
0305021
Avatar LOTR Matrix Pirates
Alice
Bob
Carol
David
RECSYS 13
Key Problems
(1) Gathering ldquoknownrdquo ratings for matrix How to collect the data in the utility matrix
(2) Extrapolate unknown ratings from the known ones Mainly interested in high unknown ratings
bull We are not interested in knowing what you donrsquot like but what you like
(3) Evaluating extrapolation methods How to measure successperformance of
recommendation methods
RECSYS 14
(1) Gathering Ratings Explicit
Ask people to rate items Doesnrsquot work well in practice ndash people
canrsquot be bothered
Implicit Learn ratings from user actions
bull Eg purchase implies high rating What about low ratings
RECSYS 15
(2) Extrapolating Utilities Key problem matrix U is sparse
Most people have not rated most items Cold start
bull New items have no ratingsbull New users have no history
Three approaches to recommender systems 1) Content-based 2) Collaborative Filtering
bull Memory-basedbull User-based Collaborative Filteringbull Item-based Collaborative Filtering
bull Latent factor based
1162019
RECSYS 16
Content-based Recommender Systems
1162019
RECSYS 17
Content-based Recommendations Main idea Recommend items to customer x similar to
previous items rated highly by x
Example
Movie recommendations Recommend movies with same actor(s)
director genre hellip
Websites blogs news Recommend other sites with ldquosimilarrdquo content
RECSYS 18
Plan of Action
likes
Item profiles
RedCircles
Triangles
User profile
match
recommendbuild
1162019
RECSYS 19
Item Profiles For each item create an item profile
Profile is a set (vector) of features Movies author title actor directorhellip Text Set of ldquoimportantrdquo words in document
How to pick important features Usual heuristic from text mining is TF-IDF
(Term frequency Inverse Doc Frequency)bull Term hellip Featurebull Document hellip Item
1162019
RECSYS 20
Sidenote TF-IDFfij = frequency of term (feature) i in doc (item) j
ni = number of docs that mention term iN = total number of docs
TF-IDF score wij = TFij times IDFi
Doc profile = set of words with highest TF-IDF scores together with their scores
1162019
Note we normalize TFto discount for ldquolongerrdquo documents
RECSYS 21
User Profiles and Prediction
User profile possibilities Weighted average of rated item profiles Variation weight by difference from average
rating for item hellip
Prediction heuristic Given user profile x and item profile i estimate 119906119906(119961119961 119946119946) = cos(119961119961 119946119946) = 119961119961middot119946119946
| 119961119961 |sdot| 119946119946 |
1162019
RECSYS 22
Pros Content-based Approach
+ No need for data on other users No cold-start or sparsity problems
+ Able to recommend to users with unique tastes
+ Able to recommend new amp unpopular items No first-rater problem
+ Able to provide explanations Can provide explanations of recommended items by listing content-
features that caused an item to be recommended
1162019
RECSYS 23
Cons Content-based Approach ndash Finding the appropriate features is hard
Eg images movies music
ndash Overspecialization Never recommends items outside userrsquos
content profile People might have multiple interests Unable to exploit quality judgments of other users
ndash Recommendations for new users How to build a user profile
1162019
RECSYS 24
Collaborative Filtering
RECSYS 25
Collaborative filtering Recommend items based on past transactions of users
Analyze relations between users andor items
Specific data characteristics are irrelevant Domain-free useritem attributes are not necessary Can identify elusive aspects
RECSYS 26
Collaborative Filtering (CF)
RECSYS 27
Example of Memory-based Collaborative FilteringUser-User Collaborative Filtering
1 Consider user x
2 Find set N of other userswhose ratings are ldquosimilarrdquo to xrsquos ratings eg using K-nearest neighbors (KNN)
3 Recommend items to xbased on the weighted ratings of items by users in N
RECSYS 28
Similar Users Let rx be the vector of user xrsquos ratings
Jaccard similarity measure Problem Ignores the value of the rating
Cosine similarity measure
sim(x y) = cos(rx ry) = 119903119903119909119909sdot119903119903119910119910||119903119903119909119909||sdot||119903119903119910119910||
Problem Treats missing ratings as ldquonegativerdquo Pearson correlation coefficient
Sxy = items rated by both users x and y
1162019
rx = [ _ _ ]ry = [ _ _]
rx ry as setsrx = 1 4 5ry = 1 3 4
rx ry as pointsrx = 1 0 0 1 3ry = 1 0 2 2 0
rx ry hellip avgrating of x y
RECSYS 29
Similarity Metric
Intuitively we want sim(A B) gt sim(A C)
Jaccard similarity 15 lt 24
Cosine similarity 0386 gt 0322 Considers missing ratings as ldquonegativerdquo Solution subtract the (row) mean
1162019
sim AB vs AC0092 gt -0559Notice cosine sim is correlation when data is centered at 0
Cosine sim
RECSYS 30
Rating Predictions
Let rx be the vector of user xrsquos ratings
Let N be the set of k users most similar to xwho have rated item i
Prediction for item i of user x 119903119903119909119909119909119909 = 1
119896119896sum119910119910isin119873119873 119903119903119910119910119909119909
119903119903119909119909119909119909 =sum119910119910isin119873119873 119904119904119909119909119910119910sdot119903119903119910119910119910119910sum119910119910isin119873119873 119904119904119909119909119910119910
Other options
Many other tricks possiblehellip1162019
Shorthand119956119956119961119961119961119961 = 119956119956119946119946119956119956 119961119961119961119961
RECSYS 31
Another type of Memory-based Collaborative Filtering Item-Item based Collaborative Filtering
So far User-user collaborative filtering
Another view Item-item For item i find other similar items Estimate rating for item i based
on the target userrsquos ratings on items similar to item i Can use same similarity metrics and
prediction functions as in user-user model
1162019
sumsum
isin
isinsdot
=)(
)(
xiNj ij
xiNj xjijxi s
rsr
sijhellip similarity of items i and jrxjhelliprating of user x on item jN(ix)hellip set items rated by x similar to i
RECSYS 32
Item-Item CF (|N|=2)
121110987654321
455311
3124452
534321423
245424
5224345
423316
users
mov
ies
- unknown rating - rating between 1 to 5
RECSYS 33
Item-Item CF (|N|=2)
121110987654321
455 311
3124452
534321423
245424
5224345
423316
users
- estimate rating of movie 1 by user 5
1162019
mov
ies
RECSYS 34
Item-Item CF (|N|=2)
121110987654321
455 311
3124452
534321423
245424
5224345
423316
users
Neighbor selectionIdentify movies similar to movie 1 rated by user 5
1162019
mov
ies
100
-018
041
-010
-031
059
sim(1m)
Here we use Pearson correlation as similarity1) Subtract mean rating mi from each movie i
m1 = (1+3+5+5+4)5 = 36row 1 [-26 0 -06 0 0 14 0 0 14 0 04 0]
2) Compute cosine similarities between rows
RECSYS 35
Item-Item CF (|N|=2)
121110987654321
455 311
3124452
534321423
245424
5224345
423316
users
Compute similarity weightss13=041 s16=059
1162019
mov
ies
100
-018
041
-010
-031
059
sim(1m)
RECSYS 36
Item-Item CF (|N|=2)
121110987654321
45526311
3124452
534321423
245424
5224345
423316
users
Predict by taking weighted average
r15 = (0412 + 0593) (041+059) = 26
mov
ies
RECSYS 37
Common Practice for Item-Item Collaborative Filtering
Define similarity sij of items i and j
Select K nearest neighbors (KNN) N(i x) Set of Items most similar to item i that were rated by x
Estimate rating rxi as the weighted average
1162019
sumsum
isin
isinsdot
=)(
)(ˆxiNj ij
xiNj xjijxi s
rsr
N(ix) = set of items similar to item i that were rated by xsij = similarity of items i and jrxj= rating of user x on item j
RECSYS 38
Item-Item vs User-User
04180109
0305081
Avatar LOTR Matrix Pirates
Alice
Bob
Carol
David
1162019
In practice it has been observed that item-itemoften works better than user-user
Why Items are simpler users have multiple tastes
RECSYS 39
ProsCons of Collaborative Filtering
+ Works for any kind of item No feature selection needed
- Cold Start Need enough users in the system to find a match
- Sparsity The userratings matrix is sparse Hard to find users that have rated the same items
- First rater Cannot recommend an item that has not been
previously rated New items Esoteric items
- Popularity bias Cannot recommend items to someone with
unique taste Tends to recommend popular items
RECSYS 40
Hybrid Methods Implement two or more different recommenders and
combine predictions Perhaps using a linear model
Add content-based methods to collaborative filtering Item profiles for new item problem Demographics to deal with new user problem
1162019
RECSYS 41
- Evaluation
- Error metrics
- Complexity Speed
RECSYS 42
Evaluation
1 3 4
3 5 5
4 5 5
3
3
2 2 2
5
2 1 1
3 3
1
movies
users
1162019
RECSYS 43
Evaluation
1 3 4
3 5 5
4 5 5
3
3
2
2 1
3
1
Test Data Set
users
movies
1162019
RECSYS 44
Evaluating Predictions Compare predictions with known ratings
Root-mean-square error (RMSE)bull where is predicted is the true rating of x on i
Precision at top 10 bull of those in top 10
Rank Correlation bull Spearmanrsquos correlation between systemrsquos and userrsquos complete
rankings Another approach 01 model
Coveragebull Number of itemsusers for which system can make predictions
Precision = TP (TP + FP) Accuracy = (TP+TN) (TP + FP + TN + FN) Receiver Operating characteristic (ROC) Curve
Y-axis True Positive Rates (TPR) X-axis False Positive Rates (FPR)bull TPR (aka Recall) = TP P = TP (TP+FN) bull FPR = FP N = FP (FP + TN)bull See httpsenwikipediaorgwikiPrecision and recall
RECSYS 45
Problems with Error Measures Narrow focus on accuracy sometimes
misses the point Prediction Diversity Prediction Context Order of predictions
In practice we care only to predict high ratings RMSE might penalize a method that does well
for high ratings and badly for others
1162019
RECSYS 46
Collaborative Filtering Complexity
Expensive step is finding k most similar customers O(|X|) Recall that X = set of customers in the system
Too expensive to do at runtime Could pre-compute using clustering as approx
Naiumlve pre-computation takes time O(N middot|C|) |C| = of clusters = k in the k-means N = of data points
We already know how to do this Near-neighbor search in high dimensions (LSH) Clustering Dimensionality reduction
1162019
RECSYS 47
Tip Add Data Leverage all the data
Donrsquot try to reduce data size in an effort to make fancy algorithms work
Simple methods on large data do best
Add more data eg add IMDB data on genres
More data beats better algorithmshttpanandtypepadcomdatawocky200803more-data-usualhtml
RECSYS 48
Recommender SystemsLatent Factor Models
CS246 Mining Massive DatasetsJure Leskovec Stanford University
httpcs246stanfordedu
RECSYS 49
Geared towards females
Geared towards males
Serious
Funny
Collaborative Filtering viaLatent Factor Models (eg SVD)
The PrincessDiaries
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
RECSYS 50Koren Bell Volinksy IEEE Computer 2009
RECSYS 51
The Netflix Utility Matrix R
1 3 4
3 5 5
4 5 5
3
3
2 2 2
5
2 1 1
3 3
1
480000 users
17700 movies
1162019
Matrix R
RECSYS 52
Utility Matrix R Evaluation
1 3 4
3 5 5
4 5 5
3
3
2
2 1
3
1
Test Data Set
SSE = sum(119909119909119909119909)isin119877119877 119903119909119909119909119909 minus 119903119903119909119909119909119909 2
1162019
480000 users
17700 movies
Predicted rating
True rating of user x on item i
119955119955120785120785120788120788
Matrix R
Training Data Set
RECSYS 53
Latent Factor Models
ldquoSVDrdquo on Netflix data R asymp Q middot PT
For now letrsquos assume we can approximate the rating matrix R as a product of ldquothinrdquo Q middot PT
R has missing entries but letrsquos ignore that for nowbull Basically we will want the reconstruction error to be small on
known ratings and we donrsquot care about the values on the missing ones
45531
312445
53432142
24542
522434
42331
2-41
56-5
53-2
32111
-221-7
37-1
-924143-48-5-253-211
13-112-72914-131457-8
1-6784-3924176-421asymp
users
item
s
PT
Q
item
s
users
R
SVD A = U Σ VT
f factors
ffactors
RECSYS 54
Ratings as Products of Factors How to estimate the missing rating of
user x for item i
1162019
45531
312445
53432142
24542
522434
42331
item
s
2-41
56-5
53-2
32111
-221-7
37-1
-924143-48-5-253-211
13-112-72914-131457-8
1-6784-3924176-421
asymp
item
s
users
users
PT
119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931
= 119943119943
119954119954119946119946119943119943 sdot 119953119953119961119961119943119943
qi = row i of Qpx = column x of PT
ffac
tors
Qf factors
RECSYS 55
Ratings as Products of Factors How to estimate the missing rating of
user x for item i
1162019
45531
312445
53432142
24542
522434
42331
item
s
2-41
56-5
53-2
32111
-221-7
37-1
-924143-48-5-253-211
13-112-72914-131457-8
1-6784-3924176-421
asymp
item
s
users
users
PT
ffac
tors
Qf factors
119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931
= 119943119943
119954119954119946119946119943119943 sdot 119953119953119961119961119943119943
qi = row i of Qpx = column x of PT
RECSYS 56
Ratings as Products of Factors How to estimate the missing rating of
user x for item i
1162019
45531
312445
53432142
24542
522434
42331
item
s
2-41
56-5
53-2
32111
-221-7
37-1
-924143-48-5-253-211
13-112-72914-131457-8
1-6784-3924176-421
asymp
item
s
users
users
QPT
24
ffac
tors
f factors
119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931
= 119943119943
119954119954119946119946119943119943 sdot 119953119953119961119961119943119943
qi = row i of Qpx = column x of PT
RECSYS 57
Geared towards females
Geared towards males
Serious
Funny
Latent Factor Models
1162019
The PrincessDiaries
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
RECSYS 58
Geared towards females
Geared towards males
Serious
Funny
Latent Factor Models
1162019
The PrincessDiaries
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
RECSYS 59
Recap SVD
Remember SVD A Input data matrix U Left singular vecs V Right singular vecs Σ Singular values
SVD gives minimum reconstruction error (SSE) min119880119880VΣ
sum119909119909119894119894 119860119860119909119909119894119894 minus 119880119880Σ119881119881T 1199091199091198941198942
So in our case ldquoSVDrdquo on Netflix data R asymp Q middot PT
A = R Q = U PT = Σ VT
But we are not done yet R has missing entries1162019
U
The sum goes over all entriesBut our R has missing entries
119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931
Am
n
Σm
n
VT
asymp
U
RECSYS 60
Latent Factor Models
SVD isnrsquot defined when entries are missing
Use specialized methods to find P Q min
119875119875119876119876sum 119909119909119909119909 isinR 119903119903119909119909119909119909 minus 119902119902119909119909 sdot 119901119901119909119909119879119879 2
Notebull We donrsquot require cols of P Q to be orthogonalunit lengthbull P Q map usersmovies to a latent spacebull The most popular model among Netflix contestants
1162019
45531
312445
53432142
24542
522434
42331
2-41
56-5
53-2
32111
-221-7
37-1
-924143-48-5-253-211
13-112-72914-131457-8
1-6784-3924176-421asymp
PTQ
users
item
s
119903119909119909119909119909 = 119902119902119909119909 sdot 119901119901119909119909119879119879
f factors
ffactorsitem
s
users
RECSYS 61
Dealing with Missing Entries Want to minimize SSE for unseen test data
Idea Minimize SSE on training data Want large f ( of factors) to capture all the signals But SSE on test data begins to rise for f gt 2
Regularization is needed to avoid Overfitting Allow rich model where there are sufficient data Shrink aggressively where data are scarce
1162019
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
1 3 43 5 5
4 5 533
2
2 1 3
1
λhellip regularization parameterldquoerrorrdquo ldquolengthrdquo
RECSYS 62
Geared towards females
Geared towards males
serious
escapist
The PrincessDiaries
The Lion King
Braveheart
Independence Day
AmadeusThe Color Purple
Oceanrsquos 11
Sense and Sensibility
Gus
Dave
Recommendations via Latent Factor Models (eg SVD++ by the [Bellkor Team])
1162019
Lethal Weapon
Dumb and Dumber
RECSYS 63
Dealing with Missing Entries Want to minimize SSE for unseen test data
Idea Minimize SSE on training data Want large f ( of factors) to capture all the signals But SSE on test data begins to rise for f gt 2
Regularization is needed Allow rich model where there are sufficient data Shrink aggressively where data are scarce
1162019
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
1 3 43 5 5
4 5 533
2
2 1 3
1
λhellip regularization parameterldquoerrorrdquo ldquolengthrdquo
RECSYS 64
Geared towards females
Geared towards males
serious
funny
The Effect of Regularization
1162019
The PrincessDiaries
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
minfactors ldquoerrorrdquo + λ ldquolengthrdquo
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
RECSYS 65
Geared towards females
Geared towards males
serious
funny
The Effect of Regularization
1162019
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
The PrincessDiaries
minfactors ldquoerrorrdquo + λ ldquolengthrdquo
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
RECSYS 66
Geared towards females
Geared towards males
serious
funny
The Effect of Regularization
1162019
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
minfactors ldquoerrorrdquo + λ ldquolengthrdquo
The PrincessDiaries
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
RECSYS 67
Geared towards females
Geared towards males
serious
funny
The Effect of Regularization
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
The PrincessDiaries
minfactors ldquoerrorrdquo + λ ldquolengthrdquo
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
RECSYS 68
Use Gradient Descent to search forthe optimal settings
Want to find matrices P and Q
Gradient descent Initialize P and Q (using SVD pretend missing ratings are
0) Do gradient descent
bull P larr P - η middotnablaPbull Q larr Q - η middotnablaQbull Where nablaQ is gradientderivative of matrix Q120571120571119876119876 = [120571120571119902119902119909119909119894119894] and 120571120571119902119902119909119909119894119894 = sum119909119909119909119909 minus2 119903119903119909119909119909119909 minus 119902119902119909119909119901119901119909119909119879119879 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894
bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q Observation Computing gradients is slow
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
How to compute gradient of a matrixCompute gradient of every element independently
RECSYS 69
Degression to the lecture notes ofRegression and Gradient Descent
by Andrew Ngrsquos Machine Learning course from Coursera
RECSYS 70
(Batch) Gradient Descent
Want to find matrices P and Q
Gradient descent Initialize P and Q (using SVD pretend missing ratings are
0) Do gradient descent
bull P larr P - η middotnablaPbull Q larr Q - η middotnablaQbull Where nablaQ is gradientderivative of matrix Q120571120571119876119876 = [120571120571119902119902119909119909119894119894] and 120571120571119902119902119909119909119894119894 = sum119909119909119909119909 minus2 119903119903119909119909119909119909 minus 119902119902119909119909119901119901119909119909119879119879 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894
bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q Observation Computing gradients is slow
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
How to compute gradient of a matrixCompute gradient of every element independently
RECSYS 72
Stochastic Gradient Descent
Gradient Descent (GD) vs Stochastic GD Observation 120571120571119876119876 = [120571120571119902119902119909119909119894119894] where
120571120571119902119902119909119909119894119894 = 119909119909119909119909
minus2 119903119903119909119909119909119909 minus 119902119902119909119909119894119894119901119901119909119909119894119894 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894 = 119961119961119946119946
nabla119928119928 119955119955119961119961119946119946
bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q 119928119928 = 119928119928 minus 119928119928 = 119928119928 minus η sum119961119961119946119946nabla119928119928 (119955119955119961119961119946119946) Idea Instead of evaluating gradient over all ratings evaluate it for
each individual rating and make a step
GD 119928119928larr119928119928 minus η sum119955119955119961119961119946119946 nabla119928119928(119955119955119961119961119946119946)
SGD 119928119928larr119928119928 minus 119928119928(119955119955119961119961119946119946) Faster convergence
bull Need more steps but each step is computed much faster
1162019
RECSYS 73
SGD vs GD Convergence of GD vs SGD
1162019
Iterationstep
Valu
e of
the
obje
ctiv
e fu
nctio
n
GD improves the value of the objective function at every step SGD improves the value but in a ldquonoisyrdquo wayGD takes fewer steps to converge but each steptakes much longer to compute In practice SGD is much faster
RECSYS 74
Stochastic Gradient Descent
Stochastic gradient decent Initialize P and Q (using SVD pretend missing ratings are
0) Then iterate over the ratings (multiple times if necessary) and update
factorsFor each rxi
bull 120576120576119909119909119909119909 = 119903119903119909119909119909119909 minus 119902119902119909119909 sdot 119901119901119909119909119879119879 (derivative of the ldquoerrorrdquo)bull 119902119902119909119909 larr 119902119902119909119909 + 120578120578 120576120576119909119909119909119909 119901119901119909119909 minus 120582120582 119902119902119909119909 (update equation)bull 119901119901119909119909 larr 119901119901119909119909 + 120578120578 120576120576119909119909119909119909 119902119902119909119909 minus 120582120582 119901119901119909119909 (update equation)
2 for loops For until convergence
bull For each rxibull Compute gradient do a ldquosteprdquo
1162019
η hellip learning rate
RECSYS 75Koren Bell Volinksy IEEE Computer 2009
1162019
RECSYS 76
Summary Recommendations via Optimization
Goal Make good recommendations Quantify goodness using SSE
So Lower SSE means better recommendations We want to make good recommendations on items
that some user has not yet seen Letrsquos set values for P and Q such that they work
well on known (user item) ratings And hope these values for P and Q will predict well
the unknown ratings
This is the a case where we apply Optimization methods
1162019
1 3 43 5 5
4 5 533
2
2 1 3
1
RECSYS 77
Backup Slides
RECSYS 78
The Netflix Challenge 2006-09
We Know What You OughtTo Be Watching This
Summer
Netflix Prizebull Training data
ndash 100 million ratingsndash 480000 usersndash 17770 moviesndash 6 years of data 2000-2005
bull Test datandash Last few ratings of each user (28 million)ndash Evaluation criterion root mean squared error (RMSE) ndash Netflix Cinematch RMSE 09514
bull Competitionndash 2700+ teamsndash $1 million grand prize for 10 improvement on Cinematch resultndash $50000 2007 progress prize for 843 improvement
scoremovieuser121152131434524123237682576344541568523425
22345
57664566
scoremovieuser6219617232473153414284935
745
696836
Training data Test data
Movie rating data
Overall rating distribution
bull Third of ratings are 4sbull Average rating is 368
From TimelyDevelopmentcom
ratings per movie
bull Avg ratingsmovie 5627
ratings per user
bull Avg ratingsuser 208
Average movie rating by movie count
bull More ratings to better movies
From TimelyDevelopmentcom
Most loved movies
CountAvg ratingTitle137812 4593 The Shawshank Redemption 133597 4545 Lord of the Rings The Return of the King1808834306 The Green Mile 150676 4460Lord of the Rings The Two Towers 139050 4415 Finding Nemo 117456 4504 Raiders of the Lost Ark 180736 4299 Forrest Gump 147932 4433 Lord of the Rings The Fellowship of the ring149199 4325 The Sixth Sense 144027 4333 Indiana Jones and the Last Crusade
Challenges
bull Size of datandash Scalabilityndash Keeping data in memory
bull Missing datandash 99 percent missingndash Very imbalanced
bull Avoiding overfittingbull Test and training data differ significantly movie 16322
bull Use an ensemble of complementing predictorsbull Two half tuned models worth more than a single fully
tuned model
The BellKor recommender system
RECSYS 101
Extending Latent Factor Model to Include Biases
1162019
RECSYS 102
Modeling Biases and Interactions
μ = overall mean rating bx = bias of user x bi = bias of movie i
user-movie interactionmovie biasuser bias
User-Movie interaction Characterizes the matching between
users and movies Attracts most research in the field Benefits from algorithmic and
mathematical innovations
Baseline predictor Separates users and movies Benefits from insights into userrsquos
behavior Among the main practical
contributions of the competition
RECSYS 103
Baseline Predictor We have expectations on the rating by
user x of movie i even without estimating xrsquos attitude towards movies like i
ndash Rating scale of user xndash Values of other ratings user
gave recently (day-specific mood anchoring multi-user accounts)
ndash (Recent) popularity of movie indash Selection bias related to
number of ratings user gave on the same day (ldquofrequencyrdquo)
1162019
RECSYS 104
Putting It All Together
Example Mean rating micro = 37 You are a critical reviewer your ratings are 1 star
lower than the mean bx = -1 Star Wars gets a mean rating of 05 higher than
average movie bi = + 05 Predicted rating for you on Star Wars
= 37 - 1 + 05 = 32
1162019
RECSYS 105
Fitting the New Model
Solve
Stochastic gradient decent to find parameters Note Both biases bu bi as well as interactions qi pu are treated
as parameters (we estimate them)
1162019
regularization
goodness of fit
λ is selected via grid-search on a validation set
( )
++++
+++minus
sumsumsumsum
sumisin
ii
xx
xx
ii
Rix
Txiixxi
PQ
bbpq
pqbbr
2222
2
)(
)(min
λ
micro
RECSYS 106
BellKor Recommender System The winner of the Netflix Challenge
Multi-scale modeling of the dataCombine top level ldquoregionalrdquomodeling of the data with a refined local view Global
bull Overall deviations of usersmovies Factorization
bull Addressing ldquoregionalrdquo effects Collaborative filtering
bull Extract local patterns
Global effects
Factorization
Collaborative filtering
RECSYS 107
Performance of Various Methods
0885
089
0895
09
0905
091
0915
092
1 10 100 1000
RMSE
Millions of parameters
CF (no time bias)
Basic Latent Factors
Latent Factors w Biases
1162019
RECSYS 108
Grand Prize 08563
Netflix 09514
Movie average 10533User average 10651
Global average 11296
Performance of Various Methods
Basic Collaborative filtering 094
Latent factors 090
Latent factors+Biases 089
Collaborative filtering++ 091
1162019
RECSYS 109
Temporal Biases Of Users
Sudden rise in the average movie rating(early 2004) Improvements in Netflix GUI improvements Meaning of rating changed
Movie age Users prefer new movies
without any reasons Older movies are just
inherently better than newer ones
1162019
Y Koren Collaborative filtering with temporal dynamics KDD rsquo09
RECSYS 110
Temporal Biases amp Factors
Original modelrxi = micro +bx + bi + qi middotpx
T
Add time dependence to biasesrxi = micro +bx(t)+ bi(t) +qi middot px
T
Make parameters bu and bi to depend on time (1) Parameterize time-dependence by linear trends
(2) Each bin corresponds to 10 consecutive weeks
Add temporal dependence to factors px(t)hellip user preference vector on day t
Y Koren Collaborative filtering with temporal dynamics KDD rsquo09
RECSYS 111
Adding Temporal Effects
1162019
0875
088
0885
089
0895
09
0905
091
0915
092
1 10 100 1000 10000
RMSE
Millions of parameters
CF (no time bias)Basic Latent FactorsCF (time bias)Latent Factors w Biases+ Linear time factors+ Per-day user biases+ CF
RECSYS 112
Grand Prize 08563
Netflix 09514
Movie average 10533User average 10651
Global average 11296
Performance of Various Methods
Basic Collaborative filtering 094
Latent factors 090
Latent factors+Biases 089
Collaborative filtering++ 091
1162019 112
Latent factors+Biases+Time 0876
Still no prize Getting desperateTry a ldquokitchen sinkrdquo approach
RECSYS 113
Effect of ensemble size
087
08725
0875
08775
088
08825
0885
08875
089
08925
1 8 15 22 29 36 43 50
Predictors
Erro
r - R
MSE
RECSYS 3
Content-based Systems amp Collaborative Filtering
CS246 Mining Massive DatasetsJure Leskovec Stanford University
httpcs246stanfordedu
RECSYS 4
High Dimensional Data
High dim data
Locality sensitive hashing
Clustering
Dimensionality
reduction
Graph data
PageRank SimRank
Community Detection
Spam Detection
Infinite data
Filtering data
streams
Web advertising
Queries on streams
Machine learning
SVM
Decision Trees
Perceptron kNN
Apps
Recommender systems
Association Rules
Duplicate document detection
1162019
RECSYS 5
Example Recommender Systems
Customer X Buys Metallica CD Buys Megadeth CD
Customer Y Does search on Metallica Recommender system
suggests Megadeth from data collected about customer X
RECSYS 6
Recommendations
Items
Search Recommendations
Products web sites blogs news items hellip
Examples
RECSYS 8
The Long Tail
Source Chris Anderson (2004)
1162019 8
RECSYS 9
Physical vs Online
1162019 9Read httpwwwwiredcomwiredarchive1210tailhtml to learn more
RECSYS 11
Formal Model
X = set of Customers
S = set of Items
Utility function u X times S RR = set of ratingsR is a totally ordered seteg 0-5 stars real number in [01]
RECSYS 12
Utility Matrix
04102
0305021
Avatar LOTR Matrix Pirates
Alice
Bob
Carol
David
RECSYS 13
Key Problems
(1) Gathering ldquoknownrdquo ratings for matrix How to collect the data in the utility matrix
(2) Extrapolate unknown ratings from the known ones Mainly interested in high unknown ratings
bull We are not interested in knowing what you donrsquot like but what you like
(3) Evaluating extrapolation methods How to measure successperformance of
recommendation methods
RECSYS 14
(1) Gathering Ratings Explicit
Ask people to rate items Doesnrsquot work well in practice ndash people
canrsquot be bothered
Implicit Learn ratings from user actions
bull Eg purchase implies high rating What about low ratings
RECSYS 15
(2) Extrapolating Utilities Key problem matrix U is sparse
Most people have not rated most items Cold start
bull New items have no ratingsbull New users have no history
Three approaches to recommender systems 1) Content-based 2) Collaborative Filtering
bull Memory-basedbull User-based Collaborative Filteringbull Item-based Collaborative Filtering
bull Latent factor based
1162019
RECSYS 16
Content-based Recommender Systems
1162019
RECSYS 17
Content-based Recommendations Main idea Recommend items to customer x similar to
previous items rated highly by x
Example
Movie recommendations Recommend movies with same actor(s)
director genre hellip
Websites blogs news Recommend other sites with ldquosimilarrdquo content
RECSYS 18
Plan of Action
likes
Item profiles
RedCircles
Triangles
User profile
match
recommendbuild
1162019
RECSYS 19
Item Profiles For each item create an item profile
Profile is a set (vector) of features Movies author title actor directorhellip Text Set of ldquoimportantrdquo words in document
How to pick important features Usual heuristic from text mining is TF-IDF
(Term frequency Inverse Doc Frequency)bull Term hellip Featurebull Document hellip Item
1162019
RECSYS 20
Sidenote TF-IDFfij = frequency of term (feature) i in doc (item) j
ni = number of docs that mention term iN = total number of docs
TF-IDF score wij = TFij times IDFi
Doc profile = set of words with highest TF-IDF scores together with their scores
1162019
Note we normalize TFto discount for ldquolongerrdquo documents
RECSYS 21
User Profiles and Prediction
User profile possibilities Weighted average of rated item profiles Variation weight by difference from average
rating for item hellip
Prediction heuristic Given user profile x and item profile i estimate 119906119906(119961119961 119946119946) = cos(119961119961 119946119946) = 119961119961middot119946119946
| 119961119961 |sdot| 119946119946 |
1162019
RECSYS 22
Pros Content-based Approach
+ No need for data on other users No cold-start or sparsity problems
+ Able to recommend to users with unique tastes
+ Able to recommend new amp unpopular items No first-rater problem
+ Able to provide explanations Can provide explanations of recommended items by listing content-
features that caused an item to be recommended
1162019
RECSYS 23
Cons Content-based Approach ndash Finding the appropriate features is hard
Eg images movies music
ndash Overspecialization Never recommends items outside userrsquos
content profile People might have multiple interests Unable to exploit quality judgments of other users
ndash Recommendations for new users How to build a user profile
1162019
RECSYS 24
Collaborative Filtering
RECSYS 25
Collaborative filtering Recommend items based on past transactions of users
Analyze relations between users andor items
Specific data characteristics are irrelevant Domain-free useritem attributes are not necessary Can identify elusive aspects
RECSYS 26
Collaborative Filtering (CF)
RECSYS 27
Example of Memory-based Collaborative FilteringUser-User Collaborative Filtering
1 Consider user x
2 Find set N of other userswhose ratings are ldquosimilarrdquo to xrsquos ratings eg using K-nearest neighbors (KNN)
3 Recommend items to xbased on the weighted ratings of items by users in N
RECSYS 28
Similar Users Let rx be the vector of user xrsquos ratings
Jaccard similarity measure Problem Ignores the value of the rating
Cosine similarity measure
sim(x y) = cos(rx ry) = 119903119903119909119909sdot119903119903119910119910||119903119903119909119909||sdot||119903119903119910119910||
Problem Treats missing ratings as ldquonegativerdquo Pearson correlation coefficient
Sxy = items rated by both users x and y
1162019
rx = [ _ _ ]ry = [ _ _]
rx ry as setsrx = 1 4 5ry = 1 3 4
rx ry as pointsrx = 1 0 0 1 3ry = 1 0 2 2 0
rx ry hellip avgrating of x y
RECSYS 29
Similarity Metric
Intuitively we want sim(A B) gt sim(A C)
Jaccard similarity 15 lt 24
Cosine similarity 0386 gt 0322 Considers missing ratings as ldquonegativerdquo Solution subtract the (row) mean
1162019
sim AB vs AC0092 gt -0559Notice cosine sim is correlation when data is centered at 0
Cosine sim
RECSYS 30
Rating Predictions
Let rx be the vector of user xrsquos ratings
Let N be the set of k users most similar to xwho have rated item i
Prediction for item i of user x 119903119903119909119909119909119909 = 1
119896119896sum119910119910isin119873119873 119903119903119910119910119909119909
119903119903119909119909119909119909 =sum119910119910isin119873119873 119904119904119909119909119910119910sdot119903119903119910119910119910119910sum119910119910isin119873119873 119904119904119909119909119910119910
Other options
Many other tricks possiblehellip1162019
Shorthand119956119956119961119961119961119961 = 119956119956119946119946119956119956 119961119961119961119961
RECSYS 31
Another type of Memory-based Collaborative Filtering Item-Item based Collaborative Filtering
So far User-user collaborative filtering
Another view Item-item For item i find other similar items Estimate rating for item i based
on the target userrsquos ratings on items similar to item i Can use same similarity metrics and
prediction functions as in user-user model
1162019
sumsum
isin
isinsdot
=)(
)(
xiNj ij
xiNj xjijxi s
rsr
sijhellip similarity of items i and jrxjhelliprating of user x on item jN(ix)hellip set items rated by x similar to i
RECSYS 32
Item-Item CF (|N|=2)
121110987654321
455311
3124452
534321423
245424
5224345
423316
users
mov
ies
- unknown rating - rating between 1 to 5
RECSYS 33
Item-Item CF (|N|=2)
121110987654321
455 311
3124452
534321423
245424
5224345
423316
users
- estimate rating of movie 1 by user 5
1162019
mov
ies
RECSYS 34
Item-Item CF (|N|=2)
121110987654321
455 311
3124452
534321423
245424
5224345
423316
users
Neighbor selectionIdentify movies similar to movie 1 rated by user 5
1162019
mov
ies
100
-018
041
-010
-031
059
sim(1m)
Here we use Pearson correlation as similarity1) Subtract mean rating mi from each movie i
m1 = (1+3+5+5+4)5 = 36row 1 [-26 0 -06 0 0 14 0 0 14 0 04 0]
2) Compute cosine similarities between rows
RECSYS 35
Item-Item CF (|N|=2)
121110987654321
455 311
3124452
534321423
245424
5224345
423316
users
Compute similarity weightss13=041 s16=059
1162019
mov
ies
100
-018
041
-010
-031
059
sim(1m)
RECSYS 36
Item-Item CF (|N|=2)
121110987654321
45526311
3124452
534321423
245424
5224345
423316
users
Predict by taking weighted average
r15 = (0412 + 0593) (041+059) = 26
mov
ies
RECSYS 37
Common Practice for Item-Item Collaborative Filtering
Define similarity sij of items i and j
Select K nearest neighbors (KNN) N(i x) Set of Items most similar to item i that were rated by x
Estimate rating rxi as the weighted average
1162019
sumsum
isin
isinsdot
=)(
)(ˆxiNj ij
xiNj xjijxi s
rsr
N(ix) = set of items similar to item i that were rated by xsij = similarity of items i and jrxj= rating of user x on item j
RECSYS 38
Item-Item vs User-User
04180109
0305081
Avatar LOTR Matrix Pirates
Alice
Bob
Carol
David
1162019
In practice it has been observed that item-itemoften works better than user-user
Why Items are simpler users have multiple tastes
RECSYS 39
ProsCons of Collaborative Filtering
+ Works for any kind of item No feature selection needed
- Cold Start Need enough users in the system to find a match
- Sparsity The userratings matrix is sparse Hard to find users that have rated the same items
- First rater Cannot recommend an item that has not been
previously rated New items Esoteric items
- Popularity bias Cannot recommend items to someone with
unique taste Tends to recommend popular items
RECSYS 40
Hybrid Methods Implement two or more different recommenders and
combine predictions Perhaps using a linear model
Add content-based methods to collaborative filtering Item profiles for new item problem Demographics to deal with new user problem
1162019
RECSYS 41
- Evaluation
- Error metrics
- Complexity Speed
RECSYS 42
Evaluation
1 3 4
3 5 5
4 5 5
3
3
2 2 2
5
2 1 1
3 3
1
movies
users
1162019
RECSYS 43
Evaluation
1 3 4
3 5 5
4 5 5
3
3
2
2 1
3
1
Test Data Set
users
movies
1162019
RECSYS 44
Evaluating Predictions Compare predictions with known ratings
Root-mean-square error (RMSE)bull where is predicted is the true rating of x on i
Precision at top 10 bull of those in top 10
Rank Correlation bull Spearmanrsquos correlation between systemrsquos and userrsquos complete
rankings Another approach 01 model
Coveragebull Number of itemsusers for which system can make predictions
Precision = TP (TP + FP) Accuracy = (TP+TN) (TP + FP + TN + FN) Receiver Operating characteristic (ROC) Curve
Y-axis True Positive Rates (TPR) X-axis False Positive Rates (FPR)bull TPR (aka Recall) = TP P = TP (TP+FN) bull FPR = FP N = FP (FP + TN)bull See httpsenwikipediaorgwikiPrecision and recall
RECSYS 45
Problems with Error Measures Narrow focus on accuracy sometimes
misses the point Prediction Diversity Prediction Context Order of predictions
In practice we care only to predict high ratings RMSE might penalize a method that does well
for high ratings and badly for others
1162019
RECSYS 46
Collaborative Filtering Complexity
Expensive step is finding k most similar customers O(|X|) Recall that X = set of customers in the system
Too expensive to do at runtime Could pre-compute using clustering as approx
Naiumlve pre-computation takes time O(N middot|C|) |C| = of clusters = k in the k-means N = of data points
We already know how to do this Near-neighbor search in high dimensions (LSH) Clustering Dimensionality reduction
1162019
RECSYS 47
Tip Add Data Leverage all the data
Donrsquot try to reduce data size in an effort to make fancy algorithms work
Simple methods on large data do best
Add more data eg add IMDB data on genres
More data beats better algorithmshttpanandtypepadcomdatawocky200803more-data-usualhtml
RECSYS 48
Recommender SystemsLatent Factor Models
CS246 Mining Massive DatasetsJure Leskovec Stanford University
httpcs246stanfordedu
RECSYS 49
Geared towards females
Geared towards males
Serious
Funny
Collaborative Filtering viaLatent Factor Models (eg SVD)
The PrincessDiaries
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
RECSYS 50Koren Bell Volinksy IEEE Computer 2009
RECSYS 51
The Netflix Utility Matrix R
1 3 4
3 5 5
4 5 5
3
3
2 2 2
5
2 1 1
3 3
1
480000 users
17700 movies
1162019
Matrix R
RECSYS 52
Utility Matrix R Evaluation
1 3 4
3 5 5
4 5 5
3
3
2
2 1
3
1
Test Data Set
SSE = sum(119909119909119909119909)isin119877119877 119903119909119909119909119909 minus 119903119903119909119909119909119909 2
1162019
480000 users
17700 movies
Predicted rating
True rating of user x on item i
119955119955120785120785120788120788
Matrix R
Training Data Set
RECSYS 53
Latent Factor Models
ldquoSVDrdquo on Netflix data R asymp Q middot PT
For now letrsquos assume we can approximate the rating matrix R as a product of ldquothinrdquo Q middot PT
R has missing entries but letrsquos ignore that for nowbull Basically we will want the reconstruction error to be small on
known ratings and we donrsquot care about the values on the missing ones
45531
312445
53432142
24542
522434
42331
2-41
56-5
53-2
32111
-221-7
37-1
-924143-48-5-253-211
13-112-72914-131457-8
1-6784-3924176-421asymp
users
item
s
PT
Q
item
s
users
R
SVD A = U Σ VT
f factors
ffactors
RECSYS 54
Ratings as Products of Factors How to estimate the missing rating of
user x for item i
1162019
45531
312445
53432142
24542
522434
42331
item
s
2-41
56-5
53-2
32111
-221-7
37-1
-924143-48-5-253-211
13-112-72914-131457-8
1-6784-3924176-421
asymp
item
s
users
users
PT
119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931
= 119943119943
119954119954119946119946119943119943 sdot 119953119953119961119961119943119943
qi = row i of Qpx = column x of PT
ffac
tors
Qf factors
RECSYS 55
Ratings as Products of Factors How to estimate the missing rating of
user x for item i
1162019
45531
312445
53432142
24542
522434
42331
item
s
2-41
56-5
53-2
32111
-221-7
37-1
-924143-48-5-253-211
13-112-72914-131457-8
1-6784-3924176-421
asymp
item
s
users
users
PT
ffac
tors
Qf factors
119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931
= 119943119943
119954119954119946119946119943119943 sdot 119953119953119961119961119943119943
qi = row i of Qpx = column x of PT
RECSYS 56
Ratings as Products of Factors How to estimate the missing rating of
user x for item i
1162019
45531
312445
53432142
24542
522434
42331
item
s
2-41
56-5
53-2
32111
-221-7
37-1
-924143-48-5-253-211
13-112-72914-131457-8
1-6784-3924176-421
asymp
item
s
users
users
QPT
24
ffac
tors
f factors
119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931
= 119943119943
119954119954119946119946119943119943 sdot 119953119953119961119961119943119943
qi = row i of Qpx = column x of PT
RECSYS 57
Geared towards females
Geared towards males
Serious
Funny
Latent Factor Models
1162019
The PrincessDiaries
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
RECSYS 58
Geared towards females
Geared towards males
Serious
Funny
Latent Factor Models
1162019
The PrincessDiaries
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
RECSYS 59
Recap SVD
Remember SVD A Input data matrix U Left singular vecs V Right singular vecs Σ Singular values
SVD gives minimum reconstruction error (SSE) min119880119880VΣ
sum119909119909119894119894 119860119860119909119909119894119894 minus 119880119880Σ119881119881T 1199091199091198941198942
So in our case ldquoSVDrdquo on Netflix data R asymp Q middot PT
A = R Q = U PT = Σ VT
But we are not done yet R has missing entries1162019
U
The sum goes over all entriesBut our R has missing entries
119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931
Am
n
Σm
n
VT
asymp
U
RECSYS 60
Latent Factor Models
SVD isnrsquot defined when entries are missing
Use specialized methods to find P Q min
119875119875119876119876sum 119909119909119909119909 isinR 119903119903119909119909119909119909 minus 119902119902119909119909 sdot 119901119901119909119909119879119879 2
Notebull We donrsquot require cols of P Q to be orthogonalunit lengthbull P Q map usersmovies to a latent spacebull The most popular model among Netflix contestants
1162019
45531
312445
53432142
24542
522434
42331
2-41
56-5
53-2
32111
-221-7
37-1
-924143-48-5-253-211
13-112-72914-131457-8
1-6784-3924176-421asymp
PTQ
users
item
s
119903119909119909119909119909 = 119902119902119909119909 sdot 119901119901119909119909119879119879
f factors
ffactorsitem
s
users
RECSYS 61
Dealing with Missing Entries Want to minimize SSE for unseen test data
Idea Minimize SSE on training data Want large f ( of factors) to capture all the signals But SSE on test data begins to rise for f gt 2
Regularization is needed to avoid Overfitting Allow rich model where there are sufficient data Shrink aggressively where data are scarce
1162019
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
1 3 43 5 5
4 5 533
2
2 1 3
1
λhellip regularization parameterldquoerrorrdquo ldquolengthrdquo
RECSYS 62
Geared towards females
Geared towards males
serious
escapist
The PrincessDiaries
The Lion King
Braveheart
Independence Day
AmadeusThe Color Purple
Oceanrsquos 11
Sense and Sensibility
Gus
Dave
Recommendations via Latent Factor Models (eg SVD++ by the [Bellkor Team])
1162019
Lethal Weapon
Dumb and Dumber
RECSYS 63
Dealing with Missing Entries Want to minimize SSE for unseen test data
Idea Minimize SSE on training data Want large f ( of factors) to capture all the signals But SSE on test data begins to rise for f gt 2
Regularization is needed Allow rich model where there are sufficient data Shrink aggressively where data are scarce
1162019
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
1 3 43 5 5
4 5 533
2
2 1 3
1
λhellip regularization parameterldquoerrorrdquo ldquolengthrdquo
RECSYS 64
Geared towards females
Geared towards males
serious
funny
The Effect of Regularization
1162019
The PrincessDiaries
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
minfactors ldquoerrorrdquo + λ ldquolengthrdquo
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
RECSYS 65
Geared towards females
Geared towards males
serious
funny
The Effect of Regularization
1162019
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
The PrincessDiaries
minfactors ldquoerrorrdquo + λ ldquolengthrdquo
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
RECSYS 66
Geared towards females
Geared towards males
serious
funny
The Effect of Regularization
1162019
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
minfactors ldquoerrorrdquo + λ ldquolengthrdquo
The PrincessDiaries
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
RECSYS 67
Geared towards females
Geared towards males
serious
funny
The Effect of Regularization
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
The PrincessDiaries
minfactors ldquoerrorrdquo + λ ldquolengthrdquo
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
RECSYS 68
Use Gradient Descent to search forthe optimal settings
Want to find matrices P and Q
Gradient descent Initialize P and Q (using SVD pretend missing ratings are
0) Do gradient descent
bull P larr P - η middotnablaPbull Q larr Q - η middotnablaQbull Where nablaQ is gradientderivative of matrix Q120571120571119876119876 = [120571120571119902119902119909119909119894119894] and 120571120571119902119902119909119909119894119894 = sum119909119909119909119909 minus2 119903119903119909119909119909119909 minus 119902119902119909119909119901119901119909119909119879119879 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894
bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q Observation Computing gradients is slow
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
How to compute gradient of a matrixCompute gradient of every element independently
RECSYS 69
Degression to the lecture notes ofRegression and Gradient Descent
by Andrew Ngrsquos Machine Learning course from Coursera
RECSYS 70
(Batch) Gradient Descent
Want to find matrices P and Q
Gradient descent Initialize P and Q (using SVD pretend missing ratings are
0) Do gradient descent
bull P larr P - η middotnablaPbull Q larr Q - η middotnablaQbull Where nablaQ is gradientderivative of matrix Q120571120571119876119876 = [120571120571119902119902119909119909119894119894] and 120571120571119902119902119909119909119894119894 = sum119909119909119909119909 minus2 119903119903119909119909119909119909 minus 119902119902119909119909119901119901119909119909119879119879 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894
bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q Observation Computing gradients is slow
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
How to compute gradient of a matrixCompute gradient of every element independently
RECSYS 72
Stochastic Gradient Descent
Gradient Descent (GD) vs Stochastic GD Observation 120571120571119876119876 = [120571120571119902119902119909119909119894119894] where
120571120571119902119902119909119909119894119894 = 119909119909119909119909
minus2 119903119903119909119909119909119909 minus 119902119902119909119909119894119894119901119901119909119909119894119894 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894 = 119961119961119946119946
nabla119928119928 119955119955119961119961119946119946
bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q 119928119928 = 119928119928 minus 119928119928 = 119928119928 minus η sum119961119961119946119946nabla119928119928 (119955119955119961119961119946119946) Idea Instead of evaluating gradient over all ratings evaluate it for
each individual rating and make a step
GD 119928119928larr119928119928 minus η sum119955119955119961119961119946119946 nabla119928119928(119955119955119961119961119946119946)
SGD 119928119928larr119928119928 minus 119928119928(119955119955119961119961119946119946) Faster convergence
bull Need more steps but each step is computed much faster
1162019
RECSYS 73
SGD vs GD Convergence of GD vs SGD
1162019
Iterationstep
Valu
e of
the
obje
ctiv
e fu
nctio
n
GD improves the value of the objective function at every step SGD improves the value but in a ldquonoisyrdquo wayGD takes fewer steps to converge but each steptakes much longer to compute In practice SGD is much faster
RECSYS 74
Stochastic Gradient Descent
Stochastic gradient decent Initialize P and Q (using SVD pretend missing ratings are
0) Then iterate over the ratings (multiple times if necessary) and update
factorsFor each rxi
bull 120576120576119909119909119909119909 = 119903119903119909119909119909119909 minus 119902119902119909119909 sdot 119901119901119909119909119879119879 (derivative of the ldquoerrorrdquo)bull 119902119902119909119909 larr 119902119902119909119909 + 120578120578 120576120576119909119909119909119909 119901119901119909119909 minus 120582120582 119902119902119909119909 (update equation)bull 119901119901119909119909 larr 119901119901119909119909 + 120578120578 120576120576119909119909119909119909 119902119902119909119909 minus 120582120582 119901119901119909119909 (update equation)
2 for loops For until convergence
bull For each rxibull Compute gradient do a ldquosteprdquo
1162019
η hellip learning rate
RECSYS 75Koren Bell Volinksy IEEE Computer 2009
1162019
RECSYS 76
Summary Recommendations via Optimization
Goal Make good recommendations Quantify goodness using SSE
So Lower SSE means better recommendations We want to make good recommendations on items
that some user has not yet seen Letrsquos set values for P and Q such that they work
well on known (user item) ratings And hope these values for P and Q will predict well
the unknown ratings
This is the a case where we apply Optimization methods
1162019
1 3 43 5 5
4 5 533
2
2 1 3
1
RECSYS 77
Backup Slides
RECSYS 78
The Netflix Challenge 2006-09
We Know What You OughtTo Be Watching This
Summer
Netflix Prizebull Training data
ndash 100 million ratingsndash 480000 usersndash 17770 moviesndash 6 years of data 2000-2005
bull Test datandash Last few ratings of each user (28 million)ndash Evaluation criterion root mean squared error (RMSE) ndash Netflix Cinematch RMSE 09514
bull Competitionndash 2700+ teamsndash $1 million grand prize for 10 improvement on Cinematch resultndash $50000 2007 progress prize for 843 improvement
scoremovieuser121152131434524123237682576344541568523425
22345
57664566
scoremovieuser6219617232473153414284935
745
696836
Training data Test data
Movie rating data
Overall rating distribution
bull Third of ratings are 4sbull Average rating is 368
From TimelyDevelopmentcom
ratings per movie
bull Avg ratingsmovie 5627
ratings per user
bull Avg ratingsuser 208
Average movie rating by movie count
bull More ratings to better movies
From TimelyDevelopmentcom
Most loved movies
CountAvg ratingTitle137812 4593 The Shawshank Redemption 133597 4545 Lord of the Rings The Return of the King1808834306 The Green Mile 150676 4460Lord of the Rings The Two Towers 139050 4415 Finding Nemo 117456 4504 Raiders of the Lost Ark 180736 4299 Forrest Gump 147932 4433 Lord of the Rings The Fellowship of the ring149199 4325 The Sixth Sense 144027 4333 Indiana Jones and the Last Crusade
Challenges
bull Size of datandash Scalabilityndash Keeping data in memory
bull Missing datandash 99 percent missingndash Very imbalanced
bull Avoiding overfittingbull Test and training data differ significantly movie 16322
bull Use an ensemble of complementing predictorsbull Two half tuned models worth more than a single fully
tuned model
The BellKor recommender system
RECSYS 101
Extending Latent Factor Model to Include Biases
1162019
RECSYS 102
Modeling Biases and Interactions
μ = overall mean rating bx = bias of user x bi = bias of movie i
user-movie interactionmovie biasuser bias
User-Movie interaction Characterizes the matching between
users and movies Attracts most research in the field Benefits from algorithmic and
mathematical innovations
Baseline predictor Separates users and movies Benefits from insights into userrsquos
behavior Among the main practical
contributions of the competition
RECSYS 103
Baseline Predictor We have expectations on the rating by
user x of movie i even without estimating xrsquos attitude towards movies like i
ndash Rating scale of user xndash Values of other ratings user
gave recently (day-specific mood anchoring multi-user accounts)
ndash (Recent) popularity of movie indash Selection bias related to
number of ratings user gave on the same day (ldquofrequencyrdquo)
1162019
RECSYS 104
Putting It All Together
Example Mean rating micro = 37 You are a critical reviewer your ratings are 1 star
lower than the mean bx = -1 Star Wars gets a mean rating of 05 higher than
average movie bi = + 05 Predicted rating for you on Star Wars
= 37 - 1 + 05 = 32
1162019
RECSYS 105
Fitting the New Model
Solve
Stochastic gradient decent to find parameters Note Both biases bu bi as well as interactions qi pu are treated
as parameters (we estimate them)
1162019
regularization
goodness of fit
λ is selected via grid-search on a validation set
( )
++++
+++minus
sumsumsumsum
sumisin
ii
xx
xx
ii
Rix
Txiixxi
PQ
bbpq
pqbbr
2222
2
)(
)(min
λ
micro
RECSYS 106
BellKor Recommender System The winner of the Netflix Challenge
Multi-scale modeling of the dataCombine top level ldquoregionalrdquomodeling of the data with a refined local view Global
bull Overall deviations of usersmovies Factorization
bull Addressing ldquoregionalrdquo effects Collaborative filtering
bull Extract local patterns
Global effects
Factorization
Collaborative filtering
RECSYS 107
Performance of Various Methods
0885
089
0895
09
0905
091
0915
092
1 10 100 1000
RMSE
Millions of parameters
CF (no time bias)
Basic Latent Factors
Latent Factors w Biases
1162019
RECSYS 108
Grand Prize 08563
Netflix 09514
Movie average 10533User average 10651
Global average 11296
Performance of Various Methods
Basic Collaborative filtering 094
Latent factors 090
Latent factors+Biases 089
Collaborative filtering++ 091
1162019
RECSYS 109
Temporal Biases Of Users
Sudden rise in the average movie rating(early 2004) Improvements in Netflix GUI improvements Meaning of rating changed
Movie age Users prefer new movies
without any reasons Older movies are just
inherently better than newer ones
1162019
Y Koren Collaborative filtering with temporal dynamics KDD rsquo09
RECSYS 110
Temporal Biases amp Factors
Original modelrxi = micro +bx + bi + qi middotpx
T
Add time dependence to biasesrxi = micro +bx(t)+ bi(t) +qi middot px
T
Make parameters bu and bi to depend on time (1) Parameterize time-dependence by linear trends
(2) Each bin corresponds to 10 consecutive weeks
Add temporal dependence to factors px(t)hellip user preference vector on day t
Y Koren Collaborative filtering with temporal dynamics KDD rsquo09
RECSYS 111
Adding Temporal Effects
1162019
0875
088
0885
089
0895
09
0905
091
0915
092
1 10 100 1000 10000
RMSE
Millions of parameters
CF (no time bias)Basic Latent FactorsCF (time bias)Latent Factors w Biases+ Linear time factors+ Per-day user biases+ CF
RECSYS 112
Grand Prize 08563
Netflix 09514
Movie average 10533User average 10651
Global average 11296
Performance of Various Methods
Basic Collaborative filtering 094
Latent factors 090
Latent factors+Biases 089
Collaborative filtering++ 091
1162019 112
Latent factors+Biases+Time 0876
Still no prize Getting desperateTry a ldquokitchen sinkrdquo approach
RECSYS 113
Effect of ensemble size
087
08725
0875
08775
088
08825
0885
08875
089
08925
1 8 15 22 29 36 43 50
Predictors
Erro
r - R
MSE
RECSYS 4
High Dimensional Data
High dim data
Locality sensitive hashing
Clustering
Dimensionality
reduction
Graph data
PageRank SimRank
Community Detection
Spam Detection
Infinite data
Filtering data
streams
Web advertising
Queries on streams
Machine learning
SVM
Decision Trees
Perceptron kNN
Apps
Recommender systems
Association Rules
Duplicate document detection
1162019
RECSYS 5
Example Recommender Systems
Customer X Buys Metallica CD Buys Megadeth CD
Customer Y Does search on Metallica Recommender system
suggests Megadeth from data collected about customer X
RECSYS 6
Recommendations
Items
Search Recommendations
Products web sites blogs news items hellip
Examples
RECSYS 8
The Long Tail
Source Chris Anderson (2004)
1162019 8
RECSYS 9
Physical vs Online
1162019 9Read httpwwwwiredcomwiredarchive1210tailhtml to learn more
RECSYS 11
Formal Model
X = set of Customers
S = set of Items
Utility function u X times S RR = set of ratingsR is a totally ordered seteg 0-5 stars real number in [01]
RECSYS 12
Utility Matrix
04102
0305021
Avatar LOTR Matrix Pirates
Alice
Bob
Carol
David
RECSYS 13
Key Problems
(1) Gathering ldquoknownrdquo ratings for matrix How to collect the data in the utility matrix
(2) Extrapolate unknown ratings from the known ones Mainly interested in high unknown ratings
bull We are not interested in knowing what you donrsquot like but what you like
(3) Evaluating extrapolation methods How to measure successperformance of
recommendation methods
RECSYS 14
(1) Gathering Ratings Explicit
Ask people to rate items Doesnrsquot work well in practice ndash people
canrsquot be bothered
Implicit Learn ratings from user actions
bull Eg purchase implies high rating What about low ratings
RECSYS 15
(2) Extrapolating Utilities Key problem matrix U is sparse
Most people have not rated most items Cold start
bull New items have no ratingsbull New users have no history
Three approaches to recommender systems 1) Content-based 2) Collaborative Filtering
bull Memory-basedbull User-based Collaborative Filteringbull Item-based Collaborative Filtering
bull Latent factor based
1162019
RECSYS 16
Content-based Recommender Systems
1162019
RECSYS 17
Content-based Recommendations Main idea Recommend items to customer x similar to
previous items rated highly by x
Example
Movie recommendations Recommend movies with same actor(s)
director genre hellip
Websites blogs news Recommend other sites with ldquosimilarrdquo content
RECSYS 18
Plan of Action
likes
Item profiles
RedCircles
Triangles
User profile
match
recommendbuild
1162019
RECSYS 19
Item Profiles For each item create an item profile
Profile is a set (vector) of features Movies author title actor directorhellip Text Set of ldquoimportantrdquo words in document
How to pick important features Usual heuristic from text mining is TF-IDF
(Term frequency Inverse Doc Frequency)bull Term hellip Featurebull Document hellip Item
1162019
RECSYS 20
Sidenote TF-IDFfij = frequency of term (feature) i in doc (item) j
ni = number of docs that mention term iN = total number of docs
TF-IDF score wij = TFij times IDFi
Doc profile = set of words with highest TF-IDF scores together with their scores
1162019
Note we normalize TFto discount for ldquolongerrdquo documents
RECSYS 21
User Profiles and Prediction
User profile possibilities Weighted average of rated item profiles Variation weight by difference from average
rating for item hellip
Prediction heuristic Given user profile x and item profile i estimate 119906119906(119961119961 119946119946) = cos(119961119961 119946119946) = 119961119961middot119946119946
| 119961119961 |sdot| 119946119946 |
1162019
RECSYS 22
Pros Content-based Approach
+ No need for data on other users No cold-start or sparsity problems
+ Able to recommend to users with unique tastes
+ Able to recommend new amp unpopular items No first-rater problem
+ Able to provide explanations Can provide explanations of recommended items by listing content-
features that caused an item to be recommended
1162019
RECSYS 23
Cons Content-based Approach ndash Finding the appropriate features is hard
Eg images movies music
ndash Overspecialization Never recommends items outside userrsquos
content profile People might have multiple interests Unable to exploit quality judgments of other users
ndash Recommendations for new users How to build a user profile
1162019
RECSYS 24
Collaborative Filtering
RECSYS 25
Collaborative filtering Recommend items based on past transactions of users
Analyze relations between users andor items
Specific data characteristics are irrelevant Domain-free useritem attributes are not necessary Can identify elusive aspects
RECSYS 26
Collaborative Filtering (CF)
RECSYS 27
Example of Memory-based Collaborative FilteringUser-User Collaborative Filtering
1 Consider user x
2 Find set N of other userswhose ratings are ldquosimilarrdquo to xrsquos ratings eg using K-nearest neighbors (KNN)
3 Recommend items to xbased on the weighted ratings of items by users in N
RECSYS 28
Similar Users Let rx be the vector of user xrsquos ratings
Jaccard similarity measure Problem Ignores the value of the rating
Cosine similarity measure
sim(x y) = cos(rx ry) = 119903119903119909119909sdot119903119903119910119910||119903119903119909119909||sdot||119903119903119910119910||
Problem Treats missing ratings as ldquonegativerdquo Pearson correlation coefficient
Sxy = items rated by both users x and y
1162019
rx = [ _ _ ]ry = [ _ _]
rx ry as setsrx = 1 4 5ry = 1 3 4
rx ry as pointsrx = 1 0 0 1 3ry = 1 0 2 2 0
rx ry hellip avgrating of x y
RECSYS 29
Similarity Metric
Intuitively we want sim(A B) gt sim(A C)
Jaccard similarity 15 lt 24
Cosine similarity 0386 gt 0322 Considers missing ratings as ldquonegativerdquo Solution subtract the (row) mean
1162019
sim AB vs AC0092 gt -0559Notice cosine sim is correlation when data is centered at 0
Cosine sim
RECSYS 30
Rating Predictions
Let rx be the vector of user xrsquos ratings
Let N be the set of k users most similar to xwho have rated item i
Prediction for item i of user x 119903119903119909119909119909119909 = 1
119896119896sum119910119910isin119873119873 119903119903119910119910119909119909
119903119903119909119909119909119909 =sum119910119910isin119873119873 119904119904119909119909119910119910sdot119903119903119910119910119910119910sum119910119910isin119873119873 119904119904119909119909119910119910
Other options
Many other tricks possiblehellip1162019
Shorthand119956119956119961119961119961119961 = 119956119956119946119946119956119956 119961119961119961119961
RECSYS 31
Another type of Memory-based Collaborative Filtering Item-Item based Collaborative Filtering
So far User-user collaborative filtering
Another view Item-item For item i find other similar items Estimate rating for item i based
on the target userrsquos ratings on items similar to item i Can use same similarity metrics and
prediction functions as in user-user model
1162019
sumsum
isin
isinsdot
=)(
)(
xiNj ij
xiNj xjijxi s
rsr
sijhellip similarity of items i and jrxjhelliprating of user x on item jN(ix)hellip set items rated by x similar to i
RECSYS 32
Item-Item CF (|N|=2)
121110987654321
455311
3124452
534321423
245424
5224345
423316
users
mov
ies
- unknown rating - rating between 1 to 5
RECSYS 33
Item-Item CF (|N|=2)
121110987654321
455 311
3124452
534321423
245424
5224345
423316
users
- estimate rating of movie 1 by user 5
1162019
mov
ies
RECSYS 34
Item-Item CF (|N|=2)
121110987654321
455 311
3124452
534321423
245424
5224345
423316
users
Neighbor selectionIdentify movies similar to movie 1 rated by user 5
1162019
mov
ies
100
-018
041
-010
-031
059
sim(1m)
Here we use Pearson correlation as similarity1) Subtract mean rating mi from each movie i
m1 = (1+3+5+5+4)5 = 36row 1 [-26 0 -06 0 0 14 0 0 14 0 04 0]
2) Compute cosine similarities between rows
RECSYS 35
Item-Item CF (|N|=2)
121110987654321
455 311
3124452
534321423
245424
5224345
423316
users
Compute similarity weightss13=041 s16=059
1162019
mov
ies
100
-018
041
-010
-031
059
sim(1m)
RECSYS 36
Item-Item CF (|N|=2)
121110987654321
45526311
3124452
534321423
245424
5224345
423316
users
Predict by taking weighted average
r15 = (0412 + 0593) (041+059) = 26
mov
ies
RECSYS 37
Common Practice for Item-Item Collaborative Filtering
Define similarity sij of items i and j
Select K nearest neighbors (KNN) N(i x) Set of Items most similar to item i that were rated by x
Estimate rating rxi as the weighted average
1162019
sumsum
isin
isinsdot
=)(
)(ˆxiNj ij
xiNj xjijxi s
rsr
N(ix) = set of items similar to item i that were rated by xsij = similarity of items i and jrxj= rating of user x on item j
RECSYS 38
Item-Item vs User-User
04180109
0305081
Avatar LOTR Matrix Pirates
Alice
Bob
Carol
David
1162019
In practice it has been observed that item-itemoften works better than user-user
Why Items are simpler users have multiple tastes
RECSYS 39
ProsCons of Collaborative Filtering
+ Works for any kind of item No feature selection needed
- Cold Start Need enough users in the system to find a match
- Sparsity The userratings matrix is sparse Hard to find users that have rated the same items
- First rater Cannot recommend an item that has not been
previously rated New items Esoteric items
- Popularity bias Cannot recommend items to someone with
unique taste Tends to recommend popular items
RECSYS 40
Hybrid Methods Implement two or more different recommenders and
combine predictions Perhaps using a linear model
Add content-based methods to collaborative filtering Item profiles for new item problem Demographics to deal with new user problem
1162019
RECSYS 41
- Evaluation
- Error metrics
- Complexity Speed
RECSYS 42
Evaluation
1 3 4
3 5 5
4 5 5
3
3
2 2 2
5
2 1 1
3 3
1
movies
users
1162019
RECSYS 43
Evaluation
1 3 4
3 5 5
4 5 5
3
3
2
2 1
3
1
Test Data Set
users
movies
1162019
RECSYS 44
Evaluating Predictions Compare predictions with known ratings
Root-mean-square error (RMSE)bull where is predicted is the true rating of x on i
Precision at top 10 bull of those in top 10
Rank Correlation bull Spearmanrsquos correlation between systemrsquos and userrsquos complete
rankings Another approach 01 model
Coveragebull Number of itemsusers for which system can make predictions
Precision = TP (TP + FP) Accuracy = (TP+TN) (TP + FP + TN + FN) Receiver Operating characteristic (ROC) Curve
Y-axis True Positive Rates (TPR) X-axis False Positive Rates (FPR)bull TPR (aka Recall) = TP P = TP (TP+FN) bull FPR = FP N = FP (FP + TN)bull See httpsenwikipediaorgwikiPrecision and recall
RECSYS 45
Problems with Error Measures Narrow focus on accuracy sometimes
misses the point Prediction Diversity Prediction Context Order of predictions
In practice we care only to predict high ratings RMSE might penalize a method that does well
for high ratings and badly for others
1162019
RECSYS 46
Collaborative Filtering Complexity
Expensive step is finding k most similar customers O(|X|) Recall that X = set of customers in the system
Too expensive to do at runtime Could pre-compute using clustering as approx
Naiumlve pre-computation takes time O(N middot|C|) |C| = of clusters = k in the k-means N = of data points
We already know how to do this Near-neighbor search in high dimensions (LSH) Clustering Dimensionality reduction
1162019
RECSYS 47
Tip Add Data Leverage all the data
Donrsquot try to reduce data size in an effort to make fancy algorithms work
Simple methods on large data do best
Add more data eg add IMDB data on genres
More data beats better algorithmshttpanandtypepadcomdatawocky200803more-data-usualhtml
RECSYS 48
Recommender SystemsLatent Factor Models
CS246 Mining Massive DatasetsJure Leskovec Stanford University
httpcs246stanfordedu
RECSYS 49
Geared towards females
Geared towards males
Serious
Funny
Collaborative Filtering viaLatent Factor Models (eg SVD)
The PrincessDiaries
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
RECSYS 50Koren Bell Volinksy IEEE Computer 2009
RECSYS 51
The Netflix Utility Matrix R
1 3 4
3 5 5
4 5 5
3
3
2 2 2
5
2 1 1
3 3
1
480000 users
17700 movies
1162019
Matrix R
RECSYS 52
Utility Matrix R Evaluation
1 3 4
3 5 5
4 5 5
3
3
2
2 1
3
1
Test Data Set
SSE = sum(119909119909119909119909)isin119877119877 119903119909119909119909119909 minus 119903119903119909119909119909119909 2
1162019
480000 users
17700 movies
Predicted rating
True rating of user x on item i
119955119955120785120785120788120788
Matrix R
Training Data Set
RECSYS 53
Latent Factor Models
ldquoSVDrdquo on Netflix data R asymp Q middot PT
For now letrsquos assume we can approximate the rating matrix R as a product of ldquothinrdquo Q middot PT
R has missing entries but letrsquos ignore that for nowbull Basically we will want the reconstruction error to be small on
known ratings and we donrsquot care about the values on the missing ones
45531
312445
53432142
24542
522434
42331
2-41
56-5
53-2
32111
-221-7
37-1
-924143-48-5-253-211
13-112-72914-131457-8
1-6784-3924176-421asymp
users
item
s
PT
Q
item
s
users
R
SVD A = U Σ VT
f factors
ffactors
RECSYS 54
Ratings as Products of Factors How to estimate the missing rating of
user x for item i
1162019
45531
312445
53432142
24542
522434
42331
item
s
2-41
56-5
53-2
32111
-221-7
37-1
-924143-48-5-253-211
13-112-72914-131457-8
1-6784-3924176-421
asymp
item
s
users
users
PT
119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931
= 119943119943
119954119954119946119946119943119943 sdot 119953119953119961119961119943119943
qi = row i of Qpx = column x of PT
ffac
tors
Qf factors
RECSYS 55
Ratings as Products of Factors How to estimate the missing rating of
user x for item i
1162019
45531
312445
53432142
24542
522434
42331
item
s
2-41
56-5
53-2
32111
-221-7
37-1
-924143-48-5-253-211
13-112-72914-131457-8
1-6784-3924176-421
asymp
item
s
users
users
PT
ffac
tors
Qf factors
119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931
= 119943119943
119954119954119946119946119943119943 sdot 119953119953119961119961119943119943
qi = row i of Qpx = column x of PT
RECSYS 56
Ratings as Products of Factors How to estimate the missing rating of
user x for item i
1162019
45531
312445
53432142
24542
522434
42331
item
s
2-41
56-5
53-2
32111
-221-7
37-1
-924143-48-5-253-211
13-112-72914-131457-8
1-6784-3924176-421
asymp
item
s
users
users
QPT
24
ffac
tors
f factors
119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931
= 119943119943
119954119954119946119946119943119943 sdot 119953119953119961119961119943119943
qi = row i of Qpx = column x of PT
RECSYS 57
Geared towards females
Geared towards males
Serious
Funny
Latent Factor Models
1162019
The PrincessDiaries
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
RECSYS 58
Geared towards females
Geared towards males
Serious
Funny
Latent Factor Models
1162019
The PrincessDiaries
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
RECSYS 59
Recap SVD
Remember SVD A Input data matrix U Left singular vecs V Right singular vecs Σ Singular values
SVD gives minimum reconstruction error (SSE) min119880119880VΣ
sum119909119909119894119894 119860119860119909119909119894119894 minus 119880119880Σ119881119881T 1199091199091198941198942
So in our case ldquoSVDrdquo on Netflix data R asymp Q middot PT
A = R Q = U PT = Σ VT
But we are not done yet R has missing entries1162019
U
The sum goes over all entriesBut our R has missing entries
119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931
Am
n
Σm
n
VT
asymp
U
RECSYS 60
Latent Factor Models
SVD isnrsquot defined when entries are missing
Use specialized methods to find P Q min
119875119875119876119876sum 119909119909119909119909 isinR 119903119903119909119909119909119909 minus 119902119902119909119909 sdot 119901119901119909119909119879119879 2
Notebull We donrsquot require cols of P Q to be orthogonalunit lengthbull P Q map usersmovies to a latent spacebull The most popular model among Netflix contestants
1162019
45531
312445
53432142
24542
522434
42331
2-41
56-5
53-2
32111
-221-7
37-1
-924143-48-5-253-211
13-112-72914-131457-8
1-6784-3924176-421asymp
PTQ
users
item
s
119903119909119909119909119909 = 119902119902119909119909 sdot 119901119901119909119909119879119879
f factors
ffactorsitem
s
users
RECSYS 61
Dealing with Missing Entries Want to minimize SSE for unseen test data
Idea Minimize SSE on training data Want large f ( of factors) to capture all the signals But SSE on test data begins to rise for f gt 2
Regularization is needed to avoid Overfitting Allow rich model where there are sufficient data Shrink aggressively where data are scarce
1162019
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
1 3 43 5 5
4 5 533
2
2 1 3
1
λhellip regularization parameterldquoerrorrdquo ldquolengthrdquo
RECSYS 62
Geared towards females
Geared towards males
serious
escapist
The PrincessDiaries
The Lion King
Braveheart
Independence Day
AmadeusThe Color Purple
Oceanrsquos 11
Sense and Sensibility
Gus
Dave
Recommendations via Latent Factor Models (eg SVD++ by the [Bellkor Team])
1162019
Lethal Weapon
Dumb and Dumber
RECSYS 63
Dealing with Missing Entries Want to minimize SSE for unseen test data
Idea Minimize SSE on training data Want large f ( of factors) to capture all the signals But SSE on test data begins to rise for f gt 2
Regularization is needed Allow rich model where there are sufficient data Shrink aggressively where data are scarce
1162019
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
1 3 43 5 5
4 5 533
2
2 1 3
1
λhellip regularization parameterldquoerrorrdquo ldquolengthrdquo
RECSYS 64
Geared towards females
Geared towards males
serious
funny
The Effect of Regularization
1162019
The PrincessDiaries
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
minfactors ldquoerrorrdquo + λ ldquolengthrdquo
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
RECSYS 65
Geared towards females
Geared towards males
serious
funny
The Effect of Regularization
1162019
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
The PrincessDiaries
minfactors ldquoerrorrdquo + λ ldquolengthrdquo
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
RECSYS 66
Geared towards females
Geared towards males
serious
funny
The Effect of Regularization
1162019
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
minfactors ldquoerrorrdquo + λ ldquolengthrdquo
The PrincessDiaries
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
RECSYS 67
Geared towards females
Geared towards males
serious
funny
The Effect of Regularization
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
The PrincessDiaries
minfactors ldquoerrorrdquo + λ ldquolengthrdquo
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
RECSYS 68
Use Gradient Descent to search forthe optimal settings
Want to find matrices P and Q
Gradient descent Initialize P and Q (using SVD pretend missing ratings are
0) Do gradient descent
bull P larr P - η middotnablaPbull Q larr Q - η middotnablaQbull Where nablaQ is gradientderivative of matrix Q120571120571119876119876 = [120571120571119902119902119909119909119894119894] and 120571120571119902119902119909119909119894119894 = sum119909119909119909119909 minus2 119903119903119909119909119909119909 minus 119902119902119909119909119901119901119909119909119879119879 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894
bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q Observation Computing gradients is slow
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
How to compute gradient of a matrixCompute gradient of every element independently
RECSYS 69
Degression to the lecture notes ofRegression and Gradient Descent
by Andrew Ngrsquos Machine Learning course from Coursera
RECSYS 70
(Batch) Gradient Descent
Want to find matrices P and Q
Gradient descent Initialize P and Q (using SVD pretend missing ratings are
0) Do gradient descent
bull P larr P - η middotnablaPbull Q larr Q - η middotnablaQbull Where nablaQ is gradientderivative of matrix Q120571120571119876119876 = [120571120571119902119902119909119909119894119894] and 120571120571119902119902119909119909119894119894 = sum119909119909119909119909 minus2 119903119903119909119909119909119909 minus 119902119902119909119909119901119901119909119909119879119879 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894
bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q Observation Computing gradients is slow
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
How to compute gradient of a matrixCompute gradient of every element independently
RECSYS 72
Stochastic Gradient Descent
Gradient Descent (GD) vs Stochastic GD Observation 120571120571119876119876 = [120571120571119902119902119909119909119894119894] where
120571120571119902119902119909119909119894119894 = 119909119909119909119909
minus2 119903119903119909119909119909119909 minus 119902119902119909119909119894119894119901119901119909119909119894119894 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894 = 119961119961119946119946
nabla119928119928 119955119955119961119961119946119946
bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q 119928119928 = 119928119928 minus 119928119928 = 119928119928 minus η sum119961119961119946119946nabla119928119928 (119955119955119961119961119946119946) Idea Instead of evaluating gradient over all ratings evaluate it for
each individual rating and make a step
GD 119928119928larr119928119928 minus η sum119955119955119961119961119946119946 nabla119928119928(119955119955119961119961119946119946)
SGD 119928119928larr119928119928 minus 119928119928(119955119955119961119961119946119946) Faster convergence
bull Need more steps but each step is computed much faster
1162019
RECSYS 73
SGD vs GD Convergence of GD vs SGD
1162019
Iterationstep
Valu
e of
the
obje
ctiv
e fu
nctio
n
GD improves the value of the objective function at every step SGD improves the value but in a ldquonoisyrdquo wayGD takes fewer steps to converge but each steptakes much longer to compute In practice SGD is much faster
RECSYS 74
Stochastic Gradient Descent
Stochastic gradient decent Initialize P and Q (using SVD pretend missing ratings are
0) Then iterate over the ratings (multiple times if necessary) and update
factorsFor each rxi
bull 120576120576119909119909119909119909 = 119903119903119909119909119909119909 minus 119902119902119909119909 sdot 119901119901119909119909119879119879 (derivative of the ldquoerrorrdquo)bull 119902119902119909119909 larr 119902119902119909119909 + 120578120578 120576120576119909119909119909119909 119901119901119909119909 minus 120582120582 119902119902119909119909 (update equation)bull 119901119901119909119909 larr 119901119901119909119909 + 120578120578 120576120576119909119909119909119909 119902119902119909119909 minus 120582120582 119901119901119909119909 (update equation)
2 for loops For until convergence
bull For each rxibull Compute gradient do a ldquosteprdquo
1162019
η hellip learning rate
RECSYS 75Koren Bell Volinksy IEEE Computer 2009
1162019
RECSYS 76
Summary Recommendations via Optimization
Goal Make good recommendations Quantify goodness using SSE
So Lower SSE means better recommendations We want to make good recommendations on items
that some user has not yet seen Letrsquos set values for P and Q such that they work
well on known (user item) ratings And hope these values for P and Q will predict well
the unknown ratings
This is the a case where we apply Optimization methods
1162019
1 3 43 5 5
4 5 533
2
2 1 3
1
RECSYS 77
Backup Slides
RECSYS 78
The Netflix Challenge 2006-09
We Know What You OughtTo Be Watching This
Summer
Netflix Prizebull Training data
ndash 100 million ratingsndash 480000 usersndash 17770 moviesndash 6 years of data 2000-2005
bull Test datandash Last few ratings of each user (28 million)ndash Evaluation criterion root mean squared error (RMSE) ndash Netflix Cinematch RMSE 09514
bull Competitionndash 2700+ teamsndash $1 million grand prize for 10 improvement on Cinematch resultndash $50000 2007 progress prize for 843 improvement
scoremovieuser121152131434524123237682576344541568523425
22345
57664566
scoremovieuser6219617232473153414284935
745
696836
Training data Test data
Movie rating data
Overall rating distribution
bull Third of ratings are 4sbull Average rating is 368
From TimelyDevelopmentcom
ratings per movie
bull Avg ratingsmovie 5627
ratings per user
bull Avg ratingsuser 208
Average movie rating by movie count
bull More ratings to better movies
From TimelyDevelopmentcom
Most loved movies
CountAvg ratingTitle137812 4593 The Shawshank Redemption 133597 4545 Lord of the Rings The Return of the King1808834306 The Green Mile 150676 4460Lord of the Rings The Two Towers 139050 4415 Finding Nemo 117456 4504 Raiders of the Lost Ark 180736 4299 Forrest Gump 147932 4433 Lord of the Rings The Fellowship of the ring149199 4325 The Sixth Sense 144027 4333 Indiana Jones and the Last Crusade
Challenges
bull Size of datandash Scalabilityndash Keeping data in memory
bull Missing datandash 99 percent missingndash Very imbalanced
bull Avoiding overfittingbull Test and training data differ significantly movie 16322
bull Use an ensemble of complementing predictorsbull Two half tuned models worth more than a single fully
tuned model
The BellKor recommender system
RECSYS 101
Extending Latent Factor Model to Include Biases
1162019
RECSYS 102
Modeling Biases and Interactions
μ = overall mean rating bx = bias of user x bi = bias of movie i
user-movie interactionmovie biasuser bias
User-Movie interaction Characterizes the matching between
users and movies Attracts most research in the field Benefits from algorithmic and
mathematical innovations
Baseline predictor Separates users and movies Benefits from insights into userrsquos
behavior Among the main practical
contributions of the competition
RECSYS 103
Baseline Predictor We have expectations on the rating by
user x of movie i even without estimating xrsquos attitude towards movies like i
ndash Rating scale of user xndash Values of other ratings user
gave recently (day-specific mood anchoring multi-user accounts)
ndash (Recent) popularity of movie indash Selection bias related to
number of ratings user gave on the same day (ldquofrequencyrdquo)
1162019
RECSYS 104
Putting It All Together
Example Mean rating micro = 37 You are a critical reviewer your ratings are 1 star
lower than the mean bx = -1 Star Wars gets a mean rating of 05 higher than
average movie bi = + 05 Predicted rating for you on Star Wars
= 37 - 1 + 05 = 32
1162019
RECSYS 105
Fitting the New Model
Solve
Stochastic gradient decent to find parameters Note Both biases bu bi as well as interactions qi pu are treated
as parameters (we estimate them)
1162019
regularization
goodness of fit
λ is selected via grid-search on a validation set
( )
++++
+++minus
sumsumsumsum
sumisin
ii
xx
xx
ii
Rix
Txiixxi
PQ
bbpq
pqbbr
2222
2
)(
)(min
λ
micro
RECSYS 106
BellKor Recommender System The winner of the Netflix Challenge
Multi-scale modeling of the dataCombine top level ldquoregionalrdquomodeling of the data with a refined local view Global
bull Overall deviations of usersmovies Factorization
bull Addressing ldquoregionalrdquo effects Collaborative filtering
bull Extract local patterns
Global effects
Factorization
Collaborative filtering
RECSYS 107
Performance of Various Methods
0885
089
0895
09
0905
091
0915
092
1 10 100 1000
RMSE
Millions of parameters
CF (no time bias)
Basic Latent Factors
Latent Factors w Biases
1162019
RECSYS 108
Grand Prize 08563
Netflix 09514
Movie average 10533User average 10651
Global average 11296
Performance of Various Methods
Basic Collaborative filtering 094
Latent factors 090
Latent factors+Biases 089
Collaborative filtering++ 091
1162019
RECSYS 109
Temporal Biases Of Users
Sudden rise in the average movie rating(early 2004) Improvements in Netflix GUI improvements Meaning of rating changed
Movie age Users prefer new movies
without any reasons Older movies are just
inherently better than newer ones
1162019
Y Koren Collaborative filtering with temporal dynamics KDD rsquo09
RECSYS 110
Temporal Biases amp Factors
Original modelrxi = micro +bx + bi + qi middotpx
T
Add time dependence to biasesrxi = micro +bx(t)+ bi(t) +qi middot px
T
Make parameters bu and bi to depend on time (1) Parameterize time-dependence by linear trends
(2) Each bin corresponds to 10 consecutive weeks
Add temporal dependence to factors px(t)hellip user preference vector on day t
Y Koren Collaborative filtering with temporal dynamics KDD rsquo09
RECSYS 111
Adding Temporal Effects
1162019
0875
088
0885
089
0895
09
0905
091
0915
092
1 10 100 1000 10000
RMSE
Millions of parameters
CF (no time bias)Basic Latent FactorsCF (time bias)Latent Factors w Biases+ Linear time factors+ Per-day user biases+ CF
RECSYS 112
Grand Prize 08563
Netflix 09514
Movie average 10533User average 10651
Global average 11296
Performance of Various Methods
Basic Collaborative filtering 094
Latent factors 090
Latent factors+Biases 089
Collaborative filtering++ 091
1162019 112
Latent factors+Biases+Time 0876
Still no prize Getting desperateTry a ldquokitchen sinkrdquo approach
RECSYS 113
Effect of ensemble size
087
08725
0875
08775
088
08825
0885
08875
089
08925
1 8 15 22 29 36 43 50
Predictors
Erro
r - R
MSE
RECSYS 5
Example Recommender Systems
Customer X Buys Metallica CD Buys Megadeth CD
Customer Y Does search on Metallica Recommender system
suggests Megadeth from data collected about customer X
RECSYS 6
Recommendations
Items
Search Recommendations
Products web sites blogs news items hellip
Examples
RECSYS 8
The Long Tail
Source Chris Anderson (2004)
1162019 8
RECSYS 9
Physical vs Online
1162019 9Read httpwwwwiredcomwiredarchive1210tailhtml to learn more
RECSYS 11
Formal Model
X = set of Customers
S = set of Items
Utility function u X times S RR = set of ratingsR is a totally ordered seteg 0-5 stars real number in [01]
RECSYS 12
Utility Matrix
04102
0305021
Avatar LOTR Matrix Pirates
Alice
Bob
Carol
David
RECSYS 13
Key Problems
(1) Gathering ldquoknownrdquo ratings for matrix How to collect the data in the utility matrix
(2) Extrapolate unknown ratings from the known ones Mainly interested in high unknown ratings
bull We are not interested in knowing what you donrsquot like but what you like
(3) Evaluating extrapolation methods How to measure successperformance of
recommendation methods
RECSYS 14
(1) Gathering Ratings Explicit
Ask people to rate items Doesnrsquot work well in practice ndash people
canrsquot be bothered
Implicit Learn ratings from user actions
bull Eg purchase implies high rating What about low ratings
RECSYS 15
(2) Extrapolating Utilities Key problem matrix U is sparse
Most people have not rated most items Cold start
bull New items have no ratingsbull New users have no history
Three approaches to recommender systems 1) Content-based 2) Collaborative Filtering
bull Memory-basedbull User-based Collaborative Filteringbull Item-based Collaborative Filtering
bull Latent factor based
1162019
RECSYS 16
Content-based Recommender Systems
1162019
RECSYS 17
Content-based Recommendations Main idea Recommend items to customer x similar to
previous items rated highly by x
Example
Movie recommendations Recommend movies with same actor(s)
director genre hellip
Websites blogs news Recommend other sites with ldquosimilarrdquo content
RECSYS 18
Plan of Action
likes
Item profiles
RedCircles
Triangles
User profile
match
recommendbuild
1162019
RECSYS 19
Item Profiles For each item create an item profile
Profile is a set (vector) of features Movies author title actor directorhellip Text Set of ldquoimportantrdquo words in document
How to pick important features Usual heuristic from text mining is TF-IDF
(Term frequency Inverse Doc Frequency)bull Term hellip Featurebull Document hellip Item
1162019
RECSYS 20
Sidenote TF-IDFfij = frequency of term (feature) i in doc (item) j
ni = number of docs that mention term iN = total number of docs
TF-IDF score wij = TFij times IDFi
Doc profile = set of words with highest TF-IDF scores together with their scores
1162019
Note we normalize TFto discount for ldquolongerrdquo documents
RECSYS 21
User Profiles and Prediction
User profile possibilities Weighted average of rated item profiles Variation weight by difference from average
rating for item hellip
Prediction heuristic Given user profile x and item profile i estimate 119906119906(119961119961 119946119946) = cos(119961119961 119946119946) = 119961119961middot119946119946
| 119961119961 |sdot| 119946119946 |
1162019
RECSYS 22
Pros Content-based Approach
+ No need for data on other users No cold-start or sparsity problems
+ Able to recommend to users with unique tastes
+ Able to recommend new amp unpopular items No first-rater problem
+ Able to provide explanations Can provide explanations of recommended items by listing content-
features that caused an item to be recommended
1162019
RECSYS 23
Cons Content-based Approach ndash Finding the appropriate features is hard
Eg images movies music
ndash Overspecialization Never recommends items outside userrsquos
content profile People might have multiple interests Unable to exploit quality judgments of other users
ndash Recommendations for new users How to build a user profile
1162019
RECSYS 24
Collaborative Filtering
RECSYS 25
Collaborative filtering Recommend items based on past transactions of users
Analyze relations between users andor items
Specific data characteristics are irrelevant Domain-free useritem attributes are not necessary Can identify elusive aspects
RECSYS 26
Collaborative Filtering (CF)
RECSYS 27
Example of Memory-based Collaborative FilteringUser-User Collaborative Filtering
1 Consider user x
2 Find set N of other userswhose ratings are ldquosimilarrdquo to xrsquos ratings eg using K-nearest neighbors (KNN)
3 Recommend items to xbased on the weighted ratings of items by users in N
RECSYS 28
Similar Users Let rx be the vector of user xrsquos ratings
Jaccard similarity measure Problem Ignores the value of the rating
Cosine similarity measure
sim(x y) = cos(rx ry) = 119903119903119909119909sdot119903119903119910119910||119903119903119909119909||sdot||119903119903119910119910||
Problem Treats missing ratings as ldquonegativerdquo Pearson correlation coefficient
Sxy = items rated by both users x and y
1162019
rx = [ _ _ ]ry = [ _ _]
rx ry as setsrx = 1 4 5ry = 1 3 4
rx ry as pointsrx = 1 0 0 1 3ry = 1 0 2 2 0
rx ry hellip avgrating of x y
RECSYS 29
Similarity Metric
Intuitively we want sim(A B) gt sim(A C)
Jaccard similarity 15 lt 24
Cosine similarity 0386 gt 0322 Considers missing ratings as ldquonegativerdquo Solution subtract the (row) mean
1162019
sim AB vs AC0092 gt -0559Notice cosine sim is correlation when data is centered at 0
Cosine sim
RECSYS 30
Rating Predictions
Let rx be the vector of user xrsquos ratings
Let N be the set of k users most similar to xwho have rated item i
Prediction for item i of user x 119903119903119909119909119909119909 = 1
119896119896sum119910119910isin119873119873 119903119903119910119910119909119909
119903119903119909119909119909119909 =sum119910119910isin119873119873 119904119904119909119909119910119910sdot119903119903119910119910119910119910sum119910119910isin119873119873 119904119904119909119909119910119910
Other options
Many other tricks possiblehellip1162019
Shorthand119956119956119961119961119961119961 = 119956119956119946119946119956119956 119961119961119961119961
RECSYS 31
Another type of Memory-based Collaborative Filtering Item-Item based Collaborative Filtering
So far User-user collaborative filtering
Another view Item-item For item i find other similar items Estimate rating for item i based
on the target userrsquos ratings on items similar to item i Can use same similarity metrics and
prediction functions as in user-user model
1162019
sumsum
isin
isinsdot
=)(
)(
xiNj ij
xiNj xjijxi s
rsr
sijhellip similarity of items i and jrxjhelliprating of user x on item jN(ix)hellip set items rated by x similar to i
RECSYS 32
Item-Item CF (|N|=2)
121110987654321
455311
3124452
534321423
245424
5224345
423316
users
mov
ies
- unknown rating - rating between 1 to 5
RECSYS 33
Item-Item CF (|N|=2)
121110987654321
455 311
3124452
534321423
245424
5224345
423316
users
- estimate rating of movie 1 by user 5
1162019
mov
ies
RECSYS 34
Item-Item CF (|N|=2)
121110987654321
455 311
3124452
534321423
245424
5224345
423316
users
Neighbor selectionIdentify movies similar to movie 1 rated by user 5
1162019
mov
ies
100
-018
041
-010
-031
059
sim(1m)
Here we use Pearson correlation as similarity1) Subtract mean rating mi from each movie i
m1 = (1+3+5+5+4)5 = 36row 1 [-26 0 -06 0 0 14 0 0 14 0 04 0]
2) Compute cosine similarities between rows
RECSYS 35
Item-Item CF (|N|=2)
121110987654321
455 311
3124452
534321423
245424
5224345
423316
users
Compute similarity weightss13=041 s16=059
1162019
mov
ies
100
-018
041
-010
-031
059
sim(1m)
RECSYS 36
Item-Item CF (|N|=2)
121110987654321
45526311
3124452
534321423
245424
5224345
423316
users
Predict by taking weighted average
r15 = (0412 + 0593) (041+059) = 26
mov
ies
RECSYS 37
Common Practice for Item-Item Collaborative Filtering
Define similarity sij of items i and j
Select K nearest neighbors (KNN) N(i x) Set of Items most similar to item i that were rated by x
Estimate rating rxi as the weighted average
1162019
sumsum
isin
isinsdot
=)(
)(ˆxiNj ij
xiNj xjijxi s
rsr
N(ix) = set of items similar to item i that were rated by xsij = similarity of items i and jrxj= rating of user x on item j
RECSYS 38
Item-Item vs User-User
04180109
0305081
Avatar LOTR Matrix Pirates
Alice
Bob
Carol
David
1162019
In practice it has been observed that item-itemoften works better than user-user
Why Items are simpler users have multiple tastes
RECSYS 39
ProsCons of Collaborative Filtering
+ Works for any kind of item No feature selection needed
- Cold Start Need enough users in the system to find a match
- Sparsity The userratings matrix is sparse Hard to find users that have rated the same items
- First rater Cannot recommend an item that has not been
previously rated New items Esoteric items
- Popularity bias Cannot recommend items to someone with
unique taste Tends to recommend popular items
RECSYS 40
Hybrid Methods Implement two or more different recommenders and
combine predictions Perhaps using a linear model
Add content-based methods to collaborative filtering Item profiles for new item problem Demographics to deal with new user problem
1162019
RECSYS 41
- Evaluation
- Error metrics
- Complexity Speed
RECSYS 42
Evaluation
1 3 4
3 5 5
4 5 5
3
3
2 2 2
5
2 1 1
3 3
1
movies
users
1162019
RECSYS 43
Evaluation
1 3 4
3 5 5
4 5 5
3
3
2
2 1
3
1
Test Data Set
users
movies
1162019
RECSYS 44
Evaluating Predictions Compare predictions with known ratings
Root-mean-square error (RMSE)bull where is predicted is the true rating of x on i
Precision at top 10 bull of those in top 10
Rank Correlation bull Spearmanrsquos correlation between systemrsquos and userrsquos complete
rankings Another approach 01 model
Coveragebull Number of itemsusers for which system can make predictions
Precision = TP (TP + FP) Accuracy = (TP+TN) (TP + FP + TN + FN) Receiver Operating characteristic (ROC) Curve
Y-axis True Positive Rates (TPR) X-axis False Positive Rates (FPR)bull TPR (aka Recall) = TP P = TP (TP+FN) bull FPR = FP N = FP (FP + TN)bull See httpsenwikipediaorgwikiPrecision and recall
RECSYS 45
Problems with Error Measures Narrow focus on accuracy sometimes
misses the point Prediction Diversity Prediction Context Order of predictions
In practice we care only to predict high ratings RMSE might penalize a method that does well
for high ratings and badly for others
1162019
RECSYS 46
Collaborative Filtering Complexity
Expensive step is finding k most similar customers O(|X|) Recall that X = set of customers in the system
Too expensive to do at runtime Could pre-compute using clustering as approx
Naiumlve pre-computation takes time O(N middot|C|) |C| = of clusters = k in the k-means N = of data points
We already know how to do this Near-neighbor search in high dimensions (LSH) Clustering Dimensionality reduction
1162019
RECSYS 47
Tip Add Data Leverage all the data
Donrsquot try to reduce data size in an effort to make fancy algorithms work
Simple methods on large data do best
Add more data eg add IMDB data on genres
More data beats better algorithmshttpanandtypepadcomdatawocky200803more-data-usualhtml
RECSYS 48
Recommender SystemsLatent Factor Models
CS246 Mining Massive DatasetsJure Leskovec Stanford University
httpcs246stanfordedu
RECSYS 49
Geared towards females
Geared towards males
Serious
Funny
Collaborative Filtering viaLatent Factor Models (eg SVD)
The PrincessDiaries
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
RECSYS 50Koren Bell Volinksy IEEE Computer 2009
RECSYS 51
The Netflix Utility Matrix R
1 3 4
3 5 5
4 5 5
3
3
2 2 2
5
2 1 1
3 3
1
480000 users
17700 movies
1162019
Matrix R
RECSYS 52
Utility Matrix R Evaluation
1 3 4
3 5 5
4 5 5
3
3
2
2 1
3
1
Test Data Set
SSE = sum(119909119909119909119909)isin119877119877 119903119909119909119909119909 minus 119903119903119909119909119909119909 2
1162019
480000 users
17700 movies
Predicted rating
True rating of user x on item i
119955119955120785120785120788120788
Matrix R
Training Data Set
RECSYS 53
Latent Factor Models
ldquoSVDrdquo on Netflix data R asymp Q middot PT
For now letrsquos assume we can approximate the rating matrix R as a product of ldquothinrdquo Q middot PT
R has missing entries but letrsquos ignore that for nowbull Basically we will want the reconstruction error to be small on
known ratings and we donrsquot care about the values on the missing ones
45531
312445
53432142
24542
522434
42331
2-41
56-5
53-2
32111
-221-7
37-1
-924143-48-5-253-211
13-112-72914-131457-8
1-6784-3924176-421asymp
users
item
s
PT
Q
item
s
users
R
SVD A = U Σ VT
f factors
ffactors
RECSYS 54
Ratings as Products of Factors How to estimate the missing rating of
user x for item i
1162019
45531
312445
53432142
24542
522434
42331
item
s
2-41
56-5
53-2
32111
-221-7
37-1
-924143-48-5-253-211
13-112-72914-131457-8
1-6784-3924176-421
asymp
item
s
users
users
PT
119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931
= 119943119943
119954119954119946119946119943119943 sdot 119953119953119961119961119943119943
qi = row i of Qpx = column x of PT
ffac
tors
Qf factors
RECSYS 55
Ratings as Products of Factors How to estimate the missing rating of
user x for item i
1162019
45531
312445
53432142
24542
522434
42331
item
s
2-41
56-5
53-2
32111
-221-7
37-1
-924143-48-5-253-211
13-112-72914-131457-8
1-6784-3924176-421
asymp
item
s
users
users
PT
ffac
tors
Qf factors
119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931
= 119943119943
119954119954119946119946119943119943 sdot 119953119953119961119961119943119943
qi = row i of Qpx = column x of PT
RECSYS 56
Ratings as Products of Factors How to estimate the missing rating of
user x for item i
1162019
45531
312445
53432142
24542
522434
42331
item
s
2-41
56-5
53-2
32111
-221-7
37-1
-924143-48-5-253-211
13-112-72914-131457-8
1-6784-3924176-421
asymp
item
s
users
users
QPT
24
ffac
tors
f factors
119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931
= 119943119943
119954119954119946119946119943119943 sdot 119953119953119961119961119943119943
qi = row i of Qpx = column x of PT
RECSYS 57
Geared towards females
Geared towards males
Serious
Funny
Latent Factor Models
1162019
The PrincessDiaries
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
RECSYS 58
Geared towards females
Geared towards males
Serious
Funny
Latent Factor Models
1162019
The PrincessDiaries
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
RECSYS 59
Recap SVD
Remember SVD A Input data matrix U Left singular vecs V Right singular vecs Σ Singular values
SVD gives minimum reconstruction error (SSE) min119880119880VΣ
sum119909119909119894119894 119860119860119909119909119894119894 minus 119880119880Σ119881119881T 1199091199091198941198942
So in our case ldquoSVDrdquo on Netflix data R asymp Q middot PT
A = R Q = U PT = Σ VT
But we are not done yet R has missing entries1162019
U
The sum goes over all entriesBut our R has missing entries
119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931
Am
n
Σm
n
VT
asymp
U
RECSYS 60
Latent Factor Models
SVD isnrsquot defined when entries are missing
Use specialized methods to find P Q min
119875119875119876119876sum 119909119909119909119909 isinR 119903119903119909119909119909119909 minus 119902119902119909119909 sdot 119901119901119909119909119879119879 2
Notebull We donrsquot require cols of P Q to be orthogonalunit lengthbull P Q map usersmovies to a latent spacebull The most popular model among Netflix contestants
1162019
45531
312445
53432142
24542
522434
42331
2-41
56-5
53-2
32111
-221-7
37-1
-924143-48-5-253-211
13-112-72914-131457-8
1-6784-3924176-421asymp
PTQ
users
item
s
119903119909119909119909119909 = 119902119902119909119909 sdot 119901119901119909119909119879119879
f factors
ffactorsitem
s
users
RECSYS 61
Dealing with Missing Entries Want to minimize SSE for unseen test data
Idea Minimize SSE on training data Want large f ( of factors) to capture all the signals But SSE on test data begins to rise for f gt 2
Regularization is needed to avoid Overfitting Allow rich model where there are sufficient data Shrink aggressively where data are scarce
1162019
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
1 3 43 5 5
4 5 533
2
2 1 3
1
λhellip regularization parameterldquoerrorrdquo ldquolengthrdquo
RECSYS 62
Geared towards females
Geared towards males
serious
escapist
The PrincessDiaries
The Lion King
Braveheart
Independence Day
AmadeusThe Color Purple
Oceanrsquos 11
Sense and Sensibility
Gus
Dave
Recommendations via Latent Factor Models (eg SVD++ by the [Bellkor Team])
1162019
Lethal Weapon
Dumb and Dumber
RECSYS 63
Dealing with Missing Entries Want to minimize SSE for unseen test data
Idea Minimize SSE on training data Want large f ( of factors) to capture all the signals But SSE on test data begins to rise for f gt 2
Regularization is needed Allow rich model where there are sufficient data Shrink aggressively where data are scarce
1162019
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
1 3 43 5 5
4 5 533
2
2 1 3
1
λhellip regularization parameterldquoerrorrdquo ldquolengthrdquo
RECSYS 64
Geared towards females
Geared towards males
serious
funny
The Effect of Regularization
1162019
The PrincessDiaries
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
minfactors ldquoerrorrdquo + λ ldquolengthrdquo
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
RECSYS 65
Geared towards females
Geared towards males
serious
funny
The Effect of Regularization
1162019
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
The PrincessDiaries
minfactors ldquoerrorrdquo + λ ldquolengthrdquo
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
RECSYS 66
Geared towards females
Geared towards males
serious
funny
The Effect of Regularization
1162019
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
minfactors ldquoerrorrdquo + λ ldquolengthrdquo
The PrincessDiaries
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
RECSYS 67
Geared towards females
Geared towards males
serious
funny
The Effect of Regularization
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
The PrincessDiaries
minfactors ldquoerrorrdquo + λ ldquolengthrdquo
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
RECSYS 68
Use Gradient Descent to search forthe optimal settings
Want to find matrices P and Q
Gradient descent Initialize P and Q (using SVD pretend missing ratings are
0) Do gradient descent
bull P larr P - η middotnablaPbull Q larr Q - η middotnablaQbull Where nablaQ is gradientderivative of matrix Q120571120571119876119876 = [120571120571119902119902119909119909119894119894] and 120571120571119902119902119909119909119894119894 = sum119909119909119909119909 minus2 119903119903119909119909119909119909 minus 119902119902119909119909119901119901119909119909119879119879 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894
bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q Observation Computing gradients is slow
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
How to compute gradient of a matrixCompute gradient of every element independently
RECSYS 69
Degression to the lecture notes ofRegression and Gradient Descent
by Andrew Ngrsquos Machine Learning course from Coursera
RECSYS 70
(Batch) Gradient Descent
Want to find matrices P and Q
Gradient descent Initialize P and Q (using SVD pretend missing ratings are
0) Do gradient descent
bull P larr P - η middotnablaPbull Q larr Q - η middotnablaQbull Where nablaQ is gradientderivative of matrix Q120571120571119876119876 = [120571120571119902119902119909119909119894119894] and 120571120571119902119902119909119909119894119894 = sum119909119909119909119909 minus2 119903119903119909119909119909119909 minus 119902119902119909119909119901119901119909119909119879119879 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894
bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q Observation Computing gradients is slow
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
How to compute gradient of a matrixCompute gradient of every element independently
RECSYS 72
Stochastic Gradient Descent
Gradient Descent (GD) vs Stochastic GD Observation 120571120571119876119876 = [120571120571119902119902119909119909119894119894] where
120571120571119902119902119909119909119894119894 = 119909119909119909119909
minus2 119903119903119909119909119909119909 minus 119902119902119909119909119894119894119901119901119909119909119894119894 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894 = 119961119961119946119946
nabla119928119928 119955119955119961119961119946119946
bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q 119928119928 = 119928119928 minus 119928119928 = 119928119928 minus η sum119961119961119946119946nabla119928119928 (119955119955119961119961119946119946) Idea Instead of evaluating gradient over all ratings evaluate it for
each individual rating and make a step
GD 119928119928larr119928119928 minus η sum119955119955119961119961119946119946 nabla119928119928(119955119955119961119961119946119946)
SGD 119928119928larr119928119928 minus 119928119928(119955119955119961119961119946119946) Faster convergence
bull Need more steps but each step is computed much faster
1162019
RECSYS 73
SGD vs GD Convergence of GD vs SGD
1162019
Iterationstep
Valu
e of
the
obje
ctiv
e fu
nctio
n
GD improves the value of the objective function at every step SGD improves the value but in a ldquonoisyrdquo wayGD takes fewer steps to converge but each steptakes much longer to compute In practice SGD is much faster
RECSYS 74
Stochastic Gradient Descent
Stochastic gradient decent Initialize P and Q (using SVD pretend missing ratings are
0) Then iterate over the ratings (multiple times if necessary) and update
factorsFor each rxi
bull 120576120576119909119909119909119909 = 119903119903119909119909119909119909 minus 119902119902119909119909 sdot 119901119901119909119909119879119879 (derivative of the ldquoerrorrdquo)bull 119902119902119909119909 larr 119902119902119909119909 + 120578120578 120576120576119909119909119909119909 119901119901119909119909 minus 120582120582 119902119902119909119909 (update equation)bull 119901119901119909119909 larr 119901119901119909119909 + 120578120578 120576120576119909119909119909119909 119902119902119909119909 minus 120582120582 119901119901119909119909 (update equation)
2 for loops For until convergence
bull For each rxibull Compute gradient do a ldquosteprdquo
1162019
η hellip learning rate
RECSYS 75Koren Bell Volinksy IEEE Computer 2009
1162019
RECSYS 76
Summary Recommendations via Optimization
Goal Make good recommendations Quantify goodness using SSE
So Lower SSE means better recommendations We want to make good recommendations on items
that some user has not yet seen Letrsquos set values for P and Q such that they work
well on known (user item) ratings And hope these values for P and Q will predict well
the unknown ratings
This is the a case where we apply Optimization methods
1162019
1 3 43 5 5
4 5 533
2
2 1 3
1
RECSYS 77
Backup Slides
RECSYS 78
The Netflix Challenge 2006-09
We Know What You OughtTo Be Watching This
Summer
Netflix Prizebull Training data
ndash 100 million ratingsndash 480000 usersndash 17770 moviesndash 6 years of data 2000-2005
bull Test datandash Last few ratings of each user (28 million)ndash Evaluation criterion root mean squared error (RMSE) ndash Netflix Cinematch RMSE 09514
bull Competitionndash 2700+ teamsndash $1 million grand prize for 10 improvement on Cinematch resultndash $50000 2007 progress prize for 843 improvement
scoremovieuser121152131434524123237682576344541568523425
22345
57664566
scoremovieuser6219617232473153414284935
745
696836
Training data Test data
Movie rating data
Overall rating distribution
bull Third of ratings are 4sbull Average rating is 368
From TimelyDevelopmentcom
ratings per movie
bull Avg ratingsmovie 5627
ratings per user
bull Avg ratingsuser 208
Average movie rating by movie count
bull More ratings to better movies
From TimelyDevelopmentcom
Most loved movies
CountAvg ratingTitle137812 4593 The Shawshank Redemption 133597 4545 Lord of the Rings The Return of the King1808834306 The Green Mile 150676 4460Lord of the Rings The Two Towers 139050 4415 Finding Nemo 117456 4504 Raiders of the Lost Ark 180736 4299 Forrest Gump 147932 4433 Lord of the Rings The Fellowship of the ring149199 4325 The Sixth Sense 144027 4333 Indiana Jones and the Last Crusade
Challenges
bull Size of datandash Scalabilityndash Keeping data in memory
bull Missing datandash 99 percent missingndash Very imbalanced
bull Avoiding overfittingbull Test and training data differ significantly movie 16322
bull Use an ensemble of complementing predictorsbull Two half tuned models worth more than a single fully
tuned model
The BellKor recommender system
RECSYS 101
Extending Latent Factor Model to Include Biases
1162019
RECSYS 102
Modeling Biases and Interactions
μ = overall mean rating bx = bias of user x bi = bias of movie i
user-movie interactionmovie biasuser bias
User-Movie interaction Characterizes the matching between
users and movies Attracts most research in the field Benefits from algorithmic and
mathematical innovations
Baseline predictor Separates users and movies Benefits from insights into userrsquos
behavior Among the main practical
contributions of the competition
RECSYS 103
Baseline Predictor We have expectations on the rating by
user x of movie i even without estimating xrsquos attitude towards movies like i
ndash Rating scale of user xndash Values of other ratings user
gave recently (day-specific mood anchoring multi-user accounts)
ndash (Recent) popularity of movie indash Selection bias related to
number of ratings user gave on the same day (ldquofrequencyrdquo)
1162019
RECSYS 104
Putting It All Together
Example Mean rating micro = 37 You are a critical reviewer your ratings are 1 star
lower than the mean bx = -1 Star Wars gets a mean rating of 05 higher than
average movie bi = + 05 Predicted rating for you on Star Wars
= 37 - 1 + 05 = 32
1162019
RECSYS 105
Fitting the New Model
Solve
Stochastic gradient decent to find parameters Note Both biases bu bi as well as interactions qi pu are treated
as parameters (we estimate them)
1162019
regularization
goodness of fit
λ is selected via grid-search on a validation set
( )
++++
+++minus
sumsumsumsum
sumisin
ii
xx
xx
ii
Rix
Txiixxi
PQ
bbpq
pqbbr
2222
2
)(
)(min
λ
micro
RECSYS 106
BellKor Recommender System The winner of the Netflix Challenge
Multi-scale modeling of the dataCombine top level ldquoregionalrdquomodeling of the data with a refined local view Global
bull Overall deviations of usersmovies Factorization
bull Addressing ldquoregionalrdquo effects Collaborative filtering
bull Extract local patterns
Global effects
Factorization
Collaborative filtering
RECSYS 107
Performance of Various Methods
0885
089
0895
09
0905
091
0915
092
1 10 100 1000
RMSE
Millions of parameters
CF (no time bias)
Basic Latent Factors
Latent Factors w Biases
1162019
RECSYS 108
Grand Prize 08563
Netflix 09514
Movie average 10533User average 10651
Global average 11296
Performance of Various Methods
Basic Collaborative filtering 094
Latent factors 090
Latent factors+Biases 089
Collaborative filtering++ 091
1162019
RECSYS 109
Temporal Biases Of Users
Sudden rise in the average movie rating(early 2004) Improvements in Netflix GUI improvements Meaning of rating changed
Movie age Users prefer new movies
without any reasons Older movies are just
inherently better than newer ones
1162019
Y Koren Collaborative filtering with temporal dynamics KDD rsquo09
RECSYS 110
Temporal Biases amp Factors
Original modelrxi = micro +bx + bi + qi middotpx
T
Add time dependence to biasesrxi = micro +bx(t)+ bi(t) +qi middot px
T
Make parameters bu and bi to depend on time (1) Parameterize time-dependence by linear trends
(2) Each bin corresponds to 10 consecutive weeks
Add temporal dependence to factors px(t)hellip user preference vector on day t
Y Koren Collaborative filtering with temporal dynamics KDD rsquo09
RECSYS 111
Adding Temporal Effects
1162019
0875
088
0885
089
0895
09
0905
091
0915
092
1 10 100 1000 10000
RMSE
Millions of parameters
CF (no time bias)Basic Latent FactorsCF (time bias)Latent Factors w Biases+ Linear time factors+ Per-day user biases+ CF
RECSYS 112
Grand Prize 08563
Netflix 09514
Movie average 10533User average 10651
Global average 11296
Performance of Various Methods
Basic Collaborative filtering 094
Latent factors 090
Latent factors+Biases 089
Collaborative filtering++ 091
1162019 112
Latent factors+Biases+Time 0876
Still no prize Getting desperateTry a ldquokitchen sinkrdquo approach
RECSYS 113
Effect of ensemble size
087
08725
0875
08775
088
08825
0885
08875
089
08925
1 8 15 22 29 36 43 50
Predictors
Erro
r - R
MSE
RECSYS 6
Recommendations
Items
Search Recommendations
Products web sites blogs news items hellip
Examples
RECSYS 8
The Long Tail
Source Chris Anderson (2004)
1162019 8
RECSYS 9
Physical vs Online
1162019 9Read httpwwwwiredcomwiredarchive1210tailhtml to learn more
RECSYS 11
Formal Model
X = set of Customers
S = set of Items
Utility function u X times S RR = set of ratingsR is a totally ordered seteg 0-5 stars real number in [01]
RECSYS 12
Utility Matrix
04102
0305021
Avatar LOTR Matrix Pirates
Alice
Bob
Carol
David
RECSYS 13
Key Problems
(1) Gathering ldquoknownrdquo ratings for matrix How to collect the data in the utility matrix
(2) Extrapolate unknown ratings from the known ones Mainly interested in high unknown ratings
bull We are not interested in knowing what you donrsquot like but what you like
(3) Evaluating extrapolation methods How to measure successperformance of
recommendation methods
RECSYS 14
(1) Gathering Ratings Explicit
Ask people to rate items Doesnrsquot work well in practice ndash people
canrsquot be bothered
Implicit Learn ratings from user actions
bull Eg purchase implies high rating What about low ratings
RECSYS 15
(2) Extrapolating Utilities Key problem matrix U is sparse
Most people have not rated most items Cold start
bull New items have no ratingsbull New users have no history
Three approaches to recommender systems 1) Content-based 2) Collaborative Filtering
bull Memory-basedbull User-based Collaborative Filteringbull Item-based Collaborative Filtering
bull Latent factor based
1162019
RECSYS 16
Content-based Recommender Systems
1162019
RECSYS 17
Content-based Recommendations Main idea Recommend items to customer x similar to
previous items rated highly by x
Example
Movie recommendations Recommend movies with same actor(s)
director genre hellip
Websites blogs news Recommend other sites with ldquosimilarrdquo content
RECSYS 18
Plan of Action
likes
Item profiles
RedCircles
Triangles
User profile
match
recommendbuild
1162019
RECSYS 19
Item Profiles For each item create an item profile
Profile is a set (vector) of features Movies author title actor directorhellip Text Set of ldquoimportantrdquo words in document
How to pick important features Usual heuristic from text mining is TF-IDF
(Term frequency Inverse Doc Frequency)bull Term hellip Featurebull Document hellip Item
1162019
RECSYS 20
Sidenote TF-IDFfij = frequency of term (feature) i in doc (item) j
ni = number of docs that mention term iN = total number of docs
TF-IDF score wij = TFij times IDFi
Doc profile = set of words with highest TF-IDF scores together with their scores
1162019
Note we normalize TFto discount for ldquolongerrdquo documents
RECSYS 21
User Profiles and Prediction
User profile possibilities Weighted average of rated item profiles Variation weight by difference from average
rating for item hellip
Prediction heuristic Given user profile x and item profile i estimate 119906119906(119961119961 119946119946) = cos(119961119961 119946119946) = 119961119961middot119946119946
| 119961119961 |sdot| 119946119946 |
1162019
RECSYS 22
Pros Content-based Approach
+ No need for data on other users No cold-start or sparsity problems
+ Able to recommend to users with unique tastes
+ Able to recommend new amp unpopular items No first-rater problem
+ Able to provide explanations Can provide explanations of recommended items by listing content-
features that caused an item to be recommended
1162019
RECSYS 23
Cons Content-based Approach ndash Finding the appropriate features is hard
Eg images movies music
ndash Overspecialization Never recommends items outside userrsquos
content profile People might have multiple interests Unable to exploit quality judgments of other users
ndash Recommendations for new users How to build a user profile
1162019
RECSYS 24
Collaborative Filtering
RECSYS 25
Collaborative filtering Recommend items based on past transactions of users
Analyze relations between users andor items
Specific data characteristics are irrelevant Domain-free useritem attributes are not necessary Can identify elusive aspects
RECSYS 26
Collaborative Filtering (CF)
RECSYS 27
Example of Memory-based Collaborative FilteringUser-User Collaborative Filtering
1 Consider user x
2 Find set N of other userswhose ratings are ldquosimilarrdquo to xrsquos ratings eg using K-nearest neighbors (KNN)
3 Recommend items to xbased on the weighted ratings of items by users in N
RECSYS 28
Similar Users Let rx be the vector of user xrsquos ratings
Jaccard similarity measure Problem Ignores the value of the rating
Cosine similarity measure
sim(x y) = cos(rx ry) = 119903119903119909119909sdot119903119903119910119910||119903119903119909119909||sdot||119903119903119910119910||
Problem Treats missing ratings as ldquonegativerdquo Pearson correlation coefficient
Sxy = items rated by both users x and y
1162019
rx = [ _ _ ]ry = [ _ _]
rx ry as setsrx = 1 4 5ry = 1 3 4
rx ry as pointsrx = 1 0 0 1 3ry = 1 0 2 2 0
rx ry hellip avgrating of x y
RECSYS 29
Similarity Metric
Intuitively we want sim(A B) gt sim(A C)
Jaccard similarity 15 lt 24
Cosine similarity 0386 gt 0322 Considers missing ratings as ldquonegativerdquo Solution subtract the (row) mean
1162019
sim AB vs AC0092 gt -0559Notice cosine sim is correlation when data is centered at 0
Cosine sim
RECSYS 30
Rating Predictions
Let rx be the vector of user xrsquos ratings
Let N be the set of k users most similar to xwho have rated item i
Prediction for item i of user x 119903119903119909119909119909119909 = 1
119896119896sum119910119910isin119873119873 119903119903119910119910119909119909
119903119903119909119909119909119909 =sum119910119910isin119873119873 119904119904119909119909119910119910sdot119903119903119910119910119910119910sum119910119910isin119873119873 119904119904119909119909119910119910
Other options
Many other tricks possiblehellip1162019
Shorthand119956119956119961119961119961119961 = 119956119956119946119946119956119956 119961119961119961119961
RECSYS 31
Another type of Memory-based Collaborative Filtering Item-Item based Collaborative Filtering
So far User-user collaborative filtering
Another view Item-item For item i find other similar items Estimate rating for item i based
on the target userrsquos ratings on items similar to item i Can use same similarity metrics and
prediction functions as in user-user model
1162019
sumsum
isin
isinsdot
=)(
)(
xiNj ij
xiNj xjijxi s
rsr
sijhellip similarity of items i and jrxjhelliprating of user x on item jN(ix)hellip set items rated by x similar to i
RECSYS 32
Item-Item CF (|N|=2)
121110987654321
455311
3124452
534321423
245424
5224345
423316
users
mov
ies
- unknown rating - rating between 1 to 5
RECSYS 33
Item-Item CF (|N|=2)
121110987654321
455 311
3124452
534321423
245424
5224345
423316
users
- estimate rating of movie 1 by user 5
1162019
mov
ies
RECSYS 34
Item-Item CF (|N|=2)
121110987654321
455 311
3124452
534321423
245424
5224345
423316
users
Neighbor selectionIdentify movies similar to movie 1 rated by user 5
1162019
mov
ies
100
-018
041
-010
-031
059
sim(1m)
Here we use Pearson correlation as similarity1) Subtract mean rating mi from each movie i
m1 = (1+3+5+5+4)5 = 36row 1 [-26 0 -06 0 0 14 0 0 14 0 04 0]
2) Compute cosine similarities between rows
RECSYS 35
Item-Item CF (|N|=2)
121110987654321
455 311
3124452
534321423
245424
5224345
423316
users
Compute similarity weightss13=041 s16=059
1162019
mov
ies
100
-018
041
-010
-031
059
sim(1m)
RECSYS 36
Item-Item CF (|N|=2)
121110987654321
45526311
3124452
534321423
245424
5224345
423316
users
Predict by taking weighted average
r15 = (0412 + 0593) (041+059) = 26
mov
ies
RECSYS 37
Common Practice for Item-Item Collaborative Filtering
Define similarity sij of items i and j
Select K nearest neighbors (KNN) N(i x) Set of Items most similar to item i that were rated by x
Estimate rating rxi as the weighted average
1162019
sumsum
isin
isinsdot
=)(
)(ˆxiNj ij
xiNj xjijxi s
rsr
N(ix) = set of items similar to item i that were rated by xsij = similarity of items i and jrxj= rating of user x on item j
RECSYS 38
Item-Item vs User-User
04180109
0305081
Avatar LOTR Matrix Pirates
Alice
Bob
Carol
David
1162019
In practice it has been observed that item-itemoften works better than user-user
Why Items are simpler users have multiple tastes
RECSYS 39
ProsCons of Collaborative Filtering
+ Works for any kind of item No feature selection needed
- Cold Start Need enough users in the system to find a match
- Sparsity The userratings matrix is sparse Hard to find users that have rated the same items
- First rater Cannot recommend an item that has not been
previously rated New items Esoteric items
- Popularity bias Cannot recommend items to someone with
unique taste Tends to recommend popular items
RECSYS 40
Hybrid Methods Implement two or more different recommenders and
combine predictions Perhaps using a linear model
Add content-based methods to collaborative filtering Item profiles for new item problem Demographics to deal with new user problem
1162019
RECSYS 41
- Evaluation
- Error metrics
- Complexity Speed
RECSYS 42
Evaluation
1 3 4
3 5 5
4 5 5
3
3
2 2 2
5
2 1 1
3 3
1
movies
users
1162019
RECSYS 43
Evaluation
1 3 4
3 5 5
4 5 5
3
3
2
2 1
3
1
Test Data Set
users
movies
1162019
RECSYS 44
Evaluating Predictions Compare predictions with known ratings
Root-mean-square error (RMSE)bull where is predicted is the true rating of x on i
Precision at top 10 bull of those in top 10
Rank Correlation bull Spearmanrsquos correlation between systemrsquos and userrsquos complete
rankings Another approach 01 model
Coveragebull Number of itemsusers for which system can make predictions
Precision = TP (TP + FP) Accuracy = (TP+TN) (TP + FP + TN + FN) Receiver Operating characteristic (ROC) Curve
Y-axis True Positive Rates (TPR) X-axis False Positive Rates (FPR)bull TPR (aka Recall) = TP P = TP (TP+FN) bull FPR = FP N = FP (FP + TN)bull See httpsenwikipediaorgwikiPrecision and recall
RECSYS 45
Problems with Error Measures Narrow focus on accuracy sometimes
misses the point Prediction Diversity Prediction Context Order of predictions
In practice we care only to predict high ratings RMSE might penalize a method that does well
for high ratings and badly for others
1162019
RECSYS 46
Collaborative Filtering Complexity
Expensive step is finding k most similar customers O(|X|) Recall that X = set of customers in the system
Too expensive to do at runtime Could pre-compute using clustering as approx
Naiumlve pre-computation takes time O(N middot|C|) |C| = of clusters = k in the k-means N = of data points
We already know how to do this Near-neighbor search in high dimensions (LSH) Clustering Dimensionality reduction
1162019
RECSYS 47
Tip Add Data Leverage all the data
Donrsquot try to reduce data size in an effort to make fancy algorithms work
Simple methods on large data do best
Add more data eg add IMDB data on genres
More data beats better algorithmshttpanandtypepadcomdatawocky200803more-data-usualhtml
RECSYS 48
Recommender SystemsLatent Factor Models
CS246 Mining Massive DatasetsJure Leskovec Stanford University
httpcs246stanfordedu
RECSYS 49
Geared towards females
Geared towards males
Serious
Funny
Collaborative Filtering viaLatent Factor Models (eg SVD)
The PrincessDiaries
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
RECSYS 50Koren Bell Volinksy IEEE Computer 2009
RECSYS 51
The Netflix Utility Matrix R
1 3 4
3 5 5
4 5 5
3
3
2 2 2
5
2 1 1
3 3
1
480000 users
17700 movies
1162019
Matrix R
RECSYS 52
Utility Matrix R Evaluation
1 3 4
3 5 5
4 5 5
3
3
2
2 1
3
1
Test Data Set
SSE = sum(119909119909119909119909)isin119877119877 119903119909119909119909119909 minus 119903119903119909119909119909119909 2
1162019
480000 users
17700 movies
Predicted rating
True rating of user x on item i
119955119955120785120785120788120788
Matrix R
Training Data Set
RECSYS 53
Latent Factor Models
ldquoSVDrdquo on Netflix data R asymp Q middot PT
For now letrsquos assume we can approximate the rating matrix R as a product of ldquothinrdquo Q middot PT
R has missing entries but letrsquos ignore that for nowbull Basically we will want the reconstruction error to be small on
known ratings and we donrsquot care about the values on the missing ones
45531
312445
53432142
24542
522434
42331
2-41
56-5
53-2
32111
-221-7
37-1
-924143-48-5-253-211
13-112-72914-131457-8
1-6784-3924176-421asymp
users
item
s
PT
Q
item
s
users
R
SVD A = U Σ VT
f factors
ffactors
RECSYS 54
Ratings as Products of Factors How to estimate the missing rating of
user x for item i
1162019
45531
312445
53432142
24542
522434
42331
item
s
2-41
56-5
53-2
32111
-221-7
37-1
-924143-48-5-253-211
13-112-72914-131457-8
1-6784-3924176-421
asymp
item
s
users
users
PT
119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931
= 119943119943
119954119954119946119946119943119943 sdot 119953119953119961119961119943119943
qi = row i of Qpx = column x of PT
ffac
tors
Qf factors
RECSYS 55
Ratings as Products of Factors How to estimate the missing rating of
user x for item i
1162019
45531
312445
53432142
24542
522434
42331
item
s
2-41
56-5
53-2
32111
-221-7
37-1
-924143-48-5-253-211
13-112-72914-131457-8
1-6784-3924176-421
asymp
item
s
users
users
PT
ffac
tors
Qf factors
119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931
= 119943119943
119954119954119946119946119943119943 sdot 119953119953119961119961119943119943
qi = row i of Qpx = column x of PT
RECSYS 56
Ratings as Products of Factors How to estimate the missing rating of
user x for item i
1162019
45531
312445
53432142
24542
522434
42331
item
s
2-41
56-5
53-2
32111
-221-7
37-1
-924143-48-5-253-211
13-112-72914-131457-8
1-6784-3924176-421
asymp
item
s
users
users
QPT
24
ffac
tors
f factors
119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931
= 119943119943
119954119954119946119946119943119943 sdot 119953119953119961119961119943119943
qi = row i of Qpx = column x of PT
RECSYS 57
Geared towards females
Geared towards males
Serious
Funny
Latent Factor Models
1162019
The PrincessDiaries
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
RECSYS 58
Geared towards females
Geared towards males
Serious
Funny
Latent Factor Models
1162019
The PrincessDiaries
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
RECSYS 59
Recap SVD
Remember SVD A Input data matrix U Left singular vecs V Right singular vecs Σ Singular values
SVD gives minimum reconstruction error (SSE) min119880119880VΣ
sum119909119909119894119894 119860119860119909119909119894119894 minus 119880119880Σ119881119881T 1199091199091198941198942
So in our case ldquoSVDrdquo on Netflix data R asymp Q middot PT
A = R Q = U PT = Σ VT
But we are not done yet R has missing entries1162019
U
The sum goes over all entriesBut our R has missing entries
119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931
Am
n
Σm
n
VT
asymp
U
RECSYS 60
Latent Factor Models
SVD isnrsquot defined when entries are missing
Use specialized methods to find P Q min
119875119875119876119876sum 119909119909119909119909 isinR 119903119903119909119909119909119909 minus 119902119902119909119909 sdot 119901119901119909119909119879119879 2
Notebull We donrsquot require cols of P Q to be orthogonalunit lengthbull P Q map usersmovies to a latent spacebull The most popular model among Netflix contestants
1162019
45531
312445
53432142
24542
522434
42331
2-41
56-5
53-2
32111
-221-7
37-1
-924143-48-5-253-211
13-112-72914-131457-8
1-6784-3924176-421asymp
PTQ
users
item
s
119903119909119909119909119909 = 119902119902119909119909 sdot 119901119901119909119909119879119879
f factors
ffactorsitem
s
users
RECSYS 61
Dealing with Missing Entries Want to minimize SSE for unseen test data
Idea Minimize SSE on training data Want large f ( of factors) to capture all the signals But SSE on test data begins to rise for f gt 2
Regularization is needed to avoid Overfitting Allow rich model where there are sufficient data Shrink aggressively where data are scarce
1162019
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
1 3 43 5 5
4 5 533
2
2 1 3
1
λhellip regularization parameterldquoerrorrdquo ldquolengthrdquo
RECSYS 62
Geared towards females
Geared towards males
serious
escapist
The PrincessDiaries
The Lion King
Braveheart
Independence Day
AmadeusThe Color Purple
Oceanrsquos 11
Sense and Sensibility
Gus
Dave
Recommendations via Latent Factor Models (eg SVD++ by the [Bellkor Team])
1162019
Lethal Weapon
Dumb and Dumber
RECSYS 63
Dealing with Missing Entries Want to minimize SSE for unseen test data
Idea Minimize SSE on training data Want large f ( of factors) to capture all the signals But SSE on test data begins to rise for f gt 2
Regularization is needed Allow rich model where there are sufficient data Shrink aggressively where data are scarce
1162019
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
1 3 43 5 5
4 5 533
2
2 1 3
1
λhellip regularization parameterldquoerrorrdquo ldquolengthrdquo
RECSYS 64
Geared towards females
Geared towards males
serious
funny
The Effect of Regularization
1162019
The PrincessDiaries
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
minfactors ldquoerrorrdquo + λ ldquolengthrdquo
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
RECSYS 65
Geared towards females
Geared towards males
serious
funny
The Effect of Regularization
1162019
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
The PrincessDiaries
minfactors ldquoerrorrdquo + λ ldquolengthrdquo
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
RECSYS 66
Geared towards females
Geared towards males
serious
funny
The Effect of Regularization
1162019
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
minfactors ldquoerrorrdquo + λ ldquolengthrdquo
The PrincessDiaries
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
RECSYS 67
Geared towards females
Geared towards males
serious
funny
The Effect of Regularization
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
The PrincessDiaries
minfactors ldquoerrorrdquo + λ ldquolengthrdquo
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
RECSYS 68
Use Gradient Descent to search forthe optimal settings
Want to find matrices P and Q
Gradient descent Initialize P and Q (using SVD pretend missing ratings are
0) Do gradient descent
bull P larr P - η middotnablaPbull Q larr Q - η middotnablaQbull Where nablaQ is gradientderivative of matrix Q120571120571119876119876 = [120571120571119902119902119909119909119894119894] and 120571120571119902119902119909119909119894119894 = sum119909119909119909119909 minus2 119903119903119909119909119909119909 minus 119902119902119909119909119901119901119909119909119879119879 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894
bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q Observation Computing gradients is slow
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
How to compute gradient of a matrixCompute gradient of every element independently
RECSYS 69
Degression to the lecture notes ofRegression and Gradient Descent
by Andrew Ngrsquos Machine Learning course from Coursera
RECSYS 70
(Batch) Gradient Descent
Want to find matrices P and Q
Gradient descent Initialize P and Q (using SVD pretend missing ratings are
0) Do gradient descent
bull P larr P - η middotnablaPbull Q larr Q - η middotnablaQbull Where nablaQ is gradientderivative of matrix Q120571120571119876119876 = [120571120571119902119902119909119909119894119894] and 120571120571119902119902119909119909119894119894 = sum119909119909119909119909 minus2 119903119903119909119909119909119909 minus 119902119902119909119909119901119901119909119909119879119879 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894
bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q Observation Computing gradients is slow
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
How to compute gradient of a matrixCompute gradient of every element independently
RECSYS 72
Stochastic Gradient Descent
Gradient Descent (GD) vs Stochastic GD Observation 120571120571119876119876 = [120571120571119902119902119909119909119894119894] where
120571120571119902119902119909119909119894119894 = 119909119909119909119909
minus2 119903119903119909119909119909119909 minus 119902119902119909119909119894119894119901119901119909119909119894119894 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894 = 119961119961119946119946
nabla119928119928 119955119955119961119961119946119946
bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q 119928119928 = 119928119928 minus 119928119928 = 119928119928 minus η sum119961119961119946119946nabla119928119928 (119955119955119961119961119946119946) Idea Instead of evaluating gradient over all ratings evaluate it for
each individual rating and make a step
GD 119928119928larr119928119928 minus η sum119955119955119961119961119946119946 nabla119928119928(119955119955119961119961119946119946)
SGD 119928119928larr119928119928 minus 119928119928(119955119955119961119961119946119946) Faster convergence
bull Need more steps but each step is computed much faster
1162019
RECSYS 73
SGD vs GD Convergence of GD vs SGD
1162019
Iterationstep
Valu
e of
the
obje
ctiv
e fu
nctio
n
GD improves the value of the objective function at every step SGD improves the value but in a ldquonoisyrdquo wayGD takes fewer steps to converge but each steptakes much longer to compute In practice SGD is much faster
RECSYS 74
Stochastic Gradient Descent
Stochastic gradient decent Initialize P and Q (using SVD pretend missing ratings are
0) Then iterate over the ratings (multiple times if necessary) and update
factorsFor each rxi
bull 120576120576119909119909119909119909 = 119903119903119909119909119909119909 minus 119902119902119909119909 sdot 119901119901119909119909119879119879 (derivative of the ldquoerrorrdquo)bull 119902119902119909119909 larr 119902119902119909119909 + 120578120578 120576120576119909119909119909119909 119901119901119909119909 minus 120582120582 119902119902119909119909 (update equation)bull 119901119901119909119909 larr 119901119901119909119909 + 120578120578 120576120576119909119909119909119909 119902119902119909119909 minus 120582120582 119901119901119909119909 (update equation)
2 for loops For until convergence
bull For each rxibull Compute gradient do a ldquosteprdquo
1162019
η hellip learning rate
RECSYS 75Koren Bell Volinksy IEEE Computer 2009
1162019
RECSYS 76
Summary Recommendations via Optimization
Goal Make good recommendations Quantify goodness using SSE
So Lower SSE means better recommendations We want to make good recommendations on items
that some user has not yet seen Letrsquos set values for P and Q such that they work
well on known (user item) ratings And hope these values for P and Q will predict well
the unknown ratings
This is the a case where we apply Optimization methods
1162019
1 3 43 5 5
4 5 533
2
2 1 3
1
RECSYS 77
Backup Slides
RECSYS 78
The Netflix Challenge 2006-09
We Know What You OughtTo Be Watching This
Summer
Netflix Prizebull Training data
ndash 100 million ratingsndash 480000 usersndash 17770 moviesndash 6 years of data 2000-2005
bull Test datandash Last few ratings of each user (28 million)ndash Evaluation criterion root mean squared error (RMSE) ndash Netflix Cinematch RMSE 09514
bull Competitionndash 2700+ teamsndash $1 million grand prize for 10 improvement on Cinematch resultndash $50000 2007 progress prize for 843 improvement
scoremovieuser121152131434524123237682576344541568523425
22345
57664566
scoremovieuser6219617232473153414284935
745
696836
Training data Test data
Movie rating data
Overall rating distribution
bull Third of ratings are 4sbull Average rating is 368
From TimelyDevelopmentcom
ratings per movie
bull Avg ratingsmovie 5627
ratings per user
bull Avg ratingsuser 208
Average movie rating by movie count
bull More ratings to better movies
From TimelyDevelopmentcom
Most loved movies
CountAvg ratingTitle137812 4593 The Shawshank Redemption 133597 4545 Lord of the Rings The Return of the King1808834306 The Green Mile 150676 4460Lord of the Rings The Two Towers 139050 4415 Finding Nemo 117456 4504 Raiders of the Lost Ark 180736 4299 Forrest Gump 147932 4433 Lord of the Rings The Fellowship of the ring149199 4325 The Sixth Sense 144027 4333 Indiana Jones and the Last Crusade
Challenges
bull Size of datandash Scalabilityndash Keeping data in memory
bull Missing datandash 99 percent missingndash Very imbalanced
bull Avoiding overfittingbull Test and training data differ significantly movie 16322
bull Use an ensemble of complementing predictorsbull Two half tuned models worth more than a single fully
tuned model
The BellKor recommender system
RECSYS 101
Extending Latent Factor Model to Include Biases
1162019
RECSYS 102
Modeling Biases and Interactions
μ = overall mean rating bx = bias of user x bi = bias of movie i
user-movie interactionmovie biasuser bias
User-Movie interaction Characterizes the matching between
users and movies Attracts most research in the field Benefits from algorithmic and
mathematical innovations
Baseline predictor Separates users and movies Benefits from insights into userrsquos
behavior Among the main practical
contributions of the competition
RECSYS 103
Baseline Predictor We have expectations on the rating by
user x of movie i even without estimating xrsquos attitude towards movies like i
ndash Rating scale of user xndash Values of other ratings user
gave recently (day-specific mood anchoring multi-user accounts)
ndash (Recent) popularity of movie indash Selection bias related to
number of ratings user gave on the same day (ldquofrequencyrdquo)
1162019
RECSYS 104
Putting It All Together
Example Mean rating micro = 37 You are a critical reviewer your ratings are 1 star
lower than the mean bx = -1 Star Wars gets a mean rating of 05 higher than
average movie bi = + 05 Predicted rating for you on Star Wars
= 37 - 1 + 05 = 32
1162019
RECSYS 105
Fitting the New Model
Solve
Stochastic gradient decent to find parameters Note Both biases bu bi as well as interactions qi pu are treated
as parameters (we estimate them)
1162019
regularization
goodness of fit
λ is selected via grid-search on a validation set
( )
++++
+++minus
sumsumsumsum
sumisin
ii
xx
xx
ii
Rix
Txiixxi
PQ
bbpq
pqbbr
2222
2
)(
)(min
λ
micro
RECSYS 106
BellKor Recommender System The winner of the Netflix Challenge
Multi-scale modeling of the dataCombine top level ldquoregionalrdquomodeling of the data with a refined local view Global
bull Overall deviations of usersmovies Factorization
bull Addressing ldquoregionalrdquo effects Collaborative filtering
bull Extract local patterns
Global effects
Factorization
Collaborative filtering
RECSYS 107
Performance of Various Methods
0885
089
0895
09
0905
091
0915
092
1 10 100 1000
RMSE
Millions of parameters
CF (no time bias)
Basic Latent Factors
Latent Factors w Biases
1162019
RECSYS 108
Grand Prize 08563
Netflix 09514
Movie average 10533User average 10651
Global average 11296
Performance of Various Methods
Basic Collaborative filtering 094
Latent factors 090
Latent factors+Biases 089
Collaborative filtering++ 091
1162019
RECSYS 109
Temporal Biases Of Users
Sudden rise in the average movie rating(early 2004) Improvements in Netflix GUI improvements Meaning of rating changed
Movie age Users prefer new movies
without any reasons Older movies are just
inherently better than newer ones
1162019
Y Koren Collaborative filtering with temporal dynamics KDD rsquo09
RECSYS 110
Temporal Biases amp Factors
Original modelrxi = micro +bx + bi + qi middotpx
T
Add time dependence to biasesrxi = micro +bx(t)+ bi(t) +qi middot px
T
Make parameters bu and bi to depend on time (1) Parameterize time-dependence by linear trends
(2) Each bin corresponds to 10 consecutive weeks
Add temporal dependence to factors px(t)hellip user preference vector on day t
Y Koren Collaborative filtering with temporal dynamics KDD rsquo09
RECSYS 111
Adding Temporal Effects
1162019
0875
088
0885
089
0895
09
0905
091
0915
092
1 10 100 1000 10000
RMSE
Millions of parameters
CF (no time bias)Basic Latent FactorsCF (time bias)Latent Factors w Biases+ Linear time factors+ Per-day user biases+ CF
RECSYS 112
Grand Prize 08563
Netflix 09514
Movie average 10533User average 10651
Global average 11296
Performance of Various Methods
Basic Collaborative filtering 094
Latent factors 090
Latent factors+Biases 089
Collaborative filtering++ 091
1162019 112
Latent factors+Biases+Time 0876
Still no prize Getting desperateTry a ldquokitchen sinkrdquo approach
RECSYS 113
Effect of ensemble size
087
08725
0875
08775
088
08825
0885
08875
089
08925
1 8 15 22 29 36 43 50
Predictors
Erro
r - R
MSE
RECSYS 8
The Long Tail
Source Chris Anderson (2004)
1162019 8
RECSYS 9
Physical vs Online
1162019 9Read httpwwwwiredcomwiredarchive1210tailhtml to learn more
RECSYS 11
Formal Model
X = set of Customers
S = set of Items
Utility function u X times S RR = set of ratingsR is a totally ordered seteg 0-5 stars real number in [01]
RECSYS 12
Utility Matrix
04102
0305021
Avatar LOTR Matrix Pirates
Alice
Bob
Carol
David
RECSYS 13
Key Problems
(1) Gathering ldquoknownrdquo ratings for matrix How to collect the data in the utility matrix
(2) Extrapolate unknown ratings from the known ones Mainly interested in high unknown ratings
bull We are not interested in knowing what you donrsquot like but what you like
(3) Evaluating extrapolation methods How to measure successperformance of
recommendation methods
RECSYS 14
(1) Gathering Ratings Explicit
Ask people to rate items Doesnrsquot work well in practice ndash people
canrsquot be bothered
Implicit Learn ratings from user actions
bull Eg purchase implies high rating What about low ratings
RECSYS 15
(2) Extrapolating Utilities Key problem matrix U is sparse
Most people have not rated most items Cold start
bull New items have no ratingsbull New users have no history
Three approaches to recommender systems 1) Content-based 2) Collaborative Filtering
bull Memory-basedbull User-based Collaborative Filteringbull Item-based Collaborative Filtering
bull Latent factor based
1162019
RECSYS 16
Content-based Recommender Systems
1162019
RECSYS 17
Content-based Recommendations Main idea Recommend items to customer x similar to
previous items rated highly by x
Example
Movie recommendations Recommend movies with same actor(s)
director genre hellip
Websites blogs news Recommend other sites with ldquosimilarrdquo content
RECSYS 18
Plan of Action
likes
Item profiles
RedCircles
Triangles
User profile
match
recommendbuild
1162019
RECSYS 19
Item Profiles For each item create an item profile
Profile is a set (vector) of features Movies author title actor directorhellip Text Set of ldquoimportantrdquo words in document
How to pick important features Usual heuristic from text mining is TF-IDF
(Term frequency Inverse Doc Frequency)bull Term hellip Featurebull Document hellip Item
1162019
RECSYS 20
Sidenote TF-IDFfij = frequency of term (feature) i in doc (item) j
ni = number of docs that mention term iN = total number of docs
TF-IDF score wij = TFij times IDFi
Doc profile = set of words with highest TF-IDF scores together with their scores
1162019
Note we normalize TFto discount for ldquolongerrdquo documents
RECSYS 21
User Profiles and Prediction
User profile possibilities Weighted average of rated item profiles Variation weight by difference from average
rating for item hellip
Prediction heuristic Given user profile x and item profile i estimate 119906119906(119961119961 119946119946) = cos(119961119961 119946119946) = 119961119961middot119946119946
| 119961119961 |sdot| 119946119946 |
1162019
RECSYS 22
Pros Content-based Approach
+ No need for data on other users No cold-start or sparsity problems
+ Able to recommend to users with unique tastes
+ Able to recommend new amp unpopular items No first-rater problem
+ Able to provide explanations Can provide explanations of recommended items by listing content-
features that caused an item to be recommended
1162019
RECSYS 23
Cons Content-based Approach ndash Finding the appropriate features is hard
Eg images movies music
ndash Overspecialization Never recommends items outside userrsquos
content profile People might have multiple interests Unable to exploit quality judgments of other users
ndash Recommendations for new users How to build a user profile
1162019
RECSYS 24
Collaborative Filtering
RECSYS 25
Collaborative filtering Recommend items based on past transactions of users
Analyze relations between users andor items
Specific data characteristics are irrelevant Domain-free useritem attributes are not necessary Can identify elusive aspects
RECSYS 26
Collaborative Filtering (CF)
RECSYS 27
Example of Memory-based Collaborative FilteringUser-User Collaborative Filtering
1 Consider user x
2 Find set N of other userswhose ratings are ldquosimilarrdquo to xrsquos ratings eg using K-nearest neighbors (KNN)
3 Recommend items to xbased on the weighted ratings of items by users in N
RECSYS 28
Similar Users Let rx be the vector of user xrsquos ratings
Jaccard similarity measure Problem Ignores the value of the rating
Cosine similarity measure
sim(x y) = cos(rx ry) = 119903119903119909119909sdot119903119903119910119910||119903119903119909119909||sdot||119903119903119910119910||
Problem Treats missing ratings as ldquonegativerdquo Pearson correlation coefficient
Sxy = items rated by both users x and y
1162019
rx = [ _ _ ]ry = [ _ _]
rx ry as setsrx = 1 4 5ry = 1 3 4
rx ry as pointsrx = 1 0 0 1 3ry = 1 0 2 2 0
rx ry hellip avgrating of x y
RECSYS 29
Similarity Metric
Intuitively we want sim(A B) gt sim(A C)
Jaccard similarity 15 lt 24
Cosine similarity 0386 gt 0322 Considers missing ratings as ldquonegativerdquo Solution subtract the (row) mean
1162019
sim AB vs AC0092 gt -0559Notice cosine sim is correlation when data is centered at 0
Cosine sim
RECSYS 30
Rating Predictions
Let rx be the vector of user xrsquos ratings
Let N be the set of k users most similar to xwho have rated item i
Prediction for item i of user x 119903119903119909119909119909119909 = 1
119896119896sum119910119910isin119873119873 119903119903119910119910119909119909
119903119903119909119909119909119909 =sum119910119910isin119873119873 119904119904119909119909119910119910sdot119903119903119910119910119910119910sum119910119910isin119873119873 119904119904119909119909119910119910
Other options
Many other tricks possiblehellip1162019
Shorthand119956119956119961119961119961119961 = 119956119956119946119946119956119956 119961119961119961119961
RECSYS 31
Another type of Memory-based Collaborative Filtering Item-Item based Collaborative Filtering
So far User-user collaborative filtering
Another view Item-item For item i find other similar items Estimate rating for item i based
on the target userrsquos ratings on items similar to item i Can use same similarity metrics and
prediction functions as in user-user model
1162019
sumsum
isin
isinsdot
=)(
)(
xiNj ij
xiNj xjijxi s
rsr
sijhellip similarity of items i and jrxjhelliprating of user x on item jN(ix)hellip set items rated by x similar to i
RECSYS 32
Item-Item CF (|N|=2)
121110987654321
455311
3124452
534321423
245424
5224345
423316
users
mov
ies
- unknown rating - rating between 1 to 5
RECSYS 33
Item-Item CF (|N|=2)
121110987654321
455 311
3124452
534321423
245424
5224345
423316
users
- estimate rating of movie 1 by user 5
1162019
mov
ies
RECSYS 34
Item-Item CF (|N|=2)
121110987654321
455 311
3124452
534321423
245424
5224345
423316
users
Neighbor selectionIdentify movies similar to movie 1 rated by user 5
1162019
mov
ies
100
-018
041
-010
-031
059
sim(1m)
Here we use Pearson correlation as similarity1) Subtract mean rating mi from each movie i
m1 = (1+3+5+5+4)5 = 36row 1 [-26 0 -06 0 0 14 0 0 14 0 04 0]
2) Compute cosine similarities between rows
RECSYS 35
Item-Item CF (|N|=2)
121110987654321
455 311
3124452
534321423
245424
5224345
423316
users
Compute similarity weightss13=041 s16=059
1162019
mov
ies
100
-018
041
-010
-031
059
sim(1m)
RECSYS 36
Item-Item CF (|N|=2)
121110987654321
45526311
3124452
534321423
245424
5224345
423316
users
Predict by taking weighted average
r15 = (0412 + 0593) (041+059) = 26
mov
ies
RECSYS 37
Common Practice for Item-Item Collaborative Filtering
Define similarity sij of items i and j
Select K nearest neighbors (KNN) N(i x) Set of Items most similar to item i that were rated by x
Estimate rating rxi as the weighted average
1162019
sumsum
isin
isinsdot
=)(
)(ˆxiNj ij
xiNj xjijxi s
rsr
N(ix) = set of items similar to item i that were rated by xsij = similarity of items i and jrxj= rating of user x on item j
RECSYS 38
Item-Item vs User-User
04180109
0305081
Avatar LOTR Matrix Pirates
Alice
Bob
Carol
David
1162019
In practice it has been observed that item-itemoften works better than user-user
Why Items are simpler users have multiple tastes
RECSYS 39
ProsCons of Collaborative Filtering
+ Works for any kind of item No feature selection needed
- Cold Start Need enough users in the system to find a match
- Sparsity The userratings matrix is sparse Hard to find users that have rated the same items
- First rater Cannot recommend an item that has not been
previously rated New items Esoteric items
- Popularity bias Cannot recommend items to someone with
unique taste Tends to recommend popular items
RECSYS 40
Hybrid Methods Implement two or more different recommenders and
combine predictions Perhaps using a linear model
Add content-based methods to collaborative filtering Item profiles for new item problem Demographics to deal with new user problem
1162019
RECSYS 41
- Evaluation
- Error metrics
- Complexity Speed
RECSYS 42
Evaluation
1 3 4
3 5 5
4 5 5
3
3
2 2 2
5
2 1 1
3 3
1
movies
users
1162019
RECSYS 43
Evaluation
1 3 4
3 5 5
4 5 5
3
3
2
2 1
3
1
Test Data Set
users
movies
1162019
RECSYS 44
Evaluating Predictions Compare predictions with known ratings
Root-mean-square error (RMSE)bull where is predicted is the true rating of x on i
Precision at top 10 bull of those in top 10
Rank Correlation bull Spearmanrsquos correlation between systemrsquos and userrsquos complete
rankings Another approach 01 model
Coveragebull Number of itemsusers for which system can make predictions
Precision = TP (TP + FP) Accuracy = (TP+TN) (TP + FP + TN + FN) Receiver Operating characteristic (ROC) Curve
Y-axis True Positive Rates (TPR) X-axis False Positive Rates (FPR)bull TPR (aka Recall) = TP P = TP (TP+FN) bull FPR = FP N = FP (FP + TN)bull See httpsenwikipediaorgwikiPrecision and recall
RECSYS 45
Problems with Error Measures Narrow focus on accuracy sometimes
misses the point Prediction Diversity Prediction Context Order of predictions
In practice we care only to predict high ratings RMSE might penalize a method that does well
for high ratings and badly for others
1162019
RECSYS 46
Collaborative Filtering Complexity
Expensive step is finding k most similar customers O(|X|) Recall that X = set of customers in the system
Too expensive to do at runtime Could pre-compute using clustering as approx
Naiumlve pre-computation takes time O(N middot|C|) |C| = of clusters = k in the k-means N = of data points
We already know how to do this Near-neighbor search in high dimensions (LSH) Clustering Dimensionality reduction
1162019
RECSYS 47
Tip Add Data Leverage all the data
Donrsquot try to reduce data size in an effort to make fancy algorithms work
Simple methods on large data do best
Add more data eg add IMDB data on genres
More data beats better algorithmshttpanandtypepadcomdatawocky200803more-data-usualhtml
RECSYS 48
Recommender SystemsLatent Factor Models
CS246 Mining Massive DatasetsJure Leskovec Stanford University
httpcs246stanfordedu
RECSYS 49
Geared towards females
Geared towards males
Serious
Funny
Collaborative Filtering viaLatent Factor Models (eg SVD)
The PrincessDiaries
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
RECSYS 50Koren Bell Volinksy IEEE Computer 2009
RECSYS 51
The Netflix Utility Matrix R
1 3 4
3 5 5
4 5 5
3
3
2 2 2
5
2 1 1
3 3
1
480000 users
17700 movies
1162019
Matrix R
RECSYS 52
Utility Matrix R Evaluation
1 3 4
3 5 5
4 5 5
3
3
2
2 1
3
1
Test Data Set
SSE = sum(119909119909119909119909)isin119877119877 119903119909119909119909119909 minus 119903119903119909119909119909119909 2
1162019
480000 users
17700 movies
Predicted rating
True rating of user x on item i
119955119955120785120785120788120788
Matrix R
Training Data Set
RECSYS 53
Latent Factor Models
ldquoSVDrdquo on Netflix data R asymp Q middot PT
For now letrsquos assume we can approximate the rating matrix R as a product of ldquothinrdquo Q middot PT
R has missing entries but letrsquos ignore that for nowbull Basically we will want the reconstruction error to be small on
known ratings and we donrsquot care about the values on the missing ones
45531
312445
53432142
24542
522434
42331
2-41
56-5
53-2
32111
-221-7
37-1
-924143-48-5-253-211
13-112-72914-131457-8
1-6784-3924176-421asymp
users
item
s
PT
Q
item
s
users
R
SVD A = U Σ VT
f factors
ffactors
RECSYS 54
Ratings as Products of Factors How to estimate the missing rating of
user x for item i
1162019
45531
312445
53432142
24542
522434
42331
item
s
2-41
56-5
53-2
32111
-221-7
37-1
-924143-48-5-253-211
13-112-72914-131457-8
1-6784-3924176-421
asymp
item
s
users
users
PT
119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931
= 119943119943
119954119954119946119946119943119943 sdot 119953119953119961119961119943119943
qi = row i of Qpx = column x of PT
ffac
tors
Qf factors
RECSYS 55
Ratings as Products of Factors How to estimate the missing rating of
user x for item i
1162019
45531
312445
53432142
24542
522434
42331
item
s
2-41
56-5
53-2
32111
-221-7
37-1
-924143-48-5-253-211
13-112-72914-131457-8
1-6784-3924176-421
asymp
item
s
users
users
PT
ffac
tors
Qf factors
119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931
= 119943119943
119954119954119946119946119943119943 sdot 119953119953119961119961119943119943
qi = row i of Qpx = column x of PT
RECSYS 56
Ratings as Products of Factors How to estimate the missing rating of
user x for item i
1162019
45531
312445
53432142
24542
522434
42331
item
s
2-41
56-5
53-2
32111
-221-7
37-1
-924143-48-5-253-211
13-112-72914-131457-8
1-6784-3924176-421
asymp
item
s
users
users
QPT
24
ffac
tors
f factors
119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931
= 119943119943
119954119954119946119946119943119943 sdot 119953119953119961119961119943119943
qi = row i of Qpx = column x of PT
RECSYS 57
Geared towards females
Geared towards males
Serious
Funny
Latent Factor Models
1162019
The PrincessDiaries
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
RECSYS 58
Geared towards females
Geared towards males
Serious
Funny
Latent Factor Models
1162019
The PrincessDiaries
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
RECSYS 59
Recap SVD
Remember SVD A Input data matrix U Left singular vecs V Right singular vecs Σ Singular values
SVD gives minimum reconstruction error (SSE) min119880119880VΣ
sum119909119909119894119894 119860119860119909119909119894119894 minus 119880119880Σ119881119881T 1199091199091198941198942
So in our case ldquoSVDrdquo on Netflix data R asymp Q middot PT
A = R Q = U PT = Σ VT
But we are not done yet R has missing entries1162019
U
The sum goes over all entriesBut our R has missing entries
119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931
Am
n
Σm
n
VT
asymp
U
RECSYS 60
Latent Factor Models
SVD isnrsquot defined when entries are missing
Use specialized methods to find P Q min
119875119875119876119876sum 119909119909119909119909 isinR 119903119903119909119909119909119909 minus 119902119902119909119909 sdot 119901119901119909119909119879119879 2
Notebull We donrsquot require cols of P Q to be orthogonalunit lengthbull P Q map usersmovies to a latent spacebull The most popular model among Netflix contestants
1162019
45531
312445
53432142
24542
522434
42331
2-41
56-5
53-2
32111
-221-7
37-1
-924143-48-5-253-211
13-112-72914-131457-8
1-6784-3924176-421asymp
PTQ
users
item
s
119903119909119909119909119909 = 119902119902119909119909 sdot 119901119901119909119909119879119879
f factors
ffactorsitem
s
users
RECSYS 61
Dealing with Missing Entries Want to minimize SSE for unseen test data
Idea Minimize SSE on training data Want large f ( of factors) to capture all the signals But SSE on test data begins to rise for f gt 2
Regularization is needed to avoid Overfitting Allow rich model where there are sufficient data Shrink aggressively where data are scarce
1162019
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
1 3 43 5 5
4 5 533
2
2 1 3
1
λhellip regularization parameterldquoerrorrdquo ldquolengthrdquo
RECSYS 62
Geared towards females
Geared towards males
serious
escapist
The PrincessDiaries
The Lion King
Braveheart
Independence Day
AmadeusThe Color Purple
Oceanrsquos 11
Sense and Sensibility
Gus
Dave
Recommendations via Latent Factor Models (eg SVD++ by the [Bellkor Team])
1162019
Lethal Weapon
Dumb and Dumber
RECSYS 63
Dealing with Missing Entries Want to minimize SSE for unseen test data
Idea Minimize SSE on training data Want large f ( of factors) to capture all the signals But SSE on test data begins to rise for f gt 2
Regularization is needed Allow rich model where there are sufficient data Shrink aggressively where data are scarce
1162019
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
1 3 43 5 5
4 5 533
2
2 1 3
1
λhellip regularization parameterldquoerrorrdquo ldquolengthrdquo
RECSYS 64
Geared towards females
Geared towards males
serious
funny
The Effect of Regularization
1162019
The PrincessDiaries
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
minfactors ldquoerrorrdquo + λ ldquolengthrdquo
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
RECSYS 65
Geared towards females
Geared towards males
serious
funny
The Effect of Regularization
1162019
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
The PrincessDiaries
minfactors ldquoerrorrdquo + λ ldquolengthrdquo
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
RECSYS 66
Geared towards females
Geared towards males
serious
funny
The Effect of Regularization
1162019
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
minfactors ldquoerrorrdquo + λ ldquolengthrdquo
The PrincessDiaries
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
RECSYS 67
Geared towards females
Geared towards males
serious
funny
The Effect of Regularization
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
The PrincessDiaries
minfactors ldquoerrorrdquo + λ ldquolengthrdquo
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
RECSYS 68
Use Gradient Descent to search forthe optimal settings
Want to find matrices P and Q
Gradient descent Initialize P and Q (using SVD pretend missing ratings are
0) Do gradient descent
bull P larr P - η middotnablaPbull Q larr Q - η middotnablaQbull Where nablaQ is gradientderivative of matrix Q120571120571119876119876 = [120571120571119902119902119909119909119894119894] and 120571120571119902119902119909119909119894119894 = sum119909119909119909119909 minus2 119903119903119909119909119909119909 minus 119902119902119909119909119901119901119909119909119879119879 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894
bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q Observation Computing gradients is slow
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
How to compute gradient of a matrixCompute gradient of every element independently
RECSYS 69
Degression to the lecture notes ofRegression and Gradient Descent
by Andrew Ngrsquos Machine Learning course from Coursera
RECSYS 70
(Batch) Gradient Descent
Want to find matrices P and Q
Gradient descent Initialize P and Q (using SVD pretend missing ratings are
0) Do gradient descent
bull P larr P - η middotnablaPbull Q larr Q - η middotnablaQbull Where nablaQ is gradientderivative of matrix Q120571120571119876119876 = [120571120571119902119902119909119909119894119894] and 120571120571119902119902119909119909119894119894 = sum119909119909119909119909 minus2 119903119903119909119909119909119909 minus 119902119902119909119909119901119901119909119909119879119879 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894
bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q Observation Computing gradients is slow
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
How to compute gradient of a matrixCompute gradient of every element independently
RECSYS 72
Stochastic Gradient Descent
Gradient Descent (GD) vs Stochastic GD Observation 120571120571119876119876 = [120571120571119902119902119909119909119894119894] where
120571120571119902119902119909119909119894119894 = 119909119909119909119909
minus2 119903119903119909119909119909119909 minus 119902119902119909119909119894119894119901119901119909119909119894119894 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894 = 119961119961119946119946
nabla119928119928 119955119955119961119961119946119946
bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q 119928119928 = 119928119928 minus 119928119928 = 119928119928 minus η sum119961119961119946119946nabla119928119928 (119955119955119961119961119946119946) Idea Instead of evaluating gradient over all ratings evaluate it for
each individual rating and make a step
GD 119928119928larr119928119928 minus η sum119955119955119961119961119946119946 nabla119928119928(119955119955119961119961119946119946)
SGD 119928119928larr119928119928 minus 119928119928(119955119955119961119961119946119946) Faster convergence
bull Need more steps but each step is computed much faster
1162019
RECSYS 73
SGD vs GD Convergence of GD vs SGD
1162019
Iterationstep
Valu
e of
the
obje
ctiv
e fu
nctio
n
GD improves the value of the objective function at every step SGD improves the value but in a ldquonoisyrdquo wayGD takes fewer steps to converge but each steptakes much longer to compute In practice SGD is much faster
RECSYS 74
Stochastic Gradient Descent
Stochastic gradient decent Initialize P and Q (using SVD pretend missing ratings are
0) Then iterate over the ratings (multiple times if necessary) and update
factorsFor each rxi
bull 120576120576119909119909119909119909 = 119903119903119909119909119909119909 minus 119902119902119909119909 sdot 119901119901119909119909119879119879 (derivative of the ldquoerrorrdquo)bull 119902119902119909119909 larr 119902119902119909119909 + 120578120578 120576120576119909119909119909119909 119901119901119909119909 minus 120582120582 119902119902119909119909 (update equation)bull 119901119901119909119909 larr 119901119901119909119909 + 120578120578 120576120576119909119909119909119909 119902119902119909119909 minus 120582120582 119901119901119909119909 (update equation)
2 for loops For until convergence
bull For each rxibull Compute gradient do a ldquosteprdquo
1162019
η hellip learning rate
RECSYS 75Koren Bell Volinksy IEEE Computer 2009
1162019
RECSYS 76
Summary Recommendations via Optimization
Goal Make good recommendations Quantify goodness using SSE
So Lower SSE means better recommendations We want to make good recommendations on items
that some user has not yet seen Letrsquos set values for P and Q such that they work
well on known (user item) ratings And hope these values for P and Q will predict well
the unknown ratings
This is the a case where we apply Optimization methods
1162019
1 3 43 5 5
4 5 533
2
2 1 3
1
RECSYS 77
Backup Slides
RECSYS 78
The Netflix Challenge 2006-09
We Know What You OughtTo Be Watching This
Summer
Netflix Prizebull Training data
ndash 100 million ratingsndash 480000 usersndash 17770 moviesndash 6 years of data 2000-2005
bull Test datandash Last few ratings of each user (28 million)ndash Evaluation criterion root mean squared error (RMSE) ndash Netflix Cinematch RMSE 09514
bull Competitionndash 2700+ teamsndash $1 million grand prize for 10 improvement on Cinematch resultndash $50000 2007 progress prize for 843 improvement
scoremovieuser121152131434524123237682576344541568523425
22345
57664566
scoremovieuser6219617232473153414284935
745
696836
Training data Test data
Movie rating data
Overall rating distribution
bull Third of ratings are 4sbull Average rating is 368
From TimelyDevelopmentcom
ratings per movie
bull Avg ratingsmovie 5627
ratings per user
bull Avg ratingsuser 208
Average movie rating by movie count
bull More ratings to better movies
From TimelyDevelopmentcom
Most loved movies
CountAvg ratingTitle137812 4593 The Shawshank Redemption 133597 4545 Lord of the Rings The Return of the King1808834306 The Green Mile 150676 4460Lord of the Rings The Two Towers 139050 4415 Finding Nemo 117456 4504 Raiders of the Lost Ark 180736 4299 Forrest Gump 147932 4433 Lord of the Rings The Fellowship of the ring149199 4325 The Sixth Sense 144027 4333 Indiana Jones and the Last Crusade
Challenges
bull Size of datandash Scalabilityndash Keeping data in memory
bull Missing datandash 99 percent missingndash Very imbalanced
bull Avoiding overfittingbull Test and training data differ significantly movie 16322
bull Use an ensemble of complementing predictorsbull Two half tuned models worth more than a single fully
tuned model
The BellKor recommender system
RECSYS 101
Extending Latent Factor Model to Include Biases
1162019
RECSYS 102
Modeling Biases and Interactions
μ = overall mean rating bx = bias of user x bi = bias of movie i
user-movie interactionmovie biasuser bias
User-Movie interaction Characterizes the matching between
users and movies Attracts most research in the field Benefits from algorithmic and
mathematical innovations
Baseline predictor Separates users and movies Benefits from insights into userrsquos
behavior Among the main practical
contributions of the competition
RECSYS 103
Baseline Predictor We have expectations on the rating by
user x of movie i even without estimating xrsquos attitude towards movies like i
ndash Rating scale of user xndash Values of other ratings user
gave recently (day-specific mood anchoring multi-user accounts)
ndash (Recent) popularity of movie indash Selection bias related to
number of ratings user gave on the same day (ldquofrequencyrdquo)
1162019
RECSYS 104
Putting It All Together
Example Mean rating micro = 37 You are a critical reviewer your ratings are 1 star
lower than the mean bx = -1 Star Wars gets a mean rating of 05 higher than
average movie bi = + 05 Predicted rating for you on Star Wars
= 37 - 1 + 05 = 32
1162019
RECSYS 105
Fitting the New Model
Solve
Stochastic gradient decent to find parameters Note Both biases bu bi as well as interactions qi pu are treated
as parameters (we estimate them)
1162019
regularization
goodness of fit
λ is selected via grid-search on a validation set
( )
++++
+++minus
sumsumsumsum
sumisin
ii
xx
xx
ii
Rix
Txiixxi
PQ
bbpq
pqbbr
2222
2
)(
)(min
λ
micro
RECSYS 106
BellKor Recommender System The winner of the Netflix Challenge
Multi-scale modeling of the dataCombine top level ldquoregionalrdquomodeling of the data with a refined local view Global
bull Overall deviations of usersmovies Factorization
bull Addressing ldquoregionalrdquo effects Collaborative filtering
bull Extract local patterns
Global effects
Factorization
Collaborative filtering
RECSYS 107
Performance of Various Methods
0885
089
0895
09
0905
091
0915
092
1 10 100 1000
RMSE
Millions of parameters
CF (no time bias)
Basic Latent Factors
Latent Factors w Biases
1162019
RECSYS 108
Grand Prize 08563
Netflix 09514
Movie average 10533User average 10651
Global average 11296
Performance of Various Methods
Basic Collaborative filtering 094
Latent factors 090
Latent factors+Biases 089
Collaborative filtering++ 091
1162019
RECSYS 109
Temporal Biases Of Users
Sudden rise in the average movie rating(early 2004) Improvements in Netflix GUI improvements Meaning of rating changed
Movie age Users prefer new movies
without any reasons Older movies are just
inherently better than newer ones
1162019
Y Koren Collaborative filtering with temporal dynamics KDD rsquo09
RECSYS 110
Temporal Biases amp Factors
Original modelrxi = micro +bx + bi + qi middotpx
T
Add time dependence to biasesrxi = micro +bx(t)+ bi(t) +qi middot px
T
Make parameters bu and bi to depend on time (1) Parameterize time-dependence by linear trends
(2) Each bin corresponds to 10 consecutive weeks
Add temporal dependence to factors px(t)hellip user preference vector on day t
Y Koren Collaborative filtering with temporal dynamics KDD rsquo09
RECSYS 111
Adding Temporal Effects
1162019
0875
088
0885
089
0895
09
0905
091
0915
092
1 10 100 1000 10000
RMSE
Millions of parameters
CF (no time bias)Basic Latent FactorsCF (time bias)Latent Factors w Biases+ Linear time factors+ Per-day user biases+ CF
RECSYS 112
Grand Prize 08563
Netflix 09514
Movie average 10533User average 10651
Global average 11296
Performance of Various Methods
Basic Collaborative filtering 094
Latent factors 090
Latent factors+Biases 089
Collaborative filtering++ 091
1162019 112
Latent factors+Biases+Time 0876
Still no prize Getting desperateTry a ldquokitchen sinkrdquo approach
RECSYS 113
Effect of ensemble size
087
08725
0875
08775
088
08825
0885
08875
089
08925
1 8 15 22 29 36 43 50
Predictors
Erro
r - R
MSE
RECSYS 9
Physical vs Online
1162019 9Read httpwwwwiredcomwiredarchive1210tailhtml to learn more
RECSYS 11
Formal Model
X = set of Customers
S = set of Items
Utility function u X times S RR = set of ratingsR is a totally ordered seteg 0-5 stars real number in [01]
RECSYS 12
Utility Matrix
04102
0305021
Avatar LOTR Matrix Pirates
Alice
Bob
Carol
David
RECSYS 13
Key Problems
(1) Gathering ldquoknownrdquo ratings for matrix How to collect the data in the utility matrix
(2) Extrapolate unknown ratings from the known ones Mainly interested in high unknown ratings
bull We are not interested in knowing what you donrsquot like but what you like
(3) Evaluating extrapolation methods How to measure successperformance of
recommendation methods
RECSYS 14
(1) Gathering Ratings Explicit
Ask people to rate items Doesnrsquot work well in practice ndash people
canrsquot be bothered
Implicit Learn ratings from user actions
bull Eg purchase implies high rating What about low ratings
RECSYS 15
(2) Extrapolating Utilities Key problem matrix U is sparse
Most people have not rated most items Cold start
bull New items have no ratingsbull New users have no history
Three approaches to recommender systems 1) Content-based 2) Collaborative Filtering
bull Memory-basedbull User-based Collaborative Filteringbull Item-based Collaborative Filtering
bull Latent factor based
1162019
RECSYS 16
Content-based Recommender Systems
1162019
RECSYS 17
Content-based Recommendations Main idea Recommend items to customer x similar to
previous items rated highly by x
Example
Movie recommendations Recommend movies with same actor(s)
director genre hellip
Websites blogs news Recommend other sites with ldquosimilarrdquo content
RECSYS 18
Plan of Action
likes
Item profiles
RedCircles
Triangles
User profile
match
recommendbuild
1162019
RECSYS 19
Item Profiles For each item create an item profile
Profile is a set (vector) of features Movies author title actor directorhellip Text Set of ldquoimportantrdquo words in document
How to pick important features Usual heuristic from text mining is TF-IDF
(Term frequency Inverse Doc Frequency)bull Term hellip Featurebull Document hellip Item
1162019
RECSYS 20
Sidenote TF-IDFfij = frequency of term (feature) i in doc (item) j
ni = number of docs that mention term iN = total number of docs
TF-IDF score wij = TFij times IDFi
Doc profile = set of words with highest TF-IDF scores together with their scores
1162019
Note we normalize TFto discount for ldquolongerrdquo documents
RECSYS 21
User Profiles and Prediction
User profile possibilities Weighted average of rated item profiles Variation weight by difference from average
rating for item hellip
Prediction heuristic Given user profile x and item profile i estimate 119906119906(119961119961 119946119946) = cos(119961119961 119946119946) = 119961119961middot119946119946
| 119961119961 |sdot| 119946119946 |
1162019
RECSYS 22
Pros Content-based Approach
+ No need for data on other users No cold-start or sparsity problems
+ Able to recommend to users with unique tastes
+ Able to recommend new amp unpopular items No first-rater problem
+ Able to provide explanations Can provide explanations of recommended items by listing content-
features that caused an item to be recommended
1162019
RECSYS 23
Cons Content-based Approach ndash Finding the appropriate features is hard
Eg images movies music
ndash Overspecialization Never recommends items outside userrsquos
content profile People might have multiple interests Unable to exploit quality judgments of other users
ndash Recommendations for new users How to build a user profile
1162019
RECSYS 24
Collaborative Filtering
RECSYS 25
Collaborative filtering Recommend items based on past transactions of users
Analyze relations between users andor items
Specific data characteristics are irrelevant Domain-free useritem attributes are not necessary Can identify elusive aspects
RECSYS 26
Collaborative Filtering (CF)
RECSYS 27
Example of Memory-based Collaborative FilteringUser-User Collaborative Filtering
1 Consider user x
2 Find set N of other userswhose ratings are ldquosimilarrdquo to xrsquos ratings eg using K-nearest neighbors (KNN)
3 Recommend items to xbased on the weighted ratings of items by users in N
RECSYS 28
Similar Users Let rx be the vector of user xrsquos ratings
Jaccard similarity measure Problem Ignores the value of the rating
Cosine similarity measure
sim(x y) = cos(rx ry) = 119903119903119909119909sdot119903119903119910119910||119903119903119909119909||sdot||119903119903119910119910||
Problem Treats missing ratings as ldquonegativerdquo Pearson correlation coefficient
Sxy = items rated by both users x and y
1162019
rx = [ _ _ ]ry = [ _ _]
rx ry as setsrx = 1 4 5ry = 1 3 4
rx ry as pointsrx = 1 0 0 1 3ry = 1 0 2 2 0
rx ry hellip avgrating of x y
RECSYS 29
Similarity Metric
Intuitively we want sim(A B) gt sim(A C)
Jaccard similarity 15 lt 24
Cosine similarity 0386 gt 0322 Considers missing ratings as ldquonegativerdquo Solution subtract the (row) mean
1162019
sim AB vs AC0092 gt -0559Notice cosine sim is correlation when data is centered at 0
Cosine sim
RECSYS 30
Rating Predictions
Let rx be the vector of user xrsquos ratings
Let N be the set of k users most similar to xwho have rated item i
Prediction for item i of user x 119903119903119909119909119909119909 = 1
119896119896sum119910119910isin119873119873 119903119903119910119910119909119909
119903119903119909119909119909119909 =sum119910119910isin119873119873 119904119904119909119909119910119910sdot119903119903119910119910119910119910sum119910119910isin119873119873 119904119904119909119909119910119910
Other options
Many other tricks possiblehellip1162019
Shorthand119956119956119961119961119961119961 = 119956119956119946119946119956119956 119961119961119961119961
RECSYS 31
Another type of Memory-based Collaborative Filtering Item-Item based Collaborative Filtering
So far User-user collaborative filtering
Another view Item-item For item i find other similar items Estimate rating for item i based
on the target userrsquos ratings on items similar to item i Can use same similarity metrics and
prediction functions as in user-user model
1162019
sumsum
isin
isinsdot
=)(
)(
xiNj ij
xiNj xjijxi s
rsr
sijhellip similarity of items i and jrxjhelliprating of user x on item jN(ix)hellip set items rated by x similar to i
RECSYS 32
Item-Item CF (|N|=2)
121110987654321
455311
3124452
534321423
245424
5224345
423316
users
mov
ies
- unknown rating - rating between 1 to 5
RECSYS 33
Item-Item CF (|N|=2)
121110987654321
455 311
3124452
534321423
245424
5224345
423316
users
- estimate rating of movie 1 by user 5
1162019
mov
ies
RECSYS 34
Item-Item CF (|N|=2)
121110987654321
455 311
3124452
534321423
245424
5224345
423316
users
Neighbor selectionIdentify movies similar to movie 1 rated by user 5
1162019
mov
ies
100
-018
041
-010
-031
059
sim(1m)
Here we use Pearson correlation as similarity1) Subtract mean rating mi from each movie i
m1 = (1+3+5+5+4)5 = 36row 1 [-26 0 -06 0 0 14 0 0 14 0 04 0]
2) Compute cosine similarities between rows
RECSYS 35
Item-Item CF (|N|=2)
121110987654321
455 311
3124452
534321423
245424
5224345
423316
users
Compute similarity weightss13=041 s16=059
1162019
mov
ies
100
-018
041
-010
-031
059
sim(1m)
RECSYS 36
Item-Item CF (|N|=2)
121110987654321
45526311
3124452
534321423
245424
5224345
423316
users
Predict by taking weighted average
r15 = (0412 + 0593) (041+059) = 26
mov
ies
RECSYS 37
Common Practice for Item-Item Collaborative Filtering
Define similarity sij of items i and j
Select K nearest neighbors (KNN) N(i x) Set of Items most similar to item i that were rated by x
Estimate rating rxi as the weighted average
1162019
sumsum
isin
isinsdot
=)(
)(ˆxiNj ij
xiNj xjijxi s
rsr
N(ix) = set of items similar to item i that were rated by xsij = similarity of items i and jrxj= rating of user x on item j
RECSYS 38
Item-Item vs User-User
04180109
0305081
Avatar LOTR Matrix Pirates
Alice
Bob
Carol
David
1162019
In practice it has been observed that item-itemoften works better than user-user
Why Items are simpler users have multiple tastes
RECSYS 39
ProsCons of Collaborative Filtering
+ Works for any kind of item No feature selection needed
- Cold Start Need enough users in the system to find a match
- Sparsity The userratings matrix is sparse Hard to find users that have rated the same items
- First rater Cannot recommend an item that has not been
previously rated New items Esoteric items
- Popularity bias Cannot recommend items to someone with
unique taste Tends to recommend popular items
RECSYS 40
Hybrid Methods Implement two or more different recommenders and
combine predictions Perhaps using a linear model
Add content-based methods to collaborative filtering Item profiles for new item problem Demographics to deal with new user problem
1162019
RECSYS 41
- Evaluation
- Error metrics
- Complexity Speed
RECSYS 42
Evaluation
1 3 4
3 5 5
4 5 5
3
3
2 2 2
5
2 1 1
3 3
1
movies
users
1162019
RECSYS 43
Evaluation
1 3 4
3 5 5
4 5 5
3
3
2
2 1
3
1
Test Data Set
users
movies
1162019
RECSYS 44
Evaluating Predictions Compare predictions with known ratings
Root-mean-square error (RMSE)bull where is predicted is the true rating of x on i
Precision at top 10 bull of those in top 10
Rank Correlation bull Spearmanrsquos correlation between systemrsquos and userrsquos complete
rankings Another approach 01 model
Coveragebull Number of itemsusers for which system can make predictions
Precision = TP (TP + FP) Accuracy = (TP+TN) (TP + FP + TN + FN) Receiver Operating characteristic (ROC) Curve
Y-axis True Positive Rates (TPR) X-axis False Positive Rates (FPR)bull TPR (aka Recall) = TP P = TP (TP+FN) bull FPR = FP N = FP (FP + TN)bull See httpsenwikipediaorgwikiPrecision and recall
RECSYS 45
Problems with Error Measures Narrow focus on accuracy sometimes
misses the point Prediction Diversity Prediction Context Order of predictions
In practice we care only to predict high ratings RMSE might penalize a method that does well
for high ratings and badly for others
1162019
RECSYS 46
Collaborative Filtering Complexity
Expensive step is finding k most similar customers O(|X|) Recall that X = set of customers in the system
Too expensive to do at runtime Could pre-compute using clustering as approx
Naiumlve pre-computation takes time O(N middot|C|) |C| = of clusters = k in the k-means N = of data points
We already know how to do this Near-neighbor search in high dimensions (LSH) Clustering Dimensionality reduction
1162019
RECSYS 47
Tip Add Data Leverage all the data
Donrsquot try to reduce data size in an effort to make fancy algorithms work
Simple methods on large data do best
Add more data eg add IMDB data on genres
More data beats better algorithmshttpanandtypepadcomdatawocky200803more-data-usualhtml
RECSYS 48
Recommender SystemsLatent Factor Models
CS246 Mining Massive DatasetsJure Leskovec Stanford University
httpcs246stanfordedu
RECSYS 49
Geared towards females
Geared towards males
Serious
Funny
Collaborative Filtering viaLatent Factor Models (eg SVD)
The PrincessDiaries
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
RECSYS 50Koren Bell Volinksy IEEE Computer 2009
RECSYS 51
The Netflix Utility Matrix R
1 3 4
3 5 5
4 5 5
3
3
2 2 2
5
2 1 1
3 3
1
480000 users
17700 movies
1162019
Matrix R
RECSYS 52
Utility Matrix R Evaluation
1 3 4
3 5 5
4 5 5
3
3
2
2 1
3
1
Test Data Set
SSE = sum(119909119909119909119909)isin119877119877 119903119909119909119909119909 minus 119903119903119909119909119909119909 2
1162019
480000 users
17700 movies
Predicted rating
True rating of user x on item i
119955119955120785120785120788120788
Matrix R
Training Data Set
RECSYS 53
Latent Factor Models
ldquoSVDrdquo on Netflix data R asymp Q middot PT
For now letrsquos assume we can approximate the rating matrix R as a product of ldquothinrdquo Q middot PT
R has missing entries but letrsquos ignore that for nowbull Basically we will want the reconstruction error to be small on
known ratings and we donrsquot care about the values on the missing ones
45531
312445
53432142
24542
522434
42331
2-41
56-5
53-2
32111
-221-7
37-1
-924143-48-5-253-211
13-112-72914-131457-8
1-6784-3924176-421asymp
users
item
s
PT
Q
item
s
users
R
SVD A = U Σ VT
f factors
ffactors
RECSYS 54
Ratings as Products of Factors How to estimate the missing rating of
user x for item i
1162019
45531
312445
53432142
24542
522434
42331
item
s
2-41
56-5
53-2
32111
-221-7
37-1
-924143-48-5-253-211
13-112-72914-131457-8
1-6784-3924176-421
asymp
item
s
users
users
PT
119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931
= 119943119943
119954119954119946119946119943119943 sdot 119953119953119961119961119943119943
qi = row i of Qpx = column x of PT
ffac
tors
Qf factors
RECSYS 55
Ratings as Products of Factors How to estimate the missing rating of
user x for item i
1162019
45531
312445
53432142
24542
522434
42331
item
s
2-41
56-5
53-2
32111
-221-7
37-1
-924143-48-5-253-211
13-112-72914-131457-8
1-6784-3924176-421
asymp
item
s
users
users
PT
ffac
tors
Qf factors
119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931
= 119943119943
119954119954119946119946119943119943 sdot 119953119953119961119961119943119943
qi = row i of Qpx = column x of PT
RECSYS 56
Ratings as Products of Factors How to estimate the missing rating of
user x for item i
1162019
45531
312445
53432142
24542
522434
42331
item
s
2-41
56-5
53-2
32111
-221-7
37-1
-924143-48-5-253-211
13-112-72914-131457-8
1-6784-3924176-421
asymp
item
s
users
users
QPT
24
ffac
tors
f factors
119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931
= 119943119943
119954119954119946119946119943119943 sdot 119953119953119961119961119943119943
qi = row i of Qpx = column x of PT
RECSYS 57
Geared towards females
Geared towards males
Serious
Funny
Latent Factor Models
1162019
The PrincessDiaries
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
RECSYS 58
Geared towards females
Geared towards males
Serious
Funny
Latent Factor Models
1162019
The PrincessDiaries
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
RECSYS 59
Recap SVD
Remember SVD A Input data matrix U Left singular vecs V Right singular vecs Σ Singular values
SVD gives minimum reconstruction error (SSE) min119880119880VΣ
sum119909119909119894119894 119860119860119909119909119894119894 minus 119880119880Σ119881119881T 1199091199091198941198942
So in our case ldquoSVDrdquo on Netflix data R asymp Q middot PT
A = R Q = U PT = Σ VT
But we are not done yet R has missing entries1162019
U
The sum goes over all entriesBut our R has missing entries
119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931
Am
n
Σm
n
VT
asymp
U
RECSYS 60
Latent Factor Models
SVD isnrsquot defined when entries are missing
Use specialized methods to find P Q min
119875119875119876119876sum 119909119909119909119909 isinR 119903119903119909119909119909119909 minus 119902119902119909119909 sdot 119901119901119909119909119879119879 2
Notebull We donrsquot require cols of P Q to be orthogonalunit lengthbull P Q map usersmovies to a latent spacebull The most popular model among Netflix contestants
1162019
45531
312445
53432142
24542
522434
42331
2-41
56-5
53-2
32111
-221-7
37-1
-924143-48-5-253-211
13-112-72914-131457-8
1-6784-3924176-421asymp
PTQ
users
item
s
119903119909119909119909119909 = 119902119902119909119909 sdot 119901119901119909119909119879119879
f factors
ffactorsitem
s
users
RECSYS 61
Dealing with Missing Entries Want to minimize SSE for unseen test data
Idea Minimize SSE on training data Want large f ( of factors) to capture all the signals But SSE on test data begins to rise for f gt 2
Regularization is needed to avoid Overfitting Allow rich model where there are sufficient data Shrink aggressively where data are scarce
1162019
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
1 3 43 5 5
4 5 533
2
2 1 3
1
λhellip regularization parameterldquoerrorrdquo ldquolengthrdquo
RECSYS 62
Geared towards females
Geared towards males
serious
escapist
The PrincessDiaries
The Lion King
Braveheart
Independence Day
AmadeusThe Color Purple
Oceanrsquos 11
Sense and Sensibility
Gus
Dave
Recommendations via Latent Factor Models (eg SVD++ by the [Bellkor Team])
1162019
Lethal Weapon
Dumb and Dumber
RECSYS 63
Dealing with Missing Entries Want to minimize SSE for unseen test data
Idea Minimize SSE on training data Want large f ( of factors) to capture all the signals But SSE on test data begins to rise for f gt 2
Regularization is needed Allow rich model where there are sufficient data Shrink aggressively where data are scarce
1162019
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
1 3 43 5 5
4 5 533
2
2 1 3
1
λhellip regularization parameterldquoerrorrdquo ldquolengthrdquo
RECSYS 64
Geared towards females
Geared towards males
serious
funny
The Effect of Regularization
1162019
The PrincessDiaries
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
minfactors ldquoerrorrdquo + λ ldquolengthrdquo
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
RECSYS 65
Geared towards females
Geared towards males
serious
funny
The Effect of Regularization
1162019
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
The PrincessDiaries
minfactors ldquoerrorrdquo + λ ldquolengthrdquo
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
RECSYS 66
Geared towards females
Geared towards males
serious
funny
The Effect of Regularization
1162019
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
minfactors ldquoerrorrdquo + λ ldquolengthrdquo
The PrincessDiaries
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
RECSYS 67
Geared towards females
Geared towards males
serious
funny
The Effect of Regularization
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
The PrincessDiaries
minfactors ldquoerrorrdquo + λ ldquolengthrdquo
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
RECSYS 68
Use Gradient Descent to search forthe optimal settings
Want to find matrices P and Q
Gradient descent Initialize P and Q (using SVD pretend missing ratings are
0) Do gradient descent
bull P larr P - η middotnablaPbull Q larr Q - η middotnablaQbull Where nablaQ is gradientderivative of matrix Q120571120571119876119876 = [120571120571119902119902119909119909119894119894] and 120571120571119902119902119909119909119894119894 = sum119909119909119909119909 minus2 119903119903119909119909119909119909 minus 119902119902119909119909119901119901119909119909119879119879 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894
bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q Observation Computing gradients is slow
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
How to compute gradient of a matrixCompute gradient of every element independently
RECSYS 69
Degression to the lecture notes ofRegression and Gradient Descent
by Andrew Ngrsquos Machine Learning course from Coursera
RECSYS 70
(Batch) Gradient Descent
Want to find matrices P and Q
Gradient descent Initialize P and Q (using SVD pretend missing ratings are
0) Do gradient descent
bull P larr P - η middotnablaPbull Q larr Q - η middotnablaQbull Where nablaQ is gradientderivative of matrix Q120571120571119876119876 = [120571120571119902119902119909119909119894119894] and 120571120571119902119902119909119909119894119894 = sum119909119909119909119909 minus2 119903119903119909119909119909119909 minus 119902119902119909119909119901119901119909119909119879119879 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894
bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q Observation Computing gradients is slow
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
How to compute gradient of a matrixCompute gradient of every element independently
RECSYS 72
Stochastic Gradient Descent
Gradient Descent (GD) vs Stochastic GD Observation 120571120571119876119876 = [120571120571119902119902119909119909119894119894] where
120571120571119902119902119909119909119894119894 = 119909119909119909119909
minus2 119903119903119909119909119909119909 minus 119902119902119909119909119894119894119901119901119909119909119894119894 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894 = 119961119961119946119946
nabla119928119928 119955119955119961119961119946119946
bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q 119928119928 = 119928119928 minus 119928119928 = 119928119928 minus η sum119961119961119946119946nabla119928119928 (119955119955119961119961119946119946) Idea Instead of evaluating gradient over all ratings evaluate it for
each individual rating and make a step
GD 119928119928larr119928119928 minus η sum119955119955119961119961119946119946 nabla119928119928(119955119955119961119961119946119946)
SGD 119928119928larr119928119928 minus 119928119928(119955119955119961119961119946119946) Faster convergence
bull Need more steps but each step is computed much faster
1162019
RECSYS 73
SGD vs GD Convergence of GD vs SGD
1162019
Iterationstep
Valu
e of
the
obje
ctiv
e fu
nctio
n
GD improves the value of the objective function at every step SGD improves the value but in a ldquonoisyrdquo wayGD takes fewer steps to converge but each steptakes much longer to compute In practice SGD is much faster
RECSYS 74
Stochastic Gradient Descent
Stochastic gradient decent Initialize P and Q (using SVD pretend missing ratings are
0) Then iterate over the ratings (multiple times if necessary) and update
factorsFor each rxi
bull 120576120576119909119909119909119909 = 119903119903119909119909119909119909 minus 119902119902119909119909 sdot 119901119901119909119909119879119879 (derivative of the ldquoerrorrdquo)bull 119902119902119909119909 larr 119902119902119909119909 + 120578120578 120576120576119909119909119909119909 119901119901119909119909 minus 120582120582 119902119902119909119909 (update equation)bull 119901119901119909119909 larr 119901119901119909119909 + 120578120578 120576120576119909119909119909119909 119902119902119909119909 minus 120582120582 119901119901119909119909 (update equation)
2 for loops For until convergence
bull For each rxibull Compute gradient do a ldquosteprdquo
1162019
η hellip learning rate
RECSYS 75Koren Bell Volinksy IEEE Computer 2009
1162019
RECSYS 76
Summary Recommendations via Optimization
Goal Make good recommendations Quantify goodness using SSE
So Lower SSE means better recommendations We want to make good recommendations on items
that some user has not yet seen Letrsquos set values for P and Q such that they work
well on known (user item) ratings And hope these values for P and Q will predict well
the unknown ratings
This is the a case where we apply Optimization methods
1162019
1 3 43 5 5
4 5 533
2
2 1 3
1
RECSYS 77
Backup Slides
RECSYS 78
The Netflix Challenge 2006-09
We Know What You OughtTo Be Watching This
Summer
Netflix Prizebull Training data
ndash 100 million ratingsndash 480000 usersndash 17770 moviesndash 6 years of data 2000-2005
bull Test datandash Last few ratings of each user (28 million)ndash Evaluation criterion root mean squared error (RMSE) ndash Netflix Cinematch RMSE 09514
bull Competitionndash 2700+ teamsndash $1 million grand prize for 10 improvement on Cinematch resultndash $50000 2007 progress prize for 843 improvement
scoremovieuser121152131434524123237682576344541568523425
22345
57664566
scoremovieuser6219617232473153414284935
745
696836
Training data Test data
Movie rating data
Overall rating distribution
bull Third of ratings are 4sbull Average rating is 368
From TimelyDevelopmentcom
ratings per movie
bull Avg ratingsmovie 5627
ratings per user
bull Avg ratingsuser 208
Average movie rating by movie count
bull More ratings to better movies
From TimelyDevelopmentcom
Most loved movies
CountAvg ratingTitle137812 4593 The Shawshank Redemption 133597 4545 Lord of the Rings The Return of the King1808834306 The Green Mile 150676 4460Lord of the Rings The Two Towers 139050 4415 Finding Nemo 117456 4504 Raiders of the Lost Ark 180736 4299 Forrest Gump 147932 4433 Lord of the Rings The Fellowship of the ring149199 4325 The Sixth Sense 144027 4333 Indiana Jones and the Last Crusade
Challenges
bull Size of datandash Scalabilityndash Keeping data in memory
bull Missing datandash 99 percent missingndash Very imbalanced
bull Avoiding overfittingbull Test and training data differ significantly movie 16322
bull Use an ensemble of complementing predictorsbull Two half tuned models worth more than a single fully
tuned model
The BellKor recommender system
RECSYS 101
Extending Latent Factor Model to Include Biases
1162019
RECSYS 102
Modeling Biases and Interactions
μ = overall mean rating bx = bias of user x bi = bias of movie i
user-movie interactionmovie biasuser bias
User-Movie interaction Characterizes the matching between
users and movies Attracts most research in the field Benefits from algorithmic and
mathematical innovations
Baseline predictor Separates users and movies Benefits from insights into userrsquos
behavior Among the main practical
contributions of the competition
RECSYS 103
Baseline Predictor We have expectations on the rating by
user x of movie i even without estimating xrsquos attitude towards movies like i
ndash Rating scale of user xndash Values of other ratings user
gave recently (day-specific mood anchoring multi-user accounts)
ndash (Recent) popularity of movie indash Selection bias related to
number of ratings user gave on the same day (ldquofrequencyrdquo)
1162019
RECSYS 104
Putting It All Together
Example Mean rating micro = 37 You are a critical reviewer your ratings are 1 star
lower than the mean bx = -1 Star Wars gets a mean rating of 05 higher than
average movie bi = + 05 Predicted rating for you on Star Wars
= 37 - 1 + 05 = 32
1162019
RECSYS 105
Fitting the New Model
Solve
Stochastic gradient decent to find parameters Note Both biases bu bi as well as interactions qi pu are treated
as parameters (we estimate them)
1162019
regularization
goodness of fit
λ is selected via grid-search on a validation set
( )
++++
+++minus
sumsumsumsum
sumisin
ii
xx
xx
ii
Rix
Txiixxi
PQ
bbpq
pqbbr
2222
2
)(
)(min
λ
micro
RECSYS 106
BellKor Recommender System The winner of the Netflix Challenge
Multi-scale modeling of the dataCombine top level ldquoregionalrdquomodeling of the data with a refined local view Global
bull Overall deviations of usersmovies Factorization
bull Addressing ldquoregionalrdquo effects Collaborative filtering
bull Extract local patterns
Global effects
Factorization
Collaborative filtering
RECSYS 107
Performance of Various Methods
0885
089
0895
09
0905
091
0915
092
1 10 100 1000
RMSE
Millions of parameters
CF (no time bias)
Basic Latent Factors
Latent Factors w Biases
1162019
RECSYS 108
Grand Prize 08563
Netflix 09514
Movie average 10533User average 10651
Global average 11296
Performance of Various Methods
Basic Collaborative filtering 094
Latent factors 090
Latent factors+Biases 089
Collaborative filtering++ 091
1162019
RECSYS 109
Temporal Biases Of Users
Sudden rise in the average movie rating(early 2004) Improvements in Netflix GUI improvements Meaning of rating changed
Movie age Users prefer new movies
without any reasons Older movies are just
inherently better than newer ones
1162019
Y Koren Collaborative filtering with temporal dynamics KDD rsquo09
RECSYS 110
Temporal Biases amp Factors
Original modelrxi = micro +bx + bi + qi middotpx
T
Add time dependence to biasesrxi = micro +bx(t)+ bi(t) +qi middot px
T
Make parameters bu and bi to depend on time (1) Parameterize time-dependence by linear trends
(2) Each bin corresponds to 10 consecutive weeks
Add temporal dependence to factors px(t)hellip user preference vector on day t
Y Koren Collaborative filtering with temporal dynamics KDD rsquo09
RECSYS 111
Adding Temporal Effects
1162019
0875
088
0885
089
0895
09
0905
091
0915
092
1 10 100 1000 10000
RMSE
Millions of parameters
CF (no time bias)Basic Latent FactorsCF (time bias)Latent Factors w Biases+ Linear time factors+ Per-day user biases+ CF
RECSYS 112
Grand Prize 08563
Netflix 09514
Movie average 10533User average 10651
Global average 11296
Performance of Various Methods
Basic Collaborative filtering 094
Latent factors 090
Latent factors+Biases 089
Collaborative filtering++ 091
1162019 112
Latent factors+Biases+Time 0876
Still no prize Getting desperateTry a ldquokitchen sinkrdquo approach
RECSYS 113
Effect of ensemble size
087
08725
0875
08775
088
08825
0885
08875
089
08925
1 8 15 22 29 36 43 50
Predictors
Erro
r - R
MSE
RECSYS 11
Formal Model
X = set of Customers
S = set of Items
Utility function u X times S RR = set of ratingsR is a totally ordered seteg 0-5 stars real number in [01]
RECSYS 12
Utility Matrix
04102
0305021
Avatar LOTR Matrix Pirates
Alice
Bob
Carol
David
RECSYS 13
Key Problems
(1) Gathering ldquoknownrdquo ratings for matrix How to collect the data in the utility matrix
(2) Extrapolate unknown ratings from the known ones Mainly interested in high unknown ratings
bull We are not interested in knowing what you donrsquot like but what you like
(3) Evaluating extrapolation methods How to measure successperformance of
recommendation methods
RECSYS 14
(1) Gathering Ratings Explicit
Ask people to rate items Doesnrsquot work well in practice ndash people
canrsquot be bothered
Implicit Learn ratings from user actions
bull Eg purchase implies high rating What about low ratings
RECSYS 15
(2) Extrapolating Utilities Key problem matrix U is sparse
Most people have not rated most items Cold start
bull New items have no ratingsbull New users have no history
Three approaches to recommender systems 1) Content-based 2) Collaborative Filtering
bull Memory-basedbull User-based Collaborative Filteringbull Item-based Collaborative Filtering
bull Latent factor based
1162019
RECSYS 16
Content-based Recommender Systems
1162019
RECSYS 17
Content-based Recommendations Main idea Recommend items to customer x similar to
previous items rated highly by x
Example
Movie recommendations Recommend movies with same actor(s)
director genre hellip
Websites blogs news Recommend other sites with ldquosimilarrdquo content
RECSYS 18
Plan of Action
likes
Item profiles
RedCircles
Triangles
User profile
match
recommendbuild
1162019
RECSYS 19
Item Profiles For each item create an item profile
Profile is a set (vector) of features Movies author title actor directorhellip Text Set of ldquoimportantrdquo words in document
How to pick important features Usual heuristic from text mining is TF-IDF
(Term frequency Inverse Doc Frequency)bull Term hellip Featurebull Document hellip Item
1162019
RECSYS 20
Sidenote TF-IDFfij = frequency of term (feature) i in doc (item) j
ni = number of docs that mention term iN = total number of docs
TF-IDF score wij = TFij times IDFi
Doc profile = set of words with highest TF-IDF scores together with their scores
1162019
Note we normalize TFto discount for ldquolongerrdquo documents
RECSYS 21
User Profiles and Prediction
User profile possibilities Weighted average of rated item profiles Variation weight by difference from average
rating for item hellip
Prediction heuristic Given user profile x and item profile i estimate 119906119906(119961119961 119946119946) = cos(119961119961 119946119946) = 119961119961middot119946119946
| 119961119961 |sdot| 119946119946 |
1162019
RECSYS 22
Pros Content-based Approach
+ No need for data on other users No cold-start or sparsity problems
+ Able to recommend to users with unique tastes
+ Able to recommend new amp unpopular items No first-rater problem
+ Able to provide explanations Can provide explanations of recommended items by listing content-
features that caused an item to be recommended
1162019
RECSYS 23
Cons Content-based Approach ndash Finding the appropriate features is hard
Eg images movies music
ndash Overspecialization Never recommends items outside userrsquos
content profile People might have multiple interests Unable to exploit quality judgments of other users
ndash Recommendations for new users How to build a user profile
1162019
RECSYS 24
Collaborative Filtering
RECSYS 25
Collaborative filtering Recommend items based on past transactions of users
Analyze relations between users andor items
Specific data characteristics are irrelevant Domain-free useritem attributes are not necessary Can identify elusive aspects
RECSYS 26
Collaborative Filtering (CF)
RECSYS 27
Example of Memory-based Collaborative FilteringUser-User Collaborative Filtering
1 Consider user x
2 Find set N of other userswhose ratings are ldquosimilarrdquo to xrsquos ratings eg using K-nearest neighbors (KNN)
3 Recommend items to xbased on the weighted ratings of items by users in N
RECSYS 28
Similar Users Let rx be the vector of user xrsquos ratings
Jaccard similarity measure Problem Ignores the value of the rating
Cosine similarity measure
sim(x y) = cos(rx ry) = 119903119903119909119909sdot119903119903119910119910||119903119903119909119909||sdot||119903119903119910119910||
Problem Treats missing ratings as ldquonegativerdquo Pearson correlation coefficient
Sxy = items rated by both users x and y
1162019
rx = [ _ _ ]ry = [ _ _]
rx ry as setsrx = 1 4 5ry = 1 3 4
rx ry as pointsrx = 1 0 0 1 3ry = 1 0 2 2 0
rx ry hellip avgrating of x y
RECSYS 29
Similarity Metric
Intuitively we want sim(A B) gt sim(A C)
Jaccard similarity 15 lt 24
Cosine similarity 0386 gt 0322 Considers missing ratings as ldquonegativerdquo Solution subtract the (row) mean
1162019
sim AB vs AC0092 gt -0559Notice cosine sim is correlation when data is centered at 0
Cosine sim
RECSYS 30
Rating Predictions
Let rx be the vector of user xrsquos ratings
Let N be the set of k users most similar to xwho have rated item i
Prediction for item i of user x 119903119903119909119909119909119909 = 1
119896119896sum119910119910isin119873119873 119903119903119910119910119909119909
119903119903119909119909119909119909 =sum119910119910isin119873119873 119904119904119909119909119910119910sdot119903119903119910119910119910119910sum119910119910isin119873119873 119904119904119909119909119910119910
Other options
Many other tricks possiblehellip1162019
Shorthand119956119956119961119961119961119961 = 119956119956119946119946119956119956 119961119961119961119961
RECSYS 31
Another type of Memory-based Collaborative Filtering Item-Item based Collaborative Filtering
So far User-user collaborative filtering
Another view Item-item For item i find other similar items Estimate rating for item i based
on the target userrsquos ratings on items similar to item i Can use same similarity metrics and
prediction functions as in user-user model
1162019
sumsum
isin
isinsdot
=)(
)(
xiNj ij
xiNj xjijxi s
rsr
sijhellip similarity of items i and jrxjhelliprating of user x on item jN(ix)hellip set items rated by x similar to i
RECSYS 32
Item-Item CF (|N|=2)
121110987654321
455311
3124452
534321423
245424
5224345
423316
users
mov
ies
- unknown rating - rating between 1 to 5
RECSYS 33
Item-Item CF (|N|=2)
121110987654321
455 311
3124452
534321423
245424
5224345
423316
users
- estimate rating of movie 1 by user 5
1162019
mov
ies
RECSYS 34
Item-Item CF (|N|=2)
121110987654321
455 311
3124452
534321423
245424
5224345
423316
users
Neighbor selectionIdentify movies similar to movie 1 rated by user 5
1162019
mov
ies
100
-018
041
-010
-031
059
sim(1m)
Here we use Pearson correlation as similarity1) Subtract mean rating mi from each movie i
m1 = (1+3+5+5+4)5 = 36row 1 [-26 0 -06 0 0 14 0 0 14 0 04 0]
2) Compute cosine similarities between rows
RECSYS 35
Item-Item CF (|N|=2)
121110987654321
455 311
3124452
534321423
245424
5224345
423316
users
Compute similarity weightss13=041 s16=059
1162019
mov
ies
100
-018
041
-010
-031
059
sim(1m)
RECSYS 36
Item-Item CF (|N|=2)
121110987654321
45526311
3124452
534321423
245424
5224345
423316
users
Predict by taking weighted average
r15 = (0412 + 0593) (041+059) = 26
mov
ies
RECSYS 37
Common Practice for Item-Item Collaborative Filtering
Define similarity sij of items i and j
Select K nearest neighbors (KNN) N(i x) Set of Items most similar to item i that were rated by x
Estimate rating rxi as the weighted average
1162019
sumsum
isin
isinsdot
=)(
)(ˆxiNj ij
xiNj xjijxi s
rsr
N(ix) = set of items similar to item i that were rated by xsij = similarity of items i and jrxj= rating of user x on item j
RECSYS 38
Item-Item vs User-User
04180109
0305081
Avatar LOTR Matrix Pirates
Alice
Bob
Carol
David
1162019
In practice it has been observed that item-itemoften works better than user-user
Why Items are simpler users have multiple tastes
RECSYS 39
ProsCons of Collaborative Filtering
+ Works for any kind of item No feature selection needed
- Cold Start Need enough users in the system to find a match
- Sparsity The userratings matrix is sparse Hard to find users that have rated the same items
- First rater Cannot recommend an item that has not been
previously rated New items Esoteric items
- Popularity bias Cannot recommend items to someone with
unique taste Tends to recommend popular items
RECSYS 40
Hybrid Methods Implement two or more different recommenders and
combine predictions Perhaps using a linear model
Add content-based methods to collaborative filtering Item profiles for new item problem Demographics to deal with new user problem
1162019
RECSYS 41
- Evaluation
- Error metrics
- Complexity Speed
RECSYS 42
Evaluation
1 3 4
3 5 5
4 5 5
3
3
2 2 2
5
2 1 1
3 3
1
movies
users
1162019
RECSYS 43
Evaluation
1 3 4
3 5 5
4 5 5
3
3
2
2 1
3
1
Test Data Set
users
movies
1162019
RECSYS 44
Evaluating Predictions Compare predictions with known ratings
Root-mean-square error (RMSE)bull where is predicted is the true rating of x on i
Precision at top 10 bull of those in top 10
Rank Correlation bull Spearmanrsquos correlation between systemrsquos and userrsquos complete
rankings Another approach 01 model
Coveragebull Number of itemsusers for which system can make predictions
Precision = TP (TP + FP) Accuracy = (TP+TN) (TP + FP + TN + FN) Receiver Operating characteristic (ROC) Curve
Y-axis True Positive Rates (TPR) X-axis False Positive Rates (FPR)bull TPR (aka Recall) = TP P = TP (TP+FN) bull FPR = FP N = FP (FP + TN)bull See httpsenwikipediaorgwikiPrecision and recall
RECSYS 45
Problems with Error Measures Narrow focus on accuracy sometimes
misses the point Prediction Diversity Prediction Context Order of predictions
In practice we care only to predict high ratings RMSE might penalize a method that does well
for high ratings and badly for others
1162019
RECSYS 46
Collaborative Filtering Complexity
Expensive step is finding k most similar customers O(|X|) Recall that X = set of customers in the system
Too expensive to do at runtime Could pre-compute using clustering as approx
Naiumlve pre-computation takes time O(N middot|C|) |C| = of clusters = k in the k-means N = of data points
We already know how to do this Near-neighbor search in high dimensions (LSH) Clustering Dimensionality reduction
1162019
RECSYS 47
Tip Add Data Leverage all the data
Donrsquot try to reduce data size in an effort to make fancy algorithms work
Simple methods on large data do best
Add more data eg add IMDB data on genres
More data beats better algorithmshttpanandtypepadcomdatawocky200803more-data-usualhtml
RECSYS 48
Recommender SystemsLatent Factor Models
CS246 Mining Massive DatasetsJure Leskovec Stanford University
httpcs246stanfordedu
RECSYS 49
Geared towards females
Geared towards males
Serious
Funny
Collaborative Filtering viaLatent Factor Models (eg SVD)
The PrincessDiaries
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
RECSYS 50Koren Bell Volinksy IEEE Computer 2009
RECSYS 51
The Netflix Utility Matrix R
1 3 4
3 5 5
4 5 5
3
3
2 2 2
5
2 1 1
3 3
1
480000 users
17700 movies
1162019
Matrix R
RECSYS 52
Utility Matrix R Evaluation
1 3 4
3 5 5
4 5 5
3
3
2
2 1
3
1
Test Data Set
SSE = sum(119909119909119909119909)isin119877119877 119903119909119909119909119909 minus 119903119903119909119909119909119909 2
1162019
480000 users
17700 movies
Predicted rating
True rating of user x on item i
119955119955120785120785120788120788
Matrix R
Training Data Set
RECSYS 53
Latent Factor Models
ldquoSVDrdquo on Netflix data R asymp Q middot PT
For now letrsquos assume we can approximate the rating matrix R as a product of ldquothinrdquo Q middot PT
R has missing entries but letrsquos ignore that for nowbull Basically we will want the reconstruction error to be small on
known ratings and we donrsquot care about the values on the missing ones
45531
312445
53432142
24542
522434
42331
2-41
56-5
53-2
32111
-221-7
37-1
-924143-48-5-253-211
13-112-72914-131457-8
1-6784-3924176-421asymp
users
item
s
PT
Q
item
s
users
R
SVD A = U Σ VT
f factors
ffactors
RECSYS 54
Ratings as Products of Factors How to estimate the missing rating of
user x for item i
1162019
45531
312445
53432142
24542
522434
42331
item
s
2-41
56-5
53-2
32111
-221-7
37-1
-924143-48-5-253-211
13-112-72914-131457-8
1-6784-3924176-421
asymp
item
s
users
users
PT
119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931
= 119943119943
119954119954119946119946119943119943 sdot 119953119953119961119961119943119943
qi = row i of Qpx = column x of PT
ffac
tors
Qf factors
RECSYS 55
Ratings as Products of Factors How to estimate the missing rating of
user x for item i
1162019
45531
312445
53432142
24542
522434
42331
item
s
2-41
56-5
53-2
32111
-221-7
37-1
-924143-48-5-253-211
13-112-72914-131457-8
1-6784-3924176-421
asymp
item
s
users
users
PT
ffac
tors
Qf factors
119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931
= 119943119943
119954119954119946119946119943119943 sdot 119953119953119961119961119943119943
qi = row i of Qpx = column x of PT
RECSYS 56
Ratings as Products of Factors How to estimate the missing rating of
user x for item i
1162019
45531
312445
53432142
24542
522434
42331
item
s
2-41
56-5
53-2
32111
-221-7
37-1
-924143-48-5-253-211
13-112-72914-131457-8
1-6784-3924176-421
asymp
item
s
users
users
QPT
24
ffac
tors
f factors
119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931
= 119943119943
119954119954119946119946119943119943 sdot 119953119953119961119961119943119943
qi = row i of Qpx = column x of PT
RECSYS 57
Geared towards females
Geared towards males
Serious
Funny
Latent Factor Models
1162019
The PrincessDiaries
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
RECSYS 58
Geared towards females
Geared towards males
Serious
Funny
Latent Factor Models
1162019
The PrincessDiaries
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
RECSYS 59
Recap SVD
Remember SVD A Input data matrix U Left singular vecs V Right singular vecs Σ Singular values
SVD gives minimum reconstruction error (SSE) min119880119880VΣ
sum119909119909119894119894 119860119860119909119909119894119894 minus 119880119880Σ119881119881T 1199091199091198941198942
So in our case ldquoSVDrdquo on Netflix data R asymp Q middot PT
A = R Q = U PT = Σ VT
But we are not done yet R has missing entries1162019
U
The sum goes over all entriesBut our R has missing entries
119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931
Am
n
Σm
n
VT
asymp
U
RECSYS 60
Latent Factor Models
SVD isnrsquot defined when entries are missing
Use specialized methods to find P Q min
119875119875119876119876sum 119909119909119909119909 isinR 119903119903119909119909119909119909 minus 119902119902119909119909 sdot 119901119901119909119909119879119879 2
Notebull We donrsquot require cols of P Q to be orthogonalunit lengthbull P Q map usersmovies to a latent spacebull The most popular model among Netflix contestants
1162019
45531
312445
53432142
24542
522434
42331
2-41
56-5
53-2
32111
-221-7
37-1
-924143-48-5-253-211
13-112-72914-131457-8
1-6784-3924176-421asymp
PTQ
users
item
s
119903119909119909119909119909 = 119902119902119909119909 sdot 119901119901119909119909119879119879
f factors
ffactorsitem
s
users
RECSYS 61
Dealing with Missing Entries Want to minimize SSE for unseen test data
Idea Minimize SSE on training data Want large f ( of factors) to capture all the signals But SSE on test data begins to rise for f gt 2
Regularization is needed to avoid Overfitting Allow rich model where there are sufficient data Shrink aggressively where data are scarce
1162019
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
1 3 43 5 5
4 5 533
2
2 1 3
1
λhellip regularization parameterldquoerrorrdquo ldquolengthrdquo
RECSYS 62
Geared towards females
Geared towards males
serious
escapist
The PrincessDiaries
The Lion King
Braveheart
Independence Day
AmadeusThe Color Purple
Oceanrsquos 11
Sense and Sensibility
Gus
Dave
Recommendations via Latent Factor Models (eg SVD++ by the [Bellkor Team])
1162019
Lethal Weapon
Dumb and Dumber
RECSYS 63
Dealing with Missing Entries Want to minimize SSE for unseen test data
Idea Minimize SSE on training data Want large f ( of factors) to capture all the signals But SSE on test data begins to rise for f gt 2
Regularization is needed Allow rich model where there are sufficient data Shrink aggressively where data are scarce
1162019
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
1 3 43 5 5
4 5 533
2
2 1 3
1
λhellip regularization parameterldquoerrorrdquo ldquolengthrdquo
RECSYS 64
Geared towards females
Geared towards males
serious
funny
The Effect of Regularization
1162019
The PrincessDiaries
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
minfactors ldquoerrorrdquo + λ ldquolengthrdquo
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
RECSYS 65
Geared towards females
Geared towards males
serious
funny
The Effect of Regularization
1162019
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
The PrincessDiaries
minfactors ldquoerrorrdquo + λ ldquolengthrdquo
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
RECSYS 66
Geared towards females
Geared towards males
serious
funny
The Effect of Regularization
1162019
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
minfactors ldquoerrorrdquo + λ ldquolengthrdquo
The PrincessDiaries
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
RECSYS 67
Geared towards females
Geared towards males
serious
funny
The Effect of Regularization
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
The PrincessDiaries
minfactors ldquoerrorrdquo + λ ldquolengthrdquo
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
RECSYS 68
Use Gradient Descent to search forthe optimal settings
Want to find matrices P and Q
Gradient descent Initialize P and Q (using SVD pretend missing ratings are
0) Do gradient descent
bull P larr P - η middotnablaPbull Q larr Q - η middotnablaQbull Where nablaQ is gradientderivative of matrix Q120571120571119876119876 = [120571120571119902119902119909119909119894119894] and 120571120571119902119902119909119909119894119894 = sum119909119909119909119909 minus2 119903119903119909119909119909119909 minus 119902119902119909119909119901119901119909119909119879119879 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894
bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q Observation Computing gradients is slow
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
How to compute gradient of a matrixCompute gradient of every element independently
RECSYS 69
Degression to the lecture notes ofRegression and Gradient Descent
by Andrew Ngrsquos Machine Learning course from Coursera
RECSYS 70
(Batch) Gradient Descent
Want to find matrices P and Q
Gradient descent Initialize P and Q (using SVD pretend missing ratings are
0) Do gradient descent
bull P larr P - η middotnablaPbull Q larr Q - η middotnablaQbull Where nablaQ is gradientderivative of matrix Q120571120571119876119876 = [120571120571119902119902119909119909119894119894] and 120571120571119902119902119909119909119894119894 = sum119909119909119909119909 minus2 119903119903119909119909119909119909 minus 119902119902119909119909119901119901119909119909119879119879 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894
bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q Observation Computing gradients is slow
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
How to compute gradient of a matrixCompute gradient of every element independently
RECSYS 72
Stochastic Gradient Descent
Gradient Descent (GD) vs Stochastic GD Observation 120571120571119876119876 = [120571120571119902119902119909119909119894119894] where
120571120571119902119902119909119909119894119894 = 119909119909119909119909
minus2 119903119903119909119909119909119909 minus 119902119902119909119909119894119894119901119901119909119909119894119894 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894 = 119961119961119946119946
nabla119928119928 119955119955119961119961119946119946
bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q 119928119928 = 119928119928 minus 119928119928 = 119928119928 minus η sum119961119961119946119946nabla119928119928 (119955119955119961119961119946119946) Idea Instead of evaluating gradient over all ratings evaluate it for
each individual rating and make a step
GD 119928119928larr119928119928 minus η sum119955119955119961119961119946119946 nabla119928119928(119955119955119961119961119946119946)
SGD 119928119928larr119928119928 minus 119928119928(119955119955119961119961119946119946) Faster convergence
bull Need more steps but each step is computed much faster
1162019
RECSYS 73
SGD vs GD Convergence of GD vs SGD
1162019
Iterationstep
Valu
e of
the
obje
ctiv
e fu
nctio
n
GD improves the value of the objective function at every step SGD improves the value but in a ldquonoisyrdquo wayGD takes fewer steps to converge but each steptakes much longer to compute In practice SGD is much faster
RECSYS 74
Stochastic Gradient Descent
Stochastic gradient decent Initialize P and Q (using SVD pretend missing ratings are
0) Then iterate over the ratings (multiple times if necessary) and update
factorsFor each rxi
bull 120576120576119909119909119909119909 = 119903119903119909119909119909119909 minus 119902119902119909119909 sdot 119901119901119909119909119879119879 (derivative of the ldquoerrorrdquo)bull 119902119902119909119909 larr 119902119902119909119909 + 120578120578 120576120576119909119909119909119909 119901119901119909119909 minus 120582120582 119902119902119909119909 (update equation)bull 119901119901119909119909 larr 119901119901119909119909 + 120578120578 120576120576119909119909119909119909 119902119902119909119909 minus 120582120582 119901119901119909119909 (update equation)
2 for loops For until convergence
bull For each rxibull Compute gradient do a ldquosteprdquo
1162019
η hellip learning rate
RECSYS 75Koren Bell Volinksy IEEE Computer 2009
1162019
RECSYS 76
Summary Recommendations via Optimization
Goal Make good recommendations Quantify goodness using SSE
So Lower SSE means better recommendations We want to make good recommendations on items
that some user has not yet seen Letrsquos set values for P and Q such that they work
well on known (user item) ratings And hope these values for P and Q will predict well
the unknown ratings
This is the a case where we apply Optimization methods
1162019
1 3 43 5 5
4 5 533
2
2 1 3
1
RECSYS 77
Backup Slides
RECSYS 78
The Netflix Challenge 2006-09
We Know What You OughtTo Be Watching This
Summer
Netflix Prizebull Training data
ndash 100 million ratingsndash 480000 usersndash 17770 moviesndash 6 years of data 2000-2005
bull Test datandash Last few ratings of each user (28 million)ndash Evaluation criterion root mean squared error (RMSE) ndash Netflix Cinematch RMSE 09514
bull Competitionndash 2700+ teamsndash $1 million grand prize for 10 improvement on Cinematch resultndash $50000 2007 progress prize for 843 improvement
scoremovieuser121152131434524123237682576344541568523425
22345
57664566
scoremovieuser6219617232473153414284935
745
696836
Training data Test data
Movie rating data
Overall rating distribution
bull Third of ratings are 4sbull Average rating is 368
From TimelyDevelopmentcom
ratings per movie
bull Avg ratingsmovie 5627
ratings per user
bull Avg ratingsuser 208
Average movie rating by movie count
bull More ratings to better movies
From TimelyDevelopmentcom
Most loved movies
CountAvg ratingTitle137812 4593 The Shawshank Redemption 133597 4545 Lord of the Rings The Return of the King1808834306 The Green Mile 150676 4460Lord of the Rings The Two Towers 139050 4415 Finding Nemo 117456 4504 Raiders of the Lost Ark 180736 4299 Forrest Gump 147932 4433 Lord of the Rings The Fellowship of the ring149199 4325 The Sixth Sense 144027 4333 Indiana Jones and the Last Crusade
Challenges
bull Size of datandash Scalabilityndash Keeping data in memory
bull Missing datandash 99 percent missingndash Very imbalanced
bull Avoiding overfittingbull Test and training data differ significantly movie 16322
bull Use an ensemble of complementing predictorsbull Two half tuned models worth more than a single fully
tuned model
The BellKor recommender system
RECSYS 101
Extending Latent Factor Model to Include Biases
1162019
RECSYS 102
Modeling Biases and Interactions
μ = overall mean rating bx = bias of user x bi = bias of movie i
user-movie interactionmovie biasuser bias
User-Movie interaction Characterizes the matching between
users and movies Attracts most research in the field Benefits from algorithmic and
mathematical innovations
Baseline predictor Separates users and movies Benefits from insights into userrsquos
behavior Among the main practical
contributions of the competition
RECSYS 103
Baseline Predictor We have expectations on the rating by
user x of movie i even without estimating xrsquos attitude towards movies like i
ndash Rating scale of user xndash Values of other ratings user
gave recently (day-specific mood anchoring multi-user accounts)
ndash (Recent) popularity of movie indash Selection bias related to
number of ratings user gave on the same day (ldquofrequencyrdquo)
1162019
RECSYS 104
Putting It All Together
Example Mean rating micro = 37 You are a critical reviewer your ratings are 1 star
lower than the mean bx = -1 Star Wars gets a mean rating of 05 higher than
average movie bi = + 05 Predicted rating for you on Star Wars
= 37 - 1 + 05 = 32
1162019
RECSYS 105
Fitting the New Model
Solve
Stochastic gradient decent to find parameters Note Both biases bu bi as well as interactions qi pu are treated
as parameters (we estimate them)
1162019
regularization
goodness of fit
λ is selected via grid-search on a validation set
( )
++++
+++minus
sumsumsumsum
sumisin
ii
xx
xx
ii
Rix
Txiixxi
PQ
bbpq
pqbbr
2222
2
)(
)(min
λ
micro
RECSYS 106
BellKor Recommender System The winner of the Netflix Challenge
Multi-scale modeling of the dataCombine top level ldquoregionalrdquomodeling of the data with a refined local view Global
bull Overall deviations of usersmovies Factorization
bull Addressing ldquoregionalrdquo effects Collaborative filtering
bull Extract local patterns
Global effects
Factorization
Collaborative filtering
RECSYS 107
Performance of Various Methods
0885
089
0895
09
0905
091
0915
092
1 10 100 1000
RMSE
Millions of parameters
CF (no time bias)
Basic Latent Factors
Latent Factors w Biases
1162019
RECSYS 108
Grand Prize 08563
Netflix 09514
Movie average 10533User average 10651
Global average 11296
Performance of Various Methods
Basic Collaborative filtering 094
Latent factors 090
Latent factors+Biases 089
Collaborative filtering++ 091
1162019
RECSYS 109
Temporal Biases Of Users
Sudden rise in the average movie rating(early 2004) Improvements in Netflix GUI improvements Meaning of rating changed
Movie age Users prefer new movies
without any reasons Older movies are just
inherently better than newer ones
1162019
Y Koren Collaborative filtering with temporal dynamics KDD rsquo09
RECSYS 110
Temporal Biases amp Factors
Original modelrxi = micro +bx + bi + qi middotpx
T
Add time dependence to biasesrxi = micro +bx(t)+ bi(t) +qi middot px
T
Make parameters bu and bi to depend on time (1) Parameterize time-dependence by linear trends
(2) Each bin corresponds to 10 consecutive weeks
Add temporal dependence to factors px(t)hellip user preference vector on day t
Y Koren Collaborative filtering with temporal dynamics KDD rsquo09
RECSYS 111
Adding Temporal Effects
1162019
0875
088
0885
089
0895
09
0905
091
0915
092
1 10 100 1000 10000
RMSE
Millions of parameters
CF (no time bias)Basic Latent FactorsCF (time bias)Latent Factors w Biases+ Linear time factors+ Per-day user biases+ CF
RECSYS 112
Grand Prize 08563
Netflix 09514
Movie average 10533User average 10651
Global average 11296
Performance of Various Methods
Basic Collaborative filtering 094
Latent factors 090
Latent factors+Biases 089
Collaborative filtering++ 091
1162019 112
Latent factors+Biases+Time 0876
Still no prize Getting desperateTry a ldquokitchen sinkrdquo approach
RECSYS 113
Effect of ensemble size
087
08725
0875
08775
088
08825
0885
08875
089
08925
1 8 15 22 29 36 43 50
Predictors
Erro
r - R
MSE
RECSYS 12
Utility Matrix
04102
0305021
Avatar LOTR Matrix Pirates
Alice
Bob
Carol
David
RECSYS 13
Key Problems
(1) Gathering ldquoknownrdquo ratings for matrix How to collect the data in the utility matrix
(2) Extrapolate unknown ratings from the known ones Mainly interested in high unknown ratings
bull We are not interested in knowing what you donrsquot like but what you like
(3) Evaluating extrapolation methods How to measure successperformance of
recommendation methods
RECSYS 14
(1) Gathering Ratings Explicit
Ask people to rate items Doesnrsquot work well in practice ndash people
canrsquot be bothered
Implicit Learn ratings from user actions
bull Eg purchase implies high rating What about low ratings
RECSYS 15
(2) Extrapolating Utilities Key problem matrix U is sparse
Most people have not rated most items Cold start
bull New items have no ratingsbull New users have no history
Three approaches to recommender systems 1) Content-based 2) Collaborative Filtering
bull Memory-basedbull User-based Collaborative Filteringbull Item-based Collaborative Filtering
bull Latent factor based
1162019
RECSYS 16
Content-based Recommender Systems
1162019
RECSYS 17
Content-based Recommendations Main idea Recommend items to customer x similar to
previous items rated highly by x
Example
Movie recommendations Recommend movies with same actor(s)
director genre hellip
Websites blogs news Recommend other sites with ldquosimilarrdquo content
RECSYS 18
Plan of Action
likes
Item profiles
RedCircles
Triangles
User profile
match
recommendbuild
1162019
RECSYS 19
Item Profiles For each item create an item profile
Profile is a set (vector) of features Movies author title actor directorhellip Text Set of ldquoimportantrdquo words in document
How to pick important features Usual heuristic from text mining is TF-IDF
(Term frequency Inverse Doc Frequency)bull Term hellip Featurebull Document hellip Item
1162019
RECSYS 20
Sidenote TF-IDFfij = frequency of term (feature) i in doc (item) j
ni = number of docs that mention term iN = total number of docs
TF-IDF score wij = TFij times IDFi
Doc profile = set of words with highest TF-IDF scores together with their scores
1162019
Note we normalize TFto discount for ldquolongerrdquo documents
RECSYS 21
User Profiles and Prediction
User profile possibilities Weighted average of rated item profiles Variation weight by difference from average
rating for item hellip
Prediction heuristic Given user profile x and item profile i estimate 119906119906(119961119961 119946119946) = cos(119961119961 119946119946) = 119961119961middot119946119946
| 119961119961 |sdot| 119946119946 |
1162019
RECSYS 22
Pros Content-based Approach
+ No need for data on other users No cold-start or sparsity problems
+ Able to recommend to users with unique tastes
+ Able to recommend new amp unpopular items No first-rater problem
+ Able to provide explanations Can provide explanations of recommended items by listing content-
features that caused an item to be recommended
1162019
RECSYS 23
Cons Content-based Approach ndash Finding the appropriate features is hard
Eg images movies music
ndash Overspecialization Never recommends items outside userrsquos
content profile People might have multiple interests Unable to exploit quality judgments of other users
ndash Recommendations for new users How to build a user profile
1162019
RECSYS 24
Collaborative Filtering
RECSYS 25
Collaborative filtering Recommend items based on past transactions of users
Analyze relations between users andor items
Specific data characteristics are irrelevant Domain-free useritem attributes are not necessary Can identify elusive aspects
RECSYS 26
Collaborative Filtering (CF)
RECSYS 27
Example of Memory-based Collaborative FilteringUser-User Collaborative Filtering
1 Consider user x
2 Find set N of other userswhose ratings are ldquosimilarrdquo to xrsquos ratings eg using K-nearest neighbors (KNN)
3 Recommend items to xbased on the weighted ratings of items by users in N
RECSYS 28
Similar Users Let rx be the vector of user xrsquos ratings
Jaccard similarity measure Problem Ignores the value of the rating
Cosine similarity measure
sim(x y) = cos(rx ry) = 119903119903119909119909sdot119903119903119910119910||119903119903119909119909||sdot||119903119903119910119910||
Problem Treats missing ratings as ldquonegativerdquo Pearson correlation coefficient
Sxy = items rated by both users x and y
1162019
rx = [ _ _ ]ry = [ _ _]
rx ry as setsrx = 1 4 5ry = 1 3 4
rx ry as pointsrx = 1 0 0 1 3ry = 1 0 2 2 0
rx ry hellip avgrating of x y
RECSYS 29
Similarity Metric
Intuitively we want sim(A B) gt sim(A C)
Jaccard similarity 15 lt 24
Cosine similarity 0386 gt 0322 Considers missing ratings as ldquonegativerdquo Solution subtract the (row) mean
1162019
sim AB vs AC0092 gt -0559Notice cosine sim is correlation when data is centered at 0
Cosine sim
RECSYS 30
Rating Predictions
Let rx be the vector of user xrsquos ratings
Let N be the set of k users most similar to xwho have rated item i
Prediction for item i of user x 119903119903119909119909119909119909 = 1
119896119896sum119910119910isin119873119873 119903119903119910119910119909119909
119903119903119909119909119909119909 =sum119910119910isin119873119873 119904119904119909119909119910119910sdot119903119903119910119910119910119910sum119910119910isin119873119873 119904119904119909119909119910119910
Other options
Many other tricks possiblehellip1162019
Shorthand119956119956119961119961119961119961 = 119956119956119946119946119956119956 119961119961119961119961
RECSYS 31
Another type of Memory-based Collaborative Filtering Item-Item based Collaborative Filtering
So far User-user collaborative filtering
Another view Item-item For item i find other similar items Estimate rating for item i based
on the target userrsquos ratings on items similar to item i Can use same similarity metrics and
prediction functions as in user-user model
1162019
sumsum
isin
isinsdot
=)(
)(
xiNj ij
xiNj xjijxi s
rsr
sijhellip similarity of items i and jrxjhelliprating of user x on item jN(ix)hellip set items rated by x similar to i
RECSYS 32
Item-Item CF (|N|=2)
121110987654321
455311
3124452
534321423
245424
5224345
423316
users
mov
ies
- unknown rating - rating between 1 to 5
RECSYS 33
Item-Item CF (|N|=2)
121110987654321
455 311
3124452
534321423
245424
5224345
423316
users
- estimate rating of movie 1 by user 5
1162019
mov
ies
RECSYS 34
Item-Item CF (|N|=2)
121110987654321
455 311
3124452
534321423
245424
5224345
423316
users
Neighbor selectionIdentify movies similar to movie 1 rated by user 5
1162019
mov
ies
100
-018
041
-010
-031
059
sim(1m)
Here we use Pearson correlation as similarity1) Subtract mean rating mi from each movie i
m1 = (1+3+5+5+4)5 = 36row 1 [-26 0 -06 0 0 14 0 0 14 0 04 0]
2) Compute cosine similarities between rows
RECSYS 35
Item-Item CF (|N|=2)
121110987654321
455 311
3124452
534321423
245424
5224345
423316
users
Compute similarity weightss13=041 s16=059
1162019
mov
ies
100
-018
041
-010
-031
059
sim(1m)
RECSYS 36
Item-Item CF (|N|=2)
121110987654321
45526311
3124452
534321423
245424
5224345
423316
users
Predict by taking weighted average
r15 = (0412 + 0593) (041+059) = 26
mov
ies
RECSYS 37
Common Practice for Item-Item Collaborative Filtering
Define similarity sij of items i and j
Select K nearest neighbors (KNN) N(i x) Set of Items most similar to item i that were rated by x
Estimate rating rxi as the weighted average
1162019
sumsum
isin
isinsdot
=)(
)(ˆxiNj ij
xiNj xjijxi s
rsr
N(ix) = set of items similar to item i that were rated by xsij = similarity of items i and jrxj= rating of user x on item j
RECSYS 38
Item-Item vs User-User
04180109
0305081
Avatar LOTR Matrix Pirates
Alice
Bob
Carol
David
1162019
In practice it has been observed that item-itemoften works better than user-user
Why Items are simpler users have multiple tastes
RECSYS 39
ProsCons of Collaborative Filtering
+ Works for any kind of item No feature selection needed
- Cold Start Need enough users in the system to find a match
- Sparsity The userratings matrix is sparse Hard to find users that have rated the same items
- First rater Cannot recommend an item that has not been
previously rated New items Esoteric items
- Popularity bias Cannot recommend items to someone with
unique taste Tends to recommend popular items
RECSYS 40
Hybrid Methods Implement two or more different recommenders and
combine predictions Perhaps using a linear model
Add content-based methods to collaborative filtering Item profiles for new item problem Demographics to deal with new user problem
1162019
RECSYS 41
- Evaluation
- Error metrics
- Complexity Speed
RECSYS 42
Evaluation
1 3 4
3 5 5
4 5 5
3
3
2 2 2
5
2 1 1
3 3
1
movies
users
1162019
RECSYS 43
Evaluation
1 3 4
3 5 5
4 5 5
3
3
2
2 1
3
1
Test Data Set
users
movies
1162019
RECSYS 44
Evaluating Predictions Compare predictions with known ratings
Root-mean-square error (RMSE)bull where is predicted is the true rating of x on i
Precision at top 10 bull of those in top 10
Rank Correlation bull Spearmanrsquos correlation between systemrsquos and userrsquos complete
rankings Another approach 01 model
Coveragebull Number of itemsusers for which system can make predictions
Precision = TP (TP + FP) Accuracy = (TP+TN) (TP + FP + TN + FN) Receiver Operating characteristic (ROC) Curve
Y-axis True Positive Rates (TPR) X-axis False Positive Rates (FPR)bull TPR (aka Recall) = TP P = TP (TP+FN) bull FPR = FP N = FP (FP + TN)bull See httpsenwikipediaorgwikiPrecision and recall
RECSYS 45
Problems with Error Measures Narrow focus on accuracy sometimes
misses the point Prediction Diversity Prediction Context Order of predictions
In practice we care only to predict high ratings RMSE might penalize a method that does well
for high ratings and badly for others
1162019
RECSYS 46
Collaborative Filtering Complexity
Expensive step is finding k most similar customers O(|X|) Recall that X = set of customers in the system
Too expensive to do at runtime Could pre-compute using clustering as approx
Naiumlve pre-computation takes time O(N middot|C|) |C| = of clusters = k in the k-means N = of data points
We already know how to do this Near-neighbor search in high dimensions (LSH) Clustering Dimensionality reduction
1162019
RECSYS 47
Tip Add Data Leverage all the data
Donrsquot try to reduce data size in an effort to make fancy algorithms work
Simple methods on large data do best
Add more data eg add IMDB data on genres
More data beats better algorithmshttpanandtypepadcomdatawocky200803more-data-usualhtml
RECSYS 48
Recommender SystemsLatent Factor Models
CS246 Mining Massive DatasetsJure Leskovec Stanford University
httpcs246stanfordedu
RECSYS 49
Geared towards females
Geared towards males
Serious
Funny
Collaborative Filtering viaLatent Factor Models (eg SVD)
The PrincessDiaries
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
RECSYS 50Koren Bell Volinksy IEEE Computer 2009
RECSYS 51
The Netflix Utility Matrix R
1 3 4
3 5 5
4 5 5
3
3
2 2 2
5
2 1 1
3 3
1
480000 users
17700 movies
1162019
Matrix R
RECSYS 52
Utility Matrix R Evaluation
1 3 4
3 5 5
4 5 5
3
3
2
2 1
3
1
Test Data Set
SSE = sum(119909119909119909119909)isin119877119877 119903119909119909119909119909 minus 119903119903119909119909119909119909 2
1162019
480000 users
17700 movies
Predicted rating
True rating of user x on item i
119955119955120785120785120788120788
Matrix R
Training Data Set
RECSYS 53
Latent Factor Models
ldquoSVDrdquo on Netflix data R asymp Q middot PT
For now letrsquos assume we can approximate the rating matrix R as a product of ldquothinrdquo Q middot PT
R has missing entries but letrsquos ignore that for nowbull Basically we will want the reconstruction error to be small on
known ratings and we donrsquot care about the values on the missing ones
45531
312445
53432142
24542
522434
42331
2-41
56-5
53-2
32111
-221-7
37-1
-924143-48-5-253-211
13-112-72914-131457-8
1-6784-3924176-421asymp
users
item
s
PT
Q
item
s
users
R
SVD A = U Σ VT
f factors
ffactors
RECSYS 54
Ratings as Products of Factors How to estimate the missing rating of
user x for item i
1162019
45531
312445
53432142
24542
522434
42331
item
s
2-41
56-5
53-2
32111
-221-7
37-1
-924143-48-5-253-211
13-112-72914-131457-8
1-6784-3924176-421
asymp
item
s
users
users
PT
119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931
= 119943119943
119954119954119946119946119943119943 sdot 119953119953119961119961119943119943
qi = row i of Qpx = column x of PT
ffac
tors
Qf factors
RECSYS 55
Ratings as Products of Factors How to estimate the missing rating of
user x for item i
1162019
45531
312445
53432142
24542
522434
42331
item
s
2-41
56-5
53-2
32111
-221-7
37-1
-924143-48-5-253-211
13-112-72914-131457-8
1-6784-3924176-421
asymp
item
s
users
users
PT
ffac
tors
Qf factors
119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931
= 119943119943
119954119954119946119946119943119943 sdot 119953119953119961119961119943119943
qi = row i of Qpx = column x of PT
RECSYS 56
Ratings as Products of Factors How to estimate the missing rating of
user x for item i
1162019
45531
312445
53432142
24542
522434
42331
item
s
2-41
56-5
53-2
32111
-221-7
37-1
-924143-48-5-253-211
13-112-72914-131457-8
1-6784-3924176-421
asymp
item
s
users
users
QPT
24
ffac
tors
f factors
119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931
= 119943119943
119954119954119946119946119943119943 sdot 119953119953119961119961119943119943
qi = row i of Qpx = column x of PT
RECSYS 57
Geared towards females
Geared towards males
Serious
Funny
Latent Factor Models
1162019
The PrincessDiaries
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
RECSYS 58
Geared towards females
Geared towards males
Serious
Funny
Latent Factor Models
1162019
The PrincessDiaries
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
RECSYS 59
Recap SVD
Remember SVD A Input data matrix U Left singular vecs V Right singular vecs Σ Singular values
SVD gives minimum reconstruction error (SSE) min119880119880VΣ
sum119909119909119894119894 119860119860119909119909119894119894 minus 119880119880Σ119881119881T 1199091199091198941198942
So in our case ldquoSVDrdquo on Netflix data R asymp Q middot PT
A = R Q = U PT = Σ VT
But we are not done yet R has missing entries1162019
U
The sum goes over all entriesBut our R has missing entries
119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931
Am
n
Σm
n
VT
asymp
U
RECSYS 60
Latent Factor Models
SVD isnrsquot defined when entries are missing
Use specialized methods to find P Q min
119875119875119876119876sum 119909119909119909119909 isinR 119903119903119909119909119909119909 minus 119902119902119909119909 sdot 119901119901119909119909119879119879 2
Notebull We donrsquot require cols of P Q to be orthogonalunit lengthbull P Q map usersmovies to a latent spacebull The most popular model among Netflix contestants
1162019
45531
312445
53432142
24542
522434
42331
2-41
56-5
53-2
32111
-221-7
37-1
-924143-48-5-253-211
13-112-72914-131457-8
1-6784-3924176-421asymp
PTQ
users
item
s
119903119909119909119909119909 = 119902119902119909119909 sdot 119901119901119909119909119879119879
f factors
ffactorsitem
s
users
RECSYS 61
Dealing with Missing Entries Want to minimize SSE for unseen test data
Idea Minimize SSE on training data Want large f ( of factors) to capture all the signals But SSE on test data begins to rise for f gt 2
Regularization is needed to avoid Overfitting Allow rich model where there are sufficient data Shrink aggressively where data are scarce
1162019
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
1 3 43 5 5
4 5 533
2
2 1 3
1
λhellip regularization parameterldquoerrorrdquo ldquolengthrdquo
RECSYS 62
Geared towards females
Geared towards males
serious
escapist
The PrincessDiaries
The Lion King
Braveheart
Independence Day
AmadeusThe Color Purple
Oceanrsquos 11
Sense and Sensibility
Gus
Dave
Recommendations via Latent Factor Models (eg SVD++ by the [Bellkor Team])
1162019
Lethal Weapon
Dumb and Dumber
RECSYS 63
Dealing with Missing Entries Want to minimize SSE for unseen test data
Idea Minimize SSE on training data Want large f ( of factors) to capture all the signals But SSE on test data begins to rise for f gt 2
Regularization is needed Allow rich model where there are sufficient data Shrink aggressively where data are scarce
1162019
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
1 3 43 5 5
4 5 533
2
2 1 3
1
λhellip regularization parameterldquoerrorrdquo ldquolengthrdquo
RECSYS 64
Geared towards females
Geared towards males
serious
funny
The Effect of Regularization
1162019
The PrincessDiaries
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
minfactors ldquoerrorrdquo + λ ldquolengthrdquo
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
RECSYS 65
Geared towards females
Geared towards males
serious
funny
The Effect of Regularization
1162019
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
The PrincessDiaries
minfactors ldquoerrorrdquo + λ ldquolengthrdquo
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
RECSYS 66
Geared towards females
Geared towards males
serious
funny
The Effect of Regularization
1162019
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
minfactors ldquoerrorrdquo + λ ldquolengthrdquo
The PrincessDiaries
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
RECSYS 67
Geared towards females
Geared towards males
serious
funny
The Effect of Regularization
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
The PrincessDiaries
minfactors ldquoerrorrdquo + λ ldquolengthrdquo
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
RECSYS 68
Use Gradient Descent to search forthe optimal settings
Want to find matrices P and Q
Gradient descent Initialize P and Q (using SVD pretend missing ratings are
0) Do gradient descent
bull P larr P - η middotnablaPbull Q larr Q - η middotnablaQbull Where nablaQ is gradientderivative of matrix Q120571120571119876119876 = [120571120571119902119902119909119909119894119894] and 120571120571119902119902119909119909119894119894 = sum119909119909119909119909 minus2 119903119903119909119909119909119909 minus 119902119902119909119909119901119901119909119909119879119879 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894
bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q Observation Computing gradients is slow
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
How to compute gradient of a matrixCompute gradient of every element independently
RECSYS 69
Degression to the lecture notes ofRegression and Gradient Descent
by Andrew Ngrsquos Machine Learning course from Coursera
RECSYS 70
(Batch) Gradient Descent
Want to find matrices P and Q
Gradient descent Initialize P and Q (using SVD pretend missing ratings are
0) Do gradient descent
bull P larr P - η middotnablaPbull Q larr Q - η middotnablaQbull Where nablaQ is gradientderivative of matrix Q120571120571119876119876 = [120571120571119902119902119909119909119894119894] and 120571120571119902119902119909119909119894119894 = sum119909119909119909119909 minus2 119903119903119909119909119909119909 minus 119902119902119909119909119901119901119909119909119879119879 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894
bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q Observation Computing gradients is slow
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
How to compute gradient of a matrixCompute gradient of every element independently
RECSYS 72
Stochastic Gradient Descent
Gradient Descent (GD) vs Stochastic GD Observation 120571120571119876119876 = [120571120571119902119902119909119909119894119894] where
120571120571119902119902119909119909119894119894 = 119909119909119909119909
minus2 119903119903119909119909119909119909 minus 119902119902119909119909119894119894119901119901119909119909119894119894 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894 = 119961119961119946119946
nabla119928119928 119955119955119961119961119946119946
bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q 119928119928 = 119928119928 minus 119928119928 = 119928119928 minus η sum119961119961119946119946nabla119928119928 (119955119955119961119961119946119946) Idea Instead of evaluating gradient over all ratings evaluate it for
each individual rating and make a step
GD 119928119928larr119928119928 minus η sum119955119955119961119961119946119946 nabla119928119928(119955119955119961119961119946119946)
SGD 119928119928larr119928119928 minus 119928119928(119955119955119961119961119946119946) Faster convergence
bull Need more steps but each step is computed much faster
1162019
RECSYS 73
SGD vs GD Convergence of GD vs SGD
1162019
Iterationstep
Valu
e of
the
obje
ctiv
e fu
nctio
n
GD improves the value of the objective function at every step SGD improves the value but in a ldquonoisyrdquo wayGD takes fewer steps to converge but each steptakes much longer to compute In practice SGD is much faster
RECSYS 74
Stochastic Gradient Descent
Stochastic gradient decent Initialize P and Q (using SVD pretend missing ratings are
0) Then iterate over the ratings (multiple times if necessary) and update
factorsFor each rxi
bull 120576120576119909119909119909119909 = 119903119903119909119909119909119909 minus 119902119902119909119909 sdot 119901119901119909119909119879119879 (derivative of the ldquoerrorrdquo)bull 119902119902119909119909 larr 119902119902119909119909 + 120578120578 120576120576119909119909119909119909 119901119901119909119909 minus 120582120582 119902119902119909119909 (update equation)bull 119901119901119909119909 larr 119901119901119909119909 + 120578120578 120576120576119909119909119909119909 119902119902119909119909 minus 120582120582 119901119901119909119909 (update equation)
2 for loops For until convergence
bull For each rxibull Compute gradient do a ldquosteprdquo
1162019
η hellip learning rate
RECSYS 75Koren Bell Volinksy IEEE Computer 2009
1162019
RECSYS 76
Summary Recommendations via Optimization
Goal Make good recommendations Quantify goodness using SSE
So Lower SSE means better recommendations We want to make good recommendations on items
that some user has not yet seen Letrsquos set values for P and Q such that they work
well on known (user item) ratings And hope these values for P and Q will predict well
the unknown ratings
This is the a case where we apply Optimization methods
1162019
1 3 43 5 5
4 5 533
2
2 1 3
1
RECSYS 77
Backup Slides
RECSYS 78
The Netflix Challenge 2006-09
We Know What You OughtTo Be Watching This
Summer
Netflix Prizebull Training data
ndash 100 million ratingsndash 480000 usersndash 17770 moviesndash 6 years of data 2000-2005
bull Test datandash Last few ratings of each user (28 million)ndash Evaluation criterion root mean squared error (RMSE) ndash Netflix Cinematch RMSE 09514
bull Competitionndash 2700+ teamsndash $1 million grand prize for 10 improvement on Cinematch resultndash $50000 2007 progress prize for 843 improvement
scoremovieuser121152131434524123237682576344541568523425
22345
57664566
scoremovieuser6219617232473153414284935
745
696836
Training data Test data
Movie rating data
Overall rating distribution
bull Third of ratings are 4sbull Average rating is 368
From TimelyDevelopmentcom
ratings per movie
bull Avg ratingsmovie 5627
ratings per user
bull Avg ratingsuser 208
Average movie rating by movie count
bull More ratings to better movies
From TimelyDevelopmentcom
Most loved movies
CountAvg ratingTitle137812 4593 The Shawshank Redemption 133597 4545 Lord of the Rings The Return of the King1808834306 The Green Mile 150676 4460Lord of the Rings The Two Towers 139050 4415 Finding Nemo 117456 4504 Raiders of the Lost Ark 180736 4299 Forrest Gump 147932 4433 Lord of the Rings The Fellowship of the ring149199 4325 The Sixth Sense 144027 4333 Indiana Jones and the Last Crusade
Challenges
bull Size of datandash Scalabilityndash Keeping data in memory
bull Missing datandash 99 percent missingndash Very imbalanced
bull Avoiding overfittingbull Test and training data differ significantly movie 16322
bull Use an ensemble of complementing predictorsbull Two half tuned models worth more than a single fully
tuned model
The BellKor recommender system
RECSYS 101
Extending Latent Factor Model to Include Biases
1162019
RECSYS 102
Modeling Biases and Interactions
μ = overall mean rating bx = bias of user x bi = bias of movie i
user-movie interactionmovie biasuser bias
User-Movie interaction Characterizes the matching between
users and movies Attracts most research in the field Benefits from algorithmic and
mathematical innovations
Baseline predictor Separates users and movies Benefits from insights into userrsquos
behavior Among the main practical
contributions of the competition
RECSYS 103
Baseline Predictor We have expectations on the rating by
user x of movie i even without estimating xrsquos attitude towards movies like i
ndash Rating scale of user xndash Values of other ratings user
gave recently (day-specific mood anchoring multi-user accounts)
ndash (Recent) popularity of movie indash Selection bias related to
number of ratings user gave on the same day (ldquofrequencyrdquo)
1162019
RECSYS 104
Putting It All Together
Example Mean rating micro = 37 You are a critical reviewer your ratings are 1 star
lower than the mean bx = -1 Star Wars gets a mean rating of 05 higher than
average movie bi = + 05 Predicted rating for you on Star Wars
= 37 - 1 + 05 = 32
1162019
RECSYS 105
Fitting the New Model
Solve
Stochastic gradient decent to find parameters Note Both biases bu bi as well as interactions qi pu are treated
as parameters (we estimate them)
1162019
regularization
goodness of fit
λ is selected via grid-search on a validation set
( )
++++
+++minus
sumsumsumsum
sumisin
ii
xx
xx
ii
Rix
Txiixxi
PQ
bbpq
pqbbr
2222
2
)(
)(min
λ
micro
RECSYS 106
BellKor Recommender System The winner of the Netflix Challenge
Multi-scale modeling of the dataCombine top level ldquoregionalrdquomodeling of the data with a refined local view Global
bull Overall deviations of usersmovies Factorization
bull Addressing ldquoregionalrdquo effects Collaborative filtering
bull Extract local patterns
Global effects
Factorization
Collaborative filtering
RECSYS 107
Performance of Various Methods
0885
089
0895
09
0905
091
0915
092
1 10 100 1000
RMSE
Millions of parameters
CF (no time bias)
Basic Latent Factors
Latent Factors w Biases
1162019
RECSYS 108
Grand Prize 08563
Netflix 09514
Movie average 10533User average 10651
Global average 11296
Performance of Various Methods
Basic Collaborative filtering 094
Latent factors 090
Latent factors+Biases 089
Collaborative filtering++ 091
1162019
RECSYS 109
Temporal Biases Of Users
Sudden rise in the average movie rating(early 2004) Improvements in Netflix GUI improvements Meaning of rating changed
Movie age Users prefer new movies
without any reasons Older movies are just
inherently better than newer ones
1162019
Y Koren Collaborative filtering with temporal dynamics KDD rsquo09
RECSYS 110
Temporal Biases amp Factors
Original modelrxi = micro +bx + bi + qi middotpx
T
Add time dependence to biasesrxi = micro +bx(t)+ bi(t) +qi middot px
T
Make parameters bu and bi to depend on time (1) Parameterize time-dependence by linear trends
(2) Each bin corresponds to 10 consecutive weeks
Add temporal dependence to factors px(t)hellip user preference vector on day t
Y Koren Collaborative filtering with temporal dynamics KDD rsquo09
RECSYS 111
Adding Temporal Effects
1162019
0875
088
0885
089
0895
09
0905
091
0915
092
1 10 100 1000 10000
RMSE
Millions of parameters
CF (no time bias)Basic Latent FactorsCF (time bias)Latent Factors w Biases+ Linear time factors+ Per-day user biases+ CF
RECSYS 112
Grand Prize 08563
Netflix 09514
Movie average 10533User average 10651
Global average 11296
Performance of Various Methods
Basic Collaborative filtering 094
Latent factors 090
Latent factors+Biases 089
Collaborative filtering++ 091
1162019 112
Latent factors+Biases+Time 0876
Still no prize Getting desperateTry a ldquokitchen sinkrdquo approach
RECSYS 113
Effect of ensemble size
087
08725
0875
08775
088
08825
0885
08875
089
08925
1 8 15 22 29 36 43 50
Predictors
Erro
r - R
MSE
RECSYS 13
Key Problems
(1) Gathering ldquoknownrdquo ratings for matrix How to collect the data in the utility matrix
(2) Extrapolate unknown ratings from the known ones Mainly interested in high unknown ratings
bull We are not interested in knowing what you donrsquot like but what you like
(3) Evaluating extrapolation methods How to measure successperformance of
recommendation methods
RECSYS 14
(1) Gathering Ratings Explicit
Ask people to rate items Doesnrsquot work well in practice ndash people
canrsquot be bothered
Implicit Learn ratings from user actions
bull Eg purchase implies high rating What about low ratings
RECSYS 15
(2) Extrapolating Utilities Key problem matrix U is sparse
Most people have not rated most items Cold start
bull New items have no ratingsbull New users have no history
Three approaches to recommender systems 1) Content-based 2) Collaborative Filtering
bull Memory-basedbull User-based Collaborative Filteringbull Item-based Collaborative Filtering
bull Latent factor based
1162019
RECSYS 16
Content-based Recommender Systems
1162019
RECSYS 17
Content-based Recommendations Main idea Recommend items to customer x similar to
previous items rated highly by x
Example
Movie recommendations Recommend movies with same actor(s)
director genre hellip
Websites blogs news Recommend other sites with ldquosimilarrdquo content
RECSYS 18
Plan of Action
likes
Item profiles
RedCircles
Triangles
User profile
match
recommendbuild
1162019
RECSYS 19
Item Profiles For each item create an item profile
Profile is a set (vector) of features Movies author title actor directorhellip Text Set of ldquoimportantrdquo words in document
How to pick important features Usual heuristic from text mining is TF-IDF
(Term frequency Inverse Doc Frequency)bull Term hellip Featurebull Document hellip Item
1162019
RECSYS 20
Sidenote TF-IDFfij = frequency of term (feature) i in doc (item) j
ni = number of docs that mention term iN = total number of docs
TF-IDF score wij = TFij times IDFi
Doc profile = set of words with highest TF-IDF scores together with their scores
1162019
Note we normalize TFto discount for ldquolongerrdquo documents
RECSYS 21
User Profiles and Prediction
User profile possibilities Weighted average of rated item profiles Variation weight by difference from average
rating for item hellip
Prediction heuristic Given user profile x and item profile i estimate 119906119906(119961119961 119946119946) = cos(119961119961 119946119946) = 119961119961middot119946119946
| 119961119961 |sdot| 119946119946 |
1162019
RECSYS 22
Pros Content-based Approach
+ No need for data on other users No cold-start or sparsity problems
+ Able to recommend to users with unique tastes
+ Able to recommend new amp unpopular items No first-rater problem
+ Able to provide explanations Can provide explanations of recommended items by listing content-
features that caused an item to be recommended
1162019
RECSYS 23
Cons Content-based Approach ndash Finding the appropriate features is hard
Eg images movies music
ndash Overspecialization Never recommends items outside userrsquos
content profile People might have multiple interests Unable to exploit quality judgments of other users
ndash Recommendations for new users How to build a user profile
1162019
RECSYS 24
Collaborative Filtering
RECSYS 25
Collaborative filtering Recommend items based on past transactions of users
Analyze relations between users andor items
Specific data characteristics are irrelevant Domain-free useritem attributes are not necessary Can identify elusive aspects
RECSYS 26
Collaborative Filtering (CF)
RECSYS 27
Example of Memory-based Collaborative FilteringUser-User Collaborative Filtering
1 Consider user x
2 Find set N of other userswhose ratings are ldquosimilarrdquo to xrsquos ratings eg using K-nearest neighbors (KNN)
3 Recommend items to xbased on the weighted ratings of items by users in N
RECSYS 28
Similar Users Let rx be the vector of user xrsquos ratings
Jaccard similarity measure Problem Ignores the value of the rating
Cosine similarity measure
sim(x y) = cos(rx ry) = 119903119903119909119909sdot119903119903119910119910||119903119903119909119909||sdot||119903119903119910119910||
Problem Treats missing ratings as ldquonegativerdquo Pearson correlation coefficient
Sxy = items rated by both users x and y
1162019
rx = [ _ _ ]ry = [ _ _]
rx ry as setsrx = 1 4 5ry = 1 3 4
rx ry as pointsrx = 1 0 0 1 3ry = 1 0 2 2 0
rx ry hellip avgrating of x y
RECSYS 29
Similarity Metric
Intuitively we want sim(A B) gt sim(A C)
Jaccard similarity 15 lt 24
Cosine similarity 0386 gt 0322 Considers missing ratings as ldquonegativerdquo Solution subtract the (row) mean
1162019
sim AB vs AC0092 gt -0559Notice cosine sim is correlation when data is centered at 0
Cosine sim
RECSYS 30
Rating Predictions
Let rx be the vector of user xrsquos ratings
Let N be the set of k users most similar to xwho have rated item i
Prediction for item i of user x 119903119903119909119909119909119909 = 1
119896119896sum119910119910isin119873119873 119903119903119910119910119909119909
119903119903119909119909119909119909 =sum119910119910isin119873119873 119904119904119909119909119910119910sdot119903119903119910119910119910119910sum119910119910isin119873119873 119904119904119909119909119910119910
Other options
Many other tricks possiblehellip1162019
Shorthand119956119956119961119961119961119961 = 119956119956119946119946119956119956 119961119961119961119961
RECSYS 31
Another type of Memory-based Collaborative Filtering Item-Item based Collaborative Filtering
So far User-user collaborative filtering
Another view Item-item For item i find other similar items Estimate rating for item i based
on the target userrsquos ratings on items similar to item i Can use same similarity metrics and
prediction functions as in user-user model
1162019
sumsum
isin
isinsdot
=)(
)(
xiNj ij
xiNj xjijxi s
rsr
sijhellip similarity of items i and jrxjhelliprating of user x on item jN(ix)hellip set items rated by x similar to i
RECSYS 32
Item-Item CF (|N|=2)
121110987654321
455311
3124452
534321423
245424
5224345
423316
users
mov
ies
- unknown rating - rating between 1 to 5
RECSYS 33
Item-Item CF (|N|=2)
121110987654321
455 311
3124452
534321423
245424
5224345
423316
users
- estimate rating of movie 1 by user 5
1162019
mov
ies
RECSYS 34
Item-Item CF (|N|=2)
121110987654321
455 311
3124452
534321423
245424
5224345
423316
users
Neighbor selectionIdentify movies similar to movie 1 rated by user 5
1162019
mov
ies
100
-018
041
-010
-031
059
sim(1m)
Here we use Pearson correlation as similarity1) Subtract mean rating mi from each movie i
m1 = (1+3+5+5+4)5 = 36row 1 [-26 0 -06 0 0 14 0 0 14 0 04 0]
2) Compute cosine similarities between rows
RECSYS 35
Item-Item CF (|N|=2)
121110987654321
455 311
3124452
534321423
245424
5224345
423316
users
Compute similarity weightss13=041 s16=059
1162019
mov
ies
100
-018
041
-010
-031
059
sim(1m)
RECSYS 36
Item-Item CF (|N|=2)
121110987654321
45526311
3124452
534321423
245424
5224345
423316
users
Predict by taking weighted average
r15 = (0412 + 0593) (041+059) = 26
mov
ies
RECSYS 37
Common Practice for Item-Item Collaborative Filtering
Define similarity sij of items i and j
Select K nearest neighbors (KNN) N(i x) Set of Items most similar to item i that were rated by x
Estimate rating rxi as the weighted average
1162019
sumsum
isin
isinsdot
=)(
)(ˆxiNj ij
xiNj xjijxi s
rsr
N(ix) = set of items similar to item i that were rated by xsij = similarity of items i and jrxj= rating of user x on item j
RECSYS 38
Item-Item vs User-User
04180109
0305081
Avatar LOTR Matrix Pirates
Alice
Bob
Carol
David
1162019
In practice it has been observed that item-itemoften works better than user-user
Why Items are simpler users have multiple tastes
RECSYS 39
ProsCons of Collaborative Filtering
+ Works for any kind of item No feature selection needed
- Cold Start Need enough users in the system to find a match
- Sparsity The userratings matrix is sparse Hard to find users that have rated the same items
- First rater Cannot recommend an item that has not been
previously rated New items Esoteric items
- Popularity bias Cannot recommend items to someone with
unique taste Tends to recommend popular items
RECSYS 40
Hybrid Methods Implement two or more different recommenders and
combine predictions Perhaps using a linear model
Add content-based methods to collaborative filtering Item profiles for new item problem Demographics to deal with new user problem
1162019
RECSYS 41
- Evaluation
- Error metrics
- Complexity Speed
RECSYS 42
Evaluation
1 3 4
3 5 5
4 5 5
3
3
2 2 2
5
2 1 1
3 3
1
movies
users
1162019
RECSYS 43
Evaluation
1 3 4
3 5 5
4 5 5
3
3
2
2 1
3
1
Test Data Set
users
movies
1162019
RECSYS 44
Evaluating Predictions Compare predictions with known ratings
Root-mean-square error (RMSE)bull where is predicted is the true rating of x on i
Precision at top 10 bull of those in top 10
Rank Correlation bull Spearmanrsquos correlation between systemrsquos and userrsquos complete
rankings Another approach 01 model
Coveragebull Number of itemsusers for which system can make predictions
Precision = TP (TP + FP) Accuracy = (TP+TN) (TP + FP + TN + FN) Receiver Operating characteristic (ROC) Curve
Y-axis True Positive Rates (TPR) X-axis False Positive Rates (FPR)bull TPR (aka Recall) = TP P = TP (TP+FN) bull FPR = FP N = FP (FP + TN)bull See httpsenwikipediaorgwikiPrecision and recall
RECSYS 45
Problems with Error Measures Narrow focus on accuracy sometimes
misses the point Prediction Diversity Prediction Context Order of predictions
In practice we care only to predict high ratings RMSE might penalize a method that does well
for high ratings and badly for others
1162019
RECSYS 46
Collaborative Filtering Complexity
Expensive step is finding k most similar customers O(|X|) Recall that X = set of customers in the system
Too expensive to do at runtime Could pre-compute using clustering as approx
Naiumlve pre-computation takes time O(N middot|C|) |C| = of clusters = k in the k-means N = of data points
We already know how to do this Near-neighbor search in high dimensions (LSH) Clustering Dimensionality reduction
1162019
RECSYS 47
Tip Add Data Leverage all the data
Donrsquot try to reduce data size in an effort to make fancy algorithms work
Simple methods on large data do best
Add more data eg add IMDB data on genres
More data beats better algorithmshttpanandtypepadcomdatawocky200803more-data-usualhtml
RECSYS 48
Recommender SystemsLatent Factor Models
CS246 Mining Massive DatasetsJure Leskovec Stanford University
httpcs246stanfordedu
RECSYS 49
Geared towards females
Geared towards males
Serious
Funny
Collaborative Filtering viaLatent Factor Models (eg SVD)
The PrincessDiaries
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
RECSYS 50Koren Bell Volinksy IEEE Computer 2009
RECSYS 51
The Netflix Utility Matrix R
1 3 4
3 5 5
4 5 5
3
3
2 2 2
5
2 1 1
3 3
1
480000 users
17700 movies
1162019
Matrix R
RECSYS 52
Utility Matrix R Evaluation
1 3 4
3 5 5
4 5 5
3
3
2
2 1
3
1
Test Data Set
SSE = sum(119909119909119909119909)isin119877119877 119903119909119909119909119909 minus 119903119903119909119909119909119909 2
1162019
480000 users
17700 movies
Predicted rating
True rating of user x on item i
119955119955120785120785120788120788
Matrix R
Training Data Set
RECSYS 53
Latent Factor Models
ldquoSVDrdquo on Netflix data R asymp Q middot PT
For now letrsquos assume we can approximate the rating matrix R as a product of ldquothinrdquo Q middot PT
R has missing entries but letrsquos ignore that for nowbull Basically we will want the reconstruction error to be small on
known ratings and we donrsquot care about the values on the missing ones
45531
312445
53432142
24542
522434
42331
2-41
56-5
53-2
32111
-221-7
37-1
-924143-48-5-253-211
13-112-72914-131457-8
1-6784-3924176-421asymp
users
item
s
PT
Q
item
s
users
R
SVD A = U Σ VT
f factors
ffactors
RECSYS 54
Ratings as Products of Factors How to estimate the missing rating of
user x for item i
1162019
45531
312445
53432142
24542
522434
42331
item
s
2-41
56-5
53-2
32111
-221-7
37-1
-924143-48-5-253-211
13-112-72914-131457-8
1-6784-3924176-421
asymp
item
s
users
users
PT
119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931
= 119943119943
119954119954119946119946119943119943 sdot 119953119953119961119961119943119943
qi = row i of Qpx = column x of PT
ffac
tors
Qf factors
RECSYS 55
Ratings as Products of Factors How to estimate the missing rating of
user x for item i
1162019
45531
312445
53432142
24542
522434
42331
item
s
2-41
56-5
53-2
32111
-221-7
37-1
-924143-48-5-253-211
13-112-72914-131457-8
1-6784-3924176-421
asymp
item
s
users
users
PT
ffac
tors
Qf factors
119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931
= 119943119943
119954119954119946119946119943119943 sdot 119953119953119961119961119943119943
qi = row i of Qpx = column x of PT
RECSYS 56
Ratings as Products of Factors How to estimate the missing rating of
user x for item i
1162019
45531
312445
53432142
24542
522434
42331
item
s
2-41
56-5
53-2
32111
-221-7
37-1
-924143-48-5-253-211
13-112-72914-131457-8
1-6784-3924176-421
asymp
item
s
users
users
QPT
24
ffac
tors
f factors
119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931
= 119943119943
119954119954119946119946119943119943 sdot 119953119953119961119961119943119943
qi = row i of Qpx = column x of PT
RECSYS 57
Geared towards females
Geared towards males
Serious
Funny
Latent Factor Models
1162019
The PrincessDiaries
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
RECSYS 58
Geared towards females
Geared towards males
Serious
Funny
Latent Factor Models
1162019
The PrincessDiaries
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
RECSYS 59
Recap SVD
Remember SVD A Input data matrix U Left singular vecs V Right singular vecs Σ Singular values
SVD gives minimum reconstruction error (SSE) min119880119880VΣ
sum119909119909119894119894 119860119860119909119909119894119894 minus 119880119880Σ119881119881T 1199091199091198941198942
So in our case ldquoSVDrdquo on Netflix data R asymp Q middot PT
A = R Q = U PT = Σ VT
But we are not done yet R has missing entries1162019
U
The sum goes over all entriesBut our R has missing entries
119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931
Am
n
Σm
n
VT
asymp
U
RECSYS 60
Latent Factor Models
SVD isnrsquot defined when entries are missing
Use specialized methods to find P Q min
119875119875119876119876sum 119909119909119909119909 isinR 119903119903119909119909119909119909 minus 119902119902119909119909 sdot 119901119901119909119909119879119879 2
Notebull We donrsquot require cols of P Q to be orthogonalunit lengthbull P Q map usersmovies to a latent spacebull The most popular model among Netflix contestants
1162019
45531
312445
53432142
24542
522434
42331
2-41
56-5
53-2
32111
-221-7
37-1
-924143-48-5-253-211
13-112-72914-131457-8
1-6784-3924176-421asymp
PTQ
users
item
s
119903119909119909119909119909 = 119902119902119909119909 sdot 119901119901119909119909119879119879
f factors
ffactorsitem
s
users
RECSYS 61
Dealing with Missing Entries Want to minimize SSE for unseen test data
Idea Minimize SSE on training data Want large f ( of factors) to capture all the signals But SSE on test data begins to rise for f gt 2
Regularization is needed to avoid Overfitting Allow rich model where there are sufficient data Shrink aggressively where data are scarce
1162019
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
1 3 43 5 5
4 5 533
2
2 1 3
1
λhellip regularization parameterldquoerrorrdquo ldquolengthrdquo
RECSYS 62
Geared towards females
Geared towards males
serious
escapist
The PrincessDiaries
The Lion King
Braveheart
Independence Day
AmadeusThe Color Purple
Oceanrsquos 11
Sense and Sensibility
Gus
Dave
Recommendations via Latent Factor Models (eg SVD++ by the [Bellkor Team])
1162019
Lethal Weapon
Dumb and Dumber
RECSYS 63
Dealing with Missing Entries Want to minimize SSE for unseen test data
Idea Minimize SSE on training data Want large f ( of factors) to capture all the signals But SSE on test data begins to rise for f gt 2
Regularization is needed Allow rich model where there are sufficient data Shrink aggressively where data are scarce
1162019
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
1 3 43 5 5
4 5 533
2
2 1 3
1
λhellip regularization parameterldquoerrorrdquo ldquolengthrdquo
RECSYS 64
Geared towards females
Geared towards males
serious
funny
The Effect of Regularization
1162019
The PrincessDiaries
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
minfactors ldquoerrorrdquo + λ ldquolengthrdquo
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
RECSYS 65
Geared towards females
Geared towards males
serious
funny
The Effect of Regularization
1162019
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
The PrincessDiaries
minfactors ldquoerrorrdquo + λ ldquolengthrdquo
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
RECSYS 66
Geared towards females
Geared towards males
serious
funny
The Effect of Regularization
1162019
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
minfactors ldquoerrorrdquo + λ ldquolengthrdquo
The PrincessDiaries
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
RECSYS 67
Geared towards females
Geared towards males
serious
funny
The Effect of Regularization
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
The PrincessDiaries
minfactors ldquoerrorrdquo + λ ldquolengthrdquo
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
RECSYS 68
Use Gradient Descent to search forthe optimal settings
Want to find matrices P and Q
Gradient descent Initialize P and Q (using SVD pretend missing ratings are
0) Do gradient descent
bull P larr P - η middotnablaPbull Q larr Q - η middotnablaQbull Where nablaQ is gradientderivative of matrix Q120571120571119876119876 = [120571120571119902119902119909119909119894119894] and 120571120571119902119902119909119909119894119894 = sum119909119909119909119909 minus2 119903119903119909119909119909119909 minus 119902119902119909119909119901119901119909119909119879119879 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894
bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q Observation Computing gradients is slow
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
How to compute gradient of a matrixCompute gradient of every element independently
RECSYS 69
Degression to the lecture notes ofRegression and Gradient Descent
by Andrew Ngrsquos Machine Learning course from Coursera
RECSYS 70
(Batch) Gradient Descent
Want to find matrices P and Q
Gradient descent Initialize P and Q (using SVD pretend missing ratings are
0) Do gradient descent
bull P larr P - η middotnablaPbull Q larr Q - η middotnablaQbull Where nablaQ is gradientderivative of matrix Q120571120571119876119876 = [120571120571119902119902119909119909119894119894] and 120571120571119902119902119909119909119894119894 = sum119909119909119909119909 minus2 119903119903119909119909119909119909 minus 119902119902119909119909119901119901119909119909119879119879 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894
bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q Observation Computing gradients is slow
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
How to compute gradient of a matrixCompute gradient of every element independently
RECSYS 72
Stochastic Gradient Descent
Gradient Descent (GD) vs Stochastic GD Observation 120571120571119876119876 = [120571120571119902119902119909119909119894119894] where
120571120571119902119902119909119909119894119894 = 119909119909119909119909
minus2 119903119903119909119909119909119909 minus 119902119902119909119909119894119894119901119901119909119909119894119894 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894 = 119961119961119946119946
nabla119928119928 119955119955119961119961119946119946
bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q 119928119928 = 119928119928 minus 119928119928 = 119928119928 minus η sum119961119961119946119946nabla119928119928 (119955119955119961119961119946119946) Idea Instead of evaluating gradient over all ratings evaluate it for
each individual rating and make a step
GD 119928119928larr119928119928 minus η sum119955119955119961119961119946119946 nabla119928119928(119955119955119961119961119946119946)
SGD 119928119928larr119928119928 minus 119928119928(119955119955119961119961119946119946) Faster convergence
bull Need more steps but each step is computed much faster
1162019
RECSYS 73
SGD vs GD Convergence of GD vs SGD
1162019
Iterationstep
Valu
e of
the
obje
ctiv
e fu
nctio
n
GD improves the value of the objective function at every step SGD improves the value but in a ldquonoisyrdquo wayGD takes fewer steps to converge but each steptakes much longer to compute In practice SGD is much faster
RECSYS 74
Stochastic Gradient Descent
Stochastic gradient decent Initialize P and Q (using SVD pretend missing ratings are
0) Then iterate over the ratings (multiple times if necessary) and update
factorsFor each rxi
bull 120576120576119909119909119909119909 = 119903119903119909119909119909119909 minus 119902119902119909119909 sdot 119901119901119909119909119879119879 (derivative of the ldquoerrorrdquo)bull 119902119902119909119909 larr 119902119902119909119909 + 120578120578 120576120576119909119909119909119909 119901119901119909119909 minus 120582120582 119902119902119909119909 (update equation)bull 119901119901119909119909 larr 119901119901119909119909 + 120578120578 120576120576119909119909119909119909 119902119902119909119909 minus 120582120582 119901119901119909119909 (update equation)
2 for loops For until convergence
bull For each rxibull Compute gradient do a ldquosteprdquo
1162019
η hellip learning rate
RECSYS 75Koren Bell Volinksy IEEE Computer 2009
1162019
RECSYS 76
Summary Recommendations via Optimization
Goal Make good recommendations Quantify goodness using SSE
So Lower SSE means better recommendations We want to make good recommendations on items
that some user has not yet seen Letrsquos set values for P and Q such that they work
well on known (user item) ratings And hope these values for P and Q will predict well
the unknown ratings
This is the a case where we apply Optimization methods
1162019
1 3 43 5 5
4 5 533
2
2 1 3
1
RECSYS 77
Backup Slides
RECSYS 78
The Netflix Challenge 2006-09
We Know What You OughtTo Be Watching This
Summer
Netflix Prizebull Training data
ndash 100 million ratingsndash 480000 usersndash 17770 moviesndash 6 years of data 2000-2005
bull Test datandash Last few ratings of each user (28 million)ndash Evaluation criterion root mean squared error (RMSE) ndash Netflix Cinematch RMSE 09514
bull Competitionndash 2700+ teamsndash $1 million grand prize for 10 improvement on Cinematch resultndash $50000 2007 progress prize for 843 improvement
scoremovieuser121152131434524123237682576344541568523425
22345
57664566
scoremovieuser6219617232473153414284935
745
696836
Training data Test data
Movie rating data
Overall rating distribution
bull Third of ratings are 4sbull Average rating is 368
From TimelyDevelopmentcom
ratings per movie
bull Avg ratingsmovie 5627
ratings per user
bull Avg ratingsuser 208
Average movie rating by movie count
bull More ratings to better movies
From TimelyDevelopmentcom
Most loved movies
CountAvg ratingTitle137812 4593 The Shawshank Redemption 133597 4545 Lord of the Rings The Return of the King1808834306 The Green Mile 150676 4460Lord of the Rings The Two Towers 139050 4415 Finding Nemo 117456 4504 Raiders of the Lost Ark 180736 4299 Forrest Gump 147932 4433 Lord of the Rings The Fellowship of the ring149199 4325 The Sixth Sense 144027 4333 Indiana Jones and the Last Crusade
Challenges
bull Size of datandash Scalabilityndash Keeping data in memory
bull Missing datandash 99 percent missingndash Very imbalanced
bull Avoiding overfittingbull Test and training data differ significantly movie 16322
bull Use an ensemble of complementing predictorsbull Two half tuned models worth more than a single fully
tuned model
The BellKor recommender system
RECSYS 101
Extending Latent Factor Model to Include Biases
1162019
RECSYS 102
Modeling Biases and Interactions
μ = overall mean rating bx = bias of user x bi = bias of movie i
user-movie interactionmovie biasuser bias
User-Movie interaction Characterizes the matching between
users and movies Attracts most research in the field Benefits from algorithmic and
mathematical innovations
Baseline predictor Separates users and movies Benefits from insights into userrsquos
behavior Among the main practical
contributions of the competition
RECSYS 103
Baseline Predictor We have expectations on the rating by
user x of movie i even without estimating xrsquos attitude towards movies like i
ndash Rating scale of user xndash Values of other ratings user
gave recently (day-specific mood anchoring multi-user accounts)
ndash (Recent) popularity of movie indash Selection bias related to
number of ratings user gave on the same day (ldquofrequencyrdquo)
1162019
RECSYS 104
Putting It All Together
Example Mean rating micro = 37 You are a critical reviewer your ratings are 1 star
lower than the mean bx = -1 Star Wars gets a mean rating of 05 higher than
average movie bi = + 05 Predicted rating for you on Star Wars
= 37 - 1 + 05 = 32
1162019
RECSYS 105
Fitting the New Model
Solve
Stochastic gradient decent to find parameters Note Both biases bu bi as well as interactions qi pu are treated
as parameters (we estimate them)
1162019
regularization
goodness of fit
λ is selected via grid-search on a validation set
( )
++++
+++minus
sumsumsumsum
sumisin
ii
xx
xx
ii
Rix
Txiixxi
PQ
bbpq
pqbbr
2222
2
)(
)(min
λ
micro
RECSYS 106
BellKor Recommender System The winner of the Netflix Challenge
Multi-scale modeling of the dataCombine top level ldquoregionalrdquomodeling of the data with a refined local view Global
bull Overall deviations of usersmovies Factorization
bull Addressing ldquoregionalrdquo effects Collaborative filtering
bull Extract local patterns
Global effects
Factorization
Collaborative filtering
RECSYS 107
Performance of Various Methods
0885
089
0895
09
0905
091
0915
092
1 10 100 1000
RMSE
Millions of parameters
CF (no time bias)
Basic Latent Factors
Latent Factors w Biases
1162019
RECSYS 108
Grand Prize 08563
Netflix 09514
Movie average 10533User average 10651
Global average 11296
Performance of Various Methods
Basic Collaborative filtering 094
Latent factors 090
Latent factors+Biases 089
Collaborative filtering++ 091
1162019
RECSYS 109
Temporal Biases Of Users
Sudden rise in the average movie rating(early 2004) Improvements in Netflix GUI improvements Meaning of rating changed
Movie age Users prefer new movies
without any reasons Older movies are just
inherently better than newer ones
1162019
Y Koren Collaborative filtering with temporal dynamics KDD rsquo09
RECSYS 110
Temporal Biases amp Factors
Original modelrxi = micro +bx + bi + qi middotpx
T
Add time dependence to biasesrxi = micro +bx(t)+ bi(t) +qi middot px
T
Make parameters bu and bi to depend on time (1) Parameterize time-dependence by linear trends
(2) Each bin corresponds to 10 consecutive weeks
Add temporal dependence to factors px(t)hellip user preference vector on day t
Y Koren Collaborative filtering with temporal dynamics KDD rsquo09
RECSYS 111
Adding Temporal Effects
1162019
0875
088
0885
089
0895
09
0905
091
0915
092
1 10 100 1000 10000
RMSE
Millions of parameters
CF (no time bias)Basic Latent FactorsCF (time bias)Latent Factors w Biases+ Linear time factors+ Per-day user biases+ CF
RECSYS 112
Grand Prize 08563
Netflix 09514
Movie average 10533User average 10651
Global average 11296
Performance of Various Methods
Basic Collaborative filtering 094
Latent factors 090
Latent factors+Biases 089
Collaborative filtering++ 091
1162019 112
Latent factors+Biases+Time 0876
Still no prize Getting desperateTry a ldquokitchen sinkrdquo approach
RECSYS 113
Effect of ensemble size
087
08725
0875
08775
088
08825
0885
08875
089
08925
1 8 15 22 29 36 43 50
Predictors
Erro
r - R
MSE
RECSYS 14
(1) Gathering Ratings Explicit
Ask people to rate items Doesnrsquot work well in practice ndash people
canrsquot be bothered
Implicit Learn ratings from user actions
bull Eg purchase implies high rating What about low ratings
RECSYS 15
(2) Extrapolating Utilities Key problem matrix U is sparse
Most people have not rated most items Cold start
bull New items have no ratingsbull New users have no history
Three approaches to recommender systems 1) Content-based 2) Collaborative Filtering
bull Memory-basedbull User-based Collaborative Filteringbull Item-based Collaborative Filtering
bull Latent factor based
1162019
RECSYS 16
Content-based Recommender Systems
1162019
RECSYS 17
Content-based Recommendations Main idea Recommend items to customer x similar to
previous items rated highly by x
Example
Movie recommendations Recommend movies with same actor(s)
director genre hellip
Websites blogs news Recommend other sites with ldquosimilarrdquo content
RECSYS 18
Plan of Action
likes
Item profiles
RedCircles
Triangles
User profile
match
recommendbuild
1162019
RECSYS 19
Item Profiles For each item create an item profile
Profile is a set (vector) of features Movies author title actor directorhellip Text Set of ldquoimportantrdquo words in document
How to pick important features Usual heuristic from text mining is TF-IDF
(Term frequency Inverse Doc Frequency)bull Term hellip Featurebull Document hellip Item
1162019
RECSYS 20
Sidenote TF-IDFfij = frequency of term (feature) i in doc (item) j
ni = number of docs that mention term iN = total number of docs
TF-IDF score wij = TFij times IDFi
Doc profile = set of words with highest TF-IDF scores together with their scores
1162019
Note we normalize TFto discount for ldquolongerrdquo documents
RECSYS 21
User Profiles and Prediction
User profile possibilities Weighted average of rated item profiles Variation weight by difference from average
rating for item hellip
Prediction heuristic Given user profile x and item profile i estimate 119906119906(119961119961 119946119946) = cos(119961119961 119946119946) = 119961119961middot119946119946
| 119961119961 |sdot| 119946119946 |
1162019
RECSYS 22
Pros Content-based Approach
+ No need for data on other users No cold-start or sparsity problems
+ Able to recommend to users with unique tastes
+ Able to recommend new amp unpopular items No first-rater problem
+ Able to provide explanations Can provide explanations of recommended items by listing content-
features that caused an item to be recommended
1162019
RECSYS 23
Cons Content-based Approach ndash Finding the appropriate features is hard
Eg images movies music
ndash Overspecialization Never recommends items outside userrsquos
content profile People might have multiple interests Unable to exploit quality judgments of other users
ndash Recommendations for new users How to build a user profile
1162019
RECSYS 24
Collaborative Filtering
RECSYS 25
Collaborative filtering Recommend items based on past transactions of users
Analyze relations between users andor items
Specific data characteristics are irrelevant Domain-free useritem attributes are not necessary Can identify elusive aspects
RECSYS 26
Collaborative Filtering (CF)
RECSYS 27
Example of Memory-based Collaborative FilteringUser-User Collaborative Filtering
1 Consider user x
2 Find set N of other userswhose ratings are ldquosimilarrdquo to xrsquos ratings eg using K-nearest neighbors (KNN)
3 Recommend items to xbased on the weighted ratings of items by users in N
RECSYS 28
Similar Users Let rx be the vector of user xrsquos ratings
Jaccard similarity measure Problem Ignores the value of the rating
Cosine similarity measure
sim(x y) = cos(rx ry) = 119903119903119909119909sdot119903119903119910119910||119903119903119909119909||sdot||119903119903119910119910||
Problem Treats missing ratings as ldquonegativerdquo Pearson correlation coefficient
Sxy = items rated by both users x and y
1162019
rx = [ _ _ ]ry = [ _ _]
rx ry as setsrx = 1 4 5ry = 1 3 4
rx ry as pointsrx = 1 0 0 1 3ry = 1 0 2 2 0
rx ry hellip avgrating of x y
RECSYS 29
Similarity Metric
Intuitively we want sim(A B) gt sim(A C)
Jaccard similarity 15 lt 24
Cosine similarity 0386 gt 0322 Considers missing ratings as ldquonegativerdquo Solution subtract the (row) mean
1162019
sim AB vs AC0092 gt -0559Notice cosine sim is correlation when data is centered at 0
Cosine sim
RECSYS 30
Rating Predictions
Let rx be the vector of user xrsquos ratings
Let N be the set of k users most similar to xwho have rated item i
Prediction for item i of user x 119903119903119909119909119909119909 = 1
119896119896sum119910119910isin119873119873 119903119903119910119910119909119909
119903119903119909119909119909119909 =sum119910119910isin119873119873 119904119904119909119909119910119910sdot119903119903119910119910119910119910sum119910119910isin119873119873 119904119904119909119909119910119910
Other options
Many other tricks possiblehellip1162019
Shorthand119956119956119961119961119961119961 = 119956119956119946119946119956119956 119961119961119961119961
RECSYS 31
Another type of Memory-based Collaborative Filtering Item-Item based Collaborative Filtering
So far User-user collaborative filtering
Another view Item-item For item i find other similar items Estimate rating for item i based
on the target userrsquos ratings on items similar to item i Can use same similarity metrics and
prediction functions as in user-user model
1162019
sumsum
isin
isinsdot
=)(
)(
xiNj ij
xiNj xjijxi s
rsr
sijhellip similarity of items i and jrxjhelliprating of user x on item jN(ix)hellip set items rated by x similar to i
RECSYS 32
Item-Item CF (|N|=2)
121110987654321
455311
3124452
534321423
245424
5224345
423316
users
mov
ies
- unknown rating - rating between 1 to 5
RECSYS 33
Item-Item CF (|N|=2)
121110987654321
455 311
3124452
534321423
245424
5224345
423316
users
- estimate rating of movie 1 by user 5
1162019
mov
ies
RECSYS 34
Item-Item CF (|N|=2)
121110987654321
455 311
3124452
534321423
245424
5224345
423316
users
Neighbor selectionIdentify movies similar to movie 1 rated by user 5
1162019
mov
ies
100
-018
041
-010
-031
059
sim(1m)
Here we use Pearson correlation as similarity1) Subtract mean rating mi from each movie i
m1 = (1+3+5+5+4)5 = 36row 1 [-26 0 -06 0 0 14 0 0 14 0 04 0]
2) Compute cosine similarities between rows
RECSYS 35
Item-Item CF (|N|=2)
121110987654321
455 311
3124452
534321423
245424
5224345
423316
users
Compute similarity weightss13=041 s16=059
1162019
mov
ies
100
-018
041
-010
-031
059
sim(1m)
RECSYS 36
Item-Item CF (|N|=2)
121110987654321
45526311
3124452
534321423
245424
5224345
423316
users
Predict by taking weighted average
r15 = (0412 + 0593) (041+059) = 26
mov
ies
RECSYS 37
Common Practice for Item-Item Collaborative Filtering
Define similarity sij of items i and j
Select K nearest neighbors (KNN) N(i x) Set of Items most similar to item i that were rated by x
Estimate rating rxi as the weighted average
1162019
sumsum
isin
isinsdot
=)(
)(ˆxiNj ij
xiNj xjijxi s
rsr
N(ix) = set of items similar to item i that were rated by xsij = similarity of items i and jrxj= rating of user x on item j
RECSYS 38
Item-Item vs User-User
04180109
0305081
Avatar LOTR Matrix Pirates
Alice
Bob
Carol
David
1162019
In practice it has been observed that item-itemoften works better than user-user
Why Items are simpler users have multiple tastes
RECSYS 39
ProsCons of Collaborative Filtering
+ Works for any kind of item No feature selection needed
- Cold Start Need enough users in the system to find a match
- Sparsity The userratings matrix is sparse Hard to find users that have rated the same items
- First rater Cannot recommend an item that has not been
previously rated New items Esoteric items
- Popularity bias Cannot recommend items to someone with
unique taste Tends to recommend popular items
RECSYS 40
Hybrid Methods Implement two or more different recommenders and
combine predictions Perhaps using a linear model
Add content-based methods to collaborative filtering Item profiles for new item problem Demographics to deal with new user problem
1162019
RECSYS 41
- Evaluation
- Error metrics
- Complexity Speed
RECSYS 42
Evaluation
1 3 4
3 5 5
4 5 5
3
3
2 2 2
5
2 1 1
3 3
1
movies
users
1162019
RECSYS 43
Evaluation
1 3 4
3 5 5
4 5 5
3
3
2
2 1
3
1
Test Data Set
users
movies
1162019
RECSYS 44
Evaluating Predictions Compare predictions with known ratings
Root-mean-square error (RMSE)bull where is predicted is the true rating of x on i
Precision at top 10 bull of those in top 10
Rank Correlation bull Spearmanrsquos correlation between systemrsquos and userrsquos complete
rankings Another approach 01 model
Coveragebull Number of itemsusers for which system can make predictions
Precision = TP (TP + FP) Accuracy = (TP+TN) (TP + FP + TN + FN) Receiver Operating characteristic (ROC) Curve
Y-axis True Positive Rates (TPR) X-axis False Positive Rates (FPR)bull TPR (aka Recall) = TP P = TP (TP+FN) bull FPR = FP N = FP (FP + TN)bull See httpsenwikipediaorgwikiPrecision and recall
RECSYS 45
Problems with Error Measures Narrow focus on accuracy sometimes
misses the point Prediction Diversity Prediction Context Order of predictions
In practice we care only to predict high ratings RMSE might penalize a method that does well
for high ratings and badly for others
1162019
RECSYS 46
Collaborative Filtering Complexity
Expensive step is finding k most similar customers O(|X|) Recall that X = set of customers in the system
Too expensive to do at runtime Could pre-compute using clustering as approx
Naiumlve pre-computation takes time O(N middot|C|) |C| = of clusters = k in the k-means N = of data points
We already know how to do this Near-neighbor search in high dimensions (LSH) Clustering Dimensionality reduction
1162019
RECSYS 47
Tip Add Data Leverage all the data
Donrsquot try to reduce data size in an effort to make fancy algorithms work
Simple methods on large data do best
Add more data eg add IMDB data on genres
More data beats better algorithmshttpanandtypepadcomdatawocky200803more-data-usualhtml
RECSYS 48
Recommender SystemsLatent Factor Models
CS246 Mining Massive DatasetsJure Leskovec Stanford University
httpcs246stanfordedu
RECSYS 49
Geared towards females
Geared towards males
Serious
Funny
Collaborative Filtering viaLatent Factor Models (eg SVD)
The PrincessDiaries
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
RECSYS 50Koren Bell Volinksy IEEE Computer 2009
RECSYS 51
The Netflix Utility Matrix R
1 3 4
3 5 5
4 5 5
3
3
2 2 2
5
2 1 1
3 3
1
480000 users
17700 movies
1162019
Matrix R
RECSYS 52
Utility Matrix R Evaluation
1 3 4
3 5 5
4 5 5
3
3
2
2 1
3
1
Test Data Set
SSE = sum(119909119909119909119909)isin119877119877 119903119909119909119909119909 minus 119903119903119909119909119909119909 2
1162019
480000 users
17700 movies
Predicted rating
True rating of user x on item i
119955119955120785120785120788120788
Matrix R
Training Data Set
RECSYS 53
Latent Factor Models
ldquoSVDrdquo on Netflix data R asymp Q middot PT
For now letrsquos assume we can approximate the rating matrix R as a product of ldquothinrdquo Q middot PT
R has missing entries but letrsquos ignore that for nowbull Basically we will want the reconstruction error to be small on
known ratings and we donrsquot care about the values on the missing ones
45531
312445
53432142
24542
522434
42331
2-41
56-5
53-2
32111
-221-7
37-1
-924143-48-5-253-211
13-112-72914-131457-8
1-6784-3924176-421asymp
users
item
s
PT
Q
item
s
users
R
SVD A = U Σ VT
f factors
ffactors
RECSYS 54
Ratings as Products of Factors How to estimate the missing rating of
user x for item i
1162019
45531
312445
53432142
24542
522434
42331
item
s
2-41
56-5
53-2
32111
-221-7
37-1
-924143-48-5-253-211
13-112-72914-131457-8
1-6784-3924176-421
asymp
item
s
users
users
PT
119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931
= 119943119943
119954119954119946119946119943119943 sdot 119953119953119961119961119943119943
qi = row i of Qpx = column x of PT
ffac
tors
Qf factors
RECSYS 55
Ratings as Products of Factors How to estimate the missing rating of
user x for item i
1162019
45531
312445
53432142
24542
522434
42331
item
s
2-41
56-5
53-2
32111
-221-7
37-1
-924143-48-5-253-211
13-112-72914-131457-8
1-6784-3924176-421
asymp
item
s
users
users
PT
ffac
tors
Qf factors
119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931
= 119943119943
119954119954119946119946119943119943 sdot 119953119953119961119961119943119943
qi = row i of Qpx = column x of PT
RECSYS 56
Ratings as Products of Factors How to estimate the missing rating of
user x for item i
1162019
45531
312445
53432142
24542
522434
42331
item
s
2-41
56-5
53-2
32111
-221-7
37-1
-924143-48-5-253-211
13-112-72914-131457-8
1-6784-3924176-421
asymp
item
s
users
users
QPT
24
ffac
tors
f factors
119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931
= 119943119943
119954119954119946119946119943119943 sdot 119953119953119961119961119943119943
qi = row i of Qpx = column x of PT
RECSYS 57
Geared towards females
Geared towards males
Serious
Funny
Latent Factor Models
1162019
The PrincessDiaries
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
RECSYS 58
Geared towards females
Geared towards males
Serious
Funny
Latent Factor Models
1162019
The PrincessDiaries
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
RECSYS 59
Recap SVD
Remember SVD A Input data matrix U Left singular vecs V Right singular vecs Σ Singular values
SVD gives minimum reconstruction error (SSE) min119880119880VΣ
sum119909119909119894119894 119860119860119909119909119894119894 minus 119880119880Σ119881119881T 1199091199091198941198942
So in our case ldquoSVDrdquo on Netflix data R asymp Q middot PT
A = R Q = U PT = Σ VT
But we are not done yet R has missing entries1162019
U
The sum goes over all entriesBut our R has missing entries
119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931
Am
n
Σm
n
VT
asymp
U
RECSYS 60
Latent Factor Models
SVD isnrsquot defined when entries are missing
Use specialized methods to find P Q min
119875119875119876119876sum 119909119909119909119909 isinR 119903119903119909119909119909119909 minus 119902119902119909119909 sdot 119901119901119909119909119879119879 2
Notebull We donrsquot require cols of P Q to be orthogonalunit lengthbull P Q map usersmovies to a latent spacebull The most popular model among Netflix contestants
1162019
45531
312445
53432142
24542
522434
42331
2-41
56-5
53-2
32111
-221-7
37-1
-924143-48-5-253-211
13-112-72914-131457-8
1-6784-3924176-421asymp
PTQ
users
item
s
119903119909119909119909119909 = 119902119902119909119909 sdot 119901119901119909119909119879119879
f factors
ffactorsitem
s
users
RECSYS 61
Dealing with Missing Entries Want to minimize SSE for unseen test data
Idea Minimize SSE on training data Want large f ( of factors) to capture all the signals But SSE on test data begins to rise for f gt 2
Regularization is needed to avoid Overfitting Allow rich model where there are sufficient data Shrink aggressively where data are scarce
1162019
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
1 3 43 5 5
4 5 533
2
2 1 3
1
λhellip regularization parameterldquoerrorrdquo ldquolengthrdquo
RECSYS 62
Geared towards females
Geared towards males
serious
escapist
The PrincessDiaries
The Lion King
Braveheart
Independence Day
AmadeusThe Color Purple
Oceanrsquos 11
Sense and Sensibility
Gus
Dave
Recommendations via Latent Factor Models (eg SVD++ by the [Bellkor Team])
1162019
Lethal Weapon
Dumb and Dumber
RECSYS 63
Dealing with Missing Entries Want to minimize SSE for unseen test data
Idea Minimize SSE on training data Want large f ( of factors) to capture all the signals But SSE on test data begins to rise for f gt 2
Regularization is needed Allow rich model where there are sufficient data Shrink aggressively where data are scarce
1162019
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
1 3 43 5 5
4 5 533
2
2 1 3
1
λhellip regularization parameterldquoerrorrdquo ldquolengthrdquo
RECSYS 64
Geared towards females
Geared towards males
serious
funny
The Effect of Regularization
1162019
The PrincessDiaries
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
minfactors ldquoerrorrdquo + λ ldquolengthrdquo
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
RECSYS 65
Geared towards females
Geared towards males
serious
funny
The Effect of Regularization
1162019
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
The PrincessDiaries
minfactors ldquoerrorrdquo + λ ldquolengthrdquo
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
RECSYS 66
Geared towards females
Geared towards males
serious
funny
The Effect of Regularization
1162019
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
minfactors ldquoerrorrdquo + λ ldquolengthrdquo
The PrincessDiaries
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
RECSYS 67
Geared towards females
Geared towards males
serious
funny
The Effect of Regularization
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
The PrincessDiaries
minfactors ldquoerrorrdquo + λ ldquolengthrdquo
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
RECSYS 68
Use Gradient Descent to search forthe optimal settings
Want to find matrices P and Q
Gradient descent Initialize P and Q (using SVD pretend missing ratings are
0) Do gradient descent
bull P larr P - η middotnablaPbull Q larr Q - η middotnablaQbull Where nablaQ is gradientderivative of matrix Q120571120571119876119876 = [120571120571119902119902119909119909119894119894] and 120571120571119902119902119909119909119894119894 = sum119909119909119909119909 minus2 119903119903119909119909119909119909 minus 119902119902119909119909119901119901119909119909119879119879 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894
bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q Observation Computing gradients is slow
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
How to compute gradient of a matrixCompute gradient of every element independently
RECSYS 69
Degression to the lecture notes ofRegression and Gradient Descent
by Andrew Ngrsquos Machine Learning course from Coursera
RECSYS 70
(Batch) Gradient Descent
Want to find matrices P and Q
Gradient descent Initialize P and Q (using SVD pretend missing ratings are
0) Do gradient descent
bull P larr P - η middotnablaPbull Q larr Q - η middotnablaQbull Where nablaQ is gradientderivative of matrix Q120571120571119876119876 = [120571120571119902119902119909119909119894119894] and 120571120571119902119902119909119909119894119894 = sum119909119909119909119909 minus2 119903119903119909119909119909119909 minus 119902119902119909119909119901119901119909119909119879119879 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894
bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q Observation Computing gradients is slow
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
How to compute gradient of a matrixCompute gradient of every element independently
RECSYS 72
Stochastic Gradient Descent
Gradient Descent (GD) vs Stochastic GD Observation 120571120571119876119876 = [120571120571119902119902119909119909119894119894] where
120571120571119902119902119909119909119894119894 = 119909119909119909119909
minus2 119903119903119909119909119909119909 minus 119902119902119909119909119894119894119901119901119909119909119894119894 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894 = 119961119961119946119946
nabla119928119928 119955119955119961119961119946119946
bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q 119928119928 = 119928119928 minus 119928119928 = 119928119928 minus η sum119961119961119946119946nabla119928119928 (119955119955119961119961119946119946) Idea Instead of evaluating gradient over all ratings evaluate it for
each individual rating and make a step
GD 119928119928larr119928119928 minus η sum119955119955119961119961119946119946 nabla119928119928(119955119955119961119961119946119946)
SGD 119928119928larr119928119928 minus 119928119928(119955119955119961119961119946119946) Faster convergence
bull Need more steps but each step is computed much faster
1162019
RECSYS 73
SGD vs GD Convergence of GD vs SGD
1162019
Iterationstep
Valu
e of
the
obje
ctiv
e fu
nctio
n
GD improves the value of the objective function at every step SGD improves the value but in a ldquonoisyrdquo wayGD takes fewer steps to converge but each steptakes much longer to compute In practice SGD is much faster
RECSYS 74
Stochastic Gradient Descent
Stochastic gradient decent Initialize P and Q (using SVD pretend missing ratings are
0) Then iterate over the ratings (multiple times if necessary) and update
factorsFor each rxi
bull 120576120576119909119909119909119909 = 119903119903119909119909119909119909 minus 119902119902119909119909 sdot 119901119901119909119909119879119879 (derivative of the ldquoerrorrdquo)bull 119902119902119909119909 larr 119902119902119909119909 + 120578120578 120576120576119909119909119909119909 119901119901119909119909 minus 120582120582 119902119902119909119909 (update equation)bull 119901119901119909119909 larr 119901119901119909119909 + 120578120578 120576120576119909119909119909119909 119902119902119909119909 minus 120582120582 119901119901119909119909 (update equation)
2 for loops For until convergence
bull For each rxibull Compute gradient do a ldquosteprdquo
1162019
η hellip learning rate
RECSYS 75Koren Bell Volinksy IEEE Computer 2009
1162019
RECSYS 76
Summary Recommendations via Optimization
Goal Make good recommendations Quantify goodness using SSE
So Lower SSE means better recommendations We want to make good recommendations on items
that some user has not yet seen Letrsquos set values for P and Q such that they work
well on known (user item) ratings And hope these values for P and Q will predict well
the unknown ratings
This is the a case where we apply Optimization methods
1162019
1 3 43 5 5
4 5 533
2
2 1 3
1
RECSYS 77
Backup Slides
RECSYS 78
The Netflix Challenge 2006-09
We Know What You OughtTo Be Watching This
Summer
Netflix Prizebull Training data
ndash 100 million ratingsndash 480000 usersndash 17770 moviesndash 6 years of data 2000-2005
bull Test datandash Last few ratings of each user (28 million)ndash Evaluation criterion root mean squared error (RMSE) ndash Netflix Cinematch RMSE 09514
bull Competitionndash 2700+ teamsndash $1 million grand prize for 10 improvement on Cinematch resultndash $50000 2007 progress prize for 843 improvement
scoremovieuser121152131434524123237682576344541568523425
22345
57664566
scoremovieuser6219617232473153414284935
745
696836
Training data Test data
Movie rating data
Overall rating distribution
bull Third of ratings are 4sbull Average rating is 368
From TimelyDevelopmentcom
ratings per movie
bull Avg ratingsmovie 5627
ratings per user
bull Avg ratingsuser 208
Average movie rating by movie count
bull More ratings to better movies
From TimelyDevelopmentcom
Most loved movies
CountAvg ratingTitle137812 4593 The Shawshank Redemption 133597 4545 Lord of the Rings The Return of the King1808834306 The Green Mile 150676 4460Lord of the Rings The Two Towers 139050 4415 Finding Nemo 117456 4504 Raiders of the Lost Ark 180736 4299 Forrest Gump 147932 4433 Lord of the Rings The Fellowship of the ring149199 4325 The Sixth Sense 144027 4333 Indiana Jones and the Last Crusade
Challenges
bull Size of datandash Scalabilityndash Keeping data in memory
bull Missing datandash 99 percent missingndash Very imbalanced
bull Avoiding overfittingbull Test and training data differ significantly movie 16322
bull Use an ensemble of complementing predictorsbull Two half tuned models worth more than a single fully
tuned model
The BellKor recommender system
RECSYS 101
Extending Latent Factor Model to Include Biases
1162019
RECSYS 102
Modeling Biases and Interactions
μ = overall mean rating bx = bias of user x bi = bias of movie i
user-movie interactionmovie biasuser bias
User-Movie interaction Characterizes the matching between
users and movies Attracts most research in the field Benefits from algorithmic and
mathematical innovations
Baseline predictor Separates users and movies Benefits from insights into userrsquos
behavior Among the main practical
contributions of the competition
RECSYS 103
Baseline Predictor We have expectations on the rating by
user x of movie i even without estimating xrsquos attitude towards movies like i
ndash Rating scale of user xndash Values of other ratings user
gave recently (day-specific mood anchoring multi-user accounts)
ndash (Recent) popularity of movie indash Selection bias related to
number of ratings user gave on the same day (ldquofrequencyrdquo)
1162019
RECSYS 104
Putting It All Together
Example Mean rating micro = 37 You are a critical reviewer your ratings are 1 star
lower than the mean bx = -1 Star Wars gets a mean rating of 05 higher than
average movie bi = + 05 Predicted rating for you on Star Wars
= 37 - 1 + 05 = 32
1162019
RECSYS 105
Fitting the New Model
Solve
Stochastic gradient decent to find parameters Note Both biases bu bi as well as interactions qi pu are treated
as parameters (we estimate them)
1162019
regularization
goodness of fit
λ is selected via grid-search on a validation set
( )
++++
+++minus
sumsumsumsum
sumisin
ii
xx
xx
ii
Rix
Txiixxi
PQ
bbpq
pqbbr
2222
2
)(
)(min
λ
micro
RECSYS 106
BellKor Recommender System The winner of the Netflix Challenge
Multi-scale modeling of the dataCombine top level ldquoregionalrdquomodeling of the data with a refined local view Global
bull Overall deviations of usersmovies Factorization
bull Addressing ldquoregionalrdquo effects Collaborative filtering
bull Extract local patterns
Global effects
Factorization
Collaborative filtering
RECSYS 107
Performance of Various Methods
0885
089
0895
09
0905
091
0915
092
1 10 100 1000
RMSE
Millions of parameters
CF (no time bias)
Basic Latent Factors
Latent Factors w Biases
1162019
RECSYS 108
Grand Prize 08563
Netflix 09514
Movie average 10533User average 10651
Global average 11296
Performance of Various Methods
Basic Collaborative filtering 094
Latent factors 090
Latent factors+Biases 089
Collaborative filtering++ 091
1162019
RECSYS 109
Temporal Biases Of Users
Sudden rise in the average movie rating(early 2004) Improvements in Netflix GUI improvements Meaning of rating changed
Movie age Users prefer new movies
without any reasons Older movies are just
inherently better than newer ones
1162019
Y Koren Collaborative filtering with temporal dynamics KDD rsquo09
RECSYS 110
Temporal Biases amp Factors
Original modelrxi = micro +bx + bi + qi middotpx
T
Add time dependence to biasesrxi = micro +bx(t)+ bi(t) +qi middot px
T
Make parameters bu and bi to depend on time (1) Parameterize time-dependence by linear trends
(2) Each bin corresponds to 10 consecutive weeks
Add temporal dependence to factors px(t)hellip user preference vector on day t
Y Koren Collaborative filtering with temporal dynamics KDD rsquo09
RECSYS 111
Adding Temporal Effects
1162019
0875
088
0885
089
0895
09
0905
091
0915
092
1 10 100 1000 10000
RMSE
Millions of parameters
CF (no time bias)Basic Latent FactorsCF (time bias)Latent Factors w Biases+ Linear time factors+ Per-day user biases+ CF
RECSYS 112
Grand Prize 08563
Netflix 09514
Movie average 10533User average 10651
Global average 11296
Performance of Various Methods
Basic Collaborative filtering 094
Latent factors 090
Latent factors+Biases 089
Collaborative filtering++ 091
1162019 112
Latent factors+Biases+Time 0876
Still no prize Getting desperateTry a ldquokitchen sinkrdquo approach
RECSYS 113
Effect of ensemble size
087
08725
0875
08775
088
08825
0885
08875
089
08925
1 8 15 22 29 36 43 50
Predictors
Erro
r - R
MSE
RECSYS 15
(2) Extrapolating Utilities Key problem matrix U is sparse
Most people have not rated most items Cold start
bull New items have no ratingsbull New users have no history
Three approaches to recommender systems 1) Content-based 2) Collaborative Filtering
bull Memory-basedbull User-based Collaborative Filteringbull Item-based Collaborative Filtering
bull Latent factor based
1162019
RECSYS 16
Content-based Recommender Systems
1162019
RECSYS 17
Content-based Recommendations Main idea Recommend items to customer x similar to
previous items rated highly by x
Example
Movie recommendations Recommend movies with same actor(s)
director genre hellip
Websites blogs news Recommend other sites with ldquosimilarrdquo content
RECSYS 18
Plan of Action
likes
Item profiles
RedCircles
Triangles
User profile
match
recommendbuild
1162019
RECSYS 19
Item Profiles For each item create an item profile
Profile is a set (vector) of features Movies author title actor directorhellip Text Set of ldquoimportantrdquo words in document
How to pick important features Usual heuristic from text mining is TF-IDF
(Term frequency Inverse Doc Frequency)bull Term hellip Featurebull Document hellip Item
1162019
RECSYS 20
Sidenote TF-IDFfij = frequency of term (feature) i in doc (item) j
ni = number of docs that mention term iN = total number of docs
TF-IDF score wij = TFij times IDFi
Doc profile = set of words with highest TF-IDF scores together with their scores
1162019
Note we normalize TFto discount for ldquolongerrdquo documents
RECSYS 21
User Profiles and Prediction
User profile possibilities Weighted average of rated item profiles Variation weight by difference from average
rating for item hellip
Prediction heuristic Given user profile x and item profile i estimate 119906119906(119961119961 119946119946) = cos(119961119961 119946119946) = 119961119961middot119946119946
| 119961119961 |sdot| 119946119946 |
1162019
RECSYS 22
Pros Content-based Approach
+ No need for data on other users No cold-start or sparsity problems
+ Able to recommend to users with unique tastes
+ Able to recommend new amp unpopular items No first-rater problem
+ Able to provide explanations Can provide explanations of recommended items by listing content-
features that caused an item to be recommended
1162019
RECSYS 23
Cons Content-based Approach ndash Finding the appropriate features is hard
Eg images movies music
ndash Overspecialization Never recommends items outside userrsquos
content profile People might have multiple interests Unable to exploit quality judgments of other users
ndash Recommendations for new users How to build a user profile
1162019
RECSYS 24
Collaborative Filtering
RECSYS 25
Collaborative filtering Recommend items based on past transactions of users
Analyze relations between users andor items
Specific data characteristics are irrelevant Domain-free useritem attributes are not necessary Can identify elusive aspects
RECSYS 26
Collaborative Filtering (CF)
RECSYS 27
Example of Memory-based Collaborative FilteringUser-User Collaborative Filtering
1 Consider user x
2 Find set N of other userswhose ratings are ldquosimilarrdquo to xrsquos ratings eg using K-nearest neighbors (KNN)
3 Recommend items to xbased on the weighted ratings of items by users in N
RECSYS 28
Similar Users Let rx be the vector of user xrsquos ratings
Jaccard similarity measure Problem Ignores the value of the rating
Cosine similarity measure
sim(x y) = cos(rx ry) = 119903119903119909119909sdot119903119903119910119910||119903119903119909119909||sdot||119903119903119910119910||
Problem Treats missing ratings as ldquonegativerdquo Pearson correlation coefficient
Sxy = items rated by both users x and y
1162019
rx = [ _ _ ]ry = [ _ _]
rx ry as setsrx = 1 4 5ry = 1 3 4
rx ry as pointsrx = 1 0 0 1 3ry = 1 0 2 2 0
rx ry hellip avgrating of x y
RECSYS 29
Similarity Metric
Intuitively we want sim(A B) gt sim(A C)
Jaccard similarity 15 lt 24
Cosine similarity 0386 gt 0322 Considers missing ratings as ldquonegativerdquo Solution subtract the (row) mean
1162019
sim AB vs AC0092 gt -0559Notice cosine sim is correlation when data is centered at 0
Cosine sim
RECSYS 30
Rating Predictions
Let rx be the vector of user xrsquos ratings
Let N be the set of k users most similar to xwho have rated item i
Prediction for item i of user x 119903119903119909119909119909119909 = 1
119896119896sum119910119910isin119873119873 119903119903119910119910119909119909
119903119903119909119909119909119909 =sum119910119910isin119873119873 119904119904119909119909119910119910sdot119903119903119910119910119910119910sum119910119910isin119873119873 119904119904119909119909119910119910
Other options
Many other tricks possiblehellip1162019
Shorthand119956119956119961119961119961119961 = 119956119956119946119946119956119956 119961119961119961119961
RECSYS 31
Another type of Memory-based Collaborative Filtering Item-Item based Collaborative Filtering
So far User-user collaborative filtering
Another view Item-item For item i find other similar items Estimate rating for item i based
on the target userrsquos ratings on items similar to item i Can use same similarity metrics and
prediction functions as in user-user model
1162019
sumsum
isin
isinsdot
=)(
)(
xiNj ij
xiNj xjijxi s
rsr
sijhellip similarity of items i and jrxjhelliprating of user x on item jN(ix)hellip set items rated by x similar to i
RECSYS 32
Item-Item CF (|N|=2)
121110987654321
455311
3124452
534321423
245424
5224345
423316
users
mov
ies
- unknown rating - rating between 1 to 5
RECSYS 33
Item-Item CF (|N|=2)
121110987654321
455 311
3124452
534321423
245424
5224345
423316
users
- estimate rating of movie 1 by user 5
1162019
mov
ies
RECSYS 34
Item-Item CF (|N|=2)
121110987654321
455 311
3124452
534321423
245424
5224345
423316
users
Neighbor selectionIdentify movies similar to movie 1 rated by user 5
1162019
mov
ies
100
-018
041
-010
-031
059
sim(1m)
Here we use Pearson correlation as similarity1) Subtract mean rating mi from each movie i
m1 = (1+3+5+5+4)5 = 36row 1 [-26 0 -06 0 0 14 0 0 14 0 04 0]
2) Compute cosine similarities between rows
RECSYS 35
Item-Item CF (|N|=2)
121110987654321
455 311
3124452
534321423
245424
5224345
423316
users
Compute similarity weightss13=041 s16=059
1162019
mov
ies
100
-018
041
-010
-031
059
sim(1m)
RECSYS 36
Item-Item CF (|N|=2)
121110987654321
45526311
3124452
534321423
245424
5224345
423316
users
Predict by taking weighted average
r15 = (0412 + 0593) (041+059) = 26
mov
ies
RECSYS 37
Common Practice for Item-Item Collaborative Filtering
Define similarity sij of items i and j
Select K nearest neighbors (KNN) N(i x) Set of Items most similar to item i that were rated by x
Estimate rating rxi as the weighted average
1162019
sumsum
isin
isinsdot
=)(
)(ˆxiNj ij
xiNj xjijxi s
rsr
N(ix) = set of items similar to item i that were rated by xsij = similarity of items i and jrxj= rating of user x on item j
RECSYS 38
Item-Item vs User-User
04180109
0305081
Avatar LOTR Matrix Pirates
Alice
Bob
Carol
David
1162019
In practice it has been observed that item-itemoften works better than user-user
Why Items are simpler users have multiple tastes
RECSYS 39
ProsCons of Collaborative Filtering
+ Works for any kind of item No feature selection needed
- Cold Start Need enough users in the system to find a match
- Sparsity The userratings matrix is sparse Hard to find users that have rated the same items
- First rater Cannot recommend an item that has not been
previously rated New items Esoteric items
- Popularity bias Cannot recommend items to someone with
unique taste Tends to recommend popular items
RECSYS 40
Hybrid Methods Implement two or more different recommenders and
combine predictions Perhaps using a linear model
Add content-based methods to collaborative filtering Item profiles for new item problem Demographics to deal with new user problem
1162019
RECSYS 41
- Evaluation
- Error metrics
- Complexity Speed
RECSYS 42
Evaluation
1 3 4
3 5 5
4 5 5
3
3
2 2 2
5
2 1 1
3 3
1
movies
users
1162019
RECSYS 43
Evaluation
1 3 4
3 5 5
4 5 5
3
3
2
2 1
3
1
Test Data Set
users
movies
1162019
RECSYS 44
Evaluating Predictions Compare predictions with known ratings
Root-mean-square error (RMSE)bull where is predicted is the true rating of x on i
Precision at top 10 bull of those in top 10
Rank Correlation bull Spearmanrsquos correlation between systemrsquos and userrsquos complete
rankings Another approach 01 model
Coveragebull Number of itemsusers for which system can make predictions
Precision = TP (TP + FP) Accuracy = (TP+TN) (TP + FP + TN + FN) Receiver Operating characteristic (ROC) Curve
Y-axis True Positive Rates (TPR) X-axis False Positive Rates (FPR)bull TPR (aka Recall) = TP P = TP (TP+FN) bull FPR = FP N = FP (FP + TN)bull See httpsenwikipediaorgwikiPrecision and recall
RECSYS 45
Problems with Error Measures Narrow focus on accuracy sometimes
misses the point Prediction Diversity Prediction Context Order of predictions
In practice we care only to predict high ratings RMSE might penalize a method that does well
for high ratings and badly for others
1162019
RECSYS 46
Collaborative Filtering Complexity
Expensive step is finding k most similar customers O(|X|) Recall that X = set of customers in the system
Too expensive to do at runtime Could pre-compute using clustering as approx
Naiumlve pre-computation takes time O(N middot|C|) |C| = of clusters = k in the k-means N = of data points
We already know how to do this Near-neighbor search in high dimensions (LSH) Clustering Dimensionality reduction
1162019
RECSYS 47
Tip Add Data Leverage all the data
Donrsquot try to reduce data size in an effort to make fancy algorithms work
Simple methods on large data do best
Add more data eg add IMDB data on genres
More data beats better algorithmshttpanandtypepadcomdatawocky200803more-data-usualhtml
RECSYS 48
Recommender SystemsLatent Factor Models
CS246 Mining Massive DatasetsJure Leskovec Stanford University
httpcs246stanfordedu
RECSYS 49
Geared towards females
Geared towards males
Serious
Funny
Collaborative Filtering viaLatent Factor Models (eg SVD)
The PrincessDiaries
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
RECSYS 50Koren Bell Volinksy IEEE Computer 2009
RECSYS 51
The Netflix Utility Matrix R
1 3 4
3 5 5
4 5 5
3
3
2 2 2
5
2 1 1
3 3
1
480000 users
17700 movies
1162019
Matrix R
RECSYS 52
Utility Matrix R Evaluation
1 3 4
3 5 5
4 5 5
3
3
2
2 1
3
1
Test Data Set
SSE = sum(119909119909119909119909)isin119877119877 119903119909119909119909119909 minus 119903119903119909119909119909119909 2
1162019
480000 users
17700 movies
Predicted rating
True rating of user x on item i
119955119955120785120785120788120788
Matrix R
Training Data Set
RECSYS 53
Latent Factor Models
ldquoSVDrdquo on Netflix data R asymp Q middot PT
For now letrsquos assume we can approximate the rating matrix R as a product of ldquothinrdquo Q middot PT
R has missing entries but letrsquos ignore that for nowbull Basically we will want the reconstruction error to be small on
known ratings and we donrsquot care about the values on the missing ones
45531
312445
53432142
24542
522434
42331
2-41
56-5
53-2
32111
-221-7
37-1
-924143-48-5-253-211
13-112-72914-131457-8
1-6784-3924176-421asymp
users
item
s
PT
Q
item
s
users
R
SVD A = U Σ VT
f factors
ffactors
RECSYS 54
Ratings as Products of Factors How to estimate the missing rating of
user x for item i
1162019
45531
312445
53432142
24542
522434
42331
item
s
2-41
56-5
53-2
32111
-221-7
37-1
-924143-48-5-253-211
13-112-72914-131457-8
1-6784-3924176-421
asymp
item
s
users
users
PT
119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931
= 119943119943
119954119954119946119946119943119943 sdot 119953119953119961119961119943119943
qi = row i of Qpx = column x of PT
ffac
tors
Qf factors
RECSYS 55
Ratings as Products of Factors How to estimate the missing rating of
user x for item i
1162019
45531
312445
53432142
24542
522434
42331
item
s
2-41
56-5
53-2
32111
-221-7
37-1
-924143-48-5-253-211
13-112-72914-131457-8
1-6784-3924176-421
asymp
item
s
users
users
PT
ffac
tors
Qf factors
119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931
= 119943119943
119954119954119946119946119943119943 sdot 119953119953119961119961119943119943
qi = row i of Qpx = column x of PT
RECSYS 56
Ratings as Products of Factors How to estimate the missing rating of
user x for item i
1162019
45531
312445
53432142
24542
522434
42331
item
s
2-41
56-5
53-2
32111
-221-7
37-1
-924143-48-5-253-211
13-112-72914-131457-8
1-6784-3924176-421
asymp
item
s
users
users
QPT
24
ffac
tors
f factors
119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931
= 119943119943
119954119954119946119946119943119943 sdot 119953119953119961119961119943119943
qi = row i of Qpx = column x of PT
RECSYS 57
Geared towards females
Geared towards males
Serious
Funny
Latent Factor Models
1162019
The PrincessDiaries
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
RECSYS 58
Geared towards females
Geared towards males
Serious
Funny
Latent Factor Models
1162019
The PrincessDiaries
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
RECSYS 59
Recap SVD
Remember SVD A Input data matrix U Left singular vecs V Right singular vecs Σ Singular values
SVD gives minimum reconstruction error (SSE) min119880119880VΣ
sum119909119909119894119894 119860119860119909119909119894119894 minus 119880119880Σ119881119881T 1199091199091198941198942
So in our case ldquoSVDrdquo on Netflix data R asymp Q middot PT
A = R Q = U PT = Σ VT
But we are not done yet R has missing entries1162019
U
The sum goes over all entriesBut our R has missing entries
119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931
Am
n
Σm
n
VT
asymp
U
RECSYS 60
Latent Factor Models
SVD isnrsquot defined when entries are missing
Use specialized methods to find P Q min
119875119875119876119876sum 119909119909119909119909 isinR 119903119903119909119909119909119909 minus 119902119902119909119909 sdot 119901119901119909119909119879119879 2
Notebull We donrsquot require cols of P Q to be orthogonalunit lengthbull P Q map usersmovies to a latent spacebull The most popular model among Netflix contestants
1162019
45531
312445
53432142
24542
522434
42331
2-41
56-5
53-2
32111
-221-7
37-1
-924143-48-5-253-211
13-112-72914-131457-8
1-6784-3924176-421asymp
PTQ
users
item
s
119903119909119909119909119909 = 119902119902119909119909 sdot 119901119901119909119909119879119879
f factors
ffactorsitem
s
users
RECSYS 61
Dealing with Missing Entries Want to minimize SSE for unseen test data
Idea Minimize SSE on training data Want large f ( of factors) to capture all the signals But SSE on test data begins to rise for f gt 2
Regularization is needed to avoid Overfitting Allow rich model where there are sufficient data Shrink aggressively where data are scarce
1162019
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
1 3 43 5 5
4 5 533
2
2 1 3
1
λhellip regularization parameterldquoerrorrdquo ldquolengthrdquo
RECSYS 62
Geared towards females
Geared towards males
serious
escapist
The PrincessDiaries
The Lion King
Braveheart
Independence Day
AmadeusThe Color Purple
Oceanrsquos 11
Sense and Sensibility
Gus
Dave
Recommendations via Latent Factor Models (eg SVD++ by the [Bellkor Team])
1162019
Lethal Weapon
Dumb and Dumber
RECSYS 63
Dealing with Missing Entries Want to minimize SSE for unseen test data
Idea Minimize SSE on training data Want large f ( of factors) to capture all the signals But SSE on test data begins to rise for f gt 2
Regularization is needed Allow rich model where there are sufficient data Shrink aggressively where data are scarce
1162019
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
1 3 43 5 5
4 5 533
2
2 1 3
1
λhellip regularization parameterldquoerrorrdquo ldquolengthrdquo
RECSYS 64
Geared towards females
Geared towards males
serious
funny
The Effect of Regularization
1162019
The PrincessDiaries
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
minfactors ldquoerrorrdquo + λ ldquolengthrdquo
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
RECSYS 65
Geared towards females
Geared towards males
serious
funny
The Effect of Regularization
1162019
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
The PrincessDiaries
minfactors ldquoerrorrdquo + λ ldquolengthrdquo
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
RECSYS 66
Geared towards females
Geared towards males
serious
funny
The Effect of Regularization
1162019
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
minfactors ldquoerrorrdquo + λ ldquolengthrdquo
The PrincessDiaries
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
RECSYS 67
Geared towards females
Geared towards males
serious
funny
The Effect of Regularization
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
The PrincessDiaries
minfactors ldquoerrorrdquo + λ ldquolengthrdquo
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
RECSYS 68
Use Gradient Descent to search forthe optimal settings
Want to find matrices P and Q
Gradient descent Initialize P and Q (using SVD pretend missing ratings are
0) Do gradient descent
bull P larr P - η middotnablaPbull Q larr Q - η middotnablaQbull Where nablaQ is gradientderivative of matrix Q120571120571119876119876 = [120571120571119902119902119909119909119894119894] and 120571120571119902119902119909119909119894119894 = sum119909119909119909119909 minus2 119903119903119909119909119909119909 minus 119902119902119909119909119901119901119909119909119879119879 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894
bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q Observation Computing gradients is slow
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
How to compute gradient of a matrixCompute gradient of every element independently
RECSYS 69
Degression to the lecture notes ofRegression and Gradient Descent
by Andrew Ngrsquos Machine Learning course from Coursera
RECSYS 70
(Batch) Gradient Descent
Want to find matrices P and Q
Gradient descent Initialize P and Q (using SVD pretend missing ratings are
0) Do gradient descent
bull P larr P - η middotnablaPbull Q larr Q - η middotnablaQbull Where nablaQ is gradientderivative of matrix Q120571120571119876119876 = [120571120571119902119902119909119909119894119894] and 120571120571119902119902119909119909119894119894 = sum119909119909119909119909 minus2 119903119903119909119909119909119909 minus 119902119902119909119909119901119901119909119909119879119879 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894
bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q Observation Computing gradients is slow
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
How to compute gradient of a matrixCompute gradient of every element independently
RECSYS 72
Stochastic Gradient Descent
Gradient Descent (GD) vs Stochastic GD Observation 120571120571119876119876 = [120571120571119902119902119909119909119894119894] where
120571120571119902119902119909119909119894119894 = 119909119909119909119909
minus2 119903119903119909119909119909119909 minus 119902119902119909119909119894119894119901119901119909119909119894119894 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894 = 119961119961119946119946
nabla119928119928 119955119955119961119961119946119946
bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q 119928119928 = 119928119928 minus 119928119928 = 119928119928 minus η sum119961119961119946119946nabla119928119928 (119955119955119961119961119946119946) Idea Instead of evaluating gradient over all ratings evaluate it for
each individual rating and make a step
GD 119928119928larr119928119928 minus η sum119955119955119961119961119946119946 nabla119928119928(119955119955119961119961119946119946)
SGD 119928119928larr119928119928 minus 119928119928(119955119955119961119961119946119946) Faster convergence
bull Need more steps but each step is computed much faster
1162019
RECSYS 73
SGD vs GD Convergence of GD vs SGD
1162019
Iterationstep
Valu
e of
the
obje
ctiv
e fu
nctio
n
GD improves the value of the objective function at every step SGD improves the value but in a ldquonoisyrdquo wayGD takes fewer steps to converge but each steptakes much longer to compute In practice SGD is much faster
RECSYS 74
Stochastic Gradient Descent
Stochastic gradient decent Initialize P and Q (using SVD pretend missing ratings are
0) Then iterate over the ratings (multiple times if necessary) and update
factorsFor each rxi
bull 120576120576119909119909119909119909 = 119903119903119909119909119909119909 minus 119902119902119909119909 sdot 119901119901119909119909119879119879 (derivative of the ldquoerrorrdquo)bull 119902119902119909119909 larr 119902119902119909119909 + 120578120578 120576120576119909119909119909119909 119901119901119909119909 minus 120582120582 119902119902119909119909 (update equation)bull 119901119901119909119909 larr 119901119901119909119909 + 120578120578 120576120576119909119909119909119909 119902119902119909119909 minus 120582120582 119901119901119909119909 (update equation)
2 for loops For until convergence
bull For each rxibull Compute gradient do a ldquosteprdquo
1162019
η hellip learning rate
RECSYS 75Koren Bell Volinksy IEEE Computer 2009
1162019
RECSYS 76
Summary Recommendations via Optimization
Goal Make good recommendations Quantify goodness using SSE
So Lower SSE means better recommendations We want to make good recommendations on items
that some user has not yet seen Letrsquos set values for P and Q such that they work
well on known (user item) ratings And hope these values for P and Q will predict well
the unknown ratings
This is the a case where we apply Optimization methods
1162019
1 3 43 5 5
4 5 533
2
2 1 3
1
RECSYS 77
Backup Slides
RECSYS 78
The Netflix Challenge 2006-09
We Know What You OughtTo Be Watching This
Summer
Netflix Prizebull Training data
ndash 100 million ratingsndash 480000 usersndash 17770 moviesndash 6 years of data 2000-2005
bull Test datandash Last few ratings of each user (28 million)ndash Evaluation criterion root mean squared error (RMSE) ndash Netflix Cinematch RMSE 09514
bull Competitionndash 2700+ teamsndash $1 million grand prize for 10 improvement on Cinematch resultndash $50000 2007 progress prize for 843 improvement
scoremovieuser121152131434524123237682576344541568523425
22345
57664566
scoremovieuser6219617232473153414284935
745
696836
Training data Test data
Movie rating data
Overall rating distribution
bull Third of ratings are 4sbull Average rating is 368
From TimelyDevelopmentcom
ratings per movie
bull Avg ratingsmovie 5627
ratings per user
bull Avg ratingsuser 208
Average movie rating by movie count
bull More ratings to better movies
From TimelyDevelopmentcom
Most loved movies
CountAvg ratingTitle137812 4593 The Shawshank Redemption 133597 4545 Lord of the Rings The Return of the King1808834306 The Green Mile 150676 4460Lord of the Rings The Two Towers 139050 4415 Finding Nemo 117456 4504 Raiders of the Lost Ark 180736 4299 Forrest Gump 147932 4433 Lord of the Rings The Fellowship of the ring149199 4325 The Sixth Sense 144027 4333 Indiana Jones and the Last Crusade
Challenges
bull Size of datandash Scalabilityndash Keeping data in memory
bull Missing datandash 99 percent missingndash Very imbalanced
bull Avoiding overfittingbull Test and training data differ significantly movie 16322
bull Use an ensemble of complementing predictorsbull Two half tuned models worth more than a single fully
tuned model
The BellKor recommender system
RECSYS 101
Extending Latent Factor Model to Include Biases
1162019
RECSYS 102
Modeling Biases and Interactions
μ = overall mean rating bx = bias of user x bi = bias of movie i
user-movie interactionmovie biasuser bias
User-Movie interaction Characterizes the matching between
users and movies Attracts most research in the field Benefits from algorithmic and
mathematical innovations
Baseline predictor Separates users and movies Benefits from insights into userrsquos
behavior Among the main practical
contributions of the competition
RECSYS 103
Baseline Predictor We have expectations on the rating by
user x of movie i even without estimating xrsquos attitude towards movies like i
ndash Rating scale of user xndash Values of other ratings user
gave recently (day-specific mood anchoring multi-user accounts)
ndash (Recent) popularity of movie indash Selection bias related to
number of ratings user gave on the same day (ldquofrequencyrdquo)
1162019
RECSYS 104
Putting It All Together
Example Mean rating micro = 37 You are a critical reviewer your ratings are 1 star
lower than the mean bx = -1 Star Wars gets a mean rating of 05 higher than
average movie bi = + 05 Predicted rating for you on Star Wars
= 37 - 1 + 05 = 32
1162019
RECSYS 105
Fitting the New Model
Solve
Stochastic gradient decent to find parameters Note Both biases bu bi as well as interactions qi pu are treated
as parameters (we estimate them)
1162019
regularization
goodness of fit
λ is selected via grid-search on a validation set
( )
++++
+++minus
sumsumsumsum
sumisin
ii
xx
xx
ii
Rix
Txiixxi
PQ
bbpq
pqbbr
2222
2
)(
)(min
λ
micro
RECSYS 106
BellKor Recommender System The winner of the Netflix Challenge
Multi-scale modeling of the dataCombine top level ldquoregionalrdquomodeling of the data with a refined local view Global
bull Overall deviations of usersmovies Factorization
bull Addressing ldquoregionalrdquo effects Collaborative filtering
bull Extract local patterns
Global effects
Factorization
Collaborative filtering
RECSYS 107
Performance of Various Methods
0885
089
0895
09
0905
091
0915
092
1 10 100 1000
RMSE
Millions of parameters
CF (no time bias)
Basic Latent Factors
Latent Factors w Biases
1162019
RECSYS 108
Grand Prize 08563
Netflix 09514
Movie average 10533User average 10651
Global average 11296
Performance of Various Methods
Basic Collaborative filtering 094
Latent factors 090
Latent factors+Biases 089
Collaborative filtering++ 091
1162019
RECSYS 109
Temporal Biases Of Users
Sudden rise in the average movie rating(early 2004) Improvements in Netflix GUI improvements Meaning of rating changed
Movie age Users prefer new movies
without any reasons Older movies are just
inherently better than newer ones
1162019
Y Koren Collaborative filtering with temporal dynamics KDD rsquo09
RECSYS 110
Temporal Biases amp Factors
Original modelrxi = micro +bx + bi + qi middotpx
T
Add time dependence to biasesrxi = micro +bx(t)+ bi(t) +qi middot px
T
Make parameters bu and bi to depend on time (1) Parameterize time-dependence by linear trends
(2) Each bin corresponds to 10 consecutive weeks
Add temporal dependence to factors px(t)hellip user preference vector on day t
Y Koren Collaborative filtering with temporal dynamics KDD rsquo09
RECSYS 111
Adding Temporal Effects
1162019
0875
088
0885
089
0895
09
0905
091
0915
092
1 10 100 1000 10000
RMSE
Millions of parameters
CF (no time bias)Basic Latent FactorsCF (time bias)Latent Factors w Biases+ Linear time factors+ Per-day user biases+ CF
RECSYS 112
Grand Prize 08563
Netflix 09514
Movie average 10533User average 10651
Global average 11296
Performance of Various Methods
Basic Collaborative filtering 094
Latent factors 090
Latent factors+Biases 089
Collaborative filtering++ 091
1162019 112
Latent factors+Biases+Time 0876
Still no prize Getting desperateTry a ldquokitchen sinkrdquo approach
RECSYS 113
Effect of ensemble size
087
08725
0875
08775
088
08825
0885
08875
089
08925
1 8 15 22 29 36 43 50
Predictors
Erro
r - R
MSE
RECSYS 16
Content-based Recommender Systems
1162019
RECSYS 17
Content-based Recommendations Main idea Recommend items to customer x similar to
previous items rated highly by x
Example
Movie recommendations Recommend movies with same actor(s)
director genre hellip
Websites blogs news Recommend other sites with ldquosimilarrdquo content
RECSYS 18
Plan of Action
likes
Item profiles
RedCircles
Triangles
User profile
match
recommendbuild
1162019
RECSYS 19
Item Profiles For each item create an item profile
Profile is a set (vector) of features Movies author title actor directorhellip Text Set of ldquoimportantrdquo words in document
How to pick important features Usual heuristic from text mining is TF-IDF
(Term frequency Inverse Doc Frequency)bull Term hellip Featurebull Document hellip Item
1162019
RECSYS 20
Sidenote TF-IDFfij = frequency of term (feature) i in doc (item) j
ni = number of docs that mention term iN = total number of docs
TF-IDF score wij = TFij times IDFi
Doc profile = set of words with highest TF-IDF scores together with their scores
1162019
Note we normalize TFto discount for ldquolongerrdquo documents
RECSYS 21
User Profiles and Prediction
User profile possibilities Weighted average of rated item profiles Variation weight by difference from average
rating for item hellip
Prediction heuristic Given user profile x and item profile i estimate 119906119906(119961119961 119946119946) = cos(119961119961 119946119946) = 119961119961middot119946119946
| 119961119961 |sdot| 119946119946 |
1162019
RECSYS 22
Pros Content-based Approach
+ No need for data on other users No cold-start or sparsity problems
+ Able to recommend to users with unique tastes
+ Able to recommend new amp unpopular items No first-rater problem
+ Able to provide explanations Can provide explanations of recommended items by listing content-
features that caused an item to be recommended
1162019
RECSYS 23
Cons Content-based Approach ndash Finding the appropriate features is hard
Eg images movies music
ndash Overspecialization Never recommends items outside userrsquos
content profile People might have multiple interests Unable to exploit quality judgments of other users
ndash Recommendations for new users How to build a user profile
1162019
RECSYS 24
Collaborative Filtering
RECSYS 25
Collaborative filtering Recommend items based on past transactions of users
Analyze relations between users andor items
Specific data characteristics are irrelevant Domain-free useritem attributes are not necessary Can identify elusive aspects
RECSYS 26
Collaborative Filtering (CF)
RECSYS 27
Example of Memory-based Collaborative FilteringUser-User Collaborative Filtering
1 Consider user x
2 Find set N of other userswhose ratings are ldquosimilarrdquo to xrsquos ratings eg using K-nearest neighbors (KNN)
3 Recommend items to xbased on the weighted ratings of items by users in N
RECSYS 28
Similar Users Let rx be the vector of user xrsquos ratings
Jaccard similarity measure Problem Ignores the value of the rating
Cosine similarity measure
sim(x y) = cos(rx ry) = 119903119903119909119909sdot119903119903119910119910||119903119903119909119909||sdot||119903119903119910119910||
Problem Treats missing ratings as ldquonegativerdquo Pearson correlation coefficient
Sxy = items rated by both users x and y
1162019
rx = [ _ _ ]ry = [ _ _]
rx ry as setsrx = 1 4 5ry = 1 3 4
rx ry as pointsrx = 1 0 0 1 3ry = 1 0 2 2 0
rx ry hellip avgrating of x y
RECSYS 29
Similarity Metric
Intuitively we want sim(A B) gt sim(A C)
Jaccard similarity 15 lt 24
Cosine similarity 0386 gt 0322 Considers missing ratings as ldquonegativerdquo Solution subtract the (row) mean
1162019
sim AB vs AC0092 gt -0559Notice cosine sim is correlation when data is centered at 0
Cosine sim
RECSYS 30
Rating Predictions
Let rx be the vector of user xrsquos ratings
Let N be the set of k users most similar to xwho have rated item i
Prediction for item i of user x 119903119903119909119909119909119909 = 1
119896119896sum119910119910isin119873119873 119903119903119910119910119909119909
119903119903119909119909119909119909 =sum119910119910isin119873119873 119904119904119909119909119910119910sdot119903119903119910119910119910119910sum119910119910isin119873119873 119904119904119909119909119910119910
Other options
Many other tricks possiblehellip1162019
Shorthand119956119956119961119961119961119961 = 119956119956119946119946119956119956 119961119961119961119961
RECSYS 31
Another type of Memory-based Collaborative Filtering Item-Item based Collaborative Filtering
So far User-user collaborative filtering
Another view Item-item For item i find other similar items Estimate rating for item i based
on the target userrsquos ratings on items similar to item i Can use same similarity metrics and
prediction functions as in user-user model
1162019
sumsum
isin
isinsdot
=)(
)(
xiNj ij
xiNj xjijxi s
rsr
sijhellip similarity of items i and jrxjhelliprating of user x on item jN(ix)hellip set items rated by x similar to i
RECSYS 32
Item-Item CF (|N|=2)
121110987654321
455311
3124452
534321423
245424
5224345
423316
users
mov
ies
- unknown rating - rating between 1 to 5
RECSYS 33
Item-Item CF (|N|=2)
121110987654321
455 311
3124452
534321423
245424
5224345
423316
users
- estimate rating of movie 1 by user 5
1162019
mov
ies
RECSYS 34
Item-Item CF (|N|=2)
121110987654321
455 311
3124452
534321423
245424
5224345
423316
users
Neighbor selectionIdentify movies similar to movie 1 rated by user 5
1162019
mov
ies
100
-018
041
-010
-031
059
sim(1m)
Here we use Pearson correlation as similarity1) Subtract mean rating mi from each movie i
m1 = (1+3+5+5+4)5 = 36row 1 [-26 0 -06 0 0 14 0 0 14 0 04 0]
2) Compute cosine similarities between rows
RECSYS 35
Item-Item CF (|N|=2)
121110987654321
455 311
3124452
534321423
245424
5224345
423316
users
Compute similarity weightss13=041 s16=059
1162019
mov
ies
100
-018
041
-010
-031
059
sim(1m)
RECSYS 36
Item-Item CF (|N|=2)
121110987654321
45526311
3124452
534321423
245424
5224345
423316
users
Predict by taking weighted average
r15 = (0412 + 0593) (041+059) = 26
mov
ies
RECSYS 37
Common Practice for Item-Item Collaborative Filtering
Define similarity sij of items i and j
Select K nearest neighbors (KNN) N(i x) Set of Items most similar to item i that were rated by x
Estimate rating rxi as the weighted average
1162019
sumsum
isin
isinsdot
=)(
)(ˆxiNj ij
xiNj xjijxi s
rsr
N(ix) = set of items similar to item i that were rated by xsij = similarity of items i and jrxj= rating of user x on item j
RECSYS 38
Item-Item vs User-User
04180109
0305081
Avatar LOTR Matrix Pirates
Alice
Bob
Carol
David
1162019
In practice it has been observed that item-itemoften works better than user-user
Why Items are simpler users have multiple tastes
RECSYS 39
ProsCons of Collaborative Filtering
+ Works for any kind of item No feature selection needed
- Cold Start Need enough users in the system to find a match
- Sparsity The userratings matrix is sparse Hard to find users that have rated the same items
- First rater Cannot recommend an item that has not been
previously rated New items Esoteric items
- Popularity bias Cannot recommend items to someone with
unique taste Tends to recommend popular items
RECSYS 40
Hybrid Methods Implement two or more different recommenders and
combine predictions Perhaps using a linear model
Add content-based methods to collaborative filtering Item profiles for new item problem Demographics to deal with new user problem
1162019
RECSYS 41
- Evaluation
- Error metrics
- Complexity Speed
RECSYS 42
Evaluation
1 3 4
3 5 5
4 5 5
3
3
2 2 2
5
2 1 1
3 3
1
movies
users
1162019
RECSYS 43
Evaluation
1 3 4
3 5 5
4 5 5
3
3
2
2 1
3
1
Test Data Set
users
movies
1162019
RECSYS 44
Evaluating Predictions Compare predictions with known ratings
Root-mean-square error (RMSE)bull where is predicted is the true rating of x on i
Precision at top 10 bull of those in top 10
Rank Correlation bull Spearmanrsquos correlation between systemrsquos and userrsquos complete
rankings Another approach 01 model
Coveragebull Number of itemsusers for which system can make predictions
Precision = TP (TP + FP) Accuracy = (TP+TN) (TP + FP + TN + FN) Receiver Operating characteristic (ROC) Curve
Y-axis True Positive Rates (TPR) X-axis False Positive Rates (FPR)bull TPR (aka Recall) = TP P = TP (TP+FN) bull FPR = FP N = FP (FP + TN)bull See httpsenwikipediaorgwikiPrecision and recall
RECSYS 45
Problems with Error Measures Narrow focus on accuracy sometimes
misses the point Prediction Diversity Prediction Context Order of predictions
In practice we care only to predict high ratings RMSE might penalize a method that does well
for high ratings and badly for others
1162019
RECSYS 46
Collaborative Filtering Complexity
Expensive step is finding k most similar customers O(|X|) Recall that X = set of customers in the system
Too expensive to do at runtime Could pre-compute using clustering as approx
Naiumlve pre-computation takes time O(N middot|C|) |C| = of clusters = k in the k-means N = of data points
We already know how to do this Near-neighbor search in high dimensions (LSH) Clustering Dimensionality reduction
1162019
RECSYS 47
Tip Add Data Leverage all the data
Donrsquot try to reduce data size in an effort to make fancy algorithms work
Simple methods on large data do best
Add more data eg add IMDB data on genres
More data beats better algorithmshttpanandtypepadcomdatawocky200803more-data-usualhtml
RECSYS 48
Recommender SystemsLatent Factor Models
CS246 Mining Massive DatasetsJure Leskovec Stanford University
httpcs246stanfordedu
RECSYS 49
Geared towards females
Geared towards males
Serious
Funny
Collaborative Filtering viaLatent Factor Models (eg SVD)
The PrincessDiaries
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
RECSYS 50Koren Bell Volinksy IEEE Computer 2009
RECSYS 51
The Netflix Utility Matrix R
1 3 4
3 5 5
4 5 5
3
3
2 2 2
5
2 1 1
3 3
1
480000 users
17700 movies
1162019
Matrix R
RECSYS 52
Utility Matrix R Evaluation
1 3 4
3 5 5
4 5 5
3
3
2
2 1
3
1
Test Data Set
SSE = sum(119909119909119909119909)isin119877119877 119903119909119909119909119909 minus 119903119903119909119909119909119909 2
1162019
480000 users
17700 movies
Predicted rating
True rating of user x on item i
119955119955120785120785120788120788
Matrix R
Training Data Set
RECSYS 53
Latent Factor Models
ldquoSVDrdquo on Netflix data R asymp Q middot PT
For now letrsquos assume we can approximate the rating matrix R as a product of ldquothinrdquo Q middot PT
R has missing entries but letrsquos ignore that for nowbull Basically we will want the reconstruction error to be small on
known ratings and we donrsquot care about the values on the missing ones
45531
312445
53432142
24542
522434
42331
2-41
56-5
53-2
32111
-221-7
37-1
-924143-48-5-253-211
13-112-72914-131457-8
1-6784-3924176-421asymp
users
item
s
PT
Q
item
s
users
R
SVD A = U Σ VT
f factors
ffactors
RECSYS 54
Ratings as Products of Factors How to estimate the missing rating of
user x for item i
1162019
45531
312445
53432142
24542
522434
42331
item
s
2-41
56-5
53-2
32111
-221-7
37-1
-924143-48-5-253-211
13-112-72914-131457-8
1-6784-3924176-421
asymp
item
s
users
users
PT
119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931
= 119943119943
119954119954119946119946119943119943 sdot 119953119953119961119961119943119943
qi = row i of Qpx = column x of PT
ffac
tors
Qf factors
RECSYS 55
Ratings as Products of Factors How to estimate the missing rating of
user x for item i
1162019
45531
312445
53432142
24542
522434
42331
item
s
2-41
56-5
53-2
32111
-221-7
37-1
-924143-48-5-253-211
13-112-72914-131457-8
1-6784-3924176-421
asymp
item
s
users
users
PT
ffac
tors
Qf factors
119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931
= 119943119943
119954119954119946119946119943119943 sdot 119953119953119961119961119943119943
qi = row i of Qpx = column x of PT
RECSYS 56
Ratings as Products of Factors How to estimate the missing rating of
user x for item i
1162019
45531
312445
53432142
24542
522434
42331
item
s
2-41
56-5
53-2
32111
-221-7
37-1
-924143-48-5-253-211
13-112-72914-131457-8
1-6784-3924176-421
asymp
item
s
users
users
QPT
24
ffac
tors
f factors
119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931
= 119943119943
119954119954119946119946119943119943 sdot 119953119953119961119961119943119943
qi = row i of Qpx = column x of PT
RECSYS 57
Geared towards females
Geared towards males
Serious
Funny
Latent Factor Models
1162019
The PrincessDiaries
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
RECSYS 58
Geared towards females
Geared towards males
Serious
Funny
Latent Factor Models
1162019
The PrincessDiaries
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
RECSYS 59
Recap SVD
Remember SVD A Input data matrix U Left singular vecs V Right singular vecs Σ Singular values
SVD gives minimum reconstruction error (SSE) min119880119880VΣ
sum119909119909119894119894 119860119860119909119909119894119894 minus 119880119880Σ119881119881T 1199091199091198941198942
So in our case ldquoSVDrdquo on Netflix data R asymp Q middot PT
A = R Q = U PT = Σ VT
But we are not done yet R has missing entries1162019
U
The sum goes over all entriesBut our R has missing entries
119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931
Am
n
Σm
n
VT
asymp
U
RECSYS 60
Latent Factor Models
SVD isnrsquot defined when entries are missing
Use specialized methods to find P Q min
119875119875119876119876sum 119909119909119909119909 isinR 119903119903119909119909119909119909 minus 119902119902119909119909 sdot 119901119901119909119909119879119879 2
Notebull We donrsquot require cols of P Q to be orthogonalunit lengthbull P Q map usersmovies to a latent spacebull The most popular model among Netflix contestants
1162019
45531
312445
53432142
24542
522434
42331
2-41
56-5
53-2
32111
-221-7
37-1
-924143-48-5-253-211
13-112-72914-131457-8
1-6784-3924176-421asymp
PTQ
users
item
s
119903119909119909119909119909 = 119902119902119909119909 sdot 119901119901119909119909119879119879
f factors
ffactorsitem
s
users
RECSYS 61
Dealing with Missing Entries Want to minimize SSE for unseen test data
Idea Minimize SSE on training data Want large f ( of factors) to capture all the signals But SSE on test data begins to rise for f gt 2
Regularization is needed to avoid Overfitting Allow rich model where there are sufficient data Shrink aggressively where data are scarce
1162019
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
1 3 43 5 5
4 5 533
2
2 1 3
1
λhellip regularization parameterldquoerrorrdquo ldquolengthrdquo
RECSYS 62
Geared towards females
Geared towards males
serious
escapist
The PrincessDiaries
The Lion King
Braveheart
Independence Day
AmadeusThe Color Purple
Oceanrsquos 11
Sense and Sensibility
Gus
Dave
Recommendations via Latent Factor Models (eg SVD++ by the [Bellkor Team])
1162019
Lethal Weapon
Dumb and Dumber
RECSYS 63
Dealing with Missing Entries Want to minimize SSE for unseen test data
Idea Minimize SSE on training data Want large f ( of factors) to capture all the signals But SSE on test data begins to rise for f gt 2
Regularization is needed Allow rich model where there are sufficient data Shrink aggressively where data are scarce
1162019
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
1 3 43 5 5
4 5 533
2
2 1 3
1
λhellip regularization parameterldquoerrorrdquo ldquolengthrdquo
RECSYS 64
Geared towards females
Geared towards males
serious
funny
The Effect of Regularization
1162019
The PrincessDiaries
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
minfactors ldquoerrorrdquo + λ ldquolengthrdquo
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
RECSYS 65
Geared towards females
Geared towards males
serious
funny
The Effect of Regularization
1162019
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
The PrincessDiaries
minfactors ldquoerrorrdquo + λ ldquolengthrdquo
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
RECSYS 66
Geared towards females
Geared towards males
serious
funny
The Effect of Regularization
1162019
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
minfactors ldquoerrorrdquo + λ ldquolengthrdquo
The PrincessDiaries
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
RECSYS 67
Geared towards females
Geared towards males
serious
funny
The Effect of Regularization
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
The PrincessDiaries
minfactors ldquoerrorrdquo + λ ldquolengthrdquo
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
RECSYS 68
Use Gradient Descent to search forthe optimal settings
Want to find matrices P and Q
Gradient descent Initialize P and Q (using SVD pretend missing ratings are
0) Do gradient descent
bull P larr P - η middotnablaPbull Q larr Q - η middotnablaQbull Where nablaQ is gradientderivative of matrix Q120571120571119876119876 = [120571120571119902119902119909119909119894119894] and 120571120571119902119902119909119909119894119894 = sum119909119909119909119909 minus2 119903119903119909119909119909119909 minus 119902119902119909119909119901119901119909119909119879119879 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894
bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q Observation Computing gradients is slow
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
How to compute gradient of a matrixCompute gradient of every element independently
RECSYS 69
Degression to the lecture notes ofRegression and Gradient Descent
by Andrew Ngrsquos Machine Learning course from Coursera
RECSYS 70
(Batch) Gradient Descent
Want to find matrices P and Q
Gradient descent Initialize P and Q (using SVD pretend missing ratings are
0) Do gradient descent
bull P larr P - η middotnablaPbull Q larr Q - η middotnablaQbull Where nablaQ is gradientderivative of matrix Q120571120571119876119876 = [120571120571119902119902119909119909119894119894] and 120571120571119902119902119909119909119894119894 = sum119909119909119909119909 minus2 119903119903119909119909119909119909 minus 119902119902119909119909119901119901119909119909119879119879 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894
bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q Observation Computing gradients is slow
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
How to compute gradient of a matrixCompute gradient of every element independently
RECSYS 72
Stochastic Gradient Descent
Gradient Descent (GD) vs Stochastic GD Observation 120571120571119876119876 = [120571120571119902119902119909119909119894119894] where
120571120571119902119902119909119909119894119894 = 119909119909119909119909
minus2 119903119903119909119909119909119909 minus 119902119902119909119909119894119894119901119901119909119909119894119894 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894 = 119961119961119946119946
nabla119928119928 119955119955119961119961119946119946
bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q 119928119928 = 119928119928 minus 119928119928 = 119928119928 minus η sum119961119961119946119946nabla119928119928 (119955119955119961119961119946119946) Idea Instead of evaluating gradient over all ratings evaluate it for
each individual rating and make a step
GD 119928119928larr119928119928 minus η sum119955119955119961119961119946119946 nabla119928119928(119955119955119961119961119946119946)
SGD 119928119928larr119928119928 minus 119928119928(119955119955119961119961119946119946) Faster convergence
bull Need more steps but each step is computed much faster
1162019
RECSYS 73
SGD vs GD Convergence of GD vs SGD
1162019
Iterationstep
Valu
e of
the
obje
ctiv
e fu
nctio
n
GD improves the value of the objective function at every step SGD improves the value but in a ldquonoisyrdquo wayGD takes fewer steps to converge but each steptakes much longer to compute In practice SGD is much faster
RECSYS 74
Stochastic Gradient Descent
Stochastic gradient decent Initialize P and Q (using SVD pretend missing ratings are
0) Then iterate over the ratings (multiple times if necessary) and update
factorsFor each rxi
bull 120576120576119909119909119909119909 = 119903119903119909119909119909119909 minus 119902119902119909119909 sdot 119901119901119909119909119879119879 (derivative of the ldquoerrorrdquo)bull 119902119902119909119909 larr 119902119902119909119909 + 120578120578 120576120576119909119909119909119909 119901119901119909119909 minus 120582120582 119902119902119909119909 (update equation)bull 119901119901119909119909 larr 119901119901119909119909 + 120578120578 120576120576119909119909119909119909 119902119902119909119909 minus 120582120582 119901119901119909119909 (update equation)
2 for loops For until convergence
bull For each rxibull Compute gradient do a ldquosteprdquo
1162019
η hellip learning rate
RECSYS 75Koren Bell Volinksy IEEE Computer 2009
1162019
RECSYS 76
Summary Recommendations via Optimization
Goal Make good recommendations Quantify goodness using SSE
So Lower SSE means better recommendations We want to make good recommendations on items
that some user has not yet seen Letrsquos set values for P and Q such that they work
well on known (user item) ratings And hope these values for P and Q will predict well
the unknown ratings
This is the a case where we apply Optimization methods
1162019
1 3 43 5 5
4 5 533
2
2 1 3
1
RECSYS 77
Backup Slides
RECSYS 78
The Netflix Challenge 2006-09
We Know What You OughtTo Be Watching This
Summer
Netflix Prizebull Training data
ndash 100 million ratingsndash 480000 usersndash 17770 moviesndash 6 years of data 2000-2005
bull Test datandash Last few ratings of each user (28 million)ndash Evaluation criterion root mean squared error (RMSE) ndash Netflix Cinematch RMSE 09514
bull Competitionndash 2700+ teamsndash $1 million grand prize for 10 improvement on Cinematch resultndash $50000 2007 progress prize for 843 improvement
scoremovieuser121152131434524123237682576344541568523425
22345
57664566
scoremovieuser6219617232473153414284935
745
696836
Training data Test data
Movie rating data
Overall rating distribution
bull Third of ratings are 4sbull Average rating is 368
From TimelyDevelopmentcom
ratings per movie
bull Avg ratingsmovie 5627
ratings per user
bull Avg ratingsuser 208
Average movie rating by movie count
bull More ratings to better movies
From TimelyDevelopmentcom
Most loved movies
CountAvg ratingTitle137812 4593 The Shawshank Redemption 133597 4545 Lord of the Rings The Return of the King1808834306 The Green Mile 150676 4460Lord of the Rings The Two Towers 139050 4415 Finding Nemo 117456 4504 Raiders of the Lost Ark 180736 4299 Forrest Gump 147932 4433 Lord of the Rings The Fellowship of the ring149199 4325 The Sixth Sense 144027 4333 Indiana Jones and the Last Crusade
Challenges
bull Size of datandash Scalabilityndash Keeping data in memory
bull Missing datandash 99 percent missingndash Very imbalanced
bull Avoiding overfittingbull Test and training data differ significantly movie 16322
bull Use an ensemble of complementing predictorsbull Two half tuned models worth more than a single fully
tuned model
The BellKor recommender system
RECSYS 101
Extending Latent Factor Model to Include Biases
1162019
RECSYS 102
Modeling Biases and Interactions
μ = overall mean rating bx = bias of user x bi = bias of movie i
user-movie interactionmovie biasuser bias
User-Movie interaction Characterizes the matching between
users and movies Attracts most research in the field Benefits from algorithmic and
mathematical innovations
Baseline predictor Separates users and movies Benefits from insights into userrsquos
behavior Among the main practical
contributions of the competition
RECSYS 103
Baseline Predictor We have expectations on the rating by
user x of movie i even without estimating xrsquos attitude towards movies like i
ndash Rating scale of user xndash Values of other ratings user
gave recently (day-specific mood anchoring multi-user accounts)
ndash (Recent) popularity of movie indash Selection bias related to
number of ratings user gave on the same day (ldquofrequencyrdquo)
1162019
RECSYS 104
Putting It All Together
Example Mean rating micro = 37 You are a critical reviewer your ratings are 1 star
lower than the mean bx = -1 Star Wars gets a mean rating of 05 higher than
average movie bi = + 05 Predicted rating for you on Star Wars
= 37 - 1 + 05 = 32
1162019
RECSYS 105
Fitting the New Model
Solve
Stochastic gradient decent to find parameters Note Both biases bu bi as well as interactions qi pu are treated
as parameters (we estimate them)
1162019
regularization
goodness of fit
λ is selected via grid-search on a validation set
( )
++++
+++minus
sumsumsumsum
sumisin
ii
xx
xx
ii
Rix
Txiixxi
PQ
bbpq
pqbbr
2222
2
)(
)(min
λ
micro
RECSYS 106
BellKor Recommender System The winner of the Netflix Challenge
Multi-scale modeling of the dataCombine top level ldquoregionalrdquomodeling of the data with a refined local view Global
bull Overall deviations of usersmovies Factorization
bull Addressing ldquoregionalrdquo effects Collaborative filtering
bull Extract local patterns
Global effects
Factorization
Collaborative filtering
RECSYS 107
Performance of Various Methods
0885
089
0895
09
0905
091
0915
092
1 10 100 1000
RMSE
Millions of parameters
CF (no time bias)
Basic Latent Factors
Latent Factors w Biases
1162019
RECSYS 108
Grand Prize 08563
Netflix 09514
Movie average 10533User average 10651
Global average 11296
Performance of Various Methods
Basic Collaborative filtering 094
Latent factors 090
Latent factors+Biases 089
Collaborative filtering++ 091
1162019
RECSYS 109
Temporal Biases Of Users
Sudden rise in the average movie rating(early 2004) Improvements in Netflix GUI improvements Meaning of rating changed
Movie age Users prefer new movies
without any reasons Older movies are just
inherently better than newer ones
1162019
Y Koren Collaborative filtering with temporal dynamics KDD rsquo09
RECSYS 110
Temporal Biases amp Factors
Original modelrxi = micro +bx + bi + qi middotpx
T
Add time dependence to biasesrxi = micro +bx(t)+ bi(t) +qi middot px
T
Make parameters bu and bi to depend on time (1) Parameterize time-dependence by linear trends
(2) Each bin corresponds to 10 consecutive weeks
Add temporal dependence to factors px(t)hellip user preference vector on day t
Y Koren Collaborative filtering with temporal dynamics KDD rsquo09
RECSYS 111
Adding Temporal Effects
1162019
0875
088
0885
089
0895
09
0905
091
0915
092
1 10 100 1000 10000
RMSE
Millions of parameters
CF (no time bias)Basic Latent FactorsCF (time bias)Latent Factors w Biases+ Linear time factors+ Per-day user biases+ CF
RECSYS 112
Grand Prize 08563
Netflix 09514
Movie average 10533User average 10651
Global average 11296
Performance of Various Methods
Basic Collaborative filtering 094
Latent factors 090
Latent factors+Biases 089
Collaborative filtering++ 091
1162019 112
Latent factors+Biases+Time 0876
Still no prize Getting desperateTry a ldquokitchen sinkrdquo approach
RECSYS 113
Effect of ensemble size
087
08725
0875
08775
088
08825
0885
08875
089
08925
1 8 15 22 29 36 43 50
Predictors
Erro
r - R
MSE
RECSYS 17
Content-based Recommendations Main idea Recommend items to customer x similar to
previous items rated highly by x
Example
Movie recommendations Recommend movies with same actor(s)
director genre hellip
Websites blogs news Recommend other sites with ldquosimilarrdquo content
RECSYS 18
Plan of Action
likes
Item profiles
RedCircles
Triangles
User profile
match
recommendbuild
1162019
RECSYS 19
Item Profiles For each item create an item profile
Profile is a set (vector) of features Movies author title actor directorhellip Text Set of ldquoimportantrdquo words in document
How to pick important features Usual heuristic from text mining is TF-IDF
(Term frequency Inverse Doc Frequency)bull Term hellip Featurebull Document hellip Item
1162019
RECSYS 20
Sidenote TF-IDFfij = frequency of term (feature) i in doc (item) j
ni = number of docs that mention term iN = total number of docs
TF-IDF score wij = TFij times IDFi
Doc profile = set of words with highest TF-IDF scores together with their scores
1162019
Note we normalize TFto discount for ldquolongerrdquo documents
RECSYS 21
User Profiles and Prediction
User profile possibilities Weighted average of rated item profiles Variation weight by difference from average
rating for item hellip
Prediction heuristic Given user profile x and item profile i estimate 119906119906(119961119961 119946119946) = cos(119961119961 119946119946) = 119961119961middot119946119946
| 119961119961 |sdot| 119946119946 |
1162019
RECSYS 22
Pros Content-based Approach
+ No need for data on other users No cold-start or sparsity problems
+ Able to recommend to users with unique tastes
+ Able to recommend new amp unpopular items No first-rater problem
+ Able to provide explanations Can provide explanations of recommended items by listing content-
features that caused an item to be recommended
1162019
RECSYS 23
Cons Content-based Approach ndash Finding the appropriate features is hard
Eg images movies music
ndash Overspecialization Never recommends items outside userrsquos
content profile People might have multiple interests Unable to exploit quality judgments of other users
ndash Recommendations for new users How to build a user profile
1162019
RECSYS 24
Collaborative Filtering
RECSYS 25
Collaborative filtering Recommend items based on past transactions of users
Analyze relations between users andor items
Specific data characteristics are irrelevant Domain-free useritem attributes are not necessary Can identify elusive aspects
RECSYS 26
Collaborative Filtering (CF)
RECSYS 27
Example of Memory-based Collaborative FilteringUser-User Collaborative Filtering
1 Consider user x
2 Find set N of other userswhose ratings are ldquosimilarrdquo to xrsquos ratings eg using K-nearest neighbors (KNN)
3 Recommend items to xbased on the weighted ratings of items by users in N
RECSYS 28
Similar Users Let rx be the vector of user xrsquos ratings
Jaccard similarity measure Problem Ignores the value of the rating
Cosine similarity measure
sim(x y) = cos(rx ry) = 119903119903119909119909sdot119903119903119910119910||119903119903119909119909||sdot||119903119903119910119910||
Problem Treats missing ratings as ldquonegativerdquo Pearson correlation coefficient
Sxy = items rated by both users x and y
1162019
rx = [ _ _ ]ry = [ _ _]
rx ry as setsrx = 1 4 5ry = 1 3 4
rx ry as pointsrx = 1 0 0 1 3ry = 1 0 2 2 0
rx ry hellip avgrating of x y
RECSYS 29
Similarity Metric
Intuitively we want sim(A B) gt sim(A C)
Jaccard similarity 15 lt 24
Cosine similarity 0386 gt 0322 Considers missing ratings as ldquonegativerdquo Solution subtract the (row) mean
1162019
sim AB vs AC0092 gt -0559Notice cosine sim is correlation when data is centered at 0
Cosine sim
RECSYS 30
Rating Predictions
Let rx be the vector of user xrsquos ratings
Let N be the set of k users most similar to xwho have rated item i
Prediction for item i of user x 119903119903119909119909119909119909 = 1
119896119896sum119910119910isin119873119873 119903119903119910119910119909119909
119903119903119909119909119909119909 =sum119910119910isin119873119873 119904119904119909119909119910119910sdot119903119903119910119910119910119910sum119910119910isin119873119873 119904119904119909119909119910119910
Other options
Many other tricks possiblehellip1162019
Shorthand119956119956119961119961119961119961 = 119956119956119946119946119956119956 119961119961119961119961
RECSYS 31
Another type of Memory-based Collaborative Filtering Item-Item based Collaborative Filtering
So far User-user collaborative filtering
Another view Item-item For item i find other similar items Estimate rating for item i based
on the target userrsquos ratings on items similar to item i Can use same similarity metrics and
prediction functions as in user-user model
1162019
sumsum
isin
isinsdot
=)(
)(
xiNj ij
xiNj xjijxi s
rsr
sijhellip similarity of items i and jrxjhelliprating of user x on item jN(ix)hellip set items rated by x similar to i
RECSYS 32
Item-Item CF (|N|=2)
121110987654321
455311
3124452
534321423
245424
5224345
423316
users
mov
ies
- unknown rating - rating between 1 to 5
RECSYS 33
Item-Item CF (|N|=2)
121110987654321
455 311
3124452
534321423
245424
5224345
423316
users
- estimate rating of movie 1 by user 5
1162019
mov
ies
RECSYS 34
Item-Item CF (|N|=2)
121110987654321
455 311
3124452
534321423
245424
5224345
423316
users
Neighbor selectionIdentify movies similar to movie 1 rated by user 5
1162019
mov
ies
100
-018
041
-010
-031
059
sim(1m)
Here we use Pearson correlation as similarity1) Subtract mean rating mi from each movie i
m1 = (1+3+5+5+4)5 = 36row 1 [-26 0 -06 0 0 14 0 0 14 0 04 0]
2) Compute cosine similarities between rows
RECSYS 35
Item-Item CF (|N|=2)
121110987654321
455 311
3124452
534321423
245424
5224345
423316
users
Compute similarity weightss13=041 s16=059
1162019
mov
ies
100
-018
041
-010
-031
059
sim(1m)
RECSYS 36
Item-Item CF (|N|=2)
121110987654321
45526311
3124452
534321423
245424
5224345
423316
users
Predict by taking weighted average
r15 = (0412 + 0593) (041+059) = 26
mov
ies
RECSYS 37
Common Practice for Item-Item Collaborative Filtering
Define similarity sij of items i and j
Select K nearest neighbors (KNN) N(i x) Set of Items most similar to item i that were rated by x
Estimate rating rxi as the weighted average
1162019
sumsum
isin
isinsdot
=)(
)(ˆxiNj ij
xiNj xjijxi s
rsr
N(ix) = set of items similar to item i that were rated by xsij = similarity of items i and jrxj= rating of user x on item j
RECSYS 38
Item-Item vs User-User
04180109
0305081
Avatar LOTR Matrix Pirates
Alice
Bob
Carol
David
1162019
In practice it has been observed that item-itemoften works better than user-user
Why Items are simpler users have multiple tastes
RECSYS 39
ProsCons of Collaborative Filtering
+ Works for any kind of item No feature selection needed
- Cold Start Need enough users in the system to find a match
- Sparsity The userratings matrix is sparse Hard to find users that have rated the same items
- First rater Cannot recommend an item that has not been
previously rated New items Esoteric items
- Popularity bias Cannot recommend items to someone with
unique taste Tends to recommend popular items
RECSYS 40
Hybrid Methods Implement two or more different recommenders and
combine predictions Perhaps using a linear model
Add content-based methods to collaborative filtering Item profiles for new item problem Demographics to deal with new user problem
1162019
RECSYS 41
- Evaluation
- Error metrics
- Complexity Speed
RECSYS 42
Evaluation
1 3 4
3 5 5
4 5 5
3
3
2 2 2
5
2 1 1
3 3
1
movies
users
1162019
RECSYS 43
Evaluation
1 3 4
3 5 5
4 5 5
3
3
2
2 1
3
1
Test Data Set
users
movies
1162019
RECSYS 44
Evaluating Predictions Compare predictions with known ratings
Root-mean-square error (RMSE)bull where is predicted is the true rating of x on i
Precision at top 10 bull of those in top 10
Rank Correlation bull Spearmanrsquos correlation between systemrsquos and userrsquos complete
rankings Another approach 01 model
Coveragebull Number of itemsusers for which system can make predictions
Precision = TP (TP + FP) Accuracy = (TP+TN) (TP + FP + TN + FN) Receiver Operating characteristic (ROC) Curve
Y-axis True Positive Rates (TPR) X-axis False Positive Rates (FPR)bull TPR (aka Recall) = TP P = TP (TP+FN) bull FPR = FP N = FP (FP + TN)bull See httpsenwikipediaorgwikiPrecision and recall
RECSYS 45
Problems with Error Measures Narrow focus on accuracy sometimes
misses the point Prediction Diversity Prediction Context Order of predictions
In practice we care only to predict high ratings RMSE might penalize a method that does well
for high ratings and badly for others
1162019
RECSYS 46
Collaborative Filtering Complexity
Expensive step is finding k most similar customers O(|X|) Recall that X = set of customers in the system
Too expensive to do at runtime Could pre-compute using clustering as approx
Naiumlve pre-computation takes time O(N middot|C|) |C| = of clusters = k in the k-means N = of data points
We already know how to do this Near-neighbor search in high dimensions (LSH) Clustering Dimensionality reduction
1162019
RECSYS 47
Tip Add Data Leverage all the data
Donrsquot try to reduce data size in an effort to make fancy algorithms work
Simple methods on large data do best
Add more data eg add IMDB data on genres
More data beats better algorithmshttpanandtypepadcomdatawocky200803more-data-usualhtml
RECSYS 48
Recommender SystemsLatent Factor Models
CS246 Mining Massive DatasetsJure Leskovec Stanford University
httpcs246stanfordedu
RECSYS 49
Geared towards females
Geared towards males
Serious
Funny
Collaborative Filtering viaLatent Factor Models (eg SVD)
The PrincessDiaries
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
RECSYS 50Koren Bell Volinksy IEEE Computer 2009
RECSYS 51
The Netflix Utility Matrix R
1 3 4
3 5 5
4 5 5
3
3
2 2 2
5
2 1 1
3 3
1
480000 users
17700 movies
1162019
Matrix R
RECSYS 52
Utility Matrix R Evaluation
1 3 4
3 5 5
4 5 5
3
3
2
2 1
3
1
Test Data Set
SSE = sum(119909119909119909119909)isin119877119877 119903119909119909119909119909 minus 119903119903119909119909119909119909 2
1162019
480000 users
17700 movies
Predicted rating
True rating of user x on item i
119955119955120785120785120788120788
Matrix R
Training Data Set
RECSYS 53
Latent Factor Models
ldquoSVDrdquo on Netflix data R asymp Q middot PT
For now letrsquos assume we can approximate the rating matrix R as a product of ldquothinrdquo Q middot PT
R has missing entries but letrsquos ignore that for nowbull Basically we will want the reconstruction error to be small on
known ratings and we donrsquot care about the values on the missing ones
45531
312445
53432142
24542
522434
42331
2-41
56-5
53-2
32111
-221-7
37-1
-924143-48-5-253-211
13-112-72914-131457-8
1-6784-3924176-421asymp
users
item
s
PT
Q
item
s
users
R
SVD A = U Σ VT
f factors
ffactors
RECSYS 54
Ratings as Products of Factors How to estimate the missing rating of
user x for item i
1162019
45531
312445
53432142
24542
522434
42331
item
s
2-41
56-5
53-2
32111
-221-7
37-1
-924143-48-5-253-211
13-112-72914-131457-8
1-6784-3924176-421
asymp
item
s
users
users
PT
119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931
= 119943119943
119954119954119946119946119943119943 sdot 119953119953119961119961119943119943
qi = row i of Qpx = column x of PT
ffac
tors
Qf factors
RECSYS 55
Ratings as Products of Factors How to estimate the missing rating of
user x for item i
1162019
45531
312445
53432142
24542
522434
42331
item
s
2-41
56-5
53-2
32111
-221-7
37-1
-924143-48-5-253-211
13-112-72914-131457-8
1-6784-3924176-421
asymp
item
s
users
users
PT
ffac
tors
Qf factors
119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931
= 119943119943
119954119954119946119946119943119943 sdot 119953119953119961119961119943119943
qi = row i of Qpx = column x of PT
RECSYS 56
Ratings as Products of Factors How to estimate the missing rating of
user x for item i
1162019
45531
312445
53432142
24542
522434
42331
item
s
2-41
56-5
53-2
32111
-221-7
37-1
-924143-48-5-253-211
13-112-72914-131457-8
1-6784-3924176-421
asymp
item
s
users
users
QPT
24
ffac
tors
f factors
119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931
= 119943119943
119954119954119946119946119943119943 sdot 119953119953119961119961119943119943
qi = row i of Qpx = column x of PT
RECSYS 57
Geared towards females
Geared towards males
Serious
Funny
Latent Factor Models
1162019
The PrincessDiaries
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
RECSYS 58
Geared towards females
Geared towards males
Serious
Funny
Latent Factor Models
1162019
The PrincessDiaries
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
RECSYS 59
Recap SVD
Remember SVD A Input data matrix U Left singular vecs V Right singular vecs Σ Singular values
SVD gives minimum reconstruction error (SSE) min119880119880VΣ
sum119909119909119894119894 119860119860119909119909119894119894 minus 119880119880Σ119881119881T 1199091199091198941198942
So in our case ldquoSVDrdquo on Netflix data R asymp Q middot PT
A = R Q = U PT = Σ VT
But we are not done yet R has missing entries1162019
U
The sum goes over all entriesBut our R has missing entries
119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931
Am
n
Σm
n
VT
asymp
U
RECSYS 60
Latent Factor Models
SVD isnrsquot defined when entries are missing
Use specialized methods to find P Q min
119875119875119876119876sum 119909119909119909119909 isinR 119903119903119909119909119909119909 minus 119902119902119909119909 sdot 119901119901119909119909119879119879 2
Notebull We donrsquot require cols of P Q to be orthogonalunit lengthbull P Q map usersmovies to a latent spacebull The most popular model among Netflix contestants
1162019
45531
312445
53432142
24542
522434
42331
2-41
56-5
53-2
32111
-221-7
37-1
-924143-48-5-253-211
13-112-72914-131457-8
1-6784-3924176-421asymp
PTQ
users
item
s
119903119909119909119909119909 = 119902119902119909119909 sdot 119901119901119909119909119879119879
f factors
ffactorsitem
s
users
RECSYS 61
Dealing with Missing Entries Want to minimize SSE for unseen test data
Idea Minimize SSE on training data Want large f ( of factors) to capture all the signals But SSE on test data begins to rise for f gt 2
Regularization is needed to avoid Overfitting Allow rich model where there are sufficient data Shrink aggressively where data are scarce
1162019
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
1 3 43 5 5
4 5 533
2
2 1 3
1
λhellip regularization parameterldquoerrorrdquo ldquolengthrdquo
RECSYS 62
Geared towards females
Geared towards males
serious
escapist
The PrincessDiaries
The Lion King
Braveheart
Independence Day
AmadeusThe Color Purple
Oceanrsquos 11
Sense and Sensibility
Gus
Dave
Recommendations via Latent Factor Models (eg SVD++ by the [Bellkor Team])
1162019
Lethal Weapon
Dumb and Dumber
RECSYS 63
Dealing with Missing Entries Want to minimize SSE for unseen test data
Idea Minimize SSE on training data Want large f ( of factors) to capture all the signals But SSE on test data begins to rise for f gt 2
Regularization is needed Allow rich model where there are sufficient data Shrink aggressively where data are scarce
1162019
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
1 3 43 5 5
4 5 533
2
2 1 3
1
λhellip regularization parameterldquoerrorrdquo ldquolengthrdquo
RECSYS 64
Geared towards females
Geared towards males
serious
funny
The Effect of Regularization
1162019
The PrincessDiaries
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
minfactors ldquoerrorrdquo + λ ldquolengthrdquo
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
RECSYS 65
Geared towards females
Geared towards males
serious
funny
The Effect of Regularization
1162019
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
The PrincessDiaries
minfactors ldquoerrorrdquo + λ ldquolengthrdquo
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
RECSYS 66
Geared towards females
Geared towards males
serious
funny
The Effect of Regularization
1162019
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
minfactors ldquoerrorrdquo + λ ldquolengthrdquo
The PrincessDiaries
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
RECSYS 67
Geared towards females
Geared towards males
serious
funny
The Effect of Regularization
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
The PrincessDiaries
minfactors ldquoerrorrdquo + λ ldquolengthrdquo
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
RECSYS 68
Use Gradient Descent to search forthe optimal settings
Want to find matrices P and Q
Gradient descent Initialize P and Q (using SVD pretend missing ratings are
0) Do gradient descent
bull P larr P - η middotnablaPbull Q larr Q - η middotnablaQbull Where nablaQ is gradientderivative of matrix Q120571120571119876119876 = [120571120571119902119902119909119909119894119894] and 120571120571119902119902119909119909119894119894 = sum119909119909119909119909 minus2 119903119903119909119909119909119909 minus 119902119902119909119909119901119901119909119909119879119879 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894
bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q Observation Computing gradients is slow
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
How to compute gradient of a matrixCompute gradient of every element independently
RECSYS 69
Degression to the lecture notes ofRegression and Gradient Descent
by Andrew Ngrsquos Machine Learning course from Coursera
RECSYS 70
(Batch) Gradient Descent
Want to find matrices P and Q
Gradient descent Initialize P and Q (using SVD pretend missing ratings are
0) Do gradient descent
bull P larr P - η middotnablaPbull Q larr Q - η middotnablaQbull Where nablaQ is gradientderivative of matrix Q120571120571119876119876 = [120571120571119902119902119909119909119894119894] and 120571120571119902119902119909119909119894119894 = sum119909119909119909119909 minus2 119903119903119909119909119909119909 minus 119902119902119909119909119901119901119909119909119879119879 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894
bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q Observation Computing gradients is slow
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
How to compute gradient of a matrixCompute gradient of every element independently
RECSYS 72
Stochastic Gradient Descent
Gradient Descent (GD) vs Stochastic GD Observation 120571120571119876119876 = [120571120571119902119902119909119909119894119894] where
120571120571119902119902119909119909119894119894 = 119909119909119909119909
minus2 119903119903119909119909119909119909 minus 119902119902119909119909119894119894119901119901119909119909119894119894 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894 = 119961119961119946119946
nabla119928119928 119955119955119961119961119946119946
bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q 119928119928 = 119928119928 minus 119928119928 = 119928119928 minus η sum119961119961119946119946nabla119928119928 (119955119955119961119961119946119946) Idea Instead of evaluating gradient over all ratings evaluate it for
each individual rating and make a step
GD 119928119928larr119928119928 minus η sum119955119955119961119961119946119946 nabla119928119928(119955119955119961119961119946119946)
SGD 119928119928larr119928119928 minus 119928119928(119955119955119961119961119946119946) Faster convergence
bull Need more steps but each step is computed much faster
1162019
RECSYS 73
SGD vs GD Convergence of GD vs SGD
1162019
Iterationstep
Valu
e of
the
obje
ctiv
e fu
nctio
n
GD improves the value of the objective function at every step SGD improves the value but in a ldquonoisyrdquo wayGD takes fewer steps to converge but each steptakes much longer to compute In practice SGD is much faster
RECSYS 74
Stochastic Gradient Descent
Stochastic gradient decent Initialize P and Q (using SVD pretend missing ratings are
0) Then iterate over the ratings (multiple times if necessary) and update
factorsFor each rxi
bull 120576120576119909119909119909119909 = 119903119903119909119909119909119909 minus 119902119902119909119909 sdot 119901119901119909119909119879119879 (derivative of the ldquoerrorrdquo)bull 119902119902119909119909 larr 119902119902119909119909 + 120578120578 120576120576119909119909119909119909 119901119901119909119909 minus 120582120582 119902119902119909119909 (update equation)bull 119901119901119909119909 larr 119901119901119909119909 + 120578120578 120576120576119909119909119909119909 119902119902119909119909 minus 120582120582 119901119901119909119909 (update equation)
2 for loops For until convergence
bull For each rxibull Compute gradient do a ldquosteprdquo
1162019
η hellip learning rate
RECSYS 75Koren Bell Volinksy IEEE Computer 2009
1162019
RECSYS 76
Summary Recommendations via Optimization
Goal Make good recommendations Quantify goodness using SSE
So Lower SSE means better recommendations We want to make good recommendations on items
that some user has not yet seen Letrsquos set values for P and Q such that they work
well on known (user item) ratings And hope these values for P and Q will predict well
the unknown ratings
This is the a case where we apply Optimization methods
1162019
1 3 43 5 5
4 5 533
2
2 1 3
1
RECSYS 77
Backup Slides
RECSYS 78
The Netflix Challenge 2006-09
We Know What You OughtTo Be Watching This
Summer
Netflix Prizebull Training data
ndash 100 million ratingsndash 480000 usersndash 17770 moviesndash 6 years of data 2000-2005
bull Test datandash Last few ratings of each user (28 million)ndash Evaluation criterion root mean squared error (RMSE) ndash Netflix Cinematch RMSE 09514
bull Competitionndash 2700+ teamsndash $1 million grand prize for 10 improvement on Cinematch resultndash $50000 2007 progress prize for 843 improvement
scoremovieuser121152131434524123237682576344541568523425
22345
57664566
scoremovieuser6219617232473153414284935
745
696836
Training data Test data
Movie rating data
Overall rating distribution
bull Third of ratings are 4sbull Average rating is 368
From TimelyDevelopmentcom
ratings per movie
bull Avg ratingsmovie 5627
ratings per user
bull Avg ratingsuser 208
Average movie rating by movie count
bull More ratings to better movies
From TimelyDevelopmentcom
Most loved movies
CountAvg ratingTitle137812 4593 The Shawshank Redemption 133597 4545 Lord of the Rings The Return of the King1808834306 The Green Mile 150676 4460Lord of the Rings The Two Towers 139050 4415 Finding Nemo 117456 4504 Raiders of the Lost Ark 180736 4299 Forrest Gump 147932 4433 Lord of the Rings The Fellowship of the ring149199 4325 The Sixth Sense 144027 4333 Indiana Jones and the Last Crusade
Challenges
bull Size of datandash Scalabilityndash Keeping data in memory
bull Missing datandash 99 percent missingndash Very imbalanced
bull Avoiding overfittingbull Test and training data differ significantly movie 16322
bull Use an ensemble of complementing predictorsbull Two half tuned models worth more than a single fully
tuned model
The BellKor recommender system
RECSYS 101
Extending Latent Factor Model to Include Biases
1162019
RECSYS 102
Modeling Biases and Interactions
μ = overall mean rating bx = bias of user x bi = bias of movie i
user-movie interactionmovie biasuser bias
User-Movie interaction Characterizes the matching between
users and movies Attracts most research in the field Benefits from algorithmic and
mathematical innovations
Baseline predictor Separates users and movies Benefits from insights into userrsquos
behavior Among the main practical
contributions of the competition
RECSYS 103
Baseline Predictor We have expectations on the rating by
user x of movie i even without estimating xrsquos attitude towards movies like i
ndash Rating scale of user xndash Values of other ratings user
gave recently (day-specific mood anchoring multi-user accounts)
ndash (Recent) popularity of movie indash Selection bias related to
number of ratings user gave on the same day (ldquofrequencyrdquo)
1162019
RECSYS 104
Putting It All Together
Example Mean rating micro = 37 You are a critical reviewer your ratings are 1 star
lower than the mean bx = -1 Star Wars gets a mean rating of 05 higher than
average movie bi = + 05 Predicted rating for you on Star Wars
= 37 - 1 + 05 = 32
1162019
RECSYS 105
Fitting the New Model
Solve
Stochastic gradient decent to find parameters Note Both biases bu bi as well as interactions qi pu are treated
as parameters (we estimate them)
1162019
regularization
goodness of fit
λ is selected via grid-search on a validation set
( )
++++
+++minus
sumsumsumsum
sumisin
ii
xx
xx
ii
Rix
Txiixxi
PQ
bbpq
pqbbr
2222
2
)(
)(min
λ
micro
RECSYS 106
BellKor Recommender System The winner of the Netflix Challenge
Multi-scale modeling of the dataCombine top level ldquoregionalrdquomodeling of the data with a refined local view Global
bull Overall deviations of usersmovies Factorization
bull Addressing ldquoregionalrdquo effects Collaborative filtering
bull Extract local patterns
Global effects
Factorization
Collaborative filtering
RECSYS 107
Performance of Various Methods
0885
089
0895
09
0905
091
0915
092
1 10 100 1000
RMSE
Millions of parameters
CF (no time bias)
Basic Latent Factors
Latent Factors w Biases
1162019
RECSYS 108
Grand Prize 08563
Netflix 09514
Movie average 10533User average 10651
Global average 11296
Performance of Various Methods
Basic Collaborative filtering 094
Latent factors 090
Latent factors+Biases 089
Collaborative filtering++ 091
1162019
RECSYS 109
Temporal Biases Of Users
Sudden rise in the average movie rating(early 2004) Improvements in Netflix GUI improvements Meaning of rating changed
Movie age Users prefer new movies
without any reasons Older movies are just
inherently better than newer ones
1162019
Y Koren Collaborative filtering with temporal dynamics KDD rsquo09
RECSYS 110
Temporal Biases amp Factors
Original modelrxi = micro +bx + bi + qi middotpx
T
Add time dependence to biasesrxi = micro +bx(t)+ bi(t) +qi middot px
T
Make parameters bu and bi to depend on time (1) Parameterize time-dependence by linear trends
(2) Each bin corresponds to 10 consecutive weeks
Add temporal dependence to factors px(t)hellip user preference vector on day t
Y Koren Collaborative filtering with temporal dynamics KDD rsquo09
RECSYS 111
Adding Temporal Effects
1162019
0875
088
0885
089
0895
09
0905
091
0915
092
1 10 100 1000 10000
RMSE
Millions of parameters
CF (no time bias)Basic Latent FactorsCF (time bias)Latent Factors w Biases+ Linear time factors+ Per-day user biases+ CF
RECSYS 112
Grand Prize 08563
Netflix 09514
Movie average 10533User average 10651
Global average 11296
Performance of Various Methods
Basic Collaborative filtering 094
Latent factors 090
Latent factors+Biases 089
Collaborative filtering++ 091
1162019 112
Latent factors+Biases+Time 0876
Still no prize Getting desperateTry a ldquokitchen sinkrdquo approach
RECSYS 113
Effect of ensemble size
087
08725
0875
08775
088
08825
0885
08875
089
08925
1 8 15 22 29 36 43 50
Predictors
Erro
r - R
MSE
RECSYS 18
Plan of Action
likes
Item profiles
RedCircles
Triangles
User profile
match
recommendbuild
1162019
RECSYS 19
Item Profiles For each item create an item profile
Profile is a set (vector) of features Movies author title actor directorhellip Text Set of ldquoimportantrdquo words in document
How to pick important features Usual heuristic from text mining is TF-IDF
(Term frequency Inverse Doc Frequency)bull Term hellip Featurebull Document hellip Item
1162019
RECSYS 20
Sidenote TF-IDFfij = frequency of term (feature) i in doc (item) j
ni = number of docs that mention term iN = total number of docs
TF-IDF score wij = TFij times IDFi
Doc profile = set of words with highest TF-IDF scores together with their scores
1162019
Note we normalize TFto discount for ldquolongerrdquo documents
RECSYS 21
User Profiles and Prediction
User profile possibilities Weighted average of rated item profiles Variation weight by difference from average
rating for item hellip
Prediction heuristic Given user profile x and item profile i estimate 119906119906(119961119961 119946119946) = cos(119961119961 119946119946) = 119961119961middot119946119946
| 119961119961 |sdot| 119946119946 |
1162019
RECSYS 22
Pros Content-based Approach
+ No need for data on other users No cold-start or sparsity problems
+ Able to recommend to users with unique tastes
+ Able to recommend new amp unpopular items No first-rater problem
+ Able to provide explanations Can provide explanations of recommended items by listing content-
features that caused an item to be recommended
1162019
RECSYS 23
Cons Content-based Approach ndash Finding the appropriate features is hard
Eg images movies music
ndash Overspecialization Never recommends items outside userrsquos
content profile People might have multiple interests Unable to exploit quality judgments of other users
ndash Recommendations for new users How to build a user profile
1162019
RECSYS 24
Collaborative Filtering
RECSYS 25
Collaborative filtering Recommend items based on past transactions of users
Analyze relations between users andor items
Specific data characteristics are irrelevant Domain-free useritem attributes are not necessary Can identify elusive aspects
RECSYS 26
Collaborative Filtering (CF)
RECSYS 27
Example of Memory-based Collaborative FilteringUser-User Collaborative Filtering
1 Consider user x
2 Find set N of other userswhose ratings are ldquosimilarrdquo to xrsquos ratings eg using K-nearest neighbors (KNN)
3 Recommend items to xbased on the weighted ratings of items by users in N
RECSYS 28
Similar Users Let rx be the vector of user xrsquos ratings
Jaccard similarity measure Problem Ignores the value of the rating
Cosine similarity measure
sim(x y) = cos(rx ry) = 119903119903119909119909sdot119903119903119910119910||119903119903119909119909||sdot||119903119903119910119910||
Problem Treats missing ratings as ldquonegativerdquo Pearson correlation coefficient
Sxy = items rated by both users x and y
1162019
rx = [ _ _ ]ry = [ _ _]
rx ry as setsrx = 1 4 5ry = 1 3 4
rx ry as pointsrx = 1 0 0 1 3ry = 1 0 2 2 0
rx ry hellip avgrating of x y
RECSYS 29
Similarity Metric
Intuitively we want sim(A B) gt sim(A C)
Jaccard similarity 15 lt 24
Cosine similarity 0386 gt 0322 Considers missing ratings as ldquonegativerdquo Solution subtract the (row) mean
1162019
sim AB vs AC0092 gt -0559Notice cosine sim is correlation when data is centered at 0
Cosine sim
RECSYS 30
Rating Predictions
Let rx be the vector of user xrsquos ratings
Let N be the set of k users most similar to xwho have rated item i
Prediction for item i of user x 119903119903119909119909119909119909 = 1
119896119896sum119910119910isin119873119873 119903119903119910119910119909119909
119903119903119909119909119909119909 =sum119910119910isin119873119873 119904119904119909119909119910119910sdot119903119903119910119910119910119910sum119910119910isin119873119873 119904119904119909119909119910119910
Other options
Many other tricks possiblehellip1162019
Shorthand119956119956119961119961119961119961 = 119956119956119946119946119956119956 119961119961119961119961
RECSYS 31
Another type of Memory-based Collaborative Filtering Item-Item based Collaborative Filtering
So far User-user collaborative filtering
Another view Item-item For item i find other similar items Estimate rating for item i based
on the target userrsquos ratings on items similar to item i Can use same similarity metrics and
prediction functions as in user-user model
1162019
sumsum
isin
isinsdot
=)(
)(
xiNj ij
xiNj xjijxi s
rsr
sijhellip similarity of items i and jrxjhelliprating of user x on item jN(ix)hellip set items rated by x similar to i
RECSYS 32
Item-Item CF (|N|=2)
121110987654321
455311
3124452
534321423
245424
5224345
423316
users
mov
ies
- unknown rating - rating between 1 to 5
RECSYS 33
Item-Item CF (|N|=2)
121110987654321
455 311
3124452
534321423
245424
5224345
423316
users
- estimate rating of movie 1 by user 5
1162019
mov
ies
RECSYS 34
Item-Item CF (|N|=2)
121110987654321
455 311
3124452
534321423
245424
5224345
423316
users
Neighbor selectionIdentify movies similar to movie 1 rated by user 5
1162019
mov
ies
100
-018
041
-010
-031
059
sim(1m)
Here we use Pearson correlation as similarity1) Subtract mean rating mi from each movie i
m1 = (1+3+5+5+4)5 = 36row 1 [-26 0 -06 0 0 14 0 0 14 0 04 0]
2) Compute cosine similarities between rows
RECSYS 35
Item-Item CF (|N|=2)
121110987654321
455 311
3124452
534321423
245424
5224345
423316
users
Compute similarity weightss13=041 s16=059
1162019
mov
ies
100
-018
041
-010
-031
059
sim(1m)
RECSYS 36
Item-Item CF (|N|=2)
121110987654321
45526311
3124452
534321423
245424
5224345
423316
users
Predict by taking weighted average
r15 = (0412 + 0593) (041+059) = 26
mov
ies
RECSYS 37
Common Practice for Item-Item Collaborative Filtering
Define similarity sij of items i and j
Select K nearest neighbors (KNN) N(i x) Set of Items most similar to item i that were rated by x
Estimate rating rxi as the weighted average
1162019
sumsum
isin
isinsdot
=)(
)(ˆxiNj ij
xiNj xjijxi s
rsr
N(ix) = set of items similar to item i that were rated by xsij = similarity of items i and jrxj= rating of user x on item j
RECSYS 38
Item-Item vs User-User
04180109
0305081
Avatar LOTR Matrix Pirates
Alice
Bob
Carol
David
1162019
In practice it has been observed that item-itemoften works better than user-user
Why Items are simpler users have multiple tastes
RECSYS 39
ProsCons of Collaborative Filtering
+ Works for any kind of item No feature selection needed
- Cold Start Need enough users in the system to find a match
- Sparsity The userratings matrix is sparse Hard to find users that have rated the same items
- First rater Cannot recommend an item that has not been
previously rated New items Esoteric items
- Popularity bias Cannot recommend items to someone with
unique taste Tends to recommend popular items
RECSYS 40
Hybrid Methods Implement two or more different recommenders and
combine predictions Perhaps using a linear model
Add content-based methods to collaborative filtering Item profiles for new item problem Demographics to deal with new user problem
1162019
RECSYS 41
- Evaluation
- Error metrics
- Complexity Speed
RECSYS 42
Evaluation
1 3 4
3 5 5
4 5 5
3
3
2 2 2
5
2 1 1
3 3
1
movies
users
1162019
RECSYS 43
Evaluation
1 3 4
3 5 5
4 5 5
3
3
2
2 1
3
1
Test Data Set
users
movies
1162019
RECSYS 44
Evaluating Predictions Compare predictions with known ratings
Root-mean-square error (RMSE)bull where is predicted is the true rating of x on i
Precision at top 10 bull of those in top 10
Rank Correlation bull Spearmanrsquos correlation between systemrsquos and userrsquos complete
rankings Another approach 01 model
Coveragebull Number of itemsusers for which system can make predictions
Precision = TP (TP + FP) Accuracy = (TP+TN) (TP + FP + TN + FN) Receiver Operating characteristic (ROC) Curve
Y-axis True Positive Rates (TPR) X-axis False Positive Rates (FPR)bull TPR (aka Recall) = TP P = TP (TP+FN) bull FPR = FP N = FP (FP + TN)bull See httpsenwikipediaorgwikiPrecision and recall
RECSYS 45
Problems with Error Measures Narrow focus on accuracy sometimes
misses the point Prediction Diversity Prediction Context Order of predictions
In practice we care only to predict high ratings RMSE might penalize a method that does well
for high ratings and badly for others
1162019
RECSYS 46
Collaborative Filtering Complexity
Expensive step is finding k most similar customers O(|X|) Recall that X = set of customers in the system
Too expensive to do at runtime Could pre-compute using clustering as approx
Naiumlve pre-computation takes time O(N middot|C|) |C| = of clusters = k in the k-means N = of data points
We already know how to do this Near-neighbor search in high dimensions (LSH) Clustering Dimensionality reduction
1162019
RECSYS 47
Tip Add Data Leverage all the data
Donrsquot try to reduce data size in an effort to make fancy algorithms work
Simple methods on large data do best
Add more data eg add IMDB data on genres
More data beats better algorithmshttpanandtypepadcomdatawocky200803more-data-usualhtml
RECSYS 48
Recommender SystemsLatent Factor Models
CS246 Mining Massive DatasetsJure Leskovec Stanford University
httpcs246stanfordedu
RECSYS 49
Geared towards females
Geared towards males
Serious
Funny
Collaborative Filtering viaLatent Factor Models (eg SVD)
The PrincessDiaries
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
RECSYS 50Koren Bell Volinksy IEEE Computer 2009
RECSYS 51
The Netflix Utility Matrix R
1 3 4
3 5 5
4 5 5
3
3
2 2 2
5
2 1 1
3 3
1
480000 users
17700 movies
1162019
Matrix R
RECSYS 52
Utility Matrix R Evaluation
1 3 4
3 5 5
4 5 5
3
3
2
2 1
3
1
Test Data Set
SSE = sum(119909119909119909119909)isin119877119877 119903119909119909119909119909 minus 119903119903119909119909119909119909 2
1162019
480000 users
17700 movies
Predicted rating
True rating of user x on item i
119955119955120785120785120788120788
Matrix R
Training Data Set
RECSYS 53
Latent Factor Models
ldquoSVDrdquo on Netflix data R asymp Q middot PT
For now letrsquos assume we can approximate the rating matrix R as a product of ldquothinrdquo Q middot PT
R has missing entries but letrsquos ignore that for nowbull Basically we will want the reconstruction error to be small on
known ratings and we donrsquot care about the values on the missing ones
45531
312445
53432142
24542
522434
42331
2-41
56-5
53-2
32111
-221-7
37-1
-924143-48-5-253-211
13-112-72914-131457-8
1-6784-3924176-421asymp
users
item
s
PT
Q
item
s
users
R
SVD A = U Σ VT
f factors
ffactors
RECSYS 54
Ratings as Products of Factors How to estimate the missing rating of
user x for item i
1162019
45531
312445
53432142
24542
522434
42331
item
s
2-41
56-5
53-2
32111
-221-7
37-1
-924143-48-5-253-211
13-112-72914-131457-8
1-6784-3924176-421
asymp
item
s
users
users
PT
119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931
= 119943119943
119954119954119946119946119943119943 sdot 119953119953119961119961119943119943
qi = row i of Qpx = column x of PT
ffac
tors
Qf factors
RECSYS 55
Ratings as Products of Factors How to estimate the missing rating of
user x for item i
1162019
45531
312445
53432142
24542
522434
42331
item
s
2-41
56-5
53-2
32111
-221-7
37-1
-924143-48-5-253-211
13-112-72914-131457-8
1-6784-3924176-421
asymp
item
s
users
users
PT
ffac
tors
Qf factors
119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931
= 119943119943
119954119954119946119946119943119943 sdot 119953119953119961119961119943119943
qi = row i of Qpx = column x of PT
RECSYS 56
Ratings as Products of Factors How to estimate the missing rating of
user x for item i
1162019
45531
312445
53432142
24542
522434
42331
item
s
2-41
56-5
53-2
32111
-221-7
37-1
-924143-48-5-253-211
13-112-72914-131457-8
1-6784-3924176-421
asymp
item
s
users
users
QPT
24
ffac
tors
f factors
119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931
= 119943119943
119954119954119946119946119943119943 sdot 119953119953119961119961119943119943
qi = row i of Qpx = column x of PT
RECSYS 57
Geared towards females
Geared towards males
Serious
Funny
Latent Factor Models
1162019
The PrincessDiaries
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
RECSYS 58
Geared towards females
Geared towards males
Serious
Funny
Latent Factor Models
1162019
The PrincessDiaries
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
RECSYS 59
Recap SVD
Remember SVD A Input data matrix U Left singular vecs V Right singular vecs Σ Singular values
SVD gives minimum reconstruction error (SSE) min119880119880VΣ
sum119909119909119894119894 119860119860119909119909119894119894 minus 119880119880Σ119881119881T 1199091199091198941198942
So in our case ldquoSVDrdquo on Netflix data R asymp Q middot PT
A = R Q = U PT = Σ VT
But we are not done yet R has missing entries1162019
U
The sum goes over all entriesBut our R has missing entries
119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931
Am
n
Σm
n
VT
asymp
U
RECSYS 60
Latent Factor Models
SVD isnrsquot defined when entries are missing
Use specialized methods to find P Q min
119875119875119876119876sum 119909119909119909119909 isinR 119903119903119909119909119909119909 minus 119902119902119909119909 sdot 119901119901119909119909119879119879 2
Notebull We donrsquot require cols of P Q to be orthogonalunit lengthbull P Q map usersmovies to a latent spacebull The most popular model among Netflix contestants
1162019
45531
312445
53432142
24542
522434
42331
2-41
56-5
53-2
32111
-221-7
37-1
-924143-48-5-253-211
13-112-72914-131457-8
1-6784-3924176-421asymp
PTQ
users
item
s
119903119909119909119909119909 = 119902119902119909119909 sdot 119901119901119909119909119879119879
f factors
ffactorsitem
s
users
RECSYS 61
Dealing with Missing Entries Want to minimize SSE for unseen test data
Idea Minimize SSE on training data Want large f ( of factors) to capture all the signals But SSE on test data begins to rise for f gt 2
Regularization is needed to avoid Overfitting Allow rich model where there are sufficient data Shrink aggressively where data are scarce
1162019
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
1 3 43 5 5
4 5 533
2
2 1 3
1
λhellip regularization parameterldquoerrorrdquo ldquolengthrdquo
RECSYS 62
Geared towards females
Geared towards males
serious
escapist
The PrincessDiaries
The Lion King
Braveheart
Independence Day
AmadeusThe Color Purple
Oceanrsquos 11
Sense and Sensibility
Gus
Dave
Recommendations via Latent Factor Models (eg SVD++ by the [Bellkor Team])
1162019
Lethal Weapon
Dumb and Dumber
RECSYS 63
Dealing with Missing Entries Want to minimize SSE for unseen test data
Idea Minimize SSE on training data Want large f ( of factors) to capture all the signals But SSE on test data begins to rise for f gt 2
Regularization is needed Allow rich model where there are sufficient data Shrink aggressively where data are scarce
1162019
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
1 3 43 5 5
4 5 533
2
2 1 3
1
λhellip regularization parameterldquoerrorrdquo ldquolengthrdquo
RECSYS 64
Geared towards females
Geared towards males
serious
funny
The Effect of Regularization
1162019
The PrincessDiaries
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
minfactors ldquoerrorrdquo + λ ldquolengthrdquo
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
RECSYS 65
Geared towards females
Geared towards males
serious
funny
The Effect of Regularization
1162019
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
The PrincessDiaries
minfactors ldquoerrorrdquo + λ ldquolengthrdquo
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
RECSYS 66
Geared towards females
Geared towards males
serious
funny
The Effect of Regularization
1162019
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
minfactors ldquoerrorrdquo + λ ldquolengthrdquo
The PrincessDiaries
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
RECSYS 67
Geared towards females
Geared towards males
serious
funny
The Effect of Regularization
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
The PrincessDiaries
minfactors ldquoerrorrdquo + λ ldquolengthrdquo
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
RECSYS 68
Use Gradient Descent to search forthe optimal settings
Want to find matrices P and Q
Gradient descent Initialize P and Q (using SVD pretend missing ratings are
0) Do gradient descent
bull P larr P - η middotnablaPbull Q larr Q - η middotnablaQbull Where nablaQ is gradientderivative of matrix Q120571120571119876119876 = [120571120571119902119902119909119909119894119894] and 120571120571119902119902119909119909119894119894 = sum119909119909119909119909 minus2 119903119903119909119909119909119909 minus 119902119902119909119909119901119901119909119909119879119879 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894
bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q Observation Computing gradients is slow
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
How to compute gradient of a matrixCompute gradient of every element independently
RECSYS 69
Degression to the lecture notes ofRegression and Gradient Descent
by Andrew Ngrsquos Machine Learning course from Coursera
RECSYS 70
(Batch) Gradient Descent
Want to find matrices P and Q
Gradient descent Initialize P and Q (using SVD pretend missing ratings are
0) Do gradient descent
bull P larr P - η middotnablaPbull Q larr Q - η middotnablaQbull Where nablaQ is gradientderivative of matrix Q120571120571119876119876 = [120571120571119902119902119909119909119894119894] and 120571120571119902119902119909119909119894119894 = sum119909119909119909119909 minus2 119903119903119909119909119909119909 minus 119902119902119909119909119901119901119909119909119879119879 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894
bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q Observation Computing gradients is slow
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
How to compute gradient of a matrixCompute gradient of every element independently
RECSYS 72
Stochastic Gradient Descent
Gradient Descent (GD) vs Stochastic GD Observation 120571120571119876119876 = [120571120571119902119902119909119909119894119894] where
120571120571119902119902119909119909119894119894 = 119909119909119909119909
minus2 119903119903119909119909119909119909 minus 119902119902119909119909119894119894119901119901119909119909119894119894 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894 = 119961119961119946119946
nabla119928119928 119955119955119961119961119946119946
bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q 119928119928 = 119928119928 minus 119928119928 = 119928119928 minus η sum119961119961119946119946nabla119928119928 (119955119955119961119961119946119946) Idea Instead of evaluating gradient over all ratings evaluate it for
each individual rating and make a step
GD 119928119928larr119928119928 minus η sum119955119955119961119961119946119946 nabla119928119928(119955119955119961119961119946119946)
SGD 119928119928larr119928119928 minus 119928119928(119955119955119961119961119946119946) Faster convergence
bull Need more steps but each step is computed much faster
1162019
RECSYS 73
SGD vs GD Convergence of GD vs SGD
1162019
Iterationstep
Valu
e of
the
obje
ctiv
e fu
nctio
n
GD improves the value of the objective function at every step SGD improves the value but in a ldquonoisyrdquo wayGD takes fewer steps to converge but each steptakes much longer to compute In practice SGD is much faster
RECSYS 74
Stochastic Gradient Descent
Stochastic gradient decent Initialize P and Q (using SVD pretend missing ratings are
0) Then iterate over the ratings (multiple times if necessary) and update
factorsFor each rxi
bull 120576120576119909119909119909119909 = 119903119903119909119909119909119909 minus 119902119902119909119909 sdot 119901119901119909119909119879119879 (derivative of the ldquoerrorrdquo)bull 119902119902119909119909 larr 119902119902119909119909 + 120578120578 120576120576119909119909119909119909 119901119901119909119909 minus 120582120582 119902119902119909119909 (update equation)bull 119901119901119909119909 larr 119901119901119909119909 + 120578120578 120576120576119909119909119909119909 119902119902119909119909 minus 120582120582 119901119901119909119909 (update equation)
2 for loops For until convergence
bull For each rxibull Compute gradient do a ldquosteprdquo
1162019
η hellip learning rate
RECSYS 75Koren Bell Volinksy IEEE Computer 2009
1162019
RECSYS 76
Summary Recommendations via Optimization
Goal Make good recommendations Quantify goodness using SSE
So Lower SSE means better recommendations We want to make good recommendations on items
that some user has not yet seen Letrsquos set values for P and Q such that they work
well on known (user item) ratings And hope these values for P and Q will predict well
the unknown ratings
This is the a case where we apply Optimization methods
1162019
1 3 43 5 5
4 5 533
2
2 1 3
1
RECSYS 77
Backup Slides
RECSYS 78
The Netflix Challenge 2006-09
We Know What You OughtTo Be Watching This
Summer
Netflix Prizebull Training data
ndash 100 million ratingsndash 480000 usersndash 17770 moviesndash 6 years of data 2000-2005
bull Test datandash Last few ratings of each user (28 million)ndash Evaluation criterion root mean squared error (RMSE) ndash Netflix Cinematch RMSE 09514
bull Competitionndash 2700+ teamsndash $1 million grand prize for 10 improvement on Cinematch resultndash $50000 2007 progress prize for 843 improvement
scoremovieuser121152131434524123237682576344541568523425
22345
57664566
scoremovieuser6219617232473153414284935
745
696836
Training data Test data
Movie rating data
Overall rating distribution
bull Third of ratings are 4sbull Average rating is 368
From TimelyDevelopmentcom
ratings per movie
bull Avg ratingsmovie 5627
ratings per user
bull Avg ratingsuser 208
Average movie rating by movie count
bull More ratings to better movies
From TimelyDevelopmentcom
Most loved movies
CountAvg ratingTitle137812 4593 The Shawshank Redemption 133597 4545 Lord of the Rings The Return of the King1808834306 The Green Mile 150676 4460Lord of the Rings The Two Towers 139050 4415 Finding Nemo 117456 4504 Raiders of the Lost Ark 180736 4299 Forrest Gump 147932 4433 Lord of the Rings The Fellowship of the ring149199 4325 The Sixth Sense 144027 4333 Indiana Jones and the Last Crusade
Challenges
bull Size of datandash Scalabilityndash Keeping data in memory
bull Missing datandash 99 percent missingndash Very imbalanced
bull Avoiding overfittingbull Test and training data differ significantly movie 16322
bull Use an ensemble of complementing predictorsbull Two half tuned models worth more than a single fully
tuned model
The BellKor recommender system
RECSYS 101
Extending Latent Factor Model to Include Biases
1162019
RECSYS 102
Modeling Biases and Interactions
μ = overall mean rating bx = bias of user x bi = bias of movie i
user-movie interactionmovie biasuser bias
User-Movie interaction Characterizes the matching between
users and movies Attracts most research in the field Benefits from algorithmic and
mathematical innovations
Baseline predictor Separates users and movies Benefits from insights into userrsquos
behavior Among the main practical
contributions of the competition
RECSYS 103
Baseline Predictor We have expectations on the rating by
user x of movie i even without estimating xrsquos attitude towards movies like i
ndash Rating scale of user xndash Values of other ratings user
gave recently (day-specific mood anchoring multi-user accounts)
ndash (Recent) popularity of movie indash Selection bias related to
number of ratings user gave on the same day (ldquofrequencyrdquo)
1162019
RECSYS 104
Putting It All Together
Example Mean rating micro = 37 You are a critical reviewer your ratings are 1 star
lower than the mean bx = -1 Star Wars gets a mean rating of 05 higher than
average movie bi = + 05 Predicted rating for you on Star Wars
= 37 - 1 + 05 = 32
1162019
RECSYS 105
Fitting the New Model
Solve
Stochastic gradient decent to find parameters Note Both biases bu bi as well as interactions qi pu are treated
as parameters (we estimate them)
1162019
regularization
goodness of fit
λ is selected via grid-search on a validation set
( )
++++
+++minus
sumsumsumsum
sumisin
ii
xx
xx
ii
Rix
Txiixxi
PQ
bbpq
pqbbr
2222
2
)(
)(min
λ
micro
RECSYS 106
BellKor Recommender System The winner of the Netflix Challenge
Multi-scale modeling of the dataCombine top level ldquoregionalrdquomodeling of the data with a refined local view Global
bull Overall deviations of usersmovies Factorization
bull Addressing ldquoregionalrdquo effects Collaborative filtering
bull Extract local patterns
Global effects
Factorization
Collaborative filtering
RECSYS 107
Performance of Various Methods
0885
089
0895
09
0905
091
0915
092
1 10 100 1000
RMSE
Millions of parameters
CF (no time bias)
Basic Latent Factors
Latent Factors w Biases
1162019
RECSYS 108
Grand Prize 08563
Netflix 09514
Movie average 10533User average 10651
Global average 11296
Performance of Various Methods
Basic Collaborative filtering 094
Latent factors 090
Latent factors+Biases 089
Collaborative filtering++ 091
1162019
RECSYS 109
Temporal Biases Of Users
Sudden rise in the average movie rating(early 2004) Improvements in Netflix GUI improvements Meaning of rating changed
Movie age Users prefer new movies
without any reasons Older movies are just
inherently better than newer ones
1162019
Y Koren Collaborative filtering with temporal dynamics KDD rsquo09
RECSYS 110
Temporal Biases amp Factors
Original modelrxi = micro +bx + bi + qi middotpx
T
Add time dependence to biasesrxi = micro +bx(t)+ bi(t) +qi middot px
T
Make parameters bu and bi to depend on time (1) Parameterize time-dependence by linear trends
(2) Each bin corresponds to 10 consecutive weeks
Add temporal dependence to factors px(t)hellip user preference vector on day t
Y Koren Collaborative filtering with temporal dynamics KDD rsquo09
RECSYS 111
Adding Temporal Effects
1162019
0875
088
0885
089
0895
09
0905
091
0915
092
1 10 100 1000 10000
RMSE
Millions of parameters
CF (no time bias)Basic Latent FactorsCF (time bias)Latent Factors w Biases+ Linear time factors+ Per-day user biases+ CF
RECSYS 112
Grand Prize 08563
Netflix 09514
Movie average 10533User average 10651
Global average 11296
Performance of Various Methods
Basic Collaborative filtering 094
Latent factors 090
Latent factors+Biases 089
Collaborative filtering++ 091
1162019 112
Latent factors+Biases+Time 0876
Still no prize Getting desperateTry a ldquokitchen sinkrdquo approach
RECSYS 113
Effect of ensemble size
087
08725
0875
08775
088
08825
0885
08875
089
08925
1 8 15 22 29 36 43 50
Predictors
Erro
r - R
MSE
RECSYS 19
Item Profiles For each item create an item profile
Profile is a set (vector) of features Movies author title actor directorhellip Text Set of ldquoimportantrdquo words in document
How to pick important features Usual heuristic from text mining is TF-IDF
(Term frequency Inverse Doc Frequency)bull Term hellip Featurebull Document hellip Item
1162019
RECSYS 20
Sidenote TF-IDFfij = frequency of term (feature) i in doc (item) j
ni = number of docs that mention term iN = total number of docs
TF-IDF score wij = TFij times IDFi
Doc profile = set of words with highest TF-IDF scores together with their scores
1162019
Note we normalize TFto discount for ldquolongerrdquo documents
RECSYS 21
User Profiles and Prediction
User profile possibilities Weighted average of rated item profiles Variation weight by difference from average
rating for item hellip
Prediction heuristic Given user profile x and item profile i estimate 119906119906(119961119961 119946119946) = cos(119961119961 119946119946) = 119961119961middot119946119946
| 119961119961 |sdot| 119946119946 |
1162019
RECSYS 22
Pros Content-based Approach
+ No need for data on other users No cold-start or sparsity problems
+ Able to recommend to users with unique tastes
+ Able to recommend new amp unpopular items No first-rater problem
+ Able to provide explanations Can provide explanations of recommended items by listing content-
features that caused an item to be recommended
1162019
RECSYS 23
Cons Content-based Approach ndash Finding the appropriate features is hard
Eg images movies music
ndash Overspecialization Never recommends items outside userrsquos
content profile People might have multiple interests Unable to exploit quality judgments of other users
ndash Recommendations for new users How to build a user profile
1162019
RECSYS 24
Collaborative Filtering
RECSYS 25
Collaborative filtering Recommend items based on past transactions of users
Analyze relations between users andor items
Specific data characteristics are irrelevant Domain-free useritem attributes are not necessary Can identify elusive aspects
RECSYS 26
Collaborative Filtering (CF)
RECSYS 27
Example of Memory-based Collaborative FilteringUser-User Collaborative Filtering
1 Consider user x
2 Find set N of other userswhose ratings are ldquosimilarrdquo to xrsquos ratings eg using K-nearest neighbors (KNN)
3 Recommend items to xbased on the weighted ratings of items by users in N
RECSYS 28
Similar Users Let rx be the vector of user xrsquos ratings
Jaccard similarity measure Problem Ignores the value of the rating
Cosine similarity measure
sim(x y) = cos(rx ry) = 119903119903119909119909sdot119903119903119910119910||119903119903119909119909||sdot||119903119903119910119910||
Problem Treats missing ratings as ldquonegativerdquo Pearson correlation coefficient
Sxy = items rated by both users x and y
1162019
rx = [ _ _ ]ry = [ _ _]
rx ry as setsrx = 1 4 5ry = 1 3 4
rx ry as pointsrx = 1 0 0 1 3ry = 1 0 2 2 0
rx ry hellip avgrating of x y
RECSYS 29
Similarity Metric
Intuitively we want sim(A B) gt sim(A C)
Jaccard similarity 15 lt 24
Cosine similarity 0386 gt 0322 Considers missing ratings as ldquonegativerdquo Solution subtract the (row) mean
1162019
sim AB vs AC0092 gt -0559Notice cosine sim is correlation when data is centered at 0
Cosine sim
RECSYS 30
Rating Predictions
Let rx be the vector of user xrsquos ratings
Let N be the set of k users most similar to xwho have rated item i
Prediction for item i of user x 119903119903119909119909119909119909 = 1
119896119896sum119910119910isin119873119873 119903119903119910119910119909119909
119903119903119909119909119909119909 =sum119910119910isin119873119873 119904119904119909119909119910119910sdot119903119903119910119910119910119910sum119910119910isin119873119873 119904119904119909119909119910119910
Other options
Many other tricks possiblehellip1162019
Shorthand119956119956119961119961119961119961 = 119956119956119946119946119956119956 119961119961119961119961
RECSYS 31
Another type of Memory-based Collaborative Filtering Item-Item based Collaborative Filtering
So far User-user collaborative filtering
Another view Item-item For item i find other similar items Estimate rating for item i based
on the target userrsquos ratings on items similar to item i Can use same similarity metrics and
prediction functions as in user-user model
1162019
sumsum
isin
isinsdot
=)(
)(
xiNj ij
xiNj xjijxi s
rsr
sijhellip similarity of items i and jrxjhelliprating of user x on item jN(ix)hellip set items rated by x similar to i
RECSYS 32
Item-Item CF (|N|=2)
121110987654321
455311
3124452
534321423
245424
5224345
423316
users
mov
ies
- unknown rating - rating between 1 to 5
RECSYS 33
Item-Item CF (|N|=2)
121110987654321
455 311
3124452
534321423
245424
5224345
423316
users
- estimate rating of movie 1 by user 5
1162019
mov
ies
RECSYS 34
Item-Item CF (|N|=2)
121110987654321
455 311
3124452
534321423
245424
5224345
423316
users
Neighbor selectionIdentify movies similar to movie 1 rated by user 5
1162019
mov
ies
100
-018
041
-010
-031
059
sim(1m)
Here we use Pearson correlation as similarity1) Subtract mean rating mi from each movie i
m1 = (1+3+5+5+4)5 = 36row 1 [-26 0 -06 0 0 14 0 0 14 0 04 0]
2) Compute cosine similarities between rows
RECSYS 35
Item-Item CF (|N|=2)
121110987654321
455 311
3124452
534321423
245424
5224345
423316
users
Compute similarity weightss13=041 s16=059
1162019
mov
ies
100
-018
041
-010
-031
059
sim(1m)
RECSYS 36
Item-Item CF (|N|=2)
121110987654321
45526311
3124452
534321423
245424
5224345
423316
users
Predict by taking weighted average
r15 = (0412 + 0593) (041+059) = 26
mov
ies
RECSYS 37
Common Practice for Item-Item Collaborative Filtering
Define similarity sij of items i and j
Select K nearest neighbors (KNN) N(i x) Set of Items most similar to item i that were rated by x
Estimate rating rxi as the weighted average
1162019
sumsum
isin
isinsdot
=)(
)(ˆxiNj ij
xiNj xjijxi s
rsr
N(ix) = set of items similar to item i that were rated by xsij = similarity of items i and jrxj= rating of user x on item j
RECSYS 38
Item-Item vs User-User
04180109
0305081
Avatar LOTR Matrix Pirates
Alice
Bob
Carol
David
1162019
In practice it has been observed that item-itemoften works better than user-user
Why Items are simpler users have multiple tastes
RECSYS 39
ProsCons of Collaborative Filtering
+ Works for any kind of item No feature selection needed
- Cold Start Need enough users in the system to find a match
- Sparsity The userratings matrix is sparse Hard to find users that have rated the same items
- First rater Cannot recommend an item that has not been
previously rated New items Esoteric items
- Popularity bias Cannot recommend items to someone with
unique taste Tends to recommend popular items
RECSYS 40
Hybrid Methods Implement two or more different recommenders and
combine predictions Perhaps using a linear model
Add content-based methods to collaborative filtering Item profiles for new item problem Demographics to deal with new user problem
1162019
RECSYS 41
- Evaluation
- Error metrics
- Complexity Speed
RECSYS 42
Evaluation
1 3 4
3 5 5
4 5 5
3
3
2 2 2
5
2 1 1
3 3
1
movies
users
1162019
RECSYS 43
Evaluation
1 3 4
3 5 5
4 5 5
3
3
2
2 1
3
1
Test Data Set
users
movies
1162019
RECSYS 44
Evaluating Predictions Compare predictions with known ratings
Root-mean-square error (RMSE)bull where is predicted is the true rating of x on i
Precision at top 10 bull of those in top 10
Rank Correlation bull Spearmanrsquos correlation between systemrsquos and userrsquos complete
rankings Another approach 01 model
Coveragebull Number of itemsusers for which system can make predictions
Precision = TP (TP + FP) Accuracy = (TP+TN) (TP + FP + TN + FN) Receiver Operating characteristic (ROC) Curve
Y-axis True Positive Rates (TPR) X-axis False Positive Rates (FPR)bull TPR (aka Recall) = TP P = TP (TP+FN) bull FPR = FP N = FP (FP + TN)bull See httpsenwikipediaorgwikiPrecision and recall
RECSYS 45
Problems with Error Measures Narrow focus on accuracy sometimes
misses the point Prediction Diversity Prediction Context Order of predictions
In practice we care only to predict high ratings RMSE might penalize a method that does well
for high ratings and badly for others
1162019
RECSYS 46
Collaborative Filtering Complexity
Expensive step is finding k most similar customers O(|X|) Recall that X = set of customers in the system
Too expensive to do at runtime Could pre-compute using clustering as approx
Naiumlve pre-computation takes time O(N middot|C|) |C| = of clusters = k in the k-means N = of data points
We already know how to do this Near-neighbor search in high dimensions (LSH) Clustering Dimensionality reduction
1162019
RECSYS 47
Tip Add Data Leverage all the data
Donrsquot try to reduce data size in an effort to make fancy algorithms work
Simple methods on large data do best
Add more data eg add IMDB data on genres
More data beats better algorithmshttpanandtypepadcomdatawocky200803more-data-usualhtml
RECSYS 48
Recommender SystemsLatent Factor Models
CS246 Mining Massive DatasetsJure Leskovec Stanford University
httpcs246stanfordedu
RECSYS 49
Geared towards females
Geared towards males
Serious
Funny
Collaborative Filtering viaLatent Factor Models (eg SVD)
The PrincessDiaries
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
RECSYS 50Koren Bell Volinksy IEEE Computer 2009
RECSYS 51
The Netflix Utility Matrix R
1 3 4
3 5 5
4 5 5
3
3
2 2 2
5
2 1 1
3 3
1
480000 users
17700 movies
1162019
Matrix R
RECSYS 52
Utility Matrix R Evaluation
1 3 4
3 5 5
4 5 5
3
3
2
2 1
3
1
Test Data Set
SSE = sum(119909119909119909119909)isin119877119877 119903119909119909119909119909 minus 119903119903119909119909119909119909 2
1162019
480000 users
17700 movies
Predicted rating
True rating of user x on item i
119955119955120785120785120788120788
Matrix R
Training Data Set
RECSYS 53
Latent Factor Models
ldquoSVDrdquo on Netflix data R asymp Q middot PT
For now letrsquos assume we can approximate the rating matrix R as a product of ldquothinrdquo Q middot PT
R has missing entries but letrsquos ignore that for nowbull Basically we will want the reconstruction error to be small on
known ratings and we donrsquot care about the values on the missing ones
45531
312445
53432142
24542
522434
42331
2-41
56-5
53-2
32111
-221-7
37-1
-924143-48-5-253-211
13-112-72914-131457-8
1-6784-3924176-421asymp
users
item
s
PT
Q
item
s
users
R
SVD A = U Σ VT
f factors
ffactors
RECSYS 54
Ratings as Products of Factors How to estimate the missing rating of
user x for item i
1162019
45531
312445
53432142
24542
522434
42331
item
s
2-41
56-5
53-2
32111
-221-7
37-1
-924143-48-5-253-211
13-112-72914-131457-8
1-6784-3924176-421
asymp
item
s
users
users
PT
119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931
= 119943119943
119954119954119946119946119943119943 sdot 119953119953119961119961119943119943
qi = row i of Qpx = column x of PT
ffac
tors
Qf factors
RECSYS 55
Ratings as Products of Factors How to estimate the missing rating of
user x for item i
1162019
45531
312445
53432142
24542
522434
42331
item
s
2-41
56-5
53-2
32111
-221-7
37-1
-924143-48-5-253-211
13-112-72914-131457-8
1-6784-3924176-421
asymp
item
s
users
users
PT
ffac
tors
Qf factors
119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931
= 119943119943
119954119954119946119946119943119943 sdot 119953119953119961119961119943119943
qi = row i of Qpx = column x of PT
RECSYS 56
Ratings as Products of Factors How to estimate the missing rating of
user x for item i
1162019
45531
312445
53432142
24542
522434
42331
item
s
2-41
56-5
53-2
32111
-221-7
37-1
-924143-48-5-253-211
13-112-72914-131457-8
1-6784-3924176-421
asymp
item
s
users
users
QPT
24
ffac
tors
f factors
119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931
= 119943119943
119954119954119946119946119943119943 sdot 119953119953119961119961119943119943
qi = row i of Qpx = column x of PT
RECSYS 57
Geared towards females
Geared towards males
Serious
Funny
Latent Factor Models
1162019
The PrincessDiaries
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
RECSYS 58
Geared towards females
Geared towards males
Serious
Funny
Latent Factor Models
1162019
The PrincessDiaries
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
RECSYS 59
Recap SVD
Remember SVD A Input data matrix U Left singular vecs V Right singular vecs Σ Singular values
SVD gives minimum reconstruction error (SSE) min119880119880VΣ
sum119909119909119894119894 119860119860119909119909119894119894 minus 119880119880Σ119881119881T 1199091199091198941198942
So in our case ldquoSVDrdquo on Netflix data R asymp Q middot PT
A = R Q = U PT = Σ VT
But we are not done yet R has missing entries1162019
U
The sum goes over all entriesBut our R has missing entries
119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931
Am
n
Σm
n
VT
asymp
U
RECSYS 60
Latent Factor Models
SVD isnrsquot defined when entries are missing
Use specialized methods to find P Q min
119875119875119876119876sum 119909119909119909119909 isinR 119903119903119909119909119909119909 minus 119902119902119909119909 sdot 119901119901119909119909119879119879 2
Notebull We donrsquot require cols of P Q to be orthogonalunit lengthbull P Q map usersmovies to a latent spacebull The most popular model among Netflix contestants
1162019
45531
312445
53432142
24542
522434
42331
2-41
56-5
53-2
32111
-221-7
37-1
-924143-48-5-253-211
13-112-72914-131457-8
1-6784-3924176-421asymp
PTQ
users
item
s
119903119909119909119909119909 = 119902119902119909119909 sdot 119901119901119909119909119879119879
f factors
ffactorsitem
s
users
RECSYS 61
Dealing with Missing Entries Want to minimize SSE for unseen test data
Idea Minimize SSE on training data Want large f ( of factors) to capture all the signals But SSE on test data begins to rise for f gt 2
Regularization is needed to avoid Overfitting Allow rich model where there are sufficient data Shrink aggressively where data are scarce
1162019
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
1 3 43 5 5
4 5 533
2
2 1 3
1
λhellip regularization parameterldquoerrorrdquo ldquolengthrdquo
RECSYS 62
Geared towards females
Geared towards males
serious
escapist
The PrincessDiaries
The Lion King
Braveheart
Independence Day
AmadeusThe Color Purple
Oceanrsquos 11
Sense and Sensibility
Gus
Dave
Recommendations via Latent Factor Models (eg SVD++ by the [Bellkor Team])
1162019
Lethal Weapon
Dumb and Dumber
RECSYS 63
Dealing with Missing Entries Want to minimize SSE for unseen test data
Idea Minimize SSE on training data Want large f ( of factors) to capture all the signals But SSE on test data begins to rise for f gt 2
Regularization is needed Allow rich model where there are sufficient data Shrink aggressively where data are scarce
1162019
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
1 3 43 5 5
4 5 533
2
2 1 3
1
λhellip regularization parameterldquoerrorrdquo ldquolengthrdquo
RECSYS 64
Geared towards females
Geared towards males
serious
funny
The Effect of Regularization
1162019
The PrincessDiaries
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
minfactors ldquoerrorrdquo + λ ldquolengthrdquo
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
RECSYS 65
Geared towards females
Geared towards males
serious
funny
The Effect of Regularization
1162019
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
The PrincessDiaries
minfactors ldquoerrorrdquo + λ ldquolengthrdquo
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
RECSYS 66
Geared towards females
Geared towards males
serious
funny
The Effect of Regularization
1162019
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
minfactors ldquoerrorrdquo + λ ldquolengthrdquo
The PrincessDiaries
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
RECSYS 67
Geared towards females
Geared towards males
serious
funny
The Effect of Regularization
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
The PrincessDiaries
minfactors ldquoerrorrdquo + λ ldquolengthrdquo
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
RECSYS 68
Use Gradient Descent to search forthe optimal settings
Want to find matrices P and Q
Gradient descent Initialize P and Q (using SVD pretend missing ratings are
0) Do gradient descent
bull P larr P - η middotnablaPbull Q larr Q - η middotnablaQbull Where nablaQ is gradientderivative of matrix Q120571120571119876119876 = [120571120571119902119902119909119909119894119894] and 120571120571119902119902119909119909119894119894 = sum119909119909119909119909 minus2 119903119903119909119909119909119909 minus 119902119902119909119909119901119901119909119909119879119879 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894
bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q Observation Computing gradients is slow
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
How to compute gradient of a matrixCompute gradient of every element independently
RECSYS 69
Degression to the lecture notes ofRegression and Gradient Descent
by Andrew Ngrsquos Machine Learning course from Coursera
RECSYS 70
(Batch) Gradient Descent
Want to find matrices P and Q
Gradient descent Initialize P and Q (using SVD pretend missing ratings are
0) Do gradient descent
bull P larr P - η middotnablaPbull Q larr Q - η middotnablaQbull Where nablaQ is gradientderivative of matrix Q120571120571119876119876 = [120571120571119902119902119909119909119894119894] and 120571120571119902119902119909119909119894119894 = sum119909119909119909119909 minus2 119903119903119909119909119909119909 minus 119902119902119909119909119901119901119909119909119879119879 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894
bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q Observation Computing gradients is slow
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
How to compute gradient of a matrixCompute gradient of every element independently
RECSYS 72
Stochastic Gradient Descent
Gradient Descent (GD) vs Stochastic GD Observation 120571120571119876119876 = [120571120571119902119902119909119909119894119894] where
120571120571119902119902119909119909119894119894 = 119909119909119909119909
minus2 119903119903119909119909119909119909 minus 119902119902119909119909119894119894119901119901119909119909119894119894 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894 = 119961119961119946119946
nabla119928119928 119955119955119961119961119946119946
bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q 119928119928 = 119928119928 minus 119928119928 = 119928119928 minus η sum119961119961119946119946nabla119928119928 (119955119955119961119961119946119946) Idea Instead of evaluating gradient over all ratings evaluate it for
each individual rating and make a step
GD 119928119928larr119928119928 minus η sum119955119955119961119961119946119946 nabla119928119928(119955119955119961119961119946119946)
SGD 119928119928larr119928119928 minus 119928119928(119955119955119961119961119946119946) Faster convergence
bull Need more steps but each step is computed much faster
1162019
RECSYS 73
SGD vs GD Convergence of GD vs SGD
1162019
Iterationstep
Valu
e of
the
obje
ctiv
e fu
nctio
n
GD improves the value of the objective function at every step SGD improves the value but in a ldquonoisyrdquo wayGD takes fewer steps to converge but each steptakes much longer to compute In practice SGD is much faster
RECSYS 74
Stochastic Gradient Descent
Stochastic gradient decent Initialize P and Q (using SVD pretend missing ratings are
0) Then iterate over the ratings (multiple times if necessary) and update
factorsFor each rxi
bull 120576120576119909119909119909119909 = 119903119903119909119909119909119909 minus 119902119902119909119909 sdot 119901119901119909119909119879119879 (derivative of the ldquoerrorrdquo)bull 119902119902119909119909 larr 119902119902119909119909 + 120578120578 120576120576119909119909119909119909 119901119901119909119909 minus 120582120582 119902119902119909119909 (update equation)bull 119901119901119909119909 larr 119901119901119909119909 + 120578120578 120576120576119909119909119909119909 119902119902119909119909 minus 120582120582 119901119901119909119909 (update equation)
2 for loops For until convergence
bull For each rxibull Compute gradient do a ldquosteprdquo
1162019
η hellip learning rate
RECSYS 75Koren Bell Volinksy IEEE Computer 2009
1162019
RECSYS 76
Summary Recommendations via Optimization
Goal Make good recommendations Quantify goodness using SSE
So Lower SSE means better recommendations We want to make good recommendations on items
that some user has not yet seen Letrsquos set values for P and Q such that they work
well on known (user item) ratings And hope these values for P and Q will predict well
the unknown ratings
This is the a case where we apply Optimization methods
1162019
1 3 43 5 5
4 5 533
2
2 1 3
1
RECSYS 77
Backup Slides
RECSYS 78
The Netflix Challenge 2006-09
We Know What You OughtTo Be Watching This
Summer
Netflix Prizebull Training data
ndash 100 million ratingsndash 480000 usersndash 17770 moviesndash 6 years of data 2000-2005
bull Test datandash Last few ratings of each user (28 million)ndash Evaluation criterion root mean squared error (RMSE) ndash Netflix Cinematch RMSE 09514
bull Competitionndash 2700+ teamsndash $1 million grand prize for 10 improvement on Cinematch resultndash $50000 2007 progress prize for 843 improvement
scoremovieuser121152131434524123237682576344541568523425
22345
57664566
scoremovieuser6219617232473153414284935
745
696836
Training data Test data
Movie rating data
Overall rating distribution
bull Third of ratings are 4sbull Average rating is 368
From TimelyDevelopmentcom
ratings per movie
bull Avg ratingsmovie 5627
ratings per user
bull Avg ratingsuser 208
Average movie rating by movie count
bull More ratings to better movies
From TimelyDevelopmentcom
Most loved movies
CountAvg ratingTitle137812 4593 The Shawshank Redemption 133597 4545 Lord of the Rings The Return of the King1808834306 The Green Mile 150676 4460Lord of the Rings The Two Towers 139050 4415 Finding Nemo 117456 4504 Raiders of the Lost Ark 180736 4299 Forrest Gump 147932 4433 Lord of the Rings The Fellowship of the ring149199 4325 The Sixth Sense 144027 4333 Indiana Jones and the Last Crusade
Challenges
bull Size of datandash Scalabilityndash Keeping data in memory
bull Missing datandash 99 percent missingndash Very imbalanced
bull Avoiding overfittingbull Test and training data differ significantly movie 16322
bull Use an ensemble of complementing predictorsbull Two half tuned models worth more than a single fully
tuned model
The BellKor recommender system
RECSYS 101
Extending Latent Factor Model to Include Biases
1162019
RECSYS 102
Modeling Biases and Interactions
μ = overall mean rating bx = bias of user x bi = bias of movie i
user-movie interactionmovie biasuser bias
User-Movie interaction Characterizes the matching between
users and movies Attracts most research in the field Benefits from algorithmic and
mathematical innovations
Baseline predictor Separates users and movies Benefits from insights into userrsquos
behavior Among the main practical
contributions of the competition
RECSYS 103
Baseline Predictor We have expectations on the rating by
user x of movie i even without estimating xrsquos attitude towards movies like i
ndash Rating scale of user xndash Values of other ratings user
gave recently (day-specific mood anchoring multi-user accounts)
ndash (Recent) popularity of movie indash Selection bias related to
number of ratings user gave on the same day (ldquofrequencyrdquo)
1162019
RECSYS 104
Putting It All Together
Example Mean rating micro = 37 You are a critical reviewer your ratings are 1 star
lower than the mean bx = -1 Star Wars gets a mean rating of 05 higher than
average movie bi = + 05 Predicted rating for you on Star Wars
= 37 - 1 + 05 = 32
1162019
RECSYS 105
Fitting the New Model
Solve
Stochastic gradient decent to find parameters Note Both biases bu bi as well as interactions qi pu are treated
as parameters (we estimate them)
1162019
regularization
goodness of fit
λ is selected via grid-search on a validation set
( )
++++
+++minus
sumsumsumsum
sumisin
ii
xx
xx
ii
Rix
Txiixxi
PQ
bbpq
pqbbr
2222
2
)(
)(min
λ
micro
RECSYS 106
BellKor Recommender System The winner of the Netflix Challenge
Multi-scale modeling of the dataCombine top level ldquoregionalrdquomodeling of the data with a refined local view Global
bull Overall deviations of usersmovies Factorization
bull Addressing ldquoregionalrdquo effects Collaborative filtering
bull Extract local patterns
Global effects
Factorization
Collaborative filtering
RECSYS 107
Performance of Various Methods
0885
089
0895
09
0905
091
0915
092
1 10 100 1000
RMSE
Millions of parameters
CF (no time bias)
Basic Latent Factors
Latent Factors w Biases
1162019
RECSYS 108
Grand Prize 08563
Netflix 09514
Movie average 10533User average 10651
Global average 11296
Performance of Various Methods
Basic Collaborative filtering 094
Latent factors 090
Latent factors+Biases 089
Collaborative filtering++ 091
1162019
RECSYS 109
Temporal Biases Of Users
Sudden rise in the average movie rating(early 2004) Improvements in Netflix GUI improvements Meaning of rating changed
Movie age Users prefer new movies
without any reasons Older movies are just
inherently better than newer ones
1162019
Y Koren Collaborative filtering with temporal dynamics KDD rsquo09
RECSYS 110
Temporal Biases amp Factors
Original modelrxi = micro +bx + bi + qi middotpx
T
Add time dependence to biasesrxi = micro +bx(t)+ bi(t) +qi middot px
T
Make parameters bu and bi to depend on time (1) Parameterize time-dependence by linear trends
(2) Each bin corresponds to 10 consecutive weeks
Add temporal dependence to factors px(t)hellip user preference vector on day t
Y Koren Collaborative filtering with temporal dynamics KDD rsquo09
RECSYS 111
Adding Temporal Effects
1162019
0875
088
0885
089
0895
09
0905
091
0915
092
1 10 100 1000 10000
RMSE
Millions of parameters
CF (no time bias)Basic Latent FactorsCF (time bias)Latent Factors w Biases+ Linear time factors+ Per-day user biases+ CF
RECSYS 112
Grand Prize 08563
Netflix 09514
Movie average 10533User average 10651
Global average 11296
Performance of Various Methods
Basic Collaborative filtering 094
Latent factors 090
Latent factors+Biases 089
Collaborative filtering++ 091
1162019 112
Latent factors+Biases+Time 0876
Still no prize Getting desperateTry a ldquokitchen sinkrdquo approach
RECSYS 113
Effect of ensemble size
087
08725
0875
08775
088
08825
0885
08875
089
08925
1 8 15 22 29 36 43 50
Predictors
Erro
r - R
MSE
RECSYS 20
Sidenote TF-IDFfij = frequency of term (feature) i in doc (item) j
ni = number of docs that mention term iN = total number of docs
TF-IDF score wij = TFij times IDFi
Doc profile = set of words with highest TF-IDF scores together with their scores
1162019
Note we normalize TFto discount for ldquolongerrdquo documents
RECSYS 21
User Profiles and Prediction
User profile possibilities Weighted average of rated item profiles Variation weight by difference from average
rating for item hellip
Prediction heuristic Given user profile x and item profile i estimate 119906119906(119961119961 119946119946) = cos(119961119961 119946119946) = 119961119961middot119946119946
| 119961119961 |sdot| 119946119946 |
1162019
RECSYS 22
Pros Content-based Approach
+ No need for data on other users No cold-start or sparsity problems
+ Able to recommend to users with unique tastes
+ Able to recommend new amp unpopular items No first-rater problem
+ Able to provide explanations Can provide explanations of recommended items by listing content-
features that caused an item to be recommended
1162019
RECSYS 23
Cons Content-based Approach ndash Finding the appropriate features is hard
Eg images movies music
ndash Overspecialization Never recommends items outside userrsquos
content profile People might have multiple interests Unable to exploit quality judgments of other users
ndash Recommendations for new users How to build a user profile
1162019
RECSYS 24
Collaborative Filtering
RECSYS 25
Collaborative filtering Recommend items based on past transactions of users
Analyze relations between users andor items
Specific data characteristics are irrelevant Domain-free useritem attributes are not necessary Can identify elusive aspects
RECSYS 26
Collaborative Filtering (CF)
RECSYS 27
Example of Memory-based Collaborative FilteringUser-User Collaborative Filtering
1 Consider user x
2 Find set N of other userswhose ratings are ldquosimilarrdquo to xrsquos ratings eg using K-nearest neighbors (KNN)
3 Recommend items to xbased on the weighted ratings of items by users in N
RECSYS 28
Similar Users Let rx be the vector of user xrsquos ratings
Jaccard similarity measure Problem Ignores the value of the rating
Cosine similarity measure
sim(x y) = cos(rx ry) = 119903119903119909119909sdot119903119903119910119910||119903119903119909119909||sdot||119903119903119910119910||
Problem Treats missing ratings as ldquonegativerdquo Pearson correlation coefficient
Sxy = items rated by both users x and y
1162019
rx = [ _ _ ]ry = [ _ _]
rx ry as setsrx = 1 4 5ry = 1 3 4
rx ry as pointsrx = 1 0 0 1 3ry = 1 0 2 2 0
rx ry hellip avgrating of x y
RECSYS 29
Similarity Metric
Intuitively we want sim(A B) gt sim(A C)
Jaccard similarity 15 lt 24
Cosine similarity 0386 gt 0322 Considers missing ratings as ldquonegativerdquo Solution subtract the (row) mean
1162019
sim AB vs AC0092 gt -0559Notice cosine sim is correlation when data is centered at 0
Cosine sim
RECSYS 30
Rating Predictions
Let rx be the vector of user xrsquos ratings
Let N be the set of k users most similar to xwho have rated item i
Prediction for item i of user x 119903119903119909119909119909119909 = 1
119896119896sum119910119910isin119873119873 119903119903119910119910119909119909
119903119903119909119909119909119909 =sum119910119910isin119873119873 119904119904119909119909119910119910sdot119903119903119910119910119910119910sum119910119910isin119873119873 119904119904119909119909119910119910
Other options
Many other tricks possiblehellip1162019
Shorthand119956119956119961119961119961119961 = 119956119956119946119946119956119956 119961119961119961119961
RECSYS 31
Another type of Memory-based Collaborative Filtering Item-Item based Collaborative Filtering
So far User-user collaborative filtering
Another view Item-item For item i find other similar items Estimate rating for item i based
on the target userrsquos ratings on items similar to item i Can use same similarity metrics and
prediction functions as in user-user model
1162019
sumsum
isin
isinsdot
=)(
)(
xiNj ij
xiNj xjijxi s
rsr
sijhellip similarity of items i and jrxjhelliprating of user x on item jN(ix)hellip set items rated by x similar to i
RECSYS 32
Item-Item CF (|N|=2)
121110987654321
455311
3124452
534321423
245424
5224345
423316
users
mov
ies
- unknown rating - rating between 1 to 5
RECSYS 33
Item-Item CF (|N|=2)
121110987654321
455 311
3124452
534321423
245424
5224345
423316
users
- estimate rating of movie 1 by user 5
1162019
mov
ies
RECSYS 34
Item-Item CF (|N|=2)
121110987654321
455 311
3124452
534321423
245424
5224345
423316
users
Neighbor selectionIdentify movies similar to movie 1 rated by user 5
1162019
mov
ies
100
-018
041
-010
-031
059
sim(1m)
Here we use Pearson correlation as similarity1) Subtract mean rating mi from each movie i
m1 = (1+3+5+5+4)5 = 36row 1 [-26 0 -06 0 0 14 0 0 14 0 04 0]
2) Compute cosine similarities between rows
RECSYS 35
Item-Item CF (|N|=2)
121110987654321
455 311
3124452
534321423
245424
5224345
423316
users
Compute similarity weightss13=041 s16=059
1162019
mov
ies
100
-018
041
-010
-031
059
sim(1m)
RECSYS 36
Item-Item CF (|N|=2)
121110987654321
45526311
3124452
534321423
245424
5224345
423316
users
Predict by taking weighted average
r15 = (0412 + 0593) (041+059) = 26
mov
ies
RECSYS 37
Common Practice for Item-Item Collaborative Filtering
Define similarity sij of items i and j
Select K nearest neighbors (KNN) N(i x) Set of Items most similar to item i that were rated by x
Estimate rating rxi as the weighted average
1162019
sumsum
isin
isinsdot
=)(
)(ˆxiNj ij
xiNj xjijxi s
rsr
N(ix) = set of items similar to item i that were rated by xsij = similarity of items i and jrxj= rating of user x on item j
RECSYS 38
Item-Item vs User-User
04180109
0305081
Avatar LOTR Matrix Pirates
Alice
Bob
Carol
David
1162019
In practice it has been observed that item-itemoften works better than user-user
Why Items are simpler users have multiple tastes
RECSYS 39
ProsCons of Collaborative Filtering
+ Works for any kind of item No feature selection needed
- Cold Start Need enough users in the system to find a match
- Sparsity The userratings matrix is sparse Hard to find users that have rated the same items
- First rater Cannot recommend an item that has not been
previously rated New items Esoteric items
- Popularity bias Cannot recommend items to someone with
unique taste Tends to recommend popular items
RECSYS 40
Hybrid Methods Implement two or more different recommenders and
combine predictions Perhaps using a linear model
Add content-based methods to collaborative filtering Item profiles for new item problem Demographics to deal with new user problem
1162019
RECSYS 41
- Evaluation
- Error metrics
- Complexity Speed
RECSYS 42
Evaluation
1 3 4
3 5 5
4 5 5
3
3
2 2 2
5
2 1 1
3 3
1
movies
users
1162019
RECSYS 43
Evaluation
1 3 4
3 5 5
4 5 5
3
3
2
2 1
3
1
Test Data Set
users
movies
1162019
RECSYS 44
Evaluating Predictions Compare predictions with known ratings
Root-mean-square error (RMSE)bull where is predicted is the true rating of x on i
Precision at top 10 bull of those in top 10
Rank Correlation bull Spearmanrsquos correlation between systemrsquos and userrsquos complete
rankings Another approach 01 model
Coveragebull Number of itemsusers for which system can make predictions
Precision = TP (TP + FP) Accuracy = (TP+TN) (TP + FP + TN + FN) Receiver Operating characteristic (ROC) Curve
Y-axis True Positive Rates (TPR) X-axis False Positive Rates (FPR)bull TPR (aka Recall) = TP P = TP (TP+FN) bull FPR = FP N = FP (FP + TN)bull See httpsenwikipediaorgwikiPrecision and recall
RECSYS 45
Problems with Error Measures Narrow focus on accuracy sometimes
misses the point Prediction Diversity Prediction Context Order of predictions
In practice we care only to predict high ratings RMSE might penalize a method that does well
for high ratings and badly for others
1162019
RECSYS 46
Collaborative Filtering Complexity
Expensive step is finding k most similar customers O(|X|) Recall that X = set of customers in the system
Too expensive to do at runtime Could pre-compute using clustering as approx
Naiumlve pre-computation takes time O(N middot|C|) |C| = of clusters = k in the k-means N = of data points
We already know how to do this Near-neighbor search in high dimensions (LSH) Clustering Dimensionality reduction
1162019
RECSYS 47
Tip Add Data Leverage all the data
Donrsquot try to reduce data size in an effort to make fancy algorithms work
Simple methods on large data do best
Add more data eg add IMDB data on genres
More data beats better algorithmshttpanandtypepadcomdatawocky200803more-data-usualhtml
RECSYS 48
Recommender SystemsLatent Factor Models
CS246 Mining Massive DatasetsJure Leskovec Stanford University
httpcs246stanfordedu
RECSYS 49
Geared towards females
Geared towards males
Serious
Funny
Collaborative Filtering viaLatent Factor Models (eg SVD)
The PrincessDiaries
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
RECSYS 50Koren Bell Volinksy IEEE Computer 2009
RECSYS 51
The Netflix Utility Matrix R
1 3 4
3 5 5
4 5 5
3
3
2 2 2
5
2 1 1
3 3
1
480000 users
17700 movies
1162019
Matrix R
RECSYS 52
Utility Matrix R Evaluation
1 3 4
3 5 5
4 5 5
3
3
2
2 1
3
1
Test Data Set
SSE = sum(119909119909119909119909)isin119877119877 119903119909119909119909119909 minus 119903119903119909119909119909119909 2
1162019
480000 users
17700 movies
Predicted rating
True rating of user x on item i
119955119955120785120785120788120788
Matrix R
Training Data Set
RECSYS 53
Latent Factor Models
ldquoSVDrdquo on Netflix data R asymp Q middot PT
For now letrsquos assume we can approximate the rating matrix R as a product of ldquothinrdquo Q middot PT
R has missing entries but letrsquos ignore that for nowbull Basically we will want the reconstruction error to be small on
known ratings and we donrsquot care about the values on the missing ones
45531
312445
53432142
24542
522434
42331
2-41
56-5
53-2
32111
-221-7
37-1
-924143-48-5-253-211
13-112-72914-131457-8
1-6784-3924176-421asymp
users
item
s
PT
Q
item
s
users
R
SVD A = U Σ VT
f factors
ffactors
RECSYS 54
Ratings as Products of Factors How to estimate the missing rating of
user x for item i
1162019
45531
312445
53432142
24542
522434
42331
item
s
2-41
56-5
53-2
32111
-221-7
37-1
-924143-48-5-253-211
13-112-72914-131457-8
1-6784-3924176-421
asymp
item
s
users
users
PT
119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931
= 119943119943
119954119954119946119946119943119943 sdot 119953119953119961119961119943119943
qi = row i of Qpx = column x of PT
ffac
tors
Qf factors
RECSYS 55
Ratings as Products of Factors How to estimate the missing rating of
user x for item i
1162019
45531
312445
53432142
24542
522434
42331
item
s
2-41
56-5
53-2
32111
-221-7
37-1
-924143-48-5-253-211
13-112-72914-131457-8
1-6784-3924176-421
asymp
item
s
users
users
PT
ffac
tors
Qf factors
119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931
= 119943119943
119954119954119946119946119943119943 sdot 119953119953119961119961119943119943
qi = row i of Qpx = column x of PT
RECSYS 56
Ratings as Products of Factors How to estimate the missing rating of
user x for item i
1162019
45531
312445
53432142
24542
522434
42331
item
s
2-41
56-5
53-2
32111
-221-7
37-1
-924143-48-5-253-211
13-112-72914-131457-8
1-6784-3924176-421
asymp
item
s
users
users
QPT
24
ffac
tors
f factors
119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931
= 119943119943
119954119954119946119946119943119943 sdot 119953119953119961119961119943119943
qi = row i of Qpx = column x of PT
RECSYS 57
Geared towards females
Geared towards males
Serious
Funny
Latent Factor Models
1162019
The PrincessDiaries
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
RECSYS 58
Geared towards females
Geared towards males
Serious
Funny
Latent Factor Models
1162019
The PrincessDiaries
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
RECSYS 59
Recap SVD
Remember SVD A Input data matrix U Left singular vecs V Right singular vecs Σ Singular values
SVD gives minimum reconstruction error (SSE) min119880119880VΣ
sum119909119909119894119894 119860119860119909119909119894119894 minus 119880119880Σ119881119881T 1199091199091198941198942
So in our case ldquoSVDrdquo on Netflix data R asymp Q middot PT
A = R Q = U PT = Σ VT
But we are not done yet R has missing entries1162019
U
The sum goes over all entriesBut our R has missing entries
119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931
Am
n
Σm
n
VT
asymp
U
RECSYS 60
Latent Factor Models
SVD isnrsquot defined when entries are missing
Use specialized methods to find P Q min
119875119875119876119876sum 119909119909119909119909 isinR 119903119903119909119909119909119909 minus 119902119902119909119909 sdot 119901119901119909119909119879119879 2
Notebull We donrsquot require cols of P Q to be orthogonalunit lengthbull P Q map usersmovies to a latent spacebull The most popular model among Netflix contestants
1162019
45531
312445
53432142
24542
522434
42331
2-41
56-5
53-2
32111
-221-7
37-1
-924143-48-5-253-211
13-112-72914-131457-8
1-6784-3924176-421asymp
PTQ
users
item
s
119903119909119909119909119909 = 119902119902119909119909 sdot 119901119901119909119909119879119879
f factors
ffactorsitem
s
users
RECSYS 61
Dealing with Missing Entries Want to minimize SSE for unseen test data
Idea Minimize SSE on training data Want large f ( of factors) to capture all the signals But SSE on test data begins to rise for f gt 2
Regularization is needed to avoid Overfitting Allow rich model where there are sufficient data Shrink aggressively where data are scarce
1162019
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
1 3 43 5 5
4 5 533
2
2 1 3
1
λhellip regularization parameterldquoerrorrdquo ldquolengthrdquo
RECSYS 62
Geared towards females
Geared towards males
serious
escapist
The PrincessDiaries
The Lion King
Braveheart
Independence Day
AmadeusThe Color Purple
Oceanrsquos 11
Sense and Sensibility
Gus
Dave
Recommendations via Latent Factor Models (eg SVD++ by the [Bellkor Team])
1162019
Lethal Weapon
Dumb and Dumber
RECSYS 63
Dealing with Missing Entries Want to minimize SSE for unseen test data
Idea Minimize SSE on training data Want large f ( of factors) to capture all the signals But SSE on test data begins to rise for f gt 2
Regularization is needed Allow rich model where there are sufficient data Shrink aggressively where data are scarce
1162019
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
1 3 43 5 5
4 5 533
2
2 1 3
1
λhellip regularization parameterldquoerrorrdquo ldquolengthrdquo
RECSYS 64
Geared towards females
Geared towards males
serious
funny
The Effect of Regularization
1162019
The PrincessDiaries
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
minfactors ldquoerrorrdquo + λ ldquolengthrdquo
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
RECSYS 65
Geared towards females
Geared towards males
serious
funny
The Effect of Regularization
1162019
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
The PrincessDiaries
minfactors ldquoerrorrdquo + λ ldquolengthrdquo
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
RECSYS 66
Geared towards females
Geared towards males
serious
funny
The Effect of Regularization
1162019
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
minfactors ldquoerrorrdquo + λ ldquolengthrdquo
The PrincessDiaries
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
RECSYS 67
Geared towards females
Geared towards males
serious
funny
The Effect of Regularization
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
The PrincessDiaries
minfactors ldquoerrorrdquo + λ ldquolengthrdquo
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
RECSYS 68
Use Gradient Descent to search forthe optimal settings
Want to find matrices P and Q
Gradient descent Initialize P and Q (using SVD pretend missing ratings are
0) Do gradient descent
bull P larr P - η middotnablaPbull Q larr Q - η middotnablaQbull Where nablaQ is gradientderivative of matrix Q120571120571119876119876 = [120571120571119902119902119909119909119894119894] and 120571120571119902119902119909119909119894119894 = sum119909119909119909119909 minus2 119903119903119909119909119909119909 minus 119902119902119909119909119901119901119909119909119879119879 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894
bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q Observation Computing gradients is slow
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
How to compute gradient of a matrixCompute gradient of every element independently
RECSYS 69
Degression to the lecture notes ofRegression and Gradient Descent
by Andrew Ngrsquos Machine Learning course from Coursera
RECSYS 70
(Batch) Gradient Descent
Want to find matrices P and Q
Gradient descent Initialize P and Q (using SVD pretend missing ratings are
0) Do gradient descent
bull P larr P - η middotnablaPbull Q larr Q - η middotnablaQbull Where nablaQ is gradientderivative of matrix Q120571120571119876119876 = [120571120571119902119902119909119909119894119894] and 120571120571119902119902119909119909119894119894 = sum119909119909119909119909 minus2 119903119903119909119909119909119909 minus 119902119902119909119909119901119901119909119909119879119879 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894
bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q Observation Computing gradients is slow
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
How to compute gradient of a matrixCompute gradient of every element independently
RECSYS 72
Stochastic Gradient Descent
Gradient Descent (GD) vs Stochastic GD Observation 120571120571119876119876 = [120571120571119902119902119909119909119894119894] where
120571120571119902119902119909119909119894119894 = 119909119909119909119909
minus2 119903119903119909119909119909119909 minus 119902119902119909119909119894119894119901119901119909119909119894119894 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894 = 119961119961119946119946
nabla119928119928 119955119955119961119961119946119946
bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q 119928119928 = 119928119928 minus 119928119928 = 119928119928 minus η sum119961119961119946119946nabla119928119928 (119955119955119961119961119946119946) Idea Instead of evaluating gradient over all ratings evaluate it for
each individual rating and make a step
GD 119928119928larr119928119928 minus η sum119955119955119961119961119946119946 nabla119928119928(119955119955119961119961119946119946)
SGD 119928119928larr119928119928 minus 119928119928(119955119955119961119961119946119946) Faster convergence
bull Need more steps but each step is computed much faster
1162019
RECSYS 73
SGD vs GD Convergence of GD vs SGD
1162019
Iterationstep
Valu
e of
the
obje
ctiv
e fu
nctio
n
GD improves the value of the objective function at every step SGD improves the value but in a ldquonoisyrdquo wayGD takes fewer steps to converge but each steptakes much longer to compute In practice SGD is much faster
RECSYS 74
Stochastic Gradient Descent
Stochastic gradient decent Initialize P and Q (using SVD pretend missing ratings are
0) Then iterate over the ratings (multiple times if necessary) and update
factorsFor each rxi
bull 120576120576119909119909119909119909 = 119903119903119909119909119909119909 minus 119902119902119909119909 sdot 119901119901119909119909119879119879 (derivative of the ldquoerrorrdquo)bull 119902119902119909119909 larr 119902119902119909119909 + 120578120578 120576120576119909119909119909119909 119901119901119909119909 minus 120582120582 119902119902119909119909 (update equation)bull 119901119901119909119909 larr 119901119901119909119909 + 120578120578 120576120576119909119909119909119909 119902119902119909119909 minus 120582120582 119901119901119909119909 (update equation)
2 for loops For until convergence
bull For each rxibull Compute gradient do a ldquosteprdquo
1162019
η hellip learning rate
RECSYS 75Koren Bell Volinksy IEEE Computer 2009
1162019
RECSYS 76
Summary Recommendations via Optimization
Goal Make good recommendations Quantify goodness using SSE
So Lower SSE means better recommendations We want to make good recommendations on items
that some user has not yet seen Letrsquos set values for P and Q such that they work
well on known (user item) ratings And hope these values for P and Q will predict well
the unknown ratings
This is the a case where we apply Optimization methods
1162019
1 3 43 5 5
4 5 533
2
2 1 3
1
RECSYS 77
Backup Slides
RECSYS 78
The Netflix Challenge 2006-09
We Know What You OughtTo Be Watching This
Summer
Netflix Prizebull Training data
ndash 100 million ratingsndash 480000 usersndash 17770 moviesndash 6 years of data 2000-2005
bull Test datandash Last few ratings of each user (28 million)ndash Evaluation criterion root mean squared error (RMSE) ndash Netflix Cinematch RMSE 09514
bull Competitionndash 2700+ teamsndash $1 million grand prize for 10 improvement on Cinematch resultndash $50000 2007 progress prize for 843 improvement
scoremovieuser121152131434524123237682576344541568523425
22345
57664566
scoremovieuser6219617232473153414284935
745
696836
Training data Test data
Movie rating data
Overall rating distribution
bull Third of ratings are 4sbull Average rating is 368
From TimelyDevelopmentcom
ratings per movie
bull Avg ratingsmovie 5627
ratings per user
bull Avg ratingsuser 208
Average movie rating by movie count
bull More ratings to better movies
From TimelyDevelopmentcom
Most loved movies
CountAvg ratingTitle137812 4593 The Shawshank Redemption 133597 4545 Lord of the Rings The Return of the King1808834306 The Green Mile 150676 4460Lord of the Rings The Two Towers 139050 4415 Finding Nemo 117456 4504 Raiders of the Lost Ark 180736 4299 Forrest Gump 147932 4433 Lord of the Rings The Fellowship of the ring149199 4325 The Sixth Sense 144027 4333 Indiana Jones and the Last Crusade
Challenges
bull Size of datandash Scalabilityndash Keeping data in memory
bull Missing datandash 99 percent missingndash Very imbalanced
bull Avoiding overfittingbull Test and training data differ significantly movie 16322
bull Use an ensemble of complementing predictorsbull Two half tuned models worth more than a single fully
tuned model
The BellKor recommender system
RECSYS 101
Extending Latent Factor Model to Include Biases
1162019
RECSYS 102
Modeling Biases and Interactions
μ = overall mean rating bx = bias of user x bi = bias of movie i
user-movie interactionmovie biasuser bias
User-Movie interaction Characterizes the matching between
users and movies Attracts most research in the field Benefits from algorithmic and
mathematical innovations
Baseline predictor Separates users and movies Benefits from insights into userrsquos
behavior Among the main practical
contributions of the competition
RECSYS 103
Baseline Predictor We have expectations on the rating by
user x of movie i even without estimating xrsquos attitude towards movies like i
ndash Rating scale of user xndash Values of other ratings user
gave recently (day-specific mood anchoring multi-user accounts)
ndash (Recent) popularity of movie indash Selection bias related to
number of ratings user gave on the same day (ldquofrequencyrdquo)
1162019
RECSYS 104
Putting It All Together
Example Mean rating micro = 37 You are a critical reviewer your ratings are 1 star
lower than the mean bx = -1 Star Wars gets a mean rating of 05 higher than
average movie bi = + 05 Predicted rating for you on Star Wars
= 37 - 1 + 05 = 32
1162019
RECSYS 105
Fitting the New Model
Solve
Stochastic gradient decent to find parameters Note Both biases bu bi as well as interactions qi pu are treated
as parameters (we estimate them)
1162019
regularization
goodness of fit
λ is selected via grid-search on a validation set
( )
++++
+++minus
sumsumsumsum
sumisin
ii
xx
xx
ii
Rix
Txiixxi
PQ
bbpq
pqbbr
2222
2
)(
)(min
λ
micro
RECSYS 106
BellKor Recommender System The winner of the Netflix Challenge
Multi-scale modeling of the dataCombine top level ldquoregionalrdquomodeling of the data with a refined local view Global
bull Overall deviations of usersmovies Factorization
bull Addressing ldquoregionalrdquo effects Collaborative filtering
bull Extract local patterns
Global effects
Factorization
Collaborative filtering
RECSYS 107
Performance of Various Methods
0885
089
0895
09
0905
091
0915
092
1 10 100 1000
RMSE
Millions of parameters
CF (no time bias)
Basic Latent Factors
Latent Factors w Biases
1162019
RECSYS 108
Grand Prize 08563
Netflix 09514
Movie average 10533User average 10651
Global average 11296
Performance of Various Methods
Basic Collaborative filtering 094
Latent factors 090
Latent factors+Biases 089
Collaborative filtering++ 091
1162019
RECSYS 109
Temporal Biases Of Users
Sudden rise in the average movie rating(early 2004) Improvements in Netflix GUI improvements Meaning of rating changed
Movie age Users prefer new movies
without any reasons Older movies are just
inherently better than newer ones
1162019
Y Koren Collaborative filtering with temporal dynamics KDD rsquo09
RECSYS 110
Temporal Biases amp Factors
Original modelrxi = micro +bx + bi + qi middotpx
T
Add time dependence to biasesrxi = micro +bx(t)+ bi(t) +qi middot px
T
Make parameters bu and bi to depend on time (1) Parameterize time-dependence by linear trends
(2) Each bin corresponds to 10 consecutive weeks
Add temporal dependence to factors px(t)hellip user preference vector on day t
Y Koren Collaborative filtering with temporal dynamics KDD rsquo09
RECSYS 111
Adding Temporal Effects
1162019
0875
088
0885
089
0895
09
0905
091
0915
092
1 10 100 1000 10000
RMSE
Millions of parameters
CF (no time bias)Basic Latent FactorsCF (time bias)Latent Factors w Biases+ Linear time factors+ Per-day user biases+ CF
RECSYS 112
Grand Prize 08563
Netflix 09514
Movie average 10533User average 10651
Global average 11296
Performance of Various Methods
Basic Collaborative filtering 094
Latent factors 090
Latent factors+Biases 089
Collaborative filtering++ 091
1162019 112
Latent factors+Biases+Time 0876
Still no prize Getting desperateTry a ldquokitchen sinkrdquo approach
RECSYS 113
Effect of ensemble size
087
08725
0875
08775
088
08825
0885
08875
089
08925
1 8 15 22 29 36 43 50
Predictors
Erro
r - R
MSE
RECSYS 21
User Profiles and Prediction
User profile possibilities Weighted average of rated item profiles Variation weight by difference from average
rating for item hellip
Prediction heuristic Given user profile x and item profile i estimate 119906119906(119961119961 119946119946) = cos(119961119961 119946119946) = 119961119961middot119946119946
| 119961119961 |sdot| 119946119946 |
1162019
RECSYS 22
Pros Content-based Approach
+ No need for data on other users No cold-start or sparsity problems
+ Able to recommend to users with unique tastes
+ Able to recommend new amp unpopular items No first-rater problem
+ Able to provide explanations Can provide explanations of recommended items by listing content-
features that caused an item to be recommended
1162019
RECSYS 23
Cons Content-based Approach ndash Finding the appropriate features is hard
Eg images movies music
ndash Overspecialization Never recommends items outside userrsquos
content profile People might have multiple interests Unable to exploit quality judgments of other users
ndash Recommendations for new users How to build a user profile
1162019
RECSYS 24
Collaborative Filtering
RECSYS 25
Collaborative filtering Recommend items based on past transactions of users
Analyze relations between users andor items
Specific data characteristics are irrelevant Domain-free useritem attributes are not necessary Can identify elusive aspects
RECSYS 26
Collaborative Filtering (CF)
RECSYS 27
Example of Memory-based Collaborative FilteringUser-User Collaborative Filtering
1 Consider user x
2 Find set N of other userswhose ratings are ldquosimilarrdquo to xrsquos ratings eg using K-nearest neighbors (KNN)
3 Recommend items to xbased on the weighted ratings of items by users in N
RECSYS 28
Similar Users Let rx be the vector of user xrsquos ratings
Jaccard similarity measure Problem Ignores the value of the rating
Cosine similarity measure
sim(x y) = cos(rx ry) = 119903119903119909119909sdot119903119903119910119910||119903119903119909119909||sdot||119903119903119910119910||
Problem Treats missing ratings as ldquonegativerdquo Pearson correlation coefficient
Sxy = items rated by both users x and y
1162019
rx = [ _ _ ]ry = [ _ _]
rx ry as setsrx = 1 4 5ry = 1 3 4
rx ry as pointsrx = 1 0 0 1 3ry = 1 0 2 2 0
rx ry hellip avgrating of x y
RECSYS 29
Similarity Metric
Intuitively we want sim(A B) gt sim(A C)
Jaccard similarity 15 lt 24
Cosine similarity 0386 gt 0322 Considers missing ratings as ldquonegativerdquo Solution subtract the (row) mean
1162019
sim AB vs AC0092 gt -0559Notice cosine sim is correlation when data is centered at 0
Cosine sim
RECSYS 30
Rating Predictions
Let rx be the vector of user xrsquos ratings
Let N be the set of k users most similar to xwho have rated item i
Prediction for item i of user x 119903119903119909119909119909119909 = 1
119896119896sum119910119910isin119873119873 119903119903119910119910119909119909
119903119903119909119909119909119909 =sum119910119910isin119873119873 119904119904119909119909119910119910sdot119903119903119910119910119910119910sum119910119910isin119873119873 119904119904119909119909119910119910
Other options
Many other tricks possiblehellip1162019
Shorthand119956119956119961119961119961119961 = 119956119956119946119946119956119956 119961119961119961119961
RECSYS 31
Another type of Memory-based Collaborative Filtering Item-Item based Collaborative Filtering
So far User-user collaborative filtering
Another view Item-item For item i find other similar items Estimate rating for item i based
on the target userrsquos ratings on items similar to item i Can use same similarity metrics and
prediction functions as in user-user model
1162019
sumsum
isin
isinsdot
=)(
)(
xiNj ij
xiNj xjijxi s
rsr
sijhellip similarity of items i and jrxjhelliprating of user x on item jN(ix)hellip set items rated by x similar to i
RECSYS 32
Item-Item CF (|N|=2)
121110987654321
455311
3124452
534321423
245424
5224345
423316
users
mov
ies
- unknown rating - rating between 1 to 5
RECSYS 33
Item-Item CF (|N|=2)
121110987654321
455 311
3124452
534321423
245424
5224345
423316
users
- estimate rating of movie 1 by user 5
1162019
mov
ies
RECSYS 34
Item-Item CF (|N|=2)
121110987654321
455 311
3124452
534321423
245424
5224345
423316
users
Neighbor selectionIdentify movies similar to movie 1 rated by user 5
1162019
mov
ies
100
-018
041
-010
-031
059
sim(1m)
Here we use Pearson correlation as similarity1) Subtract mean rating mi from each movie i
m1 = (1+3+5+5+4)5 = 36row 1 [-26 0 -06 0 0 14 0 0 14 0 04 0]
2) Compute cosine similarities between rows
RECSYS 35
Item-Item CF (|N|=2)
121110987654321
455 311
3124452
534321423
245424
5224345
423316
users
Compute similarity weightss13=041 s16=059
1162019
mov
ies
100
-018
041
-010
-031
059
sim(1m)
RECSYS 36
Item-Item CF (|N|=2)
121110987654321
45526311
3124452
534321423
245424
5224345
423316
users
Predict by taking weighted average
r15 = (0412 + 0593) (041+059) = 26
mov
ies
RECSYS 37
Common Practice for Item-Item Collaborative Filtering
Define similarity sij of items i and j
Select K nearest neighbors (KNN) N(i x) Set of Items most similar to item i that were rated by x
Estimate rating rxi as the weighted average
1162019
sumsum
isin
isinsdot
=)(
)(ˆxiNj ij
xiNj xjijxi s
rsr
N(ix) = set of items similar to item i that were rated by xsij = similarity of items i and jrxj= rating of user x on item j
RECSYS 38
Item-Item vs User-User
04180109
0305081
Avatar LOTR Matrix Pirates
Alice
Bob
Carol
David
1162019
In practice it has been observed that item-itemoften works better than user-user
Why Items are simpler users have multiple tastes
RECSYS 39
ProsCons of Collaborative Filtering
+ Works for any kind of item No feature selection needed
- Cold Start Need enough users in the system to find a match
- Sparsity The userratings matrix is sparse Hard to find users that have rated the same items
- First rater Cannot recommend an item that has not been
previously rated New items Esoteric items
- Popularity bias Cannot recommend items to someone with
unique taste Tends to recommend popular items
RECSYS 40
Hybrid Methods Implement two or more different recommenders and
combine predictions Perhaps using a linear model
Add content-based methods to collaborative filtering Item profiles for new item problem Demographics to deal with new user problem
1162019
RECSYS 41
- Evaluation
- Error metrics
- Complexity Speed
RECSYS 42
Evaluation
1 3 4
3 5 5
4 5 5
3
3
2 2 2
5
2 1 1
3 3
1
movies
users
1162019
RECSYS 43
Evaluation
1 3 4
3 5 5
4 5 5
3
3
2
2 1
3
1
Test Data Set
users
movies
1162019
RECSYS 44
Evaluating Predictions Compare predictions with known ratings
Root-mean-square error (RMSE)bull where is predicted is the true rating of x on i
Precision at top 10 bull of those in top 10
Rank Correlation bull Spearmanrsquos correlation between systemrsquos and userrsquos complete
rankings Another approach 01 model
Coveragebull Number of itemsusers for which system can make predictions
Precision = TP (TP + FP) Accuracy = (TP+TN) (TP + FP + TN + FN) Receiver Operating characteristic (ROC) Curve
Y-axis True Positive Rates (TPR) X-axis False Positive Rates (FPR)bull TPR (aka Recall) = TP P = TP (TP+FN) bull FPR = FP N = FP (FP + TN)bull See httpsenwikipediaorgwikiPrecision and recall
RECSYS 45
Problems with Error Measures Narrow focus on accuracy sometimes
misses the point Prediction Diversity Prediction Context Order of predictions
In practice we care only to predict high ratings RMSE might penalize a method that does well
for high ratings and badly for others
1162019
RECSYS 46
Collaborative Filtering Complexity
Expensive step is finding k most similar customers O(|X|) Recall that X = set of customers in the system
Too expensive to do at runtime Could pre-compute using clustering as approx
Naiumlve pre-computation takes time O(N middot|C|) |C| = of clusters = k in the k-means N = of data points
We already know how to do this Near-neighbor search in high dimensions (LSH) Clustering Dimensionality reduction
1162019
RECSYS 47
Tip Add Data Leverage all the data
Donrsquot try to reduce data size in an effort to make fancy algorithms work
Simple methods on large data do best
Add more data eg add IMDB data on genres
More data beats better algorithmshttpanandtypepadcomdatawocky200803more-data-usualhtml
RECSYS 48
Recommender SystemsLatent Factor Models
CS246 Mining Massive DatasetsJure Leskovec Stanford University
httpcs246stanfordedu
RECSYS 49
Geared towards females
Geared towards males
Serious
Funny
Collaborative Filtering viaLatent Factor Models (eg SVD)
The PrincessDiaries
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
RECSYS 50Koren Bell Volinksy IEEE Computer 2009
RECSYS 51
The Netflix Utility Matrix R
1 3 4
3 5 5
4 5 5
3
3
2 2 2
5
2 1 1
3 3
1
480000 users
17700 movies
1162019
Matrix R
RECSYS 52
Utility Matrix R Evaluation
1 3 4
3 5 5
4 5 5
3
3
2
2 1
3
1
Test Data Set
SSE = sum(119909119909119909119909)isin119877119877 119903119909119909119909119909 minus 119903119903119909119909119909119909 2
1162019
480000 users
17700 movies
Predicted rating
True rating of user x on item i
119955119955120785120785120788120788
Matrix R
Training Data Set
RECSYS 53
Latent Factor Models
ldquoSVDrdquo on Netflix data R asymp Q middot PT
For now letrsquos assume we can approximate the rating matrix R as a product of ldquothinrdquo Q middot PT
R has missing entries but letrsquos ignore that for nowbull Basically we will want the reconstruction error to be small on
known ratings and we donrsquot care about the values on the missing ones
45531
312445
53432142
24542
522434
42331
2-41
56-5
53-2
32111
-221-7
37-1
-924143-48-5-253-211
13-112-72914-131457-8
1-6784-3924176-421asymp
users
item
s
PT
Q
item
s
users
R
SVD A = U Σ VT
f factors
ffactors
RECSYS 54
Ratings as Products of Factors How to estimate the missing rating of
user x for item i
1162019
45531
312445
53432142
24542
522434
42331
item
s
2-41
56-5
53-2
32111
-221-7
37-1
-924143-48-5-253-211
13-112-72914-131457-8
1-6784-3924176-421
asymp
item
s
users
users
PT
119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931
= 119943119943
119954119954119946119946119943119943 sdot 119953119953119961119961119943119943
qi = row i of Qpx = column x of PT
ffac
tors
Qf factors
RECSYS 55
Ratings as Products of Factors How to estimate the missing rating of
user x for item i
1162019
45531
312445
53432142
24542
522434
42331
item
s
2-41
56-5
53-2
32111
-221-7
37-1
-924143-48-5-253-211
13-112-72914-131457-8
1-6784-3924176-421
asymp
item
s
users
users
PT
ffac
tors
Qf factors
119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931
= 119943119943
119954119954119946119946119943119943 sdot 119953119953119961119961119943119943
qi = row i of Qpx = column x of PT
RECSYS 56
Ratings as Products of Factors How to estimate the missing rating of
user x for item i
1162019
45531
312445
53432142
24542
522434
42331
item
s
2-41
56-5
53-2
32111
-221-7
37-1
-924143-48-5-253-211
13-112-72914-131457-8
1-6784-3924176-421
asymp
item
s
users
users
QPT
24
ffac
tors
f factors
119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931
= 119943119943
119954119954119946119946119943119943 sdot 119953119953119961119961119943119943
qi = row i of Qpx = column x of PT
RECSYS 57
Geared towards females
Geared towards males
Serious
Funny
Latent Factor Models
1162019
The PrincessDiaries
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
RECSYS 58
Geared towards females
Geared towards males
Serious
Funny
Latent Factor Models
1162019
The PrincessDiaries
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
RECSYS 59
Recap SVD
Remember SVD A Input data matrix U Left singular vecs V Right singular vecs Σ Singular values
SVD gives minimum reconstruction error (SSE) min119880119880VΣ
sum119909119909119894119894 119860119860119909119909119894119894 minus 119880119880Σ119881119881T 1199091199091198941198942
So in our case ldquoSVDrdquo on Netflix data R asymp Q middot PT
A = R Q = U PT = Σ VT
But we are not done yet R has missing entries1162019
U
The sum goes over all entriesBut our R has missing entries
119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931
Am
n
Σm
n
VT
asymp
U
RECSYS 60
Latent Factor Models
SVD isnrsquot defined when entries are missing
Use specialized methods to find P Q min
119875119875119876119876sum 119909119909119909119909 isinR 119903119903119909119909119909119909 minus 119902119902119909119909 sdot 119901119901119909119909119879119879 2
Notebull We donrsquot require cols of P Q to be orthogonalunit lengthbull P Q map usersmovies to a latent spacebull The most popular model among Netflix contestants
1162019
45531
312445
53432142
24542
522434
42331
2-41
56-5
53-2
32111
-221-7
37-1
-924143-48-5-253-211
13-112-72914-131457-8
1-6784-3924176-421asymp
PTQ
users
item
s
119903119909119909119909119909 = 119902119902119909119909 sdot 119901119901119909119909119879119879
f factors
ffactorsitem
s
users
RECSYS 61
Dealing with Missing Entries Want to minimize SSE for unseen test data
Idea Minimize SSE on training data Want large f ( of factors) to capture all the signals But SSE on test data begins to rise for f gt 2
Regularization is needed to avoid Overfitting Allow rich model where there are sufficient data Shrink aggressively where data are scarce
1162019
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
1 3 43 5 5
4 5 533
2
2 1 3
1
λhellip regularization parameterldquoerrorrdquo ldquolengthrdquo
RECSYS 62
Geared towards females
Geared towards males
serious
escapist
The PrincessDiaries
The Lion King
Braveheart
Independence Day
AmadeusThe Color Purple
Oceanrsquos 11
Sense and Sensibility
Gus
Dave
Recommendations via Latent Factor Models (eg SVD++ by the [Bellkor Team])
1162019
Lethal Weapon
Dumb and Dumber
RECSYS 63
Dealing with Missing Entries Want to minimize SSE for unseen test data
Idea Minimize SSE on training data Want large f ( of factors) to capture all the signals But SSE on test data begins to rise for f gt 2
Regularization is needed Allow rich model where there are sufficient data Shrink aggressively where data are scarce
1162019
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
1 3 43 5 5
4 5 533
2
2 1 3
1
λhellip regularization parameterldquoerrorrdquo ldquolengthrdquo
RECSYS 64
Geared towards females
Geared towards males
serious
funny
The Effect of Regularization
1162019
The PrincessDiaries
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
minfactors ldquoerrorrdquo + λ ldquolengthrdquo
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
RECSYS 65
Geared towards females
Geared towards males
serious
funny
The Effect of Regularization
1162019
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
The PrincessDiaries
minfactors ldquoerrorrdquo + λ ldquolengthrdquo
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
RECSYS 66
Geared towards females
Geared towards males
serious
funny
The Effect of Regularization
1162019
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
minfactors ldquoerrorrdquo + λ ldquolengthrdquo
The PrincessDiaries
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
RECSYS 67
Geared towards females
Geared towards males
serious
funny
The Effect of Regularization
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
The PrincessDiaries
minfactors ldquoerrorrdquo + λ ldquolengthrdquo
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
RECSYS 68
Use Gradient Descent to search forthe optimal settings
Want to find matrices P and Q
Gradient descent Initialize P and Q (using SVD pretend missing ratings are
0) Do gradient descent
bull P larr P - η middotnablaPbull Q larr Q - η middotnablaQbull Where nablaQ is gradientderivative of matrix Q120571120571119876119876 = [120571120571119902119902119909119909119894119894] and 120571120571119902119902119909119909119894119894 = sum119909119909119909119909 minus2 119903119903119909119909119909119909 minus 119902119902119909119909119901119901119909119909119879119879 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894
bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q Observation Computing gradients is slow
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
How to compute gradient of a matrixCompute gradient of every element independently
RECSYS 69
Degression to the lecture notes ofRegression and Gradient Descent
by Andrew Ngrsquos Machine Learning course from Coursera
RECSYS 70
(Batch) Gradient Descent
Want to find matrices P and Q
Gradient descent Initialize P and Q (using SVD pretend missing ratings are
0) Do gradient descent
bull P larr P - η middotnablaPbull Q larr Q - η middotnablaQbull Where nablaQ is gradientderivative of matrix Q120571120571119876119876 = [120571120571119902119902119909119909119894119894] and 120571120571119902119902119909119909119894119894 = sum119909119909119909119909 minus2 119903119903119909119909119909119909 minus 119902119902119909119909119901119901119909119909119879119879 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894
bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q Observation Computing gradients is slow
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
How to compute gradient of a matrixCompute gradient of every element independently
RECSYS 72
Stochastic Gradient Descent
Gradient Descent (GD) vs Stochastic GD Observation 120571120571119876119876 = [120571120571119902119902119909119909119894119894] where
120571120571119902119902119909119909119894119894 = 119909119909119909119909
minus2 119903119903119909119909119909119909 minus 119902119902119909119909119894119894119901119901119909119909119894119894 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894 = 119961119961119946119946
nabla119928119928 119955119955119961119961119946119946
bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q 119928119928 = 119928119928 minus 119928119928 = 119928119928 minus η sum119961119961119946119946nabla119928119928 (119955119955119961119961119946119946) Idea Instead of evaluating gradient over all ratings evaluate it for
each individual rating and make a step
GD 119928119928larr119928119928 minus η sum119955119955119961119961119946119946 nabla119928119928(119955119955119961119961119946119946)
SGD 119928119928larr119928119928 minus 119928119928(119955119955119961119961119946119946) Faster convergence
bull Need more steps but each step is computed much faster
1162019
RECSYS 73
SGD vs GD Convergence of GD vs SGD
1162019
Iterationstep
Valu
e of
the
obje
ctiv
e fu
nctio
n
GD improves the value of the objective function at every step SGD improves the value but in a ldquonoisyrdquo wayGD takes fewer steps to converge but each steptakes much longer to compute In practice SGD is much faster
RECSYS 74
Stochastic Gradient Descent
Stochastic gradient decent Initialize P and Q (using SVD pretend missing ratings are
0) Then iterate over the ratings (multiple times if necessary) and update
factorsFor each rxi
bull 120576120576119909119909119909119909 = 119903119903119909119909119909119909 minus 119902119902119909119909 sdot 119901119901119909119909119879119879 (derivative of the ldquoerrorrdquo)bull 119902119902119909119909 larr 119902119902119909119909 + 120578120578 120576120576119909119909119909119909 119901119901119909119909 minus 120582120582 119902119902119909119909 (update equation)bull 119901119901119909119909 larr 119901119901119909119909 + 120578120578 120576120576119909119909119909119909 119902119902119909119909 minus 120582120582 119901119901119909119909 (update equation)
2 for loops For until convergence
bull For each rxibull Compute gradient do a ldquosteprdquo
1162019
η hellip learning rate
RECSYS 75Koren Bell Volinksy IEEE Computer 2009
1162019
RECSYS 76
Summary Recommendations via Optimization
Goal Make good recommendations Quantify goodness using SSE
So Lower SSE means better recommendations We want to make good recommendations on items
that some user has not yet seen Letrsquos set values for P and Q such that they work
well on known (user item) ratings And hope these values for P and Q will predict well
the unknown ratings
This is the a case where we apply Optimization methods
1162019
1 3 43 5 5
4 5 533
2
2 1 3
1
RECSYS 77
Backup Slides
RECSYS 78
The Netflix Challenge 2006-09
We Know What You OughtTo Be Watching This
Summer
Netflix Prizebull Training data
ndash 100 million ratingsndash 480000 usersndash 17770 moviesndash 6 years of data 2000-2005
bull Test datandash Last few ratings of each user (28 million)ndash Evaluation criterion root mean squared error (RMSE) ndash Netflix Cinematch RMSE 09514
bull Competitionndash 2700+ teamsndash $1 million grand prize for 10 improvement on Cinematch resultndash $50000 2007 progress prize for 843 improvement
scoremovieuser121152131434524123237682576344541568523425
22345
57664566
scoremovieuser6219617232473153414284935
745
696836
Training data Test data
Movie rating data
Overall rating distribution
bull Third of ratings are 4sbull Average rating is 368
From TimelyDevelopmentcom
ratings per movie
bull Avg ratingsmovie 5627
ratings per user
bull Avg ratingsuser 208
Average movie rating by movie count
bull More ratings to better movies
From TimelyDevelopmentcom
Most loved movies
CountAvg ratingTitle137812 4593 The Shawshank Redemption 133597 4545 Lord of the Rings The Return of the King1808834306 The Green Mile 150676 4460Lord of the Rings The Two Towers 139050 4415 Finding Nemo 117456 4504 Raiders of the Lost Ark 180736 4299 Forrest Gump 147932 4433 Lord of the Rings The Fellowship of the ring149199 4325 The Sixth Sense 144027 4333 Indiana Jones and the Last Crusade
Challenges
bull Size of datandash Scalabilityndash Keeping data in memory
bull Missing datandash 99 percent missingndash Very imbalanced
bull Avoiding overfittingbull Test and training data differ significantly movie 16322
bull Use an ensemble of complementing predictorsbull Two half tuned models worth more than a single fully
tuned model
The BellKor recommender system
RECSYS 101
Extending Latent Factor Model to Include Biases
1162019
RECSYS 102
Modeling Biases and Interactions
μ = overall mean rating bx = bias of user x bi = bias of movie i
user-movie interactionmovie biasuser bias
User-Movie interaction Characterizes the matching between
users and movies Attracts most research in the field Benefits from algorithmic and
mathematical innovations
Baseline predictor Separates users and movies Benefits from insights into userrsquos
behavior Among the main practical
contributions of the competition
RECSYS 103
Baseline Predictor We have expectations on the rating by
user x of movie i even without estimating xrsquos attitude towards movies like i
ndash Rating scale of user xndash Values of other ratings user
gave recently (day-specific mood anchoring multi-user accounts)
ndash (Recent) popularity of movie indash Selection bias related to
number of ratings user gave on the same day (ldquofrequencyrdquo)
1162019
RECSYS 104
Putting It All Together
Example Mean rating micro = 37 You are a critical reviewer your ratings are 1 star
lower than the mean bx = -1 Star Wars gets a mean rating of 05 higher than
average movie bi = + 05 Predicted rating for you on Star Wars
= 37 - 1 + 05 = 32
1162019
RECSYS 105
Fitting the New Model
Solve
Stochastic gradient decent to find parameters Note Both biases bu bi as well as interactions qi pu are treated
as parameters (we estimate them)
1162019
regularization
goodness of fit
λ is selected via grid-search on a validation set
( )
++++
+++minus
sumsumsumsum
sumisin
ii
xx
xx
ii
Rix
Txiixxi
PQ
bbpq
pqbbr
2222
2
)(
)(min
λ
micro
RECSYS 106
BellKor Recommender System The winner of the Netflix Challenge
Multi-scale modeling of the dataCombine top level ldquoregionalrdquomodeling of the data with a refined local view Global
bull Overall deviations of usersmovies Factorization
bull Addressing ldquoregionalrdquo effects Collaborative filtering
bull Extract local patterns
Global effects
Factorization
Collaborative filtering
RECSYS 107
Performance of Various Methods
0885
089
0895
09
0905
091
0915
092
1 10 100 1000
RMSE
Millions of parameters
CF (no time bias)
Basic Latent Factors
Latent Factors w Biases
1162019
RECSYS 108
Grand Prize 08563
Netflix 09514
Movie average 10533User average 10651
Global average 11296
Performance of Various Methods
Basic Collaborative filtering 094
Latent factors 090
Latent factors+Biases 089
Collaborative filtering++ 091
1162019
RECSYS 109
Temporal Biases Of Users
Sudden rise in the average movie rating(early 2004) Improvements in Netflix GUI improvements Meaning of rating changed
Movie age Users prefer new movies
without any reasons Older movies are just
inherently better than newer ones
1162019
Y Koren Collaborative filtering with temporal dynamics KDD rsquo09
RECSYS 110
Temporal Biases amp Factors
Original modelrxi = micro +bx + bi + qi middotpx
T
Add time dependence to biasesrxi = micro +bx(t)+ bi(t) +qi middot px
T
Make parameters bu and bi to depend on time (1) Parameterize time-dependence by linear trends
(2) Each bin corresponds to 10 consecutive weeks
Add temporal dependence to factors px(t)hellip user preference vector on day t
Y Koren Collaborative filtering with temporal dynamics KDD rsquo09
RECSYS 111
Adding Temporal Effects
1162019
0875
088
0885
089
0895
09
0905
091
0915
092
1 10 100 1000 10000
RMSE
Millions of parameters
CF (no time bias)Basic Latent FactorsCF (time bias)Latent Factors w Biases+ Linear time factors+ Per-day user biases+ CF
RECSYS 112
Grand Prize 08563
Netflix 09514
Movie average 10533User average 10651
Global average 11296
Performance of Various Methods
Basic Collaborative filtering 094
Latent factors 090
Latent factors+Biases 089
Collaborative filtering++ 091
1162019 112
Latent factors+Biases+Time 0876
Still no prize Getting desperateTry a ldquokitchen sinkrdquo approach
RECSYS 113
Effect of ensemble size
087
08725
0875
08775
088
08825
0885
08875
089
08925
1 8 15 22 29 36 43 50
Predictors
Erro
r - R
MSE
RECSYS 22
Pros Content-based Approach
+ No need for data on other users No cold-start or sparsity problems
+ Able to recommend to users with unique tastes
+ Able to recommend new amp unpopular items No first-rater problem
+ Able to provide explanations Can provide explanations of recommended items by listing content-
features that caused an item to be recommended
1162019
RECSYS 23
Cons Content-based Approach ndash Finding the appropriate features is hard
Eg images movies music
ndash Overspecialization Never recommends items outside userrsquos
content profile People might have multiple interests Unable to exploit quality judgments of other users
ndash Recommendations for new users How to build a user profile
1162019
RECSYS 24
Collaborative Filtering
RECSYS 25
Collaborative filtering Recommend items based on past transactions of users
Analyze relations between users andor items
Specific data characteristics are irrelevant Domain-free useritem attributes are not necessary Can identify elusive aspects
RECSYS 26
Collaborative Filtering (CF)
RECSYS 27
Example of Memory-based Collaborative FilteringUser-User Collaborative Filtering
1 Consider user x
2 Find set N of other userswhose ratings are ldquosimilarrdquo to xrsquos ratings eg using K-nearest neighbors (KNN)
3 Recommend items to xbased on the weighted ratings of items by users in N
RECSYS 28
Similar Users Let rx be the vector of user xrsquos ratings
Jaccard similarity measure Problem Ignores the value of the rating
Cosine similarity measure
sim(x y) = cos(rx ry) = 119903119903119909119909sdot119903119903119910119910||119903119903119909119909||sdot||119903119903119910119910||
Problem Treats missing ratings as ldquonegativerdquo Pearson correlation coefficient
Sxy = items rated by both users x and y
1162019
rx = [ _ _ ]ry = [ _ _]
rx ry as setsrx = 1 4 5ry = 1 3 4
rx ry as pointsrx = 1 0 0 1 3ry = 1 0 2 2 0
rx ry hellip avgrating of x y
RECSYS 29
Similarity Metric
Intuitively we want sim(A B) gt sim(A C)
Jaccard similarity 15 lt 24
Cosine similarity 0386 gt 0322 Considers missing ratings as ldquonegativerdquo Solution subtract the (row) mean
1162019
sim AB vs AC0092 gt -0559Notice cosine sim is correlation when data is centered at 0
Cosine sim
RECSYS 30
Rating Predictions
Let rx be the vector of user xrsquos ratings
Let N be the set of k users most similar to xwho have rated item i
Prediction for item i of user x 119903119903119909119909119909119909 = 1
119896119896sum119910119910isin119873119873 119903119903119910119910119909119909
119903119903119909119909119909119909 =sum119910119910isin119873119873 119904119904119909119909119910119910sdot119903119903119910119910119910119910sum119910119910isin119873119873 119904119904119909119909119910119910
Other options
Many other tricks possiblehellip1162019
Shorthand119956119956119961119961119961119961 = 119956119956119946119946119956119956 119961119961119961119961
RECSYS 31
Another type of Memory-based Collaborative Filtering Item-Item based Collaborative Filtering
So far User-user collaborative filtering
Another view Item-item For item i find other similar items Estimate rating for item i based
on the target userrsquos ratings on items similar to item i Can use same similarity metrics and
prediction functions as in user-user model
1162019
sumsum
isin
isinsdot
=)(
)(
xiNj ij
xiNj xjijxi s
rsr
sijhellip similarity of items i and jrxjhelliprating of user x on item jN(ix)hellip set items rated by x similar to i
RECSYS 32
Item-Item CF (|N|=2)
121110987654321
455311
3124452
534321423
245424
5224345
423316
users
mov
ies
- unknown rating - rating between 1 to 5
RECSYS 33
Item-Item CF (|N|=2)
121110987654321
455 311
3124452
534321423
245424
5224345
423316
users
- estimate rating of movie 1 by user 5
1162019
mov
ies
RECSYS 34
Item-Item CF (|N|=2)
121110987654321
455 311
3124452
534321423
245424
5224345
423316
users
Neighbor selectionIdentify movies similar to movie 1 rated by user 5
1162019
mov
ies
100
-018
041
-010
-031
059
sim(1m)
Here we use Pearson correlation as similarity1) Subtract mean rating mi from each movie i
m1 = (1+3+5+5+4)5 = 36row 1 [-26 0 -06 0 0 14 0 0 14 0 04 0]
2) Compute cosine similarities between rows
RECSYS 35
Item-Item CF (|N|=2)
121110987654321
455 311
3124452
534321423
245424
5224345
423316
users
Compute similarity weightss13=041 s16=059
1162019
mov
ies
100
-018
041
-010
-031
059
sim(1m)
RECSYS 36
Item-Item CF (|N|=2)
121110987654321
45526311
3124452
534321423
245424
5224345
423316
users
Predict by taking weighted average
r15 = (0412 + 0593) (041+059) = 26
mov
ies
RECSYS 37
Common Practice for Item-Item Collaborative Filtering
Define similarity sij of items i and j
Select K nearest neighbors (KNN) N(i x) Set of Items most similar to item i that were rated by x
Estimate rating rxi as the weighted average
1162019
sumsum
isin
isinsdot
=)(
)(ˆxiNj ij
xiNj xjijxi s
rsr
N(ix) = set of items similar to item i that were rated by xsij = similarity of items i and jrxj= rating of user x on item j
RECSYS 38
Item-Item vs User-User
04180109
0305081
Avatar LOTR Matrix Pirates
Alice
Bob
Carol
David
1162019
In practice it has been observed that item-itemoften works better than user-user
Why Items are simpler users have multiple tastes
RECSYS 39
ProsCons of Collaborative Filtering
+ Works for any kind of item No feature selection needed
- Cold Start Need enough users in the system to find a match
- Sparsity The userratings matrix is sparse Hard to find users that have rated the same items
- First rater Cannot recommend an item that has not been
previously rated New items Esoteric items
- Popularity bias Cannot recommend items to someone with
unique taste Tends to recommend popular items
RECSYS 40
Hybrid Methods Implement two or more different recommenders and
combine predictions Perhaps using a linear model
Add content-based methods to collaborative filtering Item profiles for new item problem Demographics to deal with new user problem
1162019
RECSYS 41
- Evaluation
- Error metrics
- Complexity Speed
RECSYS 42
Evaluation
1 3 4
3 5 5
4 5 5
3
3
2 2 2
5
2 1 1
3 3
1
movies
users
1162019
RECSYS 43
Evaluation
1 3 4
3 5 5
4 5 5
3
3
2
2 1
3
1
Test Data Set
users
movies
1162019
RECSYS 44
Evaluating Predictions Compare predictions with known ratings
Root-mean-square error (RMSE)bull where is predicted is the true rating of x on i
Precision at top 10 bull of those in top 10
Rank Correlation bull Spearmanrsquos correlation between systemrsquos and userrsquos complete
rankings Another approach 01 model
Coveragebull Number of itemsusers for which system can make predictions
Precision = TP (TP + FP) Accuracy = (TP+TN) (TP + FP + TN + FN) Receiver Operating characteristic (ROC) Curve
Y-axis True Positive Rates (TPR) X-axis False Positive Rates (FPR)bull TPR (aka Recall) = TP P = TP (TP+FN) bull FPR = FP N = FP (FP + TN)bull See httpsenwikipediaorgwikiPrecision and recall
RECSYS 45
Problems with Error Measures Narrow focus on accuracy sometimes
misses the point Prediction Diversity Prediction Context Order of predictions
In practice we care only to predict high ratings RMSE might penalize a method that does well
for high ratings and badly for others
1162019
RECSYS 46
Collaborative Filtering Complexity
Expensive step is finding k most similar customers O(|X|) Recall that X = set of customers in the system
Too expensive to do at runtime Could pre-compute using clustering as approx
Naiumlve pre-computation takes time O(N middot|C|) |C| = of clusters = k in the k-means N = of data points
We already know how to do this Near-neighbor search in high dimensions (LSH) Clustering Dimensionality reduction
1162019
RECSYS 47
Tip Add Data Leverage all the data
Donrsquot try to reduce data size in an effort to make fancy algorithms work
Simple methods on large data do best
Add more data eg add IMDB data on genres
More data beats better algorithmshttpanandtypepadcomdatawocky200803more-data-usualhtml
RECSYS 48
Recommender SystemsLatent Factor Models
CS246 Mining Massive DatasetsJure Leskovec Stanford University
httpcs246stanfordedu
RECSYS 49
Geared towards females
Geared towards males
Serious
Funny
Collaborative Filtering viaLatent Factor Models (eg SVD)
The PrincessDiaries
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
RECSYS 50Koren Bell Volinksy IEEE Computer 2009
RECSYS 51
The Netflix Utility Matrix R
1 3 4
3 5 5
4 5 5
3
3
2 2 2
5
2 1 1
3 3
1
480000 users
17700 movies
1162019
Matrix R
RECSYS 52
Utility Matrix R Evaluation
1 3 4
3 5 5
4 5 5
3
3
2
2 1
3
1
Test Data Set
SSE = sum(119909119909119909119909)isin119877119877 119903119909119909119909119909 minus 119903119903119909119909119909119909 2
1162019
480000 users
17700 movies
Predicted rating
True rating of user x on item i
119955119955120785120785120788120788
Matrix R
Training Data Set
RECSYS 53
Latent Factor Models
ldquoSVDrdquo on Netflix data R asymp Q middot PT
For now letrsquos assume we can approximate the rating matrix R as a product of ldquothinrdquo Q middot PT
R has missing entries but letrsquos ignore that for nowbull Basically we will want the reconstruction error to be small on
known ratings and we donrsquot care about the values on the missing ones
45531
312445
53432142
24542
522434
42331
2-41
56-5
53-2
32111
-221-7
37-1
-924143-48-5-253-211
13-112-72914-131457-8
1-6784-3924176-421asymp
users
item
s
PT
Q
item
s
users
R
SVD A = U Σ VT
f factors
ffactors
RECSYS 54
Ratings as Products of Factors How to estimate the missing rating of
user x for item i
1162019
45531
312445
53432142
24542
522434
42331
item
s
2-41
56-5
53-2
32111
-221-7
37-1
-924143-48-5-253-211
13-112-72914-131457-8
1-6784-3924176-421
asymp
item
s
users
users
PT
119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931
= 119943119943
119954119954119946119946119943119943 sdot 119953119953119961119961119943119943
qi = row i of Qpx = column x of PT
ffac
tors
Qf factors
RECSYS 55
Ratings as Products of Factors How to estimate the missing rating of
user x for item i
1162019
45531
312445
53432142
24542
522434
42331
item
s
2-41
56-5
53-2
32111
-221-7
37-1
-924143-48-5-253-211
13-112-72914-131457-8
1-6784-3924176-421
asymp
item
s
users
users
PT
ffac
tors
Qf factors
119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931
= 119943119943
119954119954119946119946119943119943 sdot 119953119953119961119961119943119943
qi = row i of Qpx = column x of PT
RECSYS 56
Ratings as Products of Factors How to estimate the missing rating of
user x for item i
1162019
45531
312445
53432142
24542
522434
42331
item
s
2-41
56-5
53-2
32111
-221-7
37-1
-924143-48-5-253-211
13-112-72914-131457-8
1-6784-3924176-421
asymp
item
s
users
users
QPT
24
ffac
tors
f factors
119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931
= 119943119943
119954119954119946119946119943119943 sdot 119953119953119961119961119943119943
qi = row i of Qpx = column x of PT
RECSYS 57
Geared towards females
Geared towards males
Serious
Funny
Latent Factor Models
1162019
The PrincessDiaries
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
RECSYS 58
Geared towards females
Geared towards males
Serious
Funny
Latent Factor Models
1162019
The PrincessDiaries
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
RECSYS 59
Recap SVD
Remember SVD A Input data matrix U Left singular vecs V Right singular vecs Σ Singular values
SVD gives minimum reconstruction error (SSE) min119880119880VΣ
sum119909119909119894119894 119860119860119909119909119894119894 minus 119880119880Σ119881119881T 1199091199091198941198942
So in our case ldquoSVDrdquo on Netflix data R asymp Q middot PT
A = R Q = U PT = Σ VT
But we are not done yet R has missing entries1162019
U
The sum goes over all entriesBut our R has missing entries
119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931
Am
n
Σm
n
VT
asymp
U
RECSYS 60
Latent Factor Models
SVD isnrsquot defined when entries are missing
Use specialized methods to find P Q min
119875119875119876119876sum 119909119909119909119909 isinR 119903119903119909119909119909119909 minus 119902119902119909119909 sdot 119901119901119909119909119879119879 2
Notebull We donrsquot require cols of P Q to be orthogonalunit lengthbull P Q map usersmovies to a latent spacebull The most popular model among Netflix contestants
1162019
45531
312445
53432142
24542
522434
42331
2-41
56-5
53-2
32111
-221-7
37-1
-924143-48-5-253-211
13-112-72914-131457-8
1-6784-3924176-421asymp
PTQ
users
item
s
119903119909119909119909119909 = 119902119902119909119909 sdot 119901119901119909119909119879119879
f factors
ffactorsitem
s
users
RECSYS 61
Dealing with Missing Entries Want to minimize SSE for unseen test data
Idea Minimize SSE on training data Want large f ( of factors) to capture all the signals But SSE on test data begins to rise for f gt 2
Regularization is needed to avoid Overfitting Allow rich model where there are sufficient data Shrink aggressively where data are scarce
1162019
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
1 3 43 5 5
4 5 533
2
2 1 3
1
λhellip regularization parameterldquoerrorrdquo ldquolengthrdquo
RECSYS 62
Geared towards females
Geared towards males
serious
escapist
The PrincessDiaries
The Lion King
Braveheart
Independence Day
AmadeusThe Color Purple
Oceanrsquos 11
Sense and Sensibility
Gus
Dave
Recommendations via Latent Factor Models (eg SVD++ by the [Bellkor Team])
1162019
Lethal Weapon
Dumb and Dumber
RECSYS 63
Dealing with Missing Entries Want to minimize SSE for unseen test data
Idea Minimize SSE on training data Want large f ( of factors) to capture all the signals But SSE on test data begins to rise for f gt 2
Regularization is needed Allow rich model where there are sufficient data Shrink aggressively where data are scarce
1162019
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
1 3 43 5 5
4 5 533
2
2 1 3
1
λhellip regularization parameterldquoerrorrdquo ldquolengthrdquo
RECSYS 64
Geared towards females
Geared towards males
serious
funny
The Effect of Regularization
1162019
The PrincessDiaries
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
minfactors ldquoerrorrdquo + λ ldquolengthrdquo
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
RECSYS 65
Geared towards females
Geared towards males
serious
funny
The Effect of Regularization
1162019
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
The PrincessDiaries
minfactors ldquoerrorrdquo + λ ldquolengthrdquo
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
RECSYS 66
Geared towards females
Geared towards males
serious
funny
The Effect of Regularization
1162019
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
minfactors ldquoerrorrdquo + λ ldquolengthrdquo
The PrincessDiaries
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
RECSYS 67
Geared towards females
Geared towards males
serious
funny
The Effect of Regularization
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
The PrincessDiaries
minfactors ldquoerrorrdquo + λ ldquolengthrdquo
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
RECSYS 68
Use Gradient Descent to search forthe optimal settings
Want to find matrices P and Q
Gradient descent Initialize P and Q (using SVD pretend missing ratings are
0) Do gradient descent
bull P larr P - η middotnablaPbull Q larr Q - η middotnablaQbull Where nablaQ is gradientderivative of matrix Q120571120571119876119876 = [120571120571119902119902119909119909119894119894] and 120571120571119902119902119909119909119894119894 = sum119909119909119909119909 minus2 119903119903119909119909119909119909 minus 119902119902119909119909119901119901119909119909119879119879 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894
bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q Observation Computing gradients is slow
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
How to compute gradient of a matrixCompute gradient of every element independently
RECSYS 69
Degression to the lecture notes ofRegression and Gradient Descent
by Andrew Ngrsquos Machine Learning course from Coursera
RECSYS 70
(Batch) Gradient Descent
Want to find matrices P and Q
Gradient descent Initialize P and Q (using SVD pretend missing ratings are
0) Do gradient descent
bull P larr P - η middotnablaPbull Q larr Q - η middotnablaQbull Where nablaQ is gradientderivative of matrix Q120571120571119876119876 = [120571120571119902119902119909119909119894119894] and 120571120571119902119902119909119909119894119894 = sum119909119909119909119909 minus2 119903119903119909119909119909119909 minus 119902119902119909119909119901119901119909119909119879119879 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894
bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q Observation Computing gradients is slow
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
How to compute gradient of a matrixCompute gradient of every element independently
RECSYS 72
Stochastic Gradient Descent
Gradient Descent (GD) vs Stochastic GD Observation 120571120571119876119876 = [120571120571119902119902119909119909119894119894] where
120571120571119902119902119909119909119894119894 = 119909119909119909119909
minus2 119903119903119909119909119909119909 minus 119902119902119909119909119894119894119901119901119909119909119894119894 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894 = 119961119961119946119946
nabla119928119928 119955119955119961119961119946119946
bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q 119928119928 = 119928119928 minus 119928119928 = 119928119928 minus η sum119961119961119946119946nabla119928119928 (119955119955119961119961119946119946) Idea Instead of evaluating gradient over all ratings evaluate it for
each individual rating and make a step
GD 119928119928larr119928119928 minus η sum119955119955119961119961119946119946 nabla119928119928(119955119955119961119961119946119946)
SGD 119928119928larr119928119928 minus 119928119928(119955119955119961119961119946119946) Faster convergence
bull Need more steps but each step is computed much faster
1162019
RECSYS 73
SGD vs GD Convergence of GD vs SGD
1162019
Iterationstep
Valu
e of
the
obje
ctiv
e fu
nctio
n
GD improves the value of the objective function at every step SGD improves the value but in a ldquonoisyrdquo wayGD takes fewer steps to converge but each steptakes much longer to compute In practice SGD is much faster
RECSYS 74
Stochastic Gradient Descent
Stochastic gradient decent Initialize P and Q (using SVD pretend missing ratings are
0) Then iterate over the ratings (multiple times if necessary) and update
factorsFor each rxi
bull 120576120576119909119909119909119909 = 119903119903119909119909119909119909 minus 119902119902119909119909 sdot 119901119901119909119909119879119879 (derivative of the ldquoerrorrdquo)bull 119902119902119909119909 larr 119902119902119909119909 + 120578120578 120576120576119909119909119909119909 119901119901119909119909 minus 120582120582 119902119902119909119909 (update equation)bull 119901119901119909119909 larr 119901119901119909119909 + 120578120578 120576120576119909119909119909119909 119902119902119909119909 minus 120582120582 119901119901119909119909 (update equation)
2 for loops For until convergence
bull For each rxibull Compute gradient do a ldquosteprdquo
1162019
η hellip learning rate
RECSYS 75Koren Bell Volinksy IEEE Computer 2009
1162019
RECSYS 76
Summary Recommendations via Optimization
Goal Make good recommendations Quantify goodness using SSE
So Lower SSE means better recommendations We want to make good recommendations on items
that some user has not yet seen Letrsquos set values for P and Q such that they work
well on known (user item) ratings And hope these values for P and Q will predict well
the unknown ratings
This is the a case where we apply Optimization methods
1162019
1 3 43 5 5
4 5 533
2
2 1 3
1
RECSYS 77
Backup Slides
RECSYS 78
The Netflix Challenge 2006-09
We Know What You OughtTo Be Watching This
Summer
Netflix Prizebull Training data
ndash 100 million ratingsndash 480000 usersndash 17770 moviesndash 6 years of data 2000-2005
bull Test datandash Last few ratings of each user (28 million)ndash Evaluation criterion root mean squared error (RMSE) ndash Netflix Cinematch RMSE 09514
bull Competitionndash 2700+ teamsndash $1 million grand prize for 10 improvement on Cinematch resultndash $50000 2007 progress prize for 843 improvement
scoremovieuser121152131434524123237682576344541568523425
22345
57664566
scoremovieuser6219617232473153414284935
745
696836
Training data Test data
Movie rating data
Overall rating distribution
bull Third of ratings are 4sbull Average rating is 368
From TimelyDevelopmentcom
ratings per movie
bull Avg ratingsmovie 5627
ratings per user
bull Avg ratingsuser 208
Average movie rating by movie count
bull More ratings to better movies
From TimelyDevelopmentcom
Most loved movies
CountAvg ratingTitle137812 4593 The Shawshank Redemption 133597 4545 Lord of the Rings The Return of the King1808834306 The Green Mile 150676 4460Lord of the Rings The Two Towers 139050 4415 Finding Nemo 117456 4504 Raiders of the Lost Ark 180736 4299 Forrest Gump 147932 4433 Lord of the Rings The Fellowship of the ring149199 4325 The Sixth Sense 144027 4333 Indiana Jones and the Last Crusade
Challenges
bull Size of datandash Scalabilityndash Keeping data in memory
bull Missing datandash 99 percent missingndash Very imbalanced
bull Avoiding overfittingbull Test and training data differ significantly movie 16322
bull Use an ensemble of complementing predictorsbull Two half tuned models worth more than a single fully
tuned model
The BellKor recommender system
RECSYS 101
Extending Latent Factor Model to Include Biases
1162019
RECSYS 102
Modeling Biases and Interactions
μ = overall mean rating bx = bias of user x bi = bias of movie i
user-movie interactionmovie biasuser bias
User-Movie interaction Characterizes the matching between
users and movies Attracts most research in the field Benefits from algorithmic and
mathematical innovations
Baseline predictor Separates users and movies Benefits from insights into userrsquos
behavior Among the main practical
contributions of the competition
RECSYS 103
Baseline Predictor We have expectations on the rating by
user x of movie i even without estimating xrsquos attitude towards movies like i
ndash Rating scale of user xndash Values of other ratings user
gave recently (day-specific mood anchoring multi-user accounts)
ndash (Recent) popularity of movie indash Selection bias related to
number of ratings user gave on the same day (ldquofrequencyrdquo)
1162019
RECSYS 104
Putting It All Together
Example Mean rating micro = 37 You are a critical reviewer your ratings are 1 star
lower than the mean bx = -1 Star Wars gets a mean rating of 05 higher than
average movie bi = + 05 Predicted rating for you on Star Wars
= 37 - 1 + 05 = 32
1162019
RECSYS 105
Fitting the New Model
Solve
Stochastic gradient decent to find parameters Note Both biases bu bi as well as interactions qi pu are treated
as parameters (we estimate them)
1162019
regularization
goodness of fit
λ is selected via grid-search on a validation set
( )
++++
+++minus
sumsumsumsum
sumisin
ii
xx
xx
ii
Rix
Txiixxi
PQ
bbpq
pqbbr
2222
2
)(
)(min
λ
micro
RECSYS 106
BellKor Recommender System The winner of the Netflix Challenge
Multi-scale modeling of the dataCombine top level ldquoregionalrdquomodeling of the data with a refined local view Global
bull Overall deviations of usersmovies Factorization
bull Addressing ldquoregionalrdquo effects Collaborative filtering
bull Extract local patterns
Global effects
Factorization
Collaborative filtering
RECSYS 107
Performance of Various Methods
0885
089
0895
09
0905
091
0915
092
1 10 100 1000
RMSE
Millions of parameters
CF (no time bias)
Basic Latent Factors
Latent Factors w Biases
1162019
RECSYS 108
Grand Prize 08563
Netflix 09514
Movie average 10533User average 10651
Global average 11296
Performance of Various Methods
Basic Collaborative filtering 094
Latent factors 090
Latent factors+Biases 089
Collaborative filtering++ 091
1162019
RECSYS 109
Temporal Biases Of Users
Sudden rise in the average movie rating(early 2004) Improvements in Netflix GUI improvements Meaning of rating changed
Movie age Users prefer new movies
without any reasons Older movies are just
inherently better than newer ones
1162019
Y Koren Collaborative filtering with temporal dynamics KDD rsquo09
RECSYS 110
Temporal Biases amp Factors
Original modelrxi = micro +bx + bi + qi middotpx
T
Add time dependence to biasesrxi = micro +bx(t)+ bi(t) +qi middot px
T
Make parameters bu and bi to depend on time (1) Parameterize time-dependence by linear trends
(2) Each bin corresponds to 10 consecutive weeks
Add temporal dependence to factors px(t)hellip user preference vector on day t
Y Koren Collaborative filtering with temporal dynamics KDD rsquo09
RECSYS 111
Adding Temporal Effects
1162019
0875
088
0885
089
0895
09
0905
091
0915
092
1 10 100 1000 10000
RMSE
Millions of parameters
CF (no time bias)Basic Latent FactorsCF (time bias)Latent Factors w Biases+ Linear time factors+ Per-day user biases+ CF
RECSYS 112
Grand Prize 08563
Netflix 09514
Movie average 10533User average 10651
Global average 11296
Performance of Various Methods
Basic Collaborative filtering 094
Latent factors 090
Latent factors+Biases 089
Collaborative filtering++ 091
1162019 112
Latent factors+Biases+Time 0876
Still no prize Getting desperateTry a ldquokitchen sinkrdquo approach
RECSYS 113
Effect of ensemble size
087
08725
0875
08775
088
08825
0885
08875
089
08925
1 8 15 22 29 36 43 50
Predictors
Erro
r - R
MSE
RECSYS 23
Cons Content-based Approach ndash Finding the appropriate features is hard
Eg images movies music
ndash Overspecialization Never recommends items outside userrsquos
content profile People might have multiple interests Unable to exploit quality judgments of other users
ndash Recommendations for new users How to build a user profile
1162019
RECSYS 24
Collaborative Filtering
RECSYS 25
Collaborative filtering Recommend items based on past transactions of users
Analyze relations between users andor items
Specific data characteristics are irrelevant Domain-free useritem attributes are not necessary Can identify elusive aspects
RECSYS 26
Collaborative Filtering (CF)
RECSYS 27
Example of Memory-based Collaborative FilteringUser-User Collaborative Filtering
1 Consider user x
2 Find set N of other userswhose ratings are ldquosimilarrdquo to xrsquos ratings eg using K-nearest neighbors (KNN)
3 Recommend items to xbased on the weighted ratings of items by users in N
RECSYS 28
Similar Users Let rx be the vector of user xrsquos ratings
Jaccard similarity measure Problem Ignores the value of the rating
Cosine similarity measure
sim(x y) = cos(rx ry) = 119903119903119909119909sdot119903119903119910119910||119903119903119909119909||sdot||119903119903119910119910||
Problem Treats missing ratings as ldquonegativerdquo Pearson correlation coefficient
Sxy = items rated by both users x and y
1162019
rx = [ _ _ ]ry = [ _ _]
rx ry as setsrx = 1 4 5ry = 1 3 4
rx ry as pointsrx = 1 0 0 1 3ry = 1 0 2 2 0
rx ry hellip avgrating of x y
RECSYS 29
Similarity Metric
Intuitively we want sim(A B) gt sim(A C)
Jaccard similarity 15 lt 24
Cosine similarity 0386 gt 0322 Considers missing ratings as ldquonegativerdquo Solution subtract the (row) mean
1162019
sim AB vs AC0092 gt -0559Notice cosine sim is correlation when data is centered at 0
Cosine sim
RECSYS 30
Rating Predictions
Let rx be the vector of user xrsquos ratings
Let N be the set of k users most similar to xwho have rated item i
Prediction for item i of user x 119903119903119909119909119909119909 = 1
119896119896sum119910119910isin119873119873 119903119903119910119910119909119909
119903119903119909119909119909119909 =sum119910119910isin119873119873 119904119904119909119909119910119910sdot119903119903119910119910119910119910sum119910119910isin119873119873 119904119904119909119909119910119910
Other options
Many other tricks possiblehellip1162019
Shorthand119956119956119961119961119961119961 = 119956119956119946119946119956119956 119961119961119961119961
RECSYS 31
Another type of Memory-based Collaborative Filtering Item-Item based Collaborative Filtering
So far User-user collaborative filtering
Another view Item-item For item i find other similar items Estimate rating for item i based
on the target userrsquos ratings on items similar to item i Can use same similarity metrics and
prediction functions as in user-user model
1162019
sumsum
isin
isinsdot
=)(
)(
xiNj ij
xiNj xjijxi s
rsr
sijhellip similarity of items i and jrxjhelliprating of user x on item jN(ix)hellip set items rated by x similar to i
RECSYS 32
Item-Item CF (|N|=2)
121110987654321
455311
3124452
534321423
245424
5224345
423316
users
mov
ies
- unknown rating - rating between 1 to 5
RECSYS 33
Item-Item CF (|N|=2)
121110987654321
455 311
3124452
534321423
245424
5224345
423316
users
- estimate rating of movie 1 by user 5
1162019
mov
ies
RECSYS 34
Item-Item CF (|N|=2)
121110987654321
455 311
3124452
534321423
245424
5224345
423316
users
Neighbor selectionIdentify movies similar to movie 1 rated by user 5
1162019
mov
ies
100
-018
041
-010
-031
059
sim(1m)
Here we use Pearson correlation as similarity1) Subtract mean rating mi from each movie i
m1 = (1+3+5+5+4)5 = 36row 1 [-26 0 -06 0 0 14 0 0 14 0 04 0]
2) Compute cosine similarities between rows
RECSYS 35
Item-Item CF (|N|=2)
121110987654321
455 311
3124452
534321423
245424
5224345
423316
users
Compute similarity weightss13=041 s16=059
1162019
mov
ies
100
-018
041
-010
-031
059
sim(1m)
RECSYS 36
Item-Item CF (|N|=2)
121110987654321
45526311
3124452
534321423
245424
5224345
423316
users
Predict by taking weighted average
r15 = (0412 + 0593) (041+059) = 26
mov
ies
RECSYS 37
Common Practice for Item-Item Collaborative Filtering
Define similarity sij of items i and j
Select K nearest neighbors (KNN) N(i x) Set of Items most similar to item i that were rated by x
Estimate rating rxi as the weighted average
1162019
sumsum
isin
isinsdot
=)(
)(ˆxiNj ij
xiNj xjijxi s
rsr
N(ix) = set of items similar to item i that were rated by xsij = similarity of items i and jrxj= rating of user x on item j
RECSYS 38
Item-Item vs User-User
04180109
0305081
Avatar LOTR Matrix Pirates
Alice
Bob
Carol
David
1162019
In practice it has been observed that item-itemoften works better than user-user
Why Items are simpler users have multiple tastes
RECSYS 39
ProsCons of Collaborative Filtering
+ Works for any kind of item No feature selection needed
- Cold Start Need enough users in the system to find a match
- Sparsity The userratings matrix is sparse Hard to find users that have rated the same items
- First rater Cannot recommend an item that has not been
previously rated New items Esoteric items
- Popularity bias Cannot recommend items to someone with
unique taste Tends to recommend popular items
RECSYS 40
Hybrid Methods Implement two or more different recommenders and
combine predictions Perhaps using a linear model
Add content-based methods to collaborative filtering Item profiles for new item problem Demographics to deal with new user problem
1162019
RECSYS 41
- Evaluation
- Error metrics
- Complexity Speed
RECSYS 42
Evaluation
1 3 4
3 5 5
4 5 5
3
3
2 2 2
5
2 1 1
3 3
1
movies
users
1162019
RECSYS 43
Evaluation
1 3 4
3 5 5
4 5 5
3
3
2
2 1
3
1
Test Data Set
users
movies
1162019
RECSYS 44
Evaluating Predictions Compare predictions with known ratings
Root-mean-square error (RMSE)bull where is predicted is the true rating of x on i
Precision at top 10 bull of those in top 10
Rank Correlation bull Spearmanrsquos correlation between systemrsquos and userrsquos complete
rankings Another approach 01 model
Coveragebull Number of itemsusers for which system can make predictions
Precision = TP (TP + FP) Accuracy = (TP+TN) (TP + FP + TN + FN) Receiver Operating characteristic (ROC) Curve
Y-axis True Positive Rates (TPR) X-axis False Positive Rates (FPR)bull TPR (aka Recall) = TP P = TP (TP+FN) bull FPR = FP N = FP (FP + TN)bull See httpsenwikipediaorgwikiPrecision and recall
RECSYS 45
Problems with Error Measures Narrow focus on accuracy sometimes
misses the point Prediction Diversity Prediction Context Order of predictions
In practice we care only to predict high ratings RMSE might penalize a method that does well
for high ratings and badly for others
1162019
RECSYS 46
Collaborative Filtering Complexity
Expensive step is finding k most similar customers O(|X|) Recall that X = set of customers in the system
Too expensive to do at runtime Could pre-compute using clustering as approx
Naiumlve pre-computation takes time O(N middot|C|) |C| = of clusters = k in the k-means N = of data points
We already know how to do this Near-neighbor search in high dimensions (LSH) Clustering Dimensionality reduction
1162019
RECSYS 47
Tip Add Data Leverage all the data
Donrsquot try to reduce data size in an effort to make fancy algorithms work
Simple methods on large data do best
Add more data eg add IMDB data on genres
More data beats better algorithmshttpanandtypepadcomdatawocky200803more-data-usualhtml
RECSYS 48
Recommender SystemsLatent Factor Models
CS246 Mining Massive DatasetsJure Leskovec Stanford University
httpcs246stanfordedu
RECSYS 49
Geared towards females
Geared towards males
Serious
Funny
Collaborative Filtering viaLatent Factor Models (eg SVD)
The PrincessDiaries
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
RECSYS 50Koren Bell Volinksy IEEE Computer 2009
RECSYS 51
The Netflix Utility Matrix R
1 3 4
3 5 5
4 5 5
3
3
2 2 2
5
2 1 1
3 3
1
480000 users
17700 movies
1162019
Matrix R
RECSYS 52
Utility Matrix R Evaluation
1 3 4
3 5 5
4 5 5
3
3
2
2 1
3
1
Test Data Set
SSE = sum(119909119909119909119909)isin119877119877 119903119909119909119909119909 minus 119903119903119909119909119909119909 2
1162019
480000 users
17700 movies
Predicted rating
True rating of user x on item i
119955119955120785120785120788120788
Matrix R
Training Data Set
RECSYS 53
Latent Factor Models
ldquoSVDrdquo on Netflix data R asymp Q middot PT
For now letrsquos assume we can approximate the rating matrix R as a product of ldquothinrdquo Q middot PT
R has missing entries but letrsquos ignore that for nowbull Basically we will want the reconstruction error to be small on
known ratings and we donrsquot care about the values on the missing ones
45531
312445
53432142
24542
522434
42331
2-41
56-5
53-2
32111
-221-7
37-1
-924143-48-5-253-211
13-112-72914-131457-8
1-6784-3924176-421asymp
users
item
s
PT
Q
item
s
users
R
SVD A = U Σ VT
f factors
ffactors
RECSYS 54
Ratings as Products of Factors How to estimate the missing rating of
user x for item i
1162019
45531
312445
53432142
24542
522434
42331
item
s
2-41
56-5
53-2
32111
-221-7
37-1
-924143-48-5-253-211
13-112-72914-131457-8
1-6784-3924176-421
asymp
item
s
users
users
PT
119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931
= 119943119943
119954119954119946119946119943119943 sdot 119953119953119961119961119943119943
qi = row i of Qpx = column x of PT
ffac
tors
Qf factors
RECSYS 55
Ratings as Products of Factors How to estimate the missing rating of
user x for item i
1162019
45531
312445
53432142
24542
522434
42331
item
s
2-41
56-5
53-2
32111
-221-7
37-1
-924143-48-5-253-211
13-112-72914-131457-8
1-6784-3924176-421
asymp
item
s
users
users
PT
ffac
tors
Qf factors
119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931
= 119943119943
119954119954119946119946119943119943 sdot 119953119953119961119961119943119943
qi = row i of Qpx = column x of PT
RECSYS 56
Ratings as Products of Factors How to estimate the missing rating of
user x for item i
1162019
45531
312445
53432142
24542
522434
42331
item
s
2-41
56-5
53-2
32111
-221-7
37-1
-924143-48-5-253-211
13-112-72914-131457-8
1-6784-3924176-421
asymp
item
s
users
users
QPT
24
ffac
tors
f factors
119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931
= 119943119943
119954119954119946119946119943119943 sdot 119953119953119961119961119943119943
qi = row i of Qpx = column x of PT
RECSYS 57
Geared towards females
Geared towards males
Serious
Funny
Latent Factor Models
1162019
The PrincessDiaries
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
RECSYS 58
Geared towards females
Geared towards males
Serious
Funny
Latent Factor Models
1162019
The PrincessDiaries
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
RECSYS 59
Recap SVD
Remember SVD A Input data matrix U Left singular vecs V Right singular vecs Σ Singular values
SVD gives minimum reconstruction error (SSE) min119880119880VΣ
sum119909119909119894119894 119860119860119909119909119894119894 minus 119880119880Σ119881119881T 1199091199091198941198942
So in our case ldquoSVDrdquo on Netflix data R asymp Q middot PT
A = R Q = U PT = Σ VT
But we are not done yet R has missing entries1162019
U
The sum goes over all entriesBut our R has missing entries
119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931
Am
n
Σm
n
VT
asymp
U
RECSYS 60
Latent Factor Models
SVD isnrsquot defined when entries are missing
Use specialized methods to find P Q min
119875119875119876119876sum 119909119909119909119909 isinR 119903119903119909119909119909119909 minus 119902119902119909119909 sdot 119901119901119909119909119879119879 2
Notebull We donrsquot require cols of P Q to be orthogonalunit lengthbull P Q map usersmovies to a latent spacebull The most popular model among Netflix contestants
1162019
45531
312445
53432142
24542
522434
42331
2-41
56-5
53-2
32111
-221-7
37-1
-924143-48-5-253-211
13-112-72914-131457-8
1-6784-3924176-421asymp
PTQ
users
item
s
119903119909119909119909119909 = 119902119902119909119909 sdot 119901119901119909119909119879119879
f factors
ffactorsitem
s
users
RECSYS 61
Dealing with Missing Entries Want to minimize SSE for unseen test data
Idea Minimize SSE on training data Want large f ( of factors) to capture all the signals But SSE on test data begins to rise for f gt 2
Regularization is needed to avoid Overfitting Allow rich model where there are sufficient data Shrink aggressively where data are scarce
1162019
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
1 3 43 5 5
4 5 533
2
2 1 3
1
λhellip regularization parameterldquoerrorrdquo ldquolengthrdquo
RECSYS 62
Geared towards females
Geared towards males
serious
escapist
The PrincessDiaries
The Lion King
Braveheart
Independence Day
AmadeusThe Color Purple
Oceanrsquos 11
Sense and Sensibility
Gus
Dave
Recommendations via Latent Factor Models (eg SVD++ by the [Bellkor Team])
1162019
Lethal Weapon
Dumb and Dumber
RECSYS 63
Dealing with Missing Entries Want to minimize SSE for unseen test data
Idea Minimize SSE on training data Want large f ( of factors) to capture all the signals But SSE on test data begins to rise for f gt 2
Regularization is needed Allow rich model where there are sufficient data Shrink aggressively where data are scarce
1162019
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
1 3 43 5 5
4 5 533
2
2 1 3
1
λhellip regularization parameterldquoerrorrdquo ldquolengthrdquo
RECSYS 64
Geared towards females
Geared towards males
serious
funny
The Effect of Regularization
1162019
The PrincessDiaries
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
minfactors ldquoerrorrdquo + λ ldquolengthrdquo
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
RECSYS 65
Geared towards females
Geared towards males
serious
funny
The Effect of Regularization
1162019
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
The PrincessDiaries
minfactors ldquoerrorrdquo + λ ldquolengthrdquo
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
RECSYS 66
Geared towards females
Geared towards males
serious
funny
The Effect of Regularization
1162019
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
minfactors ldquoerrorrdquo + λ ldquolengthrdquo
The PrincessDiaries
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
RECSYS 67
Geared towards females
Geared towards males
serious
funny
The Effect of Regularization
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
The PrincessDiaries
minfactors ldquoerrorrdquo + λ ldquolengthrdquo
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
RECSYS 68
Use Gradient Descent to search forthe optimal settings
Want to find matrices P and Q
Gradient descent Initialize P and Q (using SVD pretend missing ratings are
0) Do gradient descent
bull P larr P - η middotnablaPbull Q larr Q - η middotnablaQbull Where nablaQ is gradientderivative of matrix Q120571120571119876119876 = [120571120571119902119902119909119909119894119894] and 120571120571119902119902119909119909119894119894 = sum119909119909119909119909 minus2 119903119903119909119909119909119909 minus 119902119902119909119909119901119901119909119909119879119879 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894
bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q Observation Computing gradients is slow
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
How to compute gradient of a matrixCompute gradient of every element independently
RECSYS 69
Degression to the lecture notes ofRegression and Gradient Descent
by Andrew Ngrsquos Machine Learning course from Coursera
RECSYS 70
(Batch) Gradient Descent
Want to find matrices P and Q
Gradient descent Initialize P and Q (using SVD pretend missing ratings are
0) Do gradient descent
bull P larr P - η middotnablaPbull Q larr Q - η middotnablaQbull Where nablaQ is gradientderivative of matrix Q120571120571119876119876 = [120571120571119902119902119909119909119894119894] and 120571120571119902119902119909119909119894119894 = sum119909119909119909119909 minus2 119903119903119909119909119909119909 minus 119902119902119909119909119901119901119909119909119879119879 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894
bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q Observation Computing gradients is slow
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
How to compute gradient of a matrixCompute gradient of every element independently
RECSYS 72
Stochastic Gradient Descent
Gradient Descent (GD) vs Stochastic GD Observation 120571120571119876119876 = [120571120571119902119902119909119909119894119894] where
120571120571119902119902119909119909119894119894 = 119909119909119909119909
minus2 119903119903119909119909119909119909 minus 119902119902119909119909119894119894119901119901119909119909119894119894 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894 = 119961119961119946119946
nabla119928119928 119955119955119961119961119946119946
bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q 119928119928 = 119928119928 minus 119928119928 = 119928119928 minus η sum119961119961119946119946nabla119928119928 (119955119955119961119961119946119946) Idea Instead of evaluating gradient over all ratings evaluate it for
each individual rating and make a step
GD 119928119928larr119928119928 minus η sum119955119955119961119961119946119946 nabla119928119928(119955119955119961119961119946119946)
SGD 119928119928larr119928119928 minus 119928119928(119955119955119961119961119946119946) Faster convergence
bull Need more steps but each step is computed much faster
1162019
RECSYS 73
SGD vs GD Convergence of GD vs SGD
1162019
Iterationstep
Valu
e of
the
obje
ctiv
e fu
nctio
n
GD improves the value of the objective function at every step SGD improves the value but in a ldquonoisyrdquo wayGD takes fewer steps to converge but each steptakes much longer to compute In practice SGD is much faster
RECSYS 74
Stochastic Gradient Descent
Stochastic gradient decent Initialize P and Q (using SVD pretend missing ratings are
0) Then iterate over the ratings (multiple times if necessary) and update
factorsFor each rxi
bull 120576120576119909119909119909119909 = 119903119903119909119909119909119909 minus 119902119902119909119909 sdot 119901119901119909119909119879119879 (derivative of the ldquoerrorrdquo)bull 119902119902119909119909 larr 119902119902119909119909 + 120578120578 120576120576119909119909119909119909 119901119901119909119909 minus 120582120582 119902119902119909119909 (update equation)bull 119901119901119909119909 larr 119901119901119909119909 + 120578120578 120576120576119909119909119909119909 119902119902119909119909 minus 120582120582 119901119901119909119909 (update equation)
2 for loops For until convergence
bull For each rxibull Compute gradient do a ldquosteprdquo
1162019
η hellip learning rate
RECSYS 75Koren Bell Volinksy IEEE Computer 2009
1162019
RECSYS 76
Summary Recommendations via Optimization
Goal Make good recommendations Quantify goodness using SSE
So Lower SSE means better recommendations We want to make good recommendations on items
that some user has not yet seen Letrsquos set values for P and Q such that they work
well on known (user item) ratings And hope these values for P and Q will predict well
the unknown ratings
This is the a case where we apply Optimization methods
1162019
1 3 43 5 5
4 5 533
2
2 1 3
1
RECSYS 77
Backup Slides
RECSYS 78
The Netflix Challenge 2006-09
We Know What You OughtTo Be Watching This
Summer
Netflix Prizebull Training data
ndash 100 million ratingsndash 480000 usersndash 17770 moviesndash 6 years of data 2000-2005
bull Test datandash Last few ratings of each user (28 million)ndash Evaluation criterion root mean squared error (RMSE) ndash Netflix Cinematch RMSE 09514
bull Competitionndash 2700+ teamsndash $1 million grand prize for 10 improvement on Cinematch resultndash $50000 2007 progress prize for 843 improvement
scoremovieuser121152131434524123237682576344541568523425
22345
57664566
scoremovieuser6219617232473153414284935
745
696836
Training data Test data
Movie rating data
Overall rating distribution
bull Third of ratings are 4sbull Average rating is 368
From TimelyDevelopmentcom
ratings per movie
bull Avg ratingsmovie 5627
ratings per user
bull Avg ratingsuser 208
Average movie rating by movie count
bull More ratings to better movies
From TimelyDevelopmentcom
Most loved movies
CountAvg ratingTitle137812 4593 The Shawshank Redemption 133597 4545 Lord of the Rings The Return of the King1808834306 The Green Mile 150676 4460Lord of the Rings The Two Towers 139050 4415 Finding Nemo 117456 4504 Raiders of the Lost Ark 180736 4299 Forrest Gump 147932 4433 Lord of the Rings The Fellowship of the ring149199 4325 The Sixth Sense 144027 4333 Indiana Jones and the Last Crusade
Challenges
bull Size of datandash Scalabilityndash Keeping data in memory
bull Missing datandash 99 percent missingndash Very imbalanced
bull Avoiding overfittingbull Test and training data differ significantly movie 16322
bull Use an ensemble of complementing predictorsbull Two half tuned models worth more than a single fully
tuned model
The BellKor recommender system
RECSYS 101
Extending Latent Factor Model to Include Biases
1162019
RECSYS 102
Modeling Biases and Interactions
μ = overall mean rating bx = bias of user x bi = bias of movie i
user-movie interactionmovie biasuser bias
User-Movie interaction Characterizes the matching between
users and movies Attracts most research in the field Benefits from algorithmic and
mathematical innovations
Baseline predictor Separates users and movies Benefits from insights into userrsquos
behavior Among the main practical
contributions of the competition
RECSYS 103
Baseline Predictor We have expectations on the rating by
user x of movie i even without estimating xrsquos attitude towards movies like i
ndash Rating scale of user xndash Values of other ratings user
gave recently (day-specific mood anchoring multi-user accounts)
ndash (Recent) popularity of movie indash Selection bias related to
number of ratings user gave on the same day (ldquofrequencyrdquo)
1162019
RECSYS 104
Putting It All Together
Example Mean rating micro = 37 You are a critical reviewer your ratings are 1 star
lower than the mean bx = -1 Star Wars gets a mean rating of 05 higher than
average movie bi = + 05 Predicted rating for you on Star Wars
= 37 - 1 + 05 = 32
1162019
RECSYS 105
Fitting the New Model
Solve
Stochastic gradient decent to find parameters Note Both biases bu bi as well as interactions qi pu are treated
as parameters (we estimate them)
1162019
regularization
goodness of fit
λ is selected via grid-search on a validation set
( )
++++
+++minus
sumsumsumsum
sumisin
ii
xx
xx
ii
Rix
Txiixxi
PQ
bbpq
pqbbr
2222
2
)(
)(min
λ
micro
RECSYS 106
BellKor Recommender System The winner of the Netflix Challenge
Multi-scale modeling of the dataCombine top level ldquoregionalrdquomodeling of the data with a refined local view Global
bull Overall deviations of usersmovies Factorization
bull Addressing ldquoregionalrdquo effects Collaborative filtering
bull Extract local patterns
Global effects
Factorization
Collaborative filtering
RECSYS 107
Performance of Various Methods
0885
089
0895
09
0905
091
0915
092
1 10 100 1000
RMSE
Millions of parameters
CF (no time bias)
Basic Latent Factors
Latent Factors w Biases
1162019
RECSYS 108
Grand Prize 08563
Netflix 09514
Movie average 10533User average 10651
Global average 11296
Performance of Various Methods
Basic Collaborative filtering 094
Latent factors 090
Latent factors+Biases 089
Collaborative filtering++ 091
1162019
RECSYS 109
Temporal Biases Of Users
Sudden rise in the average movie rating(early 2004) Improvements in Netflix GUI improvements Meaning of rating changed
Movie age Users prefer new movies
without any reasons Older movies are just
inherently better than newer ones
1162019
Y Koren Collaborative filtering with temporal dynamics KDD rsquo09
RECSYS 110
Temporal Biases amp Factors
Original modelrxi = micro +bx + bi + qi middotpx
T
Add time dependence to biasesrxi = micro +bx(t)+ bi(t) +qi middot px
T
Make parameters bu and bi to depend on time (1) Parameterize time-dependence by linear trends
(2) Each bin corresponds to 10 consecutive weeks
Add temporal dependence to factors px(t)hellip user preference vector on day t
Y Koren Collaborative filtering with temporal dynamics KDD rsquo09
RECSYS 111
Adding Temporal Effects
1162019
0875
088
0885
089
0895
09
0905
091
0915
092
1 10 100 1000 10000
RMSE
Millions of parameters
CF (no time bias)Basic Latent FactorsCF (time bias)Latent Factors w Biases+ Linear time factors+ Per-day user biases+ CF
RECSYS 112
Grand Prize 08563
Netflix 09514
Movie average 10533User average 10651
Global average 11296
Performance of Various Methods
Basic Collaborative filtering 094
Latent factors 090
Latent factors+Biases 089
Collaborative filtering++ 091
1162019 112
Latent factors+Biases+Time 0876
Still no prize Getting desperateTry a ldquokitchen sinkrdquo approach
RECSYS 113
Effect of ensemble size
087
08725
0875
08775
088
08825
0885
08875
089
08925
1 8 15 22 29 36 43 50
Predictors
Erro
r - R
MSE
RECSYS 24
Collaborative Filtering
RECSYS 25
Collaborative filtering Recommend items based on past transactions of users
Analyze relations between users andor items
Specific data characteristics are irrelevant Domain-free useritem attributes are not necessary Can identify elusive aspects
RECSYS 26
Collaborative Filtering (CF)
RECSYS 27
Example of Memory-based Collaborative FilteringUser-User Collaborative Filtering
1 Consider user x
2 Find set N of other userswhose ratings are ldquosimilarrdquo to xrsquos ratings eg using K-nearest neighbors (KNN)
3 Recommend items to xbased on the weighted ratings of items by users in N
RECSYS 28
Similar Users Let rx be the vector of user xrsquos ratings
Jaccard similarity measure Problem Ignores the value of the rating
Cosine similarity measure
sim(x y) = cos(rx ry) = 119903119903119909119909sdot119903119903119910119910||119903119903119909119909||sdot||119903119903119910119910||
Problem Treats missing ratings as ldquonegativerdquo Pearson correlation coefficient
Sxy = items rated by both users x and y
1162019
rx = [ _ _ ]ry = [ _ _]
rx ry as setsrx = 1 4 5ry = 1 3 4
rx ry as pointsrx = 1 0 0 1 3ry = 1 0 2 2 0
rx ry hellip avgrating of x y
RECSYS 29
Similarity Metric
Intuitively we want sim(A B) gt sim(A C)
Jaccard similarity 15 lt 24
Cosine similarity 0386 gt 0322 Considers missing ratings as ldquonegativerdquo Solution subtract the (row) mean
1162019
sim AB vs AC0092 gt -0559Notice cosine sim is correlation when data is centered at 0
Cosine sim
RECSYS 30
Rating Predictions
Let rx be the vector of user xrsquos ratings
Let N be the set of k users most similar to xwho have rated item i
Prediction for item i of user x 119903119903119909119909119909119909 = 1
119896119896sum119910119910isin119873119873 119903119903119910119910119909119909
119903119903119909119909119909119909 =sum119910119910isin119873119873 119904119904119909119909119910119910sdot119903119903119910119910119910119910sum119910119910isin119873119873 119904119904119909119909119910119910
Other options
Many other tricks possiblehellip1162019
Shorthand119956119956119961119961119961119961 = 119956119956119946119946119956119956 119961119961119961119961
RECSYS 31
Another type of Memory-based Collaborative Filtering Item-Item based Collaborative Filtering
So far User-user collaborative filtering
Another view Item-item For item i find other similar items Estimate rating for item i based
on the target userrsquos ratings on items similar to item i Can use same similarity metrics and
prediction functions as in user-user model
1162019
sumsum
isin
isinsdot
=)(
)(
xiNj ij
xiNj xjijxi s
rsr
sijhellip similarity of items i and jrxjhelliprating of user x on item jN(ix)hellip set items rated by x similar to i
RECSYS 32
Item-Item CF (|N|=2)
121110987654321
455311
3124452
534321423
245424
5224345
423316
users
mov
ies
- unknown rating - rating between 1 to 5
RECSYS 33
Item-Item CF (|N|=2)
121110987654321
455 311
3124452
534321423
245424
5224345
423316
users
- estimate rating of movie 1 by user 5
1162019
mov
ies
RECSYS 34
Item-Item CF (|N|=2)
121110987654321
455 311
3124452
534321423
245424
5224345
423316
users
Neighbor selectionIdentify movies similar to movie 1 rated by user 5
1162019
mov
ies
100
-018
041
-010
-031
059
sim(1m)
Here we use Pearson correlation as similarity1) Subtract mean rating mi from each movie i
m1 = (1+3+5+5+4)5 = 36row 1 [-26 0 -06 0 0 14 0 0 14 0 04 0]
2) Compute cosine similarities between rows
RECSYS 35
Item-Item CF (|N|=2)
121110987654321
455 311
3124452
534321423
245424
5224345
423316
users
Compute similarity weightss13=041 s16=059
1162019
mov
ies
100
-018
041
-010
-031
059
sim(1m)
RECSYS 36
Item-Item CF (|N|=2)
121110987654321
45526311
3124452
534321423
245424
5224345
423316
users
Predict by taking weighted average
r15 = (0412 + 0593) (041+059) = 26
mov
ies
RECSYS 37
Common Practice for Item-Item Collaborative Filtering
Define similarity sij of items i and j
Select K nearest neighbors (KNN) N(i x) Set of Items most similar to item i that were rated by x
Estimate rating rxi as the weighted average
1162019
sumsum
isin
isinsdot
=)(
)(ˆxiNj ij
xiNj xjijxi s
rsr
N(ix) = set of items similar to item i that were rated by xsij = similarity of items i and jrxj= rating of user x on item j
RECSYS 38
Item-Item vs User-User
04180109
0305081
Avatar LOTR Matrix Pirates
Alice
Bob
Carol
David
1162019
In practice it has been observed that item-itemoften works better than user-user
Why Items are simpler users have multiple tastes
RECSYS 39
ProsCons of Collaborative Filtering
+ Works for any kind of item No feature selection needed
- Cold Start Need enough users in the system to find a match
- Sparsity The userratings matrix is sparse Hard to find users that have rated the same items
- First rater Cannot recommend an item that has not been
previously rated New items Esoteric items
- Popularity bias Cannot recommend items to someone with
unique taste Tends to recommend popular items
RECSYS 40
Hybrid Methods Implement two or more different recommenders and
combine predictions Perhaps using a linear model
Add content-based methods to collaborative filtering Item profiles for new item problem Demographics to deal with new user problem
1162019
RECSYS 41
- Evaluation
- Error metrics
- Complexity Speed
RECSYS 42
Evaluation
1 3 4
3 5 5
4 5 5
3
3
2 2 2
5
2 1 1
3 3
1
movies
users
1162019
RECSYS 43
Evaluation
1 3 4
3 5 5
4 5 5
3
3
2
2 1
3
1
Test Data Set
users
movies
1162019
RECSYS 44
Evaluating Predictions Compare predictions with known ratings
Root-mean-square error (RMSE)bull where is predicted is the true rating of x on i
Precision at top 10 bull of those in top 10
Rank Correlation bull Spearmanrsquos correlation between systemrsquos and userrsquos complete
rankings Another approach 01 model
Coveragebull Number of itemsusers for which system can make predictions
Precision = TP (TP + FP) Accuracy = (TP+TN) (TP + FP + TN + FN) Receiver Operating characteristic (ROC) Curve
Y-axis True Positive Rates (TPR) X-axis False Positive Rates (FPR)bull TPR (aka Recall) = TP P = TP (TP+FN) bull FPR = FP N = FP (FP + TN)bull See httpsenwikipediaorgwikiPrecision and recall
RECSYS 45
Problems with Error Measures Narrow focus on accuracy sometimes
misses the point Prediction Diversity Prediction Context Order of predictions
In practice we care only to predict high ratings RMSE might penalize a method that does well
for high ratings and badly for others
1162019
RECSYS 46
Collaborative Filtering Complexity
Expensive step is finding k most similar customers O(|X|) Recall that X = set of customers in the system
Too expensive to do at runtime Could pre-compute using clustering as approx
Naiumlve pre-computation takes time O(N middot|C|) |C| = of clusters = k in the k-means N = of data points
We already know how to do this Near-neighbor search in high dimensions (LSH) Clustering Dimensionality reduction
1162019
RECSYS 47
Tip Add Data Leverage all the data
Donrsquot try to reduce data size in an effort to make fancy algorithms work
Simple methods on large data do best
Add more data eg add IMDB data on genres
More data beats better algorithmshttpanandtypepadcomdatawocky200803more-data-usualhtml
RECSYS 48
Recommender SystemsLatent Factor Models
CS246 Mining Massive DatasetsJure Leskovec Stanford University
httpcs246stanfordedu
RECSYS 49
Geared towards females
Geared towards males
Serious
Funny
Collaborative Filtering viaLatent Factor Models (eg SVD)
The PrincessDiaries
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
RECSYS 50Koren Bell Volinksy IEEE Computer 2009
RECSYS 51
The Netflix Utility Matrix R
1 3 4
3 5 5
4 5 5
3
3
2 2 2
5
2 1 1
3 3
1
480000 users
17700 movies
1162019
Matrix R
RECSYS 52
Utility Matrix R Evaluation
1 3 4
3 5 5
4 5 5
3
3
2
2 1
3
1
Test Data Set
SSE = sum(119909119909119909119909)isin119877119877 119903119909119909119909119909 minus 119903119903119909119909119909119909 2
1162019
480000 users
17700 movies
Predicted rating
True rating of user x on item i
119955119955120785120785120788120788
Matrix R
Training Data Set
RECSYS 53
Latent Factor Models
ldquoSVDrdquo on Netflix data R asymp Q middot PT
For now letrsquos assume we can approximate the rating matrix R as a product of ldquothinrdquo Q middot PT
R has missing entries but letrsquos ignore that for nowbull Basically we will want the reconstruction error to be small on
known ratings and we donrsquot care about the values on the missing ones
45531
312445
53432142
24542
522434
42331
2-41
56-5
53-2
32111
-221-7
37-1
-924143-48-5-253-211
13-112-72914-131457-8
1-6784-3924176-421asymp
users
item
s
PT
Q
item
s
users
R
SVD A = U Σ VT
f factors
ffactors
RECSYS 54
Ratings as Products of Factors How to estimate the missing rating of
user x for item i
1162019
45531
312445
53432142
24542
522434
42331
item
s
2-41
56-5
53-2
32111
-221-7
37-1
-924143-48-5-253-211
13-112-72914-131457-8
1-6784-3924176-421
asymp
item
s
users
users
PT
119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931
= 119943119943
119954119954119946119946119943119943 sdot 119953119953119961119961119943119943
qi = row i of Qpx = column x of PT
ffac
tors
Qf factors
RECSYS 55
Ratings as Products of Factors How to estimate the missing rating of
user x for item i
1162019
45531
312445
53432142
24542
522434
42331
item
s
2-41
56-5
53-2
32111
-221-7
37-1
-924143-48-5-253-211
13-112-72914-131457-8
1-6784-3924176-421
asymp
item
s
users
users
PT
ffac
tors
Qf factors
119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931
= 119943119943
119954119954119946119946119943119943 sdot 119953119953119961119961119943119943
qi = row i of Qpx = column x of PT
RECSYS 56
Ratings as Products of Factors How to estimate the missing rating of
user x for item i
1162019
45531
312445
53432142
24542
522434
42331
item
s
2-41
56-5
53-2
32111
-221-7
37-1
-924143-48-5-253-211
13-112-72914-131457-8
1-6784-3924176-421
asymp
item
s
users
users
QPT
24
ffac
tors
f factors
119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931
= 119943119943
119954119954119946119946119943119943 sdot 119953119953119961119961119943119943
qi = row i of Qpx = column x of PT
RECSYS 57
Geared towards females
Geared towards males
Serious
Funny
Latent Factor Models
1162019
The PrincessDiaries
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
RECSYS 58
Geared towards females
Geared towards males
Serious
Funny
Latent Factor Models
1162019
The PrincessDiaries
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
RECSYS 59
Recap SVD
Remember SVD A Input data matrix U Left singular vecs V Right singular vecs Σ Singular values
SVD gives minimum reconstruction error (SSE) min119880119880VΣ
sum119909119909119894119894 119860119860119909119909119894119894 minus 119880119880Σ119881119881T 1199091199091198941198942
So in our case ldquoSVDrdquo on Netflix data R asymp Q middot PT
A = R Q = U PT = Σ VT
But we are not done yet R has missing entries1162019
U
The sum goes over all entriesBut our R has missing entries
119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931
Am
n
Σm
n
VT
asymp
U
RECSYS 60
Latent Factor Models
SVD isnrsquot defined when entries are missing
Use specialized methods to find P Q min
119875119875119876119876sum 119909119909119909119909 isinR 119903119903119909119909119909119909 minus 119902119902119909119909 sdot 119901119901119909119909119879119879 2
Notebull We donrsquot require cols of P Q to be orthogonalunit lengthbull P Q map usersmovies to a latent spacebull The most popular model among Netflix contestants
1162019
45531
312445
53432142
24542
522434
42331
2-41
56-5
53-2
32111
-221-7
37-1
-924143-48-5-253-211
13-112-72914-131457-8
1-6784-3924176-421asymp
PTQ
users
item
s
119903119909119909119909119909 = 119902119902119909119909 sdot 119901119901119909119909119879119879
f factors
ffactorsitem
s
users
RECSYS 61
Dealing with Missing Entries Want to minimize SSE for unseen test data
Idea Minimize SSE on training data Want large f ( of factors) to capture all the signals But SSE on test data begins to rise for f gt 2
Regularization is needed to avoid Overfitting Allow rich model where there are sufficient data Shrink aggressively where data are scarce
1162019
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
1 3 43 5 5
4 5 533
2
2 1 3
1
λhellip regularization parameterldquoerrorrdquo ldquolengthrdquo
RECSYS 62
Geared towards females
Geared towards males
serious
escapist
The PrincessDiaries
The Lion King
Braveheart
Independence Day
AmadeusThe Color Purple
Oceanrsquos 11
Sense and Sensibility
Gus
Dave
Recommendations via Latent Factor Models (eg SVD++ by the [Bellkor Team])
1162019
Lethal Weapon
Dumb and Dumber
RECSYS 63
Dealing with Missing Entries Want to minimize SSE for unseen test data
Idea Minimize SSE on training data Want large f ( of factors) to capture all the signals But SSE on test data begins to rise for f gt 2
Regularization is needed Allow rich model where there are sufficient data Shrink aggressively where data are scarce
1162019
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
1 3 43 5 5
4 5 533
2
2 1 3
1
λhellip regularization parameterldquoerrorrdquo ldquolengthrdquo
RECSYS 64
Geared towards females
Geared towards males
serious
funny
The Effect of Regularization
1162019
The PrincessDiaries
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
minfactors ldquoerrorrdquo + λ ldquolengthrdquo
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
RECSYS 65
Geared towards females
Geared towards males
serious
funny
The Effect of Regularization
1162019
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
The PrincessDiaries
minfactors ldquoerrorrdquo + λ ldquolengthrdquo
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
RECSYS 66
Geared towards females
Geared towards males
serious
funny
The Effect of Regularization
1162019
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
minfactors ldquoerrorrdquo + λ ldquolengthrdquo
The PrincessDiaries
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
RECSYS 67
Geared towards females
Geared towards males
serious
funny
The Effect of Regularization
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
The PrincessDiaries
minfactors ldquoerrorrdquo + λ ldquolengthrdquo
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
RECSYS 68
Use Gradient Descent to search forthe optimal settings
Want to find matrices P and Q
Gradient descent Initialize P and Q (using SVD pretend missing ratings are
0) Do gradient descent
bull P larr P - η middotnablaPbull Q larr Q - η middotnablaQbull Where nablaQ is gradientderivative of matrix Q120571120571119876119876 = [120571120571119902119902119909119909119894119894] and 120571120571119902119902119909119909119894119894 = sum119909119909119909119909 minus2 119903119903119909119909119909119909 minus 119902119902119909119909119901119901119909119909119879119879 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894
bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q Observation Computing gradients is slow
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
How to compute gradient of a matrixCompute gradient of every element independently
RECSYS 69
Degression to the lecture notes ofRegression and Gradient Descent
by Andrew Ngrsquos Machine Learning course from Coursera
RECSYS 70
(Batch) Gradient Descent
Want to find matrices P and Q
Gradient descent Initialize P and Q (using SVD pretend missing ratings are
0) Do gradient descent
bull P larr P - η middotnablaPbull Q larr Q - η middotnablaQbull Where nablaQ is gradientderivative of matrix Q120571120571119876119876 = [120571120571119902119902119909119909119894119894] and 120571120571119902119902119909119909119894119894 = sum119909119909119909119909 minus2 119903119903119909119909119909119909 minus 119902119902119909119909119901119901119909119909119879119879 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894
bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q Observation Computing gradients is slow
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
How to compute gradient of a matrixCompute gradient of every element independently
RECSYS 72
Stochastic Gradient Descent
Gradient Descent (GD) vs Stochastic GD Observation 120571120571119876119876 = [120571120571119902119902119909119909119894119894] where
120571120571119902119902119909119909119894119894 = 119909119909119909119909
minus2 119903119903119909119909119909119909 minus 119902119902119909119909119894119894119901119901119909119909119894119894 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894 = 119961119961119946119946
nabla119928119928 119955119955119961119961119946119946
bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q 119928119928 = 119928119928 minus 119928119928 = 119928119928 minus η sum119961119961119946119946nabla119928119928 (119955119955119961119961119946119946) Idea Instead of evaluating gradient over all ratings evaluate it for
each individual rating and make a step
GD 119928119928larr119928119928 minus η sum119955119955119961119961119946119946 nabla119928119928(119955119955119961119961119946119946)
SGD 119928119928larr119928119928 minus 119928119928(119955119955119961119961119946119946) Faster convergence
bull Need more steps but each step is computed much faster
1162019
RECSYS 73
SGD vs GD Convergence of GD vs SGD
1162019
Iterationstep
Valu
e of
the
obje
ctiv
e fu
nctio
n
GD improves the value of the objective function at every step SGD improves the value but in a ldquonoisyrdquo wayGD takes fewer steps to converge but each steptakes much longer to compute In practice SGD is much faster
RECSYS 74
Stochastic Gradient Descent
Stochastic gradient decent Initialize P and Q (using SVD pretend missing ratings are
0) Then iterate over the ratings (multiple times if necessary) and update
factorsFor each rxi
bull 120576120576119909119909119909119909 = 119903119903119909119909119909119909 minus 119902119902119909119909 sdot 119901119901119909119909119879119879 (derivative of the ldquoerrorrdquo)bull 119902119902119909119909 larr 119902119902119909119909 + 120578120578 120576120576119909119909119909119909 119901119901119909119909 minus 120582120582 119902119902119909119909 (update equation)bull 119901119901119909119909 larr 119901119901119909119909 + 120578120578 120576120576119909119909119909119909 119902119902119909119909 minus 120582120582 119901119901119909119909 (update equation)
2 for loops For until convergence
bull For each rxibull Compute gradient do a ldquosteprdquo
1162019
η hellip learning rate
RECSYS 75Koren Bell Volinksy IEEE Computer 2009
1162019
RECSYS 76
Summary Recommendations via Optimization
Goal Make good recommendations Quantify goodness using SSE
So Lower SSE means better recommendations We want to make good recommendations on items
that some user has not yet seen Letrsquos set values for P and Q such that they work
well on known (user item) ratings And hope these values for P and Q will predict well
the unknown ratings
This is the a case where we apply Optimization methods
1162019
1 3 43 5 5
4 5 533
2
2 1 3
1
RECSYS 77
Backup Slides
RECSYS 78
The Netflix Challenge 2006-09
We Know What You OughtTo Be Watching This
Summer
Netflix Prizebull Training data
ndash 100 million ratingsndash 480000 usersndash 17770 moviesndash 6 years of data 2000-2005
bull Test datandash Last few ratings of each user (28 million)ndash Evaluation criterion root mean squared error (RMSE) ndash Netflix Cinematch RMSE 09514
bull Competitionndash 2700+ teamsndash $1 million grand prize for 10 improvement on Cinematch resultndash $50000 2007 progress prize for 843 improvement
scoremovieuser121152131434524123237682576344541568523425
22345
57664566
scoremovieuser6219617232473153414284935
745
696836
Training data Test data
Movie rating data
Overall rating distribution
bull Third of ratings are 4sbull Average rating is 368
From TimelyDevelopmentcom
ratings per movie
bull Avg ratingsmovie 5627
ratings per user
bull Avg ratingsuser 208
Average movie rating by movie count
bull More ratings to better movies
From TimelyDevelopmentcom
Most loved movies
CountAvg ratingTitle137812 4593 The Shawshank Redemption 133597 4545 Lord of the Rings The Return of the King1808834306 The Green Mile 150676 4460Lord of the Rings The Two Towers 139050 4415 Finding Nemo 117456 4504 Raiders of the Lost Ark 180736 4299 Forrest Gump 147932 4433 Lord of the Rings The Fellowship of the ring149199 4325 The Sixth Sense 144027 4333 Indiana Jones and the Last Crusade
Challenges
bull Size of datandash Scalabilityndash Keeping data in memory
bull Missing datandash 99 percent missingndash Very imbalanced
bull Avoiding overfittingbull Test and training data differ significantly movie 16322
bull Use an ensemble of complementing predictorsbull Two half tuned models worth more than a single fully
tuned model
The BellKor recommender system
RECSYS 101
Extending Latent Factor Model to Include Biases
1162019
RECSYS 102
Modeling Biases and Interactions
μ = overall mean rating bx = bias of user x bi = bias of movie i
user-movie interactionmovie biasuser bias
User-Movie interaction Characterizes the matching between
users and movies Attracts most research in the field Benefits from algorithmic and
mathematical innovations
Baseline predictor Separates users and movies Benefits from insights into userrsquos
behavior Among the main practical
contributions of the competition
RECSYS 103
Baseline Predictor We have expectations on the rating by
user x of movie i even without estimating xrsquos attitude towards movies like i
ndash Rating scale of user xndash Values of other ratings user
gave recently (day-specific mood anchoring multi-user accounts)
ndash (Recent) popularity of movie indash Selection bias related to
number of ratings user gave on the same day (ldquofrequencyrdquo)
1162019
RECSYS 104
Putting It All Together
Example Mean rating micro = 37 You are a critical reviewer your ratings are 1 star
lower than the mean bx = -1 Star Wars gets a mean rating of 05 higher than
average movie bi = + 05 Predicted rating for you on Star Wars
= 37 - 1 + 05 = 32
1162019
RECSYS 105
Fitting the New Model
Solve
Stochastic gradient decent to find parameters Note Both biases bu bi as well as interactions qi pu are treated
as parameters (we estimate them)
1162019
regularization
goodness of fit
λ is selected via grid-search on a validation set
( )
++++
+++minus
sumsumsumsum
sumisin
ii
xx
xx
ii
Rix
Txiixxi
PQ
bbpq
pqbbr
2222
2
)(
)(min
λ
micro
RECSYS 106
BellKor Recommender System The winner of the Netflix Challenge
Multi-scale modeling of the dataCombine top level ldquoregionalrdquomodeling of the data with a refined local view Global
bull Overall deviations of usersmovies Factorization
bull Addressing ldquoregionalrdquo effects Collaborative filtering
bull Extract local patterns
Global effects
Factorization
Collaborative filtering
RECSYS 107
Performance of Various Methods
0885
089
0895
09
0905
091
0915
092
1 10 100 1000
RMSE
Millions of parameters
CF (no time bias)
Basic Latent Factors
Latent Factors w Biases
1162019
RECSYS 108
Grand Prize 08563
Netflix 09514
Movie average 10533User average 10651
Global average 11296
Performance of Various Methods
Basic Collaborative filtering 094
Latent factors 090
Latent factors+Biases 089
Collaborative filtering++ 091
1162019
RECSYS 109
Temporal Biases Of Users
Sudden rise in the average movie rating(early 2004) Improvements in Netflix GUI improvements Meaning of rating changed
Movie age Users prefer new movies
without any reasons Older movies are just
inherently better than newer ones
1162019
Y Koren Collaborative filtering with temporal dynamics KDD rsquo09
RECSYS 110
Temporal Biases amp Factors
Original modelrxi = micro +bx + bi + qi middotpx
T
Add time dependence to biasesrxi = micro +bx(t)+ bi(t) +qi middot px
T
Make parameters bu and bi to depend on time (1) Parameterize time-dependence by linear trends
(2) Each bin corresponds to 10 consecutive weeks
Add temporal dependence to factors px(t)hellip user preference vector on day t
Y Koren Collaborative filtering with temporal dynamics KDD rsquo09
RECSYS 111
Adding Temporal Effects
1162019
0875
088
0885
089
0895
09
0905
091
0915
092
1 10 100 1000 10000
RMSE
Millions of parameters
CF (no time bias)Basic Latent FactorsCF (time bias)Latent Factors w Biases+ Linear time factors+ Per-day user biases+ CF
RECSYS 112
Grand Prize 08563
Netflix 09514
Movie average 10533User average 10651
Global average 11296
Performance of Various Methods
Basic Collaborative filtering 094
Latent factors 090
Latent factors+Biases 089
Collaborative filtering++ 091
1162019 112
Latent factors+Biases+Time 0876
Still no prize Getting desperateTry a ldquokitchen sinkrdquo approach
RECSYS 113
Effect of ensemble size
087
08725
0875
08775
088
08825
0885
08875
089
08925
1 8 15 22 29 36 43 50
Predictors
Erro
r - R
MSE
RECSYS 25
Collaborative filtering Recommend items based on past transactions of users
Analyze relations between users andor items
Specific data characteristics are irrelevant Domain-free useritem attributes are not necessary Can identify elusive aspects
RECSYS 26
Collaborative Filtering (CF)
RECSYS 27
Example of Memory-based Collaborative FilteringUser-User Collaborative Filtering
1 Consider user x
2 Find set N of other userswhose ratings are ldquosimilarrdquo to xrsquos ratings eg using K-nearest neighbors (KNN)
3 Recommend items to xbased on the weighted ratings of items by users in N
RECSYS 28
Similar Users Let rx be the vector of user xrsquos ratings
Jaccard similarity measure Problem Ignores the value of the rating
Cosine similarity measure
sim(x y) = cos(rx ry) = 119903119903119909119909sdot119903119903119910119910||119903119903119909119909||sdot||119903119903119910119910||
Problem Treats missing ratings as ldquonegativerdquo Pearson correlation coefficient
Sxy = items rated by both users x and y
1162019
rx = [ _ _ ]ry = [ _ _]
rx ry as setsrx = 1 4 5ry = 1 3 4
rx ry as pointsrx = 1 0 0 1 3ry = 1 0 2 2 0
rx ry hellip avgrating of x y
RECSYS 29
Similarity Metric
Intuitively we want sim(A B) gt sim(A C)
Jaccard similarity 15 lt 24
Cosine similarity 0386 gt 0322 Considers missing ratings as ldquonegativerdquo Solution subtract the (row) mean
1162019
sim AB vs AC0092 gt -0559Notice cosine sim is correlation when data is centered at 0
Cosine sim
RECSYS 30
Rating Predictions
Let rx be the vector of user xrsquos ratings
Let N be the set of k users most similar to xwho have rated item i
Prediction for item i of user x 119903119903119909119909119909119909 = 1
119896119896sum119910119910isin119873119873 119903119903119910119910119909119909
119903119903119909119909119909119909 =sum119910119910isin119873119873 119904119904119909119909119910119910sdot119903119903119910119910119910119910sum119910119910isin119873119873 119904119904119909119909119910119910
Other options
Many other tricks possiblehellip1162019
Shorthand119956119956119961119961119961119961 = 119956119956119946119946119956119956 119961119961119961119961
RECSYS 31
Another type of Memory-based Collaborative Filtering Item-Item based Collaborative Filtering
So far User-user collaborative filtering
Another view Item-item For item i find other similar items Estimate rating for item i based
on the target userrsquos ratings on items similar to item i Can use same similarity metrics and
prediction functions as in user-user model
1162019
sumsum
isin
isinsdot
=)(
)(
xiNj ij
xiNj xjijxi s
rsr
sijhellip similarity of items i and jrxjhelliprating of user x on item jN(ix)hellip set items rated by x similar to i
RECSYS 32
Item-Item CF (|N|=2)
121110987654321
455311
3124452
534321423
245424
5224345
423316
users
mov
ies
- unknown rating - rating between 1 to 5
RECSYS 33
Item-Item CF (|N|=2)
121110987654321
455 311
3124452
534321423
245424
5224345
423316
users
- estimate rating of movie 1 by user 5
1162019
mov
ies
RECSYS 34
Item-Item CF (|N|=2)
121110987654321
455 311
3124452
534321423
245424
5224345
423316
users
Neighbor selectionIdentify movies similar to movie 1 rated by user 5
1162019
mov
ies
100
-018
041
-010
-031
059
sim(1m)
Here we use Pearson correlation as similarity1) Subtract mean rating mi from each movie i
m1 = (1+3+5+5+4)5 = 36row 1 [-26 0 -06 0 0 14 0 0 14 0 04 0]
2) Compute cosine similarities between rows
RECSYS 35
Item-Item CF (|N|=2)
121110987654321
455 311
3124452
534321423
245424
5224345
423316
users
Compute similarity weightss13=041 s16=059
1162019
mov
ies
100
-018
041
-010
-031
059
sim(1m)
RECSYS 36
Item-Item CF (|N|=2)
121110987654321
45526311
3124452
534321423
245424
5224345
423316
users
Predict by taking weighted average
r15 = (0412 + 0593) (041+059) = 26
mov
ies
RECSYS 37
Common Practice for Item-Item Collaborative Filtering
Define similarity sij of items i and j
Select K nearest neighbors (KNN) N(i x) Set of Items most similar to item i that were rated by x
Estimate rating rxi as the weighted average
1162019
sumsum
isin
isinsdot
=)(
)(ˆxiNj ij
xiNj xjijxi s
rsr
N(ix) = set of items similar to item i that were rated by xsij = similarity of items i and jrxj= rating of user x on item j
RECSYS 38
Item-Item vs User-User
04180109
0305081
Avatar LOTR Matrix Pirates
Alice
Bob
Carol
David
1162019
In practice it has been observed that item-itemoften works better than user-user
Why Items are simpler users have multiple tastes
RECSYS 39
ProsCons of Collaborative Filtering
+ Works for any kind of item No feature selection needed
- Cold Start Need enough users in the system to find a match
- Sparsity The userratings matrix is sparse Hard to find users that have rated the same items
- First rater Cannot recommend an item that has not been
previously rated New items Esoteric items
- Popularity bias Cannot recommend items to someone with
unique taste Tends to recommend popular items
RECSYS 40
Hybrid Methods Implement two or more different recommenders and
combine predictions Perhaps using a linear model
Add content-based methods to collaborative filtering Item profiles for new item problem Demographics to deal with new user problem
1162019
RECSYS 41
- Evaluation
- Error metrics
- Complexity Speed
RECSYS 42
Evaluation
1 3 4
3 5 5
4 5 5
3
3
2 2 2
5
2 1 1
3 3
1
movies
users
1162019
RECSYS 43
Evaluation
1 3 4
3 5 5
4 5 5
3
3
2
2 1
3
1
Test Data Set
users
movies
1162019
RECSYS 44
Evaluating Predictions Compare predictions with known ratings
Root-mean-square error (RMSE)bull where is predicted is the true rating of x on i
Precision at top 10 bull of those in top 10
Rank Correlation bull Spearmanrsquos correlation between systemrsquos and userrsquos complete
rankings Another approach 01 model
Coveragebull Number of itemsusers for which system can make predictions
Precision = TP (TP + FP) Accuracy = (TP+TN) (TP + FP + TN + FN) Receiver Operating characteristic (ROC) Curve
Y-axis True Positive Rates (TPR) X-axis False Positive Rates (FPR)bull TPR (aka Recall) = TP P = TP (TP+FN) bull FPR = FP N = FP (FP + TN)bull See httpsenwikipediaorgwikiPrecision and recall
RECSYS 45
Problems with Error Measures Narrow focus on accuracy sometimes
misses the point Prediction Diversity Prediction Context Order of predictions
In practice we care only to predict high ratings RMSE might penalize a method that does well
for high ratings and badly for others
1162019
RECSYS 46
Collaborative Filtering Complexity
Expensive step is finding k most similar customers O(|X|) Recall that X = set of customers in the system
Too expensive to do at runtime Could pre-compute using clustering as approx
Naiumlve pre-computation takes time O(N middot|C|) |C| = of clusters = k in the k-means N = of data points
We already know how to do this Near-neighbor search in high dimensions (LSH) Clustering Dimensionality reduction
1162019
RECSYS 47
Tip Add Data Leverage all the data
Donrsquot try to reduce data size in an effort to make fancy algorithms work
Simple methods on large data do best
Add more data eg add IMDB data on genres
More data beats better algorithmshttpanandtypepadcomdatawocky200803more-data-usualhtml
RECSYS 48
Recommender SystemsLatent Factor Models
CS246 Mining Massive DatasetsJure Leskovec Stanford University
httpcs246stanfordedu
RECSYS 49
Geared towards females
Geared towards males
Serious
Funny
Collaborative Filtering viaLatent Factor Models (eg SVD)
The PrincessDiaries
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
RECSYS 50Koren Bell Volinksy IEEE Computer 2009
RECSYS 51
The Netflix Utility Matrix R
1 3 4
3 5 5
4 5 5
3
3
2 2 2
5
2 1 1
3 3
1
480000 users
17700 movies
1162019
Matrix R
RECSYS 52
Utility Matrix R Evaluation
1 3 4
3 5 5
4 5 5
3
3
2
2 1
3
1
Test Data Set
SSE = sum(119909119909119909119909)isin119877119877 119903119909119909119909119909 minus 119903119903119909119909119909119909 2
1162019
480000 users
17700 movies
Predicted rating
True rating of user x on item i
119955119955120785120785120788120788
Matrix R
Training Data Set
RECSYS 53
Latent Factor Models
ldquoSVDrdquo on Netflix data R asymp Q middot PT
For now letrsquos assume we can approximate the rating matrix R as a product of ldquothinrdquo Q middot PT
R has missing entries but letrsquos ignore that for nowbull Basically we will want the reconstruction error to be small on
known ratings and we donrsquot care about the values on the missing ones
45531
312445
53432142
24542
522434
42331
2-41
56-5
53-2
32111
-221-7
37-1
-924143-48-5-253-211
13-112-72914-131457-8
1-6784-3924176-421asymp
users
item
s
PT
Q
item
s
users
R
SVD A = U Σ VT
f factors
ffactors
RECSYS 54
Ratings as Products of Factors How to estimate the missing rating of
user x for item i
1162019
45531
312445
53432142
24542
522434
42331
item
s
2-41
56-5
53-2
32111
-221-7
37-1
-924143-48-5-253-211
13-112-72914-131457-8
1-6784-3924176-421
asymp
item
s
users
users
PT
119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931
= 119943119943
119954119954119946119946119943119943 sdot 119953119953119961119961119943119943
qi = row i of Qpx = column x of PT
ffac
tors
Qf factors
RECSYS 55
Ratings as Products of Factors How to estimate the missing rating of
user x for item i
1162019
45531
312445
53432142
24542
522434
42331
item
s
2-41
56-5
53-2
32111
-221-7
37-1
-924143-48-5-253-211
13-112-72914-131457-8
1-6784-3924176-421
asymp
item
s
users
users
PT
ffac
tors
Qf factors
119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931
= 119943119943
119954119954119946119946119943119943 sdot 119953119953119961119961119943119943
qi = row i of Qpx = column x of PT
RECSYS 56
Ratings as Products of Factors How to estimate the missing rating of
user x for item i
1162019
45531
312445
53432142
24542
522434
42331
item
s
2-41
56-5
53-2
32111
-221-7
37-1
-924143-48-5-253-211
13-112-72914-131457-8
1-6784-3924176-421
asymp
item
s
users
users
QPT
24
ffac
tors
f factors
119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931
= 119943119943
119954119954119946119946119943119943 sdot 119953119953119961119961119943119943
qi = row i of Qpx = column x of PT
RECSYS 57
Geared towards females
Geared towards males
Serious
Funny
Latent Factor Models
1162019
The PrincessDiaries
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
RECSYS 58
Geared towards females
Geared towards males
Serious
Funny
Latent Factor Models
1162019
The PrincessDiaries
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
RECSYS 59
Recap SVD
Remember SVD A Input data matrix U Left singular vecs V Right singular vecs Σ Singular values
SVD gives minimum reconstruction error (SSE) min119880119880VΣ
sum119909119909119894119894 119860119860119909119909119894119894 minus 119880119880Σ119881119881T 1199091199091198941198942
So in our case ldquoSVDrdquo on Netflix data R asymp Q middot PT
A = R Q = U PT = Σ VT
But we are not done yet R has missing entries1162019
U
The sum goes over all entriesBut our R has missing entries
119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931
Am
n
Σm
n
VT
asymp
U
RECSYS 60
Latent Factor Models
SVD isnrsquot defined when entries are missing
Use specialized methods to find P Q min
119875119875119876119876sum 119909119909119909119909 isinR 119903119903119909119909119909119909 minus 119902119902119909119909 sdot 119901119901119909119909119879119879 2
Notebull We donrsquot require cols of P Q to be orthogonalunit lengthbull P Q map usersmovies to a latent spacebull The most popular model among Netflix contestants
1162019
45531
312445
53432142
24542
522434
42331
2-41
56-5
53-2
32111
-221-7
37-1
-924143-48-5-253-211
13-112-72914-131457-8
1-6784-3924176-421asymp
PTQ
users
item
s
119903119909119909119909119909 = 119902119902119909119909 sdot 119901119901119909119909119879119879
f factors
ffactorsitem
s
users
RECSYS 61
Dealing with Missing Entries Want to minimize SSE for unseen test data
Idea Minimize SSE on training data Want large f ( of factors) to capture all the signals But SSE on test data begins to rise for f gt 2
Regularization is needed to avoid Overfitting Allow rich model where there are sufficient data Shrink aggressively where data are scarce
1162019
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
1 3 43 5 5
4 5 533
2
2 1 3
1
λhellip regularization parameterldquoerrorrdquo ldquolengthrdquo
RECSYS 62
Geared towards females
Geared towards males
serious
escapist
The PrincessDiaries
The Lion King
Braveheart
Independence Day
AmadeusThe Color Purple
Oceanrsquos 11
Sense and Sensibility
Gus
Dave
Recommendations via Latent Factor Models (eg SVD++ by the [Bellkor Team])
1162019
Lethal Weapon
Dumb and Dumber
RECSYS 63
Dealing with Missing Entries Want to minimize SSE for unseen test data
Idea Minimize SSE on training data Want large f ( of factors) to capture all the signals But SSE on test data begins to rise for f gt 2
Regularization is needed Allow rich model where there are sufficient data Shrink aggressively where data are scarce
1162019
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
1 3 43 5 5
4 5 533
2
2 1 3
1
λhellip regularization parameterldquoerrorrdquo ldquolengthrdquo
RECSYS 64
Geared towards females
Geared towards males
serious
funny
The Effect of Regularization
1162019
The PrincessDiaries
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
minfactors ldquoerrorrdquo + λ ldquolengthrdquo
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
RECSYS 65
Geared towards females
Geared towards males
serious
funny
The Effect of Regularization
1162019
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
The PrincessDiaries
minfactors ldquoerrorrdquo + λ ldquolengthrdquo
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
RECSYS 66
Geared towards females
Geared towards males
serious
funny
The Effect of Regularization
1162019
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
minfactors ldquoerrorrdquo + λ ldquolengthrdquo
The PrincessDiaries
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
RECSYS 67
Geared towards females
Geared towards males
serious
funny
The Effect of Regularization
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
The PrincessDiaries
minfactors ldquoerrorrdquo + λ ldquolengthrdquo
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
RECSYS 68
Use Gradient Descent to search forthe optimal settings
Want to find matrices P and Q
Gradient descent Initialize P and Q (using SVD pretend missing ratings are
0) Do gradient descent
bull P larr P - η middotnablaPbull Q larr Q - η middotnablaQbull Where nablaQ is gradientderivative of matrix Q120571120571119876119876 = [120571120571119902119902119909119909119894119894] and 120571120571119902119902119909119909119894119894 = sum119909119909119909119909 minus2 119903119903119909119909119909119909 minus 119902119902119909119909119901119901119909119909119879119879 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894
bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q Observation Computing gradients is slow
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
How to compute gradient of a matrixCompute gradient of every element independently
RECSYS 69
Degression to the lecture notes ofRegression and Gradient Descent
by Andrew Ngrsquos Machine Learning course from Coursera
RECSYS 70
(Batch) Gradient Descent
Want to find matrices P and Q
Gradient descent Initialize P and Q (using SVD pretend missing ratings are
0) Do gradient descent
bull P larr P - η middotnablaPbull Q larr Q - η middotnablaQbull Where nablaQ is gradientderivative of matrix Q120571120571119876119876 = [120571120571119902119902119909119909119894119894] and 120571120571119902119902119909119909119894119894 = sum119909119909119909119909 minus2 119903119903119909119909119909119909 minus 119902119902119909119909119901119901119909119909119879119879 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894
bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q Observation Computing gradients is slow
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
How to compute gradient of a matrixCompute gradient of every element independently
RECSYS 72
Stochastic Gradient Descent
Gradient Descent (GD) vs Stochastic GD Observation 120571120571119876119876 = [120571120571119902119902119909119909119894119894] where
120571120571119902119902119909119909119894119894 = 119909119909119909119909
minus2 119903119903119909119909119909119909 minus 119902119902119909119909119894119894119901119901119909119909119894119894 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894 = 119961119961119946119946
nabla119928119928 119955119955119961119961119946119946
bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q 119928119928 = 119928119928 minus 119928119928 = 119928119928 minus η sum119961119961119946119946nabla119928119928 (119955119955119961119961119946119946) Idea Instead of evaluating gradient over all ratings evaluate it for
each individual rating and make a step
GD 119928119928larr119928119928 minus η sum119955119955119961119961119946119946 nabla119928119928(119955119955119961119961119946119946)
SGD 119928119928larr119928119928 minus 119928119928(119955119955119961119961119946119946) Faster convergence
bull Need more steps but each step is computed much faster
1162019
RECSYS 73
SGD vs GD Convergence of GD vs SGD
1162019
Iterationstep
Valu
e of
the
obje
ctiv
e fu
nctio
n
GD improves the value of the objective function at every step SGD improves the value but in a ldquonoisyrdquo wayGD takes fewer steps to converge but each steptakes much longer to compute In practice SGD is much faster
RECSYS 74
Stochastic Gradient Descent
Stochastic gradient decent Initialize P and Q (using SVD pretend missing ratings are
0) Then iterate over the ratings (multiple times if necessary) and update
factorsFor each rxi
bull 120576120576119909119909119909119909 = 119903119903119909119909119909119909 minus 119902119902119909119909 sdot 119901119901119909119909119879119879 (derivative of the ldquoerrorrdquo)bull 119902119902119909119909 larr 119902119902119909119909 + 120578120578 120576120576119909119909119909119909 119901119901119909119909 minus 120582120582 119902119902119909119909 (update equation)bull 119901119901119909119909 larr 119901119901119909119909 + 120578120578 120576120576119909119909119909119909 119902119902119909119909 minus 120582120582 119901119901119909119909 (update equation)
2 for loops For until convergence
bull For each rxibull Compute gradient do a ldquosteprdquo
1162019
η hellip learning rate
RECSYS 75Koren Bell Volinksy IEEE Computer 2009
1162019
RECSYS 76
Summary Recommendations via Optimization
Goal Make good recommendations Quantify goodness using SSE
So Lower SSE means better recommendations We want to make good recommendations on items
that some user has not yet seen Letrsquos set values for P and Q such that they work
well on known (user item) ratings And hope these values for P and Q will predict well
the unknown ratings
This is the a case where we apply Optimization methods
1162019
1 3 43 5 5
4 5 533
2
2 1 3
1
RECSYS 77
Backup Slides
RECSYS 78
The Netflix Challenge 2006-09
We Know What You OughtTo Be Watching This
Summer
Netflix Prizebull Training data
ndash 100 million ratingsndash 480000 usersndash 17770 moviesndash 6 years of data 2000-2005
bull Test datandash Last few ratings of each user (28 million)ndash Evaluation criterion root mean squared error (RMSE) ndash Netflix Cinematch RMSE 09514
bull Competitionndash 2700+ teamsndash $1 million grand prize for 10 improvement on Cinematch resultndash $50000 2007 progress prize for 843 improvement
scoremovieuser121152131434524123237682576344541568523425
22345
57664566
scoremovieuser6219617232473153414284935
745
696836
Training data Test data
Movie rating data
Overall rating distribution
bull Third of ratings are 4sbull Average rating is 368
From TimelyDevelopmentcom
ratings per movie
bull Avg ratingsmovie 5627
ratings per user
bull Avg ratingsuser 208
Average movie rating by movie count
bull More ratings to better movies
From TimelyDevelopmentcom
Most loved movies
CountAvg ratingTitle137812 4593 The Shawshank Redemption 133597 4545 Lord of the Rings The Return of the King1808834306 The Green Mile 150676 4460Lord of the Rings The Two Towers 139050 4415 Finding Nemo 117456 4504 Raiders of the Lost Ark 180736 4299 Forrest Gump 147932 4433 Lord of the Rings The Fellowship of the ring149199 4325 The Sixth Sense 144027 4333 Indiana Jones and the Last Crusade
Challenges
bull Size of datandash Scalabilityndash Keeping data in memory
bull Missing datandash 99 percent missingndash Very imbalanced
bull Avoiding overfittingbull Test and training data differ significantly movie 16322
bull Use an ensemble of complementing predictorsbull Two half tuned models worth more than a single fully
tuned model
The BellKor recommender system
RECSYS 101
Extending Latent Factor Model to Include Biases
1162019
RECSYS 102
Modeling Biases and Interactions
μ = overall mean rating bx = bias of user x bi = bias of movie i
user-movie interactionmovie biasuser bias
User-Movie interaction Characterizes the matching between
users and movies Attracts most research in the field Benefits from algorithmic and
mathematical innovations
Baseline predictor Separates users and movies Benefits from insights into userrsquos
behavior Among the main practical
contributions of the competition
RECSYS 103
Baseline Predictor We have expectations on the rating by
user x of movie i even without estimating xrsquos attitude towards movies like i
ndash Rating scale of user xndash Values of other ratings user
gave recently (day-specific mood anchoring multi-user accounts)
ndash (Recent) popularity of movie indash Selection bias related to
number of ratings user gave on the same day (ldquofrequencyrdquo)
1162019
RECSYS 104
Putting It All Together
Example Mean rating micro = 37 You are a critical reviewer your ratings are 1 star
lower than the mean bx = -1 Star Wars gets a mean rating of 05 higher than
average movie bi = + 05 Predicted rating for you on Star Wars
= 37 - 1 + 05 = 32
1162019
RECSYS 105
Fitting the New Model
Solve
Stochastic gradient decent to find parameters Note Both biases bu bi as well as interactions qi pu are treated
as parameters (we estimate them)
1162019
regularization
goodness of fit
λ is selected via grid-search on a validation set
( )
++++
+++minus
sumsumsumsum
sumisin
ii
xx
xx
ii
Rix
Txiixxi
PQ
bbpq
pqbbr
2222
2
)(
)(min
λ
micro
RECSYS 106
BellKor Recommender System The winner of the Netflix Challenge
Multi-scale modeling of the dataCombine top level ldquoregionalrdquomodeling of the data with a refined local view Global
bull Overall deviations of usersmovies Factorization
bull Addressing ldquoregionalrdquo effects Collaborative filtering
bull Extract local patterns
Global effects
Factorization
Collaborative filtering
RECSYS 107
Performance of Various Methods
0885
089
0895
09
0905
091
0915
092
1 10 100 1000
RMSE
Millions of parameters
CF (no time bias)
Basic Latent Factors
Latent Factors w Biases
1162019
RECSYS 108
Grand Prize 08563
Netflix 09514
Movie average 10533User average 10651
Global average 11296
Performance of Various Methods
Basic Collaborative filtering 094
Latent factors 090
Latent factors+Biases 089
Collaborative filtering++ 091
1162019
RECSYS 109
Temporal Biases Of Users
Sudden rise in the average movie rating(early 2004) Improvements in Netflix GUI improvements Meaning of rating changed
Movie age Users prefer new movies
without any reasons Older movies are just
inherently better than newer ones
1162019
Y Koren Collaborative filtering with temporal dynamics KDD rsquo09
RECSYS 110
Temporal Biases amp Factors
Original modelrxi = micro +bx + bi + qi middotpx
T
Add time dependence to biasesrxi = micro +bx(t)+ bi(t) +qi middot px
T
Make parameters bu and bi to depend on time (1) Parameterize time-dependence by linear trends
(2) Each bin corresponds to 10 consecutive weeks
Add temporal dependence to factors px(t)hellip user preference vector on day t
Y Koren Collaborative filtering with temporal dynamics KDD rsquo09
RECSYS 111
Adding Temporal Effects
1162019
0875
088
0885
089
0895
09
0905
091
0915
092
1 10 100 1000 10000
RMSE
Millions of parameters
CF (no time bias)Basic Latent FactorsCF (time bias)Latent Factors w Biases+ Linear time factors+ Per-day user biases+ CF
RECSYS 112
Grand Prize 08563
Netflix 09514
Movie average 10533User average 10651
Global average 11296
Performance of Various Methods
Basic Collaborative filtering 094
Latent factors 090
Latent factors+Biases 089
Collaborative filtering++ 091
1162019 112
Latent factors+Biases+Time 0876
Still no prize Getting desperateTry a ldquokitchen sinkrdquo approach
RECSYS 113
Effect of ensemble size
087
08725
0875
08775
088
08825
0885
08875
089
08925
1 8 15 22 29 36 43 50
Predictors
Erro
r - R
MSE
RECSYS 26
Collaborative Filtering (CF)
RECSYS 27
Example of Memory-based Collaborative FilteringUser-User Collaborative Filtering
1 Consider user x
2 Find set N of other userswhose ratings are ldquosimilarrdquo to xrsquos ratings eg using K-nearest neighbors (KNN)
3 Recommend items to xbased on the weighted ratings of items by users in N
RECSYS 28
Similar Users Let rx be the vector of user xrsquos ratings
Jaccard similarity measure Problem Ignores the value of the rating
Cosine similarity measure
sim(x y) = cos(rx ry) = 119903119903119909119909sdot119903119903119910119910||119903119903119909119909||sdot||119903119903119910119910||
Problem Treats missing ratings as ldquonegativerdquo Pearson correlation coefficient
Sxy = items rated by both users x and y
1162019
rx = [ _ _ ]ry = [ _ _]
rx ry as setsrx = 1 4 5ry = 1 3 4
rx ry as pointsrx = 1 0 0 1 3ry = 1 0 2 2 0
rx ry hellip avgrating of x y
RECSYS 29
Similarity Metric
Intuitively we want sim(A B) gt sim(A C)
Jaccard similarity 15 lt 24
Cosine similarity 0386 gt 0322 Considers missing ratings as ldquonegativerdquo Solution subtract the (row) mean
1162019
sim AB vs AC0092 gt -0559Notice cosine sim is correlation when data is centered at 0
Cosine sim
RECSYS 30
Rating Predictions
Let rx be the vector of user xrsquos ratings
Let N be the set of k users most similar to xwho have rated item i
Prediction for item i of user x 119903119903119909119909119909119909 = 1
119896119896sum119910119910isin119873119873 119903119903119910119910119909119909
119903119903119909119909119909119909 =sum119910119910isin119873119873 119904119904119909119909119910119910sdot119903119903119910119910119910119910sum119910119910isin119873119873 119904119904119909119909119910119910
Other options
Many other tricks possiblehellip1162019
Shorthand119956119956119961119961119961119961 = 119956119956119946119946119956119956 119961119961119961119961
RECSYS 31
Another type of Memory-based Collaborative Filtering Item-Item based Collaborative Filtering
So far User-user collaborative filtering
Another view Item-item For item i find other similar items Estimate rating for item i based
on the target userrsquos ratings on items similar to item i Can use same similarity metrics and
prediction functions as in user-user model
1162019
sumsum
isin
isinsdot
=)(
)(
xiNj ij
xiNj xjijxi s
rsr
sijhellip similarity of items i and jrxjhelliprating of user x on item jN(ix)hellip set items rated by x similar to i
RECSYS 32
Item-Item CF (|N|=2)
121110987654321
455311
3124452
534321423
245424
5224345
423316
users
mov
ies
- unknown rating - rating between 1 to 5
RECSYS 33
Item-Item CF (|N|=2)
121110987654321
455 311
3124452
534321423
245424
5224345
423316
users
- estimate rating of movie 1 by user 5
1162019
mov
ies
RECSYS 34
Item-Item CF (|N|=2)
121110987654321
455 311
3124452
534321423
245424
5224345
423316
users
Neighbor selectionIdentify movies similar to movie 1 rated by user 5
1162019
mov
ies
100
-018
041
-010
-031
059
sim(1m)
Here we use Pearson correlation as similarity1) Subtract mean rating mi from each movie i
m1 = (1+3+5+5+4)5 = 36row 1 [-26 0 -06 0 0 14 0 0 14 0 04 0]
2) Compute cosine similarities between rows
RECSYS 35
Item-Item CF (|N|=2)
121110987654321
455 311
3124452
534321423
245424
5224345
423316
users
Compute similarity weightss13=041 s16=059
1162019
mov
ies
100
-018
041
-010
-031
059
sim(1m)
RECSYS 36
Item-Item CF (|N|=2)
121110987654321
45526311
3124452
534321423
245424
5224345
423316
users
Predict by taking weighted average
r15 = (0412 + 0593) (041+059) = 26
mov
ies
RECSYS 37
Common Practice for Item-Item Collaborative Filtering
Define similarity sij of items i and j
Select K nearest neighbors (KNN) N(i x) Set of Items most similar to item i that were rated by x
Estimate rating rxi as the weighted average
1162019
sumsum
isin
isinsdot
=)(
)(ˆxiNj ij
xiNj xjijxi s
rsr
N(ix) = set of items similar to item i that were rated by xsij = similarity of items i and jrxj= rating of user x on item j
RECSYS 38
Item-Item vs User-User
04180109
0305081
Avatar LOTR Matrix Pirates
Alice
Bob
Carol
David
1162019
In practice it has been observed that item-itemoften works better than user-user
Why Items are simpler users have multiple tastes
RECSYS 39
ProsCons of Collaborative Filtering
+ Works for any kind of item No feature selection needed
- Cold Start Need enough users in the system to find a match
- Sparsity The userratings matrix is sparse Hard to find users that have rated the same items
- First rater Cannot recommend an item that has not been
previously rated New items Esoteric items
- Popularity bias Cannot recommend items to someone with
unique taste Tends to recommend popular items
RECSYS 40
Hybrid Methods Implement two or more different recommenders and
combine predictions Perhaps using a linear model
Add content-based methods to collaborative filtering Item profiles for new item problem Demographics to deal with new user problem
1162019
RECSYS 41
- Evaluation
- Error metrics
- Complexity Speed
RECSYS 42
Evaluation
1 3 4
3 5 5
4 5 5
3
3
2 2 2
5
2 1 1
3 3
1
movies
users
1162019
RECSYS 43
Evaluation
1 3 4
3 5 5
4 5 5
3
3
2
2 1
3
1
Test Data Set
users
movies
1162019
RECSYS 44
Evaluating Predictions Compare predictions with known ratings
Root-mean-square error (RMSE)bull where is predicted is the true rating of x on i
Precision at top 10 bull of those in top 10
Rank Correlation bull Spearmanrsquos correlation between systemrsquos and userrsquos complete
rankings Another approach 01 model
Coveragebull Number of itemsusers for which system can make predictions
Precision = TP (TP + FP) Accuracy = (TP+TN) (TP + FP + TN + FN) Receiver Operating characteristic (ROC) Curve
Y-axis True Positive Rates (TPR) X-axis False Positive Rates (FPR)bull TPR (aka Recall) = TP P = TP (TP+FN) bull FPR = FP N = FP (FP + TN)bull See httpsenwikipediaorgwikiPrecision and recall
RECSYS 45
Problems with Error Measures Narrow focus on accuracy sometimes
misses the point Prediction Diversity Prediction Context Order of predictions
In practice we care only to predict high ratings RMSE might penalize a method that does well
for high ratings and badly for others
1162019
RECSYS 46
Collaborative Filtering Complexity
Expensive step is finding k most similar customers O(|X|) Recall that X = set of customers in the system
Too expensive to do at runtime Could pre-compute using clustering as approx
Naiumlve pre-computation takes time O(N middot|C|) |C| = of clusters = k in the k-means N = of data points
We already know how to do this Near-neighbor search in high dimensions (LSH) Clustering Dimensionality reduction
1162019
RECSYS 47
Tip Add Data Leverage all the data
Donrsquot try to reduce data size in an effort to make fancy algorithms work
Simple methods on large data do best
Add more data eg add IMDB data on genres
More data beats better algorithmshttpanandtypepadcomdatawocky200803more-data-usualhtml
RECSYS 48
Recommender SystemsLatent Factor Models
CS246 Mining Massive DatasetsJure Leskovec Stanford University
httpcs246stanfordedu
RECSYS 49
Geared towards females
Geared towards males
Serious
Funny
Collaborative Filtering viaLatent Factor Models (eg SVD)
The PrincessDiaries
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
RECSYS 50Koren Bell Volinksy IEEE Computer 2009
RECSYS 51
The Netflix Utility Matrix R
1 3 4
3 5 5
4 5 5
3
3
2 2 2
5
2 1 1
3 3
1
480000 users
17700 movies
1162019
Matrix R
RECSYS 52
Utility Matrix R Evaluation
1 3 4
3 5 5
4 5 5
3
3
2
2 1
3
1
Test Data Set
SSE = sum(119909119909119909119909)isin119877119877 119903119909119909119909119909 minus 119903119903119909119909119909119909 2
1162019
480000 users
17700 movies
Predicted rating
True rating of user x on item i
119955119955120785120785120788120788
Matrix R
Training Data Set
RECSYS 53
Latent Factor Models
ldquoSVDrdquo on Netflix data R asymp Q middot PT
For now letrsquos assume we can approximate the rating matrix R as a product of ldquothinrdquo Q middot PT
R has missing entries but letrsquos ignore that for nowbull Basically we will want the reconstruction error to be small on
known ratings and we donrsquot care about the values on the missing ones
45531
312445
53432142
24542
522434
42331
2-41
56-5
53-2
32111
-221-7
37-1
-924143-48-5-253-211
13-112-72914-131457-8
1-6784-3924176-421asymp
users
item
s
PT
Q
item
s
users
R
SVD A = U Σ VT
f factors
ffactors
RECSYS 54
Ratings as Products of Factors How to estimate the missing rating of
user x for item i
1162019
45531
312445
53432142
24542
522434
42331
item
s
2-41
56-5
53-2
32111
-221-7
37-1
-924143-48-5-253-211
13-112-72914-131457-8
1-6784-3924176-421
asymp
item
s
users
users
PT
119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931
= 119943119943
119954119954119946119946119943119943 sdot 119953119953119961119961119943119943
qi = row i of Qpx = column x of PT
ffac
tors
Qf factors
RECSYS 55
Ratings as Products of Factors How to estimate the missing rating of
user x for item i
1162019
45531
312445
53432142
24542
522434
42331
item
s
2-41
56-5
53-2
32111
-221-7
37-1
-924143-48-5-253-211
13-112-72914-131457-8
1-6784-3924176-421
asymp
item
s
users
users
PT
ffac
tors
Qf factors
119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931
= 119943119943
119954119954119946119946119943119943 sdot 119953119953119961119961119943119943
qi = row i of Qpx = column x of PT
RECSYS 56
Ratings as Products of Factors How to estimate the missing rating of
user x for item i
1162019
45531
312445
53432142
24542
522434
42331
item
s
2-41
56-5
53-2
32111
-221-7
37-1
-924143-48-5-253-211
13-112-72914-131457-8
1-6784-3924176-421
asymp
item
s
users
users
QPT
24
ffac
tors
f factors
119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931
= 119943119943
119954119954119946119946119943119943 sdot 119953119953119961119961119943119943
qi = row i of Qpx = column x of PT
RECSYS 57
Geared towards females
Geared towards males
Serious
Funny
Latent Factor Models
1162019
The PrincessDiaries
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
RECSYS 58
Geared towards females
Geared towards males
Serious
Funny
Latent Factor Models
1162019
The PrincessDiaries
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
RECSYS 59
Recap SVD
Remember SVD A Input data matrix U Left singular vecs V Right singular vecs Σ Singular values
SVD gives minimum reconstruction error (SSE) min119880119880VΣ
sum119909119909119894119894 119860119860119909119909119894119894 minus 119880119880Σ119881119881T 1199091199091198941198942
So in our case ldquoSVDrdquo on Netflix data R asymp Q middot PT
A = R Q = U PT = Σ VT
But we are not done yet R has missing entries1162019
U
The sum goes over all entriesBut our R has missing entries
119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931
Am
n
Σm
n
VT
asymp
U
RECSYS 60
Latent Factor Models
SVD isnrsquot defined when entries are missing
Use specialized methods to find P Q min
119875119875119876119876sum 119909119909119909119909 isinR 119903119903119909119909119909119909 minus 119902119902119909119909 sdot 119901119901119909119909119879119879 2
Notebull We donrsquot require cols of P Q to be orthogonalunit lengthbull P Q map usersmovies to a latent spacebull The most popular model among Netflix contestants
1162019
45531
312445
53432142
24542
522434
42331
2-41
56-5
53-2
32111
-221-7
37-1
-924143-48-5-253-211
13-112-72914-131457-8
1-6784-3924176-421asymp
PTQ
users
item
s
119903119909119909119909119909 = 119902119902119909119909 sdot 119901119901119909119909119879119879
f factors
ffactorsitem
s
users
RECSYS 61
Dealing with Missing Entries Want to minimize SSE for unseen test data
Idea Minimize SSE on training data Want large f ( of factors) to capture all the signals But SSE on test data begins to rise for f gt 2
Regularization is needed to avoid Overfitting Allow rich model where there are sufficient data Shrink aggressively where data are scarce
1162019
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
1 3 43 5 5
4 5 533
2
2 1 3
1
λhellip regularization parameterldquoerrorrdquo ldquolengthrdquo
RECSYS 62
Geared towards females
Geared towards males
serious
escapist
The PrincessDiaries
The Lion King
Braveheart
Independence Day
AmadeusThe Color Purple
Oceanrsquos 11
Sense and Sensibility
Gus
Dave
Recommendations via Latent Factor Models (eg SVD++ by the [Bellkor Team])
1162019
Lethal Weapon
Dumb and Dumber
RECSYS 63
Dealing with Missing Entries Want to minimize SSE for unseen test data
Idea Minimize SSE on training data Want large f ( of factors) to capture all the signals But SSE on test data begins to rise for f gt 2
Regularization is needed Allow rich model where there are sufficient data Shrink aggressively where data are scarce
1162019
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
1 3 43 5 5
4 5 533
2
2 1 3
1
λhellip regularization parameterldquoerrorrdquo ldquolengthrdquo
RECSYS 64
Geared towards females
Geared towards males
serious
funny
The Effect of Regularization
1162019
The PrincessDiaries
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
minfactors ldquoerrorrdquo + λ ldquolengthrdquo
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
RECSYS 65
Geared towards females
Geared towards males
serious
funny
The Effect of Regularization
1162019
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
The PrincessDiaries
minfactors ldquoerrorrdquo + λ ldquolengthrdquo
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
RECSYS 66
Geared towards females
Geared towards males
serious
funny
The Effect of Regularization
1162019
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
minfactors ldquoerrorrdquo + λ ldquolengthrdquo
The PrincessDiaries
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
RECSYS 67
Geared towards females
Geared towards males
serious
funny
The Effect of Regularization
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
The PrincessDiaries
minfactors ldquoerrorrdquo + λ ldquolengthrdquo
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
RECSYS 68
Use Gradient Descent to search forthe optimal settings
Want to find matrices P and Q
Gradient descent Initialize P and Q (using SVD pretend missing ratings are
0) Do gradient descent
bull P larr P - η middotnablaPbull Q larr Q - η middotnablaQbull Where nablaQ is gradientderivative of matrix Q120571120571119876119876 = [120571120571119902119902119909119909119894119894] and 120571120571119902119902119909119909119894119894 = sum119909119909119909119909 minus2 119903119903119909119909119909119909 minus 119902119902119909119909119901119901119909119909119879119879 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894
bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q Observation Computing gradients is slow
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
How to compute gradient of a matrixCompute gradient of every element independently
RECSYS 69
Degression to the lecture notes ofRegression and Gradient Descent
by Andrew Ngrsquos Machine Learning course from Coursera
RECSYS 70
(Batch) Gradient Descent
Want to find matrices P and Q
Gradient descent Initialize P and Q (using SVD pretend missing ratings are
0) Do gradient descent
bull P larr P - η middotnablaPbull Q larr Q - η middotnablaQbull Where nablaQ is gradientderivative of matrix Q120571120571119876119876 = [120571120571119902119902119909119909119894119894] and 120571120571119902119902119909119909119894119894 = sum119909119909119909119909 minus2 119903119903119909119909119909119909 minus 119902119902119909119909119901119901119909119909119879119879 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894
bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q Observation Computing gradients is slow
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
How to compute gradient of a matrixCompute gradient of every element independently
RECSYS 72
Stochastic Gradient Descent
Gradient Descent (GD) vs Stochastic GD Observation 120571120571119876119876 = [120571120571119902119902119909119909119894119894] where
120571120571119902119902119909119909119894119894 = 119909119909119909119909
minus2 119903119903119909119909119909119909 minus 119902119902119909119909119894119894119901119901119909119909119894119894 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894 = 119961119961119946119946
nabla119928119928 119955119955119961119961119946119946
bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q 119928119928 = 119928119928 minus 119928119928 = 119928119928 minus η sum119961119961119946119946nabla119928119928 (119955119955119961119961119946119946) Idea Instead of evaluating gradient over all ratings evaluate it for
each individual rating and make a step
GD 119928119928larr119928119928 minus η sum119955119955119961119961119946119946 nabla119928119928(119955119955119961119961119946119946)
SGD 119928119928larr119928119928 minus 119928119928(119955119955119961119961119946119946) Faster convergence
bull Need more steps but each step is computed much faster
1162019
RECSYS 73
SGD vs GD Convergence of GD vs SGD
1162019
Iterationstep
Valu
e of
the
obje
ctiv
e fu
nctio
n
GD improves the value of the objective function at every step SGD improves the value but in a ldquonoisyrdquo wayGD takes fewer steps to converge but each steptakes much longer to compute In practice SGD is much faster
RECSYS 74
Stochastic Gradient Descent
Stochastic gradient decent Initialize P and Q (using SVD pretend missing ratings are
0) Then iterate over the ratings (multiple times if necessary) and update
factorsFor each rxi
bull 120576120576119909119909119909119909 = 119903119903119909119909119909119909 minus 119902119902119909119909 sdot 119901119901119909119909119879119879 (derivative of the ldquoerrorrdquo)bull 119902119902119909119909 larr 119902119902119909119909 + 120578120578 120576120576119909119909119909119909 119901119901119909119909 minus 120582120582 119902119902119909119909 (update equation)bull 119901119901119909119909 larr 119901119901119909119909 + 120578120578 120576120576119909119909119909119909 119902119902119909119909 minus 120582120582 119901119901119909119909 (update equation)
2 for loops For until convergence
bull For each rxibull Compute gradient do a ldquosteprdquo
1162019
η hellip learning rate
RECSYS 75Koren Bell Volinksy IEEE Computer 2009
1162019
RECSYS 76
Summary Recommendations via Optimization
Goal Make good recommendations Quantify goodness using SSE
So Lower SSE means better recommendations We want to make good recommendations on items
that some user has not yet seen Letrsquos set values for P and Q such that they work
well on known (user item) ratings And hope these values for P and Q will predict well
the unknown ratings
This is the a case where we apply Optimization methods
1162019
1 3 43 5 5
4 5 533
2
2 1 3
1
RECSYS 77
Backup Slides
RECSYS 78
The Netflix Challenge 2006-09
We Know What You OughtTo Be Watching This
Summer
Netflix Prizebull Training data
ndash 100 million ratingsndash 480000 usersndash 17770 moviesndash 6 years of data 2000-2005
bull Test datandash Last few ratings of each user (28 million)ndash Evaluation criterion root mean squared error (RMSE) ndash Netflix Cinematch RMSE 09514
bull Competitionndash 2700+ teamsndash $1 million grand prize for 10 improvement on Cinematch resultndash $50000 2007 progress prize for 843 improvement
scoremovieuser121152131434524123237682576344541568523425
22345
57664566
scoremovieuser6219617232473153414284935
745
696836
Training data Test data
Movie rating data
Overall rating distribution
bull Third of ratings are 4sbull Average rating is 368
From TimelyDevelopmentcom
ratings per movie
bull Avg ratingsmovie 5627
ratings per user
bull Avg ratingsuser 208
Average movie rating by movie count
bull More ratings to better movies
From TimelyDevelopmentcom
Most loved movies
CountAvg ratingTitle137812 4593 The Shawshank Redemption 133597 4545 Lord of the Rings The Return of the King1808834306 The Green Mile 150676 4460Lord of the Rings The Two Towers 139050 4415 Finding Nemo 117456 4504 Raiders of the Lost Ark 180736 4299 Forrest Gump 147932 4433 Lord of the Rings The Fellowship of the ring149199 4325 The Sixth Sense 144027 4333 Indiana Jones and the Last Crusade
Challenges
bull Size of datandash Scalabilityndash Keeping data in memory
bull Missing datandash 99 percent missingndash Very imbalanced
bull Avoiding overfittingbull Test and training data differ significantly movie 16322
bull Use an ensemble of complementing predictorsbull Two half tuned models worth more than a single fully
tuned model
The BellKor recommender system
RECSYS 101
Extending Latent Factor Model to Include Biases
1162019
RECSYS 102
Modeling Biases and Interactions
μ = overall mean rating bx = bias of user x bi = bias of movie i
user-movie interactionmovie biasuser bias
User-Movie interaction Characterizes the matching between
users and movies Attracts most research in the field Benefits from algorithmic and
mathematical innovations
Baseline predictor Separates users and movies Benefits from insights into userrsquos
behavior Among the main practical
contributions of the competition
RECSYS 103
Baseline Predictor We have expectations on the rating by
user x of movie i even without estimating xrsquos attitude towards movies like i
ndash Rating scale of user xndash Values of other ratings user
gave recently (day-specific mood anchoring multi-user accounts)
ndash (Recent) popularity of movie indash Selection bias related to
number of ratings user gave on the same day (ldquofrequencyrdquo)
1162019
RECSYS 104
Putting It All Together
Example Mean rating micro = 37 You are a critical reviewer your ratings are 1 star
lower than the mean bx = -1 Star Wars gets a mean rating of 05 higher than
average movie bi = + 05 Predicted rating for you on Star Wars
= 37 - 1 + 05 = 32
1162019
RECSYS 105
Fitting the New Model
Solve
Stochastic gradient decent to find parameters Note Both biases bu bi as well as interactions qi pu are treated
as parameters (we estimate them)
1162019
regularization
goodness of fit
λ is selected via grid-search on a validation set
( )
++++
+++minus
sumsumsumsum
sumisin
ii
xx
xx
ii
Rix
Txiixxi
PQ
bbpq
pqbbr
2222
2
)(
)(min
λ
micro
RECSYS 106
BellKor Recommender System The winner of the Netflix Challenge
Multi-scale modeling of the dataCombine top level ldquoregionalrdquomodeling of the data with a refined local view Global
bull Overall deviations of usersmovies Factorization
bull Addressing ldquoregionalrdquo effects Collaborative filtering
bull Extract local patterns
Global effects
Factorization
Collaborative filtering
RECSYS 107
Performance of Various Methods
0885
089
0895
09
0905
091
0915
092
1 10 100 1000
RMSE
Millions of parameters
CF (no time bias)
Basic Latent Factors
Latent Factors w Biases
1162019
RECSYS 108
Grand Prize 08563
Netflix 09514
Movie average 10533User average 10651
Global average 11296
Performance of Various Methods
Basic Collaborative filtering 094
Latent factors 090
Latent factors+Biases 089
Collaborative filtering++ 091
1162019
RECSYS 109
Temporal Biases Of Users
Sudden rise in the average movie rating(early 2004) Improvements in Netflix GUI improvements Meaning of rating changed
Movie age Users prefer new movies
without any reasons Older movies are just
inherently better than newer ones
1162019
Y Koren Collaborative filtering with temporal dynamics KDD rsquo09
RECSYS 110
Temporal Biases amp Factors
Original modelrxi = micro +bx + bi + qi middotpx
T
Add time dependence to biasesrxi = micro +bx(t)+ bi(t) +qi middot px
T
Make parameters bu and bi to depend on time (1) Parameterize time-dependence by linear trends
(2) Each bin corresponds to 10 consecutive weeks
Add temporal dependence to factors px(t)hellip user preference vector on day t
Y Koren Collaborative filtering with temporal dynamics KDD rsquo09
RECSYS 111
Adding Temporal Effects
1162019
0875
088
0885
089
0895
09
0905
091
0915
092
1 10 100 1000 10000
RMSE
Millions of parameters
CF (no time bias)Basic Latent FactorsCF (time bias)Latent Factors w Biases+ Linear time factors+ Per-day user biases+ CF
RECSYS 112
Grand Prize 08563
Netflix 09514
Movie average 10533User average 10651
Global average 11296
Performance of Various Methods
Basic Collaborative filtering 094
Latent factors 090
Latent factors+Biases 089
Collaborative filtering++ 091
1162019 112
Latent factors+Biases+Time 0876
Still no prize Getting desperateTry a ldquokitchen sinkrdquo approach
RECSYS 113
Effect of ensemble size
087
08725
0875
08775
088
08825
0885
08875
089
08925
1 8 15 22 29 36 43 50
Predictors
Erro
r - R
MSE
RECSYS 27
Example of Memory-based Collaborative FilteringUser-User Collaborative Filtering
1 Consider user x
2 Find set N of other userswhose ratings are ldquosimilarrdquo to xrsquos ratings eg using K-nearest neighbors (KNN)
3 Recommend items to xbased on the weighted ratings of items by users in N
RECSYS 28
Similar Users Let rx be the vector of user xrsquos ratings
Jaccard similarity measure Problem Ignores the value of the rating
Cosine similarity measure
sim(x y) = cos(rx ry) = 119903119903119909119909sdot119903119903119910119910||119903119903119909119909||sdot||119903119903119910119910||
Problem Treats missing ratings as ldquonegativerdquo Pearson correlation coefficient
Sxy = items rated by both users x and y
1162019
rx = [ _ _ ]ry = [ _ _]
rx ry as setsrx = 1 4 5ry = 1 3 4
rx ry as pointsrx = 1 0 0 1 3ry = 1 0 2 2 0
rx ry hellip avgrating of x y
RECSYS 29
Similarity Metric
Intuitively we want sim(A B) gt sim(A C)
Jaccard similarity 15 lt 24
Cosine similarity 0386 gt 0322 Considers missing ratings as ldquonegativerdquo Solution subtract the (row) mean
1162019
sim AB vs AC0092 gt -0559Notice cosine sim is correlation when data is centered at 0
Cosine sim
RECSYS 30
Rating Predictions
Let rx be the vector of user xrsquos ratings
Let N be the set of k users most similar to xwho have rated item i
Prediction for item i of user x 119903119903119909119909119909119909 = 1
119896119896sum119910119910isin119873119873 119903119903119910119910119909119909
119903119903119909119909119909119909 =sum119910119910isin119873119873 119904119904119909119909119910119910sdot119903119903119910119910119910119910sum119910119910isin119873119873 119904119904119909119909119910119910
Other options
Many other tricks possiblehellip1162019
Shorthand119956119956119961119961119961119961 = 119956119956119946119946119956119956 119961119961119961119961
RECSYS 31
Another type of Memory-based Collaborative Filtering Item-Item based Collaborative Filtering
So far User-user collaborative filtering
Another view Item-item For item i find other similar items Estimate rating for item i based
on the target userrsquos ratings on items similar to item i Can use same similarity metrics and
prediction functions as in user-user model
1162019
sumsum
isin
isinsdot
=)(
)(
xiNj ij
xiNj xjijxi s
rsr
sijhellip similarity of items i and jrxjhelliprating of user x on item jN(ix)hellip set items rated by x similar to i
RECSYS 32
Item-Item CF (|N|=2)
121110987654321
455311
3124452
534321423
245424
5224345
423316
users
mov
ies
- unknown rating - rating between 1 to 5
RECSYS 33
Item-Item CF (|N|=2)
121110987654321
455 311
3124452
534321423
245424
5224345
423316
users
- estimate rating of movie 1 by user 5
1162019
mov
ies
RECSYS 34
Item-Item CF (|N|=2)
121110987654321
455 311
3124452
534321423
245424
5224345
423316
users
Neighbor selectionIdentify movies similar to movie 1 rated by user 5
1162019
mov
ies
100
-018
041
-010
-031
059
sim(1m)
Here we use Pearson correlation as similarity1) Subtract mean rating mi from each movie i
m1 = (1+3+5+5+4)5 = 36row 1 [-26 0 -06 0 0 14 0 0 14 0 04 0]
2) Compute cosine similarities between rows
RECSYS 35
Item-Item CF (|N|=2)
121110987654321
455 311
3124452
534321423
245424
5224345
423316
users
Compute similarity weightss13=041 s16=059
1162019
mov
ies
100
-018
041
-010
-031
059
sim(1m)
RECSYS 36
Item-Item CF (|N|=2)
121110987654321
45526311
3124452
534321423
245424
5224345
423316
users
Predict by taking weighted average
r15 = (0412 + 0593) (041+059) = 26
mov
ies
RECSYS 37
Common Practice for Item-Item Collaborative Filtering
Define similarity sij of items i and j
Select K nearest neighbors (KNN) N(i x) Set of Items most similar to item i that were rated by x
Estimate rating rxi as the weighted average
1162019
sumsum
isin
isinsdot
=)(
)(ˆxiNj ij
xiNj xjijxi s
rsr
N(ix) = set of items similar to item i that were rated by xsij = similarity of items i and jrxj= rating of user x on item j
RECSYS 38
Item-Item vs User-User
04180109
0305081
Avatar LOTR Matrix Pirates
Alice
Bob
Carol
David
1162019
In practice it has been observed that item-itemoften works better than user-user
Why Items are simpler users have multiple tastes
RECSYS 39
ProsCons of Collaborative Filtering
+ Works for any kind of item No feature selection needed
- Cold Start Need enough users in the system to find a match
- Sparsity The userratings matrix is sparse Hard to find users that have rated the same items
- First rater Cannot recommend an item that has not been
previously rated New items Esoteric items
- Popularity bias Cannot recommend items to someone with
unique taste Tends to recommend popular items
RECSYS 40
Hybrid Methods Implement two or more different recommenders and
combine predictions Perhaps using a linear model
Add content-based methods to collaborative filtering Item profiles for new item problem Demographics to deal with new user problem
1162019
RECSYS 41
- Evaluation
- Error metrics
- Complexity Speed
RECSYS 42
Evaluation
1 3 4
3 5 5
4 5 5
3
3
2 2 2
5
2 1 1
3 3
1
movies
users
1162019
RECSYS 43
Evaluation
1 3 4
3 5 5
4 5 5
3
3
2
2 1
3
1
Test Data Set
users
movies
1162019
RECSYS 44
Evaluating Predictions Compare predictions with known ratings
Root-mean-square error (RMSE)bull where is predicted is the true rating of x on i
Precision at top 10 bull of those in top 10
Rank Correlation bull Spearmanrsquos correlation between systemrsquos and userrsquos complete
rankings Another approach 01 model
Coveragebull Number of itemsusers for which system can make predictions
Precision = TP (TP + FP) Accuracy = (TP+TN) (TP + FP + TN + FN) Receiver Operating characteristic (ROC) Curve
Y-axis True Positive Rates (TPR) X-axis False Positive Rates (FPR)bull TPR (aka Recall) = TP P = TP (TP+FN) bull FPR = FP N = FP (FP + TN)bull See httpsenwikipediaorgwikiPrecision and recall
RECSYS 45
Problems with Error Measures Narrow focus on accuracy sometimes
misses the point Prediction Diversity Prediction Context Order of predictions
In practice we care only to predict high ratings RMSE might penalize a method that does well
for high ratings and badly for others
1162019
RECSYS 46
Collaborative Filtering Complexity
Expensive step is finding k most similar customers O(|X|) Recall that X = set of customers in the system
Too expensive to do at runtime Could pre-compute using clustering as approx
Naiumlve pre-computation takes time O(N middot|C|) |C| = of clusters = k in the k-means N = of data points
We already know how to do this Near-neighbor search in high dimensions (LSH) Clustering Dimensionality reduction
1162019
RECSYS 47
Tip Add Data Leverage all the data
Donrsquot try to reduce data size in an effort to make fancy algorithms work
Simple methods on large data do best
Add more data eg add IMDB data on genres
More data beats better algorithmshttpanandtypepadcomdatawocky200803more-data-usualhtml
RECSYS 48
Recommender SystemsLatent Factor Models
CS246 Mining Massive DatasetsJure Leskovec Stanford University
httpcs246stanfordedu
RECSYS 49
Geared towards females
Geared towards males
Serious
Funny
Collaborative Filtering viaLatent Factor Models (eg SVD)
The PrincessDiaries
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
RECSYS 50Koren Bell Volinksy IEEE Computer 2009
RECSYS 51
The Netflix Utility Matrix R
1 3 4
3 5 5
4 5 5
3
3
2 2 2
5
2 1 1
3 3
1
480000 users
17700 movies
1162019
Matrix R
RECSYS 52
Utility Matrix R Evaluation
1 3 4
3 5 5
4 5 5
3
3
2
2 1
3
1
Test Data Set
SSE = sum(119909119909119909119909)isin119877119877 119903119909119909119909119909 minus 119903119903119909119909119909119909 2
1162019
480000 users
17700 movies
Predicted rating
True rating of user x on item i
119955119955120785120785120788120788
Matrix R
Training Data Set
RECSYS 53
Latent Factor Models
ldquoSVDrdquo on Netflix data R asymp Q middot PT
For now letrsquos assume we can approximate the rating matrix R as a product of ldquothinrdquo Q middot PT
R has missing entries but letrsquos ignore that for nowbull Basically we will want the reconstruction error to be small on
known ratings and we donrsquot care about the values on the missing ones
45531
312445
53432142
24542
522434
42331
2-41
56-5
53-2
32111
-221-7
37-1
-924143-48-5-253-211
13-112-72914-131457-8
1-6784-3924176-421asymp
users
item
s
PT
Q
item
s
users
R
SVD A = U Σ VT
f factors
ffactors
RECSYS 54
Ratings as Products of Factors How to estimate the missing rating of
user x for item i
1162019
45531
312445
53432142
24542
522434
42331
item
s
2-41
56-5
53-2
32111
-221-7
37-1
-924143-48-5-253-211
13-112-72914-131457-8
1-6784-3924176-421
asymp
item
s
users
users
PT
119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931
= 119943119943
119954119954119946119946119943119943 sdot 119953119953119961119961119943119943
qi = row i of Qpx = column x of PT
ffac
tors
Qf factors
RECSYS 55
Ratings as Products of Factors How to estimate the missing rating of
user x for item i
1162019
45531
312445
53432142
24542
522434
42331
item
s
2-41
56-5
53-2
32111
-221-7
37-1
-924143-48-5-253-211
13-112-72914-131457-8
1-6784-3924176-421
asymp
item
s
users
users
PT
ffac
tors
Qf factors
119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931
= 119943119943
119954119954119946119946119943119943 sdot 119953119953119961119961119943119943
qi = row i of Qpx = column x of PT
RECSYS 56
Ratings as Products of Factors How to estimate the missing rating of
user x for item i
1162019
45531
312445
53432142
24542
522434
42331
item
s
2-41
56-5
53-2
32111
-221-7
37-1
-924143-48-5-253-211
13-112-72914-131457-8
1-6784-3924176-421
asymp
item
s
users
users
QPT
24
ffac
tors
f factors
119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931
= 119943119943
119954119954119946119946119943119943 sdot 119953119953119961119961119943119943
qi = row i of Qpx = column x of PT
RECSYS 57
Geared towards females
Geared towards males
Serious
Funny
Latent Factor Models
1162019
The PrincessDiaries
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
RECSYS 58
Geared towards females
Geared towards males
Serious
Funny
Latent Factor Models
1162019
The PrincessDiaries
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
RECSYS 59
Recap SVD
Remember SVD A Input data matrix U Left singular vecs V Right singular vecs Σ Singular values
SVD gives minimum reconstruction error (SSE) min119880119880VΣ
sum119909119909119894119894 119860119860119909119909119894119894 minus 119880119880Σ119881119881T 1199091199091198941198942
So in our case ldquoSVDrdquo on Netflix data R asymp Q middot PT
A = R Q = U PT = Σ VT
But we are not done yet R has missing entries1162019
U
The sum goes over all entriesBut our R has missing entries
119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931
Am
n
Σm
n
VT
asymp
U
RECSYS 60
Latent Factor Models
SVD isnrsquot defined when entries are missing
Use specialized methods to find P Q min
119875119875119876119876sum 119909119909119909119909 isinR 119903119903119909119909119909119909 minus 119902119902119909119909 sdot 119901119901119909119909119879119879 2
Notebull We donrsquot require cols of P Q to be orthogonalunit lengthbull P Q map usersmovies to a latent spacebull The most popular model among Netflix contestants
1162019
45531
312445
53432142
24542
522434
42331
2-41
56-5
53-2
32111
-221-7
37-1
-924143-48-5-253-211
13-112-72914-131457-8
1-6784-3924176-421asymp
PTQ
users
item
s
119903119909119909119909119909 = 119902119902119909119909 sdot 119901119901119909119909119879119879
f factors
ffactorsitem
s
users
RECSYS 61
Dealing with Missing Entries Want to minimize SSE for unseen test data
Idea Minimize SSE on training data Want large f ( of factors) to capture all the signals But SSE on test data begins to rise for f gt 2
Regularization is needed to avoid Overfitting Allow rich model where there are sufficient data Shrink aggressively where data are scarce
1162019
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
1 3 43 5 5
4 5 533
2
2 1 3
1
λhellip regularization parameterldquoerrorrdquo ldquolengthrdquo
RECSYS 62
Geared towards females
Geared towards males
serious
escapist
The PrincessDiaries
The Lion King
Braveheart
Independence Day
AmadeusThe Color Purple
Oceanrsquos 11
Sense and Sensibility
Gus
Dave
Recommendations via Latent Factor Models (eg SVD++ by the [Bellkor Team])
1162019
Lethal Weapon
Dumb and Dumber
RECSYS 63
Dealing with Missing Entries Want to minimize SSE for unseen test data
Idea Minimize SSE on training data Want large f ( of factors) to capture all the signals But SSE on test data begins to rise for f gt 2
Regularization is needed Allow rich model where there are sufficient data Shrink aggressively where data are scarce
1162019
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
1 3 43 5 5
4 5 533
2
2 1 3
1
λhellip regularization parameterldquoerrorrdquo ldquolengthrdquo
RECSYS 64
Geared towards females
Geared towards males
serious
funny
The Effect of Regularization
1162019
The PrincessDiaries
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
minfactors ldquoerrorrdquo + λ ldquolengthrdquo
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
RECSYS 65
Geared towards females
Geared towards males
serious
funny
The Effect of Regularization
1162019
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
The PrincessDiaries
minfactors ldquoerrorrdquo + λ ldquolengthrdquo
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
RECSYS 66
Geared towards females
Geared towards males
serious
funny
The Effect of Regularization
1162019
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
minfactors ldquoerrorrdquo + λ ldquolengthrdquo
The PrincessDiaries
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
RECSYS 67
Geared towards females
Geared towards males
serious
funny
The Effect of Regularization
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
The PrincessDiaries
minfactors ldquoerrorrdquo + λ ldquolengthrdquo
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
RECSYS 68
Use Gradient Descent to search forthe optimal settings
Want to find matrices P and Q
Gradient descent Initialize P and Q (using SVD pretend missing ratings are
0) Do gradient descent
bull P larr P - η middotnablaPbull Q larr Q - η middotnablaQbull Where nablaQ is gradientderivative of matrix Q120571120571119876119876 = [120571120571119902119902119909119909119894119894] and 120571120571119902119902119909119909119894119894 = sum119909119909119909119909 minus2 119903119903119909119909119909119909 minus 119902119902119909119909119901119901119909119909119879119879 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894
bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q Observation Computing gradients is slow
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
How to compute gradient of a matrixCompute gradient of every element independently
RECSYS 69
Degression to the lecture notes ofRegression and Gradient Descent
by Andrew Ngrsquos Machine Learning course from Coursera
RECSYS 70
(Batch) Gradient Descent
Want to find matrices P and Q
Gradient descent Initialize P and Q (using SVD pretend missing ratings are
0) Do gradient descent
bull P larr P - η middotnablaPbull Q larr Q - η middotnablaQbull Where nablaQ is gradientderivative of matrix Q120571120571119876119876 = [120571120571119902119902119909119909119894119894] and 120571120571119902119902119909119909119894119894 = sum119909119909119909119909 minus2 119903119903119909119909119909119909 minus 119902119902119909119909119901119901119909119909119879119879 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894
bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q Observation Computing gradients is slow
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
How to compute gradient of a matrixCompute gradient of every element independently
RECSYS 72
Stochastic Gradient Descent
Gradient Descent (GD) vs Stochastic GD Observation 120571120571119876119876 = [120571120571119902119902119909119909119894119894] where
120571120571119902119902119909119909119894119894 = 119909119909119909119909
minus2 119903119903119909119909119909119909 minus 119902119902119909119909119894119894119901119901119909119909119894119894 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894 = 119961119961119946119946
nabla119928119928 119955119955119961119961119946119946
bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q 119928119928 = 119928119928 minus 119928119928 = 119928119928 minus η sum119961119961119946119946nabla119928119928 (119955119955119961119961119946119946) Idea Instead of evaluating gradient over all ratings evaluate it for
each individual rating and make a step
GD 119928119928larr119928119928 minus η sum119955119955119961119961119946119946 nabla119928119928(119955119955119961119961119946119946)
SGD 119928119928larr119928119928 minus 119928119928(119955119955119961119961119946119946) Faster convergence
bull Need more steps but each step is computed much faster
1162019
RECSYS 73
SGD vs GD Convergence of GD vs SGD
1162019
Iterationstep
Valu
e of
the
obje
ctiv
e fu
nctio
n
GD improves the value of the objective function at every step SGD improves the value but in a ldquonoisyrdquo wayGD takes fewer steps to converge but each steptakes much longer to compute In practice SGD is much faster
RECSYS 74
Stochastic Gradient Descent
Stochastic gradient decent Initialize P and Q (using SVD pretend missing ratings are
0) Then iterate over the ratings (multiple times if necessary) and update
factorsFor each rxi
bull 120576120576119909119909119909119909 = 119903119903119909119909119909119909 minus 119902119902119909119909 sdot 119901119901119909119909119879119879 (derivative of the ldquoerrorrdquo)bull 119902119902119909119909 larr 119902119902119909119909 + 120578120578 120576120576119909119909119909119909 119901119901119909119909 minus 120582120582 119902119902119909119909 (update equation)bull 119901119901119909119909 larr 119901119901119909119909 + 120578120578 120576120576119909119909119909119909 119902119902119909119909 minus 120582120582 119901119901119909119909 (update equation)
2 for loops For until convergence
bull For each rxibull Compute gradient do a ldquosteprdquo
1162019
η hellip learning rate
RECSYS 75Koren Bell Volinksy IEEE Computer 2009
1162019
RECSYS 76
Summary Recommendations via Optimization
Goal Make good recommendations Quantify goodness using SSE
So Lower SSE means better recommendations We want to make good recommendations on items
that some user has not yet seen Letrsquos set values for P and Q such that they work
well on known (user item) ratings And hope these values for P and Q will predict well
the unknown ratings
This is the a case where we apply Optimization methods
1162019
1 3 43 5 5
4 5 533
2
2 1 3
1
RECSYS 77
Backup Slides
RECSYS 78
The Netflix Challenge 2006-09
We Know What You OughtTo Be Watching This
Summer
Netflix Prizebull Training data
ndash 100 million ratingsndash 480000 usersndash 17770 moviesndash 6 years of data 2000-2005
bull Test datandash Last few ratings of each user (28 million)ndash Evaluation criterion root mean squared error (RMSE) ndash Netflix Cinematch RMSE 09514
bull Competitionndash 2700+ teamsndash $1 million grand prize for 10 improvement on Cinematch resultndash $50000 2007 progress prize for 843 improvement
scoremovieuser121152131434524123237682576344541568523425
22345
57664566
scoremovieuser6219617232473153414284935
745
696836
Training data Test data
Movie rating data
Overall rating distribution
bull Third of ratings are 4sbull Average rating is 368
From TimelyDevelopmentcom
ratings per movie
bull Avg ratingsmovie 5627
ratings per user
bull Avg ratingsuser 208
Average movie rating by movie count
bull More ratings to better movies
From TimelyDevelopmentcom
Most loved movies
CountAvg ratingTitle137812 4593 The Shawshank Redemption 133597 4545 Lord of the Rings The Return of the King1808834306 The Green Mile 150676 4460Lord of the Rings The Two Towers 139050 4415 Finding Nemo 117456 4504 Raiders of the Lost Ark 180736 4299 Forrest Gump 147932 4433 Lord of the Rings The Fellowship of the ring149199 4325 The Sixth Sense 144027 4333 Indiana Jones and the Last Crusade
Challenges
bull Size of datandash Scalabilityndash Keeping data in memory
bull Missing datandash 99 percent missingndash Very imbalanced
bull Avoiding overfittingbull Test and training data differ significantly movie 16322
bull Use an ensemble of complementing predictorsbull Two half tuned models worth more than a single fully
tuned model
The BellKor recommender system
RECSYS 101
Extending Latent Factor Model to Include Biases
1162019
RECSYS 102
Modeling Biases and Interactions
μ = overall mean rating bx = bias of user x bi = bias of movie i
user-movie interactionmovie biasuser bias
User-Movie interaction Characterizes the matching between
users and movies Attracts most research in the field Benefits from algorithmic and
mathematical innovations
Baseline predictor Separates users and movies Benefits from insights into userrsquos
behavior Among the main practical
contributions of the competition
RECSYS 103
Baseline Predictor We have expectations on the rating by
user x of movie i even without estimating xrsquos attitude towards movies like i
ndash Rating scale of user xndash Values of other ratings user
gave recently (day-specific mood anchoring multi-user accounts)
ndash (Recent) popularity of movie indash Selection bias related to
number of ratings user gave on the same day (ldquofrequencyrdquo)
1162019
RECSYS 104
Putting It All Together
Example Mean rating micro = 37 You are a critical reviewer your ratings are 1 star
lower than the mean bx = -1 Star Wars gets a mean rating of 05 higher than
average movie bi = + 05 Predicted rating for you on Star Wars
= 37 - 1 + 05 = 32
1162019
RECSYS 105
Fitting the New Model
Solve
Stochastic gradient decent to find parameters Note Both biases bu bi as well as interactions qi pu are treated
as parameters (we estimate them)
1162019
regularization
goodness of fit
λ is selected via grid-search on a validation set
( )
++++
+++minus
sumsumsumsum
sumisin
ii
xx
xx
ii
Rix
Txiixxi
PQ
bbpq
pqbbr
2222
2
)(
)(min
λ
micro
RECSYS 106
BellKor Recommender System The winner of the Netflix Challenge
Multi-scale modeling of the dataCombine top level ldquoregionalrdquomodeling of the data with a refined local view Global
bull Overall deviations of usersmovies Factorization
bull Addressing ldquoregionalrdquo effects Collaborative filtering
bull Extract local patterns
Global effects
Factorization
Collaborative filtering
RECSYS 107
Performance of Various Methods
0885
089
0895
09
0905
091
0915
092
1 10 100 1000
RMSE
Millions of parameters
CF (no time bias)
Basic Latent Factors
Latent Factors w Biases
1162019
RECSYS 108
Grand Prize 08563
Netflix 09514
Movie average 10533User average 10651
Global average 11296
Performance of Various Methods
Basic Collaborative filtering 094
Latent factors 090
Latent factors+Biases 089
Collaborative filtering++ 091
1162019
RECSYS 109
Temporal Biases Of Users
Sudden rise in the average movie rating(early 2004) Improvements in Netflix GUI improvements Meaning of rating changed
Movie age Users prefer new movies
without any reasons Older movies are just
inherently better than newer ones
1162019
Y Koren Collaborative filtering with temporal dynamics KDD rsquo09
RECSYS 110
Temporal Biases amp Factors
Original modelrxi = micro +bx + bi + qi middotpx
T
Add time dependence to biasesrxi = micro +bx(t)+ bi(t) +qi middot px
T
Make parameters bu and bi to depend on time (1) Parameterize time-dependence by linear trends
(2) Each bin corresponds to 10 consecutive weeks
Add temporal dependence to factors px(t)hellip user preference vector on day t
Y Koren Collaborative filtering with temporal dynamics KDD rsquo09
RECSYS 111
Adding Temporal Effects
1162019
0875
088
0885
089
0895
09
0905
091
0915
092
1 10 100 1000 10000
RMSE
Millions of parameters
CF (no time bias)Basic Latent FactorsCF (time bias)Latent Factors w Biases+ Linear time factors+ Per-day user biases+ CF
RECSYS 112
Grand Prize 08563
Netflix 09514
Movie average 10533User average 10651
Global average 11296
Performance of Various Methods
Basic Collaborative filtering 094
Latent factors 090
Latent factors+Biases 089
Collaborative filtering++ 091
1162019 112
Latent factors+Biases+Time 0876
Still no prize Getting desperateTry a ldquokitchen sinkrdquo approach
RECSYS 113
Effect of ensemble size
087
08725
0875
08775
088
08825
0885
08875
089
08925
1 8 15 22 29 36 43 50
Predictors
Erro
r - R
MSE
RECSYS 28
Similar Users Let rx be the vector of user xrsquos ratings
Jaccard similarity measure Problem Ignores the value of the rating
Cosine similarity measure
sim(x y) = cos(rx ry) = 119903119903119909119909sdot119903119903119910119910||119903119903119909119909||sdot||119903119903119910119910||
Problem Treats missing ratings as ldquonegativerdquo Pearson correlation coefficient
Sxy = items rated by both users x and y
1162019
rx = [ _ _ ]ry = [ _ _]
rx ry as setsrx = 1 4 5ry = 1 3 4
rx ry as pointsrx = 1 0 0 1 3ry = 1 0 2 2 0
rx ry hellip avgrating of x y
RECSYS 29
Similarity Metric
Intuitively we want sim(A B) gt sim(A C)
Jaccard similarity 15 lt 24
Cosine similarity 0386 gt 0322 Considers missing ratings as ldquonegativerdquo Solution subtract the (row) mean
1162019
sim AB vs AC0092 gt -0559Notice cosine sim is correlation when data is centered at 0
Cosine sim
RECSYS 30
Rating Predictions
Let rx be the vector of user xrsquos ratings
Let N be the set of k users most similar to xwho have rated item i
Prediction for item i of user x 119903119903119909119909119909119909 = 1
119896119896sum119910119910isin119873119873 119903119903119910119910119909119909
119903119903119909119909119909119909 =sum119910119910isin119873119873 119904119904119909119909119910119910sdot119903119903119910119910119910119910sum119910119910isin119873119873 119904119904119909119909119910119910
Other options
Many other tricks possiblehellip1162019
Shorthand119956119956119961119961119961119961 = 119956119956119946119946119956119956 119961119961119961119961
RECSYS 31
Another type of Memory-based Collaborative Filtering Item-Item based Collaborative Filtering
So far User-user collaborative filtering
Another view Item-item For item i find other similar items Estimate rating for item i based
on the target userrsquos ratings on items similar to item i Can use same similarity metrics and
prediction functions as in user-user model
1162019
sumsum
isin
isinsdot
=)(
)(
xiNj ij
xiNj xjijxi s
rsr
sijhellip similarity of items i and jrxjhelliprating of user x on item jN(ix)hellip set items rated by x similar to i
RECSYS 32
Item-Item CF (|N|=2)
121110987654321
455311
3124452
534321423
245424
5224345
423316
users
mov
ies
- unknown rating - rating between 1 to 5
RECSYS 33
Item-Item CF (|N|=2)
121110987654321
455 311
3124452
534321423
245424
5224345
423316
users
- estimate rating of movie 1 by user 5
1162019
mov
ies
RECSYS 34
Item-Item CF (|N|=2)
121110987654321
455 311
3124452
534321423
245424
5224345
423316
users
Neighbor selectionIdentify movies similar to movie 1 rated by user 5
1162019
mov
ies
100
-018
041
-010
-031
059
sim(1m)
Here we use Pearson correlation as similarity1) Subtract mean rating mi from each movie i
m1 = (1+3+5+5+4)5 = 36row 1 [-26 0 -06 0 0 14 0 0 14 0 04 0]
2) Compute cosine similarities between rows
RECSYS 35
Item-Item CF (|N|=2)
121110987654321
455 311
3124452
534321423
245424
5224345
423316
users
Compute similarity weightss13=041 s16=059
1162019
mov
ies
100
-018
041
-010
-031
059
sim(1m)
RECSYS 36
Item-Item CF (|N|=2)
121110987654321
45526311
3124452
534321423
245424
5224345
423316
users
Predict by taking weighted average
r15 = (0412 + 0593) (041+059) = 26
mov
ies
RECSYS 37
Common Practice for Item-Item Collaborative Filtering
Define similarity sij of items i and j
Select K nearest neighbors (KNN) N(i x) Set of Items most similar to item i that were rated by x
Estimate rating rxi as the weighted average
1162019
sumsum
isin
isinsdot
=)(
)(ˆxiNj ij
xiNj xjijxi s
rsr
N(ix) = set of items similar to item i that were rated by xsij = similarity of items i and jrxj= rating of user x on item j
RECSYS 38
Item-Item vs User-User
04180109
0305081
Avatar LOTR Matrix Pirates
Alice
Bob
Carol
David
1162019
In practice it has been observed that item-itemoften works better than user-user
Why Items are simpler users have multiple tastes
RECSYS 39
ProsCons of Collaborative Filtering
+ Works for any kind of item No feature selection needed
- Cold Start Need enough users in the system to find a match
- Sparsity The userratings matrix is sparse Hard to find users that have rated the same items
- First rater Cannot recommend an item that has not been
previously rated New items Esoteric items
- Popularity bias Cannot recommend items to someone with
unique taste Tends to recommend popular items
RECSYS 40
Hybrid Methods Implement two or more different recommenders and
combine predictions Perhaps using a linear model
Add content-based methods to collaborative filtering Item profiles for new item problem Demographics to deal with new user problem
1162019
RECSYS 41
- Evaluation
- Error metrics
- Complexity Speed
RECSYS 42
Evaluation
1 3 4
3 5 5
4 5 5
3
3
2 2 2
5
2 1 1
3 3
1
movies
users
1162019
RECSYS 43
Evaluation
1 3 4
3 5 5
4 5 5
3
3
2
2 1
3
1
Test Data Set
users
movies
1162019
RECSYS 44
Evaluating Predictions Compare predictions with known ratings
Root-mean-square error (RMSE)bull where is predicted is the true rating of x on i
Precision at top 10 bull of those in top 10
Rank Correlation bull Spearmanrsquos correlation between systemrsquos and userrsquos complete
rankings Another approach 01 model
Coveragebull Number of itemsusers for which system can make predictions
Precision = TP (TP + FP) Accuracy = (TP+TN) (TP + FP + TN + FN) Receiver Operating characteristic (ROC) Curve
Y-axis True Positive Rates (TPR) X-axis False Positive Rates (FPR)bull TPR (aka Recall) = TP P = TP (TP+FN) bull FPR = FP N = FP (FP + TN)bull See httpsenwikipediaorgwikiPrecision and recall
RECSYS 45
Problems with Error Measures Narrow focus on accuracy sometimes
misses the point Prediction Diversity Prediction Context Order of predictions
In practice we care only to predict high ratings RMSE might penalize a method that does well
for high ratings and badly for others
1162019
RECSYS 46
Collaborative Filtering Complexity
Expensive step is finding k most similar customers O(|X|) Recall that X = set of customers in the system
Too expensive to do at runtime Could pre-compute using clustering as approx
Naiumlve pre-computation takes time O(N middot|C|) |C| = of clusters = k in the k-means N = of data points
We already know how to do this Near-neighbor search in high dimensions (LSH) Clustering Dimensionality reduction
1162019
RECSYS 47
Tip Add Data Leverage all the data
Donrsquot try to reduce data size in an effort to make fancy algorithms work
Simple methods on large data do best
Add more data eg add IMDB data on genres
More data beats better algorithmshttpanandtypepadcomdatawocky200803more-data-usualhtml
RECSYS 48
Recommender SystemsLatent Factor Models
CS246 Mining Massive DatasetsJure Leskovec Stanford University
httpcs246stanfordedu
RECSYS 49
Geared towards females
Geared towards males
Serious
Funny
Collaborative Filtering viaLatent Factor Models (eg SVD)
The PrincessDiaries
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
RECSYS 50Koren Bell Volinksy IEEE Computer 2009
RECSYS 51
The Netflix Utility Matrix R
1 3 4
3 5 5
4 5 5
3
3
2 2 2
5
2 1 1
3 3
1
480000 users
17700 movies
1162019
Matrix R
RECSYS 52
Utility Matrix R Evaluation
1 3 4
3 5 5
4 5 5
3
3
2
2 1
3
1
Test Data Set
SSE = sum(119909119909119909119909)isin119877119877 119903119909119909119909119909 minus 119903119903119909119909119909119909 2
1162019
480000 users
17700 movies
Predicted rating
True rating of user x on item i
119955119955120785120785120788120788
Matrix R
Training Data Set
RECSYS 53
Latent Factor Models
ldquoSVDrdquo on Netflix data R asymp Q middot PT
For now letrsquos assume we can approximate the rating matrix R as a product of ldquothinrdquo Q middot PT
R has missing entries but letrsquos ignore that for nowbull Basically we will want the reconstruction error to be small on
known ratings and we donrsquot care about the values on the missing ones
45531
312445
53432142
24542
522434
42331
2-41
56-5
53-2
32111
-221-7
37-1
-924143-48-5-253-211
13-112-72914-131457-8
1-6784-3924176-421asymp
users
item
s
PT
Q
item
s
users
R
SVD A = U Σ VT
f factors
ffactors
RECSYS 54
Ratings as Products of Factors How to estimate the missing rating of
user x for item i
1162019
45531
312445
53432142
24542
522434
42331
item
s
2-41
56-5
53-2
32111
-221-7
37-1
-924143-48-5-253-211
13-112-72914-131457-8
1-6784-3924176-421
asymp
item
s
users
users
PT
119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931
= 119943119943
119954119954119946119946119943119943 sdot 119953119953119961119961119943119943
qi = row i of Qpx = column x of PT
ffac
tors
Qf factors
RECSYS 55
Ratings as Products of Factors How to estimate the missing rating of
user x for item i
1162019
45531
312445
53432142
24542
522434
42331
item
s
2-41
56-5
53-2
32111
-221-7
37-1
-924143-48-5-253-211
13-112-72914-131457-8
1-6784-3924176-421
asymp
item
s
users
users
PT
ffac
tors
Qf factors
119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931
= 119943119943
119954119954119946119946119943119943 sdot 119953119953119961119961119943119943
qi = row i of Qpx = column x of PT
RECSYS 56
Ratings as Products of Factors How to estimate the missing rating of
user x for item i
1162019
45531
312445
53432142
24542
522434
42331
item
s
2-41
56-5
53-2
32111
-221-7
37-1
-924143-48-5-253-211
13-112-72914-131457-8
1-6784-3924176-421
asymp
item
s
users
users
QPT
24
ffac
tors
f factors
119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931
= 119943119943
119954119954119946119946119943119943 sdot 119953119953119961119961119943119943
qi = row i of Qpx = column x of PT
RECSYS 57
Geared towards females
Geared towards males
Serious
Funny
Latent Factor Models
1162019
The PrincessDiaries
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
RECSYS 58
Geared towards females
Geared towards males
Serious
Funny
Latent Factor Models
1162019
The PrincessDiaries
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
RECSYS 59
Recap SVD
Remember SVD A Input data matrix U Left singular vecs V Right singular vecs Σ Singular values
SVD gives minimum reconstruction error (SSE) min119880119880VΣ
sum119909119909119894119894 119860119860119909119909119894119894 minus 119880119880Σ119881119881T 1199091199091198941198942
So in our case ldquoSVDrdquo on Netflix data R asymp Q middot PT
A = R Q = U PT = Σ VT
But we are not done yet R has missing entries1162019
U
The sum goes over all entriesBut our R has missing entries
119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931
Am
n
Σm
n
VT
asymp
U
RECSYS 60
Latent Factor Models
SVD isnrsquot defined when entries are missing
Use specialized methods to find P Q min
119875119875119876119876sum 119909119909119909119909 isinR 119903119903119909119909119909119909 minus 119902119902119909119909 sdot 119901119901119909119909119879119879 2
Notebull We donrsquot require cols of P Q to be orthogonalunit lengthbull P Q map usersmovies to a latent spacebull The most popular model among Netflix contestants
1162019
45531
312445
53432142
24542
522434
42331
2-41
56-5
53-2
32111
-221-7
37-1
-924143-48-5-253-211
13-112-72914-131457-8
1-6784-3924176-421asymp
PTQ
users
item
s
119903119909119909119909119909 = 119902119902119909119909 sdot 119901119901119909119909119879119879
f factors
ffactorsitem
s
users
RECSYS 61
Dealing with Missing Entries Want to minimize SSE for unseen test data
Idea Minimize SSE on training data Want large f ( of factors) to capture all the signals But SSE on test data begins to rise for f gt 2
Regularization is needed to avoid Overfitting Allow rich model where there are sufficient data Shrink aggressively where data are scarce
1162019
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
1 3 43 5 5
4 5 533
2
2 1 3
1
λhellip regularization parameterldquoerrorrdquo ldquolengthrdquo
RECSYS 62
Geared towards females
Geared towards males
serious
escapist
The PrincessDiaries
The Lion King
Braveheart
Independence Day
AmadeusThe Color Purple
Oceanrsquos 11
Sense and Sensibility
Gus
Dave
Recommendations via Latent Factor Models (eg SVD++ by the [Bellkor Team])
1162019
Lethal Weapon
Dumb and Dumber
RECSYS 63
Dealing with Missing Entries Want to minimize SSE for unseen test data
Idea Minimize SSE on training data Want large f ( of factors) to capture all the signals But SSE on test data begins to rise for f gt 2
Regularization is needed Allow rich model where there are sufficient data Shrink aggressively where data are scarce
1162019
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
1 3 43 5 5
4 5 533
2
2 1 3
1
λhellip regularization parameterldquoerrorrdquo ldquolengthrdquo
RECSYS 64
Geared towards females
Geared towards males
serious
funny
The Effect of Regularization
1162019
The PrincessDiaries
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
minfactors ldquoerrorrdquo + λ ldquolengthrdquo
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
RECSYS 65
Geared towards females
Geared towards males
serious
funny
The Effect of Regularization
1162019
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
The PrincessDiaries
minfactors ldquoerrorrdquo + λ ldquolengthrdquo
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
RECSYS 66
Geared towards females
Geared towards males
serious
funny
The Effect of Regularization
1162019
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
minfactors ldquoerrorrdquo + λ ldquolengthrdquo
The PrincessDiaries
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
RECSYS 67
Geared towards females
Geared towards males
serious
funny
The Effect of Regularization
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
The PrincessDiaries
minfactors ldquoerrorrdquo + λ ldquolengthrdquo
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
RECSYS 68
Use Gradient Descent to search forthe optimal settings
Want to find matrices P and Q
Gradient descent Initialize P and Q (using SVD pretend missing ratings are
0) Do gradient descent
bull P larr P - η middotnablaPbull Q larr Q - η middotnablaQbull Where nablaQ is gradientderivative of matrix Q120571120571119876119876 = [120571120571119902119902119909119909119894119894] and 120571120571119902119902119909119909119894119894 = sum119909119909119909119909 minus2 119903119903119909119909119909119909 minus 119902119902119909119909119901119901119909119909119879119879 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894
bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q Observation Computing gradients is slow
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
How to compute gradient of a matrixCompute gradient of every element independently
RECSYS 69
Degression to the lecture notes ofRegression and Gradient Descent
by Andrew Ngrsquos Machine Learning course from Coursera
RECSYS 70
(Batch) Gradient Descent
Want to find matrices P and Q
Gradient descent Initialize P and Q (using SVD pretend missing ratings are
0) Do gradient descent
bull P larr P - η middotnablaPbull Q larr Q - η middotnablaQbull Where nablaQ is gradientderivative of matrix Q120571120571119876119876 = [120571120571119902119902119909119909119894119894] and 120571120571119902119902119909119909119894119894 = sum119909119909119909119909 minus2 119903119903119909119909119909119909 minus 119902119902119909119909119901119901119909119909119879119879 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894
bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q Observation Computing gradients is slow
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
How to compute gradient of a matrixCompute gradient of every element independently
RECSYS 72
Stochastic Gradient Descent
Gradient Descent (GD) vs Stochastic GD Observation 120571120571119876119876 = [120571120571119902119902119909119909119894119894] where
120571120571119902119902119909119909119894119894 = 119909119909119909119909
minus2 119903119903119909119909119909119909 minus 119902119902119909119909119894119894119901119901119909119909119894119894 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894 = 119961119961119946119946
nabla119928119928 119955119955119961119961119946119946
bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q 119928119928 = 119928119928 minus 119928119928 = 119928119928 minus η sum119961119961119946119946nabla119928119928 (119955119955119961119961119946119946) Idea Instead of evaluating gradient over all ratings evaluate it for
each individual rating and make a step
GD 119928119928larr119928119928 minus η sum119955119955119961119961119946119946 nabla119928119928(119955119955119961119961119946119946)
SGD 119928119928larr119928119928 minus 119928119928(119955119955119961119961119946119946) Faster convergence
bull Need more steps but each step is computed much faster
1162019
RECSYS 73
SGD vs GD Convergence of GD vs SGD
1162019
Iterationstep
Valu
e of
the
obje
ctiv
e fu
nctio
n
GD improves the value of the objective function at every step SGD improves the value but in a ldquonoisyrdquo wayGD takes fewer steps to converge but each steptakes much longer to compute In practice SGD is much faster
RECSYS 74
Stochastic Gradient Descent
Stochastic gradient decent Initialize P and Q (using SVD pretend missing ratings are
0) Then iterate over the ratings (multiple times if necessary) and update
factorsFor each rxi
bull 120576120576119909119909119909119909 = 119903119903119909119909119909119909 minus 119902119902119909119909 sdot 119901119901119909119909119879119879 (derivative of the ldquoerrorrdquo)bull 119902119902119909119909 larr 119902119902119909119909 + 120578120578 120576120576119909119909119909119909 119901119901119909119909 minus 120582120582 119902119902119909119909 (update equation)bull 119901119901119909119909 larr 119901119901119909119909 + 120578120578 120576120576119909119909119909119909 119902119902119909119909 minus 120582120582 119901119901119909119909 (update equation)
2 for loops For until convergence
bull For each rxibull Compute gradient do a ldquosteprdquo
1162019
η hellip learning rate
RECSYS 75Koren Bell Volinksy IEEE Computer 2009
1162019
RECSYS 76
Summary Recommendations via Optimization
Goal Make good recommendations Quantify goodness using SSE
So Lower SSE means better recommendations We want to make good recommendations on items
that some user has not yet seen Letrsquos set values for P and Q such that they work
well on known (user item) ratings And hope these values for P and Q will predict well
the unknown ratings
This is the a case where we apply Optimization methods
1162019
1 3 43 5 5
4 5 533
2
2 1 3
1
RECSYS 77
Backup Slides
RECSYS 78
The Netflix Challenge 2006-09
We Know What You OughtTo Be Watching This
Summer
Netflix Prizebull Training data
ndash 100 million ratingsndash 480000 usersndash 17770 moviesndash 6 years of data 2000-2005
bull Test datandash Last few ratings of each user (28 million)ndash Evaluation criterion root mean squared error (RMSE) ndash Netflix Cinematch RMSE 09514
bull Competitionndash 2700+ teamsndash $1 million grand prize for 10 improvement on Cinematch resultndash $50000 2007 progress prize for 843 improvement
scoremovieuser121152131434524123237682576344541568523425
22345
57664566
scoremovieuser6219617232473153414284935
745
696836
Training data Test data
Movie rating data
Overall rating distribution
bull Third of ratings are 4sbull Average rating is 368
From TimelyDevelopmentcom
ratings per movie
bull Avg ratingsmovie 5627
ratings per user
bull Avg ratingsuser 208
Average movie rating by movie count
bull More ratings to better movies
From TimelyDevelopmentcom
Most loved movies
CountAvg ratingTitle137812 4593 The Shawshank Redemption 133597 4545 Lord of the Rings The Return of the King1808834306 The Green Mile 150676 4460Lord of the Rings The Two Towers 139050 4415 Finding Nemo 117456 4504 Raiders of the Lost Ark 180736 4299 Forrest Gump 147932 4433 Lord of the Rings The Fellowship of the ring149199 4325 The Sixth Sense 144027 4333 Indiana Jones and the Last Crusade
Challenges
bull Size of datandash Scalabilityndash Keeping data in memory
bull Missing datandash 99 percent missingndash Very imbalanced
bull Avoiding overfittingbull Test and training data differ significantly movie 16322
bull Use an ensemble of complementing predictorsbull Two half tuned models worth more than a single fully
tuned model
The BellKor recommender system
RECSYS 101
Extending Latent Factor Model to Include Biases
1162019
RECSYS 102
Modeling Biases and Interactions
μ = overall mean rating bx = bias of user x bi = bias of movie i
user-movie interactionmovie biasuser bias
User-Movie interaction Characterizes the matching between
users and movies Attracts most research in the field Benefits from algorithmic and
mathematical innovations
Baseline predictor Separates users and movies Benefits from insights into userrsquos
behavior Among the main practical
contributions of the competition
RECSYS 103
Baseline Predictor We have expectations on the rating by
user x of movie i even without estimating xrsquos attitude towards movies like i
ndash Rating scale of user xndash Values of other ratings user
gave recently (day-specific mood anchoring multi-user accounts)
ndash (Recent) popularity of movie indash Selection bias related to
number of ratings user gave on the same day (ldquofrequencyrdquo)
1162019
RECSYS 104
Putting It All Together
Example Mean rating micro = 37 You are a critical reviewer your ratings are 1 star
lower than the mean bx = -1 Star Wars gets a mean rating of 05 higher than
average movie bi = + 05 Predicted rating for you on Star Wars
= 37 - 1 + 05 = 32
1162019
RECSYS 105
Fitting the New Model
Solve
Stochastic gradient decent to find parameters Note Both biases bu bi as well as interactions qi pu are treated
as parameters (we estimate them)
1162019
regularization
goodness of fit
λ is selected via grid-search on a validation set
( )
++++
+++minus
sumsumsumsum
sumisin
ii
xx
xx
ii
Rix
Txiixxi
PQ
bbpq
pqbbr
2222
2
)(
)(min
λ
micro
RECSYS 106
BellKor Recommender System The winner of the Netflix Challenge
Multi-scale modeling of the dataCombine top level ldquoregionalrdquomodeling of the data with a refined local view Global
bull Overall deviations of usersmovies Factorization
bull Addressing ldquoregionalrdquo effects Collaborative filtering
bull Extract local patterns
Global effects
Factorization
Collaborative filtering
RECSYS 107
Performance of Various Methods
0885
089
0895
09
0905
091
0915
092
1 10 100 1000
RMSE
Millions of parameters
CF (no time bias)
Basic Latent Factors
Latent Factors w Biases
1162019
RECSYS 108
Grand Prize 08563
Netflix 09514
Movie average 10533User average 10651
Global average 11296
Performance of Various Methods
Basic Collaborative filtering 094
Latent factors 090
Latent factors+Biases 089
Collaborative filtering++ 091
1162019
RECSYS 109
Temporal Biases Of Users
Sudden rise in the average movie rating(early 2004) Improvements in Netflix GUI improvements Meaning of rating changed
Movie age Users prefer new movies
without any reasons Older movies are just
inherently better than newer ones
1162019
Y Koren Collaborative filtering with temporal dynamics KDD rsquo09
RECSYS 110
Temporal Biases amp Factors
Original modelrxi = micro +bx + bi + qi middotpx
T
Add time dependence to biasesrxi = micro +bx(t)+ bi(t) +qi middot px
T
Make parameters bu and bi to depend on time (1) Parameterize time-dependence by linear trends
(2) Each bin corresponds to 10 consecutive weeks
Add temporal dependence to factors px(t)hellip user preference vector on day t
Y Koren Collaborative filtering with temporal dynamics KDD rsquo09
RECSYS 111
Adding Temporal Effects
1162019
0875
088
0885
089
0895
09
0905
091
0915
092
1 10 100 1000 10000
RMSE
Millions of parameters
CF (no time bias)Basic Latent FactorsCF (time bias)Latent Factors w Biases+ Linear time factors+ Per-day user biases+ CF
RECSYS 112
Grand Prize 08563
Netflix 09514
Movie average 10533User average 10651
Global average 11296
Performance of Various Methods
Basic Collaborative filtering 094
Latent factors 090
Latent factors+Biases 089
Collaborative filtering++ 091
1162019 112
Latent factors+Biases+Time 0876
Still no prize Getting desperateTry a ldquokitchen sinkrdquo approach
RECSYS 113
Effect of ensemble size
087
08725
0875
08775
088
08825
0885
08875
089
08925
1 8 15 22 29 36 43 50
Predictors
Erro
r - R
MSE
RECSYS 29
Similarity Metric
Intuitively we want sim(A B) gt sim(A C)
Jaccard similarity 15 lt 24
Cosine similarity 0386 gt 0322 Considers missing ratings as ldquonegativerdquo Solution subtract the (row) mean
1162019
sim AB vs AC0092 gt -0559Notice cosine sim is correlation when data is centered at 0
Cosine sim
RECSYS 30
Rating Predictions
Let rx be the vector of user xrsquos ratings
Let N be the set of k users most similar to xwho have rated item i
Prediction for item i of user x 119903119903119909119909119909119909 = 1
119896119896sum119910119910isin119873119873 119903119903119910119910119909119909
119903119903119909119909119909119909 =sum119910119910isin119873119873 119904119904119909119909119910119910sdot119903119903119910119910119910119910sum119910119910isin119873119873 119904119904119909119909119910119910
Other options
Many other tricks possiblehellip1162019
Shorthand119956119956119961119961119961119961 = 119956119956119946119946119956119956 119961119961119961119961
RECSYS 31
Another type of Memory-based Collaborative Filtering Item-Item based Collaborative Filtering
So far User-user collaborative filtering
Another view Item-item For item i find other similar items Estimate rating for item i based
on the target userrsquos ratings on items similar to item i Can use same similarity metrics and
prediction functions as in user-user model
1162019
sumsum
isin
isinsdot
=)(
)(
xiNj ij
xiNj xjijxi s
rsr
sijhellip similarity of items i and jrxjhelliprating of user x on item jN(ix)hellip set items rated by x similar to i
RECSYS 32
Item-Item CF (|N|=2)
121110987654321
455311
3124452
534321423
245424
5224345
423316
users
mov
ies
- unknown rating - rating between 1 to 5
RECSYS 33
Item-Item CF (|N|=2)
121110987654321
455 311
3124452
534321423
245424
5224345
423316
users
- estimate rating of movie 1 by user 5
1162019
mov
ies
RECSYS 34
Item-Item CF (|N|=2)
121110987654321
455 311
3124452
534321423
245424
5224345
423316
users
Neighbor selectionIdentify movies similar to movie 1 rated by user 5
1162019
mov
ies
100
-018
041
-010
-031
059
sim(1m)
Here we use Pearson correlation as similarity1) Subtract mean rating mi from each movie i
m1 = (1+3+5+5+4)5 = 36row 1 [-26 0 -06 0 0 14 0 0 14 0 04 0]
2) Compute cosine similarities between rows
RECSYS 35
Item-Item CF (|N|=2)
121110987654321
455 311
3124452
534321423
245424
5224345
423316
users
Compute similarity weightss13=041 s16=059
1162019
mov
ies
100
-018
041
-010
-031
059
sim(1m)
RECSYS 36
Item-Item CF (|N|=2)
121110987654321
45526311
3124452
534321423
245424
5224345
423316
users
Predict by taking weighted average
r15 = (0412 + 0593) (041+059) = 26
mov
ies
RECSYS 37
Common Practice for Item-Item Collaborative Filtering
Define similarity sij of items i and j
Select K nearest neighbors (KNN) N(i x) Set of Items most similar to item i that were rated by x
Estimate rating rxi as the weighted average
1162019
sumsum
isin
isinsdot
=)(
)(ˆxiNj ij
xiNj xjijxi s
rsr
N(ix) = set of items similar to item i that were rated by xsij = similarity of items i and jrxj= rating of user x on item j
RECSYS 38
Item-Item vs User-User
04180109
0305081
Avatar LOTR Matrix Pirates
Alice
Bob
Carol
David
1162019
In practice it has been observed that item-itemoften works better than user-user
Why Items are simpler users have multiple tastes
RECSYS 39
ProsCons of Collaborative Filtering
+ Works for any kind of item No feature selection needed
- Cold Start Need enough users in the system to find a match
- Sparsity The userratings matrix is sparse Hard to find users that have rated the same items
- First rater Cannot recommend an item that has not been
previously rated New items Esoteric items
- Popularity bias Cannot recommend items to someone with
unique taste Tends to recommend popular items
RECSYS 40
Hybrid Methods Implement two or more different recommenders and
combine predictions Perhaps using a linear model
Add content-based methods to collaborative filtering Item profiles for new item problem Demographics to deal with new user problem
1162019
RECSYS 41
- Evaluation
- Error metrics
- Complexity Speed
RECSYS 42
Evaluation
1 3 4
3 5 5
4 5 5
3
3
2 2 2
5
2 1 1
3 3
1
movies
users
1162019
RECSYS 43
Evaluation
1 3 4
3 5 5
4 5 5
3
3
2
2 1
3
1
Test Data Set
users
movies
1162019
RECSYS 44
Evaluating Predictions Compare predictions with known ratings
Root-mean-square error (RMSE)bull where is predicted is the true rating of x on i
Precision at top 10 bull of those in top 10
Rank Correlation bull Spearmanrsquos correlation between systemrsquos and userrsquos complete
rankings Another approach 01 model
Coveragebull Number of itemsusers for which system can make predictions
Precision = TP (TP + FP) Accuracy = (TP+TN) (TP + FP + TN + FN) Receiver Operating characteristic (ROC) Curve
Y-axis True Positive Rates (TPR) X-axis False Positive Rates (FPR)bull TPR (aka Recall) = TP P = TP (TP+FN) bull FPR = FP N = FP (FP + TN)bull See httpsenwikipediaorgwikiPrecision and recall
RECSYS 45
Problems with Error Measures Narrow focus on accuracy sometimes
misses the point Prediction Diversity Prediction Context Order of predictions
In practice we care only to predict high ratings RMSE might penalize a method that does well
for high ratings and badly for others
1162019
RECSYS 46
Collaborative Filtering Complexity
Expensive step is finding k most similar customers O(|X|) Recall that X = set of customers in the system
Too expensive to do at runtime Could pre-compute using clustering as approx
Naiumlve pre-computation takes time O(N middot|C|) |C| = of clusters = k in the k-means N = of data points
We already know how to do this Near-neighbor search in high dimensions (LSH) Clustering Dimensionality reduction
1162019
RECSYS 47
Tip Add Data Leverage all the data
Donrsquot try to reduce data size in an effort to make fancy algorithms work
Simple methods on large data do best
Add more data eg add IMDB data on genres
More data beats better algorithmshttpanandtypepadcomdatawocky200803more-data-usualhtml
RECSYS 48
Recommender SystemsLatent Factor Models
CS246 Mining Massive DatasetsJure Leskovec Stanford University
httpcs246stanfordedu
RECSYS 49
Geared towards females
Geared towards males
Serious
Funny
Collaborative Filtering viaLatent Factor Models (eg SVD)
The PrincessDiaries
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
RECSYS 50Koren Bell Volinksy IEEE Computer 2009
RECSYS 51
The Netflix Utility Matrix R
1 3 4
3 5 5
4 5 5
3
3
2 2 2
5
2 1 1
3 3
1
480000 users
17700 movies
1162019
Matrix R
RECSYS 52
Utility Matrix R Evaluation
1 3 4
3 5 5
4 5 5
3
3
2
2 1
3
1
Test Data Set
SSE = sum(119909119909119909119909)isin119877119877 119903119909119909119909119909 minus 119903119903119909119909119909119909 2
1162019
480000 users
17700 movies
Predicted rating
True rating of user x on item i
119955119955120785120785120788120788
Matrix R
Training Data Set
RECSYS 53
Latent Factor Models
ldquoSVDrdquo on Netflix data R asymp Q middot PT
For now letrsquos assume we can approximate the rating matrix R as a product of ldquothinrdquo Q middot PT
R has missing entries but letrsquos ignore that for nowbull Basically we will want the reconstruction error to be small on
known ratings and we donrsquot care about the values on the missing ones
45531
312445
53432142
24542
522434
42331
2-41
56-5
53-2
32111
-221-7
37-1
-924143-48-5-253-211
13-112-72914-131457-8
1-6784-3924176-421asymp
users
item
s
PT
Q
item
s
users
R
SVD A = U Σ VT
f factors
ffactors
RECSYS 54
Ratings as Products of Factors How to estimate the missing rating of
user x for item i
1162019
45531
312445
53432142
24542
522434
42331
item
s
2-41
56-5
53-2
32111
-221-7
37-1
-924143-48-5-253-211
13-112-72914-131457-8
1-6784-3924176-421
asymp
item
s
users
users
PT
119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931
= 119943119943
119954119954119946119946119943119943 sdot 119953119953119961119961119943119943
qi = row i of Qpx = column x of PT
ffac
tors
Qf factors
RECSYS 55
Ratings as Products of Factors How to estimate the missing rating of
user x for item i
1162019
45531
312445
53432142
24542
522434
42331
item
s
2-41
56-5
53-2
32111
-221-7
37-1
-924143-48-5-253-211
13-112-72914-131457-8
1-6784-3924176-421
asymp
item
s
users
users
PT
ffac
tors
Qf factors
119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931
= 119943119943
119954119954119946119946119943119943 sdot 119953119953119961119961119943119943
qi = row i of Qpx = column x of PT
RECSYS 56
Ratings as Products of Factors How to estimate the missing rating of
user x for item i
1162019
45531
312445
53432142
24542
522434
42331
item
s
2-41
56-5
53-2
32111
-221-7
37-1
-924143-48-5-253-211
13-112-72914-131457-8
1-6784-3924176-421
asymp
item
s
users
users
QPT
24
ffac
tors
f factors
119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931
= 119943119943
119954119954119946119946119943119943 sdot 119953119953119961119961119943119943
qi = row i of Qpx = column x of PT
RECSYS 57
Geared towards females
Geared towards males
Serious
Funny
Latent Factor Models
1162019
The PrincessDiaries
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
RECSYS 58
Geared towards females
Geared towards males
Serious
Funny
Latent Factor Models
1162019
The PrincessDiaries
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
RECSYS 59
Recap SVD
Remember SVD A Input data matrix U Left singular vecs V Right singular vecs Σ Singular values
SVD gives minimum reconstruction error (SSE) min119880119880VΣ
sum119909119909119894119894 119860119860119909119909119894119894 minus 119880119880Σ119881119881T 1199091199091198941198942
So in our case ldquoSVDrdquo on Netflix data R asymp Q middot PT
A = R Q = U PT = Σ VT
But we are not done yet R has missing entries1162019
U
The sum goes over all entriesBut our R has missing entries
119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931
Am
n
Σm
n
VT
asymp
U
RECSYS 60
Latent Factor Models
SVD isnrsquot defined when entries are missing
Use specialized methods to find P Q min
119875119875119876119876sum 119909119909119909119909 isinR 119903119903119909119909119909119909 minus 119902119902119909119909 sdot 119901119901119909119909119879119879 2
Notebull We donrsquot require cols of P Q to be orthogonalunit lengthbull P Q map usersmovies to a latent spacebull The most popular model among Netflix contestants
1162019
45531
312445
53432142
24542
522434
42331
2-41
56-5
53-2
32111
-221-7
37-1
-924143-48-5-253-211
13-112-72914-131457-8
1-6784-3924176-421asymp
PTQ
users
item
s
119903119909119909119909119909 = 119902119902119909119909 sdot 119901119901119909119909119879119879
f factors
ffactorsitem
s
users
RECSYS 61
Dealing with Missing Entries Want to minimize SSE for unseen test data
Idea Minimize SSE on training data Want large f ( of factors) to capture all the signals But SSE on test data begins to rise for f gt 2
Regularization is needed to avoid Overfitting Allow rich model where there are sufficient data Shrink aggressively where data are scarce
1162019
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
1 3 43 5 5
4 5 533
2
2 1 3
1
λhellip regularization parameterldquoerrorrdquo ldquolengthrdquo
RECSYS 62
Geared towards females
Geared towards males
serious
escapist
The PrincessDiaries
The Lion King
Braveheart
Independence Day
AmadeusThe Color Purple
Oceanrsquos 11
Sense and Sensibility
Gus
Dave
Recommendations via Latent Factor Models (eg SVD++ by the [Bellkor Team])
1162019
Lethal Weapon
Dumb and Dumber
RECSYS 63
Dealing with Missing Entries Want to minimize SSE for unseen test data
Idea Minimize SSE on training data Want large f ( of factors) to capture all the signals But SSE on test data begins to rise for f gt 2
Regularization is needed Allow rich model where there are sufficient data Shrink aggressively where data are scarce
1162019
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
1 3 43 5 5
4 5 533
2
2 1 3
1
λhellip regularization parameterldquoerrorrdquo ldquolengthrdquo
RECSYS 64
Geared towards females
Geared towards males
serious
funny
The Effect of Regularization
1162019
The PrincessDiaries
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
minfactors ldquoerrorrdquo + λ ldquolengthrdquo
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
RECSYS 65
Geared towards females
Geared towards males
serious
funny
The Effect of Regularization
1162019
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
The PrincessDiaries
minfactors ldquoerrorrdquo + λ ldquolengthrdquo
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
RECSYS 66
Geared towards females
Geared towards males
serious
funny
The Effect of Regularization
1162019
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
minfactors ldquoerrorrdquo + λ ldquolengthrdquo
The PrincessDiaries
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
RECSYS 67
Geared towards females
Geared towards males
serious
funny
The Effect of Regularization
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
The PrincessDiaries
minfactors ldquoerrorrdquo + λ ldquolengthrdquo
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
RECSYS 68
Use Gradient Descent to search forthe optimal settings
Want to find matrices P and Q
Gradient descent Initialize P and Q (using SVD pretend missing ratings are
0) Do gradient descent
bull P larr P - η middotnablaPbull Q larr Q - η middotnablaQbull Where nablaQ is gradientderivative of matrix Q120571120571119876119876 = [120571120571119902119902119909119909119894119894] and 120571120571119902119902119909119909119894119894 = sum119909119909119909119909 minus2 119903119903119909119909119909119909 minus 119902119902119909119909119901119901119909119909119879119879 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894
bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q Observation Computing gradients is slow
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
How to compute gradient of a matrixCompute gradient of every element independently
RECSYS 69
Degression to the lecture notes ofRegression and Gradient Descent
by Andrew Ngrsquos Machine Learning course from Coursera
RECSYS 70
(Batch) Gradient Descent
Want to find matrices P and Q
Gradient descent Initialize P and Q (using SVD pretend missing ratings are
0) Do gradient descent
bull P larr P - η middotnablaPbull Q larr Q - η middotnablaQbull Where nablaQ is gradientderivative of matrix Q120571120571119876119876 = [120571120571119902119902119909119909119894119894] and 120571120571119902119902119909119909119894119894 = sum119909119909119909119909 minus2 119903119903119909119909119909119909 minus 119902119902119909119909119901119901119909119909119879119879 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894
bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q Observation Computing gradients is slow
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
How to compute gradient of a matrixCompute gradient of every element independently
RECSYS 72
Stochastic Gradient Descent
Gradient Descent (GD) vs Stochastic GD Observation 120571120571119876119876 = [120571120571119902119902119909119909119894119894] where
120571120571119902119902119909119909119894119894 = 119909119909119909119909
minus2 119903119903119909119909119909119909 minus 119902119902119909119909119894119894119901119901119909119909119894119894 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894 = 119961119961119946119946
nabla119928119928 119955119955119961119961119946119946
bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q 119928119928 = 119928119928 minus 119928119928 = 119928119928 minus η sum119961119961119946119946nabla119928119928 (119955119955119961119961119946119946) Idea Instead of evaluating gradient over all ratings evaluate it for
each individual rating and make a step
GD 119928119928larr119928119928 minus η sum119955119955119961119961119946119946 nabla119928119928(119955119955119961119961119946119946)
SGD 119928119928larr119928119928 minus 119928119928(119955119955119961119961119946119946) Faster convergence
bull Need more steps but each step is computed much faster
1162019
RECSYS 73
SGD vs GD Convergence of GD vs SGD
1162019
Iterationstep
Valu
e of
the
obje
ctiv
e fu
nctio
n
GD improves the value of the objective function at every step SGD improves the value but in a ldquonoisyrdquo wayGD takes fewer steps to converge but each steptakes much longer to compute In practice SGD is much faster
RECSYS 74
Stochastic Gradient Descent
Stochastic gradient decent Initialize P and Q (using SVD pretend missing ratings are
0) Then iterate over the ratings (multiple times if necessary) and update
factorsFor each rxi
bull 120576120576119909119909119909119909 = 119903119903119909119909119909119909 minus 119902119902119909119909 sdot 119901119901119909119909119879119879 (derivative of the ldquoerrorrdquo)bull 119902119902119909119909 larr 119902119902119909119909 + 120578120578 120576120576119909119909119909119909 119901119901119909119909 minus 120582120582 119902119902119909119909 (update equation)bull 119901119901119909119909 larr 119901119901119909119909 + 120578120578 120576120576119909119909119909119909 119902119902119909119909 minus 120582120582 119901119901119909119909 (update equation)
2 for loops For until convergence
bull For each rxibull Compute gradient do a ldquosteprdquo
1162019
η hellip learning rate
RECSYS 75Koren Bell Volinksy IEEE Computer 2009
1162019
RECSYS 76
Summary Recommendations via Optimization
Goal Make good recommendations Quantify goodness using SSE
So Lower SSE means better recommendations We want to make good recommendations on items
that some user has not yet seen Letrsquos set values for P and Q such that they work
well on known (user item) ratings And hope these values for P and Q will predict well
the unknown ratings
This is the a case where we apply Optimization methods
1162019
1 3 43 5 5
4 5 533
2
2 1 3
1
RECSYS 77
Backup Slides
RECSYS 78
The Netflix Challenge 2006-09
We Know What You OughtTo Be Watching This
Summer
Netflix Prizebull Training data
ndash 100 million ratingsndash 480000 usersndash 17770 moviesndash 6 years of data 2000-2005
bull Test datandash Last few ratings of each user (28 million)ndash Evaluation criterion root mean squared error (RMSE) ndash Netflix Cinematch RMSE 09514
bull Competitionndash 2700+ teamsndash $1 million grand prize for 10 improvement on Cinematch resultndash $50000 2007 progress prize for 843 improvement
scoremovieuser121152131434524123237682576344541568523425
22345
57664566
scoremovieuser6219617232473153414284935
745
696836
Training data Test data
Movie rating data
Overall rating distribution
bull Third of ratings are 4sbull Average rating is 368
From TimelyDevelopmentcom
ratings per movie
bull Avg ratingsmovie 5627
ratings per user
bull Avg ratingsuser 208
Average movie rating by movie count
bull More ratings to better movies
From TimelyDevelopmentcom
Most loved movies
CountAvg ratingTitle137812 4593 The Shawshank Redemption 133597 4545 Lord of the Rings The Return of the King1808834306 The Green Mile 150676 4460Lord of the Rings The Two Towers 139050 4415 Finding Nemo 117456 4504 Raiders of the Lost Ark 180736 4299 Forrest Gump 147932 4433 Lord of the Rings The Fellowship of the ring149199 4325 The Sixth Sense 144027 4333 Indiana Jones and the Last Crusade
Challenges
bull Size of datandash Scalabilityndash Keeping data in memory
bull Missing datandash 99 percent missingndash Very imbalanced
bull Avoiding overfittingbull Test and training data differ significantly movie 16322
bull Use an ensemble of complementing predictorsbull Two half tuned models worth more than a single fully
tuned model
The BellKor recommender system
RECSYS 101
Extending Latent Factor Model to Include Biases
1162019
RECSYS 102
Modeling Biases and Interactions
μ = overall mean rating bx = bias of user x bi = bias of movie i
user-movie interactionmovie biasuser bias
User-Movie interaction Characterizes the matching between
users and movies Attracts most research in the field Benefits from algorithmic and
mathematical innovations
Baseline predictor Separates users and movies Benefits from insights into userrsquos
behavior Among the main practical
contributions of the competition
RECSYS 103
Baseline Predictor We have expectations on the rating by
user x of movie i even without estimating xrsquos attitude towards movies like i
ndash Rating scale of user xndash Values of other ratings user
gave recently (day-specific mood anchoring multi-user accounts)
ndash (Recent) popularity of movie indash Selection bias related to
number of ratings user gave on the same day (ldquofrequencyrdquo)
1162019
RECSYS 104
Putting It All Together
Example Mean rating micro = 37 You are a critical reviewer your ratings are 1 star
lower than the mean bx = -1 Star Wars gets a mean rating of 05 higher than
average movie bi = + 05 Predicted rating for you on Star Wars
= 37 - 1 + 05 = 32
1162019
RECSYS 105
Fitting the New Model
Solve
Stochastic gradient decent to find parameters Note Both biases bu bi as well as interactions qi pu are treated
as parameters (we estimate them)
1162019
regularization
goodness of fit
λ is selected via grid-search on a validation set
( )
++++
+++minus
sumsumsumsum
sumisin
ii
xx
xx
ii
Rix
Txiixxi
PQ
bbpq
pqbbr
2222
2
)(
)(min
λ
micro
RECSYS 106
BellKor Recommender System The winner of the Netflix Challenge
Multi-scale modeling of the dataCombine top level ldquoregionalrdquomodeling of the data with a refined local view Global
bull Overall deviations of usersmovies Factorization
bull Addressing ldquoregionalrdquo effects Collaborative filtering
bull Extract local patterns
Global effects
Factorization
Collaborative filtering
RECSYS 107
Performance of Various Methods
0885
089
0895
09
0905
091
0915
092
1 10 100 1000
RMSE
Millions of parameters
CF (no time bias)
Basic Latent Factors
Latent Factors w Biases
1162019
RECSYS 108
Grand Prize 08563
Netflix 09514
Movie average 10533User average 10651
Global average 11296
Performance of Various Methods
Basic Collaborative filtering 094
Latent factors 090
Latent factors+Biases 089
Collaborative filtering++ 091
1162019
RECSYS 109
Temporal Biases Of Users
Sudden rise in the average movie rating(early 2004) Improvements in Netflix GUI improvements Meaning of rating changed
Movie age Users prefer new movies
without any reasons Older movies are just
inherently better than newer ones
1162019
Y Koren Collaborative filtering with temporal dynamics KDD rsquo09
RECSYS 110
Temporal Biases amp Factors
Original modelrxi = micro +bx + bi + qi middotpx
T
Add time dependence to biasesrxi = micro +bx(t)+ bi(t) +qi middot px
T
Make parameters bu and bi to depend on time (1) Parameterize time-dependence by linear trends
(2) Each bin corresponds to 10 consecutive weeks
Add temporal dependence to factors px(t)hellip user preference vector on day t
Y Koren Collaborative filtering with temporal dynamics KDD rsquo09
RECSYS 111
Adding Temporal Effects
1162019
0875
088
0885
089
0895
09
0905
091
0915
092
1 10 100 1000 10000
RMSE
Millions of parameters
CF (no time bias)Basic Latent FactorsCF (time bias)Latent Factors w Biases+ Linear time factors+ Per-day user biases+ CF
RECSYS 112
Grand Prize 08563
Netflix 09514
Movie average 10533User average 10651
Global average 11296
Performance of Various Methods
Basic Collaborative filtering 094
Latent factors 090
Latent factors+Biases 089
Collaborative filtering++ 091
1162019 112
Latent factors+Biases+Time 0876
Still no prize Getting desperateTry a ldquokitchen sinkrdquo approach
RECSYS 113
Effect of ensemble size
087
08725
0875
08775
088
08825
0885
08875
089
08925
1 8 15 22 29 36 43 50
Predictors
Erro
r - R
MSE
RECSYS 30
Rating Predictions
Let rx be the vector of user xrsquos ratings
Let N be the set of k users most similar to xwho have rated item i
Prediction for item i of user x 119903119903119909119909119909119909 = 1
119896119896sum119910119910isin119873119873 119903119903119910119910119909119909
119903119903119909119909119909119909 =sum119910119910isin119873119873 119904119904119909119909119910119910sdot119903119903119910119910119910119910sum119910119910isin119873119873 119904119904119909119909119910119910
Other options
Many other tricks possiblehellip1162019
Shorthand119956119956119961119961119961119961 = 119956119956119946119946119956119956 119961119961119961119961
RECSYS 31
Another type of Memory-based Collaborative Filtering Item-Item based Collaborative Filtering
So far User-user collaborative filtering
Another view Item-item For item i find other similar items Estimate rating for item i based
on the target userrsquos ratings on items similar to item i Can use same similarity metrics and
prediction functions as in user-user model
1162019
sumsum
isin
isinsdot
=)(
)(
xiNj ij
xiNj xjijxi s
rsr
sijhellip similarity of items i and jrxjhelliprating of user x on item jN(ix)hellip set items rated by x similar to i
RECSYS 32
Item-Item CF (|N|=2)
121110987654321
455311
3124452
534321423
245424
5224345
423316
users
mov
ies
- unknown rating - rating between 1 to 5
RECSYS 33
Item-Item CF (|N|=2)
121110987654321
455 311
3124452
534321423
245424
5224345
423316
users
- estimate rating of movie 1 by user 5
1162019
mov
ies
RECSYS 34
Item-Item CF (|N|=2)
121110987654321
455 311
3124452
534321423
245424
5224345
423316
users
Neighbor selectionIdentify movies similar to movie 1 rated by user 5
1162019
mov
ies
100
-018
041
-010
-031
059
sim(1m)
Here we use Pearson correlation as similarity1) Subtract mean rating mi from each movie i
m1 = (1+3+5+5+4)5 = 36row 1 [-26 0 -06 0 0 14 0 0 14 0 04 0]
2) Compute cosine similarities between rows
RECSYS 35
Item-Item CF (|N|=2)
121110987654321
455 311
3124452
534321423
245424
5224345
423316
users
Compute similarity weightss13=041 s16=059
1162019
mov
ies
100
-018
041
-010
-031
059
sim(1m)
RECSYS 36
Item-Item CF (|N|=2)
121110987654321
45526311
3124452
534321423
245424
5224345
423316
users
Predict by taking weighted average
r15 = (0412 + 0593) (041+059) = 26
mov
ies
RECSYS 37
Common Practice for Item-Item Collaborative Filtering
Define similarity sij of items i and j
Select K nearest neighbors (KNN) N(i x) Set of Items most similar to item i that were rated by x
Estimate rating rxi as the weighted average
1162019
sumsum
isin
isinsdot
=)(
)(ˆxiNj ij
xiNj xjijxi s
rsr
N(ix) = set of items similar to item i that were rated by xsij = similarity of items i and jrxj= rating of user x on item j
RECSYS 38
Item-Item vs User-User
04180109
0305081
Avatar LOTR Matrix Pirates
Alice
Bob
Carol
David
1162019
In practice it has been observed that item-itemoften works better than user-user
Why Items are simpler users have multiple tastes
RECSYS 39
ProsCons of Collaborative Filtering
+ Works for any kind of item No feature selection needed
- Cold Start Need enough users in the system to find a match
- Sparsity The userratings matrix is sparse Hard to find users that have rated the same items
- First rater Cannot recommend an item that has not been
previously rated New items Esoteric items
- Popularity bias Cannot recommend items to someone with
unique taste Tends to recommend popular items
RECSYS 40
Hybrid Methods Implement two or more different recommenders and
combine predictions Perhaps using a linear model
Add content-based methods to collaborative filtering Item profiles for new item problem Demographics to deal with new user problem
1162019
RECSYS 41
- Evaluation
- Error metrics
- Complexity Speed
RECSYS 42
Evaluation
1 3 4
3 5 5
4 5 5
3
3
2 2 2
5
2 1 1
3 3
1
movies
users
1162019
RECSYS 43
Evaluation
1 3 4
3 5 5
4 5 5
3
3
2
2 1
3
1
Test Data Set
users
movies
1162019
RECSYS 44
Evaluating Predictions Compare predictions with known ratings
Root-mean-square error (RMSE)bull where is predicted is the true rating of x on i
Precision at top 10 bull of those in top 10
Rank Correlation bull Spearmanrsquos correlation between systemrsquos and userrsquos complete
rankings Another approach 01 model
Coveragebull Number of itemsusers for which system can make predictions
Precision = TP (TP + FP) Accuracy = (TP+TN) (TP + FP + TN + FN) Receiver Operating characteristic (ROC) Curve
Y-axis True Positive Rates (TPR) X-axis False Positive Rates (FPR)bull TPR (aka Recall) = TP P = TP (TP+FN) bull FPR = FP N = FP (FP + TN)bull See httpsenwikipediaorgwikiPrecision and recall
RECSYS 45
Problems with Error Measures Narrow focus on accuracy sometimes
misses the point Prediction Diversity Prediction Context Order of predictions
In practice we care only to predict high ratings RMSE might penalize a method that does well
for high ratings and badly for others
1162019
RECSYS 46
Collaborative Filtering Complexity
Expensive step is finding k most similar customers O(|X|) Recall that X = set of customers in the system
Too expensive to do at runtime Could pre-compute using clustering as approx
Naiumlve pre-computation takes time O(N middot|C|) |C| = of clusters = k in the k-means N = of data points
We already know how to do this Near-neighbor search in high dimensions (LSH) Clustering Dimensionality reduction
1162019
RECSYS 47
Tip Add Data Leverage all the data
Donrsquot try to reduce data size in an effort to make fancy algorithms work
Simple methods on large data do best
Add more data eg add IMDB data on genres
More data beats better algorithmshttpanandtypepadcomdatawocky200803more-data-usualhtml
RECSYS 48
Recommender SystemsLatent Factor Models
CS246 Mining Massive DatasetsJure Leskovec Stanford University
httpcs246stanfordedu
RECSYS 49
Geared towards females
Geared towards males
Serious
Funny
Collaborative Filtering viaLatent Factor Models (eg SVD)
The PrincessDiaries
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
RECSYS 50Koren Bell Volinksy IEEE Computer 2009
RECSYS 51
The Netflix Utility Matrix R
1 3 4
3 5 5
4 5 5
3
3
2 2 2
5
2 1 1
3 3
1
480000 users
17700 movies
1162019
Matrix R
RECSYS 52
Utility Matrix R Evaluation
1 3 4
3 5 5
4 5 5
3
3
2
2 1
3
1
Test Data Set
SSE = sum(119909119909119909119909)isin119877119877 119903119909119909119909119909 minus 119903119903119909119909119909119909 2
1162019
480000 users
17700 movies
Predicted rating
True rating of user x on item i
119955119955120785120785120788120788
Matrix R
Training Data Set
RECSYS 53
Latent Factor Models
ldquoSVDrdquo on Netflix data R asymp Q middot PT
For now letrsquos assume we can approximate the rating matrix R as a product of ldquothinrdquo Q middot PT
R has missing entries but letrsquos ignore that for nowbull Basically we will want the reconstruction error to be small on
known ratings and we donrsquot care about the values on the missing ones
45531
312445
53432142
24542
522434
42331
2-41
56-5
53-2
32111
-221-7
37-1
-924143-48-5-253-211
13-112-72914-131457-8
1-6784-3924176-421asymp
users
item
s
PT
Q
item
s
users
R
SVD A = U Σ VT
f factors
ffactors
RECSYS 54
Ratings as Products of Factors How to estimate the missing rating of
user x for item i
1162019
45531
312445
53432142
24542
522434
42331
item
s
2-41
56-5
53-2
32111
-221-7
37-1
-924143-48-5-253-211
13-112-72914-131457-8
1-6784-3924176-421
asymp
item
s
users
users
PT
119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931
= 119943119943
119954119954119946119946119943119943 sdot 119953119953119961119961119943119943
qi = row i of Qpx = column x of PT
ffac
tors
Qf factors
RECSYS 55
Ratings as Products of Factors How to estimate the missing rating of
user x for item i
1162019
45531
312445
53432142
24542
522434
42331
item
s
2-41
56-5
53-2
32111
-221-7
37-1
-924143-48-5-253-211
13-112-72914-131457-8
1-6784-3924176-421
asymp
item
s
users
users
PT
ffac
tors
Qf factors
119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931
= 119943119943
119954119954119946119946119943119943 sdot 119953119953119961119961119943119943
qi = row i of Qpx = column x of PT
RECSYS 56
Ratings as Products of Factors How to estimate the missing rating of
user x for item i
1162019
45531
312445
53432142
24542
522434
42331
item
s
2-41
56-5
53-2
32111
-221-7
37-1
-924143-48-5-253-211
13-112-72914-131457-8
1-6784-3924176-421
asymp
item
s
users
users
QPT
24
ffac
tors
f factors
119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931
= 119943119943
119954119954119946119946119943119943 sdot 119953119953119961119961119943119943
qi = row i of Qpx = column x of PT
RECSYS 57
Geared towards females
Geared towards males
Serious
Funny
Latent Factor Models
1162019
The PrincessDiaries
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
RECSYS 58
Geared towards females
Geared towards males
Serious
Funny
Latent Factor Models
1162019
The PrincessDiaries
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
RECSYS 59
Recap SVD
Remember SVD A Input data matrix U Left singular vecs V Right singular vecs Σ Singular values
SVD gives minimum reconstruction error (SSE) min119880119880VΣ
sum119909119909119894119894 119860119860119909119909119894119894 minus 119880119880Σ119881119881T 1199091199091198941198942
So in our case ldquoSVDrdquo on Netflix data R asymp Q middot PT
A = R Q = U PT = Σ VT
But we are not done yet R has missing entries1162019
U
The sum goes over all entriesBut our R has missing entries
119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931
Am
n
Σm
n
VT
asymp
U
RECSYS 60
Latent Factor Models
SVD isnrsquot defined when entries are missing
Use specialized methods to find P Q min
119875119875119876119876sum 119909119909119909119909 isinR 119903119903119909119909119909119909 minus 119902119902119909119909 sdot 119901119901119909119909119879119879 2
Notebull We donrsquot require cols of P Q to be orthogonalunit lengthbull P Q map usersmovies to a latent spacebull The most popular model among Netflix contestants
1162019
45531
312445
53432142
24542
522434
42331
2-41
56-5
53-2
32111
-221-7
37-1
-924143-48-5-253-211
13-112-72914-131457-8
1-6784-3924176-421asymp
PTQ
users
item
s
119903119909119909119909119909 = 119902119902119909119909 sdot 119901119901119909119909119879119879
f factors
ffactorsitem
s
users
RECSYS 61
Dealing with Missing Entries Want to minimize SSE for unseen test data
Idea Minimize SSE on training data Want large f ( of factors) to capture all the signals But SSE on test data begins to rise for f gt 2
Regularization is needed to avoid Overfitting Allow rich model where there are sufficient data Shrink aggressively where data are scarce
1162019
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
1 3 43 5 5
4 5 533
2
2 1 3
1
λhellip regularization parameterldquoerrorrdquo ldquolengthrdquo
RECSYS 62
Geared towards females
Geared towards males
serious
escapist
The PrincessDiaries
The Lion King
Braveheart
Independence Day
AmadeusThe Color Purple
Oceanrsquos 11
Sense and Sensibility
Gus
Dave
Recommendations via Latent Factor Models (eg SVD++ by the [Bellkor Team])
1162019
Lethal Weapon
Dumb and Dumber
RECSYS 63
Dealing with Missing Entries Want to minimize SSE for unseen test data
Idea Minimize SSE on training data Want large f ( of factors) to capture all the signals But SSE on test data begins to rise for f gt 2
Regularization is needed Allow rich model where there are sufficient data Shrink aggressively where data are scarce
1162019
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
1 3 43 5 5
4 5 533
2
2 1 3
1
λhellip regularization parameterldquoerrorrdquo ldquolengthrdquo
RECSYS 64
Geared towards females
Geared towards males
serious
funny
The Effect of Regularization
1162019
The PrincessDiaries
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
minfactors ldquoerrorrdquo + λ ldquolengthrdquo
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
RECSYS 65
Geared towards females
Geared towards males
serious
funny
The Effect of Regularization
1162019
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
The PrincessDiaries
minfactors ldquoerrorrdquo + λ ldquolengthrdquo
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
RECSYS 66
Geared towards females
Geared towards males
serious
funny
The Effect of Regularization
1162019
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
minfactors ldquoerrorrdquo + λ ldquolengthrdquo
The PrincessDiaries
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
RECSYS 67
Geared towards females
Geared towards males
serious
funny
The Effect of Regularization
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
The PrincessDiaries
minfactors ldquoerrorrdquo + λ ldquolengthrdquo
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
RECSYS 68
Use Gradient Descent to search forthe optimal settings
Want to find matrices P and Q
Gradient descent Initialize P and Q (using SVD pretend missing ratings are
0) Do gradient descent
bull P larr P - η middotnablaPbull Q larr Q - η middotnablaQbull Where nablaQ is gradientderivative of matrix Q120571120571119876119876 = [120571120571119902119902119909119909119894119894] and 120571120571119902119902119909119909119894119894 = sum119909119909119909119909 minus2 119903119903119909119909119909119909 minus 119902119902119909119909119901119901119909119909119879119879 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894
bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q Observation Computing gradients is slow
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
How to compute gradient of a matrixCompute gradient of every element independently
RECSYS 69
Degression to the lecture notes ofRegression and Gradient Descent
by Andrew Ngrsquos Machine Learning course from Coursera
RECSYS 70
(Batch) Gradient Descent
Want to find matrices P and Q
Gradient descent Initialize P and Q (using SVD pretend missing ratings are
0) Do gradient descent
bull P larr P - η middotnablaPbull Q larr Q - η middotnablaQbull Where nablaQ is gradientderivative of matrix Q120571120571119876119876 = [120571120571119902119902119909119909119894119894] and 120571120571119902119902119909119909119894119894 = sum119909119909119909119909 minus2 119903119903119909119909119909119909 minus 119902119902119909119909119901119901119909119909119879119879 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894
bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q Observation Computing gradients is slow
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
How to compute gradient of a matrixCompute gradient of every element independently
RECSYS 72
Stochastic Gradient Descent
Gradient Descent (GD) vs Stochastic GD Observation 120571120571119876119876 = [120571120571119902119902119909119909119894119894] where
120571120571119902119902119909119909119894119894 = 119909119909119909119909
minus2 119903119903119909119909119909119909 minus 119902119902119909119909119894119894119901119901119909119909119894119894 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894 = 119961119961119946119946
nabla119928119928 119955119955119961119961119946119946
bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q 119928119928 = 119928119928 minus 119928119928 = 119928119928 minus η sum119961119961119946119946nabla119928119928 (119955119955119961119961119946119946) Idea Instead of evaluating gradient over all ratings evaluate it for
each individual rating and make a step
GD 119928119928larr119928119928 minus η sum119955119955119961119961119946119946 nabla119928119928(119955119955119961119961119946119946)
SGD 119928119928larr119928119928 minus 119928119928(119955119955119961119961119946119946) Faster convergence
bull Need more steps but each step is computed much faster
1162019
RECSYS 73
SGD vs GD Convergence of GD vs SGD
1162019
Iterationstep
Valu
e of
the
obje
ctiv
e fu
nctio
n
GD improves the value of the objective function at every step SGD improves the value but in a ldquonoisyrdquo wayGD takes fewer steps to converge but each steptakes much longer to compute In practice SGD is much faster
RECSYS 74
Stochastic Gradient Descent
Stochastic gradient decent Initialize P and Q (using SVD pretend missing ratings are
0) Then iterate over the ratings (multiple times if necessary) and update
factorsFor each rxi
bull 120576120576119909119909119909119909 = 119903119903119909119909119909119909 minus 119902119902119909119909 sdot 119901119901119909119909119879119879 (derivative of the ldquoerrorrdquo)bull 119902119902119909119909 larr 119902119902119909119909 + 120578120578 120576120576119909119909119909119909 119901119901119909119909 minus 120582120582 119902119902119909119909 (update equation)bull 119901119901119909119909 larr 119901119901119909119909 + 120578120578 120576120576119909119909119909119909 119902119902119909119909 minus 120582120582 119901119901119909119909 (update equation)
2 for loops For until convergence
bull For each rxibull Compute gradient do a ldquosteprdquo
1162019
η hellip learning rate
RECSYS 75Koren Bell Volinksy IEEE Computer 2009
1162019
RECSYS 76
Summary Recommendations via Optimization
Goal Make good recommendations Quantify goodness using SSE
So Lower SSE means better recommendations We want to make good recommendations on items
that some user has not yet seen Letrsquos set values for P and Q such that they work
well on known (user item) ratings And hope these values for P and Q will predict well
the unknown ratings
This is the a case where we apply Optimization methods
1162019
1 3 43 5 5
4 5 533
2
2 1 3
1
RECSYS 77
Backup Slides
RECSYS 78
The Netflix Challenge 2006-09
We Know What You OughtTo Be Watching This
Summer
Netflix Prizebull Training data
ndash 100 million ratingsndash 480000 usersndash 17770 moviesndash 6 years of data 2000-2005
bull Test datandash Last few ratings of each user (28 million)ndash Evaluation criterion root mean squared error (RMSE) ndash Netflix Cinematch RMSE 09514
bull Competitionndash 2700+ teamsndash $1 million grand prize for 10 improvement on Cinematch resultndash $50000 2007 progress prize for 843 improvement
scoremovieuser121152131434524123237682576344541568523425
22345
57664566
scoremovieuser6219617232473153414284935
745
696836
Training data Test data
Movie rating data
Overall rating distribution
bull Third of ratings are 4sbull Average rating is 368
From TimelyDevelopmentcom
ratings per movie
bull Avg ratingsmovie 5627
ratings per user
bull Avg ratingsuser 208
Average movie rating by movie count
bull More ratings to better movies
From TimelyDevelopmentcom
Most loved movies
CountAvg ratingTitle137812 4593 The Shawshank Redemption 133597 4545 Lord of the Rings The Return of the King1808834306 The Green Mile 150676 4460Lord of the Rings The Two Towers 139050 4415 Finding Nemo 117456 4504 Raiders of the Lost Ark 180736 4299 Forrest Gump 147932 4433 Lord of the Rings The Fellowship of the ring149199 4325 The Sixth Sense 144027 4333 Indiana Jones and the Last Crusade
Challenges
bull Size of datandash Scalabilityndash Keeping data in memory
bull Missing datandash 99 percent missingndash Very imbalanced
bull Avoiding overfittingbull Test and training data differ significantly movie 16322
bull Use an ensemble of complementing predictorsbull Two half tuned models worth more than a single fully
tuned model
The BellKor recommender system
RECSYS 101
Extending Latent Factor Model to Include Biases
1162019
RECSYS 102
Modeling Biases and Interactions
μ = overall mean rating bx = bias of user x bi = bias of movie i
user-movie interactionmovie biasuser bias
User-Movie interaction Characterizes the matching between
users and movies Attracts most research in the field Benefits from algorithmic and
mathematical innovations
Baseline predictor Separates users and movies Benefits from insights into userrsquos
behavior Among the main practical
contributions of the competition
RECSYS 103
Baseline Predictor We have expectations on the rating by
user x of movie i even without estimating xrsquos attitude towards movies like i
ndash Rating scale of user xndash Values of other ratings user
gave recently (day-specific mood anchoring multi-user accounts)
ndash (Recent) popularity of movie indash Selection bias related to
number of ratings user gave on the same day (ldquofrequencyrdquo)
1162019
RECSYS 104
Putting It All Together
Example Mean rating micro = 37 You are a critical reviewer your ratings are 1 star
lower than the mean bx = -1 Star Wars gets a mean rating of 05 higher than
average movie bi = + 05 Predicted rating for you on Star Wars
= 37 - 1 + 05 = 32
1162019
RECSYS 105
Fitting the New Model
Solve
Stochastic gradient decent to find parameters Note Both biases bu bi as well as interactions qi pu are treated
as parameters (we estimate them)
1162019
regularization
goodness of fit
λ is selected via grid-search on a validation set
( )
++++
+++minus
sumsumsumsum
sumisin
ii
xx
xx
ii
Rix
Txiixxi
PQ
bbpq
pqbbr
2222
2
)(
)(min
λ
micro
RECSYS 106
BellKor Recommender System The winner of the Netflix Challenge
Multi-scale modeling of the dataCombine top level ldquoregionalrdquomodeling of the data with a refined local view Global
bull Overall deviations of usersmovies Factorization
bull Addressing ldquoregionalrdquo effects Collaborative filtering
bull Extract local patterns
Global effects
Factorization
Collaborative filtering
RECSYS 107
Performance of Various Methods
0885
089
0895
09
0905
091
0915
092
1 10 100 1000
RMSE
Millions of parameters
CF (no time bias)
Basic Latent Factors
Latent Factors w Biases
1162019
RECSYS 108
Grand Prize 08563
Netflix 09514
Movie average 10533User average 10651
Global average 11296
Performance of Various Methods
Basic Collaborative filtering 094
Latent factors 090
Latent factors+Biases 089
Collaborative filtering++ 091
1162019
RECSYS 109
Temporal Biases Of Users
Sudden rise in the average movie rating(early 2004) Improvements in Netflix GUI improvements Meaning of rating changed
Movie age Users prefer new movies
without any reasons Older movies are just
inherently better than newer ones
1162019
Y Koren Collaborative filtering with temporal dynamics KDD rsquo09
RECSYS 110
Temporal Biases amp Factors
Original modelrxi = micro +bx + bi + qi middotpx
T
Add time dependence to biasesrxi = micro +bx(t)+ bi(t) +qi middot px
T
Make parameters bu and bi to depend on time (1) Parameterize time-dependence by linear trends
(2) Each bin corresponds to 10 consecutive weeks
Add temporal dependence to factors px(t)hellip user preference vector on day t
Y Koren Collaborative filtering with temporal dynamics KDD rsquo09
RECSYS 111
Adding Temporal Effects
1162019
0875
088
0885
089
0895
09
0905
091
0915
092
1 10 100 1000 10000
RMSE
Millions of parameters
CF (no time bias)Basic Latent FactorsCF (time bias)Latent Factors w Biases+ Linear time factors+ Per-day user biases+ CF
RECSYS 112
Grand Prize 08563
Netflix 09514
Movie average 10533User average 10651
Global average 11296
Performance of Various Methods
Basic Collaborative filtering 094
Latent factors 090
Latent factors+Biases 089
Collaborative filtering++ 091
1162019 112
Latent factors+Biases+Time 0876
Still no prize Getting desperateTry a ldquokitchen sinkrdquo approach
RECSYS 113
Effect of ensemble size
087
08725
0875
08775
088
08825
0885
08875
089
08925
1 8 15 22 29 36 43 50
Predictors
Erro
r - R
MSE
RECSYS 31
Another type of Memory-based Collaborative Filtering Item-Item based Collaborative Filtering
So far User-user collaborative filtering
Another view Item-item For item i find other similar items Estimate rating for item i based
on the target userrsquos ratings on items similar to item i Can use same similarity metrics and
prediction functions as in user-user model
1162019
sumsum
isin
isinsdot
=)(
)(
xiNj ij
xiNj xjijxi s
rsr
sijhellip similarity of items i and jrxjhelliprating of user x on item jN(ix)hellip set items rated by x similar to i
RECSYS 32
Item-Item CF (|N|=2)
121110987654321
455311
3124452
534321423
245424
5224345
423316
users
mov
ies
- unknown rating - rating between 1 to 5
RECSYS 33
Item-Item CF (|N|=2)
121110987654321
455 311
3124452
534321423
245424
5224345
423316
users
- estimate rating of movie 1 by user 5
1162019
mov
ies
RECSYS 34
Item-Item CF (|N|=2)
121110987654321
455 311
3124452
534321423
245424
5224345
423316
users
Neighbor selectionIdentify movies similar to movie 1 rated by user 5
1162019
mov
ies
100
-018
041
-010
-031
059
sim(1m)
Here we use Pearson correlation as similarity1) Subtract mean rating mi from each movie i
m1 = (1+3+5+5+4)5 = 36row 1 [-26 0 -06 0 0 14 0 0 14 0 04 0]
2) Compute cosine similarities between rows
RECSYS 35
Item-Item CF (|N|=2)
121110987654321
455 311
3124452
534321423
245424
5224345
423316
users
Compute similarity weightss13=041 s16=059
1162019
mov
ies
100
-018
041
-010
-031
059
sim(1m)
RECSYS 36
Item-Item CF (|N|=2)
121110987654321
45526311
3124452
534321423
245424
5224345
423316
users
Predict by taking weighted average
r15 = (0412 + 0593) (041+059) = 26
mov
ies
RECSYS 37
Common Practice for Item-Item Collaborative Filtering
Define similarity sij of items i and j
Select K nearest neighbors (KNN) N(i x) Set of Items most similar to item i that were rated by x
Estimate rating rxi as the weighted average
1162019
sumsum
isin
isinsdot
=)(
)(ˆxiNj ij
xiNj xjijxi s
rsr
N(ix) = set of items similar to item i that were rated by xsij = similarity of items i and jrxj= rating of user x on item j
RECSYS 38
Item-Item vs User-User
04180109
0305081
Avatar LOTR Matrix Pirates
Alice
Bob
Carol
David
1162019
In practice it has been observed that item-itemoften works better than user-user
Why Items are simpler users have multiple tastes
RECSYS 39
ProsCons of Collaborative Filtering
+ Works for any kind of item No feature selection needed
- Cold Start Need enough users in the system to find a match
- Sparsity The userratings matrix is sparse Hard to find users that have rated the same items
- First rater Cannot recommend an item that has not been
previously rated New items Esoteric items
- Popularity bias Cannot recommend items to someone with
unique taste Tends to recommend popular items
RECSYS 40
Hybrid Methods Implement two or more different recommenders and
combine predictions Perhaps using a linear model
Add content-based methods to collaborative filtering Item profiles for new item problem Demographics to deal with new user problem
1162019
RECSYS 41
- Evaluation
- Error metrics
- Complexity Speed
RECSYS 42
Evaluation
1 3 4
3 5 5
4 5 5
3
3
2 2 2
5
2 1 1
3 3
1
movies
users
1162019
RECSYS 43
Evaluation
1 3 4
3 5 5
4 5 5
3
3
2
2 1
3
1
Test Data Set
users
movies
1162019
RECSYS 44
Evaluating Predictions Compare predictions with known ratings
Root-mean-square error (RMSE)bull where is predicted is the true rating of x on i
Precision at top 10 bull of those in top 10
Rank Correlation bull Spearmanrsquos correlation between systemrsquos and userrsquos complete
rankings Another approach 01 model
Coveragebull Number of itemsusers for which system can make predictions
Precision = TP (TP + FP) Accuracy = (TP+TN) (TP + FP + TN + FN) Receiver Operating characteristic (ROC) Curve
Y-axis True Positive Rates (TPR) X-axis False Positive Rates (FPR)bull TPR (aka Recall) = TP P = TP (TP+FN) bull FPR = FP N = FP (FP + TN)bull See httpsenwikipediaorgwikiPrecision and recall
RECSYS 45
Problems with Error Measures Narrow focus on accuracy sometimes
misses the point Prediction Diversity Prediction Context Order of predictions
In practice we care only to predict high ratings RMSE might penalize a method that does well
for high ratings and badly for others
1162019
RECSYS 46
Collaborative Filtering Complexity
Expensive step is finding k most similar customers O(|X|) Recall that X = set of customers in the system
Too expensive to do at runtime Could pre-compute using clustering as approx
Naiumlve pre-computation takes time O(N middot|C|) |C| = of clusters = k in the k-means N = of data points
We already know how to do this Near-neighbor search in high dimensions (LSH) Clustering Dimensionality reduction
1162019
RECSYS 47
Tip Add Data Leverage all the data
Donrsquot try to reduce data size in an effort to make fancy algorithms work
Simple methods on large data do best
Add more data eg add IMDB data on genres
More data beats better algorithmshttpanandtypepadcomdatawocky200803more-data-usualhtml
RECSYS 48
Recommender SystemsLatent Factor Models
CS246 Mining Massive DatasetsJure Leskovec Stanford University
httpcs246stanfordedu
RECSYS 49
Geared towards females
Geared towards males
Serious
Funny
Collaborative Filtering viaLatent Factor Models (eg SVD)
The PrincessDiaries
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
RECSYS 50Koren Bell Volinksy IEEE Computer 2009
RECSYS 51
The Netflix Utility Matrix R
1 3 4
3 5 5
4 5 5
3
3
2 2 2
5
2 1 1
3 3
1
480000 users
17700 movies
1162019
Matrix R
RECSYS 52
Utility Matrix R Evaluation
1 3 4
3 5 5
4 5 5
3
3
2
2 1
3
1
Test Data Set
SSE = sum(119909119909119909119909)isin119877119877 119903119909119909119909119909 minus 119903119903119909119909119909119909 2
1162019
480000 users
17700 movies
Predicted rating
True rating of user x on item i
119955119955120785120785120788120788
Matrix R
Training Data Set
RECSYS 53
Latent Factor Models
ldquoSVDrdquo on Netflix data R asymp Q middot PT
For now letrsquos assume we can approximate the rating matrix R as a product of ldquothinrdquo Q middot PT
R has missing entries but letrsquos ignore that for nowbull Basically we will want the reconstruction error to be small on
known ratings and we donrsquot care about the values on the missing ones
45531
312445
53432142
24542
522434
42331
2-41
56-5
53-2
32111
-221-7
37-1
-924143-48-5-253-211
13-112-72914-131457-8
1-6784-3924176-421asymp
users
item
s
PT
Q
item
s
users
R
SVD A = U Σ VT
f factors
ffactors
RECSYS 54
Ratings as Products of Factors How to estimate the missing rating of
user x for item i
1162019
45531
312445
53432142
24542
522434
42331
item
s
2-41
56-5
53-2
32111
-221-7
37-1
-924143-48-5-253-211
13-112-72914-131457-8
1-6784-3924176-421
asymp
item
s
users
users
PT
119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931
= 119943119943
119954119954119946119946119943119943 sdot 119953119953119961119961119943119943
qi = row i of Qpx = column x of PT
ffac
tors
Qf factors
RECSYS 55
Ratings as Products of Factors How to estimate the missing rating of
user x for item i
1162019
45531
312445
53432142
24542
522434
42331
item
s
2-41
56-5
53-2
32111
-221-7
37-1
-924143-48-5-253-211
13-112-72914-131457-8
1-6784-3924176-421
asymp
item
s
users
users
PT
ffac
tors
Qf factors
119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931
= 119943119943
119954119954119946119946119943119943 sdot 119953119953119961119961119943119943
qi = row i of Qpx = column x of PT
RECSYS 56
Ratings as Products of Factors How to estimate the missing rating of
user x for item i
1162019
45531
312445
53432142
24542
522434
42331
item
s
2-41
56-5
53-2
32111
-221-7
37-1
-924143-48-5-253-211
13-112-72914-131457-8
1-6784-3924176-421
asymp
item
s
users
users
QPT
24
ffac
tors
f factors
119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931
= 119943119943
119954119954119946119946119943119943 sdot 119953119953119961119961119943119943
qi = row i of Qpx = column x of PT
RECSYS 57
Geared towards females
Geared towards males
Serious
Funny
Latent Factor Models
1162019
The PrincessDiaries
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
RECSYS 58
Geared towards females
Geared towards males
Serious
Funny
Latent Factor Models
1162019
The PrincessDiaries
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
RECSYS 59
Recap SVD
Remember SVD A Input data matrix U Left singular vecs V Right singular vecs Σ Singular values
SVD gives minimum reconstruction error (SSE) min119880119880VΣ
sum119909119909119894119894 119860119860119909119909119894119894 minus 119880119880Σ119881119881T 1199091199091198941198942
So in our case ldquoSVDrdquo on Netflix data R asymp Q middot PT
A = R Q = U PT = Σ VT
But we are not done yet R has missing entries1162019
U
The sum goes over all entriesBut our R has missing entries
119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931
Am
n
Σm
n
VT
asymp
U
RECSYS 60
Latent Factor Models
SVD isnrsquot defined when entries are missing
Use specialized methods to find P Q min
119875119875119876119876sum 119909119909119909119909 isinR 119903119903119909119909119909119909 minus 119902119902119909119909 sdot 119901119901119909119909119879119879 2
Notebull We donrsquot require cols of P Q to be orthogonalunit lengthbull P Q map usersmovies to a latent spacebull The most popular model among Netflix contestants
1162019
45531
312445
53432142
24542
522434
42331
2-41
56-5
53-2
32111
-221-7
37-1
-924143-48-5-253-211
13-112-72914-131457-8
1-6784-3924176-421asymp
PTQ
users
item
s
119903119909119909119909119909 = 119902119902119909119909 sdot 119901119901119909119909119879119879
f factors
ffactorsitem
s
users
RECSYS 61
Dealing with Missing Entries Want to minimize SSE for unseen test data
Idea Minimize SSE on training data Want large f ( of factors) to capture all the signals But SSE on test data begins to rise for f gt 2
Regularization is needed to avoid Overfitting Allow rich model where there are sufficient data Shrink aggressively where data are scarce
1162019
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
1 3 43 5 5
4 5 533
2
2 1 3
1
λhellip regularization parameterldquoerrorrdquo ldquolengthrdquo
RECSYS 62
Geared towards females
Geared towards males
serious
escapist
The PrincessDiaries
The Lion King
Braveheart
Independence Day
AmadeusThe Color Purple
Oceanrsquos 11
Sense and Sensibility
Gus
Dave
Recommendations via Latent Factor Models (eg SVD++ by the [Bellkor Team])
1162019
Lethal Weapon
Dumb and Dumber
RECSYS 63
Dealing with Missing Entries Want to minimize SSE for unseen test data
Idea Minimize SSE on training data Want large f ( of factors) to capture all the signals But SSE on test data begins to rise for f gt 2
Regularization is needed Allow rich model where there are sufficient data Shrink aggressively where data are scarce
1162019
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
1 3 43 5 5
4 5 533
2
2 1 3
1
λhellip regularization parameterldquoerrorrdquo ldquolengthrdquo
RECSYS 64
Geared towards females
Geared towards males
serious
funny
The Effect of Regularization
1162019
The PrincessDiaries
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
minfactors ldquoerrorrdquo + λ ldquolengthrdquo
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
RECSYS 65
Geared towards females
Geared towards males
serious
funny
The Effect of Regularization
1162019
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
The PrincessDiaries
minfactors ldquoerrorrdquo + λ ldquolengthrdquo
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
RECSYS 66
Geared towards females
Geared towards males
serious
funny
The Effect of Regularization
1162019
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
minfactors ldquoerrorrdquo + λ ldquolengthrdquo
The PrincessDiaries
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
RECSYS 67
Geared towards females
Geared towards males
serious
funny
The Effect of Regularization
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
The PrincessDiaries
minfactors ldquoerrorrdquo + λ ldquolengthrdquo
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
RECSYS 68
Use Gradient Descent to search forthe optimal settings
Want to find matrices P and Q
Gradient descent Initialize P and Q (using SVD pretend missing ratings are
0) Do gradient descent
bull P larr P - η middotnablaPbull Q larr Q - η middotnablaQbull Where nablaQ is gradientderivative of matrix Q120571120571119876119876 = [120571120571119902119902119909119909119894119894] and 120571120571119902119902119909119909119894119894 = sum119909119909119909119909 minus2 119903119903119909119909119909119909 minus 119902119902119909119909119901119901119909119909119879119879 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894
bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q Observation Computing gradients is slow
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
How to compute gradient of a matrixCompute gradient of every element independently
RECSYS 69
Degression to the lecture notes ofRegression and Gradient Descent
by Andrew Ngrsquos Machine Learning course from Coursera
RECSYS 70
(Batch) Gradient Descent
Want to find matrices P and Q
Gradient descent Initialize P and Q (using SVD pretend missing ratings are
0) Do gradient descent
bull P larr P - η middotnablaPbull Q larr Q - η middotnablaQbull Where nablaQ is gradientderivative of matrix Q120571120571119876119876 = [120571120571119902119902119909119909119894119894] and 120571120571119902119902119909119909119894119894 = sum119909119909119909119909 minus2 119903119903119909119909119909119909 minus 119902119902119909119909119901119901119909119909119879119879 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894
bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q Observation Computing gradients is slow
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
How to compute gradient of a matrixCompute gradient of every element independently
RECSYS 72
Stochastic Gradient Descent
Gradient Descent (GD) vs Stochastic GD Observation 120571120571119876119876 = [120571120571119902119902119909119909119894119894] where
120571120571119902119902119909119909119894119894 = 119909119909119909119909
minus2 119903119903119909119909119909119909 minus 119902119902119909119909119894119894119901119901119909119909119894119894 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894 = 119961119961119946119946
nabla119928119928 119955119955119961119961119946119946
bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q 119928119928 = 119928119928 minus 119928119928 = 119928119928 minus η sum119961119961119946119946nabla119928119928 (119955119955119961119961119946119946) Idea Instead of evaluating gradient over all ratings evaluate it for
each individual rating and make a step
GD 119928119928larr119928119928 minus η sum119955119955119961119961119946119946 nabla119928119928(119955119955119961119961119946119946)
SGD 119928119928larr119928119928 minus 119928119928(119955119955119961119961119946119946) Faster convergence
bull Need more steps but each step is computed much faster
1162019
RECSYS 73
SGD vs GD Convergence of GD vs SGD
1162019
Iterationstep
Valu
e of
the
obje
ctiv
e fu
nctio
n
GD improves the value of the objective function at every step SGD improves the value but in a ldquonoisyrdquo wayGD takes fewer steps to converge but each steptakes much longer to compute In practice SGD is much faster
RECSYS 74
Stochastic Gradient Descent
Stochastic gradient decent Initialize P and Q (using SVD pretend missing ratings are
0) Then iterate over the ratings (multiple times if necessary) and update
factorsFor each rxi
bull 120576120576119909119909119909119909 = 119903119903119909119909119909119909 minus 119902119902119909119909 sdot 119901119901119909119909119879119879 (derivative of the ldquoerrorrdquo)bull 119902119902119909119909 larr 119902119902119909119909 + 120578120578 120576120576119909119909119909119909 119901119901119909119909 minus 120582120582 119902119902119909119909 (update equation)bull 119901119901119909119909 larr 119901119901119909119909 + 120578120578 120576120576119909119909119909119909 119902119902119909119909 minus 120582120582 119901119901119909119909 (update equation)
2 for loops For until convergence
bull For each rxibull Compute gradient do a ldquosteprdquo
1162019
η hellip learning rate
RECSYS 75Koren Bell Volinksy IEEE Computer 2009
1162019
RECSYS 76
Summary Recommendations via Optimization
Goal Make good recommendations Quantify goodness using SSE
So Lower SSE means better recommendations We want to make good recommendations on items
that some user has not yet seen Letrsquos set values for P and Q such that they work
well on known (user item) ratings And hope these values for P and Q will predict well
the unknown ratings
This is the a case where we apply Optimization methods
1162019
1 3 43 5 5
4 5 533
2
2 1 3
1
RECSYS 77
Backup Slides
RECSYS 78
The Netflix Challenge 2006-09
We Know What You OughtTo Be Watching This
Summer
Netflix Prizebull Training data
ndash 100 million ratingsndash 480000 usersndash 17770 moviesndash 6 years of data 2000-2005
bull Test datandash Last few ratings of each user (28 million)ndash Evaluation criterion root mean squared error (RMSE) ndash Netflix Cinematch RMSE 09514
bull Competitionndash 2700+ teamsndash $1 million grand prize for 10 improvement on Cinematch resultndash $50000 2007 progress prize for 843 improvement
scoremovieuser121152131434524123237682576344541568523425
22345
57664566
scoremovieuser6219617232473153414284935
745
696836
Training data Test data
Movie rating data
Overall rating distribution
bull Third of ratings are 4sbull Average rating is 368
From TimelyDevelopmentcom
ratings per movie
bull Avg ratingsmovie 5627
ratings per user
bull Avg ratingsuser 208
Average movie rating by movie count
bull More ratings to better movies
From TimelyDevelopmentcom
Most loved movies
CountAvg ratingTitle137812 4593 The Shawshank Redemption 133597 4545 Lord of the Rings The Return of the King1808834306 The Green Mile 150676 4460Lord of the Rings The Two Towers 139050 4415 Finding Nemo 117456 4504 Raiders of the Lost Ark 180736 4299 Forrest Gump 147932 4433 Lord of the Rings The Fellowship of the ring149199 4325 The Sixth Sense 144027 4333 Indiana Jones and the Last Crusade
Challenges
bull Size of datandash Scalabilityndash Keeping data in memory
bull Missing datandash 99 percent missingndash Very imbalanced
bull Avoiding overfittingbull Test and training data differ significantly movie 16322
bull Use an ensemble of complementing predictorsbull Two half tuned models worth more than a single fully
tuned model
The BellKor recommender system
RECSYS 101
Extending Latent Factor Model to Include Biases
1162019
RECSYS 102
Modeling Biases and Interactions
μ = overall mean rating bx = bias of user x bi = bias of movie i
user-movie interactionmovie biasuser bias
User-Movie interaction Characterizes the matching between
users and movies Attracts most research in the field Benefits from algorithmic and
mathematical innovations
Baseline predictor Separates users and movies Benefits from insights into userrsquos
behavior Among the main practical
contributions of the competition
RECSYS 103
Baseline Predictor We have expectations on the rating by
user x of movie i even without estimating xrsquos attitude towards movies like i
ndash Rating scale of user xndash Values of other ratings user
gave recently (day-specific mood anchoring multi-user accounts)
ndash (Recent) popularity of movie indash Selection bias related to
number of ratings user gave on the same day (ldquofrequencyrdquo)
1162019
RECSYS 104
Putting It All Together
Example Mean rating micro = 37 You are a critical reviewer your ratings are 1 star
lower than the mean bx = -1 Star Wars gets a mean rating of 05 higher than
average movie bi = + 05 Predicted rating for you on Star Wars
= 37 - 1 + 05 = 32
1162019
RECSYS 105
Fitting the New Model
Solve
Stochastic gradient decent to find parameters Note Both biases bu bi as well as interactions qi pu are treated
as parameters (we estimate them)
1162019
regularization
goodness of fit
λ is selected via grid-search on a validation set
( )
++++
+++minus
sumsumsumsum
sumisin
ii
xx
xx
ii
Rix
Txiixxi
PQ
bbpq
pqbbr
2222
2
)(
)(min
λ
micro
RECSYS 106
BellKor Recommender System The winner of the Netflix Challenge
Multi-scale modeling of the dataCombine top level ldquoregionalrdquomodeling of the data with a refined local view Global
bull Overall deviations of usersmovies Factorization
bull Addressing ldquoregionalrdquo effects Collaborative filtering
bull Extract local patterns
Global effects
Factorization
Collaborative filtering
RECSYS 107
Performance of Various Methods
0885
089
0895
09
0905
091
0915
092
1 10 100 1000
RMSE
Millions of parameters
CF (no time bias)
Basic Latent Factors
Latent Factors w Biases
1162019
RECSYS 108
Grand Prize 08563
Netflix 09514
Movie average 10533User average 10651
Global average 11296
Performance of Various Methods
Basic Collaborative filtering 094
Latent factors 090
Latent factors+Biases 089
Collaborative filtering++ 091
1162019
RECSYS 109
Temporal Biases Of Users
Sudden rise in the average movie rating(early 2004) Improvements in Netflix GUI improvements Meaning of rating changed
Movie age Users prefer new movies
without any reasons Older movies are just
inherently better than newer ones
1162019
Y Koren Collaborative filtering with temporal dynamics KDD rsquo09
RECSYS 110
Temporal Biases amp Factors
Original modelrxi = micro +bx + bi + qi middotpx
T
Add time dependence to biasesrxi = micro +bx(t)+ bi(t) +qi middot px
T
Make parameters bu and bi to depend on time (1) Parameterize time-dependence by linear trends
(2) Each bin corresponds to 10 consecutive weeks
Add temporal dependence to factors px(t)hellip user preference vector on day t
Y Koren Collaborative filtering with temporal dynamics KDD rsquo09
RECSYS 111
Adding Temporal Effects
1162019
0875
088
0885
089
0895
09
0905
091
0915
092
1 10 100 1000 10000
RMSE
Millions of parameters
CF (no time bias)Basic Latent FactorsCF (time bias)Latent Factors w Biases+ Linear time factors+ Per-day user biases+ CF
RECSYS 112
Grand Prize 08563
Netflix 09514
Movie average 10533User average 10651
Global average 11296
Performance of Various Methods
Basic Collaborative filtering 094
Latent factors 090
Latent factors+Biases 089
Collaborative filtering++ 091
1162019 112
Latent factors+Biases+Time 0876
Still no prize Getting desperateTry a ldquokitchen sinkrdquo approach
RECSYS 113
Effect of ensemble size
087
08725
0875
08775
088
08825
0885
08875
089
08925
1 8 15 22 29 36 43 50
Predictors
Erro
r - R
MSE
RECSYS 32
Item-Item CF (|N|=2)
121110987654321
455311
3124452
534321423
245424
5224345
423316
users
mov
ies
- unknown rating - rating between 1 to 5
RECSYS 33
Item-Item CF (|N|=2)
121110987654321
455 311
3124452
534321423
245424
5224345
423316
users
- estimate rating of movie 1 by user 5
1162019
mov
ies
RECSYS 34
Item-Item CF (|N|=2)
121110987654321
455 311
3124452
534321423
245424
5224345
423316
users
Neighbor selectionIdentify movies similar to movie 1 rated by user 5
1162019
mov
ies
100
-018
041
-010
-031
059
sim(1m)
Here we use Pearson correlation as similarity1) Subtract mean rating mi from each movie i
m1 = (1+3+5+5+4)5 = 36row 1 [-26 0 -06 0 0 14 0 0 14 0 04 0]
2) Compute cosine similarities between rows
RECSYS 35
Item-Item CF (|N|=2)
121110987654321
455 311
3124452
534321423
245424
5224345
423316
users
Compute similarity weightss13=041 s16=059
1162019
mov
ies
100
-018
041
-010
-031
059
sim(1m)
RECSYS 36
Item-Item CF (|N|=2)
121110987654321
45526311
3124452
534321423
245424
5224345
423316
users
Predict by taking weighted average
r15 = (0412 + 0593) (041+059) = 26
mov
ies
RECSYS 37
Common Practice for Item-Item Collaborative Filtering
Define similarity sij of items i and j
Select K nearest neighbors (KNN) N(i x) Set of Items most similar to item i that were rated by x
Estimate rating rxi as the weighted average
1162019
sumsum
isin
isinsdot
=)(
)(ˆxiNj ij
xiNj xjijxi s
rsr
N(ix) = set of items similar to item i that were rated by xsij = similarity of items i and jrxj= rating of user x on item j
RECSYS 38
Item-Item vs User-User
04180109
0305081
Avatar LOTR Matrix Pirates
Alice
Bob
Carol
David
1162019
In practice it has been observed that item-itemoften works better than user-user
Why Items are simpler users have multiple tastes
RECSYS 39
ProsCons of Collaborative Filtering
+ Works for any kind of item No feature selection needed
- Cold Start Need enough users in the system to find a match
- Sparsity The userratings matrix is sparse Hard to find users that have rated the same items
- First rater Cannot recommend an item that has not been
previously rated New items Esoteric items
- Popularity bias Cannot recommend items to someone with
unique taste Tends to recommend popular items
RECSYS 40
Hybrid Methods Implement two or more different recommenders and
combine predictions Perhaps using a linear model
Add content-based methods to collaborative filtering Item profiles for new item problem Demographics to deal with new user problem
1162019
RECSYS 41
- Evaluation
- Error metrics
- Complexity Speed
RECSYS 42
Evaluation
1 3 4
3 5 5
4 5 5
3
3
2 2 2
5
2 1 1
3 3
1
movies
users
1162019
RECSYS 43
Evaluation
1 3 4
3 5 5
4 5 5
3
3
2
2 1
3
1
Test Data Set
users
movies
1162019
RECSYS 44
Evaluating Predictions Compare predictions with known ratings
Root-mean-square error (RMSE)bull where is predicted is the true rating of x on i
Precision at top 10 bull of those in top 10
Rank Correlation bull Spearmanrsquos correlation between systemrsquos and userrsquos complete
rankings Another approach 01 model
Coveragebull Number of itemsusers for which system can make predictions
Precision = TP (TP + FP) Accuracy = (TP+TN) (TP + FP + TN + FN) Receiver Operating characteristic (ROC) Curve
Y-axis True Positive Rates (TPR) X-axis False Positive Rates (FPR)bull TPR (aka Recall) = TP P = TP (TP+FN) bull FPR = FP N = FP (FP + TN)bull See httpsenwikipediaorgwikiPrecision and recall
RECSYS 45
Problems with Error Measures Narrow focus on accuracy sometimes
misses the point Prediction Diversity Prediction Context Order of predictions
In practice we care only to predict high ratings RMSE might penalize a method that does well
for high ratings and badly for others
1162019
RECSYS 46
Collaborative Filtering Complexity
Expensive step is finding k most similar customers O(|X|) Recall that X = set of customers in the system
Too expensive to do at runtime Could pre-compute using clustering as approx
Naiumlve pre-computation takes time O(N middot|C|) |C| = of clusters = k in the k-means N = of data points
We already know how to do this Near-neighbor search in high dimensions (LSH) Clustering Dimensionality reduction
1162019
RECSYS 47
Tip Add Data Leverage all the data
Donrsquot try to reduce data size in an effort to make fancy algorithms work
Simple methods on large data do best
Add more data eg add IMDB data on genres
More data beats better algorithmshttpanandtypepadcomdatawocky200803more-data-usualhtml
RECSYS 48
Recommender SystemsLatent Factor Models
CS246 Mining Massive DatasetsJure Leskovec Stanford University
httpcs246stanfordedu
RECSYS 49
Geared towards females
Geared towards males
Serious
Funny
Collaborative Filtering viaLatent Factor Models (eg SVD)
The PrincessDiaries
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
RECSYS 50Koren Bell Volinksy IEEE Computer 2009
RECSYS 51
The Netflix Utility Matrix R
1 3 4
3 5 5
4 5 5
3
3
2 2 2
5
2 1 1
3 3
1
480000 users
17700 movies
1162019
Matrix R
RECSYS 52
Utility Matrix R Evaluation
1 3 4
3 5 5
4 5 5
3
3
2
2 1
3
1
Test Data Set
SSE = sum(119909119909119909119909)isin119877119877 119903119909119909119909119909 minus 119903119903119909119909119909119909 2
1162019
480000 users
17700 movies
Predicted rating
True rating of user x on item i
119955119955120785120785120788120788
Matrix R
Training Data Set
RECSYS 53
Latent Factor Models
ldquoSVDrdquo on Netflix data R asymp Q middot PT
For now letrsquos assume we can approximate the rating matrix R as a product of ldquothinrdquo Q middot PT
R has missing entries but letrsquos ignore that for nowbull Basically we will want the reconstruction error to be small on
known ratings and we donrsquot care about the values on the missing ones
45531
312445
53432142
24542
522434
42331
2-41
56-5
53-2
32111
-221-7
37-1
-924143-48-5-253-211
13-112-72914-131457-8
1-6784-3924176-421asymp
users
item
s
PT
Q
item
s
users
R
SVD A = U Σ VT
f factors
ffactors
RECSYS 54
Ratings as Products of Factors How to estimate the missing rating of
user x for item i
1162019
45531
312445
53432142
24542
522434
42331
item
s
2-41
56-5
53-2
32111
-221-7
37-1
-924143-48-5-253-211
13-112-72914-131457-8
1-6784-3924176-421
asymp
item
s
users
users
PT
119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931
= 119943119943
119954119954119946119946119943119943 sdot 119953119953119961119961119943119943
qi = row i of Qpx = column x of PT
ffac
tors
Qf factors
RECSYS 55
Ratings as Products of Factors How to estimate the missing rating of
user x for item i
1162019
45531
312445
53432142
24542
522434
42331
item
s
2-41
56-5
53-2
32111
-221-7
37-1
-924143-48-5-253-211
13-112-72914-131457-8
1-6784-3924176-421
asymp
item
s
users
users
PT
ffac
tors
Qf factors
119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931
= 119943119943
119954119954119946119946119943119943 sdot 119953119953119961119961119943119943
qi = row i of Qpx = column x of PT
RECSYS 56
Ratings as Products of Factors How to estimate the missing rating of
user x for item i
1162019
45531
312445
53432142
24542
522434
42331
item
s
2-41
56-5
53-2
32111
-221-7
37-1
-924143-48-5-253-211
13-112-72914-131457-8
1-6784-3924176-421
asymp
item
s
users
users
QPT
24
ffac
tors
f factors
119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931
= 119943119943
119954119954119946119946119943119943 sdot 119953119953119961119961119943119943
qi = row i of Qpx = column x of PT
RECSYS 57
Geared towards females
Geared towards males
Serious
Funny
Latent Factor Models
1162019
The PrincessDiaries
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
RECSYS 58
Geared towards females
Geared towards males
Serious
Funny
Latent Factor Models
1162019
The PrincessDiaries
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
RECSYS 59
Recap SVD
Remember SVD A Input data matrix U Left singular vecs V Right singular vecs Σ Singular values
SVD gives minimum reconstruction error (SSE) min119880119880VΣ
sum119909119909119894119894 119860119860119909119909119894119894 minus 119880119880Σ119881119881T 1199091199091198941198942
So in our case ldquoSVDrdquo on Netflix data R asymp Q middot PT
A = R Q = U PT = Σ VT
But we are not done yet R has missing entries1162019
U
The sum goes over all entriesBut our R has missing entries
119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931
Am
n
Σm
n
VT
asymp
U
RECSYS 60
Latent Factor Models
SVD isnrsquot defined when entries are missing
Use specialized methods to find P Q min
119875119875119876119876sum 119909119909119909119909 isinR 119903119903119909119909119909119909 minus 119902119902119909119909 sdot 119901119901119909119909119879119879 2
Notebull We donrsquot require cols of P Q to be orthogonalunit lengthbull P Q map usersmovies to a latent spacebull The most popular model among Netflix contestants
1162019
45531
312445
53432142
24542
522434
42331
2-41
56-5
53-2
32111
-221-7
37-1
-924143-48-5-253-211
13-112-72914-131457-8
1-6784-3924176-421asymp
PTQ
users
item
s
119903119909119909119909119909 = 119902119902119909119909 sdot 119901119901119909119909119879119879
f factors
ffactorsitem
s
users
RECSYS 61
Dealing with Missing Entries Want to minimize SSE for unseen test data
Idea Minimize SSE on training data Want large f ( of factors) to capture all the signals But SSE on test data begins to rise for f gt 2
Regularization is needed to avoid Overfitting Allow rich model where there are sufficient data Shrink aggressively where data are scarce
1162019
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
1 3 43 5 5
4 5 533
2
2 1 3
1
λhellip regularization parameterldquoerrorrdquo ldquolengthrdquo
RECSYS 62
Geared towards females
Geared towards males
serious
escapist
The PrincessDiaries
The Lion King
Braveheart
Independence Day
AmadeusThe Color Purple
Oceanrsquos 11
Sense and Sensibility
Gus
Dave
Recommendations via Latent Factor Models (eg SVD++ by the [Bellkor Team])
1162019
Lethal Weapon
Dumb and Dumber
RECSYS 63
Dealing with Missing Entries Want to minimize SSE for unseen test data
Idea Minimize SSE on training data Want large f ( of factors) to capture all the signals But SSE on test data begins to rise for f gt 2
Regularization is needed Allow rich model where there are sufficient data Shrink aggressively where data are scarce
1162019
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
1 3 43 5 5
4 5 533
2
2 1 3
1
λhellip regularization parameterldquoerrorrdquo ldquolengthrdquo
RECSYS 64
Geared towards females
Geared towards males
serious
funny
The Effect of Regularization
1162019
The PrincessDiaries
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
minfactors ldquoerrorrdquo + λ ldquolengthrdquo
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
RECSYS 65
Geared towards females
Geared towards males
serious
funny
The Effect of Regularization
1162019
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
The PrincessDiaries
minfactors ldquoerrorrdquo + λ ldquolengthrdquo
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
RECSYS 66
Geared towards females
Geared towards males
serious
funny
The Effect of Regularization
1162019
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
minfactors ldquoerrorrdquo + λ ldquolengthrdquo
The PrincessDiaries
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
RECSYS 67
Geared towards females
Geared towards males
serious
funny
The Effect of Regularization
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
The PrincessDiaries
minfactors ldquoerrorrdquo + λ ldquolengthrdquo
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
RECSYS 68
Use Gradient Descent to search forthe optimal settings
Want to find matrices P and Q
Gradient descent Initialize P and Q (using SVD pretend missing ratings are
0) Do gradient descent
bull P larr P - η middotnablaPbull Q larr Q - η middotnablaQbull Where nablaQ is gradientderivative of matrix Q120571120571119876119876 = [120571120571119902119902119909119909119894119894] and 120571120571119902119902119909119909119894119894 = sum119909119909119909119909 minus2 119903119903119909119909119909119909 minus 119902119902119909119909119901119901119909119909119879119879 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894
bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q Observation Computing gradients is slow
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
How to compute gradient of a matrixCompute gradient of every element independently
RECSYS 69
Degression to the lecture notes ofRegression and Gradient Descent
by Andrew Ngrsquos Machine Learning course from Coursera
RECSYS 70
(Batch) Gradient Descent
Want to find matrices P and Q
Gradient descent Initialize P and Q (using SVD pretend missing ratings are
0) Do gradient descent
bull P larr P - η middotnablaPbull Q larr Q - η middotnablaQbull Where nablaQ is gradientderivative of matrix Q120571120571119876119876 = [120571120571119902119902119909119909119894119894] and 120571120571119902119902119909119909119894119894 = sum119909119909119909119909 minus2 119903119903119909119909119909119909 minus 119902119902119909119909119901119901119909119909119879119879 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894
bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q Observation Computing gradients is slow
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
How to compute gradient of a matrixCompute gradient of every element independently
RECSYS 72
Stochastic Gradient Descent
Gradient Descent (GD) vs Stochastic GD Observation 120571120571119876119876 = [120571120571119902119902119909119909119894119894] where
120571120571119902119902119909119909119894119894 = 119909119909119909119909
minus2 119903119903119909119909119909119909 minus 119902119902119909119909119894119894119901119901119909119909119894119894 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894 = 119961119961119946119946
nabla119928119928 119955119955119961119961119946119946
bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q 119928119928 = 119928119928 minus 119928119928 = 119928119928 minus η sum119961119961119946119946nabla119928119928 (119955119955119961119961119946119946) Idea Instead of evaluating gradient over all ratings evaluate it for
each individual rating and make a step
GD 119928119928larr119928119928 minus η sum119955119955119961119961119946119946 nabla119928119928(119955119955119961119961119946119946)
SGD 119928119928larr119928119928 minus 119928119928(119955119955119961119961119946119946) Faster convergence
bull Need more steps but each step is computed much faster
1162019
RECSYS 73
SGD vs GD Convergence of GD vs SGD
1162019
Iterationstep
Valu
e of
the
obje
ctiv
e fu
nctio
n
GD improves the value of the objective function at every step SGD improves the value but in a ldquonoisyrdquo wayGD takes fewer steps to converge but each steptakes much longer to compute In practice SGD is much faster
RECSYS 74
Stochastic Gradient Descent
Stochastic gradient decent Initialize P and Q (using SVD pretend missing ratings are
0) Then iterate over the ratings (multiple times if necessary) and update
factorsFor each rxi
bull 120576120576119909119909119909119909 = 119903119903119909119909119909119909 minus 119902119902119909119909 sdot 119901119901119909119909119879119879 (derivative of the ldquoerrorrdquo)bull 119902119902119909119909 larr 119902119902119909119909 + 120578120578 120576120576119909119909119909119909 119901119901119909119909 minus 120582120582 119902119902119909119909 (update equation)bull 119901119901119909119909 larr 119901119901119909119909 + 120578120578 120576120576119909119909119909119909 119902119902119909119909 minus 120582120582 119901119901119909119909 (update equation)
2 for loops For until convergence
bull For each rxibull Compute gradient do a ldquosteprdquo
1162019
η hellip learning rate
RECSYS 75Koren Bell Volinksy IEEE Computer 2009
1162019
RECSYS 76
Summary Recommendations via Optimization
Goal Make good recommendations Quantify goodness using SSE
So Lower SSE means better recommendations We want to make good recommendations on items
that some user has not yet seen Letrsquos set values for P and Q such that they work
well on known (user item) ratings And hope these values for P and Q will predict well
the unknown ratings
This is the a case where we apply Optimization methods
1162019
1 3 43 5 5
4 5 533
2
2 1 3
1
RECSYS 77
Backup Slides
RECSYS 78
The Netflix Challenge 2006-09
We Know What You OughtTo Be Watching This
Summer
Netflix Prizebull Training data
ndash 100 million ratingsndash 480000 usersndash 17770 moviesndash 6 years of data 2000-2005
bull Test datandash Last few ratings of each user (28 million)ndash Evaluation criterion root mean squared error (RMSE) ndash Netflix Cinematch RMSE 09514
bull Competitionndash 2700+ teamsndash $1 million grand prize for 10 improvement on Cinematch resultndash $50000 2007 progress prize for 843 improvement
scoremovieuser121152131434524123237682576344541568523425
22345
57664566
scoremovieuser6219617232473153414284935
745
696836
Training data Test data
Movie rating data
Overall rating distribution
bull Third of ratings are 4sbull Average rating is 368
From TimelyDevelopmentcom
ratings per movie
bull Avg ratingsmovie 5627
ratings per user
bull Avg ratingsuser 208
Average movie rating by movie count
bull More ratings to better movies
From TimelyDevelopmentcom
Most loved movies
CountAvg ratingTitle137812 4593 The Shawshank Redemption 133597 4545 Lord of the Rings The Return of the King1808834306 The Green Mile 150676 4460Lord of the Rings The Two Towers 139050 4415 Finding Nemo 117456 4504 Raiders of the Lost Ark 180736 4299 Forrest Gump 147932 4433 Lord of the Rings The Fellowship of the ring149199 4325 The Sixth Sense 144027 4333 Indiana Jones and the Last Crusade
Challenges
bull Size of datandash Scalabilityndash Keeping data in memory
bull Missing datandash 99 percent missingndash Very imbalanced
bull Avoiding overfittingbull Test and training data differ significantly movie 16322
bull Use an ensemble of complementing predictorsbull Two half tuned models worth more than a single fully
tuned model
The BellKor recommender system
RECSYS 101
Extending Latent Factor Model to Include Biases
1162019
RECSYS 102
Modeling Biases and Interactions
μ = overall mean rating bx = bias of user x bi = bias of movie i
user-movie interactionmovie biasuser bias
User-Movie interaction Characterizes the matching between
users and movies Attracts most research in the field Benefits from algorithmic and
mathematical innovations
Baseline predictor Separates users and movies Benefits from insights into userrsquos
behavior Among the main practical
contributions of the competition
RECSYS 103
Baseline Predictor We have expectations on the rating by
user x of movie i even without estimating xrsquos attitude towards movies like i
ndash Rating scale of user xndash Values of other ratings user
gave recently (day-specific mood anchoring multi-user accounts)
ndash (Recent) popularity of movie indash Selection bias related to
number of ratings user gave on the same day (ldquofrequencyrdquo)
1162019
RECSYS 104
Putting It All Together
Example Mean rating micro = 37 You are a critical reviewer your ratings are 1 star
lower than the mean bx = -1 Star Wars gets a mean rating of 05 higher than
average movie bi = + 05 Predicted rating for you on Star Wars
= 37 - 1 + 05 = 32
1162019
RECSYS 105
Fitting the New Model
Solve
Stochastic gradient decent to find parameters Note Both biases bu bi as well as interactions qi pu are treated
as parameters (we estimate them)
1162019
regularization
goodness of fit
λ is selected via grid-search on a validation set
( )
++++
+++minus
sumsumsumsum
sumisin
ii
xx
xx
ii
Rix
Txiixxi
PQ
bbpq
pqbbr
2222
2
)(
)(min
λ
micro
RECSYS 106
BellKor Recommender System The winner of the Netflix Challenge
Multi-scale modeling of the dataCombine top level ldquoregionalrdquomodeling of the data with a refined local view Global
bull Overall deviations of usersmovies Factorization
bull Addressing ldquoregionalrdquo effects Collaborative filtering
bull Extract local patterns
Global effects
Factorization
Collaborative filtering
RECSYS 107
Performance of Various Methods
0885
089
0895
09
0905
091
0915
092
1 10 100 1000
RMSE
Millions of parameters
CF (no time bias)
Basic Latent Factors
Latent Factors w Biases
1162019
RECSYS 108
Grand Prize 08563
Netflix 09514
Movie average 10533User average 10651
Global average 11296
Performance of Various Methods
Basic Collaborative filtering 094
Latent factors 090
Latent factors+Biases 089
Collaborative filtering++ 091
1162019
RECSYS 109
Temporal Biases Of Users
Sudden rise in the average movie rating(early 2004) Improvements in Netflix GUI improvements Meaning of rating changed
Movie age Users prefer new movies
without any reasons Older movies are just
inherently better than newer ones
1162019
Y Koren Collaborative filtering with temporal dynamics KDD rsquo09
RECSYS 110
Temporal Biases amp Factors
Original modelrxi = micro +bx + bi + qi middotpx
T
Add time dependence to biasesrxi = micro +bx(t)+ bi(t) +qi middot px
T
Make parameters bu and bi to depend on time (1) Parameterize time-dependence by linear trends
(2) Each bin corresponds to 10 consecutive weeks
Add temporal dependence to factors px(t)hellip user preference vector on day t
Y Koren Collaborative filtering with temporal dynamics KDD rsquo09
RECSYS 111
Adding Temporal Effects
1162019
0875
088
0885
089
0895
09
0905
091
0915
092
1 10 100 1000 10000
RMSE
Millions of parameters
CF (no time bias)Basic Latent FactorsCF (time bias)Latent Factors w Biases+ Linear time factors+ Per-day user biases+ CF
RECSYS 112
Grand Prize 08563
Netflix 09514
Movie average 10533User average 10651
Global average 11296
Performance of Various Methods
Basic Collaborative filtering 094
Latent factors 090
Latent factors+Biases 089
Collaborative filtering++ 091
1162019 112
Latent factors+Biases+Time 0876
Still no prize Getting desperateTry a ldquokitchen sinkrdquo approach
RECSYS 113
Effect of ensemble size
087
08725
0875
08775
088
08825
0885
08875
089
08925
1 8 15 22 29 36 43 50
Predictors
Erro
r - R
MSE
RECSYS 33
Item-Item CF (|N|=2)
121110987654321
455 311
3124452
534321423
245424
5224345
423316
users
- estimate rating of movie 1 by user 5
1162019
mov
ies
RECSYS 34
Item-Item CF (|N|=2)
121110987654321
455 311
3124452
534321423
245424
5224345
423316
users
Neighbor selectionIdentify movies similar to movie 1 rated by user 5
1162019
mov
ies
100
-018
041
-010
-031
059
sim(1m)
Here we use Pearson correlation as similarity1) Subtract mean rating mi from each movie i
m1 = (1+3+5+5+4)5 = 36row 1 [-26 0 -06 0 0 14 0 0 14 0 04 0]
2) Compute cosine similarities between rows
RECSYS 35
Item-Item CF (|N|=2)
121110987654321
455 311
3124452
534321423
245424
5224345
423316
users
Compute similarity weightss13=041 s16=059
1162019
mov
ies
100
-018
041
-010
-031
059
sim(1m)
RECSYS 36
Item-Item CF (|N|=2)
121110987654321
45526311
3124452
534321423
245424
5224345
423316
users
Predict by taking weighted average
r15 = (0412 + 0593) (041+059) = 26
mov
ies
RECSYS 37
Common Practice for Item-Item Collaborative Filtering
Define similarity sij of items i and j
Select K nearest neighbors (KNN) N(i x) Set of Items most similar to item i that were rated by x
Estimate rating rxi as the weighted average
1162019
sumsum
isin
isinsdot
=)(
)(ˆxiNj ij
xiNj xjijxi s
rsr
N(ix) = set of items similar to item i that were rated by xsij = similarity of items i and jrxj= rating of user x on item j
RECSYS 38
Item-Item vs User-User
04180109
0305081
Avatar LOTR Matrix Pirates
Alice
Bob
Carol
David
1162019
In practice it has been observed that item-itemoften works better than user-user
Why Items are simpler users have multiple tastes
RECSYS 39
ProsCons of Collaborative Filtering
+ Works for any kind of item No feature selection needed
- Cold Start Need enough users in the system to find a match
- Sparsity The userratings matrix is sparse Hard to find users that have rated the same items
- First rater Cannot recommend an item that has not been
previously rated New items Esoteric items
- Popularity bias Cannot recommend items to someone with
unique taste Tends to recommend popular items
RECSYS 40
Hybrid Methods Implement two or more different recommenders and
combine predictions Perhaps using a linear model
Add content-based methods to collaborative filtering Item profiles for new item problem Demographics to deal with new user problem
1162019
RECSYS 41
- Evaluation
- Error metrics
- Complexity Speed
RECSYS 42
Evaluation
1 3 4
3 5 5
4 5 5
3
3
2 2 2
5
2 1 1
3 3
1
movies
users
1162019
RECSYS 43
Evaluation
1 3 4
3 5 5
4 5 5
3
3
2
2 1
3
1
Test Data Set
users
movies
1162019
RECSYS 44
Evaluating Predictions Compare predictions with known ratings
Root-mean-square error (RMSE)bull where is predicted is the true rating of x on i
Precision at top 10 bull of those in top 10
Rank Correlation bull Spearmanrsquos correlation between systemrsquos and userrsquos complete
rankings Another approach 01 model
Coveragebull Number of itemsusers for which system can make predictions
Precision = TP (TP + FP) Accuracy = (TP+TN) (TP + FP + TN + FN) Receiver Operating characteristic (ROC) Curve
Y-axis True Positive Rates (TPR) X-axis False Positive Rates (FPR)bull TPR (aka Recall) = TP P = TP (TP+FN) bull FPR = FP N = FP (FP + TN)bull See httpsenwikipediaorgwikiPrecision and recall
RECSYS 45
Problems with Error Measures Narrow focus on accuracy sometimes
misses the point Prediction Diversity Prediction Context Order of predictions
In practice we care only to predict high ratings RMSE might penalize a method that does well
for high ratings and badly for others
1162019
RECSYS 46
Collaborative Filtering Complexity
Expensive step is finding k most similar customers O(|X|) Recall that X = set of customers in the system
Too expensive to do at runtime Could pre-compute using clustering as approx
Naiumlve pre-computation takes time O(N middot|C|) |C| = of clusters = k in the k-means N = of data points
We already know how to do this Near-neighbor search in high dimensions (LSH) Clustering Dimensionality reduction
1162019
RECSYS 47
Tip Add Data Leverage all the data
Donrsquot try to reduce data size in an effort to make fancy algorithms work
Simple methods on large data do best
Add more data eg add IMDB data on genres
More data beats better algorithmshttpanandtypepadcomdatawocky200803more-data-usualhtml
RECSYS 48
Recommender SystemsLatent Factor Models
CS246 Mining Massive DatasetsJure Leskovec Stanford University
httpcs246stanfordedu
RECSYS 49
Geared towards females
Geared towards males
Serious
Funny
Collaborative Filtering viaLatent Factor Models (eg SVD)
The PrincessDiaries
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
RECSYS 50Koren Bell Volinksy IEEE Computer 2009
RECSYS 51
The Netflix Utility Matrix R
1 3 4
3 5 5
4 5 5
3
3
2 2 2
5
2 1 1
3 3
1
480000 users
17700 movies
1162019
Matrix R
RECSYS 52
Utility Matrix R Evaluation
1 3 4
3 5 5
4 5 5
3
3
2
2 1
3
1
Test Data Set
SSE = sum(119909119909119909119909)isin119877119877 119903119909119909119909119909 minus 119903119903119909119909119909119909 2
1162019
480000 users
17700 movies
Predicted rating
True rating of user x on item i
119955119955120785120785120788120788
Matrix R
Training Data Set
RECSYS 53
Latent Factor Models
ldquoSVDrdquo on Netflix data R asymp Q middot PT
For now letrsquos assume we can approximate the rating matrix R as a product of ldquothinrdquo Q middot PT
R has missing entries but letrsquos ignore that for nowbull Basically we will want the reconstruction error to be small on
known ratings and we donrsquot care about the values on the missing ones
45531
312445
53432142
24542
522434
42331
2-41
56-5
53-2
32111
-221-7
37-1
-924143-48-5-253-211
13-112-72914-131457-8
1-6784-3924176-421asymp
users
item
s
PT
Q
item
s
users
R
SVD A = U Σ VT
f factors
ffactors
RECSYS 54
Ratings as Products of Factors How to estimate the missing rating of
user x for item i
1162019
45531
312445
53432142
24542
522434
42331
item
s
2-41
56-5
53-2
32111
-221-7
37-1
-924143-48-5-253-211
13-112-72914-131457-8
1-6784-3924176-421
asymp
item
s
users
users
PT
119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931
= 119943119943
119954119954119946119946119943119943 sdot 119953119953119961119961119943119943
qi = row i of Qpx = column x of PT
ffac
tors
Qf factors
RECSYS 55
Ratings as Products of Factors How to estimate the missing rating of
user x for item i
1162019
45531
312445
53432142
24542
522434
42331
item
s
2-41
56-5
53-2
32111
-221-7
37-1
-924143-48-5-253-211
13-112-72914-131457-8
1-6784-3924176-421
asymp
item
s
users
users
PT
ffac
tors
Qf factors
119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931
= 119943119943
119954119954119946119946119943119943 sdot 119953119953119961119961119943119943
qi = row i of Qpx = column x of PT
RECSYS 56
Ratings as Products of Factors How to estimate the missing rating of
user x for item i
1162019
45531
312445
53432142
24542
522434
42331
item
s
2-41
56-5
53-2
32111
-221-7
37-1
-924143-48-5-253-211
13-112-72914-131457-8
1-6784-3924176-421
asymp
item
s
users
users
QPT
24
ffac
tors
f factors
119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931
= 119943119943
119954119954119946119946119943119943 sdot 119953119953119961119961119943119943
qi = row i of Qpx = column x of PT
RECSYS 57
Geared towards females
Geared towards males
Serious
Funny
Latent Factor Models
1162019
The PrincessDiaries
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
RECSYS 58
Geared towards females
Geared towards males
Serious
Funny
Latent Factor Models
1162019
The PrincessDiaries
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
RECSYS 59
Recap SVD
Remember SVD A Input data matrix U Left singular vecs V Right singular vecs Σ Singular values
SVD gives minimum reconstruction error (SSE) min119880119880VΣ
sum119909119909119894119894 119860119860119909119909119894119894 minus 119880119880Σ119881119881T 1199091199091198941198942
So in our case ldquoSVDrdquo on Netflix data R asymp Q middot PT
A = R Q = U PT = Σ VT
But we are not done yet R has missing entries1162019
U
The sum goes over all entriesBut our R has missing entries
119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931
Am
n
Σm
n
VT
asymp
U
RECSYS 60
Latent Factor Models
SVD isnrsquot defined when entries are missing
Use specialized methods to find P Q min
119875119875119876119876sum 119909119909119909119909 isinR 119903119903119909119909119909119909 minus 119902119902119909119909 sdot 119901119901119909119909119879119879 2
Notebull We donrsquot require cols of P Q to be orthogonalunit lengthbull P Q map usersmovies to a latent spacebull The most popular model among Netflix contestants
1162019
45531
312445
53432142
24542
522434
42331
2-41
56-5
53-2
32111
-221-7
37-1
-924143-48-5-253-211
13-112-72914-131457-8
1-6784-3924176-421asymp
PTQ
users
item
s
119903119909119909119909119909 = 119902119902119909119909 sdot 119901119901119909119909119879119879
f factors
ffactorsitem
s
users
RECSYS 61
Dealing with Missing Entries Want to minimize SSE for unseen test data
Idea Minimize SSE on training data Want large f ( of factors) to capture all the signals But SSE on test data begins to rise for f gt 2
Regularization is needed to avoid Overfitting Allow rich model where there are sufficient data Shrink aggressively where data are scarce
1162019
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
1 3 43 5 5
4 5 533
2
2 1 3
1
λhellip regularization parameterldquoerrorrdquo ldquolengthrdquo
RECSYS 62
Geared towards females
Geared towards males
serious
escapist
The PrincessDiaries
The Lion King
Braveheart
Independence Day
AmadeusThe Color Purple
Oceanrsquos 11
Sense and Sensibility
Gus
Dave
Recommendations via Latent Factor Models (eg SVD++ by the [Bellkor Team])
1162019
Lethal Weapon
Dumb and Dumber
RECSYS 63
Dealing with Missing Entries Want to minimize SSE for unseen test data
Idea Minimize SSE on training data Want large f ( of factors) to capture all the signals But SSE on test data begins to rise for f gt 2
Regularization is needed Allow rich model where there are sufficient data Shrink aggressively where data are scarce
1162019
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
1 3 43 5 5
4 5 533
2
2 1 3
1
λhellip regularization parameterldquoerrorrdquo ldquolengthrdquo
RECSYS 64
Geared towards females
Geared towards males
serious
funny
The Effect of Regularization
1162019
The PrincessDiaries
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
minfactors ldquoerrorrdquo + λ ldquolengthrdquo
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
RECSYS 65
Geared towards females
Geared towards males
serious
funny
The Effect of Regularization
1162019
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
The PrincessDiaries
minfactors ldquoerrorrdquo + λ ldquolengthrdquo
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
RECSYS 66
Geared towards females
Geared towards males
serious
funny
The Effect of Regularization
1162019
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
minfactors ldquoerrorrdquo + λ ldquolengthrdquo
The PrincessDiaries
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
RECSYS 67
Geared towards females
Geared towards males
serious
funny
The Effect of Regularization
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
The PrincessDiaries
minfactors ldquoerrorrdquo + λ ldquolengthrdquo
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
RECSYS 68
Use Gradient Descent to search forthe optimal settings
Want to find matrices P and Q
Gradient descent Initialize P and Q (using SVD pretend missing ratings are
0) Do gradient descent
bull P larr P - η middotnablaPbull Q larr Q - η middotnablaQbull Where nablaQ is gradientderivative of matrix Q120571120571119876119876 = [120571120571119902119902119909119909119894119894] and 120571120571119902119902119909119909119894119894 = sum119909119909119909119909 minus2 119903119903119909119909119909119909 minus 119902119902119909119909119901119901119909119909119879119879 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894
bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q Observation Computing gradients is slow
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
How to compute gradient of a matrixCompute gradient of every element independently
RECSYS 69
Degression to the lecture notes ofRegression and Gradient Descent
by Andrew Ngrsquos Machine Learning course from Coursera
RECSYS 70
(Batch) Gradient Descent
Want to find matrices P and Q
Gradient descent Initialize P and Q (using SVD pretend missing ratings are
0) Do gradient descent
bull P larr P - η middotnablaPbull Q larr Q - η middotnablaQbull Where nablaQ is gradientderivative of matrix Q120571120571119876119876 = [120571120571119902119902119909119909119894119894] and 120571120571119902119902119909119909119894119894 = sum119909119909119909119909 minus2 119903119903119909119909119909119909 minus 119902119902119909119909119901119901119909119909119879119879 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894
bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q Observation Computing gradients is slow
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
How to compute gradient of a matrixCompute gradient of every element independently
RECSYS 72
Stochastic Gradient Descent
Gradient Descent (GD) vs Stochastic GD Observation 120571120571119876119876 = [120571120571119902119902119909119909119894119894] where
120571120571119902119902119909119909119894119894 = 119909119909119909119909
minus2 119903119903119909119909119909119909 minus 119902119902119909119909119894119894119901119901119909119909119894119894 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894 = 119961119961119946119946
nabla119928119928 119955119955119961119961119946119946
bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q 119928119928 = 119928119928 minus 119928119928 = 119928119928 minus η sum119961119961119946119946nabla119928119928 (119955119955119961119961119946119946) Idea Instead of evaluating gradient over all ratings evaluate it for
each individual rating and make a step
GD 119928119928larr119928119928 minus η sum119955119955119961119961119946119946 nabla119928119928(119955119955119961119961119946119946)
SGD 119928119928larr119928119928 minus 119928119928(119955119955119961119961119946119946) Faster convergence
bull Need more steps but each step is computed much faster
1162019
RECSYS 73
SGD vs GD Convergence of GD vs SGD
1162019
Iterationstep
Valu
e of
the
obje
ctiv
e fu
nctio
n
GD improves the value of the objective function at every step SGD improves the value but in a ldquonoisyrdquo wayGD takes fewer steps to converge but each steptakes much longer to compute In practice SGD is much faster
RECSYS 74
Stochastic Gradient Descent
Stochastic gradient decent Initialize P and Q (using SVD pretend missing ratings are
0) Then iterate over the ratings (multiple times if necessary) and update
factorsFor each rxi
bull 120576120576119909119909119909119909 = 119903119903119909119909119909119909 minus 119902119902119909119909 sdot 119901119901119909119909119879119879 (derivative of the ldquoerrorrdquo)bull 119902119902119909119909 larr 119902119902119909119909 + 120578120578 120576120576119909119909119909119909 119901119901119909119909 minus 120582120582 119902119902119909119909 (update equation)bull 119901119901119909119909 larr 119901119901119909119909 + 120578120578 120576120576119909119909119909119909 119902119902119909119909 minus 120582120582 119901119901119909119909 (update equation)
2 for loops For until convergence
bull For each rxibull Compute gradient do a ldquosteprdquo
1162019
η hellip learning rate
RECSYS 75Koren Bell Volinksy IEEE Computer 2009
1162019
RECSYS 76
Summary Recommendations via Optimization
Goal Make good recommendations Quantify goodness using SSE
So Lower SSE means better recommendations We want to make good recommendations on items
that some user has not yet seen Letrsquos set values for P and Q such that they work
well on known (user item) ratings And hope these values for P and Q will predict well
the unknown ratings
This is the a case where we apply Optimization methods
1162019
1 3 43 5 5
4 5 533
2
2 1 3
1
RECSYS 77
Backup Slides
RECSYS 78
The Netflix Challenge 2006-09
We Know What You OughtTo Be Watching This
Summer
Netflix Prizebull Training data
ndash 100 million ratingsndash 480000 usersndash 17770 moviesndash 6 years of data 2000-2005
bull Test datandash Last few ratings of each user (28 million)ndash Evaluation criterion root mean squared error (RMSE) ndash Netflix Cinematch RMSE 09514
bull Competitionndash 2700+ teamsndash $1 million grand prize for 10 improvement on Cinematch resultndash $50000 2007 progress prize for 843 improvement
scoremovieuser121152131434524123237682576344541568523425
22345
57664566
scoremovieuser6219617232473153414284935
745
696836
Training data Test data
Movie rating data
Overall rating distribution
bull Third of ratings are 4sbull Average rating is 368
From TimelyDevelopmentcom
ratings per movie
bull Avg ratingsmovie 5627
ratings per user
bull Avg ratingsuser 208
Average movie rating by movie count
bull More ratings to better movies
From TimelyDevelopmentcom
Most loved movies
CountAvg ratingTitle137812 4593 The Shawshank Redemption 133597 4545 Lord of the Rings The Return of the King1808834306 The Green Mile 150676 4460Lord of the Rings The Two Towers 139050 4415 Finding Nemo 117456 4504 Raiders of the Lost Ark 180736 4299 Forrest Gump 147932 4433 Lord of the Rings The Fellowship of the ring149199 4325 The Sixth Sense 144027 4333 Indiana Jones and the Last Crusade
Challenges
bull Size of datandash Scalabilityndash Keeping data in memory
bull Missing datandash 99 percent missingndash Very imbalanced
bull Avoiding overfittingbull Test and training data differ significantly movie 16322
bull Use an ensemble of complementing predictorsbull Two half tuned models worth more than a single fully
tuned model
The BellKor recommender system
RECSYS 101
Extending Latent Factor Model to Include Biases
1162019
RECSYS 102
Modeling Biases and Interactions
μ = overall mean rating bx = bias of user x bi = bias of movie i
user-movie interactionmovie biasuser bias
User-Movie interaction Characterizes the matching between
users and movies Attracts most research in the field Benefits from algorithmic and
mathematical innovations
Baseline predictor Separates users and movies Benefits from insights into userrsquos
behavior Among the main practical
contributions of the competition
RECSYS 103
Baseline Predictor We have expectations on the rating by
user x of movie i even without estimating xrsquos attitude towards movies like i
ndash Rating scale of user xndash Values of other ratings user
gave recently (day-specific mood anchoring multi-user accounts)
ndash (Recent) popularity of movie indash Selection bias related to
number of ratings user gave on the same day (ldquofrequencyrdquo)
1162019
RECSYS 104
Putting It All Together
Example Mean rating micro = 37 You are a critical reviewer your ratings are 1 star
lower than the mean bx = -1 Star Wars gets a mean rating of 05 higher than
average movie bi = + 05 Predicted rating for you on Star Wars
= 37 - 1 + 05 = 32
1162019
RECSYS 105
Fitting the New Model
Solve
Stochastic gradient decent to find parameters Note Both biases bu bi as well as interactions qi pu are treated
as parameters (we estimate them)
1162019
regularization
goodness of fit
λ is selected via grid-search on a validation set
( )
++++
+++minus
sumsumsumsum
sumisin
ii
xx
xx
ii
Rix
Txiixxi
PQ
bbpq
pqbbr
2222
2
)(
)(min
λ
micro
RECSYS 106
BellKor Recommender System The winner of the Netflix Challenge
Multi-scale modeling of the dataCombine top level ldquoregionalrdquomodeling of the data with a refined local view Global
bull Overall deviations of usersmovies Factorization
bull Addressing ldquoregionalrdquo effects Collaborative filtering
bull Extract local patterns
Global effects
Factorization
Collaborative filtering
RECSYS 107
Performance of Various Methods
0885
089
0895
09
0905
091
0915
092
1 10 100 1000
RMSE
Millions of parameters
CF (no time bias)
Basic Latent Factors
Latent Factors w Biases
1162019
RECSYS 108
Grand Prize 08563
Netflix 09514
Movie average 10533User average 10651
Global average 11296
Performance of Various Methods
Basic Collaborative filtering 094
Latent factors 090
Latent factors+Biases 089
Collaborative filtering++ 091
1162019
RECSYS 109
Temporal Biases Of Users
Sudden rise in the average movie rating(early 2004) Improvements in Netflix GUI improvements Meaning of rating changed
Movie age Users prefer new movies
without any reasons Older movies are just
inherently better than newer ones
1162019
Y Koren Collaborative filtering with temporal dynamics KDD rsquo09
RECSYS 110
Temporal Biases amp Factors
Original modelrxi = micro +bx + bi + qi middotpx
T
Add time dependence to biasesrxi = micro +bx(t)+ bi(t) +qi middot px
T
Make parameters bu and bi to depend on time (1) Parameterize time-dependence by linear trends
(2) Each bin corresponds to 10 consecutive weeks
Add temporal dependence to factors px(t)hellip user preference vector on day t
Y Koren Collaborative filtering with temporal dynamics KDD rsquo09
RECSYS 111
Adding Temporal Effects
1162019
0875
088
0885
089
0895
09
0905
091
0915
092
1 10 100 1000 10000
RMSE
Millions of parameters
CF (no time bias)Basic Latent FactorsCF (time bias)Latent Factors w Biases+ Linear time factors+ Per-day user biases+ CF
RECSYS 112
Grand Prize 08563
Netflix 09514
Movie average 10533User average 10651
Global average 11296
Performance of Various Methods
Basic Collaborative filtering 094
Latent factors 090
Latent factors+Biases 089
Collaborative filtering++ 091
1162019 112
Latent factors+Biases+Time 0876
Still no prize Getting desperateTry a ldquokitchen sinkrdquo approach
RECSYS 113
Effect of ensemble size
087
08725
0875
08775
088
08825
0885
08875
089
08925
1 8 15 22 29 36 43 50
Predictors
Erro
r - R
MSE
RECSYS 34
Item-Item CF (|N|=2)
121110987654321
455 311
3124452
534321423
245424
5224345
423316
users
Neighbor selectionIdentify movies similar to movie 1 rated by user 5
1162019
mov
ies
100
-018
041
-010
-031
059
sim(1m)
Here we use Pearson correlation as similarity1) Subtract mean rating mi from each movie i
m1 = (1+3+5+5+4)5 = 36row 1 [-26 0 -06 0 0 14 0 0 14 0 04 0]
2) Compute cosine similarities between rows
RECSYS 35
Item-Item CF (|N|=2)
121110987654321
455 311
3124452
534321423
245424
5224345
423316
users
Compute similarity weightss13=041 s16=059
1162019
mov
ies
100
-018
041
-010
-031
059
sim(1m)
RECSYS 36
Item-Item CF (|N|=2)
121110987654321
45526311
3124452
534321423
245424
5224345
423316
users
Predict by taking weighted average
r15 = (0412 + 0593) (041+059) = 26
mov
ies
RECSYS 37
Common Practice for Item-Item Collaborative Filtering
Define similarity sij of items i and j
Select K nearest neighbors (KNN) N(i x) Set of Items most similar to item i that were rated by x
Estimate rating rxi as the weighted average
1162019
sumsum
isin
isinsdot
=)(
)(ˆxiNj ij
xiNj xjijxi s
rsr
N(ix) = set of items similar to item i that were rated by xsij = similarity of items i and jrxj= rating of user x on item j
RECSYS 38
Item-Item vs User-User
04180109
0305081
Avatar LOTR Matrix Pirates
Alice
Bob
Carol
David
1162019
In practice it has been observed that item-itemoften works better than user-user
Why Items are simpler users have multiple tastes
RECSYS 39
ProsCons of Collaborative Filtering
+ Works for any kind of item No feature selection needed
- Cold Start Need enough users in the system to find a match
- Sparsity The userratings matrix is sparse Hard to find users that have rated the same items
- First rater Cannot recommend an item that has not been
previously rated New items Esoteric items
- Popularity bias Cannot recommend items to someone with
unique taste Tends to recommend popular items
RECSYS 40
Hybrid Methods Implement two or more different recommenders and
combine predictions Perhaps using a linear model
Add content-based methods to collaborative filtering Item profiles for new item problem Demographics to deal with new user problem
1162019
RECSYS 41
- Evaluation
- Error metrics
- Complexity Speed
RECSYS 42
Evaluation
1 3 4
3 5 5
4 5 5
3
3
2 2 2
5
2 1 1
3 3
1
movies
users
1162019
RECSYS 43
Evaluation
1 3 4
3 5 5
4 5 5
3
3
2
2 1
3
1
Test Data Set
users
movies
1162019
RECSYS 44
Evaluating Predictions Compare predictions with known ratings
Root-mean-square error (RMSE)bull where is predicted is the true rating of x on i
Precision at top 10 bull of those in top 10
Rank Correlation bull Spearmanrsquos correlation between systemrsquos and userrsquos complete
rankings Another approach 01 model
Coveragebull Number of itemsusers for which system can make predictions
Precision = TP (TP + FP) Accuracy = (TP+TN) (TP + FP + TN + FN) Receiver Operating characteristic (ROC) Curve
Y-axis True Positive Rates (TPR) X-axis False Positive Rates (FPR)bull TPR (aka Recall) = TP P = TP (TP+FN) bull FPR = FP N = FP (FP + TN)bull See httpsenwikipediaorgwikiPrecision and recall
RECSYS 45
Problems with Error Measures Narrow focus on accuracy sometimes
misses the point Prediction Diversity Prediction Context Order of predictions
In practice we care only to predict high ratings RMSE might penalize a method that does well
for high ratings and badly for others
1162019
RECSYS 46
Collaborative Filtering Complexity
Expensive step is finding k most similar customers O(|X|) Recall that X = set of customers in the system
Too expensive to do at runtime Could pre-compute using clustering as approx
Naiumlve pre-computation takes time O(N middot|C|) |C| = of clusters = k in the k-means N = of data points
We already know how to do this Near-neighbor search in high dimensions (LSH) Clustering Dimensionality reduction
1162019
RECSYS 47
Tip Add Data Leverage all the data
Donrsquot try to reduce data size in an effort to make fancy algorithms work
Simple methods on large data do best
Add more data eg add IMDB data on genres
More data beats better algorithmshttpanandtypepadcomdatawocky200803more-data-usualhtml
RECSYS 48
Recommender SystemsLatent Factor Models
CS246 Mining Massive DatasetsJure Leskovec Stanford University
httpcs246stanfordedu
RECSYS 49
Geared towards females
Geared towards males
Serious
Funny
Collaborative Filtering viaLatent Factor Models (eg SVD)
The PrincessDiaries
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
RECSYS 50Koren Bell Volinksy IEEE Computer 2009
RECSYS 51
The Netflix Utility Matrix R
1 3 4
3 5 5
4 5 5
3
3
2 2 2
5
2 1 1
3 3
1
480000 users
17700 movies
1162019
Matrix R
RECSYS 52
Utility Matrix R Evaluation
1 3 4
3 5 5
4 5 5
3
3
2
2 1
3
1
Test Data Set
SSE = sum(119909119909119909119909)isin119877119877 119903119909119909119909119909 minus 119903119903119909119909119909119909 2
1162019
480000 users
17700 movies
Predicted rating
True rating of user x on item i
119955119955120785120785120788120788
Matrix R
Training Data Set
RECSYS 53
Latent Factor Models
ldquoSVDrdquo on Netflix data R asymp Q middot PT
For now letrsquos assume we can approximate the rating matrix R as a product of ldquothinrdquo Q middot PT
R has missing entries but letrsquos ignore that for nowbull Basically we will want the reconstruction error to be small on
known ratings and we donrsquot care about the values on the missing ones
45531
312445
53432142
24542
522434
42331
2-41
56-5
53-2
32111
-221-7
37-1
-924143-48-5-253-211
13-112-72914-131457-8
1-6784-3924176-421asymp
users
item
s
PT
Q
item
s
users
R
SVD A = U Σ VT
f factors
ffactors
RECSYS 54
Ratings as Products of Factors How to estimate the missing rating of
user x for item i
1162019
45531
312445
53432142
24542
522434
42331
item
s
2-41
56-5
53-2
32111
-221-7
37-1
-924143-48-5-253-211
13-112-72914-131457-8
1-6784-3924176-421
asymp
item
s
users
users
PT
119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931
= 119943119943
119954119954119946119946119943119943 sdot 119953119953119961119961119943119943
qi = row i of Qpx = column x of PT
ffac
tors
Qf factors
RECSYS 55
Ratings as Products of Factors How to estimate the missing rating of
user x for item i
1162019
45531
312445
53432142
24542
522434
42331
item
s
2-41
56-5
53-2
32111
-221-7
37-1
-924143-48-5-253-211
13-112-72914-131457-8
1-6784-3924176-421
asymp
item
s
users
users
PT
ffac
tors
Qf factors
119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931
= 119943119943
119954119954119946119946119943119943 sdot 119953119953119961119961119943119943
qi = row i of Qpx = column x of PT
RECSYS 56
Ratings as Products of Factors How to estimate the missing rating of
user x for item i
1162019
45531
312445
53432142
24542
522434
42331
item
s
2-41
56-5
53-2
32111
-221-7
37-1
-924143-48-5-253-211
13-112-72914-131457-8
1-6784-3924176-421
asymp
item
s
users
users
QPT
24
ffac
tors
f factors
119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931
= 119943119943
119954119954119946119946119943119943 sdot 119953119953119961119961119943119943
qi = row i of Qpx = column x of PT
RECSYS 57
Geared towards females
Geared towards males
Serious
Funny
Latent Factor Models
1162019
The PrincessDiaries
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
RECSYS 58
Geared towards females
Geared towards males
Serious
Funny
Latent Factor Models
1162019
The PrincessDiaries
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
RECSYS 59
Recap SVD
Remember SVD A Input data matrix U Left singular vecs V Right singular vecs Σ Singular values
SVD gives minimum reconstruction error (SSE) min119880119880VΣ
sum119909119909119894119894 119860119860119909119909119894119894 minus 119880119880Σ119881119881T 1199091199091198941198942
So in our case ldquoSVDrdquo on Netflix data R asymp Q middot PT
A = R Q = U PT = Σ VT
But we are not done yet R has missing entries1162019
U
The sum goes over all entriesBut our R has missing entries
119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931
Am
n
Σm
n
VT
asymp
U
RECSYS 60
Latent Factor Models
SVD isnrsquot defined when entries are missing
Use specialized methods to find P Q min
119875119875119876119876sum 119909119909119909119909 isinR 119903119903119909119909119909119909 minus 119902119902119909119909 sdot 119901119901119909119909119879119879 2
Notebull We donrsquot require cols of P Q to be orthogonalunit lengthbull P Q map usersmovies to a latent spacebull The most popular model among Netflix contestants
1162019
45531
312445
53432142
24542
522434
42331
2-41
56-5
53-2
32111
-221-7
37-1
-924143-48-5-253-211
13-112-72914-131457-8
1-6784-3924176-421asymp
PTQ
users
item
s
119903119909119909119909119909 = 119902119902119909119909 sdot 119901119901119909119909119879119879
f factors
ffactorsitem
s
users
RECSYS 61
Dealing with Missing Entries Want to minimize SSE for unseen test data
Idea Minimize SSE on training data Want large f ( of factors) to capture all the signals But SSE on test data begins to rise for f gt 2
Regularization is needed to avoid Overfitting Allow rich model where there are sufficient data Shrink aggressively where data are scarce
1162019
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
1 3 43 5 5
4 5 533
2
2 1 3
1
λhellip regularization parameterldquoerrorrdquo ldquolengthrdquo
RECSYS 62
Geared towards females
Geared towards males
serious
escapist
The PrincessDiaries
The Lion King
Braveheart
Independence Day
AmadeusThe Color Purple
Oceanrsquos 11
Sense and Sensibility
Gus
Dave
Recommendations via Latent Factor Models (eg SVD++ by the [Bellkor Team])
1162019
Lethal Weapon
Dumb and Dumber
RECSYS 63
Dealing with Missing Entries Want to minimize SSE for unseen test data
Idea Minimize SSE on training data Want large f ( of factors) to capture all the signals But SSE on test data begins to rise for f gt 2
Regularization is needed Allow rich model where there are sufficient data Shrink aggressively where data are scarce
1162019
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
1 3 43 5 5
4 5 533
2
2 1 3
1
λhellip regularization parameterldquoerrorrdquo ldquolengthrdquo
RECSYS 64
Geared towards females
Geared towards males
serious
funny
The Effect of Regularization
1162019
The PrincessDiaries
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
minfactors ldquoerrorrdquo + λ ldquolengthrdquo
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
RECSYS 65
Geared towards females
Geared towards males
serious
funny
The Effect of Regularization
1162019
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
The PrincessDiaries
minfactors ldquoerrorrdquo + λ ldquolengthrdquo
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
RECSYS 66
Geared towards females
Geared towards males
serious
funny
The Effect of Regularization
1162019
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
minfactors ldquoerrorrdquo + λ ldquolengthrdquo
The PrincessDiaries
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
RECSYS 67
Geared towards females
Geared towards males
serious
funny
The Effect of Regularization
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
The PrincessDiaries
minfactors ldquoerrorrdquo + λ ldquolengthrdquo
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
RECSYS 68
Use Gradient Descent to search forthe optimal settings
Want to find matrices P and Q
Gradient descent Initialize P and Q (using SVD pretend missing ratings are
0) Do gradient descent
bull P larr P - η middotnablaPbull Q larr Q - η middotnablaQbull Where nablaQ is gradientderivative of matrix Q120571120571119876119876 = [120571120571119902119902119909119909119894119894] and 120571120571119902119902119909119909119894119894 = sum119909119909119909119909 minus2 119903119903119909119909119909119909 minus 119902119902119909119909119901119901119909119909119879119879 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894
bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q Observation Computing gradients is slow
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
How to compute gradient of a matrixCompute gradient of every element independently
RECSYS 69
Degression to the lecture notes ofRegression and Gradient Descent
by Andrew Ngrsquos Machine Learning course from Coursera
RECSYS 70
(Batch) Gradient Descent
Want to find matrices P and Q
Gradient descent Initialize P and Q (using SVD pretend missing ratings are
0) Do gradient descent
bull P larr P - η middotnablaPbull Q larr Q - η middotnablaQbull Where nablaQ is gradientderivative of matrix Q120571120571119876119876 = [120571120571119902119902119909119909119894119894] and 120571120571119902119902119909119909119894119894 = sum119909119909119909119909 minus2 119903119903119909119909119909119909 minus 119902119902119909119909119901119901119909119909119879119879 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894
bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q Observation Computing gradients is slow
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
How to compute gradient of a matrixCompute gradient of every element independently
RECSYS 72
Stochastic Gradient Descent
Gradient Descent (GD) vs Stochastic GD Observation 120571120571119876119876 = [120571120571119902119902119909119909119894119894] where
120571120571119902119902119909119909119894119894 = 119909119909119909119909
minus2 119903119903119909119909119909119909 minus 119902119902119909119909119894119894119901119901119909119909119894119894 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894 = 119961119961119946119946
nabla119928119928 119955119955119961119961119946119946
bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q 119928119928 = 119928119928 minus 119928119928 = 119928119928 minus η sum119961119961119946119946nabla119928119928 (119955119955119961119961119946119946) Idea Instead of evaluating gradient over all ratings evaluate it for
each individual rating and make a step
GD 119928119928larr119928119928 minus η sum119955119955119961119961119946119946 nabla119928119928(119955119955119961119961119946119946)
SGD 119928119928larr119928119928 minus 119928119928(119955119955119961119961119946119946) Faster convergence
bull Need more steps but each step is computed much faster
1162019
RECSYS 73
SGD vs GD Convergence of GD vs SGD
1162019
Iterationstep
Valu
e of
the
obje
ctiv
e fu
nctio
n
GD improves the value of the objective function at every step SGD improves the value but in a ldquonoisyrdquo wayGD takes fewer steps to converge but each steptakes much longer to compute In practice SGD is much faster
RECSYS 74
Stochastic Gradient Descent
Stochastic gradient decent Initialize P and Q (using SVD pretend missing ratings are
0) Then iterate over the ratings (multiple times if necessary) and update
factorsFor each rxi
bull 120576120576119909119909119909119909 = 119903119903119909119909119909119909 minus 119902119902119909119909 sdot 119901119901119909119909119879119879 (derivative of the ldquoerrorrdquo)bull 119902119902119909119909 larr 119902119902119909119909 + 120578120578 120576120576119909119909119909119909 119901119901119909119909 minus 120582120582 119902119902119909119909 (update equation)bull 119901119901119909119909 larr 119901119901119909119909 + 120578120578 120576120576119909119909119909119909 119902119902119909119909 minus 120582120582 119901119901119909119909 (update equation)
2 for loops For until convergence
bull For each rxibull Compute gradient do a ldquosteprdquo
1162019
η hellip learning rate
RECSYS 75Koren Bell Volinksy IEEE Computer 2009
1162019
RECSYS 76
Summary Recommendations via Optimization
Goal Make good recommendations Quantify goodness using SSE
So Lower SSE means better recommendations We want to make good recommendations on items
that some user has not yet seen Letrsquos set values for P and Q such that they work
well on known (user item) ratings And hope these values for P and Q will predict well
the unknown ratings
This is the a case where we apply Optimization methods
1162019
1 3 43 5 5
4 5 533
2
2 1 3
1
RECSYS 77
Backup Slides
RECSYS 78
The Netflix Challenge 2006-09
We Know What You OughtTo Be Watching This
Summer
Netflix Prizebull Training data
ndash 100 million ratingsndash 480000 usersndash 17770 moviesndash 6 years of data 2000-2005
bull Test datandash Last few ratings of each user (28 million)ndash Evaluation criterion root mean squared error (RMSE) ndash Netflix Cinematch RMSE 09514
bull Competitionndash 2700+ teamsndash $1 million grand prize for 10 improvement on Cinematch resultndash $50000 2007 progress prize for 843 improvement
scoremovieuser121152131434524123237682576344541568523425
22345
57664566
scoremovieuser6219617232473153414284935
745
696836
Training data Test data
Movie rating data
Overall rating distribution
bull Third of ratings are 4sbull Average rating is 368
From TimelyDevelopmentcom
ratings per movie
bull Avg ratingsmovie 5627
ratings per user
bull Avg ratingsuser 208
Average movie rating by movie count
bull More ratings to better movies
From TimelyDevelopmentcom
Most loved movies
CountAvg ratingTitle137812 4593 The Shawshank Redemption 133597 4545 Lord of the Rings The Return of the King1808834306 The Green Mile 150676 4460Lord of the Rings The Two Towers 139050 4415 Finding Nemo 117456 4504 Raiders of the Lost Ark 180736 4299 Forrest Gump 147932 4433 Lord of the Rings The Fellowship of the ring149199 4325 The Sixth Sense 144027 4333 Indiana Jones and the Last Crusade
Challenges
bull Size of datandash Scalabilityndash Keeping data in memory
bull Missing datandash 99 percent missingndash Very imbalanced
bull Avoiding overfittingbull Test and training data differ significantly movie 16322
bull Use an ensemble of complementing predictorsbull Two half tuned models worth more than a single fully
tuned model
The BellKor recommender system
RECSYS 101
Extending Latent Factor Model to Include Biases
1162019
RECSYS 102
Modeling Biases and Interactions
μ = overall mean rating bx = bias of user x bi = bias of movie i
user-movie interactionmovie biasuser bias
User-Movie interaction Characterizes the matching between
users and movies Attracts most research in the field Benefits from algorithmic and
mathematical innovations
Baseline predictor Separates users and movies Benefits from insights into userrsquos
behavior Among the main practical
contributions of the competition
RECSYS 103
Baseline Predictor We have expectations on the rating by
user x of movie i even without estimating xrsquos attitude towards movies like i
ndash Rating scale of user xndash Values of other ratings user
gave recently (day-specific mood anchoring multi-user accounts)
ndash (Recent) popularity of movie indash Selection bias related to
number of ratings user gave on the same day (ldquofrequencyrdquo)
1162019
RECSYS 104
Putting It All Together
Example Mean rating micro = 37 You are a critical reviewer your ratings are 1 star
lower than the mean bx = -1 Star Wars gets a mean rating of 05 higher than
average movie bi = + 05 Predicted rating for you on Star Wars
= 37 - 1 + 05 = 32
1162019
RECSYS 105
Fitting the New Model
Solve
Stochastic gradient decent to find parameters Note Both biases bu bi as well as interactions qi pu are treated
as parameters (we estimate them)
1162019
regularization
goodness of fit
λ is selected via grid-search on a validation set
( )
++++
+++minus
sumsumsumsum
sumisin
ii
xx
xx
ii
Rix
Txiixxi
PQ
bbpq
pqbbr
2222
2
)(
)(min
λ
micro
RECSYS 106
BellKor Recommender System The winner of the Netflix Challenge
Multi-scale modeling of the dataCombine top level ldquoregionalrdquomodeling of the data with a refined local view Global
bull Overall deviations of usersmovies Factorization
bull Addressing ldquoregionalrdquo effects Collaborative filtering
bull Extract local patterns
Global effects
Factorization
Collaborative filtering
RECSYS 107
Performance of Various Methods
0885
089
0895
09
0905
091
0915
092
1 10 100 1000
RMSE
Millions of parameters
CF (no time bias)
Basic Latent Factors
Latent Factors w Biases
1162019
RECSYS 108
Grand Prize 08563
Netflix 09514
Movie average 10533User average 10651
Global average 11296
Performance of Various Methods
Basic Collaborative filtering 094
Latent factors 090
Latent factors+Biases 089
Collaborative filtering++ 091
1162019
RECSYS 109
Temporal Biases Of Users
Sudden rise in the average movie rating(early 2004) Improvements in Netflix GUI improvements Meaning of rating changed
Movie age Users prefer new movies
without any reasons Older movies are just
inherently better than newer ones
1162019
Y Koren Collaborative filtering with temporal dynamics KDD rsquo09
RECSYS 110
Temporal Biases amp Factors
Original modelrxi = micro +bx + bi + qi middotpx
T
Add time dependence to biasesrxi = micro +bx(t)+ bi(t) +qi middot px
T
Make parameters bu and bi to depend on time (1) Parameterize time-dependence by linear trends
(2) Each bin corresponds to 10 consecutive weeks
Add temporal dependence to factors px(t)hellip user preference vector on day t
Y Koren Collaborative filtering with temporal dynamics KDD rsquo09
RECSYS 111
Adding Temporal Effects
1162019
0875
088
0885
089
0895
09
0905
091
0915
092
1 10 100 1000 10000
RMSE
Millions of parameters
CF (no time bias)Basic Latent FactorsCF (time bias)Latent Factors w Biases+ Linear time factors+ Per-day user biases+ CF
RECSYS 112
Grand Prize 08563
Netflix 09514
Movie average 10533User average 10651
Global average 11296
Performance of Various Methods
Basic Collaborative filtering 094
Latent factors 090
Latent factors+Biases 089
Collaborative filtering++ 091
1162019 112
Latent factors+Biases+Time 0876
Still no prize Getting desperateTry a ldquokitchen sinkrdquo approach
RECSYS 113
Effect of ensemble size
087
08725
0875
08775
088
08825
0885
08875
089
08925
1 8 15 22 29 36 43 50
Predictors
Erro
r - R
MSE
RECSYS 35
Item-Item CF (|N|=2)
121110987654321
455 311
3124452
534321423
245424
5224345
423316
users
Compute similarity weightss13=041 s16=059
1162019
mov
ies
100
-018
041
-010
-031
059
sim(1m)
RECSYS 36
Item-Item CF (|N|=2)
121110987654321
45526311
3124452
534321423
245424
5224345
423316
users
Predict by taking weighted average
r15 = (0412 + 0593) (041+059) = 26
mov
ies
RECSYS 37
Common Practice for Item-Item Collaborative Filtering
Define similarity sij of items i and j
Select K nearest neighbors (KNN) N(i x) Set of Items most similar to item i that were rated by x
Estimate rating rxi as the weighted average
1162019
sumsum
isin
isinsdot
=)(
)(ˆxiNj ij
xiNj xjijxi s
rsr
N(ix) = set of items similar to item i that were rated by xsij = similarity of items i and jrxj= rating of user x on item j
RECSYS 38
Item-Item vs User-User
04180109
0305081
Avatar LOTR Matrix Pirates
Alice
Bob
Carol
David
1162019
In practice it has been observed that item-itemoften works better than user-user
Why Items are simpler users have multiple tastes
RECSYS 39
ProsCons of Collaborative Filtering
+ Works for any kind of item No feature selection needed
- Cold Start Need enough users in the system to find a match
- Sparsity The userratings matrix is sparse Hard to find users that have rated the same items
- First rater Cannot recommend an item that has not been
previously rated New items Esoteric items
- Popularity bias Cannot recommend items to someone with
unique taste Tends to recommend popular items
RECSYS 40
Hybrid Methods Implement two or more different recommenders and
combine predictions Perhaps using a linear model
Add content-based methods to collaborative filtering Item profiles for new item problem Demographics to deal with new user problem
1162019
RECSYS 41
- Evaluation
- Error metrics
- Complexity Speed
RECSYS 42
Evaluation
1 3 4
3 5 5
4 5 5
3
3
2 2 2
5
2 1 1
3 3
1
movies
users
1162019
RECSYS 43
Evaluation
1 3 4
3 5 5
4 5 5
3
3
2
2 1
3
1
Test Data Set
users
movies
1162019
RECSYS 44
Evaluating Predictions Compare predictions with known ratings
Root-mean-square error (RMSE)bull where is predicted is the true rating of x on i
Precision at top 10 bull of those in top 10
Rank Correlation bull Spearmanrsquos correlation between systemrsquos and userrsquos complete
rankings Another approach 01 model
Coveragebull Number of itemsusers for which system can make predictions
Precision = TP (TP + FP) Accuracy = (TP+TN) (TP + FP + TN + FN) Receiver Operating characteristic (ROC) Curve
Y-axis True Positive Rates (TPR) X-axis False Positive Rates (FPR)bull TPR (aka Recall) = TP P = TP (TP+FN) bull FPR = FP N = FP (FP + TN)bull See httpsenwikipediaorgwikiPrecision and recall
RECSYS 45
Problems with Error Measures Narrow focus on accuracy sometimes
misses the point Prediction Diversity Prediction Context Order of predictions
In practice we care only to predict high ratings RMSE might penalize a method that does well
for high ratings and badly for others
1162019
RECSYS 46
Collaborative Filtering Complexity
Expensive step is finding k most similar customers O(|X|) Recall that X = set of customers in the system
Too expensive to do at runtime Could pre-compute using clustering as approx
Naiumlve pre-computation takes time O(N middot|C|) |C| = of clusters = k in the k-means N = of data points
We already know how to do this Near-neighbor search in high dimensions (LSH) Clustering Dimensionality reduction
1162019
RECSYS 47
Tip Add Data Leverage all the data
Donrsquot try to reduce data size in an effort to make fancy algorithms work
Simple methods on large data do best
Add more data eg add IMDB data on genres
More data beats better algorithmshttpanandtypepadcomdatawocky200803more-data-usualhtml
RECSYS 48
Recommender SystemsLatent Factor Models
CS246 Mining Massive DatasetsJure Leskovec Stanford University
httpcs246stanfordedu
RECSYS 49
Geared towards females
Geared towards males
Serious
Funny
Collaborative Filtering viaLatent Factor Models (eg SVD)
The PrincessDiaries
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
RECSYS 50Koren Bell Volinksy IEEE Computer 2009
RECSYS 51
The Netflix Utility Matrix R
1 3 4
3 5 5
4 5 5
3
3
2 2 2
5
2 1 1
3 3
1
480000 users
17700 movies
1162019
Matrix R
RECSYS 52
Utility Matrix R Evaluation
1 3 4
3 5 5
4 5 5
3
3
2
2 1
3
1
Test Data Set
SSE = sum(119909119909119909119909)isin119877119877 119903119909119909119909119909 minus 119903119903119909119909119909119909 2
1162019
480000 users
17700 movies
Predicted rating
True rating of user x on item i
119955119955120785120785120788120788
Matrix R
Training Data Set
RECSYS 53
Latent Factor Models
ldquoSVDrdquo on Netflix data R asymp Q middot PT
For now letrsquos assume we can approximate the rating matrix R as a product of ldquothinrdquo Q middot PT
R has missing entries but letrsquos ignore that for nowbull Basically we will want the reconstruction error to be small on
known ratings and we donrsquot care about the values on the missing ones
45531
312445
53432142
24542
522434
42331
2-41
56-5
53-2
32111
-221-7
37-1
-924143-48-5-253-211
13-112-72914-131457-8
1-6784-3924176-421asymp
users
item
s
PT
Q
item
s
users
R
SVD A = U Σ VT
f factors
ffactors
RECSYS 54
Ratings as Products of Factors How to estimate the missing rating of
user x for item i
1162019
45531
312445
53432142
24542
522434
42331
item
s
2-41
56-5
53-2
32111
-221-7
37-1
-924143-48-5-253-211
13-112-72914-131457-8
1-6784-3924176-421
asymp
item
s
users
users
PT
119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931
= 119943119943
119954119954119946119946119943119943 sdot 119953119953119961119961119943119943
qi = row i of Qpx = column x of PT
ffac
tors
Qf factors
RECSYS 55
Ratings as Products of Factors How to estimate the missing rating of
user x for item i
1162019
45531
312445
53432142
24542
522434
42331
item
s
2-41
56-5
53-2
32111
-221-7
37-1
-924143-48-5-253-211
13-112-72914-131457-8
1-6784-3924176-421
asymp
item
s
users
users
PT
ffac
tors
Qf factors
119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931
= 119943119943
119954119954119946119946119943119943 sdot 119953119953119961119961119943119943
qi = row i of Qpx = column x of PT
RECSYS 56
Ratings as Products of Factors How to estimate the missing rating of
user x for item i
1162019
45531
312445
53432142
24542
522434
42331
item
s
2-41
56-5
53-2
32111
-221-7
37-1
-924143-48-5-253-211
13-112-72914-131457-8
1-6784-3924176-421
asymp
item
s
users
users
QPT
24
ffac
tors
f factors
119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931
= 119943119943
119954119954119946119946119943119943 sdot 119953119953119961119961119943119943
qi = row i of Qpx = column x of PT
RECSYS 57
Geared towards females
Geared towards males
Serious
Funny
Latent Factor Models
1162019
The PrincessDiaries
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
RECSYS 58
Geared towards females
Geared towards males
Serious
Funny
Latent Factor Models
1162019
The PrincessDiaries
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
RECSYS 59
Recap SVD
Remember SVD A Input data matrix U Left singular vecs V Right singular vecs Σ Singular values
SVD gives minimum reconstruction error (SSE) min119880119880VΣ
sum119909119909119894119894 119860119860119909119909119894119894 minus 119880119880Σ119881119881T 1199091199091198941198942
So in our case ldquoSVDrdquo on Netflix data R asymp Q middot PT
A = R Q = U PT = Σ VT
But we are not done yet R has missing entries1162019
U
The sum goes over all entriesBut our R has missing entries
119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931
Am
n
Σm
n
VT
asymp
U
RECSYS 60
Latent Factor Models
SVD isnrsquot defined when entries are missing
Use specialized methods to find P Q min
119875119875119876119876sum 119909119909119909119909 isinR 119903119903119909119909119909119909 minus 119902119902119909119909 sdot 119901119901119909119909119879119879 2
Notebull We donrsquot require cols of P Q to be orthogonalunit lengthbull P Q map usersmovies to a latent spacebull The most popular model among Netflix contestants
1162019
45531
312445
53432142
24542
522434
42331
2-41
56-5
53-2
32111
-221-7
37-1
-924143-48-5-253-211
13-112-72914-131457-8
1-6784-3924176-421asymp
PTQ
users
item
s
119903119909119909119909119909 = 119902119902119909119909 sdot 119901119901119909119909119879119879
f factors
ffactorsitem
s
users
RECSYS 61
Dealing with Missing Entries Want to minimize SSE for unseen test data
Idea Minimize SSE on training data Want large f ( of factors) to capture all the signals But SSE on test data begins to rise for f gt 2
Regularization is needed to avoid Overfitting Allow rich model where there are sufficient data Shrink aggressively where data are scarce
1162019
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
1 3 43 5 5
4 5 533
2
2 1 3
1
λhellip regularization parameterldquoerrorrdquo ldquolengthrdquo
RECSYS 62
Geared towards females
Geared towards males
serious
escapist
The PrincessDiaries
The Lion King
Braveheart
Independence Day
AmadeusThe Color Purple
Oceanrsquos 11
Sense and Sensibility
Gus
Dave
Recommendations via Latent Factor Models (eg SVD++ by the [Bellkor Team])
1162019
Lethal Weapon
Dumb and Dumber
RECSYS 63
Dealing with Missing Entries Want to minimize SSE for unseen test data
Idea Minimize SSE on training data Want large f ( of factors) to capture all the signals But SSE on test data begins to rise for f gt 2
Regularization is needed Allow rich model where there are sufficient data Shrink aggressively where data are scarce
1162019
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
1 3 43 5 5
4 5 533
2
2 1 3
1
λhellip regularization parameterldquoerrorrdquo ldquolengthrdquo
RECSYS 64
Geared towards females
Geared towards males
serious
funny
The Effect of Regularization
1162019
The PrincessDiaries
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
minfactors ldquoerrorrdquo + λ ldquolengthrdquo
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
RECSYS 65
Geared towards females
Geared towards males
serious
funny
The Effect of Regularization
1162019
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
The PrincessDiaries
minfactors ldquoerrorrdquo + λ ldquolengthrdquo
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
RECSYS 66
Geared towards females
Geared towards males
serious
funny
The Effect of Regularization
1162019
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
minfactors ldquoerrorrdquo + λ ldquolengthrdquo
The PrincessDiaries
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
RECSYS 67
Geared towards females
Geared towards males
serious
funny
The Effect of Regularization
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
The PrincessDiaries
minfactors ldquoerrorrdquo + λ ldquolengthrdquo
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
RECSYS 68
Use Gradient Descent to search forthe optimal settings
Want to find matrices P and Q
Gradient descent Initialize P and Q (using SVD pretend missing ratings are
0) Do gradient descent
bull P larr P - η middotnablaPbull Q larr Q - η middotnablaQbull Where nablaQ is gradientderivative of matrix Q120571120571119876119876 = [120571120571119902119902119909119909119894119894] and 120571120571119902119902119909119909119894119894 = sum119909119909119909119909 minus2 119903119903119909119909119909119909 minus 119902119902119909119909119901119901119909119909119879119879 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894
bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q Observation Computing gradients is slow
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
How to compute gradient of a matrixCompute gradient of every element independently
RECSYS 69
Degression to the lecture notes ofRegression and Gradient Descent
by Andrew Ngrsquos Machine Learning course from Coursera
RECSYS 70
(Batch) Gradient Descent
Want to find matrices P and Q
Gradient descent Initialize P and Q (using SVD pretend missing ratings are
0) Do gradient descent
bull P larr P - η middotnablaPbull Q larr Q - η middotnablaQbull Where nablaQ is gradientderivative of matrix Q120571120571119876119876 = [120571120571119902119902119909119909119894119894] and 120571120571119902119902119909119909119894119894 = sum119909119909119909119909 minus2 119903119903119909119909119909119909 minus 119902119902119909119909119901119901119909119909119879119879 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894
bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q Observation Computing gradients is slow
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
How to compute gradient of a matrixCompute gradient of every element independently
RECSYS 72
Stochastic Gradient Descent
Gradient Descent (GD) vs Stochastic GD Observation 120571120571119876119876 = [120571120571119902119902119909119909119894119894] where
120571120571119902119902119909119909119894119894 = 119909119909119909119909
minus2 119903119903119909119909119909119909 minus 119902119902119909119909119894119894119901119901119909119909119894119894 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894 = 119961119961119946119946
nabla119928119928 119955119955119961119961119946119946
bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q 119928119928 = 119928119928 minus 119928119928 = 119928119928 minus η sum119961119961119946119946nabla119928119928 (119955119955119961119961119946119946) Idea Instead of evaluating gradient over all ratings evaluate it for
each individual rating and make a step
GD 119928119928larr119928119928 minus η sum119955119955119961119961119946119946 nabla119928119928(119955119955119961119961119946119946)
SGD 119928119928larr119928119928 minus 119928119928(119955119955119961119961119946119946) Faster convergence
bull Need more steps but each step is computed much faster
1162019
RECSYS 73
SGD vs GD Convergence of GD vs SGD
1162019
Iterationstep
Valu
e of
the
obje
ctiv
e fu
nctio
n
GD improves the value of the objective function at every step SGD improves the value but in a ldquonoisyrdquo wayGD takes fewer steps to converge but each steptakes much longer to compute In practice SGD is much faster
RECSYS 74
Stochastic Gradient Descent
Stochastic gradient decent Initialize P and Q (using SVD pretend missing ratings are
0) Then iterate over the ratings (multiple times if necessary) and update
factorsFor each rxi
bull 120576120576119909119909119909119909 = 119903119903119909119909119909119909 minus 119902119902119909119909 sdot 119901119901119909119909119879119879 (derivative of the ldquoerrorrdquo)bull 119902119902119909119909 larr 119902119902119909119909 + 120578120578 120576120576119909119909119909119909 119901119901119909119909 minus 120582120582 119902119902119909119909 (update equation)bull 119901119901119909119909 larr 119901119901119909119909 + 120578120578 120576120576119909119909119909119909 119902119902119909119909 minus 120582120582 119901119901119909119909 (update equation)
2 for loops For until convergence
bull For each rxibull Compute gradient do a ldquosteprdquo
1162019
η hellip learning rate
RECSYS 75Koren Bell Volinksy IEEE Computer 2009
1162019
RECSYS 76
Summary Recommendations via Optimization
Goal Make good recommendations Quantify goodness using SSE
So Lower SSE means better recommendations We want to make good recommendations on items
that some user has not yet seen Letrsquos set values for P and Q such that they work
well on known (user item) ratings And hope these values for P and Q will predict well
the unknown ratings
This is the a case where we apply Optimization methods
1162019
1 3 43 5 5
4 5 533
2
2 1 3
1
RECSYS 77
Backup Slides
RECSYS 78
The Netflix Challenge 2006-09
We Know What You OughtTo Be Watching This
Summer
Netflix Prizebull Training data
ndash 100 million ratingsndash 480000 usersndash 17770 moviesndash 6 years of data 2000-2005
bull Test datandash Last few ratings of each user (28 million)ndash Evaluation criterion root mean squared error (RMSE) ndash Netflix Cinematch RMSE 09514
bull Competitionndash 2700+ teamsndash $1 million grand prize for 10 improvement on Cinematch resultndash $50000 2007 progress prize for 843 improvement
scoremovieuser121152131434524123237682576344541568523425
22345
57664566
scoremovieuser6219617232473153414284935
745
696836
Training data Test data
Movie rating data
Overall rating distribution
bull Third of ratings are 4sbull Average rating is 368
From TimelyDevelopmentcom
ratings per movie
bull Avg ratingsmovie 5627
ratings per user
bull Avg ratingsuser 208
Average movie rating by movie count
bull More ratings to better movies
From TimelyDevelopmentcom
Most loved movies
CountAvg ratingTitle137812 4593 The Shawshank Redemption 133597 4545 Lord of the Rings The Return of the King1808834306 The Green Mile 150676 4460Lord of the Rings The Two Towers 139050 4415 Finding Nemo 117456 4504 Raiders of the Lost Ark 180736 4299 Forrest Gump 147932 4433 Lord of the Rings The Fellowship of the ring149199 4325 The Sixth Sense 144027 4333 Indiana Jones and the Last Crusade
Challenges
bull Size of datandash Scalabilityndash Keeping data in memory
bull Missing datandash 99 percent missingndash Very imbalanced
bull Avoiding overfittingbull Test and training data differ significantly movie 16322
bull Use an ensemble of complementing predictorsbull Two half tuned models worth more than a single fully
tuned model
The BellKor recommender system
RECSYS 101
Extending Latent Factor Model to Include Biases
1162019
RECSYS 102
Modeling Biases and Interactions
μ = overall mean rating bx = bias of user x bi = bias of movie i
user-movie interactionmovie biasuser bias
User-Movie interaction Characterizes the matching between
users and movies Attracts most research in the field Benefits from algorithmic and
mathematical innovations
Baseline predictor Separates users and movies Benefits from insights into userrsquos
behavior Among the main practical
contributions of the competition
RECSYS 103
Baseline Predictor We have expectations on the rating by
user x of movie i even without estimating xrsquos attitude towards movies like i
ndash Rating scale of user xndash Values of other ratings user
gave recently (day-specific mood anchoring multi-user accounts)
ndash (Recent) popularity of movie indash Selection bias related to
number of ratings user gave on the same day (ldquofrequencyrdquo)
1162019
RECSYS 104
Putting It All Together
Example Mean rating micro = 37 You are a critical reviewer your ratings are 1 star
lower than the mean bx = -1 Star Wars gets a mean rating of 05 higher than
average movie bi = + 05 Predicted rating for you on Star Wars
= 37 - 1 + 05 = 32
1162019
RECSYS 105
Fitting the New Model
Solve
Stochastic gradient decent to find parameters Note Both biases bu bi as well as interactions qi pu are treated
as parameters (we estimate them)
1162019
regularization
goodness of fit
λ is selected via grid-search on a validation set
( )
++++
+++minus
sumsumsumsum
sumisin
ii
xx
xx
ii
Rix
Txiixxi
PQ
bbpq
pqbbr
2222
2
)(
)(min
λ
micro
RECSYS 106
BellKor Recommender System The winner of the Netflix Challenge
Multi-scale modeling of the dataCombine top level ldquoregionalrdquomodeling of the data with a refined local view Global
bull Overall deviations of usersmovies Factorization
bull Addressing ldquoregionalrdquo effects Collaborative filtering
bull Extract local patterns
Global effects
Factorization
Collaborative filtering
RECSYS 107
Performance of Various Methods
0885
089
0895
09
0905
091
0915
092
1 10 100 1000
RMSE
Millions of parameters
CF (no time bias)
Basic Latent Factors
Latent Factors w Biases
1162019
RECSYS 108
Grand Prize 08563
Netflix 09514
Movie average 10533User average 10651
Global average 11296
Performance of Various Methods
Basic Collaborative filtering 094
Latent factors 090
Latent factors+Biases 089
Collaborative filtering++ 091
1162019
RECSYS 109
Temporal Biases Of Users
Sudden rise in the average movie rating(early 2004) Improvements in Netflix GUI improvements Meaning of rating changed
Movie age Users prefer new movies
without any reasons Older movies are just
inherently better than newer ones
1162019
Y Koren Collaborative filtering with temporal dynamics KDD rsquo09
RECSYS 110
Temporal Biases amp Factors
Original modelrxi = micro +bx + bi + qi middotpx
T
Add time dependence to biasesrxi = micro +bx(t)+ bi(t) +qi middot px
T
Make parameters bu and bi to depend on time (1) Parameterize time-dependence by linear trends
(2) Each bin corresponds to 10 consecutive weeks
Add temporal dependence to factors px(t)hellip user preference vector on day t
Y Koren Collaborative filtering with temporal dynamics KDD rsquo09
RECSYS 111
Adding Temporal Effects
1162019
0875
088
0885
089
0895
09
0905
091
0915
092
1 10 100 1000 10000
RMSE
Millions of parameters
CF (no time bias)Basic Latent FactorsCF (time bias)Latent Factors w Biases+ Linear time factors+ Per-day user biases+ CF
RECSYS 112
Grand Prize 08563
Netflix 09514
Movie average 10533User average 10651
Global average 11296
Performance of Various Methods
Basic Collaborative filtering 094
Latent factors 090
Latent factors+Biases 089
Collaborative filtering++ 091
1162019 112
Latent factors+Biases+Time 0876
Still no prize Getting desperateTry a ldquokitchen sinkrdquo approach
RECSYS 113
Effect of ensemble size
087
08725
0875
08775
088
08825
0885
08875
089
08925
1 8 15 22 29 36 43 50
Predictors
Erro
r - R
MSE
RECSYS 36
Item-Item CF (|N|=2)
121110987654321
45526311
3124452
534321423
245424
5224345
423316
users
Predict by taking weighted average
r15 = (0412 + 0593) (041+059) = 26
mov
ies
RECSYS 37
Common Practice for Item-Item Collaborative Filtering
Define similarity sij of items i and j
Select K nearest neighbors (KNN) N(i x) Set of Items most similar to item i that were rated by x
Estimate rating rxi as the weighted average
1162019
sumsum
isin
isinsdot
=)(
)(ˆxiNj ij
xiNj xjijxi s
rsr
N(ix) = set of items similar to item i that were rated by xsij = similarity of items i and jrxj= rating of user x on item j
RECSYS 38
Item-Item vs User-User
04180109
0305081
Avatar LOTR Matrix Pirates
Alice
Bob
Carol
David
1162019
In practice it has been observed that item-itemoften works better than user-user
Why Items are simpler users have multiple tastes
RECSYS 39
ProsCons of Collaborative Filtering
+ Works for any kind of item No feature selection needed
- Cold Start Need enough users in the system to find a match
- Sparsity The userratings matrix is sparse Hard to find users that have rated the same items
- First rater Cannot recommend an item that has not been
previously rated New items Esoteric items
- Popularity bias Cannot recommend items to someone with
unique taste Tends to recommend popular items
RECSYS 40
Hybrid Methods Implement two or more different recommenders and
combine predictions Perhaps using a linear model
Add content-based methods to collaborative filtering Item profiles for new item problem Demographics to deal with new user problem
1162019
RECSYS 41
- Evaluation
- Error metrics
- Complexity Speed
RECSYS 42
Evaluation
1 3 4
3 5 5
4 5 5
3
3
2 2 2
5
2 1 1
3 3
1
movies
users
1162019
RECSYS 43
Evaluation
1 3 4
3 5 5
4 5 5
3
3
2
2 1
3
1
Test Data Set
users
movies
1162019
RECSYS 44
Evaluating Predictions Compare predictions with known ratings
Root-mean-square error (RMSE)bull where is predicted is the true rating of x on i
Precision at top 10 bull of those in top 10
Rank Correlation bull Spearmanrsquos correlation between systemrsquos and userrsquos complete
rankings Another approach 01 model
Coveragebull Number of itemsusers for which system can make predictions
Precision = TP (TP + FP) Accuracy = (TP+TN) (TP + FP + TN + FN) Receiver Operating characteristic (ROC) Curve
Y-axis True Positive Rates (TPR) X-axis False Positive Rates (FPR)bull TPR (aka Recall) = TP P = TP (TP+FN) bull FPR = FP N = FP (FP + TN)bull See httpsenwikipediaorgwikiPrecision and recall
RECSYS 45
Problems with Error Measures Narrow focus on accuracy sometimes
misses the point Prediction Diversity Prediction Context Order of predictions
In practice we care only to predict high ratings RMSE might penalize a method that does well
for high ratings and badly for others
1162019
RECSYS 46
Collaborative Filtering Complexity
Expensive step is finding k most similar customers O(|X|) Recall that X = set of customers in the system
Too expensive to do at runtime Could pre-compute using clustering as approx
Naiumlve pre-computation takes time O(N middot|C|) |C| = of clusters = k in the k-means N = of data points
We already know how to do this Near-neighbor search in high dimensions (LSH) Clustering Dimensionality reduction
1162019
RECSYS 47
Tip Add Data Leverage all the data
Donrsquot try to reduce data size in an effort to make fancy algorithms work
Simple methods on large data do best
Add more data eg add IMDB data on genres
More data beats better algorithmshttpanandtypepadcomdatawocky200803more-data-usualhtml
RECSYS 48
Recommender SystemsLatent Factor Models
CS246 Mining Massive DatasetsJure Leskovec Stanford University
httpcs246stanfordedu
RECSYS 49
Geared towards females
Geared towards males
Serious
Funny
Collaborative Filtering viaLatent Factor Models (eg SVD)
The PrincessDiaries
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
RECSYS 50Koren Bell Volinksy IEEE Computer 2009
RECSYS 51
The Netflix Utility Matrix R
1 3 4
3 5 5
4 5 5
3
3
2 2 2
5
2 1 1
3 3
1
480000 users
17700 movies
1162019
Matrix R
RECSYS 52
Utility Matrix R Evaluation
1 3 4
3 5 5
4 5 5
3
3
2
2 1
3
1
Test Data Set
SSE = sum(119909119909119909119909)isin119877119877 119903119909119909119909119909 minus 119903119903119909119909119909119909 2
1162019
480000 users
17700 movies
Predicted rating
True rating of user x on item i
119955119955120785120785120788120788
Matrix R
Training Data Set
RECSYS 53
Latent Factor Models
ldquoSVDrdquo on Netflix data R asymp Q middot PT
For now letrsquos assume we can approximate the rating matrix R as a product of ldquothinrdquo Q middot PT
R has missing entries but letrsquos ignore that for nowbull Basically we will want the reconstruction error to be small on
known ratings and we donrsquot care about the values on the missing ones
45531
312445
53432142
24542
522434
42331
2-41
56-5
53-2
32111
-221-7
37-1
-924143-48-5-253-211
13-112-72914-131457-8
1-6784-3924176-421asymp
users
item
s
PT
Q
item
s
users
R
SVD A = U Σ VT
f factors
ffactors
RECSYS 54
Ratings as Products of Factors How to estimate the missing rating of
user x for item i
1162019
45531
312445
53432142
24542
522434
42331
item
s
2-41
56-5
53-2
32111
-221-7
37-1
-924143-48-5-253-211
13-112-72914-131457-8
1-6784-3924176-421
asymp
item
s
users
users
PT
119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931
= 119943119943
119954119954119946119946119943119943 sdot 119953119953119961119961119943119943
qi = row i of Qpx = column x of PT
ffac
tors
Qf factors
RECSYS 55
Ratings as Products of Factors How to estimate the missing rating of
user x for item i
1162019
45531
312445
53432142
24542
522434
42331
item
s
2-41
56-5
53-2
32111
-221-7
37-1
-924143-48-5-253-211
13-112-72914-131457-8
1-6784-3924176-421
asymp
item
s
users
users
PT
ffac
tors
Qf factors
119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931
= 119943119943
119954119954119946119946119943119943 sdot 119953119953119961119961119943119943
qi = row i of Qpx = column x of PT
RECSYS 56
Ratings as Products of Factors How to estimate the missing rating of
user x for item i
1162019
45531
312445
53432142
24542
522434
42331
item
s
2-41
56-5
53-2
32111
-221-7
37-1
-924143-48-5-253-211
13-112-72914-131457-8
1-6784-3924176-421
asymp
item
s
users
users
QPT
24
ffac
tors
f factors
119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931
= 119943119943
119954119954119946119946119943119943 sdot 119953119953119961119961119943119943
qi = row i of Qpx = column x of PT
RECSYS 57
Geared towards females
Geared towards males
Serious
Funny
Latent Factor Models
1162019
The PrincessDiaries
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
RECSYS 58
Geared towards females
Geared towards males
Serious
Funny
Latent Factor Models
1162019
The PrincessDiaries
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
RECSYS 59
Recap SVD
Remember SVD A Input data matrix U Left singular vecs V Right singular vecs Σ Singular values
SVD gives minimum reconstruction error (SSE) min119880119880VΣ
sum119909119909119894119894 119860119860119909119909119894119894 minus 119880119880Σ119881119881T 1199091199091198941198942
So in our case ldquoSVDrdquo on Netflix data R asymp Q middot PT
A = R Q = U PT = Σ VT
But we are not done yet R has missing entries1162019
U
The sum goes over all entriesBut our R has missing entries
119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931
Am
n
Σm
n
VT
asymp
U
RECSYS 60
Latent Factor Models
SVD isnrsquot defined when entries are missing
Use specialized methods to find P Q min
119875119875119876119876sum 119909119909119909119909 isinR 119903119903119909119909119909119909 minus 119902119902119909119909 sdot 119901119901119909119909119879119879 2
Notebull We donrsquot require cols of P Q to be orthogonalunit lengthbull P Q map usersmovies to a latent spacebull The most popular model among Netflix contestants
1162019
45531
312445
53432142
24542
522434
42331
2-41
56-5
53-2
32111
-221-7
37-1
-924143-48-5-253-211
13-112-72914-131457-8
1-6784-3924176-421asymp
PTQ
users
item
s
119903119909119909119909119909 = 119902119902119909119909 sdot 119901119901119909119909119879119879
f factors
ffactorsitem
s
users
RECSYS 61
Dealing with Missing Entries Want to minimize SSE for unseen test data
Idea Minimize SSE on training data Want large f ( of factors) to capture all the signals But SSE on test data begins to rise for f gt 2
Regularization is needed to avoid Overfitting Allow rich model where there are sufficient data Shrink aggressively where data are scarce
1162019
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
1 3 43 5 5
4 5 533
2
2 1 3
1
λhellip regularization parameterldquoerrorrdquo ldquolengthrdquo
RECSYS 62
Geared towards females
Geared towards males
serious
escapist
The PrincessDiaries
The Lion King
Braveheart
Independence Day
AmadeusThe Color Purple
Oceanrsquos 11
Sense and Sensibility
Gus
Dave
Recommendations via Latent Factor Models (eg SVD++ by the [Bellkor Team])
1162019
Lethal Weapon
Dumb and Dumber
RECSYS 63
Dealing with Missing Entries Want to minimize SSE for unseen test data
Idea Minimize SSE on training data Want large f ( of factors) to capture all the signals But SSE on test data begins to rise for f gt 2
Regularization is needed Allow rich model where there are sufficient data Shrink aggressively where data are scarce
1162019
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
1 3 43 5 5
4 5 533
2
2 1 3
1
λhellip regularization parameterldquoerrorrdquo ldquolengthrdquo
RECSYS 64
Geared towards females
Geared towards males
serious
funny
The Effect of Regularization
1162019
The PrincessDiaries
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
minfactors ldquoerrorrdquo + λ ldquolengthrdquo
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
RECSYS 65
Geared towards females
Geared towards males
serious
funny
The Effect of Regularization
1162019
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
The PrincessDiaries
minfactors ldquoerrorrdquo + λ ldquolengthrdquo
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
RECSYS 66
Geared towards females
Geared towards males
serious
funny
The Effect of Regularization
1162019
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
minfactors ldquoerrorrdquo + λ ldquolengthrdquo
The PrincessDiaries
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
RECSYS 67
Geared towards females
Geared towards males
serious
funny
The Effect of Regularization
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
The PrincessDiaries
minfactors ldquoerrorrdquo + λ ldquolengthrdquo
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
RECSYS 68
Use Gradient Descent to search forthe optimal settings
Want to find matrices P and Q
Gradient descent Initialize P and Q (using SVD pretend missing ratings are
0) Do gradient descent
bull P larr P - η middotnablaPbull Q larr Q - η middotnablaQbull Where nablaQ is gradientderivative of matrix Q120571120571119876119876 = [120571120571119902119902119909119909119894119894] and 120571120571119902119902119909119909119894119894 = sum119909119909119909119909 minus2 119903119903119909119909119909119909 minus 119902119902119909119909119901119901119909119909119879119879 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894
bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q Observation Computing gradients is slow
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
How to compute gradient of a matrixCompute gradient of every element independently
RECSYS 69
Degression to the lecture notes ofRegression and Gradient Descent
by Andrew Ngrsquos Machine Learning course from Coursera
RECSYS 70
(Batch) Gradient Descent
Want to find matrices P and Q
Gradient descent Initialize P and Q (using SVD pretend missing ratings are
0) Do gradient descent
bull P larr P - η middotnablaPbull Q larr Q - η middotnablaQbull Where nablaQ is gradientderivative of matrix Q120571120571119876119876 = [120571120571119902119902119909119909119894119894] and 120571120571119902119902119909119909119894119894 = sum119909119909119909119909 minus2 119903119903119909119909119909119909 minus 119902119902119909119909119901119901119909119909119879119879 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894
bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q Observation Computing gradients is slow
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
How to compute gradient of a matrixCompute gradient of every element independently
RECSYS 72
Stochastic Gradient Descent
Gradient Descent (GD) vs Stochastic GD Observation 120571120571119876119876 = [120571120571119902119902119909119909119894119894] where
120571120571119902119902119909119909119894119894 = 119909119909119909119909
minus2 119903119903119909119909119909119909 minus 119902119902119909119909119894119894119901119901119909119909119894119894 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894 = 119961119961119946119946
nabla119928119928 119955119955119961119961119946119946
bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q 119928119928 = 119928119928 minus 119928119928 = 119928119928 minus η sum119961119961119946119946nabla119928119928 (119955119955119961119961119946119946) Idea Instead of evaluating gradient over all ratings evaluate it for
each individual rating and make a step
GD 119928119928larr119928119928 minus η sum119955119955119961119961119946119946 nabla119928119928(119955119955119961119961119946119946)
SGD 119928119928larr119928119928 minus 119928119928(119955119955119961119961119946119946) Faster convergence
bull Need more steps but each step is computed much faster
1162019
RECSYS 73
SGD vs GD Convergence of GD vs SGD
1162019
Iterationstep
Valu
e of
the
obje
ctiv
e fu
nctio
n
GD improves the value of the objective function at every step SGD improves the value but in a ldquonoisyrdquo wayGD takes fewer steps to converge but each steptakes much longer to compute In practice SGD is much faster
RECSYS 74
Stochastic Gradient Descent
Stochastic gradient decent Initialize P and Q (using SVD pretend missing ratings are
0) Then iterate over the ratings (multiple times if necessary) and update
factorsFor each rxi
bull 120576120576119909119909119909119909 = 119903119903119909119909119909119909 minus 119902119902119909119909 sdot 119901119901119909119909119879119879 (derivative of the ldquoerrorrdquo)bull 119902119902119909119909 larr 119902119902119909119909 + 120578120578 120576120576119909119909119909119909 119901119901119909119909 minus 120582120582 119902119902119909119909 (update equation)bull 119901119901119909119909 larr 119901119901119909119909 + 120578120578 120576120576119909119909119909119909 119902119902119909119909 minus 120582120582 119901119901119909119909 (update equation)
2 for loops For until convergence
bull For each rxibull Compute gradient do a ldquosteprdquo
1162019
η hellip learning rate
RECSYS 75Koren Bell Volinksy IEEE Computer 2009
1162019
RECSYS 76
Summary Recommendations via Optimization
Goal Make good recommendations Quantify goodness using SSE
So Lower SSE means better recommendations We want to make good recommendations on items
that some user has not yet seen Letrsquos set values for P and Q such that they work
well on known (user item) ratings And hope these values for P and Q will predict well
the unknown ratings
This is the a case where we apply Optimization methods
1162019
1 3 43 5 5
4 5 533
2
2 1 3
1
RECSYS 77
Backup Slides
RECSYS 78
The Netflix Challenge 2006-09
We Know What You OughtTo Be Watching This
Summer
Netflix Prizebull Training data
ndash 100 million ratingsndash 480000 usersndash 17770 moviesndash 6 years of data 2000-2005
bull Test datandash Last few ratings of each user (28 million)ndash Evaluation criterion root mean squared error (RMSE) ndash Netflix Cinematch RMSE 09514
bull Competitionndash 2700+ teamsndash $1 million grand prize for 10 improvement on Cinematch resultndash $50000 2007 progress prize for 843 improvement
scoremovieuser121152131434524123237682576344541568523425
22345
57664566
scoremovieuser6219617232473153414284935
745
696836
Training data Test data
Movie rating data
Overall rating distribution
bull Third of ratings are 4sbull Average rating is 368
From TimelyDevelopmentcom
ratings per movie
bull Avg ratingsmovie 5627
ratings per user
bull Avg ratingsuser 208
Average movie rating by movie count
bull More ratings to better movies
From TimelyDevelopmentcom
Most loved movies
CountAvg ratingTitle137812 4593 The Shawshank Redemption 133597 4545 Lord of the Rings The Return of the King1808834306 The Green Mile 150676 4460Lord of the Rings The Two Towers 139050 4415 Finding Nemo 117456 4504 Raiders of the Lost Ark 180736 4299 Forrest Gump 147932 4433 Lord of the Rings The Fellowship of the ring149199 4325 The Sixth Sense 144027 4333 Indiana Jones and the Last Crusade
Challenges
bull Size of datandash Scalabilityndash Keeping data in memory
bull Missing datandash 99 percent missingndash Very imbalanced
bull Avoiding overfittingbull Test and training data differ significantly movie 16322
bull Use an ensemble of complementing predictorsbull Two half tuned models worth more than a single fully
tuned model
The BellKor recommender system
RECSYS 101
Extending Latent Factor Model to Include Biases
1162019
RECSYS 102
Modeling Biases and Interactions
μ = overall mean rating bx = bias of user x bi = bias of movie i
user-movie interactionmovie biasuser bias
User-Movie interaction Characterizes the matching between
users and movies Attracts most research in the field Benefits from algorithmic and
mathematical innovations
Baseline predictor Separates users and movies Benefits from insights into userrsquos
behavior Among the main practical
contributions of the competition
RECSYS 103
Baseline Predictor We have expectations on the rating by
user x of movie i even without estimating xrsquos attitude towards movies like i
ndash Rating scale of user xndash Values of other ratings user
gave recently (day-specific mood anchoring multi-user accounts)
ndash (Recent) popularity of movie indash Selection bias related to
number of ratings user gave on the same day (ldquofrequencyrdquo)
1162019
RECSYS 104
Putting It All Together
Example Mean rating micro = 37 You are a critical reviewer your ratings are 1 star
lower than the mean bx = -1 Star Wars gets a mean rating of 05 higher than
average movie bi = + 05 Predicted rating for you on Star Wars
= 37 - 1 + 05 = 32
1162019
RECSYS 105
Fitting the New Model
Solve
Stochastic gradient decent to find parameters Note Both biases bu bi as well as interactions qi pu are treated
as parameters (we estimate them)
1162019
regularization
goodness of fit
λ is selected via grid-search on a validation set
( )
++++
+++minus
sumsumsumsum
sumisin
ii
xx
xx
ii
Rix
Txiixxi
PQ
bbpq
pqbbr
2222
2
)(
)(min
λ
micro
RECSYS 106
BellKor Recommender System The winner of the Netflix Challenge
Multi-scale modeling of the dataCombine top level ldquoregionalrdquomodeling of the data with a refined local view Global
bull Overall deviations of usersmovies Factorization
bull Addressing ldquoregionalrdquo effects Collaborative filtering
bull Extract local patterns
Global effects
Factorization
Collaborative filtering
RECSYS 107
Performance of Various Methods
0885
089
0895
09
0905
091
0915
092
1 10 100 1000
RMSE
Millions of parameters
CF (no time bias)
Basic Latent Factors
Latent Factors w Biases
1162019
RECSYS 108
Grand Prize 08563
Netflix 09514
Movie average 10533User average 10651
Global average 11296
Performance of Various Methods
Basic Collaborative filtering 094
Latent factors 090
Latent factors+Biases 089
Collaborative filtering++ 091
1162019
RECSYS 109
Temporal Biases Of Users
Sudden rise in the average movie rating(early 2004) Improvements in Netflix GUI improvements Meaning of rating changed
Movie age Users prefer new movies
without any reasons Older movies are just
inherently better than newer ones
1162019
Y Koren Collaborative filtering with temporal dynamics KDD rsquo09
RECSYS 110
Temporal Biases amp Factors
Original modelrxi = micro +bx + bi + qi middotpx
T
Add time dependence to biasesrxi = micro +bx(t)+ bi(t) +qi middot px
T
Make parameters bu and bi to depend on time (1) Parameterize time-dependence by linear trends
(2) Each bin corresponds to 10 consecutive weeks
Add temporal dependence to factors px(t)hellip user preference vector on day t
Y Koren Collaborative filtering with temporal dynamics KDD rsquo09
RECSYS 111
Adding Temporal Effects
1162019
0875
088
0885
089
0895
09
0905
091
0915
092
1 10 100 1000 10000
RMSE
Millions of parameters
CF (no time bias)Basic Latent FactorsCF (time bias)Latent Factors w Biases+ Linear time factors+ Per-day user biases+ CF
RECSYS 112
Grand Prize 08563
Netflix 09514
Movie average 10533User average 10651
Global average 11296
Performance of Various Methods
Basic Collaborative filtering 094
Latent factors 090
Latent factors+Biases 089
Collaborative filtering++ 091
1162019 112
Latent factors+Biases+Time 0876
Still no prize Getting desperateTry a ldquokitchen sinkrdquo approach
RECSYS 113
Effect of ensemble size
087
08725
0875
08775
088
08825
0885
08875
089
08925
1 8 15 22 29 36 43 50
Predictors
Erro
r - R
MSE
RECSYS 37
Common Practice for Item-Item Collaborative Filtering
Define similarity sij of items i and j
Select K nearest neighbors (KNN) N(i x) Set of Items most similar to item i that were rated by x
Estimate rating rxi as the weighted average
1162019
sumsum
isin
isinsdot
=)(
)(ˆxiNj ij
xiNj xjijxi s
rsr
N(ix) = set of items similar to item i that were rated by xsij = similarity of items i and jrxj= rating of user x on item j
RECSYS 38
Item-Item vs User-User
04180109
0305081
Avatar LOTR Matrix Pirates
Alice
Bob
Carol
David
1162019
In practice it has been observed that item-itemoften works better than user-user
Why Items are simpler users have multiple tastes
RECSYS 39
ProsCons of Collaborative Filtering
+ Works for any kind of item No feature selection needed
- Cold Start Need enough users in the system to find a match
- Sparsity The userratings matrix is sparse Hard to find users that have rated the same items
- First rater Cannot recommend an item that has not been
previously rated New items Esoteric items
- Popularity bias Cannot recommend items to someone with
unique taste Tends to recommend popular items
RECSYS 40
Hybrid Methods Implement two or more different recommenders and
combine predictions Perhaps using a linear model
Add content-based methods to collaborative filtering Item profiles for new item problem Demographics to deal with new user problem
1162019
RECSYS 41
- Evaluation
- Error metrics
- Complexity Speed
RECSYS 42
Evaluation
1 3 4
3 5 5
4 5 5
3
3
2 2 2
5
2 1 1
3 3
1
movies
users
1162019
RECSYS 43
Evaluation
1 3 4
3 5 5
4 5 5
3
3
2
2 1
3
1
Test Data Set
users
movies
1162019
RECSYS 44
Evaluating Predictions Compare predictions with known ratings
Root-mean-square error (RMSE)bull where is predicted is the true rating of x on i
Precision at top 10 bull of those in top 10
Rank Correlation bull Spearmanrsquos correlation between systemrsquos and userrsquos complete
rankings Another approach 01 model
Coveragebull Number of itemsusers for which system can make predictions
Precision = TP (TP + FP) Accuracy = (TP+TN) (TP + FP + TN + FN) Receiver Operating characteristic (ROC) Curve
Y-axis True Positive Rates (TPR) X-axis False Positive Rates (FPR)bull TPR (aka Recall) = TP P = TP (TP+FN) bull FPR = FP N = FP (FP + TN)bull See httpsenwikipediaorgwikiPrecision and recall
RECSYS 45
Problems with Error Measures Narrow focus on accuracy sometimes
misses the point Prediction Diversity Prediction Context Order of predictions
In practice we care only to predict high ratings RMSE might penalize a method that does well
for high ratings and badly for others
1162019
RECSYS 46
Collaborative Filtering Complexity
Expensive step is finding k most similar customers O(|X|) Recall that X = set of customers in the system
Too expensive to do at runtime Could pre-compute using clustering as approx
Naiumlve pre-computation takes time O(N middot|C|) |C| = of clusters = k in the k-means N = of data points
We already know how to do this Near-neighbor search in high dimensions (LSH) Clustering Dimensionality reduction
1162019
RECSYS 47
Tip Add Data Leverage all the data
Donrsquot try to reduce data size in an effort to make fancy algorithms work
Simple methods on large data do best
Add more data eg add IMDB data on genres
More data beats better algorithmshttpanandtypepadcomdatawocky200803more-data-usualhtml
RECSYS 48
Recommender SystemsLatent Factor Models
CS246 Mining Massive DatasetsJure Leskovec Stanford University
httpcs246stanfordedu
RECSYS 49
Geared towards females
Geared towards males
Serious
Funny
Collaborative Filtering viaLatent Factor Models (eg SVD)
The PrincessDiaries
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
RECSYS 50Koren Bell Volinksy IEEE Computer 2009
RECSYS 51
The Netflix Utility Matrix R
1 3 4
3 5 5
4 5 5
3
3
2 2 2
5
2 1 1
3 3
1
480000 users
17700 movies
1162019
Matrix R
RECSYS 52
Utility Matrix R Evaluation
1 3 4
3 5 5
4 5 5
3
3
2
2 1
3
1
Test Data Set
SSE = sum(119909119909119909119909)isin119877119877 119903119909119909119909119909 minus 119903119903119909119909119909119909 2
1162019
480000 users
17700 movies
Predicted rating
True rating of user x on item i
119955119955120785120785120788120788
Matrix R
Training Data Set
RECSYS 53
Latent Factor Models
ldquoSVDrdquo on Netflix data R asymp Q middot PT
For now letrsquos assume we can approximate the rating matrix R as a product of ldquothinrdquo Q middot PT
R has missing entries but letrsquos ignore that for nowbull Basically we will want the reconstruction error to be small on
known ratings and we donrsquot care about the values on the missing ones
45531
312445
53432142
24542
522434
42331
2-41
56-5
53-2
32111
-221-7
37-1
-924143-48-5-253-211
13-112-72914-131457-8
1-6784-3924176-421asymp
users
item
s
PT
Q
item
s
users
R
SVD A = U Σ VT
f factors
ffactors
RECSYS 54
Ratings as Products of Factors How to estimate the missing rating of
user x for item i
1162019
45531
312445
53432142
24542
522434
42331
item
s
2-41
56-5
53-2
32111
-221-7
37-1
-924143-48-5-253-211
13-112-72914-131457-8
1-6784-3924176-421
asymp
item
s
users
users
PT
119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931
= 119943119943
119954119954119946119946119943119943 sdot 119953119953119961119961119943119943
qi = row i of Qpx = column x of PT
ffac
tors
Qf factors
RECSYS 55
Ratings as Products of Factors How to estimate the missing rating of
user x for item i
1162019
45531
312445
53432142
24542
522434
42331
item
s
2-41
56-5
53-2
32111
-221-7
37-1
-924143-48-5-253-211
13-112-72914-131457-8
1-6784-3924176-421
asymp
item
s
users
users
PT
ffac
tors
Qf factors
119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931
= 119943119943
119954119954119946119946119943119943 sdot 119953119953119961119961119943119943
qi = row i of Qpx = column x of PT
RECSYS 56
Ratings as Products of Factors How to estimate the missing rating of
user x for item i
1162019
45531
312445
53432142
24542
522434
42331
item
s
2-41
56-5
53-2
32111
-221-7
37-1
-924143-48-5-253-211
13-112-72914-131457-8
1-6784-3924176-421
asymp
item
s
users
users
QPT
24
ffac
tors
f factors
119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931
= 119943119943
119954119954119946119946119943119943 sdot 119953119953119961119961119943119943
qi = row i of Qpx = column x of PT
RECSYS 57
Geared towards females
Geared towards males
Serious
Funny
Latent Factor Models
1162019
The PrincessDiaries
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
RECSYS 58
Geared towards females
Geared towards males
Serious
Funny
Latent Factor Models
1162019
The PrincessDiaries
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
RECSYS 59
Recap SVD
Remember SVD A Input data matrix U Left singular vecs V Right singular vecs Σ Singular values
SVD gives minimum reconstruction error (SSE) min119880119880VΣ
sum119909119909119894119894 119860119860119909119909119894119894 minus 119880119880Σ119881119881T 1199091199091198941198942
So in our case ldquoSVDrdquo on Netflix data R asymp Q middot PT
A = R Q = U PT = Σ VT
But we are not done yet R has missing entries1162019
U
The sum goes over all entriesBut our R has missing entries
119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931
Am
n
Σm
n
VT
asymp
U
RECSYS 60
Latent Factor Models
SVD isnrsquot defined when entries are missing
Use specialized methods to find P Q min
119875119875119876119876sum 119909119909119909119909 isinR 119903119903119909119909119909119909 minus 119902119902119909119909 sdot 119901119901119909119909119879119879 2
Notebull We donrsquot require cols of P Q to be orthogonalunit lengthbull P Q map usersmovies to a latent spacebull The most popular model among Netflix contestants
1162019
45531
312445
53432142
24542
522434
42331
2-41
56-5
53-2
32111
-221-7
37-1
-924143-48-5-253-211
13-112-72914-131457-8
1-6784-3924176-421asymp
PTQ
users
item
s
119903119909119909119909119909 = 119902119902119909119909 sdot 119901119901119909119909119879119879
f factors
ffactorsitem
s
users
RECSYS 61
Dealing with Missing Entries Want to minimize SSE for unseen test data
Idea Minimize SSE on training data Want large f ( of factors) to capture all the signals But SSE on test data begins to rise for f gt 2
Regularization is needed to avoid Overfitting Allow rich model where there are sufficient data Shrink aggressively where data are scarce
1162019
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
1 3 43 5 5
4 5 533
2
2 1 3
1
λhellip regularization parameterldquoerrorrdquo ldquolengthrdquo
RECSYS 62
Geared towards females
Geared towards males
serious
escapist
The PrincessDiaries
The Lion King
Braveheart
Independence Day
AmadeusThe Color Purple
Oceanrsquos 11
Sense and Sensibility
Gus
Dave
Recommendations via Latent Factor Models (eg SVD++ by the [Bellkor Team])
1162019
Lethal Weapon
Dumb and Dumber
RECSYS 63
Dealing with Missing Entries Want to minimize SSE for unseen test data
Idea Minimize SSE on training data Want large f ( of factors) to capture all the signals But SSE on test data begins to rise for f gt 2
Regularization is needed Allow rich model where there are sufficient data Shrink aggressively where data are scarce
1162019
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
1 3 43 5 5
4 5 533
2
2 1 3
1
λhellip regularization parameterldquoerrorrdquo ldquolengthrdquo
RECSYS 64
Geared towards females
Geared towards males
serious
funny
The Effect of Regularization
1162019
The PrincessDiaries
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
minfactors ldquoerrorrdquo + λ ldquolengthrdquo
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
RECSYS 65
Geared towards females
Geared towards males
serious
funny
The Effect of Regularization
1162019
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
The PrincessDiaries
minfactors ldquoerrorrdquo + λ ldquolengthrdquo
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
RECSYS 66
Geared towards females
Geared towards males
serious
funny
The Effect of Regularization
1162019
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
minfactors ldquoerrorrdquo + λ ldquolengthrdquo
The PrincessDiaries
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
RECSYS 67
Geared towards females
Geared towards males
serious
funny
The Effect of Regularization
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
The PrincessDiaries
minfactors ldquoerrorrdquo + λ ldquolengthrdquo
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
RECSYS 68
Use Gradient Descent to search forthe optimal settings
Want to find matrices P and Q
Gradient descent Initialize P and Q (using SVD pretend missing ratings are
0) Do gradient descent
bull P larr P - η middotnablaPbull Q larr Q - η middotnablaQbull Where nablaQ is gradientderivative of matrix Q120571120571119876119876 = [120571120571119902119902119909119909119894119894] and 120571120571119902119902119909119909119894119894 = sum119909119909119909119909 minus2 119903119903119909119909119909119909 minus 119902119902119909119909119901119901119909119909119879119879 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894
bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q Observation Computing gradients is slow
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
How to compute gradient of a matrixCompute gradient of every element independently
RECSYS 69
Degression to the lecture notes ofRegression and Gradient Descent
by Andrew Ngrsquos Machine Learning course from Coursera
RECSYS 70
(Batch) Gradient Descent
Want to find matrices P and Q
Gradient descent Initialize P and Q (using SVD pretend missing ratings are
0) Do gradient descent
bull P larr P - η middotnablaPbull Q larr Q - η middotnablaQbull Where nablaQ is gradientderivative of matrix Q120571120571119876119876 = [120571120571119902119902119909119909119894119894] and 120571120571119902119902119909119909119894119894 = sum119909119909119909119909 minus2 119903119903119909119909119909119909 minus 119902119902119909119909119901119901119909119909119879119879 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894
bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q Observation Computing gradients is slow
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
How to compute gradient of a matrixCompute gradient of every element independently
RECSYS 72
Stochastic Gradient Descent
Gradient Descent (GD) vs Stochastic GD Observation 120571120571119876119876 = [120571120571119902119902119909119909119894119894] where
120571120571119902119902119909119909119894119894 = 119909119909119909119909
minus2 119903119903119909119909119909119909 minus 119902119902119909119909119894119894119901119901119909119909119894119894 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894 = 119961119961119946119946
nabla119928119928 119955119955119961119961119946119946
bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q 119928119928 = 119928119928 minus 119928119928 = 119928119928 minus η sum119961119961119946119946nabla119928119928 (119955119955119961119961119946119946) Idea Instead of evaluating gradient over all ratings evaluate it for
each individual rating and make a step
GD 119928119928larr119928119928 minus η sum119955119955119961119961119946119946 nabla119928119928(119955119955119961119961119946119946)
SGD 119928119928larr119928119928 minus 119928119928(119955119955119961119961119946119946) Faster convergence
bull Need more steps but each step is computed much faster
1162019
RECSYS 73
SGD vs GD Convergence of GD vs SGD
1162019
Iterationstep
Valu
e of
the
obje
ctiv
e fu
nctio
n
GD improves the value of the objective function at every step SGD improves the value but in a ldquonoisyrdquo wayGD takes fewer steps to converge but each steptakes much longer to compute In practice SGD is much faster
RECSYS 74
Stochastic Gradient Descent
Stochastic gradient decent Initialize P and Q (using SVD pretend missing ratings are
0) Then iterate over the ratings (multiple times if necessary) and update
factorsFor each rxi
bull 120576120576119909119909119909119909 = 119903119903119909119909119909119909 minus 119902119902119909119909 sdot 119901119901119909119909119879119879 (derivative of the ldquoerrorrdquo)bull 119902119902119909119909 larr 119902119902119909119909 + 120578120578 120576120576119909119909119909119909 119901119901119909119909 minus 120582120582 119902119902119909119909 (update equation)bull 119901119901119909119909 larr 119901119901119909119909 + 120578120578 120576120576119909119909119909119909 119902119902119909119909 minus 120582120582 119901119901119909119909 (update equation)
2 for loops For until convergence
bull For each rxibull Compute gradient do a ldquosteprdquo
1162019
η hellip learning rate
RECSYS 75Koren Bell Volinksy IEEE Computer 2009
1162019
RECSYS 76
Summary Recommendations via Optimization
Goal Make good recommendations Quantify goodness using SSE
So Lower SSE means better recommendations We want to make good recommendations on items
that some user has not yet seen Letrsquos set values for P and Q such that they work
well on known (user item) ratings And hope these values for P and Q will predict well
the unknown ratings
This is the a case where we apply Optimization methods
1162019
1 3 43 5 5
4 5 533
2
2 1 3
1
RECSYS 77
Backup Slides
RECSYS 78
The Netflix Challenge 2006-09
We Know What You OughtTo Be Watching This
Summer
Netflix Prizebull Training data
ndash 100 million ratingsndash 480000 usersndash 17770 moviesndash 6 years of data 2000-2005
bull Test datandash Last few ratings of each user (28 million)ndash Evaluation criterion root mean squared error (RMSE) ndash Netflix Cinematch RMSE 09514
bull Competitionndash 2700+ teamsndash $1 million grand prize for 10 improvement on Cinematch resultndash $50000 2007 progress prize for 843 improvement
scoremovieuser121152131434524123237682576344541568523425
22345
57664566
scoremovieuser6219617232473153414284935
745
696836
Training data Test data
Movie rating data
Overall rating distribution
bull Third of ratings are 4sbull Average rating is 368
From TimelyDevelopmentcom
ratings per movie
bull Avg ratingsmovie 5627
ratings per user
bull Avg ratingsuser 208
Average movie rating by movie count
bull More ratings to better movies
From TimelyDevelopmentcom
Most loved movies
CountAvg ratingTitle137812 4593 The Shawshank Redemption 133597 4545 Lord of the Rings The Return of the King1808834306 The Green Mile 150676 4460Lord of the Rings The Two Towers 139050 4415 Finding Nemo 117456 4504 Raiders of the Lost Ark 180736 4299 Forrest Gump 147932 4433 Lord of the Rings The Fellowship of the ring149199 4325 The Sixth Sense 144027 4333 Indiana Jones and the Last Crusade
Challenges
bull Size of datandash Scalabilityndash Keeping data in memory
bull Missing datandash 99 percent missingndash Very imbalanced
bull Avoiding overfittingbull Test and training data differ significantly movie 16322
bull Use an ensemble of complementing predictorsbull Two half tuned models worth more than a single fully
tuned model
The BellKor recommender system
RECSYS 101
Extending Latent Factor Model to Include Biases
1162019
RECSYS 102
Modeling Biases and Interactions
μ = overall mean rating bx = bias of user x bi = bias of movie i
user-movie interactionmovie biasuser bias
User-Movie interaction Characterizes the matching between
users and movies Attracts most research in the field Benefits from algorithmic and
mathematical innovations
Baseline predictor Separates users and movies Benefits from insights into userrsquos
behavior Among the main practical
contributions of the competition
RECSYS 103
Baseline Predictor We have expectations on the rating by
user x of movie i even without estimating xrsquos attitude towards movies like i
ndash Rating scale of user xndash Values of other ratings user
gave recently (day-specific mood anchoring multi-user accounts)
ndash (Recent) popularity of movie indash Selection bias related to
number of ratings user gave on the same day (ldquofrequencyrdquo)
1162019
RECSYS 104
Putting It All Together
Example Mean rating micro = 37 You are a critical reviewer your ratings are 1 star
lower than the mean bx = -1 Star Wars gets a mean rating of 05 higher than
average movie bi = + 05 Predicted rating for you on Star Wars
= 37 - 1 + 05 = 32
1162019
RECSYS 105
Fitting the New Model
Solve
Stochastic gradient decent to find parameters Note Both biases bu bi as well as interactions qi pu are treated
as parameters (we estimate them)
1162019
regularization
goodness of fit
λ is selected via grid-search on a validation set
( )
++++
+++minus
sumsumsumsum
sumisin
ii
xx
xx
ii
Rix
Txiixxi
PQ
bbpq
pqbbr
2222
2
)(
)(min
λ
micro
RECSYS 106
BellKor Recommender System The winner of the Netflix Challenge
Multi-scale modeling of the dataCombine top level ldquoregionalrdquomodeling of the data with a refined local view Global
bull Overall deviations of usersmovies Factorization
bull Addressing ldquoregionalrdquo effects Collaborative filtering
bull Extract local patterns
Global effects
Factorization
Collaborative filtering
RECSYS 107
Performance of Various Methods
0885
089
0895
09
0905
091
0915
092
1 10 100 1000
RMSE
Millions of parameters
CF (no time bias)
Basic Latent Factors
Latent Factors w Biases
1162019
RECSYS 108
Grand Prize 08563
Netflix 09514
Movie average 10533User average 10651
Global average 11296
Performance of Various Methods
Basic Collaborative filtering 094
Latent factors 090
Latent factors+Biases 089
Collaborative filtering++ 091
1162019
RECSYS 109
Temporal Biases Of Users
Sudden rise in the average movie rating(early 2004) Improvements in Netflix GUI improvements Meaning of rating changed
Movie age Users prefer new movies
without any reasons Older movies are just
inherently better than newer ones
1162019
Y Koren Collaborative filtering with temporal dynamics KDD rsquo09
RECSYS 110
Temporal Biases amp Factors
Original modelrxi = micro +bx + bi + qi middotpx
T
Add time dependence to biasesrxi = micro +bx(t)+ bi(t) +qi middot px
T
Make parameters bu and bi to depend on time (1) Parameterize time-dependence by linear trends
(2) Each bin corresponds to 10 consecutive weeks
Add temporal dependence to factors px(t)hellip user preference vector on day t
Y Koren Collaborative filtering with temporal dynamics KDD rsquo09
RECSYS 111
Adding Temporal Effects
1162019
0875
088
0885
089
0895
09
0905
091
0915
092
1 10 100 1000 10000
RMSE
Millions of parameters
CF (no time bias)Basic Latent FactorsCF (time bias)Latent Factors w Biases+ Linear time factors+ Per-day user biases+ CF
RECSYS 112
Grand Prize 08563
Netflix 09514
Movie average 10533User average 10651
Global average 11296
Performance of Various Methods
Basic Collaborative filtering 094
Latent factors 090
Latent factors+Biases 089
Collaborative filtering++ 091
1162019 112
Latent factors+Biases+Time 0876
Still no prize Getting desperateTry a ldquokitchen sinkrdquo approach
RECSYS 113
Effect of ensemble size
087
08725
0875
08775
088
08825
0885
08875
089
08925
1 8 15 22 29 36 43 50
Predictors
Erro
r - R
MSE
RECSYS 38
Item-Item vs User-User
04180109
0305081
Avatar LOTR Matrix Pirates
Alice
Bob
Carol
David
1162019
In practice it has been observed that item-itemoften works better than user-user
Why Items are simpler users have multiple tastes
RECSYS 39
ProsCons of Collaborative Filtering
+ Works for any kind of item No feature selection needed
- Cold Start Need enough users in the system to find a match
- Sparsity The userratings matrix is sparse Hard to find users that have rated the same items
- First rater Cannot recommend an item that has not been
previously rated New items Esoteric items
- Popularity bias Cannot recommend items to someone with
unique taste Tends to recommend popular items
RECSYS 40
Hybrid Methods Implement two or more different recommenders and
combine predictions Perhaps using a linear model
Add content-based methods to collaborative filtering Item profiles for new item problem Demographics to deal with new user problem
1162019
RECSYS 41
- Evaluation
- Error metrics
- Complexity Speed
RECSYS 42
Evaluation
1 3 4
3 5 5
4 5 5
3
3
2 2 2
5
2 1 1
3 3
1
movies
users
1162019
RECSYS 43
Evaluation
1 3 4
3 5 5
4 5 5
3
3
2
2 1
3
1
Test Data Set
users
movies
1162019
RECSYS 44
Evaluating Predictions Compare predictions with known ratings
Root-mean-square error (RMSE)bull where is predicted is the true rating of x on i
Precision at top 10 bull of those in top 10
Rank Correlation bull Spearmanrsquos correlation between systemrsquos and userrsquos complete
rankings Another approach 01 model
Coveragebull Number of itemsusers for which system can make predictions
Precision = TP (TP + FP) Accuracy = (TP+TN) (TP + FP + TN + FN) Receiver Operating characteristic (ROC) Curve
Y-axis True Positive Rates (TPR) X-axis False Positive Rates (FPR)bull TPR (aka Recall) = TP P = TP (TP+FN) bull FPR = FP N = FP (FP + TN)bull See httpsenwikipediaorgwikiPrecision and recall
RECSYS 45
Problems with Error Measures Narrow focus on accuracy sometimes
misses the point Prediction Diversity Prediction Context Order of predictions
In practice we care only to predict high ratings RMSE might penalize a method that does well
for high ratings and badly for others
1162019
RECSYS 46
Collaborative Filtering Complexity
Expensive step is finding k most similar customers O(|X|) Recall that X = set of customers in the system
Too expensive to do at runtime Could pre-compute using clustering as approx
Naiumlve pre-computation takes time O(N middot|C|) |C| = of clusters = k in the k-means N = of data points
We already know how to do this Near-neighbor search in high dimensions (LSH) Clustering Dimensionality reduction
1162019
RECSYS 47
Tip Add Data Leverage all the data
Donrsquot try to reduce data size in an effort to make fancy algorithms work
Simple methods on large data do best
Add more data eg add IMDB data on genres
More data beats better algorithmshttpanandtypepadcomdatawocky200803more-data-usualhtml
RECSYS 48
Recommender SystemsLatent Factor Models
CS246 Mining Massive DatasetsJure Leskovec Stanford University
httpcs246stanfordedu
RECSYS 49
Geared towards females
Geared towards males
Serious
Funny
Collaborative Filtering viaLatent Factor Models (eg SVD)
The PrincessDiaries
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
RECSYS 50Koren Bell Volinksy IEEE Computer 2009
RECSYS 51
The Netflix Utility Matrix R
1 3 4
3 5 5
4 5 5
3
3
2 2 2
5
2 1 1
3 3
1
480000 users
17700 movies
1162019
Matrix R
RECSYS 52
Utility Matrix R Evaluation
1 3 4
3 5 5
4 5 5
3
3
2
2 1
3
1
Test Data Set
SSE = sum(119909119909119909119909)isin119877119877 119903119909119909119909119909 minus 119903119903119909119909119909119909 2
1162019
480000 users
17700 movies
Predicted rating
True rating of user x on item i
119955119955120785120785120788120788
Matrix R
Training Data Set
RECSYS 53
Latent Factor Models
ldquoSVDrdquo on Netflix data R asymp Q middot PT
For now letrsquos assume we can approximate the rating matrix R as a product of ldquothinrdquo Q middot PT
R has missing entries but letrsquos ignore that for nowbull Basically we will want the reconstruction error to be small on
known ratings and we donrsquot care about the values on the missing ones
45531
312445
53432142
24542
522434
42331
2-41
56-5
53-2
32111
-221-7
37-1
-924143-48-5-253-211
13-112-72914-131457-8
1-6784-3924176-421asymp
users
item
s
PT
Q
item
s
users
R
SVD A = U Σ VT
f factors
ffactors
RECSYS 54
Ratings as Products of Factors How to estimate the missing rating of
user x for item i
1162019
45531
312445
53432142
24542
522434
42331
item
s
2-41
56-5
53-2
32111
-221-7
37-1
-924143-48-5-253-211
13-112-72914-131457-8
1-6784-3924176-421
asymp
item
s
users
users
PT
119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931
= 119943119943
119954119954119946119946119943119943 sdot 119953119953119961119961119943119943
qi = row i of Qpx = column x of PT
ffac
tors
Qf factors
RECSYS 55
Ratings as Products of Factors How to estimate the missing rating of
user x for item i
1162019
45531
312445
53432142
24542
522434
42331
item
s
2-41
56-5
53-2
32111
-221-7
37-1
-924143-48-5-253-211
13-112-72914-131457-8
1-6784-3924176-421
asymp
item
s
users
users
PT
ffac
tors
Qf factors
119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931
= 119943119943
119954119954119946119946119943119943 sdot 119953119953119961119961119943119943
qi = row i of Qpx = column x of PT
RECSYS 56
Ratings as Products of Factors How to estimate the missing rating of
user x for item i
1162019
45531
312445
53432142
24542
522434
42331
item
s
2-41
56-5
53-2
32111
-221-7
37-1
-924143-48-5-253-211
13-112-72914-131457-8
1-6784-3924176-421
asymp
item
s
users
users
QPT
24
ffac
tors
f factors
119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931
= 119943119943
119954119954119946119946119943119943 sdot 119953119953119961119961119943119943
qi = row i of Qpx = column x of PT
RECSYS 57
Geared towards females
Geared towards males
Serious
Funny
Latent Factor Models
1162019
The PrincessDiaries
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
RECSYS 58
Geared towards females
Geared towards males
Serious
Funny
Latent Factor Models
1162019
The PrincessDiaries
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
RECSYS 59
Recap SVD
Remember SVD A Input data matrix U Left singular vecs V Right singular vecs Σ Singular values
SVD gives minimum reconstruction error (SSE) min119880119880VΣ
sum119909119909119894119894 119860119860119909119909119894119894 minus 119880119880Σ119881119881T 1199091199091198941198942
So in our case ldquoSVDrdquo on Netflix data R asymp Q middot PT
A = R Q = U PT = Σ VT
But we are not done yet R has missing entries1162019
U
The sum goes over all entriesBut our R has missing entries
119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931
Am
n
Σm
n
VT
asymp
U
RECSYS 60
Latent Factor Models
SVD isnrsquot defined when entries are missing
Use specialized methods to find P Q min
119875119875119876119876sum 119909119909119909119909 isinR 119903119903119909119909119909119909 minus 119902119902119909119909 sdot 119901119901119909119909119879119879 2
Notebull We donrsquot require cols of P Q to be orthogonalunit lengthbull P Q map usersmovies to a latent spacebull The most popular model among Netflix contestants
1162019
45531
312445
53432142
24542
522434
42331
2-41
56-5
53-2
32111
-221-7
37-1
-924143-48-5-253-211
13-112-72914-131457-8
1-6784-3924176-421asymp
PTQ
users
item
s
119903119909119909119909119909 = 119902119902119909119909 sdot 119901119901119909119909119879119879
f factors
ffactorsitem
s
users
RECSYS 61
Dealing with Missing Entries Want to minimize SSE for unseen test data
Idea Minimize SSE on training data Want large f ( of factors) to capture all the signals But SSE on test data begins to rise for f gt 2
Regularization is needed to avoid Overfitting Allow rich model where there are sufficient data Shrink aggressively where data are scarce
1162019
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
1 3 43 5 5
4 5 533
2
2 1 3
1
λhellip regularization parameterldquoerrorrdquo ldquolengthrdquo
RECSYS 62
Geared towards females
Geared towards males
serious
escapist
The PrincessDiaries
The Lion King
Braveheart
Independence Day
AmadeusThe Color Purple
Oceanrsquos 11
Sense and Sensibility
Gus
Dave
Recommendations via Latent Factor Models (eg SVD++ by the [Bellkor Team])
1162019
Lethal Weapon
Dumb and Dumber
RECSYS 63
Dealing with Missing Entries Want to minimize SSE for unseen test data
Idea Minimize SSE on training data Want large f ( of factors) to capture all the signals But SSE on test data begins to rise for f gt 2
Regularization is needed Allow rich model where there are sufficient data Shrink aggressively where data are scarce
1162019
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
1 3 43 5 5
4 5 533
2
2 1 3
1
λhellip regularization parameterldquoerrorrdquo ldquolengthrdquo
RECSYS 64
Geared towards females
Geared towards males
serious
funny
The Effect of Regularization
1162019
The PrincessDiaries
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
minfactors ldquoerrorrdquo + λ ldquolengthrdquo
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
RECSYS 65
Geared towards females
Geared towards males
serious
funny
The Effect of Regularization
1162019
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
The PrincessDiaries
minfactors ldquoerrorrdquo + λ ldquolengthrdquo
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
RECSYS 66
Geared towards females
Geared towards males
serious
funny
The Effect of Regularization
1162019
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
minfactors ldquoerrorrdquo + λ ldquolengthrdquo
The PrincessDiaries
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
RECSYS 67
Geared towards females
Geared towards males
serious
funny
The Effect of Regularization
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
The PrincessDiaries
minfactors ldquoerrorrdquo + λ ldquolengthrdquo
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
RECSYS 68
Use Gradient Descent to search forthe optimal settings
Want to find matrices P and Q
Gradient descent Initialize P and Q (using SVD pretend missing ratings are
0) Do gradient descent
bull P larr P - η middotnablaPbull Q larr Q - η middotnablaQbull Where nablaQ is gradientderivative of matrix Q120571120571119876119876 = [120571120571119902119902119909119909119894119894] and 120571120571119902119902119909119909119894119894 = sum119909119909119909119909 minus2 119903119903119909119909119909119909 minus 119902119902119909119909119901119901119909119909119879119879 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894
bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q Observation Computing gradients is slow
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
How to compute gradient of a matrixCompute gradient of every element independently
RECSYS 69
Degression to the lecture notes ofRegression and Gradient Descent
by Andrew Ngrsquos Machine Learning course from Coursera
RECSYS 70
(Batch) Gradient Descent
Want to find matrices P and Q
Gradient descent Initialize P and Q (using SVD pretend missing ratings are
0) Do gradient descent
bull P larr P - η middotnablaPbull Q larr Q - η middotnablaQbull Where nablaQ is gradientderivative of matrix Q120571120571119876119876 = [120571120571119902119902119909119909119894119894] and 120571120571119902119902119909119909119894119894 = sum119909119909119909119909 minus2 119903119903119909119909119909119909 minus 119902119902119909119909119901119901119909119909119879119879 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894
bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q Observation Computing gradients is slow
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
How to compute gradient of a matrixCompute gradient of every element independently
RECSYS 72
Stochastic Gradient Descent
Gradient Descent (GD) vs Stochastic GD Observation 120571120571119876119876 = [120571120571119902119902119909119909119894119894] where
120571120571119902119902119909119909119894119894 = 119909119909119909119909
minus2 119903119903119909119909119909119909 minus 119902119902119909119909119894119894119901119901119909119909119894119894 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894 = 119961119961119946119946
nabla119928119928 119955119955119961119961119946119946
bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q 119928119928 = 119928119928 minus 119928119928 = 119928119928 minus η sum119961119961119946119946nabla119928119928 (119955119955119961119961119946119946) Idea Instead of evaluating gradient over all ratings evaluate it for
each individual rating and make a step
GD 119928119928larr119928119928 minus η sum119955119955119961119961119946119946 nabla119928119928(119955119955119961119961119946119946)
SGD 119928119928larr119928119928 minus 119928119928(119955119955119961119961119946119946) Faster convergence
bull Need more steps but each step is computed much faster
1162019
RECSYS 73
SGD vs GD Convergence of GD vs SGD
1162019
Iterationstep
Valu
e of
the
obje
ctiv
e fu
nctio
n
GD improves the value of the objective function at every step SGD improves the value but in a ldquonoisyrdquo wayGD takes fewer steps to converge but each steptakes much longer to compute In practice SGD is much faster
RECSYS 74
Stochastic Gradient Descent
Stochastic gradient decent Initialize P and Q (using SVD pretend missing ratings are
0) Then iterate over the ratings (multiple times if necessary) and update
factorsFor each rxi
bull 120576120576119909119909119909119909 = 119903119903119909119909119909119909 minus 119902119902119909119909 sdot 119901119901119909119909119879119879 (derivative of the ldquoerrorrdquo)bull 119902119902119909119909 larr 119902119902119909119909 + 120578120578 120576120576119909119909119909119909 119901119901119909119909 minus 120582120582 119902119902119909119909 (update equation)bull 119901119901119909119909 larr 119901119901119909119909 + 120578120578 120576120576119909119909119909119909 119902119902119909119909 minus 120582120582 119901119901119909119909 (update equation)
2 for loops For until convergence
bull For each rxibull Compute gradient do a ldquosteprdquo
1162019
η hellip learning rate
RECSYS 75Koren Bell Volinksy IEEE Computer 2009
1162019
RECSYS 76
Summary Recommendations via Optimization
Goal Make good recommendations Quantify goodness using SSE
So Lower SSE means better recommendations We want to make good recommendations on items
that some user has not yet seen Letrsquos set values for P and Q such that they work
well on known (user item) ratings And hope these values for P and Q will predict well
the unknown ratings
This is the a case where we apply Optimization methods
1162019
1 3 43 5 5
4 5 533
2
2 1 3
1
RECSYS 77
Backup Slides
RECSYS 78
The Netflix Challenge 2006-09
We Know What You OughtTo Be Watching This
Summer
Netflix Prizebull Training data
ndash 100 million ratingsndash 480000 usersndash 17770 moviesndash 6 years of data 2000-2005
bull Test datandash Last few ratings of each user (28 million)ndash Evaluation criterion root mean squared error (RMSE) ndash Netflix Cinematch RMSE 09514
bull Competitionndash 2700+ teamsndash $1 million grand prize for 10 improvement on Cinematch resultndash $50000 2007 progress prize for 843 improvement
scoremovieuser121152131434524123237682576344541568523425
22345
57664566
scoremovieuser6219617232473153414284935
745
696836
Training data Test data
Movie rating data
Overall rating distribution
bull Third of ratings are 4sbull Average rating is 368
From TimelyDevelopmentcom
ratings per movie
bull Avg ratingsmovie 5627
ratings per user
bull Avg ratingsuser 208
Average movie rating by movie count
bull More ratings to better movies
From TimelyDevelopmentcom
Most loved movies
CountAvg ratingTitle137812 4593 The Shawshank Redemption 133597 4545 Lord of the Rings The Return of the King1808834306 The Green Mile 150676 4460Lord of the Rings The Two Towers 139050 4415 Finding Nemo 117456 4504 Raiders of the Lost Ark 180736 4299 Forrest Gump 147932 4433 Lord of the Rings The Fellowship of the ring149199 4325 The Sixth Sense 144027 4333 Indiana Jones and the Last Crusade
Challenges
bull Size of datandash Scalabilityndash Keeping data in memory
bull Missing datandash 99 percent missingndash Very imbalanced
bull Avoiding overfittingbull Test and training data differ significantly movie 16322
bull Use an ensemble of complementing predictorsbull Two half tuned models worth more than a single fully
tuned model
The BellKor recommender system
RECSYS 101
Extending Latent Factor Model to Include Biases
1162019
RECSYS 102
Modeling Biases and Interactions
μ = overall mean rating bx = bias of user x bi = bias of movie i
user-movie interactionmovie biasuser bias
User-Movie interaction Characterizes the matching between
users and movies Attracts most research in the field Benefits from algorithmic and
mathematical innovations
Baseline predictor Separates users and movies Benefits from insights into userrsquos
behavior Among the main practical
contributions of the competition
RECSYS 103
Baseline Predictor We have expectations on the rating by
user x of movie i even without estimating xrsquos attitude towards movies like i
ndash Rating scale of user xndash Values of other ratings user
gave recently (day-specific mood anchoring multi-user accounts)
ndash (Recent) popularity of movie indash Selection bias related to
number of ratings user gave on the same day (ldquofrequencyrdquo)
1162019
RECSYS 104
Putting It All Together
Example Mean rating micro = 37 You are a critical reviewer your ratings are 1 star
lower than the mean bx = -1 Star Wars gets a mean rating of 05 higher than
average movie bi = + 05 Predicted rating for you on Star Wars
= 37 - 1 + 05 = 32
1162019
RECSYS 105
Fitting the New Model
Solve
Stochastic gradient decent to find parameters Note Both biases bu bi as well as interactions qi pu are treated
as parameters (we estimate them)
1162019
regularization
goodness of fit
λ is selected via grid-search on a validation set
( )
++++
+++minus
sumsumsumsum
sumisin
ii
xx
xx
ii
Rix
Txiixxi
PQ
bbpq
pqbbr
2222
2
)(
)(min
λ
micro
RECSYS 106
BellKor Recommender System The winner of the Netflix Challenge
Multi-scale modeling of the dataCombine top level ldquoregionalrdquomodeling of the data with a refined local view Global
bull Overall deviations of usersmovies Factorization
bull Addressing ldquoregionalrdquo effects Collaborative filtering
bull Extract local patterns
Global effects
Factorization
Collaborative filtering
RECSYS 107
Performance of Various Methods
0885
089
0895
09
0905
091
0915
092
1 10 100 1000
RMSE
Millions of parameters
CF (no time bias)
Basic Latent Factors
Latent Factors w Biases
1162019
RECSYS 108
Grand Prize 08563
Netflix 09514
Movie average 10533User average 10651
Global average 11296
Performance of Various Methods
Basic Collaborative filtering 094
Latent factors 090
Latent factors+Biases 089
Collaborative filtering++ 091
1162019
RECSYS 109
Temporal Biases Of Users
Sudden rise in the average movie rating(early 2004) Improvements in Netflix GUI improvements Meaning of rating changed
Movie age Users prefer new movies
without any reasons Older movies are just
inherently better than newer ones
1162019
Y Koren Collaborative filtering with temporal dynamics KDD rsquo09
RECSYS 110
Temporal Biases amp Factors
Original modelrxi = micro +bx + bi + qi middotpx
T
Add time dependence to biasesrxi = micro +bx(t)+ bi(t) +qi middot px
T
Make parameters bu and bi to depend on time (1) Parameterize time-dependence by linear trends
(2) Each bin corresponds to 10 consecutive weeks
Add temporal dependence to factors px(t)hellip user preference vector on day t
Y Koren Collaborative filtering with temporal dynamics KDD rsquo09
RECSYS 111
Adding Temporal Effects
1162019
0875
088
0885
089
0895
09
0905
091
0915
092
1 10 100 1000 10000
RMSE
Millions of parameters
CF (no time bias)Basic Latent FactorsCF (time bias)Latent Factors w Biases+ Linear time factors+ Per-day user biases+ CF
RECSYS 112
Grand Prize 08563
Netflix 09514
Movie average 10533User average 10651
Global average 11296
Performance of Various Methods
Basic Collaborative filtering 094
Latent factors 090
Latent factors+Biases 089
Collaborative filtering++ 091
1162019 112
Latent factors+Biases+Time 0876
Still no prize Getting desperateTry a ldquokitchen sinkrdquo approach
RECSYS 113
Effect of ensemble size
087
08725
0875
08775
088
08825
0885
08875
089
08925
1 8 15 22 29 36 43 50
Predictors
Erro
r - R
MSE
RECSYS 39
ProsCons of Collaborative Filtering
+ Works for any kind of item No feature selection needed
- Cold Start Need enough users in the system to find a match
- Sparsity The userratings matrix is sparse Hard to find users that have rated the same items
- First rater Cannot recommend an item that has not been
previously rated New items Esoteric items
- Popularity bias Cannot recommend items to someone with
unique taste Tends to recommend popular items
RECSYS 40
Hybrid Methods Implement two or more different recommenders and
combine predictions Perhaps using a linear model
Add content-based methods to collaborative filtering Item profiles for new item problem Demographics to deal with new user problem
1162019
RECSYS 41
- Evaluation
- Error metrics
- Complexity Speed
RECSYS 42
Evaluation
1 3 4
3 5 5
4 5 5
3
3
2 2 2
5
2 1 1
3 3
1
movies
users
1162019
RECSYS 43
Evaluation
1 3 4
3 5 5
4 5 5
3
3
2
2 1
3
1
Test Data Set
users
movies
1162019
RECSYS 44
Evaluating Predictions Compare predictions with known ratings
Root-mean-square error (RMSE)bull where is predicted is the true rating of x on i
Precision at top 10 bull of those in top 10
Rank Correlation bull Spearmanrsquos correlation between systemrsquos and userrsquos complete
rankings Another approach 01 model
Coveragebull Number of itemsusers for which system can make predictions
Precision = TP (TP + FP) Accuracy = (TP+TN) (TP + FP + TN + FN) Receiver Operating characteristic (ROC) Curve
Y-axis True Positive Rates (TPR) X-axis False Positive Rates (FPR)bull TPR (aka Recall) = TP P = TP (TP+FN) bull FPR = FP N = FP (FP + TN)bull See httpsenwikipediaorgwikiPrecision and recall
RECSYS 45
Problems with Error Measures Narrow focus on accuracy sometimes
misses the point Prediction Diversity Prediction Context Order of predictions
In practice we care only to predict high ratings RMSE might penalize a method that does well
for high ratings and badly for others
1162019
RECSYS 46
Collaborative Filtering Complexity
Expensive step is finding k most similar customers O(|X|) Recall that X = set of customers in the system
Too expensive to do at runtime Could pre-compute using clustering as approx
Naiumlve pre-computation takes time O(N middot|C|) |C| = of clusters = k in the k-means N = of data points
We already know how to do this Near-neighbor search in high dimensions (LSH) Clustering Dimensionality reduction
1162019
RECSYS 47
Tip Add Data Leverage all the data
Donrsquot try to reduce data size in an effort to make fancy algorithms work
Simple methods on large data do best
Add more data eg add IMDB data on genres
More data beats better algorithmshttpanandtypepadcomdatawocky200803more-data-usualhtml
RECSYS 48
Recommender SystemsLatent Factor Models
CS246 Mining Massive DatasetsJure Leskovec Stanford University
httpcs246stanfordedu
RECSYS 49
Geared towards females
Geared towards males
Serious
Funny
Collaborative Filtering viaLatent Factor Models (eg SVD)
The PrincessDiaries
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
RECSYS 50Koren Bell Volinksy IEEE Computer 2009
RECSYS 51
The Netflix Utility Matrix R
1 3 4
3 5 5
4 5 5
3
3
2 2 2
5
2 1 1
3 3
1
480000 users
17700 movies
1162019
Matrix R
RECSYS 52
Utility Matrix R Evaluation
1 3 4
3 5 5
4 5 5
3
3
2
2 1
3
1
Test Data Set
SSE = sum(119909119909119909119909)isin119877119877 119903119909119909119909119909 minus 119903119903119909119909119909119909 2
1162019
480000 users
17700 movies
Predicted rating
True rating of user x on item i
119955119955120785120785120788120788
Matrix R
Training Data Set
RECSYS 53
Latent Factor Models
ldquoSVDrdquo on Netflix data R asymp Q middot PT
For now letrsquos assume we can approximate the rating matrix R as a product of ldquothinrdquo Q middot PT
R has missing entries but letrsquos ignore that for nowbull Basically we will want the reconstruction error to be small on
known ratings and we donrsquot care about the values on the missing ones
45531
312445
53432142
24542
522434
42331
2-41
56-5
53-2
32111
-221-7
37-1
-924143-48-5-253-211
13-112-72914-131457-8
1-6784-3924176-421asymp
users
item
s
PT
Q
item
s
users
R
SVD A = U Σ VT
f factors
ffactors
RECSYS 54
Ratings as Products of Factors How to estimate the missing rating of
user x for item i
1162019
45531
312445
53432142
24542
522434
42331
item
s
2-41
56-5
53-2
32111
-221-7
37-1
-924143-48-5-253-211
13-112-72914-131457-8
1-6784-3924176-421
asymp
item
s
users
users
PT
119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931
= 119943119943
119954119954119946119946119943119943 sdot 119953119953119961119961119943119943
qi = row i of Qpx = column x of PT
ffac
tors
Qf factors
RECSYS 55
Ratings as Products of Factors How to estimate the missing rating of
user x for item i
1162019
45531
312445
53432142
24542
522434
42331
item
s
2-41
56-5
53-2
32111
-221-7
37-1
-924143-48-5-253-211
13-112-72914-131457-8
1-6784-3924176-421
asymp
item
s
users
users
PT
ffac
tors
Qf factors
119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931
= 119943119943
119954119954119946119946119943119943 sdot 119953119953119961119961119943119943
qi = row i of Qpx = column x of PT
RECSYS 56
Ratings as Products of Factors How to estimate the missing rating of
user x for item i
1162019
45531
312445
53432142
24542
522434
42331
item
s
2-41
56-5
53-2
32111
-221-7
37-1
-924143-48-5-253-211
13-112-72914-131457-8
1-6784-3924176-421
asymp
item
s
users
users
QPT
24
ffac
tors
f factors
119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931
= 119943119943
119954119954119946119946119943119943 sdot 119953119953119961119961119943119943
qi = row i of Qpx = column x of PT
RECSYS 57
Geared towards females
Geared towards males
Serious
Funny
Latent Factor Models
1162019
The PrincessDiaries
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
RECSYS 58
Geared towards females
Geared towards males
Serious
Funny
Latent Factor Models
1162019
The PrincessDiaries
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
RECSYS 59
Recap SVD
Remember SVD A Input data matrix U Left singular vecs V Right singular vecs Σ Singular values
SVD gives minimum reconstruction error (SSE) min119880119880VΣ
sum119909119909119894119894 119860119860119909119909119894119894 minus 119880119880Σ119881119881T 1199091199091198941198942
So in our case ldquoSVDrdquo on Netflix data R asymp Q middot PT
A = R Q = U PT = Σ VT
But we are not done yet R has missing entries1162019
U
The sum goes over all entriesBut our R has missing entries
119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931
Am
n
Σm
n
VT
asymp
U
RECSYS 60
Latent Factor Models
SVD isnrsquot defined when entries are missing
Use specialized methods to find P Q min
119875119875119876119876sum 119909119909119909119909 isinR 119903119903119909119909119909119909 minus 119902119902119909119909 sdot 119901119901119909119909119879119879 2
Notebull We donrsquot require cols of P Q to be orthogonalunit lengthbull P Q map usersmovies to a latent spacebull The most popular model among Netflix contestants
1162019
45531
312445
53432142
24542
522434
42331
2-41
56-5
53-2
32111
-221-7
37-1
-924143-48-5-253-211
13-112-72914-131457-8
1-6784-3924176-421asymp
PTQ
users
item
s
119903119909119909119909119909 = 119902119902119909119909 sdot 119901119901119909119909119879119879
f factors
ffactorsitem
s
users
RECSYS 61
Dealing with Missing Entries Want to minimize SSE for unseen test data
Idea Minimize SSE on training data Want large f ( of factors) to capture all the signals But SSE on test data begins to rise for f gt 2
Regularization is needed to avoid Overfitting Allow rich model where there are sufficient data Shrink aggressively where data are scarce
1162019
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
1 3 43 5 5
4 5 533
2
2 1 3
1
λhellip regularization parameterldquoerrorrdquo ldquolengthrdquo
RECSYS 62
Geared towards females
Geared towards males
serious
escapist
The PrincessDiaries
The Lion King
Braveheart
Independence Day
AmadeusThe Color Purple
Oceanrsquos 11
Sense and Sensibility
Gus
Dave
Recommendations via Latent Factor Models (eg SVD++ by the [Bellkor Team])
1162019
Lethal Weapon
Dumb and Dumber
RECSYS 63
Dealing with Missing Entries Want to minimize SSE for unseen test data
Idea Minimize SSE on training data Want large f ( of factors) to capture all the signals But SSE on test data begins to rise for f gt 2
Regularization is needed Allow rich model where there are sufficient data Shrink aggressively where data are scarce
1162019
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
1 3 43 5 5
4 5 533
2
2 1 3
1
λhellip regularization parameterldquoerrorrdquo ldquolengthrdquo
RECSYS 64
Geared towards females
Geared towards males
serious
funny
The Effect of Regularization
1162019
The PrincessDiaries
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
minfactors ldquoerrorrdquo + λ ldquolengthrdquo
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
RECSYS 65
Geared towards females
Geared towards males
serious
funny
The Effect of Regularization
1162019
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
The PrincessDiaries
minfactors ldquoerrorrdquo + λ ldquolengthrdquo
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
RECSYS 66
Geared towards females
Geared towards males
serious
funny
The Effect of Regularization
1162019
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
minfactors ldquoerrorrdquo + λ ldquolengthrdquo
The PrincessDiaries
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
RECSYS 67
Geared towards females
Geared towards males
serious
funny
The Effect of Regularization
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
The PrincessDiaries
minfactors ldquoerrorrdquo + λ ldquolengthrdquo
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
RECSYS 68
Use Gradient Descent to search forthe optimal settings
Want to find matrices P and Q
Gradient descent Initialize P and Q (using SVD pretend missing ratings are
0) Do gradient descent
bull P larr P - η middotnablaPbull Q larr Q - η middotnablaQbull Where nablaQ is gradientderivative of matrix Q120571120571119876119876 = [120571120571119902119902119909119909119894119894] and 120571120571119902119902119909119909119894119894 = sum119909119909119909119909 minus2 119903119903119909119909119909119909 minus 119902119902119909119909119901119901119909119909119879119879 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894
bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q Observation Computing gradients is slow
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
How to compute gradient of a matrixCompute gradient of every element independently
RECSYS 69
Degression to the lecture notes ofRegression and Gradient Descent
by Andrew Ngrsquos Machine Learning course from Coursera
RECSYS 70
(Batch) Gradient Descent
Want to find matrices P and Q
Gradient descent Initialize P and Q (using SVD pretend missing ratings are
0) Do gradient descent
bull P larr P - η middotnablaPbull Q larr Q - η middotnablaQbull Where nablaQ is gradientderivative of matrix Q120571120571119876119876 = [120571120571119902119902119909119909119894119894] and 120571120571119902119902119909119909119894119894 = sum119909119909119909119909 minus2 119903119903119909119909119909119909 minus 119902119902119909119909119901119901119909119909119879119879 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894
bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q Observation Computing gradients is slow
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
How to compute gradient of a matrixCompute gradient of every element independently
RECSYS 72
Stochastic Gradient Descent
Gradient Descent (GD) vs Stochastic GD Observation 120571120571119876119876 = [120571120571119902119902119909119909119894119894] where
120571120571119902119902119909119909119894119894 = 119909119909119909119909
minus2 119903119903119909119909119909119909 minus 119902119902119909119909119894119894119901119901119909119909119894119894 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894 = 119961119961119946119946
nabla119928119928 119955119955119961119961119946119946
bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q 119928119928 = 119928119928 minus 119928119928 = 119928119928 minus η sum119961119961119946119946nabla119928119928 (119955119955119961119961119946119946) Idea Instead of evaluating gradient over all ratings evaluate it for
each individual rating and make a step
GD 119928119928larr119928119928 minus η sum119955119955119961119961119946119946 nabla119928119928(119955119955119961119961119946119946)
SGD 119928119928larr119928119928 minus 119928119928(119955119955119961119961119946119946) Faster convergence
bull Need more steps but each step is computed much faster
1162019
RECSYS 73
SGD vs GD Convergence of GD vs SGD
1162019
Iterationstep
Valu
e of
the
obje
ctiv
e fu
nctio
n
GD improves the value of the objective function at every step SGD improves the value but in a ldquonoisyrdquo wayGD takes fewer steps to converge but each steptakes much longer to compute In practice SGD is much faster
RECSYS 74
Stochastic Gradient Descent
Stochastic gradient decent Initialize P and Q (using SVD pretend missing ratings are
0) Then iterate over the ratings (multiple times if necessary) and update
factorsFor each rxi
bull 120576120576119909119909119909119909 = 119903119903119909119909119909119909 minus 119902119902119909119909 sdot 119901119901119909119909119879119879 (derivative of the ldquoerrorrdquo)bull 119902119902119909119909 larr 119902119902119909119909 + 120578120578 120576120576119909119909119909119909 119901119901119909119909 minus 120582120582 119902119902119909119909 (update equation)bull 119901119901119909119909 larr 119901119901119909119909 + 120578120578 120576120576119909119909119909119909 119902119902119909119909 minus 120582120582 119901119901119909119909 (update equation)
2 for loops For until convergence
bull For each rxibull Compute gradient do a ldquosteprdquo
1162019
η hellip learning rate
RECSYS 75Koren Bell Volinksy IEEE Computer 2009
1162019
RECSYS 76
Summary Recommendations via Optimization
Goal Make good recommendations Quantify goodness using SSE
So Lower SSE means better recommendations We want to make good recommendations on items
that some user has not yet seen Letrsquos set values for P and Q such that they work
well on known (user item) ratings And hope these values for P and Q will predict well
the unknown ratings
This is the a case where we apply Optimization methods
1162019
1 3 43 5 5
4 5 533
2
2 1 3
1
RECSYS 77
Backup Slides
RECSYS 78
The Netflix Challenge 2006-09
We Know What You OughtTo Be Watching This
Summer
Netflix Prizebull Training data
ndash 100 million ratingsndash 480000 usersndash 17770 moviesndash 6 years of data 2000-2005
bull Test datandash Last few ratings of each user (28 million)ndash Evaluation criterion root mean squared error (RMSE) ndash Netflix Cinematch RMSE 09514
bull Competitionndash 2700+ teamsndash $1 million grand prize for 10 improvement on Cinematch resultndash $50000 2007 progress prize for 843 improvement
scoremovieuser121152131434524123237682576344541568523425
22345
57664566
scoremovieuser6219617232473153414284935
745
696836
Training data Test data
Movie rating data
Overall rating distribution
bull Third of ratings are 4sbull Average rating is 368
From TimelyDevelopmentcom
ratings per movie
bull Avg ratingsmovie 5627
ratings per user
bull Avg ratingsuser 208
Average movie rating by movie count
bull More ratings to better movies
From TimelyDevelopmentcom
Most loved movies
CountAvg ratingTitle137812 4593 The Shawshank Redemption 133597 4545 Lord of the Rings The Return of the King1808834306 The Green Mile 150676 4460Lord of the Rings The Two Towers 139050 4415 Finding Nemo 117456 4504 Raiders of the Lost Ark 180736 4299 Forrest Gump 147932 4433 Lord of the Rings The Fellowship of the ring149199 4325 The Sixth Sense 144027 4333 Indiana Jones and the Last Crusade
Challenges
bull Size of datandash Scalabilityndash Keeping data in memory
bull Missing datandash 99 percent missingndash Very imbalanced
bull Avoiding overfittingbull Test and training data differ significantly movie 16322
bull Use an ensemble of complementing predictorsbull Two half tuned models worth more than a single fully
tuned model
The BellKor recommender system
RECSYS 101
Extending Latent Factor Model to Include Biases
1162019
RECSYS 102
Modeling Biases and Interactions
μ = overall mean rating bx = bias of user x bi = bias of movie i
user-movie interactionmovie biasuser bias
User-Movie interaction Characterizes the matching between
users and movies Attracts most research in the field Benefits from algorithmic and
mathematical innovations
Baseline predictor Separates users and movies Benefits from insights into userrsquos
behavior Among the main practical
contributions of the competition
RECSYS 103
Baseline Predictor We have expectations on the rating by
user x of movie i even without estimating xrsquos attitude towards movies like i
ndash Rating scale of user xndash Values of other ratings user
gave recently (day-specific mood anchoring multi-user accounts)
ndash (Recent) popularity of movie indash Selection bias related to
number of ratings user gave on the same day (ldquofrequencyrdquo)
1162019
RECSYS 104
Putting It All Together
Example Mean rating micro = 37 You are a critical reviewer your ratings are 1 star
lower than the mean bx = -1 Star Wars gets a mean rating of 05 higher than
average movie bi = + 05 Predicted rating for you on Star Wars
= 37 - 1 + 05 = 32
1162019
RECSYS 105
Fitting the New Model
Solve
Stochastic gradient decent to find parameters Note Both biases bu bi as well as interactions qi pu are treated
as parameters (we estimate them)
1162019
regularization
goodness of fit
λ is selected via grid-search on a validation set
( )
++++
+++minus
sumsumsumsum
sumisin
ii
xx
xx
ii
Rix
Txiixxi
PQ
bbpq
pqbbr
2222
2
)(
)(min
λ
micro
RECSYS 106
BellKor Recommender System The winner of the Netflix Challenge
Multi-scale modeling of the dataCombine top level ldquoregionalrdquomodeling of the data with a refined local view Global
bull Overall deviations of usersmovies Factorization
bull Addressing ldquoregionalrdquo effects Collaborative filtering
bull Extract local patterns
Global effects
Factorization
Collaborative filtering
RECSYS 107
Performance of Various Methods
0885
089
0895
09
0905
091
0915
092
1 10 100 1000
RMSE
Millions of parameters
CF (no time bias)
Basic Latent Factors
Latent Factors w Biases
1162019
RECSYS 108
Grand Prize 08563
Netflix 09514
Movie average 10533User average 10651
Global average 11296
Performance of Various Methods
Basic Collaborative filtering 094
Latent factors 090
Latent factors+Biases 089
Collaborative filtering++ 091
1162019
RECSYS 109
Temporal Biases Of Users
Sudden rise in the average movie rating(early 2004) Improvements in Netflix GUI improvements Meaning of rating changed
Movie age Users prefer new movies
without any reasons Older movies are just
inherently better than newer ones
1162019
Y Koren Collaborative filtering with temporal dynamics KDD rsquo09
RECSYS 110
Temporal Biases amp Factors
Original modelrxi = micro +bx + bi + qi middotpx
T
Add time dependence to biasesrxi = micro +bx(t)+ bi(t) +qi middot px
T
Make parameters bu and bi to depend on time (1) Parameterize time-dependence by linear trends
(2) Each bin corresponds to 10 consecutive weeks
Add temporal dependence to factors px(t)hellip user preference vector on day t
Y Koren Collaborative filtering with temporal dynamics KDD rsquo09
RECSYS 111
Adding Temporal Effects
1162019
0875
088
0885
089
0895
09
0905
091
0915
092
1 10 100 1000 10000
RMSE
Millions of parameters
CF (no time bias)Basic Latent FactorsCF (time bias)Latent Factors w Biases+ Linear time factors+ Per-day user biases+ CF
RECSYS 112
Grand Prize 08563
Netflix 09514
Movie average 10533User average 10651
Global average 11296
Performance of Various Methods
Basic Collaborative filtering 094
Latent factors 090
Latent factors+Biases 089
Collaborative filtering++ 091
1162019 112
Latent factors+Biases+Time 0876
Still no prize Getting desperateTry a ldquokitchen sinkrdquo approach
RECSYS 113
Effect of ensemble size
087
08725
0875
08775
088
08825
0885
08875
089
08925
1 8 15 22 29 36 43 50
Predictors
Erro
r - R
MSE
RECSYS 40
Hybrid Methods Implement two or more different recommenders and
combine predictions Perhaps using a linear model
Add content-based methods to collaborative filtering Item profiles for new item problem Demographics to deal with new user problem
1162019
RECSYS 41
- Evaluation
- Error metrics
- Complexity Speed
RECSYS 42
Evaluation
1 3 4
3 5 5
4 5 5
3
3
2 2 2
5
2 1 1
3 3
1
movies
users
1162019
RECSYS 43
Evaluation
1 3 4
3 5 5
4 5 5
3
3
2
2 1
3
1
Test Data Set
users
movies
1162019
RECSYS 44
Evaluating Predictions Compare predictions with known ratings
Root-mean-square error (RMSE)bull where is predicted is the true rating of x on i
Precision at top 10 bull of those in top 10
Rank Correlation bull Spearmanrsquos correlation between systemrsquos and userrsquos complete
rankings Another approach 01 model
Coveragebull Number of itemsusers for which system can make predictions
Precision = TP (TP + FP) Accuracy = (TP+TN) (TP + FP + TN + FN) Receiver Operating characteristic (ROC) Curve
Y-axis True Positive Rates (TPR) X-axis False Positive Rates (FPR)bull TPR (aka Recall) = TP P = TP (TP+FN) bull FPR = FP N = FP (FP + TN)bull See httpsenwikipediaorgwikiPrecision and recall
RECSYS 45
Problems with Error Measures Narrow focus on accuracy sometimes
misses the point Prediction Diversity Prediction Context Order of predictions
In practice we care only to predict high ratings RMSE might penalize a method that does well
for high ratings and badly for others
1162019
RECSYS 46
Collaborative Filtering Complexity
Expensive step is finding k most similar customers O(|X|) Recall that X = set of customers in the system
Too expensive to do at runtime Could pre-compute using clustering as approx
Naiumlve pre-computation takes time O(N middot|C|) |C| = of clusters = k in the k-means N = of data points
We already know how to do this Near-neighbor search in high dimensions (LSH) Clustering Dimensionality reduction
1162019
RECSYS 47
Tip Add Data Leverage all the data
Donrsquot try to reduce data size in an effort to make fancy algorithms work
Simple methods on large data do best
Add more data eg add IMDB data on genres
More data beats better algorithmshttpanandtypepadcomdatawocky200803more-data-usualhtml
RECSYS 48
Recommender SystemsLatent Factor Models
CS246 Mining Massive DatasetsJure Leskovec Stanford University
httpcs246stanfordedu
RECSYS 49
Geared towards females
Geared towards males
Serious
Funny
Collaborative Filtering viaLatent Factor Models (eg SVD)
The PrincessDiaries
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
RECSYS 50Koren Bell Volinksy IEEE Computer 2009
RECSYS 51
The Netflix Utility Matrix R
1 3 4
3 5 5
4 5 5
3
3
2 2 2
5
2 1 1
3 3
1
480000 users
17700 movies
1162019
Matrix R
RECSYS 52
Utility Matrix R Evaluation
1 3 4
3 5 5
4 5 5
3
3
2
2 1
3
1
Test Data Set
SSE = sum(119909119909119909119909)isin119877119877 119903119909119909119909119909 minus 119903119903119909119909119909119909 2
1162019
480000 users
17700 movies
Predicted rating
True rating of user x on item i
119955119955120785120785120788120788
Matrix R
Training Data Set
RECSYS 53
Latent Factor Models
ldquoSVDrdquo on Netflix data R asymp Q middot PT
For now letrsquos assume we can approximate the rating matrix R as a product of ldquothinrdquo Q middot PT
R has missing entries but letrsquos ignore that for nowbull Basically we will want the reconstruction error to be small on
known ratings and we donrsquot care about the values on the missing ones
45531
312445
53432142
24542
522434
42331
2-41
56-5
53-2
32111
-221-7
37-1
-924143-48-5-253-211
13-112-72914-131457-8
1-6784-3924176-421asymp
users
item
s
PT
Q
item
s
users
R
SVD A = U Σ VT
f factors
ffactors
RECSYS 54
Ratings as Products of Factors How to estimate the missing rating of
user x for item i
1162019
45531
312445
53432142
24542
522434
42331
item
s
2-41
56-5
53-2
32111
-221-7
37-1
-924143-48-5-253-211
13-112-72914-131457-8
1-6784-3924176-421
asymp
item
s
users
users
PT
119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931
= 119943119943
119954119954119946119946119943119943 sdot 119953119953119961119961119943119943
qi = row i of Qpx = column x of PT
ffac
tors
Qf factors
RECSYS 55
Ratings as Products of Factors How to estimate the missing rating of
user x for item i
1162019
45531
312445
53432142
24542
522434
42331
item
s
2-41
56-5
53-2
32111
-221-7
37-1
-924143-48-5-253-211
13-112-72914-131457-8
1-6784-3924176-421
asymp
item
s
users
users
PT
ffac
tors
Qf factors
119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931
= 119943119943
119954119954119946119946119943119943 sdot 119953119953119961119961119943119943
qi = row i of Qpx = column x of PT
RECSYS 56
Ratings as Products of Factors How to estimate the missing rating of
user x for item i
1162019
45531
312445
53432142
24542
522434
42331
item
s
2-41
56-5
53-2
32111
-221-7
37-1
-924143-48-5-253-211
13-112-72914-131457-8
1-6784-3924176-421
asymp
item
s
users
users
QPT
24
ffac
tors
f factors
119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931
= 119943119943
119954119954119946119946119943119943 sdot 119953119953119961119961119943119943
qi = row i of Qpx = column x of PT
RECSYS 57
Geared towards females
Geared towards males
Serious
Funny
Latent Factor Models
1162019
The PrincessDiaries
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
RECSYS 58
Geared towards females
Geared towards males
Serious
Funny
Latent Factor Models
1162019
The PrincessDiaries
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
RECSYS 59
Recap SVD
Remember SVD A Input data matrix U Left singular vecs V Right singular vecs Σ Singular values
SVD gives minimum reconstruction error (SSE) min119880119880VΣ
sum119909119909119894119894 119860119860119909119909119894119894 minus 119880119880Σ119881119881T 1199091199091198941198942
So in our case ldquoSVDrdquo on Netflix data R asymp Q middot PT
A = R Q = U PT = Σ VT
But we are not done yet R has missing entries1162019
U
The sum goes over all entriesBut our R has missing entries
119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931
Am
n
Σm
n
VT
asymp
U
RECSYS 60
Latent Factor Models
SVD isnrsquot defined when entries are missing
Use specialized methods to find P Q min
119875119875119876119876sum 119909119909119909119909 isinR 119903119903119909119909119909119909 minus 119902119902119909119909 sdot 119901119901119909119909119879119879 2
Notebull We donrsquot require cols of P Q to be orthogonalunit lengthbull P Q map usersmovies to a latent spacebull The most popular model among Netflix contestants
1162019
45531
312445
53432142
24542
522434
42331
2-41
56-5
53-2
32111
-221-7
37-1
-924143-48-5-253-211
13-112-72914-131457-8
1-6784-3924176-421asymp
PTQ
users
item
s
119903119909119909119909119909 = 119902119902119909119909 sdot 119901119901119909119909119879119879
f factors
ffactorsitem
s
users
RECSYS 61
Dealing with Missing Entries Want to minimize SSE for unseen test data
Idea Minimize SSE on training data Want large f ( of factors) to capture all the signals But SSE on test data begins to rise for f gt 2
Regularization is needed to avoid Overfitting Allow rich model where there are sufficient data Shrink aggressively where data are scarce
1162019
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
1 3 43 5 5
4 5 533
2
2 1 3
1
λhellip regularization parameterldquoerrorrdquo ldquolengthrdquo
RECSYS 62
Geared towards females
Geared towards males
serious
escapist
The PrincessDiaries
The Lion King
Braveheart
Independence Day
AmadeusThe Color Purple
Oceanrsquos 11
Sense and Sensibility
Gus
Dave
Recommendations via Latent Factor Models (eg SVD++ by the [Bellkor Team])
1162019
Lethal Weapon
Dumb and Dumber
RECSYS 63
Dealing with Missing Entries Want to minimize SSE for unseen test data
Idea Minimize SSE on training data Want large f ( of factors) to capture all the signals But SSE on test data begins to rise for f gt 2
Regularization is needed Allow rich model where there are sufficient data Shrink aggressively where data are scarce
1162019
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
1 3 43 5 5
4 5 533
2
2 1 3
1
λhellip regularization parameterldquoerrorrdquo ldquolengthrdquo
RECSYS 64
Geared towards females
Geared towards males
serious
funny
The Effect of Regularization
1162019
The PrincessDiaries
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
minfactors ldquoerrorrdquo + λ ldquolengthrdquo
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
RECSYS 65
Geared towards females
Geared towards males
serious
funny
The Effect of Regularization
1162019
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
The PrincessDiaries
minfactors ldquoerrorrdquo + λ ldquolengthrdquo
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
RECSYS 66
Geared towards females
Geared towards males
serious
funny
The Effect of Regularization
1162019
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
minfactors ldquoerrorrdquo + λ ldquolengthrdquo
The PrincessDiaries
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
RECSYS 67
Geared towards females
Geared towards males
serious
funny
The Effect of Regularization
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
The PrincessDiaries
minfactors ldquoerrorrdquo + λ ldquolengthrdquo
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
RECSYS 68
Use Gradient Descent to search forthe optimal settings
Want to find matrices P and Q
Gradient descent Initialize P and Q (using SVD pretend missing ratings are
0) Do gradient descent
bull P larr P - η middotnablaPbull Q larr Q - η middotnablaQbull Where nablaQ is gradientderivative of matrix Q120571120571119876119876 = [120571120571119902119902119909119909119894119894] and 120571120571119902119902119909119909119894119894 = sum119909119909119909119909 minus2 119903119903119909119909119909119909 minus 119902119902119909119909119901119901119909119909119879119879 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894
bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q Observation Computing gradients is slow
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
How to compute gradient of a matrixCompute gradient of every element independently
RECSYS 69
Degression to the lecture notes ofRegression and Gradient Descent
by Andrew Ngrsquos Machine Learning course from Coursera
RECSYS 70
(Batch) Gradient Descent
Want to find matrices P and Q
Gradient descent Initialize P and Q (using SVD pretend missing ratings are
0) Do gradient descent
bull P larr P - η middotnablaPbull Q larr Q - η middotnablaQbull Where nablaQ is gradientderivative of matrix Q120571120571119876119876 = [120571120571119902119902119909119909119894119894] and 120571120571119902119902119909119909119894119894 = sum119909119909119909119909 minus2 119903119903119909119909119909119909 minus 119902119902119909119909119901119901119909119909119879119879 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894
bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q Observation Computing gradients is slow
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
How to compute gradient of a matrixCompute gradient of every element independently
RECSYS 72
Stochastic Gradient Descent
Gradient Descent (GD) vs Stochastic GD Observation 120571120571119876119876 = [120571120571119902119902119909119909119894119894] where
120571120571119902119902119909119909119894119894 = 119909119909119909119909
minus2 119903119903119909119909119909119909 minus 119902119902119909119909119894119894119901119901119909119909119894119894 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894 = 119961119961119946119946
nabla119928119928 119955119955119961119961119946119946
bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q 119928119928 = 119928119928 minus 119928119928 = 119928119928 minus η sum119961119961119946119946nabla119928119928 (119955119955119961119961119946119946) Idea Instead of evaluating gradient over all ratings evaluate it for
each individual rating and make a step
GD 119928119928larr119928119928 minus η sum119955119955119961119961119946119946 nabla119928119928(119955119955119961119961119946119946)
SGD 119928119928larr119928119928 minus 119928119928(119955119955119961119961119946119946) Faster convergence
bull Need more steps but each step is computed much faster
1162019
RECSYS 73
SGD vs GD Convergence of GD vs SGD
1162019
Iterationstep
Valu
e of
the
obje
ctiv
e fu
nctio
n
GD improves the value of the objective function at every step SGD improves the value but in a ldquonoisyrdquo wayGD takes fewer steps to converge but each steptakes much longer to compute In practice SGD is much faster
RECSYS 74
Stochastic Gradient Descent
Stochastic gradient decent Initialize P and Q (using SVD pretend missing ratings are
0) Then iterate over the ratings (multiple times if necessary) and update
factorsFor each rxi
bull 120576120576119909119909119909119909 = 119903119903119909119909119909119909 minus 119902119902119909119909 sdot 119901119901119909119909119879119879 (derivative of the ldquoerrorrdquo)bull 119902119902119909119909 larr 119902119902119909119909 + 120578120578 120576120576119909119909119909119909 119901119901119909119909 minus 120582120582 119902119902119909119909 (update equation)bull 119901119901119909119909 larr 119901119901119909119909 + 120578120578 120576120576119909119909119909119909 119902119902119909119909 minus 120582120582 119901119901119909119909 (update equation)
2 for loops For until convergence
bull For each rxibull Compute gradient do a ldquosteprdquo
1162019
η hellip learning rate
RECSYS 75Koren Bell Volinksy IEEE Computer 2009
1162019
RECSYS 76
Summary Recommendations via Optimization
Goal Make good recommendations Quantify goodness using SSE
So Lower SSE means better recommendations We want to make good recommendations on items
that some user has not yet seen Letrsquos set values for P and Q such that they work
well on known (user item) ratings And hope these values for P and Q will predict well
the unknown ratings
This is the a case where we apply Optimization methods
1162019
1 3 43 5 5
4 5 533
2
2 1 3
1
RECSYS 77
Backup Slides
RECSYS 78
The Netflix Challenge 2006-09
We Know What You OughtTo Be Watching This
Summer
Netflix Prizebull Training data
ndash 100 million ratingsndash 480000 usersndash 17770 moviesndash 6 years of data 2000-2005
bull Test datandash Last few ratings of each user (28 million)ndash Evaluation criterion root mean squared error (RMSE) ndash Netflix Cinematch RMSE 09514
bull Competitionndash 2700+ teamsndash $1 million grand prize for 10 improvement on Cinematch resultndash $50000 2007 progress prize for 843 improvement
scoremovieuser121152131434524123237682576344541568523425
22345
57664566
scoremovieuser6219617232473153414284935
745
696836
Training data Test data
Movie rating data
Overall rating distribution
bull Third of ratings are 4sbull Average rating is 368
From TimelyDevelopmentcom
ratings per movie
bull Avg ratingsmovie 5627
ratings per user
bull Avg ratingsuser 208
Average movie rating by movie count
bull More ratings to better movies
From TimelyDevelopmentcom
Most loved movies
CountAvg ratingTitle137812 4593 The Shawshank Redemption 133597 4545 Lord of the Rings The Return of the King1808834306 The Green Mile 150676 4460Lord of the Rings The Two Towers 139050 4415 Finding Nemo 117456 4504 Raiders of the Lost Ark 180736 4299 Forrest Gump 147932 4433 Lord of the Rings The Fellowship of the ring149199 4325 The Sixth Sense 144027 4333 Indiana Jones and the Last Crusade
Challenges
bull Size of datandash Scalabilityndash Keeping data in memory
bull Missing datandash 99 percent missingndash Very imbalanced
bull Avoiding overfittingbull Test and training data differ significantly movie 16322
bull Use an ensemble of complementing predictorsbull Two half tuned models worth more than a single fully
tuned model
The BellKor recommender system
RECSYS 101
Extending Latent Factor Model to Include Biases
1162019
RECSYS 102
Modeling Biases and Interactions
μ = overall mean rating bx = bias of user x bi = bias of movie i
user-movie interactionmovie biasuser bias
User-Movie interaction Characterizes the matching between
users and movies Attracts most research in the field Benefits from algorithmic and
mathematical innovations
Baseline predictor Separates users and movies Benefits from insights into userrsquos
behavior Among the main practical
contributions of the competition
RECSYS 103
Baseline Predictor We have expectations on the rating by
user x of movie i even without estimating xrsquos attitude towards movies like i
ndash Rating scale of user xndash Values of other ratings user
gave recently (day-specific mood anchoring multi-user accounts)
ndash (Recent) popularity of movie indash Selection bias related to
number of ratings user gave on the same day (ldquofrequencyrdquo)
1162019
RECSYS 104
Putting It All Together
Example Mean rating micro = 37 You are a critical reviewer your ratings are 1 star
lower than the mean bx = -1 Star Wars gets a mean rating of 05 higher than
average movie bi = + 05 Predicted rating for you on Star Wars
= 37 - 1 + 05 = 32
1162019
RECSYS 105
Fitting the New Model
Solve
Stochastic gradient decent to find parameters Note Both biases bu bi as well as interactions qi pu are treated
as parameters (we estimate them)
1162019
regularization
goodness of fit
λ is selected via grid-search on a validation set
( )
++++
+++minus
sumsumsumsum
sumisin
ii
xx
xx
ii
Rix
Txiixxi
PQ
bbpq
pqbbr
2222
2
)(
)(min
λ
micro
RECSYS 106
BellKor Recommender System The winner of the Netflix Challenge
Multi-scale modeling of the dataCombine top level ldquoregionalrdquomodeling of the data with a refined local view Global
bull Overall deviations of usersmovies Factorization
bull Addressing ldquoregionalrdquo effects Collaborative filtering
bull Extract local patterns
Global effects
Factorization
Collaborative filtering
RECSYS 107
Performance of Various Methods
0885
089
0895
09
0905
091
0915
092
1 10 100 1000
RMSE
Millions of parameters
CF (no time bias)
Basic Latent Factors
Latent Factors w Biases
1162019
RECSYS 108
Grand Prize 08563
Netflix 09514
Movie average 10533User average 10651
Global average 11296
Performance of Various Methods
Basic Collaborative filtering 094
Latent factors 090
Latent factors+Biases 089
Collaborative filtering++ 091
1162019
RECSYS 109
Temporal Biases Of Users
Sudden rise in the average movie rating(early 2004) Improvements in Netflix GUI improvements Meaning of rating changed
Movie age Users prefer new movies
without any reasons Older movies are just
inherently better than newer ones
1162019
Y Koren Collaborative filtering with temporal dynamics KDD rsquo09
RECSYS 110
Temporal Biases amp Factors
Original modelrxi = micro +bx + bi + qi middotpx
T
Add time dependence to biasesrxi = micro +bx(t)+ bi(t) +qi middot px
T
Make parameters bu and bi to depend on time (1) Parameterize time-dependence by linear trends
(2) Each bin corresponds to 10 consecutive weeks
Add temporal dependence to factors px(t)hellip user preference vector on day t
Y Koren Collaborative filtering with temporal dynamics KDD rsquo09
RECSYS 111
Adding Temporal Effects
1162019
0875
088
0885
089
0895
09
0905
091
0915
092
1 10 100 1000 10000
RMSE
Millions of parameters
CF (no time bias)Basic Latent FactorsCF (time bias)Latent Factors w Biases+ Linear time factors+ Per-day user biases+ CF
RECSYS 112
Grand Prize 08563
Netflix 09514
Movie average 10533User average 10651
Global average 11296
Performance of Various Methods
Basic Collaborative filtering 094
Latent factors 090
Latent factors+Biases 089
Collaborative filtering++ 091
1162019 112
Latent factors+Biases+Time 0876
Still no prize Getting desperateTry a ldquokitchen sinkrdquo approach
RECSYS 113
Effect of ensemble size
087
08725
0875
08775
088
08825
0885
08875
089
08925
1 8 15 22 29 36 43 50
Predictors
Erro
r - R
MSE
RECSYS 41
- Evaluation
- Error metrics
- Complexity Speed
RECSYS 42
Evaluation
1 3 4
3 5 5
4 5 5
3
3
2 2 2
5
2 1 1
3 3
1
movies
users
1162019
RECSYS 43
Evaluation
1 3 4
3 5 5
4 5 5
3
3
2
2 1
3
1
Test Data Set
users
movies
1162019
RECSYS 44
Evaluating Predictions Compare predictions with known ratings
Root-mean-square error (RMSE)bull where is predicted is the true rating of x on i
Precision at top 10 bull of those in top 10
Rank Correlation bull Spearmanrsquos correlation between systemrsquos and userrsquos complete
rankings Another approach 01 model
Coveragebull Number of itemsusers for which system can make predictions
Precision = TP (TP + FP) Accuracy = (TP+TN) (TP + FP + TN + FN) Receiver Operating characteristic (ROC) Curve
Y-axis True Positive Rates (TPR) X-axis False Positive Rates (FPR)bull TPR (aka Recall) = TP P = TP (TP+FN) bull FPR = FP N = FP (FP + TN)bull See httpsenwikipediaorgwikiPrecision and recall
RECSYS 45
Problems with Error Measures Narrow focus on accuracy sometimes
misses the point Prediction Diversity Prediction Context Order of predictions
In practice we care only to predict high ratings RMSE might penalize a method that does well
for high ratings and badly for others
1162019
RECSYS 46
Collaborative Filtering Complexity
Expensive step is finding k most similar customers O(|X|) Recall that X = set of customers in the system
Too expensive to do at runtime Could pre-compute using clustering as approx
Naiumlve pre-computation takes time O(N middot|C|) |C| = of clusters = k in the k-means N = of data points
We already know how to do this Near-neighbor search in high dimensions (LSH) Clustering Dimensionality reduction
1162019
RECSYS 47
Tip Add Data Leverage all the data
Donrsquot try to reduce data size in an effort to make fancy algorithms work
Simple methods on large data do best
Add more data eg add IMDB data on genres
More data beats better algorithmshttpanandtypepadcomdatawocky200803more-data-usualhtml
RECSYS 48
Recommender SystemsLatent Factor Models
CS246 Mining Massive DatasetsJure Leskovec Stanford University
httpcs246stanfordedu
RECSYS 49
Geared towards females
Geared towards males
Serious
Funny
Collaborative Filtering viaLatent Factor Models (eg SVD)
The PrincessDiaries
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
RECSYS 50Koren Bell Volinksy IEEE Computer 2009
RECSYS 51
The Netflix Utility Matrix R
1 3 4
3 5 5
4 5 5
3
3
2 2 2
5
2 1 1
3 3
1
480000 users
17700 movies
1162019
Matrix R
RECSYS 52
Utility Matrix R Evaluation
1 3 4
3 5 5
4 5 5
3
3
2
2 1
3
1
Test Data Set
SSE = sum(119909119909119909119909)isin119877119877 119903119909119909119909119909 minus 119903119903119909119909119909119909 2
1162019
480000 users
17700 movies
Predicted rating
True rating of user x on item i
119955119955120785120785120788120788
Matrix R
Training Data Set
RECSYS 53
Latent Factor Models
ldquoSVDrdquo on Netflix data R asymp Q middot PT
For now letrsquos assume we can approximate the rating matrix R as a product of ldquothinrdquo Q middot PT
R has missing entries but letrsquos ignore that for nowbull Basically we will want the reconstruction error to be small on
known ratings and we donrsquot care about the values on the missing ones
45531
312445
53432142
24542
522434
42331
2-41
56-5
53-2
32111
-221-7
37-1
-924143-48-5-253-211
13-112-72914-131457-8
1-6784-3924176-421asymp
users
item
s
PT
Q
item
s
users
R
SVD A = U Σ VT
f factors
ffactors
RECSYS 54
Ratings as Products of Factors How to estimate the missing rating of
user x for item i
1162019
45531
312445
53432142
24542
522434
42331
item
s
2-41
56-5
53-2
32111
-221-7
37-1
-924143-48-5-253-211
13-112-72914-131457-8
1-6784-3924176-421
asymp
item
s
users
users
PT
119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931
= 119943119943
119954119954119946119946119943119943 sdot 119953119953119961119961119943119943
qi = row i of Qpx = column x of PT
ffac
tors
Qf factors
RECSYS 55
Ratings as Products of Factors How to estimate the missing rating of
user x for item i
1162019
45531
312445
53432142
24542
522434
42331
item
s
2-41
56-5
53-2
32111
-221-7
37-1
-924143-48-5-253-211
13-112-72914-131457-8
1-6784-3924176-421
asymp
item
s
users
users
PT
ffac
tors
Qf factors
119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931
= 119943119943
119954119954119946119946119943119943 sdot 119953119953119961119961119943119943
qi = row i of Qpx = column x of PT
RECSYS 56
Ratings as Products of Factors How to estimate the missing rating of
user x for item i
1162019
45531
312445
53432142
24542
522434
42331
item
s
2-41
56-5
53-2
32111
-221-7
37-1
-924143-48-5-253-211
13-112-72914-131457-8
1-6784-3924176-421
asymp
item
s
users
users
QPT
24
ffac
tors
f factors
119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931
= 119943119943
119954119954119946119946119943119943 sdot 119953119953119961119961119943119943
qi = row i of Qpx = column x of PT
RECSYS 57
Geared towards females
Geared towards males
Serious
Funny
Latent Factor Models
1162019
The PrincessDiaries
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
RECSYS 58
Geared towards females
Geared towards males
Serious
Funny
Latent Factor Models
1162019
The PrincessDiaries
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
RECSYS 59
Recap SVD
Remember SVD A Input data matrix U Left singular vecs V Right singular vecs Σ Singular values
SVD gives minimum reconstruction error (SSE) min119880119880VΣ
sum119909119909119894119894 119860119860119909119909119894119894 minus 119880119880Σ119881119881T 1199091199091198941198942
So in our case ldquoSVDrdquo on Netflix data R asymp Q middot PT
A = R Q = U PT = Σ VT
But we are not done yet R has missing entries1162019
U
The sum goes over all entriesBut our R has missing entries
119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931
Am
n
Σm
n
VT
asymp
U
RECSYS 60
Latent Factor Models
SVD isnrsquot defined when entries are missing
Use specialized methods to find P Q min
119875119875119876119876sum 119909119909119909119909 isinR 119903119903119909119909119909119909 minus 119902119902119909119909 sdot 119901119901119909119909119879119879 2
Notebull We donrsquot require cols of P Q to be orthogonalunit lengthbull P Q map usersmovies to a latent spacebull The most popular model among Netflix contestants
1162019
45531
312445
53432142
24542
522434
42331
2-41
56-5
53-2
32111
-221-7
37-1
-924143-48-5-253-211
13-112-72914-131457-8
1-6784-3924176-421asymp
PTQ
users
item
s
119903119909119909119909119909 = 119902119902119909119909 sdot 119901119901119909119909119879119879
f factors
ffactorsitem
s
users
RECSYS 61
Dealing with Missing Entries Want to minimize SSE for unseen test data
Idea Minimize SSE on training data Want large f ( of factors) to capture all the signals But SSE on test data begins to rise for f gt 2
Regularization is needed to avoid Overfitting Allow rich model where there are sufficient data Shrink aggressively where data are scarce
1162019
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
1 3 43 5 5
4 5 533
2
2 1 3
1
λhellip regularization parameterldquoerrorrdquo ldquolengthrdquo
RECSYS 62
Geared towards females
Geared towards males
serious
escapist
The PrincessDiaries
The Lion King
Braveheart
Independence Day
AmadeusThe Color Purple
Oceanrsquos 11
Sense and Sensibility
Gus
Dave
Recommendations via Latent Factor Models (eg SVD++ by the [Bellkor Team])
1162019
Lethal Weapon
Dumb and Dumber
RECSYS 63
Dealing with Missing Entries Want to minimize SSE for unseen test data
Idea Minimize SSE on training data Want large f ( of factors) to capture all the signals But SSE on test data begins to rise for f gt 2
Regularization is needed Allow rich model where there are sufficient data Shrink aggressively where data are scarce
1162019
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
1 3 43 5 5
4 5 533
2
2 1 3
1
λhellip regularization parameterldquoerrorrdquo ldquolengthrdquo
RECSYS 64
Geared towards females
Geared towards males
serious
funny
The Effect of Regularization
1162019
The PrincessDiaries
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
minfactors ldquoerrorrdquo + λ ldquolengthrdquo
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
RECSYS 65
Geared towards females
Geared towards males
serious
funny
The Effect of Regularization
1162019
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
The PrincessDiaries
minfactors ldquoerrorrdquo + λ ldquolengthrdquo
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
RECSYS 66
Geared towards females
Geared towards males
serious
funny
The Effect of Regularization
1162019
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
minfactors ldquoerrorrdquo + λ ldquolengthrdquo
The PrincessDiaries
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
RECSYS 67
Geared towards females
Geared towards males
serious
funny
The Effect of Regularization
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
The PrincessDiaries
minfactors ldquoerrorrdquo + λ ldquolengthrdquo
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
RECSYS 68
Use Gradient Descent to search forthe optimal settings
Want to find matrices P and Q
Gradient descent Initialize P and Q (using SVD pretend missing ratings are
0) Do gradient descent
bull P larr P - η middotnablaPbull Q larr Q - η middotnablaQbull Where nablaQ is gradientderivative of matrix Q120571120571119876119876 = [120571120571119902119902119909119909119894119894] and 120571120571119902119902119909119909119894119894 = sum119909119909119909119909 minus2 119903119903119909119909119909119909 minus 119902119902119909119909119901119901119909119909119879119879 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894
bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q Observation Computing gradients is slow
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
How to compute gradient of a matrixCompute gradient of every element independently
RECSYS 69
Degression to the lecture notes ofRegression and Gradient Descent
by Andrew Ngrsquos Machine Learning course from Coursera
RECSYS 70
(Batch) Gradient Descent
Want to find matrices P and Q
Gradient descent Initialize P and Q (using SVD pretend missing ratings are
0) Do gradient descent
bull P larr P - η middotnablaPbull Q larr Q - η middotnablaQbull Where nablaQ is gradientderivative of matrix Q120571120571119876119876 = [120571120571119902119902119909119909119894119894] and 120571120571119902119902119909119909119894119894 = sum119909119909119909119909 minus2 119903119903119909119909119909119909 minus 119902119902119909119909119901119901119909119909119879119879 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894
bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q Observation Computing gradients is slow
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
How to compute gradient of a matrixCompute gradient of every element independently
RECSYS 72
Stochastic Gradient Descent
Gradient Descent (GD) vs Stochastic GD Observation 120571120571119876119876 = [120571120571119902119902119909119909119894119894] where
120571120571119902119902119909119909119894119894 = 119909119909119909119909
minus2 119903119903119909119909119909119909 minus 119902119902119909119909119894119894119901119901119909119909119894119894 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894 = 119961119961119946119946
nabla119928119928 119955119955119961119961119946119946
bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q 119928119928 = 119928119928 minus 119928119928 = 119928119928 minus η sum119961119961119946119946nabla119928119928 (119955119955119961119961119946119946) Idea Instead of evaluating gradient over all ratings evaluate it for
each individual rating and make a step
GD 119928119928larr119928119928 minus η sum119955119955119961119961119946119946 nabla119928119928(119955119955119961119961119946119946)
SGD 119928119928larr119928119928 minus 119928119928(119955119955119961119961119946119946) Faster convergence
bull Need more steps but each step is computed much faster
1162019
RECSYS 73
SGD vs GD Convergence of GD vs SGD
1162019
Iterationstep
Valu
e of
the
obje
ctiv
e fu
nctio
n
GD improves the value of the objective function at every step SGD improves the value but in a ldquonoisyrdquo wayGD takes fewer steps to converge but each steptakes much longer to compute In practice SGD is much faster
RECSYS 74
Stochastic Gradient Descent
Stochastic gradient decent Initialize P and Q (using SVD pretend missing ratings are
0) Then iterate over the ratings (multiple times if necessary) and update
factorsFor each rxi
bull 120576120576119909119909119909119909 = 119903119903119909119909119909119909 minus 119902119902119909119909 sdot 119901119901119909119909119879119879 (derivative of the ldquoerrorrdquo)bull 119902119902119909119909 larr 119902119902119909119909 + 120578120578 120576120576119909119909119909119909 119901119901119909119909 minus 120582120582 119902119902119909119909 (update equation)bull 119901119901119909119909 larr 119901119901119909119909 + 120578120578 120576120576119909119909119909119909 119902119902119909119909 minus 120582120582 119901119901119909119909 (update equation)
2 for loops For until convergence
bull For each rxibull Compute gradient do a ldquosteprdquo
1162019
η hellip learning rate
RECSYS 75Koren Bell Volinksy IEEE Computer 2009
1162019
RECSYS 76
Summary Recommendations via Optimization
Goal Make good recommendations Quantify goodness using SSE
So Lower SSE means better recommendations We want to make good recommendations on items
that some user has not yet seen Letrsquos set values for P and Q such that they work
well on known (user item) ratings And hope these values for P and Q will predict well
the unknown ratings
This is the a case where we apply Optimization methods
1162019
1 3 43 5 5
4 5 533
2
2 1 3
1
RECSYS 77
Backup Slides
RECSYS 78
The Netflix Challenge 2006-09
We Know What You OughtTo Be Watching This
Summer
Netflix Prizebull Training data
ndash 100 million ratingsndash 480000 usersndash 17770 moviesndash 6 years of data 2000-2005
bull Test datandash Last few ratings of each user (28 million)ndash Evaluation criterion root mean squared error (RMSE) ndash Netflix Cinematch RMSE 09514
bull Competitionndash 2700+ teamsndash $1 million grand prize for 10 improvement on Cinematch resultndash $50000 2007 progress prize for 843 improvement
scoremovieuser121152131434524123237682576344541568523425
22345
57664566
scoremovieuser6219617232473153414284935
745
696836
Training data Test data
Movie rating data
Overall rating distribution
bull Third of ratings are 4sbull Average rating is 368
From TimelyDevelopmentcom
ratings per movie
bull Avg ratingsmovie 5627
ratings per user
bull Avg ratingsuser 208
Average movie rating by movie count
bull More ratings to better movies
From TimelyDevelopmentcom
Most loved movies
CountAvg ratingTitle137812 4593 The Shawshank Redemption 133597 4545 Lord of the Rings The Return of the King1808834306 The Green Mile 150676 4460Lord of the Rings The Two Towers 139050 4415 Finding Nemo 117456 4504 Raiders of the Lost Ark 180736 4299 Forrest Gump 147932 4433 Lord of the Rings The Fellowship of the ring149199 4325 The Sixth Sense 144027 4333 Indiana Jones and the Last Crusade
Challenges
bull Size of datandash Scalabilityndash Keeping data in memory
bull Missing datandash 99 percent missingndash Very imbalanced
bull Avoiding overfittingbull Test and training data differ significantly movie 16322
bull Use an ensemble of complementing predictorsbull Two half tuned models worth more than a single fully
tuned model
The BellKor recommender system
RECSYS 101
Extending Latent Factor Model to Include Biases
1162019
RECSYS 102
Modeling Biases and Interactions
μ = overall mean rating bx = bias of user x bi = bias of movie i
user-movie interactionmovie biasuser bias
User-Movie interaction Characterizes the matching between
users and movies Attracts most research in the field Benefits from algorithmic and
mathematical innovations
Baseline predictor Separates users and movies Benefits from insights into userrsquos
behavior Among the main practical
contributions of the competition
RECSYS 103
Baseline Predictor We have expectations on the rating by
user x of movie i even without estimating xrsquos attitude towards movies like i
ndash Rating scale of user xndash Values of other ratings user
gave recently (day-specific mood anchoring multi-user accounts)
ndash (Recent) popularity of movie indash Selection bias related to
number of ratings user gave on the same day (ldquofrequencyrdquo)
1162019
RECSYS 104
Putting It All Together
Example Mean rating micro = 37 You are a critical reviewer your ratings are 1 star
lower than the mean bx = -1 Star Wars gets a mean rating of 05 higher than
average movie bi = + 05 Predicted rating for you on Star Wars
= 37 - 1 + 05 = 32
1162019
RECSYS 105
Fitting the New Model
Solve
Stochastic gradient decent to find parameters Note Both biases bu bi as well as interactions qi pu are treated
as parameters (we estimate them)
1162019
regularization
goodness of fit
λ is selected via grid-search on a validation set
( )
++++
+++minus
sumsumsumsum
sumisin
ii
xx
xx
ii
Rix
Txiixxi
PQ
bbpq
pqbbr
2222
2
)(
)(min
λ
micro
RECSYS 106
BellKor Recommender System The winner of the Netflix Challenge
Multi-scale modeling of the dataCombine top level ldquoregionalrdquomodeling of the data with a refined local view Global
bull Overall deviations of usersmovies Factorization
bull Addressing ldquoregionalrdquo effects Collaborative filtering
bull Extract local patterns
Global effects
Factorization
Collaborative filtering
RECSYS 107
Performance of Various Methods
0885
089
0895
09
0905
091
0915
092
1 10 100 1000
RMSE
Millions of parameters
CF (no time bias)
Basic Latent Factors
Latent Factors w Biases
1162019
RECSYS 108
Grand Prize 08563
Netflix 09514
Movie average 10533User average 10651
Global average 11296
Performance of Various Methods
Basic Collaborative filtering 094
Latent factors 090
Latent factors+Biases 089
Collaborative filtering++ 091
1162019
RECSYS 109
Temporal Biases Of Users
Sudden rise in the average movie rating(early 2004) Improvements in Netflix GUI improvements Meaning of rating changed
Movie age Users prefer new movies
without any reasons Older movies are just
inherently better than newer ones
1162019
Y Koren Collaborative filtering with temporal dynamics KDD rsquo09
RECSYS 110
Temporal Biases amp Factors
Original modelrxi = micro +bx + bi + qi middotpx
T
Add time dependence to biasesrxi = micro +bx(t)+ bi(t) +qi middot px
T
Make parameters bu and bi to depend on time (1) Parameterize time-dependence by linear trends
(2) Each bin corresponds to 10 consecutive weeks
Add temporal dependence to factors px(t)hellip user preference vector on day t
Y Koren Collaborative filtering with temporal dynamics KDD rsquo09
RECSYS 111
Adding Temporal Effects
1162019
0875
088
0885
089
0895
09
0905
091
0915
092
1 10 100 1000 10000
RMSE
Millions of parameters
CF (no time bias)Basic Latent FactorsCF (time bias)Latent Factors w Biases+ Linear time factors+ Per-day user biases+ CF
RECSYS 112
Grand Prize 08563
Netflix 09514
Movie average 10533User average 10651
Global average 11296
Performance of Various Methods
Basic Collaborative filtering 094
Latent factors 090
Latent factors+Biases 089
Collaborative filtering++ 091
1162019 112
Latent factors+Biases+Time 0876
Still no prize Getting desperateTry a ldquokitchen sinkrdquo approach
RECSYS 113
Effect of ensemble size
087
08725
0875
08775
088
08825
0885
08875
089
08925
1 8 15 22 29 36 43 50
Predictors
Erro
r - R
MSE
RECSYS 42
Evaluation
1 3 4
3 5 5
4 5 5
3
3
2 2 2
5
2 1 1
3 3
1
movies
users
1162019
RECSYS 43
Evaluation
1 3 4
3 5 5
4 5 5
3
3
2
2 1
3
1
Test Data Set
users
movies
1162019
RECSYS 44
Evaluating Predictions Compare predictions with known ratings
Root-mean-square error (RMSE)bull where is predicted is the true rating of x on i
Precision at top 10 bull of those in top 10
Rank Correlation bull Spearmanrsquos correlation between systemrsquos and userrsquos complete
rankings Another approach 01 model
Coveragebull Number of itemsusers for which system can make predictions
Precision = TP (TP + FP) Accuracy = (TP+TN) (TP + FP + TN + FN) Receiver Operating characteristic (ROC) Curve
Y-axis True Positive Rates (TPR) X-axis False Positive Rates (FPR)bull TPR (aka Recall) = TP P = TP (TP+FN) bull FPR = FP N = FP (FP + TN)bull See httpsenwikipediaorgwikiPrecision and recall
RECSYS 45
Problems with Error Measures Narrow focus on accuracy sometimes
misses the point Prediction Diversity Prediction Context Order of predictions
In practice we care only to predict high ratings RMSE might penalize a method that does well
for high ratings and badly for others
1162019
RECSYS 46
Collaborative Filtering Complexity
Expensive step is finding k most similar customers O(|X|) Recall that X = set of customers in the system
Too expensive to do at runtime Could pre-compute using clustering as approx
Naiumlve pre-computation takes time O(N middot|C|) |C| = of clusters = k in the k-means N = of data points
We already know how to do this Near-neighbor search in high dimensions (LSH) Clustering Dimensionality reduction
1162019
RECSYS 47
Tip Add Data Leverage all the data
Donrsquot try to reduce data size in an effort to make fancy algorithms work
Simple methods on large data do best
Add more data eg add IMDB data on genres
More data beats better algorithmshttpanandtypepadcomdatawocky200803more-data-usualhtml
RECSYS 48
Recommender SystemsLatent Factor Models
CS246 Mining Massive DatasetsJure Leskovec Stanford University
httpcs246stanfordedu
RECSYS 49
Geared towards females
Geared towards males
Serious
Funny
Collaborative Filtering viaLatent Factor Models (eg SVD)
The PrincessDiaries
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
RECSYS 50Koren Bell Volinksy IEEE Computer 2009
RECSYS 51
The Netflix Utility Matrix R
1 3 4
3 5 5
4 5 5
3
3
2 2 2
5
2 1 1
3 3
1
480000 users
17700 movies
1162019
Matrix R
RECSYS 52
Utility Matrix R Evaluation
1 3 4
3 5 5
4 5 5
3
3
2
2 1
3
1
Test Data Set
SSE = sum(119909119909119909119909)isin119877119877 119903119909119909119909119909 minus 119903119903119909119909119909119909 2
1162019
480000 users
17700 movies
Predicted rating
True rating of user x on item i
119955119955120785120785120788120788
Matrix R
Training Data Set
RECSYS 53
Latent Factor Models
ldquoSVDrdquo on Netflix data R asymp Q middot PT
For now letrsquos assume we can approximate the rating matrix R as a product of ldquothinrdquo Q middot PT
R has missing entries but letrsquos ignore that for nowbull Basically we will want the reconstruction error to be small on
known ratings and we donrsquot care about the values on the missing ones
45531
312445
53432142
24542
522434
42331
2-41
56-5
53-2
32111
-221-7
37-1
-924143-48-5-253-211
13-112-72914-131457-8
1-6784-3924176-421asymp
users
item
s
PT
Q
item
s
users
R
SVD A = U Σ VT
f factors
ffactors
RECSYS 54
Ratings as Products of Factors How to estimate the missing rating of
user x for item i
1162019
45531
312445
53432142
24542
522434
42331
item
s
2-41
56-5
53-2
32111
-221-7
37-1
-924143-48-5-253-211
13-112-72914-131457-8
1-6784-3924176-421
asymp
item
s
users
users
PT
119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931
= 119943119943
119954119954119946119946119943119943 sdot 119953119953119961119961119943119943
qi = row i of Qpx = column x of PT
ffac
tors
Qf factors
RECSYS 55
Ratings as Products of Factors How to estimate the missing rating of
user x for item i
1162019
45531
312445
53432142
24542
522434
42331
item
s
2-41
56-5
53-2
32111
-221-7
37-1
-924143-48-5-253-211
13-112-72914-131457-8
1-6784-3924176-421
asymp
item
s
users
users
PT
ffac
tors
Qf factors
119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931
= 119943119943
119954119954119946119946119943119943 sdot 119953119953119961119961119943119943
qi = row i of Qpx = column x of PT
RECSYS 56
Ratings as Products of Factors How to estimate the missing rating of
user x for item i
1162019
45531
312445
53432142
24542
522434
42331
item
s
2-41
56-5
53-2
32111
-221-7
37-1
-924143-48-5-253-211
13-112-72914-131457-8
1-6784-3924176-421
asymp
item
s
users
users
QPT
24
ffac
tors
f factors
119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931
= 119943119943
119954119954119946119946119943119943 sdot 119953119953119961119961119943119943
qi = row i of Qpx = column x of PT
RECSYS 57
Geared towards females
Geared towards males
Serious
Funny
Latent Factor Models
1162019
The PrincessDiaries
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
RECSYS 58
Geared towards females
Geared towards males
Serious
Funny
Latent Factor Models
1162019
The PrincessDiaries
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
RECSYS 59
Recap SVD
Remember SVD A Input data matrix U Left singular vecs V Right singular vecs Σ Singular values
SVD gives minimum reconstruction error (SSE) min119880119880VΣ
sum119909119909119894119894 119860119860119909119909119894119894 minus 119880119880Σ119881119881T 1199091199091198941198942
So in our case ldquoSVDrdquo on Netflix data R asymp Q middot PT
A = R Q = U PT = Σ VT
But we are not done yet R has missing entries1162019
U
The sum goes over all entriesBut our R has missing entries
119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931
Am
n
Σm
n
VT
asymp
U
RECSYS 60
Latent Factor Models
SVD isnrsquot defined when entries are missing
Use specialized methods to find P Q min
119875119875119876119876sum 119909119909119909119909 isinR 119903119903119909119909119909119909 minus 119902119902119909119909 sdot 119901119901119909119909119879119879 2
Notebull We donrsquot require cols of P Q to be orthogonalunit lengthbull P Q map usersmovies to a latent spacebull The most popular model among Netflix contestants
1162019
45531
312445
53432142
24542
522434
42331
2-41
56-5
53-2
32111
-221-7
37-1
-924143-48-5-253-211
13-112-72914-131457-8
1-6784-3924176-421asymp
PTQ
users
item
s
119903119909119909119909119909 = 119902119902119909119909 sdot 119901119901119909119909119879119879
f factors
ffactorsitem
s
users
RECSYS 61
Dealing with Missing Entries Want to minimize SSE for unseen test data
Idea Minimize SSE on training data Want large f ( of factors) to capture all the signals But SSE on test data begins to rise for f gt 2
Regularization is needed to avoid Overfitting Allow rich model where there are sufficient data Shrink aggressively where data are scarce
1162019
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
1 3 43 5 5
4 5 533
2
2 1 3
1
λhellip regularization parameterldquoerrorrdquo ldquolengthrdquo
RECSYS 62
Geared towards females
Geared towards males
serious
escapist
The PrincessDiaries
The Lion King
Braveheart
Independence Day
AmadeusThe Color Purple
Oceanrsquos 11
Sense and Sensibility
Gus
Dave
Recommendations via Latent Factor Models (eg SVD++ by the [Bellkor Team])
1162019
Lethal Weapon
Dumb and Dumber
RECSYS 63
Dealing with Missing Entries Want to minimize SSE for unseen test data
Idea Minimize SSE on training data Want large f ( of factors) to capture all the signals But SSE on test data begins to rise for f gt 2
Regularization is needed Allow rich model where there are sufficient data Shrink aggressively where data are scarce
1162019
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
1 3 43 5 5
4 5 533
2
2 1 3
1
λhellip regularization parameterldquoerrorrdquo ldquolengthrdquo
RECSYS 64
Geared towards females
Geared towards males
serious
funny
The Effect of Regularization
1162019
The PrincessDiaries
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
minfactors ldquoerrorrdquo + λ ldquolengthrdquo
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
RECSYS 65
Geared towards females
Geared towards males
serious
funny
The Effect of Regularization
1162019
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
The PrincessDiaries
minfactors ldquoerrorrdquo + λ ldquolengthrdquo
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
RECSYS 66
Geared towards females
Geared towards males
serious
funny
The Effect of Regularization
1162019
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
minfactors ldquoerrorrdquo + λ ldquolengthrdquo
The PrincessDiaries
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
RECSYS 67
Geared towards females
Geared towards males
serious
funny
The Effect of Regularization
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
The PrincessDiaries
minfactors ldquoerrorrdquo + λ ldquolengthrdquo
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
RECSYS 68
Use Gradient Descent to search forthe optimal settings
Want to find matrices P and Q
Gradient descent Initialize P and Q (using SVD pretend missing ratings are
0) Do gradient descent
bull P larr P - η middotnablaPbull Q larr Q - η middotnablaQbull Where nablaQ is gradientderivative of matrix Q120571120571119876119876 = [120571120571119902119902119909119909119894119894] and 120571120571119902119902119909119909119894119894 = sum119909119909119909119909 minus2 119903119903119909119909119909119909 minus 119902119902119909119909119901119901119909119909119879119879 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894
bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q Observation Computing gradients is slow
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
How to compute gradient of a matrixCompute gradient of every element independently
RECSYS 69
Degression to the lecture notes ofRegression and Gradient Descent
by Andrew Ngrsquos Machine Learning course from Coursera
RECSYS 70
(Batch) Gradient Descent
Want to find matrices P and Q
Gradient descent Initialize P and Q (using SVD pretend missing ratings are
0) Do gradient descent
bull P larr P - η middotnablaPbull Q larr Q - η middotnablaQbull Where nablaQ is gradientderivative of matrix Q120571120571119876119876 = [120571120571119902119902119909119909119894119894] and 120571120571119902119902119909119909119894119894 = sum119909119909119909119909 minus2 119903119903119909119909119909119909 minus 119902119902119909119909119901119901119909119909119879119879 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894
bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q Observation Computing gradients is slow
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
How to compute gradient of a matrixCompute gradient of every element independently
RECSYS 72
Stochastic Gradient Descent
Gradient Descent (GD) vs Stochastic GD Observation 120571120571119876119876 = [120571120571119902119902119909119909119894119894] where
120571120571119902119902119909119909119894119894 = 119909119909119909119909
minus2 119903119903119909119909119909119909 minus 119902119902119909119909119894119894119901119901119909119909119894119894 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894 = 119961119961119946119946
nabla119928119928 119955119955119961119961119946119946
bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q 119928119928 = 119928119928 minus 119928119928 = 119928119928 minus η sum119961119961119946119946nabla119928119928 (119955119955119961119961119946119946) Idea Instead of evaluating gradient over all ratings evaluate it for
each individual rating and make a step
GD 119928119928larr119928119928 minus η sum119955119955119961119961119946119946 nabla119928119928(119955119955119961119961119946119946)
SGD 119928119928larr119928119928 minus 119928119928(119955119955119961119961119946119946) Faster convergence
bull Need more steps but each step is computed much faster
1162019
RECSYS 73
SGD vs GD Convergence of GD vs SGD
1162019
Iterationstep
Valu
e of
the
obje
ctiv
e fu
nctio
n
GD improves the value of the objective function at every step SGD improves the value but in a ldquonoisyrdquo wayGD takes fewer steps to converge but each steptakes much longer to compute In practice SGD is much faster
RECSYS 74
Stochastic Gradient Descent
Stochastic gradient decent Initialize P and Q (using SVD pretend missing ratings are
0) Then iterate over the ratings (multiple times if necessary) and update
factorsFor each rxi
bull 120576120576119909119909119909119909 = 119903119903119909119909119909119909 minus 119902119902119909119909 sdot 119901119901119909119909119879119879 (derivative of the ldquoerrorrdquo)bull 119902119902119909119909 larr 119902119902119909119909 + 120578120578 120576120576119909119909119909119909 119901119901119909119909 minus 120582120582 119902119902119909119909 (update equation)bull 119901119901119909119909 larr 119901119901119909119909 + 120578120578 120576120576119909119909119909119909 119902119902119909119909 minus 120582120582 119901119901119909119909 (update equation)
2 for loops For until convergence
bull For each rxibull Compute gradient do a ldquosteprdquo
1162019
η hellip learning rate
RECSYS 75Koren Bell Volinksy IEEE Computer 2009
1162019
RECSYS 76
Summary Recommendations via Optimization
Goal Make good recommendations Quantify goodness using SSE
So Lower SSE means better recommendations We want to make good recommendations on items
that some user has not yet seen Letrsquos set values for P and Q such that they work
well on known (user item) ratings And hope these values for P and Q will predict well
the unknown ratings
This is the a case where we apply Optimization methods
1162019
1 3 43 5 5
4 5 533
2
2 1 3
1
RECSYS 77
Backup Slides
RECSYS 78
The Netflix Challenge 2006-09
We Know What You OughtTo Be Watching This
Summer
Netflix Prizebull Training data
ndash 100 million ratingsndash 480000 usersndash 17770 moviesndash 6 years of data 2000-2005
bull Test datandash Last few ratings of each user (28 million)ndash Evaluation criterion root mean squared error (RMSE) ndash Netflix Cinematch RMSE 09514
bull Competitionndash 2700+ teamsndash $1 million grand prize for 10 improvement on Cinematch resultndash $50000 2007 progress prize for 843 improvement
scoremovieuser121152131434524123237682576344541568523425
22345
57664566
scoremovieuser6219617232473153414284935
745
696836
Training data Test data
Movie rating data
Overall rating distribution
bull Third of ratings are 4sbull Average rating is 368
From TimelyDevelopmentcom
ratings per movie
bull Avg ratingsmovie 5627
ratings per user
bull Avg ratingsuser 208
Average movie rating by movie count
bull More ratings to better movies
From TimelyDevelopmentcom
Most loved movies
CountAvg ratingTitle137812 4593 The Shawshank Redemption 133597 4545 Lord of the Rings The Return of the King1808834306 The Green Mile 150676 4460Lord of the Rings The Two Towers 139050 4415 Finding Nemo 117456 4504 Raiders of the Lost Ark 180736 4299 Forrest Gump 147932 4433 Lord of the Rings The Fellowship of the ring149199 4325 The Sixth Sense 144027 4333 Indiana Jones and the Last Crusade
Challenges
bull Size of datandash Scalabilityndash Keeping data in memory
bull Missing datandash 99 percent missingndash Very imbalanced
bull Avoiding overfittingbull Test and training data differ significantly movie 16322
bull Use an ensemble of complementing predictorsbull Two half tuned models worth more than a single fully
tuned model
The BellKor recommender system
RECSYS 101
Extending Latent Factor Model to Include Biases
1162019
RECSYS 102
Modeling Biases and Interactions
μ = overall mean rating bx = bias of user x bi = bias of movie i
user-movie interactionmovie biasuser bias
User-Movie interaction Characterizes the matching between
users and movies Attracts most research in the field Benefits from algorithmic and
mathematical innovations
Baseline predictor Separates users and movies Benefits from insights into userrsquos
behavior Among the main practical
contributions of the competition
RECSYS 103
Baseline Predictor We have expectations on the rating by
user x of movie i even without estimating xrsquos attitude towards movies like i
ndash Rating scale of user xndash Values of other ratings user
gave recently (day-specific mood anchoring multi-user accounts)
ndash (Recent) popularity of movie indash Selection bias related to
number of ratings user gave on the same day (ldquofrequencyrdquo)
1162019
RECSYS 104
Putting It All Together
Example Mean rating micro = 37 You are a critical reviewer your ratings are 1 star
lower than the mean bx = -1 Star Wars gets a mean rating of 05 higher than
average movie bi = + 05 Predicted rating for you on Star Wars
= 37 - 1 + 05 = 32
1162019
RECSYS 105
Fitting the New Model
Solve
Stochastic gradient decent to find parameters Note Both biases bu bi as well as interactions qi pu are treated
as parameters (we estimate them)
1162019
regularization
goodness of fit
λ is selected via grid-search on a validation set
( )
++++
+++minus
sumsumsumsum
sumisin
ii
xx
xx
ii
Rix
Txiixxi
PQ
bbpq
pqbbr
2222
2
)(
)(min
λ
micro
RECSYS 106
BellKor Recommender System The winner of the Netflix Challenge
Multi-scale modeling of the dataCombine top level ldquoregionalrdquomodeling of the data with a refined local view Global
bull Overall deviations of usersmovies Factorization
bull Addressing ldquoregionalrdquo effects Collaborative filtering
bull Extract local patterns
Global effects
Factorization
Collaborative filtering
RECSYS 107
Performance of Various Methods
0885
089
0895
09
0905
091
0915
092
1 10 100 1000
RMSE
Millions of parameters
CF (no time bias)
Basic Latent Factors
Latent Factors w Biases
1162019
RECSYS 108
Grand Prize 08563
Netflix 09514
Movie average 10533User average 10651
Global average 11296
Performance of Various Methods
Basic Collaborative filtering 094
Latent factors 090
Latent factors+Biases 089
Collaborative filtering++ 091
1162019
RECSYS 109
Temporal Biases Of Users
Sudden rise in the average movie rating(early 2004) Improvements in Netflix GUI improvements Meaning of rating changed
Movie age Users prefer new movies
without any reasons Older movies are just
inherently better than newer ones
1162019
Y Koren Collaborative filtering with temporal dynamics KDD rsquo09
RECSYS 110
Temporal Biases amp Factors
Original modelrxi = micro +bx + bi + qi middotpx
T
Add time dependence to biasesrxi = micro +bx(t)+ bi(t) +qi middot px
T
Make parameters bu and bi to depend on time (1) Parameterize time-dependence by linear trends
(2) Each bin corresponds to 10 consecutive weeks
Add temporal dependence to factors px(t)hellip user preference vector on day t
Y Koren Collaborative filtering with temporal dynamics KDD rsquo09
RECSYS 111
Adding Temporal Effects
1162019
0875
088
0885
089
0895
09
0905
091
0915
092
1 10 100 1000 10000
RMSE
Millions of parameters
CF (no time bias)Basic Latent FactorsCF (time bias)Latent Factors w Biases+ Linear time factors+ Per-day user biases+ CF
RECSYS 112
Grand Prize 08563
Netflix 09514
Movie average 10533User average 10651
Global average 11296
Performance of Various Methods
Basic Collaborative filtering 094
Latent factors 090
Latent factors+Biases 089
Collaborative filtering++ 091
1162019 112
Latent factors+Biases+Time 0876
Still no prize Getting desperateTry a ldquokitchen sinkrdquo approach
RECSYS 113
Effect of ensemble size
087
08725
0875
08775
088
08825
0885
08875
089
08925
1 8 15 22 29 36 43 50
Predictors
Erro
r - R
MSE
RECSYS 43
Evaluation
1 3 4
3 5 5
4 5 5
3
3
2
2 1
3
1
Test Data Set
users
movies
1162019
RECSYS 44
Evaluating Predictions Compare predictions with known ratings
Root-mean-square error (RMSE)bull where is predicted is the true rating of x on i
Precision at top 10 bull of those in top 10
Rank Correlation bull Spearmanrsquos correlation between systemrsquos and userrsquos complete
rankings Another approach 01 model
Coveragebull Number of itemsusers for which system can make predictions
Precision = TP (TP + FP) Accuracy = (TP+TN) (TP + FP + TN + FN) Receiver Operating characteristic (ROC) Curve
Y-axis True Positive Rates (TPR) X-axis False Positive Rates (FPR)bull TPR (aka Recall) = TP P = TP (TP+FN) bull FPR = FP N = FP (FP + TN)bull See httpsenwikipediaorgwikiPrecision and recall
RECSYS 45
Problems with Error Measures Narrow focus on accuracy sometimes
misses the point Prediction Diversity Prediction Context Order of predictions
In practice we care only to predict high ratings RMSE might penalize a method that does well
for high ratings and badly for others
1162019
RECSYS 46
Collaborative Filtering Complexity
Expensive step is finding k most similar customers O(|X|) Recall that X = set of customers in the system
Too expensive to do at runtime Could pre-compute using clustering as approx
Naiumlve pre-computation takes time O(N middot|C|) |C| = of clusters = k in the k-means N = of data points
We already know how to do this Near-neighbor search in high dimensions (LSH) Clustering Dimensionality reduction
1162019
RECSYS 47
Tip Add Data Leverage all the data
Donrsquot try to reduce data size in an effort to make fancy algorithms work
Simple methods on large data do best
Add more data eg add IMDB data on genres
More data beats better algorithmshttpanandtypepadcomdatawocky200803more-data-usualhtml
RECSYS 48
Recommender SystemsLatent Factor Models
CS246 Mining Massive DatasetsJure Leskovec Stanford University
httpcs246stanfordedu
RECSYS 49
Geared towards females
Geared towards males
Serious
Funny
Collaborative Filtering viaLatent Factor Models (eg SVD)
The PrincessDiaries
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
RECSYS 50Koren Bell Volinksy IEEE Computer 2009
RECSYS 51
The Netflix Utility Matrix R
1 3 4
3 5 5
4 5 5
3
3
2 2 2
5
2 1 1
3 3
1
480000 users
17700 movies
1162019
Matrix R
RECSYS 52
Utility Matrix R Evaluation
1 3 4
3 5 5
4 5 5
3
3
2
2 1
3
1
Test Data Set
SSE = sum(119909119909119909119909)isin119877119877 119903119909119909119909119909 minus 119903119903119909119909119909119909 2
1162019
480000 users
17700 movies
Predicted rating
True rating of user x on item i
119955119955120785120785120788120788
Matrix R
Training Data Set
RECSYS 53
Latent Factor Models
ldquoSVDrdquo on Netflix data R asymp Q middot PT
For now letrsquos assume we can approximate the rating matrix R as a product of ldquothinrdquo Q middot PT
R has missing entries but letrsquos ignore that for nowbull Basically we will want the reconstruction error to be small on
known ratings and we donrsquot care about the values on the missing ones
45531
312445
53432142
24542
522434
42331
2-41
56-5
53-2
32111
-221-7
37-1
-924143-48-5-253-211
13-112-72914-131457-8
1-6784-3924176-421asymp
users
item
s
PT
Q
item
s
users
R
SVD A = U Σ VT
f factors
ffactors
RECSYS 54
Ratings as Products of Factors How to estimate the missing rating of
user x for item i
1162019
45531
312445
53432142
24542
522434
42331
item
s
2-41
56-5
53-2
32111
-221-7
37-1
-924143-48-5-253-211
13-112-72914-131457-8
1-6784-3924176-421
asymp
item
s
users
users
PT
119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931
= 119943119943
119954119954119946119946119943119943 sdot 119953119953119961119961119943119943
qi = row i of Qpx = column x of PT
ffac
tors
Qf factors
RECSYS 55
Ratings as Products of Factors How to estimate the missing rating of
user x for item i
1162019
45531
312445
53432142
24542
522434
42331
item
s
2-41
56-5
53-2
32111
-221-7
37-1
-924143-48-5-253-211
13-112-72914-131457-8
1-6784-3924176-421
asymp
item
s
users
users
PT
ffac
tors
Qf factors
119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931
= 119943119943
119954119954119946119946119943119943 sdot 119953119953119961119961119943119943
qi = row i of Qpx = column x of PT
RECSYS 56
Ratings as Products of Factors How to estimate the missing rating of
user x for item i
1162019
45531
312445
53432142
24542
522434
42331
item
s
2-41
56-5
53-2
32111
-221-7
37-1
-924143-48-5-253-211
13-112-72914-131457-8
1-6784-3924176-421
asymp
item
s
users
users
QPT
24
ffac
tors
f factors
119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931
= 119943119943
119954119954119946119946119943119943 sdot 119953119953119961119961119943119943
qi = row i of Qpx = column x of PT
RECSYS 57
Geared towards females
Geared towards males
Serious
Funny
Latent Factor Models
1162019
The PrincessDiaries
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
RECSYS 58
Geared towards females
Geared towards males
Serious
Funny
Latent Factor Models
1162019
The PrincessDiaries
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
RECSYS 59
Recap SVD
Remember SVD A Input data matrix U Left singular vecs V Right singular vecs Σ Singular values
SVD gives minimum reconstruction error (SSE) min119880119880VΣ
sum119909119909119894119894 119860119860119909119909119894119894 minus 119880119880Σ119881119881T 1199091199091198941198942
So in our case ldquoSVDrdquo on Netflix data R asymp Q middot PT
A = R Q = U PT = Σ VT
But we are not done yet R has missing entries1162019
U
The sum goes over all entriesBut our R has missing entries
119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931
Am
n
Σm
n
VT
asymp
U
RECSYS 60
Latent Factor Models
SVD isnrsquot defined when entries are missing
Use specialized methods to find P Q min
119875119875119876119876sum 119909119909119909119909 isinR 119903119903119909119909119909119909 minus 119902119902119909119909 sdot 119901119901119909119909119879119879 2
Notebull We donrsquot require cols of P Q to be orthogonalunit lengthbull P Q map usersmovies to a latent spacebull The most popular model among Netflix contestants
1162019
45531
312445
53432142
24542
522434
42331
2-41
56-5
53-2
32111
-221-7
37-1
-924143-48-5-253-211
13-112-72914-131457-8
1-6784-3924176-421asymp
PTQ
users
item
s
119903119909119909119909119909 = 119902119902119909119909 sdot 119901119901119909119909119879119879
f factors
ffactorsitem
s
users
RECSYS 61
Dealing with Missing Entries Want to minimize SSE for unseen test data
Idea Minimize SSE on training data Want large f ( of factors) to capture all the signals But SSE on test data begins to rise for f gt 2
Regularization is needed to avoid Overfitting Allow rich model where there are sufficient data Shrink aggressively where data are scarce
1162019
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
1 3 43 5 5
4 5 533
2
2 1 3
1
λhellip regularization parameterldquoerrorrdquo ldquolengthrdquo
RECSYS 62
Geared towards females
Geared towards males
serious
escapist
The PrincessDiaries
The Lion King
Braveheart
Independence Day
AmadeusThe Color Purple
Oceanrsquos 11
Sense and Sensibility
Gus
Dave
Recommendations via Latent Factor Models (eg SVD++ by the [Bellkor Team])
1162019
Lethal Weapon
Dumb and Dumber
RECSYS 63
Dealing with Missing Entries Want to minimize SSE for unseen test data
Idea Minimize SSE on training data Want large f ( of factors) to capture all the signals But SSE on test data begins to rise for f gt 2
Regularization is needed Allow rich model where there are sufficient data Shrink aggressively where data are scarce
1162019
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
1 3 43 5 5
4 5 533
2
2 1 3
1
λhellip regularization parameterldquoerrorrdquo ldquolengthrdquo
RECSYS 64
Geared towards females
Geared towards males
serious
funny
The Effect of Regularization
1162019
The PrincessDiaries
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
minfactors ldquoerrorrdquo + λ ldquolengthrdquo
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
RECSYS 65
Geared towards females
Geared towards males
serious
funny
The Effect of Regularization
1162019
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
The PrincessDiaries
minfactors ldquoerrorrdquo + λ ldquolengthrdquo
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
RECSYS 66
Geared towards females
Geared towards males
serious
funny
The Effect of Regularization
1162019
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
minfactors ldquoerrorrdquo + λ ldquolengthrdquo
The PrincessDiaries
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
RECSYS 67
Geared towards females
Geared towards males
serious
funny
The Effect of Regularization
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
The PrincessDiaries
minfactors ldquoerrorrdquo + λ ldquolengthrdquo
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
RECSYS 68
Use Gradient Descent to search forthe optimal settings
Want to find matrices P and Q
Gradient descent Initialize P and Q (using SVD pretend missing ratings are
0) Do gradient descent
bull P larr P - η middotnablaPbull Q larr Q - η middotnablaQbull Where nablaQ is gradientderivative of matrix Q120571120571119876119876 = [120571120571119902119902119909119909119894119894] and 120571120571119902119902119909119909119894119894 = sum119909119909119909119909 minus2 119903119903119909119909119909119909 minus 119902119902119909119909119901119901119909119909119879119879 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894
bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q Observation Computing gradients is slow
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
How to compute gradient of a matrixCompute gradient of every element independently
RECSYS 69
Degression to the lecture notes ofRegression and Gradient Descent
by Andrew Ngrsquos Machine Learning course from Coursera
RECSYS 70
(Batch) Gradient Descent
Want to find matrices P and Q
Gradient descent Initialize P and Q (using SVD pretend missing ratings are
0) Do gradient descent
bull P larr P - η middotnablaPbull Q larr Q - η middotnablaQbull Where nablaQ is gradientderivative of matrix Q120571120571119876119876 = [120571120571119902119902119909119909119894119894] and 120571120571119902119902119909119909119894119894 = sum119909119909119909119909 minus2 119903119903119909119909119909119909 minus 119902119902119909119909119901119901119909119909119879119879 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894
bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q Observation Computing gradients is slow
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
How to compute gradient of a matrixCompute gradient of every element independently
RECSYS 72
Stochastic Gradient Descent
Gradient Descent (GD) vs Stochastic GD Observation 120571120571119876119876 = [120571120571119902119902119909119909119894119894] where
120571120571119902119902119909119909119894119894 = 119909119909119909119909
minus2 119903119903119909119909119909119909 minus 119902119902119909119909119894119894119901119901119909119909119894119894 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894 = 119961119961119946119946
nabla119928119928 119955119955119961119961119946119946
bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q 119928119928 = 119928119928 minus 119928119928 = 119928119928 minus η sum119961119961119946119946nabla119928119928 (119955119955119961119961119946119946) Idea Instead of evaluating gradient over all ratings evaluate it for
each individual rating and make a step
GD 119928119928larr119928119928 minus η sum119955119955119961119961119946119946 nabla119928119928(119955119955119961119961119946119946)
SGD 119928119928larr119928119928 minus 119928119928(119955119955119961119961119946119946) Faster convergence
bull Need more steps but each step is computed much faster
1162019
RECSYS 73
SGD vs GD Convergence of GD vs SGD
1162019
Iterationstep
Valu
e of
the
obje
ctiv
e fu
nctio
n
GD improves the value of the objective function at every step SGD improves the value but in a ldquonoisyrdquo wayGD takes fewer steps to converge but each steptakes much longer to compute In practice SGD is much faster
RECSYS 74
Stochastic Gradient Descent
Stochastic gradient decent Initialize P and Q (using SVD pretend missing ratings are
0) Then iterate over the ratings (multiple times if necessary) and update
factorsFor each rxi
bull 120576120576119909119909119909119909 = 119903119903119909119909119909119909 minus 119902119902119909119909 sdot 119901119901119909119909119879119879 (derivative of the ldquoerrorrdquo)bull 119902119902119909119909 larr 119902119902119909119909 + 120578120578 120576120576119909119909119909119909 119901119901119909119909 minus 120582120582 119902119902119909119909 (update equation)bull 119901119901119909119909 larr 119901119901119909119909 + 120578120578 120576120576119909119909119909119909 119902119902119909119909 minus 120582120582 119901119901119909119909 (update equation)
2 for loops For until convergence
bull For each rxibull Compute gradient do a ldquosteprdquo
1162019
η hellip learning rate
RECSYS 75Koren Bell Volinksy IEEE Computer 2009
1162019
RECSYS 76
Summary Recommendations via Optimization
Goal Make good recommendations Quantify goodness using SSE
So Lower SSE means better recommendations We want to make good recommendations on items
that some user has not yet seen Letrsquos set values for P and Q such that they work
well on known (user item) ratings And hope these values for P and Q will predict well
the unknown ratings
This is the a case where we apply Optimization methods
1162019
1 3 43 5 5
4 5 533
2
2 1 3
1
RECSYS 77
Backup Slides
RECSYS 78
The Netflix Challenge 2006-09
We Know What You OughtTo Be Watching This
Summer
Netflix Prizebull Training data
ndash 100 million ratingsndash 480000 usersndash 17770 moviesndash 6 years of data 2000-2005
bull Test datandash Last few ratings of each user (28 million)ndash Evaluation criterion root mean squared error (RMSE) ndash Netflix Cinematch RMSE 09514
bull Competitionndash 2700+ teamsndash $1 million grand prize for 10 improvement on Cinematch resultndash $50000 2007 progress prize for 843 improvement
scoremovieuser121152131434524123237682576344541568523425
22345
57664566
scoremovieuser6219617232473153414284935
745
696836
Training data Test data
Movie rating data
Overall rating distribution
bull Third of ratings are 4sbull Average rating is 368
From TimelyDevelopmentcom
ratings per movie
bull Avg ratingsmovie 5627
ratings per user
bull Avg ratingsuser 208
Average movie rating by movie count
bull More ratings to better movies
From TimelyDevelopmentcom
Most loved movies
CountAvg ratingTitle137812 4593 The Shawshank Redemption 133597 4545 Lord of the Rings The Return of the King1808834306 The Green Mile 150676 4460Lord of the Rings The Two Towers 139050 4415 Finding Nemo 117456 4504 Raiders of the Lost Ark 180736 4299 Forrest Gump 147932 4433 Lord of the Rings The Fellowship of the ring149199 4325 The Sixth Sense 144027 4333 Indiana Jones and the Last Crusade
Challenges
bull Size of datandash Scalabilityndash Keeping data in memory
bull Missing datandash 99 percent missingndash Very imbalanced
bull Avoiding overfittingbull Test and training data differ significantly movie 16322
bull Use an ensemble of complementing predictorsbull Two half tuned models worth more than a single fully
tuned model
The BellKor recommender system
RECSYS 101
Extending Latent Factor Model to Include Biases
1162019
RECSYS 102
Modeling Biases and Interactions
μ = overall mean rating bx = bias of user x bi = bias of movie i
user-movie interactionmovie biasuser bias
User-Movie interaction Characterizes the matching between
users and movies Attracts most research in the field Benefits from algorithmic and
mathematical innovations
Baseline predictor Separates users and movies Benefits from insights into userrsquos
behavior Among the main practical
contributions of the competition
RECSYS 103
Baseline Predictor We have expectations on the rating by
user x of movie i even without estimating xrsquos attitude towards movies like i
ndash Rating scale of user xndash Values of other ratings user
gave recently (day-specific mood anchoring multi-user accounts)
ndash (Recent) popularity of movie indash Selection bias related to
number of ratings user gave on the same day (ldquofrequencyrdquo)
1162019
RECSYS 104
Putting It All Together
Example Mean rating micro = 37 You are a critical reviewer your ratings are 1 star
lower than the mean bx = -1 Star Wars gets a mean rating of 05 higher than
average movie bi = + 05 Predicted rating for you on Star Wars
= 37 - 1 + 05 = 32
1162019
RECSYS 105
Fitting the New Model
Solve
Stochastic gradient decent to find parameters Note Both biases bu bi as well as interactions qi pu are treated
as parameters (we estimate them)
1162019
regularization
goodness of fit
λ is selected via grid-search on a validation set
( )
++++
+++minus
sumsumsumsum
sumisin
ii
xx
xx
ii
Rix
Txiixxi
PQ
bbpq
pqbbr
2222
2
)(
)(min
λ
micro
RECSYS 106
BellKor Recommender System The winner of the Netflix Challenge
Multi-scale modeling of the dataCombine top level ldquoregionalrdquomodeling of the data with a refined local view Global
bull Overall deviations of usersmovies Factorization
bull Addressing ldquoregionalrdquo effects Collaborative filtering
bull Extract local patterns
Global effects
Factorization
Collaborative filtering
RECSYS 107
Performance of Various Methods
0885
089
0895
09
0905
091
0915
092
1 10 100 1000
RMSE
Millions of parameters
CF (no time bias)
Basic Latent Factors
Latent Factors w Biases
1162019
RECSYS 108
Grand Prize 08563
Netflix 09514
Movie average 10533User average 10651
Global average 11296
Performance of Various Methods
Basic Collaborative filtering 094
Latent factors 090
Latent factors+Biases 089
Collaborative filtering++ 091
1162019
RECSYS 109
Temporal Biases Of Users
Sudden rise in the average movie rating(early 2004) Improvements in Netflix GUI improvements Meaning of rating changed
Movie age Users prefer new movies
without any reasons Older movies are just
inherently better than newer ones
1162019
Y Koren Collaborative filtering with temporal dynamics KDD rsquo09
RECSYS 110
Temporal Biases amp Factors
Original modelrxi = micro +bx + bi + qi middotpx
T
Add time dependence to biasesrxi = micro +bx(t)+ bi(t) +qi middot px
T
Make parameters bu and bi to depend on time (1) Parameterize time-dependence by linear trends
(2) Each bin corresponds to 10 consecutive weeks
Add temporal dependence to factors px(t)hellip user preference vector on day t
Y Koren Collaborative filtering with temporal dynamics KDD rsquo09
RECSYS 111
Adding Temporal Effects
1162019
0875
088
0885
089
0895
09
0905
091
0915
092
1 10 100 1000 10000
RMSE
Millions of parameters
CF (no time bias)Basic Latent FactorsCF (time bias)Latent Factors w Biases+ Linear time factors+ Per-day user biases+ CF
RECSYS 112
Grand Prize 08563
Netflix 09514
Movie average 10533User average 10651
Global average 11296
Performance of Various Methods
Basic Collaborative filtering 094
Latent factors 090
Latent factors+Biases 089
Collaborative filtering++ 091
1162019 112
Latent factors+Biases+Time 0876
Still no prize Getting desperateTry a ldquokitchen sinkrdquo approach
RECSYS 113
Effect of ensemble size
087
08725
0875
08775
088
08825
0885
08875
089
08925
1 8 15 22 29 36 43 50
Predictors
Erro
r - R
MSE
RECSYS 44
Evaluating Predictions Compare predictions with known ratings
Root-mean-square error (RMSE)bull where is predicted is the true rating of x on i
Precision at top 10 bull of those in top 10
Rank Correlation bull Spearmanrsquos correlation between systemrsquos and userrsquos complete
rankings Another approach 01 model
Coveragebull Number of itemsusers for which system can make predictions
Precision = TP (TP + FP) Accuracy = (TP+TN) (TP + FP + TN + FN) Receiver Operating characteristic (ROC) Curve
Y-axis True Positive Rates (TPR) X-axis False Positive Rates (FPR)bull TPR (aka Recall) = TP P = TP (TP+FN) bull FPR = FP N = FP (FP + TN)bull See httpsenwikipediaorgwikiPrecision and recall
RECSYS 45
Problems with Error Measures Narrow focus on accuracy sometimes
misses the point Prediction Diversity Prediction Context Order of predictions
In practice we care only to predict high ratings RMSE might penalize a method that does well
for high ratings and badly for others
1162019
RECSYS 46
Collaborative Filtering Complexity
Expensive step is finding k most similar customers O(|X|) Recall that X = set of customers in the system
Too expensive to do at runtime Could pre-compute using clustering as approx
Naiumlve pre-computation takes time O(N middot|C|) |C| = of clusters = k in the k-means N = of data points
We already know how to do this Near-neighbor search in high dimensions (LSH) Clustering Dimensionality reduction
1162019
RECSYS 47
Tip Add Data Leverage all the data
Donrsquot try to reduce data size in an effort to make fancy algorithms work
Simple methods on large data do best
Add more data eg add IMDB data on genres
More data beats better algorithmshttpanandtypepadcomdatawocky200803more-data-usualhtml
RECSYS 48
Recommender SystemsLatent Factor Models
CS246 Mining Massive DatasetsJure Leskovec Stanford University
httpcs246stanfordedu
RECSYS 49
Geared towards females
Geared towards males
Serious
Funny
Collaborative Filtering viaLatent Factor Models (eg SVD)
The PrincessDiaries
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
RECSYS 50Koren Bell Volinksy IEEE Computer 2009
RECSYS 51
The Netflix Utility Matrix R
1 3 4
3 5 5
4 5 5
3
3
2 2 2
5
2 1 1
3 3
1
480000 users
17700 movies
1162019
Matrix R
RECSYS 52
Utility Matrix R Evaluation
1 3 4
3 5 5
4 5 5
3
3
2
2 1
3
1
Test Data Set
SSE = sum(119909119909119909119909)isin119877119877 119903119909119909119909119909 minus 119903119903119909119909119909119909 2
1162019
480000 users
17700 movies
Predicted rating
True rating of user x on item i
119955119955120785120785120788120788
Matrix R
Training Data Set
RECSYS 53
Latent Factor Models
ldquoSVDrdquo on Netflix data R asymp Q middot PT
For now letrsquos assume we can approximate the rating matrix R as a product of ldquothinrdquo Q middot PT
R has missing entries but letrsquos ignore that for nowbull Basically we will want the reconstruction error to be small on
known ratings and we donrsquot care about the values on the missing ones
45531
312445
53432142
24542
522434
42331
2-41
56-5
53-2
32111
-221-7
37-1
-924143-48-5-253-211
13-112-72914-131457-8
1-6784-3924176-421asymp
users
item
s
PT
Q
item
s
users
R
SVD A = U Σ VT
f factors
ffactors
RECSYS 54
Ratings as Products of Factors How to estimate the missing rating of
user x for item i
1162019
45531
312445
53432142
24542
522434
42331
item
s
2-41
56-5
53-2
32111
-221-7
37-1
-924143-48-5-253-211
13-112-72914-131457-8
1-6784-3924176-421
asymp
item
s
users
users
PT
119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931
= 119943119943
119954119954119946119946119943119943 sdot 119953119953119961119961119943119943
qi = row i of Qpx = column x of PT
ffac
tors
Qf factors
RECSYS 55
Ratings as Products of Factors How to estimate the missing rating of
user x for item i
1162019
45531
312445
53432142
24542
522434
42331
item
s
2-41
56-5
53-2
32111
-221-7
37-1
-924143-48-5-253-211
13-112-72914-131457-8
1-6784-3924176-421
asymp
item
s
users
users
PT
ffac
tors
Qf factors
119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931
= 119943119943
119954119954119946119946119943119943 sdot 119953119953119961119961119943119943
qi = row i of Qpx = column x of PT
RECSYS 56
Ratings as Products of Factors How to estimate the missing rating of
user x for item i
1162019
45531
312445
53432142
24542
522434
42331
item
s
2-41
56-5
53-2
32111
-221-7
37-1
-924143-48-5-253-211
13-112-72914-131457-8
1-6784-3924176-421
asymp
item
s
users
users
QPT
24
ffac
tors
f factors
119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931
= 119943119943
119954119954119946119946119943119943 sdot 119953119953119961119961119943119943
qi = row i of Qpx = column x of PT
RECSYS 57
Geared towards females
Geared towards males
Serious
Funny
Latent Factor Models
1162019
The PrincessDiaries
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
RECSYS 58
Geared towards females
Geared towards males
Serious
Funny
Latent Factor Models
1162019
The PrincessDiaries
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
RECSYS 59
Recap SVD
Remember SVD A Input data matrix U Left singular vecs V Right singular vecs Σ Singular values
SVD gives minimum reconstruction error (SSE) min119880119880VΣ
sum119909119909119894119894 119860119860119909119909119894119894 minus 119880119880Σ119881119881T 1199091199091198941198942
So in our case ldquoSVDrdquo on Netflix data R asymp Q middot PT
A = R Q = U PT = Σ VT
But we are not done yet R has missing entries1162019
U
The sum goes over all entriesBut our R has missing entries
119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931
Am
n
Σm
n
VT
asymp
U
RECSYS 60
Latent Factor Models
SVD isnrsquot defined when entries are missing
Use specialized methods to find P Q min
119875119875119876119876sum 119909119909119909119909 isinR 119903119903119909119909119909119909 minus 119902119902119909119909 sdot 119901119901119909119909119879119879 2
Notebull We donrsquot require cols of P Q to be orthogonalunit lengthbull P Q map usersmovies to a latent spacebull The most popular model among Netflix contestants
1162019
45531
312445
53432142
24542
522434
42331
2-41
56-5
53-2
32111
-221-7
37-1
-924143-48-5-253-211
13-112-72914-131457-8
1-6784-3924176-421asymp
PTQ
users
item
s
119903119909119909119909119909 = 119902119902119909119909 sdot 119901119901119909119909119879119879
f factors
ffactorsitem
s
users
RECSYS 61
Dealing with Missing Entries Want to minimize SSE for unseen test data
Idea Minimize SSE on training data Want large f ( of factors) to capture all the signals But SSE on test data begins to rise for f gt 2
Regularization is needed to avoid Overfitting Allow rich model where there are sufficient data Shrink aggressively where data are scarce
1162019
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
1 3 43 5 5
4 5 533
2
2 1 3
1
λhellip regularization parameterldquoerrorrdquo ldquolengthrdquo
RECSYS 62
Geared towards females
Geared towards males
serious
escapist
The PrincessDiaries
The Lion King
Braveheart
Independence Day
AmadeusThe Color Purple
Oceanrsquos 11
Sense and Sensibility
Gus
Dave
Recommendations via Latent Factor Models (eg SVD++ by the [Bellkor Team])
1162019
Lethal Weapon
Dumb and Dumber
RECSYS 63
Dealing with Missing Entries Want to minimize SSE for unseen test data
Idea Minimize SSE on training data Want large f ( of factors) to capture all the signals But SSE on test data begins to rise for f gt 2
Regularization is needed Allow rich model where there are sufficient data Shrink aggressively where data are scarce
1162019
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
1 3 43 5 5
4 5 533
2
2 1 3
1
λhellip regularization parameterldquoerrorrdquo ldquolengthrdquo
RECSYS 64
Geared towards females
Geared towards males
serious
funny
The Effect of Regularization
1162019
The PrincessDiaries
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
minfactors ldquoerrorrdquo + λ ldquolengthrdquo
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
RECSYS 65
Geared towards females
Geared towards males
serious
funny
The Effect of Regularization
1162019
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
The PrincessDiaries
minfactors ldquoerrorrdquo + λ ldquolengthrdquo
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
RECSYS 66
Geared towards females
Geared towards males
serious
funny
The Effect of Regularization
1162019
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
minfactors ldquoerrorrdquo + λ ldquolengthrdquo
The PrincessDiaries
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
RECSYS 67
Geared towards females
Geared towards males
serious
funny
The Effect of Regularization
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
The PrincessDiaries
minfactors ldquoerrorrdquo + λ ldquolengthrdquo
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
RECSYS 68
Use Gradient Descent to search forthe optimal settings
Want to find matrices P and Q
Gradient descent Initialize P and Q (using SVD pretend missing ratings are
0) Do gradient descent
bull P larr P - η middotnablaPbull Q larr Q - η middotnablaQbull Where nablaQ is gradientderivative of matrix Q120571120571119876119876 = [120571120571119902119902119909119909119894119894] and 120571120571119902119902119909119909119894119894 = sum119909119909119909119909 minus2 119903119903119909119909119909119909 minus 119902119902119909119909119901119901119909119909119879119879 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894
bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q Observation Computing gradients is slow
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
How to compute gradient of a matrixCompute gradient of every element independently
RECSYS 69
Degression to the lecture notes ofRegression and Gradient Descent
by Andrew Ngrsquos Machine Learning course from Coursera
RECSYS 70
(Batch) Gradient Descent
Want to find matrices P and Q
Gradient descent Initialize P and Q (using SVD pretend missing ratings are
0) Do gradient descent
bull P larr P - η middotnablaPbull Q larr Q - η middotnablaQbull Where nablaQ is gradientderivative of matrix Q120571120571119876119876 = [120571120571119902119902119909119909119894119894] and 120571120571119902119902119909119909119894119894 = sum119909119909119909119909 minus2 119903119903119909119909119909119909 minus 119902119902119909119909119901119901119909119909119879119879 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894
bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q Observation Computing gradients is slow
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
How to compute gradient of a matrixCompute gradient of every element independently
RECSYS 72
Stochastic Gradient Descent
Gradient Descent (GD) vs Stochastic GD Observation 120571120571119876119876 = [120571120571119902119902119909119909119894119894] where
120571120571119902119902119909119909119894119894 = 119909119909119909119909
minus2 119903119903119909119909119909119909 minus 119902119902119909119909119894119894119901119901119909119909119894119894 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894 = 119961119961119946119946
nabla119928119928 119955119955119961119961119946119946
bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q 119928119928 = 119928119928 minus 119928119928 = 119928119928 minus η sum119961119961119946119946nabla119928119928 (119955119955119961119961119946119946) Idea Instead of evaluating gradient over all ratings evaluate it for
each individual rating and make a step
GD 119928119928larr119928119928 minus η sum119955119955119961119961119946119946 nabla119928119928(119955119955119961119961119946119946)
SGD 119928119928larr119928119928 minus 119928119928(119955119955119961119961119946119946) Faster convergence
bull Need more steps but each step is computed much faster
1162019
RECSYS 73
SGD vs GD Convergence of GD vs SGD
1162019
Iterationstep
Valu
e of
the
obje
ctiv
e fu
nctio
n
GD improves the value of the objective function at every step SGD improves the value but in a ldquonoisyrdquo wayGD takes fewer steps to converge but each steptakes much longer to compute In practice SGD is much faster
RECSYS 74
Stochastic Gradient Descent
Stochastic gradient decent Initialize P and Q (using SVD pretend missing ratings are
0) Then iterate over the ratings (multiple times if necessary) and update
factorsFor each rxi
bull 120576120576119909119909119909119909 = 119903119903119909119909119909119909 minus 119902119902119909119909 sdot 119901119901119909119909119879119879 (derivative of the ldquoerrorrdquo)bull 119902119902119909119909 larr 119902119902119909119909 + 120578120578 120576120576119909119909119909119909 119901119901119909119909 minus 120582120582 119902119902119909119909 (update equation)bull 119901119901119909119909 larr 119901119901119909119909 + 120578120578 120576120576119909119909119909119909 119902119902119909119909 minus 120582120582 119901119901119909119909 (update equation)
2 for loops For until convergence
bull For each rxibull Compute gradient do a ldquosteprdquo
1162019
η hellip learning rate
RECSYS 75Koren Bell Volinksy IEEE Computer 2009
1162019
RECSYS 76
Summary Recommendations via Optimization
Goal Make good recommendations Quantify goodness using SSE
So Lower SSE means better recommendations We want to make good recommendations on items
that some user has not yet seen Letrsquos set values for P and Q such that they work
well on known (user item) ratings And hope these values for P and Q will predict well
the unknown ratings
This is the a case where we apply Optimization methods
1162019
1 3 43 5 5
4 5 533
2
2 1 3
1
RECSYS 77
Backup Slides
RECSYS 78
The Netflix Challenge 2006-09
We Know What You OughtTo Be Watching This
Summer
Netflix Prizebull Training data
ndash 100 million ratingsndash 480000 usersndash 17770 moviesndash 6 years of data 2000-2005
bull Test datandash Last few ratings of each user (28 million)ndash Evaluation criterion root mean squared error (RMSE) ndash Netflix Cinematch RMSE 09514
bull Competitionndash 2700+ teamsndash $1 million grand prize for 10 improvement on Cinematch resultndash $50000 2007 progress prize for 843 improvement
scoremovieuser121152131434524123237682576344541568523425
22345
57664566
scoremovieuser6219617232473153414284935
745
696836
Training data Test data
Movie rating data
Overall rating distribution
bull Third of ratings are 4sbull Average rating is 368
From TimelyDevelopmentcom
ratings per movie
bull Avg ratingsmovie 5627
ratings per user
bull Avg ratingsuser 208
Average movie rating by movie count
bull More ratings to better movies
From TimelyDevelopmentcom
Most loved movies
CountAvg ratingTitle137812 4593 The Shawshank Redemption 133597 4545 Lord of the Rings The Return of the King1808834306 The Green Mile 150676 4460Lord of the Rings The Two Towers 139050 4415 Finding Nemo 117456 4504 Raiders of the Lost Ark 180736 4299 Forrest Gump 147932 4433 Lord of the Rings The Fellowship of the ring149199 4325 The Sixth Sense 144027 4333 Indiana Jones and the Last Crusade
Challenges
bull Size of datandash Scalabilityndash Keeping data in memory
bull Missing datandash 99 percent missingndash Very imbalanced
bull Avoiding overfittingbull Test and training data differ significantly movie 16322
bull Use an ensemble of complementing predictorsbull Two half tuned models worth more than a single fully
tuned model
The BellKor recommender system
RECSYS 101
Extending Latent Factor Model to Include Biases
1162019
RECSYS 102
Modeling Biases and Interactions
μ = overall mean rating bx = bias of user x bi = bias of movie i
user-movie interactionmovie biasuser bias
User-Movie interaction Characterizes the matching between
users and movies Attracts most research in the field Benefits from algorithmic and
mathematical innovations
Baseline predictor Separates users and movies Benefits from insights into userrsquos
behavior Among the main practical
contributions of the competition
RECSYS 103
Baseline Predictor We have expectations on the rating by
user x of movie i even without estimating xrsquos attitude towards movies like i
ndash Rating scale of user xndash Values of other ratings user
gave recently (day-specific mood anchoring multi-user accounts)
ndash (Recent) popularity of movie indash Selection bias related to
number of ratings user gave on the same day (ldquofrequencyrdquo)
1162019
RECSYS 104
Putting It All Together
Example Mean rating micro = 37 You are a critical reviewer your ratings are 1 star
lower than the mean bx = -1 Star Wars gets a mean rating of 05 higher than
average movie bi = + 05 Predicted rating for you on Star Wars
= 37 - 1 + 05 = 32
1162019
RECSYS 105
Fitting the New Model
Solve
Stochastic gradient decent to find parameters Note Both biases bu bi as well as interactions qi pu are treated
as parameters (we estimate them)
1162019
regularization
goodness of fit
λ is selected via grid-search on a validation set
( )
++++
+++minus
sumsumsumsum
sumisin
ii
xx
xx
ii
Rix
Txiixxi
PQ
bbpq
pqbbr
2222
2
)(
)(min
λ
micro
RECSYS 106
BellKor Recommender System The winner of the Netflix Challenge
Multi-scale modeling of the dataCombine top level ldquoregionalrdquomodeling of the data with a refined local view Global
bull Overall deviations of usersmovies Factorization
bull Addressing ldquoregionalrdquo effects Collaborative filtering
bull Extract local patterns
Global effects
Factorization
Collaborative filtering
RECSYS 107
Performance of Various Methods
0885
089
0895
09
0905
091
0915
092
1 10 100 1000
RMSE
Millions of parameters
CF (no time bias)
Basic Latent Factors
Latent Factors w Biases
1162019
RECSYS 108
Grand Prize 08563
Netflix 09514
Movie average 10533User average 10651
Global average 11296
Performance of Various Methods
Basic Collaborative filtering 094
Latent factors 090
Latent factors+Biases 089
Collaborative filtering++ 091
1162019
RECSYS 109
Temporal Biases Of Users
Sudden rise in the average movie rating(early 2004) Improvements in Netflix GUI improvements Meaning of rating changed
Movie age Users prefer new movies
without any reasons Older movies are just
inherently better than newer ones
1162019
Y Koren Collaborative filtering with temporal dynamics KDD rsquo09
RECSYS 110
Temporal Biases amp Factors
Original modelrxi = micro +bx + bi + qi middotpx
T
Add time dependence to biasesrxi = micro +bx(t)+ bi(t) +qi middot px
T
Make parameters bu and bi to depend on time (1) Parameterize time-dependence by linear trends
(2) Each bin corresponds to 10 consecutive weeks
Add temporal dependence to factors px(t)hellip user preference vector on day t
Y Koren Collaborative filtering with temporal dynamics KDD rsquo09
RECSYS 111
Adding Temporal Effects
1162019
0875
088
0885
089
0895
09
0905
091
0915
092
1 10 100 1000 10000
RMSE
Millions of parameters
CF (no time bias)Basic Latent FactorsCF (time bias)Latent Factors w Biases+ Linear time factors+ Per-day user biases+ CF
RECSYS 112
Grand Prize 08563
Netflix 09514
Movie average 10533User average 10651
Global average 11296
Performance of Various Methods
Basic Collaborative filtering 094
Latent factors 090
Latent factors+Biases 089
Collaborative filtering++ 091
1162019 112
Latent factors+Biases+Time 0876
Still no prize Getting desperateTry a ldquokitchen sinkrdquo approach
RECSYS 113
Effect of ensemble size
087
08725
0875
08775
088
08825
0885
08875
089
08925
1 8 15 22 29 36 43 50
Predictors
Erro
r - R
MSE
RECSYS 45
Problems with Error Measures Narrow focus on accuracy sometimes
misses the point Prediction Diversity Prediction Context Order of predictions
In practice we care only to predict high ratings RMSE might penalize a method that does well
for high ratings and badly for others
1162019
RECSYS 46
Collaborative Filtering Complexity
Expensive step is finding k most similar customers O(|X|) Recall that X = set of customers in the system
Too expensive to do at runtime Could pre-compute using clustering as approx
Naiumlve pre-computation takes time O(N middot|C|) |C| = of clusters = k in the k-means N = of data points
We already know how to do this Near-neighbor search in high dimensions (LSH) Clustering Dimensionality reduction
1162019
RECSYS 47
Tip Add Data Leverage all the data
Donrsquot try to reduce data size in an effort to make fancy algorithms work
Simple methods on large data do best
Add more data eg add IMDB data on genres
More data beats better algorithmshttpanandtypepadcomdatawocky200803more-data-usualhtml
RECSYS 48
Recommender SystemsLatent Factor Models
CS246 Mining Massive DatasetsJure Leskovec Stanford University
httpcs246stanfordedu
RECSYS 49
Geared towards females
Geared towards males
Serious
Funny
Collaborative Filtering viaLatent Factor Models (eg SVD)
The PrincessDiaries
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
RECSYS 50Koren Bell Volinksy IEEE Computer 2009
RECSYS 51
The Netflix Utility Matrix R
1 3 4
3 5 5
4 5 5
3
3
2 2 2
5
2 1 1
3 3
1
480000 users
17700 movies
1162019
Matrix R
RECSYS 52
Utility Matrix R Evaluation
1 3 4
3 5 5
4 5 5
3
3
2
2 1
3
1
Test Data Set
SSE = sum(119909119909119909119909)isin119877119877 119903119909119909119909119909 minus 119903119903119909119909119909119909 2
1162019
480000 users
17700 movies
Predicted rating
True rating of user x on item i
119955119955120785120785120788120788
Matrix R
Training Data Set
RECSYS 53
Latent Factor Models
ldquoSVDrdquo on Netflix data R asymp Q middot PT
For now letrsquos assume we can approximate the rating matrix R as a product of ldquothinrdquo Q middot PT
R has missing entries but letrsquos ignore that for nowbull Basically we will want the reconstruction error to be small on
known ratings and we donrsquot care about the values on the missing ones
45531
312445
53432142
24542
522434
42331
2-41
56-5
53-2
32111
-221-7
37-1
-924143-48-5-253-211
13-112-72914-131457-8
1-6784-3924176-421asymp
users
item
s
PT
Q
item
s
users
R
SVD A = U Σ VT
f factors
ffactors
RECSYS 54
Ratings as Products of Factors How to estimate the missing rating of
user x for item i
1162019
45531
312445
53432142
24542
522434
42331
item
s
2-41
56-5
53-2
32111
-221-7
37-1
-924143-48-5-253-211
13-112-72914-131457-8
1-6784-3924176-421
asymp
item
s
users
users
PT
119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931
= 119943119943
119954119954119946119946119943119943 sdot 119953119953119961119961119943119943
qi = row i of Qpx = column x of PT
ffac
tors
Qf factors
RECSYS 55
Ratings as Products of Factors How to estimate the missing rating of
user x for item i
1162019
45531
312445
53432142
24542
522434
42331
item
s
2-41
56-5
53-2
32111
-221-7
37-1
-924143-48-5-253-211
13-112-72914-131457-8
1-6784-3924176-421
asymp
item
s
users
users
PT
ffac
tors
Qf factors
119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931
= 119943119943
119954119954119946119946119943119943 sdot 119953119953119961119961119943119943
qi = row i of Qpx = column x of PT
RECSYS 56
Ratings as Products of Factors How to estimate the missing rating of
user x for item i
1162019
45531
312445
53432142
24542
522434
42331
item
s
2-41
56-5
53-2
32111
-221-7
37-1
-924143-48-5-253-211
13-112-72914-131457-8
1-6784-3924176-421
asymp
item
s
users
users
QPT
24
ffac
tors
f factors
119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931
= 119943119943
119954119954119946119946119943119943 sdot 119953119953119961119961119943119943
qi = row i of Qpx = column x of PT
RECSYS 57
Geared towards females
Geared towards males
Serious
Funny
Latent Factor Models
1162019
The PrincessDiaries
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
RECSYS 58
Geared towards females
Geared towards males
Serious
Funny
Latent Factor Models
1162019
The PrincessDiaries
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
RECSYS 59
Recap SVD
Remember SVD A Input data matrix U Left singular vecs V Right singular vecs Σ Singular values
SVD gives minimum reconstruction error (SSE) min119880119880VΣ
sum119909119909119894119894 119860119860119909119909119894119894 minus 119880119880Σ119881119881T 1199091199091198941198942
So in our case ldquoSVDrdquo on Netflix data R asymp Q middot PT
A = R Q = U PT = Σ VT
But we are not done yet R has missing entries1162019
U
The sum goes over all entriesBut our R has missing entries
119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931
Am
n
Σm
n
VT
asymp
U
RECSYS 60
Latent Factor Models
SVD isnrsquot defined when entries are missing
Use specialized methods to find P Q min
119875119875119876119876sum 119909119909119909119909 isinR 119903119903119909119909119909119909 minus 119902119902119909119909 sdot 119901119901119909119909119879119879 2
Notebull We donrsquot require cols of P Q to be orthogonalunit lengthbull P Q map usersmovies to a latent spacebull The most popular model among Netflix contestants
1162019
45531
312445
53432142
24542
522434
42331
2-41
56-5
53-2
32111
-221-7
37-1
-924143-48-5-253-211
13-112-72914-131457-8
1-6784-3924176-421asymp
PTQ
users
item
s
119903119909119909119909119909 = 119902119902119909119909 sdot 119901119901119909119909119879119879
f factors
ffactorsitem
s
users
RECSYS 61
Dealing with Missing Entries Want to minimize SSE for unseen test data
Idea Minimize SSE on training data Want large f ( of factors) to capture all the signals But SSE on test data begins to rise for f gt 2
Regularization is needed to avoid Overfitting Allow rich model where there are sufficient data Shrink aggressively where data are scarce
1162019
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
1 3 43 5 5
4 5 533
2
2 1 3
1
λhellip regularization parameterldquoerrorrdquo ldquolengthrdquo
RECSYS 62
Geared towards females
Geared towards males
serious
escapist
The PrincessDiaries
The Lion King
Braveheart
Independence Day
AmadeusThe Color Purple
Oceanrsquos 11
Sense and Sensibility
Gus
Dave
Recommendations via Latent Factor Models (eg SVD++ by the [Bellkor Team])
1162019
Lethal Weapon
Dumb and Dumber
RECSYS 63
Dealing with Missing Entries Want to minimize SSE for unseen test data
Idea Minimize SSE on training data Want large f ( of factors) to capture all the signals But SSE on test data begins to rise for f gt 2
Regularization is needed Allow rich model where there are sufficient data Shrink aggressively where data are scarce
1162019
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
1 3 43 5 5
4 5 533
2
2 1 3
1
λhellip regularization parameterldquoerrorrdquo ldquolengthrdquo
RECSYS 64
Geared towards females
Geared towards males
serious
funny
The Effect of Regularization
1162019
The PrincessDiaries
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
minfactors ldquoerrorrdquo + λ ldquolengthrdquo
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
RECSYS 65
Geared towards females
Geared towards males
serious
funny
The Effect of Regularization
1162019
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
The PrincessDiaries
minfactors ldquoerrorrdquo + λ ldquolengthrdquo
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
RECSYS 66
Geared towards females
Geared towards males
serious
funny
The Effect of Regularization
1162019
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
minfactors ldquoerrorrdquo + λ ldquolengthrdquo
The PrincessDiaries
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
RECSYS 67
Geared towards females
Geared towards males
serious
funny
The Effect of Regularization
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
The PrincessDiaries
minfactors ldquoerrorrdquo + λ ldquolengthrdquo
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
RECSYS 68
Use Gradient Descent to search forthe optimal settings
Want to find matrices P and Q
Gradient descent Initialize P and Q (using SVD pretend missing ratings are
0) Do gradient descent
bull P larr P - η middotnablaPbull Q larr Q - η middotnablaQbull Where nablaQ is gradientderivative of matrix Q120571120571119876119876 = [120571120571119902119902119909119909119894119894] and 120571120571119902119902119909119909119894119894 = sum119909119909119909119909 minus2 119903119903119909119909119909119909 minus 119902119902119909119909119901119901119909119909119879119879 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894
bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q Observation Computing gradients is slow
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
How to compute gradient of a matrixCompute gradient of every element independently
RECSYS 69
Degression to the lecture notes ofRegression and Gradient Descent
by Andrew Ngrsquos Machine Learning course from Coursera
RECSYS 70
(Batch) Gradient Descent
Want to find matrices P and Q
Gradient descent Initialize P and Q (using SVD pretend missing ratings are
0) Do gradient descent
bull P larr P - η middotnablaPbull Q larr Q - η middotnablaQbull Where nablaQ is gradientderivative of matrix Q120571120571119876119876 = [120571120571119902119902119909119909119894119894] and 120571120571119902119902119909119909119894119894 = sum119909119909119909119909 minus2 119903119903119909119909119909119909 minus 119902119902119909119909119901119901119909119909119879119879 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894
bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q Observation Computing gradients is slow
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
How to compute gradient of a matrixCompute gradient of every element independently
RECSYS 72
Stochastic Gradient Descent
Gradient Descent (GD) vs Stochastic GD Observation 120571120571119876119876 = [120571120571119902119902119909119909119894119894] where
120571120571119902119902119909119909119894119894 = 119909119909119909119909
minus2 119903119903119909119909119909119909 minus 119902119902119909119909119894119894119901119901119909119909119894119894 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894 = 119961119961119946119946
nabla119928119928 119955119955119961119961119946119946
bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q 119928119928 = 119928119928 minus 119928119928 = 119928119928 minus η sum119961119961119946119946nabla119928119928 (119955119955119961119961119946119946) Idea Instead of evaluating gradient over all ratings evaluate it for
each individual rating and make a step
GD 119928119928larr119928119928 minus η sum119955119955119961119961119946119946 nabla119928119928(119955119955119961119961119946119946)
SGD 119928119928larr119928119928 minus 119928119928(119955119955119961119961119946119946) Faster convergence
bull Need more steps but each step is computed much faster
1162019
RECSYS 73
SGD vs GD Convergence of GD vs SGD
1162019
Iterationstep
Valu
e of
the
obje
ctiv
e fu
nctio
n
GD improves the value of the objective function at every step SGD improves the value but in a ldquonoisyrdquo wayGD takes fewer steps to converge but each steptakes much longer to compute In practice SGD is much faster
RECSYS 74
Stochastic Gradient Descent
Stochastic gradient decent Initialize P and Q (using SVD pretend missing ratings are
0) Then iterate over the ratings (multiple times if necessary) and update
factorsFor each rxi
bull 120576120576119909119909119909119909 = 119903119903119909119909119909119909 minus 119902119902119909119909 sdot 119901119901119909119909119879119879 (derivative of the ldquoerrorrdquo)bull 119902119902119909119909 larr 119902119902119909119909 + 120578120578 120576120576119909119909119909119909 119901119901119909119909 minus 120582120582 119902119902119909119909 (update equation)bull 119901119901119909119909 larr 119901119901119909119909 + 120578120578 120576120576119909119909119909119909 119902119902119909119909 minus 120582120582 119901119901119909119909 (update equation)
2 for loops For until convergence
bull For each rxibull Compute gradient do a ldquosteprdquo
1162019
η hellip learning rate
RECSYS 75Koren Bell Volinksy IEEE Computer 2009
1162019
RECSYS 76
Summary Recommendations via Optimization
Goal Make good recommendations Quantify goodness using SSE
So Lower SSE means better recommendations We want to make good recommendations on items
that some user has not yet seen Letrsquos set values for P and Q such that they work
well on known (user item) ratings And hope these values for P and Q will predict well
the unknown ratings
This is the a case where we apply Optimization methods
1162019
1 3 43 5 5
4 5 533
2
2 1 3
1
RECSYS 77
Backup Slides
RECSYS 78
The Netflix Challenge 2006-09
We Know What You OughtTo Be Watching This
Summer
Netflix Prizebull Training data
ndash 100 million ratingsndash 480000 usersndash 17770 moviesndash 6 years of data 2000-2005
bull Test datandash Last few ratings of each user (28 million)ndash Evaluation criterion root mean squared error (RMSE) ndash Netflix Cinematch RMSE 09514
bull Competitionndash 2700+ teamsndash $1 million grand prize for 10 improvement on Cinematch resultndash $50000 2007 progress prize for 843 improvement
scoremovieuser121152131434524123237682576344541568523425
22345
57664566
scoremovieuser6219617232473153414284935
745
696836
Training data Test data
Movie rating data
Overall rating distribution
bull Third of ratings are 4sbull Average rating is 368
From TimelyDevelopmentcom
ratings per movie
bull Avg ratingsmovie 5627
ratings per user
bull Avg ratingsuser 208
Average movie rating by movie count
bull More ratings to better movies
From TimelyDevelopmentcom
Most loved movies
CountAvg ratingTitle137812 4593 The Shawshank Redemption 133597 4545 Lord of the Rings The Return of the King1808834306 The Green Mile 150676 4460Lord of the Rings The Two Towers 139050 4415 Finding Nemo 117456 4504 Raiders of the Lost Ark 180736 4299 Forrest Gump 147932 4433 Lord of the Rings The Fellowship of the ring149199 4325 The Sixth Sense 144027 4333 Indiana Jones and the Last Crusade
Challenges
bull Size of datandash Scalabilityndash Keeping data in memory
bull Missing datandash 99 percent missingndash Very imbalanced
bull Avoiding overfittingbull Test and training data differ significantly movie 16322
bull Use an ensemble of complementing predictorsbull Two half tuned models worth more than a single fully
tuned model
The BellKor recommender system
RECSYS 101
Extending Latent Factor Model to Include Biases
1162019
RECSYS 102
Modeling Biases and Interactions
μ = overall mean rating bx = bias of user x bi = bias of movie i
user-movie interactionmovie biasuser bias
User-Movie interaction Characterizes the matching between
users and movies Attracts most research in the field Benefits from algorithmic and
mathematical innovations
Baseline predictor Separates users and movies Benefits from insights into userrsquos
behavior Among the main practical
contributions of the competition
RECSYS 103
Baseline Predictor We have expectations on the rating by
user x of movie i even without estimating xrsquos attitude towards movies like i
ndash Rating scale of user xndash Values of other ratings user
gave recently (day-specific mood anchoring multi-user accounts)
ndash (Recent) popularity of movie indash Selection bias related to
number of ratings user gave on the same day (ldquofrequencyrdquo)
1162019
RECSYS 104
Putting It All Together
Example Mean rating micro = 37 You are a critical reviewer your ratings are 1 star
lower than the mean bx = -1 Star Wars gets a mean rating of 05 higher than
average movie bi = + 05 Predicted rating for you on Star Wars
= 37 - 1 + 05 = 32
1162019
RECSYS 105
Fitting the New Model
Solve
Stochastic gradient decent to find parameters Note Both biases bu bi as well as interactions qi pu are treated
as parameters (we estimate them)
1162019
regularization
goodness of fit
λ is selected via grid-search on a validation set
( )
++++
+++minus
sumsumsumsum
sumisin
ii
xx
xx
ii
Rix
Txiixxi
PQ
bbpq
pqbbr
2222
2
)(
)(min
λ
micro
RECSYS 106
BellKor Recommender System The winner of the Netflix Challenge
Multi-scale modeling of the dataCombine top level ldquoregionalrdquomodeling of the data with a refined local view Global
bull Overall deviations of usersmovies Factorization
bull Addressing ldquoregionalrdquo effects Collaborative filtering
bull Extract local patterns
Global effects
Factorization
Collaborative filtering
RECSYS 107
Performance of Various Methods
0885
089
0895
09
0905
091
0915
092
1 10 100 1000
RMSE
Millions of parameters
CF (no time bias)
Basic Latent Factors
Latent Factors w Biases
1162019
RECSYS 108
Grand Prize 08563
Netflix 09514
Movie average 10533User average 10651
Global average 11296
Performance of Various Methods
Basic Collaborative filtering 094
Latent factors 090
Latent factors+Biases 089
Collaborative filtering++ 091
1162019
RECSYS 109
Temporal Biases Of Users
Sudden rise in the average movie rating(early 2004) Improvements in Netflix GUI improvements Meaning of rating changed
Movie age Users prefer new movies
without any reasons Older movies are just
inherently better than newer ones
1162019
Y Koren Collaborative filtering with temporal dynamics KDD rsquo09
RECSYS 110
Temporal Biases amp Factors
Original modelrxi = micro +bx + bi + qi middotpx
T
Add time dependence to biasesrxi = micro +bx(t)+ bi(t) +qi middot px
T
Make parameters bu and bi to depend on time (1) Parameterize time-dependence by linear trends
(2) Each bin corresponds to 10 consecutive weeks
Add temporal dependence to factors px(t)hellip user preference vector on day t
Y Koren Collaborative filtering with temporal dynamics KDD rsquo09
RECSYS 111
Adding Temporal Effects
1162019
0875
088
0885
089
0895
09
0905
091
0915
092
1 10 100 1000 10000
RMSE
Millions of parameters
CF (no time bias)Basic Latent FactorsCF (time bias)Latent Factors w Biases+ Linear time factors+ Per-day user biases+ CF
RECSYS 112
Grand Prize 08563
Netflix 09514
Movie average 10533User average 10651
Global average 11296
Performance of Various Methods
Basic Collaborative filtering 094
Latent factors 090
Latent factors+Biases 089
Collaborative filtering++ 091
1162019 112
Latent factors+Biases+Time 0876
Still no prize Getting desperateTry a ldquokitchen sinkrdquo approach
RECSYS 113
Effect of ensemble size
087
08725
0875
08775
088
08825
0885
08875
089
08925
1 8 15 22 29 36 43 50
Predictors
Erro
r - R
MSE
RECSYS 46
Collaborative Filtering Complexity
Expensive step is finding k most similar customers O(|X|) Recall that X = set of customers in the system
Too expensive to do at runtime Could pre-compute using clustering as approx
Naiumlve pre-computation takes time O(N middot|C|) |C| = of clusters = k in the k-means N = of data points
We already know how to do this Near-neighbor search in high dimensions (LSH) Clustering Dimensionality reduction
1162019
RECSYS 47
Tip Add Data Leverage all the data
Donrsquot try to reduce data size in an effort to make fancy algorithms work
Simple methods on large data do best
Add more data eg add IMDB data on genres
More data beats better algorithmshttpanandtypepadcomdatawocky200803more-data-usualhtml
RECSYS 48
Recommender SystemsLatent Factor Models
CS246 Mining Massive DatasetsJure Leskovec Stanford University
httpcs246stanfordedu
RECSYS 49
Geared towards females
Geared towards males
Serious
Funny
Collaborative Filtering viaLatent Factor Models (eg SVD)
The PrincessDiaries
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
RECSYS 50Koren Bell Volinksy IEEE Computer 2009
RECSYS 51
The Netflix Utility Matrix R
1 3 4
3 5 5
4 5 5
3
3
2 2 2
5
2 1 1
3 3
1
480000 users
17700 movies
1162019
Matrix R
RECSYS 52
Utility Matrix R Evaluation
1 3 4
3 5 5
4 5 5
3
3
2
2 1
3
1
Test Data Set
SSE = sum(119909119909119909119909)isin119877119877 119903119909119909119909119909 minus 119903119903119909119909119909119909 2
1162019
480000 users
17700 movies
Predicted rating
True rating of user x on item i
119955119955120785120785120788120788
Matrix R
Training Data Set
RECSYS 53
Latent Factor Models
ldquoSVDrdquo on Netflix data R asymp Q middot PT
For now letrsquos assume we can approximate the rating matrix R as a product of ldquothinrdquo Q middot PT
R has missing entries but letrsquos ignore that for nowbull Basically we will want the reconstruction error to be small on
known ratings and we donrsquot care about the values on the missing ones
45531
312445
53432142
24542
522434
42331
2-41
56-5
53-2
32111
-221-7
37-1
-924143-48-5-253-211
13-112-72914-131457-8
1-6784-3924176-421asymp
users
item
s
PT
Q
item
s
users
R
SVD A = U Σ VT
f factors
ffactors
RECSYS 54
Ratings as Products of Factors How to estimate the missing rating of
user x for item i
1162019
45531
312445
53432142
24542
522434
42331
item
s
2-41
56-5
53-2
32111
-221-7
37-1
-924143-48-5-253-211
13-112-72914-131457-8
1-6784-3924176-421
asymp
item
s
users
users
PT
119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931
= 119943119943
119954119954119946119946119943119943 sdot 119953119953119961119961119943119943
qi = row i of Qpx = column x of PT
ffac
tors
Qf factors
RECSYS 55
Ratings as Products of Factors How to estimate the missing rating of
user x for item i
1162019
45531
312445
53432142
24542
522434
42331
item
s
2-41
56-5
53-2
32111
-221-7
37-1
-924143-48-5-253-211
13-112-72914-131457-8
1-6784-3924176-421
asymp
item
s
users
users
PT
ffac
tors
Qf factors
119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931
= 119943119943
119954119954119946119946119943119943 sdot 119953119953119961119961119943119943
qi = row i of Qpx = column x of PT
RECSYS 56
Ratings as Products of Factors How to estimate the missing rating of
user x for item i
1162019
45531
312445
53432142
24542
522434
42331
item
s
2-41
56-5
53-2
32111
-221-7
37-1
-924143-48-5-253-211
13-112-72914-131457-8
1-6784-3924176-421
asymp
item
s
users
users
QPT
24
ffac
tors
f factors
119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931
= 119943119943
119954119954119946119946119943119943 sdot 119953119953119961119961119943119943
qi = row i of Qpx = column x of PT
RECSYS 57
Geared towards females
Geared towards males
Serious
Funny
Latent Factor Models
1162019
The PrincessDiaries
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
RECSYS 58
Geared towards females
Geared towards males
Serious
Funny
Latent Factor Models
1162019
The PrincessDiaries
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
RECSYS 59
Recap SVD
Remember SVD A Input data matrix U Left singular vecs V Right singular vecs Σ Singular values
SVD gives minimum reconstruction error (SSE) min119880119880VΣ
sum119909119909119894119894 119860119860119909119909119894119894 minus 119880119880Σ119881119881T 1199091199091198941198942
So in our case ldquoSVDrdquo on Netflix data R asymp Q middot PT
A = R Q = U PT = Σ VT
But we are not done yet R has missing entries1162019
U
The sum goes over all entriesBut our R has missing entries
119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931
Am
n
Σm
n
VT
asymp
U
RECSYS 60
Latent Factor Models
SVD isnrsquot defined when entries are missing
Use specialized methods to find P Q min
119875119875119876119876sum 119909119909119909119909 isinR 119903119903119909119909119909119909 minus 119902119902119909119909 sdot 119901119901119909119909119879119879 2
Notebull We donrsquot require cols of P Q to be orthogonalunit lengthbull P Q map usersmovies to a latent spacebull The most popular model among Netflix contestants
1162019
45531
312445
53432142
24542
522434
42331
2-41
56-5
53-2
32111
-221-7
37-1
-924143-48-5-253-211
13-112-72914-131457-8
1-6784-3924176-421asymp
PTQ
users
item
s
119903119909119909119909119909 = 119902119902119909119909 sdot 119901119901119909119909119879119879
f factors
ffactorsitem
s
users
RECSYS 61
Dealing with Missing Entries Want to minimize SSE for unseen test data
Idea Minimize SSE on training data Want large f ( of factors) to capture all the signals But SSE on test data begins to rise for f gt 2
Regularization is needed to avoid Overfitting Allow rich model where there are sufficient data Shrink aggressively where data are scarce
1162019
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
1 3 43 5 5
4 5 533
2
2 1 3
1
λhellip regularization parameterldquoerrorrdquo ldquolengthrdquo
RECSYS 62
Geared towards females
Geared towards males
serious
escapist
The PrincessDiaries
The Lion King
Braveheart
Independence Day
AmadeusThe Color Purple
Oceanrsquos 11
Sense and Sensibility
Gus
Dave
Recommendations via Latent Factor Models (eg SVD++ by the [Bellkor Team])
1162019
Lethal Weapon
Dumb and Dumber
RECSYS 63
Dealing with Missing Entries Want to minimize SSE for unseen test data
Idea Minimize SSE on training data Want large f ( of factors) to capture all the signals But SSE on test data begins to rise for f gt 2
Regularization is needed Allow rich model where there are sufficient data Shrink aggressively where data are scarce
1162019
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
1 3 43 5 5
4 5 533
2
2 1 3
1
λhellip regularization parameterldquoerrorrdquo ldquolengthrdquo
RECSYS 64
Geared towards females
Geared towards males
serious
funny
The Effect of Regularization
1162019
The PrincessDiaries
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
minfactors ldquoerrorrdquo + λ ldquolengthrdquo
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
RECSYS 65
Geared towards females
Geared towards males
serious
funny
The Effect of Regularization
1162019
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
The PrincessDiaries
minfactors ldquoerrorrdquo + λ ldquolengthrdquo
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
RECSYS 66
Geared towards females
Geared towards males
serious
funny
The Effect of Regularization
1162019
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
minfactors ldquoerrorrdquo + λ ldquolengthrdquo
The PrincessDiaries
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
RECSYS 67
Geared towards females
Geared towards males
serious
funny
The Effect of Regularization
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
The PrincessDiaries
minfactors ldquoerrorrdquo + λ ldquolengthrdquo
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
RECSYS 68
Use Gradient Descent to search forthe optimal settings
Want to find matrices P and Q
Gradient descent Initialize P and Q (using SVD pretend missing ratings are
0) Do gradient descent
bull P larr P - η middotnablaPbull Q larr Q - η middotnablaQbull Where nablaQ is gradientderivative of matrix Q120571120571119876119876 = [120571120571119902119902119909119909119894119894] and 120571120571119902119902119909119909119894119894 = sum119909119909119909119909 minus2 119903119903119909119909119909119909 minus 119902119902119909119909119901119901119909119909119879119879 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894
bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q Observation Computing gradients is slow
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
How to compute gradient of a matrixCompute gradient of every element independently
RECSYS 69
Degression to the lecture notes ofRegression and Gradient Descent
by Andrew Ngrsquos Machine Learning course from Coursera
RECSYS 70
(Batch) Gradient Descent
Want to find matrices P and Q
Gradient descent Initialize P and Q (using SVD pretend missing ratings are
0) Do gradient descent
bull P larr P - η middotnablaPbull Q larr Q - η middotnablaQbull Where nablaQ is gradientderivative of matrix Q120571120571119876119876 = [120571120571119902119902119909119909119894119894] and 120571120571119902119902119909119909119894119894 = sum119909119909119909119909 minus2 119903119903119909119909119909119909 minus 119902119902119909119909119901119901119909119909119879119879 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894
bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q Observation Computing gradients is slow
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
How to compute gradient of a matrixCompute gradient of every element independently
RECSYS 72
Stochastic Gradient Descent
Gradient Descent (GD) vs Stochastic GD Observation 120571120571119876119876 = [120571120571119902119902119909119909119894119894] where
120571120571119902119902119909119909119894119894 = 119909119909119909119909
minus2 119903119903119909119909119909119909 minus 119902119902119909119909119894119894119901119901119909119909119894119894 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894 = 119961119961119946119946
nabla119928119928 119955119955119961119961119946119946
bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q 119928119928 = 119928119928 minus 119928119928 = 119928119928 minus η sum119961119961119946119946nabla119928119928 (119955119955119961119961119946119946) Idea Instead of evaluating gradient over all ratings evaluate it for
each individual rating and make a step
GD 119928119928larr119928119928 minus η sum119955119955119961119961119946119946 nabla119928119928(119955119955119961119961119946119946)
SGD 119928119928larr119928119928 minus 119928119928(119955119955119961119961119946119946) Faster convergence
bull Need more steps but each step is computed much faster
1162019
RECSYS 73
SGD vs GD Convergence of GD vs SGD
1162019
Iterationstep
Valu
e of
the
obje
ctiv
e fu
nctio
n
GD improves the value of the objective function at every step SGD improves the value but in a ldquonoisyrdquo wayGD takes fewer steps to converge but each steptakes much longer to compute In practice SGD is much faster
RECSYS 74
Stochastic Gradient Descent
Stochastic gradient decent Initialize P and Q (using SVD pretend missing ratings are
0) Then iterate over the ratings (multiple times if necessary) and update
factorsFor each rxi
bull 120576120576119909119909119909119909 = 119903119903119909119909119909119909 minus 119902119902119909119909 sdot 119901119901119909119909119879119879 (derivative of the ldquoerrorrdquo)bull 119902119902119909119909 larr 119902119902119909119909 + 120578120578 120576120576119909119909119909119909 119901119901119909119909 minus 120582120582 119902119902119909119909 (update equation)bull 119901119901119909119909 larr 119901119901119909119909 + 120578120578 120576120576119909119909119909119909 119902119902119909119909 minus 120582120582 119901119901119909119909 (update equation)
2 for loops For until convergence
bull For each rxibull Compute gradient do a ldquosteprdquo
1162019
η hellip learning rate
RECSYS 75Koren Bell Volinksy IEEE Computer 2009
1162019
RECSYS 76
Summary Recommendations via Optimization
Goal Make good recommendations Quantify goodness using SSE
So Lower SSE means better recommendations We want to make good recommendations on items
that some user has not yet seen Letrsquos set values for P and Q such that they work
well on known (user item) ratings And hope these values for P and Q will predict well
the unknown ratings
This is the a case where we apply Optimization methods
1162019
1 3 43 5 5
4 5 533
2
2 1 3
1
RECSYS 77
Backup Slides
RECSYS 78
The Netflix Challenge 2006-09
We Know What You OughtTo Be Watching This
Summer
Netflix Prizebull Training data
ndash 100 million ratingsndash 480000 usersndash 17770 moviesndash 6 years of data 2000-2005
bull Test datandash Last few ratings of each user (28 million)ndash Evaluation criterion root mean squared error (RMSE) ndash Netflix Cinematch RMSE 09514
bull Competitionndash 2700+ teamsndash $1 million grand prize for 10 improvement on Cinematch resultndash $50000 2007 progress prize for 843 improvement
scoremovieuser121152131434524123237682576344541568523425
22345
57664566
scoremovieuser6219617232473153414284935
745
696836
Training data Test data
Movie rating data
Overall rating distribution
bull Third of ratings are 4sbull Average rating is 368
From TimelyDevelopmentcom
ratings per movie
bull Avg ratingsmovie 5627
ratings per user
bull Avg ratingsuser 208
Average movie rating by movie count
bull More ratings to better movies
From TimelyDevelopmentcom
Most loved movies
CountAvg ratingTitle137812 4593 The Shawshank Redemption 133597 4545 Lord of the Rings The Return of the King1808834306 The Green Mile 150676 4460Lord of the Rings The Two Towers 139050 4415 Finding Nemo 117456 4504 Raiders of the Lost Ark 180736 4299 Forrest Gump 147932 4433 Lord of the Rings The Fellowship of the ring149199 4325 The Sixth Sense 144027 4333 Indiana Jones and the Last Crusade
Challenges
bull Size of datandash Scalabilityndash Keeping data in memory
bull Missing datandash 99 percent missingndash Very imbalanced
bull Avoiding overfittingbull Test and training data differ significantly movie 16322
bull Use an ensemble of complementing predictorsbull Two half tuned models worth more than a single fully
tuned model
The BellKor recommender system
RECSYS 101
Extending Latent Factor Model to Include Biases
1162019
RECSYS 102
Modeling Biases and Interactions
μ = overall mean rating bx = bias of user x bi = bias of movie i
user-movie interactionmovie biasuser bias
User-Movie interaction Characterizes the matching between
users and movies Attracts most research in the field Benefits from algorithmic and
mathematical innovations
Baseline predictor Separates users and movies Benefits from insights into userrsquos
behavior Among the main practical
contributions of the competition
RECSYS 103
Baseline Predictor We have expectations on the rating by
user x of movie i even without estimating xrsquos attitude towards movies like i
ndash Rating scale of user xndash Values of other ratings user
gave recently (day-specific mood anchoring multi-user accounts)
ndash (Recent) popularity of movie indash Selection bias related to
number of ratings user gave on the same day (ldquofrequencyrdquo)
1162019
RECSYS 104
Putting It All Together
Example Mean rating micro = 37 You are a critical reviewer your ratings are 1 star
lower than the mean bx = -1 Star Wars gets a mean rating of 05 higher than
average movie bi = + 05 Predicted rating for you on Star Wars
= 37 - 1 + 05 = 32
1162019
RECSYS 105
Fitting the New Model
Solve
Stochastic gradient decent to find parameters Note Both biases bu bi as well as interactions qi pu are treated
as parameters (we estimate them)
1162019
regularization
goodness of fit
λ is selected via grid-search on a validation set
( )
++++
+++minus
sumsumsumsum
sumisin
ii
xx
xx
ii
Rix
Txiixxi
PQ
bbpq
pqbbr
2222
2
)(
)(min
λ
micro
RECSYS 106
BellKor Recommender System The winner of the Netflix Challenge
Multi-scale modeling of the dataCombine top level ldquoregionalrdquomodeling of the data with a refined local view Global
bull Overall deviations of usersmovies Factorization
bull Addressing ldquoregionalrdquo effects Collaborative filtering
bull Extract local patterns
Global effects
Factorization
Collaborative filtering
RECSYS 107
Performance of Various Methods
0885
089
0895
09
0905
091
0915
092
1 10 100 1000
RMSE
Millions of parameters
CF (no time bias)
Basic Latent Factors
Latent Factors w Biases
1162019
RECSYS 108
Grand Prize 08563
Netflix 09514
Movie average 10533User average 10651
Global average 11296
Performance of Various Methods
Basic Collaborative filtering 094
Latent factors 090
Latent factors+Biases 089
Collaborative filtering++ 091
1162019
RECSYS 109
Temporal Biases Of Users
Sudden rise in the average movie rating(early 2004) Improvements in Netflix GUI improvements Meaning of rating changed
Movie age Users prefer new movies
without any reasons Older movies are just
inherently better than newer ones
1162019
Y Koren Collaborative filtering with temporal dynamics KDD rsquo09
RECSYS 110
Temporal Biases amp Factors
Original modelrxi = micro +bx + bi + qi middotpx
T
Add time dependence to biasesrxi = micro +bx(t)+ bi(t) +qi middot px
T
Make parameters bu and bi to depend on time (1) Parameterize time-dependence by linear trends
(2) Each bin corresponds to 10 consecutive weeks
Add temporal dependence to factors px(t)hellip user preference vector on day t
Y Koren Collaborative filtering with temporal dynamics KDD rsquo09
RECSYS 111
Adding Temporal Effects
1162019
0875
088
0885
089
0895
09
0905
091
0915
092
1 10 100 1000 10000
RMSE
Millions of parameters
CF (no time bias)Basic Latent FactorsCF (time bias)Latent Factors w Biases+ Linear time factors+ Per-day user biases+ CF
RECSYS 112
Grand Prize 08563
Netflix 09514
Movie average 10533User average 10651
Global average 11296
Performance of Various Methods
Basic Collaborative filtering 094
Latent factors 090
Latent factors+Biases 089
Collaborative filtering++ 091
1162019 112
Latent factors+Biases+Time 0876
Still no prize Getting desperateTry a ldquokitchen sinkrdquo approach
RECSYS 113
Effect of ensemble size
087
08725
0875
08775
088
08825
0885
08875
089
08925
1 8 15 22 29 36 43 50
Predictors
Erro
r - R
MSE
RECSYS 47
Tip Add Data Leverage all the data
Donrsquot try to reduce data size in an effort to make fancy algorithms work
Simple methods on large data do best
Add more data eg add IMDB data on genres
More data beats better algorithmshttpanandtypepadcomdatawocky200803more-data-usualhtml
RECSYS 48
Recommender SystemsLatent Factor Models
CS246 Mining Massive DatasetsJure Leskovec Stanford University
httpcs246stanfordedu
RECSYS 49
Geared towards females
Geared towards males
Serious
Funny
Collaborative Filtering viaLatent Factor Models (eg SVD)
The PrincessDiaries
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
RECSYS 50Koren Bell Volinksy IEEE Computer 2009
RECSYS 51
The Netflix Utility Matrix R
1 3 4
3 5 5
4 5 5
3
3
2 2 2
5
2 1 1
3 3
1
480000 users
17700 movies
1162019
Matrix R
RECSYS 52
Utility Matrix R Evaluation
1 3 4
3 5 5
4 5 5
3
3
2
2 1
3
1
Test Data Set
SSE = sum(119909119909119909119909)isin119877119877 119903119909119909119909119909 minus 119903119903119909119909119909119909 2
1162019
480000 users
17700 movies
Predicted rating
True rating of user x on item i
119955119955120785120785120788120788
Matrix R
Training Data Set
RECSYS 53
Latent Factor Models
ldquoSVDrdquo on Netflix data R asymp Q middot PT
For now letrsquos assume we can approximate the rating matrix R as a product of ldquothinrdquo Q middot PT
R has missing entries but letrsquos ignore that for nowbull Basically we will want the reconstruction error to be small on
known ratings and we donrsquot care about the values on the missing ones
45531
312445
53432142
24542
522434
42331
2-41
56-5
53-2
32111
-221-7
37-1
-924143-48-5-253-211
13-112-72914-131457-8
1-6784-3924176-421asymp
users
item
s
PT
Q
item
s
users
R
SVD A = U Σ VT
f factors
ffactors
RECSYS 54
Ratings as Products of Factors How to estimate the missing rating of
user x for item i
1162019
45531
312445
53432142
24542
522434
42331
item
s
2-41
56-5
53-2
32111
-221-7
37-1
-924143-48-5-253-211
13-112-72914-131457-8
1-6784-3924176-421
asymp
item
s
users
users
PT
119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931
= 119943119943
119954119954119946119946119943119943 sdot 119953119953119961119961119943119943
qi = row i of Qpx = column x of PT
ffac
tors
Qf factors
RECSYS 55
Ratings as Products of Factors How to estimate the missing rating of
user x for item i
1162019
45531
312445
53432142
24542
522434
42331
item
s
2-41
56-5
53-2
32111
-221-7
37-1
-924143-48-5-253-211
13-112-72914-131457-8
1-6784-3924176-421
asymp
item
s
users
users
PT
ffac
tors
Qf factors
119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931
= 119943119943
119954119954119946119946119943119943 sdot 119953119953119961119961119943119943
qi = row i of Qpx = column x of PT
RECSYS 56
Ratings as Products of Factors How to estimate the missing rating of
user x for item i
1162019
45531
312445
53432142
24542
522434
42331
item
s
2-41
56-5
53-2
32111
-221-7
37-1
-924143-48-5-253-211
13-112-72914-131457-8
1-6784-3924176-421
asymp
item
s
users
users
QPT
24
ffac
tors
f factors
119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931
= 119943119943
119954119954119946119946119943119943 sdot 119953119953119961119961119943119943
qi = row i of Qpx = column x of PT
RECSYS 57
Geared towards females
Geared towards males
Serious
Funny
Latent Factor Models
1162019
The PrincessDiaries
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
RECSYS 58
Geared towards females
Geared towards males
Serious
Funny
Latent Factor Models
1162019
The PrincessDiaries
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
RECSYS 59
Recap SVD
Remember SVD A Input data matrix U Left singular vecs V Right singular vecs Σ Singular values
SVD gives minimum reconstruction error (SSE) min119880119880VΣ
sum119909119909119894119894 119860119860119909119909119894119894 minus 119880119880Σ119881119881T 1199091199091198941198942
So in our case ldquoSVDrdquo on Netflix data R asymp Q middot PT
A = R Q = U PT = Σ VT
But we are not done yet R has missing entries1162019
U
The sum goes over all entriesBut our R has missing entries
119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931
Am
n
Σm
n
VT
asymp
U
RECSYS 60
Latent Factor Models
SVD isnrsquot defined when entries are missing
Use specialized methods to find P Q min
119875119875119876119876sum 119909119909119909119909 isinR 119903119903119909119909119909119909 minus 119902119902119909119909 sdot 119901119901119909119909119879119879 2
Notebull We donrsquot require cols of P Q to be orthogonalunit lengthbull P Q map usersmovies to a latent spacebull The most popular model among Netflix contestants
1162019
45531
312445
53432142
24542
522434
42331
2-41
56-5
53-2
32111
-221-7
37-1
-924143-48-5-253-211
13-112-72914-131457-8
1-6784-3924176-421asymp
PTQ
users
item
s
119903119909119909119909119909 = 119902119902119909119909 sdot 119901119901119909119909119879119879
f factors
ffactorsitem
s
users
RECSYS 61
Dealing with Missing Entries Want to minimize SSE for unseen test data
Idea Minimize SSE on training data Want large f ( of factors) to capture all the signals But SSE on test data begins to rise for f gt 2
Regularization is needed to avoid Overfitting Allow rich model where there are sufficient data Shrink aggressively where data are scarce
1162019
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
1 3 43 5 5
4 5 533
2
2 1 3
1
λhellip regularization parameterldquoerrorrdquo ldquolengthrdquo
RECSYS 62
Geared towards females
Geared towards males
serious
escapist
The PrincessDiaries
The Lion King
Braveheart
Independence Day
AmadeusThe Color Purple
Oceanrsquos 11
Sense and Sensibility
Gus
Dave
Recommendations via Latent Factor Models (eg SVD++ by the [Bellkor Team])
1162019
Lethal Weapon
Dumb and Dumber
RECSYS 63
Dealing with Missing Entries Want to minimize SSE for unseen test data
Idea Minimize SSE on training data Want large f ( of factors) to capture all the signals But SSE on test data begins to rise for f gt 2
Regularization is needed Allow rich model where there are sufficient data Shrink aggressively where data are scarce
1162019
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
1 3 43 5 5
4 5 533
2
2 1 3
1
λhellip regularization parameterldquoerrorrdquo ldquolengthrdquo
RECSYS 64
Geared towards females
Geared towards males
serious
funny
The Effect of Regularization
1162019
The PrincessDiaries
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
minfactors ldquoerrorrdquo + λ ldquolengthrdquo
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
RECSYS 65
Geared towards females
Geared towards males
serious
funny
The Effect of Regularization
1162019
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
The PrincessDiaries
minfactors ldquoerrorrdquo + λ ldquolengthrdquo
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
RECSYS 66
Geared towards females
Geared towards males
serious
funny
The Effect of Regularization
1162019
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
minfactors ldquoerrorrdquo + λ ldquolengthrdquo
The PrincessDiaries
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
RECSYS 67
Geared towards females
Geared towards males
serious
funny
The Effect of Regularization
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
The PrincessDiaries
minfactors ldquoerrorrdquo + λ ldquolengthrdquo
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
RECSYS 68
Use Gradient Descent to search forthe optimal settings
Want to find matrices P and Q
Gradient descent Initialize P and Q (using SVD pretend missing ratings are
0) Do gradient descent
bull P larr P - η middotnablaPbull Q larr Q - η middotnablaQbull Where nablaQ is gradientderivative of matrix Q120571120571119876119876 = [120571120571119902119902119909119909119894119894] and 120571120571119902119902119909119909119894119894 = sum119909119909119909119909 minus2 119903119903119909119909119909119909 minus 119902119902119909119909119901119901119909119909119879119879 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894
bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q Observation Computing gradients is slow
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
How to compute gradient of a matrixCompute gradient of every element independently
RECSYS 69
Degression to the lecture notes ofRegression and Gradient Descent
by Andrew Ngrsquos Machine Learning course from Coursera
RECSYS 70
(Batch) Gradient Descent
Want to find matrices P and Q
Gradient descent Initialize P and Q (using SVD pretend missing ratings are
0) Do gradient descent
bull P larr P - η middotnablaPbull Q larr Q - η middotnablaQbull Where nablaQ is gradientderivative of matrix Q120571120571119876119876 = [120571120571119902119902119909119909119894119894] and 120571120571119902119902119909119909119894119894 = sum119909119909119909119909 minus2 119903119903119909119909119909119909 minus 119902119902119909119909119901119901119909119909119879119879 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894
bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q Observation Computing gradients is slow
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
How to compute gradient of a matrixCompute gradient of every element independently
RECSYS 72
Stochastic Gradient Descent
Gradient Descent (GD) vs Stochastic GD Observation 120571120571119876119876 = [120571120571119902119902119909119909119894119894] where
120571120571119902119902119909119909119894119894 = 119909119909119909119909
minus2 119903119903119909119909119909119909 minus 119902119902119909119909119894119894119901119901119909119909119894119894 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894 = 119961119961119946119946
nabla119928119928 119955119955119961119961119946119946
bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q 119928119928 = 119928119928 minus 119928119928 = 119928119928 minus η sum119961119961119946119946nabla119928119928 (119955119955119961119961119946119946) Idea Instead of evaluating gradient over all ratings evaluate it for
each individual rating and make a step
GD 119928119928larr119928119928 minus η sum119955119955119961119961119946119946 nabla119928119928(119955119955119961119961119946119946)
SGD 119928119928larr119928119928 minus 119928119928(119955119955119961119961119946119946) Faster convergence
bull Need more steps but each step is computed much faster
1162019
RECSYS 73
SGD vs GD Convergence of GD vs SGD
1162019
Iterationstep
Valu
e of
the
obje
ctiv
e fu
nctio
n
GD improves the value of the objective function at every step SGD improves the value but in a ldquonoisyrdquo wayGD takes fewer steps to converge but each steptakes much longer to compute In practice SGD is much faster
RECSYS 74
Stochastic Gradient Descent
Stochastic gradient decent Initialize P and Q (using SVD pretend missing ratings are
0) Then iterate over the ratings (multiple times if necessary) and update
factorsFor each rxi
bull 120576120576119909119909119909119909 = 119903119903119909119909119909119909 minus 119902119902119909119909 sdot 119901119901119909119909119879119879 (derivative of the ldquoerrorrdquo)bull 119902119902119909119909 larr 119902119902119909119909 + 120578120578 120576120576119909119909119909119909 119901119901119909119909 minus 120582120582 119902119902119909119909 (update equation)bull 119901119901119909119909 larr 119901119901119909119909 + 120578120578 120576120576119909119909119909119909 119902119902119909119909 minus 120582120582 119901119901119909119909 (update equation)
2 for loops For until convergence
bull For each rxibull Compute gradient do a ldquosteprdquo
1162019
η hellip learning rate
RECSYS 75Koren Bell Volinksy IEEE Computer 2009
1162019
RECSYS 76
Summary Recommendations via Optimization
Goal Make good recommendations Quantify goodness using SSE
So Lower SSE means better recommendations We want to make good recommendations on items
that some user has not yet seen Letrsquos set values for P and Q such that they work
well on known (user item) ratings And hope these values for P and Q will predict well
the unknown ratings
This is the a case where we apply Optimization methods
1162019
1 3 43 5 5
4 5 533
2
2 1 3
1
RECSYS 77
Backup Slides
RECSYS 78
The Netflix Challenge 2006-09
We Know What You OughtTo Be Watching This
Summer
Netflix Prizebull Training data
ndash 100 million ratingsndash 480000 usersndash 17770 moviesndash 6 years of data 2000-2005
bull Test datandash Last few ratings of each user (28 million)ndash Evaluation criterion root mean squared error (RMSE) ndash Netflix Cinematch RMSE 09514
bull Competitionndash 2700+ teamsndash $1 million grand prize for 10 improvement on Cinematch resultndash $50000 2007 progress prize for 843 improvement
scoremovieuser121152131434524123237682576344541568523425
22345
57664566
scoremovieuser6219617232473153414284935
745
696836
Training data Test data
Movie rating data
Overall rating distribution
bull Third of ratings are 4sbull Average rating is 368
From TimelyDevelopmentcom
ratings per movie
bull Avg ratingsmovie 5627
ratings per user
bull Avg ratingsuser 208
Average movie rating by movie count
bull More ratings to better movies
From TimelyDevelopmentcom
Most loved movies
CountAvg ratingTitle137812 4593 The Shawshank Redemption 133597 4545 Lord of the Rings The Return of the King1808834306 The Green Mile 150676 4460Lord of the Rings The Two Towers 139050 4415 Finding Nemo 117456 4504 Raiders of the Lost Ark 180736 4299 Forrest Gump 147932 4433 Lord of the Rings The Fellowship of the ring149199 4325 The Sixth Sense 144027 4333 Indiana Jones and the Last Crusade
Challenges
bull Size of datandash Scalabilityndash Keeping data in memory
bull Missing datandash 99 percent missingndash Very imbalanced
bull Avoiding overfittingbull Test and training data differ significantly movie 16322
bull Use an ensemble of complementing predictorsbull Two half tuned models worth more than a single fully
tuned model
The BellKor recommender system
RECSYS 101
Extending Latent Factor Model to Include Biases
1162019
RECSYS 102
Modeling Biases and Interactions
μ = overall mean rating bx = bias of user x bi = bias of movie i
user-movie interactionmovie biasuser bias
User-Movie interaction Characterizes the matching between
users and movies Attracts most research in the field Benefits from algorithmic and
mathematical innovations
Baseline predictor Separates users and movies Benefits from insights into userrsquos
behavior Among the main practical
contributions of the competition
RECSYS 103
Baseline Predictor We have expectations on the rating by
user x of movie i even without estimating xrsquos attitude towards movies like i
ndash Rating scale of user xndash Values of other ratings user
gave recently (day-specific mood anchoring multi-user accounts)
ndash (Recent) popularity of movie indash Selection bias related to
number of ratings user gave on the same day (ldquofrequencyrdquo)
1162019
RECSYS 104
Putting It All Together
Example Mean rating micro = 37 You are a critical reviewer your ratings are 1 star
lower than the mean bx = -1 Star Wars gets a mean rating of 05 higher than
average movie bi = + 05 Predicted rating for you on Star Wars
= 37 - 1 + 05 = 32
1162019
RECSYS 105
Fitting the New Model
Solve
Stochastic gradient decent to find parameters Note Both biases bu bi as well as interactions qi pu are treated
as parameters (we estimate them)
1162019
regularization
goodness of fit
λ is selected via grid-search on a validation set
( )
++++
+++minus
sumsumsumsum
sumisin
ii
xx
xx
ii
Rix
Txiixxi
PQ
bbpq
pqbbr
2222
2
)(
)(min
λ
micro
RECSYS 106
BellKor Recommender System The winner of the Netflix Challenge
Multi-scale modeling of the dataCombine top level ldquoregionalrdquomodeling of the data with a refined local view Global
bull Overall deviations of usersmovies Factorization
bull Addressing ldquoregionalrdquo effects Collaborative filtering
bull Extract local patterns
Global effects
Factorization
Collaborative filtering
RECSYS 107
Performance of Various Methods
0885
089
0895
09
0905
091
0915
092
1 10 100 1000
RMSE
Millions of parameters
CF (no time bias)
Basic Latent Factors
Latent Factors w Biases
1162019
RECSYS 108
Grand Prize 08563
Netflix 09514
Movie average 10533User average 10651
Global average 11296
Performance of Various Methods
Basic Collaborative filtering 094
Latent factors 090
Latent factors+Biases 089
Collaborative filtering++ 091
1162019
RECSYS 109
Temporal Biases Of Users
Sudden rise in the average movie rating(early 2004) Improvements in Netflix GUI improvements Meaning of rating changed
Movie age Users prefer new movies
without any reasons Older movies are just
inherently better than newer ones
1162019
Y Koren Collaborative filtering with temporal dynamics KDD rsquo09
RECSYS 110
Temporal Biases amp Factors
Original modelrxi = micro +bx + bi + qi middotpx
T
Add time dependence to biasesrxi = micro +bx(t)+ bi(t) +qi middot px
T
Make parameters bu and bi to depend on time (1) Parameterize time-dependence by linear trends
(2) Each bin corresponds to 10 consecutive weeks
Add temporal dependence to factors px(t)hellip user preference vector on day t
Y Koren Collaborative filtering with temporal dynamics KDD rsquo09
RECSYS 111
Adding Temporal Effects
1162019
0875
088
0885
089
0895
09
0905
091
0915
092
1 10 100 1000 10000
RMSE
Millions of parameters
CF (no time bias)Basic Latent FactorsCF (time bias)Latent Factors w Biases+ Linear time factors+ Per-day user biases+ CF
RECSYS 112
Grand Prize 08563
Netflix 09514
Movie average 10533User average 10651
Global average 11296
Performance of Various Methods
Basic Collaborative filtering 094
Latent factors 090
Latent factors+Biases 089
Collaborative filtering++ 091
1162019 112
Latent factors+Biases+Time 0876
Still no prize Getting desperateTry a ldquokitchen sinkrdquo approach
RECSYS 113
Effect of ensemble size
087
08725
0875
08775
088
08825
0885
08875
089
08925
1 8 15 22 29 36 43 50
Predictors
Erro
r - R
MSE
RECSYS 48
Recommender SystemsLatent Factor Models
CS246 Mining Massive DatasetsJure Leskovec Stanford University
httpcs246stanfordedu
RECSYS 49
Geared towards females
Geared towards males
Serious
Funny
Collaborative Filtering viaLatent Factor Models (eg SVD)
The PrincessDiaries
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
RECSYS 50Koren Bell Volinksy IEEE Computer 2009
RECSYS 51
The Netflix Utility Matrix R
1 3 4
3 5 5
4 5 5
3
3
2 2 2
5
2 1 1
3 3
1
480000 users
17700 movies
1162019
Matrix R
RECSYS 52
Utility Matrix R Evaluation
1 3 4
3 5 5
4 5 5
3
3
2
2 1
3
1
Test Data Set
SSE = sum(119909119909119909119909)isin119877119877 119903119909119909119909119909 minus 119903119903119909119909119909119909 2
1162019
480000 users
17700 movies
Predicted rating
True rating of user x on item i
119955119955120785120785120788120788
Matrix R
Training Data Set
RECSYS 53
Latent Factor Models
ldquoSVDrdquo on Netflix data R asymp Q middot PT
For now letrsquos assume we can approximate the rating matrix R as a product of ldquothinrdquo Q middot PT
R has missing entries but letrsquos ignore that for nowbull Basically we will want the reconstruction error to be small on
known ratings and we donrsquot care about the values on the missing ones
45531
312445
53432142
24542
522434
42331
2-41
56-5
53-2
32111
-221-7
37-1
-924143-48-5-253-211
13-112-72914-131457-8
1-6784-3924176-421asymp
users
item
s
PT
Q
item
s
users
R
SVD A = U Σ VT
f factors
ffactors
RECSYS 54
Ratings as Products of Factors How to estimate the missing rating of
user x for item i
1162019
45531
312445
53432142
24542
522434
42331
item
s
2-41
56-5
53-2
32111
-221-7
37-1
-924143-48-5-253-211
13-112-72914-131457-8
1-6784-3924176-421
asymp
item
s
users
users
PT
119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931
= 119943119943
119954119954119946119946119943119943 sdot 119953119953119961119961119943119943
qi = row i of Qpx = column x of PT
ffac
tors
Qf factors
RECSYS 55
Ratings as Products of Factors How to estimate the missing rating of
user x for item i
1162019
45531
312445
53432142
24542
522434
42331
item
s
2-41
56-5
53-2
32111
-221-7
37-1
-924143-48-5-253-211
13-112-72914-131457-8
1-6784-3924176-421
asymp
item
s
users
users
PT
ffac
tors
Qf factors
119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931
= 119943119943
119954119954119946119946119943119943 sdot 119953119953119961119961119943119943
qi = row i of Qpx = column x of PT
RECSYS 56
Ratings as Products of Factors How to estimate the missing rating of
user x for item i
1162019
45531
312445
53432142
24542
522434
42331
item
s
2-41
56-5
53-2
32111
-221-7
37-1
-924143-48-5-253-211
13-112-72914-131457-8
1-6784-3924176-421
asymp
item
s
users
users
QPT
24
ffac
tors
f factors
119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931
= 119943119943
119954119954119946119946119943119943 sdot 119953119953119961119961119943119943
qi = row i of Qpx = column x of PT
RECSYS 57
Geared towards females
Geared towards males
Serious
Funny
Latent Factor Models
1162019
The PrincessDiaries
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
RECSYS 58
Geared towards females
Geared towards males
Serious
Funny
Latent Factor Models
1162019
The PrincessDiaries
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
RECSYS 59
Recap SVD
Remember SVD A Input data matrix U Left singular vecs V Right singular vecs Σ Singular values
SVD gives minimum reconstruction error (SSE) min119880119880VΣ
sum119909119909119894119894 119860119860119909119909119894119894 minus 119880119880Σ119881119881T 1199091199091198941198942
So in our case ldquoSVDrdquo on Netflix data R asymp Q middot PT
A = R Q = U PT = Σ VT
But we are not done yet R has missing entries1162019
U
The sum goes over all entriesBut our R has missing entries
119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931
Am
n
Σm
n
VT
asymp
U
RECSYS 60
Latent Factor Models
SVD isnrsquot defined when entries are missing
Use specialized methods to find P Q min
119875119875119876119876sum 119909119909119909119909 isinR 119903119903119909119909119909119909 minus 119902119902119909119909 sdot 119901119901119909119909119879119879 2
Notebull We donrsquot require cols of P Q to be orthogonalunit lengthbull P Q map usersmovies to a latent spacebull The most popular model among Netflix contestants
1162019
45531
312445
53432142
24542
522434
42331
2-41
56-5
53-2
32111
-221-7
37-1
-924143-48-5-253-211
13-112-72914-131457-8
1-6784-3924176-421asymp
PTQ
users
item
s
119903119909119909119909119909 = 119902119902119909119909 sdot 119901119901119909119909119879119879
f factors
ffactorsitem
s
users
RECSYS 61
Dealing with Missing Entries Want to minimize SSE for unseen test data
Idea Minimize SSE on training data Want large f ( of factors) to capture all the signals But SSE on test data begins to rise for f gt 2
Regularization is needed to avoid Overfitting Allow rich model where there are sufficient data Shrink aggressively where data are scarce
1162019
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
1 3 43 5 5
4 5 533
2
2 1 3
1
λhellip regularization parameterldquoerrorrdquo ldquolengthrdquo
RECSYS 62
Geared towards females
Geared towards males
serious
escapist
The PrincessDiaries
The Lion King
Braveheart
Independence Day
AmadeusThe Color Purple
Oceanrsquos 11
Sense and Sensibility
Gus
Dave
Recommendations via Latent Factor Models (eg SVD++ by the [Bellkor Team])
1162019
Lethal Weapon
Dumb and Dumber
RECSYS 63
Dealing with Missing Entries Want to minimize SSE for unseen test data
Idea Minimize SSE on training data Want large f ( of factors) to capture all the signals But SSE on test data begins to rise for f gt 2
Regularization is needed Allow rich model where there are sufficient data Shrink aggressively where data are scarce
1162019
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
1 3 43 5 5
4 5 533
2
2 1 3
1
λhellip regularization parameterldquoerrorrdquo ldquolengthrdquo
RECSYS 64
Geared towards females
Geared towards males
serious
funny
The Effect of Regularization
1162019
The PrincessDiaries
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
minfactors ldquoerrorrdquo + λ ldquolengthrdquo
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
RECSYS 65
Geared towards females
Geared towards males
serious
funny
The Effect of Regularization
1162019
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
The PrincessDiaries
minfactors ldquoerrorrdquo + λ ldquolengthrdquo
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
RECSYS 66
Geared towards females
Geared towards males
serious
funny
The Effect of Regularization
1162019
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
minfactors ldquoerrorrdquo + λ ldquolengthrdquo
The PrincessDiaries
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
RECSYS 67
Geared towards females
Geared towards males
serious
funny
The Effect of Regularization
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
The PrincessDiaries
minfactors ldquoerrorrdquo + λ ldquolengthrdquo
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
RECSYS 68
Use Gradient Descent to search forthe optimal settings
Want to find matrices P and Q
Gradient descent Initialize P and Q (using SVD pretend missing ratings are
0) Do gradient descent
bull P larr P - η middotnablaPbull Q larr Q - η middotnablaQbull Where nablaQ is gradientderivative of matrix Q120571120571119876119876 = [120571120571119902119902119909119909119894119894] and 120571120571119902119902119909119909119894119894 = sum119909119909119909119909 minus2 119903119903119909119909119909119909 minus 119902119902119909119909119901119901119909119909119879119879 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894
bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q Observation Computing gradients is slow
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
How to compute gradient of a matrixCompute gradient of every element independently
RECSYS 69
Degression to the lecture notes ofRegression and Gradient Descent
by Andrew Ngrsquos Machine Learning course from Coursera
RECSYS 70
(Batch) Gradient Descent
Want to find matrices P and Q
Gradient descent Initialize P and Q (using SVD pretend missing ratings are
0) Do gradient descent
bull P larr P - η middotnablaPbull Q larr Q - η middotnablaQbull Where nablaQ is gradientderivative of matrix Q120571120571119876119876 = [120571120571119902119902119909119909119894119894] and 120571120571119902119902119909119909119894119894 = sum119909119909119909119909 minus2 119903119903119909119909119909119909 minus 119902119902119909119909119901119901119909119909119879119879 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894
bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q Observation Computing gradients is slow
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
How to compute gradient of a matrixCompute gradient of every element independently
RECSYS 72
Stochastic Gradient Descent
Gradient Descent (GD) vs Stochastic GD Observation 120571120571119876119876 = [120571120571119902119902119909119909119894119894] where
120571120571119902119902119909119909119894119894 = 119909119909119909119909
minus2 119903119903119909119909119909119909 minus 119902119902119909119909119894119894119901119901119909119909119894119894 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894 = 119961119961119946119946
nabla119928119928 119955119955119961119961119946119946
bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q 119928119928 = 119928119928 minus 119928119928 = 119928119928 minus η sum119961119961119946119946nabla119928119928 (119955119955119961119961119946119946) Idea Instead of evaluating gradient over all ratings evaluate it for
each individual rating and make a step
GD 119928119928larr119928119928 minus η sum119955119955119961119961119946119946 nabla119928119928(119955119955119961119961119946119946)
SGD 119928119928larr119928119928 minus 119928119928(119955119955119961119961119946119946) Faster convergence
bull Need more steps but each step is computed much faster
1162019
RECSYS 73
SGD vs GD Convergence of GD vs SGD
1162019
Iterationstep
Valu
e of
the
obje
ctiv
e fu
nctio
n
GD improves the value of the objective function at every step SGD improves the value but in a ldquonoisyrdquo wayGD takes fewer steps to converge but each steptakes much longer to compute In practice SGD is much faster
RECSYS 74
Stochastic Gradient Descent
Stochastic gradient decent Initialize P and Q (using SVD pretend missing ratings are
0) Then iterate over the ratings (multiple times if necessary) and update
factorsFor each rxi
bull 120576120576119909119909119909119909 = 119903119903119909119909119909119909 minus 119902119902119909119909 sdot 119901119901119909119909119879119879 (derivative of the ldquoerrorrdquo)bull 119902119902119909119909 larr 119902119902119909119909 + 120578120578 120576120576119909119909119909119909 119901119901119909119909 minus 120582120582 119902119902119909119909 (update equation)bull 119901119901119909119909 larr 119901119901119909119909 + 120578120578 120576120576119909119909119909119909 119902119902119909119909 minus 120582120582 119901119901119909119909 (update equation)
2 for loops For until convergence
bull For each rxibull Compute gradient do a ldquosteprdquo
1162019
η hellip learning rate
RECSYS 75Koren Bell Volinksy IEEE Computer 2009
1162019
RECSYS 76
Summary Recommendations via Optimization
Goal Make good recommendations Quantify goodness using SSE
So Lower SSE means better recommendations We want to make good recommendations on items
that some user has not yet seen Letrsquos set values for P and Q such that they work
well on known (user item) ratings And hope these values for P and Q will predict well
the unknown ratings
This is the a case where we apply Optimization methods
1162019
1 3 43 5 5
4 5 533
2
2 1 3
1
RECSYS 77
Backup Slides
RECSYS 78
The Netflix Challenge 2006-09
We Know What You OughtTo Be Watching This
Summer
Netflix Prizebull Training data
ndash 100 million ratingsndash 480000 usersndash 17770 moviesndash 6 years of data 2000-2005
bull Test datandash Last few ratings of each user (28 million)ndash Evaluation criterion root mean squared error (RMSE) ndash Netflix Cinematch RMSE 09514
bull Competitionndash 2700+ teamsndash $1 million grand prize for 10 improvement on Cinematch resultndash $50000 2007 progress prize for 843 improvement
scoremovieuser121152131434524123237682576344541568523425
22345
57664566
scoremovieuser6219617232473153414284935
745
696836
Training data Test data
Movie rating data
Overall rating distribution
bull Third of ratings are 4sbull Average rating is 368
From TimelyDevelopmentcom
ratings per movie
bull Avg ratingsmovie 5627
ratings per user
bull Avg ratingsuser 208
Average movie rating by movie count
bull More ratings to better movies
From TimelyDevelopmentcom
Most loved movies
CountAvg ratingTitle137812 4593 The Shawshank Redemption 133597 4545 Lord of the Rings The Return of the King1808834306 The Green Mile 150676 4460Lord of the Rings The Two Towers 139050 4415 Finding Nemo 117456 4504 Raiders of the Lost Ark 180736 4299 Forrest Gump 147932 4433 Lord of the Rings The Fellowship of the ring149199 4325 The Sixth Sense 144027 4333 Indiana Jones and the Last Crusade
Challenges
bull Size of datandash Scalabilityndash Keeping data in memory
bull Missing datandash 99 percent missingndash Very imbalanced
bull Avoiding overfittingbull Test and training data differ significantly movie 16322
bull Use an ensemble of complementing predictorsbull Two half tuned models worth more than a single fully
tuned model
The BellKor recommender system
RECSYS 101
Extending Latent Factor Model to Include Biases
1162019
RECSYS 102
Modeling Biases and Interactions
μ = overall mean rating bx = bias of user x bi = bias of movie i
user-movie interactionmovie biasuser bias
User-Movie interaction Characterizes the matching between
users and movies Attracts most research in the field Benefits from algorithmic and
mathematical innovations
Baseline predictor Separates users and movies Benefits from insights into userrsquos
behavior Among the main practical
contributions of the competition
RECSYS 103
Baseline Predictor We have expectations on the rating by
user x of movie i even without estimating xrsquos attitude towards movies like i
ndash Rating scale of user xndash Values of other ratings user
gave recently (day-specific mood anchoring multi-user accounts)
ndash (Recent) popularity of movie indash Selection bias related to
number of ratings user gave on the same day (ldquofrequencyrdquo)
1162019
RECSYS 104
Putting It All Together
Example Mean rating micro = 37 You are a critical reviewer your ratings are 1 star
lower than the mean bx = -1 Star Wars gets a mean rating of 05 higher than
average movie bi = + 05 Predicted rating for you on Star Wars
= 37 - 1 + 05 = 32
1162019
RECSYS 105
Fitting the New Model
Solve
Stochastic gradient decent to find parameters Note Both biases bu bi as well as interactions qi pu are treated
as parameters (we estimate them)
1162019
regularization
goodness of fit
λ is selected via grid-search on a validation set
( )
++++
+++minus
sumsumsumsum
sumisin
ii
xx
xx
ii
Rix
Txiixxi
PQ
bbpq
pqbbr
2222
2
)(
)(min
λ
micro
RECSYS 106
BellKor Recommender System The winner of the Netflix Challenge
Multi-scale modeling of the dataCombine top level ldquoregionalrdquomodeling of the data with a refined local view Global
bull Overall deviations of usersmovies Factorization
bull Addressing ldquoregionalrdquo effects Collaborative filtering
bull Extract local patterns
Global effects
Factorization
Collaborative filtering
RECSYS 107
Performance of Various Methods
0885
089
0895
09
0905
091
0915
092
1 10 100 1000
RMSE
Millions of parameters
CF (no time bias)
Basic Latent Factors
Latent Factors w Biases
1162019
RECSYS 108
Grand Prize 08563
Netflix 09514
Movie average 10533User average 10651
Global average 11296
Performance of Various Methods
Basic Collaborative filtering 094
Latent factors 090
Latent factors+Biases 089
Collaborative filtering++ 091
1162019
RECSYS 109
Temporal Biases Of Users
Sudden rise in the average movie rating(early 2004) Improvements in Netflix GUI improvements Meaning of rating changed
Movie age Users prefer new movies
without any reasons Older movies are just
inherently better than newer ones
1162019
Y Koren Collaborative filtering with temporal dynamics KDD rsquo09
RECSYS 110
Temporal Biases amp Factors
Original modelrxi = micro +bx + bi + qi middotpx
T
Add time dependence to biasesrxi = micro +bx(t)+ bi(t) +qi middot px
T
Make parameters bu and bi to depend on time (1) Parameterize time-dependence by linear trends
(2) Each bin corresponds to 10 consecutive weeks
Add temporal dependence to factors px(t)hellip user preference vector on day t
Y Koren Collaborative filtering with temporal dynamics KDD rsquo09
RECSYS 111
Adding Temporal Effects
1162019
0875
088
0885
089
0895
09
0905
091
0915
092
1 10 100 1000 10000
RMSE
Millions of parameters
CF (no time bias)Basic Latent FactorsCF (time bias)Latent Factors w Biases+ Linear time factors+ Per-day user biases+ CF
RECSYS 112
Grand Prize 08563
Netflix 09514
Movie average 10533User average 10651
Global average 11296
Performance of Various Methods
Basic Collaborative filtering 094
Latent factors 090
Latent factors+Biases 089
Collaborative filtering++ 091
1162019 112
Latent factors+Biases+Time 0876
Still no prize Getting desperateTry a ldquokitchen sinkrdquo approach
RECSYS 113
Effect of ensemble size
087
08725
0875
08775
088
08825
0885
08875
089
08925
1 8 15 22 29 36 43 50
Predictors
Erro
r - R
MSE
RECSYS 49
Geared towards females
Geared towards males
Serious
Funny
Collaborative Filtering viaLatent Factor Models (eg SVD)
The PrincessDiaries
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
RECSYS 50Koren Bell Volinksy IEEE Computer 2009
RECSYS 51
The Netflix Utility Matrix R
1 3 4
3 5 5
4 5 5
3
3
2 2 2
5
2 1 1
3 3
1
480000 users
17700 movies
1162019
Matrix R
RECSYS 52
Utility Matrix R Evaluation
1 3 4
3 5 5
4 5 5
3
3
2
2 1
3
1
Test Data Set
SSE = sum(119909119909119909119909)isin119877119877 119903119909119909119909119909 minus 119903119903119909119909119909119909 2
1162019
480000 users
17700 movies
Predicted rating
True rating of user x on item i
119955119955120785120785120788120788
Matrix R
Training Data Set
RECSYS 53
Latent Factor Models
ldquoSVDrdquo on Netflix data R asymp Q middot PT
For now letrsquos assume we can approximate the rating matrix R as a product of ldquothinrdquo Q middot PT
R has missing entries but letrsquos ignore that for nowbull Basically we will want the reconstruction error to be small on
known ratings and we donrsquot care about the values on the missing ones
45531
312445
53432142
24542
522434
42331
2-41
56-5
53-2
32111
-221-7
37-1
-924143-48-5-253-211
13-112-72914-131457-8
1-6784-3924176-421asymp
users
item
s
PT
Q
item
s
users
R
SVD A = U Σ VT
f factors
ffactors
RECSYS 54
Ratings as Products of Factors How to estimate the missing rating of
user x for item i
1162019
45531
312445
53432142
24542
522434
42331
item
s
2-41
56-5
53-2
32111
-221-7
37-1
-924143-48-5-253-211
13-112-72914-131457-8
1-6784-3924176-421
asymp
item
s
users
users
PT
119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931
= 119943119943
119954119954119946119946119943119943 sdot 119953119953119961119961119943119943
qi = row i of Qpx = column x of PT
ffac
tors
Qf factors
RECSYS 55
Ratings as Products of Factors How to estimate the missing rating of
user x for item i
1162019
45531
312445
53432142
24542
522434
42331
item
s
2-41
56-5
53-2
32111
-221-7
37-1
-924143-48-5-253-211
13-112-72914-131457-8
1-6784-3924176-421
asymp
item
s
users
users
PT
ffac
tors
Qf factors
119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931
= 119943119943
119954119954119946119946119943119943 sdot 119953119953119961119961119943119943
qi = row i of Qpx = column x of PT
RECSYS 56
Ratings as Products of Factors How to estimate the missing rating of
user x for item i
1162019
45531
312445
53432142
24542
522434
42331
item
s
2-41
56-5
53-2
32111
-221-7
37-1
-924143-48-5-253-211
13-112-72914-131457-8
1-6784-3924176-421
asymp
item
s
users
users
QPT
24
ffac
tors
f factors
119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931
= 119943119943
119954119954119946119946119943119943 sdot 119953119953119961119961119943119943
qi = row i of Qpx = column x of PT
RECSYS 57
Geared towards females
Geared towards males
Serious
Funny
Latent Factor Models
1162019
The PrincessDiaries
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
RECSYS 58
Geared towards females
Geared towards males
Serious
Funny
Latent Factor Models
1162019
The PrincessDiaries
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
RECSYS 59
Recap SVD
Remember SVD A Input data matrix U Left singular vecs V Right singular vecs Σ Singular values
SVD gives minimum reconstruction error (SSE) min119880119880VΣ
sum119909119909119894119894 119860119860119909119909119894119894 minus 119880119880Σ119881119881T 1199091199091198941198942
So in our case ldquoSVDrdquo on Netflix data R asymp Q middot PT
A = R Q = U PT = Σ VT
But we are not done yet R has missing entries1162019
U
The sum goes over all entriesBut our R has missing entries
119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931
Am
n
Σm
n
VT
asymp
U
RECSYS 60
Latent Factor Models
SVD isnrsquot defined when entries are missing
Use specialized methods to find P Q min
119875119875119876119876sum 119909119909119909119909 isinR 119903119903119909119909119909119909 minus 119902119902119909119909 sdot 119901119901119909119909119879119879 2
Notebull We donrsquot require cols of P Q to be orthogonalunit lengthbull P Q map usersmovies to a latent spacebull The most popular model among Netflix contestants
1162019
45531
312445
53432142
24542
522434
42331
2-41
56-5
53-2
32111
-221-7
37-1
-924143-48-5-253-211
13-112-72914-131457-8
1-6784-3924176-421asymp
PTQ
users
item
s
119903119909119909119909119909 = 119902119902119909119909 sdot 119901119901119909119909119879119879
f factors
ffactorsitem
s
users
RECSYS 61
Dealing with Missing Entries Want to minimize SSE for unseen test data
Idea Minimize SSE on training data Want large f ( of factors) to capture all the signals But SSE on test data begins to rise for f gt 2
Regularization is needed to avoid Overfitting Allow rich model where there are sufficient data Shrink aggressively where data are scarce
1162019
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
1 3 43 5 5
4 5 533
2
2 1 3
1
λhellip regularization parameterldquoerrorrdquo ldquolengthrdquo
RECSYS 62
Geared towards females
Geared towards males
serious
escapist
The PrincessDiaries
The Lion King
Braveheart
Independence Day
AmadeusThe Color Purple
Oceanrsquos 11
Sense and Sensibility
Gus
Dave
Recommendations via Latent Factor Models (eg SVD++ by the [Bellkor Team])
1162019
Lethal Weapon
Dumb and Dumber
RECSYS 63
Dealing with Missing Entries Want to minimize SSE for unseen test data
Idea Minimize SSE on training data Want large f ( of factors) to capture all the signals But SSE on test data begins to rise for f gt 2
Regularization is needed Allow rich model where there are sufficient data Shrink aggressively where data are scarce
1162019
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
1 3 43 5 5
4 5 533
2
2 1 3
1
λhellip regularization parameterldquoerrorrdquo ldquolengthrdquo
RECSYS 64
Geared towards females
Geared towards males
serious
funny
The Effect of Regularization
1162019
The PrincessDiaries
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
minfactors ldquoerrorrdquo + λ ldquolengthrdquo
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
RECSYS 65
Geared towards females
Geared towards males
serious
funny
The Effect of Regularization
1162019
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
The PrincessDiaries
minfactors ldquoerrorrdquo + λ ldquolengthrdquo
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
RECSYS 66
Geared towards females
Geared towards males
serious
funny
The Effect of Regularization
1162019
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
minfactors ldquoerrorrdquo + λ ldquolengthrdquo
The PrincessDiaries
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
RECSYS 67
Geared towards females
Geared towards males
serious
funny
The Effect of Regularization
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
The PrincessDiaries
minfactors ldquoerrorrdquo + λ ldquolengthrdquo
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
RECSYS 68
Use Gradient Descent to search forthe optimal settings
Want to find matrices P and Q
Gradient descent Initialize P and Q (using SVD pretend missing ratings are
0) Do gradient descent
bull P larr P - η middotnablaPbull Q larr Q - η middotnablaQbull Where nablaQ is gradientderivative of matrix Q120571120571119876119876 = [120571120571119902119902119909119909119894119894] and 120571120571119902119902119909119909119894119894 = sum119909119909119909119909 minus2 119903119903119909119909119909119909 minus 119902119902119909119909119901119901119909119909119879119879 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894
bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q Observation Computing gradients is slow
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
How to compute gradient of a matrixCompute gradient of every element independently
RECSYS 69
Degression to the lecture notes ofRegression and Gradient Descent
by Andrew Ngrsquos Machine Learning course from Coursera
RECSYS 70
(Batch) Gradient Descent
Want to find matrices P and Q
Gradient descent Initialize P and Q (using SVD pretend missing ratings are
0) Do gradient descent
bull P larr P - η middotnablaPbull Q larr Q - η middotnablaQbull Where nablaQ is gradientderivative of matrix Q120571120571119876119876 = [120571120571119902119902119909119909119894119894] and 120571120571119902119902119909119909119894119894 = sum119909119909119909119909 minus2 119903119903119909119909119909119909 minus 119902119902119909119909119901119901119909119909119879119879 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894
bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q Observation Computing gradients is slow
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
How to compute gradient of a matrixCompute gradient of every element independently
RECSYS 72
Stochastic Gradient Descent
Gradient Descent (GD) vs Stochastic GD Observation 120571120571119876119876 = [120571120571119902119902119909119909119894119894] where
120571120571119902119902119909119909119894119894 = 119909119909119909119909
minus2 119903119903119909119909119909119909 minus 119902119902119909119909119894119894119901119901119909119909119894119894 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894 = 119961119961119946119946
nabla119928119928 119955119955119961119961119946119946
bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q 119928119928 = 119928119928 minus 119928119928 = 119928119928 minus η sum119961119961119946119946nabla119928119928 (119955119955119961119961119946119946) Idea Instead of evaluating gradient over all ratings evaluate it for
each individual rating and make a step
GD 119928119928larr119928119928 minus η sum119955119955119961119961119946119946 nabla119928119928(119955119955119961119961119946119946)
SGD 119928119928larr119928119928 minus 119928119928(119955119955119961119961119946119946) Faster convergence
bull Need more steps but each step is computed much faster
1162019
RECSYS 73
SGD vs GD Convergence of GD vs SGD
1162019
Iterationstep
Valu
e of
the
obje
ctiv
e fu
nctio
n
GD improves the value of the objective function at every step SGD improves the value but in a ldquonoisyrdquo wayGD takes fewer steps to converge but each steptakes much longer to compute In practice SGD is much faster
RECSYS 74
Stochastic Gradient Descent
Stochastic gradient decent Initialize P and Q (using SVD pretend missing ratings are
0) Then iterate over the ratings (multiple times if necessary) and update
factorsFor each rxi
bull 120576120576119909119909119909119909 = 119903119903119909119909119909119909 minus 119902119902119909119909 sdot 119901119901119909119909119879119879 (derivative of the ldquoerrorrdquo)bull 119902119902119909119909 larr 119902119902119909119909 + 120578120578 120576120576119909119909119909119909 119901119901119909119909 minus 120582120582 119902119902119909119909 (update equation)bull 119901119901119909119909 larr 119901119901119909119909 + 120578120578 120576120576119909119909119909119909 119902119902119909119909 minus 120582120582 119901119901119909119909 (update equation)
2 for loops For until convergence
bull For each rxibull Compute gradient do a ldquosteprdquo
1162019
η hellip learning rate
RECSYS 75Koren Bell Volinksy IEEE Computer 2009
1162019
RECSYS 76
Summary Recommendations via Optimization
Goal Make good recommendations Quantify goodness using SSE
So Lower SSE means better recommendations We want to make good recommendations on items
that some user has not yet seen Letrsquos set values for P and Q such that they work
well on known (user item) ratings And hope these values for P and Q will predict well
the unknown ratings
This is the a case where we apply Optimization methods
1162019
1 3 43 5 5
4 5 533
2
2 1 3
1
RECSYS 77
Backup Slides
RECSYS 78
The Netflix Challenge 2006-09
We Know What You OughtTo Be Watching This
Summer
Netflix Prizebull Training data
ndash 100 million ratingsndash 480000 usersndash 17770 moviesndash 6 years of data 2000-2005
bull Test datandash Last few ratings of each user (28 million)ndash Evaluation criterion root mean squared error (RMSE) ndash Netflix Cinematch RMSE 09514
bull Competitionndash 2700+ teamsndash $1 million grand prize for 10 improvement on Cinematch resultndash $50000 2007 progress prize for 843 improvement
scoremovieuser121152131434524123237682576344541568523425
22345
57664566
scoremovieuser6219617232473153414284935
745
696836
Training data Test data
Movie rating data
Overall rating distribution
bull Third of ratings are 4sbull Average rating is 368
From TimelyDevelopmentcom
ratings per movie
bull Avg ratingsmovie 5627
ratings per user
bull Avg ratingsuser 208
Average movie rating by movie count
bull More ratings to better movies
From TimelyDevelopmentcom
Most loved movies
CountAvg ratingTitle137812 4593 The Shawshank Redemption 133597 4545 Lord of the Rings The Return of the King1808834306 The Green Mile 150676 4460Lord of the Rings The Two Towers 139050 4415 Finding Nemo 117456 4504 Raiders of the Lost Ark 180736 4299 Forrest Gump 147932 4433 Lord of the Rings The Fellowship of the ring149199 4325 The Sixth Sense 144027 4333 Indiana Jones and the Last Crusade
Challenges
bull Size of datandash Scalabilityndash Keeping data in memory
bull Missing datandash 99 percent missingndash Very imbalanced
bull Avoiding overfittingbull Test and training data differ significantly movie 16322
bull Use an ensemble of complementing predictorsbull Two half tuned models worth more than a single fully
tuned model
The BellKor recommender system
RECSYS 101
Extending Latent Factor Model to Include Biases
1162019
RECSYS 102
Modeling Biases and Interactions
μ = overall mean rating bx = bias of user x bi = bias of movie i
user-movie interactionmovie biasuser bias
User-Movie interaction Characterizes the matching between
users and movies Attracts most research in the field Benefits from algorithmic and
mathematical innovations
Baseline predictor Separates users and movies Benefits from insights into userrsquos
behavior Among the main practical
contributions of the competition
RECSYS 103
Baseline Predictor We have expectations on the rating by
user x of movie i even without estimating xrsquos attitude towards movies like i
ndash Rating scale of user xndash Values of other ratings user
gave recently (day-specific mood anchoring multi-user accounts)
ndash (Recent) popularity of movie indash Selection bias related to
number of ratings user gave on the same day (ldquofrequencyrdquo)
1162019
RECSYS 104
Putting It All Together
Example Mean rating micro = 37 You are a critical reviewer your ratings are 1 star
lower than the mean bx = -1 Star Wars gets a mean rating of 05 higher than
average movie bi = + 05 Predicted rating for you on Star Wars
= 37 - 1 + 05 = 32
1162019
RECSYS 105
Fitting the New Model
Solve
Stochastic gradient decent to find parameters Note Both biases bu bi as well as interactions qi pu are treated
as parameters (we estimate them)
1162019
regularization
goodness of fit
λ is selected via grid-search on a validation set
( )
++++
+++minus
sumsumsumsum
sumisin
ii
xx
xx
ii
Rix
Txiixxi
PQ
bbpq
pqbbr
2222
2
)(
)(min
λ
micro
RECSYS 106
BellKor Recommender System The winner of the Netflix Challenge
Multi-scale modeling of the dataCombine top level ldquoregionalrdquomodeling of the data with a refined local view Global
bull Overall deviations of usersmovies Factorization
bull Addressing ldquoregionalrdquo effects Collaborative filtering
bull Extract local patterns
Global effects
Factorization
Collaborative filtering
RECSYS 107
Performance of Various Methods
0885
089
0895
09
0905
091
0915
092
1 10 100 1000
RMSE
Millions of parameters
CF (no time bias)
Basic Latent Factors
Latent Factors w Biases
1162019
RECSYS 108
Grand Prize 08563
Netflix 09514
Movie average 10533User average 10651
Global average 11296
Performance of Various Methods
Basic Collaborative filtering 094
Latent factors 090
Latent factors+Biases 089
Collaborative filtering++ 091
1162019
RECSYS 109
Temporal Biases Of Users
Sudden rise in the average movie rating(early 2004) Improvements in Netflix GUI improvements Meaning of rating changed
Movie age Users prefer new movies
without any reasons Older movies are just
inherently better than newer ones
1162019
Y Koren Collaborative filtering with temporal dynamics KDD rsquo09
RECSYS 110
Temporal Biases amp Factors
Original modelrxi = micro +bx + bi + qi middotpx
T
Add time dependence to biasesrxi = micro +bx(t)+ bi(t) +qi middot px
T
Make parameters bu and bi to depend on time (1) Parameterize time-dependence by linear trends
(2) Each bin corresponds to 10 consecutive weeks
Add temporal dependence to factors px(t)hellip user preference vector on day t
Y Koren Collaborative filtering with temporal dynamics KDD rsquo09
RECSYS 111
Adding Temporal Effects
1162019
0875
088
0885
089
0895
09
0905
091
0915
092
1 10 100 1000 10000
RMSE
Millions of parameters
CF (no time bias)Basic Latent FactorsCF (time bias)Latent Factors w Biases+ Linear time factors+ Per-day user biases+ CF
RECSYS 112
Grand Prize 08563
Netflix 09514
Movie average 10533User average 10651
Global average 11296
Performance of Various Methods
Basic Collaborative filtering 094
Latent factors 090
Latent factors+Biases 089
Collaborative filtering++ 091
1162019 112
Latent factors+Biases+Time 0876
Still no prize Getting desperateTry a ldquokitchen sinkrdquo approach
RECSYS 113
Effect of ensemble size
087
08725
0875
08775
088
08825
0885
08875
089
08925
1 8 15 22 29 36 43 50
Predictors
Erro
r - R
MSE
RECSYS 50Koren Bell Volinksy IEEE Computer 2009
RECSYS 51
The Netflix Utility Matrix R
1 3 4
3 5 5
4 5 5
3
3
2 2 2
5
2 1 1
3 3
1
480000 users
17700 movies
1162019
Matrix R
RECSYS 52
Utility Matrix R Evaluation
1 3 4
3 5 5
4 5 5
3
3
2
2 1
3
1
Test Data Set
SSE = sum(119909119909119909119909)isin119877119877 119903119909119909119909119909 minus 119903119903119909119909119909119909 2
1162019
480000 users
17700 movies
Predicted rating
True rating of user x on item i
119955119955120785120785120788120788
Matrix R
Training Data Set
RECSYS 53
Latent Factor Models
ldquoSVDrdquo on Netflix data R asymp Q middot PT
For now letrsquos assume we can approximate the rating matrix R as a product of ldquothinrdquo Q middot PT
R has missing entries but letrsquos ignore that for nowbull Basically we will want the reconstruction error to be small on
known ratings and we donrsquot care about the values on the missing ones
45531
312445
53432142
24542
522434
42331
2-41
56-5
53-2
32111
-221-7
37-1
-924143-48-5-253-211
13-112-72914-131457-8
1-6784-3924176-421asymp
users
item
s
PT
Q
item
s
users
R
SVD A = U Σ VT
f factors
ffactors
RECSYS 54
Ratings as Products of Factors How to estimate the missing rating of
user x for item i
1162019
45531
312445
53432142
24542
522434
42331
item
s
2-41
56-5
53-2
32111
-221-7
37-1
-924143-48-5-253-211
13-112-72914-131457-8
1-6784-3924176-421
asymp
item
s
users
users
PT
119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931
= 119943119943
119954119954119946119946119943119943 sdot 119953119953119961119961119943119943
qi = row i of Qpx = column x of PT
ffac
tors
Qf factors
RECSYS 55
Ratings as Products of Factors How to estimate the missing rating of
user x for item i
1162019
45531
312445
53432142
24542
522434
42331
item
s
2-41
56-5
53-2
32111
-221-7
37-1
-924143-48-5-253-211
13-112-72914-131457-8
1-6784-3924176-421
asymp
item
s
users
users
PT
ffac
tors
Qf factors
119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931
= 119943119943
119954119954119946119946119943119943 sdot 119953119953119961119961119943119943
qi = row i of Qpx = column x of PT
RECSYS 56
Ratings as Products of Factors How to estimate the missing rating of
user x for item i
1162019
45531
312445
53432142
24542
522434
42331
item
s
2-41
56-5
53-2
32111
-221-7
37-1
-924143-48-5-253-211
13-112-72914-131457-8
1-6784-3924176-421
asymp
item
s
users
users
QPT
24
ffac
tors
f factors
119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931
= 119943119943
119954119954119946119946119943119943 sdot 119953119953119961119961119943119943
qi = row i of Qpx = column x of PT
RECSYS 57
Geared towards females
Geared towards males
Serious
Funny
Latent Factor Models
1162019
The PrincessDiaries
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
RECSYS 58
Geared towards females
Geared towards males
Serious
Funny
Latent Factor Models
1162019
The PrincessDiaries
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
RECSYS 59
Recap SVD
Remember SVD A Input data matrix U Left singular vecs V Right singular vecs Σ Singular values
SVD gives minimum reconstruction error (SSE) min119880119880VΣ
sum119909119909119894119894 119860119860119909119909119894119894 minus 119880119880Σ119881119881T 1199091199091198941198942
So in our case ldquoSVDrdquo on Netflix data R asymp Q middot PT
A = R Q = U PT = Σ VT
But we are not done yet R has missing entries1162019
U
The sum goes over all entriesBut our R has missing entries
119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931
Am
n
Σm
n
VT
asymp
U
RECSYS 60
Latent Factor Models
SVD isnrsquot defined when entries are missing
Use specialized methods to find P Q min
119875119875119876119876sum 119909119909119909119909 isinR 119903119903119909119909119909119909 minus 119902119902119909119909 sdot 119901119901119909119909119879119879 2
Notebull We donrsquot require cols of P Q to be orthogonalunit lengthbull P Q map usersmovies to a latent spacebull The most popular model among Netflix contestants
1162019
45531
312445
53432142
24542
522434
42331
2-41
56-5
53-2
32111
-221-7
37-1
-924143-48-5-253-211
13-112-72914-131457-8
1-6784-3924176-421asymp
PTQ
users
item
s
119903119909119909119909119909 = 119902119902119909119909 sdot 119901119901119909119909119879119879
f factors
ffactorsitem
s
users
RECSYS 61
Dealing with Missing Entries Want to minimize SSE for unseen test data
Idea Minimize SSE on training data Want large f ( of factors) to capture all the signals But SSE on test data begins to rise for f gt 2
Regularization is needed to avoid Overfitting Allow rich model where there are sufficient data Shrink aggressively where data are scarce
1162019
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
1 3 43 5 5
4 5 533
2
2 1 3
1
λhellip regularization parameterldquoerrorrdquo ldquolengthrdquo
RECSYS 62
Geared towards females
Geared towards males
serious
escapist
The PrincessDiaries
The Lion King
Braveheart
Independence Day
AmadeusThe Color Purple
Oceanrsquos 11
Sense and Sensibility
Gus
Dave
Recommendations via Latent Factor Models (eg SVD++ by the [Bellkor Team])
1162019
Lethal Weapon
Dumb and Dumber
RECSYS 63
Dealing with Missing Entries Want to minimize SSE for unseen test data
Idea Minimize SSE on training data Want large f ( of factors) to capture all the signals But SSE on test data begins to rise for f gt 2
Regularization is needed Allow rich model where there are sufficient data Shrink aggressively where data are scarce
1162019
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
1 3 43 5 5
4 5 533
2
2 1 3
1
λhellip regularization parameterldquoerrorrdquo ldquolengthrdquo
RECSYS 64
Geared towards females
Geared towards males
serious
funny
The Effect of Regularization
1162019
The PrincessDiaries
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
minfactors ldquoerrorrdquo + λ ldquolengthrdquo
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
RECSYS 65
Geared towards females
Geared towards males
serious
funny
The Effect of Regularization
1162019
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
The PrincessDiaries
minfactors ldquoerrorrdquo + λ ldquolengthrdquo
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
RECSYS 66
Geared towards females
Geared towards males
serious
funny
The Effect of Regularization
1162019
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
minfactors ldquoerrorrdquo + λ ldquolengthrdquo
The PrincessDiaries
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
RECSYS 67
Geared towards females
Geared towards males
serious
funny
The Effect of Regularization
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
The PrincessDiaries
minfactors ldquoerrorrdquo + λ ldquolengthrdquo
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
RECSYS 68
Use Gradient Descent to search forthe optimal settings
Want to find matrices P and Q
Gradient descent Initialize P and Q (using SVD pretend missing ratings are
0) Do gradient descent
bull P larr P - η middotnablaPbull Q larr Q - η middotnablaQbull Where nablaQ is gradientderivative of matrix Q120571120571119876119876 = [120571120571119902119902119909119909119894119894] and 120571120571119902119902119909119909119894119894 = sum119909119909119909119909 minus2 119903119903119909119909119909119909 minus 119902119902119909119909119901119901119909119909119879119879 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894
bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q Observation Computing gradients is slow
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
How to compute gradient of a matrixCompute gradient of every element independently
RECSYS 69
Degression to the lecture notes ofRegression and Gradient Descent
by Andrew Ngrsquos Machine Learning course from Coursera
RECSYS 70
(Batch) Gradient Descent
Want to find matrices P and Q
Gradient descent Initialize P and Q (using SVD pretend missing ratings are
0) Do gradient descent
bull P larr P - η middotnablaPbull Q larr Q - η middotnablaQbull Where nablaQ is gradientderivative of matrix Q120571120571119876119876 = [120571120571119902119902119909119909119894119894] and 120571120571119902119902119909119909119894119894 = sum119909119909119909119909 minus2 119903119903119909119909119909119909 minus 119902119902119909119909119901119901119909119909119879119879 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894
bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q Observation Computing gradients is slow
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
How to compute gradient of a matrixCompute gradient of every element independently
RECSYS 72
Stochastic Gradient Descent
Gradient Descent (GD) vs Stochastic GD Observation 120571120571119876119876 = [120571120571119902119902119909119909119894119894] where
120571120571119902119902119909119909119894119894 = 119909119909119909119909
minus2 119903119903119909119909119909119909 minus 119902119902119909119909119894119894119901119901119909119909119894119894 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894 = 119961119961119946119946
nabla119928119928 119955119955119961119961119946119946
bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q 119928119928 = 119928119928 minus 119928119928 = 119928119928 minus η sum119961119961119946119946nabla119928119928 (119955119955119961119961119946119946) Idea Instead of evaluating gradient over all ratings evaluate it for
each individual rating and make a step
GD 119928119928larr119928119928 minus η sum119955119955119961119961119946119946 nabla119928119928(119955119955119961119961119946119946)
SGD 119928119928larr119928119928 minus 119928119928(119955119955119961119961119946119946) Faster convergence
bull Need more steps but each step is computed much faster
1162019
RECSYS 73
SGD vs GD Convergence of GD vs SGD
1162019
Iterationstep
Valu
e of
the
obje
ctiv
e fu
nctio
n
GD improves the value of the objective function at every step SGD improves the value but in a ldquonoisyrdquo wayGD takes fewer steps to converge but each steptakes much longer to compute In practice SGD is much faster
RECSYS 74
Stochastic Gradient Descent
Stochastic gradient decent Initialize P and Q (using SVD pretend missing ratings are
0) Then iterate over the ratings (multiple times if necessary) and update
factorsFor each rxi
bull 120576120576119909119909119909119909 = 119903119903119909119909119909119909 minus 119902119902119909119909 sdot 119901119901119909119909119879119879 (derivative of the ldquoerrorrdquo)bull 119902119902119909119909 larr 119902119902119909119909 + 120578120578 120576120576119909119909119909119909 119901119901119909119909 minus 120582120582 119902119902119909119909 (update equation)bull 119901119901119909119909 larr 119901119901119909119909 + 120578120578 120576120576119909119909119909119909 119902119902119909119909 minus 120582120582 119901119901119909119909 (update equation)
2 for loops For until convergence
bull For each rxibull Compute gradient do a ldquosteprdquo
1162019
η hellip learning rate
RECSYS 75Koren Bell Volinksy IEEE Computer 2009
1162019
RECSYS 76
Summary Recommendations via Optimization
Goal Make good recommendations Quantify goodness using SSE
So Lower SSE means better recommendations We want to make good recommendations on items
that some user has not yet seen Letrsquos set values for P and Q such that they work
well on known (user item) ratings And hope these values for P and Q will predict well
the unknown ratings
This is the a case where we apply Optimization methods
1162019
1 3 43 5 5
4 5 533
2
2 1 3
1
RECSYS 77
Backup Slides
RECSYS 78
The Netflix Challenge 2006-09
We Know What You OughtTo Be Watching This
Summer
Netflix Prizebull Training data
ndash 100 million ratingsndash 480000 usersndash 17770 moviesndash 6 years of data 2000-2005
bull Test datandash Last few ratings of each user (28 million)ndash Evaluation criterion root mean squared error (RMSE) ndash Netflix Cinematch RMSE 09514
bull Competitionndash 2700+ teamsndash $1 million grand prize for 10 improvement on Cinematch resultndash $50000 2007 progress prize for 843 improvement
scoremovieuser121152131434524123237682576344541568523425
22345
57664566
scoremovieuser6219617232473153414284935
745
696836
Training data Test data
Movie rating data
Overall rating distribution
bull Third of ratings are 4sbull Average rating is 368
From TimelyDevelopmentcom
ratings per movie
bull Avg ratingsmovie 5627
ratings per user
bull Avg ratingsuser 208
Average movie rating by movie count
bull More ratings to better movies
From TimelyDevelopmentcom
Most loved movies
CountAvg ratingTitle137812 4593 The Shawshank Redemption 133597 4545 Lord of the Rings The Return of the King1808834306 The Green Mile 150676 4460Lord of the Rings The Two Towers 139050 4415 Finding Nemo 117456 4504 Raiders of the Lost Ark 180736 4299 Forrest Gump 147932 4433 Lord of the Rings The Fellowship of the ring149199 4325 The Sixth Sense 144027 4333 Indiana Jones and the Last Crusade
Challenges
bull Size of datandash Scalabilityndash Keeping data in memory
bull Missing datandash 99 percent missingndash Very imbalanced
bull Avoiding overfittingbull Test and training data differ significantly movie 16322
bull Use an ensemble of complementing predictorsbull Two half tuned models worth more than a single fully
tuned model
The BellKor recommender system
RECSYS 101
Extending Latent Factor Model to Include Biases
1162019
RECSYS 102
Modeling Biases and Interactions
μ = overall mean rating bx = bias of user x bi = bias of movie i
user-movie interactionmovie biasuser bias
User-Movie interaction Characterizes the matching between
users and movies Attracts most research in the field Benefits from algorithmic and
mathematical innovations
Baseline predictor Separates users and movies Benefits from insights into userrsquos
behavior Among the main practical
contributions of the competition
RECSYS 103
Baseline Predictor We have expectations on the rating by
user x of movie i even without estimating xrsquos attitude towards movies like i
ndash Rating scale of user xndash Values of other ratings user
gave recently (day-specific mood anchoring multi-user accounts)
ndash (Recent) popularity of movie indash Selection bias related to
number of ratings user gave on the same day (ldquofrequencyrdquo)
1162019
RECSYS 104
Putting It All Together
Example Mean rating micro = 37 You are a critical reviewer your ratings are 1 star
lower than the mean bx = -1 Star Wars gets a mean rating of 05 higher than
average movie bi = + 05 Predicted rating for you on Star Wars
= 37 - 1 + 05 = 32
1162019
RECSYS 105
Fitting the New Model
Solve
Stochastic gradient decent to find parameters Note Both biases bu bi as well as interactions qi pu are treated
as parameters (we estimate them)
1162019
regularization
goodness of fit
λ is selected via grid-search on a validation set
( )
++++
+++minus
sumsumsumsum
sumisin
ii
xx
xx
ii
Rix
Txiixxi
PQ
bbpq
pqbbr
2222
2
)(
)(min
λ
micro
RECSYS 106
BellKor Recommender System The winner of the Netflix Challenge
Multi-scale modeling of the dataCombine top level ldquoregionalrdquomodeling of the data with a refined local view Global
bull Overall deviations of usersmovies Factorization
bull Addressing ldquoregionalrdquo effects Collaborative filtering
bull Extract local patterns
Global effects
Factorization
Collaborative filtering
RECSYS 107
Performance of Various Methods
0885
089
0895
09
0905
091
0915
092
1 10 100 1000
RMSE
Millions of parameters
CF (no time bias)
Basic Latent Factors
Latent Factors w Biases
1162019
RECSYS 108
Grand Prize 08563
Netflix 09514
Movie average 10533User average 10651
Global average 11296
Performance of Various Methods
Basic Collaborative filtering 094
Latent factors 090
Latent factors+Biases 089
Collaborative filtering++ 091
1162019
RECSYS 109
Temporal Biases Of Users
Sudden rise in the average movie rating(early 2004) Improvements in Netflix GUI improvements Meaning of rating changed
Movie age Users prefer new movies
without any reasons Older movies are just
inherently better than newer ones
1162019
Y Koren Collaborative filtering with temporal dynamics KDD rsquo09
RECSYS 110
Temporal Biases amp Factors
Original modelrxi = micro +bx + bi + qi middotpx
T
Add time dependence to biasesrxi = micro +bx(t)+ bi(t) +qi middot px
T
Make parameters bu and bi to depend on time (1) Parameterize time-dependence by linear trends
(2) Each bin corresponds to 10 consecutive weeks
Add temporal dependence to factors px(t)hellip user preference vector on day t
Y Koren Collaborative filtering with temporal dynamics KDD rsquo09
RECSYS 111
Adding Temporal Effects
1162019
0875
088
0885
089
0895
09
0905
091
0915
092
1 10 100 1000 10000
RMSE
Millions of parameters
CF (no time bias)Basic Latent FactorsCF (time bias)Latent Factors w Biases+ Linear time factors+ Per-day user biases+ CF
RECSYS 112
Grand Prize 08563
Netflix 09514
Movie average 10533User average 10651
Global average 11296
Performance of Various Methods
Basic Collaborative filtering 094
Latent factors 090
Latent factors+Biases 089
Collaborative filtering++ 091
1162019 112
Latent factors+Biases+Time 0876
Still no prize Getting desperateTry a ldquokitchen sinkrdquo approach
RECSYS 113
Effect of ensemble size
087
08725
0875
08775
088
08825
0885
08875
089
08925
1 8 15 22 29 36 43 50
Predictors
Erro
r - R
MSE
RECSYS 51
The Netflix Utility Matrix R
1 3 4
3 5 5
4 5 5
3
3
2 2 2
5
2 1 1
3 3
1
480000 users
17700 movies
1162019
Matrix R
RECSYS 52
Utility Matrix R Evaluation
1 3 4
3 5 5
4 5 5
3
3
2
2 1
3
1
Test Data Set
SSE = sum(119909119909119909119909)isin119877119877 119903119909119909119909119909 minus 119903119903119909119909119909119909 2
1162019
480000 users
17700 movies
Predicted rating
True rating of user x on item i
119955119955120785120785120788120788
Matrix R
Training Data Set
RECSYS 53
Latent Factor Models
ldquoSVDrdquo on Netflix data R asymp Q middot PT
For now letrsquos assume we can approximate the rating matrix R as a product of ldquothinrdquo Q middot PT
R has missing entries but letrsquos ignore that for nowbull Basically we will want the reconstruction error to be small on
known ratings and we donrsquot care about the values on the missing ones
45531
312445
53432142
24542
522434
42331
2-41
56-5
53-2
32111
-221-7
37-1
-924143-48-5-253-211
13-112-72914-131457-8
1-6784-3924176-421asymp
users
item
s
PT
Q
item
s
users
R
SVD A = U Σ VT
f factors
ffactors
RECSYS 54
Ratings as Products of Factors How to estimate the missing rating of
user x for item i
1162019
45531
312445
53432142
24542
522434
42331
item
s
2-41
56-5
53-2
32111
-221-7
37-1
-924143-48-5-253-211
13-112-72914-131457-8
1-6784-3924176-421
asymp
item
s
users
users
PT
119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931
= 119943119943
119954119954119946119946119943119943 sdot 119953119953119961119961119943119943
qi = row i of Qpx = column x of PT
ffac
tors
Qf factors
RECSYS 55
Ratings as Products of Factors How to estimate the missing rating of
user x for item i
1162019
45531
312445
53432142
24542
522434
42331
item
s
2-41
56-5
53-2
32111
-221-7
37-1
-924143-48-5-253-211
13-112-72914-131457-8
1-6784-3924176-421
asymp
item
s
users
users
PT
ffac
tors
Qf factors
119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931
= 119943119943
119954119954119946119946119943119943 sdot 119953119953119961119961119943119943
qi = row i of Qpx = column x of PT
RECSYS 56
Ratings as Products of Factors How to estimate the missing rating of
user x for item i
1162019
45531
312445
53432142
24542
522434
42331
item
s
2-41
56-5
53-2
32111
-221-7
37-1
-924143-48-5-253-211
13-112-72914-131457-8
1-6784-3924176-421
asymp
item
s
users
users
QPT
24
ffac
tors
f factors
119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931
= 119943119943
119954119954119946119946119943119943 sdot 119953119953119961119961119943119943
qi = row i of Qpx = column x of PT
RECSYS 57
Geared towards females
Geared towards males
Serious
Funny
Latent Factor Models
1162019
The PrincessDiaries
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
RECSYS 58
Geared towards females
Geared towards males
Serious
Funny
Latent Factor Models
1162019
The PrincessDiaries
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
RECSYS 59
Recap SVD
Remember SVD A Input data matrix U Left singular vecs V Right singular vecs Σ Singular values
SVD gives minimum reconstruction error (SSE) min119880119880VΣ
sum119909119909119894119894 119860119860119909119909119894119894 minus 119880119880Σ119881119881T 1199091199091198941198942
So in our case ldquoSVDrdquo on Netflix data R asymp Q middot PT
A = R Q = U PT = Σ VT
But we are not done yet R has missing entries1162019
U
The sum goes over all entriesBut our R has missing entries
119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931
Am
n
Σm
n
VT
asymp
U
RECSYS 60
Latent Factor Models
SVD isnrsquot defined when entries are missing
Use specialized methods to find P Q min
119875119875119876119876sum 119909119909119909119909 isinR 119903119903119909119909119909119909 minus 119902119902119909119909 sdot 119901119901119909119909119879119879 2
Notebull We donrsquot require cols of P Q to be orthogonalunit lengthbull P Q map usersmovies to a latent spacebull The most popular model among Netflix contestants
1162019
45531
312445
53432142
24542
522434
42331
2-41
56-5
53-2
32111
-221-7
37-1
-924143-48-5-253-211
13-112-72914-131457-8
1-6784-3924176-421asymp
PTQ
users
item
s
119903119909119909119909119909 = 119902119902119909119909 sdot 119901119901119909119909119879119879
f factors
ffactorsitem
s
users
RECSYS 61
Dealing with Missing Entries Want to minimize SSE for unseen test data
Idea Minimize SSE on training data Want large f ( of factors) to capture all the signals But SSE on test data begins to rise for f gt 2
Regularization is needed to avoid Overfitting Allow rich model where there are sufficient data Shrink aggressively where data are scarce
1162019
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
1 3 43 5 5
4 5 533
2
2 1 3
1
λhellip regularization parameterldquoerrorrdquo ldquolengthrdquo
RECSYS 62
Geared towards females
Geared towards males
serious
escapist
The PrincessDiaries
The Lion King
Braveheart
Independence Day
AmadeusThe Color Purple
Oceanrsquos 11
Sense and Sensibility
Gus
Dave
Recommendations via Latent Factor Models (eg SVD++ by the [Bellkor Team])
1162019
Lethal Weapon
Dumb and Dumber
RECSYS 63
Dealing with Missing Entries Want to minimize SSE for unseen test data
Idea Minimize SSE on training data Want large f ( of factors) to capture all the signals But SSE on test data begins to rise for f gt 2
Regularization is needed Allow rich model where there are sufficient data Shrink aggressively where data are scarce
1162019
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
1 3 43 5 5
4 5 533
2
2 1 3
1
λhellip regularization parameterldquoerrorrdquo ldquolengthrdquo
RECSYS 64
Geared towards females
Geared towards males
serious
funny
The Effect of Regularization
1162019
The PrincessDiaries
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
minfactors ldquoerrorrdquo + λ ldquolengthrdquo
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
RECSYS 65
Geared towards females
Geared towards males
serious
funny
The Effect of Regularization
1162019
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
The PrincessDiaries
minfactors ldquoerrorrdquo + λ ldquolengthrdquo
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
RECSYS 66
Geared towards females
Geared towards males
serious
funny
The Effect of Regularization
1162019
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
minfactors ldquoerrorrdquo + λ ldquolengthrdquo
The PrincessDiaries
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
RECSYS 67
Geared towards females
Geared towards males
serious
funny
The Effect of Regularization
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
The PrincessDiaries
minfactors ldquoerrorrdquo + λ ldquolengthrdquo
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
RECSYS 68
Use Gradient Descent to search forthe optimal settings
Want to find matrices P and Q
Gradient descent Initialize P and Q (using SVD pretend missing ratings are
0) Do gradient descent
bull P larr P - η middotnablaPbull Q larr Q - η middotnablaQbull Where nablaQ is gradientderivative of matrix Q120571120571119876119876 = [120571120571119902119902119909119909119894119894] and 120571120571119902119902119909119909119894119894 = sum119909119909119909119909 minus2 119903119903119909119909119909119909 minus 119902119902119909119909119901119901119909119909119879119879 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894
bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q Observation Computing gradients is slow
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
How to compute gradient of a matrixCompute gradient of every element independently
RECSYS 69
Degression to the lecture notes ofRegression and Gradient Descent
by Andrew Ngrsquos Machine Learning course from Coursera
RECSYS 70
(Batch) Gradient Descent
Want to find matrices P and Q
Gradient descent Initialize P and Q (using SVD pretend missing ratings are
0) Do gradient descent
bull P larr P - η middotnablaPbull Q larr Q - η middotnablaQbull Where nablaQ is gradientderivative of matrix Q120571120571119876119876 = [120571120571119902119902119909119909119894119894] and 120571120571119902119902119909119909119894119894 = sum119909119909119909119909 minus2 119903119903119909119909119909119909 minus 119902119902119909119909119901119901119909119909119879119879 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894
bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q Observation Computing gradients is slow
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
How to compute gradient of a matrixCompute gradient of every element independently
RECSYS 72
Stochastic Gradient Descent
Gradient Descent (GD) vs Stochastic GD Observation 120571120571119876119876 = [120571120571119902119902119909119909119894119894] where
120571120571119902119902119909119909119894119894 = 119909119909119909119909
minus2 119903119903119909119909119909119909 minus 119902119902119909119909119894119894119901119901119909119909119894119894 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894 = 119961119961119946119946
nabla119928119928 119955119955119961119961119946119946
bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q 119928119928 = 119928119928 minus 119928119928 = 119928119928 minus η sum119961119961119946119946nabla119928119928 (119955119955119961119961119946119946) Idea Instead of evaluating gradient over all ratings evaluate it for
each individual rating and make a step
GD 119928119928larr119928119928 minus η sum119955119955119961119961119946119946 nabla119928119928(119955119955119961119961119946119946)
SGD 119928119928larr119928119928 minus 119928119928(119955119955119961119961119946119946) Faster convergence
bull Need more steps but each step is computed much faster
1162019
RECSYS 73
SGD vs GD Convergence of GD vs SGD
1162019
Iterationstep
Valu
e of
the
obje
ctiv
e fu
nctio
n
GD improves the value of the objective function at every step SGD improves the value but in a ldquonoisyrdquo wayGD takes fewer steps to converge but each steptakes much longer to compute In practice SGD is much faster
RECSYS 74
Stochastic Gradient Descent
Stochastic gradient decent Initialize P and Q (using SVD pretend missing ratings are
0) Then iterate over the ratings (multiple times if necessary) and update
factorsFor each rxi
bull 120576120576119909119909119909119909 = 119903119903119909119909119909119909 minus 119902119902119909119909 sdot 119901119901119909119909119879119879 (derivative of the ldquoerrorrdquo)bull 119902119902119909119909 larr 119902119902119909119909 + 120578120578 120576120576119909119909119909119909 119901119901119909119909 minus 120582120582 119902119902119909119909 (update equation)bull 119901119901119909119909 larr 119901119901119909119909 + 120578120578 120576120576119909119909119909119909 119902119902119909119909 minus 120582120582 119901119901119909119909 (update equation)
2 for loops For until convergence
bull For each rxibull Compute gradient do a ldquosteprdquo
1162019
η hellip learning rate
RECSYS 75Koren Bell Volinksy IEEE Computer 2009
1162019
RECSYS 76
Summary Recommendations via Optimization
Goal Make good recommendations Quantify goodness using SSE
So Lower SSE means better recommendations We want to make good recommendations on items
that some user has not yet seen Letrsquos set values for P and Q such that they work
well on known (user item) ratings And hope these values for P and Q will predict well
the unknown ratings
This is the a case where we apply Optimization methods
1162019
1 3 43 5 5
4 5 533
2
2 1 3
1
RECSYS 77
Backup Slides
RECSYS 78
The Netflix Challenge 2006-09
We Know What You OughtTo Be Watching This
Summer
Netflix Prizebull Training data
ndash 100 million ratingsndash 480000 usersndash 17770 moviesndash 6 years of data 2000-2005
bull Test datandash Last few ratings of each user (28 million)ndash Evaluation criterion root mean squared error (RMSE) ndash Netflix Cinematch RMSE 09514
bull Competitionndash 2700+ teamsndash $1 million grand prize for 10 improvement on Cinematch resultndash $50000 2007 progress prize for 843 improvement
scoremovieuser121152131434524123237682576344541568523425
22345
57664566
scoremovieuser6219617232473153414284935
745
696836
Training data Test data
Movie rating data
Overall rating distribution
bull Third of ratings are 4sbull Average rating is 368
From TimelyDevelopmentcom
ratings per movie
bull Avg ratingsmovie 5627
ratings per user
bull Avg ratingsuser 208
Average movie rating by movie count
bull More ratings to better movies
From TimelyDevelopmentcom
Most loved movies
CountAvg ratingTitle137812 4593 The Shawshank Redemption 133597 4545 Lord of the Rings The Return of the King1808834306 The Green Mile 150676 4460Lord of the Rings The Two Towers 139050 4415 Finding Nemo 117456 4504 Raiders of the Lost Ark 180736 4299 Forrest Gump 147932 4433 Lord of the Rings The Fellowship of the ring149199 4325 The Sixth Sense 144027 4333 Indiana Jones and the Last Crusade
Challenges
bull Size of datandash Scalabilityndash Keeping data in memory
bull Missing datandash 99 percent missingndash Very imbalanced
bull Avoiding overfittingbull Test and training data differ significantly movie 16322
bull Use an ensemble of complementing predictorsbull Two half tuned models worth more than a single fully
tuned model
The BellKor recommender system
RECSYS 101
Extending Latent Factor Model to Include Biases
1162019
RECSYS 102
Modeling Biases and Interactions
μ = overall mean rating bx = bias of user x bi = bias of movie i
user-movie interactionmovie biasuser bias
User-Movie interaction Characterizes the matching between
users and movies Attracts most research in the field Benefits from algorithmic and
mathematical innovations
Baseline predictor Separates users and movies Benefits from insights into userrsquos
behavior Among the main practical
contributions of the competition
RECSYS 103
Baseline Predictor We have expectations on the rating by
user x of movie i even without estimating xrsquos attitude towards movies like i
ndash Rating scale of user xndash Values of other ratings user
gave recently (day-specific mood anchoring multi-user accounts)
ndash (Recent) popularity of movie indash Selection bias related to
number of ratings user gave on the same day (ldquofrequencyrdquo)
1162019
RECSYS 104
Putting It All Together
Example Mean rating micro = 37 You are a critical reviewer your ratings are 1 star
lower than the mean bx = -1 Star Wars gets a mean rating of 05 higher than
average movie bi = + 05 Predicted rating for you on Star Wars
= 37 - 1 + 05 = 32
1162019
RECSYS 105
Fitting the New Model
Solve
Stochastic gradient decent to find parameters Note Both biases bu bi as well as interactions qi pu are treated
as parameters (we estimate them)
1162019
regularization
goodness of fit
λ is selected via grid-search on a validation set
( )
++++
+++minus
sumsumsumsum
sumisin
ii
xx
xx
ii
Rix
Txiixxi
PQ
bbpq
pqbbr
2222
2
)(
)(min
λ
micro
RECSYS 106
BellKor Recommender System The winner of the Netflix Challenge
Multi-scale modeling of the dataCombine top level ldquoregionalrdquomodeling of the data with a refined local view Global
bull Overall deviations of usersmovies Factorization
bull Addressing ldquoregionalrdquo effects Collaborative filtering
bull Extract local patterns
Global effects
Factorization
Collaborative filtering
RECSYS 107
Performance of Various Methods
0885
089
0895
09
0905
091
0915
092
1 10 100 1000
RMSE
Millions of parameters
CF (no time bias)
Basic Latent Factors
Latent Factors w Biases
1162019
RECSYS 108
Grand Prize 08563
Netflix 09514
Movie average 10533User average 10651
Global average 11296
Performance of Various Methods
Basic Collaborative filtering 094
Latent factors 090
Latent factors+Biases 089
Collaborative filtering++ 091
1162019
RECSYS 109
Temporal Biases Of Users
Sudden rise in the average movie rating(early 2004) Improvements in Netflix GUI improvements Meaning of rating changed
Movie age Users prefer new movies
without any reasons Older movies are just
inherently better than newer ones
1162019
Y Koren Collaborative filtering with temporal dynamics KDD rsquo09
RECSYS 110
Temporal Biases amp Factors
Original modelrxi = micro +bx + bi + qi middotpx
T
Add time dependence to biasesrxi = micro +bx(t)+ bi(t) +qi middot px
T
Make parameters bu and bi to depend on time (1) Parameterize time-dependence by linear trends
(2) Each bin corresponds to 10 consecutive weeks
Add temporal dependence to factors px(t)hellip user preference vector on day t
Y Koren Collaborative filtering with temporal dynamics KDD rsquo09
RECSYS 111
Adding Temporal Effects
1162019
0875
088
0885
089
0895
09
0905
091
0915
092
1 10 100 1000 10000
RMSE
Millions of parameters
CF (no time bias)Basic Latent FactorsCF (time bias)Latent Factors w Biases+ Linear time factors+ Per-day user biases+ CF
RECSYS 112
Grand Prize 08563
Netflix 09514
Movie average 10533User average 10651
Global average 11296
Performance of Various Methods
Basic Collaborative filtering 094
Latent factors 090
Latent factors+Biases 089
Collaborative filtering++ 091
1162019 112
Latent factors+Biases+Time 0876
Still no prize Getting desperateTry a ldquokitchen sinkrdquo approach
RECSYS 113
Effect of ensemble size
087
08725
0875
08775
088
08825
0885
08875
089
08925
1 8 15 22 29 36 43 50
Predictors
Erro
r - R
MSE
RECSYS 52
Utility Matrix R Evaluation
1 3 4
3 5 5
4 5 5
3
3
2
2 1
3
1
Test Data Set
SSE = sum(119909119909119909119909)isin119877119877 119903119909119909119909119909 minus 119903119903119909119909119909119909 2
1162019
480000 users
17700 movies
Predicted rating
True rating of user x on item i
119955119955120785120785120788120788
Matrix R
Training Data Set
RECSYS 53
Latent Factor Models
ldquoSVDrdquo on Netflix data R asymp Q middot PT
For now letrsquos assume we can approximate the rating matrix R as a product of ldquothinrdquo Q middot PT
R has missing entries but letrsquos ignore that for nowbull Basically we will want the reconstruction error to be small on
known ratings and we donrsquot care about the values on the missing ones
45531
312445
53432142
24542
522434
42331
2-41
56-5
53-2
32111
-221-7
37-1
-924143-48-5-253-211
13-112-72914-131457-8
1-6784-3924176-421asymp
users
item
s
PT
Q
item
s
users
R
SVD A = U Σ VT
f factors
ffactors
RECSYS 54
Ratings as Products of Factors How to estimate the missing rating of
user x for item i
1162019
45531
312445
53432142
24542
522434
42331
item
s
2-41
56-5
53-2
32111
-221-7
37-1
-924143-48-5-253-211
13-112-72914-131457-8
1-6784-3924176-421
asymp
item
s
users
users
PT
119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931
= 119943119943
119954119954119946119946119943119943 sdot 119953119953119961119961119943119943
qi = row i of Qpx = column x of PT
ffac
tors
Qf factors
RECSYS 55
Ratings as Products of Factors How to estimate the missing rating of
user x for item i
1162019
45531
312445
53432142
24542
522434
42331
item
s
2-41
56-5
53-2
32111
-221-7
37-1
-924143-48-5-253-211
13-112-72914-131457-8
1-6784-3924176-421
asymp
item
s
users
users
PT
ffac
tors
Qf factors
119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931
= 119943119943
119954119954119946119946119943119943 sdot 119953119953119961119961119943119943
qi = row i of Qpx = column x of PT
RECSYS 56
Ratings as Products of Factors How to estimate the missing rating of
user x for item i
1162019
45531
312445
53432142
24542
522434
42331
item
s
2-41
56-5
53-2
32111
-221-7
37-1
-924143-48-5-253-211
13-112-72914-131457-8
1-6784-3924176-421
asymp
item
s
users
users
QPT
24
ffac
tors
f factors
119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931
= 119943119943
119954119954119946119946119943119943 sdot 119953119953119961119961119943119943
qi = row i of Qpx = column x of PT
RECSYS 57
Geared towards females
Geared towards males
Serious
Funny
Latent Factor Models
1162019
The PrincessDiaries
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
RECSYS 58
Geared towards females
Geared towards males
Serious
Funny
Latent Factor Models
1162019
The PrincessDiaries
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
RECSYS 59
Recap SVD
Remember SVD A Input data matrix U Left singular vecs V Right singular vecs Σ Singular values
SVD gives minimum reconstruction error (SSE) min119880119880VΣ
sum119909119909119894119894 119860119860119909119909119894119894 minus 119880119880Σ119881119881T 1199091199091198941198942
So in our case ldquoSVDrdquo on Netflix data R asymp Q middot PT
A = R Q = U PT = Σ VT
But we are not done yet R has missing entries1162019
U
The sum goes over all entriesBut our R has missing entries
119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931
Am
n
Σm
n
VT
asymp
U
RECSYS 60
Latent Factor Models
SVD isnrsquot defined when entries are missing
Use specialized methods to find P Q min
119875119875119876119876sum 119909119909119909119909 isinR 119903119903119909119909119909119909 minus 119902119902119909119909 sdot 119901119901119909119909119879119879 2
Notebull We donrsquot require cols of P Q to be orthogonalunit lengthbull P Q map usersmovies to a latent spacebull The most popular model among Netflix contestants
1162019
45531
312445
53432142
24542
522434
42331
2-41
56-5
53-2
32111
-221-7
37-1
-924143-48-5-253-211
13-112-72914-131457-8
1-6784-3924176-421asymp
PTQ
users
item
s
119903119909119909119909119909 = 119902119902119909119909 sdot 119901119901119909119909119879119879
f factors
ffactorsitem
s
users
RECSYS 61
Dealing with Missing Entries Want to minimize SSE for unseen test data
Idea Minimize SSE on training data Want large f ( of factors) to capture all the signals But SSE on test data begins to rise for f gt 2
Regularization is needed to avoid Overfitting Allow rich model where there are sufficient data Shrink aggressively where data are scarce
1162019
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
1 3 43 5 5
4 5 533
2
2 1 3
1
λhellip regularization parameterldquoerrorrdquo ldquolengthrdquo
RECSYS 62
Geared towards females
Geared towards males
serious
escapist
The PrincessDiaries
The Lion King
Braveheart
Independence Day
AmadeusThe Color Purple
Oceanrsquos 11
Sense and Sensibility
Gus
Dave
Recommendations via Latent Factor Models (eg SVD++ by the [Bellkor Team])
1162019
Lethal Weapon
Dumb and Dumber
RECSYS 63
Dealing with Missing Entries Want to minimize SSE for unseen test data
Idea Minimize SSE on training data Want large f ( of factors) to capture all the signals But SSE on test data begins to rise for f gt 2
Regularization is needed Allow rich model where there are sufficient data Shrink aggressively where data are scarce
1162019
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
1 3 43 5 5
4 5 533
2
2 1 3
1
λhellip regularization parameterldquoerrorrdquo ldquolengthrdquo
RECSYS 64
Geared towards females
Geared towards males
serious
funny
The Effect of Regularization
1162019
The PrincessDiaries
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
minfactors ldquoerrorrdquo + λ ldquolengthrdquo
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
RECSYS 65
Geared towards females
Geared towards males
serious
funny
The Effect of Regularization
1162019
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
The PrincessDiaries
minfactors ldquoerrorrdquo + λ ldquolengthrdquo
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
RECSYS 66
Geared towards females
Geared towards males
serious
funny
The Effect of Regularization
1162019
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
minfactors ldquoerrorrdquo + λ ldquolengthrdquo
The PrincessDiaries
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
RECSYS 67
Geared towards females
Geared towards males
serious
funny
The Effect of Regularization
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
The PrincessDiaries
minfactors ldquoerrorrdquo + λ ldquolengthrdquo
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
RECSYS 68
Use Gradient Descent to search forthe optimal settings
Want to find matrices P and Q
Gradient descent Initialize P and Q (using SVD pretend missing ratings are
0) Do gradient descent
bull P larr P - η middotnablaPbull Q larr Q - η middotnablaQbull Where nablaQ is gradientderivative of matrix Q120571120571119876119876 = [120571120571119902119902119909119909119894119894] and 120571120571119902119902119909119909119894119894 = sum119909119909119909119909 minus2 119903119903119909119909119909119909 minus 119902119902119909119909119901119901119909119909119879119879 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894
bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q Observation Computing gradients is slow
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
How to compute gradient of a matrixCompute gradient of every element independently
RECSYS 69
Degression to the lecture notes ofRegression and Gradient Descent
by Andrew Ngrsquos Machine Learning course from Coursera
RECSYS 70
(Batch) Gradient Descent
Want to find matrices P and Q
Gradient descent Initialize P and Q (using SVD pretend missing ratings are
0) Do gradient descent
bull P larr P - η middotnablaPbull Q larr Q - η middotnablaQbull Where nablaQ is gradientderivative of matrix Q120571120571119876119876 = [120571120571119902119902119909119909119894119894] and 120571120571119902119902119909119909119894119894 = sum119909119909119909119909 minus2 119903119903119909119909119909119909 minus 119902119902119909119909119901119901119909119909119879119879 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894
bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q Observation Computing gradients is slow
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
How to compute gradient of a matrixCompute gradient of every element independently
RECSYS 72
Stochastic Gradient Descent
Gradient Descent (GD) vs Stochastic GD Observation 120571120571119876119876 = [120571120571119902119902119909119909119894119894] where
120571120571119902119902119909119909119894119894 = 119909119909119909119909
minus2 119903119903119909119909119909119909 minus 119902119902119909119909119894119894119901119901119909119909119894119894 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894 = 119961119961119946119946
nabla119928119928 119955119955119961119961119946119946
bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q 119928119928 = 119928119928 minus 119928119928 = 119928119928 minus η sum119961119961119946119946nabla119928119928 (119955119955119961119961119946119946) Idea Instead of evaluating gradient over all ratings evaluate it for
each individual rating and make a step
GD 119928119928larr119928119928 minus η sum119955119955119961119961119946119946 nabla119928119928(119955119955119961119961119946119946)
SGD 119928119928larr119928119928 minus 119928119928(119955119955119961119961119946119946) Faster convergence
bull Need more steps but each step is computed much faster
1162019
RECSYS 73
SGD vs GD Convergence of GD vs SGD
1162019
Iterationstep
Valu
e of
the
obje
ctiv
e fu
nctio
n
GD improves the value of the objective function at every step SGD improves the value but in a ldquonoisyrdquo wayGD takes fewer steps to converge but each steptakes much longer to compute In practice SGD is much faster
RECSYS 74
Stochastic Gradient Descent
Stochastic gradient decent Initialize P and Q (using SVD pretend missing ratings are
0) Then iterate over the ratings (multiple times if necessary) and update
factorsFor each rxi
bull 120576120576119909119909119909119909 = 119903119903119909119909119909119909 minus 119902119902119909119909 sdot 119901119901119909119909119879119879 (derivative of the ldquoerrorrdquo)bull 119902119902119909119909 larr 119902119902119909119909 + 120578120578 120576120576119909119909119909119909 119901119901119909119909 minus 120582120582 119902119902119909119909 (update equation)bull 119901119901119909119909 larr 119901119901119909119909 + 120578120578 120576120576119909119909119909119909 119902119902119909119909 minus 120582120582 119901119901119909119909 (update equation)
2 for loops For until convergence
bull For each rxibull Compute gradient do a ldquosteprdquo
1162019
η hellip learning rate
RECSYS 75Koren Bell Volinksy IEEE Computer 2009
1162019
RECSYS 76
Summary Recommendations via Optimization
Goal Make good recommendations Quantify goodness using SSE
So Lower SSE means better recommendations We want to make good recommendations on items
that some user has not yet seen Letrsquos set values for P and Q such that they work
well on known (user item) ratings And hope these values for P and Q will predict well
the unknown ratings
This is the a case where we apply Optimization methods
1162019
1 3 43 5 5
4 5 533
2
2 1 3
1
RECSYS 77
Backup Slides
RECSYS 78
The Netflix Challenge 2006-09
We Know What You OughtTo Be Watching This
Summer
Netflix Prizebull Training data
ndash 100 million ratingsndash 480000 usersndash 17770 moviesndash 6 years of data 2000-2005
bull Test datandash Last few ratings of each user (28 million)ndash Evaluation criterion root mean squared error (RMSE) ndash Netflix Cinematch RMSE 09514
bull Competitionndash 2700+ teamsndash $1 million grand prize for 10 improvement on Cinematch resultndash $50000 2007 progress prize for 843 improvement
scoremovieuser121152131434524123237682576344541568523425
22345
57664566
scoremovieuser6219617232473153414284935
745
696836
Training data Test data
Movie rating data
Overall rating distribution
bull Third of ratings are 4sbull Average rating is 368
From TimelyDevelopmentcom
ratings per movie
bull Avg ratingsmovie 5627
ratings per user
bull Avg ratingsuser 208
Average movie rating by movie count
bull More ratings to better movies
From TimelyDevelopmentcom
Most loved movies
CountAvg ratingTitle137812 4593 The Shawshank Redemption 133597 4545 Lord of the Rings The Return of the King1808834306 The Green Mile 150676 4460Lord of the Rings The Two Towers 139050 4415 Finding Nemo 117456 4504 Raiders of the Lost Ark 180736 4299 Forrest Gump 147932 4433 Lord of the Rings The Fellowship of the ring149199 4325 The Sixth Sense 144027 4333 Indiana Jones and the Last Crusade
Challenges
bull Size of datandash Scalabilityndash Keeping data in memory
bull Missing datandash 99 percent missingndash Very imbalanced
bull Avoiding overfittingbull Test and training data differ significantly movie 16322
bull Use an ensemble of complementing predictorsbull Two half tuned models worth more than a single fully
tuned model
The BellKor recommender system
RECSYS 101
Extending Latent Factor Model to Include Biases
1162019
RECSYS 102
Modeling Biases and Interactions
μ = overall mean rating bx = bias of user x bi = bias of movie i
user-movie interactionmovie biasuser bias
User-Movie interaction Characterizes the matching between
users and movies Attracts most research in the field Benefits from algorithmic and
mathematical innovations
Baseline predictor Separates users and movies Benefits from insights into userrsquos
behavior Among the main practical
contributions of the competition
RECSYS 103
Baseline Predictor We have expectations on the rating by
user x of movie i even without estimating xrsquos attitude towards movies like i
ndash Rating scale of user xndash Values of other ratings user
gave recently (day-specific mood anchoring multi-user accounts)
ndash (Recent) popularity of movie indash Selection bias related to
number of ratings user gave on the same day (ldquofrequencyrdquo)
1162019
RECSYS 104
Putting It All Together
Example Mean rating micro = 37 You are a critical reviewer your ratings are 1 star
lower than the mean bx = -1 Star Wars gets a mean rating of 05 higher than
average movie bi = + 05 Predicted rating for you on Star Wars
= 37 - 1 + 05 = 32
1162019
RECSYS 105
Fitting the New Model
Solve
Stochastic gradient decent to find parameters Note Both biases bu bi as well as interactions qi pu are treated
as parameters (we estimate them)
1162019
regularization
goodness of fit
λ is selected via grid-search on a validation set
( )
++++
+++minus
sumsumsumsum
sumisin
ii
xx
xx
ii
Rix
Txiixxi
PQ
bbpq
pqbbr
2222
2
)(
)(min
λ
micro
RECSYS 106
BellKor Recommender System The winner of the Netflix Challenge
Multi-scale modeling of the dataCombine top level ldquoregionalrdquomodeling of the data with a refined local view Global
bull Overall deviations of usersmovies Factorization
bull Addressing ldquoregionalrdquo effects Collaborative filtering
bull Extract local patterns
Global effects
Factorization
Collaborative filtering
RECSYS 107
Performance of Various Methods
0885
089
0895
09
0905
091
0915
092
1 10 100 1000
RMSE
Millions of parameters
CF (no time bias)
Basic Latent Factors
Latent Factors w Biases
1162019
RECSYS 108
Grand Prize 08563
Netflix 09514
Movie average 10533User average 10651
Global average 11296
Performance of Various Methods
Basic Collaborative filtering 094
Latent factors 090
Latent factors+Biases 089
Collaborative filtering++ 091
1162019
RECSYS 109
Temporal Biases Of Users
Sudden rise in the average movie rating(early 2004) Improvements in Netflix GUI improvements Meaning of rating changed
Movie age Users prefer new movies
without any reasons Older movies are just
inherently better than newer ones
1162019
Y Koren Collaborative filtering with temporal dynamics KDD rsquo09
RECSYS 110
Temporal Biases amp Factors
Original modelrxi = micro +bx + bi + qi middotpx
T
Add time dependence to biasesrxi = micro +bx(t)+ bi(t) +qi middot px
T
Make parameters bu and bi to depend on time (1) Parameterize time-dependence by linear trends
(2) Each bin corresponds to 10 consecutive weeks
Add temporal dependence to factors px(t)hellip user preference vector on day t
Y Koren Collaborative filtering with temporal dynamics KDD rsquo09
RECSYS 111
Adding Temporal Effects
1162019
0875
088
0885
089
0895
09
0905
091
0915
092
1 10 100 1000 10000
RMSE
Millions of parameters
CF (no time bias)Basic Latent FactorsCF (time bias)Latent Factors w Biases+ Linear time factors+ Per-day user biases+ CF
RECSYS 112
Grand Prize 08563
Netflix 09514
Movie average 10533User average 10651
Global average 11296
Performance of Various Methods
Basic Collaborative filtering 094
Latent factors 090
Latent factors+Biases 089
Collaborative filtering++ 091
1162019 112
Latent factors+Biases+Time 0876
Still no prize Getting desperateTry a ldquokitchen sinkrdquo approach
RECSYS 113
Effect of ensemble size
087
08725
0875
08775
088
08825
0885
08875
089
08925
1 8 15 22 29 36 43 50
Predictors
Erro
r - R
MSE
RECSYS 53
Latent Factor Models
ldquoSVDrdquo on Netflix data R asymp Q middot PT
For now letrsquos assume we can approximate the rating matrix R as a product of ldquothinrdquo Q middot PT
R has missing entries but letrsquos ignore that for nowbull Basically we will want the reconstruction error to be small on
known ratings and we donrsquot care about the values on the missing ones
45531
312445
53432142
24542
522434
42331
2-41
56-5
53-2
32111
-221-7
37-1
-924143-48-5-253-211
13-112-72914-131457-8
1-6784-3924176-421asymp
users
item
s
PT
Q
item
s
users
R
SVD A = U Σ VT
f factors
ffactors
RECSYS 54
Ratings as Products of Factors How to estimate the missing rating of
user x for item i
1162019
45531
312445
53432142
24542
522434
42331
item
s
2-41
56-5
53-2
32111
-221-7
37-1
-924143-48-5-253-211
13-112-72914-131457-8
1-6784-3924176-421
asymp
item
s
users
users
PT
119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931
= 119943119943
119954119954119946119946119943119943 sdot 119953119953119961119961119943119943
qi = row i of Qpx = column x of PT
ffac
tors
Qf factors
RECSYS 55
Ratings as Products of Factors How to estimate the missing rating of
user x for item i
1162019
45531
312445
53432142
24542
522434
42331
item
s
2-41
56-5
53-2
32111
-221-7
37-1
-924143-48-5-253-211
13-112-72914-131457-8
1-6784-3924176-421
asymp
item
s
users
users
PT
ffac
tors
Qf factors
119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931
= 119943119943
119954119954119946119946119943119943 sdot 119953119953119961119961119943119943
qi = row i of Qpx = column x of PT
RECSYS 56
Ratings as Products of Factors How to estimate the missing rating of
user x for item i
1162019
45531
312445
53432142
24542
522434
42331
item
s
2-41
56-5
53-2
32111
-221-7
37-1
-924143-48-5-253-211
13-112-72914-131457-8
1-6784-3924176-421
asymp
item
s
users
users
QPT
24
ffac
tors
f factors
119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931
= 119943119943
119954119954119946119946119943119943 sdot 119953119953119961119961119943119943
qi = row i of Qpx = column x of PT
RECSYS 57
Geared towards females
Geared towards males
Serious
Funny
Latent Factor Models
1162019
The PrincessDiaries
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
RECSYS 58
Geared towards females
Geared towards males
Serious
Funny
Latent Factor Models
1162019
The PrincessDiaries
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
RECSYS 59
Recap SVD
Remember SVD A Input data matrix U Left singular vecs V Right singular vecs Σ Singular values
SVD gives minimum reconstruction error (SSE) min119880119880VΣ
sum119909119909119894119894 119860119860119909119909119894119894 minus 119880119880Σ119881119881T 1199091199091198941198942
So in our case ldquoSVDrdquo on Netflix data R asymp Q middot PT
A = R Q = U PT = Σ VT
But we are not done yet R has missing entries1162019
U
The sum goes over all entriesBut our R has missing entries
119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931
Am
n
Σm
n
VT
asymp
U
RECSYS 60
Latent Factor Models
SVD isnrsquot defined when entries are missing
Use specialized methods to find P Q min
119875119875119876119876sum 119909119909119909119909 isinR 119903119903119909119909119909119909 minus 119902119902119909119909 sdot 119901119901119909119909119879119879 2
Notebull We donrsquot require cols of P Q to be orthogonalunit lengthbull P Q map usersmovies to a latent spacebull The most popular model among Netflix contestants
1162019
45531
312445
53432142
24542
522434
42331
2-41
56-5
53-2
32111
-221-7
37-1
-924143-48-5-253-211
13-112-72914-131457-8
1-6784-3924176-421asymp
PTQ
users
item
s
119903119909119909119909119909 = 119902119902119909119909 sdot 119901119901119909119909119879119879
f factors
ffactorsitem
s
users
RECSYS 61
Dealing with Missing Entries Want to minimize SSE for unseen test data
Idea Minimize SSE on training data Want large f ( of factors) to capture all the signals But SSE on test data begins to rise for f gt 2
Regularization is needed to avoid Overfitting Allow rich model where there are sufficient data Shrink aggressively where data are scarce
1162019
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
1 3 43 5 5
4 5 533
2
2 1 3
1
λhellip regularization parameterldquoerrorrdquo ldquolengthrdquo
RECSYS 62
Geared towards females
Geared towards males
serious
escapist
The PrincessDiaries
The Lion King
Braveheart
Independence Day
AmadeusThe Color Purple
Oceanrsquos 11
Sense and Sensibility
Gus
Dave
Recommendations via Latent Factor Models (eg SVD++ by the [Bellkor Team])
1162019
Lethal Weapon
Dumb and Dumber
RECSYS 63
Dealing with Missing Entries Want to minimize SSE for unseen test data
Idea Minimize SSE on training data Want large f ( of factors) to capture all the signals But SSE on test data begins to rise for f gt 2
Regularization is needed Allow rich model where there are sufficient data Shrink aggressively where data are scarce
1162019
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
1 3 43 5 5
4 5 533
2
2 1 3
1
λhellip regularization parameterldquoerrorrdquo ldquolengthrdquo
RECSYS 64
Geared towards females
Geared towards males
serious
funny
The Effect of Regularization
1162019
The PrincessDiaries
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
minfactors ldquoerrorrdquo + λ ldquolengthrdquo
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
RECSYS 65
Geared towards females
Geared towards males
serious
funny
The Effect of Regularization
1162019
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
The PrincessDiaries
minfactors ldquoerrorrdquo + λ ldquolengthrdquo
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
RECSYS 66
Geared towards females
Geared towards males
serious
funny
The Effect of Regularization
1162019
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
minfactors ldquoerrorrdquo + λ ldquolengthrdquo
The PrincessDiaries
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
RECSYS 67
Geared towards females
Geared towards males
serious
funny
The Effect of Regularization
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
The PrincessDiaries
minfactors ldquoerrorrdquo + λ ldquolengthrdquo
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
RECSYS 68
Use Gradient Descent to search forthe optimal settings
Want to find matrices P and Q
Gradient descent Initialize P and Q (using SVD pretend missing ratings are
0) Do gradient descent
bull P larr P - η middotnablaPbull Q larr Q - η middotnablaQbull Where nablaQ is gradientderivative of matrix Q120571120571119876119876 = [120571120571119902119902119909119909119894119894] and 120571120571119902119902119909119909119894119894 = sum119909119909119909119909 minus2 119903119903119909119909119909119909 minus 119902119902119909119909119901119901119909119909119879119879 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894
bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q Observation Computing gradients is slow
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
How to compute gradient of a matrixCompute gradient of every element independently
RECSYS 69
Degression to the lecture notes ofRegression and Gradient Descent
by Andrew Ngrsquos Machine Learning course from Coursera
RECSYS 70
(Batch) Gradient Descent
Want to find matrices P and Q
Gradient descent Initialize P and Q (using SVD pretend missing ratings are
0) Do gradient descent
bull P larr P - η middotnablaPbull Q larr Q - η middotnablaQbull Where nablaQ is gradientderivative of matrix Q120571120571119876119876 = [120571120571119902119902119909119909119894119894] and 120571120571119902119902119909119909119894119894 = sum119909119909119909119909 minus2 119903119903119909119909119909119909 minus 119902119902119909119909119901119901119909119909119879119879 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894
bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q Observation Computing gradients is slow
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
How to compute gradient of a matrixCompute gradient of every element independently
RECSYS 72
Stochastic Gradient Descent
Gradient Descent (GD) vs Stochastic GD Observation 120571120571119876119876 = [120571120571119902119902119909119909119894119894] where
120571120571119902119902119909119909119894119894 = 119909119909119909119909
minus2 119903119903119909119909119909119909 minus 119902119902119909119909119894119894119901119901119909119909119894119894 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894 = 119961119961119946119946
nabla119928119928 119955119955119961119961119946119946
bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q 119928119928 = 119928119928 minus 119928119928 = 119928119928 minus η sum119961119961119946119946nabla119928119928 (119955119955119961119961119946119946) Idea Instead of evaluating gradient over all ratings evaluate it for
each individual rating and make a step
GD 119928119928larr119928119928 minus η sum119955119955119961119961119946119946 nabla119928119928(119955119955119961119961119946119946)
SGD 119928119928larr119928119928 minus 119928119928(119955119955119961119961119946119946) Faster convergence
bull Need more steps but each step is computed much faster
1162019
RECSYS 73
SGD vs GD Convergence of GD vs SGD
1162019
Iterationstep
Valu
e of
the
obje
ctiv
e fu
nctio
n
GD improves the value of the objective function at every step SGD improves the value but in a ldquonoisyrdquo wayGD takes fewer steps to converge but each steptakes much longer to compute In practice SGD is much faster
RECSYS 74
Stochastic Gradient Descent
Stochastic gradient decent Initialize P and Q (using SVD pretend missing ratings are
0) Then iterate over the ratings (multiple times if necessary) and update
factorsFor each rxi
bull 120576120576119909119909119909119909 = 119903119903119909119909119909119909 minus 119902119902119909119909 sdot 119901119901119909119909119879119879 (derivative of the ldquoerrorrdquo)bull 119902119902119909119909 larr 119902119902119909119909 + 120578120578 120576120576119909119909119909119909 119901119901119909119909 minus 120582120582 119902119902119909119909 (update equation)bull 119901119901119909119909 larr 119901119901119909119909 + 120578120578 120576120576119909119909119909119909 119902119902119909119909 minus 120582120582 119901119901119909119909 (update equation)
2 for loops For until convergence
bull For each rxibull Compute gradient do a ldquosteprdquo
1162019
η hellip learning rate
RECSYS 75Koren Bell Volinksy IEEE Computer 2009
1162019
RECSYS 76
Summary Recommendations via Optimization
Goal Make good recommendations Quantify goodness using SSE
So Lower SSE means better recommendations We want to make good recommendations on items
that some user has not yet seen Letrsquos set values for P and Q such that they work
well on known (user item) ratings And hope these values for P and Q will predict well
the unknown ratings
This is the a case where we apply Optimization methods
1162019
1 3 43 5 5
4 5 533
2
2 1 3
1
RECSYS 77
Backup Slides
RECSYS 78
The Netflix Challenge 2006-09
We Know What You OughtTo Be Watching This
Summer
Netflix Prizebull Training data
ndash 100 million ratingsndash 480000 usersndash 17770 moviesndash 6 years of data 2000-2005
bull Test datandash Last few ratings of each user (28 million)ndash Evaluation criterion root mean squared error (RMSE) ndash Netflix Cinematch RMSE 09514
bull Competitionndash 2700+ teamsndash $1 million grand prize for 10 improvement on Cinematch resultndash $50000 2007 progress prize for 843 improvement
scoremovieuser121152131434524123237682576344541568523425
22345
57664566
scoremovieuser6219617232473153414284935
745
696836
Training data Test data
Movie rating data
Overall rating distribution
bull Third of ratings are 4sbull Average rating is 368
From TimelyDevelopmentcom
ratings per movie
bull Avg ratingsmovie 5627
ratings per user
bull Avg ratingsuser 208
Average movie rating by movie count
bull More ratings to better movies
From TimelyDevelopmentcom
Most loved movies
CountAvg ratingTitle137812 4593 The Shawshank Redemption 133597 4545 Lord of the Rings The Return of the King1808834306 The Green Mile 150676 4460Lord of the Rings The Two Towers 139050 4415 Finding Nemo 117456 4504 Raiders of the Lost Ark 180736 4299 Forrest Gump 147932 4433 Lord of the Rings The Fellowship of the ring149199 4325 The Sixth Sense 144027 4333 Indiana Jones and the Last Crusade
Challenges
bull Size of datandash Scalabilityndash Keeping data in memory
bull Missing datandash 99 percent missingndash Very imbalanced
bull Avoiding overfittingbull Test and training data differ significantly movie 16322
bull Use an ensemble of complementing predictorsbull Two half tuned models worth more than a single fully
tuned model
The BellKor recommender system
RECSYS 101
Extending Latent Factor Model to Include Biases
1162019
RECSYS 102
Modeling Biases and Interactions
μ = overall mean rating bx = bias of user x bi = bias of movie i
user-movie interactionmovie biasuser bias
User-Movie interaction Characterizes the matching between
users and movies Attracts most research in the field Benefits from algorithmic and
mathematical innovations
Baseline predictor Separates users and movies Benefits from insights into userrsquos
behavior Among the main practical
contributions of the competition
RECSYS 103
Baseline Predictor We have expectations on the rating by
user x of movie i even without estimating xrsquos attitude towards movies like i
ndash Rating scale of user xndash Values of other ratings user
gave recently (day-specific mood anchoring multi-user accounts)
ndash (Recent) popularity of movie indash Selection bias related to
number of ratings user gave on the same day (ldquofrequencyrdquo)
1162019
RECSYS 104
Putting It All Together
Example Mean rating micro = 37 You are a critical reviewer your ratings are 1 star
lower than the mean bx = -1 Star Wars gets a mean rating of 05 higher than
average movie bi = + 05 Predicted rating for you on Star Wars
= 37 - 1 + 05 = 32
1162019
RECSYS 105
Fitting the New Model
Solve
Stochastic gradient decent to find parameters Note Both biases bu bi as well as interactions qi pu are treated
as parameters (we estimate them)
1162019
regularization
goodness of fit
λ is selected via grid-search on a validation set
( )
++++
+++minus
sumsumsumsum
sumisin
ii
xx
xx
ii
Rix
Txiixxi
PQ
bbpq
pqbbr
2222
2
)(
)(min
λ
micro
RECSYS 106
BellKor Recommender System The winner of the Netflix Challenge
Multi-scale modeling of the dataCombine top level ldquoregionalrdquomodeling of the data with a refined local view Global
bull Overall deviations of usersmovies Factorization
bull Addressing ldquoregionalrdquo effects Collaborative filtering
bull Extract local patterns
Global effects
Factorization
Collaborative filtering
RECSYS 107
Performance of Various Methods
0885
089
0895
09
0905
091
0915
092
1 10 100 1000
RMSE
Millions of parameters
CF (no time bias)
Basic Latent Factors
Latent Factors w Biases
1162019
RECSYS 108
Grand Prize 08563
Netflix 09514
Movie average 10533User average 10651
Global average 11296
Performance of Various Methods
Basic Collaborative filtering 094
Latent factors 090
Latent factors+Biases 089
Collaborative filtering++ 091
1162019
RECSYS 109
Temporal Biases Of Users
Sudden rise in the average movie rating(early 2004) Improvements in Netflix GUI improvements Meaning of rating changed
Movie age Users prefer new movies
without any reasons Older movies are just
inherently better than newer ones
1162019
Y Koren Collaborative filtering with temporal dynamics KDD rsquo09
RECSYS 110
Temporal Biases amp Factors
Original modelrxi = micro +bx + bi + qi middotpx
T
Add time dependence to biasesrxi = micro +bx(t)+ bi(t) +qi middot px
T
Make parameters bu and bi to depend on time (1) Parameterize time-dependence by linear trends
(2) Each bin corresponds to 10 consecutive weeks
Add temporal dependence to factors px(t)hellip user preference vector on day t
Y Koren Collaborative filtering with temporal dynamics KDD rsquo09
RECSYS 111
Adding Temporal Effects
1162019
0875
088
0885
089
0895
09
0905
091
0915
092
1 10 100 1000 10000
RMSE
Millions of parameters
CF (no time bias)Basic Latent FactorsCF (time bias)Latent Factors w Biases+ Linear time factors+ Per-day user biases+ CF
RECSYS 112
Grand Prize 08563
Netflix 09514
Movie average 10533User average 10651
Global average 11296
Performance of Various Methods
Basic Collaborative filtering 094
Latent factors 090
Latent factors+Biases 089
Collaborative filtering++ 091
1162019 112
Latent factors+Biases+Time 0876
Still no prize Getting desperateTry a ldquokitchen sinkrdquo approach
RECSYS 113
Effect of ensemble size
087
08725
0875
08775
088
08825
0885
08875
089
08925
1 8 15 22 29 36 43 50
Predictors
Erro
r - R
MSE
RECSYS 54
Ratings as Products of Factors How to estimate the missing rating of
user x for item i
1162019
45531
312445
53432142
24542
522434
42331
item
s
2-41
56-5
53-2
32111
-221-7
37-1
-924143-48-5-253-211
13-112-72914-131457-8
1-6784-3924176-421
asymp
item
s
users
users
PT
119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931
= 119943119943
119954119954119946119946119943119943 sdot 119953119953119961119961119943119943
qi = row i of Qpx = column x of PT
ffac
tors
Qf factors
RECSYS 55
Ratings as Products of Factors How to estimate the missing rating of
user x for item i
1162019
45531
312445
53432142
24542
522434
42331
item
s
2-41
56-5
53-2
32111
-221-7
37-1
-924143-48-5-253-211
13-112-72914-131457-8
1-6784-3924176-421
asymp
item
s
users
users
PT
ffac
tors
Qf factors
119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931
= 119943119943
119954119954119946119946119943119943 sdot 119953119953119961119961119943119943
qi = row i of Qpx = column x of PT
RECSYS 56
Ratings as Products of Factors How to estimate the missing rating of
user x for item i
1162019
45531
312445
53432142
24542
522434
42331
item
s
2-41
56-5
53-2
32111
-221-7
37-1
-924143-48-5-253-211
13-112-72914-131457-8
1-6784-3924176-421
asymp
item
s
users
users
QPT
24
ffac
tors
f factors
119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931
= 119943119943
119954119954119946119946119943119943 sdot 119953119953119961119961119943119943
qi = row i of Qpx = column x of PT
RECSYS 57
Geared towards females
Geared towards males
Serious
Funny
Latent Factor Models
1162019
The PrincessDiaries
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
RECSYS 58
Geared towards females
Geared towards males
Serious
Funny
Latent Factor Models
1162019
The PrincessDiaries
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
RECSYS 59
Recap SVD
Remember SVD A Input data matrix U Left singular vecs V Right singular vecs Σ Singular values
SVD gives minimum reconstruction error (SSE) min119880119880VΣ
sum119909119909119894119894 119860119860119909119909119894119894 minus 119880119880Σ119881119881T 1199091199091198941198942
So in our case ldquoSVDrdquo on Netflix data R asymp Q middot PT
A = R Q = U PT = Σ VT
But we are not done yet R has missing entries1162019
U
The sum goes over all entriesBut our R has missing entries
119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931
Am
n
Σm
n
VT
asymp
U
RECSYS 60
Latent Factor Models
SVD isnrsquot defined when entries are missing
Use specialized methods to find P Q min
119875119875119876119876sum 119909119909119909119909 isinR 119903119903119909119909119909119909 minus 119902119902119909119909 sdot 119901119901119909119909119879119879 2
Notebull We donrsquot require cols of P Q to be orthogonalunit lengthbull P Q map usersmovies to a latent spacebull The most popular model among Netflix contestants
1162019
45531
312445
53432142
24542
522434
42331
2-41
56-5
53-2
32111
-221-7
37-1
-924143-48-5-253-211
13-112-72914-131457-8
1-6784-3924176-421asymp
PTQ
users
item
s
119903119909119909119909119909 = 119902119902119909119909 sdot 119901119901119909119909119879119879
f factors
ffactorsitem
s
users
RECSYS 61
Dealing with Missing Entries Want to minimize SSE for unseen test data
Idea Minimize SSE on training data Want large f ( of factors) to capture all the signals But SSE on test data begins to rise for f gt 2
Regularization is needed to avoid Overfitting Allow rich model where there are sufficient data Shrink aggressively where data are scarce
1162019
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
1 3 43 5 5
4 5 533
2
2 1 3
1
λhellip regularization parameterldquoerrorrdquo ldquolengthrdquo
RECSYS 62
Geared towards females
Geared towards males
serious
escapist
The PrincessDiaries
The Lion King
Braveheart
Independence Day
AmadeusThe Color Purple
Oceanrsquos 11
Sense and Sensibility
Gus
Dave
Recommendations via Latent Factor Models (eg SVD++ by the [Bellkor Team])
1162019
Lethal Weapon
Dumb and Dumber
RECSYS 63
Dealing with Missing Entries Want to minimize SSE for unseen test data
Idea Minimize SSE on training data Want large f ( of factors) to capture all the signals But SSE on test data begins to rise for f gt 2
Regularization is needed Allow rich model where there are sufficient data Shrink aggressively where data are scarce
1162019
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
1 3 43 5 5
4 5 533
2
2 1 3
1
λhellip regularization parameterldquoerrorrdquo ldquolengthrdquo
RECSYS 64
Geared towards females
Geared towards males
serious
funny
The Effect of Regularization
1162019
The PrincessDiaries
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
minfactors ldquoerrorrdquo + λ ldquolengthrdquo
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
RECSYS 65
Geared towards females
Geared towards males
serious
funny
The Effect of Regularization
1162019
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
The PrincessDiaries
minfactors ldquoerrorrdquo + λ ldquolengthrdquo
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
RECSYS 66
Geared towards females
Geared towards males
serious
funny
The Effect of Regularization
1162019
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
minfactors ldquoerrorrdquo + λ ldquolengthrdquo
The PrincessDiaries
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
RECSYS 67
Geared towards females
Geared towards males
serious
funny
The Effect of Regularization
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
The PrincessDiaries
minfactors ldquoerrorrdquo + λ ldquolengthrdquo
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
RECSYS 68
Use Gradient Descent to search forthe optimal settings
Want to find matrices P and Q
Gradient descent Initialize P and Q (using SVD pretend missing ratings are
0) Do gradient descent
bull P larr P - η middotnablaPbull Q larr Q - η middotnablaQbull Where nablaQ is gradientderivative of matrix Q120571120571119876119876 = [120571120571119902119902119909119909119894119894] and 120571120571119902119902119909119909119894119894 = sum119909119909119909119909 minus2 119903119903119909119909119909119909 minus 119902119902119909119909119901119901119909119909119879119879 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894
bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q Observation Computing gradients is slow
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
How to compute gradient of a matrixCompute gradient of every element independently
RECSYS 69
Degression to the lecture notes ofRegression and Gradient Descent
by Andrew Ngrsquos Machine Learning course from Coursera
RECSYS 70
(Batch) Gradient Descent
Want to find matrices P and Q
Gradient descent Initialize P and Q (using SVD pretend missing ratings are
0) Do gradient descent
bull P larr P - η middotnablaPbull Q larr Q - η middotnablaQbull Where nablaQ is gradientderivative of matrix Q120571120571119876119876 = [120571120571119902119902119909119909119894119894] and 120571120571119902119902119909119909119894119894 = sum119909119909119909119909 minus2 119903119903119909119909119909119909 minus 119902119902119909119909119901119901119909119909119879119879 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894
bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q Observation Computing gradients is slow
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
How to compute gradient of a matrixCompute gradient of every element independently
RECSYS 72
Stochastic Gradient Descent
Gradient Descent (GD) vs Stochastic GD Observation 120571120571119876119876 = [120571120571119902119902119909119909119894119894] where
120571120571119902119902119909119909119894119894 = 119909119909119909119909
minus2 119903119903119909119909119909119909 minus 119902119902119909119909119894119894119901119901119909119909119894119894 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894 = 119961119961119946119946
nabla119928119928 119955119955119961119961119946119946
bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q 119928119928 = 119928119928 minus 119928119928 = 119928119928 minus η sum119961119961119946119946nabla119928119928 (119955119955119961119961119946119946) Idea Instead of evaluating gradient over all ratings evaluate it for
each individual rating and make a step
GD 119928119928larr119928119928 minus η sum119955119955119961119961119946119946 nabla119928119928(119955119955119961119961119946119946)
SGD 119928119928larr119928119928 minus 119928119928(119955119955119961119961119946119946) Faster convergence
bull Need more steps but each step is computed much faster
1162019
RECSYS 73
SGD vs GD Convergence of GD vs SGD
1162019
Iterationstep
Valu
e of
the
obje
ctiv
e fu
nctio
n
GD improves the value of the objective function at every step SGD improves the value but in a ldquonoisyrdquo wayGD takes fewer steps to converge but each steptakes much longer to compute In practice SGD is much faster
RECSYS 74
Stochastic Gradient Descent
Stochastic gradient decent Initialize P and Q (using SVD pretend missing ratings are
0) Then iterate over the ratings (multiple times if necessary) and update
factorsFor each rxi
bull 120576120576119909119909119909119909 = 119903119903119909119909119909119909 minus 119902119902119909119909 sdot 119901119901119909119909119879119879 (derivative of the ldquoerrorrdquo)bull 119902119902119909119909 larr 119902119902119909119909 + 120578120578 120576120576119909119909119909119909 119901119901119909119909 minus 120582120582 119902119902119909119909 (update equation)bull 119901119901119909119909 larr 119901119901119909119909 + 120578120578 120576120576119909119909119909119909 119902119902119909119909 minus 120582120582 119901119901119909119909 (update equation)
2 for loops For until convergence
bull For each rxibull Compute gradient do a ldquosteprdquo
1162019
η hellip learning rate
RECSYS 75Koren Bell Volinksy IEEE Computer 2009
1162019
RECSYS 76
Summary Recommendations via Optimization
Goal Make good recommendations Quantify goodness using SSE
So Lower SSE means better recommendations We want to make good recommendations on items
that some user has not yet seen Letrsquos set values for P and Q such that they work
well on known (user item) ratings And hope these values for P and Q will predict well
the unknown ratings
This is the a case where we apply Optimization methods
1162019
1 3 43 5 5
4 5 533
2
2 1 3
1
RECSYS 77
Backup Slides
RECSYS 78
The Netflix Challenge 2006-09
We Know What You OughtTo Be Watching This
Summer
Netflix Prizebull Training data
ndash 100 million ratingsndash 480000 usersndash 17770 moviesndash 6 years of data 2000-2005
bull Test datandash Last few ratings of each user (28 million)ndash Evaluation criterion root mean squared error (RMSE) ndash Netflix Cinematch RMSE 09514
bull Competitionndash 2700+ teamsndash $1 million grand prize for 10 improvement on Cinematch resultndash $50000 2007 progress prize for 843 improvement
scoremovieuser121152131434524123237682576344541568523425
22345
57664566
scoremovieuser6219617232473153414284935
745
696836
Training data Test data
Movie rating data
Overall rating distribution
bull Third of ratings are 4sbull Average rating is 368
From TimelyDevelopmentcom
ratings per movie
bull Avg ratingsmovie 5627
ratings per user
bull Avg ratingsuser 208
Average movie rating by movie count
bull More ratings to better movies
From TimelyDevelopmentcom
Most loved movies
CountAvg ratingTitle137812 4593 The Shawshank Redemption 133597 4545 Lord of the Rings The Return of the King1808834306 The Green Mile 150676 4460Lord of the Rings The Two Towers 139050 4415 Finding Nemo 117456 4504 Raiders of the Lost Ark 180736 4299 Forrest Gump 147932 4433 Lord of the Rings The Fellowship of the ring149199 4325 The Sixth Sense 144027 4333 Indiana Jones and the Last Crusade
Challenges
bull Size of datandash Scalabilityndash Keeping data in memory
bull Missing datandash 99 percent missingndash Very imbalanced
bull Avoiding overfittingbull Test and training data differ significantly movie 16322
bull Use an ensemble of complementing predictorsbull Two half tuned models worth more than a single fully
tuned model
The BellKor recommender system
RECSYS 101
Extending Latent Factor Model to Include Biases
1162019
RECSYS 102
Modeling Biases and Interactions
μ = overall mean rating bx = bias of user x bi = bias of movie i
user-movie interactionmovie biasuser bias
User-Movie interaction Characterizes the matching between
users and movies Attracts most research in the field Benefits from algorithmic and
mathematical innovations
Baseline predictor Separates users and movies Benefits from insights into userrsquos
behavior Among the main practical
contributions of the competition
RECSYS 103
Baseline Predictor We have expectations on the rating by
user x of movie i even without estimating xrsquos attitude towards movies like i
ndash Rating scale of user xndash Values of other ratings user
gave recently (day-specific mood anchoring multi-user accounts)
ndash (Recent) popularity of movie indash Selection bias related to
number of ratings user gave on the same day (ldquofrequencyrdquo)
1162019
RECSYS 104
Putting It All Together
Example Mean rating micro = 37 You are a critical reviewer your ratings are 1 star
lower than the mean bx = -1 Star Wars gets a mean rating of 05 higher than
average movie bi = + 05 Predicted rating for you on Star Wars
= 37 - 1 + 05 = 32
1162019
RECSYS 105
Fitting the New Model
Solve
Stochastic gradient decent to find parameters Note Both biases bu bi as well as interactions qi pu are treated
as parameters (we estimate them)
1162019
regularization
goodness of fit
λ is selected via grid-search on a validation set
( )
++++
+++minus
sumsumsumsum
sumisin
ii
xx
xx
ii
Rix
Txiixxi
PQ
bbpq
pqbbr
2222
2
)(
)(min
λ
micro
RECSYS 106
BellKor Recommender System The winner of the Netflix Challenge
Multi-scale modeling of the dataCombine top level ldquoregionalrdquomodeling of the data with a refined local view Global
bull Overall deviations of usersmovies Factorization
bull Addressing ldquoregionalrdquo effects Collaborative filtering
bull Extract local patterns
Global effects
Factorization
Collaborative filtering
RECSYS 107
Performance of Various Methods
0885
089
0895
09
0905
091
0915
092
1 10 100 1000
RMSE
Millions of parameters
CF (no time bias)
Basic Latent Factors
Latent Factors w Biases
1162019
RECSYS 108
Grand Prize 08563
Netflix 09514
Movie average 10533User average 10651
Global average 11296
Performance of Various Methods
Basic Collaborative filtering 094
Latent factors 090
Latent factors+Biases 089
Collaborative filtering++ 091
1162019
RECSYS 109
Temporal Biases Of Users
Sudden rise in the average movie rating(early 2004) Improvements in Netflix GUI improvements Meaning of rating changed
Movie age Users prefer new movies
without any reasons Older movies are just
inherently better than newer ones
1162019
Y Koren Collaborative filtering with temporal dynamics KDD rsquo09
RECSYS 110
Temporal Biases amp Factors
Original modelrxi = micro +bx + bi + qi middotpx
T
Add time dependence to biasesrxi = micro +bx(t)+ bi(t) +qi middot px
T
Make parameters bu and bi to depend on time (1) Parameterize time-dependence by linear trends
(2) Each bin corresponds to 10 consecutive weeks
Add temporal dependence to factors px(t)hellip user preference vector on day t
Y Koren Collaborative filtering with temporal dynamics KDD rsquo09
RECSYS 111
Adding Temporal Effects
1162019
0875
088
0885
089
0895
09
0905
091
0915
092
1 10 100 1000 10000
RMSE
Millions of parameters
CF (no time bias)Basic Latent FactorsCF (time bias)Latent Factors w Biases+ Linear time factors+ Per-day user biases+ CF
RECSYS 112
Grand Prize 08563
Netflix 09514
Movie average 10533User average 10651
Global average 11296
Performance of Various Methods
Basic Collaborative filtering 094
Latent factors 090
Latent factors+Biases 089
Collaborative filtering++ 091
1162019 112
Latent factors+Biases+Time 0876
Still no prize Getting desperateTry a ldquokitchen sinkrdquo approach
RECSYS 113
Effect of ensemble size
087
08725
0875
08775
088
08825
0885
08875
089
08925
1 8 15 22 29 36 43 50
Predictors
Erro
r - R
MSE
RECSYS 55
Ratings as Products of Factors How to estimate the missing rating of
user x for item i
1162019
45531
312445
53432142
24542
522434
42331
item
s
2-41
56-5
53-2
32111
-221-7
37-1
-924143-48-5-253-211
13-112-72914-131457-8
1-6784-3924176-421
asymp
item
s
users
users
PT
ffac
tors
Qf factors
119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931
= 119943119943
119954119954119946119946119943119943 sdot 119953119953119961119961119943119943
qi = row i of Qpx = column x of PT
RECSYS 56
Ratings as Products of Factors How to estimate the missing rating of
user x for item i
1162019
45531
312445
53432142
24542
522434
42331
item
s
2-41
56-5
53-2
32111
-221-7
37-1
-924143-48-5-253-211
13-112-72914-131457-8
1-6784-3924176-421
asymp
item
s
users
users
QPT
24
ffac
tors
f factors
119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931
= 119943119943
119954119954119946119946119943119943 sdot 119953119953119961119961119943119943
qi = row i of Qpx = column x of PT
RECSYS 57
Geared towards females
Geared towards males
Serious
Funny
Latent Factor Models
1162019
The PrincessDiaries
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
RECSYS 58
Geared towards females
Geared towards males
Serious
Funny
Latent Factor Models
1162019
The PrincessDiaries
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
RECSYS 59
Recap SVD
Remember SVD A Input data matrix U Left singular vecs V Right singular vecs Σ Singular values
SVD gives minimum reconstruction error (SSE) min119880119880VΣ
sum119909119909119894119894 119860119860119909119909119894119894 minus 119880119880Σ119881119881T 1199091199091198941198942
So in our case ldquoSVDrdquo on Netflix data R asymp Q middot PT
A = R Q = U PT = Σ VT
But we are not done yet R has missing entries1162019
U
The sum goes over all entriesBut our R has missing entries
119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931
Am
n
Σm
n
VT
asymp
U
RECSYS 60
Latent Factor Models
SVD isnrsquot defined when entries are missing
Use specialized methods to find P Q min
119875119875119876119876sum 119909119909119909119909 isinR 119903119903119909119909119909119909 minus 119902119902119909119909 sdot 119901119901119909119909119879119879 2
Notebull We donrsquot require cols of P Q to be orthogonalunit lengthbull P Q map usersmovies to a latent spacebull The most popular model among Netflix contestants
1162019
45531
312445
53432142
24542
522434
42331
2-41
56-5
53-2
32111
-221-7
37-1
-924143-48-5-253-211
13-112-72914-131457-8
1-6784-3924176-421asymp
PTQ
users
item
s
119903119909119909119909119909 = 119902119902119909119909 sdot 119901119901119909119909119879119879
f factors
ffactorsitem
s
users
RECSYS 61
Dealing with Missing Entries Want to minimize SSE for unseen test data
Idea Minimize SSE on training data Want large f ( of factors) to capture all the signals But SSE on test data begins to rise for f gt 2
Regularization is needed to avoid Overfitting Allow rich model where there are sufficient data Shrink aggressively where data are scarce
1162019
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
1 3 43 5 5
4 5 533
2
2 1 3
1
λhellip regularization parameterldquoerrorrdquo ldquolengthrdquo
RECSYS 62
Geared towards females
Geared towards males
serious
escapist
The PrincessDiaries
The Lion King
Braveheart
Independence Day
AmadeusThe Color Purple
Oceanrsquos 11
Sense and Sensibility
Gus
Dave
Recommendations via Latent Factor Models (eg SVD++ by the [Bellkor Team])
1162019
Lethal Weapon
Dumb and Dumber
RECSYS 63
Dealing with Missing Entries Want to minimize SSE for unseen test data
Idea Minimize SSE on training data Want large f ( of factors) to capture all the signals But SSE on test data begins to rise for f gt 2
Regularization is needed Allow rich model where there are sufficient data Shrink aggressively where data are scarce
1162019
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
1 3 43 5 5
4 5 533
2
2 1 3
1
λhellip regularization parameterldquoerrorrdquo ldquolengthrdquo
RECSYS 64
Geared towards females
Geared towards males
serious
funny
The Effect of Regularization
1162019
The PrincessDiaries
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
minfactors ldquoerrorrdquo + λ ldquolengthrdquo
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
RECSYS 65
Geared towards females
Geared towards males
serious
funny
The Effect of Regularization
1162019
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
The PrincessDiaries
minfactors ldquoerrorrdquo + λ ldquolengthrdquo
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
RECSYS 66
Geared towards females
Geared towards males
serious
funny
The Effect of Regularization
1162019
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
minfactors ldquoerrorrdquo + λ ldquolengthrdquo
The PrincessDiaries
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
RECSYS 67
Geared towards females
Geared towards males
serious
funny
The Effect of Regularization
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
The PrincessDiaries
minfactors ldquoerrorrdquo + λ ldquolengthrdquo
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
RECSYS 68
Use Gradient Descent to search forthe optimal settings
Want to find matrices P and Q
Gradient descent Initialize P and Q (using SVD pretend missing ratings are
0) Do gradient descent
bull P larr P - η middotnablaPbull Q larr Q - η middotnablaQbull Where nablaQ is gradientderivative of matrix Q120571120571119876119876 = [120571120571119902119902119909119909119894119894] and 120571120571119902119902119909119909119894119894 = sum119909119909119909119909 minus2 119903119903119909119909119909119909 minus 119902119902119909119909119901119901119909119909119879119879 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894
bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q Observation Computing gradients is slow
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
How to compute gradient of a matrixCompute gradient of every element independently
RECSYS 69
Degression to the lecture notes ofRegression and Gradient Descent
by Andrew Ngrsquos Machine Learning course from Coursera
RECSYS 70
(Batch) Gradient Descent
Want to find matrices P and Q
Gradient descent Initialize P and Q (using SVD pretend missing ratings are
0) Do gradient descent
bull P larr P - η middotnablaPbull Q larr Q - η middotnablaQbull Where nablaQ is gradientderivative of matrix Q120571120571119876119876 = [120571120571119902119902119909119909119894119894] and 120571120571119902119902119909119909119894119894 = sum119909119909119909119909 minus2 119903119903119909119909119909119909 minus 119902119902119909119909119901119901119909119909119879119879 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894
bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q Observation Computing gradients is slow
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
How to compute gradient of a matrixCompute gradient of every element independently
RECSYS 72
Stochastic Gradient Descent
Gradient Descent (GD) vs Stochastic GD Observation 120571120571119876119876 = [120571120571119902119902119909119909119894119894] where
120571120571119902119902119909119909119894119894 = 119909119909119909119909
minus2 119903119903119909119909119909119909 minus 119902119902119909119909119894119894119901119901119909119909119894119894 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894 = 119961119961119946119946
nabla119928119928 119955119955119961119961119946119946
bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q 119928119928 = 119928119928 minus 119928119928 = 119928119928 minus η sum119961119961119946119946nabla119928119928 (119955119955119961119961119946119946) Idea Instead of evaluating gradient over all ratings evaluate it for
each individual rating and make a step
GD 119928119928larr119928119928 minus η sum119955119955119961119961119946119946 nabla119928119928(119955119955119961119961119946119946)
SGD 119928119928larr119928119928 minus 119928119928(119955119955119961119961119946119946) Faster convergence
bull Need more steps but each step is computed much faster
1162019
RECSYS 73
SGD vs GD Convergence of GD vs SGD
1162019
Iterationstep
Valu
e of
the
obje
ctiv
e fu
nctio
n
GD improves the value of the objective function at every step SGD improves the value but in a ldquonoisyrdquo wayGD takes fewer steps to converge but each steptakes much longer to compute In practice SGD is much faster
RECSYS 74
Stochastic Gradient Descent
Stochastic gradient decent Initialize P and Q (using SVD pretend missing ratings are
0) Then iterate over the ratings (multiple times if necessary) and update
factorsFor each rxi
bull 120576120576119909119909119909119909 = 119903119903119909119909119909119909 minus 119902119902119909119909 sdot 119901119901119909119909119879119879 (derivative of the ldquoerrorrdquo)bull 119902119902119909119909 larr 119902119902119909119909 + 120578120578 120576120576119909119909119909119909 119901119901119909119909 minus 120582120582 119902119902119909119909 (update equation)bull 119901119901119909119909 larr 119901119901119909119909 + 120578120578 120576120576119909119909119909119909 119902119902119909119909 minus 120582120582 119901119901119909119909 (update equation)
2 for loops For until convergence
bull For each rxibull Compute gradient do a ldquosteprdquo
1162019
η hellip learning rate
RECSYS 75Koren Bell Volinksy IEEE Computer 2009
1162019
RECSYS 76
Summary Recommendations via Optimization
Goal Make good recommendations Quantify goodness using SSE
So Lower SSE means better recommendations We want to make good recommendations on items
that some user has not yet seen Letrsquos set values for P and Q such that they work
well on known (user item) ratings And hope these values for P and Q will predict well
the unknown ratings
This is the a case where we apply Optimization methods
1162019
1 3 43 5 5
4 5 533
2
2 1 3
1
RECSYS 77
Backup Slides
RECSYS 78
The Netflix Challenge 2006-09
We Know What You OughtTo Be Watching This
Summer
Netflix Prizebull Training data
ndash 100 million ratingsndash 480000 usersndash 17770 moviesndash 6 years of data 2000-2005
bull Test datandash Last few ratings of each user (28 million)ndash Evaluation criterion root mean squared error (RMSE) ndash Netflix Cinematch RMSE 09514
bull Competitionndash 2700+ teamsndash $1 million grand prize for 10 improvement on Cinematch resultndash $50000 2007 progress prize for 843 improvement
scoremovieuser121152131434524123237682576344541568523425
22345
57664566
scoremovieuser6219617232473153414284935
745
696836
Training data Test data
Movie rating data
Overall rating distribution
bull Third of ratings are 4sbull Average rating is 368
From TimelyDevelopmentcom
ratings per movie
bull Avg ratingsmovie 5627
ratings per user
bull Avg ratingsuser 208
Average movie rating by movie count
bull More ratings to better movies
From TimelyDevelopmentcom
Most loved movies
CountAvg ratingTitle137812 4593 The Shawshank Redemption 133597 4545 Lord of the Rings The Return of the King1808834306 The Green Mile 150676 4460Lord of the Rings The Two Towers 139050 4415 Finding Nemo 117456 4504 Raiders of the Lost Ark 180736 4299 Forrest Gump 147932 4433 Lord of the Rings The Fellowship of the ring149199 4325 The Sixth Sense 144027 4333 Indiana Jones and the Last Crusade
Challenges
bull Size of datandash Scalabilityndash Keeping data in memory
bull Missing datandash 99 percent missingndash Very imbalanced
bull Avoiding overfittingbull Test and training data differ significantly movie 16322
bull Use an ensemble of complementing predictorsbull Two half tuned models worth more than a single fully
tuned model
The BellKor recommender system
RECSYS 101
Extending Latent Factor Model to Include Biases
1162019
RECSYS 102
Modeling Biases and Interactions
μ = overall mean rating bx = bias of user x bi = bias of movie i
user-movie interactionmovie biasuser bias
User-Movie interaction Characterizes the matching between
users and movies Attracts most research in the field Benefits from algorithmic and
mathematical innovations
Baseline predictor Separates users and movies Benefits from insights into userrsquos
behavior Among the main practical
contributions of the competition
RECSYS 103
Baseline Predictor We have expectations on the rating by
user x of movie i even without estimating xrsquos attitude towards movies like i
ndash Rating scale of user xndash Values of other ratings user
gave recently (day-specific mood anchoring multi-user accounts)
ndash (Recent) popularity of movie indash Selection bias related to
number of ratings user gave on the same day (ldquofrequencyrdquo)
1162019
RECSYS 104
Putting It All Together
Example Mean rating micro = 37 You are a critical reviewer your ratings are 1 star
lower than the mean bx = -1 Star Wars gets a mean rating of 05 higher than
average movie bi = + 05 Predicted rating for you on Star Wars
= 37 - 1 + 05 = 32
1162019
RECSYS 105
Fitting the New Model
Solve
Stochastic gradient decent to find parameters Note Both biases bu bi as well as interactions qi pu are treated
as parameters (we estimate them)
1162019
regularization
goodness of fit
λ is selected via grid-search on a validation set
( )
++++
+++minus
sumsumsumsum
sumisin
ii
xx
xx
ii
Rix
Txiixxi
PQ
bbpq
pqbbr
2222
2
)(
)(min
λ
micro
RECSYS 106
BellKor Recommender System The winner of the Netflix Challenge
Multi-scale modeling of the dataCombine top level ldquoregionalrdquomodeling of the data with a refined local view Global
bull Overall deviations of usersmovies Factorization
bull Addressing ldquoregionalrdquo effects Collaborative filtering
bull Extract local patterns
Global effects
Factorization
Collaborative filtering
RECSYS 107
Performance of Various Methods
0885
089
0895
09
0905
091
0915
092
1 10 100 1000
RMSE
Millions of parameters
CF (no time bias)
Basic Latent Factors
Latent Factors w Biases
1162019
RECSYS 108
Grand Prize 08563
Netflix 09514
Movie average 10533User average 10651
Global average 11296
Performance of Various Methods
Basic Collaborative filtering 094
Latent factors 090
Latent factors+Biases 089
Collaborative filtering++ 091
1162019
RECSYS 109
Temporal Biases Of Users
Sudden rise in the average movie rating(early 2004) Improvements in Netflix GUI improvements Meaning of rating changed
Movie age Users prefer new movies
without any reasons Older movies are just
inherently better than newer ones
1162019
Y Koren Collaborative filtering with temporal dynamics KDD rsquo09
RECSYS 110
Temporal Biases amp Factors
Original modelrxi = micro +bx + bi + qi middotpx
T
Add time dependence to biasesrxi = micro +bx(t)+ bi(t) +qi middot px
T
Make parameters bu and bi to depend on time (1) Parameterize time-dependence by linear trends
(2) Each bin corresponds to 10 consecutive weeks
Add temporal dependence to factors px(t)hellip user preference vector on day t
Y Koren Collaborative filtering with temporal dynamics KDD rsquo09
RECSYS 111
Adding Temporal Effects
1162019
0875
088
0885
089
0895
09
0905
091
0915
092
1 10 100 1000 10000
RMSE
Millions of parameters
CF (no time bias)Basic Latent FactorsCF (time bias)Latent Factors w Biases+ Linear time factors+ Per-day user biases+ CF
RECSYS 112
Grand Prize 08563
Netflix 09514
Movie average 10533User average 10651
Global average 11296
Performance of Various Methods
Basic Collaborative filtering 094
Latent factors 090
Latent factors+Biases 089
Collaborative filtering++ 091
1162019 112
Latent factors+Biases+Time 0876
Still no prize Getting desperateTry a ldquokitchen sinkrdquo approach
RECSYS 113
Effect of ensemble size
087
08725
0875
08775
088
08825
0885
08875
089
08925
1 8 15 22 29 36 43 50
Predictors
Erro
r - R
MSE
RECSYS 56
Ratings as Products of Factors How to estimate the missing rating of
user x for item i
1162019
45531
312445
53432142
24542
522434
42331
item
s
2-41
56-5
53-2
32111
-221-7
37-1
-924143-48-5-253-211
13-112-72914-131457-8
1-6784-3924176-421
asymp
item
s
users
users
QPT
24
ffac
tors
f factors
119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931
= 119943119943
119954119954119946119946119943119943 sdot 119953119953119961119961119943119943
qi = row i of Qpx = column x of PT
RECSYS 57
Geared towards females
Geared towards males
Serious
Funny
Latent Factor Models
1162019
The PrincessDiaries
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
RECSYS 58
Geared towards females
Geared towards males
Serious
Funny
Latent Factor Models
1162019
The PrincessDiaries
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
RECSYS 59
Recap SVD
Remember SVD A Input data matrix U Left singular vecs V Right singular vecs Σ Singular values
SVD gives minimum reconstruction error (SSE) min119880119880VΣ
sum119909119909119894119894 119860119860119909119909119894119894 minus 119880119880Σ119881119881T 1199091199091198941198942
So in our case ldquoSVDrdquo on Netflix data R asymp Q middot PT
A = R Q = U PT = Σ VT
But we are not done yet R has missing entries1162019
U
The sum goes over all entriesBut our R has missing entries
119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931
Am
n
Σm
n
VT
asymp
U
RECSYS 60
Latent Factor Models
SVD isnrsquot defined when entries are missing
Use specialized methods to find P Q min
119875119875119876119876sum 119909119909119909119909 isinR 119903119903119909119909119909119909 minus 119902119902119909119909 sdot 119901119901119909119909119879119879 2
Notebull We donrsquot require cols of P Q to be orthogonalunit lengthbull P Q map usersmovies to a latent spacebull The most popular model among Netflix contestants
1162019
45531
312445
53432142
24542
522434
42331
2-41
56-5
53-2
32111
-221-7
37-1
-924143-48-5-253-211
13-112-72914-131457-8
1-6784-3924176-421asymp
PTQ
users
item
s
119903119909119909119909119909 = 119902119902119909119909 sdot 119901119901119909119909119879119879
f factors
ffactorsitem
s
users
RECSYS 61
Dealing with Missing Entries Want to minimize SSE for unseen test data
Idea Minimize SSE on training data Want large f ( of factors) to capture all the signals But SSE on test data begins to rise for f gt 2
Regularization is needed to avoid Overfitting Allow rich model where there are sufficient data Shrink aggressively where data are scarce
1162019
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
1 3 43 5 5
4 5 533
2
2 1 3
1
λhellip regularization parameterldquoerrorrdquo ldquolengthrdquo
RECSYS 62
Geared towards females
Geared towards males
serious
escapist
The PrincessDiaries
The Lion King
Braveheart
Independence Day
AmadeusThe Color Purple
Oceanrsquos 11
Sense and Sensibility
Gus
Dave
Recommendations via Latent Factor Models (eg SVD++ by the [Bellkor Team])
1162019
Lethal Weapon
Dumb and Dumber
RECSYS 63
Dealing with Missing Entries Want to minimize SSE for unseen test data
Idea Minimize SSE on training data Want large f ( of factors) to capture all the signals But SSE on test data begins to rise for f gt 2
Regularization is needed Allow rich model where there are sufficient data Shrink aggressively where data are scarce
1162019
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
1 3 43 5 5
4 5 533
2
2 1 3
1
λhellip regularization parameterldquoerrorrdquo ldquolengthrdquo
RECSYS 64
Geared towards females
Geared towards males
serious
funny
The Effect of Regularization
1162019
The PrincessDiaries
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
minfactors ldquoerrorrdquo + λ ldquolengthrdquo
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
RECSYS 65
Geared towards females
Geared towards males
serious
funny
The Effect of Regularization
1162019
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
The PrincessDiaries
minfactors ldquoerrorrdquo + λ ldquolengthrdquo
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
RECSYS 66
Geared towards females
Geared towards males
serious
funny
The Effect of Regularization
1162019
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
minfactors ldquoerrorrdquo + λ ldquolengthrdquo
The PrincessDiaries
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
RECSYS 67
Geared towards females
Geared towards males
serious
funny
The Effect of Regularization
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
The PrincessDiaries
minfactors ldquoerrorrdquo + λ ldquolengthrdquo
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
RECSYS 68
Use Gradient Descent to search forthe optimal settings
Want to find matrices P and Q
Gradient descent Initialize P and Q (using SVD pretend missing ratings are
0) Do gradient descent
bull P larr P - η middotnablaPbull Q larr Q - η middotnablaQbull Where nablaQ is gradientderivative of matrix Q120571120571119876119876 = [120571120571119902119902119909119909119894119894] and 120571120571119902119902119909119909119894119894 = sum119909119909119909119909 minus2 119903119903119909119909119909119909 minus 119902119902119909119909119901119901119909119909119879119879 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894
bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q Observation Computing gradients is slow
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
How to compute gradient of a matrixCompute gradient of every element independently
RECSYS 69
Degression to the lecture notes ofRegression and Gradient Descent
by Andrew Ngrsquos Machine Learning course from Coursera
RECSYS 70
(Batch) Gradient Descent
Want to find matrices P and Q
Gradient descent Initialize P and Q (using SVD pretend missing ratings are
0) Do gradient descent
bull P larr P - η middotnablaPbull Q larr Q - η middotnablaQbull Where nablaQ is gradientderivative of matrix Q120571120571119876119876 = [120571120571119902119902119909119909119894119894] and 120571120571119902119902119909119909119894119894 = sum119909119909119909119909 minus2 119903119903119909119909119909119909 minus 119902119902119909119909119901119901119909119909119879119879 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894
bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q Observation Computing gradients is slow
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
How to compute gradient of a matrixCompute gradient of every element independently
RECSYS 72
Stochastic Gradient Descent
Gradient Descent (GD) vs Stochastic GD Observation 120571120571119876119876 = [120571120571119902119902119909119909119894119894] where
120571120571119902119902119909119909119894119894 = 119909119909119909119909
minus2 119903119903119909119909119909119909 minus 119902119902119909119909119894119894119901119901119909119909119894119894 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894 = 119961119961119946119946
nabla119928119928 119955119955119961119961119946119946
bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q 119928119928 = 119928119928 minus 119928119928 = 119928119928 minus η sum119961119961119946119946nabla119928119928 (119955119955119961119961119946119946) Idea Instead of evaluating gradient over all ratings evaluate it for
each individual rating and make a step
GD 119928119928larr119928119928 minus η sum119955119955119961119961119946119946 nabla119928119928(119955119955119961119961119946119946)
SGD 119928119928larr119928119928 minus 119928119928(119955119955119961119961119946119946) Faster convergence
bull Need more steps but each step is computed much faster
1162019
RECSYS 73
SGD vs GD Convergence of GD vs SGD
1162019
Iterationstep
Valu
e of
the
obje
ctiv
e fu
nctio
n
GD improves the value of the objective function at every step SGD improves the value but in a ldquonoisyrdquo wayGD takes fewer steps to converge but each steptakes much longer to compute In practice SGD is much faster
RECSYS 74
Stochastic Gradient Descent
Stochastic gradient decent Initialize P and Q (using SVD pretend missing ratings are
0) Then iterate over the ratings (multiple times if necessary) and update
factorsFor each rxi
bull 120576120576119909119909119909119909 = 119903119903119909119909119909119909 minus 119902119902119909119909 sdot 119901119901119909119909119879119879 (derivative of the ldquoerrorrdquo)bull 119902119902119909119909 larr 119902119902119909119909 + 120578120578 120576120576119909119909119909119909 119901119901119909119909 minus 120582120582 119902119902119909119909 (update equation)bull 119901119901119909119909 larr 119901119901119909119909 + 120578120578 120576120576119909119909119909119909 119902119902119909119909 minus 120582120582 119901119901119909119909 (update equation)
2 for loops For until convergence
bull For each rxibull Compute gradient do a ldquosteprdquo
1162019
η hellip learning rate
RECSYS 75Koren Bell Volinksy IEEE Computer 2009
1162019
RECSYS 76
Summary Recommendations via Optimization
Goal Make good recommendations Quantify goodness using SSE
So Lower SSE means better recommendations We want to make good recommendations on items
that some user has not yet seen Letrsquos set values for P and Q such that they work
well on known (user item) ratings And hope these values for P and Q will predict well
the unknown ratings
This is the a case where we apply Optimization methods
1162019
1 3 43 5 5
4 5 533
2
2 1 3
1
RECSYS 77
Backup Slides
RECSYS 78
The Netflix Challenge 2006-09
We Know What You OughtTo Be Watching This
Summer
Netflix Prizebull Training data
ndash 100 million ratingsndash 480000 usersndash 17770 moviesndash 6 years of data 2000-2005
bull Test datandash Last few ratings of each user (28 million)ndash Evaluation criterion root mean squared error (RMSE) ndash Netflix Cinematch RMSE 09514
bull Competitionndash 2700+ teamsndash $1 million grand prize for 10 improvement on Cinematch resultndash $50000 2007 progress prize for 843 improvement
scoremovieuser121152131434524123237682576344541568523425
22345
57664566
scoremovieuser6219617232473153414284935
745
696836
Training data Test data
Movie rating data
Overall rating distribution
bull Third of ratings are 4sbull Average rating is 368
From TimelyDevelopmentcom
ratings per movie
bull Avg ratingsmovie 5627
ratings per user
bull Avg ratingsuser 208
Average movie rating by movie count
bull More ratings to better movies
From TimelyDevelopmentcom
Most loved movies
CountAvg ratingTitle137812 4593 The Shawshank Redemption 133597 4545 Lord of the Rings The Return of the King1808834306 The Green Mile 150676 4460Lord of the Rings The Two Towers 139050 4415 Finding Nemo 117456 4504 Raiders of the Lost Ark 180736 4299 Forrest Gump 147932 4433 Lord of the Rings The Fellowship of the ring149199 4325 The Sixth Sense 144027 4333 Indiana Jones and the Last Crusade
Challenges
bull Size of datandash Scalabilityndash Keeping data in memory
bull Missing datandash 99 percent missingndash Very imbalanced
bull Avoiding overfittingbull Test and training data differ significantly movie 16322
bull Use an ensemble of complementing predictorsbull Two half tuned models worth more than a single fully
tuned model
The BellKor recommender system
RECSYS 101
Extending Latent Factor Model to Include Biases
1162019
RECSYS 102
Modeling Biases and Interactions
μ = overall mean rating bx = bias of user x bi = bias of movie i
user-movie interactionmovie biasuser bias
User-Movie interaction Characterizes the matching between
users and movies Attracts most research in the field Benefits from algorithmic and
mathematical innovations
Baseline predictor Separates users and movies Benefits from insights into userrsquos
behavior Among the main practical
contributions of the competition
RECSYS 103
Baseline Predictor We have expectations on the rating by
user x of movie i even without estimating xrsquos attitude towards movies like i
ndash Rating scale of user xndash Values of other ratings user
gave recently (day-specific mood anchoring multi-user accounts)
ndash (Recent) popularity of movie indash Selection bias related to
number of ratings user gave on the same day (ldquofrequencyrdquo)
1162019
RECSYS 104
Putting It All Together
Example Mean rating micro = 37 You are a critical reviewer your ratings are 1 star
lower than the mean bx = -1 Star Wars gets a mean rating of 05 higher than
average movie bi = + 05 Predicted rating for you on Star Wars
= 37 - 1 + 05 = 32
1162019
RECSYS 105
Fitting the New Model
Solve
Stochastic gradient decent to find parameters Note Both biases bu bi as well as interactions qi pu are treated
as parameters (we estimate them)
1162019
regularization
goodness of fit
λ is selected via grid-search on a validation set
( )
++++
+++minus
sumsumsumsum
sumisin
ii
xx
xx
ii
Rix
Txiixxi
PQ
bbpq
pqbbr
2222
2
)(
)(min
λ
micro
RECSYS 106
BellKor Recommender System The winner of the Netflix Challenge
Multi-scale modeling of the dataCombine top level ldquoregionalrdquomodeling of the data with a refined local view Global
bull Overall deviations of usersmovies Factorization
bull Addressing ldquoregionalrdquo effects Collaborative filtering
bull Extract local patterns
Global effects
Factorization
Collaborative filtering
RECSYS 107
Performance of Various Methods
0885
089
0895
09
0905
091
0915
092
1 10 100 1000
RMSE
Millions of parameters
CF (no time bias)
Basic Latent Factors
Latent Factors w Biases
1162019
RECSYS 108
Grand Prize 08563
Netflix 09514
Movie average 10533User average 10651
Global average 11296
Performance of Various Methods
Basic Collaborative filtering 094
Latent factors 090
Latent factors+Biases 089
Collaborative filtering++ 091
1162019
RECSYS 109
Temporal Biases Of Users
Sudden rise in the average movie rating(early 2004) Improvements in Netflix GUI improvements Meaning of rating changed
Movie age Users prefer new movies
without any reasons Older movies are just
inherently better than newer ones
1162019
Y Koren Collaborative filtering with temporal dynamics KDD rsquo09
RECSYS 110
Temporal Biases amp Factors
Original modelrxi = micro +bx + bi + qi middotpx
T
Add time dependence to biasesrxi = micro +bx(t)+ bi(t) +qi middot px
T
Make parameters bu and bi to depend on time (1) Parameterize time-dependence by linear trends
(2) Each bin corresponds to 10 consecutive weeks
Add temporal dependence to factors px(t)hellip user preference vector on day t
Y Koren Collaborative filtering with temporal dynamics KDD rsquo09
RECSYS 111
Adding Temporal Effects
1162019
0875
088
0885
089
0895
09
0905
091
0915
092
1 10 100 1000 10000
RMSE
Millions of parameters
CF (no time bias)Basic Latent FactorsCF (time bias)Latent Factors w Biases+ Linear time factors+ Per-day user biases+ CF
RECSYS 112
Grand Prize 08563
Netflix 09514
Movie average 10533User average 10651
Global average 11296
Performance of Various Methods
Basic Collaborative filtering 094
Latent factors 090
Latent factors+Biases 089
Collaborative filtering++ 091
1162019 112
Latent factors+Biases+Time 0876
Still no prize Getting desperateTry a ldquokitchen sinkrdquo approach
RECSYS 113
Effect of ensemble size
087
08725
0875
08775
088
08825
0885
08875
089
08925
1 8 15 22 29 36 43 50
Predictors
Erro
r - R
MSE
RECSYS 57
Geared towards females
Geared towards males
Serious
Funny
Latent Factor Models
1162019
The PrincessDiaries
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
RECSYS 58
Geared towards females
Geared towards males
Serious
Funny
Latent Factor Models
1162019
The PrincessDiaries
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
RECSYS 59
Recap SVD
Remember SVD A Input data matrix U Left singular vecs V Right singular vecs Σ Singular values
SVD gives minimum reconstruction error (SSE) min119880119880VΣ
sum119909119909119894119894 119860119860119909119909119894119894 minus 119880119880Σ119881119881T 1199091199091198941198942
So in our case ldquoSVDrdquo on Netflix data R asymp Q middot PT
A = R Q = U PT = Σ VT
But we are not done yet R has missing entries1162019
U
The sum goes over all entriesBut our R has missing entries
119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931
Am
n
Σm
n
VT
asymp
U
RECSYS 60
Latent Factor Models
SVD isnrsquot defined when entries are missing
Use specialized methods to find P Q min
119875119875119876119876sum 119909119909119909119909 isinR 119903119903119909119909119909119909 minus 119902119902119909119909 sdot 119901119901119909119909119879119879 2
Notebull We donrsquot require cols of P Q to be orthogonalunit lengthbull P Q map usersmovies to a latent spacebull The most popular model among Netflix contestants
1162019
45531
312445
53432142
24542
522434
42331
2-41
56-5
53-2
32111
-221-7
37-1
-924143-48-5-253-211
13-112-72914-131457-8
1-6784-3924176-421asymp
PTQ
users
item
s
119903119909119909119909119909 = 119902119902119909119909 sdot 119901119901119909119909119879119879
f factors
ffactorsitem
s
users
RECSYS 61
Dealing with Missing Entries Want to minimize SSE for unseen test data
Idea Minimize SSE on training data Want large f ( of factors) to capture all the signals But SSE on test data begins to rise for f gt 2
Regularization is needed to avoid Overfitting Allow rich model where there are sufficient data Shrink aggressively where data are scarce
1162019
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
1 3 43 5 5
4 5 533
2
2 1 3
1
λhellip regularization parameterldquoerrorrdquo ldquolengthrdquo
RECSYS 62
Geared towards females
Geared towards males
serious
escapist
The PrincessDiaries
The Lion King
Braveheart
Independence Day
AmadeusThe Color Purple
Oceanrsquos 11
Sense and Sensibility
Gus
Dave
Recommendations via Latent Factor Models (eg SVD++ by the [Bellkor Team])
1162019
Lethal Weapon
Dumb and Dumber
RECSYS 63
Dealing with Missing Entries Want to minimize SSE for unseen test data
Idea Minimize SSE on training data Want large f ( of factors) to capture all the signals But SSE on test data begins to rise for f gt 2
Regularization is needed Allow rich model where there are sufficient data Shrink aggressively where data are scarce
1162019
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
1 3 43 5 5
4 5 533
2
2 1 3
1
λhellip regularization parameterldquoerrorrdquo ldquolengthrdquo
RECSYS 64
Geared towards females
Geared towards males
serious
funny
The Effect of Regularization
1162019
The PrincessDiaries
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
minfactors ldquoerrorrdquo + λ ldquolengthrdquo
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
RECSYS 65
Geared towards females
Geared towards males
serious
funny
The Effect of Regularization
1162019
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
The PrincessDiaries
minfactors ldquoerrorrdquo + λ ldquolengthrdquo
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
RECSYS 66
Geared towards females
Geared towards males
serious
funny
The Effect of Regularization
1162019
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
minfactors ldquoerrorrdquo + λ ldquolengthrdquo
The PrincessDiaries
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
RECSYS 67
Geared towards females
Geared towards males
serious
funny
The Effect of Regularization
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
The PrincessDiaries
minfactors ldquoerrorrdquo + λ ldquolengthrdquo
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
RECSYS 68
Use Gradient Descent to search forthe optimal settings
Want to find matrices P and Q
Gradient descent Initialize P and Q (using SVD pretend missing ratings are
0) Do gradient descent
bull P larr P - η middotnablaPbull Q larr Q - η middotnablaQbull Where nablaQ is gradientderivative of matrix Q120571120571119876119876 = [120571120571119902119902119909119909119894119894] and 120571120571119902119902119909119909119894119894 = sum119909119909119909119909 minus2 119903119903119909119909119909119909 minus 119902119902119909119909119901119901119909119909119879119879 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894
bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q Observation Computing gradients is slow
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
How to compute gradient of a matrixCompute gradient of every element independently
RECSYS 69
Degression to the lecture notes ofRegression and Gradient Descent
by Andrew Ngrsquos Machine Learning course from Coursera
RECSYS 70
(Batch) Gradient Descent
Want to find matrices P and Q
Gradient descent Initialize P and Q (using SVD pretend missing ratings are
0) Do gradient descent
bull P larr P - η middotnablaPbull Q larr Q - η middotnablaQbull Where nablaQ is gradientderivative of matrix Q120571120571119876119876 = [120571120571119902119902119909119909119894119894] and 120571120571119902119902119909119909119894119894 = sum119909119909119909119909 minus2 119903119903119909119909119909119909 minus 119902119902119909119909119901119901119909119909119879119879 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894
bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q Observation Computing gradients is slow
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
How to compute gradient of a matrixCompute gradient of every element independently
RECSYS 72
Stochastic Gradient Descent
Gradient Descent (GD) vs Stochastic GD Observation 120571120571119876119876 = [120571120571119902119902119909119909119894119894] where
120571120571119902119902119909119909119894119894 = 119909119909119909119909
minus2 119903119903119909119909119909119909 minus 119902119902119909119909119894119894119901119901119909119909119894119894 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894 = 119961119961119946119946
nabla119928119928 119955119955119961119961119946119946
bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q 119928119928 = 119928119928 minus 119928119928 = 119928119928 minus η sum119961119961119946119946nabla119928119928 (119955119955119961119961119946119946) Idea Instead of evaluating gradient over all ratings evaluate it for
each individual rating and make a step
GD 119928119928larr119928119928 minus η sum119955119955119961119961119946119946 nabla119928119928(119955119955119961119961119946119946)
SGD 119928119928larr119928119928 minus 119928119928(119955119955119961119961119946119946) Faster convergence
bull Need more steps but each step is computed much faster
1162019
RECSYS 73
SGD vs GD Convergence of GD vs SGD
1162019
Iterationstep
Valu
e of
the
obje
ctiv
e fu
nctio
n
GD improves the value of the objective function at every step SGD improves the value but in a ldquonoisyrdquo wayGD takes fewer steps to converge but each steptakes much longer to compute In practice SGD is much faster
RECSYS 74
Stochastic Gradient Descent
Stochastic gradient decent Initialize P and Q (using SVD pretend missing ratings are
0) Then iterate over the ratings (multiple times if necessary) and update
factorsFor each rxi
bull 120576120576119909119909119909119909 = 119903119903119909119909119909119909 minus 119902119902119909119909 sdot 119901119901119909119909119879119879 (derivative of the ldquoerrorrdquo)bull 119902119902119909119909 larr 119902119902119909119909 + 120578120578 120576120576119909119909119909119909 119901119901119909119909 minus 120582120582 119902119902119909119909 (update equation)bull 119901119901119909119909 larr 119901119901119909119909 + 120578120578 120576120576119909119909119909119909 119902119902119909119909 minus 120582120582 119901119901119909119909 (update equation)
2 for loops For until convergence
bull For each rxibull Compute gradient do a ldquosteprdquo
1162019
η hellip learning rate
RECSYS 75Koren Bell Volinksy IEEE Computer 2009
1162019
RECSYS 76
Summary Recommendations via Optimization
Goal Make good recommendations Quantify goodness using SSE
So Lower SSE means better recommendations We want to make good recommendations on items
that some user has not yet seen Letrsquos set values for P and Q such that they work
well on known (user item) ratings And hope these values for P and Q will predict well
the unknown ratings
This is the a case where we apply Optimization methods
1162019
1 3 43 5 5
4 5 533
2
2 1 3
1
RECSYS 77
Backup Slides
RECSYS 78
The Netflix Challenge 2006-09
We Know What You OughtTo Be Watching This
Summer
Netflix Prizebull Training data
ndash 100 million ratingsndash 480000 usersndash 17770 moviesndash 6 years of data 2000-2005
bull Test datandash Last few ratings of each user (28 million)ndash Evaluation criterion root mean squared error (RMSE) ndash Netflix Cinematch RMSE 09514
bull Competitionndash 2700+ teamsndash $1 million grand prize for 10 improvement on Cinematch resultndash $50000 2007 progress prize for 843 improvement
scoremovieuser121152131434524123237682576344541568523425
22345
57664566
scoremovieuser6219617232473153414284935
745
696836
Training data Test data
Movie rating data
Overall rating distribution
bull Third of ratings are 4sbull Average rating is 368
From TimelyDevelopmentcom
ratings per movie
bull Avg ratingsmovie 5627
ratings per user
bull Avg ratingsuser 208
Average movie rating by movie count
bull More ratings to better movies
From TimelyDevelopmentcom
Most loved movies
CountAvg ratingTitle137812 4593 The Shawshank Redemption 133597 4545 Lord of the Rings The Return of the King1808834306 The Green Mile 150676 4460Lord of the Rings The Two Towers 139050 4415 Finding Nemo 117456 4504 Raiders of the Lost Ark 180736 4299 Forrest Gump 147932 4433 Lord of the Rings The Fellowship of the ring149199 4325 The Sixth Sense 144027 4333 Indiana Jones and the Last Crusade
Challenges
bull Size of datandash Scalabilityndash Keeping data in memory
bull Missing datandash 99 percent missingndash Very imbalanced
bull Avoiding overfittingbull Test and training data differ significantly movie 16322
bull Use an ensemble of complementing predictorsbull Two half tuned models worth more than a single fully
tuned model
The BellKor recommender system
RECSYS 101
Extending Latent Factor Model to Include Biases
1162019
RECSYS 102
Modeling Biases and Interactions
μ = overall mean rating bx = bias of user x bi = bias of movie i
user-movie interactionmovie biasuser bias
User-Movie interaction Characterizes the matching between
users and movies Attracts most research in the field Benefits from algorithmic and
mathematical innovations
Baseline predictor Separates users and movies Benefits from insights into userrsquos
behavior Among the main practical
contributions of the competition
RECSYS 103
Baseline Predictor We have expectations on the rating by
user x of movie i even without estimating xrsquos attitude towards movies like i
ndash Rating scale of user xndash Values of other ratings user
gave recently (day-specific mood anchoring multi-user accounts)
ndash (Recent) popularity of movie indash Selection bias related to
number of ratings user gave on the same day (ldquofrequencyrdquo)
1162019
RECSYS 104
Putting It All Together
Example Mean rating micro = 37 You are a critical reviewer your ratings are 1 star
lower than the mean bx = -1 Star Wars gets a mean rating of 05 higher than
average movie bi = + 05 Predicted rating for you on Star Wars
= 37 - 1 + 05 = 32
1162019
RECSYS 105
Fitting the New Model
Solve
Stochastic gradient decent to find parameters Note Both biases bu bi as well as interactions qi pu are treated
as parameters (we estimate them)
1162019
regularization
goodness of fit
λ is selected via grid-search on a validation set
( )
++++
+++minus
sumsumsumsum
sumisin
ii
xx
xx
ii
Rix
Txiixxi
PQ
bbpq
pqbbr
2222
2
)(
)(min
λ
micro
RECSYS 106
BellKor Recommender System The winner of the Netflix Challenge
Multi-scale modeling of the dataCombine top level ldquoregionalrdquomodeling of the data with a refined local view Global
bull Overall deviations of usersmovies Factorization
bull Addressing ldquoregionalrdquo effects Collaborative filtering
bull Extract local patterns
Global effects
Factorization
Collaborative filtering
RECSYS 107
Performance of Various Methods
0885
089
0895
09
0905
091
0915
092
1 10 100 1000
RMSE
Millions of parameters
CF (no time bias)
Basic Latent Factors
Latent Factors w Biases
1162019
RECSYS 108
Grand Prize 08563
Netflix 09514
Movie average 10533User average 10651
Global average 11296
Performance of Various Methods
Basic Collaborative filtering 094
Latent factors 090
Latent factors+Biases 089
Collaborative filtering++ 091
1162019
RECSYS 109
Temporal Biases Of Users
Sudden rise in the average movie rating(early 2004) Improvements in Netflix GUI improvements Meaning of rating changed
Movie age Users prefer new movies
without any reasons Older movies are just
inherently better than newer ones
1162019
Y Koren Collaborative filtering with temporal dynamics KDD rsquo09
RECSYS 110
Temporal Biases amp Factors
Original modelrxi = micro +bx + bi + qi middotpx
T
Add time dependence to biasesrxi = micro +bx(t)+ bi(t) +qi middot px
T
Make parameters bu and bi to depend on time (1) Parameterize time-dependence by linear trends
(2) Each bin corresponds to 10 consecutive weeks
Add temporal dependence to factors px(t)hellip user preference vector on day t
Y Koren Collaborative filtering with temporal dynamics KDD rsquo09
RECSYS 111
Adding Temporal Effects
1162019
0875
088
0885
089
0895
09
0905
091
0915
092
1 10 100 1000 10000
RMSE
Millions of parameters
CF (no time bias)Basic Latent FactorsCF (time bias)Latent Factors w Biases+ Linear time factors+ Per-day user biases+ CF
RECSYS 112
Grand Prize 08563
Netflix 09514
Movie average 10533User average 10651
Global average 11296
Performance of Various Methods
Basic Collaborative filtering 094
Latent factors 090
Latent factors+Biases 089
Collaborative filtering++ 091
1162019 112
Latent factors+Biases+Time 0876
Still no prize Getting desperateTry a ldquokitchen sinkrdquo approach
RECSYS 113
Effect of ensemble size
087
08725
0875
08775
088
08825
0885
08875
089
08925
1 8 15 22 29 36 43 50
Predictors
Erro
r - R
MSE
RECSYS 58
Geared towards females
Geared towards males
Serious
Funny
Latent Factor Models
1162019
The PrincessDiaries
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
RECSYS 59
Recap SVD
Remember SVD A Input data matrix U Left singular vecs V Right singular vecs Σ Singular values
SVD gives minimum reconstruction error (SSE) min119880119880VΣ
sum119909119909119894119894 119860119860119909119909119894119894 minus 119880119880Σ119881119881T 1199091199091198941198942
So in our case ldquoSVDrdquo on Netflix data R asymp Q middot PT
A = R Q = U PT = Σ VT
But we are not done yet R has missing entries1162019
U
The sum goes over all entriesBut our R has missing entries
119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931
Am
n
Σm
n
VT
asymp
U
RECSYS 60
Latent Factor Models
SVD isnrsquot defined when entries are missing
Use specialized methods to find P Q min
119875119875119876119876sum 119909119909119909119909 isinR 119903119903119909119909119909119909 minus 119902119902119909119909 sdot 119901119901119909119909119879119879 2
Notebull We donrsquot require cols of P Q to be orthogonalunit lengthbull P Q map usersmovies to a latent spacebull The most popular model among Netflix contestants
1162019
45531
312445
53432142
24542
522434
42331
2-41
56-5
53-2
32111
-221-7
37-1
-924143-48-5-253-211
13-112-72914-131457-8
1-6784-3924176-421asymp
PTQ
users
item
s
119903119909119909119909119909 = 119902119902119909119909 sdot 119901119901119909119909119879119879
f factors
ffactorsitem
s
users
RECSYS 61
Dealing with Missing Entries Want to minimize SSE for unseen test data
Idea Minimize SSE on training data Want large f ( of factors) to capture all the signals But SSE on test data begins to rise for f gt 2
Regularization is needed to avoid Overfitting Allow rich model where there are sufficient data Shrink aggressively where data are scarce
1162019
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
1 3 43 5 5
4 5 533
2
2 1 3
1
λhellip regularization parameterldquoerrorrdquo ldquolengthrdquo
RECSYS 62
Geared towards females
Geared towards males
serious
escapist
The PrincessDiaries
The Lion King
Braveheart
Independence Day
AmadeusThe Color Purple
Oceanrsquos 11
Sense and Sensibility
Gus
Dave
Recommendations via Latent Factor Models (eg SVD++ by the [Bellkor Team])
1162019
Lethal Weapon
Dumb and Dumber
RECSYS 63
Dealing with Missing Entries Want to minimize SSE for unseen test data
Idea Minimize SSE on training data Want large f ( of factors) to capture all the signals But SSE on test data begins to rise for f gt 2
Regularization is needed Allow rich model where there are sufficient data Shrink aggressively where data are scarce
1162019
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
1 3 43 5 5
4 5 533
2
2 1 3
1
λhellip regularization parameterldquoerrorrdquo ldquolengthrdquo
RECSYS 64
Geared towards females
Geared towards males
serious
funny
The Effect of Regularization
1162019
The PrincessDiaries
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
minfactors ldquoerrorrdquo + λ ldquolengthrdquo
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
RECSYS 65
Geared towards females
Geared towards males
serious
funny
The Effect of Regularization
1162019
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
The PrincessDiaries
minfactors ldquoerrorrdquo + λ ldquolengthrdquo
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
RECSYS 66
Geared towards females
Geared towards males
serious
funny
The Effect of Regularization
1162019
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
minfactors ldquoerrorrdquo + λ ldquolengthrdquo
The PrincessDiaries
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
RECSYS 67
Geared towards females
Geared towards males
serious
funny
The Effect of Regularization
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
The PrincessDiaries
minfactors ldquoerrorrdquo + λ ldquolengthrdquo
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
RECSYS 68
Use Gradient Descent to search forthe optimal settings
Want to find matrices P and Q
Gradient descent Initialize P and Q (using SVD pretend missing ratings are
0) Do gradient descent
bull P larr P - η middotnablaPbull Q larr Q - η middotnablaQbull Where nablaQ is gradientderivative of matrix Q120571120571119876119876 = [120571120571119902119902119909119909119894119894] and 120571120571119902119902119909119909119894119894 = sum119909119909119909119909 minus2 119903119903119909119909119909119909 minus 119902119902119909119909119901119901119909119909119879119879 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894
bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q Observation Computing gradients is slow
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
How to compute gradient of a matrixCompute gradient of every element independently
RECSYS 69
Degression to the lecture notes ofRegression and Gradient Descent
by Andrew Ngrsquos Machine Learning course from Coursera
RECSYS 70
(Batch) Gradient Descent
Want to find matrices P and Q
Gradient descent Initialize P and Q (using SVD pretend missing ratings are
0) Do gradient descent
bull P larr P - η middotnablaPbull Q larr Q - η middotnablaQbull Where nablaQ is gradientderivative of matrix Q120571120571119876119876 = [120571120571119902119902119909119909119894119894] and 120571120571119902119902119909119909119894119894 = sum119909119909119909119909 minus2 119903119903119909119909119909119909 minus 119902119902119909119909119901119901119909119909119879119879 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894
bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q Observation Computing gradients is slow
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
How to compute gradient of a matrixCompute gradient of every element independently
RECSYS 72
Stochastic Gradient Descent
Gradient Descent (GD) vs Stochastic GD Observation 120571120571119876119876 = [120571120571119902119902119909119909119894119894] where
120571120571119902119902119909119909119894119894 = 119909119909119909119909
minus2 119903119903119909119909119909119909 minus 119902119902119909119909119894119894119901119901119909119909119894119894 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894 = 119961119961119946119946
nabla119928119928 119955119955119961119961119946119946
bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q 119928119928 = 119928119928 minus 119928119928 = 119928119928 minus η sum119961119961119946119946nabla119928119928 (119955119955119961119961119946119946) Idea Instead of evaluating gradient over all ratings evaluate it for
each individual rating and make a step
GD 119928119928larr119928119928 minus η sum119955119955119961119961119946119946 nabla119928119928(119955119955119961119961119946119946)
SGD 119928119928larr119928119928 minus 119928119928(119955119955119961119961119946119946) Faster convergence
bull Need more steps but each step is computed much faster
1162019
RECSYS 73
SGD vs GD Convergence of GD vs SGD
1162019
Iterationstep
Valu
e of
the
obje
ctiv
e fu
nctio
n
GD improves the value of the objective function at every step SGD improves the value but in a ldquonoisyrdquo wayGD takes fewer steps to converge but each steptakes much longer to compute In practice SGD is much faster
RECSYS 74
Stochastic Gradient Descent
Stochastic gradient decent Initialize P and Q (using SVD pretend missing ratings are
0) Then iterate over the ratings (multiple times if necessary) and update
factorsFor each rxi
bull 120576120576119909119909119909119909 = 119903119903119909119909119909119909 minus 119902119902119909119909 sdot 119901119901119909119909119879119879 (derivative of the ldquoerrorrdquo)bull 119902119902119909119909 larr 119902119902119909119909 + 120578120578 120576120576119909119909119909119909 119901119901119909119909 minus 120582120582 119902119902119909119909 (update equation)bull 119901119901119909119909 larr 119901119901119909119909 + 120578120578 120576120576119909119909119909119909 119902119902119909119909 minus 120582120582 119901119901119909119909 (update equation)
2 for loops For until convergence
bull For each rxibull Compute gradient do a ldquosteprdquo
1162019
η hellip learning rate
RECSYS 75Koren Bell Volinksy IEEE Computer 2009
1162019
RECSYS 76
Summary Recommendations via Optimization
Goal Make good recommendations Quantify goodness using SSE
So Lower SSE means better recommendations We want to make good recommendations on items
that some user has not yet seen Letrsquos set values for P and Q such that they work
well on known (user item) ratings And hope these values for P and Q will predict well
the unknown ratings
This is the a case where we apply Optimization methods
1162019
1 3 43 5 5
4 5 533
2
2 1 3
1
RECSYS 77
Backup Slides
RECSYS 78
The Netflix Challenge 2006-09
We Know What You OughtTo Be Watching This
Summer
Netflix Prizebull Training data
ndash 100 million ratingsndash 480000 usersndash 17770 moviesndash 6 years of data 2000-2005
bull Test datandash Last few ratings of each user (28 million)ndash Evaluation criterion root mean squared error (RMSE) ndash Netflix Cinematch RMSE 09514
bull Competitionndash 2700+ teamsndash $1 million grand prize for 10 improvement on Cinematch resultndash $50000 2007 progress prize for 843 improvement
scoremovieuser121152131434524123237682576344541568523425
22345
57664566
scoremovieuser6219617232473153414284935
745
696836
Training data Test data
Movie rating data
Overall rating distribution
bull Third of ratings are 4sbull Average rating is 368
From TimelyDevelopmentcom
ratings per movie
bull Avg ratingsmovie 5627
ratings per user
bull Avg ratingsuser 208
Average movie rating by movie count
bull More ratings to better movies
From TimelyDevelopmentcom
Most loved movies
CountAvg ratingTitle137812 4593 The Shawshank Redemption 133597 4545 Lord of the Rings The Return of the King1808834306 The Green Mile 150676 4460Lord of the Rings The Two Towers 139050 4415 Finding Nemo 117456 4504 Raiders of the Lost Ark 180736 4299 Forrest Gump 147932 4433 Lord of the Rings The Fellowship of the ring149199 4325 The Sixth Sense 144027 4333 Indiana Jones and the Last Crusade
Challenges
bull Size of datandash Scalabilityndash Keeping data in memory
bull Missing datandash 99 percent missingndash Very imbalanced
bull Avoiding overfittingbull Test and training data differ significantly movie 16322
bull Use an ensemble of complementing predictorsbull Two half tuned models worth more than a single fully
tuned model
The BellKor recommender system
RECSYS 101
Extending Latent Factor Model to Include Biases
1162019
RECSYS 102
Modeling Biases and Interactions
μ = overall mean rating bx = bias of user x bi = bias of movie i
user-movie interactionmovie biasuser bias
User-Movie interaction Characterizes the matching between
users and movies Attracts most research in the field Benefits from algorithmic and
mathematical innovations
Baseline predictor Separates users and movies Benefits from insights into userrsquos
behavior Among the main practical
contributions of the competition
RECSYS 103
Baseline Predictor We have expectations on the rating by
user x of movie i even without estimating xrsquos attitude towards movies like i
ndash Rating scale of user xndash Values of other ratings user
gave recently (day-specific mood anchoring multi-user accounts)
ndash (Recent) popularity of movie indash Selection bias related to
number of ratings user gave on the same day (ldquofrequencyrdquo)
1162019
RECSYS 104
Putting It All Together
Example Mean rating micro = 37 You are a critical reviewer your ratings are 1 star
lower than the mean bx = -1 Star Wars gets a mean rating of 05 higher than
average movie bi = + 05 Predicted rating for you on Star Wars
= 37 - 1 + 05 = 32
1162019
RECSYS 105
Fitting the New Model
Solve
Stochastic gradient decent to find parameters Note Both biases bu bi as well as interactions qi pu are treated
as parameters (we estimate them)
1162019
regularization
goodness of fit
λ is selected via grid-search on a validation set
( )
++++
+++minus
sumsumsumsum
sumisin
ii
xx
xx
ii
Rix
Txiixxi
PQ
bbpq
pqbbr
2222
2
)(
)(min
λ
micro
RECSYS 106
BellKor Recommender System The winner of the Netflix Challenge
Multi-scale modeling of the dataCombine top level ldquoregionalrdquomodeling of the data with a refined local view Global
bull Overall deviations of usersmovies Factorization
bull Addressing ldquoregionalrdquo effects Collaborative filtering
bull Extract local patterns
Global effects
Factorization
Collaborative filtering
RECSYS 107
Performance of Various Methods
0885
089
0895
09
0905
091
0915
092
1 10 100 1000
RMSE
Millions of parameters
CF (no time bias)
Basic Latent Factors
Latent Factors w Biases
1162019
RECSYS 108
Grand Prize 08563
Netflix 09514
Movie average 10533User average 10651
Global average 11296
Performance of Various Methods
Basic Collaborative filtering 094
Latent factors 090
Latent factors+Biases 089
Collaborative filtering++ 091
1162019
RECSYS 109
Temporal Biases Of Users
Sudden rise in the average movie rating(early 2004) Improvements in Netflix GUI improvements Meaning of rating changed
Movie age Users prefer new movies
without any reasons Older movies are just
inherently better than newer ones
1162019
Y Koren Collaborative filtering with temporal dynamics KDD rsquo09
RECSYS 110
Temporal Biases amp Factors
Original modelrxi = micro +bx + bi + qi middotpx
T
Add time dependence to biasesrxi = micro +bx(t)+ bi(t) +qi middot px
T
Make parameters bu and bi to depend on time (1) Parameterize time-dependence by linear trends
(2) Each bin corresponds to 10 consecutive weeks
Add temporal dependence to factors px(t)hellip user preference vector on day t
Y Koren Collaborative filtering with temporal dynamics KDD rsquo09
RECSYS 111
Adding Temporal Effects
1162019
0875
088
0885
089
0895
09
0905
091
0915
092
1 10 100 1000 10000
RMSE
Millions of parameters
CF (no time bias)Basic Latent FactorsCF (time bias)Latent Factors w Biases+ Linear time factors+ Per-day user biases+ CF
RECSYS 112
Grand Prize 08563
Netflix 09514
Movie average 10533User average 10651
Global average 11296
Performance of Various Methods
Basic Collaborative filtering 094
Latent factors 090
Latent factors+Biases 089
Collaborative filtering++ 091
1162019 112
Latent factors+Biases+Time 0876
Still no prize Getting desperateTry a ldquokitchen sinkrdquo approach
RECSYS 113
Effect of ensemble size
087
08725
0875
08775
088
08825
0885
08875
089
08925
1 8 15 22 29 36 43 50
Predictors
Erro
r - R
MSE
RECSYS 59
Recap SVD
Remember SVD A Input data matrix U Left singular vecs V Right singular vecs Σ Singular values
SVD gives minimum reconstruction error (SSE) min119880119880VΣ
sum119909119909119894119894 119860119860119909119909119894119894 minus 119880119880Σ119881119881T 1199091199091198941198942
So in our case ldquoSVDrdquo on Netflix data R asymp Q middot PT
A = R Q = U PT = Σ VT
But we are not done yet R has missing entries1162019
U
The sum goes over all entriesBut our R has missing entries
119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931
Am
n
Σm
n
VT
asymp
U
RECSYS 60
Latent Factor Models
SVD isnrsquot defined when entries are missing
Use specialized methods to find P Q min
119875119875119876119876sum 119909119909119909119909 isinR 119903119903119909119909119909119909 minus 119902119902119909119909 sdot 119901119901119909119909119879119879 2
Notebull We donrsquot require cols of P Q to be orthogonalunit lengthbull P Q map usersmovies to a latent spacebull The most popular model among Netflix contestants
1162019
45531
312445
53432142
24542
522434
42331
2-41
56-5
53-2
32111
-221-7
37-1
-924143-48-5-253-211
13-112-72914-131457-8
1-6784-3924176-421asymp
PTQ
users
item
s
119903119909119909119909119909 = 119902119902119909119909 sdot 119901119901119909119909119879119879
f factors
ffactorsitem
s
users
RECSYS 61
Dealing with Missing Entries Want to minimize SSE for unseen test data
Idea Minimize SSE on training data Want large f ( of factors) to capture all the signals But SSE on test data begins to rise for f gt 2
Regularization is needed to avoid Overfitting Allow rich model where there are sufficient data Shrink aggressively where data are scarce
1162019
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
1 3 43 5 5
4 5 533
2
2 1 3
1
λhellip regularization parameterldquoerrorrdquo ldquolengthrdquo
RECSYS 62
Geared towards females
Geared towards males
serious
escapist
The PrincessDiaries
The Lion King
Braveheart
Independence Day
AmadeusThe Color Purple
Oceanrsquos 11
Sense and Sensibility
Gus
Dave
Recommendations via Latent Factor Models (eg SVD++ by the [Bellkor Team])
1162019
Lethal Weapon
Dumb and Dumber
RECSYS 63
Dealing with Missing Entries Want to minimize SSE for unseen test data
Idea Minimize SSE on training data Want large f ( of factors) to capture all the signals But SSE on test data begins to rise for f gt 2
Regularization is needed Allow rich model where there are sufficient data Shrink aggressively where data are scarce
1162019
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
1 3 43 5 5
4 5 533
2
2 1 3
1
λhellip regularization parameterldquoerrorrdquo ldquolengthrdquo
RECSYS 64
Geared towards females
Geared towards males
serious
funny
The Effect of Regularization
1162019
The PrincessDiaries
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
minfactors ldquoerrorrdquo + λ ldquolengthrdquo
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
RECSYS 65
Geared towards females
Geared towards males
serious
funny
The Effect of Regularization
1162019
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
The PrincessDiaries
minfactors ldquoerrorrdquo + λ ldquolengthrdquo
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
RECSYS 66
Geared towards females
Geared towards males
serious
funny
The Effect of Regularization
1162019
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
minfactors ldquoerrorrdquo + λ ldquolengthrdquo
The PrincessDiaries
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
RECSYS 67
Geared towards females
Geared towards males
serious
funny
The Effect of Regularization
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
The PrincessDiaries
minfactors ldquoerrorrdquo + λ ldquolengthrdquo
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
RECSYS 68
Use Gradient Descent to search forthe optimal settings
Want to find matrices P and Q
Gradient descent Initialize P and Q (using SVD pretend missing ratings are
0) Do gradient descent
bull P larr P - η middotnablaPbull Q larr Q - η middotnablaQbull Where nablaQ is gradientderivative of matrix Q120571120571119876119876 = [120571120571119902119902119909119909119894119894] and 120571120571119902119902119909119909119894119894 = sum119909119909119909119909 minus2 119903119903119909119909119909119909 minus 119902119902119909119909119901119901119909119909119879119879 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894
bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q Observation Computing gradients is slow
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
How to compute gradient of a matrixCompute gradient of every element independently
RECSYS 69
Degression to the lecture notes ofRegression and Gradient Descent
by Andrew Ngrsquos Machine Learning course from Coursera
RECSYS 70
(Batch) Gradient Descent
Want to find matrices P and Q
Gradient descent Initialize P and Q (using SVD pretend missing ratings are
0) Do gradient descent
bull P larr P - η middotnablaPbull Q larr Q - η middotnablaQbull Where nablaQ is gradientderivative of matrix Q120571120571119876119876 = [120571120571119902119902119909119909119894119894] and 120571120571119902119902119909119909119894119894 = sum119909119909119909119909 minus2 119903119903119909119909119909119909 minus 119902119902119909119909119901119901119909119909119879119879 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894
bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q Observation Computing gradients is slow
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
How to compute gradient of a matrixCompute gradient of every element independently
RECSYS 72
Stochastic Gradient Descent
Gradient Descent (GD) vs Stochastic GD Observation 120571120571119876119876 = [120571120571119902119902119909119909119894119894] where
120571120571119902119902119909119909119894119894 = 119909119909119909119909
minus2 119903119903119909119909119909119909 minus 119902119902119909119909119894119894119901119901119909119909119894119894 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894 = 119961119961119946119946
nabla119928119928 119955119955119961119961119946119946
bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q 119928119928 = 119928119928 minus 119928119928 = 119928119928 minus η sum119961119961119946119946nabla119928119928 (119955119955119961119961119946119946) Idea Instead of evaluating gradient over all ratings evaluate it for
each individual rating and make a step
GD 119928119928larr119928119928 minus η sum119955119955119961119961119946119946 nabla119928119928(119955119955119961119961119946119946)
SGD 119928119928larr119928119928 minus 119928119928(119955119955119961119961119946119946) Faster convergence
bull Need more steps but each step is computed much faster
1162019
RECSYS 73
SGD vs GD Convergence of GD vs SGD
1162019
Iterationstep
Valu
e of
the
obje
ctiv
e fu
nctio
n
GD improves the value of the objective function at every step SGD improves the value but in a ldquonoisyrdquo wayGD takes fewer steps to converge but each steptakes much longer to compute In practice SGD is much faster
RECSYS 74
Stochastic Gradient Descent
Stochastic gradient decent Initialize P and Q (using SVD pretend missing ratings are
0) Then iterate over the ratings (multiple times if necessary) and update
factorsFor each rxi
bull 120576120576119909119909119909119909 = 119903119903119909119909119909119909 minus 119902119902119909119909 sdot 119901119901119909119909119879119879 (derivative of the ldquoerrorrdquo)bull 119902119902119909119909 larr 119902119902119909119909 + 120578120578 120576120576119909119909119909119909 119901119901119909119909 minus 120582120582 119902119902119909119909 (update equation)bull 119901119901119909119909 larr 119901119901119909119909 + 120578120578 120576120576119909119909119909119909 119902119902119909119909 minus 120582120582 119901119901119909119909 (update equation)
2 for loops For until convergence
bull For each rxibull Compute gradient do a ldquosteprdquo
1162019
η hellip learning rate
RECSYS 75Koren Bell Volinksy IEEE Computer 2009
1162019
RECSYS 76
Summary Recommendations via Optimization
Goal Make good recommendations Quantify goodness using SSE
So Lower SSE means better recommendations We want to make good recommendations on items
that some user has not yet seen Letrsquos set values for P and Q such that they work
well on known (user item) ratings And hope these values for P and Q will predict well
the unknown ratings
This is the a case where we apply Optimization methods
1162019
1 3 43 5 5
4 5 533
2
2 1 3
1
RECSYS 77
Backup Slides
RECSYS 78
The Netflix Challenge 2006-09
We Know What You OughtTo Be Watching This
Summer
Netflix Prizebull Training data
ndash 100 million ratingsndash 480000 usersndash 17770 moviesndash 6 years of data 2000-2005
bull Test datandash Last few ratings of each user (28 million)ndash Evaluation criterion root mean squared error (RMSE) ndash Netflix Cinematch RMSE 09514
bull Competitionndash 2700+ teamsndash $1 million grand prize for 10 improvement on Cinematch resultndash $50000 2007 progress prize for 843 improvement
scoremovieuser121152131434524123237682576344541568523425
22345
57664566
scoremovieuser6219617232473153414284935
745
696836
Training data Test data
Movie rating data
Overall rating distribution
bull Third of ratings are 4sbull Average rating is 368
From TimelyDevelopmentcom
ratings per movie
bull Avg ratingsmovie 5627
ratings per user
bull Avg ratingsuser 208
Average movie rating by movie count
bull More ratings to better movies
From TimelyDevelopmentcom
Most loved movies
CountAvg ratingTitle137812 4593 The Shawshank Redemption 133597 4545 Lord of the Rings The Return of the King1808834306 The Green Mile 150676 4460Lord of the Rings The Two Towers 139050 4415 Finding Nemo 117456 4504 Raiders of the Lost Ark 180736 4299 Forrest Gump 147932 4433 Lord of the Rings The Fellowship of the ring149199 4325 The Sixth Sense 144027 4333 Indiana Jones and the Last Crusade
Challenges
bull Size of datandash Scalabilityndash Keeping data in memory
bull Missing datandash 99 percent missingndash Very imbalanced
bull Avoiding overfittingbull Test and training data differ significantly movie 16322
bull Use an ensemble of complementing predictorsbull Two half tuned models worth more than a single fully
tuned model
The BellKor recommender system
RECSYS 101
Extending Latent Factor Model to Include Biases
1162019
RECSYS 102
Modeling Biases and Interactions
μ = overall mean rating bx = bias of user x bi = bias of movie i
user-movie interactionmovie biasuser bias
User-Movie interaction Characterizes the matching between
users and movies Attracts most research in the field Benefits from algorithmic and
mathematical innovations
Baseline predictor Separates users and movies Benefits from insights into userrsquos
behavior Among the main practical
contributions of the competition
RECSYS 103
Baseline Predictor We have expectations on the rating by
user x of movie i even without estimating xrsquos attitude towards movies like i
ndash Rating scale of user xndash Values of other ratings user
gave recently (day-specific mood anchoring multi-user accounts)
ndash (Recent) popularity of movie indash Selection bias related to
number of ratings user gave on the same day (ldquofrequencyrdquo)
1162019
RECSYS 104
Putting It All Together
Example Mean rating micro = 37 You are a critical reviewer your ratings are 1 star
lower than the mean bx = -1 Star Wars gets a mean rating of 05 higher than
average movie bi = + 05 Predicted rating for you on Star Wars
= 37 - 1 + 05 = 32
1162019
RECSYS 105
Fitting the New Model
Solve
Stochastic gradient decent to find parameters Note Both biases bu bi as well as interactions qi pu are treated
as parameters (we estimate them)
1162019
regularization
goodness of fit
λ is selected via grid-search on a validation set
( )
++++
+++minus
sumsumsumsum
sumisin
ii
xx
xx
ii
Rix
Txiixxi
PQ
bbpq
pqbbr
2222
2
)(
)(min
λ
micro
RECSYS 106
BellKor Recommender System The winner of the Netflix Challenge
Multi-scale modeling of the dataCombine top level ldquoregionalrdquomodeling of the data with a refined local view Global
bull Overall deviations of usersmovies Factorization
bull Addressing ldquoregionalrdquo effects Collaborative filtering
bull Extract local patterns
Global effects
Factorization
Collaborative filtering
RECSYS 107
Performance of Various Methods
0885
089
0895
09
0905
091
0915
092
1 10 100 1000
RMSE
Millions of parameters
CF (no time bias)
Basic Latent Factors
Latent Factors w Biases
1162019
RECSYS 108
Grand Prize 08563
Netflix 09514
Movie average 10533User average 10651
Global average 11296
Performance of Various Methods
Basic Collaborative filtering 094
Latent factors 090
Latent factors+Biases 089
Collaborative filtering++ 091
1162019
RECSYS 109
Temporal Biases Of Users
Sudden rise in the average movie rating(early 2004) Improvements in Netflix GUI improvements Meaning of rating changed
Movie age Users prefer new movies
without any reasons Older movies are just
inherently better than newer ones
1162019
Y Koren Collaborative filtering with temporal dynamics KDD rsquo09
RECSYS 110
Temporal Biases amp Factors
Original modelrxi = micro +bx + bi + qi middotpx
T
Add time dependence to biasesrxi = micro +bx(t)+ bi(t) +qi middot px
T
Make parameters bu and bi to depend on time (1) Parameterize time-dependence by linear trends
(2) Each bin corresponds to 10 consecutive weeks
Add temporal dependence to factors px(t)hellip user preference vector on day t
Y Koren Collaborative filtering with temporal dynamics KDD rsquo09
RECSYS 111
Adding Temporal Effects
1162019
0875
088
0885
089
0895
09
0905
091
0915
092
1 10 100 1000 10000
RMSE
Millions of parameters
CF (no time bias)Basic Latent FactorsCF (time bias)Latent Factors w Biases+ Linear time factors+ Per-day user biases+ CF
RECSYS 112
Grand Prize 08563
Netflix 09514
Movie average 10533User average 10651
Global average 11296
Performance of Various Methods
Basic Collaborative filtering 094
Latent factors 090
Latent factors+Biases 089
Collaborative filtering++ 091
1162019 112
Latent factors+Biases+Time 0876
Still no prize Getting desperateTry a ldquokitchen sinkrdquo approach
RECSYS 113
Effect of ensemble size
087
08725
0875
08775
088
08825
0885
08875
089
08925
1 8 15 22 29 36 43 50
Predictors
Erro
r - R
MSE
RECSYS 60
Latent Factor Models
SVD isnrsquot defined when entries are missing
Use specialized methods to find P Q min
119875119875119876119876sum 119909119909119909119909 isinR 119903119903119909119909119909119909 minus 119902119902119909119909 sdot 119901119901119909119909119879119879 2
Notebull We donrsquot require cols of P Q to be orthogonalunit lengthbull P Q map usersmovies to a latent spacebull The most popular model among Netflix contestants
1162019
45531
312445
53432142
24542
522434
42331
2-41
56-5
53-2
32111
-221-7
37-1
-924143-48-5-253-211
13-112-72914-131457-8
1-6784-3924176-421asymp
PTQ
users
item
s
119903119909119909119909119909 = 119902119902119909119909 sdot 119901119901119909119909119879119879
f factors
ffactorsitem
s
users
RECSYS 61
Dealing with Missing Entries Want to minimize SSE for unseen test data
Idea Minimize SSE on training data Want large f ( of factors) to capture all the signals But SSE on test data begins to rise for f gt 2
Regularization is needed to avoid Overfitting Allow rich model where there are sufficient data Shrink aggressively where data are scarce
1162019
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
1 3 43 5 5
4 5 533
2
2 1 3
1
λhellip regularization parameterldquoerrorrdquo ldquolengthrdquo
RECSYS 62
Geared towards females
Geared towards males
serious
escapist
The PrincessDiaries
The Lion King
Braveheart
Independence Day
AmadeusThe Color Purple
Oceanrsquos 11
Sense and Sensibility
Gus
Dave
Recommendations via Latent Factor Models (eg SVD++ by the [Bellkor Team])
1162019
Lethal Weapon
Dumb and Dumber
RECSYS 63
Dealing with Missing Entries Want to minimize SSE for unseen test data
Idea Minimize SSE on training data Want large f ( of factors) to capture all the signals But SSE on test data begins to rise for f gt 2
Regularization is needed Allow rich model where there are sufficient data Shrink aggressively where data are scarce
1162019
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
1 3 43 5 5
4 5 533
2
2 1 3
1
λhellip regularization parameterldquoerrorrdquo ldquolengthrdquo
RECSYS 64
Geared towards females
Geared towards males
serious
funny
The Effect of Regularization
1162019
The PrincessDiaries
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
minfactors ldquoerrorrdquo + λ ldquolengthrdquo
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
RECSYS 65
Geared towards females
Geared towards males
serious
funny
The Effect of Regularization
1162019
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
The PrincessDiaries
minfactors ldquoerrorrdquo + λ ldquolengthrdquo
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
RECSYS 66
Geared towards females
Geared towards males
serious
funny
The Effect of Regularization
1162019
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
minfactors ldquoerrorrdquo + λ ldquolengthrdquo
The PrincessDiaries
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
RECSYS 67
Geared towards females
Geared towards males
serious
funny
The Effect of Regularization
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
The PrincessDiaries
minfactors ldquoerrorrdquo + λ ldquolengthrdquo
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
RECSYS 68
Use Gradient Descent to search forthe optimal settings
Want to find matrices P and Q
Gradient descent Initialize P and Q (using SVD pretend missing ratings are
0) Do gradient descent
bull P larr P - η middotnablaPbull Q larr Q - η middotnablaQbull Where nablaQ is gradientderivative of matrix Q120571120571119876119876 = [120571120571119902119902119909119909119894119894] and 120571120571119902119902119909119909119894119894 = sum119909119909119909119909 minus2 119903119903119909119909119909119909 minus 119902119902119909119909119901119901119909119909119879119879 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894
bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q Observation Computing gradients is slow
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
How to compute gradient of a matrixCompute gradient of every element independently
RECSYS 69
Degression to the lecture notes ofRegression and Gradient Descent
by Andrew Ngrsquos Machine Learning course from Coursera
RECSYS 70
(Batch) Gradient Descent
Want to find matrices P and Q
Gradient descent Initialize P and Q (using SVD pretend missing ratings are
0) Do gradient descent
bull P larr P - η middotnablaPbull Q larr Q - η middotnablaQbull Where nablaQ is gradientderivative of matrix Q120571120571119876119876 = [120571120571119902119902119909119909119894119894] and 120571120571119902119902119909119909119894119894 = sum119909119909119909119909 minus2 119903119903119909119909119909119909 minus 119902119902119909119909119901119901119909119909119879119879 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894
bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q Observation Computing gradients is slow
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
How to compute gradient of a matrixCompute gradient of every element independently
RECSYS 72
Stochastic Gradient Descent
Gradient Descent (GD) vs Stochastic GD Observation 120571120571119876119876 = [120571120571119902119902119909119909119894119894] where
120571120571119902119902119909119909119894119894 = 119909119909119909119909
minus2 119903119903119909119909119909119909 minus 119902119902119909119909119894119894119901119901119909119909119894119894 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894 = 119961119961119946119946
nabla119928119928 119955119955119961119961119946119946
bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q 119928119928 = 119928119928 minus 119928119928 = 119928119928 minus η sum119961119961119946119946nabla119928119928 (119955119955119961119961119946119946) Idea Instead of evaluating gradient over all ratings evaluate it for
each individual rating and make a step
GD 119928119928larr119928119928 minus η sum119955119955119961119961119946119946 nabla119928119928(119955119955119961119961119946119946)
SGD 119928119928larr119928119928 minus 119928119928(119955119955119961119961119946119946) Faster convergence
bull Need more steps but each step is computed much faster
1162019
RECSYS 73
SGD vs GD Convergence of GD vs SGD
1162019
Iterationstep
Valu
e of
the
obje
ctiv
e fu
nctio
n
GD improves the value of the objective function at every step SGD improves the value but in a ldquonoisyrdquo wayGD takes fewer steps to converge but each steptakes much longer to compute In practice SGD is much faster
RECSYS 74
Stochastic Gradient Descent
Stochastic gradient decent Initialize P and Q (using SVD pretend missing ratings are
0) Then iterate over the ratings (multiple times if necessary) and update
factorsFor each rxi
bull 120576120576119909119909119909119909 = 119903119903119909119909119909119909 minus 119902119902119909119909 sdot 119901119901119909119909119879119879 (derivative of the ldquoerrorrdquo)bull 119902119902119909119909 larr 119902119902119909119909 + 120578120578 120576120576119909119909119909119909 119901119901119909119909 minus 120582120582 119902119902119909119909 (update equation)bull 119901119901119909119909 larr 119901119901119909119909 + 120578120578 120576120576119909119909119909119909 119902119902119909119909 minus 120582120582 119901119901119909119909 (update equation)
2 for loops For until convergence
bull For each rxibull Compute gradient do a ldquosteprdquo
1162019
η hellip learning rate
RECSYS 75Koren Bell Volinksy IEEE Computer 2009
1162019
RECSYS 76
Summary Recommendations via Optimization
Goal Make good recommendations Quantify goodness using SSE
So Lower SSE means better recommendations We want to make good recommendations on items
that some user has not yet seen Letrsquos set values for P and Q such that they work
well on known (user item) ratings And hope these values for P and Q will predict well
the unknown ratings
This is the a case where we apply Optimization methods
1162019
1 3 43 5 5
4 5 533
2
2 1 3
1
RECSYS 77
Backup Slides
RECSYS 78
The Netflix Challenge 2006-09
We Know What You OughtTo Be Watching This
Summer
Netflix Prizebull Training data
ndash 100 million ratingsndash 480000 usersndash 17770 moviesndash 6 years of data 2000-2005
bull Test datandash Last few ratings of each user (28 million)ndash Evaluation criterion root mean squared error (RMSE) ndash Netflix Cinematch RMSE 09514
bull Competitionndash 2700+ teamsndash $1 million grand prize for 10 improvement on Cinematch resultndash $50000 2007 progress prize for 843 improvement
scoremovieuser121152131434524123237682576344541568523425
22345
57664566
scoremovieuser6219617232473153414284935
745
696836
Training data Test data
Movie rating data
Overall rating distribution
bull Third of ratings are 4sbull Average rating is 368
From TimelyDevelopmentcom
ratings per movie
bull Avg ratingsmovie 5627
ratings per user
bull Avg ratingsuser 208
Average movie rating by movie count
bull More ratings to better movies
From TimelyDevelopmentcom
Most loved movies
CountAvg ratingTitle137812 4593 The Shawshank Redemption 133597 4545 Lord of the Rings The Return of the King1808834306 The Green Mile 150676 4460Lord of the Rings The Two Towers 139050 4415 Finding Nemo 117456 4504 Raiders of the Lost Ark 180736 4299 Forrest Gump 147932 4433 Lord of the Rings The Fellowship of the ring149199 4325 The Sixth Sense 144027 4333 Indiana Jones and the Last Crusade
Challenges
bull Size of datandash Scalabilityndash Keeping data in memory
bull Missing datandash 99 percent missingndash Very imbalanced
bull Avoiding overfittingbull Test and training data differ significantly movie 16322
bull Use an ensemble of complementing predictorsbull Two half tuned models worth more than a single fully
tuned model
The BellKor recommender system
RECSYS 101
Extending Latent Factor Model to Include Biases
1162019
RECSYS 102
Modeling Biases and Interactions
μ = overall mean rating bx = bias of user x bi = bias of movie i
user-movie interactionmovie biasuser bias
User-Movie interaction Characterizes the matching between
users and movies Attracts most research in the field Benefits from algorithmic and
mathematical innovations
Baseline predictor Separates users and movies Benefits from insights into userrsquos
behavior Among the main practical
contributions of the competition
RECSYS 103
Baseline Predictor We have expectations on the rating by
user x of movie i even without estimating xrsquos attitude towards movies like i
ndash Rating scale of user xndash Values of other ratings user
gave recently (day-specific mood anchoring multi-user accounts)
ndash (Recent) popularity of movie indash Selection bias related to
number of ratings user gave on the same day (ldquofrequencyrdquo)
1162019
RECSYS 104
Putting It All Together
Example Mean rating micro = 37 You are a critical reviewer your ratings are 1 star
lower than the mean bx = -1 Star Wars gets a mean rating of 05 higher than
average movie bi = + 05 Predicted rating for you on Star Wars
= 37 - 1 + 05 = 32
1162019
RECSYS 105
Fitting the New Model
Solve
Stochastic gradient decent to find parameters Note Both biases bu bi as well as interactions qi pu are treated
as parameters (we estimate them)
1162019
regularization
goodness of fit
λ is selected via grid-search on a validation set
( )
++++
+++minus
sumsumsumsum
sumisin
ii
xx
xx
ii
Rix
Txiixxi
PQ
bbpq
pqbbr
2222
2
)(
)(min
λ
micro
RECSYS 106
BellKor Recommender System The winner of the Netflix Challenge
Multi-scale modeling of the dataCombine top level ldquoregionalrdquomodeling of the data with a refined local view Global
bull Overall deviations of usersmovies Factorization
bull Addressing ldquoregionalrdquo effects Collaborative filtering
bull Extract local patterns
Global effects
Factorization
Collaborative filtering
RECSYS 107
Performance of Various Methods
0885
089
0895
09
0905
091
0915
092
1 10 100 1000
RMSE
Millions of parameters
CF (no time bias)
Basic Latent Factors
Latent Factors w Biases
1162019
RECSYS 108
Grand Prize 08563
Netflix 09514
Movie average 10533User average 10651
Global average 11296
Performance of Various Methods
Basic Collaborative filtering 094
Latent factors 090
Latent factors+Biases 089
Collaborative filtering++ 091
1162019
RECSYS 109
Temporal Biases Of Users
Sudden rise in the average movie rating(early 2004) Improvements in Netflix GUI improvements Meaning of rating changed
Movie age Users prefer new movies
without any reasons Older movies are just
inherently better than newer ones
1162019
Y Koren Collaborative filtering with temporal dynamics KDD rsquo09
RECSYS 110
Temporal Biases amp Factors
Original modelrxi = micro +bx + bi + qi middotpx
T
Add time dependence to biasesrxi = micro +bx(t)+ bi(t) +qi middot px
T
Make parameters bu and bi to depend on time (1) Parameterize time-dependence by linear trends
(2) Each bin corresponds to 10 consecutive weeks
Add temporal dependence to factors px(t)hellip user preference vector on day t
Y Koren Collaborative filtering with temporal dynamics KDD rsquo09
RECSYS 111
Adding Temporal Effects
1162019
0875
088
0885
089
0895
09
0905
091
0915
092
1 10 100 1000 10000
RMSE
Millions of parameters
CF (no time bias)Basic Latent FactorsCF (time bias)Latent Factors w Biases+ Linear time factors+ Per-day user biases+ CF
RECSYS 112
Grand Prize 08563
Netflix 09514
Movie average 10533User average 10651
Global average 11296
Performance of Various Methods
Basic Collaborative filtering 094
Latent factors 090
Latent factors+Biases 089
Collaborative filtering++ 091
1162019 112
Latent factors+Biases+Time 0876
Still no prize Getting desperateTry a ldquokitchen sinkrdquo approach
RECSYS 113
Effect of ensemble size
087
08725
0875
08775
088
08825
0885
08875
089
08925
1 8 15 22 29 36 43 50
Predictors
Erro
r - R
MSE
RECSYS 61
Dealing with Missing Entries Want to minimize SSE for unseen test data
Idea Minimize SSE on training data Want large f ( of factors) to capture all the signals But SSE on test data begins to rise for f gt 2
Regularization is needed to avoid Overfitting Allow rich model where there are sufficient data Shrink aggressively where data are scarce
1162019
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
1 3 43 5 5
4 5 533
2
2 1 3
1
λhellip regularization parameterldquoerrorrdquo ldquolengthrdquo
RECSYS 62
Geared towards females
Geared towards males
serious
escapist
The PrincessDiaries
The Lion King
Braveheart
Independence Day
AmadeusThe Color Purple
Oceanrsquos 11
Sense and Sensibility
Gus
Dave
Recommendations via Latent Factor Models (eg SVD++ by the [Bellkor Team])
1162019
Lethal Weapon
Dumb and Dumber
RECSYS 63
Dealing with Missing Entries Want to minimize SSE for unseen test data
Idea Minimize SSE on training data Want large f ( of factors) to capture all the signals But SSE on test data begins to rise for f gt 2
Regularization is needed Allow rich model where there are sufficient data Shrink aggressively where data are scarce
1162019
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
1 3 43 5 5
4 5 533
2
2 1 3
1
λhellip regularization parameterldquoerrorrdquo ldquolengthrdquo
RECSYS 64
Geared towards females
Geared towards males
serious
funny
The Effect of Regularization
1162019
The PrincessDiaries
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
minfactors ldquoerrorrdquo + λ ldquolengthrdquo
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
RECSYS 65
Geared towards females
Geared towards males
serious
funny
The Effect of Regularization
1162019
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
The PrincessDiaries
minfactors ldquoerrorrdquo + λ ldquolengthrdquo
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
RECSYS 66
Geared towards females
Geared towards males
serious
funny
The Effect of Regularization
1162019
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
minfactors ldquoerrorrdquo + λ ldquolengthrdquo
The PrincessDiaries
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
RECSYS 67
Geared towards females
Geared towards males
serious
funny
The Effect of Regularization
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
The PrincessDiaries
minfactors ldquoerrorrdquo + λ ldquolengthrdquo
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
RECSYS 68
Use Gradient Descent to search forthe optimal settings
Want to find matrices P and Q
Gradient descent Initialize P and Q (using SVD pretend missing ratings are
0) Do gradient descent
bull P larr P - η middotnablaPbull Q larr Q - η middotnablaQbull Where nablaQ is gradientderivative of matrix Q120571120571119876119876 = [120571120571119902119902119909119909119894119894] and 120571120571119902119902119909119909119894119894 = sum119909119909119909119909 minus2 119903119903119909119909119909119909 minus 119902119902119909119909119901119901119909119909119879119879 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894
bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q Observation Computing gradients is slow
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
How to compute gradient of a matrixCompute gradient of every element independently
RECSYS 69
Degression to the lecture notes ofRegression and Gradient Descent
by Andrew Ngrsquos Machine Learning course from Coursera
RECSYS 70
(Batch) Gradient Descent
Want to find matrices P and Q
Gradient descent Initialize P and Q (using SVD pretend missing ratings are
0) Do gradient descent
bull P larr P - η middotnablaPbull Q larr Q - η middotnablaQbull Where nablaQ is gradientderivative of matrix Q120571120571119876119876 = [120571120571119902119902119909119909119894119894] and 120571120571119902119902119909119909119894119894 = sum119909119909119909119909 minus2 119903119903119909119909119909119909 minus 119902119902119909119909119901119901119909119909119879119879 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894
bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q Observation Computing gradients is slow
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
How to compute gradient of a matrixCompute gradient of every element independently
RECSYS 72
Stochastic Gradient Descent
Gradient Descent (GD) vs Stochastic GD Observation 120571120571119876119876 = [120571120571119902119902119909119909119894119894] where
120571120571119902119902119909119909119894119894 = 119909119909119909119909
minus2 119903119903119909119909119909119909 minus 119902119902119909119909119894119894119901119901119909119909119894119894 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894 = 119961119961119946119946
nabla119928119928 119955119955119961119961119946119946
bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q 119928119928 = 119928119928 minus 119928119928 = 119928119928 minus η sum119961119961119946119946nabla119928119928 (119955119955119961119961119946119946) Idea Instead of evaluating gradient over all ratings evaluate it for
each individual rating and make a step
GD 119928119928larr119928119928 minus η sum119955119955119961119961119946119946 nabla119928119928(119955119955119961119961119946119946)
SGD 119928119928larr119928119928 minus 119928119928(119955119955119961119961119946119946) Faster convergence
bull Need more steps but each step is computed much faster
1162019
RECSYS 73
SGD vs GD Convergence of GD vs SGD
1162019
Iterationstep
Valu
e of
the
obje
ctiv
e fu
nctio
n
GD improves the value of the objective function at every step SGD improves the value but in a ldquonoisyrdquo wayGD takes fewer steps to converge but each steptakes much longer to compute In practice SGD is much faster
RECSYS 74
Stochastic Gradient Descent
Stochastic gradient decent Initialize P and Q (using SVD pretend missing ratings are
0) Then iterate over the ratings (multiple times if necessary) and update
factorsFor each rxi
bull 120576120576119909119909119909119909 = 119903119903119909119909119909119909 minus 119902119902119909119909 sdot 119901119901119909119909119879119879 (derivative of the ldquoerrorrdquo)bull 119902119902119909119909 larr 119902119902119909119909 + 120578120578 120576120576119909119909119909119909 119901119901119909119909 minus 120582120582 119902119902119909119909 (update equation)bull 119901119901119909119909 larr 119901119901119909119909 + 120578120578 120576120576119909119909119909119909 119902119902119909119909 minus 120582120582 119901119901119909119909 (update equation)
2 for loops For until convergence
bull For each rxibull Compute gradient do a ldquosteprdquo
1162019
η hellip learning rate
RECSYS 75Koren Bell Volinksy IEEE Computer 2009
1162019
RECSYS 76
Summary Recommendations via Optimization
Goal Make good recommendations Quantify goodness using SSE
So Lower SSE means better recommendations We want to make good recommendations on items
that some user has not yet seen Letrsquos set values for P and Q such that they work
well on known (user item) ratings And hope these values for P and Q will predict well
the unknown ratings
This is the a case where we apply Optimization methods
1162019
1 3 43 5 5
4 5 533
2
2 1 3
1
RECSYS 77
Backup Slides
RECSYS 78
The Netflix Challenge 2006-09
We Know What You OughtTo Be Watching This
Summer
Netflix Prizebull Training data
ndash 100 million ratingsndash 480000 usersndash 17770 moviesndash 6 years of data 2000-2005
bull Test datandash Last few ratings of each user (28 million)ndash Evaluation criterion root mean squared error (RMSE) ndash Netflix Cinematch RMSE 09514
bull Competitionndash 2700+ teamsndash $1 million grand prize for 10 improvement on Cinematch resultndash $50000 2007 progress prize for 843 improvement
scoremovieuser121152131434524123237682576344541568523425
22345
57664566
scoremovieuser6219617232473153414284935
745
696836
Training data Test data
Movie rating data
Overall rating distribution
bull Third of ratings are 4sbull Average rating is 368
From TimelyDevelopmentcom
ratings per movie
bull Avg ratingsmovie 5627
ratings per user
bull Avg ratingsuser 208
Average movie rating by movie count
bull More ratings to better movies
From TimelyDevelopmentcom
Most loved movies
CountAvg ratingTitle137812 4593 The Shawshank Redemption 133597 4545 Lord of the Rings The Return of the King1808834306 The Green Mile 150676 4460Lord of the Rings The Two Towers 139050 4415 Finding Nemo 117456 4504 Raiders of the Lost Ark 180736 4299 Forrest Gump 147932 4433 Lord of the Rings The Fellowship of the ring149199 4325 The Sixth Sense 144027 4333 Indiana Jones and the Last Crusade
Challenges
bull Size of datandash Scalabilityndash Keeping data in memory
bull Missing datandash 99 percent missingndash Very imbalanced
bull Avoiding overfittingbull Test and training data differ significantly movie 16322
bull Use an ensemble of complementing predictorsbull Two half tuned models worth more than a single fully
tuned model
The BellKor recommender system
RECSYS 101
Extending Latent Factor Model to Include Biases
1162019
RECSYS 102
Modeling Biases and Interactions
μ = overall mean rating bx = bias of user x bi = bias of movie i
user-movie interactionmovie biasuser bias
User-Movie interaction Characterizes the matching between
users and movies Attracts most research in the field Benefits from algorithmic and
mathematical innovations
Baseline predictor Separates users and movies Benefits from insights into userrsquos
behavior Among the main practical
contributions of the competition
RECSYS 103
Baseline Predictor We have expectations on the rating by
user x of movie i even without estimating xrsquos attitude towards movies like i
ndash Rating scale of user xndash Values of other ratings user
gave recently (day-specific mood anchoring multi-user accounts)
ndash (Recent) popularity of movie indash Selection bias related to
number of ratings user gave on the same day (ldquofrequencyrdquo)
1162019
RECSYS 104
Putting It All Together
Example Mean rating micro = 37 You are a critical reviewer your ratings are 1 star
lower than the mean bx = -1 Star Wars gets a mean rating of 05 higher than
average movie bi = + 05 Predicted rating for you on Star Wars
= 37 - 1 + 05 = 32
1162019
RECSYS 105
Fitting the New Model
Solve
Stochastic gradient decent to find parameters Note Both biases bu bi as well as interactions qi pu are treated
as parameters (we estimate them)
1162019
regularization
goodness of fit
λ is selected via grid-search on a validation set
( )
++++
+++minus
sumsumsumsum
sumisin
ii
xx
xx
ii
Rix
Txiixxi
PQ
bbpq
pqbbr
2222
2
)(
)(min
λ
micro
RECSYS 106
BellKor Recommender System The winner of the Netflix Challenge
Multi-scale modeling of the dataCombine top level ldquoregionalrdquomodeling of the data with a refined local view Global
bull Overall deviations of usersmovies Factorization
bull Addressing ldquoregionalrdquo effects Collaborative filtering
bull Extract local patterns
Global effects
Factorization
Collaborative filtering
RECSYS 107
Performance of Various Methods
0885
089
0895
09
0905
091
0915
092
1 10 100 1000
RMSE
Millions of parameters
CF (no time bias)
Basic Latent Factors
Latent Factors w Biases
1162019
RECSYS 108
Grand Prize 08563
Netflix 09514
Movie average 10533User average 10651
Global average 11296
Performance of Various Methods
Basic Collaborative filtering 094
Latent factors 090
Latent factors+Biases 089
Collaborative filtering++ 091
1162019
RECSYS 109
Temporal Biases Of Users
Sudden rise in the average movie rating(early 2004) Improvements in Netflix GUI improvements Meaning of rating changed
Movie age Users prefer new movies
without any reasons Older movies are just
inherently better than newer ones
1162019
Y Koren Collaborative filtering with temporal dynamics KDD rsquo09
RECSYS 110
Temporal Biases amp Factors
Original modelrxi = micro +bx + bi + qi middotpx
T
Add time dependence to biasesrxi = micro +bx(t)+ bi(t) +qi middot px
T
Make parameters bu and bi to depend on time (1) Parameterize time-dependence by linear trends
(2) Each bin corresponds to 10 consecutive weeks
Add temporal dependence to factors px(t)hellip user preference vector on day t
Y Koren Collaborative filtering with temporal dynamics KDD rsquo09
RECSYS 111
Adding Temporal Effects
1162019
0875
088
0885
089
0895
09
0905
091
0915
092
1 10 100 1000 10000
RMSE
Millions of parameters
CF (no time bias)Basic Latent FactorsCF (time bias)Latent Factors w Biases+ Linear time factors+ Per-day user biases+ CF
RECSYS 112
Grand Prize 08563
Netflix 09514
Movie average 10533User average 10651
Global average 11296
Performance of Various Methods
Basic Collaborative filtering 094
Latent factors 090
Latent factors+Biases 089
Collaborative filtering++ 091
1162019 112
Latent factors+Biases+Time 0876
Still no prize Getting desperateTry a ldquokitchen sinkrdquo approach
RECSYS 113
Effect of ensemble size
087
08725
0875
08775
088
08825
0885
08875
089
08925
1 8 15 22 29 36 43 50
Predictors
Erro
r - R
MSE
RECSYS 62
Geared towards females
Geared towards males
serious
escapist
The PrincessDiaries
The Lion King
Braveheart
Independence Day
AmadeusThe Color Purple
Oceanrsquos 11
Sense and Sensibility
Gus
Dave
Recommendations via Latent Factor Models (eg SVD++ by the [Bellkor Team])
1162019
Lethal Weapon
Dumb and Dumber
RECSYS 63
Dealing with Missing Entries Want to minimize SSE for unseen test data
Idea Minimize SSE on training data Want large f ( of factors) to capture all the signals But SSE on test data begins to rise for f gt 2
Regularization is needed Allow rich model where there are sufficient data Shrink aggressively where data are scarce
1162019
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
1 3 43 5 5
4 5 533
2
2 1 3
1
λhellip regularization parameterldquoerrorrdquo ldquolengthrdquo
RECSYS 64
Geared towards females
Geared towards males
serious
funny
The Effect of Regularization
1162019
The PrincessDiaries
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
minfactors ldquoerrorrdquo + λ ldquolengthrdquo
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
RECSYS 65
Geared towards females
Geared towards males
serious
funny
The Effect of Regularization
1162019
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
The PrincessDiaries
minfactors ldquoerrorrdquo + λ ldquolengthrdquo
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
RECSYS 66
Geared towards females
Geared towards males
serious
funny
The Effect of Regularization
1162019
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
minfactors ldquoerrorrdquo + λ ldquolengthrdquo
The PrincessDiaries
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
RECSYS 67
Geared towards females
Geared towards males
serious
funny
The Effect of Regularization
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
The PrincessDiaries
minfactors ldquoerrorrdquo + λ ldquolengthrdquo
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
RECSYS 68
Use Gradient Descent to search forthe optimal settings
Want to find matrices P and Q
Gradient descent Initialize P and Q (using SVD pretend missing ratings are
0) Do gradient descent
bull P larr P - η middotnablaPbull Q larr Q - η middotnablaQbull Where nablaQ is gradientderivative of matrix Q120571120571119876119876 = [120571120571119902119902119909119909119894119894] and 120571120571119902119902119909119909119894119894 = sum119909119909119909119909 minus2 119903119903119909119909119909119909 minus 119902119902119909119909119901119901119909119909119879119879 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894
bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q Observation Computing gradients is slow
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
How to compute gradient of a matrixCompute gradient of every element independently
RECSYS 69
Degression to the lecture notes ofRegression and Gradient Descent
by Andrew Ngrsquos Machine Learning course from Coursera
RECSYS 70
(Batch) Gradient Descent
Want to find matrices P and Q
Gradient descent Initialize P and Q (using SVD pretend missing ratings are
0) Do gradient descent
bull P larr P - η middotnablaPbull Q larr Q - η middotnablaQbull Where nablaQ is gradientderivative of matrix Q120571120571119876119876 = [120571120571119902119902119909119909119894119894] and 120571120571119902119902119909119909119894119894 = sum119909119909119909119909 minus2 119903119903119909119909119909119909 minus 119902119902119909119909119901119901119909119909119879119879 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894
bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q Observation Computing gradients is slow
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
How to compute gradient of a matrixCompute gradient of every element independently
RECSYS 72
Stochastic Gradient Descent
Gradient Descent (GD) vs Stochastic GD Observation 120571120571119876119876 = [120571120571119902119902119909119909119894119894] where
120571120571119902119902119909119909119894119894 = 119909119909119909119909
minus2 119903119903119909119909119909119909 minus 119902119902119909119909119894119894119901119901119909119909119894119894 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894 = 119961119961119946119946
nabla119928119928 119955119955119961119961119946119946
bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q 119928119928 = 119928119928 minus 119928119928 = 119928119928 minus η sum119961119961119946119946nabla119928119928 (119955119955119961119961119946119946) Idea Instead of evaluating gradient over all ratings evaluate it for
each individual rating and make a step
GD 119928119928larr119928119928 minus η sum119955119955119961119961119946119946 nabla119928119928(119955119955119961119961119946119946)
SGD 119928119928larr119928119928 minus 119928119928(119955119955119961119961119946119946) Faster convergence
bull Need more steps but each step is computed much faster
1162019
RECSYS 73
SGD vs GD Convergence of GD vs SGD
1162019
Iterationstep
Valu
e of
the
obje
ctiv
e fu
nctio
n
GD improves the value of the objective function at every step SGD improves the value but in a ldquonoisyrdquo wayGD takes fewer steps to converge but each steptakes much longer to compute In practice SGD is much faster
RECSYS 74
Stochastic Gradient Descent
Stochastic gradient decent Initialize P and Q (using SVD pretend missing ratings are
0) Then iterate over the ratings (multiple times if necessary) and update
factorsFor each rxi
bull 120576120576119909119909119909119909 = 119903119903119909119909119909119909 minus 119902119902119909119909 sdot 119901119901119909119909119879119879 (derivative of the ldquoerrorrdquo)bull 119902119902119909119909 larr 119902119902119909119909 + 120578120578 120576120576119909119909119909119909 119901119901119909119909 minus 120582120582 119902119902119909119909 (update equation)bull 119901119901119909119909 larr 119901119901119909119909 + 120578120578 120576120576119909119909119909119909 119902119902119909119909 minus 120582120582 119901119901119909119909 (update equation)
2 for loops For until convergence
bull For each rxibull Compute gradient do a ldquosteprdquo
1162019
η hellip learning rate
RECSYS 75Koren Bell Volinksy IEEE Computer 2009
1162019
RECSYS 76
Summary Recommendations via Optimization
Goal Make good recommendations Quantify goodness using SSE
So Lower SSE means better recommendations We want to make good recommendations on items
that some user has not yet seen Letrsquos set values for P and Q such that they work
well on known (user item) ratings And hope these values for P and Q will predict well
the unknown ratings
This is the a case where we apply Optimization methods
1162019
1 3 43 5 5
4 5 533
2
2 1 3
1
RECSYS 77
Backup Slides
RECSYS 78
The Netflix Challenge 2006-09
We Know What You OughtTo Be Watching This
Summer
Netflix Prizebull Training data
ndash 100 million ratingsndash 480000 usersndash 17770 moviesndash 6 years of data 2000-2005
bull Test datandash Last few ratings of each user (28 million)ndash Evaluation criterion root mean squared error (RMSE) ndash Netflix Cinematch RMSE 09514
bull Competitionndash 2700+ teamsndash $1 million grand prize for 10 improvement on Cinematch resultndash $50000 2007 progress prize for 843 improvement
scoremovieuser121152131434524123237682576344541568523425
22345
57664566
scoremovieuser6219617232473153414284935
745
696836
Training data Test data
Movie rating data
Overall rating distribution
bull Third of ratings are 4sbull Average rating is 368
From TimelyDevelopmentcom
ratings per movie
bull Avg ratingsmovie 5627
ratings per user
bull Avg ratingsuser 208
Average movie rating by movie count
bull More ratings to better movies
From TimelyDevelopmentcom
Most loved movies
CountAvg ratingTitle137812 4593 The Shawshank Redemption 133597 4545 Lord of the Rings The Return of the King1808834306 The Green Mile 150676 4460Lord of the Rings The Two Towers 139050 4415 Finding Nemo 117456 4504 Raiders of the Lost Ark 180736 4299 Forrest Gump 147932 4433 Lord of the Rings The Fellowship of the ring149199 4325 The Sixth Sense 144027 4333 Indiana Jones and the Last Crusade
Challenges
bull Size of datandash Scalabilityndash Keeping data in memory
bull Missing datandash 99 percent missingndash Very imbalanced
bull Avoiding overfittingbull Test and training data differ significantly movie 16322
bull Use an ensemble of complementing predictorsbull Two half tuned models worth more than a single fully
tuned model
The BellKor recommender system
RECSYS 101
Extending Latent Factor Model to Include Biases
1162019
RECSYS 102
Modeling Biases and Interactions
μ = overall mean rating bx = bias of user x bi = bias of movie i
user-movie interactionmovie biasuser bias
User-Movie interaction Characterizes the matching between
users and movies Attracts most research in the field Benefits from algorithmic and
mathematical innovations
Baseline predictor Separates users and movies Benefits from insights into userrsquos
behavior Among the main practical
contributions of the competition
RECSYS 103
Baseline Predictor We have expectations on the rating by
user x of movie i even without estimating xrsquos attitude towards movies like i
ndash Rating scale of user xndash Values of other ratings user
gave recently (day-specific mood anchoring multi-user accounts)
ndash (Recent) popularity of movie indash Selection bias related to
number of ratings user gave on the same day (ldquofrequencyrdquo)
1162019
RECSYS 104
Putting It All Together
Example Mean rating micro = 37 You are a critical reviewer your ratings are 1 star
lower than the mean bx = -1 Star Wars gets a mean rating of 05 higher than
average movie bi = + 05 Predicted rating for you on Star Wars
= 37 - 1 + 05 = 32
1162019
RECSYS 105
Fitting the New Model
Solve
Stochastic gradient decent to find parameters Note Both biases bu bi as well as interactions qi pu are treated
as parameters (we estimate them)
1162019
regularization
goodness of fit
λ is selected via grid-search on a validation set
( )
++++
+++minus
sumsumsumsum
sumisin
ii
xx
xx
ii
Rix
Txiixxi
PQ
bbpq
pqbbr
2222
2
)(
)(min
λ
micro
RECSYS 106
BellKor Recommender System The winner of the Netflix Challenge
Multi-scale modeling of the dataCombine top level ldquoregionalrdquomodeling of the data with a refined local view Global
bull Overall deviations of usersmovies Factorization
bull Addressing ldquoregionalrdquo effects Collaborative filtering
bull Extract local patterns
Global effects
Factorization
Collaborative filtering
RECSYS 107
Performance of Various Methods
0885
089
0895
09
0905
091
0915
092
1 10 100 1000
RMSE
Millions of parameters
CF (no time bias)
Basic Latent Factors
Latent Factors w Biases
1162019
RECSYS 108
Grand Prize 08563
Netflix 09514
Movie average 10533User average 10651
Global average 11296
Performance of Various Methods
Basic Collaborative filtering 094
Latent factors 090
Latent factors+Biases 089
Collaborative filtering++ 091
1162019
RECSYS 109
Temporal Biases Of Users
Sudden rise in the average movie rating(early 2004) Improvements in Netflix GUI improvements Meaning of rating changed
Movie age Users prefer new movies
without any reasons Older movies are just
inherently better than newer ones
1162019
Y Koren Collaborative filtering with temporal dynamics KDD rsquo09
RECSYS 110
Temporal Biases amp Factors
Original modelrxi = micro +bx + bi + qi middotpx
T
Add time dependence to biasesrxi = micro +bx(t)+ bi(t) +qi middot px
T
Make parameters bu and bi to depend on time (1) Parameterize time-dependence by linear trends
(2) Each bin corresponds to 10 consecutive weeks
Add temporal dependence to factors px(t)hellip user preference vector on day t
Y Koren Collaborative filtering with temporal dynamics KDD rsquo09
RECSYS 111
Adding Temporal Effects
1162019
0875
088
0885
089
0895
09
0905
091
0915
092
1 10 100 1000 10000
RMSE
Millions of parameters
CF (no time bias)Basic Latent FactorsCF (time bias)Latent Factors w Biases+ Linear time factors+ Per-day user biases+ CF
RECSYS 112
Grand Prize 08563
Netflix 09514
Movie average 10533User average 10651
Global average 11296
Performance of Various Methods
Basic Collaborative filtering 094
Latent factors 090
Latent factors+Biases 089
Collaborative filtering++ 091
1162019 112
Latent factors+Biases+Time 0876
Still no prize Getting desperateTry a ldquokitchen sinkrdquo approach
RECSYS 113
Effect of ensemble size
087
08725
0875
08775
088
08825
0885
08875
089
08925
1 8 15 22 29 36 43 50
Predictors
Erro
r - R
MSE
RECSYS 63
Dealing with Missing Entries Want to minimize SSE for unseen test data
Idea Minimize SSE on training data Want large f ( of factors) to capture all the signals But SSE on test data begins to rise for f gt 2
Regularization is needed Allow rich model where there are sufficient data Shrink aggressively where data are scarce
1162019
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
1 3 43 5 5
4 5 533
2
2 1 3
1
λhellip regularization parameterldquoerrorrdquo ldquolengthrdquo
RECSYS 64
Geared towards females
Geared towards males
serious
funny
The Effect of Regularization
1162019
The PrincessDiaries
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
minfactors ldquoerrorrdquo + λ ldquolengthrdquo
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
RECSYS 65
Geared towards females
Geared towards males
serious
funny
The Effect of Regularization
1162019
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
The PrincessDiaries
minfactors ldquoerrorrdquo + λ ldquolengthrdquo
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
RECSYS 66
Geared towards females
Geared towards males
serious
funny
The Effect of Regularization
1162019
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
minfactors ldquoerrorrdquo + λ ldquolengthrdquo
The PrincessDiaries
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
RECSYS 67
Geared towards females
Geared towards males
serious
funny
The Effect of Regularization
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
The PrincessDiaries
minfactors ldquoerrorrdquo + λ ldquolengthrdquo
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
RECSYS 68
Use Gradient Descent to search forthe optimal settings
Want to find matrices P and Q
Gradient descent Initialize P and Q (using SVD pretend missing ratings are
0) Do gradient descent
bull P larr P - η middotnablaPbull Q larr Q - η middotnablaQbull Where nablaQ is gradientderivative of matrix Q120571120571119876119876 = [120571120571119902119902119909119909119894119894] and 120571120571119902119902119909119909119894119894 = sum119909119909119909119909 minus2 119903119903119909119909119909119909 minus 119902119902119909119909119901119901119909119909119879119879 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894
bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q Observation Computing gradients is slow
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
How to compute gradient of a matrixCompute gradient of every element independently
RECSYS 69
Degression to the lecture notes ofRegression and Gradient Descent
by Andrew Ngrsquos Machine Learning course from Coursera
RECSYS 70
(Batch) Gradient Descent
Want to find matrices P and Q
Gradient descent Initialize P and Q (using SVD pretend missing ratings are
0) Do gradient descent
bull P larr P - η middotnablaPbull Q larr Q - η middotnablaQbull Where nablaQ is gradientderivative of matrix Q120571120571119876119876 = [120571120571119902119902119909119909119894119894] and 120571120571119902119902119909119909119894119894 = sum119909119909119909119909 minus2 119903119903119909119909119909119909 minus 119902119902119909119909119901119901119909119909119879119879 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894
bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q Observation Computing gradients is slow
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
How to compute gradient of a matrixCompute gradient of every element independently
RECSYS 72
Stochastic Gradient Descent
Gradient Descent (GD) vs Stochastic GD Observation 120571120571119876119876 = [120571120571119902119902119909119909119894119894] where
120571120571119902119902119909119909119894119894 = 119909119909119909119909
minus2 119903119903119909119909119909119909 minus 119902119902119909119909119894119894119901119901119909119909119894119894 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894 = 119961119961119946119946
nabla119928119928 119955119955119961119961119946119946
bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q 119928119928 = 119928119928 minus 119928119928 = 119928119928 minus η sum119961119961119946119946nabla119928119928 (119955119955119961119961119946119946) Idea Instead of evaluating gradient over all ratings evaluate it for
each individual rating and make a step
GD 119928119928larr119928119928 minus η sum119955119955119961119961119946119946 nabla119928119928(119955119955119961119961119946119946)
SGD 119928119928larr119928119928 minus 119928119928(119955119955119961119961119946119946) Faster convergence
bull Need more steps but each step is computed much faster
1162019
RECSYS 73
SGD vs GD Convergence of GD vs SGD
1162019
Iterationstep
Valu
e of
the
obje
ctiv
e fu
nctio
n
GD improves the value of the objective function at every step SGD improves the value but in a ldquonoisyrdquo wayGD takes fewer steps to converge but each steptakes much longer to compute In practice SGD is much faster
RECSYS 74
Stochastic Gradient Descent
Stochastic gradient decent Initialize P and Q (using SVD pretend missing ratings are
0) Then iterate over the ratings (multiple times if necessary) and update
factorsFor each rxi
bull 120576120576119909119909119909119909 = 119903119903119909119909119909119909 minus 119902119902119909119909 sdot 119901119901119909119909119879119879 (derivative of the ldquoerrorrdquo)bull 119902119902119909119909 larr 119902119902119909119909 + 120578120578 120576120576119909119909119909119909 119901119901119909119909 minus 120582120582 119902119902119909119909 (update equation)bull 119901119901119909119909 larr 119901119901119909119909 + 120578120578 120576120576119909119909119909119909 119902119902119909119909 minus 120582120582 119901119901119909119909 (update equation)
2 for loops For until convergence
bull For each rxibull Compute gradient do a ldquosteprdquo
1162019
η hellip learning rate
RECSYS 75Koren Bell Volinksy IEEE Computer 2009
1162019
RECSYS 76
Summary Recommendations via Optimization
Goal Make good recommendations Quantify goodness using SSE
So Lower SSE means better recommendations We want to make good recommendations on items
that some user has not yet seen Letrsquos set values for P and Q such that they work
well on known (user item) ratings And hope these values for P and Q will predict well
the unknown ratings
This is the a case where we apply Optimization methods
1162019
1 3 43 5 5
4 5 533
2
2 1 3
1
RECSYS 77
Backup Slides
RECSYS 78
The Netflix Challenge 2006-09
We Know What You OughtTo Be Watching This
Summer
Netflix Prizebull Training data
ndash 100 million ratingsndash 480000 usersndash 17770 moviesndash 6 years of data 2000-2005
bull Test datandash Last few ratings of each user (28 million)ndash Evaluation criterion root mean squared error (RMSE) ndash Netflix Cinematch RMSE 09514
bull Competitionndash 2700+ teamsndash $1 million grand prize for 10 improvement on Cinematch resultndash $50000 2007 progress prize for 843 improvement
scoremovieuser121152131434524123237682576344541568523425
22345
57664566
scoremovieuser6219617232473153414284935
745
696836
Training data Test data
Movie rating data
Overall rating distribution
bull Third of ratings are 4sbull Average rating is 368
From TimelyDevelopmentcom
ratings per movie
bull Avg ratingsmovie 5627
ratings per user
bull Avg ratingsuser 208
Average movie rating by movie count
bull More ratings to better movies
From TimelyDevelopmentcom
Most loved movies
CountAvg ratingTitle137812 4593 The Shawshank Redemption 133597 4545 Lord of the Rings The Return of the King1808834306 The Green Mile 150676 4460Lord of the Rings The Two Towers 139050 4415 Finding Nemo 117456 4504 Raiders of the Lost Ark 180736 4299 Forrest Gump 147932 4433 Lord of the Rings The Fellowship of the ring149199 4325 The Sixth Sense 144027 4333 Indiana Jones and the Last Crusade
Challenges
bull Size of datandash Scalabilityndash Keeping data in memory
bull Missing datandash 99 percent missingndash Very imbalanced
bull Avoiding overfittingbull Test and training data differ significantly movie 16322
bull Use an ensemble of complementing predictorsbull Two half tuned models worth more than a single fully
tuned model
The BellKor recommender system
RECSYS 101
Extending Latent Factor Model to Include Biases
1162019
RECSYS 102
Modeling Biases and Interactions
μ = overall mean rating bx = bias of user x bi = bias of movie i
user-movie interactionmovie biasuser bias
User-Movie interaction Characterizes the matching between
users and movies Attracts most research in the field Benefits from algorithmic and
mathematical innovations
Baseline predictor Separates users and movies Benefits from insights into userrsquos
behavior Among the main practical
contributions of the competition
RECSYS 103
Baseline Predictor We have expectations on the rating by
user x of movie i even without estimating xrsquos attitude towards movies like i
ndash Rating scale of user xndash Values of other ratings user
gave recently (day-specific mood anchoring multi-user accounts)
ndash (Recent) popularity of movie indash Selection bias related to
number of ratings user gave on the same day (ldquofrequencyrdquo)
1162019
RECSYS 104
Putting It All Together
Example Mean rating micro = 37 You are a critical reviewer your ratings are 1 star
lower than the mean bx = -1 Star Wars gets a mean rating of 05 higher than
average movie bi = + 05 Predicted rating for you on Star Wars
= 37 - 1 + 05 = 32
1162019
RECSYS 105
Fitting the New Model
Solve
Stochastic gradient decent to find parameters Note Both biases bu bi as well as interactions qi pu are treated
as parameters (we estimate them)
1162019
regularization
goodness of fit
λ is selected via grid-search on a validation set
( )
++++
+++minus
sumsumsumsum
sumisin
ii
xx
xx
ii
Rix
Txiixxi
PQ
bbpq
pqbbr
2222
2
)(
)(min
λ
micro
RECSYS 106
BellKor Recommender System The winner of the Netflix Challenge
Multi-scale modeling of the dataCombine top level ldquoregionalrdquomodeling of the data with a refined local view Global
bull Overall deviations of usersmovies Factorization
bull Addressing ldquoregionalrdquo effects Collaborative filtering
bull Extract local patterns
Global effects
Factorization
Collaborative filtering
RECSYS 107
Performance of Various Methods
0885
089
0895
09
0905
091
0915
092
1 10 100 1000
RMSE
Millions of parameters
CF (no time bias)
Basic Latent Factors
Latent Factors w Biases
1162019
RECSYS 108
Grand Prize 08563
Netflix 09514
Movie average 10533User average 10651
Global average 11296
Performance of Various Methods
Basic Collaborative filtering 094
Latent factors 090
Latent factors+Biases 089
Collaborative filtering++ 091
1162019
RECSYS 109
Temporal Biases Of Users
Sudden rise in the average movie rating(early 2004) Improvements in Netflix GUI improvements Meaning of rating changed
Movie age Users prefer new movies
without any reasons Older movies are just
inherently better than newer ones
1162019
Y Koren Collaborative filtering with temporal dynamics KDD rsquo09
RECSYS 110
Temporal Biases amp Factors
Original modelrxi = micro +bx + bi + qi middotpx
T
Add time dependence to biasesrxi = micro +bx(t)+ bi(t) +qi middot px
T
Make parameters bu and bi to depend on time (1) Parameterize time-dependence by linear trends
(2) Each bin corresponds to 10 consecutive weeks
Add temporal dependence to factors px(t)hellip user preference vector on day t
Y Koren Collaborative filtering with temporal dynamics KDD rsquo09
RECSYS 111
Adding Temporal Effects
1162019
0875
088
0885
089
0895
09
0905
091
0915
092
1 10 100 1000 10000
RMSE
Millions of parameters
CF (no time bias)Basic Latent FactorsCF (time bias)Latent Factors w Biases+ Linear time factors+ Per-day user biases+ CF
RECSYS 112
Grand Prize 08563
Netflix 09514
Movie average 10533User average 10651
Global average 11296
Performance of Various Methods
Basic Collaborative filtering 094
Latent factors 090
Latent factors+Biases 089
Collaborative filtering++ 091
1162019 112
Latent factors+Biases+Time 0876
Still no prize Getting desperateTry a ldquokitchen sinkrdquo approach
RECSYS 113
Effect of ensemble size
087
08725
0875
08775
088
08825
0885
08875
089
08925
1 8 15 22 29 36 43 50
Predictors
Erro
r - R
MSE
RECSYS 64
Geared towards females
Geared towards males
serious
funny
The Effect of Regularization
1162019
The PrincessDiaries
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
minfactors ldquoerrorrdquo + λ ldquolengthrdquo
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
RECSYS 65
Geared towards females
Geared towards males
serious
funny
The Effect of Regularization
1162019
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
The PrincessDiaries
minfactors ldquoerrorrdquo + λ ldquolengthrdquo
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
RECSYS 66
Geared towards females
Geared towards males
serious
funny
The Effect of Regularization
1162019
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
minfactors ldquoerrorrdquo + λ ldquolengthrdquo
The PrincessDiaries
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
RECSYS 67
Geared towards females
Geared towards males
serious
funny
The Effect of Regularization
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
The PrincessDiaries
minfactors ldquoerrorrdquo + λ ldquolengthrdquo
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
RECSYS 68
Use Gradient Descent to search forthe optimal settings
Want to find matrices P and Q
Gradient descent Initialize P and Q (using SVD pretend missing ratings are
0) Do gradient descent
bull P larr P - η middotnablaPbull Q larr Q - η middotnablaQbull Where nablaQ is gradientderivative of matrix Q120571120571119876119876 = [120571120571119902119902119909119909119894119894] and 120571120571119902119902119909119909119894119894 = sum119909119909119909119909 minus2 119903119903119909119909119909119909 minus 119902119902119909119909119901119901119909119909119879119879 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894
bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q Observation Computing gradients is slow
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
How to compute gradient of a matrixCompute gradient of every element independently
RECSYS 69
Degression to the lecture notes ofRegression and Gradient Descent
by Andrew Ngrsquos Machine Learning course from Coursera
RECSYS 70
(Batch) Gradient Descent
Want to find matrices P and Q
Gradient descent Initialize P and Q (using SVD pretend missing ratings are
0) Do gradient descent
bull P larr P - η middotnablaPbull Q larr Q - η middotnablaQbull Where nablaQ is gradientderivative of matrix Q120571120571119876119876 = [120571120571119902119902119909119909119894119894] and 120571120571119902119902119909119909119894119894 = sum119909119909119909119909 minus2 119903119903119909119909119909119909 minus 119902119902119909119909119901119901119909119909119879119879 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894
bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q Observation Computing gradients is slow
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
How to compute gradient of a matrixCompute gradient of every element independently
RECSYS 72
Stochastic Gradient Descent
Gradient Descent (GD) vs Stochastic GD Observation 120571120571119876119876 = [120571120571119902119902119909119909119894119894] where
120571120571119902119902119909119909119894119894 = 119909119909119909119909
minus2 119903119903119909119909119909119909 minus 119902119902119909119909119894119894119901119901119909119909119894119894 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894 = 119961119961119946119946
nabla119928119928 119955119955119961119961119946119946
bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q 119928119928 = 119928119928 minus 119928119928 = 119928119928 minus η sum119961119961119946119946nabla119928119928 (119955119955119961119961119946119946) Idea Instead of evaluating gradient over all ratings evaluate it for
each individual rating and make a step
GD 119928119928larr119928119928 minus η sum119955119955119961119961119946119946 nabla119928119928(119955119955119961119961119946119946)
SGD 119928119928larr119928119928 minus 119928119928(119955119955119961119961119946119946) Faster convergence
bull Need more steps but each step is computed much faster
1162019
RECSYS 73
SGD vs GD Convergence of GD vs SGD
1162019
Iterationstep
Valu
e of
the
obje
ctiv
e fu
nctio
n
GD improves the value of the objective function at every step SGD improves the value but in a ldquonoisyrdquo wayGD takes fewer steps to converge but each steptakes much longer to compute In practice SGD is much faster
RECSYS 74
Stochastic Gradient Descent
Stochastic gradient decent Initialize P and Q (using SVD pretend missing ratings are
0) Then iterate over the ratings (multiple times if necessary) and update
factorsFor each rxi
bull 120576120576119909119909119909119909 = 119903119903119909119909119909119909 minus 119902119902119909119909 sdot 119901119901119909119909119879119879 (derivative of the ldquoerrorrdquo)bull 119902119902119909119909 larr 119902119902119909119909 + 120578120578 120576120576119909119909119909119909 119901119901119909119909 minus 120582120582 119902119902119909119909 (update equation)bull 119901119901119909119909 larr 119901119901119909119909 + 120578120578 120576120576119909119909119909119909 119902119902119909119909 minus 120582120582 119901119901119909119909 (update equation)
2 for loops For until convergence
bull For each rxibull Compute gradient do a ldquosteprdquo
1162019
η hellip learning rate
RECSYS 75Koren Bell Volinksy IEEE Computer 2009
1162019
RECSYS 76
Summary Recommendations via Optimization
Goal Make good recommendations Quantify goodness using SSE
So Lower SSE means better recommendations We want to make good recommendations on items
that some user has not yet seen Letrsquos set values for P and Q such that they work
well on known (user item) ratings And hope these values for P and Q will predict well
the unknown ratings
This is the a case where we apply Optimization methods
1162019
1 3 43 5 5
4 5 533
2
2 1 3
1
RECSYS 77
Backup Slides
RECSYS 78
The Netflix Challenge 2006-09
We Know What You OughtTo Be Watching This
Summer
Netflix Prizebull Training data
ndash 100 million ratingsndash 480000 usersndash 17770 moviesndash 6 years of data 2000-2005
bull Test datandash Last few ratings of each user (28 million)ndash Evaluation criterion root mean squared error (RMSE) ndash Netflix Cinematch RMSE 09514
bull Competitionndash 2700+ teamsndash $1 million grand prize for 10 improvement on Cinematch resultndash $50000 2007 progress prize for 843 improvement
scoremovieuser121152131434524123237682576344541568523425
22345
57664566
scoremovieuser6219617232473153414284935
745
696836
Training data Test data
Movie rating data
Overall rating distribution
bull Third of ratings are 4sbull Average rating is 368
From TimelyDevelopmentcom
ratings per movie
bull Avg ratingsmovie 5627
ratings per user
bull Avg ratingsuser 208
Average movie rating by movie count
bull More ratings to better movies
From TimelyDevelopmentcom
Most loved movies
CountAvg ratingTitle137812 4593 The Shawshank Redemption 133597 4545 Lord of the Rings The Return of the King1808834306 The Green Mile 150676 4460Lord of the Rings The Two Towers 139050 4415 Finding Nemo 117456 4504 Raiders of the Lost Ark 180736 4299 Forrest Gump 147932 4433 Lord of the Rings The Fellowship of the ring149199 4325 The Sixth Sense 144027 4333 Indiana Jones and the Last Crusade
Challenges
bull Size of datandash Scalabilityndash Keeping data in memory
bull Missing datandash 99 percent missingndash Very imbalanced
bull Avoiding overfittingbull Test and training data differ significantly movie 16322
bull Use an ensemble of complementing predictorsbull Two half tuned models worth more than a single fully
tuned model
The BellKor recommender system
RECSYS 101
Extending Latent Factor Model to Include Biases
1162019
RECSYS 102
Modeling Biases and Interactions
μ = overall mean rating bx = bias of user x bi = bias of movie i
user-movie interactionmovie biasuser bias
User-Movie interaction Characterizes the matching between
users and movies Attracts most research in the field Benefits from algorithmic and
mathematical innovations
Baseline predictor Separates users and movies Benefits from insights into userrsquos
behavior Among the main practical
contributions of the competition
RECSYS 103
Baseline Predictor We have expectations on the rating by
user x of movie i even without estimating xrsquos attitude towards movies like i
ndash Rating scale of user xndash Values of other ratings user
gave recently (day-specific mood anchoring multi-user accounts)
ndash (Recent) popularity of movie indash Selection bias related to
number of ratings user gave on the same day (ldquofrequencyrdquo)
1162019
RECSYS 104
Putting It All Together
Example Mean rating micro = 37 You are a critical reviewer your ratings are 1 star
lower than the mean bx = -1 Star Wars gets a mean rating of 05 higher than
average movie bi = + 05 Predicted rating for you on Star Wars
= 37 - 1 + 05 = 32
1162019
RECSYS 105
Fitting the New Model
Solve
Stochastic gradient decent to find parameters Note Both biases bu bi as well as interactions qi pu are treated
as parameters (we estimate them)
1162019
regularization
goodness of fit
λ is selected via grid-search on a validation set
( )
++++
+++minus
sumsumsumsum
sumisin
ii
xx
xx
ii
Rix
Txiixxi
PQ
bbpq
pqbbr
2222
2
)(
)(min
λ
micro
RECSYS 106
BellKor Recommender System The winner of the Netflix Challenge
Multi-scale modeling of the dataCombine top level ldquoregionalrdquomodeling of the data with a refined local view Global
bull Overall deviations of usersmovies Factorization
bull Addressing ldquoregionalrdquo effects Collaborative filtering
bull Extract local patterns
Global effects
Factorization
Collaborative filtering
RECSYS 107
Performance of Various Methods
0885
089
0895
09
0905
091
0915
092
1 10 100 1000
RMSE
Millions of parameters
CF (no time bias)
Basic Latent Factors
Latent Factors w Biases
1162019
RECSYS 108
Grand Prize 08563
Netflix 09514
Movie average 10533User average 10651
Global average 11296
Performance of Various Methods
Basic Collaborative filtering 094
Latent factors 090
Latent factors+Biases 089
Collaborative filtering++ 091
1162019
RECSYS 109
Temporal Biases Of Users
Sudden rise in the average movie rating(early 2004) Improvements in Netflix GUI improvements Meaning of rating changed
Movie age Users prefer new movies
without any reasons Older movies are just
inherently better than newer ones
1162019
Y Koren Collaborative filtering with temporal dynamics KDD rsquo09
RECSYS 110
Temporal Biases amp Factors
Original modelrxi = micro +bx + bi + qi middotpx
T
Add time dependence to biasesrxi = micro +bx(t)+ bi(t) +qi middot px
T
Make parameters bu and bi to depend on time (1) Parameterize time-dependence by linear trends
(2) Each bin corresponds to 10 consecutive weeks
Add temporal dependence to factors px(t)hellip user preference vector on day t
Y Koren Collaborative filtering with temporal dynamics KDD rsquo09
RECSYS 111
Adding Temporal Effects
1162019
0875
088
0885
089
0895
09
0905
091
0915
092
1 10 100 1000 10000
RMSE
Millions of parameters
CF (no time bias)Basic Latent FactorsCF (time bias)Latent Factors w Biases+ Linear time factors+ Per-day user biases+ CF
RECSYS 112
Grand Prize 08563
Netflix 09514
Movie average 10533User average 10651
Global average 11296
Performance of Various Methods
Basic Collaborative filtering 094
Latent factors 090
Latent factors+Biases 089
Collaborative filtering++ 091
1162019 112
Latent factors+Biases+Time 0876
Still no prize Getting desperateTry a ldquokitchen sinkrdquo approach
RECSYS 113
Effect of ensemble size
087
08725
0875
08775
088
08825
0885
08875
089
08925
1 8 15 22 29 36 43 50
Predictors
Erro
r - R
MSE
RECSYS 65
Geared towards females
Geared towards males
serious
funny
The Effect of Regularization
1162019
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
The PrincessDiaries
minfactors ldquoerrorrdquo + λ ldquolengthrdquo
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
RECSYS 66
Geared towards females
Geared towards males
serious
funny
The Effect of Regularization
1162019
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
minfactors ldquoerrorrdquo + λ ldquolengthrdquo
The PrincessDiaries
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
RECSYS 67
Geared towards females
Geared towards males
serious
funny
The Effect of Regularization
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
The PrincessDiaries
minfactors ldquoerrorrdquo + λ ldquolengthrdquo
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
RECSYS 68
Use Gradient Descent to search forthe optimal settings
Want to find matrices P and Q
Gradient descent Initialize P and Q (using SVD pretend missing ratings are
0) Do gradient descent
bull P larr P - η middotnablaPbull Q larr Q - η middotnablaQbull Where nablaQ is gradientderivative of matrix Q120571120571119876119876 = [120571120571119902119902119909119909119894119894] and 120571120571119902119902119909119909119894119894 = sum119909119909119909119909 minus2 119903119903119909119909119909119909 minus 119902119902119909119909119901119901119909119909119879119879 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894
bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q Observation Computing gradients is slow
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
How to compute gradient of a matrixCompute gradient of every element independently
RECSYS 69
Degression to the lecture notes ofRegression and Gradient Descent
by Andrew Ngrsquos Machine Learning course from Coursera
RECSYS 70
(Batch) Gradient Descent
Want to find matrices P and Q
Gradient descent Initialize P and Q (using SVD pretend missing ratings are
0) Do gradient descent
bull P larr P - η middotnablaPbull Q larr Q - η middotnablaQbull Where nablaQ is gradientderivative of matrix Q120571120571119876119876 = [120571120571119902119902119909119909119894119894] and 120571120571119902119902119909119909119894119894 = sum119909119909119909119909 minus2 119903119903119909119909119909119909 minus 119902119902119909119909119901119901119909119909119879119879 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894
bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q Observation Computing gradients is slow
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
How to compute gradient of a matrixCompute gradient of every element independently
RECSYS 72
Stochastic Gradient Descent
Gradient Descent (GD) vs Stochastic GD Observation 120571120571119876119876 = [120571120571119902119902119909119909119894119894] where
120571120571119902119902119909119909119894119894 = 119909119909119909119909
minus2 119903119903119909119909119909119909 minus 119902119902119909119909119894119894119901119901119909119909119894119894 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894 = 119961119961119946119946
nabla119928119928 119955119955119961119961119946119946
bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q 119928119928 = 119928119928 minus 119928119928 = 119928119928 minus η sum119961119961119946119946nabla119928119928 (119955119955119961119961119946119946) Idea Instead of evaluating gradient over all ratings evaluate it for
each individual rating and make a step
GD 119928119928larr119928119928 minus η sum119955119955119961119961119946119946 nabla119928119928(119955119955119961119961119946119946)
SGD 119928119928larr119928119928 minus 119928119928(119955119955119961119961119946119946) Faster convergence
bull Need more steps but each step is computed much faster
1162019
RECSYS 73
SGD vs GD Convergence of GD vs SGD
1162019
Iterationstep
Valu
e of
the
obje
ctiv
e fu
nctio
n
GD improves the value of the objective function at every step SGD improves the value but in a ldquonoisyrdquo wayGD takes fewer steps to converge but each steptakes much longer to compute In practice SGD is much faster
RECSYS 74
Stochastic Gradient Descent
Stochastic gradient decent Initialize P and Q (using SVD pretend missing ratings are
0) Then iterate over the ratings (multiple times if necessary) and update
factorsFor each rxi
bull 120576120576119909119909119909119909 = 119903119903119909119909119909119909 minus 119902119902119909119909 sdot 119901119901119909119909119879119879 (derivative of the ldquoerrorrdquo)bull 119902119902119909119909 larr 119902119902119909119909 + 120578120578 120576120576119909119909119909119909 119901119901119909119909 minus 120582120582 119902119902119909119909 (update equation)bull 119901119901119909119909 larr 119901119901119909119909 + 120578120578 120576120576119909119909119909119909 119902119902119909119909 minus 120582120582 119901119901119909119909 (update equation)
2 for loops For until convergence
bull For each rxibull Compute gradient do a ldquosteprdquo
1162019
η hellip learning rate
RECSYS 75Koren Bell Volinksy IEEE Computer 2009
1162019
RECSYS 76
Summary Recommendations via Optimization
Goal Make good recommendations Quantify goodness using SSE
So Lower SSE means better recommendations We want to make good recommendations on items
that some user has not yet seen Letrsquos set values for P and Q such that they work
well on known (user item) ratings And hope these values for P and Q will predict well
the unknown ratings
This is the a case where we apply Optimization methods
1162019
1 3 43 5 5
4 5 533
2
2 1 3
1
RECSYS 77
Backup Slides
RECSYS 78
The Netflix Challenge 2006-09
We Know What You OughtTo Be Watching This
Summer
Netflix Prizebull Training data
ndash 100 million ratingsndash 480000 usersndash 17770 moviesndash 6 years of data 2000-2005
bull Test datandash Last few ratings of each user (28 million)ndash Evaluation criterion root mean squared error (RMSE) ndash Netflix Cinematch RMSE 09514
bull Competitionndash 2700+ teamsndash $1 million grand prize for 10 improvement on Cinematch resultndash $50000 2007 progress prize for 843 improvement
scoremovieuser121152131434524123237682576344541568523425
22345
57664566
scoremovieuser6219617232473153414284935
745
696836
Training data Test data
Movie rating data
Overall rating distribution
bull Third of ratings are 4sbull Average rating is 368
From TimelyDevelopmentcom
ratings per movie
bull Avg ratingsmovie 5627
ratings per user
bull Avg ratingsuser 208
Average movie rating by movie count
bull More ratings to better movies
From TimelyDevelopmentcom
Most loved movies
CountAvg ratingTitle137812 4593 The Shawshank Redemption 133597 4545 Lord of the Rings The Return of the King1808834306 The Green Mile 150676 4460Lord of the Rings The Two Towers 139050 4415 Finding Nemo 117456 4504 Raiders of the Lost Ark 180736 4299 Forrest Gump 147932 4433 Lord of the Rings The Fellowship of the ring149199 4325 The Sixth Sense 144027 4333 Indiana Jones and the Last Crusade
Challenges
bull Size of datandash Scalabilityndash Keeping data in memory
bull Missing datandash 99 percent missingndash Very imbalanced
bull Avoiding overfittingbull Test and training data differ significantly movie 16322
bull Use an ensemble of complementing predictorsbull Two half tuned models worth more than a single fully
tuned model
The BellKor recommender system
RECSYS 101
Extending Latent Factor Model to Include Biases
1162019
RECSYS 102
Modeling Biases and Interactions
μ = overall mean rating bx = bias of user x bi = bias of movie i
user-movie interactionmovie biasuser bias
User-Movie interaction Characterizes the matching between
users and movies Attracts most research in the field Benefits from algorithmic and
mathematical innovations
Baseline predictor Separates users and movies Benefits from insights into userrsquos
behavior Among the main practical
contributions of the competition
RECSYS 103
Baseline Predictor We have expectations on the rating by
user x of movie i even without estimating xrsquos attitude towards movies like i
ndash Rating scale of user xndash Values of other ratings user
gave recently (day-specific mood anchoring multi-user accounts)
ndash (Recent) popularity of movie indash Selection bias related to
number of ratings user gave on the same day (ldquofrequencyrdquo)
1162019
RECSYS 104
Putting It All Together
Example Mean rating micro = 37 You are a critical reviewer your ratings are 1 star
lower than the mean bx = -1 Star Wars gets a mean rating of 05 higher than
average movie bi = + 05 Predicted rating for you on Star Wars
= 37 - 1 + 05 = 32
1162019
RECSYS 105
Fitting the New Model
Solve
Stochastic gradient decent to find parameters Note Both biases bu bi as well as interactions qi pu are treated
as parameters (we estimate them)
1162019
regularization
goodness of fit
λ is selected via grid-search on a validation set
( )
++++
+++minus
sumsumsumsum
sumisin
ii
xx
xx
ii
Rix
Txiixxi
PQ
bbpq
pqbbr
2222
2
)(
)(min
λ
micro
RECSYS 106
BellKor Recommender System The winner of the Netflix Challenge
Multi-scale modeling of the dataCombine top level ldquoregionalrdquomodeling of the data with a refined local view Global
bull Overall deviations of usersmovies Factorization
bull Addressing ldquoregionalrdquo effects Collaborative filtering
bull Extract local patterns
Global effects
Factorization
Collaborative filtering
RECSYS 107
Performance of Various Methods
0885
089
0895
09
0905
091
0915
092
1 10 100 1000
RMSE
Millions of parameters
CF (no time bias)
Basic Latent Factors
Latent Factors w Biases
1162019
RECSYS 108
Grand Prize 08563
Netflix 09514
Movie average 10533User average 10651
Global average 11296
Performance of Various Methods
Basic Collaborative filtering 094
Latent factors 090
Latent factors+Biases 089
Collaborative filtering++ 091
1162019
RECSYS 109
Temporal Biases Of Users
Sudden rise in the average movie rating(early 2004) Improvements in Netflix GUI improvements Meaning of rating changed
Movie age Users prefer new movies
without any reasons Older movies are just
inherently better than newer ones
1162019
Y Koren Collaborative filtering with temporal dynamics KDD rsquo09
RECSYS 110
Temporal Biases amp Factors
Original modelrxi = micro +bx + bi + qi middotpx
T
Add time dependence to biasesrxi = micro +bx(t)+ bi(t) +qi middot px
T
Make parameters bu and bi to depend on time (1) Parameterize time-dependence by linear trends
(2) Each bin corresponds to 10 consecutive weeks
Add temporal dependence to factors px(t)hellip user preference vector on day t
Y Koren Collaborative filtering with temporal dynamics KDD rsquo09
RECSYS 111
Adding Temporal Effects
1162019
0875
088
0885
089
0895
09
0905
091
0915
092
1 10 100 1000 10000
RMSE
Millions of parameters
CF (no time bias)Basic Latent FactorsCF (time bias)Latent Factors w Biases+ Linear time factors+ Per-day user biases+ CF
RECSYS 112
Grand Prize 08563
Netflix 09514
Movie average 10533User average 10651
Global average 11296
Performance of Various Methods
Basic Collaborative filtering 094
Latent factors 090
Latent factors+Biases 089
Collaborative filtering++ 091
1162019 112
Latent factors+Biases+Time 0876
Still no prize Getting desperateTry a ldquokitchen sinkrdquo approach
RECSYS 113
Effect of ensemble size
087
08725
0875
08775
088
08825
0885
08875
089
08925
1 8 15 22 29 36 43 50
Predictors
Erro
r - R
MSE
RECSYS 66
Geared towards females
Geared towards males
serious
funny
The Effect of Regularization
1162019
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
minfactors ldquoerrorrdquo + λ ldquolengthrdquo
The PrincessDiaries
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
RECSYS 67
Geared towards females
Geared towards males
serious
funny
The Effect of Regularization
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
The PrincessDiaries
minfactors ldquoerrorrdquo + λ ldquolengthrdquo
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
RECSYS 68
Use Gradient Descent to search forthe optimal settings
Want to find matrices P and Q
Gradient descent Initialize P and Q (using SVD pretend missing ratings are
0) Do gradient descent
bull P larr P - η middotnablaPbull Q larr Q - η middotnablaQbull Where nablaQ is gradientderivative of matrix Q120571120571119876119876 = [120571120571119902119902119909119909119894119894] and 120571120571119902119902119909119909119894119894 = sum119909119909119909119909 minus2 119903119903119909119909119909119909 minus 119902119902119909119909119901119901119909119909119879119879 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894
bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q Observation Computing gradients is slow
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
How to compute gradient of a matrixCompute gradient of every element independently
RECSYS 69
Degression to the lecture notes ofRegression and Gradient Descent
by Andrew Ngrsquos Machine Learning course from Coursera
RECSYS 70
(Batch) Gradient Descent
Want to find matrices P and Q
Gradient descent Initialize P and Q (using SVD pretend missing ratings are
0) Do gradient descent
bull P larr P - η middotnablaPbull Q larr Q - η middotnablaQbull Where nablaQ is gradientderivative of matrix Q120571120571119876119876 = [120571120571119902119902119909119909119894119894] and 120571120571119902119902119909119909119894119894 = sum119909119909119909119909 minus2 119903119903119909119909119909119909 minus 119902119902119909119909119901119901119909119909119879119879 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894
bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q Observation Computing gradients is slow
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
How to compute gradient of a matrixCompute gradient of every element independently
RECSYS 72
Stochastic Gradient Descent
Gradient Descent (GD) vs Stochastic GD Observation 120571120571119876119876 = [120571120571119902119902119909119909119894119894] where
120571120571119902119902119909119909119894119894 = 119909119909119909119909
minus2 119903119903119909119909119909119909 minus 119902119902119909119909119894119894119901119901119909119909119894119894 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894 = 119961119961119946119946
nabla119928119928 119955119955119961119961119946119946
bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q 119928119928 = 119928119928 minus 119928119928 = 119928119928 minus η sum119961119961119946119946nabla119928119928 (119955119955119961119961119946119946) Idea Instead of evaluating gradient over all ratings evaluate it for
each individual rating and make a step
GD 119928119928larr119928119928 minus η sum119955119955119961119961119946119946 nabla119928119928(119955119955119961119961119946119946)
SGD 119928119928larr119928119928 minus 119928119928(119955119955119961119961119946119946) Faster convergence
bull Need more steps but each step is computed much faster
1162019
RECSYS 73
SGD vs GD Convergence of GD vs SGD
1162019
Iterationstep
Valu
e of
the
obje
ctiv
e fu
nctio
n
GD improves the value of the objective function at every step SGD improves the value but in a ldquonoisyrdquo wayGD takes fewer steps to converge but each steptakes much longer to compute In practice SGD is much faster
RECSYS 74
Stochastic Gradient Descent
Stochastic gradient decent Initialize P and Q (using SVD pretend missing ratings are
0) Then iterate over the ratings (multiple times if necessary) and update
factorsFor each rxi
bull 120576120576119909119909119909119909 = 119903119903119909119909119909119909 minus 119902119902119909119909 sdot 119901119901119909119909119879119879 (derivative of the ldquoerrorrdquo)bull 119902119902119909119909 larr 119902119902119909119909 + 120578120578 120576120576119909119909119909119909 119901119901119909119909 minus 120582120582 119902119902119909119909 (update equation)bull 119901119901119909119909 larr 119901119901119909119909 + 120578120578 120576120576119909119909119909119909 119902119902119909119909 minus 120582120582 119901119901119909119909 (update equation)
2 for loops For until convergence
bull For each rxibull Compute gradient do a ldquosteprdquo
1162019
η hellip learning rate
RECSYS 75Koren Bell Volinksy IEEE Computer 2009
1162019
RECSYS 76
Summary Recommendations via Optimization
Goal Make good recommendations Quantify goodness using SSE
So Lower SSE means better recommendations We want to make good recommendations on items
that some user has not yet seen Letrsquos set values for P and Q such that they work
well on known (user item) ratings And hope these values for P and Q will predict well
the unknown ratings
This is the a case where we apply Optimization methods
1162019
1 3 43 5 5
4 5 533
2
2 1 3
1
RECSYS 77
Backup Slides
RECSYS 78
The Netflix Challenge 2006-09
We Know What You OughtTo Be Watching This
Summer
Netflix Prizebull Training data
ndash 100 million ratingsndash 480000 usersndash 17770 moviesndash 6 years of data 2000-2005
bull Test datandash Last few ratings of each user (28 million)ndash Evaluation criterion root mean squared error (RMSE) ndash Netflix Cinematch RMSE 09514
bull Competitionndash 2700+ teamsndash $1 million grand prize for 10 improvement on Cinematch resultndash $50000 2007 progress prize for 843 improvement
scoremovieuser121152131434524123237682576344541568523425
22345
57664566
scoremovieuser6219617232473153414284935
745
696836
Training data Test data
Movie rating data
Overall rating distribution
bull Third of ratings are 4sbull Average rating is 368
From TimelyDevelopmentcom
ratings per movie
bull Avg ratingsmovie 5627
ratings per user
bull Avg ratingsuser 208
Average movie rating by movie count
bull More ratings to better movies
From TimelyDevelopmentcom
Most loved movies
CountAvg ratingTitle137812 4593 The Shawshank Redemption 133597 4545 Lord of the Rings The Return of the King1808834306 The Green Mile 150676 4460Lord of the Rings The Two Towers 139050 4415 Finding Nemo 117456 4504 Raiders of the Lost Ark 180736 4299 Forrest Gump 147932 4433 Lord of the Rings The Fellowship of the ring149199 4325 The Sixth Sense 144027 4333 Indiana Jones and the Last Crusade
Challenges
bull Size of datandash Scalabilityndash Keeping data in memory
bull Missing datandash 99 percent missingndash Very imbalanced
bull Avoiding overfittingbull Test and training data differ significantly movie 16322
bull Use an ensemble of complementing predictorsbull Two half tuned models worth more than a single fully
tuned model
The BellKor recommender system
RECSYS 101
Extending Latent Factor Model to Include Biases
1162019
RECSYS 102
Modeling Biases and Interactions
μ = overall mean rating bx = bias of user x bi = bias of movie i
user-movie interactionmovie biasuser bias
User-Movie interaction Characterizes the matching between
users and movies Attracts most research in the field Benefits from algorithmic and
mathematical innovations
Baseline predictor Separates users and movies Benefits from insights into userrsquos
behavior Among the main practical
contributions of the competition
RECSYS 103
Baseline Predictor We have expectations on the rating by
user x of movie i even without estimating xrsquos attitude towards movies like i
ndash Rating scale of user xndash Values of other ratings user
gave recently (day-specific mood anchoring multi-user accounts)
ndash (Recent) popularity of movie indash Selection bias related to
number of ratings user gave on the same day (ldquofrequencyrdquo)
1162019
RECSYS 104
Putting It All Together
Example Mean rating micro = 37 You are a critical reviewer your ratings are 1 star
lower than the mean bx = -1 Star Wars gets a mean rating of 05 higher than
average movie bi = + 05 Predicted rating for you on Star Wars
= 37 - 1 + 05 = 32
1162019
RECSYS 105
Fitting the New Model
Solve
Stochastic gradient decent to find parameters Note Both biases bu bi as well as interactions qi pu are treated
as parameters (we estimate them)
1162019
regularization
goodness of fit
λ is selected via grid-search on a validation set
( )
++++
+++minus
sumsumsumsum
sumisin
ii
xx
xx
ii
Rix
Txiixxi
PQ
bbpq
pqbbr
2222
2
)(
)(min
λ
micro
RECSYS 106
BellKor Recommender System The winner of the Netflix Challenge
Multi-scale modeling of the dataCombine top level ldquoregionalrdquomodeling of the data with a refined local view Global
bull Overall deviations of usersmovies Factorization
bull Addressing ldquoregionalrdquo effects Collaborative filtering
bull Extract local patterns
Global effects
Factorization
Collaborative filtering
RECSYS 107
Performance of Various Methods
0885
089
0895
09
0905
091
0915
092
1 10 100 1000
RMSE
Millions of parameters
CF (no time bias)
Basic Latent Factors
Latent Factors w Biases
1162019
RECSYS 108
Grand Prize 08563
Netflix 09514
Movie average 10533User average 10651
Global average 11296
Performance of Various Methods
Basic Collaborative filtering 094
Latent factors 090
Latent factors+Biases 089
Collaborative filtering++ 091
1162019
RECSYS 109
Temporal Biases Of Users
Sudden rise in the average movie rating(early 2004) Improvements in Netflix GUI improvements Meaning of rating changed
Movie age Users prefer new movies
without any reasons Older movies are just
inherently better than newer ones
1162019
Y Koren Collaborative filtering with temporal dynamics KDD rsquo09
RECSYS 110
Temporal Biases amp Factors
Original modelrxi = micro +bx + bi + qi middotpx
T
Add time dependence to biasesrxi = micro +bx(t)+ bi(t) +qi middot px
T
Make parameters bu and bi to depend on time (1) Parameterize time-dependence by linear trends
(2) Each bin corresponds to 10 consecutive weeks
Add temporal dependence to factors px(t)hellip user preference vector on day t
Y Koren Collaborative filtering with temporal dynamics KDD rsquo09
RECSYS 111
Adding Temporal Effects
1162019
0875
088
0885
089
0895
09
0905
091
0915
092
1 10 100 1000 10000
RMSE
Millions of parameters
CF (no time bias)Basic Latent FactorsCF (time bias)Latent Factors w Biases+ Linear time factors+ Per-day user biases+ CF
RECSYS 112
Grand Prize 08563
Netflix 09514
Movie average 10533User average 10651
Global average 11296
Performance of Various Methods
Basic Collaborative filtering 094
Latent factors 090
Latent factors+Biases 089
Collaborative filtering++ 091
1162019 112
Latent factors+Biases+Time 0876
Still no prize Getting desperateTry a ldquokitchen sinkrdquo approach
RECSYS 113
Effect of ensemble size
087
08725
0875
08775
088
08825
0885
08875
089
08925
1 8 15 22 29 36 43 50
Predictors
Erro
r - R
MSE
RECSYS 67
Geared towards females
Geared towards males
serious
funny
The Effect of Regularization
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Oceanrsquos 11
Sense and Sensibility
Factor 1
Fact
or 2
The PrincessDiaries
minfactors ldquoerrorrdquo + λ ldquolengthrdquo
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
RECSYS 68
Use Gradient Descent to search forthe optimal settings
Want to find matrices P and Q
Gradient descent Initialize P and Q (using SVD pretend missing ratings are
0) Do gradient descent
bull P larr P - η middotnablaPbull Q larr Q - η middotnablaQbull Where nablaQ is gradientderivative of matrix Q120571120571119876119876 = [120571120571119902119902119909119909119894119894] and 120571120571119902119902119909119909119894119894 = sum119909119909119909119909 minus2 119903119903119909119909119909119909 minus 119902119902119909119909119901119901119909119909119879119879 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894
bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q Observation Computing gradients is slow
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
How to compute gradient of a matrixCompute gradient of every element independently
RECSYS 69
Degression to the lecture notes ofRegression and Gradient Descent
by Andrew Ngrsquos Machine Learning course from Coursera
RECSYS 70
(Batch) Gradient Descent
Want to find matrices P and Q
Gradient descent Initialize P and Q (using SVD pretend missing ratings are
0) Do gradient descent
bull P larr P - η middotnablaPbull Q larr Q - η middotnablaQbull Where nablaQ is gradientderivative of matrix Q120571120571119876119876 = [120571120571119902119902119909119909119894119894] and 120571120571119902119902119909119909119894119894 = sum119909119909119909119909 minus2 119903119903119909119909119909119909 minus 119902119902119909119909119901119901119909119909119879119879 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894
bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q Observation Computing gradients is slow
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
How to compute gradient of a matrixCompute gradient of every element independently
RECSYS 72
Stochastic Gradient Descent
Gradient Descent (GD) vs Stochastic GD Observation 120571120571119876119876 = [120571120571119902119902119909119909119894119894] where
120571120571119902119902119909119909119894119894 = 119909119909119909119909
minus2 119903119903119909119909119909119909 minus 119902119902119909119909119894119894119901119901119909119909119894119894 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894 = 119961119961119946119946
nabla119928119928 119955119955119961119961119946119946
bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q 119928119928 = 119928119928 minus 119928119928 = 119928119928 minus η sum119961119961119946119946nabla119928119928 (119955119955119961119961119946119946) Idea Instead of evaluating gradient over all ratings evaluate it for
each individual rating and make a step
GD 119928119928larr119928119928 minus η sum119955119955119961119961119946119946 nabla119928119928(119955119955119961119961119946119946)
SGD 119928119928larr119928119928 minus 119928119928(119955119955119961119961119946119946) Faster convergence
bull Need more steps but each step is computed much faster
1162019
RECSYS 73
SGD vs GD Convergence of GD vs SGD
1162019
Iterationstep
Valu
e of
the
obje
ctiv
e fu
nctio
n
GD improves the value of the objective function at every step SGD improves the value but in a ldquonoisyrdquo wayGD takes fewer steps to converge but each steptakes much longer to compute In practice SGD is much faster
RECSYS 74
Stochastic Gradient Descent
Stochastic gradient decent Initialize P and Q (using SVD pretend missing ratings are
0) Then iterate over the ratings (multiple times if necessary) and update
factorsFor each rxi
bull 120576120576119909119909119909119909 = 119903119903119909119909119909119909 minus 119902119902119909119909 sdot 119901119901119909119909119879119879 (derivative of the ldquoerrorrdquo)bull 119902119902119909119909 larr 119902119902119909119909 + 120578120578 120576120576119909119909119909119909 119901119901119909119909 minus 120582120582 119902119902119909119909 (update equation)bull 119901119901119909119909 larr 119901119901119909119909 + 120578120578 120576120576119909119909119909119909 119902119902119909119909 minus 120582120582 119901119901119909119909 (update equation)
2 for loops For until convergence
bull For each rxibull Compute gradient do a ldquosteprdquo
1162019
η hellip learning rate
RECSYS 75Koren Bell Volinksy IEEE Computer 2009
1162019
RECSYS 76
Summary Recommendations via Optimization
Goal Make good recommendations Quantify goodness using SSE
So Lower SSE means better recommendations We want to make good recommendations on items
that some user has not yet seen Letrsquos set values for P and Q such that they work
well on known (user item) ratings And hope these values for P and Q will predict well
the unknown ratings
This is the a case where we apply Optimization methods
1162019
1 3 43 5 5
4 5 533
2
2 1 3
1
RECSYS 77
Backup Slides
RECSYS 78
The Netflix Challenge 2006-09
We Know What You OughtTo Be Watching This
Summer
Netflix Prizebull Training data
ndash 100 million ratingsndash 480000 usersndash 17770 moviesndash 6 years of data 2000-2005
bull Test datandash Last few ratings of each user (28 million)ndash Evaluation criterion root mean squared error (RMSE) ndash Netflix Cinematch RMSE 09514
bull Competitionndash 2700+ teamsndash $1 million grand prize for 10 improvement on Cinematch resultndash $50000 2007 progress prize for 843 improvement
scoremovieuser121152131434524123237682576344541568523425
22345
57664566
scoremovieuser6219617232473153414284935
745
696836
Training data Test data
Movie rating data
Overall rating distribution
bull Third of ratings are 4sbull Average rating is 368
From TimelyDevelopmentcom
ratings per movie
bull Avg ratingsmovie 5627
ratings per user
bull Avg ratingsuser 208
Average movie rating by movie count
bull More ratings to better movies
From TimelyDevelopmentcom
Most loved movies
CountAvg ratingTitle137812 4593 The Shawshank Redemption 133597 4545 Lord of the Rings The Return of the King1808834306 The Green Mile 150676 4460Lord of the Rings The Two Towers 139050 4415 Finding Nemo 117456 4504 Raiders of the Lost Ark 180736 4299 Forrest Gump 147932 4433 Lord of the Rings The Fellowship of the ring149199 4325 The Sixth Sense 144027 4333 Indiana Jones and the Last Crusade
Challenges
bull Size of datandash Scalabilityndash Keeping data in memory
bull Missing datandash 99 percent missingndash Very imbalanced
bull Avoiding overfittingbull Test and training data differ significantly movie 16322
bull Use an ensemble of complementing predictorsbull Two half tuned models worth more than a single fully
tuned model
The BellKor recommender system
RECSYS 101
Extending Latent Factor Model to Include Biases
1162019
RECSYS 102
Modeling Biases and Interactions
μ = overall mean rating bx = bias of user x bi = bias of movie i
user-movie interactionmovie biasuser bias
User-Movie interaction Characterizes the matching between
users and movies Attracts most research in the field Benefits from algorithmic and
mathematical innovations
Baseline predictor Separates users and movies Benefits from insights into userrsquos
behavior Among the main practical
contributions of the competition
RECSYS 103
Baseline Predictor We have expectations on the rating by
user x of movie i even without estimating xrsquos attitude towards movies like i
ndash Rating scale of user xndash Values of other ratings user
gave recently (day-specific mood anchoring multi-user accounts)
ndash (Recent) popularity of movie indash Selection bias related to
number of ratings user gave on the same day (ldquofrequencyrdquo)
1162019
RECSYS 104
Putting It All Together
Example Mean rating micro = 37 You are a critical reviewer your ratings are 1 star
lower than the mean bx = -1 Star Wars gets a mean rating of 05 higher than
average movie bi = + 05 Predicted rating for you on Star Wars
= 37 - 1 + 05 = 32
1162019
RECSYS 105
Fitting the New Model
Solve
Stochastic gradient decent to find parameters Note Both biases bu bi as well as interactions qi pu are treated
as parameters (we estimate them)
1162019
regularization
goodness of fit
λ is selected via grid-search on a validation set
( )
++++
+++minus
sumsumsumsum
sumisin
ii
xx
xx
ii
Rix
Txiixxi
PQ
bbpq
pqbbr
2222
2
)(
)(min
λ
micro
RECSYS 106
BellKor Recommender System The winner of the Netflix Challenge
Multi-scale modeling of the dataCombine top level ldquoregionalrdquomodeling of the data with a refined local view Global
bull Overall deviations of usersmovies Factorization
bull Addressing ldquoregionalrdquo effects Collaborative filtering
bull Extract local patterns
Global effects
Factorization
Collaborative filtering
RECSYS 107
Performance of Various Methods
0885
089
0895
09
0905
091
0915
092
1 10 100 1000
RMSE
Millions of parameters
CF (no time bias)
Basic Latent Factors
Latent Factors w Biases
1162019
RECSYS 108
Grand Prize 08563
Netflix 09514
Movie average 10533User average 10651
Global average 11296
Performance of Various Methods
Basic Collaborative filtering 094
Latent factors 090
Latent factors+Biases 089
Collaborative filtering++ 091
1162019
RECSYS 109
Temporal Biases Of Users
Sudden rise in the average movie rating(early 2004) Improvements in Netflix GUI improvements Meaning of rating changed
Movie age Users prefer new movies
without any reasons Older movies are just
inherently better than newer ones
1162019
Y Koren Collaborative filtering with temporal dynamics KDD rsquo09
RECSYS 110
Temporal Biases amp Factors
Original modelrxi = micro +bx + bi + qi middotpx
T
Add time dependence to biasesrxi = micro +bx(t)+ bi(t) +qi middot px
T
Make parameters bu and bi to depend on time (1) Parameterize time-dependence by linear trends
(2) Each bin corresponds to 10 consecutive weeks
Add temporal dependence to factors px(t)hellip user preference vector on day t
Y Koren Collaborative filtering with temporal dynamics KDD rsquo09
RECSYS 111
Adding Temporal Effects
1162019
0875
088
0885
089
0895
09
0905
091
0915
092
1 10 100 1000 10000
RMSE
Millions of parameters
CF (no time bias)Basic Latent FactorsCF (time bias)Latent Factors w Biases+ Linear time factors+ Per-day user biases+ CF
RECSYS 112
Grand Prize 08563
Netflix 09514
Movie average 10533User average 10651
Global average 11296
Performance of Various Methods
Basic Collaborative filtering 094
Latent factors 090
Latent factors+Biases 089
Collaborative filtering++ 091
1162019 112
Latent factors+Biases+Time 0876
Still no prize Getting desperateTry a ldquokitchen sinkrdquo approach
RECSYS 113
Effect of ensemble size
087
08725
0875
08775
088
08825
0885
08875
089
08925
1 8 15 22 29 36 43 50
Predictors
Erro
r - R
MSE
RECSYS 68
Use Gradient Descent to search forthe optimal settings
Want to find matrices P and Q
Gradient descent Initialize P and Q (using SVD pretend missing ratings are
0) Do gradient descent
bull P larr P - η middotnablaPbull Q larr Q - η middotnablaQbull Where nablaQ is gradientderivative of matrix Q120571120571119876119876 = [120571120571119902119902119909119909119894119894] and 120571120571119902119902119909119909119894119894 = sum119909119909119909119909 minus2 119903119903119909119909119909119909 minus 119902119902119909119909119901119901119909119909119879119879 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894
bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q Observation Computing gradients is slow
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
How to compute gradient of a matrixCompute gradient of every element independently
RECSYS 69
Degression to the lecture notes ofRegression and Gradient Descent
by Andrew Ngrsquos Machine Learning course from Coursera
RECSYS 70
(Batch) Gradient Descent
Want to find matrices P and Q
Gradient descent Initialize P and Q (using SVD pretend missing ratings are
0) Do gradient descent
bull P larr P - η middotnablaPbull Q larr Q - η middotnablaQbull Where nablaQ is gradientderivative of matrix Q120571120571119876119876 = [120571120571119902119902119909119909119894119894] and 120571120571119902119902119909119909119894119894 = sum119909119909119909119909 minus2 119903119903119909119909119909119909 minus 119902119902119909119909119901119901119909119909119879119879 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894
bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q Observation Computing gradients is slow
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
How to compute gradient of a matrixCompute gradient of every element independently
RECSYS 72
Stochastic Gradient Descent
Gradient Descent (GD) vs Stochastic GD Observation 120571120571119876119876 = [120571120571119902119902119909119909119894119894] where
120571120571119902119902119909119909119894119894 = 119909119909119909119909
minus2 119903119903119909119909119909119909 minus 119902119902119909119909119894119894119901119901119909119909119894119894 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894 = 119961119961119946119946
nabla119928119928 119955119955119961119961119946119946
bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q 119928119928 = 119928119928 minus 119928119928 = 119928119928 minus η sum119961119961119946119946nabla119928119928 (119955119955119961119961119946119946) Idea Instead of evaluating gradient over all ratings evaluate it for
each individual rating and make a step
GD 119928119928larr119928119928 minus η sum119955119955119961119961119946119946 nabla119928119928(119955119955119961119961119946119946)
SGD 119928119928larr119928119928 minus 119928119928(119955119955119961119961119946119946) Faster convergence
bull Need more steps but each step is computed much faster
1162019
RECSYS 73
SGD vs GD Convergence of GD vs SGD
1162019
Iterationstep
Valu
e of
the
obje
ctiv
e fu
nctio
n
GD improves the value of the objective function at every step SGD improves the value but in a ldquonoisyrdquo wayGD takes fewer steps to converge but each steptakes much longer to compute In practice SGD is much faster
RECSYS 74
Stochastic Gradient Descent
Stochastic gradient decent Initialize P and Q (using SVD pretend missing ratings are
0) Then iterate over the ratings (multiple times if necessary) and update
factorsFor each rxi
bull 120576120576119909119909119909119909 = 119903119903119909119909119909119909 minus 119902119902119909119909 sdot 119901119901119909119909119879119879 (derivative of the ldquoerrorrdquo)bull 119902119902119909119909 larr 119902119902119909119909 + 120578120578 120576120576119909119909119909119909 119901119901119909119909 minus 120582120582 119902119902119909119909 (update equation)bull 119901119901119909119909 larr 119901119901119909119909 + 120578120578 120576120576119909119909119909119909 119902119902119909119909 minus 120582120582 119901119901119909119909 (update equation)
2 for loops For until convergence
bull For each rxibull Compute gradient do a ldquosteprdquo
1162019
η hellip learning rate
RECSYS 75Koren Bell Volinksy IEEE Computer 2009
1162019
RECSYS 76
Summary Recommendations via Optimization
Goal Make good recommendations Quantify goodness using SSE
So Lower SSE means better recommendations We want to make good recommendations on items
that some user has not yet seen Letrsquos set values for P and Q such that they work
well on known (user item) ratings And hope these values for P and Q will predict well
the unknown ratings
This is the a case where we apply Optimization methods
1162019
1 3 43 5 5
4 5 533
2
2 1 3
1
RECSYS 77
Backup Slides
RECSYS 78
The Netflix Challenge 2006-09
We Know What You OughtTo Be Watching This
Summer
Netflix Prizebull Training data
ndash 100 million ratingsndash 480000 usersndash 17770 moviesndash 6 years of data 2000-2005
bull Test datandash Last few ratings of each user (28 million)ndash Evaluation criterion root mean squared error (RMSE) ndash Netflix Cinematch RMSE 09514
bull Competitionndash 2700+ teamsndash $1 million grand prize for 10 improvement on Cinematch resultndash $50000 2007 progress prize for 843 improvement
scoremovieuser121152131434524123237682576344541568523425
22345
57664566
scoremovieuser6219617232473153414284935
745
696836
Training data Test data
Movie rating data
Overall rating distribution
bull Third of ratings are 4sbull Average rating is 368
From TimelyDevelopmentcom
ratings per movie
bull Avg ratingsmovie 5627
ratings per user
bull Avg ratingsuser 208
Average movie rating by movie count
bull More ratings to better movies
From TimelyDevelopmentcom
Most loved movies
CountAvg ratingTitle137812 4593 The Shawshank Redemption 133597 4545 Lord of the Rings The Return of the King1808834306 The Green Mile 150676 4460Lord of the Rings The Two Towers 139050 4415 Finding Nemo 117456 4504 Raiders of the Lost Ark 180736 4299 Forrest Gump 147932 4433 Lord of the Rings The Fellowship of the ring149199 4325 The Sixth Sense 144027 4333 Indiana Jones and the Last Crusade
Challenges
bull Size of datandash Scalabilityndash Keeping data in memory
bull Missing datandash 99 percent missingndash Very imbalanced
bull Avoiding overfittingbull Test and training data differ significantly movie 16322
bull Use an ensemble of complementing predictorsbull Two half tuned models worth more than a single fully
tuned model
The BellKor recommender system
RECSYS 101
Extending Latent Factor Model to Include Biases
1162019
RECSYS 102
Modeling Biases and Interactions
μ = overall mean rating bx = bias of user x bi = bias of movie i
user-movie interactionmovie biasuser bias
User-Movie interaction Characterizes the matching between
users and movies Attracts most research in the field Benefits from algorithmic and
mathematical innovations
Baseline predictor Separates users and movies Benefits from insights into userrsquos
behavior Among the main practical
contributions of the competition
RECSYS 103
Baseline Predictor We have expectations on the rating by
user x of movie i even without estimating xrsquos attitude towards movies like i
ndash Rating scale of user xndash Values of other ratings user
gave recently (day-specific mood anchoring multi-user accounts)
ndash (Recent) popularity of movie indash Selection bias related to
number of ratings user gave on the same day (ldquofrequencyrdquo)
1162019
RECSYS 104
Putting It All Together
Example Mean rating micro = 37 You are a critical reviewer your ratings are 1 star
lower than the mean bx = -1 Star Wars gets a mean rating of 05 higher than
average movie bi = + 05 Predicted rating for you on Star Wars
= 37 - 1 + 05 = 32
1162019
RECSYS 105
Fitting the New Model
Solve
Stochastic gradient decent to find parameters Note Both biases bu bi as well as interactions qi pu are treated
as parameters (we estimate them)
1162019
regularization
goodness of fit
λ is selected via grid-search on a validation set
( )
++++
+++minus
sumsumsumsum
sumisin
ii
xx
xx
ii
Rix
Txiixxi
PQ
bbpq
pqbbr
2222
2
)(
)(min
λ
micro
RECSYS 106
BellKor Recommender System The winner of the Netflix Challenge
Multi-scale modeling of the dataCombine top level ldquoregionalrdquomodeling of the data with a refined local view Global
bull Overall deviations of usersmovies Factorization
bull Addressing ldquoregionalrdquo effects Collaborative filtering
bull Extract local patterns
Global effects
Factorization
Collaborative filtering
RECSYS 107
Performance of Various Methods
0885
089
0895
09
0905
091
0915
092
1 10 100 1000
RMSE
Millions of parameters
CF (no time bias)
Basic Latent Factors
Latent Factors w Biases
1162019
RECSYS 108
Grand Prize 08563
Netflix 09514
Movie average 10533User average 10651
Global average 11296
Performance of Various Methods
Basic Collaborative filtering 094
Latent factors 090
Latent factors+Biases 089
Collaborative filtering++ 091
1162019
RECSYS 109
Temporal Biases Of Users
Sudden rise in the average movie rating(early 2004) Improvements in Netflix GUI improvements Meaning of rating changed
Movie age Users prefer new movies
without any reasons Older movies are just
inherently better than newer ones
1162019
Y Koren Collaborative filtering with temporal dynamics KDD rsquo09
RECSYS 110
Temporal Biases amp Factors
Original modelrxi = micro +bx + bi + qi middotpx
T
Add time dependence to biasesrxi = micro +bx(t)+ bi(t) +qi middot px
T
Make parameters bu and bi to depend on time (1) Parameterize time-dependence by linear trends
(2) Each bin corresponds to 10 consecutive weeks
Add temporal dependence to factors px(t)hellip user preference vector on day t
Y Koren Collaborative filtering with temporal dynamics KDD rsquo09
RECSYS 111
Adding Temporal Effects
1162019
0875
088
0885
089
0895
09
0905
091
0915
092
1 10 100 1000 10000
RMSE
Millions of parameters
CF (no time bias)Basic Latent FactorsCF (time bias)Latent Factors w Biases+ Linear time factors+ Per-day user biases+ CF
RECSYS 112
Grand Prize 08563
Netflix 09514
Movie average 10533User average 10651
Global average 11296
Performance of Various Methods
Basic Collaborative filtering 094
Latent factors 090
Latent factors+Biases 089
Collaborative filtering++ 091
1162019 112
Latent factors+Biases+Time 0876
Still no prize Getting desperateTry a ldquokitchen sinkrdquo approach
RECSYS 113
Effect of ensemble size
087
08725
0875
08775
088
08825
0885
08875
089
08925
1 8 15 22 29 36 43 50
Predictors
Erro
r - R
MSE
RECSYS 69
Degression to the lecture notes ofRegression and Gradient Descent
by Andrew Ngrsquos Machine Learning course from Coursera
RECSYS 70
(Batch) Gradient Descent
Want to find matrices P and Q
Gradient descent Initialize P and Q (using SVD pretend missing ratings are
0) Do gradient descent
bull P larr P - η middotnablaPbull Q larr Q - η middotnablaQbull Where nablaQ is gradientderivative of matrix Q120571120571119876119876 = [120571120571119902119902119909119909119894119894] and 120571120571119902119902119909119909119894119894 = sum119909119909119909119909 minus2 119903119903119909119909119909119909 minus 119902119902119909119909119901119901119909119909119879119879 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894
bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q Observation Computing gradients is slow
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
How to compute gradient of a matrixCompute gradient of every element independently
RECSYS 72
Stochastic Gradient Descent
Gradient Descent (GD) vs Stochastic GD Observation 120571120571119876119876 = [120571120571119902119902119909119909119894119894] where
120571120571119902119902119909119909119894119894 = 119909119909119909119909
minus2 119903119903119909119909119909119909 minus 119902119902119909119909119894119894119901119901119909119909119894119894 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894 = 119961119961119946119946
nabla119928119928 119955119955119961119961119946119946
bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q 119928119928 = 119928119928 minus 119928119928 = 119928119928 minus η sum119961119961119946119946nabla119928119928 (119955119955119961119961119946119946) Idea Instead of evaluating gradient over all ratings evaluate it for
each individual rating and make a step
GD 119928119928larr119928119928 minus η sum119955119955119961119961119946119946 nabla119928119928(119955119955119961119961119946119946)
SGD 119928119928larr119928119928 minus 119928119928(119955119955119961119961119946119946) Faster convergence
bull Need more steps but each step is computed much faster
1162019
RECSYS 73
SGD vs GD Convergence of GD vs SGD
1162019
Iterationstep
Valu
e of
the
obje
ctiv
e fu
nctio
n
GD improves the value of the objective function at every step SGD improves the value but in a ldquonoisyrdquo wayGD takes fewer steps to converge but each steptakes much longer to compute In practice SGD is much faster
RECSYS 74
Stochastic Gradient Descent
Stochastic gradient decent Initialize P and Q (using SVD pretend missing ratings are
0) Then iterate over the ratings (multiple times if necessary) and update
factorsFor each rxi
bull 120576120576119909119909119909119909 = 119903119903119909119909119909119909 minus 119902119902119909119909 sdot 119901119901119909119909119879119879 (derivative of the ldquoerrorrdquo)bull 119902119902119909119909 larr 119902119902119909119909 + 120578120578 120576120576119909119909119909119909 119901119901119909119909 minus 120582120582 119902119902119909119909 (update equation)bull 119901119901119909119909 larr 119901119901119909119909 + 120578120578 120576120576119909119909119909119909 119902119902119909119909 minus 120582120582 119901119901119909119909 (update equation)
2 for loops For until convergence
bull For each rxibull Compute gradient do a ldquosteprdquo
1162019
η hellip learning rate
RECSYS 75Koren Bell Volinksy IEEE Computer 2009
1162019
RECSYS 76
Summary Recommendations via Optimization
Goal Make good recommendations Quantify goodness using SSE
So Lower SSE means better recommendations We want to make good recommendations on items
that some user has not yet seen Letrsquos set values for P and Q such that they work
well on known (user item) ratings And hope these values for P and Q will predict well
the unknown ratings
This is the a case where we apply Optimization methods
1162019
1 3 43 5 5
4 5 533
2
2 1 3
1
RECSYS 77
Backup Slides
RECSYS 78
The Netflix Challenge 2006-09
We Know What You OughtTo Be Watching This
Summer
Netflix Prizebull Training data
ndash 100 million ratingsndash 480000 usersndash 17770 moviesndash 6 years of data 2000-2005
bull Test datandash Last few ratings of each user (28 million)ndash Evaluation criterion root mean squared error (RMSE) ndash Netflix Cinematch RMSE 09514
bull Competitionndash 2700+ teamsndash $1 million grand prize for 10 improvement on Cinematch resultndash $50000 2007 progress prize for 843 improvement
scoremovieuser121152131434524123237682576344541568523425
22345
57664566
scoremovieuser6219617232473153414284935
745
696836
Training data Test data
Movie rating data
Overall rating distribution
bull Third of ratings are 4sbull Average rating is 368
From TimelyDevelopmentcom
ratings per movie
bull Avg ratingsmovie 5627
ratings per user
bull Avg ratingsuser 208
Average movie rating by movie count
bull More ratings to better movies
From TimelyDevelopmentcom
Most loved movies
CountAvg ratingTitle137812 4593 The Shawshank Redemption 133597 4545 Lord of the Rings The Return of the King1808834306 The Green Mile 150676 4460Lord of the Rings The Two Towers 139050 4415 Finding Nemo 117456 4504 Raiders of the Lost Ark 180736 4299 Forrest Gump 147932 4433 Lord of the Rings The Fellowship of the ring149199 4325 The Sixth Sense 144027 4333 Indiana Jones and the Last Crusade
Challenges
bull Size of datandash Scalabilityndash Keeping data in memory
bull Missing datandash 99 percent missingndash Very imbalanced
bull Avoiding overfittingbull Test and training data differ significantly movie 16322
bull Use an ensemble of complementing predictorsbull Two half tuned models worth more than a single fully
tuned model
The BellKor recommender system
RECSYS 101
Extending Latent Factor Model to Include Biases
1162019
RECSYS 102
Modeling Biases and Interactions
μ = overall mean rating bx = bias of user x bi = bias of movie i
user-movie interactionmovie biasuser bias
User-Movie interaction Characterizes the matching between
users and movies Attracts most research in the field Benefits from algorithmic and
mathematical innovations
Baseline predictor Separates users and movies Benefits from insights into userrsquos
behavior Among the main practical
contributions of the competition
RECSYS 103
Baseline Predictor We have expectations on the rating by
user x of movie i even without estimating xrsquos attitude towards movies like i
ndash Rating scale of user xndash Values of other ratings user
gave recently (day-specific mood anchoring multi-user accounts)
ndash (Recent) popularity of movie indash Selection bias related to
number of ratings user gave on the same day (ldquofrequencyrdquo)
1162019
RECSYS 104
Putting It All Together
Example Mean rating micro = 37 You are a critical reviewer your ratings are 1 star
lower than the mean bx = -1 Star Wars gets a mean rating of 05 higher than
average movie bi = + 05 Predicted rating for you on Star Wars
= 37 - 1 + 05 = 32
1162019
RECSYS 105
Fitting the New Model
Solve
Stochastic gradient decent to find parameters Note Both biases bu bi as well as interactions qi pu are treated
as parameters (we estimate them)
1162019
regularization
goodness of fit
λ is selected via grid-search on a validation set
( )
++++
+++minus
sumsumsumsum
sumisin
ii
xx
xx
ii
Rix
Txiixxi
PQ
bbpq
pqbbr
2222
2
)(
)(min
λ
micro
RECSYS 106
BellKor Recommender System The winner of the Netflix Challenge
Multi-scale modeling of the dataCombine top level ldquoregionalrdquomodeling of the data with a refined local view Global
bull Overall deviations of usersmovies Factorization
bull Addressing ldquoregionalrdquo effects Collaborative filtering
bull Extract local patterns
Global effects
Factorization
Collaborative filtering
RECSYS 107
Performance of Various Methods
0885
089
0895
09
0905
091
0915
092
1 10 100 1000
RMSE
Millions of parameters
CF (no time bias)
Basic Latent Factors
Latent Factors w Biases
1162019
RECSYS 108
Grand Prize 08563
Netflix 09514
Movie average 10533User average 10651
Global average 11296
Performance of Various Methods
Basic Collaborative filtering 094
Latent factors 090
Latent factors+Biases 089
Collaborative filtering++ 091
1162019
RECSYS 109
Temporal Biases Of Users
Sudden rise in the average movie rating(early 2004) Improvements in Netflix GUI improvements Meaning of rating changed
Movie age Users prefer new movies
without any reasons Older movies are just
inherently better than newer ones
1162019
Y Koren Collaborative filtering with temporal dynamics KDD rsquo09
RECSYS 110
Temporal Biases amp Factors
Original modelrxi = micro +bx + bi + qi middotpx
T
Add time dependence to biasesrxi = micro +bx(t)+ bi(t) +qi middot px
T
Make parameters bu and bi to depend on time (1) Parameterize time-dependence by linear trends
(2) Each bin corresponds to 10 consecutive weeks
Add temporal dependence to factors px(t)hellip user preference vector on day t
Y Koren Collaborative filtering with temporal dynamics KDD rsquo09
RECSYS 111
Adding Temporal Effects
1162019
0875
088
0885
089
0895
09
0905
091
0915
092
1 10 100 1000 10000
RMSE
Millions of parameters
CF (no time bias)Basic Latent FactorsCF (time bias)Latent Factors w Biases+ Linear time factors+ Per-day user biases+ CF
RECSYS 112
Grand Prize 08563
Netflix 09514
Movie average 10533User average 10651
Global average 11296
Performance of Various Methods
Basic Collaborative filtering 094
Latent factors 090
Latent factors+Biases 089
Collaborative filtering++ 091
1162019 112
Latent factors+Biases+Time 0876
Still no prize Getting desperateTry a ldquokitchen sinkrdquo approach
RECSYS 113
Effect of ensemble size
087
08725
0875
08775
088
08825
0885
08875
089
08925
1 8 15 22 29 36 43 50
Predictors
Erro
r - R
MSE
RECSYS 70
(Batch) Gradient Descent
Want to find matrices P and Q
Gradient descent Initialize P and Q (using SVD pretend missing ratings are
0) Do gradient descent
bull P larr P - η middotnablaPbull Q larr Q - η middotnablaQbull Where nablaQ is gradientderivative of matrix Q120571120571119876119876 = [120571120571119902119902119909119909119894119894] and 120571120571119902119902119909119909119894119894 = sum119909119909119909119909 minus2 119903119903119909119909119909119909 minus 119902119902119909119909119901119901119909119909119879119879 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894
bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q Observation Computing gradients is slow
++minus sumsumsum
ii
xx
training
Txixi
QPqppqr 222
)(min λ
How to compute gradient of a matrixCompute gradient of every element independently
RECSYS 72
Stochastic Gradient Descent
Gradient Descent (GD) vs Stochastic GD Observation 120571120571119876119876 = [120571120571119902119902119909119909119894119894] where
120571120571119902119902119909119909119894119894 = 119909119909119909119909
minus2 119903119903119909119909119909119909 minus 119902119902119909119909119894119894119901119901119909119909119894119894 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894 = 119961119961119946119946
nabla119928119928 119955119955119961119961119946119946
bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q 119928119928 = 119928119928 minus 119928119928 = 119928119928 minus η sum119961119961119946119946nabla119928119928 (119955119955119961119961119946119946) Idea Instead of evaluating gradient over all ratings evaluate it for
each individual rating and make a step
GD 119928119928larr119928119928 minus η sum119955119955119961119961119946119946 nabla119928119928(119955119955119961119961119946119946)
SGD 119928119928larr119928119928 minus 119928119928(119955119955119961119961119946119946) Faster convergence
bull Need more steps but each step is computed much faster
1162019
RECSYS 73
SGD vs GD Convergence of GD vs SGD
1162019
Iterationstep
Valu
e of
the
obje
ctiv
e fu
nctio
n
GD improves the value of the objective function at every step SGD improves the value but in a ldquonoisyrdquo wayGD takes fewer steps to converge but each steptakes much longer to compute In practice SGD is much faster
RECSYS 74
Stochastic Gradient Descent
Stochastic gradient decent Initialize P and Q (using SVD pretend missing ratings are
0) Then iterate over the ratings (multiple times if necessary) and update
factorsFor each rxi
bull 120576120576119909119909119909119909 = 119903119903119909119909119909119909 minus 119902119902119909119909 sdot 119901119901119909119909119879119879 (derivative of the ldquoerrorrdquo)bull 119902119902119909119909 larr 119902119902119909119909 + 120578120578 120576120576119909119909119909119909 119901119901119909119909 minus 120582120582 119902119902119909119909 (update equation)bull 119901119901119909119909 larr 119901119901119909119909 + 120578120578 120576120576119909119909119909119909 119902119902119909119909 minus 120582120582 119901119901119909119909 (update equation)
2 for loops For until convergence
bull For each rxibull Compute gradient do a ldquosteprdquo
1162019
η hellip learning rate
RECSYS 75Koren Bell Volinksy IEEE Computer 2009
1162019
RECSYS 76
Summary Recommendations via Optimization
Goal Make good recommendations Quantify goodness using SSE
So Lower SSE means better recommendations We want to make good recommendations on items
that some user has not yet seen Letrsquos set values for P and Q such that they work
well on known (user item) ratings And hope these values for P and Q will predict well
the unknown ratings
This is the a case where we apply Optimization methods
1162019
1 3 43 5 5
4 5 533
2
2 1 3
1
RECSYS 77
Backup Slides
RECSYS 78
The Netflix Challenge 2006-09
We Know What You OughtTo Be Watching This
Summer
Netflix Prizebull Training data
ndash 100 million ratingsndash 480000 usersndash 17770 moviesndash 6 years of data 2000-2005
bull Test datandash Last few ratings of each user (28 million)ndash Evaluation criterion root mean squared error (RMSE) ndash Netflix Cinematch RMSE 09514
bull Competitionndash 2700+ teamsndash $1 million grand prize for 10 improvement on Cinematch resultndash $50000 2007 progress prize for 843 improvement
scoremovieuser121152131434524123237682576344541568523425
22345
57664566
scoremovieuser6219617232473153414284935
745
696836
Training data Test data
Movie rating data
Overall rating distribution
bull Third of ratings are 4sbull Average rating is 368
From TimelyDevelopmentcom
ratings per movie
bull Avg ratingsmovie 5627
ratings per user
bull Avg ratingsuser 208
Average movie rating by movie count
bull More ratings to better movies
From TimelyDevelopmentcom
Most loved movies
CountAvg ratingTitle137812 4593 The Shawshank Redemption 133597 4545 Lord of the Rings The Return of the King1808834306 The Green Mile 150676 4460Lord of the Rings The Two Towers 139050 4415 Finding Nemo 117456 4504 Raiders of the Lost Ark 180736 4299 Forrest Gump 147932 4433 Lord of the Rings The Fellowship of the ring149199 4325 The Sixth Sense 144027 4333 Indiana Jones and the Last Crusade
Challenges
bull Size of datandash Scalabilityndash Keeping data in memory
bull Missing datandash 99 percent missingndash Very imbalanced
bull Avoiding overfittingbull Test and training data differ significantly movie 16322
bull Use an ensemble of complementing predictorsbull Two half tuned models worth more than a single fully
tuned model
The BellKor recommender system
RECSYS 101
Extending Latent Factor Model to Include Biases
1162019
RECSYS 102
Modeling Biases and Interactions
μ = overall mean rating bx = bias of user x bi = bias of movie i
user-movie interactionmovie biasuser bias
User-Movie interaction Characterizes the matching between
users and movies Attracts most research in the field Benefits from algorithmic and
mathematical innovations
Baseline predictor Separates users and movies Benefits from insights into userrsquos
behavior Among the main practical
contributions of the competition
RECSYS 103
Baseline Predictor We have expectations on the rating by
user x of movie i even without estimating xrsquos attitude towards movies like i
ndash Rating scale of user xndash Values of other ratings user
gave recently (day-specific mood anchoring multi-user accounts)
ndash (Recent) popularity of movie indash Selection bias related to
number of ratings user gave on the same day (ldquofrequencyrdquo)
1162019
RECSYS 104
Putting It All Together
Example Mean rating micro = 37 You are a critical reviewer your ratings are 1 star
lower than the mean bx = -1 Star Wars gets a mean rating of 05 higher than
average movie bi = + 05 Predicted rating for you on Star Wars
= 37 - 1 + 05 = 32
1162019
RECSYS 105
Fitting the New Model
Solve
Stochastic gradient decent to find parameters Note Both biases bu bi as well as interactions qi pu are treated
as parameters (we estimate them)
1162019
regularization
goodness of fit
λ is selected via grid-search on a validation set
( )
++++
+++minus
sumsumsumsum
sumisin
ii
xx
xx
ii
Rix
Txiixxi
PQ
bbpq
pqbbr
2222
2
)(
)(min
λ
micro
RECSYS 106
BellKor Recommender System The winner of the Netflix Challenge
Multi-scale modeling of the dataCombine top level ldquoregionalrdquomodeling of the data with a refined local view Global
bull Overall deviations of usersmovies Factorization
bull Addressing ldquoregionalrdquo effects Collaborative filtering
bull Extract local patterns
Global effects
Factorization
Collaborative filtering
RECSYS 107
Performance of Various Methods
0885
089
0895
09
0905
091
0915
092
1 10 100 1000
RMSE
Millions of parameters
CF (no time bias)
Basic Latent Factors
Latent Factors w Biases
1162019
RECSYS 108
Grand Prize 08563
Netflix 09514
Movie average 10533User average 10651
Global average 11296
Performance of Various Methods
Basic Collaborative filtering 094
Latent factors 090
Latent factors+Biases 089
Collaborative filtering++ 091
1162019
RECSYS 109
Temporal Biases Of Users
Sudden rise in the average movie rating(early 2004) Improvements in Netflix GUI improvements Meaning of rating changed
Movie age Users prefer new movies
without any reasons Older movies are just
inherently better than newer ones
1162019
Y Koren Collaborative filtering with temporal dynamics KDD rsquo09
RECSYS 110
Temporal Biases amp Factors
Original modelrxi = micro +bx + bi + qi middotpx
T
Add time dependence to biasesrxi = micro +bx(t)+ bi(t) +qi middot px
T
Make parameters bu and bi to depend on time (1) Parameterize time-dependence by linear trends
(2) Each bin corresponds to 10 consecutive weeks
Add temporal dependence to factors px(t)hellip user preference vector on day t
Y Koren Collaborative filtering with temporal dynamics KDD rsquo09
RECSYS 111
Adding Temporal Effects
1162019
0875
088
0885
089
0895
09
0905
091
0915
092
1 10 100 1000 10000
RMSE
Millions of parameters
CF (no time bias)Basic Latent FactorsCF (time bias)Latent Factors w Biases+ Linear time factors+ Per-day user biases+ CF
RECSYS 112
Grand Prize 08563
Netflix 09514
Movie average 10533User average 10651
Global average 11296
Performance of Various Methods
Basic Collaborative filtering 094
Latent factors 090
Latent factors+Biases 089
Collaborative filtering++ 091
1162019 112
Latent factors+Biases+Time 0876
Still no prize Getting desperateTry a ldquokitchen sinkrdquo approach
RECSYS 113
Effect of ensemble size
087
08725
0875
08775
088
08825
0885
08875
089
08925
1 8 15 22 29 36 43 50
Predictors
Erro
r - R
MSE
RECSYS 72
Stochastic Gradient Descent
Gradient Descent (GD) vs Stochastic GD Observation 120571120571119876119876 = [120571120571119902119902119909119909119894119894] where
120571120571119902119902119909119909119894119894 = 119909119909119909119909
minus2 119903119903119909119909119909119909 minus 119902119902119909119909119894119894119901119901119909119909119894119894 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894 = 119961119961119946119946
nabla119928119928 119955119955119961119961119946119946
bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q 119928119928 = 119928119928 minus 119928119928 = 119928119928 minus η sum119961119961119946119946nabla119928119928 (119955119955119961119961119946119946) Idea Instead of evaluating gradient over all ratings evaluate it for
each individual rating and make a step
GD 119928119928larr119928119928 minus η sum119955119955119961119961119946119946 nabla119928119928(119955119955119961119961119946119946)
SGD 119928119928larr119928119928 minus 119928119928(119955119955119961119961119946119946) Faster convergence
bull Need more steps but each step is computed much faster
1162019
RECSYS 73
SGD vs GD Convergence of GD vs SGD
1162019
Iterationstep
Valu
e of
the
obje
ctiv
e fu
nctio
n
GD improves the value of the objective function at every step SGD improves the value but in a ldquonoisyrdquo wayGD takes fewer steps to converge but each steptakes much longer to compute In practice SGD is much faster
RECSYS 74
Stochastic Gradient Descent
Stochastic gradient decent Initialize P and Q (using SVD pretend missing ratings are
0) Then iterate over the ratings (multiple times if necessary) and update
factorsFor each rxi
bull 120576120576119909119909119909119909 = 119903119903119909119909119909119909 minus 119902119902119909119909 sdot 119901119901119909119909119879119879 (derivative of the ldquoerrorrdquo)bull 119902119902119909119909 larr 119902119902119909119909 + 120578120578 120576120576119909119909119909119909 119901119901119909119909 minus 120582120582 119902119902119909119909 (update equation)bull 119901119901119909119909 larr 119901119901119909119909 + 120578120578 120576120576119909119909119909119909 119902119902119909119909 minus 120582120582 119901119901119909119909 (update equation)
2 for loops For until convergence
bull For each rxibull Compute gradient do a ldquosteprdquo
1162019
η hellip learning rate
RECSYS 75Koren Bell Volinksy IEEE Computer 2009
1162019
RECSYS 76
Summary Recommendations via Optimization
Goal Make good recommendations Quantify goodness using SSE
So Lower SSE means better recommendations We want to make good recommendations on items
that some user has not yet seen Letrsquos set values for P and Q such that they work
well on known (user item) ratings And hope these values for P and Q will predict well
the unknown ratings
This is the a case where we apply Optimization methods
1162019
1 3 43 5 5
4 5 533
2
2 1 3
1
RECSYS 77
Backup Slides
RECSYS 78
The Netflix Challenge 2006-09
We Know What You OughtTo Be Watching This
Summer
Netflix Prizebull Training data
ndash 100 million ratingsndash 480000 usersndash 17770 moviesndash 6 years of data 2000-2005
bull Test datandash Last few ratings of each user (28 million)ndash Evaluation criterion root mean squared error (RMSE) ndash Netflix Cinematch RMSE 09514
bull Competitionndash 2700+ teamsndash $1 million grand prize for 10 improvement on Cinematch resultndash $50000 2007 progress prize for 843 improvement
scoremovieuser121152131434524123237682576344541568523425
22345
57664566
scoremovieuser6219617232473153414284935
745
696836
Training data Test data
Movie rating data
Overall rating distribution
bull Third of ratings are 4sbull Average rating is 368
From TimelyDevelopmentcom
ratings per movie
bull Avg ratingsmovie 5627
ratings per user
bull Avg ratingsuser 208
Average movie rating by movie count
bull More ratings to better movies
From TimelyDevelopmentcom
Most loved movies
CountAvg ratingTitle137812 4593 The Shawshank Redemption 133597 4545 Lord of the Rings The Return of the King1808834306 The Green Mile 150676 4460Lord of the Rings The Two Towers 139050 4415 Finding Nemo 117456 4504 Raiders of the Lost Ark 180736 4299 Forrest Gump 147932 4433 Lord of the Rings The Fellowship of the ring149199 4325 The Sixth Sense 144027 4333 Indiana Jones and the Last Crusade
Challenges
bull Size of datandash Scalabilityndash Keeping data in memory
bull Missing datandash 99 percent missingndash Very imbalanced
bull Avoiding overfittingbull Test and training data differ significantly movie 16322
bull Use an ensemble of complementing predictorsbull Two half tuned models worth more than a single fully
tuned model
The BellKor recommender system
RECSYS 101
Extending Latent Factor Model to Include Biases
1162019
RECSYS 102
Modeling Biases and Interactions
μ = overall mean rating bx = bias of user x bi = bias of movie i
user-movie interactionmovie biasuser bias
User-Movie interaction Characterizes the matching between
users and movies Attracts most research in the field Benefits from algorithmic and
mathematical innovations
Baseline predictor Separates users and movies Benefits from insights into userrsquos
behavior Among the main practical
contributions of the competition
RECSYS 103
Baseline Predictor We have expectations on the rating by
user x of movie i even without estimating xrsquos attitude towards movies like i
ndash Rating scale of user xndash Values of other ratings user
gave recently (day-specific mood anchoring multi-user accounts)
ndash (Recent) popularity of movie indash Selection bias related to
number of ratings user gave on the same day (ldquofrequencyrdquo)
1162019
RECSYS 104
Putting It All Together
Example Mean rating micro = 37 You are a critical reviewer your ratings are 1 star
lower than the mean bx = -1 Star Wars gets a mean rating of 05 higher than
average movie bi = + 05 Predicted rating for you on Star Wars
= 37 - 1 + 05 = 32
1162019
RECSYS 105
Fitting the New Model
Solve
Stochastic gradient decent to find parameters Note Both biases bu bi as well as interactions qi pu are treated
as parameters (we estimate them)
1162019
regularization
goodness of fit
λ is selected via grid-search on a validation set
( )
++++
+++minus
sumsumsumsum
sumisin
ii
xx
xx
ii
Rix
Txiixxi
PQ
bbpq
pqbbr
2222
2
)(
)(min
λ
micro
RECSYS 106
BellKor Recommender System The winner of the Netflix Challenge
Multi-scale modeling of the dataCombine top level ldquoregionalrdquomodeling of the data with a refined local view Global
bull Overall deviations of usersmovies Factorization
bull Addressing ldquoregionalrdquo effects Collaborative filtering
bull Extract local patterns
Global effects
Factorization
Collaborative filtering
RECSYS 107
Performance of Various Methods
0885
089
0895
09
0905
091
0915
092
1 10 100 1000
RMSE
Millions of parameters
CF (no time bias)
Basic Latent Factors
Latent Factors w Biases
1162019
RECSYS 108
Grand Prize 08563
Netflix 09514
Movie average 10533User average 10651
Global average 11296
Performance of Various Methods
Basic Collaborative filtering 094
Latent factors 090
Latent factors+Biases 089
Collaborative filtering++ 091
1162019
RECSYS 109
Temporal Biases Of Users
Sudden rise in the average movie rating(early 2004) Improvements in Netflix GUI improvements Meaning of rating changed
Movie age Users prefer new movies
without any reasons Older movies are just
inherently better than newer ones
1162019
Y Koren Collaborative filtering with temporal dynamics KDD rsquo09
RECSYS 110
Temporal Biases amp Factors
Original modelrxi = micro +bx + bi + qi middotpx
T
Add time dependence to biasesrxi = micro +bx(t)+ bi(t) +qi middot px
T
Make parameters bu and bi to depend on time (1) Parameterize time-dependence by linear trends
(2) Each bin corresponds to 10 consecutive weeks
Add temporal dependence to factors px(t)hellip user preference vector on day t
Y Koren Collaborative filtering with temporal dynamics KDD rsquo09
RECSYS 111
Adding Temporal Effects
1162019
0875
088
0885
089
0895
09
0905
091
0915
092
1 10 100 1000 10000
RMSE
Millions of parameters
CF (no time bias)Basic Latent FactorsCF (time bias)Latent Factors w Biases+ Linear time factors+ Per-day user biases+ CF
RECSYS 112
Grand Prize 08563
Netflix 09514
Movie average 10533User average 10651
Global average 11296
Performance of Various Methods
Basic Collaborative filtering 094
Latent factors 090
Latent factors+Biases 089
Collaborative filtering++ 091
1162019 112
Latent factors+Biases+Time 0876
Still no prize Getting desperateTry a ldquokitchen sinkrdquo approach
RECSYS 113
Effect of ensemble size
087
08725
0875
08775
088
08825
0885
08875
089
08925
1 8 15 22 29 36 43 50
Predictors
Erro
r - R
MSE
RECSYS 73
SGD vs GD Convergence of GD vs SGD
1162019
Iterationstep
Valu
e of
the
obje
ctiv
e fu
nctio
n
GD improves the value of the objective function at every step SGD improves the value but in a ldquonoisyrdquo wayGD takes fewer steps to converge but each steptakes much longer to compute In practice SGD is much faster
RECSYS 74
Stochastic Gradient Descent
Stochastic gradient decent Initialize P and Q (using SVD pretend missing ratings are
0) Then iterate over the ratings (multiple times if necessary) and update
factorsFor each rxi
bull 120576120576119909119909119909119909 = 119903119903119909119909119909119909 minus 119902119902119909119909 sdot 119901119901119909119909119879119879 (derivative of the ldquoerrorrdquo)bull 119902119902119909119909 larr 119902119902119909119909 + 120578120578 120576120576119909119909119909119909 119901119901119909119909 minus 120582120582 119902119902119909119909 (update equation)bull 119901119901119909119909 larr 119901119901119909119909 + 120578120578 120576120576119909119909119909119909 119902119902119909119909 minus 120582120582 119901119901119909119909 (update equation)
2 for loops For until convergence
bull For each rxibull Compute gradient do a ldquosteprdquo
1162019
η hellip learning rate
RECSYS 75Koren Bell Volinksy IEEE Computer 2009
1162019
RECSYS 76
Summary Recommendations via Optimization
Goal Make good recommendations Quantify goodness using SSE
So Lower SSE means better recommendations We want to make good recommendations on items
that some user has not yet seen Letrsquos set values for P and Q such that they work
well on known (user item) ratings And hope these values for P and Q will predict well
the unknown ratings
This is the a case where we apply Optimization methods
1162019
1 3 43 5 5
4 5 533
2
2 1 3
1
RECSYS 77
Backup Slides
RECSYS 78
The Netflix Challenge 2006-09
We Know What You OughtTo Be Watching This
Summer
Netflix Prizebull Training data
ndash 100 million ratingsndash 480000 usersndash 17770 moviesndash 6 years of data 2000-2005
bull Test datandash Last few ratings of each user (28 million)ndash Evaluation criterion root mean squared error (RMSE) ndash Netflix Cinematch RMSE 09514
bull Competitionndash 2700+ teamsndash $1 million grand prize for 10 improvement on Cinematch resultndash $50000 2007 progress prize for 843 improvement
scoremovieuser121152131434524123237682576344541568523425
22345
57664566
scoremovieuser6219617232473153414284935
745
696836
Training data Test data
Movie rating data
Overall rating distribution
bull Third of ratings are 4sbull Average rating is 368
From TimelyDevelopmentcom
ratings per movie
bull Avg ratingsmovie 5627
ratings per user
bull Avg ratingsuser 208
Average movie rating by movie count
bull More ratings to better movies
From TimelyDevelopmentcom
Most loved movies
CountAvg ratingTitle137812 4593 The Shawshank Redemption 133597 4545 Lord of the Rings The Return of the King1808834306 The Green Mile 150676 4460Lord of the Rings The Two Towers 139050 4415 Finding Nemo 117456 4504 Raiders of the Lost Ark 180736 4299 Forrest Gump 147932 4433 Lord of the Rings The Fellowship of the ring149199 4325 The Sixth Sense 144027 4333 Indiana Jones and the Last Crusade
Challenges
bull Size of datandash Scalabilityndash Keeping data in memory
bull Missing datandash 99 percent missingndash Very imbalanced
bull Avoiding overfittingbull Test and training data differ significantly movie 16322
bull Use an ensemble of complementing predictorsbull Two half tuned models worth more than a single fully
tuned model
The BellKor recommender system
RECSYS 101
Extending Latent Factor Model to Include Biases
1162019
RECSYS 102
Modeling Biases and Interactions
μ = overall mean rating bx = bias of user x bi = bias of movie i
user-movie interactionmovie biasuser bias
User-Movie interaction Characterizes the matching between
users and movies Attracts most research in the field Benefits from algorithmic and
mathematical innovations
Baseline predictor Separates users and movies Benefits from insights into userrsquos
behavior Among the main practical
contributions of the competition
RECSYS 103
Baseline Predictor We have expectations on the rating by
user x of movie i even without estimating xrsquos attitude towards movies like i
ndash Rating scale of user xndash Values of other ratings user
gave recently (day-specific mood anchoring multi-user accounts)
ndash (Recent) popularity of movie indash Selection bias related to
number of ratings user gave on the same day (ldquofrequencyrdquo)
1162019
RECSYS 104
Putting It All Together
Example Mean rating micro = 37 You are a critical reviewer your ratings are 1 star
lower than the mean bx = -1 Star Wars gets a mean rating of 05 higher than
average movie bi = + 05 Predicted rating for you on Star Wars
= 37 - 1 + 05 = 32
1162019
RECSYS 105
Fitting the New Model
Solve
Stochastic gradient decent to find parameters Note Both biases bu bi as well as interactions qi pu are treated
as parameters (we estimate them)
1162019
regularization
goodness of fit
λ is selected via grid-search on a validation set
( )
++++
+++minus
sumsumsumsum
sumisin
ii
xx
xx
ii
Rix
Txiixxi
PQ
bbpq
pqbbr
2222
2
)(
)(min
λ
micro
RECSYS 106
BellKor Recommender System The winner of the Netflix Challenge
Multi-scale modeling of the dataCombine top level ldquoregionalrdquomodeling of the data with a refined local view Global
bull Overall deviations of usersmovies Factorization
bull Addressing ldquoregionalrdquo effects Collaborative filtering
bull Extract local patterns
Global effects
Factorization
Collaborative filtering
RECSYS 107
Performance of Various Methods
0885
089
0895
09
0905
091
0915
092
1 10 100 1000
RMSE
Millions of parameters
CF (no time bias)
Basic Latent Factors
Latent Factors w Biases
1162019
RECSYS 108
Grand Prize 08563
Netflix 09514
Movie average 10533User average 10651
Global average 11296
Performance of Various Methods
Basic Collaborative filtering 094
Latent factors 090
Latent factors+Biases 089
Collaborative filtering++ 091
1162019
RECSYS 109
Temporal Biases Of Users
Sudden rise in the average movie rating(early 2004) Improvements in Netflix GUI improvements Meaning of rating changed
Movie age Users prefer new movies
without any reasons Older movies are just
inherently better than newer ones
1162019
Y Koren Collaborative filtering with temporal dynamics KDD rsquo09
RECSYS 110
Temporal Biases amp Factors
Original modelrxi = micro +bx + bi + qi middotpx
T
Add time dependence to biasesrxi = micro +bx(t)+ bi(t) +qi middot px
T
Make parameters bu and bi to depend on time (1) Parameterize time-dependence by linear trends
(2) Each bin corresponds to 10 consecutive weeks
Add temporal dependence to factors px(t)hellip user preference vector on day t
Y Koren Collaborative filtering with temporal dynamics KDD rsquo09
RECSYS 111
Adding Temporal Effects
1162019
0875
088
0885
089
0895
09
0905
091
0915
092
1 10 100 1000 10000
RMSE
Millions of parameters
CF (no time bias)Basic Latent FactorsCF (time bias)Latent Factors w Biases+ Linear time factors+ Per-day user biases+ CF
RECSYS 112
Grand Prize 08563
Netflix 09514
Movie average 10533User average 10651
Global average 11296
Performance of Various Methods
Basic Collaborative filtering 094
Latent factors 090
Latent factors+Biases 089
Collaborative filtering++ 091
1162019 112
Latent factors+Biases+Time 0876
Still no prize Getting desperateTry a ldquokitchen sinkrdquo approach
RECSYS 113
Effect of ensemble size
087
08725
0875
08775
088
08825
0885
08875
089
08925
1 8 15 22 29 36 43 50
Predictors
Erro
r - R
MSE
RECSYS 74
Stochastic Gradient Descent
Stochastic gradient decent Initialize P and Q (using SVD pretend missing ratings are
0) Then iterate over the ratings (multiple times if necessary) and update
factorsFor each rxi
bull 120576120576119909119909119909119909 = 119903119903119909119909119909119909 minus 119902119902119909119909 sdot 119901119901119909119909119879119879 (derivative of the ldquoerrorrdquo)bull 119902119902119909119909 larr 119902119902119909119909 + 120578120578 120576120576119909119909119909119909 119901119901119909119909 minus 120582120582 119902119902119909119909 (update equation)bull 119901119901119909119909 larr 119901119901119909119909 + 120578120578 120576120576119909119909119909119909 119902119902119909119909 minus 120582120582 119901119901119909119909 (update equation)
2 for loops For until convergence
bull For each rxibull Compute gradient do a ldquosteprdquo
1162019
η hellip learning rate
RECSYS 75Koren Bell Volinksy IEEE Computer 2009
1162019
RECSYS 76
Summary Recommendations via Optimization
Goal Make good recommendations Quantify goodness using SSE
So Lower SSE means better recommendations We want to make good recommendations on items
that some user has not yet seen Letrsquos set values for P and Q such that they work
well on known (user item) ratings And hope these values for P and Q will predict well
the unknown ratings
This is the a case where we apply Optimization methods
1162019
1 3 43 5 5
4 5 533
2
2 1 3
1
RECSYS 77
Backup Slides
RECSYS 78
The Netflix Challenge 2006-09
We Know What You OughtTo Be Watching This
Summer
Netflix Prizebull Training data
ndash 100 million ratingsndash 480000 usersndash 17770 moviesndash 6 years of data 2000-2005
bull Test datandash Last few ratings of each user (28 million)ndash Evaluation criterion root mean squared error (RMSE) ndash Netflix Cinematch RMSE 09514
bull Competitionndash 2700+ teamsndash $1 million grand prize for 10 improvement on Cinematch resultndash $50000 2007 progress prize for 843 improvement
scoremovieuser121152131434524123237682576344541568523425
22345
57664566
scoremovieuser6219617232473153414284935
745
696836
Training data Test data
Movie rating data
Overall rating distribution
bull Third of ratings are 4sbull Average rating is 368
From TimelyDevelopmentcom
ratings per movie
bull Avg ratingsmovie 5627
ratings per user
bull Avg ratingsuser 208
Average movie rating by movie count
bull More ratings to better movies
From TimelyDevelopmentcom
Most loved movies
CountAvg ratingTitle137812 4593 The Shawshank Redemption 133597 4545 Lord of the Rings The Return of the King1808834306 The Green Mile 150676 4460Lord of the Rings The Two Towers 139050 4415 Finding Nemo 117456 4504 Raiders of the Lost Ark 180736 4299 Forrest Gump 147932 4433 Lord of the Rings The Fellowship of the ring149199 4325 The Sixth Sense 144027 4333 Indiana Jones and the Last Crusade
Challenges
bull Size of datandash Scalabilityndash Keeping data in memory
bull Missing datandash 99 percent missingndash Very imbalanced
bull Avoiding overfittingbull Test and training data differ significantly movie 16322
bull Use an ensemble of complementing predictorsbull Two half tuned models worth more than a single fully
tuned model
The BellKor recommender system
RECSYS 101
Extending Latent Factor Model to Include Biases
1162019
RECSYS 102
Modeling Biases and Interactions
μ = overall mean rating bx = bias of user x bi = bias of movie i
user-movie interactionmovie biasuser bias
User-Movie interaction Characterizes the matching between
users and movies Attracts most research in the field Benefits from algorithmic and
mathematical innovations
Baseline predictor Separates users and movies Benefits from insights into userrsquos
behavior Among the main practical
contributions of the competition
RECSYS 103
Baseline Predictor We have expectations on the rating by
user x of movie i even without estimating xrsquos attitude towards movies like i
ndash Rating scale of user xndash Values of other ratings user
gave recently (day-specific mood anchoring multi-user accounts)
ndash (Recent) popularity of movie indash Selection bias related to
number of ratings user gave on the same day (ldquofrequencyrdquo)
1162019
RECSYS 104
Putting It All Together
Example Mean rating micro = 37 You are a critical reviewer your ratings are 1 star
lower than the mean bx = -1 Star Wars gets a mean rating of 05 higher than
average movie bi = + 05 Predicted rating for you on Star Wars
= 37 - 1 + 05 = 32
1162019
RECSYS 105
Fitting the New Model
Solve
Stochastic gradient decent to find parameters Note Both biases bu bi as well as interactions qi pu are treated
as parameters (we estimate them)
1162019
regularization
goodness of fit
λ is selected via grid-search on a validation set
( )
++++
+++minus
sumsumsumsum
sumisin
ii
xx
xx
ii
Rix
Txiixxi
PQ
bbpq
pqbbr
2222
2
)(
)(min
λ
micro
RECSYS 106
BellKor Recommender System The winner of the Netflix Challenge
Multi-scale modeling of the dataCombine top level ldquoregionalrdquomodeling of the data with a refined local view Global
bull Overall deviations of usersmovies Factorization
bull Addressing ldquoregionalrdquo effects Collaborative filtering
bull Extract local patterns
Global effects
Factorization
Collaborative filtering
RECSYS 107
Performance of Various Methods
0885
089
0895
09
0905
091
0915
092
1 10 100 1000
RMSE
Millions of parameters
CF (no time bias)
Basic Latent Factors
Latent Factors w Biases
1162019
RECSYS 108
Grand Prize 08563
Netflix 09514
Movie average 10533User average 10651
Global average 11296
Performance of Various Methods
Basic Collaborative filtering 094
Latent factors 090
Latent factors+Biases 089
Collaborative filtering++ 091
1162019
RECSYS 109
Temporal Biases Of Users
Sudden rise in the average movie rating(early 2004) Improvements in Netflix GUI improvements Meaning of rating changed
Movie age Users prefer new movies
without any reasons Older movies are just
inherently better than newer ones
1162019
Y Koren Collaborative filtering with temporal dynamics KDD rsquo09
RECSYS 110
Temporal Biases amp Factors
Original modelrxi = micro +bx + bi + qi middotpx
T
Add time dependence to biasesrxi = micro +bx(t)+ bi(t) +qi middot px
T
Make parameters bu and bi to depend on time (1) Parameterize time-dependence by linear trends
(2) Each bin corresponds to 10 consecutive weeks
Add temporal dependence to factors px(t)hellip user preference vector on day t
Y Koren Collaborative filtering with temporal dynamics KDD rsquo09
RECSYS 111
Adding Temporal Effects
1162019
0875
088
0885
089
0895
09
0905
091
0915
092
1 10 100 1000 10000
RMSE
Millions of parameters
CF (no time bias)Basic Latent FactorsCF (time bias)Latent Factors w Biases+ Linear time factors+ Per-day user biases+ CF
RECSYS 112
Grand Prize 08563
Netflix 09514
Movie average 10533User average 10651
Global average 11296
Performance of Various Methods
Basic Collaborative filtering 094
Latent factors 090
Latent factors+Biases 089
Collaborative filtering++ 091
1162019 112
Latent factors+Biases+Time 0876
Still no prize Getting desperateTry a ldquokitchen sinkrdquo approach
RECSYS 113
Effect of ensemble size
087
08725
0875
08775
088
08825
0885
08875
089
08925
1 8 15 22 29 36 43 50
Predictors
Erro
r - R
MSE
RECSYS 75Koren Bell Volinksy IEEE Computer 2009
1162019
RECSYS 76
Summary Recommendations via Optimization
Goal Make good recommendations Quantify goodness using SSE
So Lower SSE means better recommendations We want to make good recommendations on items
that some user has not yet seen Letrsquos set values for P and Q such that they work
well on known (user item) ratings And hope these values for P and Q will predict well
the unknown ratings
This is the a case where we apply Optimization methods
1162019
1 3 43 5 5
4 5 533
2
2 1 3
1
RECSYS 77
Backup Slides
RECSYS 78
The Netflix Challenge 2006-09
We Know What You OughtTo Be Watching This
Summer
Netflix Prizebull Training data
ndash 100 million ratingsndash 480000 usersndash 17770 moviesndash 6 years of data 2000-2005
bull Test datandash Last few ratings of each user (28 million)ndash Evaluation criterion root mean squared error (RMSE) ndash Netflix Cinematch RMSE 09514
bull Competitionndash 2700+ teamsndash $1 million grand prize for 10 improvement on Cinematch resultndash $50000 2007 progress prize for 843 improvement
scoremovieuser121152131434524123237682576344541568523425
22345
57664566
scoremovieuser6219617232473153414284935
745
696836
Training data Test data
Movie rating data
Overall rating distribution
bull Third of ratings are 4sbull Average rating is 368
From TimelyDevelopmentcom
ratings per movie
bull Avg ratingsmovie 5627
ratings per user
bull Avg ratingsuser 208
Average movie rating by movie count
bull More ratings to better movies
From TimelyDevelopmentcom
Most loved movies
CountAvg ratingTitle137812 4593 The Shawshank Redemption 133597 4545 Lord of the Rings The Return of the King1808834306 The Green Mile 150676 4460Lord of the Rings The Two Towers 139050 4415 Finding Nemo 117456 4504 Raiders of the Lost Ark 180736 4299 Forrest Gump 147932 4433 Lord of the Rings The Fellowship of the ring149199 4325 The Sixth Sense 144027 4333 Indiana Jones and the Last Crusade
Challenges
bull Size of datandash Scalabilityndash Keeping data in memory
bull Missing datandash 99 percent missingndash Very imbalanced
bull Avoiding overfittingbull Test and training data differ significantly movie 16322
bull Use an ensemble of complementing predictorsbull Two half tuned models worth more than a single fully
tuned model
The BellKor recommender system
RECSYS 101
Extending Latent Factor Model to Include Biases
1162019
RECSYS 102
Modeling Biases and Interactions
μ = overall mean rating bx = bias of user x bi = bias of movie i
user-movie interactionmovie biasuser bias
User-Movie interaction Characterizes the matching between
users and movies Attracts most research in the field Benefits from algorithmic and
mathematical innovations
Baseline predictor Separates users and movies Benefits from insights into userrsquos
behavior Among the main practical
contributions of the competition
RECSYS 103
Baseline Predictor We have expectations on the rating by
user x of movie i even without estimating xrsquos attitude towards movies like i
ndash Rating scale of user xndash Values of other ratings user
gave recently (day-specific mood anchoring multi-user accounts)
ndash (Recent) popularity of movie indash Selection bias related to
number of ratings user gave on the same day (ldquofrequencyrdquo)
1162019
RECSYS 104
Putting It All Together
Example Mean rating micro = 37 You are a critical reviewer your ratings are 1 star
lower than the mean bx = -1 Star Wars gets a mean rating of 05 higher than
average movie bi = + 05 Predicted rating for you on Star Wars
= 37 - 1 + 05 = 32
1162019
RECSYS 105
Fitting the New Model
Solve
Stochastic gradient decent to find parameters Note Both biases bu bi as well as interactions qi pu are treated
as parameters (we estimate them)
1162019
regularization
goodness of fit
λ is selected via grid-search on a validation set
( )
++++
+++minus
sumsumsumsum
sumisin
ii
xx
xx
ii
Rix
Txiixxi
PQ
bbpq
pqbbr
2222
2
)(
)(min
λ
micro
RECSYS 106
BellKor Recommender System The winner of the Netflix Challenge
Multi-scale modeling of the dataCombine top level ldquoregionalrdquomodeling of the data with a refined local view Global
bull Overall deviations of usersmovies Factorization
bull Addressing ldquoregionalrdquo effects Collaborative filtering
bull Extract local patterns
Global effects
Factorization
Collaborative filtering
RECSYS 107
Performance of Various Methods
0885
089
0895
09
0905
091
0915
092
1 10 100 1000
RMSE
Millions of parameters
CF (no time bias)
Basic Latent Factors
Latent Factors w Biases
1162019
RECSYS 108
Grand Prize 08563
Netflix 09514
Movie average 10533User average 10651
Global average 11296
Performance of Various Methods
Basic Collaborative filtering 094
Latent factors 090
Latent factors+Biases 089
Collaborative filtering++ 091
1162019
RECSYS 109
Temporal Biases Of Users
Sudden rise in the average movie rating(early 2004) Improvements in Netflix GUI improvements Meaning of rating changed
Movie age Users prefer new movies
without any reasons Older movies are just
inherently better than newer ones
1162019
Y Koren Collaborative filtering with temporal dynamics KDD rsquo09
RECSYS 110
Temporal Biases amp Factors
Original modelrxi = micro +bx + bi + qi middotpx
T
Add time dependence to biasesrxi = micro +bx(t)+ bi(t) +qi middot px
T
Make parameters bu and bi to depend on time (1) Parameterize time-dependence by linear trends
(2) Each bin corresponds to 10 consecutive weeks
Add temporal dependence to factors px(t)hellip user preference vector on day t
Y Koren Collaborative filtering with temporal dynamics KDD rsquo09
RECSYS 111
Adding Temporal Effects
1162019
0875
088
0885
089
0895
09
0905
091
0915
092
1 10 100 1000 10000
RMSE
Millions of parameters
CF (no time bias)Basic Latent FactorsCF (time bias)Latent Factors w Biases+ Linear time factors+ Per-day user biases+ CF
RECSYS 112
Grand Prize 08563
Netflix 09514
Movie average 10533User average 10651
Global average 11296
Performance of Various Methods
Basic Collaborative filtering 094
Latent factors 090
Latent factors+Biases 089
Collaborative filtering++ 091
1162019 112
Latent factors+Biases+Time 0876
Still no prize Getting desperateTry a ldquokitchen sinkrdquo approach
RECSYS 113
Effect of ensemble size
087
08725
0875
08775
088
08825
0885
08875
089
08925
1 8 15 22 29 36 43 50
Predictors
Erro
r - R
MSE
RECSYS 76
Summary Recommendations via Optimization
Goal Make good recommendations Quantify goodness using SSE
So Lower SSE means better recommendations We want to make good recommendations on items
that some user has not yet seen Letrsquos set values for P and Q such that they work
well on known (user item) ratings And hope these values for P and Q will predict well
the unknown ratings
This is the a case where we apply Optimization methods
1162019
1 3 43 5 5
4 5 533
2
2 1 3
1
RECSYS 77
Backup Slides
RECSYS 78
The Netflix Challenge 2006-09
We Know What You OughtTo Be Watching This
Summer
Netflix Prizebull Training data
ndash 100 million ratingsndash 480000 usersndash 17770 moviesndash 6 years of data 2000-2005
bull Test datandash Last few ratings of each user (28 million)ndash Evaluation criterion root mean squared error (RMSE) ndash Netflix Cinematch RMSE 09514
bull Competitionndash 2700+ teamsndash $1 million grand prize for 10 improvement on Cinematch resultndash $50000 2007 progress prize for 843 improvement
scoremovieuser121152131434524123237682576344541568523425
22345
57664566
scoremovieuser6219617232473153414284935
745
696836
Training data Test data
Movie rating data
Overall rating distribution
bull Third of ratings are 4sbull Average rating is 368
From TimelyDevelopmentcom
ratings per movie
bull Avg ratingsmovie 5627
ratings per user
bull Avg ratingsuser 208
Average movie rating by movie count
bull More ratings to better movies
From TimelyDevelopmentcom
Most loved movies
CountAvg ratingTitle137812 4593 The Shawshank Redemption 133597 4545 Lord of the Rings The Return of the King1808834306 The Green Mile 150676 4460Lord of the Rings The Two Towers 139050 4415 Finding Nemo 117456 4504 Raiders of the Lost Ark 180736 4299 Forrest Gump 147932 4433 Lord of the Rings The Fellowship of the ring149199 4325 The Sixth Sense 144027 4333 Indiana Jones and the Last Crusade
Challenges
bull Size of datandash Scalabilityndash Keeping data in memory
bull Missing datandash 99 percent missingndash Very imbalanced
bull Avoiding overfittingbull Test and training data differ significantly movie 16322
bull Use an ensemble of complementing predictorsbull Two half tuned models worth more than a single fully
tuned model
The BellKor recommender system
RECSYS 101
Extending Latent Factor Model to Include Biases
1162019
RECSYS 102
Modeling Biases and Interactions
μ = overall mean rating bx = bias of user x bi = bias of movie i
user-movie interactionmovie biasuser bias
User-Movie interaction Characterizes the matching between
users and movies Attracts most research in the field Benefits from algorithmic and
mathematical innovations
Baseline predictor Separates users and movies Benefits from insights into userrsquos
behavior Among the main practical
contributions of the competition
RECSYS 103
Baseline Predictor We have expectations on the rating by
user x of movie i even without estimating xrsquos attitude towards movies like i
ndash Rating scale of user xndash Values of other ratings user
gave recently (day-specific mood anchoring multi-user accounts)
ndash (Recent) popularity of movie indash Selection bias related to
number of ratings user gave on the same day (ldquofrequencyrdquo)
1162019
RECSYS 104
Putting It All Together
Example Mean rating micro = 37 You are a critical reviewer your ratings are 1 star
lower than the mean bx = -1 Star Wars gets a mean rating of 05 higher than
average movie bi = + 05 Predicted rating for you on Star Wars
= 37 - 1 + 05 = 32
1162019
RECSYS 105
Fitting the New Model
Solve
Stochastic gradient decent to find parameters Note Both biases bu bi as well as interactions qi pu are treated
as parameters (we estimate them)
1162019
regularization
goodness of fit
λ is selected via grid-search on a validation set
( )
++++
+++minus
sumsumsumsum
sumisin
ii
xx
xx
ii
Rix
Txiixxi
PQ
bbpq
pqbbr
2222
2
)(
)(min
λ
micro
RECSYS 106
BellKor Recommender System The winner of the Netflix Challenge
Multi-scale modeling of the dataCombine top level ldquoregionalrdquomodeling of the data with a refined local view Global
bull Overall deviations of usersmovies Factorization
bull Addressing ldquoregionalrdquo effects Collaborative filtering
bull Extract local patterns
Global effects
Factorization
Collaborative filtering
RECSYS 107
Performance of Various Methods
0885
089
0895
09
0905
091
0915
092
1 10 100 1000
RMSE
Millions of parameters
CF (no time bias)
Basic Latent Factors
Latent Factors w Biases
1162019
RECSYS 108
Grand Prize 08563
Netflix 09514
Movie average 10533User average 10651
Global average 11296
Performance of Various Methods
Basic Collaborative filtering 094
Latent factors 090
Latent factors+Biases 089
Collaborative filtering++ 091
1162019
RECSYS 109
Temporal Biases Of Users
Sudden rise in the average movie rating(early 2004) Improvements in Netflix GUI improvements Meaning of rating changed
Movie age Users prefer new movies
without any reasons Older movies are just
inherently better than newer ones
1162019
Y Koren Collaborative filtering with temporal dynamics KDD rsquo09
RECSYS 110
Temporal Biases amp Factors
Original modelrxi = micro +bx + bi + qi middotpx
T
Add time dependence to biasesrxi = micro +bx(t)+ bi(t) +qi middot px
T
Make parameters bu and bi to depend on time (1) Parameterize time-dependence by linear trends
(2) Each bin corresponds to 10 consecutive weeks
Add temporal dependence to factors px(t)hellip user preference vector on day t
Y Koren Collaborative filtering with temporal dynamics KDD rsquo09
RECSYS 111
Adding Temporal Effects
1162019
0875
088
0885
089
0895
09
0905
091
0915
092
1 10 100 1000 10000
RMSE
Millions of parameters
CF (no time bias)Basic Latent FactorsCF (time bias)Latent Factors w Biases+ Linear time factors+ Per-day user biases+ CF
RECSYS 112
Grand Prize 08563
Netflix 09514
Movie average 10533User average 10651
Global average 11296
Performance of Various Methods
Basic Collaborative filtering 094
Latent factors 090
Latent factors+Biases 089
Collaborative filtering++ 091
1162019 112
Latent factors+Biases+Time 0876
Still no prize Getting desperateTry a ldquokitchen sinkrdquo approach
RECSYS 113
Effect of ensemble size
087
08725
0875
08775
088
08825
0885
08875
089
08925
1 8 15 22 29 36 43 50
Predictors
Erro
r - R
MSE
RECSYS 77
Backup Slides
RECSYS 78
The Netflix Challenge 2006-09
We Know What You OughtTo Be Watching This
Summer
Netflix Prizebull Training data
ndash 100 million ratingsndash 480000 usersndash 17770 moviesndash 6 years of data 2000-2005
bull Test datandash Last few ratings of each user (28 million)ndash Evaluation criterion root mean squared error (RMSE) ndash Netflix Cinematch RMSE 09514
bull Competitionndash 2700+ teamsndash $1 million grand prize for 10 improvement on Cinematch resultndash $50000 2007 progress prize for 843 improvement
scoremovieuser121152131434524123237682576344541568523425
22345
57664566
scoremovieuser6219617232473153414284935
745
696836
Training data Test data
Movie rating data
Overall rating distribution
bull Third of ratings are 4sbull Average rating is 368
From TimelyDevelopmentcom
ratings per movie
bull Avg ratingsmovie 5627
ratings per user
bull Avg ratingsuser 208
Average movie rating by movie count
bull More ratings to better movies
From TimelyDevelopmentcom
Most loved movies
CountAvg ratingTitle137812 4593 The Shawshank Redemption 133597 4545 Lord of the Rings The Return of the King1808834306 The Green Mile 150676 4460Lord of the Rings The Two Towers 139050 4415 Finding Nemo 117456 4504 Raiders of the Lost Ark 180736 4299 Forrest Gump 147932 4433 Lord of the Rings The Fellowship of the ring149199 4325 The Sixth Sense 144027 4333 Indiana Jones and the Last Crusade
Challenges
bull Size of datandash Scalabilityndash Keeping data in memory
bull Missing datandash 99 percent missingndash Very imbalanced
bull Avoiding overfittingbull Test and training data differ significantly movie 16322
bull Use an ensemble of complementing predictorsbull Two half tuned models worth more than a single fully
tuned model
The BellKor recommender system
RECSYS 101
Extending Latent Factor Model to Include Biases
1162019
RECSYS 102
Modeling Biases and Interactions
μ = overall mean rating bx = bias of user x bi = bias of movie i
user-movie interactionmovie biasuser bias
User-Movie interaction Characterizes the matching between
users and movies Attracts most research in the field Benefits from algorithmic and
mathematical innovations
Baseline predictor Separates users and movies Benefits from insights into userrsquos
behavior Among the main practical
contributions of the competition
RECSYS 103
Baseline Predictor We have expectations on the rating by
user x of movie i even without estimating xrsquos attitude towards movies like i
ndash Rating scale of user xndash Values of other ratings user
gave recently (day-specific mood anchoring multi-user accounts)
ndash (Recent) popularity of movie indash Selection bias related to
number of ratings user gave on the same day (ldquofrequencyrdquo)
1162019
RECSYS 104
Putting It All Together
Example Mean rating micro = 37 You are a critical reviewer your ratings are 1 star
lower than the mean bx = -1 Star Wars gets a mean rating of 05 higher than
average movie bi = + 05 Predicted rating for you on Star Wars
= 37 - 1 + 05 = 32
1162019
RECSYS 105
Fitting the New Model
Solve
Stochastic gradient decent to find parameters Note Both biases bu bi as well as interactions qi pu are treated
as parameters (we estimate them)
1162019
regularization
goodness of fit
λ is selected via grid-search on a validation set
( )
++++
+++minus
sumsumsumsum
sumisin
ii
xx
xx
ii
Rix
Txiixxi
PQ
bbpq
pqbbr
2222
2
)(
)(min
λ
micro
RECSYS 106
BellKor Recommender System The winner of the Netflix Challenge
Multi-scale modeling of the dataCombine top level ldquoregionalrdquomodeling of the data with a refined local view Global
bull Overall deviations of usersmovies Factorization
bull Addressing ldquoregionalrdquo effects Collaborative filtering
bull Extract local patterns
Global effects
Factorization
Collaborative filtering
RECSYS 107
Performance of Various Methods
0885
089
0895
09
0905
091
0915
092
1 10 100 1000
RMSE
Millions of parameters
CF (no time bias)
Basic Latent Factors
Latent Factors w Biases
1162019
RECSYS 108
Grand Prize 08563
Netflix 09514
Movie average 10533User average 10651
Global average 11296
Performance of Various Methods
Basic Collaborative filtering 094
Latent factors 090
Latent factors+Biases 089
Collaborative filtering++ 091
1162019
RECSYS 109
Temporal Biases Of Users
Sudden rise in the average movie rating(early 2004) Improvements in Netflix GUI improvements Meaning of rating changed
Movie age Users prefer new movies
without any reasons Older movies are just
inherently better than newer ones
1162019
Y Koren Collaborative filtering with temporal dynamics KDD rsquo09
RECSYS 110
Temporal Biases amp Factors
Original modelrxi = micro +bx + bi + qi middotpx
T
Add time dependence to biasesrxi = micro +bx(t)+ bi(t) +qi middot px
T
Make parameters bu and bi to depend on time (1) Parameterize time-dependence by linear trends
(2) Each bin corresponds to 10 consecutive weeks
Add temporal dependence to factors px(t)hellip user preference vector on day t
Y Koren Collaborative filtering with temporal dynamics KDD rsquo09
RECSYS 111
Adding Temporal Effects
1162019
0875
088
0885
089
0895
09
0905
091
0915
092
1 10 100 1000 10000
RMSE
Millions of parameters
CF (no time bias)Basic Latent FactorsCF (time bias)Latent Factors w Biases+ Linear time factors+ Per-day user biases+ CF
RECSYS 112
Grand Prize 08563
Netflix 09514
Movie average 10533User average 10651
Global average 11296
Performance of Various Methods
Basic Collaborative filtering 094
Latent factors 090
Latent factors+Biases 089
Collaborative filtering++ 091
1162019 112
Latent factors+Biases+Time 0876
Still no prize Getting desperateTry a ldquokitchen sinkrdquo approach
RECSYS 113
Effect of ensemble size
087
08725
0875
08775
088
08825
0885
08875
089
08925
1 8 15 22 29 36 43 50
Predictors
Erro
r - R
MSE
RECSYS 78
The Netflix Challenge 2006-09
We Know What You OughtTo Be Watching This
Summer
Netflix Prizebull Training data
ndash 100 million ratingsndash 480000 usersndash 17770 moviesndash 6 years of data 2000-2005
bull Test datandash Last few ratings of each user (28 million)ndash Evaluation criterion root mean squared error (RMSE) ndash Netflix Cinematch RMSE 09514
bull Competitionndash 2700+ teamsndash $1 million grand prize for 10 improvement on Cinematch resultndash $50000 2007 progress prize for 843 improvement
scoremovieuser121152131434524123237682576344541568523425
22345
57664566
scoremovieuser6219617232473153414284935
745
696836
Training data Test data
Movie rating data
Overall rating distribution
bull Third of ratings are 4sbull Average rating is 368
From TimelyDevelopmentcom
ratings per movie
bull Avg ratingsmovie 5627
ratings per user
bull Avg ratingsuser 208
Average movie rating by movie count
bull More ratings to better movies
From TimelyDevelopmentcom
Most loved movies
CountAvg ratingTitle137812 4593 The Shawshank Redemption 133597 4545 Lord of the Rings The Return of the King1808834306 The Green Mile 150676 4460Lord of the Rings The Two Towers 139050 4415 Finding Nemo 117456 4504 Raiders of the Lost Ark 180736 4299 Forrest Gump 147932 4433 Lord of the Rings The Fellowship of the ring149199 4325 The Sixth Sense 144027 4333 Indiana Jones and the Last Crusade
Challenges
bull Size of datandash Scalabilityndash Keeping data in memory
bull Missing datandash 99 percent missingndash Very imbalanced
bull Avoiding overfittingbull Test and training data differ significantly movie 16322
bull Use an ensemble of complementing predictorsbull Two half tuned models worth more than a single fully
tuned model
The BellKor recommender system
RECSYS 101
Extending Latent Factor Model to Include Biases
1162019
RECSYS 102
Modeling Biases and Interactions
μ = overall mean rating bx = bias of user x bi = bias of movie i
user-movie interactionmovie biasuser bias
User-Movie interaction Characterizes the matching between
users and movies Attracts most research in the field Benefits from algorithmic and
mathematical innovations
Baseline predictor Separates users and movies Benefits from insights into userrsquos
behavior Among the main practical
contributions of the competition
RECSYS 103
Baseline Predictor We have expectations on the rating by
user x of movie i even without estimating xrsquos attitude towards movies like i
ndash Rating scale of user xndash Values of other ratings user
gave recently (day-specific mood anchoring multi-user accounts)
ndash (Recent) popularity of movie indash Selection bias related to
number of ratings user gave on the same day (ldquofrequencyrdquo)
1162019
RECSYS 104
Putting It All Together
Example Mean rating micro = 37 You are a critical reviewer your ratings are 1 star
lower than the mean bx = -1 Star Wars gets a mean rating of 05 higher than
average movie bi = + 05 Predicted rating for you on Star Wars
= 37 - 1 + 05 = 32
1162019
RECSYS 105
Fitting the New Model
Solve
Stochastic gradient decent to find parameters Note Both biases bu bi as well as interactions qi pu are treated
as parameters (we estimate them)
1162019
regularization
goodness of fit
λ is selected via grid-search on a validation set
( )
++++
+++minus
sumsumsumsum
sumisin
ii
xx
xx
ii
Rix
Txiixxi
PQ
bbpq
pqbbr
2222
2
)(
)(min
λ
micro
RECSYS 106
BellKor Recommender System The winner of the Netflix Challenge
Multi-scale modeling of the dataCombine top level ldquoregionalrdquomodeling of the data with a refined local view Global
bull Overall deviations of usersmovies Factorization
bull Addressing ldquoregionalrdquo effects Collaborative filtering
bull Extract local patterns
Global effects
Factorization
Collaborative filtering
RECSYS 107
Performance of Various Methods
0885
089
0895
09
0905
091
0915
092
1 10 100 1000
RMSE
Millions of parameters
CF (no time bias)
Basic Latent Factors
Latent Factors w Biases
1162019
RECSYS 108
Grand Prize 08563
Netflix 09514
Movie average 10533User average 10651
Global average 11296
Performance of Various Methods
Basic Collaborative filtering 094
Latent factors 090
Latent factors+Biases 089
Collaborative filtering++ 091
1162019
RECSYS 109
Temporal Biases Of Users
Sudden rise in the average movie rating(early 2004) Improvements in Netflix GUI improvements Meaning of rating changed
Movie age Users prefer new movies
without any reasons Older movies are just
inherently better than newer ones
1162019
Y Koren Collaborative filtering with temporal dynamics KDD rsquo09
RECSYS 110
Temporal Biases amp Factors
Original modelrxi = micro +bx + bi + qi middotpx
T
Add time dependence to biasesrxi = micro +bx(t)+ bi(t) +qi middot px
T
Make parameters bu and bi to depend on time (1) Parameterize time-dependence by linear trends
(2) Each bin corresponds to 10 consecutive weeks
Add temporal dependence to factors px(t)hellip user preference vector on day t
Y Koren Collaborative filtering with temporal dynamics KDD rsquo09
RECSYS 111
Adding Temporal Effects
1162019
0875
088
0885
089
0895
09
0905
091
0915
092
1 10 100 1000 10000
RMSE
Millions of parameters
CF (no time bias)Basic Latent FactorsCF (time bias)Latent Factors w Biases+ Linear time factors+ Per-day user biases+ CF
RECSYS 112
Grand Prize 08563
Netflix 09514
Movie average 10533User average 10651
Global average 11296
Performance of Various Methods
Basic Collaborative filtering 094
Latent factors 090
Latent factors+Biases 089
Collaborative filtering++ 091
1162019 112
Latent factors+Biases+Time 0876
Still no prize Getting desperateTry a ldquokitchen sinkrdquo approach
RECSYS 113
Effect of ensemble size
087
08725
0875
08775
088
08825
0885
08875
089
08925
1 8 15 22 29 36 43 50
Predictors
Erro
r - R
MSE
We Know What You OughtTo Be Watching This
Summer
Netflix Prizebull Training data
ndash 100 million ratingsndash 480000 usersndash 17770 moviesndash 6 years of data 2000-2005
bull Test datandash Last few ratings of each user (28 million)ndash Evaluation criterion root mean squared error (RMSE) ndash Netflix Cinematch RMSE 09514
bull Competitionndash 2700+ teamsndash $1 million grand prize for 10 improvement on Cinematch resultndash $50000 2007 progress prize for 843 improvement
scoremovieuser121152131434524123237682576344541568523425
22345
57664566
scoremovieuser6219617232473153414284935
745
696836
Training data Test data
Movie rating data
Overall rating distribution
bull Third of ratings are 4sbull Average rating is 368
From TimelyDevelopmentcom
ratings per movie
bull Avg ratingsmovie 5627
ratings per user
bull Avg ratingsuser 208
Average movie rating by movie count
bull More ratings to better movies
From TimelyDevelopmentcom
Most loved movies
CountAvg ratingTitle137812 4593 The Shawshank Redemption 133597 4545 Lord of the Rings The Return of the King1808834306 The Green Mile 150676 4460Lord of the Rings The Two Towers 139050 4415 Finding Nemo 117456 4504 Raiders of the Lost Ark 180736 4299 Forrest Gump 147932 4433 Lord of the Rings The Fellowship of the ring149199 4325 The Sixth Sense 144027 4333 Indiana Jones and the Last Crusade
Challenges
bull Size of datandash Scalabilityndash Keeping data in memory
bull Missing datandash 99 percent missingndash Very imbalanced
bull Avoiding overfittingbull Test and training data differ significantly movie 16322
bull Use an ensemble of complementing predictorsbull Two half tuned models worth more than a single fully
tuned model
The BellKor recommender system
RECSYS 101
Extending Latent Factor Model to Include Biases
1162019
RECSYS 102
Modeling Biases and Interactions
μ = overall mean rating bx = bias of user x bi = bias of movie i
user-movie interactionmovie biasuser bias
User-Movie interaction Characterizes the matching between
users and movies Attracts most research in the field Benefits from algorithmic and
mathematical innovations
Baseline predictor Separates users and movies Benefits from insights into userrsquos
behavior Among the main practical
contributions of the competition
RECSYS 103
Baseline Predictor We have expectations on the rating by
user x of movie i even without estimating xrsquos attitude towards movies like i
ndash Rating scale of user xndash Values of other ratings user
gave recently (day-specific mood anchoring multi-user accounts)
ndash (Recent) popularity of movie indash Selection bias related to
number of ratings user gave on the same day (ldquofrequencyrdquo)
1162019
RECSYS 104
Putting It All Together
Example Mean rating micro = 37 You are a critical reviewer your ratings are 1 star
lower than the mean bx = -1 Star Wars gets a mean rating of 05 higher than
average movie bi = + 05 Predicted rating for you on Star Wars
= 37 - 1 + 05 = 32
1162019
RECSYS 105
Fitting the New Model
Solve
Stochastic gradient decent to find parameters Note Both biases bu bi as well as interactions qi pu are treated
as parameters (we estimate them)
1162019
regularization
goodness of fit
λ is selected via grid-search on a validation set
( )
++++
+++minus
sumsumsumsum
sumisin
ii
xx
xx
ii
Rix
Txiixxi
PQ
bbpq
pqbbr
2222
2
)(
)(min
λ
micro
RECSYS 106
BellKor Recommender System The winner of the Netflix Challenge
Multi-scale modeling of the dataCombine top level ldquoregionalrdquomodeling of the data with a refined local view Global
bull Overall deviations of usersmovies Factorization
bull Addressing ldquoregionalrdquo effects Collaborative filtering
bull Extract local patterns
Global effects
Factorization
Collaborative filtering
RECSYS 107
Performance of Various Methods
0885
089
0895
09
0905
091
0915
092
1 10 100 1000
RMSE
Millions of parameters
CF (no time bias)
Basic Latent Factors
Latent Factors w Biases
1162019
RECSYS 108
Grand Prize 08563
Netflix 09514
Movie average 10533User average 10651
Global average 11296
Performance of Various Methods
Basic Collaborative filtering 094
Latent factors 090
Latent factors+Biases 089
Collaborative filtering++ 091
1162019
RECSYS 109
Temporal Biases Of Users
Sudden rise in the average movie rating(early 2004) Improvements in Netflix GUI improvements Meaning of rating changed
Movie age Users prefer new movies
without any reasons Older movies are just
inherently better than newer ones
1162019
Y Koren Collaborative filtering with temporal dynamics KDD rsquo09
RECSYS 110
Temporal Biases amp Factors
Original modelrxi = micro +bx + bi + qi middotpx
T
Add time dependence to biasesrxi = micro +bx(t)+ bi(t) +qi middot px
T
Make parameters bu and bi to depend on time (1) Parameterize time-dependence by linear trends
(2) Each bin corresponds to 10 consecutive weeks
Add temporal dependence to factors px(t)hellip user preference vector on day t
Y Koren Collaborative filtering with temporal dynamics KDD rsquo09
RECSYS 111
Adding Temporal Effects
1162019
0875
088
0885
089
0895
09
0905
091
0915
092
1 10 100 1000 10000
RMSE
Millions of parameters
CF (no time bias)Basic Latent FactorsCF (time bias)Latent Factors w Biases+ Linear time factors+ Per-day user biases+ CF
RECSYS 112
Grand Prize 08563
Netflix 09514
Movie average 10533User average 10651
Global average 11296
Performance of Various Methods
Basic Collaborative filtering 094
Latent factors 090
Latent factors+Biases 089
Collaborative filtering++ 091
1162019 112
Latent factors+Biases+Time 0876
Still no prize Getting desperateTry a ldquokitchen sinkrdquo approach
RECSYS 113
Effect of ensemble size
087
08725
0875
08775
088
08825
0885
08875
089
08925
1 8 15 22 29 36 43 50
Predictors
Erro
r - R
MSE
Netflix Prizebull Training data
ndash 100 million ratingsndash 480000 usersndash 17770 moviesndash 6 years of data 2000-2005
bull Test datandash Last few ratings of each user (28 million)ndash Evaluation criterion root mean squared error (RMSE) ndash Netflix Cinematch RMSE 09514
bull Competitionndash 2700+ teamsndash $1 million grand prize for 10 improvement on Cinematch resultndash $50000 2007 progress prize for 843 improvement
scoremovieuser121152131434524123237682576344541568523425
22345
57664566
scoremovieuser6219617232473153414284935
745
696836
Training data Test data
Movie rating data
Overall rating distribution
bull Third of ratings are 4sbull Average rating is 368
From TimelyDevelopmentcom
ratings per movie
bull Avg ratingsmovie 5627
ratings per user
bull Avg ratingsuser 208
Average movie rating by movie count
bull More ratings to better movies
From TimelyDevelopmentcom
Most loved movies
CountAvg ratingTitle137812 4593 The Shawshank Redemption 133597 4545 Lord of the Rings The Return of the King1808834306 The Green Mile 150676 4460Lord of the Rings The Two Towers 139050 4415 Finding Nemo 117456 4504 Raiders of the Lost Ark 180736 4299 Forrest Gump 147932 4433 Lord of the Rings The Fellowship of the ring149199 4325 The Sixth Sense 144027 4333 Indiana Jones and the Last Crusade
Challenges
bull Size of datandash Scalabilityndash Keeping data in memory
bull Missing datandash 99 percent missingndash Very imbalanced
bull Avoiding overfittingbull Test and training data differ significantly movie 16322
bull Use an ensemble of complementing predictorsbull Two half tuned models worth more than a single fully
tuned model
The BellKor recommender system
RECSYS 101
Extending Latent Factor Model to Include Biases
1162019
RECSYS 102
Modeling Biases and Interactions
μ = overall mean rating bx = bias of user x bi = bias of movie i
user-movie interactionmovie biasuser bias
User-Movie interaction Characterizes the matching between
users and movies Attracts most research in the field Benefits from algorithmic and
mathematical innovations
Baseline predictor Separates users and movies Benefits from insights into userrsquos
behavior Among the main practical
contributions of the competition
RECSYS 103
Baseline Predictor We have expectations on the rating by
user x of movie i even without estimating xrsquos attitude towards movies like i
ndash Rating scale of user xndash Values of other ratings user
gave recently (day-specific mood anchoring multi-user accounts)
ndash (Recent) popularity of movie indash Selection bias related to
number of ratings user gave on the same day (ldquofrequencyrdquo)
1162019
RECSYS 104
Putting It All Together
Example Mean rating micro = 37 You are a critical reviewer your ratings are 1 star
lower than the mean bx = -1 Star Wars gets a mean rating of 05 higher than
average movie bi = + 05 Predicted rating for you on Star Wars
= 37 - 1 + 05 = 32
1162019
RECSYS 105
Fitting the New Model
Solve
Stochastic gradient decent to find parameters Note Both biases bu bi as well as interactions qi pu are treated
as parameters (we estimate them)
1162019
regularization
goodness of fit
λ is selected via grid-search on a validation set
( )
++++
+++minus
sumsumsumsum
sumisin
ii
xx
xx
ii
Rix
Txiixxi
PQ
bbpq
pqbbr
2222
2
)(
)(min
λ
micro
RECSYS 106
BellKor Recommender System The winner of the Netflix Challenge
Multi-scale modeling of the dataCombine top level ldquoregionalrdquomodeling of the data with a refined local view Global
bull Overall deviations of usersmovies Factorization
bull Addressing ldquoregionalrdquo effects Collaborative filtering
bull Extract local patterns
Global effects
Factorization
Collaborative filtering
RECSYS 107
Performance of Various Methods
0885
089
0895
09
0905
091
0915
092
1 10 100 1000
RMSE
Millions of parameters
CF (no time bias)
Basic Latent Factors
Latent Factors w Biases
1162019
RECSYS 108
Grand Prize 08563
Netflix 09514
Movie average 10533User average 10651
Global average 11296
Performance of Various Methods
Basic Collaborative filtering 094
Latent factors 090
Latent factors+Biases 089
Collaborative filtering++ 091
1162019
RECSYS 109
Temporal Biases Of Users
Sudden rise in the average movie rating(early 2004) Improvements in Netflix GUI improvements Meaning of rating changed
Movie age Users prefer new movies
without any reasons Older movies are just
inherently better than newer ones
1162019
Y Koren Collaborative filtering with temporal dynamics KDD rsquo09
RECSYS 110
Temporal Biases amp Factors
Original modelrxi = micro +bx + bi + qi middotpx
T
Add time dependence to biasesrxi = micro +bx(t)+ bi(t) +qi middot px
T
Make parameters bu and bi to depend on time (1) Parameterize time-dependence by linear trends
(2) Each bin corresponds to 10 consecutive weeks
Add temporal dependence to factors px(t)hellip user preference vector on day t
Y Koren Collaborative filtering with temporal dynamics KDD rsquo09
RECSYS 111
Adding Temporal Effects
1162019
0875
088
0885
089
0895
09
0905
091
0915
092
1 10 100 1000 10000
RMSE
Millions of parameters
CF (no time bias)Basic Latent FactorsCF (time bias)Latent Factors w Biases+ Linear time factors+ Per-day user biases+ CF
RECSYS 112
Grand Prize 08563
Netflix 09514
Movie average 10533User average 10651
Global average 11296
Performance of Various Methods
Basic Collaborative filtering 094
Latent factors 090
Latent factors+Biases 089
Collaborative filtering++ 091
1162019 112
Latent factors+Biases+Time 0876
Still no prize Getting desperateTry a ldquokitchen sinkrdquo approach
RECSYS 113
Effect of ensemble size
087
08725
0875
08775
088
08825
0885
08875
089
08925
1 8 15 22 29 36 43 50
Predictors
Erro
r - R
MSE
scoremovieuser121152131434524123237682576344541568523425
22345
57664566
scoremovieuser6219617232473153414284935
745
696836
Training data Test data
Movie rating data
Overall rating distribution
bull Third of ratings are 4sbull Average rating is 368
From TimelyDevelopmentcom
ratings per movie
bull Avg ratingsmovie 5627
ratings per user
bull Avg ratingsuser 208
Average movie rating by movie count
bull More ratings to better movies
From TimelyDevelopmentcom
Most loved movies
CountAvg ratingTitle137812 4593 The Shawshank Redemption 133597 4545 Lord of the Rings The Return of the King1808834306 The Green Mile 150676 4460Lord of the Rings The Two Towers 139050 4415 Finding Nemo 117456 4504 Raiders of the Lost Ark 180736 4299 Forrest Gump 147932 4433 Lord of the Rings The Fellowship of the ring149199 4325 The Sixth Sense 144027 4333 Indiana Jones and the Last Crusade
Challenges
bull Size of datandash Scalabilityndash Keeping data in memory
bull Missing datandash 99 percent missingndash Very imbalanced
bull Avoiding overfittingbull Test and training data differ significantly movie 16322
bull Use an ensemble of complementing predictorsbull Two half tuned models worth more than a single fully
tuned model
The BellKor recommender system
RECSYS 101
Extending Latent Factor Model to Include Biases
1162019
RECSYS 102
Modeling Biases and Interactions
μ = overall mean rating bx = bias of user x bi = bias of movie i
user-movie interactionmovie biasuser bias
User-Movie interaction Characterizes the matching between
users and movies Attracts most research in the field Benefits from algorithmic and
mathematical innovations
Baseline predictor Separates users and movies Benefits from insights into userrsquos
behavior Among the main practical
contributions of the competition
RECSYS 103
Baseline Predictor We have expectations on the rating by
user x of movie i even without estimating xrsquos attitude towards movies like i
ndash Rating scale of user xndash Values of other ratings user
gave recently (day-specific mood anchoring multi-user accounts)
ndash (Recent) popularity of movie indash Selection bias related to
number of ratings user gave on the same day (ldquofrequencyrdquo)
1162019
RECSYS 104
Putting It All Together
Example Mean rating micro = 37 You are a critical reviewer your ratings are 1 star
lower than the mean bx = -1 Star Wars gets a mean rating of 05 higher than
average movie bi = + 05 Predicted rating for you on Star Wars
= 37 - 1 + 05 = 32
1162019
RECSYS 105
Fitting the New Model
Solve
Stochastic gradient decent to find parameters Note Both biases bu bi as well as interactions qi pu are treated
as parameters (we estimate them)
1162019
regularization
goodness of fit
λ is selected via grid-search on a validation set
( )
++++
+++minus
sumsumsumsum
sumisin
ii
xx
xx
ii
Rix
Txiixxi
PQ
bbpq
pqbbr
2222
2
)(
)(min
λ
micro
RECSYS 106
BellKor Recommender System The winner of the Netflix Challenge
Multi-scale modeling of the dataCombine top level ldquoregionalrdquomodeling of the data with a refined local view Global
bull Overall deviations of usersmovies Factorization
bull Addressing ldquoregionalrdquo effects Collaborative filtering
bull Extract local patterns
Global effects
Factorization
Collaborative filtering
RECSYS 107
Performance of Various Methods
0885
089
0895
09
0905
091
0915
092
1 10 100 1000
RMSE
Millions of parameters
CF (no time bias)
Basic Latent Factors
Latent Factors w Biases
1162019
RECSYS 108
Grand Prize 08563
Netflix 09514
Movie average 10533User average 10651
Global average 11296
Performance of Various Methods
Basic Collaborative filtering 094
Latent factors 090
Latent factors+Biases 089
Collaborative filtering++ 091
1162019
RECSYS 109
Temporal Biases Of Users
Sudden rise in the average movie rating(early 2004) Improvements in Netflix GUI improvements Meaning of rating changed
Movie age Users prefer new movies
without any reasons Older movies are just
inherently better than newer ones
1162019
Y Koren Collaborative filtering with temporal dynamics KDD rsquo09
RECSYS 110
Temporal Biases amp Factors
Original modelrxi = micro +bx + bi + qi middotpx
T
Add time dependence to biasesrxi = micro +bx(t)+ bi(t) +qi middot px
T
Make parameters bu and bi to depend on time (1) Parameterize time-dependence by linear trends
(2) Each bin corresponds to 10 consecutive weeks
Add temporal dependence to factors px(t)hellip user preference vector on day t
Y Koren Collaborative filtering with temporal dynamics KDD rsquo09
RECSYS 111
Adding Temporal Effects
1162019
0875
088
0885
089
0895
09
0905
091
0915
092
1 10 100 1000 10000
RMSE
Millions of parameters
CF (no time bias)Basic Latent FactorsCF (time bias)Latent Factors w Biases+ Linear time factors+ Per-day user biases+ CF
RECSYS 112
Grand Prize 08563
Netflix 09514
Movie average 10533User average 10651
Global average 11296
Performance of Various Methods
Basic Collaborative filtering 094
Latent factors 090
Latent factors+Biases 089
Collaborative filtering++ 091
1162019 112
Latent factors+Biases+Time 0876
Still no prize Getting desperateTry a ldquokitchen sinkrdquo approach
RECSYS 113
Effect of ensemble size
087
08725
0875
08775
088
08825
0885
08875
089
08925
1 8 15 22 29 36 43 50
Predictors
Erro
r - R
MSE
Overall rating distribution
bull Third of ratings are 4sbull Average rating is 368
From TimelyDevelopmentcom
ratings per movie
bull Avg ratingsmovie 5627
ratings per user
bull Avg ratingsuser 208
Average movie rating by movie count
bull More ratings to better movies
From TimelyDevelopmentcom
Most loved movies
CountAvg ratingTitle137812 4593 The Shawshank Redemption 133597 4545 Lord of the Rings The Return of the King1808834306 The Green Mile 150676 4460Lord of the Rings The Two Towers 139050 4415 Finding Nemo 117456 4504 Raiders of the Lost Ark 180736 4299 Forrest Gump 147932 4433 Lord of the Rings The Fellowship of the ring149199 4325 The Sixth Sense 144027 4333 Indiana Jones and the Last Crusade
Challenges
bull Size of datandash Scalabilityndash Keeping data in memory
bull Missing datandash 99 percent missingndash Very imbalanced
bull Avoiding overfittingbull Test and training data differ significantly movie 16322
bull Use an ensemble of complementing predictorsbull Two half tuned models worth more than a single fully
tuned model
The BellKor recommender system
RECSYS 101
Extending Latent Factor Model to Include Biases
1162019
RECSYS 102
Modeling Biases and Interactions
μ = overall mean rating bx = bias of user x bi = bias of movie i
user-movie interactionmovie biasuser bias
User-Movie interaction Characterizes the matching between
users and movies Attracts most research in the field Benefits from algorithmic and
mathematical innovations
Baseline predictor Separates users and movies Benefits from insights into userrsquos
behavior Among the main practical
contributions of the competition
RECSYS 103
Baseline Predictor We have expectations on the rating by
user x of movie i even without estimating xrsquos attitude towards movies like i
ndash Rating scale of user xndash Values of other ratings user
gave recently (day-specific mood anchoring multi-user accounts)
ndash (Recent) popularity of movie indash Selection bias related to
number of ratings user gave on the same day (ldquofrequencyrdquo)
1162019
RECSYS 104
Putting It All Together
Example Mean rating micro = 37 You are a critical reviewer your ratings are 1 star
lower than the mean bx = -1 Star Wars gets a mean rating of 05 higher than
average movie bi = + 05 Predicted rating for you on Star Wars
= 37 - 1 + 05 = 32
1162019
RECSYS 105
Fitting the New Model
Solve
Stochastic gradient decent to find parameters Note Both biases bu bi as well as interactions qi pu are treated
as parameters (we estimate them)
1162019
regularization
goodness of fit
λ is selected via grid-search on a validation set
( )
++++
+++minus
sumsumsumsum
sumisin
ii
xx
xx
ii
Rix
Txiixxi
PQ
bbpq
pqbbr
2222
2
)(
)(min
λ
micro
RECSYS 106
BellKor Recommender System The winner of the Netflix Challenge
Multi-scale modeling of the dataCombine top level ldquoregionalrdquomodeling of the data with a refined local view Global
bull Overall deviations of usersmovies Factorization
bull Addressing ldquoregionalrdquo effects Collaborative filtering
bull Extract local patterns
Global effects
Factorization
Collaborative filtering
RECSYS 107
Performance of Various Methods
0885
089
0895
09
0905
091
0915
092
1 10 100 1000
RMSE
Millions of parameters
CF (no time bias)
Basic Latent Factors
Latent Factors w Biases
1162019
RECSYS 108
Grand Prize 08563
Netflix 09514
Movie average 10533User average 10651
Global average 11296
Performance of Various Methods
Basic Collaborative filtering 094
Latent factors 090
Latent factors+Biases 089
Collaborative filtering++ 091
1162019
RECSYS 109
Temporal Biases Of Users
Sudden rise in the average movie rating(early 2004) Improvements in Netflix GUI improvements Meaning of rating changed
Movie age Users prefer new movies
without any reasons Older movies are just
inherently better than newer ones
1162019
Y Koren Collaborative filtering with temporal dynamics KDD rsquo09
RECSYS 110
Temporal Biases amp Factors
Original modelrxi = micro +bx + bi + qi middotpx
T
Add time dependence to biasesrxi = micro +bx(t)+ bi(t) +qi middot px
T
Make parameters bu and bi to depend on time (1) Parameterize time-dependence by linear trends
(2) Each bin corresponds to 10 consecutive weeks
Add temporal dependence to factors px(t)hellip user preference vector on day t
Y Koren Collaborative filtering with temporal dynamics KDD rsquo09
RECSYS 111
Adding Temporal Effects
1162019
0875
088
0885
089
0895
09
0905
091
0915
092
1 10 100 1000 10000
RMSE
Millions of parameters
CF (no time bias)Basic Latent FactorsCF (time bias)Latent Factors w Biases+ Linear time factors+ Per-day user biases+ CF
RECSYS 112
Grand Prize 08563
Netflix 09514
Movie average 10533User average 10651
Global average 11296
Performance of Various Methods
Basic Collaborative filtering 094
Latent factors 090
Latent factors+Biases 089
Collaborative filtering++ 091
1162019 112
Latent factors+Biases+Time 0876
Still no prize Getting desperateTry a ldquokitchen sinkrdquo approach
RECSYS 113
Effect of ensemble size
087
08725
0875
08775
088
08825
0885
08875
089
08925
1 8 15 22 29 36 43 50
Predictors
Erro
r - R
MSE
ratings per movie
bull Avg ratingsmovie 5627
ratings per user
bull Avg ratingsuser 208
Average movie rating by movie count
bull More ratings to better movies
From TimelyDevelopmentcom
Most loved movies
CountAvg ratingTitle137812 4593 The Shawshank Redemption 133597 4545 Lord of the Rings The Return of the King1808834306 The Green Mile 150676 4460Lord of the Rings The Two Towers 139050 4415 Finding Nemo 117456 4504 Raiders of the Lost Ark 180736 4299 Forrest Gump 147932 4433 Lord of the Rings The Fellowship of the ring149199 4325 The Sixth Sense 144027 4333 Indiana Jones and the Last Crusade
Challenges
bull Size of datandash Scalabilityndash Keeping data in memory
bull Missing datandash 99 percent missingndash Very imbalanced
bull Avoiding overfittingbull Test and training data differ significantly movie 16322
bull Use an ensemble of complementing predictorsbull Two half tuned models worth more than a single fully
tuned model
The BellKor recommender system
RECSYS 101
Extending Latent Factor Model to Include Biases
1162019
RECSYS 102
Modeling Biases and Interactions
μ = overall mean rating bx = bias of user x bi = bias of movie i
user-movie interactionmovie biasuser bias
User-Movie interaction Characterizes the matching between
users and movies Attracts most research in the field Benefits from algorithmic and
mathematical innovations
Baseline predictor Separates users and movies Benefits from insights into userrsquos
behavior Among the main practical
contributions of the competition
RECSYS 103
Baseline Predictor We have expectations on the rating by
user x of movie i even without estimating xrsquos attitude towards movies like i
ndash Rating scale of user xndash Values of other ratings user
gave recently (day-specific mood anchoring multi-user accounts)
ndash (Recent) popularity of movie indash Selection bias related to
number of ratings user gave on the same day (ldquofrequencyrdquo)
1162019
RECSYS 104
Putting It All Together
Example Mean rating micro = 37 You are a critical reviewer your ratings are 1 star
lower than the mean bx = -1 Star Wars gets a mean rating of 05 higher than
average movie bi = + 05 Predicted rating for you on Star Wars
= 37 - 1 + 05 = 32
1162019
RECSYS 105
Fitting the New Model
Solve
Stochastic gradient decent to find parameters Note Both biases bu bi as well as interactions qi pu are treated
as parameters (we estimate them)
1162019
regularization
goodness of fit
λ is selected via grid-search on a validation set
( )
++++
+++minus
sumsumsumsum
sumisin
ii
xx
xx
ii
Rix
Txiixxi
PQ
bbpq
pqbbr
2222
2
)(
)(min
λ
micro
RECSYS 106
BellKor Recommender System The winner of the Netflix Challenge
Multi-scale modeling of the dataCombine top level ldquoregionalrdquomodeling of the data with a refined local view Global
bull Overall deviations of usersmovies Factorization
bull Addressing ldquoregionalrdquo effects Collaborative filtering
bull Extract local patterns
Global effects
Factorization
Collaborative filtering
RECSYS 107
Performance of Various Methods
0885
089
0895
09
0905
091
0915
092
1 10 100 1000
RMSE
Millions of parameters
CF (no time bias)
Basic Latent Factors
Latent Factors w Biases
1162019
RECSYS 108
Grand Prize 08563
Netflix 09514
Movie average 10533User average 10651
Global average 11296
Performance of Various Methods
Basic Collaborative filtering 094
Latent factors 090
Latent factors+Biases 089
Collaborative filtering++ 091
1162019
RECSYS 109
Temporal Biases Of Users
Sudden rise in the average movie rating(early 2004) Improvements in Netflix GUI improvements Meaning of rating changed
Movie age Users prefer new movies
without any reasons Older movies are just
inherently better than newer ones
1162019
Y Koren Collaborative filtering with temporal dynamics KDD rsquo09
RECSYS 110
Temporal Biases amp Factors
Original modelrxi = micro +bx + bi + qi middotpx
T
Add time dependence to biasesrxi = micro +bx(t)+ bi(t) +qi middot px
T
Make parameters bu and bi to depend on time (1) Parameterize time-dependence by linear trends
(2) Each bin corresponds to 10 consecutive weeks
Add temporal dependence to factors px(t)hellip user preference vector on day t
Y Koren Collaborative filtering with temporal dynamics KDD rsquo09
RECSYS 111
Adding Temporal Effects
1162019
0875
088
0885
089
0895
09
0905
091
0915
092
1 10 100 1000 10000
RMSE
Millions of parameters
CF (no time bias)Basic Latent FactorsCF (time bias)Latent Factors w Biases+ Linear time factors+ Per-day user biases+ CF
RECSYS 112
Grand Prize 08563
Netflix 09514
Movie average 10533User average 10651
Global average 11296
Performance of Various Methods
Basic Collaborative filtering 094
Latent factors 090
Latent factors+Biases 089
Collaborative filtering++ 091
1162019 112
Latent factors+Biases+Time 0876
Still no prize Getting desperateTry a ldquokitchen sinkrdquo approach
RECSYS 113
Effect of ensemble size
087
08725
0875
08775
088
08825
0885
08875
089
08925
1 8 15 22 29 36 43 50
Predictors
Erro
r - R
MSE
ratings per user
bull Avg ratingsuser 208
Average movie rating by movie count
bull More ratings to better movies
From TimelyDevelopmentcom
Most loved movies
CountAvg ratingTitle137812 4593 The Shawshank Redemption 133597 4545 Lord of the Rings The Return of the King1808834306 The Green Mile 150676 4460Lord of the Rings The Two Towers 139050 4415 Finding Nemo 117456 4504 Raiders of the Lost Ark 180736 4299 Forrest Gump 147932 4433 Lord of the Rings The Fellowship of the ring149199 4325 The Sixth Sense 144027 4333 Indiana Jones and the Last Crusade
Challenges
bull Size of datandash Scalabilityndash Keeping data in memory
bull Missing datandash 99 percent missingndash Very imbalanced
bull Avoiding overfittingbull Test and training data differ significantly movie 16322
bull Use an ensemble of complementing predictorsbull Two half tuned models worth more than a single fully
tuned model
The BellKor recommender system
RECSYS 101
Extending Latent Factor Model to Include Biases
1162019
RECSYS 102
Modeling Biases and Interactions
μ = overall mean rating bx = bias of user x bi = bias of movie i
user-movie interactionmovie biasuser bias
User-Movie interaction Characterizes the matching between
users and movies Attracts most research in the field Benefits from algorithmic and
mathematical innovations
Baseline predictor Separates users and movies Benefits from insights into userrsquos
behavior Among the main practical
contributions of the competition
RECSYS 103
Baseline Predictor We have expectations on the rating by
user x of movie i even without estimating xrsquos attitude towards movies like i
ndash Rating scale of user xndash Values of other ratings user
gave recently (day-specific mood anchoring multi-user accounts)
ndash (Recent) popularity of movie indash Selection bias related to
number of ratings user gave on the same day (ldquofrequencyrdquo)
1162019
RECSYS 104
Putting It All Together
Example Mean rating micro = 37 You are a critical reviewer your ratings are 1 star
lower than the mean bx = -1 Star Wars gets a mean rating of 05 higher than
average movie bi = + 05 Predicted rating for you on Star Wars
= 37 - 1 + 05 = 32
1162019
RECSYS 105
Fitting the New Model
Solve
Stochastic gradient decent to find parameters Note Both biases bu bi as well as interactions qi pu are treated
as parameters (we estimate them)
1162019
regularization
goodness of fit
λ is selected via grid-search on a validation set
( )
++++
+++minus
sumsumsumsum
sumisin
ii
xx
xx
ii
Rix
Txiixxi
PQ
bbpq
pqbbr
2222
2
)(
)(min
λ
micro
RECSYS 106
BellKor Recommender System The winner of the Netflix Challenge
Multi-scale modeling of the dataCombine top level ldquoregionalrdquomodeling of the data with a refined local view Global
bull Overall deviations of usersmovies Factorization
bull Addressing ldquoregionalrdquo effects Collaborative filtering
bull Extract local patterns
Global effects
Factorization
Collaborative filtering
RECSYS 107
Performance of Various Methods
0885
089
0895
09
0905
091
0915
092
1 10 100 1000
RMSE
Millions of parameters
CF (no time bias)
Basic Latent Factors
Latent Factors w Biases
1162019
RECSYS 108
Grand Prize 08563
Netflix 09514
Movie average 10533User average 10651
Global average 11296
Performance of Various Methods
Basic Collaborative filtering 094
Latent factors 090
Latent factors+Biases 089
Collaborative filtering++ 091
1162019
RECSYS 109
Temporal Biases Of Users
Sudden rise in the average movie rating(early 2004) Improvements in Netflix GUI improvements Meaning of rating changed
Movie age Users prefer new movies
without any reasons Older movies are just
inherently better than newer ones
1162019
Y Koren Collaborative filtering with temporal dynamics KDD rsquo09
RECSYS 110
Temporal Biases amp Factors
Original modelrxi = micro +bx + bi + qi middotpx
T
Add time dependence to biasesrxi = micro +bx(t)+ bi(t) +qi middot px
T
Make parameters bu and bi to depend on time (1) Parameterize time-dependence by linear trends
(2) Each bin corresponds to 10 consecutive weeks
Add temporal dependence to factors px(t)hellip user preference vector on day t
Y Koren Collaborative filtering with temporal dynamics KDD rsquo09
RECSYS 111
Adding Temporal Effects
1162019
0875
088
0885
089
0895
09
0905
091
0915
092
1 10 100 1000 10000
RMSE
Millions of parameters
CF (no time bias)Basic Latent FactorsCF (time bias)Latent Factors w Biases+ Linear time factors+ Per-day user biases+ CF
RECSYS 112
Grand Prize 08563
Netflix 09514
Movie average 10533User average 10651
Global average 11296
Performance of Various Methods
Basic Collaborative filtering 094
Latent factors 090
Latent factors+Biases 089
Collaborative filtering++ 091
1162019 112
Latent factors+Biases+Time 0876
Still no prize Getting desperateTry a ldquokitchen sinkrdquo approach
RECSYS 113
Effect of ensemble size
087
08725
0875
08775
088
08825
0885
08875
089
08925
1 8 15 22 29 36 43 50
Predictors
Erro
r - R
MSE
Average movie rating by movie count
bull More ratings to better movies
From TimelyDevelopmentcom
Most loved movies
CountAvg ratingTitle137812 4593 The Shawshank Redemption 133597 4545 Lord of the Rings The Return of the King1808834306 The Green Mile 150676 4460Lord of the Rings The Two Towers 139050 4415 Finding Nemo 117456 4504 Raiders of the Lost Ark 180736 4299 Forrest Gump 147932 4433 Lord of the Rings The Fellowship of the ring149199 4325 The Sixth Sense 144027 4333 Indiana Jones and the Last Crusade
Challenges
bull Size of datandash Scalabilityndash Keeping data in memory
bull Missing datandash 99 percent missingndash Very imbalanced
bull Avoiding overfittingbull Test and training data differ significantly movie 16322
bull Use an ensemble of complementing predictorsbull Two half tuned models worth more than a single fully
tuned model
The BellKor recommender system
RECSYS 101
Extending Latent Factor Model to Include Biases
1162019
RECSYS 102
Modeling Biases and Interactions
μ = overall mean rating bx = bias of user x bi = bias of movie i
user-movie interactionmovie biasuser bias
User-Movie interaction Characterizes the matching between
users and movies Attracts most research in the field Benefits from algorithmic and
mathematical innovations
Baseline predictor Separates users and movies Benefits from insights into userrsquos
behavior Among the main practical
contributions of the competition
RECSYS 103
Baseline Predictor We have expectations on the rating by
user x of movie i even without estimating xrsquos attitude towards movies like i
ndash Rating scale of user xndash Values of other ratings user
gave recently (day-specific mood anchoring multi-user accounts)
ndash (Recent) popularity of movie indash Selection bias related to
number of ratings user gave on the same day (ldquofrequencyrdquo)
1162019
RECSYS 104
Putting It All Together
Example Mean rating micro = 37 You are a critical reviewer your ratings are 1 star
lower than the mean bx = -1 Star Wars gets a mean rating of 05 higher than
average movie bi = + 05 Predicted rating for you on Star Wars
= 37 - 1 + 05 = 32
1162019
RECSYS 105
Fitting the New Model
Solve
Stochastic gradient decent to find parameters Note Both biases bu bi as well as interactions qi pu are treated
as parameters (we estimate them)
1162019
regularization
goodness of fit
λ is selected via grid-search on a validation set
( )
++++
+++minus
sumsumsumsum
sumisin
ii
xx
xx
ii
Rix
Txiixxi
PQ
bbpq
pqbbr
2222
2
)(
)(min
λ
micro
RECSYS 106
BellKor Recommender System The winner of the Netflix Challenge
Multi-scale modeling of the dataCombine top level ldquoregionalrdquomodeling of the data with a refined local view Global
bull Overall deviations of usersmovies Factorization
bull Addressing ldquoregionalrdquo effects Collaborative filtering
bull Extract local patterns
Global effects
Factorization
Collaborative filtering
RECSYS 107
Performance of Various Methods
0885
089
0895
09
0905
091
0915
092
1 10 100 1000
RMSE
Millions of parameters
CF (no time bias)
Basic Latent Factors
Latent Factors w Biases
1162019
RECSYS 108
Grand Prize 08563
Netflix 09514
Movie average 10533User average 10651
Global average 11296
Performance of Various Methods
Basic Collaborative filtering 094
Latent factors 090
Latent factors+Biases 089
Collaborative filtering++ 091
1162019
RECSYS 109
Temporal Biases Of Users
Sudden rise in the average movie rating(early 2004) Improvements in Netflix GUI improvements Meaning of rating changed
Movie age Users prefer new movies
without any reasons Older movies are just
inherently better than newer ones
1162019
Y Koren Collaborative filtering with temporal dynamics KDD rsquo09
RECSYS 110
Temporal Biases amp Factors
Original modelrxi = micro +bx + bi + qi middotpx
T
Add time dependence to biasesrxi = micro +bx(t)+ bi(t) +qi middot px
T
Make parameters bu and bi to depend on time (1) Parameterize time-dependence by linear trends
(2) Each bin corresponds to 10 consecutive weeks
Add temporal dependence to factors px(t)hellip user preference vector on day t
Y Koren Collaborative filtering with temporal dynamics KDD rsquo09
RECSYS 111
Adding Temporal Effects
1162019
0875
088
0885
089
0895
09
0905
091
0915
092
1 10 100 1000 10000
RMSE
Millions of parameters
CF (no time bias)Basic Latent FactorsCF (time bias)Latent Factors w Biases+ Linear time factors+ Per-day user biases+ CF
RECSYS 112
Grand Prize 08563
Netflix 09514
Movie average 10533User average 10651
Global average 11296
Performance of Various Methods
Basic Collaborative filtering 094
Latent factors 090
Latent factors+Biases 089
Collaborative filtering++ 091
1162019 112
Latent factors+Biases+Time 0876
Still no prize Getting desperateTry a ldquokitchen sinkrdquo approach
RECSYS 113
Effect of ensemble size
087
08725
0875
08775
088
08825
0885
08875
089
08925
1 8 15 22 29 36 43 50
Predictors
Erro
r - R
MSE
Most loved movies
CountAvg ratingTitle137812 4593 The Shawshank Redemption 133597 4545 Lord of the Rings The Return of the King1808834306 The Green Mile 150676 4460Lord of the Rings The Two Towers 139050 4415 Finding Nemo 117456 4504 Raiders of the Lost Ark 180736 4299 Forrest Gump 147932 4433 Lord of the Rings The Fellowship of the ring149199 4325 The Sixth Sense 144027 4333 Indiana Jones and the Last Crusade
Challenges
bull Size of datandash Scalabilityndash Keeping data in memory
bull Missing datandash 99 percent missingndash Very imbalanced
bull Avoiding overfittingbull Test and training data differ significantly movie 16322
bull Use an ensemble of complementing predictorsbull Two half tuned models worth more than a single fully
tuned model
The BellKor recommender system
RECSYS 101
Extending Latent Factor Model to Include Biases
1162019
RECSYS 102
Modeling Biases and Interactions
μ = overall mean rating bx = bias of user x bi = bias of movie i
user-movie interactionmovie biasuser bias
User-Movie interaction Characterizes the matching between
users and movies Attracts most research in the field Benefits from algorithmic and
mathematical innovations
Baseline predictor Separates users and movies Benefits from insights into userrsquos
behavior Among the main practical
contributions of the competition
RECSYS 103
Baseline Predictor We have expectations on the rating by
user x of movie i even without estimating xrsquos attitude towards movies like i
ndash Rating scale of user xndash Values of other ratings user
gave recently (day-specific mood anchoring multi-user accounts)
ndash (Recent) popularity of movie indash Selection bias related to
number of ratings user gave on the same day (ldquofrequencyrdquo)
1162019
RECSYS 104
Putting It All Together
Example Mean rating micro = 37 You are a critical reviewer your ratings are 1 star
lower than the mean bx = -1 Star Wars gets a mean rating of 05 higher than
average movie bi = + 05 Predicted rating for you on Star Wars
= 37 - 1 + 05 = 32
1162019
RECSYS 105
Fitting the New Model
Solve
Stochastic gradient decent to find parameters Note Both biases bu bi as well as interactions qi pu are treated
as parameters (we estimate them)
1162019
regularization
goodness of fit
λ is selected via grid-search on a validation set
( )
++++
+++minus
sumsumsumsum
sumisin
ii
xx
xx
ii
Rix
Txiixxi
PQ
bbpq
pqbbr
2222
2
)(
)(min
λ
micro
RECSYS 106
BellKor Recommender System The winner of the Netflix Challenge
Multi-scale modeling of the dataCombine top level ldquoregionalrdquomodeling of the data with a refined local view Global
bull Overall deviations of usersmovies Factorization
bull Addressing ldquoregionalrdquo effects Collaborative filtering
bull Extract local patterns
Global effects
Factorization
Collaborative filtering
RECSYS 107
Performance of Various Methods
0885
089
0895
09
0905
091
0915
092
1 10 100 1000
RMSE
Millions of parameters
CF (no time bias)
Basic Latent Factors
Latent Factors w Biases
1162019
RECSYS 108
Grand Prize 08563
Netflix 09514
Movie average 10533User average 10651
Global average 11296
Performance of Various Methods
Basic Collaborative filtering 094
Latent factors 090
Latent factors+Biases 089
Collaborative filtering++ 091
1162019
RECSYS 109
Temporal Biases Of Users
Sudden rise in the average movie rating(early 2004) Improvements in Netflix GUI improvements Meaning of rating changed
Movie age Users prefer new movies
without any reasons Older movies are just
inherently better than newer ones
1162019
Y Koren Collaborative filtering with temporal dynamics KDD rsquo09
RECSYS 110
Temporal Biases amp Factors
Original modelrxi = micro +bx + bi + qi middotpx
T
Add time dependence to biasesrxi = micro +bx(t)+ bi(t) +qi middot px
T
Make parameters bu and bi to depend on time (1) Parameterize time-dependence by linear trends
(2) Each bin corresponds to 10 consecutive weeks
Add temporal dependence to factors px(t)hellip user preference vector on day t
Y Koren Collaborative filtering with temporal dynamics KDD rsquo09
RECSYS 111
Adding Temporal Effects
1162019
0875
088
0885
089
0895
09
0905
091
0915
092
1 10 100 1000 10000
RMSE
Millions of parameters
CF (no time bias)Basic Latent FactorsCF (time bias)Latent Factors w Biases+ Linear time factors+ Per-day user biases+ CF
RECSYS 112
Grand Prize 08563
Netflix 09514
Movie average 10533User average 10651
Global average 11296
Performance of Various Methods
Basic Collaborative filtering 094
Latent factors 090
Latent factors+Biases 089
Collaborative filtering++ 091
1162019 112
Latent factors+Biases+Time 0876
Still no prize Getting desperateTry a ldquokitchen sinkrdquo approach
RECSYS 113
Effect of ensemble size
087
08725
0875
08775
088
08825
0885
08875
089
08925
1 8 15 22 29 36 43 50
Predictors
Erro
r - R
MSE
Challenges
bull Size of datandash Scalabilityndash Keeping data in memory
bull Missing datandash 99 percent missingndash Very imbalanced
bull Avoiding overfittingbull Test and training data differ significantly movie 16322
bull Use an ensemble of complementing predictorsbull Two half tuned models worth more than a single fully
tuned model
The BellKor recommender system
RECSYS 101
Extending Latent Factor Model to Include Biases
1162019
RECSYS 102
Modeling Biases and Interactions
μ = overall mean rating bx = bias of user x bi = bias of movie i
user-movie interactionmovie biasuser bias
User-Movie interaction Characterizes the matching between
users and movies Attracts most research in the field Benefits from algorithmic and
mathematical innovations
Baseline predictor Separates users and movies Benefits from insights into userrsquos
behavior Among the main practical
contributions of the competition
RECSYS 103
Baseline Predictor We have expectations on the rating by
user x of movie i even without estimating xrsquos attitude towards movies like i
ndash Rating scale of user xndash Values of other ratings user
gave recently (day-specific mood anchoring multi-user accounts)
ndash (Recent) popularity of movie indash Selection bias related to
number of ratings user gave on the same day (ldquofrequencyrdquo)
1162019
RECSYS 104
Putting It All Together
Example Mean rating micro = 37 You are a critical reviewer your ratings are 1 star
lower than the mean bx = -1 Star Wars gets a mean rating of 05 higher than
average movie bi = + 05 Predicted rating for you on Star Wars
= 37 - 1 + 05 = 32
1162019
RECSYS 105
Fitting the New Model
Solve
Stochastic gradient decent to find parameters Note Both biases bu bi as well as interactions qi pu are treated
as parameters (we estimate them)
1162019
regularization
goodness of fit
λ is selected via grid-search on a validation set
( )
++++
+++minus
sumsumsumsum
sumisin
ii
xx
xx
ii
Rix
Txiixxi
PQ
bbpq
pqbbr
2222
2
)(
)(min
λ
micro
RECSYS 106
BellKor Recommender System The winner of the Netflix Challenge
Multi-scale modeling of the dataCombine top level ldquoregionalrdquomodeling of the data with a refined local view Global
bull Overall deviations of usersmovies Factorization
bull Addressing ldquoregionalrdquo effects Collaborative filtering
bull Extract local patterns
Global effects
Factorization
Collaborative filtering
RECSYS 107
Performance of Various Methods
0885
089
0895
09
0905
091
0915
092
1 10 100 1000
RMSE
Millions of parameters
CF (no time bias)
Basic Latent Factors
Latent Factors w Biases
1162019
RECSYS 108
Grand Prize 08563
Netflix 09514
Movie average 10533User average 10651
Global average 11296
Performance of Various Methods
Basic Collaborative filtering 094
Latent factors 090
Latent factors+Biases 089
Collaborative filtering++ 091
1162019
RECSYS 109
Temporal Biases Of Users
Sudden rise in the average movie rating(early 2004) Improvements in Netflix GUI improvements Meaning of rating changed
Movie age Users prefer new movies
without any reasons Older movies are just
inherently better than newer ones
1162019
Y Koren Collaborative filtering with temporal dynamics KDD rsquo09
RECSYS 110
Temporal Biases amp Factors
Original modelrxi = micro +bx + bi + qi middotpx
T
Add time dependence to biasesrxi = micro +bx(t)+ bi(t) +qi middot px
T
Make parameters bu and bi to depend on time (1) Parameterize time-dependence by linear trends
(2) Each bin corresponds to 10 consecutive weeks
Add temporal dependence to factors px(t)hellip user preference vector on day t
Y Koren Collaborative filtering with temporal dynamics KDD rsquo09
RECSYS 111
Adding Temporal Effects
1162019
0875
088
0885
089
0895
09
0905
091
0915
092
1 10 100 1000 10000
RMSE
Millions of parameters
CF (no time bias)Basic Latent FactorsCF (time bias)Latent Factors w Biases+ Linear time factors+ Per-day user biases+ CF
RECSYS 112
Grand Prize 08563
Netflix 09514
Movie average 10533User average 10651
Global average 11296
Performance of Various Methods
Basic Collaborative filtering 094
Latent factors 090
Latent factors+Biases 089
Collaborative filtering++ 091
1162019 112
Latent factors+Biases+Time 0876
Still no prize Getting desperateTry a ldquokitchen sinkrdquo approach
RECSYS 113
Effect of ensemble size
087
08725
0875
08775
088
08825
0885
08875
089
08925
1 8 15 22 29 36 43 50
Predictors
Erro
r - R
MSE
bull Use an ensemble of complementing predictorsbull Two half tuned models worth more than a single fully
tuned model
The BellKor recommender system
RECSYS 101
Extending Latent Factor Model to Include Biases
1162019
RECSYS 102
Modeling Biases and Interactions
μ = overall mean rating bx = bias of user x bi = bias of movie i
user-movie interactionmovie biasuser bias
User-Movie interaction Characterizes the matching between
users and movies Attracts most research in the field Benefits from algorithmic and
mathematical innovations
Baseline predictor Separates users and movies Benefits from insights into userrsquos
behavior Among the main practical
contributions of the competition
RECSYS 103
Baseline Predictor We have expectations on the rating by
user x of movie i even without estimating xrsquos attitude towards movies like i
ndash Rating scale of user xndash Values of other ratings user
gave recently (day-specific mood anchoring multi-user accounts)
ndash (Recent) popularity of movie indash Selection bias related to
number of ratings user gave on the same day (ldquofrequencyrdquo)
1162019
RECSYS 104
Putting It All Together
Example Mean rating micro = 37 You are a critical reviewer your ratings are 1 star
lower than the mean bx = -1 Star Wars gets a mean rating of 05 higher than
average movie bi = + 05 Predicted rating for you on Star Wars
= 37 - 1 + 05 = 32
1162019
RECSYS 105
Fitting the New Model
Solve
Stochastic gradient decent to find parameters Note Both biases bu bi as well as interactions qi pu are treated
as parameters (we estimate them)
1162019
regularization
goodness of fit
λ is selected via grid-search on a validation set
( )
++++
+++minus
sumsumsumsum
sumisin
ii
xx
xx
ii
Rix
Txiixxi
PQ
bbpq
pqbbr
2222
2
)(
)(min
λ
micro
RECSYS 106
BellKor Recommender System The winner of the Netflix Challenge
Multi-scale modeling of the dataCombine top level ldquoregionalrdquomodeling of the data with a refined local view Global
bull Overall deviations of usersmovies Factorization
bull Addressing ldquoregionalrdquo effects Collaborative filtering
bull Extract local patterns
Global effects
Factorization
Collaborative filtering
RECSYS 107
Performance of Various Methods
0885
089
0895
09
0905
091
0915
092
1 10 100 1000
RMSE
Millions of parameters
CF (no time bias)
Basic Latent Factors
Latent Factors w Biases
1162019
RECSYS 108
Grand Prize 08563
Netflix 09514
Movie average 10533User average 10651
Global average 11296
Performance of Various Methods
Basic Collaborative filtering 094
Latent factors 090
Latent factors+Biases 089
Collaborative filtering++ 091
1162019
RECSYS 109
Temporal Biases Of Users
Sudden rise in the average movie rating(early 2004) Improvements in Netflix GUI improvements Meaning of rating changed
Movie age Users prefer new movies
without any reasons Older movies are just
inherently better than newer ones
1162019
Y Koren Collaborative filtering with temporal dynamics KDD rsquo09
RECSYS 110
Temporal Biases amp Factors
Original modelrxi = micro +bx + bi + qi middotpx
T
Add time dependence to biasesrxi = micro +bx(t)+ bi(t) +qi middot px
T
Make parameters bu and bi to depend on time (1) Parameterize time-dependence by linear trends
(2) Each bin corresponds to 10 consecutive weeks
Add temporal dependence to factors px(t)hellip user preference vector on day t
Y Koren Collaborative filtering with temporal dynamics KDD rsquo09
RECSYS 111
Adding Temporal Effects
1162019
0875
088
0885
089
0895
09
0905
091
0915
092
1 10 100 1000 10000
RMSE
Millions of parameters
CF (no time bias)Basic Latent FactorsCF (time bias)Latent Factors w Biases+ Linear time factors+ Per-day user biases+ CF
RECSYS 112
Grand Prize 08563
Netflix 09514
Movie average 10533User average 10651
Global average 11296
Performance of Various Methods
Basic Collaborative filtering 094
Latent factors 090
Latent factors+Biases 089
Collaborative filtering++ 091
1162019 112
Latent factors+Biases+Time 0876
Still no prize Getting desperateTry a ldquokitchen sinkrdquo approach
RECSYS 113
Effect of ensemble size
087
08725
0875
08775
088
08825
0885
08875
089
08925
1 8 15 22 29 36 43 50
Predictors
Erro
r - R
MSE
RECSYS 101
Extending Latent Factor Model to Include Biases
1162019
RECSYS 102
Modeling Biases and Interactions
μ = overall mean rating bx = bias of user x bi = bias of movie i
user-movie interactionmovie biasuser bias
User-Movie interaction Characterizes the matching between
users and movies Attracts most research in the field Benefits from algorithmic and
mathematical innovations
Baseline predictor Separates users and movies Benefits from insights into userrsquos
behavior Among the main practical
contributions of the competition
RECSYS 103
Baseline Predictor We have expectations on the rating by
user x of movie i even without estimating xrsquos attitude towards movies like i
ndash Rating scale of user xndash Values of other ratings user
gave recently (day-specific mood anchoring multi-user accounts)
ndash (Recent) popularity of movie indash Selection bias related to
number of ratings user gave on the same day (ldquofrequencyrdquo)
1162019
RECSYS 104
Putting It All Together
Example Mean rating micro = 37 You are a critical reviewer your ratings are 1 star
lower than the mean bx = -1 Star Wars gets a mean rating of 05 higher than
average movie bi = + 05 Predicted rating for you on Star Wars
= 37 - 1 + 05 = 32
1162019
RECSYS 105
Fitting the New Model
Solve
Stochastic gradient decent to find parameters Note Both biases bu bi as well as interactions qi pu are treated
as parameters (we estimate them)
1162019
regularization
goodness of fit
λ is selected via grid-search on a validation set
( )
++++
+++minus
sumsumsumsum
sumisin
ii
xx
xx
ii
Rix
Txiixxi
PQ
bbpq
pqbbr
2222
2
)(
)(min
λ
micro
RECSYS 106
BellKor Recommender System The winner of the Netflix Challenge
Multi-scale modeling of the dataCombine top level ldquoregionalrdquomodeling of the data with a refined local view Global
bull Overall deviations of usersmovies Factorization
bull Addressing ldquoregionalrdquo effects Collaborative filtering
bull Extract local patterns
Global effects
Factorization
Collaborative filtering
RECSYS 107
Performance of Various Methods
0885
089
0895
09
0905
091
0915
092
1 10 100 1000
RMSE
Millions of parameters
CF (no time bias)
Basic Latent Factors
Latent Factors w Biases
1162019
RECSYS 108
Grand Prize 08563
Netflix 09514
Movie average 10533User average 10651
Global average 11296
Performance of Various Methods
Basic Collaborative filtering 094
Latent factors 090
Latent factors+Biases 089
Collaborative filtering++ 091
1162019
RECSYS 109
Temporal Biases Of Users
Sudden rise in the average movie rating(early 2004) Improvements in Netflix GUI improvements Meaning of rating changed
Movie age Users prefer new movies
without any reasons Older movies are just
inherently better than newer ones
1162019
Y Koren Collaborative filtering with temporal dynamics KDD rsquo09
RECSYS 110
Temporal Biases amp Factors
Original modelrxi = micro +bx + bi + qi middotpx
T
Add time dependence to biasesrxi = micro +bx(t)+ bi(t) +qi middot px
T
Make parameters bu and bi to depend on time (1) Parameterize time-dependence by linear trends
(2) Each bin corresponds to 10 consecutive weeks
Add temporal dependence to factors px(t)hellip user preference vector on day t
Y Koren Collaborative filtering with temporal dynamics KDD rsquo09
RECSYS 111
Adding Temporal Effects
1162019
0875
088
0885
089
0895
09
0905
091
0915
092
1 10 100 1000 10000
RMSE
Millions of parameters
CF (no time bias)Basic Latent FactorsCF (time bias)Latent Factors w Biases+ Linear time factors+ Per-day user biases+ CF
RECSYS 112
Grand Prize 08563
Netflix 09514
Movie average 10533User average 10651
Global average 11296
Performance of Various Methods
Basic Collaborative filtering 094
Latent factors 090
Latent factors+Biases 089
Collaborative filtering++ 091
1162019 112
Latent factors+Biases+Time 0876
Still no prize Getting desperateTry a ldquokitchen sinkrdquo approach
RECSYS 113
Effect of ensemble size
087
08725
0875
08775
088
08825
0885
08875
089
08925
1 8 15 22 29 36 43 50
Predictors
Erro
r - R
MSE
RECSYS 102
Modeling Biases and Interactions
μ = overall mean rating bx = bias of user x bi = bias of movie i
user-movie interactionmovie biasuser bias
User-Movie interaction Characterizes the matching between
users and movies Attracts most research in the field Benefits from algorithmic and
mathematical innovations
Baseline predictor Separates users and movies Benefits from insights into userrsquos
behavior Among the main practical
contributions of the competition
RECSYS 103
Baseline Predictor We have expectations on the rating by
user x of movie i even without estimating xrsquos attitude towards movies like i
ndash Rating scale of user xndash Values of other ratings user
gave recently (day-specific mood anchoring multi-user accounts)
ndash (Recent) popularity of movie indash Selection bias related to
number of ratings user gave on the same day (ldquofrequencyrdquo)
1162019
RECSYS 104
Putting It All Together
Example Mean rating micro = 37 You are a critical reviewer your ratings are 1 star
lower than the mean bx = -1 Star Wars gets a mean rating of 05 higher than
average movie bi = + 05 Predicted rating for you on Star Wars
= 37 - 1 + 05 = 32
1162019
RECSYS 105
Fitting the New Model
Solve
Stochastic gradient decent to find parameters Note Both biases bu bi as well as interactions qi pu are treated
as parameters (we estimate them)
1162019
regularization
goodness of fit
λ is selected via grid-search on a validation set
( )
++++
+++minus
sumsumsumsum
sumisin
ii
xx
xx
ii
Rix
Txiixxi
PQ
bbpq
pqbbr
2222
2
)(
)(min
λ
micro
RECSYS 106
BellKor Recommender System The winner of the Netflix Challenge
Multi-scale modeling of the dataCombine top level ldquoregionalrdquomodeling of the data with a refined local view Global
bull Overall deviations of usersmovies Factorization
bull Addressing ldquoregionalrdquo effects Collaborative filtering
bull Extract local patterns
Global effects
Factorization
Collaborative filtering
RECSYS 107
Performance of Various Methods
0885
089
0895
09
0905
091
0915
092
1 10 100 1000
RMSE
Millions of parameters
CF (no time bias)
Basic Latent Factors
Latent Factors w Biases
1162019
RECSYS 108
Grand Prize 08563
Netflix 09514
Movie average 10533User average 10651
Global average 11296
Performance of Various Methods
Basic Collaborative filtering 094
Latent factors 090
Latent factors+Biases 089
Collaborative filtering++ 091
1162019
RECSYS 109
Temporal Biases Of Users
Sudden rise in the average movie rating(early 2004) Improvements in Netflix GUI improvements Meaning of rating changed
Movie age Users prefer new movies
without any reasons Older movies are just
inherently better than newer ones
1162019
Y Koren Collaborative filtering with temporal dynamics KDD rsquo09
RECSYS 110
Temporal Biases amp Factors
Original modelrxi = micro +bx + bi + qi middotpx
T
Add time dependence to biasesrxi = micro +bx(t)+ bi(t) +qi middot px
T
Make parameters bu and bi to depend on time (1) Parameterize time-dependence by linear trends
(2) Each bin corresponds to 10 consecutive weeks
Add temporal dependence to factors px(t)hellip user preference vector on day t
Y Koren Collaborative filtering with temporal dynamics KDD rsquo09
RECSYS 111
Adding Temporal Effects
1162019
0875
088
0885
089
0895
09
0905
091
0915
092
1 10 100 1000 10000
RMSE
Millions of parameters
CF (no time bias)Basic Latent FactorsCF (time bias)Latent Factors w Biases+ Linear time factors+ Per-day user biases+ CF
RECSYS 112
Grand Prize 08563
Netflix 09514
Movie average 10533User average 10651
Global average 11296
Performance of Various Methods
Basic Collaborative filtering 094
Latent factors 090
Latent factors+Biases 089
Collaborative filtering++ 091
1162019 112
Latent factors+Biases+Time 0876
Still no prize Getting desperateTry a ldquokitchen sinkrdquo approach
RECSYS 113
Effect of ensemble size
087
08725
0875
08775
088
08825
0885
08875
089
08925
1 8 15 22 29 36 43 50
Predictors
Erro
r - R
MSE
RECSYS 103
Baseline Predictor We have expectations on the rating by
user x of movie i even without estimating xrsquos attitude towards movies like i
ndash Rating scale of user xndash Values of other ratings user
gave recently (day-specific mood anchoring multi-user accounts)
ndash (Recent) popularity of movie indash Selection bias related to
number of ratings user gave on the same day (ldquofrequencyrdquo)
1162019
RECSYS 104
Putting It All Together
Example Mean rating micro = 37 You are a critical reviewer your ratings are 1 star
lower than the mean bx = -1 Star Wars gets a mean rating of 05 higher than
average movie bi = + 05 Predicted rating for you on Star Wars
= 37 - 1 + 05 = 32
1162019
RECSYS 105
Fitting the New Model
Solve
Stochastic gradient decent to find parameters Note Both biases bu bi as well as interactions qi pu are treated
as parameters (we estimate them)
1162019
regularization
goodness of fit
λ is selected via grid-search on a validation set
( )
++++
+++minus
sumsumsumsum
sumisin
ii
xx
xx
ii
Rix
Txiixxi
PQ
bbpq
pqbbr
2222
2
)(
)(min
λ
micro
RECSYS 106
BellKor Recommender System The winner of the Netflix Challenge
Multi-scale modeling of the dataCombine top level ldquoregionalrdquomodeling of the data with a refined local view Global
bull Overall deviations of usersmovies Factorization
bull Addressing ldquoregionalrdquo effects Collaborative filtering
bull Extract local patterns
Global effects
Factorization
Collaborative filtering
RECSYS 107
Performance of Various Methods
0885
089
0895
09
0905
091
0915
092
1 10 100 1000
RMSE
Millions of parameters
CF (no time bias)
Basic Latent Factors
Latent Factors w Biases
1162019
RECSYS 108
Grand Prize 08563
Netflix 09514
Movie average 10533User average 10651
Global average 11296
Performance of Various Methods
Basic Collaborative filtering 094
Latent factors 090
Latent factors+Biases 089
Collaborative filtering++ 091
1162019
RECSYS 109
Temporal Biases Of Users
Sudden rise in the average movie rating(early 2004) Improvements in Netflix GUI improvements Meaning of rating changed
Movie age Users prefer new movies
without any reasons Older movies are just
inherently better than newer ones
1162019
Y Koren Collaborative filtering with temporal dynamics KDD rsquo09
RECSYS 110
Temporal Biases amp Factors
Original modelrxi = micro +bx + bi + qi middotpx
T
Add time dependence to biasesrxi = micro +bx(t)+ bi(t) +qi middot px
T
Make parameters bu and bi to depend on time (1) Parameterize time-dependence by linear trends
(2) Each bin corresponds to 10 consecutive weeks
Add temporal dependence to factors px(t)hellip user preference vector on day t
Y Koren Collaborative filtering with temporal dynamics KDD rsquo09
RECSYS 111
Adding Temporal Effects
1162019
0875
088
0885
089
0895
09
0905
091
0915
092
1 10 100 1000 10000
RMSE
Millions of parameters
CF (no time bias)Basic Latent FactorsCF (time bias)Latent Factors w Biases+ Linear time factors+ Per-day user biases+ CF
RECSYS 112
Grand Prize 08563
Netflix 09514
Movie average 10533User average 10651
Global average 11296
Performance of Various Methods
Basic Collaborative filtering 094
Latent factors 090
Latent factors+Biases 089
Collaborative filtering++ 091
1162019 112
Latent factors+Biases+Time 0876
Still no prize Getting desperateTry a ldquokitchen sinkrdquo approach
RECSYS 113
Effect of ensemble size
087
08725
0875
08775
088
08825
0885
08875
089
08925
1 8 15 22 29 36 43 50
Predictors
Erro
r - R
MSE
RECSYS 104
Putting It All Together
Example Mean rating micro = 37 You are a critical reviewer your ratings are 1 star
lower than the mean bx = -1 Star Wars gets a mean rating of 05 higher than
average movie bi = + 05 Predicted rating for you on Star Wars
= 37 - 1 + 05 = 32
1162019
RECSYS 105
Fitting the New Model
Solve
Stochastic gradient decent to find parameters Note Both biases bu bi as well as interactions qi pu are treated
as parameters (we estimate them)
1162019
regularization
goodness of fit
λ is selected via grid-search on a validation set
( )
++++
+++minus
sumsumsumsum
sumisin
ii
xx
xx
ii
Rix
Txiixxi
PQ
bbpq
pqbbr
2222
2
)(
)(min
λ
micro
RECSYS 106
BellKor Recommender System The winner of the Netflix Challenge
Multi-scale modeling of the dataCombine top level ldquoregionalrdquomodeling of the data with a refined local view Global
bull Overall deviations of usersmovies Factorization
bull Addressing ldquoregionalrdquo effects Collaborative filtering
bull Extract local patterns
Global effects
Factorization
Collaborative filtering
RECSYS 107
Performance of Various Methods
0885
089
0895
09
0905
091
0915
092
1 10 100 1000
RMSE
Millions of parameters
CF (no time bias)
Basic Latent Factors
Latent Factors w Biases
1162019
RECSYS 108
Grand Prize 08563
Netflix 09514
Movie average 10533User average 10651
Global average 11296
Performance of Various Methods
Basic Collaborative filtering 094
Latent factors 090
Latent factors+Biases 089
Collaborative filtering++ 091
1162019
RECSYS 109
Temporal Biases Of Users
Sudden rise in the average movie rating(early 2004) Improvements in Netflix GUI improvements Meaning of rating changed
Movie age Users prefer new movies
without any reasons Older movies are just
inherently better than newer ones
1162019
Y Koren Collaborative filtering with temporal dynamics KDD rsquo09
RECSYS 110
Temporal Biases amp Factors
Original modelrxi = micro +bx + bi + qi middotpx
T
Add time dependence to biasesrxi = micro +bx(t)+ bi(t) +qi middot px
T
Make parameters bu and bi to depend on time (1) Parameterize time-dependence by linear trends
(2) Each bin corresponds to 10 consecutive weeks
Add temporal dependence to factors px(t)hellip user preference vector on day t
Y Koren Collaborative filtering with temporal dynamics KDD rsquo09
RECSYS 111
Adding Temporal Effects
1162019
0875
088
0885
089
0895
09
0905
091
0915
092
1 10 100 1000 10000
RMSE
Millions of parameters
CF (no time bias)Basic Latent FactorsCF (time bias)Latent Factors w Biases+ Linear time factors+ Per-day user biases+ CF
RECSYS 112
Grand Prize 08563
Netflix 09514
Movie average 10533User average 10651
Global average 11296
Performance of Various Methods
Basic Collaborative filtering 094
Latent factors 090
Latent factors+Biases 089
Collaborative filtering++ 091
1162019 112
Latent factors+Biases+Time 0876
Still no prize Getting desperateTry a ldquokitchen sinkrdquo approach
RECSYS 113
Effect of ensemble size
087
08725
0875
08775
088
08825
0885
08875
089
08925
1 8 15 22 29 36 43 50
Predictors
Erro
r - R
MSE
RECSYS 105
Fitting the New Model
Solve
Stochastic gradient decent to find parameters Note Both biases bu bi as well as interactions qi pu are treated
as parameters (we estimate them)
1162019
regularization
goodness of fit
λ is selected via grid-search on a validation set
( )
++++
+++minus
sumsumsumsum
sumisin
ii
xx
xx
ii
Rix
Txiixxi
PQ
bbpq
pqbbr
2222
2
)(
)(min
λ
micro
RECSYS 106
BellKor Recommender System The winner of the Netflix Challenge
Multi-scale modeling of the dataCombine top level ldquoregionalrdquomodeling of the data with a refined local view Global
bull Overall deviations of usersmovies Factorization
bull Addressing ldquoregionalrdquo effects Collaborative filtering
bull Extract local patterns
Global effects
Factorization
Collaborative filtering
RECSYS 107
Performance of Various Methods
0885
089
0895
09
0905
091
0915
092
1 10 100 1000
RMSE
Millions of parameters
CF (no time bias)
Basic Latent Factors
Latent Factors w Biases
1162019
RECSYS 108
Grand Prize 08563
Netflix 09514
Movie average 10533User average 10651
Global average 11296
Performance of Various Methods
Basic Collaborative filtering 094
Latent factors 090
Latent factors+Biases 089
Collaborative filtering++ 091
1162019
RECSYS 109
Temporal Biases Of Users
Sudden rise in the average movie rating(early 2004) Improvements in Netflix GUI improvements Meaning of rating changed
Movie age Users prefer new movies
without any reasons Older movies are just
inherently better than newer ones
1162019
Y Koren Collaborative filtering with temporal dynamics KDD rsquo09
RECSYS 110
Temporal Biases amp Factors
Original modelrxi = micro +bx + bi + qi middotpx
T
Add time dependence to biasesrxi = micro +bx(t)+ bi(t) +qi middot px
T
Make parameters bu and bi to depend on time (1) Parameterize time-dependence by linear trends
(2) Each bin corresponds to 10 consecutive weeks
Add temporal dependence to factors px(t)hellip user preference vector on day t
Y Koren Collaborative filtering with temporal dynamics KDD rsquo09
RECSYS 111
Adding Temporal Effects
1162019
0875
088
0885
089
0895
09
0905
091
0915
092
1 10 100 1000 10000
RMSE
Millions of parameters
CF (no time bias)Basic Latent FactorsCF (time bias)Latent Factors w Biases+ Linear time factors+ Per-day user biases+ CF
RECSYS 112
Grand Prize 08563
Netflix 09514
Movie average 10533User average 10651
Global average 11296
Performance of Various Methods
Basic Collaborative filtering 094
Latent factors 090
Latent factors+Biases 089
Collaborative filtering++ 091
1162019 112
Latent factors+Biases+Time 0876
Still no prize Getting desperateTry a ldquokitchen sinkrdquo approach
RECSYS 113
Effect of ensemble size
087
08725
0875
08775
088
08825
0885
08875
089
08925
1 8 15 22 29 36 43 50
Predictors
Erro
r - R
MSE
RECSYS 106
BellKor Recommender System The winner of the Netflix Challenge
Multi-scale modeling of the dataCombine top level ldquoregionalrdquomodeling of the data with a refined local view Global
bull Overall deviations of usersmovies Factorization
bull Addressing ldquoregionalrdquo effects Collaborative filtering
bull Extract local patterns
Global effects
Factorization
Collaborative filtering
RECSYS 107
Performance of Various Methods
0885
089
0895
09
0905
091
0915
092
1 10 100 1000
RMSE
Millions of parameters
CF (no time bias)
Basic Latent Factors
Latent Factors w Biases
1162019
RECSYS 108
Grand Prize 08563
Netflix 09514
Movie average 10533User average 10651
Global average 11296
Performance of Various Methods
Basic Collaborative filtering 094
Latent factors 090
Latent factors+Biases 089
Collaborative filtering++ 091
1162019
RECSYS 109
Temporal Biases Of Users
Sudden rise in the average movie rating(early 2004) Improvements in Netflix GUI improvements Meaning of rating changed
Movie age Users prefer new movies
without any reasons Older movies are just
inherently better than newer ones
1162019
Y Koren Collaborative filtering with temporal dynamics KDD rsquo09
RECSYS 110
Temporal Biases amp Factors
Original modelrxi = micro +bx + bi + qi middotpx
T
Add time dependence to biasesrxi = micro +bx(t)+ bi(t) +qi middot px
T
Make parameters bu and bi to depend on time (1) Parameterize time-dependence by linear trends
(2) Each bin corresponds to 10 consecutive weeks
Add temporal dependence to factors px(t)hellip user preference vector on day t
Y Koren Collaborative filtering with temporal dynamics KDD rsquo09
RECSYS 111
Adding Temporal Effects
1162019
0875
088
0885
089
0895
09
0905
091
0915
092
1 10 100 1000 10000
RMSE
Millions of parameters
CF (no time bias)Basic Latent FactorsCF (time bias)Latent Factors w Biases+ Linear time factors+ Per-day user biases+ CF
RECSYS 112
Grand Prize 08563
Netflix 09514
Movie average 10533User average 10651
Global average 11296
Performance of Various Methods
Basic Collaborative filtering 094
Latent factors 090
Latent factors+Biases 089
Collaborative filtering++ 091
1162019 112
Latent factors+Biases+Time 0876
Still no prize Getting desperateTry a ldquokitchen sinkrdquo approach
RECSYS 113
Effect of ensemble size
087
08725
0875
08775
088
08825
0885
08875
089
08925
1 8 15 22 29 36 43 50
Predictors
Erro
r - R
MSE
RECSYS 107
Performance of Various Methods
0885
089
0895
09
0905
091
0915
092
1 10 100 1000
RMSE
Millions of parameters
CF (no time bias)
Basic Latent Factors
Latent Factors w Biases
1162019
RECSYS 108
Grand Prize 08563
Netflix 09514
Movie average 10533User average 10651
Global average 11296
Performance of Various Methods
Basic Collaborative filtering 094
Latent factors 090
Latent factors+Biases 089
Collaborative filtering++ 091
1162019
RECSYS 109
Temporal Biases Of Users
Sudden rise in the average movie rating(early 2004) Improvements in Netflix GUI improvements Meaning of rating changed
Movie age Users prefer new movies
without any reasons Older movies are just
inherently better than newer ones
1162019
Y Koren Collaborative filtering with temporal dynamics KDD rsquo09
RECSYS 110
Temporal Biases amp Factors
Original modelrxi = micro +bx + bi + qi middotpx
T
Add time dependence to biasesrxi = micro +bx(t)+ bi(t) +qi middot px
T
Make parameters bu and bi to depend on time (1) Parameterize time-dependence by linear trends
(2) Each bin corresponds to 10 consecutive weeks
Add temporal dependence to factors px(t)hellip user preference vector on day t
Y Koren Collaborative filtering with temporal dynamics KDD rsquo09
RECSYS 111
Adding Temporal Effects
1162019
0875
088
0885
089
0895
09
0905
091
0915
092
1 10 100 1000 10000
RMSE
Millions of parameters
CF (no time bias)Basic Latent FactorsCF (time bias)Latent Factors w Biases+ Linear time factors+ Per-day user biases+ CF
RECSYS 112
Grand Prize 08563
Netflix 09514
Movie average 10533User average 10651
Global average 11296
Performance of Various Methods
Basic Collaborative filtering 094
Latent factors 090
Latent factors+Biases 089
Collaborative filtering++ 091
1162019 112
Latent factors+Biases+Time 0876
Still no prize Getting desperateTry a ldquokitchen sinkrdquo approach
RECSYS 113
Effect of ensemble size
087
08725
0875
08775
088
08825
0885
08875
089
08925
1 8 15 22 29 36 43 50
Predictors
Erro
r - R
MSE
RECSYS 108
Grand Prize 08563
Netflix 09514
Movie average 10533User average 10651
Global average 11296
Performance of Various Methods
Basic Collaborative filtering 094
Latent factors 090
Latent factors+Biases 089
Collaborative filtering++ 091
1162019
RECSYS 109
Temporal Biases Of Users
Sudden rise in the average movie rating(early 2004) Improvements in Netflix GUI improvements Meaning of rating changed
Movie age Users prefer new movies
without any reasons Older movies are just
inherently better than newer ones
1162019
Y Koren Collaborative filtering with temporal dynamics KDD rsquo09
RECSYS 110
Temporal Biases amp Factors
Original modelrxi = micro +bx + bi + qi middotpx
T
Add time dependence to biasesrxi = micro +bx(t)+ bi(t) +qi middot px
T
Make parameters bu and bi to depend on time (1) Parameterize time-dependence by linear trends
(2) Each bin corresponds to 10 consecutive weeks
Add temporal dependence to factors px(t)hellip user preference vector on day t
Y Koren Collaborative filtering with temporal dynamics KDD rsquo09
RECSYS 111
Adding Temporal Effects
1162019
0875
088
0885
089
0895
09
0905
091
0915
092
1 10 100 1000 10000
RMSE
Millions of parameters
CF (no time bias)Basic Latent FactorsCF (time bias)Latent Factors w Biases+ Linear time factors+ Per-day user biases+ CF
RECSYS 112
Grand Prize 08563
Netflix 09514
Movie average 10533User average 10651
Global average 11296
Performance of Various Methods
Basic Collaborative filtering 094
Latent factors 090
Latent factors+Biases 089
Collaborative filtering++ 091
1162019 112
Latent factors+Biases+Time 0876
Still no prize Getting desperateTry a ldquokitchen sinkrdquo approach
RECSYS 113
Effect of ensemble size
087
08725
0875
08775
088
08825
0885
08875
089
08925
1 8 15 22 29 36 43 50
Predictors
Erro
r - R
MSE
RECSYS 109
Temporal Biases Of Users
Sudden rise in the average movie rating(early 2004) Improvements in Netflix GUI improvements Meaning of rating changed
Movie age Users prefer new movies
without any reasons Older movies are just
inherently better than newer ones
1162019
Y Koren Collaborative filtering with temporal dynamics KDD rsquo09
RECSYS 110
Temporal Biases amp Factors
Original modelrxi = micro +bx + bi + qi middotpx
T
Add time dependence to biasesrxi = micro +bx(t)+ bi(t) +qi middot px
T
Make parameters bu and bi to depend on time (1) Parameterize time-dependence by linear trends
(2) Each bin corresponds to 10 consecutive weeks
Add temporal dependence to factors px(t)hellip user preference vector on day t
Y Koren Collaborative filtering with temporal dynamics KDD rsquo09
RECSYS 111
Adding Temporal Effects
1162019
0875
088
0885
089
0895
09
0905
091
0915
092
1 10 100 1000 10000
RMSE
Millions of parameters
CF (no time bias)Basic Latent FactorsCF (time bias)Latent Factors w Biases+ Linear time factors+ Per-day user biases+ CF
RECSYS 112
Grand Prize 08563
Netflix 09514
Movie average 10533User average 10651
Global average 11296
Performance of Various Methods
Basic Collaborative filtering 094
Latent factors 090
Latent factors+Biases 089
Collaborative filtering++ 091
1162019 112
Latent factors+Biases+Time 0876
Still no prize Getting desperateTry a ldquokitchen sinkrdquo approach
RECSYS 113
Effect of ensemble size
087
08725
0875
08775
088
08825
0885
08875
089
08925
1 8 15 22 29 36 43 50
Predictors
Erro
r - R
MSE
RECSYS 110
Temporal Biases amp Factors
Original modelrxi = micro +bx + bi + qi middotpx
T
Add time dependence to biasesrxi = micro +bx(t)+ bi(t) +qi middot px
T
Make parameters bu and bi to depend on time (1) Parameterize time-dependence by linear trends
(2) Each bin corresponds to 10 consecutive weeks
Add temporal dependence to factors px(t)hellip user preference vector on day t
Y Koren Collaborative filtering with temporal dynamics KDD rsquo09
RECSYS 111
Adding Temporal Effects
1162019
0875
088
0885
089
0895
09
0905
091
0915
092
1 10 100 1000 10000
RMSE
Millions of parameters
CF (no time bias)Basic Latent FactorsCF (time bias)Latent Factors w Biases+ Linear time factors+ Per-day user biases+ CF
RECSYS 112
Grand Prize 08563
Netflix 09514
Movie average 10533User average 10651
Global average 11296
Performance of Various Methods
Basic Collaborative filtering 094
Latent factors 090
Latent factors+Biases 089
Collaborative filtering++ 091
1162019 112
Latent factors+Biases+Time 0876
Still no prize Getting desperateTry a ldquokitchen sinkrdquo approach
RECSYS 113
Effect of ensemble size
087
08725
0875
08775
088
08825
0885
08875
089
08925
1 8 15 22 29 36 43 50
Predictors
Erro
r - R
MSE
RECSYS 111
Adding Temporal Effects
1162019
0875
088
0885
089
0895
09
0905
091
0915
092
1 10 100 1000 10000
RMSE
Millions of parameters
CF (no time bias)Basic Latent FactorsCF (time bias)Latent Factors w Biases+ Linear time factors+ Per-day user biases+ CF
RECSYS 112
Grand Prize 08563
Netflix 09514
Movie average 10533User average 10651
Global average 11296
Performance of Various Methods
Basic Collaborative filtering 094
Latent factors 090
Latent factors+Biases 089
Collaborative filtering++ 091
1162019 112
Latent factors+Biases+Time 0876
Still no prize Getting desperateTry a ldquokitchen sinkrdquo approach
RECSYS 113
Effect of ensemble size
087
08725
0875
08775
088
08825
0885
08875
089
08925
1 8 15 22 29 36 43 50
Predictors
Erro
r - R
MSE
RECSYS 112
Grand Prize 08563
Netflix 09514
Movie average 10533User average 10651
Global average 11296
Performance of Various Methods
Basic Collaborative filtering 094
Latent factors 090
Latent factors+Biases 089
Collaborative filtering++ 091
1162019 112
Latent factors+Biases+Time 0876
Still no prize Getting desperateTry a ldquokitchen sinkrdquo approach
RECSYS 113
Effect of ensemble size
087
08725
0875
08775
088
08825
0885
08875
089
08925
1 8 15 22 29 36 43 50
Predictors
Erro
r - R
MSE
RECSYS 113
Effect of ensemble size
087
08725
0875
08775
088
08825
0885
08875
089
08925
1 8 15 22 29 36 43 50
Predictors
Erro
r - R
MSE
Effect of ensemble size
087
08725
0875
08775
088
08825
0885
08875
089
08925
1 8 15 22 29 36 43 50
Predictors
Erro
r - R
MSE
Chart1
Sheet1
Sheet1
Sheet2
Sheet3
RECSYS 115
Standing on June 26th 2009
1162019June 26th submission triggers 30-day ldquolast callrdquo
RECSYS 116
The Last 30 Days
Ensemble team formed Group of other teams on leaderboard forms a new team Relies on combining their models Quickly also get a qualifying score over 10
BellKor Continue to get small improvements in their scores Realize that they are in direct competition with Ensemble
Strategy Both teams carefully monitoring the leaderboard Only sure way to check for improvement is to submit a set of
predictionsbull This alerts the other team of your latest score
1162019
RECSYS 117
24 Hours from the Deadline
Submissions limited to 1 a day Only 1 final submission could be made in the last 24h
24 hours before deadlinehellip BellKor team member in Austria notices (by chance) that
Ensemble posts a score that is slightly better than BellKorrsquos
Frantic last 24 hours for both teams Much computer time on final optimization Carefully calibrated to end about an hour before deadline
Final submissions BellKor submits a little early (on purpose) 40 mins before deadline Ensemble submits their final entry 20 mins later hellipand everyone waitshellip
1162019
RECSYS 118118
RECSYS 119
Million $ Awarded Sept 21st 2009
RECSYS 120
Further reading Y Koren Collaborative filtering with temporal
dynamics KDD rsquo09 httpwww2researchattcom~volinskynetflixbpchtml httpwwwthe-ensemblecom
1162019
- Slide Number 1
- Acknowledgements
- Content-based Systems amp Collaborative Filtering
- High Dimensional Data
- Example Recommender Systems
- Recommendations
- From Scarcity to Abundance
- The Long Tail
- Physical vs Online
- Types of Recommendations
- Formal Model
- Utility Matrix
- Key Problems
- (1) Gathering Ratings
- (2) Extrapolating Utilities
- Content-based Recommender Systems
- Content-based Recommendations
- Plan of Action
- Item Profiles
- Sidenote TF-IDF
- User Profiles and Prediction
- Pros Content-based Approach
- Cons Content-based Approach
- Collaborative Filtering
- Collaborative filtering
- Collaborative Filtering (CF)
- Example of Memory-based Collaborative Filtering User-User Collaborative Filtering
- Similar Users
- Similarity Metric
- Rating Predictions
- Another type of Memory-based Collaborative Filtering Item-Item based Collaborative Filtering
- Item-Item CF (|N|=2)
- Item-Item CF (|N|=2)
- Item-Item CF (|N|=2)
- Item-Item CF (|N|=2)
- Item-Item CF (|N|=2)
- Common Practice for Item-Item Collaborative Filtering
- Item-Item vs User-User
- ProsCons of Collaborative Filtering
- Hybrid Methods
- Remarks amp Practical Tips
- Evaluation
- Evaluation
- Evaluating Predictions
- Problems with Error Measures
- Collaborative Filtering Complexity
- Tip Add Data
- Recommender SystemsLatent Factor Models
- Collaborative Filtering viaLatent Factor Models (eg SVD)
- Slide Number 50
- The Netflix Utility Matrix R
- Utility Matrix R Evaluation
- Latent Factor Models
- Ratings as Products of Factors
- Ratings as Products of Factors
- Ratings as Products of Factors
- Latent Factor Models
- Latent Factor Models
- Recap SVD
- Latent Factor Models
- Dealing with Missing Entries
- Recommendations via Latent Factor Models (eg SVD++ by the [Bellkor Team])
- Dealing with Missing Entries
- The Effect of Regularization
- The Effect of Regularization
- The Effect of Regularization
- The Effect of Regularization
- Use Gradient Descent to search forthe optimal settings
- Degression to the lecture notes ofRegression and Gradient Descentby Andrew Ngrsquos Machine Learning course from Coursera
- (Batch) Gradient Descent
- Stochastic Gradient Descent
- SGD vs GD
- Stochastic Gradient Descent
- Slide Number 75
- Summary Recommendations via Optimization
- Backup Slides
- The Netflix Challenge 2006-09
- Slide Number 79
- Slide Number 80
- Netflix Prize
- Movie rating data
- Overall rating distribution
- ratings per movie
- ratings per user
- Average movie rating by movie count
- Most loved movies
- Challenges
- The BellKor recommender system
- Extending Latent Factor Model to Include Biases
- Modeling Biases and Interactions
- Baseline Predictor
- Putting It All Together
- Fitting the New Model
- BellKor Recommender System
- Performance of Various Methods
- Performance of Various Methods
- Temporal Biases Of Users
- Temporal Biases amp Factors
- Adding Temporal Effects
- Performance of Various Methods
- Slide Number 113
- Slide Number 114
- Standing on June 26th 2009
- The Last 30 Days
- 24 Hours from the Deadline
- Slide Number 118
- Million $ Awarded Sept 21st 2009
- Further reading
-
predictors | RMSE | ||
1 | 0889067 | ||
2 | 0880586 | ||
3 | 0878345 | ||
4 | 0877715 | ||
5 | 0876668 | ||
6 | 0876179 | ||
7 | 0875557 | ||
8 | 0875013 | ||
9 | 0874477 | ||
10 | 0874209 | ||
11 | 0873997 | ||
12 | 0873691 | ||
13 | 0873569 | ||
14 | 0873352 | ||
15 | 0873212 | ||
16 | 0873035 | ||
17 | 0872942 | ||
18 | 087283 | ||
19 | 0872748 | ||
20 | 0872654 | ||
21 | 0872563 | ||
22 | 0872468 | ||
23 | 0872382 | ||
24 | 0872275 | ||
25 | 087222 | ||
26 | 0872201 | ||
27 | 0872135 | ||
28 | 0872086 | ||
29 | 0872021 | ||
30 | 0871945 | ||
31 | 0871898 | ||
32 | 087183 | ||
33 | 0871756 | ||
34 | 0871722 | ||
35 | 0871665 | ||
36 | 0871632 | ||
37 | 0871585 | ||
38 | 0871558 | ||
39 | 0871524 | ||
40 | 0871482 | ||
41 | 0871452 | ||
42 | 0871428 | ||
43 | 0871398 | ||
44 | 0871346 | ||
45 | 0871312 | ||
46 | 0871287 | ||
47 | 0871262 | ||
48 | 0871242 | ||
49 | 0871236 | ||
50 | 0871197 |
0889067 | |
0880586 | |
0878345 | |
0877715 | |
0876668 | |
0876179 | |
0875557 | |
0875013 | |
0874477 | |
0874209 | |
0873997 | |
0873691 | |
0873569 | |
0873352 | |
0873212 | |
0873035 | |
0872942 | |
087283 | |
0872748 | |
0872654 | |
0872563 | |
0872468 | |
0872382 | |
0872275 | |
087222 | |
0872201 | |
0872135 | |
0872086 | |
0872021 | |
0871945 | |
0871898 | |
087183 | |
0871756 | |
0871722 | |
0871665 | |
0871632 | |
0871585 | |
0871558 | |
0871524 | |
0871482 | |
0871452 | |
0871428 | |
0871398 | |
0871346 | |
0871312 | |
0871287 | |
0871262 | |
0871242 | |
0871236 | |
0871197 |
Sheet1
Sheet1
Sheet2
Sheet3
RECSYS 115
Standing on June 26th 2009
1162019June 26th submission triggers 30-day ldquolast callrdquo
RECSYS 116
The Last 30 Days
Ensemble team formed Group of other teams on leaderboard forms a new team Relies on combining their models Quickly also get a qualifying score over 10
BellKor Continue to get small improvements in their scores Realize that they are in direct competition with Ensemble
Strategy Both teams carefully monitoring the leaderboard Only sure way to check for improvement is to submit a set of
predictionsbull This alerts the other team of your latest score
1162019
RECSYS 117
24 Hours from the Deadline
Submissions limited to 1 a day Only 1 final submission could be made in the last 24h
24 hours before deadlinehellip BellKor team member in Austria notices (by chance) that
Ensemble posts a score that is slightly better than BellKorrsquos
Frantic last 24 hours for both teams Much computer time on final optimization Carefully calibrated to end about an hour before deadline
Final submissions BellKor submits a little early (on purpose) 40 mins before deadline Ensemble submits their final entry 20 mins later hellipand everyone waitshellip
1162019
RECSYS 118118
RECSYS 119
Million $ Awarded Sept 21st 2009
RECSYS 120
Further reading Y Koren Collaborative filtering with temporal
dynamics KDD rsquo09 httpwww2researchattcom~volinskynetflixbpchtml httpwwwthe-ensemblecom
1162019
- Slide Number 1
- Acknowledgements
- Content-based Systems amp Collaborative Filtering
- High Dimensional Data
- Example Recommender Systems
- Recommendations
- From Scarcity to Abundance
- The Long Tail
- Physical vs Online
- Types of Recommendations
- Formal Model
- Utility Matrix
- Key Problems
- (1) Gathering Ratings
- (2) Extrapolating Utilities
- Content-based Recommender Systems
- Content-based Recommendations
- Plan of Action
- Item Profiles
- Sidenote TF-IDF
- User Profiles and Prediction
- Pros Content-based Approach
- Cons Content-based Approach
- Collaborative Filtering
- Collaborative filtering
- Collaborative Filtering (CF)
- Example of Memory-based Collaborative Filtering User-User Collaborative Filtering
- Similar Users
- Similarity Metric
- Rating Predictions
- Another type of Memory-based Collaborative Filtering Item-Item based Collaborative Filtering
- Item-Item CF (|N|=2)
- Item-Item CF (|N|=2)
- Item-Item CF (|N|=2)
- Item-Item CF (|N|=2)
- Item-Item CF (|N|=2)
- Common Practice for Item-Item Collaborative Filtering
- Item-Item vs User-User
- ProsCons of Collaborative Filtering
- Hybrid Methods
- Remarks amp Practical Tips
- Evaluation
- Evaluation
- Evaluating Predictions
- Problems with Error Measures
- Collaborative Filtering Complexity
- Tip Add Data
- Recommender SystemsLatent Factor Models
- Collaborative Filtering viaLatent Factor Models (eg SVD)
- Slide Number 50
- The Netflix Utility Matrix R
- Utility Matrix R Evaluation
- Latent Factor Models
- Ratings as Products of Factors
- Ratings as Products of Factors
- Ratings as Products of Factors
- Latent Factor Models
- Latent Factor Models
- Recap SVD
- Latent Factor Models
- Dealing with Missing Entries
- Recommendations via Latent Factor Models (eg SVD++ by the [Bellkor Team])
- Dealing with Missing Entries
- The Effect of Regularization
- The Effect of Regularization
- The Effect of Regularization
- The Effect of Regularization
- Use Gradient Descent to search forthe optimal settings
- Degression to the lecture notes ofRegression and Gradient Descentby Andrew Ngrsquos Machine Learning course from Coursera
- (Batch) Gradient Descent
- Stochastic Gradient Descent
- SGD vs GD
- Stochastic Gradient Descent
- Slide Number 75
- Summary Recommendations via Optimization
- Backup Slides
- The Netflix Challenge 2006-09
- Slide Number 79
- Slide Number 80
- Netflix Prize
- Movie rating data
- Overall rating distribution
- ratings per movie
- ratings per user
- Average movie rating by movie count
- Most loved movies
- Challenges
- The BellKor recommender system
- Extending Latent Factor Model to Include Biases
- Modeling Biases and Interactions
- Baseline Predictor
- Putting It All Together
- Fitting the New Model
- BellKor Recommender System
- Performance of Various Methods
- Performance of Various Methods
- Temporal Biases Of Users
- Temporal Biases amp Factors
- Adding Temporal Effects
- Performance of Various Methods
- Slide Number 113
- Slide Number 114
- Standing on June 26th 2009
- The Last 30 Days
- 24 Hours from the Deadline
- Slide Number 118
- Million $ Awarded Sept 21st 2009
- Further reading
-
predictors | RMSE | ||
1 | 0889067 | ||
2 | 0880586 | ||
3 | 0878345 | ||
4 | 0877715 | ||
5 | 0876668 | ||
6 | 0876179 | ||
7 | 0875557 | ||
8 | 0875013 | ||
9 | 0874477 | ||
10 | 0874209 | ||
11 | 0873997 | ||
12 | 0873691 | ||
13 | 0873569 | ||
14 | 0873352 | ||
15 | 0873212 | ||
16 | 0873035 | ||
17 | 0872942 | ||
18 | 087283 | ||
19 | 0872748 | ||
20 | 0872654 | ||
21 | 0872563 | ||
22 | 0872468 | ||
23 | 0872382 | ||
24 | 0872275 | ||
25 | 087222 | ||
26 | 0872201 | ||
27 | 0872135 | ||
28 | 0872086 | ||
29 | 0872021 | ||
30 | 0871945 | ||
31 | 0871898 | ||
32 | 087183 | ||
33 | 0871756 | ||
34 | 0871722 | ||
35 | 0871665 | ||
36 | 0871632 | ||
37 | 0871585 | ||
38 | 0871558 | ||
39 | 0871524 | ||
40 | 0871482 | ||
41 | 0871452 | ||
42 | 0871428 | ||
43 | 0871398 | ||
44 | 0871346 | ||
45 | 0871312 | ||
46 | 0871287 | ||
47 | 0871262 | ||
48 | 0871242 | ||
49 | 0871236 | ||
50 | 0871197 |
Sheet1
Sheet2
Sheet3
RECSYS 115
Standing on June 26th 2009
1162019June 26th submission triggers 30-day ldquolast callrdquo
RECSYS 116
The Last 30 Days
Ensemble team formed Group of other teams on leaderboard forms a new team Relies on combining their models Quickly also get a qualifying score over 10
BellKor Continue to get small improvements in their scores Realize that they are in direct competition with Ensemble
Strategy Both teams carefully monitoring the leaderboard Only sure way to check for improvement is to submit a set of
predictionsbull This alerts the other team of your latest score
1162019
RECSYS 117
24 Hours from the Deadline
Submissions limited to 1 a day Only 1 final submission could be made in the last 24h
24 hours before deadlinehellip BellKor team member in Austria notices (by chance) that
Ensemble posts a score that is slightly better than BellKorrsquos
Frantic last 24 hours for both teams Much computer time on final optimization Carefully calibrated to end about an hour before deadline
Final submissions BellKor submits a little early (on purpose) 40 mins before deadline Ensemble submits their final entry 20 mins later hellipand everyone waitshellip
1162019
RECSYS 118118
RECSYS 119
Million $ Awarded Sept 21st 2009
RECSYS 120
Further reading Y Koren Collaborative filtering with temporal
dynamics KDD rsquo09 httpwww2researchattcom~volinskynetflixbpchtml httpwwwthe-ensemblecom
1162019
- Slide Number 1
- Acknowledgements
- Content-based Systems amp Collaborative Filtering
- High Dimensional Data
- Example Recommender Systems
- Recommendations
- From Scarcity to Abundance
- The Long Tail
- Physical vs Online
- Types of Recommendations
- Formal Model
- Utility Matrix
- Key Problems
- (1) Gathering Ratings
- (2) Extrapolating Utilities
- Content-based Recommender Systems
- Content-based Recommendations
- Plan of Action
- Item Profiles
- Sidenote TF-IDF
- User Profiles and Prediction
- Pros Content-based Approach
- Cons Content-based Approach
- Collaborative Filtering
- Collaborative filtering
- Collaborative Filtering (CF)
- Example of Memory-based Collaborative Filtering User-User Collaborative Filtering
- Similar Users
- Similarity Metric
- Rating Predictions
- Another type of Memory-based Collaborative Filtering Item-Item based Collaborative Filtering
- Item-Item CF (|N|=2)
- Item-Item CF (|N|=2)
- Item-Item CF (|N|=2)
- Item-Item CF (|N|=2)
- Item-Item CF (|N|=2)
- Common Practice for Item-Item Collaborative Filtering
- Item-Item vs User-User
- ProsCons of Collaborative Filtering
- Hybrid Methods
- Remarks amp Practical Tips
- Evaluation
- Evaluation
- Evaluating Predictions
- Problems with Error Measures
- Collaborative Filtering Complexity
- Tip Add Data
- Recommender SystemsLatent Factor Models
- Collaborative Filtering viaLatent Factor Models (eg SVD)
- Slide Number 50
- The Netflix Utility Matrix R
- Utility Matrix R Evaluation
- Latent Factor Models
- Ratings as Products of Factors
- Ratings as Products of Factors
- Ratings as Products of Factors
- Latent Factor Models
- Latent Factor Models
- Recap SVD
- Latent Factor Models
- Dealing with Missing Entries
- Recommendations via Latent Factor Models (eg SVD++ by the [Bellkor Team])
- Dealing with Missing Entries
- The Effect of Regularization
- The Effect of Regularization
- The Effect of Regularization
- The Effect of Regularization
- Use Gradient Descent to search forthe optimal settings
- Degression to the lecture notes ofRegression and Gradient Descentby Andrew Ngrsquos Machine Learning course from Coursera
- (Batch) Gradient Descent
- Stochastic Gradient Descent
- SGD vs GD
- Stochastic Gradient Descent
- Slide Number 75
- Summary Recommendations via Optimization
- Backup Slides
- The Netflix Challenge 2006-09
- Slide Number 79
- Slide Number 80
- Netflix Prize
- Movie rating data
- Overall rating distribution
- ratings per movie
- ratings per user
- Average movie rating by movie count
- Most loved movies
- Challenges
- The BellKor recommender system
- Extending Latent Factor Model to Include Biases
- Modeling Biases and Interactions
- Baseline Predictor
- Putting It All Together
- Fitting the New Model
- BellKor Recommender System
- Performance of Various Methods
- Performance of Various Methods
- Temporal Biases Of Users
- Temporal Biases amp Factors
- Adding Temporal Effects
- Performance of Various Methods
- Slide Number 113
- Slide Number 114
- Standing on June 26th 2009
- The Last 30 Days
- 24 Hours from the Deadline
- Slide Number 118
- Million $ Awarded Sept 21st 2009
- Further reading
-
Sheet2
Sheet3
RECSYS 115
Standing on June 26th 2009
1162019June 26th submission triggers 30-day ldquolast callrdquo
RECSYS 116
The Last 30 Days
Ensemble team formed Group of other teams on leaderboard forms a new team Relies on combining their models Quickly also get a qualifying score over 10
BellKor Continue to get small improvements in their scores Realize that they are in direct competition with Ensemble
Strategy Both teams carefully monitoring the leaderboard Only sure way to check for improvement is to submit a set of
predictionsbull This alerts the other team of your latest score
1162019
RECSYS 117
24 Hours from the Deadline
Submissions limited to 1 a day Only 1 final submission could be made in the last 24h
24 hours before deadlinehellip BellKor team member in Austria notices (by chance) that
Ensemble posts a score that is slightly better than BellKorrsquos
Frantic last 24 hours for both teams Much computer time on final optimization Carefully calibrated to end about an hour before deadline
Final submissions BellKor submits a little early (on purpose) 40 mins before deadline Ensemble submits their final entry 20 mins later hellipand everyone waitshellip
1162019
RECSYS 118118
RECSYS 119
Million $ Awarded Sept 21st 2009
RECSYS 120
Further reading Y Koren Collaborative filtering with temporal
dynamics KDD rsquo09 httpwww2researchattcom~volinskynetflixbpchtml httpwwwthe-ensemblecom
1162019
- Slide Number 1
- Acknowledgements
- Content-based Systems amp Collaborative Filtering
- High Dimensional Data
- Example Recommender Systems
- Recommendations
- From Scarcity to Abundance
- The Long Tail
- Physical vs Online
- Types of Recommendations
- Formal Model
- Utility Matrix
- Key Problems
- (1) Gathering Ratings
- (2) Extrapolating Utilities
- Content-based Recommender Systems
- Content-based Recommendations
- Plan of Action
- Item Profiles
- Sidenote TF-IDF
- User Profiles and Prediction
- Pros Content-based Approach
- Cons Content-based Approach
- Collaborative Filtering
- Collaborative filtering
- Collaborative Filtering (CF)
- Example of Memory-based Collaborative Filtering User-User Collaborative Filtering
- Similar Users
- Similarity Metric
- Rating Predictions
- Another type of Memory-based Collaborative Filtering Item-Item based Collaborative Filtering
- Item-Item CF (|N|=2)
- Item-Item CF (|N|=2)
- Item-Item CF (|N|=2)
- Item-Item CF (|N|=2)
- Item-Item CF (|N|=2)
- Common Practice for Item-Item Collaborative Filtering
- Item-Item vs User-User
- ProsCons of Collaborative Filtering
- Hybrid Methods
- Remarks amp Practical Tips
- Evaluation
- Evaluation
- Evaluating Predictions
- Problems with Error Measures
- Collaborative Filtering Complexity
- Tip Add Data
- Recommender SystemsLatent Factor Models
- Collaborative Filtering viaLatent Factor Models (eg SVD)
- Slide Number 50
- The Netflix Utility Matrix R
- Utility Matrix R Evaluation
- Latent Factor Models
- Ratings as Products of Factors
- Ratings as Products of Factors
- Ratings as Products of Factors
- Latent Factor Models
- Latent Factor Models
- Recap SVD
- Latent Factor Models
- Dealing with Missing Entries
- Recommendations via Latent Factor Models (eg SVD++ by the [Bellkor Team])
- Dealing with Missing Entries
- The Effect of Regularization
- The Effect of Regularization
- The Effect of Regularization
- The Effect of Regularization
- Use Gradient Descent to search forthe optimal settings
- Degression to the lecture notes ofRegression and Gradient Descentby Andrew Ngrsquos Machine Learning course from Coursera
- (Batch) Gradient Descent
- Stochastic Gradient Descent
- SGD vs GD
- Stochastic Gradient Descent
- Slide Number 75
- Summary Recommendations via Optimization
- Backup Slides
- The Netflix Challenge 2006-09
- Slide Number 79
- Slide Number 80
- Netflix Prize
- Movie rating data
- Overall rating distribution
- ratings per movie
- ratings per user
- Average movie rating by movie count
- Most loved movies
- Challenges
- The BellKor recommender system
- Extending Latent Factor Model to Include Biases
- Modeling Biases and Interactions
- Baseline Predictor
- Putting It All Together
- Fitting the New Model
- BellKor Recommender System
- Performance of Various Methods
- Performance of Various Methods
- Temporal Biases Of Users
- Temporal Biases amp Factors
- Adding Temporal Effects
- Performance of Various Methods
- Slide Number 113
- Slide Number 114
- Standing on June 26th 2009
- The Last 30 Days
- 24 Hours from the Deadline
- Slide Number 118
- Million $ Awarded Sept 21st 2009
- Further reading
-
Sheet3
RECSYS 115
Standing on June 26th 2009
1162019June 26th submission triggers 30-day ldquolast callrdquo
RECSYS 116
The Last 30 Days
Ensemble team formed Group of other teams on leaderboard forms a new team Relies on combining their models Quickly also get a qualifying score over 10
BellKor Continue to get small improvements in their scores Realize that they are in direct competition with Ensemble
Strategy Both teams carefully monitoring the leaderboard Only sure way to check for improvement is to submit a set of
predictionsbull This alerts the other team of your latest score
1162019
RECSYS 117
24 Hours from the Deadline
Submissions limited to 1 a day Only 1 final submission could be made in the last 24h
24 hours before deadlinehellip BellKor team member in Austria notices (by chance) that
Ensemble posts a score that is slightly better than BellKorrsquos
Frantic last 24 hours for both teams Much computer time on final optimization Carefully calibrated to end about an hour before deadline
Final submissions BellKor submits a little early (on purpose) 40 mins before deadline Ensemble submits their final entry 20 mins later hellipand everyone waitshellip
1162019
RECSYS 118118
RECSYS 119
Million $ Awarded Sept 21st 2009
RECSYS 120
Further reading Y Koren Collaborative filtering with temporal
dynamics KDD rsquo09 httpwww2researchattcom~volinskynetflixbpchtml httpwwwthe-ensemblecom
1162019
- Slide Number 1
- Acknowledgements
- Content-based Systems amp Collaborative Filtering
- High Dimensional Data
- Example Recommender Systems
- Recommendations
- From Scarcity to Abundance
- The Long Tail
- Physical vs Online
- Types of Recommendations
- Formal Model
- Utility Matrix
- Key Problems
- (1) Gathering Ratings
- (2) Extrapolating Utilities
- Content-based Recommender Systems
- Content-based Recommendations
- Plan of Action
- Item Profiles
- Sidenote TF-IDF
- User Profiles and Prediction
- Pros Content-based Approach
- Cons Content-based Approach
- Collaborative Filtering
- Collaborative filtering
- Collaborative Filtering (CF)
- Example of Memory-based Collaborative Filtering User-User Collaborative Filtering
- Similar Users
- Similarity Metric
- Rating Predictions
- Another type of Memory-based Collaborative Filtering Item-Item based Collaborative Filtering
- Item-Item CF (|N|=2)
- Item-Item CF (|N|=2)
- Item-Item CF (|N|=2)
- Item-Item CF (|N|=2)
- Item-Item CF (|N|=2)
- Common Practice for Item-Item Collaborative Filtering
- Item-Item vs User-User
- ProsCons of Collaborative Filtering
- Hybrid Methods
- Remarks amp Practical Tips
- Evaluation
- Evaluation
- Evaluating Predictions
- Problems with Error Measures
- Collaborative Filtering Complexity
- Tip Add Data
- Recommender SystemsLatent Factor Models
- Collaborative Filtering viaLatent Factor Models (eg SVD)
- Slide Number 50
- The Netflix Utility Matrix R
- Utility Matrix R Evaluation
- Latent Factor Models
- Ratings as Products of Factors
- Ratings as Products of Factors
- Ratings as Products of Factors
- Latent Factor Models
- Latent Factor Models
- Recap SVD
- Latent Factor Models
- Dealing with Missing Entries
- Recommendations via Latent Factor Models (eg SVD++ by the [Bellkor Team])
- Dealing with Missing Entries
- The Effect of Regularization
- The Effect of Regularization
- The Effect of Regularization
- The Effect of Regularization
- Use Gradient Descent to search forthe optimal settings
- Degression to the lecture notes ofRegression and Gradient Descentby Andrew Ngrsquos Machine Learning course from Coursera
- (Batch) Gradient Descent
- Stochastic Gradient Descent
- SGD vs GD
- Stochastic Gradient Descent
- Slide Number 75
- Summary Recommendations via Optimization
- Backup Slides
- The Netflix Challenge 2006-09
- Slide Number 79
- Slide Number 80
- Netflix Prize
- Movie rating data
- Overall rating distribution
- ratings per movie
- ratings per user
- Average movie rating by movie count
- Most loved movies
- Challenges
- The BellKor recommender system
- Extending Latent Factor Model to Include Biases
- Modeling Biases and Interactions
- Baseline Predictor
- Putting It All Together
- Fitting the New Model
- BellKor Recommender System
- Performance of Various Methods
- Performance of Various Methods
- Temporal Biases Of Users
- Temporal Biases amp Factors
- Adding Temporal Effects
- Performance of Various Methods
- Slide Number 113
- Slide Number 114
- Standing on June 26th 2009
- The Last 30 Days
- 24 Hours from the Deadline
- Slide Number 118
- Million $ Awarded Sept 21st 2009
- Further reading
-
RECSYS 115
Standing on June 26th 2009
1162019June 26th submission triggers 30-day ldquolast callrdquo
RECSYS 116
The Last 30 Days
Ensemble team formed Group of other teams on leaderboard forms a new team Relies on combining their models Quickly also get a qualifying score over 10
BellKor Continue to get small improvements in their scores Realize that they are in direct competition with Ensemble
Strategy Both teams carefully monitoring the leaderboard Only sure way to check for improvement is to submit a set of
predictionsbull This alerts the other team of your latest score
1162019
RECSYS 117
24 Hours from the Deadline
Submissions limited to 1 a day Only 1 final submission could be made in the last 24h
24 hours before deadlinehellip BellKor team member in Austria notices (by chance) that
Ensemble posts a score that is slightly better than BellKorrsquos
Frantic last 24 hours for both teams Much computer time on final optimization Carefully calibrated to end about an hour before deadline
Final submissions BellKor submits a little early (on purpose) 40 mins before deadline Ensemble submits their final entry 20 mins later hellipand everyone waitshellip
1162019
RECSYS 118118
RECSYS 119
Million $ Awarded Sept 21st 2009
RECSYS 120
Further reading Y Koren Collaborative filtering with temporal
dynamics KDD rsquo09 httpwww2researchattcom~volinskynetflixbpchtml httpwwwthe-ensemblecom
1162019
- Slide Number 1
- Acknowledgements
- Content-based Systems amp Collaborative Filtering
- High Dimensional Data
- Example Recommender Systems
- Recommendations
- From Scarcity to Abundance
- The Long Tail
- Physical vs Online
- Types of Recommendations
- Formal Model
- Utility Matrix
- Key Problems
- (1) Gathering Ratings
- (2) Extrapolating Utilities
- Content-based Recommender Systems
- Content-based Recommendations
- Plan of Action
- Item Profiles
- Sidenote TF-IDF
- User Profiles and Prediction
- Pros Content-based Approach
- Cons Content-based Approach
- Collaborative Filtering
- Collaborative filtering
- Collaborative Filtering (CF)
- Example of Memory-based Collaborative Filtering User-User Collaborative Filtering
- Similar Users
- Similarity Metric
- Rating Predictions
- Another type of Memory-based Collaborative Filtering Item-Item based Collaborative Filtering
- Item-Item CF (|N|=2)
- Item-Item CF (|N|=2)
- Item-Item CF (|N|=2)
- Item-Item CF (|N|=2)
- Item-Item CF (|N|=2)
- Common Practice for Item-Item Collaborative Filtering
- Item-Item vs User-User
- ProsCons of Collaborative Filtering
- Hybrid Methods
- Remarks amp Practical Tips
- Evaluation
- Evaluation
- Evaluating Predictions
- Problems with Error Measures
- Collaborative Filtering Complexity
- Tip Add Data
- Recommender SystemsLatent Factor Models
- Collaborative Filtering viaLatent Factor Models (eg SVD)
- Slide Number 50
- The Netflix Utility Matrix R
- Utility Matrix R Evaluation
- Latent Factor Models
- Ratings as Products of Factors
- Ratings as Products of Factors
- Ratings as Products of Factors
- Latent Factor Models
- Latent Factor Models
- Recap SVD
- Latent Factor Models
- Dealing with Missing Entries
- Recommendations via Latent Factor Models (eg SVD++ by the [Bellkor Team])
- Dealing with Missing Entries
- The Effect of Regularization
- The Effect of Regularization
- The Effect of Regularization
- The Effect of Regularization
- Use Gradient Descent to search forthe optimal settings
- Degression to the lecture notes ofRegression and Gradient Descentby Andrew Ngrsquos Machine Learning course from Coursera
- (Batch) Gradient Descent
- Stochastic Gradient Descent
- SGD vs GD
- Stochastic Gradient Descent
- Slide Number 75
- Summary Recommendations via Optimization
- Backup Slides
- The Netflix Challenge 2006-09
- Slide Number 79
- Slide Number 80
- Netflix Prize
- Movie rating data
- Overall rating distribution
- ratings per movie
- ratings per user
- Average movie rating by movie count
- Most loved movies
- Challenges
- The BellKor recommender system
- Extending Latent Factor Model to Include Biases
- Modeling Biases and Interactions
- Baseline Predictor
- Putting It All Together
- Fitting the New Model
- BellKor Recommender System
- Performance of Various Methods
- Performance of Various Methods
- Temporal Biases Of Users
- Temporal Biases amp Factors
- Adding Temporal Effects
- Performance of Various Methods
- Slide Number 113
- Slide Number 114
- Standing on June 26th 2009
- The Last 30 Days
- 24 Hours from the Deadline
- Slide Number 118
- Million $ Awarded Sept 21st 2009
- Further reading
-
RECSYS 116
The Last 30 Days
Ensemble team formed Group of other teams on leaderboard forms a new team Relies on combining their models Quickly also get a qualifying score over 10
BellKor Continue to get small improvements in their scores Realize that they are in direct competition with Ensemble
Strategy Both teams carefully monitoring the leaderboard Only sure way to check for improvement is to submit a set of
predictionsbull This alerts the other team of your latest score
1162019
RECSYS 117
24 Hours from the Deadline
Submissions limited to 1 a day Only 1 final submission could be made in the last 24h
24 hours before deadlinehellip BellKor team member in Austria notices (by chance) that
Ensemble posts a score that is slightly better than BellKorrsquos
Frantic last 24 hours for both teams Much computer time on final optimization Carefully calibrated to end about an hour before deadline
Final submissions BellKor submits a little early (on purpose) 40 mins before deadline Ensemble submits their final entry 20 mins later hellipand everyone waitshellip
1162019
RECSYS 118118
RECSYS 119
Million $ Awarded Sept 21st 2009
RECSYS 120
Further reading Y Koren Collaborative filtering with temporal
dynamics KDD rsquo09 httpwww2researchattcom~volinskynetflixbpchtml httpwwwthe-ensemblecom
1162019
- Slide Number 1
- Acknowledgements
- Content-based Systems amp Collaborative Filtering
- High Dimensional Data
- Example Recommender Systems
- Recommendations
- From Scarcity to Abundance
- The Long Tail
- Physical vs Online
- Types of Recommendations
- Formal Model
- Utility Matrix
- Key Problems
- (1) Gathering Ratings
- (2) Extrapolating Utilities
- Content-based Recommender Systems
- Content-based Recommendations
- Plan of Action
- Item Profiles
- Sidenote TF-IDF
- User Profiles and Prediction
- Pros Content-based Approach
- Cons Content-based Approach
- Collaborative Filtering
- Collaborative filtering
- Collaborative Filtering (CF)
- Example of Memory-based Collaborative Filtering User-User Collaborative Filtering
- Similar Users
- Similarity Metric
- Rating Predictions
- Another type of Memory-based Collaborative Filtering Item-Item based Collaborative Filtering
- Item-Item CF (|N|=2)
- Item-Item CF (|N|=2)
- Item-Item CF (|N|=2)
- Item-Item CF (|N|=2)
- Item-Item CF (|N|=2)
- Common Practice for Item-Item Collaborative Filtering
- Item-Item vs User-User
- ProsCons of Collaborative Filtering
- Hybrid Methods
- Remarks amp Practical Tips
- Evaluation
- Evaluation
- Evaluating Predictions
- Problems with Error Measures
- Collaborative Filtering Complexity
- Tip Add Data
- Recommender SystemsLatent Factor Models
- Collaborative Filtering viaLatent Factor Models (eg SVD)
- Slide Number 50
- The Netflix Utility Matrix R
- Utility Matrix R Evaluation
- Latent Factor Models
- Ratings as Products of Factors
- Ratings as Products of Factors
- Ratings as Products of Factors
- Latent Factor Models
- Latent Factor Models
- Recap SVD
- Latent Factor Models
- Dealing with Missing Entries
- Recommendations via Latent Factor Models (eg SVD++ by the [Bellkor Team])
- Dealing with Missing Entries
- The Effect of Regularization
- The Effect of Regularization
- The Effect of Regularization
- The Effect of Regularization
- Use Gradient Descent to search forthe optimal settings
- Degression to the lecture notes ofRegression and Gradient Descentby Andrew Ngrsquos Machine Learning course from Coursera
- (Batch) Gradient Descent
- Stochastic Gradient Descent
- SGD vs GD
- Stochastic Gradient Descent
- Slide Number 75
- Summary Recommendations via Optimization
- Backup Slides
- The Netflix Challenge 2006-09
- Slide Number 79
- Slide Number 80
- Netflix Prize
- Movie rating data
- Overall rating distribution
- ratings per movie
- ratings per user
- Average movie rating by movie count
- Most loved movies
- Challenges
- The BellKor recommender system
- Extending Latent Factor Model to Include Biases
- Modeling Biases and Interactions
- Baseline Predictor
- Putting It All Together
- Fitting the New Model
- BellKor Recommender System
- Performance of Various Methods
- Performance of Various Methods
- Temporal Biases Of Users
- Temporal Biases amp Factors
- Adding Temporal Effects
- Performance of Various Methods
- Slide Number 113
- Slide Number 114
- Standing on June 26th 2009
- The Last 30 Days
- 24 Hours from the Deadline
- Slide Number 118
- Million $ Awarded Sept 21st 2009
- Further reading
-
RECSYS 117
24 Hours from the Deadline
Submissions limited to 1 a day Only 1 final submission could be made in the last 24h
24 hours before deadlinehellip BellKor team member in Austria notices (by chance) that
Ensemble posts a score that is slightly better than BellKorrsquos
Frantic last 24 hours for both teams Much computer time on final optimization Carefully calibrated to end about an hour before deadline
Final submissions BellKor submits a little early (on purpose) 40 mins before deadline Ensemble submits their final entry 20 mins later hellipand everyone waitshellip
1162019
RECSYS 118118
RECSYS 119
Million $ Awarded Sept 21st 2009
RECSYS 120
Further reading Y Koren Collaborative filtering with temporal
dynamics KDD rsquo09 httpwww2researchattcom~volinskynetflixbpchtml httpwwwthe-ensemblecom
1162019
- Slide Number 1
- Acknowledgements
- Content-based Systems amp Collaborative Filtering
- High Dimensional Data
- Example Recommender Systems
- Recommendations
- From Scarcity to Abundance
- The Long Tail
- Physical vs Online
- Types of Recommendations
- Formal Model
- Utility Matrix
- Key Problems
- (1) Gathering Ratings
- (2) Extrapolating Utilities
- Content-based Recommender Systems
- Content-based Recommendations
- Plan of Action
- Item Profiles
- Sidenote TF-IDF
- User Profiles and Prediction
- Pros Content-based Approach
- Cons Content-based Approach
- Collaborative Filtering
- Collaborative filtering
- Collaborative Filtering (CF)
- Example of Memory-based Collaborative Filtering User-User Collaborative Filtering
- Similar Users
- Similarity Metric
- Rating Predictions
- Another type of Memory-based Collaborative Filtering Item-Item based Collaborative Filtering
- Item-Item CF (|N|=2)
- Item-Item CF (|N|=2)
- Item-Item CF (|N|=2)
- Item-Item CF (|N|=2)
- Item-Item CF (|N|=2)
- Common Practice for Item-Item Collaborative Filtering
- Item-Item vs User-User
- ProsCons of Collaborative Filtering
- Hybrid Methods
- Remarks amp Practical Tips
- Evaluation
- Evaluation
- Evaluating Predictions
- Problems with Error Measures
- Collaborative Filtering Complexity
- Tip Add Data
- Recommender SystemsLatent Factor Models
- Collaborative Filtering viaLatent Factor Models (eg SVD)
- Slide Number 50
- The Netflix Utility Matrix R
- Utility Matrix R Evaluation
- Latent Factor Models
- Ratings as Products of Factors
- Ratings as Products of Factors
- Ratings as Products of Factors
- Latent Factor Models
- Latent Factor Models
- Recap SVD
- Latent Factor Models
- Dealing with Missing Entries
- Recommendations via Latent Factor Models (eg SVD++ by the [Bellkor Team])
- Dealing with Missing Entries
- The Effect of Regularization
- The Effect of Regularization
- The Effect of Regularization
- The Effect of Regularization
- Use Gradient Descent to search forthe optimal settings
- Degression to the lecture notes ofRegression and Gradient Descentby Andrew Ngrsquos Machine Learning course from Coursera
- (Batch) Gradient Descent
- Stochastic Gradient Descent
- SGD vs GD
- Stochastic Gradient Descent
- Slide Number 75
- Summary Recommendations via Optimization
- Backup Slides
- The Netflix Challenge 2006-09
- Slide Number 79
- Slide Number 80
- Netflix Prize
- Movie rating data
- Overall rating distribution
- ratings per movie
- ratings per user
- Average movie rating by movie count
- Most loved movies
- Challenges
- The BellKor recommender system
- Extending Latent Factor Model to Include Biases
- Modeling Biases and Interactions
- Baseline Predictor
- Putting It All Together
- Fitting the New Model
- BellKor Recommender System
- Performance of Various Methods
- Performance of Various Methods
- Temporal Biases Of Users
- Temporal Biases amp Factors
- Adding Temporal Effects
- Performance of Various Methods
- Slide Number 113
- Slide Number 114
- Standing on June 26th 2009
- The Last 30 Days
- 24 Hours from the Deadline
- Slide Number 118
- Million $ Awarded Sept 21st 2009
- Further reading
-
RECSYS 118118
RECSYS 119
Million $ Awarded Sept 21st 2009
RECSYS 120
Further reading Y Koren Collaborative filtering with temporal
dynamics KDD rsquo09 httpwww2researchattcom~volinskynetflixbpchtml httpwwwthe-ensemblecom
1162019
- Slide Number 1
- Acknowledgements
- Content-based Systems amp Collaborative Filtering
- High Dimensional Data
- Example Recommender Systems
- Recommendations
- From Scarcity to Abundance
- The Long Tail
- Physical vs Online
- Types of Recommendations
- Formal Model
- Utility Matrix
- Key Problems
- (1) Gathering Ratings
- (2) Extrapolating Utilities
- Content-based Recommender Systems
- Content-based Recommendations
- Plan of Action
- Item Profiles
- Sidenote TF-IDF
- User Profiles and Prediction
- Pros Content-based Approach
- Cons Content-based Approach
- Collaborative Filtering
- Collaborative filtering
- Collaborative Filtering (CF)
- Example of Memory-based Collaborative Filtering User-User Collaborative Filtering
- Similar Users
- Similarity Metric
- Rating Predictions
- Another type of Memory-based Collaborative Filtering Item-Item based Collaborative Filtering
- Item-Item CF (|N|=2)
- Item-Item CF (|N|=2)
- Item-Item CF (|N|=2)
- Item-Item CF (|N|=2)
- Item-Item CF (|N|=2)
- Common Practice for Item-Item Collaborative Filtering
- Item-Item vs User-User
- ProsCons of Collaborative Filtering
- Hybrid Methods
- Remarks amp Practical Tips
- Evaluation
- Evaluation
- Evaluating Predictions
- Problems with Error Measures
- Collaborative Filtering Complexity
- Tip Add Data
- Recommender SystemsLatent Factor Models
- Collaborative Filtering viaLatent Factor Models (eg SVD)
- Slide Number 50
- The Netflix Utility Matrix R
- Utility Matrix R Evaluation
- Latent Factor Models
- Ratings as Products of Factors
- Ratings as Products of Factors
- Ratings as Products of Factors
- Latent Factor Models
- Latent Factor Models
- Recap SVD
- Latent Factor Models
- Dealing with Missing Entries
- Recommendations via Latent Factor Models (eg SVD++ by the [Bellkor Team])
- Dealing with Missing Entries
- The Effect of Regularization
- The Effect of Regularization
- The Effect of Regularization
- The Effect of Regularization
- Use Gradient Descent to search forthe optimal settings
- Degression to the lecture notes ofRegression and Gradient Descentby Andrew Ngrsquos Machine Learning course from Coursera
- (Batch) Gradient Descent
- Stochastic Gradient Descent
- SGD vs GD
- Stochastic Gradient Descent
- Slide Number 75
- Summary Recommendations via Optimization
- Backup Slides
- The Netflix Challenge 2006-09
- Slide Number 79
- Slide Number 80
- Netflix Prize
- Movie rating data
- Overall rating distribution
- ratings per movie
- ratings per user
- Average movie rating by movie count
- Most loved movies
- Challenges
- The BellKor recommender system
- Extending Latent Factor Model to Include Biases
- Modeling Biases and Interactions
- Baseline Predictor
- Putting It All Together
- Fitting the New Model
- BellKor Recommender System
- Performance of Various Methods
- Performance of Various Methods
- Temporal Biases Of Users
- Temporal Biases amp Factors
- Adding Temporal Effects
- Performance of Various Methods
- Slide Number 113
- Slide Number 114
- Standing on June 26th 2009
- The Last 30 Days
- 24 Hours from the Deadline
- Slide Number 118
- Million $ Awarded Sept 21st 2009
- Further reading
-
RECSYS 119
Million $ Awarded Sept 21st 2009
RECSYS 120
Further reading Y Koren Collaborative filtering with temporal
dynamics KDD rsquo09 httpwww2researchattcom~volinskynetflixbpchtml httpwwwthe-ensemblecom
1162019
- Slide Number 1
- Acknowledgements
- Content-based Systems amp Collaborative Filtering
- High Dimensional Data
- Example Recommender Systems
- Recommendations
- From Scarcity to Abundance
- The Long Tail
- Physical vs Online
- Types of Recommendations
- Formal Model
- Utility Matrix
- Key Problems
- (1) Gathering Ratings
- (2) Extrapolating Utilities
- Content-based Recommender Systems
- Content-based Recommendations
- Plan of Action
- Item Profiles
- Sidenote TF-IDF
- User Profiles and Prediction
- Pros Content-based Approach
- Cons Content-based Approach
- Collaborative Filtering
- Collaborative filtering
- Collaborative Filtering (CF)
- Example of Memory-based Collaborative Filtering User-User Collaborative Filtering
- Similar Users
- Similarity Metric
- Rating Predictions
- Another type of Memory-based Collaborative Filtering Item-Item based Collaborative Filtering
- Item-Item CF (|N|=2)
- Item-Item CF (|N|=2)
- Item-Item CF (|N|=2)
- Item-Item CF (|N|=2)
- Item-Item CF (|N|=2)
- Common Practice for Item-Item Collaborative Filtering
- Item-Item vs User-User
- ProsCons of Collaborative Filtering
- Hybrid Methods
- Remarks amp Practical Tips
- Evaluation
- Evaluation
- Evaluating Predictions
- Problems with Error Measures
- Collaborative Filtering Complexity
- Tip Add Data
- Recommender SystemsLatent Factor Models
- Collaborative Filtering viaLatent Factor Models (eg SVD)
- Slide Number 50
- The Netflix Utility Matrix R
- Utility Matrix R Evaluation
- Latent Factor Models
- Ratings as Products of Factors
- Ratings as Products of Factors
- Ratings as Products of Factors
- Latent Factor Models
- Latent Factor Models
- Recap SVD
- Latent Factor Models
- Dealing with Missing Entries
- Recommendations via Latent Factor Models (eg SVD++ by the [Bellkor Team])
- Dealing with Missing Entries
- The Effect of Regularization
- The Effect of Regularization
- The Effect of Regularization
- The Effect of Regularization
- Use Gradient Descent to search forthe optimal settings
- Degression to the lecture notes ofRegression and Gradient Descentby Andrew Ngrsquos Machine Learning course from Coursera
- (Batch) Gradient Descent
- Stochastic Gradient Descent
- SGD vs GD
- Stochastic Gradient Descent
- Slide Number 75
- Summary Recommendations via Optimization
- Backup Slides
- The Netflix Challenge 2006-09
- Slide Number 79
- Slide Number 80
- Netflix Prize
- Movie rating data
- Overall rating distribution
- ratings per movie
- ratings per user
- Average movie rating by movie count
- Most loved movies
- Challenges
- The BellKor recommender system
- Extending Latent Factor Model to Include Biases
- Modeling Biases and Interactions
- Baseline Predictor
- Putting It All Together
- Fitting the New Model
- BellKor Recommender System
- Performance of Various Methods
- Performance of Various Methods
- Temporal Biases Of Users
- Temporal Biases amp Factors
- Adding Temporal Effects
- Performance of Various Methods
- Slide Number 113
- Slide Number 114
- Standing on June 26th 2009
- The Last 30 Days
- 24 Hours from the Deadline
- Slide Number 118
- Million $ Awarded Sept 21st 2009
- Further reading
-
RECSYS 120
Further reading Y Koren Collaborative filtering with temporal
dynamics KDD rsquo09 httpwww2researchattcom~volinskynetflixbpchtml httpwwwthe-ensemblecom
1162019
- Slide Number 1
- Acknowledgements
- Content-based Systems amp Collaborative Filtering
- High Dimensional Data
- Example Recommender Systems
- Recommendations
- From Scarcity to Abundance
- The Long Tail
- Physical vs Online
- Types of Recommendations
- Formal Model
- Utility Matrix
- Key Problems
- (1) Gathering Ratings
- (2) Extrapolating Utilities
- Content-based Recommender Systems
- Content-based Recommendations
- Plan of Action
- Item Profiles
- Sidenote TF-IDF
- User Profiles and Prediction
- Pros Content-based Approach
- Cons Content-based Approach
- Collaborative Filtering
- Collaborative filtering
- Collaborative Filtering (CF)
- Example of Memory-based Collaborative Filtering User-User Collaborative Filtering
- Similar Users
- Similarity Metric
- Rating Predictions
- Another type of Memory-based Collaborative Filtering Item-Item based Collaborative Filtering
- Item-Item CF (|N|=2)
- Item-Item CF (|N|=2)
- Item-Item CF (|N|=2)
- Item-Item CF (|N|=2)
- Item-Item CF (|N|=2)
- Common Practice for Item-Item Collaborative Filtering
- Item-Item vs User-User
- ProsCons of Collaborative Filtering
- Hybrid Methods
- Remarks amp Practical Tips
- Evaluation
- Evaluation
- Evaluating Predictions
- Problems with Error Measures
- Collaborative Filtering Complexity
- Tip Add Data
- Recommender SystemsLatent Factor Models
- Collaborative Filtering viaLatent Factor Models (eg SVD)
- Slide Number 50
- The Netflix Utility Matrix R
- Utility Matrix R Evaluation
- Latent Factor Models
- Ratings as Products of Factors
- Ratings as Products of Factors
- Ratings as Products of Factors
- Latent Factor Models
- Latent Factor Models
- Recap SVD
- Latent Factor Models
- Dealing with Missing Entries
- Recommendations via Latent Factor Models (eg SVD++ by the [Bellkor Team])
- Dealing with Missing Entries
- The Effect of Regularization
- The Effect of Regularization
- The Effect of Regularization
- The Effect of Regularization
- Use Gradient Descent to search forthe optimal settings
- Degression to the lecture notes ofRegression and Gradient Descentby Andrew Ngrsquos Machine Learning course from Coursera
- (Batch) Gradient Descent
- Stochastic Gradient Descent
- SGD vs GD
- Stochastic Gradient Descent
- Slide Number 75
- Summary Recommendations via Optimization
- Backup Slides
- The Netflix Challenge 2006-09
- Slide Number 79
- Slide Number 80
- Netflix Prize
- Movie rating data
- Overall rating distribution
- ratings per movie
- ratings per user
- Average movie rating by movie count
- Most loved movies
- Challenges
- The BellKor recommender system
- Extending Latent Factor Model to Include Biases
- Modeling Biases and Interactions
- Baseline Predictor
- Putting It All Together
- Fitting the New Model
- BellKor Recommender System
- Performance of Various Methods
- Performance of Various Methods
- Temporal Biases Of Users
- Temporal Biases amp Factors
- Adding Temporal Effects
- Performance of Various Methods
- Slide Number 113
- Slide Number 114
- Standing on June 26th 2009
- The Last 30 Days
- 24 Hours from the Deadline
- Slide Number 118
- Million $ Awarded Sept 21st 2009
- Further reading
-