xuhappy.github.io · 2 days ago · recsys 2 acknowledgements the slides used in this chapter are...

1

Upload: others

Post on 04-Jul-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: xuhappy.github.io · 2 days ago · RECSYS 2 Acknowledgements The slides used in this chapter are adapted from: CS246 Mining Massive Datasets, - by Jure Leskovec, Stanford University

RECSYS 1

Recommender Systemsfrom IERG 4030 by Prof Wing C Lau

RECSYS 2

Acknowledgements The slides used in this chapter are adapted from

CS246 Mining Massive Data-sets by Jure Leskovec Stanford University

Yehuda Koren Robert Bell amp ChrisVolinsky ldquoLessons from the Netflix Prizerdquo ATampT Research

Some slides and plots borrowed from Yehuda Koren Robert Bell and Padhraic Smyth

with the authorrsquos permission All copyrights belong to the original author of the material

RECSYS 3

Content-based Systems amp Collaborative Filtering

CS246 Mining Massive DatasetsJure Leskovec Stanford University

httpcs246stanfordedu

RECSYS 4

High Dimensional Data

High dim data

Locality sensitive hashing

Clustering

Dimensionality

reduction

Graph data

PageRank SimRank

Community Detection

Spam Detection

Infinite data

Filtering data

streams

Web advertising

Queries on streams

Machine learning

SVM

Decision Trees

Perceptron kNN

Apps

Recommender systems

Association Rules

Duplicate document detection

1162019

RECSYS 5

Example Recommender Systems

Customer X Buys Metallica CD Buys Megadeth CD

Customer Y Does search on Metallica Recommender system

suggests Megadeth from data collected about customer X

RECSYS 6

Recommendations

Items

Search Recommendations

Products web sites blogs news items hellip

Examples

RECSYS 8

The Long Tail

Source Chris Anderson (2004)

1162019 8

RECSYS 9

Physical vs Online

1162019 9Read httpwwwwiredcomwiredarchive1210tailhtml to learn more

RECSYS 11

Formal Model

X = set of Customers

S = set of Items

Utility function u X times S RR = set of ratingsR is a totally ordered seteg 0-5 stars real number in [01]

RECSYS 12

Utility Matrix

04102

0305021

Avatar LOTR Matrix Pirates

Alice

Bob

Carol

David

RECSYS 13

Key Problems

(1) Gathering ldquoknownrdquo ratings for matrix How to collect the data in the utility matrix

(2) Extrapolate unknown ratings from the known ones Mainly interested in high unknown ratings

bull We are not interested in knowing what you donrsquot like but what you like

(3) Evaluating extrapolation methods How to measure successperformance of

recommendation methods

RECSYS 14

(1) Gathering Ratings Explicit

Ask people to rate items Doesnrsquot work well in practice ndash people

canrsquot be bothered

Implicit Learn ratings from user actions

bull Eg purchase implies high rating What about low ratings

RECSYS 15

(2) Extrapolating Utilities Key problem matrix U is sparse

Most people have not rated most items Cold start

bull New items have no ratingsbull New users have no history

Three approaches to recommender systems 1) Content-based 2) Collaborative Filtering

bull Memory-basedbull User-based Collaborative Filteringbull Item-based Collaborative Filtering

bull Latent factor based

1162019

RECSYS 16

Content-based Recommender Systems

1162019

RECSYS 17

Content-based Recommendations Main idea Recommend items to customer x similar to

previous items rated highly by x

Example

Movie recommendations Recommend movies with same actor(s)

director genre hellip

Websites blogs news Recommend other sites with ldquosimilarrdquo content

RECSYS 18

Plan of Action

likes

Item profiles

RedCircles

Triangles

User profile

match

recommendbuild

1162019

RECSYS 19

Item Profiles For each item create an item profile

Profile is a set (vector) of features Movies author title actor directorhellip Text Set of ldquoimportantrdquo words in document

How to pick important features Usual heuristic from text mining is TF-IDF

(Term frequency Inverse Doc Frequency)bull Term hellip Featurebull Document hellip Item

1162019

RECSYS 20

Sidenote TF-IDFfij = frequency of term (feature) i in doc (item) j

ni = number of docs that mention term iN = total number of docs

TF-IDF score wij = TFij times IDFi

Doc profile = set of words with highest TF-IDF scores together with their scores

1162019

Note we normalize TFto discount for ldquolongerrdquo documents

RECSYS 21

User Profiles and Prediction

User profile possibilities Weighted average of rated item profiles Variation weight by difference from average

rating for item hellip

Prediction heuristic Given user profile x and item profile i estimate 119906119906(119961119961 119946119946) = cos(119961119961 119946119946) = 119961119961middot119946119946

| 119961119961 |sdot| 119946119946 |

1162019

RECSYS 22

Pros Content-based Approach

+ No need for data on other users No cold-start or sparsity problems

+ Able to recommend to users with unique tastes

+ Able to recommend new amp unpopular items No first-rater problem

+ Able to provide explanations Can provide explanations of recommended items by listing content-

features that caused an item to be recommended

1162019

RECSYS 23

Cons Content-based Approach ndash Finding the appropriate features is hard

Eg images movies music

ndash Overspecialization Never recommends items outside userrsquos

content profile People might have multiple interests Unable to exploit quality judgments of other users

ndash Recommendations for new users How to build a user profile

1162019

RECSYS 24

Collaborative Filtering

RECSYS 25

Collaborative filtering Recommend items based on past transactions of users

Analyze relations between users andor items

Specific data characteristics are irrelevant Domain-free useritem attributes are not necessary Can identify elusive aspects

RECSYS 26

Collaborative Filtering (CF)

RECSYS 27

Example of Memory-based Collaborative FilteringUser-User Collaborative Filtering

1 Consider user x

2 Find set N of other userswhose ratings are ldquosimilarrdquo to xrsquos ratings eg using K-nearest neighbors (KNN)

3 Recommend items to xbased on the weighted ratings of items by users in N

RECSYS 28

Similar Users Let rx be the vector of user xrsquos ratings

Jaccard similarity measure Problem Ignores the value of the rating

Cosine similarity measure

sim(x y) = cos(rx ry) = 119903119903119909119909sdot119903119903119910119910||119903119903119909119909||sdot||119903119903119910119910||

Problem Treats missing ratings as ldquonegativerdquo Pearson correlation coefficient

Sxy = items rated by both users x and y

1162019

rx = [ _ _ ]ry = [ _ _]

rx ry as setsrx = 1 4 5ry = 1 3 4

rx ry as pointsrx = 1 0 0 1 3ry = 1 0 2 2 0

rx ry hellip avgrating of x y

RECSYS 29

Similarity Metric

Intuitively we want sim(A B) gt sim(A C)

Jaccard similarity 15 lt 24

Cosine similarity 0386 gt 0322 Considers missing ratings as ldquonegativerdquo Solution subtract the (row) mean

1162019

sim AB vs AC0092 gt -0559Notice cosine sim is correlation when data is centered at 0

Cosine sim

RECSYS 30

Rating Predictions

Let rx be the vector of user xrsquos ratings

Let N be the set of k users most similar to xwho have rated item i

Prediction for item i of user x 119903119903119909119909119909119909 = 1

119896119896sum119910119910isin119873119873 119903119903119910119910119909119909

119903119903119909119909119909119909 =sum119910119910isin119873119873 119904119904119909119909119910119910sdot119903119903119910119910119910119910sum119910119910isin119873119873 119904119904119909119909119910119910

Other options

Many other tricks possiblehellip1162019

Shorthand119956119956119961119961119961119961 = 119956119956119946119946119956119956 119961119961119961119961

RECSYS 31

Another type of Memory-based Collaborative Filtering Item-Item based Collaborative Filtering

So far User-user collaborative filtering

Another view Item-item For item i find other similar items Estimate rating for item i based

on the target userrsquos ratings on items similar to item i Can use same similarity metrics and

prediction functions as in user-user model

1162019

sumsum

isin

isinsdot

=)(

)(

xiNj ij

xiNj xjijxi s

rsr

sijhellip similarity of items i and jrxjhelliprating of user x on item jN(ix)hellip set items rated by x similar to i

RECSYS 32

Item-Item CF (|N|=2)

121110987654321

455311

3124452

534321423

245424

5224345

423316

users

mov

ies

- unknown rating - rating between 1 to 5

RECSYS 33

Item-Item CF (|N|=2)

121110987654321

455 311

3124452

534321423

245424

5224345

423316

users

- estimate rating of movie 1 by user 5

1162019

mov

ies

RECSYS 34

Item-Item CF (|N|=2)

121110987654321

455 311

3124452

534321423

245424

5224345

423316

users

Neighbor selectionIdentify movies similar to movie 1 rated by user 5

1162019

mov

ies

100

-018

041

-010

-031

059

sim(1m)

Here we use Pearson correlation as similarity1) Subtract mean rating mi from each movie i

m1 = (1+3+5+5+4)5 = 36row 1 [-26 0 -06 0 0 14 0 0 14 0 04 0]

2) Compute cosine similarities between rows

RECSYS 35

Item-Item CF (|N|=2)

121110987654321

455 311

3124452

534321423

245424

5224345

423316

users

Compute similarity weightss13=041 s16=059

1162019

mov

ies

100

-018

041

-010

-031

059

sim(1m)

RECSYS 36

Item-Item CF (|N|=2)

121110987654321

45526311

3124452

534321423

245424

5224345

423316

users

Predict by taking weighted average

r15 = (0412 + 0593) (041+059) = 26

mov

ies

RECSYS 37

Common Practice for Item-Item Collaborative Filtering

Define similarity sij of items i and j

Select K nearest neighbors (KNN) N(i x) Set of Items most similar to item i that were rated by x

Estimate rating rxi as the weighted average

1162019

sumsum

isin

isinsdot

=)(

)(ˆxiNj ij

xiNj xjijxi s

rsr

N(ix) = set of items similar to item i that were rated by xsij = similarity of items i and jrxj= rating of user x on item j

RECSYS 38

Item-Item vs User-User

04180109

0305081

Avatar LOTR Matrix Pirates

Alice

Bob

Carol

David

1162019

In practice it has been observed that item-itemoften works better than user-user

Why Items are simpler users have multiple tastes

RECSYS 39

ProsCons of Collaborative Filtering

+ Works for any kind of item No feature selection needed

- Cold Start Need enough users in the system to find a match

- Sparsity The userratings matrix is sparse Hard to find users that have rated the same items

- First rater Cannot recommend an item that has not been

previously rated New items Esoteric items

- Popularity bias Cannot recommend items to someone with

unique taste Tends to recommend popular items

RECSYS 40

Hybrid Methods Implement two or more different recommenders and

combine predictions Perhaps using a linear model

Add content-based methods to collaborative filtering Item profiles for new item problem Demographics to deal with new user problem

1162019

RECSYS 41

- Evaluation

- Error metrics

- Complexity Speed

RECSYS 42

Evaluation

1 3 4

3 5 5

4 5 5

3

3

2 2 2

5

2 1 1

3 3

1

movies

users

1162019

RECSYS 43

Evaluation

1 3 4

3 5 5

4 5 5

3

3

2

2 1

3

1

Test Data Set

users

movies

1162019

RECSYS 44

Evaluating Predictions Compare predictions with known ratings

Root-mean-square error (RMSE)bull where is predicted is the true rating of x on i

Precision at top 10 bull of those in top 10

Rank Correlation bull Spearmanrsquos correlation between systemrsquos and userrsquos complete

rankings Another approach 01 model

Coveragebull Number of itemsusers for which system can make predictions

Precision = TP (TP + FP) Accuracy = (TP+TN) (TP + FP + TN + FN) Receiver Operating characteristic (ROC) Curve

Y-axis True Positive Rates (TPR) X-axis False Positive Rates (FPR)bull TPR (aka Recall) = TP P = TP (TP+FN) bull FPR = FP N = FP (FP + TN)bull See httpsenwikipediaorgwikiPrecision and recall

RECSYS 45

Problems with Error Measures Narrow focus on accuracy sometimes

misses the point Prediction Diversity Prediction Context Order of predictions

In practice we care only to predict high ratings RMSE might penalize a method that does well

for high ratings and badly for others

1162019

RECSYS 46

Collaborative Filtering Complexity

Expensive step is finding k most similar customers O(|X|) Recall that X = set of customers in the system

Too expensive to do at runtime Could pre-compute using clustering as approx

Naiumlve pre-computation takes time O(N middot|C|) |C| = of clusters = k in the k-means N = of data points

We already know how to do this Near-neighbor search in high dimensions (LSH) Clustering Dimensionality reduction

1162019

RECSYS 47

Tip Add Data Leverage all the data

Donrsquot try to reduce data size in an effort to make fancy algorithms work

Simple methods on large data do best

Add more data eg add IMDB data on genres

More data beats better algorithmshttpanandtypepadcomdatawocky200803more-data-usualhtml

RECSYS 48

Recommender SystemsLatent Factor Models

CS246 Mining Massive DatasetsJure Leskovec Stanford University

httpcs246stanfordedu

RECSYS 49

Geared towards females

Geared towards males

Serious

Funny

Collaborative Filtering viaLatent Factor Models (eg SVD)

The PrincessDiaries

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

RECSYS 50Koren Bell Volinksy IEEE Computer 2009

RECSYS 51

The Netflix Utility Matrix R

1 3 4

3 5 5

4 5 5

3

3

2 2 2

5

2 1 1

3 3

1

480000 users

17700 movies

1162019

Matrix R

RECSYS 52

Utility Matrix R Evaluation

1 3 4

3 5 5

4 5 5

3

3

2

2 1

3

1

Test Data Set

SSE = sum(119909119909119909119909)isin119877119877 119903119909119909119909119909 minus 119903119903119909119909119909119909 2

1162019

480000 users

17700 movies

Predicted rating

True rating of user x on item i

119955119955120785120785120788120788

Matrix R

Training Data Set

RECSYS 53

Latent Factor Models

ldquoSVDrdquo on Netflix data R asymp Q middot PT

For now letrsquos assume we can approximate the rating matrix R as a product of ldquothinrdquo Q middot PT

R has missing entries but letrsquos ignore that for nowbull Basically we will want the reconstruction error to be small on

known ratings and we donrsquot care about the values on the missing ones

45531

312445

53432142

24542

522434

42331

2-41

56-5

53-2

32111

-221-7

37-1

-924143-48-5-253-211

13-112-72914-131457-8

1-6784-3924176-421asymp

users

item

s

PT

Q

item

s

users

R

SVD A = U Σ VT

f factors

ffactors

RECSYS 54

Ratings as Products of Factors How to estimate the missing rating of

user x for item i

1162019

45531

312445

53432142

24542

522434

42331

item

s

2-41

56-5

53-2

32111

-221-7

37-1

-924143-48-5-253-211

13-112-72914-131457-8

1-6784-3924176-421

asymp

item

s

users

users

PT

119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931

= 119943119943

119954119954119946119946119943119943 sdot 119953119953119961119961119943119943

qi = row i of Qpx = column x of PT

ffac

tors

Qf factors

RECSYS 55

Ratings as Products of Factors How to estimate the missing rating of

user x for item i

1162019

45531

312445

53432142

24542

522434

42331

item

s

2-41

56-5

53-2

32111

-221-7

37-1

-924143-48-5-253-211

13-112-72914-131457-8

1-6784-3924176-421

asymp

item

s

users

users

PT

ffac

tors

Qf factors

119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931

= 119943119943

119954119954119946119946119943119943 sdot 119953119953119961119961119943119943

qi = row i of Qpx = column x of PT

RECSYS 56

Ratings as Products of Factors How to estimate the missing rating of

user x for item i

1162019

45531

312445

53432142

24542

522434

42331

item

s

2-41

56-5

53-2

32111

-221-7

37-1

-924143-48-5-253-211

13-112-72914-131457-8

1-6784-3924176-421

asymp

item

s

users

users

QPT

24

ffac

tors

f factors

119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931

= 119943119943

119954119954119946119946119943119943 sdot 119953119953119961119961119943119943

qi = row i of Qpx = column x of PT

RECSYS 57

Geared towards females

Geared towards males

Serious

Funny

Latent Factor Models

1162019

The PrincessDiaries

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

RECSYS 58

Geared towards females

Geared towards males

Serious

Funny

Latent Factor Models

1162019

The PrincessDiaries

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

RECSYS 59

Recap SVD

Remember SVD A Input data matrix U Left singular vecs V Right singular vecs Σ Singular values

SVD gives minimum reconstruction error (SSE) min119880119880VΣ

sum119909119909119894119894 119860119860119909119909119894119894 minus 119880119880Σ119881119881T 1199091199091198941198942

So in our case ldquoSVDrdquo on Netflix data R asymp Q middot PT

A = R Q = U PT = Σ VT

But we are not done yet R has missing entries1162019

U

The sum goes over all entriesBut our R has missing entries

119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931

Am

n

Σm

n

VT

asymp

U

RECSYS 60

Latent Factor Models

SVD isnrsquot defined when entries are missing

Use specialized methods to find P Q min

119875119875119876119876sum 119909119909119909119909 isinR 119903119903119909119909119909119909 minus 119902119902119909119909 sdot 119901119901119909119909119879119879 2

Notebull We donrsquot require cols of P Q to be orthogonalunit lengthbull P Q map usersmovies to a latent spacebull The most popular model among Netflix contestants

1162019

45531

312445

53432142

24542

522434

42331

2-41

56-5

53-2

32111

-221-7

37-1

-924143-48-5-253-211

13-112-72914-131457-8

1-6784-3924176-421asymp

PTQ

users

item

s

119903119909119909119909119909 = 119902119902119909119909 sdot 119901119901119909119909119879119879

f factors

ffactorsitem

s

users

RECSYS 61

Dealing with Missing Entries Want to minimize SSE for unseen test data

Idea Minimize SSE on training data Want large f ( of factors) to capture all the signals But SSE on test data begins to rise for f gt 2

Regularization is needed to avoid Overfitting Allow rich model where there are sufficient data Shrink aggressively where data are scarce

1162019

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

1 3 43 5 5

4 5 533

2

2 1 3

1

λhellip regularization parameterldquoerrorrdquo ldquolengthrdquo

RECSYS 62

Geared towards females

Geared towards males

serious

escapist

The PrincessDiaries

The Lion King

Braveheart

Independence Day

AmadeusThe Color Purple

Oceanrsquos 11

Sense and Sensibility

Gus

Dave

Recommendations via Latent Factor Models (eg SVD++ by the [Bellkor Team])

1162019

Lethal Weapon

Dumb and Dumber

RECSYS 63

Dealing with Missing Entries Want to minimize SSE for unseen test data

Idea Minimize SSE on training data Want large f ( of factors) to capture all the signals But SSE on test data begins to rise for f gt 2

Regularization is needed Allow rich model where there are sufficient data Shrink aggressively where data are scarce

1162019

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

1 3 43 5 5

4 5 533

2

2 1 3

1

λhellip regularization parameterldquoerrorrdquo ldquolengthrdquo

RECSYS 64

Geared towards females

Geared towards males

serious

funny

The Effect of Regularization

1162019

The PrincessDiaries

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

minfactors ldquoerrorrdquo + λ ldquolengthrdquo

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

RECSYS 65

Geared towards females

Geared towards males

serious

funny

The Effect of Regularization

1162019

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

The PrincessDiaries

minfactors ldquoerrorrdquo + λ ldquolengthrdquo

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

RECSYS 66

Geared towards females

Geared towards males

serious

funny

The Effect of Regularization

1162019

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

minfactors ldquoerrorrdquo + λ ldquolengthrdquo

The PrincessDiaries

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

RECSYS 67

Geared towards females

Geared towards males

serious

funny

The Effect of Regularization

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

The PrincessDiaries

minfactors ldquoerrorrdquo + λ ldquolengthrdquo

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

RECSYS 68

Use Gradient Descent to search forthe optimal settings

Want to find matrices P and Q

Gradient descent Initialize P and Q (using SVD pretend missing ratings are

0) Do gradient descent

bull P larr P - η middotnablaPbull Q larr Q - η middotnablaQbull Where nablaQ is gradientderivative of matrix Q120571120571119876119876 = [120571120571119902119902119909119909119894119894] and 120571120571119902119902119909119909119894119894 = sum119909119909119909119909 minus2 119903119903119909119909119909119909 minus 119902119902119909119909119901119901119909119909119879119879 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894

bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q Observation Computing gradients is slow

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

How to compute gradient of a matrixCompute gradient of every element independently

RECSYS 69

Degression to the lecture notes ofRegression and Gradient Descent

by Andrew Ngrsquos Machine Learning course from Coursera

RECSYS 70

(Batch) Gradient Descent

Want to find matrices P and Q

Gradient descent Initialize P and Q (using SVD pretend missing ratings are

0) Do gradient descent

bull P larr P - η middotnablaPbull Q larr Q - η middotnablaQbull Where nablaQ is gradientderivative of matrix Q120571120571119876119876 = [120571120571119902119902119909119909119894119894] and 120571120571119902119902119909119909119894119894 = sum119909119909119909119909 minus2 119903119903119909119909119909119909 minus 119902119902119909119909119901119901119909119909119879119879 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894

bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q Observation Computing gradients is slow

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

How to compute gradient of a matrixCompute gradient of every element independently

RECSYS 72

Stochastic Gradient Descent

Gradient Descent (GD) vs Stochastic GD Observation 120571120571119876119876 = [120571120571119902119902119909119909119894119894] where

120571120571119902119902119909119909119894119894 = 119909119909119909119909

minus2 119903119903119909119909119909119909 minus 119902119902119909119909119894119894119901119901119909119909119894119894 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894 = 119961119961119946119946

nabla119928119928 119955119955119961119961119946119946

bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q 119928119928 = 119928119928 minus 119928119928 = 119928119928 minus η sum119961119961119946119946nabla119928119928 (119955119955119961119961119946119946) Idea Instead of evaluating gradient over all ratings evaluate it for

each individual rating and make a step

GD 119928119928larr119928119928 minus η sum119955119955119961119961119946119946 nabla119928119928(119955119955119961119961119946119946)

SGD 119928119928larr119928119928 minus 119928119928(119955119955119961119961119946119946) Faster convergence

bull Need more steps but each step is computed much faster

1162019

RECSYS 73

SGD vs GD Convergence of GD vs SGD

1162019

Iterationstep

Valu

e of

the

obje

ctiv

e fu

nctio

n

GD improves the value of the objective function at every step SGD improves the value but in a ldquonoisyrdquo wayGD takes fewer steps to converge but each steptakes much longer to compute In practice SGD is much faster

RECSYS 74

Stochastic Gradient Descent

Stochastic gradient decent Initialize P and Q (using SVD pretend missing ratings are

0) Then iterate over the ratings (multiple times if necessary) and update

factorsFor each rxi

bull 120576120576119909119909119909119909 = 119903119903119909119909119909119909 minus 119902119902119909119909 sdot 119901119901119909119909119879119879 (derivative of the ldquoerrorrdquo)bull 119902119902119909119909 larr 119902119902119909119909 + 120578120578 120576120576119909119909119909119909 119901119901119909119909 minus 120582120582 119902119902119909119909 (update equation)bull 119901119901119909119909 larr 119901119901119909119909 + 120578120578 120576120576119909119909119909119909 119902119902119909119909 minus 120582120582 119901119901119909119909 (update equation)

2 for loops For until convergence

bull For each rxibull Compute gradient do a ldquosteprdquo

1162019

η hellip learning rate

RECSYS 75Koren Bell Volinksy IEEE Computer 2009

1162019

RECSYS 76

Summary Recommendations via Optimization

Goal Make good recommendations Quantify goodness using SSE

So Lower SSE means better recommendations We want to make good recommendations on items

that some user has not yet seen Letrsquos set values for P and Q such that they work

well on known (user item) ratings And hope these values for P and Q will predict well

the unknown ratings

This is the a case where we apply Optimization methods

1162019

1 3 43 5 5

4 5 533

2

2 1 3

1

RECSYS 77

Backup Slides

RECSYS 78

The Netflix Challenge 2006-09

We Know What You OughtTo Be Watching This

Summer

Netflix Prizebull Training data

ndash 100 million ratingsndash 480000 usersndash 17770 moviesndash 6 years of data 2000-2005

bull Test datandash Last few ratings of each user (28 million)ndash Evaluation criterion root mean squared error (RMSE) ndash Netflix Cinematch RMSE 09514

bull Competitionndash 2700+ teamsndash $1 million grand prize for 10 improvement on Cinematch resultndash $50000 2007 progress prize for 843 improvement

scoremovieuser121152131434524123237682576344541568523425

22345

57664566

scoremovieuser6219617232473153414284935

745

696836

Training data Test data

Movie rating data

Overall rating distribution

bull Third of ratings are 4sbull Average rating is 368

From TimelyDevelopmentcom

ratings per movie

bull Avg ratingsmovie 5627

ratings per user

bull Avg ratingsuser 208

Average movie rating by movie count

bull More ratings to better movies

From TimelyDevelopmentcom

Most loved movies

CountAvg ratingTitle137812 4593 The Shawshank Redemption 133597 4545 Lord of the Rings The Return of the King1808834306 The Green Mile 150676 4460Lord of the Rings The Two Towers 139050 4415 Finding Nemo 117456 4504 Raiders of the Lost Ark 180736 4299 Forrest Gump 147932 4433 Lord of the Rings The Fellowship of the ring149199 4325 The Sixth Sense 144027 4333 Indiana Jones and the Last Crusade

Challenges

bull Size of datandash Scalabilityndash Keeping data in memory

bull Missing datandash 99 percent missingndash Very imbalanced

bull Avoiding overfittingbull Test and training data differ significantly movie 16322

bull Use an ensemble of complementing predictorsbull Two half tuned models worth more than a single fully

tuned model

The BellKor recommender system

RECSYS 101

Extending Latent Factor Model to Include Biases

1162019

RECSYS 102

Modeling Biases and Interactions

μ = overall mean rating bx = bias of user x bi = bias of movie i

user-movie interactionmovie biasuser bias

User-Movie interaction Characterizes the matching between

users and movies Attracts most research in the field Benefits from algorithmic and

mathematical innovations

Baseline predictor Separates users and movies Benefits from insights into userrsquos

behavior Among the main practical

contributions of the competition

RECSYS 103

Baseline Predictor We have expectations on the rating by

user x of movie i even without estimating xrsquos attitude towards movies like i

ndash Rating scale of user xndash Values of other ratings user

gave recently (day-specific mood anchoring multi-user accounts)

ndash (Recent) popularity of movie indash Selection bias related to

number of ratings user gave on the same day (ldquofrequencyrdquo)

1162019

RECSYS 104

Putting It All Together

Example Mean rating micro = 37 You are a critical reviewer your ratings are 1 star

lower than the mean bx = -1 Star Wars gets a mean rating of 05 higher than

average movie bi = + 05 Predicted rating for you on Star Wars

= 37 - 1 + 05 = 32

1162019

RECSYS 105

Fitting the New Model

Solve

Stochastic gradient decent to find parameters Note Both biases bu bi as well as interactions qi pu are treated

as parameters (we estimate them)

1162019

regularization

goodness of fit

λ is selected via grid-search on a validation set

( )

++++

+++minus

sumsumsumsum

sumisin

ii

xx

xx

ii

Rix

Txiixxi

PQ

bbpq

pqbbr

2222

2

)(

)(min

λ

micro

RECSYS 106

BellKor Recommender System The winner of the Netflix Challenge

Multi-scale modeling of the dataCombine top level ldquoregionalrdquomodeling of the data with a refined local view Global

bull Overall deviations of usersmovies Factorization

bull Addressing ldquoregionalrdquo effects Collaborative filtering

bull Extract local patterns

Global effects

Factorization

Collaborative filtering

RECSYS 107

Performance of Various Methods

0885

089

0895

09

0905

091

0915

092

1 10 100 1000

RMSE

Millions of parameters

CF (no time bias)

Basic Latent Factors

Latent Factors w Biases

1162019

RECSYS 108

Grand Prize 08563

Netflix 09514

Movie average 10533User average 10651

Global average 11296

Performance of Various Methods

Basic Collaborative filtering 094

Latent factors 090

Latent factors+Biases 089

Collaborative filtering++ 091

1162019

RECSYS 109

Temporal Biases Of Users

Sudden rise in the average movie rating(early 2004) Improvements in Netflix GUI improvements Meaning of rating changed

Movie age Users prefer new movies

without any reasons Older movies are just

inherently better than newer ones

1162019

Y Koren Collaborative filtering with temporal dynamics KDD rsquo09

RECSYS 110

Temporal Biases amp Factors

Original modelrxi = micro +bx + bi + qi middotpx

T

Add time dependence to biasesrxi = micro +bx(t)+ bi(t) +qi middot px

T

Make parameters bu and bi to depend on time (1) Parameterize time-dependence by linear trends

(2) Each bin corresponds to 10 consecutive weeks

Add temporal dependence to factors px(t)hellip user preference vector on day t

Y Koren Collaborative filtering with temporal dynamics KDD rsquo09

RECSYS 111

Adding Temporal Effects

1162019

0875

088

0885

089

0895

09

0905

091

0915

092

1 10 100 1000 10000

RMSE

Millions of parameters

CF (no time bias)Basic Latent FactorsCF (time bias)Latent Factors w Biases+ Linear time factors+ Per-day user biases+ CF

RECSYS 112

Grand Prize 08563

Netflix 09514

Movie average 10533User average 10651

Global average 11296

Performance of Various Methods

Basic Collaborative filtering 094

Latent factors 090

Latent factors+Biases 089

Collaborative filtering++ 091

1162019 112

Latent factors+Biases+Time 0876

Still no prize Getting desperateTry a ldquokitchen sinkrdquo approach

RECSYS 113

Effect of ensemble size

087

08725

0875

08775

088

08825

0885

08875

089

08925

1 8 15 22 29 36 43 50

Predictors

Erro

r - R

MSE

Chart1

RMSE
Predictors
Error - RMSE
Effect of ensemble size

Sheet1

Sheet1

RMSE
Predictors
Error - RMSE
Effect of ensemble size

Sheet2

Sheet3

RECSYS 115

Standing on June 26th 2009

1162019June 26th submission triggers 30-day ldquolast callrdquo

RECSYS 116

The Last 30 Days

Ensemble team formed Group of other teams on leaderboard forms a new team Relies on combining their models Quickly also get a qualifying score over 10

BellKor Continue to get small improvements in their scores Realize that they are in direct competition with Ensemble

Strategy Both teams carefully monitoring the leaderboard Only sure way to check for improvement is to submit a set of

predictionsbull This alerts the other team of your latest score

1162019

RECSYS 117

24 Hours from the Deadline

Submissions limited to 1 a day Only 1 final submission could be made in the last 24h

24 hours before deadlinehellip BellKor team member in Austria notices (by chance) that

Ensemble posts a score that is slightly better than BellKorrsquos

Frantic last 24 hours for both teams Much computer time on final optimization Carefully calibrated to end about an hour before deadline

Final submissions BellKor submits a little early (on purpose) 40 mins before deadline Ensemble submits their final entry 20 mins later hellipand everyone waitshellip

1162019

RECSYS 118118

RECSYS 119

Million $ Awarded Sept 21st 2009

RECSYS 120

Further reading Y Koren Collaborative filtering with temporal

dynamics KDD rsquo09 httpwww2researchattcom~volinskynetflixbpchtml httpwwwthe-ensemblecom

1162019

  • Slide Number 1
  • Acknowledgements
  • Content-based Systems amp Collaborative Filtering
  • High Dimensional Data
  • Example Recommender Systems
  • Recommendations
  • From Scarcity to Abundance
  • The Long Tail
  • Physical vs Online
  • Types of Recommendations
  • Formal Model
  • Utility Matrix
  • Key Problems
  • (1) Gathering Ratings
  • (2) Extrapolating Utilities
  • Content-based Recommender Systems
  • Content-based Recommendations
  • Plan of Action
  • Item Profiles
  • Sidenote TF-IDF
  • User Profiles and Prediction
  • Pros Content-based Approach
  • Cons Content-based Approach
  • Collaborative Filtering
  • Collaborative filtering
  • Collaborative Filtering (CF)
  • Example of Memory-based Collaborative Filtering User-User Collaborative Filtering
  • Similar Users
  • Similarity Metric
  • Rating Predictions
  • Another type of Memory-based Collaborative Filtering Item-Item based Collaborative Filtering
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Common Practice for Item-Item Collaborative Filtering
  • Item-Item vs User-User
  • ProsCons of Collaborative Filtering
  • Hybrid Methods
  • Remarks amp Practical Tips
  • Evaluation
  • Evaluation
  • Evaluating Predictions
  • Problems with Error Measures
  • Collaborative Filtering Complexity
  • Tip Add Data
  • Recommender SystemsLatent Factor Models
  • Collaborative Filtering viaLatent Factor Models (eg SVD)
  • Slide Number 50
  • The Netflix Utility Matrix R
  • Utility Matrix R Evaluation
  • Latent Factor Models
  • Ratings as Products of Factors
  • Ratings as Products of Factors
  • Ratings as Products of Factors
  • Latent Factor Models
  • Latent Factor Models
  • Recap SVD
  • Latent Factor Models
  • Dealing with Missing Entries
  • Recommendations via Latent Factor Models (eg SVD++ by the [Bellkor Team])
  • Dealing with Missing Entries
  • The Effect of Regularization
  • The Effect of Regularization
  • The Effect of Regularization
  • The Effect of Regularization
  • Use Gradient Descent to search forthe optimal settings
  • Degression to the lecture notes ofRegression and Gradient Descentby Andrew Ngrsquos Machine Learning course from Coursera
  • (Batch) Gradient Descent
  • Stochastic Gradient Descent
  • SGD vs GD
  • Stochastic Gradient Descent
  • Slide Number 75
  • Summary Recommendations via Optimization
  • Backup Slides
  • The Netflix Challenge 2006-09
  • Slide Number 79
  • Slide Number 80
  • Netflix Prize
  • Movie rating data
  • Overall rating distribution
  • ratings per movie
  • ratings per user
  • Average movie rating by movie count
  • Most loved movies
  • Challenges
  • The BellKor recommender system
  • Extending Latent Factor Model to Include Biases
  • Modeling Biases and Interactions
  • Baseline Predictor
  • Putting It All Together
  • Fitting the New Model
  • BellKor Recommender System
  • Performance of Various Methods
  • Performance of Various Methods
  • Temporal Biases Of Users
  • Temporal Biases amp Factors
  • Adding Temporal Effects
  • Performance of Various Methods
  • Slide Number 113
  • Slide Number 114
  • Standing on June 26th 2009
  • The Last 30 Days
  • 24 Hours from the Deadline
  • Slide Number 118
  • Million $ Awarded Sept 21st 2009
  • Further reading
predictors RMSE
1 0889067
2 0880586
3 0878345
4 0877715
5 0876668
6 0876179
7 0875557
8 0875013
9 0874477
10 0874209
11 0873997
12 0873691
13 0873569
14 0873352
15 0873212
16 0873035
17 0872942
18 087283
19 0872748
20 0872654
21 0872563
22 0872468
23 0872382
24 0872275
25 087222
26 0872201
27 0872135
28 0872086
29 0872021
30 0871945
31 0871898
32 087183
33 0871756
34 0871722
35 0871665
36 0871632
37 0871585
38 0871558
39 0871524
40 0871482
41 0871452
42 0871428
43 0871398
44 0871346
45 0871312
46 0871287
47 0871262
48 0871242
49 0871236
50 0871197
0889067
0880586
0878345
0877715
0876668
0876179
0875557
0875013
0874477
0874209
0873997
0873691
0873569
0873352
0873212
0873035
0872942
087283
0872748
0872654
0872563
0872468
0872382
0872275
087222
0872201
0872135
0872086
0872021
0871945
0871898
087183
0871756
0871722
0871665
0871632
0871585
0871558
0871524
0871482
0871452
0871428
0871398
0871346
0871312
0871287
0871262
0871242
0871236
0871197
Page 2: xuhappy.github.io · 2 days ago · RECSYS 2 Acknowledgements The slides used in this chapter are adapted from: CS246 Mining Massive Datasets, - by Jure Leskovec, Stanford University

RECSYS 2

Acknowledgements The slides used in this chapter are adapted from

CS246 Mining Massive Data-sets by Jure Leskovec Stanford University

Yehuda Koren Robert Bell amp ChrisVolinsky ldquoLessons from the Netflix Prizerdquo ATampT Research

Some slides and plots borrowed from Yehuda Koren Robert Bell and Padhraic Smyth

with the authorrsquos permission All copyrights belong to the original author of the material

RECSYS 3

Content-based Systems amp Collaborative Filtering

CS246 Mining Massive DatasetsJure Leskovec Stanford University

httpcs246stanfordedu

RECSYS 4

High Dimensional Data

High dim data

Locality sensitive hashing

Clustering

Dimensionality

reduction

Graph data

PageRank SimRank

Community Detection

Spam Detection

Infinite data

Filtering data

streams

Web advertising

Queries on streams

Machine learning

SVM

Decision Trees

Perceptron kNN

Apps

Recommender systems

Association Rules

Duplicate document detection

1162019

RECSYS 5

Example Recommender Systems

Customer X Buys Metallica CD Buys Megadeth CD

Customer Y Does search on Metallica Recommender system

suggests Megadeth from data collected about customer X

RECSYS 6

Recommendations

Items

Search Recommendations

Products web sites blogs news items hellip

Examples

RECSYS 8

The Long Tail

Source Chris Anderson (2004)

1162019 8

RECSYS 9

Physical vs Online

1162019 9Read httpwwwwiredcomwiredarchive1210tailhtml to learn more

RECSYS 11

Formal Model

X = set of Customers

S = set of Items

Utility function u X times S RR = set of ratingsR is a totally ordered seteg 0-5 stars real number in [01]

RECSYS 12

Utility Matrix

04102

0305021

Avatar LOTR Matrix Pirates

Alice

Bob

Carol

David

RECSYS 13

Key Problems

(1) Gathering ldquoknownrdquo ratings for matrix How to collect the data in the utility matrix

(2) Extrapolate unknown ratings from the known ones Mainly interested in high unknown ratings

bull We are not interested in knowing what you donrsquot like but what you like

(3) Evaluating extrapolation methods How to measure successperformance of

recommendation methods

RECSYS 14

(1) Gathering Ratings Explicit

Ask people to rate items Doesnrsquot work well in practice ndash people

canrsquot be bothered

Implicit Learn ratings from user actions

bull Eg purchase implies high rating What about low ratings

RECSYS 15

(2) Extrapolating Utilities Key problem matrix U is sparse

Most people have not rated most items Cold start

bull New items have no ratingsbull New users have no history

Three approaches to recommender systems 1) Content-based 2) Collaborative Filtering

bull Memory-basedbull User-based Collaborative Filteringbull Item-based Collaborative Filtering

bull Latent factor based

1162019

RECSYS 16

Content-based Recommender Systems

1162019

RECSYS 17

Content-based Recommendations Main idea Recommend items to customer x similar to

previous items rated highly by x

Example

Movie recommendations Recommend movies with same actor(s)

director genre hellip

Websites blogs news Recommend other sites with ldquosimilarrdquo content

RECSYS 18

Plan of Action

likes

Item profiles

RedCircles

Triangles

User profile

match

recommendbuild

1162019

RECSYS 19

Item Profiles For each item create an item profile

Profile is a set (vector) of features Movies author title actor directorhellip Text Set of ldquoimportantrdquo words in document

How to pick important features Usual heuristic from text mining is TF-IDF

(Term frequency Inverse Doc Frequency)bull Term hellip Featurebull Document hellip Item

1162019

RECSYS 20

Sidenote TF-IDFfij = frequency of term (feature) i in doc (item) j

ni = number of docs that mention term iN = total number of docs

TF-IDF score wij = TFij times IDFi

Doc profile = set of words with highest TF-IDF scores together with their scores

1162019

Note we normalize TFto discount for ldquolongerrdquo documents

RECSYS 21

User Profiles and Prediction

User profile possibilities Weighted average of rated item profiles Variation weight by difference from average

rating for item hellip

Prediction heuristic Given user profile x and item profile i estimate 119906119906(119961119961 119946119946) = cos(119961119961 119946119946) = 119961119961middot119946119946

| 119961119961 |sdot| 119946119946 |

1162019

RECSYS 22

Pros Content-based Approach

+ No need for data on other users No cold-start or sparsity problems

+ Able to recommend to users with unique tastes

+ Able to recommend new amp unpopular items No first-rater problem

+ Able to provide explanations Can provide explanations of recommended items by listing content-

features that caused an item to be recommended

1162019

RECSYS 23

Cons Content-based Approach ndash Finding the appropriate features is hard

Eg images movies music

ndash Overspecialization Never recommends items outside userrsquos

content profile People might have multiple interests Unable to exploit quality judgments of other users

ndash Recommendations for new users How to build a user profile

1162019

RECSYS 24

Collaborative Filtering

RECSYS 25

Collaborative filtering Recommend items based on past transactions of users

Analyze relations between users andor items

Specific data characteristics are irrelevant Domain-free useritem attributes are not necessary Can identify elusive aspects

RECSYS 26

Collaborative Filtering (CF)

RECSYS 27

Example of Memory-based Collaborative FilteringUser-User Collaborative Filtering

1 Consider user x

2 Find set N of other userswhose ratings are ldquosimilarrdquo to xrsquos ratings eg using K-nearest neighbors (KNN)

3 Recommend items to xbased on the weighted ratings of items by users in N

RECSYS 28

Similar Users Let rx be the vector of user xrsquos ratings

Jaccard similarity measure Problem Ignores the value of the rating

Cosine similarity measure

sim(x y) = cos(rx ry) = 119903119903119909119909sdot119903119903119910119910||119903119903119909119909||sdot||119903119903119910119910||

Problem Treats missing ratings as ldquonegativerdquo Pearson correlation coefficient

Sxy = items rated by both users x and y

1162019

rx = [ _ _ ]ry = [ _ _]

rx ry as setsrx = 1 4 5ry = 1 3 4

rx ry as pointsrx = 1 0 0 1 3ry = 1 0 2 2 0

rx ry hellip avgrating of x y

RECSYS 29

Similarity Metric

Intuitively we want sim(A B) gt sim(A C)

Jaccard similarity 15 lt 24

Cosine similarity 0386 gt 0322 Considers missing ratings as ldquonegativerdquo Solution subtract the (row) mean

1162019

sim AB vs AC0092 gt -0559Notice cosine sim is correlation when data is centered at 0

Cosine sim

RECSYS 30

Rating Predictions

Let rx be the vector of user xrsquos ratings

Let N be the set of k users most similar to xwho have rated item i

Prediction for item i of user x 119903119903119909119909119909119909 = 1

119896119896sum119910119910isin119873119873 119903119903119910119910119909119909

119903119903119909119909119909119909 =sum119910119910isin119873119873 119904119904119909119909119910119910sdot119903119903119910119910119910119910sum119910119910isin119873119873 119904119904119909119909119910119910

Other options

Many other tricks possiblehellip1162019

Shorthand119956119956119961119961119961119961 = 119956119956119946119946119956119956 119961119961119961119961

RECSYS 31

Another type of Memory-based Collaborative Filtering Item-Item based Collaborative Filtering

So far User-user collaborative filtering

Another view Item-item For item i find other similar items Estimate rating for item i based

on the target userrsquos ratings on items similar to item i Can use same similarity metrics and

prediction functions as in user-user model

1162019

sumsum

isin

isinsdot

=)(

)(

xiNj ij

xiNj xjijxi s

rsr

sijhellip similarity of items i and jrxjhelliprating of user x on item jN(ix)hellip set items rated by x similar to i

RECSYS 32

Item-Item CF (|N|=2)

121110987654321

455311

3124452

534321423

245424

5224345

423316

users

mov

ies

- unknown rating - rating between 1 to 5

RECSYS 33

Item-Item CF (|N|=2)

121110987654321

455 311

3124452

534321423

245424

5224345

423316

users

- estimate rating of movie 1 by user 5

1162019

mov

ies

RECSYS 34

Item-Item CF (|N|=2)

121110987654321

455 311

3124452

534321423

245424

5224345

423316

users

Neighbor selectionIdentify movies similar to movie 1 rated by user 5

1162019

mov

ies

100

-018

041

-010

-031

059

sim(1m)

Here we use Pearson correlation as similarity1) Subtract mean rating mi from each movie i

m1 = (1+3+5+5+4)5 = 36row 1 [-26 0 -06 0 0 14 0 0 14 0 04 0]

2) Compute cosine similarities between rows

RECSYS 35

Item-Item CF (|N|=2)

121110987654321

455 311

3124452

534321423

245424

5224345

423316

users

Compute similarity weightss13=041 s16=059

1162019

mov

ies

100

-018

041

-010

-031

059

sim(1m)

RECSYS 36

Item-Item CF (|N|=2)

121110987654321

45526311

3124452

534321423

245424

5224345

423316

users

Predict by taking weighted average

r15 = (0412 + 0593) (041+059) = 26

mov

ies

RECSYS 37

Common Practice for Item-Item Collaborative Filtering

Define similarity sij of items i and j

Select K nearest neighbors (KNN) N(i x) Set of Items most similar to item i that were rated by x

Estimate rating rxi as the weighted average

1162019

sumsum

isin

isinsdot

=)(

)(ˆxiNj ij

xiNj xjijxi s

rsr

N(ix) = set of items similar to item i that were rated by xsij = similarity of items i and jrxj= rating of user x on item j

RECSYS 38

Item-Item vs User-User

04180109

0305081

Avatar LOTR Matrix Pirates

Alice

Bob

Carol

David

1162019

In practice it has been observed that item-itemoften works better than user-user

Why Items are simpler users have multiple tastes

RECSYS 39

ProsCons of Collaborative Filtering

+ Works for any kind of item No feature selection needed

- Cold Start Need enough users in the system to find a match

- Sparsity The userratings matrix is sparse Hard to find users that have rated the same items

- First rater Cannot recommend an item that has not been

previously rated New items Esoteric items

- Popularity bias Cannot recommend items to someone with

unique taste Tends to recommend popular items

RECSYS 40

Hybrid Methods Implement two or more different recommenders and

combine predictions Perhaps using a linear model

Add content-based methods to collaborative filtering Item profiles for new item problem Demographics to deal with new user problem

1162019

RECSYS 41

- Evaluation

- Error metrics

- Complexity Speed

RECSYS 42

Evaluation

1 3 4

3 5 5

4 5 5

3

3

2 2 2

5

2 1 1

3 3

1

movies

users

1162019

RECSYS 43

Evaluation

1 3 4

3 5 5

4 5 5

3

3

2

2 1

3

1

Test Data Set

users

movies

1162019

RECSYS 44

Evaluating Predictions Compare predictions with known ratings

Root-mean-square error (RMSE)bull where is predicted is the true rating of x on i

Precision at top 10 bull of those in top 10

Rank Correlation bull Spearmanrsquos correlation between systemrsquos and userrsquos complete

rankings Another approach 01 model

Coveragebull Number of itemsusers for which system can make predictions

Precision = TP (TP + FP) Accuracy = (TP+TN) (TP + FP + TN + FN) Receiver Operating characteristic (ROC) Curve

Y-axis True Positive Rates (TPR) X-axis False Positive Rates (FPR)bull TPR (aka Recall) = TP P = TP (TP+FN) bull FPR = FP N = FP (FP + TN)bull See httpsenwikipediaorgwikiPrecision and recall

RECSYS 45

Problems with Error Measures Narrow focus on accuracy sometimes

misses the point Prediction Diversity Prediction Context Order of predictions

In practice we care only to predict high ratings RMSE might penalize a method that does well

for high ratings and badly for others

1162019

RECSYS 46

Collaborative Filtering Complexity

Expensive step is finding k most similar customers O(|X|) Recall that X = set of customers in the system

Too expensive to do at runtime Could pre-compute using clustering as approx

Naiumlve pre-computation takes time O(N middot|C|) |C| = of clusters = k in the k-means N = of data points

We already know how to do this Near-neighbor search in high dimensions (LSH) Clustering Dimensionality reduction

1162019

RECSYS 47

Tip Add Data Leverage all the data

Donrsquot try to reduce data size in an effort to make fancy algorithms work

Simple methods on large data do best

Add more data eg add IMDB data on genres

More data beats better algorithmshttpanandtypepadcomdatawocky200803more-data-usualhtml

RECSYS 48

Recommender SystemsLatent Factor Models

CS246 Mining Massive DatasetsJure Leskovec Stanford University

httpcs246stanfordedu

RECSYS 49

Geared towards females

Geared towards males

Serious

Funny

Collaborative Filtering viaLatent Factor Models (eg SVD)

The PrincessDiaries

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

RECSYS 50Koren Bell Volinksy IEEE Computer 2009

RECSYS 51

The Netflix Utility Matrix R

1 3 4

3 5 5

4 5 5

3

3

2 2 2

5

2 1 1

3 3

1

480000 users

17700 movies

1162019

Matrix R

RECSYS 52

Utility Matrix R Evaluation

1 3 4

3 5 5

4 5 5

3

3

2

2 1

3

1

Test Data Set

SSE = sum(119909119909119909119909)isin119877119877 119903119909119909119909119909 minus 119903119903119909119909119909119909 2

1162019

480000 users

17700 movies

Predicted rating

True rating of user x on item i

119955119955120785120785120788120788

Matrix R

Training Data Set

RECSYS 53

Latent Factor Models

ldquoSVDrdquo on Netflix data R asymp Q middot PT

For now letrsquos assume we can approximate the rating matrix R as a product of ldquothinrdquo Q middot PT

R has missing entries but letrsquos ignore that for nowbull Basically we will want the reconstruction error to be small on

known ratings and we donrsquot care about the values on the missing ones

45531

312445

53432142

24542

522434

42331

2-41

56-5

53-2

32111

-221-7

37-1

-924143-48-5-253-211

13-112-72914-131457-8

1-6784-3924176-421asymp

users

item

s

PT

Q

item

s

users

R

SVD A = U Σ VT

f factors

ffactors

RECSYS 54

Ratings as Products of Factors How to estimate the missing rating of

user x for item i

1162019

45531

312445

53432142

24542

522434

42331

item

s

2-41

56-5

53-2

32111

-221-7

37-1

-924143-48-5-253-211

13-112-72914-131457-8

1-6784-3924176-421

asymp

item

s

users

users

PT

119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931

= 119943119943

119954119954119946119946119943119943 sdot 119953119953119961119961119943119943

qi = row i of Qpx = column x of PT

ffac

tors

Qf factors

RECSYS 55

Ratings as Products of Factors How to estimate the missing rating of

user x for item i

1162019

45531

312445

53432142

24542

522434

42331

item

s

2-41

56-5

53-2

32111

-221-7

37-1

-924143-48-5-253-211

13-112-72914-131457-8

1-6784-3924176-421

asymp

item

s

users

users

PT

ffac

tors

Qf factors

119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931

= 119943119943

119954119954119946119946119943119943 sdot 119953119953119961119961119943119943

qi = row i of Qpx = column x of PT

RECSYS 56

Ratings as Products of Factors How to estimate the missing rating of

user x for item i

1162019

45531

312445

53432142

24542

522434

42331

item

s

2-41

56-5

53-2

32111

-221-7

37-1

-924143-48-5-253-211

13-112-72914-131457-8

1-6784-3924176-421

asymp

item

s

users

users

QPT

24

ffac

tors

f factors

119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931

= 119943119943

119954119954119946119946119943119943 sdot 119953119953119961119961119943119943

qi = row i of Qpx = column x of PT

RECSYS 57

Geared towards females

Geared towards males

Serious

Funny

Latent Factor Models

1162019

The PrincessDiaries

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

RECSYS 58

Geared towards females

Geared towards males

Serious

Funny

Latent Factor Models

1162019

The PrincessDiaries

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

RECSYS 59

Recap SVD

Remember SVD A Input data matrix U Left singular vecs V Right singular vecs Σ Singular values

SVD gives minimum reconstruction error (SSE) min119880119880VΣ

sum119909119909119894119894 119860119860119909119909119894119894 minus 119880119880Σ119881119881T 1199091199091198941198942

So in our case ldquoSVDrdquo on Netflix data R asymp Q middot PT

A = R Q = U PT = Σ VT

But we are not done yet R has missing entries1162019

U

The sum goes over all entriesBut our R has missing entries

119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931

Am

n

Σm

n

VT

asymp

U

RECSYS 60

Latent Factor Models

SVD isnrsquot defined when entries are missing

Use specialized methods to find P Q min

119875119875119876119876sum 119909119909119909119909 isinR 119903119903119909119909119909119909 minus 119902119902119909119909 sdot 119901119901119909119909119879119879 2

Notebull We donrsquot require cols of P Q to be orthogonalunit lengthbull P Q map usersmovies to a latent spacebull The most popular model among Netflix contestants

1162019

45531

312445

53432142

24542

522434

42331

2-41

56-5

53-2

32111

-221-7

37-1

-924143-48-5-253-211

13-112-72914-131457-8

1-6784-3924176-421asymp

PTQ

users

item

s

119903119909119909119909119909 = 119902119902119909119909 sdot 119901119901119909119909119879119879

f factors

ffactorsitem

s

users

RECSYS 61

Dealing with Missing Entries Want to minimize SSE for unseen test data

Idea Minimize SSE on training data Want large f ( of factors) to capture all the signals But SSE on test data begins to rise for f gt 2

Regularization is needed to avoid Overfitting Allow rich model where there are sufficient data Shrink aggressively where data are scarce

1162019

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

1 3 43 5 5

4 5 533

2

2 1 3

1

λhellip regularization parameterldquoerrorrdquo ldquolengthrdquo

RECSYS 62

Geared towards females

Geared towards males

serious

escapist

The PrincessDiaries

The Lion King

Braveheart

Independence Day

AmadeusThe Color Purple

Oceanrsquos 11

Sense and Sensibility

Gus

Dave

Recommendations via Latent Factor Models (eg SVD++ by the [Bellkor Team])

1162019

Lethal Weapon

Dumb and Dumber

RECSYS 63

Dealing with Missing Entries Want to minimize SSE for unseen test data

Idea Minimize SSE on training data Want large f ( of factors) to capture all the signals But SSE on test data begins to rise for f gt 2

Regularization is needed Allow rich model where there are sufficient data Shrink aggressively where data are scarce

1162019

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

1 3 43 5 5

4 5 533

2

2 1 3

1

λhellip regularization parameterldquoerrorrdquo ldquolengthrdquo

RECSYS 64

Geared towards females

Geared towards males

serious

funny

The Effect of Regularization

1162019

The PrincessDiaries

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

minfactors ldquoerrorrdquo + λ ldquolengthrdquo

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

RECSYS 65

Geared towards females

Geared towards males

serious

funny

The Effect of Regularization

1162019

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

The PrincessDiaries

minfactors ldquoerrorrdquo + λ ldquolengthrdquo

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

RECSYS 66

Geared towards females

Geared towards males

serious

funny

The Effect of Regularization

1162019

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

minfactors ldquoerrorrdquo + λ ldquolengthrdquo

The PrincessDiaries

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

RECSYS 67

Geared towards females

Geared towards males

serious

funny

The Effect of Regularization

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

The PrincessDiaries

minfactors ldquoerrorrdquo + λ ldquolengthrdquo

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

RECSYS 68

Use Gradient Descent to search forthe optimal settings

Want to find matrices P and Q

Gradient descent Initialize P and Q (using SVD pretend missing ratings are

0) Do gradient descent

bull P larr P - η middotnablaPbull Q larr Q - η middotnablaQbull Where nablaQ is gradientderivative of matrix Q120571120571119876119876 = [120571120571119902119902119909119909119894119894] and 120571120571119902119902119909119909119894119894 = sum119909119909119909119909 minus2 119903119903119909119909119909119909 minus 119902119902119909119909119901119901119909119909119879119879 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894

bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q Observation Computing gradients is slow

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

How to compute gradient of a matrixCompute gradient of every element independently

RECSYS 69

Degression to the lecture notes ofRegression and Gradient Descent

by Andrew Ngrsquos Machine Learning course from Coursera

RECSYS 70

(Batch) Gradient Descent

Want to find matrices P and Q

Gradient descent Initialize P and Q (using SVD pretend missing ratings are

0) Do gradient descent

bull P larr P - η middotnablaPbull Q larr Q - η middotnablaQbull Where nablaQ is gradientderivative of matrix Q120571120571119876119876 = [120571120571119902119902119909119909119894119894] and 120571120571119902119902119909119909119894119894 = sum119909119909119909119909 minus2 119903119903119909119909119909119909 minus 119902119902119909119909119901119901119909119909119879119879 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894

bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q Observation Computing gradients is slow

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

How to compute gradient of a matrixCompute gradient of every element independently

RECSYS 72

Stochastic Gradient Descent

Gradient Descent (GD) vs Stochastic GD Observation 120571120571119876119876 = [120571120571119902119902119909119909119894119894] where

120571120571119902119902119909119909119894119894 = 119909119909119909119909

minus2 119903119903119909119909119909119909 minus 119902119902119909119909119894119894119901119901119909119909119894119894 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894 = 119961119961119946119946

nabla119928119928 119955119955119961119961119946119946

bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q 119928119928 = 119928119928 minus 119928119928 = 119928119928 minus η sum119961119961119946119946nabla119928119928 (119955119955119961119961119946119946) Idea Instead of evaluating gradient over all ratings evaluate it for

each individual rating and make a step

GD 119928119928larr119928119928 minus η sum119955119955119961119961119946119946 nabla119928119928(119955119955119961119961119946119946)

SGD 119928119928larr119928119928 minus 119928119928(119955119955119961119961119946119946) Faster convergence

bull Need more steps but each step is computed much faster

1162019

RECSYS 73

SGD vs GD Convergence of GD vs SGD

1162019

Iterationstep

Valu

e of

the

obje

ctiv

e fu

nctio

n

GD improves the value of the objective function at every step SGD improves the value but in a ldquonoisyrdquo wayGD takes fewer steps to converge but each steptakes much longer to compute In practice SGD is much faster

RECSYS 74

Stochastic Gradient Descent

Stochastic gradient decent Initialize P and Q (using SVD pretend missing ratings are

0) Then iterate over the ratings (multiple times if necessary) and update

factorsFor each rxi

bull 120576120576119909119909119909119909 = 119903119903119909119909119909119909 minus 119902119902119909119909 sdot 119901119901119909119909119879119879 (derivative of the ldquoerrorrdquo)bull 119902119902119909119909 larr 119902119902119909119909 + 120578120578 120576120576119909119909119909119909 119901119901119909119909 minus 120582120582 119902119902119909119909 (update equation)bull 119901119901119909119909 larr 119901119901119909119909 + 120578120578 120576120576119909119909119909119909 119902119902119909119909 minus 120582120582 119901119901119909119909 (update equation)

2 for loops For until convergence

bull For each rxibull Compute gradient do a ldquosteprdquo

1162019

η hellip learning rate

RECSYS 75Koren Bell Volinksy IEEE Computer 2009

1162019

RECSYS 76

Summary Recommendations via Optimization

Goal Make good recommendations Quantify goodness using SSE

So Lower SSE means better recommendations We want to make good recommendations on items

that some user has not yet seen Letrsquos set values for P and Q such that they work

well on known (user item) ratings And hope these values for P and Q will predict well

the unknown ratings

This is the a case where we apply Optimization methods

1162019

1 3 43 5 5

4 5 533

2

2 1 3

1

RECSYS 77

Backup Slides

RECSYS 78

The Netflix Challenge 2006-09

We Know What You OughtTo Be Watching This

Summer

Netflix Prizebull Training data

ndash 100 million ratingsndash 480000 usersndash 17770 moviesndash 6 years of data 2000-2005

bull Test datandash Last few ratings of each user (28 million)ndash Evaluation criterion root mean squared error (RMSE) ndash Netflix Cinematch RMSE 09514

bull Competitionndash 2700+ teamsndash $1 million grand prize for 10 improvement on Cinematch resultndash $50000 2007 progress prize for 843 improvement

scoremovieuser121152131434524123237682576344541568523425

22345

57664566

scoremovieuser6219617232473153414284935

745

696836

Training data Test data

Movie rating data

Overall rating distribution

bull Third of ratings are 4sbull Average rating is 368

From TimelyDevelopmentcom

ratings per movie

bull Avg ratingsmovie 5627

ratings per user

bull Avg ratingsuser 208

Average movie rating by movie count

bull More ratings to better movies

From TimelyDevelopmentcom

Most loved movies

CountAvg ratingTitle137812 4593 The Shawshank Redemption 133597 4545 Lord of the Rings The Return of the King1808834306 The Green Mile 150676 4460Lord of the Rings The Two Towers 139050 4415 Finding Nemo 117456 4504 Raiders of the Lost Ark 180736 4299 Forrest Gump 147932 4433 Lord of the Rings The Fellowship of the ring149199 4325 The Sixth Sense 144027 4333 Indiana Jones and the Last Crusade

Challenges

bull Size of datandash Scalabilityndash Keeping data in memory

bull Missing datandash 99 percent missingndash Very imbalanced

bull Avoiding overfittingbull Test and training data differ significantly movie 16322

bull Use an ensemble of complementing predictorsbull Two half tuned models worth more than a single fully

tuned model

The BellKor recommender system

RECSYS 101

Extending Latent Factor Model to Include Biases

1162019

RECSYS 102

Modeling Biases and Interactions

μ = overall mean rating bx = bias of user x bi = bias of movie i

user-movie interactionmovie biasuser bias

User-Movie interaction Characterizes the matching between

users and movies Attracts most research in the field Benefits from algorithmic and

mathematical innovations

Baseline predictor Separates users and movies Benefits from insights into userrsquos

behavior Among the main practical

contributions of the competition

RECSYS 103

Baseline Predictor We have expectations on the rating by

user x of movie i even without estimating xrsquos attitude towards movies like i

ndash Rating scale of user xndash Values of other ratings user

gave recently (day-specific mood anchoring multi-user accounts)

ndash (Recent) popularity of movie indash Selection bias related to

number of ratings user gave on the same day (ldquofrequencyrdquo)

1162019

RECSYS 104

Putting It All Together

Example Mean rating micro = 37 You are a critical reviewer your ratings are 1 star

lower than the mean bx = -1 Star Wars gets a mean rating of 05 higher than

average movie bi = + 05 Predicted rating for you on Star Wars

= 37 - 1 + 05 = 32

1162019

RECSYS 105

Fitting the New Model

Solve

Stochastic gradient decent to find parameters Note Both biases bu bi as well as interactions qi pu are treated

as parameters (we estimate them)

1162019

regularization

goodness of fit

λ is selected via grid-search on a validation set

( )

++++

+++minus

sumsumsumsum

sumisin

ii

xx

xx

ii

Rix

Txiixxi

PQ

bbpq

pqbbr

2222

2

)(

)(min

λ

micro

RECSYS 106

BellKor Recommender System The winner of the Netflix Challenge

Multi-scale modeling of the dataCombine top level ldquoregionalrdquomodeling of the data with a refined local view Global

bull Overall deviations of usersmovies Factorization

bull Addressing ldquoregionalrdquo effects Collaborative filtering

bull Extract local patterns

Global effects

Factorization

Collaborative filtering

RECSYS 107

Performance of Various Methods

0885

089

0895

09

0905

091

0915

092

1 10 100 1000

RMSE

Millions of parameters

CF (no time bias)

Basic Latent Factors

Latent Factors w Biases

1162019

RECSYS 108

Grand Prize 08563

Netflix 09514

Movie average 10533User average 10651

Global average 11296

Performance of Various Methods

Basic Collaborative filtering 094

Latent factors 090

Latent factors+Biases 089

Collaborative filtering++ 091

1162019

RECSYS 109

Temporal Biases Of Users

Sudden rise in the average movie rating(early 2004) Improvements in Netflix GUI improvements Meaning of rating changed

Movie age Users prefer new movies

without any reasons Older movies are just

inherently better than newer ones

1162019

Y Koren Collaborative filtering with temporal dynamics KDD rsquo09

RECSYS 110

Temporal Biases amp Factors

Original modelrxi = micro +bx + bi + qi middotpx

T

Add time dependence to biasesrxi = micro +bx(t)+ bi(t) +qi middot px

T

Make parameters bu and bi to depend on time (1) Parameterize time-dependence by linear trends

(2) Each bin corresponds to 10 consecutive weeks

Add temporal dependence to factors px(t)hellip user preference vector on day t

Y Koren Collaborative filtering with temporal dynamics KDD rsquo09

RECSYS 111

Adding Temporal Effects

1162019

0875

088

0885

089

0895

09

0905

091

0915

092

1 10 100 1000 10000

RMSE

Millions of parameters

CF (no time bias)Basic Latent FactorsCF (time bias)Latent Factors w Biases+ Linear time factors+ Per-day user biases+ CF

RECSYS 112

Grand Prize 08563

Netflix 09514

Movie average 10533User average 10651

Global average 11296

Performance of Various Methods

Basic Collaborative filtering 094

Latent factors 090

Latent factors+Biases 089

Collaborative filtering++ 091

1162019 112

Latent factors+Biases+Time 0876

Still no prize Getting desperateTry a ldquokitchen sinkrdquo approach

RECSYS 113

Effect of ensemble size

087

08725

0875

08775

088

08825

0885

08875

089

08925

1 8 15 22 29 36 43 50

Predictors

Erro

r - R

MSE

Chart1

RMSE
Predictors
Error - RMSE
Effect of ensemble size

Sheet1

Sheet1

RMSE
Predictors
Error - RMSE
Effect of ensemble size

Sheet2

Sheet3

RECSYS 115

Standing on June 26th 2009

1162019June 26th submission triggers 30-day ldquolast callrdquo

RECSYS 116

The Last 30 Days

Ensemble team formed Group of other teams on leaderboard forms a new team Relies on combining their models Quickly also get a qualifying score over 10

BellKor Continue to get small improvements in their scores Realize that they are in direct competition with Ensemble

Strategy Both teams carefully monitoring the leaderboard Only sure way to check for improvement is to submit a set of

predictionsbull This alerts the other team of your latest score

1162019

RECSYS 117

24 Hours from the Deadline

Submissions limited to 1 a day Only 1 final submission could be made in the last 24h

24 hours before deadlinehellip BellKor team member in Austria notices (by chance) that

Ensemble posts a score that is slightly better than BellKorrsquos

Frantic last 24 hours for both teams Much computer time on final optimization Carefully calibrated to end about an hour before deadline

Final submissions BellKor submits a little early (on purpose) 40 mins before deadline Ensemble submits their final entry 20 mins later hellipand everyone waitshellip

1162019

RECSYS 118118

RECSYS 119

Million $ Awarded Sept 21st 2009

RECSYS 120

Further reading Y Koren Collaborative filtering with temporal

dynamics KDD rsquo09 httpwww2researchattcom~volinskynetflixbpchtml httpwwwthe-ensemblecom

1162019

  • Slide Number 1
  • Acknowledgements
  • Content-based Systems amp Collaborative Filtering
  • High Dimensional Data
  • Example Recommender Systems
  • Recommendations
  • From Scarcity to Abundance
  • The Long Tail
  • Physical vs Online
  • Types of Recommendations
  • Formal Model
  • Utility Matrix
  • Key Problems
  • (1) Gathering Ratings
  • (2) Extrapolating Utilities
  • Content-based Recommender Systems
  • Content-based Recommendations
  • Plan of Action
  • Item Profiles
  • Sidenote TF-IDF
  • User Profiles and Prediction
  • Pros Content-based Approach
  • Cons Content-based Approach
  • Collaborative Filtering
  • Collaborative filtering
  • Collaborative Filtering (CF)
  • Example of Memory-based Collaborative Filtering User-User Collaborative Filtering
  • Similar Users
  • Similarity Metric
  • Rating Predictions
  • Another type of Memory-based Collaborative Filtering Item-Item based Collaborative Filtering
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Common Practice for Item-Item Collaborative Filtering
  • Item-Item vs User-User
  • ProsCons of Collaborative Filtering
  • Hybrid Methods
  • Remarks amp Practical Tips
  • Evaluation
  • Evaluation
  • Evaluating Predictions
  • Problems with Error Measures
  • Collaborative Filtering Complexity
  • Tip Add Data
  • Recommender SystemsLatent Factor Models
  • Collaborative Filtering viaLatent Factor Models (eg SVD)
  • Slide Number 50
  • The Netflix Utility Matrix R
  • Utility Matrix R Evaluation
  • Latent Factor Models
  • Ratings as Products of Factors
  • Ratings as Products of Factors
  • Ratings as Products of Factors
  • Latent Factor Models
  • Latent Factor Models
  • Recap SVD
  • Latent Factor Models
  • Dealing with Missing Entries
  • Recommendations via Latent Factor Models (eg SVD++ by the [Bellkor Team])
  • Dealing with Missing Entries
  • The Effect of Regularization
  • The Effect of Regularization
  • The Effect of Regularization
  • The Effect of Regularization
  • Use Gradient Descent to search forthe optimal settings
  • Degression to the lecture notes ofRegression and Gradient Descentby Andrew Ngrsquos Machine Learning course from Coursera
  • (Batch) Gradient Descent
  • Stochastic Gradient Descent
  • SGD vs GD
  • Stochastic Gradient Descent
  • Slide Number 75
  • Summary Recommendations via Optimization
  • Backup Slides
  • The Netflix Challenge 2006-09
  • Slide Number 79
  • Slide Number 80
  • Netflix Prize
  • Movie rating data
  • Overall rating distribution
  • ratings per movie
  • ratings per user
  • Average movie rating by movie count
  • Most loved movies
  • Challenges
  • The BellKor recommender system
  • Extending Latent Factor Model to Include Biases
  • Modeling Biases and Interactions
  • Baseline Predictor
  • Putting It All Together
  • Fitting the New Model
  • BellKor Recommender System
  • Performance of Various Methods
  • Performance of Various Methods
  • Temporal Biases Of Users
  • Temporal Biases amp Factors
  • Adding Temporal Effects
  • Performance of Various Methods
  • Slide Number 113
  • Slide Number 114
  • Standing on June 26th 2009
  • The Last 30 Days
  • 24 Hours from the Deadline
  • Slide Number 118
  • Million $ Awarded Sept 21st 2009
  • Further reading
predictors RMSE
1 0889067
2 0880586
3 0878345
4 0877715
5 0876668
6 0876179
7 0875557
8 0875013
9 0874477
10 0874209
11 0873997
12 0873691
13 0873569
14 0873352
15 0873212
16 0873035
17 0872942
18 087283
19 0872748
20 0872654
21 0872563
22 0872468
23 0872382
24 0872275
25 087222
26 0872201
27 0872135
28 0872086
29 0872021
30 0871945
31 0871898
32 087183
33 0871756
34 0871722
35 0871665
36 0871632
37 0871585
38 0871558
39 0871524
40 0871482
41 0871452
42 0871428
43 0871398
44 0871346
45 0871312
46 0871287
47 0871262
48 0871242
49 0871236
50 0871197
0889067
0880586
0878345
0877715
0876668
0876179
0875557
0875013
0874477
0874209
0873997
0873691
0873569
0873352
0873212
0873035
0872942
087283
0872748
0872654
0872563
0872468
0872382
0872275
087222
0872201
0872135
0872086
0872021
0871945
0871898
087183
0871756
0871722
0871665
0871632
0871585
0871558
0871524
0871482
0871452
0871428
0871398
0871346
0871312
0871287
0871262
0871242
0871236
0871197
Page 3: xuhappy.github.io · 2 days ago · RECSYS 2 Acknowledgements The slides used in this chapter are adapted from: CS246 Mining Massive Datasets, - by Jure Leskovec, Stanford University

RECSYS 3

Content-based Systems amp Collaborative Filtering

CS246 Mining Massive DatasetsJure Leskovec Stanford University

httpcs246stanfordedu

RECSYS 4

High Dimensional Data

High dim data

Locality sensitive hashing

Clustering

Dimensionality

reduction

Graph data

PageRank SimRank

Community Detection

Spam Detection

Infinite data

Filtering data

streams

Web advertising

Queries on streams

Machine learning

SVM

Decision Trees

Perceptron kNN

Apps

Recommender systems

Association Rules

Duplicate document detection

1162019

RECSYS 5

Example Recommender Systems

Customer X Buys Metallica CD Buys Megadeth CD

Customer Y Does search on Metallica Recommender system

suggests Megadeth from data collected about customer X

RECSYS 6

Recommendations

Items

Search Recommendations

Products web sites blogs news items hellip

Examples

RECSYS 8

The Long Tail

Source Chris Anderson (2004)

1162019 8

RECSYS 9

Physical vs Online

1162019 9Read httpwwwwiredcomwiredarchive1210tailhtml to learn more

RECSYS 11

Formal Model

X = set of Customers

S = set of Items

Utility function u X times S RR = set of ratingsR is a totally ordered seteg 0-5 stars real number in [01]

RECSYS 12

Utility Matrix

04102

0305021

Avatar LOTR Matrix Pirates

Alice

Bob

Carol

David

RECSYS 13

Key Problems

(1) Gathering ldquoknownrdquo ratings for matrix How to collect the data in the utility matrix

(2) Extrapolate unknown ratings from the known ones Mainly interested in high unknown ratings

bull We are not interested in knowing what you donrsquot like but what you like

(3) Evaluating extrapolation methods How to measure successperformance of

recommendation methods

RECSYS 14

(1) Gathering Ratings Explicit

Ask people to rate items Doesnrsquot work well in practice ndash people

canrsquot be bothered

Implicit Learn ratings from user actions

bull Eg purchase implies high rating What about low ratings

RECSYS 15

(2) Extrapolating Utilities Key problem matrix U is sparse

Most people have not rated most items Cold start

bull New items have no ratingsbull New users have no history

Three approaches to recommender systems 1) Content-based 2) Collaborative Filtering

bull Memory-basedbull User-based Collaborative Filteringbull Item-based Collaborative Filtering

bull Latent factor based

1162019

RECSYS 16

Content-based Recommender Systems

1162019

RECSYS 17

Content-based Recommendations Main idea Recommend items to customer x similar to

previous items rated highly by x

Example

Movie recommendations Recommend movies with same actor(s)

director genre hellip

Websites blogs news Recommend other sites with ldquosimilarrdquo content

RECSYS 18

Plan of Action

likes

Item profiles

RedCircles

Triangles

User profile

match

recommendbuild

1162019

RECSYS 19

Item Profiles For each item create an item profile

Profile is a set (vector) of features Movies author title actor directorhellip Text Set of ldquoimportantrdquo words in document

How to pick important features Usual heuristic from text mining is TF-IDF

(Term frequency Inverse Doc Frequency)bull Term hellip Featurebull Document hellip Item

1162019

RECSYS 20

Sidenote TF-IDFfij = frequency of term (feature) i in doc (item) j

ni = number of docs that mention term iN = total number of docs

TF-IDF score wij = TFij times IDFi

Doc profile = set of words with highest TF-IDF scores together with their scores

1162019

Note we normalize TFto discount for ldquolongerrdquo documents

RECSYS 21

User Profiles and Prediction

User profile possibilities Weighted average of rated item profiles Variation weight by difference from average

rating for item hellip

Prediction heuristic Given user profile x and item profile i estimate 119906119906(119961119961 119946119946) = cos(119961119961 119946119946) = 119961119961middot119946119946

| 119961119961 |sdot| 119946119946 |

1162019

RECSYS 22

Pros Content-based Approach

+ No need for data on other users No cold-start or sparsity problems

+ Able to recommend to users with unique tastes

+ Able to recommend new amp unpopular items No first-rater problem

+ Able to provide explanations Can provide explanations of recommended items by listing content-

features that caused an item to be recommended

1162019

RECSYS 23

Cons Content-based Approach ndash Finding the appropriate features is hard

Eg images movies music

ndash Overspecialization Never recommends items outside userrsquos

content profile People might have multiple interests Unable to exploit quality judgments of other users

ndash Recommendations for new users How to build a user profile

1162019

RECSYS 24

Collaborative Filtering

RECSYS 25

Collaborative filtering Recommend items based on past transactions of users

Analyze relations between users andor items

Specific data characteristics are irrelevant Domain-free useritem attributes are not necessary Can identify elusive aspects

RECSYS 26

Collaborative Filtering (CF)

RECSYS 27

Example of Memory-based Collaborative FilteringUser-User Collaborative Filtering

1 Consider user x

2 Find set N of other userswhose ratings are ldquosimilarrdquo to xrsquos ratings eg using K-nearest neighbors (KNN)

3 Recommend items to xbased on the weighted ratings of items by users in N

RECSYS 28

Similar Users Let rx be the vector of user xrsquos ratings

Jaccard similarity measure Problem Ignores the value of the rating

Cosine similarity measure

sim(x y) = cos(rx ry) = 119903119903119909119909sdot119903119903119910119910||119903119903119909119909||sdot||119903119903119910119910||

Problem Treats missing ratings as ldquonegativerdquo Pearson correlation coefficient

Sxy = items rated by both users x and y

1162019

rx = [ _ _ ]ry = [ _ _]

rx ry as setsrx = 1 4 5ry = 1 3 4

rx ry as pointsrx = 1 0 0 1 3ry = 1 0 2 2 0

rx ry hellip avgrating of x y

RECSYS 29

Similarity Metric

Intuitively we want sim(A B) gt sim(A C)

Jaccard similarity 15 lt 24

Cosine similarity 0386 gt 0322 Considers missing ratings as ldquonegativerdquo Solution subtract the (row) mean

1162019

sim AB vs AC0092 gt -0559Notice cosine sim is correlation when data is centered at 0

Cosine sim

RECSYS 30

Rating Predictions

Let rx be the vector of user xrsquos ratings

Let N be the set of k users most similar to xwho have rated item i

Prediction for item i of user x 119903119903119909119909119909119909 = 1

119896119896sum119910119910isin119873119873 119903119903119910119910119909119909

119903119903119909119909119909119909 =sum119910119910isin119873119873 119904119904119909119909119910119910sdot119903119903119910119910119910119910sum119910119910isin119873119873 119904119904119909119909119910119910

Other options

Many other tricks possiblehellip1162019

Shorthand119956119956119961119961119961119961 = 119956119956119946119946119956119956 119961119961119961119961

RECSYS 31

Another type of Memory-based Collaborative Filtering Item-Item based Collaborative Filtering

So far User-user collaborative filtering

Another view Item-item For item i find other similar items Estimate rating for item i based

on the target userrsquos ratings on items similar to item i Can use same similarity metrics and

prediction functions as in user-user model

1162019

sumsum

isin

isinsdot

=)(

)(

xiNj ij

xiNj xjijxi s

rsr

sijhellip similarity of items i and jrxjhelliprating of user x on item jN(ix)hellip set items rated by x similar to i

RECSYS 32

Item-Item CF (|N|=2)

121110987654321

455311

3124452

534321423

245424

5224345

423316

users

mov

ies

- unknown rating - rating between 1 to 5

RECSYS 33

Item-Item CF (|N|=2)

121110987654321

455 311

3124452

534321423

245424

5224345

423316

users

- estimate rating of movie 1 by user 5

1162019

mov

ies

RECSYS 34

Item-Item CF (|N|=2)

121110987654321

455 311

3124452

534321423

245424

5224345

423316

users

Neighbor selectionIdentify movies similar to movie 1 rated by user 5

1162019

mov

ies

100

-018

041

-010

-031

059

sim(1m)

Here we use Pearson correlation as similarity1) Subtract mean rating mi from each movie i

m1 = (1+3+5+5+4)5 = 36row 1 [-26 0 -06 0 0 14 0 0 14 0 04 0]

2) Compute cosine similarities between rows

RECSYS 35

Item-Item CF (|N|=2)

121110987654321

455 311

3124452

534321423

245424

5224345

423316

users

Compute similarity weightss13=041 s16=059

1162019

mov

ies

100

-018

041

-010

-031

059

sim(1m)

RECSYS 36

Item-Item CF (|N|=2)

121110987654321

45526311

3124452

534321423

245424

5224345

423316

users

Predict by taking weighted average

r15 = (0412 + 0593) (041+059) = 26

mov

ies

RECSYS 37

Common Practice for Item-Item Collaborative Filtering

Define similarity sij of items i and j

Select K nearest neighbors (KNN) N(i x) Set of Items most similar to item i that were rated by x

Estimate rating rxi as the weighted average

1162019

sumsum

isin

isinsdot

=)(

)(ˆxiNj ij

xiNj xjijxi s

rsr

N(ix) = set of items similar to item i that were rated by xsij = similarity of items i and jrxj= rating of user x on item j

RECSYS 38

Item-Item vs User-User

04180109

0305081

Avatar LOTR Matrix Pirates

Alice

Bob

Carol

David

1162019

In practice it has been observed that item-itemoften works better than user-user

Why Items are simpler users have multiple tastes

RECSYS 39

ProsCons of Collaborative Filtering

+ Works for any kind of item No feature selection needed

- Cold Start Need enough users in the system to find a match

- Sparsity The userratings matrix is sparse Hard to find users that have rated the same items

- First rater Cannot recommend an item that has not been

previously rated New items Esoteric items

- Popularity bias Cannot recommend items to someone with

unique taste Tends to recommend popular items

RECSYS 40

Hybrid Methods Implement two or more different recommenders and

combine predictions Perhaps using a linear model

Add content-based methods to collaborative filtering Item profiles for new item problem Demographics to deal with new user problem

1162019

RECSYS 41

- Evaluation

- Error metrics

- Complexity Speed

RECSYS 42

Evaluation

1 3 4

3 5 5

4 5 5

3

3

2 2 2

5

2 1 1

3 3

1

movies

users

1162019

RECSYS 43

Evaluation

1 3 4

3 5 5

4 5 5

3

3

2

2 1

3

1

Test Data Set

users

movies

1162019

RECSYS 44

Evaluating Predictions Compare predictions with known ratings

Root-mean-square error (RMSE)bull where is predicted is the true rating of x on i

Precision at top 10 bull of those in top 10

Rank Correlation bull Spearmanrsquos correlation between systemrsquos and userrsquos complete

rankings Another approach 01 model

Coveragebull Number of itemsusers for which system can make predictions

Precision = TP (TP + FP) Accuracy = (TP+TN) (TP + FP + TN + FN) Receiver Operating characteristic (ROC) Curve

Y-axis True Positive Rates (TPR) X-axis False Positive Rates (FPR)bull TPR (aka Recall) = TP P = TP (TP+FN) bull FPR = FP N = FP (FP + TN)bull See httpsenwikipediaorgwikiPrecision and recall

RECSYS 45

Problems with Error Measures Narrow focus on accuracy sometimes

misses the point Prediction Diversity Prediction Context Order of predictions

In practice we care only to predict high ratings RMSE might penalize a method that does well

for high ratings and badly for others

1162019

RECSYS 46

Collaborative Filtering Complexity

Expensive step is finding k most similar customers O(|X|) Recall that X = set of customers in the system

Too expensive to do at runtime Could pre-compute using clustering as approx

Naiumlve pre-computation takes time O(N middot|C|) |C| = of clusters = k in the k-means N = of data points

We already know how to do this Near-neighbor search in high dimensions (LSH) Clustering Dimensionality reduction

1162019

RECSYS 47

Tip Add Data Leverage all the data

Donrsquot try to reduce data size in an effort to make fancy algorithms work

Simple methods on large data do best

Add more data eg add IMDB data on genres

More data beats better algorithmshttpanandtypepadcomdatawocky200803more-data-usualhtml

RECSYS 48

Recommender SystemsLatent Factor Models

CS246 Mining Massive DatasetsJure Leskovec Stanford University

httpcs246stanfordedu

RECSYS 49

Geared towards females

Geared towards males

Serious

Funny

Collaborative Filtering viaLatent Factor Models (eg SVD)

The PrincessDiaries

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

RECSYS 50Koren Bell Volinksy IEEE Computer 2009

RECSYS 51

The Netflix Utility Matrix R

1 3 4

3 5 5

4 5 5

3

3

2 2 2

5

2 1 1

3 3

1

480000 users

17700 movies

1162019

Matrix R

RECSYS 52

Utility Matrix R Evaluation

1 3 4

3 5 5

4 5 5

3

3

2

2 1

3

1

Test Data Set

SSE = sum(119909119909119909119909)isin119877119877 119903119909119909119909119909 minus 119903119903119909119909119909119909 2

1162019

480000 users

17700 movies

Predicted rating

True rating of user x on item i

119955119955120785120785120788120788

Matrix R

Training Data Set

RECSYS 53

Latent Factor Models

ldquoSVDrdquo on Netflix data R asymp Q middot PT

For now letrsquos assume we can approximate the rating matrix R as a product of ldquothinrdquo Q middot PT

R has missing entries but letrsquos ignore that for nowbull Basically we will want the reconstruction error to be small on

known ratings and we donrsquot care about the values on the missing ones

45531

312445

53432142

24542

522434

42331

2-41

56-5

53-2

32111

-221-7

37-1

-924143-48-5-253-211

13-112-72914-131457-8

1-6784-3924176-421asymp

users

item

s

PT

Q

item

s

users

R

SVD A = U Σ VT

f factors

ffactors

RECSYS 54

Ratings as Products of Factors How to estimate the missing rating of

user x for item i

1162019

45531

312445

53432142

24542

522434

42331

item

s

2-41

56-5

53-2

32111

-221-7

37-1

-924143-48-5-253-211

13-112-72914-131457-8

1-6784-3924176-421

asymp

item

s

users

users

PT

119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931

= 119943119943

119954119954119946119946119943119943 sdot 119953119953119961119961119943119943

qi = row i of Qpx = column x of PT

ffac

tors

Qf factors

RECSYS 55

Ratings as Products of Factors How to estimate the missing rating of

user x for item i

1162019

45531

312445

53432142

24542

522434

42331

item

s

2-41

56-5

53-2

32111

-221-7

37-1

-924143-48-5-253-211

13-112-72914-131457-8

1-6784-3924176-421

asymp

item

s

users

users

PT

ffac

tors

Qf factors

119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931

= 119943119943

119954119954119946119946119943119943 sdot 119953119953119961119961119943119943

qi = row i of Qpx = column x of PT

RECSYS 56

Ratings as Products of Factors How to estimate the missing rating of

user x for item i

1162019

45531

312445

53432142

24542

522434

42331

item

s

2-41

56-5

53-2

32111

-221-7

37-1

-924143-48-5-253-211

13-112-72914-131457-8

1-6784-3924176-421

asymp

item

s

users

users

QPT

24

ffac

tors

f factors

119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931

= 119943119943

119954119954119946119946119943119943 sdot 119953119953119961119961119943119943

qi = row i of Qpx = column x of PT

RECSYS 57

Geared towards females

Geared towards males

Serious

Funny

Latent Factor Models

1162019

The PrincessDiaries

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

RECSYS 58

Geared towards females

Geared towards males

Serious

Funny

Latent Factor Models

1162019

The PrincessDiaries

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

RECSYS 59

Recap SVD

Remember SVD A Input data matrix U Left singular vecs V Right singular vecs Σ Singular values

SVD gives minimum reconstruction error (SSE) min119880119880VΣ

sum119909119909119894119894 119860119860119909119909119894119894 minus 119880119880Σ119881119881T 1199091199091198941198942

So in our case ldquoSVDrdquo on Netflix data R asymp Q middot PT

A = R Q = U PT = Σ VT

But we are not done yet R has missing entries1162019

U

The sum goes over all entriesBut our R has missing entries

119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931

Am

n

Σm

n

VT

asymp

U

RECSYS 60

Latent Factor Models

SVD isnrsquot defined when entries are missing

Use specialized methods to find P Q min

119875119875119876119876sum 119909119909119909119909 isinR 119903119903119909119909119909119909 minus 119902119902119909119909 sdot 119901119901119909119909119879119879 2

Notebull We donrsquot require cols of P Q to be orthogonalunit lengthbull P Q map usersmovies to a latent spacebull The most popular model among Netflix contestants

1162019

45531

312445

53432142

24542

522434

42331

2-41

56-5

53-2

32111

-221-7

37-1

-924143-48-5-253-211

13-112-72914-131457-8

1-6784-3924176-421asymp

PTQ

users

item

s

119903119909119909119909119909 = 119902119902119909119909 sdot 119901119901119909119909119879119879

f factors

ffactorsitem

s

users

RECSYS 61

Dealing with Missing Entries Want to minimize SSE for unseen test data

Idea Minimize SSE on training data Want large f ( of factors) to capture all the signals But SSE on test data begins to rise for f gt 2

Regularization is needed to avoid Overfitting Allow rich model where there are sufficient data Shrink aggressively where data are scarce

1162019

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

1 3 43 5 5

4 5 533

2

2 1 3

1

λhellip regularization parameterldquoerrorrdquo ldquolengthrdquo

RECSYS 62

Geared towards females

Geared towards males

serious

escapist

The PrincessDiaries

The Lion King

Braveheart

Independence Day

AmadeusThe Color Purple

Oceanrsquos 11

Sense and Sensibility

Gus

Dave

Recommendations via Latent Factor Models (eg SVD++ by the [Bellkor Team])

1162019

Lethal Weapon

Dumb and Dumber

RECSYS 63

Dealing with Missing Entries Want to minimize SSE for unseen test data

Idea Minimize SSE on training data Want large f ( of factors) to capture all the signals But SSE on test data begins to rise for f gt 2

Regularization is needed Allow rich model where there are sufficient data Shrink aggressively where data are scarce

1162019

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

1 3 43 5 5

4 5 533

2

2 1 3

1

λhellip regularization parameterldquoerrorrdquo ldquolengthrdquo

RECSYS 64

Geared towards females

Geared towards males

serious

funny

The Effect of Regularization

1162019

The PrincessDiaries

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

minfactors ldquoerrorrdquo + λ ldquolengthrdquo

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

RECSYS 65

Geared towards females

Geared towards males

serious

funny

The Effect of Regularization

1162019

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

The PrincessDiaries

minfactors ldquoerrorrdquo + λ ldquolengthrdquo

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

RECSYS 66

Geared towards females

Geared towards males

serious

funny

The Effect of Regularization

1162019

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

minfactors ldquoerrorrdquo + λ ldquolengthrdquo

The PrincessDiaries

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

RECSYS 67

Geared towards females

Geared towards males

serious

funny

The Effect of Regularization

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

The PrincessDiaries

minfactors ldquoerrorrdquo + λ ldquolengthrdquo

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

RECSYS 68

Use Gradient Descent to search forthe optimal settings

Want to find matrices P and Q

Gradient descent Initialize P and Q (using SVD pretend missing ratings are

0) Do gradient descent

bull P larr P - η middotnablaPbull Q larr Q - η middotnablaQbull Where nablaQ is gradientderivative of matrix Q120571120571119876119876 = [120571120571119902119902119909119909119894119894] and 120571120571119902119902119909119909119894119894 = sum119909119909119909119909 minus2 119903119903119909119909119909119909 minus 119902119902119909119909119901119901119909119909119879119879 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894

bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q Observation Computing gradients is slow

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

How to compute gradient of a matrixCompute gradient of every element independently

RECSYS 69

Degression to the lecture notes ofRegression and Gradient Descent

by Andrew Ngrsquos Machine Learning course from Coursera

RECSYS 70

(Batch) Gradient Descent

Want to find matrices P and Q

Gradient descent Initialize P and Q (using SVD pretend missing ratings are

0) Do gradient descent

bull P larr P - η middotnablaPbull Q larr Q - η middotnablaQbull Where nablaQ is gradientderivative of matrix Q120571120571119876119876 = [120571120571119902119902119909119909119894119894] and 120571120571119902119902119909119909119894119894 = sum119909119909119909119909 minus2 119903119903119909119909119909119909 minus 119902119902119909119909119901119901119909119909119879119879 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894

bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q Observation Computing gradients is slow

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

How to compute gradient of a matrixCompute gradient of every element independently

RECSYS 72

Stochastic Gradient Descent

Gradient Descent (GD) vs Stochastic GD Observation 120571120571119876119876 = [120571120571119902119902119909119909119894119894] where

120571120571119902119902119909119909119894119894 = 119909119909119909119909

minus2 119903119903119909119909119909119909 minus 119902119902119909119909119894119894119901119901119909119909119894119894 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894 = 119961119961119946119946

nabla119928119928 119955119955119961119961119946119946

bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q 119928119928 = 119928119928 minus 119928119928 = 119928119928 minus η sum119961119961119946119946nabla119928119928 (119955119955119961119961119946119946) Idea Instead of evaluating gradient over all ratings evaluate it for

each individual rating and make a step

GD 119928119928larr119928119928 minus η sum119955119955119961119961119946119946 nabla119928119928(119955119955119961119961119946119946)

SGD 119928119928larr119928119928 minus 119928119928(119955119955119961119961119946119946) Faster convergence

bull Need more steps but each step is computed much faster

1162019

RECSYS 73

SGD vs GD Convergence of GD vs SGD

1162019

Iterationstep

Valu

e of

the

obje

ctiv

e fu

nctio

n

GD improves the value of the objective function at every step SGD improves the value but in a ldquonoisyrdquo wayGD takes fewer steps to converge but each steptakes much longer to compute In practice SGD is much faster

RECSYS 74

Stochastic Gradient Descent

Stochastic gradient decent Initialize P and Q (using SVD pretend missing ratings are

0) Then iterate over the ratings (multiple times if necessary) and update

factorsFor each rxi

bull 120576120576119909119909119909119909 = 119903119903119909119909119909119909 minus 119902119902119909119909 sdot 119901119901119909119909119879119879 (derivative of the ldquoerrorrdquo)bull 119902119902119909119909 larr 119902119902119909119909 + 120578120578 120576120576119909119909119909119909 119901119901119909119909 minus 120582120582 119902119902119909119909 (update equation)bull 119901119901119909119909 larr 119901119901119909119909 + 120578120578 120576120576119909119909119909119909 119902119902119909119909 minus 120582120582 119901119901119909119909 (update equation)

2 for loops For until convergence

bull For each rxibull Compute gradient do a ldquosteprdquo

1162019

η hellip learning rate

RECSYS 75Koren Bell Volinksy IEEE Computer 2009

1162019

RECSYS 76

Summary Recommendations via Optimization

Goal Make good recommendations Quantify goodness using SSE

So Lower SSE means better recommendations We want to make good recommendations on items

that some user has not yet seen Letrsquos set values for P and Q such that they work

well on known (user item) ratings And hope these values for P and Q will predict well

the unknown ratings

This is the a case where we apply Optimization methods

1162019

1 3 43 5 5

4 5 533

2

2 1 3

1

RECSYS 77

Backup Slides

RECSYS 78

The Netflix Challenge 2006-09

We Know What You OughtTo Be Watching This

Summer

Netflix Prizebull Training data

ndash 100 million ratingsndash 480000 usersndash 17770 moviesndash 6 years of data 2000-2005

bull Test datandash Last few ratings of each user (28 million)ndash Evaluation criterion root mean squared error (RMSE) ndash Netflix Cinematch RMSE 09514

bull Competitionndash 2700+ teamsndash $1 million grand prize for 10 improvement on Cinematch resultndash $50000 2007 progress prize for 843 improvement

scoremovieuser121152131434524123237682576344541568523425

22345

57664566

scoremovieuser6219617232473153414284935

745

696836

Training data Test data

Movie rating data

Overall rating distribution

bull Third of ratings are 4sbull Average rating is 368

From TimelyDevelopmentcom

ratings per movie

bull Avg ratingsmovie 5627

ratings per user

bull Avg ratingsuser 208

Average movie rating by movie count

bull More ratings to better movies

From TimelyDevelopmentcom

Most loved movies

CountAvg ratingTitle137812 4593 The Shawshank Redemption 133597 4545 Lord of the Rings The Return of the King1808834306 The Green Mile 150676 4460Lord of the Rings The Two Towers 139050 4415 Finding Nemo 117456 4504 Raiders of the Lost Ark 180736 4299 Forrest Gump 147932 4433 Lord of the Rings The Fellowship of the ring149199 4325 The Sixth Sense 144027 4333 Indiana Jones and the Last Crusade

Challenges

bull Size of datandash Scalabilityndash Keeping data in memory

bull Missing datandash 99 percent missingndash Very imbalanced

bull Avoiding overfittingbull Test and training data differ significantly movie 16322

bull Use an ensemble of complementing predictorsbull Two half tuned models worth more than a single fully

tuned model

The BellKor recommender system

RECSYS 101

Extending Latent Factor Model to Include Biases

1162019

RECSYS 102

Modeling Biases and Interactions

μ = overall mean rating bx = bias of user x bi = bias of movie i

user-movie interactionmovie biasuser bias

User-Movie interaction Characterizes the matching between

users and movies Attracts most research in the field Benefits from algorithmic and

mathematical innovations

Baseline predictor Separates users and movies Benefits from insights into userrsquos

behavior Among the main practical

contributions of the competition

RECSYS 103

Baseline Predictor We have expectations on the rating by

user x of movie i even without estimating xrsquos attitude towards movies like i

ndash Rating scale of user xndash Values of other ratings user

gave recently (day-specific mood anchoring multi-user accounts)

ndash (Recent) popularity of movie indash Selection bias related to

number of ratings user gave on the same day (ldquofrequencyrdquo)

1162019

RECSYS 104

Putting It All Together

Example Mean rating micro = 37 You are a critical reviewer your ratings are 1 star

lower than the mean bx = -1 Star Wars gets a mean rating of 05 higher than

average movie bi = + 05 Predicted rating for you on Star Wars

= 37 - 1 + 05 = 32

1162019

RECSYS 105

Fitting the New Model

Solve

Stochastic gradient decent to find parameters Note Both biases bu bi as well as interactions qi pu are treated

as parameters (we estimate them)

1162019

regularization

goodness of fit

λ is selected via grid-search on a validation set

( )

++++

+++minus

sumsumsumsum

sumisin

ii

xx

xx

ii

Rix

Txiixxi

PQ

bbpq

pqbbr

2222

2

)(

)(min

λ

micro

RECSYS 106

BellKor Recommender System The winner of the Netflix Challenge

Multi-scale modeling of the dataCombine top level ldquoregionalrdquomodeling of the data with a refined local view Global

bull Overall deviations of usersmovies Factorization

bull Addressing ldquoregionalrdquo effects Collaborative filtering

bull Extract local patterns

Global effects

Factorization

Collaborative filtering

RECSYS 107

Performance of Various Methods

0885

089

0895

09

0905

091

0915

092

1 10 100 1000

RMSE

Millions of parameters

CF (no time bias)

Basic Latent Factors

Latent Factors w Biases

1162019

RECSYS 108

Grand Prize 08563

Netflix 09514

Movie average 10533User average 10651

Global average 11296

Performance of Various Methods

Basic Collaborative filtering 094

Latent factors 090

Latent factors+Biases 089

Collaborative filtering++ 091

1162019

RECSYS 109

Temporal Biases Of Users

Sudden rise in the average movie rating(early 2004) Improvements in Netflix GUI improvements Meaning of rating changed

Movie age Users prefer new movies

without any reasons Older movies are just

inherently better than newer ones

1162019

Y Koren Collaborative filtering with temporal dynamics KDD rsquo09

RECSYS 110

Temporal Biases amp Factors

Original modelrxi = micro +bx + bi + qi middotpx

T

Add time dependence to biasesrxi = micro +bx(t)+ bi(t) +qi middot px

T

Make parameters bu and bi to depend on time (1) Parameterize time-dependence by linear trends

(2) Each bin corresponds to 10 consecutive weeks

Add temporal dependence to factors px(t)hellip user preference vector on day t

Y Koren Collaborative filtering with temporal dynamics KDD rsquo09

RECSYS 111

Adding Temporal Effects

1162019

0875

088

0885

089

0895

09

0905

091

0915

092

1 10 100 1000 10000

RMSE

Millions of parameters

CF (no time bias)Basic Latent FactorsCF (time bias)Latent Factors w Biases+ Linear time factors+ Per-day user biases+ CF

RECSYS 112

Grand Prize 08563

Netflix 09514

Movie average 10533User average 10651

Global average 11296

Performance of Various Methods

Basic Collaborative filtering 094

Latent factors 090

Latent factors+Biases 089

Collaborative filtering++ 091

1162019 112

Latent factors+Biases+Time 0876

Still no prize Getting desperateTry a ldquokitchen sinkrdquo approach

RECSYS 113

Effect of ensemble size

087

08725

0875

08775

088

08825

0885

08875

089

08925

1 8 15 22 29 36 43 50

Predictors

Erro

r - R

MSE

Chart1

RMSE
Predictors
Error - RMSE
Effect of ensemble size

Sheet1

Sheet1

RMSE
Predictors
Error - RMSE
Effect of ensemble size

Sheet2

Sheet3

RECSYS 115

Standing on June 26th 2009

1162019June 26th submission triggers 30-day ldquolast callrdquo

RECSYS 116

The Last 30 Days

Ensemble team formed Group of other teams on leaderboard forms a new team Relies on combining their models Quickly also get a qualifying score over 10

BellKor Continue to get small improvements in their scores Realize that they are in direct competition with Ensemble

Strategy Both teams carefully monitoring the leaderboard Only sure way to check for improvement is to submit a set of

predictionsbull This alerts the other team of your latest score

1162019

RECSYS 117

24 Hours from the Deadline

Submissions limited to 1 a day Only 1 final submission could be made in the last 24h

24 hours before deadlinehellip BellKor team member in Austria notices (by chance) that

Ensemble posts a score that is slightly better than BellKorrsquos

Frantic last 24 hours for both teams Much computer time on final optimization Carefully calibrated to end about an hour before deadline

Final submissions BellKor submits a little early (on purpose) 40 mins before deadline Ensemble submits their final entry 20 mins later hellipand everyone waitshellip

1162019

RECSYS 118118

RECSYS 119

Million $ Awarded Sept 21st 2009

RECSYS 120

Further reading Y Koren Collaborative filtering with temporal

dynamics KDD rsquo09 httpwww2researchattcom~volinskynetflixbpchtml httpwwwthe-ensemblecom

1162019

  • Slide Number 1
  • Acknowledgements
  • Content-based Systems amp Collaborative Filtering
  • High Dimensional Data
  • Example Recommender Systems
  • Recommendations
  • From Scarcity to Abundance
  • The Long Tail
  • Physical vs Online
  • Types of Recommendations
  • Formal Model
  • Utility Matrix
  • Key Problems
  • (1) Gathering Ratings
  • (2) Extrapolating Utilities
  • Content-based Recommender Systems
  • Content-based Recommendations
  • Plan of Action
  • Item Profiles
  • Sidenote TF-IDF
  • User Profiles and Prediction
  • Pros Content-based Approach
  • Cons Content-based Approach
  • Collaborative Filtering
  • Collaborative filtering
  • Collaborative Filtering (CF)
  • Example of Memory-based Collaborative Filtering User-User Collaborative Filtering
  • Similar Users
  • Similarity Metric
  • Rating Predictions
  • Another type of Memory-based Collaborative Filtering Item-Item based Collaborative Filtering
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Common Practice for Item-Item Collaborative Filtering
  • Item-Item vs User-User
  • ProsCons of Collaborative Filtering
  • Hybrid Methods
  • Remarks amp Practical Tips
  • Evaluation
  • Evaluation
  • Evaluating Predictions
  • Problems with Error Measures
  • Collaborative Filtering Complexity
  • Tip Add Data
  • Recommender SystemsLatent Factor Models
  • Collaborative Filtering viaLatent Factor Models (eg SVD)
  • Slide Number 50
  • The Netflix Utility Matrix R
  • Utility Matrix R Evaluation
  • Latent Factor Models
  • Ratings as Products of Factors
  • Ratings as Products of Factors
  • Ratings as Products of Factors
  • Latent Factor Models
  • Latent Factor Models
  • Recap SVD
  • Latent Factor Models
  • Dealing with Missing Entries
  • Recommendations via Latent Factor Models (eg SVD++ by the [Bellkor Team])
  • Dealing with Missing Entries
  • The Effect of Regularization
  • The Effect of Regularization
  • The Effect of Regularization
  • The Effect of Regularization
  • Use Gradient Descent to search forthe optimal settings
  • Degression to the lecture notes ofRegression and Gradient Descentby Andrew Ngrsquos Machine Learning course from Coursera
  • (Batch) Gradient Descent
  • Stochastic Gradient Descent
  • SGD vs GD
  • Stochastic Gradient Descent
  • Slide Number 75
  • Summary Recommendations via Optimization
  • Backup Slides
  • The Netflix Challenge 2006-09
  • Slide Number 79
  • Slide Number 80
  • Netflix Prize
  • Movie rating data
  • Overall rating distribution
  • ratings per movie
  • ratings per user
  • Average movie rating by movie count
  • Most loved movies
  • Challenges
  • The BellKor recommender system
  • Extending Latent Factor Model to Include Biases
  • Modeling Biases and Interactions
  • Baseline Predictor
  • Putting It All Together
  • Fitting the New Model
  • BellKor Recommender System
  • Performance of Various Methods
  • Performance of Various Methods
  • Temporal Biases Of Users
  • Temporal Biases amp Factors
  • Adding Temporal Effects
  • Performance of Various Methods
  • Slide Number 113
  • Slide Number 114
  • Standing on June 26th 2009
  • The Last 30 Days
  • 24 Hours from the Deadline
  • Slide Number 118
  • Million $ Awarded Sept 21st 2009
  • Further reading
predictors RMSE
1 0889067
2 0880586
3 0878345
4 0877715
5 0876668
6 0876179
7 0875557
8 0875013
9 0874477
10 0874209
11 0873997
12 0873691
13 0873569
14 0873352
15 0873212
16 0873035
17 0872942
18 087283
19 0872748
20 0872654
21 0872563
22 0872468
23 0872382
24 0872275
25 087222
26 0872201
27 0872135
28 0872086
29 0872021
30 0871945
31 0871898
32 087183
33 0871756
34 0871722
35 0871665
36 0871632
37 0871585
38 0871558
39 0871524
40 0871482
41 0871452
42 0871428
43 0871398
44 0871346
45 0871312
46 0871287
47 0871262
48 0871242
49 0871236
50 0871197
0889067
0880586
0878345
0877715
0876668
0876179
0875557
0875013
0874477
0874209
0873997
0873691
0873569
0873352
0873212
0873035
0872942
087283
0872748
0872654
0872563
0872468
0872382
0872275
087222
0872201
0872135
0872086
0872021
0871945
0871898
087183
0871756
0871722
0871665
0871632
0871585
0871558
0871524
0871482
0871452
0871428
0871398
0871346
0871312
0871287
0871262
0871242
0871236
0871197
Page 4: xuhappy.github.io · 2 days ago · RECSYS 2 Acknowledgements The slides used in this chapter are adapted from: CS246 Mining Massive Datasets, - by Jure Leskovec, Stanford University

RECSYS 4

High Dimensional Data

High dim data

Locality sensitive hashing

Clustering

Dimensionality

reduction

Graph data

PageRank SimRank

Community Detection

Spam Detection

Infinite data

Filtering data

streams

Web advertising

Queries on streams

Machine learning

SVM

Decision Trees

Perceptron kNN

Apps

Recommender systems

Association Rules

Duplicate document detection

1162019

RECSYS 5

Example Recommender Systems

Customer X Buys Metallica CD Buys Megadeth CD

Customer Y Does search on Metallica Recommender system

suggests Megadeth from data collected about customer X

RECSYS 6

Recommendations

Items

Search Recommendations

Products web sites blogs news items hellip

Examples

RECSYS 8

The Long Tail

Source Chris Anderson (2004)

1162019 8

RECSYS 9

Physical vs Online

1162019 9Read httpwwwwiredcomwiredarchive1210tailhtml to learn more

RECSYS 11

Formal Model

X = set of Customers

S = set of Items

Utility function u X times S RR = set of ratingsR is a totally ordered seteg 0-5 stars real number in [01]

RECSYS 12

Utility Matrix

04102

0305021

Avatar LOTR Matrix Pirates

Alice

Bob

Carol

David

RECSYS 13

Key Problems

(1) Gathering ldquoknownrdquo ratings for matrix How to collect the data in the utility matrix

(2) Extrapolate unknown ratings from the known ones Mainly interested in high unknown ratings

bull We are not interested in knowing what you donrsquot like but what you like

(3) Evaluating extrapolation methods How to measure successperformance of

recommendation methods

RECSYS 14

(1) Gathering Ratings Explicit

Ask people to rate items Doesnrsquot work well in practice ndash people

canrsquot be bothered

Implicit Learn ratings from user actions

bull Eg purchase implies high rating What about low ratings

RECSYS 15

(2) Extrapolating Utilities Key problem matrix U is sparse

Most people have not rated most items Cold start

bull New items have no ratingsbull New users have no history

Three approaches to recommender systems 1) Content-based 2) Collaborative Filtering

bull Memory-basedbull User-based Collaborative Filteringbull Item-based Collaborative Filtering

bull Latent factor based

1162019

RECSYS 16

Content-based Recommender Systems

1162019

RECSYS 17

Content-based Recommendations Main idea Recommend items to customer x similar to

previous items rated highly by x

Example

Movie recommendations Recommend movies with same actor(s)

director genre hellip

Websites blogs news Recommend other sites with ldquosimilarrdquo content

RECSYS 18

Plan of Action

likes

Item profiles

RedCircles

Triangles

User profile

match

recommendbuild

1162019

RECSYS 19

Item Profiles For each item create an item profile

Profile is a set (vector) of features Movies author title actor directorhellip Text Set of ldquoimportantrdquo words in document

How to pick important features Usual heuristic from text mining is TF-IDF

(Term frequency Inverse Doc Frequency)bull Term hellip Featurebull Document hellip Item

1162019

RECSYS 20

Sidenote TF-IDFfij = frequency of term (feature) i in doc (item) j

ni = number of docs that mention term iN = total number of docs

TF-IDF score wij = TFij times IDFi

Doc profile = set of words with highest TF-IDF scores together with their scores

1162019

Note we normalize TFto discount for ldquolongerrdquo documents

RECSYS 21

User Profiles and Prediction

User profile possibilities Weighted average of rated item profiles Variation weight by difference from average

rating for item hellip

Prediction heuristic Given user profile x and item profile i estimate 119906119906(119961119961 119946119946) = cos(119961119961 119946119946) = 119961119961middot119946119946

| 119961119961 |sdot| 119946119946 |

1162019

RECSYS 22

Pros Content-based Approach

+ No need for data on other users No cold-start or sparsity problems

+ Able to recommend to users with unique tastes

+ Able to recommend new amp unpopular items No first-rater problem

+ Able to provide explanations Can provide explanations of recommended items by listing content-

features that caused an item to be recommended

1162019

RECSYS 23

Cons Content-based Approach ndash Finding the appropriate features is hard

Eg images movies music

ndash Overspecialization Never recommends items outside userrsquos

content profile People might have multiple interests Unable to exploit quality judgments of other users

ndash Recommendations for new users How to build a user profile

1162019

RECSYS 24

Collaborative Filtering

RECSYS 25

Collaborative filtering Recommend items based on past transactions of users

Analyze relations between users andor items

Specific data characteristics are irrelevant Domain-free useritem attributes are not necessary Can identify elusive aspects

RECSYS 26

Collaborative Filtering (CF)

RECSYS 27

Example of Memory-based Collaborative FilteringUser-User Collaborative Filtering

1 Consider user x

2 Find set N of other userswhose ratings are ldquosimilarrdquo to xrsquos ratings eg using K-nearest neighbors (KNN)

3 Recommend items to xbased on the weighted ratings of items by users in N

RECSYS 28

Similar Users Let rx be the vector of user xrsquos ratings

Jaccard similarity measure Problem Ignores the value of the rating

Cosine similarity measure

sim(x y) = cos(rx ry) = 119903119903119909119909sdot119903119903119910119910||119903119903119909119909||sdot||119903119903119910119910||

Problem Treats missing ratings as ldquonegativerdquo Pearson correlation coefficient

Sxy = items rated by both users x and y

1162019

rx = [ _ _ ]ry = [ _ _]

rx ry as setsrx = 1 4 5ry = 1 3 4

rx ry as pointsrx = 1 0 0 1 3ry = 1 0 2 2 0

rx ry hellip avgrating of x y

RECSYS 29

Similarity Metric

Intuitively we want sim(A B) gt sim(A C)

Jaccard similarity 15 lt 24

Cosine similarity 0386 gt 0322 Considers missing ratings as ldquonegativerdquo Solution subtract the (row) mean

1162019

sim AB vs AC0092 gt -0559Notice cosine sim is correlation when data is centered at 0

Cosine sim

RECSYS 30

Rating Predictions

Let rx be the vector of user xrsquos ratings

Let N be the set of k users most similar to xwho have rated item i

Prediction for item i of user x 119903119903119909119909119909119909 = 1

119896119896sum119910119910isin119873119873 119903119903119910119910119909119909

119903119903119909119909119909119909 =sum119910119910isin119873119873 119904119904119909119909119910119910sdot119903119903119910119910119910119910sum119910119910isin119873119873 119904119904119909119909119910119910

Other options

Many other tricks possiblehellip1162019

Shorthand119956119956119961119961119961119961 = 119956119956119946119946119956119956 119961119961119961119961

RECSYS 31

Another type of Memory-based Collaborative Filtering Item-Item based Collaborative Filtering

So far User-user collaborative filtering

Another view Item-item For item i find other similar items Estimate rating for item i based

on the target userrsquos ratings on items similar to item i Can use same similarity metrics and

prediction functions as in user-user model

1162019

sumsum

isin

isinsdot

=)(

)(

xiNj ij

xiNj xjijxi s

rsr

sijhellip similarity of items i and jrxjhelliprating of user x on item jN(ix)hellip set items rated by x similar to i

RECSYS 32

Item-Item CF (|N|=2)

121110987654321

455311

3124452

534321423

245424

5224345

423316

users

mov

ies

- unknown rating - rating between 1 to 5

RECSYS 33

Item-Item CF (|N|=2)

121110987654321

455 311

3124452

534321423

245424

5224345

423316

users

- estimate rating of movie 1 by user 5

1162019

mov

ies

RECSYS 34

Item-Item CF (|N|=2)

121110987654321

455 311

3124452

534321423

245424

5224345

423316

users

Neighbor selectionIdentify movies similar to movie 1 rated by user 5

1162019

mov

ies

100

-018

041

-010

-031

059

sim(1m)

Here we use Pearson correlation as similarity1) Subtract mean rating mi from each movie i

m1 = (1+3+5+5+4)5 = 36row 1 [-26 0 -06 0 0 14 0 0 14 0 04 0]

2) Compute cosine similarities between rows

RECSYS 35

Item-Item CF (|N|=2)

121110987654321

455 311

3124452

534321423

245424

5224345

423316

users

Compute similarity weightss13=041 s16=059

1162019

mov

ies

100

-018

041

-010

-031

059

sim(1m)

RECSYS 36

Item-Item CF (|N|=2)

121110987654321

45526311

3124452

534321423

245424

5224345

423316

users

Predict by taking weighted average

r15 = (0412 + 0593) (041+059) = 26

mov

ies

RECSYS 37

Common Practice for Item-Item Collaborative Filtering

Define similarity sij of items i and j

Select K nearest neighbors (KNN) N(i x) Set of Items most similar to item i that were rated by x

Estimate rating rxi as the weighted average

1162019

sumsum

isin

isinsdot

=)(

)(ˆxiNj ij

xiNj xjijxi s

rsr

N(ix) = set of items similar to item i that were rated by xsij = similarity of items i and jrxj= rating of user x on item j

RECSYS 38

Item-Item vs User-User

04180109

0305081

Avatar LOTR Matrix Pirates

Alice

Bob

Carol

David

1162019

In practice it has been observed that item-itemoften works better than user-user

Why Items are simpler users have multiple tastes

RECSYS 39

ProsCons of Collaborative Filtering

+ Works for any kind of item No feature selection needed

- Cold Start Need enough users in the system to find a match

- Sparsity The userratings matrix is sparse Hard to find users that have rated the same items

- First rater Cannot recommend an item that has not been

previously rated New items Esoteric items

- Popularity bias Cannot recommend items to someone with

unique taste Tends to recommend popular items

RECSYS 40

Hybrid Methods Implement two or more different recommenders and

combine predictions Perhaps using a linear model

Add content-based methods to collaborative filtering Item profiles for new item problem Demographics to deal with new user problem

1162019

RECSYS 41

- Evaluation

- Error metrics

- Complexity Speed

RECSYS 42

Evaluation

1 3 4

3 5 5

4 5 5

3

3

2 2 2

5

2 1 1

3 3

1

movies

users

1162019

RECSYS 43

Evaluation

1 3 4

3 5 5

4 5 5

3

3

2

2 1

3

1

Test Data Set

users

movies

1162019

RECSYS 44

Evaluating Predictions Compare predictions with known ratings

Root-mean-square error (RMSE)bull where is predicted is the true rating of x on i

Precision at top 10 bull of those in top 10

Rank Correlation bull Spearmanrsquos correlation between systemrsquos and userrsquos complete

rankings Another approach 01 model

Coveragebull Number of itemsusers for which system can make predictions

Precision = TP (TP + FP) Accuracy = (TP+TN) (TP + FP + TN + FN) Receiver Operating characteristic (ROC) Curve

Y-axis True Positive Rates (TPR) X-axis False Positive Rates (FPR)bull TPR (aka Recall) = TP P = TP (TP+FN) bull FPR = FP N = FP (FP + TN)bull See httpsenwikipediaorgwikiPrecision and recall

RECSYS 45

Problems with Error Measures Narrow focus on accuracy sometimes

misses the point Prediction Diversity Prediction Context Order of predictions

In practice we care only to predict high ratings RMSE might penalize a method that does well

for high ratings and badly for others

1162019

RECSYS 46

Collaborative Filtering Complexity

Expensive step is finding k most similar customers O(|X|) Recall that X = set of customers in the system

Too expensive to do at runtime Could pre-compute using clustering as approx

Naiumlve pre-computation takes time O(N middot|C|) |C| = of clusters = k in the k-means N = of data points

We already know how to do this Near-neighbor search in high dimensions (LSH) Clustering Dimensionality reduction

1162019

RECSYS 47

Tip Add Data Leverage all the data

Donrsquot try to reduce data size in an effort to make fancy algorithms work

Simple methods on large data do best

Add more data eg add IMDB data on genres

More data beats better algorithmshttpanandtypepadcomdatawocky200803more-data-usualhtml

RECSYS 48

Recommender SystemsLatent Factor Models

CS246 Mining Massive DatasetsJure Leskovec Stanford University

httpcs246stanfordedu

RECSYS 49

Geared towards females

Geared towards males

Serious

Funny

Collaborative Filtering viaLatent Factor Models (eg SVD)

The PrincessDiaries

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

RECSYS 50Koren Bell Volinksy IEEE Computer 2009

RECSYS 51

The Netflix Utility Matrix R

1 3 4

3 5 5

4 5 5

3

3

2 2 2

5

2 1 1

3 3

1

480000 users

17700 movies

1162019

Matrix R

RECSYS 52

Utility Matrix R Evaluation

1 3 4

3 5 5

4 5 5

3

3

2

2 1

3

1

Test Data Set

SSE = sum(119909119909119909119909)isin119877119877 119903119909119909119909119909 minus 119903119903119909119909119909119909 2

1162019

480000 users

17700 movies

Predicted rating

True rating of user x on item i

119955119955120785120785120788120788

Matrix R

Training Data Set

RECSYS 53

Latent Factor Models

ldquoSVDrdquo on Netflix data R asymp Q middot PT

For now letrsquos assume we can approximate the rating matrix R as a product of ldquothinrdquo Q middot PT

R has missing entries but letrsquos ignore that for nowbull Basically we will want the reconstruction error to be small on

known ratings and we donrsquot care about the values on the missing ones

45531

312445

53432142

24542

522434

42331

2-41

56-5

53-2

32111

-221-7

37-1

-924143-48-5-253-211

13-112-72914-131457-8

1-6784-3924176-421asymp

users

item

s

PT

Q

item

s

users

R

SVD A = U Σ VT

f factors

ffactors

RECSYS 54

Ratings as Products of Factors How to estimate the missing rating of

user x for item i

1162019

45531

312445

53432142

24542

522434

42331

item

s

2-41

56-5

53-2

32111

-221-7

37-1

-924143-48-5-253-211

13-112-72914-131457-8

1-6784-3924176-421

asymp

item

s

users

users

PT

119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931

= 119943119943

119954119954119946119946119943119943 sdot 119953119953119961119961119943119943

qi = row i of Qpx = column x of PT

ffac

tors

Qf factors

RECSYS 55

Ratings as Products of Factors How to estimate the missing rating of

user x for item i

1162019

45531

312445

53432142

24542

522434

42331

item

s

2-41

56-5

53-2

32111

-221-7

37-1

-924143-48-5-253-211

13-112-72914-131457-8

1-6784-3924176-421

asymp

item

s

users

users

PT

ffac

tors

Qf factors

119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931

= 119943119943

119954119954119946119946119943119943 sdot 119953119953119961119961119943119943

qi = row i of Qpx = column x of PT

RECSYS 56

Ratings as Products of Factors How to estimate the missing rating of

user x for item i

1162019

45531

312445

53432142

24542

522434

42331

item

s

2-41

56-5

53-2

32111

-221-7

37-1

-924143-48-5-253-211

13-112-72914-131457-8

1-6784-3924176-421

asymp

item

s

users

users

QPT

24

ffac

tors

f factors

119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931

= 119943119943

119954119954119946119946119943119943 sdot 119953119953119961119961119943119943

qi = row i of Qpx = column x of PT

RECSYS 57

Geared towards females

Geared towards males

Serious

Funny

Latent Factor Models

1162019

The PrincessDiaries

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

RECSYS 58

Geared towards females

Geared towards males

Serious

Funny

Latent Factor Models

1162019

The PrincessDiaries

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

RECSYS 59

Recap SVD

Remember SVD A Input data matrix U Left singular vecs V Right singular vecs Σ Singular values

SVD gives minimum reconstruction error (SSE) min119880119880VΣ

sum119909119909119894119894 119860119860119909119909119894119894 minus 119880119880Σ119881119881T 1199091199091198941198942

So in our case ldquoSVDrdquo on Netflix data R asymp Q middot PT

A = R Q = U PT = Σ VT

But we are not done yet R has missing entries1162019

U

The sum goes over all entriesBut our R has missing entries

119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931

Am

n

Σm

n

VT

asymp

U

RECSYS 60

Latent Factor Models

SVD isnrsquot defined when entries are missing

Use specialized methods to find P Q min

119875119875119876119876sum 119909119909119909119909 isinR 119903119903119909119909119909119909 minus 119902119902119909119909 sdot 119901119901119909119909119879119879 2

Notebull We donrsquot require cols of P Q to be orthogonalunit lengthbull P Q map usersmovies to a latent spacebull The most popular model among Netflix contestants

1162019

45531

312445

53432142

24542

522434

42331

2-41

56-5

53-2

32111

-221-7

37-1

-924143-48-5-253-211

13-112-72914-131457-8

1-6784-3924176-421asymp

PTQ

users

item

s

119903119909119909119909119909 = 119902119902119909119909 sdot 119901119901119909119909119879119879

f factors

ffactorsitem

s

users

RECSYS 61

Dealing with Missing Entries Want to minimize SSE for unseen test data

Idea Minimize SSE on training data Want large f ( of factors) to capture all the signals But SSE on test data begins to rise for f gt 2

Regularization is needed to avoid Overfitting Allow rich model where there are sufficient data Shrink aggressively where data are scarce

1162019

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

1 3 43 5 5

4 5 533

2

2 1 3

1

λhellip regularization parameterldquoerrorrdquo ldquolengthrdquo

RECSYS 62

Geared towards females

Geared towards males

serious

escapist

The PrincessDiaries

The Lion King

Braveheart

Independence Day

AmadeusThe Color Purple

Oceanrsquos 11

Sense and Sensibility

Gus

Dave

Recommendations via Latent Factor Models (eg SVD++ by the [Bellkor Team])

1162019

Lethal Weapon

Dumb and Dumber

RECSYS 63

Dealing with Missing Entries Want to minimize SSE for unseen test data

Idea Minimize SSE on training data Want large f ( of factors) to capture all the signals But SSE on test data begins to rise for f gt 2

Regularization is needed Allow rich model where there are sufficient data Shrink aggressively where data are scarce

1162019

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

1 3 43 5 5

4 5 533

2

2 1 3

1

λhellip regularization parameterldquoerrorrdquo ldquolengthrdquo

RECSYS 64

Geared towards females

Geared towards males

serious

funny

The Effect of Regularization

1162019

The PrincessDiaries

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

minfactors ldquoerrorrdquo + λ ldquolengthrdquo

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

RECSYS 65

Geared towards females

Geared towards males

serious

funny

The Effect of Regularization

1162019

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

The PrincessDiaries

minfactors ldquoerrorrdquo + λ ldquolengthrdquo

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

RECSYS 66

Geared towards females

Geared towards males

serious

funny

The Effect of Regularization

1162019

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

minfactors ldquoerrorrdquo + λ ldquolengthrdquo

The PrincessDiaries

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

RECSYS 67

Geared towards females

Geared towards males

serious

funny

The Effect of Regularization

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

The PrincessDiaries

minfactors ldquoerrorrdquo + λ ldquolengthrdquo

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

RECSYS 68

Use Gradient Descent to search forthe optimal settings

Want to find matrices P and Q

Gradient descent Initialize P and Q (using SVD pretend missing ratings are

0) Do gradient descent

bull P larr P - η middotnablaPbull Q larr Q - η middotnablaQbull Where nablaQ is gradientderivative of matrix Q120571120571119876119876 = [120571120571119902119902119909119909119894119894] and 120571120571119902119902119909119909119894119894 = sum119909119909119909119909 minus2 119903119903119909119909119909119909 minus 119902119902119909119909119901119901119909119909119879119879 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894

bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q Observation Computing gradients is slow

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

How to compute gradient of a matrixCompute gradient of every element independently

RECSYS 69

Degression to the lecture notes ofRegression and Gradient Descent

by Andrew Ngrsquos Machine Learning course from Coursera

RECSYS 70

(Batch) Gradient Descent

Want to find matrices P and Q

Gradient descent Initialize P and Q (using SVD pretend missing ratings are

0) Do gradient descent

bull P larr P - η middotnablaPbull Q larr Q - η middotnablaQbull Where nablaQ is gradientderivative of matrix Q120571120571119876119876 = [120571120571119902119902119909119909119894119894] and 120571120571119902119902119909119909119894119894 = sum119909119909119909119909 minus2 119903119903119909119909119909119909 minus 119902119902119909119909119901119901119909119909119879119879 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894

bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q Observation Computing gradients is slow

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

How to compute gradient of a matrixCompute gradient of every element independently

RECSYS 72

Stochastic Gradient Descent

Gradient Descent (GD) vs Stochastic GD Observation 120571120571119876119876 = [120571120571119902119902119909119909119894119894] where

120571120571119902119902119909119909119894119894 = 119909119909119909119909

minus2 119903119903119909119909119909119909 minus 119902119902119909119909119894119894119901119901119909119909119894119894 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894 = 119961119961119946119946

nabla119928119928 119955119955119961119961119946119946

bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q 119928119928 = 119928119928 minus 119928119928 = 119928119928 minus η sum119961119961119946119946nabla119928119928 (119955119955119961119961119946119946) Idea Instead of evaluating gradient over all ratings evaluate it for

each individual rating and make a step

GD 119928119928larr119928119928 minus η sum119955119955119961119961119946119946 nabla119928119928(119955119955119961119961119946119946)

SGD 119928119928larr119928119928 minus 119928119928(119955119955119961119961119946119946) Faster convergence

bull Need more steps but each step is computed much faster

1162019

RECSYS 73

SGD vs GD Convergence of GD vs SGD

1162019

Iterationstep

Valu

e of

the

obje

ctiv

e fu

nctio

n

GD improves the value of the objective function at every step SGD improves the value but in a ldquonoisyrdquo wayGD takes fewer steps to converge but each steptakes much longer to compute In practice SGD is much faster

RECSYS 74

Stochastic Gradient Descent

Stochastic gradient decent Initialize P and Q (using SVD pretend missing ratings are

0) Then iterate over the ratings (multiple times if necessary) and update

factorsFor each rxi

bull 120576120576119909119909119909119909 = 119903119903119909119909119909119909 minus 119902119902119909119909 sdot 119901119901119909119909119879119879 (derivative of the ldquoerrorrdquo)bull 119902119902119909119909 larr 119902119902119909119909 + 120578120578 120576120576119909119909119909119909 119901119901119909119909 minus 120582120582 119902119902119909119909 (update equation)bull 119901119901119909119909 larr 119901119901119909119909 + 120578120578 120576120576119909119909119909119909 119902119902119909119909 minus 120582120582 119901119901119909119909 (update equation)

2 for loops For until convergence

bull For each rxibull Compute gradient do a ldquosteprdquo

1162019

η hellip learning rate

RECSYS 75Koren Bell Volinksy IEEE Computer 2009

1162019

RECSYS 76

Summary Recommendations via Optimization

Goal Make good recommendations Quantify goodness using SSE

So Lower SSE means better recommendations We want to make good recommendations on items

that some user has not yet seen Letrsquos set values for P and Q such that they work

well on known (user item) ratings And hope these values for P and Q will predict well

the unknown ratings

This is the a case where we apply Optimization methods

1162019

1 3 43 5 5

4 5 533

2

2 1 3

1

RECSYS 77

Backup Slides

RECSYS 78

The Netflix Challenge 2006-09

We Know What You OughtTo Be Watching This

Summer

Netflix Prizebull Training data

ndash 100 million ratingsndash 480000 usersndash 17770 moviesndash 6 years of data 2000-2005

bull Test datandash Last few ratings of each user (28 million)ndash Evaluation criterion root mean squared error (RMSE) ndash Netflix Cinematch RMSE 09514

bull Competitionndash 2700+ teamsndash $1 million grand prize for 10 improvement on Cinematch resultndash $50000 2007 progress prize for 843 improvement

scoremovieuser121152131434524123237682576344541568523425

22345

57664566

scoremovieuser6219617232473153414284935

745

696836

Training data Test data

Movie rating data

Overall rating distribution

bull Third of ratings are 4sbull Average rating is 368

From TimelyDevelopmentcom

ratings per movie

bull Avg ratingsmovie 5627

ratings per user

bull Avg ratingsuser 208

Average movie rating by movie count

bull More ratings to better movies

From TimelyDevelopmentcom

Most loved movies

CountAvg ratingTitle137812 4593 The Shawshank Redemption 133597 4545 Lord of the Rings The Return of the King1808834306 The Green Mile 150676 4460Lord of the Rings The Two Towers 139050 4415 Finding Nemo 117456 4504 Raiders of the Lost Ark 180736 4299 Forrest Gump 147932 4433 Lord of the Rings The Fellowship of the ring149199 4325 The Sixth Sense 144027 4333 Indiana Jones and the Last Crusade

Challenges

bull Size of datandash Scalabilityndash Keeping data in memory

bull Missing datandash 99 percent missingndash Very imbalanced

bull Avoiding overfittingbull Test and training data differ significantly movie 16322

bull Use an ensemble of complementing predictorsbull Two half tuned models worth more than a single fully

tuned model

The BellKor recommender system

RECSYS 101

Extending Latent Factor Model to Include Biases

1162019

RECSYS 102

Modeling Biases and Interactions

μ = overall mean rating bx = bias of user x bi = bias of movie i

user-movie interactionmovie biasuser bias

User-Movie interaction Characterizes the matching between

users and movies Attracts most research in the field Benefits from algorithmic and

mathematical innovations

Baseline predictor Separates users and movies Benefits from insights into userrsquos

behavior Among the main practical

contributions of the competition

RECSYS 103

Baseline Predictor We have expectations on the rating by

user x of movie i even without estimating xrsquos attitude towards movies like i

ndash Rating scale of user xndash Values of other ratings user

gave recently (day-specific mood anchoring multi-user accounts)

ndash (Recent) popularity of movie indash Selection bias related to

number of ratings user gave on the same day (ldquofrequencyrdquo)

1162019

RECSYS 104

Putting It All Together

Example Mean rating micro = 37 You are a critical reviewer your ratings are 1 star

lower than the mean bx = -1 Star Wars gets a mean rating of 05 higher than

average movie bi = + 05 Predicted rating for you on Star Wars

= 37 - 1 + 05 = 32

1162019

RECSYS 105

Fitting the New Model

Solve

Stochastic gradient decent to find parameters Note Both biases bu bi as well as interactions qi pu are treated

as parameters (we estimate them)

1162019

regularization

goodness of fit

λ is selected via grid-search on a validation set

( )

++++

+++minus

sumsumsumsum

sumisin

ii

xx

xx

ii

Rix

Txiixxi

PQ

bbpq

pqbbr

2222

2

)(

)(min

λ

micro

RECSYS 106

BellKor Recommender System The winner of the Netflix Challenge

Multi-scale modeling of the dataCombine top level ldquoregionalrdquomodeling of the data with a refined local view Global

bull Overall deviations of usersmovies Factorization

bull Addressing ldquoregionalrdquo effects Collaborative filtering

bull Extract local patterns

Global effects

Factorization

Collaborative filtering

RECSYS 107

Performance of Various Methods

0885

089

0895

09

0905

091

0915

092

1 10 100 1000

RMSE

Millions of parameters

CF (no time bias)

Basic Latent Factors

Latent Factors w Biases

1162019

RECSYS 108

Grand Prize 08563

Netflix 09514

Movie average 10533User average 10651

Global average 11296

Performance of Various Methods

Basic Collaborative filtering 094

Latent factors 090

Latent factors+Biases 089

Collaborative filtering++ 091

1162019

RECSYS 109

Temporal Biases Of Users

Sudden rise in the average movie rating(early 2004) Improvements in Netflix GUI improvements Meaning of rating changed

Movie age Users prefer new movies

without any reasons Older movies are just

inherently better than newer ones

1162019

Y Koren Collaborative filtering with temporal dynamics KDD rsquo09

RECSYS 110

Temporal Biases amp Factors

Original modelrxi = micro +bx + bi + qi middotpx

T

Add time dependence to biasesrxi = micro +bx(t)+ bi(t) +qi middot px

T

Make parameters bu and bi to depend on time (1) Parameterize time-dependence by linear trends

(2) Each bin corresponds to 10 consecutive weeks

Add temporal dependence to factors px(t)hellip user preference vector on day t

Y Koren Collaborative filtering with temporal dynamics KDD rsquo09

RECSYS 111

Adding Temporal Effects

1162019

0875

088

0885

089

0895

09

0905

091

0915

092

1 10 100 1000 10000

RMSE

Millions of parameters

CF (no time bias)Basic Latent FactorsCF (time bias)Latent Factors w Biases+ Linear time factors+ Per-day user biases+ CF

RECSYS 112

Grand Prize 08563

Netflix 09514

Movie average 10533User average 10651

Global average 11296

Performance of Various Methods

Basic Collaborative filtering 094

Latent factors 090

Latent factors+Biases 089

Collaborative filtering++ 091

1162019 112

Latent factors+Biases+Time 0876

Still no prize Getting desperateTry a ldquokitchen sinkrdquo approach

RECSYS 113

Effect of ensemble size

087

08725

0875

08775

088

08825

0885

08875

089

08925

1 8 15 22 29 36 43 50

Predictors

Erro

r - R

MSE

Chart1

RMSE
Predictors
Error - RMSE
Effect of ensemble size

Sheet1

Sheet1

RMSE
Predictors
Error - RMSE
Effect of ensemble size

Sheet2

Sheet3

RECSYS 115

Standing on June 26th 2009

1162019June 26th submission triggers 30-day ldquolast callrdquo

RECSYS 116

The Last 30 Days

Ensemble team formed Group of other teams on leaderboard forms a new team Relies on combining their models Quickly also get a qualifying score over 10

BellKor Continue to get small improvements in their scores Realize that they are in direct competition with Ensemble

Strategy Both teams carefully monitoring the leaderboard Only sure way to check for improvement is to submit a set of

predictionsbull This alerts the other team of your latest score

1162019

RECSYS 117

24 Hours from the Deadline

Submissions limited to 1 a day Only 1 final submission could be made in the last 24h

24 hours before deadlinehellip BellKor team member in Austria notices (by chance) that

Ensemble posts a score that is slightly better than BellKorrsquos

Frantic last 24 hours for both teams Much computer time on final optimization Carefully calibrated to end about an hour before deadline

Final submissions BellKor submits a little early (on purpose) 40 mins before deadline Ensemble submits their final entry 20 mins later hellipand everyone waitshellip

1162019

RECSYS 118118

RECSYS 119

Million $ Awarded Sept 21st 2009

RECSYS 120

Further reading Y Koren Collaborative filtering with temporal

dynamics KDD rsquo09 httpwww2researchattcom~volinskynetflixbpchtml httpwwwthe-ensemblecom

1162019

  • Slide Number 1
  • Acknowledgements
  • Content-based Systems amp Collaborative Filtering
  • High Dimensional Data
  • Example Recommender Systems
  • Recommendations
  • From Scarcity to Abundance
  • The Long Tail
  • Physical vs Online
  • Types of Recommendations
  • Formal Model
  • Utility Matrix
  • Key Problems
  • (1) Gathering Ratings
  • (2) Extrapolating Utilities
  • Content-based Recommender Systems
  • Content-based Recommendations
  • Plan of Action
  • Item Profiles
  • Sidenote TF-IDF
  • User Profiles and Prediction
  • Pros Content-based Approach
  • Cons Content-based Approach
  • Collaborative Filtering
  • Collaborative filtering
  • Collaborative Filtering (CF)
  • Example of Memory-based Collaborative Filtering User-User Collaborative Filtering
  • Similar Users
  • Similarity Metric
  • Rating Predictions
  • Another type of Memory-based Collaborative Filtering Item-Item based Collaborative Filtering
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Common Practice for Item-Item Collaborative Filtering
  • Item-Item vs User-User
  • ProsCons of Collaborative Filtering
  • Hybrid Methods
  • Remarks amp Practical Tips
  • Evaluation
  • Evaluation
  • Evaluating Predictions
  • Problems with Error Measures
  • Collaborative Filtering Complexity
  • Tip Add Data
  • Recommender SystemsLatent Factor Models
  • Collaborative Filtering viaLatent Factor Models (eg SVD)
  • Slide Number 50
  • The Netflix Utility Matrix R
  • Utility Matrix R Evaluation
  • Latent Factor Models
  • Ratings as Products of Factors
  • Ratings as Products of Factors
  • Ratings as Products of Factors
  • Latent Factor Models
  • Latent Factor Models
  • Recap SVD
  • Latent Factor Models
  • Dealing with Missing Entries
  • Recommendations via Latent Factor Models (eg SVD++ by the [Bellkor Team])
  • Dealing with Missing Entries
  • The Effect of Regularization
  • The Effect of Regularization
  • The Effect of Regularization
  • The Effect of Regularization
  • Use Gradient Descent to search forthe optimal settings
  • Degression to the lecture notes ofRegression and Gradient Descentby Andrew Ngrsquos Machine Learning course from Coursera
  • (Batch) Gradient Descent
  • Stochastic Gradient Descent
  • SGD vs GD
  • Stochastic Gradient Descent
  • Slide Number 75
  • Summary Recommendations via Optimization
  • Backup Slides
  • The Netflix Challenge 2006-09
  • Slide Number 79
  • Slide Number 80
  • Netflix Prize
  • Movie rating data
  • Overall rating distribution
  • ratings per movie
  • ratings per user
  • Average movie rating by movie count
  • Most loved movies
  • Challenges
  • The BellKor recommender system
  • Extending Latent Factor Model to Include Biases
  • Modeling Biases and Interactions
  • Baseline Predictor
  • Putting It All Together
  • Fitting the New Model
  • BellKor Recommender System
  • Performance of Various Methods
  • Performance of Various Methods
  • Temporal Biases Of Users
  • Temporal Biases amp Factors
  • Adding Temporal Effects
  • Performance of Various Methods
  • Slide Number 113
  • Slide Number 114
  • Standing on June 26th 2009
  • The Last 30 Days
  • 24 Hours from the Deadline
  • Slide Number 118
  • Million $ Awarded Sept 21st 2009
  • Further reading
predictors RMSE
1 0889067
2 0880586
3 0878345
4 0877715
5 0876668
6 0876179
7 0875557
8 0875013
9 0874477
10 0874209
11 0873997
12 0873691
13 0873569
14 0873352
15 0873212
16 0873035
17 0872942
18 087283
19 0872748
20 0872654
21 0872563
22 0872468
23 0872382
24 0872275
25 087222
26 0872201
27 0872135
28 0872086
29 0872021
30 0871945
31 0871898
32 087183
33 0871756
34 0871722
35 0871665
36 0871632
37 0871585
38 0871558
39 0871524
40 0871482
41 0871452
42 0871428
43 0871398
44 0871346
45 0871312
46 0871287
47 0871262
48 0871242
49 0871236
50 0871197
0889067
0880586
0878345
0877715
0876668
0876179
0875557
0875013
0874477
0874209
0873997
0873691
0873569
0873352
0873212
0873035
0872942
087283
0872748
0872654
0872563
0872468
0872382
0872275
087222
0872201
0872135
0872086
0872021
0871945
0871898
087183
0871756
0871722
0871665
0871632
0871585
0871558
0871524
0871482
0871452
0871428
0871398
0871346
0871312
0871287
0871262
0871242
0871236
0871197
Page 5: xuhappy.github.io · 2 days ago · RECSYS 2 Acknowledgements The slides used in this chapter are adapted from: CS246 Mining Massive Datasets, - by Jure Leskovec, Stanford University

RECSYS 5

Example Recommender Systems

Customer X Buys Metallica CD Buys Megadeth CD

Customer Y Does search on Metallica Recommender system

suggests Megadeth from data collected about customer X

RECSYS 6

Recommendations

Items

Search Recommendations

Products web sites blogs news items hellip

Examples

RECSYS 8

The Long Tail

Source Chris Anderson (2004)

1162019 8

RECSYS 9

Physical vs Online

1162019 9Read httpwwwwiredcomwiredarchive1210tailhtml to learn more

RECSYS 11

Formal Model

X = set of Customers

S = set of Items

Utility function u X times S RR = set of ratingsR is a totally ordered seteg 0-5 stars real number in [01]

RECSYS 12

Utility Matrix

04102

0305021

Avatar LOTR Matrix Pirates

Alice

Bob

Carol

David

RECSYS 13

Key Problems

(1) Gathering ldquoknownrdquo ratings for matrix How to collect the data in the utility matrix

(2) Extrapolate unknown ratings from the known ones Mainly interested in high unknown ratings

bull We are not interested in knowing what you donrsquot like but what you like

(3) Evaluating extrapolation methods How to measure successperformance of

recommendation methods

RECSYS 14

(1) Gathering Ratings Explicit

Ask people to rate items Doesnrsquot work well in practice ndash people

canrsquot be bothered

Implicit Learn ratings from user actions

bull Eg purchase implies high rating What about low ratings

RECSYS 15

(2) Extrapolating Utilities Key problem matrix U is sparse

Most people have not rated most items Cold start

bull New items have no ratingsbull New users have no history

Three approaches to recommender systems 1) Content-based 2) Collaborative Filtering

bull Memory-basedbull User-based Collaborative Filteringbull Item-based Collaborative Filtering

bull Latent factor based

1162019

RECSYS 16

Content-based Recommender Systems

1162019

RECSYS 17

Content-based Recommendations Main idea Recommend items to customer x similar to

previous items rated highly by x

Example

Movie recommendations Recommend movies with same actor(s)

director genre hellip

Websites blogs news Recommend other sites with ldquosimilarrdquo content

RECSYS 18

Plan of Action

likes

Item profiles

RedCircles

Triangles

User profile

match

recommendbuild

1162019

RECSYS 19

Item Profiles For each item create an item profile

Profile is a set (vector) of features Movies author title actor directorhellip Text Set of ldquoimportantrdquo words in document

How to pick important features Usual heuristic from text mining is TF-IDF

(Term frequency Inverse Doc Frequency)bull Term hellip Featurebull Document hellip Item

1162019

RECSYS 20

Sidenote TF-IDFfij = frequency of term (feature) i in doc (item) j

ni = number of docs that mention term iN = total number of docs

TF-IDF score wij = TFij times IDFi

Doc profile = set of words with highest TF-IDF scores together with their scores

1162019

Note we normalize TFto discount for ldquolongerrdquo documents

RECSYS 21

User Profiles and Prediction

User profile possibilities Weighted average of rated item profiles Variation weight by difference from average

rating for item hellip

Prediction heuristic Given user profile x and item profile i estimate 119906119906(119961119961 119946119946) = cos(119961119961 119946119946) = 119961119961middot119946119946

| 119961119961 |sdot| 119946119946 |

1162019

RECSYS 22

Pros Content-based Approach

+ No need for data on other users No cold-start or sparsity problems

+ Able to recommend to users with unique tastes

+ Able to recommend new amp unpopular items No first-rater problem

+ Able to provide explanations Can provide explanations of recommended items by listing content-

features that caused an item to be recommended

1162019

RECSYS 23

Cons Content-based Approach ndash Finding the appropriate features is hard

Eg images movies music

ndash Overspecialization Never recommends items outside userrsquos

content profile People might have multiple interests Unable to exploit quality judgments of other users

ndash Recommendations for new users How to build a user profile

1162019

RECSYS 24

Collaborative Filtering

RECSYS 25

Collaborative filtering Recommend items based on past transactions of users

Analyze relations between users andor items

Specific data characteristics are irrelevant Domain-free useritem attributes are not necessary Can identify elusive aspects

RECSYS 26

Collaborative Filtering (CF)

RECSYS 27

Example of Memory-based Collaborative FilteringUser-User Collaborative Filtering

1 Consider user x

2 Find set N of other userswhose ratings are ldquosimilarrdquo to xrsquos ratings eg using K-nearest neighbors (KNN)

3 Recommend items to xbased on the weighted ratings of items by users in N

RECSYS 28

Similar Users Let rx be the vector of user xrsquos ratings

Jaccard similarity measure Problem Ignores the value of the rating

Cosine similarity measure

sim(x y) = cos(rx ry) = 119903119903119909119909sdot119903119903119910119910||119903119903119909119909||sdot||119903119903119910119910||

Problem Treats missing ratings as ldquonegativerdquo Pearson correlation coefficient

Sxy = items rated by both users x and y

1162019

rx = [ _ _ ]ry = [ _ _]

rx ry as setsrx = 1 4 5ry = 1 3 4

rx ry as pointsrx = 1 0 0 1 3ry = 1 0 2 2 0

rx ry hellip avgrating of x y

RECSYS 29

Similarity Metric

Intuitively we want sim(A B) gt sim(A C)

Jaccard similarity 15 lt 24

Cosine similarity 0386 gt 0322 Considers missing ratings as ldquonegativerdquo Solution subtract the (row) mean

1162019

sim AB vs AC0092 gt -0559Notice cosine sim is correlation when data is centered at 0

Cosine sim

RECSYS 30

Rating Predictions

Let rx be the vector of user xrsquos ratings

Let N be the set of k users most similar to xwho have rated item i

Prediction for item i of user x 119903119903119909119909119909119909 = 1

119896119896sum119910119910isin119873119873 119903119903119910119910119909119909

119903119903119909119909119909119909 =sum119910119910isin119873119873 119904119904119909119909119910119910sdot119903119903119910119910119910119910sum119910119910isin119873119873 119904119904119909119909119910119910

Other options

Many other tricks possiblehellip1162019

Shorthand119956119956119961119961119961119961 = 119956119956119946119946119956119956 119961119961119961119961

RECSYS 31

Another type of Memory-based Collaborative Filtering Item-Item based Collaborative Filtering

So far User-user collaborative filtering

Another view Item-item For item i find other similar items Estimate rating for item i based

on the target userrsquos ratings on items similar to item i Can use same similarity metrics and

prediction functions as in user-user model

1162019

sumsum

isin

isinsdot

=)(

)(

xiNj ij

xiNj xjijxi s

rsr

sijhellip similarity of items i and jrxjhelliprating of user x on item jN(ix)hellip set items rated by x similar to i

RECSYS 32

Item-Item CF (|N|=2)

121110987654321

455311

3124452

534321423

245424

5224345

423316

users

mov

ies

- unknown rating - rating between 1 to 5

RECSYS 33

Item-Item CF (|N|=2)

121110987654321

455 311

3124452

534321423

245424

5224345

423316

users

- estimate rating of movie 1 by user 5

1162019

mov

ies

RECSYS 34

Item-Item CF (|N|=2)

121110987654321

455 311

3124452

534321423

245424

5224345

423316

users

Neighbor selectionIdentify movies similar to movie 1 rated by user 5

1162019

mov

ies

100

-018

041

-010

-031

059

sim(1m)

Here we use Pearson correlation as similarity1) Subtract mean rating mi from each movie i

m1 = (1+3+5+5+4)5 = 36row 1 [-26 0 -06 0 0 14 0 0 14 0 04 0]

2) Compute cosine similarities between rows

RECSYS 35

Item-Item CF (|N|=2)

121110987654321

455 311

3124452

534321423

245424

5224345

423316

users

Compute similarity weightss13=041 s16=059

1162019

mov

ies

100

-018

041

-010

-031

059

sim(1m)

RECSYS 36

Item-Item CF (|N|=2)

121110987654321

45526311

3124452

534321423

245424

5224345

423316

users

Predict by taking weighted average

r15 = (0412 + 0593) (041+059) = 26

mov

ies

RECSYS 37

Common Practice for Item-Item Collaborative Filtering

Define similarity sij of items i and j

Select K nearest neighbors (KNN) N(i x) Set of Items most similar to item i that were rated by x

Estimate rating rxi as the weighted average

1162019

sumsum

isin

isinsdot

=)(

)(ˆxiNj ij

xiNj xjijxi s

rsr

N(ix) = set of items similar to item i that were rated by xsij = similarity of items i and jrxj= rating of user x on item j

RECSYS 38

Item-Item vs User-User

04180109

0305081

Avatar LOTR Matrix Pirates

Alice

Bob

Carol

David

1162019

In practice it has been observed that item-itemoften works better than user-user

Why Items are simpler users have multiple tastes

RECSYS 39

ProsCons of Collaborative Filtering

+ Works for any kind of item No feature selection needed

- Cold Start Need enough users in the system to find a match

- Sparsity The userratings matrix is sparse Hard to find users that have rated the same items

- First rater Cannot recommend an item that has not been

previously rated New items Esoteric items

- Popularity bias Cannot recommend items to someone with

unique taste Tends to recommend popular items

RECSYS 40

Hybrid Methods Implement two or more different recommenders and

combine predictions Perhaps using a linear model

Add content-based methods to collaborative filtering Item profiles for new item problem Demographics to deal with new user problem

1162019

RECSYS 41

- Evaluation

- Error metrics

- Complexity Speed

RECSYS 42

Evaluation

1 3 4

3 5 5

4 5 5

3

3

2 2 2

5

2 1 1

3 3

1

movies

users

1162019

RECSYS 43

Evaluation

1 3 4

3 5 5

4 5 5

3

3

2

2 1

3

1

Test Data Set

users

movies

1162019

RECSYS 44

Evaluating Predictions Compare predictions with known ratings

Root-mean-square error (RMSE)bull where is predicted is the true rating of x on i

Precision at top 10 bull of those in top 10

Rank Correlation bull Spearmanrsquos correlation between systemrsquos and userrsquos complete

rankings Another approach 01 model

Coveragebull Number of itemsusers for which system can make predictions

Precision = TP (TP + FP) Accuracy = (TP+TN) (TP + FP + TN + FN) Receiver Operating characteristic (ROC) Curve

Y-axis True Positive Rates (TPR) X-axis False Positive Rates (FPR)bull TPR (aka Recall) = TP P = TP (TP+FN) bull FPR = FP N = FP (FP + TN)bull See httpsenwikipediaorgwikiPrecision and recall

RECSYS 45

Problems with Error Measures Narrow focus on accuracy sometimes

misses the point Prediction Diversity Prediction Context Order of predictions

In practice we care only to predict high ratings RMSE might penalize a method that does well

for high ratings and badly for others

1162019

RECSYS 46

Collaborative Filtering Complexity

Expensive step is finding k most similar customers O(|X|) Recall that X = set of customers in the system

Too expensive to do at runtime Could pre-compute using clustering as approx

Naiumlve pre-computation takes time O(N middot|C|) |C| = of clusters = k in the k-means N = of data points

We already know how to do this Near-neighbor search in high dimensions (LSH) Clustering Dimensionality reduction

1162019

RECSYS 47

Tip Add Data Leverage all the data

Donrsquot try to reduce data size in an effort to make fancy algorithms work

Simple methods on large data do best

Add more data eg add IMDB data on genres

More data beats better algorithmshttpanandtypepadcomdatawocky200803more-data-usualhtml

RECSYS 48

Recommender SystemsLatent Factor Models

CS246 Mining Massive DatasetsJure Leskovec Stanford University

httpcs246stanfordedu

RECSYS 49

Geared towards females

Geared towards males

Serious

Funny

Collaborative Filtering viaLatent Factor Models (eg SVD)

The PrincessDiaries

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

RECSYS 50Koren Bell Volinksy IEEE Computer 2009

RECSYS 51

The Netflix Utility Matrix R

1 3 4

3 5 5

4 5 5

3

3

2 2 2

5

2 1 1

3 3

1

480000 users

17700 movies

1162019

Matrix R

RECSYS 52

Utility Matrix R Evaluation

1 3 4

3 5 5

4 5 5

3

3

2

2 1

3

1

Test Data Set

SSE = sum(119909119909119909119909)isin119877119877 119903119909119909119909119909 minus 119903119903119909119909119909119909 2

1162019

480000 users

17700 movies

Predicted rating

True rating of user x on item i

119955119955120785120785120788120788

Matrix R

Training Data Set

RECSYS 53

Latent Factor Models

ldquoSVDrdquo on Netflix data R asymp Q middot PT

For now letrsquos assume we can approximate the rating matrix R as a product of ldquothinrdquo Q middot PT

R has missing entries but letrsquos ignore that for nowbull Basically we will want the reconstruction error to be small on

known ratings and we donrsquot care about the values on the missing ones

45531

312445

53432142

24542

522434

42331

2-41

56-5

53-2

32111

-221-7

37-1

-924143-48-5-253-211

13-112-72914-131457-8

1-6784-3924176-421asymp

users

item

s

PT

Q

item

s

users

R

SVD A = U Σ VT

f factors

ffactors

RECSYS 54

Ratings as Products of Factors How to estimate the missing rating of

user x for item i

1162019

45531

312445

53432142

24542

522434

42331

item

s

2-41

56-5

53-2

32111

-221-7

37-1

-924143-48-5-253-211

13-112-72914-131457-8

1-6784-3924176-421

asymp

item

s

users

users

PT

119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931

= 119943119943

119954119954119946119946119943119943 sdot 119953119953119961119961119943119943

qi = row i of Qpx = column x of PT

ffac

tors

Qf factors

RECSYS 55

Ratings as Products of Factors How to estimate the missing rating of

user x for item i

1162019

45531

312445

53432142

24542

522434

42331

item

s

2-41

56-5

53-2

32111

-221-7

37-1

-924143-48-5-253-211

13-112-72914-131457-8

1-6784-3924176-421

asymp

item

s

users

users

PT

ffac

tors

Qf factors

119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931

= 119943119943

119954119954119946119946119943119943 sdot 119953119953119961119961119943119943

qi = row i of Qpx = column x of PT

RECSYS 56

Ratings as Products of Factors How to estimate the missing rating of

user x for item i

1162019

45531

312445

53432142

24542

522434

42331

item

s

2-41

56-5

53-2

32111

-221-7

37-1

-924143-48-5-253-211

13-112-72914-131457-8

1-6784-3924176-421

asymp

item

s

users

users

QPT

24

ffac

tors

f factors

119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931

= 119943119943

119954119954119946119946119943119943 sdot 119953119953119961119961119943119943

qi = row i of Qpx = column x of PT

RECSYS 57

Geared towards females

Geared towards males

Serious

Funny

Latent Factor Models

1162019

The PrincessDiaries

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

RECSYS 58

Geared towards females

Geared towards males

Serious

Funny

Latent Factor Models

1162019

The PrincessDiaries

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

RECSYS 59

Recap SVD

Remember SVD A Input data matrix U Left singular vecs V Right singular vecs Σ Singular values

SVD gives minimum reconstruction error (SSE) min119880119880VΣ

sum119909119909119894119894 119860119860119909119909119894119894 minus 119880119880Σ119881119881T 1199091199091198941198942

So in our case ldquoSVDrdquo on Netflix data R asymp Q middot PT

A = R Q = U PT = Σ VT

But we are not done yet R has missing entries1162019

U

The sum goes over all entriesBut our R has missing entries

119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931

Am

n

Σm

n

VT

asymp

U

RECSYS 60

Latent Factor Models

SVD isnrsquot defined when entries are missing

Use specialized methods to find P Q min

119875119875119876119876sum 119909119909119909119909 isinR 119903119903119909119909119909119909 minus 119902119902119909119909 sdot 119901119901119909119909119879119879 2

Notebull We donrsquot require cols of P Q to be orthogonalunit lengthbull P Q map usersmovies to a latent spacebull The most popular model among Netflix contestants

1162019

45531

312445

53432142

24542

522434

42331

2-41

56-5

53-2

32111

-221-7

37-1

-924143-48-5-253-211

13-112-72914-131457-8

1-6784-3924176-421asymp

PTQ

users

item

s

119903119909119909119909119909 = 119902119902119909119909 sdot 119901119901119909119909119879119879

f factors

ffactorsitem

s

users

RECSYS 61

Dealing with Missing Entries Want to minimize SSE for unseen test data

Idea Minimize SSE on training data Want large f ( of factors) to capture all the signals But SSE on test data begins to rise for f gt 2

Regularization is needed to avoid Overfitting Allow rich model where there are sufficient data Shrink aggressively where data are scarce

1162019

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

1 3 43 5 5

4 5 533

2

2 1 3

1

λhellip regularization parameterldquoerrorrdquo ldquolengthrdquo

RECSYS 62

Geared towards females

Geared towards males

serious

escapist

The PrincessDiaries

The Lion King

Braveheart

Independence Day

AmadeusThe Color Purple

Oceanrsquos 11

Sense and Sensibility

Gus

Dave

Recommendations via Latent Factor Models (eg SVD++ by the [Bellkor Team])

1162019

Lethal Weapon

Dumb and Dumber

RECSYS 63

Dealing with Missing Entries Want to minimize SSE for unseen test data

Idea Minimize SSE on training data Want large f ( of factors) to capture all the signals But SSE on test data begins to rise for f gt 2

Regularization is needed Allow rich model where there are sufficient data Shrink aggressively where data are scarce

1162019

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

1 3 43 5 5

4 5 533

2

2 1 3

1

λhellip regularization parameterldquoerrorrdquo ldquolengthrdquo

RECSYS 64

Geared towards females

Geared towards males

serious

funny

The Effect of Regularization

1162019

The PrincessDiaries

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

minfactors ldquoerrorrdquo + λ ldquolengthrdquo

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

RECSYS 65

Geared towards females

Geared towards males

serious

funny

The Effect of Regularization

1162019

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

The PrincessDiaries

minfactors ldquoerrorrdquo + λ ldquolengthrdquo

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

RECSYS 66

Geared towards females

Geared towards males

serious

funny

The Effect of Regularization

1162019

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

minfactors ldquoerrorrdquo + λ ldquolengthrdquo

The PrincessDiaries

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

RECSYS 67

Geared towards females

Geared towards males

serious

funny

The Effect of Regularization

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

The PrincessDiaries

minfactors ldquoerrorrdquo + λ ldquolengthrdquo

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

RECSYS 68

Use Gradient Descent to search forthe optimal settings

Want to find matrices P and Q

Gradient descent Initialize P and Q (using SVD pretend missing ratings are

0) Do gradient descent

bull P larr P - η middotnablaPbull Q larr Q - η middotnablaQbull Where nablaQ is gradientderivative of matrix Q120571120571119876119876 = [120571120571119902119902119909119909119894119894] and 120571120571119902119902119909119909119894119894 = sum119909119909119909119909 minus2 119903119903119909119909119909119909 minus 119902119902119909119909119901119901119909119909119879119879 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894

bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q Observation Computing gradients is slow

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

How to compute gradient of a matrixCompute gradient of every element independently

RECSYS 69

Degression to the lecture notes ofRegression and Gradient Descent

by Andrew Ngrsquos Machine Learning course from Coursera

RECSYS 70

(Batch) Gradient Descent

Want to find matrices P and Q

Gradient descent Initialize P and Q (using SVD pretend missing ratings are

0) Do gradient descent

bull P larr P - η middotnablaPbull Q larr Q - η middotnablaQbull Where nablaQ is gradientderivative of matrix Q120571120571119876119876 = [120571120571119902119902119909119909119894119894] and 120571120571119902119902119909119909119894119894 = sum119909119909119909119909 minus2 119903119903119909119909119909119909 minus 119902119902119909119909119901119901119909119909119879119879 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894

bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q Observation Computing gradients is slow

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

How to compute gradient of a matrixCompute gradient of every element independently

RECSYS 72

Stochastic Gradient Descent

Gradient Descent (GD) vs Stochastic GD Observation 120571120571119876119876 = [120571120571119902119902119909119909119894119894] where

120571120571119902119902119909119909119894119894 = 119909119909119909119909

minus2 119903119903119909119909119909119909 minus 119902119902119909119909119894119894119901119901119909119909119894119894 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894 = 119961119961119946119946

nabla119928119928 119955119955119961119961119946119946

bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q 119928119928 = 119928119928 minus 119928119928 = 119928119928 minus η sum119961119961119946119946nabla119928119928 (119955119955119961119961119946119946) Idea Instead of evaluating gradient over all ratings evaluate it for

each individual rating and make a step

GD 119928119928larr119928119928 minus η sum119955119955119961119961119946119946 nabla119928119928(119955119955119961119961119946119946)

SGD 119928119928larr119928119928 minus 119928119928(119955119955119961119961119946119946) Faster convergence

bull Need more steps but each step is computed much faster

1162019

RECSYS 73

SGD vs GD Convergence of GD vs SGD

1162019

Iterationstep

Valu

e of

the

obje

ctiv

e fu

nctio

n

GD improves the value of the objective function at every step SGD improves the value but in a ldquonoisyrdquo wayGD takes fewer steps to converge but each steptakes much longer to compute In practice SGD is much faster

RECSYS 74

Stochastic Gradient Descent

Stochastic gradient decent Initialize P and Q (using SVD pretend missing ratings are

0) Then iterate over the ratings (multiple times if necessary) and update

factorsFor each rxi

bull 120576120576119909119909119909119909 = 119903119903119909119909119909119909 minus 119902119902119909119909 sdot 119901119901119909119909119879119879 (derivative of the ldquoerrorrdquo)bull 119902119902119909119909 larr 119902119902119909119909 + 120578120578 120576120576119909119909119909119909 119901119901119909119909 minus 120582120582 119902119902119909119909 (update equation)bull 119901119901119909119909 larr 119901119901119909119909 + 120578120578 120576120576119909119909119909119909 119902119902119909119909 minus 120582120582 119901119901119909119909 (update equation)

2 for loops For until convergence

bull For each rxibull Compute gradient do a ldquosteprdquo

1162019

η hellip learning rate

RECSYS 75Koren Bell Volinksy IEEE Computer 2009

1162019

RECSYS 76

Summary Recommendations via Optimization

Goal Make good recommendations Quantify goodness using SSE

So Lower SSE means better recommendations We want to make good recommendations on items

that some user has not yet seen Letrsquos set values for P and Q such that they work

well on known (user item) ratings And hope these values for P and Q will predict well

the unknown ratings

This is the a case where we apply Optimization methods

1162019

1 3 43 5 5

4 5 533

2

2 1 3

1

RECSYS 77

Backup Slides

RECSYS 78

The Netflix Challenge 2006-09

We Know What You OughtTo Be Watching This

Summer

Netflix Prizebull Training data

ndash 100 million ratingsndash 480000 usersndash 17770 moviesndash 6 years of data 2000-2005

bull Test datandash Last few ratings of each user (28 million)ndash Evaluation criterion root mean squared error (RMSE) ndash Netflix Cinematch RMSE 09514

bull Competitionndash 2700+ teamsndash $1 million grand prize for 10 improvement on Cinematch resultndash $50000 2007 progress prize for 843 improvement

scoremovieuser121152131434524123237682576344541568523425

22345

57664566

scoremovieuser6219617232473153414284935

745

696836

Training data Test data

Movie rating data

Overall rating distribution

bull Third of ratings are 4sbull Average rating is 368

From TimelyDevelopmentcom

ratings per movie

bull Avg ratingsmovie 5627

ratings per user

bull Avg ratingsuser 208

Average movie rating by movie count

bull More ratings to better movies

From TimelyDevelopmentcom

Most loved movies

CountAvg ratingTitle137812 4593 The Shawshank Redemption 133597 4545 Lord of the Rings The Return of the King1808834306 The Green Mile 150676 4460Lord of the Rings The Two Towers 139050 4415 Finding Nemo 117456 4504 Raiders of the Lost Ark 180736 4299 Forrest Gump 147932 4433 Lord of the Rings The Fellowship of the ring149199 4325 The Sixth Sense 144027 4333 Indiana Jones and the Last Crusade

Challenges

bull Size of datandash Scalabilityndash Keeping data in memory

bull Missing datandash 99 percent missingndash Very imbalanced

bull Avoiding overfittingbull Test and training data differ significantly movie 16322

bull Use an ensemble of complementing predictorsbull Two half tuned models worth more than a single fully

tuned model

The BellKor recommender system

RECSYS 101

Extending Latent Factor Model to Include Biases

1162019

RECSYS 102

Modeling Biases and Interactions

μ = overall mean rating bx = bias of user x bi = bias of movie i

user-movie interactionmovie biasuser bias

User-Movie interaction Characterizes the matching between

users and movies Attracts most research in the field Benefits from algorithmic and

mathematical innovations

Baseline predictor Separates users and movies Benefits from insights into userrsquos

behavior Among the main practical

contributions of the competition

RECSYS 103

Baseline Predictor We have expectations on the rating by

user x of movie i even without estimating xrsquos attitude towards movies like i

ndash Rating scale of user xndash Values of other ratings user

gave recently (day-specific mood anchoring multi-user accounts)

ndash (Recent) popularity of movie indash Selection bias related to

number of ratings user gave on the same day (ldquofrequencyrdquo)

1162019

RECSYS 104

Putting It All Together

Example Mean rating micro = 37 You are a critical reviewer your ratings are 1 star

lower than the mean bx = -1 Star Wars gets a mean rating of 05 higher than

average movie bi = + 05 Predicted rating for you on Star Wars

= 37 - 1 + 05 = 32

1162019

RECSYS 105

Fitting the New Model

Solve

Stochastic gradient decent to find parameters Note Both biases bu bi as well as interactions qi pu are treated

as parameters (we estimate them)

1162019

regularization

goodness of fit

λ is selected via grid-search on a validation set

( )

++++

+++minus

sumsumsumsum

sumisin

ii

xx

xx

ii

Rix

Txiixxi

PQ

bbpq

pqbbr

2222

2

)(

)(min

λ

micro

RECSYS 106

BellKor Recommender System The winner of the Netflix Challenge

Multi-scale modeling of the dataCombine top level ldquoregionalrdquomodeling of the data with a refined local view Global

bull Overall deviations of usersmovies Factorization

bull Addressing ldquoregionalrdquo effects Collaborative filtering

bull Extract local patterns

Global effects

Factorization

Collaborative filtering

RECSYS 107

Performance of Various Methods

0885

089

0895

09

0905

091

0915

092

1 10 100 1000

RMSE

Millions of parameters

CF (no time bias)

Basic Latent Factors

Latent Factors w Biases

1162019

RECSYS 108

Grand Prize 08563

Netflix 09514

Movie average 10533User average 10651

Global average 11296

Performance of Various Methods

Basic Collaborative filtering 094

Latent factors 090

Latent factors+Biases 089

Collaborative filtering++ 091

1162019

RECSYS 109

Temporal Biases Of Users

Sudden rise in the average movie rating(early 2004) Improvements in Netflix GUI improvements Meaning of rating changed

Movie age Users prefer new movies

without any reasons Older movies are just

inherently better than newer ones

1162019

Y Koren Collaborative filtering with temporal dynamics KDD rsquo09

RECSYS 110

Temporal Biases amp Factors

Original modelrxi = micro +bx + bi + qi middotpx

T

Add time dependence to biasesrxi = micro +bx(t)+ bi(t) +qi middot px

T

Make parameters bu and bi to depend on time (1) Parameterize time-dependence by linear trends

(2) Each bin corresponds to 10 consecutive weeks

Add temporal dependence to factors px(t)hellip user preference vector on day t

Y Koren Collaborative filtering with temporal dynamics KDD rsquo09

RECSYS 111

Adding Temporal Effects

1162019

0875

088

0885

089

0895

09

0905

091

0915

092

1 10 100 1000 10000

RMSE

Millions of parameters

CF (no time bias)Basic Latent FactorsCF (time bias)Latent Factors w Biases+ Linear time factors+ Per-day user biases+ CF

RECSYS 112

Grand Prize 08563

Netflix 09514

Movie average 10533User average 10651

Global average 11296

Performance of Various Methods

Basic Collaborative filtering 094

Latent factors 090

Latent factors+Biases 089

Collaborative filtering++ 091

1162019 112

Latent factors+Biases+Time 0876

Still no prize Getting desperateTry a ldquokitchen sinkrdquo approach

RECSYS 113

Effect of ensemble size

087

08725

0875

08775

088

08825

0885

08875

089

08925

1 8 15 22 29 36 43 50

Predictors

Erro

r - R

MSE

Chart1

RMSE
Predictors
Error - RMSE
Effect of ensemble size

Sheet1

Sheet1

RMSE
Predictors
Error - RMSE
Effect of ensemble size

Sheet2

Sheet3

RECSYS 115

Standing on June 26th 2009

1162019June 26th submission triggers 30-day ldquolast callrdquo

RECSYS 116

The Last 30 Days

Ensemble team formed Group of other teams on leaderboard forms a new team Relies on combining their models Quickly also get a qualifying score over 10

BellKor Continue to get small improvements in their scores Realize that they are in direct competition with Ensemble

Strategy Both teams carefully monitoring the leaderboard Only sure way to check for improvement is to submit a set of

predictionsbull This alerts the other team of your latest score

1162019

RECSYS 117

24 Hours from the Deadline

Submissions limited to 1 a day Only 1 final submission could be made in the last 24h

24 hours before deadlinehellip BellKor team member in Austria notices (by chance) that

Ensemble posts a score that is slightly better than BellKorrsquos

Frantic last 24 hours for both teams Much computer time on final optimization Carefully calibrated to end about an hour before deadline

Final submissions BellKor submits a little early (on purpose) 40 mins before deadline Ensemble submits their final entry 20 mins later hellipand everyone waitshellip

1162019

RECSYS 118118

RECSYS 119

Million $ Awarded Sept 21st 2009

RECSYS 120

Further reading Y Koren Collaborative filtering with temporal

dynamics KDD rsquo09 httpwww2researchattcom~volinskynetflixbpchtml httpwwwthe-ensemblecom

1162019

  • Slide Number 1
  • Acknowledgements
  • Content-based Systems amp Collaborative Filtering
  • High Dimensional Data
  • Example Recommender Systems
  • Recommendations
  • From Scarcity to Abundance
  • The Long Tail
  • Physical vs Online
  • Types of Recommendations
  • Formal Model
  • Utility Matrix
  • Key Problems
  • (1) Gathering Ratings
  • (2) Extrapolating Utilities
  • Content-based Recommender Systems
  • Content-based Recommendations
  • Plan of Action
  • Item Profiles
  • Sidenote TF-IDF
  • User Profiles and Prediction
  • Pros Content-based Approach
  • Cons Content-based Approach
  • Collaborative Filtering
  • Collaborative filtering
  • Collaborative Filtering (CF)
  • Example of Memory-based Collaborative Filtering User-User Collaborative Filtering
  • Similar Users
  • Similarity Metric
  • Rating Predictions
  • Another type of Memory-based Collaborative Filtering Item-Item based Collaborative Filtering
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Common Practice for Item-Item Collaborative Filtering
  • Item-Item vs User-User
  • ProsCons of Collaborative Filtering
  • Hybrid Methods
  • Remarks amp Practical Tips
  • Evaluation
  • Evaluation
  • Evaluating Predictions
  • Problems with Error Measures
  • Collaborative Filtering Complexity
  • Tip Add Data
  • Recommender SystemsLatent Factor Models
  • Collaborative Filtering viaLatent Factor Models (eg SVD)
  • Slide Number 50
  • The Netflix Utility Matrix R
  • Utility Matrix R Evaluation
  • Latent Factor Models
  • Ratings as Products of Factors
  • Ratings as Products of Factors
  • Ratings as Products of Factors
  • Latent Factor Models
  • Latent Factor Models
  • Recap SVD
  • Latent Factor Models
  • Dealing with Missing Entries
  • Recommendations via Latent Factor Models (eg SVD++ by the [Bellkor Team])
  • Dealing with Missing Entries
  • The Effect of Regularization
  • The Effect of Regularization
  • The Effect of Regularization
  • The Effect of Regularization
  • Use Gradient Descent to search forthe optimal settings
  • Degression to the lecture notes ofRegression and Gradient Descentby Andrew Ngrsquos Machine Learning course from Coursera
  • (Batch) Gradient Descent
  • Stochastic Gradient Descent
  • SGD vs GD
  • Stochastic Gradient Descent
  • Slide Number 75
  • Summary Recommendations via Optimization
  • Backup Slides
  • The Netflix Challenge 2006-09
  • Slide Number 79
  • Slide Number 80
  • Netflix Prize
  • Movie rating data
  • Overall rating distribution
  • ratings per movie
  • ratings per user
  • Average movie rating by movie count
  • Most loved movies
  • Challenges
  • The BellKor recommender system
  • Extending Latent Factor Model to Include Biases
  • Modeling Biases and Interactions
  • Baseline Predictor
  • Putting It All Together
  • Fitting the New Model
  • BellKor Recommender System
  • Performance of Various Methods
  • Performance of Various Methods
  • Temporal Biases Of Users
  • Temporal Biases amp Factors
  • Adding Temporal Effects
  • Performance of Various Methods
  • Slide Number 113
  • Slide Number 114
  • Standing on June 26th 2009
  • The Last 30 Days
  • 24 Hours from the Deadline
  • Slide Number 118
  • Million $ Awarded Sept 21st 2009
  • Further reading
predictors RMSE
1 0889067
2 0880586
3 0878345
4 0877715
5 0876668
6 0876179
7 0875557
8 0875013
9 0874477
10 0874209
11 0873997
12 0873691
13 0873569
14 0873352
15 0873212
16 0873035
17 0872942
18 087283
19 0872748
20 0872654
21 0872563
22 0872468
23 0872382
24 0872275
25 087222
26 0872201
27 0872135
28 0872086
29 0872021
30 0871945
31 0871898
32 087183
33 0871756
34 0871722
35 0871665
36 0871632
37 0871585
38 0871558
39 0871524
40 0871482
41 0871452
42 0871428
43 0871398
44 0871346
45 0871312
46 0871287
47 0871262
48 0871242
49 0871236
50 0871197
0889067
0880586
0878345
0877715
0876668
0876179
0875557
0875013
0874477
0874209
0873997
0873691
0873569
0873352
0873212
0873035
0872942
087283
0872748
0872654
0872563
0872468
0872382
0872275
087222
0872201
0872135
0872086
0872021
0871945
0871898
087183
0871756
0871722
0871665
0871632
0871585
0871558
0871524
0871482
0871452
0871428
0871398
0871346
0871312
0871287
0871262
0871242
0871236
0871197
Page 6: xuhappy.github.io · 2 days ago · RECSYS 2 Acknowledgements The slides used in this chapter are adapted from: CS246 Mining Massive Datasets, - by Jure Leskovec, Stanford University

RECSYS 6

Recommendations

Items

Search Recommendations

Products web sites blogs news items hellip

Examples

RECSYS 8

The Long Tail

Source Chris Anderson (2004)

1162019 8

RECSYS 9

Physical vs Online

1162019 9Read httpwwwwiredcomwiredarchive1210tailhtml to learn more

RECSYS 11

Formal Model

X = set of Customers

S = set of Items

Utility function u X times S RR = set of ratingsR is a totally ordered seteg 0-5 stars real number in [01]

RECSYS 12

Utility Matrix

04102

0305021

Avatar LOTR Matrix Pirates

Alice

Bob

Carol

David

RECSYS 13

Key Problems

(1) Gathering ldquoknownrdquo ratings for matrix How to collect the data in the utility matrix

(2) Extrapolate unknown ratings from the known ones Mainly interested in high unknown ratings

bull We are not interested in knowing what you donrsquot like but what you like

(3) Evaluating extrapolation methods How to measure successperformance of

recommendation methods

RECSYS 14

(1) Gathering Ratings Explicit

Ask people to rate items Doesnrsquot work well in practice ndash people

canrsquot be bothered

Implicit Learn ratings from user actions

bull Eg purchase implies high rating What about low ratings

RECSYS 15

(2) Extrapolating Utilities Key problem matrix U is sparse

Most people have not rated most items Cold start

bull New items have no ratingsbull New users have no history

Three approaches to recommender systems 1) Content-based 2) Collaborative Filtering

bull Memory-basedbull User-based Collaborative Filteringbull Item-based Collaborative Filtering

bull Latent factor based

1162019

RECSYS 16

Content-based Recommender Systems

1162019

RECSYS 17

Content-based Recommendations Main idea Recommend items to customer x similar to

previous items rated highly by x

Example

Movie recommendations Recommend movies with same actor(s)

director genre hellip

Websites blogs news Recommend other sites with ldquosimilarrdquo content

RECSYS 18

Plan of Action

likes

Item profiles

RedCircles

Triangles

User profile

match

recommendbuild

1162019

RECSYS 19

Item Profiles For each item create an item profile

Profile is a set (vector) of features Movies author title actor directorhellip Text Set of ldquoimportantrdquo words in document

How to pick important features Usual heuristic from text mining is TF-IDF

(Term frequency Inverse Doc Frequency)bull Term hellip Featurebull Document hellip Item

1162019

RECSYS 20

Sidenote TF-IDFfij = frequency of term (feature) i in doc (item) j

ni = number of docs that mention term iN = total number of docs

TF-IDF score wij = TFij times IDFi

Doc profile = set of words with highest TF-IDF scores together with their scores

1162019

Note we normalize TFto discount for ldquolongerrdquo documents

RECSYS 21

User Profiles and Prediction

User profile possibilities Weighted average of rated item profiles Variation weight by difference from average

rating for item hellip

Prediction heuristic Given user profile x and item profile i estimate 119906119906(119961119961 119946119946) = cos(119961119961 119946119946) = 119961119961middot119946119946

| 119961119961 |sdot| 119946119946 |

1162019

RECSYS 22

Pros Content-based Approach

+ No need for data on other users No cold-start or sparsity problems

+ Able to recommend to users with unique tastes

+ Able to recommend new amp unpopular items No first-rater problem

+ Able to provide explanations Can provide explanations of recommended items by listing content-

features that caused an item to be recommended

1162019

RECSYS 23

Cons Content-based Approach ndash Finding the appropriate features is hard

Eg images movies music

ndash Overspecialization Never recommends items outside userrsquos

content profile People might have multiple interests Unable to exploit quality judgments of other users

ndash Recommendations for new users How to build a user profile

1162019

RECSYS 24

Collaborative Filtering

RECSYS 25

Collaborative filtering Recommend items based on past transactions of users

Analyze relations between users andor items

Specific data characteristics are irrelevant Domain-free useritem attributes are not necessary Can identify elusive aspects

RECSYS 26

Collaborative Filtering (CF)

RECSYS 27

Example of Memory-based Collaborative FilteringUser-User Collaborative Filtering

1 Consider user x

2 Find set N of other userswhose ratings are ldquosimilarrdquo to xrsquos ratings eg using K-nearest neighbors (KNN)

3 Recommend items to xbased on the weighted ratings of items by users in N

RECSYS 28

Similar Users Let rx be the vector of user xrsquos ratings

Jaccard similarity measure Problem Ignores the value of the rating

Cosine similarity measure

sim(x y) = cos(rx ry) = 119903119903119909119909sdot119903119903119910119910||119903119903119909119909||sdot||119903119903119910119910||

Problem Treats missing ratings as ldquonegativerdquo Pearson correlation coefficient

Sxy = items rated by both users x and y

1162019

rx = [ _ _ ]ry = [ _ _]

rx ry as setsrx = 1 4 5ry = 1 3 4

rx ry as pointsrx = 1 0 0 1 3ry = 1 0 2 2 0

rx ry hellip avgrating of x y

RECSYS 29

Similarity Metric

Intuitively we want sim(A B) gt sim(A C)

Jaccard similarity 15 lt 24

Cosine similarity 0386 gt 0322 Considers missing ratings as ldquonegativerdquo Solution subtract the (row) mean

1162019

sim AB vs AC0092 gt -0559Notice cosine sim is correlation when data is centered at 0

Cosine sim

RECSYS 30

Rating Predictions

Let rx be the vector of user xrsquos ratings

Let N be the set of k users most similar to xwho have rated item i

Prediction for item i of user x 119903119903119909119909119909119909 = 1

119896119896sum119910119910isin119873119873 119903119903119910119910119909119909

119903119903119909119909119909119909 =sum119910119910isin119873119873 119904119904119909119909119910119910sdot119903119903119910119910119910119910sum119910119910isin119873119873 119904119904119909119909119910119910

Other options

Many other tricks possiblehellip1162019

Shorthand119956119956119961119961119961119961 = 119956119956119946119946119956119956 119961119961119961119961

RECSYS 31

Another type of Memory-based Collaborative Filtering Item-Item based Collaborative Filtering

So far User-user collaborative filtering

Another view Item-item For item i find other similar items Estimate rating for item i based

on the target userrsquos ratings on items similar to item i Can use same similarity metrics and

prediction functions as in user-user model

1162019

sumsum

isin

isinsdot

=)(

)(

xiNj ij

xiNj xjijxi s

rsr

sijhellip similarity of items i and jrxjhelliprating of user x on item jN(ix)hellip set items rated by x similar to i

RECSYS 32

Item-Item CF (|N|=2)

121110987654321

455311

3124452

534321423

245424

5224345

423316

users

mov

ies

- unknown rating - rating between 1 to 5

RECSYS 33

Item-Item CF (|N|=2)

121110987654321

455 311

3124452

534321423

245424

5224345

423316

users

- estimate rating of movie 1 by user 5

1162019

mov

ies

RECSYS 34

Item-Item CF (|N|=2)

121110987654321

455 311

3124452

534321423

245424

5224345

423316

users

Neighbor selectionIdentify movies similar to movie 1 rated by user 5

1162019

mov

ies

100

-018

041

-010

-031

059

sim(1m)

Here we use Pearson correlation as similarity1) Subtract mean rating mi from each movie i

m1 = (1+3+5+5+4)5 = 36row 1 [-26 0 -06 0 0 14 0 0 14 0 04 0]

2) Compute cosine similarities between rows

RECSYS 35

Item-Item CF (|N|=2)

121110987654321

455 311

3124452

534321423

245424

5224345

423316

users

Compute similarity weightss13=041 s16=059

1162019

mov

ies

100

-018

041

-010

-031

059

sim(1m)

RECSYS 36

Item-Item CF (|N|=2)

121110987654321

45526311

3124452

534321423

245424

5224345

423316

users

Predict by taking weighted average

r15 = (0412 + 0593) (041+059) = 26

mov

ies

RECSYS 37

Common Practice for Item-Item Collaborative Filtering

Define similarity sij of items i and j

Select K nearest neighbors (KNN) N(i x) Set of Items most similar to item i that were rated by x

Estimate rating rxi as the weighted average

1162019

sumsum

isin

isinsdot

=)(

)(ˆxiNj ij

xiNj xjijxi s

rsr

N(ix) = set of items similar to item i that were rated by xsij = similarity of items i and jrxj= rating of user x on item j

RECSYS 38

Item-Item vs User-User

04180109

0305081

Avatar LOTR Matrix Pirates

Alice

Bob

Carol

David

1162019

In practice it has been observed that item-itemoften works better than user-user

Why Items are simpler users have multiple tastes

RECSYS 39

ProsCons of Collaborative Filtering

+ Works for any kind of item No feature selection needed

- Cold Start Need enough users in the system to find a match

- Sparsity The userratings matrix is sparse Hard to find users that have rated the same items

- First rater Cannot recommend an item that has not been

previously rated New items Esoteric items

- Popularity bias Cannot recommend items to someone with

unique taste Tends to recommend popular items

RECSYS 40

Hybrid Methods Implement two or more different recommenders and

combine predictions Perhaps using a linear model

Add content-based methods to collaborative filtering Item profiles for new item problem Demographics to deal with new user problem

1162019

RECSYS 41

- Evaluation

- Error metrics

- Complexity Speed

RECSYS 42

Evaluation

1 3 4

3 5 5

4 5 5

3

3

2 2 2

5

2 1 1

3 3

1

movies

users

1162019

RECSYS 43

Evaluation

1 3 4

3 5 5

4 5 5

3

3

2

2 1

3

1

Test Data Set

users

movies

1162019

RECSYS 44

Evaluating Predictions Compare predictions with known ratings

Root-mean-square error (RMSE)bull where is predicted is the true rating of x on i

Precision at top 10 bull of those in top 10

Rank Correlation bull Spearmanrsquos correlation between systemrsquos and userrsquos complete

rankings Another approach 01 model

Coveragebull Number of itemsusers for which system can make predictions

Precision = TP (TP + FP) Accuracy = (TP+TN) (TP + FP + TN + FN) Receiver Operating characteristic (ROC) Curve

Y-axis True Positive Rates (TPR) X-axis False Positive Rates (FPR)bull TPR (aka Recall) = TP P = TP (TP+FN) bull FPR = FP N = FP (FP + TN)bull See httpsenwikipediaorgwikiPrecision and recall

RECSYS 45

Problems with Error Measures Narrow focus on accuracy sometimes

misses the point Prediction Diversity Prediction Context Order of predictions

In practice we care only to predict high ratings RMSE might penalize a method that does well

for high ratings and badly for others

1162019

RECSYS 46

Collaborative Filtering Complexity

Expensive step is finding k most similar customers O(|X|) Recall that X = set of customers in the system

Too expensive to do at runtime Could pre-compute using clustering as approx

Naiumlve pre-computation takes time O(N middot|C|) |C| = of clusters = k in the k-means N = of data points

We already know how to do this Near-neighbor search in high dimensions (LSH) Clustering Dimensionality reduction

1162019

RECSYS 47

Tip Add Data Leverage all the data

Donrsquot try to reduce data size in an effort to make fancy algorithms work

Simple methods on large data do best

Add more data eg add IMDB data on genres

More data beats better algorithmshttpanandtypepadcomdatawocky200803more-data-usualhtml

RECSYS 48

Recommender SystemsLatent Factor Models

CS246 Mining Massive DatasetsJure Leskovec Stanford University

httpcs246stanfordedu

RECSYS 49

Geared towards females

Geared towards males

Serious

Funny

Collaborative Filtering viaLatent Factor Models (eg SVD)

The PrincessDiaries

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

RECSYS 50Koren Bell Volinksy IEEE Computer 2009

RECSYS 51

The Netflix Utility Matrix R

1 3 4

3 5 5

4 5 5

3

3

2 2 2

5

2 1 1

3 3

1

480000 users

17700 movies

1162019

Matrix R

RECSYS 52

Utility Matrix R Evaluation

1 3 4

3 5 5

4 5 5

3

3

2

2 1

3

1

Test Data Set

SSE = sum(119909119909119909119909)isin119877119877 119903119909119909119909119909 minus 119903119903119909119909119909119909 2

1162019

480000 users

17700 movies

Predicted rating

True rating of user x on item i

119955119955120785120785120788120788

Matrix R

Training Data Set

RECSYS 53

Latent Factor Models

ldquoSVDrdquo on Netflix data R asymp Q middot PT

For now letrsquos assume we can approximate the rating matrix R as a product of ldquothinrdquo Q middot PT

R has missing entries but letrsquos ignore that for nowbull Basically we will want the reconstruction error to be small on

known ratings and we donrsquot care about the values on the missing ones

45531

312445

53432142

24542

522434

42331

2-41

56-5

53-2

32111

-221-7

37-1

-924143-48-5-253-211

13-112-72914-131457-8

1-6784-3924176-421asymp

users

item

s

PT

Q

item

s

users

R

SVD A = U Σ VT

f factors

ffactors

RECSYS 54

Ratings as Products of Factors How to estimate the missing rating of

user x for item i

1162019

45531

312445

53432142

24542

522434

42331

item

s

2-41

56-5

53-2

32111

-221-7

37-1

-924143-48-5-253-211

13-112-72914-131457-8

1-6784-3924176-421

asymp

item

s

users

users

PT

119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931

= 119943119943

119954119954119946119946119943119943 sdot 119953119953119961119961119943119943

qi = row i of Qpx = column x of PT

ffac

tors

Qf factors

RECSYS 55

Ratings as Products of Factors How to estimate the missing rating of

user x for item i

1162019

45531

312445

53432142

24542

522434

42331

item

s

2-41

56-5

53-2

32111

-221-7

37-1

-924143-48-5-253-211

13-112-72914-131457-8

1-6784-3924176-421

asymp

item

s

users

users

PT

ffac

tors

Qf factors

119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931

= 119943119943

119954119954119946119946119943119943 sdot 119953119953119961119961119943119943

qi = row i of Qpx = column x of PT

RECSYS 56

Ratings as Products of Factors How to estimate the missing rating of

user x for item i

1162019

45531

312445

53432142

24542

522434

42331

item

s

2-41

56-5

53-2

32111

-221-7

37-1

-924143-48-5-253-211

13-112-72914-131457-8

1-6784-3924176-421

asymp

item

s

users

users

QPT

24

ffac

tors

f factors

119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931

= 119943119943

119954119954119946119946119943119943 sdot 119953119953119961119961119943119943

qi = row i of Qpx = column x of PT

RECSYS 57

Geared towards females

Geared towards males

Serious

Funny

Latent Factor Models

1162019

The PrincessDiaries

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

RECSYS 58

Geared towards females

Geared towards males

Serious

Funny

Latent Factor Models

1162019

The PrincessDiaries

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

RECSYS 59

Recap SVD

Remember SVD A Input data matrix U Left singular vecs V Right singular vecs Σ Singular values

SVD gives minimum reconstruction error (SSE) min119880119880VΣ

sum119909119909119894119894 119860119860119909119909119894119894 minus 119880119880Σ119881119881T 1199091199091198941198942

So in our case ldquoSVDrdquo on Netflix data R asymp Q middot PT

A = R Q = U PT = Σ VT

But we are not done yet R has missing entries1162019

U

The sum goes over all entriesBut our R has missing entries

119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931

Am

n

Σm

n

VT

asymp

U

RECSYS 60

Latent Factor Models

SVD isnrsquot defined when entries are missing

Use specialized methods to find P Q min

119875119875119876119876sum 119909119909119909119909 isinR 119903119903119909119909119909119909 minus 119902119902119909119909 sdot 119901119901119909119909119879119879 2

Notebull We donrsquot require cols of P Q to be orthogonalunit lengthbull P Q map usersmovies to a latent spacebull The most popular model among Netflix contestants

1162019

45531

312445

53432142

24542

522434

42331

2-41

56-5

53-2

32111

-221-7

37-1

-924143-48-5-253-211

13-112-72914-131457-8

1-6784-3924176-421asymp

PTQ

users

item

s

119903119909119909119909119909 = 119902119902119909119909 sdot 119901119901119909119909119879119879

f factors

ffactorsitem

s

users

RECSYS 61

Dealing with Missing Entries Want to minimize SSE for unseen test data

Idea Minimize SSE on training data Want large f ( of factors) to capture all the signals But SSE on test data begins to rise for f gt 2

Regularization is needed to avoid Overfitting Allow rich model where there are sufficient data Shrink aggressively where data are scarce

1162019

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

1 3 43 5 5

4 5 533

2

2 1 3

1

λhellip regularization parameterldquoerrorrdquo ldquolengthrdquo

RECSYS 62

Geared towards females

Geared towards males

serious

escapist

The PrincessDiaries

The Lion King

Braveheart

Independence Day

AmadeusThe Color Purple

Oceanrsquos 11

Sense and Sensibility

Gus

Dave

Recommendations via Latent Factor Models (eg SVD++ by the [Bellkor Team])

1162019

Lethal Weapon

Dumb and Dumber

RECSYS 63

Dealing with Missing Entries Want to minimize SSE for unseen test data

Idea Minimize SSE on training data Want large f ( of factors) to capture all the signals But SSE on test data begins to rise for f gt 2

Regularization is needed Allow rich model where there are sufficient data Shrink aggressively where data are scarce

1162019

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

1 3 43 5 5

4 5 533

2

2 1 3

1

λhellip regularization parameterldquoerrorrdquo ldquolengthrdquo

RECSYS 64

Geared towards females

Geared towards males

serious

funny

The Effect of Regularization

1162019

The PrincessDiaries

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

minfactors ldquoerrorrdquo + λ ldquolengthrdquo

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

RECSYS 65

Geared towards females

Geared towards males

serious

funny

The Effect of Regularization

1162019

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

The PrincessDiaries

minfactors ldquoerrorrdquo + λ ldquolengthrdquo

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

RECSYS 66

Geared towards females

Geared towards males

serious

funny

The Effect of Regularization

1162019

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

minfactors ldquoerrorrdquo + λ ldquolengthrdquo

The PrincessDiaries

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

RECSYS 67

Geared towards females

Geared towards males

serious

funny

The Effect of Regularization

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

The PrincessDiaries

minfactors ldquoerrorrdquo + λ ldquolengthrdquo

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

RECSYS 68

Use Gradient Descent to search forthe optimal settings

Want to find matrices P and Q

Gradient descent Initialize P and Q (using SVD pretend missing ratings are

0) Do gradient descent

bull P larr P - η middotnablaPbull Q larr Q - η middotnablaQbull Where nablaQ is gradientderivative of matrix Q120571120571119876119876 = [120571120571119902119902119909119909119894119894] and 120571120571119902119902119909119909119894119894 = sum119909119909119909119909 minus2 119903119903119909119909119909119909 minus 119902119902119909119909119901119901119909119909119879119879 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894

bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q Observation Computing gradients is slow

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

How to compute gradient of a matrixCompute gradient of every element independently

RECSYS 69

Degression to the lecture notes ofRegression and Gradient Descent

by Andrew Ngrsquos Machine Learning course from Coursera

RECSYS 70

(Batch) Gradient Descent

Want to find matrices P and Q

Gradient descent Initialize P and Q (using SVD pretend missing ratings are

0) Do gradient descent

bull P larr P - η middotnablaPbull Q larr Q - η middotnablaQbull Where nablaQ is gradientderivative of matrix Q120571120571119876119876 = [120571120571119902119902119909119909119894119894] and 120571120571119902119902119909119909119894119894 = sum119909119909119909119909 minus2 119903119903119909119909119909119909 minus 119902119902119909119909119901119901119909119909119879119879 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894

bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q Observation Computing gradients is slow

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

How to compute gradient of a matrixCompute gradient of every element independently

RECSYS 72

Stochastic Gradient Descent

Gradient Descent (GD) vs Stochastic GD Observation 120571120571119876119876 = [120571120571119902119902119909119909119894119894] where

120571120571119902119902119909119909119894119894 = 119909119909119909119909

minus2 119903119903119909119909119909119909 minus 119902119902119909119909119894119894119901119901119909119909119894119894 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894 = 119961119961119946119946

nabla119928119928 119955119955119961119961119946119946

bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q 119928119928 = 119928119928 minus 119928119928 = 119928119928 minus η sum119961119961119946119946nabla119928119928 (119955119955119961119961119946119946) Idea Instead of evaluating gradient over all ratings evaluate it for

each individual rating and make a step

GD 119928119928larr119928119928 minus η sum119955119955119961119961119946119946 nabla119928119928(119955119955119961119961119946119946)

SGD 119928119928larr119928119928 minus 119928119928(119955119955119961119961119946119946) Faster convergence

bull Need more steps but each step is computed much faster

1162019

RECSYS 73

SGD vs GD Convergence of GD vs SGD

1162019

Iterationstep

Valu

e of

the

obje

ctiv

e fu

nctio

n

GD improves the value of the objective function at every step SGD improves the value but in a ldquonoisyrdquo wayGD takes fewer steps to converge but each steptakes much longer to compute In practice SGD is much faster

RECSYS 74

Stochastic Gradient Descent

Stochastic gradient decent Initialize P and Q (using SVD pretend missing ratings are

0) Then iterate over the ratings (multiple times if necessary) and update

factorsFor each rxi

bull 120576120576119909119909119909119909 = 119903119903119909119909119909119909 minus 119902119902119909119909 sdot 119901119901119909119909119879119879 (derivative of the ldquoerrorrdquo)bull 119902119902119909119909 larr 119902119902119909119909 + 120578120578 120576120576119909119909119909119909 119901119901119909119909 minus 120582120582 119902119902119909119909 (update equation)bull 119901119901119909119909 larr 119901119901119909119909 + 120578120578 120576120576119909119909119909119909 119902119902119909119909 minus 120582120582 119901119901119909119909 (update equation)

2 for loops For until convergence

bull For each rxibull Compute gradient do a ldquosteprdquo

1162019

η hellip learning rate

RECSYS 75Koren Bell Volinksy IEEE Computer 2009

1162019

RECSYS 76

Summary Recommendations via Optimization

Goal Make good recommendations Quantify goodness using SSE

So Lower SSE means better recommendations We want to make good recommendations on items

that some user has not yet seen Letrsquos set values for P and Q such that they work

well on known (user item) ratings And hope these values for P and Q will predict well

the unknown ratings

This is the a case where we apply Optimization methods

1162019

1 3 43 5 5

4 5 533

2

2 1 3

1

RECSYS 77

Backup Slides

RECSYS 78

The Netflix Challenge 2006-09

We Know What You OughtTo Be Watching This

Summer

Netflix Prizebull Training data

ndash 100 million ratingsndash 480000 usersndash 17770 moviesndash 6 years of data 2000-2005

bull Test datandash Last few ratings of each user (28 million)ndash Evaluation criterion root mean squared error (RMSE) ndash Netflix Cinematch RMSE 09514

bull Competitionndash 2700+ teamsndash $1 million grand prize for 10 improvement on Cinematch resultndash $50000 2007 progress prize for 843 improvement

scoremovieuser121152131434524123237682576344541568523425

22345

57664566

scoremovieuser6219617232473153414284935

745

696836

Training data Test data

Movie rating data

Overall rating distribution

bull Third of ratings are 4sbull Average rating is 368

From TimelyDevelopmentcom

ratings per movie

bull Avg ratingsmovie 5627

ratings per user

bull Avg ratingsuser 208

Average movie rating by movie count

bull More ratings to better movies

From TimelyDevelopmentcom

Most loved movies

CountAvg ratingTitle137812 4593 The Shawshank Redemption 133597 4545 Lord of the Rings The Return of the King1808834306 The Green Mile 150676 4460Lord of the Rings The Two Towers 139050 4415 Finding Nemo 117456 4504 Raiders of the Lost Ark 180736 4299 Forrest Gump 147932 4433 Lord of the Rings The Fellowship of the ring149199 4325 The Sixth Sense 144027 4333 Indiana Jones and the Last Crusade

Challenges

bull Size of datandash Scalabilityndash Keeping data in memory

bull Missing datandash 99 percent missingndash Very imbalanced

bull Avoiding overfittingbull Test and training data differ significantly movie 16322

bull Use an ensemble of complementing predictorsbull Two half tuned models worth more than a single fully

tuned model

The BellKor recommender system

RECSYS 101

Extending Latent Factor Model to Include Biases

1162019

RECSYS 102

Modeling Biases and Interactions

μ = overall mean rating bx = bias of user x bi = bias of movie i

user-movie interactionmovie biasuser bias

User-Movie interaction Characterizes the matching between

users and movies Attracts most research in the field Benefits from algorithmic and

mathematical innovations

Baseline predictor Separates users and movies Benefits from insights into userrsquos

behavior Among the main practical

contributions of the competition

RECSYS 103

Baseline Predictor We have expectations on the rating by

user x of movie i even without estimating xrsquos attitude towards movies like i

ndash Rating scale of user xndash Values of other ratings user

gave recently (day-specific mood anchoring multi-user accounts)

ndash (Recent) popularity of movie indash Selection bias related to

number of ratings user gave on the same day (ldquofrequencyrdquo)

1162019

RECSYS 104

Putting It All Together

Example Mean rating micro = 37 You are a critical reviewer your ratings are 1 star

lower than the mean bx = -1 Star Wars gets a mean rating of 05 higher than

average movie bi = + 05 Predicted rating for you on Star Wars

= 37 - 1 + 05 = 32

1162019

RECSYS 105

Fitting the New Model

Solve

Stochastic gradient decent to find parameters Note Both biases bu bi as well as interactions qi pu are treated

as parameters (we estimate them)

1162019

regularization

goodness of fit

λ is selected via grid-search on a validation set

( )

++++

+++minus

sumsumsumsum

sumisin

ii

xx

xx

ii

Rix

Txiixxi

PQ

bbpq

pqbbr

2222

2

)(

)(min

λ

micro

RECSYS 106

BellKor Recommender System The winner of the Netflix Challenge

Multi-scale modeling of the dataCombine top level ldquoregionalrdquomodeling of the data with a refined local view Global

bull Overall deviations of usersmovies Factorization

bull Addressing ldquoregionalrdquo effects Collaborative filtering

bull Extract local patterns

Global effects

Factorization

Collaborative filtering

RECSYS 107

Performance of Various Methods

0885

089

0895

09

0905

091

0915

092

1 10 100 1000

RMSE

Millions of parameters

CF (no time bias)

Basic Latent Factors

Latent Factors w Biases

1162019

RECSYS 108

Grand Prize 08563

Netflix 09514

Movie average 10533User average 10651

Global average 11296

Performance of Various Methods

Basic Collaborative filtering 094

Latent factors 090

Latent factors+Biases 089

Collaborative filtering++ 091

1162019

RECSYS 109

Temporal Biases Of Users

Sudden rise in the average movie rating(early 2004) Improvements in Netflix GUI improvements Meaning of rating changed

Movie age Users prefer new movies

without any reasons Older movies are just

inherently better than newer ones

1162019

Y Koren Collaborative filtering with temporal dynamics KDD rsquo09

RECSYS 110

Temporal Biases amp Factors

Original modelrxi = micro +bx + bi + qi middotpx

T

Add time dependence to biasesrxi = micro +bx(t)+ bi(t) +qi middot px

T

Make parameters bu and bi to depend on time (1) Parameterize time-dependence by linear trends

(2) Each bin corresponds to 10 consecutive weeks

Add temporal dependence to factors px(t)hellip user preference vector on day t

Y Koren Collaborative filtering with temporal dynamics KDD rsquo09

RECSYS 111

Adding Temporal Effects

1162019

0875

088

0885

089

0895

09

0905

091

0915

092

1 10 100 1000 10000

RMSE

Millions of parameters

CF (no time bias)Basic Latent FactorsCF (time bias)Latent Factors w Biases+ Linear time factors+ Per-day user biases+ CF

RECSYS 112

Grand Prize 08563

Netflix 09514

Movie average 10533User average 10651

Global average 11296

Performance of Various Methods

Basic Collaborative filtering 094

Latent factors 090

Latent factors+Biases 089

Collaborative filtering++ 091

1162019 112

Latent factors+Biases+Time 0876

Still no prize Getting desperateTry a ldquokitchen sinkrdquo approach

RECSYS 113

Effect of ensemble size

087

08725

0875

08775

088

08825

0885

08875

089

08925

1 8 15 22 29 36 43 50

Predictors

Erro

r - R

MSE

Chart1

RMSE
Predictors
Error - RMSE
Effect of ensemble size

Sheet1

Sheet1

RMSE
Predictors
Error - RMSE
Effect of ensemble size

Sheet2

Sheet3

RECSYS 115

Standing on June 26th 2009

1162019June 26th submission triggers 30-day ldquolast callrdquo

RECSYS 116

The Last 30 Days

Ensemble team formed Group of other teams on leaderboard forms a new team Relies on combining their models Quickly also get a qualifying score over 10

BellKor Continue to get small improvements in their scores Realize that they are in direct competition with Ensemble

Strategy Both teams carefully monitoring the leaderboard Only sure way to check for improvement is to submit a set of

predictionsbull This alerts the other team of your latest score

1162019

RECSYS 117

24 Hours from the Deadline

Submissions limited to 1 a day Only 1 final submission could be made in the last 24h

24 hours before deadlinehellip BellKor team member in Austria notices (by chance) that

Ensemble posts a score that is slightly better than BellKorrsquos

Frantic last 24 hours for both teams Much computer time on final optimization Carefully calibrated to end about an hour before deadline

Final submissions BellKor submits a little early (on purpose) 40 mins before deadline Ensemble submits their final entry 20 mins later hellipand everyone waitshellip

1162019

RECSYS 118118

RECSYS 119

Million $ Awarded Sept 21st 2009

RECSYS 120

Further reading Y Koren Collaborative filtering with temporal

dynamics KDD rsquo09 httpwww2researchattcom~volinskynetflixbpchtml httpwwwthe-ensemblecom

1162019

  • Slide Number 1
  • Acknowledgements
  • Content-based Systems amp Collaborative Filtering
  • High Dimensional Data
  • Example Recommender Systems
  • Recommendations
  • From Scarcity to Abundance
  • The Long Tail
  • Physical vs Online
  • Types of Recommendations
  • Formal Model
  • Utility Matrix
  • Key Problems
  • (1) Gathering Ratings
  • (2) Extrapolating Utilities
  • Content-based Recommender Systems
  • Content-based Recommendations
  • Plan of Action
  • Item Profiles
  • Sidenote TF-IDF
  • User Profiles and Prediction
  • Pros Content-based Approach
  • Cons Content-based Approach
  • Collaborative Filtering
  • Collaborative filtering
  • Collaborative Filtering (CF)
  • Example of Memory-based Collaborative Filtering User-User Collaborative Filtering
  • Similar Users
  • Similarity Metric
  • Rating Predictions
  • Another type of Memory-based Collaborative Filtering Item-Item based Collaborative Filtering
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Common Practice for Item-Item Collaborative Filtering
  • Item-Item vs User-User
  • ProsCons of Collaborative Filtering
  • Hybrid Methods
  • Remarks amp Practical Tips
  • Evaluation
  • Evaluation
  • Evaluating Predictions
  • Problems with Error Measures
  • Collaborative Filtering Complexity
  • Tip Add Data
  • Recommender SystemsLatent Factor Models
  • Collaborative Filtering viaLatent Factor Models (eg SVD)
  • Slide Number 50
  • The Netflix Utility Matrix R
  • Utility Matrix R Evaluation
  • Latent Factor Models
  • Ratings as Products of Factors
  • Ratings as Products of Factors
  • Ratings as Products of Factors
  • Latent Factor Models
  • Latent Factor Models
  • Recap SVD
  • Latent Factor Models
  • Dealing with Missing Entries
  • Recommendations via Latent Factor Models (eg SVD++ by the [Bellkor Team])
  • Dealing with Missing Entries
  • The Effect of Regularization
  • The Effect of Regularization
  • The Effect of Regularization
  • The Effect of Regularization
  • Use Gradient Descent to search forthe optimal settings
  • Degression to the lecture notes ofRegression and Gradient Descentby Andrew Ngrsquos Machine Learning course from Coursera
  • (Batch) Gradient Descent
  • Stochastic Gradient Descent
  • SGD vs GD
  • Stochastic Gradient Descent
  • Slide Number 75
  • Summary Recommendations via Optimization
  • Backup Slides
  • The Netflix Challenge 2006-09
  • Slide Number 79
  • Slide Number 80
  • Netflix Prize
  • Movie rating data
  • Overall rating distribution
  • ratings per movie
  • ratings per user
  • Average movie rating by movie count
  • Most loved movies
  • Challenges
  • The BellKor recommender system
  • Extending Latent Factor Model to Include Biases
  • Modeling Biases and Interactions
  • Baseline Predictor
  • Putting It All Together
  • Fitting the New Model
  • BellKor Recommender System
  • Performance of Various Methods
  • Performance of Various Methods
  • Temporal Biases Of Users
  • Temporal Biases amp Factors
  • Adding Temporal Effects
  • Performance of Various Methods
  • Slide Number 113
  • Slide Number 114
  • Standing on June 26th 2009
  • The Last 30 Days
  • 24 Hours from the Deadline
  • Slide Number 118
  • Million $ Awarded Sept 21st 2009
  • Further reading
predictors RMSE
1 0889067
2 0880586
3 0878345
4 0877715
5 0876668
6 0876179
7 0875557
8 0875013
9 0874477
10 0874209
11 0873997
12 0873691
13 0873569
14 0873352
15 0873212
16 0873035
17 0872942
18 087283
19 0872748
20 0872654
21 0872563
22 0872468
23 0872382
24 0872275
25 087222
26 0872201
27 0872135
28 0872086
29 0872021
30 0871945
31 0871898
32 087183
33 0871756
34 0871722
35 0871665
36 0871632
37 0871585
38 0871558
39 0871524
40 0871482
41 0871452
42 0871428
43 0871398
44 0871346
45 0871312
46 0871287
47 0871262
48 0871242
49 0871236
50 0871197
0889067
0880586
0878345
0877715
0876668
0876179
0875557
0875013
0874477
0874209
0873997
0873691
0873569
0873352
0873212
0873035
0872942
087283
0872748
0872654
0872563
0872468
0872382
0872275
087222
0872201
0872135
0872086
0872021
0871945
0871898
087183
0871756
0871722
0871665
0871632
0871585
0871558
0871524
0871482
0871452
0871428
0871398
0871346
0871312
0871287
0871262
0871242
0871236
0871197
Page 7: xuhappy.github.io · 2 days ago · RECSYS 2 Acknowledgements The slides used in this chapter are adapted from: CS246 Mining Massive Datasets, - by Jure Leskovec, Stanford University

RECSYS 8

The Long Tail

Source Chris Anderson (2004)

1162019 8

RECSYS 9

Physical vs Online

1162019 9Read httpwwwwiredcomwiredarchive1210tailhtml to learn more

RECSYS 11

Formal Model

X = set of Customers

S = set of Items

Utility function u X times S RR = set of ratingsR is a totally ordered seteg 0-5 stars real number in [01]

RECSYS 12

Utility Matrix

04102

0305021

Avatar LOTR Matrix Pirates

Alice

Bob

Carol

David

RECSYS 13

Key Problems

(1) Gathering ldquoknownrdquo ratings for matrix How to collect the data in the utility matrix

(2) Extrapolate unknown ratings from the known ones Mainly interested in high unknown ratings

bull We are not interested in knowing what you donrsquot like but what you like

(3) Evaluating extrapolation methods How to measure successperformance of

recommendation methods

RECSYS 14

(1) Gathering Ratings Explicit

Ask people to rate items Doesnrsquot work well in practice ndash people

canrsquot be bothered

Implicit Learn ratings from user actions

bull Eg purchase implies high rating What about low ratings

RECSYS 15

(2) Extrapolating Utilities Key problem matrix U is sparse

Most people have not rated most items Cold start

bull New items have no ratingsbull New users have no history

Three approaches to recommender systems 1) Content-based 2) Collaborative Filtering

bull Memory-basedbull User-based Collaborative Filteringbull Item-based Collaborative Filtering

bull Latent factor based

1162019

RECSYS 16

Content-based Recommender Systems

1162019

RECSYS 17

Content-based Recommendations Main idea Recommend items to customer x similar to

previous items rated highly by x

Example

Movie recommendations Recommend movies with same actor(s)

director genre hellip

Websites blogs news Recommend other sites with ldquosimilarrdquo content

RECSYS 18

Plan of Action

likes

Item profiles

RedCircles

Triangles

User profile

match

recommendbuild

1162019

RECSYS 19

Item Profiles For each item create an item profile

Profile is a set (vector) of features Movies author title actor directorhellip Text Set of ldquoimportantrdquo words in document

How to pick important features Usual heuristic from text mining is TF-IDF

(Term frequency Inverse Doc Frequency)bull Term hellip Featurebull Document hellip Item

1162019

RECSYS 20

Sidenote TF-IDFfij = frequency of term (feature) i in doc (item) j

ni = number of docs that mention term iN = total number of docs

TF-IDF score wij = TFij times IDFi

Doc profile = set of words with highest TF-IDF scores together with their scores

1162019

Note we normalize TFto discount for ldquolongerrdquo documents

RECSYS 21

User Profiles and Prediction

User profile possibilities Weighted average of rated item profiles Variation weight by difference from average

rating for item hellip

Prediction heuristic Given user profile x and item profile i estimate 119906119906(119961119961 119946119946) = cos(119961119961 119946119946) = 119961119961middot119946119946

| 119961119961 |sdot| 119946119946 |

1162019

RECSYS 22

Pros Content-based Approach

+ No need for data on other users No cold-start or sparsity problems

+ Able to recommend to users with unique tastes

+ Able to recommend new amp unpopular items No first-rater problem

+ Able to provide explanations Can provide explanations of recommended items by listing content-

features that caused an item to be recommended

1162019

RECSYS 23

Cons Content-based Approach ndash Finding the appropriate features is hard

Eg images movies music

ndash Overspecialization Never recommends items outside userrsquos

content profile People might have multiple interests Unable to exploit quality judgments of other users

ndash Recommendations for new users How to build a user profile

1162019

RECSYS 24

Collaborative Filtering

RECSYS 25

Collaborative filtering Recommend items based on past transactions of users

Analyze relations between users andor items

Specific data characteristics are irrelevant Domain-free useritem attributes are not necessary Can identify elusive aspects

RECSYS 26

Collaborative Filtering (CF)

RECSYS 27

Example of Memory-based Collaborative FilteringUser-User Collaborative Filtering

1 Consider user x

2 Find set N of other userswhose ratings are ldquosimilarrdquo to xrsquos ratings eg using K-nearest neighbors (KNN)

3 Recommend items to xbased on the weighted ratings of items by users in N

RECSYS 28

Similar Users Let rx be the vector of user xrsquos ratings

Jaccard similarity measure Problem Ignores the value of the rating

Cosine similarity measure

sim(x y) = cos(rx ry) = 119903119903119909119909sdot119903119903119910119910||119903119903119909119909||sdot||119903119903119910119910||

Problem Treats missing ratings as ldquonegativerdquo Pearson correlation coefficient

Sxy = items rated by both users x and y

1162019

rx = [ _ _ ]ry = [ _ _]

rx ry as setsrx = 1 4 5ry = 1 3 4

rx ry as pointsrx = 1 0 0 1 3ry = 1 0 2 2 0

rx ry hellip avgrating of x y

RECSYS 29

Similarity Metric

Intuitively we want sim(A B) gt sim(A C)

Jaccard similarity 15 lt 24

Cosine similarity 0386 gt 0322 Considers missing ratings as ldquonegativerdquo Solution subtract the (row) mean

1162019

sim AB vs AC0092 gt -0559Notice cosine sim is correlation when data is centered at 0

Cosine sim

RECSYS 30

Rating Predictions

Let rx be the vector of user xrsquos ratings

Let N be the set of k users most similar to xwho have rated item i

Prediction for item i of user x 119903119903119909119909119909119909 = 1

119896119896sum119910119910isin119873119873 119903119903119910119910119909119909

119903119903119909119909119909119909 =sum119910119910isin119873119873 119904119904119909119909119910119910sdot119903119903119910119910119910119910sum119910119910isin119873119873 119904119904119909119909119910119910

Other options

Many other tricks possiblehellip1162019

Shorthand119956119956119961119961119961119961 = 119956119956119946119946119956119956 119961119961119961119961

RECSYS 31

Another type of Memory-based Collaborative Filtering Item-Item based Collaborative Filtering

So far User-user collaborative filtering

Another view Item-item For item i find other similar items Estimate rating for item i based

on the target userrsquos ratings on items similar to item i Can use same similarity metrics and

prediction functions as in user-user model

1162019

sumsum

isin

isinsdot

=)(

)(

xiNj ij

xiNj xjijxi s

rsr

sijhellip similarity of items i and jrxjhelliprating of user x on item jN(ix)hellip set items rated by x similar to i

RECSYS 32

Item-Item CF (|N|=2)

121110987654321

455311

3124452

534321423

245424

5224345

423316

users

mov

ies

- unknown rating - rating between 1 to 5

RECSYS 33

Item-Item CF (|N|=2)

121110987654321

455 311

3124452

534321423

245424

5224345

423316

users

- estimate rating of movie 1 by user 5

1162019

mov

ies

RECSYS 34

Item-Item CF (|N|=2)

121110987654321

455 311

3124452

534321423

245424

5224345

423316

users

Neighbor selectionIdentify movies similar to movie 1 rated by user 5

1162019

mov

ies

100

-018

041

-010

-031

059

sim(1m)

Here we use Pearson correlation as similarity1) Subtract mean rating mi from each movie i

m1 = (1+3+5+5+4)5 = 36row 1 [-26 0 -06 0 0 14 0 0 14 0 04 0]

2) Compute cosine similarities between rows

RECSYS 35

Item-Item CF (|N|=2)

121110987654321

455 311

3124452

534321423

245424

5224345

423316

users

Compute similarity weightss13=041 s16=059

1162019

mov

ies

100

-018

041

-010

-031

059

sim(1m)

RECSYS 36

Item-Item CF (|N|=2)

121110987654321

45526311

3124452

534321423

245424

5224345

423316

users

Predict by taking weighted average

r15 = (0412 + 0593) (041+059) = 26

mov

ies

RECSYS 37

Common Practice for Item-Item Collaborative Filtering

Define similarity sij of items i and j

Select K nearest neighbors (KNN) N(i x) Set of Items most similar to item i that were rated by x

Estimate rating rxi as the weighted average

1162019

sumsum

isin

isinsdot

=)(

)(ˆxiNj ij

xiNj xjijxi s

rsr

N(ix) = set of items similar to item i that were rated by xsij = similarity of items i and jrxj= rating of user x on item j

RECSYS 38

Item-Item vs User-User

04180109

0305081

Avatar LOTR Matrix Pirates

Alice

Bob

Carol

David

1162019

In practice it has been observed that item-itemoften works better than user-user

Why Items are simpler users have multiple tastes

RECSYS 39

ProsCons of Collaborative Filtering

+ Works for any kind of item No feature selection needed

- Cold Start Need enough users in the system to find a match

- Sparsity The userratings matrix is sparse Hard to find users that have rated the same items

- First rater Cannot recommend an item that has not been

previously rated New items Esoteric items

- Popularity bias Cannot recommend items to someone with

unique taste Tends to recommend popular items

RECSYS 40

Hybrid Methods Implement two or more different recommenders and

combine predictions Perhaps using a linear model

Add content-based methods to collaborative filtering Item profiles for new item problem Demographics to deal with new user problem

1162019

RECSYS 41

- Evaluation

- Error metrics

- Complexity Speed

RECSYS 42

Evaluation

1 3 4

3 5 5

4 5 5

3

3

2 2 2

5

2 1 1

3 3

1

movies

users

1162019

RECSYS 43

Evaluation

1 3 4

3 5 5

4 5 5

3

3

2

2 1

3

1

Test Data Set

users

movies

1162019

RECSYS 44

Evaluating Predictions Compare predictions with known ratings

Root-mean-square error (RMSE)bull where is predicted is the true rating of x on i

Precision at top 10 bull of those in top 10

Rank Correlation bull Spearmanrsquos correlation between systemrsquos and userrsquos complete

rankings Another approach 01 model

Coveragebull Number of itemsusers for which system can make predictions

Precision = TP (TP + FP) Accuracy = (TP+TN) (TP + FP + TN + FN) Receiver Operating characteristic (ROC) Curve

Y-axis True Positive Rates (TPR) X-axis False Positive Rates (FPR)bull TPR (aka Recall) = TP P = TP (TP+FN) bull FPR = FP N = FP (FP + TN)bull See httpsenwikipediaorgwikiPrecision and recall

RECSYS 45

Problems with Error Measures Narrow focus on accuracy sometimes

misses the point Prediction Diversity Prediction Context Order of predictions

In practice we care only to predict high ratings RMSE might penalize a method that does well

for high ratings and badly for others

1162019

RECSYS 46

Collaborative Filtering Complexity

Expensive step is finding k most similar customers O(|X|) Recall that X = set of customers in the system

Too expensive to do at runtime Could pre-compute using clustering as approx

Naiumlve pre-computation takes time O(N middot|C|) |C| = of clusters = k in the k-means N = of data points

We already know how to do this Near-neighbor search in high dimensions (LSH) Clustering Dimensionality reduction

1162019

RECSYS 47

Tip Add Data Leverage all the data

Donrsquot try to reduce data size in an effort to make fancy algorithms work

Simple methods on large data do best

Add more data eg add IMDB data on genres

More data beats better algorithmshttpanandtypepadcomdatawocky200803more-data-usualhtml

RECSYS 48

Recommender SystemsLatent Factor Models

CS246 Mining Massive DatasetsJure Leskovec Stanford University

httpcs246stanfordedu

RECSYS 49

Geared towards females

Geared towards males

Serious

Funny

Collaborative Filtering viaLatent Factor Models (eg SVD)

The PrincessDiaries

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

RECSYS 50Koren Bell Volinksy IEEE Computer 2009

RECSYS 51

The Netflix Utility Matrix R

1 3 4

3 5 5

4 5 5

3

3

2 2 2

5

2 1 1

3 3

1

480000 users

17700 movies

1162019

Matrix R

RECSYS 52

Utility Matrix R Evaluation

1 3 4

3 5 5

4 5 5

3

3

2

2 1

3

1

Test Data Set

SSE = sum(119909119909119909119909)isin119877119877 119903119909119909119909119909 minus 119903119903119909119909119909119909 2

1162019

480000 users

17700 movies

Predicted rating

True rating of user x on item i

119955119955120785120785120788120788

Matrix R

Training Data Set

RECSYS 53

Latent Factor Models

ldquoSVDrdquo on Netflix data R asymp Q middot PT

For now letrsquos assume we can approximate the rating matrix R as a product of ldquothinrdquo Q middot PT

R has missing entries but letrsquos ignore that for nowbull Basically we will want the reconstruction error to be small on

known ratings and we donrsquot care about the values on the missing ones

45531

312445

53432142

24542

522434

42331

2-41

56-5

53-2

32111

-221-7

37-1

-924143-48-5-253-211

13-112-72914-131457-8

1-6784-3924176-421asymp

users

item

s

PT

Q

item

s

users

R

SVD A = U Σ VT

f factors

ffactors

RECSYS 54

Ratings as Products of Factors How to estimate the missing rating of

user x for item i

1162019

45531

312445

53432142

24542

522434

42331

item

s

2-41

56-5

53-2

32111

-221-7

37-1

-924143-48-5-253-211

13-112-72914-131457-8

1-6784-3924176-421

asymp

item

s

users

users

PT

119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931

= 119943119943

119954119954119946119946119943119943 sdot 119953119953119961119961119943119943

qi = row i of Qpx = column x of PT

ffac

tors

Qf factors

RECSYS 55

Ratings as Products of Factors How to estimate the missing rating of

user x for item i

1162019

45531

312445

53432142

24542

522434

42331

item

s

2-41

56-5

53-2

32111

-221-7

37-1

-924143-48-5-253-211

13-112-72914-131457-8

1-6784-3924176-421

asymp

item

s

users

users

PT

ffac

tors

Qf factors

119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931

= 119943119943

119954119954119946119946119943119943 sdot 119953119953119961119961119943119943

qi = row i of Qpx = column x of PT

RECSYS 56

Ratings as Products of Factors How to estimate the missing rating of

user x for item i

1162019

45531

312445

53432142

24542

522434

42331

item

s

2-41

56-5

53-2

32111

-221-7

37-1

-924143-48-5-253-211

13-112-72914-131457-8

1-6784-3924176-421

asymp

item

s

users

users

QPT

24

ffac

tors

f factors

119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931

= 119943119943

119954119954119946119946119943119943 sdot 119953119953119961119961119943119943

qi = row i of Qpx = column x of PT

RECSYS 57

Geared towards females

Geared towards males

Serious

Funny

Latent Factor Models

1162019

The PrincessDiaries

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

RECSYS 58

Geared towards females

Geared towards males

Serious

Funny

Latent Factor Models

1162019

The PrincessDiaries

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

RECSYS 59

Recap SVD

Remember SVD A Input data matrix U Left singular vecs V Right singular vecs Σ Singular values

SVD gives minimum reconstruction error (SSE) min119880119880VΣ

sum119909119909119894119894 119860119860119909119909119894119894 minus 119880119880Σ119881119881T 1199091199091198941198942

So in our case ldquoSVDrdquo on Netflix data R asymp Q middot PT

A = R Q = U PT = Σ VT

But we are not done yet R has missing entries1162019

U

The sum goes over all entriesBut our R has missing entries

119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931

Am

n

Σm

n

VT

asymp

U

RECSYS 60

Latent Factor Models

SVD isnrsquot defined when entries are missing

Use specialized methods to find P Q min

119875119875119876119876sum 119909119909119909119909 isinR 119903119903119909119909119909119909 minus 119902119902119909119909 sdot 119901119901119909119909119879119879 2

Notebull We donrsquot require cols of P Q to be orthogonalunit lengthbull P Q map usersmovies to a latent spacebull The most popular model among Netflix contestants

1162019

45531

312445

53432142

24542

522434

42331

2-41

56-5

53-2

32111

-221-7

37-1

-924143-48-5-253-211

13-112-72914-131457-8

1-6784-3924176-421asymp

PTQ

users

item

s

119903119909119909119909119909 = 119902119902119909119909 sdot 119901119901119909119909119879119879

f factors

ffactorsitem

s

users

RECSYS 61

Dealing with Missing Entries Want to minimize SSE for unseen test data

Idea Minimize SSE on training data Want large f ( of factors) to capture all the signals But SSE on test data begins to rise for f gt 2

Regularization is needed to avoid Overfitting Allow rich model where there are sufficient data Shrink aggressively where data are scarce

1162019

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

1 3 43 5 5

4 5 533

2

2 1 3

1

λhellip regularization parameterldquoerrorrdquo ldquolengthrdquo

RECSYS 62

Geared towards females

Geared towards males

serious

escapist

The PrincessDiaries

The Lion King

Braveheart

Independence Day

AmadeusThe Color Purple

Oceanrsquos 11

Sense and Sensibility

Gus

Dave

Recommendations via Latent Factor Models (eg SVD++ by the [Bellkor Team])

1162019

Lethal Weapon

Dumb and Dumber

RECSYS 63

Dealing with Missing Entries Want to minimize SSE for unseen test data

Idea Minimize SSE on training data Want large f ( of factors) to capture all the signals But SSE on test data begins to rise for f gt 2

Regularization is needed Allow rich model where there are sufficient data Shrink aggressively where data are scarce

1162019

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

1 3 43 5 5

4 5 533

2

2 1 3

1

λhellip regularization parameterldquoerrorrdquo ldquolengthrdquo

RECSYS 64

Geared towards females

Geared towards males

serious

funny

The Effect of Regularization

1162019

The PrincessDiaries

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

minfactors ldquoerrorrdquo + λ ldquolengthrdquo

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

RECSYS 65

Geared towards females

Geared towards males

serious

funny

The Effect of Regularization

1162019

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

The PrincessDiaries

minfactors ldquoerrorrdquo + λ ldquolengthrdquo

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

RECSYS 66

Geared towards females

Geared towards males

serious

funny

The Effect of Regularization

1162019

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

minfactors ldquoerrorrdquo + λ ldquolengthrdquo

The PrincessDiaries

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

RECSYS 67

Geared towards females

Geared towards males

serious

funny

The Effect of Regularization

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

The PrincessDiaries

minfactors ldquoerrorrdquo + λ ldquolengthrdquo

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

RECSYS 68

Use Gradient Descent to search forthe optimal settings

Want to find matrices P and Q

Gradient descent Initialize P and Q (using SVD pretend missing ratings are

0) Do gradient descent

bull P larr P - η middotnablaPbull Q larr Q - η middotnablaQbull Where nablaQ is gradientderivative of matrix Q120571120571119876119876 = [120571120571119902119902119909119909119894119894] and 120571120571119902119902119909119909119894119894 = sum119909119909119909119909 minus2 119903119903119909119909119909119909 minus 119902119902119909119909119901119901119909119909119879119879 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894

bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q Observation Computing gradients is slow

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

How to compute gradient of a matrixCompute gradient of every element independently

RECSYS 69

Degression to the lecture notes ofRegression and Gradient Descent

by Andrew Ngrsquos Machine Learning course from Coursera

RECSYS 70

(Batch) Gradient Descent

Want to find matrices P and Q

Gradient descent Initialize P and Q (using SVD pretend missing ratings are

0) Do gradient descent

bull P larr P - η middotnablaPbull Q larr Q - η middotnablaQbull Where nablaQ is gradientderivative of matrix Q120571120571119876119876 = [120571120571119902119902119909119909119894119894] and 120571120571119902119902119909119909119894119894 = sum119909119909119909119909 minus2 119903119903119909119909119909119909 minus 119902119902119909119909119901119901119909119909119879119879 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894

bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q Observation Computing gradients is slow

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

How to compute gradient of a matrixCompute gradient of every element independently

RECSYS 72

Stochastic Gradient Descent

Gradient Descent (GD) vs Stochastic GD Observation 120571120571119876119876 = [120571120571119902119902119909119909119894119894] where

120571120571119902119902119909119909119894119894 = 119909119909119909119909

minus2 119903119903119909119909119909119909 minus 119902119902119909119909119894119894119901119901119909119909119894119894 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894 = 119961119961119946119946

nabla119928119928 119955119955119961119961119946119946

bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q 119928119928 = 119928119928 minus 119928119928 = 119928119928 minus η sum119961119961119946119946nabla119928119928 (119955119955119961119961119946119946) Idea Instead of evaluating gradient over all ratings evaluate it for

each individual rating and make a step

GD 119928119928larr119928119928 minus η sum119955119955119961119961119946119946 nabla119928119928(119955119955119961119961119946119946)

SGD 119928119928larr119928119928 minus 119928119928(119955119955119961119961119946119946) Faster convergence

bull Need more steps but each step is computed much faster

1162019

RECSYS 73

SGD vs GD Convergence of GD vs SGD

1162019

Iterationstep

Valu

e of

the

obje

ctiv

e fu

nctio

n

GD improves the value of the objective function at every step SGD improves the value but in a ldquonoisyrdquo wayGD takes fewer steps to converge but each steptakes much longer to compute In practice SGD is much faster

RECSYS 74

Stochastic Gradient Descent

Stochastic gradient decent Initialize P and Q (using SVD pretend missing ratings are

0) Then iterate over the ratings (multiple times if necessary) and update

factorsFor each rxi

bull 120576120576119909119909119909119909 = 119903119903119909119909119909119909 minus 119902119902119909119909 sdot 119901119901119909119909119879119879 (derivative of the ldquoerrorrdquo)bull 119902119902119909119909 larr 119902119902119909119909 + 120578120578 120576120576119909119909119909119909 119901119901119909119909 minus 120582120582 119902119902119909119909 (update equation)bull 119901119901119909119909 larr 119901119901119909119909 + 120578120578 120576120576119909119909119909119909 119902119902119909119909 minus 120582120582 119901119901119909119909 (update equation)

2 for loops For until convergence

bull For each rxibull Compute gradient do a ldquosteprdquo

1162019

η hellip learning rate

RECSYS 75Koren Bell Volinksy IEEE Computer 2009

1162019

RECSYS 76

Summary Recommendations via Optimization

Goal Make good recommendations Quantify goodness using SSE

So Lower SSE means better recommendations We want to make good recommendations on items

that some user has not yet seen Letrsquos set values for P and Q such that they work

well on known (user item) ratings And hope these values for P and Q will predict well

the unknown ratings

This is the a case where we apply Optimization methods

1162019

1 3 43 5 5

4 5 533

2

2 1 3

1

RECSYS 77

Backup Slides

RECSYS 78

The Netflix Challenge 2006-09

We Know What You OughtTo Be Watching This

Summer

Netflix Prizebull Training data

ndash 100 million ratingsndash 480000 usersndash 17770 moviesndash 6 years of data 2000-2005

bull Test datandash Last few ratings of each user (28 million)ndash Evaluation criterion root mean squared error (RMSE) ndash Netflix Cinematch RMSE 09514

bull Competitionndash 2700+ teamsndash $1 million grand prize for 10 improvement on Cinematch resultndash $50000 2007 progress prize for 843 improvement

scoremovieuser121152131434524123237682576344541568523425

22345

57664566

scoremovieuser6219617232473153414284935

745

696836

Training data Test data

Movie rating data

Overall rating distribution

bull Third of ratings are 4sbull Average rating is 368

From TimelyDevelopmentcom

ratings per movie

bull Avg ratingsmovie 5627

ratings per user

bull Avg ratingsuser 208

Average movie rating by movie count

bull More ratings to better movies

From TimelyDevelopmentcom

Most loved movies

CountAvg ratingTitle137812 4593 The Shawshank Redemption 133597 4545 Lord of the Rings The Return of the King1808834306 The Green Mile 150676 4460Lord of the Rings The Two Towers 139050 4415 Finding Nemo 117456 4504 Raiders of the Lost Ark 180736 4299 Forrest Gump 147932 4433 Lord of the Rings The Fellowship of the ring149199 4325 The Sixth Sense 144027 4333 Indiana Jones and the Last Crusade

Challenges

bull Size of datandash Scalabilityndash Keeping data in memory

bull Missing datandash 99 percent missingndash Very imbalanced

bull Avoiding overfittingbull Test and training data differ significantly movie 16322

bull Use an ensemble of complementing predictorsbull Two half tuned models worth more than a single fully

tuned model

The BellKor recommender system

RECSYS 101

Extending Latent Factor Model to Include Biases

1162019

RECSYS 102

Modeling Biases and Interactions

μ = overall mean rating bx = bias of user x bi = bias of movie i

user-movie interactionmovie biasuser bias

User-Movie interaction Characterizes the matching between

users and movies Attracts most research in the field Benefits from algorithmic and

mathematical innovations

Baseline predictor Separates users and movies Benefits from insights into userrsquos

behavior Among the main practical

contributions of the competition

RECSYS 103

Baseline Predictor We have expectations on the rating by

user x of movie i even without estimating xrsquos attitude towards movies like i

ndash Rating scale of user xndash Values of other ratings user

gave recently (day-specific mood anchoring multi-user accounts)

ndash (Recent) popularity of movie indash Selection bias related to

number of ratings user gave on the same day (ldquofrequencyrdquo)

1162019

RECSYS 104

Putting It All Together

Example Mean rating micro = 37 You are a critical reviewer your ratings are 1 star

lower than the mean bx = -1 Star Wars gets a mean rating of 05 higher than

average movie bi = + 05 Predicted rating for you on Star Wars

= 37 - 1 + 05 = 32

1162019

RECSYS 105

Fitting the New Model

Solve

Stochastic gradient decent to find parameters Note Both biases bu bi as well as interactions qi pu are treated

as parameters (we estimate them)

1162019

regularization

goodness of fit

λ is selected via grid-search on a validation set

( )

++++

+++minus

sumsumsumsum

sumisin

ii

xx

xx

ii

Rix

Txiixxi

PQ

bbpq

pqbbr

2222

2

)(

)(min

λ

micro

RECSYS 106

BellKor Recommender System The winner of the Netflix Challenge

Multi-scale modeling of the dataCombine top level ldquoregionalrdquomodeling of the data with a refined local view Global

bull Overall deviations of usersmovies Factorization

bull Addressing ldquoregionalrdquo effects Collaborative filtering

bull Extract local patterns

Global effects

Factorization

Collaborative filtering

RECSYS 107

Performance of Various Methods

0885

089

0895

09

0905

091

0915

092

1 10 100 1000

RMSE

Millions of parameters

CF (no time bias)

Basic Latent Factors

Latent Factors w Biases

1162019

RECSYS 108

Grand Prize 08563

Netflix 09514

Movie average 10533User average 10651

Global average 11296

Performance of Various Methods

Basic Collaborative filtering 094

Latent factors 090

Latent factors+Biases 089

Collaborative filtering++ 091

1162019

RECSYS 109

Temporal Biases Of Users

Sudden rise in the average movie rating(early 2004) Improvements in Netflix GUI improvements Meaning of rating changed

Movie age Users prefer new movies

without any reasons Older movies are just

inherently better than newer ones

1162019

Y Koren Collaborative filtering with temporal dynamics KDD rsquo09

RECSYS 110

Temporal Biases amp Factors

Original modelrxi = micro +bx + bi + qi middotpx

T

Add time dependence to biasesrxi = micro +bx(t)+ bi(t) +qi middot px

T

Make parameters bu and bi to depend on time (1) Parameterize time-dependence by linear trends

(2) Each bin corresponds to 10 consecutive weeks

Add temporal dependence to factors px(t)hellip user preference vector on day t

Y Koren Collaborative filtering with temporal dynamics KDD rsquo09

RECSYS 111

Adding Temporal Effects

1162019

0875

088

0885

089

0895

09

0905

091

0915

092

1 10 100 1000 10000

RMSE

Millions of parameters

CF (no time bias)Basic Latent FactorsCF (time bias)Latent Factors w Biases+ Linear time factors+ Per-day user biases+ CF

RECSYS 112

Grand Prize 08563

Netflix 09514

Movie average 10533User average 10651

Global average 11296

Performance of Various Methods

Basic Collaborative filtering 094

Latent factors 090

Latent factors+Biases 089

Collaborative filtering++ 091

1162019 112

Latent factors+Biases+Time 0876

Still no prize Getting desperateTry a ldquokitchen sinkrdquo approach

RECSYS 113

Effect of ensemble size

087

08725

0875

08775

088

08825

0885

08875

089

08925

1 8 15 22 29 36 43 50

Predictors

Erro

r - R

MSE

Chart1

RMSE
Predictors
Error - RMSE
Effect of ensemble size

Sheet1

Sheet1

RMSE
Predictors
Error - RMSE
Effect of ensemble size

Sheet2

Sheet3

RECSYS 115

Standing on June 26th 2009

1162019June 26th submission triggers 30-day ldquolast callrdquo

RECSYS 116

The Last 30 Days

Ensemble team formed Group of other teams on leaderboard forms a new team Relies on combining their models Quickly also get a qualifying score over 10

BellKor Continue to get small improvements in their scores Realize that they are in direct competition with Ensemble

Strategy Both teams carefully monitoring the leaderboard Only sure way to check for improvement is to submit a set of

predictionsbull This alerts the other team of your latest score

1162019

RECSYS 117

24 Hours from the Deadline

Submissions limited to 1 a day Only 1 final submission could be made in the last 24h

24 hours before deadlinehellip BellKor team member in Austria notices (by chance) that

Ensemble posts a score that is slightly better than BellKorrsquos

Frantic last 24 hours for both teams Much computer time on final optimization Carefully calibrated to end about an hour before deadline

Final submissions BellKor submits a little early (on purpose) 40 mins before deadline Ensemble submits their final entry 20 mins later hellipand everyone waitshellip

1162019

RECSYS 118118

RECSYS 119

Million $ Awarded Sept 21st 2009

RECSYS 120

Further reading Y Koren Collaborative filtering with temporal

dynamics KDD rsquo09 httpwww2researchattcom~volinskynetflixbpchtml httpwwwthe-ensemblecom

1162019

  • Slide Number 1
  • Acknowledgements
  • Content-based Systems amp Collaborative Filtering
  • High Dimensional Data
  • Example Recommender Systems
  • Recommendations
  • From Scarcity to Abundance
  • The Long Tail
  • Physical vs Online
  • Types of Recommendations
  • Formal Model
  • Utility Matrix
  • Key Problems
  • (1) Gathering Ratings
  • (2) Extrapolating Utilities
  • Content-based Recommender Systems
  • Content-based Recommendations
  • Plan of Action
  • Item Profiles
  • Sidenote TF-IDF
  • User Profiles and Prediction
  • Pros Content-based Approach
  • Cons Content-based Approach
  • Collaborative Filtering
  • Collaborative filtering
  • Collaborative Filtering (CF)
  • Example of Memory-based Collaborative Filtering User-User Collaborative Filtering
  • Similar Users
  • Similarity Metric
  • Rating Predictions
  • Another type of Memory-based Collaborative Filtering Item-Item based Collaborative Filtering
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Common Practice for Item-Item Collaborative Filtering
  • Item-Item vs User-User
  • ProsCons of Collaborative Filtering
  • Hybrid Methods
  • Remarks amp Practical Tips
  • Evaluation
  • Evaluation
  • Evaluating Predictions
  • Problems with Error Measures
  • Collaborative Filtering Complexity
  • Tip Add Data
  • Recommender SystemsLatent Factor Models
  • Collaborative Filtering viaLatent Factor Models (eg SVD)
  • Slide Number 50
  • The Netflix Utility Matrix R
  • Utility Matrix R Evaluation
  • Latent Factor Models
  • Ratings as Products of Factors
  • Ratings as Products of Factors
  • Ratings as Products of Factors
  • Latent Factor Models
  • Latent Factor Models
  • Recap SVD
  • Latent Factor Models
  • Dealing with Missing Entries
  • Recommendations via Latent Factor Models (eg SVD++ by the [Bellkor Team])
  • Dealing with Missing Entries
  • The Effect of Regularization
  • The Effect of Regularization
  • The Effect of Regularization
  • The Effect of Regularization
  • Use Gradient Descent to search forthe optimal settings
  • Degression to the lecture notes ofRegression and Gradient Descentby Andrew Ngrsquos Machine Learning course from Coursera
  • (Batch) Gradient Descent
  • Stochastic Gradient Descent
  • SGD vs GD
  • Stochastic Gradient Descent
  • Slide Number 75
  • Summary Recommendations via Optimization
  • Backup Slides
  • The Netflix Challenge 2006-09
  • Slide Number 79
  • Slide Number 80
  • Netflix Prize
  • Movie rating data
  • Overall rating distribution
  • ratings per movie
  • ratings per user
  • Average movie rating by movie count
  • Most loved movies
  • Challenges
  • The BellKor recommender system
  • Extending Latent Factor Model to Include Biases
  • Modeling Biases and Interactions
  • Baseline Predictor
  • Putting It All Together
  • Fitting the New Model
  • BellKor Recommender System
  • Performance of Various Methods
  • Performance of Various Methods
  • Temporal Biases Of Users
  • Temporal Biases amp Factors
  • Adding Temporal Effects
  • Performance of Various Methods
  • Slide Number 113
  • Slide Number 114
  • Standing on June 26th 2009
  • The Last 30 Days
  • 24 Hours from the Deadline
  • Slide Number 118
  • Million $ Awarded Sept 21st 2009
  • Further reading
predictors RMSE
1 0889067
2 0880586
3 0878345
4 0877715
5 0876668
6 0876179
7 0875557
8 0875013
9 0874477
10 0874209
11 0873997
12 0873691
13 0873569
14 0873352
15 0873212
16 0873035
17 0872942
18 087283
19 0872748
20 0872654
21 0872563
22 0872468
23 0872382
24 0872275
25 087222
26 0872201
27 0872135
28 0872086
29 0872021
30 0871945
31 0871898
32 087183
33 0871756
34 0871722
35 0871665
36 0871632
37 0871585
38 0871558
39 0871524
40 0871482
41 0871452
42 0871428
43 0871398
44 0871346
45 0871312
46 0871287
47 0871262
48 0871242
49 0871236
50 0871197
0889067
0880586
0878345
0877715
0876668
0876179
0875557
0875013
0874477
0874209
0873997
0873691
0873569
0873352
0873212
0873035
0872942
087283
0872748
0872654
0872563
0872468
0872382
0872275
087222
0872201
0872135
0872086
0872021
0871945
0871898
087183
0871756
0871722
0871665
0871632
0871585
0871558
0871524
0871482
0871452
0871428
0871398
0871346
0871312
0871287
0871262
0871242
0871236
0871197
Page 8: xuhappy.github.io · 2 days ago · RECSYS 2 Acknowledgements The slides used in this chapter are adapted from: CS246 Mining Massive Datasets, - by Jure Leskovec, Stanford University

RECSYS 9

Physical vs Online

1162019 9Read httpwwwwiredcomwiredarchive1210tailhtml to learn more

RECSYS 11

Formal Model

X = set of Customers

S = set of Items

Utility function u X times S RR = set of ratingsR is a totally ordered seteg 0-5 stars real number in [01]

RECSYS 12

Utility Matrix

04102

0305021

Avatar LOTR Matrix Pirates

Alice

Bob

Carol

David

RECSYS 13

Key Problems

(1) Gathering ldquoknownrdquo ratings for matrix How to collect the data in the utility matrix

(2) Extrapolate unknown ratings from the known ones Mainly interested in high unknown ratings

bull We are not interested in knowing what you donrsquot like but what you like

(3) Evaluating extrapolation methods How to measure successperformance of

recommendation methods

RECSYS 14

(1) Gathering Ratings Explicit

Ask people to rate items Doesnrsquot work well in practice ndash people

canrsquot be bothered

Implicit Learn ratings from user actions

bull Eg purchase implies high rating What about low ratings

RECSYS 15

(2) Extrapolating Utilities Key problem matrix U is sparse

Most people have not rated most items Cold start

bull New items have no ratingsbull New users have no history

Three approaches to recommender systems 1) Content-based 2) Collaborative Filtering

bull Memory-basedbull User-based Collaborative Filteringbull Item-based Collaborative Filtering

bull Latent factor based

1162019

RECSYS 16

Content-based Recommender Systems

1162019

RECSYS 17

Content-based Recommendations Main idea Recommend items to customer x similar to

previous items rated highly by x

Example

Movie recommendations Recommend movies with same actor(s)

director genre hellip

Websites blogs news Recommend other sites with ldquosimilarrdquo content

RECSYS 18

Plan of Action

likes

Item profiles

RedCircles

Triangles

User profile

match

recommendbuild

1162019

RECSYS 19

Item Profiles For each item create an item profile

Profile is a set (vector) of features Movies author title actor directorhellip Text Set of ldquoimportantrdquo words in document

How to pick important features Usual heuristic from text mining is TF-IDF

(Term frequency Inverse Doc Frequency)bull Term hellip Featurebull Document hellip Item

1162019

RECSYS 20

Sidenote TF-IDFfij = frequency of term (feature) i in doc (item) j

ni = number of docs that mention term iN = total number of docs

TF-IDF score wij = TFij times IDFi

Doc profile = set of words with highest TF-IDF scores together with their scores

1162019

Note we normalize TFto discount for ldquolongerrdquo documents

RECSYS 21

User Profiles and Prediction

User profile possibilities Weighted average of rated item profiles Variation weight by difference from average

rating for item hellip

Prediction heuristic Given user profile x and item profile i estimate 119906119906(119961119961 119946119946) = cos(119961119961 119946119946) = 119961119961middot119946119946

| 119961119961 |sdot| 119946119946 |

1162019

RECSYS 22

Pros Content-based Approach

+ No need for data on other users No cold-start or sparsity problems

+ Able to recommend to users with unique tastes

+ Able to recommend new amp unpopular items No first-rater problem

+ Able to provide explanations Can provide explanations of recommended items by listing content-

features that caused an item to be recommended

1162019

RECSYS 23

Cons Content-based Approach ndash Finding the appropriate features is hard

Eg images movies music

ndash Overspecialization Never recommends items outside userrsquos

content profile People might have multiple interests Unable to exploit quality judgments of other users

ndash Recommendations for new users How to build a user profile

1162019

RECSYS 24

Collaborative Filtering

RECSYS 25

Collaborative filtering Recommend items based on past transactions of users

Analyze relations between users andor items

Specific data characteristics are irrelevant Domain-free useritem attributes are not necessary Can identify elusive aspects

RECSYS 26

Collaborative Filtering (CF)

RECSYS 27

Example of Memory-based Collaborative FilteringUser-User Collaborative Filtering

1 Consider user x

2 Find set N of other userswhose ratings are ldquosimilarrdquo to xrsquos ratings eg using K-nearest neighbors (KNN)

3 Recommend items to xbased on the weighted ratings of items by users in N

RECSYS 28

Similar Users Let rx be the vector of user xrsquos ratings

Jaccard similarity measure Problem Ignores the value of the rating

Cosine similarity measure

sim(x y) = cos(rx ry) = 119903119903119909119909sdot119903119903119910119910||119903119903119909119909||sdot||119903119903119910119910||

Problem Treats missing ratings as ldquonegativerdquo Pearson correlation coefficient

Sxy = items rated by both users x and y

1162019

rx = [ _ _ ]ry = [ _ _]

rx ry as setsrx = 1 4 5ry = 1 3 4

rx ry as pointsrx = 1 0 0 1 3ry = 1 0 2 2 0

rx ry hellip avgrating of x y

RECSYS 29

Similarity Metric

Intuitively we want sim(A B) gt sim(A C)

Jaccard similarity 15 lt 24

Cosine similarity 0386 gt 0322 Considers missing ratings as ldquonegativerdquo Solution subtract the (row) mean

1162019

sim AB vs AC0092 gt -0559Notice cosine sim is correlation when data is centered at 0

Cosine sim

RECSYS 30

Rating Predictions

Let rx be the vector of user xrsquos ratings

Let N be the set of k users most similar to xwho have rated item i

Prediction for item i of user x 119903119903119909119909119909119909 = 1

119896119896sum119910119910isin119873119873 119903119903119910119910119909119909

119903119903119909119909119909119909 =sum119910119910isin119873119873 119904119904119909119909119910119910sdot119903119903119910119910119910119910sum119910119910isin119873119873 119904119904119909119909119910119910

Other options

Many other tricks possiblehellip1162019

Shorthand119956119956119961119961119961119961 = 119956119956119946119946119956119956 119961119961119961119961

RECSYS 31

Another type of Memory-based Collaborative Filtering Item-Item based Collaborative Filtering

So far User-user collaborative filtering

Another view Item-item For item i find other similar items Estimate rating for item i based

on the target userrsquos ratings on items similar to item i Can use same similarity metrics and

prediction functions as in user-user model

1162019

sumsum

isin

isinsdot

=)(

)(

xiNj ij

xiNj xjijxi s

rsr

sijhellip similarity of items i and jrxjhelliprating of user x on item jN(ix)hellip set items rated by x similar to i

RECSYS 32

Item-Item CF (|N|=2)

121110987654321

455311

3124452

534321423

245424

5224345

423316

users

mov

ies

- unknown rating - rating between 1 to 5

RECSYS 33

Item-Item CF (|N|=2)

121110987654321

455 311

3124452

534321423

245424

5224345

423316

users

- estimate rating of movie 1 by user 5

1162019

mov

ies

RECSYS 34

Item-Item CF (|N|=2)

121110987654321

455 311

3124452

534321423

245424

5224345

423316

users

Neighbor selectionIdentify movies similar to movie 1 rated by user 5

1162019

mov

ies

100

-018

041

-010

-031

059

sim(1m)

Here we use Pearson correlation as similarity1) Subtract mean rating mi from each movie i

m1 = (1+3+5+5+4)5 = 36row 1 [-26 0 -06 0 0 14 0 0 14 0 04 0]

2) Compute cosine similarities between rows

RECSYS 35

Item-Item CF (|N|=2)

121110987654321

455 311

3124452

534321423

245424

5224345

423316

users

Compute similarity weightss13=041 s16=059

1162019

mov

ies

100

-018

041

-010

-031

059

sim(1m)

RECSYS 36

Item-Item CF (|N|=2)

121110987654321

45526311

3124452

534321423

245424

5224345

423316

users

Predict by taking weighted average

r15 = (0412 + 0593) (041+059) = 26

mov

ies

RECSYS 37

Common Practice for Item-Item Collaborative Filtering

Define similarity sij of items i and j

Select K nearest neighbors (KNN) N(i x) Set of Items most similar to item i that were rated by x

Estimate rating rxi as the weighted average

1162019

sumsum

isin

isinsdot

=)(

)(ˆxiNj ij

xiNj xjijxi s

rsr

N(ix) = set of items similar to item i that were rated by xsij = similarity of items i and jrxj= rating of user x on item j

RECSYS 38

Item-Item vs User-User

04180109

0305081

Avatar LOTR Matrix Pirates

Alice

Bob

Carol

David

1162019

In practice it has been observed that item-itemoften works better than user-user

Why Items are simpler users have multiple tastes

RECSYS 39

ProsCons of Collaborative Filtering

+ Works for any kind of item No feature selection needed

- Cold Start Need enough users in the system to find a match

- Sparsity The userratings matrix is sparse Hard to find users that have rated the same items

- First rater Cannot recommend an item that has not been

previously rated New items Esoteric items

- Popularity bias Cannot recommend items to someone with

unique taste Tends to recommend popular items

RECSYS 40

Hybrid Methods Implement two or more different recommenders and

combine predictions Perhaps using a linear model

Add content-based methods to collaborative filtering Item profiles for new item problem Demographics to deal with new user problem

1162019

RECSYS 41

- Evaluation

- Error metrics

- Complexity Speed

RECSYS 42

Evaluation

1 3 4

3 5 5

4 5 5

3

3

2 2 2

5

2 1 1

3 3

1

movies

users

1162019

RECSYS 43

Evaluation

1 3 4

3 5 5

4 5 5

3

3

2

2 1

3

1

Test Data Set

users

movies

1162019

RECSYS 44

Evaluating Predictions Compare predictions with known ratings

Root-mean-square error (RMSE)bull where is predicted is the true rating of x on i

Precision at top 10 bull of those in top 10

Rank Correlation bull Spearmanrsquos correlation between systemrsquos and userrsquos complete

rankings Another approach 01 model

Coveragebull Number of itemsusers for which system can make predictions

Precision = TP (TP + FP) Accuracy = (TP+TN) (TP + FP + TN + FN) Receiver Operating characteristic (ROC) Curve

Y-axis True Positive Rates (TPR) X-axis False Positive Rates (FPR)bull TPR (aka Recall) = TP P = TP (TP+FN) bull FPR = FP N = FP (FP + TN)bull See httpsenwikipediaorgwikiPrecision and recall

RECSYS 45

Problems with Error Measures Narrow focus on accuracy sometimes

misses the point Prediction Diversity Prediction Context Order of predictions

In practice we care only to predict high ratings RMSE might penalize a method that does well

for high ratings and badly for others

1162019

RECSYS 46

Collaborative Filtering Complexity

Expensive step is finding k most similar customers O(|X|) Recall that X = set of customers in the system

Too expensive to do at runtime Could pre-compute using clustering as approx

Naiumlve pre-computation takes time O(N middot|C|) |C| = of clusters = k in the k-means N = of data points

We already know how to do this Near-neighbor search in high dimensions (LSH) Clustering Dimensionality reduction

1162019

RECSYS 47

Tip Add Data Leverage all the data

Donrsquot try to reduce data size in an effort to make fancy algorithms work

Simple methods on large data do best

Add more data eg add IMDB data on genres

More data beats better algorithmshttpanandtypepadcomdatawocky200803more-data-usualhtml

RECSYS 48

Recommender SystemsLatent Factor Models

CS246 Mining Massive DatasetsJure Leskovec Stanford University

httpcs246stanfordedu

RECSYS 49

Geared towards females

Geared towards males

Serious

Funny

Collaborative Filtering viaLatent Factor Models (eg SVD)

The PrincessDiaries

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

RECSYS 50Koren Bell Volinksy IEEE Computer 2009

RECSYS 51

The Netflix Utility Matrix R

1 3 4

3 5 5

4 5 5

3

3

2 2 2

5

2 1 1

3 3

1

480000 users

17700 movies

1162019

Matrix R

RECSYS 52

Utility Matrix R Evaluation

1 3 4

3 5 5

4 5 5

3

3

2

2 1

3

1

Test Data Set

SSE = sum(119909119909119909119909)isin119877119877 119903119909119909119909119909 minus 119903119903119909119909119909119909 2

1162019

480000 users

17700 movies

Predicted rating

True rating of user x on item i

119955119955120785120785120788120788

Matrix R

Training Data Set

RECSYS 53

Latent Factor Models

ldquoSVDrdquo on Netflix data R asymp Q middot PT

For now letrsquos assume we can approximate the rating matrix R as a product of ldquothinrdquo Q middot PT

R has missing entries but letrsquos ignore that for nowbull Basically we will want the reconstruction error to be small on

known ratings and we donrsquot care about the values on the missing ones

45531

312445

53432142

24542

522434

42331

2-41

56-5

53-2

32111

-221-7

37-1

-924143-48-5-253-211

13-112-72914-131457-8

1-6784-3924176-421asymp

users

item

s

PT

Q

item

s

users

R

SVD A = U Σ VT

f factors

ffactors

RECSYS 54

Ratings as Products of Factors How to estimate the missing rating of

user x for item i

1162019

45531

312445

53432142

24542

522434

42331

item

s

2-41

56-5

53-2

32111

-221-7

37-1

-924143-48-5-253-211

13-112-72914-131457-8

1-6784-3924176-421

asymp

item

s

users

users

PT

119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931

= 119943119943

119954119954119946119946119943119943 sdot 119953119953119961119961119943119943

qi = row i of Qpx = column x of PT

ffac

tors

Qf factors

RECSYS 55

Ratings as Products of Factors How to estimate the missing rating of

user x for item i

1162019

45531

312445

53432142

24542

522434

42331

item

s

2-41

56-5

53-2

32111

-221-7

37-1

-924143-48-5-253-211

13-112-72914-131457-8

1-6784-3924176-421

asymp

item

s

users

users

PT

ffac

tors

Qf factors

119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931

= 119943119943

119954119954119946119946119943119943 sdot 119953119953119961119961119943119943

qi = row i of Qpx = column x of PT

RECSYS 56

Ratings as Products of Factors How to estimate the missing rating of

user x for item i

1162019

45531

312445

53432142

24542

522434

42331

item

s

2-41

56-5

53-2

32111

-221-7

37-1

-924143-48-5-253-211

13-112-72914-131457-8

1-6784-3924176-421

asymp

item

s

users

users

QPT

24

ffac

tors

f factors

119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931

= 119943119943

119954119954119946119946119943119943 sdot 119953119953119961119961119943119943

qi = row i of Qpx = column x of PT

RECSYS 57

Geared towards females

Geared towards males

Serious

Funny

Latent Factor Models

1162019

The PrincessDiaries

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

RECSYS 58

Geared towards females

Geared towards males

Serious

Funny

Latent Factor Models

1162019

The PrincessDiaries

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

RECSYS 59

Recap SVD

Remember SVD A Input data matrix U Left singular vecs V Right singular vecs Σ Singular values

SVD gives minimum reconstruction error (SSE) min119880119880VΣ

sum119909119909119894119894 119860119860119909119909119894119894 minus 119880119880Σ119881119881T 1199091199091198941198942

So in our case ldquoSVDrdquo on Netflix data R asymp Q middot PT

A = R Q = U PT = Σ VT

But we are not done yet R has missing entries1162019

U

The sum goes over all entriesBut our R has missing entries

119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931

Am

n

Σm

n

VT

asymp

U

RECSYS 60

Latent Factor Models

SVD isnrsquot defined when entries are missing

Use specialized methods to find P Q min

119875119875119876119876sum 119909119909119909119909 isinR 119903119903119909119909119909119909 minus 119902119902119909119909 sdot 119901119901119909119909119879119879 2

Notebull We donrsquot require cols of P Q to be orthogonalunit lengthbull P Q map usersmovies to a latent spacebull The most popular model among Netflix contestants

1162019

45531

312445

53432142

24542

522434

42331

2-41

56-5

53-2

32111

-221-7

37-1

-924143-48-5-253-211

13-112-72914-131457-8

1-6784-3924176-421asymp

PTQ

users

item

s

119903119909119909119909119909 = 119902119902119909119909 sdot 119901119901119909119909119879119879

f factors

ffactorsitem

s

users

RECSYS 61

Dealing with Missing Entries Want to minimize SSE for unseen test data

Idea Minimize SSE on training data Want large f ( of factors) to capture all the signals But SSE on test data begins to rise for f gt 2

Regularization is needed to avoid Overfitting Allow rich model where there are sufficient data Shrink aggressively where data are scarce

1162019

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

1 3 43 5 5

4 5 533

2

2 1 3

1

λhellip regularization parameterldquoerrorrdquo ldquolengthrdquo

RECSYS 62

Geared towards females

Geared towards males

serious

escapist

The PrincessDiaries

The Lion King

Braveheart

Independence Day

AmadeusThe Color Purple

Oceanrsquos 11

Sense and Sensibility

Gus

Dave

Recommendations via Latent Factor Models (eg SVD++ by the [Bellkor Team])

1162019

Lethal Weapon

Dumb and Dumber

RECSYS 63

Dealing with Missing Entries Want to minimize SSE for unseen test data

Idea Minimize SSE on training data Want large f ( of factors) to capture all the signals But SSE on test data begins to rise for f gt 2

Regularization is needed Allow rich model where there are sufficient data Shrink aggressively where data are scarce

1162019

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

1 3 43 5 5

4 5 533

2

2 1 3

1

λhellip regularization parameterldquoerrorrdquo ldquolengthrdquo

RECSYS 64

Geared towards females

Geared towards males

serious

funny

The Effect of Regularization

1162019

The PrincessDiaries

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

minfactors ldquoerrorrdquo + λ ldquolengthrdquo

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

RECSYS 65

Geared towards females

Geared towards males

serious

funny

The Effect of Regularization

1162019

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

The PrincessDiaries

minfactors ldquoerrorrdquo + λ ldquolengthrdquo

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

RECSYS 66

Geared towards females

Geared towards males

serious

funny

The Effect of Regularization

1162019

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

minfactors ldquoerrorrdquo + λ ldquolengthrdquo

The PrincessDiaries

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

RECSYS 67

Geared towards females

Geared towards males

serious

funny

The Effect of Regularization

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

The PrincessDiaries

minfactors ldquoerrorrdquo + λ ldquolengthrdquo

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

RECSYS 68

Use Gradient Descent to search forthe optimal settings

Want to find matrices P and Q

Gradient descent Initialize P and Q (using SVD pretend missing ratings are

0) Do gradient descent

bull P larr P - η middotnablaPbull Q larr Q - η middotnablaQbull Where nablaQ is gradientderivative of matrix Q120571120571119876119876 = [120571120571119902119902119909119909119894119894] and 120571120571119902119902119909119909119894119894 = sum119909119909119909119909 minus2 119903119903119909119909119909119909 minus 119902119902119909119909119901119901119909119909119879119879 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894

bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q Observation Computing gradients is slow

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

How to compute gradient of a matrixCompute gradient of every element independently

RECSYS 69

Degression to the lecture notes ofRegression and Gradient Descent

by Andrew Ngrsquos Machine Learning course from Coursera

RECSYS 70

(Batch) Gradient Descent

Want to find matrices P and Q

Gradient descent Initialize P and Q (using SVD pretend missing ratings are

0) Do gradient descent

bull P larr P - η middotnablaPbull Q larr Q - η middotnablaQbull Where nablaQ is gradientderivative of matrix Q120571120571119876119876 = [120571120571119902119902119909119909119894119894] and 120571120571119902119902119909119909119894119894 = sum119909119909119909119909 minus2 119903119903119909119909119909119909 minus 119902119902119909119909119901119901119909119909119879119879 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894

bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q Observation Computing gradients is slow

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

How to compute gradient of a matrixCompute gradient of every element independently

RECSYS 72

Stochastic Gradient Descent

Gradient Descent (GD) vs Stochastic GD Observation 120571120571119876119876 = [120571120571119902119902119909119909119894119894] where

120571120571119902119902119909119909119894119894 = 119909119909119909119909

minus2 119903119903119909119909119909119909 minus 119902119902119909119909119894119894119901119901119909119909119894119894 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894 = 119961119961119946119946

nabla119928119928 119955119955119961119961119946119946

bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q 119928119928 = 119928119928 minus 119928119928 = 119928119928 minus η sum119961119961119946119946nabla119928119928 (119955119955119961119961119946119946) Idea Instead of evaluating gradient over all ratings evaluate it for

each individual rating and make a step

GD 119928119928larr119928119928 minus η sum119955119955119961119961119946119946 nabla119928119928(119955119955119961119961119946119946)

SGD 119928119928larr119928119928 minus 119928119928(119955119955119961119961119946119946) Faster convergence

bull Need more steps but each step is computed much faster

1162019

RECSYS 73

SGD vs GD Convergence of GD vs SGD

1162019

Iterationstep

Valu

e of

the

obje

ctiv

e fu

nctio

n

GD improves the value of the objective function at every step SGD improves the value but in a ldquonoisyrdquo wayGD takes fewer steps to converge but each steptakes much longer to compute In practice SGD is much faster

RECSYS 74

Stochastic Gradient Descent

Stochastic gradient decent Initialize P and Q (using SVD pretend missing ratings are

0) Then iterate over the ratings (multiple times if necessary) and update

factorsFor each rxi

bull 120576120576119909119909119909119909 = 119903119903119909119909119909119909 minus 119902119902119909119909 sdot 119901119901119909119909119879119879 (derivative of the ldquoerrorrdquo)bull 119902119902119909119909 larr 119902119902119909119909 + 120578120578 120576120576119909119909119909119909 119901119901119909119909 minus 120582120582 119902119902119909119909 (update equation)bull 119901119901119909119909 larr 119901119901119909119909 + 120578120578 120576120576119909119909119909119909 119902119902119909119909 minus 120582120582 119901119901119909119909 (update equation)

2 for loops For until convergence

bull For each rxibull Compute gradient do a ldquosteprdquo

1162019

η hellip learning rate

RECSYS 75Koren Bell Volinksy IEEE Computer 2009

1162019

RECSYS 76

Summary Recommendations via Optimization

Goal Make good recommendations Quantify goodness using SSE

So Lower SSE means better recommendations We want to make good recommendations on items

that some user has not yet seen Letrsquos set values for P and Q such that they work

well on known (user item) ratings And hope these values for P and Q will predict well

the unknown ratings

This is the a case where we apply Optimization methods

1162019

1 3 43 5 5

4 5 533

2

2 1 3

1

RECSYS 77

Backup Slides

RECSYS 78

The Netflix Challenge 2006-09

We Know What You OughtTo Be Watching This

Summer

Netflix Prizebull Training data

ndash 100 million ratingsndash 480000 usersndash 17770 moviesndash 6 years of data 2000-2005

bull Test datandash Last few ratings of each user (28 million)ndash Evaluation criterion root mean squared error (RMSE) ndash Netflix Cinematch RMSE 09514

bull Competitionndash 2700+ teamsndash $1 million grand prize for 10 improvement on Cinematch resultndash $50000 2007 progress prize for 843 improvement

scoremovieuser121152131434524123237682576344541568523425

22345

57664566

scoremovieuser6219617232473153414284935

745

696836

Training data Test data

Movie rating data

Overall rating distribution

bull Third of ratings are 4sbull Average rating is 368

From TimelyDevelopmentcom

ratings per movie

bull Avg ratingsmovie 5627

ratings per user

bull Avg ratingsuser 208

Average movie rating by movie count

bull More ratings to better movies

From TimelyDevelopmentcom

Most loved movies

CountAvg ratingTitle137812 4593 The Shawshank Redemption 133597 4545 Lord of the Rings The Return of the King1808834306 The Green Mile 150676 4460Lord of the Rings The Two Towers 139050 4415 Finding Nemo 117456 4504 Raiders of the Lost Ark 180736 4299 Forrest Gump 147932 4433 Lord of the Rings The Fellowship of the ring149199 4325 The Sixth Sense 144027 4333 Indiana Jones and the Last Crusade

Challenges

bull Size of datandash Scalabilityndash Keeping data in memory

bull Missing datandash 99 percent missingndash Very imbalanced

bull Avoiding overfittingbull Test and training data differ significantly movie 16322

bull Use an ensemble of complementing predictorsbull Two half tuned models worth more than a single fully

tuned model

The BellKor recommender system

RECSYS 101

Extending Latent Factor Model to Include Biases

1162019

RECSYS 102

Modeling Biases and Interactions

μ = overall mean rating bx = bias of user x bi = bias of movie i

user-movie interactionmovie biasuser bias

User-Movie interaction Characterizes the matching between

users and movies Attracts most research in the field Benefits from algorithmic and

mathematical innovations

Baseline predictor Separates users and movies Benefits from insights into userrsquos

behavior Among the main practical

contributions of the competition

RECSYS 103

Baseline Predictor We have expectations on the rating by

user x of movie i even without estimating xrsquos attitude towards movies like i

ndash Rating scale of user xndash Values of other ratings user

gave recently (day-specific mood anchoring multi-user accounts)

ndash (Recent) popularity of movie indash Selection bias related to

number of ratings user gave on the same day (ldquofrequencyrdquo)

1162019

RECSYS 104

Putting It All Together

Example Mean rating micro = 37 You are a critical reviewer your ratings are 1 star

lower than the mean bx = -1 Star Wars gets a mean rating of 05 higher than

average movie bi = + 05 Predicted rating for you on Star Wars

= 37 - 1 + 05 = 32

1162019

RECSYS 105

Fitting the New Model

Solve

Stochastic gradient decent to find parameters Note Both biases bu bi as well as interactions qi pu are treated

as parameters (we estimate them)

1162019

regularization

goodness of fit

λ is selected via grid-search on a validation set

( )

++++

+++minus

sumsumsumsum

sumisin

ii

xx

xx

ii

Rix

Txiixxi

PQ

bbpq

pqbbr

2222

2

)(

)(min

λ

micro

RECSYS 106

BellKor Recommender System The winner of the Netflix Challenge

Multi-scale modeling of the dataCombine top level ldquoregionalrdquomodeling of the data with a refined local view Global

bull Overall deviations of usersmovies Factorization

bull Addressing ldquoregionalrdquo effects Collaborative filtering

bull Extract local patterns

Global effects

Factorization

Collaborative filtering

RECSYS 107

Performance of Various Methods

0885

089

0895

09

0905

091

0915

092

1 10 100 1000

RMSE

Millions of parameters

CF (no time bias)

Basic Latent Factors

Latent Factors w Biases

1162019

RECSYS 108

Grand Prize 08563

Netflix 09514

Movie average 10533User average 10651

Global average 11296

Performance of Various Methods

Basic Collaborative filtering 094

Latent factors 090

Latent factors+Biases 089

Collaborative filtering++ 091

1162019

RECSYS 109

Temporal Biases Of Users

Sudden rise in the average movie rating(early 2004) Improvements in Netflix GUI improvements Meaning of rating changed

Movie age Users prefer new movies

without any reasons Older movies are just

inherently better than newer ones

1162019

Y Koren Collaborative filtering with temporal dynamics KDD rsquo09

RECSYS 110

Temporal Biases amp Factors

Original modelrxi = micro +bx + bi + qi middotpx

T

Add time dependence to biasesrxi = micro +bx(t)+ bi(t) +qi middot px

T

Make parameters bu and bi to depend on time (1) Parameterize time-dependence by linear trends

(2) Each bin corresponds to 10 consecutive weeks

Add temporal dependence to factors px(t)hellip user preference vector on day t

Y Koren Collaborative filtering with temporal dynamics KDD rsquo09

RECSYS 111

Adding Temporal Effects

1162019

0875

088

0885

089

0895

09

0905

091

0915

092

1 10 100 1000 10000

RMSE

Millions of parameters

CF (no time bias)Basic Latent FactorsCF (time bias)Latent Factors w Biases+ Linear time factors+ Per-day user biases+ CF

RECSYS 112

Grand Prize 08563

Netflix 09514

Movie average 10533User average 10651

Global average 11296

Performance of Various Methods

Basic Collaborative filtering 094

Latent factors 090

Latent factors+Biases 089

Collaborative filtering++ 091

1162019 112

Latent factors+Biases+Time 0876

Still no prize Getting desperateTry a ldquokitchen sinkrdquo approach

RECSYS 113

Effect of ensemble size

087

08725

0875

08775

088

08825

0885

08875

089

08925

1 8 15 22 29 36 43 50

Predictors

Erro

r - R

MSE

Chart1

RMSE
Predictors
Error - RMSE
Effect of ensemble size

Sheet1

Sheet1

RMSE
Predictors
Error - RMSE
Effect of ensemble size

Sheet2

Sheet3

RECSYS 115

Standing on June 26th 2009

1162019June 26th submission triggers 30-day ldquolast callrdquo

RECSYS 116

The Last 30 Days

Ensemble team formed Group of other teams on leaderboard forms a new team Relies on combining their models Quickly also get a qualifying score over 10

BellKor Continue to get small improvements in their scores Realize that they are in direct competition with Ensemble

Strategy Both teams carefully monitoring the leaderboard Only sure way to check for improvement is to submit a set of

predictionsbull This alerts the other team of your latest score

1162019

RECSYS 117

24 Hours from the Deadline

Submissions limited to 1 a day Only 1 final submission could be made in the last 24h

24 hours before deadlinehellip BellKor team member in Austria notices (by chance) that

Ensemble posts a score that is slightly better than BellKorrsquos

Frantic last 24 hours for both teams Much computer time on final optimization Carefully calibrated to end about an hour before deadline

Final submissions BellKor submits a little early (on purpose) 40 mins before deadline Ensemble submits their final entry 20 mins later hellipand everyone waitshellip

1162019

RECSYS 118118

RECSYS 119

Million $ Awarded Sept 21st 2009

RECSYS 120

Further reading Y Koren Collaborative filtering with temporal

dynamics KDD rsquo09 httpwww2researchattcom~volinskynetflixbpchtml httpwwwthe-ensemblecom

1162019

  • Slide Number 1
  • Acknowledgements
  • Content-based Systems amp Collaborative Filtering
  • High Dimensional Data
  • Example Recommender Systems
  • Recommendations
  • From Scarcity to Abundance
  • The Long Tail
  • Physical vs Online
  • Types of Recommendations
  • Formal Model
  • Utility Matrix
  • Key Problems
  • (1) Gathering Ratings
  • (2) Extrapolating Utilities
  • Content-based Recommender Systems
  • Content-based Recommendations
  • Plan of Action
  • Item Profiles
  • Sidenote TF-IDF
  • User Profiles and Prediction
  • Pros Content-based Approach
  • Cons Content-based Approach
  • Collaborative Filtering
  • Collaborative filtering
  • Collaborative Filtering (CF)
  • Example of Memory-based Collaborative Filtering User-User Collaborative Filtering
  • Similar Users
  • Similarity Metric
  • Rating Predictions
  • Another type of Memory-based Collaborative Filtering Item-Item based Collaborative Filtering
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Common Practice for Item-Item Collaborative Filtering
  • Item-Item vs User-User
  • ProsCons of Collaborative Filtering
  • Hybrid Methods
  • Remarks amp Practical Tips
  • Evaluation
  • Evaluation
  • Evaluating Predictions
  • Problems with Error Measures
  • Collaborative Filtering Complexity
  • Tip Add Data
  • Recommender SystemsLatent Factor Models
  • Collaborative Filtering viaLatent Factor Models (eg SVD)
  • Slide Number 50
  • The Netflix Utility Matrix R
  • Utility Matrix R Evaluation
  • Latent Factor Models
  • Ratings as Products of Factors
  • Ratings as Products of Factors
  • Ratings as Products of Factors
  • Latent Factor Models
  • Latent Factor Models
  • Recap SVD
  • Latent Factor Models
  • Dealing with Missing Entries
  • Recommendations via Latent Factor Models (eg SVD++ by the [Bellkor Team])
  • Dealing with Missing Entries
  • The Effect of Regularization
  • The Effect of Regularization
  • The Effect of Regularization
  • The Effect of Regularization
  • Use Gradient Descent to search forthe optimal settings
  • Degression to the lecture notes ofRegression and Gradient Descentby Andrew Ngrsquos Machine Learning course from Coursera
  • (Batch) Gradient Descent
  • Stochastic Gradient Descent
  • SGD vs GD
  • Stochastic Gradient Descent
  • Slide Number 75
  • Summary Recommendations via Optimization
  • Backup Slides
  • The Netflix Challenge 2006-09
  • Slide Number 79
  • Slide Number 80
  • Netflix Prize
  • Movie rating data
  • Overall rating distribution
  • ratings per movie
  • ratings per user
  • Average movie rating by movie count
  • Most loved movies
  • Challenges
  • The BellKor recommender system
  • Extending Latent Factor Model to Include Biases
  • Modeling Biases and Interactions
  • Baseline Predictor
  • Putting It All Together
  • Fitting the New Model
  • BellKor Recommender System
  • Performance of Various Methods
  • Performance of Various Methods
  • Temporal Biases Of Users
  • Temporal Biases amp Factors
  • Adding Temporal Effects
  • Performance of Various Methods
  • Slide Number 113
  • Slide Number 114
  • Standing on June 26th 2009
  • The Last 30 Days
  • 24 Hours from the Deadline
  • Slide Number 118
  • Million $ Awarded Sept 21st 2009
  • Further reading
predictors RMSE
1 0889067
2 0880586
3 0878345
4 0877715
5 0876668
6 0876179
7 0875557
8 0875013
9 0874477
10 0874209
11 0873997
12 0873691
13 0873569
14 0873352
15 0873212
16 0873035
17 0872942
18 087283
19 0872748
20 0872654
21 0872563
22 0872468
23 0872382
24 0872275
25 087222
26 0872201
27 0872135
28 0872086
29 0872021
30 0871945
31 0871898
32 087183
33 0871756
34 0871722
35 0871665
36 0871632
37 0871585
38 0871558
39 0871524
40 0871482
41 0871452
42 0871428
43 0871398
44 0871346
45 0871312
46 0871287
47 0871262
48 0871242
49 0871236
50 0871197
0889067
0880586
0878345
0877715
0876668
0876179
0875557
0875013
0874477
0874209
0873997
0873691
0873569
0873352
0873212
0873035
0872942
087283
0872748
0872654
0872563
0872468
0872382
0872275
087222
0872201
0872135
0872086
0872021
0871945
0871898
087183
0871756
0871722
0871665
0871632
0871585
0871558
0871524
0871482
0871452
0871428
0871398
0871346
0871312
0871287
0871262
0871242
0871236
0871197
Page 9: xuhappy.github.io · 2 days ago · RECSYS 2 Acknowledgements The slides used in this chapter are adapted from: CS246 Mining Massive Datasets, - by Jure Leskovec, Stanford University

RECSYS 11

Formal Model

X = set of Customers

S = set of Items

Utility function u X times S RR = set of ratingsR is a totally ordered seteg 0-5 stars real number in [01]

RECSYS 12

Utility Matrix

04102

0305021

Avatar LOTR Matrix Pirates

Alice

Bob

Carol

David

RECSYS 13

Key Problems

(1) Gathering ldquoknownrdquo ratings for matrix How to collect the data in the utility matrix

(2) Extrapolate unknown ratings from the known ones Mainly interested in high unknown ratings

bull We are not interested in knowing what you donrsquot like but what you like

(3) Evaluating extrapolation methods How to measure successperformance of

recommendation methods

RECSYS 14

(1) Gathering Ratings Explicit

Ask people to rate items Doesnrsquot work well in practice ndash people

canrsquot be bothered

Implicit Learn ratings from user actions

bull Eg purchase implies high rating What about low ratings

RECSYS 15

(2) Extrapolating Utilities Key problem matrix U is sparse

Most people have not rated most items Cold start

bull New items have no ratingsbull New users have no history

Three approaches to recommender systems 1) Content-based 2) Collaborative Filtering

bull Memory-basedbull User-based Collaborative Filteringbull Item-based Collaborative Filtering

bull Latent factor based

1162019

RECSYS 16

Content-based Recommender Systems

1162019

RECSYS 17

Content-based Recommendations Main idea Recommend items to customer x similar to

previous items rated highly by x

Example

Movie recommendations Recommend movies with same actor(s)

director genre hellip

Websites blogs news Recommend other sites with ldquosimilarrdquo content

RECSYS 18

Plan of Action

likes

Item profiles

RedCircles

Triangles

User profile

match

recommendbuild

1162019

RECSYS 19

Item Profiles For each item create an item profile

Profile is a set (vector) of features Movies author title actor directorhellip Text Set of ldquoimportantrdquo words in document

How to pick important features Usual heuristic from text mining is TF-IDF

(Term frequency Inverse Doc Frequency)bull Term hellip Featurebull Document hellip Item

1162019

RECSYS 20

Sidenote TF-IDFfij = frequency of term (feature) i in doc (item) j

ni = number of docs that mention term iN = total number of docs

TF-IDF score wij = TFij times IDFi

Doc profile = set of words with highest TF-IDF scores together with their scores

1162019

Note we normalize TFto discount for ldquolongerrdquo documents

RECSYS 21

User Profiles and Prediction

User profile possibilities Weighted average of rated item profiles Variation weight by difference from average

rating for item hellip

Prediction heuristic Given user profile x and item profile i estimate 119906119906(119961119961 119946119946) = cos(119961119961 119946119946) = 119961119961middot119946119946

| 119961119961 |sdot| 119946119946 |

1162019

RECSYS 22

Pros Content-based Approach

+ No need for data on other users No cold-start or sparsity problems

+ Able to recommend to users with unique tastes

+ Able to recommend new amp unpopular items No first-rater problem

+ Able to provide explanations Can provide explanations of recommended items by listing content-

features that caused an item to be recommended

1162019

RECSYS 23

Cons Content-based Approach ndash Finding the appropriate features is hard

Eg images movies music

ndash Overspecialization Never recommends items outside userrsquos

content profile People might have multiple interests Unable to exploit quality judgments of other users

ndash Recommendations for new users How to build a user profile

1162019

RECSYS 24

Collaborative Filtering

RECSYS 25

Collaborative filtering Recommend items based on past transactions of users

Analyze relations between users andor items

Specific data characteristics are irrelevant Domain-free useritem attributes are not necessary Can identify elusive aspects

RECSYS 26

Collaborative Filtering (CF)

RECSYS 27

Example of Memory-based Collaborative FilteringUser-User Collaborative Filtering

1 Consider user x

2 Find set N of other userswhose ratings are ldquosimilarrdquo to xrsquos ratings eg using K-nearest neighbors (KNN)

3 Recommend items to xbased on the weighted ratings of items by users in N

RECSYS 28

Similar Users Let rx be the vector of user xrsquos ratings

Jaccard similarity measure Problem Ignores the value of the rating

Cosine similarity measure

sim(x y) = cos(rx ry) = 119903119903119909119909sdot119903119903119910119910||119903119903119909119909||sdot||119903119903119910119910||

Problem Treats missing ratings as ldquonegativerdquo Pearson correlation coefficient

Sxy = items rated by both users x and y

1162019

rx = [ _ _ ]ry = [ _ _]

rx ry as setsrx = 1 4 5ry = 1 3 4

rx ry as pointsrx = 1 0 0 1 3ry = 1 0 2 2 0

rx ry hellip avgrating of x y

RECSYS 29

Similarity Metric

Intuitively we want sim(A B) gt sim(A C)

Jaccard similarity 15 lt 24

Cosine similarity 0386 gt 0322 Considers missing ratings as ldquonegativerdquo Solution subtract the (row) mean

1162019

sim AB vs AC0092 gt -0559Notice cosine sim is correlation when data is centered at 0

Cosine sim

RECSYS 30

Rating Predictions

Let rx be the vector of user xrsquos ratings

Let N be the set of k users most similar to xwho have rated item i

Prediction for item i of user x 119903119903119909119909119909119909 = 1

119896119896sum119910119910isin119873119873 119903119903119910119910119909119909

119903119903119909119909119909119909 =sum119910119910isin119873119873 119904119904119909119909119910119910sdot119903119903119910119910119910119910sum119910119910isin119873119873 119904119904119909119909119910119910

Other options

Many other tricks possiblehellip1162019

Shorthand119956119956119961119961119961119961 = 119956119956119946119946119956119956 119961119961119961119961

RECSYS 31

Another type of Memory-based Collaborative Filtering Item-Item based Collaborative Filtering

So far User-user collaborative filtering

Another view Item-item For item i find other similar items Estimate rating for item i based

on the target userrsquos ratings on items similar to item i Can use same similarity metrics and

prediction functions as in user-user model

1162019

sumsum

isin

isinsdot

=)(

)(

xiNj ij

xiNj xjijxi s

rsr

sijhellip similarity of items i and jrxjhelliprating of user x on item jN(ix)hellip set items rated by x similar to i

RECSYS 32

Item-Item CF (|N|=2)

121110987654321

455311

3124452

534321423

245424

5224345

423316

users

mov

ies

- unknown rating - rating between 1 to 5

RECSYS 33

Item-Item CF (|N|=2)

121110987654321

455 311

3124452

534321423

245424

5224345

423316

users

- estimate rating of movie 1 by user 5

1162019

mov

ies

RECSYS 34

Item-Item CF (|N|=2)

121110987654321

455 311

3124452

534321423

245424

5224345

423316

users

Neighbor selectionIdentify movies similar to movie 1 rated by user 5

1162019

mov

ies

100

-018

041

-010

-031

059

sim(1m)

Here we use Pearson correlation as similarity1) Subtract mean rating mi from each movie i

m1 = (1+3+5+5+4)5 = 36row 1 [-26 0 -06 0 0 14 0 0 14 0 04 0]

2) Compute cosine similarities between rows

RECSYS 35

Item-Item CF (|N|=2)

121110987654321

455 311

3124452

534321423

245424

5224345

423316

users

Compute similarity weightss13=041 s16=059

1162019

mov

ies

100

-018

041

-010

-031

059

sim(1m)

RECSYS 36

Item-Item CF (|N|=2)

121110987654321

45526311

3124452

534321423

245424

5224345

423316

users

Predict by taking weighted average

r15 = (0412 + 0593) (041+059) = 26

mov

ies

RECSYS 37

Common Practice for Item-Item Collaborative Filtering

Define similarity sij of items i and j

Select K nearest neighbors (KNN) N(i x) Set of Items most similar to item i that were rated by x

Estimate rating rxi as the weighted average

1162019

sumsum

isin

isinsdot

=)(

)(ˆxiNj ij

xiNj xjijxi s

rsr

N(ix) = set of items similar to item i that were rated by xsij = similarity of items i and jrxj= rating of user x on item j

RECSYS 38

Item-Item vs User-User

04180109

0305081

Avatar LOTR Matrix Pirates

Alice

Bob

Carol

David

1162019

In practice it has been observed that item-itemoften works better than user-user

Why Items are simpler users have multiple tastes

RECSYS 39

ProsCons of Collaborative Filtering

+ Works for any kind of item No feature selection needed

- Cold Start Need enough users in the system to find a match

- Sparsity The userratings matrix is sparse Hard to find users that have rated the same items

- First rater Cannot recommend an item that has not been

previously rated New items Esoteric items

- Popularity bias Cannot recommend items to someone with

unique taste Tends to recommend popular items

RECSYS 40

Hybrid Methods Implement two or more different recommenders and

combine predictions Perhaps using a linear model

Add content-based methods to collaborative filtering Item profiles for new item problem Demographics to deal with new user problem

1162019

RECSYS 41

- Evaluation

- Error metrics

- Complexity Speed

RECSYS 42

Evaluation

1 3 4

3 5 5

4 5 5

3

3

2 2 2

5

2 1 1

3 3

1

movies

users

1162019

RECSYS 43

Evaluation

1 3 4

3 5 5

4 5 5

3

3

2

2 1

3

1

Test Data Set

users

movies

1162019

RECSYS 44

Evaluating Predictions Compare predictions with known ratings

Root-mean-square error (RMSE)bull where is predicted is the true rating of x on i

Precision at top 10 bull of those in top 10

Rank Correlation bull Spearmanrsquos correlation between systemrsquos and userrsquos complete

rankings Another approach 01 model

Coveragebull Number of itemsusers for which system can make predictions

Precision = TP (TP + FP) Accuracy = (TP+TN) (TP + FP + TN + FN) Receiver Operating characteristic (ROC) Curve

Y-axis True Positive Rates (TPR) X-axis False Positive Rates (FPR)bull TPR (aka Recall) = TP P = TP (TP+FN) bull FPR = FP N = FP (FP + TN)bull See httpsenwikipediaorgwikiPrecision and recall

RECSYS 45

Problems with Error Measures Narrow focus on accuracy sometimes

misses the point Prediction Diversity Prediction Context Order of predictions

In practice we care only to predict high ratings RMSE might penalize a method that does well

for high ratings and badly for others

1162019

RECSYS 46

Collaborative Filtering Complexity

Expensive step is finding k most similar customers O(|X|) Recall that X = set of customers in the system

Too expensive to do at runtime Could pre-compute using clustering as approx

Naiumlve pre-computation takes time O(N middot|C|) |C| = of clusters = k in the k-means N = of data points

We already know how to do this Near-neighbor search in high dimensions (LSH) Clustering Dimensionality reduction

1162019

RECSYS 47

Tip Add Data Leverage all the data

Donrsquot try to reduce data size in an effort to make fancy algorithms work

Simple methods on large data do best

Add more data eg add IMDB data on genres

More data beats better algorithmshttpanandtypepadcomdatawocky200803more-data-usualhtml

RECSYS 48

Recommender SystemsLatent Factor Models

CS246 Mining Massive DatasetsJure Leskovec Stanford University

httpcs246stanfordedu

RECSYS 49

Geared towards females

Geared towards males

Serious

Funny

Collaborative Filtering viaLatent Factor Models (eg SVD)

The PrincessDiaries

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

RECSYS 50Koren Bell Volinksy IEEE Computer 2009

RECSYS 51

The Netflix Utility Matrix R

1 3 4

3 5 5

4 5 5

3

3

2 2 2

5

2 1 1

3 3

1

480000 users

17700 movies

1162019

Matrix R

RECSYS 52

Utility Matrix R Evaluation

1 3 4

3 5 5

4 5 5

3

3

2

2 1

3

1

Test Data Set

SSE = sum(119909119909119909119909)isin119877119877 119903119909119909119909119909 minus 119903119903119909119909119909119909 2

1162019

480000 users

17700 movies

Predicted rating

True rating of user x on item i

119955119955120785120785120788120788

Matrix R

Training Data Set

RECSYS 53

Latent Factor Models

ldquoSVDrdquo on Netflix data R asymp Q middot PT

For now letrsquos assume we can approximate the rating matrix R as a product of ldquothinrdquo Q middot PT

R has missing entries but letrsquos ignore that for nowbull Basically we will want the reconstruction error to be small on

known ratings and we donrsquot care about the values on the missing ones

45531

312445

53432142

24542

522434

42331

2-41

56-5

53-2

32111

-221-7

37-1

-924143-48-5-253-211

13-112-72914-131457-8

1-6784-3924176-421asymp

users

item

s

PT

Q

item

s

users

R

SVD A = U Σ VT

f factors

ffactors

RECSYS 54

Ratings as Products of Factors How to estimate the missing rating of

user x for item i

1162019

45531

312445

53432142

24542

522434

42331

item

s

2-41

56-5

53-2

32111

-221-7

37-1

-924143-48-5-253-211

13-112-72914-131457-8

1-6784-3924176-421

asymp

item

s

users

users

PT

119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931

= 119943119943

119954119954119946119946119943119943 sdot 119953119953119961119961119943119943

qi = row i of Qpx = column x of PT

ffac

tors

Qf factors

RECSYS 55

Ratings as Products of Factors How to estimate the missing rating of

user x for item i

1162019

45531

312445

53432142

24542

522434

42331

item

s

2-41

56-5

53-2

32111

-221-7

37-1

-924143-48-5-253-211

13-112-72914-131457-8

1-6784-3924176-421

asymp

item

s

users

users

PT

ffac

tors

Qf factors

119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931

= 119943119943

119954119954119946119946119943119943 sdot 119953119953119961119961119943119943

qi = row i of Qpx = column x of PT

RECSYS 56

Ratings as Products of Factors How to estimate the missing rating of

user x for item i

1162019

45531

312445

53432142

24542

522434

42331

item

s

2-41

56-5

53-2

32111

-221-7

37-1

-924143-48-5-253-211

13-112-72914-131457-8

1-6784-3924176-421

asymp

item

s

users

users

QPT

24

ffac

tors

f factors

119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931

= 119943119943

119954119954119946119946119943119943 sdot 119953119953119961119961119943119943

qi = row i of Qpx = column x of PT

RECSYS 57

Geared towards females

Geared towards males

Serious

Funny

Latent Factor Models

1162019

The PrincessDiaries

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

RECSYS 58

Geared towards females

Geared towards males

Serious

Funny

Latent Factor Models

1162019

The PrincessDiaries

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

RECSYS 59

Recap SVD

Remember SVD A Input data matrix U Left singular vecs V Right singular vecs Σ Singular values

SVD gives minimum reconstruction error (SSE) min119880119880VΣ

sum119909119909119894119894 119860119860119909119909119894119894 minus 119880119880Σ119881119881T 1199091199091198941198942

So in our case ldquoSVDrdquo on Netflix data R asymp Q middot PT

A = R Q = U PT = Σ VT

But we are not done yet R has missing entries1162019

U

The sum goes over all entriesBut our R has missing entries

119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931

Am

n

Σm

n

VT

asymp

U

RECSYS 60

Latent Factor Models

SVD isnrsquot defined when entries are missing

Use specialized methods to find P Q min

119875119875119876119876sum 119909119909119909119909 isinR 119903119903119909119909119909119909 minus 119902119902119909119909 sdot 119901119901119909119909119879119879 2

Notebull We donrsquot require cols of P Q to be orthogonalunit lengthbull P Q map usersmovies to a latent spacebull The most popular model among Netflix contestants

1162019

45531

312445

53432142

24542

522434

42331

2-41

56-5

53-2

32111

-221-7

37-1

-924143-48-5-253-211

13-112-72914-131457-8

1-6784-3924176-421asymp

PTQ

users

item

s

119903119909119909119909119909 = 119902119902119909119909 sdot 119901119901119909119909119879119879

f factors

ffactorsitem

s

users

RECSYS 61

Dealing with Missing Entries Want to minimize SSE for unseen test data

Idea Minimize SSE on training data Want large f ( of factors) to capture all the signals But SSE on test data begins to rise for f gt 2

Regularization is needed to avoid Overfitting Allow rich model where there are sufficient data Shrink aggressively where data are scarce

1162019

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

1 3 43 5 5

4 5 533

2

2 1 3

1

λhellip regularization parameterldquoerrorrdquo ldquolengthrdquo

RECSYS 62

Geared towards females

Geared towards males

serious

escapist

The PrincessDiaries

The Lion King

Braveheart

Independence Day

AmadeusThe Color Purple

Oceanrsquos 11

Sense and Sensibility

Gus

Dave

Recommendations via Latent Factor Models (eg SVD++ by the [Bellkor Team])

1162019

Lethal Weapon

Dumb and Dumber

RECSYS 63

Dealing with Missing Entries Want to minimize SSE for unseen test data

Idea Minimize SSE on training data Want large f ( of factors) to capture all the signals But SSE on test data begins to rise for f gt 2

Regularization is needed Allow rich model where there are sufficient data Shrink aggressively where data are scarce

1162019

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

1 3 43 5 5

4 5 533

2

2 1 3

1

λhellip regularization parameterldquoerrorrdquo ldquolengthrdquo

RECSYS 64

Geared towards females

Geared towards males

serious

funny

The Effect of Regularization

1162019

The PrincessDiaries

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

minfactors ldquoerrorrdquo + λ ldquolengthrdquo

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

RECSYS 65

Geared towards females

Geared towards males

serious

funny

The Effect of Regularization

1162019

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

The PrincessDiaries

minfactors ldquoerrorrdquo + λ ldquolengthrdquo

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

RECSYS 66

Geared towards females

Geared towards males

serious

funny

The Effect of Regularization

1162019

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

minfactors ldquoerrorrdquo + λ ldquolengthrdquo

The PrincessDiaries

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

RECSYS 67

Geared towards females

Geared towards males

serious

funny

The Effect of Regularization

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

The PrincessDiaries

minfactors ldquoerrorrdquo + λ ldquolengthrdquo

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

RECSYS 68

Use Gradient Descent to search forthe optimal settings

Want to find matrices P and Q

Gradient descent Initialize P and Q (using SVD pretend missing ratings are

0) Do gradient descent

bull P larr P - η middotnablaPbull Q larr Q - η middotnablaQbull Where nablaQ is gradientderivative of matrix Q120571120571119876119876 = [120571120571119902119902119909119909119894119894] and 120571120571119902119902119909119909119894119894 = sum119909119909119909119909 minus2 119903119903119909119909119909119909 minus 119902119902119909119909119901119901119909119909119879119879 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894

bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q Observation Computing gradients is slow

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

How to compute gradient of a matrixCompute gradient of every element independently

RECSYS 69

Degression to the lecture notes ofRegression and Gradient Descent

by Andrew Ngrsquos Machine Learning course from Coursera

RECSYS 70

(Batch) Gradient Descent

Want to find matrices P and Q

Gradient descent Initialize P and Q (using SVD pretend missing ratings are

0) Do gradient descent

bull P larr P - η middotnablaPbull Q larr Q - η middotnablaQbull Where nablaQ is gradientderivative of matrix Q120571120571119876119876 = [120571120571119902119902119909119909119894119894] and 120571120571119902119902119909119909119894119894 = sum119909119909119909119909 minus2 119903119903119909119909119909119909 minus 119902119902119909119909119901119901119909119909119879119879 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894

bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q Observation Computing gradients is slow

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

How to compute gradient of a matrixCompute gradient of every element independently

RECSYS 72

Stochastic Gradient Descent

Gradient Descent (GD) vs Stochastic GD Observation 120571120571119876119876 = [120571120571119902119902119909119909119894119894] where

120571120571119902119902119909119909119894119894 = 119909119909119909119909

minus2 119903119903119909119909119909119909 minus 119902119902119909119909119894119894119901119901119909119909119894119894 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894 = 119961119961119946119946

nabla119928119928 119955119955119961119961119946119946

bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q 119928119928 = 119928119928 minus 119928119928 = 119928119928 minus η sum119961119961119946119946nabla119928119928 (119955119955119961119961119946119946) Idea Instead of evaluating gradient over all ratings evaluate it for

each individual rating and make a step

GD 119928119928larr119928119928 minus η sum119955119955119961119961119946119946 nabla119928119928(119955119955119961119961119946119946)

SGD 119928119928larr119928119928 minus 119928119928(119955119955119961119961119946119946) Faster convergence

bull Need more steps but each step is computed much faster

1162019

RECSYS 73

SGD vs GD Convergence of GD vs SGD

1162019

Iterationstep

Valu

e of

the

obje

ctiv

e fu

nctio

n

GD improves the value of the objective function at every step SGD improves the value but in a ldquonoisyrdquo wayGD takes fewer steps to converge but each steptakes much longer to compute In practice SGD is much faster

RECSYS 74

Stochastic Gradient Descent

Stochastic gradient decent Initialize P and Q (using SVD pretend missing ratings are

0) Then iterate over the ratings (multiple times if necessary) and update

factorsFor each rxi

bull 120576120576119909119909119909119909 = 119903119903119909119909119909119909 minus 119902119902119909119909 sdot 119901119901119909119909119879119879 (derivative of the ldquoerrorrdquo)bull 119902119902119909119909 larr 119902119902119909119909 + 120578120578 120576120576119909119909119909119909 119901119901119909119909 minus 120582120582 119902119902119909119909 (update equation)bull 119901119901119909119909 larr 119901119901119909119909 + 120578120578 120576120576119909119909119909119909 119902119902119909119909 minus 120582120582 119901119901119909119909 (update equation)

2 for loops For until convergence

bull For each rxibull Compute gradient do a ldquosteprdquo

1162019

η hellip learning rate

RECSYS 75Koren Bell Volinksy IEEE Computer 2009

1162019

RECSYS 76

Summary Recommendations via Optimization

Goal Make good recommendations Quantify goodness using SSE

So Lower SSE means better recommendations We want to make good recommendations on items

that some user has not yet seen Letrsquos set values for P and Q such that they work

well on known (user item) ratings And hope these values for P and Q will predict well

the unknown ratings

This is the a case where we apply Optimization methods

1162019

1 3 43 5 5

4 5 533

2

2 1 3

1

RECSYS 77

Backup Slides

RECSYS 78

The Netflix Challenge 2006-09

We Know What You OughtTo Be Watching This

Summer

Netflix Prizebull Training data

ndash 100 million ratingsndash 480000 usersndash 17770 moviesndash 6 years of data 2000-2005

bull Test datandash Last few ratings of each user (28 million)ndash Evaluation criterion root mean squared error (RMSE) ndash Netflix Cinematch RMSE 09514

bull Competitionndash 2700+ teamsndash $1 million grand prize for 10 improvement on Cinematch resultndash $50000 2007 progress prize for 843 improvement

scoremovieuser121152131434524123237682576344541568523425

22345

57664566

scoremovieuser6219617232473153414284935

745

696836

Training data Test data

Movie rating data

Overall rating distribution

bull Third of ratings are 4sbull Average rating is 368

From TimelyDevelopmentcom

ratings per movie

bull Avg ratingsmovie 5627

ratings per user

bull Avg ratingsuser 208

Average movie rating by movie count

bull More ratings to better movies

From TimelyDevelopmentcom

Most loved movies

CountAvg ratingTitle137812 4593 The Shawshank Redemption 133597 4545 Lord of the Rings The Return of the King1808834306 The Green Mile 150676 4460Lord of the Rings The Two Towers 139050 4415 Finding Nemo 117456 4504 Raiders of the Lost Ark 180736 4299 Forrest Gump 147932 4433 Lord of the Rings The Fellowship of the ring149199 4325 The Sixth Sense 144027 4333 Indiana Jones and the Last Crusade

Challenges

bull Size of datandash Scalabilityndash Keeping data in memory

bull Missing datandash 99 percent missingndash Very imbalanced

bull Avoiding overfittingbull Test and training data differ significantly movie 16322

bull Use an ensemble of complementing predictorsbull Two half tuned models worth more than a single fully

tuned model

The BellKor recommender system

RECSYS 101

Extending Latent Factor Model to Include Biases

1162019

RECSYS 102

Modeling Biases and Interactions

μ = overall mean rating bx = bias of user x bi = bias of movie i

user-movie interactionmovie biasuser bias

User-Movie interaction Characterizes the matching between

users and movies Attracts most research in the field Benefits from algorithmic and

mathematical innovations

Baseline predictor Separates users and movies Benefits from insights into userrsquos

behavior Among the main practical

contributions of the competition

RECSYS 103

Baseline Predictor We have expectations on the rating by

user x of movie i even without estimating xrsquos attitude towards movies like i

ndash Rating scale of user xndash Values of other ratings user

gave recently (day-specific mood anchoring multi-user accounts)

ndash (Recent) popularity of movie indash Selection bias related to

number of ratings user gave on the same day (ldquofrequencyrdquo)

1162019

RECSYS 104

Putting It All Together

Example Mean rating micro = 37 You are a critical reviewer your ratings are 1 star

lower than the mean bx = -1 Star Wars gets a mean rating of 05 higher than

average movie bi = + 05 Predicted rating for you on Star Wars

= 37 - 1 + 05 = 32

1162019

RECSYS 105

Fitting the New Model

Solve

Stochastic gradient decent to find parameters Note Both biases bu bi as well as interactions qi pu are treated

as parameters (we estimate them)

1162019

regularization

goodness of fit

λ is selected via grid-search on a validation set

( )

++++

+++minus

sumsumsumsum

sumisin

ii

xx

xx

ii

Rix

Txiixxi

PQ

bbpq

pqbbr

2222

2

)(

)(min

λ

micro

RECSYS 106

BellKor Recommender System The winner of the Netflix Challenge

Multi-scale modeling of the dataCombine top level ldquoregionalrdquomodeling of the data with a refined local view Global

bull Overall deviations of usersmovies Factorization

bull Addressing ldquoregionalrdquo effects Collaborative filtering

bull Extract local patterns

Global effects

Factorization

Collaborative filtering

RECSYS 107

Performance of Various Methods

0885

089

0895

09

0905

091

0915

092

1 10 100 1000

RMSE

Millions of parameters

CF (no time bias)

Basic Latent Factors

Latent Factors w Biases

1162019

RECSYS 108

Grand Prize 08563

Netflix 09514

Movie average 10533User average 10651

Global average 11296

Performance of Various Methods

Basic Collaborative filtering 094

Latent factors 090

Latent factors+Biases 089

Collaborative filtering++ 091

1162019

RECSYS 109

Temporal Biases Of Users

Sudden rise in the average movie rating(early 2004) Improvements in Netflix GUI improvements Meaning of rating changed

Movie age Users prefer new movies

without any reasons Older movies are just

inherently better than newer ones

1162019

Y Koren Collaborative filtering with temporal dynamics KDD rsquo09

RECSYS 110

Temporal Biases amp Factors

Original modelrxi = micro +bx + bi + qi middotpx

T

Add time dependence to biasesrxi = micro +bx(t)+ bi(t) +qi middot px

T

Make parameters bu and bi to depend on time (1) Parameterize time-dependence by linear trends

(2) Each bin corresponds to 10 consecutive weeks

Add temporal dependence to factors px(t)hellip user preference vector on day t

Y Koren Collaborative filtering with temporal dynamics KDD rsquo09

RECSYS 111

Adding Temporal Effects

1162019

0875

088

0885

089

0895

09

0905

091

0915

092

1 10 100 1000 10000

RMSE

Millions of parameters

CF (no time bias)Basic Latent FactorsCF (time bias)Latent Factors w Biases+ Linear time factors+ Per-day user biases+ CF

RECSYS 112

Grand Prize 08563

Netflix 09514

Movie average 10533User average 10651

Global average 11296

Performance of Various Methods

Basic Collaborative filtering 094

Latent factors 090

Latent factors+Biases 089

Collaborative filtering++ 091

1162019 112

Latent factors+Biases+Time 0876

Still no prize Getting desperateTry a ldquokitchen sinkrdquo approach

RECSYS 113

Effect of ensemble size

087

08725

0875

08775

088

08825

0885

08875

089

08925

1 8 15 22 29 36 43 50

Predictors

Erro

r - R

MSE

Chart1

RMSE
Predictors
Error - RMSE
Effect of ensemble size

Sheet1

Sheet1

RMSE
Predictors
Error - RMSE
Effect of ensemble size

Sheet2

Sheet3

RECSYS 115

Standing on June 26th 2009

1162019June 26th submission triggers 30-day ldquolast callrdquo

RECSYS 116

The Last 30 Days

Ensemble team formed Group of other teams on leaderboard forms a new team Relies on combining their models Quickly also get a qualifying score over 10

BellKor Continue to get small improvements in their scores Realize that they are in direct competition with Ensemble

Strategy Both teams carefully monitoring the leaderboard Only sure way to check for improvement is to submit a set of

predictionsbull This alerts the other team of your latest score

1162019

RECSYS 117

24 Hours from the Deadline

Submissions limited to 1 a day Only 1 final submission could be made in the last 24h

24 hours before deadlinehellip BellKor team member in Austria notices (by chance) that

Ensemble posts a score that is slightly better than BellKorrsquos

Frantic last 24 hours for both teams Much computer time on final optimization Carefully calibrated to end about an hour before deadline

Final submissions BellKor submits a little early (on purpose) 40 mins before deadline Ensemble submits their final entry 20 mins later hellipand everyone waitshellip

1162019

RECSYS 118118

RECSYS 119

Million $ Awarded Sept 21st 2009

RECSYS 120

Further reading Y Koren Collaborative filtering with temporal

dynamics KDD rsquo09 httpwww2researchattcom~volinskynetflixbpchtml httpwwwthe-ensemblecom

1162019

  • Slide Number 1
  • Acknowledgements
  • Content-based Systems amp Collaborative Filtering
  • High Dimensional Data
  • Example Recommender Systems
  • Recommendations
  • From Scarcity to Abundance
  • The Long Tail
  • Physical vs Online
  • Types of Recommendations
  • Formal Model
  • Utility Matrix
  • Key Problems
  • (1) Gathering Ratings
  • (2) Extrapolating Utilities
  • Content-based Recommender Systems
  • Content-based Recommendations
  • Plan of Action
  • Item Profiles
  • Sidenote TF-IDF
  • User Profiles and Prediction
  • Pros Content-based Approach
  • Cons Content-based Approach
  • Collaborative Filtering
  • Collaborative filtering
  • Collaborative Filtering (CF)
  • Example of Memory-based Collaborative Filtering User-User Collaborative Filtering
  • Similar Users
  • Similarity Metric
  • Rating Predictions
  • Another type of Memory-based Collaborative Filtering Item-Item based Collaborative Filtering
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Common Practice for Item-Item Collaborative Filtering
  • Item-Item vs User-User
  • ProsCons of Collaborative Filtering
  • Hybrid Methods
  • Remarks amp Practical Tips
  • Evaluation
  • Evaluation
  • Evaluating Predictions
  • Problems with Error Measures
  • Collaborative Filtering Complexity
  • Tip Add Data
  • Recommender SystemsLatent Factor Models
  • Collaborative Filtering viaLatent Factor Models (eg SVD)
  • Slide Number 50
  • The Netflix Utility Matrix R
  • Utility Matrix R Evaluation
  • Latent Factor Models
  • Ratings as Products of Factors
  • Ratings as Products of Factors
  • Ratings as Products of Factors
  • Latent Factor Models
  • Latent Factor Models
  • Recap SVD
  • Latent Factor Models
  • Dealing with Missing Entries
  • Recommendations via Latent Factor Models (eg SVD++ by the [Bellkor Team])
  • Dealing with Missing Entries
  • The Effect of Regularization
  • The Effect of Regularization
  • The Effect of Regularization
  • The Effect of Regularization
  • Use Gradient Descent to search forthe optimal settings
  • Degression to the lecture notes ofRegression and Gradient Descentby Andrew Ngrsquos Machine Learning course from Coursera
  • (Batch) Gradient Descent
  • Stochastic Gradient Descent
  • SGD vs GD
  • Stochastic Gradient Descent
  • Slide Number 75
  • Summary Recommendations via Optimization
  • Backup Slides
  • The Netflix Challenge 2006-09
  • Slide Number 79
  • Slide Number 80
  • Netflix Prize
  • Movie rating data
  • Overall rating distribution
  • ratings per movie
  • ratings per user
  • Average movie rating by movie count
  • Most loved movies
  • Challenges
  • The BellKor recommender system
  • Extending Latent Factor Model to Include Biases
  • Modeling Biases and Interactions
  • Baseline Predictor
  • Putting It All Together
  • Fitting the New Model
  • BellKor Recommender System
  • Performance of Various Methods
  • Performance of Various Methods
  • Temporal Biases Of Users
  • Temporal Biases amp Factors
  • Adding Temporal Effects
  • Performance of Various Methods
  • Slide Number 113
  • Slide Number 114
  • Standing on June 26th 2009
  • The Last 30 Days
  • 24 Hours from the Deadline
  • Slide Number 118
  • Million $ Awarded Sept 21st 2009
  • Further reading
predictors RMSE
1 0889067
2 0880586
3 0878345
4 0877715
5 0876668
6 0876179
7 0875557
8 0875013
9 0874477
10 0874209
11 0873997
12 0873691
13 0873569
14 0873352
15 0873212
16 0873035
17 0872942
18 087283
19 0872748
20 0872654
21 0872563
22 0872468
23 0872382
24 0872275
25 087222
26 0872201
27 0872135
28 0872086
29 0872021
30 0871945
31 0871898
32 087183
33 0871756
34 0871722
35 0871665
36 0871632
37 0871585
38 0871558
39 0871524
40 0871482
41 0871452
42 0871428
43 0871398
44 0871346
45 0871312
46 0871287
47 0871262
48 0871242
49 0871236
50 0871197
0889067
0880586
0878345
0877715
0876668
0876179
0875557
0875013
0874477
0874209
0873997
0873691
0873569
0873352
0873212
0873035
0872942
087283
0872748
0872654
0872563
0872468
0872382
0872275
087222
0872201
0872135
0872086
0872021
0871945
0871898
087183
0871756
0871722
0871665
0871632
0871585
0871558
0871524
0871482
0871452
0871428
0871398
0871346
0871312
0871287
0871262
0871242
0871236
0871197
Page 10: xuhappy.github.io · 2 days ago · RECSYS 2 Acknowledgements The slides used in this chapter are adapted from: CS246 Mining Massive Datasets, - by Jure Leskovec, Stanford University

RECSYS 12

Utility Matrix

04102

0305021

Avatar LOTR Matrix Pirates

Alice

Bob

Carol

David

RECSYS 13

Key Problems

(1) Gathering ldquoknownrdquo ratings for matrix How to collect the data in the utility matrix

(2) Extrapolate unknown ratings from the known ones Mainly interested in high unknown ratings

bull We are not interested in knowing what you donrsquot like but what you like

(3) Evaluating extrapolation methods How to measure successperformance of

recommendation methods

RECSYS 14

(1) Gathering Ratings Explicit

Ask people to rate items Doesnrsquot work well in practice ndash people

canrsquot be bothered

Implicit Learn ratings from user actions

bull Eg purchase implies high rating What about low ratings

RECSYS 15

(2) Extrapolating Utilities Key problem matrix U is sparse

Most people have not rated most items Cold start

bull New items have no ratingsbull New users have no history

Three approaches to recommender systems 1) Content-based 2) Collaborative Filtering

bull Memory-basedbull User-based Collaborative Filteringbull Item-based Collaborative Filtering

bull Latent factor based

1162019

RECSYS 16

Content-based Recommender Systems

1162019

RECSYS 17

Content-based Recommendations Main idea Recommend items to customer x similar to

previous items rated highly by x

Example

Movie recommendations Recommend movies with same actor(s)

director genre hellip

Websites blogs news Recommend other sites with ldquosimilarrdquo content

RECSYS 18

Plan of Action

likes

Item profiles

RedCircles

Triangles

User profile

match

recommendbuild

1162019

RECSYS 19

Item Profiles For each item create an item profile

Profile is a set (vector) of features Movies author title actor directorhellip Text Set of ldquoimportantrdquo words in document

How to pick important features Usual heuristic from text mining is TF-IDF

(Term frequency Inverse Doc Frequency)bull Term hellip Featurebull Document hellip Item

1162019

RECSYS 20

Sidenote TF-IDFfij = frequency of term (feature) i in doc (item) j

ni = number of docs that mention term iN = total number of docs

TF-IDF score wij = TFij times IDFi

Doc profile = set of words with highest TF-IDF scores together with their scores

1162019

Note we normalize TFto discount for ldquolongerrdquo documents

RECSYS 21

User Profiles and Prediction

User profile possibilities Weighted average of rated item profiles Variation weight by difference from average

rating for item hellip

Prediction heuristic Given user profile x and item profile i estimate 119906119906(119961119961 119946119946) = cos(119961119961 119946119946) = 119961119961middot119946119946

| 119961119961 |sdot| 119946119946 |

1162019

RECSYS 22

Pros Content-based Approach

+ No need for data on other users No cold-start or sparsity problems

+ Able to recommend to users with unique tastes

+ Able to recommend new amp unpopular items No first-rater problem

+ Able to provide explanations Can provide explanations of recommended items by listing content-

features that caused an item to be recommended

1162019

RECSYS 23

Cons Content-based Approach ndash Finding the appropriate features is hard

Eg images movies music

ndash Overspecialization Never recommends items outside userrsquos

content profile People might have multiple interests Unable to exploit quality judgments of other users

ndash Recommendations for new users How to build a user profile

1162019

RECSYS 24

Collaborative Filtering

RECSYS 25

Collaborative filtering Recommend items based on past transactions of users

Analyze relations between users andor items

Specific data characteristics are irrelevant Domain-free useritem attributes are not necessary Can identify elusive aspects

RECSYS 26

Collaborative Filtering (CF)

RECSYS 27

Example of Memory-based Collaborative FilteringUser-User Collaborative Filtering

1 Consider user x

2 Find set N of other userswhose ratings are ldquosimilarrdquo to xrsquos ratings eg using K-nearest neighbors (KNN)

3 Recommend items to xbased on the weighted ratings of items by users in N

RECSYS 28

Similar Users Let rx be the vector of user xrsquos ratings

Jaccard similarity measure Problem Ignores the value of the rating

Cosine similarity measure

sim(x y) = cos(rx ry) = 119903119903119909119909sdot119903119903119910119910||119903119903119909119909||sdot||119903119903119910119910||

Problem Treats missing ratings as ldquonegativerdquo Pearson correlation coefficient

Sxy = items rated by both users x and y

1162019

rx = [ _ _ ]ry = [ _ _]

rx ry as setsrx = 1 4 5ry = 1 3 4

rx ry as pointsrx = 1 0 0 1 3ry = 1 0 2 2 0

rx ry hellip avgrating of x y

RECSYS 29

Similarity Metric

Intuitively we want sim(A B) gt sim(A C)

Jaccard similarity 15 lt 24

Cosine similarity 0386 gt 0322 Considers missing ratings as ldquonegativerdquo Solution subtract the (row) mean

1162019

sim AB vs AC0092 gt -0559Notice cosine sim is correlation when data is centered at 0

Cosine sim

RECSYS 30

Rating Predictions

Let rx be the vector of user xrsquos ratings

Let N be the set of k users most similar to xwho have rated item i

Prediction for item i of user x 119903119903119909119909119909119909 = 1

119896119896sum119910119910isin119873119873 119903119903119910119910119909119909

119903119903119909119909119909119909 =sum119910119910isin119873119873 119904119904119909119909119910119910sdot119903119903119910119910119910119910sum119910119910isin119873119873 119904119904119909119909119910119910

Other options

Many other tricks possiblehellip1162019

Shorthand119956119956119961119961119961119961 = 119956119956119946119946119956119956 119961119961119961119961

RECSYS 31

Another type of Memory-based Collaborative Filtering Item-Item based Collaborative Filtering

So far User-user collaborative filtering

Another view Item-item For item i find other similar items Estimate rating for item i based

on the target userrsquos ratings on items similar to item i Can use same similarity metrics and

prediction functions as in user-user model

1162019

sumsum

isin

isinsdot

=)(

)(

xiNj ij

xiNj xjijxi s

rsr

sijhellip similarity of items i and jrxjhelliprating of user x on item jN(ix)hellip set items rated by x similar to i

RECSYS 32

Item-Item CF (|N|=2)

121110987654321

455311

3124452

534321423

245424

5224345

423316

users

mov

ies

- unknown rating - rating between 1 to 5

RECSYS 33

Item-Item CF (|N|=2)

121110987654321

455 311

3124452

534321423

245424

5224345

423316

users

- estimate rating of movie 1 by user 5

1162019

mov

ies

RECSYS 34

Item-Item CF (|N|=2)

121110987654321

455 311

3124452

534321423

245424

5224345

423316

users

Neighbor selectionIdentify movies similar to movie 1 rated by user 5

1162019

mov

ies

100

-018

041

-010

-031

059

sim(1m)

Here we use Pearson correlation as similarity1) Subtract mean rating mi from each movie i

m1 = (1+3+5+5+4)5 = 36row 1 [-26 0 -06 0 0 14 0 0 14 0 04 0]

2) Compute cosine similarities between rows

RECSYS 35

Item-Item CF (|N|=2)

121110987654321

455 311

3124452

534321423

245424

5224345

423316

users

Compute similarity weightss13=041 s16=059

1162019

mov

ies

100

-018

041

-010

-031

059

sim(1m)

RECSYS 36

Item-Item CF (|N|=2)

121110987654321

45526311

3124452

534321423

245424

5224345

423316

users

Predict by taking weighted average

r15 = (0412 + 0593) (041+059) = 26

mov

ies

RECSYS 37

Common Practice for Item-Item Collaborative Filtering

Define similarity sij of items i and j

Select K nearest neighbors (KNN) N(i x) Set of Items most similar to item i that were rated by x

Estimate rating rxi as the weighted average

1162019

sumsum

isin

isinsdot

=)(

)(ˆxiNj ij

xiNj xjijxi s

rsr

N(ix) = set of items similar to item i that were rated by xsij = similarity of items i and jrxj= rating of user x on item j

RECSYS 38

Item-Item vs User-User

04180109

0305081

Avatar LOTR Matrix Pirates

Alice

Bob

Carol

David

1162019

In practice it has been observed that item-itemoften works better than user-user

Why Items are simpler users have multiple tastes

RECSYS 39

ProsCons of Collaborative Filtering

+ Works for any kind of item No feature selection needed

- Cold Start Need enough users in the system to find a match

- Sparsity The userratings matrix is sparse Hard to find users that have rated the same items

- First rater Cannot recommend an item that has not been

previously rated New items Esoteric items

- Popularity bias Cannot recommend items to someone with

unique taste Tends to recommend popular items

RECSYS 40

Hybrid Methods Implement two or more different recommenders and

combine predictions Perhaps using a linear model

Add content-based methods to collaborative filtering Item profiles for new item problem Demographics to deal with new user problem

1162019

RECSYS 41

- Evaluation

- Error metrics

- Complexity Speed

RECSYS 42

Evaluation

1 3 4

3 5 5

4 5 5

3

3

2 2 2

5

2 1 1

3 3

1

movies

users

1162019

RECSYS 43

Evaluation

1 3 4

3 5 5

4 5 5

3

3

2

2 1

3

1

Test Data Set

users

movies

1162019

RECSYS 44

Evaluating Predictions Compare predictions with known ratings

Root-mean-square error (RMSE)bull where is predicted is the true rating of x on i

Precision at top 10 bull of those in top 10

Rank Correlation bull Spearmanrsquos correlation between systemrsquos and userrsquos complete

rankings Another approach 01 model

Coveragebull Number of itemsusers for which system can make predictions

Precision = TP (TP + FP) Accuracy = (TP+TN) (TP + FP + TN + FN) Receiver Operating characteristic (ROC) Curve

Y-axis True Positive Rates (TPR) X-axis False Positive Rates (FPR)bull TPR (aka Recall) = TP P = TP (TP+FN) bull FPR = FP N = FP (FP + TN)bull See httpsenwikipediaorgwikiPrecision and recall

RECSYS 45

Problems with Error Measures Narrow focus on accuracy sometimes

misses the point Prediction Diversity Prediction Context Order of predictions

In practice we care only to predict high ratings RMSE might penalize a method that does well

for high ratings and badly for others

1162019

RECSYS 46

Collaborative Filtering Complexity

Expensive step is finding k most similar customers O(|X|) Recall that X = set of customers in the system

Too expensive to do at runtime Could pre-compute using clustering as approx

Naiumlve pre-computation takes time O(N middot|C|) |C| = of clusters = k in the k-means N = of data points

We already know how to do this Near-neighbor search in high dimensions (LSH) Clustering Dimensionality reduction

1162019

RECSYS 47

Tip Add Data Leverage all the data

Donrsquot try to reduce data size in an effort to make fancy algorithms work

Simple methods on large data do best

Add more data eg add IMDB data on genres

More data beats better algorithmshttpanandtypepadcomdatawocky200803more-data-usualhtml

RECSYS 48

Recommender SystemsLatent Factor Models

CS246 Mining Massive DatasetsJure Leskovec Stanford University

httpcs246stanfordedu

RECSYS 49

Geared towards females

Geared towards males

Serious

Funny

Collaborative Filtering viaLatent Factor Models (eg SVD)

The PrincessDiaries

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

RECSYS 50Koren Bell Volinksy IEEE Computer 2009

RECSYS 51

The Netflix Utility Matrix R

1 3 4

3 5 5

4 5 5

3

3

2 2 2

5

2 1 1

3 3

1

480000 users

17700 movies

1162019

Matrix R

RECSYS 52

Utility Matrix R Evaluation

1 3 4

3 5 5

4 5 5

3

3

2

2 1

3

1

Test Data Set

SSE = sum(119909119909119909119909)isin119877119877 119903119909119909119909119909 minus 119903119903119909119909119909119909 2

1162019

480000 users

17700 movies

Predicted rating

True rating of user x on item i

119955119955120785120785120788120788

Matrix R

Training Data Set

RECSYS 53

Latent Factor Models

ldquoSVDrdquo on Netflix data R asymp Q middot PT

For now letrsquos assume we can approximate the rating matrix R as a product of ldquothinrdquo Q middot PT

R has missing entries but letrsquos ignore that for nowbull Basically we will want the reconstruction error to be small on

known ratings and we donrsquot care about the values on the missing ones

45531

312445

53432142

24542

522434

42331

2-41

56-5

53-2

32111

-221-7

37-1

-924143-48-5-253-211

13-112-72914-131457-8

1-6784-3924176-421asymp

users

item

s

PT

Q

item

s

users

R

SVD A = U Σ VT

f factors

ffactors

RECSYS 54

Ratings as Products of Factors How to estimate the missing rating of

user x for item i

1162019

45531

312445

53432142

24542

522434

42331

item

s

2-41

56-5

53-2

32111

-221-7

37-1

-924143-48-5-253-211

13-112-72914-131457-8

1-6784-3924176-421

asymp

item

s

users

users

PT

119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931

= 119943119943

119954119954119946119946119943119943 sdot 119953119953119961119961119943119943

qi = row i of Qpx = column x of PT

ffac

tors

Qf factors

RECSYS 55

Ratings as Products of Factors How to estimate the missing rating of

user x for item i

1162019

45531

312445

53432142

24542

522434

42331

item

s

2-41

56-5

53-2

32111

-221-7

37-1

-924143-48-5-253-211

13-112-72914-131457-8

1-6784-3924176-421

asymp

item

s

users

users

PT

ffac

tors

Qf factors

119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931

= 119943119943

119954119954119946119946119943119943 sdot 119953119953119961119961119943119943

qi = row i of Qpx = column x of PT

RECSYS 56

Ratings as Products of Factors How to estimate the missing rating of

user x for item i

1162019

45531

312445

53432142

24542

522434

42331

item

s

2-41

56-5

53-2

32111

-221-7

37-1

-924143-48-5-253-211

13-112-72914-131457-8

1-6784-3924176-421

asymp

item

s

users

users

QPT

24

ffac

tors

f factors

119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931

= 119943119943

119954119954119946119946119943119943 sdot 119953119953119961119961119943119943

qi = row i of Qpx = column x of PT

RECSYS 57

Geared towards females

Geared towards males

Serious

Funny

Latent Factor Models

1162019

The PrincessDiaries

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

RECSYS 58

Geared towards females

Geared towards males

Serious

Funny

Latent Factor Models

1162019

The PrincessDiaries

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

RECSYS 59

Recap SVD

Remember SVD A Input data matrix U Left singular vecs V Right singular vecs Σ Singular values

SVD gives minimum reconstruction error (SSE) min119880119880VΣ

sum119909119909119894119894 119860119860119909119909119894119894 minus 119880119880Σ119881119881T 1199091199091198941198942

So in our case ldquoSVDrdquo on Netflix data R asymp Q middot PT

A = R Q = U PT = Σ VT

But we are not done yet R has missing entries1162019

U

The sum goes over all entriesBut our R has missing entries

119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931

Am

n

Σm

n

VT

asymp

U

RECSYS 60

Latent Factor Models

SVD isnrsquot defined when entries are missing

Use specialized methods to find P Q min

119875119875119876119876sum 119909119909119909119909 isinR 119903119903119909119909119909119909 minus 119902119902119909119909 sdot 119901119901119909119909119879119879 2

Notebull We donrsquot require cols of P Q to be orthogonalunit lengthbull P Q map usersmovies to a latent spacebull The most popular model among Netflix contestants

1162019

45531

312445

53432142

24542

522434

42331

2-41

56-5

53-2

32111

-221-7

37-1

-924143-48-5-253-211

13-112-72914-131457-8

1-6784-3924176-421asymp

PTQ

users

item

s

119903119909119909119909119909 = 119902119902119909119909 sdot 119901119901119909119909119879119879

f factors

ffactorsitem

s

users

RECSYS 61

Dealing with Missing Entries Want to minimize SSE for unseen test data

Idea Minimize SSE on training data Want large f ( of factors) to capture all the signals But SSE on test data begins to rise for f gt 2

Regularization is needed to avoid Overfitting Allow rich model where there are sufficient data Shrink aggressively where data are scarce

1162019

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

1 3 43 5 5

4 5 533

2

2 1 3

1

λhellip regularization parameterldquoerrorrdquo ldquolengthrdquo

RECSYS 62

Geared towards females

Geared towards males

serious

escapist

The PrincessDiaries

The Lion King

Braveheart

Independence Day

AmadeusThe Color Purple

Oceanrsquos 11

Sense and Sensibility

Gus

Dave

Recommendations via Latent Factor Models (eg SVD++ by the [Bellkor Team])

1162019

Lethal Weapon

Dumb and Dumber

RECSYS 63

Dealing with Missing Entries Want to minimize SSE for unseen test data

Idea Minimize SSE on training data Want large f ( of factors) to capture all the signals But SSE on test data begins to rise for f gt 2

Regularization is needed Allow rich model where there are sufficient data Shrink aggressively where data are scarce

1162019

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

1 3 43 5 5

4 5 533

2

2 1 3

1

λhellip regularization parameterldquoerrorrdquo ldquolengthrdquo

RECSYS 64

Geared towards females

Geared towards males

serious

funny

The Effect of Regularization

1162019

The PrincessDiaries

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

minfactors ldquoerrorrdquo + λ ldquolengthrdquo

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

RECSYS 65

Geared towards females

Geared towards males

serious

funny

The Effect of Regularization

1162019

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

The PrincessDiaries

minfactors ldquoerrorrdquo + λ ldquolengthrdquo

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

RECSYS 66

Geared towards females

Geared towards males

serious

funny

The Effect of Regularization

1162019

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

minfactors ldquoerrorrdquo + λ ldquolengthrdquo

The PrincessDiaries

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

RECSYS 67

Geared towards females

Geared towards males

serious

funny

The Effect of Regularization

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

The PrincessDiaries

minfactors ldquoerrorrdquo + λ ldquolengthrdquo

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

RECSYS 68

Use Gradient Descent to search forthe optimal settings

Want to find matrices P and Q

Gradient descent Initialize P and Q (using SVD pretend missing ratings are

0) Do gradient descent

bull P larr P - η middotnablaPbull Q larr Q - η middotnablaQbull Where nablaQ is gradientderivative of matrix Q120571120571119876119876 = [120571120571119902119902119909119909119894119894] and 120571120571119902119902119909119909119894119894 = sum119909119909119909119909 minus2 119903119903119909119909119909119909 minus 119902119902119909119909119901119901119909119909119879119879 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894

bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q Observation Computing gradients is slow

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

How to compute gradient of a matrixCompute gradient of every element independently

RECSYS 69

Degression to the lecture notes ofRegression and Gradient Descent

by Andrew Ngrsquos Machine Learning course from Coursera

RECSYS 70

(Batch) Gradient Descent

Want to find matrices P and Q

Gradient descent Initialize P and Q (using SVD pretend missing ratings are

0) Do gradient descent

bull P larr P - η middotnablaPbull Q larr Q - η middotnablaQbull Where nablaQ is gradientderivative of matrix Q120571120571119876119876 = [120571120571119902119902119909119909119894119894] and 120571120571119902119902119909119909119894119894 = sum119909119909119909119909 minus2 119903119903119909119909119909119909 minus 119902119902119909119909119901119901119909119909119879119879 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894

bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q Observation Computing gradients is slow

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

How to compute gradient of a matrixCompute gradient of every element independently

RECSYS 72

Stochastic Gradient Descent

Gradient Descent (GD) vs Stochastic GD Observation 120571120571119876119876 = [120571120571119902119902119909119909119894119894] where

120571120571119902119902119909119909119894119894 = 119909119909119909119909

minus2 119903119903119909119909119909119909 minus 119902119902119909119909119894119894119901119901119909119909119894119894 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894 = 119961119961119946119946

nabla119928119928 119955119955119961119961119946119946

bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q 119928119928 = 119928119928 minus 119928119928 = 119928119928 minus η sum119961119961119946119946nabla119928119928 (119955119955119961119961119946119946) Idea Instead of evaluating gradient over all ratings evaluate it for

each individual rating and make a step

GD 119928119928larr119928119928 minus η sum119955119955119961119961119946119946 nabla119928119928(119955119955119961119961119946119946)

SGD 119928119928larr119928119928 minus 119928119928(119955119955119961119961119946119946) Faster convergence

bull Need more steps but each step is computed much faster

1162019

RECSYS 73

SGD vs GD Convergence of GD vs SGD

1162019

Iterationstep

Valu

e of

the

obje

ctiv

e fu

nctio

n

GD improves the value of the objective function at every step SGD improves the value but in a ldquonoisyrdquo wayGD takes fewer steps to converge but each steptakes much longer to compute In practice SGD is much faster

RECSYS 74

Stochastic Gradient Descent

Stochastic gradient decent Initialize P and Q (using SVD pretend missing ratings are

0) Then iterate over the ratings (multiple times if necessary) and update

factorsFor each rxi

bull 120576120576119909119909119909119909 = 119903119903119909119909119909119909 minus 119902119902119909119909 sdot 119901119901119909119909119879119879 (derivative of the ldquoerrorrdquo)bull 119902119902119909119909 larr 119902119902119909119909 + 120578120578 120576120576119909119909119909119909 119901119901119909119909 minus 120582120582 119902119902119909119909 (update equation)bull 119901119901119909119909 larr 119901119901119909119909 + 120578120578 120576120576119909119909119909119909 119902119902119909119909 minus 120582120582 119901119901119909119909 (update equation)

2 for loops For until convergence

bull For each rxibull Compute gradient do a ldquosteprdquo

1162019

η hellip learning rate

RECSYS 75Koren Bell Volinksy IEEE Computer 2009

1162019

RECSYS 76

Summary Recommendations via Optimization

Goal Make good recommendations Quantify goodness using SSE

So Lower SSE means better recommendations We want to make good recommendations on items

that some user has not yet seen Letrsquos set values for P and Q such that they work

well on known (user item) ratings And hope these values for P and Q will predict well

the unknown ratings

This is the a case where we apply Optimization methods

1162019

1 3 43 5 5

4 5 533

2

2 1 3

1

RECSYS 77

Backup Slides

RECSYS 78

The Netflix Challenge 2006-09

We Know What You OughtTo Be Watching This

Summer

Netflix Prizebull Training data

ndash 100 million ratingsndash 480000 usersndash 17770 moviesndash 6 years of data 2000-2005

bull Test datandash Last few ratings of each user (28 million)ndash Evaluation criterion root mean squared error (RMSE) ndash Netflix Cinematch RMSE 09514

bull Competitionndash 2700+ teamsndash $1 million grand prize for 10 improvement on Cinematch resultndash $50000 2007 progress prize for 843 improvement

scoremovieuser121152131434524123237682576344541568523425

22345

57664566

scoremovieuser6219617232473153414284935

745

696836

Training data Test data

Movie rating data

Overall rating distribution

bull Third of ratings are 4sbull Average rating is 368

From TimelyDevelopmentcom

ratings per movie

bull Avg ratingsmovie 5627

ratings per user

bull Avg ratingsuser 208

Average movie rating by movie count

bull More ratings to better movies

From TimelyDevelopmentcom

Most loved movies

CountAvg ratingTitle137812 4593 The Shawshank Redemption 133597 4545 Lord of the Rings The Return of the King1808834306 The Green Mile 150676 4460Lord of the Rings The Two Towers 139050 4415 Finding Nemo 117456 4504 Raiders of the Lost Ark 180736 4299 Forrest Gump 147932 4433 Lord of the Rings The Fellowship of the ring149199 4325 The Sixth Sense 144027 4333 Indiana Jones and the Last Crusade

Challenges

bull Size of datandash Scalabilityndash Keeping data in memory

bull Missing datandash 99 percent missingndash Very imbalanced

bull Avoiding overfittingbull Test and training data differ significantly movie 16322

bull Use an ensemble of complementing predictorsbull Two half tuned models worth more than a single fully

tuned model

The BellKor recommender system

RECSYS 101

Extending Latent Factor Model to Include Biases

1162019

RECSYS 102

Modeling Biases and Interactions

μ = overall mean rating bx = bias of user x bi = bias of movie i

user-movie interactionmovie biasuser bias

User-Movie interaction Characterizes the matching between

users and movies Attracts most research in the field Benefits from algorithmic and

mathematical innovations

Baseline predictor Separates users and movies Benefits from insights into userrsquos

behavior Among the main practical

contributions of the competition

RECSYS 103

Baseline Predictor We have expectations on the rating by

user x of movie i even without estimating xrsquos attitude towards movies like i

ndash Rating scale of user xndash Values of other ratings user

gave recently (day-specific mood anchoring multi-user accounts)

ndash (Recent) popularity of movie indash Selection bias related to

number of ratings user gave on the same day (ldquofrequencyrdquo)

1162019

RECSYS 104

Putting It All Together

Example Mean rating micro = 37 You are a critical reviewer your ratings are 1 star

lower than the mean bx = -1 Star Wars gets a mean rating of 05 higher than

average movie bi = + 05 Predicted rating for you on Star Wars

= 37 - 1 + 05 = 32

1162019

RECSYS 105

Fitting the New Model

Solve

Stochastic gradient decent to find parameters Note Both biases bu bi as well as interactions qi pu are treated

as parameters (we estimate them)

1162019

regularization

goodness of fit

λ is selected via grid-search on a validation set

( )

++++

+++minus

sumsumsumsum

sumisin

ii

xx

xx

ii

Rix

Txiixxi

PQ

bbpq

pqbbr

2222

2

)(

)(min

λ

micro

RECSYS 106

BellKor Recommender System The winner of the Netflix Challenge

Multi-scale modeling of the dataCombine top level ldquoregionalrdquomodeling of the data with a refined local view Global

bull Overall deviations of usersmovies Factorization

bull Addressing ldquoregionalrdquo effects Collaborative filtering

bull Extract local patterns

Global effects

Factorization

Collaborative filtering

RECSYS 107

Performance of Various Methods

0885

089

0895

09

0905

091

0915

092

1 10 100 1000

RMSE

Millions of parameters

CF (no time bias)

Basic Latent Factors

Latent Factors w Biases

1162019

RECSYS 108

Grand Prize 08563

Netflix 09514

Movie average 10533User average 10651

Global average 11296

Performance of Various Methods

Basic Collaborative filtering 094

Latent factors 090

Latent factors+Biases 089

Collaborative filtering++ 091

1162019

RECSYS 109

Temporal Biases Of Users

Sudden rise in the average movie rating(early 2004) Improvements in Netflix GUI improvements Meaning of rating changed

Movie age Users prefer new movies

without any reasons Older movies are just

inherently better than newer ones

1162019

Y Koren Collaborative filtering with temporal dynamics KDD rsquo09

RECSYS 110

Temporal Biases amp Factors

Original modelrxi = micro +bx + bi + qi middotpx

T

Add time dependence to biasesrxi = micro +bx(t)+ bi(t) +qi middot px

T

Make parameters bu and bi to depend on time (1) Parameterize time-dependence by linear trends

(2) Each bin corresponds to 10 consecutive weeks

Add temporal dependence to factors px(t)hellip user preference vector on day t

Y Koren Collaborative filtering with temporal dynamics KDD rsquo09

RECSYS 111

Adding Temporal Effects

1162019

0875

088

0885

089

0895

09

0905

091

0915

092

1 10 100 1000 10000

RMSE

Millions of parameters

CF (no time bias)Basic Latent FactorsCF (time bias)Latent Factors w Biases+ Linear time factors+ Per-day user biases+ CF

RECSYS 112

Grand Prize 08563

Netflix 09514

Movie average 10533User average 10651

Global average 11296

Performance of Various Methods

Basic Collaborative filtering 094

Latent factors 090

Latent factors+Biases 089

Collaborative filtering++ 091

1162019 112

Latent factors+Biases+Time 0876

Still no prize Getting desperateTry a ldquokitchen sinkrdquo approach

RECSYS 113

Effect of ensemble size

087

08725

0875

08775

088

08825

0885

08875

089

08925

1 8 15 22 29 36 43 50

Predictors

Erro

r - R

MSE

Chart1

RMSE
Predictors
Error - RMSE
Effect of ensemble size

Sheet1

Sheet1

RMSE
Predictors
Error - RMSE
Effect of ensemble size

Sheet2

Sheet3

RECSYS 115

Standing on June 26th 2009

1162019June 26th submission triggers 30-day ldquolast callrdquo

RECSYS 116

The Last 30 Days

Ensemble team formed Group of other teams on leaderboard forms a new team Relies on combining their models Quickly also get a qualifying score over 10

BellKor Continue to get small improvements in their scores Realize that they are in direct competition with Ensemble

Strategy Both teams carefully monitoring the leaderboard Only sure way to check for improvement is to submit a set of

predictionsbull This alerts the other team of your latest score

1162019

RECSYS 117

24 Hours from the Deadline

Submissions limited to 1 a day Only 1 final submission could be made in the last 24h

24 hours before deadlinehellip BellKor team member in Austria notices (by chance) that

Ensemble posts a score that is slightly better than BellKorrsquos

Frantic last 24 hours for both teams Much computer time on final optimization Carefully calibrated to end about an hour before deadline

Final submissions BellKor submits a little early (on purpose) 40 mins before deadline Ensemble submits their final entry 20 mins later hellipand everyone waitshellip

1162019

RECSYS 118118

RECSYS 119

Million $ Awarded Sept 21st 2009

RECSYS 120

Further reading Y Koren Collaborative filtering with temporal

dynamics KDD rsquo09 httpwww2researchattcom~volinskynetflixbpchtml httpwwwthe-ensemblecom

1162019

  • Slide Number 1
  • Acknowledgements
  • Content-based Systems amp Collaborative Filtering
  • High Dimensional Data
  • Example Recommender Systems
  • Recommendations
  • From Scarcity to Abundance
  • The Long Tail
  • Physical vs Online
  • Types of Recommendations
  • Formal Model
  • Utility Matrix
  • Key Problems
  • (1) Gathering Ratings
  • (2) Extrapolating Utilities
  • Content-based Recommender Systems
  • Content-based Recommendations
  • Plan of Action
  • Item Profiles
  • Sidenote TF-IDF
  • User Profiles and Prediction
  • Pros Content-based Approach
  • Cons Content-based Approach
  • Collaborative Filtering
  • Collaborative filtering
  • Collaborative Filtering (CF)
  • Example of Memory-based Collaborative Filtering User-User Collaborative Filtering
  • Similar Users
  • Similarity Metric
  • Rating Predictions
  • Another type of Memory-based Collaborative Filtering Item-Item based Collaborative Filtering
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Common Practice for Item-Item Collaborative Filtering
  • Item-Item vs User-User
  • ProsCons of Collaborative Filtering
  • Hybrid Methods
  • Remarks amp Practical Tips
  • Evaluation
  • Evaluation
  • Evaluating Predictions
  • Problems with Error Measures
  • Collaborative Filtering Complexity
  • Tip Add Data
  • Recommender SystemsLatent Factor Models
  • Collaborative Filtering viaLatent Factor Models (eg SVD)
  • Slide Number 50
  • The Netflix Utility Matrix R
  • Utility Matrix R Evaluation
  • Latent Factor Models
  • Ratings as Products of Factors
  • Ratings as Products of Factors
  • Ratings as Products of Factors
  • Latent Factor Models
  • Latent Factor Models
  • Recap SVD
  • Latent Factor Models
  • Dealing with Missing Entries
  • Recommendations via Latent Factor Models (eg SVD++ by the [Bellkor Team])
  • Dealing with Missing Entries
  • The Effect of Regularization
  • The Effect of Regularization
  • The Effect of Regularization
  • The Effect of Regularization
  • Use Gradient Descent to search forthe optimal settings
  • Degression to the lecture notes ofRegression and Gradient Descentby Andrew Ngrsquos Machine Learning course from Coursera
  • (Batch) Gradient Descent
  • Stochastic Gradient Descent
  • SGD vs GD
  • Stochastic Gradient Descent
  • Slide Number 75
  • Summary Recommendations via Optimization
  • Backup Slides
  • The Netflix Challenge 2006-09
  • Slide Number 79
  • Slide Number 80
  • Netflix Prize
  • Movie rating data
  • Overall rating distribution
  • ratings per movie
  • ratings per user
  • Average movie rating by movie count
  • Most loved movies
  • Challenges
  • The BellKor recommender system
  • Extending Latent Factor Model to Include Biases
  • Modeling Biases and Interactions
  • Baseline Predictor
  • Putting It All Together
  • Fitting the New Model
  • BellKor Recommender System
  • Performance of Various Methods
  • Performance of Various Methods
  • Temporal Biases Of Users
  • Temporal Biases amp Factors
  • Adding Temporal Effects
  • Performance of Various Methods
  • Slide Number 113
  • Slide Number 114
  • Standing on June 26th 2009
  • The Last 30 Days
  • 24 Hours from the Deadline
  • Slide Number 118
  • Million $ Awarded Sept 21st 2009
  • Further reading
predictors RMSE
1 0889067
2 0880586
3 0878345
4 0877715
5 0876668
6 0876179
7 0875557
8 0875013
9 0874477
10 0874209
11 0873997
12 0873691
13 0873569
14 0873352
15 0873212
16 0873035
17 0872942
18 087283
19 0872748
20 0872654
21 0872563
22 0872468
23 0872382
24 0872275
25 087222
26 0872201
27 0872135
28 0872086
29 0872021
30 0871945
31 0871898
32 087183
33 0871756
34 0871722
35 0871665
36 0871632
37 0871585
38 0871558
39 0871524
40 0871482
41 0871452
42 0871428
43 0871398
44 0871346
45 0871312
46 0871287
47 0871262
48 0871242
49 0871236
50 0871197
0889067
0880586
0878345
0877715
0876668
0876179
0875557
0875013
0874477
0874209
0873997
0873691
0873569
0873352
0873212
0873035
0872942
087283
0872748
0872654
0872563
0872468
0872382
0872275
087222
0872201
0872135
0872086
0872021
0871945
0871898
087183
0871756
0871722
0871665
0871632
0871585
0871558
0871524
0871482
0871452
0871428
0871398
0871346
0871312
0871287
0871262
0871242
0871236
0871197
Page 11: xuhappy.github.io · 2 days ago · RECSYS 2 Acknowledgements The slides used in this chapter are adapted from: CS246 Mining Massive Datasets, - by Jure Leskovec, Stanford University

RECSYS 13

Key Problems

(1) Gathering ldquoknownrdquo ratings for matrix How to collect the data in the utility matrix

(2) Extrapolate unknown ratings from the known ones Mainly interested in high unknown ratings

bull We are not interested in knowing what you donrsquot like but what you like

(3) Evaluating extrapolation methods How to measure successperformance of

recommendation methods

RECSYS 14

(1) Gathering Ratings Explicit

Ask people to rate items Doesnrsquot work well in practice ndash people

canrsquot be bothered

Implicit Learn ratings from user actions

bull Eg purchase implies high rating What about low ratings

RECSYS 15

(2) Extrapolating Utilities Key problem matrix U is sparse

Most people have not rated most items Cold start

bull New items have no ratingsbull New users have no history

Three approaches to recommender systems 1) Content-based 2) Collaborative Filtering

bull Memory-basedbull User-based Collaborative Filteringbull Item-based Collaborative Filtering

bull Latent factor based

1162019

RECSYS 16

Content-based Recommender Systems

1162019

RECSYS 17

Content-based Recommendations Main idea Recommend items to customer x similar to

previous items rated highly by x

Example

Movie recommendations Recommend movies with same actor(s)

director genre hellip

Websites blogs news Recommend other sites with ldquosimilarrdquo content

RECSYS 18

Plan of Action

likes

Item profiles

RedCircles

Triangles

User profile

match

recommendbuild

1162019

RECSYS 19

Item Profiles For each item create an item profile

Profile is a set (vector) of features Movies author title actor directorhellip Text Set of ldquoimportantrdquo words in document

How to pick important features Usual heuristic from text mining is TF-IDF

(Term frequency Inverse Doc Frequency)bull Term hellip Featurebull Document hellip Item

1162019

RECSYS 20

Sidenote TF-IDFfij = frequency of term (feature) i in doc (item) j

ni = number of docs that mention term iN = total number of docs

TF-IDF score wij = TFij times IDFi

Doc profile = set of words with highest TF-IDF scores together with their scores

1162019

Note we normalize TFto discount for ldquolongerrdquo documents

RECSYS 21

User Profiles and Prediction

User profile possibilities Weighted average of rated item profiles Variation weight by difference from average

rating for item hellip

Prediction heuristic Given user profile x and item profile i estimate 119906119906(119961119961 119946119946) = cos(119961119961 119946119946) = 119961119961middot119946119946

| 119961119961 |sdot| 119946119946 |

1162019

RECSYS 22

Pros Content-based Approach

+ No need for data on other users No cold-start or sparsity problems

+ Able to recommend to users with unique tastes

+ Able to recommend new amp unpopular items No first-rater problem

+ Able to provide explanations Can provide explanations of recommended items by listing content-

features that caused an item to be recommended

1162019

RECSYS 23

Cons Content-based Approach ndash Finding the appropriate features is hard

Eg images movies music

ndash Overspecialization Never recommends items outside userrsquos

content profile People might have multiple interests Unable to exploit quality judgments of other users

ndash Recommendations for new users How to build a user profile

1162019

RECSYS 24

Collaborative Filtering

RECSYS 25

Collaborative filtering Recommend items based on past transactions of users

Analyze relations between users andor items

Specific data characteristics are irrelevant Domain-free useritem attributes are not necessary Can identify elusive aspects

RECSYS 26

Collaborative Filtering (CF)

RECSYS 27

Example of Memory-based Collaborative FilteringUser-User Collaborative Filtering

1 Consider user x

2 Find set N of other userswhose ratings are ldquosimilarrdquo to xrsquos ratings eg using K-nearest neighbors (KNN)

3 Recommend items to xbased on the weighted ratings of items by users in N

RECSYS 28

Similar Users Let rx be the vector of user xrsquos ratings

Jaccard similarity measure Problem Ignores the value of the rating

Cosine similarity measure

sim(x y) = cos(rx ry) = 119903119903119909119909sdot119903119903119910119910||119903119903119909119909||sdot||119903119903119910119910||

Problem Treats missing ratings as ldquonegativerdquo Pearson correlation coefficient

Sxy = items rated by both users x and y

1162019

rx = [ _ _ ]ry = [ _ _]

rx ry as setsrx = 1 4 5ry = 1 3 4

rx ry as pointsrx = 1 0 0 1 3ry = 1 0 2 2 0

rx ry hellip avgrating of x y

RECSYS 29

Similarity Metric

Intuitively we want sim(A B) gt sim(A C)

Jaccard similarity 15 lt 24

Cosine similarity 0386 gt 0322 Considers missing ratings as ldquonegativerdquo Solution subtract the (row) mean

1162019

sim AB vs AC0092 gt -0559Notice cosine sim is correlation when data is centered at 0

Cosine sim

RECSYS 30

Rating Predictions

Let rx be the vector of user xrsquos ratings

Let N be the set of k users most similar to xwho have rated item i

Prediction for item i of user x 119903119903119909119909119909119909 = 1

119896119896sum119910119910isin119873119873 119903119903119910119910119909119909

119903119903119909119909119909119909 =sum119910119910isin119873119873 119904119904119909119909119910119910sdot119903119903119910119910119910119910sum119910119910isin119873119873 119904119904119909119909119910119910

Other options

Many other tricks possiblehellip1162019

Shorthand119956119956119961119961119961119961 = 119956119956119946119946119956119956 119961119961119961119961

RECSYS 31

Another type of Memory-based Collaborative Filtering Item-Item based Collaborative Filtering

So far User-user collaborative filtering

Another view Item-item For item i find other similar items Estimate rating for item i based

on the target userrsquos ratings on items similar to item i Can use same similarity metrics and

prediction functions as in user-user model

1162019

sumsum

isin

isinsdot

=)(

)(

xiNj ij

xiNj xjijxi s

rsr

sijhellip similarity of items i and jrxjhelliprating of user x on item jN(ix)hellip set items rated by x similar to i

RECSYS 32

Item-Item CF (|N|=2)

121110987654321

455311

3124452

534321423

245424

5224345

423316

users

mov

ies

- unknown rating - rating between 1 to 5

RECSYS 33

Item-Item CF (|N|=2)

121110987654321

455 311

3124452

534321423

245424

5224345

423316

users

- estimate rating of movie 1 by user 5

1162019

mov

ies

RECSYS 34

Item-Item CF (|N|=2)

121110987654321

455 311

3124452

534321423

245424

5224345

423316

users

Neighbor selectionIdentify movies similar to movie 1 rated by user 5

1162019

mov

ies

100

-018

041

-010

-031

059

sim(1m)

Here we use Pearson correlation as similarity1) Subtract mean rating mi from each movie i

m1 = (1+3+5+5+4)5 = 36row 1 [-26 0 -06 0 0 14 0 0 14 0 04 0]

2) Compute cosine similarities between rows

RECSYS 35

Item-Item CF (|N|=2)

121110987654321

455 311

3124452

534321423

245424

5224345

423316

users

Compute similarity weightss13=041 s16=059

1162019

mov

ies

100

-018

041

-010

-031

059

sim(1m)

RECSYS 36

Item-Item CF (|N|=2)

121110987654321

45526311

3124452

534321423

245424

5224345

423316

users

Predict by taking weighted average

r15 = (0412 + 0593) (041+059) = 26

mov

ies

RECSYS 37

Common Practice for Item-Item Collaborative Filtering

Define similarity sij of items i and j

Select K nearest neighbors (KNN) N(i x) Set of Items most similar to item i that were rated by x

Estimate rating rxi as the weighted average

1162019

sumsum

isin

isinsdot

=)(

)(ˆxiNj ij

xiNj xjijxi s

rsr

N(ix) = set of items similar to item i that were rated by xsij = similarity of items i and jrxj= rating of user x on item j

RECSYS 38

Item-Item vs User-User

04180109

0305081

Avatar LOTR Matrix Pirates

Alice

Bob

Carol

David

1162019

In practice it has been observed that item-itemoften works better than user-user

Why Items are simpler users have multiple tastes

RECSYS 39

ProsCons of Collaborative Filtering

+ Works for any kind of item No feature selection needed

- Cold Start Need enough users in the system to find a match

- Sparsity The userratings matrix is sparse Hard to find users that have rated the same items

- First rater Cannot recommend an item that has not been

previously rated New items Esoteric items

- Popularity bias Cannot recommend items to someone with

unique taste Tends to recommend popular items

RECSYS 40

Hybrid Methods Implement two or more different recommenders and

combine predictions Perhaps using a linear model

Add content-based methods to collaborative filtering Item profiles for new item problem Demographics to deal with new user problem

1162019

RECSYS 41

- Evaluation

- Error metrics

- Complexity Speed

RECSYS 42

Evaluation

1 3 4

3 5 5

4 5 5

3

3

2 2 2

5

2 1 1

3 3

1

movies

users

1162019

RECSYS 43

Evaluation

1 3 4

3 5 5

4 5 5

3

3

2

2 1

3

1

Test Data Set

users

movies

1162019

RECSYS 44

Evaluating Predictions Compare predictions with known ratings

Root-mean-square error (RMSE)bull where is predicted is the true rating of x on i

Precision at top 10 bull of those in top 10

Rank Correlation bull Spearmanrsquos correlation between systemrsquos and userrsquos complete

rankings Another approach 01 model

Coveragebull Number of itemsusers for which system can make predictions

Precision = TP (TP + FP) Accuracy = (TP+TN) (TP + FP + TN + FN) Receiver Operating characteristic (ROC) Curve

Y-axis True Positive Rates (TPR) X-axis False Positive Rates (FPR)bull TPR (aka Recall) = TP P = TP (TP+FN) bull FPR = FP N = FP (FP + TN)bull See httpsenwikipediaorgwikiPrecision and recall

RECSYS 45

Problems with Error Measures Narrow focus on accuracy sometimes

misses the point Prediction Diversity Prediction Context Order of predictions

In practice we care only to predict high ratings RMSE might penalize a method that does well

for high ratings and badly for others

1162019

RECSYS 46

Collaborative Filtering Complexity

Expensive step is finding k most similar customers O(|X|) Recall that X = set of customers in the system

Too expensive to do at runtime Could pre-compute using clustering as approx

Naiumlve pre-computation takes time O(N middot|C|) |C| = of clusters = k in the k-means N = of data points

We already know how to do this Near-neighbor search in high dimensions (LSH) Clustering Dimensionality reduction

1162019

RECSYS 47

Tip Add Data Leverage all the data

Donrsquot try to reduce data size in an effort to make fancy algorithms work

Simple methods on large data do best

Add more data eg add IMDB data on genres

More data beats better algorithmshttpanandtypepadcomdatawocky200803more-data-usualhtml

RECSYS 48

Recommender SystemsLatent Factor Models

CS246 Mining Massive DatasetsJure Leskovec Stanford University

httpcs246stanfordedu

RECSYS 49

Geared towards females

Geared towards males

Serious

Funny

Collaborative Filtering viaLatent Factor Models (eg SVD)

The PrincessDiaries

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

RECSYS 50Koren Bell Volinksy IEEE Computer 2009

RECSYS 51

The Netflix Utility Matrix R

1 3 4

3 5 5

4 5 5

3

3

2 2 2

5

2 1 1

3 3

1

480000 users

17700 movies

1162019

Matrix R

RECSYS 52

Utility Matrix R Evaluation

1 3 4

3 5 5

4 5 5

3

3

2

2 1

3

1

Test Data Set

SSE = sum(119909119909119909119909)isin119877119877 119903119909119909119909119909 minus 119903119903119909119909119909119909 2

1162019

480000 users

17700 movies

Predicted rating

True rating of user x on item i

119955119955120785120785120788120788

Matrix R

Training Data Set

RECSYS 53

Latent Factor Models

ldquoSVDrdquo on Netflix data R asymp Q middot PT

For now letrsquos assume we can approximate the rating matrix R as a product of ldquothinrdquo Q middot PT

R has missing entries but letrsquos ignore that for nowbull Basically we will want the reconstruction error to be small on

known ratings and we donrsquot care about the values on the missing ones

45531

312445

53432142

24542

522434

42331

2-41

56-5

53-2

32111

-221-7

37-1

-924143-48-5-253-211

13-112-72914-131457-8

1-6784-3924176-421asymp

users

item

s

PT

Q

item

s

users

R

SVD A = U Σ VT

f factors

ffactors

RECSYS 54

Ratings as Products of Factors How to estimate the missing rating of

user x for item i

1162019

45531

312445

53432142

24542

522434

42331

item

s

2-41

56-5

53-2

32111

-221-7

37-1

-924143-48-5-253-211

13-112-72914-131457-8

1-6784-3924176-421

asymp

item

s

users

users

PT

119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931

= 119943119943

119954119954119946119946119943119943 sdot 119953119953119961119961119943119943

qi = row i of Qpx = column x of PT

ffac

tors

Qf factors

RECSYS 55

Ratings as Products of Factors How to estimate the missing rating of

user x for item i

1162019

45531

312445

53432142

24542

522434

42331

item

s

2-41

56-5

53-2

32111

-221-7

37-1

-924143-48-5-253-211

13-112-72914-131457-8

1-6784-3924176-421

asymp

item

s

users

users

PT

ffac

tors

Qf factors

119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931

= 119943119943

119954119954119946119946119943119943 sdot 119953119953119961119961119943119943

qi = row i of Qpx = column x of PT

RECSYS 56

Ratings as Products of Factors How to estimate the missing rating of

user x for item i

1162019

45531

312445

53432142

24542

522434

42331

item

s

2-41

56-5

53-2

32111

-221-7

37-1

-924143-48-5-253-211

13-112-72914-131457-8

1-6784-3924176-421

asymp

item

s

users

users

QPT

24

ffac

tors

f factors

119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931

= 119943119943

119954119954119946119946119943119943 sdot 119953119953119961119961119943119943

qi = row i of Qpx = column x of PT

RECSYS 57

Geared towards females

Geared towards males

Serious

Funny

Latent Factor Models

1162019

The PrincessDiaries

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

RECSYS 58

Geared towards females

Geared towards males

Serious

Funny

Latent Factor Models

1162019

The PrincessDiaries

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

RECSYS 59

Recap SVD

Remember SVD A Input data matrix U Left singular vecs V Right singular vecs Σ Singular values

SVD gives minimum reconstruction error (SSE) min119880119880VΣ

sum119909119909119894119894 119860119860119909119909119894119894 minus 119880119880Σ119881119881T 1199091199091198941198942

So in our case ldquoSVDrdquo on Netflix data R asymp Q middot PT

A = R Q = U PT = Σ VT

But we are not done yet R has missing entries1162019

U

The sum goes over all entriesBut our R has missing entries

119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931

Am

n

Σm

n

VT

asymp

U

RECSYS 60

Latent Factor Models

SVD isnrsquot defined when entries are missing

Use specialized methods to find P Q min

119875119875119876119876sum 119909119909119909119909 isinR 119903119903119909119909119909119909 minus 119902119902119909119909 sdot 119901119901119909119909119879119879 2

Notebull We donrsquot require cols of P Q to be orthogonalunit lengthbull P Q map usersmovies to a latent spacebull The most popular model among Netflix contestants

1162019

45531

312445

53432142

24542

522434

42331

2-41

56-5

53-2

32111

-221-7

37-1

-924143-48-5-253-211

13-112-72914-131457-8

1-6784-3924176-421asymp

PTQ

users

item

s

119903119909119909119909119909 = 119902119902119909119909 sdot 119901119901119909119909119879119879

f factors

ffactorsitem

s

users

RECSYS 61

Dealing with Missing Entries Want to minimize SSE for unseen test data

Idea Minimize SSE on training data Want large f ( of factors) to capture all the signals But SSE on test data begins to rise for f gt 2

Regularization is needed to avoid Overfitting Allow rich model where there are sufficient data Shrink aggressively where data are scarce

1162019

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

1 3 43 5 5

4 5 533

2

2 1 3

1

λhellip regularization parameterldquoerrorrdquo ldquolengthrdquo

RECSYS 62

Geared towards females

Geared towards males

serious

escapist

The PrincessDiaries

The Lion King

Braveheart

Independence Day

AmadeusThe Color Purple

Oceanrsquos 11

Sense and Sensibility

Gus

Dave

Recommendations via Latent Factor Models (eg SVD++ by the [Bellkor Team])

1162019

Lethal Weapon

Dumb and Dumber

RECSYS 63

Dealing with Missing Entries Want to minimize SSE for unseen test data

Idea Minimize SSE on training data Want large f ( of factors) to capture all the signals But SSE on test data begins to rise for f gt 2

Regularization is needed Allow rich model where there are sufficient data Shrink aggressively where data are scarce

1162019

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

1 3 43 5 5

4 5 533

2

2 1 3

1

λhellip regularization parameterldquoerrorrdquo ldquolengthrdquo

RECSYS 64

Geared towards females

Geared towards males

serious

funny

The Effect of Regularization

1162019

The PrincessDiaries

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

minfactors ldquoerrorrdquo + λ ldquolengthrdquo

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

RECSYS 65

Geared towards females

Geared towards males

serious

funny

The Effect of Regularization

1162019

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

The PrincessDiaries

minfactors ldquoerrorrdquo + λ ldquolengthrdquo

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

RECSYS 66

Geared towards females

Geared towards males

serious

funny

The Effect of Regularization

1162019

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

minfactors ldquoerrorrdquo + λ ldquolengthrdquo

The PrincessDiaries

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

RECSYS 67

Geared towards females

Geared towards males

serious

funny

The Effect of Regularization

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

The PrincessDiaries

minfactors ldquoerrorrdquo + λ ldquolengthrdquo

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

RECSYS 68

Use Gradient Descent to search forthe optimal settings

Want to find matrices P and Q

Gradient descent Initialize P and Q (using SVD pretend missing ratings are

0) Do gradient descent

bull P larr P - η middotnablaPbull Q larr Q - η middotnablaQbull Where nablaQ is gradientderivative of matrix Q120571120571119876119876 = [120571120571119902119902119909119909119894119894] and 120571120571119902119902119909119909119894119894 = sum119909119909119909119909 minus2 119903119903119909119909119909119909 minus 119902119902119909119909119901119901119909119909119879119879 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894

bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q Observation Computing gradients is slow

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

How to compute gradient of a matrixCompute gradient of every element independently

RECSYS 69

Degression to the lecture notes ofRegression and Gradient Descent

by Andrew Ngrsquos Machine Learning course from Coursera

RECSYS 70

(Batch) Gradient Descent

Want to find matrices P and Q

Gradient descent Initialize P and Q (using SVD pretend missing ratings are

0) Do gradient descent

bull P larr P - η middotnablaPbull Q larr Q - η middotnablaQbull Where nablaQ is gradientderivative of matrix Q120571120571119876119876 = [120571120571119902119902119909119909119894119894] and 120571120571119902119902119909119909119894119894 = sum119909119909119909119909 minus2 119903119903119909119909119909119909 minus 119902119902119909119909119901119901119909119909119879119879 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894

bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q Observation Computing gradients is slow

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

How to compute gradient of a matrixCompute gradient of every element independently

RECSYS 72

Stochastic Gradient Descent

Gradient Descent (GD) vs Stochastic GD Observation 120571120571119876119876 = [120571120571119902119902119909119909119894119894] where

120571120571119902119902119909119909119894119894 = 119909119909119909119909

minus2 119903119903119909119909119909119909 minus 119902119902119909119909119894119894119901119901119909119909119894119894 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894 = 119961119961119946119946

nabla119928119928 119955119955119961119961119946119946

bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q 119928119928 = 119928119928 minus 119928119928 = 119928119928 minus η sum119961119961119946119946nabla119928119928 (119955119955119961119961119946119946) Idea Instead of evaluating gradient over all ratings evaluate it for

each individual rating and make a step

GD 119928119928larr119928119928 minus η sum119955119955119961119961119946119946 nabla119928119928(119955119955119961119961119946119946)

SGD 119928119928larr119928119928 minus 119928119928(119955119955119961119961119946119946) Faster convergence

bull Need more steps but each step is computed much faster

1162019

RECSYS 73

SGD vs GD Convergence of GD vs SGD

1162019

Iterationstep

Valu

e of

the

obje

ctiv

e fu

nctio

n

GD improves the value of the objective function at every step SGD improves the value but in a ldquonoisyrdquo wayGD takes fewer steps to converge but each steptakes much longer to compute In practice SGD is much faster

RECSYS 74

Stochastic Gradient Descent

Stochastic gradient decent Initialize P and Q (using SVD pretend missing ratings are

0) Then iterate over the ratings (multiple times if necessary) and update

factorsFor each rxi

bull 120576120576119909119909119909119909 = 119903119903119909119909119909119909 minus 119902119902119909119909 sdot 119901119901119909119909119879119879 (derivative of the ldquoerrorrdquo)bull 119902119902119909119909 larr 119902119902119909119909 + 120578120578 120576120576119909119909119909119909 119901119901119909119909 minus 120582120582 119902119902119909119909 (update equation)bull 119901119901119909119909 larr 119901119901119909119909 + 120578120578 120576120576119909119909119909119909 119902119902119909119909 minus 120582120582 119901119901119909119909 (update equation)

2 for loops For until convergence

bull For each rxibull Compute gradient do a ldquosteprdquo

1162019

η hellip learning rate

RECSYS 75Koren Bell Volinksy IEEE Computer 2009

1162019

RECSYS 76

Summary Recommendations via Optimization

Goal Make good recommendations Quantify goodness using SSE

So Lower SSE means better recommendations We want to make good recommendations on items

that some user has not yet seen Letrsquos set values for P and Q such that they work

well on known (user item) ratings And hope these values for P and Q will predict well

the unknown ratings

This is the a case where we apply Optimization methods

1162019

1 3 43 5 5

4 5 533

2

2 1 3

1

RECSYS 77

Backup Slides

RECSYS 78

The Netflix Challenge 2006-09

We Know What You OughtTo Be Watching This

Summer

Netflix Prizebull Training data

ndash 100 million ratingsndash 480000 usersndash 17770 moviesndash 6 years of data 2000-2005

bull Test datandash Last few ratings of each user (28 million)ndash Evaluation criterion root mean squared error (RMSE) ndash Netflix Cinematch RMSE 09514

bull Competitionndash 2700+ teamsndash $1 million grand prize for 10 improvement on Cinematch resultndash $50000 2007 progress prize for 843 improvement

scoremovieuser121152131434524123237682576344541568523425

22345

57664566

scoremovieuser6219617232473153414284935

745

696836

Training data Test data

Movie rating data

Overall rating distribution

bull Third of ratings are 4sbull Average rating is 368

From TimelyDevelopmentcom

ratings per movie

bull Avg ratingsmovie 5627

ratings per user

bull Avg ratingsuser 208

Average movie rating by movie count

bull More ratings to better movies

From TimelyDevelopmentcom

Most loved movies

CountAvg ratingTitle137812 4593 The Shawshank Redemption 133597 4545 Lord of the Rings The Return of the King1808834306 The Green Mile 150676 4460Lord of the Rings The Two Towers 139050 4415 Finding Nemo 117456 4504 Raiders of the Lost Ark 180736 4299 Forrest Gump 147932 4433 Lord of the Rings The Fellowship of the ring149199 4325 The Sixth Sense 144027 4333 Indiana Jones and the Last Crusade

Challenges

bull Size of datandash Scalabilityndash Keeping data in memory

bull Missing datandash 99 percent missingndash Very imbalanced

bull Avoiding overfittingbull Test and training data differ significantly movie 16322

bull Use an ensemble of complementing predictorsbull Two half tuned models worth more than a single fully

tuned model

The BellKor recommender system

RECSYS 101

Extending Latent Factor Model to Include Biases

1162019

RECSYS 102

Modeling Biases and Interactions

μ = overall mean rating bx = bias of user x bi = bias of movie i

user-movie interactionmovie biasuser bias

User-Movie interaction Characterizes the matching between

users and movies Attracts most research in the field Benefits from algorithmic and

mathematical innovations

Baseline predictor Separates users and movies Benefits from insights into userrsquos

behavior Among the main practical

contributions of the competition

RECSYS 103

Baseline Predictor We have expectations on the rating by

user x of movie i even without estimating xrsquos attitude towards movies like i

ndash Rating scale of user xndash Values of other ratings user

gave recently (day-specific mood anchoring multi-user accounts)

ndash (Recent) popularity of movie indash Selection bias related to

number of ratings user gave on the same day (ldquofrequencyrdquo)

1162019

RECSYS 104

Putting It All Together

Example Mean rating micro = 37 You are a critical reviewer your ratings are 1 star

lower than the mean bx = -1 Star Wars gets a mean rating of 05 higher than

average movie bi = + 05 Predicted rating for you on Star Wars

= 37 - 1 + 05 = 32

1162019

RECSYS 105

Fitting the New Model

Solve

Stochastic gradient decent to find parameters Note Both biases bu bi as well as interactions qi pu are treated

as parameters (we estimate them)

1162019

regularization

goodness of fit

λ is selected via grid-search on a validation set

( )

++++

+++minus

sumsumsumsum

sumisin

ii

xx

xx

ii

Rix

Txiixxi

PQ

bbpq

pqbbr

2222

2

)(

)(min

λ

micro

RECSYS 106

BellKor Recommender System The winner of the Netflix Challenge

Multi-scale modeling of the dataCombine top level ldquoregionalrdquomodeling of the data with a refined local view Global

bull Overall deviations of usersmovies Factorization

bull Addressing ldquoregionalrdquo effects Collaborative filtering

bull Extract local patterns

Global effects

Factorization

Collaborative filtering

RECSYS 107

Performance of Various Methods

0885

089

0895

09

0905

091

0915

092

1 10 100 1000

RMSE

Millions of parameters

CF (no time bias)

Basic Latent Factors

Latent Factors w Biases

1162019

RECSYS 108

Grand Prize 08563

Netflix 09514

Movie average 10533User average 10651

Global average 11296

Performance of Various Methods

Basic Collaborative filtering 094

Latent factors 090

Latent factors+Biases 089

Collaborative filtering++ 091

1162019

RECSYS 109

Temporal Biases Of Users

Sudden rise in the average movie rating(early 2004) Improvements in Netflix GUI improvements Meaning of rating changed

Movie age Users prefer new movies

without any reasons Older movies are just

inherently better than newer ones

1162019

Y Koren Collaborative filtering with temporal dynamics KDD rsquo09

RECSYS 110

Temporal Biases amp Factors

Original modelrxi = micro +bx + bi + qi middotpx

T

Add time dependence to biasesrxi = micro +bx(t)+ bi(t) +qi middot px

T

Make parameters bu and bi to depend on time (1) Parameterize time-dependence by linear trends

(2) Each bin corresponds to 10 consecutive weeks

Add temporal dependence to factors px(t)hellip user preference vector on day t

Y Koren Collaborative filtering with temporal dynamics KDD rsquo09

RECSYS 111

Adding Temporal Effects

1162019

0875

088

0885

089

0895

09

0905

091

0915

092

1 10 100 1000 10000

RMSE

Millions of parameters

CF (no time bias)Basic Latent FactorsCF (time bias)Latent Factors w Biases+ Linear time factors+ Per-day user biases+ CF

RECSYS 112

Grand Prize 08563

Netflix 09514

Movie average 10533User average 10651

Global average 11296

Performance of Various Methods

Basic Collaborative filtering 094

Latent factors 090

Latent factors+Biases 089

Collaborative filtering++ 091

1162019 112

Latent factors+Biases+Time 0876

Still no prize Getting desperateTry a ldquokitchen sinkrdquo approach

RECSYS 113

Effect of ensemble size

087

08725

0875

08775

088

08825

0885

08875

089

08925

1 8 15 22 29 36 43 50

Predictors

Erro

r - R

MSE

Chart1

RMSE
Predictors
Error - RMSE
Effect of ensemble size

Sheet1

Sheet1

RMSE
Predictors
Error - RMSE
Effect of ensemble size

Sheet2

Sheet3

RECSYS 115

Standing on June 26th 2009

1162019June 26th submission triggers 30-day ldquolast callrdquo

RECSYS 116

The Last 30 Days

Ensemble team formed Group of other teams on leaderboard forms a new team Relies on combining their models Quickly also get a qualifying score over 10

BellKor Continue to get small improvements in their scores Realize that they are in direct competition with Ensemble

Strategy Both teams carefully monitoring the leaderboard Only sure way to check for improvement is to submit a set of

predictionsbull This alerts the other team of your latest score

1162019

RECSYS 117

24 Hours from the Deadline

Submissions limited to 1 a day Only 1 final submission could be made in the last 24h

24 hours before deadlinehellip BellKor team member in Austria notices (by chance) that

Ensemble posts a score that is slightly better than BellKorrsquos

Frantic last 24 hours for both teams Much computer time on final optimization Carefully calibrated to end about an hour before deadline

Final submissions BellKor submits a little early (on purpose) 40 mins before deadline Ensemble submits their final entry 20 mins later hellipand everyone waitshellip

1162019

RECSYS 118118

RECSYS 119

Million $ Awarded Sept 21st 2009

RECSYS 120

Further reading Y Koren Collaborative filtering with temporal

dynamics KDD rsquo09 httpwww2researchattcom~volinskynetflixbpchtml httpwwwthe-ensemblecom

1162019

  • Slide Number 1
  • Acknowledgements
  • Content-based Systems amp Collaborative Filtering
  • High Dimensional Data
  • Example Recommender Systems
  • Recommendations
  • From Scarcity to Abundance
  • The Long Tail
  • Physical vs Online
  • Types of Recommendations
  • Formal Model
  • Utility Matrix
  • Key Problems
  • (1) Gathering Ratings
  • (2) Extrapolating Utilities
  • Content-based Recommender Systems
  • Content-based Recommendations
  • Plan of Action
  • Item Profiles
  • Sidenote TF-IDF
  • User Profiles and Prediction
  • Pros Content-based Approach
  • Cons Content-based Approach
  • Collaborative Filtering
  • Collaborative filtering
  • Collaborative Filtering (CF)
  • Example of Memory-based Collaborative Filtering User-User Collaborative Filtering
  • Similar Users
  • Similarity Metric
  • Rating Predictions
  • Another type of Memory-based Collaborative Filtering Item-Item based Collaborative Filtering
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Common Practice for Item-Item Collaborative Filtering
  • Item-Item vs User-User
  • ProsCons of Collaborative Filtering
  • Hybrid Methods
  • Remarks amp Practical Tips
  • Evaluation
  • Evaluation
  • Evaluating Predictions
  • Problems with Error Measures
  • Collaborative Filtering Complexity
  • Tip Add Data
  • Recommender SystemsLatent Factor Models
  • Collaborative Filtering viaLatent Factor Models (eg SVD)
  • Slide Number 50
  • The Netflix Utility Matrix R
  • Utility Matrix R Evaluation
  • Latent Factor Models
  • Ratings as Products of Factors
  • Ratings as Products of Factors
  • Ratings as Products of Factors
  • Latent Factor Models
  • Latent Factor Models
  • Recap SVD
  • Latent Factor Models
  • Dealing with Missing Entries
  • Recommendations via Latent Factor Models (eg SVD++ by the [Bellkor Team])
  • Dealing with Missing Entries
  • The Effect of Regularization
  • The Effect of Regularization
  • The Effect of Regularization
  • The Effect of Regularization
  • Use Gradient Descent to search forthe optimal settings
  • Degression to the lecture notes ofRegression and Gradient Descentby Andrew Ngrsquos Machine Learning course from Coursera
  • (Batch) Gradient Descent
  • Stochastic Gradient Descent
  • SGD vs GD
  • Stochastic Gradient Descent
  • Slide Number 75
  • Summary Recommendations via Optimization
  • Backup Slides
  • The Netflix Challenge 2006-09
  • Slide Number 79
  • Slide Number 80
  • Netflix Prize
  • Movie rating data
  • Overall rating distribution
  • ratings per movie
  • ratings per user
  • Average movie rating by movie count
  • Most loved movies
  • Challenges
  • The BellKor recommender system
  • Extending Latent Factor Model to Include Biases
  • Modeling Biases and Interactions
  • Baseline Predictor
  • Putting It All Together
  • Fitting the New Model
  • BellKor Recommender System
  • Performance of Various Methods
  • Performance of Various Methods
  • Temporal Biases Of Users
  • Temporal Biases amp Factors
  • Adding Temporal Effects
  • Performance of Various Methods
  • Slide Number 113
  • Slide Number 114
  • Standing on June 26th 2009
  • The Last 30 Days
  • 24 Hours from the Deadline
  • Slide Number 118
  • Million $ Awarded Sept 21st 2009
  • Further reading
predictors RMSE
1 0889067
2 0880586
3 0878345
4 0877715
5 0876668
6 0876179
7 0875557
8 0875013
9 0874477
10 0874209
11 0873997
12 0873691
13 0873569
14 0873352
15 0873212
16 0873035
17 0872942
18 087283
19 0872748
20 0872654
21 0872563
22 0872468
23 0872382
24 0872275
25 087222
26 0872201
27 0872135
28 0872086
29 0872021
30 0871945
31 0871898
32 087183
33 0871756
34 0871722
35 0871665
36 0871632
37 0871585
38 0871558
39 0871524
40 0871482
41 0871452
42 0871428
43 0871398
44 0871346
45 0871312
46 0871287
47 0871262
48 0871242
49 0871236
50 0871197
0889067
0880586
0878345
0877715
0876668
0876179
0875557
0875013
0874477
0874209
0873997
0873691
0873569
0873352
0873212
0873035
0872942
087283
0872748
0872654
0872563
0872468
0872382
0872275
087222
0872201
0872135
0872086
0872021
0871945
0871898
087183
0871756
0871722
0871665
0871632
0871585
0871558
0871524
0871482
0871452
0871428
0871398
0871346
0871312
0871287
0871262
0871242
0871236
0871197
Page 12: xuhappy.github.io · 2 days ago · RECSYS 2 Acknowledgements The slides used in this chapter are adapted from: CS246 Mining Massive Datasets, - by Jure Leskovec, Stanford University

RECSYS 14

(1) Gathering Ratings Explicit

Ask people to rate items Doesnrsquot work well in practice ndash people

canrsquot be bothered

Implicit Learn ratings from user actions

bull Eg purchase implies high rating What about low ratings

RECSYS 15

(2) Extrapolating Utilities Key problem matrix U is sparse

Most people have not rated most items Cold start

bull New items have no ratingsbull New users have no history

Three approaches to recommender systems 1) Content-based 2) Collaborative Filtering

bull Memory-basedbull User-based Collaborative Filteringbull Item-based Collaborative Filtering

bull Latent factor based

1162019

RECSYS 16

Content-based Recommender Systems

1162019

RECSYS 17

Content-based Recommendations Main idea Recommend items to customer x similar to

previous items rated highly by x

Example

Movie recommendations Recommend movies with same actor(s)

director genre hellip

Websites blogs news Recommend other sites with ldquosimilarrdquo content

RECSYS 18

Plan of Action

likes

Item profiles

RedCircles

Triangles

User profile

match

recommendbuild

1162019

RECSYS 19

Item Profiles For each item create an item profile

Profile is a set (vector) of features Movies author title actor directorhellip Text Set of ldquoimportantrdquo words in document

How to pick important features Usual heuristic from text mining is TF-IDF

(Term frequency Inverse Doc Frequency)bull Term hellip Featurebull Document hellip Item

1162019

RECSYS 20

Sidenote TF-IDFfij = frequency of term (feature) i in doc (item) j

ni = number of docs that mention term iN = total number of docs

TF-IDF score wij = TFij times IDFi

Doc profile = set of words with highest TF-IDF scores together with their scores

1162019

Note we normalize TFto discount for ldquolongerrdquo documents

RECSYS 21

User Profiles and Prediction

User profile possibilities Weighted average of rated item profiles Variation weight by difference from average

rating for item hellip

Prediction heuristic Given user profile x and item profile i estimate 119906119906(119961119961 119946119946) = cos(119961119961 119946119946) = 119961119961middot119946119946

| 119961119961 |sdot| 119946119946 |

1162019

RECSYS 22

Pros Content-based Approach

+ No need for data on other users No cold-start or sparsity problems

+ Able to recommend to users with unique tastes

+ Able to recommend new amp unpopular items No first-rater problem

+ Able to provide explanations Can provide explanations of recommended items by listing content-

features that caused an item to be recommended

1162019

RECSYS 23

Cons Content-based Approach ndash Finding the appropriate features is hard

Eg images movies music

ndash Overspecialization Never recommends items outside userrsquos

content profile People might have multiple interests Unable to exploit quality judgments of other users

ndash Recommendations for new users How to build a user profile

1162019

RECSYS 24

Collaborative Filtering

RECSYS 25

Collaborative filtering Recommend items based on past transactions of users

Analyze relations between users andor items

Specific data characteristics are irrelevant Domain-free useritem attributes are not necessary Can identify elusive aspects

RECSYS 26

Collaborative Filtering (CF)

RECSYS 27

Example of Memory-based Collaborative FilteringUser-User Collaborative Filtering

1 Consider user x

2 Find set N of other userswhose ratings are ldquosimilarrdquo to xrsquos ratings eg using K-nearest neighbors (KNN)

3 Recommend items to xbased on the weighted ratings of items by users in N

RECSYS 28

Similar Users Let rx be the vector of user xrsquos ratings

Jaccard similarity measure Problem Ignores the value of the rating

Cosine similarity measure

sim(x y) = cos(rx ry) = 119903119903119909119909sdot119903119903119910119910||119903119903119909119909||sdot||119903119903119910119910||

Problem Treats missing ratings as ldquonegativerdquo Pearson correlation coefficient

Sxy = items rated by both users x and y

1162019

rx = [ _ _ ]ry = [ _ _]

rx ry as setsrx = 1 4 5ry = 1 3 4

rx ry as pointsrx = 1 0 0 1 3ry = 1 0 2 2 0

rx ry hellip avgrating of x y

RECSYS 29

Similarity Metric

Intuitively we want sim(A B) gt sim(A C)

Jaccard similarity 15 lt 24

Cosine similarity 0386 gt 0322 Considers missing ratings as ldquonegativerdquo Solution subtract the (row) mean

1162019

sim AB vs AC0092 gt -0559Notice cosine sim is correlation when data is centered at 0

Cosine sim

RECSYS 30

Rating Predictions

Let rx be the vector of user xrsquos ratings

Let N be the set of k users most similar to xwho have rated item i

Prediction for item i of user x 119903119903119909119909119909119909 = 1

119896119896sum119910119910isin119873119873 119903119903119910119910119909119909

119903119903119909119909119909119909 =sum119910119910isin119873119873 119904119904119909119909119910119910sdot119903119903119910119910119910119910sum119910119910isin119873119873 119904119904119909119909119910119910

Other options

Many other tricks possiblehellip1162019

Shorthand119956119956119961119961119961119961 = 119956119956119946119946119956119956 119961119961119961119961

RECSYS 31

Another type of Memory-based Collaborative Filtering Item-Item based Collaborative Filtering

So far User-user collaborative filtering

Another view Item-item For item i find other similar items Estimate rating for item i based

on the target userrsquos ratings on items similar to item i Can use same similarity metrics and

prediction functions as in user-user model

1162019

sumsum

isin

isinsdot

=)(

)(

xiNj ij

xiNj xjijxi s

rsr

sijhellip similarity of items i and jrxjhelliprating of user x on item jN(ix)hellip set items rated by x similar to i

RECSYS 32

Item-Item CF (|N|=2)

121110987654321

455311

3124452

534321423

245424

5224345

423316

users

mov

ies

- unknown rating - rating between 1 to 5

RECSYS 33

Item-Item CF (|N|=2)

121110987654321

455 311

3124452

534321423

245424

5224345

423316

users

- estimate rating of movie 1 by user 5

1162019

mov

ies

RECSYS 34

Item-Item CF (|N|=2)

121110987654321

455 311

3124452

534321423

245424

5224345

423316

users

Neighbor selectionIdentify movies similar to movie 1 rated by user 5

1162019

mov

ies

100

-018

041

-010

-031

059

sim(1m)

Here we use Pearson correlation as similarity1) Subtract mean rating mi from each movie i

m1 = (1+3+5+5+4)5 = 36row 1 [-26 0 -06 0 0 14 0 0 14 0 04 0]

2) Compute cosine similarities between rows

RECSYS 35

Item-Item CF (|N|=2)

121110987654321

455 311

3124452

534321423

245424

5224345

423316

users

Compute similarity weightss13=041 s16=059

1162019

mov

ies

100

-018

041

-010

-031

059

sim(1m)

RECSYS 36

Item-Item CF (|N|=2)

121110987654321

45526311

3124452

534321423

245424

5224345

423316

users

Predict by taking weighted average

r15 = (0412 + 0593) (041+059) = 26

mov

ies

RECSYS 37

Common Practice for Item-Item Collaborative Filtering

Define similarity sij of items i and j

Select K nearest neighbors (KNN) N(i x) Set of Items most similar to item i that were rated by x

Estimate rating rxi as the weighted average

1162019

sumsum

isin

isinsdot

=)(

)(ˆxiNj ij

xiNj xjijxi s

rsr

N(ix) = set of items similar to item i that were rated by xsij = similarity of items i and jrxj= rating of user x on item j

RECSYS 38

Item-Item vs User-User

04180109

0305081

Avatar LOTR Matrix Pirates

Alice

Bob

Carol

David

1162019

In practice it has been observed that item-itemoften works better than user-user

Why Items are simpler users have multiple tastes

RECSYS 39

ProsCons of Collaborative Filtering

+ Works for any kind of item No feature selection needed

- Cold Start Need enough users in the system to find a match

- Sparsity The userratings matrix is sparse Hard to find users that have rated the same items

- First rater Cannot recommend an item that has not been

previously rated New items Esoteric items

- Popularity bias Cannot recommend items to someone with

unique taste Tends to recommend popular items

RECSYS 40

Hybrid Methods Implement two or more different recommenders and

combine predictions Perhaps using a linear model

Add content-based methods to collaborative filtering Item profiles for new item problem Demographics to deal with new user problem

1162019

RECSYS 41

- Evaluation

- Error metrics

- Complexity Speed

RECSYS 42

Evaluation

1 3 4

3 5 5

4 5 5

3

3

2 2 2

5

2 1 1

3 3

1

movies

users

1162019

RECSYS 43

Evaluation

1 3 4

3 5 5

4 5 5

3

3

2

2 1

3

1

Test Data Set

users

movies

1162019

RECSYS 44

Evaluating Predictions Compare predictions with known ratings

Root-mean-square error (RMSE)bull where is predicted is the true rating of x on i

Precision at top 10 bull of those in top 10

Rank Correlation bull Spearmanrsquos correlation between systemrsquos and userrsquos complete

rankings Another approach 01 model

Coveragebull Number of itemsusers for which system can make predictions

Precision = TP (TP + FP) Accuracy = (TP+TN) (TP + FP + TN + FN) Receiver Operating characteristic (ROC) Curve

Y-axis True Positive Rates (TPR) X-axis False Positive Rates (FPR)bull TPR (aka Recall) = TP P = TP (TP+FN) bull FPR = FP N = FP (FP + TN)bull See httpsenwikipediaorgwikiPrecision and recall

RECSYS 45

Problems with Error Measures Narrow focus on accuracy sometimes

misses the point Prediction Diversity Prediction Context Order of predictions

In practice we care only to predict high ratings RMSE might penalize a method that does well

for high ratings and badly for others

1162019

RECSYS 46

Collaborative Filtering Complexity

Expensive step is finding k most similar customers O(|X|) Recall that X = set of customers in the system

Too expensive to do at runtime Could pre-compute using clustering as approx

Naiumlve pre-computation takes time O(N middot|C|) |C| = of clusters = k in the k-means N = of data points

We already know how to do this Near-neighbor search in high dimensions (LSH) Clustering Dimensionality reduction

1162019

RECSYS 47

Tip Add Data Leverage all the data

Donrsquot try to reduce data size in an effort to make fancy algorithms work

Simple methods on large data do best

Add more data eg add IMDB data on genres

More data beats better algorithmshttpanandtypepadcomdatawocky200803more-data-usualhtml

RECSYS 48

Recommender SystemsLatent Factor Models

CS246 Mining Massive DatasetsJure Leskovec Stanford University

httpcs246stanfordedu

RECSYS 49

Geared towards females

Geared towards males

Serious

Funny

Collaborative Filtering viaLatent Factor Models (eg SVD)

The PrincessDiaries

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

RECSYS 50Koren Bell Volinksy IEEE Computer 2009

RECSYS 51

The Netflix Utility Matrix R

1 3 4

3 5 5

4 5 5

3

3

2 2 2

5

2 1 1

3 3

1

480000 users

17700 movies

1162019

Matrix R

RECSYS 52

Utility Matrix R Evaluation

1 3 4

3 5 5

4 5 5

3

3

2

2 1

3

1

Test Data Set

SSE = sum(119909119909119909119909)isin119877119877 119903119909119909119909119909 minus 119903119903119909119909119909119909 2

1162019

480000 users

17700 movies

Predicted rating

True rating of user x on item i

119955119955120785120785120788120788

Matrix R

Training Data Set

RECSYS 53

Latent Factor Models

ldquoSVDrdquo on Netflix data R asymp Q middot PT

For now letrsquos assume we can approximate the rating matrix R as a product of ldquothinrdquo Q middot PT

R has missing entries but letrsquos ignore that for nowbull Basically we will want the reconstruction error to be small on

known ratings and we donrsquot care about the values on the missing ones

45531

312445

53432142

24542

522434

42331

2-41

56-5

53-2

32111

-221-7

37-1

-924143-48-5-253-211

13-112-72914-131457-8

1-6784-3924176-421asymp

users

item

s

PT

Q

item

s

users

R

SVD A = U Σ VT

f factors

ffactors

RECSYS 54

Ratings as Products of Factors How to estimate the missing rating of

user x for item i

1162019

45531

312445

53432142

24542

522434

42331

item

s

2-41

56-5

53-2

32111

-221-7

37-1

-924143-48-5-253-211

13-112-72914-131457-8

1-6784-3924176-421

asymp

item

s

users

users

PT

119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931

= 119943119943

119954119954119946119946119943119943 sdot 119953119953119961119961119943119943

qi = row i of Qpx = column x of PT

ffac

tors

Qf factors

RECSYS 55

Ratings as Products of Factors How to estimate the missing rating of

user x for item i

1162019

45531

312445

53432142

24542

522434

42331

item

s

2-41

56-5

53-2

32111

-221-7

37-1

-924143-48-5-253-211

13-112-72914-131457-8

1-6784-3924176-421

asymp

item

s

users

users

PT

ffac

tors

Qf factors

119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931

= 119943119943

119954119954119946119946119943119943 sdot 119953119953119961119961119943119943

qi = row i of Qpx = column x of PT

RECSYS 56

Ratings as Products of Factors How to estimate the missing rating of

user x for item i

1162019

45531

312445

53432142

24542

522434

42331

item

s

2-41

56-5

53-2

32111

-221-7

37-1

-924143-48-5-253-211

13-112-72914-131457-8

1-6784-3924176-421

asymp

item

s

users

users

QPT

24

ffac

tors

f factors

119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931

= 119943119943

119954119954119946119946119943119943 sdot 119953119953119961119961119943119943

qi = row i of Qpx = column x of PT

RECSYS 57

Geared towards females

Geared towards males

Serious

Funny

Latent Factor Models

1162019

The PrincessDiaries

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

RECSYS 58

Geared towards females

Geared towards males

Serious

Funny

Latent Factor Models

1162019

The PrincessDiaries

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

RECSYS 59

Recap SVD

Remember SVD A Input data matrix U Left singular vecs V Right singular vecs Σ Singular values

SVD gives minimum reconstruction error (SSE) min119880119880VΣ

sum119909119909119894119894 119860119860119909119909119894119894 minus 119880119880Σ119881119881T 1199091199091198941198942

So in our case ldquoSVDrdquo on Netflix data R asymp Q middot PT

A = R Q = U PT = Σ VT

But we are not done yet R has missing entries1162019

U

The sum goes over all entriesBut our R has missing entries

119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931

Am

n

Σm

n

VT

asymp

U

RECSYS 60

Latent Factor Models

SVD isnrsquot defined when entries are missing

Use specialized methods to find P Q min

119875119875119876119876sum 119909119909119909119909 isinR 119903119903119909119909119909119909 minus 119902119902119909119909 sdot 119901119901119909119909119879119879 2

Notebull We donrsquot require cols of P Q to be orthogonalunit lengthbull P Q map usersmovies to a latent spacebull The most popular model among Netflix contestants

1162019

45531

312445

53432142

24542

522434

42331

2-41

56-5

53-2

32111

-221-7

37-1

-924143-48-5-253-211

13-112-72914-131457-8

1-6784-3924176-421asymp

PTQ

users

item

s

119903119909119909119909119909 = 119902119902119909119909 sdot 119901119901119909119909119879119879

f factors

ffactorsitem

s

users

RECSYS 61

Dealing with Missing Entries Want to minimize SSE for unseen test data

Idea Minimize SSE on training data Want large f ( of factors) to capture all the signals But SSE on test data begins to rise for f gt 2

Regularization is needed to avoid Overfitting Allow rich model where there are sufficient data Shrink aggressively where data are scarce

1162019

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

1 3 43 5 5

4 5 533

2

2 1 3

1

λhellip regularization parameterldquoerrorrdquo ldquolengthrdquo

RECSYS 62

Geared towards females

Geared towards males

serious

escapist

The PrincessDiaries

The Lion King

Braveheart

Independence Day

AmadeusThe Color Purple

Oceanrsquos 11

Sense and Sensibility

Gus

Dave

Recommendations via Latent Factor Models (eg SVD++ by the [Bellkor Team])

1162019

Lethal Weapon

Dumb and Dumber

RECSYS 63

Dealing with Missing Entries Want to minimize SSE for unseen test data

Idea Minimize SSE on training data Want large f ( of factors) to capture all the signals But SSE on test data begins to rise for f gt 2

Regularization is needed Allow rich model where there are sufficient data Shrink aggressively where data are scarce

1162019

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

1 3 43 5 5

4 5 533

2

2 1 3

1

λhellip regularization parameterldquoerrorrdquo ldquolengthrdquo

RECSYS 64

Geared towards females

Geared towards males

serious

funny

The Effect of Regularization

1162019

The PrincessDiaries

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

minfactors ldquoerrorrdquo + λ ldquolengthrdquo

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

RECSYS 65

Geared towards females

Geared towards males

serious

funny

The Effect of Regularization

1162019

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

The PrincessDiaries

minfactors ldquoerrorrdquo + λ ldquolengthrdquo

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

RECSYS 66

Geared towards females

Geared towards males

serious

funny

The Effect of Regularization

1162019

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

minfactors ldquoerrorrdquo + λ ldquolengthrdquo

The PrincessDiaries

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

RECSYS 67

Geared towards females

Geared towards males

serious

funny

The Effect of Regularization

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

The PrincessDiaries

minfactors ldquoerrorrdquo + λ ldquolengthrdquo

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

RECSYS 68

Use Gradient Descent to search forthe optimal settings

Want to find matrices P and Q

Gradient descent Initialize P and Q (using SVD pretend missing ratings are

0) Do gradient descent

bull P larr P - η middotnablaPbull Q larr Q - η middotnablaQbull Where nablaQ is gradientderivative of matrix Q120571120571119876119876 = [120571120571119902119902119909119909119894119894] and 120571120571119902119902119909119909119894119894 = sum119909119909119909119909 minus2 119903119903119909119909119909119909 minus 119902119902119909119909119901119901119909119909119879119879 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894

bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q Observation Computing gradients is slow

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

How to compute gradient of a matrixCompute gradient of every element independently

RECSYS 69

Degression to the lecture notes ofRegression and Gradient Descent

by Andrew Ngrsquos Machine Learning course from Coursera

RECSYS 70

(Batch) Gradient Descent

Want to find matrices P and Q

Gradient descent Initialize P and Q (using SVD pretend missing ratings are

0) Do gradient descent

bull P larr P - η middotnablaPbull Q larr Q - η middotnablaQbull Where nablaQ is gradientderivative of matrix Q120571120571119876119876 = [120571120571119902119902119909119909119894119894] and 120571120571119902119902119909119909119894119894 = sum119909119909119909119909 minus2 119903119903119909119909119909119909 minus 119902119902119909119909119901119901119909119909119879119879 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894

bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q Observation Computing gradients is slow

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

How to compute gradient of a matrixCompute gradient of every element independently

RECSYS 72

Stochastic Gradient Descent

Gradient Descent (GD) vs Stochastic GD Observation 120571120571119876119876 = [120571120571119902119902119909119909119894119894] where

120571120571119902119902119909119909119894119894 = 119909119909119909119909

minus2 119903119903119909119909119909119909 minus 119902119902119909119909119894119894119901119901119909119909119894119894 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894 = 119961119961119946119946

nabla119928119928 119955119955119961119961119946119946

bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q 119928119928 = 119928119928 minus 119928119928 = 119928119928 minus η sum119961119961119946119946nabla119928119928 (119955119955119961119961119946119946) Idea Instead of evaluating gradient over all ratings evaluate it for

each individual rating and make a step

GD 119928119928larr119928119928 minus η sum119955119955119961119961119946119946 nabla119928119928(119955119955119961119961119946119946)

SGD 119928119928larr119928119928 minus 119928119928(119955119955119961119961119946119946) Faster convergence

bull Need more steps but each step is computed much faster

1162019

RECSYS 73

SGD vs GD Convergence of GD vs SGD

1162019

Iterationstep

Valu

e of

the

obje

ctiv

e fu

nctio

n

GD improves the value of the objective function at every step SGD improves the value but in a ldquonoisyrdquo wayGD takes fewer steps to converge but each steptakes much longer to compute In practice SGD is much faster

RECSYS 74

Stochastic Gradient Descent

Stochastic gradient decent Initialize P and Q (using SVD pretend missing ratings are

0) Then iterate over the ratings (multiple times if necessary) and update

factorsFor each rxi

bull 120576120576119909119909119909119909 = 119903119903119909119909119909119909 minus 119902119902119909119909 sdot 119901119901119909119909119879119879 (derivative of the ldquoerrorrdquo)bull 119902119902119909119909 larr 119902119902119909119909 + 120578120578 120576120576119909119909119909119909 119901119901119909119909 minus 120582120582 119902119902119909119909 (update equation)bull 119901119901119909119909 larr 119901119901119909119909 + 120578120578 120576120576119909119909119909119909 119902119902119909119909 minus 120582120582 119901119901119909119909 (update equation)

2 for loops For until convergence

bull For each rxibull Compute gradient do a ldquosteprdquo

1162019

η hellip learning rate

RECSYS 75Koren Bell Volinksy IEEE Computer 2009

1162019

RECSYS 76

Summary Recommendations via Optimization

Goal Make good recommendations Quantify goodness using SSE

So Lower SSE means better recommendations We want to make good recommendations on items

that some user has not yet seen Letrsquos set values for P and Q such that they work

well on known (user item) ratings And hope these values for P and Q will predict well

the unknown ratings

This is the a case where we apply Optimization methods

1162019

1 3 43 5 5

4 5 533

2

2 1 3

1

RECSYS 77

Backup Slides

RECSYS 78

The Netflix Challenge 2006-09

We Know What You OughtTo Be Watching This

Summer

Netflix Prizebull Training data

ndash 100 million ratingsndash 480000 usersndash 17770 moviesndash 6 years of data 2000-2005

bull Test datandash Last few ratings of each user (28 million)ndash Evaluation criterion root mean squared error (RMSE) ndash Netflix Cinematch RMSE 09514

bull Competitionndash 2700+ teamsndash $1 million grand prize for 10 improvement on Cinematch resultndash $50000 2007 progress prize for 843 improvement

scoremovieuser121152131434524123237682576344541568523425

22345

57664566

scoremovieuser6219617232473153414284935

745

696836

Training data Test data

Movie rating data

Overall rating distribution

bull Third of ratings are 4sbull Average rating is 368

From TimelyDevelopmentcom

ratings per movie

bull Avg ratingsmovie 5627

ratings per user

bull Avg ratingsuser 208

Average movie rating by movie count

bull More ratings to better movies

From TimelyDevelopmentcom

Most loved movies

CountAvg ratingTitle137812 4593 The Shawshank Redemption 133597 4545 Lord of the Rings The Return of the King1808834306 The Green Mile 150676 4460Lord of the Rings The Two Towers 139050 4415 Finding Nemo 117456 4504 Raiders of the Lost Ark 180736 4299 Forrest Gump 147932 4433 Lord of the Rings The Fellowship of the ring149199 4325 The Sixth Sense 144027 4333 Indiana Jones and the Last Crusade

Challenges

bull Size of datandash Scalabilityndash Keeping data in memory

bull Missing datandash 99 percent missingndash Very imbalanced

bull Avoiding overfittingbull Test and training data differ significantly movie 16322

bull Use an ensemble of complementing predictorsbull Two half tuned models worth more than a single fully

tuned model

The BellKor recommender system

RECSYS 101

Extending Latent Factor Model to Include Biases

1162019

RECSYS 102

Modeling Biases and Interactions

μ = overall mean rating bx = bias of user x bi = bias of movie i

user-movie interactionmovie biasuser bias

User-Movie interaction Characterizes the matching between

users and movies Attracts most research in the field Benefits from algorithmic and

mathematical innovations

Baseline predictor Separates users and movies Benefits from insights into userrsquos

behavior Among the main practical

contributions of the competition

RECSYS 103

Baseline Predictor We have expectations on the rating by

user x of movie i even without estimating xrsquos attitude towards movies like i

ndash Rating scale of user xndash Values of other ratings user

gave recently (day-specific mood anchoring multi-user accounts)

ndash (Recent) popularity of movie indash Selection bias related to

number of ratings user gave on the same day (ldquofrequencyrdquo)

1162019

RECSYS 104

Putting It All Together

Example Mean rating micro = 37 You are a critical reviewer your ratings are 1 star

lower than the mean bx = -1 Star Wars gets a mean rating of 05 higher than

average movie bi = + 05 Predicted rating for you on Star Wars

= 37 - 1 + 05 = 32

1162019

RECSYS 105

Fitting the New Model

Solve

Stochastic gradient decent to find parameters Note Both biases bu bi as well as interactions qi pu are treated

as parameters (we estimate them)

1162019

regularization

goodness of fit

λ is selected via grid-search on a validation set

( )

++++

+++minus

sumsumsumsum

sumisin

ii

xx

xx

ii

Rix

Txiixxi

PQ

bbpq

pqbbr

2222

2

)(

)(min

λ

micro

RECSYS 106

BellKor Recommender System The winner of the Netflix Challenge

Multi-scale modeling of the dataCombine top level ldquoregionalrdquomodeling of the data with a refined local view Global

bull Overall deviations of usersmovies Factorization

bull Addressing ldquoregionalrdquo effects Collaborative filtering

bull Extract local patterns

Global effects

Factorization

Collaborative filtering

RECSYS 107

Performance of Various Methods

0885

089

0895

09

0905

091

0915

092

1 10 100 1000

RMSE

Millions of parameters

CF (no time bias)

Basic Latent Factors

Latent Factors w Biases

1162019

RECSYS 108

Grand Prize 08563

Netflix 09514

Movie average 10533User average 10651

Global average 11296

Performance of Various Methods

Basic Collaborative filtering 094

Latent factors 090

Latent factors+Biases 089

Collaborative filtering++ 091

1162019

RECSYS 109

Temporal Biases Of Users

Sudden rise in the average movie rating(early 2004) Improvements in Netflix GUI improvements Meaning of rating changed

Movie age Users prefer new movies

without any reasons Older movies are just

inherently better than newer ones

1162019

Y Koren Collaborative filtering with temporal dynamics KDD rsquo09

RECSYS 110

Temporal Biases amp Factors

Original modelrxi = micro +bx + bi + qi middotpx

T

Add time dependence to biasesrxi = micro +bx(t)+ bi(t) +qi middot px

T

Make parameters bu and bi to depend on time (1) Parameterize time-dependence by linear trends

(2) Each bin corresponds to 10 consecutive weeks

Add temporal dependence to factors px(t)hellip user preference vector on day t

Y Koren Collaborative filtering with temporal dynamics KDD rsquo09

RECSYS 111

Adding Temporal Effects

1162019

0875

088

0885

089

0895

09

0905

091

0915

092

1 10 100 1000 10000

RMSE

Millions of parameters

CF (no time bias)Basic Latent FactorsCF (time bias)Latent Factors w Biases+ Linear time factors+ Per-day user biases+ CF

RECSYS 112

Grand Prize 08563

Netflix 09514

Movie average 10533User average 10651

Global average 11296

Performance of Various Methods

Basic Collaborative filtering 094

Latent factors 090

Latent factors+Biases 089

Collaborative filtering++ 091

1162019 112

Latent factors+Biases+Time 0876

Still no prize Getting desperateTry a ldquokitchen sinkrdquo approach

RECSYS 113

Effect of ensemble size

087

08725

0875

08775

088

08825

0885

08875

089

08925

1 8 15 22 29 36 43 50

Predictors

Erro

r - R

MSE

Chart1

RMSE
Predictors
Error - RMSE
Effect of ensemble size

Sheet1

Sheet1

RMSE
Predictors
Error - RMSE
Effect of ensemble size

Sheet2

Sheet3

RECSYS 115

Standing on June 26th 2009

1162019June 26th submission triggers 30-day ldquolast callrdquo

RECSYS 116

The Last 30 Days

Ensemble team formed Group of other teams on leaderboard forms a new team Relies on combining their models Quickly also get a qualifying score over 10

BellKor Continue to get small improvements in their scores Realize that they are in direct competition with Ensemble

Strategy Both teams carefully monitoring the leaderboard Only sure way to check for improvement is to submit a set of

predictionsbull This alerts the other team of your latest score

1162019

RECSYS 117

24 Hours from the Deadline

Submissions limited to 1 a day Only 1 final submission could be made in the last 24h

24 hours before deadlinehellip BellKor team member in Austria notices (by chance) that

Ensemble posts a score that is slightly better than BellKorrsquos

Frantic last 24 hours for both teams Much computer time on final optimization Carefully calibrated to end about an hour before deadline

Final submissions BellKor submits a little early (on purpose) 40 mins before deadline Ensemble submits their final entry 20 mins later hellipand everyone waitshellip

1162019

RECSYS 118118

RECSYS 119

Million $ Awarded Sept 21st 2009

RECSYS 120

Further reading Y Koren Collaborative filtering with temporal

dynamics KDD rsquo09 httpwww2researchattcom~volinskynetflixbpchtml httpwwwthe-ensemblecom

1162019

  • Slide Number 1
  • Acknowledgements
  • Content-based Systems amp Collaborative Filtering
  • High Dimensional Data
  • Example Recommender Systems
  • Recommendations
  • From Scarcity to Abundance
  • The Long Tail
  • Physical vs Online
  • Types of Recommendations
  • Formal Model
  • Utility Matrix
  • Key Problems
  • (1) Gathering Ratings
  • (2) Extrapolating Utilities
  • Content-based Recommender Systems
  • Content-based Recommendations
  • Plan of Action
  • Item Profiles
  • Sidenote TF-IDF
  • User Profiles and Prediction
  • Pros Content-based Approach
  • Cons Content-based Approach
  • Collaborative Filtering
  • Collaborative filtering
  • Collaborative Filtering (CF)
  • Example of Memory-based Collaborative Filtering User-User Collaborative Filtering
  • Similar Users
  • Similarity Metric
  • Rating Predictions
  • Another type of Memory-based Collaborative Filtering Item-Item based Collaborative Filtering
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Common Practice for Item-Item Collaborative Filtering
  • Item-Item vs User-User
  • ProsCons of Collaborative Filtering
  • Hybrid Methods
  • Remarks amp Practical Tips
  • Evaluation
  • Evaluation
  • Evaluating Predictions
  • Problems with Error Measures
  • Collaborative Filtering Complexity
  • Tip Add Data
  • Recommender SystemsLatent Factor Models
  • Collaborative Filtering viaLatent Factor Models (eg SVD)
  • Slide Number 50
  • The Netflix Utility Matrix R
  • Utility Matrix R Evaluation
  • Latent Factor Models
  • Ratings as Products of Factors
  • Ratings as Products of Factors
  • Ratings as Products of Factors
  • Latent Factor Models
  • Latent Factor Models
  • Recap SVD
  • Latent Factor Models
  • Dealing with Missing Entries
  • Recommendations via Latent Factor Models (eg SVD++ by the [Bellkor Team])
  • Dealing with Missing Entries
  • The Effect of Regularization
  • The Effect of Regularization
  • The Effect of Regularization
  • The Effect of Regularization
  • Use Gradient Descent to search forthe optimal settings
  • Degression to the lecture notes ofRegression and Gradient Descentby Andrew Ngrsquos Machine Learning course from Coursera
  • (Batch) Gradient Descent
  • Stochastic Gradient Descent
  • SGD vs GD
  • Stochastic Gradient Descent
  • Slide Number 75
  • Summary Recommendations via Optimization
  • Backup Slides
  • The Netflix Challenge 2006-09
  • Slide Number 79
  • Slide Number 80
  • Netflix Prize
  • Movie rating data
  • Overall rating distribution
  • ratings per movie
  • ratings per user
  • Average movie rating by movie count
  • Most loved movies
  • Challenges
  • The BellKor recommender system
  • Extending Latent Factor Model to Include Biases
  • Modeling Biases and Interactions
  • Baseline Predictor
  • Putting It All Together
  • Fitting the New Model
  • BellKor Recommender System
  • Performance of Various Methods
  • Performance of Various Methods
  • Temporal Biases Of Users
  • Temporal Biases amp Factors
  • Adding Temporal Effects
  • Performance of Various Methods
  • Slide Number 113
  • Slide Number 114
  • Standing on June 26th 2009
  • The Last 30 Days
  • 24 Hours from the Deadline
  • Slide Number 118
  • Million $ Awarded Sept 21st 2009
  • Further reading
predictors RMSE
1 0889067
2 0880586
3 0878345
4 0877715
5 0876668
6 0876179
7 0875557
8 0875013
9 0874477
10 0874209
11 0873997
12 0873691
13 0873569
14 0873352
15 0873212
16 0873035
17 0872942
18 087283
19 0872748
20 0872654
21 0872563
22 0872468
23 0872382
24 0872275
25 087222
26 0872201
27 0872135
28 0872086
29 0872021
30 0871945
31 0871898
32 087183
33 0871756
34 0871722
35 0871665
36 0871632
37 0871585
38 0871558
39 0871524
40 0871482
41 0871452
42 0871428
43 0871398
44 0871346
45 0871312
46 0871287
47 0871262
48 0871242
49 0871236
50 0871197
0889067
0880586
0878345
0877715
0876668
0876179
0875557
0875013
0874477
0874209
0873997
0873691
0873569
0873352
0873212
0873035
0872942
087283
0872748
0872654
0872563
0872468
0872382
0872275
087222
0872201
0872135
0872086
0872021
0871945
0871898
087183
0871756
0871722
0871665
0871632
0871585
0871558
0871524
0871482
0871452
0871428
0871398
0871346
0871312
0871287
0871262
0871242
0871236
0871197
Page 13: xuhappy.github.io · 2 days ago · RECSYS 2 Acknowledgements The slides used in this chapter are adapted from: CS246 Mining Massive Datasets, - by Jure Leskovec, Stanford University

RECSYS 15

(2) Extrapolating Utilities Key problem matrix U is sparse

Most people have not rated most items Cold start

bull New items have no ratingsbull New users have no history

Three approaches to recommender systems 1) Content-based 2) Collaborative Filtering

bull Memory-basedbull User-based Collaborative Filteringbull Item-based Collaborative Filtering

bull Latent factor based

1162019

RECSYS 16

Content-based Recommender Systems

1162019

RECSYS 17

Content-based Recommendations Main idea Recommend items to customer x similar to

previous items rated highly by x

Example

Movie recommendations Recommend movies with same actor(s)

director genre hellip

Websites blogs news Recommend other sites with ldquosimilarrdquo content

RECSYS 18

Plan of Action

likes

Item profiles

RedCircles

Triangles

User profile

match

recommendbuild

1162019

RECSYS 19

Item Profiles For each item create an item profile

Profile is a set (vector) of features Movies author title actor directorhellip Text Set of ldquoimportantrdquo words in document

How to pick important features Usual heuristic from text mining is TF-IDF

(Term frequency Inverse Doc Frequency)bull Term hellip Featurebull Document hellip Item

1162019

RECSYS 20

Sidenote TF-IDFfij = frequency of term (feature) i in doc (item) j

ni = number of docs that mention term iN = total number of docs

TF-IDF score wij = TFij times IDFi

Doc profile = set of words with highest TF-IDF scores together with their scores

1162019

Note we normalize TFto discount for ldquolongerrdquo documents

RECSYS 21

User Profiles and Prediction

User profile possibilities Weighted average of rated item profiles Variation weight by difference from average

rating for item hellip

Prediction heuristic Given user profile x and item profile i estimate 119906119906(119961119961 119946119946) = cos(119961119961 119946119946) = 119961119961middot119946119946

| 119961119961 |sdot| 119946119946 |

1162019

RECSYS 22

Pros Content-based Approach

+ No need for data on other users No cold-start or sparsity problems

+ Able to recommend to users with unique tastes

+ Able to recommend new amp unpopular items No first-rater problem

+ Able to provide explanations Can provide explanations of recommended items by listing content-

features that caused an item to be recommended

1162019

RECSYS 23

Cons Content-based Approach ndash Finding the appropriate features is hard

Eg images movies music

ndash Overspecialization Never recommends items outside userrsquos

content profile People might have multiple interests Unable to exploit quality judgments of other users

ndash Recommendations for new users How to build a user profile

1162019

RECSYS 24

Collaborative Filtering

RECSYS 25

Collaborative filtering Recommend items based on past transactions of users

Analyze relations between users andor items

Specific data characteristics are irrelevant Domain-free useritem attributes are not necessary Can identify elusive aspects

RECSYS 26

Collaborative Filtering (CF)

RECSYS 27

Example of Memory-based Collaborative FilteringUser-User Collaborative Filtering

1 Consider user x

2 Find set N of other userswhose ratings are ldquosimilarrdquo to xrsquos ratings eg using K-nearest neighbors (KNN)

3 Recommend items to xbased on the weighted ratings of items by users in N

RECSYS 28

Similar Users Let rx be the vector of user xrsquos ratings

Jaccard similarity measure Problem Ignores the value of the rating

Cosine similarity measure

sim(x y) = cos(rx ry) = 119903119903119909119909sdot119903119903119910119910||119903119903119909119909||sdot||119903119903119910119910||

Problem Treats missing ratings as ldquonegativerdquo Pearson correlation coefficient

Sxy = items rated by both users x and y

1162019

rx = [ _ _ ]ry = [ _ _]

rx ry as setsrx = 1 4 5ry = 1 3 4

rx ry as pointsrx = 1 0 0 1 3ry = 1 0 2 2 0

rx ry hellip avgrating of x y

RECSYS 29

Similarity Metric

Intuitively we want sim(A B) gt sim(A C)

Jaccard similarity 15 lt 24

Cosine similarity 0386 gt 0322 Considers missing ratings as ldquonegativerdquo Solution subtract the (row) mean

1162019

sim AB vs AC0092 gt -0559Notice cosine sim is correlation when data is centered at 0

Cosine sim

RECSYS 30

Rating Predictions

Let rx be the vector of user xrsquos ratings

Let N be the set of k users most similar to xwho have rated item i

Prediction for item i of user x 119903119903119909119909119909119909 = 1

119896119896sum119910119910isin119873119873 119903119903119910119910119909119909

119903119903119909119909119909119909 =sum119910119910isin119873119873 119904119904119909119909119910119910sdot119903119903119910119910119910119910sum119910119910isin119873119873 119904119904119909119909119910119910

Other options

Many other tricks possiblehellip1162019

Shorthand119956119956119961119961119961119961 = 119956119956119946119946119956119956 119961119961119961119961

RECSYS 31

Another type of Memory-based Collaborative Filtering Item-Item based Collaborative Filtering

So far User-user collaborative filtering

Another view Item-item For item i find other similar items Estimate rating for item i based

on the target userrsquos ratings on items similar to item i Can use same similarity metrics and

prediction functions as in user-user model

1162019

sumsum

isin

isinsdot

=)(

)(

xiNj ij

xiNj xjijxi s

rsr

sijhellip similarity of items i and jrxjhelliprating of user x on item jN(ix)hellip set items rated by x similar to i

RECSYS 32

Item-Item CF (|N|=2)

121110987654321

455311

3124452

534321423

245424

5224345

423316

users

mov

ies

- unknown rating - rating between 1 to 5

RECSYS 33

Item-Item CF (|N|=2)

121110987654321

455 311

3124452

534321423

245424

5224345

423316

users

- estimate rating of movie 1 by user 5

1162019

mov

ies

RECSYS 34

Item-Item CF (|N|=2)

121110987654321

455 311

3124452

534321423

245424

5224345

423316

users

Neighbor selectionIdentify movies similar to movie 1 rated by user 5

1162019

mov

ies

100

-018

041

-010

-031

059

sim(1m)

Here we use Pearson correlation as similarity1) Subtract mean rating mi from each movie i

m1 = (1+3+5+5+4)5 = 36row 1 [-26 0 -06 0 0 14 0 0 14 0 04 0]

2) Compute cosine similarities between rows

RECSYS 35

Item-Item CF (|N|=2)

121110987654321

455 311

3124452

534321423

245424

5224345

423316

users

Compute similarity weightss13=041 s16=059

1162019

mov

ies

100

-018

041

-010

-031

059

sim(1m)

RECSYS 36

Item-Item CF (|N|=2)

121110987654321

45526311

3124452

534321423

245424

5224345

423316

users

Predict by taking weighted average

r15 = (0412 + 0593) (041+059) = 26

mov

ies

RECSYS 37

Common Practice for Item-Item Collaborative Filtering

Define similarity sij of items i and j

Select K nearest neighbors (KNN) N(i x) Set of Items most similar to item i that were rated by x

Estimate rating rxi as the weighted average

1162019

sumsum

isin

isinsdot

=)(

)(ˆxiNj ij

xiNj xjijxi s

rsr

N(ix) = set of items similar to item i that were rated by xsij = similarity of items i and jrxj= rating of user x on item j

RECSYS 38

Item-Item vs User-User

04180109

0305081

Avatar LOTR Matrix Pirates

Alice

Bob

Carol

David

1162019

In practice it has been observed that item-itemoften works better than user-user

Why Items are simpler users have multiple tastes

RECSYS 39

ProsCons of Collaborative Filtering

+ Works for any kind of item No feature selection needed

- Cold Start Need enough users in the system to find a match

- Sparsity The userratings matrix is sparse Hard to find users that have rated the same items

- First rater Cannot recommend an item that has not been

previously rated New items Esoteric items

- Popularity bias Cannot recommend items to someone with

unique taste Tends to recommend popular items

RECSYS 40

Hybrid Methods Implement two or more different recommenders and

combine predictions Perhaps using a linear model

Add content-based methods to collaborative filtering Item profiles for new item problem Demographics to deal with new user problem

1162019

RECSYS 41

- Evaluation

- Error metrics

- Complexity Speed

RECSYS 42

Evaluation

1 3 4

3 5 5

4 5 5

3

3

2 2 2

5

2 1 1

3 3

1

movies

users

1162019

RECSYS 43

Evaluation

1 3 4

3 5 5

4 5 5

3

3

2

2 1

3

1

Test Data Set

users

movies

1162019

RECSYS 44

Evaluating Predictions Compare predictions with known ratings

Root-mean-square error (RMSE)bull where is predicted is the true rating of x on i

Precision at top 10 bull of those in top 10

Rank Correlation bull Spearmanrsquos correlation between systemrsquos and userrsquos complete

rankings Another approach 01 model

Coveragebull Number of itemsusers for which system can make predictions

Precision = TP (TP + FP) Accuracy = (TP+TN) (TP + FP + TN + FN) Receiver Operating characteristic (ROC) Curve

Y-axis True Positive Rates (TPR) X-axis False Positive Rates (FPR)bull TPR (aka Recall) = TP P = TP (TP+FN) bull FPR = FP N = FP (FP + TN)bull See httpsenwikipediaorgwikiPrecision and recall

RECSYS 45

Problems with Error Measures Narrow focus on accuracy sometimes

misses the point Prediction Diversity Prediction Context Order of predictions

In practice we care only to predict high ratings RMSE might penalize a method that does well

for high ratings and badly for others

1162019

RECSYS 46

Collaborative Filtering Complexity

Expensive step is finding k most similar customers O(|X|) Recall that X = set of customers in the system

Too expensive to do at runtime Could pre-compute using clustering as approx

Naiumlve pre-computation takes time O(N middot|C|) |C| = of clusters = k in the k-means N = of data points

We already know how to do this Near-neighbor search in high dimensions (LSH) Clustering Dimensionality reduction

1162019

RECSYS 47

Tip Add Data Leverage all the data

Donrsquot try to reduce data size in an effort to make fancy algorithms work

Simple methods on large data do best

Add more data eg add IMDB data on genres

More data beats better algorithmshttpanandtypepadcomdatawocky200803more-data-usualhtml

RECSYS 48

Recommender SystemsLatent Factor Models

CS246 Mining Massive DatasetsJure Leskovec Stanford University

httpcs246stanfordedu

RECSYS 49

Geared towards females

Geared towards males

Serious

Funny

Collaborative Filtering viaLatent Factor Models (eg SVD)

The PrincessDiaries

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

RECSYS 50Koren Bell Volinksy IEEE Computer 2009

RECSYS 51

The Netflix Utility Matrix R

1 3 4

3 5 5

4 5 5

3

3

2 2 2

5

2 1 1

3 3

1

480000 users

17700 movies

1162019

Matrix R

RECSYS 52

Utility Matrix R Evaluation

1 3 4

3 5 5

4 5 5

3

3

2

2 1

3

1

Test Data Set

SSE = sum(119909119909119909119909)isin119877119877 119903119909119909119909119909 minus 119903119903119909119909119909119909 2

1162019

480000 users

17700 movies

Predicted rating

True rating of user x on item i

119955119955120785120785120788120788

Matrix R

Training Data Set

RECSYS 53

Latent Factor Models

ldquoSVDrdquo on Netflix data R asymp Q middot PT

For now letrsquos assume we can approximate the rating matrix R as a product of ldquothinrdquo Q middot PT

R has missing entries but letrsquos ignore that for nowbull Basically we will want the reconstruction error to be small on

known ratings and we donrsquot care about the values on the missing ones

45531

312445

53432142

24542

522434

42331

2-41

56-5

53-2

32111

-221-7

37-1

-924143-48-5-253-211

13-112-72914-131457-8

1-6784-3924176-421asymp

users

item

s

PT

Q

item

s

users

R

SVD A = U Σ VT

f factors

ffactors

RECSYS 54

Ratings as Products of Factors How to estimate the missing rating of

user x for item i

1162019

45531

312445

53432142

24542

522434

42331

item

s

2-41

56-5

53-2

32111

-221-7

37-1

-924143-48-5-253-211

13-112-72914-131457-8

1-6784-3924176-421

asymp

item

s

users

users

PT

119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931

= 119943119943

119954119954119946119946119943119943 sdot 119953119953119961119961119943119943

qi = row i of Qpx = column x of PT

ffac

tors

Qf factors

RECSYS 55

Ratings as Products of Factors How to estimate the missing rating of

user x for item i

1162019

45531

312445

53432142

24542

522434

42331

item

s

2-41

56-5

53-2

32111

-221-7

37-1

-924143-48-5-253-211

13-112-72914-131457-8

1-6784-3924176-421

asymp

item

s

users

users

PT

ffac

tors

Qf factors

119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931

= 119943119943

119954119954119946119946119943119943 sdot 119953119953119961119961119943119943

qi = row i of Qpx = column x of PT

RECSYS 56

Ratings as Products of Factors How to estimate the missing rating of

user x for item i

1162019

45531

312445

53432142

24542

522434

42331

item

s

2-41

56-5

53-2

32111

-221-7

37-1

-924143-48-5-253-211

13-112-72914-131457-8

1-6784-3924176-421

asymp

item

s

users

users

QPT

24

ffac

tors

f factors

119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931

= 119943119943

119954119954119946119946119943119943 sdot 119953119953119961119961119943119943

qi = row i of Qpx = column x of PT

RECSYS 57

Geared towards females

Geared towards males

Serious

Funny

Latent Factor Models

1162019

The PrincessDiaries

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

RECSYS 58

Geared towards females

Geared towards males

Serious

Funny

Latent Factor Models

1162019

The PrincessDiaries

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

RECSYS 59

Recap SVD

Remember SVD A Input data matrix U Left singular vecs V Right singular vecs Σ Singular values

SVD gives minimum reconstruction error (SSE) min119880119880VΣ

sum119909119909119894119894 119860119860119909119909119894119894 minus 119880119880Σ119881119881T 1199091199091198941198942

So in our case ldquoSVDrdquo on Netflix data R asymp Q middot PT

A = R Q = U PT = Σ VT

But we are not done yet R has missing entries1162019

U

The sum goes over all entriesBut our R has missing entries

119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931

Am

n

Σm

n

VT

asymp

U

RECSYS 60

Latent Factor Models

SVD isnrsquot defined when entries are missing

Use specialized methods to find P Q min

119875119875119876119876sum 119909119909119909119909 isinR 119903119903119909119909119909119909 minus 119902119902119909119909 sdot 119901119901119909119909119879119879 2

Notebull We donrsquot require cols of P Q to be orthogonalunit lengthbull P Q map usersmovies to a latent spacebull The most popular model among Netflix contestants

1162019

45531

312445

53432142

24542

522434

42331

2-41

56-5

53-2

32111

-221-7

37-1

-924143-48-5-253-211

13-112-72914-131457-8

1-6784-3924176-421asymp

PTQ

users

item

s

119903119909119909119909119909 = 119902119902119909119909 sdot 119901119901119909119909119879119879

f factors

ffactorsitem

s

users

RECSYS 61

Dealing with Missing Entries Want to minimize SSE for unseen test data

Idea Minimize SSE on training data Want large f ( of factors) to capture all the signals But SSE on test data begins to rise for f gt 2

Regularization is needed to avoid Overfitting Allow rich model where there are sufficient data Shrink aggressively where data are scarce

1162019

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

1 3 43 5 5

4 5 533

2

2 1 3

1

λhellip regularization parameterldquoerrorrdquo ldquolengthrdquo

RECSYS 62

Geared towards females

Geared towards males

serious

escapist

The PrincessDiaries

The Lion King

Braveheart

Independence Day

AmadeusThe Color Purple

Oceanrsquos 11

Sense and Sensibility

Gus

Dave

Recommendations via Latent Factor Models (eg SVD++ by the [Bellkor Team])

1162019

Lethal Weapon

Dumb and Dumber

RECSYS 63

Dealing with Missing Entries Want to minimize SSE for unseen test data

Idea Minimize SSE on training data Want large f ( of factors) to capture all the signals But SSE on test data begins to rise for f gt 2

Regularization is needed Allow rich model where there are sufficient data Shrink aggressively where data are scarce

1162019

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

1 3 43 5 5

4 5 533

2

2 1 3

1

λhellip regularization parameterldquoerrorrdquo ldquolengthrdquo

RECSYS 64

Geared towards females

Geared towards males

serious

funny

The Effect of Regularization

1162019

The PrincessDiaries

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

minfactors ldquoerrorrdquo + λ ldquolengthrdquo

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

RECSYS 65

Geared towards females

Geared towards males

serious

funny

The Effect of Regularization

1162019

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

The PrincessDiaries

minfactors ldquoerrorrdquo + λ ldquolengthrdquo

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

RECSYS 66

Geared towards females

Geared towards males

serious

funny

The Effect of Regularization

1162019

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

minfactors ldquoerrorrdquo + λ ldquolengthrdquo

The PrincessDiaries

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

RECSYS 67

Geared towards females

Geared towards males

serious

funny

The Effect of Regularization

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

The PrincessDiaries

minfactors ldquoerrorrdquo + λ ldquolengthrdquo

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

RECSYS 68

Use Gradient Descent to search forthe optimal settings

Want to find matrices P and Q

Gradient descent Initialize P and Q (using SVD pretend missing ratings are

0) Do gradient descent

bull P larr P - η middotnablaPbull Q larr Q - η middotnablaQbull Where nablaQ is gradientderivative of matrix Q120571120571119876119876 = [120571120571119902119902119909119909119894119894] and 120571120571119902119902119909119909119894119894 = sum119909119909119909119909 minus2 119903119903119909119909119909119909 minus 119902119902119909119909119901119901119909119909119879119879 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894

bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q Observation Computing gradients is slow

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

How to compute gradient of a matrixCompute gradient of every element independently

RECSYS 69

Degression to the lecture notes ofRegression and Gradient Descent

by Andrew Ngrsquos Machine Learning course from Coursera

RECSYS 70

(Batch) Gradient Descent

Want to find matrices P and Q

Gradient descent Initialize P and Q (using SVD pretend missing ratings are

0) Do gradient descent

bull P larr P - η middotnablaPbull Q larr Q - η middotnablaQbull Where nablaQ is gradientderivative of matrix Q120571120571119876119876 = [120571120571119902119902119909119909119894119894] and 120571120571119902119902119909119909119894119894 = sum119909119909119909119909 minus2 119903119903119909119909119909119909 minus 119902119902119909119909119901119901119909119909119879119879 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894

bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q Observation Computing gradients is slow

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

How to compute gradient of a matrixCompute gradient of every element independently

RECSYS 72

Stochastic Gradient Descent

Gradient Descent (GD) vs Stochastic GD Observation 120571120571119876119876 = [120571120571119902119902119909119909119894119894] where

120571120571119902119902119909119909119894119894 = 119909119909119909119909

minus2 119903119903119909119909119909119909 minus 119902119902119909119909119894119894119901119901119909119909119894119894 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894 = 119961119961119946119946

nabla119928119928 119955119955119961119961119946119946

bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q 119928119928 = 119928119928 minus 119928119928 = 119928119928 minus η sum119961119961119946119946nabla119928119928 (119955119955119961119961119946119946) Idea Instead of evaluating gradient over all ratings evaluate it for

each individual rating and make a step

GD 119928119928larr119928119928 minus η sum119955119955119961119961119946119946 nabla119928119928(119955119955119961119961119946119946)

SGD 119928119928larr119928119928 minus 119928119928(119955119955119961119961119946119946) Faster convergence

bull Need more steps but each step is computed much faster

1162019

RECSYS 73

SGD vs GD Convergence of GD vs SGD

1162019

Iterationstep

Valu

e of

the

obje

ctiv

e fu

nctio

n

GD improves the value of the objective function at every step SGD improves the value but in a ldquonoisyrdquo wayGD takes fewer steps to converge but each steptakes much longer to compute In practice SGD is much faster

RECSYS 74

Stochastic Gradient Descent

Stochastic gradient decent Initialize P and Q (using SVD pretend missing ratings are

0) Then iterate over the ratings (multiple times if necessary) and update

factorsFor each rxi

bull 120576120576119909119909119909119909 = 119903119903119909119909119909119909 minus 119902119902119909119909 sdot 119901119901119909119909119879119879 (derivative of the ldquoerrorrdquo)bull 119902119902119909119909 larr 119902119902119909119909 + 120578120578 120576120576119909119909119909119909 119901119901119909119909 minus 120582120582 119902119902119909119909 (update equation)bull 119901119901119909119909 larr 119901119901119909119909 + 120578120578 120576120576119909119909119909119909 119902119902119909119909 minus 120582120582 119901119901119909119909 (update equation)

2 for loops For until convergence

bull For each rxibull Compute gradient do a ldquosteprdquo

1162019

η hellip learning rate

RECSYS 75Koren Bell Volinksy IEEE Computer 2009

1162019

RECSYS 76

Summary Recommendations via Optimization

Goal Make good recommendations Quantify goodness using SSE

So Lower SSE means better recommendations We want to make good recommendations on items

that some user has not yet seen Letrsquos set values for P and Q such that they work

well on known (user item) ratings And hope these values for P and Q will predict well

the unknown ratings

This is the a case where we apply Optimization methods

1162019

1 3 43 5 5

4 5 533

2

2 1 3

1

RECSYS 77

Backup Slides

RECSYS 78

The Netflix Challenge 2006-09

We Know What You OughtTo Be Watching This

Summer

Netflix Prizebull Training data

ndash 100 million ratingsndash 480000 usersndash 17770 moviesndash 6 years of data 2000-2005

bull Test datandash Last few ratings of each user (28 million)ndash Evaluation criterion root mean squared error (RMSE) ndash Netflix Cinematch RMSE 09514

bull Competitionndash 2700+ teamsndash $1 million grand prize for 10 improvement on Cinematch resultndash $50000 2007 progress prize for 843 improvement

scoremovieuser121152131434524123237682576344541568523425

22345

57664566

scoremovieuser6219617232473153414284935

745

696836

Training data Test data

Movie rating data

Overall rating distribution

bull Third of ratings are 4sbull Average rating is 368

From TimelyDevelopmentcom

ratings per movie

bull Avg ratingsmovie 5627

ratings per user

bull Avg ratingsuser 208

Average movie rating by movie count

bull More ratings to better movies

From TimelyDevelopmentcom

Most loved movies

CountAvg ratingTitle137812 4593 The Shawshank Redemption 133597 4545 Lord of the Rings The Return of the King1808834306 The Green Mile 150676 4460Lord of the Rings The Two Towers 139050 4415 Finding Nemo 117456 4504 Raiders of the Lost Ark 180736 4299 Forrest Gump 147932 4433 Lord of the Rings The Fellowship of the ring149199 4325 The Sixth Sense 144027 4333 Indiana Jones and the Last Crusade

Challenges

bull Size of datandash Scalabilityndash Keeping data in memory

bull Missing datandash 99 percent missingndash Very imbalanced

bull Avoiding overfittingbull Test and training data differ significantly movie 16322

bull Use an ensemble of complementing predictorsbull Two half tuned models worth more than a single fully

tuned model

The BellKor recommender system

RECSYS 101

Extending Latent Factor Model to Include Biases

1162019

RECSYS 102

Modeling Biases and Interactions

μ = overall mean rating bx = bias of user x bi = bias of movie i

user-movie interactionmovie biasuser bias

User-Movie interaction Characterizes the matching between

users and movies Attracts most research in the field Benefits from algorithmic and

mathematical innovations

Baseline predictor Separates users and movies Benefits from insights into userrsquos

behavior Among the main practical

contributions of the competition

RECSYS 103

Baseline Predictor We have expectations on the rating by

user x of movie i even without estimating xrsquos attitude towards movies like i

ndash Rating scale of user xndash Values of other ratings user

gave recently (day-specific mood anchoring multi-user accounts)

ndash (Recent) popularity of movie indash Selection bias related to

number of ratings user gave on the same day (ldquofrequencyrdquo)

1162019

RECSYS 104

Putting It All Together

Example Mean rating micro = 37 You are a critical reviewer your ratings are 1 star

lower than the mean bx = -1 Star Wars gets a mean rating of 05 higher than

average movie bi = + 05 Predicted rating for you on Star Wars

= 37 - 1 + 05 = 32

1162019

RECSYS 105

Fitting the New Model

Solve

Stochastic gradient decent to find parameters Note Both biases bu bi as well as interactions qi pu are treated

as parameters (we estimate them)

1162019

regularization

goodness of fit

λ is selected via grid-search on a validation set

( )

++++

+++minus

sumsumsumsum

sumisin

ii

xx

xx

ii

Rix

Txiixxi

PQ

bbpq

pqbbr

2222

2

)(

)(min

λ

micro

RECSYS 106

BellKor Recommender System The winner of the Netflix Challenge

Multi-scale modeling of the dataCombine top level ldquoregionalrdquomodeling of the data with a refined local view Global

bull Overall deviations of usersmovies Factorization

bull Addressing ldquoregionalrdquo effects Collaborative filtering

bull Extract local patterns

Global effects

Factorization

Collaborative filtering

RECSYS 107

Performance of Various Methods

0885

089

0895

09

0905

091

0915

092

1 10 100 1000

RMSE

Millions of parameters

CF (no time bias)

Basic Latent Factors

Latent Factors w Biases

1162019

RECSYS 108

Grand Prize 08563

Netflix 09514

Movie average 10533User average 10651

Global average 11296

Performance of Various Methods

Basic Collaborative filtering 094

Latent factors 090

Latent factors+Biases 089

Collaborative filtering++ 091

1162019

RECSYS 109

Temporal Biases Of Users

Sudden rise in the average movie rating(early 2004) Improvements in Netflix GUI improvements Meaning of rating changed

Movie age Users prefer new movies

without any reasons Older movies are just

inherently better than newer ones

1162019

Y Koren Collaborative filtering with temporal dynamics KDD rsquo09

RECSYS 110

Temporal Biases amp Factors

Original modelrxi = micro +bx + bi + qi middotpx

T

Add time dependence to biasesrxi = micro +bx(t)+ bi(t) +qi middot px

T

Make parameters bu and bi to depend on time (1) Parameterize time-dependence by linear trends

(2) Each bin corresponds to 10 consecutive weeks

Add temporal dependence to factors px(t)hellip user preference vector on day t

Y Koren Collaborative filtering with temporal dynamics KDD rsquo09

RECSYS 111

Adding Temporal Effects

1162019

0875

088

0885

089

0895

09

0905

091

0915

092

1 10 100 1000 10000

RMSE

Millions of parameters

CF (no time bias)Basic Latent FactorsCF (time bias)Latent Factors w Biases+ Linear time factors+ Per-day user biases+ CF

RECSYS 112

Grand Prize 08563

Netflix 09514

Movie average 10533User average 10651

Global average 11296

Performance of Various Methods

Basic Collaborative filtering 094

Latent factors 090

Latent factors+Biases 089

Collaborative filtering++ 091

1162019 112

Latent factors+Biases+Time 0876

Still no prize Getting desperateTry a ldquokitchen sinkrdquo approach

RECSYS 113

Effect of ensemble size

087

08725

0875

08775

088

08825

0885

08875

089

08925

1 8 15 22 29 36 43 50

Predictors

Erro

r - R

MSE

Chart1

RMSE
Predictors
Error - RMSE
Effect of ensemble size

Sheet1

Sheet1

RMSE
Predictors
Error - RMSE
Effect of ensemble size

Sheet2

Sheet3

RECSYS 115

Standing on June 26th 2009

1162019June 26th submission triggers 30-day ldquolast callrdquo

RECSYS 116

The Last 30 Days

Ensemble team formed Group of other teams on leaderboard forms a new team Relies on combining their models Quickly also get a qualifying score over 10

BellKor Continue to get small improvements in their scores Realize that they are in direct competition with Ensemble

Strategy Both teams carefully monitoring the leaderboard Only sure way to check for improvement is to submit a set of

predictionsbull This alerts the other team of your latest score

1162019

RECSYS 117

24 Hours from the Deadline

Submissions limited to 1 a day Only 1 final submission could be made in the last 24h

24 hours before deadlinehellip BellKor team member in Austria notices (by chance) that

Ensemble posts a score that is slightly better than BellKorrsquos

Frantic last 24 hours for both teams Much computer time on final optimization Carefully calibrated to end about an hour before deadline

Final submissions BellKor submits a little early (on purpose) 40 mins before deadline Ensemble submits their final entry 20 mins later hellipand everyone waitshellip

1162019

RECSYS 118118

RECSYS 119

Million $ Awarded Sept 21st 2009

RECSYS 120

Further reading Y Koren Collaborative filtering with temporal

dynamics KDD rsquo09 httpwww2researchattcom~volinskynetflixbpchtml httpwwwthe-ensemblecom

1162019

  • Slide Number 1
  • Acknowledgements
  • Content-based Systems amp Collaborative Filtering
  • High Dimensional Data
  • Example Recommender Systems
  • Recommendations
  • From Scarcity to Abundance
  • The Long Tail
  • Physical vs Online
  • Types of Recommendations
  • Formal Model
  • Utility Matrix
  • Key Problems
  • (1) Gathering Ratings
  • (2) Extrapolating Utilities
  • Content-based Recommender Systems
  • Content-based Recommendations
  • Plan of Action
  • Item Profiles
  • Sidenote TF-IDF
  • User Profiles and Prediction
  • Pros Content-based Approach
  • Cons Content-based Approach
  • Collaborative Filtering
  • Collaborative filtering
  • Collaborative Filtering (CF)
  • Example of Memory-based Collaborative Filtering User-User Collaborative Filtering
  • Similar Users
  • Similarity Metric
  • Rating Predictions
  • Another type of Memory-based Collaborative Filtering Item-Item based Collaborative Filtering
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Common Practice for Item-Item Collaborative Filtering
  • Item-Item vs User-User
  • ProsCons of Collaborative Filtering
  • Hybrid Methods
  • Remarks amp Practical Tips
  • Evaluation
  • Evaluation
  • Evaluating Predictions
  • Problems with Error Measures
  • Collaborative Filtering Complexity
  • Tip Add Data
  • Recommender SystemsLatent Factor Models
  • Collaborative Filtering viaLatent Factor Models (eg SVD)
  • Slide Number 50
  • The Netflix Utility Matrix R
  • Utility Matrix R Evaluation
  • Latent Factor Models
  • Ratings as Products of Factors
  • Ratings as Products of Factors
  • Ratings as Products of Factors
  • Latent Factor Models
  • Latent Factor Models
  • Recap SVD
  • Latent Factor Models
  • Dealing with Missing Entries
  • Recommendations via Latent Factor Models (eg SVD++ by the [Bellkor Team])
  • Dealing with Missing Entries
  • The Effect of Regularization
  • The Effect of Regularization
  • The Effect of Regularization
  • The Effect of Regularization
  • Use Gradient Descent to search forthe optimal settings
  • Degression to the lecture notes ofRegression and Gradient Descentby Andrew Ngrsquos Machine Learning course from Coursera
  • (Batch) Gradient Descent
  • Stochastic Gradient Descent
  • SGD vs GD
  • Stochastic Gradient Descent
  • Slide Number 75
  • Summary Recommendations via Optimization
  • Backup Slides
  • The Netflix Challenge 2006-09
  • Slide Number 79
  • Slide Number 80
  • Netflix Prize
  • Movie rating data
  • Overall rating distribution
  • ratings per movie
  • ratings per user
  • Average movie rating by movie count
  • Most loved movies
  • Challenges
  • The BellKor recommender system
  • Extending Latent Factor Model to Include Biases
  • Modeling Biases and Interactions
  • Baseline Predictor
  • Putting It All Together
  • Fitting the New Model
  • BellKor Recommender System
  • Performance of Various Methods
  • Performance of Various Methods
  • Temporal Biases Of Users
  • Temporal Biases amp Factors
  • Adding Temporal Effects
  • Performance of Various Methods
  • Slide Number 113
  • Slide Number 114
  • Standing on June 26th 2009
  • The Last 30 Days
  • 24 Hours from the Deadline
  • Slide Number 118
  • Million $ Awarded Sept 21st 2009
  • Further reading
predictors RMSE
1 0889067
2 0880586
3 0878345
4 0877715
5 0876668
6 0876179
7 0875557
8 0875013
9 0874477
10 0874209
11 0873997
12 0873691
13 0873569
14 0873352
15 0873212
16 0873035
17 0872942
18 087283
19 0872748
20 0872654
21 0872563
22 0872468
23 0872382
24 0872275
25 087222
26 0872201
27 0872135
28 0872086
29 0872021
30 0871945
31 0871898
32 087183
33 0871756
34 0871722
35 0871665
36 0871632
37 0871585
38 0871558
39 0871524
40 0871482
41 0871452
42 0871428
43 0871398
44 0871346
45 0871312
46 0871287
47 0871262
48 0871242
49 0871236
50 0871197
0889067
0880586
0878345
0877715
0876668
0876179
0875557
0875013
0874477
0874209
0873997
0873691
0873569
0873352
0873212
0873035
0872942
087283
0872748
0872654
0872563
0872468
0872382
0872275
087222
0872201
0872135
0872086
0872021
0871945
0871898
087183
0871756
0871722
0871665
0871632
0871585
0871558
0871524
0871482
0871452
0871428
0871398
0871346
0871312
0871287
0871262
0871242
0871236
0871197
Page 14: xuhappy.github.io · 2 days ago · RECSYS 2 Acknowledgements The slides used in this chapter are adapted from: CS246 Mining Massive Datasets, - by Jure Leskovec, Stanford University

RECSYS 16

Content-based Recommender Systems

1162019

RECSYS 17

Content-based Recommendations Main idea Recommend items to customer x similar to

previous items rated highly by x

Example

Movie recommendations Recommend movies with same actor(s)

director genre hellip

Websites blogs news Recommend other sites with ldquosimilarrdquo content

RECSYS 18

Plan of Action

likes

Item profiles

RedCircles

Triangles

User profile

match

recommendbuild

1162019

RECSYS 19

Item Profiles For each item create an item profile

Profile is a set (vector) of features Movies author title actor directorhellip Text Set of ldquoimportantrdquo words in document

How to pick important features Usual heuristic from text mining is TF-IDF

(Term frequency Inverse Doc Frequency)bull Term hellip Featurebull Document hellip Item

1162019

RECSYS 20

Sidenote TF-IDFfij = frequency of term (feature) i in doc (item) j

ni = number of docs that mention term iN = total number of docs

TF-IDF score wij = TFij times IDFi

Doc profile = set of words with highest TF-IDF scores together with their scores

1162019

Note we normalize TFto discount for ldquolongerrdquo documents

RECSYS 21

User Profiles and Prediction

User profile possibilities Weighted average of rated item profiles Variation weight by difference from average

rating for item hellip

Prediction heuristic Given user profile x and item profile i estimate 119906119906(119961119961 119946119946) = cos(119961119961 119946119946) = 119961119961middot119946119946

| 119961119961 |sdot| 119946119946 |

1162019

RECSYS 22

Pros Content-based Approach

+ No need for data on other users No cold-start or sparsity problems

+ Able to recommend to users with unique tastes

+ Able to recommend new amp unpopular items No first-rater problem

+ Able to provide explanations Can provide explanations of recommended items by listing content-

features that caused an item to be recommended

1162019

RECSYS 23

Cons Content-based Approach ndash Finding the appropriate features is hard

Eg images movies music

ndash Overspecialization Never recommends items outside userrsquos

content profile People might have multiple interests Unable to exploit quality judgments of other users

ndash Recommendations for new users How to build a user profile

1162019

RECSYS 24

Collaborative Filtering

RECSYS 25

Collaborative filtering Recommend items based on past transactions of users

Analyze relations between users andor items

Specific data characteristics are irrelevant Domain-free useritem attributes are not necessary Can identify elusive aspects

RECSYS 26

Collaborative Filtering (CF)

RECSYS 27

Example of Memory-based Collaborative FilteringUser-User Collaborative Filtering

1 Consider user x

2 Find set N of other userswhose ratings are ldquosimilarrdquo to xrsquos ratings eg using K-nearest neighbors (KNN)

3 Recommend items to xbased on the weighted ratings of items by users in N

RECSYS 28

Similar Users Let rx be the vector of user xrsquos ratings

Jaccard similarity measure Problem Ignores the value of the rating

Cosine similarity measure

sim(x y) = cos(rx ry) = 119903119903119909119909sdot119903119903119910119910||119903119903119909119909||sdot||119903119903119910119910||

Problem Treats missing ratings as ldquonegativerdquo Pearson correlation coefficient

Sxy = items rated by both users x and y

1162019

rx = [ _ _ ]ry = [ _ _]

rx ry as setsrx = 1 4 5ry = 1 3 4

rx ry as pointsrx = 1 0 0 1 3ry = 1 0 2 2 0

rx ry hellip avgrating of x y

RECSYS 29

Similarity Metric

Intuitively we want sim(A B) gt sim(A C)

Jaccard similarity 15 lt 24

Cosine similarity 0386 gt 0322 Considers missing ratings as ldquonegativerdquo Solution subtract the (row) mean

1162019

sim AB vs AC0092 gt -0559Notice cosine sim is correlation when data is centered at 0

Cosine sim

RECSYS 30

Rating Predictions

Let rx be the vector of user xrsquos ratings

Let N be the set of k users most similar to xwho have rated item i

Prediction for item i of user x 119903119903119909119909119909119909 = 1

119896119896sum119910119910isin119873119873 119903119903119910119910119909119909

119903119903119909119909119909119909 =sum119910119910isin119873119873 119904119904119909119909119910119910sdot119903119903119910119910119910119910sum119910119910isin119873119873 119904119904119909119909119910119910

Other options

Many other tricks possiblehellip1162019

Shorthand119956119956119961119961119961119961 = 119956119956119946119946119956119956 119961119961119961119961

RECSYS 31

Another type of Memory-based Collaborative Filtering Item-Item based Collaborative Filtering

So far User-user collaborative filtering

Another view Item-item For item i find other similar items Estimate rating for item i based

on the target userrsquos ratings on items similar to item i Can use same similarity metrics and

prediction functions as in user-user model

1162019

sumsum

isin

isinsdot

=)(

)(

xiNj ij

xiNj xjijxi s

rsr

sijhellip similarity of items i and jrxjhelliprating of user x on item jN(ix)hellip set items rated by x similar to i

RECSYS 32

Item-Item CF (|N|=2)

121110987654321

455311

3124452

534321423

245424

5224345

423316

users

mov

ies

- unknown rating - rating between 1 to 5

RECSYS 33

Item-Item CF (|N|=2)

121110987654321

455 311

3124452

534321423

245424

5224345

423316

users

- estimate rating of movie 1 by user 5

1162019

mov

ies

RECSYS 34

Item-Item CF (|N|=2)

121110987654321

455 311

3124452

534321423

245424

5224345

423316

users

Neighbor selectionIdentify movies similar to movie 1 rated by user 5

1162019

mov

ies

100

-018

041

-010

-031

059

sim(1m)

Here we use Pearson correlation as similarity1) Subtract mean rating mi from each movie i

m1 = (1+3+5+5+4)5 = 36row 1 [-26 0 -06 0 0 14 0 0 14 0 04 0]

2) Compute cosine similarities between rows

RECSYS 35

Item-Item CF (|N|=2)

121110987654321

455 311

3124452

534321423

245424

5224345

423316

users

Compute similarity weightss13=041 s16=059

1162019

mov

ies

100

-018

041

-010

-031

059

sim(1m)

RECSYS 36

Item-Item CF (|N|=2)

121110987654321

45526311

3124452

534321423

245424

5224345

423316

users

Predict by taking weighted average

r15 = (0412 + 0593) (041+059) = 26

mov

ies

RECSYS 37

Common Practice for Item-Item Collaborative Filtering

Define similarity sij of items i and j

Select K nearest neighbors (KNN) N(i x) Set of Items most similar to item i that were rated by x

Estimate rating rxi as the weighted average

1162019

sumsum

isin

isinsdot

=)(

)(ˆxiNj ij

xiNj xjijxi s

rsr

N(ix) = set of items similar to item i that were rated by xsij = similarity of items i and jrxj= rating of user x on item j

RECSYS 38

Item-Item vs User-User

04180109

0305081

Avatar LOTR Matrix Pirates

Alice

Bob

Carol

David

1162019

In practice it has been observed that item-itemoften works better than user-user

Why Items are simpler users have multiple tastes

RECSYS 39

ProsCons of Collaborative Filtering

+ Works for any kind of item No feature selection needed

- Cold Start Need enough users in the system to find a match

- Sparsity The userratings matrix is sparse Hard to find users that have rated the same items

- First rater Cannot recommend an item that has not been

previously rated New items Esoteric items

- Popularity bias Cannot recommend items to someone with

unique taste Tends to recommend popular items

RECSYS 40

Hybrid Methods Implement two or more different recommenders and

combine predictions Perhaps using a linear model

Add content-based methods to collaborative filtering Item profiles for new item problem Demographics to deal with new user problem

1162019

RECSYS 41

- Evaluation

- Error metrics

- Complexity Speed

RECSYS 42

Evaluation

1 3 4

3 5 5

4 5 5

3

3

2 2 2

5

2 1 1

3 3

1

movies

users

1162019

RECSYS 43

Evaluation

1 3 4

3 5 5

4 5 5

3

3

2

2 1

3

1

Test Data Set

users

movies

1162019

RECSYS 44

Evaluating Predictions Compare predictions with known ratings

Root-mean-square error (RMSE)bull where is predicted is the true rating of x on i

Precision at top 10 bull of those in top 10

Rank Correlation bull Spearmanrsquos correlation between systemrsquos and userrsquos complete

rankings Another approach 01 model

Coveragebull Number of itemsusers for which system can make predictions

Precision = TP (TP + FP) Accuracy = (TP+TN) (TP + FP + TN + FN) Receiver Operating characteristic (ROC) Curve

Y-axis True Positive Rates (TPR) X-axis False Positive Rates (FPR)bull TPR (aka Recall) = TP P = TP (TP+FN) bull FPR = FP N = FP (FP + TN)bull See httpsenwikipediaorgwikiPrecision and recall

RECSYS 45

Problems with Error Measures Narrow focus on accuracy sometimes

misses the point Prediction Diversity Prediction Context Order of predictions

In practice we care only to predict high ratings RMSE might penalize a method that does well

for high ratings and badly for others

1162019

RECSYS 46

Collaborative Filtering Complexity

Expensive step is finding k most similar customers O(|X|) Recall that X = set of customers in the system

Too expensive to do at runtime Could pre-compute using clustering as approx

Naiumlve pre-computation takes time O(N middot|C|) |C| = of clusters = k in the k-means N = of data points

We already know how to do this Near-neighbor search in high dimensions (LSH) Clustering Dimensionality reduction

1162019

RECSYS 47

Tip Add Data Leverage all the data

Donrsquot try to reduce data size in an effort to make fancy algorithms work

Simple methods on large data do best

Add more data eg add IMDB data on genres

More data beats better algorithmshttpanandtypepadcomdatawocky200803more-data-usualhtml

RECSYS 48

Recommender SystemsLatent Factor Models

CS246 Mining Massive DatasetsJure Leskovec Stanford University

httpcs246stanfordedu

RECSYS 49

Geared towards females

Geared towards males

Serious

Funny

Collaborative Filtering viaLatent Factor Models (eg SVD)

The PrincessDiaries

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

RECSYS 50Koren Bell Volinksy IEEE Computer 2009

RECSYS 51

The Netflix Utility Matrix R

1 3 4

3 5 5

4 5 5

3

3

2 2 2

5

2 1 1

3 3

1

480000 users

17700 movies

1162019

Matrix R

RECSYS 52

Utility Matrix R Evaluation

1 3 4

3 5 5

4 5 5

3

3

2

2 1

3

1

Test Data Set

SSE = sum(119909119909119909119909)isin119877119877 119903119909119909119909119909 minus 119903119903119909119909119909119909 2

1162019

480000 users

17700 movies

Predicted rating

True rating of user x on item i

119955119955120785120785120788120788

Matrix R

Training Data Set

RECSYS 53

Latent Factor Models

ldquoSVDrdquo on Netflix data R asymp Q middot PT

For now letrsquos assume we can approximate the rating matrix R as a product of ldquothinrdquo Q middot PT

R has missing entries but letrsquos ignore that for nowbull Basically we will want the reconstruction error to be small on

known ratings and we donrsquot care about the values on the missing ones

45531

312445

53432142

24542

522434

42331

2-41

56-5

53-2

32111

-221-7

37-1

-924143-48-5-253-211

13-112-72914-131457-8

1-6784-3924176-421asymp

users

item

s

PT

Q

item

s

users

R

SVD A = U Σ VT

f factors

ffactors

RECSYS 54

Ratings as Products of Factors How to estimate the missing rating of

user x for item i

1162019

45531

312445

53432142

24542

522434

42331

item

s

2-41

56-5

53-2

32111

-221-7

37-1

-924143-48-5-253-211

13-112-72914-131457-8

1-6784-3924176-421

asymp

item

s

users

users

PT

119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931

= 119943119943

119954119954119946119946119943119943 sdot 119953119953119961119961119943119943

qi = row i of Qpx = column x of PT

ffac

tors

Qf factors

RECSYS 55

Ratings as Products of Factors How to estimate the missing rating of

user x for item i

1162019

45531

312445

53432142

24542

522434

42331

item

s

2-41

56-5

53-2

32111

-221-7

37-1

-924143-48-5-253-211

13-112-72914-131457-8

1-6784-3924176-421

asymp

item

s

users

users

PT

ffac

tors

Qf factors

119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931

= 119943119943

119954119954119946119946119943119943 sdot 119953119953119961119961119943119943

qi = row i of Qpx = column x of PT

RECSYS 56

Ratings as Products of Factors How to estimate the missing rating of

user x for item i

1162019

45531

312445

53432142

24542

522434

42331

item

s

2-41

56-5

53-2

32111

-221-7

37-1

-924143-48-5-253-211

13-112-72914-131457-8

1-6784-3924176-421

asymp

item

s

users

users

QPT

24

ffac

tors

f factors

119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931

= 119943119943

119954119954119946119946119943119943 sdot 119953119953119961119961119943119943

qi = row i of Qpx = column x of PT

RECSYS 57

Geared towards females

Geared towards males

Serious

Funny

Latent Factor Models

1162019

The PrincessDiaries

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

RECSYS 58

Geared towards females

Geared towards males

Serious

Funny

Latent Factor Models

1162019

The PrincessDiaries

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

RECSYS 59

Recap SVD

Remember SVD A Input data matrix U Left singular vecs V Right singular vecs Σ Singular values

SVD gives minimum reconstruction error (SSE) min119880119880VΣ

sum119909119909119894119894 119860119860119909119909119894119894 minus 119880119880Σ119881119881T 1199091199091198941198942

So in our case ldquoSVDrdquo on Netflix data R asymp Q middot PT

A = R Q = U PT = Σ VT

But we are not done yet R has missing entries1162019

U

The sum goes over all entriesBut our R has missing entries

119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931

Am

n

Σm

n

VT

asymp

U

RECSYS 60

Latent Factor Models

SVD isnrsquot defined when entries are missing

Use specialized methods to find P Q min

119875119875119876119876sum 119909119909119909119909 isinR 119903119903119909119909119909119909 minus 119902119902119909119909 sdot 119901119901119909119909119879119879 2

Notebull We donrsquot require cols of P Q to be orthogonalunit lengthbull P Q map usersmovies to a latent spacebull The most popular model among Netflix contestants

1162019

45531

312445

53432142

24542

522434

42331

2-41

56-5

53-2

32111

-221-7

37-1

-924143-48-5-253-211

13-112-72914-131457-8

1-6784-3924176-421asymp

PTQ

users

item

s

119903119909119909119909119909 = 119902119902119909119909 sdot 119901119901119909119909119879119879

f factors

ffactorsitem

s

users

RECSYS 61

Dealing with Missing Entries Want to minimize SSE for unseen test data

Idea Minimize SSE on training data Want large f ( of factors) to capture all the signals But SSE on test data begins to rise for f gt 2

Regularization is needed to avoid Overfitting Allow rich model where there are sufficient data Shrink aggressively where data are scarce

1162019

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

1 3 43 5 5

4 5 533

2

2 1 3

1

λhellip regularization parameterldquoerrorrdquo ldquolengthrdquo

RECSYS 62

Geared towards females

Geared towards males

serious

escapist

The PrincessDiaries

The Lion King

Braveheart

Independence Day

AmadeusThe Color Purple

Oceanrsquos 11

Sense and Sensibility

Gus

Dave

Recommendations via Latent Factor Models (eg SVD++ by the [Bellkor Team])

1162019

Lethal Weapon

Dumb and Dumber

RECSYS 63

Dealing with Missing Entries Want to minimize SSE for unseen test data

Idea Minimize SSE on training data Want large f ( of factors) to capture all the signals But SSE on test data begins to rise for f gt 2

Regularization is needed Allow rich model where there are sufficient data Shrink aggressively where data are scarce

1162019

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

1 3 43 5 5

4 5 533

2

2 1 3

1

λhellip regularization parameterldquoerrorrdquo ldquolengthrdquo

RECSYS 64

Geared towards females

Geared towards males

serious

funny

The Effect of Regularization

1162019

The PrincessDiaries

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

minfactors ldquoerrorrdquo + λ ldquolengthrdquo

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

RECSYS 65

Geared towards females

Geared towards males

serious

funny

The Effect of Regularization

1162019

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

The PrincessDiaries

minfactors ldquoerrorrdquo + λ ldquolengthrdquo

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

RECSYS 66

Geared towards females

Geared towards males

serious

funny

The Effect of Regularization

1162019

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

minfactors ldquoerrorrdquo + λ ldquolengthrdquo

The PrincessDiaries

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

RECSYS 67

Geared towards females

Geared towards males

serious

funny

The Effect of Regularization

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

The PrincessDiaries

minfactors ldquoerrorrdquo + λ ldquolengthrdquo

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

RECSYS 68

Use Gradient Descent to search forthe optimal settings

Want to find matrices P and Q

Gradient descent Initialize P and Q (using SVD pretend missing ratings are

0) Do gradient descent

bull P larr P - η middotnablaPbull Q larr Q - η middotnablaQbull Where nablaQ is gradientderivative of matrix Q120571120571119876119876 = [120571120571119902119902119909119909119894119894] and 120571120571119902119902119909119909119894119894 = sum119909119909119909119909 minus2 119903119903119909119909119909119909 minus 119902119902119909119909119901119901119909119909119879119879 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894

bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q Observation Computing gradients is slow

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

How to compute gradient of a matrixCompute gradient of every element independently

RECSYS 69

Degression to the lecture notes ofRegression and Gradient Descent

by Andrew Ngrsquos Machine Learning course from Coursera

RECSYS 70

(Batch) Gradient Descent

Want to find matrices P and Q

Gradient descent Initialize P and Q (using SVD pretend missing ratings are

0) Do gradient descent

bull P larr P - η middotnablaPbull Q larr Q - η middotnablaQbull Where nablaQ is gradientderivative of matrix Q120571120571119876119876 = [120571120571119902119902119909119909119894119894] and 120571120571119902119902119909119909119894119894 = sum119909119909119909119909 minus2 119903119903119909119909119909119909 minus 119902119902119909119909119901119901119909119909119879119879 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894

bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q Observation Computing gradients is slow

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

How to compute gradient of a matrixCompute gradient of every element independently

RECSYS 72

Stochastic Gradient Descent

Gradient Descent (GD) vs Stochastic GD Observation 120571120571119876119876 = [120571120571119902119902119909119909119894119894] where

120571120571119902119902119909119909119894119894 = 119909119909119909119909

minus2 119903119903119909119909119909119909 minus 119902119902119909119909119894119894119901119901119909119909119894119894 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894 = 119961119961119946119946

nabla119928119928 119955119955119961119961119946119946

bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q 119928119928 = 119928119928 minus 119928119928 = 119928119928 minus η sum119961119961119946119946nabla119928119928 (119955119955119961119961119946119946) Idea Instead of evaluating gradient over all ratings evaluate it for

each individual rating and make a step

GD 119928119928larr119928119928 minus η sum119955119955119961119961119946119946 nabla119928119928(119955119955119961119961119946119946)

SGD 119928119928larr119928119928 minus 119928119928(119955119955119961119961119946119946) Faster convergence

bull Need more steps but each step is computed much faster

1162019

RECSYS 73

SGD vs GD Convergence of GD vs SGD

1162019

Iterationstep

Valu

e of

the

obje

ctiv

e fu

nctio

n

GD improves the value of the objective function at every step SGD improves the value but in a ldquonoisyrdquo wayGD takes fewer steps to converge but each steptakes much longer to compute In practice SGD is much faster

RECSYS 74

Stochastic Gradient Descent

Stochastic gradient decent Initialize P and Q (using SVD pretend missing ratings are

0) Then iterate over the ratings (multiple times if necessary) and update

factorsFor each rxi

bull 120576120576119909119909119909119909 = 119903119903119909119909119909119909 minus 119902119902119909119909 sdot 119901119901119909119909119879119879 (derivative of the ldquoerrorrdquo)bull 119902119902119909119909 larr 119902119902119909119909 + 120578120578 120576120576119909119909119909119909 119901119901119909119909 minus 120582120582 119902119902119909119909 (update equation)bull 119901119901119909119909 larr 119901119901119909119909 + 120578120578 120576120576119909119909119909119909 119902119902119909119909 minus 120582120582 119901119901119909119909 (update equation)

2 for loops For until convergence

bull For each rxibull Compute gradient do a ldquosteprdquo

1162019

η hellip learning rate

RECSYS 75Koren Bell Volinksy IEEE Computer 2009

1162019

RECSYS 76

Summary Recommendations via Optimization

Goal Make good recommendations Quantify goodness using SSE

So Lower SSE means better recommendations We want to make good recommendations on items

that some user has not yet seen Letrsquos set values for P and Q such that they work

well on known (user item) ratings And hope these values for P and Q will predict well

the unknown ratings

This is the a case where we apply Optimization methods

1162019

1 3 43 5 5

4 5 533

2

2 1 3

1

RECSYS 77

Backup Slides

RECSYS 78

The Netflix Challenge 2006-09

We Know What You OughtTo Be Watching This

Summer

Netflix Prizebull Training data

ndash 100 million ratingsndash 480000 usersndash 17770 moviesndash 6 years of data 2000-2005

bull Test datandash Last few ratings of each user (28 million)ndash Evaluation criterion root mean squared error (RMSE) ndash Netflix Cinematch RMSE 09514

bull Competitionndash 2700+ teamsndash $1 million grand prize for 10 improvement on Cinematch resultndash $50000 2007 progress prize for 843 improvement

scoremovieuser121152131434524123237682576344541568523425

22345

57664566

scoremovieuser6219617232473153414284935

745

696836

Training data Test data

Movie rating data

Overall rating distribution

bull Third of ratings are 4sbull Average rating is 368

From TimelyDevelopmentcom

ratings per movie

bull Avg ratingsmovie 5627

ratings per user

bull Avg ratingsuser 208

Average movie rating by movie count

bull More ratings to better movies

From TimelyDevelopmentcom

Most loved movies

CountAvg ratingTitle137812 4593 The Shawshank Redemption 133597 4545 Lord of the Rings The Return of the King1808834306 The Green Mile 150676 4460Lord of the Rings The Two Towers 139050 4415 Finding Nemo 117456 4504 Raiders of the Lost Ark 180736 4299 Forrest Gump 147932 4433 Lord of the Rings The Fellowship of the ring149199 4325 The Sixth Sense 144027 4333 Indiana Jones and the Last Crusade

Challenges

bull Size of datandash Scalabilityndash Keeping data in memory

bull Missing datandash 99 percent missingndash Very imbalanced

bull Avoiding overfittingbull Test and training data differ significantly movie 16322

bull Use an ensemble of complementing predictorsbull Two half tuned models worth more than a single fully

tuned model

The BellKor recommender system

RECSYS 101

Extending Latent Factor Model to Include Biases

1162019

RECSYS 102

Modeling Biases and Interactions

μ = overall mean rating bx = bias of user x bi = bias of movie i

user-movie interactionmovie biasuser bias

User-Movie interaction Characterizes the matching between

users and movies Attracts most research in the field Benefits from algorithmic and

mathematical innovations

Baseline predictor Separates users and movies Benefits from insights into userrsquos

behavior Among the main practical

contributions of the competition

RECSYS 103

Baseline Predictor We have expectations on the rating by

user x of movie i even without estimating xrsquos attitude towards movies like i

ndash Rating scale of user xndash Values of other ratings user

gave recently (day-specific mood anchoring multi-user accounts)

ndash (Recent) popularity of movie indash Selection bias related to

number of ratings user gave on the same day (ldquofrequencyrdquo)

1162019

RECSYS 104

Putting It All Together

Example Mean rating micro = 37 You are a critical reviewer your ratings are 1 star

lower than the mean bx = -1 Star Wars gets a mean rating of 05 higher than

average movie bi = + 05 Predicted rating for you on Star Wars

= 37 - 1 + 05 = 32

1162019

RECSYS 105

Fitting the New Model

Solve

Stochastic gradient decent to find parameters Note Both biases bu bi as well as interactions qi pu are treated

as parameters (we estimate them)

1162019

regularization

goodness of fit

λ is selected via grid-search on a validation set

( )

++++

+++minus

sumsumsumsum

sumisin

ii

xx

xx

ii

Rix

Txiixxi

PQ

bbpq

pqbbr

2222

2

)(

)(min

λ

micro

RECSYS 106

BellKor Recommender System The winner of the Netflix Challenge

Multi-scale modeling of the dataCombine top level ldquoregionalrdquomodeling of the data with a refined local view Global

bull Overall deviations of usersmovies Factorization

bull Addressing ldquoregionalrdquo effects Collaborative filtering

bull Extract local patterns

Global effects

Factorization

Collaborative filtering

RECSYS 107

Performance of Various Methods

0885

089

0895

09

0905

091

0915

092

1 10 100 1000

RMSE

Millions of parameters

CF (no time bias)

Basic Latent Factors

Latent Factors w Biases

1162019

RECSYS 108

Grand Prize 08563

Netflix 09514

Movie average 10533User average 10651

Global average 11296

Performance of Various Methods

Basic Collaborative filtering 094

Latent factors 090

Latent factors+Biases 089

Collaborative filtering++ 091

1162019

RECSYS 109

Temporal Biases Of Users

Sudden rise in the average movie rating(early 2004) Improvements in Netflix GUI improvements Meaning of rating changed

Movie age Users prefer new movies

without any reasons Older movies are just

inherently better than newer ones

1162019

Y Koren Collaborative filtering with temporal dynamics KDD rsquo09

RECSYS 110

Temporal Biases amp Factors

Original modelrxi = micro +bx + bi + qi middotpx

T

Add time dependence to biasesrxi = micro +bx(t)+ bi(t) +qi middot px

T

Make parameters bu and bi to depend on time (1) Parameterize time-dependence by linear trends

(2) Each bin corresponds to 10 consecutive weeks

Add temporal dependence to factors px(t)hellip user preference vector on day t

Y Koren Collaborative filtering with temporal dynamics KDD rsquo09

RECSYS 111

Adding Temporal Effects

1162019

0875

088

0885

089

0895

09

0905

091

0915

092

1 10 100 1000 10000

RMSE

Millions of parameters

CF (no time bias)Basic Latent FactorsCF (time bias)Latent Factors w Biases+ Linear time factors+ Per-day user biases+ CF

RECSYS 112

Grand Prize 08563

Netflix 09514

Movie average 10533User average 10651

Global average 11296

Performance of Various Methods

Basic Collaborative filtering 094

Latent factors 090

Latent factors+Biases 089

Collaborative filtering++ 091

1162019 112

Latent factors+Biases+Time 0876

Still no prize Getting desperateTry a ldquokitchen sinkrdquo approach

RECSYS 113

Effect of ensemble size

087

08725

0875

08775

088

08825

0885

08875

089

08925

1 8 15 22 29 36 43 50

Predictors

Erro

r - R

MSE

Chart1

RMSE
Predictors
Error - RMSE
Effect of ensemble size

Sheet1

Sheet1

RMSE
Predictors
Error - RMSE
Effect of ensemble size

Sheet2

Sheet3

RECSYS 115

Standing on June 26th 2009

1162019June 26th submission triggers 30-day ldquolast callrdquo

RECSYS 116

The Last 30 Days

Ensemble team formed Group of other teams on leaderboard forms a new team Relies on combining their models Quickly also get a qualifying score over 10

BellKor Continue to get small improvements in their scores Realize that they are in direct competition with Ensemble

Strategy Both teams carefully monitoring the leaderboard Only sure way to check for improvement is to submit a set of

predictionsbull This alerts the other team of your latest score

1162019

RECSYS 117

24 Hours from the Deadline

Submissions limited to 1 a day Only 1 final submission could be made in the last 24h

24 hours before deadlinehellip BellKor team member in Austria notices (by chance) that

Ensemble posts a score that is slightly better than BellKorrsquos

Frantic last 24 hours for both teams Much computer time on final optimization Carefully calibrated to end about an hour before deadline

Final submissions BellKor submits a little early (on purpose) 40 mins before deadline Ensemble submits their final entry 20 mins later hellipand everyone waitshellip

1162019

RECSYS 118118

RECSYS 119

Million $ Awarded Sept 21st 2009

RECSYS 120

Further reading Y Koren Collaborative filtering with temporal

dynamics KDD rsquo09 httpwww2researchattcom~volinskynetflixbpchtml httpwwwthe-ensemblecom

1162019

  • Slide Number 1
  • Acknowledgements
  • Content-based Systems amp Collaborative Filtering
  • High Dimensional Data
  • Example Recommender Systems
  • Recommendations
  • From Scarcity to Abundance
  • The Long Tail
  • Physical vs Online
  • Types of Recommendations
  • Formal Model
  • Utility Matrix
  • Key Problems
  • (1) Gathering Ratings
  • (2) Extrapolating Utilities
  • Content-based Recommender Systems
  • Content-based Recommendations
  • Plan of Action
  • Item Profiles
  • Sidenote TF-IDF
  • User Profiles and Prediction
  • Pros Content-based Approach
  • Cons Content-based Approach
  • Collaborative Filtering
  • Collaborative filtering
  • Collaborative Filtering (CF)
  • Example of Memory-based Collaborative Filtering User-User Collaborative Filtering
  • Similar Users
  • Similarity Metric
  • Rating Predictions
  • Another type of Memory-based Collaborative Filtering Item-Item based Collaborative Filtering
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Common Practice for Item-Item Collaborative Filtering
  • Item-Item vs User-User
  • ProsCons of Collaborative Filtering
  • Hybrid Methods
  • Remarks amp Practical Tips
  • Evaluation
  • Evaluation
  • Evaluating Predictions
  • Problems with Error Measures
  • Collaborative Filtering Complexity
  • Tip Add Data
  • Recommender SystemsLatent Factor Models
  • Collaborative Filtering viaLatent Factor Models (eg SVD)
  • Slide Number 50
  • The Netflix Utility Matrix R
  • Utility Matrix R Evaluation
  • Latent Factor Models
  • Ratings as Products of Factors
  • Ratings as Products of Factors
  • Ratings as Products of Factors
  • Latent Factor Models
  • Latent Factor Models
  • Recap SVD
  • Latent Factor Models
  • Dealing with Missing Entries
  • Recommendations via Latent Factor Models (eg SVD++ by the [Bellkor Team])
  • Dealing with Missing Entries
  • The Effect of Regularization
  • The Effect of Regularization
  • The Effect of Regularization
  • The Effect of Regularization
  • Use Gradient Descent to search forthe optimal settings
  • Degression to the lecture notes ofRegression and Gradient Descentby Andrew Ngrsquos Machine Learning course from Coursera
  • (Batch) Gradient Descent
  • Stochastic Gradient Descent
  • SGD vs GD
  • Stochastic Gradient Descent
  • Slide Number 75
  • Summary Recommendations via Optimization
  • Backup Slides
  • The Netflix Challenge 2006-09
  • Slide Number 79
  • Slide Number 80
  • Netflix Prize
  • Movie rating data
  • Overall rating distribution
  • ratings per movie
  • ratings per user
  • Average movie rating by movie count
  • Most loved movies
  • Challenges
  • The BellKor recommender system
  • Extending Latent Factor Model to Include Biases
  • Modeling Biases and Interactions
  • Baseline Predictor
  • Putting It All Together
  • Fitting the New Model
  • BellKor Recommender System
  • Performance of Various Methods
  • Performance of Various Methods
  • Temporal Biases Of Users
  • Temporal Biases amp Factors
  • Adding Temporal Effects
  • Performance of Various Methods
  • Slide Number 113
  • Slide Number 114
  • Standing on June 26th 2009
  • The Last 30 Days
  • 24 Hours from the Deadline
  • Slide Number 118
  • Million $ Awarded Sept 21st 2009
  • Further reading
predictors RMSE
1 0889067
2 0880586
3 0878345
4 0877715
5 0876668
6 0876179
7 0875557
8 0875013
9 0874477
10 0874209
11 0873997
12 0873691
13 0873569
14 0873352
15 0873212
16 0873035
17 0872942
18 087283
19 0872748
20 0872654
21 0872563
22 0872468
23 0872382
24 0872275
25 087222
26 0872201
27 0872135
28 0872086
29 0872021
30 0871945
31 0871898
32 087183
33 0871756
34 0871722
35 0871665
36 0871632
37 0871585
38 0871558
39 0871524
40 0871482
41 0871452
42 0871428
43 0871398
44 0871346
45 0871312
46 0871287
47 0871262
48 0871242
49 0871236
50 0871197
0889067
0880586
0878345
0877715
0876668
0876179
0875557
0875013
0874477
0874209
0873997
0873691
0873569
0873352
0873212
0873035
0872942
087283
0872748
0872654
0872563
0872468
0872382
0872275
087222
0872201
0872135
0872086
0872021
0871945
0871898
087183
0871756
0871722
0871665
0871632
0871585
0871558
0871524
0871482
0871452
0871428
0871398
0871346
0871312
0871287
0871262
0871242
0871236
0871197
Page 15: xuhappy.github.io · 2 days ago · RECSYS 2 Acknowledgements The slides used in this chapter are adapted from: CS246 Mining Massive Datasets, - by Jure Leskovec, Stanford University

RECSYS 17

Content-based Recommendations Main idea Recommend items to customer x similar to

previous items rated highly by x

Example

Movie recommendations Recommend movies with same actor(s)

director genre hellip

Websites blogs news Recommend other sites with ldquosimilarrdquo content

RECSYS 18

Plan of Action

likes

Item profiles

RedCircles

Triangles

User profile

match

recommendbuild

1162019

RECSYS 19

Item Profiles For each item create an item profile

Profile is a set (vector) of features Movies author title actor directorhellip Text Set of ldquoimportantrdquo words in document

How to pick important features Usual heuristic from text mining is TF-IDF

(Term frequency Inverse Doc Frequency)bull Term hellip Featurebull Document hellip Item

1162019

RECSYS 20

Sidenote TF-IDFfij = frequency of term (feature) i in doc (item) j

ni = number of docs that mention term iN = total number of docs

TF-IDF score wij = TFij times IDFi

Doc profile = set of words with highest TF-IDF scores together with their scores

1162019

Note we normalize TFto discount for ldquolongerrdquo documents

RECSYS 21

User Profiles and Prediction

User profile possibilities Weighted average of rated item profiles Variation weight by difference from average

rating for item hellip

Prediction heuristic Given user profile x and item profile i estimate 119906119906(119961119961 119946119946) = cos(119961119961 119946119946) = 119961119961middot119946119946

| 119961119961 |sdot| 119946119946 |

1162019

RECSYS 22

Pros Content-based Approach

+ No need for data on other users No cold-start or sparsity problems

+ Able to recommend to users with unique tastes

+ Able to recommend new amp unpopular items No first-rater problem

+ Able to provide explanations Can provide explanations of recommended items by listing content-

features that caused an item to be recommended

1162019

RECSYS 23

Cons Content-based Approach ndash Finding the appropriate features is hard

Eg images movies music

ndash Overspecialization Never recommends items outside userrsquos

content profile People might have multiple interests Unable to exploit quality judgments of other users

ndash Recommendations for new users How to build a user profile

1162019

RECSYS 24

Collaborative Filtering

RECSYS 25

Collaborative filtering Recommend items based on past transactions of users

Analyze relations between users andor items

Specific data characteristics are irrelevant Domain-free useritem attributes are not necessary Can identify elusive aspects

RECSYS 26

Collaborative Filtering (CF)

RECSYS 27

Example of Memory-based Collaborative FilteringUser-User Collaborative Filtering

1 Consider user x

2 Find set N of other userswhose ratings are ldquosimilarrdquo to xrsquos ratings eg using K-nearest neighbors (KNN)

3 Recommend items to xbased on the weighted ratings of items by users in N

RECSYS 28

Similar Users Let rx be the vector of user xrsquos ratings

Jaccard similarity measure Problem Ignores the value of the rating

Cosine similarity measure

sim(x y) = cos(rx ry) = 119903119903119909119909sdot119903119903119910119910||119903119903119909119909||sdot||119903119903119910119910||

Problem Treats missing ratings as ldquonegativerdquo Pearson correlation coefficient

Sxy = items rated by both users x and y

1162019

rx = [ _ _ ]ry = [ _ _]

rx ry as setsrx = 1 4 5ry = 1 3 4

rx ry as pointsrx = 1 0 0 1 3ry = 1 0 2 2 0

rx ry hellip avgrating of x y

RECSYS 29

Similarity Metric

Intuitively we want sim(A B) gt sim(A C)

Jaccard similarity 15 lt 24

Cosine similarity 0386 gt 0322 Considers missing ratings as ldquonegativerdquo Solution subtract the (row) mean

1162019

sim AB vs AC0092 gt -0559Notice cosine sim is correlation when data is centered at 0

Cosine sim

RECSYS 30

Rating Predictions

Let rx be the vector of user xrsquos ratings

Let N be the set of k users most similar to xwho have rated item i

Prediction for item i of user x 119903119903119909119909119909119909 = 1

119896119896sum119910119910isin119873119873 119903119903119910119910119909119909

119903119903119909119909119909119909 =sum119910119910isin119873119873 119904119904119909119909119910119910sdot119903119903119910119910119910119910sum119910119910isin119873119873 119904119904119909119909119910119910

Other options

Many other tricks possiblehellip1162019

Shorthand119956119956119961119961119961119961 = 119956119956119946119946119956119956 119961119961119961119961

RECSYS 31

Another type of Memory-based Collaborative Filtering Item-Item based Collaborative Filtering

So far User-user collaborative filtering

Another view Item-item For item i find other similar items Estimate rating for item i based

on the target userrsquos ratings on items similar to item i Can use same similarity metrics and

prediction functions as in user-user model

1162019

sumsum

isin

isinsdot

=)(

)(

xiNj ij

xiNj xjijxi s

rsr

sijhellip similarity of items i and jrxjhelliprating of user x on item jN(ix)hellip set items rated by x similar to i

RECSYS 32

Item-Item CF (|N|=2)

121110987654321

455311

3124452

534321423

245424

5224345

423316

users

mov

ies

- unknown rating - rating between 1 to 5

RECSYS 33

Item-Item CF (|N|=2)

121110987654321

455 311

3124452

534321423

245424

5224345

423316

users

- estimate rating of movie 1 by user 5

1162019

mov

ies

RECSYS 34

Item-Item CF (|N|=2)

121110987654321

455 311

3124452

534321423

245424

5224345

423316

users

Neighbor selectionIdentify movies similar to movie 1 rated by user 5

1162019

mov

ies

100

-018

041

-010

-031

059

sim(1m)

Here we use Pearson correlation as similarity1) Subtract mean rating mi from each movie i

m1 = (1+3+5+5+4)5 = 36row 1 [-26 0 -06 0 0 14 0 0 14 0 04 0]

2) Compute cosine similarities between rows

RECSYS 35

Item-Item CF (|N|=2)

121110987654321

455 311

3124452

534321423

245424

5224345

423316

users

Compute similarity weightss13=041 s16=059

1162019

mov

ies

100

-018

041

-010

-031

059

sim(1m)

RECSYS 36

Item-Item CF (|N|=2)

121110987654321

45526311

3124452

534321423

245424

5224345

423316

users

Predict by taking weighted average

r15 = (0412 + 0593) (041+059) = 26

mov

ies

RECSYS 37

Common Practice for Item-Item Collaborative Filtering

Define similarity sij of items i and j

Select K nearest neighbors (KNN) N(i x) Set of Items most similar to item i that were rated by x

Estimate rating rxi as the weighted average

1162019

sumsum

isin

isinsdot

=)(

)(ˆxiNj ij

xiNj xjijxi s

rsr

N(ix) = set of items similar to item i that were rated by xsij = similarity of items i and jrxj= rating of user x on item j

RECSYS 38

Item-Item vs User-User

04180109

0305081

Avatar LOTR Matrix Pirates

Alice

Bob

Carol

David

1162019

In practice it has been observed that item-itemoften works better than user-user

Why Items are simpler users have multiple tastes

RECSYS 39

ProsCons of Collaborative Filtering

+ Works for any kind of item No feature selection needed

- Cold Start Need enough users in the system to find a match

- Sparsity The userratings matrix is sparse Hard to find users that have rated the same items

- First rater Cannot recommend an item that has not been

previously rated New items Esoteric items

- Popularity bias Cannot recommend items to someone with

unique taste Tends to recommend popular items

RECSYS 40

Hybrid Methods Implement two or more different recommenders and

combine predictions Perhaps using a linear model

Add content-based methods to collaborative filtering Item profiles for new item problem Demographics to deal with new user problem

1162019

RECSYS 41

- Evaluation

- Error metrics

- Complexity Speed

RECSYS 42

Evaluation

1 3 4

3 5 5

4 5 5

3

3

2 2 2

5

2 1 1

3 3

1

movies

users

1162019

RECSYS 43

Evaluation

1 3 4

3 5 5

4 5 5

3

3

2

2 1

3

1

Test Data Set

users

movies

1162019

RECSYS 44

Evaluating Predictions Compare predictions with known ratings

Root-mean-square error (RMSE)bull where is predicted is the true rating of x on i

Precision at top 10 bull of those in top 10

Rank Correlation bull Spearmanrsquos correlation between systemrsquos and userrsquos complete

rankings Another approach 01 model

Coveragebull Number of itemsusers for which system can make predictions

Precision = TP (TP + FP) Accuracy = (TP+TN) (TP + FP + TN + FN) Receiver Operating characteristic (ROC) Curve

Y-axis True Positive Rates (TPR) X-axis False Positive Rates (FPR)bull TPR (aka Recall) = TP P = TP (TP+FN) bull FPR = FP N = FP (FP + TN)bull See httpsenwikipediaorgwikiPrecision and recall

RECSYS 45

Problems with Error Measures Narrow focus on accuracy sometimes

misses the point Prediction Diversity Prediction Context Order of predictions

In practice we care only to predict high ratings RMSE might penalize a method that does well

for high ratings and badly for others

1162019

RECSYS 46

Collaborative Filtering Complexity

Expensive step is finding k most similar customers O(|X|) Recall that X = set of customers in the system

Too expensive to do at runtime Could pre-compute using clustering as approx

Naiumlve pre-computation takes time O(N middot|C|) |C| = of clusters = k in the k-means N = of data points

We already know how to do this Near-neighbor search in high dimensions (LSH) Clustering Dimensionality reduction

1162019

RECSYS 47

Tip Add Data Leverage all the data

Donrsquot try to reduce data size in an effort to make fancy algorithms work

Simple methods on large data do best

Add more data eg add IMDB data on genres

More data beats better algorithmshttpanandtypepadcomdatawocky200803more-data-usualhtml

RECSYS 48

Recommender SystemsLatent Factor Models

CS246 Mining Massive DatasetsJure Leskovec Stanford University

httpcs246stanfordedu

RECSYS 49

Geared towards females

Geared towards males

Serious

Funny

Collaborative Filtering viaLatent Factor Models (eg SVD)

The PrincessDiaries

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

RECSYS 50Koren Bell Volinksy IEEE Computer 2009

RECSYS 51

The Netflix Utility Matrix R

1 3 4

3 5 5

4 5 5

3

3

2 2 2

5

2 1 1

3 3

1

480000 users

17700 movies

1162019

Matrix R

RECSYS 52

Utility Matrix R Evaluation

1 3 4

3 5 5

4 5 5

3

3

2

2 1

3

1

Test Data Set

SSE = sum(119909119909119909119909)isin119877119877 119903119909119909119909119909 minus 119903119903119909119909119909119909 2

1162019

480000 users

17700 movies

Predicted rating

True rating of user x on item i

119955119955120785120785120788120788

Matrix R

Training Data Set

RECSYS 53

Latent Factor Models

ldquoSVDrdquo on Netflix data R asymp Q middot PT

For now letrsquos assume we can approximate the rating matrix R as a product of ldquothinrdquo Q middot PT

R has missing entries but letrsquos ignore that for nowbull Basically we will want the reconstruction error to be small on

known ratings and we donrsquot care about the values on the missing ones

45531

312445

53432142

24542

522434

42331

2-41

56-5

53-2

32111

-221-7

37-1

-924143-48-5-253-211

13-112-72914-131457-8

1-6784-3924176-421asymp

users

item

s

PT

Q

item

s

users

R

SVD A = U Σ VT

f factors

ffactors

RECSYS 54

Ratings as Products of Factors How to estimate the missing rating of

user x for item i

1162019

45531

312445

53432142

24542

522434

42331

item

s

2-41

56-5

53-2

32111

-221-7

37-1

-924143-48-5-253-211

13-112-72914-131457-8

1-6784-3924176-421

asymp

item

s

users

users

PT

119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931

= 119943119943

119954119954119946119946119943119943 sdot 119953119953119961119961119943119943

qi = row i of Qpx = column x of PT

ffac

tors

Qf factors

RECSYS 55

Ratings as Products of Factors How to estimate the missing rating of

user x for item i

1162019

45531

312445

53432142

24542

522434

42331

item

s

2-41

56-5

53-2

32111

-221-7

37-1

-924143-48-5-253-211

13-112-72914-131457-8

1-6784-3924176-421

asymp

item

s

users

users

PT

ffac

tors

Qf factors

119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931

= 119943119943

119954119954119946119946119943119943 sdot 119953119953119961119961119943119943

qi = row i of Qpx = column x of PT

RECSYS 56

Ratings as Products of Factors How to estimate the missing rating of

user x for item i

1162019

45531

312445

53432142

24542

522434

42331

item

s

2-41

56-5

53-2

32111

-221-7

37-1

-924143-48-5-253-211

13-112-72914-131457-8

1-6784-3924176-421

asymp

item

s

users

users

QPT

24

ffac

tors

f factors

119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931

= 119943119943

119954119954119946119946119943119943 sdot 119953119953119961119961119943119943

qi = row i of Qpx = column x of PT

RECSYS 57

Geared towards females

Geared towards males

Serious

Funny

Latent Factor Models

1162019

The PrincessDiaries

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

RECSYS 58

Geared towards females

Geared towards males

Serious

Funny

Latent Factor Models

1162019

The PrincessDiaries

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

RECSYS 59

Recap SVD

Remember SVD A Input data matrix U Left singular vecs V Right singular vecs Σ Singular values

SVD gives minimum reconstruction error (SSE) min119880119880VΣ

sum119909119909119894119894 119860119860119909119909119894119894 minus 119880119880Σ119881119881T 1199091199091198941198942

So in our case ldquoSVDrdquo on Netflix data R asymp Q middot PT

A = R Q = U PT = Σ VT

But we are not done yet R has missing entries1162019

U

The sum goes over all entriesBut our R has missing entries

119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931

Am

n

Σm

n

VT

asymp

U

RECSYS 60

Latent Factor Models

SVD isnrsquot defined when entries are missing

Use specialized methods to find P Q min

119875119875119876119876sum 119909119909119909119909 isinR 119903119903119909119909119909119909 minus 119902119902119909119909 sdot 119901119901119909119909119879119879 2

Notebull We donrsquot require cols of P Q to be orthogonalunit lengthbull P Q map usersmovies to a latent spacebull The most popular model among Netflix contestants

1162019

45531

312445

53432142

24542

522434

42331

2-41

56-5

53-2

32111

-221-7

37-1

-924143-48-5-253-211

13-112-72914-131457-8

1-6784-3924176-421asymp

PTQ

users

item

s

119903119909119909119909119909 = 119902119902119909119909 sdot 119901119901119909119909119879119879

f factors

ffactorsitem

s

users

RECSYS 61

Dealing with Missing Entries Want to minimize SSE for unseen test data

Idea Minimize SSE on training data Want large f ( of factors) to capture all the signals But SSE on test data begins to rise for f gt 2

Regularization is needed to avoid Overfitting Allow rich model where there are sufficient data Shrink aggressively where data are scarce

1162019

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

1 3 43 5 5

4 5 533

2

2 1 3

1

λhellip regularization parameterldquoerrorrdquo ldquolengthrdquo

RECSYS 62

Geared towards females

Geared towards males

serious

escapist

The PrincessDiaries

The Lion King

Braveheart

Independence Day

AmadeusThe Color Purple

Oceanrsquos 11

Sense and Sensibility

Gus

Dave

Recommendations via Latent Factor Models (eg SVD++ by the [Bellkor Team])

1162019

Lethal Weapon

Dumb and Dumber

RECSYS 63

Dealing with Missing Entries Want to minimize SSE for unseen test data

Idea Minimize SSE on training data Want large f ( of factors) to capture all the signals But SSE on test data begins to rise for f gt 2

Regularization is needed Allow rich model where there are sufficient data Shrink aggressively where data are scarce

1162019

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

1 3 43 5 5

4 5 533

2

2 1 3

1

λhellip regularization parameterldquoerrorrdquo ldquolengthrdquo

RECSYS 64

Geared towards females

Geared towards males

serious

funny

The Effect of Regularization

1162019

The PrincessDiaries

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

minfactors ldquoerrorrdquo + λ ldquolengthrdquo

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

RECSYS 65

Geared towards females

Geared towards males

serious

funny

The Effect of Regularization

1162019

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

The PrincessDiaries

minfactors ldquoerrorrdquo + λ ldquolengthrdquo

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

RECSYS 66

Geared towards females

Geared towards males

serious

funny

The Effect of Regularization

1162019

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

minfactors ldquoerrorrdquo + λ ldquolengthrdquo

The PrincessDiaries

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

RECSYS 67

Geared towards females

Geared towards males

serious

funny

The Effect of Regularization

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

The PrincessDiaries

minfactors ldquoerrorrdquo + λ ldquolengthrdquo

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

RECSYS 68

Use Gradient Descent to search forthe optimal settings

Want to find matrices P and Q

Gradient descent Initialize P and Q (using SVD pretend missing ratings are

0) Do gradient descent

bull P larr P - η middotnablaPbull Q larr Q - η middotnablaQbull Where nablaQ is gradientderivative of matrix Q120571120571119876119876 = [120571120571119902119902119909119909119894119894] and 120571120571119902119902119909119909119894119894 = sum119909119909119909119909 minus2 119903119903119909119909119909119909 minus 119902119902119909119909119901119901119909119909119879119879 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894

bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q Observation Computing gradients is slow

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

How to compute gradient of a matrixCompute gradient of every element independently

RECSYS 69

Degression to the lecture notes ofRegression and Gradient Descent

by Andrew Ngrsquos Machine Learning course from Coursera

RECSYS 70

(Batch) Gradient Descent

Want to find matrices P and Q

Gradient descent Initialize P and Q (using SVD pretend missing ratings are

0) Do gradient descent

bull P larr P - η middotnablaPbull Q larr Q - η middotnablaQbull Where nablaQ is gradientderivative of matrix Q120571120571119876119876 = [120571120571119902119902119909119909119894119894] and 120571120571119902119902119909119909119894119894 = sum119909119909119909119909 minus2 119903119903119909119909119909119909 minus 119902119902119909119909119901119901119909119909119879119879 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894

bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q Observation Computing gradients is slow

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

How to compute gradient of a matrixCompute gradient of every element independently

RECSYS 72

Stochastic Gradient Descent

Gradient Descent (GD) vs Stochastic GD Observation 120571120571119876119876 = [120571120571119902119902119909119909119894119894] where

120571120571119902119902119909119909119894119894 = 119909119909119909119909

minus2 119903119903119909119909119909119909 minus 119902119902119909119909119894119894119901119901119909119909119894119894 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894 = 119961119961119946119946

nabla119928119928 119955119955119961119961119946119946

bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q 119928119928 = 119928119928 minus 119928119928 = 119928119928 minus η sum119961119961119946119946nabla119928119928 (119955119955119961119961119946119946) Idea Instead of evaluating gradient over all ratings evaluate it for

each individual rating and make a step

GD 119928119928larr119928119928 minus η sum119955119955119961119961119946119946 nabla119928119928(119955119955119961119961119946119946)

SGD 119928119928larr119928119928 minus 119928119928(119955119955119961119961119946119946) Faster convergence

bull Need more steps but each step is computed much faster

1162019

RECSYS 73

SGD vs GD Convergence of GD vs SGD

1162019

Iterationstep

Valu

e of

the

obje

ctiv

e fu

nctio

n

GD improves the value of the objective function at every step SGD improves the value but in a ldquonoisyrdquo wayGD takes fewer steps to converge but each steptakes much longer to compute In practice SGD is much faster

RECSYS 74

Stochastic Gradient Descent

Stochastic gradient decent Initialize P and Q (using SVD pretend missing ratings are

0) Then iterate over the ratings (multiple times if necessary) and update

factorsFor each rxi

bull 120576120576119909119909119909119909 = 119903119903119909119909119909119909 minus 119902119902119909119909 sdot 119901119901119909119909119879119879 (derivative of the ldquoerrorrdquo)bull 119902119902119909119909 larr 119902119902119909119909 + 120578120578 120576120576119909119909119909119909 119901119901119909119909 minus 120582120582 119902119902119909119909 (update equation)bull 119901119901119909119909 larr 119901119901119909119909 + 120578120578 120576120576119909119909119909119909 119902119902119909119909 minus 120582120582 119901119901119909119909 (update equation)

2 for loops For until convergence

bull For each rxibull Compute gradient do a ldquosteprdquo

1162019

η hellip learning rate

RECSYS 75Koren Bell Volinksy IEEE Computer 2009

1162019

RECSYS 76

Summary Recommendations via Optimization

Goal Make good recommendations Quantify goodness using SSE

So Lower SSE means better recommendations We want to make good recommendations on items

that some user has not yet seen Letrsquos set values for P and Q such that they work

well on known (user item) ratings And hope these values for P and Q will predict well

the unknown ratings

This is the a case where we apply Optimization methods

1162019

1 3 43 5 5

4 5 533

2

2 1 3

1

RECSYS 77

Backup Slides

RECSYS 78

The Netflix Challenge 2006-09

We Know What You OughtTo Be Watching This

Summer

Netflix Prizebull Training data

ndash 100 million ratingsndash 480000 usersndash 17770 moviesndash 6 years of data 2000-2005

bull Test datandash Last few ratings of each user (28 million)ndash Evaluation criterion root mean squared error (RMSE) ndash Netflix Cinematch RMSE 09514

bull Competitionndash 2700+ teamsndash $1 million grand prize for 10 improvement on Cinematch resultndash $50000 2007 progress prize for 843 improvement

scoremovieuser121152131434524123237682576344541568523425

22345

57664566

scoremovieuser6219617232473153414284935

745

696836

Training data Test data

Movie rating data

Overall rating distribution

bull Third of ratings are 4sbull Average rating is 368

From TimelyDevelopmentcom

ratings per movie

bull Avg ratingsmovie 5627

ratings per user

bull Avg ratingsuser 208

Average movie rating by movie count

bull More ratings to better movies

From TimelyDevelopmentcom

Most loved movies

CountAvg ratingTitle137812 4593 The Shawshank Redemption 133597 4545 Lord of the Rings The Return of the King1808834306 The Green Mile 150676 4460Lord of the Rings The Two Towers 139050 4415 Finding Nemo 117456 4504 Raiders of the Lost Ark 180736 4299 Forrest Gump 147932 4433 Lord of the Rings The Fellowship of the ring149199 4325 The Sixth Sense 144027 4333 Indiana Jones and the Last Crusade

Challenges

bull Size of datandash Scalabilityndash Keeping data in memory

bull Missing datandash 99 percent missingndash Very imbalanced

bull Avoiding overfittingbull Test and training data differ significantly movie 16322

bull Use an ensemble of complementing predictorsbull Two half tuned models worth more than a single fully

tuned model

The BellKor recommender system

RECSYS 101

Extending Latent Factor Model to Include Biases

1162019

RECSYS 102

Modeling Biases and Interactions

μ = overall mean rating bx = bias of user x bi = bias of movie i

user-movie interactionmovie biasuser bias

User-Movie interaction Characterizes the matching between

users and movies Attracts most research in the field Benefits from algorithmic and

mathematical innovations

Baseline predictor Separates users and movies Benefits from insights into userrsquos

behavior Among the main practical

contributions of the competition

RECSYS 103

Baseline Predictor We have expectations on the rating by

user x of movie i even without estimating xrsquos attitude towards movies like i

ndash Rating scale of user xndash Values of other ratings user

gave recently (day-specific mood anchoring multi-user accounts)

ndash (Recent) popularity of movie indash Selection bias related to

number of ratings user gave on the same day (ldquofrequencyrdquo)

1162019

RECSYS 104

Putting It All Together

Example Mean rating micro = 37 You are a critical reviewer your ratings are 1 star

lower than the mean bx = -1 Star Wars gets a mean rating of 05 higher than

average movie bi = + 05 Predicted rating for you on Star Wars

= 37 - 1 + 05 = 32

1162019

RECSYS 105

Fitting the New Model

Solve

Stochastic gradient decent to find parameters Note Both biases bu bi as well as interactions qi pu are treated

as parameters (we estimate them)

1162019

regularization

goodness of fit

λ is selected via grid-search on a validation set

( )

++++

+++minus

sumsumsumsum

sumisin

ii

xx

xx

ii

Rix

Txiixxi

PQ

bbpq

pqbbr

2222

2

)(

)(min

λ

micro

RECSYS 106

BellKor Recommender System The winner of the Netflix Challenge

Multi-scale modeling of the dataCombine top level ldquoregionalrdquomodeling of the data with a refined local view Global

bull Overall deviations of usersmovies Factorization

bull Addressing ldquoregionalrdquo effects Collaborative filtering

bull Extract local patterns

Global effects

Factorization

Collaborative filtering

RECSYS 107

Performance of Various Methods

0885

089

0895

09

0905

091

0915

092

1 10 100 1000

RMSE

Millions of parameters

CF (no time bias)

Basic Latent Factors

Latent Factors w Biases

1162019

RECSYS 108

Grand Prize 08563

Netflix 09514

Movie average 10533User average 10651

Global average 11296

Performance of Various Methods

Basic Collaborative filtering 094

Latent factors 090

Latent factors+Biases 089

Collaborative filtering++ 091

1162019

RECSYS 109

Temporal Biases Of Users

Sudden rise in the average movie rating(early 2004) Improvements in Netflix GUI improvements Meaning of rating changed

Movie age Users prefer new movies

without any reasons Older movies are just

inherently better than newer ones

1162019

Y Koren Collaborative filtering with temporal dynamics KDD rsquo09

RECSYS 110

Temporal Biases amp Factors

Original modelrxi = micro +bx + bi + qi middotpx

T

Add time dependence to biasesrxi = micro +bx(t)+ bi(t) +qi middot px

T

Make parameters bu and bi to depend on time (1) Parameterize time-dependence by linear trends

(2) Each bin corresponds to 10 consecutive weeks

Add temporal dependence to factors px(t)hellip user preference vector on day t

Y Koren Collaborative filtering with temporal dynamics KDD rsquo09

RECSYS 111

Adding Temporal Effects

1162019

0875

088

0885

089

0895

09

0905

091

0915

092

1 10 100 1000 10000

RMSE

Millions of parameters

CF (no time bias)Basic Latent FactorsCF (time bias)Latent Factors w Biases+ Linear time factors+ Per-day user biases+ CF

RECSYS 112

Grand Prize 08563

Netflix 09514

Movie average 10533User average 10651

Global average 11296

Performance of Various Methods

Basic Collaborative filtering 094

Latent factors 090

Latent factors+Biases 089

Collaborative filtering++ 091

1162019 112

Latent factors+Biases+Time 0876

Still no prize Getting desperateTry a ldquokitchen sinkrdquo approach

RECSYS 113

Effect of ensemble size

087

08725

0875

08775

088

08825

0885

08875

089

08925

1 8 15 22 29 36 43 50

Predictors

Erro

r - R

MSE

Chart1

RMSE
Predictors
Error - RMSE
Effect of ensemble size

Sheet1

Sheet1

RMSE
Predictors
Error - RMSE
Effect of ensemble size

Sheet2

Sheet3

RECSYS 115

Standing on June 26th 2009

1162019June 26th submission triggers 30-day ldquolast callrdquo

RECSYS 116

The Last 30 Days

Ensemble team formed Group of other teams on leaderboard forms a new team Relies on combining their models Quickly also get a qualifying score over 10

BellKor Continue to get small improvements in their scores Realize that they are in direct competition with Ensemble

Strategy Both teams carefully monitoring the leaderboard Only sure way to check for improvement is to submit a set of

predictionsbull This alerts the other team of your latest score

1162019

RECSYS 117

24 Hours from the Deadline

Submissions limited to 1 a day Only 1 final submission could be made in the last 24h

24 hours before deadlinehellip BellKor team member in Austria notices (by chance) that

Ensemble posts a score that is slightly better than BellKorrsquos

Frantic last 24 hours for both teams Much computer time on final optimization Carefully calibrated to end about an hour before deadline

Final submissions BellKor submits a little early (on purpose) 40 mins before deadline Ensemble submits their final entry 20 mins later hellipand everyone waitshellip

1162019

RECSYS 118118

RECSYS 119

Million $ Awarded Sept 21st 2009

RECSYS 120

Further reading Y Koren Collaborative filtering with temporal

dynamics KDD rsquo09 httpwww2researchattcom~volinskynetflixbpchtml httpwwwthe-ensemblecom

1162019

  • Slide Number 1
  • Acknowledgements
  • Content-based Systems amp Collaborative Filtering
  • High Dimensional Data
  • Example Recommender Systems
  • Recommendations
  • From Scarcity to Abundance
  • The Long Tail
  • Physical vs Online
  • Types of Recommendations
  • Formal Model
  • Utility Matrix
  • Key Problems
  • (1) Gathering Ratings
  • (2) Extrapolating Utilities
  • Content-based Recommender Systems
  • Content-based Recommendations
  • Plan of Action
  • Item Profiles
  • Sidenote TF-IDF
  • User Profiles and Prediction
  • Pros Content-based Approach
  • Cons Content-based Approach
  • Collaborative Filtering
  • Collaborative filtering
  • Collaborative Filtering (CF)
  • Example of Memory-based Collaborative Filtering User-User Collaborative Filtering
  • Similar Users
  • Similarity Metric
  • Rating Predictions
  • Another type of Memory-based Collaborative Filtering Item-Item based Collaborative Filtering
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Common Practice for Item-Item Collaborative Filtering
  • Item-Item vs User-User
  • ProsCons of Collaborative Filtering
  • Hybrid Methods
  • Remarks amp Practical Tips
  • Evaluation
  • Evaluation
  • Evaluating Predictions
  • Problems with Error Measures
  • Collaborative Filtering Complexity
  • Tip Add Data
  • Recommender SystemsLatent Factor Models
  • Collaborative Filtering viaLatent Factor Models (eg SVD)
  • Slide Number 50
  • The Netflix Utility Matrix R
  • Utility Matrix R Evaluation
  • Latent Factor Models
  • Ratings as Products of Factors
  • Ratings as Products of Factors
  • Ratings as Products of Factors
  • Latent Factor Models
  • Latent Factor Models
  • Recap SVD
  • Latent Factor Models
  • Dealing with Missing Entries
  • Recommendations via Latent Factor Models (eg SVD++ by the [Bellkor Team])
  • Dealing with Missing Entries
  • The Effect of Regularization
  • The Effect of Regularization
  • The Effect of Regularization
  • The Effect of Regularization
  • Use Gradient Descent to search forthe optimal settings
  • Degression to the lecture notes ofRegression and Gradient Descentby Andrew Ngrsquos Machine Learning course from Coursera
  • (Batch) Gradient Descent
  • Stochastic Gradient Descent
  • SGD vs GD
  • Stochastic Gradient Descent
  • Slide Number 75
  • Summary Recommendations via Optimization
  • Backup Slides
  • The Netflix Challenge 2006-09
  • Slide Number 79
  • Slide Number 80
  • Netflix Prize
  • Movie rating data
  • Overall rating distribution
  • ratings per movie
  • ratings per user
  • Average movie rating by movie count
  • Most loved movies
  • Challenges
  • The BellKor recommender system
  • Extending Latent Factor Model to Include Biases
  • Modeling Biases and Interactions
  • Baseline Predictor
  • Putting It All Together
  • Fitting the New Model
  • BellKor Recommender System
  • Performance of Various Methods
  • Performance of Various Methods
  • Temporal Biases Of Users
  • Temporal Biases amp Factors
  • Adding Temporal Effects
  • Performance of Various Methods
  • Slide Number 113
  • Slide Number 114
  • Standing on June 26th 2009
  • The Last 30 Days
  • 24 Hours from the Deadline
  • Slide Number 118
  • Million $ Awarded Sept 21st 2009
  • Further reading
predictors RMSE
1 0889067
2 0880586
3 0878345
4 0877715
5 0876668
6 0876179
7 0875557
8 0875013
9 0874477
10 0874209
11 0873997
12 0873691
13 0873569
14 0873352
15 0873212
16 0873035
17 0872942
18 087283
19 0872748
20 0872654
21 0872563
22 0872468
23 0872382
24 0872275
25 087222
26 0872201
27 0872135
28 0872086
29 0872021
30 0871945
31 0871898
32 087183
33 0871756
34 0871722
35 0871665
36 0871632
37 0871585
38 0871558
39 0871524
40 0871482
41 0871452
42 0871428
43 0871398
44 0871346
45 0871312
46 0871287
47 0871262
48 0871242
49 0871236
50 0871197
0889067
0880586
0878345
0877715
0876668
0876179
0875557
0875013
0874477
0874209
0873997
0873691
0873569
0873352
0873212
0873035
0872942
087283
0872748
0872654
0872563
0872468
0872382
0872275
087222
0872201
0872135
0872086
0872021
0871945
0871898
087183
0871756
0871722
0871665
0871632
0871585
0871558
0871524
0871482
0871452
0871428
0871398
0871346
0871312
0871287
0871262
0871242
0871236
0871197
Page 16: xuhappy.github.io · 2 days ago · RECSYS 2 Acknowledgements The slides used in this chapter are adapted from: CS246 Mining Massive Datasets, - by Jure Leskovec, Stanford University

RECSYS 18

Plan of Action

likes

Item profiles

RedCircles

Triangles

User profile

match

recommendbuild

1162019

RECSYS 19

Item Profiles For each item create an item profile

Profile is a set (vector) of features Movies author title actor directorhellip Text Set of ldquoimportantrdquo words in document

How to pick important features Usual heuristic from text mining is TF-IDF

(Term frequency Inverse Doc Frequency)bull Term hellip Featurebull Document hellip Item

1162019

RECSYS 20

Sidenote TF-IDFfij = frequency of term (feature) i in doc (item) j

ni = number of docs that mention term iN = total number of docs

TF-IDF score wij = TFij times IDFi

Doc profile = set of words with highest TF-IDF scores together with their scores

1162019

Note we normalize TFto discount for ldquolongerrdquo documents

RECSYS 21

User Profiles and Prediction

User profile possibilities Weighted average of rated item profiles Variation weight by difference from average

rating for item hellip

Prediction heuristic Given user profile x and item profile i estimate 119906119906(119961119961 119946119946) = cos(119961119961 119946119946) = 119961119961middot119946119946

| 119961119961 |sdot| 119946119946 |

1162019

RECSYS 22

Pros Content-based Approach

+ No need for data on other users No cold-start or sparsity problems

+ Able to recommend to users with unique tastes

+ Able to recommend new amp unpopular items No first-rater problem

+ Able to provide explanations Can provide explanations of recommended items by listing content-

features that caused an item to be recommended

1162019

RECSYS 23

Cons Content-based Approach ndash Finding the appropriate features is hard

Eg images movies music

ndash Overspecialization Never recommends items outside userrsquos

content profile People might have multiple interests Unable to exploit quality judgments of other users

ndash Recommendations for new users How to build a user profile

1162019

RECSYS 24

Collaborative Filtering

RECSYS 25

Collaborative filtering Recommend items based on past transactions of users

Analyze relations between users andor items

Specific data characteristics are irrelevant Domain-free useritem attributes are not necessary Can identify elusive aspects

RECSYS 26

Collaborative Filtering (CF)

RECSYS 27

Example of Memory-based Collaborative FilteringUser-User Collaborative Filtering

1 Consider user x

2 Find set N of other userswhose ratings are ldquosimilarrdquo to xrsquos ratings eg using K-nearest neighbors (KNN)

3 Recommend items to xbased on the weighted ratings of items by users in N

RECSYS 28

Similar Users Let rx be the vector of user xrsquos ratings

Jaccard similarity measure Problem Ignores the value of the rating

Cosine similarity measure

sim(x y) = cos(rx ry) = 119903119903119909119909sdot119903119903119910119910||119903119903119909119909||sdot||119903119903119910119910||

Problem Treats missing ratings as ldquonegativerdquo Pearson correlation coefficient

Sxy = items rated by both users x and y

1162019

rx = [ _ _ ]ry = [ _ _]

rx ry as setsrx = 1 4 5ry = 1 3 4

rx ry as pointsrx = 1 0 0 1 3ry = 1 0 2 2 0

rx ry hellip avgrating of x y

RECSYS 29

Similarity Metric

Intuitively we want sim(A B) gt sim(A C)

Jaccard similarity 15 lt 24

Cosine similarity 0386 gt 0322 Considers missing ratings as ldquonegativerdquo Solution subtract the (row) mean

1162019

sim AB vs AC0092 gt -0559Notice cosine sim is correlation when data is centered at 0

Cosine sim

RECSYS 30

Rating Predictions

Let rx be the vector of user xrsquos ratings

Let N be the set of k users most similar to xwho have rated item i

Prediction for item i of user x 119903119903119909119909119909119909 = 1

119896119896sum119910119910isin119873119873 119903119903119910119910119909119909

119903119903119909119909119909119909 =sum119910119910isin119873119873 119904119904119909119909119910119910sdot119903119903119910119910119910119910sum119910119910isin119873119873 119904119904119909119909119910119910

Other options

Many other tricks possiblehellip1162019

Shorthand119956119956119961119961119961119961 = 119956119956119946119946119956119956 119961119961119961119961

RECSYS 31

Another type of Memory-based Collaborative Filtering Item-Item based Collaborative Filtering

So far User-user collaborative filtering

Another view Item-item For item i find other similar items Estimate rating for item i based

on the target userrsquos ratings on items similar to item i Can use same similarity metrics and

prediction functions as in user-user model

1162019

sumsum

isin

isinsdot

=)(

)(

xiNj ij

xiNj xjijxi s

rsr

sijhellip similarity of items i and jrxjhelliprating of user x on item jN(ix)hellip set items rated by x similar to i

RECSYS 32

Item-Item CF (|N|=2)

121110987654321

455311

3124452

534321423

245424

5224345

423316

users

mov

ies

- unknown rating - rating between 1 to 5

RECSYS 33

Item-Item CF (|N|=2)

121110987654321

455 311

3124452

534321423

245424

5224345

423316

users

- estimate rating of movie 1 by user 5

1162019

mov

ies

RECSYS 34

Item-Item CF (|N|=2)

121110987654321

455 311

3124452

534321423

245424

5224345

423316

users

Neighbor selectionIdentify movies similar to movie 1 rated by user 5

1162019

mov

ies

100

-018

041

-010

-031

059

sim(1m)

Here we use Pearson correlation as similarity1) Subtract mean rating mi from each movie i

m1 = (1+3+5+5+4)5 = 36row 1 [-26 0 -06 0 0 14 0 0 14 0 04 0]

2) Compute cosine similarities between rows

RECSYS 35

Item-Item CF (|N|=2)

121110987654321

455 311

3124452

534321423

245424

5224345

423316

users

Compute similarity weightss13=041 s16=059

1162019

mov

ies

100

-018

041

-010

-031

059

sim(1m)

RECSYS 36

Item-Item CF (|N|=2)

121110987654321

45526311

3124452

534321423

245424

5224345

423316

users

Predict by taking weighted average

r15 = (0412 + 0593) (041+059) = 26

mov

ies

RECSYS 37

Common Practice for Item-Item Collaborative Filtering

Define similarity sij of items i and j

Select K nearest neighbors (KNN) N(i x) Set of Items most similar to item i that were rated by x

Estimate rating rxi as the weighted average

1162019

sumsum

isin

isinsdot

=)(

)(ˆxiNj ij

xiNj xjijxi s

rsr

N(ix) = set of items similar to item i that were rated by xsij = similarity of items i and jrxj= rating of user x on item j

RECSYS 38

Item-Item vs User-User

04180109

0305081

Avatar LOTR Matrix Pirates

Alice

Bob

Carol

David

1162019

In practice it has been observed that item-itemoften works better than user-user

Why Items are simpler users have multiple tastes

RECSYS 39

ProsCons of Collaborative Filtering

+ Works for any kind of item No feature selection needed

- Cold Start Need enough users in the system to find a match

- Sparsity The userratings matrix is sparse Hard to find users that have rated the same items

- First rater Cannot recommend an item that has not been

previously rated New items Esoteric items

- Popularity bias Cannot recommend items to someone with

unique taste Tends to recommend popular items

RECSYS 40

Hybrid Methods Implement two or more different recommenders and

combine predictions Perhaps using a linear model

Add content-based methods to collaborative filtering Item profiles for new item problem Demographics to deal with new user problem

1162019

RECSYS 41

- Evaluation

- Error metrics

- Complexity Speed

RECSYS 42

Evaluation

1 3 4

3 5 5

4 5 5

3

3

2 2 2

5

2 1 1

3 3

1

movies

users

1162019

RECSYS 43

Evaluation

1 3 4

3 5 5

4 5 5

3

3

2

2 1

3

1

Test Data Set

users

movies

1162019

RECSYS 44

Evaluating Predictions Compare predictions with known ratings

Root-mean-square error (RMSE)bull where is predicted is the true rating of x on i

Precision at top 10 bull of those in top 10

Rank Correlation bull Spearmanrsquos correlation between systemrsquos and userrsquos complete

rankings Another approach 01 model

Coveragebull Number of itemsusers for which system can make predictions

Precision = TP (TP + FP) Accuracy = (TP+TN) (TP + FP + TN + FN) Receiver Operating characteristic (ROC) Curve

Y-axis True Positive Rates (TPR) X-axis False Positive Rates (FPR)bull TPR (aka Recall) = TP P = TP (TP+FN) bull FPR = FP N = FP (FP + TN)bull See httpsenwikipediaorgwikiPrecision and recall

RECSYS 45

Problems with Error Measures Narrow focus on accuracy sometimes

misses the point Prediction Diversity Prediction Context Order of predictions

In practice we care only to predict high ratings RMSE might penalize a method that does well

for high ratings and badly for others

1162019

RECSYS 46

Collaborative Filtering Complexity

Expensive step is finding k most similar customers O(|X|) Recall that X = set of customers in the system

Too expensive to do at runtime Could pre-compute using clustering as approx

Naiumlve pre-computation takes time O(N middot|C|) |C| = of clusters = k in the k-means N = of data points

We already know how to do this Near-neighbor search in high dimensions (LSH) Clustering Dimensionality reduction

1162019

RECSYS 47

Tip Add Data Leverage all the data

Donrsquot try to reduce data size in an effort to make fancy algorithms work

Simple methods on large data do best

Add more data eg add IMDB data on genres

More data beats better algorithmshttpanandtypepadcomdatawocky200803more-data-usualhtml

RECSYS 48

Recommender SystemsLatent Factor Models

CS246 Mining Massive DatasetsJure Leskovec Stanford University

httpcs246stanfordedu

RECSYS 49

Geared towards females

Geared towards males

Serious

Funny

Collaborative Filtering viaLatent Factor Models (eg SVD)

The PrincessDiaries

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

RECSYS 50Koren Bell Volinksy IEEE Computer 2009

RECSYS 51

The Netflix Utility Matrix R

1 3 4

3 5 5

4 5 5

3

3

2 2 2

5

2 1 1

3 3

1

480000 users

17700 movies

1162019

Matrix R

RECSYS 52

Utility Matrix R Evaluation

1 3 4

3 5 5

4 5 5

3

3

2

2 1

3

1

Test Data Set

SSE = sum(119909119909119909119909)isin119877119877 119903119909119909119909119909 minus 119903119903119909119909119909119909 2

1162019

480000 users

17700 movies

Predicted rating

True rating of user x on item i

119955119955120785120785120788120788

Matrix R

Training Data Set

RECSYS 53

Latent Factor Models

ldquoSVDrdquo on Netflix data R asymp Q middot PT

For now letrsquos assume we can approximate the rating matrix R as a product of ldquothinrdquo Q middot PT

R has missing entries but letrsquos ignore that for nowbull Basically we will want the reconstruction error to be small on

known ratings and we donrsquot care about the values on the missing ones

45531

312445

53432142

24542

522434

42331

2-41

56-5

53-2

32111

-221-7

37-1

-924143-48-5-253-211

13-112-72914-131457-8

1-6784-3924176-421asymp

users

item

s

PT

Q

item

s

users

R

SVD A = U Σ VT

f factors

ffactors

RECSYS 54

Ratings as Products of Factors How to estimate the missing rating of

user x for item i

1162019

45531

312445

53432142

24542

522434

42331

item

s

2-41

56-5

53-2

32111

-221-7

37-1

-924143-48-5-253-211

13-112-72914-131457-8

1-6784-3924176-421

asymp

item

s

users

users

PT

119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931

= 119943119943

119954119954119946119946119943119943 sdot 119953119953119961119961119943119943

qi = row i of Qpx = column x of PT

ffac

tors

Qf factors

RECSYS 55

Ratings as Products of Factors How to estimate the missing rating of

user x for item i

1162019

45531

312445

53432142

24542

522434

42331

item

s

2-41

56-5

53-2

32111

-221-7

37-1

-924143-48-5-253-211

13-112-72914-131457-8

1-6784-3924176-421

asymp

item

s

users

users

PT

ffac

tors

Qf factors

119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931

= 119943119943

119954119954119946119946119943119943 sdot 119953119953119961119961119943119943

qi = row i of Qpx = column x of PT

RECSYS 56

Ratings as Products of Factors How to estimate the missing rating of

user x for item i

1162019

45531

312445

53432142

24542

522434

42331

item

s

2-41

56-5

53-2

32111

-221-7

37-1

-924143-48-5-253-211

13-112-72914-131457-8

1-6784-3924176-421

asymp

item

s

users

users

QPT

24

ffac

tors

f factors

119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931

= 119943119943

119954119954119946119946119943119943 sdot 119953119953119961119961119943119943

qi = row i of Qpx = column x of PT

RECSYS 57

Geared towards females

Geared towards males

Serious

Funny

Latent Factor Models

1162019

The PrincessDiaries

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

RECSYS 58

Geared towards females

Geared towards males

Serious

Funny

Latent Factor Models

1162019

The PrincessDiaries

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

RECSYS 59

Recap SVD

Remember SVD A Input data matrix U Left singular vecs V Right singular vecs Σ Singular values

SVD gives minimum reconstruction error (SSE) min119880119880VΣ

sum119909119909119894119894 119860119860119909119909119894119894 minus 119880119880Σ119881119881T 1199091199091198941198942

So in our case ldquoSVDrdquo on Netflix data R asymp Q middot PT

A = R Q = U PT = Σ VT

But we are not done yet R has missing entries1162019

U

The sum goes over all entriesBut our R has missing entries

119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931

Am

n

Σm

n

VT

asymp

U

RECSYS 60

Latent Factor Models

SVD isnrsquot defined when entries are missing

Use specialized methods to find P Q min

119875119875119876119876sum 119909119909119909119909 isinR 119903119903119909119909119909119909 minus 119902119902119909119909 sdot 119901119901119909119909119879119879 2

Notebull We donrsquot require cols of P Q to be orthogonalunit lengthbull P Q map usersmovies to a latent spacebull The most popular model among Netflix contestants

1162019

45531

312445

53432142

24542

522434

42331

2-41

56-5

53-2

32111

-221-7

37-1

-924143-48-5-253-211

13-112-72914-131457-8

1-6784-3924176-421asymp

PTQ

users

item

s

119903119909119909119909119909 = 119902119902119909119909 sdot 119901119901119909119909119879119879

f factors

ffactorsitem

s

users

RECSYS 61

Dealing with Missing Entries Want to minimize SSE for unseen test data

Idea Minimize SSE on training data Want large f ( of factors) to capture all the signals But SSE on test data begins to rise for f gt 2

Regularization is needed to avoid Overfitting Allow rich model where there are sufficient data Shrink aggressively where data are scarce

1162019

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

1 3 43 5 5

4 5 533

2

2 1 3

1

λhellip regularization parameterldquoerrorrdquo ldquolengthrdquo

RECSYS 62

Geared towards females

Geared towards males

serious

escapist

The PrincessDiaries

The Lion King

Braveheart

Independence Day

AmadeusThe Color Purple

Oceanrsquos 11

Sense and Sensibility

Gus

Dave

Recommendations via Latent Factor Models (eg SVD++ by the [Bellkor Team])

1162019

Lethal Weapon

Dumb and Dumber

RECSYS 63

Dealing with Missing Entries Want to minimize SSE for unseen test data

Idea Minimize SSE on training data Want large f ( of factors) to capture all the signals But SSE on test data begins to rise for f gt 2

Regularization is needed Allow rich model where there are sufficient data Shrink aggressively where data are scarce

1162019

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

1 3 43 5 5

4 5 533

2

2 1 3

1

λhellip regularization parameterldquoerrorrdquo ldquolengthrdquo

RECSYS 64

Geared towards females

Geared towards males

serious

funny

The Effect of Regularization

1162019

The PrincessDiaries

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

minfactors ldquoerrorrdquo + λ ldquolengthrdquo

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

RECSYS 65

Geared towards females

Geared towards males

serious

funny

The Effect of Regularization

1162019

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

The PrincessDiaries

minfactors ldquoerrorrdquo + λ ldquolengthrdquo

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

RECSYS 66

Geared towards females

Geared towards males

serious

funny

The Effect of Regularization

1162019

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

minfactors ldquoerrorrdquo + λ ldquolengthrdquo

The PrincessDiaries

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

RECSYS 67

Geared towards females

Geared towards males

serious

funny

The Effect of Regularization

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

The PrincessDiaries

minfactors ldquoerrorrdquo + λ ldquolengthrdquo

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

RECSYS 68

Use Gradient Descent to search forthe optimal settings

Want to find matrices P and Q

Gradient descent Initialize P and Q (using SVD pretend missing ratings are

0) Do gradient descent

bull P larr P - η middotnablaPbull Q larr Q - η middotnablaQbull Where nablaQ is gradientderivative of matrix Q120571120571119876119876 = [120571120571119902119902119909119909119894119894] and 120571120571119902119902119909119909119894119894 = sum119909119909119909119909 minus2 119903119903119909119909119909119909 minus 119902119902119909119909119901119901119909119909119879119879 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894

bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q Observation Computing gradients is slow

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

How to compute gradient of a matrixCompute gradient of every element independently

RECSYS 69

Degression to the lecture notes ofRegression and Gradient Descent

by Andrew Ngrsquos Machine Learning course from Coursera

RECSYS 70

(Batch) Gradient Descent

Want to find matrices P and Q

Gradient descent Initialize P and Q (using SVD pretend missing ratings are

0) Do gradient descent

bull P larr P - η middotnablaPbull Q larr Q - η middotnablaQbull Where nablaQ is gradientderivative of matrix Q120571120571119876119876 = [120571120571119902119902119909119909119894119894] and 120571120571119902119902119909119909119894119894 = sum119909119909119909119909 minus2 119903119903119909119909119909119909 minus 119902119902119909119909119901119901119909119909119879119879 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894

bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q Observation Computing gradients is slow

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

How to compute gradient of a matrixCompute gradient of every element independently

RECSYS 72

Stochastic Gradient Descent

Gradient Descent (GD) vs Stochastic GD Observation 120571120571119876119876 = [120571120571119902119902119909119909119894119894] where

120571120571119902119902119909119909119894119894 = 119909119909119909119909

minus2 119903119903119909119909119909119909 minus 119902119902119909119909119894119894119901119901119909119909119894119894 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894 = 119961119961119946119946

nabla119928119928 119955119955119961119961119946119946

bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q 119928119928 = 119928119928 minus 119928119928 = 119928119928 minus η sum119961119961119946119946nabla119928119928 (119955119955119961119961119946119946) Idea Instead of evaluating gradient over all ratings evaluate it for

each individual rating and make a step

GD 119928119928larr119928119928 minus η sum119955119955119961119961119946119946 nabla119928119928(119955119955119961119961119946119946)

SGD 119928119928larr119928119928 minus 119928119928(119955119955119961119961119946119946) Faster convergence

bull Need more steps but each step is computed much faster

1162019

RECSYS 73

SGD vs GD Convergence of GD vs SGD

1162019

Iterationstep

Valu

e of

the

obje

ctiv

e fu

nctio

n

GD improves the value of the objective function at every step SGD improves the value but in a ldquonoisyrdquo wayGD takes fewer steps to converge but each steptakes much longer to compute In practice SGD is much faster

RECSYS 74

Stochastic Gradient Descent

Stochastic gradient decent Initialize P and Q (using SVD pretend missing ratings are

0) Then iterate over the ratings (multiple times if necessary) and update

factorsFor each rxi

bull 120576120576119909119909119909119909 = 119903119903119909119909119909119909 minus 119902119902119909119909 sdot 119901119901119909119909119879119879 (derivative of the ldquoerrorrdquo)bull 119902119902119909119909 larr 119902119902119909119909 + 120578120578 120576120576119909119909119909119909 119901119901119909119909 minus 120582120582 119902119902119909119909 (update equation)bull 119901119901119909119909 larr 119901119901119909119909 + 120578120578 120576120576119909119909119909119909 119902119902119909119909 minus 120582120582 119901119901119909119909 (update equation)

2 for loops For until convergence

bull For each rxibull Compute gradient do a ldquosteprdquo

1162019

η hellip learning rate

RECSYS 75Koren Bell Volinksy IEEE Computer 2009

1162019

RECSYS 76

Summary Recommendations via Optimization

Goal Make good recommendations Quantify goodness using SSE

So Lower SSE means better recommendations We want to make good recommendations on items

that some user has not yet seen Letrsquos set values for P and Q such that they work

well on known (user item) ratings And hope these values for P and Q will predict well

the unknown ratings

This is the a case where we apply Optimization methods

1162019

1 3 43 5 5

4 5 533

2

2 1 3

1

RECSYS 77

Backup Slides

RECSYS 78

The Netflix Challenge 2006-09

We Know What You OughtTo Be Watching This

Summer

Netflix Prizebull Training data

ndash 100 million ratingsndash 480000 usersndash 17770 moviesndash 6 years of data 2000-2005

bull Test datandash Last few ratings of each user (28 million)ndash Evaluation criterion root mean squared error (RMSE) ndash Netflix Cinematch RMSE 09514

bull Competitionndash 2700+ teamsndash $1 million grand prize for 10 improvement on Cinematch resultndash $50000 2007 progress prize for 843 improvement

scoremovieuser121152131434524123237682576344541568523425

22345

57664566

scoremovieuser6219617232473153414284935

745

696836

Training data Test data

Movie rating data

Overall rating distribution

bull Third of ratings are 4sbull Average rating is 368

From TimelyDevelopmentcom

ratings per movie

bull Avg ratingsmovie 5627

ratings per user

bull Avg ratingsuser 208

Average movie rating by movie count

bull More ratings to better movies

From TimelyDevelopmentcom

Most loved movies

CountAvg ratingTitle137812 4593 The Shawshank Redemption 133597 4545 Lord of the Rings The Return of the King1808834306 The Green Mile 150676 4460Lord of the Rings The Two Towers 139050 4415 Finding Nemo 117456 4504 Raiders of the Lost Ark 180736 4299 Forrest Gump 147932 4433 Lord of the Rings The Fellowship of the ring149199 4325 The Sixth Sense 144027 4333 Indiana Jones and the Last Crusade

Challenges

bull Size of datandash Scalabilityndash Keeping data in memory

bull Missing datandash 99 percent missingndash Very imbalanced

bull Avoiding overfittingbull Test and training data differ significantly movie 16322

bull Use an ensemble of complementing predictorsbull Two half tuned models worth more than a single fully

tuned model

The BellKor recommender system

RECSYS 101

Extending Latent Factor Model to Include Biases

1162019

RECSYS 102

Modeling Biases and Interactions

μ = overall mean rating bx = bias of user x bi = bias of movie i

user-movie interactionmovie biasuser bias

User-Movie interaction Characterizes the matching between

users and movies Attracts most research in the field Benefits from algorithmic and

mathematical innovations

Baseline predictor Separates users and movies Benefits from insights into userrsquos

behavior Among the main practical

contributions of the competition

RECSYS 103

Baseline Predictor We have expectations on the rating by

user x of movie i even without estimating xrsquos attitude towards movies like i

ndash Rating scale of user xndash Values of other ratings user

gave recently (day-specific mood anchoring multi-user accounts)

ndash (Recent) popularity of movie indash Selection bias related to

number of ratings user gave on the same day (ldquofrequencyrdquo)

1162019

RECSYS 104

Putting It All Together

Example Mean rating micro = 37 You are a critical reviewer your ratings are 1 star

lower than the mean bx = -1 Star Wars gets a mean rating of 05 higher than

average movie bi = + 05 Predicted rating for you on Star Wars

= 37 - 1 + 05 = 32

1162019

RECSYS 105

Fitting the New Model

Solve

Stochastic gradient decent to find parameters Note Both biases bu bi as well as interactions qi pu are treated

as parameters (we estimate them)

1162019

regularization

goodness of fit

λ is selected via grid-search on a validation set

( )

++++

+++minus

sumsumsumsum

sumisin

ii

xx

xx

ii

Rix

Txiixxi

PQ

bbpq

pqbbr

2222

2

)(

)(min

λ

micro

RECSYS 106

BellKor Recommender System The winner of the Netflix Challenge

Multi-scale modeling of the dataCombine top level ldquoregionalrdquomodeling of the data with a refined local view Global

bull Overall deviations of usersmovies Factorization

bull Addressing ldquoregionalrdquo effects Collaborative filtering

bull Extract local patterns

Global effects

Factorization

Collaborative filtering

RECSYS 107

Performance of Various Methods

0885

089

0895

09

0905

091

0915

092

1 10 100 1000

RMSE

Millions of parameters

CF (no time bias)

Basic Latent Factors

Latent Factors w Biases

1162019

RECSYS 108

Grand Prize 08563

Netflix 09514

Movie average 10533User average 10651

Global average 11296

Performance of Various Methods

Basic Collaborative filtering 094

Latent factors 090

Latent factors+Biases 089

Collaborative filtering++ 091

1162019

RECSYS 109

Temporal Biases Of Users

Sudden rise in the average movie rating(early 2004) Improvements in Netflix GUI improvements Meaning of rating changed

Movie age Users prefer new movies

without any reasons Older movies are just

inherently better than newer ones

1162019

Y Koren Collaborative filtering with temporal dynamics KDD rsquo09

RECSYS 110

Temporal Biases amp Factors

Original modelrxi = micro +bx + bi + qi middotpx

T

Add time dependence to biasesrxi = micro +bx(t)+ bi(t) +qi middot px

T

Make parameters bu and bi to depend on time (1) Parameterize time-dependence by linear trends

(2) Each bin corresponds to 10 consecutive weeks

Add temporal dependence to factors px(t)hellip user preference vector on day t

Y Koren Collaborative filtering with temporal dynamics KDD rsquo09

RECSYS 111

Adding Temporal Effects

1162019

0875

088

0885

089

0895

09

0905

091

0915

092

1 10 100 1000 10000

RMSE

Millions of parameters

CF (no time bias)Basic Latent FactorsCF (time bias)Latent Factors w Biases+ Linear time factors+ Per-day user biases+ CF

RECSYS 112

Grand Prize 08563

Netflix 09514

Movie average 10533User average 10651

Global average 11296

Performance of Various Methods

Basic Collaborative filtering 094

Latent factors 090

Latent factors+Biases 089

Collaborative filtering++ 091

1162019 112

Latent factors+Biases+Time 0876

Still no prize Getting desperateTry a ldquokitchen sinkrdquo approach

RECSYS 113

Effect of ensemble size

087

08725

0875

08775

088

08825

0885

08875

089

08925

1 8 15 22 29 36 43 50

Predictors

Erro

r - R

MSE

Chart1

RMSE
Predictors
Error - RMSE
Effect of ensemble size

Sheet1

Sheet1

RMSE
Predictors
Error - RMSE
Effect of ensemble size

Sheet2

Sheet3

RECSYS 115

Standing on June 26th 2009

1162019June 26th submission triggers 30-day ldquolast callrdquo

RECSYS 116

The Last 30 Days

Ensemble team formed Group of other teams on leaderboard forms a new team Relies on combining their models Quickly also get a qualifying score over 10

BellKor Continue to get small improvements in their scores Realize that they are in direct competition with Ensemble

Strategy Both teams carefully monitoring the leaderboard Only sure way to check for improvement is to submit a set of

predictionsbull This alerts the other team of your latest score

1162019

RECSYS 117

24 Hours from the Deadline

Submissions limited to 1 a day Only 1 final submission could be made in the last 24h

24 hours before deadlinehellip BellKor team member in Austria notices (by chance) that

Ensemble posts a score that is slightly better than BellKorrsquos

Frantic last 24 hours for both teams Much computer time on final optimization Carefully calibrated to end about an hour before deadline

Final submissions BellKor submits a little early (on purpose) 40 mins before deadline Ensemble submits their final entry 20 mins later hellipand everyone waitshellip

1162019

RECSYS 118118

RECSYS 119

Million $ Awarded Sept 21st 2009

RECSYS 120

Further reading Y Koren Collaborative filtering with temporal

dynamics KDD rsquo09 httpwww2researchattcom~volinskynetflixbpchtml httpwwwthe-ensemblecom

1162019

  • Slide Number 1
  • Acknowledgements
  • Content-based Systems amp Collaborative Filtering
  • High Dimensional Data
  • Example Recommender Systems
  • Recommendations
  • From Scarcity to Abundance
  • The Long Tail
  • Physical vs Online
  • Types of Recommendations
  • Formal Model
  • Utility Matrix
  • Key Problems
  • (1) Gathering Ratings
  • (2) Extrapolating Utilities
  • Content-based Recommender Systems
  • Content-based Recommendations
  • Plan of Action
  • Item Profiles
  • Sidenote TF-IDF
  • User Profiles and Prediction
  • Pros Content-based Approach
  • Cons Content-based Approach
  • Collaborative Filtering
  • Collaborative filtering
  • Collaborative Filtering (CF)
  • Example of Memory-based Collaborative Filtering User-User Collaborative Filtering
  • Similar Users
  • Similarity Metric
  • Rating Predictions
  • Another type of Memory-based Collaborative Filtering Item-Item based Collaborative Filtering
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Common Practice for Item-Item Collaborative Filtering
  • Item-Item vs User-User
  • ProsCons of Collaborative Filtering
  • Hybrid Methods
  • Remarks amp Practical Tips
  • Evaluation
  • Evaluation
  • Evaluating Predictions
  • Problems with Error Measures
  • Collaborative Filtering Complexity
  • Tip Add Data
  • Recommender SystemsLatent Factor Models
  • Collaborative Filtering viaLatent Factor Models (eg SVD)
  • Slide Number 50
  • The Netflix Utility Matrix R
  • Utility Matrix R Evaluation
  • Latent Factor Models
  • Ratings as Products of Factors
  • Ratings as Products of Factors
  • Ratings as Products of Factors
  • Latent Factor Models
  • Latent Factor Models
  • Recap SVD
  • Latent Factor Models
  • Dealing with Missing Entries
  • Recommendations via Latent Factor Models (eg SVD++ by the [Bellkor Team])
  • Dealing with Missing Entries
  • The Effect of Regularization
  • The Effect of Regularization
  • The Effect of Regularization
  • The Effect of Regularization
  • Use Gradient Descent to search forthe optimal settings
  • Degression to the lecture notes ofRegression and Gradient Descentby Andrew Ngrsquos Machine Learning course from Coursera
  • (Batch) Gradient Descent
  • Stochastic Gradient Descent
  • SGD vs GD
  • Stochastic Gradient Descent
  • Slide Number 75
  • Summary Recommendations via Optimization
  • Backup Slides
  • The Netflix Challenge 2006-09
  • Slide Number 79
  • Slide Number 80
  • Netflix Prize
  • Movie rating data
  • Overall rating distribution
  • ratings per movie
  • ratings per user
  • Average movie rating by movie count
  • Most loved movies
  • Challenges
  • The BellKor recommender system
  • Extending Latent Factor Model to Include Biases
  • Modeling Biases and Interactions
  • Baseline Predictor
  • Putting It All Together
  • Fitting the New Model
  • BellKor Recommender System
  • Performance of Various Methods
  • Performance of Various Methods
  • Temporal Biases Of Users
  • Temporal Biases amp Factors
  • Adding Temporal Effects
  • Performance of Various Methods
  • Slide Number 113
  • Slide Number 114
  • Standing on June 26th 2009
  • The Last 30 Days
  • 24 Hours from the Deadline
  • Slide Number 118
  • Million $ Awarded Sept 21st 2009
  • Further reading
predictors RMSE
1 0889067
2 0880586
3 0878345
4 0877715
5 0876668
6 0876179
7 0875557
8 0875013
9 0874477
10 0874209
11 0873997
12 0873691
13 0873569
14 0873352
15 0873212
16 0873035
17 0872942
18 087283
19 0872748
20 0872654
21 0872563
22 0872468
23 0872382
24 0872275
25 087222
26 0872201
27 0872135
28 0872086
29 0872021
30 0871945
31 0871898
32 087183
33 0871756
34 0871722
35 0871665
36 0871632
37 0871585
38 0871558
39 0871524
40 0871482
41 0871452
42 0871428
43 0871398
44 0871346
45 0871312
46 0871287
47 0871262
48 0871242
49 0871236
50 0871197
0889067
0880586
0878345
0877715
0876668
0876179
0875557
0875013
0874477
0874209
0873997
0873691
0873569
0873352
0873212
0873035
0872942
087283
0872748
0872654
0872563
0872468
0872382
0872275
087222
0872201
0872135
0872086
0872021
0871945
0871898
087183
0871756
0871722
0871665
0871632
0871585
0871558
0871524
0871482
0871452
0871428
0871398
0871346
0871312
0871287
0871262
0871242
0871236
0871197
Page 17: xuhappy.github.io · 2 days ago · RECSYS 2 Acknowledgements The slides used in this chapter are adapted from: CS246 Mining Massive Datasets, - by Jure Leskovec, Stanford University

RECSYS 19

Item Profiles For each item create an item profile

Profile is a set (vector) of features Movies author title actor directorhellip Text Set of ldquoimportantrdquo words in document

How to pick important features Usual heuristic from text mining is TF-IDF

(Term frequency Inverse Doc Frequency)bull Term hellip Featurebull Document hellip Item

1162019

RECSYS 20

Sidenote TF-IDFfij = frequency of term (feature) i in doc (item) j

ni = number of docs that mention term iN = total number of docs

TF-IDF score wij = TFij times IDFi

Doc profile = set of words with highest TF-IDF scores together with their scores

1162019

Note we normalize TFto discount for ldquolongerrdquo documents

RECSYS 21

User Profiles and Prediction

User profile possibilities Weighted average of rated item profiles Variation weight by difference from average

rating for item hellip

Prediction heuristic Given user profile x and item profile i estimate 119906119906(119961119961 119946119946) = cos(119961119961 119946119946) = 119961119961middot119946119946

| 119961119961 |sdot| 119946119946 |

1162019

RECSYS 22

Pros Content-based Approach

+ No need for data on other users No cold-start or sparsity problems

+ Able to recommend to users with unique tastes

+ Able to recommend new amp unpopular items No first-rater problem

+ Able to provide explanations Can provide explanations of recommended items by listing content-

features that caused an item to be recommended

1162019

RECSYS 23

Cons Content-based Approach ndash Finding the appropriate features is hard

Eg images movies music

ndash Overspecialization Never recommends items outside userrsquos

content profile People might have multiple interests Unable to exploit quality judgments of other users

ndash Recommendations for new users How to build a user profile

1162019

RECSYS 24

Collaborative Filtering

RECSYS 25

Collaborative filtering Recommend items based on past transactions of users

Analyze relations between users andor items

Specific data characteristics are irrelevant Domain-free useritem attributes are not necessary Can identify elusive aspects

RECSYS 26

Collaborative Filtering (CF)

RECSYS 27

Example of Memory-based Collaborative FilteringUser-User Collaborative Filtering

1 Consider user x

2 Find set N of other userswhose ratings are ldquosimilarrdquo to xrsquos ratings eg using K-nearest neighbors (KNN)

3 Recommend items to xbased on the weighted ratings of items by users in N

RECSYS 28

Similar Users Let rx be the vector of user xrsquos ratings

Jaccard similarity measure Problem Ignores the value of the rating

Cosine similarity measure

sim(x y) = cos(rx ry) = 119903119903119909119909sdot119903119903119910119910||119903119903119909119909||sdot||119903119903119910119910||

Problem Treats missing ratings as ldquonegativerdquo Pearson correlation coefficient

Sxy = items rated by both users x and y

1162019

rx = [ _ _ ]ry = [ _ _]

rx ry as setsrx = 1 4 5ry = 1 3 4

rx ry as pointsrx = 1 0 0 1 3ry = 1 0 2 2 0

rx ry hellip avgrating of x y

RECSYS 29

Similarity Metric

Intuitively we want sim(A B) gt sim(A C)

Jaccard similarity 15 lt 24

Cosine similarity 0386 gt 0322 Considers missing ratings as ldquonegativerdquo Solution subtract the (row) mean

1162019

sim AB vs AC0092 gt -0559Notice cosine sim is correlation when data is centered at 0

Cosine sim

RECSYS 30

Rating Predictions

Let rx be the vector of user xrsquos ratings

Let N be the set of k users most similar to xwho have rated item i

Prediction for item i of user x 119903119903119909119909119909119909 = 1

119896119896sum119910119910isin119873119873 119903119903119910119910119909119909

119903119903119909119909119909119909 =sum119910119910isin119873119873 119904119904119909119909119910119910sdot119903119903119910119910119910119910sum119910119910isin119873119873 119904119904119909119909119910119910

Other options

Many other tricks possiblehellip1162019

Shorthand119956119956119961119961119961119961 = 119956119956119946119946119956119956 119961119961119961119961

RECSYS 31

Another type of Memory-based Collaborative Filtering Item-Item based Collaborative Filtering

So far User-user collaborative filtering

Another view Item-item For item i find other similar items Estimate rating for item i based

on the target userrsquos ratings on items similar to item i Can use same similarity metrics and

prediction functions as in user-user model

1162019

sumsum

isin

isinsdot

=)(

)(

xiNj ij

xiNj xjijxi s

rsr

sijhellip similarity of items i and jrxjhelliprating of user x on item jN(ix)hellip set items rated by x similar to i

RECSYS 32

Item-Item CF (|N|=2)

121110987654321

455311

3124452

534321423

245424

5224345

423316

users

mov

ies

- unknown rating - rating between 1 to 5

RECSYS 33

Item-Item CF (|N|=2)

121110987654321

455 311

3124452

534321423

245424

5224345

423316

users

- estimate rating of movie 1 by user 5

1162019

mov

ies

RECSYS 34

Item-Item CF (|N|=2)

121110987654321

455 311

3124452

534321423

245424

5224345

423316

users

Neighbor selectionIdentify movies similar to movie 1 rated by user 5

1162019

mov

ies

100

-018

041

-010

-031

059

sim(1m)

Here we use Pearson correlation as similarity1) Subtract mean rating mi from each movie i

m1 = (1+3+5+5+4)5 = 36row 1 [-26 0 -06 0 0 14 0 0 14 0 04 0]

2) Compute cosine similarities between rows

RECSYS 35

Item-Item CF (|N|=2)

121110987654321

455 311

3124452

534321423

245424

5224345

423316

users

Compute similarity weightss13=041 s16=059

1162019

mov

ies

100

-018

041

-010

-031

059

sim(1m)

RECSYS 36

Item-Item CF (|N|=2)

121110987654321

45526311

3124452

534321423

245424

5224345

423316

users

Predict by taking weighted average

r15 = (0412 + 0593) (041+059) = 26

mov

ies

RECSYS 37

Common Practice for Item-Item Collaborative Filtering

Define similarity sij of items i and j

Select K nearest neighbors (KNN) N(i x) Set of Items most similar to item i that were rated by x

Estimate rating rxi as the weighted average

1162019

sumsum

isin

isinsdot

=)(

)(ˆxiNj ij

xiNj xjijxi s

rsr

N(ix) = set of items similar to item i that were rated by xsij = similarity of items i and jrxj= rating of user x on item j

RECSYS 38

Item-Item vs User-User

04180109

0305081

Avatar LOTR Matrix Pirates

Alice

Bob

Carol

David

1162019

In practice it has been observed that item-itemoften works better than user-user

Why Items are simpler users have multiple tastes

RECSYS 39

ProsCons of Collaborative Filtering

+ Works for any kind of item No feature selection needed

- Cold Start Need enough users in the system to find a match

- Sparsity The userratings matrix is sparse Hard to find users that have rated the same items

- First rater Cannot recommend an item that has not been

previously rated New items Esoteric items

- Popularity bias Cannot recommend items to someone with

unique taste Tends to recommend popular items

RECSYS 40

Hybrid Methods Implement two or more different recommenders and

combine predictions Perhaps using a linear model

Add content-based methods to collaborative filtering Item profiles for new item problem Demographics to deal with new user problem

1162019

RECSYS 41

- Evaluation

- Error metrics

- Complexity Speed

RECSYS 42

Evaluation

1 3 4

3 5 5

4 5 5

3

3

2 2 2

5

2 1 1

3 3

1

movies

users

1162019

RECSYS 43

Evaluation

1 3 4

3 5 5

4 5 5

3

3

2

2 1

3

1

Test Data Set

users

movies

1162019

RECSYS 44

Evaluating Predictions Compare predictions with known ratings

Root-mean-square error (RMSE)bull where is predicted is the true rating of x on i

Precision at top 10 bull of those in top 10

Rank Correlation bull Spearmanrsquos correlation between systemrsquos and userrsquos complete

rankings Another approach 01 model

Coveragebull Number of itemsusers for which system can make predictions

Precision = TP (TP + FP) Accuracy = (TP+TN) (TP + FP + TN + FN) Receiver Operating characteristic (ROC) Curve

Y-axis True Positive Rates (TPR) X-axis False Positive Rates (FPR)bull TPR (aka Recall) = TP P = TP (TP+FN) bull FPR = FP N = FP (FP + TN)bull See httpsenwikipediaorgwikiPrecision and recall

RECSYS 45

Problems with Error Measures Narrow focus on accuracy sometimes

misses the point Prediction Diversity Prediction Context Order of predictions

In practice we care only to predict high ratings RMSE might penalize a method that does well

for high ratings and badly for others

1162019

RECSYS 46

Collaborative Filtering Complexity

Expensive step is finding k most similar customers O(|X|) Recall that X = set of customers in the system

Too expensive to do at runtime Could pre-compute using clustering as approx

Naiumlve pre-computation takes time O(N middot|C|) |C| = of clusters = k in the k-means N = of data points

We already know how to do this Near-neighbor search in high dimensions (LSH) Clustering Dimensionality reduction

1162019

RECSYS 47

Tip Add Data Leverage all the data

Donrsquot try to reduce data size in an effort to make fancy algorithms work

Simple methods on large data do best

Add more data eg add IMDB data on genres

More data beats better algorithmshttpanandtypepadcomdatawocky200803more-data-usualhtml

RECSYS 48

Recommender SystemsLatent Factor Models

CS246 Mining Massive DatasetsJure Leskovec Stanford University

httpcs246stanfordedu

RECSYS 49

Geared towards females

Geared towards males

Serious

Funny

Collaborative Filtering viaLatent Factor Models (eg SVD)

The PrincessDiaries

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

RECSYS 50Koren Bell Volinksy IEEE Computer 2009

RECSYS 51

The Netflix Utility Matrix R

1 3 4

3 5 5

4 5 5

3

3

2 2 2

5

2 1 1

3 3

1

480000 users

17700 movies

1162019

Matrix R

RECSYS 52

Utility Matrix R Evaluation

1 3 4

3 5 5

4 5 5

3

3

2

2 1

3

1

Test Data Set

SSE = sum(119909119909119909119909)isin119877119877 119903119909119909119909119909 minus 119903119903119909119909119909119909 2

1162019

480000 users

17700 movies

Predicted rating

True rating of user x on item i

119955119955120785120785120788120788

Matrix R

Training Data Set

RECSYS 53

Latent Factor Models

ldquoSVDrdquo on Netflix data R asymp Q middot PT

For now letrsquos assume we can approximate the rating matrix R as a product of ldquothinrdquo Q middot PT

R has missing entries but letrsquos ignore that for nowbull Basically we will want the reconstruction error to be small on

known ratings and we donrsquot care about the values on the missing ones

45531

312445

53432142

24542

522434

42331

2-41

56-5

53-2

32111

-221-7

37-1

-924143-48-5-253-211

13-112-72914-131457-8

1-6784-3924176-421asymp

users

item

s

PT

Q

item

s

users

R

SVD A = U Σ VT

f factors

ffactors

RECSYS 54

Ratings as Products of Factors How to estimate the missing rating of

user x for item i

1162019

45531

312445

53432142

24542

522434

42331

item

s

2-41

56-5

53-2

32111

-221-7

37-1

-924143-48-5-253-211

13-112-72914-131457-8

1-6784-3924176-421

asymp

item

s

users

users

PT

119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931

= 119943119943

119954119954119946119946119943119943 sdot 119953119953119961119961119943119943

qi = row i of Qpx = column x of PT

ffac

tors

Qf factors

RECSYS 55

Ratings as Products of Factors How to estimate the missing rating of

user x for item i

1162019

45531

312445

53432142

24542

522434

42331

item

s

2-41

56-5

53-2

32111

-221-7

37-1

-924143-48-5-253-211

13-112-72914-131457-8

1-6784-3924176-421

asymp

item

s

users

users

PT

ffac

tors

Qf factors

119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931

= 119943119943

119954119954119946119946119943119943 sdot 119953119953119961119961119943119943

qi = row i of Qpx = column x of PT

RECSYS 56

Ratings as Products of Factors How to estimate the missing rating of

user x for item i

1162019

45531

312445

53432142

24542

522434

42331

item

s

2-41

56-5

53-2

32111

-221-7

37-1

-924143-48-5-253-211

13-112-72914-131457-8

1-6784-3924176-421

asymp

item

s

users

users

QPT

24

ffac

tors

f factors

119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931

= 119943119943

119954119954119946119946119943119943 sdot 119953119953119961119961119943119943

qi = row i of Qpx = column x of PT

RECSYS 57

Geared towards females

Geared towards males

Serious

Funny

Latent Factor Models

1162019

The PrincessDiaries

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

RECSYS 58

Geared towards females

Geared towards males

Serious

Funny

Latent Factor Models

1162019

The PrincessDiaries

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

RECSYS 59

Recap SVD

Remember SVD A Input data matrix U Left singular vecs V Right singular vecs Σ Singular values

SVD gives minimum reconstruction error (SSE) min119880119880VΣ

sum119909119909119894119894 119860119860119909119909119894119894 minus 119880119880Σ119881119881T 1199091199091198941198942

So in our case ldquoSVDrdquo on Netflix data R asymp Q middot PT

A = R Q = U PT = Σ VT

But we are not done yet R has missing entries1162019

U

The sum goes over all entriesBut our R has missing entries

119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931

Am

n

Σm

n

VT

asymp

U

RECSYS 60

Latent Factor Models

SVD isnrsquot defined when entries are missing

Use specialized methods to find P Q min

119875119875119876119876sum 119909119909119909119909 isinR 119903119903119909119909119909119909 minus 119902119902119909119909 sdot 119901119901119909119909119879119879 2

Notebull We donrsquot require cols of P Q to be orthogonalunit lengthbull P Q map usersmovies to a latent spacebull The most popular model among Netflix contestants

1162019

45531

312445

53432142

24542

522434

42331

2-41

56-5

53-2

32111

-221-7

37-1

-924143-48-5-253-211

13-112-72914-131457-8

1-6784-3924176-421asymp

PTQ

users

item

s

119903119909119909119909119909 = 119902119902119909119909 sdot 119901119901119909119909119879119879

f factors

ffactorsitem

s

users

RECSYS 61

Dealing with Missing Entries Want to minimize SSE for unseen test data

Idea Minimize SSE on training data Want large f ( of factors) to capture all the signals But SSE on test data begins to rise for f gt 2

Regularization is needed to avoid Overfitting Allow rich model where there are sufficient data Shrink aggressively where data are scarce

1162019

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

1 3 43 5 5

4 5 533

2

2 1 3

1

λhellip regularization parameterldquoerrorrdquo ldquolengthrdquo

RECSYS 62

Geared towards females

Geared towards males

serious

escapist

The PrincessDiaries

The Lion King

Braveheart

Independence Day

AmadeusThe Color Purple

Oceanrsquos 11

Sense and Sensibility

Gus

Dave

Recommendations via Latent Factor Models (eg SVD++ by the [Bellkor Team])

1162019

Lethal Weapon

Dumb and Dumber

RECSYS 63

Dealing with Missing Entries Want to minimize SSE for unseen test data

Idea Minimize SSE on training data Want large f ( of factors) to capture all the signals But SSE on test data begins to rise for f gt 2

Regularization is needed Allow rich model where there are sufficient data Shrink aggressively where data are scarce

1162019

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

1 3 43 5 5

4 5 533

2

2 1 3

1

λhellip regularization parameterldquoerrorrdquo ldquolengthrdquo

RECSYS 64

Geared towards females

Geared towards males

serious

funny

The Effect of Regularization

1162019

The PrincessDiaries

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

minfactors ldquoerrorrdquo + λ ldquolengthrdquo

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

RECSYS 65

Geared towards females

Geared towards males

serious

funny

The Effect of Regularization

1162019

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

The PrincessDiaries

minfactors ldquoerrorrdquo + λ ldquolengthrdquo

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

RECSYS 66

Geared towards females

Geared towards males

serious

funny

The Effect of Regularization

1162019

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

minfactors ldquoerrorrdquo + λ ldquolengthrdquo

The PrincessDiaries

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

RECSYS 67

Geared towards females

Geared towards males

serious

funny

The Effect of Regularization

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

The PrincessDiaries

minfactors ldquoerrorrdquo + λ ldquolengthrdquo

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

RECSYS 68

Use Gradient Descent to search forthe optimal settings

Want to find matrices P and Q

Gradient descent Initialize P and Q (using SVD pretend missing ratings are

0) Do gradient descent

bull P larr P - η middotnablaPbull Q larr Q - η middotnablaQbull Where nablaQ is gradientderivative of matrix Q120571120571119876119876 = [120571120571119902119902119909119909119894119894] and 120571120571119902119902119909119909119894119894 = sum119909119909119909119909 minus2 119903119903119909119909119909119909 minus 119902119902119909119909119901119901119909119909119879119879 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894

bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q Observation Computing gradients is slow

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

How to compute gradient of a matrixCompute gradient of every element independently

RECSYS 69

Degression to the lecture notes ofRegression and Gradient Descent

by Andrew Ngrsquos Machine Learning course from Coursera

RECSYS 70

(Batch) Gradient Descent

Want to find matrices P and Q

Gradient descent Initialize P and Q (using SVD pretend missing ratings are

0) Do gradient descent

bull P larr P - η middotnablaPbull Q larr Q - η middotnablaQbull Where nablaQ is gradientderivative of matrix Q120571120571119876119876 = [120571120571119902119902119909119909119894119894] and 120571120571119902119902119909119909119894119894 = sum119909119909119909119909 minus2 119903119903119909119909119909119909 minus 119902119902119909119909119901119901119909119909119879119879 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894

bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q Observation Computing gradients is slow

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

How to compute gradient of a matrixCompute gradient of every element independently

RECSYS 72

Stochastic Gradient Descent

Gradient Descent (GD) vs Stochastic GD Observation 120571120571119876119876 = [120571120571119902119902119909119909119894119894] where

120571120571119902119902119909119909119894119894 = 119909119909119909119909

minus2 119903119903119909119909119909119909 minus 119902119902119909119909119894119894119901119901119909119909119894119894 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894 = 119961119961119946119946

nabla119928119928 119955119955119961119961119946119946

bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q 119928119928 = 119928119928 minus 119928119928 = 119928119928 minus η sum119961119961119946119946nabla119928119928 (119955119955119961119961119946119946) Idea Instead of evaluating gradient over all ratings evaluate it for

each individual rating and make a step

GD 119928119928larr119928119928 minus η sum119955119955119961119961119946119946 nabla119928119928(119955119955119961119961119946119946)

SGD 119928119928larr119928119928 minus 119928119928(119955119955119961119961119946119946) Faster convergence

bull Need more steps but each step is computed much faster

1162019

RECSYS 73

SGD vs GD Convergence of GD vs SGD

1162019

Iterationstep

Valu

e of

the

obje

ctiv

e fu

nctio

n

GD improves the value of the objective function at every step SGD improves the value but in a ldquonoisyrdquo wayGD takes fewer steps to converge but each steptakes much longer to compute In practice SGD is much faster

RECSYS 74

Stochastic Gradient Descent

Stochastic gradient decent Initialize P and Q (using SVD pretend missing ratings are

0) Then iterate over the ratings (multiple times if necessary) and update

factorsFor each rxi

bull 120576120576119909119909119909119909 = 119903119903119909119909119909119909 minus 119902119902119909119909 sdot 119901119901119909119909119879119879 (derivative of the ldquoerrorrdquo)bull 119902119902119909119909 larr 119902119902119909119909 + 120578120578 120576120576119909119909119909119909 119901119901119909119909 minus 120582120582 119902119902119909119909 (update equation)bull 119901119901119909119909 larr 119901119901119909119909 + 120578120578 120576120576119909119909119909119909 119902119902119909119909 minus 120582120582 119901119901119909119909 (update equation)

2 for loops For until convergence

bull For each rxibull Compute gradient do a ldquosteprdquo

1162019

η hellip learning rate

RECSYS 75Koren Bell Volinksy IEEE Computer 2009

1162019

RECSYS 76

Summary Recommendations via Optimization

Goal Make good recommendations Quantify goodness using SSE

So Lower SSE means better recommendations We want to make good recommendations on items

that some user has not yet seen Letrsquos set values for P and Q such that they work

well on known (user item) ratings And hope these values for P and Q will predict well

the unknown ratings

This is the a case where we apply Optimization methods

1162019

1 3 43 5 5

4 5 533

2

2 1 3

1

RECSYS 77

Backup Slides

RECSYS 78

The Netflix Challenge 2006-09

We Know What You OughtTo Be Watching This

Summer

Netflix Prizebull Training data

ndash 100 million ratingsndash 480000 usersndash 17770 moviesndash 6 years of data 2000-2005

bull Test datandash Last few ratings of each user (28 million)ndash Evaluation criterion root mean squared error (RMSE) ndash Netflix Cinematch RMSE 09514

bull Competitionndash 2700+ teamsndash $1 million grand prize for 10 improvement on Cinematch resultndash $50000 2007 progress prize for 843 improvement

scoremovieuser121152131434524123237682576344541568523425

22345

57664566

scoremovieuser6219617232473153414284935

745

696836

Training data Test data

Movie rating data

Overall rating distribution

bull Third of ratings are 4sbull Average rating is 368

From TimelyDevelopmentcom

ratings per movie

bull Avg ratingsmovie 5627

ratings per user

bull Avg ratingsuser 208

Average movie rating by movie count

bull More ratings to better movies

From TimelyDevelopmentcom

Most loved movies

CountAvg ratingTitle137812 4593 The Shawshank Redemption 133597 4545 Lord of the Rings The Return of the King1808834306 The Green Mile 150676 4460Lord of the Rings The Two Towers 139050 4415 Finding Nemo 117456 4504 Raiders of the Lost Ark 180736 4299 Forrest Gump 147932 4433 Lord of the Rings The Fellowship of the ring149199 4325 The Sixth Sense 144027 4333 Indiana Jones and the Last Crusade

Challenges

bull Size of datandash Scalabilityndash Keeping data in memory

bull Missing datandash 99 percent missingndash Very imbalanced

bull Avoiding overfittingbull Test and training data differ significantly movie 16322

bull Use an ensemble of complementing predictorsbull Two half tuned models worth more than a single fully

tuned model

The BellKor recommender system

RECSYS 101

Extending Latent Factor Model to Include Biases

1162019

RECSYS 102

Modeling Biases and Interactions

μ = overall mean rating bx = bias of user x bi = bias of movie i

user-movie interactionmovie biasuser bias

User-Movie interaction Characterizes the matching between

users and movies Attracts most research in the field Benefits from algorithmic and

mathematical innovations

Baseline predictor Separates users and movies Benefits from insights into userrsquos

behavior Among the main practical

contributions of the competition

RECSYS 103

Baseline Predictor We have expectations on the rating by

user x of movie i even without estimating xrsquos attitude towards movies like i

ndash Rating scale of user xndash Values of other ratings user

gave recently (day-specific mood anchoring multi-user accounts)

ndash (Recent) popularity of movie indash Selection bias related to

number of ratings user gave on the same day (ldquofrequencyrdquo)

1162019

RECSYS 104

Putting It All Together

Example Mean rating micro = 37 You are a critical reviewer your ratings are 1 star

lower than the mean bx = -1 Star Wars gets a mean rating of 05 higher than

average movie bi = + 05 Predicted rating for you on Star Wars

= 37 - 1 + 05 = 32

1162019

RECSYS 105

Fitting the New Model

Solve

Stochastic gradient decent to find parameters Note Both biases bu bi as well as interactions qi pu are treated

as parameters (we estimate them)

1162019

regularization

goodness of fit

λ is selected via grid-search on a validation set

( )

++++

+++minus

sumsumsumsum

sumisin

ii

xx

xx

ii

Rix

Txiixxi

PQ

bbpq

pqbbr

2222

2

)(

)(min

λ

micro

RECSYS 106

BellKor Recommender System The winner of the Netflix Challenge

Multi-scale modeling of the dataCombine top level ldquoregionalrdquomodeling of the data with a refined local view Global

bull Overall deviations of usersmovies Factorization

bull Addressing ldquoregionalrdquo effects Collaborative filtering

bull Extract local patterns

Global effects

Factorization

Collaborative filtering

RECSYS 107

Performance of Various Methods

0885

089

0895

09

0905

091

0915

092

1 10 100 1000

RMSE

Millions of parameters

CF (no time bias)

Basic Latent Factors

Latent Factors w Biases

1162019

RECSYS 108

Grand Prize 08563

Netflix 09514

Movie average 10533User average 10651

Global average 11296

Performance of Various Methods

Basic Collaborative filtering 094

Latent factors 090

Latent factors+Biases 089

Collaborative filtering++ 091

1162019

RECSYS 109

Temporal Biases Of Users

Sudden rise in the average movie rating(early 2004) Improvements in Netflix GUI improvements Meaning of rating changed

Movie age Users prefer new movies

without any reasons Older movies are just

inherently better than newer ones

1162019

Y Koren Collaborative filtering with temporal dynamics KDD rsquo09

RECSYS 110

Temporal Biases amp Factors

Original modelrxi = micro +bx + bi + qi middotpx

T

Add time dependence to biasesrxi = micro +bx(t)+ bi(t) +qi middot px

T

Make parameters bu and bi to depend on time (1) Parameterize time-dependence by linear trends

(2) Each bin corresponds to 10 consecutive weeks

Add temporal dependence to factors px(t)hellip user preference vector on day t

Y Koren Collaborative filtering with temporal dynamics KDD rsquo09

RECSYS 111

Adding Temporal Effects

1162019

0875

088

0885

089

0895

09

0905

091

0915

092

1 10 100 1000 10000

RMSE

Millions of parameters

CF (no time bias)Basic Latent FactorsCF (time bias)Latent Factors w Biases+ Linear time factors+ Per-day user biases+ CF

RECSYS 112

Grand Prize 08563

Netflix 09514

Movie average 10533User average 10651

Global average 11296

Performance of Various Methods

Basic Collaborative filtering 094

Latent factors 090

Latent factors+Biases 089

Collaborative filtering++ 091

1162019 112

Latent factors+Biases+Time 0876

Still no prize Getting desperateTry a ldquokitchen sinkrdquo approach

RECSYS 113

Effect of ensemble size

087

08725

0875

08775

088

08825

0885

08875

089

08925

1 8 15 22 29 36 43 50

Predictors

Erro

r - R

MSE

Chart1

RMSE
Predictors
Error - RMSE
Effect of ensemble size

Sheet1

Sheet1

RMSE
Predictors
Error - RMSE
Effect of ensemble size

Sheet2

Sheet3

RECSYS 115

Standing on June 26th 2009

1162019June 26th submission triggers 30-day ldquolast callrdquo

RECSYS 116

The Last 30 Days

Ensemble team formed Group of other teams on leaderboard forms a new team Relies on combining their models Quickly also get a qualifying score over 10

BellKor Continue to get small improvements in their scores Realize that they are in direct competition with Ensemble

Strategy Both teams carefully monitoring the leaderboard Only sure way to check for improvement is to submit a set of

predictionsbull This alerts the other team of your latest score

1162019

RECSYS 117

24 Hours from the Deadline

Submissions limited to 1 a day Only 1 final submission could be made in the last 24h

24 hours before deadlinehellip BellKor team member in Austria notices (by chance) that

Ensemble posts a score that is slightly better than BellKorrsquos

Frantic last 24 hours for both teams Much computer time on final optimization Carefully calibrated to end about an hour before deadline

Final submissions BellKor submits a little early (on purpose) 40 mins before deadline Ensemble submits their final entry 20 mins later hellipand everyone waitshellip

1162019

RECSYS 118118

RECSYS 119

Million $ Awarded Sept 21st 2009

RECSYS 120

Further reading Y Koren Collaborative filtering with temporal

dynamics KDD rsquo09 httpwww2researchattcom~volinskynetflixbpchtml httpwwwthe-ensemblecom

1162019

  • Slide Number 1
  • Acknowledgements
  • Content-based Systems amp Collaborative Filtering
  • High Dimensional Data
  • Example Recommender Systems
  • Recommendations
  • From Scarcity to Abundance
  • The Long Tail
  • Physical vs Online
  • Types of Recommendations
  • Formal Model
  • Utility Matrix
  • Key Problems
  • (1) Gathering Ratings
  • (2) Extrapolating Utilities
  • Content-based Recommender Systems
  • Content-based Recommendations
  • Plan of Action
  • Item Profiles
  • Sidenote TF-IDF
  • User Profiles and Prediction
  • Pros Content-based Approach
  • Cons Content-based Approach
  • Collaborative Filtering
  • Collaborative filtering
  • Collaborative Filtering (CF)
  • Example of Memory-based Collaborative Filtering User-User Collaborative Filtering
  • Similar Users
  • Similarity Metric
  • Rating Predictions
  • Another type of Memory-based Collaborative Filtering Item-Item based Collaborative Filtering
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Common Practice for Item-Item Collaborative Filtering
  • Item-Item vs User-User
  • ProsCons of Collaborative Filtering
  • Hybrid Methods
  • Remarks amp Practical Tips
  • Evaluation
  • Evaluation
  • Evaluating Predictions
  • Problems with Error Measures
  • Collaborative Filtering Complexity
  • Tip Add Data
  • Recommender SystemsLatent Factor Models
  • Collaborative Filtering viaLatent Factor Models (eg SVD)
  • Slide Number 50
  • The Netflix Utility Matrix R
  • Utility Matrix R Evaluation
  • Latent Factor Models
  • Ratings as Products of Factors
  • Ratings as Products of Factors
  • Ratings as Products of Factors
  • Latent Factor Models
  • Latent Factor Models
  • Recap SVD
  • Latent Factor Models
  • Dealing with Missing Entries
  • Recommendations via Latent Factor Models (eg SVD++ by the [Bellkor Team])
  • Dealing with Missing Entries
  • The Effect of Regularization
  • The Effect of Regularization
  • The Effect of Regularization
  • The Effect of Regularization
  • Use Gradient Descent to search forthe optimal settings
  • Degression to the lecture notes ofRegression and Gradient Descentby Andrew Ngrsquos Machine Learning course from Coursera
  • (Batch) Gradient Descent
  • Stochastic Gradient Descent
  • SGD vs GD
  • Stochastic Gradient Descent
  • Slide Number 75
  • Summary Recommendations via Optimization
  • Backup Slides
  • The Netflix Challenge 2006-09
  • Slide Number 79
  • Slide Number 80
  • Netflix Prize
  • Movie rating data
  • Overall rating distribution
  • ratings per movie
  • ratings per user
  • Average movie rating by movie count
  • Most loved movies
  • Challenges
  • The BellKor recommender system
  • Extending Latent Factor Model to Include Biases
  • Modeling Biases and Interactions
  • Baseline Predictor
  • Putting It All Together
  • Fitting the New Model
  • BellKor Recommender System
  • Performance of Various Methods
  • Performance of Various Methods
  • Temporal Biases Of Users
  • Temporal Biases amp Factors
  • Adding Temporal Effects
  • Performance of Various Methods
  • Slide Number 113
  • Slide Number 114
  • Standing on June 26th 2009
  • The Last 30 Days
  • 24 Hours from the Deadline
  • Slide Number 118
  • Million $ Awarded Sept 21st 2009
  • Further reading
predictors RMSE
1 0889067
2 0880586
3 0878345
4 0877715
5 0876668
6 0876179
7 0875557
8 0875013
9 0874477
10 0874209
11 0873997
12 0873691
13 0873569
14 0873352
15 0873212
16 0873035
17 0872942
18 087283
19 0872748
20 0872654
21 0872563
22 0872468
23 0872382
24 0872275
25 087222
26 0872201
27 0872135
28 0872086
29 0872021
30 0871945
31 0871898
32 087183
33 0871756
34 0871722
35 0871665
36 0871632
37 0871585
38 0871558
39 0871524
40 0871482
41 0871452
42 0871428
43 0871398
44 0871346
45 0871312
46 0871287
47 0871262
48 0871242
49 0871236
50 0871197
0889067
0880586
0878345
0877715
0876668
0876179
0875557
0875013
0874477
0874209
0873997
0873691
0873569
0873352
0873212
0873035
0872942
087283
0872748
0872654
0872563
0872468
0872382
0872275
087222
0872201
0872135
0872086
0872021
0871945
0871898
087183
0871756
0871722
0871665
0871632
0871585
0871558
0871524
0871482
0871452
0871428
0871398
0871346
0871312
0871287
0871262
0871242
0871236
0871197
Page 18: xuhappy.github.io · 2 days ago · RECSYS 2 Acknowledgements The slides used in this chapter are adapted from: CS246 Mining Massive Datasets, - by Jure Leskovec, Stanford University

RECSYS 20

Sidenote TF-IDFfij = frequency of term (feature) i in doc (item) j

ni = number of docs that mention term iN = total number of docs

TF-IDF score wij = TFij times IDFi

Doc profile = set of words with highest TF-IDF scores together with their scores

1162019

Note we normalize TFto discount for ldquolongerrdquo documents

RECSYS 21

User Profiles and Prediction

User profile possibilities Weighted average of rated item profiles Variation weight by difference from average

rating for item hellip

Prediction heuristic Given user profile x and item profile i estimate 119906119906(119961119961 119946119946) = cos(119961119961 119946119946) = 119961119961middot119946119946

| 119961119961 |sdot| 119946119946 |

1162019

RECSYS 22

Pros Content-based Approach

+ No need for data on other users No cold-start or sparsity problems

+ Able to recommend to users with unique tastes

+ Able to recommend new amp unpopular items No first-rater problem

+ Able to provide explanations Can provide explanations of recommended items by listing content-

features that caused an item to be recommended

1162019

RECSYS 23

Cons Content-based Approach ndash Finding the appropriate features is hard

Eg images movies music

ndash Overspecialization Never recommends items outside userrsquos

content profile People might have multiple interests Unable to exploit quality judgments of other users

ndash Recommendations for new users How to build a user profile

1162019

RECSYS 24

Collaborative Filtering

RECSYS 25

Collaborative filtering Recommend items based on past transactions of users

Analyze relations between users andor items

Specific data characteristics are irrelevant Domain-free useritem attributes are not necessary Can identify elusive aspects

RECSYS 26

Collaborative Filtering (CF)

RECSYS 27

Example of Memory-based Collaborative FilteringUser-User Collaborative Filtering

1 Consider user x

2 Find set N of other userswhose ratings are ldquosimilarrdquo to xrsquos ratings eg using K-nearest neighbors (KNN)

3 Recommend items to xbased on the weighted ratings of items by users in N

RECSYS 28

Similar Users Let rx be the vector of user xrsquos ratings

Jaccard similarity measure Problem Ignores the value of the rating

Cosine similarity measure

sim(x y) = cos(rx ry) = 119903119903119909119909sdot119903119903119910119910||119903119903119909119909||sdot||119903119903119910119910||

Problem Treats missing ratings as ldquonegativerdquo Pearson correlation coefficient

Sxy = items rated by both users x and y

1162019

rx = [ _ _ ]ry = [ _ _]

rx ry as setsrx = 1 4 5ry = 1 3 4

rx ry as pointsrx = 1 0 0 1 3ry = 1 0 2 2 0

rx ry hellip avgrating of x y

RECSYS 29

Similarity Metric

Intuitively we want sim(A B) gt sim(A C)

Jaccard similarity 15 lt 24

Cosine similarity 0386 gt 0322 Considers missing ratings as ldquonegativerdquo Solution subtract the (row) mean

1162019

sim AB vs AC0092 gt -0559Notice cosine sim is correlation when data is centered at 0

Cosine sim

RECSYS 30

Rating Predictions

Let rx be the vector of user xrsquos ratings

Let N be the set of k users most similar to xwho have rated item i

Prediction for item i of user x 119903119903119909119909119909119909 = 1

119896119896sum119910119910isin119873119873 119903119903119910119910119909119909

119903119903119909119909119909119909 =sum119910119910isin119873119873 119904119904119909119909119910119910sdot119903119903119910119910119910119910sum119910119910isin119873119873 119904119904119909119909119910119910

Other options

Many other tricks possiblehellip1162019

Shorthand119956119956119961119961119961119961 = 119956119956119946119946119956119956 119961119961119961119961

RECSYS 31

Another type of Memory-based Collaborative Filtering Item-Item based Collaborative Filtering

So far User-user collaborative filtering

Another view Item-item For item i find other similar items Estimate rating for item i based

on the target userrsquos ratings on items similar to item i Can use same similarity metrics and

prediction functions as in user-user model

1162019

sumsum

isin

isinsdot

=)(

)(

xiNj ij

xiNj xjijxi s

rsr

sijhellip similarity of items i and jrxjhelliprating of user x on item jN(ix)hellip set items rated by x similar to i

RECSYS 32

Item-Item CF (|N|=2)

121110987654321

455311

3124452

534321423

245424

5224345

423316

users

mov

ies

- unknown rating - rating between 1 to 5

RECSYS 33

Item-Item CF (|N|=2)

121110987654321

455 311

3124452

534321423

245424

5224345

423316

users

- estimate rating of movie 1 by user 5

1162019

mov

ies

RECSYS 34

Item-Item CF (|N|=2)

121110987654321

455 311

3124452

534321423

245424

5224345

423316

users

Neighbor selectionIdentify movies similar to movie 1 rated by user 5

1162019

mov

ies

100

-018

041

-010

-031

059

sim(1m)

Here we use Pearson correlation as similarity1) Subtract mean rating mi from each movie i

m1 = (1+3+5+5+4)5 = 36row 1 [-26 0 -06 0 0 14 0 0 14 0 04 0]

2) Compute cosine similarities between rows

RECSYS 35

Item-Item CF (|N|=2)

121110987654321

455 311

3124452

534321423

245424

5224345

423316

users

Compute similarity weightss13=041 s16=059

1162019

mov

ies

100

-018

041

-010

-031

059

sim(1m)

RECSYS 36

Item-Item CF (|N|=2)

121110987654321

45526311

3124452

534321423

245424

5224345

423316

users

Predict by taking weighted average

r15 = (0412 + 0593) (041+059) = 26

mov

ies

RECSYS 37

Common Practice for Item-Item Collaborative Filtering

Define similarity sij of items i and j

Select K nearest neighbors (KNN) N(i x) Set of Items most similar to item i that were rated by x

Estimate rating rxi as the weighted average

1162019

sumsum

isin

isinsdot

=)(

)(ˆxiNj ij

xiNj xjijxi s

rsr

N(ix) = set of items similar to item i that were rated by xsij = similarity of items i and jrxj= rating of user x on item j

RECSYS 38

Item-Item vs User-User

04180109

0305081

Avatar LOTR Matrix Pirates

Alice

Bob

Carol

David

1162019

In practice it has been observed that item-itemoften works better than user-user

Why Items are simpler users have multiple tastes

RECSYS 39

ProsCons of Collaborative Filtering

+ Works for any kind of item No feature selection needed

- Cold Start Need enough users in the system to find a match

- Sparsity The userratings matrix is sparse Hard to find users that have rated the same items

- First rater Cannot recommend an item that has not been

previously rated New items Esoteric items

- Popularity bias Cannot recommend items to someone with

unique taste Tends to recommend popular items

RECSYS 40

Hybrid Methods Implement two or more different recommenders and

combine predictions Perhaps using a linear model

Add content-based methods to collaborative filtering Item profiles for new item problem Demographics to deal with new user problem

1162019

RECSYS 41

- Evaluation

- Error metrics

- Complexity Speed

RECSYS 42

Evaluation

1 3 4

3 5 5

4 5 5

3

3

2 2 2

5

2 1 1

3 3

1

movies

users

1162019

RECSYS 43

Evaluation

1 3 4

3 5 5

4 5 5

3

3

2

2 1

3

1

Test Data Set

users

movies

1162019

RECSYS 44

Evaluating Predictions Compare predictions with known ratings

Root-mean-square error (RMSE)bull where is predicted is the true rating of x on i

Precision at top 10 bull of those in top 10

Rank Correlation bull Spearmanrsquos correlation between systemrsquos and userrsquos complete

rankings Another approach 01 model

Coveragebull Number of itemsusers for which system can make predictions

Precision = TP (TP + FP) Accuracy = (TP+TN) (TP + FP + TN + FN) Receiver Operating characteristic (ROC) Curve

Y-axis True Positive Rates (TPR) X-axis False Positive Rates (FPR)bull TPR (aka Recall) = TP P = TP (TP+FN) bull FPR = FP N = FP (FP + TN)bull See httpsenwikipediaorgwikiPrecision and recall

RECSYS 45

Problems with Error Measures Narrow focus on accuracy sometimes

misses the point Prediction Diversity Prediction Context Order of predictions

In practice we care only to predict high ratings RMSE might penalize a method that does well

for high ratings and badly for others

1162019

RECSYS 46

Collaborative Filtering Complexity

Expensive step is finding k most similar customers O(|X|) Recall that X = set of customers in the system

Too expensive to do at runtime Could pre-compute using clustering as approx

Naiumlve pre-computation takes time O(N middot|C|) |C| = of clusters = k in the k-means N = of data points

We already know how to do this Near-neighbor search in high dimensions (LSH) Clustering Dimensionality reduction

1162019

RECSYS 47

Tip Add Data Leverage all the data

Donrsquot try to reduce data size in an effort to make fancy algorithms work

Simple methods on large data do best

Add more data eg add IMDB data on genres

More data beats better algorithmshttpanandtypepadcomdatawocky200803more-data-usualhtml

RECSYS 48

Recommender SystemsLatent Factor Models

CS246 Mining Massive DatasetsJure Leskovec Stanford University

httpcs246stanfordedu

RECSYS 49

Geared towards females

Geared towards males

Serious

Funny

Collaborative Filtering viaLatent Factor Models (eg SVD)

The PrincessDiaries

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

RECSYS 50Koren Bell Volinksy IEEE Computer 2009

RECSYS 51

The Netflix Utility Matrix R

1 3 4

3 5 5

4 5 5

3

3

2 2 2

5

2 1 1

3 3

1

480000 users

17700 movies

1162019

Matrix R

RECSYS 52

Utility Matrix R Evaluation

1 3 4

3 5 5

4 5 5

3

3

2

2 1

3

1

Test Data Set

SSE = sum(119909119909119909119909)isin119877119877 119903119909119909119909119909 minus 119903119903119909119909119909119909 2

1162019

480000 users

17700 movies

Predicted rating

True rating of user x on item i

119955119955120785120785120788120788

Matrix R

Training Data Set

RECSYS 53

Latent Factor Models

ldquoSVDrdquo on Netflix data R asymp Q middot PT

For now letrsquos assume we can approximate the rating matrix R as a product of ldquothinrdquo Q middot PT

R has missing entries but letrsquos ignore that for nowbull Basically we will want the reconstruction error to be small on

known ratings and we donrsquot care about the values on the missing ones

45531

312445

53432142

24542

522434

42331

2-41

56-5

53-2

32111

-221-7

37-1

-924143-48-5-253-211

13-112-72914-131457-8

1-6784-3924176-421asymp

users

item

s

PT

Q

item

s

users

R

SVD A = U Σ VT

f factors

ffactors

RECSYS 54

Ratings as Products of Factors How to estimate the missing rating of

user x for item i

1162019

45531

312445

53432142

24542

522434

42331

item

s

2-41

56-5

53-2

32111

-221-7

37-1

-924143-48-5-253-211

13-112-72914-131457-8

1-6784-3924176-421

asymp

item

s

users

users

PT

119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931

= 119943119943

119954119954119946119946119943119943 sdot 119953119953119961119961119943119943

qi = row i of Qpx = column x of PT

ffac

tors

Qf factors

RECSYS 55

Ratings as Products of Factors How to estimate the missing rating of

user x for item i

1162019

45531

312445

53432142

24542

522434

42331

item

s

2-41

56-5

53-2

32111

-221-7

37-1

-924143-48-5-253-211

13-112-72914-131457-8

1-6784-3924176-421

asymp

item

s

users

users

PT

ffac

tors

Qf factors

119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931

= 119943119943

119954119954119946119946119943119943 sdot 119953119953119961119961119943119943

qi = row i of Qpx = column x of PT

RECSYS 56

Ratings as Products of Factors How to estimate the missing rating of

user x for item i

1162019

45531

312445

53432142

24542

522434

42331

item

s

2-41

56-5

53-2

32111

-221-7

37-1

-924143-48-5-253-211

13-112-72914-131457-8

1-6784-3924176-421

asymp

item

s

users

users

QPT

24

ffac

tors

f factors

119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931

= 119943119943

119954119954119946119946119943119943 sdot 119953119953119961119961119943119943

qi = row i of Qpx = column x of PT

RECSYS 57

Geared towards females

Geared towards males

Serious

Funny

Latent Factor Models

1162019

The PrincessDiaries

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

RECSYS 58

Geared towards females

Geared towards males

Serious

Funny

Latent Factor Models

1162019

The PrincessDiaries

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

RECSYS 59

Recap SVD

Remember SVD A Input data matrix U Left singular vecs V Right singular vecs Σ Singular values

SVD gives minimum reconstruction error (SSE) min119880119880VΣ

sum119909119909119894119894 119860119860119909119909119894119894 minus 119880119880Σ119881119881T 1199091199091198941198942

So in our case ldquoSVDrdquo on Netflix data R asymp Q middot PT

A = R Q = U PT = Σ VT

But we are not done yet R has missing entries1162019

U

The sum goes over all entriesBut our R has missing entries

119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931

Am

n

Σm

n

VT

asymp

U

RECSYS 60

Latent Factor Models

SVD isnrsquot defined when entries are missing

Use specialized methods to find P Q min

119875119875119876119876sum 119909119909119909119909 isinR 119903119903119909119909119909119909 minus 119902119902119909119909 sdot 119901119901119909119909119879119879 2

Notebull We donrsquot require cols of P Q to be orthogonalunit lengthbull P Q map usersmovies to a latent spacebull The most popular model among Netflix contestants

1162019

45531

312445

53432142

24542

522434

42331

2-41

56-5

53-2

32111

-221-7

37-1

-924143-48-5-253-211

13-112-72914-131457-8

1-6784-3924176-421asymp

PTQ

users

item

s

119903119909119909119909119909 = 119902119902119909119909 sdot 119901119901119909119909119879119879

f factors

ffactorsitem

s

users

RECSYS 61

Dealing with Missing Entries Want to minimize SSE for unseen test data

Idea Minimize SSE on training data Want large f ( of factors) to capture all the signals But SSE on test data begins to rise for f gt 2

Regularization is needed to avoid Overfitting Allow rich model where there are sufficient data Shrink aggressively where data are scarce

1162019

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

1 3 43 5 5

4 5 533

2

2 1 3

1

λhellip regularization parameterldquoerrorrdquo ldquolengthrdquo

RECSYS 62

Geared towards females

Geared towards males

serious

escapist

The PrincessDiaries

The Lion King

Braveheart

Independence Day

AmadeusThe Color Purple

Oceanrsquos 11

Sense and Sensibility

Gus

Dave

Recommendations via Latent Factor Models (eg SVD++ by the [Bellkor Team])

1162019

Lethal Weapon

Dumb and Dumber

RECSYS 63

Dealing with Missing Entries Want to minimize SSE for unseen test data

Idea Minimize SSE on training data Want large f ( of factors) to capture all the signals But SSE on test data begins to rise for f gt 2

Regularization is needed Allow rich model where there are sufficient data Shrink aggressively where data are scarce

1162019

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

1 3 43 5 5

4 5 533

2

2 1 3

1

λhellip regularization parameterldquoerrorrdquo ldquolengthrdquo

RECSYS 64

Geared towards females

Geared towards males

serious

funny

The Effect of Regularization

1162019

The PrincessDiaries

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

minfactors ldquoerrorrdquo + λ ldquolengthrdquo

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

RECSYS 65

Geared towards females

Geared towards males

serious

funny

The Effect of Regularization

1162019

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

The PrincessDiaries

minfactors ldquoerrorrdquo + λ ldquolengthrdquo

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

RECSYS 66

Geared towards females

Geared towards males

serious

funny

The Effect of Regularization

1162019

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

minfactors ldquoerrorrdquo + λ ldquolengthrdquo

The PrincessDiaries

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

RECSYS 67

Geared towards females

Geared towards males

serious

funny

The Effect of Regularization

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

The PrincessDiaries

minfactors ldquoerrorrdquo + λ ldquolengthrdquo

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

RECSYS 68

Use Gradient Descent to search forthe optimal settings

Want to find matrices P and Q

Gradient descent Initialize P and Q (using SVD pretend missing ratings are

0) Do gradient descent

bull P larr P - η middotnablaPbull Q larr Q - η middotnablaQbull Where nablaQ is gradientderivative of matrix Q120571120571119876119876 = [120571120571119902119902119909119909119894119894] and 120571120571119902119902119909119909119894119894 = sum119909119909119909119909 minus2 119903119903119909119909119909119909 minus 119902119902119909119909119901119901119909119909119879119879 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894

bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q Observation Computing gradients is slow

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

How to compute gradient of a matrixCompute gradient of every element independently

RECSYS 69

Degression to the lecture notes ofRegression and Gradient Descent

by Andrew Ngrsquos Machine Learning course from Coursera

RECSYS 70

(Batch) Gradient Descent

Want to find matrices P and Q

Gradient descent Initialize P and Q (using SVD pretend missing ratings are

0) Do gradient descent

bull P larr P - η middotnablaPbull Q larr Q - η middotnablaQbull Where nablaQ is gradientderivative of matrix Q120571120571119876119876 = [120571120571119902119902119909119909119894119894] and 120571120571119902119902119909119909119894119894 = sum119909119909119909119909 minus2 119903119903119909119909119909119909 minus 119902119902119909119909119901119901119909119909119879119879 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894

bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q Observation Computing gradients is slow

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

How to compute gradient of a matrixCompute gradient of every element independently

RECSYS 72

Stochastic Gradient Descent

Gradient Descent (GD) vs Stochastic GD Observation 120571120571119876119876 = [120571120571119902119902119909119909119894119894] where

120571120571119902119902119909119909119894119894 = 119909119909119909119909

minus2 119903119903119909119909119909119909 minus 119902119902119909119909119894119894119901119901119909119909119894119894 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894 = 119961119961119946119946

nabla119928119928 119955119955119961119961119946119946

bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q 119928119928 = 119928119928 minus 119928119928 = 119928119928 minus η sum119961119961119946119946nabla119928119928 (119955119955119961119961119946119946) Idea Instead of evaluating gradient over all ratings evaluate it for

each individual rating and make a step

GD 119928119928larr119928119928 minus η sum119955119955119961119961119946119946 nabla119928119928(119955119955119961119961119946119946)

SGD 119928119928larr119928119928 minus 119928119928(119955119955119961119961119946119946) Faster convergence

bull Need more steps but each step is computed much faster

1162019

RECSYS 73

SGD vs GD Convergence of GD vs SGD

1162019

Iterationstep

Valu

e of

the

obje

ctiv

e fu

nctio

n

GD improves the value of the objective function at every step SGD improves the value but in a ldquonoisyrdquo wayGD takes fewer steps to converge but each steptakes much longer to compute In practice SGD is much faster

RECSYS 74

Stochastic Gradient Descent

Stochastic gradient decent Initialize P and Q (using SVD pretend missing ratings are

0) Then iterate over the ratings (multiple times if necessary) and update

factorsFor each rxi

bull 120576120576119909119909119909119909 = 119903119903119909119909119909119909 minus 119902119902119909119909 sdot 119901119901119909119909119879119879 (derivative of the ldquoerrorrdquo)bull 119902119902119909119909 larr 119902119902119909119909 + 120578120578 120576120576119909119909119909119909 119901119901119909119909 minus 120582120582 119902119902119909119909 (update equation)bull 119901119901119909119909 larr 119901119901119909119909 + 120578120578 120576120576119909119909119909119909 119902119902119909119909 minus 120582120582 119901119901119909119909 (update equation)

2 for loops For until convergence

bull For each rxibull Compute gradient do a ldquosteprdquo

1162019

η hellip learning rate

RECSYS 75Koren Bell Volinksy IEEE Computer 2009

1162019

RECSYS 76

Summary Recommendations via Optimization

Goal Make good recommendations Quantify goodness using SSE

So Lower SSE means better recommendations We want to make good recommendations on items

that some user has not yet seen Letrsquos set values for P and Q such that they work

well on known (user item) ratings And hope these values for P and Q will predict well

the unknown ratings

This is the a case where we apply Optimization methods

1162019

1 3 43 5 5

4 5 533

2

2 1 3

1

RECSYS 77

Backup Slides

RECSYS 78

The Netflix Challenge 2006-09

We Know What You OughtTo Be Watching This

Summer

Netflix Prizebull Training data

ndash 100 million ratingsndash 480000 usersndash 17770 moviesndash 6 years of data 2000-2005

bull Test datandash Last few ratings of each user (28 million)ndash Evaluation criterion root mean squared error (RMSE) ndash Netflix Cinematch RMSE 09514

bull Competitionndash 2700+ teamsndash $1 million grand prize for 10 improvement on Cinematch resultndash $50000 2007 progress prize for 843 improvement

scoremovieuser121152131434524123237682576344541568523425

22345

57664566

scoremovieuser6219617232473153414284935

745

696836

Training data Test data

Movie rating data

Overall rating distribution

bull Third of ratings are 4sbull Average rating is 368

From TimelyDevelopmentcom

ratings per movie

bull Avg ratingsmovie 5627

ratings per user

bull Avg ratingsuser 208

Average movie rating by movie count

bull More ratings to better movies

From TimelyDevelopmentcom

Most loved movies

CountAvg ratingTitle137812 4593 The Shawshank Redemption 133597 4545 Lord of the Rings The Return of the King1808834306 The Green Mile 150676 4460Lord of the Rings The Two Towers 139050 4415 Finding Nemo 117456 4504 Raiders of the Lost Ark 180736 4299 Forrest Gump 147932 4433 Lord of the Rings The Fellowship of the ring149199 4325 The Sixth Sense 144027 4333 Indiana Jones and the Last Crusade

Challenges

bull Size of datandash Scalabilityndash Keeping data in memory

bull Missing datandash 99 percent missingndash Very imbalanced

bull Avoiding overfittingbull Test and training data differ significantly movie 16322

bull Use an ensemble of complementing predictorsbull Two half tuned models worth more than a single fully

tuned model

The BellKor recommender system

RECSYS 101

Extending Latent Factor Model to Include Biases

1162019

RECSYS 102

Modeling Biases and Interactions

μ = overall mean rating bx = bias of user x bi = bias of movie i

user-movie interactionmovie biasuser bias

User-Movie interaction Characterizes the matching between

users and movies Attracts most research in the field Benefits from algorithmic and

mathematical innovations

Baseline predictor Separates users and movies Benefits from insights into userrsquos

behavior Among the main practical

contributions of the competition

RECSYS 103

Baseline Predictor We have expectations on the rating by

user x of movie i even without estimating xrsquos attitude towards movies like i

ndash Rating scale of user xndash Values of other ratings user

gave recently (day-specific mood anchoring multi-user accounts)

ndash (Recent) popularity of movie indash Selection bias related to

number of ratings user gave on the same day (ldquofrequencyrdquo)

1162019

RECSYS 104

Putting It All Together

Example Mean rating micro = 37 You are a critical reviewer your ratings are 1 star

lower than the mean bx = -1 Star Wars gets a mean rating of 05 higher than

average movie bi = + 05 Predicted rating for you on Star Wars

= 37 - 1 + 05 = 32

1162019

RECSYS 105

Fitting the New Model

Solve

Stochastic gradient decent to find parameters Note Both biases bu bi as well as interactions qi pu are treated

as parameters (we estimate them)

1162019

regularization

goodness of fit

λ is selected via grid-search on a validation set

( )

++++

+++minus

sumsumsumsum

sumisin

ii

xx

xx

ii

Rix

Txiixxi

PQ

bbpq

pqbbr

2222

2

)(

)(min

λ

micro

RECSYS 106

BellKor Recommender System The winner of the Netflix Challenge

Multi-scale modeling of the dataCombine top level ldquoregionalrdquomodeling of the data with a refined local view Global

bull Overall deviations of usersmovies Factorization

bull Addressing ldquoregionalrdquo effects Collaborative filtering

bull Extract local patterns

Global effects

Factorization

Collaborative filtering

RECSYS 107

Performance of Various Methods

0885

089

0895

09

0905

091

0915

092

1 10 100 1000

RMSE

Millions of parameters

CF (no time bias)

Basic Latent Factors

Latent Factors w Biases

1162019

RECSYS 108

Grand Prize 08563

Netflix 09514

Movie average 10533User average 10651

Global average 11296

Performance of Various Methods

Basic Collaborative filtering 094

Latent factors 090

Latent factors+Biases 089

Collaborative filtering++ 091

1162019

RECSYS 109

Temporal Biases Of Users

Sudden rise in the average movie rating(early 2004) Improvements in Netflix GUI improvements Meaning of rating changed

Movie age Users prefer new movies

without any reasons Older movies are just

inherently better than newer ones

1162019

Y Koren Collaborative filtering with temporal dynamics KDD rsquo09

RECSYS 110

Temporal Biases amp Factors

Original modelrxi = micro +bx + bi + qi middotpx

T

Add time dependence to biasesrxi = micro +bx(t)+ bi(t) +qi middot px

T

Make parameters bu and bi to depend on time (1) Parameterize time-dependence by linear trends

(2) Each bin corresponds to 10 consecutive weeks

Add temporal dependence to factors px(t)hellip user preference vector on day t

Y Koren Collaborative filtering with temporal dynamics KDD rsquo09

RECSYS 111

Adding Temporal Effects

1162019

0875

088

0885

089

0895

09

0905

091

0915

092

1 10 100 1000 10000

RMSE

Millions of parameters

CF (no time bias)Basic Latent FactorsCF (time bias)Latent Factors w Biases+ Linear time factors+ Per-day user biases+ CF

RECSYS 112

Grand Prize 08563

Netflix 09514

Movie average 10533User average 10651

Global average 11296

Performance of Various Methods

Basic Collaborative filtering 094

Latent factors 090

Latent factors+Biases 089

Collaborative filtering++ 091

1162019 112

Latent factors+Biases+Time 0876

Still no prize Getting desperateTry a ldquokitchen sinkrdquo approach

RECSYS 113

Effect of ensemble size

087

08725

0875

08775

088

08825

0885

08875

089

08925

1 8 15 22 29 36 43 50

Predictors

Erro

r - R

MSE

Chart1

RMSE
Predictors
Error - RMSE
Effect of ensemble size

Sheet1

Sheet1

RMSE
Predictors
Error - RMSE
Effect of ensemble size

Sheet2

Sheet3

RECSYS 115

Standing on June 26th 2009

1162019June 26th submission triggers 30-day ldquolast callrdquo

RECSYS 116

The Last 30 Days

Ensemble team formed Group of other teams on leaderboard forms a new team Relies on combining their models Quickly also get a qualifying score over 10

BellKor Continue to get small improvements in their scores Realize that they are in direct competition with Ensemble

Strategy Both teams carefully monitoring the leaderboard Only sure way to check for improvement is to submit a set of

predictionsbull This alerts the other team of your latest score

1162019

RECSYS 117

24 Hours from the Deadline

Submissions limited to 1 a day Only 1 final submission could be made in the last 24h

24 hours before deadlinehellip BellKor team member in Austria notices (by chance) that

Ensemble posts a score that is slightly better than BellKorrsquos

Frantic last 24 hours for both teams Much computer time on final optimization Carefully calibrated to end about an hour before deadline

Final submissions BellKor submits a little early (on purpose) 40 mins before deadline Ensemble submits their final entry 20 mins later hellipand everyone waitshellip

1162019

RECSYS 118118

RECSYS 119

Million $ Awarded Sept 21st 2009

RECSYS 120

Further reading Y Koren Collaborative filtering with temporal

dynamics KDD rsquo09 httpwww2researchattcom~volinskynetflixbpchtml httpwwwthe-ensemblecom

1162019

  • Slide Number 1
  • Acknowledgements
  • Content-based Systems amp Collaborative Filtering
  • High Dimensional Data
  • Example Recommender Systems
  • Recommendations
  • From Scarcity to Abundance
  • The Long Tail
  • Physical vs Online
  • Types of Recommendations
  • Formal Model
  • Utility Matrix
  • Key Problems
  • (1) Gathering Ratings
  • (2) Extrapolating Utilities
  • Content-based Recommender Systems
  • Content-based Recommendations
  • Plan of Action
  • Item Profiles
  • Sidenote TF-IDF
  • User Profiles and Prediction
  • Pros Content-based Approach
  • Cons Content-based Approach
  • Collaborative Filtering
  • Collaborative filtering
  • Collaborative Filtering (CF)
  • Example of Memory-based Collaborative Filtering User-User Collaborative Filtering
  • Similar Users
  • Similarity Metric
  • Rating Predictions
  • Another type of Memory-based Collaborative Filtering Item-Item based Collaborative Filtering
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Common Practice for Item-Item Collaborative Filtering
  • Item-Item vs User-User
  • ProsCons of Collaborative Filtering
  • Hybrid Methods
  • Remarks amp Practical Tips
  • Evaluation
  • Evaluation
  • Evaluating Predictions
  • Problems with Error Measures
  • Collaborative Filtering Complexity
  • Tip Add Data
  • Recommender SystemsLatent Factor Models
  • Collaborative Filtering viaLatent Factor Models (eg SVD)
  • Slide Number 50
  • The Netflix Utility Matrix R
  • Utility Matrix R Evaluation
  • Latent Factor Models
  • Ratings as Products of Factors
  • Ratings as Products of Factors
  • Ratings as Products of Factors
  • Latent Factor Models
  • Latent Factor Models
  • Recap SVD
  • Latent Factor Models
  • Dealing with Missing Entries
  • Recommendations via Latent Factor Models (eg SVD++ by the [Bellkor Team])
  • Dealing with Missing Entries
  • The Effect of Regularization
  • The Effect of Regularization
  • The Effect of Regularization
  • The Effect of Regularization
  • Use Gradient Descent to search forthe optimal settings
  • Degression to the lecture notes ofRegression and Gradient Descentby Andrew Ngrsquos Machine Learning course from Coursera
  • (Batch) Gradient Descent
  • Stochastic Gradient Descent
  • SGD vs GD
  • Stochastic Gradient Descent
  • Slide Number 75
  • Summary Recommendations via Optimization
  • Backup Slides
  • The Netflix Challenge 2006-09
  • Slide Number 79
  • Slide Number 80
  • Netflix Prize
  • Movie rating data
  • Overall rating distribution
  • ratings per movie
  • ratings per user
  • Average movie rating by movie count
  • Most loved movies
  • Challenges
  • The BellKor recommender system
  • Extending Latent Factor Model to Include Biases
  • Modeling Biases and Interactions
  • Baseline Predictor
  • Putting It All Together
  • Fitting the New Model
  • BellKor Recommender System
  • Performance of Various Methods
  • Performance of Various Methods
  • Temporal Biases Of Users
  • Temporal Biases amp Factors
  • Adding Temporal Effects
  • Performance of Various Methods
  • Slide Number 113
  • Slide Number 114
  • Standing on June 26th 2009
  • The Last 30 Days
  • 24 Hours from the Deadline
  • Slide Number 118
  • Million $ Awarded Sept 21st 2009
  • Further reading
predictors RMSE
1 0889067
2 0880586
3 0878345
4 0877715
5 0876668
6 0876179
7 0875557
8 0875013
9 0874477
10 0874209
11 0873997
12 0873691
13 0873569
14 0873352
15 0873212
16 0873035
17 0872942
18 087283
19 0872748
20 0872654
21 0872563
22 0872468
23 0872382
24 0872275
25 087222
26 0872201
27 0872135
28 0872086
29 0872021
30 0871945
31 0871898
32 087183
33 0871756
34 0871722
35 0871665
36 0871632
37 0871585
38 0871558
39 0871524
40 0871482
41 0871452
42 0871428
43 0871398
44 0871346
45 0871312
46 0871287
47 0871262
48 0871242
49 0871236
50 0871197
0889067
0880586
0878345
0877715
0876668
0876179
0875557
0875013
0874477
0874209
0873997
0873691
0873569
0873352
0873212
0873035
0872942
087283
0872748
0872654
0872563
0872468
0872382
0872275
087222
0872201
0872135
0872086
0872021
0871945
0871898
087183
0871756
0871722
0871665
0871632
0871585
0871558
0871524
0871482
0871452
0871428
0871398
0871346
0871312
0871287
0871262
0871242
0871236
0871197
Page 19: xuhappy.github.io · 2 days ago · RECSYS 2 Acknowledgements The slides used in this chapter are adapted from: CS246 Mining Massive Datasets, - by Jure Leskovec, Stanford University

RECSYS 21

User Profiles and Prediction

User profile possibilities Weighted average of rated item profiles Variation weight by difference from average

rating for item hellip

Prediction heuristic Given user profile x and item profile i estimate 119906119906(119961119961 119946119946) = cos(119961119961 119946119946) = 119961119961middot119946119946

| 119961119961 |sdot| 119946119946 |

1162019

RECSYS 22

Pros Content-based Approach

+ No need for data on other users No cold-start or sparsity problems

+ Able to recommend to users with unique tastes

+ Able to recommend new amp unpopular items No first-rater problem

+ Able to provide explanations Can provide explanations of recommended items by listing content-

features that caused an item to be recommended

1162019

RECSYS 23

Cons Content-based Approach ndash Finding the appropriate features is hard

Eg images movies music

ndash Overspecialization Never recommends items outside userrsquos

content profile People might have multiple interests Unable to exploit quality judgments of other users

ndash Recommendations for new users How to build a user profile

1162019

RECSYS 24

Collaborative Filtering

RECSYS 25

Collaborative filtering Recommend items based on past transactions of users

Analyze relations between users andor items

Specific data characteristics are irrelevant Domain-free useritem attributes are not necessary Can identify elusive aspects

RECSYS 26

Collaborative Filtering (CF)

RECSYS 27

Example of Memory-based Collaborative FilteringUser-User Collaborative Filtering

1 Consider user x

2 Find set N of other userswhose ratings are ldquosimilarrdquo to xrsquos ratings eg using K-nearest neighbors (KNN)

3 Recommend items to xbased on the weighted ratings of items by users in N

RECSYS 28

Similar Users Let rx be the vector of user xrsquos ratings

Jaccard similarity measure Problem Ignores the value of the rating

Cosine similarity measure

sim(x y) = cos(rx ry) = 119903119903119909119909sdot119903119903119910119910||119903119903119909119909||sdot||119903119903119910119910||

Problem Treats missing ratings as ldquonegativerdquo Pearson correlation coefficient

Sxy = items rated by both users x and y

1162019

rx = [ _ _ ]ry = [ _ _]

rx ry as setsrx = 1 4 5ry = 1 3 4

rx ry as pointsrx = 1 0 0 1 3ry = 1 0 2 2 0

rx ry hellip avgrating of x y

RECSYS 29

Similarity Metric

Intuitively we want sim(A B) gt sim(A C)

Jaccard similarity 15 lt 24

Cosine similarity 0386 gt 0322 Considers missing ratings as ldquonegativerdquo Solution subtract the (row) mean

1162019

sim AB vs AC0092 gt -0559Notice cosine sim is correlation when data is centered at 0

Cosine sim

RECSYS 30

Rating Predictions

Let rx be the vector of user xrsquos ratings

Let N be the set of k users most similar to xwho have rated item i

Prediction for item i of user x 119903119903119909119909119909119909 = 1

119896119896sum119910119910isin119873119873 119903119903119910119910119909119909

119903119903119909119909119909119909 =sum119910119910isin119873119873 119904119904119909119909119910119910sdot119903119903119910119910119910119910sum119910119910isin119873119873 119904119904119909119909119910119910

Other options

Many other tricks possiblehellip1162019

Shorthand119956119956119961119961119961119961 = 119956119956119946119946119956119956 119961119961119961119961

RECSYS 31

Another type of Memory-based Collaborative Filtering Item-Item based Collaborative Filtering

So far User-user collaborative filtering

Another view Item-item For item i find other similar items Estimate rating for item i based

on the target userrsquos ratings on items similar to item i Can use same similarity metrics and

prediction functions as in user-user model

1162019

sumsum

isin

isinsdot

=)(

)(

xiNj ij

xiNj xjijxi s

rsr

sijhellip similarity of items i and jrxjhelliprating of user x on item jN(ix)hellip set items rated by x similar to i

RECSYS 32

Item-Item CF (|N|=2)

121110987654321

455311

3124452

534321423

245424

5224345

423316

users

mov

ies

- unknown rating - rating between 1 to 5

RECSYS 33

Item-Item CF (|N|=2)

121110987654321

455 311

3124452

534321423

245424

5224345

423316

users

- estimate rating of movie 1 by user 5

1162019

mov

ies

RECSYS 34

Item-Item CF (|N|=2)

121110987654321

455 311

3124452

534321423

245424

5224345

423316

users

Neighbor selectionIdentify movies similar to movie 1 rated by user 5

1162019

mov

ies

100

-018

041

-010

-031

059

sim(1m)

Here we use Pearson correlation as similarity1) Subtract mean rating mi from each movie i

m1 = (1+3+5+5+4)5 = 36row 1 [-26 0 -06 0 0 14 0 0 14 0 04 0]

2) Compute cosine similarities between rows

RECSYS 35

Item-Item CF (|N|=2)

121110987654321

455 311

3124452

534321423

245424

5224345

423316

users

Compute similarity weightss13=041 s16=059

1162019

mov

ies

100

-018

041

-010

-031

059

sim(1m)

RECSYS 36

Item-Item CF (|N|=2)

121110987654321

45526311

3124452

534321423

245424

5224345

423316

users

Predict by taking weighted average

r15 = (0412 + 0593) (041+059) = 26

mov

ies

RECSYS 37

Common Practice for Item-Item Collaborative Filtering

Define similarity sij of items i and j

Select K nearest neighbors (KNN) N(i x) Set of Items most similar to item i that were rated by x

Estimate rating rxi as the weighted average

1162019

sumsum

isin

isinsdot

=)(

)(ˆxiNj ij

xiNj xjijxi s

rsr

N(ix) = set of items similar to item i that were rated by xsij = similarity of items i and jrxj= rating of user x on item j

RECSYS 38

Item-Item vs User-User

04180109

0305081

Avatar LOTR Matrix Pirates

Alice

Bob

Carol

David

1162019

In practice it has been observed that item-itemoften works better than user-user

Why Items are simpler users have multiple tastes

RECSYS 39

ProsCons of Collaborative Filtering

+ Works for any kind of item No feature selection needed

- Cold Start Need enough users in the system to find a match

- Sparsity The userratings matrix is sparse Hard to find users that have rated the same items

- First rater Cannot recommend an item that has not been

previously rated New items Esoteric items

- Popularity bias Cannot recommend items to someone with

unique taste Tends to recommend popular items

RECSYS 40

Hybrid Methods Implement two or more different recommenders and

combine predictions Perhaps using a linear model

Add content-based methods to collaborative filtering Item profiles for new item problem Demographics to deal with new user problem

1162019

RECSYS 41

- Evaluation

- Error metrics

- Complexity Speed

RECSYS 42

Evaluation

1 3 4

3 5 5

4 5 5

3

3

2 2 2

5

2 1 1

3 3

1

movies

users

1162019

RECSYS 43

Evaluation

1 3 4

3 5 5

4 5 5

3

3

2

2 1

3

1

Test Data Set

users

movies

1162019

RECSYS 44

Evaluating Predictions Compare predictions with known ratings

Root-mean-square error (RMSE)bull where is predicted is the true rating of x on i

Precision at top 10 bull of those in top 10

Rank Correlation bull Spearmanrsquos correlation between systemrsquos and userrsquos complete

rankings Another approach 01 model

Coveragebull Number of itemsusers for which system can make predictions

Precision = TP (TP + FP) Accuracy = (TP+TN) (TP + FP + TN + FN) Receiver Operating characteristic (ROC) Curve

Y-axis True Positive Rates (TPR) X-axis False Positive Rates (FPR)bull TPR (aka Recall) = TP P = TP (TP+FN) bull FPR = FP N = FP (FP + TN)bull See httpsenwikipediaorgwikiPrecision and recall

RECSYS 45

Problems with Error Measures Narrow focus on accuracy sometimes

misses the point Prediction Diversity Prediction Context Order of predictions

In practice we care only to predict high ratings RMSE might penalize a method that does well

for high ratings and badly for others

1162019

RECSYS 46

Collaborative Filtering Complexity

Expensive step is finding k most similar customers O(|X|) Recall that X = set of customers in the system

Too expensive to do at runtime Could pre-compute using clustering as approx

Naiumlve pre-computation takes time O(N middot|C|) |C| = of clusters = k in the k-means N = of data points

We already know how to do this Near-neighbor search in high dimensions (LSH) Clustering Dimensionality reduction

1162019

RECSYS 47

Tip Add Data Leverage all the data

Donrsquot try to reduce data size in an effort to make fancy algorithms work

Simple methods on large data do best

Add more data eg add IMDB data on genres

More data beats better algorithmshttpanandtypepadcomdatawocky200803more-data-usualhtml

RECSYS 48

Recommender SystemsLatent Factor Models

CS246 Mining Massive DatasetsJure Leskovec Stanford University

httpcs246stanfordedu

RECSYS 49

Geared towards females

Geared towards males

Serious

Funny

Collaborative Filtering viaLatent Factor Models (eg SVD)

The PrincessDiaries

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

RECSYS 50Koren Bell Volinksy IEEE Computer 2009

RECSYS 51

The Netflix Utility Matrix R

1 3 4

3 5 5

4 5 5

3

3

2 2 2

5

2 1 1

3 3

1

480000 users

17700 movies

1162019

Matrix R

RECSYS 52

Utility Matrix R Evaluation

1 3 4

3 5 5

4 5 5

3

3

2

2 1

3

1

Test Data Set

SSE = sum(119909119909119909119909)isin119877119877 119903119909119909119909119909 minus 119903119903119909119909119909119909 2

1162019

480000 users

17700 movies

Predicted rating

True rating of user x on item i

119955119955120785120785120788120788

Matrix R

Training Data Set

RECSYS 53

Latent Factor Models

ldquoSVDrdquo on Netflix data R asymp Q middot PT

For now letrsquos assume we can approximate the rating matrix R as a product of ldquothinrdquo Q middot PT

R has missing entries but letrsquos ignore that for nowbull Basically we will want the reconstruction error to be small on

known ratings and we donrsquot care about the values on the missing ones

45531

312445

53432142

24542

522434

42331

2-41

56-5

53-2

32111

-221-7

37-1

-924143-48-5-253-211

13-112-72914-131457-8

1-6784-3924176-421asymp

users

item

s

PT

Q

item

s

users

R

SVD A = U Σ VT

f factors

ffactors

RECSYS 54

Ratings as Products of Factors How to estimate the missing rating of

user x for item i

1162019

45531

312445

53432142

24542

522434

42331

item

s

2-41

56-5

53-2

32111

-221-7

37-1

-924143-48-5-253-211

13-112-72914-131457-8

1-6784-3924176-421

asymp

item

s

users

users

PT

119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931

= 119943119943

119954119954119946119946119943119943 sdot 119953119953119961119961119943119943

qi = row i of Qpx = column x of PT

ffac

tors

Qf factors

RECSYS 55

Ratings as Products of Factors How to estimate the missing rating of

user x for item i

1162019

45531

312445

53432142

24542

522434

42331

item

s

2-41

56-5

53-2

32111

-221-7

37-1

-924143-48-5-253-211

13-112-72914-131457-8

1-6784-3924176-421

asymp

item

s

users

users

PT

ffac

tors

Qf factors

119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931

= 119943119943

119954119954119946119946119943119943 sdot 119953119953119961119961119943119943

qi = row i of Qpx = column x of PT

RECSYS 56

Ratings as Products of Factors How to estimate the missing rating of

user x for item i

1162019

45531

312445

53432142

24542

522434

42331

item

s

2-41

56-5

53-2

32111

-221-7

37-1

-924143-48-5-253-211

13-112-72914-131457-8

1-6784-3924176-421

asymp

item

s

users

users

QPT

24

ffac

tors

f factors

119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931

= 119943119943

119954119954119946119946119943119943 sdot 119953119953119961119961119943119943

qi = row i of Qpx = column x of PT

RECSYS 57

Geared towards females

Geared towards males

Serious

Funny

Latent Factor Models

1162019

The PrincessDiaries

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

RECSYS 58

Geared towards females

Geared towards males

Serious

Funny

Latent Factor Models

1162019

The PrincessDiaries

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

RECSYS 59

Recap SVD

Remember SVD A Input data matrix U Left singular vecs V Right singular vecs Σ Singular values

SVD gives minimum reconstruction error (SSE) min119880119880VΣ

sum119909119909119894119894 119860119860119909119909119894119894 minus 119880119880Σ119881119881T 1199091199091198941198942

So in our case ldquoSVDrdquo on Netflix data R asymp Q middot PT

A = R Q = U PT = Σ VT

But we are not done yet R has missing entries1162019

U

The sum goes over all entriesBut our R has missing entries

119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931

Am

n

Σm

n

VT

asymp

U

RECSYS 60

Latent Factor Models

SVD isnrsquot defined when entries are missing

Use specialized methods to find P Q min

119875119875119876119876sum 119909119909119909119909 isinR 119903119903119909119909119909119909 minus 119902119902119909119909 sdot 119901119901119909119909119879119879 2

Notebull We donrsquot require cols of P Q to be orthogonalunit lengthbull P Q map usersmovies to a latent spacebull The most popular model among Netflix contestants

1162019

45531

312445

53432142

24542

522434

42331

2-41

56-5

53-2

32111

-221-7

37-1

-924143-48-5-253-211

13-112-72914-131457-8

1-6784-3924176-421asymp

PTQ

users

item

s

119903119909119909119909119909 = 119902119902119909119909 sdot 119901119901119909119909119879119879

f factors

ffactorsitem

s

users

RECSYS 61

Dealing with Missing Entries Want to minimize SSE for unseen test data

Idea Minimize SSE on training data Want large f ( of factors) to capture all the signals But SSE on test data begins to rise for f gt 2

Regularization is needed to avoid Overfitting Allow rich model where there are sufficient data Shrink aggressively where data are scarce

1162019

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

1 3 43 5 5

4 5 533

2

2 1 3

1

λhellip regularization parameterldquoerrorrdquo ldquolengthrdquo

RECSYS 62

Geared towards females

Geared towards males

serious

escapist

The PrincessDiaries

The Lion King

Braveheart

Independence Day

AmadeusThe Color Purple

Oceanrsquos 11

Sense and Sensibility

Gus

Dave

Recommendations via Latent Factor Models (eg SVD++ by the [Bellkor Team])

1162019

Lethal Weapon

Dumb and Dumber

RECSYS 63

Dealing with Missing Entries Want to minimize SSE for unseen test data

Idea Minimize SSE on training data Want large f ( of factors) to capture all the signals But SSE on test data begins to rise for f gt 2

Regularization is needed Allow rich model where there are sufficient data Shrink aggressively where data are scarce

1162019

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

1 3 43 5 5

4 5 533

2

2 1 3

1

λhellip regularization parameterldquoerrorrdquo ldquolengthrdquo

RECSYS 64

Geared towards females

Geared towards males

serious

funny

The Effect of Regularization

1162019

The PrincessDiaries

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

minfactors ldquoerrorrdquo + λ ldquolengthrdquo

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

RECSYS 65

Geared towards females

Geared towards males

serious

funny

The Effect of Regularization

1162019

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

The PrincessDiaries

minfactors ldquoerrorrdquo + λ ldquolengthrdquo

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

RECSYS 66

Geared towards females

Geared towards males

serious

funny

The Effect of Regularization

1162019

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

minfactors ldquoerrorrdquo + λ ldquolengthrdquo

The PrincessDiaries

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

RECSYS 67

Geared towards females

Geared towards males

serious

funny

The Effect of Regularization

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

The PrincessDiaries

minfactors ldquoerrorrdquo + λ ldquolengthrdquo

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

RECSYS 68

Use Gradient Descent to search forthe optimal settings

Want to find matrices P and Q

Gradient descent Initialize P and Q (using SVD pretend missing ratings are

0) Do gradient descent

bull P larr P - η middotnablaPbull Q larr Q - η middotnablaQbull Where nablaQ is gradientderivative of matrix Q120571120571119876119876 = [120571120571119902119902119909119909119894119894] and 120571120571119902119902119909119909119894119894 = sum119909119909119909119909 minus2 119903119903119909119909119909119909 minus 119902119902119909119909119901119901119909119909119879119879 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894

bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q Observation Computing gradients is slow

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

How to compute gradient of a matrixCompute gradient of every element independently

RECSYS 69

Degression to the lecture notes ofRegression and Gradient Descent

by Andrew Ngrsquos Machine Learning course from Coursera

RECSYS 70

(Batch) Gradient Descent

Want to find matrices P and Q

Gradient descent Initialize P and Q (using SVD pretend missing ratings are

0) Do gradient descent

bull P larr P - η middotnablaPbull Q larr Q - η middotnablaQbull Where nablaQ is gradientderivative of matrix Q120571120571119876119876 = [120571120571119902119902119909119909119894119894] and 120571120571119902119902119909119909119894119894 = sum119909119909119909119909 minus2 119903119903119909119909119909119909 minus 119902119902119909119909119901119901119909119909119879119879 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894

bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q Observation Computing gradients is slow

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

How to compute gradient of a matrixCompute gradient of every element independently

RECSYS 72

Stochastic Gradient Descent

Gradient Descent (GD) vs Stochastic GD Observation 120571120571119876119876 = [120571120571119902119902119909119909119894119894] where

120571120571119902119902119909119909119894119894 = 119909119909119909119909

minus2 119903119903119909119909119909119909 minus 119902119902119909119909119894119894119901119901119909119909119894119894 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894 = 119961119961119946119946

nabla119928119928 119955119955119961119961119946119946

bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q 119928119928 = 119928119928 minus 119928119928 = 119928119928 minus η sum119961119961119946119946nabla119928119928 (119955119955119961119961119946119946) Idea Instead of evaluating gradient over all ratings evaluate it for

each individual rating and make a step

GD 119928119928larr119928119928 minus η sum119955119955119961119961119946119946 nabla119928119928(119955119955119961119961119946119946)

SGD 119928119928larr119928119928 minus 119928119928(119955119955119961119961119946119946) Faster convergence

bull Need more steps but each step is computed much faster

1162019

RECSYS 73

SGD vs GD Convergence of GD vs SGD

1162019

Iterationstep

Valu

e of

the

obje

ctiv

e fu

nctio

n

GD improves the value of the objective function at every step SGD improves the value but in a ldquonoisyrdquo wayGD takes fewer steps to converge but each steptakes much longer to compute In practice SGD is much faster

RECSYS 74

Stochastic Gradient Descent

Stochastic gradient decent Initialize P and Q (using SVD pretend missing ratings are

0) Then iterate over the ratings (multiple times if necessary) and update

factorsFor each rxi

bull 120576120576119909119909119909119909 = 119903119903119909119909119909119909 minus 119902119902119909119909 sdot 119901119901119909119909119879119879 (derivative of the ldquoerrorrdquo)bull 119902119902119909119909 larr 119902119902119909119909 + 120578120578 120576120576119909119909119909119909 119901119901119909119909 minus 120582120582 119902119902119909119909 (update equation)bull 119901119901119909119909 larr 119901119901119909119909 + 120578120578 120576120576119909119909119909119909 119902119902119909119909 minus 120582120582 119901119901119909119909 (update equation)

2 for loops For until convergence

bull For each rxibull Compute gradient do a ldquosteprdquo

1162019

η hellip learning rate

RECSYS 75Koren Bell Volinksy IEEE Computer 2009

1162019

RECSYS 76

Summary Recommendations via Optimization

Goal Make good recommendations Quantify goodness using SSE

So Lower SSE means better recommendations We want to make good recommendations on items

that some user has not yet seen Letrsquos set values for P and Q such that they work

well on known (user item) ratings And hope these values for P and Q will predict well

the unknown ratings

This is the a case where we apply Optimization methods

1162019

1 3 43 5 5

4 5 533

2

2 1 3

1

RECSYS 77

Backup Slides

RECSYS 78

The Netflix Challenge 2006-09

We Know What You OughtTo Be Watching This

Summer

Netflix Prizebull Training data

ndash 100 million ratingsndash 480000 usersndash 17770 moviesndash 6 years of data 2000-2005

bull Test datandash Last few ratings of each user (28 million)ndash Evaluation criterion root mean squared error (RMSE) ndash Netflix Cinematch RMSE 09514

bull Competitionndash 2700+ teamsndash $1 million grand prize for 10 improvement on Cinematch resultndash $50000 2007 progress prize for 843 improvement

scoremovieuser121152131434524123237682576344541568523425

22345

57664566

scoremovieuser6219617232473153414284935

745

696836

Training data Test data

Movie rating data

Overall rating distribution

bull Third of ratings are 4sbull Average rating is 368

From TimelyDevelopmentcom

ratings per movie

bull Avg ratingsmovie 5627

ratings per user

bull Avg ratingsuser 208

Average movie rating by movie count

bull More ratings to better movies

From TimelyDevelopmentcom

Most loved movies

CountAvg ratingTitle137812 4593 The Shawshank Redemption 133597 4545 Lord of the Rings The Return of the King1808834306 The Green Mile 150676 4460Lord of the Rings The Two Towers 139050 4415 Finding Nemo 117456 4504 Raiders of the Lost Ark 180736 4299 Forrest Gump 147932 4433 Lord of the Rings The Fellowship of the ring149199 4325 The Sixth Sense 144027 4333 Indiana Jones and the Last Crusade

Challenges

bull Size of datandash Scalabilityndash Keeping data in memory

bull Missing datandash 99 percent missingndash Very imbalanced

bull Avoiding overfittingbull Test and training data differ significantly movie 16322

bull Use an ensemble of complementing predictorsbull Two half tuned models worth more than a single fully

tuned model

The BellKor recommender system

RECSYS 101

Extending Latent Factor Model to Include Biases

1162019

RECSYS 102

Modeling Biases and Interactions

μ = overall mean rating bx = bias of user x bi = bias of movie i

user-movie interactionmovie biasuser bias

User-Movie interaction Characterizes the matching between

users and movies Attracts most research in the field Benefits from algorithmic and

mathematical innovations

Baseline predictor Separates users and movies Benefits from insights into userrsquos

behavior Among the main practical

contributions of the competition

RECSYS 103

Baseline Predictor We have expectations on the rating by

user x of movie i even without estimating xrsquos attitude towards movies like i

ndash Rating scale of user xndash Values of other ratings user

gave recently (day-specific mood anchoring multi-user accounts)

ndash (Recent) popularity of movie indash Selection bias related to

number of ratings user gave on the same day (ldquofrequencyrdquo)

1162019

RECSYS 104

Putting It All Together

Example Mean rating micro = 37 You are a critical reviewer your ratings are 1 star

lower than the mean bx = -1 Star Wars gets a mean rating of 05 higher than

average movie bi = + 05 Predicted rating for you on Star Wars

= 37 - 1 + 05 = 32

1162019

RECSYS 105

Fitting the New Model

Solve

Stochastic gradient decent to find parameters Note Both biases bu bi as well as interactions qi pu are treated

as parameters (we estimate them)

1162019

regularization

goodness of fit

λ is selected via grid-search on a validation set

( )

++++

+++minus

sumsumsumsum

sumisin

ii

xx

xx

ii

Rix

Txiixxi

PQ

bbpq

pqbbr

2222

2

)(

)(min

λ

micro

RECSYS 106

BellKor Recommender System The winner of the Netflix Challenge

Multi-scale modeling of the dataCombine top level ldquoregionalrdquomodeling of the data with a refined local view Global

bull Overall deviations of usersmovies Factorization

bull Addressing ldquoregionalrdquo effects Collaborative filtering

bull Extract local patterns

Global effects

Factorization

Collaborative filtering

RECSYS 107

Performance of Various Methods

0885

089

0895

09

0905

091

0915

092

1 10 100 1000

RMSE

Millions of parameters

CF (no time bias)

Basic Latent Factors

Latent Factors w Biases

1162019

RECSYS 108

Grand Prize 08563

Netflix 09514

Movie average 10533User average 10651

Global average 11296

Performance of Various Methods

Basic Collaborative filtering 094

Latent factors 090

Latent factors+Biases 089

Collaborative filtering++ 091

1162019

RECSYS 109

Temporal Biases Of Users

Sudden rise in the average movie rating(early 2004) Improvements in Netflix GUI improvements Meaning of rating changed

Movie age Users prefer new movies

without any reasons Older movies are just

inherently better than newer ones

1162019

Y Koren Collaborative filtering with temporal dynamics KDD rsquo09

RECSYS 110

Temporal Biases amp Factors

Original modelrxi = micro +bx + bi + qi middotpx

T

Add time dependence to biasesrxi = micro +bx(t)+ bi(t) +qi middot px

T

Make parameters bu and bi to depend on time (1) Parameterize time-dependence by linear trends

(2) Each bin corresponds to 10 consecutive weeks

Add temporal dependence to factors px(t)hellip user preference vector on day t

Y Koren Collaborative filtering with temporal dynamics KDD rsquo09

RECSYS 111

Adding Temporal Effects

1162019

0875

088

0885

089

0895

09

0905

091

0915

092

1 10 100 1000 10000

RMSE

Millions of parameters

CF (no time bias)Basic Latent FactorsCF (time bias)Latent Factors w Biases+ Linear time factors+ Per-day user biases+ CF

RECSYS 112

Grand Prize 08563

Netflix 09514

Movie average 10533User average 10651

Global average 11296

Performance of Various Methods

Basic Collaborative filtering 094

Latent factors 090

Latent factors+Biases 089

Collaborative filtering++ 091

1162019 112

Latent factors+Biases+Time 0876

Still no prize Getting desperateTry a ldquokitchen sinkrdquo approach

RECSYS 113

Effect of ensemble size

087

08725

0875

08775

088

08825

0885

08875

089

08925

1 8 15 22 29 36 43 50

Predictors

Erro

r - R

MSE

Chart1

RMSE
Predictors
Error - RMSE
Effect of ensemble size

Sheet1

Sheet1

RMSE
Predictors
Error - RMSE
Effect of ensemble size

Sheet2

Sheet3

RECSYS 115

Standing on June 26th 2009

1162019June 26th submission triggers 30-day ldquolast callrdquo

RECSYS 116

The Last 30 Days

Ensemble team formed Group of other teams on leaderboard forms a new team Relies on combining their models Quickly also get a qualifying score over 10

BellKor Continue to get small improvements in their scores Realize that they are in direct competition with Ensemble

Strategy Both teams carefully monitoring the leaderboard Only sure way to check for improvement is to submit a set of

predictionsbull This alerts the other team of your latest score

1162019

RECSYS 117

24 Hours from the Deadline

Submissions limited to 1 a day Only 1 final submission could be made in the last 24h

24 hours before deadlinehellip BellKor team member in Austria notices (by chance) that

Ensemble posts a score that is slightly better than BellKorrsquos

Frantic last 24 hours for both teams Much computer time on final optimization Carefully calibrated to end about an hour before deadline

Final submissions BellKor submits a little early (on purpose) 40 mins before deadline Ensemble submits their final entry 20 mins later hellipand everyone waitshellip

1162019

RECSYS 118118

RECSYS 119

Million $ Awarded Sept 21st 2009

RECSYS 120

Further reading Y Koren Collaborative filtering with temporal

dynamics KDD rsquo09 httpwww2researchattcom~volinskynetflixbpchtml httpwwwthe-ensemblecom

1162019

  • Slide Number 1
  • Acknowledgements
  • Content-based Systems amp Collaborative Filtering
  • High Dimensional Data
  • Example Recommender Systems
  • Recommendations
  • From Scarcity to Abundance
  • The Long Tail
  • Physical vs Online
  • Types of Recommendations
  • Formal Model
  • Utility Matrix
  • Key Problems
  • (1) Gathering Ratings
  • (2) Extrapolating Utilities
  • Content-based Recommender Systems
  • Content-based Recommendations
  • Plan of Action
  • Item Profiles
  • Sidenote TF-IDF
  • User Profiles and Prediction
  • Pros Content-based Approach
  • Cons Content-based Approach
  • Collaborative Filtering
  • Collaborative filtering
  • Collaborative Filtering (CF)
  • Example of Memory-based Collaborative Filtering User-User Collaborative Filtering
  • Similar Users
  • Similarity Metric
  • Rating Predictions
  • Another type of Memory-based Collaborative Filtering Item-Item based Collaborative Filtering
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Common Practice for Item-Item Collaborative Filtering
  • Item-Item vs User-User
  • ProsCons of Collaborative Filtering
  • Hybrid Methods
  • Remarks amp Practical Tips
  • Evaluation
  • Evaluation
  • Evaluating Predictions
  • Problems with Error Measures
  • Collaborative Filtering Complexity
  • Tip Add Data
  • Recommender SystemsLatent Factor Models
  • Collaborative Filtering viaLatent Factor Models (eg SVD)
  • Slide Number 50
  • The Netflix Utility Matrix R
  • Utility Matrix R Evaluation
  • Latent Factor Models
  • Ratings as Products of Factors
  • Ratings as Products of Factors
  • Ratings as Products of Factors
  • Latent Factor Models
  • Latent Factor Models
  • Recap SVD
  • Latent Factor Models
  • Dealing with Missing Entries
  • Recommendations via Latent Factor Models (eg SVD++ by the [Bellkor Team])
  • Dealing with Missing Entries
  • The Effect of Regularization
  • The Effect of Regularization
  • The Effect of Regularization
  • The Effect of Regularization
  • Use Gradient Descent to search forthe optimal settings
  • Degression to the lecture notes ofRegression and Gradient Descentby Andrew Ngrsquos Machine Learning course from Coursera
  • (Batch) Gradient Descent
  • Stochastic Gradient Descent
  • SGD vs GD
  • Stochastic Gradient Descent
  • Slide Number 75
  • Summary Recommendations via Optimization
  • Backup Slides
  • The Netflix Challenge 2006-09
  • Slide Number 79
  • Slide Number 80
  • Netflix Prize
  • Movie rating data
  • Overall rating distribution
  • ratings per movie
  • ratings per user
  • Average movie rating by movie count
  • Most loved movies
  • Challenges
  • The BellKor recommender system
  • Extending Latent Factor Model to Include Biases
  • Modeling Biases and Interactions
  • Baseline Predictor
  • Putting It All Together
  • Fitting the New Model
  • BellKor Recommender System
  • Performance of Various Methods
  • Performance of Various Methods
  • Temporal Biases Of Users
  • Temporal Biases amp Factors
  • Adding Temporal Effects
  • Performance of Various Methods
  • Slide Number 113
  • Slide Number 114
  • Standing on June 26th 2009
  • The Last 30 Days
  • 24 Hours from the Deadline
  • Slide Number 118
  • Million $ Awarded Sept 21st 2009
  • Further reading
predictors RMSE
1 0889067
2 0880586
3 0878345
4 0877715
5 0876668
6 0876179
7 0875557
8 0875013
9 0874477
10 0874209
11 0873997
12 0873691
13 0873569
14 0873352
15 0873212
16 0873035
17 0872942
18 087283
19 0872748
20 0872654
21 0872563
22 0872468
23 0872382
24 0872275
25 087222
26 0872201
27 0872135
28 0872086
29 0872021
30 0871945
31 0871898
32 087183
33 0871756
34 0871722
35 0871665
36 0871632
37 0871585
38 0871558
39 0871524
40 0871482
41 0871452
42 0871428
43 0871398
44 0871346
45 0871312
46 0871287
47 0871262
48 0871242
49 0871236
50 0871197
0889067
0880586
0878345
0877715
0876668
0876179
0875557
0875013
0874477
0874209
0873997
0873691
0873569
0873352
0873212
0873035
0872942
087283
0872748
0872654
0872563
0872468
0872382
0872275
087222
0872201
0872135
0872086
0872021
0871945
0871898
087183
0871756
0871722
0871665
0871632
0871585
0871558
0871524
0871482
0871452
0871428
0871398
0871346
0871312
0871287
0871262
0871242
0871236
0871197
Page 20: xuhappy.github.io · 2 days ago · RECSYS 2 Acknowledgements The slides used in this chapter are adapted from: CS246 Mining Massive Datasets, - by Jure Leskovec, Stanford University

RECSYS 22

Pros Content-based Approach

+ No need for data on other users No cold-start or sparsity problems

+ Able to recommend to users with unique tastes

+ Able to recommend new amp unpopular items No first-rater problem

+ Able to provide explanations Can provide explanations of recommended items by listing content-

features that caused an item to be recommended

1162019

RECSYS 23

Cons Content-based Approach ndash Finding the appropriate features is hard

Eg images movies music

ndash Overspecialization Never recommends items outside userrsquos

content profile People might have multiple interests Unable to exploit quality judgments of other users

ndash Recommendations for new users How to build a user profile

1162019

RECSYS 24

Collaborative Filtering

RECSYS 25

Collaborative filtering Recommend items based on past transactions of users

Analyze relations between users andor items

Specific data characteristics are irrelevant Domain-free useritem attributes are not necessary Can identify elusive aspects

RECSYS 26

Collaborative Filtering (CF)

RECSYS 27

Example of Memory-based Collaborative FilteringUser-User Collaborative Filtering

1 Consider user x

2 Find set N of other userswhose ratings are ldquosimilarrdquo to xrsquos ratings eg using K-nearest neighbors (KNN)

3 Recommend items to xbased on the weighted ratings of items by users in N

RECSYS 28

Similar Users Let rx be the vector of user xrsquos ratings

Jaccard similarity measure Problem Ignores the value of the rating

Cosine similarity measure

sim(x y) = cos(rx ry) = 119903119903119909119909sdot119903119903119910119910||119903119903119909119909||sdot||119903119903119910119910||

Problem Treats missing ratings as ldquonegativerdquo Pearson correlation coefficient

Sxy = items rated by both users x and y

1162019

rx = [ _ _ ]ry = [ _ _]

rx ry as setsrx = 1 4 5ry = 1 3 4

rx ry as pointsrx = 1 0 0 1 3ry = 1 0 2 2 0

rx ry hellip avgrating of x y

RECSYS 29

Similarity Metric

Intuitively we want sim(A B) gt sim(A C)

Jaccard similarity 15 lt 24

Cosine similarity 0386 gt 0322 Considers missing ratings as ldquonegativerdquo Solution subtract the (row) mean

1162019

sim AB vs AC0092 gt -0559Notice cosine sim is correlation when data is centered at 0

Cosine sim

RECSYS 30

Rating Predictions

Let rx be the vector of user xrsquos ratings

Let N be the set of k users most similar to xwho have rated item i

Prediction for item i of user x 119903119903119909119909119909119909 = 1

119896119896sum119910119910isin119873119873 119903119903119910119910119909119909

119903119903119909119909119909119909 =sum119910119910isin119873119873 119904119904119909119909119910119910sdot119903119903119910119910119910119910sum119910119910isin119873119873 119904119904119909119909119910119910

Other options

Many other tricks possiblehellip1162019

Shorthand119956119956119961119961119961119961 = 119956119956119946119946119956119956 119961119961119961119961

RECSYS 31

Another type of Memory-based Collaborative Filtering Item-Item based Collaborative Filtering

So far User-user collaborative filtering

Another view Item-item For item i find other similar items Estimate rating for item i based

on the target userrsquos ratings on items similar to item i Can use same similarity metrics and

prediction functions as in user-user model

1162019

sumsum

isin

isinsdot

=)(

)(

xiNj ij

xiNj xjijxi s

rsr

sijhellip similarity of items i and jrxjhelliprating of user x on item jN(ix)hellip set items rated by x similar to i

RECSYS 32

Item-Item CF (|N|=2)

121110987654321

455311

3124452

534321423

245424

5224345

423316

users

mov

ies

- unknown rating - rating between 1 to 5

RECSYS 33

Item-Item CF (|N|=2)

121110987654321

455 311

3124452

534321423

245424

5224345

423316

users

- estimate rating of movie 1 by user 5

1162019

mov

ies

RECSYS 34

Item-Item CF (|N|=2)

121110987654321

455 311

3124452

534321423

245424

5224345

423316

users

Neighbor selectionIdentify movies similar to movie 1 rated by user 5

1162019

mov

ies

100

-018

041

-010

-031

059

sim(1m)

Here we use Pearson correlation as similarity1) Subtract mean rating mi from each movie i

m1 = (1+3+5+5+4)5 = 36row 1 [-26 0 -06 0 0 14 0 0 14 0 04 0]

2) Compute cosine similarities between rows

RECSYS 35

Item-Item CF (|N|=2)

121110987654321

455 311

3124452

534321423

245424

5224345

423316

users

Compute similarity weightss13=041 s16=059

1162019

mov

ies

100

-018

041

-010

-031

059

sim(1m)

RECSYS 36

Item-Item CF (|N|=2)

121110987654321

45526311

3124452

534321423

245424

5224345

423316

users

Predict by taking weighted average

r15 = (0412 + 0593) (041+059) = 26

mov

ies

RECSYS 37

Common Practice for Item-Item Collaborative Filtering

Define similarity sij of items i and j

Select K nearest neighbors (KNN) N(i x) Set of Items most similar to item i that were rated by x

Estimate rating rxi as the weighted average

1162019

sumsum

isin

isinsdot

=)(

)(ˆxiNj ij

xiNj xjijxi s

rsr

N(ix) = set of items similar to item i that were rated by xsij = similarity of items i and jrxj= rating of user x on item j

RECSYS 38

Item-Item vs User-User

04180109

0305081

Avatar LOTR Matrix Pirates

Alice

Bob

Carol

David

1162019

In practice it has been observed that item-itemoften works better than user-user

Why Items are simpler users have multiple tastes

RECSYS 39

ProsCons of Collaborative Filtering

+ Works for any kind of item No feature selection needed

- Cold Start Need enough users in the system to find a match

- Sparsity The userratings matrix is sparse Hard to find users that have rated the same items

- First rater Cannot recommend an item that has not been

previously rated New items Esoteric items

- Popularity bias Cannot recommend items to someone with

unique taste Tends to recommend popular items

RECSYS 40

Hybrid Methods Implement two or more different recommenders and

combine predictions Perhaps using a linear model

Add content-based methods to collaborative filtering Item profiles for new item problem Demographics to deal with new user problem

1162019

RECSYS 41

- Evaluation

- Error metrics

- Complexity Speed

RECSYS 42

Evaluation

1 3 4

3 5 5

4 5 5

3

3

2 2 2

5

2 1 1

3 3

1

movies

users

1162019

RECSYS 43

Evaluation

1 3 4

3 5 5

4 5 5

3

3

2

2 1

3

1

Test Data Set

users

movies

1162019

RECSYS 44

Evaluating Predictions Compare predictions with known ratings

Root-mean-square error (RMSE)bull where is predicted is the true rating of x on i

Precision at top 10 bull of those in top 10

Rank Correlation bull Spearmanrsquos correlation between systemrsquos and userrsquos complete

rankings Another approach 01 model

Coveragebull Number of itemsusers for which system can make predictions

Precision = TP (TP + FP) Accuracy = (TP+TN) (TP + FP + TN + FN) Receiver Operating characteristic (ROC) Curve

Y-axis True Positive Rates (TPR) X-axis False Positive Rates (FPR)bull TPR (aka Recall) = TP P = TP (TP+FN) bull FPR = FP N = FP (FP + TN)bull See httpsenwikipediaorgwikiPrecision and recall

RECSYS 45

Problems with Error Measures Narrow focus on accuracy sometimes

misses the point Prediction Diversity Prediction Context Order of predictions

In practice we care only to predict high ratings RMSE might penalize a method that does well

for high ratings and badly for others

1162019

RECSYS 46

Collaborative Filtering Complexity

Expensive step is finding k most similar customers O(|X|) Recall that X = set of customers in the system

Too expensive to do at runtime Could pre-compute using clustering as approx

Naiumlve pre-computation takes time O(N middot|C|) |C| = of clusters = k in the k-means N = of data points

We already know how to do this Near-neighbor search in high dimensions (LSH) Clustering Dimensionality reduction

1162019

RECSYS 47

Tip Add Data Leverage all the data

Donrsquot try to reduce data size in an effort to make fancy algorithms work

Simple methods on large data do best

Add more data eg add IMDB data on genres

More data beats better algorithmshttpanandtypepadcomdatawocky200803more-data-usualhtml

RECSYS 48

Recommender SystemsLatent Factor Models

CS246 Mining Massive DatasetsJure Leskovec Stanford University

httpcs246stanfordedu

RECSYS 49

Geared towards females

Geared towards males

Serious

Funny

Collaborative Filtering viaLatent Factor Models (eg SVD)

The PrincessDiaries

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

RECSYS 50Koren Bell Volinksy IEEE Computer 2009

RECSYS 51

The Netflix Utility Matrix R

1 3 4

3 5 5

4 5 5

3

3

2 2 2

5

2 1 1

3 3

1

480000 users

17700 movies

1162019

Matrix R

RECSYS 52

Utility Matrix R Evaluation

1 3 4

3 5 5

4 5 5

3

3

2

2 1

3

1

Test Data Set

SSE = sum(119909119909119909119909)isin119877119877 119903119909119909119909119909 minus 119903119903119909119909119909119909 2

1162019

480000 users

17700 movies

Predicted rating

True rating of user x on item i

119955119955120785120785120788120788

Matrix R

Training Data Set

RECSYS 53

Latent Factor Models

ldquoSVDrdquo on Netflix data R asymp Q middot PT

For now letrsquos assume we can approximate the rating matrix R as a product of ldquothinrdquo Q middot PT

R has missing entries but letrsquos ignore that for nowbull Basically we will want the reconstruction error to be small on

known ratings and we donrsquot care about the values on the missing ones

45531

312445

53432142

24542

522434

42331

2-41

56-5

53-2

32111

-221-7

37-1

-924143-48-5-253-211

13-112-72914-131457-8

1-6784-3924176-421asymp

users

item

s

PT

Q

item

s

users

R

SVD A = U Σ VT

f factors

ffactors

RECSYS 54

Ratings as Products of Factors How to estimate the missing rating of

user x for item i

1162019

45531

312445

53432142

24542

522434

42331

item

s

2-41

56-5

53-2

32111

-221-7

37-1

-924143-48-5-253-211

13-112-72914-131457-8

1-6784-3924176-421

asymp

item

s

users

users

PT

119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931

= 119943119943

119954119954119946119946119943119943 sdot 119953119953119961119961119943119943

qi = row i of Qpx = column x of PT

ffac

tors

Qf factors

RECSYS 55

Ratings as Products of Factors How to estimate the missing rating of

user x for item i

1162019

45531

312445

53432142

24542

522434

42331

item

s

2-41

56-5

53-2

32111

-221-7

37-1

-924143-48-5-253-211

13-112-72914-131457-8

1-6784-3924176-421

asymp

item

s

users

users

PT

ffac

tors

Qf factors

119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931

= 119943119943

119954119954119946119946119943119943 sdot 119953119953119961119961119943119943

qi = row i of Qpx = column x of PT

RECSYS 56

Ratings as Products of Factors How to estimate the missing rating of

user x for item i

1162019

45531

312445

53432142

24542

522434

42331

item

s

2-41

56-5

53-2

32111

-221-7

37-1

-924143-48-5-253-211

13-112-72914-131457-8

1-6784-3924176-421

asymp

item

s

users

users

QPT

24

ffac

tors

f factors

119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931

= 119943119943

119954119954119946119946119943119943 sdot 119953119953119961119961119943119943

qi = row i of Qpx = column x of PT

RECSYS 57

Geared towards females

Geared towards males

Serious

Funny

Latent Factor Models

1162019

The PrincessDiaries

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

RECSYS 58

Geared towards females

Geared towards males

Serious

Funny

Latent Factor Models

1162019

The PrincessDiaries

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

RECSYS 59

Recap SVD

Remember SVD A Input data matrix U Left singular vecs V Right singular vecs Σ Singular values

SVD gives minimum reconstruction error (SSE) min119880119880VΣ

sum119909119909119894119894 119860119860119909119909119894119894 minus 119880119880Σ119881119881T 1199091199091198941198942

So in our case ldquoSVDrdquo on Netflix data R asymp Q middot PT

A = R Q = U PT = Σ VT

But we are not done yet R has missing entries1162019

U

The sum goes over all entriesBut our R has missing entries

119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931

Am

n

Σm

n

VT

asymp

U

RECSYS 60

Latent Factor Models

SVD isnrsquot defined when entries are missing

Use specialized methods to find P Q min

119875119875119876119876sum 119909119909119909119909 isinR 119903119903119909119909119909119909 minus 119902119902119909119909 sdot 119901119901119909119909119879119879 2

Notebull We donrsquot require cols of P Q to be orthogonalunit lengthbull P Q map usersmovies to a latent spacebull The most popular model among Netflix contestants

1162019

45531

312445

53432142

24542

522434

42331

2-41

56-5

53-2

32111

-221-7

37-1

-924143-48-5-253-211

13-112-72914-131457-8

1-6784-3924176-421asymp

PTQ

users

item

s

119903119909119909119909119909 = 119902119902119909119909 sdot 119901119901119909119909119879119879

f factors

ffactorsitem

s

users

RECSYS 61

Dealing with Missing Entries Want to minimize SSE for unseen test data

Idea Minimize SSE on training data Want large f ( of factors) to capture all the signals But SSE on test data begins to rise for f gt 2

Regularization is needed to avoid Overfitting Allow rich model where there are sufficient data Shrink aggressively where data are scarce

1162019

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

1 3 43 5 5

4 5 533

2

2 1 3

1

λhellip regularization parameterldquoerrorrdquo ldquolengthrdquo

RECSYS 62

Geared towards females

Geared towards males

serious

escapist

The PrincessDiaries

The Lion King

Braveheart

Independence Day

AmadeusThe Color Purple

Oceanrsquos 11

Sense and Sensibility

Gus

Dave

Recommendations via Latent Factor Models (eg SVD++ by the [Bellkor Team])

1162019

Lethal Weapon

Dumb and Dumber

RECSYS 63

Dealing with Missing Entries Want to minimize SSE for unseen test data

Idea Minimize SSE on training data Want large f ( of factors) to capture all the signals But SSE on test data begins to rise for f gt 2

Regularization is needed Allow rich model where there are sufficient data Shrink aggressively where data are scarce

1162019

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

1 3 43 5 5

4 5 533

2

2 1 3

1

λhellip regularization parameterldquoerrorrdquo ldquolengthrdquo

RECSYS 64

Geared towards females

Geared towards males

serious

funny

The Effect of Regularization

1162019

The PrincessDiaries

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

minfactors ldquoerrorrdquo + λ ldquolengthrdquo

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

RECSYS 65

Geared towards females

Geared towards males

serious

funny

The Effect of Regularization

1162019

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

The PrincessDiaries

minfactors ldquoerrorrdquo + λ ldquolengthrdquo

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

RECSYS 66

Geared towards females

Geared towards males

serious

funny

The Effect of Regularization

1162019

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

minfactors ldquoerrorrdquo + λ ldquolengthrdquo

The PrincessDiaries

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

RECSYS 67

Geared towards females

Geared towards males

serious

funny

The Effect of Regularization

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

The PrincessDiaries

minfactors ldquoerrorrdquo + λ ldquolengthrdquo

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

RECSYS 68

Use Gradient Descent to search forthe optimal settings

Want to find matrices P and Q

Gradient descent Initialize P and Q (using SVD pretend missing ratings are

0) Do gradient descent

bull P larr P - η middotnablaPbull Q larr Q - η middotnablaQbull Where nablaQ is gradientderivative of matrix Q120571120571119876119876 = [120571120571119902119902119909119909119894119894] and 120571120571119902119902119909119909119894119894 = sum119909119909119909119909 minus2 119903119903119909119909119909119909 minus 119902119902119909119909119901119901119909119909119879119879 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894

bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q Observation Computing gradients is slow

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

How to compute gradient of a matrixCompute gradient of every element independently

RECSYS 69

Degression to the lecture notes ofRegression and Gradient Descent

by Andrew Ngrsquos Machine Learning course from Coursera

RECSYS 70

(Batch) Gradient Descent

Want to find matrices P and Q

Gradient descent Initialize P and Q (using SVD pretend missing ratings are

0) Do gradient descent

bull P larr P - η middotnablaPbull Q larr Q - η middotnablaQbull Where nablaQ is gradientderivative of matrix Q120571120571119876119876 = [120571120571119902119902119909119909119894119894] and 120571120571119902119902119909119909119894119894 = sum119909119909119909119909 minus2 119903119903119909119909119909119909 minus 119902119902119909119909119901119901119909119909119879119879 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894

bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q Observation Computing gradients is slow

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

How to compute gradient of a matrixCompute gradient of every element independently

RECSYS 72

Stochastic Gradient Descent

Gradient Descent (GD) vs Stochastic GD Observation 120571120571119876119876 = [120571120571119902119902119909119909119894119894] where

120571120571119902119902119909119909119894119894 = 119909119909119909119909

minus2 119903119903119909119909119909119909 minus 119902119902119909119909119894119894119901119901119909119909119894119894 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894 = 119961119961119946119946

nabla119928119928 119955119955119961119961119946119946

bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q 119928119928 = 119928119928 minus 119928119928 = 119928119928 minus η sum119961119961119946119946nabla119928119928 (119955119955119961119961119946119946) Idea Instead of evaluating gradient over all ratings evaluate it for

each individual rating and make a step

GD 119928119928larr119928119928 minus η sum119955119955119961119961119946119946 nabla119928119928(119955119955119961119961119946119946)

SGD 119928119928larr119928119928 minus 119928119928(119955119955119961119961119946119946) Faster convergence

bull Need more steps but each step is computed much faster

1162019

RECSYS 73

SGD vs GD Convergence of GD vs SGD

1162019

Iterationstep

Valu

e of

the

obje

ctiv

e fu

nctio

n

GD improves the value of the objective function at every step SGD improves the value but in a ldquonoisyrdquo wayGD takes fewer steps to converge but each steptakes much longer to compute In practice SGD is much faster

RECSYS 74

Stochastic Gradient Descent

Stochastic gradient decent Initialize P and Q (using SVD pretend missing ratings are

0) Then iterate over the ratings (multiple times if necessary) and update

factorsFor each rxi

bull 120576120576119909119909119909119909 = 119903119903119909119909119909119909 minus 119902119902119909119909 sdot 119901119901119909119909119879119879 (derivative of the ldquoerrorrdquo)bull 119902119902119909119909 larr 119902119902119909119909 + 120578120578 120576120576119909119909119909119909 119901119901119909119909 minus 120582120582 119902119902119909119909 (update equation)bull 119901119901119909119909 larr 119901119901119909119909 + 120578120578 120576120576119909119909119909119909 119902119902119909119909 minus 120582120582 119901119901119909119909 (update equation)

2 for loops For until convergence

bull For each rxibull Compute gradient do a ldquosteprdquo

1162019

η hellip learning rate

RECSYS 75Koren Bell Volinksy IEEE Computer 2009

1162019

RECSYS 76

Summary Recommendations via Optimization

Goal Make good recommendations Quantify goodness using SSE

So Lower SSE means better recommendations We want to make good recommendations on items

that some user has not yet seen Letrsquos set values for P and Q such that they work

well on known (user item) ratings And hope these values for P and Q will predict well

the unknown ratings

This is the a case where we apply Optimization methods

1162019

1 3 43 5 5

4 5 533

2

2 1 3

1

RECSYS 77

Backup Slides

RECSYS 78

The Netflix Challenge 2006-09

We Know What You OughtTo Be Watching This

Summer

Netflix Prizebull Training data

ndash 100 million ratingsndash 480000 usersndash 17770 moviesndash 6 years of data 2000-2005

bull Test datandash Last few ratings of each user (28 million)ndash Evaluation criterion root mean squared error (RMSE) ndash Netflix Cinematch RMSE 09514

bull Competitionndash 2700+ teamsndash $1 million grand prize for 10 improvement on Cinematch resultndash $50000 2007 progress prize for 843 improvement

scoremovieuser121152131434524123237682576344541568523425

22345

57664566

scoremovieuser6219617232473153414284935

745

696836

Training data Test data

Movie rating data

Overall rating distribution

bull Third of ratings are 4sbull Average rating is 368

From TimelyDevelopmentcom

ratings per movie

bull Avg ratingsmovie 5627

ratings per user

bull Avg ratingsuser 208

Average movie rating by movie count

bull More ratings to better movies

From TimelyDevelopmentcom

Most loved movies

CountAvg ratingTitle137812 4593 The Shawshank Redemption 133597 4545 Lord of the Rings The Return of the King1808834306 The Green Mile 150676 4460Lord of the Rings The Two Towers 139050 4415 Finding Nemo 117456 4504 Raiders of the Lost Ark 180736 4299 Forrest Gump 147932 4433 Lord of the Rings The Fellowship of the ring149199 4325 The Sixth Sense 144027 4333 Indiana Jones and the Last Crusade

Challenges

bull Size of datandash Scalabilityndash Keeping data in memory

bull Missing datandash 99 percent missingndash Very imbalanced

bull Avoiding overfittingbull Test and training data differ significantly movie 16322

bull Use an ensemble of complementing predictorsbull Two half tuned models worth more than a single fully

tuned model

The BellKor recommender system

RECSYS 101

Extending Latent Factor Model to Include Biases

1162019

RECSYS 102

Modeling Biases and Interactions

μ = overall mean rating bx = bias of user x bi = bias of movie i

user-movie interactionmovie biasuser bias

User-Movie interaction Characterizes the matching between

users and movies Attracts most research in the field Benefits from algorithmic and

mathematical innovations

Baseline predictor Separates users and movies Benefits from insights into userrsquos

behavior Among the main practical

contributions of the competition

RECSYS 103

Baseline Predictor We have expectations on the rating by

user x of movie i even without estimating xrsquos attitude towards movies like i

ndash Rating scale of user xndash Values of other ratings user

gave recently (day-specific mood anchoring multi-user accounts)

ndash (Recent) popularity of movie indash Selection bias related to

number of ratings user gave on the same day (ldquofrequencyrdquo)

1162019

RECSYS 104

Putting It All Together

Example Mean rating micro = 37 You are a critical reviewer your ratings are 1 star

lower than the mean bx = -1 Star Wars gets a mean rating of 05 higher than

average movie bi = + 05 Predicted rating for you on Star Wars

= 37 - 1 + 05 = 32

1162019

RECSYS 105

Fitting the New Model

Solve

Stochastic gradient decent to find parameters Note Both biases bu bi as well as interactions qi pu are treated

as parameters (we estimate them)

1162019

regularization

goodness of fit

λ is selected via grid-search on a validation set

( )

++++

+++minus

sumsumsumsum

sumisin

ii

xx

xx

ii

Rix

Txiixxi

PQ

bbpq

pqbbr

2222

2

)(

)(min

λ

micro

RECSYS 106

BellKor Recommender System The winner of the Netflix Challenge

Multi-scale modeling of the dataCombine top level ldquoregionalrdquomodeling of the data with a refined local view Global

bull Overall deviations of usersmovies Factorization

bull Addressing ldquoregionalrdquo effects Collaborative filtering

bull Extract local patterns

Global effects

Factorization

Collaborative filtering

RECSYS 107

Performance of Various Methods

0885

089

0895

09

0905

091

0915

092

1 10 100 1000

RMSE

Millions of parameters

CF (no time bias)

Basic Latent Factors

Latent Factors w Biases

1162019

RECSYS 108

Grand Prize 08563

Netflix 09514

Movie average 10533User average 10651

Global average 11296

Performance of Various Methods

Basic Collaborative filtering 094

Latent factors 090

Latent factors+Biases 089

Collaborative filtering++ 091

1162019

RECSYS 109

Temporal Biases Of Users

Sudden rise in the average movie rating(early 2004) Improvements in Netflix GUI improvements Meaning of rating changed

Movie age Users prefer new movies

without any reasons Older movies are just

inherently better than newer ones

1162019

Y Koren Collaborative filtering with temporal dynamics KDD rsquo09

RECSYS 110

Temporal Biases amp Factors

Original modelrxi = micro +bx + bi + qi middotpx

T

Add time dependence to biasesrxi = micro +bx(t)+ bi(t) +qi middot px

T

Make parameters bu and bi to depend on time (1) Parameterize time-dependence by linear trends

(2) Each bin corresponds to 10 consecutive weeks

Add temporal dependence to factors px(t)hellip user preference vector on day t

Y Koren Collaborative filtering with temporal dynamics KDD rsquo09

RECSYS 111

Adding Temporal Effects

1162019

0875

088

0885

089

0895

09

0905

091

0915

092

1 10 100 1000 10000

RMSE

Millions of parameters

CF (no time bias)Basic Latent FactorsCF (time bias)Latent Factors w Biases+ Linear time factors+ Per-day user biases+ CF

RECSYS 112

Grand Prize 08563

Netflix 09514

Movie average 10533User average 10651

Global average 11296

Performance of Various Methods

Basic Collaborative filtering 094

Latent factors 090

Latent factors+Biases 089

Collaborative filtering++ 091

1162019 112

Latent factors+Biases+Time 0876

Still no prize Getting desperateTry a ldquokitchen sinkrdquo approach

RECSYS 113

Effect of ensemble size

087

08725

0875

08775

088

08825

0885

08875

089

08925

1 8 15 22 29 36 43 50

Predictors

Erro

r - R

MSE

Chart1

RMSE
Predictors
Error - RMSE
Effect of ensemble size

Sheet1

Sheet1

RMSE
Predictors
Error - RMSE
Effect of ensemble size

Sheet2

Sheet3

RECSYS 115

Standing on June 26th 2009

1162019June 26th submission triggers 30-day ldquolast callrdquo

RECSYS 116

The Last 30 Days

Ensemble team formed Group of other teams on leaderboard forms a new team Relies on combining their models Quickly also get a qualifying score over 10

BellKor Continue to get small improvements in their scores Realize that they are in direct competition with Ensemble

Strategy Both teams carefully monitoring the leaderboard Only sure way to check for improvement is to submit a set of

predictionsbull This alerts the other team of your latest score

1162019

RECSYS 117

24 Hours from the Deadline

Submissions limited to 1 a day Only 1 final submission could be made in the last 24h

24 hours before deadlinehellip BellKor team member in Austria notices (by chance) that

Ensemble posts a score that is slightly better than BellKorrsquos

Frantic last 24 hours for both teams Much computer time on final optimization Carefully calibrated to end about an hour before deadline

Final submissions BellKor submits a little early (on purpose) 40 mins before deadline Ensemble submits their final entry 20 mins later hellipand everyone waitshellip

1162019

RECSYS 118118

RECSYS 119

Million $ Awarded Sept 21st 2009

RECSYS 120

Further reading Y Koren Collaborative filtering with temporal

dynamics KDD rsquo09 httpwww2researchattcom~volinskynetflixbpchtml httpwwwthe-ensemblecom

1162019

  • Slide Number 1
  • Acknowledgements
  • Content-based Systems amp Collaborative Filtering
  • High Dimensional Data
  • Example Recommender Systems
  • Recommendations
  • From Scarcity to Abundance
  • The Long Tail
  • Physical vs Online
  • Types of Recommendations
  • Formal Model
  • Utility Matrix
  • Key Problems
  • (1) Gathering Ratings
  • (2) Extrapolating Utilities
  • Content-based Recommender Systems
  • Content-based Recommendations
  • Plan of Action
  • Item Profiles
  • Sidenote TF-IDF
  • User Profiles and Prediction
  • Pros Content-based Approach
  • Cons Content-based Approach
  • Collaborative Filtering
  • Collaborative filtering
  • Collaborative Filtering (CF)
  • Example of Memory-based Collaborative Filtering User-User Collaborative Filtering
  • Similar Users
  • Similarity Metric
  • Rating Predictions
  • Another type of Memory-based Collaborative Filtering Item-Item based Collaborative Filtering
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Common Practice for Item-Item Collaborative Filtering
  • Item-Item vs User-User
  • ProsCons of Collaborative Filtering
  • Hybrid Methods
  • Remarks amp Practical Tips
  • Evaluation
  • Evaluation
  • Evaluating Predictions
  • Problems with Error Measures
  • Collaborative Filtering Complexity
  • Tip Add Data
  • Recommender SystemsLatent Factor Models
  • Collaborative Filtering viaLatent Factor Models (eg SVD)
  • Slide Number 50
  • The Netflix Utility Matrix R
  • Utility Matrix R Evaluation
  • Latent Factor Models
  • Ratings as Products of Factors
  • Ratings as Products of Factors
  • Ratings as Products of Factors
  • Latent Factor Models
  • Latent Factor Models
  • Recap SVD
  • Latent Factor Models
  • Dealing with Missing Entries
  • Recommendations via Latent Factor Models (eg SVD++ by the [Bellkor Team])
  • Dealing with Missing Entries
  • The Effect of Regularization
  • The Effect of Regularization
  • The Effect of Regularization
  • The Effect of Regularization
  • Use Gradient Descent to search forthe optimal settings
  • Degression to the lecture notes ofRegression and Gradient Descentby Andrew Ngrsquos Machine Learning course from Coursera
  • (Batch) Gradient Descent
  • Stochastic Gradient Descent
  • SGD vs GD
  • Stochastic Gradient Descent
  • Slide Number 75
  • Summary Recommendations via Optimization
  • Backup Slides
  • The Netflix Challenge 2006-09
  • Slide Number 79
  • Slide Number 80
  • Netflix Prize
  • Movie rating data
  • Overall rating distribution
  • ratings per movie
  • ratings per user
  • Average movie rating by movie count
  • Most loved movies
  • Challenges
  • The BellKor recommender system
  • Extending Latent Factor Model to Include Biases
  • Modeling Biases and Interactions
  • Baseline Predictor
  • Putting It All Together
  • Fitting the New Model
  • BellKor Recommender System
  • Performance of Various Methods
  • Performance of Various Methods
  • Temporal Biases Of Users
  • Temporal Biases amp Factors
  • Adding Temporal Effects
  • Performance of Various Methods
  • Slide Number 113
  • Slide Number 114
  • Standing on June 26th 2009
  • The Last 30 Days
  • 24 Hours from the Deadline
  • Slide Number 118
  • Million $ Awarded Sept 21st 2009
  • Further reading
predictors RMSE
1 0889067
2 0880586
3 0878345
4 0877715
5 0876668
6 0876179
7 0875557
8 0875013
9 0874477
10 0874209
11 0873997
12 0873691
13 0873569
14 0873352
15 0873212
16 0873035
17 0872942
18 087283
19 0872748
20 0872654
21 0872563
22 0872468
23 0872382
24 0872275
25 087222
26 0872201
27 0872135
28 0872086
29 0872021
30 0871945
31 0871898
32 087183
33 0871756
34 0871722
35 0871665
36 0871632
37 0871585
38 0871558
39 0871524
40 0871482
41 0871452
42 0871428
43 0871398
44 0871346
45 0871312
46 0871287
47 0871262
48 0871242
49 0871236
50 0871197
0889067
0880586
0878345
0877715
0876668
0876179
0875557
0875013
0874477
0874209
0873997
0873691
0873569
0873352
0873212
0873035
0872942
087283
0872748
0872654
0872563
0872468
0872382
0872275
087222
0872201
0872135
0872086
0872021
0871945
0871898
087183
0871756
0871722
0871665
0871632
0871585
0871558
0871524
0871482
0871452
0871428
0871398
0871346
0871312
0871287
0871262
0871242
0871236
0871197
Page 21: xuhappy.github.io · 2 days ago · RECSYS 2 Acknowledgements The slides used in this chapter are adapted from: CS246 Mining Massive Datasets, - by Jure Leskovec, Stanford University

RECSYS 23

Cons Content-based Approach ndash Finding the appropriate features is hard

Eg images movies music

ndash Overspecialization Never recommends items outside userrsquos

content profile People might have multiple interests Unable to exploit quality judgments of other users

ndash Recommendations for new users How to build a user profile

1162019

RECSYS 24

Collaborative Filtering

RECSYS 25

Collaborative filtering Recommend items based on past transactions of users

Analyze relations between users andor items

Specific data characteristics are irrelevant Domain-free useritem attributes are not necessary Can identify elusive aspects

RECSYS 26

Collaborative Filtering (CF)

RECSYS 27

Example of Memory-based Collaborative FilteringUser-User Collaborative Filtering

1 Consider user x

2 Find set N of other userswhose ratings are ldquosimilarrdquo to xrsquos ratings eg using K-nearest neighbors (KNN)

3 Recommend items to xbased on the weighted ratings of items by users in N

RECSYS 28

Similar Users Let rx be the vector of user xrsquos ratings

Jaccard similarity measure Problem Ignores the value of the rating

Cosine similarity measure

sim(x y) = cos(rx ry) = 119903119903119909119909sdot119903119903119910119910||119903119903119909119909||sdot||119903119903119910119910||

Problem Treats missing ratings as ldquonegativerdquo Pearson correlation coefficient

Sxy = items rated by both users x and y

1162019

rx = [ _ _ ]ry = [ _ _]

rx ry as setsrx = 1 4 5ry = 1 3 4

rx ry as pointsrx = 1 0 0 1 3ry = 1 0 2 2 0

rx ry hellip avgrating of x y

RECSYS 29

Similarity Metric

Intuitively we want sim(A B) gt sim(A C)

Jaccard similarity 15 lt 24

Cosine similarity 0386 gt 0322 Considers missing ratings as ldquonegativerdquo Solution subtract the (row) mean

1162019

sim AB vs AC0092 gt -0559Notice cosine sim is correlation when data is centered at 0

Cosine sim

RECSYS 30

Rating Predictions

Let rx be the vector of user xrsquos ratings

Let N be the set of k users most similar to xwho have rated item i

Prediction for item i of user x 119903119903119909119909119909119909 = 1

119896119896sum119910119910isin119873119873 119903119903119910119910119909119909

119903119903119909119909119909119909 =sum119910119910isin119873119873 119904119904119909119909119910119910sdot119903119903119910119910119910119910sum119910119910isin119873119873 119904119904119909119909119910119910

Other options

Many other tricks possiblehellip1162019

Shorthand119956119956119961119961119961119961 = 119956119956119946119946119956119956 119961119961119961119961

RECSYS 31

Another type of Memory-based Collaborative Filtering Item-Item based Collaborative Filtering

So far User-user collaborative filtering

Another view Item-item For item i find other similar items Estimate rating for item i based

on the target userrsquos ratings on items similar to item i Can use same similarity metrics and

prediction functions as in user-user model

1162019

sumsum

isin

isinsdot

=)(

)(

xiNj ij

xiNj xjijxi s

rsr

sijhellip similarity of items i and jrxjhelliprating of user x on item jN(ix)hellip set items rated by x similar to i

RECSYS 32

Item-Item CF (|N|=2)

121110987654321

455311

3124452

534321423

245424

5224345

423316

users

mov

ies

- unknown rating - rating between 1 to 5

RECSYS 33

Item-Item CF (|N|=2)

121110987654321

455 311

3124452

534321423

245424

5224345

423316

users

- estimate rating of movie 1 by user 5

1162019

mov

ies

RECSYS 34

Item-Item CF (|N|=2)

121110987654321

455 311

3124452

534321423

245424

5224345

423316

users

Neighbor selectionIdentify movies similar to movie 1 rated by user 5

1162019

mov

ies

100

-018

041

-010

-031

059

sim(1m)

Here we use Pearson correlation as similarity1) Subtract mean rating mi from each movie i

m1 = (1+3+5+5+4)5 = 36row 1 [-26 0 -06 0 0 14 0 0 14 0 04 0]

2) Compute cosine similarities between rows

RECSYS 35

Item-Item CF (|N|=2)

121110987654321

455 311

3124452

534321423

245424

5224345

423316

users

Compute similarity weightss13=041 s16=059

1162019

mov

ies

100

-018

041

-010

-031

059

sim(1m)

RECSYS 36

Item-Item CF (|N|=2)

121110987654321

45526311

3124452

534321423

245424

5224345

423316

users

Predict by taking weighted average

r15 = (0412 + 0593) (041+059) = 26

mov

ies

RECSYS 37

Common Practice for Item-Item Collaborative Filtering

Define similarity sij of items i and j

Select K nearest neighbors (KNN) N(i x) Set of Items most similar to item i that were rated by x

Estimate rating rxi as the weighted average

1162019

sumsum

isin

isinsdot

=)(

)(ˆxiNj ij

xiNj xjijxi s

rsr

N(ix) = set of items similar to item i that were rated by xsij = similarity of items i and jrxj= rating of user x on item j

RECSYS 38

Item-Item vs User-User

04180109

0305081

Avatar LOTR Matrix Pirates

Alice

Bob

Carol

David

1162019

In practice it has been observed that item-itemoften works better than user-user

Why Items are simpler users have multiple tastes

RECSYS 39

ProsCons of Collaborative Filtering

+ Works for any kind of item No feature selection needed

- Cold Start Need enough users in the system to find a match

- Sparsity The userratings matrix is sparse Hard to find users that have rated the same items

- First rater Cannot recommend an item that has not been

previously rated New items Esoteric items

- Popularity bias Cannot recommend items to someone with

unique taste Tends to recommend popular items

RECSYS 40

Hybrid Methods Implement two or more different recommenders and

combine predictions Perhaps using a linear model

Add content-based methods to collaborative filtering Item profiles for new item problem Demographics to deal with new user problem

1162019

RECSYS 41

- Evaluation

- Error metrics

- Complexity Speed

RECSYS 42

Evaluation

1 3 4

3 5 5

4 5 5

3

3

2 2 2

5

2 1 1

3 3

1

movies

users

1162019

RECSYS 43

Evaluation

1 3 4

3 5 5

4 5 5

3

3

2

2 1

3

1

Test Data Set

users

movies

1162019

RECSYS 44

Evaluating Predictions Compare predictions with known ratings

Root-mean-square error (RMSE)bull where is predicted is the true rating of x on i

Precision at top 10 bull of those in top 10

Rank Correlation bull Spearmanrsquos correlation between systemrsquos and userrsquos complete

rankings Another approach 01 model

Coveragebull Number of itemsusers for which system can make predictions

Precision = TP (TP + FP) Accuracy = (TP+TN) (TP + FP + TN + FN) Receiver Operating characteristic (ROC) Curve

Y-axis True Positive Rates (TPR) X-axis False Positive Rates (FPR)bull TPR (aka Recall) = TP P = TP (TP+FN) bull FPR = FP N = FP (FP + TN)bull See httpsenwikipediaorgwikiPrecision and recall

RECSYS 45

Problems with Error Measures Narrow focus on accuracy sometimes

misses the point Prediction Diversity Prediction Context Order of predictions

In practice we care only to predict high ratings RMSE might penalize a method that does well

for high ratings and badly for others

1162019

RECSYS 46

Collaborative Filtering Complexity

Expensive step is finding k most similar customers O(|X|) Recall that X = set of customers in the system

Too expensive to do at runtime Could pre-compute using clustering as approx

Naiumlve pre-computation takes time O(N middot|C|) |C| = of clusters = k in the k-means N = of data points

We already know how to do this Near-neighbor search in high dimensions (LSH) Clustering Dimensionality reduction

1162019

RECSYS 47

Tip Add Data Leverage all the data

Donrsquot try to reduce data size in an effort to make fancy algorithms work

Simple methods on large data do best

Add more data eg add IMDB data on genres

More data beats better algorithmshttpanandtypepadcomdatawocky200803more-data-usualhtml

RECSYS 48

Recommender SystemsLatent Factor Models

CS246 Mining Massive DatasetsJure Leskovec Stanford University

httpcs246stanfordedu

RECSYS 49

Geared towards females

Geared towards males

Serious

Funny

Collaborative Filtering viaLatent Factor Models (eg SVD)

The PrincessDiaries

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

RECSYS 50Koren Bell Volinksy IEEE Computer 2009

RECSYS 51

The Netflix Utility Matrix R

1 3 4

3 5 5

4 5 5

3

3

2 2 2

5

2 1 1

3 3

1

480000 users

17700 movies

1162019

Matrix R

RECSYS 52

Utility Matrix R Evaluation

1 3 4

3 5 5

4 5 5

3

3

2

2 1

3

1

Test Data Set

SSE = sum(119909119909119909119909)isin119877119877 119903119909119909119909119909 minus 119903119903119909119909119909119909 2

1162019

480000 users

17700 movies

Predicted rating

True rating of user x on item i

119955119955120785120785120788120788

Matrix R

Training Data Set

RECSYS 53

Latent Factor Models

ldquoSVDrdquo on Netflix data R asymp Q middot PT

For now letrsquos assume we can approximate the rating matrix R as a product of ldquothinrdquo Q middot PT

R has missing entries but letrsquos ignore that for nowbull Basically we will want the reconstruction error to be small on

known ratings and we donrsquot care about the values on the missing ones

45531

312445

53432142

24542

522434

42331

2-41

56-5

53-2

32111

-221-7

37-1

-924143-48-5-253-211

13-112-72914-131457-8

1-6784-3924176-421asymp

users

item

s

PT

Q

item

s

users

R

SVD A = U Σ VT

f factors

ffactors

RECSYS 54

Ratings as Products of Factors How to estimate the missing rating of

user x for item i

1162019

45531

312445

53432142

24542

522434

42331

item

s

2-41

56-5

53-2

32111

-221-7

37-1

-924143-48-5-253-211

13-112-72914-131457-8

1-6784-3924176-421

asymp

item

s

users

users

PT

119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931

= 119943119943

119954119954119946119946119943119943 sdot 119953119953119961119961119943119943

qi = row i of Qpx = column x of PT

ffac

tors

Qf factors

RECSYS 55

Ratings as Products of Factors How to estimate the missing rating of

user x for item i

1162019

45531

312445

53432142

24542

522434

42331

item

s

2-41

56-5

53-2

32111

-221-7

37-1

-924143-48-5-253-211

13-112-72914-131457-8

1-6784-3924176-421

asymp

item

s

users

users

PT

ffac

tors

Qf factors

119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931

= 119943119943

119954119954119946119946119943119943 sdot 119953119953119961119961119943119943

qi = row i of Qpx = column x of PT

RECSYS 56

Ratings as Products of Factors How to estimate the missing rating of

user x for item i

1162019

45531

312445

53432142

24542

522434

42331

item

s

2-41

56-5

53-2

32111

-221-7

37-1

-924143-48-5-253-211

13-112-72914-131457-8

1-6784-3924176-421

asymp

item

s

users

users

QPT

24

ffac

tors

f factors

119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931

= 119943119943

119954119954119946119946119943119943 sdot 119953119953119961119961119943119943

qi = row i of Qpx = column x of PT

RECSYS 57

Geared towards females

Geared towards males

Serious

Funny

Latent Factor Models

1162019

The PrincessDiaries

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

RECSYS 58

Geared towards females

Geared towards males

Serious

Funny

Latent Factor Models

1162019

The PrincessDiaries

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

RECSYS 59

Recap SVD

Remember SVD A Input data matrix U Left singular vecs V Right singular vecs Σ Singular values

SVD gives minimum reconstruction error (SSE) min119880119880VΣ

sum119909119909119894119894 119860119860119909119909119894119894 minus 119880119880Σ119881119881T 1199091199091198941198942

So in our case ldquoSVDrdquo on Netflix data R asymp Q middot PT

A = R Q = U PT = Σ VT

But we are not done yet R has missing entries1162019

U

The sum goes over all entriesBut our R has missing entries

119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931

Am

n

Σm

n

VT

asymp

U

RECSYS 60

Latent Factor Models

SVD isnrsquot defined when entries are missing

Use specialized methods to find P Q min

119875119875119876119876sum 119909119909119909119909 isinR 119903119903119909119909119909119909 minus 119902119902119909119909 sdot 119901119901119909119909119879119879 2

Notebull We donrsquot require cols of P Q to be orthogonalunit lengthbull P Q map usersmovies to a latent spacebull The most popular model among Netflix contestants

1162019

45531

312445

53432142

24542

522434

42331

2-41

56-5

53-2

32111

-221-7

37-1

-924143-48-5-253-211

13-112-72914-131457-8

1-6784-3924176-421asymp

PTQ

users

item

s

119903119909119909119909119909 = 119902119902119909119909 sdot 119901119901119909119909119879119879

f factors

ffactorsitem

s

users

RECSYS 61

Dealing with Missing Entries Want to minimize SSE for unseen test data

Idea Minimize SSE on training data Want large f ( of factors) to capture all the signals But SSE on test data begins to rise for f gt 2

Regularization is needed to avoid Overfitting Allow rich model where there are sufficient data Shrink aggressively where data are scarce

1162019

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

1 3 43 5 5

4 5 533

2

2 1 3

1

λhellip regularization parameterldquoerrorrdquo ldquolengthrdquo

RECSYS 62

Geared towards females

Geared towards males

serious

escapist

The PrincessDiaries

The Lion King

Braveheart

Independence Day

AmadeusThe Color Purple

Oceanrsquos 11

Sense and Sensibility

Gus

Dave

Recommendations via Latent Factor Models (eg SVD++ by the [Bellkor Team])

1162019

Lethal Weapon

Dumb and Dumber

RECSYS 63

Dealing with Missing Entries Want to minimize SSE for unseen test data

Idea Minimize SSE on training data Want large f ( of factors) to capture all the signals But SSE on test data begins to rise for f gt 2

Regularization is needed Allow rich model where there are sufficient data Shrink aggressively where data are scarce

1162019

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

1 3 43 5 5

4 5 533

2

2 1 3

1

λhellip regularization parameterldquoerrorrdquo ldquolengthrdquo

RECSYS 64

Geared towards females

Geared towards males

serious

funny

The Effect of Regularization

1162019

The PrincessDiaries

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

minfactors ldquoerrorrdquo + λ ldquolengthrdquo

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

RECSYS 65

Geared towards females

Geared towards males

serious

funny

The Effect of Regularization

1162019

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

The PrincessDiaries

minfactors ldquoerrorrdquo + λ ldquolengthrdquo

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

RECSYS 66

Geared towards females

Geared towards males

serious

funny

The Effect of Regularization

1162019

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

minfactors ldquoerrorrdquo + λ ldquolengthrdquo

The PrincessDiaries

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

RECSYS 67

Geared towards females

Geared towards males

serious

funny

The Effect of Regularization

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

The PrincessDiaries

minfactors ldquoerrorrdquo + λ ldquolengthrdquo

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

RECSYS 68

Use Gradient Descent to search forthe optimal settings

Want to find matrices P and Q

Gradient descent Initialize P and Q (using SVD pretend missing ratings are

0) Do gradient descent

bull P larr P - η middotnablaPbull Q larr Q - η middotnablaQbull Where nablaQ is gradientderivative of matrix Q120571120571119876119876 = [120571120571119902119902119909119909119894119894] and 120571120571119902119902119909119909119894119894 = sum119909119909119909119909 minus2 119903119903119909119909119909119909 minus 119902119902119909119909119901119901119909119909119879119879 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894

bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q Observation Computing gradients is slow

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

How to compute gradient of a matrixCompute gradient of every element independently

RECSYS 69

Degression to the lecture notes ofRegression and Gradient Descent

by Andrew Ngrsquos Machine Learning course from Coursera

RECSYS 70

(Batch) Gradient Descent

Want to find matrices P and Q

Gradient descent Initialize P and Q (using SVD pretend missing ratings are

0) Do gradient descent

bull P larr P - η middotnablaPbull Q larr Q - η middotnablaQbull Where nablaQ is gradientderivative of matrix Q120571120571119876119876 = [120571120571119902119902119909119909119894119894] and 120571120571119902119902119909119909119894119894 = sum119909119909119909119909 minus2 119903119903119909119909119909119909 minus 119902119902119909119909119901119901119909119909119879119879 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894

bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q Observation Computing gradients is slow

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

How to compute gradient of a matrixCompute gradient of every element independently

RECSYS 72

Stochastic Gradient Descent

Gradient Descent (GD) vs Stochastic GD Observation 120571120571119876119876 = [120571120571119902119902119909119909119894119894] where

120571120571119902119902119909119909119894119894 = 119909119909119909119909

minus2 119903119903119909119909119909119909 minus 119902119902119909119909119894119894119901119901119909119909119894119894 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894 = 119961119961119946119946

nabla119928119928 119955119955119961119961119946119946

bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q 119928119928 = 119928119928 minus 119928119928 = 119928119928 minus η sum119961119961119946119946nabla119928119928 (119955119955119961119961119946119946) Idea Instead of evaluating gradient over all ratings evaluate it for

each individual rating and make a step

GD 119928119928larr119928119928 minus η sum119955119955119961119961119946119946 nabla119928119928(119955119955119961119961119946119946)

SGD 119928119928larr119928119928 minus 119928119928(119955119955119961119961119946119946) Faster convergence

bull Need more steps but each step is computed much faster

1162019

RECSYS 73

SGD vs GD Convergence of GD vs SGD

1162019

Iterationstep

Valu

e of

the

obje

ctiv

e fu

nctio

n

GD improves the value of the objective function at every step SGD improves the value but in a ldquonoisyrdquo wayGD takes fewer steps to converge but each steptakes much longer to compute In practice SGD is much faster

RECSYS 74

Stochastic Gradient Descent

Stochastic gradient decent Initialize P and Q (using SVD pretend missing ratings are

0) Then iterate over the ratings (multiple times if necessary) and update

factorsFor each rxi

bull 120576120576119909119909119909119909 = 119903119903119909119909119909119909 minus 119902119902119909119909 sdot 119901119901119909119909119879119879 (derivative of the ldquoerrorrdquo)bull 119902119902119909119909 larr 119902119902119909119909 + 120578120578 120576120576119909119909119909119909 119901119901119909119909 minus 120582120582 119902119902119909119909 (update equation)bull 119901119901119909119909 larr 119901119901119909119909 + 120578120578 120576120576119909119909119909119909 119902119902119909119909 minus 120582120582 119901119901119909119909 (update equation)

2 for loops For until convergence

bull For each rxibull Compute gradient do a ldquosteprdquo

1162019

η hellip learning rate

RECSYS 75Koren Bell Volinksy IEEE Computer 2009

1162019

RECSYS 76

Summary Recommendations via Optimization

Goal Make good recommendations Quantify goodness using SSE

So Lower SSE means better recommendations We want to make good recommendations on items

that some user has not yet seen Letrsquos set values for P and Q such that they work

well on known (user item) ratings And hope these values for P and Q will predict well

the unknown ratings

This is the a case where we apply Optimization methods

1162019

1 3 43 5 5

4 5 533

2

2 1 3

1

RECSYS 77

Backup Slides

RECSYS 78

The Netflix Challenge 2006-09

We Know What You OughtTo Be Watching This

Summer

Netflix Prizebull Training data

ndash 100 million ratingsndash 480000 usersndash 17770 moviesndash 6 years of data 2000-2005

bull Test datandash Last few ratings of each user (28 million)ndash Evaluation criterion root mean squared error (RMSE) ndash Netflix Cinematch RMSE 09514

bull Competitionndash 2700+ teamsndash $1 million grand prize for 10 improvement on Cinematch resultndash $50000 2007 progress prize for 843 improvement

scoremovieuser121152131434524123237682576344541568523425

22345

57664566

scoremovieuser6219617232473153414284935

745

696836

Training data Test data

Movie rating data

Overall rating distribution

bull Third of ratings are 4sbull Average rating is 368

From TimelyDevelopmentcom

ratings per movie

bull Avg ratingsmovie 5627

ratings per user

bull Avg ratingsuser 208

Average movie rating by movie count

bull More ratings to better movies

From TimelyDevelopmentcom

Most loved movies

CountAvg ratingTitle137812 4593 The Shawshank Redemption 133597 4545 Lord of the Rings The Return of the King1808834306 The Green Mile 150676 4460Lord of the Rings The Two Towers 139050 4415 Finding Nemo 117456 4504 Raiders of the Lost Ark 180736 4299 Forrest Gump 147932 4433 Lord of the Rings The Fellowship of the ring149199 4325 The Sixth Sense 144027 4333 Indiana Jones and the Last Crusade

Challenges

bull Size of datandash Scalabilityndash Keeping data in memory

bull Missing datandash 99 percent missingndash Very imbalanced

bull Avoiding overfittingbull Test and training data differ significantly movie 16322

bull Use an ensemble of complementing predictorsbull Two half tuned models worth more than a single fully

tuned model

The BellKor recommender system

RECSYS 101

Extending Latent Factor Model to Include Biases

1162019

RECSYS 102

Modeling Biases and Interactions

μ = overall mean rating bx = bias of user x bi = bias of movie i

user-movie interactionmovie biasuser bias

User-Movie interaction Characterizes the matching between

users and movies Attracts most research in the field Benefits from algorithmic and

mathematical innovations

Baseline predictor Separates users and movies Benefits from insights into userrsquos

behavior Among the main practical

contributions of the competition

RECSYS 103

Baseline Predictor We have expectations on the rating by

user x of movie i even without estimating xrsquos attitude towards movies like i

ndash Rating scale of user xndash Values of other ratings user

gave recently (day-specific mood anchoring multi-user accounts)

ndash (Recent) popularity of movie indash Selection bias related to

number of ratings user gave on the same day (ldquofrequencyrdquo)

1162019

RECSYS 104

Putting It All Together

Example Mean rating micro = 37 You are a critical reviewer your ratings are 1 star

lower than the mean bx = -1 Star Wars gets a mean rating of 05 higher than

average movie bi = + 05 Predicted rating for you on Star Wars

= 37 - 1 + 05 = 32

1162019

RECSYS 105

Fitting the New Model

Solve

Stochastic gradient decent to find parameters Note Both biases bu bi as well as interactions qi pu are treated

as parameters (we estimate them)

1162019

regularization

goodness of fit

λ is selected via grid-search on a validation set

( )

++++

+++minus

sumsumsumsum

sumisin

ii

xx

xx

ii

Rix

Txiixxi

PQ

bbpq

pqbbr

2222

2

)(

)(min

λ

micro

RECSYS 106

BellKor Recommender System The winner of the Netflix Challenge

Multi-scale modeling of the dataCombine top level ldquoregionalrdquomodeling of the data with a refined local view Global

bull Overall deviations of usersmovies Factorization

bull Addressing ldquoregionalrdquo effects Collaborative filtering

bull Extract local patterns

Global effects

Factorization

Collaborative filtering

RECSYS 107

Performance of Various Methods

0885

089

0895

09

0905

091

0915

092

1 10 100 1000

RMSE

Millions of parameters

CF (no time bias)

Basic Latent Factors

Latent Factors w Biases

1162019

RECSYS 108

Grand Prize 08563

Netflix 09514

Movie average 10533User average 10651

Global average 11296

Performance of Various Methods

Basic Collaborative filtering 094

Latent factors 090

Latent factors+Biases 089

Collaborative filtering++ 091

1162019

RECSYS 109

Temporal Biases Of Users

Sudden rise in the average movie rating(early 2004) Improvements in Netflix GUI improvements Meaning of rating changed

Movie age Users prefer new movies

without any reasons Older movies are just

inherently better than newer ones

1162019

Y Koren Collaborative filtering with temporal dynamics KDD rsquo09

RECSYS 110

Temporal Biases amp Factors

Original modelrxi = micro +bx + bi + qi middotpx

T

Add time dependence to biasesrxi = micro +bx(t)+ bi(t) +qi middot px

T

Make parameters bu and bi to depend on time (1) Parameterize time-dependence by linear trends

(2) Each bin corresponds to 10 consecutive weeks

Add temporal dependence to factors px(t)hellip user preference vector on day t

Y Koren Collaborative filtering with temporal dynamics KDD rsquo09

RECSYS 111

Adding Temporal Effects

1162019

0875

088

0885

089

0895

09

0905

091

0915

092

1 10 100 1000 10000

RMSE

Millions of parameters

CF (no time bias)Basic Latent FactorsCF (time bias)Latent Factors w Biases+ Linear time factors+ Per-day user biases+ CF

RECSYS 112

Grand Prize 08563

Netflix 09514

Movie average 10533User average 10651

Global average 11296

Performance of Various Methods

Basic Collaborative filtering 094

Latent factors 090

Latent factors+Biases 089

Collaborative filtering++ 091

1162019 112

Latent factors+Biases+Time 0876

Still no prize Getting desperateTry a ldquokitchen sinkrdquo approach

RECSYS 113

Effect of ensemble size

087

08725

0875

08775

088

08825

0885

08875

089

08925

1 8 15 22 29 36 43 50

Predictors

Erro

r - R

MSE

Chart1

RMSE
Predictors
Error - RMSE
Effect of ensemble size

Sheet1

Sheet1

RMSE
Predictors
Error - RMSE
Effect of ensemble size

Sheet2

Sheet3

RECSYS 115

Standing on June 26th 2009

1162019June 26th submission triggers 30-day ldquolast callrdquo

RECSYS 116

The Last 30 Days

Ensemble team formed Group of other teams on leaderboard forms a new team Relies on combining their models Quickly also get a qualifying score over 10

BellKor Continue to get small improvements in their scores Realize that they are in direct competition with Ensemble

Strategy Both teams carefully monitoring the leaderboard Only sure way to check for improvement is to submit a set of

predictionsbull This alerts the other team of your latest score

1162019

RECSYS 117

24 Hours from the Deadline

Submissions limited to 1 a day Only 1 final submission could be made in the last 24h

24 hours before deadlinehellip BellKor team member in Austria notices (by chance) that

Ensemble posts a score that is slightly better than BellKorrsquos

Frantic last 24 hours for both teams Much computer time on final optimization Carefully calibrated to end about an hour before deadline

Final submissions BellKor submits a little early (on purpose) 40 mins before deadline Ensemble submits their final entry 20 mins later hellipand everyone waitshellip

1162019

RECSYS 118118

RECSYS 119

Million $ Awarded Sept 21st 2009

RECSYS 120

Further reading Y Koren Collaborative filtering with temporal

dynamics KDD rsquo09 httpwww2researchattcom~volinskynetflixbpchtml httpwwwthe-ensemblecom

1162019

  • Slide Number 1
  • Acknowledgements
  • Content-based Systems amp Collaborative Filtering
  • High Dimensional Data
  • Example Recommender Systems
  • Recommendations
  • From Scarcity to Abundance
  • The Long Tail
  • Physical vs Online
  • Types of Recommendations
  • Formal Model
  • Utility Matrix
  • Key Problems
  • (1) Gathering Ratings
  • (2) Extrapolating Utilities
  • Content-based Recommender Systems
  • Content-based Recommendations
  • Plan of Action
  • Item Profiles
  • Sidenote TF-IDF
  • User Profiles and Prediction
  • Pros Content-based Approach
  • Cons Content-based Approach
  • Collaborative Filtering
  • Collaborative filtering
  • Collaborative Filtering (CF)
  • Example of Memory-based Collaborative Filtering User-User Collaborative Filtering
  • Similar Users
  • Similarity Metric
  • Rating Predictions
  • Another type of Memory-based Collaborative Filtering Item-Item based Collaborative Filtering
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Common Practice for Item-Item Collaborative Filtering
  • Item-Item vs User-User
  • ProsCons of Collaborative Filtering
  • Hybrid Methods
  • Remarks amp Practical Tips
  • Evaluation
  • Evaluation
  • Evaluating Predictions
  • Problems with Error Measures
  • Collaborative Filtering Complexity
  • Tip Add Data
  • Recommender SystemsLatent Factor Models
  • Collaborative Filtering viaLatent Factor Models (eg SVD)
  • Slide Number 50
  • The Netflix Utility Matrix R
  • Utility Matrix R Evaluation
  • Latent Factor Models
  • Ratings as Products of Factors
  • Ratings as Products of Factors
  • Ratings as Products of Factors
  • Latent Factor Models
  • Latent Factor Models
  • Recap SVD
  • Latent Factor Models
  • Dealing with Missing Entries
  • Recommendations via Latent Factor Models (eg SVD++ by the [Bellkor Team])
  • Dealing with Missing Entries
  • The Effect of Regularization
  • The Effect of Regularization
  • The Effect of Regularization
  • The Effect of Regularization
  • Use Gradient Descent to search forthe optimal settings
  • Degression to the lecture notes ofRegression and Gradient Descentby Andrew Ngrsquos Machine Learning course from Coursera
  • (Batch) Gradient Descent
  • Stochastic Gradient Descent
  • SGD vs GD
  • Stochastic Gradient Descent
  • Slide Number 75
  • Summary Recommendations via Optimization
  • Backup Slides
  • The Netflix Challenge 2006-09
  • Slide Number 79
  • Slide Number 80
  • Netflix Prize
  • Movie rating data
  • Overall rating distribution
  • ratings per movie
  • ratings per user
  • Average movie rating by movie count
  • Most loved movies
  • Challenges
  • The BellKor recommender system
  • Extending Latent Factor Model to Include Biases
  • Modeling Biases and Interactions
  • Baseline Predictor
  • Putting It All Together
  • Fitting the New Model
  • BellKor Recommender System
  • Performance of Various Methods
  • Performance of Various Methods
  • Temporal Biases Of Users
  • Temporal Biases amp Factors
  • Adding Temporal Effects
  • Performance of Various Methods
  • Slide Number 113
  • Slide Number 114
  • Standing on June 26th 2009
  • The Last 30 Days
  • 24 Hours from the Deadline
  • Slide Number 118
  • Million $ Awarded Sept 21st 2009
  • Further reading
predictors RMSE
1 0889067
2 0880586
3 0878345
4 0877715
5 0876668
6 0876179
7 0875557
8 0875013
9 0874477
10 0874209
11 0873997
12 0873691
13 0873569
14 0873352
15 0873212
16 0873035
17 0872942
18 087283
19 0872748
20 0872654
21 0872563
22 0872468
23 0872382
24 0872275
25 087222
26 0872201
27 0872135
28 0872086
29 0872021
30 0871945
31 0871898
32 087183
33 0871756
34 0871722
35 0871665
36 0871632
37 0871585
38 0871558
39 0871524
40 0871482
41 0871452
42 0871428
43 0871398
44 0871346
45 0871312
46 0871287
47 0871262
48 0871242
49 0871236
50 0871197
0889067
0880586
0878345
0877715
0876668
0876179
0875557
0875013
0874477
0874209
0873997
0873691
0873569
0873352
0873212
0873035
0872942
087283
0872748
0872654
0872563
0872468
0872382
0872275
087222
0872201
0872135
0872086
0872021
0871945
0871898
087183
0871756
0871722
0871665
0871632
0871585
0871558
0871524
0871482
0871452
0871428
0871398
0871346
0871312
0871287
0871262
0871242
0871236
0871197
Page 22: xuhappy.github.io · 2 days ago · RECSYS 2 Acknowledgements The slides used in this chapter are adapted from: CS246 Mining Massive Datasets, - by Jure Leskovec, Stanford University

RECSYS 24

Collaborative Filtering

RECSYS 25

Collaborative filtering Recommend items based on past transactions of users

Analyze relations between users andor items

Specific data characteristics are irrelevant Domain-free useritem attributes are not necessary Can identify elusive aspects

RECSYS 26

Collaborative Filtering (CF)

RECSYS 27

Example of Memory-based Collaborative FilteringUser-User Collaborative Filtering

1 Consider user x

2 Find set N of other userswhose ratings are ldquosimilarrdquo to xrsquos ratings eg using K-nearest neighbors (KNN)

3 Recommend items to xbased on the weighted ratings of items by users in N

RECSYS 28

Similar Users Let rx be the vector of user xrsquos ratings

Jaccard similarity measure Problem Ignores the value of the rating

Cosine similarity measure

sim(x y) = cos(rx ry) = 119903119903119909119909sdot119903119903119910119910||119903119903119909119909||sdot||119903119903119910119910||

Problem Treats missing ratings as ldquonegativerdquo Pearson correlation coefficient

Sxy = items rated by both users x and y

1162019

rx = [ _ _ ]ry = [ _ _]

rx ry as setsrx = 1 4 5ry = 1 3 4

rx ry as pointsrx = 1 0 0 1 3ry = 1 0 2 2 0

rx ry hellip avgrating of x y

RECSYS 29

Similarity Metric

Intuitively we want sim(A B) gt sim(A C)

Jaccard similarity 15 lt 24

Cosine similarity 0386 gt 0322 Considers missing ratings as ldquonegativerdquo Solution subtract the (row) mean

1162019

sim AB vs AC0092 gt -0559Notice cosine sim is correlation when data is centered at 0

Cosine sim

RECSYS 30

Rating Predictions

Let rx be the vector of user xrsquos ratings

Let N be the set of k users most similar to xwho have rated item i

Prediction for item i of user x 119903119903119909119909119909119909 = 1

119896119896sum119910119910isin119873119873 119903119903119910119910119909119909

119903119903119909119909119909119909 =sum119910119910isin119873119873 119904119904119909119909119910119910sdot119903119903119910119910119910119910sum119910119910isin119873119873 119904119904119909119909119910119910

Other options

Many other tricks possiblehellip1162019

Shorthand119956119956119961119961119961119961 = 119956119956119946119946119956119956 119961119961119961119961

RECSYS 31

Another type of Memory-based Collaborative Filtering Item-Item based Collaborative Filtering

So far User-user collaborative filtering

Another view Item-item For item i find other similar items Estimate rating for item i based

on the target userrsquos ratings on items similar to item i Can use same similarity metrics and

prediction functions as in user-user model

1162019

sumsum

isin

isinsdot

=)(

)(

xiNj ij

xiNj xjijxi s

rsr

sijhellip similarity of items i and jrxjhelliprating of user x on item jN(ix)hellip set items rated by x similar to i

RECSYS 32

Item-Item CF (|N|=2)

121110987654321

455311

3124452

534321423

245424

5224345

423316

users

mov

ies

- unknown rating - rating between 1 to 5

RECSYS 33

Item-Item CF (|N|=2)

121110987654321

455 311

3124452

534321423

245424

5224345

423316

users

- estimate rating of movie 1 by user 5

1162019

mov

ies

RECSYS 34

Item-Item CF (|N|=2)

121110987654321

455 311

3124452

534321423

245424

5224345

423316

users

Neighbor selectionIdentify movies similar to movie 1 rated by user 5

1162019

mov

ies

100

-018

041

-010

-031

059

sim(1m)

Here we use Pearson correlation as similarity1) Subtract mean rating mi from each movie i

m1 = (1+3+5+5+4)5 = 36row 1 [-26 0 -06 0 0 14 0 0 14 0 04 0]

2) Compute cosine similarities between rows

RECSYS 35

Item-Item CF (|N|=2)

121110987654321

455 311

3124452

534321423

245424

5224345

423316

users

Compute similarity weightss13=041 s16=059

1162019

mov

ies

100

-018

041

-010

-031

059

sim(1m)

RECSYS 36

Item-Item CF (|N|=2)

121110987654321

45526311

3124452

534321423

245424

5224345

423316

users

Predict by taking weighted average

r15 = (0412 + 0593) (041+059) = 26

mov

ies

RECSYS 37

Common Practice for Item-Item Collaborative Filtering

Define similarity sij of items i and j

Select K nearest neighbors (KNN) N(i x) Set of Items most similar to item i that were rated by x

Estimate rating rxi as the weighted average

1162019

sumsum

isin

isinsdot

=)(

)(ˆxiNj ij

xiNj xjijxi s

rsr

N(ix) = set of items similar to item i that were rated by xsij = similarity of items i and jrxj= rating of user x on item j

RECSYS 38

Item-Item vs User-User

04180109

0305081

Avatar LOTR Matrix Pirates

Alice

Bob

Carol

David

1162019

In practice it has been observed that item-itemoften works better than user-user

Why Items are simpler users have multiple tastes

RECSYS 39

ProsCons of Collaborative Filtering

+ Works for any kind of item No feature selection needed

- Cold Start Need enough users in the system to find a match

- Sparsity The userratings matrix is sparse Hard to find users that have rated the same items

- First rater Cannot recommend an item that has not been

previously rated New items Esoteric items

- Popularity bias Cannot recommend items to someone with

unique taste Tends to recommend popular items

RECSYS 40

Hybrid Methods Implement two or more different recommenders and

combine predictions Perhaps using a linear model

Add content-based methods to collaborative filtering Item profiles for new item problem Demographics to deal with new user problem

1162019

RECSYS 41

- Evaluation

- Error metrics

- Complexity Speed

RECSYS 42

Evaluation

1 3 4

3 5 5

4 5 5

3

3

2 2 2

5

2 1 1

3 3

1

movies

users

1162019

RECSYS 43

Evaluation

1 3 4

3 5 5

4 5 5

3

3

2

2 1

3

1

Test Data Set

users

movies

1162019

RECSYS 44

Evaluating Predictions Compare predictions with known ratings

Root-mean-square error (RMSE)bull where is predicted is the true rating of x on i

Precision at top 10 bull of those in top 10

Rank Correlation bull Spearmanrsquos correlation between systemrsquos and userrsquos complete

rankings Another approach 01 model

Coveragebull Number of itemsusers for which system can make predictions

Precision = TP (TP + FP) Accuracy = (TP+TN) (TP + FP + TN + FN) Receiver Operating characteristic (ROC) Curve

Y-axis True Positive Rates (TPR) X-axis False Positive Rates (FPR)bull TPR (aka Recall) = TP P = TP (TP+FN) bull FPR = FP N = FP (FP + TN)bull See httpsenwikipediaorgwikiPrecision and recall

RECSYS 45

Problems with Error Measures Narrow focus on accuracy sometimes

misses the point Prediction Diversity Prediction Context Order of predictions

In practice we care only to predict high ratings RMSE might penalize a method that does well

for high ratings and badly for others

1162019

RECSYS 46

Collaborative Filtering Complexity

Expensive step is finding k most similar customers O(|X|) Recall that X = set of customers in the system

Too expensive to do at runtime Could pre-compute using clustering as approx

Naiumlve pre-computation takes time O(N middot|C|) |C| = of clusters = k in the k-means N = of data points

We already know how to do this Near-neighbor search in high dimensions (LSH) Clustering Dimensionality reduction

1162019

RECSYS 47

Tip Add Data Leverage all the data

Donrsquot try to reduce data size in an effort to make fancy algorithms work

Simple methods on large data do best

Add more data eg add IMDB data on genres

More data beats better algorithmshttpanandtypepadcomdatawocky200803more-data-usualhtml

RECSYS 48

Recommender SystemsLatent Factor Models

CS246 Mining Massive DatasetsJure Leskovec Stanford University

httpcs246stanfordedu

RECSYS 49

Geared towards females

Geared towards males

Serious

Funny

Collaborative Filtering viaLatent Factor Models (eg SVD)

The PrincessDiaries

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

RECSYS 50Koren Bell Volinksy IEEE Computer 2009

RECSYS 51

The Netflix Utility Matrix R

1 3 4

3 5 5

4 5 5

3

3

2 2 2

5

2 1 1

3 3

1

480000 users

17700 movies

1162019

Matrix R

RECSYS 52

Utility Matrix R Evaluation

1 3 4

3 5 5

4 5 5

3

3

2

2 1

3

1

Test Data Set

SSE = sum(119909119909119909119909)isin119877119877 119903119909119909119909119909 minus 119903119903119909119909119909119909 2

1162019

480000 users

17700 movies

Predicted rating

True rating of user x on item i

119955119955120785120785120788120788

Matrix R

Training Data Set

RECSYS 53

Latent Factor Models

ldquoSVDrdquo on Netflix data R asymp Q middot PT

For now letrsquos assume we can approximate the rating matrix R as a product of ldquothinrdquo Q middot PT

R has missing entries but letrsquos ignore that for nowbull Basically we will want the reconstruction error to be small on

known ratings and we donrsquot care about the values on the missing ones

45531

312445

53432142

24542

522434

42331

2-41

56-5

53-2

32111

-221-7

37-1

-924143-48-5-253-211

13-112-72914-131457-8

1-6784-3924176-421asymp

users

item

s

PT

Q

item

s

users

R

SVD A = U Σ VT

f factors

ffactors

RECSYS 54

Ratings as Products of Factors How to estimate the missing rating of

user x for item i

1162019

45531

312445

53432142

24542

522434

42331

item

s

2-41

56-5

53-2

32111

-221-7

37-1

-924143-48-5-253-211

13-112-72914-131457-8

1-6784-3924176-421

asymp

item

s

users

users

PT

119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931

= 119943119943

119954119954119946119946119943119943 sdot 119953119953119961119961119943119943

qi = row i of Qpx = column x of PT

ffac

tors

Qf factors

RECSYS 55

Ratings as Products of Factors How to estimate the missing rating of

user x for item i

1162019

45531

312445

53432142

24542

522434

42331

item

s

2-41

56-5

53-2

32111

-221-7

37-1

-924143-48-5-253-211

13-112-72914-131457-8

1-6784-3924176-421

asymp

item

s

users

users

PT

ffac

tors

Qf factors

119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931

= 119943119943

119954119954119946119946119943119943 sdot 119953119953119961119961119943119943

qi = row i of Qpx = column x of PT

RECSYS 56

Ratings as Products of Factors How to estimate the missing rating of

user x for item i

1162019

45531

312445

53432142

24542

522434

42331

item

s

2-41

56-5

53-2

32111

-221-7

37-1

-924143-48-5-253-211

13-112-72914-131457-8

1-6784-3924176-421

asymp

item

s

users

users

QPT

24

ffac

tors

f factors

119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931

= 119943119943

119954119954119946119946119943119943 sdot 119953119953119961119961119943119943

qi = row i of Qpx = column x of PT

RECSYS 57

Geared towards females

Geared towards males

Serious

Funny

Latent Factor Models

1162019

The PrincessDiaries

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

RECSYS 58

Geared towards females

Geared towards males

Serious

Funny

Latent Factor Models

1162019

The PrincessDiaries

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

RECSYS 59

Recap SVD

Remember SVD A Input data matrix U Left singular vecs V Right singular vecs Σ Singular values

SVD gives minimum reconstruction error (SSE) min119880119880VΣ

sum119909119909119894119894 119860119860119909119909119894119894 minus 119880119880Σ119881119881T 1199091199091198941198942

So in our case ldquoSVDrdquo on Netflix data R asymp Q middot PT

A = R Q = U PT = Σ VT

But we are not done yet R has missing entries1162019

U

The sum goes over all entriesBut our R has missing entries

119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931

Am

n

Σm

n

VT

asymp

U

RECSYS 60

Latent Factor Models

SVD isnrsquot defined when entries are missing

Use specialized methods to find P Q min

119875119875119876119876sum 119909119909119909119909 isinR 119903119903119909119909119909119909 minus 119902119902119909119909 sdot 119901119901119909119909119879119879 2

Notebull We donrsquot require cols of P Q to be orthogonalunit lengthbull P Q map usersmovies to a latent spacebull The most popular model among Netflix contestants

1162019

45531

312445

53432142

24542

522434

42331

2-41

56-5

53-2

32111

-221-7

37-1

-924143-48-5-253-211

13-112-72914-131457-8

1-6784-3924176-421asymp

PTQ

users

item

s

119903119909119909119909119909 = 119902119902119909119909 sdot 119901119901119909119909119879119879

f factors

ffactorsitem

s

users

RECSYS 61

Dealing with Missing Entries Want to minimize SSE for unseen test data

Idea Minimize SSE on training data Want large f ( of factors) to capture all the signals But SSE on test data begins to rise for f gt 2

Regularization is needed to avoid Overfitting Allow rich model where there are sufficient data Shrink aggressively where data are scarce

1162019

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

1 3 43 5 5

4 5 533

2

2 1 3

1

λhellip regularization parameterldquoerrorrdquo ldquolengthrdquo

RECSYS 62

Geared towards females

Geared towards males

serious

escapist

The PrincessDiaries

The Lion King

Braveheart

Independence Day

AmadeusThe Color Purple

Oceanrsquos 11

Sense and Sensibility

Gus

Dave

Recommendations via Latent Factor Models (eg SVD++ by the [Bellkor Team])

1162019

Lethal Weapon

Dumb and Dumber

RECSYS 63

Dealing with Missing Entries Want to minimize SSE for unseen test data

Idea Minimize SSE on training data Want large f ( of factors) to capture all the signals But SSE on test data begins to rise for f gt 2

Regularization is needed Allow rich model where there are sufficient data Shrink aggressively where data are scarce

1162019

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

1 3 43 5 5

4 5 533

2

2 1 3

1

λhellip regularization parameterldquoerrorrdquo ldquolengthrdquo

RECSYS 64

Geared towards females

Geared towards males

serious

funny

The Effect of Regularization

1162019

The PrincessDiaries

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

minfactors ldquoerrorrdquo + λ ldquolengthrdquo

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

RECSYS 65

Geared towards females

Geared towards males

serious

funny

The Effect of Regularization

1162019

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

The PrincessDiaries

minfactors ldquoerrorrdquo + λ ldquolengthrdquo

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

RECSYS 66

Geared towards females

Geared towards males

serious

funny

The Effect of Regularization

1162019

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

minfactors ldquoerrorrdquo + λ ldquolengthrdquo

The PrincessDiaries

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

RECSYS 67

Geared towards females

Geared towards males

serious

funny

The Effect of Regularization

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

The PrincessDiaries

minfactors ldquoerrorrdquo + λ ldquolengthrdquo

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

RECSYS 68

Use Gradient Descent to search forthe optimal settings

Want to find matrices P and Q

Gradient descent Initialize P and Q (using SVD pretend missing ratings are

0) Do gradient descent

bull P larr P - η middotnablaPbull Q larr Q - η middotnablaQbull Where nablaQ is gradientderivative of matrix Q120571120571119876119876 = [120571120571119902119902119909119909119894119894] and 120571120571119902119902119909119909119894119894 = sum119909119909119909119909 minus2 119903119903119909119909119909119909 minus 119902119902119909119909119901119901119909119909119879119879 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894

bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q Observation Computing gradients is slow

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

How to compute gradient of a matrixCompute gradient of every element independently

RECSYS 69

Degression to the lecture notes ofRegression and Gradient Descent

by Andrew Ngrsquos Machine Learning course from Coursera

RECSYS 70

(Batch) Gradient Descent

Want to find matrices P and Q

Gradient descent Initialize P and Q (using SVD pretend missing ratings are

0) Do gradient descent

bull P larr P - η middotnablaPbull Q larr Q - η middotnablaQbull Where nablaQ is gradientderivative of matrix Q120571120571119876119876 = [120571120571119902119902119909119909119894119894] and 120571120571119902119902119909119909119894119894 = sum119909119909119909119909 minus2 119903119903119909119909119909119909 minus 119902119902119909119909119901119901119909119909119879119879 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894

bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q Observation Computing gradients is slow

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

How to compute gradient of a matrixCompute gradient of every element independently

RECSYS 72

Stochastic Gradient Descent

Gradient Descent (GD) vs Stochastic GD Observation 120571120571119876119876 = [120571120571119902119902119909119909119894119894] where

120571120571119902119902119909119909119894119894 = 119909119909119909119909

minus2 119903119903119909119909119909119909 minus 119902119902119909119909119894119894119901119901119909119909119894119894 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894 = 119961119961119946119946

nabla119928119928 119955119955119961119961119946119946

bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q 119928119928 = 119928119928 minus 119928119928 = 119928119928 minus η sum119961119961119946119946nabla119928119928 (119955119955119961119961119946119946) Idea Instead of evaluating gradient over all ratings evaluate it for

each individual rating and make a step

GD 119928119928larr119928119928 minus η sum119955119955119961119961119946119946 nabla119928119928(119955119955119961119961119946119946)

SGD 119928119928larr119928119928 minus 119928119928(119955119955119961119961119946119946) Faster convergence

bull Need more steps but each step is computed much faster

1162019

RECSYS 73

SGD vs GD Convergence of GD vs SGD

1162019

Iterationstep

Valu

e of

the

obje

ctiv

e fu

nctio

n

GD improves the value of the objective function at every step SGD improves the value but in a ldquonoisyrdquo wayGD takes fewer steps to converge but each steptakes much longer to compute In practice SGD is much faster

RECSYS 74

Stochastic Gradient Descent

Stochastic gradient decent Initialize P and Q (using SVD pretend missing ratings are

0) Then iterate over the ratings (multiple times if necessary) and update

factorsFor each rxi

bull 120576120576119909119909119909119909 = 119903119903119909119909119909119909 minus 119902119902119909119909 sdot 119901119901119909119909119879119879 (derivative of the ldquoerrorrdquo)bull 119902119902119909119909 larr 119902119902119909119909 + 120578120578 120576120576119909119909119909119909 119901119901119909119909 minus 120582120582 119902119902119909119909 (update equation)bull 119901119901119909119909 larr 119901119901119909119909 + 120578120578 120576120576119909119909119909119909 119902119902119909119909 minus 120582120582 119901119901119909119909 (update equation)

2 for loops For until convergence

bull For each rxibull Compute gradient do a ldquosteprdquo

1162019

η hellip learning rate

RECSYS 75Koren Bell Volinksy IEEE Computer 2009

1162019

RECSYS 76

Summary Recommendations via Optimization

Goal Make good recommendations Quantify goodness using SSE

So Lower SSE means better recommendations We want to make good recommendations on items

that some user has not yet seen Letrsquos set values for P and Q such that they work

well on known (user item) ratings And hope these values for P and Q will predict well

the unknown ratings

This is the a case where we apply Optimization methods

1162019

1 3 43 5 5

4 5 533

2

2 1 3

1

RECSYS 77

Backup Slides

RECSYS 78

The Netflix Challenge 2006-09

We Know What You OughtTo Be Watching This

Summer

Netflix Prizebull Training data

ndash 100 million ratingsndash 480000 usersndash 17770 moviesndash 6 years of data 2000-2005

bull Test datandash Last few ratings of each user (28 million)ndash Evaluation criterion root mean squared error (RMSE) ndash Netflix Cinematch RMSE 09514

bull Competitionndash 2700+ teamsndash $1 million grand prize for 10 improvement on Cinematch resultndash $50000 2007 progress prize for 843 improvement

scoremovieuser121152131434524123237682576344541568523425

22345

57664566

scoremovieuser6219617232473153414284935

745

696836

Training data Test data

Movie rating data

Overall rating distribution

bull Third of ratings are 4sbull Average rating is 368

From TimelyDevelopmentcom

ratings per movie

bull Avg ratingsmovie 5627

ratings per user

bull Avg ratingsuser 208

Average movie rating by movie count

bull More ratings to better movies

From TimelyDevelopmentcom

Most loved movies

CountAvg ratingTitle137812 4593 The Shawshank Redemption 133597 4545 Lord of the Rings The Return of the King1808834306 The Green Mile 150676 4460Lord of the Rings The Two Towers 139050 4415 Finding Nemo 117456 4504 Raiders of the Lost Ark 180736 4299 Forrest Gump 147932 4433 Lord of the Rings The Fellowship of the ring149199 4325 The Sixth Sense 144027 4333 Indiana Jones and the Last Crusade

Challenges

bull Size of datandash Scalabilityndash Keeping data in memory

bull Missing datandash 99 percent missingndash Very imbalanced

bull Avoiding overfittingbull Test and training data differ significantly movie 16322

bull Use an ensemble of complementing predictorsbull Two half tuned models worth more than a single fully

tuned model

The BellKor recommender system

RECSYS 101

Extending Latent Factor Model to Include Biases

1162019

RECSYS 102

Modeling Biases and Interactions

μ = overall mean rating bx = bias of user x bi = bias of movie i

user-movie interactionmovie biasuser bias

User-Movie interaction Characterizes the matching between

users and movies Attracts most research in the field Benefits from algorithmic and

mathematical innovations

Baseline predictor Separates users and movies Benefits from insights into userrsquos

behavior Among the main practical

contributions of the competition

RECSYS 103

Baseline Predictor We have expectations on the rating by

user x of movie i even without estimating xrsquos attitude towards movies like i

ndash Rating scale of user xndash Values of other ratings user

gave recently (day-specific mood anchoring multi-user accounts)

ndash (Recent) popularity of movie indash Selection bias related to

number of ratings user gave on the same day (ldquofrequencyrdquo)

1162019

RECSYS 104

Putting It All Together

Example Mean rating micro = 37 You are a critical reviewer your ratings are 1 star

lower than the mean bx = -1 Star Wars gets a mean rating of 05 higher than

average movie bi = + 05 Predicted rating for you on Star Wars

= 37 - 1 + 05 = 32

1162019

RECSYS 105

Fitting the New Model

Solve

Stochastic gradient decent to find parameters Note Both biases bu bi as well as interactions qi pu are treated

as parameters (we estimate them)

1162019

regularization

goodness of fit

λ is selected via grid-search on a validation set

( )

++++

+++minus

sumsumsumsum

sumisin

ii

xx

xx

ii

Rix

Txiixxi

PQ

bbpq

pqbbr

2222

2

)(

)(min

λ

micro

RECSYS 106

BellKor Recommender System The winner of the Netflix Challenge

Multi-scale modeling of the dataCombine top level ldquoregionalrdquomodeling of the data with a refined local view Global

bull Overall deviations of usersmovies Factorization

bull Addressing ldquoregionalrdquo effects Collaborative filtering

bull Extract local patterns

Global effects

Factorization

Collaborative filtering

RECSYS 107

Performance of Various Methods

0885

089

0895

09

0905

091

0915

092

1 10 100 1000

RMSE

Millions of parameters

CF (no time bias)

Basic Latent Factors

Latent Factors w Biases

1162019

RECSYS 108

Grand Prize 08563

Netflix 09514

Movie average 10533User average 10651

Global average 11296

Performance of Various Methods

Basic Collaborative filtering 094

Latent factors 090

Latent factors+Biases 089

Collaborative filtering++ 091

1162019

RECSYS 109

Temporal Biases Of Users

Sudden rise in the average movie rating(early 2004) Improvements in Netflix GUI improvements Meaning of rating changed

Movie age Users prefer new movies

without any reasons Older movies are just

inherently better than newer ones

1162019

Y Koren Collaborative filtering with temporal dynamics KDD rsquo09

RECSYS 110

Temporal Biases amp Factors

Original modelrxi = micro +bx + bi + qi middotpx

T

Add time dependence to biasesrxi = micro +bx(t)+ bi(t) +qi middot px

T

Make parameters bu and bi to depend on time (1) Parameterize time-dependence by linear trends

(2) Each bin corresponds to 10 consecutive weeks

Add temporal dependence to factors px(t)hellip user preference vector on day t

Y Koren Collaborative filtering with temporal dynamics KDD rsquo09

RECSYS 111

Adding Temporal Effects

1162019

0875

088

0885

089

0895

09

0905

091

0915

092

1 10 100 1000 10000

RMSE

Millions of parameters

CF (no time bias)Basic Latent FactorsCF (time bias)Latent Factors w Biases+ Linear time factors+ Per-day user biases+ CF

RECSYS 112

Grand Prize 08563

Netflix 09514

Movie average 10533User average 10651

Global average 11296

Performance of Various Methods

Basic Collaborative filtering 094

Latent factors 090

Latent factors+Biases 089

Collaborative filtering++ 091

1162019 112

Latent factors+Biases+Time 0876

Still no prize Getting desperateTry a ldquokitchen sinkrdquo approach

RECSYS 113

Effect of ensemble size

087

08725

0875

08775

088

08825

0885

08875

089

08925

1 8 15 22 29 36 43 50

Predictors

Erro

r - R

MSE

Chart1

RMSE
Predictors
Error - RMSE
Effect of ensemble size

Sheet1

Sheet1

RMSE
Predictors
Error - RMSE
Effect of ensemble size

Sheet2

Sheet3

RECSYS 115

Standing on June 26th 2009

1162019June 26th submission triggers 30-day ldquolast callrdquo

RECSYS 116

The Last 30 Days

Ensemble team formed Group of other teams on leaderboard forms a new team Relies on combining their models Quickly also get a qualifying score over 10

BellKor Continue to get small improvements in their scores Realize that they are in direct competition with Ensemble

Strategy Both teams carefully monitoring the leaderboard Only sure way to check for improvement is to submit a set of

predictionsbull This alerts the other team of your latest score

1162019

RECSYS 117

24 Hours from the Deadline

Submissions limited to 1 a day Only 1 final submission could be made in the last 24h

24 hours before deadlinehellip BellKor team member in Austria notices (by chance) that

Ensemble posts a score that is slightly better than BellKorrsquos

Frantic last 24 hours for both teams Much computer time on final optimization Carefully calibrated to end about an hour before deadline

Final submissions BellKor submits a little early (on purpose) 40 mins before deadline Ensemble submits their final entry 20 mins later hellipand everyone waitshellip

1162019

RECSYS 118118

RECSYS 119

Million $ Awarded Sept 21st 2009

RECSYS 120

Further reading Y Koren Collaborative filtering with temporal

dynamics KDD rsquo09 httpwww2researchattcom~volinskynetflixbpchtml httpwwwthe-ensemblecom

1162019

  • Slide Number 1
  • Acknowledgements
  • Content-based Systems amp Collaborative Filtering
  • High Dimensional Data
  • Example Recommender Systems
  • Recommendations
  • From Scarcity to Abundance
  • The Long Tail
  • Physical vs Online
  • Types of Recommendations
  • Formal Model
  • Utility Matrix
  • Key Problems
  • (1) Gathering Ratings
  • (2) Extrapolating Utilities
  • Content-based Recommender Systems
  • Content-based Recommendations
  • Plan of Action
  • Item Profiles
  • Sidenote TF-IDF
  • User Profiles and Prediction
  • Pros Content-based Approach
  • Cons Content-based Approach
  • Collaborative Filtering
  • Collaborative filtering
  • Collaborative Filtering (CF)
  • Example of Memory-based Collaborative Filtering User-User Collaborative Filtering
  • Similar Users
  • Similarity Metric
  • Rating Predictions
  • Another type of Memory-based Collaborative Filtering Item-Item based Collaborative Filtering
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Common Practice for Item-Item Collaborative Filtering
  • Item-Item vs User-User
  • ProsCons of Collaborative Filtering
  • Hybrid Methods
  • Remarks amp Practical Tips
  • Evaluation
  • Evaluation
  • Evaluating Predictions
  • Problems with Error Measures
  • Collaborative Filtering Complexity
  • Tip Add Data
  • Recommender SystemsLatent Factor Models
  • Collaborative Filtering viaLatent Factor Models (eg SVD)
  • Slide Number 50
  • The Netflix Utility Matrix R
  • Utility Matrix R Evaluation
  • Latent Factor Models
  • Ratings as Products of Factors
  • Ratings as Products of Factors
  • Ratings as Products of Factors
  • Latent Factor Models
  • Latent Factor Models
  • Recap SVD
  • Latent Factor Models
  • Dealing with Missing Entries
  • Recommendations via Latent Factor Models (eg SVD++ by the [Bellkor Team])
  • Dealing with Missing Entries
  • The Effect of Regularization
  • The Effect of Regularization
  • The Effect of Regularization
  • The Effect of Regularization
  • Use Gradient Descent to search forthe optimal settings
  • Degression to the lecture notes ofRegression and Gradient Descentby Andrew Ngrsquos Machine Learning course from Coursera
  • (Batch) Gradient Descent
  • Stochastic Gradient Descent
  • SGD vs GD
  • Stochastic Gradient Descent
  • Slide Number 75
  • Summary Recommendations via Optimization
  • Backup Slides
  • The Netflix Challenge 2006-09
  • Slide Number 79
  • Slide Number 80
  • Netflix Prize
  • Movie rating data
  • Overall rating distribution
  • ratings per movie
  • ratings per user
  • Average movie rating by movie count
  • Most loved movies
  • Challenges
  • The BellKor recommender system
  • Extending Latent Factor Model to Include Biases
  • Modeling Biases and Interactions
  • Baseline Predictor
  • Putting It All Together
  • Fitting the New Model
  • BellKor Recommender System
  • Performance of Various Methods
  • Performance of Various Methods
  • Temporal Biases Of Users
  • Temporal Biases amp Factors
  • Adding Temporal Effects
  • Performance of Various Methods
  • Slide Number 113
  • Slide Number 114
  • Standing on June 26th 2009
  • The Last 30 Days
  • 24 Hours from the Deadline
  • Slide Number 118
  • Million $ Awarded Sept 21st 2009
  • Further reading
predictors RMSE
1 0889067
2 0880586
3 0878345
4 0877715
5 0876668
6 0876179
7 0875557
8 0875013
9 0874477
10 0874209
11 0873997
12 0873691
13 0873569
14 0873352
15 0873212
16 0873035
17 0872942
18 087283
19 0872748
20 0872654
21 0872563
22 0872468
23 0872382
24 0872275
25 087222
26 0872201
27 0872135
28 0872086
29 0872021
30 0871945
31 0871898
32 087183
33 0871756
34 0871722
35 0871665
36 0871632
37 0871585
38 0871558
39 0871524
40 0871482
41 0871452
42 0871428
43 0871398
44 0871346
45 0871312
46 0871287
47 0871262
48 0871242
49 0871236
50 0871197
0889067
0880586
0878345
0877715
0876668
0876179
0875557
0875013
0874477
0874209
0873997
0873691
0873569
0873352
0873212
0873035
0872942
087283
0872748
0872654
0872563
0872468
0872382
0872275
087222
0872201
0872135
0872086
0872021
0871945
0871898
087183
0871756
0871722
0871665
0871632
0871585
0871558
0871524
0871482
0871452
0871428
0871398
0871346
0871312
0871287
0871262
0871242
0871236
0871197
Page 23: xuhappy.github.io · 2 days ago · RECSYS 2 Acknowledgements The slides used in this chapter are adapted from: CS246 Mining Massive Datasets, - by Jure Leskovec, Stanford University

RECSYS 25

Collaborative filtering Recommend items based on past transactions of users

Analyze relations between users andor items

Specific data characteristics are irrelevant Domain-free useritem attributes are not necessary Can identify elusive aspects

RECSYS 26

Collaborative Filtering (CF)

RECSYS 27

Example of Memory-based Collaborative FilteringUser-User Collaborative Filtering

1 Consider user x

2 Find set N of other userswhose ratings are ldquosimilarrdquo to xrsquos ratings eg using K-nearest neighbors (KNN)

3 Recommend items to xbased on the weighted ratings of items by users in N

RECSYS 28

Similar Users Let rx be the vector of user xrsquos ratings

Jaccard similarity measure Problem Ignores the value of the rating

Cosine similarity measure

sim(x y) = cos(rx ry) = 119903119903119909119909sdot119903119903119910119910||119903119903119909119909||sdot||119903119903119910119910||

Problem Treats missing ratings as ldquonegativerdquo Pearson correlation coefficient

Sxy = items rated by both users x and y

1162019

rx = [ _ _ ]ry = [ _ _]

rx ry as setsrx = 1 4 5ry = 1 3 4

rx ry as pointsrx = 1 0 0 1 3ry = 1 0 2 2 0

rx ry hellip avgrating of x y

RECSYS 29

Similarity Metric

Intuitively we want sim(A B) gt sim(A C)

Jaccard similarity 15 lt 24

Cosine similarity 0386 gt 0322 Considers missing ratings as ldquonegativerdquo Solution subtract the (row) mean

1162019

sim AB vs AC0092 gt -0559Notice cosine sim is correlation when data is centered at 0

Cosine sim

RECSYS 30

Rating Predictions

Let rx be the vector of user xrsquos ratings

Let N be the set of k users most similar to xwho have rated item i

Prediction for item i of user x 119903119903119909119909119909119909 = 1

119896119896sum119910119910isin119873119873 119903119903119910119910119909119909

119903119903119909119909119909119909 =sum119910119910isin119873119873 119904119904119909119909119910119910sdot119903119903119910119910119910119910sum119910119910isin119873119873 119904119904119909119909119910119910

Other options

Many other tricks possiblehellip1162019

Shorthand119956119956119961119961119961119961 = 119956119956119946119946119956119956 119961119961119961119961

RECSYS 31

Another type of Memory-based Collaborative Filtering Item-Item based Collaborative Filtering

So far User-user collaborative filtering

Another view Item-item For item i find other similar items Estimate rating for item i based

on the target userrsquos ratings on items similar to item i Can use same similarity metrics and

prediction functions as in user-user model

1162019

sumsum

isin

isinsdot

=)(

)(

xiNj ij

xiNj xjijxi s

rsr

sijhellip similarity of items i and jrxjhelliprating of user x on item jN(ix)hellip set items rated by x similar to i

RECSYS 32

Item-Item CF (|N|=2)

121110987654321

455311

3124452

534321423

245424

5224345

423316

users

mov

ies

- unknown rating - rating between 1 to 5

RECSYS 33

Item-Item CF (|N|=2)

121110987654321

455 311

3124452

534321423

245424

5224345

423316

users

- estimate rating of movie 1 by user 5

1162019

mov

ies

RECSYS 34

Item-Item CF (|N|=2)

121110987654321

455 311

3124452

534321423

245424

5224345

423316

users

Neighbor selectionIdentify movies similar to movie 1 rated by user 5

1162019

mov

ies

100

-018

041

-010

-031

059

sim(1m)

Here we use Pearson correlation as similarity1) Subtract mean rating mi from each movie i

m1 = (1+3+5+5+4)5 = 36row 1 [-26 0 -06 0 0 14 0 0 14 0 04 0]

2) Compute cosine similarities between rows

RECSYS 35

Item-Item CF (|N|=2)

121110987654321

455 311

3124452

534321423

245424

5224345

423316

users

Compute similarity weightss13=041 s16=059

1162019

mov

ies

100

-018

041

-010

-031

059

sim(1m)

RECSYS 36

Item-Item CF (|N|=2)

121110987654321

45526311

3124452

534321423

245424

5224345

423316

users

Predict by taking weighted average

r15 = (0412 + 0593) (041+059) = 26

mov

ies

RECSYS 37

Common Practice for Item-Item Collaborative Filtering

Define similarity sij of items i and j

Select K nearest neighbors (KNN) N(i x) Set of Items most similar to item i that were rated by x

Estimate rating rxi as the weighted average

1162019

sumsum

isin

isinsdot

=)(

)(ˆxiNj ij

xiNj xjijxi s

rsr

N(ix) = set of items similar to item i that were rated by xsij = similarity of items i and jrxj= rating of user x on item j

RECSYS 38

Item-Item vs User-User

04180109

0305081

Avatar LOTR Matrix Pirates

Alice

Bob

Carol

David

1162019

In practice it has been observed that item-itemoften works better than user-user

Why Items are simpler users have multiple tastes

RECSYS 39

ProsCons of Collaborative Filtering

+ Works for any kind of item No feature selection needed

- Cold Start Need enough users in the system to find a match

- Sparsity The userratings matrix is sparse Hard to find users that have rated the same items

- First rater Cannot recommend an item that has not been

previously rated New items Esoteric items

- Popularity bias Cannot recommend items to someone with

unique taste Tends to recommend popular items

RECSYS 40

Hybrid Methods Implement two or more different recommenders and

combine predictions Perhaps using a linear model

Add content-based methods to collaborative filtering Item profiles for new item problem Demographics to deal with new user problem

1162019

RECSYS 41

- Evaluation

- Error metrics

- Complexity Speed

RECSYS 42

Evaluation

1 3 4

3 5 5

4 5 5

3

3

2 2 2

5

2 1 1

3 3

1

movies

users

1162019

RECSYS 43

Evaluation

1 3 4

3 5 5

4 5 5

3

3

2

2 1

3

1

Test Data Set

users

movies

1162019

RECSYS 44

Evaluating Predictions Compare predictions with known ratings

Root-mean-square error (RMSE)bull where is predicted is the true rating of x on i

Precision at top 10 bull of those in top 10

Rank Correlation bull Spearmanrsquos correlation between systemrsquos and userrsquos complete

rankings Another approach 01 model

Coveragebull Number of itemsusers for which system can make predictions

Precision = TP (TP + FP) Accuracy = (TP+TN) (TP + FP + TN + FN) Receiver Operating characteristic (ROC) Curve

Y-axis True Positive Rates (TPR) X-axis False Positive Rates (FPR)bull TPR (aka Recall) = TP P = TP (TP+FN) bull FPR = FP N = FP (FP + TN)bull See httpsenwikipediaorgwikiPrecision and recall

RECSYS 45

Problems with Error Measures Narrow focus on accuracy sometimes

misses the point Prediction Diversity Prediction Context Order of predictions

In practice we care only to predict high ratings RMSE might penalize a method that does well

for high ratings and badly for others

1162019

RECSYS 46

Collaborative Filtering Complexity

Expensive step is finding k most similar customers O(|X|) Recall that X = set of customers in the system

Too expensive to do at runtime Could pre-compute using clustering as approx

Naiumlve pre-computation takes time O(N middot|C|) |C| = of clusters = k in the k-means N = of data points

We already know how to do this Near-neighbor search in high dimensions (LSH) Clustering Dimensionality reduction

1162019

RECSYS 47

Tip Add Data Leverage all the data

Donrsquot try to reduce data size in an effort to make fancy algorithms work

Simple methods on large data do best

Add more data eg add IMDB data on genres

More data beats better algorithmshttpanandtypepadcomdatawocky200803more-data-usualhtml

RECSYS 48

Recommender SystemsLatent Factor Models

CS246 Mining Massive DatasetsJure Leskovec Stanford University

httpcs246stanfordedu

RECSYS 49

Geared towards females

Geared towards males

Serious

Funny

Collaborative Filtering viaLatent Factor Models (eg SVD)

The PrincessDiaries

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

RECSYS 50Koren Bell Volinksy IEEE Computer 2009

RECSYS 51

The Netflix Utility Matrix R

1 3 4

3 5 5

4 5 5

3

3

2 2 2

5

2 1 1

3 3

1

480000 users

17700 movies

1162019

Matrix R

RECSYS 52

Utility Matrix R Evaluation

1 3 4

3 5 5

4 5 5

3

3

2

2 1

3

1

Test Data Set

SSE = sum(119909119909119909119909)isin119877119877 119903119909119909119909119909 minus 119903119903119909119909119909119909 2

1162019

480000 users

17700 movies

Predicted rating

True rating of user x on item i

119955119955120785120785120788120788

Matrix R

Training Data Set

RECSYS 53

Latent Factor Models

ldquoSVDrdquo on Netflix data R asymp Q middot PT

For now letrsquos assume we can approximate the rating matrix R as a product of ldquothinrdquo Q middot PT

R has missing entries but letrsquos ignore that for nowbull Basically we will want the reconstruction error to be small on

known ratings and we donrsquot care about the values on the missing ones

45531

312445

53432142

24542

522434

42331

2-41

56-5

53-2

32111

-221-7

37-1

-924143-48-5-253-211

13-112-72914-131457-8

1-6784-3924176-421asymp

users

item

s

PT

Q

item

s

users

R

SVD A = U Σ VT

f factors

ffactors

RECSYS 54

Ratings as Products of Factors How to estimate the missing rating of

user x for item i

1162019

45531

312445

53432142

24542

522434

42331

item

s

2-41

56-5

53-2

32111

-221-7

37-1

-924143-48-5-253-211

13-112-72914-131457-8

1-6784-3924176-421

asymp

item

s

users

users

PT

119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931

= 119943119943

119954119954119946119946119943119943 sdot 119953119953119961119961119943119943

qi = row i of Qpx = column x of PT

ffac

tors

Qf factors

RECSYS 55

Ratings as Products of Factors How to estimate the missing rating of

user x for item i

1162019

45531

312445

53432142

24542

522434

42331

item

s

2-41

56-5

53-2

32111

-221-7

37-1

-924143-48-5-253-211

13-112-72914-131457-8

1-6784-3924176-421

asymp

item

s

users

users

PT

ffac

tors

Qf factors

119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931

= 119943119943

119954119954119946119946119943119943 sdot 119953119953119961119961119943119943

qi = row i of Qpx = column x of PT

RECSYS 56

Ratings as Products of Factors How to estimate the missing rating of

user x for item i

1162019

45531

312445

53432142

24542

522434

42331

item

s

2-41

56-5

53-2

32111

-221-7

37-1

-924143-48-5-253-211

13-112-72914-131457-8

1-6784-3924176-421

asymp

item

s

users

users

QPT

24

ffac

tors

f factors

119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931

= 119943119943

119954119954119946119946119943119943 sdot 119953119953119961119961119943119943

qi = row i of Qpx = column x of PT

RECSYS 57

Geared towards females

Geared towards males

Serious

Funny

Latent Factor Models

1162019

The PrincessDiaries

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

RECSYS 58

Geared towards females

Geared towards males

Serious

Funny

Latent Factor Models

1162019

The PrincessDiaries

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

RECSYS 59

Recap SVD

Remember SVD A Input data matrix U Left singular vecs V Right singular vecs Σ Singular values

SVD gives minimum reconstruction error (SSE) min119880119880VΣ

sum119909119909119894119894 119860119860119909119909119894119894 minus 119880119880Σ119881119881T 1199091199091198941198942

So in our case ldquoSVDrdquo on Netflix data R asymp Q middot PT

A = R Q = U PT = Σ VT

But we are not done yet R has missing entries1162019

U

The sum goes over all entriesBut our R has missing entries

119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931

Am

n

Σm

n

VT

asymp

U

RECSYS 60

Latent Factor Models

SVD isnrsquot defined when entries are missing

Use specialized methods to find P Q min

119875119875119876119876sum 119909119909119909119909 isinR 119903119903119909119909119909119909 minus 119902119902119909119909 sdot 119901119901119909119909119879119879 2

Notebull We donrsquot require cols of P Q to be orthogonalunit lengthbull P Q map usersmovies to a latent spacebull The most popular model among Netflix contestants

1162019

45531

312445

53432142

24542

522434

42331

2-41

56-5

53-2

32111

-221-7

37-1

-924143-48-5-253-211

13-112-72914-131457-8

1-6784-3924176-421asymp

PTQ

users

item

s

119903119909119909119909119909 = 119902119902119909119909 sdot 119901119901119909119909119879119879

f factors

ffactorsitem

s

users

RECSYS 61

Dealing with Missing Entries Want to minimize SSE for unseen test data

Idea Minimize SSE on training data Want large f ( of factors) to capture all the signals But SSE on test data begins to rise for f gt 2

Regularization is needed to avoid Overfitting Allow rich model where there are sufficient data Shrink aggressively where data are scarce

1162019

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

1 3 43 5 5

4 5 533

2

2 1 3

1

λhellip regularization parameterldquoerrorrdquo ldquolengthrdquo

RECSYS 62

Geared towards females

Geared towards males

serious

escapist

The PrincessDiaries

The Lion King

Braveheart

Independence Day

AmadeusThe Color Purple

Oceanrsquos 11

Sense and Sensibility

Gus

Dave

Recommendations via Latent Factor Models (eg SVD++ by the [Bellkor Team])

1162019

Lethal Weapon

Dumb and Dumber

RECSYS 63

Dealing with Missing Entries Want to minimize SSE for unseen test data

Idea Minimize SSE on training data Want large f ( of factors) to capture all the signals But SSE on test data begins to rise for f gt 2

Regularization is needed Allow rich model where there are sufficient data Shrink aggressively where data are scarce

1162019

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

1 3 43 5 5

4 5 533

2

2 1 3

1

λhellip regularization parameterldquoerrorrdquo ldquolengthrdquo

RECSYS 64

Geared towards females

Geared towards males

serious

funny

The Effect of Regularization

1162019

The PrincessDiaries

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

minfactors ldquoerrorrdquo + λ ldquolengthrdquo

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

RECSYS 65

Geared towards females

Geared towards males

serious

funny

The Effect of Regularization

1162019

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

The PrincessDiaries

minfactors ldquoerrorrdquo + λ ldquolengthrdquo

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

RECSYS 66

Geared towards females

Geared towards males

serious

funny

The Effect of Regularization

1162019

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

minfactors ldquoerrorrdquo + λ ldquolengthrdquo

The PrincessDiaries

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

RECSYS 67

Geared towards females

Geared towards males

serious

funny

The Effect of Regularization

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

The PrincessDiaries

minfactors ldquoerrorrdquo + λ ldquolengthrdquo

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

RECSYS 68

Use Gradient Descent to search forthe optimal settings

Want to find matrices P and Q

Gradient descent Initialize P and Q (using SVD pretend missing ratings are

0) Do gradient descent

bull P larr P - η middotnablaPbull Q larr Q - η middotnablaQbull Where nablaQ is gradientderivative of matrix Q120571120571119876119876 = [120571120571119902119902119909119909119894119894] and 120571120571119902119902119909119909119894119894 = sum119909119909119909119909 minus2 119903119903119909119909119909119909 minus 119902119902119909119909119901119901119909119909119879119879 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894

bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q Observation Computing gradients is slow

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

How to compute gradient of a matrixCompute gradient of every element independently

RECSYS 69

Degression to the lecture notes ofRegression and Gradient Descent

by Andrew Ngrsquos Machine Learning course from Coursera

RECSYS 70

(Batch) Gradient Descent

Want to find matrices P and Q

Gradient descent Initialize P and Q (using SVD pretend missing ratings are

0) Do gradient descent

bull P larr P - η middotnablaPbull Q larr Q - η middotnablaQbull Where nablaQ is gradientderivative of matrix Q120571120571119876119876 = [120571120571119902119902119909119909119894119894] and 120571120571119902119902119909119909119894119894 = sum119909119909119909119909 minus2 119903119903119909119909119909119909 minus 119902119902119909119909119901119901119909119909119879119879 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894

bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q Observation Computing gradients is slow

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

How to compute gradient of a matrixCompute gradient of every element independently

RECSYS 72

Stochastic Gradient Descent

Gradient Descent (GD) vs Stochastic GD Observation 120571120571119876119876 = [120571120571119902119902119909119909119894119894] where

120571120571119902119902119909119909119894119894 = 119909119909119909119909

minus2 119903119903119909119909119909119909 minus 119902119902119909119909119894119894119901119901119909119909119894119894 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894 = 119961119961119946119946

nabla119928119928 119955119955119961119961119946119946

bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q 119928119928 = 119928119928 minus 119928119928 = 119928119928 minus η sum119961119961119946119946nabla119928119928 (119955119955119961119961119946119946) Idea Instead of evaluating gradient over all ratings evaluate it for

each individual rating and make a step

GD 119928119928larr119928119928 minus η sum119955119955119961119961119946119946 nabla119928119928(119955119955119961119961119946119946)

SGD 119928119928larr119928119928 minus 119928119928(119955119955119961119961119946119946) Faster convergence

bull Need more steps but each step is computed much faster

1162019

RECSYS 73

SGD vs GD Convergence of GD vs SGD

1162019

Iterationstep

Valu

e of

the

obje

ctiv

e fu

nctio

n

GD improves the value of the objective function at every step SGD improves the value but in a ldquonoisyrdquo wayGD takes fewer steps to converge but each steptakes much longer to compute In practice SGD is much faster

RECSYS 74

Stochastic Gradient Descent

Stochastic gradient decent Initialize P and Q (using SVD pretend missing ratings are

0) Then iterate over the ratings (multiple times if necessary) and update

factorsFor each rxi

bull 120576120576119909119909119909119909 = 119903119903119909119909119909119909 minus 119902119902119909119909 sdot 119901119901119909119909119879119879 (derivative of the ldquoerrorrdquo)bull 119902119902119909119909 larr 119902119902119909119909 + 120578120578 120576120576119909119909119909119909 119901119901119909119909 minus 120582120582 119902119902119909119909 (update equation)bull 119901119901119909119909 larr 119901119901119909119909 + 120578120578 120576120576119909119909119909119909 119902119902119909119909 minus 120582120582 119901119901119909119909 (update equation)

2 for loops For until convergence

bull For each rxibull Compute gradient do a ldquosteprdquo

1162019

η hellip learning rate

RECSYS 75Koren Bell Volinksy IEEE Computer 2009

1162019

RECSYS 76

Summary Recommendations via Optimization

Goal Make good recommendations Quantify goodness using SSE

So Lower SSE means better recommendations We want to make good recommendations on items

that some user has not yet seen Letrsquos set values for P and Q such that they work

well on known (user item) ratings And hope these values for P and Q will predict well

the unknown ratings

This is the a case where we apply Optimization methods

1162019

1 3 43 5 5

4 5 533

2

2 1 3

1

RECSYS 77

Backup Slides

RECSYS 78

The Netflix Challenge 2006-09

We Know What You OughtTo Be Watching This

Summer

Netflix Prizebull Training data

ndash 100 million ratingsndash 480000 usersndash 17770 moviesndash 6 years of data 2000-2005

bull Test datandash Last few ratings of each user (28 million)ndash Evaluation criterion root mean squared error (RMSE) ndash Netflix Cinematch RMSE 09514

bull Competitionndash 2700+ teamsndash $1 million grand prize for 10 improvement on Cinematch resultndash $50000 2007 progress prize for 843 improvement

scoremovieuser121152131434524123237682576344541568523425

22345

57664566

scoremovieuser6219617232473153414284935

745

696836

Training data Test data

Movie rating data

Overall rating distribution

bull Third of ratings are 4sbull Average rating is 368

From TimelyDevelopmentcom

ratings per movie

bull Avg ratingsmovie 5627

ratings per user

bull Avg ratingsuser 208

Average movie rating by movie count

bull More ratings to better movies

From TimelyDevelopmentcom

Most loved movies

CountAvg ratingTitle137812 4593 The Shawshank Redemption 133597 4545 Lord of the Rings The Return of the King1808834306 The Green Mile 150676 4460Lord of the Rings The Two Towers 139050 4415 Finding Nemo 117456 4504 Raiders of the Lost Ark 180736 4299 Forrest Gump 147932 4433 Lord of the Rings The Fellowship of the ring149199 4325 The Sixth Sense 144027 4333 Indiana Jones and the Last Crusade

Challenges

bull Size of datandash Scalabilityndash Keeping data in memory

bull Missing datandash 99 percent missingndash Very imbalanced

bull Avoiding overfittingbull Test and training data differ significantly movie 16322

bull Use an ensemble of complementing predictorsbull Two half tuned models worth more than a single fully

tuned model

The BellKor recommender system

RECSYS 101

Extending Latent Factor Model to Include Biases

1162019

RECSYS 102

Modeling Biases and Interactions

μ = overall mean rating bx = bias of user x bi = bias of movie i

user-movie interactionmovie biasuser bias

User-Movie interaction Characterizes the matching between

users and movies Attracts most research in the field Benefits from algorithmic and

mathematical innovations

Baseline predictor Separates users and movies Benefits from insights into userrsquos

behavior Among the main practical

contributions of the competition

RECSYS 103

Baseline Predictor We have expectations on the rating by

user x of movie i even without estimating xrsquos attitude towards movies like i

ndash Rating scale of user xndash Values of other ratings user

gave recently (day-specific mood anchoring multi-user accounts)

ndash (Recent) popularity of movie indash Selection bias related to

number of ratings user gave on the same day (ldquofrequencyrdquo)

1162019

RECSYS 104

Putting It All Together

Example Mean rating micro = 37 You are a critical reviewer your ratings are 1 star

lower than the mean bx = -1 Star Wars gets a mean rating of 05 higher than

average movie bi = + 05 Predicted rating for you on Star Wars

= 37 - 1 + 05 = 32

1162019

RECSYS 105

Fitting the New Model

Solve

Stochastic gradient decent to find parameters Note Both biases bu bi as well as interactions qi pu are treated

as parameters (we estimate them)

1162019

regularization

goodness of fit

λ is selected via grid-search on a validation set

( )

++++

+++minus

sumsumsumsum

sumisin

ii

xx

xx

ii

Rix

Txiixxi

PQ

bbpq

pqbbr

2222

2

)(

)(min

λ

micro

RECSYS 106

BellKor Recommender System The winner of the Netflix Challenge

Multi-scale modeling of the dataCombine top level ldquoregionalrdquomodeling of the data with a refined local view Global

bull Overall deviations of usersmovies Factorization

bull Addressing ldquoregionalrdquo effects Collaborative filtering

bull Extract local patterns

Global effects

Factorization

Collaborative filtering

RECSYS 107

Performance of Various Methods

0885

089

0895

09

0905

091

0915

092

1 10 100 1000

RMSE

Millions of parameters

CF (no time bias)

Basic Latent Factors

Latent Factors w Biases

1162019

RECSYS 108

Grand Prize 08563

Netflix 09514

Movie average 10533User average 10651

Global average 11296

Performance of Various Methods

Basic Collaborative filtering 094

Latent factors 090

Latent factors+Biases 089

Collaborative filtering++ 091

1162019

RECSYS 109

Temporal Biases Of Users

Sudden rise in the average movie rating(early 2004) Improvements in Netflix GUI improvements Meaning of rating changed

Movie age Users prefer new movies

without any reasons Older movies are just

inherently better than newer ones

1162019

Y Koren Collaborative filtering with temporal dynamics KDD rsquo09

RECSYS 110

Temporal Biases amp Factors

Original modelrxi = micro +bx + bi + qi middotpx

T

Add time dependence to biasesrxi = micro +bx(t)+ bi(t) +qi middot px

T

Make parameters bu and bi to depend on time (1) Parameterize time-dependence by linear trends

(2) Each bin corresponds to 10 consecutive weeks

Add temporal dependence to factors px(t)hellip user preference vector on day t

Y Koren Collaborative filtering with temporal dynamics KDD rsquo09

RECSYS 111

Adding Temporal Effects

1162019

0875

088

0885

089

0895

09

0905

091

0915

092

1 10 100 1000 10000

RMSE

Millions of parameters

CF (no time bias)Basic Latent FactorsCF (time bias)Latent Factors w Biases+ Linear time factors+ Per-day user biases+ CF

RECSYS 112

Grand Prize 08563

Netflix 09514

Movie average 10533User average 10651

Global average 11296

Performance of Various Methods

Basic Collaborative filtering 094

Latent factors 090

Latent factors+Biases 089

Collaborative filtering++ 091

1162019 112

Latent factors+Biases+Time 0876

Still no prize Getting desperateTry a ldquokitchen sinkrdquo approach

RECSYS 113

Effect of ensemble size

087

08725

0875

08775

088

08825

0885

08875

089

08925

1 8 15 22 29 36 43 50

Predictors

Erro

r - R

MSE

Chart1

RMSE
Predictors
Error - RMSE
Effect of ensemble size

Sheet1

Sheet1

RMSE
Predictors
Error - RMSE
Effect of ensemble size

Sheet2

Sheet3

RECSYS 115

Standing on June 26th 2009

1162019June 26th submission triggers 30-day ldquolast callrdquo

RECSYS 116

The Last 30 Days

Ensemble team formed Group of other teams on leaderboard forms a new team Relies on combining their models Quickly also get a qualifying score over 10

BellKor Continue to get small improvements in their scores Realize that they are in direct competition with Ensemble

Strategy Both teams carefully monitoring the leaderboard Only sure way to check for improvement is to submit a set of

predictionsbull This alerts the other team of your latest score

1162019

RECSYS 117

24 Hours from the Deadline

Submissions limited to 1 a day Only 1 final submission could be made in the last 24h

24 hours before deadlinehellip BellKor team member in Austria notices (by chance) that

Ensemble posts a score that is slightly better than BellKorrsquos

Frantic last 24 hours for both teams Much computer time on final optimization Carefully calibrated to end about an hour before deadline

Final submissions BellKor submits a little early (on purpose) 40 mins before deadline Ensemble submits their final entry 20 mins later hellipand everyone waitshellip

1162019

RECSYS 118118

RECSYS 119

Million $ Awarded Sept 21st 2009

RECSYS 120

Further reading Y Koren Collaborative filtering with temporal

dynamics KDD rsquo09 httpwww2researchattcom~volinskynetflixbpchtml httpwwwthe-ensemblecom

1162019

  • Slide Number 1
  • Acknowledgements
  • Content-based Systems amp Collaborative Filtering
  • High Dimensional Data
  • Example Recommender Systems
  • Recommendations
  • From Scarcity to Abundance
  • The Long Tail
  • Physical vs Online
  • Types of Recommendations
  • Formal Model
  • Utility Matrix
  • Key Problems
  • (1) Gathering Ratings
  • (2) Extrapolating Utilities
  • Content-based Recommender Systems
  • Content-based Recommendations
  • Plan of Action
  • Item Profiles
  • Sidenote TF-IDF
  • User Profiles and Prediction
  • Pros Content-based Approach
  • Cons Content-based Approach
  • Collaborative Filtering
  • Collaborative filtering
  • Collaborative Filtering (CF)
  • Example of Memory-based Collaborative Filtering User-User Collaborative Filtering
  • Similar Users
  • Similarity Metric
  • Rating Predictions
  • Another type of Memory-based Collaborative Filtering Item-Item based Collaborative Filtering
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Common Practice for Item-Item Collaborative Filtering
  • Item-Item vs User-User
  • ProsCons of Collaborative Filtering
  • Hybrid Methods
  • Remarks amp Practical Tips
  • Evaluation
  • Evaluation
  • Evaluating Predictions
  • Problems with Error Measures
  • Collaborative Filtering Complexity
  • Tip Add Data
  • Recommender SystemsLatent Factor Models
  • Collaborative Filtering viaLatent Factor Models (eg SVD)
  • Slide Number 50
  • The Netflix Utility Matrix R
  • Utility Matrix R Evaluation
  • Latent Factor Models
  • Ratings as Products of Factors
  • Ratings as Products of Factors
  • Ratings as Products of Factors
  • Latent Factor Models
  • Latent Factor Models
  • Recap SVD
  • Latent Factor Models
  • Dealing with Missing Entries
  • Recommendations via Latent Factor Models (eg SVD++ by the [Bellkor Team])
  • Dealing with Missing Entries
  • The Effect of Regularization
  • The Effect of Regularization
  • The Effect of Regularization
  • The Effect of Regularization
  • Use Gradient Descent to search forthe optimal settings
  • Degression to the lecture notes ofRegression and Gradient Descentby Andrew Ngrsquos Machine Learning course from Coursera
  • (Batch) Gradient Descent
  • Stochastic Gradient Descent
  • SGD vs GD
  • Stochastic Gradient Descent
  • Slide Number 75
  • Summary Recommendations via Optimization
  • Backup Slides
  • The Netflix Challenge 2006-09
  • Slide Number 79
  • Slide Number 80
  • Netflix Prize
  • Movie rating data
  • Overall rating distribution
  • ratings per movie
  • ratings per user
  • Average movie rating by movie count
  • Most loved movies
  • Challenges
  • The BellKor recommender system
  • Extending Latent Factor Model to Include Biases
  • Modeling Biases and Interactions
  • Baseline Predictor
  • Putting It All Together
  • Fitting the New Model
  • BellKor Recommender System
  • Performance of Various Methods
  • Performance of Various Methods
  • Temporal Biases Of Users
  • Temporal Biases amp Factors
  • Adding Temporal Effects
  • Performance of Various Methods
  • Slide Number 113
  • Slide Number 114
  • Standing on June 26th 2009
  • The Last 30 Days
  • 24 Hours from the Deadline
  • Slide Number 118
  • Million $ Awarded Sept 21st 2009
  • Further reading
predictors RMSE
1 0889067
2 0880586
3 0878345
4 0877715
5 0876668
6 0876179
7 0875557
8 0875013
9 0874477
10 0874209
11 0873997
12 0873691
13 0873569
14 0873352
15 0873212
16 0873035
17 0872942
18 087283
19 0872748
20 0872654
21 0872563
22 0872468
23 0872382
24 0872275
25 087222
26 0872201
27 0872135
28 0872086
29 0872021
30 0871945
31 0871898
32 087183
33 0871756
34 0871722
35 0871665
36 0871632
37 0871585
38 0871558
39 0871524
40 0871482
41 0871452
42 0871428
43 0871398
44 0871346
45 0871312
46 0871287
47 0871262
48 0871242
49 0871236
50 0871197
0889067
0880586
0878345
0877715
0876668
0876179
0875557
0875013
0874477
0874209
0873997
0873691
0873569
0873352
0873212
0873035
0872942
087283
0872748
0872654
0872563
0872468
0872382
0872275
087222
0872201
0872135
0872086
0872021
0871945
0871898
087183
0871756
0871722
0871665
0871632
0871585
0871558
0871524
0871482
0871452
0871428
0871398
0871346
0871312
0871287
0871262
0871242
0871236
0871197
Page 24: xuhappy.github.io · 2 days ago · RECSYS 2 Acknowledgements The slides used in this chapter are adapted from: CS246 Mining Massive Datasets, - by Jure Leskovec, Stanford University

RECSYS 26

Collaborative Filtering (CF)

RECSYS 27

Example of Memory-based Collaborative FilteringUser-User Collaborative Filtering

1 Consider user x

2 Find set N of other userswhose ratings are ldquosimilarrdquo to xrsquos ratings eg using K-nearest neighbors (KNN)

3 Recommend items to xbased on the weighted ratings of items by users in N

RECSYS 28

Similar Users Let rx be the vector of user xrsquos ratings

Jaccard similarity measure Problem Ignores the value of the rating

Cosine similarity measure

sim(x y) = cos(rx ry) = 119903119903119909119909sdot119903119903119910119910||119903119903119909119909||sdot||119903119903119910119910||

Problem Treats missing ratings as ldquonegativerdquo Pearson correlation coefficient

Sxy = items rated by both users x and y

1162019

rx = [ _ _ ]ry = [ _ _]

rx ry as setsrx = 1 4 5ry = 1 3 4

rx ry as pointsrx = 1 0 0 1 3ry = 1 0 2 2 0

rx ry hellip avgrating of x y

RECSYS 29

Similarity Metric

Intuitively we want sim(A B) gt sim(A C)

Jaccard similarity 15 lt 24

Cosine similarity 0386 gt 0322 Considers missing ratings as ldquonegativerdquo Solution subtract the (row) mean

1162019

sim AB vs AC0092 gt -0559Notice cosine sim is correlation when data is centered at 0

Cosine sim

RECSYS 30

Rating Predictions

Let rx be the vector of user xrsquos ratings

Let N be the set of k users most similar to xwho have rated item i

Prediction for item i of user x 119903119903119909119909119909119909 = 1

119896119896sum119910119910isin119873119873 119903119903119910119910119909119909

119903119903119909119909119909119909 =sum119910119910isin119873119873 119904119904119909119909119910119910sdot119903119903119910119910119910119910sum119910119910isin119873119873 119904119904119909119909119910119910

Other options

Many other tricks possiblehellip1162019

Shorthand119956119956119961119961119961119961 = 119956119956119946119946119956119956 119961119961119961119961

RECSYS 31

Another type of Memory-based Collaborative Filtering Item-Item based Collaborative Filtering

So far User-user collaborative filtering

Another view Item-item For item i find other similar items Estimate rating for item i based

on the target userrsquos ratings on items similar to item i Can use same similarity metrics and

prediction functions as in user-user model

1162019

sumsum

isin

isinsdot

=)(

)(

xiNj ij

xiNj xjijxi s

rsr

sijhellip similarity of items i and jrxjhelliprating of user x on item jN(ix)hellip set items rated by x similar to i

RECSYS 32

Item-Item CF (|N|=2)

121110987654321

455311

3124452

534321423

245424

5224345

423316

users

mov

ies

- unknown rating - rating between 1 to 5

RECSYS 33

Item-Item CF (|N|=2)

121110987654321

455 311

3124452

534321423

245424

5224345

423316

users

- estimate rating of movie 1 by user 5

1162019

mov

ies

RECSYS 34

Item-Item CF (|N|=2)

121110987654321

455 311

3124452

534321423

245424

5224345

423316

users

Neighbor selectionIdentify movies similar to movie 1 rated by user 5

1162019

mov

ies

100

-018

041

-010

-031

059

sim(1m)

Here we use Pearson correlation as similarity1) Subtract mean rating mi from each movie i

m1 = (1+3+5+5+4)5 = 36row 1 [-26 0 -06 0 0 14 0 0 14 0 04 0]

2) Compute cosine similarities between rows

RECSYS 35

Item-Item CF (|N|=2)

121110987654321

455 311

3124452

534321423

245424

5224345

423316

users

Compute similarity weightss13=041 s16=059

1162019

mov

ies

100

-018

041

-010

-031

059

sim(1m)

RECSYS 36

Item-Item CF (|N|=2)

121110987654321

45526311

3124452

534321423

245424

5224345

423316

users

Predict by taking weighted average

r15 = (0412 + 0593) (041+059) = 26

mov

ies

RECSYS 37

Common Practice for Item-Item Collaborative Filtering

Define similarity sij of items i and j

Select K nearest neighbors (KNN) N(i x) Set of Items most similar to item i that were rated by x

Estimate rating rxi as the weighted average

1162019

sumsum

isin

isinsdot

=)(

)(ˆxiNj ij

xiNj xjijxi s

rsr

N(ix) = set of items similar to item i that were rated by xsij = similarity of items i and jrxj= rating of user x on item j

RECSYS 38

Item-Item vs User-User

04180109

0305081

Avatar LOTR Matrix Pirates

Alice

Bob

Carol

David

1162019

In practice it has been observed that item-itemoften works better than user-user

Why Items are simpler users have multiple tastes

RECSYS 39

ProsCons of Collaborative Filtering

+ Works for any kind of item No feature selection needed

- Cold Start Need enough users in the system to find a match

- Sparsity The userratings matrix is sparse Hard to find users that have rated the same items

- First rater Cannot recommend an item that has not been

previously rated New items Esoteric items

- Popularity bias Cannot recommend items to someone with

unique taste Tends to recommend popular items

RECSYS 40

Hybrid Methods Implement two or more different recommenders and

combine predictions Perhaps using a linear model

Add content-based methods to collaborative filtering Item profiles for new item problem Demographics to deal with new user problem

1162019

RECSYS 41

- Evaluation

- Error metrics

- Complexity Speed

RECSYS 42

Evaluation

1 3 4

3 5 5

4 5 5

3

3

2 2 2

5

2 1 1

3 3

1

movies

users

1162019

RECSYS 43

Evaluation

1 3 4

3 5 5

4 5 5

3

3

2

2 1

3

1

Test Data Set

users

movies

1162019

RECSYS 44

Evaluating Predictions Compare predictions with known ratings

Root-mean-square error (RMSE)bull where is predicted is the true rating of x on i

Precision at top 10 bull of those in top 10

Rank Correlation bull Spearmanrsquos correlation between systemrsquos and userrsquos complete

rankings Another approach 01 model

Coveragebull Number of itemsusers for which system can make predictions

Precision = TP (TP + FP) Accuracy = (TP+TN) (TP + FP + TN + FN) Receiver Operating characteristic (ROC) Curve

Y-axis True Positive Rates (TPR) X-axis False Positive Rates (FPR)bull TPR (aka Recall) = TP P = TP (TP+FN) bull FPR = FP N = FP (FP + TN)bull See httpsenwikipediaorgwikiPrecision and recall

RECSYS 45

Problems with Error Measures Narrow focus on accuracy sometimes

misses the point Prediction Diversity Prediction Context Order of predictions

In practice we care only to predict high ratings RMSE might penalize a method that does well

for high ratings and badly for others

1162019

RECSYS 46

Collaborative Filtering Complexity

Expensive step is finding k most similar customers O(|X|) Recall that X = set of customers in the system

Too expensive to do at runtime Could pre-compute using clustering as approx

Naiumlve pre-computation takes time O(N middot|C|) |C| = of clusters = k in the k-means N = of data points

We already know how to do this Near-neighbor search in high dimensions (LSH) Clustering Dimensionality reduction

1162019

RECSYS 47

Tip Add Data Leverage all the data

Donrsquot try to reduce data size in an effort to make fancy algorithms work

Simple methods on large data do best

Add more data eg add IMDB data on genres

More data beats better algorithmshttpanandtypepadcomdatawocky200803more-data-usualhtml

RECSYS 48

Recommender SystemsLatent Factor Models

CS246 Mining Massive DatasetsJure Leskovec Stanford University

httpcs246stanfordedu

RECSYS 49

Geared towards females

Geared towards males

Serious

Funny

Collaborative Filtering viaLatent Factor Models (eg SVD)

The PrincessDiaries

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

RECSYS 50Koren Bell Volinksy IEEE Computer 2009

RECSYS 51

The Netflix Utility Matrix R

1 3 4

3 5 5

4 5 5

3

3

2 2 2

5

2 1 1

3 3

1

480000 users

17700 movies

1162019

Matrix R

RECSYS 52

Utility Matrix R Evaluation

1 3 4

3 5 5

4 5 5

3

3

2

2 1

3

1

Test Data Set

SSE = sum(119909119909119909119909)isin119877119877 119903119909119909119909119909 minus 119903119903119909119909119909119909 2

1162019

480000 users

17700 movies

Predicted rating

True rating of user x on item i

119955119955120785120785120788120788

Matrix R

Training Data Set

RECSYS 53

Latent Factor Models

ldquoSVDrdquo on Netflix data R asymp Q middot PT

For now letrsquos assume we can approximate the rating matrix R as a product of ldquothinrdquo Q middot PT

R has missing entries but letrsquos ignore that for nowbull Basically we will want the reconstruction error to be small on

known ratings and we donrsquot care about the values on the missing ones

45531

312445

53432142

24542

522434

42331

2-41

56-5

53-2

32111

-221-7

37-1

-924143-48-5-253-211

13-112-72914-131457-8

1-6784-3924176-421asymp

users

item

s

PT

Q

item

s

users

R

SVD A = U Σ VT

f factors

ffactors

RECSYS 54

Ratings as Products of Factors How to estimate the missing rating of

user x for item i

1162019

45531

312445

53432142

24542

522434

42331

item

s

2-41

56-5

53-2

32111

-221-7

37-1

-924143-48-5-253-211

13-112-72914-131457-8

1-6784-3924176-421

asymp

item

s

users

users

PT

119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931

= 119943119943

119954119954119946119946119943119943 sdot 119953119953119961119961119943119943

qi = row i of Qpx = column x of PT

ffac

tors

Qf factors

RECSYS 55

Ratings as Products of Factors How to estimate the missing rating of

user x for item i

1162019

45531

312445

53432142

24542

522434

42331

item

s

2-41

56-5

53-2

32111

-221-7

37-1

-924143-48-5-253-211

13-112-72914-131457-8

1-6784-3924176-421

asymp

item

s

users

users

PT

ffac

tors

Qf factors

119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931

= 119943119943

119954119954119946119946119943119943 sdot 119953119953119961119961119943119943

qi = row i of Qpx = column x of PT

RECSYS 56

Ratings as Products of Factors How to estimate the missing rating of

user x for item i

1162019

45531

312445

53432142

24542

522434

42331

item

s

2-41

56-5

53-2

32111

-221-7

37-1

-924143-48-5-253-211

13-112-72914-131457-8

1-6784-3924176-421

asymp

item

s

users

users

QPT

24

ffac

tors

f factors

119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931

= 119943119943

119954119954119946119946119943119943 sdot 119953119953119961119961119943119943

qi = row i of Qpx = column x of PT

RECSYS 57

Geared towards females

Geared towards males

Serious

Funny

Latent Factor Models

1162019

The PrincessDiaries

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

RECSYS 58

Geared towards females

Geared towards males

Serious

Funny

Latent Factor Models

1162019

The PrincessDiaries

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

RECSYS 59

Recap SVD

Remember SVD A Input data matrix U Left singular vecs V Right singular vecs Σ Singular values

SVD gives minimum reconstruction error (SSE) min119880119880VΣ

sum119909119909119894119894 119860119860119909119909119894119894 minus 119880119880Σ119881119881T 1199091199091198941198942

So in our case ldquoSVDrdquo on Netflix data R asymp Q middot PT

A = R Q = U PT = Σ VT

But we are not done yet R has missing entries1162019

U

The sum goes over all entriesBut our R has missing entries

119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931

Am

n

Σm

n

VT

asymp

U

RECSYS 60

Latent Factor Models

SVD isnrsquot defined when entries are missing

Use specialized methods to find P Q min

119875119875119876119876sum 119909119909119909119909 isinR 119903119903119909119909119909119909 minus 119902119902119909119909 sdot 119901119901119909119909119879119879 2

Notebull We donrsquot require cols of P Q to be orthogonalunit lengthbull P Q map usersmovies to a latent spacebull The most popular model among Netflix contestants

1162019

45531

312445

53432142

24542

522434

42331

2-41

56-5

53-2

32111

-221-7

37-1

-924143-48-5-253-211

13-112-72914-131457-8

1-6784-3924176-421asymp

PTQ

users

item

s

119903119909119909119909119909 = 119902119902119909119909 sdot 119901119901119909119909119879119879

f factors

ffactorsitem

s

users

RECSYS 61

Dealing with Missing Entries Want to minimize SSE for unseen test data

Idea Minimize SSE on training data Want large f ( of factors) to capture all the signals But SSE on test data begins to rise for f gt 2

Regularization is needed to avoid Overfitting Allow rich model where there are sufficient data Shrink aggressively where data are scarce

1162019

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

1 3 43 5 5

4 5 533

2

2 1 3

1

λhellip regularization parameterldquoerrorrdquo ldquolengthrdquo

RECSYS 62

Geared towards females

Geared towards males

serious

escapist

The PrincessDiaries

The Lion King

Braveheart

Independence Day

AmadeusThe Color Purple

Oceanrsquos 11

Sense and Sensibility

Gus

Dave

Recommendations via Latent Factor Models (eg SVD++ by the [Bellkor Team])

1162019

Lethal Weapon

Dumb and Dumber

RECSYS 63

Dealing with Missing Entries Want to minimize SSE for unseen test data

Idea Minimize SSE on training data Want large f ( of factors) to capture all the signals But SSE on test data begins to rise for f gt 2

Regularization is needed Allow rich model where there are sufficient data Shrink aggressively where data are scarce

1162019

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

1 3 43 5 5

4 5 533

2

2 1 3

1

λhellip regularization parameterldquoerrorrdquo ldquolengthrdquo

RECSYS 64

Geared towards females

Geared towards males

serious

funny

The Effect of Regularization

1162019

The PrincessDiaries

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

minfactors ldquoerrorrdquo + λ ldquolengthrdquo

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

RECSYS 65

Geared towards females

Geared towards males

serious

funny

The Effect of Regularization

1162019

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

The PrincessDiaries

minfactors ldquoerrorrdquo + λ ldquolengthrdquo

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

RECSYS 66

Geared towards females

Geared towards males

serious

funny

The Effect of Regularization

1162019

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

minfactors ldquoerrorrdquo + λ ldquolengthrdquo

The PrincessDiaries

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

RECSYS 67

Geared towards females

Geared towards males

serious

funny

The Effect of Regularization

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

The PrincessDiaries

minfactors ldquoerrorrdquo + λ ldquolengthrdquo

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

RECSYS 68

Use Gradient Descent to search forthe optimal settings

Want to find matrices P and Q

Gradient descent Initialize P and Q (using SVD pretend missing ratings are

0) Do gradient descent

bull P larr P - η middotnablaPbull Q larr Q - η middotnablaQbull Where nablaQ is gradientderivative of matrix Q120571120571119876119876 = [120571120571119902119902119909119909119894119894] and 120571120571119902119902119909119909119894119894 = sum119909119909119909119909 minus2 119903119903119909119909119909119909 minus 119902119902119909119909119901119901119909119909119879119879 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894

bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q Observation Computing gradients is slow

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

How to compute gradient of a matrixCompute gradient of every element independently

RECSYS 69

Degression to the lecture notes ofRegression and Gradient Descent

by Andrew Ngrsquos Machine Learning course from Coursera

RECSYS 70

(Batch) Gradient Descent

Want to find matrices P and Q

Gradient descent Initialize P and Q (using SVD pretend missing ratings are

0) Do gradient descent

bull P larr P - η middotnablaPbull Q larr Q - η middotnablaQbull Where nablaQ is gradientderivative of matrix Q120571120571119876119876 = [120571120571119902119902119909119909119894119894] and 120571120571119902119902119909119909119894119894 = sum119909119909119909119909 minus2 119903119903119909119909119909119909 minus 119902119902119909119909119901119901119909119909119879119879 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894

bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q Observation Computing gradients is slow

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

How to compute gradient of a matrixCompute gradient of every element independently

RECSYS 72

Stochastic Gradient Descent

Gradient Descent (GD) vs Stochastic GD Observation 120571120571119876119876 = [120571120571119902119902119909119909119894119894] where

120571120571119902119902119909119909119894119894 = 119909119909119909119909

minus2 119903119903119909119909119909119909 minus 119902119902119909119909119894119894119901119901119909119909119894119894 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894 = 119961119961119946119946

nabla119928119928 119955119955119961119961119946119946

bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q 119928119928 = 119928119928 minus 119928119928 = 119928119928 minus η sum119961119961119946119946nabla119928119928 (119955119955119961119961119946119946) Idea Instead of evaluating gradient over all ratings evaluate it for

each individual rating and make a step

GD 119928119928larr119928119928 minus η sum119955119955119961119961119946119946 nabla119928119928(119955119955119961119961119946119946)

SGD 119928119928larr119928119928 minus 119928119928(119955119955119961119961119946119946) Faster convergence

bull Need more steps but each step is computed much faster

1162019

RECSYS 73

SGD vs GD Convergence of GD vs SGD

1162019

Iterationstep

Valu

e of

the

obje

ctiv

e fu

nctio

n

GD improves the value of the objective function at every step SGD improves the value but in a ldquonoisyrdquo wayGD takes fewer steps to converge but each steptakes much longer to compute In practice SGD is much faster

RECSYS 74

Stochastic Gradient Descent

Stochastic gradient decent Initialize P and Q (using SVD pretend missing ratings are

0) Then iterate over the ratings (multiple times if necessary) and update

factorsFor each rxi

bull 120576120576119909119909119909119909 = 119903119903119909119909119909119909 minus 119902119902119909119909 sdot 119901119901119909119909119879119879 (derivative of the ldquoerrorrdquo)bull 119902119902119909119909 larr 119902119902119909119909 + 120578120578 120576120576119909119909119909119909 119901119901119909119909 minus 120582120582 119902119902119909119909 (update equation)bull 119901119901119909119909 larr 119901119901119909119909 + 120578120578 120576120576119909119909119909119909 119902119902119909119909 minus 120582120582 119901119901119909119909 (update equation)

2 for loops For until convergence

bull For each rxibull Compute gradient do a ldquosteprdquo

1162019

η hellip learning rate

RECSYS 75Koren Bell Volinksy IEEE Computer 2009

1162019

RECSYS 76

Summary Recommendations via Optimization

Goal Make good recommendations Quantify goodness using SSE

So Lower SSE means better recommendations We want to make good recommendations on items

that some user has not yet seen Letrsquos set values for P and Q such that they work

well on known (user item) ratings And hope these values for P and Q will predict well

the unknown ratings

This is the a case where we apply Optimization methods

1162019

1 3 43 5 5

4 5 533

2

2 1 3

1

RECSYS 77

Backup Slides

RECSYS 78

The Netflix Challenge 2006-09

We Know What You OughtTo Be Watching This

Summer

Netflix Prizebull Training data

ndash 100 million ratingsndash 480000 usersndash 17770 moviesndash 6 years of data 2000-2005

bull Test datandash Last few ratings of each user (28 million)ndash Evaluation criterion root mean squared error (RMSE) ndash Netflix Cinematch RMSE 09514

bull Competitionndash 2700+ teamsndash $1 million grand prize for 10 improvement on Cinematch resultndash $50000 2007 progress prize for 843 improvement

scoremovieuser121152131434524123237682576344541568523425

22345

57664566

scoremovieuser6219617232473153414284935

745

696836

Training data Test data

Movie rating data

Overall rating distribution

bull Third of ratings are 4sbull Average rating is 368

From TimelyDevelopmentcom

ratings per movie

bull Avg ratingsmovie 5627

ratings per user

bull Avg ratingsuser 208

Average movie rating by movie count

bull More ratings to better movies

From TimelyDevelopmentcom

Most loved movies

CountAvg ratingTitle137812 4593 The Shawshank Redemption 133597 4545 Lord of the Rings The Return of the King1808834306 The Green Mile 150676 4460Lord of the Rings The Two Towers 139050 4415 Finding Nemo 117456 4504 Raiders of the Lost Ark 180736 4299 Forrest Gump 147932 4433 Lord of the Rings The Fellowship of the ring149199 4325 The Sixth Sense 144027 4333 Indiana Jones and the Last Crusade

Challenges

bull Size of datandash Scalabilityndash Keeping data in memory

bull Missing datandash 99 percent missingndash Very imbalanced

bull Avoiding overfittingbull Test and training data differ significantly movie 16322

bull Use an ensemble of complementing predictorsbull Two half tuned models worth more than a single fully

tuned model

The BellKor recommender system

RECSYS 101

Extending Latent Factor Model to Include Biases

1162019

RECSYS 102

Modeling Biases and Interactions

μ = overall mean rating bx = bias of user x bi = bias of movie i

user-movie interactionmovie biasuser bias

User-Movie interaction Characterizes the matching between

users and movies Attracts most research in the field Benefits from algorithmic and

mathematical innovations

Baseline predictor Separates users and movies Benefits from insights into userrsquos

behavior Among the main practical

contributions of the competition

RECSYS 103

Baseline Predictor We have expectations on the rating by

user x of movie i even without estimating xrsquos attitude towards movies like i

ndash Rating scale of user xndash Values of other ratings user

gave recently (day-specific mood anchoring multi-user accounts)

ndash (Recent) popularity of movie indash Selection bias related to

number of ratings user gave on the same day (ldquofrequencyrdquo)

1162019

RECSYS 104

Putting It All Together

Example Mean rating micro = 37 You are a critical reviewer your ratings are 1 star

lower than the mean bx = -1 Star Wars gets a mean rating of 05 higher than

average movie bi = + 05 Predicted rating for you on Star Wars

= 37 - 1 + 05 = 32

1162019

RECSYS 105

Fitting the New Model

Solve

Stochastic gradient decent to find parameters Note Both biases bu bi as well as interactions qi pu are treated

as parameters (we estimate them)

1162019

regularization

goodness of fit

λ is selected via grid-search on a validation set

( )

++++

+++minus

sumsumsumsum

sumisin

ii

xx

xx

ii

Rix

Txiixxi

PQ

bbpq

pqbbr

2222

2

)(

)(min

λ

micro

RECSYS 106

BellKor Recommender System The winner of the Netflix Challenge

Multi-scale modeling of the dataCombine top level ldquoregionalrdquomodeling of the data with a refined local view Global

bull Overall deviations of usersmovies Factorization

bull Addressing ldquoregionalrdquo effects Collaborative filtering

bull Extract local patterns

Global effects

Factorization

Collaborative filtering

RECSYS 107

Performance of Various Methods

0885

089

0895

09

0905

091

0915

092

1 10 100 1000

RMSE

Millions of parameters

CF (no time bias)

Basic Latent Factors

Latent Factors w Biases

1162019

RECSYS 108

Grand Prize 08563

Netflix 09514

Movie average 10533User average 10651

Global average 11296

Performance of Various Methods

Basic Collaborative filtering 094

Latent factors 090

Latent factors+Biases 089

Collaborative filtering++ 091

1162019

RECSYS 109

Temporal Biases Of Users

Sudden rise in the average movie rating(early 2004) Improvements in Netflix GUI improvements Meaning of rating changed

Movie age Users prefer new movies

without any reasons Older movies are just

inherently better than newer ones

1162019

Y Koren Collaborative filtering with temporal dynamics KDD rsquo09

RECSYS 110

Temporal Biases amp Factors

Original modelrxi = micro +bx + bi + qi middotpx

T

Add time dependence to biasesrxi = micro +bx(t)+ bi(t) +qi middot px

T

Make parameters bu and bi to depend on time (1) Parameterize time-dependence by linear trends

(2) Each bin corresponds to 10 consecutive weeks

Add temporal dependence to factors px(t)hellip user preference vector on day t

Y Koren Collaborative filtering with temporal dynamics KDD rsquo09

RECSYS 111

Adding Temporal Effects

1162019

0875

088

0885

089

0895

09

0905

091

0915

092

1 10 100 1000 10000

RMSE

Millions of parameters

CF (no time bias)Basic Latent FactorsCF (time bias)Latent Factors w Biases+ Linear time factors+ Per-day user biases+ CF

RECSYS 112

Grand Prize 08563

Netflix 09514

Movie average 10533User average 10651

Global average 11296

Performance of Various Methods

Basic Collaborative filtering 094

Latent factors 090

Latent factors+Biases 089

Collaborative filtering++ 091

1162019 112

Latent factors+Biases+Time 0876

Still no prize Getting desperateTry a ldquokitchen sinkrdquo approach

RECSYS 113

Effect of ensemble size

087

08725

0875

08775

088

08825

0885

08875

089

08925

1 8 15 22 29 36 43 50

Predictors

Erro

r - R

MSE

Chart1

RMSE
Predictors
Error - RMSE
Effect of ensemble size

Sheet1

Sheet1

RMSE
Predictors
Error - RMSE
Effect of ensemble size

Sheet2

Sheet3

RECSYS 115

Standing on June 26th 2009

1162019June 26th submission triggers 30-day ldquolast callrdquo

RECSYS 116

The Last 30 Days

Ensemble team formed Group of other teams on leaderboard forms a new team Relies on combining their models Quickly also get a qualifying score over 10

BellKor Continue to get small improvements in their scores Realize that they are in direct competition with Ensemble

Strategy Both teams carefully monitoring the leaderboard Only sure way to check for improvement is to submit a set of

predictionsbull This alerts the other team of your latest score

1162019

RECSYS 117

24 Hours from the Deadline

Submissions limited to 1 a day Only 1 final submission could be made in the last 24h

24 hours before deadlinehellip BellKor team member in Austria notices (by chance) that

Ensemble posts a score that is slightly better than BellKorrsquos

Frantic last 24 hours for both teams Much computer time on final optimization Carefully calibrated to end about an hour before deadline

Final submissions BellKor submits a little early (on purpose) 40 mins before deadline Ensemble submits their final entry 20 mins later hellipand everyone waitshellip

1162019

RECSYS 118118

RECSYS 119

Million $ Awarded Sept 21st 2009

RECSYS 120

Further reading Y Koren Collaborative filtering with temporal

dynamics KDD rsquo09 httpwww2researchattcom~volinskynetflixbpchtml httpwwwthe-ensemblecom

1162019

  • Slide Number 1
  • Acknowledgements
  • Content-based Systems amp Collaborative Filtering
  • High Dimensional Data
  • Example Recommender Systems
  • Recommendations
  • From Scarcity to Abundance
  • The Long Tail
  • Physical vs Online
  • Types of Recommendations
  • Formal Model
  • Utility Matrix
  • Key Problems
  • (1) Gathering Ratings
  • (2) Extrapolating Utilities
  • Content-based Recommender Systems
  • Content-based Recommendations
  • Plan of Action
  • Item Profiles
  • Sidenote TF-IDF
  • User Profiles and Prediction
  • Pros Content-based Approach
  • Cons Content-based Approach
  • Collaborative Filtering
  • Collaborative filtering
  • Collaborative Filtering (CF)
  • Example of Memory-based Collaborative Filtering User-User Collaborative Filtering
  • Similar Users
  • Similarity Metric
  • Rating Predictions
  • Another type of Memory-based Collaborative Filtering Item-Item based Collaborative Filtering
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Common Practice for Item-Item Collaborative Filtering
  • Item-Item vs User-User
  • ProsCons of Collaborative Filtering
  • Hybrid Methods
  • Remarks amp Practical Tips
  • Evaluation
  • Evaluation
  • Evaluating Predictions
  • Problems with Error Measures
  • Collaborative Filtering Complexity
  • Tip Add Data
  • Recommender SystemsLatent Factor Models
  • Collaborative Filtering viaLatent Factor Models (eg SVD)
  • Slide Number 50
  • The Netflix Utility Matrix R
  • Utility Matrix R Evaluation
  • Latent Factor Models
  • Ratings as Products of Factors
  • Ratings as Products of Factors
  • Ratings as Products of Factors
  • Latent Factor Models
  • Latent Factor Models
  • Recap SVD
  • Latent Factor Models
  • Dealing with Missing Entries
  • Recommendations via Latent Factor Models (eg SVD++ by the [Bellkor Team])
  • Dealing with Missing Entries
  • The Effect of Regularization
  • The Effect of Regularization
  • The Effect of Regularization
  • The Effect of Regularization
  • Use Gradient Descent to search forthe optimal settings
  • Degression to the lecture notes ofRegression and Gradient Descentby Andrew Ngrsquos Machine Learning course from Coursera
  • (Batch) Gradient Descent
  • Stochastic Gradient Descent
  • SGD vs GD
  • Stochastic Gradient Descent
  • Slide Number 75
  • Summary Recommendations via Optimization
  • Backup Slides
  • The Netflix Challenge 2006-09
  • Slide Number 79
  • Slide Number 80
  • Netflix Prize
  • Movie rating data
  • Overall rating distribution
  • ratings per movie
  • ratings per user
  • Average movie rating by movie count
  • Most loved movies
  • Challenges
  • The BellKor recommender system
  • Extending Latent Factor Model to Include Biases
  • Modeling Biases and Interactions
  • Baseline Predictor
  • Putting It All Together
  • Fitting the New Model
  • BellKor Recommender System
  • Performance of Various Methods
  • Performance of Various Methods
  • Temporal Biases Of Users
  • Temporal Biases amp Factors
  • Adding Temporal Effects
  • Performance of Various Methods
  • Slide Number 113
  • Slide Number 114
  • Standing on June 26th 2009
  • The Last 30 Days
  • 24 Hours from the Deadline
  • Slide Number 118
  • Million $ Awarded Sept 21st 2009
  • Further reading
predictors RMSE
1 0889067
2 0880586
3 0878345
4 0877715
5 0876668
6 0876179
7 0875557
8 0875013
9 0874477
10 0874209
11 0873997
12 0873691
13 0873569
14 0873352
15 0873212
16 0873035
17 0872942
18 087283
19 0872748
20 0872654
21 0872563
22 0872468
23 0872382
24 0872275
25 087222
26 0872201
27 0872135
28 0872086
29 0872021
30 0871945
31 0871898
32 087183
33 0871756
34 0871722
35 0871665
36 0871632
37 0871585
38 0871558
39 0871524
40 0871482
41 0871452
42 0871428
43 0871398
44 0871346
45 0871312
46 0871287
47 0871262
48 0871242
49 0871236
50 0871197
0889067
0880586
0878345
0877715
0876668
0876179
0875557
0875013
0874477
0874209
0873997
0873691
0873569
0873352
0873212
0873035
0872942
087283
0872748
0872654
0872563
0872468
0872382
0872275
087222
0872201
0872135
0872086
0872021
0871945
0871898
087183
0871756
0871722
0871665
0871632
0871585
0871558
0871524
0871482
0871452
0871428
0871398
0871346
0871312
0871287
0871262
0871242
0871236
0871197
Page 25: xuhappy.github.io · 2 days ago · RECSYS 2 Acknowledgements The slides used in this chapter are adapted from: CS246 Mining Massive Datasets, - by Jure Leskovec, Stanford University

RECSYS 27

Example of Memory-based Collaborative FilteringUser-User Collaborative Filtering

1 Consider user x

2 Find set N of other userswhose ratings are ldquosimilarrdquo to xrsquos ratings eg using K-nearest neighbors (KNN)

3 Recommend items to xbased on the weighted ratings of items by users in N

RECSYS 28

Similar Users Let rx be the vector of user xrsquos ratings

Jaccard similarity measure Problem Ignores the value of the rating

Cosine similarity measure

sim(x y) = cos(rx ry) = 119903119903119909119909sdot119903119903119910119910||119903119903119909119909||sdot||119903119903119910119910||

Problem Treats missing ratings as ldquonegativerdquo Pearson correlation coefficient

Sxy = items rated by both users x and y

1162019

rx = [ _ _ ]ry = [ _ _]

rx ry as setsrx = 1 4 5ry = 1 3 4

rx ry as pointsrx = 1 0 0 1 3ry = 1 0 2 2 0

rx ry hellip avgrating of x y

RECSYS 29

Similarity Metric

Intuitively we want sim(A B) gt sim(A C)

Jaccard similarity 15 lt 24

Cosine similarity 0386 gt 0322 Considers missing ratings as ldquonegativerdquo Solution subtract the (row) mean

1162019

sim AB vs AC0092 gt -0559Notice cosine sim is correlation when data is centered at 0

Cosine sim

RECSYS 30

Rating Predictions

Let rx be the vector of user xrsquos ratings

Let N be the set of k users most similar to xwho have rated item i

Prediction for item i of user x 119903119903119909119909119909119909 = 1

119896119896sum119910119910isin119873119873 119903119903119910119910119909119909

119903119903119909119909119909119909 =sum119910119910isin119873119873 119904119904119909119909119910119910sdot119903119903119910119910119910119910sum119910119910isin119873119873 119904119904119909119909119910119910

Other options

Many other tricks possiblehellip1162019

Shorthand119956119956119961119961119961119961 = 119956119956119946119946119956119956 119961119961119961119961

RECSYS 31

Another type of Memory-based Collaborative Filtering Item-Item based Collaborative Filtering

So far User-user collaborative filtering

Another view Item-item For item i find other similar items Estimate rating for item i based

on the target userrsquos ratings on items similar to item i Can use same similarity metrics and

prediction functions as in user-user model

1162019

sumsum

isin

isinsdot

=)(

)(

xiNj ij

xiNj xjijxi s

rsr

sijhellip similarity of items i and jrxjhelliprating of user x on item jN(ix)hellip set items rated by x similar to i

RECSYS 32

Item-Item CF (|N|=2)

121110987654321

455311

3124452

534321423

245424

5224345

423316

users

mov

ies

- unknown rating - rating between 1 to 5

RECSYS 33

Item-Item CF (|N|=2)

121110987654321

455 311

3124452

534321423

245424

5224345

423316

users

- estimate rating of movie 1 by user 5

1162019

mov

ies

RECSYS 34

Item-Item CF (|N|=2)

121110987654321

455 311

3124452

534321423

245424

5224345

423316

users

Neighbor selectionIdentify movies similar to movie 1 rated by user 5

1162019

mov

ies

100

-018

041

-010

-031

059

sim(1m)

Here we use Pearson correlation as similarity1) Subtract mean rating mi from each movie i

m1 = (1+3+5+5+4)5 = 36row 1 [-26 0 -06 0 0 14 0 0 14 0 04 0]

2) Compute cosine similarities between rows

RECSYS 35

Item-Item CF (|N|=2)

121110987654321

455 311

3124452

534321423

245424

5224345

423316

users

Compute similarity weightss13=041 s16=059

1162019

mov

ies

100

-018

041

-010

-031

059

sim(1m)

RECSYS 36

Item-Item CF (|N|=2)

121110987654321

45526311

3124452

534321423

245424

5224345

423316

users

Predict by taking weighted average

r15 = (0412 + 0593) (041+059) = 26

mov

ies

RECSYS 37

Common Practice for Item-Item Collaborative Filtering

Define similarity sij of items i and j

Select K nearest neighbors (KNN) N(i x) Set of Items most similar to item i that were rated by x

Estimate rating rxi as the weighted average

1162019

sumsum

isin

isinsdot

=)(

)(ˆxiNj ij

xiNj xjijxi s

rsr

N(ix) = set of items similar to item i that were rated by xsij = similarity of items i and jrxj= rating of user x on item j

RECSYS 38

Item-Item vs User-User

04180109

0305081

Avatar LOTR Matrix Pirates

Alice

Bob

Carol

David

1162019

In practice it has been observed that item-itemoften works better than user-user

Why Items are simpler users have multiple tastes

RECSYS 39

ProsCons of Collaborative Filtering

+ Works for any kind of item No feature selection needed

- Cold Start Need enough users in the system to find a match

- Sparsity The userratings matrix is sparse Hard to find users that have rated the same items

- First rater Cannot recommend an item that has not been

previously rated New items Esoteric items

- Popularity bias Cannot recommend items to someone with

unique taste Tends to recommend popular items

RECSYS 40

Hybrid Methods Implement two or more different recommenders and

combine predictions Perhaps using a linear model

Add content-based methods to collaborative filtering Item profiles for new item problem Demographics to deal with new user problem

1162019

RECSYS 41

- Evaluation

- Error metrics

- Complexity Speed

RECSYS 42

Evaluation

1 3 4

3 5 5

4 5 5

3

3

2 2 2

5

2 1 1

3 3

1

movies

users

1162019

RECSYS 43

Evaluation

1 3 4

3 5 5

4 5 5

3

3

2

2 1

3

1

Test Data Set

users

movies

1162019

RECSYS 44

Evaluating Predictions Compare predictions with known ratings

Root-mean-square error (RMSE)bull where is predicted is the true rating of x on i

Precision at top 10 bull of those in top 10

Rank Correlation bull Spearmanrsquos correlation between systemrsquos and userrsquos complete

rankings Another approach 01 model

Coveragebull Number of itemsusers for which system can make predictions

Precision = TP (TP + FP) Accuracy = (TP+TN) (TP + FP + TN + FN) Receiver Operating characteristic (ROC) Curve

Y-axis True Positive Rates (TPR) X-axis False Positive Rates (FPR)bull TPR (aka Recall) = TP P = TP (TP+FN) bull FPR = FP N = FP (FP + TN)bull See httpsenwikipediaorgwikiPrecision and recall

RECSYS 45

Problems with Error Measures Narrow focus on accuracy sometimes

misses the point Prediction Diversity Prediction Context Order of predictions

In practice we care only to predict high ratings RMSE might penalize a method that does well

for high ratings and badly for others

1162019

RECSYS 46

Collaborative Filtering Complexity

Expensive step is finding k most similar customers O(|X|) Recall that X = set of customers in the system

Too expensive to do at runtime Could pre-compute using clustering as approx

Naiumlve pre-computation takes time O(N middot|C|) |C| = of clusters = k in the k-means N = of data points

We already know how to do this Near-neighbor search in high dimensions (LSH) Clustering Dimensionality reduction

1162019

RECSYS 47

Tip Add Data Leverage all the data

Donrsquot try to reduce data size in an effort to make fancy algorithms work

Simple methods on large data do best

Add more data eg add IMDB data on genres

More data beats better algorithmshttpanandtypepadcomdatawocky200803more-data-usualhtml

RECSYS 48

Recommender SystemsLatent Factor Models

CS246 Mining Massive DatasetsJure Leskovec Stanford University

httpcs246stanfordedu

RECSYS 49

Geared towards females

Geared towards males

Serious

Funny

Collaborative Filtering viaLatent Factor Models (eg SVD)

The PrincessDiaries

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

RECSYS 50Koren Bell Volinksy IEEE Computer 2009

RECSYS 51

The Netflix Utility Matrix R

1 3 4

3 5 5

4 5 5

3

3

2 2 2

5

2 1 1

3 3

1

480000 users

17700 movies

1162019

Matrix R

RECSYS 52

Utility Matrix R Evaluation

1 3 4

3 5 5

4 5 5

3

3

2

2 1

3

1

Test Data Set

SSE = sum(119909119909119909119909)isin119877119877 119903119909119909119909119909 minus 119903119903119909119909119909119909 2

1162019

480000 users

17700 movies

Predicted rating

True rating of user x on item i

119955119955120785120785120788120788

Matrix R

Training Data Set

RECSYS 53

Latent Factor Models

ldquoSVDrdquo on Netflix data R asymp Q middot PT

For now letrsquos assume we can approximate the rating matrix R as a product of ldquothinrdquo Q middot PT

R has missing entries but letrsquos ignore that for nowbull Basically we will want the reconstruction error to be small on

known ratings and we donrsquot care about the values on the missing ones

45531

312445

53432142

24542

522434

42331

2-41

56-5

53-2

32111

-221-7

37-1

-924143-48-5-253-211

13-112-72914-131457-8

1-6784-3924176-421asymp

users

item

s

PT

Q

item

s

users

R

SVD A = U Σ VT

f factors

ffactors

RECSYS 54

Ratings as Products of Factors How to estimate the missing rating of

user x for item i

1162019

45531

312445

53432142

24542

522434

42331

item

s

2-41

56-5

53-2

32111

-221-7

37-1

-924143-48-5-253-211

13-112-72914-131457-8

1-6784-3924176-421

asymp

item

s

users

users

PT

119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931

= 119943119943

119954119954119946119946119943119943 sdot 119953119953119961119961119943119943

qi = row i of Qpx = column x of PT

ffac

tors

Qf factors

RECSYS 55

Ratings as Products of Factors How to estimate the missing rating of

user x for item i

1162019

45531

312445

53432142

24542

522434

42331

item

s

2-41

56-5

53-2

32111

-221-7

37-1

-924143-48-5-253-211

13-112-72914-131457-8

1-6784-3924176-421

asymp

item

s

users

users

PT

ffac

tors

Qf factors

119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931

= 119943119943

119954119954119946119946119943119943 sdot 119953119953119961119961119943119943

qi = row i of Qpx = column x of PT

RECSYS 56

Ratings as Products of Factors How to estimate the missing rating of

user x for item i

1162019

45531

312445

53432142

24542

522434

42331

item

s

2-41

56-5

53-2

32111

-221-7

37-1

-924143-48-5-253-211

13-112-72914-131457-8

1-6784-3924176-421

asymp

item

s

users

users

QPT

24

ffac

tors

f factors

119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931

= 119943119943

119954119954119946119946119943119943 sdot 119953119953119961119961119943119943

qi = row i of Qpx = column x of PT

RECSYS 57

Geared towards females

Geared towards males

Serious

Funny

Latent Factor Models

1162019

The PrincessDiaries

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

RECSYS 58

Geared towards females

Geared towards males

Serious

Funny

Latent Factor Models

1162019

The PrincessDiaries

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

RECSYS 59

Recap SVD

Remember SVD A Input data matrix U Left singular vecs V Right singular vecs Σ Singular values

SVD gives minimum reconstruction error (SSE) min119880119880VΣ

sum119909119909119894119894 119860119860119909119909119894119894 minus 119880119880Σ119881119881T 1199091199091198941198942

So in our case ldquoSVDrdquo on Netflix data R asymp Q middot PT

A = R Q = U PT = Σ VT

But we are not done yet R has missing entries1162019

U

The sum goes over all entriesBut our R has missing entries

119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931

Am

n

Σm

n

VT

asymp

U

RECSYS 60

Latent Factor Models

SVD isnrsquot defined when entries are missing

Use specialized methods to find P Q min

119875119875119876119876sum 119909119909119909119909 isinR 119903119903119909119909119909119909 minus 119902119902119909119909 sdot 119901119901119909119909119879119879 2

Notebull We donrsquot require cols of P Q to be orthogonalunit lengthbull P Q map usersmovies to a latent spacebull The most popular model among Netflix contestants

1162019

45531

312445

53432142

24542

522434

42331

2-41

56-5

53-2

32111

-221-7

37-1

-924143-48-5-253-211

13-112-72914-131457-8

1-6784-3924176-421asymp

PTQ

users

item

s

119903119909119909119909119909 = 119902119902119909119909 sdot 119901119901119909119909119879119879

f factors

ffactorsitem

s

users

RECSYS 61

Dealing with Missing Entries Want to minimize SSE for unseen test data

Idea Minimize SSE on training data Want large f ( of factors) to capture all the signals But SSE on test data begins to rise for f gt 2

Regularization is needed to avoid Overfitting Allow rich model where there are sufficient data Shrink aggressively where data are scarce

1162019

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

1 3 43 5 5

4 5 533

2

2 1 3

1

λhellip regularization parameterldquoerrorrdquo ldquolengthrdquo

RECSYS 62

Geared towards females

Geared towards males

serious

escapist

The PrincessDiaries

The Lion King

Braveheart

Independence Day

AmadeusThe Color Purple

Oceanrsquos 11

Sense and Sensibility

Gus

Dave

Recommendations via Latent Factor Models (eg SVD++ by the [Bellkor Team])

1162019

Lethal Weapon

Dumb and Dumber

RECSYS 63

Dealing with Missing Entries Want to minimize SSE for unseen test data

Idea Minimize SSE on training data Want large f ( of factors) to capture all the signals But SSE on test data begins to rise for f gt 2

Regularization is needed Allow rich model where there are sufficient data Shrink aggressively where data are scarce

1162019

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

1 3 43 5 5

4 5 533

2

2 1 3

1

λhellip regularization parameterldquoerrorrdquo ldquolengthrdquo

RECSYS 64

Geared towards females

Geared towards males

serious

funny

The Effect of Regularization

1162019

The PrincessDiaries

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

minfactors ldquoerrorrdquo + λ ldquolengthrdquo

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

RECSYS 65

Geared towards females

Geared towards males

serious

funny

The Effect of Regularization

1162019

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

The PrincessDiaries

minfactors ldquoerrorrdquo + λ ldquolengthrdquo

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

RECSYS 66

Geared towards females

Geared towards males

serious

funny

The Effect of Regularization

1162019

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

minfactors ldquoerrorrdquo + λ ldquolengthrdquo

The PrincessDiaries

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

RECSYS 67

Geared towards females

Geared towards males

serious

funny

The Effect of Regularization

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

The PrincessDiaries

minfactors ldquoerrorrdquo + λ ldquolengthrdquo

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

RECSYS 68

Use Gradient Descent to search forthe optimal settings

Want to find matrices P and Q

Gradient descent Initialize P and Q (using SVD pretend missing ratings are

0) Do gradient descent

bull P larr P - η middotnablaPbull Q larr Q - η middotnablaQbull Where nablaQ is gradientderivative of matrix Q120571120571119876119876 = [120571120571119902119902119909119909119894119894] and 120571120571119902119902119909119909119894119894 = sum119909119909119909119909 minus2 119903119903119909119909119909119909 minus 119902119902119909119909119901119901119909119909119879119879 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894

bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q Observation Computing gradients is slow

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

How to compute gradient of a matrixCompute gradient of every element independently

RECSYS 69

Degression to the lecture notes ofRegression and Gradient Descent

by Andrew Ngrsquos Machine Learning course from Coursera

RECSYS 70

(Batch) Gradient Descent

Want to find matrices P and Q

Gradient descent Initialize P and Q (using SVD pretend missing ratings are

0) Do gradient descent

bull P larr P - η middotnablaPbull Q larr Q - η middotnablaQbull Where nablaQ is gradientderivative of matrix Q120571120571119876119876 = [120571120571119902119902119909119909119894119894] and 120571120571119902119902119909119909119894119894 = sum119909119909119909119909 minus2 119903119903119909119909119909119909 minus 119902119902119909119909119901119901119909119909119879119879 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894

bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q Observation Computing gradients is slow

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

How to compute gradient of a matrixCompute gradient of every element independently

RECSYS 72

Stochastic Gradient Descent

Gradient Descent (GD) vs Stochastic GD Observation 120571120571119876119876 = [120571120571119902119902119909119909119894119894] where

120571120571119902119902119909119909119894119894 = 119909119909119909119909

minus2 119903119903119909119909119909119909 minus 119902119902119909119909119894119894119901119901119909119909119894119894 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894 = 119961119961119946119946

nabla119928119928 119955119955119961119961119946119946

bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q 119928119928 = 119928119928 minus 119928119928 = 119928119928 minus η sum119961119961119946119946nabla119928119928 (119955119955119961119961119946119946) Idea Instead of evaluating gradient over all ratings evaluate it for

each individual rating and make a step

GD 119928119928larr119928119928 minus η sum119955119955119961119961119946119946 nabla119928119928(119955119955119961119961119946119946)

SGD 119928119928larr119928119928 minus 119928119928(119955119955119961119961119946119946) Faster convergence

bull Need more steps but each step is computed much faster

1162019

RECSYS 73

SGD vs GD Convergence of GD vs SGD

1162019

Iterationstep

Valu

e of

the

obje

ctiv

e fu

nctio

n

GD improves the value of the objective function at every step SGD improves the value but in a ldquonoisyrdquo wayGD takes fewer steps to converge but each steptakes much longer to compute In practice SGD is much faster

RECSYS 74

Stochastic Gradient Descent

Stochastic gradient decent Initialize P and Q (using SVD pretend missing ratings are

0) Then iterate over the ratings (multiple times if necessary) and update

factorsFor each rxi

bull 120576120576119909119909119909119909 = 119903119903119909119909119909119909 minus 119902119902119909119909 sdot 119901119901119909119909119879119879 (derivative of the ldquoerrorrdquo)bull 119902119902119909119909 larr 119902119902119909119909 + 120578120578 120576120576119909119909119909119909 119901119901119909119909 minus 120582120582 119902119902119909119909 (update equation)bull 119901119901119909119909 larr 119901119901119909119909 + 120578120578 120576120576119909119909119909119909 119902119902119909119909 minus 120582120582 119901119901119909119909 (update equation)

2 for loops For until convergence

bull For each rxibull Compute gradient do a ldquosteprdquo

1162019

η hellip learning rate

RECSYS 75Koren Bell Volinksy IEEE Computer 2009

1162019

RECSYS 76

Summary Recommendations via Optimization

Goal Make good recommendations Quantify goodness using SSE

So Lower SSE means better recommendations We want to make good recommendations on items

that some user has not yet seen Letrsquos set values for P and Q such that they work

well on known (user item) ratings And hope these values for P and Q will predict well

the unknown ratings

This is the a case where we apply Optimization methods

1162019

1 3 43 5 5

4 5 533

2

2 1 3

1

RECSYS 77

Backup Slides

RECSYS 78

The Netflix Challenge 2006-09

We Know What You OughtTo Be Watching This

Summer

Netflix Prizebull Training data

ndash 100 million ratingsndash 480000 usersndash 17770 moviesndash 6 years of data 2000-2005

bull Test datandash Last few ratings of each user (28 million)ndash Evaluation criterion root mean squared error (RMSE) ndash Netflix Cinematch RMSE 09514

bull Competitionndash 2700+ teamsndash $1 million grand prize for 10 improvement on Cinematch resultndash $50000 2007 progress prize for 843 improvement

scoremovieuser121152131434524123237682576344541568523425

22345

57664566

scoremovieuser6219617232473153414284935

745

696836

Training data Test data

Movie rating data

Overall rating distribution

bull Third of ratings are 4sbull Average rating is 368

From TimelyDevelopmentcom

ratings per movie

bull Avg ratingsmovie 5627

ratings per user

bull Avg ratingsuser 208

Average movie rating by movie count

bull More ratings to better movies

From TimelyDevelopmentcom

Most loved movies

CountAvg ratingTitle137812 4593 The Shawshank Redemption 133597 4545 Lord of the Rings The Return of the King1808834306 The Green Mile 150676 4460Lord of the Rings The Two Towers 139050 4415 Finding Nemo 117456 4504 Raiders of the Lost Ark 180736 4299 Forrest Gump 147932 4433 Lord of the Rings The Fellowship of the ring149199 4325 The Sixth Sense 144027 4333 Indiana Jones and the Last Crusade

Challenges

bull Size of datandash Scalabilityndash Keeping data in memory

bull Missing datandash 99 percent missingndash Very imbalanced

bull Avoiding overfittingbull Test and training data differ significantly movie 16322

bull Use an ensemble of complementing predictorsbull Two half tuned models worth more than a single fully

tuned model

The BellKor recommender system

RECSYS 101

Extending Latent Factor Model to Include Biases

1162019

RECSYS 102

Modeling Biases and Interactions

μ = overall mean rating bx = bias of user x bi = bias of movie i

user-movie interactionmovie biasuser bias

User-Movie interaction Characterizes the matching between

users and movies Attracts most research in the field Benefits from algorithmic and

mathematical innovations

Baseline predictor Separates users and movies Benefits from insights into userrsquos

behavior Among the main practical

contributions of the competition

RECSYS 103

Baseline Predictor We have expectations on the rating by

user x of movie i even without estimating xrsquos attitude towards movies like i

ndash Rating scale of user xndash Values of other ratings user

gave recently (day-specific mood anchoring multi-user accounts)

ndash (Recent) popularity of movie indash Selection bias related to

number of ratings user gave on the same day (ldquofrequencyrdquo)

1162019

RECSYS 104

Putting It All Together

Example Mean rating micro = 37 You are a critical reviewer your ratings are 1 star

lower than the mean bx = -1 Star Wars gets a mean rating of 05 higher than

average movie bi = + 05 Predicted rating for you on Star Wars

= 37 - 1 + 05 = 32

1162019

RECSYS 105

Fitting the New Model

Solve

Stochastic gradient decent to find parameters Note Both biases bu bi as well as interactions qi pu are treated

as parameters (we estimate them)

1162019

regularization

goodness of fit

λ is selected via grid-search on a validation set

( )

++++

+++minus

sumsumsumsum

sumisin

ii

xx

xx

ii

Rix

Txiixxi

PQ

bbpq

pqbbr

2222

2

)(

)(min

λ

micro

RECSYS 106

BellKor Recommender System The winner of the Netflix Challenge

Multi-scale modeling of the dataCombine top level ldquoregionalrdquomodeling of the data with a refined local view Global

bull Overall deviations of usersmovies Factorization

bull Addressing ldquoregionalrdquo effects Collaborative filtering

bull Extract local patterns

Global effects

Factorization

Collaborative filtering

RECSYS 107

Performance of Various Methods

0885

089

0895

09

0905

091

0915

092

1 10 100 1000

RMSE

Millions of parameters

CF (no time bias)

Basic Latent Factors

Latent Factors w Biases

1162019

RECSYS 108

Grand Prize 08563

Netflix 09514

Movie average 10533User average 10651

Global average 11296

Performance of Various Methods

Basic Collaborative filtering 094

Latent factors 090

Latent factors+Biases 089

Collaborative filtering++ 091

1162019

RECSYS 109

Temporal Biases Of Users

Sudden rise in the average movie rating(early 2004) Improvements in Netflix GUI improvements Meaning of rating changed

Movie age Users prefer new movies

without any reasons Older movies are just

inherently better than newer ones

1162019

Y Koren Collaborative filtering with temporal dynamics KDD rsquo09

RECSYS 110

Temporal Biases amp Factors

Original modelrxi = micro +bx + bi + qi middotpx

T

Add time dependence to biasesrxi = micro +bx(t)+ bi(t) +qi middot px

T

Make parameters bu and bi to depend on time (1) Parameterize time-dependence by linear trends

(2) Each bin corresponds to 10 consecutive weeks

Add temporal dependence to factors px(t)hellip user preference vector on day t

Y Koren Collaborative filtering with temporal dynamics KDD rsquo09

RECSYS 111

Adding Temporal Effects

1162019

0875

088

0885

089

0895

09

0905

091

0915

092

1 10 100 1000 10000

RMSE

Millions of parameters

CF (no time bias)Basic Latent FactorsCF (time bias)Latent Factors w Biases+ Linear time factors+ Per-day user biases+ CF

RECSYS 112

Grand Prize 08563

Netflix 09514

Movie average 10533User average 10651

Global average 11296

Performance of Various Methods

Basic Collaborative filtering 094

Latent factors 090

Latent factors+Biases 089

Collaborative filtering++ 091

1162019 112

Latent factors+Biases+Time 0876

Still no prize Getting desperateTry a ldquokitchen sinkrdquo approach

RECSYS 113

Effect of ensemble size

087

08725

0875

08775

088

08825

0885

08875

089

08925

1 8 15 22 29 36 43 50

Predictors

Erro

r - R

MSE

Chart1

RMSE
Predictors
Error - RMSE
Effect of ensemble size

Sheet1

Sheet1

RMSE
Predictors
Error - RMSE
Effect of ensemble size

Sheet2

Sheet3

RECSYS 115

Standing on June 26th 2009

1162019June 26th submission triggers 30-day ldquolast callrdquo

RECSYS 116

The Last 30 Days

Ensemble team formed Group of other teams on leaderboard forms a new team Relies on combining their models Quickly also get a qualifying score over 10

BellKor Continue to get small improvements in their scores Realize that they are in direct competition with Ensemble

Strategy Both teams carefully monitoring the leaderboard Only sure way to check for improvement is to submit a set of

predictionsbull This alerts the other team of your latest score

1162019

RECSYS 117

24 Hours from the Deadline

Submissions limited to 1 a day Only 1 final submission could be made in the last 24h

24 hours before deadlinehellip BellKor team member in Austria notices (by chance) that

Ensemble posts a score that is slightly better than BellKorrsquos

Frantic last 24 hours for both teams Much computer time on final optimization Carefully calibrated to end about an hour before deadline

Final submissions BellKor submits a little early (on purpose) 40 mins before deadline Ensemble submits their final entry 20 mins later hellipand everyone waitshellip

1162019

RECSYS 118118

RECSYS 119

Million $ Awarded Sept 21st 2009

RECSYS 120

Further reading Y Koren Collaborative filtering with temporal

dynamics KDD rsquo09 httpwww2researchattcom~volinskynetflixbpchtml httpwwwthe-ensemblecom

1162019

  • Slide Number 1
  • Acknowledgements
  • Content-based Systems amp Collaborative Filtering
  • High Dimensional Data
  • Example Recommender Systems
  • Recommendations
  • From Scarcity to Abundance
  • The Long Tail
  • Physical vs Online
  • Types of Recommendations
  • Formal Model
  • Utility Matrix
  • Key Problems
  • (1) Gathering Ratings
  • (2) Extrapolating Utilities
  • Content-based Recommender Systems
  • Content-based Recommendations
  • Plan of Action
  • Item Profiles
  • Sidenote TF-IDF
  • User Profiles and Prediction
  • Pros Content-based Approach
  • Cons Content-based Approach
  • Collaborative Filtering
  • Collaborative filtering
  • Collaborative Filtering (CF)
  • Example of Memory-based Collaborative Filtering User-User Collaborative Filtering
  • Similar Users
  • Similarity Metric
  • Rating Predictions
  • Another type of Memory-based Collaborative Filtering Item-Item based Collaborative Filtering
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Common Practice for Item-Item Collaborative Filtering
  • Item-Item vs User-User
  • ProsCons of Collaborative Filtering
  • Hybrid Methods
  • Remarks amp Practical Tips
  • Evaluation
  • Evaluation
  • Evaluating Predictions
  • Problems with Error Measures
  • Collaborative Filtering Complexity
  • Tip Add Data
  • Recommender SystemsLatent Factor Models
  • Collaborative Filtering viaLatent Factor Models (eg SVD)
  • Slide Number 50
  • The Netflix Utility Matrix R
  • Utility Matrix R Evaluation
  • Latent Factor Models
  • Ratings as Products of Factors
  • Ratings as Products of Factors
  • Ratings as Products of Factors
  • Latent Factor Models
  • Latent Factor Models
  • Recap SVD
  • Latent Factor Models
  • Dealing with Missing Entries
  • Recommendations via Latent Factor Models (eg SVD++ by the [Bellkor Team])
  • Dealing with Missing Entries
  • The Effect of Regularization
  • The Effect of Regularization
  • The Effect of Regularization
  • The Effect of Regularization
  • Use Gradient Descent to search forthe optimal settings
  • Degression to the lecture notes ofRegression and Gradient Descentby Andrew Ngrsquos Machine Learning course from Coursera
  • (Batch) Gradient Descent
  • Stochastic Gradient Descent
  • SGD vs GD
  • Stochastic Gradient Descent
  • Slide Number 75
  • Summary Recommendations via Optimization
  • Backup Slides
  • The Netflix Challenge 2006-09
  • Slide Number 79
  • Slide Number 80
  • Netflix Prize
  • Movie rating data
  • Overall rating distribution
  • ratings per movie
  • ratings per user
  • Average movie rating by movie count
  • Most loved movies
  • Challenges
  • The BellKor recommender system
  • Extending Latent Factor Model to Include Biases
  • Modeling Biases and Interactions
  • Baseline Predictor
  • Putting It All Together
  • Fitting the New Model
  • BellKor Recommender System
  • Performance of Various Methods
  • Performance of Various Methods
  • Temporal Biases Of Users
  • Temporal Biases amp Factors
  • Adding Temporal Effects
  • Performance of Various Methods
  • Slide Number 113
  • Slide Number 114
  • Standing on June 26th 2009
  • The Last 30 Days
  • 24 Hours from the Deadline
  • Slide Number 118
  • Million $ Awarded Sept 21st 2009
  • Further reading
predictors RMSE
1 0889067
2 0880586
3 0878345
4 0877715
5 0876668
6 0876179
7 0875557
8 0875013
9 0874477
10 0874209
11 0873997
12 0873691
13 0873569
14 0873352
15 0873212
16 0873035
17 0872942
18 087283
19 0872748
20 0872654
21 0872563
22 0872468
23 0872382
24 0872275
25 087222
26 0872201
27 0872135
28 0872086
29 0872021
30 0871945
31 0871898
32 087183
33 0871756
34 0871722
35 0871665
36 0871632
37 0871585
38 0871558
39 0871524
40 0871482
41 0871452
42 0871428
43 0871398
44 0871346
45 0871312
46 0871287
47 0871262
48 0871242
49 0871236
50 0871197
0889067
0880586
0878345
0877715
0876668
0876179
0875557
0875013
0874477
0874209
0873997
0873691
0873569
0873352
0873212
0873035
0872942
087283
0872748
0872654
0872563
0872468
0872382
0872275
087222
0872201
0872135
0872086
0872021
0871945
0871898
087183
0871756
0871722
0871665
0871632
0871585
0871558
0871524
0871482
0871452
0871428
0871398
0871346
0871312
0871287
0871262
0871242
0871236
0871197
Page 26: xuhappy.github.io · 2 days ago · RECSYS 2 Acknowledgements The slides used in this chapter are adapted from: CS246 Mining Massive Datasets, - by Jure Leskovec, Stanford University

RECSYS 28

Similar Users Let rx be the vector of user xrsquos ratings

Jaccard similarity measure Problem Ignores the value of the rating

Cosine similarity measure

sim(x y) = cos(rx ry) = 119903119903119909119909sdot119903119903119910119910||119903119903119909119909||sdot||119903119903119910119910||

Problem Treats missing ratings as ldquonegativerdquo Pearson correlation coefficient

Sxy = items rated by both users x and y

1162019

rx = [ _ _ ]ry = [ _ _]

rx ry as setsrx = 1 4 5ry = 1 3 4

rx ry as pointsrx = 1 0 0 1 3ry = 1 0 2 2 0

rx ry hellip avgrating of x y

RECSYS 29

Similarity Metric

Intuitively we want sim(A B) gt sim(A C)

Jaccard similarity 15 lt 24

Cosine similarity 0386 gt 0322 Considers missing ratings as ldquonegativerdquo Solution subtract the (row) mean

1162019

sim AB vs AC0092 gt -0559Notice cosine sim is correlation when data is centered at 0

Cosine sim

RECSYS 30

Rating Predictions

Let rx be the vector of user xrsquos ratings

Let N be the set of k users most similar to xwho have rated item i

Prediction for item i of user x 119903119903119909119909119909119909 = 1

119896119896sum119910119910isin119873119873 119903119903119910119910119909119909

119903119903119909119909119909119909 =sum119910119910isin119873119873 119904119904119909119909119910119910sdot119903119903119910119910119910119910sum119910119910isin119873119873 119904119904119909119909119910119910

Other options

Many other tricks possiblehellip1162019

Shorthand119956119956119961119961119961119961 = 119956119956119946119946119956119956 119961119961119961119961

RECSYS 31

Another type of Memory-based Collaborative Filtering Item-Item based Collaborative Filtering

So far User-user collaborative filtering

Another view Item-item For item i find other similar items Estimate rating for item i based

on the target userrsquos ratings on items similar to item i Can use same similarity metrics and

prediction functions as in user-user model

1162019

sumsum

isin

isinsdot

=)(

)(

xiNj ij

xiNj xjijxi s

rsr

sijhellip similarity of items i and jrxjhelliprating of user x on item jN(ix)hellip set items rated by x similar to i

RECSYS 32

Item-Item CF (|N|=2)

121110987654321

455311

3124452

534321423

245424

5224345

423316

users

mov

ies

- unknown rating - rating between 1 to 5

RECSYS 33

Item-Item CF (|N|=2)

121110987654321

455 311

3124452

534321423

245424

5224345

423316

users

- estimate rating of movie 1 by user 5

1162019

mov

ies

RECSYS 34

Item-Item CF (|N|=2)

121110987654321

455 311

3124452

534321423

245424

5224345

423316

users

Neighbor selectionIdentify movies similar to movie 1 rated by user 5

1162019

mov

ies

100

-018

041

-010

-031

059

sim(1m)

Here we use Pearson correlation as similarity1) Subtract mean rating mi from each movie i

m1 = (1+3+5+5+4)5 = 36row 1 [-26 0 -06 0 0 14 0 0 14 0 04 0]

2) Compute cosine similarities between rows

RECSYS 35

Item-Item CF (|N|=2)

121110987654321

455 311

3124452

534321423

245424

5224345

423316

users

Compute similarity weightss13=041 s16=059

1162019

mov

ies

100

-018

041

-010

-031

059

sim(1m)

RECSYS 36

Item-Item CF (|N|=2)

121110987654321

45526311

3124452

534321423

245424

5224345

423316

users

Predict by taking weighted average

r15 = (0412 + 0593) (041+059) = 26

mov

ies

RECSYS 37

Common Practice for Item-Item Collaborative Filtering

Define similarity sij of items i and j

Select K nearest neighbors (KNN) N(i x) Set of Items most similar to item i that were rated by x

Estimate rating rxi as the weighted average

1162019

sumsum

isin

isinsdot

=)(

)(ˆxiNj ij

xiNj xjijxi s

rsr

N(ix) = set of items similar to item i that were rated by xsij = similarity of items i and jrxj= rating of user x on item j

RECSYS 38

Item-Item vs User-User

04180109

0305081

Avatar LOTR Matrix Pirates

Alice

Bob

Carol

David

1162019

In practice it has been observed that item-itemoften works better than user-user

Why Items are simpler users have multiple tastes

RECSYS 39

ProsCons of Collaborative Filtering

+ Works for any kind of item No feature selection needed

- Cold Start Need enough users in the system to find a match

- Sparsity The userratings matrix is sparse Hard to find users that have rated the same items

- First rater Cannot recommend an item that has not been

previously rated New items Esoteric items

- Popularity bias Cannot recommend items to someone with

unique taste Tends to recommend popular items

RECSYS 40

Hybrid Methods Implement two or more different recommenders and

combine predictions Perhaps using a linear model

Add content-based methods to collaborative filtering Item profiles for new item problem Demographics to deal with new user problem

1162019

RECSYS 41

- Evaluation

- Error metrics

- Complexity Speed

RECSYS 42

Evaluation

1 3 4

3 5 5

4 5 5

3

3

2 2 2

5

2 1 1

3 3

1

movies

users

1162019

RECSYS 43

Evaluation

1 3 4

3 5 5

4 5 5

3

3

2

2 1

3

1

Test Data Set

users

movies

1162019

RECSYS 44

Evaluating Predictions Compare predictions with known ratings

Root-mean-square error (RMSE)bull where is predicted is the true rating of x on i

Precision at top 10 bull of those in top 10

Rank Correlation bull Spearmanrsquos correlation between systemrsquos and userrsquos complete

rankings Another approach 01 model

Coveragebull Number of itemsusers for which system can make predictions

Precision = TP (TP + FP) Accuracy = (TP+TN) (TP + FP + TN + FN) Receiver Operating characteristic (ROC) Curve

Y-axis True Positive Rates (TPR) X-axis False Positive Rates (FPR)bull TPR (aka Recall) = TP P = TP (TP+FN) bull FPR = FP N = FP (FP + TN)bull See httpsenwikipediaorgwikiPrecision and recall

RECSYS 45

Problems with Error Measures Narrow focus on accuracy sometimes

misses the point Prediction Diversity Prediction Context Order of predictions

In practice we care only to predict high ratings RMSE might penalize a method that does well

for high ratings and badly for others

1162019

RECSYS 46

Collaborative Filtering Complexity

Expensive step is finding k most similar customers O(|X|) Recall that X = set of customers in the system

Too expensive to do at runtime Could pre-compute using clustering as approx

Naiumlve pre-computation takes time O(N middot|C|) |C| = of clusters = k in the k-means N = of data points

We already know how to do this Near-neighbor search in high dimensions (LSH) Clustering Dimensionality reduction

1162019

RECSYS 47

Tip Add Data Leverage all the data

Donrsquot try to reduce data size in an effort to make fancy algorithms work

Simple methods on large data do best

Add more data eg add IMDB data on genres

More data beats better algorithmshttpanandtypepadcomdatawocky200803more-data-usualhtml

RECSYS 48

Recommender SystemsLatent Factor Models

CS246 Mining Massive DatasetsJure Leskovec Stanford University

httpcs246stanfordedu

RECSYS 49

Geared towards females

Geared towards males

Serious

Funny

Collaborative Filtering viaLatent Factor Models (eg SVD)

The PrincessDiaries

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

RECSYS 50Koren Bell Volinksy IEEE Computer 2009

RECSYS 51

The Netflix Utility Matrix R

1 3 4

3 5 5

4 5 5

3

3

2 2 2

5

2 1 1

3 3

1

480000 users

17700 movies

1162019

Matrix R

RECSYS 52

Utility Matrix R Evaluation

1 3 4

3 5 5

4 5 5

3

3

2

2 1

3

1

Test Data Set

SSE = sum(119909119909119909119909)isin119877119877 119903119909119909119909119909 minus 119903119903119909119909119909119909 2

1162019

480000 users

17700 movies

Predicted rating

True rating of user x on item i

119955119955120785120785120788120788

Matrix R

Training Data Set

RECSYS 53

Latent Factor Models

ldquoSVDrdquo on Netflix data R asymp Q middot PT

For now letrsquos assume we can approximate the rating matrix R as a product of ldquothinrdquo Q middot PT

R has missing entries but letrsquos ignore that for nowbull Basically we will want the reconstruction error to be small on

known ratings and we donrsquot care about the values on the missing ones

45531

312445

53432142

24542

522434

42331

2-41

56-5

53-2

32111

-221-7

37-1

-924143-48-5-253-211

13-112-72914-131457-8

1-6784-3924176-421asymp

users

item

s

PT

Q

item

s

users

R

SVD A = U Σ VT

f factors

ffactors

RECSYS 54

Ratings as Products of Factors How to estimate the missing rating of

user x for item i

1162019

45531

312445

53432142

24542

522434

42331

item

s

2-41

56-5

53-2

32111

-221-7

37-1

-924143-48-5-253-211

13-112-72914-131457-8

1-6784-3924176-421

asymp

item

s

users

users

PT

119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931

= 119943119943

119954119954119946119946119943119943 sdot 119953119953119961119961119943119943

qi = row i of Qpx = column x of PT

ffac

tors

Qf factors

RECSYS 55

Ratings as Products of Factors How to estimate the missing rating of

user x for item i

1162019

45531

312445

53432142

24542

522434

42331

item

s

2-41

56-5

53-2

32111

-221-7

37-1

-924143-48-5-253-211

13-112-72914-131457-8

1-6784-3924176-421

asymp

item

s

users

users

PT

ffac

tors

Qf factors

119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931

= 119943119943

119954119954119946119946119943119943 sdot 119953119953119961119961119943119943

qi = row i of Qpx = column x of PT

RECSYS 56

Ratings as Products of Factors How to estimate the missing rating of

user x for item i

1162019

45531

312445

53432142

24542

522434

42331

item

s

2-41

56-5

53-2

32111

-221-7

37-1

-924143-48-5-253-211

13-112-72914-131457-8

1-6784-3924176-421

asymp

item

s

users

users

QPT

24

ffac

tors

f factors

119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931

= 119943119943

119954119954119946119946119943119943 sdot 119953119953119961119961119943119943

qi = row i of Qpx = column x of PT

RECSYS 57

Geared towards females

Geared towards males

Serious

Funny

Latent Factor Models

1162019

The PrincessDiaries

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

RECSYS 58

Geared towards females

Geared towards males

Serious

Funny

Latent Factor Models

1162019

The PrincessDiaries

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

RECSYS 59

Recap SVD

Remember SVD A Input data matrix U Left singular vecs V Right singular vecs Σ Singular values

SVD gives minimum reconstruction error (SSE) min119880119880VΣ

sum119909119909119894119894 119860119860119909119909119894119894 minus 119880119880Σ119881119881T 1199091199091198941198942

So in our case ldquoSVDrdquo on Netflix data R asymp Q middot PT

A = R Q = U PT = Σ VT

But we are not done yet R has missing entries1162019

U

The sum goes over all entriesBut our R has missing entries

119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931

Am

n

Σm

n

VT

asymp

U

RECSYS 60

Latent Factor Models

SVD isnrsquot defined when entries are missing

Use specialized methods to find P Q min

119875119875119876119876sum 119909119909119909119909 isinR 119903119903119909119909119909119909 minus 119902119902119909119909 sdot 119901119901119909119909119879119879 2

Notebull We donrsquot require cols of P Q to be orthogonalunit lengthbull P Q map usersmovies to a latent spacebull The most popular model among Netflix contestants

1162019

45531

312445

53432142

24542

522434

42331

2-41

56-5

53-2

32111

-221-7

37-1

-924143-48-5-253-211

13-112-72914-131457-8

1-6784-3924176-421asymp

PTQ

users

item

s

119903119909119909119909119909 = 119902119902119909119909 sdot 119901119901119909119909119879119879

f factors

ffactorsitem

s

users

RECSYS 61

Dealing with Missing Entries Want to minimize SSE for unseen test data

Idea Minimize SSE on training data Want large f ( of factors) to capture all the signals But SSE on test data begins to rise for f gt 2

Regularization is needed to avoid Overfitting Allow rich model where there are sufficient data Shrink aggressively where data are scarce

1162019

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

1 3 43 5 5

4 5 533

2

2 1 3

1

λhellip regularization parameterldquoerrorrdquo ldquolengthrdquo

RECSYS 62

Geared towards females

Geared towards males

serious

escapist

The PrincessDiaries

The Lion King

Braveheart

Independence Day

AmadeusThe Color Purple

Oceanrsquos 11

Sense and Sensibility

Gus

Dave

Recommendations via Latent Factor Models (eg SVD++ by the [Bellkor Team])

1162019

Lethal Weapon

Dumb and Dumber

RECSYS 63

Dealing with Missing Entries Want to minimize SSE for unseen test data

Idea Minimize SSE on training data Want large f ( of factors) to capture all the signals But SSE on test data begins to rise for f gt 2

Regularization is needed Allow rich model where there are sufficient data Shrink aggressively where data are scarce

1162019

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

1 3 43 5 5

4 5 533

2

2 1 3

1

λhellip regularization parameterldquoerrorrdquo ldquolengthrdquo

RECSYS 64

Geared towards females

Geared towards males

serious

funny

The Effect of Regularization

1162019

The PrincessDiaries

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

minfactors ldquoerrorrdquo + λ ldquolengthrdquo

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

RECSYS 65

Geared towards females

Geared towards males

serious

funny

The Effect of Regularization

1162019

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

The PrincessDiaries

minfactors ldquoerrorrdquo + λ ldquolengthrdquo

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

RECSYS 66

Geared towards females

Geared towards males

serious

funny

The Effect of Regularization

1162019

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

minfactors ldquoerrorrdquo + λ ldquolengthrdquo

The PrincessDiaries

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

RECSYS 67

Geared towards females

Geared towards males

serious

funny

The Effect of Regularization

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

The PrincessDiaries

minfactors ldquoerrorrdquo + λ ldquolengthrdquo

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

RECSYS 68

Use Gradient Descent to search forthe optimal settings

Want to find matrices P and Q

Gradient descent Initialize P and Q (using SVD pretend missing ratings are

0) Do gradient descent

bull P larr P - η middotnablaPbull Q larr Q - η middotnablaQbull Where nablaQ is gradientderivative of matrix Q120571120571119876119876 = [120571120571119902119902119909119909119894119894] and 120571120571119902119902119909119909119894119894 = sum119909119909119909119909 minus2 119903119903119909119909119909119909 minus 119902119902119909119909119901119901119909119909119879119879 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894

bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q Observation Computing gradients is slow

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

How to compute gradient of a matrixCompute gradient of every element independently

RECSYS 69

Degression to the lecture notes ofRegression and Gradient Descent

by Andrew Ngrsquos Machine Learning course from Coursera

RECSYS 70

(Batch) Gradient Descent

Want to find matrices P and Q

Gradient descent Initialize P and Q (using SVD pretend missing ratings are

0) Do gradient descent

bull P larr P - η middotnablaPbull Q larr Q - η middotnablaQbull Where nablaQ is gradientderivative of matrix Q120571120571119876119876 = [120571120571119902119902119909119909119894119894] and 120571120571119902119902119909119909119894119894 = sum119909119909119909119909 minus2 119903119903119909119909119909119909 minus 119902119902119909119909119901119901119909119909119879119879 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894

bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q Observation Computing gradients is slow

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

How to compute gradient of a matrixCompute gradient of every element independently

RECSYS 72

Stochastic Gradient Descent

Gradient Descent (GD) vs Stochastic GD Observation 120571120571119876119876 = [120571120571119902119902119909119909119894119894] where

120571120571119902119902119909119909119894119894 = 119909119909119909119909

minus2 119903119903119909119909119909119909 minus 119902119902119909119909119894119894119901119901119909119909119894119894 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894 = 119961119961119946119946

nabla119928119928 119955119955119961119961119946119946

bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q 119928119928 = 119928119928 minus 119928119928 = 119928119928 minus η sum119961119961119946119946nabla119928119928 (119955119955119961119961119946119946) Idea Instead of evaluating gradient over all ratings evaluate it for

each individual rating and make a step

GD 119928119928larr119928119928 minus η sum119955119955119961119961119946119946 nabla119928119928(119955119955119961119961119946119946)

SGD 119928119928larr119928119928 minus 119928119928(119955119955119961119961119946119946) Faster convergence

bull Need more steps but each step is computed much faster

1162019

RECSYS 73

SGD vs GD Convergence of GD vs SGD

1162019

Iterationstep

Valu

e of

the

obje

ctiv

e fu

nctio

n

GD improves the value of the objective function at every step SGD improves the value but in a ldquonoisyrdquo wayGD takes fewer steps to converge but each steptakes much longer to compute In practice SGD is much faster

RECSYS 74

Stochastic Gradient Descent

Stochastic gradient decent Initialize P and Q (using SVD pretend missing ratings are

0) Then iterate over the ratings (multiple times if necessary) and update

factorsFor each rxi

bull 120576120576119909119909119909119909 = 119903119903119909119909119909119909 minus 119902119902119909119909 sdot 119901119901119909119909119879119879 (derivative of the ldquoerrorrdquo)bull 119902119902119909119909 larr 119902119902119909119909 + 120578120578 120576120576119909119909119909119909 119901119901119909119909 minus 120582120582 119902119902119909119909 (update equation)bull 119901119901119909119909 larr 119901119901119909119909 + 120578120578 120576120576119909119909119909119909 119902119902119909119909 minus 120582120582 119901119901119909119909 (update equation)

2 for loops For until convergence

bull For each rxibull Compute gradient do a ldquosteprdquo

1162019

η hellip learning rate

RECSYS 75Koren Bell Volinksy IEEE Computer 2009

1162019

RECSYS 76

Summary Recommendations via Optimization

Goal Make good recommendations Quantify goodness using SSE

So Lower SSE means better recommendations We want to make good recommendations on items

that some user has not yet seen Letrsquos set values for P and Q such that they work

well on known (user item) ratings And hope these values for P and Q will predict well

the unknown ratings

This is the a case where we apply Optimization methods

1162019

1 3 43 5 5

4 5 533

2

2 1 3

1

RECSYS 77

Backup Slides

RECSYS 78

The Netflix Challenge 2006-09

We Know What You OughtTo Be Watching This

Summer

Netflix Prizebull Training data

ndash 100 million ratingsndash 480000 usersndash 17770 moviesndash 6 years of data 2000-2005

bull Test datandash Last few ratings of each user (28 million)ndash Evaluation criterion root mean squared error (RMSE) ndash Netflix Cinematch RMSE 09514

bull Competitionndash 2700+ teamsndash $1 million grand prize for 10 improvement on Cinematch resultndash $50000 2007 progress prize for 843 improvement

scoremovieuser121152131434524123237682576344541568523425

22345

57664566

scoremovieuser6219617232473153414284935

745

696836

Training data Test data

Movie rating data

Overall rating distribution

bull Third of ratings are 4sbull Average rating is 368

From TimelyDevelopmentcom

ratings per movie

bull Avg ratingsmovie 5627

ratings per user

bull Avg ratingsuser 208

Average movie rating by movie count

bull More ratings to better movies

From TimelyDevelopmentcom

Most loved movies

CountAvg ratingTitle137812 4593 The Shawshank Redemption 133597 4545 Lord of the Rings The Return of the King1808834306 The Green Mile 150676 4460Lord of the Rings The Two Towers 139050 4415 Finding Nemo 117456 4504 Raiders of the Lost Ark 180736 4299 Forrest Gump 147932 4433 Lord of the Rings The Fellowship of the ring149199 4325 The Sixth Sense 144027 4333 Indiana Jones and the Last Crusade

Challenges

bull Size of datandash Scalabilityndash Keeping data in memory

bull Missing datandash 99 percent missingndash Very imbalanced

bull Avoiding overfittingbull Test and training data differ significantly movie 16322

bull Use an ensemble of complementing predictorsbull Two half tuned models worth more than a single fully

tuned model

The BellKor recommender system

RECSYS 101

Extending Latent Factor Model to Include Biases

1162019

RECSYS 102

Modeling Biases and Interactions

μ = overall mean rating bx = bias of user x bi = bias of movie i

user-movie interactionmovie biasuser bias

User-Movie interaction Characterizes the matching between

users and movies Attracts most research in the field Benefits from algorithmic and

mathematical innovations

Baseline predictor Separates users and movies Benefits from insights into userrsquos

behavior Among the main practical

contributions of the competition

RECSYS 103

Baseline Predictor We have expectations on the rating by

user x of movie i even without estimating xrsquos attitude towards movies like i

ndash Rating scale of user xndash Values of other ratings user

gave recently (day-specific mood anchoring multi-user accounts)

ndash (Recent) popularity of movie indash Selection bias related to

number of ratings user gave on the same day (ldquofrequencyrdquo)

1162019

RECSYS 104

Putting It All Together

Example Mean rating micro = 37 You are a critical reviewer your ratings are 1 star

lower than the mean bx = -1 Star Wars gets a mean rating of 05 higher than

average movie bi = + 05 Predicted rating for you on Star Wars

= 37 - 1 + 05 = 32

1162019

RECSYS 105

Fitting the New Model

Solve

Stochastic gradient decent to find parameters Note Both biases bu bi as well as interactions qi pu are treated

as parameters (we estimate them)

1162019

regularization

goodness of fit

λ is selected via grid-search on a validation set

( )

++++

+++minus

sumsumsumsum

sumisin

ii

xx

xx

ii

Rix

Txiixxi

PQ

bbpq

pqbbr

2222

2

)(

)(min

λ

micro

RECSYS 106

BellKor Recommender System The winner of the Netflix Challenge

Multi-scale modeling of the dataCombine top level ldquoregionalrdquomodeling of the data with a refined local view Global

bull Overall deviations of usersmovies Factorization

bull Addressing ldquoregionalrdquo effects Collaborative filtering

bull Extract local patterns

Global effects

Factorization

Collaborative filtering

RECSYS 107

Performance of Various Methods

0885

089

0895

09

0905

091

0915

092

1 10 100 1000

RMSE

Millions of parameters

CF (no time bias)

Basic Latent Factors

Latent Factors w Biases

1162019

RECSYS 108

Grand Prize 08563

Netflix 09514

Movie average 10533User average 10651

Global average 11296

Performance of Various Methods

Basic Collaborative filtering 094

Latent factors 090

Latent factors+Biases 089

Collaborative filtering++ 091

1162019

RECSYS 109

Temporal Biases Of Users

Sudden rise in the average movie rating(early 2004) Improvements in Netflix GUI improvements Meaning of rating changed

Movie age Users prefer new movies

without any reasons Older movies are just

inherently better than newer ones

1162019

Y Koren Collaborative filtering with temporal dynamics KDD rsquo09

RECSYS 110

Temporal Biases amp Factors

Original modelrxi = micro +bx + bi + qi middotpx

T

Add time dependence to biasesrxi = micro +bx(t)+ bi(t) +qi middot px

T

Make parameters bu and bi to depend on time (1) Parameterize time-dependence by linear trends

(2) Each bin corresponds to 10 consecutive weeks

Add temporal dependence to factors px(t)hellip user preference vector on day t

Y Koren Collaborative filtering with temporal dynamics KDD rsquo09

RECSYS 111

Adding Temporal Effects

1162019

0875

088

0885

089

0895

09

0905

091

0915

092

1 10 100 1000 10000

RMSE

Millions of parameters

CF (no time bias)Basic Latent FactorsCF (time bias)Latent Factors w Biases+ Linear time factors+ Per-day user biases+ CF

RECSYS 112

Grand Prize 08563

Netflix 09514

Movie average 10533User average 10651

Global average 11296

Performance of Various Methods

Basic Collaborative filtering 094

Latent factors 090

Latent factors+Biases 089

Collaborative filtering++ 091

1162019 112

Latent factors+Biases+Time 0876

Still no prize Getting desperateTry a ldquokitchen sinkrdquo approach

RECSYS 113

Effect of ensemble size

087

08725

0875

08775

088

08825

0885

08875

089

08925

1 8 15 22 29 36 43 50

Predictors

Erro

r - R

MSE

Chart1

RMSE
Predictors
Error - RMSE
Effect of ensemble size

Sheet1

Sheet1

RMSE
Predictors
Error - RMSE
Effect of ensemble size

Sheet2

Sheet3

RECSYS 115

Standing on June 26th 2009

1162019June 26th submission triggers 30-day ldquolast callrdquo

RECSYS 116

The Last 30 Days

Ensemble team formed Group of other teams on leaderboard forms a new team Relies on combining their models Quickly also get a qualifying score over 10

BellKor Continue to get small improvements in their scores Realize that they are in direct competition with Ensemble

Strategy Both teams carefully monitoring the leaderboard Only sure way to check for improvement is to submit a set of

predictionsbull This alerts the other team of your latest score

1162019

RECSYS 117

24 Hours from the Deadline

Submissions limited to 1 a day Only 1 final submission could be made in the last 24h

24 hours before deadlinehellip BellKor team member in Austria notices (by chance) that

Ensemble posts a score that is slightly better than BellKorrsquos

Frantic last 24 hours for both teams Much computer time on final optimization Carefully calibrated to end about an hour before deadline

Final submissions BellKor submits a little early (on purpose) 40 mins before deadline Ensemble submits their final entry 20 mins later hellipand everyone waitshellip

1162019

RECSYS 118118

RECSYS 119

Million $ Awarded Sept 21st 2009

RECSYS 120

Further reading Y Koren Collaborative filtering with temporal

dynamics KDD rsquo09 httpwww2researchattcom~volinskynetflixbpchtml httpwwwthe-ensemblecom

1162019

  • Slide Number 1
  • Acknowledgements
  • Content-based Systems amp Collaborative Filtering
  • High Dimensional Data
  • Example Recommender Systems
  • Recommendations
  • From Scarcity to Abundance
  • The Long Tail
  • Physical vs Online
  • Types of Recommendations
  • Formal Model
  • Utility Matrix
  • Key Problems
  • (1) Gathering Ratings
  • (2) Extrapolating Utilities
  • Content-based Recommender Systems
  • Content-based Recommendations
  • Plan of Action
  • Item Profiles
  • Sidenote TF-IDF
  • User Profiles and Prediction
  • Pros Content-based Approach
  • Cons Content-based Approach
  • Collaborative Filtering
  • Collaborative filtering
  • Collaborative Filtering (CF)
  • Example of Memory-based Collaborative Filtering User-User Collaborative Filtering
  • Similar Users
  • Similarity Metric
  • Rating Predictions
  • Another type of Memory-based Collaborative Filtering Item-Item based Collaborative Filtering
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Common Practice for Item-Item Collaborative Filtering
  • Item-Item vs User-User
  • ProsCons of Collaborative Filtering
  • Hybrid Methods
  • Remarks amp Practical Tips
  • Evaluation
  • Evaluation
  • Evaluating Predictions
  • Problems with Error Measures
  • Collaborative Filtering Complexity
  • Tip Add Data
  • Recommender SystemsLatent Factor Models
  • Collaborative Filtering viaLatent Factor Models (eg SVD)
  • Slide Number 50
  • The Netflix Utility Matrix R
  • Utility Matrix R Evaluation
  • Latent Factor Models
  • Ratings as Products of Factors
  • Ratings as Products of Factors
  • Ratings as Products of Factors
  • Latent Factor Models
  • Latent Factor Models
  • Recap SVD
  • Latent Factor Models
  • Dealing with Missing Entries
  • Recommendations via Latent Factor Models (eg SVD++ by the [Bellkor Team])
  • Dealing with Missing Entries
  • The Effect of Regularization
  • The Effect of Regularization
  • The Effect of Regularization
  • The Effect of Regularization
  • Use Gradient Descent to search forthe optimal settings
  • Degression to the lecture notes ofRegression and Gradient Descentby Andrew Ngrsquos Machine Learning course from Coursera
  • (Batch) Gradient Descent
  • Stochastic Gradient Descent
  • SGD vs GD
  • Stochastic Gradient Descent
  • Slide Number 75
  • Summary Recommendations via Optimization
  • Backup Slides
  • The Netflix Challenge 2006-09
  • Slide Number 79
  • Slide Number 80
  • Netflix Prize
  • Movie rating data
  • Overall rating distribution
  • ratings per movie
  • ratings per user
  • Average movie rating by movie count
  • Most loved movies
  • Challenges
  • The BellKor recommender system
  • Extending Latent Factor Model to Include Biases
  • Modeling Biases and Interactions
  • Baseline Predictor
  • Putting It All Together
  • Fitting the New Model
  • BellKor Recommender System
  • Performance of Various Methods
  • Performance of Various Methods
  • Temporal Biases Of Users
  • Temporal Biases amp Factors
  • Adding Temporal Effects
  • Performance of Various Methods
  • Slide Number 113
  • Slide Number 114
  • Standing on June 26th 2009
  • The Last 30 Days
  • 24 Hours from the Deadline
  • Slide Number 118
  • Million $ Awarded Sept 21st 2009
  • Further reading
predictors RMSE
1 0889067
2 0880586
3 0878345
4 0877715
5 0876668
6 0876179
7 0875557
8 0875013
9 0874477
10 0874209
11 0873997
12 0873691
13 0873569
14 0873352
15 0873212
16 0873035
17 0872942
18 087283
19 0872748
20 0872654
21 0872563
22 0872468
23 0872382
24 0872275
25 087222
26 0872201
27 0872135
28 0872086
29 0872021
30 0871945
31 0871898
32 087183
33 0871756
34 0871722
35 0871665
36 0871632
37 0871585
38 0871558
39 0871524
40 0871482
41 0871452
42 0871428
43 0871398
44 0871346
45 0871312
46 0871287
47 0871262
48 0871242
49 0871236
50 0871197
0889067
0880586
0878345
0877715
0876668
0876179
0875557
0875013
0874477
0874209
0873997
0873691
0873569
0873352
0873212
0873035
0872942
087283
0872748
0872654
0872563
0872468
0872382
0872275
087222
0872201
0872135
0872086
0872021
0871945
0871898
087183
0871756
0871722
0871665
0871632
0871585
0871558
0871524
0871482
0871452
0871428
0871398
0871346
0871312
0871287
0871262
0871242
0871236
0871197
Page 27: xuhappy.github.io · 2 days ago · RECSYS 2 Acknowledgements The slides used in this chapter are adapted from: CS246 Mining Massive Datasets, - by Jure Leskovec, Stanford University

RECSYS 29

Similarity Metric

Intuitively we want sim(A B) gt sim(A C)

Jaccard similarity 15 lt 24

Cosine similarity 0386 gt 0322 Considers missing ratings as ldquonegativerdquo Solution subtract the (row) mean

1162019

sim AB vs AC0092 gt -0559Notice cosine sim is correlation when data is centered at 0

Cosine sim

RECSYS 30

Rating Predictions

Let rx be the vector of user xrsquos ratings

Let N be the set of k users most similar to xwho have rated item i

Prediction for item i of user x 119903119903119909119909119909119909 = 1

119896119896sum119910119910isin119873119873 119903119903119910119910119909119909

119903119903119909119909119909119909 =sum119910119910isin119873119873 119904119904119909119909119910119910sdot119903119903119910119910119910119910sum119910119910isin119873119873 119904119904119909119909119910119910

Other options

Many other tricks possiblehellip1162019

Shorthand119956119956119961119961119961119961 = 119956119956119946119946119956119956 119961119961119961119961

RECSYS 31

Another type of Memory-based Collaborative Filtering Item-Item based Collaborative Filtering

So far User-user collaborative filtering

Another view Item-item For item i find other similar items Estimate rating for item i based

on the target userrsquos ratings on items similar to item i Can use same similarity metrics and

prediction functions as in user-user model

1162019

sumsum

isin

isinsdot

=)(

)(

xiNj ij

xiNj xjijxi s

rsr

sijhellip similarity of items i and jrxjhelliprating of user x on item jN(ix)hellip set items rated by x similar to i

RECSYS 32

Item-Item CF (|N|=2)

121110987654321

455311

3124452

534321423

245424

5224345

423316

users

mov

ies

- unknown rating - rating between 1 to 5

RECSYS 33

Item-Item CF (|N|=2)

121110987654321

455 311

3124452

534321423

245424

5224345

423316

users

- estimate rating of movie 1 by user 5

1162019

mov

ies

RECSYS 34

Item-Item CF (|N|=2)

121110987654321

455 311

3124452

534321423

245424

5224345

423316

users

Neighbor selectionIdentify movies similar to movie 1 rated by user 5

1162019

mov

ies

100

-018

041

-010

-031

059

sim(1m)

Here we use Pearson correlation as similarity1) Subtract mean rating mi from each movie i

m1 = (1+3+5+5+4)5 = 36row 1 [-26 0 -06 0 0 14 0 0 14 0 04 0]

2) Compute cosine similarities between rows

RECSYS 35

Item-Item CF (|N|=2)

121110987654321

455 311

3124452

534321423

245424

5224345

423316

users

Compute similarity weightss13=041 s16=059

1162019

mov

ies

100

-018

041

-010

-031

059

sim(1m)

RECSYS 36

Item-Item CF (|N|=2)

121110987654321

45526311

3124452

534321423

245424

5224345

423316

users

Predict by taking weighted average

r15 = (0412 + 0593) (041+059) = 26

mov

ies

RECSYS 37

Common Practice for Item-Item Collaborative Filtering

Define similarity sij of items i and j

Select K nearest neighbors (KNN) N(i x) Set of Items most similar to item i that were rated by x

Estimate rating rxi as the weighted average

1162019

sumsum

isin

isinsdot

=)(

)(ˆxiNj ij

xiNj xjijxi s

rsr

N(ix) = set of items similar to item i that were rated by xsij = similarity of items i and jrxj= rating of user x on item j

RECSYS 38

Item-Item vs User-User

04180109

0305081

Avatar LOTR Matrix Pirates

Alice

Bob

Carol

David

1162019

In practice it has been observed that item-itemoften works better than user-user

Why Items are simpler users have multiple tastes

RECSYS 39

ProsCons of Collaborative Filtering

+ Works for any kind of item No feature selection needed

- Cold Start Need enough users in the system to find a match

- Sparsity The userratings matrix is sparse Hard to find users that have rated the same items

- First rater Cannot recommend an item that has not been

previously rated New items Esoteric items

- Popularity bias Cannot recommend items to someone with

unique taste Tends to recommend popular items

RECSYS 40

Hybrid Methods Implement two or more different recommenders and

combine predictions Perhaps using a linear model

Add content-based methods to collaborative filtering Item profiles for new item problem Demographics to deal with new user problem

1162019

RECSYS 41

- Evaluation

- Error metrics

- Complexity Speed

RECSYS 42

Evaluation

1 3 4

3 5 5

4 5 5

3

3

2 2 2

5

2 1 1

3 3

1

movies

users

1162019

RECSYS 43

Evaluation

1 3 4

3 5 5

4 5 5

3

3

2

2 1

3

1

Test Data Set

users

movies

1162019

RECSYS 44

Evaluating Predictions Compare predictions with known ratings

Root-mean-square error (RMSE)bull where is predicted is the true rating of x on i

Precision at top 10 bull of those in top 10

Rank Correlation bull Spearmanrsquos correlation between systemrsquos and userrsquos complete

rankings Another approach 01 model

Coveragebull Number of itemsusers for which system can make predictions

Precision = TP (TP + FP) Accuracy = (TP+TN) (TP + FP + TN + FN) Receiver Operating characteristic (ROC) Curve

Y-axis True Positive Rates (TPR) X-axis False Positive Rates (FPR)bull TPR (aka Recall) = TP P = TP (TP+FN) bull FPR = FP N = FP (FP + TN)bull See httpsenwikipediaorgwikiPrecision and recall

RECSYS 45

Problems with Error Measures Narrow focus on accuracy sometimes

misses the point Prediction Diversity Prediction Context Order of predictions

In practice we care only to predict high ratings RMSE might penalize a method that does well

for high ratings and badly for others

1162019

RECSYS 46

Collaborative Filtering Complexity

Expensive step is finding k most similar customers O(|X|) Recall that X = set of customers in the system

Too expensive to do at runtime Could pre-compute using clustering as approx

Naiumlve pre-computation takes time O(N middot|C|) |C| = of clusters = k in the k-means N = of data points

We already know how to do this Near-neighbor search in high dimensions (LSH) Clustering Dimensionality reduction

1162019

RECSYS 47

Tip Add Data Leverage all the data

Donrsquot try to reduce data size in an effort to make fancy algorithms work

Simple methods on large data do best

Add more data eg add IMDB data on genres

More data beats better algorithmshttpanandtypepadcomdatawocky200803more-data-usualhtml

RECSYS 48

Recommender SystemsLatent Factor Models

CS246 Mining Massive DatasetsJure Leskovec Stanford University

httpcs246stanfordedu

RECSYS 49

Geared towards females

Geared towards males

Serious

Funny

Collaborative Filtering viaLatent Factor Models (eg SVD)

The PrincessDiaries

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

RECSYS 50Koren Bell Volinksy IEEE Computer 2009

RECSYS 51

The Netflix Utility Matrix R

1 3 4

3 5 5

4 5 5

3

3

2 2 2

5

2 1 1

3 3

1

480000 users

17700 movies

1162019

Matrix R

RECSYS 52

Utility Matrix R Evaluation

1 3 4

3 5 5

4 5 5

3

3

2

2 1

3

1

Test Data Set

SSE = sum(119909119909119909119909)isin119877119877 119903119909119909119909119909 minus 119903119903119909119909119909119909 2

1162019

480000 users

17700 movies

Predicted rating

True rating of user x on item i

119955119955120785120785120788120788

Matrix R

Training Data Set

RECSYS 53

Latent Factor Models

ldquoSVDrdquo on Netflix data R asymp Q middot PT

For now letrsquos assume we can approximate the rating matrix R as a product of ldquothinrdquo Q middot PT

R has missing entries but letrsquos ignore that for nowbull Basically we will want the reconstruction error to be small on

known ratings and we donrsquot care about the values on the missing ones

45531

312445

53432142

24542

522434

42331

2-41

56-5

53-2

32111

-221-7

37-1

-924143-48-5-253-211

13-112-72914-131457-8

1-6784-3924176-421asymp

users

item

s

PT

Q

item

s

users

R

SVD A = U Σ VT

f factors

ffactors

RECSYS 54

Ratings as Products of Factors How to estimate the missing rating of

user x for item i

1162019

45531

312445

53432142

24542

522434

42331

item

s

2-41

56-5

53-2

32111

-221-7

37-1

-924143-48-5-253-211

13-112-72914-131457-8

1-6784-3924176-421

asymp

item

s

users

users

PT

119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931

= 119943119943

119954119954119946119946119943119943 sdot 119953119953119961119961119943119943

qi = row i of Qpx = column x of PT

ffac

tors

Qf factors

RECSYS 55

Ratings as Products of Factors How to estimate the missing rating of

user x for item i

1162019

45531

312445

53432142

24542

522434

42331

item

s

2-41

56-5

53-2

32111

-221-7

37-1

-924143-48-5-253-211

13-112-72914-131457-8

1-6784-3924176-421

asymp

item

s

users

users

PT

ffac

tors

Qf factors

119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931

= 119943119943

119954119954119946119946119943119943 sdot 119953119953119961119961119943119943

qi = row i of Qpx = column x of PT

RECSYS 56

Ratings as Products of Factors How to estimate the missing rating of

user x for item i

1162019

45531

312445

53432142

24542

522434

42331

item

s

2-41

56-5

53-2

32111

-221-7

37-1

-924143-48-5-253-211

13-112-72914-131457-8

1-6784-3924176-421

asymp

item

s

users

users

QPT

24

ffac

tors

f factors

119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931

= 119943119943

119954119954119946119946119943119943 sdot 119953119953119961119961119943119943

qi = row i of Qpx = column x of PT

RECSYS 57

Geared towards females

Geared towards males

Serious

Funny

Latent Factor Models

1162019

The PrincessDiaries

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

RECSYS 58

Geared towards females

Geared towards males

Serious

Funny

Latent Factor Models

1162019

The PrincessDiaries

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

RECSYS 59

Recap SVD

Remember SVD A Input data matrix U Left singular vecs V Right singular vecs Σ Singular values

SVD gives minimum reconstruction error (SSE) min119880119880VΣ

sum119909119909119894119894 119860119860119909119909119894119894 minus 119880119880Σ119881119881T 1199091199091198941198942

So in our case ldquoSVDrdquo on Netflix data R asymp Q middot PT

A = R Q = U PT = Σ VT

But we are not done yet R has missing entries1162019

U

The sum goes over all entriesBut our R has missing entries

119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931

Am

n

Σm

n

VT

asymp

U

RECSYS 60

Latent Factor Models

SVD isnrsquot defined when entries are missing

Use specialized methods to find P Q min

119875119875119876119876sum 119909119909119909119909 isinR 119903119903119909119909119909119909 minus 119902119902119909119909 sdot 119901119901119909119909119879119879 2

Notebull We donrsquot require cols of P Q to be orthogonalunit lengthbull P Q map usersmovies to a latent spacebull The most popular model among Netflix contestants

1162019

45531

312445

53432142

24542

522434

42331

2-41

56-5

53-2

32111

-221-7

37-1

-924143-48-5-253-211

13-112-72914-131457-8

1-6784-3924176-421asymp

PTQ

users

item

s

119903119909119909119909119909 = 119902119902119909119909 sdot 119901119901119909119909119879119879

f factors

ffactorsitem

s

users

RECSYS 61

Dealing with Missing Entries Want to minimize SSE for unseen test data

Idea Minimize SSE on training data Want large f ( of factors) to capture all the signals But SSE on test data begins to rise for f gt 2

Regularization is needed to avoid Overfitting Allow rich model where there are sufficient data Shrink aggressively where data are scarce

1162019

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

1 3 43 5 5

4 5 533

2

2 1 3

1

λhellip regularization parameterldquoerrorrdquo ldquolengthrdquo

RECSYS 62

Geared towards females

Geared towards males

serious

escapist

The PrincessDiaries

The Lion King

Braveheart

Independence Day

AmadeusThe Color Purple

Oceanrsquos 11

Sense and Sensibility

Gus

Dave

Recommendations via Latent Factor Models (eg SVD++ by the [Bellkor Team])

1162019

Lethal Weapon

Dumb and Dumber

RECSYS 63

Dealing with Missing Entries Want to minimize SSE for unseen test data

Idea Minimize SSE on training data Want large f ( of factors) to capture all the signals But SSE on test data begins to rise for f gt 2

Regularization is needed Allow rich model where there are sufficient data Shrink aggressively where data are scarce

1162019

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

1 3 43 5 5

4 5 533

2

2 1 3

1

λhellip regularization parameterldquoerrorrdquo ldquolengthrdquo

RECSYS 64

Geared towards females

Geared towards males

serious

funny

The Effect of Regularization

1162019

The PrincessDiaries

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

minfactors ldquoerrorrdquo + λ ldquolengthrdquo

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

RECSYS 65

Geared towards females

Geared towards males

serious

funny

The Effect of Regularization

1162019

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

The PrincessDiaries

minfactors ldquoerrorrdquo + λ ldquolengthrdquo

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

RECSYS 66

Geared towards females

Geared towards males

serious

funny

The Effect of Regularization

1162019

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

minfactors ldquoerrorrdquo + λ ldquolengthrdquo

The PrincessDiaries

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

RECSYS 67

Geared towards females

Geared towards males

serious

funny

The Effect of Regularization

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

The PrincessDiaries

minfactors ldquoerrorrdquo + λ ldquolengthrdquo

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

RECSYS 68

Use Gradient Descent to search forthe optimal settings

Want to find matrices P and Q

Gradient descent Initialize P and Q (using SVD pretend missing ratings are

0) Do gradient descent

bull P larr P - η middotnablaPbull Q larr Q - η middotnablaQbull Where nablaQ is gradientderivative of matrix Q120571120571119876119876 = [120571120571119902119902119909119909119894119894] and 120571120571119902119902119909119909119894119894 = sum119909119909119909119909 minus2 119903119903119909119909119909119909 minus 119902119902119909119909119901119901119909119909119879119879 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894

bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q Observation Computing gradients is slow

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

How to compute gradient of a matrixCompute gradient of every element independently

RECSYS 69

Degression to the lecture notes ofRegression and Gradient Descent

by Andrew Ngrsquos Machine Learning course from Coursera

RECSYS 70

(Batch) Gradient Descent

Want to find matrices P and Q

Gradient descent Initialize P and Q (using SVD pretend missing ratings are

0) Do gradient descent

bull P larr P - η middotnablaPbull Q larr Q - η middotnablaQbull Where nablaQ is gradientderivative of matrix Q120571120571119876119876 = [120571120571119902119902119909119909119894119894] and 120571120571119902119902119909119909119894119894 = sum119909119909119909119909 minus2 119903119903119909119909119909119909 minus 119902119902119909119909119901119901119909119909119879119879 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894

bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q Observation Computing gradients is slow

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

How to compute gradient of a matrixCompute gradient of every element independently

RECSYS 72

Stochastic Gradient Descent

Gradient Descent (GD) vs Stochastic GD Observation 120571120571119876119876 = [120571120571119902119902119909119909119894119894] where

120571120571119902119902119909119909119894119894 = 119909119909119909119909

minus2 119903119903119909119909119909119909 minus 119902119902119909119909119894119894119901119901119909119909119894119894 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894 = 119961119961119946119946

nabla119928119928 119955119955119961119961119946119946

bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q 119928119928 = 119928119928 minus 119928119928 = 119928119928 minus η sum119961119961119946119946nabla119928119928 (119955119955119961119961119946119946) Idea Instead of evaluating gradient over all ratings evaluate it for

each individual rating and make a step

GD 119928119928larr119928119928 minus η sum119955119955119961119961119946119946 nabla119928119928(119955119955119961119961119946119946)

SGD 119928119928larr119928119928 minus 119928119928(119955119955119961119961119946119946) Faster convergence

bull Need more steps but each step is computed much faster

1162019

RECSYS 73

SGD vs GD Convergence of GD vs SGD

1162019

Iterationstep

Valu

e of

the

obje

ctiv

e fu

nctio

n

GD improves the value of the objective function at every step SGD improves the value but in a ldquonoisyrdquo wayGD takes fewer steps to converge but each steptakes much longer to compute In practice SGD is much faster

RECSYS 74

Stochastic Gradient Descent

Stochastic gradient decent Initialize P and Q (using SVD pretend missing ratings are

0) Then iterate over the ratings (multiple times if necessary) and update

factorsFor each rxi

bull 120576120576119909119909119909119909 = 119903119903119909119909119909119909 minus 119902119902119909119909 sdot 119901119901119909119909119879119879 (derivative of the ldquoerrorrdquo)bull 119902119902119909119909 larr 119902119902119909119909 + 120578120578 120576120576119909119909119909119909 119901119901119909119909 minus 120582120582 119902119902119909119909 (update equation)bull 119901119901119909119909 larr 119901119901119909119909 + 120578120578 120576120576119909119909119909119909 119902119902119909119909 minus 120582120582 119901119901119909119909 (update equation)

2 for loops For until convergence

bull For each rxibull Compute gradient do a ldquosteprdquo

1162019

η hellip learning rate

RECSYS 75Koren Bell Volinksy IEEE Computer 2009

1162019

RECSYS 76

Summary Recommendations via Optimization

Goal Make good recommendations Quantify goodness using SSE

So Lower SSE means better recommendations We want to make good recommendations on items

that some user has not yet seen Letrsquos set values for P and Q such that they work

well on known (user item) ratings And hope these values for P and Q will predict well

the unknown ratings

This is the a case where we apply Optimization methods

1162019

1 3 43 5 5

4 5 533

2

2 1 3

1

RECSYS 77

Backup Slides

RECSYS 78

The Netflix Challenge 2006-09

We Know What You OughtTo Be Watching This

Summer

Netflix Prizebull Training data

ndash 100 million ratingsndash 480000 usersndash 17770 moviesndash 6 years of data 2000-2005

bull Test datandash Last few ratings of each user (28 million)ndash Evaluation criterion root mean squared error (RMSE) ndash Netflix Cinematch RMSE 09514

bull Competitionndash 2700+ teamsndash $1 million grand prize for 10 improvement on Cinematch resultndash $50000 2007 progress prize for 843 improvement

scoremovieuser121152131434524123237682576344541568523425

22345

57664566

scoremovieuser6219617232473153414284935

745

696836

Training data Test data

Movie rating data

Overall rating distribution

bull Third of ratings are 4sbull Average rating is 368

From TimelyDevelopmentcom

ratings per movie

bull Avg ratingsmovie 5627

ratings per user

bull Avg ratingsuser 208

Average movie rating by movie count

bull More ratings to better movies

From TimelyDevelopmentcom

Most loved movies

CountAvg ratingTitle137812 4593 The Shawshank Redemption 133597 4545 Lord of the Rings The Return of the King1808834306 The Green Mile 150676 4460Lord of the Rings The Two Towers 139050 4415 Finding Nemo 117456 4504 Raiders of the Lost Ark 180736 4299 Forrest Gump 147932 4433 Lord of the Rings The Fellowship of the ring149199 4325 The Sixth Sense 144027 4333 Indiana Jones and the Last Crusade

Challenges

bull Size of datandash Scalabilityndash Keeping data in memory

bull Missing datandash 99 percent missingndash Very imbalanced

bull Avoiding overfittingbull Test and training data differ significantly movie 16322

bull Use an ensemble of complementing predictorsbull Two half tuned models worth more than a single fully

tuned model

The BellKor recommender system

RECSYS 101

Extending Latent Factor Model to Include Biases

1162019

RECSYS 102

Modeling Biases and Interactions

μ = overall mean rating bx = bias of user x bi = bias of movie i

user-movie interactionmovie biasuser bias

User-Movie interaction Characterizes the matching between

users and movies Attracts most research in the field Benefits from algorithmic and

mathematical innovations

Baseline predictor Separates users and movies Benefits from insights into userrsquos

behavior Among the main practical

contributions of the competition

RECSYS 103

Baseline Predictor We have expectations on the rating by

user x of movie i even without estimating xrsquos attitude towards movies like i

ndash Rating scale of user xndash Values of other ratings user

gave recently (day-specific mood anchoring multi-user accounts)

ndash (Recent) popularity of movie indash Selection bias related to

number of ratings user gave on the same day (ldquofrequencyrdquo)

1162019

RECSYS 104

Putting It All Together

Example Mean rating micro = 37 You are a critical reviewer your ratings are 1 star

lower than the mean bx = -1 Star Wars gets a mean rating of 05 higher than

average movie bi = + 05 Predicted rating for you on Star Wars

= 37 - 1 + 05 = 32

1162019

RECSYS 105

Fitting the New Model

Solve

Stochastic gradient decent to find parameters Note Both biases bu bi as well as interactions qi pu are treated

as parameters (we estimate them)

1162019

regularization

goodness of fit

λ is selected via grid-search on a validation set

( )

++++

+++minus

sumsumsumsum

sumisin

ii

xx

xx

ii

Rix

Txiixxi

PQ

bbpq

pqbbr

2222

2

)(

)(min

λ

micro

RECSYS 106

BellKor Recommender System The winner of the Netflix Challenge

Multi-scale modeling of the dataCombine top level ldquoregionalrdquomodeling of the data with a refined local view Global

bull Overall deviations of usersmovies Factorization

bull Addressing ldquoregionalrdquo effects Collaborative filtering

bull Extract local patterns

Global effects

Factorization

Collaborative filtering

RECSYS 107

Performance of Various Methods

0885

089

0895

09

0905

091

0915

092

1 10 100 1000

RMSE

Millions of parameters

CF (no time bias)

Basic Latent Factors

Latent Factors w Biases

1162019

RECSYS 108

Grand Prize 08563

Netflix 09514

Movie average 10533User average 10651

Global average 11296

Performance of Various Methods

Basic Collaborative filtering 094

Latent factors 090

Latent factors+Biases 089

Collaborative filtering++ 091

1162019

RECSYS 109

Temporal Biases Of Users

Sudden rise in the average movie rating(early 2004) Improvements in Netflix GUI improvements Meaning of rating changed

Movie age Users prefer new movies

without any reasons Older movies are just

inherently better than newer ones

1162019

Y Koren Collaborative filtering with temporal dynamics KDD rsquo09

RECSYS 110

Temporal Biases amp Factors

Original modelrxi = micro +bx + bi + qi middotpx

T

Add time dependence to biasesrxi = micro +bx(t)+ bi(t) +qi middot px

T

Make parameters bu and bi to depend on time (1) Parameterize time-dependence by linear trends

(2) Each bin corresponds to 10 consecutive weeks

Add temporal dependence to factors px(t)hellip user preference vector on day t

Y Koren Collaborative filtering with temporal dynamics KDD rsquo09

RECSYS 111

Adding Temporal Effects

1162019

0875

088

0885

089

0895

09

0905

091

0915

092

1 10 100 1000 10000

RMSE

Millions of parameters

CF (no time bias)Basic Latent FactorsCF (time bias)Latent Factors w Biases+ Linear time factors+ Per-day user biases+ CF

RECSYS 112

Grand Prize 08563

Netflix 09514

Movie average 10533User average 10651

Global average 11296

Performance of Various Methods

Basic Collaborative filtering 094

Latent factors 090

Latent factors+Biases 089

Collaborative filtering++ 091

1162019 112

Latent factors+Biases+Time 0876

Still no prize Getting desperateTry a ldquokitchen sinkrdquo approach

RECSYS 113

Effect of ensemble size

087

08725

0875

08775

088

08825

0885

08875

089

08925

1 8 15 22 29 36 43 50

Predictors

Erro

r - R

MSE

Chart1

RMSE
Predictors
Error - RMSE
Effect of ensemble size

Sheet1

Sheet1

RMSE
Predictors
Error - RMSE
Effect of ensemble size

Sheet2

Sheet3

RECSYS 115

Standing on June 26th 2009

1162019June 26th submission triggers 30-day ldquolast callrdquo

RECSYS 116

The Last 30 Days

Ensemble team formed Group of other teams on leaderboard forms a new team Relies on combining their models Quickly also get a qualifying score over 10

BellKor Continue to get small improvements in their scores Realize that they are in direct competition with Ensemble

Strategy Both teams carefully monitoring the leaderboard Only sure way to check for improvement is to submit a set of

predictionsbull This alerts the other team of your latest score

1162019

RECSYS 117

24 Hours from the Deadline

Submissions limited to 1 a day Only 1 final submission could be made in the last 24h

24 hours before deadlinehellip BellKor team member in Austria notices (by chance) that

Ensemble posts a score that is slightly better than BellKorrsquos

Frantic last 24 hours for both teams Much computer time on final optimization Carefully calibrated to end about an hour before deadline

Final submissions BellKor submits a little early (on purpose) 40 mins before deadline Ensemble submits their final entry 20 mins later hellipand everyone waitshellip

1162019

RECSYS 118118

RECSYS 119

Million $ Awarded Sept 21st 2009

RECSYS 120

Further reading Y Koren Collaborative filtering with temporal

dynamics KDD rsquo09 httpwww2researchattcom~volinskynetflixbpchtml httpwwwthe-ensemblecom

1162019

  • Slide Number 1
  • Acknowledgements
  • Content-based Systems amp Collaborative Filtering
  • High Dimensional Data
  • Example Recommender Systems
  • Recommendations
  • From Scarcity to Abundance
  • The Long Tail
  • Physical vs Online
  • Types of Recommendations
  • Formal Model
  • Utility Matrix
  • Key Problems
  • (1) Gathering Ratings
  • (2) Extrapolating Utilities
  • Content-based Recommender Systems
  • Content-based Recommendations
  • Plan of Action
  • Item Profiles
  • Sidenote TF-IDF
  • User Profiles and Prediction
  • Pros Content-based Approach
  • Cons Content-based Approach
  • Collaborative Filtering
  • Collaborative filtering
  • Collaborative Filtering (CF)
  • Example of Memory-based Collaborative Filtering User-User Collaborative Filtering
  • Similar Users
  • Similarity Metric
  • Rating Predictions
  • Another type of Memory-based Collaborative Filtering Item-Item based Collaborative Filtering
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Common Practice for Item-Item Collaborative Filtering
  • Item-Item vs User-User
  • ProsCons of Collaborative Filtering
  • Hybrid Methods
  • Remarks amp Practical Tips
  • Evaluation
  • Evaluation
  • Evaluating Predictions
  • Problems with Error Measures
  • Collaborative Filtering Complexity
  • Tip Add Data
  • Recommender SystemsLatent Factor Models
  • Collaborative Filtering viaLatent Factor Models (eg SVD)
  • Slide Number 50
  • The Netflix Utility Matrix R
  • Utility Matrix R Evaluation
  • Latent Factor Models
  • Ratings as Products of Factors
  • Ratings as Products of Factors
  • Ratings as Products of Factors
  • Latent Factor Models
  • Latent Factor Models
  • Recap SVD
  • Latent Factor Models
  • Dealing with Missing Entries
  • Recommendations via Latent Factor Models (eg SVD++ by the [Bellkor Team])
  • Dealing with Missing Entries
  • The Effect of Regularization
  • The Effect of Regularization
  • The Effect of Regularization
  • The Effect of Regularization
  • Use Gradient Descent to search forthe optimal settings
  • Degression to the lecture notes ofRegression and Gradient Descentby Andrew Ngrsquos Machine Learning course from Coursera
  • (Batch) Gradient Descent
  • Stochastic Gradient Descent
  • SGD vs GD
  • Stochastic Gradient Descent
  • Slide Number 75
  • Summary Recommendations via Optimization
  • Backup Slides
  • The Netflix Challenge 2006-09
  • Slide Number 79
  • Slide Number 80
  • Netflix Prize
  • Movie rating data
  • Overall rating distribution
  • ratings per movie
  • ratings per user
  • Average movie rating by movie count
  • Most loved movies
  • Challenges
  • The BellKor recommender system
  • Extending Latent Factor Model to Include Biases
  • Modeling Biases and Interactions
  • Baseline Predictor
  • Putting It All Together
  • Fitting the New Model
  • BellKor Recommender System
  • Performance of Various Methods
  • Performance of Various Methods
  • Temporal Biases Of Users
  • Temporal Biases amp Factors
  • Adding Temporal Effects
  • Performance of Various Methods
  • Slide Number 113
  • Slide Number 114
  • Standing on June 26th 2009
  • The Last 30 Days
  • 24 Hours from the Deadline
  • Slide Number 118
  • Million $ Awarded Sept 21st 2009
  • Further reading
predictors RMSE
1 0889067
2 0880586
3 0878345
4 0877715
5 0876668
6 0876179
7 0875557
8 0875013
9 0874477
10 0874209
11 0873997
12 0873691
13 0873569
14 0873352
15 0873212
16 0873035
17 0872942
18 087283
19 0872748
20 0872654
21 0872563
22 0872468
23 0872382
24 0872275
25 087222
26 0872201
27 0872135
28 0872086
29 0872021
30 0871945
31 0871898
32 087183
33 0871756
34 0871722
35 0871665
36 0871632
37 0871585
38 0871558
39 0871524
40 0871482
41 0871452
42 0871428
43 0871398
44 0871346
45 0871312
46 0871287
47 0871262
48 0871242
49 0871236
50 0871197
0889067
0880586
0878345
0877715
0876668
0876179
0875557
0875013
0874477
0874209
0873997
0873691
0873569
0873352
0873212
0873035
0872942
087283
0872748
0872654
0872563
0872468
0872382
0872275
087222
0872201
0872135
0872086
0872021
0871945
0871898
087183
0871756
0871722
0871665
0871632
0871585
0871558
0871524
0871482
0871452
0871428
0871398
0871346
0871312
0871287
0871262
0871242
0871236
0871197
Page 28: xuhappy.github.io · 2 days ago · RECSYS 2 Acknowledgements The slides used in this chapter are adapted from: CS246 Mining Massive Datasets, - by Jure Leskovec, Stanford University

RECSYS 30

Rating Predictions

Let rx be the vector of user xrsquos ratings

Let N be the set of k users most similar to xwho have rated item i

Prediction for item i of user x 119903119903119909119909119909119909 = 1

119896119896sum119910119910isin119873119873 119903119903119910119910119909119909

119903119903119909119909119909119909 =sum119910119910isin119873119873 119904119904119909119909119910119910sdot119903119903119910119910119910119910sum119910119910isin119873119873 119904119904119909119909119910119910

Other options

Many other tricks possiblehellip1162019

Shorthand119956119956119961119961119961119961 = 119956119956119946119946119956119956 119961119961119961119961

RECSYS 31

Another type of Memory-based Collaborative Filtering Item-Item based Collaborative Filtering

So far User-user collaborative filtering

Another view Item-item For item i find other similar items Estimate rating for item i based

on the target userrsquos ratings on items similar to item i Can use same similarity metrics and

prediction functions as in user-user model

1162019

sumsum

isin

isinsdot

=)(

)(

xiNj ij

xiNj xjijxi s

rsr

sijhellip similarity of items i and jrxjhelliprating of user x on item jN(ix)hellip set items rated by x similar to i

RECSYS 32

Item-Item CF (|N|=2)

121110987654321

455311

3124452

534321423

245424

5224345

423316

users

mov

ies

- unknown rating - rating between 1 to 5

RECSYS 33

Item-Item CF (|N|=2)

121110987654321

455 311

3124452

534321423

245424

5224345

423316

users

- estimate rating of movie 1 by user 5

1162019

mov

ies

RECSYS 34

Item-Item CF (|N|=2)

121110987654321

455 311

3124452

534321423

245424

5224345

423316

users

Neighbor selectionIdentify movies similar to movie 1 rated by user 5

1162019

mov

ies

100

-018

041

-010

-031

059

sim(1m)

Here we use Pearson correlation as similarity1) Subtract mean rating mi from each movie i

m1 = (1+3+5+5+4)5 = 36row 1 [-26 0 -06 0 0 14 0 0 14 0 04 0]

2) Compute cosine similarities between rows

RECSYS 35

Item-Item CF (|N|=2)

121110987654321

455 311

3124452

534321423

245424

5224345

423316

users

Compute similarity weightss13=041 s16=059

1162019

mov

ies

100

-018

041

-010

-031

059

sim(1m)

RECSYS 36

Item-Item CF (|N|=2)

121110987654321

45526311

3124452

534321423

245424

5224345

423316

users

Predict by taking weighted average

r15 = (0412 + 0593) (041+059) = 26

mov

ies

RECSYS 37

Common Practice for Item-Item Collaborative Filtering

Define similarity sij of items i and j

Select K nearest neighbors (KNN) N(i x) Set of Items most similar to item i that were rated by x

Estimate rating rxi as the weighted average

1162019

sumsum

isin

isinsdot

=)(

)(ˆxiNj ij

xiNj xjijxi s

rsr

N(ix) = set of items similar to item i that were rated by xsij = similarity of items i and jrxj= rating of user x on item j

RECSYS 38

Item-Item vs User-User

04180109

0305081

Avatar LOTR Matrix Pirates

Alice

Bob

Carol

David

1162019

In practice it has been observed that item-itemoften works better than user-user

Why Items are simpler users have multiple tastes

RECSYS 39

ProsCons of Collaborative Filtering

+ Works for any kind of item No feature selection needed

- Cold Start Need enough users in the system to find a match

- Sparsity The userratings matrix is sparse Hard to find users that have rated the same items

- First rater Cannot recommend an item that has not been

previously rated New items Esoteric items

- Popularity bias Cannot recommend items to someone with

unique taste Tends to recommend popular items

RECSYS 40

Hybrid Methods Implement two or more different recommenders and

combine predictions Perhaps using a linear model

Add content-based methods to collaborative filtering Item profiles for new item problem Demographics to deal with new user problem

1162019

RECSYS 41

- Evaluation

- Error metrics

- Complexity Speed

RECSYS 42

Evaluation

1 3 4

3 5 5

4 5 5

3

3

2 2 2

5

2 1 1

3 3

1

movies

users

1162019

RECSYS 43

Evaluation

1 3 4

3 5 5

4 5 5

3

3

2

2 1

3

1

Test Data Set

users

movies

1162019

RECSYS 44

Evaluating Predictions Compare predictions with known ratings

Root-mean-square error (RMSE)bull where is predicted is the true rating of x on i

Precision at top 10 bull of those in top 10

Rank Correlation bull Spearmanrsquos correlation between systemrsquos and userrsquos complete

rankings Another approach 01 model

Coveragebull Number of itemsusers for which system can make predictions

Precision = TP (TP + FP) Accuracy = (TP+TN) (TP + FP + TN + FN) Receiver Operating characteristic (ROC) Curve

Y-axis True Positive Rates (TPR) X-axis False Positive Rates (FPR)bull TPR (aka Recall) = TP P = TP (TP+FN) bull FPR = FP N = FP (FP + TN)bull See httpsenwikipediaorgwikiPrecision and recall

RECSYS 45

Problems with Error Measures Narrow focus on accuracy sometimes

misses the point Prediction Diversity Prediction Context Order of predictions

In practice we care only to predict high ratings RMSE might penalize a method that does well

for high ratings and badly for others

1162019

RECSYS 46

Collaborative Filtering Complexity

Expensive step is finding k most similar customers O(|X|) Recall that X = set of customers in the system

Too expensive to do at runtime Could pre-compute using clustering as approx

Naiumlve pre-computation takes time O(N middot|C|) |C| = of clusters = k in the k-means N = of data points

We already know how to do this Near-neighbor search in high dimensions (LSH) Clustering Dimensionality reduction

1162019

RECSYS 47

Tip Add Data Leverage all the data

Donrsquot try to reduce data size in an effort to make fancy algorithms work

Simple methods on large data do best

Add more data eg add IMDB data on genres

More data beats better algorithmshttpanandtypepadcomdatawocky200803more-data-usualhtml

RECSYS 48

Recommender SystemsLatent Factor Models

CS246 Mining Massive DatasetsJure Leskovec Stanford University

httpcs246stanfordedu

RECSYS 49

Geared towards females

Geared towards males

Serious

Funny

Collaborative Filtering viaLatent Factor Models (eg SVD)

The PrincessDiaries

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

RECSYS 50Koren Bell Volinksy IEEE Computer 2009

RECSYS 51

The Netflix Utility Matrix R

1 3 4

3 5 5

4 5 5

3

3

2 2 2

5

2 1 1

3 3

1

480000 users

17700 movies

1162019

Matrix R

RECSYS 52

Utility Matrix R Evaluation

1 3 4

3 5 5

4 5 5

3

3

2

2 1

3

1

Test Data Set

SSE = sum(119909119909119909119909)isin119877119877 119903119909119909119909119909 minus 119903119903119909119909119909119909 2

1162019

480000 users

17700 movies

Predicted rating

True rating of user x on item i

119955119955120785120785120788120788

Matrix R

Training Data Set

RECSYS 53

Latent Factor Models

ldquoSVDrdquo on Netflix data R asymp Q middot PT

For now letrsquos assume we can approximate the rating matrix R as a product of ldquothinrdquo Q middot PT

R has missing entries but letrsquos ignore that for nowbull Basically we will want the reconstruction error to be small on

known ratings and we donrsquot care about the values on the missing ones

45531

312445

53432142

24542

522434

42331

2-41

56-5

53-2

32111

-221-7

37-1

-924143-48-5-253-211

13-112-72914-131457-8

1-6784-3924176-421asymp

users

item

s

PT

Q

item

s

users

R

SVD A = U Σ VT

f factors

ffactors

RECSYS 54

Ratings as Products of Factors How to estimate the missing rating of

user x for item i

1162019

45531

312445

53432142

24542

522434

42331

item

s

2-41

56-5

53-2

32111

-221-7

37-1

-924143-48-5-253-211

13-112-72914-131457-8

1-6784-3924176-421

asymp

item

s

users

users

PT

119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931

= 119943119943

119954119954119946119946119943119943 sdot 119953119953119961119961119943119943

qi = row i of Qpx = column x of PT

ffac

tors

Qf factors

RECSYS 55

Ratings as Products of Factors How to estimate the missing rating of

user x for item i

1162019

45531

312445

53432142

24542

522434

42331

item

s

2-41

56-5

53-2

32111

-221-7

37-1

-924143-48-5-253-211

13-112-72914-131457-8

1-6784-3924176-421

asymp

item

s

users

users

PT

ffac

tors

Qf factors

119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931

= 119943119943

119954119954119946119946119943119943 sdot 119953119953119961119961119943119943

qi = row i of Qpx = column x of PT

RECSYS 56

Ratings as Products of Factors How to estimate the missing rating of

user x for item i

1162019

45531

312445

53432142

24542

522434

42331

item

s

2-41

56-5

53-2

32111

-221-7

37-1

-924143-48-5-253-211

13-112-72914-131457-8

1-6784-3924176-421

asymp

item

s

users

users

QPT

24

ffac

tors

f factors

119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931

= 119943119943

119954119954119946119946119943119943 sdot 119953119953119961119961119943119943

qi = row i of Qpx = column x of PT

RECSYS 57

Geared towards females

Geared towards males

Serious

Funny

Latent Factor Models

1162019

The PrincessDiaries

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

RECSYS 58

Geared towards females

Geared towards males

Serious

Funny

Latent Factor Models

1162019

The PrincessDiaries

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

RECSYS 59

Recap SVD

Remember SVD A Input data matrix U Left singular vecs V Right singular vecs Σ Singular values

SVD gives minimum reconstruction error (SSE) min119880119880VΣ

sum119909119909119894119894 119860119860119909119909119894119894 minus 119880119880Σ119881119881T 1199091199091198941198942

So in our case ldquoSVDrdquo on Netflix data R asymp Q middot PT

A = R Q = U PT = Σ VT

But we are not done yet R has missing entries1162019

U

The sum goes over all entriesBut our R has missing entries

119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931

Am

n

Σm

n

VT

asymp

U

RECSYS 60

Latent Factor Models

SVD isnrsquot defined when entries are missing

Use specialized methods to find P Q min

119875119875119876119876sum 119909119909119909119909 isinR 119903119903119909119909119909119909 minus 119902119902119909119909 sdot 119901119901119909119909119879119879 2

Notebull We donrsquot require cols of P Q to be orthogonalunit lengthbull P Q map usersmovies to a latent spacebull The most popular model among Netflix contestants

1162019

45531

312445

53432142

24542

522434

42331

2-41

56-5

53-2

32111

-221-7

37-1

-924143-48-5-253-211

13-112-72914-131457-8

1-6784-3924176-421asymp

PTQ

users

item

s

119903119909119909119909119909 = 119902119902119909119909 sdot 119901119901119909119909119879119879

f factors

ffactorsitem

s

users

RECSYS 61

Dealing with Missing Entries Want to minimize SSE for unseen test data

Idea Minimize SSE on training data Want large f ( of factors) to capture all the signals But SSE on test data begins to rise for f gt 2

Regularization is needed to avoid Overfitting Allow rich model where there are sufficient data Shrink aggressively where data are scarce

1162019

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

1 3 43 5 5

4 5 533

2

2 1 3

1

λhellip regularization parameterldquoerrorrdquo ldquolengthrdquo

RECSYS 62

Geared towards females

Geared towards males

serious

escapist

The PrincessDiaries

The Lion King

Braveheart

Independence Day

AmadeusThe Color Purple

Oceanrsquos 11

Sense and Sensibility

Gus

Dave

Recommendations via Latent Factor Models (eg SVD++ by the [Bellkor Team])

1162019

Lethal Weapon

Dumb and Dumber

RECSYS 63

Dealing with Missing Entries Want to minimize SSE for unseen test data

Idea Minimize SSE on training data Want large f ( of factors) to capture all the signals But SSE on test data begins to rise for f gt 2

Regularization is needed Allow rich model where there are sufficient data Shrink aggressively where data are scarce

1162019

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

1 3 43 5 5

4 5 533

2

2 1 3

1

λhellip regularization parameterldquoerrorrdquo ldquolengthrdquo

RECSYS 64

Geared towards females

Geared towards males

serious

funny

The Effect of Regularization

1162019

The PrincessDiaries

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

minfactors ldquoerrorrdquo + λ ldquolengthrdquo

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

RECSYS 65

Geared towards females

Geared towards males

serious

funny

The Effect of Regularization

1162019

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

The PrincessDiaries

minfactors ldquoerrorrdquo + λ ldquolengthrdquo

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

RECSYS 66

Geared towards females

Geared towards males

serious

funny

The Effect of Regularization

1162019

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

minfactors ldquoerrorrdquo + λ ldquolengthrdquo

The PrincessDiaries

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

RECSYS 67

Geared towards females

Geared towards males

serious

funny

The Effect of Regularization

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

The PrincessDiaries

minfactors ldquoerrorrdquo + λ ldquolengthrdquo

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

RECSYS 68

Use Gradient Descent to search forthe optimal settings

Want to find matrices P and Q

Gradient descent Initialize P and Q (using SVD pretend missing ratings are

0) Do gradient descent

bull P larr P - η middotnablaPbull Q larr Q - η middotnablaQbull Where nablaQ is gradientderivative of matrix Q120571120571119876119876 = [120571120571119902119902119909119909119894119894] and 120571120571119902119902119909119909119894119894 = sum119909119909119909119909 minus2 119903119903119909119909119909119909 minus 119902119902119909119909119901119901119909119909119879119879 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894

bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q Observation Computing gradients is slow

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

How to compute gradient of a matrixCompute gradient of every element independently

RECSYS 69

Degression to the lecture notes ofRegression and Gradient Descent

by Andrew Ngrsquos Machine Learning course from Coursera

RECSYS 70

(Batch) Gradient Descent

Want to find matrices P and Q

Gradient descent Initialize P and Q (using SVD pretend missing ratings are

0) Do gradient descent

bull P larr P - η middotnablaPbull Q larr Q - η middotnablaQbull Where nablaQ is gradientderivative of matrix Q120571120571119876119876 = [120571120571119902119902119909119909119894119894] and 120571120571119902119902119909119909119894119894 = sum119909119909119909119909 minus2 119903119903119909119909119909119909 minus 119902119902119909119909119901119901119909119909119879119879 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894

bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q Observation Computing gradients is slow

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

How to compute gradient of a matrixCompute gradient of every element independently

RECSYS 72

Stochastic Gradient Descent

Gradient Descent (GD) vs Stochastic GD Observation 120571120571119876119876 = [120571120571119902119902119909119909119894119894] where

120571120571119902119902119909119909119894119894 = 119909119909119909119909

minus2 119903119903119909119909119909119909 minus 119902119902119909119909119894119894119901119901119909119909119894119894 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894 = 119961119961119946119946

nabla119928119928 119955119955119961119961119946119946

bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q 119928119928 = 119928119928 minus 119928119928 = 119928119928 minus η sum119961119961119946119946nabla119928119928 (119955119955119961119961119946119946) Idea Instead of evaluating gradient over all ratings evaluate it for

each individual rating and make a step

GD 119928119928larr119928119928 minus η sum119955119955119961119961119946119946 nabla119928119928(119955119955119961119961119946119946)

SGD 119928119928larr119928119928 minus 119928119928(119955119955119961119961119946119946) Faster convergence

bull Need more steps but each step is computed much faster

1162019

RECSYS 73

SGD vs GD Convergence of GD vs SGD

1162019

Iterationstep

Valu

e of

the

obje

ctiv

e fu

nctio

n

GD improves the value of the objective function at every step SGD improves the value but in a ldquonoisyrdquo wayGD takes fewer steps to converge but each steptakes much longer to compute In practice SGD is much faster

RECSYS 74

Stochastic Gradient Descent

Stochastic gradient decent Initialize P and Q (using SVD pretend missing ratings are

0) Then iterate over the ratings (multiple times if necessary) and update

factorsFor each rxi

bull 120576120576119909119909119909119909 = 119903119903119909119909119909119909 minus 119902119902119909119909 sdot 119901119901119909119909119879119879 (derivative of the ldquoerrorrdquo)bull 119902119902119909119909 larr 119902119902119909119909 + 120578120578 120576120576119909119909119909119909 119901119901119909119909 minus 120582120582 119902119902119909119909 (update equation)bull 119901119901119909119909 larr 119901119901119909119909 + 120578120578 120576120576119909119909119909119909 119902119902119909119909 minus 120582120582 119901119901119909119909 (update equation)

2 for loops For until convergence

bull For each rxibull Compute gradient do a ldquosteprdquo

1162019

η hellip learning rate

RECSYS 75Koren Bell Volinksy IEEE Computer 2009

1162019

RECSYS 76

Summary Recommendations via Optimization

Goal Make good recommendations Quantify goodness using SSE

So Lower SSE means better recommendations We want to make good recommendations on items

that some user has not yet seen Letrsquos set values for P and Q such that they work

well on known (user item) ratings And hope these values for P and Q will predict well

the unknown ratings

This is the a case where we apply Optimization methods

1162019

1 3 43 5 5

4 5 533

2

2 1 3

1

RECSYS 77

Backup Slides

RECSYS 78

The Netflix Challenge 2006-09

We Know What You OughtTo Be Watching This

Summer

Netflix Prizebull Training data

ndash 100 million ratingsndash 480000 usersndash 17770 moviesndash 6 years of data 2000-2005

bull Test datandash Last few ratings of each user (28 million)ndash Evaluation criterion root mean squared error (RMSE) ndash Netflix Cinematch RMSE 09514

bull Competitionndash 2700+ teamsndash $1 million grand prize for 10 improvement on Cinematch resultndash $50000 2007 progress prize for 843 improvement

scoremovieuser121152131434524123237682576344541568523425

22345

57664566

scoremovieuser6219617232473153414284935

745

696836

Training data Test data

Movie rating data

Overall rating distribution

bull Third of ratings are 4sbull Average rating is 368

From TimelyDevelopmentcom

ratings per movie

bull Avg ratingsmovie 5627

ratings per user

bull Avg ratingsuser 208

Average movie rating by movie count

bull More ratings to better movies

From TimelyDevelopmentcom

Most loved movies

CountAvg ratingTitle137812 4593 The Shawshank Redemption 133597 4545 Lord of the Rings The Return of the King1808834306 The Green Mile 150676 4460Lord of the Rings The Two Towers 139050 4415 Finding Nemo 117456 4504 Raiders of the Lost Ark 180736 4299 Forrest Gump 147932 4433 Lord of the Rings The Fellowship of the ring149199 4325 The Sixth Sense 144027 4333 Indiana Jones and the Last Crusade

Challenges

bull Size of datandash Scalabilityndash Keeping data in memory

bull Missing datandash 99 percent missingndash Very imbalanced

bull Avoiding overfittingbull Test and training data differ significantly movie 16322

bull Use an ensemble of complementing predictorsbull Two half tuned models worth more than a single fully

tuned model

The BellKor recommender system

RECSYS 101

Extending Latent Factor Model to Include Biases

1162019

RECSYS 102

Modeling Biases and Interactions

μ = overall mean rating bx = bias of user x bi = bias of movie i

user-movie interactionmovie biasuser bias

User-Movie interaction Characterizes the matching between

users and movies Attracts most research in the field Benefits from algorithmic and

mathematical innovations

Baseline predictor Separates users and movies Benefits from insights into userrsquos

behavior Among the main practical

contributions of the competition

RECSYS 103

Baseline Predictor We have expectations on the rating by

user x of movie i even without estimating xrsquos attitude towards movies like i

ndash Rating scale of user xndash Values of other ratings user

gave recently (day-specific mood anchoring multi-user accounts)

ndash (Recent) popularity of movie indash Selection bias related to

number of ratings user gave on the same day (ldquofrequencyrdquo)

1162019

RECSYS 104

Putting It All Together

Example Mean rating micro = 37 You are a critical reviewer your ratings are 1 star

lower than the mean bx = -1 Star Wars gets a mean rating of 05 higher than

average movie bi = + 05 Predicted rating for you on Star Wars

= 37 - 1 + 05 = 32

1162019

RECSYS 105

Fitting the New Model

Solve

Stochastic gradient decent to find parameters Note Both biases bu bi as well as interactions qi pu are treated

as parameters (we estimate them)

1162019

regularization

goodness of fit

λ is selected via grid-search on a validation set

( )

++++

+++minus

sumsumsumsum

sumisin

ii

xx

xx

ii

Rix

Txiixxi

PQ

bbpq

pqbbr

2222

2

)(

)(min

λ

micro

RECSYS 106

BellKor Recommender System The winner of the Netflix Challenge

Multi-scale modeling of the dataCombine top level ldquoregionalrdquomodeling of the data with a refined local view Global

bull Overall deviations of usersmovies Factorization

bull Addressing ldquoregionalrdquo effects Collaborative filtering

bull Extract local patterns

Global effects

Factorization

Collaborative filtering

RECSYS 107

Performance of Various Methods

0885

089

0895

09

0905

091

0915

092

1 10 100 1000

RMSE

Millions of parameters

CF (no time bias)

Basic Latent Factors

Latent Factors w Biases

1162019

RECSYS 108

Grand Prize 08563

Netflix 09514

Movie average 10533User average 10651

Global average 11296

Performance of Various Methods

Basic Collaborative filtering 094

Latent factors 090

Latent factors+Biases 089

Collaborative filtering++ 091

1162019

RECSYS 109

Temporal Biases Of Users

Sudden rise in the average movie rating(early 2004) Improvements in Netflix GUI improvements Meaning of rating changed

Movie age Users prefer new movies

without any reasons Older movies are just

inherently better than newer ones

1162019

Y Koren Collaborative filtering with temporal dynamics KDD rsquo09

RECSYS 110

Temporal Biases amp Factors

Original modelrxi = micro +bx + bi + qi middotpx

T

Add time dependence to biasesrxi = micro +bx(t)+ bi(t) +qi middot px

T

Make parameters bu and bi to depend on time (1) Parameterize time-dependence by linear trends

(2) Each bin corresponds to 10 consecutive weeks

Add temporal dependence to factors px(t)hellip user preference vector on day t

Y Koren Collaborative filtering with temporal dynamics KDD rsquo09

RECSYS 111

Adding Temporal Effects

1162019

0875

088

0885

089

0895

09

0905

091

0915

092

1 10 100 1000 10000

RMSE

Millions of parameters

CF (no time bias)Basic Latent FactorsCF (time bias)Latent Factors w Biases+ Linear time factors+ Per-day user biases+ CF

RECSYS 112

Grand Prize 08563

Netflix 09514

Movie average 10533User average 10651

Global average 11296

Performance of Various Methods

Basic Collaborative filtering 094

Latent factors 090

Latent factors+Biases 089

Collaborative filtering++ 091

1162019 112

Latent factors+Biases+Time 0876

Still no prize Getting desperateTry a ldquokitchen sinkrdquo approach

RECSYS 113

Effect of ensemble size

087

08725

0875

08775

088

08825

0885

08875

089

08925

1 8 15 22 29 36 43 50

Predictors

Erro

r - R

MSE

Chart1

RMSE
Predictors
Error - RMSE
Effect of ensemble size

Sheet1

Sheet1

RMSE
Predictors
Error - RMSE
Effect of ensemble size

Sheet2

Sheet3

RECSYS 115

Standing on June 26th 2009

1162019June 26th submission triggers 30-day ldquolast callrdquo

RECSYS 116

The Last 30 Days

Ensemble team formed Group of other teams on leaderboard forms a new team Relies on combining their models Quickly also get a qualifying score over 10

BellKor Continue to get small improvements in their scores Realize that they are in direct competition with Ensemble

Strategy Both teams carefully monitoring the leaderboard Only sure way to check for improvement is to submit a set of

predictionsbull This alerts the other team of your latest score

1162019

RECSYS 117

24 Hours from the Deadline

Submissions limited to 1 a day Only 1 final submission could be made in the last 24h

24 hours before deadlinehellip BellKor team member in Austria notices (by chance) that

Ensemble posts a score that is slightly better than BellKorrsquos

Frantic last 24 hours for both teams Much computer time on final optimization Carefully calibrated to end about an hour before deadline

Final submissions BellKor submits a little early (on purpose) 40 mins before deadline Ensemble submits their final entry 20 mins later hellipand everyone waitshellip

1162019

RECSYS 118118

RECSYS 119

Million $ Awarded Sept 21st 2009

RECSYS 120

Further reading Y Koren Collaborative filtering with temporal

dynamics KDD rsquo09 httpwww2researchattcom~volinskynetflixbpchtml httpwwwthe-ensemblecom

1162019

  • Slide Number 1
  • Acknowledgements
  • Content-based Systems amp Collaborative Filtering
  • High Dimensional Data
  • Example Recommender Systems
  • Recommendations
  • From Scarcity to Abundance
  • The Long Tail
  • Physical vs Online
  • Types of Recommendations
  • Formal Model
  • Utility Matrix
  • Key Problems
  • (1) Gathering Ratings
  • (2) Extrapolating Utilities
  • Content-based Recommender Systems
  • Content-based Recommendations
  • Plan of Action
  • Item Profiles
  • Sidenote TF-IDF
  • User Profiles and Prediction
  • Pros Content-based Approach
  • Cons Content-based Approach
  • Collaborative Filtering
  • Collaborative filtering
  • Collaborative Filtering (CF)
  • Example of Memory-based Collaborative Filtering User-User Collaborative Filtering
  • Similar Users
  • Similarity Metric
  • Rating Predictions
  • Another type of Memory-based Collaborative Filtering Item-Item based Collaborative Filtering
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Common Practice for Item-Item Collaborative Filtering
  • Item-Item vs User-User
  • ProsCons of Collaborative Filtering
  • Hybrid Methods
  • Remarks amp Practical Tips
  • Evaluation
  • Evaluation
  • Evaluating Predictions
  • Problems with Error Measures
  • Collaborative Filtering Complexity
  • Tip Add Data
  • Recommender SystemsLatent Factor Models
  • Collaborative Filtering viaLatent Factor Models (eg SVD)
  • Slide Number 50
  • The Netflix Utility Matrix R
  • Utility Matrix R Evaluation
  • Latent Factor Models
  • Ratings as Products of Factors
  • Ratings as Products of Factors
  • Ratings as Products of Factors
  • Latent Factor Models
  • Latent Factor Models
  • Recap SVD
  • Latent Factor Models
  • Dealing with Missing Entries
  • Recommendations via Latent Factor Models (eg SVD++ by the [Bellkor Team])
  • Dealing with Missing Entries
  • The Effect of Regularization
  • The Effect of Regularization
  • The Effect of Regularization
  • The Effect of Regularization
  • Use Gradient Descent to search forthe optimal settings
  • Degression to the lecture notes ofRegression and Gradient Descentby Andrew Ngrsquos Machine Learning course from Coursera
  • (Batch) Gradient Descent
  • Stochastic Gradient Descent
  • SGD vs GD
  • Stochastic Gradient Descent
  • Slide Number 75
  • Summary Recommendations via Optimization
  • Backup Slides
  • The Netflix Challenge 2006-09
  • Slide Number 79
  • Slide Number 80
  • Netflix Prize
  • Movie rating data
  • Overall rating distribution
  • ratings per movie
  • ratings per user
  • Average movie rating by movie count
  • Most loved movies
  • Challenges
  • The BellKor recommender system
  • Extending Latent Factor Model to Include Biases
  • Modeling Biases and Interactions
  • Baseline Predictor
  • Putting It All Together
  • Fitting the New Model
  • BellKor Recommender System
  • Performance of Various Methods
  • Performance of Various Methods
  • Temporal Biases Of Users
  • Temporal Biases amp Factors
  • Adding Temporal Effects
  • Performance of Various Methods
  • Slide Number 113
  • Slide Number 114
  • Standing on June 26th 2009
  • The Last 30 Days
  • 24 Hours from the Deadline
  • Slide Number 118
  • Million $ Awarded Sept 21st 2009
  • Further reading
predictors RMSE
1 0889067
2 0880586
3 0878345
4 0877715
5 0876668
6 0876179
7 0875557
8 0875013
9 0874477
10 0874209
11 0873997
12 0873691
13 0873569
14 0873352
15 0873212
16 0873035
17 0872942
18 087283
19 0872748
20 0872654
21 0872563
22 0872468
23 0872382
24 0872275
25 087222
26 0872201
27 0872135
28 0872086
29 0872021
30 0871945
31 0871898
32 087183
33 0871756
34 0871722
35 0871665
36 0871632
37 0871585
38 0871558
39 0871524
40 0871482
41 0871452
42 0871428
43 0871398
44 0871346
45 0871312
46 0871287
47 0871262
48 0871242
49 0871236
50 0871197
0889067
0880586
0878345
0877715
0876668
0876179
0875557
0875013
0874477
0874209
0873997
0873691
0873569
0873352
0873212
0873035
0872942
087283
0872748
0872654
0872563
0872468
0872382
0872275
087222
0872201
0872135
0872086
0872021
0871945
0871898
087183
0871756
0871722
0871665
0871632
0871585
0871558
0871524
0871482
0871452
0871428
0871398
0871346
0871312
0871287
0871262
0871242
0871236
0871197
Page 29: xuhappy.github.io · 2 days ago · RECSYS 2 Acknowledgements The slides used in this chapter are adapted from: CS246 Mining Massive Datasets, - by Jure Leskovec, Stanford University

RECSYS 31

Another type of Memory-based Collaborative Filtering Item-Item based Collaborative Filtering

So far User-user collaborative filtering

Another view Item-item For item i find other similar items Estimate rating for item i based

on the target userrsquos ratings on items similar to item i Can use same similarity metrics and

prediction functions as in user-user model

1162019

sumsum

isin

isinsdot

=)(

)(

xiNj ij

xiNj xjijxi s

rsr

sijhellip similarity of items i and jrxjhelliprating of user x on item jN(ix)hellip set items rated by x similar to i

RECSYS 32

Item-Item CF (|N|=2)

121110987654321

455311

3124452

534321423

245424

5224345

423316

users

mov

ies

- unknown rating - rating between 1 to 5

RECSYS 33

Item-Item CF (|N|=2)

121110987654321

455 311

3124452

534321423

245424

5224345

423316

users

- estimate rating of movie 1 by user 5

1162019

mov

ies

RECSYS 34

Item-Item CF (|N|=2)

121110987654321

455 311

3124452

534321423

245424

5224345

423316

users

Neighbor selectionIdentify movies similar to movie 1 rated by user 5

1162019

mov

ies

100

-018

041

-010

-031

059

sim(1m)

Here we use Pearson correlation as similarity1) Subtract mean rating mi from each movie i

m1 = (1+3+5+5+4)5 = 36row 1 [-26 0 -06 0 0 14 0 0 14 0 04 0]

2) Compute cosine similarities between rows

RECSYS 35

Item-Item CF (|N|=2)

121110987654321

455 311

3124452

534321423

245424

5224345

423316

users

Compute similarity weightss13=041 s16=059

1162019

mov

ies

100

-018

041

-010

-031

059

sim(1m)

RECSYS 36

Item-Item CF (|N|=2)

121110987654321

45526311

3124452

534321423

245424

5224345

423316

users

Predict by taking weighted average

r15 = (0412 + 0593) (041+059) = 26

mov

ies

RECSYS 37

Common Practice for Item-Item Collaborative Filtering

Define similarity sij of items i and j

Select K nearest neighbors (KNN) N(i x) Set of Items most similar to item i that were rated by x

Estimate rating rxi as the weighted average

1162019

sumsum

isin

isinsdot

=)(

)(ˆxiNj ij

xiNj xjijxi s

rsr

N(ix) = set of items similar to item i that were rated by xsij = similarity of items i and jrxj= rating of user x on item j

RECSYS 38

Item-Item vs User-User

04180109

0305081

Avatar LOTR Matrix Pirates

Alice

Bob

Carol

David

1162019

In practice it has been observed that item-itemoften works better than user-user

Why Items are simpler users have multiple tastes

RECSYS 39

ProsCons of Collaborative Filtering

+ Works for any kind of item No feature selection needed

- Cold Start Need enough users in the system to find a match

- Sparsity The userratings matrix is sparse Hard to find users that have rated the same items

- First rater Cannot recommend an item that has not been

previously rated New items Esoteric items

- Popularity bias Cannot recommend items to someone with

unique taste Tends to recommend popular items

RECSYS 40

Hybrid Methods Implement two or more different recommenders and

combine predictions Perhaps using a linear model

Add content-based methods to collaborative filtering Item profiles for new item problem Demographics to deal with new user problem

1162019

RECSYS 41

- Evaluation

- Error metrics

- Complexity Speed

RECSYS 42

Evaluation

1 3 4

3 5 5

4 5 5

3

3

2 2 2

5

2 1 1

3 3

1

movies

users

1162019

RECSYS 43

Evaluation

1 3 4

3 5 5

4 5 5

3

3

2

2 1

3

1

Test Data Set

users

movies

1162019

RECSYS 44

Evaluating Predictions Compare predictions with known ratings

Root-mean-square error (RMSE)bull where is predicted is the true rating of x on i

Precision at top 10 bull of those in top 10

Rank Correlation bull Spearmanrsquos correlation between systemrsquos and userrsquos complete

rankings Another approach 01 model

Coveragebull Number of itemsusers for which system can make predictions

Precision = TP (TP + FP) Accuracy = (TP+TN) (TP + FP + TN + FN) Receiver Operating characteristic (ROC) Curve

Y-axis True Positive Rates (TPR) X-axis False Positive Rates (FPR)bull TPR (aka Recall) = TP P = TP (TP+FN) bull FPR = FP N = FP (FP + TN)bull See httpsenwikipediaorgwikiPrecision and recall

RECSYS 45

Problems with Error Measures Narrow focus on accuracy sometimes

misses the point Prediction Diversity Prediction Context Order of predictions

In practice we care only to predict high ratings RMSE might penalize a method that does well

for high ratings and badly for others

1162019

RECSYS 46

Collaborative Filtering Complexity

Expensive step is finding k most similar customers O(|X|) Recall that X = set of customers in the system

Too expensive to do at runtime Could pre-compute using clustering as approx

Naiumlve pre-computation takes time O(N middot|C|) |C| = of clusters = k in the k-means N = of data points

We already know how to do this Near-neighbor search in high dimensions (LSH) Clustering Dimensionality reduction

1162019

RECSYS 47

Tip Add Data Leverage all the data

Donrsquot try to reduce data size in an effort to make fancy algorithms work

Simple methods on large data do best

Add more data eg add IMDB data on genres

More data beats better algorithmshttpanandtypepadcomdatawocky200803more-data-usualhtml

RECSYS 48

Recommender SystemsLatent Factor Models

CS246 Mining Massive DatasetsJure Leskovec Stanford University

httpcs246stanfordedu

RECSYS 49

Geared towards females

Geared towards males

Serious

Funny

Collaborative Filtering viaLatent Factor Models (eg SVD)

The PrincessDiaries

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

RECSYS 50Koren Bell Volinksy IEEE Computer 2009

RECSYS 51

The Netflix Utility Matrix R

1 3 4

3 5 5

4 5 5

3

3

2 2 2

5

2 1 1

3 3

1

480000 users

17700 movies

1162019

Matrix R

RECSYS 52

Utility Matrix R Evaluation

1 3 4

3 5 5

4 5 5

3

3

2

2 1

3

1

Test Data Set

SSE = sum(119909119909119909119909)isin119877119877 119903119909119909119909119909 minus 119903119903119909119909119909119909 2

1162019

480000 users

17700 movies

Predicted rating

True rating of user x on item i

119955119955120785120785120788120788

Matrix R

Training Data Set

RECSYS 53

Latent Factor Models

ldquoSVDrdquo on Netflix data R asymp Q middot PT

For now letrsquos assume we can approximate the rating matrix R as a product of ldquothinrdquo Q middot PT

R has missing entries but letrsquos ignore that for nowbull Basically we will want the reconstruction error to be small on

known ratings and we donrsquot care about the values on the missing ones

45531

312445

53432142

24542

522434

42331

2-41

56-5

53-2

32111

-221-7

37-1

-924143-48-5-253-211

13-112-72914-131457-8

1-6784-3924176-421asymp

users

item

s

PT

Q

item

s

users

R

SVD A = U Σ VT

f factors

ffactors

RECSYS 54

Ratings as Products of Factors How to estimate the missing rating of

user x for item i

1162019

45531

312445

53432142

24542

522434

42331

item

s

2-41

56-5

53-2

32111

-221-7

37-1

-924143-48-5-253-211

13-112-72914-131457-8

1-6784-3924176-421

asymp

item

s

users

users

PT

119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931

= 119943119943

119954119954119946119946119943119943 sdot 119953119953119961119961119943119943

qi = row i of Qpx = column x of PT

ffac

tors

Qf factors

RECSYS 55

Ratings as Products of Factors How to estimate the missing rating of

user x for item i

1162019

45531

312445

53432142

24542

522434

42331

item

s

2-41

56-5

53-2

32111

-221-7

37-1

-924143-48-5-253-211

13-112-72914-131457-8

1-6784-3924176-421

asymp

item

s

users

users

PT

ffac

tors

Qf factors

119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931

= 119943119943

119954119954119946119946119943119943 sdot 119953119953119961119961119943119943

qi = row i of Qpx = column x of PT

RECSYS 56

Ratings as Products of Factors How to estimate the missing rating of

user x for item i

1162019

45531

312445

53432142

24542

522434

42331

item

s

2-41

56-5

53-2

32111

-221-7

37-1

-924143-48-5-253-211

13-112-72914-131457-8

1-6784-3924176-421

asymp

item

s

users

users

QPT

24

ffac

tors

f factors

119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931

= 119943119943

119954119954119946119946119943119943 sdot 119953119953119961119961119943119943

qi = row i of Qpx = column x of PT

RECSYS 57

Geared towards females

Geared towards males

Serious

Funny

Latent Factor Models

1162019

The PrincessDiaries

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

RECSYS 58

Geared towards females

Geared towards males

Serious

Funny

Latent Factor Models

1162019

The PrincessDiaries

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

RECSYS 59

Recap SVD

Remember SVD A Input data matrix U Left singular vecs V Right singular vecs Σ Singular values

SVD gives minimum reconstruction error (SSE) min119880119880VΣ

sum119909119909119894119894 119860119860119909119909119894119894 minus 119880119880Σ119881119881T 1199091199091198941198942

So in our case ldquoSVDrdquo on Netflix data R asymp Q middot PT

A = R Q = U PT = Σ VT

But we are not done yet R has missing entries1162019

U

The sum goes over all entriesBut our R has missing entries

119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931

Am

n

Σm

n

VT

asymp

U

RECSYS 60

Latent Factor Models

SVD isnrsquot defined when entries are missing

Use specialized methods to find P Q min

119875119875119876119876sum 119909119909119909119909 isinR 119903119903119909119909119909119909 minus 119902119902119909119909 sdot 119901119901119909119909119879119879 2

Notebull We donrsquot require cols of P Q to be orthogonalunit lengthbull P Q map usersmovies to a latent spacebull The most popular model among Netflix contestants

1162019

45531

312445

53432142

24542

522434

42331

2-41

56-5

53-2

32111

-221-7

37-1

-924143-48-5-253-211

13-112-72914-131457-8

1-6784-3924176-421asymp

PTQ

users

item

s

119903119909119909119909119909 = 119902119902119909119909 sdot 119901119901119909119909119879119879

f factors

ffactorsitem

s

users

RECSYS 61

Dealing with Missing Entries Want to minimize SSE for unseen test data

Idea Minimize SSE on training data Want large f ( of factors) to capture all the signals But SSE on test data begins to rise for f gt 2

Regularization is needed to avoid Overfitting Allow rich model where there are sufficient data Shrink aggressively where data are scarce

1162019

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

1 3 43 5 5

4 5 533

2

2 1 3

1

λhellip regularization parameterldquoerrorrdquo ldquolengthrdquo

RECSYS 62

Geared towards females

Geared towards males

serious

escapist

The PrincessDiaries

The Lion King

Braveheart

Independence Day

AmadeusThe Color Purple

Oceanrsquos 11

Sense and Sensibility

Gus

Dave

Recommendations via Latent Factor Models (eg SVD++ by the [Bellkor Team])

1162019

Lethal Weapon

Dumb and Dumber

RECSYS 63

Dealing with Missing Entries Want to minimize SSE for unseen test data

Idea Minimize SSE on training data Want large f ( of factors) to capture all the signals But SSE on test data begins to rise for f gt 2

Regularization is needed Allow rich model where there are sufficient data Shrink aggressively where data are scarce

1162019

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

1 3 43 5 5

4 5 533

2

2 1 3

1

λhellip regularization parameterldquoerrorrdquo ldquolengthrdquo

RECSYS 64

Geared towards females

Geared towards males

serious

funny

The Effect of Regularization

1162019

The PrincessDiaries

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

minfactors ldquoerrorrdquo + λ ldquolengthrdquo

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

RECSYS 65

Geared towards females

Geared towards males

serious

funny

The Effect of Regularization

1162019

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

The PrincessDiaries

minfactors ldquoerrorrdquo + λ ldquolengthrdquo

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

RECSYS 66

Geared towards females

Geared towards males

serious

funny

The Effect of Regularization

1162019

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

minfactors ldquoerrorrdquo + λ ldquolengthrdquo

The PrincessDiaries

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

RECSYS 67

Geared towards females

Geared towards males

serious

funny

The Effect of Regularization

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

The PrincessDiaries

minfactors ldquoerrorrdquo + λ ldquolengthrdquo

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

RECSYS 68

Use Gradient Descent to search forthe optimal settings

Want to find matrices P and Q

Gradient descent Initialize P and Q (using SVD pretend missing ratings are

0) Do gradient descent

bull P larr P - η middotnablaPbull Q larr Q - η middotnablaQbull Where nablaQ is gradientderivative of matrix Q120571120571119876119876 = [120571120571119902119902119909119909119894119894] and 120571120571119902119902119909119909119894119894 = sum119909119909119909119909 minus2 119903119903119909119909119909119909 minus 119902119902119909119909119901119901119909119909119879119879 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894

bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q Observation Computing gradients is slow

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

How to compute gradient of a matrixCompute gradient of every element independently

RECSYS 69

Degression to the lecture notes ofRegression and Gradient Descent

by Andrew Ngrsquos Machine Learning course from Coursera

RECSYS 70

(Batch) Gradient Descent

Want to find matrices P and Q

Gradient descent Initialize P and Q (using SVD pretend missing ratings are

0) Do gradient descent

bull P larr P - η middotnablaPbull Q larr Q - η middotnablaQbull Where nablaQ is gradientderivative of matrix Q120571120571119876119876 = [120571120571119902119902119909119909119894119894] and 120571120571119902119902119909119909119894119894 = sum119909119909119909119909 minus2 119903119903119909119909119909119909 minus 119902119902119909119909119901119901119909119909119879119879 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894

bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q Observation Computing gradients is slow

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

How to compute gradient of a matrixCompute gradient of every element independently

RECSYS 72

Stochastic Gradient Descent

Gradient Descent (GD) vs Stochastic GD Observation 120571120571119876119876 = [120571120571119902119902119909119909119894119894] where

120571120571119902119902119909119909119894119894 = 119909119909119909119909

minus2 119903119903119909119909119909119909 minus 119902119902119909119909119894119894119901119901119909119909119894119894 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894 = 119961119961119946119946

nabla119928119928 119955119955119961119961119946119946

bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q 119928119928 = 119928119928 minus 119928119928 = 119928119928 minus η sum119961119961119946119946nabla119928119928 (119955119955119961119961119946119946) Idea Instead of evaluating gradient over all ratings evaluate it for

each individual rating and make a step

GD 119928119928larr119928119928 minus η sum119955119955119961119961119946119946 nabla119928119928(119955119955119961119961119946119946)

SGD 119928119928larr119928119928 minus 119928119928(119955119955119961119961119946119946) Faster convergence

bull Need more steps but each step is computed much faster

1162019

RECSYS 73

SGD vs GD Convergence of GD vs SGD

1162019

Iterationstep

Valu

e of

the

obje

ctiv

e fu

nctio

n

GD improves the value of the objective function at every step SGD improves the value but in a ldquonoisyrdquo wayGD takes fewer steps to converge but each steptakes much longer to compute In practice SGD is much faster

RECSYS 74

Stochastic Gradient Descent

Stochastic gradient decent Initialize P and Q (using SVD pretend missing ratings are

0) Then iterate over the ratings (multiple times if necessary) and update

factorsFor each rxi

bull 120576120576119909119909119909119909 = 119903119903119909119909119909119909 minus 119902119902119909119909 sdot 119901119901119909119909119879119879 (derivative of the ldquoerrorrdquo)bull 119902119902119909119909 larr 119902119902119909119909 + 120578120578 120576120576119909119909119909119909 119901119901119909119909 minus 120582120582 119902119902119909119909 (update equation)bull 119901119901119909119909 larr 119901119901119909119909 + 120578120578 120576120576119909119909119909119909 119902119902119909119909 minus 120582120582 119901119901119909119909 (update equation)

2 for loops For until convergence

bull For each rxibull Compute gradient do a ldquosteprdquo

1162019

η hellip learning rate

RECSYS 75Koren Bell Volinksy IEEE Computer 2009

1162019

RECSYS 76

Summary Recommendations via Optimization

Goal Make good recommendations Quantify goodness using SSE

So Lower SSE means better recommendations We want to make good recommendations on items

that some user has not yet seen Letrsquos set values for P and Q such that they work

well on known (user item) ratings And hope these values for P and Q will predict well

the unknown ratings

This is the a case where we apply Optimization methods

1162019

1 3 43 5 5

4 5 533

2

2 1 3

1

RECSYS 77

Backup Slides

RECSYS 78

The Netflix Challenge 2006-09

We Know What You OughtTo Be Watching This

Summer

Netflix Prizebull Training data

ndash 100 million ratingsndash 480000 usersndash 17770 moviesndash 6 years of data 2000-2005

bull Test datandash Last few ratings of each user (28 million)ndash Evaluation criterion root mean squared error (RMSE) ndash Netflix Cinematch RMSE 09514

bull Competitionndash 2700+ teamsndash $1 million grand prize for 10 improvement on Cinematch resultndash $50000 2007 progress prize for 843 improvement

scoremovieuser121152131434524123237682576344541568523425

22345

57664566

scoremovieuser6219617232473153414284935

745

696836

Training data Test data

Movie rating data

Overall rating distribution

bull Third of ratings are 4sbull Average rating is 368

From TimelyDevelopmentcom

ratings per movie

bull Avg ratingsmovie 5627

ratings per user

bull Avg ratingsuser 208

Average movie rating by movie count

bull More ratings to better movies

From TimelyDevelopmentcom

Most loved movies

CountAvg ratingTitle137812 4593 The Shawshank Redemption 133597 4545 Lord of the Rings The Return of the King1808834306 The Green Mile 150676 4460Lord of the Rings The Two Towers 139050 4415 Finding Nemo 117456 4504 Raiders of the Lost Ark 180736 4299 Forrest Gump 147932 4433 Lord of the Rings The Fellowship of the ring149199 4325 The Sixth Sense 144027 4333 Indiana Jones and the Last Crusade

Challenges

bull Size of datandash Scalabilityndash Keeping data in memory

bull Missing datandash 99 percent missingndash Very imbalanced

bull Avoiding overfittingbull Test and training data differ significantly movie 16322

bull Use an ensemble of complementing predictorsbull Two half tuned models worth more than a single fully

tuned model

The BellKor recommender system

RECSYS 101

Extending Latent Factor Model to Include Biases

1162019

RECSYS 102

Modeling Biases and Interactions

μ = overall mean rating bx = bias of user x bi = bias of movie i

user-movie interactionmovie biasuser bias

User-Movie interaction Characterizes the matching between

users and movies Attracts most research in the field Benefits from algorithmic and

mathematical innovations

Baseline predictor Separates users and movies Benefits from insights into userrsquos

behavior Among the main practical

contributions of the competition

RECSYS 103

Baseline Predictor We have expectations on the rating by

user x of movie i even without estimating xrsquos attitude towards movies like i

ndash Rating scale of user xndash Values of other ratings user

gave recently (day-specific mood anchoring multi-user accounts)

ndash (Recent) popularity of movie indash Selection bias related to

number of ratings user gave on the same day (ldquofrequencyrdquo)

1162019

RECSYS 104

Putting It All Together

Example Mean rating micro = 37 You are a critical reviewer your ratings are 1 star

lower than the mean bx = -1 Star Wars gets a mean rating of 05 higher than

average movie bi = + 05 Predicted rating for you on Star Wars

= 37 - 1 + 05 = 32

1162019

RECSYS 105

Fitting the New Model

Solve

Stochastic gradient decent to find parameters Note Both biases bu bi as well as interactions qi pu are treated

as parameters (we estimate them)

1162019

regularization

goodness of fit

λ is selected via grid-search on a validation set

( )

++++

+++minus

sumsumsumsum

sumisin

ii

xx

xx

ii

Rix

Txiixxi

PQ

bbpq

pqbbr

2222

2

)(

)(min

λ

micro

RECSYS 106

BellKor Recommender System The winner of the Netflix Challenge

Multi-scale modeling of the dataCombine top level ldquoregionalrdquomodeling of the data with a refined local view Global

bull Overall deviations of usersmovies Factorization

bull Addressing ldquoregionalrdquo effects Collaborative filtering

bull Extract local patterns

Global effects

Factorization

Collaborative filtering

RECSYS 107

Performance of Various Methods

0885

089

0895

09

0905

091

0915

092

1 10 100 1000

RMSE

Millions of parameters

CF (no time bias)

Basic Latent Factors

Latent Factors w Biases

1162019

RECSYS 108

Grand Prize 08563

Netflix 09514

Movie average 10533User average 10651

Global average 11296

Performance of Various Methods

Basic Collaborative filtering 094

Latent factors 090

Latent factors+Biases 089

Collaborative filtering++ 091

1162019

RECSYS 109

Temporal Biases Of Users

Sudden rise in the average movie rating(early 2004) Improvements in Netflix GUI improvements Meaning of rating changed

Movie age Users prefer new movies

without any reasons Older movies are just

inherently better than newer ones

1162019

Y Koren Collaborative filtering with temporal dynamics KDD rsquo09

RECSYS 110

Temporal Biases amp Factors

Original modelrxi = micro +bx + bi + qi middotpx

T

Add time dependence to biasesrxi = micro +bx(t)+ bi(t) +qi middot px

T

Make parameters bu and bi to depend on time (1) Parameterize time-dependence by linear trends

(2) Each bin corresponds to 10 consecutive weeks

Add temporal dependence to factors px(t)hellip user preference vector on day t

Y Koren Collaborative filtering with temporal dynamics KDD rsquo09

RECSYS 111

Adding Temporal Effects

1162019

0875

088

0885

089

0895

09

0905

091

0915

092

1 10 100 1000 10000

RMSE

Millions of parameters

CF (no time bias)Basic Latent FactorsCF (time bias)Latent Factors w Biases+ Linear time factors+ Per-day user biases+ CF

RECSYS 112

Grand Prize 08563

Netflix 09514

Movie average 10533User average 10651

Global average 11296

Performance of Various Methods

Basic Collaborative filtering 094

Latent factors 090

Latent factors+Biases 089

Collaborative filtering++ 091

1162019 112

Latent factors+Biases+Time 0876

Still no prize Getting desperateTry a ldquokitchen sinkrdquo approach

RECSYS 113

Effect of ensemble size

087

08725

0875

08775

088

08825

0885

08875

089

08925

1 8 15 22 29 36 43 50

Predictors

Erro

r - R

MSE

Chart1

RMSE
Predictors
Error - RMSE
Effect of ensemble size

Sheet1

Sheet1

RMSE
Predictors
Error - RMSE
Effect of ensemble size

Sheet2

Sheet3

RECSYS 115

Standing on June 26th 2009

1162019June 26th submission triggers 30-day ldquolast callrdquo

RECSYS 116

The Last 30 Days

Ensemble team formed Group of other teams on leaderboard forms a new team Relies on combining their models Quickly also get a qualifying score over 10

BellKor Continue to get small improvements in their scores Realize that they are in direct competition with Ensemble

Strategy Both teams carefully monitoring the leaderboard Only sure way to check for improvement is to submit a set of

predictionsbull This alerts the other team of your latest score

1162019

RECSYS 117

24 Hours from the Deadline

Submissions limited to 1 a day Only 1 final submission could be made in the last 24h

24 hours before deadlinehellip BellKor team member in Austria notices (by chance) that

Ensemble posts a score that is slightly better than BellKorrsquos

Frantic last 24 hours for both teams Much computer time on final optimization Carefully calibrated to end about an hour before deadline

Final submissions BellKor submits a little early (on purpose) 40 mins before deadline Ensemble submits their final entry 20 mins later hellipand everyone waitshellip

1162019

RECSYS 118118

RECSYS 119

Million $ Awarded Sept 21st 2009

RECSYS 120

Further reading Y Koren Collaborative filtering with temporal

dynamics KDD rsquo09 httpwww2researchattcom~volinskynetflixbpchtml httpwwwthe-ensemblecom

1162019

  • Slide Number 1
  • Acknowledgements
  • Content-based Systems amp Collaborative Filtering
  • High Dimensional Data
  • Example Recommender Systems
  • Recommendations
  • From Scarcity to Abundance
  • The Long Tail
  • Physical vs Online
  • Types of Recommendations
  • Formal Model
  • Utility Matrix
  • Key Problems
  • (1) Gathering Ratings
  • (2) Extrapolating Utilities
  • Content-based Recommender Systems
  • Content-based Recommendations
  • Plan of Action
  • Item Profiles
  • Sidenote TF-IDF
  • User Profiles and Prediction
  • Pros Content-based Approach
  • Cons Content-based Approach
  • Collaborative Filtering
  • Collaborative filtering
  • Collaborative Filtering (CF)
  • Example of Memory-based Collaborative Filtering User-User Collaborative Filtering
  • Similar Users
  • Similarity Metric
  • Rating Predictions
  • Another type of Memory-based Collaborative Filtering Item-Item based Collaborative Filtering
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Common Practice for Item-Item Collaborative Filtering
  • Item-Item vs User-User
  • ProsCons of Collaborative Filtering
  • Hybrid Methods
  • Remarks amp Practical Tips
  • Evaluation
  • Evaluation
  • Evaluating Predictions
  • Problems with Error Measures
  • Collaborative Filtering Complexity
  • Tip Add Data
  • Recommender SystemsLatent Factor Models
  • Collaborative Filtering viaLatent Factor Models (eg SVD)
  • Slide Number 50
  • The Netflix Utility Matrix R
  • Utility Matrix R Evaluation
  • Latent Factor Models
  • Ratings as Products of Factors
  • Ratings as Products of Factors
  • Ratings as Products of Factors
  • Latent Factor Models
  • Latent Factor Models
  • Recap SVD
  • Latent Factor Models
  • Dealing with Missing Entries
  • Recommendations via Latent Factor Models (eg SVD++ by the [Bellkor Team])
  • Dealing with Missing Entries
  • The Effect of Regularization
  • The Effect of Regularization
  • The Effect of Regularization
  • The Effect of Regularization
  • Use Gradient Descent to search forthe optimal settings
  • Degression to the lecture notes ofRegression and Gradient Descentby Andrew Ngrsquos Machine Learning course from Coursera
  • (Batch) Gradient Descent
  • Stochastic Gradient Descent
  • SGD vs GD
  • Stochastic Gradient Descent
  • Slide Number 75
  • Summary Recommendations via Optimization
  • Backup Slides
  • The Netflix Challenge 2006-09
  • Slide Number 79
  • Slide Number 80
  • Netflix Prize
  • Movie rating data
  • Overall rating distribution
  • ratings per movie
  • ratings per user
  • Average movie rating by movie count
  • Most loved movies
  • Challenges
  • The BellKor recommender system
  • Extending Latent Factor Model to Include Biases
  • Modeling Biases and Interactions
  • Baseline Predictor
  • Putting It All Together
  • Fitting the New Model
  • BellKor Recommender System
  • Performance of Various Methods
  • Performance of Various Methods
  • Temporal Biases Of Users
  • Temporal Biases amp Factors
  • Adding Temporal Effects
  • Performance of Various Methods
  • Slide Number 113
  • Slide Number 114
  • Standing on June 26th 2009
  • The Last 30 Days
  • 24 Hours from the Deadline
  • Slide Number 118
  • Million $ Awarded Sept 21st 2009
  • Further reading
predictors RMSE
1 0889067
2 0880586
3 0878345
4 0877715
5 0876668
6 0876179
7 0875557
8 0875013
9 0874477
10 0874209
11 0873997
12 0873691
13 0873569
14 0873352
15 0873212
16 0873035
17 0872942
18 087283
19 0872748
20 0872654
21 0872563
22 0872468
23 0872382
24 0872275
25 087222
26 0872201
27 0872135
28 0872086
29 0872021
30 0871945
31 0871898
32 087183
33 0871756
34 0871722
35 0871665
36 0871632
37 0871585
38 0871558
39 0871524
40 0871482
41 0871452
42 0871428
43 0871398
44 0871346
45 0871312
46 0871287
47 0871262
48 0871242
49 0871236
50 0871197
0889067
0880586
0878345
0877715
0876668
0876179
0875557
0875013
0874477
0874209
0873997
0873691
0873569
0873352
0873212
0873035
0872942
087283
0872748
0872654
0872563
0872468
0872382
0872275
087222
0872201
0872135
0872086
0872021
0871945
0871898
087183
0871756
0871722
0871665
0871632
0871585
0871558
0871524
0871482
0871452
0871428
0871398
0871346
0871312
0871287
0871262
0871242
0871236
0871197
Page 30: xuhappy.github.io · 2 days ago · RECSYS 2 Acknowledgements The slides used in this chapter are adapted from: CS246 Mining Massive Datasets, - by Jure Leskovec, Stanford University

RECSYS 32

Item-Item CF (|N|=2)

121110987654321

455311

3124452

534321423

245424

5224345

423316

users

mov

ies

- unknown rating - rating between 1 to 5

RECSYS 33

Item-Item CF (|N|=2)

121110987654321

455 311

3124452

534321423

245424

5224345

423316

users

- estimate rating of movie 1 by user 5

1162019

mov

ies

RECSYS 34

Item-Item CF (|N|=2)

121110987654321

455 311

3124452

534321423

245424

5224345

423316

users

Neighbor selectionIdentify movies similar to movie 1 rated by user 5

1162019

mov

ies

100

-018

041

-010

-031

059

sim(1m)

Here we use Pearson correlation as similarity1) Subtract mean rating mi from each movie i

m1 = (1+3+5+5+4)5 = 36row 1 [-26 0 -06 0 0 14 0 0 14 0 04 0]

2) Compute cosine similarities between rows

RECSYS 35

Item-Item CF (|N|=2)

121110987654321

455 311

3124452

534321423

245424

5224345

423316

users

Compute similarity weightss13=041 s16=059

1162019

mov

ies

100

-018

041

-010

-031

059

sim(1m)

RECSYS 36

Item-Item CF (|N|=2)

121110987654321

45526311

3124452

534321423

245424

5224345

423316

users

Predict by taking weighted average

r15 = (0412 + 0593) (041+059) = 26

mov

ies

RECSYS 37

Common Practice for Item-Item Collaborative Filtering

Define similarity sij of items i and j

Select K nearest neighbors (KNN) N(i x) Set of Items most similar to item i that were rated by x

Estimate rating rxi as the weighted average

1162019

sumsum

isin

isinsdot

=)(

)(ˆxiNj ij

xiNj xjijxi s

rsr

N(ix) = set of items similar to item i that were rated by xsij = similarity of items i and jrxj= rating of user x on item j

RECSYS 38

Item-Item vs User-User

04180109

0305081

Avatar LOTR Matrix Pirates

Alice

Bob

Carol

David

1162019

In practice it has been observed that item-itemoften works better than user-user

Why Items are simpler users have multiple tastes

RECSYS 39

ProsCons of Collaborative Filtering

+ Works for any kind of item No feature selection needed

- Cold Start Need enough users in the system to find a match

- Sparsity The userratings matrix is sparse Hard to find users that have rated the same items

- First rater Cannot recommend an item that has not been

previously rated New items Esoteric items

- Popularity bias Cannot recommend items to someone with

unique taste Tends to recommend popular items

RECSYS 40

Hybrid Methods Implement two or more different recommenders and

combine predictions Perhaps using a linear model

Add content-based methods to collaborative filtering Item profiles for new item problem Demographics to deal with new user problem

1162019

RECSYS 41

- Evaluation

- Error metrics

- Complexity Speed

RECSYS 42

Evaluation

1 3 4

3 5 5

4 5 5

3

3

2 2 2

5

2 1 1

3 3

1

movies

users

1162019

RECSYS 43

Evaluation

1 3 4

3 5 5

4 5 5

3

3

2

2 1

3

1

Test Data Set

users

movies

1162019

RECSYS 44

Evaluating Predictions Compare predictions with known ratings

Root-mean-square error (RMSE)bull where is predicted is the true rating of x on i

Precision at top 10 bull of those in top 10

Rank Correlation bull Spearmanrsquos correlation between systemrsquos and userrsquos complete

rankings Another approach 01 model

Coveragebull Number of itemsusers for which system can make predictions

Precision = TP (TP + FP) Accuracy = (TP+TN) (TP + FP + TN + FN) Receiver Operating characteristic (ROC) Curve

Y-axis True Positive Rates (TPR) X-axis False Positive Rates (FPR)bull TPR (aka Recall) = TP P = TP (TP+FN) bull FPR = FP N = FP (FP + TN)bull See httpsenwikipediaorgwikiPrecision and recall

RECSYS 45

Problems with Error Measures Narrow focus on accuracy sometimes

misses the point Prediction Diversity Prediction Context Order of predictions

In practice we care only to predict high ratings RMSE might penalize a method that does well

for high ratings and badly for others

1162019

RECSYS 46

Collaborative Filtering Complexity

Expensive step is finding k most similar customers O(|X|) Recall that X = set of customers in the system

Too expensive to do at runtime Could pre-compute using clustering as approx

Naiumlve pre-computation takes time O(N middot|C|) |C| = of clusters = k in the k-means N = of data points

We already know how to do this Near-neighbor search in high dimensions (LSH) Clustering Dimensionality reduction

1162019

RECSYS 47

Tip Add Data Leverage all the data

Donrsquot try to reduce data size in an effort to make fancy algorithms work

Simple methods on large data do best

Add more data eg add IMDB data on genres

More data beats better algorithmshttpanandtypepadcomdatawocky200803more-data-usualhtml

RECSYS 48

Recommender SystemsLatent Factor Models

CS246 Mining Massive DatasetsJure Leskovec Stanford University

httpcs246stanfordedu

RECSYS 49

Geared towards females

Geared towards males

Serious

Funny

Collaborative Filtering viaLatent Factor Models (eg SVD)

The PrincessDiaries

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

RECSYS 50Koren Bell Volinksy IEEE Computer 2009

RECSYS 51

The Netflix Utility Matrix R

1 3 4

3 5 5

4 5 5

3

3

2 2 2

5

2 1 1

3 3

1

480000 users

17700 movies

1162019

Matrix R

RECSYS 52

Utility Matrix R Evaluation

1 3 4

3 5 5

4 5 5

3

3

2

2 1

3

1

Test Data Set

SSE = sum(119909119909119909119909)isin119877119877 119903119909119909119909119909 minus 119903119903119909119909119909119909 2

1162019

480000 users

17700 movies

Predicted rating

True rating of user x on item i

119955119955120785120785120788120788

Matrix R

Training Data Set

RECSYS 53

Latent Factor Models

ldquoSVDrdquo on Netflix data R asymp Q middot PT

For now letrsquos assume we can approximate the rating matrix R as a product of ldquothinrdquo Q middot PT

R has missing entries but letrsquos ignore that for nowbull Basically we will want the reconstruction error to be small on

known ratings and we donrsquot care about the values on the missing ones

45531

312445

53432142

24542

522434

42331

2-41

56-5

53-2

32111

-221-7

37-1

-924143-48-5-253-211

13-112-72914-131457-8

1-6784-3924176-421asymp

users

item

s

PT

Q

item

s

users

R

SVD A = U Σ VT

f factors

ffactors

RECSYS 54

Ratings as Products of Factors How to estimate the missing rating of

user x for item i

1162019

45531

312445

53432142

24542

522434

42331

item

s

2-41

56-5

53-2

32111

-221-7

37-1

-924143-48-5-253-211

13-112-72914-131457-8

1-6784-3924176-421

asymp

item

s

users

users

PT

119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931

= 119943119943

119954119954119946119946119943119943 sdot 119953119953119961119961119943119943

qi = row i of Qpx = column x of PT

ffac

tors

Qf factors

RECSYS 55

Ratings as Products of Factors How to estimate the missing rating of

user x for item i

1162019

45531

312445

53432142

24542

522434

42331

item

s

2-41

56-5

53-2

32111

-221-7

37-1

-924143-48-5-253-211

13-112-72914-131457-8

1-6784-3924176-421

asymp

item

s

users

users

PT

ffac

tors

Qf factors

119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931

= 119943119943

119954119954119946119946119943119943 sdot 119953119953119961119961119943119943

qi = row i of Qpx = column x of PT

RECSYS 56

Ratings as Products of Factors How to estimate the missing rating of

user x for item i

1162019

45531

312445

53432142

24542

522434

42331

item

s

2-41

56-5

53-2

32111

-221-7

37-1

-924143-48-5-253-211

13-112-72914-131457-8

1-6784-3924176-421

asymp

item

s

users

users

QPT

24

ffac

tors

f factors

119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931

= 119943119943

119954119954119946119946119943119943 sdot 119953119953119961119961119943119943

qi = row i of Qpx = column x of PT

RECSYS 57

Geared towards females

Geared towards males

Serious

Funny

Latent Factor Models

1162019

The PrincessDiaries

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

RECSYS 58

Geared towards females

Geared towards males

Serious

Funny

Latent Factor Models

1162019

The PrincessDiaries

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

RECSYS 59

Recap SVD

Remember SVD A Input data matrix U Left singular vecs V Right singular vecs Σ Singular values

SVD gives minimum reconstruction error (SSE) min119880119880VΣ

sum119909119909119894119894 119860119860119909119909119894119894 minus 119880119880Σ119881119881T 1199091199091198941198942

So in our case ldquoSVDrdquo on Netflix data R asymp Q middot PT

A = R Q = U PT = Σ VT

But we are not done yet R has missing entries1162019

U

The sum goes over all entriesBut our R has missing entries

119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931

Am

n

Σm

n

VT

asymp

U

RECSYS 60

Latent Factor Models

SVD isnrsquot defined when entries are missing

Use specialized methods to find P Q min

119875119875119876119876sum 119909119909119909119909 isinR 119903119903119909119909119909119909 minus 119902119902119909119909 sdot 119901119901119909119909119879119879 2

Notebull We donrsquot require cols of P Q to be orthogonalunit lengthbull P Q map usersmovies to a latent spacebull The most popular model among Netflix contestants

1162019

45531

312445

53432142

24542

522434

42331

2-41

56-5

53-2

32111

-221-7

37-1

-924143-48-5-253-211

13-112-72914-131457-8

1-6784-3924176-421asymp

PTQ

users

item

s

119903119909119909119909119909 = 119902119902119909119909 sdot 119901119901119909119909119879119879

f factors

ffactorsitem

s

users

RECSYS 61

Dealing with Missing Entries Want to minimize SSE for unseen test data

Idea Minimize SSE on training data Want large f ( of factors) to capture all the signals But SSE on test data begins to rise for f gt 2

Regularization is needed to avoid Overfitting Allow rich model where there are sufficient data Shrink aggressively where data are scarce

1162019

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

1 3 43 5 5

4 5 533

2

2 1 3

1

λhellip regularization parameterldquoerrorrdquo ldquolengthrdquo

RECSYS 62

Geared towards females

Geared towards males

serious

escapist

The PrincessDiaries

The Lion King

Braveheart

Independence Day

AmadeusThe Color Purple

Oceanrsquos 11

Sense and Sensibility

Gus

Dave

Recommendations via Latent Factor Models (eg SVD++ by the [Bellkor Team])

1162019

Lethal Weapon

Dumb and Dumber

RECSYS 63

Dealing with Missing Entries Want to minimize SSE for unseen test data

Idea Minimize SSE on training data Want large f ( of factors) to capture all the signals But SSE on test data begins to rise for f gt 2

Regularization is needed Allow rich model where there are sufficient data Shrink aggressively where data are scarce

1162019

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

1 3 43 5 5

4 5 533

2

2 1 3

1

λhellip regularization parameterldquoerrorrdquo ldquolengthrdquo

RECSYS 64

Geared towards females

Geared towards males

serious

funny

The Effect of Regularization

1162019

The PrincessDiaries

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

minfactors ldquoerrorrdquo + λ ldquolengthrdquo

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

RECSYS 65

Geared towards females

Geared towards males

serious

funny

The Effect of Regularization

1162019

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

The PrincessDiaries

minfactors ldquoerrorrdquo + λ ldquolengthrdquo

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

RECSYS 66

Geared towards females

Geared towards males

serious

funny

The Effect of Regularization

1162019

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

minfactors ldquoerrorrdquo + λ ldquolengthrdquo

The PrincessDiaries

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

RECSYS 67

Geared towards females

Geared towards males

serious

funny

The Effect of Regularization

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

The PrincessDiaries

minfactors ldquoerrorrdquo + λ ldquolengthrdquo

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

RECSYS 68

Use Gradient Descent to search forthe optimal settings

Want to find matrices P and Q

Gradient descent Initialize P and Q (using SVD pretend missing ratings are

0) Do gradient descent

bull P larr P - η middotnablaPbull Q larr Q - η middotnablaQbull Where nablaQ is gradientderivative of matrix Q120571120571119876119876 = [120571120571119902119902119909119909119894119894] and 120571120571119902119902119909119909119894119894 = sum119909119909119909119909 minus2 119903119903119909119909119909119909 minus 119902119902119909119909119901119901119909119909119879119879 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894

bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q Observation Computing gradients is slow

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

How to compute gradient of a matrixCompute gradient of every element independently

RECSYS 69

Degression to the lecture notes ofRegression and Gradient Descent

by Andrew Ngrsquos Machine Learning course from Coursera

RECSYS 70

(Batch) Gradient Descent

Want to find matrices P and Q

Gradient descent Initialize P and Q (using SVD pretend missing ratings are

0) Do gradient descent

bull P larr P - η middotnablaPbull Q larr Q - η middotnablaQbull Where nablaQ is gradientderivative of matrix Q120571120571119876119876 = [120571120571119902119902119909119909119894119894] and 120571120571119902119902119909119909119894119894 = sum119909119909119909119909 minus2 119903119903119909119909119909119909 minus 119902119902119909119909119901119901119909119909119879119879 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894

bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q Observation Computing gradients is slow

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

How to compute gradient of a matrixCompute gradient of every element independently

RECSYS 72

Stochastic Gradient Descent

Gradient Descent (GD) vs Stochastic GD Observation 120571120571119876119876 = [120571120571119902119902119909119909119894119894] where

120571120571119902119902119909119909119894119894 = 119909119909119909119909

minus2 119903119903119909119909119909119909 minus 119902119902119909119909119894119894119901119901119909119909119894119894 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894 = 119961119961119946119946

nabla119928119928 119955119955119961119961119946119946

bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q 119928119928 = 119928119928 minus 119928119928 = 119928119928 minus η sum119961119961119946119946nabla119928119928 (119955119955119961119961119946119946) Idea Instead of evaluating gradient over all ratings evaluate it for

each individual rating and make a step

GD 119928119928larr119928119928 minus η sum119955119955119961119961119946119946 nabla119928119928(119955119955119961119961119946119946)

SGD 119928119928larr119928119928 minus 119928119928(119955119955119961119961119946119946) Faster convergence

bull Need more steps but each step is computed much faster

1162019

RECSYS 73

SGD vs GD Convergence of GD vs SGD

1162019

Iterationstep

Valu

e of

the

obje

ctiv

e fu

nctio

n

GD improves the value of the objective function at every step SGD improves the value but in a ldquonoisyrdquo wayGD takes fewer steps to converge but each steptakes much longer to compute In practice SGD is much faster

RECSYS 74

Stochastic Gradient Descent

Stochastic gradient decent Initialize P and Q (using SVD pretend missing ratings are

0) Then iterate over the ratings (multiple times if necessary) and update

factorsFor each rxi

bull 120576120576119909119909119909119909 = 119903119903119909119909119909119909 minus 119902119902119909119909 sdot 119901119901119909119909119879119879 (derivative of the ldquoerrorrdquo)bull 119902119902119909119909 larr 119902119902119909119909 + 120578120578 120576120576119909119909119909119909 119901119901119909119909 minus 120582120582 119902119902119909119909 (update equation)bull 119901119901119909119909 larr 119901119901119909119909 + 120578120578 120576120576119909119909119909119909 119902119902119909119909 minus 120582120582 119901119901119909119909 (update equation)

2 for loops For until convergence

bull For each rxibull Compute gradient do a ldquosteprdquo

1162019

η hellip learning rate

RECSYS 75Koren Bell Volinksy IEEE Computer 2009

1162019

RECSYS 76

Summary Recommendations via Optimization

Goal Make good recommendations Quantify goodness using SSE

So Lower SSE means better recommendations We want to make good recommendations on items

that some user has not yet seen Letrsquos set values for P and Q such that they work

well on known (user item) ratings And hope these values for P and Q will predict well

the unknown ratings

This is the a case where we apply Optimization methods

1162019

1 3 43 5 5

4 5 533

2

2 1 3

1

RECSYS 77

Backup Slides

RECSYS 78

The Netflix Challenge 2006-09

We Know What You OughtTo Be Watching This

Summer

Netflix Prizebull Training data

ndash 100 million ratingsndash 480000 usersndash 17770 moviesndash 6 years of data 2000-2005

bull Test datandash Last few ratings of each user (28 million)ndash Evaluation criterion root mean squared error (RMSE) ndash Netflix Cinematch RMSE 09514

bull Competitionndash 2700+ teamsndash $1 million grand prize for 10 improvement on Cinematch resultndash $50000 2007 progress prize for 843 improvement

scoremovieuser121152131434524123237682576344541568523425

22345

57664566

scoremovieuser6219617232473153414284935

745

696836

Training data Test data

Movie rating data

Overall rating distribution

bull Third of ratings are 4sbull Average rating is 368

From TimelyDevelopmentcom

ratings per movie

bull Avg ratingsmovie 5627

ratings per user

bull Avg ratingsuser 208

Average movie rating by movie count

bull More ratings to better movies

From TimelyDevelopmentcom

Most loved movies

CountAvg ratingTitle137812 4593 The Shawshank Redemption 133597 4545 Lord of the Rings The Return of the King1808834306 The Green Mile 150676 4460Lord of the Rings The Two Towers 139050 4415 Finding Nemo 117456 4504 Raiders of the Lost Ark 180736 4299 Forrest Gump 147932 4433 Lord of the Rings The Fellowship of the ring149199 4325 The Sixth Sense 144027 4333 Indiana Jones and the Last Crusade

Challenges

bull Size of datandash Scalabilityndash Keeping data in memory

bull Missing datandash 99 percent missingndash Very imbalanced

bull Avoiding overfittingbull Test and training data differ significantly movie 16322

bull Use an ensemble of complementing predictorsbull Two half tuned models worth more than a single fully

tuned model

The BellKor recommender system

RECSYS 101

Extending Latent Factor Model to Include Biases

1162019

RECSYS 102

Modeling Biases and Interactions

μ = overall mean rating bx = bias of user x bi = bias of movie i

user-movie interactionmovie biasuser bias

User-Movie interaction Characterizes the matching between

users and movies Attracts most research in the field Benefits from algorithmic and

mathematical innovations

Baseline predictor Separates users and movies Benefits from insights into userrsquos

behavior Among the main practical

contributions of the competition

RECSYS 103

Baseline Predictor We have expectations on the rating by

user x of movie i even without estimating xrsquos attitude towards movies like i

ndash Rating scale of user xndash Values of other ratings user

gave recently (day-specific mood anchoring multi-user accounts)

ndash (Recent) popularity of movie indash Selection bias related to

number of ratings user gave on the same day (ldquofrequencyrdquo)

1162019

RECSYS 104

Putting It All Together

Example Mean rating micro = 37 You are a critical reviewer your ratings are 1 star

lower than the mean bx = -1 Star Wars gets a mean rating of 05 higher than

average movie bi = + 05 Predicted rating for you on Star Wars

= 37 - 1 + 05 = 32

1162019

RECSYS 105

Fitting the New Model

Solve

Stochastic gradient decent to find parameters Note Both biases bu bi as well as interactions qi pu are treated

as parameters (we estimate them)

1162019

regularization

goodness of fit

λ is selected via grid-search on a validation set

( )

++++

+++minus

sumsumsumsum

sumisin

ii

xx

xx

ii

Rix

Txiixxi

PQ

bbpq

pqbbr

2222

2

)(

)(min

λ

micro

RECSYS 106

BellKor Recommender System The winner of the Netflix Challenge

Multi-scale modeling of the dataCombine top level ldquoregionalrdquomodeling of the data with a refined local view Global

bull Overall deviations of usersmovies Factorization

bull Addressing ldquoregionalrdquo effects Collaborative filtering

bull Extract local patterns

Global effects

Factorization

Collaborative filtering

RECSYS 107

Performance of Various Methods

0885

089

0895

09

0905

091

0915

092

1 10 100 1000

RMSE

Millions of parameters

CF (no time bias)

Basic Latent Factors

Latent Factors w Biases

1162019

RECSYS 108

Grand Prize 08563

Netflix 09514

Movie average 10533User average 10651

Global average 11296

Performance of Various Methods

Basic Collaborative filtering 094

Latent factors 090

Latent factors+Biases 089

Collaborative filtering++ 091

1162019

RECSYS 109

Temporal Biases Of Users

Sudden rise in the average movie rating(early 2004) Improvements in Netflix GUI improvements Meaning of rating changed

Movie age Users prefer new movies

without any reasons Older movies are just

inherently better than newer ones

1162019

Y Koren Collaborative filtering with temporal dynamics KDD rsquo09

RECSYS 110

Temporal Biases amp Factors

Original modelrxi = micro +bx + bi + qi middotpx

T

Add time dependence to biasesrxi = micro +bx(t)+ bi(t) +qi middot px

T

Make parameters bu and bi to depend on time (1) Parameterize time-dependence by linear trends

(2) Each bin corresponds to 10 consecutive weeks

Add temporal dependence to factors px(t)hellip user preference vector on day t

Y Koren Collaborative filtering with temporal dynamics KDD rsquo09

RECSYS 111

Adding Temporal Effects

1162019

0875

088

0885

089

0895

09

0905

091

0915

092

1 10 100 1000 10000

RMSE

Millions of parameters

CF (no time bias)Basic Latent FactorsCF (time bias)Latent Factors w Biases+ Linear time factors+ Per-day user biases+ CF

RECSYS 112

Grand Prize 08563

Netflix 09514

Movie average 10533User average 10651

Global average 11296

Performance of Various Methods

Basic Collaborative filtering 094

Latent factors 090

Latent factors+Biases 089

Collaborative filtering++ 091

1162019 112

Latent factors+Biases+Time 0876

Still no prize Getting desperateTry a ldquokitchen sinkrdquo approach

RECSYS 113

Effect of ensemble size

087

08725

0875

08775

088

08825

0885

08875

089

08925

1 8 15 22 29 36 43 50

Predictors

Erro

r - R

MSE

Chart1

RMSE
Predictors
Error - RMSE
Effect of ensemble size

Sheet1

Sheet1

RMSE
Predictors
Error - RMSE
Effect of ensemble size

Sheet2

Sheet3

RECSYS 115

Standing on June 26th 2009

1162019June 26th submission triggers 30-day ldquolast callrdquo

RECSYS 116

The Last 30 Days

Ensemble team formed Group of other teams on leaderboard forms a new team Relies on combining their models Quickly also get a qualifying score over 10

BellKor Continue to get small improvements in their scores Realize that they are in direct competition with Ensemble

Strategy Both teams carefully monitoring the leaderboard Only sure way to check for improvement is to submit a set of

predictionsbull This alerts the other team of your latest score

1162019

RECSYS 117

24 Hours from the Deadline

Submissions limited to 1 a day Only 1 final submission could be made in the last 24h

24 hours before deadlinehellip BellKor team member in Austria notices (by chance) that

Ensemble posts a score that is slightly better than BellKorrsquos

Frantic last 24 hours for both teams Much computer time on final optimization Carefully calibrated to end about an hour before deadline

Final submissions BellKor submits a little early (on purpose) 40 mins before deadline Ensemble submits their final entry 20 mins later hellipand everyone waitshellip

1162019

RECSYS 118118

RECSYS 119

Million $ Awarded Sept 21st 2009

RECSYS 120

Further reading Y Koren Collaborative filtering with temporal

dynamics KDD rsquo09 httpwww2researchattcom~volinskynetflixbpchtml httpwwwthe-ensemblecom

1162019

  • Slide Number 1
  • Acknowledgements
  • Content-based Systems amp Collaborative Filtering
  • High Dimensional Data
  • Example Recommender Systems
  • Recommendations
  • From Scarcity to Abundance
  • The Long Tail
  • Physical vs Online
  • Types of Recommendations
  • Formal Model
  • Utility Matrix
  • Key Problems
  • (1) Gathering Ratings
  • (2) Extrapolating Utilities
  • Content-based Recommender Systems
  • Content-based Recommendations
  • Plan of Action
  • Item Profiles
  • Sidenote TF-IDF
  • User Profiles and Prediction
  • Pros Content-based Approach
  • Cons Content-based Approach
  • Collaborative Filtering
  • Collaborative filtering
  • Collaborative Filtering (CF)
  • Example of Memory-based Collaborative Filtering User-User Collaborative Filtering
  • Similar Users
  • Similarity Metric
  • Rating Predictions
  • Another type of Memory-based Collaborative Filtering Item-Item based Collaborative Filtering
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Common Practice for Item-Item Collaborative Filtering
  • Item-Item vs User-User
  • ProsCons of Collaborative Filtering
  • Hybrid Methods
  • Remarks amp Practical Tips
  • Evaluation
  • Evaluation
  • Evaluating Predictions
  • Problems with Error Measures
  • Collaborative Filtering Complexity
  • Tip Add Data
  • Recommender SystemsLatent Factor Models
  • Collaborative Filtering viaLatent Factor Models (eg SVD)
  • Slide Number 50
  • The Netflix Utility Matrix R
  • Utility Matrix R Evaluation
  • Latent Factor Models
  • Ratings as Products of Factors
  • Ratings as Products of Factors
  • Ratings as Products of Factors
  • Latent Factor Models
  • Latent Factor Models
  • Recap SVD
  • Latent Factor Models
  • Dealing with Missing Entries
  • Recommendations via Latent Factor Models (eg SVD++ by the [Bellkor Team])
  • Dealing with Missing Entries
  • The Effect of Regularization
  • The Effect of Regularization
  • The Effect of Regularization
  • The Effect of Regularization
  • Use Gradient Descent to search forthe optimal settings
  • Degression to the lecture notes ofRegression and Gradient Descentby Andrew Ngrsquos Machine Learning course from Coursera
  • (Batch) Gradient Descent
  • Stochastic Gradient Descent
  • SGD vs GD
  • Stochastic Gradient Descent
  • Slide Number 75
  • Summary Recommendations via Optimization
  • Backup Slides
  • The Netflix Challenge 2006-09
  • Slide Number 79
  • Slide Number 80
  • Netflix Prize
  • Movie rating data
  • Overall rating distribution
  • ratings per movie
  • ratings per user
  • Average movie rating by movie count
  • Most loved movies
  • Challenges
  • The BellKor recommender system
  • Extending Latent Factor Model to Include Biases
  • Modeling Biases and Interactions
  • Baseline Predictor
  • Putting It All Together
  • Fitting the New Model
  • BellKor Recommender System
  • Performance of Various Methods
  • Performance of Various Methods
  • Temporal Biases Of Users
  • Temporal Biases amp Factors
  • Adding Temporal Effects
  • Performance of Various Methods
  • Slide Number 113
  • Slide Number 114
  • Standing on June 26th 2009
  • The Last 30 Days
  • 24 Hours from the Deadline
  • Slide Number 118
  • Million $ Awarded Sept 21st 2009
  • Further reading
predictors RMSE
1 0889067
2 0880586
3 0878345
4 0877715
5 0876668
6 0876179
7 0875557
8 0875013
9 0874477
10 0874209
11 0873997
12 0873691
13 0873569
14 0873352
15 0873212
16 0873035
17 0872942
18 087283
19 0872748
20 0872654
21 0872563
22 0872468
23 0872382
24 0872275
25 087222
26 0872201
27 0872135
28 0872086
29 0872021
30 0871945
31 0871898
32 087183
33 0871756
34 0871722
35 0871665
36 0871632
37 0871585
38 0871558
39 0871524
40 0871482
41 0871452
42 0871428
43 0871398
44 0871346
45 0871312
46 0871287
47 0871262
48 0871242
49 0871236
50 0871197
0889067
0880586
0878345
0877715
0876668
0876179
0875557
0875013
0874477
0874209
0873997
0873691
0873569
0873352
0873212
0873035
0872942
087283
0872748
0872654
0872563
0872468
0872382
0872275
087222
0872201
0872135
0872086
0872021
0871945
0871898
087183
0871756
0871722
0871665
0871632
0871585
0871558
0871524
0871482
0871452
0871428
0871398
0871346
0871312
0871287
0871262
0871242
0871236
0871197
Page 31: xuhappy.github.io · 2 days ago · RECSYS 2 Acknowledgements The slides used in this chapter are adapted from: CS246 Mining Massive Datasets, - by Jure Leskovec, Stanford University

RECSYS 33

Item-Item CF (|N|=2)

121110987654321

455 311

3124452

534321423

245424

5224345

423316

users

- estimate rating of movie 1 by user 5

1162019

mov

ies

RECSYS 34

Item-Item CF (|N|=2)

121110987654321

455 311

3124452

534321423

245424

5224345

423316

users

Neighbor selectionIdentify movies similar to movie 1 rated by user 5

1162019

mov

ies

100

-018

041

-010

-031

059

sim(1m)

Here we use Pearson correlation as similarity1) Subtract mean rating mi from each movie i

m1 = (1+3+5+5+4)5 = 36row 1 [-26 0 -06 0 0 14 0 0 14 0 04 0]

2) Compute cosine similarities between rows

RECSYS 35

Item-Item CF (|N|=2)

121110987654321

455 311

3124452

534321423

245424

5224345

423316

users

Compute similarity weightss13=041 s16=059

1162019

mov

ies

100

-018

041

-010

-031

059

sim(1m)

RECSYS 36

Item-Item CF (|N|=2)

121110987654321

45526311

3124452

534321423

245424

5224345

423316

users

Predict by taking weighted average

r15 = (0412 + 0593) (041+059) = 26

mov

ies

RECSYS 37

Common Practice for Item-Item Collaborative Filtering

Define similarity sij of items i and j

Select K nearest neighbors (KNN) N(i x) Set of Items most similar to item i that were rated by x

Estimate rating rxi as the weighted average

1162019

sumsum

isin

isinsdot

=)(

)(ˆxiNj ij

xiNj xjijxi s

rsr

N(ix) = set of items similar to item i that were rated by xsij = similarity of items i and jrxj= rating of user x on item j

RECSYS 38

Item-Item vs User-User

04180109

0305081

Avatar LOTR Matrix Pirates

Alice

Bob

Carol

David

1162019

In practice it has been observed that item-itemoften works better than user-user

Why Items are simpler users have multiple tastes

RECSYS 39

ProsCons of Collaborative Filtering

+ Works for any kind of item No feature selection needed

- Cold Start Need enough users in the system to find a match

- Sparsity The userratings matrix is sparse Hard to find users that have rated the same items

- First rater Cannot recommend an item that has not been

previously rated New items Esoteric items

- Popularity bias Cannot recommend items to someone with

unique taste Tends to recommend popular items

RECSYS 40

Hybrid Methods Implement two or more different recommenders and

combine predictions Perhaps using a linear model

Add content-based methods to collaborative filtering Item profiles for new item problem Demographics to deal with new user problem

1162019

RECSYS 41

- Evaluation

- Error metrics

- Complexity Speed

RECSYS 42

Evaluation

1 3 4

3 5 5

4 5 5

3

3

2 2 2

5

2 1 1

3 3

1

movies

users

1162019

RECSYS 43

Evaluation

1 3 4

3 5 5

4 5 5

3

3

2

2 1

3

1

Test Data Set

users

movies

1162019

RECSYS 44

Evaluating Predictions Compare predictions with known ratings

Root-mean-square error (RMSE)bull where is predicted is the true rating of x on i

Precision at top 10 bull of those in top 10

Rank Correlation bull Spearmanrsquos correlation between systemrsquos and userrsquos complete

rankings Another approach 01 model

Coveragebull Number of itemsusers for which system can make predictions

Precision = TP (TP + FP) Accuracy = (TP+TN) (TP + FP + TN + FN) Receiver Operating characteristic (ROC) Curve

Y-axis True Positive Rates (TPR) X-axis False Positive Rates (FPR)bull TPR (aka Recall) = TP P = TP (TP+FN) bull FPR = FP N = FP (FP + TN)bull See httpsenwikipediaorgwikiPrecision and recall

RECSYS 45

Problems with Error Measures Narrow focus on accuracy sometimes

misses the point Prediction Diversity Prediction Context Order of predictions

In practice we care only to predict high ratings RMSE might penalize a method that does well

for high ratings and badly for others

1162019

RECSYS 46

Collaborative Filtering Complexity

Expensive step is finding k most similar customers O(|X|) Recall that X = set of customers in the system

Too expensive to do at runtime Could pre-compute using clustering as approx

Naiumlve pre-computation takes time O(N middot|C|) |C| = of clusters = k in the k-means N = of data points

We already know how to do this Near-neighbor search in high dimensions (LSH) Clustering Dimensionality reduction

1162019

RECSYS 47

Tip Add Data Leverage all the data

Donrsquot try to reduce data size in an effort to make fancy algorithms work

Simple methods on large data do best

Add more data eg add IMDB data on genres

More data beats better algorithmshttpanandtypepadcomdatawocky200803more-data-usualhtml

RECSYS 48

Recommender SystemsLatent Factor Models

CS246 Mining Massive DatasetsJure Leskovec Stanford University

httpcs246stanfordedu

RECSYS 49

Geared towards females

Geared towards males

Serious

Funny

Collaborative Filtering viaLatent Factor Models (eg SVD)

The PrincessDiaries

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

RECSYS 50Koren Bell Volinksy IEEE Computer 2009

RECSYS 51

The Netflix Utility Matrix R

1 3 4

3 5 5

4 5 5

3

3

2 2 2

5

2 1 1

3 3

1

480000 users

17700 movies

1162019

Matrix R

RECSYS 52

Utility Matrix R Evaluation

1 3 4

3 5 5

4 5 5

3

3

2

2 1

3

1

Test Data Set

SSE = sum(119909119909119909119909)isin119877119877 119903119909119909119909119909 minus 119903119903119909119909119909119909 2

1162019

480000 users

17700 movies

Predicted rating

True rating of user x on item i

119955119955120785120785120788120788

Matrix R

Training Data Set

RECSYS 53

Latent Factor Models

ldquoSVDrdquo on Netflix data R asymp Q middot PT

For now letrsquos assume we can approximate the rating matrix R as a product of ldquothinrdquo Q middot PT

R has missing entries but letrsquos ignore that for nowbull Basically we will want the reconstruction error to be small on

known ratings and we donrsquot care about the values on the missing ones

45531

312445

53432142

24542

522434

42331

2-41

56-5

53-2

32111

-221-7

37-1

-924143-48-5-253-211

13-112-72914-131457-8

1-6784-3924176-421asymp

users

item

s

PT

Q

item

s

users

R

SVD A = U Σ VT

f factors

ffactors

RECSYS 54

Ratings as Products of Factors How to estimate the missing rating of

user x for item i

1162019

45531

312445

53432142

24542

522434

42331

item

s

2-41

56-5

53-2

32111

-221-7

37-1

-924143-48-5-253-211

13-112-72914-131457-8

1-6784-3924176-421

asymp

item

s

users

users

PT

119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931

= 119943119943

119954119954119946119946119943119943 sdot 119953119953119961119961119943119943

qi = row i of Qpx = column x of PT

ffac

tors

Qf factors

RECSYS 55

Ratings as Products of Factors How to estimate the missing rating of

user x for item i

1162019

45531

312445

53432142

24542

522434

42331

item

s

2-41

56-5

53-2

32111

-221-7

37-1

-924143-48-5-253-211

13-112-72914-131457-8

1-6784-3924176-421

asymp

item

s

users

users

PT

ffac

tors

Qf factors

119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931

= 119943119943

119954119954119946119946119943119943 sdot 119953119953119961119961119943119943

qi = row i of Qpx = column x of PT

RECSYS 56

Ratings as Products of Factors How to estimate the missing rating of

user x for item i

1162019

45531

312445

53432142

24542

522434

42331

item

s

2-41

56-5

53-2

32111

-221-7

37-1

-924143-48-5-253-211

13-112-72914-131457-8

1-6784-3924176-421

asymp

item

s

users

users

QPT

24

ffac

tors

f factors

119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931

= 119943119943

119954119954119946119946119943119943 sdot 119953119953119961119961119943119943

qi = row i of Qpx = column x of PT

RECSYS 57

Geared towards females

Geared towards males

Serious

Funny

Latent Factor Models

1162019

The PrincessDiaries

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

RECSYS 58

Geared towards females

Geared towards males

Serious

Funny

Latent Factor Models

1162019

The PrincessDiaries

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

RECSYS 59

Recap SVD

Remember SVD A Input data matrix U Left singular vecs V Right singular vecs Σ Singular values

SVD gives minimum reconstruction error (SSE) min119880119880VΣ

sum119909119909119894119894 119860119860119909119909119894119894 minus 119880119880Σ119881119881T 1199091199091198941198942

So in our case ldquoSVDrdquo on Netflix data R asymp Q middot PT

A = R Q = U PT = Σ VT

But we are not done yet R has missing entries1162019

U

The sum goes over all entriesBut our R has missing entries

119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931

Am

n

Σm

n

VT

asymp

U

RECSYS 60

Latent Factor Models

SVD isnrsquot defined when entries are missing

Use specialized methods to find P Q min

119875119875119876119876sum 119909119909119909119909 isinR 119903119903119909119909119909119909 minus 119902119902119909119909 sdot 119901119901119909119909119879119879 2

Notebull We donrsquot require cols of P Q to be orthogonalunit lengthbull P Q map usersmovies to a latent spacebull The most popular model among Netflix contestants

1162019

45531

312445

53432142

24542

522434

42331

2-41

56-5

53-2

32111

-221-7

37-1

-924143-48-5-253-211

13-112-72914-131457-8

1-6784-3924176-421asymp

PTQ

users

item

s

119903119909119909119909119909 = 119902119902119909119909 sdot 119901119901119909119909119879119879

f factors

ffactorsitem

s

users

RECSYS 61

Dealing with Missing Entries Want to minimize SSE for unseen test data

Idea Minimize SSE on training data Want large f ( of factors) to capture all the signals But SSE on test data begins to rise for f gt 2

Regularization is needed to avoid Overfitting Allow rich model where there are sufficient data Shrink aggressively where data are scarce

1162019

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

1 3 43 5 5

4 5 533

2

2 1 3

1

λhellip regularization parameterldquoerrorrdquo ldquolengthrdquo

RECSYS 62

Geared towards females

Geared towards males

serious

escapist

The PrincessDiaries

The Lion King

Braveheart

Independence Day

AmadeusThe Color Purple

Oceanrsquos 11

Sense and Sensibility

Gus

Dave

Recommendations via Latent Factor Models (eg SVD++ by the [Bellkor Team])

1162019

Lethal Weapon

Dumb and Dumber

RECSYS 63

Dealing with Missing Entries Want to minimize SSE for unseen test data

Idea Minimize SSE on training data Want large f ( of factors) to capture all the signals But SSE on test data begins to rise for f gt 2

Regularization is needed Allow rich model where there are sufficient data Shrink aggressively where data are scarce

1162019

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

1 3 43 5 5

4 5 533

2

2 1 3

1

λhellip regularization parameterldquoerrorrdquo ldquolengthrdquo

RECSYS 64

Geared towards females

Geared towards males

serious

funny

The Effect of Regularization

1162019

The PrincessDiaries

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

minfactors ldquoerrorrdquo + λ ldquolengthrdquo

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

RECSYS 65

Geared towards females

Geared towards males

serious

funny

The Effect of Regularization

1162019

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

The PrincessDiaries

minfactors ldquoerrorrdquo + λ ldquolengthrdquo

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

RECSYS 66

Geared towards females

Geared towards males

serious

funny

The Effect of Regularization

1162019

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

minfactors ldquoerrorrdquo + λ ldquolengthrdquo

The PrincessDiaries

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

RECSYS 67

Geared towards females

Geared towards males

serious

funny

The Effect of Regularization

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

The PrincessDiaries

minfactors ldquoerrorrdquo + λ ldquolengthrdquo

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

RECSYS 68

Use Gradient Descent to search forthe optimal settings

Want to find matrices P and Q

Gradient descent Initialize P and Q (using SVD pretend missing ratings are

0) Do gradient descent

bull P larr P - η middotnablaPbull Q larr Q - η middotnablaQbull Where nablaQ is gradientderivative of matrix Q120571120571119876119876 = [120571120571119902119902119909119909119894119894] and 120571120571119902119902119909119909119894119894 = sum119909119909119909119909 minus2 119903119903119909119909119909119909 minus 119902119902119909119909119901119901119909119909119879119879 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894

bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q Observation Computing gradients is slow

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

How to compute gradient of a matrixCompute gradient of every element independently

RECSYS 69

Degression to the lecture notes ofRegression and Gradient Descent

by Andrew Ngrsquos Machine Learning course from Coursera

RECSYS 70

(Batch) Gradient Descent

Want to find matrices P and Q

Gradient descent Initialize P and Q (using SVD pretend missing ratings are

0) Do gradient descent

bull P larr P - η middotnablaPbull Q larr Q - η middotnablaQbull Where nablaQ is gradientderivative of matrix Q120571120571119876119876 = [120571120571119902119902119909119909119894119894] and 120571120571119902119902119909119909119894119894 = sum119909119909119909119909 minus2 119903119903119909119909119909119909 minus 119902119902119909119909119901119901119909119909119879119879 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894

bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q Observation Computing gradients is slow

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

How to compute gradient of a matrixCompute gradient of every element independently

RECSYS 72

Stochastic Gradient Descent

Gradient Descent (GD) vs Stochastic GD Observation 120571120571119876119876 = [120571120571119902119902119909119909119894119894] where

120571120571119902119902119909119909119894119894 = 119909119909119909119909

minus2 119903119903119909119909119909119909 minus 119902119902119909119909119894119894119901119901119909119909119894119894 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894 = 119961119961119946119946

nabla119928119928 119955119955119961119961119946119946

bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q 119928119928 = 119928119928 minus 119928119928 = 119928119928 minus η sum119961119961119946119946nabla119928119928 (119955119955119961119961119946119946) Idea Instead of evaluating gradient over all ratings evaluate it for

each individual rating and make a step

GD 119928119928larr119928119928 minus η sum119955119955119961119961119946119946 nabla119928119928(119955119955119961119961119946119946)

SGD 119928119928larr119928119928 minus 119928119928(119955119955119961119961119946119946) Faster convergence

bull Need more steps but each step is computed much faster

1162019

RECSYS 73

SGD vs GD Convergence of GD vs SGD

1162019

Iterationstep

Valu

e of

the

obje

ctiv

e fu

nctio

n

GD improves the value of the objective function at every step SGD improves the value but in a ldquonoisyrdquo wayGD takes fewer steps to converge but each steptakes much longer to compute In practice SGD is much faster

RECSYS 74

Stochastic Gradient Descent

Stochastic gradient decent Initialize P and Q (using SVD pretend missing ratings are

0) Then iterate over the ratings (multiple times if necessary) and update

factorsFor each rxi

bull 120576120576119909119909119909119909 = 119903119903119909119909119909119909 minus 119902119902119909119909 sdot 119901119901119909119909119879119879 (derivative of the ldquoerrorrdquo)bull 119902119902119909119909 larr 119902119902119909119909 + 120578120578 120576120576119909119909119909119909 119901119901119909119909 minus 120582120582 119902119902119909119909 (update equation)bull 119901119901119909119909 larr 119901119901119909119909 + 120578120578 120576120576119909119909119909119909 119902119902119909119909 minus 120582120582 119901119901119909119909 (update equation)

2 for loops For until convergence

bull For each rxibull Compute gradient do a ldquosteprdquo

1162019

η hellip learning rate

RECSYS 75Koren Bell Volinksy IEEE Computer 2009

1162019

RECSYS 76

Summary Recommendations via Optimization

Goal Make good recommendations Quantify goodness using SSE

So Lower SSE means better recommendations We want to make good recommendations on items

that some user has not yet seen Letrsquos set values for P and Q such that they work

well on known (user item) ratings And hope these values for P and Q will predict well

the unknown ratings

This is the a case where we apply Optimization methods

1162019

1 3 43 5 5

4 5 533

2

2 1 3

1

RECSYS 77

Backup Slides

RECSYS 78

The Netflix Challenge 2006-09

We Know What You OughtTo Be Watching This

Summer

Netflix Prizebull Training data

ndash 100 million ratingsndash 480000 usersndash 17770 moviesndash 6 years of data 2000-2005

bull Test datandash Last few ratings of each user (28 million)ndash Evaluation criterion root mean squared error (RMSE) ndash Netflix Cinematch RMSE 09514

bull Competitionndash 2700+ teamsndash $1 million grand prize for 10 improvement on Cinematch resultndash $50000 2007 progress prize for 843 improvement

scoremovieuser121152131434524123237682576344541568523425

22345

57664566

scoremovieuser6219617232473153414284935

745

696836

Training data Test data

Movie rating data

Overall rating distribution

bull Third of ratings are 4sbull Average rating is 368

From TimelyDevelopmentcom

ratings per movie

bull Avg ratingsmovie 5627

ratings per user

bull Avg ratingsuser 208

Average movie rating by movie count

bull More ratings to better movies

From TimelyDevelopmentcom

Most loved movies

CountAvg ratingTitle137812 4593 The Shawshank Redemption 133597 4545 Lord of the Rings The Return of the King1808834306 The Green Mile 150676 4460Lord of the Rings The Two Towers 139050 4415 Finding Nemo 117456 4504 Raiders of the Lost Ark 180736 4299 Forrest Gump 147932 4433 Lord of the Rings The Fellowship of the ring149199 4325 The Sixth Sense 144027 4333 Indiana Jones and the Last Crusade

Challenges

bull Size of datandash Scalabilityndash Keeping data in memory

bull Missing datandash 99 percent missingndash Very imbalanced

bull Avoiding overfittingbull Test and training data differ significantly movie 16322

bull Use an ensemble of complementing predictorsbull Two half tuned models worth more than a single fully

tuned model

The BellKor recommender system

RECSYS 101

Extending Latent Factor Model to Include Biases

1162019

RECSYS 102

Modeling Biases and Interactions

μ = overall mean rating bx = bias of user x bi = bias of movie i

user-movie interactionmovie biasuser bias

User-Movie interaction Characterizes the matching between

users and movies Attracts most research in the field Benefits from algorithmic and

mathematical innovations

Baseline predictor Separates users and movies Benefits from insights into userrsquos

behavior Among the main practical

contributions of the competition

RECSYS 103

Baseline Predictor We have expectations on the rating by

user x of movie i even without estimating xrsquos attitude towards movies like i

ndash Rating scale of user xndash Values of other ratings user

gave recently (day-specific mood anchoring multi-user accounts)

ndash (Recent) popularity of movie indash Selection bias related to

number of ratings user gave on the same day (ldquofrequencyrdquo)

1162019

RECSYS 104

Putting It All Together

Example Mean rating micro = 37 You are a critical reviewer your ratings are 1 star

lower than the mean bx = -1 Star Wars gets a mean rating of 05 higher than

average movie bi = + 05 Predicted rating for you on Star Wars

= 37 - 1 + 05 = 32

1162019

RECSYS 105

Fitting the New Model

Solve

Stochastic gradient decent to find parameters Note Both biases bu bi as well as interactions qi pu are treated

as parameters (we estimate them)

1162019

regularization

goodness of fit

λ is selected via grid-search on a validation set

( )

++++

+++minus

sumsumsumsum

sumisin

ii

xx

xx

ii

Rix

Txiixxi

PQ

bbpq

pqbbr

2222

2

)(

)(min

λ

micro

RECSYS 106

BellKor Recommender System The winner of the Netflix Challenge

Multi-scale modeling of the dataCombine top level ldquoregionalrdquomodeling of the data with a refined local view Global

bull Overall deviations of usersmovies Factorization

bull Addressing ldquoregionalrdquo effects Collaborative filtering

bull Extract local patterns

Global effects

Factorization

Collaborative filtering

RECSYS 107

Performance of Various Methods

0885

089

0895

09

0905

091

0915

092

1 10 100 1000

RMSE

Millions of parameters

CF (no time bias)

Basic Latent Factors

Latent Factors w Biases

1162019

RECSYS 108

Grand Prize 08563

Netflix 09514

Movie average 10533User average 10651

Global average 11296

Performance of Various Methods

Basic Collaborative filtering 094

Latent factors 090

Latent factors+Biases 089

Collaborative filtering++ 091

1162019

RECSYS 109

Temporal Biases Of Users

Sudden rise in the average movie rating(early 2004) Improvements in Netflix GUI improvements Meaning of rating changed

Movie age Users prefer new movies

without any reasons Older movies are just

inherently better than newer ones

1162019

Y Koren Collaborative filtering with temporal dynamics KDD rsquo09

RECSYS 110

Temporal Biases amp Factors

Original modelrxi = micro +bx + bi + qi middotpx

T

Add time dependence to biasesrxi = micro +bx(t)+ bi(t) +qi middot px

T

Make parameters bu and bi to depend on time (1) Parameterize time-dependence by linear trends

(2) Each bin corresponds to 10 consecutive weeks

Add temporal dependence to factors px(t)hellip user preference vector on day t

Y Koren Collaborative filtering with temporal dynamics KDD rsquo09

RECSYS 111

Adding Temporal Effects

1162019

0875

088

0885

089

0895

09

0905

091

0915

092

1 10 100 1000 10000

RMSE

Millions of parameters

CF (no time bias)Basic Latent FactorsCF (time bias)Latent Factors w Biases+ Linear time factors+ Per-day user biases+ CF

RECSYS 112

Grand Prize 08563

Netflix 09514

Movie average 10533User average 10651

Global average 11296

Performance of Various Methods

Basic Collaborative filtering 094

Latent factors 090

Latent factors+Biases 089

Collaborative filtering++ 091

1162019 112

Latent factors+Biases+Time 0876

Still no prize Getting desperateTry a ldquokitchen sinkrdquo approach

RECSYS 113

Effect of ensemble size

087

08725

0875

08775

088

08825

0885

08875

089

08925

1 8 15 22 29 36 43 50

Predictors

Erro

r - R

MSE

Chart1

RMSE
Predictors
Error - RMSE
Effect of ensemble size

Sheet1

Sheet1

RMSE
Predictors
Error - RMSE
Effect of ensemble size

Sheet2

Sheet3

RECSYS 115

Standing on June 26th 2009

1162019June 26th submission triggers 30-day ldquolast callrdquo

RECSYS 116

The Last 30 Days

Ensemble team formed Group of other teams on leaderboard forms a new team Relies on combining their models Quickly also get a qualifying score over 10

BellKor Continue to get small improvements in their scores Realize that they are in direct competition with Ensemble

Strategy Both teams carefully monitoring the leaderboard Only sure way to check for improvement is to submit a set of

predictionsbull This alerts the other team of your latest score

1162019

RECSYS 117

24 Hours from the Deadline

Submissions limited to 1 a day Only 1 final submission could be made in the last 24h

24 hours before deadlinehellip BellKor team member in Austria notices (by chance) that

Ensemble posts a score that is slightly better than BellKorrsquos

Frantic last 24 hours for both teams Much computer time on final optimization Carefully calibrated to end about an hour before deadline

Final submissions BellKor submits a little early (on purpose) 40 mins before deadline Ensemble submits their final entry 20 mins later hellipand everyone waitshellip

1162019

RECSYS 118118

RECSYS 119

Million $ Awarded Sept 21st 2009

RECSYS 120

Further reading Y Koren Collaborative filtering with temporal

dynamics KDD rsquo09 httpwww2researchattcom~volinskynetflixbpchtml httpwwwthe-ensemblecom

1162019

  • Slide Number 1
  • Acknowledgements
  • Content-based Systems amp Collaborative Filtering
  • High Dimensional Data
  • Example Recommender Systems
  • Recommendations
  • From Scarcity to Abundance
  • The Long Tail
  • Physical vs Online
  • Types of Recommendations
  • Formal Model
  • Utility Matrix
  • Key Problems
  • (1) Gathering Ratings
  • (2) Extrapolating Utilities
  • Content-based Recommender Systems
  • Content-based Recommendations
  • Plan of Action
  • Item Profiles
  • Sidenote TF-IDF
  • User Profiles and Prediction
  • Pros Content-based Approach
  • Cons Content-based Approach
  • Collaborative Filtering
  • Collaborative filtering
  • Collaborative Filtering (CF)
  • Example of Memory-based Collaborative Filtering User-User Collaborative Filtering
  • Similar Users
  • Similarity Metric
  • Rating Predictions
  • Another type of Memory-based Collaborative Filtering Item-Item based Collaborative Filtering
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Common Practice for Item-Item Collaborative Filtering
  • Item-Item vs User-User
  • ProsCons of Collaborative Filtering
  • Hybrid Methods
  • Remarks amp Practical Tips
  • Evaluation
  • Evaluation
  • Evaluating Predictions
  • Problems with Error Measures
  • Collaborative Filtering Complexity
  • Tip Add Data
  • Recommender SystemsLatent Factor Models
  • Collaborative Filtering viaLatent Factor Models (eg SVD)
  • Slide Number 50
  • The Netflix Utility Matrix R
  • Utility Matrix R Evaluation
  • Latent Factor Models
  • Ratings as Products of Factors
  • Ratings as Products of Factors
  • Ratings as Products of Factors
  • Latent Factor Models
  • Latent Factor Models
  • Recap SVD
  • Latent Factor Models
  • Dealing with Missing Entries
  • Recommendations via Latent Factor Models (eg SVD++ by the [Bellkor Team])
  • Dealing with Missing Entries
  • The Effect of Regularization
  • The Effect of Regularization
  • The Effect of Regularization
  • The Effect of Regularization
  • Use Gradient Descent to search forthe optimal settings
  • Degression to the lecture notes ofRegression and Gradient Descentby Andrew Ngrsquos Machine Learning course from Coursera
  • (Batch) Gradient Descent
  • Stochastic Gradient Descent
  • SGD vs GD
  • Stochastic Gradient Descent
  • Slide Number 75
  • Summary Recommendations via Optimization
  • Backup Slides
  • The Netflix Challenge 2006-09
  • Slide Number 79
  • Slide Number 80
  • Netflix Prize
  • Movie rating data
  • Overall rating distribution
  • ratings per movie
  • ratings per user
  • Average movie rating by movie count
  • Most loved movies
  • Challenges
  • The BellKor recommender system
  • Extending Latent Factor Model to Include Biases
  • Modeling Biases and Interactions
  • Baseline Predictor
  • Putting It All Together
  • Fitting the New Model
  • BellKor Recommender System
  • Performance of Various Methods
  • Performance of Various Methods
  • Temporal Biases Of Users
  • Temporal Biases amp Factors
  • Adding Temporal Effects
  • Performance of Various Methods
  • Slide Number 113
  • Slide Number 114
  • Standing on June 26th 2009
  • The Last 30 Days
  • 24 Hours from the Deadline
  • Slide Number 118
  • Million $ Awarded Sept 21st 2009
  • Further reading
predictors RMSE
1 0889067
2 0880586
3 0878345
4 0877715
5 0876668
6 0876179
7 0875557
8 0875013
9 0874477
10 0874209
11 0873997
12 0873691
13 0873569
14 0873352
15 0873212
16 0873035
17 0872942
18 087283
19 0872748
20 0872654
21 0872563
22 0872468
23 0872382
24 0872275
25 087222
26 0872201
27 0872135
28 0872086
29 0872021
30 0871945
31 0871898
32 087183
33 0871756
34 0871722
35 0871665
36 0871632
37 0871585
38 0871558
39 0871524
40 0871482
41 0871452
42 0871428
43 0871398
44 0871346
45 0871312
46 0871287
47 0871262
48 0871242
49 0871236
50 0871197
0889067
0880586
0878345
0877715
0876668
0876179
0875557
0875013
0874477
0874209
0873997
0873691
0873569
0873352
0873212
0873035
0872942
087283
0872748
0872654
0872563
0872468
0872382
0872275
087222
0872201
0872135
0872086
0872021
0871945
0871898
087183
0871756
0871722
0871665
0871632
0871585
0871558
0871524
0871482
0871452
0871428
0871398
0871346
0871312
0871287
0871262
0871242
0871236
0871197
Page 32: xuhappy.github.io · 2 days ago · RECSYS 2 Acknowledgements The slides used in this chapter are adapted from: CS246 Mining Massive Datasets, - by Jure Leskovec, Stanford University

RECSYS 34

Item-Item CF (|N|=2)

121110987654321

455 311

3124452

534321423

245424

5224345

423316

users

Neighbor selectionIdentify movies similar to movie 1 rated by user 5

1162019

mov

ies

100

-018

041

-010

-031

059

sim(1m)

Here we use Pearson correlation as similarity1) Subtract mean rating mi from each movie i

m1 = (1+3+5+5+4)5 = 36row 1 [-26 0 -06 0 0 14 0 0 14 0 04 0]

2) Compute cosine similarities between rows

RECSYS 35

Item-Item CF (|N|=2)

121110987654321

455 311

3124452

534321423

245424

5224345

423316

users

Compute similarity weightss13=041 s16=059

1162019

mov

ies

100

-018

041

-010

-031

059

sim(1m)

RECSYS 36

Item-Item CF (|N|=2)

121110987654321

45526311

3124452

534321423

245424

5224345

423316

users

Predict by taking weighted average

r15 = (0412 + 0593) (041+059) = 26

mov

ies

RECSYS 37

Common Practice for Item-Item Collaborative Filtering

Define similarity sij of items i and j

Select K nearest neighbors (KNN) N(i x) Set of Items most similar to item i that were rated by x

Estimate rating rxi as the weighted average

1162019

sumsum

isin

isinsdot

=)(

)(ˆxiNj ij

xiNj xjijxi s

rsr

N(ix) = set of items similar to item i that were rated by xsij = similarity of items i and jrxj= rating of user x on item j

RECSYS 38

Item-Item vs User-User

04180109

0305081

Avatar LOTR Matrix Pirates

Alice

Bob

Carol

David

1162019

In practice it has been observed that item-itemoften works better than user-user

Why Items are simpler users have multiple tastes

RECSYS 39

ProsCons of Collaborative Filtering

+ Works for any kind of item No feature selection needed

- Cold Start Need enough users in the system to find a match

- Sparsity The userratings matrix is sparse Hard to find users that have rated the same items

- First rater Cannot recommend an item that has not been

previously rated New items Esoteric items

- Popularity bias Cannot recommend items to someone with

unique taste Tends to recommend popular items

RECSYS 40

Hybrid Methods Implement two or more different recommenders and

combine predictions Perhaps using a linear model

Add content-based methods to collaborative filtering Item profiles for new item problem Demographics to deal with new user problem

1162019

RECSYS 41

- Evaluation

- Error metrics

- Complexity Speed

RECSYS 42

Evaluation

1 3 4

3 5 5

4 5 5

3

3

2 2 2

5

2 1 1

3 3

1

movies

users

1162019

RECSYS 43

Evaluation

1 3 4

3 5 5

4 5 5

3

3

2

2 1

3

1

Test Data Set

users

movies

1162019

RECSYS 44

Evaluating Predictions Compare predictions with known ratings

Root-mean-square error (RMSE)bull where is predicted is the true rating of x on i

Precision at top 10 bull of those in top 10

Rank Correlation bull Spearmanrsquos correlation between systemrsquos and userrsquos complete

rankings Another approach 01 model

Coveragebull Number of itemsusers for which system can make predictions

Precision = TP (TP + FP) Accuracy = (TP+TN) (TP + FP + TN + FN) Receiver Operating characteristic (ROC) Curve

Y-axis True Positive Rates (TPR) X-axis False Positive Rates (FPR)bull TPR (aka Recall) = TP P = TP (TP+FN) bull FPR = FP N = FP (FP + TN)bull See httpsenwikipediaorgwikiPrecision and recall

RECSYS 45

Problems with Error Measures Narrow focus on accuracy sometimes

misses the point Prediction Diversity Prediction Context Order of predictions

In practice we care only to predict high ratings RMSE might penalize a method that does well

for high ratings and badly for others

1162019

RECSYS 46

Collaborative Filtering Complexity

Expensive step is finding k most similar customers O(|X|) Recall that X = set of customers in the system

Too expensive to do at runtime Could pre-compute using clustering as approx

Naiumlve pre-computation takes time O(N middot|C|) |C| = of clusters = k in the k-means N = of data points

We already know how to do this Near-neighbor search in high dimensions (LSH) Clustering Dimensionality reduction

1162019

RECSYS 47

Tip Add Data Leverage all the data

Donrsquot try to reduce data size in an effort to make fancy algorithms work

Simple methods on large data do best

Add more data eg add IMDB data on genres

More data beats better algorithmshttpanandtypepadcomdatawocky200803more-data-usualhtml

RECSYS 48

Recommender SystemsLatent Factor Models

CS246 Mining Massive DatasetsJure Leskovec Stanford University

httpcs246stanfordedu

RECSYS 49

Geared towards females

Geared towards males

Serious

Funny

Collaborative Filtering viaLatent Factor Models (eg SVD)

The PrincessDiaries

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

RECSYS 50Koren Bell Volinksy IEEE Computer 2009

RECSYS 51

The Netflix Utility Matrix R

1 3 4

3 5 5

4 5 5

3

3

2 2 2

5

2 1 1

3 3

1

480000 users

17700 movies

1162019

Matrix R

RECSYS 52

Utility Matrix R Evaluation

1 3 4

3 5 5

4 5 5

3

3

2

2 1

3

1

Test Data Set

SSE = sum(119909119909119909119909)isin119877119877 119903119909119909119909119909 minus 119903119903119909119909119909119909 2

1162019

480000 users

17700 movies

Predicted rating

True rating of user x on item i

119955119955120785120785120788120788

Matrix R

Training Data Set

RECSYS 53

Latent Factor Models

ldquoSVDrdquo on Netflix data R asymp Q middot PT

For now letrsquos assume we can approximate the rating matrix R as a product of ldquothinrdquo Q middot PT

R has missing entries but letrsquos ignore that for nowbull Basically we will want the reconstruction error to be small on

known ratings and we donrsquot care about the values on the missing ones

45531

312445

53432142

24542

522434

42331

2-41

56-5

53-2

32111

-221-7

37-1

-924143-48-5-253-211

13-112-72914-131457-8

1-6784-3924176-421asymp

users

item

s

PT

Q

item

s

users

R

SVD A = U Σ VT

f factors

ffactors

RECSYS 54

Ratings as Products of Factors How to estimate the missing rating of

user x for item i

1162019

45531

312445

53432142

24542

522434

42331

item

s

2-41

56-5

53-2

32111

-221-7

37-1

-924143-48-5-253-211

13-112-72914-131457-8

1-6784-3924176-421

asymp

item

s

users

users

PT

119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931

= 119943119943

119954119954119946119946119943119943 sdot 119953119953119961119961119943119943

qi = row i of Qpx = column x of PT

ffac

tors

Qf factors

RECSYS 55

Ratings as Products of Factors How to estimate the missing rating of

user x for item i

1162019

45531

312445

53432142

24542

522434

42331

item

s

2-41

56-5

53-2

32111

-221-7

37-1

-924143-48-5-253-211

13-112-72914-131457-8

1-6784-3924176-421

asymp

item

s

users

users

PT

ffac

tors

Qf factors

119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931

= 119943119943

119954119954119946119946119943119943 sdot 119953119953119961119961119943119943

qi = row i of Qpx = column x of PT

RECSYS 56

Ratings as Products of Factors How to estimate the missing rating of

user x for item i

1162019

45531

312445

53432142

24542

522434

42331

item

s

2-41

56-5

53-2

32111

-221-7

37-1

-924143-48-5-253-211

13-112-72914-131457-8

1-6784-3924176-421

asymp

item

s

users

users

QPT

24

ffac

tors

f factors

119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931

= 119943119943

119954119954119946119946119943119943 sdot 119953119953119961119961119943119943

qi = row i of Qpx = column x of PT

RECSYS 57

Geared towards females

Geared towards males

Serious

Funny

Latent Factor Models

1162019

The PrincessDiaries

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

RECSYS 58

Geared towards females

Geared towards males

Serious

Funny

Latent Factor Models

1162019

The PrincessDiaries

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

RECSYS 59

Recap SVD

Remember SVD A Input data matrix U Left singular vecs V Right singular vecs Σ Singular values

SVD gives minimum reconstruction error (SSE) min119880119880VΣ

sum119909119909119894119894 119860119860119909119909119894119894 minus 119880119880Σ119881119881T 1199091199091198941198942

So in our case ldquoSVDrdquo on Netflix data R asymp Q middot PT

A = R Q = U PT = Σ VT

But we are not done yet R has missing entries1162019

U

The sum goes over all entriesBut our R has missing entries

119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931

Am

n

Σm

n

VT

asymp

U

RECSYS 60

Latent Factor Models

SVD isnrsquot defined when entries are missing

Use specialized methods to find P Q min

119875119875119876119876sum 119909119909119909119909 isinR 119903119903119909119909119909119909 minus 119902119902119909119909 sdot 119901119901119909119909119879119879 2

Notebull We donrsquot require cols of P Q to be orthogonalunit lengthbull P Q map usersmovies to a latent spacebull The most popular model among Netflix contestants

1162019

45531

312445

53432142

24542

522434

42331

2-41

56-5

53-2

32111

-221-7

37-1

-924143-48-5-253-211

13-112-72914-131457-8

1-6784-3924176-421asymp

PTQ

users

item

s

119903119909119909119909119909 = 119902119902119909119909 sdot 119901119901119909119909119879119879

f factors

ffactorsitem

s

users

RECSYS 61

Dealing with Missing Entries Want to minimize SSE for unseen test data

Idea Minimize SSE on training data Want large f ( of factors) to capture all the signals But SSE on test data begins to rise for f gt 2

Regularization is needed to avoid Overfitting Allow rich model where there are sufficient data Shrink aggressively where data are scarce

1162019

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

1 3 43 5 5

4 5 533

2

2 1 3

1

λhellip regularization parameterldquoerrorrdquo ldquolengthrdquo

RECSYS 62

Geared towards females

Geared towards males

serious

escapist

The PrincessDiaries

The Lion King

Braveheart

Independence Day

AmadeusThe Color Purple

Oceanrsquos 11

Sense and Sensibility

Gus

Dave

Recommendations via Latent Factor Models (eg SVD++ by the [Bellkor Team])

1162019

Lethal Weapon

Dumb and Dumber

RECSYS 63

Dealing with Missing Entries Want to minimize SSE for unseen test data

Idea Minimize SSE on training data Want large f ( of factors) to capture all the signals But SSE on test data begins to rise for f gt 2

Regularization is needed Allow rich model where there are sufficient data Shrink aggressively where data are scarce

1162019

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

1 3 43 5 5

4 5 533

2

2 1 3

1

λhellip regularization parameterldquoerrorrdquo ldquolengthrdquo

RECSYS 64

Geared towards females

Geared towards males

serious

funny

The Effect of Regularization

1162019

The PrincessDiaries

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

minfactors ldquoerrorrdquo + λ ldquolengthrdquo

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

RECSYS 65

Geared towards females

Geared towards males

serious

funny

The Effect of Regularization

1162019

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

The PrincessDiaries

minfactors ldquoerrorrdquo + λ ldquolengthrdquo

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

RECSYS 66

Geared towards females

Geared towards males

serious

funny

The Effect of Regularization

1162019

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

minfactors ldquoerrorrdquo + λ ldquolengthrdquo

The PrincessDiaries

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

RECSYS 67

Geared towards females

Geared towards males

serious

funny

The Effect of Regularization

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

The PrincessDiaries

minfactors ldquoerrorrdquo + λ ldquolengthrdquo

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

RECSYS 68

Use Gradient Descent to search forthe optimal settings

Want to find matrices P and Q

Gradient descent Initialize P and Q (using SVD pretend missing ratings are

0) Do gradient descent

bull P larr P - η middotnablaPbull Q larr Q - η middotnablaQbull Where nablaQ is gradientderivative of matrix Q120571120571119876119876 = [120571120571119902119902119909119909119894119894] and 120571120571119902119902119909119909119894119894 = sum119909119909119909119909 minus2 119903119903119909119909119909119909 minus 119902119902119909119909119901119901119909119909119879119879 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894

bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q Observation Computing gradients is slow

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

How to compute gradient of a matrixCompute gradient of every element independently

RECSYS 69

Degression to the lecture notes ofRegression and Gradient Descent

by Andrew Ngrsquos Machine Learning course from Coursera

RECSYS 70

(Batch) Gradient Descent

Want to find matrices P and Q

Gradient descent Initialize P and Q (using SVD pretend missing ratings are

0) Do gradient descent

bull P larr P - η middotnablaPbull Q larr Q - η middotnablaQbull Where nablaQ is gradientderivative of matrix Q120571120571119876119876 = [120571120571119902119902119909119909119894119894] and 120571120571119902119902119909119909119894119894 = sum119909119909119909119909 minus2 119903119903119909119909119909119909 minus 119902119902119909119909119901119901119909119909119879119879 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894

bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q Observation Computing gradients is slow

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

How to compute gradient of a matrixCompute gradient of every element independently

RECSYS 72

Stochastic Gradient Descent

Gradient Descent (GD) vs Stochastic GD Observation 120571120571119876119876 = [120571120571119902119902119909119909119894119894] where

120571120571119902119902119909119909119894119894 = 119909119909119909119909

minus2 119903119903119909119909119909119909 minus 119902119902119909119909119894119894119901119901119909119909119894119894 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894 = 119961119961119946119946

nabla119928119928 119955119955119961119961119946119946

bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q 119928119928 = 119928119928 minus 119928119928 = 119928119928 minus η sum119961119961119946119946nabla119928119928 (119955119955119961119961119946119946) Idea Instead of evaluating gradient over all ratings evaluate it for

each individual rating and make a step

GD 119928119928larr119928119928 minus η sum119955119955119961119961119946119946 nabla119928119928(119955119955119961119961119946119946)

SGD 119928119928larr119928119928 minus 119928119928(119955119955119961119961119946119946) Faster convergence

bull Need more steps but each step is computed much faster

1162019

RECSYS 73

SGD vs GD Convergence of GD vs SGD

1162019

Iterationstep

Valu

e of

the

obje

ctiv

e fu

nctio

n

GD improves the value of the objective function at every step SGD improves the value but in a ldquonoisyrdquo wayGD takes fewer steps to converge but each steptakes much longer to compute In practice SGD is much faster

RECSYS 74

Stochastic Gradient Descent

Stochastic gradient decent Initialize P and Q (using SVD pretend missing ratings are

0) Then iterate over the ratings (multiple times if necessary) and update

factorsFor each rxi

bull 120576120576119909119909119909119909 = 119903119903119909119909119909119909 minus 119902119902119909119909 sdot 119901119901119909119909119879119879 (derivative of the ldquoerrorrdquo)bull 119902119902119909119909 larr 119902119902119909119909 + 120578120578 120576120576119909119909119909119909 119901119901119909119909 minus 120582120582 119902119902119909119909 (update equation)bull 119901119901119909119909 larr 119901119901119909119909 + 120578120578 120576120576119909119909119909119909 119902119902119909119909 minus 120582120582 119901119901119909119909 (update equation)

2 for loops For until convergence

bull For each rxibull Compute gradient do a ldquosteprdquo

1162019

η hellip learning rate

RECSYS 75Koren Bell Volinksy IEEE Computer 2009

1162019

RECSYS 76

Summary Recommendations via Optimization

Goal Make good recommendations Quantify goodness using SSE

So Lower SSE means better recommendations We want to make good recommendations on items

that some user has not yet seen Letrsquos set values for P and Q such that they work

well on known (user item) ratings And hope these values for P and Q will predict well

the unknown ratings

This is the a case where we apply Optimization methods

1162019

1 3 43 5 5

4 5 533

2

2 1 3

1

RECSYS 77

Backup Slides

RECSYS 78

The Netflix Challenge 2006-09

We Know What You OughtTo Be Watching This

Summer

Netflix Prizebull Training data

ndash 100 million ratingsndash 480000 usersndash 17770 moviesndash 6 years of data 2000-2005

bull Test datandash Last few ratings of each user (28 million)ndash Evaluation criterion root mean squared error (RMSE) ndash Netflix Cinematch RMSE 09514

bull Competitionndash 2700+ teamsndash $1 million grand prize for 10 improvement on Cinematch resultndash $50000 2007 progress prize for 843 improvement

scoremovieuser121152131434524123237682576344541568523425

22345

57664566

scoremovieuser6219617232473153414284935

745

696836

Training data Test data

Movie rating data

Overall rating distribution

bull Third of ratings are 4sbull Average rating is 368

From TimelyDevelopmentcom

ratings per movie

bull Avg ratingsmovie 5627

ratings per user

bull Avg ratingsuser 208

Average movie rating by movie count

bull More ratings to better movies

From TimelyDevelopmentcom

Most loved movies

CountAvg ratingTitle137812 4593 The Shawshank Redemption 133597 4545 Lord of the Rings The Return of the King1808834306 The Green Mile 150676 4460Lord of the Rings The Two Towers 139050 4415 Finding Nemo 117456 4504 Raiders of the Lost Ark 180736 4299 Forrest Gump 147932 4433 Lord of the Rings The Fellowship of the ring149199 4325 The Sixth Sense 144027 4333 Indiana Jones and the Last Crusade

Challenges

bull Size of datandash Scalabilityndash Keeping data in memory

bull Missing datandash 99 percent missingndash Very imbalanced

bull Avoiding overfittingbull Test and training data differ significantly movie 16322

bull Use an ensemble of complementing predictorsbull Two half tuned models worth more than a single fully

tuned model

The BellKor recommender system

RECSYS 101

Extending Latent Factor Model to Include Biases

1162019

RECSYS 102

Modeling Biases and Interactions

μ = overall mean rating bx = bias of user x bi = bias of movie i

user-movie interactionmovie biasuser bias

User-Movie interaction Characterizes the matching between

users and movies Attracts most research in the field Benefits from algorithmic and

mathematical innovations

Baseline predictor Separates users and movies Benefits from insights into userrsquos

behavior Among the main practical

contributions of the competition

RECSYS 103

Baseline Predictor We have expectations on the rating by

user x of movie i even without estimating xrsquos attitude towards movies like i

ndash Rating scale of user xndash Values of other ratings user

gave recently (day-specific mood anchoring multi-user accounts)

ndash (Recent) popularity of movie indash Selection bias related to

number of ratings user gave on the same day (ldquofrequencyrdquo)

1162019

RECSYS 104

Putting It All Together

Example Mean rating micro = 37 You are a critical reviewer your ratings are 1 star

lower than the mean bx = -1 Star Wars gets a mean rating of 05 higher than

average movie bi = + 05 Predicted rating for you on Star Wars

= 37 - 1 + 05 = 32

1162019

RECSYS 105

Fitting the New Model

Solve

Stochastic gradient decent to find parameters Note Both biases bu bi as well as interactions qi pu are treated

as parameters (we estimate them)

1162019

regularization

goodness of fit

λ is selected via grid-search on a validation set

( )

++++

+++minus

sumsumsumsum

sumisin

ii

xx

xx

ii

Rix

Txiixxi

PQ

bbpq

pqbbr

2222

2

)(

)(min

λ

micro

RECSYS 106

BellKor Recommender System The winner of the Netflix Challenge

Multi-scale modeling of the dataCombine top level ldquoregionalrdquomodeling of the data with a refined local view Global

bull Overall deviations of usersmovies Factorization

bull Addressing ldquoregionalrdquo effects Collaborative filtering

bull Extract local patterns

Global effects

Factorization

Collaborative filtering

RECSYS 107

Performance of Various Methods

0885

089

0895

09

0905

091

0915

092

1 10 100 1000

RMSE

Millions of parameters

CF (no time bias)

Basic Latent Factors

Latent Factors w Biases

1162019

RECSYS 108

Grand Prize 08563

Netflix 09514

Movie average 10533User average 10651

Global average 11296

Performance of Various Methods

Basic Collaborative filtering 094

Latent factors 090

Latent factors+Biases 089

Collaborative filtering++ 091

1162019

RECSYS 109

Temporal Biases Of Users

Sudden rise in the average movie rating(early 2004) Improvements in Netflix GUI improvements Meaning of rating changed

Movie age Users prefer new movies

without any reasons Older movies are just

inherently better than newer ones

1162019

Y Koren Collaborative filtering with temporal dynamics KDD rsquo09

RECSYS 110

Temporal Biases amp Factors

Original modelrxi = micro +bx + bi + qi middotpx

T

Add time dependence to biasesrxi = micro +bx(t)+ bi(t) +qi middot px

T

Make parameters bu and bi to depend on time (1) Parameterize time-dependence by linear trends

(2) Each bin corresponds to 10 consecutive weeks

Add temporal dependence to factors px(t)hellip user preference vector on day t

Y Koren Collaborative filtering with temporal dynamics KDD rsquo09

RECSYS 111

Adding Temporal Effects

1162019

0875

088

0885

089

0895

09

0905

091

0915

092

1 10 100 1000 10000

RMSE

Millions of parameters

CF (no time bias)Basic Latent FactorsCF (time bias)Latent Factors w Biases+ Linear time factors+ Per-day user biases+ CF

RECSYS 112

Grand Prize 08563

Netflix 09514

Movie average 10533User average 10651

Global average 11296

Performance of Various Methods

Basic Collaborative filtering 094

Latent factors 090

Latent factors+Biases 089

Collaborative filtering++ 091

1162019 112

Latent factors+Biases+Time 0876

Still no prize Getting desperateTry a ldquokitchen sinkrdquo approach

RECSYS 113

Effect of ensemble size

087

08725

0875

08775

088

08825

0885

08875

089

08925

1 8 15 22 29 36 43 50

Predictors

Erro

r - R

MSE

Chart1

RMSE
Predictors
Error - RMSE
Effect of ensemble size

Sheet1

Sheet1

RMSE
Predictors
Error - RMSE
Effect of ensemble size

Sheet2

Sheet3

RECSYS 115

Standing on June 26th 2009

1162019June 26th submission triggers 30-day ldquolast callrdquo

RECSYS 116

The Last 30 Days

Ensemble team formed Group of other teams on leaderboard forms a new team Relies on combining their models Quickly also get a qualifying score over 10

BellKor Continue to get small improvements in their scores Realize that they are in direct competition with Ensemble

Strategy Both teams carefully monitoring the leaderboard Only sure way to check for improvement is to submit a set of

predictionsbull This alerts the other team of your latest score

1162019

RECSYS 117

24 Hours from the Deadline

Submissions limited to 1 a day Only 1 final submission could be made in the last 24h

24 hours before deadlinehellip BellKor team member in Austria notices (by chance) that

Ensemble posts a score that is slightly better than BellKorrsquos

Frantic last 24 hours for both teams Much computer time on final optimization Carefully calibrated to end about an hour before deadline

Final submissions BellKor submits a little early (on purpose) 40 mins before deadline Ensemble submits their final entry 20 mins later hellipand everyone waitshellip

1162019

RECSYS 118118

RECSYS 119

Million $ Awarded Sept 21st 2009

RECSYS 120

Further reading Y Koren Collaborative filtering with temporal

dynamics KDD rsquo09 httpwww2researchattcom~volinskynetflixbpchtml httpwwwthe-ensemblecom

1162019

  • Slide Number 1
  • Acknowledgements
  • Content-based Systems amp Collaborative Filtering
  • High Dimensional Data
  • Example Recommender Systems
  • Recommendations
  • From Scarcity to Abundance
  • The Long Tail
  • Physical vs Online
  • Types of Recommendations
  • Formal Model
  • Utility Matrix
  • Key Problems
  • (1) Gathering Ratings
  • (2) Extrapolating Utilities
  • Content-based Recommender Systems
  • Content-based Recommendations
  • Plan of Action
  • Item Profiles
  • Sidenote TF-IDF
  • User Profiles and Prediction
  • Pros Content-based Approach
  • Cons Content-based Approach
  • Collaborative Filtering
  • Collaborative filtering
  • Collaborative Filtering (CF)
  • Example of Memory-based Collaborative Filtering User-User Collaborative Filtering
  • Similar Users
  • Similarity Metric
  • Rating Predictions
  • Another type of Memory-based Collaborative Filtering Item-Item based Collaborative Filtering
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Common Practice for Item-Item Collaborative Filtering
  • Item-Item vs User-User
  • ProsCons of Collaborative Filtering
  • Hybrid Methods
  • Remarks amp Practical Tips
  • Evaluation
  • Evaluation
  • Evaluating Predictions
  • Problems with Error Measures
  • Collaborative Filtering Complexity
  • Tip Add Data
  • Recommender SystemsLatent Factor Models
  • Collaborative Filtering viaLatent Factor Models (eg SVD)
  • Slide Number 50
  • The Netflix Utility Matrix R
  • Utility Matrix R Evaluation
  • Latent Factor Models
  • Ratings as Products of Factors
  • Ratings as Products of Factors
  • Ratings as Products of Factors
  • Latent Factor Models
  • Latent Factor Models
  • Recap SVD
  • Latent Factor Models
  • Dealing with Missing Entries
  • Recommendations via Latent Factor Models (eg SVD++ by the [Bellkor Team])
  • Dealing with Missing Entries
  • The Effect of Regularization
  • The Effect of Regularization
  • The Effect of Regularization
  • The Effect of Regularization
  • Use Gradient Descent to search forthe optimal settings
  • Degression to the lecture notes ofRegression and Gradient Descentby Andrew Ngrsquos Machine Learning course from Coursera
  • (Batch) Gradient Descent
  • Stochastic Gradient Descent
  • SGD vs GD
  • Stochastic Gradient Descent
  • Slide Number 75
  • Summary Recommendations via Optimization
  • Backup Slides
  • The Netflix Challenge 2006-09
  • Slide Number 79
  • Slide Number 80
  • Netflix Prize
  • Movie rating data
  • Overall rating distribution
  • ratings per movie
  • ratings per user
  • Average movie rating by movie count
  • Most loved movies
  • Challenges
  • The BellKor recommender system
  • Extending Latent Factor Model to Include Biases
  • Modeling Biases and Interactions
  • Baseline Predictor
  • Putting It All Together
  • Fitting the New Model
  • BellKor Recommender System
  • Performance of Various Methods
  • Performance of Various Methods
  • Temporal Biases Of Users
  • Temporal Biases amp Factors
  • Adding Temporal Effects
  • Performance of Various Methods
  • Slide Number 113
  • Slide Number 114
  • Standing on June 26th 2009
  • The Last 30 Days
  • 24 Hours from the Deadline
  • Slide Number 118
  • Million $ Awarded Sept 21st 2009
  • Further reading
predictors RMSE
1 0889067
2 0880586
3 0878345
4 0877715
5 0876668
6 0876179
7 0875557
8 0875013
9 0874477
10 0874209
11 0873997
12 0873691
13 0873569
14 0873352
15 0873212
16 0873035
17 0872942
18 087283
19 0872748
20 0872654
21 0872563
22 0872468
23 0872382
24 0872275
25 087222
26 0872201
27 0872135
28 0872086
29 0872021
30 0871945
31 0871898
32 087183
33 0871756
34 0871722
35 0871665
36 0871632
37 0871585
38 0871558
39 0871524
40 0871482
41 0871452
42 0871428
43 0871398
44 0871346
45 0871312
46 0871287
47 0871262
48 0871242
49 0871236
50 0871197
0889067
0880586
0878345
0877715
0876668
0876179
0875557
0875013
0874477
0874209
0873997
0873691
0873569
0873352
0873212
0873035
0872942
087283
0872748
0872654
0872563
0872468
0872382
0872275
087222
0872201
0872135
0872086
0872021
0871945
0871898
087183
0871756
0871722
0871665
0871632
0871585
0871558
0871524
0871482
0871452
0871428
0871398
0871346
0871312
0871287
0871262
0871242
0871236
0871197
Page 33: xuhappy.github.io · 2 days ago · RECSYS 2 Acknowledgements The slides used in this chapter are adapted from: CS246 Mining Massive Datasets, - by Jure Leskovec, Stanford University

RECSYS 35

Item-Item CF (|N|=2)

121110987654321

455 311

3124452

534321423

245424

5224345

423316

users

Compute similarity weightss13=041 s16=059

1162019

mov

ies

100

-018

041

-010

-031

059

sim(1m)

RECSYS 36

Item-Item CF (|N|=2)

121110987654321

45526311

3124452

534321423

245424

5224345

423316

users

Predict by taking weighted average

r15 = (0412 + 0593) (041+059) = 26

mov

ies

RECSYS 37

Common Practice for Item-Item Collaborative Filtering

Define similarity sij of items i and j

Select K nearest neighbors (KNN) N(i x) Set of Items most similar to item i that were rated by x

Estimate rating rxi as the weighted average

1162019

sumsum

isin

isinsdot

=)(

)(ˆxiNj ij

xiNj xjijxi s

rsr

N(ix) = set of items similar to item i that were rated by xsij = similarity of items i and jrxj= rating of user x on item j

RECSYS 38

Item-Item vs User-User

04180109

0305081

Avatar LOTR Matrix Pirates

Alice

Bob

Carol

David

1162019

In practice it has been observed that item-itemoften works better than user-user

Why Items are simpler users have multiple tastes

RECSYS 39

ProsCons of Collaborative Filtering

+ Works for any kind of item No feature selection needed

- Cold Start Need enough users in the system to find a match

- Sparsity The userratings matrix is sparse Hard to find users that have rated the same items

- First rater Cannot recommend an item that has not been

previously rated New items Esoteric items

- Popularity bias Cannot recommend items to someone with

unique taste Tends to recommend popular items

RECSYS 40

Hybrid Methods Implement two or more different recommenders and

combine predictions Perhaps using a linear model

Add content-based methods to collaborative filtering Item profiles for new item problem Demographics to deal with new user problem

1162019

RECSYS 41

- Evaluation

- Error metrics

- Complexity Speed

RECSYS 42

Evaluation

1 3 4

3 5 5

4 5 5

3

3

2 2 2

5

2 1 1

3 3

1

movies

users

1162019

RECSYS 43

Evaluation

1 3 4

3 5 5

4 5 5

3

3

2

2 1

3

1

Test Data Set

users

movies

1162019

RECSYS 44

Evaluating Predictions Compare predictions with known ratings

Root-mean-square error (RMSE)bull where is predicted is the true rating of x on i

Precision at top 10 bull of those in top 10

Rank Correlation bull Spearmanrsquos correlation between systemrsquos and userrsquos complete

rankings Another approach 01 model

Coveragebull Number of itemsusers for which system can make predictions

Precision = TP (TP + FP) Accuracy = (TP+TN) (TP + FP + TN + FN) Receiver Operating characteristic (ROC) Curve

Y-axis True Positive Rates (TPR) X-axis False Positive Rates (FPR)bull TPR (aka Recall) = TP P = TP (TP+FN) bull FPR = FP N = FP (FP + TN)bull See httpsenwikipediaorgwikiPrecision and recall

RECSYS 45

Problems with Error Measures Narrow focus on accuracy sometimes

misses the point Prediction Diversity Prediction Context Order of predictions

In practice we care only to predict high ratings RMSE might penalize a method that does well

for high ratings and badly for others

1162019

RECSYS 46

Collaborative Filtering Complexity

Expensive step is finding k most similar customers O(|X|) Recall that X = set of customers in the system

Too expensive to do at runtime Could pre-compute using clustering as approx

Naiumlve pre-computation takes time O(N middot|C|) |C| = of clusters = k in the k-means N = of data points

We already know how to do this Near-neighbor search in high dimensions (LSH) Clustering Dimensionality reduction

1162019

RECSYS 47

Tip Add Data Leverage all the data

Donrsquot try to reduce data size in an effort to make fancy algorithms work

Simple methods on large data do best

Add more data eg add IMDB data on genres

More data beats better algorithmshttpanandtypepadcomdatawocky200803more-data-usualhtml

RECSYS 48

Recommender SystemsLatent Factor Models

CS246 Mining Massive DatasetsJure Leskovec Stanford University

httpcs246stanfordedu

RECSYS 49

Geared towards females

Geared towards males

Serious

Funny

Collaborative Filtering viaLatent Factor Models (eg SVD)

The PrincessDiaries

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

RECSYS 50Koren Bell Volinksy IEEE Computer 2009

RECSYS 51

The Netflix Utility Matrix R

1 3 4

3 5 5

4 5 5

3

3

2 2 2

5

2 1 1

3 3

1

480000 users

17700 movies

1162019

Matrix R

RECSYS 52

Utility Matrix R Evaluation

1 3 4

3 5 5

4 5 5

3

3

2

2 1

3

1

Test Data Set

SSE = sum(119909119909119909119909)isin119877119877 119903119909119909119909119909 minus 119903119903119909119909119909119909 2

1162019

480000 users

17700 movies

Predicted rating

True rating of user x on item i

119955119955120785120785120788120788

Matrix R

Training Data Set

RECSYS 53

Latent Factor Models

ldquoSVDrdquo on Netflix data R asymp Q middot PT

For now letrsquos assume we can approximate the rating matrix R as a product of ldquothinrdquo Q middot PT

R has missing entries but letrsquos ignore that for nowbull Basically we will want the reconstruction error to be small on

known ratings and we donrsquot care about the values on the missing ones

45531

312445

53432142

24542

522434

42331

2-41

56-5

53-2

32111

-221-7

37-1

-924143-48-5-253-211

13-112-72914-131457-8

1-6784-3924176-421asymp

users

item

s

PT

Q

item

s

users

R

SVD A = U Σ VT

f factors

ffactors

RECSYS 54

Ratings as Products of Factors How to estimate the missing rating of

user x for item i

1162019

45531

312445

53432142

24542

522434

42331

item

s

2-41

56-5

53-2

32111

-221-7

37-1

-924143-48-5-253-211

13-112-72914-131457-8

1-6784-3924176-421

asymp

item

s

users

users

PT

119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931

= 119943119943

119954119954119946119946119943119943 sdot 119953119953119961119961119943119943

qi = row i of Qpx = column x of PT

ffac

tors

Qf factors

RECSYS 55

Ratings as Products of Factors How to estimate the missing rating of

user x for item i

1162019

45531

312445

53432142

24542

522434

42331

item

s

2-41

56-5

53-2

32111

-221-7

37-1

-924143-48-5-253-211

13-112-72914-131457-8

1-6784-3924176-421

asymp

item

s

users

users

PT

ffac

tors

Qf factors

119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931

= 119943119943

119954119954119946119946119943119943 sdot 119953119953119961119961119943119943

qi = row i of Qpx = column x of PT

RECSYS 56

Ratings as Products of Factors How to estimate the missing rating of

user x for item i

1162019

45531

312445

53432142

24542

522434

42331

item

s

2-41

56-5

53-2

32111

-221-7

37-1

-924143-48-5-253-211

13-112-72914-131457-8

1-6784-3924176-421

asymp

item

s

users

users

QPT

24

ffac

tors

f factors

119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931

= 119943119943

119954119954119946119946119943119943 sdot 119953119953119961119961119943119943

qi = row i of Qpx = column x of PT

RECSYS 57

Geared towards females

Geared towards males

Serious

Funny

Latent Factor Models

1162019

The PrincessDiaries

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

RECSYS 58

Geared towards females

Geared towards males

Serious

Funny

Latent Factor Models

1162019

The PrincessDiaries

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

RECSYS 59

Recap SVD

Remember SVD A Input data matrix U Left singular vecs V Right singular vecs Σ Singular values

SVD gives minimum reconstruction error (SSE) min119880119880VΣ

sum119909119909119894119894 119860119860119909119909119894119894 minus 119880119880Σ119881119881T 1199091199091198941198942

So in our case ldquoSVDrdquo on Netflix data R asymp Q middot PT

A = R Q = U PT = Σ VT

But we are not done yet R has missing entries1162019

U

The sum goes over all entriesBut our R has missing entries

119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931

Am

n

Σm

n

VT

asymp

U

RECSYS 60

Latent Factor Models

SVD isnrsquot defined when entries are missing

Use specialized methods to find P Q min

119875119875119876119876sum 119909119909119909119909 isinR 119903119903119909119909119909119909 minus 119902119902119909119909 sdot 119901119901119909119909119879119879 2

Notebull We donrsquot require cols of P Q to be orthogonalunit lengthbull P Q map usersmovies to a latent spacebull The most popular model among Netflix contestants

1162019

45531

312445

53432142

24542

522434

42331

2-41

56-5

53-2

32111

-221-7

37-1

-924143-48-5-253-211

13-112-72914-131457-8

1-6784-3924176-421asymp

PTQ

users

item

s

119903119909119909119909119909 = 119902119902119909119909 sdot 119901119901119909119909119879119879

f factors

ffactorsitem

s

users

RECSYS 61

Dealing with Missing Entries Want to minimize SSE for unseen test data

Idea Minimize SSE on training data Want large f ( of factors) to capture all the signals But SSE on test data begins to rise for f gt 2

Regularization is needed to avoid Overfitting Allow rich model where there are sufficient data Shrink aggressively where data are scarce

1162019

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

1 3 43 5 5

4 5 533

2

2 1 3

1

λhellip regularization parameterldquoerrorrdquo ldquolengthrdquo

RECSYS 62

Geared towards females

Geared towards males

serious

escapist

The PrincessDiaries

The Lion King

Braveheart

Independence Day

AmadeusThe Color Purple

Oceanrsquos 11

Sense and Sensibility

Gus

Dave

Recommendations via Latent Factor Models (eg SVD++ by the [Bellkor Team])

1162019

Lethal Weapon

Dumb and Dumber

RECSYS 63

Dealing with Missing Entries Want to minimize SSE for unseen test data

Idea Minimize SSE on training data Want large f ( of factors) to capture all the signals But SSE on test data begins to rise for f gt 2

Regularization is needed Allow rich model where there are sufficient data Shrink aggressively where data are scarce

1162019

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

1 3 43 5 5

4 5 533

2

2 1 3

1

λhellip regularization parameterldquoerrorrdquo ldquolengthrdquo

RECSYS 64

Geared towards females

Geared towards males

serious

funny

The Effect of Regularization

1162019

The PrincessDiaries

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

minfactors ldquoerrorrdquo + λ ldquolengthrdquo

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

RECSYS 65

Geared towards females

Geared towards males

serious

funny

The Effect of Regularization

1162019

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

The PrincessDiaries

minfactors ldquoerrorrdquo + λ ldquolengthrdquo

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

RECSYS 66

Geared towards females

Geared towards males

serious

funny

The Effect of Regularization

1162019

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

minfactors ldquoerrorrdquo + λ ldquolengthrdquo

The PrincessDiaries

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

RECSYS 67

Geared towards females

Geared towards males

serious

funny

The Effect of Regularization

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

The PrincessDiaries

minfactors ldquoerrorrdquo + λ ldquolengthrdquo

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

RECSYS 68

Use Gradient Descent to search forthe optimal settings

Want to find matrices P and Q

Gradient descent Initialize P and Q (using SVD pretend missing ratings are

0) Do gradient descent

bull P larr P - η middotnablaPbull Q larr Q - η middotnablaQbull Where nablaQ is gradientderivative of matrix Q120571120571119876119876 = [120571120571119902119902119909119909119894119894] and 120571120571119902119902119909119909119894119894 = sum119909119909119909119909 minus2 119903119903119909119909119909119909 minus 119902119902119909119909119901119901119909119909119879119879 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894

bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q Observation Computing gradients is slow

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

How to compute gradient of a matrixCompute gradient of every element independently

RECSYS 69

Degression to the lecture notes ofRegression and Gradient Descent

by Andrew Ngrsquos Machine Learning course from Coursera

RECSYS 70

(Batch) Gradient Descent

Want to find matrices P and Q

Gradient descent Initialize P and Q (using SVD pretend missing ratings are

0) Do gradient descent

bull P larr P - η middotnablaPbull Q larr Q - η middotnablaQbull Where nablaQ is gradientderivative of matrix Q120571120571119876119876 = [120571120571119902119902119909119909119894119894] and 120571120571119902119902119909119909119894119894 = sum119909119909119909119909 minus2 119903119903119909119909119909119909 minus 119902119902119909119909119901119901119909119909119879119879 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894

bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q Observation Computing gradients is slow

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

How to compute gradient of a matrixCompute gradient of every element independently

RECSYS 72

Stochastic Gradient Descent

Gradient Descent (GD) vs Stochastic GD Observation 120571120571119876119876 = [120571120571119902119902119909119909119894119894] where

120571120571119902119902119909119909119894119894 = 119909119909119909119909

minus2 119903119903119909119909119909119909 minus 119902119902119909119909119894119894119901119901119909119909119894119894 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894 = 119961119961119946119946

nabla119928119928 119955119955119961119961119946119946

bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q 119928119928 = 119928119928 minus 119928119928 = 119928119928 minus η sum119961119961119946119946nabla119928119928 (119955119955119961119961119946119946) Idea Instead of evaluating gradient over all ratings evaluate it for

each individual rating and make a step

GD 119928119928larr119928119928 minus η sum119955119955119961119961119946119946 nabla119928119928(119955119955119961119961119946119946)

SGD 119928119928larr119928119928 minus 119928119928(119955119955119961119961119946119946) Faster convergence

bull Need more steps but each step is computed much faster

1162019

RECSYS 73

SGD vs GD Convergence of GD vs SGD

1162019

Iterationstep

Valu

e of

the

obje

ctiv

e fu

nctio

n

GD improves the value of the objective function at every step SGD improves the value but in a ldquonoisyrdquo wayGD takes fewer steps to converge but each steptakes much longer to compute In practice SGD is much faster

RECSYS 74

Stochastic Gradient Descent

Stochastic gradient decent Initialize P and Q (using SVD pretend missing ratings are

0) Then iterate over the ratings (multiple times if necessary) and update

factorsFor each rxi

bull 120576120576119909119909119909119909 = 119903119903119909119909119909119909 minus 119902119902119909119909 sdot 119901119901119909119909119879119879 (derivative of the ldquoerrorrdquo)bull 119902119902119909119909 larr 119902119902119909119909 + 120578120578 120576120576119909119909119909119909 119901119901119909119909 minus 120582120582 119902119902119909119909 (update equation)bull 119901119901119909119909 larr 119901119901119909119909 + 120578120578 120576120576119909119909119909119909 119902119902119909119909 minus 120582120582 119901119901119909119909 (update equation)

2 for loops For until convergence

bull For each rxibull Compute gradient do a ldquosteprdquo

1162019

η hellip learning rate

RECSYS 75Koren Bell Volinksy IEEE Computer 2009

1162019

RECSYS 76

Summary Recommendations via Optimization

Goal Make good recommendations Quantify goodness using SSE

So Lower SSE means better recommendations We want to make good recommendations on items

that some user has not yet seen Letrsquos set values for P and Q such that they work

well on known (user item) ratings And hope these values for P and Q will predict well

the unknown ratings

This is the a case where we apply Optimization methods

1162019

1 3 43 5 5

4 5 533

2

2 1 3

1

RECSYS 77

Backup Slides

RECSYS 78

The Netflix Challenge 2006-09

We Know What You OughtTo Be Watching This

Summer

Netflix Prizebull Training data

ndash 100 million ratingsndash 480000 usersndash 17770 moviesndash 6 years of data 2000-2005

bull Test datandash Last few ratings of each user (28 million)ndash Evaluation criterion root mean squared error (RMSE) ndash Netflix Cinematch RMSE 09514

bull Competitionndash 2700+ teamsndash $1 million grand prize for 10 improvement on Cinematch resultndash $50000 2007 progress prize for 843 improvement

scoremovieuser121152131434524123237682576344541568523425

22345

57664566

scoremovieuser6219617232473153414284935

745

696836

Training data Test data

Movie rating data

Overall rating distribution

bull Third of ratings are 4sbull Average rating is 368

From TimelyDevelopmentcom

ratings per movie

bull Avg ratingsmovie 5627

ratings per user

bull Avg ratingsuser 208

Average movie rating by movie count

bull More ratings to better movies

From TimelyDevelopmentcom

Most loved movies

CountAvg ratingTitle137812 4593 The Shawshank Redemption 133597 4545 Lord of the Rings The Return of the King1808834306 The Green Mile 150676 4460Lord of the Rings The Two Towers 139050 4415 Finding Nemo 117456 4504 Raiders of the Lost Ark 180736 4299 Forrest Gump 147932 4433 Lord of the Rings The Fellowship of the ring149199 4325 The Sixth Sense 144027 4333 Indiana Jones and the Last Crusade

Challenges

bull Size of datandash Scalabilityndash Keeping data in memory

bull Missing datandash 99 percent missingndash Very imbalanced

bull Avoiding overfittingbull Test and training data differ significantly movie 16322

bull Use an ensemble of complementing predictorsbull Two half tuned models worth more than a single fully

tuned model

The BellKor recommender system

RECSYS 101

Extending Latent Factor Model to Include Biases

1162019

RECSYS 102

Modeling Biases and Interactions

μ = overall mean rating bx = bias of user x bi = bias of movie i

user-movie interactionmovie biasuser bias

User-Movie interaction Characterizes the matching between

users and movies Attracts most research in the field Benefits from algorithmic and

mathematical innovations

Baseline predictor Separates users and movies Benefits from insights into userrsquos

behavior Among the main practical

contributions of the competition

RECSYS 103

Baseline Predictor We have expectations on the rating by

user x of movie i even without estimating xrsquos attitude towards movies like i

ndash Rating scale of user xndash Values of other ratings user

gave recently (day-specific mood anchoring multi-user accounts)

ndash (Recent) popularity of movie indash Selection bias related to

number of ratings user gave on the same day (ldquofrequencyrdquo)

1162019

RECSYS 104

Putting It All Together

Example Mean rating micro = 37 You are a critical reviewer your ratings are 1 star

lower than the mean bx = -1 Star Wars gets a mean rating of 05 higher than

average movie bi = + 05 Predicted rating for you on Star Wars

= 37 - 1 + 05 = 32

1162019

RECSYS 105

Fitting the New Model

Solve

Stochastic gradient decent to find parameters Note Both biases bu bi as well as interactions qi pu are treated

as parameters (we estimate them)

1162019

regularization

goodness of fit

λ is selected via grid-search on a validation set

( )

++++

+++minus

sumsumsumsum

sumisin

ii

xx

xx

ii

Rix

Txiixxi

PQ

bbpq

pqbbr

2222

2

)(

)(min

λ

micro

RECSYS 106

BellKor Recommender System The winner of the Netflix Challenge

Multi-scale modeling of the dataCombine top level ldquoregionalrdquomodeling of the data with a refined local view Global

bull Overall deviations of usersmovies Factorization

bull Addressing ldquoregionalrdquo effects Collaborative filtering

bull Extract local patterns

Global effects

Factorization

Collaborative filtering

RECSYS 107

Performance of Various Methods

0885

089

0895

09

0905

091

0915

092

1 10 100 1000

RMSE

Millions of parameters

CF (no time bias)

Basic Latent Factors

Latent Factors w Biases

1162019

RECSYS 108

Grand Prize 08563

Netflix 09514

Movie average 10533User average 10651

Global average 11296

Performance of Various Methods

Basic Collaborative filtering 094

Latent factors 090

Latent factors+Biases 089

Collaborative filtering++ 091

1162019

RECSYS 109

Temporal Biases Of Users

Sudden rise in the average movie rating(early 2004) Improvements in Netflix GUI improvements Meaning of rating changed

Movie age Users prefer new movies

without any reasons Older movies are just

inherently better than newer ones

1162019

Y Koren Collaborative filtering with temporal dynamics KDD rsquo09

RECSYS 110

Temporal Biases amp Factors

Original modelrxi = micro +bx + bi + qi middotpx

T

Add time dependence to biasesrxi = micro +bx(t)+ bi(t) +qi middot px

T

Make parameters bu and bi to depend on time (1) Parameterize time-dependence by linear trends

(2) Each bin corresponds to 10 consecutive weeks

Add temporal dependence to factors px(t)hellip user preference vector on day t

Y Koren Collaborative filtering with temporal dynamics KDD rsquo09

RECSYS 111

Adding Temporal Effects

1162019

0875

088

0885

089

0895

09

0905

091

0915

092

1 10 100 1000 10000

RMSE

Millions of parameters

CF (no time bias)Basic Latent FactorsCF (time bias)Latent Factors w Biases+ Linear time factors+ Per-day user biases+ CF

RECSYS 112

Grand Prize 08563

Netflix 09514

Movie average 10533User average 10651

Global average 11296

Performance of Various Methods

Basic Collaborative filtering 094

Latent factors 090

Latent factors+Biases 089

Collaborative filtering++ 091

1162019 112

Latent factors+Biases+Time 0876

Still no prize Getting desperateTry a ldquokitchen sinkrdquo approach

RECSYS 113

Effect of ensemble size

087

08725

0875

08775

088

08825

0885

08875

089

08925

1 8 15 22 29 36 43 50

Predictors

Erro

r - R

MSE

Chart1

RMSE
Predictors
Error - RMSE
Effect of ensemble size

Sheet1

Sheet1

RMSE
Predictors
Error - RMSE
Effect of ensemble size

Sheet2

Sheet3

RECSYS 115

Standing on June 26th 2009

1162019June 26th submission triggers 30-day ldquolast callrdquo

RECSYS 116

The Last 30 Days

Ensemble team formed Group of other teams on leaderboard forms a new team Relies on combining their models Quickly also get a qualifying score over 10

BellKor Continue to get small improvements in their scores Realize that they are in direct competition with Ensemble

Strategy Both teams carefully monitoring the leaderboard Only sure way to check for improvement is to submit a set of

predictionsbull This alerts the other team of your latest score

1162019

RECSYS 117

24 Hours from the Deadline

Submissions limited to 1 a day Only 1 final submission could be made in the last 24h

24 hours before deadlinehellip BellKor team member in Austria notices (by chance) that

Ensemble posts a score that is slightly better than BellKorrsquos

Frantic last 24 hours for both teams Much computer time on final optimization Carefully calibrated to end about an hour before deadline

Final submissions BellKor submits a little early (on purpose) 40 mins before deadline Ensemble submits their final entry 20 mins later hellipand everyone waitshellip

1162019

RECSYS 118118

RECSYS 119

Million $ Awarded Sept 21st 2009

RECSYS 120

Further reading Y Koren Collaborative filtering with temporal

dynamics KDD rsquo09 httpwww2researchattcom~volinskynetflixbpchtml httpwwwthe-ensemblecom

1162019

  • Slide Number 1
  • Acknowledgements
  • Content-based Systems amp Collaborative Filtering
  • High Dimensional Data
  • Example Recommender Systems
  • Recommendations
  • From Scarcity to Abundance
  • The Long Tail
  • Physical vs Online
  • Types of Recommendations
  • Formal Model
  • Utility Matrix
  • Key Problems
  • (1) Gathering Ratings
  • (2) Extrapolating Utilities
  • Content-based Recommender Systems
  • Content-based Recommendations
  • Plan of Action
  • Item Profiles
  • Sidenote TF-IDF
  • User Profiles and Prediction
  • Pros Content-based Approach
  • Cons Content-based Approach
  • Collaborative Filtering
  • Collaborative filtering
  • Collaborative Filtering (CF)
  • Example of Memory-based Collaborative Filtering User-User Collaborative Filtering
  • Similar Users
  • Similarity Metric
  • Rating Predictions
  • Another type of Memory-based Collaborative Filtering Item-Item based Collaborative Filtering
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Common Practice for Item-Item Collaborative Filtering
  • Item-Item vs User-User
  • ProsCons of Collaborative Filtering
  • Hybrid Methods
  • Remarks amp Practical Tips
  • Evaluation
  • Evaluation
  • Evaluating Predictions
  • Problems with Error Measures
  • Collaborative Filtering Complexity
  • Tip Add Data
  • Recommender SystemsLatent Factor Models
  • Collaborative Filtering viaLatent Factor Models (eg SVD)
  • Slide Number 50
  • The Netflix Utility Matrix R
  • Utility Matrix R Evaluation
  • Latent Factor Models
  • Ratings as Products of Factors
  • Ratings as Products of Factors
  • Ratings as Products of Factors
  • Latent Factor Models
  • Latent Factor Models
  • Recap SVD
  • Latent Factor Models
  • Dealing with Missing Entries
  • Recommendations via Latent Factor Models (eg SVD++ by the [Bellkor Team])
  • Dealing with Missing Entries
  • The Effect of Regularization
  • The Effect of Regularization
  • The Effect of Regularization
  • The Effect of Regularization
  • Use Gradient Descent to search forthe optimal settings
  • Degression to the lecture notes ofRegression and Gradient Descentby Andrew Ngrsquos Machine Learning course from Coursera
  • (Batch) Gradient Descent
  • Stochastic Gradient Descent
  • SGD vs GD
  • Stochastic Gradient Descent
  • Slide Number 75
  • Summary Recommendations via Optimization
  • Backup Slides
  • The Netflix Challenge 2006-09
  • Slide Number 79
  • Slide Number 80
  • Netflix Prize
  • Movie rating data
  • Overall rating distribution
  • ratings per movie
  • ratings per user
  • Average movie rating by movie count
  • Most loved movies
  • Challenges
  • The BellKor recommender system
  • Extending Latent Factor Model to Include Biases
  • Modeling Biases and Interactions
  • Baseline Predictor
  • Putting It All Together
  • Fitting the New Model
  • BellKor Recommender System
  • Performance of Various Methods
  • Performance of Various Methods
  • Temporal Biases Of Users
  • Temporal Biases amp Factors
  • Adding Temporal Effects
  • Performance of Various Methods
  • Slide Number 113
  • Slide Number 114
  • Standing on June 26th 2009
  • The Last 30 Days
  • 24 Hours from the Deadline
  • Slide Number 118
  • Million $ Awarded Sept 21st 2009
  • Further reading
predictors RMSE
1 0889067
2 0880586
3 0878345
4 0877715
5 0876668
6 0876179
7 0875557
8 0875013
9 0874477
10 0874209
11 0873997
12 0873691
13 0873569
14 0873352
15 0873212
16 0873035
17 0872942
18 087283
19 0872748
20 0872654
21 0872563
22 0872468
23 0872382
24 0872275
25 087222
26 0872201
27 0872135
28 0872086
29 0872021
30 0871945
31 0871898
32 087183
33 0871756
34 0871722
35 0871665
36 0871632
37 0871585
38 0871558
39 0871524
40 0871482
41 0871452
42 0871428
43 0871398
44 0871346
45 0871312
46 0871287
47 0871262
48 0871242
49 0871236
50 0871197
0889067
0880586
0878345
0877715
0876668
0876179
0875557
0875013
0874477
0874209
0873997
0873691
0873569
0873352
0873212
0873035
0872942
087283
0872748
0872654
0872563
0872468
0872382
0872275
087222
0872201
0872135
0872086
0872021
0871945
0871898
087183
0871756
0871722
0871665
0871632
0871585
0871558
0871524
0871482
0871452
0871428
0871398
0871346
0871312
0871287
0871262
0871242
0871236
0871197
Page 34: xuhappy.github.io · 2 days ago · RECSYS 2 Acknowledgements The slides used in this chapter are adapted from: CS246 Mining Massive Datasets, - by Jure Leskovec, Stanford University

RECSYS 36

Item-Item CF (|N|=2)

121110987654321

45526311

3124452

534321423

245424

5224345

423316

users

Predict by taking weighted average

r15 = (0412 + 0593) (041+059) = 26

mov

ies

RECSYS 37

Common Practice for Item-Item Collaborative Filtering

Define similarity sij of items i and j

Select K nearest neighbors (KNN) N(i x) Set of Items most similar to item i that were rated by x

Estimate rating rxi as the weighted average

1162019

sumsum

isin

isinsdot

=)(

)(ˆxiNj ij

xiNj xjijxi s

rsr

N(ix) = set of items similar to item i that were rated by xsij = similarity of items i and jrxj= rating of user x on item j

RECSYS 38

Item-Item vs User-User

04180109

0305081

Avatar LOTR Matrix Pirates

Alice

Bob

Carol

David

1162019

In practice it has been observed that item-itemoften works better than user-user

Why Items are simpler users have multiple tastes

RECSYS 39

ProsCons of Collaborative Filtering

+ Works for any kind of item No feature selection needed

- Cold Start Need enough users in the system to find a match

- Sparsity The userratings matrix is sparse Hard to find users that have rated the same items

- First rater Cannot recommend an item that has not been

previously rated New items Esoteric items

- Popularity bias Cannot recommend items to someone with

unique taste Tends to recommend popular items

RECSYS 40

Hybrid Methods Implement two or more different recommenders and

combine predictions Perhaps using a linear model

Add content-based methods to collaborative filtering Item profiles for new item problem Demographics to deal with new user problem

1162019

RECSYS 41

- Evaluation

- Error metrics

- Complexity Speed

RECSYS 42

Evaluation

1 3 4

3 5 5

4 5 5

3

3

2 2 2

5

2 1 1

3 3

1

movies

users

1162019

RECSYS 43

Evaluation

1 3 4

3 5 5

4 5 5

3

3

2

2 1

3

1

Test Data Set

users

movies

1162019

RECSYS 44

Evaluating Predictions Compare predictions with known ratings

Root-mean-square error (RMSE)bull where is predicted is the true rating of x on i

Precision at top 10 bull of those in top 10

Rank Correlation bull Spearmanrsquos correlation between systemrsquos and userrsquos complete

rankings Another approach 01 model

Coveragebull Number of itemsusers for which system can make predictions

Precision = TP (TP + FP) Accuracy = (TP+TN) (TP + FP + TN + FN) Receiver Operating characteristic (ROC) Curve

Y-axis True Positive Rates (TPR) X-axis False Positive Rates (FPR)bull TPR (aka Recall) = TP P = TP (TP+FN) bull FPR = FP N = FP (FP + TN)bull See httpsenwikipediaorgwikiPrecision and recall

RECSYS 45

Problems with Error Measures Narrow focus on accuracy sometimes

misses the point Prediction Diversity Prediction Context Order of predictions

In practice we care only to predict high ratings RMSE might penalize a method that does well

for high ratings and badly for others

1162019

RECSYS 46

Collaborative Filtering Complexity

Expensive step is finding k most similar customers O(|X|) Recall that X = set of customers in the system

Too expensive to do at runtime Could pre-compute using clustering as approx

Naiumlve pre-computation takes time O(N middot|C|) |C| = of clusters = k in the k-means N = of data points

We already know how to do this Near-neighbor search in high dimensions (LSH) Clustering Dimensionality reduction

1162019

RECSYS 47

Tip Add Data Leverage all the data

Donrsquot try to reduce data size in an effort to make fancy algorithms work

Simple methods on large data do best

Add more data eg add IMDB data on genres

More data beats better algorithmshttpanandtypepadcomdatawocky200803more-data-usualhtml

RECSYS 48

Recommender SystemsLatent Factor Models

CS246 Mining Massive DatasetsJure Leskovec Stanford University

httpcs246stanfordedu

RECSYS 49

Geared towards females

Geared towards males

Serious

Funny

Collaborative Filtering viaLatent Factor Models (eg SVD)

The PrincessDiaries

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

RECSYS 50Koren Bell Volinksy IEEE Computer 2009

RECSYS 51

The Netflix Utility Matrix R

1 3 4

3 5 5

4 5 5

3

3

2 2 2

5

2 1 1

3 3

1

480000 users

17700 movies

1162019

Matrix R

RECSYS 52

Utility Matrix R Evaluation

1 3 4

3 5 5

4 5 5

3

3

2

2 1

3

1

Test Data Set

SSE = sum(119909119909119909119909)isin119877119877 119903119909119909119909119909 minus 119903119903119909119909119909119909 2

1162019

480000 users

17700 movies

Predicted rating

True rating of user x on item i

119955119955120785120785120788120788

Matrix R

Training Data Set

RECSYS 53

Latent Factor Models

ldquoSVDrdquo on Netflix data R asymp Q middot PT

For now letrsquos assume we can approximate the rating matrix R as a product of ldquothinrdquo Q middot PT

R has missing entries but letrsquos ignore that for nowbull Basically we will want the reconstruction error to be small on

known ratings and we donrsquot care about the values on the missing ones

45531

312445

53432142

24542

522434

42331

2-41

56-5

53-2

32111

-221-7

37-1

-924143-48-5-253-211

13-112-72914-131457-8

1-6784-3924176-421asymp

users

item

s

PT

Q

item

s

users

R

SVD A = U Σ VT

f factors

ffactors

RECSYS 54

Ratings as Products of Factors How to estimate the missing rating of

user x for item i

1162019

45531

312445

53432142

24542

522434

42331

item

s

2-41

56-5

53-2

32111

-221-7

37-1

-924143-48-5-253-211

13-112-72914-131457-8

1-6784-3924176-421

asymp

item

s

users

users

PT

119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931

= 119943119943

119954119954119946119946119943119943 sdot 119953119953119961119961119943119943

qi = row i of Qpx = column x of PT

ffac

tors

Qf factors

RECSYS 55

Ratings as Products of Factors How to estimate the missing rating of

user x for item i

1162019

45531

312445

53432142

24542

522434

42331

item

s

2-41

56-5

53-2

32111

-221-7

37-1

-924143-48-5-253-211

13-112-72914-131457-8

1-6784-3924176-421

asymp

item

s

users

users

PT

ffac

tors

Qf factors

119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931

= 119943119943

119954119954119946119946119943119943 sdot 119953119953119961119961119943119943

qi = row i of Qpx = column x of PT

RECSYS 56

Ratings as Products of Factors How to estimate the missing rating of

user x for item i

1162019

45531

312445

53432142

24542

522434

42331

item

s

2-41

56-5

53-2

32111

-221-7

37-1

-924143-48-5-253-211

13-112-72914-131457-8

1-6784-3924176-421

asymp

item

s

users

users

QPT

24

ffac

tors

f factors

119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931

= 119943119943

119954119954119946119946119943119943 sdot 119953119953119961119961119943119943

qi = row i of Qpx = column x of PT

RECSYS 57

Geared towards females

Geared towards males

Serious

Funny

Latent Factor Models

1162019

The PrincessDiaries

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

RECSYS 58

Geared towards females

Geared towards males

Serious

Funny

Latent Factor Models

1162019

The PrincessDiaries

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

RECSYS 59

Recap SVD

Remember SVD A Input data matrix U Left singular vecs V Right singular vecs Σ Singular values

SVD gives minimum reconstruction error (SSE) min119880119880VΣ

sum119909119909119894119894 119860119860119909119909119894119894 minus 119880119880Σ119881119881T 1199091199091198941198942

So in our case ldquoSVDrdquo on Netflix data R asymp Q middot PT

A = R Q = U PT = Σ VT

But we are not done yet R has missing entries1162019

U

The sum goes over all entriesBut our R has missing entries

119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931

Am

n

Σm

n

VT

asymp

U

RECSYS 60

Latent Factor Models

SVD isnrsquot defined when entries are missing

Use specialized methods to find P Q min

119875119875119876119876sum 119909119909119909119909 isinR 119903119903119909119909119909119909 minus 119902119902119909119909 sdot 119901119901119909119909119879119879 2

Notebull We donrsquot require cols of P Q to be orthogonalunit lengthbull P Q map usersmovies to a latent spacebull The most popular model among Netflix contestants

1162019

45531

312445

53432142

24542

522434

42331

2-41

56-5

53-2

32111

-221-7

37-1

-924143-48-5-253-211

13-112-72914-131457-8

1-6784-3924176-421asymp

PTQ

users

item

s

119903119909119909119909119909 = 119902119902119909119909 sdot 119901119901119909119909119879119879

f factors

ffactorsitem

s

users

RECSYS 61

Dealing with Missing Entries Want to minimize SSE for unseen test data

Idea Minimize SSE on training data Want large f ( of factors) to capture all the signals But SSE on test data begins to rise for f gt 2

Regularization is needed to avoid Overfitting Allow rich model where there are sufficient data Shrink aggressively where data are scarce

1162019

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

1 3 43 5 5

4 5 533

2

2 1 3

1

λhellip regularization parameterldquoerrorrdquo ldquolengthrdquo

RECSYS 62

Geared towards females

Geared towards males

serious

escapist

The PrincessDiaries

The Lion King

Braveheart

Independence Day

AmadeusThe Color Purple

Oceanrsquos 11

Sense and Sensibility

Gus

Dave

Recommendations via Latent Factor Models (eg SVD++ by the [Bellkor Team])

1162019

Lethal Weapon

Dumb and Dumber

RECSYS 63

Dealing with Missing Entries Want to minimize SSE for unseen test data

Idea Minimize SSE on training data Want large f ( of factors) to capture all the signals But SSE on test data begins to rise for f gt 2

Regularization is needed Allow rich model where there are sufficient data Shrink aggressively where data are scarce

1162019

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

1 3 43 5 5

4 5 533

2

2 1 3

1

λhellip regularization parameterldquoerrorrdquo ldquolengthrdquo

RECSYS 64

Geared towards females

Geared towards males

serious

funny

The Effect of Regularization

1162019

The PrincessDiaries

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

minfactors ldquoerrorrdquo + λ ldquolengthrdquo

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

RECSYS 65

Geared towards females

Geared towards males

serious

funny

The Effect of Regularization

1162019

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

The PrincessDiaries

minfactors ldquoerrorrdquo + λ ldquolengthrdquo

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

RECSYS 66

Geared towards females

Geared towards males

serious

funny

The Effect of Regularization

1162019

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

minfactors ldquoerrorrdquo + λ ldquolengthrdquo

The PrincessDiaries

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

RECSYS 67

Geared towards females

Geared towards males

serious

funny

The Effect of Regularization

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

The PrincessDiaries

minfactors ldquoerrorrdquo + λ ldquolengthrdquo

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

RECSYS 68

Use Gradient Descent to search forthe optimal settings

Want to find matrices P and Q

Gradient descent Initialize P and Q (using SVD pretend missing ratings are

0) Do gradient descent

bull P larr P - η middotnablaPbull Q larr Q - η middotnablaQbull Where nablaQ is gradientderivative of matrix Q120571120571119876119876 = [120571120571119902119902119909119909119894119894] and 120571120571119902119902119909119909119894119894 = sum119909119909119909119909 minus2 119903119903119909119909119909119909 minus 119902119902119909119909119901119901119909119909119879119879 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894

bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q Observation Computing gradients is slow

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

How to compute gradient of a matrixCompute gradient of every element independently

RECSYS 69

Degression to the lecture notes ofRegression and Gradient Descent

by Andrew Ngrsquos Machine Learning course from Coursera

RECSYS 70

(Batch) Gradient Descent

Want to find matrices P and Q

Gradient descent Initialize P and Q (using SVD pretend missing ratings are

0) Do gradient descent

bull P larr P - η middotnablaPbull Q larr Q - η middotnablaQbull Where nablaQ is gradientderivative of matrix Q120571120571119876119876 = [120571120571119902119902119909119909119894119894] and 120571120571119902119902119909119909119894119894 = sum119909119909119909119909 minus2 119903119903119909119909119909119909 minus 119902119902119909119909119901119901119909119909119879119879 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894

bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q Observation Computing gradients is slow

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

How to compute gradient of a matrixCompute gradient of every element independently

RECSYS 72

Stochastic Gradient Descent

Gradient Descent (GD) vs Stochastic GD Observation 120571120571119876119876 = [120571120571119902119902119909119909119894119894] where

120571120571119902119902119909119909119894119894 = 119909119909119909119909

minus2 119903119903119909119909119909119909 minus 119902119902119909119909119894119894119901119901119909119909119894119894 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894 = 119961119961119946119946

nabla119928119928 119955119955119961119961119946119946

bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q 119928119928 = 119928119928 minus 119928119928 = 119928119928 minus η sum119961119961119946119946nabla119928119928 (119955119955119961119961119946119946) Idea Instead of evaluating gradient over all ratings evaluate it for

each individual rating and make a step

GD 119928119928larr119928119928 minus η sum119955119955119961119961119946119946 nabla119928119928(119955119955119961119961119946119946)

SGD 119928119928larr119928119928 minus 119928119928(119955119955119961119961119946119946) Faster convergence

bull Need more steps but each step is computed much faster

1162019

RECSYS 73

SGD vs GD Convergence of GD vs SGD

1162019

Iterationstep

Valu

e of

the

obje

ctiv

e fu

nctio

n

GD improves the value of the objective function at every step SGD improves the value but in a ldquonoisyrdquo wayGD takes fewer steps to converge but each steptakes much longer to compute In practice SGD is much faster

RECSYS 74

Stochastic Gradient Descent

Stochastic gradient decent Initialize P and Q (using SVD pretend missing ratings are

0) Then iterate over the ratings (multiple times if necessary) and update

factorsFor each rxi

bull 120576120576119909119909119909119909 = 119903119903119909119909119909119909 minus 119902119902119909119909 sdot 119901119901119909119909119879119879 (derivative of the ldquoerrorrdquo)bull 119902119902119909119909 larr 119902119902119909119909 + 120578120578 120576120576119909119909119909119909 119901119901119909119909 minus 120582120582 119902119902119909119909 (update equation)bull 119901119901119909119909 larr 119901119901119909119909 + 120578120578 120576120576119909119909119909119909 119902119902119909119909 minus 120582120582 119901119901119909119909 (update equation)

2 for loops For until convergence

bull For each rxibull Compute gradient do a ldquosteprdquo

1162019

η hellip learning rate

RECSYS 75Koren Bell Volinksy IEEE Computer 2009

1162019

RECSYS 76

Summary Recommendations via Optimization

Goal Make good recommendations Quantify goodness using SSE

So Lower SSE means better recommendations We want to make good recommendations on items

that some user has not yet seen Letrsquos set values for P and Q such that they work

well on known (user item) ratings And hope these values for P and Q will predict well

the unknown ratings

This is the a case where we apply Optimization methods

1162019

1 3 43 5 5

4 5 533

2

2 1 3

1

RECSYS 77

Backup Slides

RECSYS 78

The Netflix Challenge 2006-09

We Know What You OughtTo Be Watching This

Summer

Netflix Prizebull Training data

ndash 100 million ratingsndash 480000 usersndash 17770 moviesndash 6 years of data 2000-2005

bull Test datandash Last few ratings of each user (28 million)ndash Evaluation criterion root mean squared error (RMSE) ndash Netflix Cinematch RMSE 09514

bull Competitionndash 2700+ teamsndash $1 million grand prize for 10 improvement on Cinematch resultndash $50000 2007 progress prize for 843 improvement

scoremovieuser121152131434524123237682576344541568523425

22345

57664566

scoremovieuser6219617232473153414284935

745

696836

Training data Test data

Movie rating data

Overall rating distribution

bull Third of ratings are 4sbull Average rating is 368

From TimelyDevelopmentcom

ratings per movie

bull Avg ratingsmovie 5627

ratings per user

bull Avg ratingsuser 208

Average movie rating by movie count

bull More ratings to better movies

From TimelyDevelopmentcom

Most loved movies

CountAvg ratingTitle137812 4593 The Shawshank Redemption 133597 4545 Lord of the Rings The Return of the King1808834306 The Green Mile 150676 4460Lord of the Rings The Two Towers 139050 4415 Finding Nemo 117456 4504 Raiders of the Lost Ark 180736 4299 Forrest Gump 147932 4433 Lord of the Rings The Fellowship of the ring149199 4325 The Sixth Sense 144027 4333 Indiana Jones and the Last Crusade

Challenges

bull Size of datandash Scalabilityndash Keeping data in memory

bull Missing datandash 99 percent missingndash Very imbalanced

bull Avoiding overfittingbull Test and training data differ significantly movie 16322

bull Use an ensemble of complementing predictorsbull Two half tuned models worth more than a single fully

tuned model

The BellKor recommender system

RECSYS 101

Extending Latent Factor Model to Include Biases

1162019

RECSYS 102

Modeling Biases and Interactions

μ = overall mean rating bx = bias of user x bi = bias of movie i

user-movie interactionmovie biasuser bias

User-Movie interaction Characterizes the matching between

users and movies Attracts most research in the field Benefits from algorithmic and

mathematical innovations

Baseline predictor Separates users and movies Benefits from insights into userrsquos

behavior Among the main practical

contributions of the competition

RECSYS 103

Baseline Predictor We have expectations on the rating by

user x of movie i even without estimating xrsquos attitude towards movies like i

ndash Rating scale of user xndash Values of other ratings user

gave recently (day-specific mood anchoring multi-user accounts)

ndash (Recent) popularity of movie indash Selection bias related to

number of ratings user gave on the same day (ldquofrequencyrdquo)

1162019

RECSYS 104

Putting It All Together

Example Mean rating micro = 37 You are a critical reviewer your ratings are 1 star

lower than the mean bx = -1 Star Wars gets a mean rating of 05 higher than

average movie bi = + 05 Predicted rating for you on Star Wars

= 37 - 1 + 05 = 32

1162019

RECSYS 105

Fitting the New Model

Solve

Stochastic gradient decent to find parameters Note Both biases bu bi as well as interactions qi pu are treated

as parameters (we estimate them)

1162019

regularization

goodness of fit

λ is selected via grid-search on a validation set

( )

++++

+++minus

sumsumsumsum

sumisin

ii

xx

xx

ii

Rix

Txiixxi

PQ

bbpq

pqbbr

2222

2

)(

)(min

λ

micro

RECSYS 106

BellKor Recommender System The winner of the Netflix Challenge

Multi-scale modeling of the dataCombine top level ldquoregionalrdquomodeling of the data with a refined local view Global

bull Overall deviations of usersmovies Factorization

bull Addressing ldquoregionalrdquo effects Collaborative filtering

bull Extract local patterns

Global effects

Factorization

Collaborative filtering

RECSYS 107

Performance of Various Methods

0885

089

0895

09

0905

091

0915

092

1 10 100 1000

RMSE

Millions of parameters

CF (no time bias)

Basic Latent Factors

Latent Factors w Biases

1162019

RECSYS 108

Grand Prize 08563

Netflix 09514

Movie average 10533User average 10651

Global average 11296

Performance of Various Methods

Basic Collaborative filtering 094

Latent factors 090

Latent factors+Biases 089

Collaborative filtering++ 091

1162019

RECSYS 109

Temporal Biases Of Users

Sudden rise in the average movie rating(early 2004) Improvements in Netflix GUI improvements Meaning of rating changed

Movie age Users prefer new movies

without any reasons Older movies are just

inherently better than newer ones

1162019

Y Koren Collaborative filtering with temporal dynamics KDD rsquo09

RECSYS 110

Temporal Biases amp Factors

Original modelrxi = micro +bx + bi + qi middotpx

T

Add time dependence to biasesrxi = micro +bx(t)+ bi(t) +qi middot px

T

Make parameters bu and bi to depend on time (1) Parameterize time-dependence by linear trends

(2) Each bin corresponds to 10 consecutive weeks

Add temporal dependence to factors px(t)hellip user preference vector on day t

Y Koren Collaborative filtering with temporal dynamics KDD rsquo09

RECSYS 111

Adding Temporal Effects

1162019

0875

088

0885

089

0895

09

0905

091

0915

092

1 10 100 1000 10000

RMSE

Millions of parameters

CF (no time bias)Basic Latent FactorsCF (time bias)Latent Factors w Biases+ Linear time factors+ Per-day user biases+ CF

RECSYS 112

Grand Prize 08563

Netflix 09514

Movie average 10533User average 10651

Global average 11296

Performance of Various Methods

Basic Collaborative filtering 094

Latent factors 090

Latent factors+Biases 089

Collaborative filtering++ 091

1162019 112

Latent factors+Biases+Time 0876

Still no prize Getting desperateTry a ldquokitchen sinkrdquo approach

RECSYS 113

Effect of ensemble size

087

08725

0875

08775

088

08825

0885

08875

089

08925

1 8 15 22 29 36 43 50

Predictors

Erro

r - R

MSE

Chart1

RMSE
Predictors
Error - RMSE
Effect of ensemble size

Sheet1

Sheet1

RMSE
Predictors
Error - RMSE
Effect of ensemble size

Sheet2

Sheet3

RECSYS 115

Standing on June 26th 2009

1162019June 26th submission triggers 30-day ldquolast callrdquo

RECSYS 116

The Last 30 Days

Ensemble team formed Group of other teams on leaderboard forms a new team Relies on combining their models Quickly also get a qualifying score over 10

BellKor Continue to get small improvements in their scores Realize that they are in direct competition with Ensemble

Strategy Both teams carefully monitoring the leaderboard Only sure way to check for improvement is to submit a set of

predictionsbull This alerts the other team of your latest score

1162019

RECSYS 117

24 Hours from the Deadline

Submissions limited to 1 a day Only 1 final submission could be made in the last 24h

24 hours before deadlinehellip BellKor team member in Austria notices (by chance) that

Ensemble posts a score that is slightly better than BellKorrsquos

Frantic last 24 hours for both teams Much computer time on final optimization Carefully calibrated to end about an hour before deadline

Final submissions BellKor submits a little early (on purpose) 40 mins before deadline Ensemble submits their final entry 20 mins later hellipand everyone waitshellip

1162019

RECSYS 118118

RECSYS 119

Million $ Awarded Sept 21st 2009

RECSYS 120

Further reading Y Koren Collaborative filtering with temporal

dynamics KDD rsquo09 httpwww2researchattcom~volinskynetflixbpchtml httpwwwthe-ensemblecom

1162019

  • Slide Number 1
  • Acknowledgements
  • Content-based Systems amp Collaborative Filtering
  • High Dimensional Data
  • Example Recommender Systems
  • Recommendations
  • From Scarcity to Abundance
  • The Long Tail
  • Physical vs Online
  • Types of Recommendations
  • Formal Model
  • Utility Matrix
  • Key Problems
  • (1) Gathering Ratings
  • (2) Extrapolating Utilities
  • Content-based Recommender Systems
  • Content-based Recommendations
  • Plan of Action
  • Item Profiles
  • Sidenote TF-IDF
  • User Profiles and Prediction
  • Pros Content-based Approach
  • Cons Content-based Approach
  • Collaborative Filtering
  • Collaborative filtering
  • Collaborative Filtering (CF)
  • Example of Memory-based Collaborative Filtering User-User Collaborative Filtering
  • Similar Users
  • Similarity Metric
  • Rating Predictions
  • Another type of Memory-based Collaborative Filtering Item-Item based Collaborative Filtering
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Common Practice for Item-Item Collaborative Filtering
  • Item-Item vs User-User
  • ProsCons of Collaborative Filtering
  • Hybrid Methods
  • Remarks amp Practical Tips
  • Evaluation
  • Evaluation
  • Evaluating Predictions
  • Problems with Error Measures
  • Collaborative Filtering Complexity
  • Tip Add Data
  • Recommender SystemsLatent Factor Models
  • Collaborative Filtering viaLatent Factor Models (eg SVD)
  • Slide Number 50
  • The Netflix Utility Matrix R
  • Utility Matrix R Evaluation
  • Latent Factor Models
  • Ratings as Products of Factors
  • Ratings as Products of Factors
  • Ratings as Products of Factors
  • Latent Factor Models
  • Latent Factor Models
  • Recap SVD
  • Latent Factor Models
  • Dealing with Missing Entries
  • Recommendations via Latent Factor Models (eg SVD++ by the [Bellkor Team])
  • Dealing with Missing Entries
  • The Effect of Regularization
  • The Effect of Regularization
  • The Effect of Regularization
  • The Effect of Regularization
  • Use Gradient Descent to search forthe optimal settings
  • Degression to the lecture notes ofRegression and Gradient Descentby Andrew Ngrsquos Machine Learning course from Coursera
  • (Batch) Gradient Descent
  • Stochastic Gradient Descent
  • SGD vs GD
  • Stochastic Gradient Descent
  • Slide Number 75
  • Summary Recommendations via Optimization
  • Backup Slides
  • The Netflix Challenge 2006-09
  • Slide Number 79
  • Slide Number 80
  • Netflix Prize
  • Movie rating data
  • Overall rating distribution
  • ratings per movie
  • ratings per user
  • Average movie rating by movie count
  • Most loved movies
  • Challenges
  • The BellKor recommender system
  • Extending Latent Factor Model to Include Biases
  • Modeling Biases and Interactions
  • Baseline Predictor
  • Putting It All Together
  • Fitting the New Model
  • BellKor Recommender System
  • Performance of Various Methods
  • Performance of Various Methods
  • Temporal Biases Of Users
  • Temporal Biases amp Factors
  • Adding Temporal Effects
  • Performance of Various Methods
  • Slide Number 113
  • Slide Number 114
  • Standing on June 26th 2009
  • The Last 30 Days
  • 24 Hours from the Deadline
  • Slide Number 118
  • Million $ Awarded Sept 21st 2009
  • Further reading
predictors RMSE
1 0889067
2 0880586
3 0878345
4 0877715
5 0876668
6 0876179
7 0875557
8 0875013
9 0874477
10 0874209
11 0873997
12 0873691
13 0873569
14 0873352
15 0873212
16 0873035
17 0872942
18 087283
19 0872748
20 0872654
21 0872563
22 0872468
23 0872382
24 0872275
25 087222
26 0872201
27 0872135
28 0872086
29 0872021
30 0871945
31 0871898
32 087183
33 0871756
34 0871722
35 0871665
36 0871632
37 0871585
38 0871558
39 0871524
40 0871482
41 0871452
42 0871428
43 0871398
44 0871346
45 0871312
46 0871287
47 0871262
48 0871242
49 0871236
50 0871197
0889067
0880586
0878345
0877715
0876668
0876179
0875557
0875013
0874477
0874209
0873997
0873691
0873569
0873352
0873212
0873035
0872942
087283
0872748
0872654
0872563
0872468
0872382
0872275
087222
0872201
0872135
0872086
0872021
0871945
0871898
087183
0871756
0871722
0871665
0871632
0871585
0871558
0871524
0871482
0871452
0871428
0871398
0871346
0871312
0871287
0871262
0871242
0871236
0871197
Page 35: xuhappy.github.io · 2 days ago · RECSYS 2 Acknowledgements The slides used in this chapter are adapted from: CS246 Mining Massive Datasets, - by Jure Leskovec, Stanford University

RECSYS 37

Common Practice for Item-Item Collaborative Filtering

Define similarity sij of items i and j

Select K nearest neighbors (KNN) N(i x) Set of Items most similar to item i that were rated by x

Estimate rating rxi as the weighted average

1162019

sumsum

isin

isinsdot

=)(

)(ˆxiNj ij

xiNj xjijxi s

rsr

N(ix) = set of items similar to item i that were rated by xsij = similarity of items i and jrxj= rating of user x on item j

RECSYS 38

Item-Item vs User-User

04180109

0305081

Avatar LOTR Matrix Pirates

Alice

Bob

Carol

David

1162019

In practice it has been observed that item-itemoften works better than user-user

Why Items are simpler users have multiple tastes

RECSYS 39

ProsCons of Collaborative Filtering

+ Works for any kind of item No feature selection needed

- Cold Start Need enough users in the system to find a match

- Sparsity The userratings matrix is sparse Hard to find users that have rated the same items

- First rater Cannot recommend an item that has not been

previously rated New items Esoteric items

- Popularity bias Cannot recommend items to someone with

unique taste Tends to recommend popular items

RECSYS 40

Hybrid Methods Implement two or more different recommenders and

combine predictions Perhaps using a linear model

Add content-based methods to collaborative filtering Item profiles for new item problem Demographics to deal with new user problem

1162019

RECSYS 41

- Evaluation

- Error metrics

- Complexity Speed

RECSYS 42

Evaluation

1 3 4

3 5 5

4 5 5

3

3

2 2 2

5

2 1 1

3 3

1

movies

users

1162019

RECSYS 43

Evaluation

1 3 4

3 5 5

4 5 5

3

3

2

2 1

3

1

Test Data Set

users

movies

1162019

RECSYS 44

Evaluating Predictions Compare predictions with known ratings

Root-mean-square error (RMSE)bull where is predicted is the true rating of x on i

Precision at top 10 bull of those in top 10

Rank Correlation bull Spearmanrsquos correlation between systemrsquos and userrsquos complete

rankings Another approach 01 model

Coveragebull Number of itemsusers for which system can make predictions

Precision = TP (TP + FP) Accuracy = (TP+TN) (TP + FP + TN + FN) Receiver Operating characteristic (ROC) Curve

Y-axis True Positive Rates (TPR) X-axis False Positive Rates (FPR)bull TPR (aka Recall) = TP P = TP (TP+FN) bull FPR = FP N = FP (FP + TN)bull See httpsenwikipediaorgwikiPrecision and recall

RECSYS 45

Problems with Error Measures Narrow focus on accuracy sometimes

misses the point Prediction Diversity Prediction Context Order of predictions

In practice we care only to predict high ratings RMSE might penalize a method that does well

for high ratings and badly for others

1162019

RECSYS 46

Collaborative Filtering Complexity

Expensive step is finding k most similar customers O(|X|) Recall that X = set of customers in the system

Too expensive to do at runtime Could pre-compute using clustering as approx

Naiumlve pre-computation takes time O(N middot|C|) |C| = of clusters = k in the k-means N = of data points

We already know how to do this Near-neighbor search in high dimensions (LSH) Clustering Dimensionality reduction

1162019

RECSYS 47

Tip Add Data Leverage all the data

Donrsquot try to reduce data size in an effort to make fancy algorithms work

Simple methods on large data do best

Add more data eg add IMDB data on genres

More data beats better algorithmshttpanandtypepadcomdatawocky200803more-data-usualhtml

RECSYS 48

Recommender SystemsLatent Factor Models

CS246 Mining Massive DatasetsJure Leskovec Stanford University

httpcs246stanfordedu

RECSYS 49

Geared towards females

Geared towards males

Serious

Funny

Collaborative Filtering viaLatent Factor Models (eg SVD)

The PrincessDiaries

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

RECSYS 50Koren Bell Volinksy IEEE Computer 2009

RECSYS 51

The Netflix Utility Matrix R

1 3 4

3 5 5

4 5 5

3

3

2 2 2

5

2 1 1

3 3

1

480000 users

17700 movies

1162019

Matrix R

RECSYS 52

Utility Matrix R Evaluation

1 3 4

3 5 5

4 5 5

3

3

2

2 1

3

1

Test Data Set

SSE = sum(119909119909119909119909)isin119877119877 119903119909119909119909119909 minus 119903119903119909119909119909119909 2

1162019

480000 users

17700 movies

Predicted rating

True rating of user x on item i

119955119955120785120785120788120788

Matrix R

Training Data Set

RECSYS 53

Latent Factor Models

ldquoSVDrdquo on Netflix data R asymp Q middot PT

For now letrsquos assume we can approximate the rating matrix R as a product of ldquothinrdquo Q middot PT

R has missing entries but letrsquos ignore that for nowbull Basically we will want the reconstruction error to be small on

known ratings and we donrsquot care about the values on the missing ones

45531

312445

53432142

24542

522434

42331

2-41

56-5

53-2

32111

-221-7

37-1

-924143-48-5-253-211

13-112-72914-131457-8

1-6784-3924176-421asymp

users

item

s

PT

Q

item

s

users

R

SVD A = U Σ VT

f factors

ffactors

RECSYS 54

Ratings as Products of Factors How to estimate the missing rating of

user x for item i

1162019

45531

312445

53432142

24542

522434

42331

item

s

2-41

56-5

53-2

32111

-221-7

37-1

-924143-48-5-253-211

13-112-72914-131457-8

1-6784-3924176-421

asymp

item

s

users

users

PT

119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931

= 119943119943

119954119954119946119946119943119943 sdot 119953119953119961119961119943119943

qi = row i of Qpx = column x of PT

ffac

tors

Qf factors

RECSYS 55

Ratings as Products of Factors How to estimate the missing rating of

user x for item i

1162019

45531

312445

53432142

24542

522434

42331

item

s

2-41

56-5

53-2

32111

-221-7

37-1

-924143-48-5-253-211

13-112-72914-131457-8

1-6784-3924176-421

asymp

item

s

users

users

PT

ffac

tors

Qf factors

119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931

= 119943119943

119954119954119946119946119943119943 sdot 119953119953119961119961119943119943

qi = row i of Qpx = column x of PT

RECSYS 56

Ratings as Products of Factors How to estimate the missing rating of

user x for item i

1162019

45531

312445

53432142

24542

522434

42331

item

s

2-41

56-5

53-2

32111

-221-7

37-1

-924143-48-5-253-211

13-112-72914-131457-8

1-6784-3924176-421

asymp

item

s

users

users

QPT

24

ffac

tors

f factors

119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931

= 119943119943

119954119954119946119946119943119943 sdot 119953119953119961119961119943119943

qi = row i of Qpx = column x of PT

RECSYS 57

Geared towards females

Geared towards males

Serious

Funny

Latent Factor Models

1162019

The PrincessDiaries

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

RECSYS 58

Geared towards females

Geared towards males

Serious

Funny

Latent Factor Models

1162019

The PrincessDiaries

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

RECSYS 59

Recap SVD

Remember SVD A Input data matrix U Left singular vecs V Right singular vecs Σ Singular values

SVD gives minimum reconstruction error (SSE) min119880119880VΣ

sum119909119909119894119894 119860119860119909119909119894119894 minus 119880119880Σ119881119881T 1199091199091198941198942

So in our case ldquoSVDrdquo on Netflix data R asymp Q middot PT

A = R Q = U PT = Σ VT

But we are not done yet R has missing entries1162019

U

The sum goes over all entriesBut our R has missing entries

119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931

Am

n

Σm

n

VT

asymp

U

RECSYS 60

Latent Factor Models

SVD isnrsquot defined when entries are missing

Use specialized methods to find P Q min

119875119875119876119876sum 119909119909119909119909 isinR 119903119903119909119909119909119909 minus 119902119902119909119909 sdot 119901119901119909119909119879119879 2

Notebull We donrsquot require cols of P Q to be orthogonalunit lengthbull P Q map usersmovies to a latent spacebull The most popular model among Netflix contestants

1162019

45531

312445

53432142

24542

522434

42331

2-41

56-5

53-2

32111

-221-7

37-1

-924143-48-5-253-211

13-112-72914-131457-8

1-6784-3924176-421asymp

PTQ

users

item

s

119903119909119909119909119909 = 119902119902119909119909 sdot 119901119901119909119909119879119879

f factors

ffactorsitem

s

users

RECSYS 61

Dealing with Missing Entries Want to minimize SSE for unseen test data

Idea Minimize SSE on training data Want large f ( of factors) to capture all the signals But SSE on test data begins to rise for f gt 2

Regularization is needed to avoid Overfitting Allow rich model where there are sufficient data Shrink aggressively where data are scarce

1162019

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

1 3 43 5 5

4 5 533

2

2 1 3

1

λhellip regularization parameterldquoerrorrdquo ldquolengthrdquo

RECSYS 62

Geared towards females

Geared towards males

serious

escapist

The PrincessDiaries

The Lion King

Braveheart

Independence Day

AmadeusThe Color Purple

Oceanrsquos 11

Sense and Sensibility

Gus

Dave

Recommendations via Latent Factor Models (eg SVD++ by the [Bellkor Team])

1162019

Lethal Weapon

Dumb and Dumber

RECSYS 63

Dealing with Missing Entries Want to minimize SSE for unseen test data

Idea Minimize SSE on training data Want large f ( of factors) to capture all the signals But SSE on test data begins to rise for f gt 2

Regularization is needed Allow rich model where there are sufficient data Shrink aggressively where data are scarce

1162019

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

1 3 43 5 5

4 5 533

2

2 1 3

1

λhellip regularization parameterldquoerrorrdquo ldquolengthrdquo

RECSYS 64

Geared towards females

Geared towards males

serious

funny

The Effect of Regularization

1162019

The PrincessDiaries

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

minfactors ldquoerrorrdquo + λ ldquolengthrdquo

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

RECSYS 65

Geared towards females

Geared towards males

serious

funny

The Effect of Regularization

1162019

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

The PrincessDiaries

minfactors ldquoerrorrdquo + λ ldquolengthrdquo

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

RECSYS 66

Geared towards females

Geared towards males

serious

funny

The Effect of Regularization

1162019

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

minfactors ldquoerrorrdquo + λ ldquolengthrdquo

The PrincessDiaries

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

RECSYS 67

Geared towards females

Geared towards males

serious

funny

The Effect of Regularization

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

The PrincessDiaries

minfactors ldquoerrorrdquo + λ ldquolengthrdquo

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

RECSYS 68

Use Gradient Descent to search forthe optimal settings

Want to find matrices P and Q

Gradient descent Initialize P and Q (using SVD pretend missing ratings are

0) Do gradient descent

bull P larr P - η middotnablaPbull Q larr Q - η middotnablaQbull Where nablaQ is gradientderivative of matrix Q120571120571119876119876 = [120571120571119902119902119909119909119894119894] and 120571120571119902119902119909119909119894119894 = sum119909119909119909119909 minus2 119903119903119909119909119909119909 minus 119902119902119909119909119901119901119909119909119879119879 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894

bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q Observation Computing gradients is slow

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

How to compute gradient of a matrixCompute gradient of every element independently

RECSYS 69

Degression to the lecture notes ofRegression and Gradient Descent

by Andrew Ngrsquos Machine Learning course from Coursera

RECSYS 70

(Batch) Gradient Descent

Want to find matrices P and Q

Gradient descent Initialize P and Q (using SVD pretend missing ratings are

0) Do gradient descent

bull P larr P - η middotnablaPbull Q larr Q - η middotnablaQbull Where nablaQ is gradientderivative of matrix Q120571120571119876119876 = [120571120571119902119902119909119909119894119894] and 120571120571119902119902119909119909119894119894 = sum119909119909119909119909 minus2 119903119903119909119909119909119909 minus 119902119902119909119909119901119901119909119909119879119879 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894

bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q Observation Computing gradients is slow

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

How to compute gradient of a matrixCompute gradient of every element independently

RECSYS 72

Stochastic Gradient Descent

Gradient Descent (GD) vs Stochastic GD Observation 120571120571119876119876 = [120571120571119902119902119909119909119894119894] where

120571120571119902119902119909119909119894119894 = 119909119909119909119909

minus2 119903119903119909119909119909119909 minus 119902119902119909119909119894119894119901119901119909119909119894119894 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894 = 119961119961119946119946

nabla119928119928 119955119955119961119961119946119946

bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q 119928119928 = 119928119928 minus 119928119928 = 119928119928 minus η sum119961119961119946119946nabla119928119928 (119955119955119961119961119946119946) Idea Instead of evaluating gradient over all ratings evaluate it for

each individual rating and make a step

GD 119928119928larr119928119928 minus η sum119955119955119961119961119946119946 nabla119928119928(119955119955119961119961119946119946)

SGD 119928119928larr119928119928 minus 119928119928(119955119955119961119961119946119946) Faster convergence

bull Need more steps but each step is computed much faster

1162019

RECSYS 73

SGD vs GD Convergence of GD vs SGD

1162019

Iterationstep

Valu

e of

the

obje

ctiv

e fu

nctio

n

GD improves the value of the objective function at every step SGD improves the value but in a ldquonoisyrdquo wayGD takes fewer steps to converge but each steptakes much longer to compute In practice SGD is much faster

RECSYS 74

Stochastic Gradient Descent

Stochastic gradient decent Initialize P and Q (using SVD pretend missing ratings are

0) Then iterate over the ratings (multiple times if necessary) and update

factorsFor each rxi

bull 120576120576119909119909119909119909 = 119903119903119909119909119909119909 minus 119902119902119909119909 sdot 119901119901119909119909119879119879 (derivative of the ldquoerrorrdquo)bull 119902119902119909119909 larr 119902119902119909119909 + 120578120578 120576120576119909119909119909119909 119901119901119909119909 minus 120582120582 119902119902119909119909 (update equation)bull 119901119901119909119909 larr 119901119901119909119909 + 120578120578 120576120576119909119909119909119909 119902119902119909119909 minus 120582120582 119901119901119909119909 (update equation)

2 for loops For until convergence

bull For each rxibull Compute gradient do a ldquosteprdquo

1162019

η hellip learning rate

RECSYS 75Koren Bell Volinksy IEEE Computer 2009

1162019

RECSYS 76

Summary Recommendations via Optimization

Goal Make good recommendations Quantify goodness using SSE

So Lower SSE means better recommendations We want to make good recommendations on items

that some user has not yet seen Letrsquos set values for P and Q such that they work

well on known (user item) ratings And hope these values for P and Q will predict well

the unknown ratings

This is the a case where we apply Optimization methods

1162019

1 3 43 5 5

4 5 533

2

2 1 3

1

RECSYS 77

Backup Slides

RECSYS 78

The Netflix Challenge 2006-09

We Know What You OughtTo Be Watching This

Summer

Netflix Prizebull Training data

ndash 100 million ratingsndash 480000 usersndash 17770 moviesndash 6 years of data 2000-2005

bull Test datandash Last few ratings of each user (28 million)ndash Evaluation criterion root mean squared error (RMSE) ndash Netflix Cinematch RMSE 09514

bull Competitionndash 2700+ teamsndash $1 million grand prize for 10 improvement on Cinematch resultndash $50000 2007 progress prize for 843 improvement

scoremovieuser121152131434524123237682576344541568523425

22345

57664566

scoremovieuser6219617232473153414284935

745

696836

Training data Test data

Movie rating data

Overall rating distribution

bull Third of ratings are 4sbull Average rating is 368

From TimelyDevelopmentcom

ratings per movie

bull Avg ratingsmovie 5627

ratings per user

bull Avg ratingsuser 208

Average movie rating by movie count

bull More ratings to better movies

From TimelyDevelopmentcom

Most loved movies

CountAvg ratingTitle137812 4593 The Shawshank Redemption 133597 4545 Lord of the Rings The Return of the King1808834306 The Green Mile 150676 4460Lord of the Rings The Two Towers 139050 4415 Finding Nemo 117456 4504 Raiders of the Lost Ark 180736 4299 Forrest Gump 147932 4433 Lord of the Rings The Fellowship of the ring149199 4325 The Sixth Sense 144027 4333 Indiana Jones and the Last Crusade

Challenges

bull Size of datandash Scalabilityndash Keeping data in memory

bull Missing datandash 99 percent missingndash Very imbalanced

bull Avoiding overfittingbull Test and training data differ significantly movie 16322

bull Use an ensemble of complementing predictorsbull Two half tuned models worth more than a single fully

tuned model

The BellKor recommender system

RECSYS 101

Extending Latent Factor Model to Include Biases

1162019

RECSYS 102

Modeling Biases and Interactions

μ = overall mean rating bx = bias of user x bi = bias of movie i

user-movie interactionmovie biasuser bias

User-Movie interaction Characterizes the matching between

users and movies Attracts most research in the field Benefits from algorithmic and

mathematical innovations

Baseline predictor Separates users and movies Benefits from insights into userrsquos

behavior Among the main practical

contributions of the competition

RECSYS 103

Baseline Predictor We have expectations on the rating by

user x of movie i even without estimating xrsquos attitude towards movies like i

ndash Rating scale of user xndash Values of other ratings user

gave recently (day-specific mood anchoring multi-user accounts)

ndash (Recent) popularity of movie indash Selection bias related to

number of ratings user gave on the same day (ldquofrequencyrdquo)

1162019

RECSYS 104

Putting It All Together

Example Mean rating micro = 37 You are a critical reviewer your ratings are 1 star

lower than the mean bx = -1 Star Wars gets a mean rating of 05 higher than

average movie bi = + 05 Predicted rating for you on Star Wars

= 37 - 1 + 05 = 32

1162019

RECSYS 105

Fitting the New Model

Solve

Stochastic gradient decent to find parameters Note Both biases bu bi as well as interactions qi pu are treated

as parameters (we estimate them)

1162019

regularization

goodness of fit

λ is selected via grid-search on a validation set

( )

++++

+++minus

sumsumsumsum

sumisin

ii

xx

xx

ii

Rix

Txiixxi

PQ

bbpq

pqbbr

2222

2

)(

)(min

λ

micro

RECSYS 106

BellKor Recommender System The winner of the Netflix Challenge

Multi-scale modeling of the dataCombine top level ldquoregionalrdquomodeling of the data with a refined local view Global

bull Overall deviations of usersmovies Factorization

bull Addressing ldquoregionalrdquo effects Collaborative filtering

bull Extract local patterns

Global effects

Factorization

Collaborative filtering

RECSYS 107

Performance of Various Methods

0885

089

0895

09

0905

091

0915

092

1 10 100 1000

RMSE

Millions of parameters

CF (no time bias)

Basic Latent Factors

Latent Factors w Biases

1162019

RECSYS 108

Grand Prize 08563

Netflix 09514

Movie average 10533User average 10651

Global average 11296

Performance of Various Methods

Basic Collaborative filtering 094

Latent factors 090

Latent factors+Biases 089

Collaborative filtering++ 091

1162019

RECSYS 109

Temporal Biases Of Users

Sudden rise in the average movie rating(early 2004) Improvements in Netflix GUI improvements Meaning of rating changed

Movie age Users prefer new movies

without any reasons Older movies are just

inherently better than newer ones

1162019

Y Koren Collaborative filtering with temporal dynamics KDD rsquo09

RECSYS 110

Temporal Biases amp Factors

Original modelrxi = micro +bx + bi + qi middotpx

T

Add time dependence to biasesrxi = micro +bx(t)+ bi(t) +qi middot px

T

Make parameters bu and bi to depend on time (1) Parameterize time-dependence by linear trends

(2) Each bin corresponds to 10 consecutive weeks

Add temporal dependence to factors px(t)hellip user preference vector on day t

Y Koren Collaborative filtering with temporal dynamics KDD rsquo09

RECSYS 111

Adding Temporal Effects

1162019

0875

088

0885

089

0895

09

0905

091

0915

092

1 10 100 1000 10000

RMSE

Millions of parameters

CF (no time bias)Basic Latent FactorsCF (time bias)Latent Factors w Biases+ Linear time factors+ Per-day user biases+ CF

RECSYS 112

Grand Prize 08563

Netflix 09514

Movie average 10533User average 10651

Global average 11296

Performance of Various Methods

Basic Collaborative filtering 094

Latent factors 090

Latent factors+Biases 089

Collaborative filtering++ 091

1162019 112

Latent factors+Biases+Time 0876

Still no prize Getting desperateTry a ldquokitchen sinkrdquo approach

RECSYS 113

Effect of ensemble size

087

08725

0875

08775

088

08825

0885

08875

089

08925

1 8 15 22 29 36 43 50

Predictors

Erro

r - R

MSE

Chart1

RMSE
Predictors
Error - RMSE
Effect of ensemble size

Sheet1

Sheet1

RMSE
Predictors
Error - RMSE
Effect of ensemble size

Sheet2

Sheet3

RECSYS 115

Standing on June 26th 2009

1162019June 26th submission triggers 30-day ldquolast callrdquo

RECSYS 116

The Last 30 Days

Ensemble team formed Group of other teams on leaderboard forms a new team Relies on combining their models Quickly also get a qualifying score over 10

BellKor Continue to get small improvements in their scores Realize that they are in direct competition with Ensemble

Strategy Both teams carefully monitoring the leaderboard Only sure way to check for improvement is to submit a set of

predictionsbull This alerts the other team of your latest score

1162019

RECSYS 117

24 Hours from the Deadline

Submissions limited to 1 a day Only 1 final submission could be made in the last 24h

24 hours before deadlinehellip BellKor team member in Austria notices (by chance) that

Ensemble posts a score that is slightly better than BellKorrsquos

Frantic last 24 hours for both teams Much computer time on final optimization Carefully calibrated to end about an hour before deadline

Final submissions BellKor submits a little early (on purpose) 40 mins before deadline Ensemble submits their final entry 20 mins later hellipand everyone waitshellip

1162019

RECSYS 118118

RECSYS 119

Million $ Awarded Sept 21st 2009

RECSYS 120

Further reading Y Koren Collaborative filtering with temporal

dynamics KDD rsquo09 httpwww2researchattcom~volinskynetflixbpchtml httpwwwthe-ensemblecom

1162019

  • Slide Number 1
  • Acknowledgements
  • Content-based Systems amp Collaborative Filtering
  • High Dimensional Data
  • Example Recommender Systems
  • Recommendations
  • From Scarcity to Abundance
  • The Long Tail
  • Physical vs Online
  • Types of Recommendations
  • Formal Model
  • Utility Matrix
  • Key Problems
  • (1) Gathering Ratings
  • (2) Extrapolating Utilities
  • Content-based Recommender Systems
  • Content-based Recommendations
  • Plan of Action
  • Item Profiles
  • Sidenote TF-IDF
  • User Profiles and Prediction
  • Pros Content-based Approach
  • Cons Content-based Approach
  • Collaborative Filtering
  • Collaborative filtering
  • Collaborative Filtering (CF)
  • Example of Memory-based Collaborative Filtering User-User Collaborative Filtering
  • Similar Users
  • Similarity Metric
  • Rating Predictions
  • Another type of Memory-based Collaborative Filtering Item-Item based Collaborative Filtering
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Common Practice for Item-Item Collaborative Filtering
  • Item-Item vs User-User
  • ProsCons of Collaborative Filtering
  • Hybrid Methods
  • Remarks amp Practical Tips
  • Evaluation
  • Evaluation
  • Evaluating Predictions
  • Problems with Error Measures
  • Collaborative Filtering Complexity
  • Tip Add Data
  • Recommender SystemsLatent Factor Models
  • Collaborative Filtering viaLatent Factor Models (eg SVD)
  • Slide Number 50
  • The Netflix Utility Matrix R
  • Utility Matrix R Evaluation
  • Latent Factor Models
  • Ratings as Products of Factors
  • Ratings as Products of Factors
  • Ratings as Products of Factors
  • Latent Factor Models
  • Latent Factor Models
  • Recap SVD
  • Latent Factor Models
  • Dealing with Missing Entries
  • Recommendations via Latent Factor Models (eg SVD++ by the [Bellkor Team])
  • Dealing with Missing Entries
  • The Effect of Regularization
  • The Effect of Regularization
  • The Effect of Regularization
  • The Effect of Regularization
  • Use Gradient Descent to search forthe optimal settings
  • Degression to the lecture notes ofRegression and Gradient Descentby Andrew Ngrsquos Machine Learning course from Coursera
  • (Batch) Gradient Descent
  • Stochastic Gradient Descent
  • SGD vs GD
  • Stochastic Gradient Descent
  • Slide Number 75
  • Summary Recommendations via Optimization
  • Backup Slides
  • The Netflix Challenge 2006-09
  • Slide Number 79
  • Slide Number 80
  • Netflix Prize
  • Movie rating data
  • Overall rating distribution
  • ratings per movie
  • ratings per user
  • Average movie rating by movie count
  • Most loved movies
  • Challenges
  • The BellKor recommender system
  • Extending Latent Factor Model to Include Biases
  • Modeling Biases and Interactions
  • Baseline Predictor
  • Putting It All Together
  • Fitting the New Model
  • BellKor Recommender System
  • Performance of Various Methods
  • Performance of Various Methods
  • Temporal Biases Of Users
  • Temporal Biases amp Factors
  • Adding Temporal Effects
  • Performance of Various Methods
  • Slide Number 113
  • Slide Number 114
  • Standing on June 26th 2009
  • The Last 30 Days
  • 24 Hours from the Deadline
  • Slide Number 118
  • Million $ Awarded Sept 21st 2009
  • Further reading
predictors RMSE
1 0889067
2 0880586
3 0878345
4 0877715
5 0876668
6 0876179
7 0875557
8 0875013
9 0874477
10 0874209
11 0873997
12 0873691
13 0873569
14 0873352
15 0873212
16 0873035
17 0872942
18 087283
19 0872748
20 0872654
21 0872563
22 0872468
23 0872382
24 0872275
25 087222
26 0872201
27 0872135
28 0872086
29 0872021
30 0871945
31 0871898
32 087183
33 0871756
34 0871722
35 0871665
36 0871632
37 0871585
38 0871558
39 0871524
40 0871482
41 0871452
42 0871428
43 0871398
44 0871346
45 0871312
46 0871287
47 0871262
48 0871242
49 0871236
50 0871197
0889067
0880586
0878345
0877715
0876668
0876179
0875557
0875013
0874477
0874209
0873997
0873691
0873569
0873352
0873212
0873035
0872942
087283
0872748
0872654
0872563
0872468
0872382
0872275
087222
0872201
0872135
0872086
0872021
0871945
0871898
087183
0871756
0871722
0871665
0871632
0871585
0871558
0871524
0871482
0871452
0871428
0871398
0871346
0871312
0871287
0871262
0871242
0871236
0871197
Page 36: xuhappy.github.io · 2 days ago · RECSYS 2 Acknowledgements The slides used in this chapter are adapted from: CS246 Mining Massive Datasets, - by Jure Leskovec, Stanford University

RECSYS 38

Item-Item vs User-User

04180109

0305081

Avatar LOTR Matrix Pirates

Alice

Bob

Carol

David

1162019

In practice it has been observed that item-itemoften works better than user-user

Why Items are simpler users have multiple tastes

RECSYS 39

ProsCons of Collaborative Filtering

+ Works for any kind of item No feature selection needed

- Cold Start Need enough users in the system to find a match

- Sparsity The userratings matrix is sparse Hard to find users that have rated the same items

- First rater Cannot recommend an item that has not been

previously rated New items Esoteric items

- Popularity bias Cannot recommend items to someone with

unique taste Tends to recommend popular items

RECSYS 40

Hybrid Methods Implement two or more different recommenders and

combine predictions Perhaps using a linear model

Add content-based methods to collaborative filtering Item profiles for new item problem Demographics to deal with new user problem

1162019

RECSYS 41

- Evaluation

- Error metrics

- Complexity Speed

RECSYS 42

Evaluation

1 3 4

3 5 5

4 5 5

3

3

2 2 2

5

2 1 1

3 3

1

movies

users

1162019

RECSYS 43

Evaluation

1 3 4

3 5 5

4 5 5

3

3

2

2 1

3

1

Test Data Set

users

movies

1162019

RECSYS 44

Evaluating Predictions Compare predictions with known ratings

Root-mean-square error (RMSE)bull where is predicted is the true rating of x on i

Precision at top 10 bull of those in top 10

Rank Correlation bull Spearmanrsquos correlation between systemrsquos and userrsquos complete

rankings Another approach 01 model

Coveragebull Number of itemsusers for which system can make predictions

Precision = TP (TP + FP) Accuracy = (TP+TN) (TP + FP + TN + FN) Receiver Operating characteristic (ROC) Curve

Y-axis True Positive Rates (TPR) X-axis False Positive Rates (FPR)bull TPR (aka Recall) = TP P = TP (TP+FN) bull FPR = FP N = FP (FP + TN)bull See httpsenwikipediaorgwikiPrecision and recall

RECSYS 45

Problems with Error Measures Narrow focus on accuracy sometimes

misses the point Prediction Diversity Prediction Context Order of predictions

In practice we care only to predict high ratings RMSE might penalize a method that does well

for high ratings and badly for others

1162019

RECSYS 46

Collaborative Filtering Complexity

Expensive step is finding k most similar customers O(|X|) Recall that X = set of customers in the system

Too expensive to do at runtime Could pre-compute using clustering as approx

Naiumlve pre-computation takes time O(N middot|C|) |C| = of clusters = k in the k-means N = of data points

We already know how to do this Near-neighbor search in high dimensions (LSH) Clustering Dimensionality reduction

1162019

RECSYS 47

Tip Add Data Leverage all the data

Donrsquot try to reduce data size in an effort to make fancy algorithms work

Simple methods on large data do best

Add more data eg add IMDB data on genres

More data beats better algorithmshttpanandtypepadcomdatawocky200803more-data-usualhtml

RECSYS 48

Recommender SystemsLatent Factor Models

CS246 Mining Massive DatasetsJure Leskovec Stanford University

httpcs246stanfordedu

RECSYS 49

Geared towards females

Geared towards males

Serious

Funny

Collaborative Filtering viaLatent Factor Models (eg SVD)

The PrincessDiaries

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

RECSYS 50Koren Bell Volinksy IEEE Computer 2009

RECSYS 51

The Netflix Utility Matrix R

1 3 4

3 5 5

4 5 5

3

3

2 2 2

5

2 1 1

3 3

1

480000 users

17700 movies

1162019

Matrix R

RECSYS 52

Utility Matrix R Evaluation

1 3 4

3 5 5

4 5 5

3

3

2

2 1

3

1

Test Data Set

SSE = sum(119909119909119909119909)isin119877119877 119903119909119909119909119909 minus 119903119903119909119909119909119909 2

1162019

480000 users

17700 movies

Predicted rating

True rating of user x on item i

119955119955120785120785120788120788

Matrix R

Training Data Set

RECSYS 53

Latent Factor Models

ldquoSVDrdquo on Netflix data R asymp Q middot PT

For now letrsquos assume we can approximate the rating matrix R as a product of ldquothinrdquo Q middot PT

R has missing entries but letrsquos ignore that for nowbull Basically we will want the reconstruction error to be small on

known ratings and we donrsquot care about the values on the missing ones

45531

312445

53432142

24542

522434

42331

2-41

56-5

53-2

32111

-221-7

37-1

-924143-48-5-253-211

13-112-72914-131457-8

1-6784-3924176-421asymp

users

item

s

PT

Q

item

s

users

R

SVD A = U Σ VT

f factors

ffactors

RECSYS 54

Ratings as Products of Factors How to estimate the missing rating of

user x for item i

1162019

45531

312445

53432142

24542

522434

42331

item

s

2-41

56-5

53-2

32111

-221-7

37-1

-924143-48-5-253-211

13-112-72914-131457-8

1-6784-3924176-421

asymp

item

s

users

users

PT

119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931

= 119943119943

119954119954119946119946119943119943 sdot 119953119953119961119961119943119943

qi = row i of Qpx = column x of PT

ffac

tors

Qf factors

RECSYS 55

Ratings as Products of Factors How to estimate the missing rating of

user x for item i

1162019

45531

312445

53432142

24542

522434

42331

item

s

2-41

56-5

53-2

32111

-221-7

37-1

-924143-48-5-253-211

13-112-72914-131457-8

1-6784-3924176-421

asymp

item

s

users

users

PT

ffac

tors

Qf factors

119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931

= 119943119943

119954119954119946119946119943119943 sdot 119953119953119961119961119943119943

qi = row i of Qpx = column x of PT

RECSYS 56

Ratings as Products of Factors How to estimate the missing rating of

user x for item i

1162019

45531

312445

53432142

24542

522434

42331

item

s

2-41

56-5

53-2

32111

-221-7

37-1

-924143-48-5-253-211

13-112-72914-131457-8

1-6784-3924176-421

asymp

item

s

users

users

QPT

24

ffac

tors

f factors

119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931

= 119943119943

119954119954119946119946119943119943 sdot 119953119953119961119961119943119943

qi = row i of Qpx = column x of PT

RECSYS 57

Geared towards females

Geared towards males

Serious

Funny

Latent Factor Models

1162019

The PrincessDiaries

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

RECSYS 58

Geared towards females

Geared towards males

Serious

Funny

Latent Factor Models

1162019

The PrincessDiaries

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

RECSYS 59

Recap SVD

Remember SVD A Input data matrix U Left singular vecs V Right singular vecs Σ Singular values

SVD gives minimum reconstruction error (SSE) min119880119880VΣ

sum119909119909119894119894 119860119860119909119909119894119894 minus 119880119880Σ119881119881T 1199091199091198941198942

So in our case ldquoSVDrdquo on Netflix data R asymp Q middot PT

A = R Q = U PT = Σ VT

But we are not done yet R has missing entries1162019

U

The sum goes over all entriesBut our R has missing entries

119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931

Am

n

Σm

n

VT

asymp

U

RECSYS 60

Latent Factor Models

SVD isnrsquot defined when entries are missing

Use specialized methods to find P Q min

119875119875119876119876sum 119909119909119909119909 isinR 119903119903119909119909119909119909 minus 119902119902119909119909 sdot 119901119901119909119909119879119879 2

Notebull We donrsquot require cols of P Q to be orthogonalunit lengthbull P Q map usersmovies to a latent spacebull The most popular model among Netflix contestants

1162019

45531

312445

53432142

24542

522434

42331

2-41

56-5

53-2

32111

-221-7

37-1

-924143-48-5-253-211

13-112-72914-131457-8

1-6784-3924176-421asymp

PTQ

users

item

s

119903119909119909119909119909 = 119902119902119909119909 sdot 119901119901119909119909119879119879

f factors

ffactorsitem

s

users

RECSYS 61

Dealing with Missing Entries Want to minimize SSE for unseen test data

Idea Minimize SSE on training data Want large f ( of factors) to capture all the signals But SSE on test data begins to rise for f gt 2

Regularization is needed to avoid Overfitting Allow rich model where there are sufficient data Shrink aggressively where data are scarce

1162019

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

1 3 43 5 5

4 5 533

2

2 1 3

1

λhellip regularization parameterldquoerrorrdquo ldquolengthrdquo

RECSYS 62

Geared towards females

Geared towards males

serious

escapist

The PrincessDiaries

The Lion King

Braveheart

Independence Day

AmadeusThe Color Purple

Oceanrsquos 11

Sense and Sensibility

Gus

Dave

Recommendations via Latent Factor Models (eg SVD++ by the [Bellkor Team])

1162019

Lethal Weapon

Dumb and Dumber

RECSYS 63

Dealing with Missing Entries Want to minimize SSE for unseen test data

Idea Minimize SSE on training data Want large f ( of factors) to capture all the signals But SSE on test data begins to rise for f gt 2

Regularization is needed Allow rich model where there are sufficient data Shrink aggressively where data are scarce

1162019

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

1 3 43 5 5

4 5 533

2

2 1 3

1

λhellip regularization parameterldquoerrorrdquo ldquolengthrdquo

RECSYS 64

Geared towards females

Geared towards males

serious

funny

The Effect of Regularization

1162019

The PrincessDiaries

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

minfactors ldquoerrorrdquo + λ ldquolengthrdquo

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

RECSYS 65

Geared towards females

Geared towards males

serious

funny

The Effect of Regularization

1162019

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

The PrincessDiaries

minfactors ldquoerrorrdquo + λ ldquolengthrdquo

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

RECSYS 66

Geared towards females

Geared towards males

serious

funny

The Effect of Regularization

1162019

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

minfactors ldquoerrorrdquo + λ ldquolengthrdquo

The PrincessDiaries

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

RECSYS 67

Geared towards females

Geared towards males

serious

funny

The Effect of Regularization

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

The PrincessDiaries

minfactors ldquoerrorrdquo + λ ldquolengthrdquo

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

RECSYS 68

Use Gradient Descent to search forthe optimal settings

Want to find matrices P and Q

Gradient descent Initialize P and Q (using SVD pretend missing ratings are

0) Do gradient descent

bull P larr P - η middotnablaPbull Q larr Q - η middotnablaQbull Where nablaQ is gradientderivative of matrix Q120571120571119876119876 = [120571120571119902119902119909119909119894119894] and 120571120571119902119902119909119909119894119894 = sum119909119909119909119909 minus2 119903119903119909119909119909119909 minus 119902119902119909119909119901119901119909119909119879119879 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894

bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q Observation Computing gradients is slow

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

How to compute gradient of a matrixCompute gradient of every element independently

RECSYS 69

Degression to the lecture notes ofRegression and Gradient Descent

by Andrew Ngrsquos Machine Learning course from Coursera

RECSYS 70

(Batch) Gradient Descent

Want to find matrices P and Q

Gradient descent Initialize P and Q (using SVD pretend missing ratings are

0) Do gradient descent

bull P larr P - η middotnablaPbull Q larr Q - η middotnablaQbull Where nablaQ is gradientderivative of matrix Q120571120571119876119876 = [120571120571119902119902119909119909119894119894] and 120571120571119902119902119909119909119894119894 = sum119909119909119909119909 minus2 119903119903119909119909119909119909 minus 119902119902119909119909119901119901119909119909119879119879 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894

bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q Observation Computing gradients is slow

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

How to compute gradient of a matrixCompute gradient of every element independently

RECSYS 72

Stochastic Gradient Descent

Gradient Descent (GD) vs Stochastic GD Observation 120571120571119876119876 = [120571120571119902119902119909119909119894119894] where

120571120571119902119902119909119909119894119894 = 119909119909119909119909

minus2 119903119903119909119909119909119909 minus 119902119902119909119909119894119894119901119901119909119909119894119894 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894 = 119961119961119946119946

nabla119928119928 119955119955119961119961119946119946

bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q 119928119928 = 119928119928 minus 119928119928 = 119928119928 minus η sum119961119961119946119946nabla119928119928 (119955119955119961119961119946119946) Idea Instead of evaluating gradient over all ratings evaluate it for

each individual rating and make a step

GD 119928119928larr119928119928 minus η sum119955119955119961119961119946119946 nabla119928119928(119955119955119961119961119946119946)

SGD 119928119928larr119928119928 minus 119928119928(119955119955119961119961119946119946) Faster convergence

bull Need more steps but each step is computed much faster

1162019

RECSYS 73

SGD vs GD Convergence of GD vs SGD

1162019

Iterationstep

Valu

e of

the

obje

ctiv

e fu

nctio

n

GD improves the value of the objective function at every step SGD improves the value but in a ldquonoisyrdquo wayGD takes fewer steps to converge but each steptakes much longer to compute In practice SGD is much faster

RECSYS 74

Stochastic Gradient Descent

Stochastic gradient decent Initialize P and Q (using SVD pretend missing ratings are

0) Then iterate over the ratings (multiple times if necessary) and update

factorsFor each rxi

bull 120576120576119909119909119909119909 = 119903119903119909119909119909119909 minus 119902119902119909119909 sdot 119901119901119909119909119879119879 (derivative of the ldquoerrorrdquo)bull 119902119902119909119909 larr 119902119902119909119909 + 120578120578 120576120576119909119909119909119909 119901119901119909119909 minus 120582120582 119902119902119909119909 (update equation)bull 119901119901119909119909 larr 119901119901119909119909 + 120578120578 120576120576119909119909119909119909 119902119902119909119909 minus 120582120582 119901119901119909119909 (update equation)

2 for loops For until convergence

bull For each rxibull Compute gradient do a ldquosteprdquo

1162019

η hellip learning rate

RECSYS 75Koren Bell Volinksy IEEE Computer 2009

1162019

RECSYS 76

Summary Recommendations via Optimization

Goal Make good recommendations Quantify goodness using SSE

So Lower SSE means better recommendations We want to make good recommendations on items

that some user has not yet seen Letrsquos set values for P and Q such that they work

well on known (user item) ratings And hope these values for P and Q will predict well

the unknown ratings

This is the a case where we apply Optimization methods

1162019

1 3 43 5 5

4 5 533

2

2 1 3

1

RECSYS 77

Backup Slides

RECSYS 78

The Netflix Challenge 2006-09

We Know What You OughtTo Be Watching This

Summer

Netflix Prizebull Training data

ndash 100 million ratingsndash 480000 usersndash 17770 moviesndash 6 years of data 2000-2005

bull Test datandash Last few ratings of each user (28 million)ndash Evaluation criterion root mean squared error (RMSE) ndash Netflix Cinematch RMSE 09514

bull Competitionndash 2700+ teamsndash $1 million grand prize for 10 improvement on Cinematch resultndash $50000 2007 progress prize for 843 improvement

scoremovieuser121152131434524123237682576344541568523425

22345

57664566

scoremovieuser6219617232473153414284935

745

696836

Training data Test data

Movie rating data

Overall rating distribution

bull Third of ratings are 4sbull Average rating is 368

From TimelyDevelopmentcom

ratings per movie

bull Avg ratingsmovie 5627

ratings per user

bull Avg ratingsuser 208

Average movie rating by movie count

bull More ratings to better movies

From TimelyDevelopmentcom

Most loved movies

CountAvg ratingTitle137812 4593 The Shawshank Redemption 133597 4545 Lord of the Rings The Return of the King1808834306 The Green Mile 150676 4460Lord of the Rings The Two Towers 139050 4415 Finding Nemo 117456 4504 Raiders of the Lost Ark 180736 4299 Forrest Gump 147932 4433 Lord of the Rings The Fellowship of the ring149199 4325 The Sixth Sense 144027 4333 Indiana Jones and the Last Crusade

Challenges

bull Size of datandash Scalabilityndash Keeping data in memory

bull Missing datandash 99 percent missingndash Very imbalanced

bull Avoiding overfittingbull Test and training data differ significantly movie 16322

bull Use an ensemble of complementing predictorsbull Two half tuned models worth more than a single fully

tuned model

The BellKor recommender system

RECSYS 101

Extending Latent Factor Model to Include Biases

1162019

RECSYS 102

Modeling Biases and Interactions

μ = overall mean rating bx = bias of user x bi = bias of movie i

user-movie interactionmovie biasuser bias

User-Movie interaction Characterizes the matching between

users and movies Attracts most research in the field Benefits from algorithmic and

mathematical innovations

Baseline predictor Separates users and movies Benefits from insights into userrsquos

behavior Among the main practical

contributions of the competition

RECSYS 103

Baseline Predictor We have expectations on the rating by

user x of movie i even without estimating xrsquos attitude towards movies like i

ndash Rating scale of user xndash Values of other ratings user

gave recently (day-specific mood anchoring multi-user accounts)

ndash (Recent) popularity of movie indash Selection bias related to

number of ratings user gave on the same day (ldquofrequencyrdquo)

1162019

RECSYS 104

Putting It All Together

Example Mean rating micro = 37 You are a critical reviewer your ratings are 1 star

lower than the mean bx = -1 Star Wars gets a mean rating of 05 higher than

average movie bi = + 05 Predicted rating for you on Star Wars

= 37 - 1 + 05 = 32

1162019

RECSYS 105

Fitting the New Model

Solve

Stochastic gradient decent to find parameters Note Both biases bu bi as well as interactions qi pu are treated

as parameters (we estimate them)

1162019

regularization

goodness of fit

λ is selected via grid-search on a validation set

( )

++++

+++minus

sumsumsumsum

sumisin

ii

xx

xx

ii

Rix

Txiixxi

PQ

bbpq

pqbbr

2222

2

)(

)(min

λ

micro

RECSYS 106

BellKor Recommender System The winner of the Netflix Challenge

Multi-scale modeling of the dataCombine top level ldquoregionalrdquomodeling of the data with a refined local view Global

bull Overall deviations of usersmovies Factorization

bull Addressing ldquoregionalrdquo effects Collaborative filtering

bull Extract local patterns

Global effects

Factorization

Collaborative filtering

RECSYS 107

Performance of Various Methods

0885

089

0895

09

0905

091

0915

092

1 10 100 1000

RMSE

Millions of parameters

CF (no time bias)

Basic Latent Factors

Latent Factors w Biases

1162019

RECSYS 108

Grand Prize 08563

Netflix 09514

Movie average 10533User average 10651

Global average 11296

Performance of Various Methods

Basic Collaborative filtering 094

Latent factors 090

Latent factors+Biases 089

Collaborative filtering++ 091

1162019

RECSYS 109

Temporal Biases Of Users

Sudden rise in the average movie rating(early 2004) Improvements in Netflix GUI improvements Meaning of rating changed

Movie age Users prefer new movies

without any reasons Older movies are just

inherently better than newer ones

1162019

Y Koren Collaborative filtering with temporal dynamics KDD rsquo09

RECSYS 110

Temporal Biases amp Factors

Original modelrxi = micro +bx + bi + qi middotpx

T

Add time dependence to biasesrxi = micro +bx(t)+ bi(t) +qi middot px

T

Make parameters bu and bi to depend on time (1) Parameterize time-dependence by linear trends

(2) Each bin corresponds to 10 consecutive weeks

Add temporal dependence to factors px(t)hellip user preference vector on day t

Y Koren Collaborative filtering with temporal dynamics KDD rsquo09

RECSYS 111

Adding Temporal Effects

1162019

0875

088

0885

089

0895

09

0905

091

0915

092

1 10 100 1000 10000

RMSE

Millions of parameters

CF (no time bias)Basic Latent FactorsCF (time bias)Latent Factors w Biases+ Linear time factors+ Per-day user biases+ CF

RECSYS 112

Grand Prize 08563

Netflix 09514

Movie average 10533User average 10651

Global average 11296

Performance of Various Methods

Basic Collaborative filtering 094

Latent factors 090

Latent factors+Biases 089

Collaborative filtering++ 091

1162019 112

Latent factors+Biases+Time 0876

Still no prize Getting desperateTry a ldquokitchen sinkrdquo approach

RECSYS 113

Effect of ensemble size

087

08725

0875

08775

088

08825

0885

08875

089

08925

1 8 15 22 29 36 43 50

Predictors

Erro

r - R

MSE

Chart1

RMSE
Predictors
Error - RMSE
Effect of ensemble size

Sheet1

Sheet1

RMSE
Predictors
Error - RMSE
Effect of ensemble size

Sheet2

Sheet3

RECSYS 115

Standing on June 26th 2009

1162019June 26th submission triggers 30-day ldquolast callrdquo

RECSYS 116

The Last 30 Days

Ensemble team formed Group of other teams on leaderboard forms a new team Relies on combining their models Quickly also get a qualifying score over 10

BellKor Continue to get small improvements in their scores Realize that they are in direct competition with Ensemble

Strategy Both teams carefully monitoring the leaderboard Only sure way to check for improvement is to submit a set of

predictionsbull This alerts the other team of your latest score

1162019

RECSYS 117

24 Hours from the Deadline

Submissions limited to 1 a day Only 1 final submission could be made in the last 24h

24 hours before deadlinehellip BellKor team member in Austria notices (by chance) that

Ensemble posts a score that is slightly better than BellKorrsquos

Frantic last 24 hours for both teams Much computer time on final optimization Carefully calibrated to end about an hour before deadline

Final submissions BellKor submits a little early (on purpose) 40 mins before deadline Ensemble submits their final entry 20 mins later hellipand everyone waitshellip

1162019

RECSYS 118118

RECSYS 119

Million $ Awarded Sept 21st 2009

RECSYS 120

Further reading Y Koren Collaborative filtering with temporal

dynamics KDD rsquo09 httpwww2researchattcom~volinskynetflixbpchtml httpwwwthe-ensemblecom

1162019

  • Slide Number 1
  • Acknowledgements
  • Content-based Systems amp Collaborative Filtering
  • High Dimensional Data
  • Example Recommender Systems
  • Recommendations
  • From Scarcity to Abundance
  • The Long Tail
  • Physical vs Online
  • Types of Recommendations
  • Formal Model
  • Utility Matrix
  • Key Problems
  • (1) Gathering Ratings
  • (2) Extrapolating Utilities
  • Content-based Recommender Systems
  • Content-based Recommendations
  • Plan of Action
  • Item Profiles
  • Sidenote TF-IDF
  • User Profiles and Prediction
  • Pros Content-based Approach
  • Cons Content-based Approach
  • Collaborative Filtering
  • Collaborative filtering
  • Collaborative Filtering (CF)
  • Example of Memory-based Collaborative Filtering User-User Collaborative Filtering
  • Similar Users
  • Similarity Metric
  • Rating Predictions
  • Another type of Memory-based Collaborative Filtering Item-Item based Collaborative Filtering
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Common Practice for Item-Item Collaborative Filtering
  • Item-Item vs User-User
  • ProsCons of Collaborative Filtering
  • Hybrid Methods
  • Remarks amp Practical Tips
  • Evaluation
  • Evaluation
  • Evaluating Predictions
  • Problems with Error Measures
  • Collaborative Filtering Complexity
  • Tip Add Data
  • Recommender SystemsLatent Factor Models
  • Collaborative Filtering viaLatent Factor Models (eg SVD)
  • Slide Number 50
  • The Netflix Utility Matrix R
  • Utility Matrix R Evaluation
  • Latent Factor Models
  • Ratings as Products of Factors
  • Ratings as Products of Factors
  • Ratings as Products of Factors
  • Latent Factor Models
  • Latent Factor Models
  • Recap SVD
  • Latent Factor Models
  • Dealing with Missing Entries
  • Recommendations via Latent Factor Models (eg SVD++ by the [Bellkor Team])
  • Dealing with Missing Entries
  • The Effect of Regularization
  • The Effect of Regularization
  • The Effect of Regularization
  • The Effect of Regularization
  • Use Gradient Descent to search forthe optimal settings
  • Degression to the lecture notes ofRegression and Gradient Descentby Andrew Ngrsquos Machine Learning course from Coursera
  • (Batch) Gradient Descent
  • Stochastic Gradient Descent
  • SGD vs GD
  • Stochastic Gradient Descent
  • Slide Number 75
  • Summary Recommendations via Optimization
  • Backup Slides
  • The Netflix Challenge 2006-09
  • Slide Number 79
  • Slide Number 80
  • Netflix Prize
  • Movie rating data
  • Overall rating distribution
  • ratings per movie
  • ratings per user
  • Average movie rating by movie count
  • Most loved movies
  • Challenges
  • The BellKor recommender system
  • Extending Latent Factor Model to Include Biases
  • Modeling Biases and Interactions
  • Baseline Predictor
  • Putting It All Together
  • Fitting the New Model
  • BellKor Recommender System
  • Performance of Various Methods
  • Performance of Various Methods
  • Temporal Biases Of Users
  • Temporal Biases amp Factors
  • Adding Temporal Effects
  • Performance of Various Methods
  • Slide Number 113
  • Slide Number 114
  • Standing on June 26th 2009
  • The Last 30 Days
  • 24 Hours from the Deadline
  • Slide Number 118
  • Million $ Awarded Sept 21st 2009
  • Further reading
predictors RMSE
1 0889067
2 0880586
3 0878345
4 0877715
5 0876668
6 0876179
7 0875557
8 0875013
9 0874477
10 0874209
11 0873997
12 0873691
13 0873569
14 0873352
15 0873212
16 0873035
17 0872942
18 087283
19 0872748
20 0872654
21 0872563
22 0872468
23 0872382
24 0872275
25 087222
26 0872201
27 0872135
28 0872086
29 0872021
30 0871945
31 0871898
32 087183
33 0871756
34 0871722
35 0871665
36 0871632
37 0871585
38 0871558
39 0871524
40 0871482
41 0871452
42 0871428
43 0871398
44 0871346
45 0871312
46 0871287
47 0871262
48 0871242
49 0871236
50 0871197
0889067
0880586
0878345
0877715
0876668
0876179
0875557
0875013
0874477
0874209
0873997
0873691
0873569
0873352
0873212
0873035
0872942
087283
0872748
0872654
0872563
0872468
0872382
0872275
087222
0872201
0872135
0872086
0872021
0871945
0871898
087183
0871756
0871722
0871665
0871632
0871585
0871558
0871524
0871482
0871452
0871428
0871398
0871346
0871312
0871287
0871262
0871242
0871236
0871197
Page 37: xuhappy.github.io · 2 days ago · RECSYS 2 Acknowledgements The slides used in this chapter are adapted from: CS246 Mining Massive Datasets, - by Jure Leskovec, Stanford University

RECSYS 39

ProsCons of Collaborative Filtering

+ Works for any kind of item No feature selection needed

- Cold Start Need enough users in the system to find a match

- Sparsity The userratings matrix is sparse Hard to find users that have rated the same items

- First rater Cannot recommend an item that has not been

previously rated New items Esoteric items

- Popularity bias Cannot recommend items to someone with

unique taste Tends to recommend popular items

RECSYS 40

Hybrid Methods Implement two or more different recommenders and

combine predictions Perhaps using a linear model

Add content-based methods to collaborative filtering Item profiles for new item problem Demographics to deal with new user problem

1162019

RECSYS 41

- Evaluation

- Error metrics

- Complexity Speed

RECSYS 42

Evaluation

1 3 4

3 5 5

4 5 5

3

3

2 2 2

5

2 1 1

3 3

1

movies

users

1162019

RECSYS 43

Evaluation

1 3 4

3 5 5

4 5 5

3

3

2

2 1

3

1

Test Data Set

users

movies

1162019

RECSYS 44

Evaluating Predictions Compare predictions with known ratings

Root-mean-square error (RMSE)bull where is predicted is the true rating of x on i

Precision at top 10 bull of those in top 10

Rank Correlation bull Spearmanrsquos correlation between systemrsquos and userrsquos complete

rankings Another approach 01 model

Coveragebull Number of itemsusers for which system can make predictions

Precision = TP (TP + FP) Accuracy = (TP+TN) (TP + FP + TN + FN) Receiver Operating characteristic (ROC) Curve

Y-axis True Positive Rates (TPR) X-axis False Positive Rates (FPR)bull TPR (aka Recall) = TP P = TP (TP+FN) bull FPR = FP N = FP (FP + TN)bull See httpsenwikipediaorgwikiPrecision and recall

RECSYS 45

Problems with Error Measures Narrow focus on accuracy sometimes

misses the point Prediction Diversity Prediction Context Order of predictions

In practice we care only to predict high ratings RMSE might penalize a method that does well

for high ratings and badly for others

1162019

RECSYS 46

Collaborative Filtering Complexity

Expensive step is finding k most similar customers O(|X|) Recall that X = set of customers in the system

Too expensive to do at runtime Could pre-compute using clustering as approx

Naiumlve pre-computation takes time O(N middot|C|) |C| = of clusters = k in the k-means N = of data points

We already know how to do this Near-neighbor search in high dimensions (LSH) Clustering Dimensionality reduction

1162019

RECSYS 47

Tip Add Data Leverage all the data

Donrsquot try to reduce data size in an effort to make fancy algorithms work

Simple methods on large data do best

Add more data eg add IMDB data on genres

More data beats better algorithmshttpanandtypepadcomdatawocky200803more-data-usualhtml

RECSYS 48

Recommender SystemsLatent Factor Models

CS246 Mining Massive DatasetsJure Leskovec Stanford University

httpcs246stanfordedu

RECSYS 49

Geared towards females

Geared towards males

Serious

Funny

Collaborative Filtering viaLatent Factor Models (eg SVD)

The PrincessDiaries

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

RECSYS 50Koren Bell Volinksy IEEE Computer 2009

RECSYS 51

The Netflix Utility Matrix R

1 3 4

3 5 5

4 5 5

3

3

2 2 2

5

2 1 1

3 3

1

480000 users

17700 movies

1162019

Matrix R

RECSYS 52

Utility Matrix R Evaluation

1 3 4

3 5 5

4 5 5

3

3

2

2 1

3

1

Test Data Set

SSE = sum(119909119909119909119909)isin119877119877 119903119909119909119909119909 minus 119903119903119909119909119909119909 2

1162019

480000 users

17700 movies

Predicted rating

True rating of user x on item i

119955119955120785120785120788120788

Matrix R

Training Data Set

RECSYS 53

Latent Factor Models

ldquoSVDrdquo on Netflix data R asymp Q middot PT

For now letrsquos assume we can approximate the rating matrix R as a product of ldquothinrdquo Q middot PT

R has missing entries but letrsquos ignore that for nowbull Basically we will want the reconstruction error to be small on

known ratings and we donrsquot care about the values on the missing ones

45531

312445

53432142

24542

522434

42331

2-41

56-5

53-2

32111

-221-7

37-1

-924143-48-5-253-211

13-112-72914-131457-8

1-6784-3924176-421asymp

users

item

s

PT

Q

item

s

users

R

SVD A = U Σ VT

f factors

ffactors

RECSYS 54

Ratings as Products of Factors How to estimate the missing rating of

user x for item i

1162019

45531

312445

53432142

24542

522434

42331

item

s

2-41

56-5

53-2

32111

-221-7

37-1

-924143-48-5-253-211

13-112-72914-131457-8

1-6784-3924176-421

asymp

item

s

users

users

PT

119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931

= 119943119943

119954119954119946119946119943119943 sdot 119953119953119961119961119943119943

qi = row i of Qpx = column x of PT

ffac

tors

Qf factors

RECSYS 55

Ratings as Products of Factors How to estimate the missing rating of

user x for item i

1162019

45531

312445

53432142

24542

522434

42331

item

s

2-41

56-5

53-2

32111

-221-7

37-1

-924143-48-5-253-211

13-112-72914-131457-8

1-6784-3924176-421

asymp

item

s

users

users

PT

ffac

tors

Qf factors

119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931

= 119943119943

119954119954119946119946119943119943 sdot 119953119953119961119961119943119943

qi = row i of Qpx = column x of PT

RECSYS 56

Ratings as Products of Factors How to estimate the missing rating of

user x for item i

1162019

45531

312445

53432142

24542

522434

42331

item

s

2-41

56-5

53-2

32111

-221-7

37-1

-924143-48-5-253-211

13-112-72914-131457-8

1-6784-3924176-421

asymp

item

s

users

users

QPT

24

ffac

tors

f factors

119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931

= 119943119943

119954119954119946119946119943119943 sdot 119953119953119961119961119943119943

qi = row i of Qpx = column x of PT

RECSYS 57

Geared towards females

Geared towards males

Serious

Funny

Latent Factor Models

1162019

The PrincessDiaries

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

RECSYS 58

Geared towards females

Geared towards males

Serious

Funny

Latent Factor Models

1162019

The PrincessDiaries

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

RECSYS 59

Recap SVD

Remember SVD A Input data matrix U Left singular vecs V Right singular vecs Σ Singular values

SVD gives minimum reconstruction error (SSE) min119880119880VΣ

sum119909119909119894119894 119860119860119909119909119894119894 minus 119880119880Σ119881119881T 1199091199091198941198942

So in our case ldquoSVDrdquo on Netflix data R asymp Q middot PT

A = R Q = U PT = Σ VT

But we are not done yet R has missing entries1162019

U

The sum goes over all entriesBut our R has missing entries

119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931

Am

n

Σm

n

VT

asymp

U

RECSYS 60

Latent Factor Models

SVD isnrsquot defined when entries are missing

Use specialized methods to find P Q min

119875119875119876119876sum 119909119909119909119909 isinR 119903119903119909119909119909119909 minus 119902119902119909119909 sdot 119901119901119909119909119879119879 2

Notebull We donrsquot require cols of P Q to be orthogonalunit lengthbull P Q map usersmovies to a latent spacebull The most popular model among Netflix contestants

1162019

45531

312445

53432142

24542

522434

42331

2-41

56-5

53-2

32111

-221-7

37-1

-924143-48-5-253-211

13-112-72914-131457-8

1-6784-3924176-421asymp

PTQ

users

item

s

119903119909119909119909119909 = 119902119902119909119909 sdot 119901119901119909119909119879119879

f factors

ffactorsitem

s

users

RECSYS 61

Dealing with Missing Entries Want to minimize SSE for unseen test data

Idea Minimize SSE on training data Want large f ( of factors) to capture all the signals But SSE on test data begins to rise for f gt 2

Regularization is needed to avoid Overfitting Allow rich model where there are sufficient data Shrink aggressively where data are scarce

1162019

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

1 3 43 5 5

4 5 533

2

2 1 3

1

λhellip regularization parameterldquoerrorrdquo ldquolengthrdquo

RECSYS 62

Geared towards females

Geared towards males

serious

escapist

The PrincessDiaries

The Lion King

Braveheart

Independence Day

AmadeusThe Color Purple

Oceanrsquos 11

Sense and Sensibility

Gus

Dave

Recommendations via Latent Factor Models (eg SVD++ by the [Bellkor Team])

1162019

Lethal Weapon

Dumb and Dumber

RECSYS 63

Dealing with Missing Entries Want to minimize SSE for unseen test data

Idea Minimize SSE on training data Want large f ( of factors) to capture all the signals But SSE on test data begins to rise for f gt 2

Regularization is needed Allow rich model where there are sufficient data Shrink aggressively where data are scarce

1162019

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

1 3 43 5 5

4 5 533

2

2 1 3

1

λhellip regularization parameterldquoerrorrdquo ldquolengthrdquo

RECSYS 64

Geared towards females

Geared towards males

serious

funny

The Effect of Regularization

1162019

The PrincessDiaries

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

minfactors ldquoerrorrdquo + λ ldquolengthrdquo

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

RECSYS 65

Geared towards females

Geared towards males

serious

funny

The Effect of Regularization

1162019

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

The PrincessDiaries

minfactors ldquoerrorrdquo + λ ldquolengthrdquo

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

RECSYS 66

Geared towards females

Geared towards males

serious

funny

The Effect of Regularization

1162019

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

minfactors ldquoerrorrdquo + λ ldquolengthrdquo

The PrincessDiaries

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

RECSYS 67

Geared towards females

Geared towards males

serious

funny

The Effect of Regularization

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

The PrincessDiaries

minfactors ldquoerrorrdquo + λ ldquolengthrdquo

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

RECSYS 68

Use Gradient Descent to search forthe optimal settings

Want to find matrices P and Q

Gradient descent Initialize P and Q (using SVD pretend missing ratings are

0) Do gradient descent

bull P larr P - η middotnablaPbull Q larr Q - η middotnablaQbull Where nablaQ is gradientderivative of matrix Q120571120571119876119876 = [120571120571119902119902119909119909119894119894] and 120571120571119902119902119909119909119894119894 = sum119909119909119909119909 minus2 119903119903119909119909119909119909 minus 119902119902119909119909119901119901119909119909119879119879 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894

bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q Observation Computing gradients is slow

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

How to compute gradient of a matrixCompute gradient of every element independently

RECSYS 69

Degression to the lecture notes ofRegression and Gradient Descent

by Andrew Ngrsquos Machine Learning course from Coursera

RECSYS 70

(Batch) Gradient Descent

Want to find matrices P and Q

Gradient descent Initialize P and Q (using SVD pretend missing ratings are

0) Do gradient descent

bull P larr P - η middotnablaPbull Q larr Q - η middotnablaQbull Where nablaQ is gradientderivative of matrix Q120571120571119876119876 = [120571120571119902119902119909119909119894119894] and 120571120571119902119902119909119909119894119894 = sum119909119909119909119909 minus2 119903119903119909119909119909119909 minus 119902119902119909119909119901119901119909119909119879119879 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894

bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q Observation Computing gradients is slow

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

How to compute gradient of a matrixCompute gradient of every element independently

RECSYS 72

Stochastic Gradient Descent

Gradient Descent (GD) vs Stochastic GD Observation 120571120571119876119876 = [120571120571119902119902119909119909119894119894] where

120571120571119902119902119909119909119894119894 = 119909119909119909119909

minus2 119903119903119909119909119909119909 minus 119902119902119909119909119894119894119901119901119909119909119894119894 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894 = 119961119961119946119946

nabla119928119928 119955119955119961119961119946119946

bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q 119928119928 = 119928119928 minus 119928119928 = 119928119928 minus η sum119961119961119946119946nabla119928119928 (119955119955119961119961119946119946) Idea Instead of evaluating gradient over all ratings evaluate it for

each individual rating and make a step

GD 119928119928larr119928119928 minus η sum119955119955119961119961119946119946 nabla119928119928(119955119955119961119961119946119946)

SGD 119928119928larr119928119928 minus 119928119928(119955119955119961119961119946119946) Faster convergence

bull Need more steps but each step is computed much faster

1162019

RECSYS 73

SGD vs GD Convergence of GD vs SGD

1162019

Iterationstep

Valu

e of

the

obje

ctiv

e fu

nctio

n

GD improves the value of the objective function at every step SGD improves the value but in a ldquonoisyrdquo wayGD takes fewer steps to converge but each steptakes much longer to compute In practice SGD is much faster

RECSYS 74

Stochastic Gradient Descent

Stochastic gradient decent Initialize P and Q (using SVD pretend missing ratings are

0) Then iterate over the ratings (multiple times if necessary) and update

factorsFor each rxi

bull 120576120576119909119909119909119909 = 119903119903119909119909119909119909 minus 119902119902119909119909 sdot 119901119901119909119909119879119879 (derivative of the ldquoerrorrdquo)bull 119902119902119909119909 larr 119902119902119909119909 + 120578120578 120576120576119909119909119909119909 119901119901119909119909 minus 120582120582 119902119902119909119909 (update equation)bull 119901119901119909119909 larr 119901119901119909119909 + 120578120578 120576120576119909119909119909119909 119902119902119909119909 minus 120582120582 119901119901119909119909 (update equation)

2 for loops For until convergence

bull For each rxibull Compute gradient do a ldquosteprdquo

1162019

η hellip learning rate

RECSYS 75Koren Bell Volinksy IEEE Computer 2009

1162019

RECSYS 76

Summary Recommendations via Optimization

Goal Make good recommendations Quantify goodness using SSE

So Lower SSE means better recommendations We want to make good recommendations on items

that some user has not yet seen Letrsquos set values for P and Q such that they work

well on known (user item) ratings And hope these values for P and Q will predict well

the unknown ratings

This is the a case where we apply Optimization methods

1162019

1 3 43 5 5

4 5 533

2

2 1 3

1

RECSYS 77

Backup Slides

RECSYS 78

The Netflix Challenge 2006-09

We Know What You OughtTo Be Watching This

Summer

Netflix Prizebull Training data

ndash 100 million ratingsndash 480000 usersndash 17770 moviesndash 6 years of data 2000-2005

bull Test datandash Last few ratings of each user (28 million)ndash Evaluation criterion root mean squared error (RMSE) ndash Netflix Cinematch RMSE 09514

bull Competitionndash 2700+ teamsndash $1 million grand prize for 10 improvement on Cinematch resultndash $50000 2007 progress prize for 843 improvement

scoremovieuser121152131434524123237682576344541568523425

22345

57664566

scoremovieuser6219617232473153414284935

745

696836

Training data Test data

Movie rating data

Overall rating distribution

bull Third of ratings are 4sbull Average rating is 368

From TimelyDevelopmentcom

ratings per movie

bull Avg ratingsmovie 5627

ratings per user

bull Avg ratingsuser 208

Average movie rating by movie count

bull More ratings to better movies

From TimelyDevelopmentcom

Most loved movies

CountAvg ratingTitle137812 4593 The Shawshank Redemption 133597 4545 Lord of the Rings The Return of the King1808834306 The Green Mile 150676 4460Lord of the Rings The Two Towers 139050 4415 Finding Nemo 117456 4504 Raiders of the Lost Ark 180736 4299 Forrest Gump 147932 4433 Lord of the Rings The Fellowship of the ring149199 4325 The Sixth Sense 144027 4333 Indiana Jones and the Last Crusade

Challenges

bull Size of datandash Scalabilityndash Keeping data in memory

bull Missing datandash 99 percent missingndash Very imbalanced

bull Avoiding overfittingbull Test and training data differ significantly movie 16322

bull Use an ensemble of complementing predictorsbull Two half tuned models worth more than a single fully

tuned model

The BellKor recommender system

RECSYS 101

Extending Latent Factor Model to Include Biases

1162019

RECSYS 102

Modeling Biases and Interactions

μ = overall mean rating bx = bias of user x bi = bias of movie i

user-movie interactionmovie biasuser bias

User-Movie interaction Characterizes the matching between

users and movies Attracts most research in the field Benefits from algorithmic and

mathematical innovations

Baseline predictor Separates users and movies Benefits from insights into userrsquos

behavior Among the main practical

contributions of the competition

RECSYS 103

Baseline Predictor We have expectations on the rating by

user x of movie i even without estimating xrsquos attitude towards movies like i

ndash Rating scale of user xndash Values of other ratings user

gave recently (day-specific mood anchoring multi-user accounts)

ndash (Recent) popularity of movie indash Selection bias related to

number of ratings user gave on the same day (ldquofrequencyrdquo)

1162019

RECSYS 104

Putting It All Together

Example Mean rating micro = 37 You are a critical reviewer your ratings are 1 star

lower than the mean bx = -1 Star Wars gets a mean rating of 05 higher than

average movie bi = + 05 Predicted rating for you on Star Wars

= 37 - 1 + 05 = 32

1162019

RECSYS 105

Fitting the New Model

Solve

Stochastic gradient decent to find parameters Note Both biases bu bi as well as interactions qi pu are treated

as parameters (we estimate them)

1162019

regularization

goodness of fit

λ is selected via grid-search on a validation set

( )

++++

+++minus

sumsumsumsum

sumisin

ii

xx

xx

ii

Rix

Txiixxi

PQ

bbpq

pqbbr

2222

2

)(

)(min

λ

micro

RECSYS 106

BellKor Recommender System The winner of the Netflix Challenge

Multi-scale modeling of the dataCombine top level ldquoregionalrdquomodeling of the data with a refined local view Global

bull Overall deviations of usersmovies Factorization

bull Addressing ldquoregionalrdquo effects Collaborative filtering

bull Extract local patterns

Global effects

Factorization

Collaborative filtering

RECSYS 107

Performance of Various Methods

0885

089

0895

09

0905

091

0915

092

1 10 100 1000

RMSE

Millions of parameters

CF (no time bias)

Basic Latent Factors

Latent Factors w Biases

1162019

RECSYS 108

Grand Prize 08563

Netflix 09514

Movie average 10533User average 10651

Global average 11296

Performance of Various Methods

Basic Collaborative filtering 094

Latent factors 090

Latent factors+Biases 089

Collaborative filtering++ 091

1162019

RECSYS 109

Temporal Biases Of Users

Sudden rise in the average movie rating(early 2004) Improvements in Netflix GUI improvements Meaning of rating changed

Movie age Users prefer new movies

without any reasons Older movies are just

inherently better than newer ones

1162019

Y Koren Collaborative filtering with temporal dynamics KDD rsquo09

RECSYS 110

Temporal Biases amp Factors

Original modelrxi = micro +bx + bi + qi middotpx

T

Add time dependence to biasesrxi = micro +bx(t)+ bi(t) +qi middot px

T

Make parameters bu and bi to depend on time (1) Parameterize time-dependence by linear trends

(2) Each bin corresponds to 10 consecutive weeks

Add temporal dependence to factors px(t)hellip user preference vector on day t

Y Koren Collaborative filtering with temporal dynamics KDD rsquo09

RECSYS 111

Adding Temporal Effects

1162019

0875

088

0885

089

0895

09

0905

091

0915

092

1 10 100 1000 10000

RMSE

Millions of parameters

CF (no time bias)Basic Latent FactorsCF (time bias)Latent Factors w Biases+ Linear time factors+ Per-day user biases+ CF

RECSYS 112

Grand Prize 08563

Netflix 09514

Movie average 10533User average 10651

Global average 11296

Performance of Various Methods

Basic Collaborative filtering 094

Latent factors 090

Latent factors+Biases 089

Collaborative filtering++ 091

1162019 112

Latent factors+Biases+Time 0876

Still no prize Getting desperateTry a ldquokitchen sinkrdquo approach

RECSYS 113

Effect of ensemble size

087

08725

0875

08775

088

08825

0885

08875

089

08925

1 8 15 22 29 36 43 50

Predictors

Erro

r - R

MSE

Chart1

RMSE
Predictors
Error - RMSE
Effect of ensemble size

Sheet1

Sheet1

RMSE
Predictors
Error - RMSE
Effect of ensemble size

Sheet2

Sheet3

RECSYS 115

Standing on June 26th 2009

1162019June 26th submission triggers 30-day ldquolast callrdquo

RECSYS 116

The Last 30 Days

Ensemble team formed Group of other teams on leaderboard forms a new team Relies on combining their models Quickly also get a qualifying score over 10

BellKor Continue to get small improvements in their scores Realize that they are in direct competition with Ensemble

Strategy Both teams carefully monitoring the leaderboard Only sure way to check for improvement is to submit a set of

predictionsbull This alerts the other team of your latest score

1162019

RECSYS 117

24 Hours from the Deadline

Submissions limited to 1 a day Only 1 final submission could be made in the last 24h

24 hours before deadlinehellip BellKor team member in Austria notices (by chance) that

Ensemble posts a score that is slightly better than BellKorrsquos

Frantic last 24 hours for both teams Much computer time on final optimization Carefully calibrated to end about an hour before deadline

Final submissions BellKor submits a little early (on purpose) 40 mins before deadline Ensemble submits their final entry 20 mins later hellipand everyone waitshellip

1162019

RECSYS 118118

RECSYS 119

Million $ Awarded Sept 21st 2009

RECSYS 120

Further reading Y Koren Collaborative filtering with temporal

dynamics KDD rsquo09 httpwww2researchattcom~volinskynetflixbpchtml httpwwwthe-ensemblecom

1162019

  • Slide Number 1
  • Acknowledgements
  • Content-based Systems amp Collaborative Filtering
  • High Dimensional Data
  • Example Recommender Systems
  • Recommendations
  • From Scarcity to Abundance
  • The Long Tail
  • Physical vs Online
  • Types of Recommendations
  • Formal Model
  • Utility Matrix
  • Key Problems
  • (1) Gathering Ratings
  • (2) Extrapolating Utilities
  • Content-based Recommender Systems
  • Content-based Recommendations
  • Plan of Action
  • Item Profiles
  • Sidenote TF-IDF
  • User Profiles and Prediction
  • Pros Content-based Approach
  • Cons Content-based Approach
  • Collaborative Filtering
  • Collaborative filtering
  • Collaborative Filtering (CF)
  • Example of Memory-based Collaborative Filtering User-User Collaborative Filtering
  • Similar Users
  • Similarity Metric
  • Rating Predictions
  • Another type of Memory-based Collaborative Filtering Item-Item based Collaborative Filtering
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Common Practice for Item-Item Collaborative Filtering
  • Item-Item vs User-User
  • ProsCons of Collaborative Filtering
  • Hybrid Methods
  • Remarks amp Practical Tips
  • Evaluation
  • Evaluation
  • Evaluating Predictions
  • Problems with Error Measures
  • Collaborative Filtering Complexity
  • Tip Add Data
  • Recommender SystemsLatent Factor Models
  • Collaborative Filtering viaLatent Factor Models (eg SVD)
  • Slide Number 50
  • The Netflix Utility Matrix R
  • Utility Matrix R Evaluation
  • Latent Factor Models
  • Ratings as Products of Factors
  • Ratings as Products of Factors
  • Ratings as Products of Factors
  • Latent Factor Models
  • Latent Factor Models
  • Recap SVD
  • Latent Factor Models
  • Dealing with Missing Entries
  • Recommendations via Latent Factor Models (eg SVD++ by the [Bellkor Team])
  • Dealing with Missing Entries
  • The Effect of Regularization
  • The Effect of Regularization
  • The Effect of Regularization
  • The Effect of Regularization
  • Use Gradient Descent to search forthe optimal settings
  • Degression to the lecture notes ofRegression and Gradient Descentby Andrew Ngrsquos Machine Learning course from Coursera
  • (Batch) Gradient Descent
  • Stochastic Gradient Descent
  • SGD vs GD
  • Stochastic Gradient Descent
  • Slide Number 75
  • Summary Recommendations via Optimization
  • Backup Slides
  • The Netflix Challenge 2006-09
  • Slide Number 79
  • Slide Number 80
  • Netflix Prize
  • Movie rating data
  • Overall rating distribution
  • ratings per movie
  • ratings per user
  • Average movie rating by movie count
  • Most loved movies
  • Challenges
  • The BellKor recommender system
  • Extending Latent Factor Model to Include Biases
  • Modeling Biases and Interactions
  • Baseline Predictor
  • Putting It All Together
  • Fitting the New Model
  • BellKor Recommender System
  • Performance of Various Methods
  • Performance of Various Methods
  • Temporal Biases Of Users
  • Temporal Biases amp Factors
  • Adding Temporal Effects
  • Performance of Various Methods
  • Slide Number 113
  • Slide Number 114
  • Standing on June 26th 2009
  • The Last 30 Days
  • 24 Hours from the Deadline
  • Slide Number 118
  • Million $ Awarded Sept 21st 2009
  • Further reading
predictors RMSE
1 0889067
2 0880586
3 0878345
4 0877715
5 0876668
6 0876179
7 0875557
8 0875013
9 0874477
10 0874209
11 0873997
12 0873691
13 0873569
14 0873352
15 0873212
16 0873035
17 0872942
18 087283
19 0872748
20 0872654
21 0872563
22 0872468
23 0872382
24 0872275
25 087222
26 0872201
27 0872135
28 0872086
29 0872021
30 0871945
31 0871898
32 087183
33 0871756
34 0871722
35 0871665
36 0871632
37 0871585
38 0871558
39 0871524
40 0871482
41 0871452
42 0871428
43 0871398
44 0871346
45 0871312
46 0871287
47 0871262
48 0871242
49 0871236
50 0871197
0889067
0880586
0878345
0877715
0876668
0876179
0875557
0875013
0874477
0874209
0873997
0873691
0873569
0873352
0873212
0873035
0872942
087283
0872748
0872654
0872563
0872468
0872382
0872275
087222
0872201
0872135
0872086
0872021
0871945
0871898
087183
0871756
0871722
0871665
0871632
0871585
0871558
0871524
0871482
0871452
0871428
0871398
0871346
0871312
0871287
0871262
0871242
0871236
0871197
Page 38: xuhappy.github.io · 2 days ago · RECSYS 2 Acknowledgements The slides used in this chapter are adapted from: CS246 Mining Massive Datasets, - by Jure Leskovec, Stanford University

RECSYS 40

Hybrid Methods Implement two or more different recommenders and

combine predictions Perhaps using a linear model

Add content-based methods to collaborative filtering Item profiles for new item problem Demographics to deal with new user problem

1162019

RECSYS 41

- Evaluation

- Error metrics

- Complexity Speed

RECSYS 42

Evaluation

1 3 4

3 5 5

4 5 5

3

3

2 2 2

5

2 1 1

3 3

1

movies

users

1162019

RECSYS 43

Evaluation

1 3 4

3 5 5

4 5 5

3

3

2

2 1

3

1

Test Data Set

users

movies

1162019

RECSYS 44

Evaluating Predictions Compare predictions with known ratings

Root-mean-square error (RMSE)bull where is predicted is the true rating of x on i

Precision at top 10 bull of those in top 10

Rank Correlation bull Spearmanrsquos correlation between systemrsquos and userrsquos complete

rankings Another approach 01 model

Coveragebull Number of itemsusers for which system can make predictions

Precision = TP (TP + FP) Accuracy = (TP+TN) (TP + FP + TN + FN) Receiver Operating characteristic (ROC) Curve

Y-axis True Positive Rates (TPR) X-axis False Positive Rates (FPR)bull TPR (aka Recall) = TP P = TP (TP+FN) bull FPR = FP N = FP (FP + TN)bull See httpsenwikipediaorgwikiPrecision and recall

RECSYS 45

Problems with Error Measures Narrow focus on accuracy sometimes

misses the point Prediction Diversity Prediction Context Order of predictions

In practice we care only to predict high ratings RMSE might penalize a method that does well

for high ratings and badly for others

1162019

RECSYS 46

Collaborative Filtering Complexity

Expensive step is finding k most similar customers O(|X|) Recall that X = set of customers in the system

Too expensive to do at runtime Could pre-compute using clustering as approx

Naiumlve pre-computation takes time O(N middot|C|) |C| = of clusters = k in the k-means N = of data points

We already know how to do this Near-neighbor search in high dimensions (LSH) Clustering Dimensionality reduction

1162019

RECSYS 47

Tip Add Data Leverage all the data

Donrsquot try to reduce data size in an effort to make fancy algorithms work

Simple methods on large data do best

Add more data eg add IMDB data on genres

More data beats better algorithmshttpanandtypepadcomdatawocky200803more-data-usualhtml

RECSYS 48

Recommender SystemsLatent Factor Models

CS246 Mining Massive DatasetsJure Leskovec Stanford University

httpcs246stanfordedu

RECSYS 49

Geared towards females

Geared towards males

Serious

Funny

Collaborative Filtering viaLatent Factor Models (eg SVD)

The PrincessDiaries

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

RECSYS 50Koren Bell Volinksy IEEE Computer 2009

RECSYS 51

The Netflix Utility Matrix R

1 3 4

3 5 5

4 5 5

3

3

2 2 2

5

2 1 1

3 3

1

480000 users

17700 movies

1162019

Matrix R

RECSYS 52

Utility Matrix R Evaluation

1 3 4

3 5 5

4 5 5

3

3

2

2 1

3

1

Test Data Set

SSE = sum(119909119909119909119909)isin119877119877 119903119909119909119909119909 minus 119903119903119909119909119909119909 2

1162019

480000 users

17700 movies

Predicted rating

True rating of user x on item i

119955119955120785120785120788120788

Matrix R

Training Data Set

RECSYS 53

Latent Factor Models

ldquoSVDrdquo on Netflix data R asymp Q middot PT

For now letrsquos assume we can approximate the rating matrix R as a product of ldquothinrdquo Q middot PT

R has missing entries but letrsquos ignore that for nowbull Basically we will want the reconstruction error to be small on

known ratings and we donrsquot care about the values on the missing ones

45531

312445

53432142

24542

522434

42331

2-41

56-5

53-2

32111

-221-7

37-1

-924143-48-5-253-211

13-112-72914-131457-8

1-6784-3924176-421asymp

users

item

s

PT

Q

item

s

users

R

SVD A = U Σ VT

f factors

ffactors

RECSYS 54

Ratings as Products of Factors How to estimate the missing rating of

user x for item i

1162019

45531

312445

53432142

24542

522434

42331

item

s

2-41

56-5

53-2

32111

-221-7

37-1

-924143-48-5-253-211

13-112-72914-131457-8

1-6784-3924176-421

asymp

item

s

users

users

PT

119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931

= 119943119943

119954119954119946119946119943119943 sdot 119953119953119961119961119943119943

qi = row i of Qpx = column x of PT

ffac

tors

Qf factors

RECSYS 55

Ratings as Products of Factors How to estimate the missing rating of

user x for item i

1162019

45531

312445

53432142

24542

522434

42331

item

s

2-41

56-5

53-2

32111

-221-7

37-1

-924143-48-5-253-211

13-112-72914-131457-8

1-6784-3924176-421

asymp

item

s

users

users

PT

ffac

tors

Qf factors

119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931

= 119943119943

119954119954119946119946119943119943 sdot 119953119953119961119961119943119943

qi = row i of Qpx = column x of PT

RECSYS 56

Ratings as Products of Factors How to estimate the missing rating of

user x for item i

1162019

45531

312445

53432142

24542

522434

42331

item

s

2-41

56-5

53-2

32111

-221-7

37-1

-924143-48-5-253-211

13-112-72914-131457-8

1-6784-3924176-421

asymp

item

s

users

users

QPT

24

ffac

tors

f factors

119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931

= 119943119943

119954119954119946119946119943119943 sdot 119953119953119961119961119943119943

qi = row i of Qpx = column x of PT

RECSYS 57

Geared towards females

Geared towards males

Serious

Funny

Latent Factor Models

1162019

The PrincessDiaries

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

RECSYS 58

Geared towards females

Geared towards males

Serious

Funny

Latent Factor Models

1162019

The PrincessDiaries

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

RECSYS 59

Recap SVD

Remember SVD A Input data matrix U Left singular vecs V Right singular vecs Σ Singular values

SVD gives minimum reconstruction error (SSE) min119880119880VΣ

sum119909119909119894119894 119860119860119909119909119894119894 minus 119880119880Σ119881119881T 1199091199091198941198942

So in our case ldquoSVDrdquo on Netflix data R asymp Q middot PT

A = R Q = U PT = Σ VT

But we are not done yet R has missing entries1162019

U

The sum goes over all entriesBut our R has missing entries

119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931

Am

n

Σm

n

VT

asymp

U

RECSYS 60

Latent Factor Models

SVD isnrsquot defined when entries are missing

Use specialized methods to find P Q min

119875119875119876119876sum 119909119909119909119909 isinR 119903119903119909119909119909119909 minus 119902119902119909119909 sdot 119901119901119909119909119879119879 2

Notebull We donrsquot require cols of P Q to be orthogonalunit lengthbull P Q map usersmovies to a latent spacebull The most popular model among Netflix contestants

1162019

45531

312445

53432142

24542

522434

42331

2-41

56-5

53-2

32111

-221-7

37-1

-924143-48-5-253-211

13-112-72914-131457-8

1-6784-3924176-421asymp

PTQ

users

item

s

119903119909119909119909119909 = 119902119902119909119909 sdot 119901119901119909119909119879119879

f factors

ffactorsitem

s

users

RECSYS 61

Dealing with Missing Entries Want to minimize SSE for unseen test data

Idea Minimize SSE on training data Want large f ( of factors) to capture all the signals But SSE on test data begins to rise for f gt 2

Regularization is needed to avoid Overfitting Allow rich model where there are sufficient data Shrink aggressively where data are scarce

1162019

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

1 3 43 5 5

4 5 533

2

2 1 3

1

λhellip regularization parameterldquoerrorrdquo ldquolengthrdquo

RECSYS 62

Geared towards females

Geared towards males

serious

escapist

The PrincessDiaries

The Lion King

Braveheart

Independence Day

AmadeusThe Color Purple

Oceanrsquos 11

Sense and Sensibility

Gus

Dave

Recommendations via Latent Factor Models (eg SVD++ by the [Bellkor Team])

1162019

Lethal Weapon

Dumb and Dumber

RECSYS 63

Dealing with Missing Entries Want to minimize SSE for unseen test data

Idea Minimize SSE on training data Want large f ( of factors) to capture all the signals But SSE on test data begins to rise for f gt 2

Regularization is needed Allow rich model where there are sufficient data Shrink aggressively where data are scarce

1162019

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

1 3 43 5 5

4 5 533

2

2 1 3

1

λhellip regularization parameterldquoerrorrdquo ldquolengthrdquo

RECSYS 64

Geared towards females

Geared towards males

serious

funny

The Effect of Regularization

1162019

The PrincessDiaries

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

minfactors ldquoerrorrdquo + λ ldquolengthrdquo

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

RECSYS 65

Geared towards females

Geared towards males

serious

funny

The Effect of Regularization

1162019

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

The PrincessDiaries

minfactors ldquoerrorrdquo + λ ldquolengthrdquo

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

RECSYS 66

Geared towards females

Geared towards males

serious

funny

The Effect of Regularization

1162019

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

minfactors ldquoerrorrdquo + λ ldquolengthrdquo

The PrincessDiaries

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

RECSYS 67

Geared towards females

Geared towards males

serious

funny

The Effect of Regularization

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

The PrincessDiaries

minfactors ldquoerrorrdquo + λ ldquolengthrdquo

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

RECSYS 68

Use Gradient Descent to search forthe optimal settings

Want to find matrices P and Q

Gradient descent Initialize P and Q (using SVD pretend missing ratings are

0) Do gradient descent

bull P larr P - η middotnablaPbull Q larr Q - η middotnablaQbull Where nablaQ is gradientderivative of matrix Q120571120571119876119876 = [120571120571119902119902119909119909119894119894] and 120571120571119902119902119909119909119894119894 = sum119909119909119909119909 minus2 119903119903119909119909119909119909 minus 119902119902119909119909119901119901119909119909119879119879 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894

bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q Observation Computing gradients is slow

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

How to compute gradient of a matrixCompute gradient of every element independently

RECSYS 69

Degression to the lecture notes ofRegression and Gradient Descent

by Andrew Ngrsquos Machine Learning course from Coursera

RECSYS 70

(Batch) Gradient Descent

Want to find matrices P and Q

Gradient descent Initialize P and Q (using SVD pretend missing ratings are

0) Do gradient descent

bull P larr P - η middotnablaPbull Q larr Q - η middotnablaQbull Where nablaQ is gradientderivative of matrix Q120571120571119876119876 = [120571120571119902119902119909119909119894119894] and 120571120571119902119902119909119909119894119894 = sum119909119909119909119909 minus2 119903119903119909119909119909119909 minus 119902119902119909119909119901119901119909119909119879119879 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894

bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q Observation Computing gradients is slow

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

How to compute gradient of a matrixCompute gradient of every element independently

RECSYS 72

Stochastic Gradient Descent

Gradient Descent (GD) vs Stochastic GD Observation 120571120571119876119876 = [120571120571119902119902119909119909119894119894] where

120571120571119902119902119909119909119894119894 = 119909119909119909119909

minus2 119903119903119909119909119909119909 minus 119902119902119909119909119894119894119901119901119909119909119894119894 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894 = 119961119961119946119946

nabla119928119928 119955119955119961119961119946119946

bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q 119928119928 = 119928119928 minus 119928119928 = 119928119928 minus η sum119961119961119946119946nabla119928119928 (119955119955119961119961119946119946) Idea Instead of evaluating gradient over all ratings evaluate it for

each individual rating and make a step

GD 119928119928larr119928119928 minus η sum119955119955119961119961119946119946 nabla119928119928(119955119955119961119961119946119946)

SGD 119928119928larr119928119928 minus 119928119928(119955119955119961119961119946119946) Faster convergence

bull Need more steps but each step is computed much faster

1162019

RECSYS 73

SGD vs GD Convergence of GD vs SGD

1162019

Iterationstep

Valu

e of

the

obje

ctiv

e fu

nctio

n

GD improves the value of the objective function at every step SGD improves the value but in a ldquonoisyrdquo wayGD takes fewer steps to converge but each steptakes much longer to compute In practice SGD is much faster

RECSYS 74

Stochastic Gradient Descent

Stochastic gradient decent Initialize P and Q (using SVD pretend missing ratings are

0) Then iterate over the ratings (multiple times if necessary) and update

factorsFor each rxi

bull 120576120576119909119909119909119909 = 119903119903119909119909119909119909 minus 119902119902119909119909 sdot 119901119901119909119909119879119879 (derivative of the ldquoerrorrdquo)bull 119902119902119909119909 larr 119902119902119909119909 + 120578120578 120576120576119909119909119909119909 119901119901119909119909 minus 120582120582 119902119902119909119909 (update equation)bull 119901119901119909119909 larr 119901119901119909119909 + 120578120578 120576120576119909119909119909119909 119902119902119909119909 minus 120582120582 119901119901119909119909 (update equation)

2 for loops For until convergence

bull For each rxibull Compute gradient do a ldquosteprdquo

1162019

η hellip learning rate

RECSYS 75Koren Bell Volinksy IEEE Computer 2009

1162019

RECSYS 76

Summary Recommendations via Optimization

Goal Make good recommendations Quantify goodness using SSE

So Lower SSE means better recommendations We want to make good recommendations on items

that some user has not yet seen Letrsquos set values for P and Q such that they work

well on known (user item) ratings And hope these values for P and Q will predict well

the unknown ratings

This is the a case where we apply Optimization methods

1162019

1 3 43 5 5

4 5 533

2

2 1 3

1

RECSYS 77

Backup Slides

RECSYS 78

The Netflix Challenge 2006-09

We Know What You OughtTo Be Watching This

Summer

Netflix Prizebull Training data

ndash 100 million ratingsndash 480000 usersndash 17770 moviesndash 6 years of data 2000-2005

bull Test datandash Last few ratings of each user (28 million)ndash Evaluation criterion root mean squared error (RMSE) ndash Netflix Cinematch RMSE 09514

bull Competitionndash 2700+ teamsndash $1 million grand prize for 10 improvement on Cinematch resultndash $50000 2007 progress prize for 843 improvement

scoremovieuser121152131434524123237682576344541568523425

22345

57664566

scoremovieuser6219617232473153414284935

745

696836

Training data Test data

Movie rating data

Overall rating distribution

bull Third of ratings are 4sbull Average rating is 368

From TimelyDevelopmentcom

ratings per movie

bull Avg ratingsmovie 5627

ratings per user

bull Avg ratingsuser 208

Average movie rating by movie count

bull More ratings to better movies

From TimelyDevelopmentcom

Most loved movies

CountAvg ratingTitle137812 4593 The Shawshank Redemption 133597 4545 Lord of the Rings The Return of the King1808834306 The Green Mile 150676 4460Lord of the Rings The Two Towers 139050 4415 Finding Nemo 117456 4504 Raiders of the Lost Ark 180736 4299 Forrest Gump 147932 4433 Lord of the Rings The Fellowship of the ring149199 4325 The Sixth Sense 144027 4333 Indiana Jones and the Last Crusade

Challenges

bull Size of datandash Scalabilityndash Keeping data in memory

bull Missing datandash 99 percent missingndash Very imbalanced

bull Avoiding overfittingbull Test and training data differ significantly movie 16322

bull Use an ensemble of complementing predictorsbull Two half tuned models worth more than a single fully

tuned model

The BellKor recommender system

RECSYS 101

Extending Latent Factor Model to Include Biases

1162019

RECSYS 102

Modeling Biases and Interactions

μ = overall mean rating bx = bias of user x bi = bias of movie i

user-movie interactionmovie biasuser bias

User-Movie interaction Characterizes the matching between

users and movies Attracts most research in the field Benefits from algorithmic and

mathematical innovations

Baseline predictor Separates users and movies Benefits from insights into userrsquos

behavior Among the main practical

contributions of the competition

RECSYS 103

Baseline Predictor We have expectations on the rating by

user x of movie i even without estimating xrsquos attitude towards movies like i

ndash Rating scale of user xndash Values of other ratings user

gave recently (day-specific mood anchoring multi-user accounts)

ndash (Recent) popularity of movie indash Selection bias related to

number of ratings user gave on the same day (ldquofrequencyrdquo)

1162019

RECSYS 104

Putting It All Together

Example Mean rating micro = 37 You are a critical reviewer your ratings are 1 star

lower than the mean bx = -1 Star Wars gets a mean rating of 05 higher than

average movie bi = + 05 Predicted rating for you on Star Wars

= 37 - 1 + 05 = 32

1162019

RECSYS 105

Fitting the New Model

Solve

Stochastic gradient decent to find parameters Note Both biases bu bi as well as interactions qi pu are treated

as parameters (we estimate them)

1162019

regularization

goodness of fit

λ is selected via grid-search on a validation set

( )

++++

+++minus

sumsumsumsum

sumisin

ii

xx

xx

ii

Rix

Txiixxi

PQ

bbpq

pqbbr

2222

2

)(

)(min

λ

micro

RECSYS 106

BellKor Recommender System The winner of the Netflix Challenge

Multi-scale modeling of the dataCombine top level ldquoregionalrdquomodeling of the data with a refined local view Global

bull Overall deviations of usersmovies Factorization

bull Addressing ldquoregionalrdquo effects Collaborative filtering

bull Extract local patterns

Global effects

Factorization

Collaborative filtering

RECSYS 107

Performance of Various Methods

0885

089

0895

09

0905

091

0915

092

1 10 100 1000

RMSE

Millions of parameters

CF (no time bias)

Basic Latent Factors

Latent Factors w Biases

1162019

RECSYS 108

Grand Prize 08563

Netflix 09514

Movie average 10533User average 10651

Global average 11296

Performance of Various Methods

Basic Collaborative filtering 094

Latent factors 090

Latent factors+Biases 089

Collaborative filtering++ 091

1162019

RECSYS 109

Temporal Biases Of Users

Sudden rise in the average movie rating(early 2004) Improvements in Netflix GUI improvements Meaning of rating changed

Movie age Users prefer new movies

without any reasons Older movies are just

inherently better than newer ones

1162019

Y Koren Collaborative filtering with temporal dynamics KDD rsquo09

RECSYS 110

Temporal Biases amp Factors

Original modelrxi = micro +bx + bi + qi middotpx

T

Add time dependence to biasesrxi = micro +bx(t)+ bi(t) +qi middot px

T

Make parameters bu and bi to depend on time (1) Parameterize time-dependence by linear trends

(2) Each bin corresponds to 10 consecutive weeks

Add temporal dependence to factors px(t)hellip user preference vector on day t

Y Koren Collaborative filtering with temporal dynamics KDD rsquo09

RECSYS 111

Adding Temporal Effects

1162019

0875

088

0885

089

0895

09

0905

091

0915

092

1 10 100 1000 10000

RMSE

Millions of parameters

CF (no time bias)Basic Latent FactorsCF (time bias)Latent Factors w Biases+ Linear time factors+ Per-day user biases+ CF

RECSYS 112

Grand Prize 08563

Netflix 09514

Movie average 10533User average 10651

Global average 11296

Performance of Various Methods

Basic Collaborative filtering 094

Latent factors 090

Latent factors+Biases 089

Collaborative filtering++ 091

1162019 112

Latent factors+Biases+Time 0876

Still no prize Getting desperateTry a ldquokitchen sinkrdquo approach

RECSYS 113

Effect of ensemble size

087

08725

0875

08775

088

08825

0885

08875

089

08925

1 8 15 22 29 36 43 50

Predictors

Erro

r - R

MSE

Chart1

RMSE
Predictors
Error - RMSE
Effect of ensemble size

Sheet1

Sheet1

RMSE
Predictors
Error - RMSE
Effect of ensemble size

Sheet2

Sheet3

RECSYS 115

Standing on June 26th 2009

1162019June 26th submission triggers 30-day ldquolast callrdquo

RECSYS 116

The Last 30 Days

Ensemble team formed Group of other teams on leaderboard forms a new team Relies on combining their models Quickly also get a qualifying score over 10

BellKor Continue to get small improvements in their scores Realize that they are in direct competition with Ensemble

Strategy Both teams carefully monitoring the leaderboard Only sure way to check for improvement is to submit a set of

predictionsbull This alerts the other team of your latest score

1162019

RECSYS 117

24 Hours from the Deadline

Submissions limited to 1 a day Only 1 final submission could be made in the last 24h

24 hours before deadlinehellip BellKor team member in Austria notices (by chance) that

Ensemble posts a score that is slightly better than BellKorrsquos

Frantic last 24 hours for both teams Much computer time on final optimization Carefully calibrated to end about an hour before deadline

Final submissions BellKor submits a little early (on purpose) 40 mins before deadline Ensemble submits their final entry 20 mins later hellipand everyone waitshellip

1162019

RECSYS 118118

RECSYS 119

Million $ Awarded Sept 21st 2009

RECSYS 120

Further reading Y Koren Collaborative filtering with temporal

dynamics KDD rsquo09 httpwww2researchattcom~volinskynetflixbpchtml httpwwwthe-ensemblecom

1162019

  • Slide Number 1
  • Acknowledgements
  • Content-based Systems amp Collaborative Filtering
  • High Dimensional Data
  • Example Recommender Systems
  • Recommendations
  • From Scarcity to Abundance
  • The Long Tail
  • Physical vs Online
  • Types of Recommendations
  • Formal Model
  • Utility Matrix
  • Key Problems
  • (1) Gathering Ratings
  • (2) Extrapolating Utilities
  • Content-based Recommender Systems
  • Content-based Recommendations
  • Plan of Action
  • Item Profiles
  • Sidenote TF-IDF
  • User Profiles and Prediction
  • Pros Content-based Approach
  • Cons Content-based Approach
  • Collaborative Filtering
  • Collaborative filtering
  • Collaborative Filtering (CF)
  • Example of Memory-based Collaborative Filtering User-User Collaborative Filtering
  • Similar Users
  • Similarity Metric
  • Rating Predictions
  • Another type of Memory-based Collaborative Filtering Item-Item based Collaborative Filtering
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Common Practice for Item-Item Collaborative Filtering
  • Item-Item vs User-User
  • ProsCons of Collaborative Filtering
  • Hybrid Methods
  • Remarks amp Practical Tips
  • Evaluation
  • Evaluation
  • Evaluating Predictions
  • Problems with Error Measures
  • Collaborative Filtering Complexity
  • Tip Add Data
  • Recommender SystemsLatent Factor Models
  • Collaborative Filtering viaLatent Factor Models (eg SVD)
  • Slide Number 50
  • The Netflix Utility Matrix R
  • Utility Matrix R Evaluation
  • Latent Factor Models
  • Ratings as Products of Factors
  • Ratings as Products of Factors
  • Ratings as Products of Factors
  • Latent Factor Models
  • Latent Factor Models
  • Recap SVD
  • Latent Factor Models
  • Dealing with Missing Entries
  • Recommendations via Latent Factor Models (eg SVD++ by the [Bellkor Team])
  • Dealing with Missing Entries
  • The Effect of Regularization
  • The Effect of Regularization
  • The Effect of Regularization
  • The Effect of Regularization
  • Use Gradient Descent to search forthe optimal settings
  • Degression to the lecture notes ofRegression and Gradient Descentby Andrew Ngrsquos Machine Learning course from Coursera
  • (Batch) Gradient Descent
  • Stochastic Gradient Descent
  • SGD vs GD
  • Stochastic Gradient Descent
  • Slide Number 75
  • Summary Recommendations via Optimization
  • Backup Slides
  • The Netflix Challenge 2006-09
  • Slide Number 79
  • Slide Number 80
  • Netflix Prize
  • Movie rating data
  • Overall rating distribution
  • ratings per movie
  • ratings per user
  • Average movie rating by movie count
  • Most loved movies
  • Challenges
  • The BellKor recommender system
  • Extending Latent Factor Model to Include Biases
  • Modeling Biases and Interactions
  • Baseline Predictor
  • Putting It All Together
  • Fitting the New Model
  • BellKor Recommender System
  • Performance of Various Methods
  • Performance of Various Methods
  • Temporal Biases Of Users
  • Temporal Biases amp Factors
  • Adding Temporal Effects
  • Performance of Various Methods
  • Slide Number 113
  • Slide Number 114
  • Standing on June 26th 2009
  • The Last 30 Days
  • 24 Hours from the Deadline
  • Slide Number 118
  • Million $ Awarded Sept 21st 2009
  • Further reading
predictors RMSE
1 0889067
2 0880586
3 0878345
4 0877715
5 0876668
6 0876179
7 0875557
8 0875013
9 0874477
10 0874209
11 0873997
12 0873691
13 0873569
14 0873352
15 0873212
16 0873035
17 0872942
18 087283
19 0872748
20 0872654
21 0872563
22 0872468
23 0872382
24 0872275
25 087222
26 0872201
27 0872135
28 0872086
29 0872021
30 0871945
31 0871898
32 087183
33 0871756
34 0871722
35 0871665
36 0871632
37 0871585
38 0871558
39 0871524
40 0871482
41 0871452
42 0871428
43 0871398
44 0871346
45 0871312
46 0871287
47 0871262
48 0871242
49 0871236
50 0871197
0889067
0880586
0878345
0877715
0876668
0876179
0875557
0875013
0874477
0874209
0873997
0873691
0873569
0873352
0873212
0873035
0872942
087283
0872748
0872654
0872563
0872468
0872382
0872275
087222
0872201
0872135
0872086
0872021
0871945
0871898
087183
0871756
0871722
0871665
0871632
0871585
0871558
0871524
0871482
0871452
0871428
0871398
0871346
0871312
0871287
0871262
0871242
0871236
0871197
Page 39: xuhappy.github.io · 2 days ago · RECSYS 2 Acknowledgements The slides used in this chapter are adapted from: CS246 Mining Massive Datasets, - by Jure Leskovec, Stanford University

RECSYS 41

- Evaluation

- Error metrics

- Complexity Speed

RECSYS 42

Evaluation

1 3 4

3 5 5

4 5 5

3

3

2 2 2

5

2 1 1

3 3

1

movies

users

1162019

RECSYS 43

Evaluation

1 3 4

3 5 5

4 5 5

3

3

2

2 1

3

1

Test Data Set

users

movies

1162019

RECSYS 44

Evaluating Predictions Compare predictions with known ratings

Root-mean-square error (RMSE)bull where is predicted is the true rating of x on i

Precision at top 10 bull of those in top 10

Rank Correlation bull Spearmanrsquos correlation between systemrsquos and userrsquos complete

rankings Another approach 01 model

Coveragebull Number of itemsusers for which system can make predictions

Precision = TP (TP + FP) Accuracy = (TP+TN) (TP + FP + TN + FN) Receiver Operating characteristic (ROC) Curve

Y-axis True Positive Rates (TPR) X-axis False Positive Rates (FPR)bull TPR (aka Recall) = TP P = TP (TP+FN) bull FPR = FP N = FP (FP + TN)bull See httpsenwikipediaorgwikiPrecision and recall

RECSYS 45

Problems with Error Measures Narrow focus on accuracy sometimes

misses the point Prediction Diversity Prediction Context Order of predictions

In practice we care only to predict high ratings RMSE might penalize a method that does well

for high ratings and badly for others

1162019

RECSYS 46

Collaborative Filtering Complexity

Expensive step is finding k most similar customers O(|X|) Recall that X = set of customers in the system

Too expensive to do at runtime Could pre-compute using clustering as approx

Naiumlve pre-computation takes time O(N middot|C|) |C| = of clusters = k in the k-means N = of data points

We already know how to do this Near-neighbor search in high dimensions (LSH) Clustering Dimensionality reduction

1162019

RECSYS 47

Tip Add Data Leverage all the data

Donrsquot try to reduce data size in an effort to make fancy algorithms work

Simple methods on large data do best

Add more data eg add IMDB data on genres

More data beats better algorithmshttpanandtypepadcomdatawocky200803more-data-usualhtml

RECSYS 48

Recommender SystemsLatent Factor Models

CS246 Mining Massive DatasetsJure Leskovec Stanford University

httpcs246stanfordedu

RECSYS 49

Geared towards females

Geared towards males

Serious

Funny

Collaborative Filtering viaLatent Factor Models (eg SVD)

The PrincessDiaries

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

RECSYS 50Koren Bell Volinksy IEEE Computer 2009

RECSYS 51

The Netflix Utility Matrix R

1 3 4

3 5 5

4 5 5

3

3

2 2 2

5

2 1 1

3 3

1

480000 users

17700 movies

1162019

Matrix R

RECSYS 52

Utility Matrix R Evaluation

1 3 4

3 5 5

4 5 5

3

3

2

2 1

3

1

Test Data Set

SSE = sum(119909119909119909119909)isin119877119877 119903119909119909119909119909 minus 119903119903119909119909119909119909 2

1162019

480000 users

17700 movies

Predicted rating

True rating of user x on item i

119955119955120785120785120788120788

Matrix R

Training Data Set

RECSYS 53

Latent Factor Models

ldquoSVDrdquo on Netflix data R asymp Q middot PT

For now letrsquos assume we can approximate the rating matrix R as a product of ldquothinrdquo Q middot PT

R has missing entries but letrsquos ignore that for nowbull Basically we will want the reconstruction error to be small on

known ratings and we donrsquot care about the values on the missing ones

45531

312445

53432142

24542

522434

42331

2-41

56-5

53-2

32111

-221-7

37-1

-924143-48-5-253-211

13-112-72914-131457-8

1-6784-3924176-421asymp

users

item

s

PT

Q

item

s

users

R

SVD A = U Σ VT

f factors

ffactors

RECSYS 54

Ratings as Products of Factors How to estimate the missing rating of

user x for item i

1162019

45531

312445

53432142

24542

522434

42331

item

s

2-41

56-5

53-2

32111

-221-7

37-1

-924143-48-5-253-211

13-112-72914-131457-8

1-6784-3924176-421

asymp

item

s

users

users

PT

119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931

= 119943119943

119954119954119946119946119943119943 sdot 119953119953119961119961119943119943

qi = row i of Qpx = column x of PT

ffac

tors

Qf factors

RECSYS 55

Ratings as Products of Factors How to estimate the missing rating of

user x for item i

1162019

45531

312445

53432142

24542

522434

42331

item

s

2-41

56-5

53-2

32111

-221-7

37-1

-924143-48-5-253-211

13-112-72914-131457-8

1-6784-3924176-421

asymp

item

s

users

users

PT

ffac

tors

Qf factors

119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931

= 119943119943

119954119954119946119946119943119943 sdot 119953119953119961119961119943119943

qi = row i of Qpx = column x of PT

RECSYS 56

Ratings as Products of Factors How to estimate the missing rating of

user x for item i

1162019

45531

312445

53432142

24542

522434

42331

item

s

2-41

56-5

53-2

32111

-221-7

37-1

-924143-48-5-253-211

13-112-72914-131457-8

1-6784-3924176-421

asymp

item

s

users

users

QPT

24

ffac

tors

f factors

119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931

= 119943119943

119954119954119946119946119943119943 sdot 119953119953119961119961119943119943

qi = row i of Qpx = column x of PT

RECSYS 57

Geared towards females

Geared towards males

Serious

Funny

Latent Factor Models

1162019

The PrincessDiaries

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

RECSYS 58

Geared towards females

Geared towards males

Serious

Funny

Latent Factor Models

1162019

The PrincessDiaries

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

RECSYS 59

Recap SVD

Remember SVD A Input data matrix U Left singular vecs V Right singular vecs Σ Singular values

SVD gives minimum reconstruction error (SSE) min119880119880VΣ

sum119909119909119894119894 119860119860119909119909119894119894 minus 119880119880Σ119881119881T 1199091199091198941198942

So in our case ldquoSVDrdquo on Netflix data R asymp Q middot PT

A = R Q = U PT = Σ VT

But we are not done yet R has missing entries1162019

U

The sum goes over all entriesBut our R has missing entries

119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931

Am

n

Σm

n

VT

asymp

U

RECSYS 60

Latent Factor Models

SVD isnrsquot defined when entries are missing

Use specialized methods to find P Q min

119875119875119876119876sum 119909119909119909119909 isinR 119903119903119909119909119909119909 minus 119902119902119909119909 sdot 119901119901119909119909119879119879 2

Notebull We donrsquot require cols of P Q to be orthogonalunit lengthbull P Q map usersmovies to a latent spacebull The most popular model among Netflix contestants

1162019

45531

312445

53432142

24542

522434

42331

2-41

56-5

53-2

32111

-221-7

37-1

-924143-48-5-253-211

13-112-72914-131457-8

1-6784-3924176-421asymp

PTQ

users

item

s

119903119909119909119909119909 = 119902119902119909119909 sdot 119901119901119909119909119879119879

f factors

ffactorsitem

s

users

RECSYS 61

Dealing with Missing Entries Want to minimize SSE for unseen test data

Idea Minimize SSE on training data Want large f ( of factors) to capture all the signals But SSE on test data begins to rise for f gt 2

Regularization is needed to avoid Overfitting Allow rich model where there are sufficient data Shrink aggressively where data are scarce

1162019

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

1 3 43 5 5

4 5 533

2

2 1 3

1

λhellip regularization parameterldquoerrorrdquo ldquolengthrdquo

RECSYS 62

Geared towards females

Geared towards males

serious

escapist

The PrincessDiaries

The Lion King

Braveheart

Independence Day

AmadeusThe Color Purple

Oceanrsquos 11

Sense and Sensibility

Gus

Dave

Recommendations via Latent Factor Models (eg SVD++ by the [Bellkor Team])

1162019

Lethal Weapon

Dumb and Dumber

RECSYS 63

Dealing with Missing Entries Want to minimize SSE for unseen test data

Idea Minimize SSE on training data Want large f ( of factors) to capture all the signals But SSE on test data begins to rise for f gt 2

Regularization is needed Allow rich model where there are sufficient data Shrink aggressively where data are scarce

1162019

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

1 3 43 5 5

4 5 533

2

2 1 3

1

λhellip regularization parameterldquoerrorrdquo ldquolengthrdquo

RECSYS 64

Geared towards females

Geared towards males

serious

funny

The Effect of Regularization

1162019

The PrincessDiaries

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

minfactors ldquoerrorrdquo + λ ldquolengthrdquo

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

RECSYS 65

Geared towards females

Geared towards males

serious

funny

The Effect of Regularization

1162019

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

The PrincessDiaries

minfactors ldquoerrorrdquo + λ ldquolengthrdquo

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

RECSYS 66

Geared towards females

Geared towards males

serious

funny

The Effect of Regularization

1162019

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

minfactors ldquoerrorrdquo + λ ldquolengthrdquo

The PrincessDiaries

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

RECSYS 67

Geared towards females

Geared towards males

serious

funny

The Effect of Regularization

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

The PrincessDiaries

minfactors ldquoerrorrdquo + λ ldquolengthrdquo

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

RECSYS 68

Use Gradient Descent to search forthe optimal settings

Want to find matrices P and Q

Gradient descent Initialize P and Q (using SVD pretend missing ratings are

0) Do gradient descent

bull P larr P - η middotnablaPbull Q larr Q - η middotnablaQbull Where nablaQ is gradientderivative of matrix Q120571120571119876119876 = [120571120571119902119902119909119909119894119894] and 120571120571119902119902119909119909119894119894 = sum119909119909119909119909 minus2 119903119903119909119909119909119909 minus 119902119902119909119909119901119901119909119909119879119879 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894

bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q Observation Computing gradients is slow

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

How to compute gradient of a matrixCompute gradient of every element independently

RECSYS 69

Degression to the lecture notes ofRegression and Gradient Descent

by Andrew Ngrsquos Machine Learning course from Coursera

RECSYS 70

(Batch) Gradient Descent

Want to find matrices P and Q

Gradient descent Initialize P and Q (using SVD pretend missing ratings are

0) Do gradient descent

bull P larr P - η middotnablaPbull Q larr Q - η middotnablaQbull Where nablaQ is gradientderivative of matrix Q120571120571119876119876 = [120571120571119902119902119909119909119894119894] and 120571120571119902119902119909119909119894119894 = sum119909119909119909119909 minus2 119903119903119909119909119909119909 minus 119902119902119909119909119901119901119909119909119879119879 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894

bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q Observation Computing gradients is slow

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

How to compute gradient of a matrixCompute gradient of every element independently

RECSYS 72

Stochastic Gradient Descent

Gradient Descent (GD) vs Stochastic GD Observation 120571120571119876119876 = [120571120571119902119902119909119909119894119894] where

120571120571119902119902119909119909119894119894 = 119909119909119909119909

minus2 119903119903119909119909119909119909 minus 119902119902119909119909119894119894119901119901119909119909119894119894 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894 = 119961119961119946119946

nabla119928119928 119955119955119961119961119946119946

bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q 119928119928 = 119928119928 minus 119928119928 = 119928119928 minus η sum119961119961119946119946nabla119928119928 (119955119955119961119961119946119946) Idea Instead of evaluating gradient over all ratings evaluate it for

each individual rating and make a step

GD 119928119928larr119928119928 minus η sum119955119955119961119961119946119946 nabla119928119928(119955119955119961119961119946119946)

SGD 119928119928larr119928119928 minus 119928119928(119955119955119961119961119946119946) Faster convergence

bull Need more steps but each step is computed much faster

1162019

RECSYS 73

SGD vs GD Convergence of GD vs SGD

1162019

Iterationstep

Valu

e of

the

obje

ctiv

e fu

nctio

n

GD improves the value of the objective function at every step SGD improves the value but in a ldquonoisyrdquo wayGD takes fewer steps to converge but each steptakes much longer to compute In practice SGD is much faster

RECSYS 74

Stochastic Gradient Descent

Stochastic gradient decent Initialize P and Q (using SVD pretend missing ratings are

0) Then iterate over the ratings (multiple times if necessary) and update

factorsFor each rxi

bull 120576120576119909119909119909119909 = 119903119903119909119909119909119909 minus 119902119902119909119909 sdot 119901119901119909119909119879119879 (derivative of the ldquoerrorrdquo)bull 119902119902119909119909 larr 119902119902119909119909 + 120578120578 120576120576119909119909119909119909 119901119901119909119909 minus 120582120582 119902119902119909119909 (update equation)bull 119901119901119909119909 larr 119901119901119909119909 + 120578120578 120576120576119909119909119909119909 119902119902119909119909 minus 120582120582 119901119901119909119909 (update equation)

2 for loops For until convergence

bull For each rxibull Compute gradient do a ldquosteprdquo

1162019

η hellip learning rate

RECSYS 75Koren Bell Volinksy IEEE Computer 2009

1162019

RECSYS 76

Summary Recommendations via Optimization

Goal Make good recommendations Quantify goodness using SSE

So Lower SSE means better recommendations We want to make good recommendations on items

that some user has not yet seen Letrsquos set values for P and Q such that they work

well on known (user item) ratings And hope these values for P and Q will predict well

the unknown ratings

This is the a case where we apply Optimization methods

1162019

1 3 43 5 5

4 5 533

2

2 1 3

1

RECSYS 77

Backup Slides

RECSYS 78

The Netflix Challenge 2006-09

We Know What You OughtTo Be Watching This

Summer

Netflix Prizebull Training data

ndash 100 million ratingsndash 480000 usersndash 17770 moviesndash 6 years of data 2000-2005

bull Test datandash Last few ratings of each user (28 million)ndash Evaluation criterion root mean squared error (RMSE) ndash Netflix Cinematch RMSE 09514

bull Competitionndash 2700+ teamsndash $1 million grand prize for 10 improvement on Cinematch resultndash $50000 2007 progress prize for 843 improvement

scoremovieuser121152131434524123237682576344541568523425

22345

57664566

scoremovieuser6219617232473153414284935

745

696836

Training data Test data

Movie rating data

Overall rating distribution

bull Third of ratings are 4sbull Average rating is 368

From TimelyDevelopmentcom

ratings per movie

bull Avg ratingsmovie 5627

ratings per user

bull Avg ratingsuser 208

Average movie rating by movie count

bull More ratings to better movies

From TimelyDevelopmentcom

Most loved movies

CountAvg ratingTitle137812 4593 The Shawshank Redemption 133597 4545 Lord of the Rings The Return of the King1808834306 The Green Mile 150676 4460Lord of the Rings The Two Towers 139050 4415 Finding Nemo 117456 4504 Raiders of the Lost Ark 180736 4299 Forrest Gump 147932 4433 Lord of the Rings The Fellowship of the ring149199 4325 The Sixth Sense 144027 4333 Indiana Jones and the Last Crusade

Challenges

bull Size of datandash Scalabilityndash Keeping data in memory

bull Missing datandash 99 percent missingndash Very imbalanced

bull Avoiding overfittingbull Test and training data differ significantly movie 16322

bull Use an ensemble of complementing predictorsbull Two half tuned models worth more than a single fully

tuned model

The BellKor recommender system

RECSYS 101

Extending Latent Factor Model to Include Biases

1162019

RECSYS 102

Modeling Biases and Interactions

μ = overall mean rating bx = bias of user x bi = bias of movie i

user-movie interactionmovie biasuser bias

User-Movie interaction Characterizes the matching between

users and movies Attracts most research in the field Benefits from algorithmic and

mathematical innovations

Baseline predictor Separates users and movies Benefits from insights into userrsquos

behavior Among the main practical

contributions of the competition

RECSYS 103

Baseline Predictor We have expectations on the rating by

user x of movie i even without estimating xrsquos attitude towards movies like i

ndash Rating scale of user xndash Values of other ratings user

gave recently (day-specific mood anchoring multi-user accounts)

ndash (Recent) popularity of movie indash Selection bias related to

number of ratings user gave on the same day (ldquofrequencyrdquo)

1162019

RECSYS 104

Putting It All Together

Example Mean rating micro = 37 You are a critical reviewer your ratings are 1 star

lower than the mean bx = -1 Star Wars gets a mean rating of 05 higher than

average movie bi = + 05 Predicted rating for you on Star Wars

= 37 - 1 + 05 = 32

1162019

RECSYS 105

Fitting the New Model

Solve

Stochastic gradient decent to find parameters Note Both biases bu bi as well as interactions qi pu are treated

as parameters (we estimate them)

1162019

regularization

goodness of fit

λ is selected via grid-search on a validation set

( )

++++

+++minus

sumsumsumsum

sumisin

ii

xx

xx

ii

Rix

Txiixxi

PQ

bbpq

pqbbr

2222

2

)(

)(min

λ

micro

RECSYS 106

BellKor Recommender System The winner of the Netflix Challenge

Multi-scale modeling of the dataCombine top level ldquoregionalrdquomodeling of the data with a refined local view Global

bull Overall deviations of usersmovies Factorization

bull Addressing ldquoregionalrdquo effects Collaborative filtering

bull Extract local patterns

Global effects

Factorization

Collaborative filtering

RECSYS 107

Performance of Various Methods

0885

089

0895

09

0905

091

0915

092

1 10 100 1000

RMSE

Millions of parameters

CF (no time bias)

Basic Latent Factors

Latent Factors w Biases

1162019

RECSYS 108

Grand Prize 08563

Netflix 09514

Movie average 10533User average 10651

Global average 11296

Performance of Various Methods

Basic Collaborative filtering 094

Latent factors 090

Latent factors+Biases 089

Collaborative filtering++ 091

1162019

RECSYS 109

Temporal Biases Of Users

Sudden rise in the average movie rating(early 2004) Improvements in Netflix GUI improvements Meaning of rating changed

Movie age Users prefer new movies

without any reasons Older movies are just

inherently better than newer ones

1162019

Y Koren Collaborative filtering with temporal dynamics KDD rsquo09

RECSYS 110

Temporal Biases amp Factors

Original modelrxi = micro +bx + bi + qi middotpx

T

Add time dependence to biasesrxi = micro +bx(t)+ bi(t) +qi middot px

T

Make parameters bu and bi to depend on time (1) Parameterize time-dependence by linear trends

(2) Each bin corresponds to 10 consecutive weeks

Add temporal dependence to factors px(t)hellip user preference vector on day t

Y Koren Collaborative filtering with temporal dynamics KDD rsquo09

RECSYS 111

Adding Temporal Effects

1162019

0875

088

0885

089

0895

09

0905

091

0915

092

1 10 100 1000 10000

RMSE

Millions of parameters

CF (no time bias)Basic Latent FactorsCF (time bias)Latent Factors w Biases+ Linear time factors+ Per-day user biases+ CF

RECSYS 112

Grand Prize 08563

Netflix 09514

Movie average 10533User average 10651

Global average 11296

Performance of Various Methods

Basic Collaborative filtering 094

Latent factors 090

Latent factors+Biases 089

Collaborative filtering++ 091

1162019 112

Latent factors+Biases+Time 0876

Still no prize Getting desperateTry a ldquokitchen sinkrdquo approach

RECSYS 113

Effect of ensemble size

087

08725

0875

08775

088

08825

0885

08875

089

08925

1 8 15 22 29 36 43 50

Predictors

Erro

r - R

MSE

Chart1

RMSE
Predictors
Error - RMSE
Effect of ensemble size

Sheet1

Sheet1

RMSE
Predictors
Error - RMSE
Effect of ensemble size

Sheet2

Sheet3

RECSYS 115

Standing on June 26th 2009

1162019June 26th submission triggers 30-day ldquolast callrdquo

RECSYS 116

The Last 30 Days

Ensemble team formed Group of other teams on leaderboard forms a new team Relies on combining their models Quickly also get a qualifying score over 10

BellKor Continue to get small improvements in their scores Realize that they are in direct competition with Ensemble

Strategy Both teams carefully monitoring the leaderboard Only sure way to check for improvement is to submit a set of

predictionsbull This alerts the other team of your latest score

1162019

RECSYS 117

24 Hours from the Deadline

Submissions limited to 1 a day Only 1 final submission could be made in the last 24h

24 hours before deadlinehellip BellKor team member in Austria notices (by chance) that

Ensemble posts a score that is slightly better than BellKorrsquos

Frantic last 24 hours for both teams Much computer time on final optimization Carefully calibrated to end about an hour before deadline

Final submissions BellKor submits a little early (on purpose) 40 mins before deadline Ensemble submits their final entry 20 mins later hellipand everyone waitshellip

1162019

RECSYS 118118

RECSYS 119

Million $ Awarded Sept 21st 2009

RECSYS 120

Further reading Y Koren Collaborative filtering with temporal

dynamics KDD rsquo09 httpwww2researchattcom~volinskynetflixbpchtml httpwwwthe-ensemblecom

1162019

  • Slide Number 1
  • Acknowledgements
  • Content-based Systems amp Collaborative Filtering
  • High Dimensional Data
  • Example Recommender Systems
  • Recommendations
  • From Scarcity to Abundance
  • The Long Tail
  • Physical vs Online
  • Types of Recommendations
  • Formal Model
  • Utility Matrix
  • Key Problems
  • (1) Gathering Ratings
  • (2) Extrapolating Utilities
  • Content-based Recommender Systems
  • Content-based Recommendations
  • Plan of Action
  • Item Profiles
  • Sidenote TF-IDF
  • User Profiles and Prediction
  • Pros Content-based Approach
  • Cons Content-based Approach
  • Collaborative Filtering
  • Collaborative filtering
  • Collaborative Filtering (CF)
  • Example of Memory-based Collaborative Filtering User-User Collaborative Filtering
  • Similar Users
  • Similarity Metric
  • Rating Predictions
  • Another type of Memory-based Collaborative Filtering Item-Item based Collaborative Filtering
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Common Practice for Item-Item Collaborative Filtering
  • Item-Item vs User-User
  • ProsCons of Collaborative Filtering
  • Hybrid Methods
  • Remarks amp Practical Tips
  • Evaluation
  • Evaluation
  • Evaluating Predictions
  • Problems with Error Measures
  • Collaborative Filtering Complexity
  • Tip Add Data
  • Recommender SystemsLatent Factor Models
  • Collaborative Filtering viaLatent Factor Models (eg SVD)
  • Slide Number 50
  • The Netflix Utility Matrix R
  • Utility Matrix R Evaluation
  • Latent Factor Models
  • Ratings as Products of Factors
  • Ratings as Products of Factors
  • Ratings as Products of Factors
  • Latent Factor Models
  • Latent Factor Models
  • Recap SVD
  • Latent Factor Models
  • Dealing with Missing Entries
  • Recommendations via Latent Factor Models (eg SVD++ by the [Bellkor Team])
  • Dealing with Missing Entries
  • The Effect of Regularization
  • The Effect of Regularization
  • The Effect of Regularization
  • The Effect of Regularization
  • Use Gradient Descent to search forthe optimal settings
  • Degression to the lecture notes ofRegression and Gradient Descentby Andrew Ngrsquos Machine Learning course from Coursera
  • (Batch) Gradient Descent
  • Stochastic Gradient Descent
  • SGD vs GD
  • Stochastic Gradient Descent
  • Slide Number 75
  • Summary Recommendations via Optimization
  • Backup Slides
  • The Netflix Challenge 2006-09
  • Slide Number 79
  • Slide Number 80
  • Netflix Prize
  • Movie rating data
  • Overall rating distribution
  • ratings per movie
  • ratings per user
  • Average movie rating by movie count
  • Most loved movies
  • Challenges
  • The BellKor recommender system
  • Extending Latent Factor Model to Include Biases
  • Modeling Biases and Interactions
  • Baseline Predictor
  • Putting It All Together
  • Fitting the New Model
  • BellKor Recommender System
  • Performance of Various Methods
  • Performance of Various Methods
  • Temporal Biases Of Users
  • Temporal Biases amp Factors
  • Adding Temporal Effects
  • Performance of Various Methods
  • Slide Number 113
  • Slide Number 114
  • Standing on June 26th 2009
  • The Last 30 Days
  • 24 Hours from the Deadline
  • Slide Number 118
  • Million $ Awarded Sept 21st 2009
  • Further reading
predictors RMSE
1 0889067
2 0880586
3 0878345
4 0877715
5 0876668
6 0876179
7 0875557
8 0875013
9 0874477
10 0874209
11 0873997
12 0873691
13 0873569
14 0873352
15 0873212
16 0873035
17 0872942
18 087283
19 0872748
20 0872654
21 0872563
22 0872468
23 0872382
24 0872275
25 087222
26 0872201
27 0872135
28 0872086
29 0872021
30 0871945
31 0871898
32 087183
33 0871756
34 0871722
35 0871665
36 0871632
37 0871585
38 0871558
39 0871524
40 0871482
41 0871452
42 0871428
43 0871398
44 0871346
45 0871312
46 0871287
47 0871262
48 0871242
49 0871236
50 0871197
0889067
0880586
0878345
0877715
0876668
0876179
0875557
0875013
0874477
0874209
0873997
0873691
0873569
0873352
0873212
0873035
0872942
087283
0872748
0872654
0872563
0872468
0872382
0872275
087222
0872201
0872135
0872086
0872021
0871945
0871898
087183
0871756
0871722
0871665
0871632
0871585
0871558
0871524
0871482
0871452
0871428
0871398
0871346
0871312
0871287
0871262
0871242
0871236
0871197
Page 40: xuhappy.github.io · 2 days ago · RECSYS 2 Acknowledgements The slides used in this chapter are adapted from: CS246 Mining Massive Datasets, - by Jure Leskovec, Stanford University

RECSYS 42

Evaluation

1 3 4

3 5 5

4 5 5

3

3

2 2 2

5

2 1 1

3 3

1

movies

users

1162019

RECSYS 43

Evaluation

1 3 4

3 5 5

4 5 5

3

3

2

2 1

3

1

Test Data Set

users

movies

1162019

RECSYS 44

Evaluating Predictions Compare predictions with known ratings

Root-mean-square error (RMSE)bull where is predicted is the true rating of x on i

Precision at top 10 bull of those in top 10

Rank Correlation bull Spearmanrsquos correlation between systemrsquos and userrsquos complete

rankings Another approach 01 model

Coveragebull Number of itemsusers for which system can make predictions

Precision = TP (TP + FP) Accuracy = (TP+TN) (TP + FP + TN + FN) Receiver Operating characteristic (ROC) Curve

Y-axis True Positive Rates (TPR) X-axis False Positive Rates (FPR)bull TPR (aka Recall) = TP P = TP (TP+FN) bull FPR = FP N = FP (FP + TN)bull See httpsenwikipediaorgwikiPrecision and recall

RECSYS 45

Problems with Error Measures Narrow focus on accuracy sometimes

misses the point Prediction Diversity Prediction Context Order of predictions

In practice we care only to predict high ratings RMSE might penalize a method that does well

for high ratings and badly for others

1162019

RECSYS 46

Collaborative Filtering Complexity

Expensive step is finding k most similar customers O(|X|) Recall that X = set of customers in the system

Too expensive to do at runtime Could pre-compute using clustering as approx

Naiumlve pre-computation takes time O(N middot|C|) |C| = of clusters = k in the k-means N = of data points

We already know how to do this Near-neighbor search in high dimensions (LSH) Clustering Dimensionality reduction

1162019

RECSYS 47

Tip Add Data Leverage all the data

Donrsquot try to reduce data size in an effort to make fancy algorithms work

Simple methods on large data do best

Add more data eg add IMDB data on genres

More data beats better algorithmshttpanandtypepadcomdatawocky200803more-data-usualhtml

RECSYS 48

Recommender SystemsLatent Factor Models

CS246 Mining Massive DatasetsJure Leskovec Stanford University

httpcs246stanfordedu

RECSYS 49

Geared towards females

Geared towards males

Serious

Funny

Collaborative Filtering viaLatent Factor Models (eg SVD)

The PrincessDiaries

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

RECSYS 50Koren Bell Volinksy IEEE Computer 2009

RECSYS 51

The Netflix Utility Matrix R

1 3 4

3 5 5

4 5 5

3

3

2 2 2

5

2 1 1

3 3

1

480000 users

17700 movies

1162019

Matrix R

RECSYS 52

Utility Matrix R Evaluation

1 3 4

3 5 5

4 5 5

3

3

2

2 1

3

1

Test Data Set

SSE = sum(119909119909119909119909)isin119877119877 119903119909119909119909119909 minus 119903119903119909119909119909119909 2

1162019

480000 users

17700 movies

Predicted rating

True rating of user x on item i

119955119955120785120785120788120788

Matrix R

Training Data Set

RECSYS 53

Latent Factor Models

ldquoSVDrdquo on Netflix data R asymp Q middot PT

For now letrsquos assume we can approximate the rating matrix R as a product of ldquothinrdquo Q middot PT

R has missing entries but letrsquos ignore that for nowbull Basically we will want the reconstruction error to be small on

known ratings and we donrsquot care about the values on the missing ones

45531

312445

53432142

24542

522434

42331

2-41

56-5

53-2

32111

-221-7

37-1

-924143-48-5-253-211

13-112-72914-131457-8

1-6784-3924176-421asymp

users

item

s

PT

Q

item

s

users

R

SVD A = U Σ VT

f factors

ffactors

RECSYS 54

Ratings as Products of Factors How to estimate the missing rating of

user x for item i

1162019

45531

312445

53432142

24542

522434

42331

item

s

2-41

56-5

53-2

32111

-221-7

37-1

-924143-48-5-253-211

13-112-72914-131457-8

1-6784-3924176-421

asymp

item

s

users

users

PT

119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931

= 119943119943

119954119954119946119946119943119943 sdot 119953119953119961119961119943119943

qi = row i of Qpx = column x of PT

ffac

tors

Qf factors

RECSYS 55

Ratings as Products of Factors How to estimate the missing rating of

user x for item i

1162019

45531

312445

53432142

24542

522434

42331

item

s

2-41

56-5

53-2

32111

-221-7

37-1

-924143-48-5-253-211

13-112-72914-131457-8

1-6784-3924176-421

asymp

item

s

users

users

PT

ffac

tors

Qf factors

119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931

= 119943119943

119954119954119946119946119943119943 sdot 119953119953119961119961119943119943

qi = row i of Qpx = column x of PT

RECSYS 56

Ratings as Products of Factors How to estimate the missing rating of

user x for item i

1162019

45531

312445

53432142

24542

522434

42331

item

s

2-41

56-5

53-2

32111

-221-7

37-1

-924143-48-5-253-211

13-112-72914-131457-8

1-6784-3924176-421

asymp

item

s

users

users

QPT

24

ffac

tors

f factors

119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931

= 119943119943

119954119954119946119946119943119943 sdot 119953119953119961119961119943119943

qi = row i of Qpx = column x of PT

RECSYS 57

Geared towards females

Geared towards males

Serious

Funny

Latent Factor Models

1162019

The PrincessDiaries

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

RECSYS 58

Geared towards females

Geared towards males

Serious

Funny

Latent Factor Models

1162019

The PrincessDiaries

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

RECSYS 59

Recap SVD

Remember SVD A Input data matrix U Left singular vecs V Right singular vecs Σ Singular values

SVD gives minimum reconstruction error (SSE) min119880119880VΣ

sum119909119909119894119894 119860119860119909119909119894119894 minus 119880119880Σ119881119881T 1199091199091198941198942

So in our case ldquoSVDrdquo on Netflix data R asymp Q middot PT

A = R Q = U PT = Σ VT

But we are not done yet R has missing entries1162019

U

The sum goes over all entriesBut our R has missing entries

119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931

Am

n

Σm

n

VT

asymp

U

RECSYS 60

Latent Factor Models

SVD isnrsquot defined when entries are missing

Use specialized methods to find P Q min

119875119875119876119876sum 119909119909119909119909 isinR 119903119903119909119909119909119909 minus 119902119902119909119909 sdot 119901119901119909119909119879119879 2

Notebull We donrsquot require cols of P Q to be orthogonalunit lengthbull P Q map usersmovies to a latent spacebull The most popular model among Netflix contestants

1162019

45531

312445

53432142

24542

522434

42331

2-41

56-5

53-2

32111

-221-7

37-1

-924143-48-5-253-211

13-112-72914-131457-8

1-6784-3924176-421asymp

PTQ

users

item

s

119903119909119909119909119909 = 119902119902119909119909 sdot 119901119901119909119909119879119879

f factors

ffactorsitem

s

users

RECSYS 61

Dealing with Missing Entries Want to minimize SSE for unseen test data

Idea Minimize SSE on training data Want large f ( of factors) to capture all the signals But SSE on test data begins to rise for f gt 2

Regularization is needed to avoid Overfitting Allow rich model where there are sufficient data Shrink aggressively where data are scarce

1162019

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

1 3 43 5 5

4 5 533

2

2 1 3

1

λhellip regularization parameterldquoerrorrdquo ldquolengthrdquo

RECSYS 62

Geared towards females

Geared towards males

serious

escapist

The PrincessDiaries

The Lion King

Braveheart

Independence Day

AmadeusThe Color Purple

Oceanrsquos 11

Sense and Sensibility

Gus

Dave

Recommendations via Latent Factor Models (eg SVD++ by the [Bellkor Team])

1162019

Lethal Weapon

Dumb and Dumber

RECSYS 63

Dealing with Missing Entries Want to minimize SSE for unseen test data

Idea Minimize SSE on training data Want large f ( of factors) to capture all the signals But SSE on test data begins to rise for f gt 2

Regularization is needed Allow rich model where there are sufficient data Shrink aggressively where data are scarce

1162019

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

1 3 43 5 5

4 5 533

2

2 1 3

1

λhellip regularization parameterldquoerrorrdquo ldquolengthrdquo

RECSYS 64

Geared towards females

Geared towards males

serious

funny

The Effect of Regularization

1162019

The PrincessDiaries

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

minfactors ldquoerrorrdquo + λ ldquolengthrdquo

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

RECSYS 65

Geared towards females

Geared towards males

serious

funny

The Effect of Regularization

1162019

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

The PrincessDiaries

minfactors ldquoerrorrdquo + λ ldquolengthrdquo

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

RECSYS 66

Geared towards females

Geared towards males

serious

funny

The Effect of Regularization

1162019

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

minfactors ldquoerrorrdquo + λ ldquolengthrdquo

The PrincessDiaries

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

RECSYS 67

Geared towards females

Geared towards males

serious

funny

The Effect of Regularization

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

The PrincessDiaries

minfactors ldquoerrorrdquo + λ ldquolengthrdquo

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

RECSYS 68

Use Gradient Descent to search forthe optimal settings

Want to find matrices P and Q

Gradient descent Initialize P and Q (using SVD pretend missing ratings are

0) Do gradient descent

bull P larr P - η middotnablaPbull Q larr Q - η middotnablaQbull Where nablaQ is gradientderivative of matrix Q120571120571119876119876 = [120571120571119902119902119909119909119894119894] and 120571120571119902119902119909119909119894119894 = sum119909119909119909119909 minus2 119903119903119909119909119909119909 minus 119902119902119909119909119901119901119909119909119879119879 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894

bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q Observation Computing gradients is slow

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

How to compute gradient of a matrixCompute gradient of every element independently

RECSYS 69

Degression to the lecture notes ofRegression and Gradient Descent

by Andrew Ngrsquos Machine Learning course from Coursera

RECSYS 70

(Batch) Gradient Descent

Want to find matrices P and Q

Gradient descent Initialize P and Q (using SVD pretend missing ratings are

0) Do gradient descent

bull P larr P - η middotnablaPbull Q larr Q - η middotnablaQbull Where nablaQ is gradientderivative of matrix Q120571120571119876119876 = [120571120571119902119902119909119909119894119894] and 120571120571119902119902119909119909119894119894 = sum119909119909119909119909 minus2 119903119903119909119909119909119909 minus 119902119902119909119909119901119901119909119909119879119879 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894

bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q Observation Computing gradients is slow

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

How to compute gradient of a matrixCompute gradient of every element independently

RECSYS 72

Stochastic Gradient Descent

Gradient Descent (GD) vs Stochastic GD Observation 120571120571119876119876 = [120571120571119902119902119909119909119894119894] where

120571120571119902119902119909119909119894119894 = 119909119909119909119909

minus2 119903119903119909119909119909119909 minus 119902119902119909119909119894119894119901119901119909119909119894119894 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894 = 119961119961119946119946

nabla119928119928 119955119955119961119961119946119946

bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q 119928119928 = 119928119928 minus 119928119928 = 119928119928 minus η sum119961119961119946119946nabla119928119928 (119955119955119961119961119946119946) Idea Instead of evaluating gradient over all ratings evaluate it for

each individual rating and make a step

GD 119928119928larr119928119928 minus η sum119955119955119961119961119946119946 nabla119928119928(119955119955119961119961119946119946)

SGD 119928119928larr119928119928 minus 119928119928(119955119955119961119961119946119946) Faster convergence

bull Need more steps but each step is computed much faster

1162019

RECSYS 73

SGD vs GD Convergence of GD vs SGD

1162019

Iterationstep

Valu

e of

the

obje

ctiv

e fu

nctio

n

GD improves the value of the objective function at every step SGD improves the value but in a ldquonoisyrdquo wayGD takes fewer steps to converge but each steptakes much longer to compute In practice SGD is much faster

RECSYS 74

Stochastic Gradient Descent

Stochastic gradient decent Initialize P and Q (using SVD pretend missing ratings are

0) Then iterate over the ratings (multiple times if necessary) and update

factorsFor each rxi

bull 120576120576119909119909119909119909 = 119903119903119909119909119909119909 minus 119902119902119909119909 sdot 119901119901119909119909119879119879 (derivative of the ldquoerrorrdquo)bull 119902119902119909119909 larr 119902119902119909119909 + 120578120578 120576120576119909119909119909119909 119901119901119909119909 minus 120582120582 119902119902119909119909 (update equation)bull 119901119901119909119909 larr 119901119901119909119909 + 120578120578 120576120576119909119909119909119909 119902119902119909119909 minus 120582120582 119901119901119909119909 (update equation)

2 for loops For until convergence

bull For each rxibull Compute gradient do a ldquosteprdquo

1162019

η hellip learning rate

RECSYS 75Koren Bell Volinksy IEEE Computer 2009

1162019

RECSYS 76

Summary Recommendations via Optimization

Goal Make good recommendations Quantify goodness using SSE

So Lower SSE means better recommendations We want to make good recommendations on items

that some user has not yet seen Letrsquos set values for P and Q such that they work

well on known (user item) ratings And hope these values for P and Q will predict well

the unknown ratings

This is the a case where we apply Optimization methods

1162019

1 3 43 5 5

4 5 533

2

2 1 3

1

RECSYS 77

Backup Slides

RECSYS 78

The Netflix Challenge 2006-09

We Know What You OughtTo Be Watching This

Summer

Netflix Prizebull Training data

ndash 100 million ratingsndash 480000 usersndash 17770 moviesndash 6 years of data 2000-2005

bull Test datandash Last few ratings of each user (28 million)ndash Evaluation criterion root mean squared error (RMSE) ndash Netflix Cinematch RMSE 09514

bull Competitionndash 2700+ teamsndash $1 million grand prize for 10 improvement on Cinematch resultndash $50000 2007 progress prize for 843 improvement

scoremovieuser121152131434524123237682576344541568523425

22345

57664566

scoremovieuser6219617232473153414284935

745

696836

Training data Test data

Movie rating data

Overall rating distribution

bull Third of ratings are 4sbull Average rating is 368

From TimelyDevelopmentcom

ratings per movie

bull Avg ratingsmovie 5627

ratings per user

bull Avg ratingsuser 208

Average movie rating by movie count

bull More ratings to better movies

From TimelyDevelopmentcom

Most loved movies

CountAvg ratingTitle137812 4593 The Shawshank Redemption 133597 4545 Lord of the Rings The Return of the King1808834306 The Green Mile 150676 4460Lord of the Rings The Two Towers 139050 4415 Finding Nemo 117456 4504 Raiders of the Lost Ark 180736 4299 Forrest Gump 147932 4433 Lord of the Rings The Fellowship of the ring149199 4325 The Sixth Sense 144027 4333 Indiana Jones and the Last Crusade

Challenges

bull Size of datandash Scalabilityndash Keeping data in memory

bull Missing datandash 99 percent missingndash Very imbalanced

bull Avoiding overfittingbull Test and training data differ significantly movie 16322

bull Use an ensemble of complementing predictorsbull Two half tuned models worth more than a single fully

tuned model

The BellKor recommender system

RECSYS 101

Extending Latent Factor Model to Include Biases

1162019

RECSYS 102

Modeling Biases and Interactions

μ = overall mean rating bx = bias of user x bi = bias of movie i

user-movie interactionmovie biasuser bias

User-Movie interaction Characterizes the matching between

users and movies Attracts most research in the field Benefits from algorithmic and

mathematical innovations

Baseline predictor Separates users and movies Benefits from insights into userrsquos

behavior Among the main practical

contributions of the competition

RECSYS 103

Baseline Predictor We have expectations on the rating by

user x of movie i even without estimating xrsquos attitude towards movies like i

ndash Rating scale of user xndash Values of other ratings user

gave recently (day-specific mood anchoring multi-user accounts)

ndash (Recent) popularity of movie indash Selection bias related to

number of ratings user gave on the same day (ldquofrequencyrdquo)

1162019

RECSYS 104

Putting It All Together

Example Mean rating micro = 37 You are a critical reviewer your ratings are 1 star

lower than the mean bx = -1 Star Wars gets a mean rating of 05 higher than

average movie bi = + 05 Predicted rating for you on Star Wars

= 37 - 1 + 05 = 32

1162019

RECSYS 105

Fitting the New Model

Solve

Stochastic gradient decent to find parameters Note Both biases bu bi as well as interactions qi pu are treated

as parameters (we estimate them)

1162019

regularization

goodness of fit

λ is selected via grid-search on a validation set

( )

++++

+++minus

sumsumsumsum

sumisin

ii

xx

xx

ii

Rix

Txiixxi

PQ

bbpq

pqbbr

2222

2

)(

)(min

λ

micro

RECSYS 106

BellKor Recommender System The winner of the Netflix Challenge

Multi-scale modeling of the dataCombine top level ldquoregionalrdquomodeling of the data with a refined local view Global

bull Overall deviations of usersmovies Factorization

bull Addressing ldquoregionalrdquo effects Collaborative filtering

bull Extract local patterns

Global effects

Factorization

Collaborative filtering

RECSYS 107

Performance of Various Methods

0885

089

0895

09

0905

091

0915

092

1 10 100 1000

RMSE

Millions of parameters

CF (no time bias)

Basic Latent Factors

Latent Factors w Biases

1162019

RECSYS 108

Grand Prize 08563

Netflix 09514

Movie average 10533User average 10651

Global average 11296

Performance of Various Methods

Basic Collaborative filtering 094

Latent factors 090

Latent factors+Biases 089

Collaborative filtering++ 091

1162019

RECSYS 109

Temporal Biases Of Users

Sudden rise in the average movie rating(early 2004) Improvements in Netflix GUI improvements Meaning of rating changed

Movie age Users prefer new movies

without any reasons Older movies are just

inherently better than newer ones

1162019

Y Koren Collaborative filtering with temporal dynamics KDD rsquo09

RECSYS 110

Temporal Biases amp Factors

Original modelrxi = micro +bx + bi + qi middotpx

T

Add time dependence to biasesrxi = micro +bx(t)+ bi(t) +qi middot px

T

Make parameters bu and bi to depend on time (1) Parameterize time-dependence by linear trends

(2) Each bin corresponds to 10 consecutive weeks

Add temporal dependence to factors px(t)hellip user preference vector on day t

Y Koren Collaborative filtering with temporal dynamics KDD rsquo09

RECSYS 111

Adding Temporal Effects

1162019

0875

088

0885

089

0895

09

0905

091

0915

092

1 10 100 1000 10000

RMSE

Millions of parameters

CF (no time bias)Basic Latent FactorsCF (time bias)Latent Factors w Biases+ Linear time factors+ Per-day user biases+ CF

RECSYS 112

Grand Prize 08563

Netflix 09514

Movie average 10533User average 10651

Global average 11296

Performance of Various Methods

Basic Collaborative filtering 094

Latent factors 090

Latent factors+Biases 089

Collaborative filtering++ 091

1162019 112

Latent factors+Biases+Time 0876

Still no prize Getting desperateTry a ldquokitchen sinkrdquo approach

RECSYS 113

Effect of ensemble size

087

08725

0875

08775

088

08825

0885

08875

089

08925

1 8 15 22 29 36 43 50

Predictors

Erro

r - R

MSE

Chart1

RMSE
Predictors
Error - RMSE
Effect of ensemble size

Sheet1

Sheet1

RMSE
Predictors
Error - RMSE
Effect of ensemble size

Sheet2

Sheet3

RECSYS 115

Standing on June 26th 2009

1162019June 26th submission triggers 30-day ldquolast callrdquo

RECSYS 116

The Last 30 Days

Ensemble team formed Group of other teams on leaderboard forms a new team Relies on combining their models Quickly also get a qualifying score over 10

BellKor Continue to get small improvements in their scores Realize that they are in direct competition with Ensemble

Strategy Both teams carefully monitoring the leaderboard Only sure way to check for improvement is to submit a set of

predictionsbull This alerts the other team of your latest score

1162019

RECSYS 117

24 Hours from the Deadline

Submissions limited to 1 a day Only 1 final submission could be made in the last 24h

24 hours before deadlinehellip BellKor team member in Austria notices (by chance) that

Ensemble posts a score that is slightly better than BellKorrsquos

Frantic last 24 hours for both teams Much computer time on final optimization Carefully calibrated to end about an hour before deadline

Final submissions BellKor submits a little early (on purpose) 40 mins before deadline Ensemble submits their final entry 20 mins later hellipand everyone waitshellip

1162019

RECSYS 118118

RECSYS 119

Million $ Awarded Sept 21st 2009

RECSYS 120

Further reading Y Koren Collaborative filtering with temporal

dynamics KDD rsquo09 httpwww2researchattcom~volinskynetflixbpchtml httpwwwthe-ensemblecom

1162019

  • Slide Number 1
  • Acknowledgements
  • Content-based Systems amp Collaborative Filtering
  • High Dimensional Data
  • Example Recommender Systems
  • Recommendations
  • From Scarcity to Abundance
  • The Long Tail
  • Physical vs Online
  • Types of Recommendations
  • Formal Model
  • Utility Matrix
  • Key Problems
  • (1) Gathering Ratings
  • (2) Extrapolating Utilities
  • Content-based Recommender Systems
  • Content-based Recommendations
  • Plan of Action
  • Item Profiles
  • Sidenote TF-IDF
  • User Profiles and Prediction
  • Pros Content-based Approach
  • Cons Content-based Approach
  • Collaborative Filtering
  • Collaborative filtering
  • Collaborative Filtering (CF)
  • Example of Memory-based Collaborative Filtering User-User Collaborative Filtering
  • Similar Users
  • Similarity Metric
  • Rating Predictions
  • Another type of Memory-based Collaborative Filtering Item-Item based Collaborative Filtering
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Common Practice for Item-Item Collaborative Filtering
  • Item-Item vs User-User
  • ProsCons of Collaborative Filtering
  • Hybrid Methods
  • Remarks amp Practical Tips
  • Evaluation
  • Evaluation
  • Evaluating Predictions
  • Problems with Error Measures
  • Collaborative Filtering Complexity
  • Tip Add Data
  • Recommender SystemsLatent Factor Models
  • Collaborative Filtering viaLatent Factor Models (eg SVD)
  • Slide Number 50
  • The Netflix Utility Matrix R
  • Utility Matrix R Evaluation
  • Latent Factor Models
  • Ratings as Products of Factors
  • Ratings as Products of Factors
  • Ratings as Products of Factors
  • Latent Factor Models
  • Latent Factor Models
  • Recap SVD
  • Latent Factor Models
  • Dealing with Missing Entries
  • Recommendations via Latent Factor Models (eg SVD++ by the [Bellkor Team])
  • Dealing with Missing Entries
  • The Effect of Regularization
  • The Effect of Regularization
  • The Effect of Regularization
  • The Effect of Regularization
  • Use Gradient Descent to search forthe optimal settings
  • Degression to the lecture notes ofRegression and Gradient Descentby Andrew Ngrsquos Machine Learning course from Coursera
  • (Batch) Gradient Descent
  • Stochastic Gradient Descent
  • SGD vs GD
  • Stochastic Gradient Descent
  • Slide Number 75
  • Summary Recommendations via Optimization
  • Backup Slides
  • The Netflix Challenge 2006-09
  • Slide Number 79
  • Slide Number 80
  • Netflix Prize
  • Movie rating data
  • Overall rating distribution
  • ratings per movie
  • ratings per user
  • Average movie rating by movie count
  • Most loved movies
  • Challenges
  • The BellKor recommender system
  • Extending Latent Factor Model to Include Biases
  • Modeling Biases and Interactions
  • Baseline Predictor
  • Putting It All Together
  • Fitting the New Model
  • BellKor Recommender System
  • Performance of Various Methods
  • Performance of Various Methods
  • Temporal Biases Of Users
  • Temporal Biases amp Factors
  • Adding Temporal Effects
  • Performance of Various Methods
  • Slide Number 113
  • Slide Number 114
  • Standing on June 26th 2009
  • The Last 30 Days
  • 24 Hours from the Deadline
  • Slide Number 118
  • Million $ Awarded Sept 21st 2009
  • Further reading
predictors RMSE
1 0889067
2 0880586
3 0878345
4 0877715
5 0876668
6 0876179
7 0875557
8 0875013
9 0874477
10 0874209
11 0873997
12 0873691
13 0873569
14 0873352
15 0873212
16 0873035
17 0872942
18 087283
19 0872748
20 0872654
21 0872563
22 0872468
23 0872382
24 0872275
25 087222
26 0872201
27 0872135
28 0872086
29 0872021
30 0871945
31 0871898
32 087183
33 0871756
34 0871722
35 0871665
36 0871632
37 0871585
38 0871558
39 0871524
40 0871482
41 0871452
42 0871428
43 0871398
44 0871346
45 0871312
46 0871287
47 0871262
48 0871242
49 0871236
50 0871197
0889067
0880586
0878345
0877715
0876668
0876179
0875557
0875013
0874477
0874209
0873997
0873691
0873569
0873352
0873212
0873035
0872942
087283
0872748
0872654
0872563
0872468
0872382
0872275
087222
0872201
0872135
0872086
0872021
0871945
0871898
087183
0871756
0871722
0871665
0871632
0871585
0871558
0871524
0871482
0871452
0871428
0871398
0871346
0871312
0871287
0871262
0871242
0871236
0871197
Page 41: xuhappy.github.io · 2 days ago · RECSYS 2 Acknowledgements The slides used in this chapter are adapted from: CS246 Mining Massive Datasets, - by Jure Leskovec, Stanford University

RECSYS 43

Evaluation

1 3 4

3 5 5

4 5 5

3

3

2

2 1

3

1

Test Data Set

users

movies

1162019

RECSYS 44

Evaluating Predictions Compare predictions with known ratings

Root-mean-square error (RMSE)bull where is predicted is the true rating of x on i

Precision at top 10 bull of those in top 10

Rank Correlation bull Spearmanrsquos correlation between systemrsquos and userrsquos complete

rankings Another approach 01 model

Coveragebull Number of itemsusers for which system can make predictions

Precision = TP (TP + FP) Accuracy = (TP+TN) (TP + FP + TN + FN) Receiver Operating characteristic (ROC) Curve

Y-axis True Positive Rates (TPR) X-axis False Positive Rates (FPR)bull TPR (aka Recall) = TP P = TP (TP+FN) bull FPR = FP N = FP (FP + TN)bull See httpsenwikipediaorgwikiPrecision and recall

RECSYS 45

Problems with Error Measures Narrow focus on accuracy sometimes

misses the point Prediction Diversity Prediction Context Order of predictions

In practice we care only to predict high ratings RMSE might penalize a method that does well

for high ratings and badly for others

1162019

RECSYS 46

Collaborative Filtering Complexity

Expensive step is finding k most similar customers O(|X|) Recall that X = set of customers in the system

Too expensive to do at runtime Could pre-compute using clustering as approx

Naiumlve pre-computation takes time O(N middot|C|) |C| = of clusters = k in the k-means N = of data points

We already know how to do this Near-neighbor search in high dimensions (LSH) Clustering Dimensionality reduction

1162019

RECSYS 47

Tip Add Data Leverage all the data

Donrsquot try to reduce data size in an effort to make fancy algorithms work

Simple methods on large data do best

Add more data eg add IMDB data on genres

More data beats better algorithmshttpanandtypepadcomdatawocky200803more-data-usualhtml

RECSYS 48

Recommender SystemsLatent Factor Models

CS246 Mining Massive DatasetsJure Leskovec Stanford University

httpcs246stanfordedu

RECSYS 49

Geared towards females

Geared towards males

Serious

Funny

Collaborative Filtering viaLatent Factor Models (eg SVD)

The PrincessDiaries

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

RECSYS 50Koren Bell Volinksy IEEE Computer 2009

RECSYS 51

The Netflix Utility Matrix R

1 3 4

3 5 5

4 5 5

3

3

2 2 2

5

2 1 1

3 3

1

480000 users

17700 movies

1162019

Matrix R

RECSYS 52

Utility Matrix R Evaluation

1 3 4

3 5 5

4 5 5

3

3

2

2 1

3

1

Test Data Set

SSE = sum(119909119909119909119909)isin119877119877 119903119909119909119909119909 minus 119903119903119909119909119909119909 2

1162019

480000 users

17700 movies

Predicted rating

True rating of user x on item i

119955119955120785120785120788120788

Matrix R

Training Data Set

RECSYS 53

Latent Factor Models

ldquoSVDrdquo on Netflix data R asymp Q middot PT

For now letrsquos assume we can approximate the rating matrix R as a product of ldquothinrdquo Q middot PT

R has missing entries but letrsquos ignore that for nowbull Basically we will want the reconstruction error to be small on

known ratings and we donrsquot care about the values on the missing ones

45531

312445

53432142

24542

522434

42331

2-41

56-5

53-2

32111

-221-7

37-1

-924143-48-5-253-211

13-112-72914-131457-8

1-6784-3924176-421asymp

users

item

s

PT

Q

item

s

users

R

SVD A = U Σ VT

f factors

ffactors

RECSYS 54

Ratings as Products of Factors How to estimate the missing rating of

user x for item i

1162019

45531

312445

53432142

24542

522434

42331

item

s

2-41

56-5

53-2

32111

-221-7

37-1

-924143-48-5-253-211

13-112-72914-131457-8

1-6784-3924176-421

asymp

item

s

users

users

PT

119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931

= 119943119943

119954119954119946119946119943119943 sdot 119953119953119961119961119943119943

qi = row i of Qpx = column x of PT

ffac

tors

Qf factors

RECSYS 55

Ratings as Products of Factors How to estimate the missing rating of

user x for item i

1162019

45531

312445

53432142

24542

522434

42331

item

s

2-41

56-5

53-2

32111

-221-7

37-1

-924143-48-5-253-211

13-112-72914-131457-8

1-6784-3924176-421

asymp

item

s

users

users

PT

ffac

tors

Qf factors

119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931

= 119943119943

119954119954119946119946119943119943 sdot 119953119953119961119961119943119943

qi = row i of Qpx = column x of PT

RECSYS 56

Ratings as Products of Factors How to estimate the missing rating of

user x for item i

1162019

45531

312445

53432142

24542

522434

42331

item

s

2-41

56-5

53-2

32111

-221-7

37-1

-924143-48-5-253-211

13-112-72914-131457-8

1-6784-3924176-421

asymp

item

s

users

users

QPT

24

ffac

tors

f factors

119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931

= 119943119943

119954119954119946119946119943119943 sdot 119953119953119961119961119943119943

qi = row i of Qpx = column x of PT

RECSYS 57

Geared towards females

Geared towards males

Serious

Funny

Latent Factor Models

1162019

The PrincessDiaries

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

RECSYS 58

Geared towards females

Geared towards males

Serious

Funny

Latent Factor Models

1162019

The PrincessDiaries

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

RECSYS 59

Recap SVD

Remember SVD A Input data matrix U Left singular vecs V Right singular vecs Σ Singular values

SVD gives minimum reconstruction error (SSE) min119880119880VΣ

sum119909119909119894119894 119860119860119909119909119894119894 minus 119880119880Σ119881119881T 1199091199091198941198942

So in our case ldquoSVDrdquo on Netflix data R asymp Q middot PT

A = R Q = U PT = Σ VT

But we are not done yet R has missing entries1162019

U

The sum goes over all entriesBut our R has missing entries

119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931

Am

n

Σm

n

VT

asymp

U

RECSYS 60

Latent Factor Models

SVD isnrsquot defined when entries are missing

Use specialized methods to find P Q min

119875119875119876119876sum 119909119909119909119909 isinR 119903119903119909119909119909119909 minus 119902119902119909119909 sdot 119901119901119909119909119879119879 2

Notebull We donrsquot require cols of P Q to be orthogonalunit lengthbull P Q map usersmovies to a latent spacebull The most popular model among Netflix contestants

1162019

45531

312445

53432142

24542

522434

42331

2-41

56-5

53-2

32111

-221-7

37-1

-924143-48-5-253-211

13-112-72914-131457-8

1-6784-3924176-421asymp

PTQ

users

item

s

119903119909119909119909119909 = 119902119902119909119909 sdot 119901119901119909119909119879119879

f factors

ffactorsitem

s

users

RECSYS 61

Dealing with Missing Entries Want to minimize SSE for unseen test data

Idea Minimize SSE on training data Want large f ( of factors) to capture all the signals But SSE on test data begins to rise for f gt 2

Regularization is needed to avoid Overfitting Allow rich model where there are sufficient data Shrink aggressively where data are scarce

1162019

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

1 3 43 5 5

4 5 533

2

2 1 3

1

λhellip regularization parameterldquoerrorrdquo ldquolengthrdquo

RECSYS 62

Geared towards females

Geared towards males

serious

escapist

The PrincessDiaries

The Lion King

Braveheart

Independence Day

AmadeusThe Color Purple

Oceanrsquos 11

Sense and Sensibility

Gus

Dave

Recommendations via Latent Factor Models (eg SVD++ by the [Bellkor Team])

1162019

Lethal Weapon

Dumb and Dumber

RECSYS 63

Dealing with Missing Entries Want to minimize SSE for unseen test data

Idea Minimize SSE on training data Want large f ( of factors) to capture all the signals But SSE on test data begins to rise for f gt 2

Regularization is needed Allow rich model where there are sufficient data Shrink aggressively where data are scarce

1162019

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

1 3 43 5 5

4 5 533

2

2 1 3

1

λhellip regularization parameterldquoerrorrdquo ldquolengthrdquo

RECSYS 64

Geared towards females

Geared towards males

serious

funny

The Effect of Regularization

1162019

The PrincessDiaries

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

minfactors ldquoerrorrdquo + λ ldquolengthrdquo

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

RECSYS 65

Geared towards females

Geared towards males

serious

funny

The Effect of Regularization

1162019

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

The PrincessDiaries

minfactors ldquoerrorrdquo + λ ldquolengthrdquo

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

RECSYS 66

Geared towards females

Geared towards males

serious

funny

The Effect of Regularization

1162019

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

minfactors ldquoerrorrdquo + λ ldquolengthrdquo

The PrincessDiaries

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

RECSYS 67

Geared towards females

Geared towards males

serious

funny

The Effect of Regularization

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

The PrincessDiaries

minfactors ldquoerrorrdquo + λ ldquolengthrdquo

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

RECSYS 68

Use Gradient Descent to search forthe optimal settings

Want to find matrices P and Q

Gradient descent Initialize P and Q (using SVD pretend missing ratings are

0) Do gradient descent

bull P larr P - η middotnablaPbull Q larr Q - η middotnablaQbull Where nablaQ is gradientderivative of matrix Q120571120571119876119876 = [120571120571119902119902119909119909119894119894] and 120571120571119902119902119909119909119894119894 = sum119909119909119909119909 minus2 119903119903119909119909119909119909 minus 119902119902119909119909119901119901119909119909119879119879 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894

bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q Observation Computing gradients is slow

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

How to compute gradient of a matrixCompute gradient of every element independently

RECSYS 69

Degression to the lecture notes ofRegression and Gradient Descent

by Andrew Ngrsquos Machine Learning course from Coursera

RECSYS 70

(Batch) Gradient Descent

Want to find matrices P and Q

Gradient descent Initialize P and Q (using SVD pretend missing ratings are

0) Do gradient descent

bull P larr P - η middotnablaPbull Q larr Q - η middotnablaQbull Where nablaQ is gradientderivative of matrix Q120571120571119876119876 = [120571120571119902119902119909119909119894119894] and 120571120571119902119902119909119909119894119894 = sum119909119909119909119909 minus2 119903119903119909119909119909119909 minus 119902119902119909119909119901119901119909119909119879119879 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894

bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q Observation Computing gradients is slow

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

How to compute gradient of a matrixCompute gradient of every element independently

RECSYS 72

Stochastic Gradient Descent

Gradient Descent (GD) vs Stochastic GD Observation 120571120571119876119876 = [120571120571119902119902119909119909119894119894] where

120571120571119902119902119909119909119894119894 = 119909119909119909119909

minus2 119903119903119909119909119909119909 minus 119902119902119909119909119894119894119901119901119909119909119894119894 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894 = 119961119961119946119946

nabla119928119928 119955119955119961119961119946119946

bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q 119928119928 = 119928119928 minus 119928119928 = 119928119928 minus η sum119961119961119946119946nabla119928119928 (119955119955119961119961119946119946) Idea Instead of evaluating gradient over all ratings evaluate it for

each individual rating and make a step

GD 119928119928larr119928119928 minus η sum119955119955119961119961119946119946 nabla119928119928(119955119955119961119961119946119946)

SGD 119928119928larr119928119928 minus 119928119928(119955119955119961119961119946119946) Faster convergence

bull Need more steps but each step is computed much faster

1162019

RECSYS 73

SGD vs GD Convergence of GD vs SGD

1162019

Iterationstep

Valu

e of

the

obje

ctiv

e fu

nctio

n

GD improves the value of the objective function at every step SGD improves the value but in a ldquonoisyrdquo wayGD takes fewer steps to converge but each steptakes much longer to compute In practice SGD is much faster

RECSYS 74

Stochastic Gradient Descent

Stochastic gradient decent Initialize P and Q (using SVD pretend missing ratings are

0) Then iterate over the ratings (multiple times if necessary) and update

factorsFor each rxi

bull 120576120576119909119909119909119909 = 119903119903119909119909119909119909 minus 119902119902119909119909 sdot 119901119901119909119909119879119879 (derivative of the ldquoerrorrdquo)bull 119902119902119909119909 larr 119902119902119909119909 + 120578120578 120576120576119909119909119909119909 119901119901119909119909 minus 120582120582 119902119902119909119909 (update equation)bull 119901119901119909119909 larr 119901119901119909119909 + 120578120578 120576120576119909119909119909119909 119902119902119909119909 minus 120582120582 119901119901119909119909 (update equation)

2 for loops For until convergence

bull For each rxibull Compute gradient do a ldquosteprdquo

1162019

η hellip learning rate

RECSYS 75Koren Bell Volinksy IEEE Computer 2009

1162019

RECSYS 76

Summary Recommendations via Optimization

Goal Make good recommendations Quantify goodness using SSE

So Lower SSE means better recommendations We want to make good recommendations on items

that some user has not yet seen Letrsquos set values for P and Q such that they work

well on known (user item) ratings And hope these values for P and Q will predict well

the unknown ratings

This is the a case where we apply Optimization methods

1162019

1 3 43 5 5

4 5 533

2

2 1 3

1

RECSYS 77

Backup Slides

RECSYS 78

The Netflix Challenge 2006-09

We Know What You OughtTo Be Watching This

Summer

Netflix Prizebull Training data

ndash 100 million ratingsndash 480000 usersndash 17770 moviesndash 6 years of data 2000-2005

bull Test datandash Last few ratings of each user (28 million)ndash Evaluation criterion root mean squared error (RMSE) ndash Netflix Cinematch RMSE 09514

bull Competitionndash 2700+ teamsndash $1 million grand prize for 10 improvement on Cinematch resultndash $50000 2007 progress prize for 843 improvement

scoremovieuser121152131434524123237682576344541568523425

22345

57664566

scoremovieuser6219617232473153414284935

745

696836

Training data Test data

Movie rating data

Overall rating distribution

bull Third of ratings are 4sbull Average rating is 368

From TimelyDevelopmentcom

ratings per movie

bull Avg ratingsmovie 5627

ratings per user

bull Avg ratingsuser 208

Average movie rating by movie count

bull More ratings to better movies

From TimelyDevelopmentcom

Most loved movies

CountAvg ratingTitle137812 4593 The Shawshank Redemption 133597 4545 Lord of the Rings The Return of the King1808834306 The Green Mile 150676 4460Lord of the Rings The Two Towers 139050 4415 Finding Nemo 117456 4504 Raiders of the Lost Ark 180736 4299 Forrest Gump 147932 4433 Lord of the Rings The Fellowship of the ring149199 4325 The Sixth Sense 144027 4333 Indiana Jones and the Last Crusade

Challenges

bull Size of datandash Scalabilityndash Keeping data in memory

bull Missing datandash 99 percent missingndash Very imbalanced

bull Avoiding overfittingbull Test and training data differ significantly movie 16322

bull Use an ensemble of complementing predictorsbull Two half tuned models worth more than a single fully

tuned model

The BellKor recommender system

RECSYS 101

Extending Latent Factor Model to Include Biases

1162019

RECSYS 102

Modeling Biases and Interactions

μ = overall mean rating bx = bias of user x bi = bias of movie i

user-movie interactionmovie biasuser bias

User-Movie interaction Characterizes the matching between

users and movies Attracts most research in the field Benefits from algorithmic and

mathematical innovations

Baseline predictor Separates users and movies Benefits from insights into userrsquos

behavior Among the main practical

contributions of the competition

RECSYS 103

Baseline Predictor We have expectations on the rating by

user x of movie i even without estimating xrsquos attitude towards movies like i

ndash Rating scale of user xndash Values of other ratings user

gave recently (day-specific mood anchoring multi-user accounts)

ndash (Recent) popularity of movie indash Selection bias related to

number of ratings user gave on the same day (ldquofrequencyrdquo)

1162019

RECSYS 104

Putting It All Together

Example Mean rating micro = 37 You are a critical reviewer your ratings are 1 star

lower than the mean bx = -1 Star Wars gets a mean rating of 05 higher than

average movie bi = + 05 Predicted rating for you on Star Wars

= 37 - 1 + 05 = 32

1162019

RECSYS 105

Fitting the New Model

Solve

Stochastic gradient decent to find parameters Note Both biases bu bi as well as interactions qi pu are treated

as parameters (we estimate them)

1162019

regularization

goodness of fit

λ is selected via grid-search on a validation set

( )

++++

+++minus

sumsumsumsum

sumisin

ii

xx

xx

ii

Rix

Txiixxi

PQ

bbpq

pqbbr

2222

2

)(

)(min

λ

micro

RECSYS 106

BellKor Recommender System The winner of the Netflix Challenge

Multi-scale modeling of the dataCombine top level ldquoregionalrdquomodeling of the data with a refined local view Global

bull Overall deviations of usersmovies Factorization

bull Addressing ldquoregionalrdquo effects Collaborative filtering

bull Extract local patterns

Global effects

Factorization

Collaborative filtering

RECSYS 107

Performance of Various Methods

0885

089

0895

09

0905

091

0915

092

1 10 100 1000

RMSE

Millions of parameters

CF (no time bias)

Basic Latent Factors

Latent Factors w Biases

1162019

RECSYS 108

Grand Prize 08563

Netflix 09514

Movie average 10533User average 10651

Global average 11296

Performance of Various Methods

Basic Collaborative filtering 094

Latent factors 090

Latent factors+Biases 089

Collaborative filtering++ 091

1162019

RECSYS 109

Temporal Biases Of Users

Sudden rise in the average movie rating(early 2004) Improvements in Netflix GUI improvements Meaning of rating changed

Movie age Users prefer new movies

without any reasons Older movies are just

inherently better than newer ones

1162019

Y Koren Collaborative filtering with temporal dynamics KDD rsquo09

RECSYS 110

Temporal Biases amp Factors

Original modelrxi = micro +bx + bi + qi middotpx

T

Add time dependence to biasesrxi = micro +bx(t)+ bi(t) +qi middot px

T

Make parameters bu and bi to depend on time (1) Parameterize time-dependence by linear trends

(2) Each bin corresponds to 10 consecutive weeks

Add temporal dependence to factors px(t)hellip user preference vector on day t

Y Koren Collaborative filtering with temporal dynamics KDD rsquo09

RECSYS 111

Adding Temporal Effects

1162019

0875

088

0885

089

0895

09

0905

091

0915

092

1 10 100 1000 10000

RMSE

Millions of parameters

CF (no time bias)Basic Latent FactorsCF (time bias)Latent Factors w Biases+ Linear time factors+ Per-day user biases+ CF

RECSYS 112

Grand Prize 08563

Netflix 09514

Movie average 10533User average 10651

Global average 11296

Performance of Various Methods

Basic Collaborative filtering 094

Latent factors 090

Latent factors+Biases 089

Collaborative filtering++ 091

1162019 112

Latent factors+Biases+Time 0876

Still no prize Getting desperateTry a ldquokitchen sinkrdquo approach

RECSYS 113

Effect of ensemble size

087

08725

0875

08775

088

08825

0885

08875

089

08925

1 8 15 22 29 36 43 50

Predictors

Erro

r - R

MSE

Chart1

RMSE
Predictors
Error - RMSE
Effect of ensemble size

Sheet1

Sheet1

RMSE
Predictors
Error - RMSE
Effect of ensemble size

Sheet2

Sheet3

RECSYS 115

Standing on June 26th 2009

1162019June 26th submission triggers 30-day ldquolast callrdquo

RECSYS 116

The Last 30 Days

Ensemble team formed Group of other teams on leaderboard forms a new team Relies on combining their models Quickly also get a qualifying score over 10

BellKor Continue to get small improvements in their scores Realize that they are in direct competition with Ensemble

Strategy Both teams carefully monitoring the leaderboard Only sure way to check for improvement is to submit a set of

predictionsbull This alerts the other team of your latest score

1162019

RECSYS 117

24 Hours from the Deadline

Submissions limited to 1 a day Only 1 final submission could be made in the last 24h

24 hours before deadlinehellip BellKor team member in Austria notices (by chance) that

Ensemble posts a score that is slightly better than BellKorrsquos

Frantic last 24 hours for both teams Much computer time on final optimization Carefully calibrated to end about an hour before deadline

Final submissions BellKor submits a little early (on purpose) 40 mins before deadline Ensemble submits their final entry 20 mins later hellipand everyone waitshellip

1162019

RECSYS 118118

RECSYS 119

Million $ Awarded Sept 21st 2009

RECSYS 120

Further reading Y Koren Collaborative filtering with temporal

dynamics KDD rsquo09 httpwww2researchattcom~volinskynetflixbpchtml httpwwwthe-ensemblecom

1162019

  • Slide Number 1
  • Acknowledgements
  • Content-based Systems amp Collaborative Filtering
  • High Dimensional Data
  • Example Recommender Systems
  • Recommendations
  • From Scarcity to Abundance
  • The Long Tail
  • Physical vs Online
  • Types of Recommendations
  • Formal Model
  • Utility Matrix
  • Key Problems
  • (1) Gathering Ratings
  • (2) Extrapolating Utilities
  • Content-based Recommender Systems
  • Content-based Recommendations
  • Plan of Action
  • Item Profiles
  • Sidenote TF-IDF
  • User Profiles and Prediction
  • Pros Content-based Approach
  • Cons Content-based Approach
  • Collaborative Filtering
  • Collaborative filtering
  • Collaborative Filtering (CF)
  • Example of Memory-based Collaborative Filtering User-User Collaborative Filtering
  • Similar Users
  • Similarity Metric
  • Rating Predictions
  • Another type of Memory-based Collaborative Filtering Item-Item based Collaborative Filtering
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Common Practice for Item-Item Collaborative Filtering
  • Item-Item vs User-User
  • ProsCons of Collaborative Filtering
  • Hybrid Methods
  • Remarks amp Practical Tips
  • Evaluation
  • Evaluation
  • Evaluating Predictions
  • Problems with Error Measures
  • Collaborative Filtering Complexity
  • Tip Add Data
  • Recommender SystemsLatent Factor Models
  • Collaborative Filtering viaLatent Factor Models (eg SVD)
  • Slide Number 50
  • The Netflix Utility Matrix R
  • Utility Matrix R Evaluation
  • Latent Factor Models
  • Ratings as Products of Factors
  • Ratings as Products of Factors
  • Ratings as Products of Factors
  • Latent Factor Models
  • Latent Factor Models
  • Recap SVD
  • Latent Factor Models
  • Dealing with Missing Entries
  • Recommendations via Latent Factor Models (eg SVD++ by the [Bellkor Team])
  • Dealing with Missing Entries
  • The Effect of Regularization
  • The Effect of Regularization
  • The Effect of Regularization
  • The Effect of Regularization
  • Use Gradient Descent to search forthe optimal settings
  • Degression to the lecture notes ofRegression and Gradient Descentby Andrew Ngrsquos Machine Learning course from Coursera
  • (Batch) Gradient Descent
  • Stochastic Gradient Descent
  • SGD vs GD
  • Stochastic Gradient Descent
  • Slide Number 75
  • Summary Recommendations via Optimization
  • Backup Slides
  • The Netflix Challenge 2006-09
  • Slide Number 79
  • Slide Number 80
  • Netflix Prize
  • Movie rating data
  • Overall rating distribution
  • ratings per movie
  • ratings per user
  • Average movie rating by movie count
  • Most loved movies
  • Challenges
  • The BellKor recommender system
  • Extending Latent Factor Model to Include Biases
  • Modeling Biases and Interactions
  • Baseline Predictor
  • Putting It All Together
  • Fitting the New Model
  • BellKor Recommender System
  • Performance of Various Methods
  • Performance of Various Methods
  • Temporal Biases Of Users
  • Temporal Biases amp Factors
  • Adding Temporal Effects
  • Performance of Various Methods
  • Slide Number 113
  • Slide Number 114
  • Standing on June 26th 2009
  • The Last 30 Days
  • 24 Hours from the Deadline
  • Slide Number 118
  • Million $ Awarded Sept 21st 2009
  • Further reading
predictors RMSE
1 0889067
2 0880586
3 0878345
4 0877715
5 0876668
6 0876179
7 0875557
8 0875013
9 0874477
10 0874209
11 0873997
12 0873691
13 0873569
14 0873352
15 0873212
16 0873035
17 0872942
18 087283
19 0872748
20 0872654
21 0872563
22 0872468
23 0872382
24 0872275
25 087222
26 0872201
27 0872135
28 0872086
29 0872021
30 0871945
31 0871898
32 087183
33 0871756
34 0871722
35 0871665
36 0871632
37 0871585
38 0871558
39 0871524
40 0871482
41 0871452
42 0871428
43 0871398
44 0871346
45 0871312
46 0871287
47 0871262
48 0871242
49 0871236
50 0871197
0889067
0880586
0878345
0877715
0876668
0876179
0875557
0875013
0874477
0874209
0873997
0873691
0873569
0873352
0873212
0873035
0872942
087283
0872748
0872654
0872563
0872468
0872382
0872275
087222
0872201
0872135
0872086
0872021
0871945
0871898
087183
0871756
0871722
0871665
0871632
0871585
0871558
0871524
0871482
0871452
0871428
0871398
0871346
0871312
0871287
0871262
0871242
0871236
0871197
Page 42: xuhappy.github.io · 2 days ago · RECSYS 2 Acknowledgements The slides used in this chapter are adapted from: CS246 Mining Massive Datasets, - by Jure Leskovec, Stanford University

RECSYS 44

Evaluating Predictions Compare predictions with known ratings

Root-mean-square error (RMSE)bull where is predicted is the true rating of x on i

Precision at top 10 bull of those in top 10

Rank Correlation bull Spearmanrsquos correlation between systemrsquos and userrsquos complete

rankings Another approach 01 model

Coveragebull Number of itemsusers for which system can make predictions

Precision = TP (TP + FP) Accuracy = (TP+TN) (TP + FP + TN + FN) Receiver Operating characteristic (ROC) Curve

Y-axis True Positive Rates (TPR) X-axis False Positive Rates (FPR)bull TPR (aka Recall) = TP P = TP (TP+FN) bull FPR = FP N = FP (FP + TN)bull See httpsenwikipediaorgwikiPrecision and recall

RECSYS 45

Problems with Error Measures Narrow focus on accuracy sometimes

misses the point Prediction Diversity Prediction Context Order of predictions

In practice we care only to predict high ratings RMSE might penalize a method that does well

for high ratings and badly for others

1162019

RECSYS 46

Collaborative Filtering Complexity

Expensive step is finding k most similar customers O(|X|) Recall that X = set of customers in the system

Too expensive to do at runtime Could pre-compute using clustering as approx

Naiumlve pre-computation takes time O(N middot|C|) |C| = of clusters = k in the k-means N = of data points

We already know how to do this Near-neighbor search in high dimensions (LSH) Clustering Dimensionality reduction

1162019

RECSYS 47

Tip Add Data Leverage all the data

Donrsquot try to reduce data size in an effort to make fancy algorithms work

Simple methods on large data do best

Add more data eg add IMDB data on genres

More data beats better algorithmshttpanandtypepadcomdatawocky200803more-data-usualhtml

RECSYS 48

Recommender SystemsLatent Factor Models

CS246 Mining Massive DatasetsJure Leskovec Stanford University

httpcs246stanfordedu

RECSYS 49

Geared towards females

Geared towards males

Serious

Funny

Collaborative Filtering viaLatent Factor Models (eg SVD)

The PrincessDiaries

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

RECSYS 50Koren Bell Volinksy IEEE Computer 2009

RECSYS 51

The Netflix Utility Matrix R

1 3 4

3 5 5

4 5 5

3

3

2 2 2

5

2 1 1

3 3

1

480000 users

17700 movies

1162019

Matrix R

RECSYS 52

Utility Matrix R Evaluation

1 3 4

3 5 5

4 5 5

3

3

2

2 1

3

1

Test Data Set

SSE = sum(119909119909119909119909)isin119877119877 119903119909119909119909119909 minus 119903119903119909119909119909119909 2

1162019

480000 users

17700 movies

Predicted rating

True rating of user x on item i

119955119955120785120785120788120788

Matrix R

Training Data Set

RECSYS 53

Latent Factor Models

ldquoSVDrdquo on Netflix data R asymp Q middot PT

For now letrsquos assume we can approximate the rating matrix R as a product of ldquothinrdquo Q middot PT

R has missing entries but letrsquos ignore that for nowbull Basically we will want the reconstruction error to be small on

known ratings and we donrsquot care about the values on the missing ones

45531

312445

53432142

24542

522434

42331

2-41

56-5

53-2

32111

-221-7

37-1

-924143-48-5-253-211

13-112-72914-131457-8

1-6784-3924176-421asymp

users

item

s

PT

Q

item

s

users

R

SVD A = U Σ VT

f factors

ffactors

RECSYS 54

Ratings as Products of Factors How to estimate the missing rating of

user x for item i

1162019

45531

312445

53432142

24542

522434

42331

item

s

2-41

56-5

53-2

32111

-221-7

37-1

-924143-48-5-253-211

13-112-72914-131457-8

1-6784-3924176-421

asymp

item

s

users

users

PT

119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931

= 119943119943

119954119954119946119946119943119943 sdot 119953119953119961119961119943119943

qi = row i of Qpx = column x of PT

ffac

tors

Qf factors

RECSYS 55

Ratings as Products of Factors How to estimate the missing rating of

user x for item i

1162019

45531

312445

53432142

24542

522434

42331

item

s

2-41

56-5

53-2

32111

-221-7

37-1

-924143-48-5-253-211

13-112-72914-131457-8

1-6784-3924176-421

asymp

item

s

users

users

PT

ffac

tors

Qf factors

119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931

= 119943119943

119954119954119946119946119943119943 sdot 119953119953119961119961119943119943

qi = row i of Qpx = column x of PT

RECSYS 56

Ratings as Products of Factors How to estimate the missing rating of

user x for item i

1162019

45531

312445

53432142

24542

522434

42331

item

s

2-41

56-5

53-2

32111

-221-7

37-1

-924143-48-5-253-211

13-112-72914-131457-8

1-6784-3924176-421

asymp

item

s

users

users

QPT

24

ffac

tors

f factors

119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931

= 119943119943

119954119954119946119946119943119943 sdot 119953119953119961119961119943119943

qi = row i of Qpx = column x of PT

RECSYS 57

Geared towards females

Geared towards males

Serious

Funny

Latent Factor Models

1162019

The PrincessDiaries

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

RECSYS 58

Geared towards females

Geared towards males

Serious

Funny

Latent Factor Models

1162019

The PrincessDiaries

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

RECSYS 59

Recap SVD

Remember SVD A Input data matrix U Left singular vecs V Right singular vecs Σ Singular values

SVD gives minimum reconstruction error (SSE) min119880119880VΣ

sum119909119909119894119894 119860119860119909119909119894119894 minus 119880119880Σ119881119881T 1199091199091198941198942

So in our case ldquoSVDrdquo on Netflix data R asymp Q middot PT

A = R Q = U PT = Σ VT

But we are not done yet R has missing entries1162019

U

The sum goes over all entriesBut our R has missing entries

119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931

Am

n

Σm

n

VT

asymp

U

RECSYS 60

Latent Factor Models

SVD isnrsquot defined when entries are missing

Use specialized methods to find P Q min

119875119875119876119876sum 119909119909119909119909 isinR 119903119903119909119909119909119909 minus 119902119902119909119909 sdot 119901119901119909119909119879119879 2

Notebull We donrsquot require cols of P Q to be orthogonalunit lengthbull P Q map usersmovies to a latent spacebull The most popular model among Netflix contestants

1162019

45531

312445

53432142

24542

522434

42331

2-41

56-5

53-2

32111

-221-7

37-1

-924143-48-5-253-211

13-112-72914-131457-8

1-6784-3924176-421asymp

PTQ

users

item

s

119903119909119909119909119909 = 119902119902119909119909 sdot 119901119901119909119909119879119879

f factors

ffactorsitem

s

users

RECSYS 61

Dealing with Missing Entries Want to minimize SSE for unseen test data

Idea Minimize SSE on training data Want large f ( of factors) to capture all the signals But SSE on test data begins to rise for f gt 2

Regularization is needed to avoid Overfitting Allow rich model where there are sufficient data Shrink aggressively where data are scarce

1162019

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

1 3 43 5 5

4 5 533

2

2 1 3

1

λhellip regularization parameterldquoerrorrdquo ldquolengthrdquo

RECSYS 62

Geared towards females

Geared towards males

serious

escapist

The PrincessDiaries

The Lion King

Braveheart

Independence Day

AmadeusThe Color Purple

Oceanrsquos 11

Sense and Sensibility

Gus

Dave

Recommendations via Latent Factor Models (eg SVD++ by the [Bellkor Team])

1162019

Lethal Weapon

Dumb and Dumber

RECSYS 63

Dealing with Missing Entries Want to minimize SSE for unseen test data

Idea Minimize SSE on training data Want large f ( of factors) to capture all the signals But SSE on test data begins to rise for f gt 2

Regularization is needed Allow rich model where there are sufficient data Shrink aggressively where data are scarce

1162019

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

1 3 43 5 5

4 5 533

2

2 1 3

1

λhellip regularization parameterldquoerrorrdquo ldquolengthrdquo

RECSYS 64

Geared towards females

Geared towards males

serious

funny

The Effect of Regularization

1162019

The PrincessDiaries

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

minfactors ldquoerrorrdquo + λ ldquolengthrdquo

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

RECSYS 65

Geared towards females

Geared towards males

serious

funny

The Effect of Regularization

1162019

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

The PrincessDiaries

minfactors ldquoerrorrdquo + λ ldquolengthrdquo

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

RECSYS 66

Geared towards females

Geared towards males

serious

funny

The Effect of Regularization

1162019

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

minfactors ldquoerrorrdquo + λ ldquolengthrdquo

The PrincessDiaries

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

RECSYS 67

Geared towards females

Geared towards males

serious

funny

The Effect of Regularization

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

The PrincessDiaries

minfactors ldquoerrorrdquo + λ ldquolengthrdquo

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

RECSYS 68

Use Gradient Descent to search forthe optimal settings

Want to find matrices P and Q

Gradient descent Initialize P and Q (using SVD pretend missing ratings are

0) Do gradient descent

bull P larr P - η middotnablaPbull Q larr Q - η middotnablaQbull Where nablaQ is gradientderivative of matrix Q120571120571119876119876 = [120571120571119902119902119909119909119894119894] and 120571120571119902119902119909119909119894119894 = sum119909119909119909119909 minus2 119903119903119909119909119909119909 minus 119902119902119909119909119901119901119909119909119879119879 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894

bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q Observation Computing gradients is slow

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

How to compute gradient of a matrixCompute gradient of every element independently

RECSYS 69

Degression to the lecture notes ofRegression and Gradient Descent

by Andrew Ngrsquos Machine Learning course from Coursera

RECSYS 70

(Batch) Gradient Descent

Want to find matrices P and Q

Gradient descent Initialize P and Q (using SVD pretend missing ratings are

0) Do gradient descent

bull P larr P - η middotnablaPbull Q larr Q - η middotnablaQbull Where nablaQ is gradientderivative of matrix Q120571120571119876119876 = [120571120571119902119902119909119909119894119894] and 120571120571119902119902119909119909119894119894 = sum119909119909119909119909 minus2 119903119903119909119909119909119909 minus 119902119902119909119909119901119901119909119909119879119879 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894

bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q Observation Computing gradients is slow

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

How to compute gradient of a matrixCompute gradient of every element independently

RECSYS 72

Stochastic Gradient Descent

Gradient Descent (GD) vs Stochastic GD Observation 120571120571119876119876 = [120571120571119902119902119909119909119894119894] where

120571120571119902119902119909119909119894119894 = 119909119909119909119909

minus2 119903119903119909119909119909119909 minus 119902119902119909119909119894119894119901119901119909119909119894119894 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894 = 119961119961119946119946

nabla119928119928 119955119955119961119961119946119946

bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q 119928119928 = 119928119928 minus 119928119928 = 119928119928 minus η sum119961119961119946119946nabla119928119928 (119955119955119961119961119946119946) Idea Instead of evaluating gradient over all ratings evaluate it for

each individual rating and make a step

GD 119928119928larr119928119928 minus η sum119955119955119961119961119946119946 nabla119928119928(119955119955119961119961119946119946)

SGD 119928119928larr119928119928 minus 119928119928(119955119955119961119961119946119946) Faster convergence

bull Need more steps but each step is computed much faster

1162019

RECSYS 73

SGD vs GD Convergence of GD vs SGD

1162019

Iterationstep

Valu

e of

the

obje

ctiv

e fu

nctio

n

GD improves the value of the objective function at every step SGD improves the value but in a ldquonoisyrdquo wayGD takes fewer steps to converge but each steptakes much longer to compute In practice SGD is much faster

RECSYS 74

Stochastic Gradient Descent

Stochastic gradient decent Initialize P and Q (using SVD pretend missing ratings are

0) Then iterate over the ratings (multiple times if necessary) and update

factorsFor each rxi

bull 120576120576119909119909119909119909 = 119903119903119909119909119909119909 minus 119902119902119909119909 sdot 119901119901119909119909119879119879 (derivative of the ldquoerrorrdquo)bull 119902119902119909119909 larr 119902119902119909119909 + 120578120578 120576120576119909119909119909119909 119901119901119909119909 minus 120582120582 119902119902119909119909 (update equation)bull 119901119901119909119909 larr 119901119901119909119909 + 120578120578 120576120576119909119909119909119909 119902119902119909119909 minus 120582120582 119901119901119909119909 (update equation)

2 for loops For until convergence

bull For each rxibull Compute gradient do a ldquosteprdquo

1162019

η hellip learning rate

RECSYS 75Koren Bell Volinksy IEEE Computer 2009

1162019

RECSYS 76

Summary Recommendations via Optimization

Goal Make good recommendations Quantify goodness using SSE

So Lower SSE means better recommendations We want to make good recommendations on items

that some user has not yet seen Letrsquos set values for P and Q such that they work

well on known (user item) ratings And hope these values for P and Q will predict well

the unknown ratings

This is the a case where we apply Optimization methods

1162019

1 3 43 5 5

4 5 533

2

2 1 3

1

RECSYS 77

Backup Slides

RECSYS 78

The Netflix Challenge 2006-09

We Know What You OughtTo Be Watching This

Summer

Netflix Prizebull Training data

ndash 100 million ratingsndash 480000 usersndash 17770 moviesndash 6 years of data 2000-2005

bull Test datandash Last few ratings of each user (28 million)ndash Evaluation criterion root mean squared error (RMSE) ndash Netflix Cinematch RMSE 09514

bull Competitionndash 2700+ teamsndash $1 million grand prize for 10 improvement on Cinematch resultndash $50000 2007 progress prize for 843 improvement

scoremovieuser121152131434524123237682576344541568523425

22345

57664566

scoremovieuser6219617232473153414284935

745

696836

Training data Test data

Movie rating data

Overall rating distribution

bull Third of ratings are 4sbull Average rating is 368

From TimelyDevelopmentcom

ratings per movie

bull Avg ratingsmovie 5627

ratings per user

bull Avg ratingsuser 208

Average movie rating by movie count

bull More ratings to better movies

From TimelyDevelopmentcom

Most loved movies

CountAvg ratingTitle137812 4593 The Shawshank Redemption 133597 4545 Lord of the Rings The Return of the King1808834306 The Green Mile 150676 4460Lord of the Rings The Two Towers 139050 4415 Finding Nemo 117456 4504 Raiders of the Lost Ark 180736 4299 Forrest Gump 147932 4433 Lord of the Rings The Fellowship of the ring149199 4325 The Sixth Sense 144027 4333 Indiana Jones and the Last Crusade

Challenges

bull Size of datandash Scalabilityndash Keeping data in memory

bull Missing datandash 99 percent missingndash Very imbalanced

bull Avoiding overfittingbull Test and training data differ significantly movie 16322

bull Use an ensemble of complementing predictorsbull Two half tuned models worth more than a single fully

tuned model

The BellKor recommender system

RECSYS 101

Extending Latent Factor Model to Include Biases

1162019

RECSYS 102

Modeling Biases and Interactions

μ = overall mean rating bx = bias of user x bi = bias of movie i

user-movie interactionmovie biasuser bias

User-Movie interaction Characterizes the matching between

users and movies Attracts most research in the field Benefits from algorithmic and

mathematical innovations

Baseline predictor Separates users and movies Benefits from insights into userrsquos

behavior Among the main practical

contributions of the competition

RECSYS 103

Baseline Predictor We have expectations on the rating by

user x of movie i even without estimating xrsquos attitude towards movies like i

ndash Rating scale of user xndash Values of other ratings user

gave recently (day-specific mood anchoring multi-user accounts)

ndash (Recent) popularity of movie indash Selection bias related to

number of ratings user gave on the same day (ldquofrequencyrdquo)

1162019

RECSYS 104

Putting It All Together

Example Mean rating micro = 37 You are a critical reviewer your ratings are 1 star

lower than the mean bx = -1 Star Wars gets a mean rating of 05 higher than

average movie bi = + 05 Predicted rating for you on Star Wars

= 37 - 1 + 05 = 32

1162019

RECSYS 105

Fitting the New Model

Solve

Stochastic gradient decent to find parameters Note Both biases bu bi as well as interactions qi pu are treated

as parameters (we estimate them)

1162019

regularization

goodness of fit

λ is selected via grid-search on a validation set

( )

++++

+++minus

sumsumsumsum

sumisin

ii

xx

xx

ii

Rix

Txiixxi

PQ

bbpq

pqbbr

2222

2

)(

)(min

λ

micro

RECSYS 106

BellKor Recommender System The winner of the Netflix Challenge

Multi-scale modeling of the dataCombine top level ldquoregionalrdquomodeling of the data with a refined local view Global

bull Overall deviations of usersmovies Factorization

bull Addressing ldquoregionalrdquo effects Collaborative filtering

bull Extract local patterns

Global effects

Factorization

Collaborative filtering

RECSYS 107

Performance of Various Methods

0885

089

0895

09

0905

091

0915

092

1 10 100 1000

RMSE

Millions of parameters

CF (no time bias)

Basic Latent Factors

Latent Factors w Biases

1162019

RECSYS 108

Grand Prize 08563

Netflix 09514

Movie average 10533User average 10651

Global average 11296

Performance of Various Methods

Basic Collaborative filtering 094

Latent factors 090

Latent factors+Biases 089

Collaborative filtering++ 091

1162019

RECSYS 109

Temporal Biases Of Users

Sudden rise in the average movie rating(early 2004) Improvements in Netflix GUI improvements Meaning of rating changed

Movie age Users prefer new movies

without any reasons Older movies are just

inherently better than newer ones

1162019

Y Koren Collaborative filtering with temporal dynamics KDD rsquo09

RECSYS 110

Temporal Biases amp Factors

Original modelrxi = micro +bx + bi + qi middotpx

T

Add time dependence to biasesrxi = micro +bx(t)+ bi(t) +qi middot px

T

Make parameters bu and bi to depend on time (1) Parameterize time-dependence by linear trends

(2) Each bin corresponds to 10 consecutive weeks

Add temporal dependence to factors px(t)hellip user preference vector on day t

Y Koren Collaborative filtering with temporal dynamics KDD rsquo09

RECSYS 111

Adding Temporal Effects

1162019

0875

088

0885

089

0895

09

0905

091

0915

092

1 10 100 1000 10000

RMSE

Millions of parameters

CF (no time bias)Basic Latent FactorsCF (time bias)Latent Factors w Biases+ Linear time factors+ Per-day user biases+ CF

RECSYS 112

Grand Prize 08563

Netflix 09514

Movie average 10533User average 10651

Global average 11296

Performance of Various Methods

Basic Collaborative filtering 094

Latent factors 090

Latent factors+Biases 089

Collaborative filtering++ 091

1162019 112

Latent factors+Biases+Time 0876

Still no prize Getting desperateTry a ldquokitchen sinkrdquo approach

RECSYS 113

Effect of ensemble size

087

08725

0875

08775

088

08825

0885

08875

089

08925

1 8 15 22 29 36 43 50

Predictors

Erro

r - R

MSE

Chart1

RMSE
Predictors
Error - RMSE
Effect of ensemble size

Sheet1

Sheet1

RMSE
Predictors
Error - RMSE
Effect of ensemble size

Sheet2

Sheet3

RECSYS 115

Standing on June 26th 2009

1162019June 26th submission triggers 30-day ldquolast callrdquo

RECSYS 116

The Last 30 Days

Ensemble team formed Group of other teams on leaderboard forms a new team Relies on combining their models Quickly also get a qualifying score over 10

BellKor Continue to get small improvements in their scores Realize that they are in direct competition with Ensemble

Strategy Both teams carefully monitoring the leaderboard Only sure way to check for improvement is to submit a set of

predictionsbull This alerts the other team of your latest score

1162019

RECSYS 117

24 Hours from the Deadline

Submissions limited to 1 a day Only 1 final submission could be made in the last 24h

24 hours before deadlinehellip BellKor team member in Austria notices (by chance) that

Ensemble posts a score that is slightly better than BellKorrsquos

Frantic last 24 hours for both teams Much computer time on final optimization Carefully calibrated to end about an hour before deadline

Final submissions BellKor submits a little early (on purpose) 40 mins before deadline Ensemble submits their final entry 20 mins later hellipand everyone waitshellip

1162019

RECSYS 118118

RECSYS 119

Million $ Awarded Sept 21st 2009

RECSYS 120

Further reading Y Koren Collaborative filtering with temporal

dynamics KDD rsquo09 httpwww2researchattcom~volinskynetflixbpchtml httpwwwthe-ensemblecom

1162019

  • Slide Number 1
  • Acknowledgements
  • Content-based Systems amp Collaborative Filtering
  • High Dimensional Data
  • Example Recommender Systems
  • Recommendations
  • From Scarcity to Abundance
  • The Long Tail
  • Physical vs Online
  • Types of Recommendations
  • Formal Model
  • Utility Matrix
  • Key Problems
  • (1) Gathering Ratings
  • (2) Extrapolating Utilities
  • Content-based Recommender Systems
  • Content-based Recommendations
  • Plan of Action
  • Item Profiles
  • Sidenote TF-IDF
  • User Profiles and Prediction
  • Pros Content-based Approach
  • Cons Content-based Approach
  • Collaborative Filtering
  • Collaborative filtering
  • Collaborative Filtering (CF)
  • Example of Memory-based Collaborative Filtering User-User Collaborative Filtering
  • Similar Users
  • Similarity Metric
  • Rating Predictions
  • Another type of Memory-based Collaborative Filtering Item-Item based Collaborative Filtering
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Common Practice for Item-Item Collaborative Filtering
  • Item-Item vs User-User
  • ProsCons of Collaborative Filtering
  • Hybrid Methods
  • Remarks amp Practical Tips
  • Evaluation
  • Evaluation
  • Evaluating Predictions
  • Problems with Error Measures
  • Collaborative Filtering Complexity
  • Tip Add Data
  • Recommender SystemsLatent Factor Models
  • Collaborative Filtering viaLatent Factor Models (eg SVD)
  • Slide Number 50
  • The Netflix Utility Matrix R
  • Utility Matrix R Evaluation
  • Latent Factor Models
  • Ratings as Products of Factors
  • Ratings as Products of Factors
  • Ratings as Products of Factors
  • Latent Factor Models
  • Latent Factor Models
  • Recap SVD
  • Latent Factor Models
  • Dealing with Missing Entries
  • Recommendations via Latent Factor Models (eg SVD++ by the [Bellkor Team])
  • Dealing with Missing Entries
  • The Effect of Regularization
  • The Effect of Regularization
  • The Effect of Regularization
  • The Effect of Regularization
  • Use Gradient Descent to search forthe optimal settings
  • Degression to the lecture notes ofRegression and Gradient Descentby Andrew Ngrsquos Machine Learning course from Coursera
  • (Batch) Gradient Descent
  • Stochastic Gradient Descent
  • SGD vs GD
  • Stochastic Gradient Descent
  • Slide Number 75
  • Summary Recommendations via Optimization
  • Backup Slides
  • The Netflix Challenge 2006-09
  • Slide Number 79
  • Slide Number 80
  • Netflix Prize
  • Movie rating data
  • Overall rating distribution
  • ratings per movie
  • ratings per user
  • Average movie rating by movie count
  • Most loved movies
  • Challenges
  • The BellKor recommender system
  • Extending Latent Factor Model to Include Biases
  • Modeling Biases and Interactions
  • Baseline Predictor
  • Putting It All Together
  • Fitting the New Model
  • BellKor Recommender System
  • Performance of Various Methods
  • Performance of Various Methods
  • Temporal Biases Of Users
  • Temporal Biases amp Factors
  • Adding Temporal Effects
  • Performance of Various Methods
  • Slide Number 113
  • Slide Number 114
  • Standing on June 26th 2009
  • The Last 30 Days
  • 24 Hours from the Deadline
  • Slide Number 118
  • Million $ Awarded Sept 21st 2009
  • Further reading
predictors RMSE
1 0889067
2 0880586
3 0878345
4 0877715
5 0876668
6 0876179
7 0875557
8 0875013
9 0874477
10 0874209
11 0873997
12 0873691
13 0873569
14 0873352
15 0873212
16 0873035
17 0872942
18 087283
19 0872748
20 0872654
21 0872563
22 0872468
23 0872382
24 0872275
25 087222
26 0872201
27 0872135
28 0872086
29 0872021
30 0871945
31 0871898
32 087183
33 0871756
34 0871722
35 0871665
36 0871632
37 0871585
38 0871558
39 0871524
40 0871482
41 0871452
42 0871428
43 0871398
44 0871346
45 0871312
46 0871287
47 0871262
48 0871242
49 0871236
50 0871197
0889067
0880586
0878345
0877715
0876668
0876179
0875557
0875013
0874477
0874209
0873997
0873691
0873569
0873352
0873212
0873035
0872942
087283
0872748
0872654
0872563
0872468
0872382
0872275
087222
0872201
0872135
0872086
0872021
0871945
0871898
087183
0871756
0871722
0871665
0871632
0871585
0871558
0871524
0871482
0871452
0871428
0871398
0871346
0871312
0871287
0871262
0871242
0871236
0871197
Page 43: xuhappy.github.io · 2 days ago · RECSYS 2 Acknowledgements The slides used in this chapter are adapted from: CS246 Mining Massive Datasets, - by Jure Leskovec, Stanford University

RECSYS 45

Problems with Error Measures Narrow focus on accuracy sometimes

misses the point Prediction Diversity Prediction Context Order of predictions

In practice we care only to predict high ratings RMSE might penalize a method that does well

for high ratings and badly for others

1162019

RECSYS 46

Collaborative Filtering Complexity

Expensive step is finding k most similar customers O(|X|) Recall that X = set of customers in the system

Too expensive to do at runtime Could pre-compute using clustering as approx

Naiumlve pre-computation takes time O(N middot|C|) |C| = of clusters = k in the k-means N = of data points

We already know how to do this Near-neighbor search in high dimensions (LSH) Clustering Dimensionality reduction

1162019

RECSYS 47

Tip Add Data Leverage all the data

Donrsquot try to reduce data size in an effort to make fancy algorithms work

Simple methods on large data do best

Add more data eg add IMDB data on genres

More data beats better algorithmshttpanandtypepadcomdatawocky200803more-data-usualhtml

RECSYS 48

Recommender SystemsLatent Factor Models

CS246 Mining Massive DatasetsJure Leskovec Stanford University

httpcs246stanfordedu

RECSYS 49

Geared towards females

Geared towards males

Serious

Funny

Collaborative Filtering viaLatent Factor Models (eg SVD)

The PrincessDiaries

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

RECSYS 50Koren Bell Volinksy IEEE Computer 2009

RECSYS 51

The Netflix Utility Matrix R

1 3 4

3 5 5

4 5 5

3

3

2 2 2

5

2 1 1

3 3

1

480000 users

17700 movies

1162019

Matrix R

RECSYS 52

Utility Matrix R Evaluation

1 3 4

3 5 5

4 5 5

3

3

2

2 1

3

1

Test Data Set

SSE = sum(119909119909119909119909)isin119877119877 119903119909119909119909119909 minus 119903119903119909119909119909119909 2

1162019

480000 users

17700 movies

Predicted rating

True rating of user x on item i

119955119955120785120785120788120788

Matrix R

Training Data Set

RECSYS 53

Latent Factor Models

ldquoSVDrdquo on Netflix data R asymp Q middot PT

For now letrsquos assume we can approximate the rating matrix R as a product of ldquothinrdquo Q middot PT

R has missing entries but letrsquos ignore that for nowbull Basically we will want the reconstruction error to be small on

known ratings and we donrsquot care about the values on the missing ones

45531

312445

53432142

24542

522434

42331

2-41

56-5

53-2

32111

-221-7

37-1

-924143-48-5-253-211

13-112-72914-131457-8

1-6784-3924176-421asymp

users

item

s

PT

Q

item

s

users

R

SVD A = U Σ VT

f factors

ffactors

RECSYS 54

Ratings as Products of Factors How to estimate the missing rating of

user x for item i

1162019

45531

312445

53432142

24542

522434

42331

item

s

2-41

56-5

53-2

32111

-221-7

37-1

-924143-48-5-253-211

13-112-72914-131457-8

1-6784-3924176-421

asymp

item

s

users

users

PT

119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931

= 119943119943

119954119954119946119946119943119943 sdot 119953119953119961119961119943119943

qi = row i of Qpx = column x of PT

ffac

tors

Qf factors

RECSYS 55

Ratings as Products of Factors How to estimate the missing rating of

user x for item i

1162019

45531

312445

53432142

24542

522434

42331

item

s

2-41

56-5

53-2

32111

-221-7

37-1

-924143-48-5-253-211

13-112-72914-131457-8

1-6784-3924176-421

asymp

item

s

users

users

PT

ffac

tors

Qf factors

119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931

= 119943119943

119954119954119946119946119943119943 sdot 119953119953119961119961119943119943

qi = row i of Qpx = column x of PT

RECSYS 56

Ratings as Products of Factors How to estimate the missing rating of

user x for item i

1162019

45531

312445

53432142

24542

522434

42331

item

s

2-41

56-5

53-2

32111

-221-7

37-1

-924143-48-5-253-211

13-112-72914-131457-8

1-6784-3924176-421

asymp

item

s

users

users

QPT

24

ffac

tors

f factors

119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931

= 119943119943

119954119954119946119946119943119943 sdot 119953119953119961119961119943119943

qi = row i of Qpx = column x of PT

RECSYS 57

Geared towards females

Geared towards males

Serious

Funny

Latent Factor Models

1162019

The PrincessDiaries

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

RECSYS 58

Geared towards females

Geared towards males

Serious

Funny

Latent Factor Models

1162019

The PrincessDiaries

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

RECSYS 59

Recap SVD

Remember SVD A Input data matrix U Left singular vecs V Right singular vecs Σ Singular values

SVD gives minimum reconstruction error (SSE) min119880119880VΣ

sum119909119909119894119894 119860119860119909119909119894119894 minus 119880119880Σ119881119881T 1199091199091198941198942

So in our case ldquoSVDrdquo on Netflix data R asymp Q middot PT

A = R Q = U PT = Σ VT

But we are not done yet R has missing entries1162019

U

The sum goes over all entriesBut our R has missing entries

119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931

Am

n

Σm

n

VT

asymp

U

RECSYS 60

Latent Factor Models

SVD isnrsquot defined when entries are missing

Use specialized methods to find P Q min

119875119875119876119876sum 119909119909119909119909 isinR 119903119903119909119909119909119909 minus 119902119902119909119909 sdot 119901119901119909119909119879119879 2

Notebull We donrsquot require cols of P Q to be orthogonalunit lengthbull P Q map usersmovies to a latent spacebull The most popular model among Netflix contestants

1162019

45531

312445

53432142

24542

522434

42331

2-41

56-5

53-2

32111

-221-7

37-1

-924143-48-5-253-211

13-112-72914-131457-8

1-6784-3924176-421asymp

PTQ

users

item

s

119903119909119909119909119909 = 119902119902119909119909 sdot 119901119901119909119909119879119879

f factors

ffactorsitem

s

users

RECSYS 61

Dealing with Missing Entries Want to minimize SSE for unseen test data

Idea Minimize SSE on training data Want large f ( of factors) to capture all the signals But SSE on test data begins to rise for f gt 2

Regularization is needed to avoid Overfitting Allow rich model where there are sufficient data Shrink aggressively where data are scarce

1162019

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

1 3 43 5 5

4 5 533

2

2 1 3

1

λhellip regularization parameterldquoerrorrdquo ldquolengthrdquo

RECSYS 62

Geared towards females

Geared towards males

serious

escapist

The PrincessDiaries

The Lion King

Braveheart

Independence Day

AmadeusThe Color Purple

Oceanrsquos 11

Sense and Sensibility

Gus

Dave

Recommendations via Latent Factor Models (eg SVD++ by the [Bellkor Team])

1162019

Lethal Weapon

Dumb and Dumber

RECSYS 63

Dealing with Missing Entries Want to minimize SSE for unseen test data

Idea Minimize SSE on training data Want large f ( of factors) to capture all the signals But SSE on test data begins to rise for f gt 2

Regularization is needed Allow rich model where there are sufficient data Shrink aggressively where data are scarce

1162019

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

1 3 43 5 5

4 5 533

2

2 1 3

1

λhellip regularization parameterldquoerrorrdquo ldquolengthrdquo

RECSYS 64

Geared towards females

Geared towards males

serious

funny

The Effect of Regularization

1162019

The PrincessDiaries

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

minfactors ldquoerrorrdquo + λ ldquolengthrdquo

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

RECSYS 65

Geared towards females

Geared towards males

serious

funny

The Effect of Regularization

1162019

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

The PrincessDiaries

minfactors ldquoerrorrdquo + λ ldquolengthrdquo

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

RECSYS 66

Geared towards females

Geared towards males

serious

funny

The Effect of Regularization

1162019

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

minfactors ldquoerrorrdquo + λ ldquolengthrdquo

The PrincessDiaries

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

RECSYS 67

Geared towards females

Geared towards males

serious

funny

The Effect of Regularization

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

The PrincessDiaries

minfactors ldquoerrorrdquo + λ ldquolengthrdquo

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

RECSYS 68

Use Gradient Descent to search forthe optimal settings

Want to find matrices P and Q

Gradient descent Initialize P and Q (using SVD pretend missing ratings are

0) Do gradient descent

bull P larr P - η middotnablaPbull Q larr Q - η middotnablaQbull Where nablaQ is gradientderivative of matrix Q120571120571119876119876 = [120571120571119902119902119909119909119894119894] and 120571120571119902119902119909119909119894119894 = sum119909119909119909119909 minus2 119903119903119909119909119909119909 minus 119902119902119909119909119901119901119909119909119879119879 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894

bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q Observation Computing gradients is slow

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

How to compute gradient of a matrixCompute gradient of every element independently

RECSYS 69

Degression to the lecture notes ofRegression and Gradient Descent

by Andrew Ngrsquos Machine Learning course from Coursera

RECSYS 70

(Batch) Gradient Descent

Want to find matrices P and Q

Gradient descent Initialize P and Q (using SVD pretend missing ratings are

0) Do gradient descent

bull P larr P - η middotnablaPbull Q larr Q - η middotnablaQbull Where nablaQ is gradientderivative of matrix Q120571120571119876119876 = [120571120571119902119902119909119909119894119894] and 120571120571119902119902119909119909119894119894 = sum119909119909119909119909 minus2 119903119903119909119909119909119909 minus 119902119902119909119909119901119901119909119909119879119879 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894

bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q Observation Computing gradients is slow

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

How to compute gradient of a matrixCompute gradient of every element independently

RECSYS 72

Stochastic Gradient Descent

Gradient Descent (GD) vs Stochastic GD Observation 120571120571119876119876 = [120571120571119902119902119909119909119894119894] where

120571120571119902119902119909119909119894119894 = 119909119909119909119909

minus2 119903119903119909119909119909119909 minus 119902119902119909119909119894119894119901119901119909119909119894119894 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894 = 119961119961119946119946

nabla119928119928 119955119955119961119961119946119946

bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q 119928119928 = 119928119928 minus 119928119928 = 119928119928 minus η sum119961119961119946119946nabla119928119928 (119955119955119961119961119946119946) Idea Instead of evaluating gradient over all ratings evaluate it for

each individual rating and make a step

GD 119928119928larr119928119928 minus η sum119955119955119961119961119946119946 nabla119928119928(119955119955119961119961119946119946)

SGD 119928119928larr119928119928 minus 119928119928(119955119955119961119961119946119946) Faster convergence

bull Need more steps but each step is computed much faster

1162019

RECSYS 73

SGD vs GD Convergence of GD vs SGD

1162019

Iterationstep

Valu

e of

the

obje

ctiv

e fu

nctio

n

GD improves the value of the objective function at every step SGD improves the value but in a ldquonoisyrdquo wayGD takes fewer steps to converge but each steptakes much longer to compute In practice SGD is much faster

RECSYS 74

Stochastic Gradient Descent

Stochastic gradient decent Initialize P and Q (using SVD pretend missing ratings are

0) Then iterate over the ratings (multiple times if necessary) and update

factorsFor each rxi

bull 120576120576119909119909119909119909 = 119903119903119909119909119909119909 minus 119902119902119909119909 sdot 119901119901119909119909119879119879 (derivative of the ldquoerrorrdquo)bull 119902119902119909119909 larr 119902119902119909119909 + 120578120578 120576120576119909119909119909119909 119901119901119909119909 minus 120582120582 119902119902119909119909 (update equation)bull 119901119901119909119909 larr 119901119901119909119909 + 120578120578 120576120576119909119909119909119909 119902119902119909119909 minus 120582120582 119901119901119909119909 (update equation)

2 for loops For until convergence

bull For each rxibull Compute gradient do a ldquosteprdquo

1162019

η hellip learning rate

RECSYS 75Koren Bell Volinksy IEEE Computer 2009

1162019

RECSYS 76

Summary Recommendations via Optimization

Goal Make good recommendations Quantify goodness using SSE

So Lower SSE means better recommendations We want to make good recommendations on items

that some user has not yet seen Letrsquos set values for P and Q such that they work

well on known (user item) ratings And hope these values for P and Q will predict well

the unknown ratings

This is the a case where we apply Optimization methods

1162019

1 3 43 5 5

4 5 533

2

2 1 3

1

RECSYS 77

Backup Slides

RECSYS 78

The Netflix Challenge 2006-09

We Know What You OughtTo Be Watching This

Summer

Netflix Prizebull Training data

ndash 100 million ratingsndash 480000 usersndash 17770 moviesndash 6 years of data 2000-2005

bull Test datandash Last few ratings of each user (28 million)ndash Evaluation criterion root mean squared error (RMSE) ndash Netflix Cinematch RMSE 09514

bull Competitionndash 2700+ teamsndash $1 million grand prize for 10 improvement on Cinematch resultndash $50000 2007 progress prize for 843 improvement

scoremovieuser121152131434524123237682576344541568523425

22345

57664566

scoremovieuser6219617232473153414284935

745

696836

Training data Test data

Movie rating data

Overall rating distribution

bull Third of ratings are 4sbull Average rating is 368

From TimelyDevelopmentcom

ratings per movie

bull Avg ratingsmovie 5627

ratings per user

bull Avg ratingsuser 208

Average movie rating by movie count

bull More ratings to better movies

From TimelyDevelopmentcom

Most loved movies

CountAvg ratingTitle137812 4593 The Shawshank Redemption 133597 4545 Lord of the Rings The Return of the King1808834306 The Green Mile 150676 4460Lord of the Rings The Two Towers 139050 4415 Finding Nemo 117456 4504 Raiders of the Lost Ark 180736 4299 Forrest Gump 147932 4433 Lord of the Rings The Fellowship of the ring149199 4325 The Sixth Sense 144027 4333 Indiana Jones and the Last Crusade

Challenges

bull Size of datandash Scalabilityndash Keeping data in memory

bull Missing datandash 99 percent missingndash Very imbalanced

bull Avoiding overfittingbull Test and training data differ significantly movie 16322

bull Use an ensemble of complementing predictorsbull Two half tuned models worth more than a single fully

tuned model

The BellKor recommender system

RECSYS 101

Extending Latent Factor Model to Include Biases

1162019

RECSYS 102

Modeling Biases and Interactions

μ = overall mean rating bx = bias of user x bi = bias of movie i

user-movie interactionmovie biasuser bias

User-Movie interaction Characterizes the matching between

users and movies Attracts most research in the field Benefits from algorithmic and

mathematical innovations

Baseline predictor Separates users and movies Benefits from insights into userrsquos

behavior Among the main practical

contributions of the competition

RECSYS 103

Baseline Predictor We have expectations on the rating by

user x of movie i even without estimating xrsquos attitude towards movies like i

ndash Rating scale of user xndash Values of other ratings user

gave recently (day-specific mood anchoring multi-user accounts)

ndash (Recent) popularity of movie indash Selection bias related to

number of ratings user gave on the same day (ldquofrequencyrdquo)

1162019

RECSYS 104

Putting It All Together

Example Mean rating micro = 37 You are a critical reviewer your ratings are 1 star

lower than the mean bx = -1 Star Wars gets a mean rating of 05 higher than

average movie bi = + 05 Predicted rating for you on Star Wars

= 37 - 1 + 05 = 32

1162019

RECSYS 105

Fitting the New Model

Solve

Stochastic gradient decent to find parameters Note Both biases bu bi as well as interactions qi pu are treated

as parameters (we estimate them)

1162019

regularization

goodness of fit

λ is selected via grid-search on a validation set

( )

++++

+++minus

sumsumsumsum

sumisin

ii

xx

xx

ii

Rix

Txiixxi

PQ

bbpq

pqbbr

2222

2

)(

)(min

λ

micro

RECSYS 106

BellKor Recommender System The winner of the Netflix Challenge

Multi-scale modeling of the dataCombine top level ldquoregionalrdquomodeling of the data with a refined local view Global

bull Overall deviations of usersmovies Factorization

bull Addressing ldquoregionalrdquo effects Collaborative filtering

bull Extract local patterns

Global effects

Factorization

Collaborative filtering

RECSYS 107

Performance of Various Methods

0885

089

0895

09

0905

091

0915

092

1 10 100 1000

RMSE

Millions of parameters

CF (no time bias)

Basic Latent Factors

Latent Factors w Biases

1162019

RECSYS 108

Grand Prize 08563

Netflix 09514

Movie average 10533User average 10651

Global average 11296

Performance of Various Methods

Basic Collaborative filtering 094

Latent factors 090

Latent factors+Biases 089

Collaborative filtering++ 091

1162019

RECSYS 109

Temporal Biases Of Users

Sudden rise in the average movie rating(early 2004) Improvements in Netflix GUI improvements Meaning of rating changed

Movie age Users prefer new movies

without any reasons Older movies are just

inherently better than newer ones

1162019

Y Koren Collaborative filtering with temporal dynamics KDD rsquo09

RECSYS 110

Temporal Biases amp Factors

Original modelrxi = micro +bx + bi + qi middotpx

T

Add time dependence to biasesrxi = micro +bx(t)+ bi(t) +qi middot px

T

Make parameters bu and bi to depend on time (1) Parameterize time-dependence by linear trends

(2) Each bin corresponds to 10 consecutive weeks

Add temporal dependence to factors px(t)hellip user preference vector on day t

Y Koren Collaborative filtering with temporal dynamics KDD rsquo09

RECSYS 111

Adding Temporal Effects

1162019

0875

088

0885

089

0895

09

0905

091

0915

092

1 10 100 1000 10000

RMSE

Millions of parameters

CF (no time bias)Basic Latent FactorsCF (time bias)Latent Factors w Biases+ Linear time factors+ Per-day user biases+ CF

RECSYS 112

Grand Prize 08563

Netflix 09514

Movie average 10533User average 10651

Global average 11296

Performance of Various Methods

Basic Collaborative filtering 094

Latent factors 090

Latent factors+Biases 089

Collaborative filtering++ 091

1162019 112

Latent factors+Biases+Time 0876

Still no prize Getting desperateTry a ldquokitchen sinkrdquo approach

RECSYS 113

Effect of ensemble size

087

08725

0875

08775

088

08825

0885

08875

089

08925

1 8 15 22 29 36 43 50

Predictors

Erro

r - R

MSE

Chart1

RMSE
Predictors
Error - RMSE
Effect of ensemble size

Sheet1

Sheet1

RMSE
Predictors
Error - RMSE
Effect of ensemble size

Sheet2

Sheet3

RECSYS 115

Standing on June 26th 2009

1162019June 26th submission triggers 30-day ldquolast callrdquo

RECSYS 116

The Last 30 Days

Ensemble team formed Group of other teams on leaderboard forms a new team Relies on combining their models Quickly also get a qualifying score over 10

BellKor Continue to get small improvements in their scores Realize that they are in direct competition with Ensemble

Strategy Both teams carefully monitoring the leaderboard Only sure way to check for improvement is to submit a set of

predictionsbull This alerts the other team of your latest score

1162019

RECSYS 117

24 Hours from the Deadline

Submissions limited to 1 a day Only 1 final submission could be made in the last 24h

24 hours before deadlinehellip BellKor team member in Austria notices (by chance) that

Ensemble posts a score that is slightly better than BellKorrsquos

Frantic last 24 hours for both teams Much computer time on final optimization Carefully calibrated to end about an hour before deadline

Final submissions BellKor submits a little early (on purpose) 40 mins before deadline Ensemble submits their final entry 20 mins later hellipand everyone waitshellip

1162019

RECSYS 118118

RECSYS 119

Million $ Awarded Sept 21st 2009

RECSYS 120

Further reading Y Koren Collaborative filtering with temporal

dynamics KDD rsquo09 httpwww2researchattcom~volinskynetflixbpchtml httpwwwthe-ensemblecom

1162019

  • Slide Number 1
  • Acknowledgements
  • Content-based Systems amp Collaborative Filtering
  • High Dimensional Data
  • Example Recommender Systems
  • Recommendations
  • From Scarcity to Abundance
  • The Long Tail
  • Physical vs Online
  • Types of Recommendations
  • Formal Model
  • Utility Matrix
  • Key Problems
  • (1) Gathering Ratings
  • (2) Extrapolating Utilities
  • Content-based Recommender Systems
  • Content-based Recommendations
  • Plan of Action
  • Item Profiles
  • Sidenote TF-IDF
  • User Profiles and Prediction
  • Pros Content-based Approach
  • Cons Content-based Approach
  • Collaborative Filtering
  • Collaborative filtering
  • Collaborative Filtering (CF)
  • Example of Memory-based Collaborative Filtering User-User Collaborative Filtering
  • Similar Users
  • Similarity Metric
  • Rating Predictions
  • Another type of Memory-based Collaborative Filtering Item-Item based Collaborative Filtering
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Common Practice for Item-Item Collaborative Filtering
  • Item-Item vs User-User
  • ProsCons of Collaborative Filtering
  • Hybrid Methods
  • Remarks amp Practical Tips
  • Evaluation
  • Evaluation
  • Evaluating Predictions
  • Problems with Error Measures
  • Collaborative Filtering Complexity
  • Tip Add Data
  • Recommender SystemsLatent Factor Models
  • Collaborative Filtering viaLatent Factor Models (eg SVD)
  • Slide Number 50
  • The Netflix Utility Matrix R
  • Utility Matrix R Evaluation
  • Latent Factor Models
  • Ratings as Products of Factors
  • Ratings as Products of Factors
  • Ratings as Products of Factors
  • Latent Factor Models
  • Latent Factor Models
  • Recap SVD
  • Latent Factor Models
  • Dealing with Missing Entries
  • Recommendations via Latent Factor Models (eg SVD++ by the [Bellkor Team])
  • Dealing with Missing Entries
  • The Effect of Regularization
  • The Effect of Regularization
  • The Effect of Regularization
  • The Effect of Regularization
  • Use Gradient Descent to search forthe optimal settings
  • Degression to the lecture notes ofRegression and Gradient Descentby Andrew Ngrsquos Machine Learning course from Coursera
  • (Batch) Gradient Descent
  • Stochastic Gradient Descent
  • SGD vs GD
  • Stochastic Gradient Descent
  • Slide Number 75
  • Summary Recommendations via Optimization
  • Backup Slides
  • The Netflix Challenge 2006-09
  • Slide Number 79
  • Slide Number 80
  • Netflix Prize
  • Movie rating data
  • Overall rating distribution
  • ratings per movie
  • ratings per user
  • Average movie rating by movie count
  • Most loved movies
  • Challenges
  • The BellKor recommender system
  • Extending Latent Factor Model to Include Biases
  • Modeling Biases and Interactions
  • Baseline Predictor
  • Putting It All Together
  • Fitting the New Model
  • BellKor Recommender System
  • Performance of Various Methods
  • Performance of Various Methods
  • Temporal Biases Of Users
  • Temporal Biases amp Factors
  • Adding Temporal Effects
  • Performance of Various Methods
  • Slide Number 113
  • Slide Number 114
  • Standing on June 26th 2009
  • The Last 30 Days
  • 24 Hours from the Deadline
  • Slide Number 118
  • Million $ Awarded Sept 21st 2009
  • Further reading
predictors RMSE
1 0889067
2 0880586
3 0878345
4 0877715
5 0876668
6 0876179
7 0875557
8 0875013
9 0874477
10 0874209
11 0873997
12 0873691
13 0873569
14 0873352
15 0873212
16 0873035
17 0872942
18 087283
19 0872748
20 0872654
21 0872563
22 0872468
23 0872382
24 0872275
25 087222
26 0872201
27 0872135
28 0872086
29 0872021
30 0871945
31 0871898
32 087183
33 0871756
34 0871722
35 0871665
36 0871632
37 0871585
38 0871558
39 0871524
40 0871482
41 0871452
42 0871428
43 0871398
44 0871346
45 0871312
46 0871287
47 0871262
48 0871242
49 0871236
50 0871197
0889067
0880586
0878345
0877715
0876668
0876179
0875557
0875013
0874477
0874209
0873997
0873691
0873569
0873352
0873212
0873035
0872942
087283
0872748
0872654
0872563
0872468
0872382
0872275
087222
0872201
0872135
0872086
0872021
0871945
0871898
087183
0871756
0871722
0871665
0871632
0871585
0871558
0871524
0871482
0871452
0871428
0871398
0871346
0871312
0871287
0871262
0871242
0871236
0871197
Page 44: xuhappy.github.io · 2 days ago · RECSYS 2 Acknowledgements The slides used in this chapter are adapted from: CS246 Mining Massive Datasets, - by Jure Leskovec, Stanford University

RECSYS 46

Collaborative Filtering Complexity

Expensive step is finding k most similar customers O(|X|) Recall that X = set of customers in the system

Too expensive to do at runtime Could pre-compute using clustering as approx

Naiumlve pre-computation takes time O(N middot|C|) |C| = of clusters = k in the k-means N = of data points

We already know how to do this Near-neighbor search in high dimensions (LSH) Clustering Dimensionality reduction

1162019

RECSYS 47

Tip Add Data Leverage all the data

Donrsquot try to reduce data size in an effort to make fancy algorithms work

Simple methods on large data do best

Add more data eg add IMDB data on genres

More data beats better algorithmshttpanandtypepadcomdatawocky200803more-data-usualhtml

RECSYS 48

Recommender SystemsLatent Factor Models

CS246 Mining Massive DatasetsJure Leskovec Stanford University

httpcs246stanfordedu

RECSYS 49

Geared towards females

Geared towards males

Serious

Funny

Collaborative Filtering viaLatent Factor Models (eg SVD)

The PrincessDiaries

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

RECSYS 50Koren Bell Volinksy IEEE Computer 2009

RECSYS 51

The Netflix Utility Matrix R

1 3 4

3 5 5

4 5 5

3

3

2 2 2

5

2 1 1

3 3

1

480000 users

17700 movies

1162019

Matrix R

RECSYS 52

Utility Matrix R Evaluation

1 3 4

3 5 5

4 5 5

3

3

2

2 1

3

1

Test Data Set

SSE = sum(119909119909119909119909)isin119877119877 119903119909119909119909119909 minus 119903119903119909119909119909119909 2

1162019

480000 users

17700 movies

Predicted rating

True rating of user x on item i

119955119955120785120785120788120788

Matrix R

Training Data Set

RECSYS 53

Latent Factor Models

ldquoSVDrdquo on Netflix data R asymp Q middot PT

For now letrsquos assume we can approximate the rating matrix R as a product of ldquothinrdquo Q middot PT

R has missing entries but letrsquos ignore that for nowbull Basically we will want the reconstruction error to be small on

known ratings and we donrsquot care about the values on the missing ones

45531

312445

53432142

24542

522434

42331

2-41

56-5

53-2

32111

-221-7

37-1

-924143-48-5-253-211

13-112-72914-131457-8

1-6784-3924176-421asymp

users

item

s

PT

Q

item

s

users

R

SVD A = U Σ VT

f factors

ffactors

RECSYS 54

Ratings as Products of Factors How to estimate the missing rating of

user x for item i

1162019

45531

312445

53432142

24542

522434

42331

item

s

2-41

56-5

53-2

32111

-221-7

37-1

-924143-48-5-253-211

13-112-72914-131457-8

1-6784-3924176-421

asymp

item

s

users

users

PT

119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931

= 119943119943

119954119954119946119946119943119943 sdot 119953119953119961119961119943119943

qi = row i of Qpx = column x of PT

ffac

tors

Qf factors

RECSYS 55

Ratings as Products of Factors How to estimate the missing rating of

user x for item i

1162019

45531

312445

53432142

24542

522434

42331

item

s

2-41

56-5

53-2

32111

-221-7

37-1

-924143-48-5-253-211

13-112-72914-131457-8

1-6784-3924176-421

asymp

item

s

users

users

PT

ffac

tors

Qf factors

119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931

= 119943119943

119954119954119946119946119943119943 sdot 119953119953119961119961119943119943

qi = row i of Qpx = column x of PT

RECSYS 56

Ratings as Products of Factors How to estimate the missing rating of

user x for item i

1162019

45531

312445

53432142

24542

522434

42331

item

s

2-41

56-5

53-2

32111

-221-7

37-1

-924143-48-5-253-211

13-112-72914-131457-8

1-6784-3924176-421

asymp

item

s

users

users

QPT

24

ffac

tors

f factors

119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931

= 119943119943

119954119954119946119946119943119943 sdot 119953119953119961119961119943119943

qi = row i of Qpx = column x of PT

RECSYS 57

Geared towards females

Geared towards males

Serious

Funny

Latent Factor Models

1162019

The PrincessDiaries

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

RECSYS 58

Geared towards females

Geared towards males

Serious

Funny

Latent Factor Models

1162019

The PrincessDiaries

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

RECSYS 59

Recap SVD

Remember SVD A Input data matrix U Left singular vecs V Right singular vecs Σ Singular values

SVD gives minimum reconstruction error (SSE) min119880119880VΣ

sum119909119909119894119894 119860119860119909119909119894119894 minus 119880119880Σ119881119881T 1199091199091198941198942

So in our case ldquoSVDrdquo on Netflix data R asymp Q middot PT

A = R Q = U PT = Σ VT

But we are not done yet R has missing entries1162019

U

The sum goes over all entriesBut our R has missing entries

119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931

Am

n

Σm

n

VT

asymp

U

RECSYS 60

Latent Factor Models

SVD isnrsquot defined when entries are missing

Use specialized methods to find P Q min

119875119875119876119876sum 119909119909119909119909 isinR 119903119903119909119909119909119909 minus 119902119902119909119909 sdot 119901119901119909119909119879119879 2

Notebull We donrsquot require cols of P Q to be orthogonalunit lengthbull P Q map usersmovies to a latent spacebull The most popular model among Netflix contestants

1162019

45531

312445

53432142

24542

522434

42331

2-41

56-5

53-2

32111

-221-7

37-1

-924143-48-5-253-211

13-112-72914-131457-8

1-6784-3924176-421asymp

PTQ

users

item

s

119903119909119909119909119909 = 119902119902119909119909 sdot 119901119901119909119909119879119879

f factors

ffactorsitem

s

users

RECSYS 61

Dealing with Missing Entries Want to minimize SSE for unseen test data

Idea Minimize SSE on training data Want large f ( of factors) to capture all the signals But SSE on test data begins to rise for f gt 2

Regularization is needed to avoid Overfitting Allow rich model where there are sufficient data Shrink aggressively where data are scarce

1162019

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

1 3 43 5 5

4 5 533

2

2 1 3

1

λhellip regularization parameterldquoerrorrdquo ldquolengthrdquo

RECSYS 62

Geared towards females

Geared towards males

serious

escapist

The PrincessDiaries

The Lion King

Braveheart

Independence Day

AmadeusThe Color Purple

Oceanrsquos 11

Sense and Sensibility

Gus

Dave

Recommendations via Latent Factor Models (eg SVD++ by the [Bellkor Team])

1162019

Lethal Weapon

Dumb and Dumber

RECSYS 63

Dealing with Missing Entries Want to minimize SSE for unseen test data

Idea Minimize SSE on training data Want large f ( of factors) to capture all the signals But SSE on test data begins to rise for f gt 2

Regularization is needed Allow rich model where there are sufficient data Shrink aggressively where data are scarce

1162019

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

1 3 43 5 5

4 5 533

2

2 1 3

1

λhellip regularization parameterldquoerrorrdquo ldquolengthrdquo

RECSYS 64

Geared towards females

Geared towards males

serious

funny

The Effect of Regularization

1162019

The PrincessDiaries

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

minfactors ldquoerrorrdquo + λ ldquolengthrdquo

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

RECSYS 65

Geared towards females

Geared towards males

serious

funny

The Effect of Regularization

1162019

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

The PrincessDiaries

minfactors ldquoerrorrdquo + λ ldquolengthrdquo

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

RECSYS 66

Geared towards females

Geared towards males

serious

funny

The Effect of Regularization

1162019

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

minfactors ldquoerrorrdquo + λ ldquolengthrdquo

The PrincessDiaries

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

RECSYS 67

Geared towards females

Geared towards males

serious

funny

The Effect of Regularization

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

The PrincessDiaries

minfactors ldquoerrorrdquo + λ ldquolengthrdquo

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

RECSYS 68

Use Gradient Descent to search forthe optimal settings

Want to find matrices P and Q

Gradient descent Initialize P and Q (using SVD pretend missing ratings are

0) Do gradient descent

bull P larr P - η middotnablaPbull Q larr Q - η middotnablaQbull Where nablaQ is gradientderivative of matrix Q120571120571119876119876 = [120571120571119902119902119909119909119894119894] and 120571120571119902119902119909119909119894119894 = sum119909119909119909119909 minus2 119903119903119909119909119909119909 minus 119902119902119909119909119901119901119909119909119879119879 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894

bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q Observation Computing gradients is slow

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

How to compute gradient of a matrixCompute gradient of every element independently

RECSYS 69

Degression to the lecture notes ofRegression and Gradient Descent

by Andrew Ngrsquos Machine Learning course from Coursera

RECSYS 70

(Batch) Gradient Descent

Want to find matrices P and Q

Gradient descent Initialize P and Q (using SVD pretend missing ratings are

0) Do gradient descent

bull P larr P - η middotnablaPbull Q larr Q - η middotnablaQbull Where nablaQ is gradientderivative of matrix Q120571120571119876119876 = [120571120571119902119902119909119909119894119894] and 120571120571119902119902119909119909119894119894 = sum119909119909119909119909 minus2 119903119903119909119909119909119909 minus 119902119902119909119909119901119901119909119909119879119879 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894

bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q Observation Computing gradients is slow

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

How to compute gradient of a matrixCompute gradient of every element independently

RECSYS 72

Stochastic Gradient Descent

Gradient Descent (GD) vs Stochastic GD Observation 120571120571119876119876 = [120571120571119902119902119909119909119894119894] where

120571120571119902119902119909119909119894119894 = 119909119909119909119909

minus2 119903119903119909119909119909119909 minus 119902119902119909119909119894119894119901119901119909119909119894119894 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894 = 119961119961119946119946

nabla119928119928 119955119955119961119961119946119946

bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q 119928119928 = 119928119928 minus 119928119928 = 119928119928 minus η sum119961119961119946119946nabla119928119928 (119955119955119961119961119946119946) Idea Instead of evaluating gradient over all ratings evaluate it for

each individual rating and make a step

GD 119928119928larr119928119928 minus η sum119955119955119961119961119946119946 nabla119928119928(119955119955119961119961119946119946)

SGD 119928119928larr119928119928 minus 119928119928(119955119955119961119961119946119946) Faster convergence

bull Need more steps but each step is computed much faster

1162019

RECSYS 73

SGD vs GD Convergence of GD vs SGD

1162019

Iterationstep

Valu

e of

the

obje

ctiv

e fu

nctio

n

GD improves the value of the objective function at every step SGD improves the value but in a ldquonoisyrdquo wayGD takes fewer steps to converge but each steptakes much longer to compute In practice SGD is much faster

RECSYS 74

Stochastic Gradient Descent

Stochastic gradient decent Initialize P and Q (using SVD pretend missing ratings are

0) Then iterate over the ratings (multiple times if necessary) and update

factorsFor each rxi

bull 120576120576119909119909119909119909 = 119903119903119909119909119909119909 minus 119902119902119909119909 sdot 119901119901119909119909119879119879 (derivative of the ldquoerrorrdquo)bull 119902119902119909119909 larr 119902119902119909119909 + 120578120578 120576120576119909119909119909119909 119901119901119909119909 minus 120582120582 119902119902119909119909 (update equation)bull 119901119901119909119909 larr 119901119901119909119909 + 120578120578 120576120576119909119909119909119909 119902119902119909119909 minus 120582120582 119901119901119909119909 (update equation)

2 for loops For until convergence

bull For each rxibull Compute gradient do a ldquosteprdquo

1162019

η hellip learning rate

RECSYS 75Koren Bell Volinksy IEEE Computer 2009

1162019

RECSYS 76

Summary Recommendations via Optimization

Goal Make good recommendations Quantify goodness using SSE

So Lower SSE means better recommendations We want to make good recommendations on items

that some user has not yet seen Letrsquos set values for P and Q such that they work

well on known (user item) ratings And hope these values for P and Q will predict well

the unknown ratings

This is the a case where we apply Optimization methods

1162019

1 3 43 5 5

4 5 533

2

2 1 3

1

RECSYS 77

Backup Slides

RECSYS 78

The Netflix Challenge 2006-09

We Know What You OughtTo Be Watching This

Summer

Netflix Prizebull Training data

ndash 100 million ratingsndash 480000 usersndash 17770 moviesndash 6 years of data 2000-2005

bull Test datandash Last few ratings of each user (28 million)ndash Evaluation criterion root mean squared error (RMSE) ndash Netflix Cinematch RMSE 09514

bull Competitionndash 2700+ teamsndash $1 million grand prize for 10 improvement on Cinematch resultndash $50000 2007 progress prize for 843 improvement

scoremovieuser121152131434524123237682576344541568523425

22345

57664566

scoremovieuser6219617232473153414284935

745

696836

Training data Test data

Movie rating data

Overall rating distribution

bull Third of ratings are 4sbull Average rating is 368

From TimelyDevelopmentcom

ratings per movie

bull Avg ratingsmovie 5627

ratings per user

bull Avg ratingsuser 208

Average movie rating by movie count

bull More ratings to better movies

From TimelyDevelopmentcom

Most loved movies

CountAvg ratingTitle137812 4593 The Shawshank Redemption 133597 4545 Lord of the Rings The Return of the King1808834306 The Green Mile 150676 4460Lord of the Rings The Two Towers 139050 4415 Finding Nemo 117456 4504 Raiders of the Lost Ark 180736 4299 Forrest Gump 147932 4433 Lord of the Rings The Fellowship of the ring149199 4325 The Sixth Sense 144027 4333 Indiana Jones and the Last Crusade

Challenges

bull Size of datandash Scalabilityndash Keeping data in memory

bull Missing datandash 99 percent missingndash Very imbalanced

bull Avoiding overfittingbull Test and training data differ significantly movie 16322

bull Use an ensemble of complementing predictorsbull Two half tuned models worth more than a single fully

tuned model

The BellKor recommender system

RECSYS 101

Extending Latent Factor Model to Include Biases

1162019

RECSYS 102

Modeling Biases and Interactions

μ = overall mean rating bx = bias of user x bi = bias of movie i

user-movie interactionmovie biasuser bias

User-Movie interaction Characterizes the matching between

users and movies Attracts most research in the field Benefits from algorithmic and

mathematical innovations

Baseline predictor Separates users and movies Benefits from insights into userrsquos

behavior Among the main practical

contributions of the competition

RECSYS 103

Baseline Predictor We have expectations on the rating by

user x of movie i even without estimating xrsquos attitude towards movies like i

ndash Rating scale of user xndash Values of other ratings user

gave recently (day-specific mood anchoring multi-user accounts)

ndash (Recent) popularity of movie indash Selection bias related to

number of ratings user gave on the same day (ldquofrequencyrdquo)

1162019

RECSYS 104

Putting It All Together

Example Mean rating micro = 37 You are a critical reviewer your ratings are 1 star

lower than the mean bx = -1 Star Wars gets a mean rating of 05 higher than

average movie bi = + 05 Predicted rating for you on Star Wars

= 37 - 1 + 05 = 32

1162019

RECSYS 105

Fitting the New Model

Solve

Stochastic gradient decent to find parameters Note Both biases bu bi as well as interactions qi pu are treated

as parameters (we estimate them)

1162019

regularization

goodness of fit

λ is selected via grid-search on a validation set

( )

++++

+++minus

sumsumsumsum

sumisin

ii

xx

xx

ii

Rix

Txiixxi

PQ

bbpq

pqbbr

2222

2

)(

)(min

λ

micro

RECSYS 106

BellKor Recommender System The winner of the Netflix Challenge

Multi-scale modeling of the dataCombine top level ldquoregionalrdquomodeling of the data with a refined local view Global

bull Overall deviations of usersmovies Factorization

bull Addressing ldquoregionalrdquo effects Collaborative filtering

bull Extract local patterns

Global effects

Factorization

Collaborative filtering

RECSYS 107

Performance of Various Methods

0885

089

0895

09

0905

091

0915

092

1 10 100 1000

RMSE

Millions of parameters

CF (no time bias)

Basic Latent Factors

Latent Factors w Biases

1162019

RECSYS 108

Grand Prize 08563

Netflix 09514

Movie average 10533User average 10651

Global average 11296

Performance of Various Methods

Basic Collaborative filtering 094

Latent factors 090

Latent factors+Biases 089

Collaborative filtering++ 091

1162019

RECSYS 109

Temporal Biases Of Users

Sudden rise in the average movie rating(early 2004) Improvements in Netflix GUI improvements Meaning of rating changed

Movie age Users prefer new movies

without any reasons Older movies are just

inherently better than newer ones

1162019

Y Koren Collaborative filtering with temporal dynamics KDD rsquo09

RECSYS 110

Temporal Biases amp Factors

Original modelrxi = micro +bx + bi + qi middotpx

T

Add time dependence to biasesrxi = micro +bx(t)+ bi(t) +qi middot px

T

Make parameters bu and bi to depend on time (1) Parameterize time-dependence by linear trends

(2) Each bin corresponds to 10 consecutive weeks

Add temporal dependence to factors px(t)hellip user preference vector on day t

Y Koren Collaborative filtering with temporal dynamics KDD rsquo09

RECSYS 111

Adding Temporal Effects

1162019

0875

088

0885

089

0895

09

0905

091

0915

092

1 10 100 1000 10000

RMSE

Millions of parameters

CF (no time bias)Basic Latent FactorsCF (time bias)Latent Factors w Biases+ Linear time factors+ Per-day user biases+ CF

RECSYS 112

Grand Prize 08563

Netflix 09514

Movie average 10533User average 10651

Global average 11296

Performance of Various Methods

Basic Collaborative filtering 094

Latent factors 090

Latent factors+Biases 089

Collaborative filtering++ 091

1162019 112

Latent factors+Biases+Time 0876

Still no prize Getting desperateTry a ldquokitchen sinkrdquo approach

RECSYS 113

Effect of ensemble size

087

08725

0875

08775

088

08825

0885

08875

089

08925

1 8 15 22 29 36 43 50

Predictors

Erro

r - R

MSE

Chart1

RMSE
Predictors
Error - RMSE
Effect of ensemble size

Sheet1

Sheet1

RMSE
Predictors
Error - RMSE
Effect of ensemble size

Sheet2

Sheet3

RECSYS 115

Standing on June 26th 2009

1162019June 26th submission triggers 30-day ldquolast callrdquo

RECSYS 116

The Last 30 Days

Ensemble team formed Group of other teams on leaderboard forms a new team Relies on combining their models Quickly also get a qualifying score over 10

BellKor Continue to get small improvements in their scores Realize that they are in direct competition with Ensemble

Strategy Both teams carefully monitoring the leaderboard Only sure way to check for improvement is to submit a set of

predictionsbull This alerts the other team of your latest score

1162019

RECSYS 117

24 Hours from the Deadline

Submissions limited to 1 a day Only 1 final submission could be made in the last 24h

24 hours before deadlinehellip BellKor team member in Austria notices (by chance) that

Ensemble posts a score that is slightly better than BellKorrsquos

Frantic last 24 hours for both teams Much computer time on final optimization Carefully calibrated to end about an hour before deadline

Final submissions BellKor submits a little early (on purpose) 40 mins before deadline Ensemble submits their final entry 20 mins later hellipand everyone waitshellip

1162019

RECSYS 118118

RECSYS 119

Million $ Awarded Sept 21st 2009

RECSYS 120

Further reading Y Koren Collaborative filtering with temporal

dynamics KDD rsquo09 httpwww2researchattcom~volinskynetflixbpchtml httpwwwthe-ensemblecom

1162019

  • Slide Number 1
  • Acknowledgements
  • Content-based Systems amp Collaborative Filtering
  • High Dimensional Data
  • Example Recommender Systems
  • Recommendations
  • From Scarcity to Abundance
  • The Long Tail
  • Physical vs Online
  • Types of Recommendations
  • Formal Model
  • Utility Matrix
  • Key Problems
  • (1) Gathering Ratings
  • (2) Extrapolating Utilities
  • Content-based Recommender Systems
  • Content-based Recommendations
  • Plan of Action
  • Item Profiles
  • Sidenote TF-IDF
  • User Profiles and Prediction
  • Pros Content-based Approach
  • Cons Content-based Approach
  • Collaborative Filtering
  • Collaborative filtering
  • Collaborative Filtering (CF)
  • Example of Memory-based Collaborative Filtering User-User Collaborative Filtering
  • Similar Users
  • Similarity Metric
  • Rating Predictions
  • Another type of Memory-based Collaborative Filtering Item-Item based Collaborative Filtering
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Common Practice for Item-Item Collaborative Filtering
  • Item-Item vs User-User
  • ProsCons of Collaborative Filtering
  • Hybrid Methods
  • Remarks amp Practical Tips
  • Evaluation
  • Evaluation
  • Evaluating Predictions
  • Problems with Error Measures
  • Collaborative Filtering Complexity
  • Tip Add Data
  • Recommender SystemsLatent Factor Models
  • Collaborative Filtering viaLatent Factor Models (eg SVD)
  • Slide Number 50
  • The Netflix Utility Matrix R
  • Utility Matrix R Evaluation
  • Latent Factor Models
  • Ratings as Products of Factors
  • Ratings as Products of Factors
  • Ratings as Products of Factors
  • Latent Factor Models
  • Latent Factor Models
  • Recap SVD
  • Latent Factor Models
  • Dealing with Missing Entries
  • Recommendations via Latent Factor Models (eg SVD++ by the [Bellkor Team])
  • Dealing with Missing Entries
  • The Effect of Regularization
  • The Effect of Regularization
  • The Effect of Regularization
  • The Effect of Regularization
  • Use Gradient Descent to search forthe optimal settings
  • Degression to the lecture notes ofRegression and Gradient Descentby Andrew Ngrsquos Machine Learning course from Coursera
  • (Batch) Gradient Descent
  • Stochastic Gradient Descent
  • SGD vs GD
  • Stochastic Gradient Descent
  • Slide Number 75
  • Summary Recommendations via Optimization
  • Backup Slides
  • The Netflix Challenge 2006-09
  • Slide Number 79
  • Slide Number 80
  • Netflix Prize
  • Movie rating data
  • Overall rating distribution
  • ratings per movie
  • ratings per user
  • Average movie rating by movie count
  • Most loved movies
  • Challenges
  • The BellKor recommender system
  • Extending Latent Factor Model to Include Biases
  • Modeling Biases and Interactions
  • Baseline Predictor
  • Putting It All Together
  • Fitting the New Model
  • BellKor Recommender System
  • Performance of Various Methods
  • Performance of Various Methods
  • Temporal Biases Of Users
  • Temporal Biases amp Factors
  • Adding Temporal Effects
  • Performance of Various Methods
  • Slide Number 113
  • Slide Number 114
  • Standing on June 26th 2009
  • The Last 30 Days
  • 24 Hours from the Deadline
  • Slide Number 118
  • Million $ Awarded Sept 21st 2009
  • Further reading
predictors RMSE
1 0889067
2 0880586
3 0878345
4 0877715
5 0876668
6 0876179
7 0875557
8 0875013
9 0874477
10 0874209
11 0873997
12 0873691
13 0873569
14 0873352
15 0873212
16 0873035
17 0872942
18 087283
19 0872748
20 0872654
21 0872563
22 0872468
23 0872382
24 0872275
25 087222
26 0872201
27 0872135
28 0872086
29 0872021
30 0871945
31 0871898
32 087183
33 0871756
34 0871722
35 0871665
36 0871632
37 0871585
38 0871558
39 0871524
40 0871482
41 0871452
42 0871428
43 0871398
44 0871346
45 0871312
46 0871287
47 0871262
48 0871242
49 0871236
50 0871197
0889067
0880586
0878345
0877715
0876668
0876179
0875557
0875013
0874477
0874209
0873997
0873691
0873569
0873352
0873212
0873035
0872942
087283
0872748
0872654
0872563
0872468
0872382
0872275
087222
0872201
0872135
0872086
0872021
0871945
0871898
087183
0871756
0871722
0871665
0871632
0871585
0871558
0871524
0871482
0871452
0871428
0871398
0871346
0871312
0871287
0871262
0871242
0871236
0871197
Page 45: xuhappy.github.io · 2 days ago · RECSYS 2 Acknowledgements The slides used in this chapter are adapted from: CS246 Mining Massive Datasets, - by Jure Leskovec, Stanford University

RECSYS 47

Tip Add Data Leverage all the data

Donrsquot try to reduce data size in an effort to make fancy algorithms work

Simple methods on large data do best

Add more data eg add IMDB data on genres

More data beats better algorithmshttpanandtypepadcomdatawocky200803more-data-usualhtml

RECSYS 48

Recommender SystemsLatent Factor Models

CS246 Mining Massive DatasetsJure Leskovec Stanford University

httpcs246stanfordedu

RECSYS 49

Geared towards females

Geared towards males

Serious

Funny

Collaborative Filtering viaLatent Factor Models (eg SVD)

The PrincessDiaries

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

RECSYS 50Koren Bell Volinksy IEEE Computer 2009

RECSYS 51

The Netflix Utility Matrix R

1 3 4

3 5 5

4 5 5

3

3

2 2 2

5

2 1 1

3 3

1

480000 users

17700 movies

1162019

Matrix R

RECSYS 52

Utility Matrix R Evaluation

1 3 4

3 5 5

4 5 5

3

3

2

2 1

3

1

Test Data Set

SSE = sum(119909119909119909119909)isin119877119877 119903119909119909119909119909 minus 119903119903119909119909119909119909 2

1162019

480000 users

17700 movies

Predicted rating

True rating of user x on item i

119955119955120785120785120788120788

Matrix R

Training Data Set

RECSYS 53

Latent Factor Models

ldquoSVDrdquo on Netflix data R asymp Q middot PT

For now letrsquos assume we can approximate the rating matrix R as a product of ldquothinrdquo Q middot PT

R has missing entries but letrsquos ignore that for nowbull Basically we will want the reconstruction error to be small on

known ratings and we donrsquot care about the values on the missing ones

45531

312445

53432142

24542

522434

42331

2-41

56-5

53-2

32111

-221-7

37-1

-924143-48-5-253-211

13-112-72914-131457-8

1-6784-3924176-421asymp

users

item

s

PT

Q

item

s

users

R

SVD A = U Σ VT

f factors

ffactors

RECSYS 54

Ratings as Products of Factors How to estimate the missing rating of

user x for item i

1162019

45531

312445

53432142

24542

522434

42331

item

s

2-41

56-5

53-2

32111

-221-7

37-1

-924143-48-5-253-211

13-112-72914-131457-8

1-6784-3924176-421

asymp

item

s

users

users

PT

119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931

= 119943119943

119954119954119946119946119943119943 sdot 119953119953119961119961119943119943

qi = row i of Qpx = column x of PT

ffac

tors

Qf factors

RECSYS 55

Ratings as Products of Factors How to estimate the missing rating of

user x for item i

1162019

45531

312445

53432142

24542

522434

42331

item

s

2-41

56-5

53-2

32111

-221-7

37-1

-924143-48-5-253-211

13-112-72914-131457-8

1-6784-3924176-421

asymp

item

s

users

users

PT

ffac

tors

Qf factors

119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931

= 119943119943

119954119954119946119946119943119943 sdot 119953119953119961119961119943119943

qi = row i of Qpx = column x of PT

RECSYS 56

Ratings as Products of Factors How to estimate the missing rating of

user x for item i

1162019

45531

312445

53432142

24542

522434

42331

item

s

2-41

56-5

53-2

32111

-221-7

37-1

-924143-48-5-253-211

13-112-72914-131457-8

1-6784-3924176-421

asymp

item

s

users

users

QPT

24

ffac

tors

f factors

119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931

= 119943119943

119954119954119946119946119943119943 sdot 119953119953119961119961119943119943

qi = row i of Qpx = column x of PT

RECSYS 57

Geared towards females

Geared towards males

Serious

Funny

Latent Factor Models

1162019

The PrincessDiaries

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

RECSYS 58

Geared towards females

Geared towards males

Serious

Funny

Latent Factor Models

1162019

The PrincessDiaries

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

RECSYS 59

Recap SVD

Remember SVD A Input data matrix U Left singular vecs V Right singular vecs Σ Singular values

SVD gives minimum reconstruction error (SSE) min119880119880VΣ

sum119909119909119894119894 119860119860119909119909119894119894 minus 119880119880Σ119881119881T 1199091199091198941198942

So in our case ldquoSVDrdquo on Netflix data R asymp Q middot PT

A = R Q = U PT = Σ VT

But we are not done yet R has missing entries1162019

U

The sum goes over all entriesBut our R has missing entries

119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931

Am

n

Σm

n

VT

asymp

U

RECSYS 60

Latent Factor Models

SVD isnrsquot defined when entries are missing

Use specialized methods to find P Q min

119875119875119876119876sum 119909119909119909119909 isinR 119903119903119909119909119909119909 minus 119902119902119909119909 sdot 119901119901119909119909119879119879 2

Notebull We donrsquot require cols of P Q to be orthogonalunit lengthbull P Q map usersmovies to a latent spacebull The most popular model among Netflix contestants

1162019

45531

312445

53432142

24542

522434

42331

2-41

56-5

53-2

32111

-221-7

37-1

-924143-48-5-253-211

13-112-72914-131457-8

1-6784-3924176-421asymp

PTQ

users

item

s

119903119909119909119909119909 = 119902119902119909119909 sdot 119901119901119909119909119879119879

f factors

ffactorsitem

s

users

RECSYS 61

Dealing with Missing Entries Want to minimize SSE for unseen test data

Idea Minimize SSE on training data Want large f ( of factors) to capture all the signals But SSE on test data begins to rise for f gt 2

Regularization is needed to avoid Overfitting Allow rich model where there are sufficient data Shrink aggressively where data are scarce

1162019

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

1 3 43 5 5

4 5 533

2

2 1 3

1

λhellip regularization parameterldquoerrorrdquo ldquolengthrdquo

RECSYS 62

Geared towards females

Geared towards males

serious

escapist

The PrincessDiaries

The Lion King

Braveheart

Independence Day

AmadeusThe Color Purple

Oceanrsquos 11

Sense and Sensibility

Gus

Dave

Recommendations via Latent Factor Models (eg SVD++ by the [Bellkor Team])

1162019

Lethal Weapon

Dumb and Dumber

RECSYS 63

Dealing with Missing Entries Want to minimize SSE for unseen test data

Idea Minimize SSE on training data Want large f ( of factors) to capture all the signals But SSE on test data begins to rise for f gt 2

Regularization is needed Allow rich model where there are sufficient data Shrink aggressively where data are scarce

1162019

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

1 3 43 5 5

4 5 533

2

2 1 3

1

λhellip regularization parameterldquoerrorrdquo ldquolengthrdquo

RECSYS 64

Geared towards females

Geared towards males

serious

funny

The Effect of Regularization

1162019

The PrincessDiaries

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

minfactors ldquoerrorrdquo + λ ldquolengthrdquo

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

RECSYS 65

Geared towards females

Geared towards males

serious

funny

The Effect of Regularization

1162019

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

The PrincessDiaries

minfactors ldquoerrorrdquo + λ ldquolengthrdquo

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

RECSYS 66

Geared towards females

Geared towards males

serious

funny

The Effect of Regularization

1162019

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

minfactors ldquoerrorrdquo + λ ldquolengthrdquo

The PrincessDiaries

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

RECSYS 67

Geared towards females

Geared towards males

serious

funny

The Effect of Regularization

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

The PrincessDiaries

minfactors ldquoerrorrdquo + λ ldquolengthrdquo

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

RECSYS 68

Use Gradient Descent to search forthe optimal settings

Want to find matrices P and Q

Gradient descent Initialize P and Q (using SVD pretend missing ratings are

0) Do gradient descent

bull P larr P - η middotnablaPbull Q larr Q - η middotnablaQbull Where nablaQ is gradientderivative of matrix Q120571120571119876119876 = [120571120571119902119902119909119909119894119894] and 120571120571119902119902119909119909119894119894 = sum119909119909119909119909 minus2 119903119903119909119909119909119909 minus 119902119902119909119909119901119901119909119909119879119879 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894

bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q Observation Computing gradients is slow

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

How to compute gradient of a matrixCompute gradient of every element independently

RECSYS 69

Degression to the lecture notes ofRegression and Gradient Descent

by Andrew Ngrsquos Machine Learning course from Coursera

RECSYS 70

(Batch) Gradient Descent

Want to find matrices P and Q

Gradient descent Initialize P and Q (using SVD pretend missing ratings are

0) Do gradient descent

bull P larr P - η middotnablaPbull Q larr Q - η middotnablaQbull Where nablaQ is gradientderivative of matrix Q120571120571119876119876 = [120571120571119902119902119909119909119894119894] and 120571120571119902119902119909119909119894119894 = sum119909119909119909119909 minus2 119903119903119909119909119909119909 minus 119902119902119909119909119901119901119909119909119879119879 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894

bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q Observation Computing gradients is slow

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

How to compute gradient of a matrixCompute gradient of every element independently

RECSYS 72

Stochastic Gradient Descent

Gradient Descent (GD) vs Stochastic GD Observation 120571120571119876119876 = [120571120571119902119902119909119909119894119894] where

120571120571119902119902119909119909119894119894 = 119909119909119909119909

minus2 119903119903119909119909119909119909 minus 119902119902119909119909119894119894119901119901119909119909119894119894 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894 = 119961119961119946119946

nabla119928119928 119955119955119961119961119946119946

bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q 119928119928 = 119928119928 minus 119928119928 = 119928119928 minus η sum119961119961119946119946nabla119928119928 (119955119955119961119961119946119946) Idea Instead of evaluating gradient over all ratings evaluate it for

each individual rating and make a step

GD 119928119928larr119928119928 minus η sum119955119955119961119961119946119946 nabla119928119928(119955119955119961119961119946119946)

SGD 119928119928larr119928119928 minus 119928119928(119955119955119961119961119946119946) Faster convergence

bull Need more steps but each step is computed much faster

1162019

RECSYS 73

SGD vs GD Convergence of GD vs SGD

1162019

Iterationstep

Valu

e of

the

obje

ctiv

e fu

nctio

n

GD improves the value of the objective function at every step SGD improves the value but in a ldquonoisyrdquo wayGD takes fewer steps to converge but each steptakes much longer to compute In practice SGD is much faster

RECSYS 74

Stochastic Gradient Descent

Stochastic gradient decent Initialize P and Q (using SVD pretend missing ratings are

0) Then iterate over the ratings (multiple times if necessary) and update

factorsFor each rxi

bull 120576120576119909119909119909119909 = 119903119903119909119909119909119909 minus 119902119902119909119909 sdot 119901119901119909119909119879119879 (derivative of the ldquoerrorrdquo)bull 119902119902119909119909 larr 119902119902119909119909 + 120578120578 120576120576119909119909119909119909 119901119901119909119909 minus 120582120582 119902119902119909119909 (update equation)bull 119901119901119909119909 larr 119901119901119909119909 + 120578120578 120576120576119909119909119909119909 119902119902119909119909 minus 120582120582 119901119901119909119909 (update equation)

2 for loops For until convergence

bull For each rxibull Compute gradient do a ldquosteprdquo

1162019

η hellip learning rate

RECSYS 75Koren Bell Volinksy IEEE Computer 2009

1162019

RECSYS 76

Summary Recommendations via Optimization

Goal Make good recommendations Quantify goodness using SSE

So Lower SSE means better recommendations We want to make good recommendations on items

that some user has not yet seen Letrsquos set values for P and Q such that they work

well on known (user item) ratings And hope these values for P and Q will predict well

the unknown ratings

This is the a case where we apply Optimization methods

1162019

1 3 43 5 5

4 5 533

2

2 1 3

1

RECSYS 77

Backup Slides

RECSYS 78

The Netflix Challenge 2006-09

We Know What You OughtTo Be Watching This

Summer

Netflix Prizebull Training data

ndash 100 million ratingsndash 480000 usersndash 17770 moviesndash 6 years of data 2000-2005

bull Test datandash Last few ratings of each user (28 million)ndash Evaluation criterion root mean squared error (RMSE) ndash Netflix Cinematch RMSE 09514

bull Competitionndash 2700+ teamsndash $1 million grand prize for 10 improvement on Cinematch resultndash $50000 2007 progress prize for 843 improvement

scoremovieuser121152131434524123237682576344541568523425

22345

57664566

scoremovieuser6219617232473153414284935

745

696836

Training data Test data

Movie rating data

Overall rating distribution

bull Third of ratings are 4sbull Average rating is 368

From TimelyDevelopmentcom

ratings per movie

bull Avg ratingsmovie 5627

ratings per user

bull Avg ratingsuser 208

Average movie rating by movie count

bull More ratings to better movies

From TimelyDevelopmentcom

Most loved movies

CountAvg ratingTitle137812 4593 The Shawshank Redemption 133597 4545 Lord of the Rings The Return of the King1808834306 The Green Mile 150676 4460Lord of the Rings The Two Towers 139050 4415 Finding Nemo 117456 4504 Raiders of the Lost Ark 180736 4299 Forrest Gump 147932 4433 Lord of the Rings The Fellowship of the ring149199 4325 The Sixth Sense 144027 4333 Indiana Jones and the Last Crusade

Challenges

bull Size of datandash Scalabilityndash Keeping data in memory

bull Missing datandash 99 percent missingndash Very imbalanced

bull Avoiding overfittingbull Test and training data differ significantly movie 16322

bull Use an ensemble of complementing predictorsbull Two half tuned models worth more than a single fully

tuned model

The BellKor recommender system

RECSYS 101

Extending Latent Factor Model to Include Biases

1162019

RECSYS 102

Modeling Biases and Interactions

μ = overall mean rating bx = bias of user x bi = bias of movie i

user-movie interactionmovie biasuser bias

User-Movie interaction Characterizes the matching between

users and movies Attracts most research in the field Benefits from algorithmic and

mathematical innovations

Baseline predictor Separates users and movies Benefits from insights into userrsquos

behavior Among the main practical

contributions of the competition

RECSYS 103

Baseline Predictor We have expectations on the rating by

user x of movie i even without estimating xrsquos attitude towards movies like i

ndash Rating scale of user xndash Values of other ratings user

gave recently (day-specific mood anchoring multi-user accounts)

ndash (Recent) popularity of movie indash Selection bias related to

number of ratings user gave on the same day (ldquofrequencyrdquo)

1162019

RECSYS 104

Putting It All Together

Example Mean rating micro = 37 You are a critical reviewer your ratings are 1 star

lower than the mean bx = -1 Star Wars gets a mean rating of 05 higher than

average movie bi = + 05 Predicted rating for you on Star Wars

= 37 - 1 + 05 = 32

1162019

RECSYS 105

Fitting the New Model

Solve

Stochastic gradient decent to find parameters Note Both biases bu bi as well as interactions qi pu are treated

as parameters (we estimate them)

1162019

regularization

goodness of fit

λ is selected via grid-search on a validation set

( )

++++

+++minus

sumsumsumsum

sumisin

ii

xx

xx

ii

Rix

Txiixxi

PQ

bbpq

pqbbr

2222

2

)(

)(min

λ

micro

RECSYS 106

BellKor Recommender System The winner of the Netflix Challenge

Multi-scale modeling of the dataCombine top level ldquoregionalrdquomodeling of the data with a refined local view Global

bull Overall deviations of usersmovies Factorization

bull Addressing ldquoregionalrdquo effects Collaborative filtering

bull Extract local patterns

Global effects

Factorization

Collaborative filtering

RECSYS 107

Performance of Various Methods

0885

089

0895

09

0905

091

0915

092

1 10 100 1000

RMSE

Millions of parameters

CF (no time bias)

Basic Latent Factors

Latent Factors w Biases

1162019

RECSYS 108

Grand Prize 08563

Netflix 09514

Movie average 10533User average 10651

Global average 11296

Performance of Various Methods

Basic Collaborative filtering 094

Latent factors 090

Latent factors+Biases 089

Collaborative filtering++ 091

1162019

RECSYS 109

Temporal Biases Of Users

Sudden rise in the average movie rating(early 2004) Improvements in Netflix GUI improvements Meaning of rating changed

Movie age Users prefer new movies

without any reasons Older movies are just

inherently better than newer ones

1162019

Y Koren Collaborative filtering with temporal dynamics KDD rsquo09

RECSYS 110

Temporal Biases amp Factors

Original modelrxi = micro +bx + bi + qi middotpx

T

Add time dependence to biasesrxi = micro +bx(t)+ bi(t) +qi middot px

T

Make parameters bu and bi to depend on time (1) Parameterize time-dependence by linear trends

(2) Each bin corresponds to 10 consecutive weeks

Add temporal dependence to factors px(t)hellip user preference vector on day t

Y Koren Collaborative filtering with temporal dynamics KDD rsquo09

RECSYS 111

Adding Temporal Effects

1162019

0875

088

0885

089

0895

09

0905

091

0915

092

1 10 100 1000 10000

RMSE

Millions of parameters

CF (no time bias)Basic Latent FactorsCF (time bias)Latent Factors w Biases+ Linear time factors+ Per-day user biases+ CF

RECSYS 112

Grand Prize 08563

Netflix 09514

Movie average 10533User average 10651

Global average 11296

Performance of Various Methods

Basic Collaborative filtering 094

Latent factors 090

Latent factors+Biases 089

Collaborative filtering++ 091

1162019 112

Latent factors+Biases+Time 0876

Still no prize Getting desperateTry a ldquokitchen sinkrdquo approach

RECSYS 113

Effect of ensemble size

087

08725

0875

08775

088

08825

0885

08875

089

08925

1 8 15 22 29 36 43 50

Predictors

Erro

r - R

MSE

Chart1

RMSE
Predictors
Error - RMSE
Effect of ensemble size

Sheet1

Sheet1

RMSE
Predictors
Error - RMSE
Effect of ensemble size

Sheet2

Sheet3

RECSYS 115

Standing on June 26th 2009

1162019June 26th submission triggers 30-day ldquolast callrdquo

RECSYS 116

The Last 30 Days

Ensemble team formed Group of other teams on leaderboard forms a new team Relies on combining their models Quickly also get a qualifying score over 10

BellKor Continue to get small improvements in their scores Realize that they are in direct competition with Ensemble

Strategy Both teams carefully monitoring the leaderboard Only sure way to check for improvement is to submit a set of

predictionsbull This alerts the other team of your latest score

1162019

RECSYS 117

24 Hours from the Deadline

Submissions limited to 1 a day Only 1 final submission could be made in the last 24h

24 hours before deadlinehellip BellKor team member in Austria notices (by chance) that

Ensemble posts a score that is slightly better than BellKorrsquos

Frantic last 24 hours for both teams Much computer time on final optimization Carefully calibrated to end about an hour before deadline

Final submissions BellKor submits a little early (on purpose) 40 mins before deadline Ensemble submits their final entry 20 mins later hellipand everyone waitshellip

1162019

RECSYS 118118

RECSYS 119

Million $ Awarded Sept 21st 2009

RECSYS 120

Further reading Y Koren Collaborative filtering with temporal

dynamics KDD rsquo09 httpwww2researchattcom~volinskynetflixbpchtml httpwwwthe-ensemblecom

1162019

  • Slide Number 1
  • Acknowledgements
  • Content-based Systems amp Collaborative Filtering
  • High Dimensional Data
  • Example Recommender Systems
  • Recommendations
  • From Scarcity to Abundance
  • The Long Tail
  • Physical vs Online
  • Types of Recommendations
  • Formal Model
  • Utility Matrix
  • Key Problems
  • (1) Gathering Ratings
  • (2) Extrapolating Utilities
  • Content-based Recommender Systems
  • Content-based Recommendations
  • Plan of Action
  • Item Profiles
  • Sidenote TF-IDF
  • User Profiles and Prediction
  • Pros Content-based Approach
  • Cons Content-based Approach
  • Collaborative Filtering
  • Collaborative filtering
  • Collaborative Filtering (CF)
  • Example of Memory-based Collaborative Filtering User-User Collaborative Filtering
  • Similar Users
  • Similarity Metric
  • Rating Predictions
  • Another type of Memory-based Collaborative Filtering Item-Item based Collaborative Filtering
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Common Practice for Item-Item Collaborative Filtering
  • Item-Item vs User-User
  • ProsCons of Collaborative Filtering
  • Hybrid Methods
  • Remarks amp Practical Tips
  • Evaluation
  • Evaluation
  • Evaluating Predictions
  • Problems with Error Measures
  • Collaborative Filtering Complexity
  • Tip Add Data
  • Recommender SystemsLatent Factor Models
  • Collaborative Filtering viaLatent Factor Models (eg SVD)
  • Slide Number 50
  • The Netflix Utility Matrix R
  • Utility Matrix R Evaluation
  • Latent Factor Models
  • Ratings as Products of Factors
  • Ratings as Products of Factors
  • Ratings as Products of Factors
  • Latent Factor Models
  • Latent Factor Models
  • Recap SVD
  • Latent Factor Models
  • Dealing with Missing Entries
  • Recommendations via Latent Factor Models (eg SVD++ by the [Bellkor Team])
  • Dealing with Missing Entries
  • The Effect of Regularization
  • The Effect of Regularization
  • The Effect of Regularization
  • The Effect of Regularization
  • Use Gradient Descent to search forthe optimal settings
  • Degression to the lecture notes ofRegression and Gradient Descentby Andrew Ngrsquos Machine Learning course from Coursera
  • (Batch) Gradient Descent
  • Stochastic Gradient Descent
  • SGD vs GD
  • Stochastic Gradient Descent
  • Slide Number 75
  • Summary Recommendations via Optimization
  • Backup Slides
  • The Netflix Challenge 2006-09
  • Slide Number 79
  • Slide Number 80
  • Netflix Prize
  • Movie rating data
  • Overall rating distribution
  • ratings per movie
  • ratings per user
  • Average movie rating by movie count
  • Most loved movies
  • Challenges
  • The BellKor recommender system
  • Extending Latent Factor Model to Include Biases
  • Modeling Biases and Interactions
  • Baseline Predictor
  • Putting It All Together
  • Fitting the New Model
  • BellKor Recommender System
  • Performance of Various Methods
  • Performance of Various Methods
  • Temporal Biases Of Users
  • Temporal Biases amp Factors
  • Adding Temporal Effects
  • Performance of Various Methods
  • Slide Number 113
  • Slide Number 114
  • Standing on June 26th 2009
  • The Last 30 Days
  • 24 Hours from the Deadline
  • Slide Number 118
  • Million $ Awarded Sept 21st 2009
  • Further reading
predictors RMSE
1 0889067
2 0880586
3 0878345
4 0877715
5 0876668
6 0876179
7 0875557
8 0875013
9 0874477
10 0874209
11 0873997
12 0873691
13 0873569
14 0873352
15 0873212
16 0873035
17 0872942
18 087283
19 0872748
20 0872654
21 0872563
22 0872468
23 0872382
24 0872275
25 087222
26 0872201
27 0872135
28 0872086
29 0872021
30 0871945
31 0871898
32 087183
33 0871756
34 0871722
35 0871665
36 0871632
37 0871585
38 0871558
39 0871524
40 0871482
41 0871452
42 0871428
43 0871398
44 0871346
45 0871312
46 0871287
47 0871262
48 0871242
49 0871236
50 0871197
0889067
0880586
0878345
0877715
0876668
0876179
0875557
0875013
0874477
0874209
0873997
0873691
0873569
0873352
0873212
0873035
0872942
087283
0872748
0872654
0872563
0872468
0872382
0872275
087222
0872201
0872135
0872086
0872021
0871945
0871898
087183
0871756
0871722
0871665
0871632
0871585
0871558
0871524
0871482
0871452
0871428
0871398
0871346
0871312
0871287
0871262
0871242
0871236
0871197
Page 46: xuhappy.github.io · 2 days ago · RECSYS 2 Acknowledgements The slides used in this chapter are adapted from: CS246 Mining Massive Datasets, - by Jure Leskovec, Stanford University

RECSYS 48

Recommender SystemsLatent Factor Models

CS246 Mining Massive DatasetsJure Leskovec Stanford University

httpcs246stanfordedu

RECSYS 49

Geared towards females

Geared towards males

Serious

Funny

Collaborative Filtering viaLatent Factor Models (eg SVD)

The PrincessDiaries

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

RECSYS 50Koren Bell Volinksy IEEE Computer 2009

RECSYS 51

The Netflix Utility Matrix R

1 3 4

3 5 5

4 5 5

3

3

2 2 2

5

2 1 1

3 3

1

480000 users

17700 movies

1162019

Matrix R

RECSYS 52

Utility Matrix R Evaluation

1 3 4

3 5 5

4 5 5

3

3

2

2 1

3

1

Test Data Set

SSE = sum(119909119909119909119909)isin119877119877 119903119909119909119909119909 minus 119903119903119909119909119909119909 2

1162019

480000 users

17700 movies

Predicted rating

True rating of user x on item i

119955119955120785120785120788120788

Matrix R

Training Data Set

RECSYS 53

Latent Factor Models

ldquoSVDrdquo on Netflix data R asymp Q middot PT

For now letrsquos assume we can approximate the rating matrix R as a product of ldquothinrdquo Q middot PT

R has missing entries but letrsquos ignore that for nowbull Basically we will want the reconstruction error to be small on

known ratings and we donrsquot care about the values on the missing ones

45531

312445

53432142

24542

522434

42331

2-41

56-5

53-2

32111

-221-7

37-1

-924143-48-5-253-211

13-112-72914-131457-8

1-6784-3924176-421asymp

users

item

s

PT

Q

item

s

users

R

SVD A = U Σ VT

f factors

ffactors

RECSYS 54

Ratings as Products of Factors How to estimate the missing rating of

user x for item i

1162019

45531

312445

53432142

24542

522434

42331

item

s

2-41

56-5

53-2

32111

-221-7

37-1

-924143-48-5-253-211

13-112-72914-131457-8

1-6784-3924176-421

asymp

item

s

users

users

PT

119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931

= 119943119943

119954119954119946119946119943119943 sdot 119953119953119961119961119943119943

qi = row i of Qpx = column x of PT

ffac

tors

Qf factors

RECSYS 55

Ratings as Products of Factors How to estimate the missing rating of

user x for item i

1162019

45531

312445

53432142

24542

522434

42331

item

s

2-41

56-5

53-2

32111

-221-7

37-1

-924143-48-5-253-211

13-112-72914-131457-8

1-6784-3924176-421

asymp

item

s

users

users

PT

ffac

tors

Qf factors

119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931

= 119943119943

119954119954119946119946119943119943 sdot 119953119953119961119961119943119943

qi = row i of Qpx = column x of PT

RECSYS 56

Ratings as Products of Factors How to estimate the missing rating of

user x for item i

1162019

45531

312445

53432142

24542

522434

42331

item

s

2-41

56-5

53-2

32111

-221-7

37-1

-924143-48-5-253-211

13-112-72914-131457-8

1-6784-3924176-421

asymp

item

s

users

users

QPT

24

ffac

tors

f factors

119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931

= 119943119943

119954119954119946119946119943119943 sdot 119953119953119961119961119943119943

qi = row i of Qpx = column x of PT

RECSYS 57

Geared towards females

Geared towards males

Serious

Funny

Latent Factor Models

1162019

The PrincessDiaries

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

RECSYS 58

Geared towards females

Geared towards males

Serious

Funny

Latent Factor Models

1162019

The PrincessDiaries

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

RECSYS 59

Recap SVD

Remember SVD A Input data matrix U Left singular vecs V Right singular vecs Σ Singular values

SVD gives minimum reconstruction error (SSE) min119880119880VΣ

sum119909119909119894119894 119860119860119909119909119894119894 minus 119880119880Σ119881119881T 1199091199091198941198942

So in our case ldquoSVDrdquo on Netflix data R asymp Q middot PT

A = R Q = U PT = Σ VT

But we are not done yet R has missing entries1162019

U

The sum goes over all entriesBut our R has missing entries

119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931

Am

n

Σm

n

VT

asymp

U

RECSYS 60

Latent Factor Models

SVD isnrsquot defined when entries are missing

Use specialized methods to find P Q min

119875119875119876119876sum 119909119909119909119909 isinR 119903119903119909119909119909119909 minus 119902119902119909119909 sdot 119901119901119909119909119879119879 2

Notebull We donrsquot require cols of P Q to be orthogonalunit lengthbull P Q map usersmovies to a latent spacebull The most popular model among Netflix contestants

1162019

45531

312445

53432142

24542

522434

42331

2-41

56-5

53-2

32111

-221-7

37-1

-924143-48-5-253-211

13-112-72914-131457-8

1-6784-3924176-421asymp

PTQ

users

item

s

119903119909119909119909119909 = 119902119902119909119909 sdot 119901119901119909119909119879119879

f factors

ffactorsitem

s

users

RECSYS 61

Dealing with Missing Entries Want to minimize SSE for unseen test data

Idea Minimize SSE on training data Want large f ( of factors) to capture all the signals But SSE on test data begins to rise for f gt 2

Regularization is needed to avoid Overfitting Allow rich model where there are sufficient data Shrink aggressively where data are scarce

1162019

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

1 3 43 5 5

4 5 533

2

2 1 3

1

λhellip regularization parameterldquoerrorrdquo ldquolengthrdquo

RECSYS 62

Geared towards females

Geared towards males

serious

escapist

The PrincessDiaries

The Lion King

Braveheart

Independence Day

AmadeusThe Color Purple

Oceanrsquos 11

Sense and Sensibility

Gus

Dave

Recommendations via Latent Factor Models (eg SVD++ by the [Bellkor Team])

1162019

Lethal Weapon

Dumb and Dumber

RECSYS 63

Dealing with Missing Entries Want to minimize SSE for unseen test data

Idea Minimize SSE on training data Want large f ( of factors) to capture all the signals But SSE on test data begins to rise for f gt 2

Regularization is needed Allow rich model where there are sufficient data Shrink aggressively where data are scarce

1162019

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

1 3 43 5 5

4 5 533

2

2 1 3

1

λhellip regularization parameterldquoerrorrdquo ldquolengthrdquo

RECSYS 64

Geared towards females

Geared towards males

serious

funny

The Effect of Regularization

1162019

The PrincessDiaries

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

minfactors ldquoerrorrdquo + λ ldquolengthrdquo

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

RECSYS 65

Geared towards females

Geared towards males

serious

funny

The Effect of Regularization

1162019

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

The PrincessDiaries

minfactors ldquoerrorrdquo + λ ldquolengthrdquo

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

RECSYS 66

Geared towards females

Geared towards males

serious

funny

The Effect of Regularization

1162019

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

minfactors ldquoerrorrdquo + λ ldquolengthrdquo

The PrincessDiaries

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

RECSYS 67

Geared towards females

Geared towards males

serious

funny

The Effect of Regularization

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

The PrincessDiaries

minfactors ldquoerrorrdquo + λ ldquolengthrdquo

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

RECSYS 68

Use Gradient Descent to search forthe optimal settings

Want to find matrices P and Q

Gradient descent Initialize P and Q (using SVD pretend missing ratings are

0) Do gradient descent

bull P larr P - η middotnablaPbull Q larr Q - η middotnablaQbull Where nablaQ is gradientderivative of matrix Q120571120571119876119876 = [120571120571119902119902119909119909119894119894] and 120571120571119902119902119909119909119894119894 = sum119909119909119909119909 minus2 119903119903119909119909119909119909 minus 119902119902119909119909119901119901119909119909119879119879 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894

bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q Observation Computing gradients is slow

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

How to compute gradient of a matrixCompute gradient of every element independently

RECSYS 69

Degression to the lecture notes ofRegression and Gradient Descent

by Andrew Ngrsquos Machine Learning course from Coursera

RECSYS 70

(Batch) Gradient Descent

Want to find matrices P and Q

Gradient descent Initialize P and Q (using SVD pretend missing ratings are

0) Do gradient descent

bull P larr P - η middotnablaPbull Q larr Q - η middotnablaQbull Where nablaQ is gradientderivative of matrix Q120571120571119876119876 = [120571120571119902119902119909119909119894119894] and 120571120571119902119902119909119909119894119894 = sum119909119909119909119909 minus2 119903119903119909119909119909119909 minus 119902119902119909119909119901119901119909119909119879119879 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894

bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q Observation Computing gradients is slow

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

How to compute gradient of a matrixCompute gradient of every element independently

RECSYS 72

Stochastic Gradient Descent

Gradient Descent (GD) vs Stochastic GD Observation 120571120571119876119876 = [120571120571119902119902119909119909119894119894] where

120571120571119902119902119909119909119894119894 = 119909119909119909119909

minus2 119903119903119909119909119909119909 minus 119902119902119909119909119894119894119901119901119909119909119894119894 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894 = 119961119961119946119946

nabla119928119928 119955119955119961119961119946119946

bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q 119928119928 = 119928119928 minus 119928119928 = 119928119928 minus η sum119961119961119946119946nabla119928119928 (119955119955119961119961119946119946) Idea Instead of evaluating gradient over all ratings evaluate it for

each individual rating and make a step

GD 119928119928larr119928119928 minus η sum119955119955119961119961119946119946 nabla119928119928(119955119955119961119961119946119946)

SGD 119928119928larr119928119928 minus 119928119928(119955119955119961119961119946119946) Faster convergence

bull Need more steps but each step is computed much faster

1162019

RECSYS 73

SGD vs GD Convergence of GD vs SGD

1162019

Iterationstep

Valu

e of

the

obje

ctiv

e fu

nctio

n

GD improves the value of the objective function at every step SGD improves the value but in a ldquonoisyrdquo wayGD takes fewer steps to converge but each steptakes much longer to compute In practice SGD is much faster

RECSYS 74

Stochastic Gradient Descent

Stochastic gradient decent Initialize P and Q (using SVD pretend missing ratings are

0) Then iterate over the ratings (multiple times if necessary) and update

factorsFor each rxi

bull 120576120576119909119909119909119909 = 119903119903119909119909119909119909 minus 119902119902119909119909 sdot 119901119901119909119909119879119879 (derivative of the ldquoerrorrdquo)bull 119902119902119909119909 larr 119902119902119909119909 + 120578120578 120576120576119909119909119909119909 119901119901119909119909 minus 120582120582 119902119902119909119909 (update equation)bull 119901119901119909119909 larr 119901119901119909119909 + 120578120578 120576120576119909119909119909119909 119902119902119909119909 minus 120582120582 119901119901119909119909 (update equation)

2 for loops For until convergence

bull For each rxibull Compute gradient do a ldquosteprdquo

1162019

η hellip learning rate

RECSYS 75Koren Bell Volinksy IEEE Computer 2009

1162019

RECSYS 76

Summary Recommendations via Optimization

Goal Make good recommendations Quantify goodness using SSE

So Lower SSE means better recommendations We want to make good recommendations on items

that some user has not yet seen Letrsquos set values for P and Q such that they work

well on known (user item) ratings And hope these values for P and Q will predict well

the unknown ratings

This is the a case where we apply Optimization methods

1162019

1 3 43 5 5

4 5 533

2

2 1 3

1

RECSYS 77

Backup Slides

RECSYS 78

The Netflix Challenge 2006-09

We Know What You OughtTo Be Watching This

Summer

Netflix Prizebull Training data

ndash 100 million ratingsndash 480000 usersndash 17770 moviesndash 6 years of data 2000-2005

bull Test datandash Last few ratings of each user (28 million)ndash Evaluation criterion root mean squared error (RMSE) ndash Netflix Cinematch RMSE 09514

bull Competitionndash 2700+ teamsndash $1 million grand prize for 10 improvement on Cinematch resultndash $50000 2007 progress prize for 843 improvement

scoremovieuser121152131434524123237682576344541568523425

22345

57664566

scoremovieuser6219617232473153414284935

745

696836

Training data Test data

Movie rating data

Overall rating distribution

bull Third of ratings are 4sbull Average rating is 368

From TimelyDevelopmentcom

ratings per movie

bull Avg ratingsmovie 5627

ratings per user

bull Avg ratingsuser 208

Average movie rating by movie count

bull More ratings to better movies

From TimelyDevelopmentcom

Most loved movies

CountAvg ratingTitle137812 4593 The Shawshank Redemption 133597 4545 Lord of the Rings The Return of the King1808834306 The Green Mile 150676 4460Lord of the Rings The Two Towers 139050 4415 Finding Nemo 117456 4504 Raiders of the Lost Ark 180736 4299 Forrest Gump 147932 4433 Lord of the Rings The Fellowship of the ring149199 4325 The Sixth Sense 144027 4333 Indiana Jones and the Last Crusade

Challenges

bull Size of datandash Scalabilityndash Keeping data in memory

bull Missing datandash 99 percent missingndash Very imbalanced

bull Avoiding overfittingbull Test and training data differ significantly movie 16322

bull Use an ensemble of complementing predictorsbull Two half tuned models worth more than a single fully

tuned model

The BellKor recommender system

RECSYS 101

Extending Latent Factor Model to Include Biases

1162019

RECSYS 102

Modeling Biases and Interactions

μ = overall mean rating bx = bias of user x bi = bias of movie i

user-movie interactionmovie biasuser bias

User-Movie interaction Characterizes the matching between

users and movies Attracts most research in the field Benefits from algorithmic and

mathematical innovations

Baseline predictor Separates users and movies Benefits from insights into userrsquos

behavior Among the main practical

contributions of the competition

RECSYS 103

Baseline Predictor We have expectations on the rating by

user x of movie i even without estimating xrsquos attitude towards movies like i

ndash Rating scale of user xndash Values of other ratings user

gave recently (day-specific mood anchoring multi-user accounts)

ndash (Recent) popularity of movie indash Selection bias related to

number of ratings user gave on the same day (ldquofrequencyrdquo)

1162019

RECSYS 104

Putting It All Together

Example Mean rating micro = 37 You are a critical reviewer your ratings are 1 star

lower than the mean bx = -1 Star Wars gets a mean rating of 05 higher than

average movie bi = + 05 Predicted rating for you on Star Wars

= 37 - 1 + 05 = 32

1162019

RECSYS 105

Fitting the New Model

Solve

Stochastic gradient decent to find parameters Note Both biases bu bi as well as interactions qi pu are treated

as parameters (we estimate them)

1162019

regularization

goodness of fit

λ is selected via grid-search on a validation set

( )

++++

+++minus

sumsumsumsum

sumisin

ii

xx

xx

ii

Rix

Txiixxi

PQ

bbpq

pqbbr

2222

2

)(

)(min

λ

micro

RECSYS 106

BellKor Recommender System The winner of the Netflix Challenge

Multi-scale modeling of the dataCombine top level ldquoregionalrdquomodeling of the data with a refined local view Global

bull Overall deviations of usersmovies Factorization

bull Addressing ldquoregionalrdquo effects Collaborative filtering

bull Extract local patterns

Global effects

Factorization

Collaborative filtering

RECSYS 107

Performance of Various Methods

0885

089

0895

09

0905

091

0915

092

1 10 100 1000

RMSE

Millions of parameters

CF (no time bias)

Basic Latent Factors

Latent Factors w Biases

1162019

RECSYS 108

Grand Prize 08563

Netflix 09514

Movie average 10533User average 10651

Global average 11296

Performance of Various Methods

Basic Collaborative filtering 094

Latent factors 090

Latent factors+Biases 089

Collaborative filtering++ 091

1162019

RECSYS 109

Temporal Biases Of Users

Sudden rise in the average movie rating(early 2004) Improvements in Netflix GUI improvements Meaning of rating changed

Movie age Users prefer new movies

without any reasons Older movies are just

inherently better than newer ones

1162019

Y Koren Collaborative filtering with temporal dynamics KDD rsquo09

RECSYS 110

Temporal Biases amp Factors

Original modelrxi = micro +bx + bi + qi middotpx

T

Add time dependence to biasesrxi = micro +bx(t)+ bi(t) +qi middot px

T

Make parameters bu and bi to depend on time (1) Parameterize time-dependence by linear trends

(2) Each bin corresponds to 10 consecutive weeks

Add temporal dependence to factors px(t)hellip user preference vector on day t

Y Koren Collaborative filtering with temporal dynamics KDD rsquo09

RECSYS 111

Adding Temporal Effects

1162019

0875

088

0885

089

0895

09

0905

091

0915

092

1 10 100 1000 10000

RMSE

Millions of parameters

CF (no time bias)Basic Latent FactorsCF (time bias)Latent Factors w Biases+ Linear time factors+ Per-day user biases+ CF

RECSYS 112

Grand Prize 08563

Netflix 09514

Movie average 10533User average 10651

Global average 11296

Performance of Various Methods

Basic Collaborative filtering 094

Latent factors 090

Latent factors+Biases 089

Collaborative filtering++ 091

1162019 112

Latent factors+Biases+Time 0876

Still no prize Getting desperateTry a ldquokitchen sinkrdquo approach

RECSYS 113

Effect of ensemble size

087

08725

0875

08775

088

08825

0885

08875

089

08925

1 8 15 22 29 36 43 50

Predictors

Erro

r - R

MSE

Chart1

RMSE
Predictors
Error - RMSE
Effect of ensemble size

Sheet1

Sheet1

RMSE
Predictors
Error - RMSE
Effect of ensemble size

Sheet2

Sheet3

RECSYS 115

Standing on June 26th 2009

1162019June 26th submission triggers 30-day ldquolast callrdquo

RECSYS 116

The Last 30 Days

Ensemble team formed Group of other teams on leaderboard forms a new team Relies on combining their models Quickly also get a qualifying score over 10

BellKor Continue to get small improvements in their scores Realize that they are in direct competition with Ensemble

Strategy Both teams carefully monitoring the leaderboard Only sure way to check for improvement is to submit a set of

predictionsbull This alerts the other team of your latest score

1162019

RECSYS 117

24 Hours from the Deadline

Submissions limited to 1 a day Only 1 final submission could be made in the last 24h

24 hours before deadlinehellip BellKor team member in Austria notices (by chance) that

Ensemble posts a score that is slightly better than BellKorrsquos

Frantic last 24 hours for both teams Much computer time on final optimization Carefully calibrated to end about an hour before deadline

Final submissions BellKor submits a little early (on purpose) 40 mins before deadline Ensemble submits their final entry 20 mins later hellipand everyone waitshellip

1162019

RECSYS 118118

RECSYS 119

Million $ Awarded Sept 21st 2009

RECSYS 120

Further reading Y Koren Collaborative filtering with temporal

dynamics KDD rsquo09 httpwww2researchattcom~volinskynetflixbpchtml httpwwwthe-ensemblecom

1162019

  • Slide Number 1
  • Acknowledgements
  • Content-based Systems amp Collaborative Filtering
  • High Dimensional Data
  • Example Recommender Systems
  • Recommendations
  • From Scarcity to Abundance
  • The Long Tail
  • Physical vs Online
  • Types of Recommendations
  • Formal Model
  • Utility Matrix
  • Key Problems
  • (1) Gathering Ratings
  • (2) Extrapolating Utilities
  • Content-based Recommender Systems
  • Content-based Recommendations
  • Plan of Action
  • Item Profiles
  • Sidenote TF-IDF
  • User Profiles and Prediction
  • Pros Content-based Approach
  • Cons Content-based Approach
  • Collaborative Filtering
  • Collaborative filtering
  • Collaborative Filtering (CF)
  • Example of Memory-based Collaborative Filtering User-User Collaborative Filtering
  • Similar Users
  • Similarity Metric
  • Rating Predictions
  • Another type of Memory-based Collaborative Filtering Item-Item based Collaborative Filtering
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Common Practice for Item-Item Collaborative Filtering
  • Item-Item vs User-User
  • ProsCons of Collaborative Filtering
  • Hybrid Methods
  • Remarks amp Practical Tips
  • Evaluation
  • Evaluation
  • Evaluating Predictions
  • Problems with Error Measures
  • Collaborative Filtering Complexity
  • Tip Add Data
  • Recommender SystemsLatent Factor Models
  • Collaborative Filtering viaLatent Factor Models (eg SVD)
  • Slide Number 50
  • The Netflix Utility Matrix R
  • Utility Matrix R Evaluation
  • Latent Factor Models
  • Ratings as Products of Factors
  • Ratings as Products of Factors
  • Ratings as Products of Factors
  • Latent Factor Models
  • Latent Factor Models
  • Recap SVD
  • Latent Factor Models
  • Dealing with Missing Entries
  • Recommendations via Latent Factor Models (eg SVD++ by the [Bellkor Team])
  • Dealing with Missing Entries
  • The Effect of Regularization
  • The Effect of Regularization
  • The Effect of Regularization
  • The Effect of Regularization
  • Use Gradient Descent to search forthe optimal settings
  • Degression to the lecture notes ofRegression and Gradient Descentby Andrew Ngrsquos Machine Learning course from Coursera
  • (Batch) Gradient Descent
  • Stochastic Gradient Descent
  • SGD vs GD
  • Stochastic Gradient Descent
  • Slide Number 75
  • Summary Recommendations via Optimization
  • Backup Slides
  • The Netflix Challenge 2006-09
  • Slide Number 79
  • Slide Number 80
  • Netflix Prize
  • Movie rating data
  • Overall rating distribution
  • ratings per movie
  • ratings per user
  • Average movie rating by movie count
  • Most loved movies
  • Challenges
  • The BellKor recommender system
  • Extending Latent Factor Model to Include Biases
  • Modeling Biases and Interactions
  • Baseline Predictor
  • Putting It All Together
  • Fitting the New Model
  • BellKor Recommender System
  • Performance of Various Methods
  • Performance of Various Methods
  • Temporal Biases Of Users
  • Temporal Biases amp Factors
  • Adding Temporal Effects
  • Performance of Various Methods
  • Slide Number 113
  • Slide Number 114
  • Standing on June 26th 2009
  • The Last 30 Days
  • 24 Hours from the Deadline
  • Slide Number 118
  • Million $ Awarded Sept 21st 2009
  • Further reading
predictors RMSE
1 0889067
2 0880586
3 0878345
4 0877715
5 0876668
6 0876179
7 0875557
8 0875013
9 0874477
10 0874209
11 0873997
12 0873691
13 0873569
14 0873352
15 0873212
16 0873035
17 0872942
18 087283
19 0872748
20 0872654
21 0872563
22 0872468
23 0872382
24 0872275
25 087222
26 0872201
27 0872135
28 0872086
29 0872021
30 0871945
31 0871898
32 087183
33 0871756
34 0871722
35 0871665
36 0871632
37 0871585
38 0871558
39 0871524
40 0871482
41 0871452
42 0871428
43 0871398
44 0871346
45 0871312
46 0871287
47 0871262
48 0871242
49 0871236
50 0871197
0889067
0880586
0878345
0877715
0876668
0876179
0875557
0875013
0874477
0874209
0873997
0873691
0873569
0873352
0873212
0873035
0872942
087283
0872748
0872654
0872563
0872468
0872382
0872275
087222
0872201
0872135
0872086
0872021
0871945
0871898
087183
0871756
0871722
0871665
0871632
0871585
0871558
0871524
0871482
0871452
0871428
0871398
0871346
0871312
0871287
0871262
0871242
0871236
0871197
Page 47: xuhappy.github.io · 2 days ago · RECSYS 2 Acknowledgements The slides used in this chapter are adapted from: CS246 Mining Massive Datasets, - by Jure Leskovec, Stanford University

RECSYS 49

Geared towards females

Geared towards males

Serious

Funny

Collaborative Filtering viaLatent Factor Models (eg SVD)

The PrincessDiaries

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

RECSYS 50Koren Bell Volinksy IEEE Computer 2009

RECSYS 51

The Netflix Utility Matrix R

1 3 4

3 5 5

4 5 5

3

3

2 2 2

5

2 1 1

3 3

1

480000 users

17700 movies

1162019

Matrix R

RECSYS 52

Utility Matrix R Evaluation

1 3 4

3 5 5

4 5 5

3

3

2

2 1

3

1

Test Data Set

SSE = sum(119909119909119909119909)isin119877119877 119903119909119909119909119909 minus 119903119903119909119909119909119909 2

1162019

480000 users

17700 movies

Predicted rating

True rating of user x on item i

119955119955120785120785120788120788

Matrix R

Training Data Set

RECSYS 53

Latent Factor Models

ldquoSVDrdquo on Netflix data R asymp Q middot PT

For now letrsquos assume we can approximate the rating matrix R as a product of ldquothinrdquo Q middot PT

R has missing entries but letrsquos ignore that for nowbull Basically we will want the reconstruction error to be small on

known ratings and we donrsquot care about the values on the missing ones

45531

312445

53432142

24542

522434

42331

2-41

56-5

53-2

32111

-221-7

37-1

-924143-48-5-253-211

13-112-72914-131457-8

1-6784-3924176-421asymp

users

item

s

PT

Q

item

s

users

R

SVD A = U Σ VT

f factors

ffactors

RECSYS 54

Ratings as Products of Factors How to estimate the missing rating of

user x for item i

1162019

45531

312445

53432142

24542

522434

42331

item

s

2-41

56-5

53-2

32111

-221-7

37-1

-924143-48-5-253-211

13-112-72914-131457-8

1-6784-3924176-421

asymp

item

s

users

users

PT

119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931

= 119943119943

119954119954119946119946119943119943 sdot 119953119953119961119961119943119943

qi = row i of Qpx = column x of PT

ffac

tors

Qf factors

RECSYS 55

Ratings as Products of Factors How to estimate the missing rating of

user x for item i

1162019

45531

312445

53432142

24542

522434

42331

item

s

2-41

56-5

53-2

32111

-221-7

37-1

-924143-48-5-253-211

13-112-72914-131457-8

1-6784-3924176-421

asymp

item

s

users

users

PT

ffac

tors

Qf factors

119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931

= 119943119943

119954119954119946119946119943119943 sdot 119953119953119961119961119943119943

qi = row i of Qpx = column x of PT

RECSYS 56

Ratings as Products of Factors How to estimate the missing rating of

user x for item i

1162019

45531

312445

53432142

24542

522434

42331

item

s

2-41

56-5

53-2

32111

-221-7

37-1

-924143-48-5-253-211

13-112-72914-131457-8

1-6784-3924176-421

asymp

item

s

users

users

QPT

24

ffac

tors

f factors

119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931

= 119943119943

119954119954119946119946119943119943 sdot 119953119953119961119961119943119943

qi = row i of Qpx = column x of PT

RECSYS 57

Geared towards females

Geared towards males

Serious

Funny

Latent Factor Models

1162019

The PrincessDiaries

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

RECSYS 58

Geared towards females

Geared towards males

Serious

Funny

Latent Factor Models

1162019

The PrincessDiaries

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

RECSYS 59

Recap SVD

Remember SVD A Input data matrix U Left singular vecs V Right singular vecs Σ Singular values

SVD gives minimum reconstruction error (SSE) min119880119880VΣ

sum119909119909119894119894 119860119860119909119909119894119894 minus 119880119880Σ119881119881T 1199091199091198941198942

So in our case ldquoSVDrdquo on Netflix data R asymp Q middot PT

A = R Q = U PT = Σ VT

But we are not done yet R has missing entries1162019

U

The sum goes over all entriesBut our R has missing entries

119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931

Am

n

Σm

n

VT

asymp

U

RECSYS 60

Latent Factor Models

SVD isnrsquot defined when entries are missing

Use specialized methods to find P Q min

119875119875119876119876sum 119909119909119909119909 isinR 119903119903119909119909119909119909 minus 119902119902119909119909 sdot 119901119901119909119909119879119879 2

Notebull We donrsquot require cols of P Q to be orthogonalunit lengthbull P Q map usersmovies to a latent spacebull The most popular model among Netflix contestants

1162019

45531

312445

53432142

24542

522434

42331

2-41

56-5

53-2

32111

-221-7

37-1

-924143-48-5-253-211

13-112-72914-131457-8

1-6784-3924176-421asymp

PTQ

users

item

s

119903119909119909119909119909 = 119902119902119909119909 sdot 119901119901119909119909119879119879

f factors

ffactorsitem

s

users

RECSYS 61

Dealing with Missing Entries Want to minimize SSE for unseen test data

Idea Minimize SSE on training data Want large f ( of factors) to capture all the signals But SSE on test data begins to rise for f gt 2

Regularization is needed to avoid Overfitting Allow rich model where there are sufficient data Shrink aggressively where data are scarce

1162019

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

1 3 43 5 5

4 5 533

2

2 1 3

1

λhellip regularization parameterldquoerrorrdquo ldquolengthrdquo

RECSYS 62

Geared towards females

Geared towards males

serious

escapist

The PrincessDiaries

The Lion King

Braveheart

Independence Day

AmadeusThe Color Purple

Oceanrsquos 11

Sense and Sensibility

Gus

Dave

Recommendations via Latent Factor Models (eg SVD++ by the [Bellkor Team])

1162019

Lethal Weapon

Dumb and Dumber

RECSYS 63

Dealing with Missing Entries Want to minimize SSE for unseen test data

Idea Minimize SSE on training data Want large f ( of factors) to capture all the signals But SSE on test data begins to rise for f gt 2

Regularization is needed Allow rich model where there are sufficient data Shrink aggressively where data are scarce

1162019

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

1 3 43 5 5

4 5 533

2

2 1 3

1

λhellip regularization parameterldquoerrorrdquo ldquolengthrdquo

RECSYS 64

Geared towards females

Geared towards males

serious

funny

The Effect of Regularization

1162019

The PrincessDiaries

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

minfactors ldquoerrorrdquo + λ ldquolengthrdquo

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

RECSYS 65

Geared towards females

Geared towards males

serious

funny

The Effect of Regularization

1162019

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

The PrincessDiaries

minfactors ldquoerrorrdquo + λ ldquolengthrdquo

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

RECSYS 66

Geared towards females

Geared towards males

serious

funny

The Effect of Regularization

1162019

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

minfactors ldquoerrorrdquo + λ ldquolengthrdquo

The PrincessDiaries

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

RECSYS 67

Geared towards females

Geared towards males

serious

funny

The Effect of Regularization

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

The PrincessDiaries

minfactors ldquoerrorrdquo + λ ldquolengthrdquo

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

RECSYS 68

Use Gradient Descent to search forthe optimal settings

Want to find matrices P and Q

Gradient descent Initialize P and Q (using SVD pretend missing ratings are

0) Do gradient descent

bull P larr P - η middotnablaPbull Q larr Q - η middotnablaQbull Where nablaQ is gradientderivative of matrix Q120571120571119876119876 = [120571120571119902119902119909119909119894119894] and 120571120571119902119902119909119909119894119894 = sum119909119909119909119909 minus2 119903119903119909119909119909119909 minus 119902119902119909119909119901119901119909119909119879119879 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894

bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q Observation Computing gradients is slow

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

How to compute gradient of a matrixCompute gradient of every element independently

RECSYS 69

Degression to the lecture notes ofRegression and Gradient Descent

by Andrew Ngrsquos Machine Learning course from Coursera

RECSYS 70

(Batch) Gradient Descent

Want to find matrices P and Q

Gradient descent Initialize P and Q (using SVD pretend missing ratings are

0) Do gradient descent

bull P larr P - η middotnablaPbull Q larr Q - η middotnablaQbull Where nablaQ is gradientderivative of matrix Q120571120571119876119876 = [120571120571119902119902119909119909119894119894] and 120571120571119902119902119909119909119894119894 = sum119909119909119909119909 minus2 119903119903119909119909119909119909 minus 119902119902119909119909119901119901119909119909119879119879 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894

bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q Observation Computing gradients is slow

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

How to compute gradient of a matrixCompute gradient of every element independently

RECSYS 72

Stochastic Gradient Descent

Gradient Descent (GD) vs Stochastic GD Observation 120571120571119876119876 = [120571120571119902119902119909119909119894119894] where

120571120571119902119902119909119909119894119894 = 119909119909119909119909

minus2 119903119903119909119909119909119909 minus 119902119902119909119909119894119894119901119901119909119909119894119894 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894 = 119961119961119946119946

nabla119928119928 119955119955119961119961119946119946

bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q 119928119928 = 119928119928 minus 119928119928 = 119928119928 minus η sum119961119961119946119946nabla119928119928 (119955119955119961119961119946119946) Idea Instead of evaluating gradient over all ratings evaluate it for

each individual rating and make a step

GD 119928119928larr119928119928 minus η sum119955119955119961119961119946119946 nabla119928119928(119955119955119961119961119946119946)

SGD 119928119928larr119928119928 minus 119928119928(119955119955119961119961119946119946) Faster convergence

bull Need more steps but each step is computed much faster

1162019

RECSYS 73

SGD vs GD Convergence of GD vs SGD

1162019

Iterationstep

Valu

e of

the

obje

ctiv

e fu

nctio

n

GD improves the value of the objective function at every step SGD improves the value but in a ldquonoisyrdquo wayGD takes fewer steps to converge but each steptakes much longer to compute In practice SGD is much faster

RECSYS 74

Stochastic Gradient Descent

Stochastic gradient decent Initialize P and Q (using SVD pretend missing ratings are

0) Then iterate over the ratings (multiple times if necessary) and update

factorsFor each rxi

bull 120576120576119909119909119909119909 = 119903119903119909119909119909119909 minus 119902119902119909119909 sdot 119901119901119909119909119879119879 (derivative of the ldquoerrorrdquo)bull 119902119902119909119909 larr 119902119902119909119909 + 120578120578 120576120576119909119909119909119909 119901119901119909119909 minus 120582120582 119902119902119909119909 (update equation)bull 119901119901119909119909 larr 119901119901119909119909 + 120578120578 120576120576119909119909119909119909 119902119902119909119909 minus 120582120582 119901119901119909119909 (update equation)

2 for loops For until convergence

bull For each rxibull Compute gradient do a ldquosteprdquo

1162019

η hellip learning rate

RECSYS 75Koren Bell Volinksy IEEE Computer 2009

1162019

RECSYS 76

Summary Recommendations via Optimization

Goal Make good recommendations Quantify goodness using SSE

So Lower SSE means better recommendations We want to make good recommendations on items

that some user has not yet seen Letrsquos set values for P and Q such that they work

well on known (user item) ratings And hope these values for P and Q will predict well

the unknown ratings

This is the a case where we apply Optimization methods

1162019

1 3 43 5 5

4 5 533

2

2 1 3

1

RECSYS 77

Backup Slides

RECSYS 78

The Netflix Challenge 2006-09

We Know What You OughtTo Be Watching This

Summer

Netflix Prizebull Training data

ndash 100 million ratingsndash 480000 usersndash 17770 moviesndash 6 years of data 2000-2005

bull Test datandash Last few ratings of each user (28 million)ndash Evaluation criterion root mean squared error (RMSE) ndash Netflix Cinematch RMSE 09514

bull Competitionndash 2700+ teamsndash $1 million grand prize for 10 improvement on Cinematch resultndash $50000 2007 progress prize for 843 improvement

scoremovieuser121152131434524123237682576344541568523425

22345

57664566

scoremovieuser6219617232473153414284935

745

696836

Training data Test data

Movie rating data

Overall rating distribution

bull Third of ratings are 4sbull Average rating is 368

From TimelyDevelopmentcom

ratings per movie

bull Avg ratingsmovie 5627

ratings per user

bull Avg ratingsuser 208

Average movie rating by movie count

bull More ratings to better movies

From TimelyDevelopmentcom

Most loved movies

CountAvg ratingTitle137812 4593 The Shawshank Redemption 133597 4545 Lord of the Rings The Return of the King1808834306 The Green Mile 150676 4460Lord of the Rings The Two Towers 139050 4415 Finding Nemo 117456 4504 Raiders of the Lost Ark 180736 4299 Forrest Gump 147932 4433 Lord of the Rings The Fellowship of the ring149199 4325 The Sixth Sense 144027 4333 Indiana Jones and the Last Crusade

Challenges

bull Size of datandash Scalabilityndash Keeping data in memory

bull Missing datandash 99 percent missingndash Very imbalanced

bull Avoiding overfittingbull Test and training data differ significantly movie 16322

bull Use an ensemble of complementing predictorsbull Two half tuned models worth more than a single fully

tuned model

The BellKor recommender system

RECSYS 101

Extending Latent Factor Model to Include Biases

1162019

RECSYS 102

Modeling Biases and Interactions

μ = overall mean rating bx = bias of user x bi = bias of movie i

user-movie interactionmovie biasuser bias

User-Movie interaction Characterizes the matching between

users and movies Attracts most research in the field Benefits from algorithmic and

mathematical innovations

Baseline predictor Separates users and movies Benefits from insights into userrsquos

behavior Among the main practical

contributions of the competition

RECSYS 103

Baseline Predictor We have expectations on the rating by

user x of movie i even without estimating xrsquos attitude towards movies like i

ndash Rating scale of user xndash Values of other ratings user

gave recently (day-specific mood anchoring multi-user accounts)

ndash (Recent) popularity of movie indash Selection bias related to

number of ratings user gave on the same day (ldquofrequencyrdquo)

1162019

RECSYS 104

Putting It All Together

Example Mean rating micro = 37 You are a critical reviewer your ratings are 1 star

lower than the mean bx = -1 Star Wars gets a mean rating of 05 higher than

average movie bi = + 05 Predicted rating for you on Star Wars

= 37 - 1 + 05 = 32

1162019

RECSYS 105

Fitting the New Model

Solve

Stochastic gradient decent to find parameters Note Both biases bu bi as well as interactions qi pu are treated

as parameters (we estimate them)

1162019

regularization

goodness of fit

λ is selected via grid-search on a validation set

( )

++++

+++minus

sumsumsumsum

sumisin

ii

xx

xx

ii

Rix

Txiixxi

PQ

bbpq

pqbbr

2222

2

)(

)(min

λ

micro

RECSYS 106

BellKor Recommender System The winner of the Netflix Challenge

Multi-scale modeling of the dataCombine top level ldquoregionalrdquomodeling of the data with a refined local view Global

bull Overall deviations of usersmovies Factorization

bull Addressing ldquoregionalrdquo effects Collaborative filtering

bull Extract local patterns

Global effects

Factorization

Collaborative filtering

RECSYS 107

Performance of Various Methods

0885

089

0895

09

0905

091

0915

092

1 10 100 1000

RMSE

Millions of parameters

CF (no time bias)

Basic Latent Factors

Latent Factors w Biases

1162019

RECSYS 108

Grand Prize 08563

Netflix 09514

Movie average 10533User average 10651

Global average 11296

Performance of Various Methods

Basic Collaborative filtering 094

Latent factors 090

Latent factors+Biases 089

Collaborative filtering++ 091

1162019

RECSYS 109

Temporal Biases Of Users

Sudden rise in the average movie rating(early 2004) Improvements in Netflix GUI improvements Meaning of rating changed

Movie age Users prefer new movies

without any reasons Older movies are just

inherently better than newer ones

1162019

Y Koren Collaborative filtering with temporal dynamics KDD rsquo09

RECSYS 110

Temporal Biases amp Factors

Original modelrxi = micro +bx + bi + qi middotpx

T

Add time dependence to biasesrxi = micro +bx(t)+ bi(t) +qi middot px

T

Make parameters bu and bi to depend on time (1) Parameterize time-dependence by linear trends

(2) Each bin corresponds to 10 consecutive weeks

Add temporal dependence to factors px(t)hellip user preference vector on day t

Y Koren Collaborative filtering with temporal dynamics KDD rsquo09

RECSYS 111

Adding Temporal Effects

1162019

0875

088

0885

089

0895

09

0905

091

0915

092

1 10 100 1000 10000

RMSE

Millions of parameters

CF (no time bias)Basic Latent FactorsCF (time bias)Latent Factors w Biases+ Linear time factors+ Per-day user biases+ CF

RECSYS 112

Grand Prize 08563

Netflix 09514

Movie average 10533User average 10651

Global average 11296

Performance of Various Methods

Basic Collaborative filtering 094

Latent factors 090

Latent factors+Biases 089

Collaborative filtering++ 091

1162019 112

Latent factors+Biases+Time 0876

Still no prize Getting desperateTry a ldquokitchen sinkrdquo approach

RECSYS 113

Effect of ensemble size

087

08725

0875

08775

088

08825

0885

08875

089

08925

1 8 15 22 29 36 43 50

Predictors

Erro

r - R

MSE

Chart1

RMSE
Predictors
Error - RMSE
Effect of ensemble size

Sheet1

Sheet1

RMSE
Predictors
Error - RMSE
Effect of ensemble size

Sheet2

Sheet3

RECSYS 115

Standing on June 26th 2009

1162019June 26th submission triggers 30-day ldquolast callrdquo

RECSYS 116

The Last 30 Days

Ensemble team formed Group of other teams on leaderboard forms a new team Relies on combining their models Quickly also get a qualifying score over 10

BellKor Continue to get small improvements in their scores Realize that they are in direct competition with Ensemble

Strategy Both teams carefully monitoring the leaderboard Only sure way to check for improvement is to submit a set of

predictionsbull This alerts the other team of your latest score

1162019

RECSYS 117

24 Hours from the Deadline

Submissions limited to 1 a day Only 1 final submission could be made in the last 24h

24 hours before deadlinehellip BellKor team member in Austria notices (by chance) that

Ensemble posts a score that is slightly better than BellKorrsquos

Frantic last 24 hours for both teams Much computer time on final optimization Carefully calibrated to end about an hour before deadline

Final submissions BellKor submits a little early (on purpose) 40 mins before deadline Ensemble submits their final entry 20 mins later hellipand everyone waitshellip

1162019

RECSYS 118118

RECSYS 119

Million $ Awarded Sept 21st 2009

RECSYS 120

Further reading Y Koren Collaborative filtering with temporal

dynamics KDD rsquo09 httpwww2researchattcom~volinskynetflixbpchtml httpwwwthe-ensemblecom

1162019

  • Slide Number 1
  • Acknowledgements
  • Content-based Systems amp Collaborative Filtering
  • High Dimensional Data
  • Example Recommender Systems
  • Recommendations
  • From Scarcity to Abundance
  • The Long Tail
  • Physical vs Online
  • Types of Recommendations
  • Formal Model
  • Utility Matrix
  • Key Problems
  • (1) Gathering Ratings
  • (2) Extrapolating Utilities
  • Content-based Recommender Systems
  • Content-based Recommendations
  • Plan of Action
  • Item Profiles
  • Sidenote TF-IDF
  • User Profiles and Prediction
  • Pros Content-based Approach
  • Cons Content-based Approach
  • Collaborative Filtering
  • Collaborative filtering
  • Collaborative Filtering (CF)
  • Example of Memory-based Collaborative Filtering User-User Collaborative Filtering
  • Similar Users
  • Similarity Metric
  • Rating Predictions
  • Another type of Memory-based Collaborative Filtering Item-Item based Collaborative Filtering
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Common Practice for Item-Item Collaborative Filtering
  • Item-Item vs User-User
  • ProsCons of Collaborative Filtering
  • Hybrid Methods
  • Remarks amp Practical Tips
  • Evaluation
  • Evaluation
  • Evaluating Predictions
  • Problems with Error Measures
  • Collaborative Filtering Complexity
  • Tip Add Data
  • Recommender SystemsLatent Factor Models
  • Collaborative Filtering viaLatent Factor Models (eg SVD)
  • Slide Number 50
  • The Netflix Utility Matrix R
  • Utility Matrix R Evaluation
  • Latent Factor Models
  • Ratings as Products of Factors
  • Ratings as Products of Factors
  • Ratings as Products of Factors
  • Latent Factor Models
  • Latent Factor Models
  • Recap SVD
  • Latent Factor Models
  • Dealing with Missing Entries
  • Recommendations via Latent Factor Models (eg SVD++ by the [Bellkor Team])
  • Dealing with Missing Entries
  • The Effect of Regularization
  • The Effect of Regularization
  • The Effect of Regularization
  • The Effect of Regularization
  • Use Gradient Descent to search forthe optimal settings
  • Degression to the lecture notes ofRegression and Gradient Descentby Andrew Ngrsquos Machine Learning course from Coursera
  • (Batch) Gradient Descent
  • Stochastic Gradient Descent
  • SGD vs GD
  • Stochastic Gradient Descent
  • Slide Number 75
  • Summary Recommendations via Optimization
  • Backup Slides
  • The Netflix Challenge 2006-09
  • Slide Number 79
  • Slide Number 80
  • Netflix Prize
  • Movie rating data
  • Overall rating distribution
  • ratings per movie
  • ratings per user
  • Average movie rating by movie count
  • Most loved movies
  • Challenges
  • The BellKor recommender system
  • Extending Latent Factor Model to Include Biases
  • Modeling Biases and Interactions
  • Baseline Predictor
  • Putting It All Together
  • Fitting the New Model
  • BellKor Recommender System
  • Performance of Various Methods
  • Performance of Various Methods
  • Temporal Biases Of Users
  • Temporal Biases amp Factors
  • Adding Temporal Effects
  • Performance of Various Methods
  • Slide Number 113
  • Slide Number 114
  • Standing on June 26th 2009
  • The Last 30 Days
  • 24 Hours from the Deadline
  • Slide Number 118
  • Million $ Awarded Sept 21st 2009
  • Further reading
predictors RMSE
1 0889067
2 0880586
3 0878345
4 0877715
5 0876668
6 0876179
7 0875557
8 0875013
9 0874477
10 0874209
11 0873997
12 0873691
13 0873569
14 0873352
15 0873212
16 0873035
17 0872942
18 087283
19 0872748
20 0872654
21 0872563
22 0872468
23 0872382
24 0872275
25 087222
26 0872201
27 0872135
28 0872086
29 0872021
30 0871945
31 0871898
32 087183
33 0871756
34 0871722
35 0871665
36 0871632
37 0871585
38 0871558
39 0871524
40 0871482
41 0871452
42 0871428
43 0871398
44 0871346
45 0871312
46 0871287
47 0871262
48 0871242
49 0871236
50 0871197
0889067
0880586
0878345
0877715
0876668
0876179
0875557
0875013
0874477
0874209
0873997
0873691
0873569
0873352
0873212
0873035
0872942
087283
0872748
0872654
0872563
0872468
0872382
0872275
087222
0872201
0872135
0872086
0872021
0871945
0871898
087183
0871756
0871722
0871665
0871632
0871585
0871558
0871524
0871482
0871452
0871428
0871398
0871346
0871312
0871287
0871262
0871242
0871236
0871197
Page 48: xuhappy.github.io · 2 days ago · RECSYS 2 Acknowledgements The slides used in this chapter are adapted from: CS246 Mining Massive Datasets, - by Jure Leskovec, Stanford University

RECSYS 50Koren Bell Volinksy IEEE Computer 2009

RECSYS 51

The Netflix Utility Matrix R

1 3 4

3 5 5

4 5 5

3

3

2 2 2

5

2 1 1

3 3

1

480000 users

17700 movies

1162019

Matrix R

RECSYS 52

Utility Matrix R Evaluation

1 3 4

3 5 5

4 5 5

3

3

2

2 1

3

1

Test Data Set

SSE = sum(119909119909119909119909)isin119877119877 119903119909119909119909119909 minus 119903119903119909119909119909119909 2

1162019

480000 users

17700 movies

Predicted rating

True rating of user x on item i

119955119955120785120785120788120788

Matrix R

Training Data Set

RECSYS 53

Latent Factor Models

ldquoSVDrdquo on Netflix data R asymp Q middot PT

For now letrsquos assume we can approximate the rating matrix R as a product of ldquothinrdquo Q middot PT

R has missing entries but letrsquos ignore that for nowbull Basically we will want the reconstruction error to be small on

known ratings and we donrsquot care about the values on the missing ones

45531

312445

53432142

24542

522434

42331

2-41

56-5

53-2

32111

-221-7

37-1

-924143-48-5-253-211

13-112-72914-131457-8

1-6784-3924176-421asymp

users

item

s

PT

Q

item

s

users

R

SVD A = U Σ VT

f factors

ffactors

RECSYS 54

Ratings as Products of Factors How to estimate the missing rating of

user x for item i

1162019

45531

312445

53432142

24542

522434

42331

item

s

2-41

56-5

53-2

32111

-221-7

37-1

-924143-48-5-253-211

13-112-72914-131457-8

1-6784-3924176-421

asymp

item

s

users

users

PT

119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931

= 119943119943

119954119954119946119946119943119943 sdot 119953119953119961119961119943119943

qi = row i of Qpx = column x of PT

ffac

tors

Qf factors

RECSYS 55

Ratings as Products of Factors How to estimate the missing rating of

user x for item i

1162019

45531

312445

53432142

24542

522434

42331

item

s

2-41

56-5

53-2

32111

-221-7

37-1

-924143-48-5-253-211

13-112-72914-131457-8

1-6784-3924176-421

asymp

item

s

users

users

PT

ffac

tors

Qf factors

119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931

= 119943119943

119954119954119946119946119943119943 sdot 119953119953119961119961119943119943

qi = row i of Qpx = column x of PT

RECSYS 56

Ratings as Products of Factors How to estimate the missing rating of

user x for item i

1162019

45531

312445

53432142

24542

522434

42331

item

s

2-41

56-5

53-2

32111

-221-7

37-1

-924143-48-5-253-211

13-112-72914-131457-8

1-6784-3924176-421

asymp

item

s

users

users

QPT

24

ffac

tors

f factors

119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931

= 119943119943

119954119954119946119946119943119943 sdot 119953119953119961119961119943119943

qi = row i of Qpx = column x of PT

RECSYS 57

Geared towards females

Geared towards males

Serious

Funny

Latent Factor Models

1162019

The PrincessDiaries

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

RECSYS 58

Geared towards females

Geared towards males

Serious

Funny

Latent Factor Models

1162019

The PrincessDiaries

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

RECSYS 59

Recap SVD

Remember SVD A Input data matrix U Left singular vecs V Right singular vecs Σ Singular values

SVD gives minimum reconstruction error (SSE) min119880119880VΣ

sum119909119909119894119894 119860119860119909119909119894119894 minus 119880119880Σ119881119881T 1199091199091198941198942

So in our case ldquoSVDrdquo on Netflix data R asymp Q middot PT

A = R Q = U PT = Σ VT

But we are not done yet R has missing entries1162019

U

The sum goes over all entriesBut our R has missing entries

119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931

Am

n

Σm

n

VT

asymp

U

RECSYS 60

Latent Factor Models

SVD isnrsquot defined when entries are missing

Use specialized methods to find P Q min

119875119875119876119876sum 119909119909119909119909 isinR 119903119903119909119909119909119909 minus 119902119902119909119909 sdot 119901119901119909119909119879119879 2

Notebull We donrsquot require cols of P Q to be orthogonalunit lengthbull P Q map usersmovies to a latent spacebull The most popular model among Netflix contestants

1162019

45531

312445

53432142

24542

522434

42331

2-41

56-5

53-2

32111

-221-7

37-1

-924143-48-5-253-211

13-112-72914-131457-8

1-6784-3924176-421asymp

PTQ

users

item

s

119903119909119909119909119909 = 119902119902119909119909 sdot 119901119901119909119909119879119879

f factors

ffactorsitem

s

users

RECSYS 61

Dealing with Missing Entries Want to minimize SSE for unseen test data

Idea Minimize SSE on training data Want large f ( of factors) to capture all the signals But SSE on test data begins to rise for f gt 2

Regularization is needed to avoid Overfitting Allow rich model where there are sufficient data Shrink aggressively where data are scarce

1162019

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

1 3 43 5 5

4 5 533

2

2 1 3

1

λhellip regularization parameterldquoerrorrdquo ldquolengthrdquo

RECSYS 62

Geared towards females

Geared towards males

serious

escapist

The PrincessDiaries

The Lion King

Braveheart

Independence Day

AmadeusThe Color Purple

Oceanrsquos 11

Sense and Sensibility

Gus

Dave

Recommendations via Latent Factor Models (eg SVD++ by the [Bellkor Team])

1162019

Lethal Weapon

Dumb and Dumber

RECSYS 63

Dealing with Missing Entries Want to minimize SSE for unseen test data

Idea Minimize SSE on training data Want large f ( of factors) to capture all the signals But SSE on test data begins to rise for f gt 2

Regularization is needed Allow rich model where there are sufficient data Shrink aggressively where data are scarce

1162019

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

1 3 43 5 5

4 5 533

2

2 1 3

1

λhellip regularization parameterldquoerrorrdquo ldquolengthrdquo

RECSYS 64

Geared towards females

Geared towards males

serious

funny

The Effect of Regularization

1162019

The PrincessDiaries

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

minfactors ldquoerrorrdquo + λ ldquolengthrdquo

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

RECSYS 65

Geared towards females

Geared towards males

serious

funny

The Effect of Regularization

1162019

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

The PrincessDiaries

minfactors ldquoerrorrdquo + λ ldquolengthrdquo

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

RECSYS 66

Geared towards females

Geared towards males

serious

funny

The Effect of Regularization

1162019

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

minfactors ldquoerrorrdquo + λ ldquolengthrdquo

The PrincessDiaries

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

RECSYS 67

Geared towards females

Geared towards males

serious

funny

The Effect of Regularization

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

The PrincessDiaries

minfactors ldquoerrorrdquo + λ ldquolengthrdquo

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

RECSYS 68

Use Gradient Descent to search forthe optimal settings

Want to find matrices P and Q

Gradient descent Initialize P and Q (using SVD pretend missing ratings are

0) Do gradient descent

bull P larr P - η middotnablaPbull Q larr Q - η middotnablaQbull Where nablaQ is gradientderivative of matrix Q120571120571119876119876 = [120571120571119902119902119909119909119894119894] and 120571120571119902119902119909119909119894119894 = sum119909119909119909119909 minus2 119903119903119909119909119909119909 minus 119902119902119909119909119901119901119909119909119879119879 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894

bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q Observation Computing gradients is slow

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

How to compute gradient of a matrixCompute gradient of every element independently

RECSYS 69

Degression to the lecture notes ofRegression and Gradient Descent

by Andrew Ngrsquos Machine Learning course from Coursera

RECSYS 70

(Batch) Gradient Descent

Want to find matrices P and Q

Gradient descent Initialize P and Q (using SVD pretend missing ratings are

0) Do gradient descent

bull P larr P - η middotnablaPbull Q larr Q - η middotnablaQbull Where nablaQ is gradientderivative of matrix Q120571120571119876119876 = [120571120571119902119902119909119909119894119894] and 120571120571119902119902119909119909119894119894 = sum119909119909119909119909 minus2 119903119903119909119909119909119909 minus 119902119902119909119909119901119901119909119909119879119879 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894

bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q Observation Computing gradients is slow

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

How to compute gradient of a matrixCompute gradient of every element independently

RECSYS 72

Stochastic Gradient Descent

Gradient Descent (GD) vs Stochastic GD Observation 120571120571119876119876 = [120571120571119902119902119909119909119894119894] where

120571120571119902119902119909119909119894119894 = 119909119909119909119909

minus2 119903119903119909119909119909119909 minus 119902119902119909119909119894119894119901119901119909119909119894119894 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894 = 119961119961119946119946

nabla119928119928 119955119955119961119961119946119946

bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q 119928119928 = 119928119928 minus 119928119928 = 119928119928 minus η sum119961119961119946119946nabla119928119928 (119955119955119961119961119946119946) Idea Instead of evaluating gradient over all ratings evaluate it for

each individual rating and make a step

GD 119928119928larr119928119928 minus η sum119955119955119961119961119946119946 nabla119928119928(119955119955119961119961119946119946)

SGD 119928119928larr119928119928 minus 119928119928(119955119955119961119961119946119946) Faster convergence

bull Need more steps but each step is computed much faster

1162019

RECSYS 73

SGD vs GD Convergence of GD vs SGD

1162019

Iterationstep

Valu

e of

the

obje

ctiv

e fu

nctio

n

GD improves the value of the objective function at every step SGD improves the value but in a ldquonoisyrdquo wayGD takes fewer steps to converge but each steptakes much longer to compute In practice SGD is much faster

RECSYS 74

Stochastic Gradient Descent

Stochastic gradient decent Initialize P and Q (using SVD pretend missing ratings are

0) Then iterate over the ratings (multiple times if necessary) and update

factorsFor each rxi

bull 120576120576119909119909119909119909 = 119903119903119909119909119909119909 minus 119902119902119909119909 sdot 119901119901119909119909119879119879 (derivative of the ldquoerrorrdquo)bull 119902119902119909119909 larr 119902119902119909119909 + 120578120578 120576120576119909119909119909119909 119901119901119909119909 minus 120582120582 119902119902119909119909 (update equation)bull 119901119901119909119909 larr 119901119901119909119909 + 120578120578 120576120576119909119909119909119909 119902119902119909119909 minus 120582120582 119901119901119909119909 (update equation)

2 for loops For until convergence

bull For each rxibull Compute gradient do a ldquosteprdquo

1162019

η hellip learning rate

RECSYS 75Koren Bell Volinksy IEEE Computer 2009

1162019

RECSYS 76

Summary Recommendations via Optimization

Goal Make good recommendations Quantify goodness using SSE

So Lower SSE means better recommendations We want to make good recommendations on items

that some user has not yet seen Letrsquos set values for P and Q such that they work

well on known (user item) ratings And hope these values for P and Q will predict well

the unknown ratings

This is the a case where we apply Optimization methods

1162019

1 3 43 5 5

4 5 533

2

2 1 3

1

RECSYS 77

Backup Slides

RECSYS 78

The Netflix Challenge 2006-09

We Know What You OughtTo Be Watching This

Summer

Netflix Prizebull Training data

ndash 100 million ratingsndash 480000 usersndash 17770 moviesndash 6 years of data 2000-2005

bull Test datandash Last few ratings of each user (28 million)ndash Evaluation criterion root mean squared error (RMSE) ndash Netflix Cinematch RMSE 09514

bull Competitionndash 2700+ teamsndash $1 million grand prize for 10 improvement on Cinematch resultndash $50000 2007 progress prize for 843 improvement

scoremovieuser121152131434524123237682576344541568523425

22345

57664566

scoremovieuser6219617232473153414284935

745

696836

Training data Test data

Movie rating data

Overall rating distribution

bull Third of ratings are 4sbull Average rating is 368

From TimelyDevelopmentcom

ratings per movie

bull Avg ratingsmovie 5627

ratings per user

bull Avg ratingsuser 208

Average movie rating by movie count

bull More ratings to better movies

From TimelyDevelopmentcom

Most loved movies

CountAvg ratingTitle137812 4593 The Shawshank Redemption 133597 4545 Lord of the Rings The Return of the King1808834306 The Green Mile 150676 4460Lord of the Rings The Two Towers 139050 4415 Finding Nemo 117456 4504 Raiders of the Lost Ark 180736 4299 Forrest Gump 147932 4433 Lord of the Rings The Fellowship of the ring149199 4325 The Sixth Sense 144027 4333 Indiana Jones and the Last Crusade

Challenges

bull Size of datandash Scalabilityndash Keeping data in memory

bull Missing datandash 99 percent missingndash Very imbalanced

bull Avoiding overfittingbull Test and training data differ significantly movie 16322

bull Use an ensemble of complementing predictorsbull Two half tuned models worth more than a single fully

tuned model

The BellKor recommender system

RECSYS 101

Extending Latent Factor Model to Include Biases

1162019

RECSYS 102

Modeling Biases and Interactions

μ = overall mean rating bx = bias of user x bi = bias of movie i

user-movie interactionmovie biasuser bias

User-Movie interaction Characterizes the matching between

users and movies Attracts most research in the field Benefits from algorithmic and

mathematical innovations

Baseline predictor Separates users and movies Benefits from insights into userrsquos

behavior Among the main practical

contributions of the competition

RECSYS 103

Baseline Predictor We have expectations on the rating by

user x of movie i even without estimating xrsquos attitude towards movies like i

ndash Rating scale of user xndash Values of other ratings user

gave recently (day-specific mood anchoring multi-user accounts)

ndash (Recent) popularity of movie indash Selection bias related to

number of ratings user gave on the same day (ldquofrequencyrdquo)

1162019

RECSYS 104

Putting It All Together

Example Mean rating micro = 37 You are a critical reviewer your ratings are 1 star

lower than the mean bx = -1 Star Wars gets a mean rating of 05 higher than

average movie bi = + 05 Predicted rating for you on Star Wars

= 37 - 1 + 05 = 32

1162019

RECSYS 105

Fitting the New Model

Solve

Stochastic gradient decent to find parameters Note Both biases bu bi as well as interactions qi pu are treated

as parameters (we estimate them)

1162019

regularization

goodness of fit

λ is selected via grid-search on a validation set

( )

++++

+++minus

sumsumsumsum

sumisin

ii

xx

xx

ii

Rix

Txiixxi

PQ

bbpq

pqbbr

2222

2

)(

)(min

λ

micro

RECSYS 106

BellKor Recommender System The winner of the Netflix Challenge

Multi-scale modeling of the dataCombine top level ldquoregionalrdquomodeling of the data with a refined local view Global

bull Overall deviations of usersmovies Factorization

bull Addressing ldquoregionalrdquo effects Collaborative filtering

bull Extract local patterns

Global effects

Factorization

Collaborative filtering

RECSYS 107

Performance of Various Methods

0885

089

0895

09

0905

091

0915

092

1 10 100 1000

RMSE

Millions of parameters

CF (no time bias)

Basic Latent Factors

Latent Factors w Biases

1162019

RECSYS 108

Grand Prize 08563

Netflix 09514

Movie average 10533User average 10651

Global average 11296

Performance of Various Methods

Basic Collaborative filtering 094

Latent factors 090

Latent factors+Biases 089

Collaborative filtering++ 091

1162019

RECSYS 109

Temporal Biases Of Users

Sudden rise in the average movie rating(early 2004) Improvements in Netflix GUI improvements Meaning of rating changed

Movie age Users prefer new movies

without any reasons Older movies are just

inherently better than newer ones

1162019

Y Koren Collaborative filtering with temporal dynamics KDD rsquo09

RECSYS 110

Temporal Biases amp Factors

Original modelrxi = micro +bx + bi + qi middotpx

T

Add time dependence to biasesrxi = micro +bx(t)+ bi(t) +qi middot px

T

Make parameters bu and bi to depend on time (1) Parameterize time-dependence by linear trends

(2) Each bin corresponds to 10 consecutive weeks

Add temporal dependence to factors px(t)hellip user preference vector on day t

Y Koren Collaborative filtering with temporal dynamics KDD rsquo09

RECSYS 111

Adding Temporal Effects

1162019

0875

088

0885

089

0895

09

0905

091

0915

092

1 10 100 1000 10000

RMSE

Millions of parameters

CF (no time bias)Basic Latent FactorsCF (time bias)Latent Factors w Biases+ Linear time factors+ Per-day user biases+ CF

RECSYS 112

Grand Prize 08563

Netflix 09514

Movie average 10533User average 10651

Global average 11296

Performance of Various Methods

Basic Collaborative filtering 094

Latent factors 090

Latent factors+Biases 089

Collaborative filtering++ 091

1162019 112

Latent factors+Biases+Time 0876

Still no prize Getting desperateTry a ldquokitchen sinkrdquo approach

RECSYS 113

Effect of ensemble size

087

08725

0875

08775

088

08825

0885

08875

089

08925

1 8 15 22 29 36 43 50

Predictors

Erro

r - R

MSE

Chart1

RMSE
Predictors
Error - RMSE
Effect of ensemble size

Sheet1

Sheet1

RMSE
Predictors
Error - RMSE
Effect of ensemble size

Sheet2

Sheet3

RECSYS 115

Standing on June 26th 2009

1162019June 26th submission triggers 30-day ldquolast callrdquo

RECSYS 116

The Last 30 Days

Ensemble team formed Group of other teams on leaderboard forms a new team Relies on combining their models Quickly also get a qualifying score over 10

BellKor Continue to get small improvements in their scores Realize that they are in direct competition with Ensemble

Strategy Both teams carefully monitoring the leaderboard Only sure way to check for improvement is to submit a set of

predictionsbull This alerts the other team of your latest score

1162019

RECSYS 117

24 Hours from the Deadline

Submissions limited to 1 a day Only 1 final submission could be made in the last 24h

24 hours before deadlinehellip BellKor team member in Austria notices (by chance) that

Ensemble posts a score that is slightly better than BellKorrsquos

Frantic last 24 hours for both teams Much computer time on final optimization Carefully calibrated to end about an hour before deadline

Final submissions BellKor submits a little early (on purpose) 40 mins before deadline Ensemble submits their final entry 20 mins later hellipand everyone waitshellip

1162019

RECSYS 118118

RECSYS 119

Million $ Awarded Sept 21st 2009

RECSYS 120

Further reading Y Koren Collaborative filtering with temporal

dynamics KDD rsquo09 httpwww2researchattcom~volinskynetflixbpchtml httpwwwthe-ensemblecom

1162019

  • Slide Number 1
  • Acknowledgements
  • Content-based Systems amp Collaborative Filtering
  • High Dimensional Data
  • Example Recommender Systems
  • Recommendations
  • From Scarcity to Abundance
  • The Long Tail
  • Physical vs Online
  • Types of Recommendations
  • Formal Model
  • Utility Matrix
  • Key Problems
  • (1) Gathering Ratings
  • (2) Extrapolating Utilities
  • Content-based Recommender Systems
  • Content-based Recommendations
  • Plan of Action
  • Item Profiles
  • Sidenote TF-IDF
  • User Profiles and Prediction
  • Pros Content-based Approach
  • Cons Content-based Approach
  • Collaborative Filtering
  • Collaborative filtering
  • Collaborative Filtering (CF)
  • Example of Memory-based Collaborative Filtering User-User Collaborative Filtering
  • Similar Users
  • Similarity Metric
  • Rating Predictions
  • Another type of Memory-based Collaborative Filtering Item-Item based Collaborative Filtering
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Common Practice for Item-Item Collaborative Filtering
  • Item-Item vs User-User
  • ProsCons of Collaborative Filtering
  • Hybrid Methods
  • Remarks amp Practical Tips
  • Evaluation
  • Evaluation
  • Evaluating Predictions
  • Problems with Error Measures
  • Collaborative Filtering Complexity
  • Tip Add Data
  • Recommender SystemsLatent Factor Models
  • Collaborative Filtering viaLatent Factor Models (eg SVD)
  • Slide Number 50
  • The Netflix Utility Matrix R
  • Utility Matrix R Evaluation
  • Latent Factor Models
  • Ratings as Products of Factors
  • Ratings as Products of Factors
  • Ratings as Products of Factors
  • Latent Factor Models
  • Latent Factor Models
  • Recap SVD
  • Latent Factor Models
  • Dealing with Missing Entries
  • Recommendations via Latent Factor Models (eg SVD++ by the [Bellkor Team])
  • Dealing with Missing Entries
  • The Effect of Regularization
  • The Effect of Regularization
  • The Effect of Regularization
  • The Effect of Regularization
  • Use Gradient Descent to search forthe optimal settings
  • Degression to the lecture notes ofRegression and Gradient Descentby Andrew Ngrsquos Machine Learning course from Coursera
  • (Batch) Gradient Descent
  • Stochastic Gradient Descent
  • SGD vs GD
  • Stochastic Gradient Descent
  • Slide Number 75
  • Summary Recommendations via Optimization
  • Backup Slides
  • The Netflix Challenge 2006-09
  • Slide Number 79
  • Slide Number 80
  • Netflix Prize
  • Movie rating data
  • Overall rating distribution
  • ratings per movie
  • ratings per user
  • Average movie rating by movie count
  • Most loved movies
  • Challenges
  • The BellKor recommender system
  • Extending Latent Factor Model to Include Biases
  • Modeling Biases and Interactions
  • Baseline Predictor
  • Putting It All Together
  • Fitting the New Model
  • BellKor Recommender System
  • Performance of Various Methods
  • Performance of Various Methods
  • Temporal Biases Of Users
  • Temporal Biases amp Factors
  • Adding Temporal Effects
  • Performance of Various Methods
  • Slide Number 113
  • Slide Number 114
  • Standing on June 26th 2009
  • The Last 30 Days
  • 24 Hours from the Deadline
  • Slide Number 118
  • Million $ Awarded Sept 21st 2009
  • Further reading
predictors RMSE
1 0889067
2 0880586
3 0878345
4 0877715
5 0876668
6 0876179
7 0875557
8 0875013
9 0874477
10 0874209
11 0873997
12 0873691
13 0873569
14 0873352
15 0873212
16 0873035
17 0872942
18 087283
19 0872748
20 0872654
21 0872563
22 0872468
23 0872382
24 0872275
25 087222
26 0872201
27 0872135
28 0872086
29 0872021
30 0871945
31 0871898
32 087183
33 0871756
34 0871722
35 0871665
36 0871632
37 0871585
38 0871558
39 0871524
40 0871482
41 0871452
42 0871428
43 0871398
44 0871346
45 0871312
46 0871287
47 0871262
48 0871242
49 0871236
50 0871197
0889067
0880586
0878345
0877715
0876668
0876179
0875557
0875013
0874477
0874209
0873997
0873691
0873569
0873352
0873212
0873035
0872942
087283
0872748
0872654
0872563
0872468
0872382
0872275
087222
0872201
0872135
0872086
0872021
0871945
0871898
087183
0871756
0871722
0871665
0871632
0871585
0871558
0871524
0871482
0871452
0871428
0871398
0871346
0871312
0871287
0871262
0871242
0871236
0871197
Page 49: xuhappy.github.io · 2 days ago · RECSYS 2 Acknowledgements The slides used in this chapter are adapted from: CS246 Mining Massive Datasets, - by Jure Leskovec, Stanford University

RECSYS 51

The Netflix Utility Matrix R

1 3 4

3 5 5

4 5 5

3

3

2 2 2

5

2 1 1

3 3

1

480000 users

17700 movies

1162019

Matrix R

RECSYS 52

Utility Matrix R Evaluation

1 3 4

3 5 5

4 5 5

3

3

2

2 1

3

1

Test Data Set

SSE = sum(119909119909119909119909)isin119877119877 119903119909119909119909119909 minus 119903119903119909119909119909119909 2

1162019

480000 users

17700 movies

Predicted rating

True rating of user x on item i

119955119955120785120785120788120788

Matrix R

Training Data Set

RECSYS 53

Latent Factor Models

ldquoSVDrdquo on Netflix data R asymp Q middot PT

For now letrsquos assume we can approximate the rating matrix R as a product of ldquothinrdquo Q middot PT

R has missing entries but letrsquos ignore that for nowbull Basically we will want the reconstruction error to be small on

known ratings and we donrsquot care about the values on the missing ones

45531

312445

53432142

24542

522434

42331

2-41

56-5

53-2

32111

-221-7

37-1

-924143-48-5-253-211

13-112-72914-131457-8

1-6784-3924176-421asymp

users

item

s

PT

Q

item

s

users

R

SVD A = U Σ VT

f factors

ffactors

RECSYS 54

Ratings as Products of Factors How to estimate the missing rating of

user x for item i

1162019

45531

312445

53432142

24542

522434

42331

item

s

2-41

56-5

53-2

32111

-221-7

37-1

-924143-48-5-253-211

13-112-72914-131457-8

1-6784-3924176-421

asymp

item

s

users

users

PT

119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931

= 119943119943

119954119954119946119946119943119943 sdot 119953119953119961119961119943119943

qi = row i of Qpx = column x of PT

ffac

tors

Qf factors

RECSYS 55

Ratings as Products of Factors How to estimate the missing rating of

user x for item i

1162019

45531

312445

53432142

24542

522434

42331

item

s

2-41

56-5

53-2

32111

-221-7

37-1

-924143-48-5-253-211

13-112-72914-131457-8

1-6784-3924176-421

asymp

item

s

users

users

PT

ffac

tors

Qf factors

119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931

= 119943119943

119954119954119946119946119943119943 sdot 119953119953119961119961119943119943

qi = row i of Qpx = column x of PT

RECSYS 56

Ratings as Products of Factors How to estimate the missing rating of

user x for item i

1162019

45531

312445

53432142

24542

522434

42331

item

s

2-41

56-5

53-2

32111

-221-7

37-1

-924143-48-5-253-211

13-112-72914-131457-8

1-6784-3924176-421

asymp

item

s

users

users

QPT

24

ffac

tors

f factors

119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931

= 119943119943

119954119954119946119946119943119943 sdot 119953119953119961119961119943119943

qi = row i of Qpx = column x of PT

RECSYS 57

Geared towards females

Geared towards males

Serious

Funny

Latent Factor Models

1162019

The PrincessDiaries

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

RECSYS 58

Geared towards females

Geared towards males

Serious

Funny

Latent Factor Models

1162019

The PrincessDiaries

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

RECSYS 59

Recap SVD

Remember SVD A Input data matrix U Left singular vecs V Right singular vecs Σ Singular values

SVD gives minimum reconstruction error (SSE) min119880119880VΣ

sum119909119909119894119894 119860119860119909119909119894119894 minus 119880119880Σ119881119881T 1199091199091198941198942

So in our case ldquoSVDrdquo on Netflix data R asymp Q middot PT

A = R Q = U PT = Σ VT

But we are not done yet R has missing entries1162019

U

The sum goes over all entriesBut our R has missing entries

119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931

Am

n

Σm

n

VT

asymp

U

RECSYS 60

Latent Factor Models

SVD isnrsquot defined when entries are missing

Use specialized methods to find P Q min

119875119875119876119876sum 119909119909119909119909 isinR 119903119903119909119909119909119909 minus 119902119902119909119909 sdot 119901119901119909119909119879119879 2

Notebull We donrsquot require cols of P Q to be orthogonalunit lengthbull P Q map usersmovies to a latent spacebull The most popular model among Netflix contestants

1162019

45531

312445

53432142

24542

522434

42331

2-41

56-5

53-2

32111

-221-7

37-1

-924143-48-5-253-211

13-112-72914-131457-8

1-6784-3924176-421asymp

PTQ

users

item

s

119903119909119909119909119909 = 119902119902119909119909 sdot 119901119901119909119909119879119879

f factors

ffactorsitem

s

users

RECSYS 61

Dealing with Missing Entries Want to minimize SSE for unseen test data

Idea Minimize SSE on training data Want large f ( of factors) to capture all the signals But SSE on test data begins to rise for f gt 2

Regularization is needed to avoid Overfitting Allow rich model where there are sufficient data Shrink aggressively where data are scarce

1162019

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

1 3 43 5 5

4 5 533

2

2 1 3

1

λhellip regularization parameterldquoerrorrdquo ldquolengthrdquo

RECSYS 62

Geared towards females

Geared towards males

serious

escapist

The PrincessDiaries

The Lion King

Braveheart

Independence Day

AmadeusThe Color Purple

Oceanrsquos 11

Sense and Sensibility

Gus

Dave

Recommendations via Latent Factor Models (eg SVD++ by the [Bellkor Team])

1162019

Lethal Weapon

Dumb and Dumber

RECSYS 63

Dealing with Missing Entries Want to minimize SSE for unseen test data

Idea Minimize SSE on training data Want large f ( of factors) to capture all the signals But SSE on test data begins to rise for f gt 2

Regularization is needed Allow rich model where there are sufficient data Shrink aggressively where data are scarce

1162019

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

1 3 43 5 5

4 5 533

2

2 1 3

1

λhellip regularization parameterldquoerrorrdquo ldquolengthrdquo

RECSYS 64

Geared towards females

Geared towards males

serious

funny

The Effect of Regularization

1162019

The PrincessDiaries

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

minfactors ldquoerrorrdquo + λ ldquolengthrdquo

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

RECSYS 65

Geared towards females

Geared towards males

serious

funny

The Effect of Regularization

1162019

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

The PrincessDiaries

minfactors ldquoerrorrdquo + λ ldquolengthrdquo

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

RECSYS 66

Geared towards females

Geared towards males

serious

funny

The Effect of Regularization

1162019

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

minfactors ldquoerrorrdquo + λ ldquolengthrdquo

The PrincessDiaries

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

RECSYS 67

Geared towards females

Geared towards males

serious

funny

The Effect of Regularization

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

The PrincessDiaries

minfactors ldquoerrorrdquo + λ ldquolengthrdquo

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

RECSYS 68

Use Gradient Descent to search forthe optimal settings

Want to find matrices P and Q

Gradient descent Initialize P and Q (using SVD pretend missing ratings are

0) Do gradient descent

bull P larr P - η middotnablaPbull Q larr Q - η middotnablaQbull Where nablaQ is gradientderivative of matrix Q120571120571119876119876 = [120571120571119902119902119909119909119894119894] and 120571120571119902119902119909119909119894119894 = sum119909119909119909119909 minus2 119903119903119909119909119909119909 minus 119902119902119909119909119901119901119909119909119879119879 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894

bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q Observation Computing gradients is slow

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

How to compute gradient of a matrixCompute gradient of every element independently

RECSYS 69

Degression to the lecture notes ofRegression and Gradient Descent

by Andrew Ngrsquos Machine Learning course from Coursera

RECSYS 70

(Batch) Gradient Descent

Want to find matrices P and Q

Gradient descent Initialize P and Q (using SVD pretend missing ratings are

0) Do gradient descent

bull P larr P - η middotnablaPbull Q larr Q - η middotnablaQbull Where nablaQ is gradientderivative of matrix Q120571120571119876119876 = [120571120571119902119902119909119909119894119894] and 120571120571119902119902119909119909119894119894 = sum119909119909119909119909 minus2 119903119903119909119909119909119909 minus 119902119902119909119909119901119901119909119909119879119879 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894

bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q Observation Computing gradients is slow

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

How to compute gradient of a matrixCompute gradient of every element independently

RECSYS 72

Stochastic Gradient Descent

Gradient Descent (GD) vs Stochastic GD Observation 120571120571119876119876 = [120571120571119902119902119909119909119894119894] where

120571120571119902119902119909119909119894119894 = 119909119909119909119909

minus2 119903119903119909119909119909119909 minus 119902119902119909119909119894119894119901119901119909119909119894119894 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894 = 119961119961119946119946

nabla119928119928 119955119955119961119961119946119946

bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q 119928119928 = 119928119928 minus 119928119928 = 119928119928 minus η sum119961119961119946119946nabla119928119928 (119955119955119961119961119946119946) Idea Instead of evaluating gradient over all ratings evaluate it for

each individual rating and make a step

GD 119928119928larr119928119928 minus η sum119955119955119961119961119946119946 nabla119928119928(119955119955119961119961119946119946)

SGD 119928119928larr119928119928 minus 119928119928(119955119955119961119961119946119946) Faster convergence

bull Need more steps but each step is computed much faster

1162019

RECSYS 73

SGD vs GD Convergence of GD vs SGD

1162019

Iterationstep

Valu

e of

the

obje

ctiv

e fu

nctio

n

GD improves the value of the objective function at every step SGD improves the value but in a ldquonoisyrdquo wayGD takes fewer steps to converge but each steptakes much longer to compute In practice SGD is much faster

RECSYS 74

Stochastic Gradient Descent

Stochastic gradient decent Initialize P and Q (using SVD pretend missing ratings are

0) Then iterate over the ratings (multiple times if necessary) and update

factorsFor each rxi

bull 120576120576119909119909119909119909 = 119903119903119909119909119909119909 minus 119902119902119909119909 sdot 119901119901119909119909119879119879 (derivative of the ldquoerrorrdquo)bull 119902119902119909119909 larr 119902119902119909119909 + 120578120578 120576120576119909119909119909119909 119901119901119909119909 minus 120582120582 119902119902119909119909 (update equation)bull 119901119901119909119909 larr 119901119901119909119909 + 120578120578 120576120576119909119909119909119909 119902119902119909119909 minus 120582120582 119901119901119909119909 (update equation)

2 for loops For until convergence

bull For each rxibull Compute gradient do a ldquosteprdquo

1162019

η hellip learning rate

RECSYS 75Koren Bell Volinksy IEEE Computer 2009

1162019

RECSYS 76

Summary Recommendations via Optimization

Goal Make good recommendations Quantify goodness using SSE

So Lower SSE means better recommendations We want to make good recommendations on items

that some user has not yet seen Letrsquos set values for P and Q such that they work

well on known (user item) ratings And hope these values for P and Q will predict well

the unknown ratings

This is the a case where we apply Optimization methods

1162019

1 3 43 5 5

4 5 533

2

2 1 3

1

RECSYS 77

Backup Slides

RECSYS 78

The Netflix Challenge 2006-09

We Know What You OughtTo Be Watching This

Summer

Netflix Prizebull Training data

ndash 100 million ratingsndash 480000 usersndash 17770 moviesndash 6 years of data 2000-2005

bull Test datandash Last few ratings of each user (28 million)ndash Evaluation criterion root mean squared error (RMSE) ndash Netflix Cinematch RMSE 09514

bull Competitionndash 2700+ teamsndash $1 million grand prize for 10 improvement on Cinematch resultndash $50000 2007 progress prize for 843 improvement

scoremovieuser121152131434524123237682576344541568523425

22345

57664566

scoremovieuser6219617232473153414284935

745

696836

Training data Test data

Movie rating data

Overall rating distribution

bull Third of ratings are 4sbull Average rating is 368

From TimelyDevelopmentcom

ratings per movie

bull Avg ratingsmovie 5627

ratings per user

bull Avg ratingsuser 208

Average movie rating by movie count

bull More ratings to better movies

From TimelyDevelopmentcom

Most loved movies

CountAvg ratingTitle137812 4593 The Shawshank Redemption 133597 4545 Lord of the Rings The Return of the King1808834306 The Green Mile 150676 4460Lord of the Rings The Two Towers 139050 4415 Finding Nemo 117456 4504 Raiders of the Lost Ark 180736 4299 Forrest Gump 147932 4433 Lord of the Rings The Fellowship of the ring149199 4325 The Sixth Sense 144027 4333 Indiana Jones and the Last Crusade

Challenges

bull Size of datandash Scalabilityndash Keeping data in memory

bull Missing datandash 99 percent missingndash Very imbalanced

bull Avoiding overfittingbull Test and training data differ significantly movie 16322

bull Use an ensemble of complementing predictorsbull Two half tuned models worth more than a single fully

tuned model

The BellKor recommender system

RECSYS 101

Extending Latent Factor Model to Include Biases

1162019

RECSYS 102

Modeling Biases and Interactions

μ = overall mean rating bx = bias of user x bi = bias of movie i

user-movie interactionmovie biasuser bias

User-Movie interaction Characterizes the matching between

users and movies Attracts most research in the field Benefits from algorithmic and

mathematical innovations

Baseline predictor Separates users and movies Benefits from insights into userrsquos

behavior Among the main practical

contributions of the competition

RECSYS 103

Baseline Predictor We have expectations on the rating by

user x of movie i even without estimating xrsquos attitude towards movies like i

ndash Rating scale of user xndash Values of other ratings user

gave recently (day-specific mood anchoring multi-user accounts)

ndash (Recent) popularity of movie indash Selection bias related to

number of ratings user gave on the same day (ldquofrequencyrdquo)

1162019

RECSYS 104

Putting It All Together

Example Mean rating micro = 37 You are a critical reviewer your ratings are 1 star

lower than the mean bx = -1 Star Wars gets a mean rating of 05 higher than

average movie bi = + 05 Predicted rating for you on Star Wars

= 37 - 1 + 05 = 32

1162019

RECSYS 105

Fitting the New Model

Solve

Stochastic gradient decent to find parameters Note Both biases bu bi as well as interactions qi pu are treated

as parameters (we estimate them)

1162019

regularization

goodness of fit

λ is selected via grid-search on a validation set

( )

++++

+++minus

sumsumsumsum

sumisin

ii

xx

xx

ii

Rix

Txiixxi

PQ

bbpq

pqbbr

2222

2

)(

)(min

λ

micro

RECSYS 106

BellKor Recommender System The winner of the Netflix Challenge

Multi-scale modeling of the dataCombine top level ldquoregionalrdquomodeling of the data with a refined local view Global

bull Overall deviations of usersmovies Factorization

bull Addressing ldquoregionalrdquo effects Collaborative filtering

bull Extract local patterns

Global effects

Factorization

Collaborative filtering

RECSYS 107

Performance of Various Methods

0885

089

0895

09

0905

091

0915

092

1 10 100 1000

RMSE

Millions of parameters

CF (no time bias)

Basic Latent Factors

Latent Factors w Biases

1162019

RECSYS 108

Grand Prize 08563

Netflix 09514

Movie average 10533User average 10651

Global average 11296

Performance of Various Methods

Basic Collaborative filtering 094

Latent factors 090

Latent factors+Biases 089

Collaborative filtering++ 091

1162019

RECSYS 109

Temporal Biases Of Users

Sudden rise in the average movie rating(early 2004) Improvements in Netflix GUI improvements Meaning of rating changed

Movie age Users prefer new movies

without any reasons Older movies are just

inherently better than newer ones

1162019

Y Koren Collaborative filtering with temporal dynamics KDD rsquo09

RECSYS 110

Temporal Biases amp Factors

Original modelrxi = micro +bx + bi + qi middotpx

T

Add time dependence to biasesrxi = micro +bx(t)+ bi(t) +qi middot px

T

Make parameters bu and bi to depend on time (1) Parameterize time-dependence by linear trends

(2) Each bin corresponds to 10 consecutive weeks

Add temporal dependence to factors px(t)hellip user preference vector on day t

Y Koren Collaborative filtering with temporal dynamics KDD rsquo09

RECSYS 111

Adding Temporal Effects

1162019

0875

088

0885

089

0895

09

0905

091

0915

092

1 10 100 1000 10000

RMSE

Millions of parameters

CF (no time bias)Basic Latent FactorsCF (time bias)Latent Factors w Biases+ Linear time factors+ Per-day user biases+ CF

RECSYS 112

Grand Prize 08563

Netflix 09514

Movie average 10533User average 10651

Global average 11296

Performance of Various Methods

Basic Collaborative filtering 094

Latent factors 090

Latent factors+Biases 089

Collaborative filtering++ 091

1162019 112

Latent factors+Biases+Time 0876

Still no prize Getting desperateTry a ldquokitchen sinkrdquo approach

RECSYS 113

Effect of ensemble size

087

08725

0875

08775

088

08825

0885

08875

089

08925

1 8 15 22 29 36 43 50

Predictors

Erro

r - R

MSE

Chart1

RMSE
Predictors
Error - RMSE
Effect of ensemble size

Sheet1

Sheet1

RMSE
Predictors
Error - RMSE
Effect of ensemble size

Sheet2

Sheet3

RECSYS 115

Standing on June 26th 2009

1162019June 26th submission triggers 30-day ldquolast callrdquo

RECSYS 116

The Last 30 Days

Ensemble team formed Group of other teams on leaderboard forms a new team Relies on combining their models Quickly also get a qualifying score over 10

BellKor Continue to get small improvements in their scores Realize that they are in direct competition with Ensemble

Strategy Both teams carefully monitoring the leaderboard Only sure way to check for improvement is to submit a set of

predictionsbull This alerts the other team of your latest score

1162019

RECSYS 117

24 Hours from the Deadline

Submissions limited to 1 a day Only 1 final submission could be made in the last 24h

24 hours before deadlinehellip BellKor team member in Austria notices (by chance) that

Ensemble posts a score that is slightly better than BellKorrsquos

Frantic last 24 hours for both teams Much computer time on final optimization Carefully calibrated to end about an hour before deadline

Final submissions BellKor submits a little early (on purpose) 40 mins before deadline Ensemble submits their final entry 20 mins later hellipand everyone waitshellip

1162019

RECSYS 118118

RECSYS 119

Million $ Awarded Sept 21st 2009

RECSYS 120

Further reading Y Koren Collaborative filtering with temporal

dynamics KDD rsquo09 httpwww2researchattcom~volinskynetflixbpchtml httpwwwthe-ensemblecom

1162019

  • Slide Number 1
  • Acknowledgements
  • Content-based Systems amp Collaborative Filtering
  • High Dimensional Data
  • Example Recommender Systems
  • Recommendations
  • From Scarcity to Abundance
  • The Long Tail
  • Physical vs Online
  • Types of Recommendations
  • Formal Model
  • Utility Matrix
  • Key Problems
  • (1) Gathering Ratings
  • (2) Extrapolating Utilities
  • Content-based Recommender Systems
  • Content-based Recommendations
  • Plan of Action
  • Item Profiles
  • Sidenote TF-IDF
  • User Profiles and Prediction
  • Pros Content-based Approach
  • Cons Content-based Approach
  • Collaborative Filtering
  • Collaborative filtering
  • Collaborative Filtering (CF)
  • Example of Memory-based Collaborative Filtering User-User Collaborative Filtering
  • Similar Users
  • Similarity Metric
  • Rating Predictions
  • Another type of Memory-based Collaborative Filtering Item-Item based Collaborative Filtering
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Common Practice for Item-Item Collaborative Filtering
  • Item-Item vs User-User
  • ProsCons of Collaborative Filtering
  • Hybrid Methods
  • Remarks amp Practical Tips
  • Evaluation
  • Evaluation
  • Evaluating Predictions
  • Problems with Error Measures
  • Collaborative Filtering Complexity
  • Tip Add Data
  • Recommender SystemsLatent Factor Models
  • Collaborative Filtering viaLatent Factor Models (eg SVD)
  • Slide Number 50
  • The Netflix Utility Matrix R
  • Utility Matrix R Evaluation
  • Latent Factor Models
  • Ratings as Products of Factors
  • Ratings as Products of Factors
  • Ratings as Products of Factors
  • Latent Factor Models
  • Latent Factor Models
  • Recap SVD
  • Latent Factor Models
  • Dealing with Missing Entries
  • Recommendations via Latent Factor Models (eg SVD++ by the [Bellkor Team])
  • Dealing with Missing Entries
  • The Effect of Regularization
  • The Effect of Regularization
  • The Effect of Regularization
  • The Effect of Regularization
  • Use Gradient Descent to search forthe optimal settings
  • Degression to the lecture notes ofRegression and Gradient Descentby Andrew Ngrsquos Machine Learning course from Coursera
  • (Batch) Gradient Descent
  • Stochastic Gradient Descent
  • SGD vs GD
  • Stochastic Gradient Descent
  • Slide Number 75
  • Summary Recommendations via Optimization
  • Backup Slides
  • The Netflix Challenge 2006-09
  • Slide Number 79
  • Slide Number 80
  • Netflix Prize
  • Movie rating data
  • Overall rating distribution
  • ratings per movie
  • ratings per user
  • Average movie rating by movie count
  • Most loved movies
  • Challenges
  • The BellKor recommender system
  • Extending Latent Factor Model to Include Biases
  • Modeling Biases and Interactions
  • Baseline Predictor
  • Putting It All Together
  • Fitting the New Model
  • BellKor Recommender System
  • Performance of Various Methods
  • Performance of Various Methods
  • Temporal Biases Of Users
  • Temporal Biases amp Factors
  • Adding Temporal Effects
  • Performance of Various Methods
  • Slide Number 113
  • Slide Number 114
  • Standing on June 26th 2009
  • The Last 30 Days
  • 24 Hours from the Deadline
  • Slide Number 118
  • Million $ Awarded Sept 21st 2009
  • Further reading
predictors RMSE
1 0889067
2 0880586
3 0878345
4 0877715
5 0876668
6 0876179
7 0875557
8 0875013
9 0874477
10 0874209
11 0873997
12 0873691
13 0873569
14 0873352
15 0873212
16 0873035
17 0872942
18 087283
19 0872748
20 0872654
21 0872563
22 0872468
23 0872382
24 0872275
25 087222
26 0872201
27 0872135
28 0872086
29 0872021
30 0871945
31 0871898
32 087183
33 0871756
34 0871722
35 0871665
36 0871632
37 0871585
38 0871558
39 0871524
40 0871482
41 0871452
42 0871428
43 0871398
44 0871346
45 0871312
46 0871287
47 0871262
48 0871242
49 0871236
50 0871197
0889067
0880586
0878345
0877715
0876668
0876179
0875557
0875013
0874477
0874209
0873997
0873691
0873569
0873352
0873212
0873035
0872942
087283
0872748
0872654
0872563
0872468
0872382
0872275
087222
0872201
0872135
0872086
0872021
0871945
0871898
087183
0871756
0871722
0871665
0871632
0871585
0871558
0871524
0871482
0871452
0871428
0871398
0871346
0871312
0871287
0871262
0871242
0871236
0871197
Page 50: xuhappy.github.io · 2 days ago · RECSYS 2 Acknowledgements The slides used in this chapter are adapted from: CS246 Mining Massive Datasets, - by Jure Leskovec, Stanford University

RECSYS 52

Utility Matrix R Evaluation

1 3 4

3 5 5

4 5 5

3

3

2

2 1

3

1

Test Data Set

SSE = sum(119909119909119909119909)isin119877119877 119903119909119909119909119909 minus 119903119903119909119909119909119909 2

1162019

480000 users

17700 movies

Predicted rating

True rating of user x on item i

119955119955120785120785120788120788

Matrix R

Training Data Set

RECSYS 53

Latent Factor Models

ldquoSVDrdquo on Netflix data R asymp Q middot PT

For now letrsquos assume we can approximate the rating matrix R as a product of ldquothinrdquo Q middot PT

R has missing entries but letrsquos ignore that for nowbull Basically we will want the reconstruction error to be small on

known ratings and we donrsquot care about the values on the missing ones

45531

312445

53432142

24542

522434

42331

2-41

56-5

53-2

32111

-221-7

37-1

-924143-48-5-253-211

13-112-72914-131457-8

1-6784-3924176-421asymp

users

item

s

PT

Q

item

s

users

R

SVD A = U Σ VT

f factors

ffactors

RECSYS 54

Ratings as Products of Factors How to estimate the missing rating of

user x for item i

1162019

45531

312445

53432142

24542

522434

42331

item

s

2-41

56-5

53-2

32111

-221-7

37-1

-924143-48-5-253-211

13-112-72914-131457-8

1-6784-3924176-421

asymp

item

s

users

users

PT

119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931

= 119943119943

119954119954119946119946119943119943 sdot 119953119953119961119961119943119943

qi = row i of Qpx = column x of PT

ffac

tors

Qf factors

RECSYS 55

Ratings as Products of Factors How to estimate the missing rating of

user x for item i

1162019

45531

312445

53432142

24542

522434

42331

item

s

2-41

56-5

53-2

32111

-221-7

37-1

-924143-48-5-253-211

13-112-72914-131457-8

1-6784-3924176-421

asymp

item

s

users

users

PT

ffac

tors

Qf factors

119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931

= 119943119943

119954119954119946119946119943119943 sdot 119953119953119961119961119943119943

qi = row i of Qpx = column x of PT

RECSYS 56

Ratings as Products of Factors How to estimate the missing rating of

user x for item i

1162019

45531

312445

53432142

24542

522434

42331

item

s

2-41

56-5

53-2

32111

-221-7

37-1

-924143-48-5-253-211

13-112-72914-131457-8

1-6784-3924176-421

asymp

item

s

users

users

QPT

24

ffac

tors

f factors

119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931

= 119943119943

119954119954119946119946119943119943 sdot 119953119953119961119961119943119943

qi = row i of Qpx = column x of PT

RECSYS 57

Geared towards females

Geared towards males

Serious

Funny

Latent Factor Models

1162019

The PrincessDiaries

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

RECSYS 58

Geared towards females

Geared towards males

Serious

Funny

Latent Factor Models

1162019

The PrincessDiaries

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

RECSYS 59

Recap SVD

Remember SVD A Input data matrix U Left singular vecs V Right singular vecs Σ Singular values

SVD gives minimum reconstruction error (SSE) min119880119880VΣ

sum119909119909119894119894 119860119860119909119909119894119894 minus 119880119880Σ119881119881T 1199091199091198941198942

So in our case ldquoSVDrdquo on Netflix data R asymp Q middot PT

A = R Q = U PT = Σ VT

But we are not done yet R has missing entries1162019

U

The sum goes over all entriesBut our R has missing entries

119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931

Am

n

Σm

n

VT

asymp

U

RECSYS 60

Latent Factor Models

SVD isnrsquot defined when entries are missing

Use specialized methods to find P Q min

119875119875119876119876sum 119909119909119909119909 isinR 119903119903119909119909119909119909 minus 119902119902119909119909 sdot 119901119901119909119909119879119879 2

Notebull We donrsquot require cols of P Q to be orthogonalunit lengthbull P Q map usersmovies to a latent spacebull The most popular model among Netflix contestants

1162019

45531

312445

53432142

24542

522434

42331

2-41

56-5

53-2

32111

-221-7

37-1

-924143-48-5-253-211

13-112-72914-131457-8

1-6784-3924176-421asymp

PTQ

users

item

s

119903119909119909119909119909 = 119902119902119909119909 sdot 119901119901119909119909119879119879

f factors

ffactorsitem

s

users

RECSYS 61

Dealing with Missing Entries Want to minimize SSE for unseen test data

Idea Minimize SSE on training data Want large f ( of factors) to capture all the signals But SSE on test data begins to rise for f gt 2

Regularization is needed to avoid Overfitting Allow rich model where there are sufficient data Shrink aggressively where data are scarce

1162019

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

1 3 43 5 5

4 5 533

2

2 1 3

1

λhellip regularization parameterldquoerrorrdquo ldquolengthrdquo

RECSYS 62

Geared towards females

Geared towards males

serious

escapist

The PrincessDiaries

The Lion King

Braveheart

Independence Day

AmadeusThe Color Purple

Oceanrsquos 11

Sense and Sensibility

Gus

Dave

Recommendations via Latent Factor Models (eg SVD++ by the [Bellkor Team])

1162019

Lethal Weapon

Dumb and Dumber

RECSYS 63

Dealing with Missing Entries Want to minimize SSE for unseen test data

Idea Minimize SSE on training data Want large f ( of factors) to capture all the signals But SSE on test data begins to rise for f gt 2

Regularization is needed Allow rich model where there are sufficient data Shrink aggressively where data are scarce

1162019

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

1 3 43 5 5

4 5 533

2

2 1 3

1

λhellip regularization parameterldquoerrorrdquo ldquolengthrdquo

RECSYS 64

Geared towards females

Geared towards males

serious

funny

The Effect of Regularization

1162019

The PrincessDiaries

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

minfactors ldquoerrorrdquo + λ ldquolengthrdquo

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

RECSYS 65

Geared towards females

Geared towards males

serious

funny

The Effect of Regularization

1162019

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

The PrincessDiaries

minfactors ldquoerrorrdquo + λ ldquolengthrdquo

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

RECSYS 66

Geared towards females

Geared towards males

serious

funny

The Effect of Regularization

1162019

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

minfactors ldquoerrorrdquo + λ ldquolengthrdquo

The PrincessDiaries

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

RECSYS 67

Geared towards females

Geared towards males

serious

funny

The Effect of Regularization

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

The PrincessDiaries

minfactors ldquoerrorrdquo + λ ldquolengthrdquo

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

RECSYS 68

Use Gradient Descent to search forthe optimal settings

Want to find matrices P and Q

Gradient descent Initialize P and Q (using SVD pretend missing ratings are

0) Do gradient descent

bull P larr P - η middotnablaPbull Q larr Q - η middotnablaQbull Where nablaQ is gradientderivative of matrix Q120571120571119876119876 = [120571120571119902119902119909119909119894119894] and 120571120571119902119902119909119909119894119894 = sum119909119909119909119909 minus2 119903119903119909119909119909119909 minus 119902119902119909119909119901119901119909119909119879119879 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894

bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q Observation Computing gradients is slow

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

How to compute gradient of a matrixCompute gradient of every element independently

RECSYS 69

Degression to the lecture notes ofRegression and Gradient Descent

by Andrew Ngrsquos Machine Learning course from Coursera

RECSYS 70

(Batch) Gradient Descent

Want to find matrices P and Q

Gradient descent Initialize P and Q (using SVD pretend missing ratings are

0) Do gradient descent

bull P larr P - η middotnablaPbull Q larr Q - η middotnablaQbull Where nablaQ is gradientderivative of matrix Q120571120571119876119876 = [120571120571119902119902119909119909119894119894] and 120571120571119902119902119909119909119894119894 = sum119909119909119909119909 minus2 119903119903119909119909119909119909 minus 119902119902119909119909119901119901119909119909119879119879 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894

bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q Observation Computing gradients is slow

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

How to compute gradient of a matrixCompute gradient of every element independently

RECSYS 72

Stochastic Gradient Descent

Gradient Descent (GD) vs Stochastic GD Observation 120571120571119876119876 = [120571120571119902119902119909119909119894119894] where

120571120571119902119902119909119909119894119894 = 119909119909119909119909

minus2 119903119903119909119909119909119909 minus 119902119902119909119909119894119894119901119901119909119909119894119894 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894 = 119961119961119946119946

nabla119928119928 119955119955119961119961119946119946

bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q 119928119928 = 119928119928 minus 119928119928 = 119928119928 minus η sum119961119961119946119946nabla119928119928 (119955119955119961119961119946119946) Idea Instead of evaluating gradient over all ratings evaluate it for

each individual rating and make a step

GD 119928119928larr119928119928 minus η sum119955119955119961119961119946119946 nabla119928119928(119955119955119961119961119946119946)

SGD 119928119928larr119928119928 minus 119928119928(119955119955119961119961119946119946) Faster convergence

bull Need more steps but each step is computed much faster

1162019

RECSYS 73

SGD vs GD Convergence of GD vs SGD

1162019

Iterationstep

Valu

e of

the

obje

ctiv

e fu

nctio

n

GD improves the value of the objective function at every step SGD improves the value but in a ldquonoisyrdquo wayGD takes fewer steps to converge but each steptakes much longer to compute In practice SGD is much faster

RECSYS 74

Stochastic Gradient Descent

Stochastic gradient decent Initialize P and Q (using SVD pretend missing ratings are

0) Then iterate over the ratings (multiple times if necessary) and update

factorsFor each rxi

bull 120576120576119909119909119909119909 = 119903119903119909119909119909119909 minus 119902119902119909119909 sdot 119901119901119909119909119879119879 (derivative of the ldquoerrorrdquo)bull 119902119902119909119909 larr 119902119902119909119909 + 120578120578 120576120576119909119909119909119909 119901119901119909119909 minus 120582120582 119902119902119909119909 (update equation)bull 119901119901119909119909 larr 119901119901119909119909 + 120578120578 120576120576119909119909119909119909 119902119902119909119909 minus 120582120582 119901119901119909119909 (update equation)

2 for loops For until convergence

bull For each rxibull Compute gradient do a ldquosteprdquo

1162019

η hellip learning rate

RECSYS 75Koren Bell Volinksy IEEE Computer 2009

1162019

RECSYS 76

Summary Recommendations via Optimization

Goal Make good recommendations Quantify goodness using SSE

So Lower SSE means better recommendations We want to make good recommendations on items

that some user has not yet seen Letrsquos set values for P and Q such that they work

well on known (user item) ratings And hope these values for P and Q will predict well

the unknown ratings

This is the a case where we apply Optimization methods

1162019

1 3 43 5 5

4 5 533

2

2 1 3

1

RECSYS 77

Backup Slides

RECSYS 78

The Netflix Challenge 2006-09

We Know What You OughtTo Be Watching This

Summer

Netflix Prizebull Training data

ndash 100 million ratingsndash 480000 usersndash 17770 moviesndash 6 years of data 2000-2005

bull Test datandash Last few ratings of each user (28 million)ndash Evaluation criterion root mean squared error (RMSE) ndash Netflix Cinematch RMSE 09514

bull Competitionndash 2700+ teamsndash $1 million grand prize for 10 improvement on Cinematch resultndash $50000 2007 progress prize for 843 improvement

scoremovieuser121152131434524123237682576344541568523425

22345

57664566

scoremovieuser6219617232473153414284935

745

696836

Training data Test data

Movie rating data

Overall rating distribution

bull Third of ratings are 4sbull Average rating is 368

From TimelyDevelopmentcom

ratings per movie

bull Avg ratingsmovie 5627

ratings per user

bull Avg ratingsuser 208

Average movie rating by movie count

bull More ratings to better movies

From TimelyDevelopmentcom

Most loved movies

CountAvg ratingTitle137812 4593 The Shawshank Redemption 133597 4545 Lord of the Rings The Return of the King1808834306 The Green Mile 150676 4460Lord of the Rings The Two Towers 139050 4415 Finding Nemo 117456 4504 Raiders of the Lost Ark 180736 4299 Forrest Gump 147932 4433 Lord of the Rings The Fellowship of the ring149199 4325 The Sixth Sense 144027 4333 Indiana Jones and the Last Crusade

Challenges

bull Size of datandash Scalabilityndash Keeping data in memory

bull Missing datandash 99 percent missingndash Very imbalanced

bull Avoiding overfittingbull Test and training data differ significantly movie 16322

bull Use an ensemble of complementing predictorsbull Two half tuned models worth more than a single fully

tuned model

The BellKor recommender system

RECSYS 101

Extending Latent Factor Model to Include Biases

1162019

RECSYS 102

Modeling Biases and Interactions

μ = overall mean rating bx = bias of user x bi = bias of movie i

user-movie interactionmovie biasuser bias

User-Movie interaction Characterizes the matching between

users and movies Attracts most research in the field Benefits from algorithmic and

mathematical innovations

Baseline predictor Separates users and movies Benefits from insights into userrsquos

behavior Among the main practical

contributions of the competition

RECSYS 103

Baseline Predictor We have expectations on the rating by

user x of movie i even without estimating xrsquos attitude towards movies like i

ndash Rating scale of user xndash Values of other ratings user

gave recently (day-specific mood anchoring multi-user accounts)

ndash (Recent) popularity of movie indash Selection bias related to

number of ratings user gave on the same day (ldquofrequencyrdquo)

1162019

RECSYS 104

Putting It All Together

Example Mean rating micro = 37 You are a critical reviewer your ratings are 1 star

lower than the mean bx = -1 Star Wars gets a mean rating of 05 higher than

average movie bi = + 05 Predicted rating for you on Star Wars

= 37 - 1 + 05 = 32

1162019

RECSYS 105

Fitting the New Model

Solve

Stochastic gradient decent to find parameters Note Both biases bu bi as well as interactions qi pu are treated

as parameters (we estimate them)

1162019

regularization

goodness of fit

λ is selected via grid-search on a validation set

( )

++++

+++minus

sumsumsumsum

sumisin

ii

xx

xx

ii

Rix

Txiixxi

PQ

bbpq

pqbbr

2222

2

)(

)(min

λ

micro

RECSYS 106

BellKor Recommender System The winner of the Netflix Challenge

Multi-scale modeling of the dataCombine top level ldquoregionalrdquomodeling of the data with a refined local view Global

bull Overall deviations of usersmovies Factorization

bull Addressing ldquoregionalrdquo effects Collaborative filtering

bull Extract local patterns

Global effects

Factorization

Collaborative filtering

RECSYS 107

Performance of Various Methods

0885

089

0895

09

0905

091

0915

092

1 10 100 1000

RMSE

Millions of parameters

CF (no time bias)

Basic Latent Factors

Latent Factors w Biases

1162019

RECSYS 108

Grand Prize 08563

Netflix 09514

Movie average 10533User average 10651

Global average 11296

Performance of Various Methods

Basic Collaborative filtering 094

Latent factors 090

Latent factors+Biases 089

Collaborative filtering++ 091

1162019

RECSYS 109

Temporal Biases Of Users

Sudden rise in the average movie rating(early 2004) Improvements in Netflix GUI improvements Meaning of rating changed

Movie age Users prefer new movies

without any reasons Older movies are just

inherently better than newer ones

1162019

Y Koren Collaborative filtering with temporal dynamics KDD rsquo09

RECSYS 110

Temporal Biases amp Factors

Original modelrxi = micro +bx + bi + qi middotpx

T

Add time dependence to biasesrxi = micro +bx(t)+ bi(t) +qi middot px

T

Make parameters bu and bi to depend on time (1) Parameterize time-dependence by linear trends

(2) Each bin corresponds to 10 consecutive weeks

Add temporal dependence to factors px(t)hellip user preference vector on day t

Y Koren Collaborative filtering with temporal dynamics KDD rsquo09

RECSYS 111

Adding Temporal Effects

1162019

0875

088

0885

089

0895

09

0905

091

0915

092

1 10 100 1000 10000

RMSE

Millions of parameters

CF (no time bias)Basic Latent FactorsCF (time bias)Latent Factors w Biases+ Linear time factors+ Per-day user biases+ CF

RECSYS 112

Grand Prize 08563

Netflix 09514

Movie average 10533User average 10651

Global average 11296

Performance of Various Methods

Basic Collaborative filtering 094

Latent factors 090

Latent factors+Biases 089

Collaborative filtering++ 091

1162019 112

Latent factors+Biases+Time 0876

Still no prize Getting desperateTry a ldquokitchen sinkrdquo approach

RECSYS 113

Effect of ensemble size

087

08725

0875

08775

088

08825

0885

08875

089

08925

1 8 15 22 29 36 43 50

Predictors

Erro

r - R

MSE

Chart1

RMSE
Predictors
Error - RMSE
Effect of ensemble size

Sheet1

Sheet1

RMSE
Predictors
Error - RMSE
Effect of ensemble size

Sheet2

Sheet3

RECSYS 115

Standing on June 26th 2009

1162019June 26th submission triggers 30-day ldquolast callrdquo

RECSYS 116

The Last 30 Days

Ensemble team formed Group of other teams on leaderboard forms a new team Relies on combining their models Quickly also get a qualifying score over 10

BellKor Continue to get small improvements in their scores Realize that they are in direct competition with Ensemble

Strategy Both teams carefully monitoring the leaderboard Only sure way to check for improvement is to submit a set of

predictionsbull This alerts the other team of your latest score

1162019

RECSYS 117

24 Hours from the Deadline

Submissions limited to 1 a day Only 1 final submission could be made in the last 24h

24 hours before deadlinehellip BellKor team member in Austria notices (by chance) that

Ensemble posts a score that is slightly better than BellKorrsquos

Frantic last 24 hours for both teams Much computer time on final optimization Carefully calibrated to end about an hour before deadline

Final submissions BellKor submits a little early (on purpose) 40 mins before deadline Ensemble submits their final entry 20 mins later hellipand everyone waitshellip

1162019

RECSYS 118118

RECSYS 119

Million $ Awarded Sept 21st 2009

RECSYS 120

Further reading Y Koren Collaborative filtering with temporal

dynamics KDD rsquo09 httpwww2researchattcom~volinskynetflixbpchtml httpwwwthe-ensemblecom

1162019

  • Slide Number 1
  • Acknowledgements
  • Content-based Systems amp Collaborative Filtering
  • High Dimensional Data
  • Example Recommender Systems
  • Recommendations
  • From Scarcity to Abundance
  • The Long Tail
  • Physical vs Online
  • Types of Recommendations
  • Formal Model
  • Utility Matrix
  • Key Problems
  • (1) Gathering Ratings
  • (2) Extrapolating Utilities
  • Content-based Recommender Systems
  • Content-based Recommendations
  • Plan of Action
  • Item Profiles
  • Sidenote TF-IDF
  • User Profiles and Prediction
  • Pros Content-based Approach
  • Cons Content-based Approach
  • Collaborative Filtering
  • Collaborative filtering
  • Collaborative Filtering (CF)
  • Example of Memory-based Collaborative Filtering User-User Collaborative Filtering
  • Similar Users
  • Similarity Metric
  • Rating Predictions
  • Another type of Memory-based Collaborative Filtering Item-Item based Collaborative Filtering
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Common Practice for Item-Item Collaborative Filtering
  • Item-Item vs User-User
  • ProsCons of Collaborative Filtering
  • Hybrid Methods
  • Remarks amp Practical Tips
  • Evaluation
  • Evaluation
  • Evaluating Predictions
  • Problems with Error Measures
  • Collaborative Filtering Complexity
  • Tip Add Data
  • Recommender SystemsLatent Factor Models
  • Collaborative Filtering viaLatent Factor Models (eg SVD)
  • Slide Number 50
  • The Netflix Utility Matrix R
  • Utility Matrix R Evaluation
  • Latent Factor Models
  • Ratings as Products of Factors
  • Ratings as Products of Factors
  • Ratings as Products of Factors
  • Latent Factor Models
  • Latent Factor Models
  • Recap SVD
  • Latent Factor Models
  • Dealing with Missing Entries
  • Recommendations via Latent Factor Models (eg SVD++ by the [Bellkor Team])
  • Dealing with Missing Entries
  • The Effect of Regularization
  • The Effect of Regularization
  • The Effect of Regularization
  • The Effect of Regularization
  • Use Gradient Descent to search forthe optimal settings
  • Degression to the lecture notes ofRegression and Gradient Descentby Andrew Ngrsquos Machine Learning course from Coursera
  • (Batch) Gradient Descent
  • Stochastic Gradient Descent
  • SGD vs GD
  • Stochastic Gradient Descent
  • Slide Number 75
  • Summary Recommendations via Optimization
  • Backup Slides
  • The Netflix Challenge 2006-09
  • Slide Number 79
  • Slide Number 80
  • Netflix Prize
  • Movie rating data
  • Overall rating distribution
  • ratings per movie
  • ratings per user
  • Average movie rating by movie count
  • Most loved movies
  • Challenges
  • The BellKor recommender system
  • Extending Latent Factor Model to Include Biases
  • Modeling Biases and Interactions
  • Baseline Predictor
  • Putting It All Together
  • Fitting the New Model
  • BellKor Recommender System
  • Performance of Various Methods
  • Performance of Various Methods
  • Temporal Biases Of Users
  • Temporal Biases amp Factors
  • Adding Temporal Effects
  • Performance of Various Methods
  • Slide Number 113
  • Slide Number 114
  • Standing on June 26th 2009
  • The Last 30 Days
  • 24 Hours from the Deadline
  • Slide Number 118
  • Million $ Awarded Sept 21st 2009
  • Further reading
predictors RMSE
1 0889067
2 0880586
3 0878345
4 0877715
5 0876668
6 0876179
7 0875557
8 0875013
9 0874477
10 0874209
11 0873997
12 0873691
13 0873569
14 0873352
15 0873212
16 0873035
17 0872942
18 087283
19 0872748
20 0872654
21 0872563
22 0872468
23 0872382
24 0872275
25 087222
26 0872201
27 0872135
28 0872086
29 0872021
30 0871945
31 0871898
32 087183
33 0871756
34 0871722
35 0871665
36 0871632
37 0871585
38 0871558
39 0871524
40 0871482
41 0871452
42 0871428
43 0871398
44 0871346
45 0871312
46 0871287
47 0871262
48 0871242
49 0871236
50 0871197
0889067
0880586
0878345
0877715
0876668
0876179
0875557
0875013
0874477
0874209
0873997
0873691
0873569
0873352
0873212
0873035
0872942
087283
0872748
0872654
0872563
0872468
0872382
0872275
087222
0872201
0872135
0872086
0872021
0871945
0871898
087183
0871756
0871722
0871665
0871632
0871585
0871558
0871524
0871482
0871452
0871428
0871398
0871346
0871312
0871287
0871262
0871242
0871236
0871197
Page 51: xuhappy.github.io · 2 days ago · RECSYS 2 Acknowledgements The slides used in this chapter are adapted from: CS246 Mining Massive Datasets, - by Jure Leskovec, Stanford University

RECSYS 53

Latent Factor Models

ldquoSVDrdquo on Netflix data R asymp Q middot PT

For now letrsquos assume we can approximate the rating matrix R as a product of ldquothinrdquo Q middot PT

R has missing entries but letrsquos ignore that for nowbull Basically we will want the reconstruction error to be small on

known ratings and we donrsquot care about the values on the missing ones

45531

312445

53432142

24542

522434

42331

2-41

56-5

53-2

32111

-221-7

37-1

-924143-48-5-253-211

13-112-72914-131457-8

1-6784-3924176-421asymp

users

item

s

PT

Q

item

s

users

R

SVD A = U Σ VT

f factors

ffactors

RECSYS 54

Ratings as Products of Factors How to estimate the missing rating of

user x for item i

1162019

45531

312445

53432142

24542

522434

42331

item

s

2-41

56-5

53-2

32111

-221-7

37-1

-924143-48-5-253-211

13-112-72914-131457-8

1-6784-3924176-421

asymp

item

s

users

users

PT

119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931

= 119943119943

119954119954119946119946119943119943 sdot 119953119953119961119961119943119943

qi = row i of Qpx = column x of PT

ffac

tors

Qf factors

RECSYS 55

Ratings as Products of Factors How to estimate the missing rating of

user x for item i

1162019

45531

312445

53432142

24542

522434

42331

item

s

2-41

56-5

53-2

32111

-221-7

37-1

-924143-48-5-253-211

13-112-72914-131457-8

1-6784-3924176-421

asymp

item

s

users

users

PT

ffac

tors

Qf factors

119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931

= 119943119943

119954119954119946119946119943119943 sdot 119953119953119961119961119943119943

qi = row i of Qpx = column x of PT

RECSYS 56

Ratings as Products of Factors How to estimate the missing rating of

user x for item i

1162019

45531

312445

53432142

24542

522434

42331

item

s

2-41

56-5

53-2

32111

-221-7

37-1

-924143-48-5-253-211

13-112-72914-131457-8

1-6784-3924176-421

asymp

item

s

users

users

QPT

24

ffac

tors

f factors

119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931

= 119943119943

119954119954119946119946119943119943 sdot 119953119953119961119961119943119943

qi = row i of Qpx = column x of PT

RECSYS 57

Geared towards females

Geared towards males

Serious

Funny

Latent Factor Models

1162019

The PrincessDiaries

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

RECSYS 58

Geared towards females

Geared towards males

Serious

Funny

Latent Factor Models

1162019

The PrincessDiaries

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

RECSYS 59

Recap SVD

Remember SVD A Input data matrix U Left singular vecs V Right singular vecs Σ Singular values

SVD gives minimum reconstruction error (SSE) min119880119880VΣ

sum119909119909119894119894 119860119860119909119909119894119894 minus 119880119880Σ119881119881T 1199091199091198941198942

So in our case ldquoSVDrdquo on Netflix data R asymp Q middot PT

A = R Q = U PT = Σ VT

But we are not done yet R has missing entries1162019

U

The sum goes over all entriesBut our R has missing entries

119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931

Am

n

Σm

n

VT

asymp

U

RECSYS 60

Latent Factor Models

SVD isnrsquot defined when entries are missing

Use specialized methods to find P Q min

119875119875119876119876sum 119909119909119909119909 isinR 119903119903119909119909119909119909 minus 119902119902119909119909 sdot 119901119901119909119909119879119879 2

Notebull We donrsquot require cols of P Q to be orthogonalunit lengthbull P Q map usersmovies to a latent spacebull The most popular model among Netflix contestants

1162019

45531

312445

53432142

24542

522434

42331

2-41

56-5

53-2

32111

-221-7

37-1

-924143-48-5-253-211

13-112-72914-131457-8

1-6784-3924176-421asymp

PTQ

users

item

s

119903119909119909119909119909 = 119902119902119909119909 sdot 119901119901119909119909119879119879

f factors

ffactorsitem

s

users

RECSYS 61

Dealing with Missing Entries Want to minimize SSE for unseen test data

Idea Minimize SSE on training data Want large f ( of factors) to capture all the signals But SSE on test data begins to rise for f gt 2

Regularization is needed to avoid Overfitting Allow rich model where there are sufficient data Shrink aggressively where data are scarce

1162019

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

1 3 43 5 5

4 5 533

2

2 1 3

1

λhellip regularization parameterldquoerrorrdquo ldquolengthrdquo

RECSYS 62

Geared towards females

Geared towards males

serious

escapist

The PrincessDiaries

The Lion King

Braveheart

Independence Day

AmadeusThe Color Purple

Oceanrsquos 11

Sense and Sensibility

Gus

Dave

Recommendations via Latent Factor Models (eg SVD++ by the [Bellkor Team])

1162019

Lethal Weapon

Dumb and Dumber

RECSYS 63

Dealing with Missing Entries Want to minimize SSE for unseen test data

Idea Minimize SSE on training data Want large f ( of factors) to capture all the signals But SSE on test data begins to rise for f gt 2

Regularization is needed Allow rich model where there are sufficient data Shrink aggressively where data are scarce

1162019

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

1 3 43 5 5

4 5 533

2

2 1 3

1

λhellip regularization parameterldquoerrorrdquo ldquolengthrdquo

RECSYS 64

Geared towards females

Geared towards males

serious

funny

The Effect of Regularization

1162019

The PrincessDiaries

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

minfactors ldquoerrorrdquo + λ ldquolengthrdquo

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

RECSYS 65

Geared towards females

Geared towards males

serious

funny

The Effect of Regularization

1162019

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

The PrincessDiaries

minfactors ldquoerrorrdquo + λ ldquolengthrdquo

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

RECSYS 66

Geared towards females

Geared towards males

serious

funny

The Effect of Regularization

1162019

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

minfactors ldquoerrorrdquo + λ ldquolengthrdquo

The PrincessDiaries

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

RECSYS 67

Geared towards females

Geared towards males

serious

funny

The Effect of Regularization

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

The PrincessDiaries

minfactors ldquoerrorrdquo + λ ldquolengthrdquo

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

RECSYS 68

Use Gradient Descent to search forthe optimal settings

Want to find matrices P and Q

Gradient descent Initialize P and Q (using SVD pretend missing ratings are

0) Do gradient descent

bull P larr P - η middotnablaPbull Q larr Q - η middotnablaQbull Where nablaQ is gradientderivative of matrix Q120571120571119876119876 = [120571120571119902119902119909119909119894119894] and 120571120571119902119902119909119909119894119894 = sum119909119909119909119909 minus2 119903119903119909119909119909119909 minus 119902119902119909119909119901119901119909119909119879119879 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894

bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q Observation Computing gradients is slow

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

How to compute gradient of a matrixCompute gradient of every element independently

RECSYS 69

Degression to the lecture notes ofRegression and Gradient Descent

by Andrew Ngrsquos Machine Learning course from Coursera

RECSYS 70

(Batch) Gradient Descent

Want to find matrices P and Q

Gradient descent Initialize P and Q (using SVD pretend missing ratings are

0) Do gradient descent

bull P larr P - η middotnablaPbull Q larr Q - η middotnablaQbull Where nablaQ is gradientderivative of matrix Q120571120571119876119876 = [120571120571119902119902119909119909119894119894] and 120571120571119902119902119909119909119894119894 = sum119909119909119909119909 minus2 119903119903119909119909119909119909 minus 119902119902119909119909119901119901119909119909119879119879 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894

bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q Observation Computing gradients is slow

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

How to compute gradient of a matrixCompute gradient of every element independently

RECSYS 72

Stochastic Gradient Descent

Gradient Descent (GD) vs Stochastic GD Observation 120571120571119876119876 = [120571120571119902119902119909119909119894119894] where

120571120571119902119902119909119909119894119894 = 119909119909119909119909

minus2 119903119903119909119909119909119909 minus 119902119902119909119909119894119894119901119901119909119909119894119894 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894 = 119961119961119946119946

nabla119928119928 119955119955119961119961119946119946

bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q 119928119928 = 119928119928 minus 119928119928 = 119928119928 minus η sum119961119961119946119946nabla119928119928 (119955119955119961119961119946119946) Idea Instead of evaluating gradient over all ratings evaluate it for

each individual rating and make a step

GD 119928119928larr119928119928 minus η sum119955119955119961119961119946119946 nabla119928119928(119955119955119961119961119946119946)

SGD 119928119928larr119928119928 minus 119928119928(119955119955119961119961119946119946) Faster convergence

bull Need more steps but each step is computed much faster

1162019

RECSYS 73

SGD vs GD Convergence of GD vs SGD

1162019

Iterationstep

Valu

e of

the

obje

ctiv

e fu

nctio

n

GD improves the value of the objective function at every step SGD improves the value but in a ldquonoisyrdquo wayGD takes fewer steps to converge but each steptakes much longer to compute In practice SGD is much faster

RECSYS 74

Stochastic Gradient Descent

Stochastic gradient decent Initialize P and Q (using SVD pretend missing ratings are

0) Then iterate over the ratings (multiple times if necessary) and update

factorsFor each rxi

bull 120576120576119909119909119909119909 = 119903119903119909119909119909119909 minus 119902119902119909119909 sdot 119901119901119909119909119879119879 (derivative of the ldquoerrorrdquo)bull 119902119902119909119909 larr 119902119902119909119909 + 120578120578 120576120576119909119909119909119909 119901119901119909119909 minus 120582120582 119902119902119909119909 (update equation)bull 119901119901119909119909 larr 119901119901119909119909 + 120578120578 120576120576119909119909119909119909 119902119902119909119909 minus 120582120582 119901119901119909119909 (update equation)

2 for loops For until convergence

bull For each rxibull Compute gradient do a ldquosteprdquo

1162019

η hellip learning rate

RECSYS 75Koren Bell Volinksy IEEE Computer 2009

1162019

RECSYS 76

Summary Recommendations via Optimization

Goal Make good recommendations Quantify goodness using SSE

So Lower SSE means better recommendations We want to make good recommendations on items

that some user has not yet seen Letrsquos set values for P and Q such that they work

well on known (user item) ratings And hope these values for P and Q will predict well

the unknown ratings

This is the a case where we apply Optimization methods

1162019

1 3 43 5 5

4 5 533

2

2 1 3

1

RECSYS 77

Backup Slides

RECSYS 78

The Netflix Challenge 2006-09

We Know What You OughtTo Be Watching This

Summer

Netflix Prizebull Training data

ndash 100 million ratingsndash 480000 usersndash 17770 moviesndash 6 years of data 2000-2005

bull Test datandash Last few ratings of each user (28 million)ndash Evaluation criterion root mean squared error (RMSE) ndash Netflix Cinematch RMSE 09514

bull Competitionndash 2700+ teamsndash $1 million grand prize for 10 improvement on Cinematch resultndash $50000 2007 progress prize for 843 improvement

scoremovieuser121152131434524123237682576344541568523425

22345

57664566

scoremovieuser6219617232473153414284935

745

696836

Training data Test data

Movie rating data

Overall rating distribution

bull Third of ratings are 4sbull Average rating is 368

From TimelyDevelopmentcom

ratings per movie

bull Avg ratingsmovie 5627

ratings per user

bull Avg ratingsuser 208

Average movie rating by movie count

bull More ratings to better movies

From TimelyDevelopmentcom

Most loved movies

CountAvg ratingTitle137812 4593 The Shawshank Redemption 133597 4545 Lord of the Rings The Return of the King1808834306 The Green Mile 150676 4460Lord of the Rings The Two Towers 139050 4415 Finding Nemo 117456 4504 Raiders of the Lost Ark 180736 4299 Forrest Gump 147932 4433 Lord of the Rings The Fellowship of the ring149199 4325 The Sixth Sense 144027 4333 Indiana Jones and the Last Crusade

Challenges

bull Size of datandash Scalabilityndash Keeping data in memory

bull Missing datandash 99 percent missingndash Very imbalanced

bull Avoiding overfittingbull Test and training data differ significantly movie 16322

bull Use an ensemble of complementing predictorsbull Two half tuned models worth more than a single fully

tuned model

The BellKor recommender system

RECSYS 101

Extending Latent Factor Model to Include Biases

1162019

RECSYS 102

Modeling Biases and Interactions

μ = overall mean rating bx = bias of user x bi = bias of movie i

user-movie interactionmovie biasuser bias

User-Movie interaction Characterizes the matching between

users and movies Attracts most research in the field Benefits from algorithmic and

mathematical innovations

Baseline predictor Separates users and movies Benefits from insights into userrsquos

behavior Among the main practical

contributions of the competition

RECSYS 103

Baseline Predictor We have expectations on the rating by

user x of movie i even without estimating xrsquos attitude towards movies like i

ndash Rating scale of user xndash Values of other ratings user

gave recently (day-specific mood anchoring multi-user accounts)

ndash (Recent) popularity of movie indash Selection bias related to

number of ratings user gave on the same day (ldquofrequencyrdquo)

1162019

RECSYS 104

Putting It All Together

Example Mean rating micro = 37 You are a critical reviewer your ratings are 1 star

lower than the mean bx = -1 Star Wars gets a mean rating of 05 higher than

average movie bi = + 05 Predicted rating for you on Star Wars

= 37 - 1 + 05 = 32

1162019

RECSYS 105

Fitting the New Model

Solve

Stochastic gradient decent to find parameters Note Both biases bu bi as well as interactions qi pu are treated

as parameters (we estimate them)

1162019

regularization

goodness of fit

λ is selected via grid-search on a validation set

( )

++++

+++minus

sumsumsumsum

sumisin

ii

xx

xx

ii

Rix

Txiixxi

PQ

bbpq

pqbbr

2222

2

)(

)(min

λ

micro

RECSYS 106

BellKor Recommender System The winner of the Netflix Challenge

Multi-scale modeling of the dataCombine top level ldquoregionalrdquomodeling of the data with a refined local view Global

bull Overall deviations of usersmovies Factorization

bull Addressing ldquoregionalrdquo effects Collaborative filtering

bull Extract local patterns

Global effects

Factorization

Collaborative filtering

RECSYS 107

Performance of Various Methods

0885

089

0895

09

0905

091

0915

092

1 10 100 1000

RMSE

Millions of parameters

CF (no time bias)

Basic Latent Factors

Latent Factors w Biases

1162019

RECSYS 108

Grand Prize 08563

Netflix 09514

Movie average 10533User average 10651

Global average 11296

Performance of Various Methods

Basic Collaborative filtering 094

Latent factors 090

Latent factors+Biases 089

Collaborative filtering++ 091

1162019

RECSYS 109

Temporal Biases Of Users

Sudden rise in the average movie rating(early 2004) Improvements in Netflix GUI improvements Meaning of rating changed

Movie age Users prefer new movies

without any reasons Older movies are just

inherently better than newer ones

1162019

Y Koren Collaborative filtering with temporal dynamics KDD rsquo09

RECSYS 110

Temporal Biases amp Factors

Original modelrxi = micro +bx + bi + qi middotpx

T

Add time dependence to biasesrxi = micro +bx(t)+ bi(t) +qi middot px

T

Make parameters bu and bi to depend on time (1) Parameterize time-dependence by linear trends

(2) Each bin corresponds to 10 consecutive weeks

Add temporal dependence to factors px(t)hellip user preference vector on day t

Y Koren Collaborative filtering with temporal dynamics KDD rsquo09

RECSYS 111

Adding Temporal Effects

1162019

0875

088

0885

089

0895

09

0905

091

0915

092

1 10 100 1000 10000

RMSE

Millions of parameters

CF (no time bias)Basic Latent FactorsCF (time bias)Latent Factors w Biases+ Linear time factors+ Per-day user biases+ CF

RECSYS 112

Grand Prize 08563

Netflix 09514

Movie average 10533User average 10651

Global average 11296

Performance of Various Methods

Basic Collaborative filtering 094

Latent factors 090

Latent factors+Biases 089

Collaborative filtering++ 091

1162019 112

Latent factors+Biases+Time 0876

Still no prize Getting desperateTry a ldquokitchen sinkrdquo approach

RECSYS 113

Effect of ensemble size

087

08725

0875

08775

088

08825

0885

08875

089

08925

1 8 15 22 29 36 43 50

Predictors

Erro

r - R

MSE

Chart1

RMSE
Predictors
Error - RMSE
Effect of ensemble size

Sheet1

Sheet1

RMSE
Predictors
Error - RMSE
Effect of ensemble size

Sheet2

Sheet3

RECSYS 115

Standing on June 26th 2009

1162019June 26th submission triggers 30-day ldquolast callrdquo

RECSYS 116

The Last 30 Days

Ensemble team formed Group of other teams on leaderboard forms a new team Relies on combining their models Quickly also get a qualifying score over 10

BellKor Continue to get small improvements in their scores Realize that they are in direct competition with Ensemble

Strategy Both teams carefully monitoring the leaderboard Only sure way to check for improvement is to submit a set of

predictionsbull This alerts the other team of your latest score

1162019

RECSYS 117

24 Hours from the Deadline

Submissions limited to 1 a day Only 1 final submission could be made in the last 24h

24 hours before deadlinehellip BellKor team member in Austria notices (by chance) that

Ensemble posts a score that is slightly better than BellKorrsquos

Frantic last 24 hours for both teams Much computer time on final optimization Carefully calibrated to end about an hour before deadline

Final submissions BellKor submits a little early (on purpose) 40 mins before deadline Ensemble submits their final entry 20 mins later hellipand everyone waitshellip

1162019

RECSYS 118118

RECSYS 119

Million $ Awarded Sept 21st 2009

RECSYS 120

Further reading Y Koren Collaborative filtering with temporal

dynamics KDD rsquo09 httpwww2researchattcom~volinskynetflixbpchtml httpwwwthe-ensemblecom

1162019

  • Slide Number 1
  • Acknowledgements
  • Content-based Systems amp Collaborative Filtering
  • High Dimensional Data
  • Example Recommender Systems
  • Recommendations
  • From Scarcity to Abundance
  • The Long Tail
  • Physical vs Online
  • Types of Recommendations
  • Formal Model
  • Utility Matrix
  • Key Problems
  • (1) Gathering Ratings
  • (2) Extrapolating Utilities
  • Content-based Recommender Systems
  • Content-based Recommendations
  • Plan of Action
  • Item Profiles
  • Sidenote TF-IDF
  • User Profiles and Prediction
  • Pros Content-based Approach
  • Cons Content-based Approach
  • Collaborative Filtering
  • Collaborative filtering
  • Collaborative Filtering (CF)
  • Example of Memory-based Collaborative Filtering User-User Collaborative Filtering
  • Similar Users
  • Similarity Metric
  • Rating Predictions
  • Another type of Memory-based Collaborative Filtering Item-Item based Collaborative Filtering
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Common Practice for Item-Item Collaborative Filtering
  • Item-Item vs User-User
  • ProsCons of Collaborative Filtering
  • Hybrid Methods
  • Remarks amp Practical Tips
  • Evaluation
  • Evaluation
  • Evaluating Predictions
  • Problems with Error Measures
  • Collaborative Filtering Complexity
  • Tip Add Data
  • Recommender SystemsLatent Factor Models
  • Collaborative Filtering viaLatent Factor Models (eg SVD)
  • Slide Number 50
  • The Netflix Utility Matrix R
  • Utility Matrix R Evaluation
  • Latent Factor Models
  • Ratings as Products of Factors
  • Ratings as Products of Factors
  • Ratings as Products of Factors
  • Latent Factor Models
  • Latent Factor Models
  • Recap SVD
  • Latent Factor Models
  • Dealing with Missing Entries
  • Recommendations via Latent Factor Models (eg SVD++ by the [Bellkor Team])
  • Dealing with Missing Entries
  • The Effect of Regularization
  • The Effect of Regularization
  • The Effect of Regularization
  • The Effect of Regularization
  • Use Gradient Descent to search forthe optimal settings
  • Degression to the lecture notes ofRegression and Gradient Descentby Andrew Ngrsquos Machine Learning course from Coursera
  • (Batch) Gradient Descent
  • Stochastic Gradient Descent
  • SGD vs GD
  • Stochastic Gradient Descent
  • Slide Number 75
  • Summary Recommendations via Optimization
  • Backup Slides
  • The Netflix Challenge 2006-09
  • Slide Number 79
  • Slide Number 80
  • Netflix Prize
  • Movie rating data
  • Overall rating distribution
  • ratings per movie
  • ratings per user
  • Average movie rating by movie count
  • Most loved movies
  • Challenges
  • The BellKor recommender system
  • Extending Latent Factor Model to Include Biases
  • Modeling Biases and Interactions
  • Baseline Predictor
  • Putting It All Together
  • Fitting the New Model
  • BellKor Recommender System
  • Performance of Various Methods
  • Performance of Various Methods
  • Temporal Biases Of Users
  • Temporal Biases amp Factors
  • Adding Temporal Effects
  • Performance of Various Methods
  • Slide Number 113
  • Slide Number 114
  • Standing on June 26th 2009
  • The Last 30 Days
  • 24 Hours from the Deadline
  • Slide Number 118
  • Million $ Awarded Sept 21st 2009
  • Further reading
predictors RMSE
1 0889067
2 0880586
3 0878345
4 0877715
5 0876668
6 0876179
7 0875557
8 0875013
9 0874477
10 0874209
11 0873997
12 0873691
13 0873569
14 0873352
15 0873212
16 0873035
17 0872942
18 087283
19 0872748
20 0872654
21 0872563
22 0872468
23 0872382
24 0872275
25 087222
26 0872201
27 0872135
28 0872086
29 0872021
30 0871945
31 0871898
32 087183
33 0871756
34 0871722
35 0871665
36 0871632
37 0871585
38 0871558
39 0871524
40 0871482
41 0871452
42 0871428
43 0871398
44 0871346
45 0871312
46 0871287
47 0871262
48 0871242
49 0871236
50 0871197
0889067
0880586
0878345
0877715
0876668
0876179
0875557
0875013
0874477
0874209
0873997
0873691
0873569
0873352
0873212
0873035
0872942
087283
0872748
0872654
0872563
0872468
0872382
0872275
087222
0872201
0872135
0872086
0872021
0871945
0871898
087183
0871756
0871722
0871665
0871632
0871585
0871558
0871524
0871482
0871452
0871428
0871398
0871346
0871312
0871287
0871262
0871242
0871236
0871197
Page 52: xuhappy.github.io · 2 days ago · RECSYS 2 Acknowledgements The slides used in this chapter are adapted from: CS246 Mining Massive Datasets, - by Jure Leskovec, Stanford University

RECSYS 54

Ratings as Products of Factors How to estimate the missing rating of

user x for item i

1162019

45531

312445

53432142

24542

522434

42331

item

s

2-41

56-5

53-2

32111

-221-7

37-1

-924143-48-5-253-211

13-112-72914-131457-8

1-6784-3924176-421

asymp

item

s

users

users

PT

119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931

= 119943119943

119954119954119946119946119943119943 sdot 119953119953119961119961119943119943

qi = row i of Qpx = column x of PT

ffac

tors

Qf factors

RECSYS 55

Ratings as Products of Factors How to estimate the missing rating of

user x for item i

1162019

45531

312445

53432142

24542

522434

42331

item

s

2-41

56-5

53-2

32111

-221-7

37-1

-924143-48-5-253-211

13-112-72914-131457-8

1-6784-3924176-421

asymp

item

s

users

users

PT

ffac

tors

Qf factors

119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931

= 119943119943

119954119954119946119946119943119943 sdot 119953119953119961119961119943119943

qi = row i of Qpx = column x of PT

RECSYS 56

Ratings as Products of Factors How to estimate the missing rating of

user x for item i

1162019

45531

312445

53432142

24542

522434

42331

item

s

2-41

56-5

53-2

32111

-221-7

37-1

-924143-48-5-253-211

13-112-72914-131457-8

1-6784-3924176-421

asymp

item

s

users

users

QPT

24

ffac

tors

f factors

119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931

= 119943119943

119954119954119946119946119943119943 sdot 119953119953119961119961119943119943

qi = row i of Qpx = column x of PT

RECSYS 57

Geared towards females

Geared towards males

Serious

Funny

Latent Factor Models

1162019

The PrincessDiaries

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

RECSYS 58

Geared towards females

Geared towards males

Serious

Funny

Latent Factor Models

1162019

The PrincessDiaries

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

RECSYS 59

Recap SVD

Remember SVD A Input data matrix U Left singular vecs V Right singular vecs Σ Singular values

SVD gives minimum reconstruction error (SSE) min119880119880VΣ

sum119909119909119894119894 119860119860119909119909119894119894 minus 119880119880Σ119881119881T 1199091199091198941198942

So in our case ldquoSVDrdquo on Netflix data R asymp Q middot PT

A = R Q = U PT = Σ VT

But we are not done yet R has missing entries1162019

U

The sum goes over all entriesBut our R has missing entries

119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931

Am

n

Σm

n

VT

asymp

U

RECSYS 60

Latent Factor Models

SVD isnrsquot defined when entries are missing

Use specialized methods to find P Q min

119875119875119876119876sum 119909119909119909119909 isinR 119903119903119909119909119909119909 minus 119902119902119909119909 sdot 119901119901119909119909119879119879 2

Notebull We donrsquot require cols of P Q to be orthogonalunit lengthbull P Q map usersmovies to a latent spacebull The most popular model among Netflix contestants

1162019

45531

312445

53432142

24542

522434

42331

2-41

56-5

53-2

32111

-221-7

37-1

-924143-48-5-253-211

13-112-72914-131457-8

1-6784-3924176-421asymp

PTQ

users

item

s

119903119909119909119909119909 = 119902119902119909119909 sdot 119901119901119909119909119879119879

f factors

ffactorsitem

s

users

RECSYS 61

Dealing with Missing Entries Want to minimize SSE for unseen test data

Idea Minimize SSE on training data Want large f ( of factors) to capture all the signals But SSE on test data begins to rise for f gt 2

Regularization is needed to avoid Overfitting Allow rich model where there are sufficient data Shrink aggressively where data are scarce

1162019

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

1 3 43 5 5

4 5 533

2

2 1 3

1

λhellip regularization parameterldquoerrorrdquo ldquolengthrdquo

RECSYS 62

Geared towards females

Geared towards males

serious

escapist

The PrincessDiaries

The Lion King

Braveheart

Independence Day

AmadeusThe Color Purple

Oceanrsquos 11

Sense and Sensibility

Gus

Dave

Recommendations via Latent Factor Models (eg SVD++ by the [Bellkor Team])

1162019

Lethal Weapon

Dumb and Dumber

RECSYS 63

Dealing with Missing Entries Want to minimize SSE for unseen test data

Idea Minimize SSE on training data Want large f ( of factors) to capture all the signals But SSE on test data begins to rise for f gt 2

Regularization is needed Allow rich model where there are sufficient data Shrink aggressively where data are scarce

1162019

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

1 3 43 5 5

4 5 533

2

2 1 3

1

λhellip regularization parameterldquoerrorrdquo ldquolengthrdquo

RECSYS 64

Geared towards females

Geared towards males

serious

funny

The Effect of Regularization

1162019

The PrincessDiaries

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

minfactors ldquoerrorrdquo + λ ldquolengthrdquo

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

RECSYS 65

Geared towards females

Geared towards males

serious

funny

The Effect of Regularization

1162019

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

The PrincessDiaries

minfactors ldquoerrorrdquo + λ ldquolengthrdquo

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

RECSYS 66

Geared towards females

Geared towards males

serious

funny

The Effect of Regularization

1162019

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

minfactors ldquoerrorrdquo + λ ldquolengthrdquo

The PrincessDiaries

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

RECSYS 67

Geared towards females

Geared towards males

serious

funny

The Effect of Regularization

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

The PrincessDiaries

minfactors ldquoerrorrdquo + λ ldquolengthrdquo

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

RECSYS 68

Use Gradient Descent to search forthe optimal settings

Want to find matrices P and Q

Gradient descent Initialize P and Q (using SVD pretend missing ratings are

0) Do gradient descent

bull P larr P - η middotnablaPbull Q larr Q - η middotnablaQbull Where nablaQ is gradientderivative of matrix Q120571120571119876119876 = [120571120571119902119902119909119909119894119894] and 120571120571119902119902119909119909119894119894 = sum119909119909119909119909 minus2 119903119903119909119909119909119909 minus 119902119902119909119909119901119901119909119909119879119879 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894

bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q Observation Computing gradients is slow

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

How to compute gradient of a matrixCompute gradient of every element independently

RECSYS 69

Degression to the lecture notes ofRegression and Gradient Descent

by Andrew Ngrsquos Machine Learning course from Coursera

RECSYS 70

(Batch) Gradient Descent

Want to find matrices P and Q

Gradient descent Initialize P and Q (using SVD pretend missing ratings are

0) Do gradient descent

bull P larr P - η middotnablaPbull Q larr Q - η middotnablaQbull Where nablaQ is gradientderivative of matrix Q120571120571119876119876 = [120571120571119902119902119909119909119894119894] and 120571120571119902119902119909119909119894119894 = sum119909119909119909119909 minus2 119903119903119909119909119909119909 minus 119902119902119909119909119901119901119909119909119879119879 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894

bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q Observation Computing gradients is slow

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

How to compute gradient of a matrixCompute gradient of every element independently

RECSYS 72

Stochastic Gradient Descent

Gradient Descent (GD) vs Stochastic GD Observation 120571120571119876119876 = [120571120571119902119902119909119909119894119894] where

120571120571119902119902119909119909119894119894 = 119909119909119909119909

minus2 119903119903119909119909119909119909 minus 119902119902119909119909119894119894119901119901119909119909119894119894 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894 = 119961119961119946119946

nabla119928119928 119955119955119961119961119946119946

bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q 119928119928 = 119928119928 minus 119928119928 = 119928119928 minus η sum119961119961119946119946nabla119928119928 (119955119955119961119961119946119946) Idea Instead of evaluating gradient over all ratings evaluate it for

each individual rating and make a step

GD 119928119928larr119928119928 minus η sum119955119955119961119961119946119946 nabla119928119928(119955119955119961119961119946119946)

SGD 119928119928larr119928119928 minus 119928119928(119955119955119961119961119946119946) Faster convergence

bull Need more steps but each step is computed much faster

1162019

RECSYS 73

SGD vs GD Convergence of GD vs SGD

1162019

Iterationstep

Valu

e of

the

obje

ctiv

e fu

nctio

n

GD improves the value of the objective function at every step SGD improves the value but in a ldquonoisyrdquo wayGD takes fewer steps to converge but each steptakes much longer to compute In practice SGD is much faster

RECSYS 74

Stochastic Gradient Descent

Stochastic gradient decent Initialize P and Q (using SVD pretend missing ratings are

0) Then iterate over the ratings (multiple times if necessary) and update

factorsFor each rxi

bull 120576120576119909119909119909119909 = 119903119903119909119909119909119909 minus 119902119902119909119909 sdot 119901119901119909119909119879119879 (derivative of the ldquoerrorrdquo)bull 119902119902119909119909 larr 119902119902119909119909 + 120578120578 120576120576119909119909119909119909 119901119901119909119909 minus 120582120582 119902119902119909119909 (update equation)bull 119901119901119909119909 larr 119901119901119909119909 + 120578120578 120576120576119909119909119909119909 119902119902119909119909 minus 120582120582 119901119901119909119909 (update equation)

2 for loops For until convergence

bull For each rxibull Compute gradient do a ldquosteprdquo

1162019

η hellip learning rate

RECSYS 75Koren Bell Volinksy IEEE Computer 2009

1162019

RECSYS 76

Summary Recommendations via Optimization

Goal Make good recommendations Quantify goodness using SSE

So Lower SSE means better recommendations We want to make good recommendations on items

that some user has not yet seen Letrsquos set values for P and Q such that they work

well on known (user item) ratings And hope these values for P and Q will predict well

the unknown ratings

This is the a case where we apply Optimization methods

1162019

1 3 43 5 5

4 5 533

2

2 1 3

1

RECSYS 77

Backup Slides

RECSYS 78

The Netflix Challenge 2006-09

We Know What You OughtTo Be Watching This

Summer

Netflix Prizebull Training data

ndash 100 million ratingsndash 480000 usersndash 17770 moviesndash 6 years of data 2000-2005

bull Test datandash Last few ratings of each user (28 million)ndash Evaluation criterion root mean squared error (RMSE) ndash Netflix Cinematch RMSE 09514

bull Competitionndash 2700+ teamsndash $1 million grand prize for 10 improvement on Cinematch resultndash $50000 2007 progress prize for 843 improvement

scoremovieuser121152131434524123237682576344541568523425

22345

57664566

scoremovieuser6219617232473153414284935

745

696836

Training data Test data

Movie rating data

Overall rating distribution

bull Third of ratings are 4sbull Average rating is 368

From TimelyDevelopmentcom

ratings per movie

bull Avg ratingsmovie 5627

ratings per user

bull Avg ratingsuser 208

Average movie rating by movie count

bull More ratings to better movies

From TimelyDevelopmentcom

Most loved movies

CountAvg ratingTitle137812 4593 The Shawshank Redemption 133597 4545 Lord of the Rings The Return of the King1808834306 The Green Mile 150676 4460Lord of the Rings The Two Towers 139050 4415 Finding Nemo 117456 4504 Raiders of the Lost Ark 180736 4299 Forrest Gump 147932 4433 Lord of the Rings The Fellowship of the ring149199 4325 The Sixth Sense 144027 4333 Indiana Jones and the Last Crusade

Challenges

bull Size of datandash Scalabilityndash Keeping data in memory

bull Missing datandash 99 percent missingndash Very imbalanced

bull Avoiding overfittingbull Test and training data differ significantly movie 16322

bull Use an ensemble of complementing predictorsbull Two half tuned models worth more than a single fully

tuned model

The BellKor recommender system

RECSYS 101

Extending Latent Factor Model to Include Biases

1162019

RECSYS 102

Modeling Biases and Interactions

μ = overall mean rating bx = bias of user x bi = bias of movie i

user-movie interactionmovie biasuser bias

User-Movie interaction Characterizes the matching between

users and movies Attracts most research in the field Benefits from algorithmic and

mathematical innovations

Baseline predictor Separates users and movies Benefits from insights into userrsquos

behavior Among the main practical

contributions of the competition

RECSYS 103

Baseline Predictor We have expectations on the rating by

user x of movie i even without estimating xrsquos attitude towards movies like i

ndash Rating scale of user xndash Values of other ratings user

gave recently (day-specific mood anchoring multi-user accounts)

ndash (Recent) popularity of movie indash Selection bias related to

number of ratings user gave on the same day (ldquofrequencyrdquo)

1162019

RECSYS 104

Putting It All Together

Example Mean rating micro = 37 You are a critical reviewer your ratings are 1 star

lower than the mean bx = -1 Star Wars gets a mean rating of 05 higher than

average movie bi = + 05 Predicted rating for you on Star Wars

= 37 - 1 + 05 = 32

1162019

RECSYS 105

Fitting the New Model

Solve

Stochastic gradient decent to find parameters Note Both biases bu bi as well as interactions qi pu are treated

as parameters (we estimate them)

1162019

regularization

goodness of fit

λ is selected via grid-search on a validation set

( )

++++

+++minus

sumsumsumsum

sumisin

ii

xx

xx

ii

Rix

Txiixxi

PQ

bbpq

pqbbr

2222

2

)(

)(min

λ

micro

RECSYS 106

BellKor Recommender System The winner of the Netflix Challenge

Multi-scale modeling of the dataCombine top level ldquoregionalrdquomodeling of the data with a refined local view Global

bull Overall deviations of usersmovies Factorization

bull Addressing ldquoregionalrdquo effects Collaborative filtering

bull Extract local patterns

Global effects

Factorization

Collaborative filtering

RECSYS 107

Performance of Various Methods

0885

089

0895

09

0905

091

0915

092

1 10 100 1000

RMSE

Millions of parameters

CF (no time bias)

Basic Latent Factors

Latent Factors w Biases

1162019

RECSYS 108

Grand Prize 08563

Netflix 09514

Movie average 10533User average 10651

Global average 11296

Performance of Various Methods

Basic Collaborative filtering 094

Latent factors 090

Latent factors+Biases 089

Collaborative filtering++ 091

1162019

RECSYS 109

Temporal Biases Of Users

Sudden rise in the average movie rating(early 2004) Improvements in Netflix GUI improvements Meaning of rating changed

Movie age Users prefer new movies

without any reasons Older movies are just

inherently better than newer ones

1162019

Y Koren Collaborative filtering with temporal dynamics KDD rsquo09

RECSYS 110

Temporal Biases amp Factors

Original modelrxi = micro +bx + bi + qi middotpx

T

Add time dependence to biasesrxi = micro +bx(t)+ bi(t) +qi middot px

T

Make parameters bu and bi to depend on time (1) Parameterize time-dependence by linear trends

(2) Each bin corresponds to 10 consecutive weeks

Add temporal dependence to factors px(t)hellip user preference vector on day t

Y Koren Collaborative filtering with temporal dynamics KDD rsquo09

RECSYS 111

Adding Temporal Effects

1162019

0875

088

0885

089

0895

09

0905

091

0915

092

1 10 100 1000 10000

RMSE

Millions of parameters

CF (no time bias)Basic Latent FactorsCF (time bias)Latent Factors w Biases+ Linear time factors+ Per-day user biases+ CF

RECSYS 112

Grand Prize 08563

Netflix 09514

Movie average 10533User average 10651

Global average 11296

Performance of Various Methods

Basic Collaborative filtering 094

Latent factors 090

Latent factors+Biases 089

Collaborative filtering++ 091

1162019 112

Latent factors+Biases+Time 0876

Still no prize Getting desperateTry a ldquokitchen sinkrdquo approach

RECSYS 113

Effect of ensemble size

087

08725

0875

08775

088

08825

0885

08875

089

08925

1 8 15 22 29 36 43 50

Predictors

Erro

r - R

MSE

Chart1

RMSE
Predictors
Error - RMSE
Effect of ensemble size

Sheet1

Sheet1

RMSE
Predictors
Error - RMSE
Effect of ensemble size

Sheet2

Sheet3

RECSYS 115

Standing on June 26th 2009

1162019June 26th submission triggers 30-day ldquolast callrdquo

RECSYS 116

The Last 30 Days

Ensemble team formed Group of other teams on leaderboard forms a new team Relies on combining their models Quickly also get a qualifying score over 10

BellKor Continue to get small improvements in their scores Realize that they are in direct competition with Ensemble

Strategy Both teams carefully monitoring the leaderboard Only sure way to check for improvement is to submit a set of

predictionsbull This alerts the other team of your latest score

1162019

RECSYS 117

24 Hours from the Deadline

Submissions limited to 1 a day Only 1 final submission could be made in the last 24h

24 hours before deadlinehellip BellKor team member in Austria notices (by chance) that

Ensemble posts a score that is slightly better than BellKorrsquos

Frantic last 24 hours for both teams Much computer time on final optimization Carefully calibrated to end about an hour before deadline

Final submissions BellKor submits a little early (on purpose) 40 mins before deadline Ensemble submits their final entry 20 mins later hellipand everyone waitshellip

1162019

RECSYS 118118

RECSYS 119

Million $ Awarded Sept 21st 2009

RECSYS 120

Further reading Y Koren Collaborative filtering with temporal

dynamics KDD rsquo09 httpwww2researchattcom~volinskynetflixbpchtml httpwwwthe-ensemblecom

1162019

  • Slide Number 1
  • Acknowledgements
  • Content-based Systems amp Collaborative Filtering
  • High Dimensional Data
  • Example Recommender Systems
  • Recommendations
  • From Scarcity to Abundance
  • The Long Tail
  • Physical vs Online
  • Types of Recommendations
  • Formal Model
  • Utility Matrix
  • Key Problems
  • (1) Gathering Ratings
  • (2) Extrapolating Utilities
  • Content-based Recommender Systems
  • Content-based Recommendations
  • Plan of Action
  • Item Profiles
  • Sidenote TF-IDF
  • User Profiles and Prediction
  • Pros Content-based Approach
  • Cons Content-based Approach
  • Collaborative Filtering
  • Collaborative filtering
  • Collaborative Filtering (CF)
  • Example of Memory-based Collaborative Filtering User-User Collaborative Filtering
  • Similar Users
  • Similarity Metric
  • Rating Predictions
  • Another type of Memory-based Collaborative Filtering Item-Item based Collaborative Filtering
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Common Practice for Item-Item Collaborative Filtering
  • Item-Item vs User-User
  • ProsCons of Collaborative Filtering
  • Hybrid Methods
  • Remarks amp Practical Tips
  • Evaluation
  • Evaluation
  • Evaluating Predictions
  • Problems with Error Measures
  • Collaborative Filtering Complexity
  • Tip Add Data
  • Recommender SystemsLatent Factor Models
  • Collaborative Filtering viaLatent Factor Models (eg SVD)
  • Slide Number 50
  • The Netflix Utility Matrix R
  • Utility Matrix R Evaluation
  • Latent Factor Models
  • Ratings as Products of Factors
  • Ratings as Products of Factors
  • Ratings as Products of Factors
  • Latent Factor Models
  • Latent Factor Models
  • Recap SVD
  • Latent Factor Models
  • Dealing with Missing Entries
  • Recommendations via Latent Factor Models (eg SVD++ by the [Bellkor Team])
  • Dealing with Missing Entries
  • The Effect of Regularization
  • The Effect of Regularization
  • The Effect of Regularization
  • The Effect of Regularization
  • Use Gradient Descent to search forthe optimal settings
  • Degression to the lecture notes ofRegression and Gradient Descentby Andrew Ngrsquos Machine Learning course from Coursera
  • (Batch) Gradient Descent
  • Stochastic Gradient Descent
  • SGD vs GD
  • Stochastic Gradient Descent
  • Slide Number 75
  • Summary Recommendations via Optimization
  • Backup Slides
  • The Netflix Challenge 2006-09
  • Slide Number 79
  • Slide Number 80
  • Netflix Prize
  • Movie rating data
  • Overall rating distribution
  • ratings per movie
  • ratings per user
  • Average movie rating by movie count
  • Most loved movies
  • Challenges
  • The BellKor recommender system
  • Extending Latent Factor Model to Include Biases
  • Modeling Biases and Interactions
  • Baseline Predictor
  • Putting It All Together
  • Fitting the New Model
  • BellKor Recommender System
  • Performance of Various Methods
  • Performance of Various Methods
  • Temporal Biases Of Users
  • Temporal Biases amp Factors
  • Adding Temporal Effects
  • Performance of Various Methods
  • Slide Number 113
  • Slide Number 114
  • Standing on June 26th 2009
  • The Last 30 Days
  • 24 Hours from the Deadline
  • Slide Number 118
  • Million $ Awarded Sept 21st 2009
  • Further reading
predictors RMSE
1 0889067
2 0880586
3 0878345
4 0877715
5 0876668
6 0876179
7 0875557
8 0875013
9 0874477
10 0874209
11 0873997
12 0873691
13 0873569
14 0873352
15 0873212
16 0873035
17 0872942
18 087283
19 0872748
20 0872654
21 0872563
22 0872468
23 0872382
24 0872275
25 087222
26 0872201
27 0872135
28 0872086
29 0872021
30 0871945
31 0871898
32 087183
33 0871756
34 0871722
35 0871665
36 0871632
37 0871585
38 0871558
39 0871524
40 0871482
41 0871452
42 0871428
43 0871398
44 0871346
45 0871312
46 0871287
47 0871262
48 0871242
49 0871236
50 0871197
0889067
0880586
0878345
0877715
0876668
0876179
0875557
0875013
0874477
0874209
0873997
0873691
0873569
0873352
0873212
0873035
0872942
087283
0872748
0872654
0872563
0872468
0872382
0872275
087222
0872201
0872135
0872086
0872021
0871945
0871898
087183
0871756
0871722
0871665
0871632
0871585
0871558
0871524
0871482
0871452
0871428
0871398
0871346
0871312
0871287
0871262
0871242
0871236
0871197
Page 53: xuhappy.github.io · 2 days ago · RECSYS 2 Acknowledgements The slides used in this chapter are adapted from: CS246 Mining Massive Datasets, - by Jure Leskovec, Stanford University

RECSYS 55

Ratings as Products of Factors How to estimate the missing rating of

user x for item i

1162019

45531

312445

53432142

24542

522434

42331

item

s

2-41

56-5

53-2

32111

-221-7

37-1

-924143-48-5-253-211

13-112-72914-131457-8

1-6784-3924176-421

asymp

item

s

users

users

PT

ffac

tors

Qf factors

119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931

= 119943119943

119954119954119946119946119943119943 sdot 119953119953119961119961119943119943

qi = row i of Qpx = column x of PT

RECSYS 56

Ratings as Products of Factors How to estimate the missing rating of

user x for item i

1162019

45531

312445

53432142

24542

522434

42331

item

s

2-41

56-5

53-2

32111

-221-7

37-1

-924143-48-5-253-211

13-112-72914-131457-8

1-6784-3924176-421

asymp

item

s

users

users

QPT

24

ffac

tors

f factors

119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931

= 119943119943

119954119954119946119946119943119943 sdot 119953119953119961119961119943119943

qi = row i of Qpx = column x of PT

RECSYS 57

Geared towards females

Geared towards males

Serious

Funny

Latent Factor Models

1162019

The PrincessDiaries

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

RECSYS 58

Geared towards females

Geared towards males

Serious

Funny

Latent Factor Models

1162019

The PrincessDiaries

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

RECSYS 59

Recap SVD

Remember SVD A Input data matrix U Left singular vecs V Right singular vecs Σ Singular values

SVD gives minimum reconstruction error (SSE) min119880119880VΣ

sum119909119909119894119894 119860119860119909119909119894119894 minus 119880119880Σ119881119881T 1199091199091198941198942

So in our case ldquoSVDrdquo on Netflix data R asymp Q middot PT

A = R Q = U PT = Σ VT

But we are not done yet R has missing entries1162019

U

The sum goes over all entriesBut our R has missing entries

119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931

Am

n

Σm

n

VT

asymp

U

RECSYS 60

Latent Factor Models

SVD isnrsquot defined when entries are missing

Use specialized methods to find P Q min

119875119875119876119876sum 119909119909119909119909 isinR 119903119903119909119909119909119909 minus 119902119902119909119909 sdot 119901119901119909119909119879119879 2

Notebull We donrsquot require cols of P Q to be orthogonalunit lengthbull P Q map usersmovies to a latent spacebull The most popular model among Netflix contestants

1162019

45531

312445

53432142

24542

522434

42331

2-41

56-5

53-2

32111

-221-7

37-1

-924143-48-5-253-211

13-112-72914-131457-8

1-6784-3924176-421asymp

PTQ

users

item

s

119903119909119909119909119909 = 119902119902119909119909 sdot 119901119901119909119909119879119879

f factors

ffactorsitem

s

users

RECSYS 61

Dealing with Missing Entries Want to minimize SSE for unseen test data

Idea Minimize SSE on training data Want large f ( of factors) to capture all the signals But SSE on test data begins to rise for f gt 2

Regularization is needed to avoid Overfitting Allow rich model where there are sufficient data Shrink aggressively where data are scarce

1162019

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

1 3 43 5 5

4 5 533

2

2 1 3

1

λhellip regularization parameterldquoerrorrdquo ldquolengthrdquo

RECSYS 62

Geared towards females

Geared towards males

serious

escapist

The PrincessDiaries

The Lion King

Braveheart

Independence Day

AmadeusThe Color Purple

Oceanrsquos 11

Sense and Sensibility

Gus

Dave

Recommendations via Latent Factor Models (eg SVD++ by the [Bellkor Team])

1162019

Lethal Weapon

Dumb and Dumber

RECSYS 63

Dealing with Missing Entries Want to minimize SSE for unseen test data

Idea Minimize SSE on training data Want large f ( of factors) to capture all the signals But SSE on test data begins to rise for f gt 2

Regularization is needed Allow rich model where there are sufficient data Shrink aggressively where data are scarce

1162019

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

1 3 43 5 5

4 5 533

2

2 1 3

1

λhellip regularization parameterldquoerrorrdquo ldquolengthrdquo

RECSYS 64

Geared towards females

Geared towards males

serious

funny

The Effect of Regularization

1162019

The PrincessDiaries

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

minfactors ldquoerrorrdquo + λ ldquolengthrdquo

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

RECSYS 65

Geared towards females

Geared towards males

serious

funny

The Effect of Regularization

1162019

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

The PrincessDiaries

minfactors ldquoerrorrdquo + λ ldquolengthrdquo

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

RECSYS 66

Geared towards females

Geared towards males

serious

funny

The Effect of Regularization

1162019

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

minfactors ldquoerrorrdquo + λ ldquolengthrdquo

The PrincessDiaries

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

RECSYS 67

Geared towards females

Geared towards males

serious

funny

The Effect of Regularization

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

The PrincessDiaries

minfactors ldquoerrorrdquo + λ ldquolengthrdquo

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

RECSYS 68

Use Gradient Descent to search forthe optimal settings

Want to find matrices P and Q

Gradient descent Initialize P and Q (using SVD pretend missing ratings are

0) Do gradient descent

bull P larr P - η middotnablaPbull Q larr Q - η middotnablaQbull Where nablaQ is gradientderivative of matrix Q120571120571119876119876 = [120571120571119902119902119909119909119894119894] and 120571120571119902119902119909119909119894119894 = sum119909119909119909119909 minus2 119903119903119909119909119909119909 minus 119902119902119909119909119901119901119909119909119879119879 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894

bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q Observation Computing gradients is slow

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

How to compute gradient of a matrixCompute gradient of every element independently

RECSYS 69

Degression to the lecture notes ofRegression and Gradient Descent

by Andrew Ngrsquos Machine Learning course from Coursera

RECSYS 70

(Batch) Gradient Descent

Want to find matrices P and Q

Gradient descent Initialize P and Q (using SVD pretend missing ratings are

0) Do gradient descent

bull P larr P - η middotnablaPbull Q larr Q - η middotnablaQbull Where nablaQ is gradientderivative of matrix Q120571120571119876119876 = [120571120571119902119902119909119909119894119894] and 120571120571119902119902119909119909119894119894 = sum119909119909119909119909 minus2 119903119903119909119909119909119909 minus 119902119902119909119909119901119901119909119909119879119879 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894

bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q Observation Computing gradients is slow

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

How to compute gradient of a matrixCompute gradient of every element independently

RECSYS 72

Stochastic Gradient Descent

Gradient Descent (GD) vs Stochastic GD Observation 120571120571119876119876 = [120571120571119902119902119909119909119894119894] where

120571120571119902119902119909119909119894119894 = 119909119909119909119909

minus2 119903119903119909119909119909119909 minus 119902119902119909119909119894119894119901119901119909119909119894119894 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894 = 119961119961119946119946

nabla119928119928 119955119955119961119961119946119946

bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q 119928119928 = 119928119928 minus 119928119928 = 119928119928 minus η sum119961119961119946119946nabla119928119928 (119955119955119961119961119946119946) Idea Instead of evaluating gradient over all ratings evaluate it for

each individual rating and make a step

GD 119928119928larr119928119928 minus η sum119955119955119961119961119946119946 nabla119928119928(119955119955119961119961119946119946)

SGD 119928119928larr119928119928 minus 119928119928(119955119955119961119961119946119946) Faster convergence

bull Need more steps but each step is computed much faster

1162019

RECSYS 73

SGD vs GD Convergence of GD vs SGD

1162019

Iterationstep

Valu

e of

the

obje

ctiv

e fu

nctio

n

GD improves the value of the objective function at every step SGD improves the value but in a ldquonoisyrdquo wayGD takes fewer steps to converge but each steptakes much longer to compute In practice SGD is much faster

RECSYS 74

Stochastic Gradient Descent

Stochastic gradient decent Initialize P and Q (using SVD pretend missing ratings are

0) Then iterate over the ratings (multiple times if necessary) and update

factorsFor each rxi

bull 120576120576119909119909119909119909 = 119903119903119909119909119909119909 minus 119902119902119909119909 sdot 119901119901119909119909119879119879 (derivative of the ldquoerrorrdquo)bull 119902119902119909119909 larr 119902119902119909119909 + 120578120578 120576120576119909119909119909119909 119901119901119909119909 minus 120582120582 119902119902119909119909 (update equation)bull 119901119901119909119909 larr 119901119901119909119909 + 120578120578 120576120576119909119909119909119909 119902119902119909119909 minus 120582120582 119901119901119909119909 (update equation)

2 for loops For until convergence

bull For each rxibull Compute gradient do a ldquosteprdquo

1162019

η hellip learning rate

RECSYS 75Koren Bell Volinksy IEEE Computer 2009

1162019

RECSYS 76

Summary Recommendations via Optimization

Goal Make good recommendations Quantify goodness using SSE

So Lower SSE means better recommendations We want to make good recommendations on items

that some user has not yet seen Letrsquos set values for P and Q such that they work

well on known (user item) ratings And hope these values for P and Q will predict well

the unknown ratings

This is the a case where we apply Optimization methods

1162019

1 3 43 5 5

4 5 533

2

2 1 3

1

RECSYS 77

Backup Slides

RECSYS 78

The Netflix Challenge 2006-09

We Know What You OughtTo Be Watching This

Summer

Netflix Prizebull Training data

ndash 100 million ratingsndash 480000 usersndash 17770 moviesndash 6 years of data 2000-2005

bull Test datandash Last few ratings of each user (28 million)ndash Evaluation criterion root mean squared error (RMSE) ndash Netflix Cinematch RMSE 09514

bull Competitionndash 2700+ teamsndash $1 million grand prize for 10 improvement on Cinematch resultndash $50000 2007 progress prize for 843 improvement

scoremovieuser121152131434524123237682576344541568523425

22345

57664566

scoremovieuser6219617232473153414284935

745

696836

Training data Test data

Movie rating data

Overall rating distribution

bull Third of ratings are 4sbull Average rating is 368

From TimelyDevelopmentcom

ratings per movie

bull Avg ratingsmovie 5627

ratings per user

bull Avg ratingsuser 208

Average movie rating by movie count

bull More ratings to better movies

From TimelyDevelopmentcom

Most loved movies

CountAvg ratingTitle137812 4593 The Shawshank Redemption 133597 4545 Lord of the Rings The Return of the King1808834306 The Green Mile 150676 4460Lord of the Rings The Two Towers 139050 4415 Finding Nemo 117456 4504 Raiders of the Lost Ark 180736 4299 Forrest Gump 147932 4433 Lord of the Rings The Fellowship of the ring149199 4325 The Sixth Sense 144027 4333 Indiana Jones and the Last Crusade

Challenges

bull Size of datandash Scalabilityndash Keeping data in memory

bull Missing datandash 99 percent missingndash Very imbalanced

bull Avoiding overfittingbull Test and training data differ significantly movie 16322

bull Use an ensemble of complementing predictorsbull Two half tuned models worth more than a single fully

tuned model

The BellKor recommender system

RECSYS 101

Extending Latent Factor Model to Include Biases

1162019

RECSYS 102

Modeling Biases and Interactions

μ = overall mean rating bx = bias of user x bi = bias of movie i

user-movie interactionmovie biasuser bias

User-Movie interaction Characterizes the matching between

users and movies Attracts most research in the field Benefits from algorithmic and

mathematical innovations

Baseline predictor Separates users and movies Benefits from insights into userrsquos

behavior Among the main practical

contributions of the competition

RECSYS 103

Baseline Predictor We have expectations on the rating by

user x of movie i even without estimating xrsquos attitude towards movies like i

ndash Rating scale of user xndash Values of other ratings user

gave recently (day-specific mood anchoring multi-user accounts)

ndash (Recent) popularity of movie indash Selection bias related to

number of ratings user gave on the same day (ldquofrequencyrdquo)

1162019

RECSYS 104

Putting It All Together

Example Mean rating micro = 37 You are a critical reviewer your ratings are 1 star

lower than the mean bx = -1 Star Wars gets a mean rating of 05 higher than

average movie bi = + 05 Predicted rating for you on Star Wars

= 37 - 1 + 05 = 32

1162019

RECSYS 105

Fitting the New Model

Solve

Stochastic gradient decent to find parameters Note Both biases bu bi as well as interactions qi pu are treated

as parameters (we estimate them)

1162019

regularization

goodness of fit

λ is selected via grid-search on a validation set

( )

++++

+++minus

sumsumsumsum

sumisin

ii

xx

xx

ii

Rix

Txiixxi

PQ

bbpq

pqbbr

2222

2

)(

)(min

λ

micro

RECSYS 106

BellKor Recommender System The winner of the Netflix Challenge

Multi-scale modeling of the dataCombine top level ldquoregionalrdquomodeling of the data with a refined local view Global

bull Overall deviations of usersmovies Factorization

bull Addressing ldquoregionalrdquo effects Collaborative filtering

bull Extract local patterns

Global effects

Factorization

Collaborative filtering

RECSYS 107

Performance of Various Methods

0885

089

0895

09

0905

091

0915

092

1 10 100 1000

RMSE

Millions of parameters

CF (no time bias)

Basic Latent Factors

Latent Factors w Biases

1162019

RECSYS 108

Grand Prize 08563

Netflix 09514

Movie average 10533User average 10651

Global average 11296

Performance of Various Methods

Basic Collaborative filtering 094

Latent factors 090

Latent factors+Biases 089

Collaborative filtering++ 091

1162019

RECSYS 109

Temporal Biases Of Users

Sudden rise in the average movie rating(early 2004) Improvements in Netflix GUI improvements Meaning of rating changed

Movie age Users prefer new movies

without any reasons Older movies are just

inherently better than newer ones

1162019

Y Koren Collaborative filtering with temporal dynamics KDD rsquo09

RECSYS 110

Temporal Biases amp Factors

Original modelrxi = micro +bx + bi + qi middotpx

T

Add time dependence to biasesrxi = micro +bx(t)+ bi(t) +qi middot px

T

Make parameters bu and bi to depend on time (1) Parameterize time-dependence by linear trends

(2) Each bin corresponds to 10 consecutive weeks

Add temporal dependence to factors px(t)hellip user preference vector on day t

Y Koren Collaborative filtering with temporal dynamics KDD rsquo09

RECSYS 111

Adding Temporal Effects

1162019

0875

088

0885

089

0895

09

0905

091

0915

092

1 10 100 1000 10000

RMSE

Millions of parameters

CF (no time bias)Basic Latent FactorsCF (time bias)Latent Factors w Biases+ Linear time factors+ Per-day user biases+ CF

RECSYS 112

Grand Prize 08563

Netflix 09514

Movie average 10533User average 10651

Global average 11296

Performance of Various Methods

Basic Collaborative filtering 094

Latent factors 090

Latent factors+Biases 089

Collaborative filtering++ 091

1162019 112

Latent factors+Biases+Time 0876

Still no prize Getting desperateTry a ldquokitchen sinkrdquo approach

RECSYS 113

Effect of ensemble size

087

08725

0875

08775

088

08825

0885

08875

089

08925

1 8 15 22 29 36 43 50

Predictors

Erro

r - R

MSE

Chart1

RMSE
Predictors
Error - RMSE
Effect of ensemble size

Sheet1

Sheet1

RMSE
Predictors
Error - RMSE
Effect of ensemble size

Sheet2

Sheet3

RECSYS 115

Standing on June 26th 2009

1162019June 26th submission triggers 30-day ldquolast callrdquo

RECSYS 116

The Last 30 Days

Ensemble team formed Group of other teams on leaderboard forms a new team Relies on combining their models Quickly also get a qualifying score over 10

BellKor Continue to get small improvements in their scores Realize that they are in direct competition with Ensemble

Strategy Both teams carefully monitoring the leaderboard Only sure way to check for improvement is to submit a set of

predictionsbull This alerts the other team of your latest score

1162019

RECSYS 117

24 Hours from the Deadline

Submissions limited to 1 a day Only 1 final submission could be made in the last 24h

24 hours before deadlinehellip BellKor team member in Austria notices (by chance) that

Ensemble posts a score that is slightly better than BellKorrsquos

Frantic last 24 hours for both teams Much computer time on final optimization Carefully calibrated to end about an hour before deadline

Final submissions BellKor submits a little early (on purpose) 40 mins before deadline Ensemble submits their final entry 20 mins later hellipand everyone waitshellip

1162019

RECSYS 118118

RECSYS 119

Million $ Awarded Sept 21st 2009

RECSYS 120

Further reading Y Koren Collaborative filtering with temporal

dynamics KDD rsquo09 httpwww2researchattcom~volinskynetflixbpchtml httpwwwthe-ensemblecom

1162019

  • Slide Number 1
  • Acknowledgements
  • Content-based Systems amp Collaborative Filtering
  • High Dimensional Data
  • Example Recommender Systems
  • Recommendations
  • From Scarcity to Abundance
  • The Long Tail
  • Physical vs Online
  • Types of Recommendations
  • Formal Model
  • Utility Matrix
  • Key Problems
  • (1) Gathering Ratings
  • (2) Extrapolating Utilities
  • Content-based Recommender Systems
  • Content-based Recommendations
  • Plan of Action
  • Item Profiles
  • Sidenote TF-IDF
  • User Profiles and Prediction
  • Pros Content-based Approach
  • Cons Content-based Approach
  • Collaborative Filtering
  • Collaborative filtering
  • Collaborative Filtering (CF)
  • Example of Memory-based Collaborative Filtering User-User Collaborative Filtering
  • Similar Users
  • Similarity Metric
  • Rating Predictions
  • Another type of Memory-based Collaborative Filtering Item-Item based Collaborative Filtering
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Common Practice for Item-Item Collaborative Filtering
  • Item-Item vs User-User
  • ProsCons of Collaborative Filtering
  • Hybrid Methods
  • Remarks amp Practical Tips
  • Evaluation
  • Evaluation
  • Evaluating Predictions
  • Problems with Error Measures
  • Collaborative Filtering Complexity
  • Tip Add Data
  • Recommender SystemsLatent Factor Models
  • Collaborative Filtering viaLatent Factor Models (eg SVD)
  • Slide Number 50
  • The Netflix Utility Matrix R
  • Utility Matrix R Evaluation
  • Latent Factor Models
  • Ratings as Products of Factors
  • Ratings as Products of Factors
  • Ratings as Products of Factors
  • Latent Factor Models
  • Latent Factor Models
  • Recap SVD
  • Latent Factor Models
  • Dealing with Missing Entries
  • Recommendations via Latent Factor Models (eg SVD++ by the [Bellkor Team])
  • Dealing with Missing Entries
  • The Effect of Regularization
  • The Effect of Regularization
  • The Effect of Regularization
  • The Effect of Regularization
  • Use Gradient Descent to search forthe optimal settings
  • Degression to the lecture notes ofRegression and Gradient Descentby Andrew Ngrsquos Machine Learning course from Coursera
  • (Batch) Gradient Descent
  • Stochastic Gradient Descent
  • SGD vs GD
  • Stochastic Gradient Descent
  • Slide Number 75
  • Summary Recommendations via Optimization
  • Backup Slides
  • The Netflix Challenge 2006-09
  • Slide Number 79
  • Slide Number 80
  • Netflix Prize
  • Movie rating data
  • Overall rating distribution
  • ratings per movie
  • ratings per user
  • Average movie rating by movie count
  • Most loved movies
  • Challenges
  • The BellKor recommender system
  • Extending Latent Factor Model to Include Biases
  • Modeling Biases and Interactions
  • Baseline Predictor
  • Putting It All Together
  • Fitting the New Model
  • BellKor Recommender System
  • Performance of Various Methods
  • Performance of Various Methods
  • Temporal Biases Of Users
  • Temporal Biases amp Factors
  • Adding Temporal Effects
  • Performance of Various Methods
  • Slide Number 113
  • Slide Number 114
  • Standing on June 26th 2009
  • The Last 30 Days
  • 24 Hours from the Deadline
  • Slide Number 118
  • Million $ Awarded Sept 21st 2009
  • Further reading
predictors RMSE
1 0889067
2 0880586
3 0878345
4 0877715
5 0876668
6 0876179
7 0875557
8 0875013
9 0874477
10 0874209
11 0873997
12 0873691
13 0873569
14 0873352
15 0873212
16 0873035
17 0872942
18 087283
19 0872748
20 0872654
21 0872563
22 0872468
23 0872382
24 0872275
25 087222
26 0872201
27 0872135
28 0872086
29 0872021
30 0871945
31 0871898
32 087183
33 0871756
34 0871722
35 0871665
36 0871632
37 0871585
38 0871558
39 0871524
40 0871482
41 0871452
42 0871428
43 0871398
44 0871346
45 0871312
46 0871287
47 0871262
48 0871242
49 0871236
50 0871197
0889067
0880586
0878345
0877715
0876668
0876179
0875557
0875013
0874477
0874209
0873997
0873691
0873569
0873352
0873212
0873035
0872942
087283
0872748
0872654
0872563
0872468
0872382
0872275
087222
0872201
0872135
0872086
0872021
0871945
0871898
087183
0871756
0871722
0871665
0871632
0871585
0871558
0871524
0871482
0871452
0871428
0871398
0871346
0871312
0871287
0871262
0871242
0871236
0871197
Page 54: xuhappy.github.io · 2 days ago · RECSYS 2 Acknowledgements The slides used in this chapter are adapted from: CS246 Mining Massive Datasets, - by Jure Leskovec, Stanford University

RECSYS 56

Ratings as Products of Factors How to estimate the missing rating of

user x for item i

1162019

45531

312445

53432142

24542

522434

42331

item

s

2-41

56-5

53-2

32111

-221-7

37-1

-924143-48-5-253-211

13-112-72914-131457-8

1-6784-3924176-421

asymp

item

s

users

users

QPT

24

ffac

tors

f factors

119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931

= 119943119943

119954119954119946119946119943119943 sdot 119953119953119961119961119943119943

qi = row i of Qpx = column x of PT

RECSYS 57

Geared towards females

Geared towards males

Serious

Funny

Latent Factor Models

1162019

The PrincessDiaries

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

RECSYS 58

Geared towards females

Geared towards males

Serious

Funny

Latent Factor Models

1162019

The PrincessDiaries

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

RECSYS 59

Recap SVD

Remember SVD A Input data matrix U Left singular vecs V Right singular vecs Σ Singular values

SVD gives minimum reconstruction error (SSE) min119880119880VΣ

sum119909119909119894119894 119860119860119909119909119894119894 minus 119880119880Σ119881119881T 1199091199091198941198942

So in our case ldquoSVDrdquo on Netflix data R asymp Q middot PT

A = R Q = U PT = Σ VT

But we are not done yet R has missing entries1162019

U

The sum goes over all entriesBut our R has missing entries

119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931

Am

n

Σm

n

VT

asymp

U

RECSYS 60

Latent Factor Models

SVD isnrsquot defined when entries are missing

Use specialized methods to find P Q min

119875119875119876119876sum 119909119909119909119909 isinR 119903119903119909119909119909119909 minus 119902119902119909119909 sdot 119901119901119909119909119879119879 2

Notebull We donrsquot require cols of P Q to be orthogonalunit lengthbull P Q map usersmovies to a latent spacebull The most popular model among Netflix contestants

1162019

45531

312445

53432142

24542

522434

42331

2-41

56-5

53-2

32111

-221-7

37-1

-924143-48-5-253-211

13-112-72914-131457-8

1-6784-3924176-421asymp

PTQ

users

item

s

119903119909119909119909119909 = 119902119902119909119909 sdot 119901119901119909119909119879119879

f factors

ffactorsitem

s

users

RECSYS 61

Dealing with Missing Entries Want to minimize SSE for unseen test data

Idea Minimize SSE on training data Want large f ( of factors) to capture all the signals But SSE on test data begins to rise for f gt 2

Regularization is needed to avoid Overfitting Allow rich model where there are sufficient data Shrink aggressively where data are scarce

1162019

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

1 3 43 5 5

4 5 533

2

2 1 3

1

λhellip regularization parameterldquoerrorrdquo ldquolengthrdquo

RECSYS 62

Geared towards females

Geared towards males

serious

escapist

The PrincessDiaries

The Lion King

Braveheart

Independence Day

AmadeusThe Color Purple

Oceanrsquos 11

Sense and Sensibility

Gus

Dave

Recommendations via Latent Factor Models (eg SVD++ by the [Bellkor Team])

1162019

Lethal Weapon

Dumb and Dumber

RECSYS 63

Dealing with Missing Entries Want to minimize SSE for unseen test data

Idea Minimize SSE on training data Want large f ( of factors) to capture all the signals But SSE on test data begins to rise for f gt 2

Regularization is needed Allow rich model where there are sufficient data Shrink aggressively where data are scarce

1162019

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

1 3 43 5 5

4 5 533

2

2 1 3

1

λhellip regularization parameterldquoerrorrdquo ldquolengthrdquo

RECSYS 64

Geared towards females

Geared towards males

serious

funny

The Effect of Regularization

1162019

The PrincessDiaries

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

minfactors ldquoerrorrdquo + λ ldquolengthrdquo

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

RECSYS 65

Geared towards females

Geared towards males

serious

funny

The Effect of Regularization

1162019

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

The PrincessDiaries

minfactors ldquoerrorrdquo + λ ldquolengthrdquo

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

RECSYS 66

Geared towards females

Geared towards males

serious

funny

The Effect of Regularization

1162019

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

minfactors ldquoerrorrdquo + λ ldquolengthrdquo

The PrincessDiaries

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

RECSYS 67

Geared towards females

Geared towards males

serious

funny

The Effect of Regularization

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

The PrincessDiaries

minfactors ldquoerrorrdquo + λ ldquolengthrdquo

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

RECSYS 68

Use Gradient Descent to search forthe optimal settings

Want to find matrices P and Q

Gradient descent Initialize P and Q (using SVD pretend missing ratings are

0) Do gradient descent

bull P larr P - η middotnablaPbull Q larr Q - η middotnablaQbull Where nablaQ is gradientderivative of matrix Q120571120571119876119876 = [120571120571119902119902119909119909119894119894] and 120571120571119902119902119909119909119894119894 = sum119909119909119909119909 minus2 119903119903119909119909119909119909 minus 119902119902119909119909119901119901119909119909119879119879 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894

bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q Observation Computing gradients is slow

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

How to compute gradient of a matrixCompute gradient of every element independently

RECSYS 69

Degression to the lecture notes ofRegression and Gradient Descent

by Andrew Ngrsquos Machine Learning course from Coursera

RECSYS 70

(Batch) Gradient Descent

Want to find matrices P and Q

Gradient descent Initialize P and Q (using SVD pretend missing ratings are

0) Do gradient descent

bull P larr P - η middotnablaPbull Q larr Q - η middotnablaQbull Where nablaQ is gradientderivative of matrix Q120571120571119876119876 = [120571120571119902119902119909119909119894119894] and 120571120571119902119902119909119909119894119894 = sum119909119909119909119909 minus2 119903119903119909119909119909119909 minus 119902119902119909119909119901119901119909119909119879119879 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894

bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q Observation Computing gradients is slow

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

How to compute gradient of a matrixCompute gradient of every element independently

RECSYS 72

Stochastic Gradient Descent

Gradient Descent (GD) vs Stochastic GD Observation 120571120571119876119876 = [120571120571119902119902119909119909119894119894] where

120571120571119902119902119909119909119894119894 = 119909119909119909119909

minus2 119903119903119909119909119909119909 minus 119902119902119909119909119894119894119901119901119909119909119894119894 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894 = 119961119961119946119946

nabla119928119928 119955119955119961119961119946119946

bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q 119928119928 = 119928119928 minus 119928119928 = 119928119928 minus η sum119961119961119946119946nabla119928119928 (119955119955119961119961119946119946) Idea Instead of evaluating gradient over all ratings evaluate it for

each individual rating and make a step

GD 119928119928larr119928119928 minus η sum119955119955119961119961119946119946 nabla119928119928(119955119955119961119961119946119946)

SGD 119928119928larr119928119928 minus 119928119928(119955119955119961119961119946119946) Faster convergence

bull Need more steps but each step is computed much faster

1162019

RECSYS 73

SGD vs GD Convergence of GD vs SGD

1162019

Iterationstep

Valu

e of

the

obje

ctiv

e fu

nctio

n

GD improves the value of the objective function at every step SGD improves the value but in a ldquonoisyrdquo wayGD takes fewer steps to converge but each steptakes much longer to compute In practice SGD is much faster

RECSYS 74

Stochastic Gradient Descent

Stochastic gradient decent Initialize P and Q (using SVD pretend missing ratings are

0) Then iterate over the ratings (multiple times if necessary) and update

factorsFor each rxi

bull 120576120576119909119909119909119909 = 119903119903119909119909119909119909 minus 119902119902119909119909 sdot 119901119901119909119909119879119879 (derivative of the ldquoerrorrdquo)bull 119902119902119909119909 larr 119902119902119909119909 + 120578120578 120576120576119909119909119909119909 119901119901119909119909 minus 120582120582 119902119902119909119909 (update equation)bull 119901119901119909119909 larr 119901119901119909119909 + 120578120578 120576120576119909119909119909119909 119902119902119909119909 minus 120582120582 119901119901119909119909 (update equation)

2 for loops For until convergence

bull For each rxibull Compute gradient do a ldquosteprdquo

1162019

η hellip learning rate

RECSYS 75Koren Bell Volinksy IEEE Computer 2009

1162019

RECSYS 76

Summary Recommendations via Optimization

Goal Make good recommendations Quantify goodness using SSE

So Lower SSE means better recommendations We want to make good recommendations on items

that some user has not yet seen Letrsquos set values for P and Q such that they work

well on known (user item) ratings And hope these values for P and Q will predict well

the unknown ratings

This is the a case where we apply Optimization methods

1162019

1 3 43 5 5

4 5 533

2

2 1 3

1

RECSYS 77

Backup Slides

RECSYS 78

The Netflix Challenge 2006-09

We Know What You OughtTo Be Watching This

Summer

Netflix Prizebull Training data

ndash 100 million ratingsndash 480000 usersndash 17770 moviesndash 6 years of data 2000-2005

bull Test datandash Last few ratings of each user (28 million)ndash Evaluation criterion root mean squared error (RMSE) ndash Netflix Cinematch RMSE 09514

bull Competitionndash 2700+ teamsndash $1 million grand prize for 10 improvement on Cinematch resultndash $50000 2007 progress prize for 843 improvement

scoremovieuser121152131434524123237682576344541568523425

22345

57664566

scoremovieuser6219617232473153414284935

745

696836

Training data Test data

Movie rating data

Overall rating distribution

bull Third of ratings are 4sbull Average rating is 368

From TimelyDevelopmentcom

ratings per movie

bull Avg ratingsmovie 5627

ratings per user

bull Avg ratingsuser 208

Average movie rating by movie count

bull More ratings to better movies

From TimelyDevelopmentcom

Most loved movies

CountAvg ratingTitle137812 4593 The Shawshank Redemption 133597 4545 Lord of the Rings The Return of the King1808834306 The Green Mile 150676 4460Lord of the Rings The Two Towers 139050 4415 Finding Nemo 117456 4504 Raiders of the Lost Ark 180736 4299 Forrest Gump 147932 4433 Lord of the Rings The Fellowship of the ring149199 4325 The Sixth Sense 144027 4333 Indiana Jones and the Last Crusade

Challenges

bull Size of datandash Scalabilityndash Keeping data in memory

bull Missing datandash 99 percent missingndash Very imbalanced

bull Avoiding overfittingbull Test and training data differ significantly movie 16322

bull Use an ensemble of complementing predictorsbull Two half tuned models worth more than a single fully

tuned model

The BellKor recommender system

RECSYS 101

Extending Latent Factor Model to Include Biases

1162019

RECSYS 102

Modeling Biases and Interactions

μ = overall mean rating bx = bias of user x bi = bias of movie i

user-movie interactionmovie biasuser bias

User-Movie interaction Characterizes the matching between

users and movies Attracts most research in the field Benefits from algorithmic and

mathematical innovations

Baseline predictor Separates users and movies Benefits from insights into userrsquos

behavior Among the main practical

contributions of the competition

RECSYS 103

Baseline Predictor We have expectations on the rating by

user x of movie i even without estimating xrsquos attitude towards movies like i

ndash Rating scale of user xndash Values of other ratings user

gave recently (day-specific mood anchoring multi-user accounts)

ndash (Recent) popularity of movie indash Selection bias related to

number of ratings user gave on the same day (ldquofrequencyrdquo)

1162019

RECSYS 104

Putting It All Together

Example Mean rating micro = 37 You are a critical reviewer your ratings are 1 star

lower than the mean bx = -1 Star Wars gets a mean rating of 05 higher than

average movie bi = + 05 Predicted rating for you on Star Wars

= 37 - 1 + 05 = 32

1162019

RECSYS 105

Fitting the New Model

Solve

Stochastic gradient decent to find parameters Note Both biases bu bi as well as interactions qi pu are treated

as parameters (we estimate them)

1162019

regularization

goodness of fit

λ is selected via grid-search on a validation set

( )

++++

+++minus

sumsumsumsum

sumisin

ii

xx

xx

ii

Rix

Txiixxi

PQ

bbpq

pqbbr

2222

2

)(

)(min

λ

micro

RECSYS 106

BellKor Recommender System The winner of the Netflix Challenge

Multi-scale modeling of the dataCombine top level ldquoregionalrdquomodeling of the data with a refined local view Global

bull Overall deviations of usersmovies Factorization

bull Addressing ldquoregionalrdquo effects Collaborative filtering

bull Extract local patterns

Global effects

Factorization

Collaborative filtering

RECSYS 107

Performance of Various Methods

0885

089

0895

09

0905

091

0915

092

1 10 100 1000

RMSE

Millions of parameters

CF (no time bias)

Basic Latent Factors

Latent Factors w Biases

1162019

RECSYS 108

Grand Prize 08563

Netflix 09514

Movie average 10533User average 10651

Global average 11296

Performance of Various Methods

Basic Collaborative filtering 094

Latent factors 090

Latent factors+Biases 089

Collaborative filtering++ 091

1162019

RECSYS 109

Temporal Biases Of Users

Sudden rise in the average movie rating(early 2004) Improvements in Netflix GUI improvements Meaning of rating changed

Movie age Users prefer new movies

without any reasons Older movies are just

inherently better than newer ones

1162019

Y Koren Collaborative filtering with temporal dynamics KDD rsquo09

RECSYS 110

Temporal Biases amp Factors

Original modelrxi = micro +bx + bi + qi middotpx

T

Add time dependence to biasesrxi = micro +bx(t)+ bi(t) +qi middot px

T

Make parameters bu and bi to depend on time (1) Parameterize time-dependence by linear trends

(2) Each bin corresponds to 10 consecutive weeks

Add temporal dependence to factors px(t)hellip user preference vector on day t

Y Koren Collaborative filtering with temporal dynamics KDD rsquo09

RECSYS 111

Adding Temporal Effects

1162019

0875

088

0885

089

0895

09

0905

091

0915

092

1 10 100 1000 10000

RMSE

Millions of parameters

CF (no time bias)Basic Latent FactorsCF (time bias)Latent Factors w Biases+ Linear time factors+ Per-day user biases+ CF

RECSYS 112

Grand Prize 08563

Netflix 09514

Movie average 10533User average 10651

Global average 11296

Performance of Various Methods

Basic Collaborative filtering 094

Latent factors 090

Latent factors+Biases 089

Collaborative filtering++ 091

1162019 112

Latent factors+Biases+Time 0876

Still no prize Getting desperateTry a ldquokitchen sinkrdquo approach

RECSYS 113

Effect of ensemble size

087

08725

0875

08775

088

08825

0885

08875

089

08925

1 8 15 22 29 36 43 50

Predictors

Erro

r - R

MSE

Chart1

RMSE
Predictors
Error - RMSE
Effect of ensemble size

Sheet1

Sheet1

RMSE
Predictors
Error - RMSE
Effect of ensemble size

Sheet2

Sheet3

RECSYS 115

Standing on June 26th 2009

1162019June 26th submission triggers 30-day ldquolast callrdquo

RECSYS 116

The Last 30 Days

Ensemble team formed Group of other teams on leaderboard forms a new team Relies on combining their models Quickly also get a qualifying score over 10

BellKor Continue to get small improvements in their scores Realize that they are in direct competition with Ensemble

Strategy Both teams carefully monitoring the leaderboard Only sure way to check for improvement is to submit a set of

predictionsbull This alerts the other team of your latest score

1162019

RECSYS 117

24 Hours from the Deadline

Submissions limited to 1 a day Only 1 final submission could be made in the last 24h

24 hours before deadlinehellip BellKor team member in Austria notices (by chance) that

Ensemble posts a score that is slightly better than BellKorrsquos

Frantic last 24 hours for both teams Much computer time on final optimization Carefully calibrated to end about an hour before deadline

Final submissions BellKor submits a little early (on purpose) 40 mins before deadline Ensemble submits their final entry 20 mins later hellipand everyone waitshellip

1162019

RECSYS 118118

RECSYS 119

Million $ Awarded Sept 21st 2009

RECSYS 120

Further reading Y Koren Collaborative filtering with temporal

dynamics KDD rsquo09 httpwww2researchattcom~volinskynetflixbpchtml httpwwwthe-ensemblecom

1162019

  • Slide Number 1
  • Acknowledgements
  • Content-based Systems amp Collaborative Filtering
  • High Dimensional Data
  • Example Recommender Systems
  • Recommendations
  • From Scarcity to Abundance
  • The Long Tail
  • Physical vs Online
  • Types of Recommendations
  • Formal Model
  • Utility Matrix
  • Key Problems
  • (1) Gathering Ratings
  • (2) Extrapolating Utilities
  • Content-based Recommender Systems
  • Content-based Recommendations
  • Plan of Action
  • Item Profiles
  • Sidenote TF-IDF
  • User Profiles and Prediction
  • Pros Content-based Approach
  • Cons Content-based Approach
  • Collaborative Filtering
  • Collaborative filtering
  • Collaborative Filtering (CF)
  • Example of Memory-based Collaborative Filtering User-User Collaborative Filtering
  • Similar Users
  • Similarity Metric
  • Rating Predictions
  • Another type of Memory-based Collaborative Filtering Item-Item based Collaborative Filtering
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Common Practice for Item-Item Collaborative Filtering
  • Item-Item vs User-User
  • ProsCons of Collaborative Filtering
  • Hybrid Methods
  • Remarks amp Practical Tips
  • Evaluation
  • Evaluation
  • Evaluating Predictions
  • Problems with Error Measures
  • Collaborative Filtering Complexity
  • Tip Add Data
  • Recommender SystemsLatent Factor Models
  • Collaborative Filtering viaLatent Factor Models (eg SVD)
  • Slide Number 50
  • The Netflix Utility Matrix R
  • Utility Matrix R Evaluation
  • Latent Factor Models
  • Ratings as Products of Factors
  • Ratings as Products of Factors
  • Ratings as Products of Factors
  • Latent Factor Models
  • Latent Factor Models
  • Recap SVD
  • Latent Factor Models
  • Dealing with Missing Entries
  • Recommendations via Latent Factor Models (eg SVD++ by the [Bellkor Team])
  • Dealing with Missing Entries
  • The Effect of Regularization
  • The Effect of Regularization
  • The Effect of Regularization
  • The Effect of Regularization
  • Use Gradient Descent to search forthe optimal settings
  • Degression to the lecture notes ofRegression and Gradient Descentby Andrew Ngrsquos Machine Learning course from Coursera
  • (Batch) Gradient Descent
  • Stochastic Gradient Descent
  • SGD vs GD
  • Stochastic Gradient Descent
  • Slide Number 75
  • Summary Recommendations via Optimization
  • Backup Slides
  • The Netflix Challenge 2006-09
  • Slide Number 79
  • Slide Number 80
  • Netflix Prize
  • Movie rating data
  • Overall rating distribution
  • ratings per movie
  • ratings per user
  • Average movie rating by movie count
  • Most loved movies
  • Challenges
  • The BellKor recommender system
  • Extending Latent Factor Model to Include Biases
  • Modeling Biases and Interactions
  • Baseline Predictor
  • Putting It All Together
  • Fitting the New Model
  • BellKor Recommender System
  • Performance of Various Methods
  • Performance of Various Methods
  • Temporal Biases Of Users
  • Temporal Biases amp Factors
  • Adding Temporal Effects
  • Performance of Various Methods
  • Slide Number 113
  • Slide Number 114
  • Standing on June 26th 2009
  • The Last 30 Days
  • 24 Hours from the Deadline
  • Slide Number 118
  • Million $ Awarded Sept 21st 2009
  • Further reading
predictors RMSE
1 0889067
2 0880586
3 0878345
4 0877715
5 0876668
6 0876179
7 0875557
8 0875013
9 0874477
10 0874209
11 0873997
12 0873691
13 0873569
14 0873352
15 0873212
16 0873035
17 0872942
18 087283
19 0872748
20 0872654
21 0872563
22 0872468
23 0872382
24 0872275
25 087222
26 0872201
27 0872135
28 0872086
29 0872021
30 0871945
31 0871898
32 087183
33 0871756
34 0871722
35 0871665
36 0871632
37 0871585
38 0871558
39 0871524
40 0871482
41 0871452
42 0871428
43 0871398
44 0871346
45 0871312
46 0871287
47 0871262
48 0871242
49 0871236
50 0871197
0889067
0880586
0878345
0877715
0876668
0876179
0875557
0875013
0874477
0874209
0873997
0873691
0873569
0873352
0873212
0873035
0872942
087283
0872748
0872654
0872563
0872468
0872382
0872275
087222
0872201
0872135
0872086
0872021
0871945
0871898
087183
0871756
0871722
0871665
0871632
0871585
0871558
0871524
0871482
0871452
0871428
0871398
0871346
0871312
0871287
0871262
0871242
0871236
0871197
Page 55: xuhappy.github.io · 2 days ago · RECSYS 2 Acknowledgements The slides used in this chapter are adapted from: CS246 Mining Massive Datasets, - by Jure Leskovec, Stanford University

RECSYS 57

Geared towards females

Geared towards males

Serious

Funny

Latent Factor Models

1162019

The PrincessDiaries

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

RECSYS 58

Geared towards females

Geared towards males

Serious

Funny

Latent Factor Models

1162019

The PrincessDiaries

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

RECSYS 59

Recap SVD

Remember SVD A Input data matrix U Left singular vecs V Right singular vecs Σ Singular values

SVD gives minimum reconstruction error (SSE) min119880119880VΣ

sum119909119909119894119894 119860119860119909119909119894119894 minus 119880119880Σ119881119881T 1199091199091198941198942

So in our case ldquoSVDrdquo on Netflix data R asymp Q middot PT

A = R Q = U PT = Σ VT

But we are not done yet R has missing entries1162019

U

The sum goes over all entriesBut our R has missing entries

119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931

Am

n

Σm

n

VT

asymp

U

RECSYS 60

Latent Factor Models

SVD isnrsquot defined when entries are missing

Use specialized methods to find P Q min

119875119875119876119876sum 119909119909119909119909 isinR 119903119903119909119909119909119909 minus 119902119902119909119909 sdot 119901119901119909119909119879119879 2

Notebull We donrsquot require cols of P Q to be orthogonalunit lengthbull P Q map usersmovies to a latent spacebull The most popular model among Netflix contestants

1162019

45531

312445

53432142

24542

522434

42331

2-41

56-5

53-2

32111

-221-7

37-1

-924143-48-5-253-211

13-112-72914-131457-8

1-6784-3924176-421asymp

PTQ

users

item

s

119903119909119909119909119909 = 119902119902119909119909 sdot 119901119901119909119909119879119879

f factors

ffactorsitem

s

users

RECSYS 61

Dealing with Missing Entries Want to minimize SSE for unseen test data

Idea Minimize SSE on training data Want large f ( of factors) to capture all the signals But SSE on test data begins to rise for f gt 2

Regularization is needed to avoid Overfitting Allow rich model where there are sufficient data Shrink aggressively where data are scarce

1162019

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

1 3 43 5 5

4 5 533

2

2 1 3

1

λhellip regularization parameterldquoerrorrdquo ldquolengthrdquo

RECSYS 62

Geared towards females

Geared towards males

serious

escapist

The PrincessDiaries

The Lion King

Braveheart

Independence Day

AmadeusThe Color Purple

Oceanrsquos 11

Sense and Sensibility

Gus

Dave

Recommendations via Latent Factor Models (eg SVD++ by the [Bellkor Team])

1162019

Lethal Weapon

Dumb and Dumber

RECSYS 63

Dealing with Missing Entries Want to minimize SSE for unseen test data

Idea Minimize SSE on training data Want large f ( of factors) to capture all the signals But SSE on test data begins to rise for f gt 2

Regularization is needed Allow rich model where there are sufficient data Shrink aggressively where data are scarce

1162019

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

1 3 43 5 5

4 5 533

2

2 1 3

1

λhellip regularization parameterldquoerrorrdquo ldquolengthrdquo

RECSYS 64

Geared towards females

Geared towards males

serious

funny

The Effect of Regularization

1162019

The PrincessDiaries

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

minfactors ldquoerrorrdquo + λ ldquolengthrdquo

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

RECSYS 65

Geared towards females

Geared towards males

serious

funny

The Effect of Regularization

1162019

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

The PrincessDiaries

minfactors ldquoerrorrdquo + λ ldquolengthrdquo

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

RECSYS 66

Geared towards females

Geared towards males

serious

funny

The Effect of Regularization

1162019

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

minfactors ldquoerrorrdquo + λ ldquolengthrdquo

The PrincessDiaries

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

RECSYS 67

Geared towards females

Geared towards males

serious

funny

The Effect of Regularization

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

The PrincessDiaries

minfactors ldquoerrorrdquo + λ ldquolengthrdquo

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

RECSYS 68

Use Gradient Descent to search forthe optimal settings

Want to find matrices P and Q

Gradient descent Initialize P and Q (using SVD pretend missing ratings are

0) Do gradient descent

bull P larr P - η middotnablaPbull Q larr Q - η middotnablaQbull Where nablaQ is gradientderivative of matrix Q120571120571119876119876 = [120571120571119902119902119909119909119894119894] and 120571120571119902119902119909119909119894119894 = sum119909119909119909119909 minus2 119903119903119909119909119909119909 minus 119902119902119909119909119901119901119909119909119879119879 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894

bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q Observation Computing gradients is slow

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

How to compute gradient of a matrixCompute gradient of every element independently

RECSYS 69

Degression to the lecture notes ofRegression and Gradient Descent

by Andrew Ngrsquos Machine Learning course from Coursera

RECSYS 70

(Batch) Gradient Descent

Want to find matrices P and Q

Gradient descent Initialize P and Q (using SVD pretend missing ratings are

0) Do gradient descent

bull P larr P - η middotnablaPbull Q larr Q - η middotnablaQbull Where nablaQ is gradientderivative of matrix Q120571120571119876119876 = [120571120571119902119902119909119909119894119894] and 120571120571119902119902119909119909119894119894 = sum119909119909119909119909 minus2 119903119903119909119909119909119909 minus 119902119902119909119909119901119901119909119909119879119879 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894

bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q Observation Computing gradients is slow

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

How to compute gradient of a matrixCompute gradient of every element independently

RECSYS 72

Stochastic Gradient Descent

Gradient Descent (GD) vs Stochastic GD Observation 120571120571119876119876 = [120571120571119902119902119909119909119894119894] where

120571120571119902119902119909119909119894119894 = 119909119909119909119909

minus2 119903119903119909119909119909119909 minus 119902119902119909119909119894119894119901119901119909119909119894119894 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894 = 119961119961119946119946

nabla119928119928 119955119955119961119961119946119946

bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q 119928119928 = 119928119928 minus 119928119928 = 119928119928 minus η sum119961119961119946119946nabla119928119928 (119955119955119961119961119946119946) Idea Instead of evaluating gradient over all ratings evaluate it for

each individual rating and make a step

GD 119928119928larr119928119928 minus η sum119955119955119961119961119946119946 nabla119928119928(119955119955119961119961119946119946)

SGD 119928119928larr119928119928 minus 119928119928(119955119955119961119961119946119946) Faster convergence

bull Need more steps but each step is computed much faster

1162019

RECSYS 73

SGD vs GD Convergence of GD vs SGD

1162019

Iterationstep

Valu

e of

the

obje

ctiv

e fu

nctio

n

GD improves the value of the objective function at every step SGD improves the value but in a ldquonoisyrdquo wayGD takes fewer steps to converge but each steptakes much longer to compute In practice SGD is much faster

RECSYS 74

Stochastic Gradient Descent

Stochastic gradient decent Initialize P and Q (using SVD pretend missing ratings are

0) Then iterate over the ratings (multiple times if necessary) and update

factorsFor each rxi

bull 120576120576119909119909119909119909 = 119903119903119909119909119909119909 minus 119902119902119909119909 sdot 119901119901119909119909119879119879 (derivative of the ldquoerrorrdquo)bull 119902119902119909119909 larr 119902119902119909119909 + 120578120578 120576120576119909119909119909119909 119901119901119909119909 minus 120582120582 119902119902119909119909 (update equation)bull 119901119901119909119909 larr 119901119901119909119909 + 120578120578 120576120576119909119909119909119909 119902119902119909119909 minus 120582120582 119901119901119909119909 (update equation)

2 for loops For until convergence

bull For each rxibull Compute gradient do a ldquosteprdquo

1162019

η hellip learning rate

RECSYS 75Koren Bell Volinksy IEEE Computer 2009

1162019

RECSYS 76

Summary Recommendations via Optimization

Goal Make good recommendations Quantify goodness using SSE

So Lower SSE means better recommendations We want to make good recommendations on items

that some user has not yet seen Letrsquos set values for P and Q such that they work

well on known (user item) ratings And hope these values for P and Q will predict well

the unknown ratings

This is the a case where we apply Optimization methods

1162019

1 3 43 5 5

4 5 533

2

2 1 3

1

RECSYS 77

Backup Slides

RECSYS 78

The Netflix Challenge 2006-09

We Know What You OughtTo Be Watching This

Summer

Netflix Prizebull Training data

ndash 100 million ratingsndash 480000 usersndash 17770 moviesndash 6 years of data 2000-2005

bull Test datandash Last few ratings of each user (28 million)ndash Evaluation criterion root mean squared error (RMSE) ndash Netflix Cinematch RMSE 09514

bull Competitionndash 2700+ teamsndash $1 million grand prize for 10 improvement on Cinematch resultndash $50000 2007 progress prize for 843 improvement

scoremovieuser121152131434524123237682576344541568523425

22345

57664566

scoremovieuser6219617232473153414284935

745

696836

Training data Test data

Movie rating data

Overall rating distribution

bull Third of ratings are 4sbull Average rating is 368

From TimelyDevelopmentcom

ratings per movie

bull Avg ratingsmovie 5627

ratings per user

bull Avg ratingsuser 208

Average movie rating by movie count

bull More ratings to better movies

From TimelyDevelopmentcom

Most loved movies

CountAvg ratingTitle137812 4593 The Shawshank Redemption 133597 4545 Lord of the Rings The Return of the King1808834306 The Green Mile 150676 4460Lord of the Rings The Two Towers 139050 4415 Finding Nemo 117456 4504 Raiders of the Lost Ark 180736 4299 Forrest Gump 147932 4433 Lord of the Rings The Fellowship of the ring149199 4325 The Sixth Sense 144027 4333 Indiana Jones and the Last Crusade

Challenges

bull Size of datandash Scalabilityndash Keeping data in memory

bull Missing datandash 99 percent missingndash Very imbalanced

bull Avoiding overfittingbull Test and training data differ significantly movie 16322

bull Use an ensemble of complementing predictorsbull Two half tuned models worth more than a single fully

tuned model

The BellKor recommender system

RECSYS 101

Extending Latent Factor Model to Include Biases

1162019

RECSYS 102

Modeling Biases and Interactions

μ = overall mean rating bx = bias of user x bi = bias of movie i

user-movie interactionmovie biasuser bias

User-Movie interaction Characterizes the matching between

users and movies Attracts most research in the field Benefits from algorithmic and

mathematical innovations

Baseline predictor Separates users and movies Benefits from insights into userrsquos

behavior Among the main practical

contributions of the competition

RECSYS 103

Baseline Predictor We have expectations on the rating by

user x of movie i even without estimating xrsquos attitude towards movies like i

ndash Rating scale of user xndash Values of other ratings user

gave recently (day-specific mood anchoring multi-user accounts)

ndash (Recent) popularity of movie indash Selection bias related to

number of ratings user gave on the same day (ldquofrequencyrdquo)

1162019

RECSYS 104

Putting It All Together

Example Mean rating micro = 37 You are a critical reviewer your ratings are 1 star

lower than the mean bx = -1 Star Wars gets a mean rating of 05 higher than

average movie bi = + 05 Predicted rating for you on Star Wars

= 37 - 1 + 05 = 32

1162019

RECSYS 105

Fitting the New Model

Solve

Stochastic gradient decent to find parameters Note Both biases bu bi as well as interactions qi pu are treated

as parameters (we estimate them)

1162019

regularization

goodness of fit

λ is selected via grid-search on a validation set

( )

++++

+++minus

sumsumsumsum

sumisin

ii

xx

xx

ii

Rix

Txiixxi

PQ

bbpq

pqbbr

2222

2

)(

)(min

λ

micro

RECSYS 106

BellKor Recommender System The winner of the Netflix Challenge

Multi-scale modeling of the dataCombine top level ldquoregionalrdquomodeling of the data with a refined local view Global

bull Overall deviations of usersmovies Factorization

bull Addressing ldquoregionalrdquo effects Collaborative filtering

bull Extract local patterns

Global effects

Factorization

Collaborative filtering

RECSYS 107

Performance of Various Methods

0885

089

0895

09

0905

091

0915

092

1 10 100 1000

RMSE

Millions of parameters

CF (no time bias)

Basic Latent Factors

Latent Factors w Biases

1162019

RECSYS 108

Grand Prize 08563

Netflix 09514

Movie average 10533User average 10651

Global average 11296

Performance of Various Methods

Basic Collaborative filtering 094

Latent factors 090

Latent factors+Biases 089

Collaborative filtering++ 091

1162019

RECSYS 109

Temporal Biases Of Users

Sudden rise in the average movie rating(early 2004) Improvements in Netflix GUI improvements Meaning of rating changed

Movie age Users prefer new movies

without any reasons Older movies are just

inherently better than newer ones

1162019

Y Koren Collaborative filtering with temporal dynamics KDD rsquo09

RECSYS 110

Temporal Biases amp Factors

Original modelrxi = micro +bx + bi + qi middotpx

T

Add time dependence to biasesrxi = micro +bx(t)+ bi(t) +qi middot px

T

Make parameters bu and bi to depend on time (1) Parameterize time-dependence by linear trends

(2) Each bin corresponds to 10 consecutive weeks

Add temporal dependence to factors px(t)hellip user preference vector on day t

Y Koren Collaborative filtering with temporal dynamics KDD rsquo09

RECSYS 111

Adding Temporal Effects

1162019

0875

088

0885

089

0895

09

0905

091

0915

092

1 10 100 1000 10000

RMSE

Millions of parameters

CF (no time bias)Basic Latent FactorsCF (time bias)Latent Factors w Biases+ Linear time factors+ Per-day user biases+ CF

RECSYS 112

Grand Prize 08563

Netflix 09514

Movie average 10533User average 10651

Global average 11296

Performance of Various Methods

Basic Collaborative filtering 094

Latent factors 090

Latent factors+Biases 089

Collaborative filtering++ 091

1162019 112

Latent factors+Biases+Time 0876

Still no prize Getting desperateTry a ldquokitchen sinkrdquo approach

RECSYS 113

Effect of ensemble size

087

08725

0875

08775

088

08825

0885

08875

089

08925

1 8 15 22 29 36 43 50

Predictors

Erro

r - R

MSE

Chart1

RMSE
Predictors
Error - RMSE
Effect of ensemble size

Sheet1

Sheet1

RMSE
Predictors
Error - RMSE
Effect of ensemble size

Sheet2

Sheet3

RECSYS 115

Standing on June 26th 2009

1162019June 26th submission triggers 30-day ldquolast callrdquo

RECSYS 116

The Last 30 Days

Ensemble team formed Group of other teams on leaderboard forms a new team Relies on combining their models Quickly also get a qualifying score over 10

BellKor Continue to get small improvements in their scores Realize that they are in direct competition with Ensemble

Strategy Both teams carefully monitoring the leaderboard Only sure way to check for improvement is to submit a set of

predictionsbull This alerts the other team of your latest score

1162019

RECSYS 117

24 Hours from the Deadline

Submissions limited to 1 a day Only 1 final submission could be made in the last 24h

24 hours before deadlinehellip BellKor team member in Austria notices (by chance) that

Ensemble posts a score that is slightly better than BellKorrsquos

Frantic last 24 hours for both teams Much computer time on final optimization Carefully calibrated to end about an hour before deadline

Final submissions BellKor submits a little early (on purpose) 40 mins before deadline Ensemble submits their final entry 20 mins later hellipand everyone waitshellip

1162019

RECSYS 118118

RECSYS 119

Million $ Awarded Sept 21st 2009

RECSYS 120

Further reading Y Koren Collaborative filtering with temporal

dynamics KDD rsquo09 httpwww2researchattcom~volinskynetflixbpchtml httpwwwthe-ensemblecom

1162019

  • Slide Number 1
  • Acknowledgements
  • Content-based Systems amp Collaborative Filtering
  • High Dimensional Data
  • Example Recommender Systems
  • Recommendations
  • From Scarcity to Abundance
  • The Long Tail
  • Physical vs Online
  • Types of Recommendations
  • Formal Model
  • Utility Matrix
  • Key Problems
  • (1) Gathering Ratings
  • (2) Extrapolating Utilities
  • Content-based Recommender Systems
  • Content-based Recommendations
  • Plan of Action
  • Item Profiles
  • Sidenote TF-IDF
  • User Profiles and Prediction
  • Pros Content-based Approach
  • Cons Content-based Approach
  • Collaborative Filtering
  • Collaborative filtering
  • Collaborative Filtering (CF)
  • Example of Memory-based Collaborative Filtering User-User Collaborative Filtering
  • Similar Users
  • Similarity Metric
  • Rating Predictions
  • Another type of Memory-based Collaborative Filtering Item-Item based Collaborative Filtering
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Common Practice for Item-Item Collaborative Filtering
  • Item-Item vs User-User
  • ProsCons of Collaborative Filtering
  • Hybrid Methods
  • Remarks amp Practical Tips
  • Evaluation
  • Evaluation
  • Evaluating Predictions
  • Problems with Error Measures
  • Collaborative Filtering Complexity
  • Tip Add Data
  • Recommender SystemsLatent Factor Models
  • Collaborative Filtering viaLatent Factor Models (eg SVD)
  • Slide Number 50
  • The Netflix Utility Matrix R
  • Utility Matrix R Evaluation
  • Latent Factor Models
  • Ratings as Products of Factors
  • Ratings as Products of Factors
  • Ratings as Products of Factors
  • Latent Factor Models
  • Latent Factor Models
  • Recap SVD
  • Latent Factor Models
  • Dealing with Missing Entries
  • Recommendations via Latent Factor Models (eg SVD++ by the [Bellkor Team])
  • Dealing with Missing Entries
  • The Effect of Regularization
  • The Effect of Regularization
  • The Effect of Regularization
  • The Effect of Regularization
  • Use Gradient Descent to search forthe optimal settings
  • Degression to the lecture notes ofRegression and Gradient Descentby Andrew Ngrsquos Machine Learning course from Coursera
  • (Batch) Gradient Descent
  • Stochastic Gradient Descent
  • SGD vs GD
  • Stochastic Gradient Descent
  • Slide Number 75
  • Summary Recommendations via Optimization
  • Backup Slides
  • The Netflix Challenge 2006-09
  • Slide Number 79
  • Slide Number 80
  • Netflix Prize
  • Movie rating data
  • Overall rating distribution
  • ratings per movie
  • ratings per user
  • Average movie rating by movie count
  • Most loved movies
  • Challenges
  • The BellKor recommender system
  • Extending Latent Factor Model to Include Biases
  • Modeling Biases and Interactions
  • Baseline Predictor
  • Putting It All Together
  • Fitting the New Model
  • BellKor Recommender System
  • Performance of Various Methods
  • Performance of Various Methods
  • Temporal Biases Of Users
  • Temporal Biases amp Factors
  • Adding Temporal Effects
  • Performance of Various Methods
  • Slide Number 113
  • Slide Number 114
  • Standing on June 26th 2009
  • The Last 30 Days
  • 24 Hours from the Deadline
  • Slide Number 118
  • Million $ Awarded Sept 21st 2009
  • Further reading
predictors RMSE
1 0889067
2 0880586
3 0878345
4 0877715
5 0876668
6 0876179
7 0875557
8 0875013
9 0874477
10 0874209
11 0873997
12 0873691
13 0873569
14 0873352
15 0873212
16 0873035
17 0872942
18 087283
19 0872748
20 0872654
21 0872563
22 0872468
23 0872382
24 0872275
25 087222
26 0872201
27 0872135
28 0872086
29 0872021
30 0871945
31 0871898
32 087183
33 0871756
34 0871722
35 0871665
36 0871632
37 0871585
38 0871558
39 0871524
40 0871482
41 0871452
42 0871428
43 0871398
44 0871346
45 0871312
46 0871287
47 0871262
48 0871242
49 0871236
50 0871197
0889067
0880586
0878345
0877715
0876668
0876179
0875557
0875013
0874477
0874209
0873997
0873691
0873569
0873352
0873212
0873035
0872942
087283
0872748
0872654
0872563
0872468
0872382
0872275
087222
0872201
0872135
0872086
0872021
0871945
0871898
087183
0871756
0871722
0871665
0871632
0871585
0871558
0871524
0871482
0871452
0871428
0871398
0871346
0871312
0871287
0871262
0871242
0871236
0871197
Page 56: xuhappy.github.io · 2 days ago · RECSYS 2 Acknowledgements The slides used in this chapter are adapted from: CS246 Mining Massive Datasets, - by Jure Leskovec, Stanford University

RECSYS 58

Geared towards females

Geared towards males

Serious

Funny

Latent Factor Models

1162019

The PrincessDiaries

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

RECSYS 59

Recap SVD

Remember SVD A Input data matrix U Left singular vecs V Right singular vecs Σ Singular values

SVD gives minimum reconstruction error (SSE) min119880119880VΣ

sum119909119909119894119894 119860119860119909119909119894119894 minus 119880119880Σ119881119881T 1199091199091198941198942

So in our case ldquoSVDrdquo on Netflix data R asymp Q middot PT

A = R Q = U PT = Σ VT

But we are not done yet R has missing entries1162019

U

The sum goes over all entriesBut our R has missing entries

119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931

Am

n

Σm

n

VT

asymp

U

RECSYS 60

Latent Factor Models

SVD isnrsquot defined when entries are missing

Use specialized methods to find P Q min

119875119875119876119876sum 119909119909119909119909 isinR 119903119903119909119909119909119909 minus 119902119902119909119909 sdot 119901119901119909119909119879119879 2

Notebull We donrsquot require cols of P Q to be orthogonalunit lengthbull P Q map usersmovies to a latent spacebull The most popular model among Netflix contestants

1162019

45531

312445

53432142

24542

522434

42331

2-41

56-5

53-2

32111

-221-7

37-1

-924143-48-5-253-211

13-112-72914-131457-8

1-6784-3924176-421asymp

PTQ

users

item

s

119903119909119909119909119909 = 119902119902119909119909 sdot 119901119901119909119909119879119879

f factors

ffactorsitem

s

users

RECSYS 61

Dealing with Missing Entries Want to minimize SSE for unseen test data

Idea Minimize SSE on training data Want large f ( of factors) to capture all the signals But SSE on test data begins to rise for f gt 2

Regularization is needed to avoid Overfitting Allow rich model where there are sufficient data Shrink aggressively where data are scarce

1162019

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

1 3 43 5 5

4 5 533

2

2 1 3

1

λhellip regularization parameterldquoerrorrdquo ldquolengthrdquo

RECSYS 62

Geared towards females

Geared towards males

serious

escapist

The PrincessDiaries

The Lion King

Braveheart

Independence Day

AmadeusThe Color Purple

Oceanrsquos 11

Sense and Sensibility

Gus

Dave

Recommendations via Latent Factor Models (eg SVD++ by the [Bellkor Team])

1162019

Lethal Weapon

Dumb and Dumber

RECSYS 63

Dealing with Missing Entries Want to minimize SSE for unseen test data

Idea Minimize SSE on training data Want large f ( of factors) to capture all the signals But SSE on test data begins to rise for f gt 2

Regularization is needed Allow rich model where there are sufficient data Shrink aggressively where data are scarce

1162019

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

1 3 43 5 5

4 5 533

2

2 1 3

1

λhellip regularization parameterldquoerrorrdquo ldquolengthrdquo

RECSYS 64

Geared towards females

Geared towards males

serious

funny

The Effect of Regularization

1162019

The PrincessDiaries

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

minfactors ldquoerrorrdquo + λ ldquolengthrdquo

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

RECSYS 65

Geared towards females

Geared towards males

serious

funny

The Effect of Regularization

1162019

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

The PrincessDiaries

minfactors ldquoerrorrdquo + λ ldquolengthrdquo

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

RECSYS 66

Geared towards females

Geared towards males

serious

funny

The Effect of Regularization

1162019

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

minfactors ldquoerrorrdquo + λ ldquolengthrdquo

The PrincessDiaries

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

RECSYS 67

Geared towards females

Geared towards males

serious

funny

The Effect of Regularization

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

The PrincessDiaries

minfactors ldquoerrorrdquo + λ ldquolengthrdquo

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

RECSYS 68

Use Gradient Descent to search forthe optimal settings

Want to find matrices P and Q

Gradient descent Initialize P and Q (using SVD pretend missing ratings are

0) Do gradient descent

bull P larr P - η middotnablaPbull Q larr Q - η middotnablaQbull Where nablaQ is gradientderivative of matrix Q120571120571119876119876 = [120571120571119902119902119909119909119894119894] and 120571120571119902119902119909119909119894119894 = sum119909119909119909119909 minus2 119903119903119909119909119909119909 minus 119902119902119909119909119901119901119909119909119879119879 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894

bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q Observation Computing gradients is slow

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

How to compute gradient of a matrixCompute gradient of every element independently

RECSYS 69

Degression to the lecture notes ofRegression and Gradient Descent

by Andrew Ngrsquos Machine Learning course from Coursera

RECSYS 70

(Batch) Gradient Descent

Want to find matrices P and Q

Gradient descent Initialize P and Q (using SVD pretend missing ratings are

0) Do gradient descent

bull P larr P - η middotnablaPbull Q larr Q - η middotnablaQbull Where nablaQ is gradientderivative of matrix Q120571120571119876119876 = [120571120571119902119902119909119909119894119894] and 120571120571119902119902119909119909119894119894 = sum119909119909119909119909 minus2 119903119903119909119909119909119909 minus 119902119902119909119909119901119901119909119909119879119879 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894

bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q Observation Computing gradients is slow

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

How to compute gradient of a matrixCompute gradient of every element independently

RECSYS 72

Stochastic Gradient Descent

Gradient Descent (GD) vs Stochastic GD Observation 120571120571119876119876 = [120571120571119902119902119909119909119894119894] where

120571120571119902119902119909119909119894119894 = 119909119909119909119909

minus2 119903119903119909119909119909119909 minus 119902119902119909119909119894119894119901119901119909119909119894119894 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894 = 119961119961119946119946

nabla119928119928 119955119955119961119961119946119946

bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q 119928119928 = 119928119928 minus 119928119928 = 119928119928 minus η sum119961119961119946119946nabla119928119928 (119955119955119961119961119946119946) Idea Instead of evaluating gradient over all ratings evaluate it for

each individual rating and make a step

GD 119928119928larr119928119928 minus η sum119955119955119961119961119946119946 nabla119928119928(119955119955119961119961119946119946)

SGD 119928119928larr119928119928 minus 119928119928(119955119955119961119961119946119946) Faster convergence

bull Need more steps but each step is computed much faster

1162019

RECSYS 73

SGD vs GD Convergence of GD vs SGD

1162019

Iterationstep

Valu

e of

the

obje

ctiv

e fu

nctio

n

GD improves the value of the objective function at every step SGD improves the value but in a ldquonoisyrdquo wayGD takes fewer steps to converge but each steptakes much longer to compute In practice SGD is much faster

RECSYS 74

Stochastic Gradient Descent

Stochastic gradient decent Initialize P and Q (using SVD pretend missing ratings are

0) Then iterate over the ratings (multiple times if necessary) and update

factorsFor each rxi

bull 120576120576119909119909119909119909 = 119903119903119909119909119909119909 minus 119902119902119909119909 sdot 119901119901119909119909119879119879 (derivative of the ldquoerrorrdquo)bull 119902119902119909119909 larr 119902119902119909119909 + 120578120578 120576120576119909119909119909119909 119901119901119909119909 minus 120582120582 119902119902119909119909 (update equation)bull 119901119901119909119909 larr 119901119901119909119909 + 120578120578 120576120576119909119909119909119909 119902119902119909119909 minus 120582120582 119901119901119909119909 (update equation)

2 for loops For until convergence

bull For each rxibull Compute gradient do a ldquosteprdquo

1162019

η hellip learning rate

RECSYS 75Koren Bell Volinksy IEEE Computer 2009

1162019

RECSYS 76

Summary Recommendations via Optimization

Goal Make good recommendations Quantify goodness using SSE

So Lower SSE means better recommendations We want to make good recommendations on items

that some user has not yet seen Letrsquos set values for P and Q such that they work

well on known (user item) ratings And hope these values for P and Q will predict well

the unknown ratings

This is the a case where we apply Optimization methods

1162019

1 3 43 5 5

4 5 533

2

2 1 3

1

RECSYS 77

Backup Slides

RECSYS 78

The Netflix Challenge 2006-09

We Know What You OughtTo Be Watching This

Summer

Netflix Prizebull Training data

ndash 100 million ratingsndash 480000 usersndash 17770 moviesndash 6 years of data 2000-2005

bull Test datandash Last few ratings of each user (28 million)ndash Evaluation criterion root mean squared error (RMSE) ndash Netflix Cinematch RMSE 09514

bull Competitionndash 2700+ teamsndash $1 million grand prize for 10 improvement on Cinematch resultndash $50000 2007 progress prize for 843 improvement

scoremovieuser121152131434524123237682576344541568523425

22345

57664566

scoremovieuser6219617232473153414284935

745

696836

Training data Test data

Movie rating data

Overall rating distribution

bull Third of ratings are 4sbull Average rating is 368

From TimelyDevelopmentcom

ratings per movie

bull Avg ratingsmovie 5627

ratings per user

bull Avg ratingsuser 208

Average movie rating by movie count

bull More ratings to better movies

From TimelyDevelopmentcom

Most loved movies

CountAvg ratingTitle137812 4593 The Shawshank Redemption 133597 4545 Lord of the Rings The Return of the King1808834306 The Green Mile 150676 4460Lord of the Rings The Two Towers 139050 4415 Finding Nemo 117456 4504 Raiders of the Lost Ark 180736 4299 Forrest Gump 147932 4433 Lord of the Rings The Fellowship of the ring149199 4325 The Sixth Sense 144027 4333 Indiana Jones and the Last Crusade

Challenges

bull Size of datandash Scalabilityndash Keeping data in memory

bull Missing datandash 99 percent missingndash Very imbalanced

bull Avoiding overfittingbull Test and training data differ significantly movie 16322

bull Use an ensemble of complementing predictorsbull Two half tuned models worth more than a single fully

tuned model

The BellKor recommender system

RECSYS 101

Extending Latent Factor Model to Include Biases

1162019

RECSYS 102

Modeling Biases and Interactions

μ = overall mean rating bx = bias of user x bi = bias of movie i

user-movie interactionmovie biasuser bias

User-Movie interaction Characterizes the matching between

users and movies Attracts most research in the field Benefits from algorithmic and

mathematical innovations

Baseline predictor Separates users and movies Benefits from insights into userrsquos

behavior Among the main practical

contributions of the competition

RECSYS 103

Baseline Predictor We have expectations on the rating by

user x of movie i even without estimating xrsquos attitude towards movies like i

ndash Rating scale of user xndash Values of other ratings user

gave recently (day-specific mood anchoring multi-user accounts)

ndash (Recent) popularity of movie indash Selection bias related to

number of ratings user gave on the same day (ldquofrequencyrdquo)

1162019

RECSYS 104

Putting It All Together

Example Mean rating micro = 37 You are a critical reviewer your ratings are 1 star

lower than the mean bx = -1 Star Wars gets a mean rating of 05 higher than

average movie bi = + 05 Predicted rating for you on Star Wars

= 37 - 1 + 05 = 32

1162019

RECSYS 105

Fitting the New Model

Solve

Stochastic gradient decent to find parameters Note Both biases bu bi as well as interactions qi pu are treated

as parameters (we estimate them)

1162019

regularization

goodness of fit

λ is selected via grid-search on a validation set

( )

++++

+++minus

sumsumsumsum

sumisin

ii

xx

xx

ii

Rix

Txiixxi

PQ

bbpq

pqbbr

2222

2

)(

)(min

λ

micro

RECSYS 106

BellKor Recommender System The winner of the Netflix Challenge

Multi-scale modeling of the dataCombine top level ldquoregionalrdquomodeling of the data with a refined local view Global

bull Overall deviations of usersmovies Factorization

bull Addressing ldquoregionalrdquo effects Collaborative filtering

bull Extract local patterns

Global effects

Factorization

Collaborative filtering

RECSYS 107

Performance of Various Methods

0885

089

0895

09

0905

091

0915

092

1 10 100 1000

RMSE

Millions of parameters

CF (no time bias)

Basic Latent Factors

Latent Factors w Biases

1162019

RECSYS 108

Grand Prize 08563

Netflix 09514

Movie average 10533User average 10651

Global average 11296

Performance of Various Methods

Basic Collaborative filtering 094

Latent factors 090

Latent factors+Biases 089

Collaborative filtering++ 091

1162019

RECSYS 109

Temporal Biases Of Users

Sudden rise in the average movie rating(early 2004) Improvements in Netflix GUI improvements Meaning of rating changed

Movie age Users prefer new movies

without any reasons Older movies are just

inherently better than newer ones

1162019

Y Koren Collaborative filtering with temporal dynamics KDD rsquo09

RECSYS 110

Temporal Biases amp Factors

Original modelrxi = micro +bx + bi + qi middotpx

T

Add time dependence to biasesrxi = micro +bx(t)+ bi(t) +qi middot px

T

Make parameters bu and bi to depend on time (1) Parameterize time-dependence by linear trends

(2) Each bin corresponds to 10 consecutive weeks

Add temporal dependence to factors px(t)hellip user preference vector on day t

Y Koren Collaborative filtering with temporal dynamics KDD rsquo09

RECSYS 111

Adding Temporal Effects

1162019

0875

088

0885

089

0895

09

0905

091

0915

092

1 10 100 1000 10000

RMSE

Millions of parameters

CF (no time bias)Basic Latent FactorsCF (time bias)Latent Factors w Biases+ Linear time factors+ Per-day user biases+ CF

RECSYS 112

Grand Prize 08563

Netflix 09514

Movie average 10533User average 10651

Global average 11296

Performance of Various Methods

Basic Collaborative filtering 094

Latent factors 090

Latent factors+Biases 089

Collaborative filtering++ 091

1162019 112

Latent factors+Biases+Time 0876

Still no prize Getting desperateTry a ldquokitchen sinkrdquo approach

RECSYS 113

Effect of ensemble size

087

08725

0875

08775

088

08825

0885

08875

089

08925

1 8 15 22 29 36 43 50

Predictors

Erro

r - R

MSE

Chart1

RMSE
Predictors
Error - RMSE
Effect of ensemble size

Sheet1

Sheet1

RMSE
Predictors
Error - RMSE
Effect of ensemble size

Sheet2

Sheet3

RECSYS 115

Standing on June 26th 2009

1162019June 26th submission triggers 30-day ldquolast callrdquo

RECSYS 116

The Last 30 Days

Ensemble team formed Group of other teams on leaderboard forms a new team Relies on combining their models Quickly also get a qualifying score over 10

BellKor Continue to get small improvements in their scores Realize that they are in direct competition with Ensemble

Strategy Both teams carefully monitoring the leaderboard Only sure way to check for improvement is to submit a set of

predictionsbull This alerts the other team of your latest score

1162019

RECSYS 117

24 Hours from the Deadline

Submissions limited to 1 a day Only 1 final submission could be made in the last 24h

24 hours before deadlinehellip BellKor team member in Austria notices (by chance) that

Ensemble posts a score that is slightly better than BellKorrsquos

Frantic last 24 hours for both teams Much computer time on final optimization Carefully calibrated to end about an hour before deadline

Final submissions BellKor submits a little early (on purpose) 40 mins before deadline Ensemble submits their final entry 20 mins later hellipand everyone waitshellip

1162019

RECSYS 118118

RECSYS 119

Million $ Awarded Sept 21st 2009

RECSYS 120

Further reading Y Koren Collaborative filtering with temporal

dynamics KDD rsquo09 httpwww2researchattcom~volinskynetflixbpchtml httpwwwthe-ensemblecom

1162019

  • Slide Number 1
  • Acknowledgements
  • Content-based Systems amp Collaborative Filtering
  • High Dimensional Data
  • Example Recommender Systems
  • Recommendations
  • From Scarcity to Abundance
  • The Long Tail
  • Physical vs Online
  • Types of Recommendations
  • Formal Model
  • Utility Matrix
  • Key Problems
  • (1) Gathering Ratings
  • (2) Extrapolating Utilities
  • Content-based Recommender Systems
  • Content-based Recommendations
  • Plan of Action
  • Item Profiles
  • Sidenote TF-IDF
  • User Profiles and Prediction
  • Pros Content-based Approach
  • Cons Content-based Approach
  • Collaborative Filtering
  • Collaborative filtering
  • Collaborative Filtering (CF)
  • Example of Memory-based Collaborative Filtering User-User Collaborative Filtering
  • Similar Users
  • Similarity Metric
  • Rating Predictions
  • Another type of Memory-based Collaborative Filtering Item-Item based Collaborative Filtering
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Common Practice for Item-Item Collaborative Filtering
  • Item-Item vs User-User
  • ProsCons of Collaborative Filtering
  • Hybrid Methods
  • Remarks amp Practical Tips
  • Evaluation
  • Evaluation
  • Evaluating Predictions
  • Problems with Error Measures
  • Collaborative Filtering Complexity
  • Tip Add Data
  • Recommender SystemsLatent Factor Models
  • Collaborative Filtering viaLatent Factor Models (eg SVD)
  • Slide Number 50
  • The Netflix Utility Matrix R
  • Utility Matrix R Evaluation
  • Latent Factor Models
  • Ratings as Products of Factors
  • Ratings as Products of Factors
  • Ratings as Products of Factors
  • Latent Factor Models
  • Latent Factor Models
  • Recap SVD
  • Latent Factor Models
  • Dealing with Missing Entries
  • Recommendations via Latent Factor Models (eg SVD++ by the [Bellkor Team])
  • Dealing with Missing Entries
  • The Effect of Regularization
  • The Effect of Regularization
  • The Effect of Regularization
  • The Effect of Regularization
  • Use Gradient Descent to search forthe optimal settings
  • Degression to the lecture notes ofRegression and Gradient Descentby Andrew Ngrsquos Machine Learning course from Coursera
  • (Batch) Gradient Descent
  • Stochastic Gradient Descent
  • SGD vs GD
  • Stochastic Gradient Descent
  • Slide Number 75
  • Summary Recommendations via Optimization
  • Backup Slides
  • The Netflix Challenge 2006-09
  • Slide Number 79
  • Slide Number 80
  • Netflix Prize
  • Movie rating data
  • Overall rating distribution
  • ratings per movie
  • ratings per user
  • Average movie rating by movie count
  • Most loved movies
  • Challenges
  • The BellKor recommender system
  • Extending Latent Factor Model to Include Biases
  • Modeling Biases and Interactions
  • Baseline Predictor
  • Putting It All Together
  • Fitting the New Model
  • BellKor Recommender System
  • Performance of Various Methods
  • Performance of Various Methods
  • Temporal Biases Of Users
  • Temporal Biases amp Factors
  • Adding Temporal Effects
  • Performance of Various Methods
  • Slide Number 113
  • Slide Number 114
  • Standing on June 26th 2009
  • The Last 30 Days
  • 24 Hours from the Deadline
  • Slide Number 118
  • Million $ Awarded Sept 21st 2009
  • Further reading
predictors RMSE
1 0889067
2 0880586
3 0878345
4 0877715
5 0876668
6 0876179
7 0875557
8 0875013
9 0874477
10 0874209
11 0873997
12 0873691
13 0873569
14 0873352
15 0873212
16 0873035
17 0872942
18 087283
19 0872748
20 0872654
21 0872563
22 0872468
23 0872382
24 0872275
25 087222
26 0872201
27 0872135
28 0872086
29 0872021
30 0871945
31 0871898
32 087183
33 0871756
34 0871722
35 0871665
36 0871632
37 0871585
38 0871558
39 0871524
40 0871482
41 0871452
42 0871428
43 0871398
44 0871346
45 0871312
46 0871287
47 0871262
48 0871242
49 0871236
50 0871197
0889067
0880586
0878345
0877715
0876668
0876179
0875557
0875013
0874477
0874209
0873997
0873691
0873569
0873352
0873212
0873035
0872942
087283
0872748
0872654
0872563
0872468
0872382
0872275
087222
0872201
0872135
0872086
0872021
0871945
0871898
087183
0871756
0871722
0871665
0871632
0871585
0871558
0871524
0871482
0871452
0871428
0871398
0871346
0871312
0871287
0871262
0871242
0871236
0871197
Page 57: xuhappy.github.io · 2 days ago · RECSYS 2 Acknowledgements The slides used in this chapter are adapted from: CS246 Mining Massive Datasets, - by Jure Leskovec, Stanford University

RECSYS 59

Recap SVD

Remember SVD A Input data matrix U Left singular vecs V Right singular vecs Σ Singular values

SVD gives minimum reconstruction error (SSE) min119880119880VΣ

sum119909119909119894119894 119860119860119909119909119894119894 minus 119880119880Σ119881119881T 1199091199091198941198942

So in our case ldquoSVDrdquo on Netflix data R asymp Q middot PT

A = R Q = U PT = Σ VT

But we are not done yet R has missing entries1162019

U

The sum goes over all entriesBut our R has missing entries

119955119955119961119961119946119946 = 119954119954119946119946 sdot 119953119953119961119961119931119931

Am

n

Σm

n

VT

asymp

U

RECSYS 60

Latent Factor Models

SVD isnrsquot defined when entries are missing

Use specialized methods to find P Q min

119875119875119876119876sum 119909119909119909119909 isinR 119903119903119909119909119909119909 minus 119902119902119909119909 sdot 119901119901119909119909119879119879 2

Notebull We donrsquot require cols of P Q to be orthogonalunit lengthbull P Q map usersmovies to a latent spacebull The most popular model among Netflix contestants

1162019

45531

312445

53432142

24542

522434

42331

2-41

56-5

53-2

32111

-221-7

37-1

-924143-48-5-253-211

13-112-72914-131457-8

1-6784-3924176-421asymp

PTQ

users

item

s

119903119909119909119909119909 = 119902119902119909119909 sdot 119901119901119909119909119879119879

f factors

ffactorsitem

s

users

RECSYS 61

Dealing with Missing Entries Want to minimize SSE for unseen test data

Idea Minimize SSE on training data Want large f ( of factors) to capture all the signals But SSE on test data begins to rise for f gt 2

Regularization is needed to avoid Overfitting Allow rich model where there are sufficient data Shrink aggressively where data are scarce

1162019

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

1 3 43 5 5

4 5 533

2

2 1 3

1

λhellip regularization parameterldquoerrorrdquo ldquolengthrdquo

RECSYS 62

Geared towards females

Geared towards males

serious

escapist

The PrincessDiaries

The Lion King

Braveheart

Independence Day

AmadeusThe Color Purple

Oceanrsquos 11

Sense and Sensibility

Gus

Dave

Recommendations via Latent Factor Models (eg SVD++ by the [Bellkor Team])

1162019

Lethal Weapon

Dumb and Dumber

RECSYS 63

Dealing with Missing Entries Want to minimize SSE for unseen test data

Idea Minimize SSE on training data Want large f ( of factors) to capture all the signals But SSE on test data begins to rise for f gt 2

Regularization is needed Allow rich model where there are sufficient data Shrink aggressively where data are scarce

1162019

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

1 3 43 5 5

4 5 533

2

2 1 3

1

λhellip regularization parameterldquoerrorrdquo ldquolengthrdquo

RECSYS 64

Geared towards females

Geared towards males

serious

funny

The Effect of Regularization

1162019

The PrincessDiaries

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

minfactors ldquoerrorrdquo + λ ldquolengthrdquo

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

RECSYS 65

Geared towards females

Geared towards males

serious

funny

The Effect of Regularization

1162019

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

The PrincessDiaries

minfactors ldquoerrorrdquo + λ ldquolengthrdquo

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

RECSYS 66

Geared towards females

Geared towards males

serious

funny

The Effect of Regularization

1162019

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

minfactors ldquoerrorrdquo + λ ldquolengthrdquo

The PrincessDiaries

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

RECSYS 67

Geared towards females

Geared towards males

serious

funny

The Effect of Regularization

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

The PrincessDiaries

minfactors ldquoerrorrdquo + λ ldquolengthrdquo

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

RECSYS 68

Use Gradient Descent to search forthe optimal settings

Want to find matrices P and Q

Gradient descent Initialize P and Q (using SVD pretend missing ratings are

0) Do gradient descent

bull P larr P - η middotnablaPbull Q larr Q - η middotnablaQbull Where nablaQ is gradientderivative of matrix Q120571120571119876119876 = [120571120571119902119902119909119909119894119894] and 120571120571119902119902119909119909119894119894 = sum119909119909119909119909 minus2 119903119903119909119909119909119909 minus 119902119902119909119909119901119901119909119909119879119879 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894

bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q Observation Computing gradients is slow

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

How to compute gradient of a matrixCompute gradient of every element independently

RECSYS 69

Degression to the lecture notes ofRegression and Gradient Descent

by Andrew Ngrsquos Machine Learning course from Coursera

RECSYS 70

(Batch) Gradient Descent

Want to find matrices P and Q

Gradient descent Initialize P and Q (using SVD pretend missing ratings are

0) Do gradient descent

bull P larr P - η middotnablaPbull Q larr Q - η middotnablaQbull Where nablaQ is gradientderivative of matrix Q120571120571119876119876 = [120571120571119902119902119909119909119894119894] and 120571120571119902119902119909119909119894119894 = sum119909119909119909119909 minus2 119903119903119909119909119909119909 minus 119902119902119909119909119901119901119909119909119879119879 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894

bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q Observation Computing gradients is slow

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

How to compute gradient of a matrixCompute gradient of every element independently

RECSYS 72

Stochastic Gradient Descent

Gradient Descent (GD) vs Stochastic GD Observation 120571120571119876119876 = [120571120571119902119902119909119909119894119894] where

120571120571119902119902119909119909119894119894 = 119909119909119909119909

minus2 119903119903119909119909119909119909 minus 119902119902119909119909119894119894119901119901119909119909119894119894 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894 = 119961119961119946119946

nabla119928119928 119955119955119961119961119946119946

bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q 119928119928 = 119928119928 minus 119928119928 = 119928119928 minus η sum119961119961119946119946nabla119928119928 (119955119955119961119961119946119946) Idea Instead of evaluating gradient over all ratings evaluate it for

each individual rating and make a step

GD 119928119928larr119928119928 minus η sum119955119955119961119961119946119946 nabla119928119928(119955119955119961119961119946119946)

SGD 119928119928larr119928119928 minus 119928119928(119955119955119961119961119946119946) Faster convergence

bull Need more steps but each step is computed much faster

1162019

RECSYS 73

SGD vs GD Convergence of GD vs SGD

1162019

Iterationstep

Valu

e of

the

obje

ctiv

e fu

nctio

n

GD improves the value of the objective function at every step SGD improves the value but in a ldquonoisyrdquo wayGD takes fewer steps to converge but each steptakes much longer to compute In practice SGD is much faster

RECSYS 74

Stochastic Gradient Descent

Stochastic gradient decent Initialize P and Q (using SVD pretend missing ratings are

0) Then iterate over the ratings (multiple times if necessary) and update

factorsFor each rxi

bull 120576120576119909119909119909119909 = 119903119903119909119909119909119909 minus 119902119902119909119909 sdot 119901119901119909119909119879119879 (derivative of the ldquoerrorrdquo)bull 119902119902119909119909 larr 119902119902119909119909 + 120578120578 120576120576119909119909119909119909 119901119901119909119909 minus 120582120582 119902119902119909119909 (update equation)bull 119901119901119909119909 larr 119901119901119909119909 + 120578120578 120576120576119909119909119909119909 119902119902119909119909 minus 120582120582 119901119901119909119909 (update equation)

2 for loops For until convergence

bull For each rxibull Compute gradient do a ldquosteprdquo

1162019

η hellip learning rate

RECSYS 75Koren Bell Volinksy IEEE Computer 2009

1162019

RECSYS 76

Summary Recommendations via Optimization

Goal Make good recommendations Quantify goodness using SSE

So Lower SSE means better recommendations We want to make good recommendations on items

that some user has not yet seen Letrsquos set values for P and Q such that they work

well on known (user item) ratings And hope these values for P and Q will predict well

the unknown ratings

This is the a case where we apply Optimization methods

1162019

1 3 43 5 5

4 5 533

2

2 1 3

1

RECSYS 77

Backup Slides

RECSYS 78

The Netflix Challenge 2006-09

We Know What You OughtTo Be Watching This

Summer

Netflix Prizebull Training data

ndash 100 million ratingsndash 480000 usersndash 17770 moviesndash 6 years of data 2000-2005

bull Test datandash Last few ratings of each user (28 million)ndash Evaluation criterion root mean squared error (RMSE) ndash Netflix Cinematch RMSE 09514

bull Competitionndash 2700+ teamsndash $1 million grand prize for 10 improvement on Cinematch resultndash $50000 2007 progress prize for 843 improvement

scoremovieuser121152131434524123237682576344541568523425

22345

57664566

scoremovieuser6219617232473153414284935

745

696836

Training data Test data

Movie rating data

Overall rating distribution

bull Third of ratings are 4sbull Average rating is 368

From TimelyDevelopmentcom

ratings per movie

bull Avg ratingsmovie 5627

ratings per user

bull Avg ratingsuser 208

Average movie rating by movie count

bull More ratings to better movies

From TimelyDevelopmentcom

Most loved movies

CountAvg ratingTitle137812 4593 The Shawshank Redemption 133597 4545 Lord of the Rings The Return of the King1808834306 The Green Mile 150676 4460Lord of the Rings The Two Towers 139050 4415 Finding Nemo 117456 4504 Raiders of the Lost Ark 180736 4299 Forrest Gump 147932 4433 Lord of the Rings The Fellowship of the ring149199 4325 The Sixth Sense 144027 4333 Indiana Jones and the Last Crusade

Challenges

bull Size of datandash Scalabilityndash Keeping data in memory

bull Missing datandash 99 percent missingndash Very imbalanced

bull Avoiding overfittingbull Test and training data differ significantly movie 16322

bull Use an ensemble of complementing predictorsbull Two half tuned models worth more than a single fully

tuned model

The BellKor recommender system

RECSYS 101

Extending Latent Factor Model to Include Biases

1162019

RECSYS 102

Modeling Biases and Interactions

μ = overall mean rating bx = bias of user x bi = bias of movie i

user-movie interactionmovie biasuser bias

User-Movie interaction Characterizes the matching between

users and movies Attracts most research in the field Benefits from algorithmic and

mathematical innovations

Baseline predictor Separates users and movies Benefits from insights into userrsquos

behavior Among the main practical

contributions of the competition

RECSYS 103

Baseline Predictor We have expectations on the rating by

user x of movie i even without estimating xrsquos attitude towards movies like i

ndash Rating scale of user xndash Values of other ratings user

gave recently (day-specific mood anchoring multi-user accounts)

ndash (Recent) popularity of movie indash Selection bias related to

number of ratings user gave on the same day (ldquofrequencyrdquo)

1162019

RECSYS 104

Putting It All Together

Example Mean rating micro = 37 You are a critical reviewer your ratings are 1 star

lower than the mean bx = -1 Star Wars gets a mean rating of 05 higher than

average movie bi = + 05 Predicted rating for you on Star Wars

= 37 - 1 + 05 = 32

1162019

RECSYS 105

Fitting the New Model

Solve

Stochastic gradient decent to find parameters Note Both biases bu bi as well as interactions qi pu are treated

as parameters (we estimate them)

1162019

regularization

goodness of fit

λ is selected via grid-search on a validation set

( )

++++

+++minus

sumsumsumsum

sumisin

ii

xx

xx

ii

Rix

Txiixxi

PQ

bbpq

pqbbr

2222

2

)(

)(min

λ

micro

RECSYS 106

BellKor Recommender System The winner of the Netflix Challenge

Multi-scale modeling of the dataCombine top level ldquoregionalrdquomodeling of the data with a refined local view Global

bull Overall deviations of usersmovies Factorization

bull Addressing ldquoregionalrdquo effects Collaborative filtering

bull Extract local patterns

Global effects

Factorization

Collaborative filtering

RECSYS 107

Performance of Various Methods

0885

089

0895

09

0905

091

0915

092

1 10 100 1000

RMSE

Millions of parameters

CF (no time bias)

Basic Latent Factors

Latent Factors w Biases

1162019

RECSYS 108

Grand Prize 08563

Netflix 09514

Movie average 10533User average 10651

Global average 11296

Performance of Various Methods

Basic Collaborative filtering 094

Latent factors 090

Latent factors+Biases 089

Collaborative filtering++ 091

1162019

RECSYS 109

Temporal Biases Of Users

Sudden rise in the average movie rating(early 2004) Improvements in Netflix GUI improvements Meaning of rating changed

Movie age Users prefer new movies

without any reasons Older movies are just

inherently better than newer ones

1162019

Y Koren Collaborative filtering with temporal dynamics KDD rsquo09

RECSYS 110

Temporal Biases amp Factors

Original modelrxi = micro +bx + bi + qi middotpx

T

Add time dependence to biasesrxi = micro +bx(t)+ bi(t) +qi middot px

T

Make parameters bu and bi to depend on time (1) Parameterize time-dependence by linear trends

(2) Each bin corresponds to 10 consecutive weeks

Add temporal dependence to factors px(t)hellip user preference vector on day t

Y Koren Collaborative filtering with temporal dynamics KDD rsquo09

RECSYS 111

Adding Temporal Effects

1162019

0875

088

0885

089

0895

09

0905

091

0915

092

1 10 100 1000 10000

RMSE

Millions of parameters

CF (no time bias)Basic Latent FactorsCF (time bias)Latent Factors w Biases+ Linear time factors+ Per-day user biases+ CF

RECSYS 112

Grand Prize 08563

Netflix 09514

Movie average 10533User average 10651

Global average 11296

Performance of Various Methods

Basic Collaborative filtering 094

Latent factors 090

Latent factors+Biases 089

Collaborative filtering++ 091

1162019 112

Latent factors+Biases+Time 0876

Still no prize Getting desperateTry a ldquokitchen sinkrdquo approach

RECSYS 113

Effect of ensemble size

087

08725

0875

08775

088

08825

0885

08875

089

08925

1 8 15 22 29 36 43 50

Predictors

Erro

r - R

MSE

Chart1

RMSE
Predictors
Error - RMSE
Effect of ensemble size

Sheet1

Sheet1

RMSE
Predictors
Error - RMSE
Effect of ensemble size

Sheet2

Sheet3

RECSYS 115

Standing on June 26th 2009

1162019June 26th submission triggers 30-day ldquolast callrdquo

RECSYS 116

The Last 30 Days

Ensemble team formed Group of other teams on leaderboard forms a new team Relies on combining their models Quickly also get a qualifying score over 10

BellKor Continue to get small improvements in their scores Realize that they are in direct competition with Ensemble

Strategy Both teams carefully monitoring the leaderboard Only sure way to check for improvement is to submit a set of

predictionsbull This alerts the other team of your latest score

1162019

RECSYS 117

24 Hours from the Deadline

Submissions limited to 1 a day Only 1 final submission could be made in the last 24h

24 hours before deadlinehellip BellKor team member in Austria notices (by chance) that

Ensemble posts a score that is slightly better than BellKorrsquos

Frantic last 24 hours for both teams Much computer time on final optimization Carefully calibrated to end about an hour before deadline

Final submissions BellKor submits a little early (on purpose) 40 mins before deadline Ensemble submits their final entry 20 mins later hellipand everyone waitshellip

1162019

RECSYS 118118

RECSYS 119

Million $ Awarded Sept 21st 2009

RECSYS 120

Further reading Y Koren Collaborative filtering with temporal

dynamics KDD rsquo09 httpwww2researchattcom~volinskynetflixbpchtml httpwwwthe-ensemblecom

1162019

  • Slide Number 1
  • Acknowledgements
  • Content-based Systems amp Collaborative Filtering
  • High Dimensional Data
  • Example Recommender Systems
  • Recommendations
  • From Scarcity to Abundance
  • The Long Tail
  • Physical vs Online
  • Types of Recommendations
  • Formal Model
  • Utility Matrix
  • Key Problems
  • (1) Gathering Ratings
  • (2) Extrapolating Utilities
  • Content-based Recommender Systems
  • Content-based Recommendations
  • Plan of Action
  • Item Profiles
  • Sidenote TF-IDF
  • User Profiles and Prediction
  • Pros Content-based Approach
  • Cons Content-based Approach
  • Collaborative Filtering
  • Collaborative filtering
  • Collaborative Filtering (CF)
  • Example of Memory-based Collaborative Filtering User-User Collaborative Filtering
  • Similar Users
  • Similarity Metric
  • Rating Predictions
  • Another type of Memory-based Collaborative Filtering Item-Item based Collaborative Filtering
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Common Practice for Item-Item Collaborative Filtering
  • Item-Item vs User-User
  • ProsCons of Collaborative Filtering
  • Hybrid Methods
  • Remarks amp Practical Tips
  • Evaluation
  • Evaluation
  • Evaluating Predictions
  • Problems with Error Measures
  • Collaborative Filtering Complexity
  • Tip Add Data
  • Recommender SystemsLatent Factor Models
  • Collaborative Filtering viaLatent Factor Models (eg SVD)
  • Slide Number 50
  • The Netflix Utility Matrix R
  • Utility Matrix R Evaluation
  • Latent Factor Models
  • Ratings as Products of Factors
  • Ratings as Products of Factors
  • Ratings as Products of Factors
  • Latent Factor Models
  • Latent Factor Models
  • Recap SVD
  • Latent Factor Models
  • Dealing with Missing Entries
  • Recommendations via Latent Factor Models (eg SVD++ by the [Bellkor Team])
  • Dealing with Missing Entries
  • The Effect of Regularization
  • The Effect of Regularization
  • The Effect of Regularization
  • The Effect of Regularization
  • Use Gradient Descent to search forthe optimal settings
  • Degression to the lecture notes ofRegression and Gradient Descentby Andrew Ngrsquos Machine Learning course from Coursera
  • (Batch) Gradient Descent
  • Stochastic Gradient Descent
  • SGD vs GD
  • Stochastic Gradient Descent
  • Slide Number 75
  • Summary Recommendations via Optimization
  • Backup Slides
  • The Netflix Challenge 2006-09
  • Slide Number 79
  • Slide Number 80
  • Netflix Prize
  • Movie rating data
  • Overall rating distribution
  • ratings per movie
  • ratings per user
  • Average movie rating by movie count
  • Most loved movies
  • Challenges
  • The BellKor recommender system
  • Extending Latent Factor Model to Include Biases
  • Modeling Biases and Interactions
  • Baseline Predictor
  • Putting It All Together
  • Fitting the New Model
  • BellKor Recommender System
  • Performance of Various Methods
  • Performance of Various Methods
  • Temporal Biases Of Users
  • Temporal Biases amp Factors
  • Adding Temporal Effects
  • Performance of Various Methods
  • Slide Number 113
  • Slide Number 114
  • Standing on June 26th 2009
  • The Last 30 Days
  • 24 Hours from the Deadline
  • Slide Number 118
  • Million $ Awarded Sept 21st 2009
  • Further reading
predictors RMSE
1 0889067
2 0880586
3 0878345
4 0877715
5 0876668
6 0876179
7 0875557
8 0875013
9 0874477
10 0874209
11 0873997
12 0873691
13 0873569
14 0873352
15 0873212
16 0873035
17 0872942
18 087283
19 0872748
20 0872654
21 0872563
22 0872468
23 0872382
24 0872275
25 087222
26 0872201
27 0872135
28 0872086
29 0872021
30 0871945
31 0871898
32 087183
33 0871756
34 0871722
35 0871665
36 0871632
37 0871585
38 0871558
39 0871524
40 0871482
41 0871452
42 0871428
43 0871398
44 0871346
45 0871312
46 0871287
47 0871262
48 0871242
49 0871236
50 0871197
0889067
0880586
0878345
0877715
0876668
0876179
0875557
0875013
0874477
0874209
0873997
0873691
0873569
0873352
0873212
0873035
0872942
087283
0872748
0872654
0872563
0872468
0872382
0872275
087222
0872201
0872135
0872086
0872021
0871945
0871898
087183
0871756
0871722
0871665
0871632
0871585
0871558
0871524
0871482
0871452
0871428
0871398
0871346
0871312
0871287
0871262
0871242
0871236
0871197
Page 58: xuhappy.github.io · 2 days ago · RECSYS 2 Acknowledgements The slides used in this chapter are adapted from: CS246 Mining Massive Datasets, - by Jure Leskovec, Stanford University

RECSYS 60

Latent Factor Models

SVD isnrsquot defined when entries are missing

Use specialized methods to find P Q min

119875119875119876119876sum 119909119909119909119909 isinR 119903119903119909119909119909119909 minus 119902119902119909119909 sdot 119901119901119909119909119879119879 2

Notebull We donrsquot require cols of P Q to be orthogonalunit lengthbull P Q map usersmovies to a latent spacebull The most popular model among Netflix contestants

1162019

45531

312445

53432142

24542

522434

42331

2-41

56-5

53-2

32111

-221-7

37-1

-924143-48-5-253-211

13-112-72914-131457-8

1-6784-3924176-421asymp

PTQ

users

item

s

119903119909119909119909119909 = 119902119902119909119909 sdot 119901119901119909119909119879119879

f factors

ffactorsitem

s

users

RECSYS 61

Dealing with Missing Entries Want to minimize SSE for unseen test data

Idea Minimize SSE on training data Want large f ( of factors) to capture all the signals But SSE on test data begins to rise for f gt 2

Regularization is needed to avoid Overfitting Allow rich model where there are sufficient data Shrink aggressively where data are scarce

1162019

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

1 3 43 5 5

4 5 533

2

2 1 3

1

λhellip regularization parameterldquoerrorrdquo ldquolengthrdquo

RECSYS 62

Geared towards females

Geared towards males

serious

escapist

The PrincessDiaries

The Lion King

Braveheart

Independence Day

AmadeusThe Color Purple

Oceanrsquos 11

Sense and Sensibility

Gus

Dave

Recommendations via Latent Factor Models (eg SVD++ by the [Bellkor Team])

1162019

Lethal Weapon

Dumb and Dumber

RECSYS 63

Dealing with Missing Entries Want to minimize SSE for unseen test data

Idea Minimize SSE on training data Want large f ( of factors) to capture all the signals But SSE on test data begins to rise for f gt 2

Regularization is needed Allow rich model where there are sufficient data Shrink aggressively where data are scarce

1162019

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

1 3 43 5 5

4 5 533

2

2 1 3

1

λhellip regularization parameterldquoerrorrdquo ldquolengthrdquo

RECSYS 64

Geared towards females

Geared towards males

serious

funny

The Effect of Regularization

1162019

The PrincessDiaries

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

minfactors ldquoerrorrdquo + λ ldquolengthrdquo

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

RECSYS 65

Geared towards females

Geared towards males

serious

funny

The Effect of Regularization

1162019

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

The PrincessDiaries

minfactors ldquoerrorrdquo + λ ldquolengthrdquo

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

RECSYS 66

Geared towards females

Geared towards males

serious

funny

The Effect of Regularization

1162019

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

minfactors ldquoerrorrdquo + λ ldquolengthrdquo

The PrincessDiaries

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

RECSYS 67

Geared towards females

Geared towards males

serious

funny

The Effect of Regularization

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

The PrincessDiaries

minfactors ldquoerrorrdquo + λ ldquolengthrdquo

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

RECSYS 68

Use Gradient Descent to search forthe optimal settings

Want to find matrices P and Q

Gradient descent Initialize P and Q (using SVD pretend missing ratings are

0) Do gradient descent

bull P larr P - η middotnablaPbull Q larr Q - η middotnablaQbull Where nablaQ is gradientderivative of matrix Q120571120571119876119876 = [120571120571119902119902119909119909119894119894] and 120571120571119902119902119909119909119894119894 = sum119909119909119909119909 minus2 119903119903119909119909119909119909 minus 119902119902119909119909119901119901119909119909119879119879 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894

bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q Observation Computing gradients is slow

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

How to compute gradient of a matrixCompute gradient of every element independently

RECSYS 69

Degression to the lecture notes ofRegression and Gradient Descent

by Andrew Ngrsquos Machine Learning course from Coursera

RECSYS 70

(Batch) Gradient Descent

Want to find matrices P and Q

Gradient descent Initialize P and Q (using SVD pretend missing ratings are

0) Do gradient descent

bull P larr P - η middotnablaPbull Q larr Q - η middotnablaQbull Where nablaQ is gradientderivative of matrix Q120571120571119876119876 = [120571120571119902119902119909119909119894119894] and 120571120571119902119902119909119909119894119894 = sum119909119909119909119909 minus2 119903119903119909119909119909119909 minus 119902119902119909119909119901119901119909119909119879119879 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894

bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q Observation Computing gradients is slow

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

How to compute gradient of a matrixCompute gradient of every element independently

RECSYS 72

Stochastic Gradient Descent

Gradient Descent (GD) vs Stochastic GD Observation 120571120571119876119876 = [120571120571119902119902119909119909119894119894] where

120571120571119902119902119909119909119894119894 = 119909119909119909119909

minus2 119903119903119909119909119909119909 minus 119902119902119909119909119894119894119901119901119909119909119894119894 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894 = 119961119961119946119946

nabla119928119928 119955119955119961119961119946119946

bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q 119928119928 = 119928119928 minus 119928119928 = 119928119928 minus η sum119961119961119946119946nabla119928119928 (119955119955119961119961119946119946) Idea Instead of evaluating gradient over all ratings evaluate it for

each individual rating and make a step

GD 119928119928larr119928119928 minus η sum119955119955119961119961119946119946 nabla119928119928(119955119955119961119961119946119946)

SGD 119928119928larr119928119928 minus 119928119928(119955119955119961119961119946119946) Faster convergence

bull Need more steps but each step is computed much faster

1162019

RECSYS 73

SGD vs GD Convergence of GD vs SGD

1162019

Iterationstep

Valu

e of

the

obje

ctiv

e fu

nctio

n

GD improves the value of the objective function at every step SGD improves the value but in a ldquonoisyrdquo wayGD takes fewer steps to converge but each steptakes much longer to compute In practice SGD is much faster

RECSYS 74

Stochastic Gradient Descent

Stochastic gradient decent Initialize P and Q (using SVD pretend missing ratings are

0) Then iterate over the ratings (multiple times if necessary) and update

factorsFor each rxi

bull 120576120576119909119909119909119909 = 119903119903119909119909119909119909 minus 119902119902119909119909 sdot 119901119901119909119909119879119879 (derivative of the ldquoerrorrdquo)bull 119902119902119909119909 larr 119902119902119909119909 + 120578120578 120576120576119909119909119909119909 119901119901119909119909 minus 120582120582 119902119902119909119909 (update equation)bull 119901119901119909119909 larr 119901119901119909119909 + 120578120578 120576120576119909119909119909119909 119902119902119909119909 minus 120582120582 119901119901119909119909 (update equation)

2 for loops For until convergence

bull For each rxibull Compute gradient do a ldquosteprdquo

1162019

η hellip learning rate

RECSYS 75Koren Bell Volinksy IEEE Computer 2009

1162019

RECSYS 76

Summary Recommendations via Optimization

Goal Make good recommendations Quantify goodness using SSE

So Lower SSE means better recommendations We want to make good recommendations on items

that some user has not yet seen Letrsquos set values for P and Q such that they work

well on known (user item) ratings And hope these values for P and Q will predict well

the unknown ratings

This is the a case where we apply Optimization methods

1162019

1 3 43 5 5

4 5 533

2

2 1 3

1

RECSYS 77

Backup Slides

RECSYS 78

The Netflix Challenge 2006-09

We Know What You OughtTo Be Watching This

Summer

Netflix Prizebull Training data

ndash 100 million ratingsndash 480000 usersndash 17770 moviesndash 6 years of data 2000-2005

bull Test datandash Last few ratings of each user (28 million)ndash Evaluation criterion root mean squared error (RMSE) ndash Netflix Cinematch RMSE 09514

bull Competitionndash 2700+ teamsndash $1 million grand prize for 10 improvement on Cinematch resultndash $50000 2007 progress prize for 843 improvement

scoremovieuser121152131434524123237682576344541568523425

22345

57664566

scoremovieuser6219617232473153414284935

745

696836

Training data Test data

Movie rating data

Overall rating distribution

bull Third of ratings are 4sbull Average rating is 368

From TimelyDevelopmentcom

ratings per movie

bull Avg ratingsmovie 5627

ratings per user

bull Avg ratingsuser 208

Average movie rating by movie count

bull More ratings to better movies

From TimelyDevelopmentcom

Most loved movies

CountAvg ratingTitle137812 4593 The Shawshank Redemption 133597 4545 Lord of the Rings The Return of the King1808834306 The Green Mile 150676 4460Lord of the Rings The Two Towers 139050 4415 Finding Nemo 117456 4504 Raiders of the Lost Ark 180736 4299 Forrest Gump 147932 4433 Lord of the Rings The Fellowship of the ring149199 4325 The Sixth Sense 144027 4333 Indiana Jones and the Last Crusade

Challenges

bull Size of datandash Scalabilityndash Keeping data in memory

bull Missing datandash 99 percent missingndash Very imbalanced

bull Avoiding overfittingbull Test and training data differ significantly movie 16322

bull Use an ensemble of complementing predictorsbull Two half tuned models worth more than a single fully

tuned model

The BellKor recommender system

RECSYS 101

Extending Latent Factor Model to Include Biases

1162019

RECSYS 102

Modeling Biases and Interactions

μ = overall mean rating bx = bias of user x bi = bias of movie i

user-movie interactionmovie biasuser bias

User-Movie interaction Characterizes the matching between

users and movies Attracts most research in the field Benefits from algorithmic and

mathematical innovations

Baseline predictor Separates users and movies Benefits from insights into userrsquos

behavior Among the main practical

contributions of the competition

RECSYS 103

Baseline Predictor We have expectations on the rating by

user x of movie i even without estimating xrsquos attitude towards movies like i

ndash Rating scale of user xndash Values of other ratings user

gave recently (day-specific mood anchoring multi-user accounts)

ndash (Recent) popularity of movie indash Selection bias related to

number of ratings user gave on the same day (ldquofrequencyrdquo)

1162019

RECSYS 104

Putting It All Together

Example Mean rating micro = 37 You are a critical reviewer your ratings are 1 star

lower than the mean bx = -1 Star Wars gets a mean rating of 05 higher than

average movie bi = + 05 Predicted rating for you on Star Wars

= 37 - 1 + 05 = 32

1162019

RECSYS 105

Fitting the New Model

Solve

Stochastic gradient decent to find parameters Note Both biases bu bi as well as interactions qi pu are treated

as parameters (we estimate them)

1162019

regularization

goodness of fit

λ is selected via grid-search on a validation set

( )

++++

+++minus

sumsumsumsum

sumisin

ii

xx

xx

ii

Rix

Txiixxi

PQ

bbpq

pqbbr

2222

2

)(

)(min

λ

micro

RECSYS 106

BellKor Recommender System The winner of the Netflix Challenge

Multi-scale modeling of the dataCombine top level ldquoregionalrdquomodeling of the data with a refined local view Global

bull Overall deviations of usersmovies Factorization

bull Addressing ldquoregionalrdquo effects Collaborative filtering

bull Extract local patterns

Global effects

Factorization

Collaborative filtering

RECSYS 107

Performance of Various Methods

0885

089

0895

09

0905

091

0915

092

1 10 100 1000

RMSE

Millions of parameters

CF (no time bias)

Basic Latent Factors

Latent Factors w Biases

1162019

RECSYS 108

Grand Prize 08563

Netflix 09514

Movie average 10533User average 10651

Global average 11296

Performance of Various Methods

Basic Collaborative filtering 094

Latent factors 090

Latent factors+Biases 089

Collaborative filtering++ 091

1162019

RECSYS 109

Temporal Biases Of Users

Sudden rise in the average movie rating(early 2004) Improvements in Netflix GUI improvements Meaning of rating changed

Movie age Users prefer new movies

without any reasons Older movies are just

inherently better than newer ones

1162019

Y Koren Collaborative filtering with temporal dynamics KDD rsquo09

RECSYS 110

Temporal Biases amp Factors

Original modelrxi = micro +bx + bi + qi middotpx

T

Add time dependence to biasesrxi = micro +bx(t)+ bi(t) +qi middot px

T

Make parameters bu and bi to depend on time (1) Parameterize time-dependence by linear trends

(2) Each bin corresponds to 10 consecutive weeks

Add temporal dependence to factors px(t)hellip user preference vector on day t

Y Koren Collaborative filtering with temporal dynamics KDD rsquo09

RECSYS 111

Adding Temporal Effects

1162019

0875

088

0885

089

0895

09

0905

091

0915

092

1 10 100 1000 10000

RMSE

Millions of parameters

CF (no time bias)Basic Latent FactorsCF (time bias)Latent Factors w Biases+ Linear time factors+ Per-day user biases+ CF

RECSYS 112

Grand Prize 08563

Netflix 09514

Movie average 10533User average 10651

Global average 11296

Performance of Various Methods

Basic Collaborative filtering 094

Latent factors 090

Latent factors+Biases 089

Collaborative filtering++ 091

1162019 112

Latent factors+Biases+Time 0876

Still no prize Getting desperateTry a ldquokitchen sinkrdquo approach

RECSYS 113

Effect of ensemble size

087

08725

0875

08775

088

08825

0885

08875

089

08925

1 8 15 22 29 36 43 50

Predictors

Erro

r - R

MSE

Chart1

RMSE
Predictors
Error - RMSE
Effect of ensemble size

Sheet1

Sheet1

RMSE
Predictors
Error - RMSE
Effect of ensemble size

Sheet2

Sheet3

RECSYS 115

Standing on June 26th 2009

1162019June 26th submission triggers 30-day ldquolast callrdquo

RECSYS 116

The Last 30 Days

Ensemble team formed Group of other teams on leaderboard forms a new team Relies on combining their models Quickly also get a qualifying score over 10

BellKor Continue to get small improvements in their scores Realize that they are in direct competition with Ensemble

Strategy Both teams carefully monitoring the leaderboard Only sure way to check for improvement is to submit a set of

predictionsbull This alerts the other team of your latest score

1162019

RECSYS 117

24 Hours from the Deadline

Submissions limited to 1 a day Only 1 final submission could be made in the last 24h

24 hours before deadlinehellip BellKor team member in Austria notices (by chance) that

Ensemble posts a score that is slightly better than BellKorrsquos

Frantic last 24 hours for both teams Much computer time on final optimization Carefully calibrated to end about an hour before deadline

Final submissions BellKor submits a little early (on purpose) 40 mins before deadline Ensemble submits their final entry 20 mins later hellipand everyone waitshellip

1162019

RECSYS 118118

RECSYS 119

Million $ Awarded Sept 21st 2009

RECSYS 120

Further reading Y Koren Collaborative filtering with temporal

dynamics KDD rsquo09 httpwww2researchattcom~volinskynetflixbpchtml httpwwwthe-ensemblecom

1162019

  • Slide Number 1
  • Acknowledgements
  • Content-based Systems amp Collaborative Filtering
  • High Dimensional Data
  • Example Recommender Systems
  • Recommendations
  • From Scarcity to Abundance
  • The Long Tail
  • Physical vs Online
  • Types of Recommendations
  • Formal Model
  • Utility Matrix
  • Key Problems
  • (1) Gathering Ratings
  • (2) Extrapolating Utilities
  • Content-based Recommender Systems
  • Content-based Recommendations
  • Plan of Action
  • Item Profiles
  • Sidenote TF-IDF
  • User Profiles and Prediction
  • Pros Content-based Approach
  • Cons Content-based Approach
  • Collaborative Filtering
  • Collaborative filtering
  • Collaborative Filtering (CF)
  • Example of Memory-based Collaborative Filtering User-User Collaborative Filtering
  • Similar Users
  • Similarity Metric
  • Rating Predictions
  • Another type of Memory-based Collaborative Filtering Item-Item based Collaborative Filtering
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Common Practice for Item-Item Collaborative Filtering
  • Item-Item vs User-User
  • ProsCons of Collaborative Filtering
  • Hybrid Methods
  • Remarks amp Practical Tips
  • Evaluation
  • Evaluation
  • Evaluating Predictions
  • Problems with Error Measures
  • Collaborative Filtering Complexity
  • Tip Add Data
  • Recommender SystemsLatent Factor Models
  • Collaborative Filtering viaLatent Factor Models (eg SVD)
  • Slide Number 50
  • The Netflix Utility Matrix R
  • Utility Matrix R Evaluation
  • Latent Factor Models
  • Ratings as Products of Factors
  • Ratings as Products of Factors
  • Ratings as Products of Factors
  • Latent Factor Models
  • Latent Factor Models
  • Recap SVD
  • Latent Factor Models
  • Dealing with Missing Entries
  • Recommendations via Latent Factor Models (eg SVD++ by the [Bellkor Team])
  • Dealing with Missing Entries
  • The Effect of Regularization
  • The Effect of Regularization
  • The Effect of Regularization
  • The Effect of Regularization
  • Use Gradient Descent to search forthe optimal settings
  • Degression to the lecture notes ofRegression and Gradient Descentby Andrew Ngrsquos Machine Learning course from Coursera
  • (Batch) Gradient Descent
  • Stochastic Gradient Descent
  • SGD vs GD
  • Stochastic Gradient Descent
  • Slide Number 75
  • Summary Recommendations via Optimization
  • Backup Slides
  • The Netflix Challenge 2006-09
  • Slide Number 79
  • Slide Number 80
  • Netflix Prize
  • Movie rating data
  • Overall rating distribution
  • ratings per movie
  • ratings per user
  • Average movie rating by movie count
  • Most loved movies
  • Challenges
  • The BellKor recommender system
  • Extending Latent Factor Model to Include Biases
  • Modeling Biases and Interactions
  • Baseline Predictor
  • Putting It All Together
  • Fitting the New Model
  • BellKor Recommender System
  • Performance of Various Methods
  • Performance of Various Methods
  • Temporal Biases Of Users
  • Temporal Biases amp Factors
  • Adding Temporal Effects
  • Performance of Various Methods
  • Slide Number 113
  • Slide Number 114
  • Standing on June 26th 2009
  • The Last 30 Days
  • 24 Hours from the Deadline
  • Slide Number 118
  • Million $ Awarded Sept 21st 2009
  • Further reading
predictors RMSE
1 0889067
2 0880586
3 0878345
4 0877715
5 0876668
6 0876179
7 0875557
8 0875013
9 0874477
10 0874209
11 0873997
12 0873691
13 0873569
14 0873352
15 0873212
16 0873035
17 0872942
18 087283
19 0872748
20 0872654
21 0872563
22 0872468
23 0872382
24 0872275
25 087222
26 0872201
27 0872135
28 0872086
29 0872021
30 0871945
31 0871898
32 087183
33 0871756
34 0871722
35 0871665
36 0871632
37 0871585
38 0871558
39 0871524
40 0871482
41 0871452
42 0871428
43 0871398
44 0871346
45 0871312
46 0871287
47 0871262
48 0871242
49 0871236
50 0871197
0889067
0880586
0878345
0877715
0876668
0876179
0875557
0875013
0874477
0874209
0873997
0873691
0873569
0873352
0873212
0873035
0872942
087283
0872748
0872654
0872563
0872468
0872382
0872275
087222
0872201
0872135
0872086
0872021
0871945
0871898
087183
0871756
0871722
0871665
0871632
0871585
0871558
0871524
0871482
0871452
0871428
0871398
0871346
0871312
0871287
0871262
0871242
0871236
0871197
Page 59: xuhappy.github.io · 2 days ago · RECSYS 2 Acknowledgements The slides used in this chapter are adapted from: CS246 Mining Massive Datasets, - by Jure Leskovec, Stanford University

RECSYS 61

Dealing with Missing Entries Want to minimize SSE for unseen test data

Idea Minimize SSE on training data Want large f ( of factors) to capture all the signals But SSE on test data begins to rise for f gt 2

Regularization is needed to avoid Overfitting Allow rich model where there are sufficient data Shrink aggressively where data are scarce

1162019

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

1 3 43 5 5

4 5 533

2

2 1 3

1

λhellip regularization parameterldquoerrorrdquo ldquolengthrdquo

RECSYS 62

Geared towards females

Geared towards males

serious

escapist

The PrincessDiaries

The Lion King

Braveheart

Independence Day

AmadeusThe Color Purple

Oceanrsquos 11

Sense and Sensibility

Gus

Dave

Recommendations via Latent Factor Models (eg SVD++ by the [Bellkor Team])

1162019

Lethal Weapon

Dumb and Dumber

RECSYS 63

Dealing with Missing Entries Want to minimize SSE for unseen test data

Idea Minimize SSE on training data Want large f ( of factors) to capture all the signals But SSE on test data begins to rise for f gt 2

Regularization is needed Allow rich model where there are sufficient data Shrink aggressively where data are scarce

1162019

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

1 3 43 5 5

4 5 533

2

2 1 3

1

λhellip regularization parameterldquoerrorrdquo ldquolengthrdquo

RECSYS 64

Geared towards females

Geared towards males

serious

funny

The Effect of Regularization

1162019

The PrincessDiaries

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

minfactors ldquoerrorrdquo + λ ldquolengthrdquo

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

RECSYS 65

Geared towards females

Geared towards males

serious

funny

The Effect of Regularization

1162019

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

The PrincessDiaries

minfactors ldquoerrorrdquo + λ ldquolengthrdquo

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

RECSYS 66

Geared towards females

Geared towards males

serious

funny

The Effect of Regularization

1162019

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

minfactors ldquoerrorrdquo + λ ldquolengthrdquo

The PrincessDiaries

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

RECSYS 67

Geared towards females

Geared towards males

serious

funny

The Effect of Regularization

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

The PrincessDiaries

minfactors ldquoerrorrdquo + λ ldquolengthrdquo

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

RECSYS 68

Use Gradient Descent to search forthe optimal settings

Want to find matrices P and Q

Gradient descent Initialize P and Q (using SVD pretend missing ratings are

0) Do gradient descent

bull P larr P - η middotnablaPbull Q larr Q - η middotnablaQbull Where nablaQ is gradientderivative of matrix Q120571120571119876119876 = [120571120571119902119902119909119909119894119894] and 120571120571119902119902119909119909119894119894 = sum119909119909119909119909 minus2 119903119903119909119909119909119909 minus 119902119902119909119909119901119901119909119909119879119879 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894

bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q Observation Computing gradients is slow

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

How to compute gradient of a matrixCompute gradient of every element independently

RECSYS 69

Degression to the lecture notes ofRegression and Gradient Descent

by Andrew Ngrsquos Machine Learning course from Coursera

RECSYS 70

(Batch) Gradient Descent

Want to find matrices P and Q

Gradient descent Initialize P and Q (using SVD pretend missing ratings are

0) Do gradient descent

bull P larr P - η middotnablaPbull Q larr Q - η middotnablaQbull Where nablaQ is gradientderivative of matrix Q120571120571119876119876 = [120571120571119902119902119909119909119894119894] and 120571120571119902119902119909119909119894119894 = sum119909119909119909119909 minus2 119903119903119909119909119909119909 minus 119902119902119909119909119901119901119909119909119879119879 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894

bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q Observation Computing gradients is slow

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

How to compute gradient of a matrixCompute gradient of every element independently

RECSYS 72

Stochastic Gradient Descent

Gradient Descent (GD) vs Stochastic GD Observation 120571120571119876119876 = [120571120571119902119902119909119909119894119894] where

120571120571119902119902119909119909119894119894 = 119909119909119909119909

minus2 119903119903119909119909119909119909 minus 119902119902119909119909119894119894119901119901119909119909119894119894 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894 = 119961119961119946119946

nabla119928119928 119955119955119961119961119946119946

bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q 119928119928 = 119928119928 minus 119928119928 = 119928119928 minus η sum119961119961119946119946nabla119928119928 (119955119955119961119961119946119946) Idea Instead of evaluating gradient over all ratings evaluate it for

each individual rating and make a step

GD 119928119928larr119928119928 minus η sum119955119955119961119961119946119946 nabla119928119928(119955119955119961119961119946119946)

SGD 119928119928larr119928119928 minus 119928119928(119955119955119961119961119946119946) Faster convergence

bull Need more steps but each step is computed much faster

1162019

RECSYS 73

SGD vs GD Convergence of GD vs SGD

1162019

Iterationstep

Valu

e of

the

obje

ctiv

e fu

nctio

n

GD improves the value of the objective function at every step SGD improves the value but in a ldquonoisyrdquo wayGD takes fewer steps to converge but each steptakes much longer to compute In practice SGD is much faster

RECSYS 74

Stochastic Gradient Descent

Stochastic gradient decent Initialize P and Q (using SVD pretend missing ratings are

0) Then iterate over the ratings (multiple times if necessary) and update

factorsFor each rxi

bull 120576120576119909119909119909119909 = 119903119903119909119909119909119909 minus 119902119902119909119909 sdot 119901119901119909119909119879119879 (derivative of the ldquoerrorrdquo)bull 119902119902119909119909 larr 119902119902119909119909 + 120578120578 120576120576119909119909119909119909 119901119901119909119909 minus 120582120582 119902119902119909119909 (update equation)bull 119901119901119909119909 larr 119901119901119909119909 + 120578120578 120576120576119909119909119909119909 119902119902119909119909 minus 120582120582 119901119901119909119909 (update equation)

2 for loops For until convergence

bull For each rxibull Compute gradient do a ldquosteprdquo

1162019

η hellip learning rate

RECSYS 75Koren Bell Volinksy IEEE Computer 2009

1162019

RECSYS 76

Summary Recommendations via Optimization

Goal Make good recommendations Quantify goodness using SSE

So Lower SSE means better recommendations We want to make good recommendations on items

that some user has not yet seen Letrsquos set values for P and Q such that they work

well on known (user item) ratings And hope these values for P and Q will predict well

the unknown ratings

This is the a case where we apply Optimization methods

1162019

1 3 43 5 5

4 5 533

2

2 1 3

1

RECSYS 77

Backup Slides

RECSYS 78

The Netflix Challenge 2006-09

We Know What You OughtTo Be Watching This

Summer

Netflix Prizebull Training data

ndash 100 million ratingsndash 480000 usersndash 17770 moviesndash 6 years of data 2000-2005

bull Test datandash Last few ratings of each user (28 million)ndash Evaluation criterion root mean squared error (RMSE) ndash Netflix Cinematch RMSE 09514

bull Competitionndash 2700+ teamsndash $1 million grand prize for 10 improvement on Cinematch resultndash $50000 2007 progress prize for 843 improvement

scoremovieuser121152131434524123237682576344541568523425

22345

57664566

scoremovieuser6219617232473153414284935

745

696836

Training data Test data

Movie rating data

Overall rating distribution

bull Third of ratings are 4sbull Average rating is 368

From TimelyDevelopmentcom

ratings per movie

bull Avg ratingsmovie 5627

ratings per user

bull Avg ratingsuser 208

Average movie rating by movie count

bull More ratings to better movies

From TimelyDevelopmentcom

Most loved movies

CountAvg ratingTitle137812 4593 The Shawshank Redemption 133597 4545 Lord of the Rings The Return of the King1808834306 The Green Mile 150676 4460Lord of the Rings The Two Towers 139050 4415 Finding Nemo 117456 4504 Raiders of the Lost Ark 180736 4299 Forrest Gump 147932 4433 Lord of the Rings The Fellowship of the ring149199 4325 The Sixth Sense 144027 4333 Indiana Jones and the Last Crusade

Challenges

bull Size of datandash Scalabilityndash Keeping data in memory

bull Missing datandash 99 percent missingndash Very imbalanced

bull Avoiding overfittingbull Test and training data differ significantly movie 16322

bull Use an ensemble of complementing predictorsbull Two half tuned models worth more than a single fully

tuned model

The BellKor recommender system

RECSYS 101

Extending Latent Factor Model to Include Biases

1162019

RECSYS 102

Modeling Biases and Interactions

μ = overall mean rating bx = bias of user x bi = bias of movie i

user-movie interactionmovie biasuser bias

User-Movie interaction Characterizes the matching between

users and movies Attracts most research in the field Benefits from algorithmic and

mathematical innovations

Baseline predictor Separates users and movies Benefits from insights into userrsquos

behavior Among the main practical

contributions of the competition

RECSYS 103

Baseline Predictor We have expectations on the rating by

user x of movie i even without estimating xrsquos attitude towards movies like i

ndash Rating scale of user xndash Values of other ratings user

gave recently (day-specific mood anchoring multi-user accounts)

ndash (Recent) popularity of movie indash Selection bias related to

number of ratings user gave on the same day (ldquofrequencyrdquo)

1162019

RECSYS 104

Putting It All Together

Example Mean rating micro = 37 You are a critical reviewer your ratings are 1 star

lower than the mean bx = -1 Star Wars gets a mean rating of 05 higher than

average movie bi = + 05 Predicted rating for you on Star Wars

= 37 - 1 + 05 = 32

1162019

RECSYS 105

Fitting the New Model

Solve

Stochastic gradient decent to find parameters Note Both biases bu bi as well as interactions qi pu are treated

as parameters (we estimate them)

1162019

regularization

goodness of fit

λ is selected via grid-search on a validation set

( )

++++

+++minus

sumsumsumsum

sumisin

ii

xx

xx

ii

Rix

Txiixxi

PQ

bbpq

pqbbr

2222

2

)(

)(min

λ

micro

RECSYS 106

BellKor Recommender System The winner of the Netflix Challenge

Multi-scale modeling of the dataCombine top level ldquoregionalrdquomodeling of the data with a refined local view Global

bull Overall deviations of usersmovies Factorization

bull Addressing ldquoregionalrdquo effects Collaborative filtering

bull Extract local patterns

Global effects

Factorization

Collaborative filtering

RECSYS 107

Performance of Various Methods

0885

089

0895

09

0905

091

0915

092

1 10 100 1000

RMSE

Millions of parameters

CF (no time bias)

Basic Latent Factors

Latent Factors w Biases

1162019

RECSYS 108

Grand Prize 08563

Netflix 09514

Movie average 10533User average 10651

Global average 11296

Performance of Various Methods

Basic Collaborative filtering 094

Latent factors 090

Latent factors+Biases 089

Collaborative filtering++ 091

1162019

RECSYS 109

Temporal Biases Of Users

Sudden rise in the average movie rating(early 2004) Improvements in Netflix GUI improvements Meaning of rating changed

Movie age Users prefer new movies

without any reasons Older movies are just

inherently better than newer ones

1162019

Y Koren Collaborative filtering with temporal dynamics KDD rsquo09

RECSYS 110

Temporal Biases amp Factors

Original modelrxi = micro +bx + bi + qi middotpx

T

Add time dependence to biasesrxi = micro +bx(t)+ bi(t) +qi middot px

T

Make parameters bu and bi to depend on time (1) Parameterize time-dependence by linear trends

(2) Each bin corresponds to 10 consecutive weeks

Add temporal dependence to factors px(t)hellip user preference vector on day t

Y Koren Collaborative filtering with temporal dynamics KDD rsquo09

RECSYS 111

Adding Temporal Effects

1162019

0875

088

0885

089

0895

09

0905

091

0915

092

1 10 100 1000 10000

RMSE

Millions of parameters

CF (no time bias)Basic Latent FactorsCF (time bias)Latent Factors w Biases+ Linear time factors+ Per-day user biases+ CF

RECSYS 112

Grand Prize 08563

Netflix 09514

Movie average 10533User average 10651

Global average 11296

Performance of Various Methods

Basic Collaborative filtering 094

Latent factors 090

Latent factors+Biases 089

Collaborative filtering++ 091

1162019 112

Latent factors+Biases+Time 0876

Still no prize Getting desperateTry a ldquokitchen sinkrdquo approach

RECSYS 113

Effect of ensemble size

087

08725

0875

08775

088

08825

0885

08875

089

08925

1 8 15 22 29 36 43 50

Predictors

Erro

r - R

MSE

Chart1

RMSE
Predictors
Error - RMSE
Effect of ensemble size

Sheet1

Sheet1

RMSE
Predictors
Error - RMSE
Effect of ensemble size

Sheet2

Sheet3

RECSYS 115

Standing on June 26th 2009

1162019June 26th submission triggers 30-day ldquolast callrdquo

RECSYS 116

The Last 30 Days

Ensemble team formed Group of other teams on leaderboard forms a new team Relies on combining their models Quickly also get a qualifying score over 10

BellKor Continue to get small improvements in their scores Realize that they are in direct competition with Ensemble

Strategy Both teams carefully monitoring the leaderboard Only sure way to check for improvement is to submit a set of

predictionsbull This alerts the other team of your latest score

1162019

RECSYS 117

24 Hours from the Deadline

Submissions limited to 1 a day Only 1 final submission could be made in the last 24h

24 hours before deadlinehellip BellKor team member in Austria notices (by chance) that

Ensemble posts a score that is slightly better than BellKorrsquos

Frantic last 24 hours for both teams Much computer time on final optimization Carefully calibrated to end about an hour before deadline

Final submissions BellKor submits a little early (on purpose) 40 mins before deadline Ensemble submits their final entry 20 mins later hellipand everyone waitshellip

1162019

RECSYS 118118

RECSYS 119

Million $ Awarded Sept 21st 2009

RECSYS 120

Further reading Y Koren Collaborative filtering with temporal

dynamics KDD rsquo09 httpwww2researchattcom~volinskynetflixbpchtml httpwwwthe-ensemblecom

1162019

  • Slide Number 1
  • Acknowledgements
  • Content-based Systems amp Collaborative Filtering
  • High Dimensional Data
  • Example Recommender Systems
  • Recommendations
  • From Scarcity to Abundance
  • The Long Tail
  • Physical vs Online
  • Types of Recommendations
  • Formal Model
  • Utility Matrix
  • Key Problems
  • (1) Gathering Ratings
  • (2) Extrapolating Utilities
  • Content-based Recommender Systems
  • Content-based Recommendations
  • Plan of Action
  • Item Profiles
  • Sidenote TF-IDF
  • User Profiles and Prediction
  • Pros Content-based Approach
  • Cons Content-based Approach
  • Collaborative Filtering
  • Collaborative filtering
  • Collaborative Filtering (CF)
  • Example of Memory-based Collaborative Filtering User-User Collaborative Filtering
  • Similar Users
  • Similarity Metric
  • Rating Predictions
  • Another type of Memory-based Collaborative Filtering Item-Item based Collaborative Filtering
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Common Practice for Item-Item Collaborative Filtering
  • Item-Item vs User-User
  • ProsCons of Collaborative Filtering
  • Hybrid Methods
  • Remarks amp Practical Tips
  • Evaluation
  • Evaluation
  • Evaluating Predictions
  • Problems with Error Measures
  • Collaborative Filtering Complexity
  • Tip Add Data
  • Recommender SystemsLatent Factor Models
  • Collaborative Filtering viaLatent Factor Models (eg SVD)
  • Slide Number 50
  • The Netflix Utility Matrix R
  • Utility Matrix R Evaluation
  • Latent Factor Models
  • Ratings as Products of Factors
  • Ratings as Products of Factors
  • Ratings as Products of Factors
  • Latent Factor Models
  • Latent Factor Models
  • Recap SVD
  • Latent Factor Models
  • Dealing with Missing Entries
  • Recommendations via Latent Factor Models (eg SVD++ by the [Bellkor Team])
  • Dealing with Missing Entries
  • The Effect of Regularization
  • The Effect of Regularization
  • The Effect of Regularization
  • The Effect of Regularization
  • Use Gradient Descent to search forthe optimal settings
  • Degression to the lecture notes ofRegression and Gradient Descentby Andrew Ngrsquos Machine Learning course from Coursera
  • (Batch) Gradient Descent
  • Stochastic Gradient Descent
  • SGD vs GD
  • Stochastic Gradient Descent
  • Slide Number 75
  • Summary Recommendations via Optimization
  • Backup Slides
  • The Netflix Challenge 2006-09
  • Slide Number 79
  • Slide Number 80
  • Netflix Prize
  • Movie rating data
  • Overall rating distribution
  • ratings per movie
  • ratings per user
  • Average movie rating by movie count
  • Most loved movies
  • Challenges
  • The BellKor recommender system
  • Extending Latent Factor Model to Include Biases
  • Modeling Biases and Interactions
  • Baseline Predictor
  • Putting It All Together
  • Fitting the New Model
  • BellKor Recommender System
  • Performance of Various Methods
  • Performance of Various Methods
  • Temporal Biases Of Users
  • Temporal Biases amp Factors
  • Adding Temporal Effects
  • Performance of Various Methods
  • Slide Number 113
  • Slide Number 114
  • Standing on June 26th 2009
  • The Last 30 Days
  • 24 Hours from the Deadline
  • Slide Number 118
  • Million $ Awarded Sept 21st 2009
  • Further reading
predictors RMSE
1 0889067
2 0880586
3 0878345
4 0877715
5 0876668
6 0876179
7 0875557
8 0875013
9 0874477
10 0874209
11 0873997
12 0873691
13 0873569
14 0873352
15 0873212
16 0873035
17 0872942
18 087283
19 0872748
20 0872654
21 0872563
22 0872468
23 0872382
24 0872275
25 087222
26 0872201
27 0872135
28 0872086
29 0872021
30 0871945
31 0871898
32 087183
33 0871756
34 0871722
35 0871665
36 0871632
37 0871585
38 0871558
39 0871524
40 0871482
41 0871452
42 0871428
43 0871398
44 0871346
45 0871312
46 0871287
47 0871262
48 0871242
49 0871236
50 0871197
0889067
0880586
0878345
0877715
0876668
0876179
0875557
0875013
0874477
0874209
0873997
0873691
0873569
0873352
0873212
0873035
0872942
087283
0872748
0872654
0872563
0872468
0872382
0872275
087222
0872201
0872135
0872086
0872021
0871945
0871898
087183
0871756
0871722
0871665
0871632
0871585
0871558
0871524
0871482
0871452
0871428
0871398
0871346
0871312
0871287
0871262
0871242
0871236
0871197
Page 60: xuhappy.github.io · 2 days ago · RECSYS 2 Acknowledgements The slides used in this chapter are adapted from: CS246 Mining Massive Datasets, - by Jure Leskovec, Stanford University

RECSYS 62

Geared towards females

Geared towards males

serious

escapist

The PrincessDiaries

The Lion King

Braveheart

Independence Day

AmadeusThe Color Purple

Oceanrsquos 11

Sense and Sensibility

Gus

Dave

Recommendations via Latent Factor Models (eg SVD++ by the [Bellkor Team])

1162019

Lethal Weapon

Dumb and Dumber

RECSYS 63

Dealing with Missing Entries Want to minimize SSE for unseen test data

Idea Minimize SSE on training data Want large f ( of factors) to capture all the signals But SSE on test data begins to rise for f gt 2

Regularization is needed Allow rich model where there are sufficient data Shrink aggressively where data are scarce

1162019

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

1 3 43 5 5

4 5 533

2

2 1 3

1

λhellip regularization parameterldquoerrorrdquo ldquolengthrdquo

RECSYS 64

Geared towards females

Geared towards males

serious

funny

The Effect of Regularization

1162019

The PrincessDiaries

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

minfactors ldquoerrorrdquo + λ ldquolengthrdquo

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

RECSYS 65

Geared towards females

Geared towards males

serious

funny

The Effect of Regularization

1162019

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

The PrincessDiaries

minfactors ldquoerrorrdquo + λ ldquolengthrdquo

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

RECSYS 66

Geared towards females

Geared towards males

serious

funny

The Effect of Regularization

1162019

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

minfactors ldquoerrorrdquo + λ ldquolengthrdquo

The PrincessDiaries

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

RECSYS 67

Geared towards females

Geared towards males

serious

funny

The Effect of Regularization

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

The PrincessDiaries

minfactors ldquoerrorrdquo + λ ldquolengthrdquo

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

RECSYS 68

Use Gradient Descent to search forthe optimal settings

Want to find matrices P and Q

Gradient descent Initialize P and Q (using SVD pretend missing ratings are

0) Do gradient descent

bull P larr P - η middotnablaPbull Q larr Q - η middotnablaQbull Where nablaQ is gradientderivative of matrix Q120571120571119876119876 = [120571120571119902119902119909119909119894119894] and 120571120571119902119902119909119909119894119894 = sum119909119909119909119909 minus2 119903119903119909119909119909119909 minus 119902119902119909119909119901119901119909119909119879119879 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894

bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q Observation Computing gradients is slow

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

How to compute gradient of a matrixCompute gradient of every element independently

RECSYS 69

Degression to the lecture notes ofRegression and Gradient Descent

by Andrew Ngrsquos Machine Learning course from Coursera

RECSYS 70

(Batch) Gradient Descent

Want to find matrices P and Q

Gradient descent Initialize P and Q (using SVD pretend missing ratings are

0) Do gradient descent

bull P larr P - η middotnablaPbull Q larr Q - η middotnablaQbull Where nablaQ is gradientderivative of matrix Q120571120571119876119876 = [120571120571119902119902119909119909119894119894] and 120571120571119902119902119909119909119894119894 = sum119909119909119909119909 minus2 119903119903119909119909119909119909 minus 119902119902119909119909119901119901119909119909119879119879 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894

bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q Observation Computing gradients is slow

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

How to compute gradient of a matrixCompute gradient of every element independently

RECSYS 72

Stochastic Gradient Descent

Gradient Descent (GD) vs Stochastic GD Observation 120571120571119876119876 = [120571120571119902119902119909119909119894119894] where

120571120571119902119902119909119909119894119894 = 119909119909119909119909

minus2 119903119903119909119909119909119909 minus 119902119902119909119909119894119894119901119901119909119909119894119894 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894 = 119961119961119946119946

nabla119928119928 119955119955119961119961119946119946

bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q 119928119928 = 119928119928 minus 119928119928 = 119928119928 minus η sum119961119961119946119946nabla119928119928 (119955119955119961119961119946119946) Idea Instead of evaluating gradient over all ratings evaluate it for

each individual rating and make a step

GD 119928119928larr119928119928 minus η sum119955119955119961119961119946119946 nabla119928119928(119955119955119961119961119946119946)

SGD 119928119928larr119928119928 minus 119928119928(119955119955119961119961119946119946) Faster convergence

bull Need more steps but each step is computed much faster

1162019

RECSYS 73

SGD vs GD Convergence of GD vs SGD

1162019

Iterationstep

Valu

e of

the

obje

ctiv

e fu

nctio

n

GD improves the value of the objective function at every step SGD improves the value but in a ldquonoisyrdquo wayGD takes fewer steps to converge but each steptakes much longer to compute In practice SGD is much faster

RECSYS 74

Stochastic Gradient Descent

Stochastic gradient decent Initialize P and Q (using SVD pretend missing ratings are

0) Then iterate over the ratings (multiple times if necessary) and update

factorsFor each rxi

bull 120576120576119909119909119909119909 = 119903119903119909119909119909119909 minus 119902119902119909119909 sdot 119901119901119909119909119879119879 (derivative of the ldquoerrorrdquo)bull 119902119902119909119909 larr 119902119902119909119909 + 120578120578 120576120576119909119909119909119909 119901119901119909119909 minus 120582120582 119902119902119909119909 (update equation)bull 119901119901119909119909 larr 119901119901119909119909 + 120578120578 120576120576119909119909119909119909 119902119902119909119909 minus 120582120582 119901119901119909119909 (update equation)

2 for loops For until convergence

bull For each rxibull Compute gradient do a ldquosteprdquo

1162019

η hellip learning rate

RECSYS 75Koren Bell Volinksy IEEE Computer 2009

1162019

RECSYS 76

Summary Recommendations via Optimization

Goal Make good recommendations Quantify goodness using SSE

So Lower SSE means better recommendations We want to make good recommendations on items

that some user has not yet seen Letrsquos set values for P and Q such that they work

well on known (user item) ratings And hope these values for P and Q will predict well

the unknown ratings

This is the a case where we apply Optimization methods

1162019

1 3 43 5 5

4 5 533

2

2 1 3

1

RECSYS 77

Backup Slides

RECSYS 78

The Netflix Challenge 2006-09

We Know What You OughtTo Be Watching This

Summer

Netflix Prizebull Training data

ndash 100 million ratingsndash 480000 usersndash 17770 moviesndash 6 years of data 2000-2005

bull Test datandash Last few ratings of each user (28 million)ndash Evaluation criterion root mean squared error (RMSE) ndash Netflix Cinematch RMSE 09514

bull Competitionndash 2700+ teamsndash $1 million grand prize for 10 improvement on Cinematch resultndash $50000 2007 progress prize for 843 improvement

scoremovieuser121152131434524123237682576344541568523425

22345

57664566

scoremovieuser6219617232473153414284935

745

696836

Training data Test data

Movie rating data

Overall rating distribution

bull Third of ratings are 4sbull Average rating is 368

From TimelyDevelopmentcom

ratings per movie

bull Avg ratingsmovie 5627

ratings per user

bull Avg ratingsuser 208

Average movie rating by movie count

bull More ratings to better movies

From TimelyDevelopmentcom

Most loved movies

CountAvg ratingTitle137812 4593 The Shawshank Redemption 133597 4545 Lord of the Rings The Return of the King1808834306 The Green Mile 150676 4460Lord of the Rings The Two Towers 139050 4415 Finding Nemo 117456 4504 Raiders of the Lost Ark 180736 4299 Forrest Gump 147932 4433 Lord of the Rings The Fellowship of the ring149199 4325 The Sixth Sense 144027 4333 Indiana Jones and the Last Crusade

Challenges

bull Size of datandash Scalabilityndash Keeping data in memory

bull Missing datandash 99 percent missingndash Very imbalanced

bull Avoiding overfittingbull Test and training data differ significantly movie 16322

bull Use an ensemble of complementing predictorsbull Two half tuned models worth more than a single fully

tuned model

The BellKor recommender system

RECSYS 101

Extending Latent Factor Model to Include Biases

1162019

RECSYS 102

Modeling Biases and Interactions

μ = overall mean rating bx = bias of user x bi = bias of movie i

user-movie interactionmovie biasuser bias

User-Movie interaction Characterizes the matching between

users and movies Attracts most research in the field Benefits from algorithmic and

mathematical innovations

Baseline predictor Separates users and movies Benefits from insights into userrsquos

behavior Among the main practical

contributions of the competition

RECSYS 103

Baseline Predictor We have expectations on the rating by

user x of movie i even without estimating xrsquos attitude towards movies like i

ndash Rating scale of user xndash Values of other ratings user

gave recently (day-specific mood anchoring multi-user accounts)

ndash (Recent) popularity of movie indash Selection bias related to

number of ratings user gave on the same day (ldquofrequencyrdquo)

1162019

RECSYS 104

Putting It All Together

Example Mean rating micro = 37 You are a critical reviewer your ratings are 1 star

lower than the mean bx = -1 Star Wars gets a mean rating of 05 higher than

average movie bi = + 05 Predicted rating for you on Star Wars

= 37 - 1 + 05 = 32

1162019

RECSYS 105

Fitting the New Model

Solve

Stochastic gradient decent to find parameters Note Both biases bu bi as well as interactions qi pu are treated

as parameters (we estimate them)

1162019

regularization

goodness of fit

λ is selected via grid-search on a validation set

( )

++++

+++minus

sumsumsumsum

sumisin

ii

xx

xx

ii

Rix

Txiixxi

PQ

bbpq

pqbbr

2222

2

)(

)(min

λ

micro

RECSYS 106

BellKor Recommender System The winner of the Netflix Challenge

Multi-scale modeling of the dataCombine top level ldquoregionalrdquomodeling of the data with a refined local view Global

bull Overall deviations of usersmovies Factorization

bull Addressing ldquoregionalrdquo effects Collaborative filtering

bull Extract local patterns

Global effects

Factorization

Collaborative filtering

RECSYS 107

Performance of Various Methods

0885

089

0895

09

0905

091

0915

092

1 10 100 1000

RMSE

Millions of parameters

CF (no time bias)

Basic Latent Factors

Latent Factors w Biases

1162019

RECSYS 108

Grand Prize 08563

Netflix 09514

Movie average 10533User average 10651

Global average 11296

Performance of Various Methods

Basic Collaborative filtering 094

Latent factors 090

Latent factors+Biases 089

Collaborative filtering++ 091

1162019

RECSYS 109

Temporal Biases Of Users

Sudden rise in the average movie rating(early 2004) Improvements in Netflix GUI improvements Meaning of rating changed

Movie age Users prefer new movies

without any reasons Older movies are just

inherently better than newer ones

1162019

Y Koren Collaborative filtering with temporal dynamics KDD rsquo09

RECSYS 110

Temporal Biases amp Factors

Original modelrxi = micro +bx + bi + qi middotpx

T

Add time dependence to biasesrxi = micro +bx(t)+ bi(t) +qi middot px

T

Make parameters bu and bi to depend on time (1) Parameterize time-dependence by linear trends

(2) Each bin corresponds to 10 consecutive weeks

Add temporal dependence to factors px(t)hellip user preference vector on day t

Y Koren Collaborative filtering with temporal dynamics KDD rsquo09

RECSYS 111

Adding Temporal Effects

1162019

0875

088

0885

089

0895

09

0905

091

0915

092

1 10 100 1000 10000

RMSE

Millions of parameters

CF (no time bias)Basic Latent FactorsCF (time bias)Latent Factors w Biases+ Linear time factors+ Per-day user biases+ CF

RECSYS 112

Grand Prize 08563

Netflix 09514

Movie average 10533User average 10651

Global average 11296

Performance of Various Methods

Basic Collaborative filtering 094

Latent factors 090

Latent factors+Biases 089

Collaborative filtering++ 091

1162019 112

Latent factors+Biases+Time 0876

Still no prize Getting desperateTry a ldquokitchen sinkrdquo approach

RECSYS 113

Effect of ensemble size

087

08725

0875

08775

088

08825

0885

08875

089

08925

1 8 15 22 29 36 43 50

Predictors

Erro

r - R

MSE

Chart1

RMSE
Predictors
Error - RMSE
Effect of ensemble size

Sheet1

Sheet1

RMSE
Predictors
Error - RMSE
Effect of ensemble size

Sheet2

Sheet3

RECSYS 115

Standing on June 26th 2009

1162019June 26th submission triggers 30-day ldquolast callrdquo

RECSYS 116

The Last 30 Days

Ensemble team formed Group of other teams on leaderboard forms a new team Relies on combining their models Quickly also get a qualifying score over 10

BellKor Continue to get small improvements in their scores Realize that they are in direct competition with Ensemble

Strategy Both teams carefully monitoring the leaderboard Only sure way to check for improvement is to submit a set of

predictionsbull This alerts the other team of your latest score

1162019

RECSYS 117

24 Hours from the Deadline

Submissions limited to 1 a day Only 1 final submission could be made in the last 24h

24 hours before deadlinehellip BellKor team member in Austria notices (by chance) that

Ensemble posts a score that is slightly better than BellKorrsquos

Frantic last 24 hours for both teams Much computer time on final optimization Carefully calibrated to end about an hour before deadline

Final submissions BellKor submits a little early (on purpose) 40 mins before deadline Ensemble submits their final entry 20 mins later hellipand everyone waitshellip

1162019

RECSYS 118118

RECSYS 119

Million $ Awarded Sept 21st 2009

RECSYS 120

Further reading Y Koren Collaborative filtering with temporal

dynamics KDD rsquo09 httpwww2researchattcom~volinskynetflixbpchtml httpwwwthe-ensemblecom

1162019

  • Slide Number 1
  • Acknowledgements
  • Content-based Systems amp Collaborative Filtering
  • High Dimensional Data
  • Example Recommender Systems
  • Recommendations
  • From Scarcity to Abundance
  • The Long Tail
  • Physical vs Online
  • Types of Recommendations
  • Formal Model
  • Utility Matrix
  • Key Problems
  • (1) Gathering Ratings
  • (2) Extrapolating Utilities
  • Content-based Recommender Systems
  • Content-based Recommendations
  • Plan of Action
  • Item Profiles
  • Sidenote TF-IDF
  • User Profiles and Prediction
  • Pros Content-based Approach
  • Cons Content-based Approach
  • Collaborative Filtering
  • Collaborative filtering
  • Collaborative Filtering (CF)
  • Example of Memory-based Collaborative Filtering User-User Collaborative Filtering
  • Similar Users
  • Similarity Metric
  • Rating Predictions
  • Another type of Memory-based Collaborative Filtering Item-Item based Collaborative Filtering
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Common Practice for Item-Item Collaborative Filtering
  • Item-Item vs User-User
  • ProsCons of Collaborative Filtering
  • Hybrid Methods
  • Remarks amp Practical Tips
  • Evaluation
  • Evaluation
  • Evaluating Predictions
  • Problems with Error Measures
  • Collaborative Filtering Complexity
  • Tip Add Data
  • Recommender SystemsLatent Factor Models
  • Collaborative Filtering viaLatent Factor Models (eg SVD)
  • Slide Number 50
  • The Netflix Utility Matrix R
  • Utility Matrix R Evaluation
  • Latent Factor Models
  • Ratings as Products of Factors
  • Ratings as Products of Factors
  • Ratings as Products of Factors
  • Latent Factor Models
  • Latent Factor Models
  • Recap SVD
  • Latent Factor Models
  • Dealing with Missing Entries
  • Recommendations via Latent Factor Models (eg SVD++ by the [Bellkor Team])
  • Dealing with Missing Entries
  • The Effect of Regularization
  • The Effect of Regularization
  • The Effect of Regularization
  • The Effect of Regularization
  • Use Gradient Descent to search forthe optimal settings
  • Degression to the lecture notes ofRegression and Gradient Descentby Andrew Ngrsquos Machine Learning course from Coursera
  • (Batch) Gradient Descent
  • Stochastic Gradient Descent
  • SGD vs GD
  • Stochastic Gradient Descent
  • Slide Number 75
  • Summary Recommendations via Optimization
  • Backup Slides
  • The Netflix Challenge 2006-09
  • Slide Number 79
  • Slide Number 80
  • Netflix Prize
  • Movie rating data
  • Overall rating distribution
  • ratings per movie
  • ratings per user
  • Average movie rating by movie count
  • Most loved movies
  • Challenges
  • The BellKor recommender system
  • Extending Latent Factor Model to Include Biases
  • Modeling Biases and Interactions
  • Baseline Predictor
  • Putting It All Together
  • Fitting the New Model
  • BellKor Recommender System
  • Performance of Various Methods
  • Performance of Various Methods
  • Temporal Biases Of Users
  • Temporal Biases amp Factors
  • Adding Temporal Effects
  • Performance of Various Methods
  • Slide Number 113
  • Slide Number 114
  • Standing on June 26th 2009
  • The Last 30 Days
  • 24 Hours from the Deadline
  • Slide Number 118
  • Million $ Awarded Sept 21st 2009
  • Further reading
predictors RMSE
1 0889067
2 0880586
3 0878345
4 0877715
5 0876668
6 0876179
7 0875557
8 0875013
9 0874477
10 0874209
11 0873997
12 0873691
13 0873569
14 0873352
15 0873212
16 0873035
17 0872942
18 087283
19 0872748
20 0872654
21 0872563
22 0872468
23 0872382
24 0872275
25 087222
26 0872201
27 0872135
28 0872086
29 0872021
30 0871945
31 0871898
32 087183
33 0871756
34 0871722
35 0871665
36 0871632
37 0871585
38 0871558
39 0871524
40 0871482
41 0871452
42 0871428
43 0871398
44 0871346
45 0871312
46 0871287
47 0871262
48 0871242
49 0871236
50 0871197
0889067
0880586
0878345
0877715
0876668
0876179
0875557
0875013
0874477
0874209
0873997
0873691
0873569
0873352
0873212
0873035
0872942
087283
0872748
0872654
0872563
0872468
0872382
0872275
087222
0872201
0872135
0872086
0872021
0871945
0871898
087183
0871756
0871722
0871665
0871632
0871585
0871558
0871524
0871482
0871452
0871428
0871398
0871346
0871312
0871287
0871262
0871242
0871236
0871197
Page 61: xuhappy.github.io · 2 days ago · RECSYS 2 Acknowledgements The slides used in this chapter are adapted from: CS246 Mining Massive Datasets, - by Jure Leskovec, Stanford University

RECSYS 63

Dealing with Missing Entries Want to minimize SSE for unseen test data

Idea Minimize SSE on training data Want large f ( of factors) to capture all the signals But SSE on test data begins to rise for f gt 2

Regularization is needed Allow rich model where there are sufficient data Shrink aggressively where data are scarce

1162019

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

1 3 43 5 5

4 5 533

2

2 1 3

1

λhellip regularization parameterldquoerrorrdquo ldquolengthrdquo

RECSYS 64

Geared towards females

Geared towards males

serious

funny

The Effect of Regularization

1162019

The PrincessDiaries

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

minfactors ldquoerrorrdquo + λ ldquolengthrdquo

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

RECSYS 65

Geared towards females

Geared towards males

serious

funny

The Effect of Regularization

1162019

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

The PrincessDiaries

minfactors ldquoerrorrdquo + λ ldquolengthrdquo

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

RECSYS 66

Geared towards females

Geared towards males

serious

funny

The Effect of Regularization

1162019

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

minfactors ldquoerrorrdquo + λ ldquolengthrdquo

The PrincessDiaries

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

RECSYS 67

Geared towards females

Geared towards males

serious

funny

The Effect of Regularization

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

The PrincessDiaries

minfactors ldquoerrorrdquo + λ ldquolengthrdquo

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

RECSYS 68

Use Gradient Descent to search forthe optimal settings

Want to find matrices P and Q

Gradient descent Initialize P and Q (using SVD pretend missing ratings are

0) Do gradient descent

bull P larr P - η middotnablaPbull Q larr Q - η middotnablaQbull Where nablaQ is gradientderivative of matrix Q120571120571119876119876 = [120571120571119902119902119909119909119894119894] and 120571120571119902119902119909119909119894119894 = sum119909119909119909119909 minus2 119903119903119909119909119909119909 minus 119902119902119909119909119901119901119909119909119879119879 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894

bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q Observation Computing gradients is slow

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

How to compute gradient of a matrixCompute gradient of every element independently

RECSYS 69

Degression to the lecture notes ofRegression and Gradient Descent

by Andrew Ngrsquos Machine Learning course from Coursera

RECSYS 70

(Batch) Gradient Descent

Want to find matrices P and Q

Gradient descent Initialize P and Q (using SVD pretend missing ratings are

0) Do gradient descent

bull P larr P - η middotnablaPbull Q larr Q - η middotnablaQbull Where nablaQ is gradientderivative of matrix Q120571120571119876119876 = [120571120571119902119902119909119909119894119894] and 120571120571119902119902119909119909119894119894 = sum119909119909119909119909 minus2 119903119903119909119909119909119909 minus 119902119902119909119909119901119901119909119909119879119879 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894

bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q Observation Computing gradients is slow

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

How to compute gradient of a matrixCompute gradient of every element independently

RECSYS 72

Stochastic Gradient Descent

Gradient Descent (GD) vs Stochastic GD Observation 120571120571119876119876 = [120571120571119902119902119909119909119894119894] where

120571120571119902119902119909119909119894119894 = 119909119909119909119909

minus2 119903119903119909119909119909119909 minus 119902119902119909119909119894119894119901119901119909119909119894119894 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894 = 119961119961119946119946

nabla119928119928 119955119955119961119961119946119946

bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q 119928119928 = 119928119928 minus 119928119928 = 119928119928 minus η sum119961119961119946119946nabla119928119928 (119955119955119961119961119946119946) Idea Instead of evaluating gradient over all ratings evaluate it for

each individual rating and make a step

GD 119928119928larr119928119928 minus η sum119955119955119961119961119946119946 nabla119928119928(119955119955119961119961119946119946)

SGD 119928119928larr119928119928 minus 119928119928(119955119955119961119961119946119946) Faster convergence

bull Need more steps but each step is computed much faster

1162019

RECSYS 73

SGD vs GD Convergence of GD vs SGD

1162019

Iterationstep

Valu

e of

the

obje

ctiv

e fu

nctio

n

GD improves the value of the objective function at every step SGD improves the value but in a ldquonoisyrdquo wayGD takes fewer steps to converge but each steptakes much longer to compute In practice SGD is much faster

RECSYS 74

Stochastic Gradient Descent

Stochastic gradient decent Initialize P and Q (using SVD pretend missing ratings are

0) Then iterate over the ratings (multiple times if necessary) and update

factorsFor each rxi

bull 120576120576119909119909119909119909 = 119903119903119909119909119909119909 minus 119902119902119909119909 sdot 119901119901119909119909119879119879 (derivative of the ldquoerrorrdquo)bull 119902119902119909119909 larr 119902119902119909119909 + 120578120578 120576120576119909119909119909119909 119901119901119909119909 minus 120582120582 119902119902119909119909 (update equation)bull 119901119901119909119909 larr 119901119901119909119909 + 120578120578 120576120576119909119909119909119909 119902119902119909119909 minus 120582120582 119901119901119909119909 (update equation)

2 for loops For until convergence

bull For each rxibull Compute gradient do a ldquosteprdquo

1162019

η hellip learning rate

RECSYS 75Koren Bell Volinksy IEEE Computer 2009

1162019

RECSYS 76

Summary Recommendations via Optimization

Goal Make good recommendations Quantify goodness using SSE

So Lower SSE means better recommendations We want to make good recommendations on items

that some user has not yet seen Letrsquos set values for P and Q such that they work

well on known (user item) ratings And hope these values for P and Q will predict well

the unknown ratings

This is the a case where we apply Optimization methods

1162019

1 3 43 5 5

4 5 533

2

2 1 3

1

RECSYS 77

Backup Slides

RECSYS 78

The Netflix Challenge 2006-09

We Know What You OughtTo Be Watching This

Summer

Netflix Prizebull Training data

ndash 100 million ratingsndash 480000 usersndash 17770 moviesndash 6 years of data 2000-2005

bull Test datandash Last few ratings of each user (28 million)ndash Evaluation criterion root mean squared error (RMSE) ndash Netflix Cinematch RMSE 09514

bull Competitionndash 2700+ teamsndash $1 million grand prize for 10 improvement on Cinematch resultndash $50000 2007 progress prize for 843 improvement

scoremovieuser121152131434524123237682576344541568523425

22345

57664566

scoremovieuser6219617232473153414284935

745

696836

Training data Test data

Movie rating data

Overall rating distribution

bull Third of ratings are 4sbull Average rating is 368

From TimelyDevelopmentcom

ratings per movie

bull Avg ratingsmovie 5627

ratings per user

bull Avg ratingsuser 208

Average movie rating by movie count

bull More ratings to better movies

From TimelyDevelopmentcom

Most loved movies

CountAvg ratingTitle137812 4593 The Shawshank Redemption 133597 4545 Lord of the Rings The Return of the King1808834306 The Green Mile 150676 4460Lord of the Rings The Two Towers 139050 4415 Finding Nemo 117456 4504 Raiders of the Lost Ark 180736 4299 Forrest Gump 147932 4433 Lord of the Rings The Fellowship of the ring149199 4325 The Sixth Sense 144027 4333 Indiana Jones and the Last Crusade

Challenges

bull Size of datandash Scalabilityndash Keeping data in memory

bull Missing datandash 99 percent missingndash Very imbalanced

bull Avoiding overfittingbull Test and training data differ significantly movie 16322

bull Use an ensemble of complementing predictorsbull Two half tuned models worth more than a single fully

tuned model

The BellKor recommender system

RECSYS 101

Extending Latent Factor Model to Include Biases

1162019

RECSYS 102

Modeling Biases and Interactions

μ = overall mean rating bx = bias of user x bi = bias of movie i

user-movie interactionmovie biasuser bias

User-Movie interaction Characterizes the matching between

users and movies Attracts most research in the field Benefits from algorithmic and

mathematical innovations

Baseline predictor Separates users and movies Benefits from insights into userrsquos

behavior Among the main practical

contributions of the competition

RECSYS 103

Baseline Predictor We have expectations on the rating by

user x of movie i even without estimating xrsquos attitude towards movies like i

ndash Rating scale of user xndash Values of other ratings user

gave recently (day-specific mood anchoring multi-user accounts)

ndash (Recent) popularity of movie indash Selection bias related to

number of ratings user gave on the same day (ldquofrequencyrdquo)

1162019

RECSYS 104

Putting It All Together

Example Mean rating micro = 37 You are a critical reviewer your ratings are 1 star

lower than the mean bx = -1 Star Wars gets a mean rating of 05 higher than

average movie bi = + 05 Predicted rating for you on Star Wars

= 37 - 1 + 05 = 32

1162019

RECSYS 105

Fitting the New Model

Solve

Stochastic gradient decent to find parameters Note Both biases bu bi as well as interactions qi pu are treated

as parameters (we estimate them)

1162019

regularization

goodness of fit

λ is selected via grid-search on a validation set

( )

++++

+++minus

sumsumsumsum

sumisin

ii

xx

xx

ii

Rix

Txiixxi

PQ

bbpq

pqbbr

2222

2

)(

)(min

λ

micro

RECSYS 106

BellKor Recommender System The winner of the Netflix Challenge

Multi-scale modeling of the dataCombine top level ldquoregionalrdquomodeling of the data with a refined local view Global

bull Overall deviations of usersmovies Factorization

bull Addressing ldquoregionalrdquo effects Collaborative filtering

bull Extract local patterns

Global effects

Factorization

Collaborative filtering

RECSYS 107

Performance of Various Methods

0885

089

0895

09

0905

091

0915

092

1 10 100 1000

RMSE

Millions of parameters

CF (no time bias)

Basic Latent Factors

Latent Factors w Biases

1162019

RECSYS 108

Grand Prize 08563

Netflix 09514

Movie average 10533User average 10651

Global average 11296

Performance of Various Methods

Basic Collaborative filtering 094

Latent factors 090

Latent factors+Biases 089

Collaborative filtering++ 091

1162019

RECSYS 109

Temporal Biases Of Users

Sudden rise in the average movie rating(early 2004) Improvements in Netflix GUI improvements Meaning of rating changed

Movie age Users prefer new movies

without any reasons Older movies are just

inherently better than newer ones

1162019

Y Koren Collaborative filtering with temporal dynamics KDD rsquo09

RECSYS 110

Temporal Biases amp Factors

Original modelrxi = micro +bx + bi + qi middotpx

T

Add time dependence to biasesrxi = micro +bx(t)+ bi(t) +qi middot px

T

Make parameters bu and bi to depend on time (1) Parameterize time-dependence by linear trends

(2) Each bin corresponds to 10 consecutive weeks

Add temporal dependence to factors px(t)hellip user preference vector on day t

Y Koren Collaborative filtering with temporal dynamics KDD rsquo09

RECSYS 111

Adding Temporal Effects

1162019

0875

088

0885

089

0895

09

0905

091

0915

092

1 10 100 1000 10000

RMSE

Millions of parameters

CF (no time bias)Basic Latent FactorsCF (time bias)Latent Factors w Biases+ Linear time factors+ Per-day user biases+ CF

RECSYS 112

Grand Prize 08563

Netflix 09514

Movie average 10533User average 10651

Global average 11296

Performance of Various Methods

Basic Collaborative filtering 094

Latent factors 090

Latent factors+Biases 089

Collaborative filtering++ 091

1162019 112

Latent factors+Biases+Time 0876

Still no prize Getting desperateTry a ldquokitchen sinkrdquo approach

RECSYS 113

Effect of ensemble size

087

08725

0875

08775

088

08825

0885

08875

089

08925

1 8 15 22 29 36 43 50

Predictors

Erro

r - R

MSE

Chart1

RMSE
Predictors
Error - RMSE
Effect of ensemble size

Sheet1

Sheet1

RMSE
Predictors
Error - RMSE
Effect of ensemble size

Sheet2

Sheet3

RECSYS 115

Standing on June 26th 2009

1162019June 26th submission triggers 30-day ldquolast callrdquo

RECSYS 116

The Last 30 Days

Ensemble team formed Group of other teams on leaderboard forms a new team Relies on combining their models Quickly also get a qualifying score over 10

BellKor Continue to get small improvements in their scores Realize that they are in direct competition with Ensemble

Strategy Both teams carefully monitoring the leaderboard Only sure way to check for improvement is to submit a set of

predictionsbull This alerts the other team of your latest score

1162019

RECSYS 117

24 Hours from the Deadline

Submissions limited to 1 a day Only 1 final submission could be made in the last 24h

24 hours before deadlinehellip BellKor team member in Austria notices (by chance) that

Ensemble posts a score that is slightly better than BellKorrsquos

Frantic last 24 hours for both teams Much computer time on final optimization Carefully calibrated to end about an hour before deadline

Final submissions BellKor submits a little early (on purpose) 40 mins before deadline Ensemble submits their final entry 20 mins later hellipand everyone waitshellip

1162019

RECSYS 118118

RECSYS 119

Million $ Awarded Sept 21st 2009

RECSYS 120

Further reading Y Koren Collaborative filtering with temporal

dynamics KDD rsquo09 httpwww2researchattcom~volinskynetflixbpchtml httpwwwthe-ensemblecom

1162019

  • Slide Number 1
  • Acknowledgements
  • Content-based Systems amp Collaborative Filtering
  • High Dimensional Data
  • Example Recommender Systems
  • Recommendations
  • From Scarcity to Abundance
  • The Long Tail
  • Physical vs Online
  • Types of Recommendations
  • Formal Model
  • Utility Matrix
  • Key Problems
  • (1) Gathering Ratings
  • (2) Extrapolating Utilities
  • Content-based Recommender Systems
  • Content-based Recommendations
  • Plan of Action
  • Item Profiles
  • Sidenote TF-IDF
  • User Profiles and Prediction
  • Pros Content-based Approach
  • Cons Content-based Approach
  • Collaborative Filtering
  • Collaborative filtering
  • Collaborative Filtering (CF)
  • Example of Memory-based Collaborative Filtering User-User Collaborative Filtering
  • Similar Users
  • Similarity Metric
  • Rating Predictions
  • Another type of Memory-based Collaborative Filtering Item-Item based Collaborative Filtering
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Common Practice for Item-Item Collaborative Filtering
  • Item-Item vs User-User
  • ProsCons of Collaborative Filtering
  • Hybrid Methods
  • Remarks amp Practical Tips
  • Evaluation
  • Evaluation
  • Evaluating Predictions
  • Problems with Error Measures
  • Collaborative Filtering Complexity
  • Tip Add Data
  • Recommender SystemsLatent Factor Models
  • Collaborative Filtering viaLatent Factor Models (eg SVD)
  • Slide Number 50
  • The Netflix Utility Matrix R
  • Utility Matrix R Evaluation
  • Latent Factor Models
  • Ratings as Products of Factors
  • Ratings as Products of Factors
  • Ratings as Products of Factors
  • Latent Factor Models
  • Latent Factor Models
  • Recap SVD
  • Latent Factor Models
  • Dealing with Missing Entries
  • Recommendations via Latent Factor Models (eg SVD++ by the [Bellkor Team])
  • Dealing with Missing Entries
  • The Effect of Regularization
  • The Effect of Regularization
  • The Effect of Regularization
  • The Effect of Regularization
  • Use Gradient Descent to search forthe optimal settings
  • Degression to the lecture notes ofRegression and Gradient Descentby Andrew Ngrsquos Machine Learning course from Coursera
  • (Batch) Gradient Descent
  • Stochastic Gradient Descent
  • SGD vs GD
  • Stochastic Gradient Descent
  • Slide Number 75
  • Summary Recommendations via Optimization
  • Backup Slides
  • The Netflix Challenge 2006-09
  • Slide Number 79
  • Slide Number 80
  • Netflix Prize
  • Movie rating data
  • Overall rating distribution
  • ratings per movie
  • ratings per user
  • Average movie rating by movie count
  • Most loved movies
  • Challenges
  • The BellKor recommender system
  • Extending Latent Factor Model to Include Biases
  • Modeling Biases and Interactions
  • Baseline Predictor
  • Putting It All Together
  • Fitting the New Model
  • BellKor Recommender System
  • Performance of Various Methods
  • Performance of Various Methods
  • Temporal Biases Of Users
  • Temporal Biases amp Factors
  • Adding Temporal Effects
  • Performance of Various Methods
  • Slide Number 113
  • Slide Number 114
  • Standing on June 26th 2009
  • The Last 30 Days
  • 24 Hours from the Deadline
  • Slide Number 118
  • Million $ Awarded Sept 21st 2009
  • Further reading
predictors RMSE
1 0889067
2 0880586
3 0878345
4 0877715
5 0876668
6 0876179
7 0875557
8 0875013
9 0874477
10 0874209
11 0873997
12 0873691
13 0873569
14 0873352
15 0873212
16 0873035
17 0872942
18 087283
19 0872748
20 0872654
21 0872563
22 0872468
23 0872382
24 0872275
25 087222
26 0872201
27 0872135
28 0872086
29 0872021
30 0871945
31 0871898
32 087183
33 0871756
34 0871722
35 0871665
36 0871632
37 0871585
38 0871558
39 0871524
40 0871482
41 0871452
42 0871428
43 0871398
44 0871346
45 0871312
46 0871287
47 0871262
48 0871242
49 0871236
50 0871197
0889067
0880586
0878345
0877715
0876668
0876179
0875557
0875013
0874477
0874209
0873997
0873691
0873569
0873352
0873212
0873035
0872942
087283
0872748
0872654
0872563
0872468
0872382
0872275
087222
0872201
0872135
0872086
0872021
0871945
0871898
087183
0871756
0871722
0871665
0871632
0871585
0871558
0871524
0871482
0871452
0871428
0871398
0871346
0871312
0871287
0871262
0871242
0871236
0871197
Page 62: xuhappy.github.io · 2 days ago · RECSYS 2 Acknowledgements The slides used in this chapter are adapted from: CS246 Mining Massive Datasets, - by Jure Leskovec, Stanford University

RECSYS 64

Geared towards females

Geared towards males

serious

funny

The Effect of Regularization

1162019

The PrincessDiaries

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

minfactors ldquoerrorrdquo + λ ldquolengthrdquo

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

RECSYS 65

Geared towards females

Geared towards males

serious

funny

The Effect of Regularization

1162019

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

The PrincessDiaries

minfactors ldquoerrorrdquo + λ ldquolengthrdquo

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

RECSYS 66

Geared towards females

Geared towards males

serious

funny

The Effect of Regularization

1162019

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

minfactors ldquoerrorrdquo + λ ldquolengthrdquo

The PrincessDiaries

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

RECSYS 67

Geared towards females

Geared towards males

serious

funny

The Effect of Regularization

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

The PrincessDiaries

minfactors ldquoerrorrdquo + λ ldquolengthrdquo

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

RECSYS 68

Use Gradient Descent to search forthe optimal settings

Want to find matrices P and Q

Gradient descent Initialize P and Q (using SVD pretend missing ratings are

0) Do gradient descent

bull P larr P - η middotnablaPbull Q larr Q - η middotnablaQbull Where nablaQ is gradientderivative of matrix Q120571120571119876119876 = [120571120571119902119902119909119909119894119894] and 120571120571119902119902119909119909119894119894 = sum119909119909119909119909 minus2 119903119903119909119909119909119909 minus 119902119902119909119909119901119901119909119909119879119879 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894

bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q Observation Computing gradients is slow

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

How to compute gradient of a matrixCompute gradient of every element independently

RECSYS 69

Degression to the lecture notes ofRegression and Gradient Descent

by Andrew Ngrsquos Machine Learning course from Coursera

RECSYS 70

(Batch) Gradient Descent

Want to find matrices P and Q

Gradient descent Initialize P and Q (using SVD pretend missing ratings are

0) Do gradient descent

bull P larr P - η middotnablaPbull Q larr Q - η middotnablaQbull Where nablaQ is gradientderivative of matrix Q120571120571119876119876 = [120571120571119902119902119909119909119894119894] and 120571120571119902119902119909119909119894119894 = sum119909119909119909119909 minus2 119903119903119909119909119909119909 minus 119902119902119909119909119901119901119909119909119879119879 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894

bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q Observation Computing gradients is slow

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

How to compute gradient of a matrixCompute gradient of every element independently

RECSYS 72

Stochastic Gradient Descent

Gradient Descent (GD) vs Stochastic GD Observation 120571120571119876119876 = [120571120571119902119902119909119909119894119894] where

120571120571119902119902119909119909119894119894 = 119909119909119909119909

minus2 119903119903119909119909119909119909 minus 119902119902119909119909119894119894119901119901119909119909119894119894 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894 = 119961119961119946119946

nabla119928119928 119955119955119961119961119946119946

bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q 119928119928 = 119928119928 minus 119928119928 = 119928119928 minus η sum119961119961119946119946nabla119928119928 (119955119955119961119961119946119946) Idea Instead of evaluating gradient over all ratings evaluate it for

each individual rating and make a step

GD 119928119928larr119928119928 minus η sum119955119955119961119961119946119946 nabla119928119928(119955119955119961119961119946119946)

SGD 119928119928larr119928119928 minus 119928119928(119955119955119961119961119946119946) Faster convergence

bull Need more steps but each step is computed much faster

1162019

RECSYS 73

SGD vs GD Convergence of GD vs SGD

1162019

Iterationstep

Valu

e of

the

obje

ctiv

e fu

nctio

n

GD improves the value of the objective function at every step SGD improves the value but in a ldquonoisyrdquo wayGD takes fewer steps to converge but each steptakes much longer to compute In practice SGD is much faster

RECSYS 74

Stochastic Gradient Descent

Stochastic gradient decent Initialize P and Q (using SVD pretend missing ratings are

0) Then iterate over the ratings (multiple times if necessary) and update

factorsFor each rxi

bull 120576120576119909119909119909119909 = 119903119903119909119909119909119909 minus 119902119902119909119909 sdot 119901119901119909119909119879119879 (derivative of the ldquoerrorrdquo)bull 119902119902119909119909 larr 119902119902119909119909 + 120578120578 120576120576119909119909119909119909 119901119901119909119909 minus 120582120582 119902119902119909119909 (update equation)bull 119901119901119909119909 larr 119901119901119909119909 + 120578120578 120576120576119909119909119909119909 119902119902119909119909 minus 120582120582 119901119901119909119909 (update equation)

2 for loops For until convergence

bull For each rxibull Compute gradient do a ldquosteprdquo

1162019

η hellip learning rate

RECSYS 75Koren Bell Volinksy IEEE Computer 2009

1162019

RECSYS 76

Summary Recommendations via Optimization

Goal Make good recommendations Quantify goodness using SSE

So Lower SSE means better recommendations We want to make good recommendations on items

that some user has not yet seen Letrsquos set values for P and Q such that they work

well on known (user item) ratings And hope these values for P and Q will predict well

the unknown ratings

This is the a case where we apply Optimization methods

1162019

1 3 43 5 5

4 5 533

2

2 1 3

1

RECSYS 77

Backup Slides

RECSYS 78

The Netflix Challenge 2006-09

We Know What You OughtTo Be Watching This

Summer

Netflix Prizebull Training data

ndash 100 million ratingsndash 480000 usersndash 17770 moviesndash 6 years of data 2000-2005

bull Test datandash Last few ratings of each user (28 million)ndash Evaluation criterion root mean squared error (RMSE) ndash Netflix Cinematch RMSE 09514

bull Competitionndash 2700+ teamsndash $1 million grand prize for 10 improvement on Cinematch resultndash $50000 2007 progress prize for 843 improvement

scoremovieuser121152131434524123237682576344541568523425

22345

57664566

scoremovieuser6219617232473153414284935

745

696836

Training data Test data

Movie rating data

Overall rating distribution

bull Third of ratings are 4sbull Average rating is 368

From TimelyDevelopmentcom

ratings per movie

bull Avg ratingsmovie 5627

ratings per user

bull Avg ratingsuser 208

Average movie rating by movie count

bull More ratings to better movies

From TimelyDevelopmentcom

Most loved movies

CountAvg ratingTitle137812 4593 The Shawshank Redemption 133597 4545 Lord of the Rings The Return of the King1808834306 The Green Mile 150676 4460Lord of the Rings The Two Towers 139050 4415 Finding Nemo 117456 4504 Raiders of the Lost Ark 180736 4299 Forrest Gump 147932 4433 Lord of the Rings The Fellowship of the ring149199 4325 The Sixth Sense 144027 4333 Indiana Jones and the Last Crusade

Challenges

bull Size of datandash Scalabilityndash Keeping data in memory

bull Missing datandash 99 percent missingndash Very imbalanced

bull Avoiding overfittingbull Test and training data differ significantly movie 16322

bull Use an ensemble of complementing predictorsbull Two half tuned models worth more than a single fully

tuned model

The BellKor recommender system

RECSYS 101

Extending Latent Factor Model to Include Biases

1162019

RECSYS 102

Modeling Biases and Interactions

μ = overall mean rating bx = bias of user x bi = bias of movie i

user-movie interactionmovie biasuser bias

User-Movie interaction Characterizes the matching between

users and movies Attracts most research in the field Benefits from algorithmic and

mathematical innovations

Baseline predictor Separates users and movies Benefits from insights into userrsquos

behavior Among the main practical

contributions of the competition

RECSYS 103

Baseline Predictor We have expectations on the rating by

user x of movie i even without estimating xrsquos attitude towards movies like i

ndash Rating scale of user xndash Values of other ratings user

gave recently (day-specific mood anchoring multi-user accounts)

ndash (Recent) popularity of movie indash Selection bias related to

number of ratings user gave on the same day (ldquofrequencyrdquo)

1162019

RECSYS 104

Putting It All Together

Example Mean rating micro = 37 You are a critical reviewer your ratings are 1 star

lower than the mean bx = -1 Star Wars gets a mean rating of 05 higher than

average movie bi = + 05 Predicted rating for you on Star Wars

= 37 - 1 + 05 = 32

1162019

RECSYS 105

Fitting the New Model

Solve

Stochastic gradient decent to find parameters Note Both biases bu bi as well as interactions qi pu are treated

as parameters (we estimate them)

1162019

regularization

goodness of fit

λ is selected via grid-search on a validation set

( )

++++

+++minus

sumsumsumsum

sumisin

ii

xx

xx

ii

Rix

Txiixxi

PQ

bbpq

pqbbr

2222

2

)(

)(min

λ

micro

RECSYS 106

BellKor Recommender System The winner of the Netflix Challenge

Multi-scale modeling of the dataCombine top level ldquoregionalrdquomodeling of the data with a refined local view Global

bull Overall deviations of usersmovies Factorization

bull Addressing ldquoregionalrdquo effects Collaborative filtering

bull Extract local patterns

Global effects

Factorization

Collaborative filtering

RECSYS 107

Performance of Various Methods

0885

089

0895

09

0905

091

0915

092

1 10 100 1000

RMSE

Millions of parameters

CF (no time bias)

Basic Latent Factors

Latent Factors w Biases

1162019

RECSYS 108

Grand Prize 08563

Netflix 09514

Movie average 10533User average 10651

Global average 11296

Performance of Various Methods

Basic Collaborative filtering 094

Latent factors 090

Latent factors+Biases 089

Collaborative filtering++ 091

1162019

RECSYS 109

Temporal Biases Of Users

Sudden rise in the average movie rating(early 2004) Improvements in Netflix GUI improvements Meaning of rating changed

Movie age Users prefer new movies

without any reasons Older movies are just

inherently better than newer ones

1162019

Y Koren Collaborative filtering with temporal dynamics KDD rsquo09

RECSYS 110

Temporal Biases amp Factors

Original modelrxi = micro +bx + bi + qi middotpx

T

Add time dependence to biasesrxi = micro +bx(t)+ bi(t) +qi middot px

T

Make parameters bu and bi to depend on time (1) Parameterize time-dependence by linear trends

(2) Each bin corresponds to 10 consecutive weeks

Add temporal dependence to factors px(t)hellip user preference vector on day t

Y Koren Collaborative filtering with temporal dynamics KDD rsquo09

RECSYS 111

Adding Temporal Effects

1162019

0875

088

0885

089

0895

09

0905

091

0915

092

1 10 100 1000 10000

RMSE

Millions of parameters

CF (no time bias)Basic Latent FactorsCF (time bias)Latent Factors w Biases+ Linear time factors+ Per-day user biases+ CF

RECSYS 112

Grand Prize 08563

Netflix 09514

Movie average 10533User average 10651

Global average 11296

Performance of Various Methods

Basic Collaborative filtering 094

Latent factors 090

Latent factors+Biases 089

Collaborative filtering++ 091

1162019 112

Latent factors+Biases+Time 0876

Still no prize Getting desperateTry a ldquokitchen sinkrdquo approach

RECSYS 113

Effect of ensemble size

087

08725

0875

08775

088

08825

0885

08875

089

08925

1 8 15 22 29 36 43 50

Predictors

Erro

r - R

MSE

Chart1

RMSE
Predictors
Error - RMSE
Effect of ensemble size

Sheet1

Sheet1

RMSE
Predictors
Error - RMSE
Effect of ensemble size

Sheet2

Sheet3

RECSYS 115

Standing on June 26th 2009

1162019June 26th submission triggers 30-day ldquolast callrdquo

RECSYS 116

The Last 30 Days

Ensemble team formed Group of other teams on leaderboard forms a new team Relies on combining their models Quickly also get a qualifying score over 10

BellKor Continue to get small improvements in their scores Realize that they are in direct competition with Ensemble

Strategy Both teams carefully monitoring the leaderboard Only sure way to check for improvement is to submit a set of

predictionsbull This alerts the other team of your latest score

1162019

RECSYS 117

24 Hours from the Deadline

Submissions limited to 1 a day Only 1 final submission could be made in the last 24h

24 hours before deadlinehellip BellKor team member in Austria notices (by chance) that

Ensemble posts a score that is slightly better than BellKorrsquos

Frantic last 24 hours for both teams Much computer time on final optimization Carefully calibrated to end about an hour before deadline

Final submissions BellKor submits a little early (on purpose) 40 mins before deadline Ensemble submits their final entry 20 mins later hellipand everyone waitshellip

1162019

RECSYS 118118

RECSYS 119

Million $ Awarded Sept 21st 2009

RECSYS 120

Further reading Y Koren Collaborative filtering with temporal

dynamics KDD rsquo09 httpwww2researchattcom~volinskynetflixbpchtml httpwwwthe-ensemblecom

1162019

  • Slide Number 1
  • Acknowledgements
  • Content-based Systems amp Collaborative Filtering
  • High Dimensional Data
  • Example Recommender Systems
  • Recommendations
  • From Scarcity to Abundance
  • The Long Tail
  • Physical vs Online
  • Types of Recommendations
  • Formal Model
  • Utility Matrix
  • Key Problems
  • (1) Gathering Ratings
  • (2) Extrapolating Utilities
  • Content-based Recommender Systems
  • Content-based Recommendations
  • Plan of Action
  • Item Profiles
  • Sidenote TF-IDF
  • User Profiles and Prediction
  • Pros Content-based Approach
  • Cons Content-based Approach
  • Collaborative Filtering
  • Collaborative filtering
  • Collaborative Filtering (CF)
  • Example of Memory-based Collaborative Filtering User-User Collaborative Filtering
  • Similar Users
  • Similarity Metric
  • Rating Predictions
  • Another type of Memory-based Collaborative Filtering Item-Item based Collaborative Filtering
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Common Practice for Item-Item Collaborative Filtering
  • Item-Item vs User-User
  • ProsCons of Collaborative Filtering
  • Hybrid Methods
  • Remarks amp Practical Tips
  • Evaluation
  • Evaluation
  • Evaluating Predictions
  • Problems with Error Measures
  • Collaborative Filtering Complexity
  • Tip Add Data
  • Recommender SystemsLatent Factor Models
  • Collaborative Filtering viaLatent Factor Models (eg SVD)
  • Slide Number 50
  • The Netflix Utility Matrix R
  • Utility Matrix R Evaluation
  • Latent Factor Models
  • Ratings as Products of Factors
  • Ratings as Products of Factors
  • Ratings as Products of Factors
  • Latent Factor Models
  • Latent Factor Models
  • Recap SVD
  • Latent Factor Models
  • Dealing with Missing Entries
  • Recommendations via Latent Factor Models (eg SVD++ by the [Bellkor Team])
  • Dealing with Missing Entries
  • The Effect of Regularization
  • The Effect of Regularization
  • The Effect of Regularization
  • The Effect of Regularization
  • Use Gradient Descent to search forthe optimal settings
  • Degression to the lecture notes ofRegression and Gradient Descentby Andrew Ngrsquos Machine Learning course from Coursera
  • (Batch) Gradient Descent
  • Stochastic Gradient Descent
  • SGD vs GD
  • Stochastic Gradient Descent
  • Slide Number 75
  • Summary Recommendations via Optimization
  • Backup Slides
  • The Netflix Challenge 2006-09
  • Slide Number 79
  • Slide Number 80
  • Netflix Prize
  • Movie rating data
  • Overall rating distribution
  • ratings per movie
  • ratings per user
  • Average movie rating by movie count
  • Most loved movies
  • Challenges
  • The BellKor recommender system
  • Extending Latent Factor Model to Include Biases
  • Modeling Biases and Interactions
  • Baseline Predictor
  • Putting It All Together
  • Fitting the New Model
  • BellKor Recommender System
  • Performance of Various Methods
  • Performance of Various Methods
  • Temporal Biases Of Users
  • Temporal Biases amp Factors
  • Adding Temporal Effects
  • Performance of Various Methods
  • Slide Number 113
  • Slide Number 114
  • Standing on June 26th 2009
  • The Last 30 Days
  • 24 Hours from the Deadline
  • Slide Number 118
  • Million $ Awarded Sept 21st 2009
  • Further reading
predictors RMSE
1 0889067
2 0880586
3 0878345
4 0877715
5 0876668
6 0876179
7 0875557
8 0875013
9 0874477
10 0874209
11 0873997
12 0873691
13 0873569
14 0873352
15 0873212
16 0873035
17 0872942
18 087283
19 0872748
20 0872654
21 0872563
22 0872468
23 0872382
24 0872275
25 087222
26 0872201
27 0872135
28 0872086
29 0872021
30 0871945
31 0871898
32 087183
33 0871756
34 0871722
35 0871665
36 0871632
37 0871585
38 0871558
39 0871524
40 0871482
41 0871452
42 0871428
43 0871398
44 0871346
45 0871312
46 0871287
47 0871262
48 0871242
49 0871236
50 0871197
0889067
0880586
0878345
0877715
0876668
0876179
0875557
0875013
0874477
0874209
0873997
0873691
0873569
0873352
0873212
0873035
0872942
087283
0872748
0872654
0872563
0872468
0872382
0872275
087222
0872201
0872135
0872086
0872021
0871945
0871898
087183
0871756
0871722
0871665
0871632
0871585
0871558
0871524
0871482
0871452
0871428
0871398
0871346
0871312
0871287
0871262
0871242
0871236
0871197
Page 63: xuhappy.github.io · 2 days ago · RECSYS 2 Acknowledgements The slides used in this chapter are adapted from: CS246 Mining Massive Datasets, - by Jure Leskovec, Stanford University

RECSYS 65

Geared towards females

Geared towards males

serious

funny

The Effect of Regularization

1162019

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

The PrincessDiaries

minfactors ldquoerrorrdquo + λ ldquolengthrdquo

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

RECSYS 66

Geared towards females

Geared towards males

serious

funny

The Effect of Regularization

1162019

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

minfactors ldquoerrorrdquo + λ ldquolengthrdquo

The PrincessDiaries

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

RECSYS 67

Geared towards females

Geared towards males

serious

funny

The Effect of Regularization

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

The PrincessDiaries

minfactors ldquoerrorrdquo + λ ldquolengthrdquo

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

RECSYS 68

Use Gradient Descent to search forthe optimal settings

Want to find matrices P and Q

Gradient descent Initialize P and Q (using SVD pretend missing ratings are

0) Do gradient descent

bull P larr P - η middotnablaPbull Q larr Q - η middotnablaQbull Where nablaQ is gradientderivative of matrix Q120571120571119876119876 = [120571120571119902119902119909119909119894119894] and 120571120571119902119902119909119909119894119894 = sum119909119909119909119909 minus2 119903119903119909119909119909119909 minus 119902119902119909119909119901119901119909119909119879119879 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894

bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q Observation Computing gradients is slow

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

How to compute gradient of a matrixCompute gradient of every element independently

RECSYS 69

Degression to the lecture notes ofRegression and Gradient Descent

by Andrew Ngrsquos Machine Learning course from Coursera

RECSYS 70

(Batch) Gradient Descent

Want to find matrices P and Q

Gradient descent Initialize P and Q (using SVD pretend missing ratings are

0) Do gradient descent

bull P larr P - η middotnablaPbull Q larr Q - η middotnablaQbull Where nablaQ is gradientderivative of matrix Q120571120571119876119876 = [120571120571119902119902119909119909119894119894] and 120571120571119902119902119909119909119894119894 = sum119909119909119909119909 minus2 119903119903119909119909119909119909 minus 119902119902119909119909119901119901119909119909119879119879 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894

bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q Observation Computing gradients is slow

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

How to compute gradient of a matrixCompute gradient of every element independently

RECSYS 72

Stochastic Gradient Descent

Gradient Descent (GD) vs Stochastic GD Observation 120571120571119876119876 = [120571120571119902119902119909119909119894119894] where

120571120571119902119902119909119909119894119894 = 119909119909119909119909

minus2 119903119903119909119909119909119909 minus 119902119902119909119909119894119894119901119901119909119909119894119894 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894 = 119961119961119946119946

nabla119928119928 119955119955119961119961119946119946

bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q 119928119928 = 119928119928 minus 119928119928 = 119928119928 minus η sum119961119961119946119946nabla119928119928 (119955119955119961119961119946119946) Idea Instead of evaluating gradient over all ratings evaluate it for

each individual rating and make a step

GD 119928119928larr119928119928 minus η sum119955119955119961119961119946119946 nabla119928119928(119955119955119961119961119946119946)

SGD 119928119928larr119928119928 minus 119928119928(119955119955119961119961119946119946) Faster convergence

bull Need more steps but each step is computed much faster

1162019

RECSYS 73

SGD vs GD Convergence of GD vs SGD

1162019

Iterationstep

Valu

e of

the

obje

ctiv

e fu

nctio

n

GD improves the value of the objective function at every step SGD improves the value but in a ldquonoisyrdquo wayGD takes fewer steps to converge but each steptakes much longer to compute In practice SGD is much faster

RECSYS 74

Stochastic Gradient Descent

Stochastic gradient decent Initialize P and Q (using SVD pretend missing ratings are

0) Then iterate over the ratings (multiple times if necessary) and update

factorsFor each rxi

bull 120576120576119909119909119909119909 = 119903119903119909119909119909119909 minus 119902119902119909119909 sdot 119901119901119909119909119879119879 (derivative of the ldquoerrorrdquo)bull 119902119902119909119909 larr 119902119902119909119909 + 120578120578 120576120576119909119909119909119909 119901119901119909119909 minus 120582120582 119902119902119909119909 (update equation)bull 119901119901119909119909 larr 119901119901119909119909 + 120578120578 120576120576119909119909119909119909 119902119902119909119909 minus 120582120582 119901119901119909119909 (update equation)

2 for loops For until convergence

bull For each rxibull Compute gradient do a ldquosteprdquo

1162019

η hellip learning rate

RECSYS 75Koren Bell Volinksy IEEE Computer 2009

1162019

RECSYS 76

Summary Recommendations via Optimization

Goal Make good recommendations Quantify goodness using SSE

So Lower SSE means better recommendations We want to make good recommendations on items

that some user has not yet seen Letrsquos set values for P and Q such that they work

well on known (user item) ratings And hope these values for P and Q will predict well

the unknown ratings

This is the a case where we apply Optimization methods

1162019

1 3 43 5 5

4 5 533

2

2 1 3

1

RECSYS 77

Backup Slides

RECSYS 78

The Netflix Challenge 2006-09

We Know What You OughtTo Be Watching This

Summer

Netflix Prizebull Training data

ndash 100 million ratingsndash 480000 usersndash 17770 moviesndash 6 years of data 2000-2005

bull Test datandash Last few ratings of each user (28 million)ndash Evaluation criterion root mean squared error (RMSE) ndash Netflix Cinematch RMSE 09514

bull Competitionndash 2700+ teamsndash $1 million grand prize for 10 improvement on Cinematch resultndash $50000 2007 progress prize for 843 improvement

scoremovieuser121152131434524123237682576344541568523425

22345

57664566

scoremovieuser6219617232473153414284935

745

696836

Training data Test data

Movie rating data

Overall rating distribution

bull Third of ratings are 4sbull Average rating is 368

From TimelyDevelopmentcom

ratings per movie

bull Avg ratingsmovie 5627

ratings per user

bull Avg ratingsuser 208

Average movie rating by movie count

bull More ratings to better movies

From TimelyDevelopmentcom

Most loved movies

CountAvg ratingTitle137812 4593 The Shawshank Redemption 133597 4545 Lord of the Rings The Return of the King1808834306 The Green Mile 150676 4460Lord of the Rings The Two Towers 139050 4415 Finding Nemo 117456 4504 Raiders of the Lost Ark 180736 4299 Forrest Gump 147932 4433 Lord of the Rings The Fellowship of the ring149199 4325 The Sixth Sense 144027 4333 Indiana Jones and the Last Crusade

Challenges

bull Size of datandash Scalabilityndash Keeping data in memory

bull Missing datandash 99 percent missingndash Very imbalanced

bull Avoiding overfittingbull Test and training data differ significantly movie 16322

bull Use an ensemble of complementing predictorsbull Two half tuned models worth more than a single fully

tuned model

The BellKor recommender system

RECSYS 101

Extending Latent Factor Model to Include Biases

1162019

RECSYS 102

Modeling Biases and Interactions

μ = overall mean rating bx = bias of user x bi = bias of movie i

user-movie interactionmovie biasuser bias

User-Movie interaction Characterizes the matching between

users and movies Attracts most research in the field Benefits from algorithmic and

mathematical innovations

Baseline predictor Separates users and movies Benefits from insights into userrsquos

behavior Among the main practical

contributions of the competition

RECSYS 103

Baseline Predictor We have expectations on the rating by

user x of movie i even without estimating xrsquos attitude towards movies like i

ndash Rating scale of user xndash Values of other ratings user

gave recently (day-specific mood anchoring multi-user accounts)

ndash (Recent) popularity of movie indash Selection bias related to

number of ratings user gave on the same day (ldquofrequencyrdquo)

1162019

RECSYS 104

Putting It All Together

Example Mean rating micro = 37 You are a critical reviewer your ratings are 1 star

lower than the mean bx = -1 Star Wars gets a mean rating of 05 higher than

average movie bi = + 05 Predicted rating for you on Star Wars

= 37 - 1 + 05 = 32

1162019

RECSYS 105

Fitting the New Model

Solve

Stochastic gradient decent to find parameters Note Both biases bu bi as well as interactions qi pu are treated

as parameters (we estimate them)

1162019

regularization

goodness of fit

λ is selected via grid-search on a validation set

( )

++++

+++minus

sumsumsumsum

sumisin

ii

xx

xx

ii

Rix

Txiixxi

PQ

bbpq

pqbbr

2222

2

)(

)(min

λ

micro

RECSYS 106

BellKor Recommender System The winner of the Netflix Challenge

Multi-scale modeling of the dataCombine top level ldquoregionalrdquomodeling of the data with a refined local view Global

bull Overall deviations of usersmovies Factorization

bull Addressing ldquoregionalrdquo effects Collaborative filtering

bull Extract local patterns

Global effects

Factorization

Collaborative filtering

RECSYS 107

Performance of Various Methods

0885

089

0895

09

0905

091

0915

092

1 10 100 1000

RMSE

Millions of parameters

CF (no time bias)

Basic Latent Factors

Latent Factors w Biases

1162019

RECSYS 108

Grand Prize 08563

Netflix 09514

Movie average 10533User average 10651

Global average 11296

Performance of Various Methods

Basic Collaborative filtering 094

Latent factors 090

Latent factors+Biases 089

Collaborative filtering++ 091

1162019

RECSYS 109

Temporal Biases Of Users

Sudden rise in the average movie rating(early 2004) Improvements in Netflix GUI improvements Meaning of rating changed

Movie age Users prefer new movies

without any reasons Older movies are just

inherently better than newer ones

1162019

Y Koren Collaborative filtering with temporal dynamics KDD rsquo09

RECSYS 110

Temporal Biases amp Factors

Original modelrxi = micro +bx + bi + qi middotpx

T

Add time dependence to biasesrxi = micro +bx(t)+ bi(t) +qi middot px

T

Make parameters bu and bi to depend on time (1) Parameterize time-dependence by linear trends

(2) Each bin corresponds to 10 consecutive weeks

Add temporal dependence to factors px(t)hellip user preference vector on day t

Y Koren Collaborative filtering with temporal dynamics KDD rsquo09

RECSYS 111

Adding Temporal Effects

1162019

0875

088

0885

089

0895

09

0905

091

0915

092

1 10 100 1000 10000

RMSE

Millions of parameters

CF (no time bias)Basic Latent FactorsCF (time bias)Latent Factors w Biases+ Linear time factors+ Per-day user biases+ CF

RECSYS 112

Grand Prize 08563

Netflix 09514

Movie average 10533User average 10651

Global average 11296

Performance of Various Methods

Basic Collaborative filtering 094

Latent factors 090

Latent factors+Biases 089

Collaborative filtering++ 091

1162019 112

Latent factors+Biases+Time 0876

Still no prize Getting desperateTry a ldquokitchen sinkrdquo approach

RECSYS 113

Effect of ensemble size

087

08725

0875

08775

088

08825

0885

08875

089

08925

1 8 15 22 29 36 43 50

Predictors

Erro

r - R

MSE

Chart1

RMSE
Predictors
Error - RMSE
Effect of ensemble size

Sheet1

Sheet1

RMSE
Predictors
Error - RMSE
Effect of ensemble size

Sheet2

Sheet3

RECSYS 115

Standing on June 26th 2009

1162019June 26th submission triggers 30-day ldquolast callrdquo

RECSYS 116

The Last 30 Days

Ensemble team formed Group of other teams on leaderboard forms a new team Relies on combining their models Quickly also get a qualifying score over 10

BellKor Continue to get small improvements in their scores Realize that they are in direct competition with Ensemble

Strategy Both teams carefully monitoring the leaderboard Only sure way to check for improvement is to submit a set of

predictionsbull This alerts the other team of your latest score

1162019

RECSYS 117

24 Hours from the Deadline

Submissions limited to 1 a day Only 1 final submission could be made in the last 24h

24 hours before deadlinehellip BellKor team member in Austria notices (by chance) that

Ensemble posts a score that is slightly better than BellKorrsquos

Frantic last 24 hours for both teams Much computer time on final optimization Carefully calibrated to end about an hour before deadline

Final submissions BellKor submits a little early (on purpose) 40 mins before deadline Ensemble submits their final entry 20 mins later hellipand everyone waitshellip

1162019

RECSYS 118118

RECSYS 119

Million $ Awarded Sept 21st 2009

RECSYS 120

Further reading Y Koren Collaborative filtering with temporal

dynamics KDD rsquo09 httpwww2researchattcom~volinskynetflixbpchtml httpwwwthe-ensemblecom

1162019

  • Slide Number 1
  • Acknowledgements
  • Content-based Systems amp Collaborative Filtering
  • High Dimensional Data
  • Example Recommender Systems
  • Recommendations
  • From Scarcity to Abundance
  • The Long Tail
  • Physical vs Online
  • Types of Recommendations
  • Formal Model
  • Utility Matrix
  • Key Problems
  • (1) Gathering Ratings
  • (2) Extrapolating Utilities
  • Content-based Recommender Systems
  • Content-based Recommendations
  • Plan of Action
  • Item Profiles
  • Sidenote TF-IDF
  • User Profiles and Prediction
  • Pros Content-based Approach
  • Cons Content-based Approach
  • Collaborative Filtering
  • Collaborative filtering
  • Collaborative Filtering (CF)
  • Example of Memory-based Collaborative Filtering User-User Collaborative Filtering
  • Similar Users
  • Similarity Metric
  • Rating Predictions
  • Another type of Memory-based Collaborative Filtering Item-Item based Collaborative Filtering
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Common Practice for Item-Item Collaborative Filtering
  • Item-Item vs User-User
  • ProsCons of Collaborative Filtering
  • Hybrid Methods
  • Remarks amp Practical Tips
  • Evaluation
  • Evaluation
  • Evaluating Predictions
  • Problems with Error Measures
  • Collaborative Filtering Complexity
  • Tip Add Data
  • Recommender SystemsLatent Factor Models
  • Collaborative Filtering viaLatent Factor Models (eg SVD)
  • Slide Number 50
  • The Netflix Utility Matrix R
  • Utility Matrix R Evaluation
  • Latent Factor Models
  • Ratings as Products of Factors
  • Ratings as Products of Factors
  • Ratings as Products of Factors
  • Latent Factor Models
  • Latent Factor Models
  • Recap SVD
  • Latent Factor Models
  • Dealing with Missing Entries
  • Recommendations via Latent Factor Models (eg SVD++ by the [Bellkor Team])
  • Dealing with Missing Entries
  • The Effect of Regularization
  • The Effect of Regularization
  • The Effect of Regularization
  • The Effect of Regularization
  • Use Gradient Descent to search forthe optimal settings
  • Degression to the lecture notes ofRegression and Gradient Descentby Andrew Ngrsquos Machine Learning course from Coursera
  • (Batch) Gradient Descent
  • Stochastic Gradient Descent
  • SGD vs GD
  • Stochastic Gradient Descent
  • Slide Number 75
  • Summary Recommendations via Optimization
  • Backup Slides
  • The Netflix Challenge 2006-09
  • Slide Number 79
  • Slide Number 80
  • Netflix Prize
  • Movie rating data
  • Overall rating distribution
  • ratings per movie
  • ratings per user
  • Average movie rating by movie count
  • Most loved movies
  • Challenges
  • The BellKor recommender system
  • Extending Latent Factor Model to Include Biases
  • Modeling Biases and Interactions
  • Baseline Predictor
  • Putting It All Together
  • Fitting the New Model
  • BellKor Recommender System
  • Performance of Various Methods
  • Performance of Various Methods
  • Temporal Biases Of Users
  • Temporal Biases amp Factors
  • Adding Temporal Effects
  • Performance of Various Methods
  • Slide Number 113
  • Slide Number 114
  • Standing on June 26th 2009
  • The Last 30 Days
  • 24 Hours from the Deadline
  • Slide Number 118
  • Million $ Awarded Sept 21st 2009
  • Further reading
predictors RMSE
1 0889067
2 0880586
3 0878345
4 0877715
5 0876668
6 0876179
7 0875557
8 0875013
9 0874477
10 0874209
11 0873997
12 0873691
13 0873569
14 0873352
15 0873212
16 0873035
17 0872942
18 087283
19 0872748
20 0872654
21 0872563
22 0872468
23 0872382
24 0872275
25 087222
26 0872201
27 0872135
28 0872086
29 0872021
30 0871945
31 0871898
32 087183
33 0871756
34 0871722
35 0871665
36 0871632
37 0871585
38 0871558
39 0871524
40 0871482
41 0871452
42 0871428
43 0871398
44 0871346
45 0871312
46 0871287
47 0871262
48 0871242
49 0871236
50 0871197
0889067
0880586
0878345
0877715
0876668
0876179
0875557
0875013
0874477
0874209
0873997
0873691
0873569
0873352
0873212
0873035
0872942
087283
0872748
0872654
0872563
0872468
0872382
0872275
087222
0872201
0872135
0872086
0872021
0871945
0871898
087183
0871756
0871722
0871665
0871632
0871585
0871558
0871524
0871482
0871452
0871428
0871398
0871346
0871312
0871287
0871262
0871242
0871236
0871197
Page 64: xuhappy.github.io · 2 days ago · RECSYS 2 Acknowledgements The slides used in this chapter are adapted from: CS246 Mining Massive Datasets, - by Jure Leskovec, Stanford University

RECSYS 66

Geared towards females

Geared towards males

serious

funny

The Effect of Regularization

1162019

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

minfactors ldquoerrorrdquo + λ ldquolengthrdquo

The PrincessDiaries

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

RECSYS 67

Geared towards females

Geared towards males

serious

funny

The Effect of Regularization

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

The PrincessDiaries

minfactors ldquoerrorrdquo + λ ldquolengthrdquo

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

RECSYS 68

Use Gradient Descent to search forthe optimal settings

Want to find matrices P and Q

Gradient descent Initialize P and Q (using SVD pretend missing ratings are

0) Do gradient descent

bull P larr P - η middotnablaPbull Q larr Q - η middotnablaQbull Where nablaQ is gradientderivative of matrix Q120571120571119876119876 = [120571120571119902119902119909119909119894119894] and 120571120571119902119902119909119909119894119894 = sum119909119909119909119909 minus2 119903119903119909119909119909119909 minus 119902119902119909119909119901119901119909119909119879119879 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894

bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q Observation Computing gradients is slow

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

How to compute gradient of a matrixCompute gradient of every element independently

RECSYS 69

Degression to the lecture notes ofRegression and Gradient Descent

by Andrew Ngrsquos Machine Learning course from Coursera

RECSYS 70

(Batch) Gradient Descent

Want to find matrices P and Q

Gradient descent Initialize P and Q (using SVD pretend missing ratings are

0) Do gradient descent

bull P larr P - η middotnablaPbull Q larr Q - η middotnablaQbull Where nablaQ is gradientderivative of matrix Q120571120571119876119876 = [120571120571119902119902119909119909119894119894] and 120571120571119902119902119909119909119894119894 = sum119909119909119909119909 minus2 119903119903119909119909119909119909 minus 119902119902119909119909119901119901119909119909119879119879 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894

bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q Observation Computing gradients is slow

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

How to compute gradient of a matrixCompute gradient of every element independently

RECSYS 72

Stochastic Gradient Descent

Gradient Descent (GD) vs Stochastic GD Observation 120571120571119876119876 = [120571120571119902119902119909119909119894119894] where

120571120571119902119902119909119909119894119894 = 119909119909119909119909

minus2 119903119903119909119909119909119909 minus 119902119902119909119909119894119894119901119901119909119909119894119894 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894 = 119961119961119946119946

nabla119928119928 119955119955119961119961119946119946

bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q 119928119928 = 119928119928 minus 119928119928 = 119928119928 minus η sum119961119961119946119946nabla119928119928 (119955119955119961119961119946119946) Idea Instead of evaluating gradient over all ratings evaluate it for

each individual rating and make a step

GD 119928119928larr119928119928 minus η sum119955119955119961119961119946119946 nabla119928119928(119955119955119961119961119946119946)

SGD 119928119928larr119928119928 minus 119928119928(119955119955119961119961119946119946) Faster convergence

bull Need more steps but each step is computed much faster

1162019

RECSYS 73

SGD vs GD Convergence of GD vs SGD

1162019

Iterationstep

Valu

e of

the

obje

ctiv

e fu

nctio

n

GD improves the value of the objective function at every step SGD improves the value but in a ldquonoisyrdquo wayGD takes fewer steps to converge but each steptakes much longer to compute In practice SGD is much faster

RECSYS 74

Stochastic Gradient Descent

Stochastic gradient decent Initialize P and Q (using SVD pretend missing ratings are

0) Then iterate over the ratings (multiple times if necessary) and update

factorsFor each rxi

bull 120576120576119909119909119909119909 = 119903119903119909119909119909119909 minus 119902119902119909119909 sdot 119901119901119909119909119879119879 (derivative of the ldquoerrorrdquo)bull 119902119902119909119909 larr 119902119902119909119909 + 120578120578 120576120576119909119909119909119909 119901119901119909119909 minus 120582120582 119902119902119909119909 (update equation)bull 119901119901119909119909 larr 119901119901119909119909 + 120578120578 120576120576119909119909119909119909 119902119902119909119909 minus 120582120582 119901119901119909119909 (update equation)

2 for loops For until convergence

bull For each rxibull Compute gradient do a ldquosteprdquo

1162019

η hellip learning rate

RECSYS 75Koren Bell Volinksy IEEE Computer 2009

1162019

RECSYS 76

Summary Recommendations via Optimization

Goal Make good recommendations Quantify goodness using SSE

So Lower SSE means better recommendations We want to make good recommendations on items

that some user has not yet seen Letrsquos set values for P and Q such that they work

well on known (user item) ratings And hope these values for P and Q will predict well

the unknown ratings

This is the a case where we apply Optimization methods

1162019

1 3 43 5 5

4 5 533

2

2 1 3

1

RECSYS 77

Backup Slides

RECSYS 78

The Netflix Challenge 2006-09

We Know What You OughtTo Be Watching This

Summer

Netflix Prizebull Training data

ndash 100 million ratingsndash 480000 usersndash 17770 moviesndash 6 years of data 2000-2005

bull Test datandash Last few ratings of each user (28 million)ndash Evaluation criterion root mean squared error (RMSE) ndash Netflix Cinematch RMSE 09514

bull Competitionndash 2700+ teamsndash $1 million grand prize for 10 improvement on Cinematch resultndash $50000 2007 progress prize for 843 improvement

scoremovieuser121152131434524123237682576344541568523425

22345

57664566

scoremovieuser6219617232473153414284935

745

696836

Training data Test data

Movie rating data

Overall rating distribution

bull Third of ratings are 4sbull Average rating is 368

From TimelyDevelopmentcom

ratings per movie

bull Avg ratingsmovie 5627

ratings per user

bull Avg ratingsuser 208

Average movie rating by movie count

bull More ratings to better movies

From TimelyDevelopmentcom

Most loved movies

CountAvg ratingTitle137812 4593 The Shawshank Redemption 133597 4545 Lord of the Rings The Return of the King1808834306 The Green Mile 150676 4460Lord of the Rings The Two Towers 139050 4415 Finding Nemo 117456 4504 Raiders of the Lost Ark 180736 4299 Forrest Gump 147932 4433 Lord of the Rings The Fellowship of the ring149199 4325 The Sixth Sense 144027 4333 Indiana Jones and the Last Crusade

Challenges

bull Size of datandash Scalabilityndash Keeping data in memory

bull Missing datandash 99 percent missingndash Very imbalanced

bull Avoiding overfittingbull Test and training data differ significantly movie 16322

bull Use an ensemble of complementing predictorsbull Two half tuned models worth more than a single fully

tuned model

The BellKor recommender system

RECSYS 101

Extending Latent Factor Model to Include Biases

1162019

RECSYS 102

Modeling Biases and Interactions

μ = overall mean rating bx = bias of user x bi = bias of movie i

user-movie interactionmovie biasuser bias

User-Movie interaction Characterizes the matching between

users and movies Attracts most research in the field Benefits from algorithmic and

mathematical innovations

Baseline predictor Separates users and movies Benefits from insights into userrsquos

behavior Among the main practical

contributions of the competition

RECSYS 103

Baseline Predictor We have expectations on the rating by

user x of movie i even without estimating xrsquos attitude towards movies like i

ndash Rating scale of user xndash Values of other ratings user

gave recently (day-specific mood anchoring multi-user accounts)

ndash (Recent) popularity of movie indash Selection bias related to

number of ratings user gave on the same day (ldquofrequencyrdquo)

1162019

RECSYS 104

Putting It All Together

Example Mean rating micro = 37 You are a critical reviewer your ratings are 1 star

lower than the mean bx = -1 Star Wars gets a mean rating of 05 higher than

average movie bi = + 05 Predicted rating for you on Star Wars

= 37 - 1 + 05 = 32

1162019

RECSYS 105

Fitting the New Model

Solve

Stochastic gradient decent to find parameters Note Both biases bu bi as well as interactions qi pu are treated

as parameters (we estimate them)

1162019

regularization

goodness of fit

λ is selected via grid-search on a validation set

( )

++++

+++minus

sumsumsumsum

sumisin

ii

xx

xx

ii

Rix

Txiixxi

PQ

bbpq

pqbbr

2222

2

)(

)(min

λ

micro

RECSYS 106

BellKor Recommender System The winner of the Netflix Challenge

Multi-scale modeling of the dataCombine top level ldquoregionalrdquomodeling of the data with a refined local view Global

bull Overall deviations of usersmovies Factorization

bull Addressing ldquoregionalrdquo effects Collaborative filtering

bull Extract local patterns

Global effects

Factorization

Collaborative filtering

RECSYS 107

Performance of Various Methods

0885

089

0895

09

0905

091

0915

092

1 10 100 1000

RMSE

Millions of parameters

CF (no time bias)

Basic Latent Factors

Latent Factors w Biases

1162019

RECSYS 108

Grand Prize 08563

Netflix 09514

Movie average 10533User average 10651

Global average 11296

Performance of Various Methods

Basic Collaborative filtering 094

Latent factors 090

Latent factors+Biases 089

Collaborative filtering++ 091

1162019

RECSYS 109

Temporal Biases Of Users

Sudden rise in the average movie rating(early 2004) Improvements in Netflix GUI improvements Meaning of rating changed

Movie age Users prefer new movies

without any reasons Older movies are just

inherently better than newer ones

1162019

Y Koren Collaborative filtering with temporal dynamics KDD rsquo09

RECSYS 110

Temporal Biases amp Factors

Original modelrxi = micro +bx + bi + qi middotpx

T

Add time dependence to biasesrxi = micro +bx(t)+ bi(t) +qi middot px

T

Make parameters bu and bi to depend on time (1) Parameterize time-dependence by linear trends

(2) Each bin corresponds to 10 consecutive weeks

Add temporal dependence to factors px(t)hellip user preference vector on day t

Y Koren Collaborative filtering with temporal dynamics KDD rsquo09

RECSYS 111

Adding Temporal Effects

1162019

0875

088

0885

089

0895

09

0905

091

0915

092

1 10 100 1000 10000

RMSE

Millions of parameters

CF (no time bias)Basic Latent FactorsCF (time bias)Latent Factors w Biases+ Linear time factors+ Per-day user biases+ CF

RECSYS 112

Grand Prize 08563

Netflix 09514

Movie average 10533User average 10651

Global average 11296

Performance of Various Methods

Basic Collaborative filtering 094

Latent factors 090

Latent factors+Biases 089

Collaborative filtering++ 091

1162019 112

Latent factors+Biases+Time 0876

Still no prize Getting desperateTry a ldquokitchen sinkrdquo approach

RECSYS 113

Effect of ensemble size

087

08725

0875

08775

088

08825

0885

08875

089

08925

1 8 15 22 29 36 43 50

Predictors

Erro

r - R

MSE

Chart1

RMSE
Predictors
Error - RMSE
Effect of ensemble size

Sheet1

Sheet1

RMSE
Predictors
Error - RMSE
Effect of ensemble size

Sheet2

Sheet3

RECSYS 115

Standing on June 26th 2009

1162019June 26th submission triggers 30-day ldquolast callrdquo

RECSYS 116

The Last 30 Days

Ensemble team formed Group of other teams on leaderboard forms a new team Relies on combining their models Quickly also get a qualifying score over 10

BellKor Continue to get small improvements in their scores Realize that they are in direct competition with Ensemble

Strategy Both teams carefully monitoring the leaderboard Only sure way to check for improvement is to submit a set of

predictionsbull This alerts the other team of your latest score

1162019

RECSYS 117

24 Hours from the Deadline

Submissions limited to 1 a day Only 1 final submission could be made in the last 24h

24 hours before deadlinehellip BellKor team member in Austria notices (by chance) that

Ensemble posts a score that is slightly better than BellKorrsquos

Frantic last 24 hours for both teams Much computer time on final optimization Carefully calibrated to end about an hour before deadline

Final submissions BellKor submits a little early (on purpose) 40 mins before deadline Ensemble submits their final entry 20 mins later hellipand everyone waitshellip

1162019

RECSYS 118118

RECSYS 119

Million $ Awarded Sept 21st 2009

RECSYS 120

Further reading Y Koren Collaborative filtering with temporal

dynamics KDD rsquo09 httpwww2researchattcom~volinskynetflixbpchtml httpwwwthe-ensemblecom

1162019

  • Slide Number 1
  • Acknowledgements
  • Content-based Systems amp Collaborative Filtering
  • High Dimensional Data
  • Example Recommender Systems
  • Recommendations
  • From Scarcity to Abundance
  • The Long Tail
  • Physical vs Online
  • Types of Recommendations
  • Formal Model
  • Utility Matrix
  • Key Problems
  • (1) Gathering Ratings
  • (2) Extrapolating Utilities
  • Content-based Recommender Systems
  • Content-based Recommendations
  • Plan of Action
  • Item Profiles
  • Sidenote TF-IDF
  • User Profiles and Prediction
  • Pros Content-based Approach
  • Cons Content-based Approach
  • Collaborative Filtering
  • Collaborative filtering
  • Collaborative Filtering (CF)
  • Example of Memory-based Collaborative Filtering User-User Collaborative Filtering
  • Similar Users
  • Similarity Metric
  • Rating Predictions
  • Another type of Memory-based Collaborative Filtering Item-Item based Collaborative Filtering
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Common Practice for Item-Item Collaborative Filtering
  • Item-Item vs User-User
  • ProsCons of Collaborative Filtering
  • Hybrid Methods
  • Remarks amp Practical Tips
  • Evaluation
  • Evaluation
  • Evaluating Predictions
  • Problems with Error Measures
  • Collaborative Filtering Complexity
  • Tip Add Data
  • Recommender SystemsLatent Factor Models
  • Collaborative Filtering viaLatent Factor Models (eg SVD)
  • Slide Number 50
  • The Netflix Utility Matrix R
  • Utility Matrix R Evaluation
  • Latent Factor Models
  • Ratings as Products of Factors
  • Ratings as Products of Factors
  • Ratings as Products of Factors
  • Latent Factor Models
  • Latent Factor Models
  • Recap SVD
  • Latent Factor Models
  • Dealing with Missing Entries
  • Recommendations via Latent Factor Models (eg SVD++ by the [Bellkor Team])
  • Dealing with Missing Entries
  • The Effect of Regularization
  • The Effect of Regularization
  • The Effect of Regularization
  • The Effect of Regularization
  • Use Gradient Descent to search forthe optimal settings
  • Degression to the lecture notes ofRegression and Gradient Descentby Andrew Ngrsquos Machine Learning course from Coursera
  • (Batch) Gradient Descent
  • Stochastic Gradient Descent
  • SGD vs GD
  • Stochastic Gradient Descent
  • Slide Number 75
  • Summary Recommendations via Optimization
  • Backup Slides
  • The Netflix Challenge 2006-09
  • Slide Number 79
  • Slide Number 80
  • Netflix Prize
  • Movie rating data
  • Overall rating distribution
  • ratings per movie
  • ratings per user
  • Average movie rating by movie count
  • Most loved movies
  • Challenges
  • The BellKor recommender system
  • Extending Latent Factor Model to Include Biases
  • Modeling Biases and Interactions
  • Baseline Predictor
  • Putting It All Together
  • Fitting the New Model
  • BellKor Recommender System
  • Performance of Various Methods
  • Performance of Various Methods
  • Temporal Biases Of Users
  • Temporal Biases amp Factors
  • Adding Temporal Effects
  • Performance of Various Methods
  • Slide Number 113
  • Slide Number 114
  • Standing on June 26th 2009
  • The Last 30 Days
  • 24 Hours from the Deadline
  • Slide Number 118
  • Million $ Awarded Sept 21st 2009
  • Further reading
predictors RMSE
1 0889067
2 0880586
3 0878345
4 0877715
5 0876668
6 0876179
7 0875557
8 0875013
9 0874477
10 0874209
11 0873997
12 0873691
13 0873569
14 0873352
15 0873212
16 0873035
17 0872942
18 087283
19 0872748
20 0872654
21 0872563
22 0872468
23 0872382
24 0872275
25 087222
26 0872201
27 0872135
28 0872086
29 0872021
30 0871945
31 0871898
32 087183
33 0871756
34 0871722
35 0871665
36 0871632
37 0871585
38 0871558
39 0871524
40 0871482
41 0871452
42 0871428
43 0871398
44 0871346
45 0871312
46 0871287
47 0871262
48 0871242
49 0871236
50 0871197
0889067
0880586
0878345
0877715
0876668
0876179
0875557
0875013
0874477
0874209
0873997
0873691
0873569
0873352
0873212
0873035
0872942
087283
0872748
0872654
0872563
0872468
0872382
0872275
087222
0872201
0872135
0872086
0872021
0871945
0871898
087183
0871756
0871722
0871665
0871632
0871585
0871558
0871524
0871482
0871452
0871428
0871398
0871346
0871312
0871287
0871262
0871242
0871236
0871197
Page 65: xuhappy.github.io · 2 days ago · RECSYS 2 Acknowledgements The slides used in this chapter are adapted from: CS246 Mining Massive Datasets, - by Jure Leskovec, Stanford University

RECSYS 67

Geared towards females

Geared towards males

serious

funny

The Effect of Regularization

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Oceanrsquos 11

Sense and Sensibility

Factor 1

Fact

or 2

The PrincessDiaries

minfactors ldquoerrorrdquo + λ ldquolengthrdquo

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

RECSYS 68

Use Gradient Descent to search forthe optimal settings

Want to find matrices P and Q

Gradient descent Initialize P and Q (using SVD pretend missing ratings are

0) Do gradient descent

bull P larr P - η middotnablaPbull Q larr Q - η middotnablaQbull Where nablaQ is gradientderivative of matrix Q120571120571119876119876 = [120571120571119902119902119909119909119894119894] and 120571120571119902119902119909119909119894119894 = sum119909119909119909119909 minus2 119903119903119909119909119909119909 minus 119902119902119909119909119901119901119909119909119879119879 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894

bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q Observation Computing gradients is slow

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

How to compute gradient of a matrixCompute gradient of every element independently

RECSYS 69

Degression to the lecture notes ofRegression and Gradient Descent

by Andrew Ngrsquos Machine Learning course from Coursera

RECSYS 70

(Batch) Gradient Descent

Want to find matrices P and Q

Gradient descent Initialize P and Q (using SVD pretend missing ratings are

0) Do gradient descent

bull P larr P - η middotnablaPbull Q larr Q - η middotnablaQbull Where nablaQ is gradientderivative of matrix Q120571120571119876119876 = [120571120571119902119902119909119909119894119894] and 120571120571119902119902119909119909119894119894 = sum119909119909119909119909 minus2 119903119903119909119909119909119909 minus 119902119902119909119909119901119901119909119909119879119879 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894

bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q Observation Computing gradients is slow

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

How to compute gradient of a matrixCompute gradient of every element independently

RECSYS 72

Stochastic Gradient Descent

Gradient Descent (GD) vs Stochastic GD Observation 120571120571119876119876 = [120571120571119902119902119909119909119894119894] where

120571120571119902119902119909119909119894119894 = 119909119909119909119909

minus2 119903119903119909119909119909119909 minus 119902119902119909119909119894119894119901119901119909119909119894119894 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894 = 119961119961119946119946

nabla119928119928 119955119955119961119961119946119946

bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q 119928119928 = 119928119928 minus 119928119928 = 119928119928 minus η sum119961119961119946119946nabla119928119928 (119955119955119961119961119946119946) Idea Instead of evaluating gradient over all ratings evaluate it for

each individual rating and make a step

GD 119928119928larr119928119928 minus η sum119955119955119961119961119946119946 nabla119928119928(119955119955119961119961119946119946)

SGD 119928119928larr119928119928 minus 119928119928(119955119955119961119961119946119946) Faster convergence

bull Need more steps but each step is computed much faster

1162019

RECSYS 73

SGD vs GD Convergence of GD vs SGD

1162019

Iterationstep

Valu

e of

the

obje

ctiv

e fu

nctio

n

GD improves the value of the objective function at every step SGD improves the value but in a ldquonoisyrdquo wayGD takes fewer steps to converge but each steptakes much longer to compute In practice SGD is much faster

RECSYS 74

Stochastic Gradient Descent

Stochastic gradient decent Initialize P and Q (using SVD pretend missing ratings are

0) Then iterate over the ratings (multiple times if necessary) and update

factorsFor each rxi

bull 120576120576119909119909119909119909 = 119903119903119909119909119909119909 minus 119902119902119909119909 sdot 119901119901119909119909119879119879 (derivative of the ldquoerrorrdquo)bull 119902119902119909119909 larr 119902119902119909119909 + 120578120578 120576120576119909119909119909119909 119901119901119909119909 minus 120582120582 119902119902119909119909 (update equation)bull 119901119901119909119909 larr 119901119901119909119909 + 120578120578 120576120576119909119909119909119909 119902119902119909119909 minus 120582120582 119901119901119909119909 (update equation)

2 for loops For until convergence

bull For each rxibull Compute gradient do a ldquosteprdquo

1162019

η hellip learning rate

RECSYS 75Koren Bell Volinksy IEEE Computer 2009

1162019

RECSYS 76

Summary Recommendations via Optimization

Goal Make good recommendations Quantify goodness using SSE

So Lower SSE means better recommendations We want to make good recommendations on items

that some user has not yet seen Letrsquos set values for P and Q such that they work

well on known (user item) ratings And hope these values for P and Q will predict well

the unknown ratings

This is the a case where we apply Optimization methods

1162019

1 3 43 5 5

4 5 533

2

2 1 3

1

RECSYS 77

Backup Slides

RECSYS 78

The Netflix Challenge 2006-09

We Know What You OughtTo Be Watching This

Summer

Netflix Prizebull Training data

ndash 100 million ratingsndash 480000 usersndash 17770 moviesndash 6 years of data 2000-2005

bull Test datandash Last few ratings of each user (28 million)ndash Evaluation criterion root mean squared error (RMSE) ndash Netflix Cinematch RMSE 09514

bull Competitionndash 2700+ teamsndash $1 million grand prize for 10 improvement on Cinematch resultndash $50000 2007 progress prize for 843 improvement

scoremovieuser121152131434524123237682576344541568523425

22345

57664566

scoremovieuser6219617232473153414284935

745

696836

Training data Test data

Movie rating data

Overall rating distribution

bull Third of ratings are 4sbull Average rating is 368

From TimelyDevelopmentcom

ratings per movie

bull Avg ratingsmovie 5627

ratings per user

bull Avg ratingsuser 208

Average movie rating by movie count

bull More ratings to better movies

From TimelyDevelopmentcom

Most loved movies

CountAvg ratingTitle137812 4593 The Shawshank Redemption 133597 4545 Lord of the Rings The Return of the King1808834306 The Green Mile 150676 4460Lord of the Rings The Two Towers 139050 4415 Finding Nemo 117456 4504 Raiders of the Lost Ark 180736 4299 Forrest Gump 147932 4433 Lord of the Rings The Fellowship of the ring149199 4325 The Sixth Sense 144027 4333 Indiana Jones and the Last Crusade

Challenges

bull Size of datandash Scalabilityndash Keeping data in memory

bull Missing datandash 99 percent missingndash Very imbalanced

bull Avoiding overfittingbull Test and training data differ significantly movie 16322

bull Use an ensemble of complementing predictorsbull Two half tuned models worth more than a single fully

tuned model

The BellKor recommender system

RECSYS 101

Extending Latent Factor Model to Include Biases

1162019

RECSYS 102

Modeling Biases and Interactions

μ = overall mean rating bx = bias of user x bi = bias of movie i

user-movie interactionmovie biasuser bias

User-Movie interaction Characterizes the matching between

users and movies Attracts most research in the field Benefits from algorithmic and

mathematical innovations

Baseline predictor Separates users and movies Benefits from insights into userrsquos

behavior Among the main practical

contributions of the competition

RECSYS 103

Baseline Predictor We have expectations on the rating by

user x of movie i even without estimating xrsquos attitude towards movies like i

ndash Rating scale of user xndash Values of other ratings user

gave recently (day-specific mood anchoring multi-user accounts)

ndash (Recent) popularity of movie indash Selection bias related to

number of ratings user gave on the same day (ldquofrequencyrdquo)

1162019

RECSYS 104

Putting It All Together

Example Mean rating micro = 37 You are a critical reviewer your ratings are 1 star

lower than the mean bx = -1 Star Wars gets a mean rating of 05 higher than

average movie bi = + 05 Predicted rating for you on Star Wars

= 37 - 1 + 05 = 32

1162019

RECSYS 105

Fitting the New Model

Solve

Stochastic gradient decent to find parameters Note Both biases bu bi as well as interactions qi pu are treated

as parameters (we estimate them)

1162019

regularization

goodness of fit

λ is selected via grid-search on a validation set

( )

++++

+++minus

sumsumsumsum

sumisin

ii

xx

xx

ii

Rix

Txiixxi

PQ

bbpq

pqbbr

2222

2

)(

)(min

λ

micro

RECSYS 106

BellKor Recommender System The winner of the Netflix Challenge

Multi-scale modeling of the dataCombine top level ldquoregionalrdquomodeling of the data with a refined local view Global

bull Overall deviations of usersmovies Factorization

bull Addressing ldquoregionalrdquo effects Collaborative filtering

bull Extract local patterns

Global effects

Factorization

Collaborative filtering

RECSYS 107

Performance of Various Methods

0885

089

0895

09

0905

091

0915

092

1 10 100 1000

RMSE

Millions of parameters

CF (no time bias)

Basic Latent Factors

Latent Factors w Biases

1162019

RECSYS 108

Grand Prize 08563

Netflix 09514

Movie average 10533User average 10651

Global average 11296

Performance of Various Methods

Basic Collaborative filtering 094

Latent factors 090

Latent factors+Biases 089

Collaborative filtering++ 091

1162019

RECSYS 109

Temporal Biases Of Users

Sudden rise in the average movie rating(early 2004) Improvements in Netflix GUI improvements Meaning of rating changed

Movie age Users prefer new movies

without any reasons Older movies are just

inherently better than newer ones

1162019

Y Koren Collaborative filtering with temporal dynamics KDD rsquo09

RECSYS 110

Temporal Biases amp Factors

Original modelrxi = micro +bx + bi + qi middotpx

T

Add time dependence to biasesrxi = micro +bx(t)+ bi(t) +qi middot px

T

Make parameters bu and bi to depend on time (1) Parameterize time-dependence by linear trends

(2) Each bin corresponds to 10 consecutive weeks

Add temporal dependence to factors px(t)hellip user preference vector on day t

Y Koren Collaborative filtering with temporal dynamics KDD rsquo09

RECSYS 111

Adding Temporal Effects

1162019

0875

088

0885

089

0895

09

0905

091

0915

092

1 10 100 1000 10000

RMSE

Millions of parameters

CF (no time bias)Basic Latent FactorsCF (time bias)Latent Factors w Biases+ Linear time factors+ Per-day user biases+ CF

RECSYS 112

Grand Prize 08563

Netflix 09514

Movie average 10533User average 10651

Global average 11296

Performance of Various Methods

Basic Collaborative filtering 094

Latent factors 090

Latent factors+Biases 089

Collaborative filtering++ 091

1162019 112

Latent factors+Biases+Time 0876

Still no prize Getting desperateTry a ldquokitchen sinkrdquo approach

RECSYS 113

Effect of ensemble size

087

08725

0875

08775

088

08825

0885

08875

089

08925

1 8 15 22 29 36 43 50

Predictors

Erro

r - R

MSE

Chart1

RMSE
Predictors
Error - RMSE
Effect of ensemble size

Sheet1

Sheet1

RMSE
Predictors
Error - RMSE
Effect of ensemble size

Sheet2

Sheet3

RECSYS 115

Standing on June 26th 2009

1162019June 26th submission triggers 30-day ldquolast callrdquo

RECSYS 116

The Last 30 Days

Ensemble team formed Group of other teams on leaderboard forms a new team Relies on combining their models Quickly also get a qualifying score over 10

BellKor Continue to get small improvements in their scores Realize that they are in direct competition with Ensemble

Strategy Both teams carefully monitoring the leaderboard Only sure way to check for improvement is to submit a set of

predictionsbull This alerts the other team of your latest score

1162019

RECSYS 117

24 Hours from the Deadline

Submissions limited to 1 a day Only 1 final submission could be made in the last 24h

24 hours before deadlinehellip BellKor team member in Austria notices (by chance) that

Ensemble posts a score that is slightly better than BellKorrsquos

Frantic last 24 hours for both teams Much computer time on final optimization Carefully calibrated to end about an hour before deadline

Final submissions BellKor submits a little early (on purpose) 40 mins before deadline Ensemble submits their final entry 20 mins later hellipand everyone waitshellip

1162019

RECSYS 118118

RECSYS 119

Million $ Awarded Sept 21st 2009

RECSYS 120

Further reading Y Koren Collaborative filtering with temporal

dynamics KDD rsquo09 httpwww2researchattcom~volinskynetflixbpchtml httpwwwthe-ensemblecom

1162019

  • Slide Number 1
  • Acknowledgements
  • Content-based Systems amp Collaborative Filtering
  • High Dimensional Data
  • Example Recommender Systems
  • Recommendations
  • From Scarcity to Abundance
  • The Long Tail
  • Physical vs Online
  • Types of Recommendations
  • Formal Model
  • Utility Matrix
  • Key Problems
  • (1) Gathering Ratings
  • (2) Extrapolating Utilities
  • Content-based Recommender Systems
  • Content-based Recommendations
  • Plan of Action
  • Item Profiles
  • Sidenote TF-IDF
  • User Profiles and Prediction
  • Pros Content-based Approach
  • Cons Content-based Approach
  • Collaborative Filtering
  • Collaborative filtering
  • Collaborative Filtering (CF)
  • Example of Memory-based Collaborative Filtering User-User Collaborative Filtering
  • Similar Users
  • Similarity Metric
  • Rating Predictions
  • Another type of Memory-based Collaborative Filtering Item-Item based Collaborative Filtering
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Common Practice for Item-Item Collaborative Filtering
  • Item-Item vs User-User
  • ProsCons of Collaborative Filtering
  • Hybrid Methods
  • Remarks amp Practical Tips
  • Evaluation
  • Evaluation
  • Evaluating Predictions
  • Problems with Error Measures
  • Collaborative Filtering Complexity
  • Tip Add Data
  • Recommender SystemsLatent Factor Models
  • Collaborative Filtering viaLatent Factor Models (eg SVD)
  • Slide Number 50
  • The Netflix Utility Matrix R
  • Utility Matrix R Evaluation
  • Latent Factor Models
  • Ratings as Products of Factors
  • Ratings as Products of Factors
  • Ratings as Products of Factors
  • Latent Factor Models
  • Latent Factor Models
  • Recap SVD
  • Latent Factor Models
  • Dealing with Missing Entries
  • Recommendations via Latent Factor Models (eg SVD++ by the [Bellkor Team])
  • Dealing with Missing Entries
  • The Effect of Regularization
  • The Effect of Regularization
  • The Effect of Regularization
  • The Effect of Regularization
  • Use Gradient Descent to search forthe optimal settings
  • Degression to the lecture notes ofRegression and Gradient Descentby Andrew Ngrsquos Machine Learning course from Coursera
  • (Batch) Gradient Descent
  • Stochastic Gradient Descent
  • SGD vs GD
  • Stochastic Gradient Descent
  • Slide Number 75
  • Summary Recommendations via Optimization
  • Backup Slides
  • The Netflix Challenge 2006-09
  • Slide Number 79
  • Slide Number 80
  • Netflix Prize
  • Movie rating data
  • Overall rating distribution
  • ratings per movie
  • ratings per user
  • Average movie rating by movie count
  • Most loved movies
  • Challenges
  • The BellKor recommender system
  • Extending Latent Factor Model to Include Biases
  • Modeling Biases and Interactions
  • Baseline Predictor
  • Putting It All Together
  • Fitting the New Model
  • BellKor Recommender System
  • Performance of Various Methods
  • Performance of Various Methods
  • Temporal Biases Of Users
  • Temporal Biases amp Factors
  • Adding Temporal Effects
  • Performance of Various Methods
  • Slide Number 113
  • Slide Number 114
  • Standing on June 26th 2009
  • The Last 30 Days
  • 24 Hours from the Deadline
  • Slide Number 118
  • Million $ Awarded Sept 21st 2009
  • Further reading
predictors RMSE
1 0889067
2 0880586
3 0878345
4 0877715
5 0876668
6 0876179
7 0875557
8 0875013
9 0874477
10 0874209
11 0873997
12 0873691
13 0873569
14 0873352
15 0873212
16 0873035
17 0872942
18 087283
19 0872748
20 0872654
21 0872563
22 0872468
23 0872382
24 0872275
25 087222
26 0872201
27 0872135
28 0872086
29 0872021
30 0871945
31 0871898
32 087183
33 0871756
34 0871722
35 0871665
36 0871632
37 0871585
38 0871558
39 0871524
40 0871482
41 0871452
42 0871428
43 0871398
44 0871346
45 0871312
46 0871287
47 0871262
48 0871242
49 0871236
50 0871197
0889067
0880586
0878345
0877715
0876668
0876179
0875557
0875013
0874477
0874209
0873997
0873691
0873569
0873352
0873212
0873035
0872942
087283
0872748
0872654
0872563
0872468
0872382
0872275
087222
0872201
0872135
0872086
0872021
0871945
0871898
087183
0871756
0871722
0871665
0871632
0871585
0871558
0871524
0871482
0871452
0871428
0871398
0871346
0871312
0871287
0871262
0871242
0871236
0871197
Page 66: xuhappy.github.io · 2 days ago · RECSYS 2 Acknowledgements The slides used in this chapter are adapted from: CS246 Mining Massive Datasets, - by Jure Leskovec, Stanford University

RECSYS 68

Use Gradient Descent to search forthe optimal settings

Want to find matrices P and Q

Gradient descent Initialize P and Q (using SVD pretend missing ratings are

0) Do gradient descent

bull P larr P - η middotnablaPbull Q larr Q - η middotnablaQbull Where nablaQ is gradientderivative of matrix Q120571120571119876119876 = [120571120571119902119902119909119909119894119894] and 120571120571119902119902119909119909119894119894 = sum119909119909119909119909 minus2 119903119903119909119909119909119909 minus 119902119902119909119909119901119901119909119909119879119879 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894

bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q Observation Computing gradients is slow

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

How to compute gradient of a matrixCompute gradient of every element independently

RECSYS 69

Degression to the lecture notes ofRegression and Gradient Descent

by Andrew Ngrsquos Machine Learning course from Coursera

RECSYS 70

(Batch) Gradient Descent

Want to find matrices P and Q

Gradient descent Initialize P and Q (using SVD pretend missing ratings are

0) Do gradient descent

bull P larr P - η middotnablaPbull Q larr Q - η middotnablaQbull Where nablaQ is gradientderivative of matrix Q120571120571119876119876 = [120571120571119902119902119909119909119894119894] and 120571120571119902119902119909119909119894119894 = sum119909119909119909119909 minus2 119903119903119909119909119909119909 minus 119902119902119909119909119901119901119909119909119879119879 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894

bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q Observation Computing gradients is slow

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

How to compute gradient of a matrixCompute gradient of every element independently

RECSYS 72

Stochastic Gradient Descent

Gradient Descent (GD) vs Stochastic GD Observation 120571120571119876119876 = [120571120571119902119902119909119909119894119894] where

120571120571119902119902119909119909119894119894 = 119909119909119909119909

minus2 119903119903119909119909119909119909 minus 119902119902119909119909119894119894119901119901119909119909119894119894 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894 = 119961119961119946119946

nabla119928119928 119955119955119961119961119946119946

bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q 119928119928 = 119928119928 minus 119928119928 = 119928119928 minus η sum119961119961119946119946nabla119928119928 (119955119955119961119961119946119946) Idea Instead of evaluating gradient over all ratings evaluate it for

each individual rating and make a step

GD 119928119928larr119928119928 minus η sum119955119955119961119961119946119946 nabla119928119928(119955119955119961119961119946119946)

SGD 119928119928larr119928119928 minus 119928119928(119955119955119961119961119946119946) Faster convergence

bull Need more steps but each step is computed much faster

1162019

RECSYS 73

SGD vs GD Convergence of GD vs SGD

1162019

Iterationstep

Valu

e of

the

obje

ctiv

e fu

nctio

n

GD improves the value of the objective function at every step SGD improves the value but in a ldquonoisyrdquo wayGD takes fewer steps to converge but each steptakes much longer to compute In practice SGD is much faster

RECSYS 74

Stochastic Gradient Descent

Stochastic gradient decent Initialize P and Q (using SVD pretend missing ratings are

0) Then iterate over the ratings (multiple times if necessary) and update

factorsFor each rxi

bull 120576120576119909119909119909119909 = 119903119903119909119909119909119909 minus 119902119902119909119909 sdot 119901119901119909119909119879119879 (derivative of the ldquoerrorrdquo)bull 119902119902119909119909 larr 119902119902119909119909 + 120578120578 120576120576119909119909119909119909 119901119901119909119909 minus 120582120582 119902119902119909119909 (update equation)bull 119901119901119909119909 larr 119901119901119909119909 + 120578120578 120576120576119909119909119909119909 119902119902119909119909 minus 120582120582 119901119901119909119909 (update equation)

2 for loops For until convergence

bull For each rxibull Compute gradient do a ldquosteprdquo

1162019

η hellip learning rate

RECSYS 75Koren Bell Volinksy IEEE Computer 2009

1162019

RECSYS 76

Summary Recommendations via Optimization

Goal Make good recommendations Quantify goodness using SSE

So Lower SSE means better recommendations We want to make good recommendations on items

that some user has not yet seen Letrsquos set values for P and Q such that they work

well on known (user item) ratings And hope these values for P and Q will predict well

the unknown ratings

This is the a case where we apply Optimization methods

1162019

1 3 43 5 5

4 5 533

2

2 1 3

1

RECSYS 77

Backup Slides

RECSYS 78

The Netflix Challenge 2006-09

We Know What You OughtTo Be Watching This

Summer

Netflix Prizebull Training data

ndash 100 million ratingsndash 480000 usersndash 17770 moviesndash 6 years of data 2000-2005

bull Test datandash Last few ratings of each user (28 million)ndash Evaluation criterion root mean squared error (RMSE) ndash Netflix Cinematch RMSE 09514

bull Competitionndash 2700+ teamsndash $1 million grand prize for 10 improvement on Cinematch resultndash $50000 2007 progress prize for 843 improvement

scoremovieuser121152131434524123237682576344541568523425

22345

57664566

scoremovieuser6219617232473153414284935

745

696836

Training data Test data

Movie rating data

Overall rating distribution

bull Third of ratings are 4sbull Average rating is 368

From TimelyDevelopmentcom

ratings per movie

bull Avg ratingsmovie 5627

ratings per user

bull Avg ratingsuser 208

Average movie rating by movie count

bull More ratings to better movies

From TimelyDevelopmentcom

Most loved movies

CountAvg ratingTitle137812 4593 The Shawshank Redemption 133597 4545 Lord of the Rings The Return of the King1808834306 The Green Mile 150676 4460Lord of the Rings The Two Towers 139050 4415 Finding Nemo 117456 4504 Raiders of the Lost Ark 180736 4299 Forrest Gump 147932 4433 Lord of the Rings The Fellowship of the ring149199 4325 The Sixth Sense 144027 4333 Indiana Jones and the Last Crusade

Challenges

bull Size of datandash Scalabilityndash Keeping data in memory

bull Missing datandash 99 percent missingndash Very imbalanced

bull Avoiding overfittingbull Test and training data differ significantly movie 16322

bull Use an ensemble of complementing predictorsbull Two half tuned models worth more than a single fully

tuned model

The BellKor recommender system

RECSYS 101

Extending Latent Factor Model to Include Biases

1162019

RECSYS 102

Modeling Biases and Interactions

μ = overall mean rating bx = bias of user x bi = bias of movie i

user-movie interactionmovie biasuser bias

User-Movie interaction Characterizes the matching between

users and movies Attracts most research in the field Benefits from algorithmic and

mathematical innovations

Baseline predictor Separates users and movies Benefits from insights into userrsquos

behavior Among the main practical

contributions of the competition

RECSYS 103

Baseline Predictor We have expectations on the rating by

user x of movie i even without estimating xrsquos attitude towards movies like i

ndash Rating scale of user xndash Values of other ratings user

gave recently (day-specific mood anchoring multi-user accounts)

ndash (Recent) popularity of movie indash Selection bias related to

number of ratings user gave on the same day (ldquofrequencyrdquo)

1162019

RECSYS 104

Putting It All Together

Example Mean rating micro = 37 You are a critical reviewer your ratings are 1 star

lower than the mean bx = -1 Star Wars gets a mean rating of 05 higher than

average movie bi = + 05 Predicted rating for you on Star Wars

= 37 - 1 + 05 = 32

1162019

RECSYS 105

Fitting the New Model

Solve

Stochastic gradient decent to find parameters Note Both biases bu bi as well as interactions qi pu are treated

as parameters (we estimate them)

1162019

regularization

goodness of fit

λ is selected via grid-search on a validation set

( )

++++

+++minus

sumsumsumsum

sumisin

ii

xx

xx

ii

Rix

Txiixxi

PQ

bbpq

pqbbr

2222

2

)(

)(min

λ

micro

RECSYS 106

BellKor Recommender System The winner of the Netflix Challenge

Multi-scale modeling of the dataCombine top level ldquoregionalrdquomodeling of the data with a refined local view Global

bull Overall deviations of usersmovies Factorization

bull Addressing ldquoregionalrdquo effects Collaborative filtering

bull Extract local patterns

Global effects

Factorization

Collaborative filtering

RECSYS 107

Performance of Various Methods

0885

089

0895

09

0905

091

0915

092

1 10 100 1000

RMSE

Millions of parameters

CF (no time bias)

Basic Latent Factors

Latent Factors w Biases

1162019

RECSYS 108

Grand Prize 08563

Netflix 09514

Movie average 10533User average 10651

Global average 11296

Performance of Various Methods

Basic Collaborative filtering 094

Latent factors 090

Latent factors+Biases 089

Collaborative filtering++ 091

1162019

RECSYS 109

Temporal Biases Of Users

Sudden rise in the average movie rating(early 2004) Improvements in Netflix GUI improvements Meaning of rating changed

Movie age Users prefer new movies

without any reasons Older movies are just

inherently better than newer ones

1162019

Y Koren Collaborative filtering with temporal dynamics KDD rsquo09

RECSYS 110

Temporal Biases amp Factors

Original modelrxi = micro +bx + bi + qi middotpx

T

Add time dependence to biasesrxi = micro +bx(t)+ bi(t) +qi middot px

T

Make parameters bu and bi to depend on time (1) Parameterize time-dependence by linear trends

(2) Each bin corresponds to 10 consecutive weeks

Add temporal dependence to factors px(t)hellip user preference vector on day t

Y Koren Collaborative filtering with temporal dynamics KDD rsquo09

RECSYS 111

Adding Temporal Effects

1162019

0875

088

0885

089

0895

09

0905

091

0915

092

1 10 100 1000 10000

RMSE

Millions of parameters

CF (no time bias)Basic Latent FactorsCF (time bias)Latent Factors w Biases+ Linear time factors+ Per-day user biases+ CF

RECSYS 112

Grand Prize 08563

Netflix 09514

Movie average 10533User average 10651

Global average 11296

Performance of Various Methods

Basic Collaborative filtering 094

Latent factors 090

Latent factors+Biases 089

Collaborative filtering++ 091

1162019 112

Latent factors+Biases+Time 0876

Still no prize Getting desperateTry a ldquokitchen sinkrdquo approach

RECSYS 113

Effect of ensemble size

087

08725

0875

08775

088

08825

0885

08875

089

08925

1 8 15 22 29 36 43 50

Predictors

Erro

r - R

MSE

Chart1

RMSE
Predictors
Error - RMSE
Effect of ensemble size

Sheet1

Sheet1

RMSE
Predictors
Error - RMSE
Effect of ensemble size

Sheet2

Sheet3

RECSYS 115

Standing on June 26th 2009

1162019June 26th submission triggers 30-day ldquolast callrdquo

RECSYS 116

The Last 30 Days

Ensemble team formed Group of other teams on leaderboard forms a new team Relies on combining their models Quickly also get a qualifying score over 10

BellKor Continue to get small improvements in their scores Realize that they are in direct competition with Ensemble

Strategy Both teams carefully monitoring the leaderboard Only sure way to check for improvement is to submit a set of

predictionsbull This alerts the other team of your latest score

1162019

RECSYS 117

24 Hours from the Deadline

Submissions limited to 1 a day Only 1 final submission could be made in the last 24h

24 hours before deadlinehellip BellKor team member in Austria notices (by chance) that

Ensemble posts a score that is slightly better than BellKorrsquos

Frantic last 24 hours for both teams Much computer time on final optimization Carefully calibrated to end about an hour before deadline

Final submissions BellKor submits a little early (on purpose) 40 mins before deadline Ensemble submits their final entry 20 mins later hellipand everyone waitshellip

1162019

RECSYS 118118

RECSYS 119

Million $ Awarded Sept 21st 2009

RECSYS 120

Further reading Y Koren Collaborative filtering with temporal

dynamics KDD rsquo09 httpwww2researchattcom~volinskynetflixbpchtml httpwwwthe-ensemblecom

1162019

  • Slide Number 1
  • Acknowledgements
  • Content-based Systems amp Collaborative Filtering
  • High Dimensional Data
  • Example Recommender Systems
  • Recommendations
  • From Scarcity to Abundance
  • The Long Tail
  • Physical vs Online
  • Types of Recommendations
  • Formal Model
  • Utility Matrix
  • Key Problems
  • (1) Gathering Ratings
  • (2) Extrapolating Utilities
  • Content-based Recommender Systems
  • Content-based Recommendations
  • Plan of Action
  • Item Profiles
  • Sidenote TF-IDF
  • User Profiles and Prediction
  • Pros Content-based Approach
  • Cons Content-based Approach
  • Collaborative Filtering
  • Collaborative filtering
  • Collaborative Filtering (CF)
  • Example of Memory-based Collaborative Filtering User-User Collaborative Filtering
  • Similar Users
  • Similarity Metric
  • Rating Predictions
  • Another type of Memory-based Collaborative Filtering Item-Item based Collaborative Filtering
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Common Practice for Item-Item Collaborative Filtering
  • Item-Item vs User-User
  • ProsCons of Collaborative Filtering
  • Hybrid Methods
  • Remarks amp Practical Tips
  • Evaluation
  • Evaluation
  • Evaluating Predictions
  • Problems with Error Measures
  • Collaborative Filtering Complexity
  • Tip Add Data
  • Recommender SystemsLatent Factor Models
  • Collaborative Filtering viaLatent Factor Models (eg SVD)
  • Slide Number 50
  • The Netflix Utility Matrix R
  • Utility Matrix R Evaluation
  • Latent Factor Models
  • Ratings as Products of Factors
  • Ratings as Products of Factors
  • Ratings as Products of Factors
  • Latent Factor Models
  • Latent Factor Models
  • Recap SVD
  • Latent Factor Models
  • Dealing with Missing Entries
  • Recommendations via Latent Factor Models (eg SVD++ by the [Bellkor Team])
  • Dealing with Missing Entries
  • The Effect of Regularization
  • The Effect of Regularization
  • The Effect of Regularization
  • The Effect of Regularization
  • Use Gradient Descent to search forthe optimal settings
  • Degression to the lecture notes ofRegression and Gradient Descentby Andrew Ngrsquos Machine Learning course from Coursera
  • (Batch) Gradient Descent
  • Stochastic Gradient Descent
  • SGD vs GD
  • Stochastic Gradient Descent
  • Slide Number 75
  • Summary Recommendations via Optimization
  • Backup Slides
  • The Netflix Challenge 2006-09
  • Slide Number 79
  • Slide Number 80
  • Netflix Prize
  • Movie rating data
  • Overall rating distribution
  • ratings per movie
  • ratings per user
  • Average movie rating by movie count
  • Most loved movies
  • Challenges
  • The BellKor recommender system
  • Extending Latent Factor Model to Include Biases
  • Modeling Biases and Interactions
  • Baseline Predictor
  • Putting It All Together
  • Fitting the New Model
  • BellKor Recommender System
  • Performance of Various Methods
  • Performance of Various Methods
  • Temporal Biases Of Users
  • Temporal Biases amp Factors
  • Adding Temporal Effects
  • Performance of Various Methods
  • Slide Number 113
  • Slide Number 114
  • Standing on June 26th 2009
  • The Last 30 Days
  • 24 Hours from the Deadline
  • Slide Number 118
  • Million $ Awarded Sept 21st 2009
  • Further reading
predictors RMSE
1 0889067
2 0880586
3 0878345
4 0877715
5 0876668
6 0876179
7 0875557
8 0875013
9 0874477
10 0874209
11 0873997
12 0873691
13 0873569
14 0873352
15 0873212
16 0873035
17 0872942
18 087283
19 0872748
20 0872654
21 0872563
22 0872468
23 0872382
24 0872275
25 087222
26 0872201
27 0872135
28 0872086
29 0872021
30 0871945
31 0871898
32 087183
33 0871756
34 0871722
35 0871665
36 0871632
37 0871585
38 0871558
39 0871524
40 0871482
41 0871452
42 0871428
43 0871398
44 0871346
45 0871312
46 0871287
47 0871262
48 0871242
49 0871236
50 0871197
0889067
0880586
0878345
0877715
0876668
0876179
0875557
0875013
0874477
0874209
0873997
0873691
0873569
0873352
0873212
0873035
0872942
087283
0872748
0872654
0872563
0872468
0872382
0872275
087222
0872201
0872135
0872086
0872021
0871945
0871898
087183
0871756
0871722
0871665
0871632
0871585
0871558
0871524
0871482
0871452
0871428
0871398
0871346
0871312
0871287
0871262
0871242
0871236
0871197
Page 67: xuhappy.github.io · 2 days ago · RECSYS 2 Acknowledgements The slides used in this chapter are adapted from: CS246 Mining Massive Datasets, - by Jure Leskovec, Stanford University

RECSYS 69

Degression to the lecture notes ofRegression and Gradient Descent

by Andrew Ngrsquos Machine Learning course from Coursera

RECSYS 70

(Batch) Gradient Descent

Want to find matrices P and Q

Gradient descent Initialize P and Q (using SVD pretend missing ratings are

0) Do gradient descent

bull P larr P - η middotnablaPbull Q larr Q - η middotnablaQbull Where nablaQ is gradientderivative of matrix Q120571120571119876119876 = [120571120571119902119902119909119909119894119894] and 120571120571119902119902119909119909119894119894 = sum119909119909119909119909 minus2 119903119903119909119909119909119909 minus 119902119902119909119909119901119901119909119909119879119879 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894

bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q Observation Computing gradients is slow

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

How to compute gradient of a matrixCompute gradient of every element independently

RECSYS 72

Stochastic Gradient Descent

Gradient Descent (GD) vs Stochastic GD Observation 120571120571119876119876 = [120571120571119902119902119909119909119894119894] where

120571120571119902119902119909119909119894119894 = 119909119909119909119909

minus2 119903119903119909119909119909119909 minus 119902119902119909119909119894119894119901119901119909119909119894119894 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894 = 119961119961119946119946

nabla119928119928 119955119955119961119961119946119946

bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q 119928119928 = 119928119928 minus 119928119928 = 119928119928 minus η sum119961119961119946119946nabla119928119928 (119955119955119961119961119946119946) Idea Instead of evaluating gradient over all ratings evaluate it for

each individual rating and make a step

GD 119928119928larr119928119928 minus η sum119955119955119961119961119946119946 nabla119928119928(119955119955119961119961119946119946)

SGD 119928119928larr119928119928 minus 119928119928(119955119955119961119961119946119946) Faster convergence

bull Need more steps but each step is computed much faster

1162019

RECSYS 73

SGD vs GD Convergence of GD vs SGD

1162019

Iterationstep

Valu

e of

the

obje

ctiv

e fu

nctio

n

GD improves the value of the objective function at every step SGD improves the value but in a ldquonoisyrdquo wayGD takes fewer steps to converge but each steptakes much longer to compute In practice SGD is much faster

RECSYS 74

Stochastic Gradient Descent

Stochastic gradient decent Initialize P and Q (using SVD pretend missing ratings are

0) Then iterate over the ratings (multiple times if necessary) and update

factorsFor each rxi

bull 120576120576119909119909119909119909 = 119903119903119909119909119909119909 minus 119902119902119909119909 sdot 119901119901119909119909119879119879 (derivative of the ldquoerrorrdquo)bull 119902119902119909119909 larr 119902119902119909119909 + 120578120578 120576120576119909119909119909119909 119901119901119909119909 minus 120582120582 119902119902119909119909 (update equation)bull 119901119901119909119909 larr 119901119901119909119909 + 120578120578 120576120576119909119909119909119909 119902119902119909119909 minus 120582120582 119901119901119909119909 (update equation)

2 for loops For until convergence

bull For each rxibull Compute gradient do a ldquosteprdquo

1162019

η hellip learning rate

RECSYS 75Koren Bell Volinksy IEEE Computer 2009

1162019

RECSYS 76

Summary Recommendations via Optimization

Goal Make good recommendations Quantify goodness using SSE

So Lower SSE means better recommendations We want to make good recommendations on items

that some user has not yet seen Letrsquos set values for P and Q such that they work

well on known (user item) ratings And hope these values for P and Q will predict well

the unknown ratings

This is the a case where we apply Optimization methods

1162019

1 3 43 5 5

4 5 533

2

2 1 3

1

RECSYS 77

Backup Slides

RECSYS 78

The Netflix Challenge 2006-09

We Know What You OughtTo Be Watching This

Summer

Netflix Prizebull Training data

ndash 100 million ratingsndash 480000 usersndash 17770 moviesndash 6 years of data 2000-2005

bull Test datandash Last few ratings of each user (28 million)ndash Evaluation criterion root mean squared error (RMSE) ndash Netflix Cinematch RMSE 09514

bull Competitionndash 2700+ teamsndash $1 million grand prize for 10 improvement on Cinematch resultndash $50000 2007 progress prize for 843 improvement

scoremovieuser121152131434524123237682576344541568523425

22345

57664566

scoremovieuser6219617232473153414284935

745

696836

Training data Test data

Movie rating data

Overall rating distribution

bull Third of ratings are 4sbull Average rating is 368

From TimelyDevelopmentcom

ratings per movie

bull Avg ratingsmovie 5627

ratings per user

bull Avg ratingsuser 208

Average movie rating by movie count

bull More ratings to better movies

From TimelyDevelopmentcom

Most loved movies

CountAvg ratingTitle137812 4593 The Shawshank Redemption 133597 4545 Lord of the Rings The Return of the King1808834306 The Green Mile 150676 4460Lord of the Rings The Two Towers 139050 4415 Finding Nemo 117456 4504 Raiders of the Lost Ark 180736 4299 Forrest Gump 147932 4433 Lord of the Rings The Fellowship of the ring149199 4325 The Sixth Sense 144027 4333 Indiana Jones and the Last Crusade

Challenges

bull Size of datandash Scalabilityndash Keeping data in memory

bull Missing datandash 99 percent missingndash Very imbalanced

bull Avoiding overfittingbull Test and training data differ significantly movie 16322

bull Use an ensemble of complementing predictorsbull Two half tuned models worth more than a single fully

tuned model

The BellKor recommender system

RECSYS 101

Extending Latent Factor Model to Include Biases

1162019

RECSYS 102

Modeling Biases and Interactions

μ = overall mean rating bx = bias of user x bi = bias of movie i

user-movie interactionmovie biasuser bias

User-Movie interaction Characterizes the matching between

users and movies Attracts most research in the field Benefits from algorithmic and

mathematical innovations

Baseline predictor Separates users and movies Benefits from insights into userrsquos

behavior Among the main practical

contributions of the competition

RECSYS 103

Baseline Predictor We have expectations on the rating by

user x of movie i even without estimating xrsquos attitude towards movies like i

ndash Rating scale of user xndash Values of other ratings user

gave recently (day-specific mood anchoring multi-user accounts)

ndash (Recent) popularity of movie indash Selection bias related to

number of ratings user gave on the same day (ldquofrequencyrdquo)

1162019

RECSYS 104

Putting It All Together

Example Mean rating micro = 37 You are a critical reviewer your ratings are 1 star

lower than the mean bx = -1 Star Wars gets a mean rating of 05 higher than

average movie bi = + 05 Predicted rating for you on Star Wars

= 37 - 1 + 05 = 32

1162019

RECSYS 105

Fitting the New Model

Solve

Stochastic gradient decent to find parameters Note Both biases bu bi as well as interactions qi pu are treated

as parameters (we estimate them)

1162019

regularization

goodness of fit

λ is selected via grid-search on a validation set

( )

++++

+++minus

sumsumsumsum

sumisin

ii

xx

xx

ii

Rix

Txiixxi

PQ

bbpq

pqbbr

2222

2

)(

)(min

λ

micro

RECSYS 106

BellKor Recommender System The winner of the Netflix Challenge

Multi-scale modeling of the dataCombine top level ldquoregionalrdquomodeling of the data with a refined local view Global

bull Overall deviations of usersmovies Factorization

bull Addressing ldquoregionalrdquo effects Collaborative filtering

bull Extract local patterns

Global effects

Factorization

Collaborative filtering

RECSYS 107

Performance of Various Methods

0885

089

0895

09

0905

091

0915

092

1 10 100 1000

RMSE

Millions of parameters

CF (no time bias)

Basic Latent Factors

Latent Factors w Biases

1162019

RECSYS 108

Grand Prize 08563

Netflix 09514

Movie average 10533User average 10651

Global average 11296

Performance of Various Methods

Basic Collaborative filtering 094

Latent factors 090

Latent factors+Biases 089

Collaborative filtering++ 091

1162019

RECSYS 109

Temporal Biases Of Users

Sudden rise in the average movie rating(early 2004) Improvements in Netflix GUI improvements Meaning of rating changed

Movie age Users prefer new movies

without any reasons Older movies are just

inherently better than newer ones

1162019

Y Koren Collaborative filtering with temporal dynamics KDD rsquo09

RECSYS 110

Temporal Biases amp Factors

Original modelrxi = micro +bx + bi + qi middotpx

T

Add time dependence to biasesrxi = micro +bx(t)+ bi(t) +qi middot px

T

Make parameters bu and bi to depend on time (1) Parameterize time-dependence by linear trends

(2) Each bin corresponds to 10 consecutive weeks

Add temporal dependence to factors px(t)hellip user preference vector on day t

Y Koren Collaborative filtering with temporal dynamics KDD rsquo09

RECSYS 111

Adding Temporal Effects

1162019

0875

088

0885

089

0895

09

0905

091

0915

092

1 10 100 1000 10000

RMSE

Millions of parameters

CF (no time bias)Basic Latent FactorsCF (time bias)Latent Factors w Biases+ Linear time factors+ Per-day user biases+ CF

RECSYS 112

Grand Prize 08563

Netflix 09514

Movie average 10533User average 10651

Global average 11296

Performance of Various Methods

Basic Collaborative filtering 094

Latent factors 090

Latent factors+Biases 089

Collaborative filtering++ 091

1162019 112

Latent factors+Biases+Time 0876

Still no prize Getting desperateTry a ldquokitchen sinkrdquo approach

RECSYS 113

Effect of ensemble size

087

08725

0875

08775

088

08825

0885

08875

089

08925

1 8 15 22 29 36 43 50

Predictors

Erro

r - R

MSE

Chart1

RMSE
Predictors
Error - RMSE
Effect of ensemble size

Sheet1

Sheet1

RMSE
Predictors
Error - RMSE
Effect of ensemble size

Sheet2

Sheet3

RECSYS 115

Standing on June 26th 2009

1162019June 26th submission triggers 30-day ldquolast callrdquo

RECSYS 116

The Last 30 Days

Ensemble team formed Group of other teams on leaderboard forms a new team Relies on combining their models Quickly also get a qualifying score over 10

BellKor Continue to get small improvements in their scores Realize that they are in direct competition with Ensemble

Strategy Both teams carefully monitoring the leaderboard Only sure way to check for improvement is to submit a set of

predictionsbull This alerts the other team of your latest score

1162019

RECSYS 117

24 Hours from the Deadline

Submissions limited to 1 a day Only 1 final submission could be made in the last 24h

24 hours before deadlinehellip BellKor team member in Austria notices (by chance) that

Ensemble posts a score that is slightly better than BellKorrsquos

Frantic last 24 hours for both teams Much computer time on final optimization Carefully calibrated to end about an hour before deadline

Final submissions BellKor submits a little early (on purpose) 40 mins before deadline Ensemble submits their final entry 20 mins later hellipand everyone waitshellip

1162019

RECSYS 118118

RECSYS 119

Million $ Awarded Sept 21st 2009

RECSYS 120

Further reading Y Koren Collaborative filtering with temporal

dynamics KDD rsquo09 httpwww2researchattcom~volinskynetflixbpchtml httpwwwthe-ensemblecom

1162019

  • Slide Number 1
  • Acknowledgements
  • Content-based Systems amp Collaborative Filtering
  • High Dimensional Data
  • Example Recommender Systems
  • Recommendations
  • From Scarcity to Abundance
  • The Long Tail
  • Physical vs Online
  • Types of Recommendations
  • Formal Model
  • Utility Matrix
  • Key Problems
  • (1) Gathering Ratings
  • (2) Extrapolating Utilities
  • Content-based Recommender Systems
  • Content-based Recommendations
  • Plan of Action
  • Item Profiles
  • Sidenote TF-IDF
  • User Profiles and Prediction
  • Pros Content-based Approach
  • Cons Content-based Approach
  • Collaborative Filtering
  • Collaborative filtering
  • Collaborative Filtering (CF)
  • Example of Memory-based Collaborative Filtering User-User Collaborative Filtering
  • Similar Users
  • Similarity Metric
  • Rating Predictions
  • Another type of Memory-based Collaborative Filtering Item-Item based Collaborative Filtering
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Common Practice for Item-Item Collaborative Filtering
  • Item-Item vs User-User
  • ProsCons of Collaborative Filtering
  • Hybrid Methods
  • Remarks amp Practical Tips
  • Evaluation
  • Evaluation
  • Evaluating Predictions
  • Problems with Error Measures
  • Collaborative Filtering Complexity
  • Tip Add Data
  • Recommender SystemsLatent Factor Models
  • Collaborative Filtering viaLatent Factor Models (eg SVD)
  • Slide Number 50
  • The Netflix Utility Matrix R
  • Utility Matrix R Evaluation
  • Latent Factor Models
  • Ratings as Products of Factors
  • Ratings as Products of Factors
  • Ratings as Products of Factors
  • Latent Factor Models
  • Latent Factor Models
  • Recap SVD
  • Latent Factor Models
  • Dealing with Missing Entries
  • Recommendations via Latent Factor Models (eg SVD++ by the [Bellkor Team])
  • Dealing with Missing Entries
  • The Effect of Regularization
  • The Effect of Regularization
  • The Effect of Regularization
  • The Effect of Regularization
  • Use Gradient Descent to search forthe optimal settings
  • Degression to the lecture notes ofRegression and Gradient Descentby Andrew Ngrsquos Machine Learning course from Coursera
  • (Batch) Gradient Descent
  • Stochastic Gradient Descent
  • SGD vs GD
  • Stochastic Gradient Descent
  • Slide Number 75
  • Summary Recommendations via Optimization
  • Backup Slides
  • The Netflix Challenge 2006-09
  • Slide Number 79
  • Slide Number 80
  • Netflix Prize
  • Movie rating data
  • Overall rating distribution
  • ratings per movie
  • ratings per user
  • Average movie rating by movie count
  • Most loved movies
  • Challenges
  • The BellKor recommender system
  • Extending Latent Factor Model to Include Biases
  • Modeling Biases and Interactions
  • Baseline Predictor
  • Putting It All Together
  • Fitting the New Model
  • BellKor Recommender System
  • Performance of Various Methods
  • Performance of Various Methods
  • Temporal Biases Of Users
  • Temporal Biases amp Factors
  • Adding Temporal Effects
  • Performance of Various Methods
  • Slide Number 113
  • Slide Number 114
  • Standing on June 26th 2009
  • The Last 30 Days
  • 24 Hours from the Deadline
  • Slide Number 118
  • Million $ Awarded Sept 21st 2009
  • Further reading
predictors RMSE
1 0889067
2 0880586
3 0878345
4 0877715
5 0876668
6 0876179
7 0875557
8 0875013
9 0874477
10 0874209
11 0873997
12 0873691
13 0873569
14 0873352
15 0873212
16 0873035
17 0872942
18 087283
19 0872748
20 0872654
21 0872563
22 0872468
23 0872382
24 0872275
25 087222
26 0872201
27 0872135
28 0872086
29 0872021
30 0871945
31 0871898
32 087183
33 0871756
34 0871722
35 0871665
36 0871632
37 0871585
38 0871558
39 0871524
40 0871482
41 0871452
42 0871428
43 0871398
44 0871346
45 0871312
46 0871287
47 0871262
48 0871242
49 0871236
50 0871197
0889067
0880586
0878345
0877715
0876668
0876179
0875557
0875013
0874477
0874209
0873997
0873691
0873569
0873352
0873212
0873035
0872942
087283
0872748
0872654
0872563
0872468
0872382
0872275
087222
0872201
0872135
0872086
0872021
0871945
0871898
087183
0871756
0871722
0871665
0871632
0871585
0871558
0871524
0871482
0871452
0871428
0871398
0871346
0871312
0871287
0871262
0871242
0871236
0871197
Page 68: xuhappy.github.io · 2 days ago · RECSYS 2 Acknowledgements The slides used in this chapter are adapted from: CS246 Mining Massive Datasets, - by Jure Leskovec, Stanford University

RECSYS 70

(Batch) Gradient Descent

Want to find matrices P and Q

Gradient descent Initialize P and Q (using SVD pretend missing ratings are

0) Do gradient descent

bull P larr P - η middotnablaPbull Q larr Q - η middotnablaQbull Where nablaQ is gradientderivative of matrix Q120571120571119876119876 = [120571120571119902119902119909119909119894119894] and 120571120571119902119902119909119909119894119894 = sum119909119909119909119909 minus2 119903119903119909119909119909119909 minus 119902119902119909119909119901119901119909119909119879119879 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894

bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q Observation Computing gradients is slow

++minus sumsumsum

ii

xx

training

Txixi

QPqppqr 222

)(min λ

How to compute gradient of a matrixCompute gradient of every element independently

RECSYS 72

Stochastic Gradient Descent

Gradient Descent (GD) vs Stochastic GD Observation 120571120571119876119876 = [120571120571119902119902119909119909119894119894] where

120571120571119902119902119909119909119894119894 = 119909119909119909119909

minus2 119903119903119909119909119909119909 minus 119902119902119909119909119894119894119901119901119909119909119894119894 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894 = 119961119961119946119946

nabla119928119928 119955119955119961119961119946119946

bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q 119928119928 = 119928119928 minus 119928119928 = 119928119928 minus η sum119961119961119946119946nabla119928119928 (119955119955119961119961119946119946) Idea Instead of evaluating gradient over all ratings evaluate it for

each individual rating and make a step

GD 119928119928larr119928119928 minus η sum119955119955119961119961119946119946 nabla119928119928(119955119955119961119961119946119946)

SGD 119928119928larr119928119928 minus 119928119928(119955119955119961119961119946119946) Faster convergence

bull Need more steps but each step is computed much faster

1162019

RECSYS 73

SGD vs GD Convergence of GD vs SGD

1162019

Iterationstep

Valu

e of

the

obje

ctiv

e fu

nctio

n

GD improves the value of the objective function at every step SGD improves the value but in a ldquonoisyrdquo wayGD takes fewer steps to converge but each steptakes much longer to compute In practice SGD is much faster

RECSYS 74

Stochastic Gradient Descent

Stochastic gradient decent Initialize P and Q (using SVD pretend missing ratings are

0) Then iterate over the ratings (multiple times if necessary) and update

factorsFor each rxi

bull 120576120576119909119909119909119909 = 119903119903119909119909119909119909 minus 119902119902119909119909 sdot 119901119901119909119909119879119879 (derivative of the ldquoerrorrdquo)bull 119902119902119909119909 larr 119902119902119909119909 + 120578120578 120576120576119909119909119909119909 119901119901119909119909 minus 120582120582 119902119902119909119909 (update equation)bull 119901119901119909119909 larr 119901119901119909119909 + 120578120578 120576120576119909119909119909119909 119902119902119909119909 minus 120582120582 119901119901119909119909 (update equation)

2 for loops For until convergence

bull For each rxibull Compute gradient do a ldquosteprdquo

1162019

η hellip learning rate

RECSYS 75Koren Bell Volinksy IEEE Computer 2009

1162019

RECSYS 76

Summary Recommendations via Optimization

Goal Make good recommendations Quantify goodness using SSE

So Lower SSE means better recommendations We want to make good recommendations on items

that some user has not yet seen Letrsquos set values for P and Q such that they work

well on known (user item) ratings And hope these values for P and Q will predict well

the unknown ratings

This is the a case where we apply Optimization methods

1162019

1 3 43 5 5

4 5 533

2

2 1 3

1

RECSYS 77

Backup Slides

RECSYS 78

The Netflix Challenge 2006-09

We Know What You OughtTo Be Watching This

Summer

Netflix Prizebull Training data

ndash 100 million ratingsndash 480000 usersndash 17770 moviesndash 6 years of data 2000-2005

bull Test datandash Last few ratings of each user (28 million)ndash Evaluation criterion root mean squared error (RMSE) ndash Netflix Cinematch RMSE 09514

bull Competitionndash 2700+ teamsndash $1 million grand prize for 10 improvement on Cinematch resultndash $50000 2007 progress prize for 843 improvement

scoremovieuser121152131434524123237682576344541568523425

22345

57664566

scoremovieuser6219617232473153414284935

745

696836

Training data Test data

Movie rating data

Overall rating distribution

bull Third of ratings are 4sbull Average rating is 368

From TimelyDevelopmentcom

ratings per movie

bull Avg ratingsmovie 5627

ratings per user

bull Avg ratingsuser 208

Average movie rating by movie count

bull More ratings to better movies

From TimelyDevelopmentcom

Most loved movies

CountAvg ratingTitle137812 4593 The Shawshank Redemption 133597 4545 Lord of the Rings The Return of the King1808834306 The Green Mile 150676 4460Lord of the Rings The Two Towers 139050 4415 Finding Nemo 117456 4504 Raiders of the Lost Ark 180736 4299 Forrest Gump 147932 4433 Lord of the Rings The Fellowship of the ring149199 4325 The Sixth Sense 144027 4333 Indiana Jones and the Last Crusade

Challenges

bull Size of datandash Scalabilityndash Keeping data in memory

bull Missing datandash 99 percent missingndash Very imbalanced

bull Avoiding overfittingbull Test and training data differ significantly movie 16322

bull Use an ensemble of complementing predictorsbull Two half tuned models worth more than a single fully

tuned model

The BellKor recommender system

RECSYS 101

Extending Latent Factor Model to Include Biases

1162019

RECSYS 102

Modeling Biases and Interactions

μ = overall mean rating bx = bias of user x bi = bias of movie i

user-movie interactionmovie biasuser bias

User-Movie interaction Characterizes the matching between

users and movies Attracts most research in the field Benefits from algorithmic and

mathematical innovations

Baseline predictor Separates users and movies Benefits from insights into userrsquos

behavior Among the main practical

contributions of the competition

RECSYS 103

Baseline Predictor We have expectations on the rating by

user x of movie i even without estimating xrsquos attitude towards movies like i

ndash Rating scale of user xndash Values of other ratings user

gave recently (day-specific mood anchoring multi-user accounts)

ndash (Recent) popularity of movie indash Selection bias related to

number of ratings user gave on the same day (ldquofrequencyrdquo)

1162019

RECSYS 104

Putting It All Together

Example Mean rating micro = 37 You are a critical reviewer your ratings are 1 star

lower than the mean bx = -1 Star Wars gets a mean rating of 05 higher than

average movie bi = + 05 Predicted rating for you on Star Wars

= 37 - 1 + 05 = 32

1162019

RECSYS 105

Fitting the New Model

Solve

Stochastic gradient decent to find parameters Note Both biases bu bi as well as interactions qi pu are treated

as parameters (we estimate them)

1162019

regularization

goodness of fit

λ is selected via grid-search on a validation set

( )

++++

+++minus

sumsumsumsum

sumisin

ii

xx

xx

ii

Rix

Txiixxi

PQ

bbpq

pqbbr

2222

2

)(

)(min

λ

micro

RECSYS 106

BellKor Recommender System The winner of the Netflix Challenge

Multi-scale modeling of the dataCombine top level ldquoregionalrdquomodeling of the data with a refined local view Global

bull Overall deviations of usersmovies Factorization

bull Addressing ldquoregionalrdquo effects Collaborative filtering

bull Extract local patterns

Global effects

Factorization

Collaborative filtering

RECSYS 107

Performance of Various Methods

0885

089

0895

09

0905

091

0915

092

1 10 100 1000

RMSE

Millions of parameters

CF (no time bias)

Basic Latent Factors

Latent Factors w Biases

1162019

RECSYS 108

Grand Prize 08563

Netflix 09514

Movie average 10533User average 10651

Global average 11296

Performance of Various Methods

Basic Collaborative filtering 094

Latent factors 090

Latent factors+Biases 089

Collaborative filtering++ 091

1162019

RECSYS 109

Temporal Biases Of Users

Sudden rise in the average movie rating(early 2004) Improvements in Netflix GUI improvements Meaning of rating changed

Movie age Users prefer new movies

without any reasons Older movies are just

inherently better than newer ones

1162019

Y Koren Collaborative filtering with temporal dynamics KDD rsquo09

RECSYS 110

Temporal Biases amp Factors

Original modelrxi = micro +bx + bi + qi middotpx

T

Add time dependence to biasesrxi = micro +bx(t)+ bi(t) +qi middot px

T

Make parameters bu and bi to depend on time (1) Parameterize time-dependence by linear trends

(2) Each bin corresponds to 10 consecutive weeks

Add temporal dependence to factors px(t)hellip user preference vector on day t

Y Koren Collaborative filtering with temporal dynamics KDD rsquo09

RECSYS 111

Adding Temporal Effects

1162019

0875

088

0885

089

0895

09

0905

091

0915

092

1 10 100 1000 10000

RMSE

Millions of parameters

CF (no time bias)Basic Latent FactorsCF (time bias)Latent Factors w Biases+ Linear time factors+ Per-day user biases+ CF

RECSYS 112

Grand Prize 08563

Netflix 09514

Movie average 10533User average 10651

Global average 11296

Performance of Various Methods

Basic Collaborative filtering 094

Latent factors 090

Latent factors+Biases 089

Collaborative filtering++ 091

1162019 112

Latent factors+Biases+Time 0876

Still no prize Getting desperateTry a ldquokitchen sinkrdquo approach

RECSYS 113

Effect of ensemble size

087

08725

0875

08775

088

08825

0885

08875

089

08925

1 8 15 22 29 36 43 50

Predictors

Erro

r - R

MSE

Chart1

RMSE
Predictors
Error - RMSE
Effect of ensemble size

Sheet1

Sheet1

RMSE
Predictors
Error - RMSE
Effect of ensemble size

Sheet2

Sheet3

RECSYS 115

Standing on June 26th 2009

1162019June 26th submission triggers 30-day ldquolast callrdquo

RECSYS 116

The Last 30 Days

Ensemble team formed Group of other teams on leaderboard forms a new team Relies on combining their models Quickly also get a qualifying score over 10

BellKor Continue to get small improvements in their scores Realize that they are in direct competition with Ensemble

Strategy Both teams carefully monitoring the leaderboard Only sure way to check for improvement is to submit a set of

predictionsbull This alerts the other team of your latest score

1162019

RECSYS 117

24 Hours from the Deadline

Submissions limited to 1 a day Only 1 final submission could be made in the last 24h

24 hours before deadlinehellip BellKor team member in Austria notices (by chance) that

Ensemble posts a score that is slightly better than BellKorrsquos

Frantic last 24 hours for both teams Much computer time on final optimization Carefully calibrated to end about an hour before deadline

Final submissions BellKor submits a little early (on purpose) 40 mins before deadline Ensemble submits their final entry 20 mins later hellipand everyone waitshellip

1162019

RECSYS 118118

RECSYS 119

Million $ Awarded Sept 21st 2009

RECSYS 120

Further reading Y Koren Collaborative filtering with temporal

dynamics KDD rsquo09 httpwww2researchattcom~volinskynetflixbpchtml httpwwwthe-ensemblecom

1162019

  • Slide Number 1
  • Acknowledgements
  • Content-based Systems amp Collaborative Filtering
  • High Dimensional Data
  • Example Recommender Systems
  • Recommendations
  • From Scarcity to Abundance
  • The Long Tail
  • Physical vs Online
  • Types of Recommendations
  • Formal Model
  • Utility Matrix
  • Key Problems
  • (1) Gathering Ratings
  • (2) Extrapolating Utilities
  • Content-based Recommender Systems
  • Content-based Recommendations
  • Plan of Action
  • Item Profiles
  • Sidenote TF-IDF
  • User Profiles and Prediction
  • Pros Content-based Approach
  • Cons Content-based Approach
  • Collaborative Filtering
  • Collaborative filtering
  • Collaborative Filtering (CF)
  • Example of Memory-based Collaborative Filtering User-User Collaborative Filtering
  • Similar Users
  • Similarity Metric
  • Rating Predictions
  • Another type of Memory-based Collaborative Filtering Item-Item based Collaborative Filtering
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Common Practice for Item-Item Collaborative Filtering
  • Item-Item vs User-User
  • ProsCons of Collaborative Filtering
  • Hybrid Methods
  • Remarks amp Practical Tips
  • Evaluation
  • Evaluation
  • Evaluating Predictions
  • Problems with Error Measures
  • Collaborative Filtering Complexity
  • Tip Add Data
  • Recommender SystemsLatent Factor Models
  • Collaborative Filtering viaLatent Factor Models (eg SVD)
  • Slide Number 50
  • The Netflix Utility Matrix R
  • Utility Matrix R Evaluation
  • Latent Factor Models
  • Ratings as Products of Factors
  • Ratings as Products of Factors
  • Ratings as Products of Factors
  • Latent Factor Models
  • Latent Factor Models
  • Recap SVD
  • Latent Factor Models
  • Dealing with Missing Entries
  • Recommendations via Latent Factor Models (eg SVD++ by the [Bellkor Team])
  • Dealing with Missing Entries
  • The Effect of Regularization
  • The Effect of Regularization
  • The Effect of Regularization
  • The Effect of Regularization
  • Use Gradient Descent to search forthe optimal settings
  • Degression to the lecture notes ofRegression and Gradient Descentby Andrew Ngrsquos Machine Learning course from Coursera
  • (Batch) Gradient Descent
  • Stochastic Gradient Descent
  • SGD vs GD
  • Stochastic Gradient Descent
  • Slide Number 75
  • Summary Recommendations via Optimization
  • Backup Slides
  • The Netflix Challenge 2006-09
  • Slide Number 79
  • Slide Number 80
  • Netflix Prize
  • Movie rating data
  • Overall rating distribution
  • ratings per movie
  • ratings per user
  • Average movie rating by movie count
  • Most loved movies
  • Challenges
  • The BellKor recommender system
  • Extending Latent Factor Model to Include Biases
  • Modeling Biases and Interactions
  • Baseline Predictor
  • Putting It All Together
  • Fitting the New Model
  • BellKor Recommender System
  • Performance of Various Methods
  • Performance of Various Methods
  • Temporal Biases Of Users
  • Temporal Biases amp Factors
  • Adding Temporal Effects
  • Performance of Various Methods
  • Slide Number 113
  • Slide Number 114
  • Standing on June 26th 2009
  • The Last 30 Days
  • 24 Hours from the Deadline
  • Slide Number 118
  • Million $ Awarded Sept 21st 2009
  • Further reading
predictors RMSE
1 0889067
2 0880586
3 0878345
4 0877715
5 0876668
6 0876179
7 0875557
8 0875013
9 0874477
10 0874209
11 0873997
12 0873691
13 0873569
14 0873352
15 0873212
16 0873035
17 0872942
18 087283
19 0872748
20 0872654
21 0872563
22 0872468
23 0872382
24 0872275
25 087222
26 0872201
27 0872135
28 0872086
29 0872021
30 0871945
31 0871898
32 087183
33 0871756
34 0871722
35 0871665
36 0871632
37 0871585
38 0871558
39 0871524
40 0871482
41 0871452
42 0871428
43 0871398
44 0871346
45 0871312
46 0871287
47 0871262
48 0871242
49 0871236
50 0871197
0889067
0880586
0878345
0877715
0876668
0876179
0875557
0875013
0874477
0874209
0873997
0873691
0873569
0873352
0873212
0873035
0872942
087283
0872748
0872654
0872563
0872468
0872382
0872275
087222
0872201
0872135
0872086
0872021
0871945
0871898
087183
0871756
0871722
0871665
0871632
0871585
0871558
0871524
0871482
0871452
0871428
0871398
0871346
0871312
0871287
0871262
0871242
0871236
0871197
Page 69: xuhappy.github.io · 2 days ago · RECSYS 2 Acknowledgements The slides used in this chapter are adapted from: CS246 Mining Massive Datasets, - by Jure Leskovec, Stanford University

RECSYS 72

Stochastic Gradient Descent

Gradient Descent (GD) vs Stochastic GD Observation 120571120571119876119876 = [120571120571119902119902119909119909119894119894] where

120571120571119902119902119909119909119894119894 = 119909119909119909119909

minus2 119903119903119909119909119909119909 minus 119902119902119909119909119894119894119901119901119909119909119894119894 119901119901119909119909119894119894 + 2120582120582119902119902119909119909119894119894 = 119961119961119946119946

nabla119928119928 119955119955119961119961119946119946

bull Here 119954119954119946119946119943119943 is entry f of row qi of matrix Q 119928119928 = 119928119928 minus 119928119928 = 119928119928 minus η sum119961119961119946119946nabla119928119928 (119955119955119961119961119946119946) Idea Instead of evaluating gradient over all ratings evaluate it for

each individual rating and make a step

GD 119928119928larr119928119928 minus η sum119955119955119961119961119946119946 nabla119928119928(119955119955119961119961119946119946)

SGD 119928119928larr119928119928 minus 119928119928(119955119955119961119961119946119946) Faster convergence

bull Need more steps but each step is computed much faster

1162019

RECSYS 73

SGD vs GD Convergence of GD vs SGD

1162019

Iterationstep

Valu

e of

the

obje

ctiv

e fu

nctio

n

GD improves the value of the objective function at every step SGD improves the value but in a ldquonoisyrdquo wayGD takes fewer steps to converge but each steptakes much longer to compute In practice SGD is much faster

RECSYS 74

Stochastic Gradient Descent

Stochastic gradient decent Initialize P and Q (using SVD pretend missing ratings are

0) Then iterate over the ratings (multiple times if necessary) and update

factorsFor each rxi

bull 120576120576119909119909119909119909 = 119903119903119909119909119909119909 minus 119902119902119909119909 sdot 119901119901119909119909119879119879 (derivative of the ldquoerrorrdquo)bull 119902119902119909119909 larr 119902119902119909119909 + 120578120578 120576120576119909119909119909119909 119901119901119909119909 minus 120582120582 119902119902119909119909 (update equation)bull 119901119901119909119909 larr 119901119901119909119909 + 120578120578 120576120576119909119909119909119909 119902119902119909119909 minus 120582120582 119901119901119909119909 (update equation)

2 for loops For until convergence

bull For each rxibull Compute gradient do a ldquosteprdquo

1162019

η hellip learning rate

RECSYS 75Koren Bell Volinksy IEEE Computer 2009

1162019

RECSYS 76

Summary Recommendations via Optimization

Goal Make good recommendations Quantify goodness using SSE

So Lower SSE means better recommendations We want to make good recommendations on items

that some user has not yet seen Letrsquos set values for P and Q such that they work

well on known (user item) ratings And hope these values for P and Q will predict well

the unknown ratings

This is the a case where we apply Optimization methods

1162019

1 3 43 5 5

4 5 533

2

2 1 3

1

RECSYS 77

Backup Slides

RECSYS 78

The Netflix Challenge 2006-09

We Know What You OughtTo Be Watching This

Summer

Netflix Prizebull Training data

ndash 100 million ratingsndash 480000 usersndash 17770 moviesndash 6 years of data 2000-2005

bull Test datandash Last few ratings of each user (28 million)ndash Evaluation criterion root mean squared error (RMSE) ndash Netflix Cinematch RMSE 09514

bull Competitionndash 2700+ teamsndash $1 million grand prize for 10 improvement on Cinematch resultndash $50000 2007 progress prize for 843 improvement

scoremovieuser121152131434524123237682576344541568523425

22345

57664566

scoremovieuser6219617232473153414284935

745

696836

Training data Test data

Movie rating data

Overall rating distribution

bull Third of ratings are 4sbull Average rating is 368

From TimelyDevelopmentcom

ratings per movie

bull Avg ratingsmovie 5627

ratings per user

bull Avg ratingsuser 208

Average movie rating by movie count

bull More ratings to better movies

From TimelyDevelopmentcom

Most loved movies

CountAvg ratingTitle137812 4593 The Shawshank Redemption 133597 4545 Lord of the Rings The Return of the King1808834306 The Green Mile 150676 4460Lord of the Rings The Two Towers 139050 4415 Finding Nemo 117456 4504 Raiders of the Lost Ark 180736 4299 Forrest Gump 147932 4433 Lord of the Rings The Fellowship of the ring149199 4325 The Sixth Sense 144027 4333 Indiana Jones and the Last Crusade

Challenges

bull Size of datandash Scalabilityndash Keeping data in memory

bull Missing datandash 99 percent missingndash Very imbalanced

bull Avoiding overfittingbull Test and training data differ significantly movie 16322

bull Use an ensemble of complementing predictorsbull Two half tuned models worth more than a single fully

tuned model

The BellKor recommender system

RECSYS 101

Extending Latent Factor Model to Include Biases

1162019

RECSYS 102

Modeling Biases and Interactions

μ = overall mean rating bx = bias of user x bi = bias of movie i

user-movie interactionmovie biasuser bias

User-Movie interaction Characterizes the matching between

users and movies Attracts most research in the field Benefits from algorithmic and

mathematical innovations

Baseline predictor Separates users and movies Benefits from insights into userrsquos

behavior Among the main practical

contributions of the competition

RECSYS 103

Baseline Predictor We have expectations on the rating by

user x of movie i even without estimating xrsquos attitude towards movies like i

ndash Rating scale of user xndash Values of other ratings user

gave recently (day-specific mood anchoring multi-user accounts)

ndash (Recent) popularity of movie indash Selection bias related to

number of ratings user gave on the same day (ldquofrequencyrdquo)

1162019

RECSYS 104

Putting It All Together

Example Mean rating micro = 37 You are a critical reviewer your ratings are 1 star

lower than the mean bx = -1 Star Wars gets a mean rating of 05 higher than

average movie bi = + 05 Predicted rating for you on Star Wars

= 37 - 1 + 05 = 32

1162019

RECSYS 105

Fitting the New Model

Solve

Stochastic gradient decent to find parameters Note Both biases bu bi as well as interactions qi pu are treated

as parameters (we estimate them)

1162019

regularization

goodness of fit

λ is selected via grid-search on a validation set

( )

++++

+++minus

sumsumsumsum

sumisin

ii

xx

xx

ii

Rix

Txiixxi

PQ

bbpq

pqbbr

2222

2

)(

)(min

λ

micro

RECSYS 106

BellKor Recommender System The winner of the Netflix Challenge

Multi-scale modeling of the dataCombine top level ldquoregionalrdquomodeling of the data with a refined local view Global

bull Overall deviations of usersmovies Factorization

bull Addressing ldquoregionalrdquo effects Collaborative filtering

bull Extract local patterns

Global effects

Factorization

Collaborative filtering

RECSYS 107

Performance of Various Methods

0885

089

0895

09

0905

091

0915

092

1 10 100 1000

RMSE

Millions of parameters

CF (no time bias)

Basic Latent Factors

Latent Factors w Biases

1162019

RECSYS 108

Grand Prize 08563

Netflix 09514

Movie average 10533User average 10651

Global average 11296

Performance of Various Methods

Basic Collaborative filtering 094

Latent factors 090

Latent factors+Biases 089

Collaborative filtering++ 091

1162019

RECSYS 109

Temporal Biases Of Users

Sudden rise in the average movie rating(early 2004) Improvements in Netflix GUI improvements Meaning of rating changed

Movie age Users prefer new movies

without any reasons Older movies are just

inherently better than newer ones

1162019

Y Koren Collaborative filtering with temporal dynamics KDD rsquo09

RECSYS 110

Temporal Biases amp Factors

Original modelrxi = micro +bx + bi + qi middotpx

T

Add time dependence to biasesrxi = micro +bx(t)+ bi(t) +qi middot px

T

Make parameters bu and bi to depend on time (1) Parameterize time-dependence by linear trends

(2) Each bin corresponds to 10 consecutive weeks

Add temporal dependence to factors px(t)hellip user preference vector on day t

Y Koren Collaborative filtering with temporal dynamics KDD rsquo09

RECSYS 111

Adding Temporal Effects

1162019

0875

088

0885

089

0895

09

0905

091

0915

092

1 10 100 1000 10000

RMSE

Millions of parameters

CF (no time bias)Basic Latent FactorsCF (time bias)Latent Factors w Biases+ Linear time factors+ Per-day user biases+ CF

RECSYS 112

Grand Prize 08563

Netflix 09514

Movie average 10533User average 10651

Global average 11296

Performance of Various Methods

Basic Collaborative filtering 094

Latent factors 090

Latent factors+Biases 089

Collaborative filtering++ 091

1162019 112

Latent factors+Biases+Time 0876

Still no prize Getting desperateTry a ldquokitchen sinkrdquo approach

RECSYS 113

Effect of ensemble size

087

08725

0875

08775

088

08825

0885

08875

089

08925

1 8 15 22 29 36 43 50

Predictors

Erro

r - R

MSE

Chart1

RMSE
Predictors
Error - RMSE
Effect of ensemble size

Sheet1

Sheet1

RMSE
Predictors
Error - RMSE
Effect of ensemble size

Sheet2

Sheet3

RECSYS 115

Standing on June 26th 2009

1162019June 26th submission triggers 30-day ldquolast callrdquo

RECSYS 116

The Last 30 Days

Ensemble team formed Group of other teams on leaderboard forms a new team Relies on combining their models Quickly also get a qualifying score over 10

BellKor Continue to get small improvements in their scores Realize that they are in direct competition with Ensemble

Strategy Both teams carefully monitoring the leaderboard Only sure way to check for improvement is to submit a set of

predictionsbull This alerts the other team of your latest score

1162019

RECSYS 117

24 Hours from the Deadline

Submissions limited to 1 a day Only 1 final submission could be made in the last 24h

24 hours before deadlinehellip BellKor team member in Austria notices (by chance) that

Ensemble posts a score that is slightly better than BellKorrsquos

Frantic last 24 hours for both teams Much computer time on final optimization Carefully calibrated to end about an hour before deadline

Final submissions BellKor submits a little early (on purpose) 40 mins before deadline Ensemble submits their final entry 20 mins later hellipand everyone waitshellip

1162019

RECSYS 118118

RECSYS 119

Million $ Awarded Sept 21st 2009

RECSYS 120

Further reading Y Koren Collaborative filtering with temporal

dynamics KDD rsquo09 httpwww2researchattcom~volinskynetflixbpchtml httpwwwthe-ensemblecom

1162019

  • Slide Number 1
  • Acknowledgements
  • Content-based Systems amp Collaborative Filtering
  • High Dimensional Data
  • Example Recommender Systems
  • Recommendations
  • From Scarcity to Abundance
  • The Long Tail
  • Physical vs Online
  • Types of Recommendations
  • Formal Model
  • Utility Matrix
  • Key Problems
  • (1) Gathering Ratings
  • (2) Extrapolating Utilities
  • Content-based Recommender Systems
  • Content-based Recommendations
  • Plan of Action
  • Item Profiles
  • Sidenote TF-IDF
  • User Profiles and Prediction
  • Pros Content-based Approach
  • Cons Content-based Approach
  • Collaborative Filtering
  • Collaborative filtering
  • Collaborative Filtering (CF)
  • Example of Memory-based Collaborative Filtering User-User Collaborative Filtering
  • Similar Users
  • Similarity Metric
  • Rating Predictions
  • Another type of Memory-based Collaborative Filtering Item-Item based Collaborative Filtering
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Common Practice for Item-Item Collaborative Filtering
  • Item-Item vs User-User
  • ProsCons of Collaborative Filtering
  • Hybrid Methods
  • Remarks amp Practical Tips
  • Evaluation
  • Evaluation
  • Evaluating Predictions
  • Problems with Error Measures
  • Collaborative Filtering Complexity
  • Tip Add Data
  • Recommender SystemsLatent Factor Models
  • Collaborative Filtering viaLatent Factor Models (eg SVD)
  • Slide Number 50
  • The Netflix Utility Matrix R
  • Utility Matrix R Evaluation
  • Latent Factor Models
  • Ratings as Products of Factors
  • Ratings as Products of Factors
  • Ratings as Products of Factors
  • Latent Factor Models
  • Latent Factor Models
  • Recap SVD
  • Latent Factor Models
  • Dealing with Missing Entries
  • Recommendations via Latent Factor Models (eg SVD++ by the [Bellkor Team])
  • Dealing with Missing Entries
  • The Effect of Regularization
  • The Effect of Regularization
  • The Effect of Regularization
  • The Effect of Regularization
  • Use Gradient Descent to search forthe optimal settings
  • Degression to the lecture notes ofRegression and Gradient Descentby Andrew Ngrsquos Machine Learning course from Coursera
  • (Batch) Gradient Descent
  • Stochastic Gradient Descent
  • SGD vs GD
  • Stochastic Gradient Descent
  • Slide Number 75
  • Summary Recommendations via Optimization
  • Backup Slides
  • The Netflix Challenge 2006-09
  • Slide Number 79
  • Slide Number 80
  • Netflix Prize
  • Movie rating data
  • Overall rating distribution
  • ratings per movie
  • ratings per user
  • Average movie rating by movie count
  • Most loved movies
  • Challenges
  • The BellKor recommender system
  • Extending Latent Factor Model to Include Biases
  • Modeling Biases and Interactions
  • Baseline Predictor
  • Putting It All Together
  • Fitting the New Model
  • BellKor Recommender System
  • Performance of Various Methods
  • Performance of Various Methods
  • Temporal Biases Of Users
  • Temporal Biases amp Factors
  • Adding Temporal Effects
  • Performance of Various Methods
  • Slide Number 113
  • Slide Number 114
  • Standing on June 26th 2009
  • The Last 30 Days
  • 24 Hours from the Deadline
  • Slide Number 118
  • Million $ Awarded Sept 21st 2009
  • Further reading
predictors RMSE
1 0889067
2 0880586
3 0878345
4 0877715
5 0876668
6 0876179
7 0875557
8 0875013
9 0874477
10 0874209
11 0873997
12 0873691
13 0873569
14 0873352
15 0873212
16 0873035
17 0872942
18 087283
19 0872748
20 0872654
21 0872563
22 0872468
23 0872382
24 0872275
25 087222
26 0872201
27 0872135
28 0872086
29 0872021
30 0871945
31 0871898
32 087183
33 0871756
34 0871722
35 0871665
36 0871632
37 0871585
38 0871558
39 0871524
40 0871482
41 0871452
42 0871428
43 0871398
44 0871346
45 0871312
46 0871287
47 0871262
48 0871242
49 0871236
50 0871197
0889067
0880586
0878345
0877715
0876668
0876179
0875557
0875013
0874477
0874209
0873997
0873691
0873569
0873352
0873212
0873035
0872942
087283
0872748
0872654
0872563
0872468
0872382
0872275
087222
0872201
0872135
0872086
0872021
0871945
0871898
087183
0871756
0871722
0871665
0871632
0871585
0871558
0871524
0871482
0871452
0871428
0871398
0871346
0871312
0871287
0871262
0871242
0871236
0871197
Page 70: xuhappy.github.io · 2 days ago · RECSYS 2 Acknowledgements The slides used in this chapter are adapted from: CS246 Mining Massive Datasets, - by Jure Leskovec, Stanford University

RECSYS 73

SGD vs GD Convergence of GD vs SGD

1162019

Iterationstep

Valu

e of

the

obje

ctiv

e fu

nctio

n

GD improves the value of the objective function at every step SGD improves the value but in a ldquonoisyrdquo wayGD takes fewer steps to converge but each steptakes much longer to compute In practice SGD is much faster

RECSYS 74

Stochastic Gradient Descent

Stochastic gradient decent Initialize P and Q (using SVD pretend missing ratings are

0) Then iterate over the ratings (multiple times if necessary) and update

factorsFor each rxi

bull 120576120576119909119909119909119909 = 119903119903119909119909119909119909 minus 119902119902119909119909 sdot 119901119901119909119909119879119879 (derivative of the ldquoerrorrdquo)bull 119902119902119909119909 larr 119902119902119909119909 + 120578120578 120576120576119909119909119909119909 119901119901119909119909 minus 120582120582 119902119902119909119909 (update equation)bull 119901119901119909119909 larr 119901119901119909119909 + 120578120578 120576120576119909119909119909119909 119902119902119909119909 minus 120582120582 119901119901119909119909 (update equation)

2 for loops For until convergence

bull For each rxibull Compute gradient do a ldquosteprdquo

1162019

η hellip learning rate

RECSYS 75Koren Bell Volinksy IEEE Computer 2009

1162019

RECSYS 76

Summary Recommendations via Optimization

Goal Make good recommendations Quantify goodness using SSE

So Lower SSE means better recommendations We want to make good recommendations on items

that some user has not yet seen Letrsquos set values for P and Q such that they work

well on known (user item) ratings And hope these values for P and Q will predict well

the unknown ratings

This is the a case where we apply Optimization methods

1162019

1 3 43 5 5

4 5 533

2

2 1 3

1

RECSYS 77

Backup Slides

RECSYS 78

The Netflix Challenge 2006-09

We Know What You OughtTo Be Watching This

Summer

Netflix Prizebull Training data

ndash 100 million ratingsndash 480000 usersndash 17770 moviesndash 6 years of data 2000-2005

bull Test datandash Last few ratings of each user (28 million)ndash Evaluation criterion root mean squared error (RMSE) ndash Netflix Cinematch RMSE 09514

bull Competitionndash 2700+ teamsndash $1 million grand prize for 10 improvement on Cinematch resultndash $50000 2007 progress prize for 843 improvement

scoremovieuser121152131434524123237682576344541568523425

22345

57664566

scoremovieuser6219617232473153414284935

745

696836

Training data Test data

Movie rating data

Overall rating distribution

bull Third of ratings are 4sbull Average rating is 368

From TimelyDevelopmentcom

ratings per movie

bull Avg ratingsmovie 5627

ratings per user

bull Avg ratingsuser 208

Average movie rating by movie count

bull More ratings to better movies

From TimelyDevelopmentcom

Most loved movies

CountAvg ratingTitle137812 4593 The Shawshank Redemption 133597 4545 Lord of the Rings The Return of the King1808834306 The Green Mile 150676 4460Lord of the Rings The Two Towers 139050 4415 Finding Nemo 117456 4504 Raiders of the Lost Ark 180736 4299 Forrest Gump 147932 4433 Lord of the Rings The Fellowship of the ring149199 4325 The Sixth Sense 144027 4333 Indiana Jones and the Last Crusade

Challenges

bull Size of datandash Scalabilityndash Keeping data in memory

bull Missing datandash 99 percent missingndash Very imbalanced

bull Avoiding overfittingbull Test and training data differ significantly movie 16322

bull Use an ensemble of complementing predictorsbull Two half tuned models worth more than a single fully

tuned model

The BellKor recommender system

RECSYS 101

Extending Latent Factor Model to Include Biases

1162019

RECSYS 102

Modeling Biases and Interactions

μ = overall mean rating bx = bias of user x bi = bias of movie i

user-movie interactionmovie biasuser bias

User-Movie interaction Characterizes the matching between

users and movies Attracts most research in the field Benefits from algorithmic and

mathematical innovations

Baseline predictor Separates users and movies Benefits from insights into userrsquos

behavior Among the main practical

contributions of the competition

RECSYS 103

Baseline Predictor We have expectations on the rating by

user x of movie i even without estimating xrsquos attitude towards movies like i

ndash Rating scale of user xndash Values of other ratings user

gave recently (day-specific mood anchoring multi-user accounts)

ndash (Recent) popularity of movie indash Selection bias related to

number of ratings user gave on the same day (ldquofrequencyrdquo)

1162019

RECSYS 104

Putting It All Together

Example Mean rating micro = 37 You are a critical reviewer your ratings are 1 star

lower than the mean bx = -1 Star Wars gets a mean rating of 05 higher than

average movie bi = + 05 Predicted rating for you on Star Wars

= 37 - 1 + 05 = 32

1162019

RECSYS 105

Fitting the New Model

Solve

Stochastic gradient decent to find parameters Note Both biases bu bi as well as interactions qi pu are treated

as parameters (we estimate them)

1162019

regularization

goodness of fit

λ is selected via grid-search on a validation set

( )

++++

+++minus

sumsumsumsum

sumisin

ii

xx

xx

ii

Rix

Txiixxi

PQ

bbpq

pqbbr

2222

2

)(

)(min

λ

micro

RECSYS 106

BellKor Recommender System The winner of the Netflix Challenge

Multi-scale modeling of the dataCombine top level ldquoregionalrdquomodeling of the data with a refined local view Global

bull Overall deviations of usersmovies Factorization

bull Addressing ldquoregionalrdquo effects Collaborative filtering

bull Extract local patterns

Global effects

Factorization

Collaborative filtering

RECSYS 107

Performance of Various Methods

0885

089

0895

09

0905

091

0915

092

1 10 100 1000

RMSE

Millions of parameters

CF (no time bias)

Basic Latent Factors

Latent Factors w Biases

1162019

RECSYS 108

Grand Prize 08563

Netflix 09514

Movie average 10533User average 10651

Global average 11296

Performance of Various Methods

Basic Collaborative filtering 094

Latent factors 090

Latent factors+Biases 089

Collaborative filtering++ 091

1162019

RECSYS 109

Temporal Biases Of Users

Sudden rise in the average movie rating(early 2004) Improvements in Netflix GUI improvements Meaning of rating changed

Movie age Users prefer new movies

without any reasons Older movies are just

inherently better than newer ones

1162019

Y Koren Collaborative filtering with temporal dynamics KDD rsquo09

RECSYS 110

Temporal Biases amp Factors

Original modelrxi = micro +bx + bi + qi middotpx

T

Add time dependence to biasesrxi = micro +bx(t)+ bi(t) +qi middot px

T

Make parameters bu and bi to depend on time (1) Parameterize time-dependence by linear trends

(2) Each bin corresponds to 10 consecutive weeks

Add temporal dependence to factors px(t)hellip user preference vector on day t

Y Koren Collaborative filtering with temporal dynamics KDD rsquo09

RECSYS 111

Adding Temporal Effects

1162019

0875

088

0885

089

0895

09

0905

091

0915

092

1 10 100 1000 10000

RMSE

Millions of parameters

CF (no time bias)Basic Latent FactorsCF (time bias)Latent Factors w Biases+ Linear time factors+ Per-day user biases+ CF

RECSYS 112

Grand Prize 08563

Netflix 09514

Movie average 10533User average 10651

Global average 11296

Performance of Various Methods

Basic Collaborative filtering 094

Latent factors 090

Latent factors+Biases 089

Collaborative filtering++ 091

1162019 112

Latent factors+Biases+Time 0876

Still no prize Getting desperateTry a ldquokitchen sinkrdquo approach

RECSYS 113

Effect of ensemble size

087

08725

0875

08775

088

08825

0885

08875

089

08925

1 8 15 22 29 36 43 50

Predictors

Erro

r - R

MSE

Chart1

RMSE
Predictors
Error - RMSE
Effect of ensemble size

Sheet1

Sheet1

RMSE
Predictors
Error - RMSE
Effect of ensemble size

Sheet2

Sheet3

RECSYS 115

Standing on June 26th 2009

1162019June 26th submission triggers 30-day ldquolast callrdquo

RECSYS 116

The Last 30 Days

Ensemble team formed Group of other teams on leaderboard forms a new team Relies on combining their models Quickly also get a qualifying score over 10

BellKor Continue to get small improvements in their scores Realize that they are in direct competition with Ensemble

Strategy Both teams carefully monitoring the leaderboard Only sure way to check for improvement is to submit a set of

predictionsbull This alerts the other team of your latest score

1162019

RECSYS 117

24 Hours from the Deadline

Submissions limited to 1 a day Only 1 final submission could be made in the last 24h

24 hours before deadlinehellip BellKor team member in Austria notices (by chance) that

Ensemble posts a score that is slightly better than BellKorrsquos

Frantic last 24 hours for both teams Much computer time on final optimization Carefully calibrated to end about an hour before deadline

Final submissions BellKor submits a little early (on purpose) 40 mins before deadline Ensemble submits their final entry 20 mins later hellipand everyone waitshellip

1162019

RECSYS 118118

RECSYS 119

Million $ Awarded Sept 21st 2009

RECSYS 120

Further reading Y Koren Collaborative filtering with temporal

dynamics KDD rsquo09 httpwww2researchattcom~volinskynetflixbpchtml httpwwwthe-ensemblecom

1162019

  • Slide Number 1
  • Acknowledgements
  • Content-based Systems amp Collaborative Filtering
  • High Dimensional Data
  • Example Recommender Systems
  • Recommendations
  • From Scarcity to Abundance
  • The Long Tail
  • Physical vs Online
  • Types of Recommendations
  • Formal Model
  • Utility Matrix
  • Key Problems
  • (1) Gathering Ratings
  • (2) Extrapolating Utilities
  • Content-based Recommender Systems
  • Content-based Recommendations
  • Plan of Action
  • Item Profiles
  • Sidenote TF-IDF
  • User Profiles and Prediction
  • Pros Content-based Approach
  • Cons Content-based Approach
  • Collaborative Filtering
  • Collaborative filtering
  • Collaborative Filtering (CF)
  • Example of Memory-based Collaborative Filtering User-User Collaborative Filtering
  • Similar Users
  • Similarity Metric
  • Rating Predictions
  • Another type of Memory-based Collaborative Filtering Item-Item based Collaborative Filtering
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Common Practice for Item-Item Collaborative Filtering
  • Item-Item vs User-User
  • ProsCons of Collaborative Filtering
  • Hybrid Methods
  • Remarks amp Practical Tips
  • Evaluation
  • Evaluation
  • Evaluating Predictions
  • Problems with Error Measures
  • Collaborative Filtering Complexity
  • Tip Add Data
  • Recommender SystemsLatent Factor Models
  • Collaborative Filtering viaLatent Factor Models (eg SVD)
  • Slide Number 50
  • The Netflix Utility Matrix R
  • Utility Matrix R Evaluation
  • Latent Factor Models
  • Ratings as Products of Factors
  • Ratings as Products of Factors
  • Ratings as Products of Factors
  • Latent Factor Models
  • Latent Factor Models
  • Recap SVD
  • Latent Factor Models
  • Dealing with Missing Entries
  • Recommendations via Latent Factor Models (eg SVD++ by the [Bellkor Team])
  • Dealing with Missing Entries
  • The Effect of Regularization
  • The Effect of Regularization
  • The Effect of Regularization
  • The Effect of Regularization
  • Use Gradient Descent to search forthe optimal settings
  • Degression to the lecture notes ofRegression and Gradient Descentby Andrew Ngrsquos Machine Learning course from Coursera
  • (Batch) Gradient Descent
  • Stochastic Gradient Descent
  • SGD vs GD
  • Stochastic Gradient Descent
  • Slide Number 75
  • Summary Recommendations via Optimization
  • Backup Slides
  • The Netflix Challenge 2006-09
  • Slide Number 79
  • Slide Number 80
  • Netflix Prize
  • Movie rating data
  • Overall rating distribution
  • ratings per movie
  • ratings per user
  • Average movie rating by movie count
  • Most loved movies
  • Challenges
  • The BellKor recommender system
  • Extending Latent Factor Model to Include Biases
  • Modeling Biases and Interactions
  • Baseline Predictor
  • Putting It All Together
  • Fitting the New Model
  • BellKor Recommender System
  • Performance of Various Methods
  • Performance of Various Methods
  • Temporal Biases Of Users
  • Temporal Biases amp Factors
  • Adding Temporal Effects
  • Performance of Various Methods
  • Slide Number 113
  • Slide Number 114
  • Standing on June 26th 2009
  • The Last 30 Days
  • 24 Hours from the Deadline
  • Slide Number 118
  • Million $ Awarded Sept 21st 2009
  • Further reading
predictors RMSE
1 0889067
2 0880586
3 0878345
4 0877715
5 0876668
6 0876179
7 0875557
8 0875013
9 0874477
10 0874209
11 0873997
12 0873691
13 0873569
14 0873352
15 0873212
16 0873035
17 0872942
18 087283
19 0872748
20 0872654
21 0872563
22 0872468
23 0872382
24 0872275
25 087222
26 0872201
27 0872135
28 0872086
29 0872021
30 0871945
31 0871898
32 087183
33 0871756
34 0871722
35 0871665
36 0871632
37 0871585
38 0871558
39 0871524
40 0871482
41 0871452
42 0871428
43 0871398
44 0871346
45 0871312
46 0871287
47 0871262
48 0871242
49 0871236
50 0871197
0889067
0880586
0878345
0877715
0876668
0876179
0875557
0875013
0874477
0874209
0873997
0873691
0873569
0873352
0873212
0873035
0872942
087283
0872748
0872654
0872563
0872468
0872382
0872275
087222
0872201
0872135
0872086
0872021
0871945
0871898
087183
0871756
0871722
0871665
0871632
0871585
0871558
0871524
0871482
0871452
0871428
0871398
0871346
0871312
0871287
0871262
0871242
0871236
0871197
Page 71: xuhappy.github.io · 2 days ago · RECSYS 2 Acknowledgements The slides used in this chapter are adapted from: CS246 Mining Massive Datasets, - by Jure Leskovec, Stanford University

RECSYS 74

Stochastic Gradient Descent

Stochastic gradient decent Initialize P and Q (using SVD pretend missing ratings are

0) Then iterate over the ratings (multiple times if necessary) and update

factorsFor each rxi

bull 120576120576119909119909119909119909 = 119903119903119909119909119909119909 minus 119902119902119909119909 sdot 119901119901119909119909119879119879 (derivative of the ldquoerrorrdquo)bull 119902119902119909119909 larr 119902119902119909119909 + 120578120578 120576120576119909119909119909119909 119901119901119909119909 minus 120582120582 119902119902119909119909 (update equation)bull 119901119901119909119909 larr 119901119901119909119909 + 120578120578 120576120576119909119909119909119909 119902119902119909119909 minus 120582120582 119901119901119909119909 (update equation)

2 for loops For until convergence

bull For each rxibull Compute gradient do a ldquosteprdquo

1162019

η hellip learning rate

RECSYS 75Koren Bell Volinksy IEEE Computer 2009

1162019

RECSYS 76

Summary Recommendations via Optimization

Goal Make good recommendations Quantify goodness using SSE

So Lower SSE means better recommendations We want to make good recommendations on items

that some user has not yet seen Letrsquos set values for P and Q such that they work

well on known (user item) ratings And hope these values for P and Q will predict well

the unknown ratings

This is the a case where we apply Optimization methods

1162019

1 3 43 5 5

4 5 533

2

2 1 3

1

RECSYS 77

Backup Slides

RECSYS 78

The Netflix Challenge 2006-09

We Know What You OughtTo Be Watching This

Summer

Netflix Prizebull Training data

ndash 100 million ratingsndash 480000 usersndash 17770 moviesndash 6 years of data 2000-2005

bull Test datandash Last few ratings of each user (28 million)ndash Evaluation criterion root mean squared error (RMSE) ndash Netflix Cinematch RMSE 09514

bull Competitionndash 2700+ teamsndash $1 million grand prize for 10 improvement on Cinematch resultndash $50000 2007 progress prize for 843 improvement

scoremovieuser121152131434524123237682576344541568523425

22345

57664566

scoremovieuser6219617232473153414284935

745

696836

Training data Test data

Movie rating data

Overall rating distribution

bull Third of ratings are 4sbull Average rating is 368

From TimelyDevelopmentcom

ratings per movie

bull Avg ratingsmovie 5627

ratings per user

bull Avg ratingsuser 208

Average movie rating by movie count

bull More ratings to better movies

From TimelyDevelopmentcom

Most loved movies

CountAvg ratingTitle137812 4593 The Shawshank Redemption 133597 4545 Lord of the Rings The Return of the King1808834306 The Green Mile 150676 4460Lord of the Rings The Two Towers 139050 4415 Finding Nemo 117456 4504 Raiders of the Lost Ark 180736 4299 Forrest Gump 147932 4433 Lord of the Rings The Fellowship of the ring149199 4325 The Sixth Sense 144027 4333 Indiana Jones and the Last Crusade

Challenges

bull Size of datandash Scalabilityndash Keeping data in memory

bull Missing datandash 99 percent missingndash Very imbalanced

bull Avoiding overfittingbull Test and training data differ significantly movie 16322

bull Use an ensemble of complementing predictorsbull Two half tuned models worth more than a single fully

tuned model

The BellKor recommender system

RECSYS 101

Extending Latent Factor Model to Include Biases

1162019

RECSYS 102

Modeling Biases and Interactions

μ = overall mean rating bx = bias of user x bi = bias of movie i

user-movie interactionmovie biasuser bias

User-Movie interaction Characterizes the matching between

users and movies Attracts most research in the field Benefits from algorithmic and

mathematical innovations

Baseline predictor Separates users and movies Benefits from insights into userrsquos

behavior Among the main practical

contributions of the competition

RECSYS 103

Baseline Predictor We have expectations on the rating by

user x of movie i even without estimating xrsquos attitude towards movies like i

ndash Rating scale of user xndash Values of other ratings user

gave recently (day-specific mood anchoring multi-user accounts)

ndash (Recent) popularity of movie indash Selection bias related to

number of ratings user gave on the same day (ldquofrequencyrdquo)

1162019

RECSYS 104

Putting It All Together

Example Mean rating micro = 37 You are a critical reviewer your ratings are 1 star

lower than the mean bx = -1 Star Wars gets a mean rating of 05 higher than

average movie bi = + 05 Predicted rating for you on Star Wars

= 37 - 1 + 05 = 32

1162019

RECSYS 105

Fitting the New Model

Solve

Stochastic gradient decent to find parameters Note Both biases bu bi as well as interactions qi pu are treated

as parameters (we estimate them)

1162019

regularization

goodness of fit

λ is selected via grid-search on a validation set

( )

++++

+++minus

sumsumsumsum

sumisin

ii

xx

xx

ii

Rix

Txiixxi

PQ

bbpq

pqbbr

2222

2

)(

)(min

λ

micro

RECSYS 106

BellKor Recommender System The winner of the Netflix Challenge

Multi-scale modeling of the dataCombine top level ldquoregionalrdquomodeling of the data with a refined local view Global

bull Overall deviations of usersmovies Factorization

bull Addressing ldquoregionalrdquo effects Collaborative filtering

bull Extract local patterns

Global effects

Factorization

Collaborative filtering

RECSYS 107

Performance of Various Methods

0885

089

0895

09

0905

091

0915

092

1 10 100 1000

RMSE

Millions of parameters

CF (no time bias)

Basic Latent Factors

Latent Factors w Biases

1162019

RECSYS 108

Grand Prize 08563

Netflix 09514

Movie average 10533User average 10651

Global average 11296

Performance of Various Methods

Basic Collaborative filtering 094

Latent factors 090

Latent factors+Biases 089

Collaborative filtering++ 091

1162019

RECSYS 109

Temporal Biases Of Users

Sudden rise in the average movie rating(early 2004) Improvements in Netflix GUI improvements Meaning of rating changed

Movie age Users prefer new movies

without any reasons Older movies are just

inherently better than newer ones

1162019

Y Koren Collaborative filtering with temporal dynamics KDD rsquo09

RECSYS 110

Temporal Biases amp Factors

Original modelrxi = micro +bx + bi + qi middotpx

T

Add time dependence to biasesrxi = micro +bx(t)+ bi(t) +qi middot px

T

Make parameters bu and bi to depend on time (1) Parameterize time-dependence by linear trends

(2) Each bin corresponds to 10 consecutive weeks

Add temporal dependence to factors px(t)hellip user preference vector on day t

Y Koren Collaborative filtering with temporal dynamics KDD rsquo09

RECSYS 111

Adding Temporal Effects

1162019

0875

088

0885

089

0895

09

0905

091

0915

092

1 10 100 1000 10000

RMSE

Millions of parameters

CF (no time bias)Basic Latent FactorsCF (time bias)Latent Factors w Biases+ Linear time factors+ Per-day user biases+ CF

RECSYS 112

Grand Prize 08563

Netflix 09514

Movie average 10533User average 10651

Global average 11296

Performance of Various Methods

Basic Collaborative filtering 094

Latent factors 090

Latent factors+Biases 089

Collaborative filtering++ 091

1162019 112

Latent factors+Biases+Time 0876

Still no prize Getting desperateTry a ldquokitchen sinkrdquo approach

RECSYS 113

Effect of ensemble size

087

08725

0875

08775

088

08825

0885

08875

089

08925

1 8 15 22 29 36 43 50

Predictors

Erro

r - R

MSE

Chart1

RMSE
Predictors
Error - RMSE
Effect of ensemble size

Sheet1

Sheet1

RMSE
Predictors
Error - RMSE
Effect of ensemble size

Sheet2

Sheet3

RECSYS 115

Standing on June 26th 2009

1162019June 26th submission triggers 30-day ldquolast callrdquo

RECSYS 116

The Last 30 Days

Ensemble team formed Group of other teams on leaderboard forms a new team Relies on combining their models Quickly also get a qualifying score over 10

BellKor Continue to get small improvements in their scores Realize that they are in direct competition with Ensemble

Strategy Both teams carefully monitoring the leaderboard Only sure way to check for improvement is to submit a set of

predictionsbull This alerts the other team of your latest score

1162019

RECSYS 117

24 Hours from the Deadline

Submissions limited to 1 a day Only 1 final submission could be made in the last 24h

24 hours before deadlinehellip BellKor team member in Austria notices (by chance) that

Ensemble posts a score that is slightly better than BellKorrsquos

Frantic last 24 hours for both teams Much computer time on final optimization Carefully calibrated to end about an hour before deadline

Final submissions BellKor submits a little early (on purpose) 40 mins before deadline Ensemble submits their final entry 20 mins later hellipand everyone waitshellip

1162019

RECSYS 118118

RECSYS 119

Million $ Awarded Sept 21st 2009

RECSYS 120

Further reading Y Koren Collaborative filtering with temporal

dynamics KDD rsquo09 httpwww2researchattcom~volinskynetflixbpchtml httpwwwthe-ensemblecom

1162019

  • Slide Number 1
  • Acknowledgements
  • Content-based Systems amp Collaborative Filtering
  • High Dimensional Data
  • Example Recommender Systems
  • Recommendations
  • From Scarcity to Abundance
  • The Long Tail
  • Physical vs Online
  • Types of Recommendations
  • Formal Model
  • Utility Matrix
  • Key Problems
  • (1) Gathering Ratings
  • (2) Extrapolating Utilities
  • Content-based Recommender Systems
  • Content-based Recommendations
  • Plan of Action
  • Item Profiles
  • Sidenote TF-IDF
  • User Profiles and Prediction
  • Pros Content-based Approach
  • Cons Content-based Approach
  • Collaborative Filtering
  • Collaborative filtering
  • Collaborative Filtering (CF)
  • Example of Memory-based Collaborative Filtering User-User Collaborative Filtering
  • Similar Users
  • Similarity Metric
  • Rating Predictions
  • Another type of Memory-based Collaborative Filtering Item-Item based Collaborative Filtering
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Common Practice for Item-Item Collaborative Filtering
  • Item-Item vs User-User
  • ProsCons of Collaborative Filtering
  • Hybrid Methods
  • Remarks amp Practical Tips
  • Evaluation
  • Evaluation
  • Evaluating Predictions
  • Problems with Error Measures
  • Collaborative Filtering Complexity
  • Tip Add Data
  • Recommender SystemsLatent Factor Models
  • Collaborative Filtering viaLatent Factor Models (eg SVD)
  • Slide Number 50
  • The Netflix Utility Matrix R
  • Utility Matrix R Evaluation
  • Latent Factor Models
  • Ratings as Products of Factors
  • Ratings as Products of Factors
  • Ratings as Products of Factors
  • Latent Factor Models
  • Latent Factor Models
  • Recap SVD
  • Latent Factor Models
  • Dealing with Missing Entries
  • Recommendations via Latent Factor Models (eg SVD++ by the [Bellkor Team])
  • Dealing with Missing Entries
  • The Effect of Regularization
  • The Effect of Regularization
  • The Effect of Regularization
  • The Effect of Regularization
  • Use Gradient Descent to search forthe optimal settings
  • Degression to the lecture notes ofRegression and Gradient Descentby Andrew Ngrsquos Machine Learning course from Coursera
  • (Batch) Gradient Descent
  • Stochastic Gradient Descent
  • SGD vs GD
  • Stochastic Gradient Descent
  • Slide Number 75
  • Summary Recommendations via Optimization
  • Backup Slides
  • The Netflix Challenge 2006-09
  • Slide Number 79
  • Slide Number 80
  • Netflix Prize
  • Movie rating data
  • Overall rating distribution
  • ratings per movie
  • ratings per user
  • Average movie rating by movie count
  • Most loved movies
  • Challenges
  • The BellKor recommender system
  • Extending Latent Factor Model to Include Biases
  • Modeling Biases and Interactions
  • Baseline Predictor
  • Putting It All Together
  • Fitting the New Model
  • BellKor Recommender System
  • Performance of Various Methods
  • Performance of Various Methods
  • Temporal Biases Of Users
  • Temporal Biases amp Factors
  • Adding Temporal Effects
  • Performance of Various Methods
  • Slide Number 113
  • Slide Number 114
  • Standing on June 26th 2009
  • The Last 30 Days
  • 24 Hours from the Deadline
  • Slide Number 118
  • Million $ Awarded Sept 21st 2009
  • Further reading
predictors RMSE
1 0889067
2 0880586
3 0878345
4 0877715
5 0876668
6 0876179
7 0875557
8 0875013
9 0874477
10 0874209
11 0873997
12 0873691
13 0873569
14 0873352
15 0873212
16 0873035
17 0872942
18 087283
19 0872748
20 0872654
21 0872563
22 0872468
23 0872382
24 0872275
25 087222
26 0872201
27 0872135
28 0872086
29 0872021
30 0871945
31 0871898
32 087183
33 0871756
34 0871722
35 0871665
36 0871632
37 0871585
38 0871558
39 0871524
40 0871482
41 0871452
42 0871428
43 0871398
44 0871346
45 0871312
46 0871287
47 0871262
48 0871242
49 0871236
50 0871197
0889067
0880586
0878345
0877715
0876668
0876179
0875557
0875013
0874477
0874209
0873997
0873691
0873569
0873352
0873212
0873035
0872942
087283
0872748
0872654
0872563
0872468
0872382
0872275
087222
0872201
0872135
0872086
0872021
0871945
0871898
087183
0871756
0871722
0871665
0871632
0871585
0871558
0871524
0871482
0871452
0871428
0871398
0871346
0871312
0871287
0871262
0871242
0871236
0871197
Page 72: xuhappy.github.io · 2 days ago · RECSYS 2 Acknowledgements The slides used in this chapter are adapted from: CS246 Mining Massive Datasets, - by Jure Leskovec, Stanford University

RECSYS 75Koren Bell Volinksy IEEE Computer 2009

1162019

RECSYS 76

Summary Recommendations via Optimization

Goal Make good recommendations Quantify goodness using SSE

So Lower SSE means better recommendations We want to make good recommendations on items

that some user has not yet seen Letrsquos set values for P and Q such that they work

well on known (user item) ratings And hope these values for P and Q will predict well

the unknown ratings

This is the a case where we apply Optimization methods

1162019

1 3 43 5 5

4 5 533

2

2 1 3

1

RECSYS 77

Backup Slides

RECSYS 78

The Netflix Challenge 2006-09

We Know What You OughtTo Be Watching This

Summer

Netflix Prizebull Training data

ndash 100 million ratingsndash 480000 usersndash 17770 moviesndash 6 years of data 2000-2005

bull Test datandash Last few ratings of each user (28 million)ndash Evaluation criterion root mean squared error (RMSE) ndash Netflix Cinematch RMSE 09514

bull Competitionndash 2700+ teamsndash $1 million grand prize for 10 improvement on Cinematch resultndash $50000 2007 progress prize for 843 improvement

scoremovieuser121152131434524123237682576344541568523425

22345

57664566

scoremovieuser6219617232473153414284935

745

696836

Training data Test data

Movie rating data

Overall rating distribution

bull Third of ratings are 4sbull Average rating is 368

From TimelyDevelopmentcom

ratings per movie

bull Avg ratingsmovie 5627

ratings per user

bull Avg ratingsuser 208

Average movie rating by movie count

bull More ratings to better movies

From TimelyDevelopmentcom

Most loved movies

CountAvg ratingTitle137812 4593 The Shawshank Redemption 133597 4545 Lord of the Rings The Return of the King1808834306 The Green Mile 150676 4460Lord of the Rings The Two Towers 139050 4415 Finding Nemo 117456 4504 Raiders of the Lost Ark 180736 4299 Forrest Gump 147932 4433 Lord of the Rings The Fellowship of the ring149199 4325 The Sixth Sense 144027 4333 Indiana Jones and the Last Crusade

Challenges

bull Size of datandash Scalabilityndash Keeping data in memory

bull Missing datandash 99 percent missingndash Very imbalanced

bull Avoiding overfittingbull Test and training data differ significantly movie 16322

bull Use an ensemble of complementing predictorsbull Two half tuned models worth more than a single fully

tuned model

The BellKor recommender system

RECSYS 101

Extending Latent Factor Model to Include Biases

1162019

RECSYS 102

Modeling Biases and Interactions

μ = overall mean rating bx = bias of user x bi = bias of movie i

user-movie interactionmovie biasuser bias

User-Movie interaction Characterizes the matching between

users and movies Attracts most research in the field Benefits from algorithmic and

mathematical innovations

Baseline predictor Separates users and movies Benefits from insights into userrsquos

behavior Among the main practical

contributions of the competition

RECSYS 103

Baseline Predictor We have expectations on the rating by

user x of movie i even without estimating xrsquos attitude towards movies like i

ndash Rating scale of user xndash Values of other ratings user

gave recently (day-specific mood anchoring multi-user accounts)

ndash (Recent) popularity of movie indash Selection bias related to

number of ratings user gave on the same day (ldquofrequencyrdquo)

1162019

RECSYS 104

Putting It All Together

Example Mean rating micro = 37 You are a critical reviewer your ratings are 1 star

lower than the mean bx = -1 Star Wars gets a mean rating of 05 higher than

average movie bi = + 05 Predicted rating for you on Star Wars

= 37 - 1 + 05 = 32

1162019

RECSYS 105

Fitting the New Model

Solve

Stochastic gradient decent to find parameters Note Both biases bu bi as well as interactions qi pu are treated

as parameters (we estimate them)

1162019

regularization

goodness of fit

λ is selected via grid-search on a validation set

( )

++++

+++minus

sumsumsumsum

sumisin

ii

xx

xx

ii

Rix

Txiixxi

PQ

bbpq

pqbbr

2222

2

)(

)(min

λ

micro

RECSYS 106

BellKor Recommender System The winner of the Netflix Challenge

Multi-scale modeling of the dataCombine top level ldquoregionalrdquomodeling of the data with a refined local view Global

bull Overall deviations of usersmovies Factorization

bull Addressing ldquoregionalrdquo effects Collaborative filtering

bull Extract local patterns

Global effects

Factorization

Collaborative filtering

RECSYS 107

Performance of Various Methods

0885

089

0895

09

0905

091

0915

092

1 10 100 1000

RMSE

Millions of parameters

CF (no time bias)

Basic Latent Factors

Latent Factors w Biases

1162019

RECSYS 108

Grand Prize 08563

Netflix 09514

Movie average 10533User average 10651

Global average 11296

Performance of Various Methods

Basic Collaborative filtering 094

Latent factors 090

Latent factors+Biases 089

Collaborative filtering++ 091

1162019

RECSYS 109

Temporal Biases Of Users

Sudden rise in the average movie rating(early 2004) Improvements in Netflix GUI improvements Meaning of rating changed

Movie age Users prefer new movies

without any reasons Older movies are just

inherently better than newer ones

1162019

Y Koren Collaborative filtering with temporal dynamics KDD rsquo09

RECSYS 110

Temporal Biases amp Factors

Original modelrxi = micro +bx + bi + qi middotpx

T

Add time dependence to biasesrxi = micro +bx(t)+ bi(t) +qi middot px

T

Make parameters bu and bi to depend on time (1) Parameterize time-dependence by linear trends

(2) Each bin corresponds to 10 consecutive weeks

Add temporal dependence to factors px(t)hellip user preference vector on day t

Y Koren Collaborative filtering with temporal dynamics KDD rsquo09

RECSYS 111

Adding Temporal Effects

1162019

0875

088

0885

089

0895

09

0905

091

0915

092

1 10 100 1000 10000

RMSE

Millions of parameters

CF (no time bias)Basic Latent FactorsCF (time bias)Latent Factors w Biases+ Linear time factors+ Per-day user biases+ CF

RECSYS 112

Grand Prize 08563

Netflix 09514

Movie average 10533User average 10651

Global average 11296

Performance of Various Methods

Basic Collaborative filtering 094

Latent factors 090

Latent factors+Biases 089

Collaborative filtering++ 091

1162019 112

Latent factors+Biases+Time 0876

Still no prize Getting desperateTry a ldquokitchen sinkrdquo approach

RECSYS 113

Effect of ensemble size

087

08725

0875

08775

088

08825

0885

08875

089

08925

1 8 15 22 29 36 43 50

Predictors

Erro

r - R

MSE

Chart1

RMSE
Predictors
Error - RMSE
Effect of ensemble size

Sheet1

Sheet1

RMSE
Predictors
Error - RMSE
Effect of ensemble size

Sheet2

Sheet3

RECSYS 115

Standing on June 26th 2009

1162019June 26th submission triggers 30-day ldquolast callrdquo

RECSYS 116

The Last 30 Days

Ensemble team formed Group of other teams on leaderboard forms a new team Relies on combining their models Quickly also get a qualifying score over 10

BellKor Continue to get small improvements in their scores Realize that they are in direct competition with Ensemble

Strategy Both teams carefully monitoring the leaderboard Only sure way to check for improvement is to submit a set of

predictionsbull This alerts the other team of your latest score

1162019

RECSYS 117

24 Hours from the Deadline

Submissions limited to 1 a day Only 1 final submission could be made in the last 24h

24 hours before deadlinehellip BellKor team member in Austria notices (by chance) that

Ensemble posts a score that is slightly better than BellKorrsquos

Frantic last 24 hours for both teams Much computer time on final optimization Carefully calibrated to end about an hour before deadline

Final submissions BellKor submits a little early (on purpose) 40 mins before deadline Ensemble submits their final entry 20 mins later hellipand everyone waitshellip

1162019

RECSYS 118118

RECSYS 119

Million $ Awarded Sept 21st 2009

RECSYS 120

Further reading Y Koren Collaborative filtering with temporal

dynamics KDD rsquo09 httpwww2researchattcom~volinskynetflixbpchtml httpwwwthe-ensemblecom

1162019

  • Slide Number 1
  • Acknowledgements
  • Content-based Systems amp Collaborative Filtering
  • High Dimensional Data
  • Example Recommender Systems
  • Recommendations
  • From Scarcity to Abundance
  • The Long Tail
  • Physical vs Online
  • Types of Recommendations
  • Formal Model
  • Utility Matrix
  • Key Problems
  • (1) Gathering Ratings
  • (2) Extrapolating Utilities
  • Content-based Recommender Systems
  • Content-based Recommendations
  • Plan of Action
  • Item Profiles
  • Sidenote TF-IDF
  • User Profiles and Prediction
  • Pros Content-based Approach
  • Cons Content-based Approach
  • Collaborative Filtering
  • Collaborative filtering
  • Collaborative Filtering (CF)
  • Example of Memory-based Collaborative Filtering User-User Collaborative Filtering
  • Similar Users
  • Similarity Metric
  • Rating Predictions
  • Another type of Memory-based Collaborative Filtering Item-Item based Collaborative Filtering
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Common Practice for Item-Item Collaborative Filtering
  • Item-Item vs User-User
  • ProsCons of Collaborative Filtering
  • Hybrid Methods
  • Remarks amp Practical Tips
  • Evaluation
  • Evaluation
  • Evaluating Predictions
  • Problems with Error Measures
  • Collaborative Filtering Complexity
  • Tip Add Data
  • Recommender SystemsLatent Factor Models
  • Collaborative Filtering viaLatent Factor Models (eg SVD)
  • Slide Number 50
  • The Netflix Utility Matrix R
  • Utility Matrix R Evaluation
  • Latent Factor Models
  • Ratings as Products of Factors
  • Ratings as Products of Factors
  • Ratings as Products of Factors
  • Latent Factor Models
  • Latent Factor Models
  • Recap SVD
  • Latent Factor Models
  • Dealing with Missing Entries
  • Recommendations via Latent Factor Models (eg SVD++ by the [Bellkor Team])
  • Dealing with Missing Entries
  • The Effect of Regularization
  • The Effect of Regularization
  • The Effect of Regularization
  • The Effect of Regularization
  • Use Gradient Descent to search forthe optimal settings
  • Degression to the lecture notes ofRegression and Gradient Descentby Andrew Ngrsquos Machine Learning course from Coursera
  • (Batch) Gradient Descent
  • Stochastic Gradient Descent
  • SGD vs GD
  • Stochastic Gradient Descent
  • Slide Number 75
  • Summary Recommendations via Optimization
  • Backup Slides
  • The Netflix Challenge 2006-09
  • Slide Number 79
  • Slide Number 80
  • Netflix Prize
  • Movie rating data
  • Overall rating distribution
  • ratings per movie
  • ratings per user
  • Average movie rating by movie count
  • Most loved movies
  • Challenges
  • The BellKor recommender system
  • Extending Latent Factor Model to Include Biases
  • Modeling Biases and Interactions
  • Baseline Predictor
  • Putting It All Together
  • Fitting the New Model
  • BellKor Recommender System
  • Performance of Various Methods
  • Performance of Various Methods
  • Temporal Biases Of Users
  • Temporal Biases amp Factors
  • Adding Temporal Effects
  • Performance of Various Methods
  • Slide Number 113
  • Slide Number 114
  • Standing on June 26th 2009
  • The Last 30 Days
  • 24 Hours from the Deadline
  • Slide Number 118
  • Million $ Awarded Sept 21st 2009
  • Further reading
predictors RMSE
1 0889067
2 0880586
3 0878345
4 0877715
5 0876668
6 0876179
7 0875557
8 0875013
9 0874477
10 0874209
11 0873997
12 0873691
13 0873569
14 0873352
15 0873212
16 0873035
17 0872942
18 087283
19 0872748
20 0872654
21 0872563
22 0872468
23 0872382
24 0872275
25 087222
26 0872201
27 0872135
28 0872086
29 0872021
30 0871945
31 0871898
32 087183
33 0871756
34 0871722
35 0871665
36 0871632
37 0871585
38 0871558
39 0871524
40 0871482
41 0871452
42 0871428
43 0871398
44 0871346
45 0871312
46 0871287
47 0871262
48 0871242
49 0871236
50 0871197
0889067
0880586
0878345
0877715
0876668
0876179
0875557
0875013
0874477
0874209
0873997
0873691
0873569
0873352
0873212
0873035
0872942
087283
0872748
0872654
0872563
0872468
0872382
0872275
087222
0872201
0872135
0872086
0872021
0871945
0871898
087183
0871756
0871722
0871665
0871632
0871585
0871558
0871524
0871482
0871452
0871428
0871398
0871346
0871312
0871287
0871262
0871242
0871236
0871197
Page 73: xuhappy.github.io · 2 days ago · RECSYS 2 Acknowledgements The slides used in this chapter are adapted from: CS246 Mining Massive Datasets, - by Jure Leskovec, Stanford University

RECSYS 76

Summary Recommendations via Optimization

Goal Make good recommendations Quantify goodness using SSE

So Lower SSE means better recommendations We want to make good recommendations on items

that some user has not yet seen Letrsquos set values for P and Q such that they work

well on known (user item) ratings And hope these values for P and Q will predict well

the unknown ratings

This is the a case where we apply Optimization methods

1162019

1 3 43 5 5

4 5 533

2

2 1 3

1

RECSYS 77

Backup Slides

RECSYS 78

The Netflix Challenge 2006-09

We Know What You OughtTo Be Watching This

Summer

Netflix Prizebull Training data

ndash 100 million ratingsndash 480000 usersndash 17770 moviesndash 6 years of data 2000-2005

bull Test datandash Last few ratings of each user (28 million)ndash Evaluation criterion root mean squared error (RMSE) ndash Netflix Cinematch RMSE 09514

bull Competitionndash 2700+ teamsndash $1 million grand prize for 10 improvement on Cinematch resultndash $50000 2007 progress prize for 843 improvement

scoremovieuser121152131434524123237682576344541568523425

22345

57664566

scoremovieuser6219617232473153414284935

745

696836

Training data Test data

Movie rating data

Overall rating distribution

bull Third of ratings are 4sbull Average rating is 368

From TimelyDevelopmentcom

ratings per movie

bull Avg ratingsmovie 5627

ratings per user

bull Avg ratingsuser 208

Average movie rating by movie count

bull More ratings to better movies

From TimelyDevelopmentcom

Most loved movies

CountAvg ratingTitle137812 4593 The Shawshank Redemption 133597 4545 Lord of the Rings The Return of the King1808834306 The Green Mile 150676 4460Lord of the Rings The Two Towers 139050 4415 Finding Nemo 117456 4504 Raiders of the Lost Ark 180736 4299 Forrest Gump 147932 4433 Lord of the Rings The Fellowship of the ring149199 4325 The Sixth Sense 144027 4333 Indiana Jones and the Last Crusade

Challenges

bull Size of datandash Scalabilityndash Keeping data in memory

bull Missing datandash 99 percent missingndash Very imbalanced

bull Avoiding overfittingbull Test and training data differ significantly movie 16322

bull Use an ensemble of complementing predictorsbull Two half tuned models worth more than a single fully

tuned model

The BellKor recommender system

RECSYS 101

Extending Latent Factor Model to Include Biases

1162019

RECSYS 102

Modeling Biases and Interactions

μ = overall mean rating bx = bias of user x bi = bias of movie i

user-movie interactionmovie biasuser bias

User-Movie interaction Characterizes the matching between

users and movies Attracts most research in the field Benefits from algorithmic and

mathematical innovations

Baseline predictor Separates users and movies Benefits from insights into userrsquos

behavior Among the main practical

contributions of the competition

RECSYS 103

Baseline Predictor We have expectations on the rating by

user x of movie i even without estimating xrsquos attitude towards movies like i

ndash Rating scale of user xndash Values of other ratings user

gave recently (day-specific mood anchoring multi-user accounts)

ndash (Recent) popularity of movie indash Selection bias related to

number of ratings user gave on the same day (ldquofrequencyrdquo)

1162019

RECSYS 104

Putting It All Together

Example Mean rating micro = 37 You are a critical reviewer your ratings are 1 star

lower than the mean bx = -1 Star Wars gets a mean rating of 05 higher than

average movie bi = + 05 Predicted rating for you on Star Wars

= 37 - 1 + 05 = 32

1162019

RECSYS 105

Fitting the New Model

Solve

Stochastic gradient decent to find parameters Note Both biases bu bi as well as interactions qi pu are treated

as parameters (we estimate them)

1162019

regularization

goodness of fit

λ is selected via grid-search on a validation set

( )

++++

+++minus

sumsumsumsum

sumisin

ii

xx

xx

ii

Rix

Txiixxi

PQ

bbpq

pqbbr

2222

2

)(

)(min

λ

micro

RECSYS 106

BellKor Recommender System The winner of the Netflix Challenge

Multi-scale modeling of the dataCombine top level ldquoregionalrdquomodeling of the data with a refined local view Global

bull Overall deviations of usersmovies Factorization

bull Addressing ldquoregionalrdquo effects Collaborative filtering

bull Extract local patterns

Global effects

Factorization

Collaborative filtering

RECSYS 107

Performance of Various Methods

0885

089

0895

09

0905

091

0915

092

1 10 100 1000

RMSE

Millions of parameters

CF (no time bias)

Basic Latent Factors

Latent Factors w Biases

1162019

RECSYS 108

Grand Prize 08563

Netflix 09514

Movie average 10533User average 10651

Global average 11296

Performance of Various Methods

Basic Collaborative filtering 094

Latent factors 090

Latent factors+Biases 089

Collaborative filtering++ 091

1162019

RECSYS 109

Temporal Biases Of Users

Sudden rise in the average movie rating(early 2004) Improvements in Netflix GUI improvements Meaning of rating changed

Movie age Users prefer new movies

without any reasons Older movies are just

inherently better than newer ones

1162019

Y Koren Collaborative filtering with temporal dynamics KDD rsquo09

RECSYS 110

Temporal Biases amp Factors

Original modelrxi = micro +bx + bi + qi middotpx

T

Add time dependence to biasesrxi = micro +bx(t)+ bi(t) +qi middot px

T

Make parameters bu and bi to depend on time (1) Parameterize time-dependence by linear trends

(2) Each bin corresponds to 10 consecutive weeks

Add temporal dependence to factors px(t)hellip user preference vector on day t

Y Koren Collaborative filtering with temporal dynamics KDD rsquo09

RECSYS 111

Adding Temporal Effects

1162019

0875

088

0885

089

0895

09

0905

091

0915

092

1 10 100 1000 10000

RMSE

Millions of parameters

CF (no time bias)Basic Latent FactorsCF (time bias)Latent Factors w Biases+ Linear time factors+ Per-day user biases+ CF

RECSYS 112

Grand Prize 08563

Netflix 09514

Movie average 10533User average 10651

Global average 11296

Performance of Various Methods

Basic Collaborative filtering 094

Latent factors 090

Latent factors+Biases 089

Collaborative filtering++ 091

1162019 112

Latent factors+Biases+Time 0876

Still no prize Getting desperateTry a ldquokitchen sinkrdquo approach

RECSYS 113

Effect of ensemble size

087

08725

0875

08775

088

08825

0885

08875

089

08925

1 8 15 22 29 36 43 50

Predictors

Erro

r - R

MSE

Chart1

RMSE
Predictors
Error - RMSE
Effect of ensemble size

Sheet1

Sheet1

RMSE
Predictors
Error - RMSE
Effect of ensemble size

Sheet2

Sheet3

RECSYS 115

Standing on June 26th 2009

1162019June 26th submission triggers 30-day ldquolast callrdquo

RECSYS 116

The Last 30 Days

Ensemble team formed Group of other teams on leaderboard forms a new team Relies on combining their models Quickly also get a qualifying score over 10

BellKor Continue to get small improvements in their scores Realize that they are in direct competition with Ensemble

Strategy Both teams carefully monitoring the leaderboard Only sure way to check for improvement is to submit a set of

predictionsbull This alerts the other team of your latest score

1162019

RECSYS 117

24 Hours from the Deadline

Submissions limited to 1 a day Only 1 final submission could be made in the last 24h

24 hours before deadlinehellip BellKor team member in Austria notices (by chance) that

Ensemble posts a score that is slightly better than BellKorrsquos

Frantic last 24 hours for both teams Much computer time on final optimization Carefully calibrated to end about an hour before deadline

Final submissions BellKor submits a little early (on purpose) 40 mins before deadline Ensemble submits their final entry 20 mins later hellipand everyone waitshellip

1162019

RECSYS 118118

RECSYS 119

Million $ Awarded Sept 21st 2009

RECSYS 120

Further reading Y Koren Collaborative filtering with temporal

dynamics KDD rsquo09 httpwww2researchattcom~volinskynetflixbpchtml httpwwwthe-ensemblecom

1162019

  • Slide Number 1
  • Acknowledgements
  • Content-based Systems amp Collaborative Filtering
  • High Dimensional Data
  • Example Recommender Systems
  • Recommendations
  • From Scarcity to Abundance
  • The Long Tail
  • Physical vs Online
  • Types of Recommendations
  • Formal Model
  • Utility Matrix
  • Key Problems
  • (1) Gathering Ratings
  • (2) Extrapolating Utilities
  • Content-based Recommender Systems
  • Content-based Recommendations
  • Plan of Action
  • Item Profiles
  • Sidenote TF-IDF
  • User Profiles and Prediction
  • Pros Content-based Approach
  • Cons Content-based Approach
  • Collaborative Filtering
  • Collaborative filtering
  • Collaborative Filtering (CF)
  • Example of Memory-based Collaborative Filtering User-User Collaborative Filtering
  • Similar Users
  • Similarity Metric
  • Rating Predictions
  • Another type of Memory-based Collaborative Filtering Item-Item based Collaborative Filtering
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Common Practice for Item-Item Collaborative Filtering
  • Item-Item vs User-User
  • ProsCons of Collaborative Filtering
  • Hybrid Methods
  • Remarks amp Practical Tips
  • Evaluation
  • Evaluation
  • Evaluating Predictions
  • Problems with Error Measures
  • Collaborative Filtering Complexity
  • Tip Add Data
  • Recommender SystemsLatent Factor Models
  • Collaborative Filtering viaLatent Factor Models (eg SVD)
  • Slide Number 50
  • The Netflix Utility Matrix R
  • Utility Matrix R Evaluation
  • Latent Factor Models
  • Ratings as Products of Factors
  • Ratings as Products of Factors
  • Ratings as Products of Factors
  • Latent Factor Models
  • Latent Factor Models
  • Recap SVD
  • Latent Factor Models
  • Dealing with Missing Entries
  • Recommendations via Latent Factor Models (eg SVD++ by the [Bellkor Team])
  • Dealing with Missing Entries
  • The Effect of Regularization
  • The Effect of Regularization
  • The Effect of Regularization
  • The Effect of Regularization
  • Use Gradient Descent to search forthe optimal settings
  • Degression to the lecture notes ofRegression and Gradient Descentby Andrew Ngrsquos Machine Learning course from Coursera
  • (Batch) Gradient Descent
  • Stochastic Gradient Descent
  • SGD vs GD
  • Stochastic Gradient Descent
  • Slide Number 75
  • Summary Recommendations via Optimization
  • Backup Slides
  • The Netflix Challenge 2006-09
  • Slide Number 79
  • Slide Number 80
  • Netflix Prize
  • Movie rating data
  • Overall rating distribution
  • ratings per movie
  • ratings per user
  • Average movie rating by movie count
  • Most loved movies
  • Challenges
  • The BellKor recommender system
  • Extending Latent Factor Model to Include Biases
  • Modeling Biases and Interactions
  • Baseline Predictor
  • Putting It All Together
  • Fitting the New Model
  • BellKor Recommender System
  • Performance of Various Methods
  • Performance of Various Methods
  • Temporal Biases Of Users
  • Temporal Biases amp Factors
  • Adding Temporal Effects
  • Performance of Various Methods
  • Slide Number 113
  • Slide Number 114
  • Standing on June 26th 2009
  • The Last 30 Days
  • 24 Hours from the Deadline
  • Slide Number 118
  • Million $ Awarded Sept 21st 2009
  • Further reading
predictors RMSE
1 0889067
2 0880586
3 0878345
4 0877715
5 0876668
6 0876179
7 0875557
8 0875013
9 0874477
10 0874209
11 0873997
12 0873691
13 0873569
14 0873352
15 0873212
16 0873035
17 0872942
18 087283
19 0872748
20 0872654
21 0872563
22 0872468
23 0872382
24 0872275
25 087222
26 0872201
27 0872135
28 0872086
29 0872021
30 0871945
31 0871898
32 087183
33 0871756
34 0871722
35 0871665
36 0871632
37 0871585
38 0871558
39 0871524
40 0871482
41 0871452
42 0871428
43 0871398
44 0871346
45 0871312
46 0871287
47 0871262
48 0871242
49 0871236
50 0871197
0889067
0880586
0878345
0877715
0876668
0876179
0875557
0875013
0874477
0874209
0873997
0873691
0873569
0873352
0873212
0873035
0872942
087283
0872748
0872654
0872563
0872468
0872382
0872275
087222
0872201
0872135
0872086
0872021
0871945
0871898
087183
0871756
0871722
0871665
0871632
0871585
0871558
0871524
0871482
0871452
0871428
0871398
0871346
0871312
0871287
0871262
0871242
0871236
0871197
Page 74: xuhappy.github.io · 2 days ago · RECSYS 2 Acknowledgements The slides used in this chapter are adapted from: CS246 Mining Massive Datasets, - by Jure Leskovec, Stanford University

RECSYS 77

Backup Slides

RECSYS 78

The Netflix Challenge 2006-09

We Know What You OughtTo Be Watching This

Summer

Netflix Prizebull Training data

ndash 100 million ratingsndash 480000 usersndash 17770 moviesndash 6 years of data 2000-2005

bull Test datandash Last few ratings of each user (28 million)ndash Evaluation criterion root mean squared error (RMSE) ndash Netflix Cinematch RMSE 09514

bull Competitionndash 2700+ teamsndash $1 million grand prize for 10 improvement on Cinematch resultndash $50000 2007 progress prize for 843 improvement

scoremovieuser121152131434524123237682576344541568523425

22345

57664566

scoremovieuser6219617232473153414284935

745

696836

Training data Test data

Movie rating data

Overall rating distribution

bull Third of ratings are 4sbull Average rating is 368

From TimelyDevelopmentcom

ratings per movie

bull Avg ratingsmovie 5627

ratings per user

bull Avg ratingsuser 208

Average movie rating by movie count

bull More ratings to better movies

From TimelyDevelopmentcom

Most loved movies

CountAvg ratingTitle137812 4593 The Shawshank Redemption 133597 4545 Lord of the Rings The Return of the King1808834306 The Green Mile 150676 4460Lord of the Rings The Two Towers 139050 4415 Finding Nemo 117456 4504 Raiders of the Lost Ark 180736 4299 Forrest Gump 147932 4433 Lord of the Rings The Fellowship of the ring149199 4325 The Sixth Sense 144027 4333 Indiana Jones and the Last Crusade

Challenges

bull Size of datandash Scalabilityndash Keeping data in memory

bull Missing datandash 99 percent missingndash Very imbalanced

bull Avoiding overfittingbull Test and training data differ significantly movie 16322

bull Use an ensemble of complementing predictorsbull Two half tuned models worth more than a single fully

tuned model

The BellKor recommender system

RECSYS 101

Extending Latent Factor Model to Include Biases

1162019

RECSYS 102

Modeling Biases and Interactions

μ = overall mean rating bx = bias of user x bi = bias of movie i

user-movie interactionmovie biasuser bias

User-Movie interaction Characterizes the matching between

users and movies Attracts most research in the field Benefits from algorithmic and

mathematical innovations

Baseline predictor Separates users and movies Benefits from insights into userrsquos

behavior Among the main practical

contributions of the competition

RECSYS 103

Baseline Predictor We have expectations on the rating by

user x of movie i even without estimating xrsquos attitude towards movies like i

ndash Rating scale of user xndash Values of other ratings user

gave recently (day-specific mood anchoring multi-user accounts)

ndash (Recent) popularity of movie indash Selection bias related to

number of ratings user gave on the same day (ldquofrequencyrdquo)

1162019

RECSYS 104

Putting It All Together

Example Mean rating micro = 37 You are a critical reviewer your ratings are 1 star

lower than the mean bx = -1 Star Wars gets a mean rating of 05 higher than

average movie bi = + 05 Predicted rating for you on Star Wars

= 37 - 1 + 05 = 32

1162019

RECSYS 105

Fitting the New Model

Solve

Stochastic gradient decent to find parameters Note Both biases bu bi as well as interactions qi pu are treated

as parameters (we estimate them)

1162019

regularization

goodness of fit

λ is selected via grid-search on a validation set

( )

++++

+++minus

sumsumsumsum

sumisin

ii

xx

xx

ii

Rix

Txiixxi

PQ

bbpq

pqbbr

2222

2

)(

)(min

λ

micro

RECSYS 106

BellKor Recommender System The winner of the Netflix Challenge

Multi-scale modeling of the dataCombine top level ldquoregionalrdquomodeling of the data with a refined local view Global

bull Overall deviations of usersmovies Factorization

bull Addressing ldquoregionalrdquo effects Collaborative filtering

bull Extract local patterns

Global effects

Factorization

Collaborative filtering

RECSYS 107

Performance of Various Methods

0885

089

0895

09

0905

091

0915

092

1 10 100 1000

RMSE

Millions of parameters

CF (no time bias)

Basic Latent Factors

Latent Factors w Biases

1162019

RECSYS 108

Grand Prize 08563

Netflix 09514

Movie average 10533User average 10651

Global average 11296

Performance of Various Methods

Basic Collaborative filtering 094

Latent factors 090

Latent factors+Biases 089

Collaborative filtering++ 091

1162019

RECSYS 109

Temporal Biases Of Users

Sudden rise in the average movie rating(early 2004) Improvements in Netflix GUI improvements Meaning of rating changed

Movie age Users prefer new movies

without any reasons Older movies are just

inherently better than newer ones

1162019

Y Koren Collaborative filtering with temporal dynamics KDD rsquo09

RECSYS 110

Temporal Biases amp Factors

Original modelrxi = micro +bx + bi + qi middotpx

T

Add time dependence to biasesrxi = micro +bx(t)+ bi(t) +qi middot px

T

Make parameters bu and bi to depend on time (1) Parameterize time-dependence by linear trends

(2) Each bin corresponds to 10 consecutive weeks

Add temporal dependence to factors px(t)hellip user preference vector on day t

Y Koren Collaborative filtering with temporal dynamics KDD rsquo09

RECSYS 111

Adding Temporal Effects

1162019

0875

088

0885

089

0895

09

0905

091

0915

092

1 10 100 1000 10000

RMSE

Millions of parameters

CF (no time bias)Basic Latent FactorsCF (time bias)Latent Factors w Biases+ Linear time factors+ Per-day user biases+ CF

RECSYS 112

Grand Prize 08563

Netflix 09514

Movie average 10533User average 10651

Global average 11296

Performance of Various Methods

Basic Collaborative filtering 094

Latent factors 090

Latent factors+Biases 089

Collaborative filtering++ 091

1162019 112

Latent factors+Biases+Time 0876

Still no prize Getting desperateTry a ldquokitchen sinkrdquo approach

RECSYS 113

Effect of ensemble size

087

08725

0875

08775

088

08825

0885

08875

089

08925

1 8 15 22 29 36 43 50

Predictors

Erro

r - R

MSE

Chart1

RMSE
Predictors
Error - RMSE
Effect of ensemble size

Sheet1

Sheet1

RMSE
Predictors
Error - RMSE
Effect of ensemble size

Sheet2

Sheet3

RECSYS 115

Standing on June 26th 2009

1162019June 26th submission triggers 30-day ldquolast callrdquo

RECSYS 116

The Last 30 Days

Ensemble team formed Group of other teams on leaderboard forms a new team Relies on combining their models Quickly also get a qualifying score over 10

BellKor Continue to get small improvements in their scores Realize that they are in direct competition with Ensemble

Strategy Both teams carefully monitoring the leaderboard Only sure way to check for improvement is to submit a set of

predictionsbull This alerts the other team of your latest score

1162019

RECSYS 117

24 Hours from the Deadline

Submissions limited to 1 a day Only 1 final submission could be made in the last 24h

24 hours before deadlinehellip BellKor team member in Austria notices (by chance) that

Ensemble posts a score that is slightly better than BellKorrsquos

Frantic last 24 hours for both teams Much computer time on final optimization Carefully calibrated to end about an hour before deadline

Final submissions BellKor submits a little early (on purpose) 40 mins before deadline Ensemble submits their final entry 20 mins later hellipand everyone waitshellip

1162019

RECSYS 118118

RECSYS 119

Million $ Awarded Sept 21st 2009

RECSYS 120

Further reading Y Koren Collaborative filtering with temporal

dynamics KDD rsquo09 httpwww2researchattcom~volinskynetflixbpchtml httpwwwthe-ensemblecom

1162019

  • Slide Number 1
  • Acknowledgements
  • Content-based Systems amp Collaborative Filtering
  • High Dimensional Data
  • Example Recommender Systems
  • Recommendations
  • From Scarcity to Abundance
  • The Long Tail
  • Physical vs Online
  • Types of Recommendations
  • Formal Model
  • Utility Matrix
  • Key Problems
  • (1) Gathering Ratings
  • (2) Extrapolating Utilities
  • Content-based Recommender Systems
  • Content-based Recommendations
  • Plan of Action
  • Item Profiles
  • Sidenote TF-IDF
  • User Profiles and Prediction
  • Pros Content-based Approach
  • Cons Content-based Approach
  • Collaborative Filtering
  • Collaborative filtering
  • Collaborative Filtering (CF)
  • Example of Memory-based Collaborative Filtering User-User Collaborative Filtering
  • Similar Users
  • Similarity Metric
  • Rating Predictions
  • Another type of Memory-based Collaborative Filtering Item-Item based Collaborative Filtering
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Common Practice for Item-Item Collaborative Filtering
  • Item-Item vs User-User
  • ProsCons of Collaborative Filtering
  • Hybrid Methods
  • Remarks amp Practical Tips
  • Evaluation
  • Evaluation
  • Evaluating Predictions
  • Problems with Error Measures
  • Collaborative Filtering Complexity
  • Tip Add Data
  • Recommender SystemsLatent Factor Models
  • Collaborative Filtering viaLatent Factor Models (eg SVD)
  • Slide Number 50
  • The Netflix Utility Matrix R
  • Utility Matrix R Evaluation
  • Latent Factor Models
  • Ratings as Products of Factors
  • Ratings as Products of Factors
  • Ratings as Products of Factors
  • Latent Factor Models
  • Latent Factor Models
  • Recap SVD
  • Latent Factor Models
  • Dealing with Missing Entries
  • Recommendations via Latent Factor Models (eg SVD++ by the [Bellkor Team])
  • Dealing with Missing Entries
  • The Effect of Regularization
  • The Effect of Regularization
  • The Effect of Regularization
  • The Effect of Regularization
  • Use Gradient Descent to search forthe optimal settings
  • Degression to the lecture notes ofRegression and Gradient Descentby Andrew Ngrsquos Machine Learning course from Coursera
  • (Batch) Gradient Descent
  • Stochastic Gradient Descent
  • SGD vs GD
  • Stochastic Gradient Descent
  • Slide Number 75
  • Summary Recommendations via Optimization
  • Backup Slides
  • The Netflix Challenge 2006-09
  • Slide Number 79
  • Slide Number 80
  • Netflix Prize
  • Movie rating data
  • Overall rating distribution
  • ratings per movie
  • ratings per user
  • Average movie rating by movie count
  • Most loved movies
  • Challenges
  • The BellKor recommender system
  • Extending Latent Factor Model to Include Biases
  • Modeling Biases and Interactions
  • Baseline Predictor
  • Putting It All Together
  • Fitting the New Model
  • BellKor Recommender System
  • Performance of Various Methods
  • Performance of Various Methods
  • Temporal Biases Of Users
  • Temporal Biases amp Factors
  • Adding Temporal Effects
  • Performance of Various Methods
  • Slide Number 113
  • Slide Number 114
  • Standing on June 26th 2009
  • The Last 30 Days
  • 24 Hours from the Deadline
  • Slide Number 118
  • Million $ Awarded Sept 21st 2009
  • Further reading
predictors RMSE
1 0889067
2 0880586
3 0878345
4 0877715
5 0876668
6 0876179
7 0875557
8 0875013
9 0874477
10 0874209
11 0873997
12 0873691
13 0873569
14 0873352
15 0873212
16 0873035
17 0872942
18 087283
19 0872748
20 0872654
21 0872563
22 0872468
23 0872382
24 0872275
25 087222
26 0872201
27 0872135
28 0872086
29 0872021
30 0871945
31 0871898
32 087183
33 0871756
34 0871722
35 0871665
36 0871632
37 0871585
38 0871558
39 0871524
40 0871482
41 0871452
42 0871428
43 0871398
44 0871346
45 0871312
46 0871287
47 0871262
48 0871242
49 0871236
50 0871197
0889067
0880586
0878345
0877715
0876668
0876179
0875557
0875013
0874477
0874209
0873997
0873691
0873569
0873352
0873212
0873035
0872942
087283
0872748
0872654
0872563
0872468
0872382
0872275
087222
0872201
0872135
0872086
0872021
0871945
0871898
087183
0871756
0871722
0871665
0871632
0871585
0871558
0871524
0871482
0871452
0871428
0871398
0871346
0871312
0871287
0871262
0871242
0871236
0871197
Page 75: xuhappy.github.io · 2 days ago · RECSYS 2 Acknowledgements The slides used in this chapter are adapted from: CS246 Mining Massive Datasets, - by Jure Leskovec, Stanford University

RECSYS 78

The Netflix Challenge 2006-09

We Know What You OughtTo Be Watching This

Summer

Netflix Prizebull Training data

ndash 100 million ratingsndash 480000 usersndash 17770 moviesndash 6 years of data 2000-2005

bull Test datandash Last few ratings of each user (28 million)ndash Evaluation criterion root mean squared error (RMSE) ndash Netflix Cinematch RMSE 09514

bull Competitionndash 2700+ teamsndash $1 million grand prize for 10 improvement on Cinematch resultndash $50000 2007 progress prize for 843 improvement

scoremovieuser121152131434524123237682576344541568523425

22345

57664566

scoremovieuser6219617232473153414284935

745

696836

Training data Test data

Movie rating data

Overall rating distribution

bull Third of ratings are 4sbull Average rating is 368

From TimelyDevelopmentcom

ratings per movie

bull Avg ratingsmovie 5627

ratings per user

bull Avg ratingsuser 208

Average movie rating by movie count

bull More ratings to better movies

From TimelyDevelopmentcom

Most loved movies

CountAvg ratingTitle137812 4593 The Shawshank Redemption 133597 4545 Lord of the Rings The Return of the King1808834306 The Green Mile 150676 4460Lord of the Rings The Two Towers 139050 4415 Finding Nemo 117456 4504 Raiders of the Lost Ark 180736 4299 Forrest Gump 147932 4433 Lord of the Rings The Fellowship of the ring149199 4325 The Sixth Sense 144027 4333 Indiana Jones and the Last Crusade

Challenges

bull Size of datandash Scalabilityndash Keeping data in memory

bull Missing datandash 99 percent missingndash Very imbalanced

bull Avoiding overfittingbull Test and training data differ significantly movie 16322

bull Use an ensemble of complementing predictorsbull Two half tuned models worth more than a single fully

tuned model

The BellKor recommender system

RECSYS 101

Extending Latent Factor Model to Include Biases

1162019

RECSYS 102

Modeling Biases and Interactions

μ = overall mean rating bx = bias of user x bi = bias of movie i

user-movie interactionmovie biasuser bias

User-Movie interaction Characterizes the matching between

users and movies Attracts most research in the field Benefits from algorithmic and

mathematical innovations

Baseline predictor Separates users and movies Benefits from insights into userrsquos

behavior Among the main practical

contributions of the competition

RECSYS 103

Baseline Predictor We have expectations on the rating by

user x of movie i even without estimating xrsquos attitude towards movies like i

ndash Rating scale of user xndash Values of other ratings user

gave recently (day-specific mood anchoring multi-user accounts)

ndash (Recent) popularity of movie indash Selection bias related to

number of ratings user gave on the same day (ldquofrequencyrdquo)

1162019

RECSYS 104

Putting It All Together

Example Mean rating micro = 37 You are a critical reviewer your ratings are 1 star

lower than the mean bx = -1 Star Wars gets a mean rating of 05 higher than

average movie bi = + 05 Predicted rating for you on Star Wars

= 37 - 1 + 05 = 32

1162019

RECSYS 105

Fitting the New Model

Solve

Stochastic gradient decent to find parameters Note Both biases bu bi as well as interactions qi pu are treated

as parameters (we estimate them)

1162019

regularization

goodness of fit

λ is selected via grid-search on a validation set

( )

++++

+++minus

sumsumsumsum

sumisin

ii

xx

xx

ii

Rix

Txiixxi

PQ

bbpq

pqbbr

2222

2

)(

)(min

λ

micro

RECSYS 106

BellKor Recommender System The winner of the Netflix Challenge

Multi-scale modeling of the dataCombine top level ldquoregionalrdquomodeling of the data with a refined local view Global

bull Overall deviations of usersmovies Factorization

bull Addressing ldquoregionalrdquo effects Collaborative filtering

bull Extract local patterns

Global effects

Factorization

Collaborative filtering

RECSYS 107

Performance of Various Methods

0885

089

0895

09

0905

091

0915

092

1 10 100 1000

RMSE

Millions of parameters

CF (no time bias)

Basic Latent Factors

Latent Factors w Biases

1162019

RECSYS 108

Grand Prize 08563

Netflix 09514

Movie average 10533User average 10651

Global average 11296

Performance of Various Methods

Basic Collaborative filtering 094

Latent factors 090

Latent factors+Biases 089

Collaborative filtering++ 091

1162019

RECSYS 109

Temporal Biases Of Users

Sudden rise in the average movie rating(early 2004) Improvements in Netflix GUI improvements Meaning of rating changed

Movie age Users prefer new movies

without any reasons Older movies are just

inherently better than newer ones

1162019

Y Koren Collaborative filtering with temporal dynamics KDD rsquo09

RECSYS 110

Temporal Biases amp Factors

Original modelrxi = micro +bx + bi + qi middotpx

T

Add time dependence to biasesrxi = micro +bx(t)+ bi(t) +qi middot px

T

Make parameters bu and bi to depend on time (1) Parameterize time-dependence by linear trends

(2) Each bin corresponds to 10 consecutive weeks

Add temporal dependence to factors px(t)hellip user preference vector on day t

Y Koren Collaborative filtering with temporal dynamics KDD rsquo09

RECSYS 111

Adding Temporal Effects

1162019

0875

088

0885

089

0895

09

0905

091

0915

092

1 10 100 1000 10000

RMSE

Millions of parameters

CF (no time bias)Basic Latent FactorsCF (time bias)Latent Factors w Biases+ Linear time factors+ Per-day user biases+ CF

RECSYS 112

Grand Prize 08563

Netflix 09514

Movie average 10533User average 10651

Global average 11296

Performance of Various Methods

Basic Collaborative filtering 094

Latent factors 090

Latent factors+Biases 089

Collaborative filtering++ 091

1162019 112

Latent factors+Biases+Time 0876

Still no prize Getting desperateTry a ldquokitchen sinkrdquo approach

RECSYS 113

Effect of ensemble size

087

08725

0875

08775

088

08825

0885

08875

089

08925

1 8 15 22 29 36 43 50

Predictors

Erro

r - R

MSE

Chart1

RMSE
Predictors
Error - RMSE
Effect of ensemble size

Sheet1

Sheet1

RMSE
Predictors
Error - RMSE
Effect of ensemble size

Sheet2

Sheet3

RECSYS 115

Standing on June 26th 2009

1162019June 26th submission triggers 30-day ldquolast callrdquo

RECSYS 116

The Last 30 Days

Ensemble team formed Group of other teams on leaderboard forms a new team Relies on combining their models Quickly also get a qualifying score over 10

BellKor Continue to get small improvements in their scores Realize that they are in direct competition with Ensemble

Strategy Both teams carefully monitoring the leaderboard Only sure way to check for improvement is to submit a set of

predictionsbull This alerts the other team of your latest score

1162019

RECSYS 117

24 Hours from the Deadline

Submissions limited to 1 a day Only 1 final submission could be made in the last 24h

24 hours before deadlinehellip BellKor team member in Austria notices (by chance) that

Ensemble posts a score that is slightly better than BellKorrsquos

Frantic last 24 hours for both teams Much computer time on final optimization Carefully calibrated to end about an hour before deadline

Final submissions BellKor submits a little early (on purpose) 40 mins before deadline Ensemble submits their final entry 20 mins later hellipand everyone waitshellip

1162019

RECSYS 118118

RECSYS 119

Million $ Awarded Sept 21st 2009

RECSYS 120

Further reading Y Koren Collaborative filtering with temporal

dynamics KDD rsquo09 httpwww2researchattcom~volinskynetflixbpchtml httpwwwthe-ensemblecom

1162019

  • Slide Number 1
  • Acknowledgements
  • Content-based Systems amp Collaborative Filtering
  • High Dimensional Data
  • Example Recommender Systems
  • Recommendations
  • From Scarcity to Abundance
  • The Long Tail
  • Physical vs Online
  • Types of Recommendations
  • Formal Model
  • Utility Matrix
  • Key Problems
  • (1) Gathering Ratings
  • (2) Extrapolating Utilities
  • Content-based Recommender Systems
  • Content-based Recommendations
  • Plan of Action
  • Item Profiles
  • Sidenote TF-IDF
  • User Profiles and Prediction
  • Pros Content-based Approach
  • Cons Content-based Approach
  • Collaborative Filtering
  • Collaborative filtering
  • Collaborative Filtering (CF)
  • Example of Memory-based Collaborative Filtering User-User Collaborative Filtering
  • Similar Users
  • Similarity Metric
  • Rating Predictions
  • Another type of Memory-based Collaborative Filtering Item-Item based Collaborative Filtering
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Common Practice for Item-Item Collaborative Filtering
  • Item-Item vs User-User
  • ProsCons of Collaborative Filtering
  • Hybrid Methods
  • Remarks amp Practical Tips
  • Evaluation
  • Evaluation
  • Evaluating Predictions
  • Problems with Error Measures
  • Collaborative Filtering Complexity
  • Tip Add Data
  • Recommender SystemsLatent Factor Models
  • Collaborative Filtering viaLatent Factor Models (eg SVD)
  • Slide Number 50
  • The Netflix Utility Matrix R
  • Utility Matrix R Evaluation
  • Latent Factor Models
  • Ratings as Products of Factors
  • Ratings as Products of Factors
  • Ratings as Products of Factors
  • Latent Factor Models
  • Latent Factor Models
  • Recap SVD
  • Latent Factor Models
  • Dealing with Missing Entries
  • Recommendations via Latent Factor Models (eg SVD++ by the [Bellkor Team])
  • Dealing with Missing Entries
  • The Effect of Regularization
  • The Effect of Regularization
  • The Effect of Regularization
  • The Effect of Regularization
  • Use Gradient Descent to search forthe optimal settings
  • Degression to the lecture notes ofRegression and Gradient Descentby Andrew Ngrsquos Machine Learning course from Coursera
  • (Batch) Gradient Descent
  • Stochastic Gradient Descent
  • SGD vs GD
  • Stochastic Gradient Descent
  • Slide Number 75
  • Summary Recommendations via Optimization
  • Backup Slides
  • The Netflix Challenge 2006-09
  • Slide Number 79
  • Slide Number 80
  • Netflix Prize
  • Movie rating data
  • Overall rating distribution
  • ratings per movie
  • ratings per user
  • Average movie rating by movie count
  • Most loved movies
  • Challenges
  • The BellKor recommender system
  • Extending Latent Factor Model to Include Biases
  • Modeling Biases and Interactions
  • Baseline Predictor
  • Putting It All Together
  • Fitting the New Model
  • BellKor Recommender System
  • Performance of Various Methods
  • Performance of Various Methods
  • Temporal Biases Of Users
  • Temporal Biases amp Factors
  • Adding Temporal Effects
  • Performance of Various Methods
  • Slide Number 113
  • Slide Number 114
  • Standing on June 26th 2009
  • The Last 30 Days
  • 24 Hours from the Deadline
  • Slide Number 118
  • Million $ Awarded Sept 21st 2009
  • Further reading
predictors RMSE
1 0889067
2 0880586
3 0878345
4 0877715
5 0876668
6 0876179
7 0875557
8 0875013
9 0874477
10 0874209
11 0873997
12 0873691
13 0873569
14 0873352
15 0873212
16 0873035
17 0872942
18 087283
19 0872748
20 0872654
21 0872563
22 0872468
23 0872382
24 0872275
25 087222
26 0872201
27 0872135
28 0872086
29 0872021
30 0871945
31 0871898
32 087183
33 0871756
34 0871722
35 0871665
36 0871632
37 0871585
38 0871558
39 0871524
40 0871482
41 0871452
42 0871428
43 0871398
44 0871346
45 0871312
46 0871287
47 0871262
48 0871242
49 0871236
50 0871197
0889067
0880586
0878345
0877715
0876668
0876179
0875557
0875013
0874477
0874209
0873997
0873691
0873569
0873352
0873212
0873035
0872942
087283
0872748
0872654
0872563
0872468
0872382
0872275
087222
0872201
0872135
0872086
0872021
0871945
0871898
087183
0871756
0871722
0871665
0871632
0871585
0871558
0871524
0871482
0871452
0871428
0871398
0871346
0871312
0871287
0871262
0871242
0871236
0871197
Page 76: xuhappy.github.io · 2 days ago · RECSYS 2 Acknowledgements The slides used in this chapter are adapted from: CS246 Mining Massive Datasets, - by Jure Leskovec, Stanford University

We Know What You OughtTo Be Watching This

Summer

Netflix Prizebull Training data

ndash 100 million ratingsndash 480000 usersndash 17770 moviesndash 6 years of data 2000-2005

bull Test datandash Last few ratings of each user (28 million)ndash Evaluation criterion root mean squared error (RMSE) ndash Netflix Cinematch RMSE 09514

bull Competitionndash 2700+ teamsndash $1 million grand prize for 10 improvement on Cinematch resultndash $50000 2007 progress prize for 843 improvement

scoremovieuser121152131434524123237682576344541568523425

22345

57664566

scoremovieuser6219617232473153414284935

745

696836

Training data Test data

Movie rating data

Overall rating distribution

bull Third of ratings are 4sbull Average rating is 368

From TimelyDevelopmentcom

ratings per movie

bull Avg ratingsmovie 5627

ratings per user

bull Avg ratingsuser 208

Average movie rating by movie count

bull More ratings to better movies

From TimelyDevelopmentcom

Most loved movies

CountAvg ratingTitle137812 4593 The Shawshank Redemption 133597 4545 Lord of the Rings The Return of the King1808834306 The Green Mile 150676 4460Lord of the Rings The Two Towers 139050 4415 Finding Nemo 117456 4504 Raiders of the Lost Ark 180736 4299 Forrest Gump 147932 4433 Lord of the Rings The Fellowship of the ring149199 4325 The Sixth Sense 144027 4333 Indiana Jones and the Last Crusade

Challenges

bull Size of datandash Scalabilityndash Keeping data in memory

bull Missing datandash 99 percent missingndash Very imbalanced

bull Avoiding overfittingbull Test and training data differ significantly movie 16322

bull Use an ensemble of complementing predictorsbull Two half tuned models worth more than a single fully

tuned model

The BellKor recommender system

RECSYS 101

Extending Latent Factor Model to Include Biases

1162019

RECSYS 102

Modeling Biases and Interactions

μ = overall mean rating bx = bias of user x bi = bias of movie i

user-movie interactionmovie biasuser bias

User-Movie interaction Characterizes the matching between

users and movies Attracts most research in the field Benefits from algorithmic and

mathematical innovations

Baseline predictor Separates users and movies Benefits from insights into userrsquos

behavior Among the main practical

contributions of the competition

RECSYS 103

Baseline Predictor We have expectations on the rating by

user x of movie i even without estimating xrsquos attitude towards movies like i

ndash Rating scale of user xndash Values of other ratings user

gave recently (day-specific mood anchoring multi-user accounts)

ndash (Recent) popularity of movie indash Selection bias related to

number of ratings user gave on the same day (ldquofrequencyrdquo)

1162019

RECSYS 104

Putting It All Together

Example Mean rating micro = 37 You are a critical reviewer your ratings are 1 star

lower than the mean bx = -1 Star Wars gets a mean rating of 05 higher than

average movie bi = + 05 Predicted rating for you on Star Wars

= 37 - 1 + 05 = 32

1162019

RECSYS 105

Fitting the New Model

Solve

Stochastic gradient decent to find parameters Note Both biases bu bi as well as interactions qi pu are treated

as parameters (we estimate them)

1162019

regularization

goodness of fit

λ is selected via grid-search on a validation set

( )

++++

+++minus

sumsumsumsum

sumisin

ii

xx

xx

ii

Rix

Txiixxi

PQ

bbpq

pqbbr

2222

2

)(

)(min

λ

micro

RECSYS 106

BellKor Recommender System The winner of the Netflix Challenge

Multi-scale modeling of the dataCombine top level ldquoregionalrdquomodeling of the data with a refined local view Global

bull Overall deviations of usersmovies Factorization

bull Addressing ldquoregionalrdquo effects Collaborative filtering

bull Extract local patterns

Global effects

Factorization

Collaborative filtering

RECSYS 107

Performance of Various Methods

0885

089

0895

09

0905

091

0915

092

1 10 100 1000

RMSE

Millions of parameters

CF (no time bias)

Basic Latent Factors

Latent Factors w Biases

1162019

RECSYS 108

Grand Prize 08563

Netflix 09514

Movie average 10533User average 10651

Global average 11296

Performance of Various Methods

Basic Collaborative filtering 094

Latent factors 090

Latent factors+Biases 089

Collaborative filtering++ 091

1162019

RECSYS 109

Temporal Biases Of Users

Sudden rise in the average movie rating(early 2004) Improvements in Netflix GUI improvements Meaning of rating changed

Movie age Users prefer new movies

without any reasons Older movies are just

inherently better than newer ones

1162019

Y Koren Collaborative filtering with temporal dynamics KDD rsquo09

RECSYS 110

Temporal Biases amp Factors

Original modelrxi = micro +bx + bi + qi middotpx

T

Add time dependence to biasesrxi = micro +bx(t)+ bi(t) +qi middot px

T

Make parameters bu and bi to depend on time (1) Parameterize time-dependence by linear trends

(2) Each bin corresponds to 10 consecutive weeks

Add temporal dependence to factors px(t)hellip user preference vector on day t

Y Koren Collaborative filtering with temporal dynamics KDD rsquo09

RECSYS 111

Adding Temporal Effects

1162019

0875

088

0885

089

0895

09

0905

091

0915

092

1 10 100 1000 10000

RMSE

Millions of parameters

CF (no time bias)Basic Latent FactorsCF (time bias)Latent Factors w Biases+ Linear time factors+ Per-day user biases+ CF

RECSYS 112

Grand Prize 08563

Netflix 09514

Movie average 10533User average 10651

Global average 11296

Performance of Various Methods

Basic Collaborative filtering 094

Latent factors 090

Latent factors+Biases 089

Collaborative filtering++ 091

1162019 112

Latent factors+Biases+Time 0876

Still no prize Getting desperateTry a ldquokitchen sinkrdquo approach

RECSYS 113

Effect of ensemble size

087

08725

0875

08775

088

08825

0885

08875

089

08925

1 8 15 22 29 36 43 50

Predictors

Erro

r - R

MSE

Chart1

RMSE
Predictors
Error - RMSE
Effect of ensemble size

Sheet1

Sheet1

RMSE
Predictors
Error - RMSE
Effect of ensemble size

Sheet2

Sheet3

RECSYS 115

Standing on June 26th 2009

1162019June 26th submission triggers 30-day ldquolast callrdquo

RECSYS 116

The Last 30 Days

Ensemble team formed Group of other teams on leaderboard forms a new team Relies on combining their models Quickly also get a qualifying score over 10

BellKor Continue to get small improvements in their scores Realize that they are in direct competition with Ensemble

Strategy Both teams carefully monitoring the leaderboard Only sure way to check for improvement is to submit a set of

predictionsbull This alerts the other team of your latest score

1162019

RECSYS 117

24 Hours from the Deadline

Submissions limited to 1 a day Only 1 final submission could be made in the last 24h

24 hours before deadlinehellip BellKor team member in Austria notices (by chance) that

Ensemble posts a score that is slightly better than BellKorrsquos

Frantic last 24 hours for both teams Much computer time on final optimization Carefully calibrated to end about an hour before deadline

Final submissions BellKor submits a little early (on purpose) 40 mins before deadline Ensemble submits their final entry 20 mins later hellipand everyone waitshellip

1162019

RECSYS 118118

RECSYS 119

Million $ Awarded Sept 21st 2009

RECSYS 120

Further reading Y Koren Collaborative filtering with temporal

dynamics KDD rsquo09 httpwww2researchattcom~volinskynetflixbpchtml httpwwwthe-ensemblecom

1162019

  • Slide Number 1
  • Acknowledgements
  • Content-based Systems amp Collaborative Filtering
  • High Dimensional Data
  • Example Recommender Systems
  • Recommendations
  • From Scarcity to Abundance
  • The Long Tail
  • Physical vs Online
  • Types of Recommendations
  • Formal Model
  • Utility Matrix
  • Key Problems
  • (1) Gathering Ratings
  • (2) Extrapolating Utilities
  • Content-based Recommender Systems
  • Content-based Recommendations
  • Plan of Action
  • Item Profiles
  • Sidenote TF-IDF
  • User Profiles and Prediction
  • Pros Content-based Approach
  • Cons Content-based Approach
  • Collaborative Filtering
  • Collaborative filtering
  • Collaborative Filtering (CF)
  • Example of Memory-based Collaborative Filtering User-User Collaborative Filtering
  • Similar Users
  • Similarity Metric
  • Rating Predictions
  • Another type of Memory-based Collaborative Filtering Item-Item based Collaborative Filtering
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Common Practice for Item-Item Collaborative Filtering
  • Item-Item vs User-User
  • ProsCons of Collaborative Filtering
  • Hybrid Methods
  • Remarks amp Practical Tips
  • Evaluation
  • Evaluation
  • Evaluating Predictions
  • Problems with Error Measures
  • Collaborative Filtering Complexity
  • Tip Add Data
  • Recommender SystemsLatent Factor Models
  • Collaborative Filtering viaLatent Factor Models (eg SVD)
  • Slide Number 50
  • The Netflix Utility Matrix R
  • Utility Matrix R Evaluation
  • Latent Factor Models
  • Ratings as Products of Factors
  • Ratings as Products of Factors
  • Ratings as Products of Factors
  • Latent Factor Models
  • Latent Factor Models
  • Recap SVD
  • Latent Factor Models
  • Dealing with Missing Entries
  • Recommendations via Latent Factor Models (eg SVD++ by the [Bellkor Team])
  • Dealing with Missing Entries
  • The Effect of Regularization
  • The Effect of Regularization
  • The Effect of Regularization
  • The Effect of Regularization
  • Use Gradient Descent to search forthe optimal settings
  • Degression to the lecture notes ofRegression and Gradient Descentby Andrew Ngrsquos Machine Learning course from Coursera
  • (Batch) Gradient Descent
  • Stochastic Gradient Descent
  • SGD vs GD
  • Stochastic Gradient Descent
  • Slide Number 75
  • Summary Recommendations via Optimization
  • Backup Slides
  • The Netflix Challenge 2006-09
  • Slide Number 79
  • Slide Number 80
  • Netflix Prize
  • Movie rating data
  • Overall rating distribution
  • ratings per movie
  • ratings per user
  • Average movie rating by movie count
  • Most loved movies
  • Challenges
  • The BellKor recommender system
  • Extending Latent Factor Model to Include Biases
  • Modeling Biases and Interactions
  • Baseline Predictor
  • Putting It All Together
  • Fitting the New Model
  • BellKor Recommender System
  • Performance of Various Methods
  • Performance of Various Methods
  • Temporal Biases Of Users
  • Temporal Biases amp Factors
  • Adding Temporal Effects
  • Performance of Various Methods
  • Slide Number 113
  • Slide Number 114
  • Standing on June 26th 2009
  • The Last 30 Days
  • 24 Hours from the Deadline
  • Slide Number 118
  • Million $ Awarded Sept 21st 2009
  • Further reading
predictors RMSE
1 0889067
2 0880586
3 0878345
4 0877715
5 0876668
6 0876179
7 0875557
8 0875013
9 0874477
10 0874209
11 0873997
12 0873691
13 0873569
14 0873352
15 0873212
16 0873035
17 0872942
18 087283
19 0872748
20 0872654
21 0872563
22 0872468
23 0872382
24 0872275
25 087222
26 0872201
27 0872135
28 0872086
29 0872021
30 0871945
31 0871898
32 087183
33 0871756
34 0871722
35 0871665
36 0871632
37 0871585
38 0871558
39 0871524
40 0871482
41 0871452
42 0871428
43 0871398
44 0871346
45 0871312
46 0871287
47 0871262
48 0871242
49 0871236
50 0871197
0889067
0880586
0878345
0877715
0876668
0876179
0875557
0875013
0874477
0874209
0873997
0873691
0873569
0873352
0873212
0873035
0872942
087283
0872748
0872654
0872563
0872468
0872382
0872275
087222
0872201
0872135
0872086
0872021
0871945
0871898
087183
0871756
0871722
0871665
0871632
0871585
0871558
0871524
0871482
0871452
0871428
0871398
0871346
0871312
0871287
0871262
0871242
0871236
0871197
Page 77: xuhappy.github.io · 2 days ago · RECSYS 2 Acknowledgements The slides used in this chapter are adapted from: CS246 Mining Massive Datasets, - by Jure Leskovec, Stanford University

Netflix Prizebull Training data

ndash 100 million ratingsndash 480000 usersndash 17770 moviesndash 6 years of data 2000-2005

bull Test datandash Last few ratings of each user (28 million)ndash Evaluation criterion root mean squared error (RMSE) ndash Netflix Cinematch RMSE 09514

bull Competitionndash 2700+ teamsndash $1 million grand prize for 10 improvement on Cinematch resultndash $50000 2007 progress prize for 843 improvement

scoremovieuser121152131434524123237682576344541568523425

22345

57664566

scoremovieuser6219617232473153414284935

745

696836

Training data Test data

Movie rating data

Overall rating distribution

bull Third of ratings are 4sbull Average rating is 368

From TimelyDevelopmentcom

ratings per movie

bull Avg ratingsmovie 5627

ratings per user

bull Avg ratingsuser 208

Average movie rating by movie count

bull More ratings to better movies

From TimelyDevelopmentcom

Most loved movies

CountAvg ratingTitle137812 4593 The Shawshank Redemption 133597 4545 Lord of the Rings The Return of the King1808834306 The Green Mile 150676 4460Lord of the Rings The Two Towers 139050 4415 Finding Nemo 117456 4504 Raiders of the Lost Ark 180736 4299 Forrest Gump 147932 4433 Lord of the Rings The Fellowship of the ring149199 4325 The Sixth Sense 144027 4333 Indiana Jones and the Last Crusade

Challenges

bull Size of datandash Scalabilityndash Keeping data in memory

bull Missing datandash 99 percent missingndash Very imbalanced

bull Avoiding overfittingbull Test and training data differ significantly movie 16322

bull Use an ensemble of complementing predictorsbull Two half tuned models worth more than a single fully

tuned model

The BellKor recommender system

RECSYS 101

Extending Latent Factor Model to Include Biases

1162019

RECSYS 102

Modeling Biases and Interactions

μ = overall mean rating bx = bias of user x bi = bias of movie i

user-movie interactionmovie biasuser bias

User-Movie interaction Characterizes the matching between

users and movies Attracts most research in the field Benefits from algorithmic and

mathematical innovations

Baseline predictor Separates users and movies Benefits from insights into userrsquos

behavior Among the main practical

contributions of the competition

RECSYS 103

Baseline Predictor We have expectations on the rating by

user x of movie i even without estimating xrsquos attitude towards movies like i

ndash Rating scale of user xndash Values of other ratings user

gave recently (day-specific mood anchoring multi-user accounts)

ndash (Recent) popularity of movie indash Selection bias related to

number of ratings user gave on the same day (ldquofrequencyrdquo)

1162019

RECSYS 104

Putting It All Together

Example Mean rating micro = 37 You are a critical reviewer your ratings are 1 star

lower than the mean bx = -1 Star Wars gets a mean rating of 05 higher than

average movie bi = + 05 Predicted rating for you on Star Wars

= 37 - 1 + 05 = 32

1162019

RECSYS 105

Fitting the New Model

Solve

Stochastic gradient decent to find parameters Note Both biases bu bi as well as interactions qi pu are treated

as parameters (we estimate them)

1162019

regularization

goodness of fit

λ is selected via grid-search on a validation set

( )

++++

+++minus

sumsumsumsum

sumisin

ii

xx

xx

ii

Rix

Txiixxi

PQ

bbpq

pqbbr

2222

2

)(

)(min

λ

micro

RECSYS 106

BellKor Recommender System The winner of the Netflix Challenge

Multi-scale modeling of the dataCombine top level ldquoregionalrdquomodeling of the data with a refined local view Global

bull Overall deviations of usersmovies Factorization

bull Addressing ldquoregionalrdquo effects Collaborative filtering

bull Extract local patterns

Global effects

Factorization

Collaborative filtering

RECSYS 107

Performance of Various Methods

0885

089

0895

09

0905

091

0915

092

1 10 100 1000

RMSE

Millions of parameters

CF (no time bias)

Basic Latent Factors

Latent Factors w Biases

1162019

RECSYS 108

Grand Prize 08563

Netflix 09514

Movie average 10533User average 10651

Global average 11296

Performance of Various Methods

Basic Collaborative filtering 094

Latent factors 090

Latent factors+Biases 089

Collaborative filtering++ 091

1162019

RECSYS 109

Temporal Biases Of Users

Sudden rise in the average movie rating(early 2004) Improvements in Netflix GUI improvements Meaning of rating changed

Movie age Users prefer new movies

without any reasons Older movies are just

inherently better than newer ones

1162019

Y Koren Collaborative filtering with temporal dynamics KDD rsquo09

RECSYS 110

Temporal Biases amp Factors

Original modelrxi = micro +bx + bi + qi middotpx

T

Add time dependence to biasesrxi = micro +bx(t)+ bi(t) +qi middot px

T

Make parameters bu and bi to depend on time (1) Parameterize time-dependence by linear trends

(2) Each bin corresponds to 10 consecutive weeks

Add temporal dependence to factors px(t)hellip user preference vector on day t

Y Koren Collaborative filtering with temporal dynamics KDD rsquo09

RECSYS 111

Adding Temporal Effects

1162019

0875

088

0885

089

0895

09

0905

091

0915

092

1 10 100 1000 10000

RMSE

Millions of parameters

CF (no time bias)Basic Latent FactorsCF (time bias)Latent Factors w Biases+ Linear time factors+ Per-day user biases+ CF

RECSYS 112

Grand Prize 08563

Netflix 09514

Movie average 10533User average 10651

Global average 11296

Performance of Various Methods

Basic Collaborative filtering 094

Latent factors 090

Latent factors+Biases 089

Collaborative filtering++ 091

1162019 112

Latent factors+Biases+Time 0876

Still no prize Getting desperateTry a ldquokitchen sinkrdquo approach

RECSYS 113

Effect of ensemble size

087

08725

0875

08775

088

08825

0885

08875

089

08925

1 8 15 22 29 36 43 50

Predictors

Erro

r - R

MSE

Chart1

RMSE
Predictors
Error - RMSE
Effect of ensemble size

Sheet1

Sheet1

RMSE
Predictors
Error - RMSE
Effect of ensemble size

Sheet2

Sheet3

RECSYS 115

Standing on June 26th 2009

1162019June 26th submission triggers 30-day ldquolast callrdquo

RECSYS 116

The Last 30 Days

Ensemble team formed Group of other teams on leaderboard forms a new team Relies on combining their models Quickly also get a qualifying score over 10

BellKor Continue to get small improvements in their scores Realize that they are in direct competition with Ensemble

Strategy Both teams carefully monitoring the leaderboard Only sure way to check for improvement is to submit a set of

predictionsbull This alerts the other team of your latest score

1162019

RECSYS 117

24 Hours from the Deadline

Submissions limited to 1 a day Only 1 final submission could be made in the last 24h

24 hours before deadlinehellip BellKor team member in Austria notices (by chance) that

Ensemble posts a score that is slightly better than BellKorrsquos

Frantic last 24 hours for both teams Much computer time on final optimization Carefully calibrated to end about an hour before deadline

Final submissions BellKor submits a little early (on purpose) 40 mins before deadline Ensemble submits their final entry 20 mins later hellipand everyone waitshellip

1162019

RECSYS 118118

RECSYS 119

Million $ Awarded Sept 21st 2009

RECSYS 120

Further reading Y Koren Collaborative filtering with temporal

dynamics KDD rsquo09 httpwww2researchattcom~volinskynetflixbpchtml httpwwwthe-ensemblecom

1162019

  • Slide Number 1
  • Acknowledgements
  • Content-based Systems amp Collaborative Filtering
  • High Dimensional Data
  • Example Recommender Systems
  • Recommendations
  • From Scarcity to Abundance
  • The Long Tail
  • Physical vs Online
  • Types of Recommendations
  • Formal Model
  • Utility Matrix
  • Key Problems
  • (1) Gathering Ratings
  • (2) Extrapolating Utilities
  • Content-based Recommender Systems
  • Content-based Recommendations
  • Plan of Action
  • Item Profiles
  • Sidenote TF-IDF
  • User Profiles and Prediction
  • Pros Content-based Approach
  • Cons Content-based Approach
  • Collaborative Filtering
  • Collaborative filtering
  • Collaborative Filtering (CF)
  • Example of Memory-based Collaborative Filtering User-User Collaborative Filtering
  • Similar Users
  • Similarity Metric
  • Rating Predictions
  • Another type of Memory-based Collaborative Filtering Item-Item based Collaborative Filtering
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Common Practice for Item-Item Collaborative Filtering
  • Item-Item vs User-User
  • ProsCons of Collaborative Filtering
  • Hybrid Methods
  • Remarks amp Practical Tips
  • Evaluation
  • Evaluation
  • Evaluating Predictions
  • Problems with Error Measures
  • Collaborative Filtering Complexity
  • Tip Add Data
  • Recommender SystemsLatent Factor Models
  • Collaborative Filtering viaLatent Factor Models (eg SVD)
  • Slide Number 50
  • The Netflix Utility Matrix R
  • Utility Matrix R Evaluation
  • Latent Factor Models
  • Ratings as Products of Factors
  • Ratings as Products of Factors
  • Ratings as Products of Factors
  • Latent Factor Models
  • Latent Factor Models
  • Recap SVD
  • Latent Factor Models
  • Dealing with Missing Entries
  • Recommendations via Latent Factor Models (eg SVD++ by the [Bellkor Team])
  • Dealing with Missing Entries
  • The Effect of Regularization
  • The Effect of Regularization
  • The Effect of Regularization
  • The Effect of Regularization
  • Use Gradient Descent to search forthe optimal settings
  • Degression to the lecture notes ofRegression and Gradient Descentby Andrew Ngrsquos Machine Learning course from Coursera
  • (Batch) Gradient Descent
  • Stochastic Gradient Descent
  • SGD vs GD
  • Stochastic Gradient Descent
  • Slide Number 75
  • Summary Recommendations via Optimization
  • Backup Slides
  • The Netflix Challenge 2006-09
  • Slide Number 79
  • Slide Number 80
  • Netflix Prize
  • Movie rating data
  • Overall rating distribution
  • ratings per movie
  • ratings per user
  • Average movie rating by movie count
  • Most loved movies
  • Challenges
  • The BellKor recommender system
  • Extending Latent Factor Model to Include Biases
  • Modeling Biases and Interactions
  • Baseline Predictor
  • Putting It All Together
  • Fitting the New Model
  • BellKor Recommender System
  • Performance of Various Methods
  • Performance of Various Methods
  • Temporal Biases Of Users
  • Temporal Biases amp Factors
  • Adding Temporal Effects
  • Performance of Various Methods
  • Slide Number 113
  • Slide Number 114
  • Standing on June 26th 2009
  • The Last 30 Days
  • 24 Hours from the Deadline
  • Slide Number 118
  • Million $ Awarded Sept 21st 2009
  • Further reading
predictors RMSE
1 0889067
2 0880586
3 0878345
4 0877715
5 0876668
6 0876179
7 0875557
8 0875013
9 0874477
10 0874209
11 0873997
12 0873691
13 0873569
14 0873352
15 0873212
16 0873035
17 0872942
18 087283
19 0872748
20 0872654
21 0872563
22 0872468
23 0872382
24 0872275
25 087222
26 0872201
27 0872135
28 0872086
29 0872021
30 0871945
31 0871898
32 087183
33 0871756
34 0871722
35 0871665
36 0871632
37 0871585
38 0871558
39 0871524
40 0871482
41 0871452
42 0871428
43 0871398
44 0871346
45 0871312
46 0871287
47 0871262
48 0871242
49 0871236
50 0871197
0889067
0880586
0878345
0877715
0876668
0876179
0875557
0875013
0874477
0874209
0873997
0873691
0873569
0873352
0873212
0873035
0872942
087283
0872748
0872654
0872563
0872468
0872382
0872275
087222
0872201
0872135
0872086
0872021
0871945
0871898
087183
0871756
0871722
0871665
0871632
0871585
0871558
0871524
0871482
0871452
0871428
0871398
0871346
0871312
0871287
0871262
0871242
0871236
0871197
Page 78: xuhappy.github.io · 2 days ago · RECSYS 2 Acknowledgements The slides used in this chapter are adapted from: CS246 Mining Massive Datasets, - by Jure Leskovec, Stanford University

scoremovieuser121152131434524123237682576344541568523425

22345

57664566

scoremovieuser6219617232473153414284935

745

696836

Training data Test data

Movie rating data

Overall rating distribution

bull Third of ratings are 4sbull Average rating is 368

From TimelyDevelopmentcom

ratings per movie

bull Avg ratingsmovie 5627

ratings per user

bull Avg ratingsuser 208

Average movie rating by movie count

bull More ratings to better movies

From TimelyDevelopmentcom

Most loved movies

CountAvg ratingTitle137812 4593 The Shawshank Redemption 133597 4545 Lord of the Rings The Return of the King1808834306 The Green Mile 150676 4460Lord of the Rings The Two Towers 139050 4415 Finding Nemo 117456 4504 Raiders of the Lost Ark 180736 4299 Forrest Gump 147932 4433 Lord of the Rings The Fellowship of the ring149199 4325 The Sixth Sense 144027 4333 Indiana Jones and the Last Crusade

Challenges

bull Size of datandash Scalabilityndash Keeping data in memory

bull Missing datandash 99 percent missingndash Very imbalanced

bull Avoiding overfittingbull Test and training data differ significantly movie 16322

bull Use an ensemble of complementing predictorsbull Two half tuned models worth more than a single fully

tuned model

The BellKor recommender system

RECSYS 101

Extending Latent Factor Model to Include Biases

1162019

RECSYS 102

Modeling Biases and Interactions

μ = overall mean rating bx = bias of user x bi = bias of movie i

user-movie interactionmovie biasuser bias

User-Movie interaction Characterizes the matching between

users and movies Attracts most research in the field Benefits from algorithmic and

mathematical innovations

Baseline predictor Separates users and movies Benefits from insights into userrsquos

behavior Among the main practical

contributions of the competition

RECSYS 103

Baseline Predictor We have expectations on the rating by

user x of movie i even without estimating xrsquos attitude towards movies like i

ndash Rating scale of user xndash Values of other ratings user

gave recently (day-specific mood anchoring multi-user accounts)

ndash (Recent) popularity of movie indash Selection bias related to

number of ratings user gave on the same day (ldquofrequencyrdquo)

1162019

RECSYS 104

Putting It All Together

Example Mean rating micro = 37 You are a critical reviewer your ratings are 1 star

lower than the mean bx = -1 Star Wars gets a mean rating of 05 higher than

average movie bi = + 05 Predicted rating for you on Star Wars

= 37 - 1 + 05 = 32

1162019

RECSYS 105

Fitting the New Model

Solve

Stochastic gradient decent to find parameters Note Both biases bu bi as well as interactions qi pu are treated

as parameters (we estimate them)

1162019

regularization

goodness of fit

λ is selected via grid-search on a validation set

( )

++++

+++minus

sumsumsumsum

sumisin

ii

xx

xx

ii

Rix

Txiixxi

PQ

bbpq

pqbbr

2222

2

)(

)(min

λ

micro

RECSYS 106

BellKor Recommender System The winner of the Netflix Challenge

Multi-scale modeling of the dataCombine top level ldquoregionalrdquomodeling of the data with a refined local view Global

bull Overall deviations of usersmovies Factorization

bull Addressing ldquoregionalrdquo effects Collaborative filtering

bull Extract local patterns

Global effects

Factorization

Collaborative filtering

RECSYS 107

Performance of Various Methods

0885

089

0895

09

0905

091

0915

092

1 10 100 1000

RMSE

Millions of parameters

CF (no time bias)

Basic Latent Factors

Latent Factors w Biases

1162019

RECSYS 108

Grand Prize 08563

Netflix 09514

Movie average 10533User average 10651

Global average 11296

Performance of Various Methods

Basic Collaborative filtering 094

Latent factors 090

Latent factors+Biases 089

Collaborative filtering++ 091

1162019

RECSYS 109

Temporal Biases Of Users

Sudden rise in the average movie rating(early 2004) Improvements in Netflix GUI improvements Meaning of rating changed

Movie age Users prefer new movies

without any reasons Older movies are just

inherently better than newer ones

1162019

Y Koren Collaborative filtering with temporal dynamics KDD rsquo09

RECSYS 110

Temporal Biases amp Factors

Original modelrxi = micro +bx + bi + qi middotpx

T

Add time dependence to biasesrxi = micro +bx(t)+ bi(t) +qi middot px

T

Make parameters bu and bi to depend on time (1) Parameterize time-dependence by linear trends

(2) Each bin corresponds to 10 consecutive weeks

Add temporal dependence to factors px(t)hellip user preference vector on day t

Y Koren Collaborative filtering with temporal dynamics KDD rsquo09

RECSYS 111

Adding Temporal Effects

1162019

0875

088

0885

089

0895

09

0905

091

0915

092

1 10 100 1000 10000

RMSE

Millions of parameters

CF (no time bias)Basic Latent FactorsCF (time bias)Latent Factors w Biases+ Linear time factors+ Per-day user biases+ CF

RECSYS 112

Grand Prize 08563

Netflix 09514

Movie average 10533User average 10651

Global average 11296

Performance of Various Methods

Basic Collaborative filtering 094

Latent factors 090

Latent factors+Biases 089

Collaborative filtering++ 091

1162019 112

Latent factors+Biases+Time 0876

Still no prize Getting desperateTry a ldquokitchen sinkrdquo approach

RECSYS 113

Effect of ensemble size

087

08725

0875

08775

088

08825

0885

08875

089

08925

1 8 15 22 29 36 43 50

Predictors

Erro

r - R

MSE

Chart1

RMSE
Predictors
Error - RMSE
Effect of ensemble size

Sheet1

Sheet1

RMSE
Predictors
Error - RMSE
Effect of ensemble size

Sheet2

Sheet3

RECSYS 115

Standing on June 26th 2009

1162019June 26th submission triggers 30-day ldquolast callrdquo

RECSYS 116

The Last 30 Days

Ensemble team formed Group of other teams on leaderboard forms a new team Relies on combining their models Quickly also get a qualifying score over 10

BellKor Continue to get small improvements in their scores Realize that they are in direct competition with Ensemble

Strategy Both teams carefully monitoring the leaderboard Only sure way to check for improvement is to submit a set of

predictionsbull This alerts the other team of your latest score

1162019

RECSYS 117

24 Hours from the Deadline

Submissions limited to 1 a day Only 1 final submission could be made in the last 24h

24 hours before deadlinehellip BellKor team member in Austria notices (by chance) that

Ensemble posts a score that is slightly better than BellKorrsquos

Frantic last 24 hours for both teams Much computer time on final optimization Carefully calibrated to end about an hour before deadline

Final submissions BellKor submits a little early (on purpose) 40 mins before deadline Ensemble submits their final entry 20 mins later hellipand everyone waitshellip

1162019

RECSYS 118118

RECSYS 119

Million $ Awarded Sept 21st 2009

RECSYS 120

Further reading Y Koren Collaborative filtering with temporal

dynamics KDD rsquo09 httpwww2researchattcom~volinskynetflixbpchtml httpwwwthe-ensemblecom

1162019

  • Slide Number 1
  • Acknowledgements
  • Content-based Systems amp Collaborative Filtering
  • High Dimensional Data
  • Example Recommender Systems
  • Recommendations
  • From Scarcity to Abundance
  • The Long Tail
  • Physical vs Online
  • Types of Recommendations
  • Formal Model
  • Utility Matrix
  • Key Problems
  • (1) Gathering Ratings
  • (2) Extrapolating Utilities
  • Content-based Recommender Systems
  • Content-based Recommendations
  • Plan of Action
  • Item Profiles
  • Sidenote TF-IDF
  • User Profiles and Prediction
  • Pros Content-based Approach
  • Cons Content-based Approach
  • Collaborative Filtering
  • Collaborative filtering
  • Collaborative Filtering (CF)
  • Example of Memory-based Collaborative Filtering User-User Collaborative Filtering
  • Similar Users
  • Similarity Metric
  • Rating Predictions
  • Another type of Memory-based Collaborative Filtering Item-Item based Collaborative Filtering
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Common Practice for Item-Item Collaborative Filtering
  • Item-Item vs User-User
  • ProsCons of Collaborative Filtering
  • Hybrid Methods
  • Remarks amp Practical Tips
  • Evaluation
  • Evaluation
  • Evaluating Predictions
  • Problems with Error Measures
  • Collaborative Filtering Complexity
  • Tip Add Data
  • Recommender SystemsLatent Factor Models
  • Collaborative Filtering viaLatent Factor Models (eg SVD)
  • Slide Number 50
  • The Netflix Utility Matrix R
  • Utility Matrix R Evaluation
  • Latent Factor Models
  • Ratings as Products of Factors
  • Ratings as Products of Factors
  • Ratings as Products of Factors
  • Latent Factor Models
  • Latent Factor Models
  • Recap SVD
  • Latent Factor Models
  • Dealing with Missing Entries
  • Recommendations via Latent Factor Models (eg SVD++ by the [Bellkor Team])
  • Dealing with Missing Entries
  • The Effect of Regularization
  • The Effect of Regularization
  • The Effect of Regularization
  • The Effect of Regularization
  • Use Gradient Descent to search forthe optimal settings
  • Degression to the lecture notes ofRegression and Gradient Descentby Andrew Ngrsquos Machine Learning course from Coursera
  • (Batch) Gradient Descent
  • Stochastic Gradient Descent
  • SGD vs GD
  • Stochastic Gradient Descent
  • Slide Number 75
  • Summary Recommendations via Optimization
  • Backup Slides
  • The Netflix Challenge 2006-09
  • Slide Number 79
  • Slide Number 80
  • Netflix Prize
  • Movie rating data
  • Overall rating distribution
  • ratings per movie
  • ratings per user
  • Average movie rating by movie count
  • Most loved movies
  • Challenges
  • The BellKor recommender system
  • Extending Latent Factor Model to Include Biases
  • Modeling Biases and Interactions
  • Baseline Predictor
  • Putting It All Together
  • Fitting the New Model
  • BellKor Recommender System
  • Performance of Various Methods
  • Performance of Various Methods
  • Temporal Biases Of Users
  • Temporal Biases amp Factors
  • Adding Temporal Effects
  • Performance of Various Methods
  • Slide Number 113
  • Slide Number 114
  • Standing on June 26th 2009
  • The Last 30 Days
  • 24 Hours from the Deadline
  • Slide Number 118
  • Million $ Awarded Sept 21st 2009
  • Further reading
predictors RMSE
1 0889067
2 0880586
3 0878345
4 0877715
5 0876668
6 0876179
7 0875557
8 0875013
9 0874477
10 0874209
11 0873997
12 0873691
13 0873569
14 0873352
15 0873212
16 0873035
17 0872942
18 087283
19 0872748
20 0872654
21 0872563
22 0872468
23 0872382
24 0872275
25 087222
26 0872201
27 0872135
28 0872086
29 0872021
30 0871945
31 0871898
32 087183
33 0871756
34 0871722
35 0871665
36 0871632
37 0871585
38 0871558
39 0871524
40 0871482
41 0871452
42 0871428
43 0871398
44 0871346
45 0871312
46 0871287
47 0871262
48 0871242
49 0871236
50 0871197
0889067
0880586
0878345
0877715
0876668
0876179
0875557
0875013
0874477
0874209
0873997
0873691
0873569
0873352
0873212
0873035
0872942
087283
0872748
0872654
0872563
0872468
0872382
0872275
087222
0872201
0872135
0872086
0872021
0871945
0871898
087183
0871756
0871722
0871665
0871632
0871585
0871558
0871524
0871482
0871452
0871428
0871398
0871346
0871312
0871287
0871262
0871242
0871236
0871197
Page 79: xuhappy.github.io · 2 days ago · RECSYS 2 Acknowledgements The slides used in this chapter are adapted from: CS246 Mining Massive Datasets, - by Jure Leskovec, Stanford University

Overall rating distribution

bull Third of ratings are 4sbull Average rating is 368

From TimelyDevelopmentcom

ratings per movie

bull Avg ratingsmovie 5627

ratings per user

bull Avg ratingsuser 208

Average movie rating by movie count

bull More ratings to better movies

From TimelyDevelopmentcom

Most loved movies

CountAvg ratingTitle137812 4593 The Shawshank Redemption 133597 4545 Lord of the Rings The Return of the King1808834306 The Green Mile 150676 4460Lord of the Rings The Two Towers 139050 4415 Finding Nemo 117456 4504 Raiders of the Lost Ark 180736 4299 Forrest Gump 147932 4433 Lord of the Rings The Fellowship of the ring149199 4325 The Sixth Sense 144027 4333 Indiana Jones and the Last Crusade

Challenges

bull Size of datandash Scalabilityndash Keeping data in memory

bull Missing datandash 99 percent missingndash Very imbalanced

bull Avoiding overfittingbull Test and training data differ significantly movie 16322

bull Use an ensemble of complementing predictorsbull Two half tuned models worth more than a single fully

tuned model

The BellKor recommender system

RECSYS 101

Extending Latent Factor Model to Include Biases

1162019

RECSYS 102

Modeling Biases and Interactions

μ = overall mean rating bx = bias of user x bi = bias of movie i

user-movie interactionmovie biasuser bias

User-Movie interaction Characterizes the matching between

users and movies Attracts most research in the field Benefits from algorithmic and

mathematical innovations

Baseline predictor Separates users and movies Benefits from insights into userrsquos

behavior Among the main practical

contributions of the competition

RECSYS 103

Baseline Predictor We have expectations on the rating by

user x of movie i even without estimating xrsquos attitude towards movies like i

ndash Rating scale of user xndash Values of other ratings user

gave recently (day-specific mood anchoring multi-user accounts)

ndash (Recent) popularity of movie indash Selection bias related to

number of ratings user gave on the same day (ldquofrequencyrdquo)

1162019

RECSYS 104

Putting It All Together

Example Mean rating micro = 37 You are a critical reviewer your ratings are 1 star

lower than the mean bx = -1 Star Wars gets a mean rating of 05 higher than

average movie bi = + 05 Predicted rating for you on Star Wars

= 37 - 1 + 05 = 32

1162019

RECSYS 105

Fitting the New Model

Solve

Stochastic gradient decent to find parameters Note Both biases bu bi as well as interactions qi pu are treated

as parameters (we estimate them)

1162019

regularization

goodness of fit

λ is selected via grid-search on a validation set

( )

++++

+++minus

sumsumsumsum

sumisin

ii

xx

xx

ii

Rix

Txiixxi

PQ

bbpq

pqbbr

2222

2

)(

)(min

λ

micro

RECSYS 106

BellKor Recommender System The winner of the Netflix Challenge

Multi-scale modeling of the dataCombine top level ldquoregionalrdquomodeling of the data with a refined local view Global

bull Overall deviations of usersmovies Factorization

bull Addressing ldquoregionalrdquo effects Collaborative filtering

bull Extract local patterns

Global effects

Factorization

Collaborative filtering

RECSYS 107

Performance of Various Methods

0885

089

0895

09

0905

091

0915

092

1 10 100 1000

RMSE

Millions of parameters

CF (no time bias)

Basic Latent Factors

Latent Factors w Biases

1162019

RECSYS 108

Grand Prize 08563

Netflix 09514

Movie average 10533User average 10651

Global average 11296

Performance of Various Methods

Basic Collaborative filtering 094

Latent factors 090

Latent factors+Biases 089

Collaborative filtering++ 091

1162019

RECSYS 109

Temporal Biases Of Users

Sudden rise in the average movie rating(early 2004) Improvements in Netflix GUI improvements Meaning of rating changed

Movie age Users prefer new movies

without any reasons Older movies are just

inherently better than newer ones

1162019

Y Koren Collaborative filtering with temporal dynamics KDD rsquo09

RECSYS 110

Temporal Biases amp Factors

Original modelrxi = micro +bx + bi + qi middotpx

T

Add time dependence to biasesrxi = micro +bx(t)+ bi(t) +qi middot px

T

Make parameters bu and bi to depend on time (1) Parameterize time-dependence by linear trends

(2) Each bin corresponds to 10 consecutive weeks

Add temporal dependence to factors px(t)hellip user preference vector on day t

Y Koren Collaborative filtering with temporal dynamics KDD rsquo09

RECSYS 111

Adding Temporal Effects

1162019

0875

088

0885

089

0895

09

0905

091

0915

092

1 10 100 1000 10000

RMSE

Millions of parameters

CF (no time bias)Basic Latent FactorsCF (time bias)Latent Factors w Biases+ Linear time factors+ Per-day user biases+ CF

RECSYS 112

Grand Prize 08563

Netflix 09514

Movie average 10533User average 10651

Global average 11296

Performance of Various Methods

Basic Collaborative filtering 094

Latent factors 090

Latent factors+Biases 089

Collaborative filtering++ 091

1162019 112

Latent factors+Biases+Time 0876

Still no prize Getting desperateTry a ldquokitchen sinkrdquo approach

RECSYS 113

Effect of ensemble size

087

08725

0875

08775

088

08825

0885

08875

089

08925

1 8 15 22 29 36 43 50

Predictors

Erro

r - R

MSE

Chart1

RMSE
Predictors
Error - RMSE
Effect of ensemble size

Sheet1

Sheet1

RMSE
Predictors
Error - RMSE
Effect of ensemble size

Sheet2

Sheet3

RECSYS 115

Standing on June 26th 2009

1162019June 26th submission triggers 30-day ldquolast callrdquo

RECSYS 116

The Last 30 Days

Ensemble team formed Group of other teams on leaderboard forms a new team Relies on combining their models Quickly also get a qualifying score over 10

BellKor Continue to get small improvements in their scores Realize that they are in direct competition with Ensemble

Strategy Both teams carefully monitoring the leaderboard Only sure way to check for improvement is to submit a set of

predictionsbull This alerts the other team of your latest score

1162019

RECSYS 117

24 Hours from the Deadline

Submissions limited to 1 a day Only 1 final submission could be made in the last 24h

24 hours before deadlinehellip BellKor team member in Austria notices (by chance) that

Ensemble posts a score that is slightly better than BellKorrsquos

Frantic last 24 hours for both teams Much computer time on final optimization Carefully calibrated to end about an hour before deadline

Final submissions BellKor submits a little early (on purpose) 40 mins before deadline Ensemble submits their final entry 20 mins later hellipand everyone waitshellip

1162019

RECSYS 118118

RECSYS 119

Million $ Awarded Sept 21st 2009

RECSYS 120

Further reading Y Koren Collaborative filtering with temporal

dynamics KDD rsquo09 httpwww2researchattcom~volinskynetflixbpchtml httpwwwthe-ensemblecom

1162019

  • Slide Number 1
  • Acknowledgements
  • Content-based Systems amp Collaborative Filtering
  • High Dimensional Data
  • Example Recommender Systems
  • Recommendations
  • From Scarcity to Abundance
  • The Long Tail
  • Physical vs Online
  • Types of Recommendations
  • Formal Model
  • Utility Matrix
  • Key Problems
  • (1) Gathering Ratings
  • (2) Extrapolating Utilities
  • Content-based Recommender Systems
  • Content-based Recommendations
  • Plan of Action
  • Item Profiles
  • Sidenote TF-IDF
  • User Profiles and Prediction
  • Pros Content-based Approach
  • Cons Content-based Approach
  • Collaborative Filtering
  • Collaborative filtering
  • Collaborative Filtering (CF)
  • Example of Memory-based Collaborative Filtering User-User Collaborative Filtering
  • Similar Users
  • Similarity Metric
  • Rating Predictions
  • Another type of Memory-based Collaborative Filtering Item-Item based Collaborative Filtering
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Common Practice for Item-Item Collaborative Filtering
  • Item-Item vs User-User
  • ProsCons of Collaborative Filtering
  • Hybrid Methods
  • Remarks amp Practical Tips
  • Evaluation
  • Evaluation
  • Evaluating Predictions
  • Problems with Error Measures
  • Collaborative Filtering Complexity
  • Tip Add Data
  • Recommender SystemsLatent Factor Models
  • Collaborative Filtering viaLatent Factor Models (eg SVD)
  • Slide Number 50
  • The Netflix Utility Matrix R
  • Utility Matrix R Evaluation
  • Latent Factor Models
  • Ratings as Products of Factors
  • Ratings as Products of Factors
  • Ratings as Products of Factors
  • Latent Factor Models
  • Latent Factor Models
  • Recap SVD
  • Latent Factor Models
  • Dealing with Missing Entries
  • Recommendations via Latent Factor Models (eg SVD++ by the [Bellkor Team])
  • Dealing with Missing Entries
  • The Effect of Regularization
  • The Effect of Regularization
  • The Effect of Regularization
  • The Effect of Regularization
  • Use Gradient Descent to search forthe optimal settings
  • Degression to the lecture notes ofRegression and Gradient Descentby Andrew Ngrsquos Machine Learning course from Coursera
  • (Batch) Gradient Descent
  • Stochastic Gradient Descent
  • SGD vs GD
  • Stochastic Gradient Descent
  • Slide Number 75
  • Summary Recommendations via Optimization
  • Backup Slides
  • The Netflix Challenge 2006-09
  • Slide Number 79
  • Slide Number 80
  • Netflix Prize
  • Movie rating data
  • Overall rating distribution
  • ratings per movie
  • ratings per user
  • Average movie rating by movie count
  • Most loved movies
  • Challenges
  • The BellKor recommender system
  • Extending Latent Factor Model to Include Biases
  • Modeling Biases and Interactions
  • Baseline Predictor
  • Putting It All Together
  • Fitting the New Model
  • BellKor Recommender System
  • Performance of Various Methods
  • Performance of Various Methods
  • Temporal Biases Of Users
  • Temporal Biases amp Factors
  • Adding Temporal Effects
  • Performance of Various Methods
  • Slide Number 113
  • Slide Number 114
  • Standing on June 26th 2009
  • The Last 30 Days
  • 24 Hours from the Deadline
  • Slide Number 118
  • Million $ Awarded Sept 21st 2009
  • Further reading
predictors RMSE
1 0889067
2 0880586
3 0878345
4 0877715
5 0876668
6 0876179
7 0875557
8 0875013
9 0874477
10 0874209
11 0873997
12 0873691
13 0873569
14 0873352
15 0873212
16 0873035
17 0872942
18 087283
19 0872748
20 0872654
21 0872563
22 0872468
23 0872382
24 0872275
25 087222
26 0872201
27 0872135
28 0872086
29 0872021
30 0871945
31 0871898
32 087183
33 0871756
34 0871722
35 0871665
36 0871632
37 0871585
38 0871558
39 0871524
40 0871482
41 0871452
42 0871428
43 0871398
44 0871346
45 0871312
46 0871287
47 0871262
48 0871242
49 0871236
50 0871197
0889067
0880586
0878345
0877715
0876668
0876179
0875557
0875013
0874477
0874209
0873997
0873691
0873569
0873352
0873212
0873035
0872942
087283
0872748
0872654
0872563
0872468
0872382
0872275
087222
0872201
0872135
0872086
0872021
0871945
0871898
087183
0871756
0871722
0871665
0871632
0871585
0871558
0871524
0871482
0871452
0871428
0871398
0871346
0871312
0871287
0871262
0871242
0871236
0871197
Page 80: xuhappy.github.io · 2 days ago · RECSYS 2 Acknowledgements The slides used in this chapter are adapted from: CS246 Mining Massive Datasets, - by Jure Leskovec, Stanford University

ratings per movie

bull Avg ratingsmovie 5627

ratings per user

bull Avg ratingsuser 208

Average movie rating by movie count

bull More ratings to better movies

From TimelyDevelopmentcom

Most loved movies

CountAvg ratingTitle137812 4593 The Shawshank Redemption 133597 4545 Lord of the Rings The Return of the King1808834306 The Green Mile 150676 4460Lord of the Rings The Two Towers 139050 4415 Finding Nemo 117456 4504 Raiders of the Lost Ark 180736 4299 Forrest Gump 147932 4433 Lord of the Rings The Fellowship of the ring149199 4325 The Sixth Sense 144027 4333 Indiana Jones and the Last Crusade

Challenges

bull Size of datandash Scalabilityndash Keeping data in memory

bull Missing datandash 99 percent missingndash Very imbalanced

bull Avoiding overfittingbull Test and training data differ significantly movie 16322

bull Use an ensemble of complementing predictorsbull Two half tuned models worth more than a single fully

tuned model

The BellKor recommender system

RECSYS 101

Extending Latent Factor Model to Include Biases

1162019

RECSYS 102

Modeling Biases and Interactions

μ = overall mean rating bx = bias of user x bi = bias of movie i

user-movie interactionmovie biasuser bias

User-Movie interaction Characterizes the matching between

users and movies Attracts most research in the field Benefits from algorithmic and

mathematical innovations

Baseline predictor Separates users and movies Benefits from insights into userrsquos

behavior Among the main practical

contributions of the competition

RECSYS 103

Baseline Predictor We have expectations on the rating by

user x of movie i even without estimating xrsquos attitude towards movies like i

ndash Rating scale of user xndash Values of other ratings user

gave recently (day-specific mood anchoring multi-user accounts)

ndash (Recent) popularity of movie indash Selection bias related to

number of ratings user gave on the same day (ldquofrequencyrdquo)

1162019

RECSYS 104

Putting It All Together

Example Mean rating micro = 37 You are a critical reviewer your ratings are 1 star

lower than the mean bx = -1 Star Wars gets a mean rating of 05 higher than

average movie bi = + 05 Predicted rating for you on Star Wars

= 37 - 1 + 05 = 32

1162019

RECSYS 105

Fitting the New Model

Solve

Stochastic gradient decent to find parameters Note Both biases bu bi as well as interactions qi pu are treated

as parameters (we estimate them)

1162019

regularization

goodness of fit

λ is selected via grid-search on a validation set

( )

++++

+++minus

sumsumsumsum

sumisin

ii

xx

xx

ii

Rix

Txiixxi

PQ

bbpq

pqbbr

2222

2

)(

)(min

λ

micro

RECSYS 106

BellKor Recommender System The winner of the Netflix Challenge

Multi-scale modeling of the dataCombine top level ldquoregionalrdquomodeling of the data with a refined local view Global

bull Overall deviations of usersmovies Factorization

bull Addressing ldquoregionalrdquo effects Collaborative filtering

bull Extract local patterns

Global effects

Factorization

Collaborative filtering

RECSYS 107

Performance of Various Methods

0885

089

0895

09

0905

091

0915

092

1 10 100 1000

RMSE

Millions of parameters

CF (no time bias)

Basic Latent Factors

Latent Factors w Biases

1162019

RECSYS 108

Grand Prize 08563

Netflix 09514

Movie average 10533User average 10651

Global average 11296

Performance of Various Methods

Basic Collaborative filtering 094

Latent factors 090

Latent factors+Biases 089

Collaborative filtering++ 091

1162019

RECSYS 109

Temporal Biases Of Users

Sudden rise in the average movie rating(early 2004) Improvements in Netflix GUI improvements Meaning of rating changed

Movie age Users prefer new movies

without any reasons Older movies are just

inherently better than newer ones

1162019

Y Koren Collaborative filtering with temporal dynamics KDD rsquo09

RECSYS 110

Temporal Biases amp Factors

Original modelrxi = micro +bx + bi + qi middotpx

T

Add time dependence to biasesrxi = micro +bx(t)+ bi(t) +qi middot px

T

Make parameters bu and bi to depend on time (1) Parameterize time-dependence by linear trends

(2) Each bin corresponds to 10 consecutive weeks

Add temporal dependence to factors px(t)hellip user preference vector on day t

Y Koren Collaborative filtering with temporal dynamics KDD rsquo09

RECSYS 111

Adding Temporal Effects

1162019

0875

088

0885

089

0895

09

0905

091

0915

092

1 10 100 1000 10000

RMSE

Millions of parameters

CF (no time bias)Basic Latent FactorsCF (time bias)Latent Factors w Biases+ Linear time factors+ Per-day user biases+ CF

RECSYS 112

Grand Prize 08563

Netflix 09514

Movie average 10533User average 10651

Global average 11296

Performance of Various Methods

Basic Collaborative filtering 094

Latent factors 090

Latent factors+Biases 089

Collaborative filtering++ 091

1162019 112

Latent factors+Biases+Time 0876

Still no prize Getting desperateTry a ldquokitchen sinkrdquo approach

RECSYS 113

Effect of ensemble size

087

08725

0875

08775

088

08825

0885

08875

089

08925

1 8 15 22 29 36 43 50

Predictors

Erro

r - R

MSE

Chart1

RMSE
Predictors
Error - RMSE
Effect of ensemble size

Sheet1

Sheet1

RMSE
Predictors
Error - RMSE
Effect of ensemble size

Sheet2

Sheet3

RECSYS 115

Standing on June 26th 2009

1162019June 26th submission triggers 30-day ldquolast callrdquo

RECSYS 116

The Last 30 Days

Ensemble team formed Group of other teams on leaderboard forms a new team Relies on combining their models Quickly also get a qualifying score over 10

BellKor Continue to get small improvements in their scores Realize that they are in direct competition with Ensemble

Strategy Both teams carefully monitoring the leaderboard Only sure way to check for improvement is to submit a set of

predictionsbull This alerts the other team of your latest score

1162019

RECSYS 117

24 Hours from the Deadline

Submissions limited to 1 a day Only 1 final submission could be made in the last 24h

24 hours before deadlinehellip BellKor team member in Austria notices (by chance) that

Ensemble posts a score that is slightly better than BellKorrsquos

Frantic last 24 hours for both teams Much computer time on final optimization Carefully calibrated to end about an hour before deadline

Final submissions BellKor submits a little early (on purpose) 40 mins before deadline Ensemble submits their final entry 20 mins later hellipand everyone waitshellip

1162019

RECSYS 118118

RECSYS 119

Million $ Awarded Sept 21st 2009

RECSYS 120

Further reading Y Koren Collaborative filtering with temporal

dynamics KDD rsquo09 httpwww2researchattcom~volinskynetflixbpchtml httpwwwthe-ensemblecom

1162019

  • Slide Number 1
  • Acknowledgements
  • Content-based Systems amp Collaborative Filtering
  • High Dimensional Data
  • Example Recommender Systems
  • Recommendations
  • From Scarcity to Abundance
  • The Long Tail
  • Physical vs Online
  • Types of Recommendations
  • Formal Model
  • Utility Matrix
  • Key Problems
  • (1) Gathering Ratings
  • (2) Extrapolating Utilities
  • Content-based Recommender Systems
  • Content-based Recommendations
  • Plan of Action
  • Item Profiles
  • Sidenote TF-IDF
  • User Profiles and Prediction
  • Pros Content-based Approach
  • Cons Content-based Approach
  • Collaborative Filtering
  • Collaborative filtering
  • Collaborative Filtering (CF)
  • Example of Memory-based Collaborative Filtering User-User Collaborative Filtering
  • Similar Users
  • Similarity Metric
  • Rating Predictions
  • Another type of Memory-based Collaborative Filtering Item-Item based Collaborative Filtering
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Common Practice for Item-Item Collaborative Filtering
  • Item-Item vs User-User
  • ProsCons of Collaborative Filtering
  • Hybrid Methods
  • Remarks amp Practical Tips
  • Evaluation
  • Evaluation
  • Evaluating Predictions
  • Problems with Error Measures
  • Collaborative Filtering Complexity
  • Tip Add Data
  • Recommender SystemsLatent Factor Models
  • Collaborative Filtering viaLatent Factor Models (eg SVD)
  • Slide Number 50
  • The Netflix Utility Matrix R
  • Utility Matrix R Evaluation
  • Latent Factor Models
  • Ratings as Products of Factors
  • Ratings as Products of Factors
  • Ratings as Products of Factors
  • Latent Factor Models
  • Latent Factor Models
  • Recap SVD
  • Latent Factor Models
  • Dealing with Missing Entries
  • Recommendations via Latent Factor Models (eg SVD++ by the [Bellkor Team])
  • Dealing with Missing Entries
  • The Effect of Regularization
  • The Effect of Regularization
  • The Effect of Regularization
  • The Effect of Regularization
  • Use Gradient Descent to search forthe optimal settings
  • Degression to the lecture notes ofRegression and Gradient Descentby Andrew Ngrsquos Machine Learning course from Coursera
  • (Batch) Gradient Descent
  • Stochastic Gradient Descent
  • SGD vs GD
  • Stochastic Gradient Descent
  • Slide Number 75
  • Summary Recommendations via Optimization
  • Backup Slides
  • The Netflix Challenge 2006-09
  • Slide Number 79
  • Slide Number 80
  • Netflix Prize
  • Movie rating data
  • Overall rating distribution
  • ratings per movie
  • ratings per user
  • Average movie rating by movie count
  • Most loved movies
  • Challenges
  • The BellKor recommender system
  • Extending Latent Factor Model to Include Biases
  • Modeling Biases and Interactions
  • Baseline Predictor
  • Putting It All Together
  • Fitting the New Model
  • BellKor Recommender System
  • Performance of Various Methods
  • Performance of Various Methods
  • Temporal Biases Of Users
  • Temporal Biases amp Factors
  • Adding Temporal Effects
  • Performance of Various Methods
  • Slide Number 113
  • Slide Number 114
  • Standing on June 26th 2009
  • The Last 30 Days
  • 24 Hours from the Deadline
  • Slide Number 118
  • Million $ Awarded Sept 21st 2009
  • Further reading
predictors RMSE
1 0889067
2 0880586
3 0878345
4 0877715
5 0876668
6 0876179
7 0875557
8 0875013
9 0874477
10 0874209
11 0873997
12 0873691
13 0873569
14 0873352
15 0873212
16 0873035
17 0872942
18 087283
19 0872748
20 0872654
21 0872563
22 0872468
23 0872382
24 0872275
25 087222
26 0872201
27 0872135
28 0872086
29 0872021
30 0871945
31 0871898
32 087183
33 0871756
34 0871722
35 0871665
36 0871632
37 0871585
38 0871558
39 0871524
40 0871482
41 0871452
42 0871428
43 0871398
44 0871346
45 0871312
46 0871287
47 0871262
48 0871242
49 0871236
50 0871197
0889067
0880586
0878345
0877715
0876668
0876179
0875557
0875013
0874477
0874209
0873997
0873691
0873569
0873352
0873212
0873035
0872942
087283
0872748
0872654
0872563
0872468
0872382
0872275
087222
0872201
0872135
0872086
0872021
0871945
0871898
087183
0871756
0871722
0871665
0871632
0871585
0871558
0871524
0871482
0871452
0871428
0871398
0871346
0871312
0871287
0871262
0871242
0871236
0871197
Page 81: xuhappy.github.io · 2 days ago · RECSYS 2 Acknowledgements The slides used in this chapter are adapted from: CS246 Mining Massive Datasets, - by Jure Leskovec, Stanford University

ratings per user

bull Avg ratingsuser 208

Average movie rating by movie count

bull More ratings to better movies

From TimelyDevelopmentcom

Most loved movies

CountAvg ratingTitle137812 4593 The Shawshank Redemption 133597 4545 Lord of the Rings The Return of the King1808834306 The Green Mile 150676 4460Lord of the Rings The Two Towers 139050 4415 Finding Nemo 117456 4504 Raiders of the Lost Ark 180736 4299 Forrest Gump 147932 4433 Lord of the Rings The Fellowship of the ring149199 4325 The Sixth Sense 144027 4333 Indiana Jones and the Last Crusade

Challenges

bull Size of datandash Scalabilityndash Keeping data in memory

bull Missing datandash 99 percent missingndash Very imbalanced

bull Avoiding overfittingbull Test and training data differ significantly movie 16322

bull Use an ensemble of complementing predictorsbull Two half tuned models worth more than a single fully

tuned model

The BellKor recommender system

RECSYS 101

Extending Latent Factor Model to Include Biases

1162019

RECSYS 102

Modeling Biases and Interactions

μ = overall mean rating bx = bias of user x bi = bias of movie i

user-movie interactionmovie biasuser bias

User-Movie interaction Characterizes the matching between

users and movies Attracts most research in the field Benefits from algorithmic and

mathematical innovations

Baseline predictor Separates users and movies Benefits from insights into userrsquos

behavior Among the main practical

contributions of the competition

RECSYS 103

Baseline Predictor We have expectations on the rating by

user x of movie i even without estimating xrsquos attitude towards movies like i

ndash Rating scale of user xndash Values of other ratings user

gave recently (day-specific mood anchoring multi-user accounts)

ndash (Recent) popularity of movie indash Selection bias related to

number of ratings user gave on the same day (ldquofrequencyrdquo)

1162019

RECSYS 104

Putting It All Together

Example Mean rating micro = 37 You are a critical reviewer your ratings are 1 star

lower than the mean bx = -1 Star Wars gets a mean rating of 05 higher than

average movie bi = + 05 Predicted rating for you on Star Wars

= 37 - 1 + 05 = 32

1162019

RECSYS 105

Fitting the New Model

Solve

Stochastic gradient decent to find parameters Note Both biases bu bi as well as interactions qi pu are treated

as parameters (we estimate them)

1162019

regularization

goodness of fit

λ is selected via grid-search on a validation set

( )

++++

+++minus

sumsumsumsum

sumisin

ii

xx

xx

ii

Rix

Txiixxi

PQ

bbpq

pqbbr

2222

2

)(

)(min

λ

micro

RECSYS 106

BellKor Recommender System The winner of the Netflix Challenge

Multi-scale modeling of the dataCombine top level ldquoregionalrdquomodeling of the data with a refined local view Global

bull Overall deviations of usersmovies Factorization

bull Addressing ldquoregionalrdquo effects Collaborative filtering

bull Extract local patterns

Global effects

Factorization

Collaborative filtering

RECSYS 107

Performance of Various Methods

0885

089

0895

09

0905

091

0915

092

1 10 100 1000

RMSE

Millions of parameters

CF (no time bias)

Basic Latent Factors

Latent Factors w Biases

1162019

RECSYS 108

Grand Prize 08563

Netflix 09514

Movie average 10533User average 10651

Global average 11296

Performance of Various Methods

Basic Collaborative filtering 094

Latent factors 090

Latent factors+Biases 089

Collaborative filtering++ 091

1162019

RECSYS 109

Temporal Biases Of Users

Sudden rise in the average movie rating(early 2004) Improvements in Netflix GUI improvements Meaning of rating changed

Movie age Users prefer new movies

without any reasons Older movies are just

inherently better than newer ones

1162019

Y Koren Collaborative filtering with temporal dynamics KDD rsquo09

RECSYS 110

Temporal Biases amp Factors

Original modelrxi = micro +bx + bi + qi middotpx

T

Add time dependence to biasesrxi = micro +bx(t)+ bi(t) +qi middot px

T

Make parameters bu and bi to depend on time (1) Parameterize time-dependence by linear trends

(2) Each bin corresponds to 10 consecutive weeks

Add temporal dependence to factors px(t)hellip user preference vector on day t

Y Koren Collaborative filtering with temporal dynamics KDD rsquo09

RECSYS 111

Adding Temporal Effects

1162019

0875

088

0885

089

0895

09

0905

091

0915

092

1 10 100 1000 10000

RMSE

Millions of parameters

CF (no time bias)Basic Latent FactorsCF (time bias)Latent Factors w Biases+ Linear time factors+ Per-day user biases+ CF

RECSYS 112

Grand Prize 08563

Netflix 09514

Movie average 10533User average 10651

Global average 11296

Performance of Various Methods

Basic Collaborative filtering 094

Latent factors 090

Latent factors+Biases 089

Collaborative filtering++ 091

1162019 112

Latent factors+Biases+Time 0876

Still no prize Getting desperateTry a ldquokitchen sinkrdquo approach

RECSYS 113

Effect of ensemble size

087

08725

0875

08775

088

08825

0885

08875

089

08925

1 8 15 22 29 36 43 50

Predictors

Erro

r - R

MSE

Chart1

RMSE
Predictors
Error - RMSE
Effect of ensemble size

Sheet1

Sheet1

RMSE
Predictors
Error - RMSE
Effect of ensemble size

Sheet2

Sheet3

RECSYS 115

Standing on June 26th 2009

1162019June 26th submission triggers 30-day ldquolast callrdquo

RECSYS 116

The Last 30 Days

Ensemble team formed Group of other teams on leaderboard forms a new team Relies on combining their models Quickly also get a qualifying score over 10

BellKor Continue to get small improvements in their scores Realize that they are in direct competition with Ensemble

Strategy Both teams carefully monitoring the leaderboard Only sure way to check for improvement is to submit a set of

predictionsbull This alerts the other team of your latest score

1162019

RECSYS 117

24 Hours from the Deadline

Submissions limited to 1 a day Only 1 final submission could be made in the last 24h

24 hours before deadlinehellip BellKor team member in Austria notices (by chance) that

Ensemble posts a score that is slightly better than BellKorrsquos

Frantic last 24 hours for both teams Much computer time on final optimization Carefully calibrated to end about an hour before deadline

Final submissions BellKor submits a little early (on purpose) 40 mins before deadline Ensemble submits their final entry 20 mins later hellipand everyone waitshellip

1162019

RECSYS 118118

RECSYS 119

Million $ Awarded Sept 21st 2009

RECSYS 120

Further reading Y Koren Collaborative filtering with temporal

dynamics KDD rsquo09 httpwww2researchattcom~volinskynetflixbpchtml httpwwwthe-ensemblecom

1162019

  • Slide Number 1
  • Acknowledgements
  • Content-based Systems amp Collaborative Filtering
  • High Dimensional Data
  • Example Recommender Systems
  • Recommendations
  • From Scarcity to Abundance
  • The Long Tail
  • Physical vs Online
  • Types of Recommendations
  • Formal Model
  • Utility Matrix
  • Key Problems
  • (1) Gathering Ratings
  • (2) Extrapolating Utilities
  • Content-based Recommender Systems
  • Content-based Recommendations
  • Plan of Action
  • Item Profiles
  • Sidenote TF-IDF
  • User Profiles and Prediction
  • Pros Content-based Approach
  • Cons Content-based Approach
  • Collaborative Filtering
  • Collaborative filtering
  • Collaborative Filtering (CF)
  • Example of Memory-based Collaborative Filtering User-User Collaborative Filtering
  • Similar Users
  • Similarity Metric
  • Rating Predictions
  • Another type of Memory-based Collaborative Filtering Item-Item based Collaborative Filtering
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Common Practice for Item-Item Collaborative Filtering
  • Item-Item vs User-User
  • ProsCons of Collaborative Filtering
  • Hybrid Methods
  • Remarks amp Practical Tips
  • Evaluation
  • Evaluation
  • Evaluating Predictions
  • Problems with Error Measures
  • Collaborative Filtering Complexity
  • Tip Add Data
  • Recommender SystemsLatent Factor Models
  • Collaborative Filtering viaLatent Factor Models (eg SVD)
  • Slide Number 50
  • The Netflix Utility Matrix R
  • Utility Matrix R Evaluation
  • Latent Factor Models
  • Ratings as Products of Factors
  • Ratings as Products of Factors
  • Ratings as Products of Factors
  • Latent Factor Models
  • Latent Factor Models
  • Recap SVD
  • Latent Factor Models
  • Dealing with Missing Entries
  • Recommendations via Latent Factor Models (eg SVD++ by the [Bellkor Team])
  • Dealing with Missing Entries
  • The Effect of Regularization
  • The Effect of Regularization
  • The Effect of Regularization
  • The Effect of Regularization
  • Use Gradient Descent to search forthe optimal settings
  • Degression to the lecture notes ofRegression and Gradient Descentby Andrew Ngrsquos Machine Learning course from Coursera
  • (Batch) Gradient Descent
  • Stochastic Gradient Descent
  • SGD vs GD
  • Stochastic Gradient Descent
  • Slide Number 75
  • Summary Recommendations via Optimization
  • Backup Slides
  • The Netflix Challenge 2006-09
  • Slide Number 79
  • Slide Number 80
  • Netflix Prize
  • Movie rating data
  • Overall rating distribution
  • ratings per movie
  • ratings per user
  • Average movie rating by movie count
  • Most loved movies
  • Challenges
  • The BellKor recommender system
  • Extending Latent Factor Model to Include Biases
  • Modeling Biases and Interactions
  • Baseline Predictor
  • Putting It All Together
  • Fitting the New Model
  • BellKor Recommender System
  • Performance of Various Methods
  • Performance of Various Methods
  • Temporal Biases Of Users
  • Temporal Biases amp Factors
  • Adding Temporal Effects
  • Performance of Various Methods
  • Slide Number 113
  • Slide Number 114
  • Standing on June 26th 2009
  • The Last 30 Days
  • 24 Hours from the Deadline
  • Slide Number 118
  • Million $ Awarded Sept 21st 2009
  • Further reading
predictors RMSE
1 0889067
2 0880586
3 0878345
4 0877715
5 0876668
6 0876179
7 0875557
8 0875013
9 0874477
10 0874209
11 0873997
12 0873691
13 0873569
14 0873352
15 0873212
16 0873035
17 0872942
18 087283
19 0872748
20 0872654
21 0872563
22 0872468
23 0872382
24 0872275
25 087222
26 0872201
27 0872135
28 0872086
29 0872021
30 0871945
31 0871898
32 087183
33 0871756
34 0871722
35 0871665
36 0871632
37 0871585
38 0871558
39 0871524
40 0871482
41 0871452
42 0871428
43 0871398
44 0871346
45 0871312
46 0871287
47 0871262
48 0871242
49 0871236
50 0871197
0889067
0880586
0878345
0877715
0876668
0876179
0875557
0875013
0874477
0874209
0873997
0873691
0873569
0873352
0873212
0873035
0872942
087283
0872748
0872654
0872563
0872468
0872382
0872275
087222
0872201
0872135
0872086
0872021
0871945
0871898
087183
0871756
0871722
0871665
0871632
0871585
0871558
0871524
0871482
0871452
0871428
0871398
0871346
0871312
0871287
0871262
0871242
0871236
0871197
Page 82: xuhappy.github.io · 2 days ago · RECSYS 2 Acknowledgements The slides used in this chapter are adapted from: CS246 Mining Massive Datasets, - by Jure Leskovec, Stanford University

Average movie rating by movie count

bull More ratings to better movies

From TimelyDevelopmentcom

Most loved movies

CountAvg ratingTitle137812 4593 The Shawshank Redemption 133597 4545 Lord of the Rings The Return of the King1808834306 The Green Mile 150676 4460Lord of the Rings The Two Towers 139050 4415 Finding Nemo 117456 4504 Raiders of the Lost Ark 180736 4299 Forrest Gump 147932 4433 Lord of the Rings The Fellowship of the ring149199 4325 The Sixth Sense 144027 4333 Indiana Jones and the Last Crusade

Challenges

bull Size of datandash Scalabilityndash Keeping data in memory

bull Missing datandash 99 percent missingndash Very imbalanced

bull Avoiding overfittingbull Test and training data differ significantly movie 16322

bull Use an ensemble of complementing predictorsbull Two half tuned models worth more than a single fully

tuned model

The BellKor recommender system

RECSYS 101

Extending Latent Factor Model to Include Biases

1162019

RECSYS 102

Modeling Biases and Interactions

μ = overall mean rating bx = bias of user x bi = bias of movie i

user-movie interactionmovie biasuser bias

User-Movie interaction Characterizes the matching between

users and movies Attracts most research in the field Benefits from algorithmic and

mathematical innovations

Baseline predictor Separates users and movies Benefits from insights into userrsquos

behavior Among the main practical

contributions of the competition

RECSYS 103

Baseline Predictor We have expectations on the rating by

user x of movie i even without estimating xrsquos attitude towards movies like i

ndash Rating scale of user xndash Values of other ratings user

gave recently (day-specific mood anchoring multi-user accounts)

ndash (Recent) popularity of movie indash Selection bias related to

number of ratings user gave on the same day (ldquofrequencyrdquo)

1162019

RECSYS 104

Putting It All Together

Example Mean rating micro = 37 You are a critical reviewer your ratings are 1 star

lower than the mean bx = -1 Star Wars gets a mean rating of 05 higher than

average movie bi = + 05 Predicted rating for you on Star Wars

= 37 - 1 + 05 = 32

1162019

RECSYS 105

Fitting the New Model

Solve

Stochastic gradient decent to find parameters Note Both biases bu bi as well as interactions qi pu are treated

as parameters (we estimate them)

1162019

regularization

goodness of fit

λ is selected via grid-search on a validation set

( )

++++

+++minus

sumsumsumsum

sumisin

ii

xx

xx

ii

Rix

Txiixxi

PQ

bbpq

pqbbr

2222

2

)(

)(min

λ

micro

RECSYS 106

BellKor Recommender System The winner of the Netflix Challenge

Multi-scale modeling of the dataCombine top level ldquoregionalrdquomodeling of the data with a refined local view Global

bull Overall deviations of usersmovies Factorization

bull Addressing ldquoregionalrdquo effects Collaborative filtering

bull Extract local patterns

Global effects

Factorization

Collaborative filtering

RECSYS 107

Performance of Various Methods

0885

089

0895

09

0905

091

0915

092

1 10 100 1000

RMSE

Millions of parameters

CF (no time bias)

Basic Latent Factors

Latent Factors w Biases

1162019

RECSYS 108

Grand Prize 08563

Netflix 09514

Movie average 10533User average 10651

Global average 11296

Performance of Various Methods

Basic Collaborative filtering 094

Latent factors 090

Latent factors+Biases 089

Collaborative filtering++ 091

1162019

RECSYS 109

Temporal Biases Of Users

Sudden rise in the average movie rating(early 2004) Improvements in Netflix GUI improvements Meaning of rating changed

Movie age Users prefer new movies

without any reasons Older movies are just

inherently better than newer ones

1162019

Y Koren Collaborative filtering with temporal dynamics KDD rsquo09

RECSYS 110

Temporal Biases amp Factors

Original modelrxi = micro +bx + bi + qi middotpx

T

Add time dependence to biasesrxi = micro +bx(t)+ bi(t) +qi middot px

T

Make parameters bu and bi to depend on time (1) Parameterize time-dependence by linear trends

(2) Each bin corresponds to 10 consecutive weeks

Add temporal dependence to factors px(t)hellip user preference vector on day t

Y Koren Collaborative filtering with temporal dynamics KDD rsquo09

RECSYS 111

Adding Temporal Effects

1162019

0875

088

0885

089

0895

09

0905

091

0915

092

1 10 100 1000 10000

RMSE

Millions of parameters

CF (no time bias)Basic Latent FactorsCF (time bias)Latent Factors w Biases+ Linear time factors+ Per-day user biases+ CF

RECSYS 112

Grand Prize 08563

Netflix 09514

Movie average 10533User average 10651

Global average 11296

Performance of Various Methods

Basic Collaborative filtering 094

Latent factors 090

Latent factors+Biases 089

Collaborative filtering++ 091

1162019 112

Latent factors+Biases+Time 0876

Still no prize Getting desperateTry a ldquokitchen sinkrdquo approach

RECSYS 113

Effect of ensemble size

087

08725

0875

08775

088

08825

0885

08875

089

08925

1 8 15 22 29 36 43 50

Predictors

Erro

r - R

MSE

Chart1

RMSE
Predictors
Error - RMSE
Effect of ensemble size

Sheet1

Sheet1

RMSE
Predictors
Error - RMSE
Effect of ensemble size

Sheet2

Sheet3

RECSYS 115

Standing on June 26th 2009

1162019June 26th submission triggers 30-day ldquolast callrdquo

RECSYS 116

The Last 30 Days

Ensemble team formed Group of other teams on leaderboard forms a new team Relies on combining their models Quickly also get a qualifying score over 10

BellKor Continue to get small improvements in their scores Realize that they are in direct competition with Ensemble

Strategy Both teams carefully monitoring the leaderboard Only sure way to check for improvement is to submit a set of

predictionsbull This alerts the other team of your latest score

1162019

RECSYS 117

24 Hours from the Deadline

Submissions limited to 1 a day Only 1 final submission could be made in the last 24h

24 hours before deadlinehellip BellKor team member in Austria notices (by chance) that

Ensemble posts a score that is slightly better than BellKorrsquos

Frantic last 24 hours for both teams Much computer time on final optimization Carefully calibrated to end about an hour before deadline

Final submissions BellKor submits a little early (on purpose) 40 mins before deadline Ensemble submits their final entry 20 mins later hellipand everyone waitshellip

1162019

RECSYS 118118

RECSYS 119

Million $ Awarded Sept 21st 2009

RECSYS 120

Further reading Y Koren Collaborative filtering with temporal

dynamics KDD rsquo09 httpwww2researchattcom~volinskynetflixbpchtml httpwwwthe-ensemblecom

1162019

  • Slide Number 1
  • Acknowledgements
  • Content-based Systems amp Collaborative Filtering
  • High Dimensional Data
  • Example Recommender Systems
  • Recommendations
  • From Scarcity to Abundance
  • The Long Tail
  • Physical vs Online
  • Types of Recommendations
  • Formal Model
  • Utility Matrix
  • Key Problems
  • (1) Gathering Ratings
  • (2) Extrapolating Utilities
  • Content-based Recommender Systems
  • Content-based Recommendations
  • Plan of Action
  • Item Profiles
  • Sidenote TF-IDF
  • User Profiles and Prediction
  • Pros Content-based Approach
  • Cons Content-based Approach
  • Collaborative Filtering
  • Collaborative filtering
  • Collaborative Filtering (CF)
  • Example of Memory-based Collaborative Filtering User-User Collaborative Filtering
  • Similar Users
  • Similarity Metric
  • Rating Predictions
  • Another type of Memory-based Collaborative Filtering Item-Item based Collaborative Filtering
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Common Practice for Item-Item Collaborative Filtering
  • Item-Item vs User-User
  • ProsCons of Collaborative Filtering
  • Hybrid Methods
  • Remarks amp Practical Tips
  • Evaluation
  • Evaluation
  • Evaluating Predictions
  • Problems with Error Measures
  • Collaborative Filtering Complexity
  • Tip Add Data
  • Recommender SystemsLatent Factor Models
  • Collaborative Filtering viaLatent Factor Models (eg SVD)
  • Slide Number 50
  • The Netflix Utility Matrix R
  • Utility Matrix R Evaluation
  • Latent Factor Models
  • Ratings as Products of Factors
  • Ratings as Products of Factors
  • Ratings as Products of Factors
  • Latent Factor Models
  • Latent Factor Models
  • Recap SVD
  • Latent Factor Models
  • Dealing with Missing Entries
  • Recommendations via Latent Factor Models (eg SVD++ by the [Bellkor Team])
  • Dealing with Missing Entries
  • The Effect of Regularization
  • The Effect of Regularization
  • The Effect of Regularization
  • The Effect of Regularization
  • Use Gradient Descent to search forthe optimal settings
  • Degression to the lecture notes ofRegression and Gradient Descentby Andrew Ngrsquos Machine Learning course from Coursera
  • (Batch) Gradient Descent
  • Stochastic Gradient Descent
  • SGD vs GD
  • Stochastic Gradient Descent
  • Slide Number 75
  • Summary Recommendations via Optimization
  • Backup Slides
  • The Netflix Challenge 2006-09
  • Slide Number 79
  • Slide Number 80
  • Netflix Prize
  • Movie rating data
  • Overall rating distribution
  • ratings per movie
  • ratings per user
  • Average movie rating by movie count
  • Most loved movies
  • Challenges
  • The BellKor recommender system
  • Extending Latent Factor Model to Include Biases
  • Modeling Biases and Interactions
  • Baseline Predictor
  • Putting It All Together
  • Fitting the New Model
  • BellKor Recommender System
  • Performance of Various Methods
  • Performance of Various Methods
  • Temporal Biases Of Users
  • Temporal Biases amp Factors
  • Adding Temporal Effects
  • Performance of Various Methods
  • Slide Number 113
  • Slide Number 114
  • Standing on June 26th 2009
  • The Last 30 Days
  • 24 Hours from the Deadline
  • Slide Number 118
  • Million $ Awarded Sept 21st 2009
  • Further reading
predictors RMSE
1 0889067
2 0880586
3 0878345
4 0877715
5 0876668
6 0876179
7 0875557
8 0875013
9 0874477
10 0874209
11 0873997
12 0873691
13 0873569
14 0873352
15 0873212
16 0873035
17 0872942
18 087283
19 0872748
20 0872654
21 0872563
22 0872468
23 0872382
24 0872275
25 087222
26 0872201
27 0872135
28 0872086
29 0872021
30 0871945
31 0871898
32 087183
33 0871756
34 0871722
35 0871665
36 0871632
37 0871585
38 0871558
39 0871524
40 0871482
41 0871452
42 0871428
43 0871398
44 0871346
45 0871312
46 0871287
47 0871262
48 0871242
49 0871236
50 0871197
0889067
0880586
0878345
0877715
0876668
0876179
0875557
0875013
0874477
0874209
0873997
0873691
0873569
0873352
0873212
0873035
0872942
087283
0872748
0872654
0872563
0872468
0872382
0872275
087222
0872201
0872135
0872086
0872021
0871945
0871898
087183
0871756
0871722
0871665
0871632
0871585
0871558
0871524
0871482
0871452
0871428
0871398
0871346
0871312
0871287
0871262
0871242
0871236
0871197
Page 83: xuhappy.github.io · 2 days ago · RECSYS 2 Acknowledgements The slides used in this chapter are adapted from: CS246 Mining Massive Datasets, - by Jure Leskovec, Stanford University

Most loved movies

CountAvg ratingTitle137812 4593 The Shawshank Redemption 133597 4545 Lord of the Rings The Return of the King1808834306 The Green Mile 150676 4460Lord of the Rings The Two Towers 139050 4415 Finding Nemo 117456 4504 Raiders of the Lost Ark 180736 4299 Forrest Gump 147932 4433 Lord of the Rings The Fellowship of the ring149199 4325 The Sixth Sense 144027 4333 Indiana Jones and the Last Crusade

Challenges

bull Size of datandash Scalabilityndash Keeping data in memory

bull Missing datandash 99 percent missingndash Very imbalanced

bull Avoiding overfittingbull Test and training data differ significantly movie 16322

bull Use an ensemble of complementing predictorsbull Two half tuned models worth more than a single fully

tuned model

The BellKor recommender system

RECSYS 101

Extending Latent Factor Model to Include Biases

1162019

RECSYS 102

Modeling Biases and Interactions

μ = overall mean rating bx = bias of user x bi = bias of movie i

user-movie interactionmovie biasuser bias

User-Movie interaction Characterizes the matching between

users and movies Attracts most research in the field Benefits from algorithmic and

mathematical innovations

Baseline predictor Separates users and movies Benefits from insights into userrsquos

behavior Among the main practical

contributions of the competition

RECSYS 103

Baseline Predictor We have expectations on the rating by

user x of movie i even without estimating xrsquos attitude towards movies like i

ndash Rating scale of user xndash Values of other ratings user

gave recently (day-specific mood anchoring multi-user accounts)

ndash (Recent) popularity of movie indash Selection bias related to

number of ratings user gave on the same day (ldquofrequencyrdquo)

1162019

RECSYS 104

Putting It All Together

Example Mean rating micro = 37 You are a critical reviewer your ratings are 1 star

lower than the mean bx = -1 Star Wars gets a mean rating of 05 higher than

average movie bi = + 05 Predicted rating for you on Star Wars

= 37 - 1 + 05 = 32

1162019

RECSYS 105

Fitting the New Model

Solve

Stochastic gradient decent to find parameters Note Both biases bu bi as well as interactions qi pu are treated

as parameters (we estimate them)

1162019

regularization

goodness of fit

λ is selected via grid-search on a validation set

( )

++++

+++minus

sumsumsumsum

sumisin

ii

xx

xx

ii

Rix

Txiixxi

PQ

bbpq

pqbbr

2222

2

)(

)(min

λ

micro

RECSYS 106

BellKor Recommender System The winner of the Netflix Challenge

Multi-scale modeling of the dataCombine top level ldquoregionalrdquomodeling of the data with a refined local view Global

bull Overall deviations of usersmovies Factorization

bull Addressing ldquoregionalrdquo effects Collaborative filtering

bull Extract local patterns

Global effects

Factorization

Collaborative filtering

RECSYS 107

Performance of Various Methods

0885

089

0895

09

0905

091

0915

092

1 10 100 1000

RMSE

Millions of parameters

CF (no time bias)

Basic Latent Factors

Latent Factors w Biases

1162019

RECSYS 108

Grand Prize 08563

Netflix 09514

Movie average 10533User average 10651

Global average 11296

Performance of Various Methods

Basic Collaborative filtering 094

Latent factors 090

Latent factors+Biases 089

Collaborative filtering++ 091

1162019

RECSYS 109

Temporal Biases Of Users

Sudden rise in the average movie rating(early 2004) Improvements in Netflix GUI improvements Meaning of rating changed

Movie age Users prefer new movies

without any reasons Older movies are just

inherently better than newer ones

1162019

Y Koren Collaborative filtering with temporal dynamics KDD rsquo09

RECSYS 110

Temporal Biases amp Factors

Original modelrxi = micro +bx + bi + qi middotpx

T

Add time dependence to biasesrxi = micro +bx(t)+ bi(t) +qi middot px

T

Make parameters bu and bi to depend on time (1) Parameterize time-dependence by linear trends

(2) Each bin corresponds to 10 consecutive weeks

Add temporal dependence to factors px(t)hellip user preference vector on day t

Y Koren Collaborative filtering with temporal dynamics KDD rsquo09

RECSYS 111

Adding Temporal Effects

1162019

0875

088

0885

089

0895

09

0905

091

0915

092

1 10 100 1000 10000

RMSE

Millions of parameters

CF (no time bias)Basic Latent FactorsCF (time bias)Latent Factors w Biases+ Linear time factors+ Per-day user biases+ CF

RECSYS 112

Grand Prize 08563

Netflix 09514

Movie average 10533User average 10651

Global average 11296

Performance of Various Methods

Basic Collaborative filtering 094

Latent factors 090

Latent factors+Biases 089

Collaborative filtering++ 091

1162019 112

Latent factors+Biases+Time 0876

Still no prize Getting desperateTry a ldquokitchen sinkrdquo approach

RECSYS 113

Effect of ensemble size

087

08725

0875

08775

088

08825

0885

08875

089

08925

1 8 15 22 29 36 43 50

Predictors

Erro

r - R

MSE

Chart1

RMSE
Predictors
Error - RMSE
Effect of ensemble size

Sheet1

Sheet1

RMSE
Predictors
Error - RMSE
Effect of ensemble size

Sheet2

Sheet3

RECSYS 115

Standing on June 26th 2009

1162019June 26th submission triggers 30-day ldquolast callrdquo

RECSYS 116

The Last 30 Days

Ensemble team formed Group of other teams on leaderboard forms a new team Relies on combining their models Quickly also get a qualifying score over 10

BellKor Continue to get small improvements in their scores Realize that they are in direct competition with Ensemble

Strategy Both teams carefully monitoring the leaderboard Only sure way to check for improvement is to submit a set of

predictionsbull This alerts the other team of your latest score

1162019

RECSYS 117

24 Hours from the Deadline

Submissions limited to 1 a day Only 1 final submission could be made in the last 24h

24 hours before deadlinehellip BellKor team member in Austria notices (by chance) that

Ensemble posts a score that is slightly better than BellKorrsquos

Frantic last 24 hours for both teams Much computer time on final optimization Carefully calibrated to end about an hour before deadline

Final submissions BellKor submits a little early (on purpose) 40 mins before deadline Ensemble submits their final entry 20 mins later hellipand everyone waitshellip

1162019

RECSYS 118118

RECSYS 119

Million $ Awarded Sept 21st 2009

RECSYS 120

Further reading Y Koren Collaborative filtering with temporal

dynamics KDD rsquo09 httpwww2researchattcom~volinskynetflixbpchtml httpwwwthe-ensemblecom

1162019

  • Slide Number 1
  • Acknowledgements
  • Content-based Systems amp Collaborative Filtering
  • High Dimensional Data
  • Example Recommender Systems
  • Recommendations
  • From Scarcity to Abundance
  • The Long Tail
  • Physical vs Online
  • Types of Recommendations
  • Formal Model
  • Utility Matrix
  • Key Problems
  • (1) Gathering Ratings
  • (2) Extrapolating Utilities
  • Content-based Recommender Systems
  • Content-based Recommendations
  • Plan of Action
  • Item Profiles
  • Sidenote TF-IDF
  • User Profiles and Prediction
  • Pros Content-based Approach
  • Cons Content-based Approach
  • Collaborative Filtering
  • Collaborative filtering
  • Collaborative Filtering (CF)
  • Example of Memory-based Collaborative Filtering User-User Collaborative Filtering
  • Similar Users
  • Similarity Metric
  • Rating Predictions
  • Another type of Memory-based Collaborative Filtering Item-Item based Collaborative Filtering
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Common Practice for Item-Item Collaborative Filtering
  • Item-Item vs User-User
  • ProsCons of Collaborative Filtering
  • Hybrid Methods
  • Remarks amp Practical Tips
  • Evaluation
  • Evaluation
  • Evaluating Predictions
  • Problems with Error Measures
  • Collaborative Filtering Complexity
  • Tip Add Data
  • Recommender SystemsLatent Factor Models
  • Collaborative Filtering viaLatent Factor Models (eg SVD)
  • Slide Number 50
  • The Netflix Utility Matrix R
  • Utility Matrix R Evaluation
  • Latent Factor Models
  • Ratings as Products of Factors
  • Ratings as Products of Factors
  • Ratings as Products of Factors
  • Latent Factor Models
  • Latent Factor Models
  • Recap SVD
  • Latent Factor Models
  • Dealing with Missing Entries
  • Recommendations via Latent Factor Models (eg SVD++ by the [Bellkor Team])
  • Dealing with Missing Entries
  • The Effect of Regularization
  • The Effect of Regularization
  • The Effect of Regularization
  • The Effect of Regularization
  • Use Gradient Descent to search forthe optimal settings
  • Degression to the lecture notes ofRegression and Gradient Descentby Andrew Ngrsquos Machine Learning course from Coursera
  • (Batch) Gradient Descent
  • Stochastic Gradient Descent
  • SGD vs GD
  • Stochastic Gradient Descent
  • Slide Number 75
  • Summary Recommendations via Optimization
  • Backup Slides
  • The Netflix Challenge 2006-09
  • Slide Number 79
  • Slide Number 80
  • Netflix Prize
  • Movie rating data
  • Overall rating distribution
  • ratings per movie
  • ratings per user
  • Average movie rating by movie count
  • Most loved movies
  • Challenges
  • The BellKor recommender system
  • Extending Latent Factor Model to Include Biases
  • Modeling Biases and Interactions
  • Baseline Predictor
  • Putting It All Together
  • Fitting the New Model
  • BellKor Recommender System
  • Performance of Various Methods
  • Performance of Various Methods
  • Temporal Biases Of Users
  • Temporal Biases amp Factors
  • Adding Temporal Effects
  • Performance of Various Methods
  • Slide Number 113
  • Slide Number 114
  • Standing on June 26th 2009
  • The Last 30 Days
  • 24 Hours from the Deadline
  • Slide Number 118
  • Million $ Awarded Sept 21st 2009
  • Further reading
predictors RMSE
1 0889067
2 0880586
3 0878345
4 0877715
5 0876668
6 0876179
7 0875557
8 0875013
9 0874477
10 0874209
11 0873997
12 0873691
13 0873569
14 0873352
15 0873212
16 0873035
17 0872942
18 087283
19 0872748
20 0872654
21 0872563
22 0872468
23 0872382
24 0872275
25 087222
26 0872201
27 0872135
28 0872086
29 0872021
30 0871945
31 0871898
32 087183
33 0871756
34 0871722
35 0871665
36 0871632
37 0871585
38 0871558
39 0871524
40 0871482
41 0871452
42 0871428
43 0871398
44 0871346
45 0871312
46 0871287
47 0871262
48 0871242
49 0871236
50 0871197
0889067
0880586
0878345
0877715
0876668
0876179
0875557
0875013
0874477
0874209
0873997
0873691
0873569
0873352
0873212
0873035
0872942
087283
0872748
0872654
0872563
0872468
0872382
0872275
087222
0872201
0872135
0872086
0872021
0871945
0871898
087183
0871756
0871722
0871665
0871632
0871585
0871558
0871524
0871482
0871452
0871428
0871398
0871346
0871312
0871287
0871262
0871242
0871236
0871197
Page 84: xuhappy.github.io · 2 days ago · RECSYS 2 Acknowledgements The slides used in this chapter are adapted from: CS246 Mining Massive Datasets, - by Jure Leskovec, Stanford University

Challenges

bull Size of datandash Scalabilityndash Keeping data in memory

bull Missing datandash 99 percent missingndash Very imbalanced

bull Avoiding overfittingbull Test and training data differ significantly movie 16322

bull Use an ensemble of complementing predictorsbull Two half tuned models worth more than a single fully

tuned model

The BellKor recommender system

RECSYS 101

Extending Latent Factor Model to Include Biases

1162019

RECSYS 102

Modeling Biases and Interactions

μ = overall mean rating bx = bias of user x bi = bias of movie i

user-movie interactionmovie biasuser bias

User-Movie interaction Characterizes the matching between

users and movies Attracts most research in the field Benefits from algorithmic and

mathematical innovations

Baseline predictor Separates users and movies Benefits from insights into userrsquos

behavior Among the main practical

contributions of the competition

RECSYS 103

Baseline Predictor We have expectations on the rating by

user x of movie i even without estimating xrsquos attitude towards movies like i

ndash Rating scale of user xndash Values of other ratings user

gave recently (day-specific mood anchoring multi-user accounts)

ndash (Recent) popularity of movie indash Selection bias related to

number of ratings user gave on the same day (ldquofrequencyrdquo)

1162019

RECSYS 104

Putting It All Together

Example Mean rating micro = 37 You are a critical reviewer your ratings are 1 star

lower than the mean bx = -1 Star Wars gets a mean rating of 05 higher than

average movie bi = + 05 Predicted rating for you on Star Wars

= 37 - 1 + 05 = 32

1162019

RECSYS 105

Fitting the New Model

Solve

Stochastic gradient decent to find parameters Note Both biases bu bi as well as interactions qi pu are treated

as parameters (we estimate them)

1162019

regularization

goodness of fit

λ is selected via grid-search on a validation set

( )

++++

+++minus

sumsumsumsum

sumisin

ii

xx

xx

ii

Rix

Txiixxi

PQ

bbpq

pqbbr

2222

2

)(

)(min

λ

micro

RECSYS 106

BellKor Recommender System The winner of the Netflix Challenge

Multi-scale modeling of the dataCombine top level ldquoregionalrdquomodeling of the data with a refined local view Global

bull Overall deviations of usersmovies Factorization

bull Addressing ldquoregionalrdquo effects Collaborative filtering

bull Extract local patterns

Global effects

Factorization

Collaborative filtering

RECSYS 107

Performance of Various Methods

0885

089

0895

09

0905

091

0915

092

1 10 100 1000

RMSE

Millions of parameters

CF (no time bias)

Basic Latent Factors

Latent Factors w Biases

1162019

RECSYS 108

Grand Prize 08563

Netflix 09514

Movie average 10533User average 10651

Global average 11296

Performance of Various Methods

Basic Collaborative filtering 094

Latent factors 090

Latent factors+Biases 089

Collaborative filtering++ 091

1162019

RECSYS 109

Temporal Biases Of Users

Sudden rise in the average movie rating(early 2004) Improvements in Netflix GUI improvements Meaning of rating changed

Movie age Users prefer new movies

without any reasons Older movies are just

inherently better than newer ones

1162019

Y Koren Collaborative filtering with temporal dynamics KDD rsquo09

RECSYS 110

Temporal Biases amp Factors

Original modelrxi = micro +bx + bi + qi middotpx

T

Add time dependence to biasesrxi = micro +bx(t)+ bi(t) +qi middot px

T

Make parameters bu and bi to depend on time (1) Parameterize time-dependence by linear trends

(2) Each bin corresponds to 10 consecutive weeks

Add temporal dependence to factors px(t)hellip user preference vector on day t

Y Koren Collaborative filtering with temporal dynamics KDD rsquo09

RECSYS 111

Adding Temporal Effects

1162019

0875

088

0885

089

0895

09

0905

091

0915

092

1 10 100 1000 10000

RMSE

Millions of parameters

CF (no time bias)Basic Latent FactorsCF (time bias)Latent Factors w Biases+ Linear time factors+ Per-day user biases+ CF

RECSYS 112

Grand Prize 08563

Netflix 09514

Movie average 10533User average 10651

Global average 11296

Performance of Various Methods

Basic Collaborative filtering 094

Latent factors 090

Latent factors+Biases 089

Collaborative filtering++ 091

1162019 112

Latent factors+Biases+Time 0876

Still no prize Getting desperateTry a ldquokitchen sinkrdquo approach

RECSYS 113

Effect of ensemble size

087

08725

0875

08775

088

08825

0885

08875

089

08925

1 8 15 22 29 36 43 50

Predictors

Erro

r - R

MSE

Chart1

RMSE
Predictors
Error - RMSE
Effect of ensemble size

Sheet1

Sheet1

RMSE
Predictors
Error - RMSE
Effect of ensemble size

Sheet2

Sheet3

RECSYS 115

Standing on June 26th 2009

1162019June 26th submission triggers 30-day ldquolast callrdquo

RECSYS 116

The Last 30 Days

Ensemble team formed Group of other teams on leaderboard forms a new team Relies on combining their models Quickly also get a qualifying score over 10

BellKor Continue to get small improvements in their scores Realize that they are in direct competition with Ensemble

Strategy Both teams carefully monitoring the leaderboard Only sure way to check for improvement is to submit a set of

predictionsbull This alerts the other team of your latest score

1162019

RECSYS 117

24 Hours from the Deadline

Submissions limited to 1 a day Only 1 final submission could be made in the last 24h

24 hours before deadlinehellip BellKor team member in Austria notices (by chance) that

Ensemble posts a score that is slightly better than BellKorrsquos

Frantic last 24 hours for both teams Much computer time on final optimization Carefully calibrated to end about an hour before deadline

Final submissions BellKor submits a little early (on purpose) 40 mins before deadline Ensemble submits their final entry 20 mins later hellipand everyone waitshellip

1162019

RECSYS 118118

RECSYS 119

Million $ Awarded Sept 21st 2009

RECSYS 120

Further reading Y Koren Collaborative filtering with temporal

dynamics KDD rsquo09 httpwww2researchattcom~volinskynetflixbpchtml httpwwwthe-ensemblecom

1162019

  • Slide Number 1
  • Acknowledgements
  • Content-based Systems amp Collaborative Filtering
  • High Dimensional Data
  • Example Recommender Systems
  • Recommendations
  • From Scarcity to Abundance
  • The Long Tail
  • Physical vs Online
  • Types of Recommendations
  • Formal Model
  • Utility Matrix
  • Key Problems
  • (1) Gathering Ratings
  • (2) Extrapolating Utilities
  • Content-based Recommender Systems
  • Content-based Recommendations
  • Plan of Action
  • Item Profiles
  • Sidenote TF-IDF
  • User Profiles and Prediction
  • Pros Content-based Approach
  • Cons Content-based Approach
  • Collaborative Filtering
  • Collaborative filtering
  • Collaborative Filtering (CF)
  • Example of Memory-based Collaborative Filtering User-User Collaborative Filtering
  • Similar Users
  • Similarity Metric
  • Rating Predictions
  • Another type of Memory-based Collaborative Filtering Item-Item based Collaborative Filtering
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Common Practice for Item-Item Collaborative Filtering
  • Item-Item vs User-User
  • ProsCons of Collaborative Filtering
  • Hybrid Methods
  • Remarks amp Practical Tips
  • Evaluation
  • Evaluation
  • Evaluating Predictions
  • Problems with Error Measures
  • Collaborative Filtering Complexity
  • Tip Add Data
  • Recommender SystemsLatent Factor Models
  • Collaborative Filtering viaLatent Factor Models (eg SVD)
  • Slide Number 50
  • The Netflix Utility Matrix R
  • Utility Matrix R Evaluation
  • Latent Factor Models
  • Ratings as Products of Factors
  • Ratings as Products of Factors
  • Ratings as Products of Factors
  • Latent Factor Models
  • Latent Factor Models
  • Recap SVD
  • Latent Factor Models
  • Dealing with Missing Entries
  • Recommendations via Latent Factor Models (eg SVD++ by the [Bellkor Team])
  • Dealing with Missing Entries
  • The Effect of Regularization
  • The Effect of Regularization
  • The Effect of Regularization
  • The Effect of Regularization
  • Use Gradient Descent to search forthe optimal settings
  • Degression to the lecture notes ofRegression and Gradient Descentby Andrew Ngrsquos Machine Learning course from Coursera
  • (Batch) Gradient Descent
  • Stochastic Gradient Descent
  • SGD vs GD
  • Stochastic Gradient Descent
  • Slide Number 75
  • Summary Recommendations via Optimization
  • Backup Slides
  • The Netflix Challenge 2006-09
  • Slide Number 79
  • Slide Number 80
  • Netflix Prize
  • Movie rating data
  • Overall rating distribution
  • ratings per movie
  • ratings per user
  • Average movie rating by movie count
  • Most loved movies
  • Challenges
  • The BellKor recommender system
  • Extending Latent Factor Model to Include Biases
  • Modeling Biases and Interactions
  • Baseline Predictor
  • Putting It All Together
  • Fitting the New Model
  • BellKor Recommender System
  • Performance of Various Methods
  • Performance of Various Methods
  • Temporal Biases Of Users
  • Temporal Biases amp Factors
  • Adding Temporal Effects
  • Performance of Various Methods
  • Slide Number 113
  • Slide Number 114
  • Standing on June 26th 2009
  • The Last 30 Days
  • 24 Hours from the Deadline
  • Slide Number 118
  • Million $ Awarded Sept 21st 2009
  • Further reading
predictors RMSE
1 0889067
2 0880586
3 0878345
4 0877715
5 0876668
6 0876179
7 0875557
8 0875013
9 0874477
10 0874209
11 0873997
12 0873691
13 0873569
14 0873352
15 0873212
16 0873035
17 0872942
18 087283
19 0872748
20 0872654
21 0872563
22 0872468
23 0872382
24 0872275
25 087222
26 0872201
27 0872135
28 0872086
29 0872021
30 0871945
31 0871898
32 087183
33 0871756
34 0871722
35 0871665
36 0871632
37 0871585
38 0871558
39 0871524
40 0871482
41 0871452
42 0871428
43 0871398
44 0871346
45 0871312
46 0871287
47 0871262
48 0871242
49 0871236
50 0871197
0889067
0880586
0878345
0877715
0876668
0876179
0875557
0875013
0874477
0874209
0873997
0873691
0873569
0873352
0873212
0873035
0872942
087283
0872748
0872654
0872563
0872468
0872382
0872275
087222
0872201
0872135
0872086
0872021
0871945
0871898
087183
0871756
0871722
0871665
0871632
0871585
0871558
0871524
0871482
0871452
0871428
0871398
0871346
0871312
0871287
0871262
0871242
0871236
0871197
Page 85: xuhappy.github.io · 2 days ago · RECSYS 2 Acknowledgements The slides used in this chapter are adapted from: CS246 Mining Massive Datasets, - by Jure Leskovec, Stanford University

bull Use an ensemble of complementing predictorsbull Two half tuned models worth more than a single fully

tuned model

The BellKor recommender system

RECSYS 101

Extending Latent Factor Model to Include Biases

1162019

RECSYS 102

Modeling Biases and Interactions

μ = overall mean rating bx = bias of user x bi = bias of movie i

user-movie interactionmovie biasuser bias

User-Movie interaction Characterizes the matching between

users and movies Attracts most research in the field Benefits from algorithmic and

mathematical innovations

Baseline predictor Separates users and movies Benefits from insights into userrsquos

behavior Among the main practical

contributions of the competition

RECSYS 103

Baseline Predictor We have expectations on the rating by

user x of movie i even without estimating xrsquos attitude towards movies like i

ndash Rating scale of user xndash Values of other ratings user

gave recently (day-specific mood anchoring multi-user accounts)

ndash (Recent) popularity of movie indash Selection bias related to

number of ratings user gave on the same day (ldquofrequencyrdquo)

1162019

RECSYS 104

Putting It All Together

Example Mean rating micro = 37 You are a critical reviewer your ratings are 1 star

lower than the mean bx = -1 Star Wars gets a mean rating of 05 higher than

average movie bi = + 05 Predicted rating for you on Star Wars

= 37 - 1 + 05 = 32

1162019

RECSYS 105

Fitting the New Model

Solve

Stochastic gradient decent to find parameters Note Both biases bu bi as well as interactions qi pu are treated

as parameters (we estimate them)

1162019

regularization

goodness of fit

λ is selected via grid-search on a validation set

( )

++++

+++minus

sumsumsumsum

sumisin

ii

xx

xx

ii

Rix

Txiixxi

PQ

bbpq

pqbbr

2222

2

)(

)(min

λ

micro

RECSYS 106

BellKor Recommender System The winner of the Netflix Challenge

Multi-scale modeling of the dataCombine top level ldquoregionalrdquomodeling of the data with a refined local view Global

bull Overall deviations of usersmovies Factorization

bull Addressing ldquoregionalrdquo effects Collaborative filtering

bull Extract local patterns

Global effects

Factorization

Collaborative filtering

RECSYS 107

Performance of Various Methods

0885

089

0895

09

0905

091

0915

092

1 10 100 1000

RMSE

Millions of parameters

CF (no time bias)

Basic Latent Factors

Latent Factors w Biases

1162019

RECSYS 108

Grand Prize 08563

Netflix 09514

Movie average 10533User average 10651

Global average 11296

Performance of Various Methods

Basic Collaborative filtering 094

Latent factors 090

Latent factors+Biases 089

Collaborative filtering++ 091

1162019

RECSYS 109

Temporal Biases Of Users

Sudden rise in the average movie rating(early 2004) Improvements in Netflix GUI improvements Meaning of rating changed

Movie age Users prefer new movies

without any reasons Older movies are just

inherently better than newer ones

1162019

Y Koren Collaborative filtering with temporal dynamics KDD rsquo09

RECSYS 110

Temporal Biases amp Factors

Original modelrxi = micro +bx + bi + qi middotpx

T

Add time dependence to biasesrxi = micro +bx(t)+ bi(t) +qi middot px

T

Make parameters bu and bi to depend on time (1) Parameterize time-dependence by linear trends

(2) Each bin corresponds to 10 consecutive weeks

Add temporal dependence to factors px(t)hellip user preference vector on day t

Y Koren Collaborative filtering with temporal dynamics KDD rsquo09

RECSYS 111

Adding Temporal Effects

1162019

0875

088

0885

089

0895

09

0905

091

0915

092

1 10 100 1000 10000

RMSE

Millions of parameters

CF (no time bias)Basic Latent FactorsCF (time bias)Latent Factors w Biases+ Linear time factors+ Per-day user biases+ CF

RECSYS 112

Grand Prize 08563

Netflix 09514

Movie average 10533User average 10651

Global average 11296

Performance of Various Methods

Basic Collaborative filtering 094

Latent factors 090

Latent factors+Biases 089

Collaborative filtering++ 091

1162019 112

Latent factors+Biases+Time 0876

Still no prize Getting desperateTry a ldquokitchen sinkrdquo approach

RECSYS 113

Effect of ensemble size

087

08725

0875

08775

088

08825

0885

08875

089

08925

1 8 15 22 29 36 43 50

Predictors

Erro

r - R

MSE

Chart1

RMSE
Predictors
Error - RMSE
Effect of ensemble size

Sheet1

Sheet1

RMSE
Predictors
Error - RMSE
Effect of ensemble size

Sheet2

Sheet3

RECSYS 115

Standing on June 26th 2009

1162019June 26th submission triggers 30-day ldquolast callrdquo

RECSYS 116

The Last 30 Days

Ensemble team formed Group of other teams on leaderboard forms a new team Relies on combining their models Quickly also get a qualifying score over 10

BellKor Continue to get small improvements in their scores Realize that they are in direct competition with Ensemble

Strategy Both teams carefully monitoring the leaderboard Only sure way to check for improvement is to submit a set of

predictionsbull This alerts the other team of your latest score

1162019

RECSYS 117

24 Hours from the Deadline

Submissions limited to 1 a day Only 1 final submission could be made in the last 24h

24 hours before deadlinehellip BellKor team member in Austria notices (by chance) that

Ensemble posts a score that is slightly better than BellKorrsquos

Frantic last 24 hours for both teams Much computer time on final optimization Carefully calibrated to end about an hour before deadline

Final submissions BellKor submits a little early (on purpose) 40 mins before deadline Ensemble submits their final entry 20 mins later hellipand everyone waitshellip

1162019

RECSYS 118118

RECSYS 119

Million $ Awarded Sept 21st 2009

RECSYS 120

Further reading Y Koren Collaborative filtering with temporal

dynamics KDD rsquo09 httpwww2researchattcom~volinskynetflixbpchtml httpwwwthe-ensemblecom

1162019

  • Slide Number 1
  • Acknowledgements
  • Content-based Systems amp Collaborative Filtering
  • High Dimensional Data
  • Example Recommender Systems
  • Recommendations
  • From Scarcity to Abundance
  • The Long Tail
  • Physical vs Online
  • Types of Recommendations
  • Formal Model
  • Utility Matrix
  • Key Problems
  • (1) Gathering Ratings
  • (2) Extrapolating Utilities
  • Content-based Recommender Systems
  • Content-based Recommendations
  • Plan of Action
  • Item Profiles
  • Sidenote TF-IDF
  • User Profiles and Prediction
  • Pros Content-based Approach
  • Cons Content-based Approach
  • Collaborative Filtering
  • Collaborative filtering
  • Collaborative Filtering (CF)
  • Example of Memory-based Collaborative Filtering User-User Collaborative Filtering
  • Similar Users
  • Similarity Metric
  • Rating Predictions
  • Another type of Memory-based Collaborative Filtering Item-Item based Collaborative Filtering
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Common Practice for Item-Item Collaborative Filtering
  • Item-Item vs User-User
  • ProsCons of Collaborative Filtering
  • Hybrid Methods
  • Remarks amp Practical Tips
  • Evaluation
  • Evaluation
  • Evaluating Predictions
  • Problems with Error Measures
  • Collaborative Filtering Complexity
  • Tip Add Data
  • Recommender SystemsLatent Factor Models
  • Collaborative Filtering viaLatent Factor Models (eg SVD)
  • Slide Number 50
  • The Netflix Utility Matrix R
  • Utility Matrix R Evaluation
  • Latent Factor Models
  • Ratings as Products of Factors
  • Ratings as Products of Factors
  • Ratings as Products of Factors
  • Latent Factor Models
  • Latent Factor Models
  • Recap SVD
  • Latent Factor Models
  • Dealing with Missing Entries
  • Recommendations via Latent Factor Models (eg SVD++ by the [Bellkor Team])
  • Dealing with Missing Entries
  • The Effect of Regularization
  • The Effect of Regularization
  • The Effect of Regularization
  • The Effect of Regularization
  • Use Gradient Descent to search forthe optimal settings
  • Degression to the lecture notes ofRegression and Gradient Descentby Andrew Ngrsquos Machine Learning course from Coursera
  • (Batch) Gradient Descent
  • Stochastic Gradient Descent
  • SGD vs GD
  • Stochastic Gradient Descent
  • Slide Number 75
  • Summary Recommendations via Optimization
  • Backup Slides
  • The Netflix Challenge 2006-09
  • Slide Number 79
  • Slide Number 80
  • Netflix Prize
  • Movie rating data
  • Overall rating distribution
  • ratings per movie
  • ratings per user
  • Average movie rating by movie count
  • Most loved movies
  • Challenges
  • The BellKor recommender system
  • Extending Latent Factor Model to Include Biases
  • Modeling Biases and Interactions
  • Baseline Predictor
  • Putting It All Together
  • Fitting the New Model
  • BellKor Recommender System
  • Performance of Various Methods
  • Performance of Various Methods
  • Temporal Biases Of Users
  • Temporal Biases amp Factors
  • Adding Temporal Effects
  • Performance of Various Methods
  • Slide Number 113
  • Slide Number 114
  • Standing on June 26th 2009
  • The Last 30 Days
  • 24 Hours from the Deadline
  • Slide Number 118
  • Million $ Awarded Sept 21st 2009
  • Further reading
predictors RMSE
1 0889067
2 0880586
3 0878345
4 0877715
5 0876668
6 0876179
7 0875557
8 0875013
9 0874477
10 0874209
11 0873997
12 0873691
13 0873569
14 0873352
15 0873212
16 0873035
17 0872942
18 087283
19 0872748
20 0872654
21 0872563
22 0872468
23 0872382
24 0872275
25 087222
26 0872201
27 0872135
28 0872086
29 0872021
30 0871945
31 0871898
32 087183
33 0871756
34 0871722
35 0871665
36 0871632
37 0871585
38 0871558
39 0871524
40 0871482
41 0871452
42 0871428
43 0871398
44 0871346
45 0871312
46 0871287
47 0871262
48 0871242
49 0871236
50 0871197
0889067
0880586
0878345
0877715
0876668
0876179
0875557
0875013
0874477
0874209
0873997
0873691
0873569
0873352
0873212
0873035
0872942
087283
0872748
0872654
0872563
0872468
0872382
0872275
087222
0872201
0872135
0872086
0872021
0871945
0871898
087183
0871756
0871722
0871665
0871632
0871585
0871558
0871524
0871482
0871452
0871428
0871398
0871346
0871312
0871287
0871262
0871242
0871236
0871197
Page 86: xuhappy.github.io · 2 days ago · RECSYS 2 Acknowledgements The slides used in this chapter are adapted from: CS246 Mining Massive Datasets, - by Jure Leskovec, Stanford University

RECSYS 101

Extending Latent Factor Model to Include Biases

1162019

RECSYS 102

Modeling Biases and Interactions

μ = overall mean rating bx = bias of user x bi = bias of movie i

user-movie interactionmovie biasuser bias

User-Movie interaction Characterizes the matching between

users and movies Attracts most research in the field Benefits from algorithmic and

mathematical innovations

Baseline predictor Separates users and movies Benefits from insights into userrsquos

behavior Among the main practical

contributions of the competition

RECSYS 103

Baseline Predictor We have expectations on the rating by

user x of movie i even without estimating xrsquos attitude towards movies like i

ndash Rating scale of user xndash Values of other ratings user

gave recently (day-specific mood anchoring multi-user accounts)

ndash (Recent) popularity of movie indash Selection bias related to

number of ratings user gave on the same day (ldquofrequencyrdquo)

1162019

RECSYS 104

Putting It All Together

Example Mean rating micro = 37 You are a critical reviewer your ratings are 1 star

lower than the mean bx = -1 Star Wars gets a mean rating of 05 higher than

average movie bi = + 05 Predicted rating for you on Star Wars

= 37 - 1 + 05 = 32

1162019

RECSYS 105

Fitting the New Model

Solve

Stochastic gradient decent to find parameters Note Both biases bu bi as well as interactions qi pu are treated

as parameters (we estimate them)

1162019

regularization

goodness of fit

λ is selected via grid-search on a validation set

( )

++++

+++minus

sumsumsumsum

sumisin

ii

xx

xx

ii

Rix

Txiixxi

PQ

bbpq

pqbbr

2222

2

)(

)(min

λ

micro

RECSYS 106

BellKor Recommender System The winner of the Netflix Challenge

Multi-scale modeling of the dataCombine top level ldquoregionalrdquomodeling of the data with a refined local view Global

bull Overall deviations of usersmovies Factorization

bull Addressing ldquoregionalrdquo effects Collaborative filtering

bull Extract local patterns

Global effects

Factorization

Collaborative filtering

RECSYS 107

Performance of Various Methods

0885

089

0895

09

0905

091

0915

092

1 10 100 1000

RMSE

Millions of parameters

CF (no time bias)

Basic Latent Factors

Latent Factors w Biases

1162019

RECSYS 108

Grand Prize 08563

Netflix 09514

Movie average 10533User average 10651

Global average 11296

Performance of Various Methods

Basic Collaborative filtering 094

Latent factors 090

Latent factors+Biases 089

Collaborative filtering++ 091

1162019

RECSYS 109

Temporal Biases Of Users

Sudden rise in the average movie rating(early 2004) Improvements in Netflix GUI improvements Meaning of rating changed

Movie age Users prefer new movies

without any reasons Older movies are just

inherently better than newer ones

1162019

Y Koren Collaborative filtering with temporal dynamics KDD rsquo09

RECSYS 110

Temporal Biases amp Factors

Original modelrxi = micro +bx + bi + qi middotpx

T

Add time dependence to biasesrxi = micro +bx(t)+ bi(t) +qi middot px

T

Make parameters bu and bi to depend on time (1) Parameterize time-dependence by linear trends

(2) Each bin corresponds to 10 consecutive weeks

Add temporal dependence to factors px(t)hellip user preference vector on day t

Y Koren Collaborative filtering with temporal dynamics KDD rsquo09

RECSYS 111

Adding Temporal Effects

1162019

0875

088

0885

089

0895

09

0905

091

0915

092

1 10 100 1000 10000

RMSE

Millions of parameters

CF (no time bias)Basic Latent FactorsCF (time bias)Latent Factors w Biases+ Linear time factors+ Per-day user biases+ CF

RECSYS 112

Grand Prize 08563

Netflix 09514

Movie average 10533User average 10651

Global average 11296

Performance of Various Methods

Basic Collaborative filtering 094

Latent factors 090

Latent factors+Biases 089

Collaborative filtering++ 091

1162019 112

Latent factors+Biases+Time 0876

Still no prize Getting desperateTry a ldquokitchen sinkrdquo approach

RECSYS 113

Effect of ensemble size

087

08725

0875

08775

088

08825

0885

08875

089

08925

1 8 15 22 29 36 43 50

Predictors

Erro

r - R

MSE

Chart1

RMSE
Predictors
Error - RMSE
Effect of ensemble size

Sheet1

Sheet1

RMSE
Predictors
Error - RMSE
Effect of ensemble size

Sheet2

Sheet3

RECSYS 115

Standing on June 26th 2009

1162019June 26th submission triggers 30-day ldquolast callrdquo

RECSYS 116

The Last 30 Days

Ensemble team formed Group of other teams on leaderboard forms a new team Relies on combining their models Quickly also get a qualifying score over 10

BellKor Continue to get small improvements in their scores Realize that they are in direct competition with Ensemble

Strategy Both teams carefully monitoring the leaderboard Only sure way to check for improvement is to submit a set of

predictionsbull This alerts the other team of your latest score

1162019

RECSYS 117

24 Hours from the Deadline

Submissions limited to 1 a day Only 1 final submission could be made in the last 24h

24 hours before deadlinehellip BellKor team member in Austria notices (by chance) that

Ensemble posts a score that is slightly better than BellKorrsquos

Frantic last 24 hours for both teams Much computer time on final optimization Carefully calibrated to end about an hour before deadline

Final submissions BellKor submits a little early (on purpose) 40 mins before deadline Ensemble submits their final entry 20 mins later hellipand everyone waitshellip

1162019

RECSYS 118118

RECSYS 119

Million $ Awarded Sept 21st 2009

RECSYS 120

Further reading Y Koren Collaborative filtering with temporal

dynamics KDD rsquo09 httpwww2researchattcom~volinskynetflixbpchtml httpwwwthe-ensemblecom

1162019

  • Slide Number 1
  • Acknowledgements
  • Content-based Systems amp Collaborative Filtering
  • High Dimensional Data
  • Example Recommender Systems
  • Recommendations
  • From Scarcity to Abundance
  • The Long Tail
  • Physical vs Online
  • Types of Recommendations
  • Formal Model
  • Utility Matrix
  • Key Problems
  • (1) Gathering Ratings
  • (2) Extrapolating Utilities
  • Content-based Recommender Systems
  • Content-based Recommendations
  • Plan of Action
  • Item Profiles
  • Sidenote TF-IDF
  • User Profiles and Prediction
  • Pros Content-based Approach
  • Cons Content-based Approach
  • Collaborative Filtering
  • Collaborative filtering
  • Collaborative Filtering (CF)
  • Example of Memory-based Collaborative Filtering User-User Collaborative Filtering
  • Similar Users
  • Similarity Metric
  • Rating Predictions
  • Another type of Memory-based Collaborative Filtering Item-Item based Collaborative Filtering
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Common Practice for Item-Item Collaborative Filtering
  • Item-Item vs User-User
  • ProsCons of Collaborative Filtering
  • Hybrid Methods
  • Remarks amp Practical Tips
  • Evaluation
  • Evaluation
  • Evaluating Predictions
  • Problems with Error Measures
  • Collaborative Filtering Complexity
  • Tip Add Data
  • Recommender SystemsLatent Factor Models
  • Collaborative Filtering viaLatent Factor Models (eg SVD)
  • Slide Number 50
  • The Netflix Utility Matrix R
  • Utility Matrix R Evaluation
  • Latent Factor Models
  • Ratings as Products of Factors
  • Ratings as Products of Factors
  • Ratings as Products of Factors
  • Latent Factor Models
  • Latent Factor Models
  • Recap SVD
  • Latent Factor Models
  • Dealing with Missing Entries
  • Recommendations via Latent Factor Models (eg SVD++ by the [Bellkor Team])
  • Dealing with Missing Entries
  • The Effect of Regularization
  • The Effect of Regularization
  • The Effect of Regularization
  • The Effect of Regularization
  • Use Gradient Descent to search forthe optimal settings
  • Degression to the lecture notes ofRegression and Gradient Descentby Andrew Ngrsquos Machine Learning course from Coursera
  • (Batch) Gradient Descent
  • Stochastic Gradient Descent
  • SGD vs GD
  • Stochastic Gradient Descent
  • Slide Number 75
  • Summary Recommendations via Optimization
  • Backup Slides
  • The Netflix Challenge 2006-09
  • Slide Number 79
  • Slide Number 80
  • Netflix Prize
  • Movie rating data
  • Overall rating distribution
  • ratings per movie
  • ratings per user
  • Average movie rating by movie count
  • Most loved movies
  • Challenges
  • The BellKor recommender system
  • Extending Latent Factor Model to Include Biases
  • Modeling Biases and Interactions
  • Baseline Predictor
  • Putting It All Together
  • Fitting the New Model
  • BellKor Recommender System
  • Performance of Various Methods
  • Performance of Various Methods
  • Temporal Biases Of Users
  • Temporal Biases amp Factors
  • Adding Temporal Effects
  • Performance of Various Methods
  • Slide Number 113
  • Slide Number 114
  • Standing on June 26th 2009
  • The Last 30 Days
  • 24 Hours from the Deadline
  • Slide Number 118
  • Million $ Awarded Sept 21st 2009
  • Further reading
predictors RMSE
1 0889067
2 0880586
3 0878345
4 0877715
5 0876668
6 0876179
7 0875557
8 0875013
9 0874477
10 0874209
11 0873997
12 0873691
13 0873569
14 0873352
15 0873212
16 0873035
17 0872942
18 087283
19 0872748
20 0872654
21 0872563
22 0872468
23 0872382
24 0872275
25 087222
26 0872201
27 0872135
28 0872086
29 0872021
30 0871945
31 0871898
32 087183
33 0871756
34 0871722
35 0871665
36 0871632
37 0871585
38 0871558
39 0871524
40 0871482
41 0871452
42 0871428
43 0871398
44 0871346
45 0871312
46 0871287
47 0871262
48 0871242
49 0871236
50 0871197
0889067
0880586
0878345
0877715
0876668
0876179
0875557
0875013
0874477
0874209
0873997
0873691
0873569
0873352
0873212
0873035
0872942
087283
0872748
0872654
0872563
0872468
0872382
0872275
087222
0872201
0872135
0872086
0872021
0871945
0871898
087183
0871756
0871722
0871665
0871632
0871585
0871558
0871524
0871482
0871452
0871428
0871398
0871346
0871312
0871287
0871262
0871242
0871236
0871197
Page 87: xuhappy.github.io · 2 days ago · RECSYS 2 Acknowledgements The slides used in this chapter are adapted from: CS246 Mining Massive Datasets, - by Jure Leskovec, Stanford University

RECSYS 102

Modeling Biases and Interactions

μ = overall mean rating bx = bias of user x bi = bias of movie i

user-movie interactionmovie biasuser bias

User-Movie interaction Characterizes the matching between

users and movies Attracts most research in the field Benefits from algorithmic and

mathematical innovations

Baseline predictor Separates users and movies Benefits from insights into userrsquos

behavior Among the main practical

contributions of the competition

RECSYS 103

Baseline Predictor We have expectations on the rating by

user x of movie i even without estimating xrsquos attitude towards movies like i

ndash Rating scale of user xndash Values of other ratings user

gave recently (day-specific mood anchoring multi-user accounts)

ndash (Recent) popularity of movie indash Selection bias related to

number of ratings user gave on the same day (ldquofrequencyrdquo)

1162019

RECSYS 104

Putting It All Together

Example Mean rating micro = 37 You are a critical reviewer your ratings are 1 star

lower than the mean bx = -1 Star Wars gets a mean rating of 05 higher than

average movie bi = + 05 Predicted rating for you on Star Wars

= 37 - 1 + 05 = 32

1162019

RECSYS 105

Fitting the New Model

Solve

Stochastic gradient decent to find parameters Note Both biases bu bi as well as interactions qi pu are treated

as parameters (we estimate them)

1162019

regularization

goodness of fit

λ is selected via grid-search on a validation set

( )

++++

+++minus

sumsumsumsum

sumisin

ii

xx

xx

ii

Rix

Txiixxi

PQ

bbpq

pqbbr

2222

2

)(

)(min

λ

micro

RECSYS 106

BellKor Recommender System The winner of the Netflix Challenge

Multi-scale modeling of the dataCombine top level ldquoregionalrdquomodeling of the data with a refined local view Global

bull Overall deviations of usersmovies Factorization

bull Addressing ldquoregionalrdquo effects Collaborative filtering

bull Extract local patterns

Global effects

Factorization

Collaborative filtering

RECSYS 107

Performance of Various Methods

0885

089

0895

09

0905

091

0915

092

1 10 100 1000

RMSE

Millions of parameters

CF (no time bias)

Basic Latent Factors

Latent Factors w Biases

1162019

RECSYS 108

Grand Prize 08563

Netflix 09514

Movie average 10533User average 10651

Global average 11296

Performance of Various Methods

Basic Collaborative filtering 094

Latent factors 090

Latent factors+Biases 089

Collaborative filtering++ 091

1162019

RECSYS 109

Temporal Biases Of Users

Sudden rise in the average movie rating(early 2004) Improvements in Netflix GUI improvements Meaning of rating changed

Movie age Users prefer new movies

without any reasons Older movies are just

inherently better than newer ones

1162019

Y Koren Collaborative filtering with temporal dynamics KDD rsquo09

RECSYS 110

Temporal Biases amp Factors

Original modelrxi = micro +bx + bi + qi middotpx

T

Add time dependence to biasesrxi = micro +bx(t)+ bi(t) +qi middot px

T

Make parameters bu and bi to depend on time (1) Parameterize time-dependence by linear trends

(2) Each bin corresponds to 10 consecutive weeks

Add temporal dependence to factors px(t)hellip user preference vector on day t

Y Koren Collaborative filtering with temporal dynamics KDD rsquo09

RECSYS 111

Adding Temporal Effects

1162019

0875

088

0885

089

0895

09

0905

091

0915

092

1 10 100 1000 10000

RMSE

Millions of parameters

CF (no time bias)Basic Latent FactorsCF (time bias)Latent Factors w Biases+ Linear time factors+ Per-day user biases+ CF

RECSYS 112

Grand Prize 08563

Netflix 09514

Movie average 10533User average 10651

Global average 11296

Performance of Various Methods

Basic Collaborative filtering 094

Latent factors 090

Latent factors+Biases 089

Collaborative filtering++ 091

1162019 112

Latent factors+Biases+Time 0876

Still no prize Getting desperateTry a ldquokitchen sinkrdquo approach

RECSYS 113

Effect of ensemble size

087

08725

0875

08775

088

08825

0885

08875

089

08925

1 8 15 22 29 36 43 50

Predictors

Erro

r - R

MSE

Chart1

RMSE
Predictors
Error - RMSE
Effect of ensemble size

Sheet1

Sheet1

RMSE
Predictors
Error - RMSE
Effect of ensemble size

Sheet2

Sheet3

RECSYS 115

Standing on June 26th 2009

1162019June 26th submission triggers 30-day ldquolast callrdquo

RECSYS 116

The Last 30 Days

Ensemble team formed Group of other teams on leaderboard forms a new team Relies on combining their models Quickly also get a qualifying score over 10

BellKor Continue to get small improvements in their scores Realize that they are in direct competition with Ensemble

Strategy Both teams carefully monitoring the leaderboard Only sure way to check for improvement is to submit a set of

predictionsbull This alerts the other team of your latest score

1162019

RECSYS 117

24 Hours from the Deadline

Submissions limited to 1 a day Only 1 final submission could be made in the last 24h

24 hours before deadlinehellip BellKor team member in Austria notices (by chance) that

Ensemble posts a score that is slightly better than BellKorrsquos

Frantic last 24 hours for both teams Much computer time on final optimization Carefully calibrated to end about an hour before deadline

Final submissions BellKor submits a little early (on purpose) 40 mins before deadline Ensemble submits their final entry 20 mins later hellipand everyone waitshellip

1162019

RECSYS 118118

RECSYS 119

Million $ Awarded Sept 21st 2009

RECSYS 120

Further reading Y Koren Collaborative filtering with temporal

dynamics KDD rsquo09 httpwww2researchattcom~volinskynetflixbpchtml httpwwwthe-ensemblecom

1162019

  • Slide Number 1
  • Acknowledgements
  • Content-based Systems amp Collaborative Filtering
  • High Dimensional Data
  • Example Recommender Systems
  • Recommendations
  • From Scarcity to Abundance
  • The Long Tail
  • Physical vs Online
  • Types of Recommendations
  • Formal Model
  • Utility Matrix
  • Key Problems
  • (1) Gathering Ratings
  • (2) Extrapolating Utilities
  • Content-based Recommender Systems
  • Content-based Recommendations
  • Plan of Action
  • Item Profiles
  • Sidenote TF-IDF
  • User Profiles and Prediction
  • Pros Content-based Approach
  • Cons Content-based Approach
  • Collaborative Filtering
  • Collaborative filtering
  • Collaborative Filtering (CF)
  • Example of Memory-based Collaborative Filtering User-User Collaborative Filtering
  • Similar Users
  • Similarity Metric
  • Rating Predictions
  • Another type of Memory-based Collaborative Filtering Item-Item based Collaborative Filtering
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Common Practice for Item-Item Collaborative Filtering
  • Item-Item vs User-User
  • ProsCons of Collaborative Filtering
  • Hybrid Methods
  • Remarks amp Practical Tips
  • Evaluation
  • Evaluation
  • Evaluating Predictions
  • Problems with Error Measures
  • Collaborative Filtering Complexity
  • Tip Add Data
  • Recommender SystemsLatent Factor Models
  • Collaborative Filtering viaLatent Factor Models (eg SVD)
  • Slide Number 50
  • The Netflix Utility Matrix R
  • Utility Matrix R Evaluation
  • Latent Factor Models
  • Ratings as Products of Factors
  • Ratings as Products of Factors
  • Ratings as Products of Factors
  • Latent Factor Models
  • Latent Factor Models
  • Recap SVD
  • Latent Factor Models
  • Dealing with Missing Entries
  • Recommendations via Latent Factor Models (eg SVD++ by the [Bellkor Team])
  • Dealing with Missing Entries
  • The Effect of Regularization
  • The Effect of Regularization
  • The Effect of Regularization
  • The Effect of Regularization
  • Use Gradient Descent to search forthe optimal settings
  • Degression to the lecture notes ofRegression and Gradient Descentby Andrew Ngrsquos Machine Learning course from Coursera
  • (Batch) Gradient Descent
  • Stochastic Gradient Descent
  • SGD vs GD
  • Stochastic Gradient Descent
  • Slide Number 75
  • Summary Recommendations via Optimization
  • Backup Slides
  • The Netflix Challenge 2006-09
  • Slide Number 79
  • Slide Number 80
  • Netflix Prize
  • Movie rating data
  • Overall rating distribution
  • ratings per movie
  • ratings per user
  • Average movie rating by movie count
  • Most loved movies
  • Challenges
  • The BellKor recommender system
  • Extending Latent Factor Model to Include Biases
  • Modeling Biases and Interactions
  • Baseline Predictor
  • Putting It All Together
  • Fitting the New Model
  • BellKor Recommender System
  • Performance of Various Methods
  • Performance of Various Methods
  • Temporal Biases Of Users
  • Temporal Biases amp Factors
  • Adding Temporal Effects
  • Performance of Various Methods
  • Slide Number 113
  • Slide Number 114
  • Standing on June 26th 2009
  • The Last 30 Days
  • 24 Hours from the Deadline
  • Slide Number 118
  • Million $ Awarded Sept 21st 2009
  • Further reading
predictors RMSE
1 0889067
2 0880586
3 0878345
4 0877715
5 0876668
6 0876179
7 0875557
8 0875013
9 0874477
10 0874209
11 0873997
12 0873691
13 0873569
14 0873352
15 0873212
16 0873035
17 0872942
18 087283
19 0872748
20 0872654
21 0872563
22 0872468
23 0872382
24 0872275
25 087222
26 0872201
27 0872135
28 0872086
29 0872021
30 0871945
31 0871898
32 087183
33 0871756
34 0871722
35 0871665
36 0871632
37 0871585
38 0871558
39 0871524
40 0871482
41 0871452
42 0871428
43 0871398
44 0871346
45 0871312
46 0871287
47 0871262
48 0871242
49 0871236
50 0871197
0889067
0880586
0878345
0877715
0876668
0876179
0875557
0875013
0874477
0874209
0873997
0873691
0873569
0873352
0873212
0873035
0872942
087283
0872748
0872654
0872563
0872468
0872382
0872275
087222
0872201
0872135
0872086
0872021
0871945
0871898
087183
0871756
0871722
0871665
0871632
0871585
0871558
0871524
0871482
0871452
0871428
0871398
0871346
0871312
0871287
0871262
0871242
0871236
0871197
Page 88: xuhappy.github.io · 2 days ago · RECSYS 2 Acknowledgements The slides used in this chapter are adapted from: CS246 Mining Massive Datasets, - by Jure Leskovec, Stanford University

RECSYS 103

Baseline Predictor We have expectations on the rating by

user x of movie i even without estimating xrsquos attitude towards movies like i

ndash Rating scale of user xndash Values of other ratings user

gave recently (day-specific mood anchoring multi-user accounts)

ndash (Recent) popularity of movie indash Selection bias related to

number of ratings user gave on the same day (ldquofrequencyrdquo)

1162019

RECSYS 104

Putting It All Together

Example Mean rating micro = 37 You are a critical reviewer your ratings are 1 star

lower than the mean bx = -1 Star Wars gets a mean rating of 05 higher than

average movie bi = + 05 Predicted rating for you on Star Wars

= 37 - 1 + 05 = 32

1162019

RECSYS 105

Fitting the New Model

Solve

Stochastic gradient decent to find parameters Note Both biases bu bi as well as interactions qi pu are treated

as parameters (we estimate them)

1162019

regularization

goodness of fit

λ is selected via grid-search on a validation set

( )

++++

+++minus

sumsumsumsum

sumisin

ii

xx

xx

ii

Rix

Txiixxi

PQ

bbpq

pqbbr

2222

2

)(

)(min

λ

micro

RECSYS 106

BellKor Recommender System The winner of the Netflix Challenge

Multi-scale modeling of the dataCombine top level ldquoregionalrdquomodeling of the data with a refined local view Global

bull Overall deviations of usersmovies Factorization

bull Addressing ldquoregionalrdquo effects Collaborative filtering

bull Extract local patterns

Global effects

Factorization

Collaborative filtering

RECSYS 107

Performance of Various Methods

0885

089

0895

09

0905

091

0915

092

1 10 100 1000

RMSE

Millions of parameters

CF (no time bias)

Basic Latent Factors

Latent Factors w Biases

1162019

RECSYS 108

Grand Prize 08563

Netflix 09514

Movie average 10533User average 10651

Global average 11296

Performance of Various Methods

Basic Collaborative filtering 094

Latent factors 090

Latent factors+Biases 089

Collaborative filtering++ 091

1162019

RECSYS 109

Temporal Biases Of Users

Sudden rise in the average movie rating(early 2004) Improvements in Netflix GUI improvements Meaning of rating changed

Movie age Users prefer new movies

without any reasons Older movies are just

inherently better than newer ones

1162019

Y Koren Collaborative filtering with temporal dynamics KDD rsquo09

RECSYS 110

Temporal Biases amp Factors

Original modelrxi = micro +bx + bi + qi middotpx

T

Add time dependence to biasesrxi = micro +bx(t)+ bi(t) +qi middot px

T

Make parameters bu and bi to depend on time (1) Parameterize time-dependence by linear trends

(2) Each bin corresponds to 10 consecutive weeks

Add temporal dependence to factors px(t)hellip user preference vector on day t

Y Koren Collaborative filtering with temporal dynamics KDD rsquo09

RECSYS 111

Adding Temporal Effects

1162019

0875

088

0885

089

0895

09

0905

091

0915

092

1 10 100 1000 10000

RMSE

Millions of parameters

CF (no time bias)Basic Latent FactorsCF (time bias)Latent Factors w Biases+ Linear time factors+ Per-day user biases+ CF

RECSYS 112

Grand Prize 08563

Netflix 09514

Movie average 10533User average 10651

Global average 11296

Performance of Various Methods

Basic Collaborative filtering 094

Latent factors 090

Latent factors+Biases 089

Collaborative filtering++ 091

1162019 112

Latent factors+Biases+Time 0876

Still no prize Getting desperateTry a ldquokitchen sinkrdquo approach

RECSYS 113

Effect of ensemble size

087

08725

0875

08775

088

08825

0885

08875

089

08925

1 8 15 22 29 36 43 50

Predictors

Erro

r - R

MSE

Chart1

RMSE
Predictors
Error - RMSE
Effect of ensemble size

Sheet1

Sheet1

RMSE
Predictors
Error - RMSE
Effect of ensemble size

Sheet2

Sheet3

RECSYS 115

Standing on June 26th 2009

1162019June 26th submission triggers 30-day ldquolast callrdquo

RECSYS 116

The Last 30 Days

Ensemble team formed Group of other teams on leaderboard forms a new team Relies on combining their models Quickly also get a qualifying score over 10

BellKor Continue to get small improvements in their scores Realize that they are in direct competition with Ensemble

Strategy Both teams carefully monitoring the leaderboard Only sure way to check for improvement is to submit a set of

predictionsbull This alerts the other team of your latest score

1162019

RECSYS 117

24 Hours from the Deadline

Submissions limited to 1 a day Only 1 final submission could be made in the last 24h

24 hours before deadlinehellip BellKor team member in Austria notices (by chance) that

Ensemble posts a score that is slightly better than BellKorrsquos

Frantic last 24 hours for both teams Much computer time on final optimization Carefully calibrated to end about an hour before deadline

Final submissions BellKor submits a little early (on purpose) 40 mins before deadline Ensemble submits their final entry 20 mins later hellipand everyone waitshellip

1162019

RECSYS 118118

RECSYS 119

Million $ Awarded Sept 21st 2009

RECSYS 120

Further reading Y Koren Collaborative filtering with temporal

dynamics KDD rsquo09 httpwww2researchattcom~volinskynetflixbpchtml httpwwwthe-ensemblecom

1162019

  • Slide Number 1
  • Acknowledgements
  • Content-based Systems amp Collaborative Filtering
  • High Dimensional Data
  • Example Recommender Systems
  • Recommendations
  • From Scarcity to Abundance
  • The Long Tail
  • Physical vs Online
  • Types of Recommendations
  • Formal Model
  • Utility Matrix
  • Key Problems
  • (1) Gathering Ratings
  • (2) Extrapolating Utilities
  • Content-based Recommender Systems
  • Content-based Recommendations
  • Plan of Action
  • Item Profiles
  • Sidenote TF-IDF
  • User Profiles and Prediction
  • Pros Content-based Approach
  • Cons Content-based Approach
  • Collaborative Filtering
  • Collaborative filtering
  • Collaborative Filtering (CF)
  • Example of Memory-based Collaborative Filtering User-User Collaborative Filtering
  • Similar Users
  • Similarity Metric
  • Rating Predictions
  • Another type of Memory-based Collaborative Filtering Item-Item based Collaborative Filtering
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Common Practice for Item-Item Collaborative Filtering
  • Item-Item vs User-User
  • ProsCons of Collaborative Filtering
  • Hybrid Methods
  • Remarks amp Practical Tips
  • Evaluation
  • Evaluation
  • Evaluating Predictions
  • Problems with Error Measures
  • Collaborative Filtering Complexity
  • Tip Add Data
  • Recommender SystemsLatent Factor Models
  • Collaborative Filtering viaLatent Factor Models (eg SVD)
  • Slide Number 50
  • The Netflix Utility Matrix R
  • Utility Matrix R Evaluation
  • Latent Factor Models
  • Ratings as Products of Factors
  • Ratings as Products of Factors
  • Ratings as Products of Factors
  • Latent Factor Models
  • Latent Factor Models
  • Recap SVD
  • Latent Factor Models
  • Dealing with Missing Entries
  • Recommendations via Latent Factor Models (eg SVD++ by the [Bellkor Team])
  • Dealing with Missing Entries
  • The Effect of Regularization
  • The Effect of Regularization
  • The Effect of Regularization
  • The Effect of Regularization
  • Use Gradient Descent to search forthe optimal settings
  • Degression to the lecture notes ofRegression and Gradient Descentby Andrew Ngrsquos Machine Learning course from Coursera
  • (Batch) Gradient Descent
  • Stochastic Gradient Descent
  • SGD vs GD
  • Stochastic Gradient Descent
  • Slide Number 75
  • Summary Recommendations via Optimization
  • Backup Slides
  • The Netflix Challenge 2006-09
  • Slide Number 79
  • Slide Number 80
  • Netflix Prize
  • Movie rating data
  • Overall rating distribution
  • ratings per movie
  • ratings per user
  • Average movie rating by movie count
  • Most loved movies
  • Challenges
  • The BellKor recommender system
  • Extending Latent Factor Model to Include Biases
  • Modeling Biases and Interactions
  • Baseline Predictor
  • Putting It All Together
  • Fitting the New Model
  • BellKor Recommender System
  • Performance of Various Methods
  • Performance of Various Methods
  • Temporal Biases Of Users
  • Temporal Biases amp Factors
  • Adding Temporal Effects
  • Performance of Various Methods
  • Slide Number 113
  • Slide Number 114
  • Standing on June 26th 2009
  • The Last 30 Days
  • 24 Hours from the Deadline
  • Slide Number 118
  • Million $ Awarded Sept 21st 2009
  • Further reading
predictors RMSE
1 0889067
2 0880586
3 0878345
4 0877715
5 0876668
6 0876179
7 0875557
8 0875013
9 0874477
10 0874209
11 0873997
12 0873691
13 0873569
14 0873352
15 0873212
16 0873035
17 0872942
18 087283
19 0872748
20 0872654
21 0872563
22 0872468
23 0872382
24 0872275
25 087222
26 0872201
27 0872135
28 0872086
29 0872021
30 0871945
31 0871898
32 087183
33 0871756
34 0871722
35 0871665
36 0871632
37 0871585
38 0871558
39 0871524
40 0871482
41 0871452
42 0871428
43 0871398
44 0871346
45 0871312
46 0871287
47 0871262
48 0871242
49 0871236
50 0871197
0889067
0880586
0878345
0877715
0876668
0876179
0875557
0875013
0874477
0874209
0873997
0873691
0873569
0873352
0873212
0873035
0872942
087283
0872748
0872654
0872563
0872468
0872382
0872275
087222
0872201
0872135
0872086
0872021
0871945
0871898
087183
0871756
0871722
0871665
0871632
0871585
0871558
0871524
0871482
0871452
0871428
0871398
0871346
0871312
0871287
0871262
0871242
0871236
0871197
Page 89: xuhappy.github.io · 2 days ago · RECSYS 2 Acknowledgements The slides used in this chapter are adapted from: CS246 Mining Massive Datasets, - by Jure Leskovec, Stanford University

RECSYS 104

Putting It All Together

Example Mean rating micro = 37 You are a critical reviewer your ratings are 1 star

lower than the mean bx = -1 Star Wars gets a mean rating of 05 higher than

average movie bi = + 05 Predicted rating for you on Star Wars

= 37 - 1 + 05 = 32

1162019

RECSYS 105

Fitting the New Model

Solve

Stochastic gradient decent to find parameters Note Both biases bu bi as well as interactions qi pu are treated

as parameters (we estimate them)

1162019

regularization

goodness of fit

λ is selected via grid-search on a validation set

( )

++++

+++minus

sumsumsumsum

sumisin

ii

xx

xx

ii

Rix

Txiixxi

PQ

bbpq

pqbbr

2222

2

)(

)(min

λ

micro

RECSYS 106

BellKor Recommender System The winner of the Netflix Challenge

Multi-scale modeling of the dataCombine top level ldquoregionalrdquomodeling of the data with a refined local view Global

bull Overall deviations of usersmovies Factorization

bull Addressing ldquoregionalrdquo effects Collaborative filtering

bull Extract local patterns

Global effects

Factorization

Collaborative filtering

RECSYS 107

Performance of Various Methods

0885

089

0895

09

0905

091

0915

092

1 10 100 1000

RMSE

Millions of parameters

CF (no time bias)

Basic Latent Factors

Latent Factors w Biases

1162019

RECSYS 108

Grand Prize 08563

Netflix 09514

Movie average 10533User average 10651

Global average 11296

Performance of Various Methods

Basic Collaborative filtering 094

Latent factors 090

Latent factors+Biases 089

Collaborative filtering++ 091

1162019

RECSYS 109

Temporal Biases Of Users

Sudden rise in the average movie rating(early 2004) Improvements in Netflix GUI improvements Meaning of rating changed

Movie age Users prefer new movies

without any reasons Older movies are just

inherently better than newer ones

1162019

Y Koren Collaborative filtering with temporal dynamics KDD rsquo09

RECSYS 110

Temporal Biases amp Factors

Original modelrxi = micro +bx + bi + qi middotpx

T

Add time dependence to biasesrxi = micro +bx(t)+ bi(t) +qi middot px

T

Make parameters bu and bi to depend on time (1) Parameterize time-dependence by linear trends

(2) Each bin corresponds to 10 consecutive weeks

Add temporal dependence to factors px(t)hellip user preference vector on day t

Y Koren Collaborative filtering with temporal dynamics KDD rsquo09

RECSYS 111

Adding Temporal Effects

1162019

0875

088

0885

089

0895

09

0905

091

0915

092

1 10 100 1000 10000

RMSE

Millions of parameters

CF (no time bias)Basic Latent FactorsCF (time bias)Latent Factors w Biases+ Linear time factors+ Per-day user biases+ CF

RECSYS 112

Grand Prize 08563

Netflix 09514

Movie average 10533User average 10651

Global average 11296

Performance of Various Methods

Basic Collaborative filtering 094

Latent factors 090

Latent factors+Biases 089

Collaborative filtering++ 091

1162019 112

Latent factors+Biases+Time 0876

Still no prize Getting desperateTry a ldquokitchen sinkrdquo approach

RECSYS 113

Effect of ensemble size

087

08725

0875

08775

088

08825

0885

08875

089

08925

1 8 15 22 29 36 43 50

Predictors

Erro

r - R

MSE

Chart1

RMSE
Predictors
Error - RMSE
Effect of ensemble size

Sheet1

Sheet1

RMSE
Predictors
Error - RMSE
Effect of ensemble size

Sheet2

Sheet3

RECSYS 115

Standing on June 26th 2009

1162019June 26th submission triggers 30-day ldquolast callrdquo

RECSYS 116

The Last 30 Days

Ensemble team formed Group of other teams on leaderboard forms a new team Relies on combining their models Quickly also get a qualifying score over 10

BellKor Continue to get small improvements in their scores Realize that they are in direct competition with Ensemble

Strategy Both teams carefully monitoring the leaderboard Only sure way to check for improvement is to submit a set of

predictionsbull This alerts the other team of your latest score

1162019

RECSYS 117

24 Hours from the Deadline

Submissions limited to 1 a day Only 1 final submission could be made in the last 24h

24 hours before deadlinehellip BellKor team member in Austria notices (by chance) that

Ensemble posts a score that is slightly better than BellKorrsquos

Frantic last 24 hours for both teams Much computer time on final optimization Carefully calibrated to end about an hour before deadline

Final submissions BellKor submits a little early (on purpose) 40 mins before deadline Ensemble submits their final entry 20 mins later hellipand everyone waitshellip

1162019

RECSYS 118118

RECSYS 119

Million $ Awarded Sept 21st 2009

RECSYS 120

Further reading Y Koren Collaborative filtering with temporal

dynamics KDD rsquo09 httpwww2researchattcom~volinskynetflixbpchtml httpwwwthe-ensemblecom

1162019

  • Slide Number 1
  • Acknowledgements
  • Content-based Systems amp Collaborative Filtering
  • High Dimensional Data
  • Example Recommender Systems
  • Recommendations
  • From Scarcity to Abundance
  • The Long Tail
  • Physical vs Online
  • Types of Recommendations
  • Formal Model
  • Utility Matrix
  • Key Problems
  • (1) Gathering Ratings
  • (2) Extrapolating Utilities
  • Content-based Recommender Systems
  • Content-based Recommendations
  • Plan of Action
  • Item Profiles
  • Sidenote TF-IDF
  • User Profiles and Prediction
  • Pros Content-based Approach
  • Cons Content-based Approach
  • Collaborative Filtering
  • Collaborative filtering
  • Collaborative Filtering (CF)
  • Example of Memory-based Collaborative Filtering User-User Collaborative Filtering
  • Similar Users
  • Similarity Metric
  • Rating Predictions
  • Another type of Memory-based Collaborative Filtering Item-Item based Collaborative Filtering
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Common Practice for Item-Item Collaborative Filtering
  • Item-Item vs User-User
  • ProsCons of Collaborative Filtering
  • Hybrid Methods
  • Remarks amp Practical Tips
  • Evaluation
  • Evaluation
  • Evaluating Predictions
  • Problems with Error Measures
  • Collaborative Filtering Complexity
  • Tip Add Data
  • Recommender SystemsLatent Factor Models
  • Collaborative Filtering viaLatent Factor Models (eg SVD)
  • Slide Number 50
  • The Netflix Utility Matrix R
  • Utility Matrix R Evaluation
  • Latent Factor Models
  • Ratings as Products of Factors
  • Ratings as Products of Factors
  • Ratings as Products of Factors
  • Latent Factor Models
  • Latent Factor Models
  • Recap SVD
  • Latent Factor Models
  • Dealing with Missing Entries
  • Recommendations via Latent Factor Models (eg SVD++ by the [Bellkor Team])
  • Dealing with Missing Entries
  • The Effect of Regularization
  • The Effect of Regularization
  • The Effect of Regularization
  • The Effect of Regularization
  • Use Gradient Descent to search forthe optimal settings
  • Degression to the lecture notes ofRegression and Gradient Descentby Andrew Ngrsquos Machine Learning course from Coursera
  • (Batch) Gradient Descent
  • Stochastic Gradient Descent
  • SGD vs GD
  • Stochastic Gradient Descent
  • Slide Number 75
  • Summary Recommendations via Optimization
  • Backup Slides
  • The Netflix Challenge 2006-09
  • Slide Number 79
  • Slide Number 80
  • Netflix Prize
  • Movie rating data
  • Overall rating distribution
  • ratings per movie
  • ratings per user
  • Average movie rating by movie count
  • Most loved movies
  • Challenges
  • The BellKor recommender system
  • Extending Latent Factor Model to Include Biases
  • Modeling Biases and Interactions
  • Baseline Predictor
  • Putting It All Together
  • Fitting the New Model
  • BellKor Recommender System
  • Performance of Various Methods
  • Performance of Various Methods
  • Temporal Biases Of Users
  • Temporal Biases amp Factors
  • Adding Temporal Effects
  • Performance of Various Methods
  • Slide Number 113
  • Slide Number 114
  • Standing on June 26th 2009
  • The Last 30 Days
  • 24 Hours from the Deadline
  • Slide Number 118
  • Million $ Awarded Sept 21st 2009
  • Further reading
predictors RMSE
1 0889067
2 0880586
3 0878345
4 0877715
5 0876668
6 0876179
7 0875557
8 0875013
9 0874477
10 0874209
11 0873997
12 0873691
13 0873569
14 0873352
15 0873212
16 0873035
17 0872942
18 087283
19 0872748
20 0872654
21 0872563
22 0872468
23 0872382
24 0872275
25 087222
26 0872201
27 0872135
28 0872086
29 0872021
30 0871945
31 0871898
32 087183
33 0871756
34 0871722
35 0871665
36 0871632
37 0871585
38 0871558
39 0871524
40 0871482
41 0871452
42 0871428
43 0871398
44 0871346
45 0871312
46 0871287
47 0871262
48 0871242
49 0871236
50 0871197
0889067
0880586
0878345
0877715
0876668
0876179
0875557
0875013
0874477
0874209
0873997
0873691
0873569
0873352
0873212
0873035
0872942
087283
0872748
0872654
0872563
0872468
0872382
0872275
087222
0872201
0872135
0872086
0872021
0871945
0871898
087183
0871756
0871722
0871665
0871632
0871585
0871558
0871524
0871482
0871452
0871428
0871398
0871346
0871312
0871287
0871262
0871242
0871236
0871197
Page 90: xuhappy.github.io · 2 days ago · RECSYS 2 Acknowledgements The slides used in this chapter are adapted from: CS246 Mining Massive Datasets, - by Jure Leskovec, Stanford University

RECSYS 105

Fitting the New Model

Solve

Stochastic gradient decent to find parameters Note Both biases bu bi as well as interactions qi pu are treated

as parameters (we estimate them)

1162019

regularization

goodness of fit

λ is selected via grid-search on a validation set

( )

++++

+++minus

sumsumsumsum

sumisin

ii

xx

xx

ii

Rix

Txiixxi

PQ

bbpq

pqbbr

2222

2

)(

)(min

λ

micro

RECSYS 106

BellKor Recommender System The winner of the Netflix Challenge

Multi-scale modeling of the dataCombine top level ldquoregionalrdquomodeling of the data with a refined local view Global

bull Overall deviations of usersmovies Factorization

bull Addressing ldquoregionalrdquo effects Collaborative filtering

bull Extract local patterns

Global effects

Factorization

Collaborative filtering

RECSYS 107

Performance of Various Methods

0885

089

0895

09

0905

091

0915

092

1 10 100 1000

RMSE

Millions of parameters

CF (no time bias)

Basic Latent Factors

Latent Factors w Biases

1162019

RECSYS 108

Grand Prize 08563

Netflix 09514

Movie average 10533User average 10651

Global average 11296

Performance of Various Methods

Basic Collaborative filtering 094

Latent factors 090

Latent factors+Biases 089

Collaborative filtering++ 091

1162019

RECSYS 109

Temporal Biases Of Users

Sudden rise in the average movie rating(early 2004) Improvements in Netflix GUI improvements Meaning of rating changed

Movie age Users prefer new movies

without any reasons Older movies are just

inherently better than newer ones

1162019

Y Koren Collaborative filtering with temporal dynamics KDD rsquo09

RECSYS 110

Temporal Biases amp Factors

Original modelrxi = micro +bx + bi + qi middotpx

T

Add time dependence to biasesrxi = micro +bx(t)+ bi(t) +qi middot px

T

Make parameters bu and bi to depend on time (1) Parameterize time-dependence by linear trends

(2) Each bin corresponds to 10 consecutive weeks

Add temporal dependence to factors px(t)hellip user preference vector on day t

Y Koren Collaborative filtering with temporal dynamics KDD rsquo09

RECSYS 111

Adding Temporal Effects

1162019

0875

088

0885

089

0895

09

0905

091

0915

092

1 10 100 1000 10000

RMSE

Millions of parameters

CF (no time bias)Basic Latent FactorsCF (time bias)Latent Factors w Biases+ Linear time factors+ Per-day user biases+ CF

RECSYS 112

Grand Prize 08563

Netflix 09514

Movie average 10533User average 10651

Global average 11296

Performance of Various Methods

Basic Collaborative filtering 094

Latent factors 090

Latent factors+Biases 089

Collaborative filtering++ 091

1162019 112

Latent factors+Biases+Time 0876

Still no prize Getting desperateTry a ldquokitchen sinkrdquo approach

RECSYS 113

Effect of ensemble size

087

08725

0875

08775

088

08825

0885

08875

089

08925

1 8 15 22 29 36 43 50

Predictors

Erro

r - R

MSE

Chart1

RMSE
Predictors
Error - RMSE
Effect of ensemble size

Sheet1

Sheet1

RMSE
Predictors
Error - RMSE
Effect of ensemble size

Sheet2

Sheet3

RECSYS 115

Standing on June 26th 2009

1162019June 26th submission triggers 30-day ldquolast callrdquo

RECSYS 116

The Last 30 Days

Ensemble team formed Group of other teams on leaderboard forms a new team Relies on combining their models Quickly also get a qualifying score over 10

BellKor Continue to get small improvements in their scores Realize that they are in direct competition with Ensemble

Strategy Both teams carefully monitoring the leaderboard Only sure way to check for improvement is to submit a set of

predictionsbull This alerts the other team of your latest score

1162019

RECSYS 117

24 Hours from the Deadline

Submissions limited to 1 a day Only 1 final submission could be made in the last 24h

24 hours before deadlinehellip BellKor team member in Austria notices (by chance) that

Ensemble posts a score that is slightly better than BellKorrsquos

Frantic last 24 hours for both teams Much computer time on final optimization Carefully calibrated to end about an hour before deadline

Final submissions BellKor submits a little early (on purpose) 40 mins before deadline Ensemble submits their final entry 20 mins later hellipand everyone waitshellip

1162019

RECSYS 118118

RECSYS 119

Million $ Awarded Sept 21st 2009

RECSYS 120

Further reading Y Koren Collaborative filtering with temporal

dynamics KDD rsquo09 httpwww2researchattcom~volinskynetflixbpchtml httpwwwthe-ensemblecom

1162019

  • Slide Number 1
  • Acknowledgements
  • Content-based Systems amp Collaborative Filtering
  • High Dimensional Data
  • Example Recommender Systems
  • Recommendations
  • From Scarcity to Abundance
  • The Long Tail
  • Physical vs Online
  • Types of Recommendations
  • Formal Model
  • Utility Matrix
  • Key Problems
  • (1) Gathering Ratings
  • (2) Extrapolating Utilities
  • Content-based Recommender Systems
  • Content-based Recommendations
  • Plan of Action
  • Item Profiles
  • Sidenote TF-IDF
  • User Profiles and Prediction
  • Pros Content-based Approach
  • Cons Content-based Approach
  • Collaborative Filtering
  • Collaborative filtering
  • Collaborative Filtering (CF)
  • Example of Memory-based Collaborative Filtering User-User Collaborative Filtering
  • Similar Users
  • Similarity Metric
  • Rating Predictions
  • Another type of Memory-based Collaborative Filtering Item-Item based Collaborative Filtering
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Common Practice for Item-Item Collaborative Filtering
  • Item-Item vs User-User
  • ProsCons of Collaborative Filtering
  • Hybrid Methods
  • Remarks amp Practical Tips
  • Evaluation
  • Evaluation
  • Evaluating Predictions
  • Problems with Error Measures
  • Collaborative Filtering Complexity
  • Tip Add Data
  • Recommender SystemsLatent Factor Models
  • Collaborative Filtering viaLatent Factor Models (eg SVD)
  • Slide Number 50
  • The Netflix Utility Matrix R
  • Utility Matrix R Evaluation
  • Latent Factor Models
  • Ratings as Products of Factors
  • Ratings as Products of Factors
  • Ratings as Products of Factors
  • Latent Factor Models
  • Latent Factor Models
  • Recap SVD
  • Latent Factor Models
  • Dealing with Missing Entries
  • Recommendations via Latent Factor Models (eg SVD++ by the [Bellkor Team])
  • Dealing with Missing Entries
  • The Effect of Regularization
  • The Effect of Regularization
  • The Effect of Regularization
  • The Effect of Regularization
  • Use Gradient Descent to search forthe optimal settings
  • Degression to the lecture notes ofRegression and Gradient Descentby Andrew Ngrsquos Machine Learning course from Coursera
  • (Batch) Gradient Descent
  • Stochastic Gradient Descent
  • SGD vs GD
  • Stochastic Gradient Descent
  • Slide Number 75
  • Summary Recommendations via Optimization
  • Backup Slides
  • The Netflix Challenge 2006-09
  • Slide Number 79
  • Slide Number 80
  • Netflix Prize
  • Movie rating data
  • Overall rating distribution
  • ratings per movie
  • ratings per user
  • Average movie rating by movie count
  • Most loved movies
  • Challenges
  • The BellKor recommender system
  • Extending Latent Factor Model to Include Biases
  • Modeling Biases and Interactions
  • Baseline Predictor
  • Putting It All Together
  • Fitting the New Model
  • BellKor Recommender System
  • Performance of Various Methods
  • Performance of Various Methods
  • Temporal Biases Of Users
  • Temporal Biases amp Factors
  • Adding Temporal Effects
  • Performance of Various Methods
  • Slide Number 113
  • Slide Number 114
  • Standing on June 26th 2009
  • The Last 30 Days
  • 24 Hours from the Deadline
  • Slide Number 118
  • Million $ Awarded Sept 21st 2009
  • Further reading
predictors RMSE
1 0889067
2 0880586
3 0878345
4 0877715
5 0876668
6 0876179
7 0875557
8 0875013
9 0874477
10 0874209
11 0873997
12 0873691
13 0873569
14 0873352
15 0873212
16 0873035
17 0872942
18 087283
19 0872748
20 0872654
21 0872563
22 0872468
23 0872382
24 0872275
25 087222
26 0872201
27 0872135
28 0872086
29 0872021
30 0871945
31 0871898
32 087183
33 0871756
34 0871722
35 0871665
36 0871632
37 0871585
38 0871558
39 0871524
40 0871482
41 0871452
42 0871428
43 0871398
44 0871346
45 0871312
46 0871287
47 0871262
48 0871242
49 0871236
50 0871197
0889067
0880586
0878345
0877715
0876668
0876179
0875557
0875013
0874477
0874209
0873997
0873691
0873569
0873352
0873212
0873035
0872942
087283
0872748
0872654
0872563
0872468
0872382
0872275
087222
0872201
0872135
0872086
0872021
0871945
0871898
087183
0871756
0871722
0871665
0871632
0871585
0871558
0871524
0871482
0871452
0871428
0871398
0871346
0871312
0871287
0871262
0871242
0871236
0871197
Page 91: xuhappy.github.io · 2 days ago · RECSYS 2 Acknowledgements The slides used in this chapter are adapted from: CS246 Mining Massive Datasets, - by Jure Leskovec, Stanford University

RECSYS 106

BellKor Recommender System The winner of the Netflix Challenge

Multi-scale modeling of the dataCombine top level ldquoregionalrdquomodeling of the data with a refined local view Global

bull Overall deviations of usersmovies Factorization

bull Addressing ldquoregionalrdquo effects Collaborative filtering

bull Extract local patterns

Global effects

Factorization

Collaborative filtering

RECSYS 107

Performance of Various Methods

0885

089

0895

09

0905

091

0915

092

1 10 100 1000

RMSE

Millions of parameters

CF (no time bias)

Basic Latent Factors

Latent Factors w Biases

1162019

RECSYS 108

Grand Prize 08563

Netflix 09514

Movie average 10533User average 10651

Global average 11296

Performance of Various Methods

Basic Collaborative filtering 094

Latent factors 090

Latent factors+Biases 089

Collaborative filtering++ 091

1162019

RECSYS 109

Temporal Biases Of Users

Sudden rise in the average movie rating(early 2004) Improvements in Netflix GUI improvements Meaning of rating changed

Movie age Users prefer new movies

without any reasons Older movies are just

inherently better than newer ones

1162019

Y Koren Collaborative filtering with temporal dynamics KDD rsquo09

RECSYS 110

Temporal Biases amp Factors

Original modelrxi = micro +bx + bi + qi middotpx

T

Add time dependence to biasesrxi = micro +bx(t)+ bi(t) +qi middot px

T

Make parameters bu and bi to depend on time (1) Parameterize time-dependence by linear trends

(2) Each bin corresponds to 10 consecutive weeks

Add temporal dependence to factors px(t)hellip user preference vector on day t

Y Koren Collaborative filtering with temporal dynamics KDD rsquo09

RECSYS 111

Adding Temporal Effects

1162019

0875

088

0885

089

0895

09

0905

091

0915

092

1 10 100 1000 10000

RMSE

Millions of parameters

CF (no time bias)Basic Latent FactorsCF (time bias)Latent Factors w Biases+ Linear time factors+ Per-day user biases+ CF

RECSYS 112

Grand Prize 08563

Netflix 09514

Movie average 10533User average 10651

Global average 11296

Performance of Various Methods

Basic Collaborative filtering 094

Latent factors 090

Latent factors+Biases 089

Collaborative filtering++ 091

1162019 112

Latent factors+Biases+Time 0876

Still no prize Getting desperateTry a ldquokitchen sinkrdquo approach

RECSYS 113

Effect of ensemble size

087

08725

0875

08775

088

08825

0885

08875

089

08925

1 8 15 22 29 36 43 50

Predictors

Erro

r - R

MSE

Chart1

RMSE
Predictors
Error - RMSE
Effect of ensemble size

Sheet1

Sheet1

RMSE
Predictors
Error - RMSE
Effect of ensemble size

Sheet2

Sheet3

RECSYS 115

Standing on June 26th 2009

1162019June 26th submission triggers 30-day ldquolast callrdquo

RECSYS 116

The Last 30 Days

Ensemble team formed Group of other teams on leaderboard forms a new team Relies on combining their models Quickly also get a qualifying score over 10

BellKor Continue to get small improvements in their scores Realize that they are in direct competition with Ensemble

Strategy Both teams carefully monitoring the leaderboard Only sure way to check for improvement is to submit a set of

predictionsbull This alerts the other team of your latest score

1162019

RECSYS 117

24 Hours from the Deadline

Submissions limited to 1 a day Only 1 final submission could be made in the last 24h

24 hours before deadlinehellip BellKor team member in Austria notices (by chance) that

Ensemble posts a score that is slightly better than BellKorrsquos

Frantic last 24 hours for both teams Much computer time on final optimization Carefully calibrated to end about an hour before deadline

Final submissions BellKor submits a little early (on purpose) 40 mins before deadline Ensemble submits their final entry 20 mins later hellipand everyone waitshellip

1162019

RECSYS 118118

RECSYS 119

Million $ Awarded Sept 21st 2009

RECSYS 120

Further reading Y Koren Collaborative filtering with temporal

dynamics KDD rsquo09 httpwww2researchattcom~volinskynetflixbpchtml httpwwwthe-ensemblecom

1162019

  • Slide Number 1
  • Acknowledgements
  • Content-based Systems amp Collaborative Filtering
  • High Dimensional Data
  • Example Recommender Systems
  • Recommendations
  • From Scarcity to Abundance
  • The Long Tail
  • Physical vs Online
  • Types of Recommendations
  • Formal Model
  • Utility Matrix
  • Key Problems
  • (1) Gathering Ratings
  • (2) Extrapolating Utilities
  • Content-based Recommender Systems
  • Content-based Recommendations
  • Plan of Action
  • Item Profiles
  • Sidenote TF-IDF
  • User Profiles and Prediction
  • Pros Content-based Approach
  • Cons Content-based Approach
  • Collaborative Filtering
  • Collaborative filtering
  • Collaborative Filtering (CF)
  • Example of Memory-based Collaborative Filtering User-User Collaborative Filtering
  • Similar Users
  • Similarity Metric
  • Rating Predictions
  • Another type of Memory-based Collaborative Filtering Item-Item based Collaborative Filtering
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Common Practice for Item-Item Collaborative Filtering
  • Item-Item vs User-User
  • ProsCons of Collaborative Filtering
  • Hybrid Methods
  • Remarks amp Practical Tips
  • Evaluation
  • Evaluation
  • Evaluating Predictions
  • Problems with Error Measures
  • Collaborative Filtering Complexity
  • Tip Add Data
  • Recommender SystemsLatent Factor Models
  • Collaborative Filtering viaLatent Factor Models (eg SVD)
  • Slide Number 50
  • The Netflix Utility Matrix R
  • Utility Matrix R Evaluation
  • Latent Factor Models
  • Ratings as Products of Factors
  • Ratings as Products of Factors
  • Ratings as Products of Factors
  • Latent Factor Models
  • Latent Factor Models
  • Recap SVD
  • Latent Factor Models
  • Dealing with Missing Entries
  • Recommendations via Latent Factor Models (eg SVD++ by the [Bellkor Team])
  • Dealing with Missing Entries
  • The Effect of Regularization
  • The Effect of Regularization
  • The Effect of Regularization
  • The Effect of Regularization
  • Use Gradient Descent to search forthe optimal settings
  • Degression to the lecture notes ofRegression and Gradient Descentby Andrew Ngrsquos Machine Learning course from Coursera
  • (Batch) Gradient Descent
  • Stochastic Gradient Descent
  • SGD vs GD
  • Stochastic Gradient Descent
  • Slide Number 75
  • Summary Recommendations via Optimization
  • Backup Slides
  • The Netflix Challenge 2006-09
  • Slide Number 79
  • Slide Number 80
  • Netflix Prize
  • Movie rating data
  • Overall rating distribution
  • ratings per movie
  • ratings per user
  • Average movie rating by movie count
  • Most loved movies
  • Challenges
  • The BellKor recommender system
  • Extending Latent Factor Model to Include Biases
  • Modeling Biases and Interactions
  • Baseline Predictor
  • Putting It All Together
  • Fitting the New Model
  • BellKor Recommender System
  • Performance of Various Methods
  • Performance of Various Methods
  • Temporal Biases Of Users
  • Temporal Biases amp Factors
  • Adding Temporal Effects
  • Performance of Various Methods
  • Slide Number 113
  • Slide Number 114
  • Standing on June 26th 2009
  • The Last 30 Days
  • 24 Hours from the Deadline
  • Slide Number 118
  • Million $ Awarded Sept 21st 2009
  • Further reading
predictors RMSE
1 0889067
2 0880586
3 0878345
4 0877715
5 0876668
6 0876179
7 0875557
8 0875013
9 0874477
10 0874209
11 0873997
12 0873691
13 0873569
14 0873352
15 0873212
16 0873035
17 0872942
18 087283
19 0872748
20 0872654
21 0872563
22 0872468
23 0872382
24 0872275
25 087222
26 0872201
27 0872135
28 0872086
29 0872021
30 0871945
31 0871898
32 087183
33 0871756
34 0871722
35 0871665
36 0871632
37 0871585
38 0871558
39 0871524
40 0871482
41 0871452
42 0871428
43 0871398
44 0871346
45 0871312
46 0871287
47 0871262
48 0871242
49 0871236
50 0871197
0889067
0880586
0878345
0877715
0876668
0876179
0875557
0875013
0874477
0874209
0873997
0873691
0873569
0873352
0873212
0873035
0872942
087283
0872748
0872654
0872563
0872468
0872382
0872275
087222
0872201
0872135
0872086
0872021
0871945
0871898
087183
0871756
0871722
0871665
0871632
0871585
0871558
0871524
0871482
0871452
0871428
0871398
0871346
0871312
0871287
0871262
0871242
0871236
0871197
Page 92: xuhappy.github.io · 2 days ago · RECSYS 2 Acknowledgements The slides used in this chapter are adapted from: CS246 Mining Massive Datasets, - by Jure Leskovec, Stanford University

RECSYS 107

Performance of Various Methods

0885

089

0895

09

0905

091

0915

092

1 10 100 1000

RMSE

Millions of parameters

CF (no time bias)

Basic Latent Factors

Latent Factors w Biases

1162019

RECSYS 108

Grand Prize 08563

Netflix 09514

Movie average 10533User average 10651

Global average 11296

Performance of Various Methods

Basic Collaborative filtering 094

Latent factors 090

Latent factors+Biases 089

Collaborative filtering++ 091

1162019

RECSYS 109

Temporal Biases Of Users

Sudden rise in the average movie rating(early 2004) Improvements in Netflix GUI improvements Meaning of rating changed

Movie age Users prefer new movies

without any reasons Older movies are just

inherently better than newer ones

1162019

Y Koren Collaborative filtering with temporal dynamics KDD rsquo09

RECSYS 110

Temporal Biases amp Factors

Original modelrxi = micro +bx + bi + qi middotpx

T

Add time dependence to biasesrxi = micro +bx(t)+ bi(t) +qi middot px

T

Make parameters bu and bi to depend on time (1) Parameterize time-dependence by linear trends

(2) Each bin corresponds to 10 consecutive weeks

Add temporal dependence to factors px(t)hellip user preference vector on day t

Y Koren Collaborative filtering with temporal dynamics KDD rsquo09

RECSYS 111

Adding Temporal Effects

1162019

0875

088

0885

089

0895

09

0905

091

0915

092

1 10 100 1000 10000

RMSE

Millions of parameters

CF (no time bias)Basic Latent FactorsCF (time bias)Latent Factors w Biases+ Linear time factors+ Per-day user biases+ CF

RECSYS 112

Grand Prize 08563

Netflix 09514

Movie average 10533User average 10651

Global average 11296

Performance of Various Methods

Basic Collaborative filtering 094

Latent factors 090

Latent factors+Biases 089

Collaborative filtering++ 091

1162019 112

Latent factors+Biases+Time 0876

Still no prize Getting desperateTry a ldquokitchen sinkrdquo approach

RECSYS 113

Effect of ensemble size

087

08725

0875

08775

088

08825

0885

08875

089

08925

1 8 15 22 29 36 43 50

Predictors

Erro

r - R

MSE

Chart1

RMSE
Predictors
Error - RMSE
Effect of ensemble size

Sheet1

Sheet1

RMSE
Predictors
Error - RMSE
Effect of ensemble size

Sheet2

Sheet3

RECSYS 115

Standing on June 26th 2009

1162019June 26th submission triggers 30-day ldquolast callrdquo

RECSYS 116

The Last 30 Days

Ensemble team formed Group of other teams on leaderboard forms a new team Relies on combining their models Quickly also get a qualifying score over 10

BellKor Continue to get small improvements in their scores Realize that they are in direct competition with Ensemble

Strategy Both teams carefully monitoring the leaderboard Only sure way to check for improvement is to submit a set of

predictionsbull This alerts the other team of your latest score

1162019

RECSYS 117

24 Hours from the Deadline

Submissions limited to 1 a day Only 1 final submission could be made in the last 24h

24 hours before deadlinehellip BellKor team member in Austria notices (by chance) that

Ensemble posts a score that is slightly better than BellKorrsquos

Frantic last 24 hours for both teams Much computer time on final optimization Carefully calibrated to end about an hour before deadline

Final submissions BellKor submits a little early (on purpose) 40 mins before deadline Ensemble submits their final entry 20 mins later hellipand everyone waitshellip

1162019

RECSYS 118118

RECSYS 119

Million $ Awarded Sept 21st 2009

RECSYS 120

Further reading Y Koren Collaborative filtering with temporal

dynamics KDD rsquo09 httpwww2researchattcom~volinskynetflixbpchtml httpwwwthe-ensemblecom

1162019

  • Slide Number 1
  • Acknowledgements
  • Content-based Systems amp Collaborative Filtering
  • High Dimensional Data
  • Example Recommender Systems
  • Recommendations
  • From Scarcity to Abundance
  • The Long Tail
  • Physical vs Online
  • Types of Recommendations
  • Formal Model
  • Utility Matrix
  • Key Problems
  • (1) Gathering Ratings
  • (2) Extrapolating Utilities
  • Content-based Recommender Systems
  • Content-based Recommendations
  • Plan of Action
  • Item Profiles
  • Sidenote TF-IDF
  • User Profiles and Prediction
  • Pros Content-based Approach
  • Cons Content-based Approach
  • Collaborative Filtering
  • Collaborative filtering
  • Collaborative Filtering (CF)
  • Example of Memory-based Collaborative Filtering User-User Collaborative Filtering
  • Similar Users
  • Similarity Metric
  • Rating Predictions
  • Another type of Memory-based Collaborative Filtering Item-Item based Collaborative Filtering
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Common Practice for Item-Item Collaborative Filtering
  • Item-Item vs User-User
  • ProsCons of Collaborative Filtering
  • Hybrid Methods
  • Remarks amp Practical Tips
  • Evaluation
  • Evaluation
  • Evaluating Predictions
  • Problems with Error Measures
  • Collaborative Filtering Complexity
  • Tip Add Data
  • Recommender SystemsLatent Factor Models
  • Collaborative Filtering viaLatent Factor Models (eg SVD)
  • Slide Number 50
  • The Netflix Utility Matrix R
  • Utility Matrix R Evaluation
  • Latent Factor Models
  • Ratings as Products of Factors
  • Ratings as Products of Factors
  • Ratings as Products of Factors
  • Latent Factor Models
  • Latent Factor Models
  • Recap SVD
  • Latent Factor Models
  • Dealing with Missing Entries
  • Recommendations via Latent Factor Models (eg SVD++ by the [Bellkor Team])
  • Dealing with Missing Entries
  • The Effect of Regularization
  • The Effect of Regularization
  • The Effect of Regularization
  • The Effect of Regularization
  • Use Gradient Descent to search forthe optimal settings
  • Degression to the lecture notes ofRegression and Gradient Descentby Andrew Ngrsquos Machine Learning course from Coursera
  • (Batch) Gradient Descent
  • Stochastic Gradient Descent
  • SGD vs GD
  • Stochastic Gradient Descent
  • Slide Number 75
  • Summary Recommendations via Optimization
  • Backup Slides
  • The Netflix Challenge 2006-09
  • Slide Number 79
  • Slide Number 80
  • Netflix Prize
  • Movie rating data
  • Overall rating distribution
  • ratings per movie
  • ratings per user
  • Average movie rating by movie count
  • Most loved movies
  • Challenges
  • The BellKor recommender system
  • Extending Latent Factor Model to Include Biases
  • Modeling Biases and Interactions
  • Baseline Predictor
  • Putting It All Together
  • Fitting the New Model
  • BellKor Recommender System
  • Performance of Various Methods
  • Performance of Various Methods
  • Temporal Biases Of Users
  • Temporal Biases amp Factors
  • Adding Temporal Effects
  • Performance of Various Methods
  • Slide Number 113
  • Slide Number 114
  • Standing on June 26th 2009
  • The Last 30 Days
  • 24 Hours from the Deadline
  • Slide Number 118
  • Million $ Awarded Sept 21st 2009
  • Further reading
predictors RMSE
1 0889067
2 0880586
3 0878345
4 0877715
5 0876668
6 0876179
7 0875557
8 0875013
9 0874477
10 0874209
11 0873997
12 0873691
13 0873569
14 0873352
15 0873212
16 0873035
17 0872942
18 087283
19 0872748
20 0872654
21 0872563
22 0872468
23 0872382
24 0872275
25 087222
26 0872201
27 0872135
28 0872086
29 0872021
30 0871945
31 0871898
32 087183
33 0871756
34 0871722
35 0871665
36 0871632
37 0871585
38 0871558
39 0871524
40 0871482
41 0871452
42 0871428
43 0871398
44 0871346
45 0871312
46 0871287
47 0871262
48 0871242
49 0871236
50 0871197
0889067
0880586
0878345
0877715
0876668
0876179
0875557
0875013
0874477
0874209
0873997
0873691
0873569
0873352
0873212
0873035
0872942
087283
0872748
0872654
0872563
0872468
0872382
0872275
087222
0872201
0872135
0872086
0872021
0871945
0871898
087183
0871756
0871722
0871665
0871632
0871585
0871558
0871524
0871482
0871452
0871428
0871398
0871346
0871312
0871287
0871262
0871242
0871236
0871197
Page 93: xuhappy.github.io · 2 days ago · RECSYS 2 Acknowledgements The slides used in this chapter are adapted from: CS246 Mining Massive Datasets, - by Jure Leskovec, Stanford University

RECSYS 108

Grand Prize 08563

Netflix 09514

Movie average 10533User average 10651

Global average 11296

Performance of Various Methods

Basic Collaborative filtering 094

Latent factors 090

Latent factors+Biases 089

Collaborative filtering++ 091

1162019

RECSYS 109

Temporal Biases Of Users

Sudden rise in the average movie rating(early 2004) Improvements in Netflix GUI improvements Meaning of rating changed

Movie age Users prefer new movies

without any reasons Older movies are just

inherently better than newer ones

1162019

Y Koren Collaborative filtering with temporal dynamics KDD rsquo09

RECSYS 110

Temporal Biases amp Factors

Original modelrxi = micro +bx + bi + qi middotpx

T

Add time dependence to biasesrxi = micro +bx(t)+ bi(t) +qi middot px

T

Make parameters bu and bi to depend on time (1) Parameterize time-dependence by linear trends

(2) Each bin corresponds to 10 consecutive weeks

Add temporal dependence to factors px(t)hellip user preference vector on day t

Y Koren Collaborative filtering with temporal dynamics KDD rsquo09

RECSYS 111

Adding Temporal Effects

1162019

0875

088

0885

089

0895

09

0905

091

0915

092

1 10 100 1000 10000

RMSE

Millions of parameters

CF (no time bias)Basic Latent FactorsCF (time bias)Latent Factors w Biases+ Linear time factors+ Per-day user biases+ CF

RECSYS 112

Grand Prize 08563

Netflix 09514

Movie average 10533User average 10651

Global average 11296

Performance of Various Methods

Basic Collaborative filtering 094

Latent factors 090

Latent factors+Biases 089

Collaborative filtering++ 091

1162019 112

Latent factors+Biases+Time 0876

Still no prize Getting desperateTry a ldquokitchen sinkrdquo approach

RECSYS 113

Effect of ensemble size

087

08725

0875

08775

088

08825

0885

08875

089

08925

1 8 15 22 29 36 43 50

Predictors

Erro

r - R

MSE

Chart1

RMSE
Predictors
Error - RMSE
Effect of ensemble size

Sheet1

Sheet1

RMSE
Predictors
Error - RMSE
Effect of ensemble size

Sheet2

Sheet3

RECSYS 115

Standing on June 26th 2009

1162019June 26th submission triggers 30-day ldquolast callrdquo

RECSYS 116

The Last 30 Days

Ensemble team formed Group of other teams on leaderboard forms a new team Relies on combining their models Quickly also get a qualifying score over 10

BellKor Continue to get small improvements in their scores Realize that they are in direct competition with Ensemble

Strategy Both teams carefully monitoring the leaderboard Only sure way to check for improvement is to submit a set of

predictionsbull This alerts the other team of your latest score

1162019

RECSYS 117

24 Hours from the Deadline

Submissions limited to 1 a day Only 1 final submission could be made in the last 24h

24 hours before deadlinehellip BellKor team member in Austria notices (by chance) that

Ensemble posts a score that is slightly better than BellKorrsquos

Frantic last 24 hours for both teams Much computer time on final optimization Carefully calibrated to end about an hour before deadline

Final submissions BellKor submits a little early (on purpose) 40 mins before deadline Ensemble submits their final entry 20 mins later hellipand everyone waitshellip

1162019

RECSYS 118118

RECSYS 119

Million $ Awarded Sept 21st 2009

RECSYS 120

Further reading Y Koren Collaborative filtering with temporal

dynamics KDD rsquo09 httpwww2researchattcom~volinskynetflixbpchtml httpwwwthe-ensemblecom

1162019

  • Slide Number 1
  • Acknowledgements
  • Content-based Systems amp Collaborative Filtering
  • High Dimensional Data
  • Example Recommender Systems
  • Recommendations
  • From Scarcity to Abundance
  • The Long Tail
  • Physical vs Online
  • Types of Recommendations
  • Formal Model
  • Utility Matrix
  • Key Problems
  • (1) Gathering Ratings
  • (2) Extrapolating Utilities
  • Content-based Recommender Systems
  • Content-based Recommendations
  • Plan of Action
  • Item Profiles
  • Sidenote TF-IDF
  • User Profiles and Prediction
  • Pros Content-based Approach
  • Cons Content-based Approach
  • Collaborative Filtering
  • Collaborative filtering
  • Collaborative Filtering (CF)
  • Example of Memory-based Collaborative Filtering User-User Collaborative Filtering
  • Similar Users
  • Similarity Metric
  • Rating Predictions
  • Another type of Memory-based Collaborative Filtering Item-Item based Collaborative Filtering
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Common Practice for Item-Item Collaborative Filtering
  • Item-Item vs User-User
  • ProsCons of Collaborative Filtering
  • Hybrid Methods
  • Remarks amp Practical Tips
  • Evaluation
  • Evaluation
  • Evaluating Predictions
  • Problems with Error Measures
  • Collaborative Filtering Complexity
  • Tip Add Data
  • Recommender SystemsLatent Factor Models
  • Collaborative Filtering viaLatent Factor Models (eg SVD)
  • Slide Number 50
  • The Netflix Utility Matrix R
  • Utility Matrix R Evaluation
  • Latent Factor Models
  • Ratings as Products of Factors
  • Ratings as Products of Factors
  • Ratings as Products of Factors
  • Latent Factor Models
  • Latent Factor Models
  • Recap SVD
  • Latent Factor Models
  • Dealing with Missing Entries
  • Recommendations via Latent Factor Models (eg SVD++ by the [Bellkor Team])
  • Dealing with Missing Entries
  • The Effect of Regularization
  • The Effect of Regularization
  • The Effect of Regularization
  • The Effect of Regularization
  • Use Gradient Descent to search forthe optimal settings
  • Degression to the lecture notes ofRegression and Gradient Descentby Andrew Ngrsquos Machine Learning course from Coursera
  • (Batch) Gradient Descent
  • Stochastic Gradient Descent
  • SGD vs GD
  • Stochastic Gradient Descent
  • Slide Number 75
  • Summary Recommendations via Optimization
  • Backup Slides
  • The Netflix Challenge 2006-09
  • Slide Number 79
  • Slide Number 80
  • Netflix Prize
  • Movie rating data
  • Overall rating distribution
  • ratings per movie
  • ratings per user
  • Average movie rating by movie count
  • Most loved movies
  • Challenges
  • The BellKor recommender system
  • Extending Latent Factor Model to Include Biases
  • Modeling Biases and Interactions
  • Baseline Predictor
  • Putting It All Together
  • Fitting the New Model
  • BellKor Recommender System
  • Performance of Various Methods
  • Performance of Various Methods
  • Temporal Biases Of Users
  • Temporal Biases amp Factors
  • Adding Temporal Effects
  • Performance of Various Methods
  • Slide Number 113
  • Slide Number 114
  • Standing on June 26th 2009
  • The Last 30 Days
  • 24 Hours from the Deadline
  • Slide Number 118
  • Million $ Awarded Sept 21st 2009
  • Further reading
predictors RMSE
1 0889067
2 0880586
3 0878345
4 0877715
5 0876668
6 0876179
7 0875557
8 0875013
9 0874477
10 0874209
11 0873997
12 0873691
13 0873569
14 0873352
15 0873212
16 0873035
17 0872942
18 087283
19 0872748
20 0872654
21 0872563
22 0872468
23 0872382
24 0872275
25 087222
26 0872201
27 0872135
28 0872086
29 0872021
30 0871945
31 0871898
32 087183
33 0871756
34 0871722
35 0871665
36 0871632
37 0871585
38 0871558
39 0871524
40 0871482
41 0871452
42 0871428
43 0871398
44 0871346
45 0871312
46 0871287
47 0871262
48 0871242
49 0871236
50 0871197
0889067
0880586
0878345
0877715
0876668
0876179
0875557
0875013
0874477
0874209
0873997
0873691
0873569
0873352
0873212
0873035
0872942
087283
0872748
0872654
0872563
0872468
0872382
0872275
087222
0872201
0872135
0872086
0872021
0871945
0871898
087183
0871756
0871722
0871665
0871632
0871585
0871558
0871524
0871482
0871452
0871428
0871398
0871346
0871312
0871287
0871262
0871242
0871236
0871197
Page 94: xuhappy.github.io · 2 days ago · RECSYS 2 Acknowledgements The slides used in this chapter are adapted from: CS246 Mining Massive Datasets, - by Jure Leskovec, Stanford University

RECSYS 109

Temporal Biases Of Users

Sudden rise in the average movie rating(early 2004) Improvements in Netflix GUI improvements Meaning of rating changed

Movie age Users prefer new movies

without any reasons Older movies are just

inherently better than newer ones

1162019

Y Koren Collaborative filtering with temporal dynamics KDD rsquo09

RECSYS 110

Temporal Biases amp Factors

Original modelrxi = micro +bx + bi + qi middotpx

T

Add time dependence to biasesrxi = micro +bx(t)+ bi(t) +qi middot px

T

Make parameters bu and bi to depend on time (1) Parameterize time-dependence by linear trends

(2) Each bin corresponds to 10 consecutive weeks

Add temporal dependence to factors px(t)hellip user preference vector on day t

Y Koren Collaborative filtering with temporal dynamics KDD rsquo09

RECSYS 111

Adding Temporal Effects

1162019

0875

088

0885

089

0895

09

0905

091

0915

092

1 10 100 1000 10000

RMSE

Millions of parameters

CF (no time bias)Basic Latent FactorsCF (time bias)Latent Factors w Biases+ Linear time factors+ Per-day user biases+ CF

RECSYS 112

Grand Prize 08563

Netflix 09514

Movie average 10533User average 10651

Global average 11296

Performance of Various Methods

Basic Collaborative filtering 094

Latent factors 090

Latent factors+Biases 089

Collaborative filtering++ 091

1162019 112

Latent factors+Biases+Time 0876

Still no prize Getting desperateTry a ldquokitchen sinkrdquo approach

RECSYS 113

Effect of ensemble size

087

08725

0875

08775

088

08825

0885

08875

089

08925

1 8 15 22 29 36 43 50

Predictors

Erro

r - R

MSE

Chart1

RMSE
Predictors
Error - RMSE
Effect of ensemble size

Sheet1

Sheet1

RMSE
Predictors
Error - RMSE
Effect of ensemble size

Sheet2

Sheet3

RECSYS 115

Standing on June 26th 2009

1162019June 26th submission triggers 30-day ldquolast callrdquo

RECSYS 116

The Last 30 Days

Ensemble team formed Group of other teams on leaderboard forms a new team Relies on combining their models Quickly also get a qualifying score over 10

BellKor Continue to get small improvements in their scores Realize that they are in direct competition with Ensemble

Strategy Both teams carefully monitoring the leaderboard Only sure way to check for improvement is to submit a set of

predictionsbull This alerts the other team of your latest score

1162019

RECSYS 117

24 Hours from the Deadline

Submissions limited to 1 a day Only 1 final submission could be made in the last 24h

24 hours before deadlinehellip BellKor team member in Austria notices (by chance) that

Ensemble posts a score that is slightly better than BellKorrsquos

Frantic last 24 hours for both teams Much computer time on final optimization Carefully calibrated to end about an hour before deadline

Final submissions BellKor submits a little early (on purpose) 40 mins before deadline Ensemble submits their final entry 20 mins later hellipand everyone waitshellip

1162019

RECSYS 118118

RECSYS 119

Million $ Awarded Sept 21st 2009

RECSYS 120

Further reading Y Koren Collaborative filtering with temporal

dynamics KDD rsquo09 httpwww2researchattcom~volinskynetflixbpchtml httpwwwthe-ensemblecom

1162019

  • Slide Number 1
  • Acknowledgements
  • Content-based Systems amp Collaborative Filtering
  • High Dimensional Data
  • Example Recommender Systems
  • Recommendations
  • From Scarcity to Abundance
  • The Long Tail
  • Physical vs Online
  • Types of Recommendations
  • Formal Model
  • Utility Matrix
  • Key Problems
  • (1) Gathering Ratings
  • (2) Extrapolating Utilities
  • Content-based Recommender Systems
  • Content-based Recommendations
  • Plan of Action
  • Item Profiles
  • Sidenote TF-IDF
  • User Profiles and Prediction
  • Pros Content-based Approach
  • Cons Content-based Approach
  • Collaborative Filtering
  • Collaborative filtering
  • Collaborative Filtering (CF)
  • Example of Memory-based Collaborative Filtering User-User Collaborative Filtering
  • Similar Users
  • Similarity Metric
  • Rating Predictions
  • Another type of Memory-based Collaborative Filtering Item-Item based Collaborative Filtering
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Common Practice for Item-Item Collaborative Filtering
  • Item-Item vs User-User
  • ProsCons of Collaborative Filtering
  • Hybrid Methods
  • Remarks amp Practical Tips
  • Evaluation
  • Evaluation
  • Evaluating Predictions
  • Problems with Error Measures
  • Collaborative Filtering Complexity
  • Tip Add Data
  • Recommender SystemsLatent Factor Models
  • Collaborative Filtering viaLatent Factor Models (eg SVD)
  • Slide Number 50
  • The Netflix Utility Matrix R
  • Utility Matrix R Evaluation
  • Latent Factor Models
  • Ratings as Products of Factors
  • Ratings as Products of Factors
  • Ratings as Products of Factors
  • Latent Factor Models
  • Latent Factor Models
  • Recap SVD
  • Latent Factor Models
  • Dealing with Missing Entries
  • Recommendations via Latent Factor Models (eg SVD++ by the [Bellkor Team])
  • Dealing with Missing Entries
  • The Effect of Regularization
  • The Effect of Regularization
  • The Effect of Regularization
  • The Effect of Regularization
  • Use Gradient Descent to search forthe optimal settings
  • Degression to the lecture notes ofRegression and Gradient Descentby Andrew Ngrsquos Machine Learning course from Coursera
  • (Batch) Gradient Descent
  • Stochastic Gradient Descent
  • SGD vs GD
  • Stochastic Gradient Descent
  • Slide Number 75
  • Summary Recommendations via Optimization
  • Backup Slides
  • The Netflix Challenge 2006-09
  • Slide Number 79
  • Slide Number 80
  • Netflix Prize
  • Movie rating data
  • Overall rating distribution
  • ratings per movie
  • ratings per user
  • Average movie rating by movie count
  • Most loved movies
  • Challenges
  • The BellKor recommender system
  • Extending Latent Factor Model to Include Biases
  • Modeling Biases and Interactions
  • Baseline Predictor
  • Putting It All Together
  • Fitting the New Model
  • BellKor Recommender System
  • Performance of Various Methods
  • Performance of Various Methods
  • Temporal Biases Of Users
  • Temporal Biases amp Factors
  • Adding Temporal Effects
  • Performance of Various Methods
  • Slide Number 113
  • Slide Number 114
  • Standing on June 26th 2009
  • The Last 30 Days
  • 24 Hours from the Deadline
  • Slide Number 118
  • Million $ Awarded Sept 21st 2009
  • Further reading
predictors RMSE
1 0889067
2 0880586
3 0878345
4 0877715
5 0876668
6 0876179
7 0875557
8 0875013
9 0874477
10 0874209
11 0873997
12 0873691
13 0873569
14 0873352
15 0873212
16 0873035
17 0872942
18 087283
19 0872748
20 0872654
21 0872563
22 0872468
23 0872382
24 0872275
25 087222
26 0872201
27 0872135
28 0872086
29 0872021
30 0871945
31 0871898
32 087183
33 0871756
34 0871722
35 0871665
36 0871632
37 0871585
38 0871558
39 0871524
40 0871482
41 0871452
42 0871428
43 0871398
44 0871346
45 0871312
46 0871287
47 0871262
48 0871242
49 0871236
50 0871197
0889067
0880586
0878345
0877715
0876668
0876179
0875557
0875013
0874477
0874209
0873997
0873691
0873569
0873352
0873212
0873035
0872942
087283
0872748
0872654
0872563
0872468
0872382
0872275
087222
0872201
0872135
0872086
0872021
0871945
0871898
087183
0871756
0871722
0871665
0871632
0871585
0871558
0871524
0871482
0871452
0871428
0871398
0871346
0871312
0871287
0871262
0871242
0871236
0871197
Page 95: xuhappy.github.io · 2 days ago · RECSYS 2 Acknowledgements The slides used in this chapter are adapted from: CS246 Mining Massive Datasets, - by Jure Leskovec, Stanford University

RECSYS 110

Temporal Biases amp Factors

Original modelrxi = micro +bx + bi + qi middotpx

T

Add time dependence to biasesrxi = micro +bx(t)+ bi(t) +qi middot px

T

Make parameters bu and bi to depend on time (1) Parameterize time-dependence by linear trends

(2) Each bin corresponds to 10 consecutive weeks

Add temporal dependence to factors px(t)hellip user preference vector on day t

Y Koren Collaborative filtering with temporal dynamics KDD rsquo09

RECSYS 111

Adding Temporal Effects

1162019

0875

088

0885

089

0895

09

0905

091

0915

092

1 10 100 1000 10000

RMSE

Millions of parameters

CF (no time bias)Basic Latent FactorsCF (time bias)Latent Factors w Biases+ Linear time factors+ Per-day user biases+ CF

RECSYS 112

Grand Prize 08563

Netflix 09514

Movie average 10533User average 10651

Global average 11296

Performance of Various Methods

Basic Collaborative filtering 094

Latent factors 090

Latent factors+Biases 089

Collaborative filtering++ 091

1162019 112

Latent factors+Biases+Time 0876

Still no prize Getting desperateTry a ldquokitchen sinkrdquo approach

RECSYS 113

Effect of ensemble size

087

08725

0875

08775

088

08825

0885

08875

089

08925

1 8 15 22 29 36 43 50

Predictors

Erro

r - R

MSE

Chart1

RMSE
Predictors
Error - RMSE
Effect of ensemble size

Sheet1

Sheet1

RMSE
Predictors
Error - RMSE
Effect of ensemble size

Sheet2

Sheet3

RECSYS 115

Standing on June 26th 2009

1162019June 26th submission triggers 30-day ldquolast callrdquo

RECSYS 116

The Last 30 Days

Ensemble team formed Group of other teams on leaderboard forms a new team Relies on combining their models Quickly also get a qualifying score over 10

BellKor Continue to get small improvements in their scores Realize that they are in direct competition with Ensemble

Strategy Both teams carefully monitoring the leaderboard Only sure way to check for improvement is to submit a set of

predictionsbull This alerts the other team of your latest score

1162019

RECSYS 117

24 Hours from the Deadline

Submissions limited to 1 a day Only 1 final submission could be made in the last 24h

24 hours before deadlinehellip BellKor team member in Austria notices (by chance) that

Ensemble posts a score that is slightly better than BellKorrsquos

Frantic last 24 hours for both teams Much computer time on final optimization Carefully calibrated to end about an hour before deadline

Final submissions BellKor submits a little early (on purpose) 40 mins before deadline Ensemble submits their final entry 20 mins later hellipand everyone waitshellip

1162019

RECSYS 118118

RECSYS 119

Million $ Awarded Sept 21st 2009

RECSYS 120

Further reading Y Koren Collaborative filtering with temporal

dynamics KDD rsquo09 httpwww2researchattcom~volinskynetflixbpchtml httpwwwthe-ensemblecom

1162019

  • Slide Number 1
  • Acknowledgements
  • Content-based Systems amp Collaborative Filtering
  • High Dimensional Data
  • Example Recommender Systems
  • Recommendations
  • From Scarcity to Abundance
  • The Long Tail
  • Physical vs Online
  • Types of Recommendations
  • Formal Model
  • Utility Matrix
  • Key Problems
  • (1) Gathering Ratings
  • (2) Extrapolating Utilities
  • Content-based Recommender Systems
  • Content-based Recommendations
  • Plan of Action
  • Item Profiles
  • Sidenote TF-IDF
  • User Profiles and Prediction
  • Pros Content-based Approach
  • Cons Content-based Approach
  • Collaborative Filtering
  • Collaborative filtering
  • Collaborative Filtering (CF)
  • Example of Memory-based Collaborative Filtering User-User Collaborative Filtering
  • Similar Users
  • Similarity Metric
  • Rating Predictions
  • Another type of Memory-based Collaborative Filtering Item-Item based Collaborative Filtering
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Common Practice for Item-Item Collaborative Filtering
  • Item-Item vs User-User
  • ProsCons of Collaborative Filtering
  • Hybrid Methods
  • Remarks amp Practical Tips
  • Evaluation
  • Evaluation
  • Evaluating Predictions
  • Problems with Error Measures
  • Collaborative Filtering Complexity
  • Tip Add Data
  • Recommender SystemsLatent Factor Models
  • Collaborative Filtering viaLatent Factor Models (eg SVD)
  • Slide Number 50
  • The Netflix Utility Matrix R
  • Utility Matrix R Evaluation
  • Latent Factor Models
  • Ratings as Products of Factors
  • Ratings as Products of Factors
  • Ratings as Products of Factors
  • Latent Factor Models
  • Latent Factor Models
  • Recap SVD
  • Latent Factor Models
  • Dealing with Missing Entries
  • Recommendations via Latent Factor Models (eg SVD++ by the [Bellkor Team])
  • Dealing with Missing Entries
  • The Effect of Regularization
  • The Effect of Regularization
  • The Effect of Regularization
  • The Effect of Regularization
  • Use Gradient Descent to search forthe optimal settings
  • Degression to the lecture notes ofRegression and Gradient Descentby Andrew Ngrsquos Machine Learning course from Coursera
  • (Batch) Gradient Descent
  • Stochastic Gradient Descent
  • SGD vs GD
  • Stochastic Gradient Descent
  • Slide Number 75
  • Summary Recommendations via Optimization
  • Backup Slides
  • The Netflix Challenge 2006-09
  • Slide Number 79
  • Slide Number 80
  • Netflix Prize
  • Movie rating data
  • Overall rating distribution
  • ratings per movie
  • ratings per user
  • Average movie rating by movie count
  • Most loved movies
  • Challenges
  • The BellKor recommender system
  • Extending Latent Factor Model to Include Biases
  • Modeling Biases and Interactions
  • Baseline Predictor
  • Putting It All Together
  • Fitting the New Model
  • BellKor Recommender System
  • Performance of Various Methods
  • Performance of Various Methods
  • Temporal Biases Of Users
  • Temporal Biases amp Factors
  • Adding Temporal Effects
  • Performance of Various Methods
  • Slide Number 113
  • Slide Number 114
  • Standing on June 26th 2009
  • The Last 30 Days
  • 24 Hours from the Deadline
  • Slide Number 118
  • Million $ Awarded Sept 21st 2009
  • Further reading
predictors RMSE
1 0889067
2 0880586
3 0878345
4 0877715
5 0876668
6 0876179
7 0875557
8 0875013
9 0874477
10 0874209
11 0873997
12 0873691
13 0873569
14 0873352
15 0873212
16 0873035
17 0872942
18 087283
19 0872748
20 0872654
21 0872563
22 0872468
23 0872382
24 0872275
25 087222
26 0872201
27 0872135
28 0872086
29 0872021
30 0871945
31 0871898
32 087183
33 0871756
34 0871722
35 0871665
36 0871632
37 0871585
38 0871558
39 0871524
40 0871482
41 0871452
42 0871428
43 0871398
44 0871346
45 0871312
46 0871287
47 0871262
48 0871242
49 0871236
50 0871197
0889067
0880586
0878345
0877715
0876668
0876179
0875557
0875013
0874477
0874209
0873997
0873691
0873569
0873352
0873212
0873035
0872942
087283
0872748
0872654
0872563
0872468
0872382
0872275
087222
0872201
0872135
0872086
0872021
0871945
0871898
087183
0871756
0871722
0871665
0871632
0871585
0871558
0871524
0871482
0871452
0871428
0871398
0871346
0871312
0871287
0871262
0871242
0871236
0871197
Page 96: xuhappy.github.io · 2 days ago · RECSYS 2 Acknowledgements The slides used in this chapter are adapted from: CS246 Mining Massive Datasets, - by Jure Leskovec, Stanford University

RECSYS 111

Adding Temporal Effects

1162019

0875

088

0885

089

0895

09

0905

091

0915

092

1 10 100 1000 10000

RMSE

Millions of parameters

CF (no time bias)Basic Latent FactorsCF (time bias)Latent Factors w Biases+ Linear time factors+ Per-day user biases+ CF

RECSYS 112

Grand Prize 08563

Netflix 09514

Movie average 10533User average 10651

Global average 11296

Performance of Various Methods

Basic Collaborative filtering 094

Latent factors 090

Latent factors+Biases 089

Collaborative filtering++ 091

1162019 112

Latent factors+Biases+Time 0876

Still no prize Getting desperateTry a ldquokitchen sinkrdquo approach

RECSYS 113

Effect of ensemble size

087

08725

0875

08775

088

08825

0885

08875

089

08925

1 8 15 22 29 36 43 50

Predictors

Erro

r - R

MSE

Chart1

RMSE
Predictors
Error - RMSE
Effect of ensemble size

Sheet1

Sheet1

RMSE
Predictors
Error - RMSE
Effect of ensemble size

Sheet2

Sheet3

RECSYS 115

Standing on June 26th 2009

1162019June 26th submission triggers 30-day ldquolast callrdquo

RECSYS 116

The Last 30 Days

Ensemble team formed Group of other teams on leaderboard forms a new team Relies on combining their models Quickly also get a qualifying score over 10

BellKor Continue to get small improvements in their scores Realize that they are in direct competition with Ensemble

Strategy Both teams carefully monitoring the leaderboard Only sure way to check for improvement is to submit a set of

predictionsbull This alerts the other team of your latest score

1162019

RECSYS 117

24 Hours from the Deadline

Submissions limited to 1 a day Only 1 final submission could be made in the last 24h

24 hours before deadlinehellip BellKor team member in Austria notices (by chance) that

Ensemble posts a score that is slightly better than BellKorrsquos

Frantic last 24 hours for both teams Much computer time on final optimization Carefully calibrated to end about an hour before deadline

Final submissions BellKor submits a little early (on purpose) 40 mins before deadline Ensemble submits their final entry 20 mins later hellipand everyone waitshellip

1162019

RECSYS 118118

RECSYS 119

Million $ Awarded Sept 21st 2009

RECSYS 120

Further reading Y Koren Collaborative filtering with temporal

dynamics KDD rsquo09 httpwww2researchattcom~volinskynetflixbpchtml httpwwwthe-ensemblecom

1162019

  • Slide Number 1
  • Acknowledgements
  • Content-based Systems amp Collaborative Filtering
  • High Dimensional Data
  • Example Recommender Systems
  • Recommendations
  • From Scarcity to Abundance
  • The Long Tail
  • Physical vs Online
  • Types of Recommendations
  • Formal Model
  • Utility Matrix
  • Key Problems
  • (1) Gathering Ratings
  • (2) Extrapolating Utilities
  • Content-based Recommender Systems
  • Content-based Recommendations
  • Plan of Action
  • Item Profiles
  • Sidenote TF-IDF
  • User Profiles and Prediction
  • Pros Content-based Approach
  • Cons Content-based Approach
  • Collaborative Filtering
  • Collaborative filtering
  • Collaborative Filtering (CF)
  • Example of Memory-based Collaborative Filtering User-User Collaborative Filtering
  • Similar Users
  • Similarity Metric
  • Rating Predictions
  • Another type of Memory-based Collaborative Filtering Item-Item based Collaborative Filtering
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Common Practice for Item-Item Collaborative Filtering
  • Item-Item vs User-User
  • ProsCons of Collaborative Filtering
  • Hybrid Methods
  • Remarks amp Practical Tips
  • Evaluation
  • Evaluation
  • Evaluating Predictions
  • Problems with Error Measures
  • Collaborative Filtering Complexity
  • Tip Add Data
  • Recommender SystemsLatent Factor Models
  • Collaborative Filtering viaLatent Factor Models (eg SVD)
  • Slide Number 50
  • The Netflix Utility Matrix R
  • Utility Matrix R Evaluation
  • Latent Factor Models
  • Ratings as Products of Factors
  • Ratings as Products of Factors
  • Ratings as Products of Factors
  • Latent Factor Models
  • Latent Factor Models
  • Recap SVD
  • Latent Factor Models
  • Dealing with Missing Entries
  • Recommendations via Latent Factor Models (eg SVD++ by the [Bellkor Team])
  • Dealing with Missing Entries
  • The Effect of Regularization
  • The Effect of Regularization
  • The Effect of Regularization
  • The Effect of Regularization
  • Use Gradient Descent to search forthe optimal settings
  • Degression to the lecture notes ofRegression and Gradient Descentby Andrew Ngrsquos Machine Learning course from Coursera
  • (Batch) Gradient Descent
  • Stochastic Gradient Descent
  • SGD vs GD
  • Stochastic Gradient Descent
  • Slide Number 75
  • Summary Recommendations via Optimization
  • Backup Slides
  • The Netflix Challenge 2006-09
  • Slide Number 79
  • Slide Number 80
  • Netflix Prize
  • Movie rating data
  • Overall rating distribution
  • ratings per movie
  • ratings per user
  • Average movie rating by movie count
  • Most loved movies
  • Challenges
  • The BellKor recommender system
  • Extending Latent Factor Model to Include Biases
  • Modeling Biases and Interactions
  • Baseline Predictor
  • Putting It All Together
  • Fitting the New Model
  • BellKor Recommender System
  • Performance of Various Methods
  • Performance of Various Methods
  • Temporal Biases Of Users
  • Temporal Biases amp Factors
  • Adding Temporal Effects
  • Performance of Various Methods
  • Slide Number 113
  • Slide Number 114
  • Standing on June 26th 2009
  • The Last 30 Days
  • 24 Hours from the Deadline
  • Slide Number 118
  • Million $ Awarded Sept 21st 2009
  • Further reading
predictors RMSE
1 0889067
2 0880586
3 0878345
4 0877715
5 0876668
6 0876179
7 0875557
8 0875013
9 0874477
10 0874209
11 0873997
12 0873691
13 0873569
14 0873352
15 0873212
16 0873035
17 0872942
18 087283
19 0872748
20 0872654
21 0872563
22 0872468
23 0872382
24 0872275
25 087222
26 0872201
27 0872135
28 0872086
29 0872021
30 0871945
31 0871898
32 087183
33 0871756
34 0871722
35 0871665
36 0871632
37 0871585
38 0871558
39 0871524
40 0871482
41 0871452
42 0871428
43 0871398
44 0871346
45 0871312
46 0871287
47 0871262
48 0871242
49 0871236
50 0871197
0889067
0880586
0878345
0877715
0876668
0876179
0875557
0875013
0874477
0874209
0873997
0873691
0873569
0873352
0873212
0873035
0872942
087283
0872748
0872654
0872563
0872468
0872382
0872275
087222
0872201
0872135
0872086
0872021
0871945
0871898
087183
0871756
0871722
0871665
0871632
0871585
0871558
0871524
0871482
0871452
0871428
0871398
0871346
0871312
0871287
0871262
0871242
0871236
0871197
Page 97: xuhappy.github.io · 2 days ago · RECSYS 2 Acknowledgements The slides used in this chapter are adapted from: CS246 Mining Massive Datasets, - by Jure Leskovec, Stanford University

RECSYS 112

Grand Prize 08563

Netflix 09514

Movie average 10533User average 10651

Global average 11296

Performance of Various Methods

Basic Collaborative filtering 094

Latent factors 090

Latent factors+Biases 089

Collaborative filtering++ 091

1162019 112

Latent factors+Biases+Time 0876

Still no prize Getting desperateTry a ldquokitchen sinkrdquo approach

RECSYS 113

Effect of ensemble size

087

08725

0875

08775

088

08825

0885

08875

089

08925

1 8 15 22 29 36 43 50

Predictors

Erro

r - R

MSE

Chart1

RMSE
Predictors
Error - RMSE
Effect of ensemble size

Sheet1

Sheet1

RMSE
Predictors
Error - RMSE
Effect of ensemble size

Sheet2

Sheet3

RECSYS 115

Standing on June 26th 2009

1162019June 26th submission triggers 30-day ldquolast callrdquo

RECSYS 116

The Last 30 Days

Ensemble team formed Group of other teams on leaderboard forms a new team Relies on combining their models Quickly also get a qualifying score over 10

BellKor Continue to get small improvements in their scores Realize that they are in direct competition with Ensemble

Strategy Both teams carefully monitoring the leaderboard Only sure way to check for improvement is to submit a set of

predictionsbull This alerts the other team of your latest score

1162019

RECSYS 117

24 Hours from the Deadline

Submissions limited to 1 a day Only 1 final submission could be made in the last 24h

24 hours before deadlinehellip BellKor team member in Austria notices (by chance) that

Ensemble posts a score that is slightly better than BellKorrsquos

Frantic last 24 hours for both teams Much computer time on final optimization Carefully calibrated to end about an hour before deadline

Final submissions BellKor submits a little early (on purpose) 40 mins before deadline Ensemble submits their final entry 20 mins later hellipand everyone waitshellip

1162019

RECSYS 118118

RECSYS 119

Million $ Awarded Sept 21st 2009

RECSYS 120

Further reading Y Koren Collaborative filtering with temporal

dynamics KDD rsquo09 httpwww2researchattcom~volinskynetflixbpchtml httpwwwthe-ensemblecom

1162019

  • Slide Number 1
  • Acknowledgements
  • Content-based Systems amp Collaborative Filtering
  • High Dimensional Data
  • Example Recommender Systems
  • Recommendations
  • From Scarcity to Abundance
  • The Long Tail
  • Physical vs Online
  • Types of Recommendations
  • Formal Model
  • Utility Matrix
  • Key Problems
  • (1) Gathering Ratings
  • (2) Extrapolating Utilities
  • Content-based Recommender Systems
  • Content-based Recommendations
  • Plan of Action
  • Item Profiles
  • Sidenote TF-IDF
  • User Profiles and Prediction
  • Pros Content-based Approach
  • Cons Content-based Approach
  • Collaborative Filtering
  • Collaborative filtering
  • Collaborative Filtering (CF)
  • Example of Memory-based Collaborative Filtering User-User Collaborative Filtering
  • Similar Users
  • Similarity Metric
  • Rating Predictions
  • Another type of Memory-based Collaborative Filtering Item-Item based Collaborative Filtering
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Common Practice for Item-Item Collaborative Filtering
  • Item-Item vs User-User
  • ProsCons of Collaborative Filtering
  • Hybrid Methods
  • Remarks amp Practical Tips
  • Evaluation
  • Evaluation
  • Evaluating Predictions
  • Problems with Error Measures
  • Collaborative Filtering Complexity
  • Tip Add Data
  • Recommender SystemsLatent Factor Models
  • Collaborative Filtering viaLatent Factor Models (eg SVD)
  • Slide Number 50
  • The Netflix Utility Matrix R
  • Utility Matrix R Evaluation
  • Latent Factor Models
  • Ratings as Products of Factors
  • Ratings as Products of Factors
  • Ratings as Products of Factors
  • Latent Factor Models
  • Latent Factor Models
  • Recap SVD
  • Latent Factor Models
  • Dealing with Missing Entries
  • Recommendations via Latent Factor Models (eg SVD++ by the [Bellkor Team])
  • Dealing with Missing Entries
  • The Effect of Regularization
  • The Effect of Regularization
  • The Effect of Regularization
  • The Effect of Regularization
  • Use Gradient Descent to search forthe optimal settings
  • Degression to the lecture notes ofRegression and Gradient Descentby Andrew Ngrsquos Machine Learning course from Coursera
  • (Batch) Gradient Descent
  • Stochastic Gradient Descent
  • SGD vs GD
  • Stochastic Gradient Descent
  • Slide Number 75
  • Summary Recommendations via Optimization
  • Backup Slides
  • The Netflix Challenge 2006-09
  • Slide Number 79
  • Slide Number 80
  • Netflix Prize
  • Movie rating data
  • Overall rating distribution
  • ratings per movie
  • ratings per user
  • Average movie rating by movie count
  • Most loved movies
  • Challenges
  • The BellKor recommender system
  • Extending Latent Factor Model to Include Biases
  • Modeling Biases and Interactions
  • Baseline Predictor
  • Putting It All Together
  • Fitting the New Model
  • BellKor Recommender System
  • Performance of Various Methods
  • Performance of Various Methods
  • Temporal Biases Of Users
  • Temporal Biases amp Factors
  • Adding Temporal Effects
  • Performance of Various Methods
  • Slide Number 113
  • Slide Number 114
  • Standing on June 26th 2009
  • The Last 30 Days
  • 24 Hours from the Deadline
  • Slide Number 118
  • Million $ Awarded Sept 21st 2009
  • Further reading
predictors RMSE
1 0889067
2 0880586
3 0878345
4 0877715
5 0876668
6 0876179
7 0875557
8 0875013
9 0874477
10 0874209
11 0873997
12 0873691
13 0873569
14 0873352
15 0873212
16 0873035
17 0872942
18 087283
19 0872748
20 0872654
21 0872563
22 0872468
23 0872382
24 0872275
25 087222
26 0872201
27 0872135
28 0872086
29 0872021
30 0871945
31 0871898
32 087183
33 0871756
34 0871722
35 0871665
36 0871632
37 0871585
38 0871558
39 0871524
40 0871482
41 0871452
42 0871428
43 0871398
44 0871346
45 0871312
46 0871287
47 0871262
48 0871242
49 0871236
50 0871197
0889067
0880586
0878345
0877715
0876668
0876179
0875557
0875013
0874477
0874209
0873997
0873691
0873569
0873352
0873212
0873035
0872942
087283
0872748
0872654
0872563
0872468
0872382
0872275
087222
0872201
0872135
0872086
0872021
0871945
0871898
087183
0871756
0871722
0871665
0871632
0871585
0871558
0871524
0871482
0871452
0871428
0871398
0871346
0871312
0871287
0871262
0871242
0871236
0871197
Page 98: xuhappy.github.io · 2 days ago · RECSYS 2 Acknowledgements The slides used in this chapter are adapted from: CS246 Mining Massive Datasets, - by Jure Leskovec, Stanford University

RECSYS 113

Effect of ensemble size

087

08725

0875

08775

088

08825

0885

08875

089

08925

1 8 15 22 29 36 43 50

Predictors

Erro

r - R

MSE

Chart1

RMSE
Predictors
Error - RMSE
Effect of ensemble size

Sheet1

Sheet1

RMSE
Predictors
Error - RMSE
Effect of ensemble size

Sheet2

Sheet3

RECSYS 115

Standing on June 26th 2009

1162019June 26th submission triggers 30-day ldquolast callrdquo

RECSYS 116

The Last 30 Days

Ensemble team formed Group of other teams on leaderboard forms a new team Relies on combining their models Quickly also get a qualifying score over 10

BellKor Continue to get small improvements in their scores Realize that they are in direct competition with Ensemble

Strategy Both teams carefully monitoring the leaderboard Only sure way to check for improvement is to submit a set of

predictionsbull This alerts the other team of your latest score

1162019

RECSYS 117

24 Hours from the Deadline

Submissions limited to 1 a day Only 1 final submission could be made in the last 24h

24 hours before deadlinehellip BellKor team member in Austria notices (by chance) that

Ensemble posts a score that is slightly better than BellKorrsquos

Frantic last 24 hours for both teams Much computer time on final optimization Carefully calibrated to end about an hour before deadline

Final submissions BellKor submits a little early (on purpose) 40 mins before deadline Ensemble submits their final entry 20 mins later hellipand everyone waitshellip

1162019

RECSYS 118118

RECSYS 119

Million $ Awarded Sept 21st 2009

RECSYS 120

Further reading Y Koren Collaborative filtering with temporal

dynamics KDD rsquo09 httpwww2researchattcom~volinskynetflixbpchtml httpwwwthe-ensemblecom

1162019

  • Slide Number 1
  • Acknowledgements
  • Content-based Systems amp Collaborative Filtering
  • High Dimensional Data
  • Example Recommender Systems
  • Recommendations
  • From Scarcity to Abundance
  • The Long Tail
  • Physical vs Online
  • Types of Recommendations
  • Formal Model
  • Utility Matrix
  • Key Problems
  • (1) Gathering Ratings
  • (2) Extrapolating Utilities
  • Content-based Recommender Systems
  • Content-based Recommendations
  • Plan of Action
  • Item Profiles
  • Sidenote TF-IDF
  • User Profiles and Prediction
  • Pros Content-based Approach
  • Cons Content-based Approach
  • Collaborative Filtering
  • Collaborative filtering
  • Collaborative Filtering (CF)
  • Example of Memory-based Collaborative Filtering User-User Collaborative Filtering
  • Similar Users
  • Similarity Metric
  • Rating Predictions
  • Another type of Memory-based Collaborative Filtering Item-Item based Collaborative Filtering
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Common Practice for Item-Item Collaborative Filtering
  • Item-Item vs User-User
  • ProsCons of Collaborative Filtering
  • Hybrid Methods
  • Remarks amp Practical Tips
  • Evaluation
  • Evaluation
  • Evaluating Predictions
  • Problems with Error Measures
  • Collaborative Filtering Complexity
  • Tip Add Data
  • Recommender SystemsLatent Factor Models
  • Collaborative Filtering viaLatent Factor Models (eg SVD)
  • Slide Number 50
  • The Netflix Utility Matrix R
  • Utility Matrix R Evaluation
  • Latent Factor Models
  • Ratings as Products of Factors
  • Ratings as Products of Factors
  • Ratings as Products of Factors
  • Latent Factor Models
  • Latent Factor Models
  • Recap SVD
  • Latent Factor Models
  • Dealing with Missing Entries
  • Recommendations via Latent Factor Models (eg SVD++ by the [Bellkor Team])
  • Dealing with Missing Entries
  • The Effect of Regularization
  • The Effect of Regularization
  • The Effect of Regularization
  • The Effect of Regularization
  • Use Gradient Descent to search forthe optimal settings
  • Degression to the lecture notes ofRegression and Gradient Descentby Andrew Ngrsquos Machine Learning course from Coursera
  • (Batch) Gradient Descent
  • Stochastic Gradient Descent
  • SGD vs GD
  • Stochastic Gradient Descent
  • Slide Number 75
  • Summary Recommendations via Optimization
  • Backup Slides
  • The Netflix Challenge 2006-09
  • Slide Number 79
  • Slide Number 80
  • Netflix Prize
  • Movie rating data
  • Overall rating distribution
  • ratings per movie
  • ratings per user
  • Average movie rating by movie count
  • Most loved movies
  • Challenges
  • The BellKor recommender system
  • Extending Latent Factor Model to Include Biases
  • Modeling Biases and Interactions
  • Baseline Predictor
  • Putting It All Together
  • Fitting the New Model
  • BellKor Recommender System
  • Performance of Various Methods
  • Performance of Various Methods
  • Temporal Biases Of Users
  • Temporal Biases amp Factors
  • Adding Temporal Effects
  • Performance of Various Methods
  • Slide Number 113
  • Slide Number 114
  • Standing on June 26th 2009
  • The Last 30 Days
  • 24 Hours from the Deadline
  • Slide Number 118
  • Million $ Awarded Sept 21st 2009
  • Further reading
predictors RMSE
1 0889067
2 0880586
3 0878345
4 0877715
5 0876668
6 0876179
7 0875557
8 0875013
9 0874477
10 0874209
11 0873997
12 0873691
13 0873569
14 0873352
15 0873212
16 0873035
17 0872942
18 087283
19 0872748
20 0872654
21 0872563
22 0872468
23 0872382
24 0872275
25 087222
26 0872201
27 0872135
28 0872086
29 0872021
30 0871945
31 0871898
32 087183
33 0871756
34 0871722
35 0871665
36 0871632
37 0871585
38 0871558
39 0871524
40 0871482
41 0871452
42 0871428
43 0871398
44 0871346
45 0871312
46 0871287
47 0871262
48 0871242
49 0871236
50 0871197
0889067
0880586
0878345
0877715
0876668
0876179
0875557
0875013
0874477
0874209
0873997
0873691
0873569
0873352
0873212
0873035
0872942
087283
0872748
0872654
0872563
0872468
0872382
0872275
087222
0872201
0872135
0872086
0872021
0871945
0871898
087183
0871756
0871722
0871665
0871632
0871585
0871558
0871524
0871482
0871452
0871428
0871398
0871346
0871312
0871287
0871262
0871242
0871236
0871197
Page 99: xuhappy.github.io · 2 days ago · RECSYS 2 Acknowledgements The slides used in this chapter are adapted from: CS246 Mining Massive Datasets, - by Jure Leskovec, Stanford University

Effect of ensemble size

087

08725

0875

08775

088

08825

0885

08875

089

08925

1 8 15 22 29 36 43 50

Predictors

Erro

r - R

MSE

Chart1

RMSE
Predictors
Error - RMSE
Effect of ensemble size

Sheet1

Sheet1

RMSE
Predictors
Error - RMSE
Effect of ensemble size

Sheet2

Sheet3

RECSYS 115

Standing on June 26th 2009

1162019June 26th submission triggers 30-day ldquolast callrdquo

RECSYS 116

The Last 30 Days

Ensemble team formed Group of other teams on leaderboard forms a new team Relies on combining their models Quickly also get a qualifying score over 10

BellKor Continue to get small improvements in their scores Realize that they are in direct competition with Ensemble

Strategy Both teams carefully monitoring the leaderboard Only sure way to check for improvement is to submit a set of

predictionsbull This alerts the other team of your latest score

1162019

RECSYS 117

24 Hours from the Deadline

Submissions limited to 1 a day Only 1 final submission could be made in the last 24h

24 hours before deadlinehellip BellKor team member in Austria notices (by chance) that

Ensemble posts a score that is slightly better than BellKorrsquos

Frantic last 24 hours for both teams Much computer time on final optimization Carefully calibrated to end about an hour before deadline

Final submissions BellKor submits a little early (on purpose) 40 mins before deadline Ensemble submits their final entry 20 mins later hellipand everyone waitshellip

1162019

RECSYS 118118

RECSYS 119

Million $ Awarded Sept 21st 2009

RECSYS 120

Further reading Y Koren Collaborative filtering with temporal

dynamics KDD rsquo09 httpwww2researchattcom~volinskynetflixbpchtml httpwwwthe-ensemblecom

1162019

  • Slide Number 1
  • Acknowledgements
  • Content-based Systems amp Collaborative Filtering
  • High Dimensional Data
  • Example Recommender Systems
  • Recommendations
  • From Scarcity to Abundance
  • The Long Tail
  • Physical vs Online
  • Types of Recommendations
  • Formal Model
  • Utility Matrix
  • Key Problems
  • (1) Gathering Ratings
  • (2) Extrapolating Utilities
  • Content-based Recommender Systems
  • Content-based Recommendations
  • Plan of Action
  • Item Profiles
  • Sidenote TF-IDF
  • User Profiles and Prediction
  • Pros Content-based Approach
  • Cons Content-based Approach
  • Collaborative Filtering
  • Collaborative filtering
  • Collaborative Filtering (CF)
  • Example of Memory-based Collaborative Filtering User-User Collaborative Filtering
  • Similar Users
  • Similarity Metric
  • Rating Predictions
  • Another type of Memory-based Collaborative Filtering Item-Item based Collaborative Filtering
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Common Practice for Item-Item Collaborative Filtering
  • Item-Item vs User-User
  • ProsCons of Collaborative Filtering
  • Hybrid Methods
  • Remarks amp Practical Tips
  • Evaluation
  • Evaluation
  • Evaluating Predictions
  • Problems with Error Measures
  • Collaborative Filtering Complexity
  • Tip Add Data
  • Recommender SystemsLatent Factor Models
  • Collaborative Filtering viaLatent Factor Models (eg SVD)
  • Slide Number 50
  • The Netflix Utility Matrix R
  • Utility Matrix R Evaluation
  • Latent Factor Models
  • Ratings as Products of Factors
  • Ratings as Products of Factors
  • Ratings as Products of Factors
  • Latent Factor Models
  • Latent Factor Models
  • Recap SVD
  • Latent Factor Models
  • Dealing with Missing Entries
  • Recommendations via Latent Factor Models (eg SVD++ by the [Bellkor Team])
  • Dealing with Missing Entries
  • The Effect of Regularization
  • The Effect of Regularization
  • The Effect of Regularization
  • The Effect of Regularization
  • Use Gradient Descent to search forthe optimal settings
  • Degression to the lecture notes ofRegression and Gradient Descentby Andrew Ngrsquos Machine Learning course from Coursera
  • (Batch) Gradient Descent
  • Stochastic Gradient Descent
  • SGD vs GD
  • Stochastic Gradient Descent
  • Slide Number 75
  • Summary Recommendations via Optimization
  • Backup Slides
  • The Netflix Challenge 2006-09
  • Slide Number 79
  • Slide Number 80
  • Netflix Prize
  • Movie rating data
  • Overall rating distribution
  • ratings per movie
  • ratings per user
  • Average movie rating by movie count
  • Most loved movies
  • Challenges
  • The BellKor recommender system
  • Extending Latent Factor Model to Include Biases
  • Modeling Biases and Interactions
  • Baseline Predictor
  • Putting It All Together
  • Fitting the New Model
  • BellKor Recommender System
  • Performance of Various Methods
  • Performance of Various Methods
  • Temporal Biases Of Users
  • Temporal Biases amp Factors
  • Adding Temporal Effects
  • Performance of Various Methods
  • Slide Number 113
  • Slide Number 114
  • Standing on June 26th 2009
  • The Last 30 Days
  • 24 Hours from the Deadline
  • Slide Number 118
  • Million $ Awarded Sept 21st 2009
  • Further reading
predictors RMSE
1 0889067
2 0880586
3 0878345
4 0877715
5 0876668
6 0876179
7 0875557
8 0875013
9 0874477
10 0874209
11 0873997
12 0873691
13 0873569
14 0873352
15 0873212
16 0873035
17 0872942
18 087283
19 0872748
20 0872654
21 0872563
22 0872468
23 0872382
24 0872275
25 087222
26 0872201
27 0872135
28 0872086
29 0872021
30 0871945
31 0871898
32 087183
33 0871756
34 0871722
35 0871665
36 0871632
37 0871585
38 0871558
39 0871524
40 0871482
41 0871452
42 0871428
43 0871398
44 0871346
45 0871312
46 0871287
47 0871262
48 0871242
49 0871236
50 0871197
0889067
0880586
0878345
0877715
0876668
0876179
0875557
0875013
0874477
0874209
0873997
0873691
0873569
0873352
0873212
0873035
0872942
087283
0872748
0872654
0872563
0872468
0872382
0872275
087222
0872201
0872135
0872086
0872021
0871945
0871898
087183
0871756
0871722
0871665
0871632
0871585
0871558
0871524
0871482
0871452
0871428
0871398
0871346
0871312
0871287
0871262
0871242
0871236
0871197
Page 100: xuhappy.github.io · 2 days ago · RECSYS 2 Acknowledgements The slides used in this chapter are adapted from: CS246 Mining Massive Datasets, - by Jure Leskovec, Stanford University

Chart1

RMSE
Predictors
Error - RMSE
Effect of ensemble size

Sheet1

Sheet1

RMSE
Predictors
Error - RMSE
Effect of ensemble size

Sheet2

Sheet3

RECSYS 115

Standing on June 26th 2009

1162019June 26th submission triggers 30-day ldquolast callrdquo

RECSYS 116

The Last 30 Days

Ensemble team formed Group of other teams on leaderboard forms a new team Relies on combining their models Quickly also get a qualifying score over 10

BellKor Continue to get small improvements in their scores Realize that they are in direct competition with Ensemble

Strategy Both teams carefully monitoring the leaderboard Only sure way to check for improvement is to submit a set of

predictionsbull This alerts the other team of your latest score

1162019

RECSYS 117

24 Hours from the Deadline

Submissions limited to 1 a day Only 1 final submission could be made in the last 24h

24 hours before deadlinehellip BellKor team member in Austria notices (by chance) that

Ensemble posts a score that is slightly better than BellKorrsquos

Frantic last 24 hours for both teams Much computer time on final optimization Carefully calibrated to end about an hour before deadline

Final submissions BellKor submits a little early (on purpose) 40 mins before deadline Ensemble submits their final entry 20 mins later hellipand everyone waitshellip

1162019

RECSYS 118118

RECSYS 119

Million $ Awarded Sept 21st 2009

RECSYS 120

Further reading Y Koren Collaborative filtering with temporal

dynamics KDD rsquo09 httpwww2researchattcom~volinskynetflixbpchtml httpwwwthe-ensemblecom

1162019

  • Slide Number 1
  • Acknowledgements
  • Content-based Systems amp Collaborative Filtering
  • High Dimensional Data
  • Example Recommender Systems
  • Recommendations
  • From Scarcity to Abundance
  • The Long Tail
  • Physical vs Online
  • Types of Recommendations
  • Formal Model
  • Utility Matrix
  • Key Problems
  • (1) Gathering Ratings
  • (2) Extrapolating Utilities
  • Content-based Recommender Systems
  • Content-based Recommendations
  • Plan of Action
  • Item Profiles
  • Sidenote TF-IDF
  • User Profiles and Prediction
  • Pros Content-based Approach
  • Cons Content-based Approach
  • Collaborative Filtering
  • Collaborative filtering
  • Collaborative Filtering (CF)
  • Example of Memory-based Collaborative Filtering User-User Collaborative Filtering
  • Similar Users
  • Similarity Metric
  • Rating Predictions
  • Another type of Memory-based Collaborative Filtering Item-Item based Collaborative Filtering
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Common Practice for Item-Item Collaborative Filtering
  • Item-Item vs User-User
  • ProsCons of Collaborative Filtering
  • Hybrid Methods
  • Remarks amp Practical Tips
  • Evaluation
  • Evaluation
  • Evaluating Predictions
  • Problems with Error Measures
  • Collaborative Filtering Complexity
  • Tip Add Data
  • Recommender SystemsLatent Factor Models
  • Collaborative Filtering viaLatent Factor Models (eg SVD)
  • Slide Number 50
  • The Netflix Utility Matrix R
  • Utility Matrix R Evaluation
  • Latent Factor Models
  • Ratings as Products of Factors
  • Ratings as Products of Factors
  • Ratings as Products of Factors
  • Latent Factor Models
  • Latent Factor Models
  • Recap SVD
  • Latent Factor Models
  • Dealing with Missing Entries
  • Recommendations via Latent Factor Models (eg SVD++ by the [Bellkor Team])
  • Dealing with Missing Entries
  • The Effect of Regularization
  • The Effect of Regularization
  • The Effect of Regularization
  • The Effect of Regularization
  • Use Gradient Descent to search forthe optimal settings
  • Degression to the lecture notes ofRegression and Gradient Descentby Andrew Ngrsquos Machine Learning course from Coursera
  • (Batch) Gradient Descent
  • Stochastic Gradient Descent
  • SGD vs GD
  • Stochastic Gradient Descent
  • Slide Number 75
  • Summary Recommendations via Optimization
  • Backup Slides
  • The Netflix Challenge 2006-09
  • Slide Number 79
  • Slide Number 80
  • Netflix Prize
  • Movie rating data
  • Overall rating distribution
  • ratings per movie
  • ratings per user
  • Average movie rating by movie count
  • Most loved movies
  • Challenges
  • The BellKor recommender system
  • Extending Latent Factor Model to Include Biases
  • Modeling Biases and Interactions
  • Baseline Predictor
  • Putting It All Together
  • Fitting the New Model
  • BellKor Recommender System
  • Performance of Various Methods
  • Performance of Various Methods
  • Temporal Biases Of Users
  • Temporal Biases amp Factors
  • Adding Temporal Effects
  • Performance of Various Methods
  • Slide Number 113
  • Slide Number 114
  • Standing on June 26th 2009
  • The Last 30 Days
  • 24 Hours from the Deadline
  • Slide Number 118
  • Million $ Awarded Sept 21st 2009
  • Further reading
predictors RMSE
1 0889067
2 0880586
3 0878345
4 0877715
5 0876668
6 0876179
7 0875557
8 0875013
9 0874477
10 0874209
11 0873997
12 0873691
13 0873569
14 0873352
15 0873212
16 0873035
17 0872942
18 087283
19 0872748
20 0872654
21 0872563
22 0872468
23 0872382
24 0872275
25 087222
26 0872201
27 0872135
28 0872086
29 0872021
30 0871945
31 0871898
32 087183
33 0871756
34 0871722
35 0871665
36 0871632
37 0871585
38 0871558
39 0871524
40 0871482
41 0871452
42 0871428
43 0871398
44 0871346
45 0871312
46 0871287
47 0871262
48 0871242
49 0871236
50 0871197
0889067
0880586
0878345
0877715
0876668
0876179
0875557
0875013
0874477
0874209
0873997
0873691
0873569
0873352
0873212
0873035
0872942
087283
0872748
0872654
0872563
0872468
0872382
0872275
087222
0872201
0872135
0872086
0872021
0871945
0871898
087183
0871756
0871722
0871665
0871632
0871585
0871558
0871524
0871482
0871452
0871428
0871398
0871346
0871312
0871287
0871262
0871242
0871236
0871197
Page 101: xuhappy.github.io · 2 days ago · RECSYS 2 Acknowledgements The slides used in this chapter are adapted from: CS246 Mining Massive Datasets, - by Jure Leskovec, Stanford University

Sheet1

Sheet1

RMSE
Predictors
Error - RMSE
Effect of ensemble size

Sheet2

Sheet3

RECSYS 115

Standing on June 26th 2009

1162019June 26th submission triggers 30-day ldquolast callrdquo

RECSYS 116

The Last 30 Days

Ensemble team formed Group of other teams on leaderboard forms a new team Relies on combining their models Quickly also get a qualifying score over 10

BellKor Continue to get small improvements in their scores Realize that they are in direct competition with Ensemble

Strategy Both teams carefully monitoring the leaderboard Only sure way to check for improvement is to submit a set of

predictionsbull This alerts the other team of your latest score

1162019

RECSYS 117

24 Hours from the Deadline

Submissions limited to 1 a day Only 1 final submission could be made in the last 24h

24 hours before deadlinehellip BellKor team member in Austria notices (by chance) that

Ensemble posts a score that is slightly better than BellKorrsquos

Frantic last 24 hours for both teams Much computer time on final optimization Carefully calibrated to end about an hour before deadline

Final submissions BellKor submits a little early (on purpose) 40 mins before deadline Ensemble submits their final entry 20 mins later hellipand everyone waitshellip

1162019

RECSYS 118118

RECSYS 119

Million $ Awarded Sept 21st 2009

RECSYS 120

Further reading Y Koren Collaborative filtering with temporal

dynamics KDD rsquo09 httpwww2researchattcom~volinskynetflixbpchtml httpwwwthe-ensemblecom

1162019

  • Slide Number 1
  • Acknowledgements
  • Content-based Systems amp Collaborative Filtering
  • High Dimensional Data
  • Example Recommender Systems
  • Recommendations
  • From Scarcity to Abundance
  • The Long Tail
  • Physical vs Online
  • Types of Recommendations
  • Formal Model
  • Utility Matrix
  • Key Problems
  • (1) Gathering Ratings
  • (2) Extrapolating Utilities
  • Content-based Recommender Systems
  • Content-based Recommendations
  • Plan of Action
  • Item Profiles
  • Sidenote TF-IDF
  • User Profiles and Prediction
  • Pros Content-based Approach
  • Cons Content-based Approach
  • Collaborative Filtering
  • Collaborative filtering
  • Collaborative Filtering (CF)
  • Example of Memory-based Collaborative Filtering User-User Collaborative Filtering
  • Similar Users
  • Similarity Metric
  • Rating Predictions
  • Another type of Memory-based Collaborative Filtering Item-Item based Collaborative Filtering
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Common Practice for Item-Item Collaborative Filtering
  • Item-Item vs User-User
  • ProsCons of Collaborative Filtering
  • Hybrid Methods
  • Remarks amp Practical Tips
  • Evaluation
  • Evaluation
  • Evaluating Predictions
  • Problems with Error Measures
  • Collaborative Filtering Complexity
  • Tip Add Data
  • Recommender SystemsLatent Factor Models
  • Collaborative Filtering viaLatent Factor Models (eg SVD)
  • Slide Number 50
  • The Netflix Utility Matrix R
  • Utility Matrix R Evaluation
  • Latent Factor Models
  • Ratings as Products of Factors
  • Ratings as Products of Factors
  • Ratings as Products of Factors
  • Latent Factor Models
  • Latent Factor Models
  • Recap SVD
  • Latent Factor Models
  • Dealing with Missing Entries
  • Recommendations via Latent Factor Models (eg SVD++ by the [Bellkor Team])
  • Dealing with Missing Entries
  • The Effect of Regularization
  • The Effect of Regularization
  • The Effect of Regularization
  • The Effect of Regularization
  • Use Gradient Descent to search forthe optimal settings
  • Degression to the lecture notes ofRegression and Gradient Descentby Andrew Ngrsquos Machine Learning course from Coursera
  • (Batch) Gradient Descent
  • Stochastic Gradient Descent
  • SGD vs GD
  • Stochastic Gradient Descent
  • Slide Number 75
  • Summary Recommendations via Optimization
  • Backup Slides
  • The Netflix Challenge 2006-09
  • Slide Number 79
  • Slide Number 80
  • Netflix Prize
  • Movie rating data
  • Overall rating distribution
  • ratings per movie
  • ratings per user
  • Average movie rating by movie count
  • Most loved movies
  • Challenges
  • The BellKor recommender system
  • Extending Latent Factor Model to Include Biases
  • Modeling Biases and Interactions
  • Baseline Predictor
  • Putting It All Together
  • Fitting the New Model
  • BellKor Recommender System
  • Performance of Various Methods
  • Performance of Various Methods
  • Temporal Biases Of Users
  • Temporal Biases amp Factors
  • Adding Temporal Effects
  • Performance of Various Methods
  • Slide Number 113
  • Slide Number 114
  • Standing on June 26th 2009
  • The Last 30 Days
  • 24 Hours from the Deadline
  • Slide Number 118
  • Million $ Awarded Sept 21st 2009
  • Further reading
predictors RMSE
1 0889067
2 0880586
3 0878345
4 0877715
5 0876668
6 0876179
7 0875557
8 0875013
9 0874477
10 0874209
11 0873997
12 0873691
13 0873569
14 0873352
15 0873212
16 0873035
17 0872942
18 087283
19 0872748
20 0872654
21 0872563
22 0872468
23 0872382
24 0872275
25 087222
26 0872201
27 0872135
28 0872086
29 0872021
30 0871945
31 0871898
32 087183
33 0871756
34 0871722
35 0871665
36 0871632
37 0871585
38 0871558
39 0871524
40 0871482
41 0871452
42 0871428
43 0871398
44 0871346
45 0871312
46 0871287
47 0871262
48 0871242
49 0871236
50 0871197
Page 102: xuhappy.github.io · 2 days ago · RECSYS 2 Acknowledgements The slides used in this chapter are adapted from: CS246 Mining Massive Datasets, - by Jure Leskovec, Stanford University

Sheet1

RMSE
Predictors
Error - RMSE
Effect of ensemble size

Sheet2

Sheet3

RECSYS 115

Standing on June 26th 2009

1162019June 26th submission triggers 30-day ldquolast callrdquo

RECSYS 116

The Last 30 Days

Ensemble team formed Group of other teams on leaderboard forms a new team Relies on combining their models Quickly also get a qualifying score over 10

BellKor Continue to get small improvements in their scores Realize that they are in direct competition with Ensemble

Strategy Both teams carefully monitoring the leaderboard Only sure way to check for improvement is to submit a set of

predictionsbull This alerts the other team of your latest score

1162019

RECSYS 117

24 Hours from the Deadline

Submissions limited to 1 a day Only 1 final submission could be made in the last 24h

24 hours before deadlinehellip BellKor team member in Austria notices (by chance) that

Ensemble posts a score that is slightly better than BellKorrsquos

Frantic last 24 hours for both teams Much computer time on final optimization Carefully calibrated to end about an hour before deadline

Final submissions BellKor submits a little early (on purpose) 40 mins before deadline Ensemble submits their final entry 20 mins later hellipand everyone waitshellip

1162019

RECSYS 118118

RECSYS 119

Million $ Awarded Sept 21st 2009

RECSYS 120

Further reading Y Koren Collaborative filtering with temporal

dynamics KDD rsquo09 httpwww2researchattcom~volinskynetflixbpchtml httpwwwthe-ensemblecom

1162019

  • Slide Number 1
  • Acknowledgements
  • Content-based Systems amp Collaborative Filtering
  • High Dimensional Data
  • Example Recommender Systems
  • Recommendations
  • From Scarcity to Abundance
  • The Long Tail
  • Physical vs Online
  • Types of Recommendations
  • Formal Model
  • Utility Matrix
  • Key Problems
  • (1) Gathering Ratings
  • (2) Extrapolating Utilities
  • Content-based Recommender Systems
  • Content-based Recommendations
  • Plan of Action
  • Item Profiles
  • Sidenote TF-IDF
  • User Profiles and Prediction
  • Pros Content-based Approach
  • Cons Content-based Approach
  • Collaborative Filtering
  • Collaborative filtering
  • Collaborative Filtering (CF)
  • Example of Memory-based Collaborative Filtering User-User Collaborative Filtering
  • Similar Users
  • Similarity Metric
  • Rating Predictions
  • Another type of Memory-based Collaborative Filtering Item-Item based Collaborative Filtering
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Common Practice for Item-Item Collaborative Filtering
  • Item-Item vs User-User
  • ProsCons of Collaborative Filtering
  • Hybrid Methods
  • Remarks amp Practical Tips
  • Evaluation
  • Evaluation
  • Evaluating Predictions
  • Problems with Error Measures
  • Collaborative Filtering Complexity
  • Tip Add Data
  • Recommender SystemsLatent Factor Models
  • Collaborative Filtering viaLatent Factor Models (eg SVD)
  • Slide Number 50
  • The Netflix Utility Matrix R
  • Utility Matrix R Evaluation
  • Latent Factor Models
  • Ratings as Products of Factors
  • Ratings as Products of Factors
  • Ratings as Products of Factors
  • Latent Factor Models
  • Latent Factor Models
  • Recap SVD
  • Latent Factor Models
  • Dealing with Missing Entries
  • Recommendations via Latent Factor Models (eg SVD++ by the [Bellkor Team])
  • Dealing with Missing Entries
  • The Effect of Regularization
  • The Effect of Regularization
  • The Effect of Regularization
  • The Effect of Regularization
  • Use Gradient Descent to search forthe optimal settings
  • Degression to the lecture notes ofRegression and Gradient Descentby Andrew Ngrsquos Machine Learning course from Coursera
  • (Batch) Gradient Descent
  • Stochastic Gradient Descent
  • SGD vs GD
  • Stochastic Gradient Descent
  • Slide Number 75
  • Summary Recommendations via Optimization
  • Backup Slides
  • The Netflix Challenge 2006-09
  • Slide Number 79
  • Slide Number 80
  • Netflix Prize
  • Movie rating data
  • Overall rating distribution
  • ratings per movie
  • ratings per user
  • Average movie rating by movie count
  • Most loved movies
  • Challenges
  • The BellKor recommender system
  • Extending Latent Factor Model to Include Biases
  • Modeling Biases and Interactions
  • Baseline Predictor
  • Putting It All Together
  • Fitting the New Model
  • BellKor Recommender System
  • Performance of Various Methods
  • Performance of Various Methods
  • Temporal Biases Of Users
  • Temporal Biases amp Factors
  • Adding Temporal Effects
  • Performance of Various Methods
  • Slide Number 113
  • Slide Number 114
  • Standing on June 26th 2009
  • The Last 30 Days
  • 24 Hours from the Deadline
  • Slide Number 118
  • Million $ Awarded Sept 21st 2009
  • Further reading
Page 103: xuhappy.github.io · 2 days ago · RECSYS 2 Acknowledgements The slides used in this chapter are adapted from: CS246 Mining Massive Datasets, - by Jure Leskovec, Stanford University

Sheet2

Sheet3

RECSYS 115

Standing on June 26th 2009

1162019June 26th submission triggers 30-day ldquolast callrdquo

RECSYS 116

The Last 30 Days

Ensemble team formed Group of other teams on leaderboard forms a new team Relies on combining their models Quickly also get a qualifying score over 10

BellKor Continue to get small improvements in their scores Realize that they are in direct competition with Ensemble

Strategy Both teams carefully monitoring the leaderboard Only sure way to check for improvement is to submit a set of

predictionsbull This alerts the other team of your latest score

1162019

RECSYS 117

24 Hours from the Deadline

Submissions limited to 1 a day Only 1 final submission could be made in the last 24h

24 hours before deadlinehellip BellKor team member in Austria notices (by chance) that

Ensemble posts a score that is slightly better than BellKorrsquos

Frantic last 24 hours for both teams Much computer time on final optimization Carefully calibrated to end about an hour before deadline

Final submissions BellKor submits a little early (on purpose) 40 mins before deadline Ensemble submits their final entry 20 mins later hellipand everyone waitshellip

1162019

RECSYS 118118

RECSYS 119

Million $ Awarded Sept 21st 2009

RECSYS 120

Further reading Y Koren Collaborative filtering with temporal

dynamics KDD rsquo09 httpwww2researchattcom~volinskynetflixbpchtml httpwwwthe-ensemblecom

1162019

  • Slide Number 1
  • Acknowledgements
  • Content-based Systems amp Collaborative Filtering
  • High Dimensional Data
  • Example Recommender Systems
  • Recommendations
  • From Scarcity to Abundance
  • The Long Tail
  • Physical vs Online
  • Types of Recommendations
  • Formal Model
  • Utility Matrix
  • Key Problems
  • (1) Gathering Ratings
  • (2) Extrapolating Utilities
  • Content-based Recommender Systems
  • Content-based Recommendations
  • Plan of Action
  • Item Profiles
  • Sidenote TF-IDF
  • User Profiles and Prediction
  • Pros Content-based Approach
  • Cons Content-based Approach
  • Collaborative Filtering
  • Collaborative filtering
  • Collaborative Filtering (CF)
  • Example of Memory-based Collaborative Filtering User-User Collaborative Filtering
  • Similar Users
  • Similarity Metric
  • Rating Predictions
  • Another type of Memory-based Collaborative Filtering Item-Item based Collaborative Filtering
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Common Practice for Item-Item Collaborative Filtering
  • Item-Item vs User-User
  • ProsCons of Collaborative Filtering
  • Hybrid Methods
  • Remarks amp Practical Tips
  • Evaluation
  • Evaluation
  • Evaluating Predictions
  • Problems with Error Measures
  • Collaborative Filtering Complexity
  • Tip Add Data
  • Recommender SystemsLatent Factor Models
  • Collaborative Filtering viaLatent Factor Models (eg SVD)
  • Slide Number 50
  • The Netflix Utility Matrix R
  • Utility Matrix R Evaluation
  • Latent Factor Models
  • Ratings as Products of Factors
  • Ratings as Products of Factors
  • Ratings as Products of Factors
  • Latent Factor Models
  • Latent Factor Models
  • Recap SVD
  • Latent Factor Models
  • Dealing with Missing Entries
  • Recommendations via Latent Factor Models (eg SVD++ by the [Bellkor Team])
  • Dealing with Missing Entries
  • The Effect of Regularization
  • The Effect of Regularization
  • The Effect of Regularization
  • The Effect of Regularization
  • Use Gradient Descent to search forthe optimal settings
  • Degression to the lecture notes ofRegression and Gradient Descentby Andrew Ngrsquos Machine Learning course from Coursera
  • (Batch) Gradient Descent
  • Stochastic Gradient Descent
  • SGD vs GD
  • Stochastic Gradient Descent
  • Slide Number 75
  • Summary Recommendations via Optimization
  • Backup Slides
  • The Netflix Challenge 2006-09
  • Slide Number 79
  • Slide Number 80
  • Netflix Prize
  • Movie rating data
  • Overall rating distribution
  • ratings per movie
  • ratings per user
  • Average movie rating by movie count
  • Most loved movies
  • Challenges
  • The BellKor recommender system
  • Extending Latent Factor Model to Include Biases
  • Modeling Biases and Interactions
  • Baseline Predictor
  • Putting It All Together
  • Fitting the New Model
  • BellKor Recommender System
  • Performance of Various Methods
  • Performance of Various Methods
  • Temporal Biases Of Users
  • Temporal Biases amp Factors
  • Adding Temporal Effects
  • Performance of Various Methods
  • Slide Number 113
  • Slide Number 114
  • Standing on June 26th 2009
  • The Last 30 Days
  • 24 Hours from the Deadline
  • Slide Number 118
  • Million $ Awarded Sept 21st 2009
  • Further reading
Page 104: xuhappy.github.io · 2 days ago · RECSYS 2 Acknowledgements The slides used in this chapter are adapted from: CS246 Mining Massive Datasets, - by Jure Leskovec, Stanford University

Sheet3

RECSYS 115

Standing on June 26th 2009

1162019June 26th submission triggers 30-day ldquolast callrdquo

RECSYS 116

The Last 30 Days

Ensemble team formed Group of other teams on leaderboard forms a new team Relies on combining their models Quickly also get a qualifying score over 10

BellKor Continue to get small improvements in their scores Realize that they are in direct competition with Ensemble

Strategy Both teams carefully monitoring the leaderboard Only sure way to check for improvement is to submit a set of

predictionsbull This alerts the other team of your latest score

1162019

RECSYS 117

24 Hours from the Deadline

Submissions limited to 1 a day Only 1 final submission could be made in the last 24h

24 hours before deadlinehellip BellKor team member in Austria notices (by chance) that

Ensemble posts a score that is slightly better than BellKorrsquos

Frantic last 24 hours for both teams Much computer time on final optimization Carefully calibrated to end about an hour before deadline

Final submissions BellKor submits a little early (on purpose) 40 mins before deadline Ensemble submits their final entry 20 mins later hellipand everyone waitshellip

1162019

RECSYS 118118

RECSYS 119

Million $ Awarded Sept 21st 2009

RECSYS 120

Further reading Y Koren Collaborative filtering with temporal

dynamics KDD rsquo09 httpwww2researchattcom~volinskynetflixbpchtml httpwwwthe-ensemblecom

1162019

  • Slide Number 1
  • Acknowledgements
  • Content-based Systems amp Collaborative Filtering
  • High Dimensional Data
  • Example Recommender Systems
  • Recommendations
  • From Scarcity to Abundance
  • The Long Tail
  • Physical vs Online
  • Types of Recommendations
  • Formal Model
  • Utility Matrix
  • Key Problems
  • (1) Gathering Ratings
  • (2) Extrapolating Utilities
  • Content-based Recommender Systems
  • Content-based Recommendations
  • Plan of Action
  • Item Profiles
  • Sidenote TF-IDF
  • User Profiles and Prediction
  • Pros Content-based Approach
  • Cons Content-based Approach
  • Collaborative Filtering
  • Collaborative filtering
  • Collaborative Filtering (CF)
  • Example of Memory-based Collaborative Filtering User-User Collaborative Filtering
  • Similar Users
  • Similarity Metric
  • Rating Predictions
  • Another type of Memory-based Collaborative Filtering Item-Item based Collaborative Filtering
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Common Practice for Item-Item Collaborative Filtering
  • Item-Item vs User-User
  • ProsCons of Collaborative Filtering
  • Hybrid Methods
  • Remarks amp Practical Tips
  • Evaluation
  • Evaluation
  • Evaluating Predictions
  • Problems with Error Measures
  • Collaborative Filtering Complexity
  • Tip Add Data
  • Recommender SystemsLatent Factor Models
  • Collaborative Filtering viaLatent Factor Models (eg SVD)
  • Slide Number 50
  • The Netflix Utility Matrix R
  • Utility Matrix R Evaluation
  • Latent Factor Models
  • Ratings as Products of Factors
  • Ratings as Products of Factors
  • Ratings as Products of Factors
  • Latent Factor Models
  • Latent Factor Models
  • Recap SVD
  • Latent Factor Models
  • Dealing with Missing Entries
  • Recommendations via Latent Factor Models (eg SVD++ by the [Bellkor Team])
  • Dealing with Missing Entries
  • The Effect of Regularization
  • The Effect of Regularization
  • The Effect of Regularization
  • The Effect of Regularization
  • Use Gradient Descent to search forthe optimal settings
  • Degression to the lecture notes ofRegression and Gradient Descentby Andrew Ngrsquos Machine Learning course from Coursera
  • (Batch) Gradient Descent
  • Stochastic Gradient Descent
  • SGD vs GD
  • Stochastic Gradient Descent
  • Slide Number 75
  • Summary Recommendations via Optimization
  • Backup Slides
  • The Netflix Challenge 2006-09
  • Slide Number 79
  • Slide Number 80
  • Netflix Prize
  • Movie rating data
  • Overall rating distribution
  • ratings per movie
  • ratings per user
  • Average movie rating by movie count
  • Most loved movies
  • Challenges
  • The BellKor recommender system
  • Extending Latent Factor Model to Include Biases
  • Modeling Biases and Interactions
  • Baseline Predictor
  • Putting It All Together
  • Fitting the New Model
  • BellKor Recommender System
  • Performance of Various Methods
  • Performance of Various Methods
  • Temporal Biases Of Users
  • Temporal Biases amp Factors
  • Adding Temporal Effects
  • Performance of Various Methods
  • Slide Number 113
  • Slide Number 114
  • Standing on June 26th 2009
  • The Last 30 Days
  • 24 Hours from the Deadline
  • Slide Number 118
  • Million $ Awarded Sept 21st 2009
  • Further reading
Page 105: xuhappy.github.io · 2 days ago · RECSYS 2 Acknowledgements The slides used in this chapter are adapted from: CS246 Mining Massive Datasets, - by Jure Leskovec, Stanford University

RECSYS 115

Standing on June 26th 2009

1162019June 26th submission triggers 30-day ldquolast callrdquo

RECSYS 116

The Last 30 Days

Ensemble team formed Group of other teams on leaderboard forms a new team Relies on combining their models Quickly also get a qualifying score over 10

BellKor Continue to get small improvements in their scores Realize that they are in direct competition with Ensemble

Strategy Both teams carefully monitoring the leaderboard Only sure way to check for improvement is to submit a set of

predictionsbull This alerts the other team of your latest score

1162019

RECSYS 117

24 Hours from the Deadline

Submissions limited to 1 a day Only 1 final submission could be made in the last 24h

24 hours before deadlinehellip BellKor team member in Austria notices (by chance) that

Ensemble posts a score that is slightly better than BellKorrsquos

Frantic last 24 hours for both teams Much computer time on final optimization Carefully calibrated to end about an hour before deadline

Final submissions BellKor submits a little early (on purpose) 40 mins before deadline Ensemble submits their final entry 20 mins later hellipand everyone waitshellip

1162019

RECSYS 118118

RECSYS 119

Million $ Awarded Sept 21st 2009

RECSYS 120

Further reading Y Koren Collaborative filtering with temporal

dynamics KDD rsquo09 httpwww2researchattcom~volinskynetflixbpchtml httpwwwthe-ensemblecom

1162019

  • Slide Number 1
  • Acknowledgements
  • Content-based Systems amp Collaborative Filtering
  • High Dimensional Data
  • Example Recommender Systems
  • Recommendations
  • From Scarcity to Abundance
  • The Long Tail
  • Physical vs Online
  • Types of Recommendations
  • Formal Model
  • Utility Matrix
  • Key Problems
  • (1) Gathering Ratings
  • (2) Extrapolating Utilities
  • Content-based Recommender Systems
  • Content-based Recommendations
  • Plan of Action
  • Item Profiles
  • Sidenote TF-IDF
  • User Profiles and Prediction
  • Pros Content-based Approach
  • Cons Content-based Approach
  • Collaborative Filtering
  • Collaborative filtering
  • Collaborative Filtering (CF)
  • Example of Memory-based Collaborative Filtering User-User Collaborative Filtering
  • Similar Users
  • Similarity Metric
  • Rating Predictions
  • Another type of Memory-based Collaborative Filtering Item-Item based Collaborative Filtering
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Common Practice for Item-Item Collaborative Filtering
  • Item-Item vs User-User
  • ProsCons of Collaborative Filtering
  • Hybrid Methods
  • Remarks amp Practical Tips
  • Evaluation
  • Evaluation
  • Evaluating Predictions
  • Problems with Error Measures
  • Collaborative Filtering Complexity
  • Tip Add Data
  • Recommender SystemsLatent Factor Models
  • Collaborative Filtering viaLatent Factor Models (eg SVD)
  • Slide Number 50
  • The Netflix Utility Matrix R
  • Utility Matrix R Evaluation
  • Latent Factor Models
  • Ratings as Products of Factors
  • Ratings as Products of Factors
  • Ratings as Products of Factors
  • Latent Factor Models
  • Latent Factor Models
  • Recap SVD
  • Latent Factor Models
  • Dealing with Missing Entries
  • Recommendations via Latent Factor Models (eg SVD++ by the [Bellkor Team])
  • Dealing with Missing Entries
  • The Effect of Regularization
  • The Effect of Regularization
  • The Effect of Regularization
  • The Effect of Regularization
  • Use Gradient Descent to search forthe optimal settings
  • Degression to the lecture notes ofRegression and Gradient Descentby Andrew Ngrsquos Machine Learning course from Coursera
  • (Batch) Gradient Descent
  • Stochastic Gradient Descent
  • SGD vs GD
  • Stochastic Gradient Descent
  • Slide Number 75
  • Summary Recommendations via Optimization
  • Backup Slides
  • The Netflix Challenge 2006-09
  • Slide Number 79
  • Slide Number 80
  • Netflix Prize
  • Movie rating data
  • Overall rating distribution
  • ratings per movie
  • ratings per user
  • Average movie rating by movie count
  • Most loved movies
  • Challenges
  • The BellKor recommender system
  • Extending Latent Factor Model to Include Biases
  • Modeling Biases and Interactions
  • Baseline Predictor
  • Putting It All Together
  • Fitting the New Model
  • BellKor Recommender System
  • Performance of Various Methods
  • Performance of Various Methods
  • Temporal Biases Of Users
  • Temporal Biases amp Factors
  • Adding Temporal Effects
  • Performance of Various Methods
  • Slide Number 113
  • Slide Number 114
  • Standing on June 26th 2009
  • The Last 30 Days
  • 24 Hours from the Deadline
  • Slide Number 118
  • Million $ Awarded Sept 21st 2009
  • Further reading
Page 106: xuhappy.github.io · 2 days ago · RECSYS 2 Acknowledgements The slides used in this chapter are adapted from: CS246 Mining Massive Datasets, - by Jure Leskovec, Stanford University

RECSYS 116

The Last 30 Days

Ensemble team formed Group of other teams on leaderboard forms a new team Relies on combining their models Quickly also get a qualifying score over 10

BellKor Continue to get small improvements in their scores Realize that they are in direct competition with Ensemble

Strategy Both teams carefully monitoring the leaderboard Only sure way to check for improvement is to submit a set of

predictionsbull This alerts the other team of your latest score

1162019

RECSYS 117

24 Hours from the Deadline

Submissions limited to 1 a day Only 1 final submission could be made in the last 24h

24 hours before deadlinehellip BellKor team member in Austria notices (by chance) that

Ensemble posts a score that is slightly better than BellKorrsquos

Frantic last 24 hours for both teams Much computer time on final optimization Carefully calibrated to end about an hour before deadline

Final submissions BellKor submits a little early (on purpose) 40 mins before deadline Ensemble submits their final entry 20 mins later hellipand everyone waitshellip

1162019

RECSYS 118118

RECSYS 119

Million $ Awarded Sept 21st 2009

RECSYS 120

Further reading Y Koren Collaborative filtering with temporal

dynamics KDD rsquo09 httpwww2researchattcom~volinskynetflixbpchtml httpwwwthe-ensemblecom

1162019

  • Slide Number 1
  • Acknowledgements
  • Content-based Systems amp Collaborative Filtering
  • High Dimensional Data
  • Example Recommender Systems
  • Recommendations
  • From Scarcity to Abundance
  • The Long Tail
  • Physical vs Online
  • Types of Recommendations
  • Formal Model
  • Utility Matrix
  • Key Problems
  • (1) Gathering Ratings
  • (2) Extrapolating Utilities
  • Content-based Recommender Systems
  • Content-based Recommendations
  • Plan of Action
  • Item Profiles
  • Sidenote TF-IDF
  • User Profiles and Prediction
  • Pros Content-based Approach
  • Cons Content-based Approach
  • Collaborative Filtering
  • Collaborative filtering
  • Collaborative Filtering (CF)
  • Example of Memory-based Collaborative Filtering User-User Collaborative Filtering
  • Similar Users
  • Similarity Metric
  • Rating Predictions
  • Another type of Memory-based Collaborative Filtering Item-Item based Collaborative Filtering
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Common Practice for Item-Item Collaborative Filtering
  • Item-Item vs User-User
  • ProsCons of Collaborative Filtering
  • Hybrid Methods
  • Remarks amp Practical Tips
  • Evaluation
  • Evaluation
  • Evaluating Predictions
  • Problems with Error Measures
  • Collaborative Filtering Complexity
  • Tip Add Data
  • Recommender SystemsLatent Factor Models
  • Collaborative Filtering viaLatent Factor Models (eg SVD)
  • Slide Number 50
  • The Netflix Utility Matrix R
  • Utility Matrix R Evaluation
  • Latent Factor Models
  • Ratings as Products of Factors
  • Ratings as Products of Factors
  • Ratings as Products of Factors
  • Latent Factor Models
  • Latent Factor Models
  • Recap SVD
  • Latent Factor Models
  • Dealing with Missing Entries
  • Recommendations via Latent Factor Models (eg SVD++ by the [Bellkor Team])
  • Dealing with Missing Entries
  • The Effect of Regularization
  • The Effect of Regularization
  • The Effect of Regularization
  • The Effect of Regularization
  • Use Gradient Descent to search forthe optimal settings
  • Degression to the lecture notes ofRegression and Gradient Descentby Andrew Ngrsquos Machine Learning course from Coursera
  • (Batch) Gradient Descent
  • Stochastic Gradient Descent
  • SGD vs GD
  • Stochastic Gradient Descent
  • Slide Number 75
  • Summary Recommendations via Optimization
  • Backup Slides
  • The Netflix Challenge 2006-09
  • Slide Number 79
  • Slide Number 80
  • Netflix Prize
  • Movie rating data
  • Overall rating distribution
  • ratings per movie
  • ratings per user
  • Average movie rating by movie count
  • Most loved movies
  • Challenges
  • The BellKor recommender system
  • Extending Latent Factor Model to Include Biases
  • Modeling Biases and Interactions
  • Baseline Predictor
  • Putting It All Together
  • Fitting the New Model
  • BellKor Recommender System
  • Performance of Various Methods
  • Performance of Various Methods
  • Temporal Biases Of Users
  • Temporal Biases amp Factors
  • Adding Temporal Effects
  • Performance of Various Methods
  • Slide Number 113
  • Slide Number 114
  • Standing on June 26th 2009
  • The Last 30 Days
  • 24 Hours from the Deadline
  • Slide Number 118
  • Million $ Awarded Sept 21st 2009
  • Further reading
Page 107: xuhappy.github.io · 2 days ago · RECSYS 2 Acknowledgements The slides used in this chapter are adapted from: CS246 Mining Massive Datasets, - by Jure Leskovec, Stanford University

RECSYS 117

24 Hours from the Deadline

Submissions limited to 1 a day Only 1 final submission could be made in the last 24h

24 hours before deadlinehellip BellKor team member in Austria notices (by chance) that

Ensemble posts a score that is slightly better than BellKorrsquos

Frantic last 24 hours for both teams Much computer time on final optimization Carefully calibrated to end about an hour before deadline

Final submissions BellKor submits a little early (on purpose) 40 mins before deadline Ensemble submits their final entry 20 mins later hellipand everyone waitshellip

1162019

RECSYS 118118

RECSYS 119

Million $ Awarded Sept 21st 2009

RECSYS 120

Further reading Y Koren Collaborative filtering with temporal

dynamics KDD rsquo09 httpwww2researchattcom~volinskynetflixbpchtml httpwwwthe-ensemblecom

1162019

  • Slide Number 1
  • Acknowledgements
  • Content-based Systems amp Collaborative Filtering
  • High Dimensional Data
  • Example Recommender Systems
  • Recommendations
  • From Scarcity to Abundance
  • The Long Tail
  • Physical vs Online
  • Types of Recommendations
  • Formal Model
  • Utility Matrix
  • Key Problems
  • (1) Gathering Ratings
  • (2) Extrapolating Utilities
  • Content-based Recommender Systems
  • Content-based Recommendations
  • Plan of Action
  • Item Profiles
  • Sidenote TF-IDF
  • User Profiles and Prediction
  • Pros Content-based Approach
  • Cons Content-based Approach
  • Collaborative Filtering
  • Collaborative filtering
  • Collaborative Filtering (CF)
  • Example of Memory-based Collaborative Filtering User-User Collaborative Filtering
  • Similar Users
  • Similarity Metric
  • Rating Predictions
  • Another type of Memory-based Collaborative Filtering Item-Item based Collaborative Filtering
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Common Practice for Item-Item Collaborative Filtering
  • Item-Item vs User-User
  • ProsCons of Collaborative Filtering
  • Hybrid Methods
  • Remarks amp Practical Tips
  • Evaluation
  • Evaluation
  • Evaluating Predictions
  • Problems with Error Measures
  • Collaborative Filtering Complexity
  • Tip Add Data
  • Recommender SystemsLatent Factor Models
  • Collaborative Filtering viaLatent Factor Models (eg SVD)
  • Slide Number 50
  • The Netflix Utility Matrix R
  • Utility Matrix R Evaluation
  • Latent Factor Models
  • Ratings as Products of Factors
  • Ratings as Products of Factors
  • Ratings as Products of Factors
  • Latent Factor Models
  • Latent Factor Models
  • Recap SVD
  • Latent Factor Models
  • Dealing with Missing Entries
  • Recommendations via Latent Factor Models (eg SVD++ by the [Bellkor Team])
  • Dealing with Missing Entries
  • The Effect of Regularization
  • The Effect of Regularization
  • The Effect of Regularization
  • The Effect of Regularization
  • Use Gradient Descent to search forthe optimal settings
  • Degression to the lecture notes ofRegression and Gradient Descentby Andrew Ngrsquos Machine Learning course from Coursera
  • (Batch) Gradient Descent
  • Stochastic Gradient Descent
  • SGD vs GD
  • Stochastic Gradient Descent
  • Slide Number 75
  • Summary Recommendations via Optimization
  • Backup Slides
  • The Netflix Challenge 2006-09
  • Slide Number 79
  • Slide Number 80
  • Netflix Prize
  • Movie rating data
  • Overall rating distribution
  • ratings per movie
  • ratings per user
  • Average movie rating by movie count
  • Most loved movies
  • Challenges
  • The BellKor recommender system
  • Extending Latent Factor Model to Include Biases
  • Modeling Biases and Interactions
  • Baseline Predictor
  • Putting It All Together
  • Fitting the New Model
  • BellKor Recommender System
  • Performance of Various Methods
  • Performance of Various Methods
  • Temporal Biases Of Users
  • Temporal Biases amp Factors
  • Adding Temporal Effects
  • Performance of Various Methods
  • Slide Number 113
  • Slide Number 114
  • Standing on June 26th 2009
  • The Last 30 Days
  • 24 Hours from the Deadline
  • Slide Number 118
  • Million $ Awarded Sept 21st 2009
  • Further reading
Page 108: xuhappy.github.io · 2 days ago · RECSYS 2 Acknowledgements The slides used in this chapter are adapted from: CS246 Mining Massive Datasets, - by Jure Leskovec, Stanford University

RECSYS 118118

RECSYS 119

Million $ Awarded Sept 21st 2009

RECSYS 120

Further reading Y Koren Collaborative filtering with temporal

dynamics KDD rsquo09 httpwww2researchattcom~volinskynetflixbpchtml httpwwwthe-ensemblecom

1162019

  • Slide Number 1
  • Acknowledgements
  • Content-based Systems amp Collaborative Filtering
  • High Dimensional Data
  • Example Recommender Systems
  • Recommendations
  • From Scarcity to Abundance
  • The Long Tail
  • Physical vs Online
  • Types of Recommendations
  • Formal Model
  • Utility Matrix
  • Key Problems
  • (1) Gathering Ratings
  • (2) Extrapolating Utilities
  • Content-based Recommender Systems
  • Content-based Recommendations
  • Plan of Action
  • Item Profiles
  • Sidenote TF-IDF
  • User Profiles and Prediction
  • Pros Content-based Approach
  • Cons Content-based Approach
  • Collaborative Filtering
  • Collaborative filtering
  • Collaborative Filtering (CF)
  • Example of Memory-based Collaborative Filtering User-User Collaborative Filtering
  • Similar Users
  • Similarity Metric
  • Rating Predictions
  • Another type of Memory-based Collaborative Filtering Item-Item based Collaborative Filtering
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Common Practice for Item-Item Collaborative Filtering
  • Item-Item vs User-User
  • ProsCons of Collaborative Filtering
  • Hybrid Methods
  • Remarks amp Practical Tips
  • Evaluation
  • Evaluation
  • Evaluating Predictions
  • Problems with Error Measures
  • Collaborative Filtering Complexity
  • Tip Add Data
  • Recommender SystemsLatent Factor Models
  • Collaborative Filtering viaLatent Factor Models (eg SVD)
  • Slide Number 50
  • The Netflix Utility Matrix R
  • Utility Matrix R Evaluation
  • Latent Factor Models
  • Ratings as Products of Factors
  • Ratings as Products of Factors
  • Ratings as Products of Factors
  • Latent Factor Models
  • Latent Factor Models
  • Recap SVD
  • Latent Factor Models
  • Dealing with Missing Entries
  • Recommendations via Latent Factor Models (eg SVD++ by the [Bellkor Team])
  • Dealing with Missing Entries
  • The Effect of Regularization
  • The Effect of Regularization
  • The Effect of Regularization
  • The Effect of Regularization
  • Use Gradient Descent to search forthe optimal settings
  • Degression to the lecture notes ofRegression and Gradient Descentby Andrew Ngrsquos Machine Learning course from Coursera
  • (Batch) Gradient Descent
  • Stochastic Gradient Descent
  • SGD vs GD
  • Stochastic Gradient Descent
  • Slide Number 75
  • Summary Recommendations via Optimization
  • Backup Slides
  • The Netflix Challenge 2006-09
  • Slide Number 79
  • Slide Number 80
  • Netflix Prize
  • Movie rating data
  • Overall rating distribution
  • ratings per movie
  • ratings per user
  • Average movie rating by movie count
  • Most loved movies
  • Challenges
  • The BellKor recommender system
  • Extending Latent Factor Model to Include Biases
  • Modeling Biases and Interactions
  • Baseline Predictor
  • Putting It All Together
  • Fitting the New Model
  • BellKor Recommender System
  • Performance of Various Methods
  • Performance of Various Methods
  • Temporal Biases Of Users
  • Temporal Biases amp Factors
  • Adding Temporal Effects
  • Performance of Various Methods
  • Slide Number 113
  • Slide Number 114
  • Standing on June 26th 2009
  • The Last 30 Days
  • 24 Hours from the Deadline
  • Slide Number 118
  • Million $ Awarded Sept 21st 2009
  • Further reading
Page 109: xuhappy.github.io · 2 days ago · RECSYS 2 Acknowledgements The slides used in this chapter are adapted from: CS246 Mining Massive Datasets, - by Jure Leskovec, Stanford University

RECSYS 119

Million $ Awarded Sept 21st 2009

RECSYS 120

Further reading Y Koren Collaborative filtering with temporal

dynamics KDD rsquo09 httpwww2researchattcom~volinskynetflixbpchtml httpwwwthe-ensemblecom

1162019

  • Slide Number 1
  • Acknowledgements
  • Content-based Systems amp Collaborative Filtering
  • High Dimensional Data
  • Example Recommender Systems
  • Recommendations
  • From Scarcity to Abundance
  • The Long Tail
  • Physical vs Online
  • Types of Recommendations
  • Formal Model
  • Utility Matrix
  • Key Problems
  • (1) Gathering Ratings
  • (2) Extrapolating Utilities
  • Content-based Recommender Systems
  • Content-based Recommendations
  • Plan of Action
  • Item Profiles
  • Sidenote TF-IDF
  • User Profiles and Prediction
  • Pros Content-based Approach
  • Cons Content-based Approach
  • Collaborative Filtering
  • Collaborative filtering
  • Collaborative Filtering (CF)
  • Example of Memory-based Collaborative Filtering User-User Collaborative Filtering
  • Similar Users
  • Similarity Metric
  • Rating Predictions
  • Another type of Memory-based Collaborative Filtering Item-Item based Collaborative Filtering
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Common Practice for Item-Item Collaborative Filtering
  • Item-Item vs User-User
  • ProsCons of Collaborative Filtering
  • Hybrid Methods
  • Remarks amp Practical Tips
  • Evaluation
  • Evaluation
  • Evaluating Predictions
  • Problems with Error Measures
  • Collaborative Filtering Complexity
  • Tip Add Data
  • Recommender SystemsLatent Factor Models
  • Collaborative Filtering viaLatent Factor Models (eg SVD)
  • Slide Number 50
  • The Netflix Utility Matrix R
  • Utility Matrix R Evaluation
  • Latent Factor Models
  • Ratings as Products of Factors
  • Ratings as Products of Factors
  • Ratings as Products of Factors
  • Latent Factor Models
  • Latent Factor Models
  • Recap SVD
  • Latent Factor Models
  • Dealing with Missing Entries
  • Recommendations via Latent Factor Models (eg SVD++ by the [Bellkor Team])
  • Dealing with Missing Entries
  • The Effect of Regularization
  • The Effect of Regularization
  • The Effect of Regularization
  • The Effect of Regularization
  • Use Gradient Descent to search forthe optimal settings
  • Degression to the lecture notes ofRegression and Gradient Descentby Andrew Ngrsquos Machine Learning course from Coursera
  • (Batch) Gradient Descent
  • Stochastic Gradient Descent
  • SGD vs GD
  • Stochastic Gradient Descent
  • Slide Number 75
  • Summary Recommendations via Optimization
  • Backup Slides
  • The Netflix Challenge 2006-09
  • Slide Number 79
  • Slide Number 80
  • Netflix Prize
  • Movie rating data
  • Overall rating distribution
  • ratings per movie
  • ratings per user
  • Average movie rating by movie count
  • Most loved movies
  • Challenges
  • The BellKor recommender system
  • Extending Latent Factor Model to Include Biases
  • Modeling Biases and Interactions
  • Baseline Predictor
  • Putting It All Together
  • Fitting the New Model
  • BellKor Recommender System
  • Performance of Various Methods
  • Performance of Various Methods
  • Temporal Biases Of Users
  • Temporal Biases amp Factors
  • Adding Temporal Effects
  • Performance of Various Methods
  • Slide Number 113
  • Slide Number 114
  • Standing on June 26th 2009
  • The Last 30 Days
  • 24 Hours from the Deadline
  • Slide Number 118
  • Million $ Awarded Sept 21st 2009
  • Further reading
Page 110: xuhappy.github.io · 2 days ago · RECSYS 2 Acknowledgements The slides used in this chapter are adapted from: CS246 Mining Massive Datasets, - by Jure Leskovec, Stanford University

RECSYS 120

Further reading Y Koren Collaborative filtering with temporal

dynamics KDD rsquo09 httpwww2researchattcom~volinskynetflixbpchtml httpwwwthe-ensemblecom

1162019

  • Slide Number 1
  • Acknowledgements
  • Content-based Systems amp Collaborative Filtering
  • High Dimensional Data
  • Example Recommender Systems
  • Recommendations
  • From Scarcity to Abundance
  • The Long Tail
  • Physical vs Online
  • Types of Recommendations
  • Formal Model
  • Utility Matrix
  • Key Problems
  • (1) Gathering Ratings
  • (2) Extrapolating Utilities
  • Content-based Recommender Systems
  • Content-based Recommendations
  • Plan of Action
  • Item Profiles
  • Sidenote TF-IDF
  • User Profiles and Prediction
  • Pros Content-based Approach
  • Cons Content-based Approach
  • Collaborative Filtering
  • Collaborative filtering
  • Collaborative Filtering (CF)
  • Example of Memory-based Collaborative Filtering User-User Collaborative Filtering
  • Similar Users
  • Similarity Metric
  • Rating Predictions
  • Another type of Memory-based Collaborative Filtering Item-Item based Collaborative Filtering
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Item-Item CF (|N|=2)
  • Common Practice for Item-Item Collaborative Filtering
  • Item-Item vs User-User
  • ProsCons of Collaborative Filtering
  • Hybrid Methods
  • Remarks amp Practical Tips
  • Evaluation
  • Evaluation
  • Evaluating Predictions
  • Problems with Error Measures
  • Collaborative Filtering Complexity
  • Tip Add Data
  • Recommender SystemsLatent Factor Models
  • Collaborative Filtering viaLatent Factor Models (eg SVD)
  • Slide Number 50
  • The Netflix Utility Matrix R
  • Utility Matrix R Evaluation
  • Latent Factor Models
  • Ratings as Products of Factors
  • Ratings as Products of Factors
  • Ratings as Products of Factors
  • Latent Factor Models
  • Latent Factor Models
  • Recap SVD
  • Latent Factor Models
  • Dealing with Missing Entries
  • Recommendations via Latent Factor Models (eg SVD++ by the [Bellkor Team])
  • Dealing with Missing Entries
  • The Effect of Regularization
  • The Effect of Regularization
  • The Effect of Regularization
  • The Effect of Regularization
  • Use Gradient Descent to search forthe optimal settings
  • Degression to the lecture notes ofRegression and Gradient Descentby Andrew Ngrsquos Machine Learning course from Coursera
  • (Batch) Gradient Descent
  • Stochastic Gradient Descent
  • SGD vs GD
  • Stochastic Gradient Descent
  • Slide Number 75
  • Summary Recommendations via Optimization
  • Backup Slides
  • The Netflix Challenge 2006-09
  • Slide Number 79
  • Slide Number 80
  • Netflix Prize
  • Movie rating data
  • Overall rating distribution
  • ratings per movie
  • ratings per user
  • Average movie rating by movie count
  • Most loved movies
  • Challenges
  • The BellKor recommender system
  • Extending Latent Factor Model to Include Biases
  • Modeling Biases and Interactions
  • Baseline Predictor
  • Putting It All Together
  • Fitting the New Model
  • BellKor Recommender System
  • Performance of Various Methods
  • Performance of Various Methods
  • Temporal Biases Of Users
  • Temporal Biases amp Factors
  • Adding Temporal Effects
  • Performance of Various Methods
  • Slide Number 113
  • Slide Number 114
  • Standing on June 26th 2009
  • The Last 30 Days
  • 24 Hours from the Deadline
  • Slide Number 118
  • Million $ Awarded Sept 21st 2009
  • Further reading