movie recommendations using collaborative filtering

1

m vie

The world of media today can be characterized by us being exposed to incredible amounts of

content - new movies, songs, books and all the other media are being released every day.

Is this a problem? NO! This is great since it gives us a variety we never had before! But

what is not so good is that it’s hard for us to keep up with all the new stuff. So the problem with the media today is how to find good stuff. And I may add, that due to the huge amounts of media we need it to be done automatically.

This is where the recommendation systems come into the picture...

In the next 2 weeks we will explore how recommendation systems produce

recommendations. Even though we’ll be talking about movie recommendations have in

mind that the process itself is the same for music, books, cars, clothes, and the rest of the

things that possibly can be recommended.

Formulas are the same, I just thought that it may be easier for us to relate to the results if they happen to be the popular movies. Besides these 2 weeks have to be fun, right?...

You’ll learn the math behind the most popular recommendation approach - Collaborative Filtering. You’ll also try out several ways to calculate the similarity between people based on their movie taste, as well as several ways to combine multiple user ratings into one prediction. First we will do a small exercise (just 8 people and 4 movies) by hand, so that you can really see what is happening. Later we’ll try a bigger example (60+ users and 50 movies) using popular programming language - Python. In the end you’ll have to write a report (in groups of 5-6 people) and also with several of you I’ll have a pleasure to meet during an oral exam at the end of the semester :)

So without further delay, lets start...

02525

lect

ure

mat

eria

l for

DTU

cou

rse

The lecture notes for the course

DTU

by ANDRIUSBUTKUS DTU Informatics

using COLLABORATIVEFILTERINGRECOMMENDATIONS

2

As I already pointed out, the main challenge is how to AUTOMATICALLY find movies that a user hasnʼt seen yet but would like to see. How do we know that the user would like to see them? Well, we donʼt. We can just estimate. We make our estimations and predictions based on several assumptions. One of those assumptions is that:

“People will like movies that are similar to the ones they have liked before”

This seems obvious enough, but what do we mean by SIMILAR movies? What makes two movies similar? If they are similar then they must share something, right? In general we could say that two movies are similar if they share a number of features. The features here could mean any type of information we could say about the movie: genre, director, year of production, actors, keywords, tags, etc...

As you can see already some features are more meaningful than others. What makes two movies more similar - if they share a year of production or a genre? Moreover, the movie genres are very vague categories with very unclear boundaries. User generated keywords and tags normally give us a good idea about what the movie is like. But even here one has to be careful - certain keywords (features) carry more information and meaning than others. We as humans can see it quite easily, but if we want the process to be done automatically, (and we do want that) then we need to teach machines to pick the “meaningful” features form the whole pile and then do the comparison of two movies based on these “meaningful” features. This is what content-based filtering is all about. Where the main problem remains the extraction of “meaningful” features about the movie and then interpreting them.

Another approach to predict what movies a person will like is called Collaborative Filtering. First of all it starts with a different assumption:

“People will like movies that other similar people have liked before”

Much like the first one, this assumption is quite logical and we have all seen it work from our life experience. If we need to find out what movie to see next, we quite often end up asking our friends about it. You have probably also noticed that some of your friends have “better” movie taste than others. “Better” here meaning that itʼs closer to your own. So far we learned two things about collaborative filtering - first, we need a number of people who have seen something we havenʼt yet so that we can use their opinions about the movie (ratings) to predict how much weʼll love or hate the particular movie. Secondly, these people need to

be similar to us in their movie taste. OK, now how can we see if two people have similar taste in movies?

If you and your friend us ually end up rating the same movies in a similar way then that could be a good indication that you do indeed share a movie taste and therefore can be good advisors for each other when it comes down to recommending new movies. So the third assumption is this:

“People are similar if they rated the same movies in the same way”

So what do we need to do now? Two things: first to find people who are similar to us based on the way they rate movies, and second to somehow combine their ratings for the movies that we havenʼt seen yet and by doing that weʼll get our recommendations. This is the essence of Collaborative Filtering.

Collaborative Filtering has one major advantage against content-based filtering - itʼs ability to work with very limited input. It doesnʼt require complicated features of a movie to be extracted, instead it needs knowledge of who bought what and (if available) who rated what. Thatʼs all it really needs. Of course this means that we can not say why a certain movie was recommended to us - is it because it is a drama or is it because Naomi Watts is in it. All we will get is “people who bought th is a lso bought th is” type of recommendations. You are all familiar with the Amazon.com way of recommending things (see the picture bellow).

RECOMMENDATIONAPPROACHESfeatures or the users?

3

Now that you know the main idea behind Collaborative Filtering, lets get more technical and see what kind of math is hiding behind.

As you already know, all that collaborative filtering needs as an input is a matrix containing information of which customer bought/rated which movie. This is called the interaction matrix. As you can imagine, this matrix is mostly empty. Even if you think youʼve seen many movies, itʼs very few compared to the total number of movies available today.

An individual entry in the matrix is the rating R of the user u for the movie i. There are a number of rating schemes using different numbers of possible values - love/hate (Last.fm), 5 stars (Amazon.com), 10 stars (imdb), etc... To keep things simple weʼll use a common 5 star rating, where 3 is the neutral value surrounded by two positive and two negative values.

The overall goal of the algorithm is to predict what kind of values will be in the currently empty cells of the interaction matrix - prediction of rating R for and item ii and user ua. If we can do that, then itʼs easy to simply pick top n movies and present to the user as a sorted list.

As already mentioned itʼs a two step process:STEP 1: form neighborhoodSTEP 2: predict ratings

The neighborhood in this case refers to a group of people who have a similar taste for movies like you. Since the neighborhood is formed among users (calculated from the user-user similarity matrix) such collaborative filtering is called “user-based”.

Due to the dynamic nature of user profiles, both steps need to be performed “on-line” - this can be computationally very expensive and causes the scalability issue of such system, meaning that while this method is good for small databases itʼs too slow when we have millions of users and movies.

As a solution to this problem Amazon.com introduced “item-based” collaborative filtering. The only difference is that instead of calculating user-user similarities, we instead calculate item-item similarities, and then present the user with the “most similar” unseen movies to the ones that user has rated high.

The main advantage here is that i tem-item relationships are much less dynamic compared with the user-user. Hereby the first step (forming of the neighborhood) may be calculated offline and the scalability problem is solved. The difference in dynamics here can be explained by the fact that every movie is normally rated by a much higher number of users compared with the number of movies seen by a single person. Thus adding a new user or movie into the system has different affects for user-user and item-item similarity matrices.

COLLABORATIVEFILTERINGhow does it work?

4

The Euclidean Distance The Euclidian distance is probably the most simple way to calculate similarities between users. It takes the movies that the selected set of users all have rated and uses them as axes in a space. Users are mapped into the space based on their ratings for the selected movies. A similarity will be then expressed as the Euclidean distance between people in this space. As you can guess such space will have as many dimensions as the number of movies weʼll take.

If we use only one movie then the space will be one-dimensional and will have only one axis x. The distance between two people in such space P=(px), Q=(qx) can be calculated using this equation:

The formula is basically just one rating minus another ensuring that the result is positive since the distance can never be negative.

If we have a two dimensional space P=(px, py), Q=(qx, qy) then the formula turns into this:

You can see where this is going, right? So if we have an N dimensional space P=(px, py , ... , pN), Q=(qx, qy , ... , qN) the formula will look like this:

You get the idea...

STEPONEselecting the neighborsWhen it comes down to calculating similarities between a set of items (would it be users or movies) there are a number of different methods to do this. Weʼll look at two of them - Euclidian Distance and Pearson Correlation.

Imagine a user called Andrius. He has rated 4 movies in the following way:

There is also another movie that Andrius hasnʼt seen yet - “District 9”. Now the question is:

How much will Andrius like “DISTRICT 9”?

The answer has to be a rating in the range of 1 to 5 since thatʼs the rating scheme we are using here.

To answer that question we need to have several other users who both have rated the four movies Andrius has seen and “District 9”. This is all the input we need..

Lets start our practical part of the lecture by calculating the Euclidian distance using the example bellow. This is the example of collaborative filtering that weʼll stick to today. It is oversimplified since we only have 8 users and 4 movies that are shared between them. The example will show you the basic principles and math behind collaborative filtering which can be later applied on much bigger data sets.

(px − qx )2 + (py − qy )

2

(px − qx )2 + (py − qy )

2 + ...+ (pn − qn )2

(px − qx )2 = px − qx

Dark KnightDark KnightThe Fall Saw Love Actually

District 9

Michael 5 4 1 4 3

Christian 5 3 4 2 4

Gitte 2 4 1 5 5

Andrius 3 5 3 5 ?

Emilie 4 3 5 1 4

Sofie 5 2 3 3 4

Isabel 2 1 1 2 2

Wivi 3 3 1 5 2

Dark KnightDark Knight The Fall Saw Love Actually

Andrius 3 5 3 5

...ready?....

EXERCISEONEshould Andrius see “District 9” ?

5

Now if we want to calculate the similarity between two persons in this space we can use a formula given in the previous page.

Lets take Andrius and Gitte as an example. We get that the distance between them is 1.41 (using only the first two movies). See the picture above on the right hand side.

This distance should be taken literally, you can imagine it as meters for instance. It means that the smaller the number the closer two people are. So the range of values is basically from 0 to infinity. This is great if we want to visualize the distance, but if we want to use the distance as a weighting function in our calculations we have to normalize it - to make it fit in between 0 and 1 (since itʼs a metric space there are no negative values).

How do we do that?...

A simple way to do this is to take the existing formula add 1 to it (so that we donʼt get the division by zero) and invert it.

We get that the similarity between Andrius and Gitte is 0.41. In this case, the closer the value gets to 1 the more similar the users are. Now you can try to use all 4 movies to calculate the similarity between Andrius and Christian.

If we use this method to calculate similarities between every pair of users in the interaction matrix (using all four movies) we get the user-user similarity matrix that looks like this:

Michael Christian Gitte Andrius Emilie Sofie Isabel Wivi

Michael 1.00 0.21 0.24 0.24 0.16 0.25 0.18 0.29

Christian 0.21 1.00 0.16 0.16 0.37 0.37 0.18 0.18

Gitte 0.24 0.16 1.00 0.29 0.14 0.18 0.19 0.41

Andrius 0.24 0.19 0.29 1.00 0.17 0.20 0.15 0.26

Emilie 0.16 0.37 0.14 0.17 1.00 0.24 0.17 0.15

Sofie 0.25 0.37 0.18 0.20 0.24 1.00 0.21 0.22

Isabel 0.18 0.18 0.19 0.15 0.17 0.21 1.00 0.21

Wivi 0.29 0.18 0.41 0.26 0.15 0.22 0.21 1.00

Dark Knight

The Fall SawLove

Actually

Michael 5 4 1 4

Christian 5 3 4 2

Gitte 2 4 1 5

Andrius 3 5 3 5

Emilie 4 3 5 1

Sofie 5 2 3 3

Isabel 2 1 1 2

Wivi 3 3 1 5

0

1

2

3

4

5

0 1 2 3 4 5

The

Fall

Dark Knight

Andrius

Michael

Christian

SofieIsabel

Gitte

Wivi Emilie

11+ (px − qx )

2 + (py − qy )2

1.41

As you saw in the previous page the Euclidian distance can be calculated between two points in an N-dimensional space. In our case N=4, since we have the ratings for four movies that we can use. Since it problematic to effectively visualize a space that has more than 3 dimensions. To make is easier to see visually, lets try to take only two movies (“Dark Knight” and “The Fall”) and map all 8 users into such 2D space based on their ratings.

6

What we get in the end of STEP 1 is a user-user similarity matrix. This matrix shows us which users are similar to Andrius. The similarity measure is based on Euclidian distance and takes values from 0 to 1.

The next thing to do is to rank users in terms of their similarity to Andrius and select the most similar ones. This could be done in two ways:

ONE way is to select the N most similar users. This ensures that we have the sufficient number of “similar” users to make predictions but we donʼt have control over how similar they are to Andrius.ANOTHER way would be to have a certain threshold of similarity and then take everybody above that mark. This ensures that our selected users are sufficiently similar to Andrius but we canʼt ensure that we have enough users to make quality predictions.

In the example lets take top 3 users or everybody above similarity of 0.2

The main problem by using the Euclidian distance it that doesnʼt take into consideration the fact that two users may simply have different rating habits. Look at Isabel. Her ratings are 2, 1, 1, 2. It seems that she never gives higher values than that. So maybe a rating of 2 for her is quite high, while for other users it would be quite low. Notice that the similarity between Isabel and Michael is 0.18. Itʼs not much even though we see that in principal they both sort movies in a very similar way, even though they may use different ratings for them.

Euclidian distance takes movie ratings one by one and “compares” them separately. It doesnʼt see the big picture. Sometimes peopleʼs individual ratings are different but still share general tendencies and patterns. One way to capture such patterns is to measure how well two sets of data fit on a straight line. This is known as...

...the Pearson Correlation

0

0.05

0.10

0.15

0.20

0.25

0.30

Gitte Wivi Michael Sofie Emilie Isabel

top 3

sim(a,m) =(Ra,i − Ra )(Rm,i − Rm )

i∈I∑

(Ra,i − Ra )2i∈I∑ (Rm,i − Rm )2

i∈I∑

A slightly more sophisticated way to determine the similarity between peopleʼs preferences is to use a Pearson correlation coefficient. The formula is a bit more complicated than the Euclidean distance, but it tends to give better results in situations where the data isnʼt well normalized. Like in the example of Michael vs. Isabel.

Take a look at the formula on your right hand side. You notice that all we need as an input are the individual ratings of each user for each movie and also the average of the ratings for the particular user. Since the formula is quite easy to calculate by hand weʼll try to do that for two users (lets stick to Andrius and Gitte for that).

The similarity score we get ranges from -1 to 1. We can use it directly (it doesnʼt need to be normalized). Correlation of 1 shows that two users are as similar as possible. Similarity of -1 shows them being completely opposite (which is a useful thing to know as well).

sim(a,m) − similarity between users a and mRa,i − rating of the user a of item iRm,i − rating of the user m for item i

Ra − average rating of the user a

Rm − average rating of the user m

7

Michael Christian Gitte Andrius Emilie Sofie Isabel Wivi

Michael 1.00 0.00 0.53 0.33 -0.51 0.38 0.67 0.71Christian 0.00 1.00 -0.85 -0.89 0.83 0.72 0.00 -0.63

Gitte 0.53 -0.85 1.00 0.95 -0.96 -0.44 0.32 0.89Andrius 0.33 -0.89 0.95 1.00 -0.85 -0.69 0.00 0.71

Emilie -0.51 0.83 -0.96 -0.85 1.00 0.27 -0.51 -0.96Sofie 0.38 0.72 -0.44 -0.69 0.27 1.00 0.69 0.00

Isabel 0.67 0.00 0.32 0.00 -0.51 0.69 1.00 0.71Wivi 0.71 -0.63 0.89 0.71 -0.96 0.00 0.71 1.00

0

1

2

3

4

5

0 1 2 3 4 5

And

rius

Gitte

0

1

2

3

4

5

0 1 2 3 4 5

Em

ilie

Andrius

If we want to visualize the similarity between Andrius and Gitte using the Pearson Correlation Score, then we have to plot the ratings of both of them on two axis (one axis for each person) and then see how well they are fitted by a straight line.

Here you see two examples of the Pearson correlation. Both graphs have 4 points because we have 4 movies to rate. The mapping itself is pretty straightforward - simply plot each movie in the space using Andriusʼs rating for it as Y coordinate and Gitteʼs as X. The red line that you see is called the best-fit line because it comes as close to all the movies on the chart as possible.

sim(a,m) = 0.95 sim(a,m) = -0.85Take a look at the two graphs above. The correlation depends on two parameters here. First, how well does the datasets fit on a straight line. Secondly, what is the slope of the line (positive or negative). The first one will contribute to strength of the correlation, while the second one will show is it positive or negative. How would the red line look if the ratings were identical?

The interesting aspect of the Pearson Correlation is that it takes into consideration the general rating tendencies between two users and can produce high correlation even if not a single rating is actually the same. Notice how Michael and Isabel now are similar to each other with the correlation of 0.67.

The correlation between two datasets is a number. But how to interpret it? What is “good” and “bad” here? The truth is that there is no universal answer - it all depends on an application. Sometimes even a correlation of 0.9 is not good enough. If we are talking about all kinds of Social Networks with many users who rate things, then almost any positive correlation is already an indication that users are similar enough for this information to be useful.

Now try to manually calculate the Pearson Correlation Score for Andrius and Sofie. In the table bellow you see the rest of the correlations already calculated.

8

After we get the user-user similarity table (no matter what similarity metric we choose to use) we want to select the users most similar to us. In our example the top 3 neighbors for Andrius are still the same (see the table on the right hand side).

Now that we have selected the neighbors and have their ratings for the movie “District 9”, how do we combine these three numbers to produce a recommendation for Andrius?

There are several ways to do this:

First of all we could just take a simple

average of the three ratings. The obvious drawback of this average is that is ignores the fact that some users are more similar to us than others. In other words it

treats everybody the same.

In order to take user-user similarity into consideration we need to multiply the ratings with a similarity index - which we choose to

use as our weighting function. Then we need to divide everything, not by the number of users (3) but, by the sum of their

individual similarities to Andrius.

Until now we still donʼt take into consideration that all 4 people (3 users and Andrius) may

have different rating habits. To take differences in rating habits into account we need to

adjust our prediction by including the averages of the ratings for each user into the

equation. Hereʼs how it looks....

pred(a,i) =Ru ,iu∈NN∑

k

pred(a,i) =Ru ,i × sim(a,u)u∈NN∑sim(a,u)

u∈NN∑

pred(a,i) = Ra +(Ru ,i − Ru ) × sim(a,u)u∈NN∑

sim(a,u)u∈NN∑

STEPTWOcalculating prediction

top 3 nearest neighbors for

Andrius

ratings for “District

9”Gitte 0.95 5Wivi 0.71 2

Michael 0.33 3

1

3Now calculate prediction for Andrius using all three methods. Can you see the difference? Does it makes sense?

2

9

EXERCISETWO with your own ratings

here’s the plan:

1. work in groups2. in the group select 6 movies of your choice3. each person rate all 6 of them (5 star rating scheme)4. now you have 1 interaction matrix per group all filled with ratings5. each of you discard one movie of your choice (pretend that you havenʼt seen it)6. now you have one interaction matrix per person with one empty cell7. the task for each of you is to calculate the predicted rating for the “missing movie”

1. calculate the user-user similarities using Euclidean distance2. select top 3 closest neighbors to you3. calculate the user-user similarities using Pearson Correlation4. select top 3 closest neighbors to you 5. each of you calculate the predictions for one “missing movie”6. use all three ways to combine the ratings of your neighbors. 7. now you have 6 predicted ratings: 3 using the Euclidean distance and 3 using Pearson

Correlation

1. discuss in the group what seems to work best in your case2. what weaknesses can you identify in Collaborative Filtering approach

PREPARATIONPHASE

CALCULATIONPHASE

DISCUSSIONPHASE

[email protected]

THANKYOUFOR READING

mailto:[email protected]

mailto:[email protected]

movie recommendations using collaborative filtering

Documents

similar movies

movies similar

popular movies

new movies

movie recommendations

movie genres

movie taste

user generated keywords