recommenders systems

Recommender Systems

Collaborative Filtering &Content-Based Recommending

1

Recommender Systems

Systems for recommending items (e.g. books, movies, CD’s, web pages, newsgroup messages) to users based on examples of their preferences.

Many on-line stores provide recommendations (e.g. Amazon, CDNow).

Recommenders have been shown to substantially increase sales at on-line stores.

2

Book Recommender

3

RedMars

Juras-sicPark

LostWorld

2001

Foundation

Differ-enceEngine

Machine Learning

UserProfile

Neuro-mancer

2010

4

Utility matrixIn a recommendation-system application there are two classes of entities, which we shall refer to as users and items.Users have preferences for certain items, and these preferences must be teased out of the data. The data itself is represented as a utility matrix, giving for each user-item pair, a value that represents what is known about the degree of preference of that user for that item.We assume that the matrix is sparse, meaning that most entries are “unknown.” An unknown rating implies that we have no explicit information about the user’s preference for the item.

5

Example

The goal of a recommendation system is to predict the blanks in the utility matrix.

6

Long Tail Phenomena The distinction between the physical and on-line worlds has been called the long tail phenomenon.The long-tail phenomenon forces on-line institutions to recommend items to individual users. It is not possible to present all available items to the user, the way physical institutions can.

7

Populating the Utility MatrixThere are two way:1. We can ask users to rate items.2. We can make inferences from users’

behavior.

8

Approaches of recommender system

There are two basic approaches to recommending:1. Collaborative Filtering 2. Content-based Filtering

Collaborative Filtering

Maintain a database of many users’ ratings of a variety of items.

For a given user, find other similar users whose ratings strongly correlate with the current user.

Recommend items rated highly by these similar users, but not rated by the current user.

Almost all existing commercial recommenders use this approach (e.g. Amazon).

9


10

A 9B 3C: :Z 5

A B C 9: :Z 10

A 5B 3C: : Z 7

A B C 8: : Z

A 6B 4C: :Z

A 10B 4C 8. .Z 1

UserDatabase

ActiveUser

CorrelationMatch

A 9B 3C . .Z 5

A 9B 3C: :Z 5

A 10B 4C 8. .Z 1

ExtractRecommendations

C

Collaborative Filtering Method Weight all users with respect to similarity with the

active user. Select a subset of the users (neighbors) to use as

predictors. Normalize ratings and compute a prediction from

a weighted combination of the selected neighbors’ ratings.

Present items with highest predicted ratings as recommendations.

11

12

Measuring Similarity

The first question we must deal with is how to measure similarity of users or items from their rows or columns in the utility matrix.Jaccard Distance: lets assume A and B have an intersection of size 1 and a union of size 5.Thus, their Jaccard similarity is 1/5, and their Jaccard distance is 4/5; i.e.,they are very far apart.

13

Conti….Cosine Distance: We can treat blanks as a 0 value. This choice is questionable, since it has the effect of treating the lack of a rating as more similar to disliking the movie than liking it.Rounding the Data:We could try to eliminate the apparent similarity between movies a user rates highly and those with low scores by rounding the ratings. For instance, we could consider ratings of 3, 4, and 5 as a “1” and consider ratings 1 and 2 as unrated.

14

Conti….

Normalizing the rating:If we normalize ratings, by subtracting from each rating the average rating of that user, we turn low ratings into negative numbers and high ratings into positive numbers.

15

User based similarity

Let say,Sim(User1, User2)= Σj | rating (User1, Item j) – rating|(User2, Item j) /Num. of items

Problems with Collaborative Filtering Cold Start: There needs to be enough other

users already in the system to find a match. Sparsity: If there are many items to be

recommended, even if there are many users, the user/ratings matrix is sparse, and it is hard to find users that have rated the same items.

First Rater: Cannot recommend an item that has not been previously rated. New items Esoteric items

Popularity Bias: Cannot recommend items to someone with unique tastes. Tends to recommend popular items.

16

Content-Based Recommending Recommendations are based on

information on the content of items rather than on other users’ opinions.

Uses a machine learning algorithm to induce a profile of the users preferences.

Some previous applications: Newsweeder (Lang, 1995) Syskill and Webert (Pazzani et al., 1996)

17

18

Item profileIn a content-based system, we must construct for each item a profile, which is a record or collection of records representing important characteristics of that item. 1. The set of actors of the movie. Some

viewers prefer movies with their favorite actors.

2. The director. Some viewers have a preference for the work of certain directors.

3. The year in which the movie was made. Some viewers prefer old movies; others watch only the latest releases.

4. The genre or general type of movie. Some viewers like only comedie, others dramas or romances.

19

Discovering Features of DocumentsThere are other classes of items where it is not immediately apparent what the values of features should be. We shall consider two of them: document collections and images. First, eliminate stop words –the several

hundred most common words, which tend to say little about the topic of a document.

For the remaining words, compute the TF.IDF score for each word in the document.

The ones with the highest scores are the words that characterize the document.

20

Obtaining Item Features From Tags The problem with images is that their

data, typically an array of pixels, does not tell us anything useful about their features.

There have been a number of attempts to obtain information about features of items by inviting users to tag the items by entering words or phrases that describe the item.

21

User Profiles

We need to create vectors with the same components that describe the user’s preferences. We have the utility matrix representing the connection between users and items.

22

Recommending Items to Users Based on Content

With profile vectors for both users and items, we can estimate the degree to which a user would prefer an item by computing the cosine distance between the user’s and item’s vectors.

Advantages of Content-Based Approach

No need for data on other users. No cold-start or sparsity problems.

Able to recommend to users with unique tastes.

Able to recommend new and unpopular items No first-rater problem.

Can provide explanations of recommended items by listing content-features that caused an item to be recommended.

23

Disadvantages of Content-Based Method Requires content that can be

encoded as meaningful features. Users’ tastes must be represented as

a learnable function of these content features.

Unable to exploit quality judgments of other users. Unless these are somehow

included in the content features.24

LIBRALearning Intelligent Book Recommending Agent Content-based recommender for books using

information about titles extracted from Amazon.

Uses information extraction from the web to organize text into fields: Author Title Editorial Reviews Customer Comments Subject terms Related authors Related titles

25

LIBRA System

26

Amazon Pages

Rated Examples

User Profile

Machine Learning

Learner

InformationExtraction

LIBRA Database

Recommendations

1.~~~~~~2.~~~~~~~3.~~~~~::: Predictor

Sample Amazon Page

27

Age of Spiritual Machines

http://www.amazon.com/exec/obidos/ASIN/0140282025

Sample Extracted Information

28

Title: <The Age of Spiritual Machines: When Computers Exceed Human Intelligence>Author: <Ray Kurzweil>Price: <11.96>Publication Date: <January 2000>ISBN: <0140282025>Related Titles: <Title: <Robot: Mere Machine or Transcendent Mind> Author: <Hans Moravec> > …Reviews: <Author: <Amazon.com Reviews> Text: <How much do we humans…> > …Comments: <Stars: <4> Author: <Stephen A. Haines> Text:<Kurzweil has …> > …Related Authors: <Hans P. Moravec> <K. Eric Drexler>…Subjects: <Science/Mathematics> <Computers> <Artificial Intelligence> …

Libra Content Information

Libra uses this extracted information to form “bags of words” for the following slots: Author Title Description (reviews and comments) Subjects Related Titles Related Authors

29

Implementation Stopwords removed from all bags. A book’s title and author are added to its own related title and

related author slots. All probabilities are smoothed using Laplace estimation to

account for small sample size. Lisp implementation is quite efficient:

Training: 20 exs in 0.4 secs, 840 exs in 11.5 secs Test: 200 books per second

30

Experimental Data

Amazon searches were used to find books in various genres.

Titles that have at least one review or comment were kept.

Data sets: Literature fiction: 3,061 titles Mystery: 7,285 titles Science: 3,813 titles Science Fiction: 3.813 titles

31

Rated Data

4 users rated random examples within a genre by reviewing the Amazon pages about the title: LIT1 936 titles LIT2 935 titles MYST 500 titles SCI 500 titles SF 500 titles

32

Combining Content and Collaboration Content-based and collaborative methods have

complementary strengths and weaknesses. Combine methods to obtain the best of both. Various hybrid approaches:

Apply both methods and combine recommendations.

Use collaborative data as content. Use content-based predictor as another

collaborator. Use content-based predictor to complete

collaborative data.

33

Content-Boosted Collaborative Filtering

34

IMDbEachMovie Web Crawler

MovieContentDatabase

Full UserRatings Matrix


Active User Ratings

User RatingsMatrix (Sparse) Content-based

Predictor

Recommendations

Content-Boosted CF - I

35

Content-Based Predictor

Training Examples

Pseudo User-ratings Vector

Items with Predicted Ratings

User-ratings Vector

User-rated Items

Unrated Items

Conclusions

Recommending and personalization are important approaches to combating information over-load.

Machine Learning is an important part of systems for these tasks.

Collaborative filtering has problems. Content-based methods address these problems

(but have problems of their own). Integrating both is best.

36

recommenders systems

Technology