content-based book recommending using learning for text categorization

CONTENT-BASED BOOK RECOMMENDING USING LEARNING

FOR TEXT CATEGORIZATION

TRIVIKRAM BHAT

UNIVERSITY OF TEXAS AT ARLINGTON

DATA MINING

CSE6362

BASED ON PAPER

BY

RAYMOND J. MOONEY AND LORIENE ROY

UNIVERSITY OF TEXAS, AUSTIN

2

OVERVIEW

• Introduction

• Techniques

• Drawbacks of Existing Systems

• Advantages of Content Based Systems

• LIBRA

• System Description

• Experimental Results

• Future Work

• Conclusions

3

INTRODUCTION

General goal of a Recommender System• Make personalized suggestions based on previous

examples of users likes and dislikes

Types• Existing systems that use Social Filtering methods

(base recommendations on other users preferences)

• Content Based systems (use information about an item itself to make

suggestions)

4

INTRODUCTION

Companies

• Firefly

• Net Perceptions

• LikeMinds

• Amazon ( Book Recommending )

• Barnes And Noble ( Book Recommending )

5

TECHNIQUES

Social / Collaborative Filtering

• Maintain a Database of user preferences

• Find other users whose known preferences correlate significantly with a given user

Content Based Filtering

• Allows a system to uniquely characterize each user without having to match their interests to someone else’s

• Items are recommended based on the information of the item itself

6

DRAWBACKS OF EXISTING SYSTEMS

• Assume that a given user’s tastes are generally the same as another user

• Assume that there are sufficient number of ratings

• Tend to recommend popular titles

• Need for sufficient information about other users which raises concerns about privacy and access to customer data

7

ADVANTAGES OF CONTENT BASED SYSTEMS

• Items are recommended based on the content of the item rather than on other users preferences

• Provides a way to list content features that caused the item to be recommended

• Allows users to provide initial subject information to aid the system

8

LIBRA(Learning Intelligent Book

Recommending Agent)

• A database of book information extracted from web pages at Amazon.com

• Users select a set of training books and rate them on a scale of 1-10

• System learns a profile of the user using a Bayesian learning algorithm

• Produces a ranked list of the most recommended additional titles from the system catalog

9

SYSTEM DESCRIPTION

Extracting information and building a database• Perform Amazon subject search

• Download book description URL’s

• Information Extraction using slots to get valuable information about each book

• Current slots used are title, authors, published reviews and many more

• A simple extraction system is sufficient as the layout of Amazon’s automatically generated pages is regular

• Some preprocessing is done

(author names into unique tokens of the form first_initial_last-name)

10

SYSTEM DESCRIPTION

Learning a Profile• User selects titles (maybe for a particular author)

- Need not perform a random scan of the entire database

• Users rate the selected titles based on a scale of 1-10

• Naïve Bayesian text classifier is used to classify a book title as either positive(6-10) or negative(1-5)

• N training books Be (1 <= e <= N)

• Each has 2 real weights

- Positive weight e1 = (r-1)/9

- Negative weight e0 = 1 - e1

- r = user rating (1 <= r <= 10)

11

SYSTEM DESCRIPTION

Parameters• P(cj) = ej / N

• P(wk|cj, sm) = ej nkem / L(cj, sm)

– Where nkem = count of the number of times a word wk

appears in example Be in slot sm

– L(cj, sm) = ej / dm denotes the total weighted length of the documents in category cj and slot sm

– dm = vector of documents

• Strength – It measures how much more likely a word in a slot is to appear in a positively rated book than a negatively rated book

12

Sample Positive Profile Features

Slot Word Strength

WORDS ZUBRIN 9.85

WORDS SMOLIN 9.39

WORDS TREFIL 8.77

WORDS DOT 8.67

SUBJECTS COMPARATIVE 8.39

AUTHOR D GOLDSMITH 8.04

WORDS ALH 7.97

WORDS MANNED 7.97

RELATED TITLES SETTLE 7.91

13

SYSTEM DESCRIPTION

Producing, Explaining and Revising Recommendations• Once a profile is learnt, it is used to predict the preferred

ranking of the remaining books

• Recommendations are reviewed by the user and the user may assign their own rating to the examples they believe to be incorrectly ranked

• Retrain the system by repeating the above several times in order to produce the best results

14

EXPERIMENTAL RESULTS

Data Collection• Several data sets were assembled (LIT1, LIT2, MYST, SCI, SF)

• In order to present a quantitative picture of performance on a realistic sample, books were selected at random

• If the user was not familiar with a book, the user was asked to give a rating based on the information provided by the Amazon page describing the book

15


Performance Evaluation• Performed 10-fold cross validation on the examples

• Various metrics were used to measure the performance

– Classification accuracy (Acc): The percentage of examples correctly classified as positive or negative

– Precision (Pr): The percentage of examples classified as positive which are positive

16


Discussion• User-selected examples v/s Randomly selected examples

– User-selected examples are better as the user can accurately rate the selection

– Randomly selected examples tend to cover the complete dataset

• Conclusion – Avoid prematurely committing to a specific methodology

17


• Can Collaborative and Content-Based approaches be combined to produce better results?

• Slots – related authors, related titles

• When the above slots were removed, performance degraded

Use of both approaches together produces better results

18

FUTURE WORK

• Web-Based interface (with a larger body of users)• Compare LIBRA’s Content-Based Approach to a standard

Collaborative Approach• Maximize the utility of the small training set by using various

Machine Learning techniques– Unsupervised learning– Active learning (incremental approach)

• One effective approach – provide highly rated examples, generate initial recommendations, review the results, provide low rating for bad items and retrain the system to get new recommendations

19

CONCLUSIONS

• Content-Based Approach holds the promise of being able to effectively recommend items that have not been rated

• Provides accurate information without any background knowledge of other users preferences

• Combining Collaborative techniques does provide better results• www.cs.utexas.edu/users/ml/recommender.html• Partially supported by NSF

20

QUESTIONS??

content-based book recommending using learning for text categorization

Documents