improving collaborative filtering based recommenders using topic modelling

1confidentialFlytxt. All rights reserved. 18 August 201418 August 2014©

Improving Collaborative Filtering BasedRecommenders using Topic Modelling

Jobin Wilson1, Santanu Chaudhury2 & Brejesh Lall2

1 R&D Department, Flytxt, Trivandrum, India2 Dept. of Electrical Engineering, IIT Delhi, India


Agenda

Recommender Systems Overview

Proposed Approach

Experiments & Results

Conclusion


Recommender Systems Overview

Information filtering technique to locate products/services/information that is relevant and exciting to users based on historicalpreferences; utilizing “wisdom of the crowds”

E.g. Editorial, Aggregates (top views, top downloads, recent),Personalized recommendations

Formally :

U = set of UsersI = set of ItemsUtility function F: Relates U to I through a rating R ; E.g. 0-5 stars, a real numberTask : For each user, estimate her preference for items that are yet unseen by her, given all the existing user ratings.


Design Matrix

Sparse matrix representing each user’s preference for each item.

Algorithms need to predict values for empty cells based on availablecell values

Denser the matrix, better the quality of recommendations

User | Item i1 i2 i3 i4 i5

u1 r12 r14 r15

u2 r21 r22 r25

u3 r32 r34

u4 r43 r45


Taxonomy of Recommender Algorithms

Collaborative Filtering (CF) Based• Neighborhood Based (e.g. User Based, Item Based)

• Latent Factor Based (e.g. SVD; factorization of user-item rating matrix to determine latent properties of users and items)

Content Based• Constructing user profile from history and matching content profiles to

the learned user profile

Hybrid• Combining multiple approaches to improve quality of recommendations


Proposed Approach

Intuition• In many domains, considerable contextual data in text form is available,

describing items being recommended (e.g. movies, e-commerce)

• Standard CF algorithms do not consider latent properties of users/items whichmay be influencing a user’s rating decision on items

• Discovering such latent properties of users/items help to address sparsityproblem as similarity calculation is possible even if there aren’t anyoverlapping ratings among a pair of users

Approach Overview• Discover user profiles in a latent topic space by leveraging contextual data in

text form and user’s historical ratings.

• Build a hybrid neighborhood based on a similarity score considering latenttopic space similarity as well as rating overlap based similarity

• Recommend items yet unpicked by the user based on their popularity withinthe hybrid neighborhood


Rating Data

Rating based Nbhd.

Generate Hybrid User Neighborhood

HybridUser Neighborhood Based Recommender

Topic similarity based Nbhd.

Top-K un-rated items in Nbhd

Collect IR-Stats

Extract Item Text Descriptions

Contextual Data collection & Pre-processing

Generate Item txt files (e.g. plot + genre for movies)

LDA on Item txt files

Item-Topic and User-Topic distributions

Process Flow


Topic Modelling and LDA

LDA is a generative probabilistic model for analyzing large discrete datasets such as text corpora

Each documents is represented as a random mixture of topics which are latent

Each topic is represented a distribution of words

Documents & words within are observable; Model has to come up with document-topic distributions and topic-word distributions

E.g. “Yesterdays cricket match was good, we played well.” => More of “Sports”, less of “Negotiation”


Discovering User Profile in Latent Topic Space

Load the matrix I determined by the item-topic distribution vectors corresponding to all item documents into memory

For each user U, lookup & load the list of items that she has expressed interest on, into a list L.

Initialize the current user U’s topic distribution vector to zeros.

For each item i in L (each item that current user has expressed interest on),

Add the topic distribution vector for i, multiplied by U’s rating normalized by sum of all ratings from U, into U’s topic-distribution vector

Persist U’s topic-distribution vector as her user profile.

Summary : Add up item-topic distributions multiplied by normalized user-rating, corresponding to each user’s interests, to generate each user’s topic-distribution vector, which indicates his user profile in the latent topic space


User Similarity Functions

Latent Topic Space Similarity,

Rating Overlap Based Similarity

We could use standard Pearson-correlation similarity or cosine similarity as well

Hybrid Similarity Function,


Datasets for Our Experiments

Movies.dat - movielens ratings.dat - movielens Plot.list - IMDB

Movielens 1M dataset (6040 users and 3706 movies)

Subset of Netflix Dataset with 2M ratings (5000 users and 17770 movies)

Metadata from IMDB interfaces & OMDBAPI


Topic - Keyword Distribution Sample - Movielens-IMDB

T26 wayne bruce batman gotham

T27 show comedy television network

T28 car truck steal run accident

T29 lives drama relationship childhood

T30 love romance marry girl

T31 prison escape jail

T32 ship crew island sea

T33 hospital suicide doctor psychiatrist

T34 plane airport flight rescue

T35 wife husband affair sexual

T36 friend private insurance

T37 coach player basketball winning

T38 drama death accident lonely

T39 comedy great farm

T40 company business career working

T41 find brothers gang members

T42 family daughter drama home

T43 professor scientist research doctor

T44 good home mind change

T45 war army vietnam nazi

T46 london paris england french

T47 friendship relationship

T48 night friends party stay

T49 fi sci planet alien

T0 black men man white

T1 apartment women boyfriend

T2 time kill train buddy deal

T3 story drama history real stories

T4 day face fate actions prove led

T5 father son mother child

T6 secret agent fbi government thriller

T7 british american indian africa

T8 children animation adventures musical

T9 job girlfriend worse

T10 evil king princess magic prince fantasy

T11 money fortune hard million

T12 police crime drug mafia

T13 documentary music band stage

T14 horror death mysterious ghost

T15 bond james british agent cia

T16 comedy kids vacation

T17 murder police thriller detective

T18 brother make academy lassard

T19 drama lawyer court attorney

T20 world group action battle

T21 town local people

T22 priest god church angel

T23 york city manhattan phone

T24 school friends girl college

T25 house home hill mansion


Sample Item-Topic Distributions DiscoveredTomorrow Never DiesSchindler's List


Sample User Profiles DiscoveredMovie User : 5989Movie User : 5988


Results : Movielens 1M

Standard Item Based CF performs the worst with precision values way less than even 1%

Standard User Based CF generates precision values less than 5%

Proposed HUNR performs the best with precision value at 5 to be more than 31%

Recall at 30 indicates HUNR is able to retrieve > 25% of relevant items where as standard User Based CF is only able to retrieve < 5% of the relevant items

F-measure analysis also ascertains that HUNR significantly outperforms standard CF techniques


0.0000

0.0500

0.1000

0.1500

0.2000

0.2500

0.3000

0.3500

5 10 20 30 50 75

Pre

cisi

on

K

UBCF(LL) UBCF(P) IBCF(LL) IBCF(P) HUNR UTNR

0.0000

0.0500

0.1000

0.1500

0.2000

0.2500

0.3000

0.3500

0.4000

0.4500

5 10 20 30 50 75

Rec

all

K


0.0000

0.0500

0.1000

0.1500

0.2000

0.2500

5 10 20 30 50 75

F-m

easu

re

K



Results : Netflix 2M

Standard Item Based CF performs the worst with precision values way lessthan even 1%

Standard User Based CF is generating precision values around 10% whereas proposed HUNR performed the best with precision value at 5 to be >38%.

Recall at 75 indicates that HUNR is able to retrieve around 24% of the relevant items where as standard User Based CF is able to retrieve < 9% of the relevant items

F-measure analysis also indicates that HUNR performs much better compared to standard CF


0.0000

0.0500

0.1000

0.1500

0.2000

0.2500

0.3000

0.3500

0.4000

0.4500

5 10 20 30 50 75

Pre

cisi

on

K

UBCF(LL) UBCF(P) IBCF(LL) HUNR UTNR

0.0000

0.0500

0.1000

0.1500

0.2000

0.2500

0.3000

5 10 20 30 50 75

Rec

all

K


0.0000

0.0500

0.1000

0.1500

0.2000

0.2500

0.3000

5 10 20 30 50 75

F-m

easu

re

K



Conclusion

We proposed a novel hybrid recommender approach usingLDA, utilizing similarity of users in a latent topic space alongwith rating overlap based similarity to refine neighborhoodformation for improving quality of recommendations.

Empirical evaluations indicate that the technique is well suitedfor recommender domains having contextual data available intext form, describing items being recommended

Proposed approach significantly outperform standard CFalgorithms which make use of rating data alone for generatingrecommendations.


References[1] R. Burke, “Hybrid recommender systems: Survey and experiments,” User modeling and user-adapted interaction, vol.

12, no. 4, pp. 331–370, 2002.

[2] J. L. Herlocker, J. A. Konstan, A. Borchers, and J. Riedl, “An algorithmic framework for performing collaborative

filtering,” in Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in

information retrieval. ACM, 1999, pp. 230–237.

[3] B. Sarwar, G. Karypis, J. Konstan, and J. Riedl, “Item-based collaborative filtering recommendation algorithms,” in

Proceedings of the 10th international conference on World Wide Web. ACM, 2001, pp. 285–295.

[4] Q. Liu, E. Chen, H. Xiong, C. H. Ding, and J. Chen, “Enhancing collaborative filtering by user interest expansion via

personalized ranking,” Systems, Man, and Cybernetics, Part B: Cybernetics, IEEE Transactions on, vol. 42, no. 1, pp. 218–

233, 2012.

[5] T.-M. Chang and W.-F. Hsiao, “Lda-based personalized document recommendation,” 2013.

[6] Y. Koren, “Factorization meets the neighborhood: a multifaceted collaborative filtering model,” in Proceedings of the

14th ACM SIGKDD international conference on Knowledge discovery and data mining.ACM, 2008, pp. 426–434.[7] P. Lops, M. de Gemmis, and G. Semeraro, “Content-based recommender systems: State of the art and trends,” in

Recommender Systems Handbook. Springer, 2011, pp. 73–105.

[8] D. M. Blei, A. Y. Ng, and M. I. Jordan, “Latent dirichlet allocation,” the Journal of machine Learning research, vol. 3, pp.

993–1022, 2003.

[9] Y. Zhang, A. Ahmed, V. Josifovski, and A. Smola, “Taxonomy discovery for personalized recommendation.”

[10] D. H. Stern, R. Herbrich, and T. Graepel, “Matchbox: large scale online bayesian recommendations,” in Proceedings of

the 18th international conference on World wide web. ACM, 2009, pp. 111–120.

[11] T. Dunning, “Accurate methods for the statistics of surprise and coincidence,” Computational linguistics, vol. 19, no. 1,

pp. 61–74, 1993.

[12] D. Lee, “Personalized recommendations based on usersinformation-centered social networks,” Ph.D. dissertation,

University of Pittsburgh,2013.

[13] IMDb., “Internet movie database:,” February 2014. [Online]. Available: http://www.imdb.com/interfaces

[14] B. Fritz., “The open movie database api:,” February 2014. [Online]. Available: http://www.omdbapi.com/

[15] J. Riedl and J. Konstan, “Movielens dataset,” 1998.

[16] A. K. McCallum, “Mallet: A machine learning for language toolkit,” 2002, http://mallet.cs.umass.edu.

[17] S. Owen, R. Anil, T. Dunning, and E. Friedman, Mahout in action. Manning, 2011.

http://mallet.cs.umass.edu/


Thank Youwww.flytxt.com

http://www.flytxt.com/

http://www.flytxt.com/


Backup Slides


LDA – As a General Graphical Model

improving collaborative filtering based recommenders using topic modelling

Data & Analytics

itemtopic distributions

user u

user profiles

current user

normalized userrating

item txt files itemtopic

latent topic space load

discovering user profile