improving collaborative filtering based recommenders using topic modelling
DESCRIPTION
Improving Collaborative Filtering Based Recommenders Using Topic ModellingTRANSCRIPT
1confidentialFlytxt. All rights reserved. 18 August 201418 August 2014©
Improving Collaborative Filtering BasedRecommenders using Topic Modelling
Jobin Wilson1, Santanu Chaudhury2 & Brejesh Lall2
1 R&D Department, Flytxt, Trivandrum, India2 Dept. of Electrical Engineering, IIT Delhi, India
2confidentialFlytxt. All rights reserved. 18 August 201418 August 2014©
Agenda
Recommender Systems Overview
Proposed Approach
Experiments & Results
Conclusion
3confidentialFlytxt. All rights reserved. 18 August 201418 August 2014©
Recommender Systems Overview
Information filtering technique to locate products/services/information that is relevant and exciting to users based on historicalpreferences; utilizing “wisdom of the crowds”
E.g. Editorial, Aggregates (top views, top downloads, recent),Personalized recommendations
Formally :
U = set of UsersI = set of ItemsUtility function F: Relates U to I through a rating R ; E.g. 0-5 stars, a real numberTask : For each user, estimate her preference for items that are yet unseen by her, given all the existing user ratings.
4confidentialFlytxt. All rights reserved. 18 August 201418 August 2014©
Design Matrix
Sparse matrix representing each user’s preference for each item.
Algorithms need to predict values for empty cells based on availablecell values
Denser the matrix, better the quality of recommendations
User | Item i1 i2 i3 i4 i5
u1 r12 r14 r15
u2 r21 r22 r25
u3 r32 r34
u4 r43 r45
5confidentialFlytxt. All rights reserved. 18 August 201418 August 2014©
Taxonomy of Recommender Algorithms
Collaborative Filtering (CF) Based• Neighborhood Based (e.g. User Based, Item Based)
• Latent Factor Based (e.g. SVD; factorization of user-item rating matrix to determine latent properties of users and items)
Content Based• Constructing user profile from history and matching content profiles to
the learned user profile
Hybrid• Combining multiple approaches to improve quality of recommendations
6confidentialFlytxt. All rights reserved. 18 August 201418 August 2014©
Proposed Approach
Intuition• In many domains, considerable contextual data in text form is available,
describing items being recommended (e.g. movies, e-commerce)
• Standard CF algorithms do not consider latent properties of users/items whichmay be influencing a user’s rating decision on items
• Discovering such latent properties of users/items help to address sparsityproblem as similarity calculation is possible even if there aren’t anyoverlapping ratings among a pair of users
Approach Overview• Discover user profiles in a latent topic space by leveraging contextual data in
text form and user’s historical ratings.
• Build a hybrid neighborhood based on a similarity score considering latenttopic space similarity as well as rating overlap based similarity
• Recommend items yet unpicked by the user based on their popularity withinthe hybrid neighborhood
7confidentialFlytxt. All rights reserved. 18 August 201418 August 2014©
Rating Data
Rating based Nbhd.
Generate Hybrid User Neighborhood
HybridUser Neighborhood Based Recommender
Topic similarity based Nbhd.
Top-K un-rated items in Nbhd
Collect IR-Stats
Extract Item Text Descriptions
Contextual Data collection & Pre-processing
Generate Item txt files (e.g. plot + genre for movies)
LDA on Item txt files
Item-Topic and User-Topic distributions
Process Flow
8confidentialFlytxt. All rights reserved. 18 August 201418 August 2014©
Topic Modelling and LDA
LDA is a generative probabilistic model for analyzing large discrete datasets such as text corpora
Each documents is represented as a random mixture of topics which are latent
Each topic is represented a distribution of words
Documents & words within are observable; Model has to come up with document-topic distributions and topic-word distributions
E.g. “Yesterdays cricket match was good, we played well.” => More of “Sports”, less of “Negotiation”
9confidentialFlytxt. All rights reserved. 18 August 201418 August 2014©
Discovering User Profile in Latent Topic Space
Load the matrix I determined by the item-topic distribution vectors corresponding to all item documents into memory
For each user U, lookup & load the list of items that she has expressed interest on, into a list L.
Initialize the current user U’s topic distribution vector to zeros.
For each item i in L (each item that current user has expressed interest on),
Add the topic distribution vector for i, multiplied by U’s rating normalized by sum of all ratings from U, into U’s topic-distribution vector
Persist U’s topic-distribution vector as her user profile.
Summary : Add up item-topic distributions multiplied by normalized user-rating, corresponding to each user’s interests, to generate each user’s topic-distribution vector, which indicates his user profile in the latent topic space
10confidentialFlytxt. All rights reserved. 18 August 201418 August 2014©
User Similarity Functions
Latent Topic Space Similarity,
Rating Overlap Based Similarity
We could use standard Pearson-correlation similarity or cosine similarity as well
Hybrid Similarity Function,
11confidentialFlytxt. All rights reserved. 18 August 201418 August 2014©
Datasets for Our Experiments
Movies.dat - movielens ratings.dat - movielens Plot.list - IMDB
Movielens 1M dataset (6040 users and 3706 movies)
Subset of Netflix Dataset with 2M ratings (5000 users and 17770 movies)
Metadata from IMDB interfaces & OMDBAPI
12confidentialFlytxt. All rights reserved. 18 August 201418 August 2014©
Topic - Keyword Distribution Sample - Movielens-IMDB
T26 wayne bruce batman gotham
T27 show comedy television network
T28 car truck steal run accident
T29 lives drama relationship childhood
T30 love romance marry girl
T31 prison escape jail
T32 ship crew island sea
T33 hospital suicide doctor psychiatrist
T34 plane airport flight rescue
T35 wife husband affair sexual
T36 friend private insurance
T37 coach player basketball winning
T38 drama death accident lonely
T39 comedy great farm
T40 company business career working
T41 find brothers gang members
T42 family daughter drama home
T43 professor scientist research doctor
T44 good home mind change
T45 war army vietnam nazi
T46 london paris england french
T47 friendship relationship
T48 night friends party stay
T49 fi sci planet alien
T0 black men man white
T1 apartment women boyfriend
T2 time kill train buddy deal
T3 story drama history real stories
T4 day face fate actions prove led
T5 father son mother child
T6 secret agent fbi government thriller
T7 british american indian africa
T8 children animation adventures musical
T9 job girlfriend worse
T10 evil king princess magic prince fantasy
T11 money fortune hard million
T12 police crime drug mafia
T13 documentary music band stage
T14 horror death mysterious ghost
T15 bond james british agent cia
T16 comedy kids vacation
T17 murder police thriller detective
T18 brother make academy lassard
T19 drama lawyer court attorney
T20 world group action battle
T21 town local people
T22 priest god church angel
T23 york city manhattan phone
T24 school friends girl college
T25 house home hill mansion
13confidentialFlytxt. All rights reserved. 18 August 201418 August 2014©
Sample Item-Topic Distributions DiscoveredTomorrow Never DiesSchindler's List
14confidentialFlytxt. All rights reserved. 18 August 201418 August 2014©
Sample User Profiles DiscoveredMovie User : 5989Movie User : 5988
15confidentialFlytxt. All rights reserved. 18 August 201418 August 2014©
Results : Movielens 1M
Standard Item Based CF performs the worst with precision values way less than even 1%
Standard User Based CF generates precision values less than 5%
Proposed HUNR performs the best with precision value at 5 to be more than 31%
Recall at 30 indicates HUNR is able to retrieve > 25% of relevant items where as standard User Based CF is only able to retrieve < 5% of the relevant items
F-measure analysis also ascertains that HUNR significantly outperforms standard CF techniques
16confidentialFlytxt. All rights reserved. 18 August 201418 August 2014©
0.0000
0.0500
0.1000
0.1500
0.2000
0.2500
0.3000
0.3500
5 10 20 30 50 75
Pre
cisi
on
K
UBCF(LL) UBCF(P) IBCF(LL) IBCF(P) HUNR UTNR
0.0000
0.0500
0.1000
0.1500
0.2000
0.2500
0.3000
0.3500
0.4000
0.4500
5 10 20 30 50 75
Rec
all
K
UBCF(LL) UBCF(P) IBCF(LL) IBCF(P) HUNR UTNR
0.0000
0.0500
0.1000
0.1500
0.2000
0.2500
5 10 20 30 50 75
F-m
easu
re
K
UBCF(LL) UBCF(P) IBCF(LL) IBCF(P) HUNR UTNR
17confidentialFlytxt. All rights reserved. 18 August 201418 August 2014©
Results : Netflix 2M
Standard Item Based CF performs the worst with precision values way lessthan even 1%
Standard User Based CF is generating precision values around 10% whereas proposed HUNR performed the best with precision value at 5 to be >38%.
Recall at 75 indicates that HUNR is able to retrieve around 24% of the relevant items where as standard User Based CF is able to retrieve < 9% of the relevant items
F-measure analysis also indicates that HUNR performs much better compared to standard CF
18confidentialFlytxt. All rights reserved. 18 August 201418 August 2014©
0.0000
0.0500
0.1000
0.1500
0.2000
0.2500
0.3000
0.3500
0.4000
0.4500
5 10 20 30 50 75
Pre
cisi
on
K
UBCF(LL) UBCF(P) IBCF(LL) HUNR UTNR
0.0000
0.0500
0.1000
0.1500
0.2000
0.2500
0.3000
5 10 20 30 50 75
Rec
all
K
UBCF(LL) UBCF(P) IBCF(LL) HUNR UTNR
0.0000
0.0500
0.1000
0.1500
0.2000
0.2500
0.3000
5 10 20 30 50 75
F-m
easu
re
K
UBCF(LL) UBCF(P) IBCF(LL) HUNR UTNR
19confidentialFlytxt. All rights reserved. 18 August 201418 August 2014©
Conclusion
We proposed a novel hybrid recommender approach usingLDA, utilizing similarity of users in a latent topic space alongwith rating overlap based similarity to refine neighborhoodformation for improving quality of recommendations.
Empirical evaluations indicate that the technique is well suitedfor recommender domains having contextual data available intext form, describing items being recommended
Proposed approach significantly outperform standard CFalgorithms which make use of rating data alone for generatingrecommendations.
20confidentialFlytxt. All rights reserved. 18 August 201418 August 2014©
References[1] R. Burke, “Hybrid recommender systems: Survey and experiments,” User modeling and user-adapted interaction, vol.
12, no. 4, pp. 331–370, 2002.
[2] J. L. Herlocker, J. A. Konstan, A. Borchers, and J. Riedl, “An algorithmic framework for performing collaborative
filtering,” in Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in
information retrieval. ACM, 1999, pp. 230–237.
[3] B. Sarwar, G. Karypis, J. Konstan, and J. Riedl, “Item-based collaborative filtering recommendation algorithms,” in
Proceedings of the 10th international conference on World Wide Web. ACM, 2001, pp. 285–295.
[4] Q. Liu, E. Chen, H. Xiong, C. H. Ding, and J. Chen, “Enhancing collaborative filtering by user interest expansion via
personalized ranking,” Systems, Man, and Cybernetics, Part B: Cybernetics, IEEE Transactions on, vol. 42, no. 1, pp. 218–
233, 2012.
[5] T.-M. Chang and W.-F. Hsiao, “Lda-based personalized document recommendation,” 2013.
[6] Y. Koren, “Factorization meets the neighborhood: a multifaceted collaborative filtering model,” in Proceedings of the
14th ACM SIGKDD international conference on Knowledge discovery and data mining.ACM, 2008, pp. 426–434.[7] P. Lops, M. de Gemmis, and G. Semeraro, “Content-based recommender systems: State of the art and trends,” in
Recommender Systems Handbook. Springer, 2011, pp. 73–105.
[8] D. M. Blei, A. Y. Ng, and M. I. Jordan, “Latent dirichlet allocation,” the Journal of machine Learning research, vol. 3, pp.
993–1022, 2003.
[9] Y. Zhang, A. Ahmed, V. Josifovski, and A. Smola, “Taxonomy discovery for personalized recommendation.”
[10] D. H. Stern, R. Herbrich, and T. Graepel, “Matchbox: large scale online bayesian recommendations,” in Proceedings of
the 18th international conference on World wide web. ACM, 2009, pp. 111–120.
[11] T. Dunning, “Accurate methods for the statistics of surprise and coincidence,” Computational linguistics, vol. 19, no. 1,
pp. 61–74, 1993.
[12] D. Lee, “Personalized recommendations based on usersinformation-centered social networks,” Ph.D. dissertation,
University of Pittsburgh,2013.
[13] IMDb., “Internet movie database:,” February 2014. [Online]. Available: http://www.imdb.com/interfaces
[14] B. Fritz., “The open movie database api:,” February 2014. [Online]. Available: http://www.omdbapi.com/
[15] J. Riedl and J. Konstan, “Movielens dataset,” 1998.
[16] A. K. McCallum, “Mallet: A machine learning for language toolkit,” 2002, http://mallet.cs.umass.edu.
[17] S. Owen, R. Anil, T. Dunning, and E. Friedman, Mahout in action. Manning, 2011.
21confidentialFlytxt. All rights reserved. 18 August 201418 August 2014©
Thank Youwww.flytxt.com
22confidentialFlytxt. All rights reserved. 18 August 201418 August 2014©
Backup Slides
23confidentialFlytxt. All rights reserved. 18 August 201418 August 2014©
LDA – As a General Graphical Model