using graph partitioning techniques for neighbour selection in user-based collaborative filtering

1
1 Using Graph Partitioning Techniques for Neighbour Selection in User-Based Collaborative Filtering Alejandro Bellogín 1 , Javier Parapar 2 [email protected], [email protected] 1 Information Retrieval Group, Department of Computer Science, Universidad Autónoma de Madrid 2 Information Retrieval Lab, Department of Computer Science, University of A Coruña Normalised Cut for Collaborative Filtering The outline of the spectral clustering algorithm Normalised Cut (NCut) is as follows [3]: 1. Create a graph G =(V,E,W ) from the data collection to cluster the users, represented as the vertices V in the graph G 2. Solve a trace minimisation problem involving the weights W and the Laplacian matrix of the graph 3. The eigenvectors of the optimal matrix allow a projection of each data instance in R k based on its similarity to the other instances 4. Cluster the resulting projected data points 5. Backtrace the grouping found in the previous step to the original data points, obtaining the final outcome of the clustering algorithm In Collaborative Filtering, users play the role of the nodes of the graph to cut. ˜ r (u, i) = v NC (u) sim(u, v )r (v,i) v NC (u) |sim(u, v )| NC (u) outputs the elements who belong to the same cluster as the target user u sim(u, v ): similarity between users u and v r (v,i): rating given by user v to item i Notation: NC+P when sim(u, v ) is Pearson and NC when sim(u, v )=1 Introduction Neighbourhood identification in memory-based CF algorithms is based on selecting those users who are most similar to the active user according to a certain similarity metric, according to the principle that a particular user’s rating records are not equally useful to all other users as input to provide them with item suggestions [2]. Hypothesis: cluster-based CF techniques may be improved by using spectral clus- tering methods, instead of old-fashioned clustering methods such as k-Means or hierarchical clustering. Experiment: a spectral clustering tech- nique (NCut) has been introduced into a cluster-based CF algorithm which outper- forms other standard techniques in terms of ranking precision. Experiments and Results Evaluation methodology: TestItems [1] (for each user a ranking is generated by predicting a score for every item in the test set). Baselines: UB (user-based CF with Pearson’s correlation as similarity measure), MF (matrix factorization algorithm with a latent space of dimension 50), kM and kM+P (user-based CF using k-Means clustering alone or with Pearson’s similarity). Dataset: Movielens 100K. Method P@5 Coverage MF 0.081 u 100% UB 0.026 100% NC+P 50 (100) 0.075 u 100.00% NC+P 100 (150) 0.101 mu 97.82% NC+P 150 (200) 0.111 mu 93.68% NC+P 200 (250) 0.112 mu 79.74% NC+P 250 (300) 0.111 mu 69.06% NC+P 300 (350) 0.108 mu 59.26% NC+P 350 (400) 0.103 mu 54.25% NC+P 400 (450) 0.094 mu 43.36% NC+P 450 (500) 0.086 u 40.52% NC+P 500 (550) 0.079 u 35.95% NC+P 550 (600) 0.079 u 32.03% NC+P 600 (650) 0.079 u 25.93% 0 0.02 0.04 0.06 0.08 0.10 0.12 100 200 300 400 500 600 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% P @ 5 C o v e r a g e Number of clusters NC NC+P cov(NC) kM kM+P cov(kM) UB 4 4 4 4 4 4 4 4 4 4 4 4 4 MF N N N N N N N N N N N N N Performance results. In brackets, the number of eigenvectors used in the Normalised Cut for each cluster of size k . Sensitivity of the performance and coverage to the number of clusters. Conclusions The use of Normalised Cut (NCut) as a clustering method (based on graph partition- ing) for exploiting the neighbourhoods outperform other state-of-the-art approaches. The improvement in performance is consistent even when no similarity is used (method NC). The coverage of our method (measured as the number of users for which the system can recommend items) is higher than other clustering methods in similar conditions. In the future, we aim to investigate also item clusters along with item-based approaches and alternative clustering methods. References [1] B ELLOGÍN , A., C ASTELLS , P., AND C ANTA - DOR , I. Precision-oriented evaluation of rec- ommender systems: an algorithmic compari- son. In RecSys (2011), pp. 333–336. [2] D ESROSIERS , C., AND K ARYPIS , G. A com- prehensive survey of neighborhood-based recommendation methods. In Recommender Systems Handbook. 2011, pp. 107–144. [3] S HI , J., AND M ALIK , J. Normalized cuts and image segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 22, 8 (2000), 888–905. RecSys 2012, 6th ACM Conference on Recommender Systems. September 9-13, 2012. Dublin, Ireland.

Upload: alejandro-bellogin

Post on 27-Jun-2015

898 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Using Graph Partitioning Techniques for Neighbour Selection in User-Based Collaborative Filtering

1

Using Graph Partitioning Techniques for NeighbourSelection in User-Based Collaborative FilteringAlejandro Bellogín1, Javier Parapar2

[email protected], [email protected] Retrieval Group, Department of Computer Science, Universidad Autónoma de Madrid2Information Retrieval Lab, Department of Computer Science, University of A Coruña

Normalised Cut for Collaborative FilteringThe outline of the spectral clustering algorithm Normalised Cut (NCut) is as follows [3]:

1. Create a graph G = (V,E,W ) from the data collection to cluster the users, representedas the vertices V in the graph G

2. Solve a trace minimisation problem involving the weights W and the Laplacian matrixof the graph

3. The eigenvectors of the optimal matrix allow a projection of each data instance in Rk

based on its similarity to the other instances4. Cluster the resulting projected data points5. Backtrace the grouping found in the previous step to the original data points, obtaining

the final outcome of the clustering algorithm

In Collaborative Filtering, users play the role of the nodes of the graph to cut.

r̃(u, i) =

∑v∈NC(u) sim(u, v)r(v, i)∑

v∈NC(u) |sim(u, v)|

• NC(u) outputs the elements who belong to the same cluster as the target user u• sim(u, v): similarity between users u and v

• r(v, i): rating given by user v to item i

Notation: NC+P when sim(u, v) is Pearson and NC when sim(u, v) = 1

Introduction• Neighbourhood identification in

memory-based CF algorithms is basedon selecting those users who are mostsimilar to the active user according toa certain similarity metric, according tothe principle that a particular user’s ratingrecords are not equally useful to all otherusers as input to provide them with itemsuggestions [2].

• Hypothesis: cluster-based CF techniquesmay be improved by using spectral clus-tering methods, instead of old-fashionedclustering methods such as k-Means orhierarchical clustering.

• Experiment: a spectral clustering tech-nique (NCut) has been introduced into acluster-based CF algorithm which outper-forms other standard techniques in termsof ranking precision.

Experiments and ResultsEvaluation methodology: TestItems [1] (for each user a ranking is generated by predicting a score for every item in the test set).Baselines: UB (user-based CF with Pearson’s correlation as similarity measure), MF (matrix factorization algorithm with a latent space ofdimension 50), kM and kM+P (user-based CF using k-Means clustering alone or with Pearson’s similarity).Dataset: Movielens 100K.

Method P@5 CoverageMF 0.081u 100%UB 0.026 100%NC+P 50 (100) 0.075u 100.00%NC+P 100 (150) 0.101mu 97.82%NC+P 150 (200) 0.111mu 93.68%NC+P 200 (250) 0.112mu 79.74%NC+P 250 (300) 0.111mu 69.06%NC+P 300 (350) 0.108mu 59.26%NC+P 350 (400) 0.103mu 54.25%NC+P 400 (450) 0.094mu 43.36%NC+P 450 (500) 0.086u 40.52%NC+P 500 (550) 0.079u 35.95%NC+P 550 (600) 0.079u 32.03%NC+P 600 (650) 0.079u 25.93%

0

0.02

0.04

0.06

0.08

0.10

0.12

100 200 300 400 500 6000%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

P@5

Coverage

Number of clusters

NC

�� �

� �

��

�� �

NC+P

◦ ◦ ◦ ◦◦

◦◦ ◦ ◦

cov(NC)

kM

�� � �

� ��

� � ��

kM+P

• • • •• • • •

• ••

cov(kM)

UB

4 4 4 4 4 4 4 4 4 4 4 4

4

MF

N N N N N N N N N N N N

N

Performance results. In brackets, the number of eigenvectorsused in the Normalised Cut for each cluster of size k.

Sensitivity of the performance and coverage to the number ofclusters.

Conclusions• The use of Normalised Cut (NCut) as a clustering method (based on graph partition-

ing) for exploiting the neighbourhoods outperform other state-of-the-art approaches.

• The improvement in performance is consistent even when no similarity is used(method NC).

• The coverage of our method (measured as the number of users for which the systemcan recommend items) is higher than other clustering methods in similar conditions.

In the future, we aim to investigate also item clusters along with item-based approaches andalternative clustering methods.

References[1] BELLOGÍN, A., CASTELLS, P., AND CANTA-

DOR, I. Precision-oriented evaluation of rec-ommender systems: an algorithmic compari-son. In RecSys (2011), pp. 333–336.

[2] DESROSIERS, C., AND KARYPIS, G. A com-prehensive survey of neighborhood-basedrecommendation methods. In RecommenderSystems Handbook. 2011, pp. 107–144.

[3] SHI, J., AND MALIK, J. Normalized cutsand image segmentation. IEEE Trans. PatternAnal. Mach. Intell. 22, 8 (2000), 888–905.

RecSys 2012, 6th ACM Conference on Recommender Systems. September 9-13, 2012. Dublin, Ireland.