(popularity)biasofrecommendersystems...

1
Measuring the Concentration Reinforcement (Popularity) Bias of Recommender Systems Panagiotis Adamopoulos, Alexander Tuzhilin, Peter Mountanos Leonard N. Stern School of Business, New York University {padamopo,atuzhili}@stern.nyu.edu, [email protected] A BSTRACT We propose new metrics to accurately measure the concentration reinforcement of recommender sys- tems and the enhancement of the “long tail”. We also conduct a comparative analysis of various recommender system algorithms illustrating the use- fulness of the proposed metrics. M OTIVATION Increasing interest in metrics going beyond predictive accuracy. [1, 2]. Simply evaluating the generated recommendation lists in terms of dispersion and inequality does not provide any information about the concentration reinforcement and popularity bias of the recommendations: whether popular or long-tail items are more likely to be recommended. Existing metrics do not consider the prior popularity of the candidate items. R ELATED W ORK Various metrics measure popularity biases without tak- ing into consideration the prior popularity of items: Catalog coverage, aggregate diversity, Gini coefficient, Hoover index, Lorenz curve, etc. [3,4] employ a popularity reinforcement measure M to assess whether a RS follows or changes the prior popu- larity of items when generates recommendations. They measure the proportion of items that changed from “long-tail” in terms of prior sales to popular in terms of recommendation frequency as: M =1 - K X i=1 π i ρ ii , where the vector π denotes the initial distribution of each of the K popularity categories and ρ ii the prob- ability of staying in category i, given that i was the initial category. The popularity categories, labeled as “head” and “tail”, are based on the Pareto principle and hence the “head” category contains the top 20% of items and the “tail” category the remaining 80%. However, this metric entails an arbitrary selection of popularity categories. Besides, all items included in the same popularity category are contributing equally to this metric, de- spite any differences in popularity. C ONCENTRATION R EINFORCEMENT To measure the concentration reinforcement (popularity) bias of RSes, we propose a new metric as follows: CI @N = X iI 1 2 s(i) j I s(j ) ln s(i)+1 j I s(j )+1 r N (i)+1 N *|U |+|I | + 1 2 r N (i) N *|U | ln r N (i)+1 N *|U |+|I | s(i)+1 j I s(j )+1 , where s(i) is the number of positive ratings for item i in the training set, r N (i) is the number of times item i is included in the generated top-N recommendation lists, and U and I are the sets of users and items, respectively. The proposed metric captures the distributional divergence between the popularity of each item in terms of prior sales (or number of positive ratings) and the number of times each item is recommended across all users. A score of zero denotes no change (i.e. the frequency of recommendation of an item is proportional to its prior popularity) whereas a (more) positive score denotes that recommendations deviate (more) from prior popularity. L ONG -T AIL E NHANCEMENT In order to measure whether the deviation of recommendations from the distribution of prior sales (or positive ratings) promotes long-tail rather than popular items, we also propose a measure of “long-tail enforcement” as follows: LT I λ @N = 1 |I | X iI λ 1 - s(i) j I s(j ) ! ln r N (i)+1 N *|U |+|I | s(i)+1 j I s(j )+1 + (1 - λ) s(i) j I s(j ) ln s(i)+1 j I s(j )+1 r N (i)+1 N *|U |+|I | , where λ (0, 1) controls which items are considered long-tail (i.e., the percentile of popularity below which a RS should increase the frequency of recommendation of an item). The proposed metric rewards a RS for increasing the frequency of recommendations of long-tail items while penalizing for frequently recommending popular items. E XPERIMENTAL S ETTING We use the MovieLens 100k (ML-100k), 1M (ML-1m), “latest-small” (ML-ls), and the FilmTrust (FT) data sets. We use the algorithms of association rules (AR) [8], item-based collaborative filtering (CF) nearest neighbors (ItemKNN), user-based CF nearest neighbors (UserKNN), CF ensemble for ranking (RankSGD) [7], list-wise learning to rank with matrix factorization (LRMF) [10], Bayesian personalized ranking (BPR) [9], and Bayesian personalized ranking for non-uniformly sampled items (WBPR) [5] implemented in [6]. We employ a holdout evaluation scheme with 80/20 random splits into training and test sets. We evaluate each model in term of predictive accuracy, concentration reinforcement, and long-tail enhancement. C OMPARATIVE A NALYSIS Figure 1: Divergence of recommendations from prior popularity. Figure 2: Performance (ranking) of various RS algorithms. Based on the experimental results, we can see that: the proposed metrics capture different performance dimensions of algorithms compared to existing metrics, even though on aggregate an algorithm might distribute more equally the number of times each item is recommended, it might still achieve this by deviating less from the prior number of sales (or positive ratings) for each item separately (e.g., green color for Gini coefficient and red color for concentration reinforcement). even though some algorithms might recommend less (more) items than others or distribute how many times each item is recommended less (more) equally, they might achieve this by frequently recommending more (less) long-tail items rather than more (less) popular items (e.g., red color for Gini coefficient and green color for LT I ). F UTURE D IRECTIONS The proposed metrics should be used in combination in order to evaluate: how much the recommendations of a RS algorithm deviate from the prior popularity of items and whether this deviation occurs by promoting long-tail rather than already popular items. R EFERENCES [1] P. Adamopoulos. Beyond Rating Prediction Accuracy: On New Perspectives in RSes. In RecSys. ACM, 2013. [2] P. Adamopoulos. On Discovering non-Obvious Recommendations: Using Unexpectedness and Neighborhood Selection Methods. In RecSys, 2014. [3] P. Adamopoulos and A. Tuzhilin. Probabilistic Neighborhood Selection in CF Systems. http://hdl.handle.net/2451/31988, 2013. [4] P. Adamopoulos and A. Tuzhilin. On Over-specialization and Concentration Bias of Recommendations: Probabilistic Neighborhood Selection. In RecSys, 2014. [5] Z. Gantner, L. Drumond, et al. Personalized Ranking for Non-Uniformly Sampled Items. In KDD Cup, 2012. [6] G. Guo, J. Zhang, et al. LibRec: A Java Library for Recommender Systems. In UMAP, 2015. [7] M. Jahrer and A. Töscher. Collaborative Filtering Ensemble for Ranking. In KDD Cup, 2012. [8] C. Kim and J. Kim. A Recommendation Algorithm Using Multi-Level Association Rules. In WI ’03. IEEE, 2003. [9] S. Rendle, C. Freudenthaler, et al. Bayesian Personalized Ranking from Implicit Feedback. In UAI, 2009. [10] Y. Shi, M. Larson, et al. List-wise Learning to Rank with Matrix Factorization for CF. In RecSys, 2010.

Upload: others

Post on 02-Jun-2020

14 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: (Popularity)BiasofRecommenderSystems ...people.stern.nyu.edu/padamopo/RecSys2015_Adamopoulos...MeasuringtheConcentrationReinforcement (Popularity)BiasofRecommenderSystems Panagiotis

Measuring the Concentration Reinforcement(Popularity) Bias of Recommender Systems

Panagiotis Adamopoulos, Alexander Tuzhilin, Peter MountanosLeonard N. Stern School of Business, New York University

{padamopo,atuzhili}@stern.nyu.edu, [email protected]

ABSTRACT◦ We propose new metrics to accurately measure the

concentration reinforcement of recommender sys-tems and the enhancement of the “long tail”.

◦ We also conduct a comparative analysis of variousrecommender system algorithms illustrating the use-fulness of the proposed metrics.

MOTIVATION◦ Increasing interest in metrics going beyond predictive

accuracy. [1, 2].◦ Simply evaluating the generated recommendation lists

in terms of dispersion and inequality does not provideany information about the concentration reinforcementand popularity bias of the recommendations:

◦ whether popular or long-tail items are more likely tobe recommended.

◦ Existing metrics do not consider the prior popularity ofthe candidate items.

RELATED WORK◦ Various metrics measure popularity biases without tak-

ing into consideration the prior popularity of items:◦ Catalog coverage, aggregate diversity,◦ Gini coefficient, Hoover index, Lorenz curve, etc.

◦ [3,4] employ a popularity reinforcement measureM toassess whether a RS follows or changes the prior popu-larity of items when generates recommendations.◦ They measure the proportion of items that changed

from “long-tail” in terms of prior sales to popular interms of recommendation frequency as:

M = 1−K∑i=1

πiρii,

where the vector π denotes the initial distribution ofeach of the K popularity categories and ρii the prob-ability of staying in category i, given that i was theinitial category.

◦ The popularity categories, labeled as “head” and“tail”, are based on the Pareto principle and hencethe “head” category contains the top 20% of itemsand the “tail” category the remaining 80%.

◦ However, this metric entails an arbitrary selection ofpopularity categories.

◦ Besides, all items included in the same popularitycategory are contributing equally to this metric, de-spite any differences in popularity.

CONCENTRATION REINFORCEMENTTo measure the concentration reinforcement (popularity) bias of RSes, we propose a new metric as follows:

CI@N =∑i∈I

1

2

s(i)∑j∈I s(j)

ln

s(i)+1∑j∈I s(j)+1

rN (i)+1N∗|U |+|I|

+1

2

rN(i)

N ∗ |U |ln

rN (i)+1N∗|U |+|I|s(i)+1∑

j∈I s(j)+1

,

where s(i) is the number of positive ratings for item i in the training set, rN(i) is the number of times item i is includedin the generated top-N recommendation lists, and U and I are the sets of users and items, respectively.◦ The proposed metric captures the distributional divergence between the popularity of each item in terms of prior

sales (or number of positive ratings) and the number of times each item is recommended across all users.◦ A score of zero denotes no change (i.e. the frequency of recommendation of an item is proportional to its prior

popularity) whereas a (more) positive score denotes that recommendations deviate (more) from prior popularity.

LONG-TAIL ENHANCEMENTIn order to measure whether the deviation of recommendations from the distribution of prior sales (or positive ratings)promotes long-tail rather than popular items, we also propose a measure of “long-tail enforcement” as follows:

LTIλ@N =1

|I|∑i∈I

λ

(1− s(i)∑

j∈I s(j)

)ln

rN (i)+1N∗|U |+|I|s(i)+1∑

j∈I s(j)+1

+ (1− λ) s(i)∑j∈I s(j)

ln

s(i)+1∑j∈I s(j)+1

rN (i)+1N∗|U |+|I|

,

where λ ∈ (0, 1) controls which items are considered long-tail (i.e., the percentile of popularity below which a RSshould increase the frequency of recommendation of an item).◦ The proposed metric rewards a RS for increasing the frequency of recommendations of long-tail items while

penalizing for frequently recommending popular items.

EXPERIMENTAL SETTING◦ We use the MovieLens 100k (ML-100k), 1M (ML-1m), “latest-small” (ML-ls), and the FilmTrust (FT) data sets.◦ We use the algorithms of association rules (AR) [8], item-based collaborative filtering (CF) nearest neighbors

(ItemKNN), user-based CF nearest neighbors (UserKNN), CF ensemble for ranking (RankSGD) [7], list-wiselearning to rank with matrix factorization (LRMF) [10], Bayesian personalized ranking (BPR) [9], and Bayesianpersonalized ranking for non-uniformly sampled items (WBPR) [5] implemented in [6].

◦ We employ a holdout evaluation scheme with 80/20 random splits into training and test sets.◦ We evaluate each model in term of predictive accuracy, concentration reinforcement, and long-tail enhancement.

COMPARATIVE ANALYSIS

Figure 1: Divergence of recommendations from prior popularity. Figure 2: Performance (ranking) of various RS algorithms.

Based on the experimental results, we can see that:◦ the proposed metrics capture different performance dimensions of algorithms compared to existing metrics,◦ even though on aggregate an algorithm might distribute more equally the number of times each item is

recommended, it might still achieve this by deviating less from the prior number of sales (or positive ratings) foreach item separately (e.g., green color for Gini coefficient and red color for concentration reinforcement).

◦ even though some algorithms might recommend less (more) items than others or distribute how many times eachitem is recommended less (more) equally, they might achieve this by frequently recommending more (less)long-tail items rather than more (less) popular items (e.g., red color for Gini coefficient and green color for LTI).

FUTURE DIRECTIONSThe proposed metrics should be used in combination in order to evaluate:◦ how much the recommendations of a RS algorithm deviate from the prior popularity of items and◦ whether this deviation occurs by promoting long-tail rather than already popular items.

REFERENCES

[1] P. Adamopoulos. Beyond Rating Prediction Accuracy: OnNew Perspectives in RSes. In RecSys. ACM, 2013.

[2] P. Adamopoulos. On Discovering non-ObviousRecommendations: Using Unexpectedness andNeighborhood Selection Methods. In RecSys, 2014.

[3] P. Adamopoulos and A. Tuzhilin. ProbabilisticNeighborhood Selection in CF Systems.http://hdl.handle.net/2451/31988, 2013.

[4] P. Adamopoulos and A. Tuzhilin. On Over-specializationand Concentration Bias of Recommendations:Probabilistic Neighborhood Selection. In RecSys, 2014.

[5] Z. Gantner, L. Drumond, et al. Personalized Ranking forNon-Uniformly Sampled Items. In KDD Cup, 2012.

[6] G. Guo, J. Zhang, et al. LibRec: A Java Library forRecommender Systems. In UMAP, 2015.

[7] M. Jahrer and A. Töscher. Collaborative FilteringEnsemble for Ranking. In KDD Cup, 2012.

[8] C. Kim and J. Kim. A Recommendation Algorithm UsingMulti-Level Association Rules. In WI ’03. IEEE, 2003.

[9] S. Rendle, C. Freudenthaler, et al. Bayesian PersonalizedRanking from Implicit Feedback. In UAI, 2009.

[10] Y. Shi, M. Larson, et al. List-wise Learning to Rank withMatrix Factorization for CF. In RecSys, 2010.