efficient similarity computation - acm recommender systems · matrix /and vector 0, in a single...

1
Contribution We introduce a novel algorithm, called “Dynamic Index”, for efficiently computing all exact pairwise similarities in a collection of sparse high-dimensional vectors, which are typical for recommender systems. The algorithm learns incrementally, making it suitable for real-time environments. We further exploit non-recommendable items to improve the computational efficiency of our method. By presenting our algorithm in a MapReduce-inspired formulation, it is easily parallellised and scalable. The algorithm is generally applicable, but we focus on neighbourhood-based collaborative filtering. Efficient Similarity Computation for Collaborative Filtering in Dynamic Environments Olivier Jeunen 1 , Koen Verstrepen 2 and Bart Goethals 1,2,3 1 Adrem Data Lab, University of Antwerp, Antwerp, Belgium 2 Froomle, Antwerp, Belgium 3 Faculty of Information Technology, Monash University, Melbourne, Australia Fig. 2: Runtime in seconds for Dynamic Index, when items only remain recommendable for time, along with the number of recommendable items that are left. Fig. 1: Runtime in seconds for a baseline tuned towards sparse data and our Dynamic Index algorithm (top plot). Runtime in seconds for our Dynamic Index algorithm, for a varying number of virtual cores (bottom plot). Intuition In the case of binary implicit feedback, cosine similarity between items can be computed as the number of users who have seen both items, normalised by their counts. cos , = | + - | + . | - | As such, we compute | + - | and + incrementally in a matrix and vector , in a single pass over the data. Simple unordered inverted indices ensure fast updates when new data arrives. Recommendability For many different use-cases, the set of recommendable items at a given time is a much smaller subset of the full item collection (due to recency, stock, business rules…). We exploit this imbalance by only keeping up-to-date scores for recommendable items. The set of recommendable items at time is denoted by 3 . Algorithm 1 Dynamic Index Input: A set of pageviews P t , a set of recommendable items R t . Output: A matrix of item intersections M, a vector of items’ l 1 -norms N, an inverted index of users to rec. items L r , an inverted index of users to non-rec. items L n . 1: M 0, N 0, 8u 2 U : L r [u] ;, L n [u] ; 2: for (u, i, s) 2 P t do 3: for j 2 L r [u] do 4: M i,j += 1 5: if i 2 R t then 6: for j 2 L n [u] do 7: M i,j += 1 8: L r [u]= L r [u] [ {i} 9: N i += 1 10: else 11: L n [u]= L n [u] [ {i} 12: return M, N, L r , L n Parallellisation For two models and ℳ’ computed on data stemming from disjoint sets of users, the combined model is simply the sum of their item co-occurrence matrices and count vectors. As such, we can easily parallellise Dynamic Index in a MapReduce-fashion. Conclusion o Dynamic Index is especially fast for highly sparse data o Incrementally computable, easily parallellisable o Future work includes looking into other similarity functions (Jaccard, PMI, Pearson), and extensions for non-binary data. Source code:

Upload: others

Post on 27-Jul-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Efficient Similarity Computation - ACM Recommender Systems · matrix /and vector 0, in a single pass over the data. Simple unordered invertedindicesensure fastupdateswhen new data

ContributionWe introduce a novel algorithm, called “Dynamic Index”,

for efficiently computing all exact pairwise similarities in a collection of sparse high-dimensional vectors, which are typical for recommender systems. The algorithm learns incrementally, making it suitable for real-timeenvironments.

We further exploit non-recommendable items to improvethe computational efficiency of our method.

By presenting our algorithm in a MapReduce-inspired formulation, it is easily parallellised and scalable.

The algorithm is generally applicable, but we focus on neighbourhood-based collaborative filtering.

Efficient Similarity Computationfor Collaborative Filtering in Dynamic Environments

Olivier Jeunen1, Koen Verstrepen2 and Bart Goethals1,2,3

1Adrem Data Lab, University of Antwerp, Antwerp, Belgium2Froomle, Antwerp, Belgium

3Faculty of Information Technology, Monash University, Melbourne, Australia

Fig. 2: Runtime in seconds for Dynamic Index, when items only remain recommendable for 𝛿 time, along with the number of recommendable items that are left.

Fig. 1: Runtime in seconds for a baseline tuned towards sparse data and our Dynamic Index algorithm (top plot). Runtime in seconds for our Dynamic Index algorithm, for a varying number of virtual cores (bottom plot).

IntuitionIn the case of binary implicit feedback, cosine similarity

between items can be computed as the number of users who have seen both items, normalised by their counts.

cos 𝑖, 𝑗 =|𝑈+ ∩ 𝑈-|

𝑈+ . |𝑈-|As such, we compute |𝑈+ ∩ 𝑈-| and 𝑈+ incrementally in a

matrix 𝑀 and vector 𝑁, in a single pass over the data.Simple unordered inverted indices ensure fast updates when

new data arrives.

RecommendabilityFor many different use-cases, the set of recommendable

items at a given time is a much smaller subset of the fullitem collection (due to recency, stock, business rules…).

We exploit this imbalance by only keeping up-to-date scores for recommendable items. The set of recommendable items at time 𝑡 is denoted by 𝑅3.

Algorithm 1 Dynamic Index

Input: A set of pageviews Pt, a set of recommendable items Rt .Output: A matrix of item intersections M, a vector of items’

l1�norms N, an inverted index of users to rec. items Lr, aninverted index of users to non-rec. items Ln.

1: M 0, N 0, 8u 2 U : Lr[u] ;,Ln[u] ;2: for (u, i, s) 2 Pt do

3: for j 2 Lr[u] do4: Mi,j += 15: if i 2 Rt then

6: for j 2 Ln[u] do7: Mi,j += 18: Lr[u] = Lr[u] [ {i}9: Ni += 1

10: else

11: Ln[u] = Ln[u] [ {i}12: return M,N,Lr,Ln

1

ParallellisationFor two models ℳ and ℳ’ computed on data

stemming from disjoint sets of users, the combined model is simply the sum of their item co-occurrence matrices and count vectors. As such, we can easily parallellise Dynamic Index in a MapReduce-fashion.

Conclusiono Dynamic Index is especially fast for highly sparse datao Incrementally computable, easily parallellisableo Future work includes looking into other

similarity functions (Jaccard, PMI, Pearson),and extensions for non-binary data.

Source code: