kth kexjobb user behavior prediction

8/17/2019 KTH Kexjobb User Behavior Prediction

1/38

INDEGREE PROJECT TECHNOLOGY,FIRST CYCLE, 15 CREDITS

,STOCKHOLM SWEDEN 2016

A comparative study of the

conventional item-based

collaborative filtering and the

Slope One algorithms for

recommender systems

HENRIK SVEBRANT

JOHN SVANBERG

KTH ROYAL INSTITUTE OF TECHNOLOGY

SCHOOL OF COMPUTER SCIENCE AND COMMUNICATION


2/38

A comparative study of the conventional

item-based collaborative filtering and the Slope

One algorithms for recommender systems

HENRIK SVEBRANT

JOHN SVANBERG

Degree Project in Computer Science, DD143X

Supervisor: Jeanette Hellgren KotaleskiExaminer: Örjan Ekeberg

CSC KTH 2016-05


3/38

Abstract

Recommender systems are an important research topic intodays society as the amount of data increases across theglobe. In order for commercial systems to give their usersgood and personalized recommendations on what data maybe of interest to them in an effective manner, such a systemmust be able to give recommendations quickly and scalewell as data increases. The purpose of this paper is to eval-uate two such algorithms with this in mind.

The two different algorithm families tested are classified asitem-based collaborative filtering but work very differently.

It is therefore of interest to see how their complexities af-fect their performance, accuracy as well as scalability. TheSlope One family is much simpler to implement and provesto be equally as efficient, if not even more efficient than theconventional item-based ones.

Both families do require a precomputation stage before rec-ommendations are possible to give, this is the stage whereSlope One suffers in comparison to the conventional item-based one.

The algorithms are tested using Lenskit, on data provided

by GroupLens and their MovieLens project.


4/38

Referat

Rekommendationssystem är idag ett viktigt forskningsom-råde. Då mängden data snabbt ökar i dagens samhälle ärdet viktigt för kommersiella system att kunna ge sina an-vändare bra och personliga rekommendationer på intres-santa föremål på ett effektivt sätt. Samtidigt som de bör gerekommendationer snabbt skall de även vara skalbara föratt fortsättningsvis kunna användas då datamängden ökarytterligare. Syftet med denna rapport är att evaluera två

typer av rekommendationsalgoritmer med dessa punkter iåtanke.

De två olika algoritmfamiljerna som testas hör båda till ty-pen item-based collaborative filtering men fungerar i grun-den mycket olika. Det är därför intressant att se hur kom-plexiteten hos den ena står sig mot simpliciteten hos denandra, gällande prestanda, precision och skalbarhet. SlopeOne-algoritmen är mycket enklare att implementera och vi-sar sig vara lika effektiv, möjligen även mer effektiv än denvanliga item-baserade.

Båda typerna kräver ett förberäkningsstadie för att rekom-mendationer skall kunna ges, i detta skede presterar SlopeOne sämre i jämförelse med dess motståndare.

Algoritmerna har testats med hjälp av Lenskit, på data frånGroupLens och deras MovieLens-projekt.


5/38

Contents

1 Introduction 1

1.1 Problem definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.2 Scope and constraints . . . . . . . . . . . . . . . . . . . . . . . . . . 21.3 Thesis overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2 Background 4

2.1 Applications of recommender systems . . . . . . . . . . . . . . . . . 42.2 Items . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42.3 Users . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42.4 Transactions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.5 Content-Based filtering . . . . . . . . . . . . . . . . . . . . . . . . . . 52.6 Collaborative filtering . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.6.1 Memory-based collaborative filtering . . . . . . . . . . . . . . 62.6.2 Model-based collaborative filtering . . . . . . . . . . . . . . . 6

2.7 Tools and algorithms used in this study . . . . . . . . . . . . . . . . 72.7.1 Lenskit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72.7.2 Conventional item-based CF . . . . . . . . . . . . . . . . . . 72.7.3 Slope One . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92.7.4 Weighted Slope One . . . . . . . . . . . . . . . . . . . . . . . 11

3 Method 13

3.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133.2 Testing hardware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133.3 Algorithm evaluation testing . . . . . . . . . . . . . . . . . . . . . . 14

3.3.1 Conventional item-based CF . . . . . . . . . . . . . . . . . . 143.3.2 Slope One . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

3.4 Algorithm recommendation performance testing . . . . . . . . . . . . 143.5 Validation of results . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

3.5.1 K-fold cross-validation . . . . . . . . . . . . . . . . . . . . . . 153.5.2 Root Mean Square Error . . . . . . . . . . . . . . . . . . . . 15

4 Results and analysis 16

4.1 Algorithm evaluation results . . . . . . . . . . . . . . . . . . . . . . . 16


6/38

4.2 Recommendation performance results . . . . . . . . . . . . . . . . . 17

5 Discussion 19

5.1 Comparison of results with regards to accuracy . . . . . . . . . . . . 195.2 Criticism of testing methods . . . . . . . . . . . . . . . . . . . . . . . 20

5.2.1 Lenskit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205.2.2 Cross-folding . . . . . . . . . . . . . . . . . . . . . . . . . . . 205.2.3 RMSE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205.2.4 Layout of testing . . . . . . . . . . . . . . . . . . . . . . . . . 21

5.3 Challenges with limited hardware . . . . . . . . . . . . . . . . . . . . 215.4 Final conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

Bibliography 23

Appendices 23

A Evaluation result plots 24


7/38

Chapter 1

Introduction

The rapidly increasing use of the internet supplies its users with an incredibleamount of information. Being able to quickly find what you need is becomingincreasingly difficult as the amount of information available grows. To mitigate thiseffect, it is important to develop systems that make personalized recommendationsand streamline the search process to make it more effective.

Recommender systems (RS) does exactly this and is mainly used to generate rec-ommendations to users of a commercial system. The recommendations made variesbased upon what the system itself is used for. A few examples of recommendationsare books, movies, music, travel and other e-commerce. The recommendations are

based on the perceived interests of the user’s with the intention of giving the useran idea of what the next interesting item could be.

There are two main approaches to how recommender systems are made, eitherthrough Collaborative filtering (CF), or by Content-based filtering. Collaborativefiltering could further be divided into user-based CF and item-based CF. The user-based CF generates recommendations for user u based on other users similar to u .An item-based CF investigates the similarity between items by representing eachitems ratings as a vector in n dimensions. The vectors of two items can then becompared with e.g cosine similarity. This information is later used to generate rec-ommendation to user u based on similar items to u ’s previously liked items. A

content-based RS generates recommendations based on the content of items, there-fore the system needs data about the items, eg. actors, director and genre of a movie.

Either approach each have their own advantages and disadvantages, making thissubject a much researched one. The goal is to maximize the quality of the recom-mendations made, in order of achieving a greater user satisfaction, resulting in alarger customer-base and greater income.

A big problem for recommender systems is to be fast, scalable and yet perform

1


8/38

CHAPTER 1. INTRODUCTION

with high accuracy. It has been shown that a user-based RS does not scale very

well with large datasets and systems with rapidly growing and changing user-base.Therefore this thesis will discuss different implementations of two different item-based algorithms. The two algorithms discussed are the conventional item-basedCF algorithm and the more simple Slope One algorithm. Focus is about comparingthe accuracy of the algorithms with their complexity and run time performance inmind. Both algorithms are easily modified to adjust certain parameters that possi-bly affect both accuracy and run time performance.

A lot of research has been done in the field of recommender systems. The testscarried out in this thesis are based on the Lenskit recommender system researchtoolkit. Lenskit is a credible set of tools developed to be useful in recommender

system research. Lenskit was developed by researchers at the Texas State Universityand GroupLens Researchers at the University of Minnesota.

1.1 Problem definition

The need for faster and more scalable recommender systems is an existing problemthat grows as the amount of data increase. It is therefore important to use well per-forming algorithms that can manage the demands set by the users of such a system.As item-based recommender systems have been proven to be more efficient thanthe user-based ones, this study will compare the performance and scalability of twoalgorithms of this type. The algorithms chosen for evaluation are the conventionalitem-based CF algorithm and the Slope One algorithm. They will be comparedwith regards of performance, scalability and their implementation difficulties. Thequestion to be discussed in this study is the following:

How does the Slope One algorithm compare to the more complex conventionalitem-based algorithm, with regards to performance and scalability?

1.2 Scope and constraints

The data used in this study consists of datasets with movie ratings, however the

algorithms used and results seen is well applicable for other recommendation sys-tem applications in other fields, branches of e-commerce and other businesses whererecommendations are needed.

The testing software used in this study is an open-source recommender system soft-ware developed by researchers at Texas State University and GroupLens Researchat the University of Minnesota.

The datasets used are supplied by GroupLens Research, using data collected bytheir MovieLens project. Item ratings are made on the scale of 1-5 using integer

2


9/38

CHAPTER 1. INTRODUCTION

increments.

1.3 Thesis overview

• Chapter 2 will provide the background needed to effectively understand thesubject, combined with definitions and explanations for a set of field specificterminology. It will also introduce the algorithms compared in this thesis.

• Chapter 3 explains the method used for this study.

• Chapter 4 will provide the final results from tests explained in the previouschapter.

• Chapter 5 discusses the results and provides the reader with the problemsencountered during this study.

• Chapter 6 provides the references used.

• The appendix provides plots for various evaluation test results.

3


10/38

Chapter 2

Background

To be able to implement a recommender system (RS) and do research about itsperformance, knowledge about the subject is needed. This section introduces thefield of recommender systems and its applications. Focus will be laid on collabora-tive filtering and especially item-based collaborative filtering, however alternativeapproaches and drawbacks will also be presented briefly.

2.1 Applications of recommender systems

Recommender systems are widely used on the Internet. Systems of this kind can be

implemented using many different kinds of algorithms, and be applied in multipletypes of fields. Common uses are giving recommendations for movies, music anditems in online stores or to suggest friends in social networks for example.

In short, a recommender system is a system that helps its users to find informationof interest in environments where the amount of information is large.

2.2 Items

The items being referred to in this report are the objects that are recommended bythe RS. An item is distinguished by a unique id number and characterized by item

specific values, such as movie titles, genres, directors[3].

2.3 Users

The users of a RS could possibly have very diverse goals, personalities, taste andcharacteristics. Therefore the RS needs to be generic and exploit user informationwith an unbiased approach that does not exclude any characteristics. A user in theRS has several variables and attributes that correlates with other users or items.One important such are their ratings for previously used items[3].

4


11/38

CHAPTER 2. BACKGROUND

2.4 Transactions

The interaction between users and items are called transactions. The interactionsare recorded and applied to the owner of the interaction, namely the user that in-teracted with the item. Transactions are useful and is an important data for theRS to be able to generate qualitative recommendations to a user. Transactions maybe collected explicitly or implicitly. In a movie-RS typical explicit transactions aresystem-asked ratings whereas implicit could be the transaction when a user watchesa movie[3].

Transactions could be abstracted in a variety of forms:

• User rating 1-5.

• Binary ratings, if the item is good or bad.

• User expressive ratings, eg. “Good”, “beautiful”, “emotional”.

• A users previously seen movies.

2.5 Content-Based filtering

Content-based (CB) recommender systems attempt to recommend items similar tothose a given user has liked in the past. This is done by building a model repre-

sentation of the user, based on the features of the objects he or she has previouslyrated. The model is then used to match the attributes of the user model againstthose of a content object[3].

Content-based filtering systems has some advantages when compared to Collabora-tive filtering. One advantage is that it is user independent, it solely makes use of ratings provided by the given user in order to build its respective user model. Thesystem is also able to give recommendations on items not yet rated by any user,which a Collaborative-based system is not.

It does however have a couple of shortcomings as well. One of which is the limited

content analysis it provides. To give movie recommendations the system needs toknow the actors and directors for example. No system can give reliable recommen-dations if data to distinguish items is lacking.

Content-based recommender systems also tend to be over-specialized when giv-ing recommendations, as recommendations made are solely items whose scores arematched highly against the user model, e.g if a user previously liked movies by a cer-tain director, then the recommender system will give recommendations for movieswith this particular director and reject other movies because the recommender is

5


12/38


following a certain pattern that matches the user profile with the item specifics. Be-

cause of this new users cannot be given accurate recommendations until the systemknows and understands the users’ preferences[3].

2.6 Collaborative filtering

Collaborative-filtering (CF) recommendation systems does typically make attemptsto identify users whose preferences are similar to those of the given user and recom-mend items that they have liked. Collaborative filtering algorithms could also havean item-based approach and will then identify similarities between different items.Just as with Content-based filtering systems, the most used rating representationsare numeric scale ratings or binary (like/dislike)-systems. In the field of e-commerceunary ratings, such as “has purchased” are also common[3][2].

Common techniques for collaborative filtering systems are to build a user-item rat-ings matrix. The resulting matrix is often very sparse, which increases in difficultyto manage as the amount of data grows large[2].

2.6.1 Memory-based collaborative filtering

One approach to CF systems are the so-called memory-based, or user-based CF.Memory-based CF makes use of statistical techniques to find a set of users, calledneighbors, that have a common preference and a history of similarly liked items.

The memory-based CF RS can then apply different types of algorithms to constructa top-N list with items that are predicted to be appreciated by the neighborhood.

The memory-based CF system has limitations however, one of which is the fact thatsimilarity values are calculated based on common items. This becomes unreliablewhen data is sparse and common items are few. Another disadvantage of memory-based CF is the bad scalability performance. When the dataset used by the RS isgrowing, the computations in a memory-based CF increase both by number of usersand items, a system with millions of users and items will not scale very well[1].

2.6.2 Model-based collaborative filtering

Another approach to improve the prediction reliability is the model-based CF, alsocalled item-based CF. This approach builds a model based on rating data of theindividual items, commonly by using data mining och machine learning techniques.Item-based CF has been shown to scale better than User-based CF because themodels are based on items, which are more static than the users in a typical dataset.The scalability performance is also better due to the better handling of sparse datawith an item-based approach. This is of importance today as data grow large andscalability problems are important issues[8]. This approach will further be presentedin section 2.7.2 .

6


13/38


2.7 Tools and algorithms used in this study

This section introduces the software that was used in order of evaluating the al-gorithms studied in this report, they both belong to the family of Model-basedcollaborative filtering.

2.7.1 Lenskit

Lenskit is a free and open-source software being developed by researchers at TexasState University and GroupLens Research at the University of Minnesota, with con-tributions from developers around the world[5]. Lenskit is designed to be useful forbuilding production-quality recommender systems and to support many forms of

research, including research on evaluation techniques and algorithms.

Lenskit is based on components that together build a functional recommender. Thecomponents can be changed and it is possible to implement components of yourown. It implements effective data structures optimized for sparse data and linearalgebra operations such as dot products. It also provides crossfolding techniques tosplit the dataset into N partitions for cross-validation[5]. In addition to this, it doesalso implement various performance measuring techniques, making it a justifiedresource for the research conducted in this project.

2.7.2 Conventional item-based CF

An item-based approach creates a model of user ratings. A list of similar itemsbased on the active user’s previously liked items is the goal of the algorithm. Thisis done by evaluating the similarity of the target item i and selecting the k mostsimilar items. A second list consists of those k items corresponding similarity valuesto i [1].

The similarity between items can be calculated in several different ways and thereare benefits and disadvantages to every such method. Using a straightforwardmethod would benefit the complexity but could possibly lack the accuracy of amore complex method.

Cosine-based similarity is a proven method to compute the similarity of two vectorsin an N-dimensional space. In this case the two compared items are represented asthe two vectors. The similarity is computed by calculating the cosine of the anglebetween the vectors. The dimension N is based on the number of users who ratedthe items, m users implies m dimensions.

7


14/38


A similarity function between item i and j denoted Similarity(i,j) can be calculated

by the formula below.

Similarity(i, j) = cos j,i = i· j

i× j (2.1)

Where i and j are vectors and · is the dot-product of the two vectors.

An adjusted version of the cosine-based similarity involves taking each individualaverage rating into account by subtracting the user’s average from each co-ratedpair, this could be beneficial because of the fact that some users are more criticaland others are more positive in general and therefore does not have a common ratingscale. A co-rated case means that two or more users all rated item i and j . A setof users who co-rated item i and j are denoted U . Each element in the set is a pairof two items both rated by a user. This adjusted version is given by the followingformula.

Similarity(i, j) =

u∈U

(Ru,i−Ru)(Ru,j−Ru) u∈U

(Ru,i−Ru)2

u∈U (Ru,j−Ru)2

(2.2)

Where Ru,i is u ’s rating of i and Ru is the average rating of u .

The similarity could also be calculated by the pearson-r correlation. To make thecorrelation significant the co-rated cases must be put in a set, similar to what wasdone for the previous adjusted cosine formula. This is illustrated by the formulabelow.

Similarity(i, j) =

u∈U

(Ru,i−Ri)(Ru,j−Rj) u∈U

(Ru,i−Ri)2

u∈U (Ru,j−Rj)2

(2.3)

Ri is the average rating of item i .

Now that we have the similarities calculated, the first stage in the algorithm isdone. The second stage in a conventional item-based CF algorithm is to generatethe actual recommendations based on the similarities previously calculated and theratings of the target user.

As with the similarity calculation between items there are several methods to pro-duce recommendations. One way to generate a predication on item i for the target

8


15/38


user u is to calculate the sum of the ratings similar to i made by user u . All ratings

which are summed are also weighted so that ratings corresponding to items withhigh similarity to i affects the prediction for i more. This method is called WeightedSum and the formula is shown below.

P rediction(i, j) =

ī∈N

(S i,̄i×Ru,̄i)ī∈N

(|S i,̄i|) (2.4)

N is the set of items similar to i and S i,̄i is i ’s similarity to ī

The Weighted sum could also be approximated with a regression model, this waythere is no need for directly using the ratings of similar items. A regression model

could be more accurate than Weighted Sum because the euclidean distance betweenthe compared vectors, the items, could be distant but the actual similarity maystill be high. The approximated regression model can use the same formula asthe Weighted Sum where Ru,i is exchanged with the approximation

Ŕu,i. Ŕu,i is

obtained by the following formula.[1]

Ŕu,i = αRi + β + (2.5)

The parameters α and β are acquired from the rating vectors whereas is the errorof the model.

2.7.3 Slope One

The Slope One algorithm is a model-based CF that is both simple to implementand to understand. It was developed and presented by Daniel Lemire and AnnaMaclachlan in 2005[4]. Among CF systems it can be argued to be one of the sim-plest forms of algorithms that is both non-trivial and item-based.

The algorithm uses information regarding the items that the user has rated previ-ously, also from other users who have rated the same items as the given user. Onlythe ratings made by users who have some items in common with the predictee user,

and of those items only the items also rated by the predictee are used. This doesin turn build a rating pair which will be used for the prediction.

The prediction has the form of f(x) = x + b, being a simplified version of the linearregression algorithm f(x) = ax + b. The constant b is defined as the mean differencebetween each item and x is a variable representing rating values.

9


16/38


Table 2.1. Movie rating example

Vikings Game of Thrones Breaking Bad Band of Brothers Suits

John 5 4 5 - 4

Henrik 5 5 4 3 2

Lisa 4 - 4 2 4

Sophie ? 3 2 - -

Anna 4 3 - - 5

For example if we wish to predict Sophie’s rating for Vikings in Table 2.1, it goesas follows.

Using the common pairs of ratings for Vikings and Game of Thrones we get themean difference: ((5-4) + (5-5) + (4-3)) / 3 = 2/3. This value is our x, now we addx to b, which is Sophie’s rating for Game of Thrones, which is 3. We get the rating3 + 2/3.

Doing the same for Breaking Bad we get ((5-5) + (5-4) + (4-4))/3 = 1/3, added tob gives us 1/3 + 2 = 2 + 1/3

Using both of those predictions we can get a better one by calculating the mean.Resulting in ((3 * (3 + 3/2)) + (3 * (2 + 1/3)))/(3 + 3) = 18/ 6 = 3. Sophie’spredicted rating for Vikings using available pairs for Game of Thrones and Breaking

Bad is therefore 3, using the standard Slope One algorithm. The algorithm can moreformally be written as ni=1

(vi+wi)

n (2.6)

where v and w are the different items. And vi and wi are the ratings given by user i for those items.

To get the best prediction on the form f(x) = x + b, given two arrays vi and wiwith i = 1,2,...,n we minimize:

ni=1(vi + b + wi)

2(2.7)

By deriving with respect to b and by setting the derivative to zero will imply thatb is is equal to Formula 2.6 . This result leads us to the following scheme.

Given a user evaluation u with ratings u j and ui where i and j are items. Also givena training set X , the average deviation between the given two items is consideredas:

dev j,i =

uS j,i(X )uj−ui

card(S j,i(X )) (2.8)

10


17/38


Where S i,j(X ) is the set of all user evaluations in the training set X, with respect

to i and j . In other words, the deviation we get only take into account those usersthat has specified a rating to those specific items. The resulting information fromthe calculated devi,j is saved in a symmetric matrix, which can be easily updatedwhen new data is entered.

Given the fact that devi,j + ui is a prediction for u j , given ui, a reasonable predictorwould therefore be the average of all such predictions.

P (u) j =

iRj

(devj,i+ui)

card(Rj) (2.9)

Where R j is the set of all relevant items, and P (u) j is the prediction of item j .Worth noting is that this implementation does not depend on how the user hasrated individual items, but instead only on the user’s average rating and on whichitems that the user has rated.

2.7.4 Weighted Slope One

The Slope One scheme suffers from the drawback that the number of ratings ob-served is not taken into consideration when predicting ratings. Consider the examplewhere we wish to predict user Adam’s rating of item A given Adam’s rating of theitems J and K. If 3000 users have rated the pair of items B and A, whereas only50 users have rated the pair of C and A. Then it is likely that Adams rating of item B is a far better predictor for item A than Adams rating for item C is. Withthe Weighted Slope One algorithm it is possible to increase the weight of the morerelevant ratings, in order of mitigating this effect.

P (u) j =

iS(u)−j

((devj,i+ui))∗cj,iiS(u)−j

(cj,i) (2.10)

Where c j,i is the number of relevant items in the set S, considered to be the weight.

By using Table 2.1, the weight between Vikings and Game of Thrones is representedby the number of users that have rated both of those items, formally representedas c j,i. In Table 2.1 this was done by the users John, Henrik and Anna, resultingin a weight of 3. A table representation of all weight from this example is:

11


18/38


Table 2.2. Table of weights between TV-shows from Table 2.1

Vikings Game of Thrones Breaking Bad Band of Brothers Suits

Vikings 4 3 3 2 4

Game of Thrones 3 4 3 1 3

Breaking Bad 3 3 4 2 3

Band of Brothers 2 1 2 2 2

Suits 4 3 3 2 4

Using this table and the formula 2.10 presented previously, Sophie’s predicted rat-ing for Vikings using her ratings for Game of Thrones and Breaking Bad can becalculated as follows:

(3∗3∗(3+3/2))+(3∗3∗(2+1/3))(3+3)∗3 =

61.518 = 3.41 (2.11)

12


19/38

Chapter 3

Method

This section describes the method of the study. It is mainly a literature study withthe addition of running research software in order of testing and examining results.

3.1 Datasets

The data used in this research is collected and supplied by GroupLens, generatedusing their MovieLens website. MovieLens is a web-based research recommendersystem from 1997. In order to evaluate the chosen algorithms effectively, differentsized datasets have been used. MovieLens supplies datasets with a total of 100.000,

1M, 10M and 20M ratings. The ratings are on a scale of 1 to 5 and have beencollected from 1000 up to 138,000 users, depending on the dataset. Each userincluded in those datasets have rated at least 20 movies.

3.2 Testing hardware

The test results presented in this report was brought forward using the followingcomputer hardware specifications:

Table 3.1. Computer specifications

CPU Intel Core i5-36570K (4 cores @ 3.4GHz)

GPU Geforce GTX 580RAM DDR3 8GB 2133MHz

SSD 128GB Samsung

Disk 1TB Western Digital 7200rpm

Motherboard ASUS P8Z77-V PRO

OS Windows 10 Pro 64-bit

13


20/38

CHAPTER 3. METHOD

3.3 Algorithm evaluation testing

This section will present the evaluation tests that has been done. Which algorithmsand parameters were used. The evaluation testing were done using MovieLens’sdatasets consisting of 100K and 1M ratings[6]. The accuracy of the algorithms willbe examined and tested with the crossfolding evaluation technique, explained insection 3.5.1. The result of the crossfolding algorithm will produce an error whichis calculated by a root mean square error, RMSE. The error is a measure of howaccurate a prediction is, compared with the actual value.

3.3.1 Conventional item-based CF

The first tests were run based on the conventional item-based CF algorithm aspresented in section 2.7.2 . Tests were run using various algorithm modifications.With varying neighborhood sizes ranging from values 1 up to 250. When too manyneighbors are taken into account the result has been shown to be less accurate,therefore the upper limit of 250 was chosen. The algorithm was also modified bytesting similarity measurements using either Pearson-Correlation, Cosine similarityor Adjusted cosine similarity.

3.3.2 Slope One

Both the weighted and unweighted versions of the Slope One algorithm, presentedin section 2.7.3 and section 2.7.4 were tested. They were tested using different

deviation damping levels ranging from 0 to 6. Deviation damping is a value that isadded to the number of coratings when calculating the deviation of item pairs. Thedamping levels used in this study were chosen as increasingly high ones appearedto affect the algorithms accuracy negatively.

3.4 Algorithm recommendation performance testing

The performance evaluation was carried out as a run time comparison between eachalgorithms and parameters. This was done using MovieLens’s datasets consisting of 1M and 10M ratings, testing each algorithm 100 times in order of calculating a mean-

value representation of their run times. The algorithms were set to recommend 10movies for two randomly chosen users. User u with few rated objects and v withmany, where u and v was set to be the test users for all tests. Model building timeswere also noted as they differ between each algorithm.

3.5 Validation of results

This section provides techniques used in order of validating the results acquiredfrom running tests described in section 3.3.

14


21/38

CHAPTER 3. METHOD

3.5.1 K-fold cross-validation

An implementation of k-fold cross-validation is present in the Lenskit evaluationalgorithm. The dataset is partitioned into an arbitrary number of partitions whereone of the k partitions is randomly selected to be the validation partition, the otherk-1 partitions are used as training data. Each partition will be processed by thecross-validation exactly once, the result from the k-1 validations can then be aver-aged to produce one estimation of the entire dataset[7].

The number of partitions used in our tests were 5 for all tests, meaning that onepartition was used for validation and the rest for training. This is also the defaultvalue used in Lenskit.

3.5.2 Root Mean Square Error

During the cross-validation there are several methods to calculate the error betweenestimations and the actual value. All tests that were run in this study used RMSEas error calculation[9]. The formula for RMSE is presented below.

RMSE =

1n

ni=1 (wi − vi)

2(3.1)

Where w is the set of actual ratings and v is the set of predicted ratings. The closerthe RMSE value is to 0, the smaller is the error. Meaning that a value closer to 0

is more accurate than a higher one.

15


22/38

Chapter 4

Results and analysis

This section will present the results acquired from testing. The tests that have beenrun are accuracy evaluations and recommendation performances.

4.1 Algorithm evaluation results

The accuracy of an algorithm is an important measurement that is taken into ac-count when the overall benefits of an algorithm is presented. The following plotpresents the accuracy measurement with RMSE presented in the y-axis, for themost accurate algorithm configurations from respective algorithm family. See Ap-

pendix A for more evaluation results.

Figure 4.1. Evaluation results

16


23/38

CHAPTER 4. RESULTS AND ANALYSIS

4.2 Recommendation performance results

The time it takes to recommend items to the user is a critical factor to take intoaccount when choosing which algorithm to use for your project. Therefore it wasdecided to test the runtime performance measured in seconds for the algorithmsfound to be the most accurate in the previous section.

The following table presents the runtime required to build the prediction model foreach of the tested algorithms, for datasets of sizes 1M and 10M ratings.

Table 4.1. Performance testing results - model-building runtime

Algorithms 1M Buildtime (s) 10M Buildtime (s)

Conventional item-based 20 neighbors 17.266 300.598Adjusted cosine item-based 20 neighbors 17.278 299.992

Slope One, deviation damping 5 23.895 419.467

Weighted Slope one, deviation damping 5 23.661 415.717

The conventional item-based algorithm and the Adjusted cosine item-based algo-rithms perform better than the regular Slope One and the weighted Slope Onealgorithms, both with the 1M and the larger 10M dataset. The mean differencebetween the item-based and Slope One families calculated with both the 1M and10M datasets shows that the item-based family only took 72 % of the Slope Onebuild time. The build time difference between item-based and adjusted item-basedis slight and so is also the build time for Slope One compared to the weighted SlopeOne.

The results presented in the following table were all collected using MovieLens’s 1Mdataset, where recommendations were made for the user with id 4, who has givenratings to 20 different movies as well as for user 53, who has rated 683 movies.The runtime is a computed mean value result from running each algorithm for 100iterations.

Table 4.2. Performance testing results on dataset 1M

Algorithms Runtime user 4(s) Runtime user 53(s)

Conventional item-based 20 neighbors 1.36722 1.60939

Adjusted cosine item-based 20 neighbors 1.34965 1.59801



The results from this test show that the algorithms from the Slope One family isnoticeably faster for users with few ratings. The tests run for user 53 shows less of a difference. The difference between the conventional item-based algorithm and theAdjusted cosine item-based is small. The Adjusted cosine item-based algorithm only

17


24/38

CHAPTER 4. RESULTS AND ANALYSIS

differ by a few milliseconds compared to the conventional one. A similar pattern is

shown in the difference between the weighted Slope One and the regular Slope One.

Table 4.3. Performance testing results on dataset 10M

Algorithms Runtime user 4(s) Runtime user 53(s)

Conventional item-based 20 neighbors 13.62649 13.96945

Adjusted cosine item-based 20 neighbors 13.52653 13.72841



The results from the test shows that the performance differences seen in Table 4.2 are increasing as the amount of data grows larger. Both regarding the differencesbetween the two conventional algorithms and the differences between Slope Oneand the weighted Slope One. Also the difference between both algorithm familiesis more noticeable.

18


25/38

Chapter 5

Discussion

The testing methods chosen have not been perfect, as problems have been encoun-tered which may affect our testing results and thereby our final conclusions. Thischapter will discuss the results, the testing methods used in this study, the problemsthat were encountered and the conclusions that could be drawn.

5.1 Comparison of results with regards to accuracy

The algorithm evaluation results depicted in figure 4.1 are the best algorithms fromthe conventional item-based algorithm family and the Slope One algorithm family.

Note that the Pearson-Correlation version of the conventional item-based family isnot listed as it was not as accurate as the ones depicted. Among the four algo-rithms that have been listed one can see that the Weighted Slope One algorithm isthe one with the smallest error and least amount of spread. Meaning that it is themost accurate algorithm together with the Adjusted cosine-similarity item-basedalgorithm for the datasets and evaluation method previously described. The SlopeOne algorithm proved to be efficient when it comes to run time performance andyet both simple to implement and accurate in its recommendations.

In table 4.1 it shows that the Slope One algorithms are slower to build, indepen-dent of dataset size. Therefore it is recommended to use the algorithm that fits

your needs, as having to rebuild the model often will affect system performancenegatively. If simplicity in implementations is desirable, the Slope One algorithm ispreferred however.

The recommendation performance depicted in table 4.2 and 4.3 showcase that theSlope One algorithm is slightly, sometimes even noticeably faster than the item-based family. This result shows that the much simpler algorithm still performs verywell, making it a good alternative to the other.

The most accurate deviation damping parameter value for the Slope One algorithm

19


26/38

CHAPTER 5. DISCUSSION

was found to be 5, values past this point appeared to lose accuracy and was there-

fore not tested thoroughly. It would be of interest to test how this parameter wouldaffect the algorithms behavior on various different datasets and could therefore beconsidered an improvement to consider if tests were to be remade.

Regarding the conventional item-based algorithm, one parameter which supposedlywould have affected our results was to also test the minimum neighborhood size,meaning the smallest amount of neighbors to consider in the trials. The neighbor-hood size parameter used only affects the number of neighbors actually consideredfor each prediction and sets an upper limit. Testing more specifically defined sizeintervals would have given interesting results. The results would much likely varyheavily depending on the dataset used.

5.2 Criticism of testing methods

Although the results provided in this study have been thoroughly studied and arethe results from multiple tests and attempts, it may not be perfect. Problems havebeen met along the way that have limited the testing capabilities of this study,which may have affected the final results in some way. This section introducesthose problems and discusses how they may have interfered with the final resultspresented.

5.2.1 LenskitLenskit is a credible and well documented software that has been cited in a numberof published papers, tests made in this thesis could also have been implemented inother software to avoid biased results.

5.2.2 Cross-folding

We chose to use the default value of 5 partitions for cross-validation for all tests,independent of data size, because of [5] doing the same for the same sets of data.Other ratios for training and validations as well as different evaluations techniquesmight have been interesting to test.

5.2.3 RMSE

The error measuring metric used in this study was chosen to be Root Mean SquareError (RMSE) at an early stage, as it was found that only the error size differedbetween RMSE and MAE (Mean Absolute Error), but their distributions along they-axis were mostly consistent. Time constraints for this study were also an impor-tant factor as to why this decision was made. RMSE is also a popular metric inmany studies, as well as being the one used in the Netflix Prize competition.

20


27/38


The metric has been critized by researchers however, and a combination using both

RMSE and MAE may have been preferred.

5.2.4 Layout of testing

The tests could very well be perfected even more to achieve a higher confidence of results. One improvement to be made is to run each test for even more times thanwhat was done in this study. As well as running tests of different datasets and ondata of larger volume, a problem discussed in the next section.

However, as each of the tests were run under the same conditions with the samehardware and software it can be argued to be efficient enough to draw a conclusion

to the study, with respect to the given problem.

5.3 Challenges with limited hardware

Testing using the hardware presented in the previous section brought along oneproblem, critical in order of showcasing a confident result. This problem is the lackof RAM. It was shown that 8GB of RAM was not enough to run our tests on biggerdatasets, as the testing software requires a large size of memory to be allocated onthe heap. This limited our evaluation testing to the datasets consisting of 100Kratings and 1M ratings. Recommendation performance testing was limited to thedatasets of sizes up to 10M ratings. Whereas two additional datasets with respec-

tive sizes of 10M and 20M ratings were available from the MovieLens project itself.

The lack of memory resulted in thrashing, as well as not being able to run the testsat all. As a result of this, some tests had to be run multiple times in order of acquiring data that was understandable. Therefore there is a risk of errors in thedata, however as all tests were run under the same circumstances and only the goodresult-data have been used, this should not be as big of an issue.

5.4 Final conclusion

This study shows that Slope One is a good alternative for the more complex con-ventional item-based approach. Slope One delivers accurate recommendations andperforms well even though its high grade of simplicity, relative to the other. Thishas been shown in both small datasets as well as in bigger datasets with 10Mratings. The conventional item-based algorithm has a noticeably faster buildtime,something that may increase its attractiveness compared to Slope One. However alltests made were carried out on datasets from the same source, the results may havediffered if tests were run on other datasets. The lack of hardware support in ourtesting setup prevented us from including datasets of greater size. This report failsto draw a fully confident conclusion regarding scalability and performance in big

21


28/38


data environments, therefore further testing on greater datasets with appropriate

hardware is recommended.

22


29/38

Bibliography

[1] Sarwar. Badrul et al. “Item-based collaborative filtering recommendation algo-rithms”. In: vol. 2001. 2001, pp. 285–295. doi: http://dx.doi.org/10.1145/

371920.372071.[2] Ekstrand., Rield., and Konstant. “Collaborative Filtering Recommender Sys-

tems”. In: Foundations and Trends in Human–Computer Interaction 4.2 (2010),pp. 81–173. url: http://files.grouplens.org/papers/FnT%20CF%20Recsys%20Survey.pdf.

[3] Ricci. Francesco et al. Recommender Systems Handbook . Springer, 2011. isbn:978-0-387-85820-3.

[4] Daniel Lamire and Anna Maclachlan. “Slope One Predictors for Online Rating-Based Collaborative Filtering”. In: (2005). url: http :/ /lemire .me /fr/documents/publications/lemiremaclachlan_sdm05.pdf .

[5] Ekstrand. Michael et al. “Rethinking The Recommender Research Ecosystem:Reproducibility, Openness, and LensKit.” In: In Proceedings of the Fifth ACM Conference on Recommender Systems (RecSys ’11). ACM, New York, NY,USA (2011), pp. 130–140. doi: 10.1145/2043932.2043958.

[6] MovieLens Datasets . url: http://grouplens.org/datasets/movielens/.

[7] Jeff Schneider. Cross Validation . 2007. url: https: //www.cs.cmu.edu/~schneide/tut5/node42.html .

[8] Xiaoyuan Su and Taghi M. Khoshgoftaar. “A Survey on Collaborative FilteringTechniques, Advances in Artificial Intelligence”. In: 2009, 421425 (2009), pp. 1–19. url: http://www.hindawi.com/journals/aai/2009/421425/.

[9] Chai. T and Draxler. R. “Root mean square error (RMSE) or mean abso-lute error (MAE)? – Arguments against avoiding RMSE in the literature”. In:(2014). url: http://www.geosci-model-dev.net/7/1247/2014/gmd-7-1247-2014.pdf.

23


30/38

Appendix A

Evaluation result plots

Figure A.1. Results from the chosen four algorithms

24


31/38

APPENDIX A. EVALUATION RESULT PLOTS

Figure A.2. Results from evaluating item-based

25


32/38


Figure A.3. Results from evaluating adjusted item-based

26


33/38


Figure A.4. Results from evaluating pearson item-based

27


34/38


Figure A.5. Results from evaluating adjusted pearson item-based

28


35/38


Figure A.6. Results from evaluating slope one

29


36/38


Figure A.7. Results from evaluating weighted slope one

30


37/38


Figure A.8. All results in one plot, for overview purposes

31


38/38

kth kexjobb user behavior prediction

Documents