hybrid collab fr p2p

9
 Future Generation Computer Systems 26 (2010) 1409–1417 Contents lists available at  ScienceDirect Future Generation Computer Systems  journal homepage:  www.elsevier.com/locate /fgcs A hybrid collaborative filtering recommendation mechanism for P2P networks Zhaobin Liu a , Wenyu Qu a,, Haitao Li b , Changsheng Xie c a School of Information Science and Technology, Dalian Maritime University, Dalian, 116026, PR China b Institute for Photogrammetry and Remote Sensing, Chinese Academy of Surveying and Mapping, Beijing, 100039, PR China c Wuhan National Lab for Optoelectronics (WNLO), Huazhong University of Science and Technology, Wuhan, 430074, PR China a r t i c l e i n f o  Article history: Received 30 November 2009 Received in revised form 6 April 2010 Accepted 16 April 2010 Available online 2 May 2010 Keywords: Collaborative filtering Recommendation Sparse matrix Eigenvalue matrix Peer-to-peer (P2P) networks a b s t r a c t With the increasing number of commerce facilities using peer-to-peer (P2P) networks, challenges exist in recommending interesting or useful products and services to a particular customer. Collaborative Filtering (CF) is one of the most successful techniques that attempts to recommend items (such as music, movies, web sites) which are likely to be of interest to the people. However, conventional collaborative filtering encounters a number of challenges on its recommendation accuracy. One of the most important challenges may be due to the sparse attributes inherent to the rating data. Another important challenge is tha t exi sti ng CF met hod s con sid er mai nlyuser- bas ed or ite m-b ase d ra tin gs re spe cti vel y. In thi s pap er a P2P-based hybrid collaborative filtering mechanism for the support of combining user-based and item attribute-based ratings is considered. We take advantage of the inherent item attributes to construct a Boolean matrix to predict the blank elements for a sparse user–item matrix. Furthermore, a Hybrid collaborative filtering (HCF) algorithm is presented to improve the predictive accuracy. Case studies and experimen t resu lts illus tratethat our approaches not only contr ibute to pred ictin g the unra ted blankdata for a sparse matrix but also improve the prediction accuracy as expected. © 2010 Elsevier B.V. All rights reserved. 1. Introduction In recent years, peer-to-peer (P2P) file-sharing networks have become a popular new way to exchange resources, information and services across a large number of autonomous peers [ 1,2]. Examples of P2P file shari ng syste ms are: Gnutella, BitTorr ent and P2P Music streaming systems like iTunes [ 3]. These systems enable users to form communities for sharing different types of fil es. However, due to the explo siv e growth of the volume of information, such as in the web, users should be able to make choices without knowing all of the alternatives  [4,5]. Moreover, both the users and the dat a are distri but ed and dyn amica lly changing which make it difficult to filter (and search) and localize the available content within the P2P network  [6]. These significant promotions and the associated requirement challenges have motivated the development of recommendation systems. Colla borat ive filte ring (CF)  [7] is such a personalized recommendation technique that has been very promising both in resear ch and indus try. CF lev erages the usa ge his tory of groups of similar users in order to make recommendations to a  Corresponding author. E-mail addresses: [email protected] (Z. Liu),  [email protected], [email protected] (W. Qu), [email protected] (H. Li),  [email protected] (C. Xie). target user  [8,9]. Nowadays, CF technology has gradually been impl ement ed in vario us appl icati ons (e.g. Netf lix, TiVo, Google news and Amazon [10,11]). However, reg ard les s of itssuccess in man y app lic ati on settin gs, con ventio nal collaborative fil ter ing enc oun ters a number of limitations which influence its recommendation accuracy. One of the most important limitations may be the data sparsity problem. The underlying assumption of CF is that the active user will prefer to choose those items which similar users prefer [12]. In many real collabora tive filte ring syste ms there are many potential ly recommendable items and many user profiles, but a typical user may have rated only a tiny percentage of these items [ 13]. In others words, the user–item matrix is a sparse matrix populated primarily with blanks. Another most important limitation is that existing CF methods consider mainly user-based or item-based rat ing s res pec tiv ely , and thi s is a maj or issue tha t limits the qua lit y of recommenda tions and the appl icabi lity of colla bora tive filte ring in general. To alleviate the sparse problem issue, in this paper our first con tri but ion foc use s on the spa rse dat a pro blem in CF. We pro pos e to bri ng in the use r-b ase d and ite m att rib ute -ba sed rat ing s. Using the item attributes Boolean matrix, a new item similarity computing mechanism is presented to predict the blank elements in the sparse user–item matrix. For this reason, we take advantage of the inherent item attributes to construct a Boolean matrix. By compar ing the Euclid ean dis tan ce between two ite ms, we 0167-739X/$ – see front matter © 2010 Elsevier B.V. All rights reserved. doi:10.1016/j.future.2010.04.002

Upload: archanasubburaman

Post on 01-Nov-2015

220 views

Category:

Documents


0 download

DESCRIPTION

Recommender system

TRANSCRIPT

  • 2Available online 2 May 2010

    Keywords:Collaborative filteringRecommendationSparse matrixEigenvalue matrixPeer-to-peer (P2P) networks

    challenges may be due to the sparse attributes inherent to the rating data. Another important challengeis that existing CF methods consider mainly user-based or item-based ratings respectively. In this paper aP2P-based hybrid collaborative filtering mechanism for the support of combining user-based and itemattribute-based ratings is considered. We take advantage of the inherent item attributes to constructa Boolean matrix to predict the blank elements for a sparse useritem matrix. Furthermore, a Hybridcollaborative filtering (HCF) algorithm is presented to improve the predictive accuracy. Case studies andexperiment results illustrate that our approaches not only contribute to predicting the unrated blank datafor a sparse matrix but also improve the prediction accuracy as expected.

    2010 Elsevier B.V. All rights reserved.

    1. Introduction

    In recent years, peer-to-peer (P2P) file-sharing networks havebecome a popular new way to exchange resources, informationand services across a large number of autonomous peers [1,2].Examples of P2P file sharing systems are: Gnutella, BitTorrentand P2P Music streaming systems like iTunes [3]. These systemsenable users to form communities for sharing different types offiles. However, due to the explosive growth of the volume ofinformation, such as in the web, users should be able to makechoices without knowing all of the alternatives [4,5]. Moreover,both the users and the data are distributed and dynamicallychanging which make it difficult to filter (and search) and localizethe available content within the P2P network [6].These significant promotions and the associated requirement

    challenges have motivated the development of recommendationsystems. Collaborative filtering (CF) [7] is such a personalizedrecommendation technique that has been very promising bothin research and industry. CF leverages the usage history ofgroups of similar users in order to make recommendations to a

    Corresponding author.E-mail addresses: [email protected] (Z. Liu), [email protected],

    [email protected] (W. Qu), [email protected] (H. Li), [email protected](C. Xie).

    target user [8,9]. Nowadays, CF technology has gradually beenimplemented in various applications (e.g. Netflix, TiVo, Googlenews and Amazon [10,11]).However, regardless of its success in many application settings,

    conventional collaborative filtering encounters a number oflimitations which influence its recommendation accuracy. One ofthe most important limitations may be the data sparsity problem.The underlying assumption of CF is that the active user will preferto choose those items which similar users prefer [12]. In manyreal collaborative filtering systems there are many potentiallyrecommendable items and many user profiles, but a typical usermay have rated only a tiny percentage of these items [13]. Inothers words, the useritem matrix is a sparse matrix populatedprimarily with blanks. Another most important limitation is thatexisting CF methods consider mainly user-based or item-basedratings respectively, and this is amajor issue that limits the qualityof recommendations and the applicability of collaborative filteringin general.To alleviate the sparse problem issue, in this paper our first

    contribution focuses on the sparse data problem in CF.We proposeto bring in the user-based and item attribute-based ratings.Using the item attributes Boolean matrix, a new item similaritycomputing mechanism is presented to predict the blank elementsin the sparse useritemmatrix. For this reason, we take advantageof the inherent item attributes to construct a Boolean matrix.By comparing the Euclidean distance between two items, weFuture Generation Computer S

    Contents lists availa

    Future Generation

    journal homepage: www

    A hybrid collaborative filtering recommeZhaobin Liu a, Wenyu Qu a,, Haitao Li b, Changshenga School of Information Science and Technology, Dalian Maritime University, Dalian, 1160b Institute for Photogrammetry and Remote Sensing, Chinese Academy of Surveying and McWuhan National Lab for Optoelectronics (WNLO), Huazhong University of Science and Te

    a r t i c l e i n f o

    Article history:Received 30 November 2009Received in revised form6 April 2010Accepted 16 April 2010

    a b s t r a c t

    With the increasing numberin recommending interestinFiltering (CF) is one of the momovies, web sites) which arefiltering encounters a numbe0167-739X/$ see front matter 2010 Elsevier B.V. All rights reserved.doi:10.1016/j.future.2010.04.002ystems 26 (2010) 14091417

    ble at ScienceDirect

    Computer Systems

    .elsevier.com/locate/fgcs

    ndation mechanism for P2P networksXie c

    6, PR Chinaapping, Beijing, 100039, PR Chinachnology, Wuhan, 430074, PR China

    of commerce facilities using peer-to-peer (P2P) networks, challenges existg or useful products and services to a particular customer. Collaborativest successful techniques that attempts to recommend items (such as music,likely to be of interest to the people. However, conventional collaborativer of challenges on its recommendation accuracy. One of the most important

  • 1410 Z. Liu et al. / Future Generation Comp

    seek a novel blank element prediction approach to compute thesimilarity of items. Case studies demonstrate that ourmethodologycontributes to effectively predicting the blank elements for asparse matrix. Test results also show that the filling-in accuracyis acceptable and reasonable.Our second contribution is that a Hybrid collaborative filtering

    (HCF) mechanism is presented to improve the predictive accuracy.In this paper, we pre-classify the user clusters based on the P2Puser attributes, that is, selecting the most similar users into thesame cluster. Then we apply the k-means clustering algorithmsearching within the similar users instead of a whole database.Our experiment results illustrate that Hybrid collaborative filteringachieves reasonable quality performance.The remainder of the manuscript is organized as follows. The

    Problem Statement and related works are reviewed in Section 2.Section 3 investigates the theoretical analysis for unrated ratingprediction and proposes a new filling-in methodology for a sparseuseritem matrix. The case study and results analysis are alsodescribed. In Section 4, a Hybrid collaborative filtering mechanismis presented to improve the predictive accuracy. Our experimentalresults are analyzed in Section 5. The conclusion and future worksare concluded in Section 6.

    2. Problem statement and related works

    Many schemes have been proposed for efficient peer-to-peerrecommendation. In this section, we review the major researchworks on the limitations of existing sparsity and clusteringalgorithm in P2P Collaborative Filtering technology.Before problem formulation, we first introduce some no-

    tations and abbreviations used in this paper. Given a set ofusers U = {user1, user2, . . . , userm}, and a set of items T ={item1, item12, . . . , itemn}, the useritem rating matrix is repre-sented as an U T matrix R = (Ri,j). The value in this matrix iseither a real number within a range (from 1 to 5) or , the symbolfor unrated rating, and Ri,j defines the rating of user i for itemj.Due to the promotion of recommendation systems, currently

    many researchers and experts are focusing on the CF problemand have made great progress. The number of users and itemsin major P2P network e-commerce systems is very large [14].Even very popular items can only have been rated by a fewusers available in the database [15]. Moreover, new items mayresult in a cold-start problem which concerns the issue thatthey cannot be recommended unless they have been rated by asubstantial number of users.Many approaches have been proposedto alleviate the sparsity problem [1618]. The basic CF methodgives a simplification average rating for unrated elements in asparse useritem matrix. It has been proved to perform poorlyon prediction accuracy. Dimensionality reduction approaches [1921] address the sparsity problem by removing unrepresentativeor insignificant users or items from the useritem matrix.Unfortunately, potentially useful information might be lost duringthis reduction process. One approach combines CF with a content-based method to solve the sparsity problem [22,23]. Moststudies using this approach have demonstrated improvementin quality. However, regarding items, this approach requiresadditional information. Another solution is to respectively useuser-based or item-based collaborative filtering to predict theblank unrated data. Nevertheless, filling unrated data only throughuser-based or item-based approaches will potentially ignorevaluable information that will make the prediction more accurate.In 1967, MacQueen presented a k-means algorithm which

    assigns each node to the clusterwhose center (also called centroid)is nearest. The algorithm steps are [24]: (1) Choose the number of

    clusters, k. (2) Randomly generate k clusters and determine thecluster centers, or directly generate k random nodes as clusteruter Systems 26 (2010) 14091417

    centers. (3) Assign each node to the nearest cluster center. (4)Recompute the new cluster centers. (5) Repeat the two previoussteps until some convergence criterion is met (usually that theassignment hasnt changed). Because the resulting clusters dependon the initial random assignments, its disadvantage is that itmay not yield the same result with each run [25]. Anotherdisadvantage is the search for neighbors among the whole P2Pnetwork may decrease the performance quality. The traditionalk-means algorithm for collaborative filtering mainly refers toitem-based or user-based separately. For this reason, in [26] theauthors propose a hybrid predictive algorithm with smoothing toconsider both users aspects and items aspects. In [27] eSciGridwas presented to take into account the physical distance betweenpeers and the amount of traffic carried by each node. Wanget al. present a unified probabilistic model for collaborativefiltering using Parzen-window density estimation for acquiringthe probabilities of the proposed unified relevance model [28,29].However, these approaches may dont suit well for P2P networkapplication scenarios or ignore real-time performance qualitywhile finding closer neighbors.

    3. Sparsity limitation solution

    In this section we present a method for alleviating the sparsitychallenge in collaborative filtering based on the item attributesBoolean matrix.

    3.1. Collaborative filtering based on item attributes

    The first approach is to compose the eigenvaluematrix of items.Its methodology is as follows: Each item could be divided intoseveral dimensions. Each dimension is the items attribute. At thesame time, each attribute has its initial eigenvalue. To simplifythe problem, we use the Boolean variable (uniformly 0 or 1) toconstruct the eigenvalue matrix. We assume that a 1 indicatesthe item is of that attribute, a 0 indicates it is not. In order todeterminewhich items are similar, we need to define a similarityfunction. We take advantage of Euclidean distance to calculate thesimilar degrees among items,

    d(itemi, itemj) = nm=1

    (pitemi pitemj)2 (1)

    where d(itemi, itemj) is the Euclidean distance of itemi and itemj,pitemi is themth index value of itemi, pitemj is themth index value ofitemj.Then the similarity between itemi and itemj could be defined as,

    sims(itemi, itemj) = 11+ d(itemi, itemj) . (2)For all blank (unrated) entries, R(user,item) can be generatedaccording to,

    Ri =

    nj=1Rj sims(itemi, itemj)nj=1sims(itemi, itemj)

    (3)

    where Ri is the predictive rating of users to itemi, Rj is the rating ofusers to itemj, sims (itemi, itemj) is the similarity between itemi anditemj.The above equation gives the formula for similarity computing

    between two items. In other words, we can predict all the unratedentries to fill in the sparse useritem matrix.

    3.2. Case studyWe use a typical movie recommender system as a specificexample of collaborative filtering. Generally, we assume there

  • Z. Liu et al. / Future Generation Comp

    Table 1The useritem matrix (m n) before data prediction. (means unrated rating).

    item1 item2 item3 itemn

    user1 5 4 5user2 3 4 user3 3 4userm 4 3

    are m users and n items (movies), and then we can get theuseritem matrix (as shown in Table 1). Each movie can beregarded as an item. Each item often has detailed informationon primitive concept levels. If such a rating exists, the elementof the matrix means the users rating on item, otherwise, if theelement is blank, which means that there has been no suchrating. In this paper, each movie item contains four preferences:genre, language, release year and country. Each preference hasseveral primitive values. For instance, genre contains {action,adventure, mystery, drama, documentary, romance and comedyetc}. Language contains {Chinese, Cantonese, English, Japanese andKorean etc}. Release year contains {less than two years, less thanfive years and more than five years}. Country contains {ChineseMainland, HK &Taiwan, Occident, Japan and Korea etc}.Each user of the system expresses his opinions about movies

    which he loves or dislikes by rating the score. The opinion of acustomer can be divided into five ratings of the preference (from1 to 5). All of these ratings are captured in the useritem matrix.In this matrix, users are in the rows and items (movies) are in thecolumns. Each space contains the users rating of that item shownin Table 1. For example, 1 star expresses that the user feels awfuland 5 stars expresses that the user feels excellent. From the Table 1,we can see that user1 could rate the movie item1 5 stars, anduser2 could rate the same movie 3 stars.Obviously, most of the elements in useritem matrix are un-

    rated blank data. To improve the accuracy of filling-in, we first con-struct the Booleanmatrix in terms of itemattributes. In this specificinstance, we give that item1 is Crouching Tiger, Hidden Dragon,item2 is Tomb Raider, item3 is The Lord of the Rings and item4 isMr and Mrs Smith. As a result, we can get the eigenvalue matrixof items respectively (as shown in Table 2(ad)).To compute the prediction rating of user1 to item2, according to

    Eq. (1), we first calculate the Euclidean distance between item2 anditem1, item3, item4 respectively.

    d(item2, item1) =7 = 2.646.

    d(item2, item3) = 1.d(item2, item4) = 2.Hence, from the Fig. 1we can get that the closest distance occurringfrom item2 to item3.In order to compute the similarity between item2 and item1,

    item3, item4 respectively, according to Eq. (2) we can get:

    sim(item2, item1) = 0.27.sim(item2, item3) = 0.5.sim(item2, item4) = 0.33.As a result, in terms of Eq. (3), the prediction rating of user1 to item2is:5 0.27+ 4 0.5+ 5 0.33

    0.27+ 0.5+ 0.33 = 4.54= 5 (after near-integer rounded down).

    Similarly to the above, we can obtain the rest of the unrated data inthe sparse useritem matrix. To compare the prediction accuracyof our methodology, we also compute the rated data. After filling

    in the entire element, the new useritem matrix (m n) can bedepicted in Table 3.uter Systems 26 (2010) 14091417 1411

    Table 2Eigenvalue matrix of items.

    (a) Eigenvalue matrix of item1

    Eigenvalue matrix of item1Genre 1 0 0 0 0 1 0Language 1 0 1 0 0 0 0Year 0 1 0 0 0 0 0Country 1 0 0 0 0 0 0

    (b) Eigenvalue matrix of item2

    Eigenvalue matrix of item2Genre 1 0 1 0 0 0 0Language 0 0 1 0 0 0 0Year 0 0 1 0 0 0 0Country 0 0 1 0 0 0 0

    (c) Eigenvalue matrix of item3

    Eigenvalue matrix of item3Genre 1 0 0 0 0 0 0Language 0 0 1 0 0 0 0Year 0 0 1 0 0 0 0Country 0 0 1 0 0 0 0

    (d) Eigenvalue matrix of item4

    Eigenvalue matrix of item4Genre 1 0 0 0 0 0 1Language 0 0 1 0 0 0 0Year 1 0 0 0 0 0 0Country 0 0 1 0 0 0 0

    Fig. 1. The Euclidean distance between item2 and the others.

    Table 3The useritem matrix (m n) after data prediction.

    item1 item2 item3 itemn

    user1 4.48 4.54 5 4.22user2 4 3.64 3.64 3.58user3 3.5 4 3.41 3.45userm 3.64 3.35 4 3.47

    As for the rated elements in the user-based matrix, to comparetheprediction accuracy of ourmethod, Fig. 2 shows the comparisonresult of prediction data and initial data. The near-integer roundeddown result is described in Fig. 3. The result indicates that all theresults are close to the initial rating data, and the round-off resultalso matches our expectations.

    4. A hybrid collaborative filtering mechanism

    There are a number of papers on the technical aspects of P2P

    collaborative filtering clustering algorithms. However, in manyways, the large size of users and items in a P2P network could

  • Rconsidering the preference of users.

    4.1. A quantitative approach for P2P user attributes

    Before proposing our new hybrid collaborative filtering algo-rithm, the personality features of P2P users should be expressedquantificationally. Generally, when a new customer registers intoa P2P network, we can get user profiles, such as age, gender, ca-reer, character and preference etc. and usually they are stored intoa database. Obviously, age is a numerical value and gender is a du-alistic value (for example, 0 means male, 1 means female). Educa-tional background can be divided into elementary school, middleschool, bachelor, master and Ph.D., which can be described from 1to 5 respectively. As for the quantitative profession and character,to describe this we can adopt a hierarchical tree.Following is the quantifying stage of profession and character

    for P2P users. In his theory of career choice, psychologist John L.

    information registered on the website by P2P users.Just like profession hierarchical tree, we can also get the

    character hierarchical tree for P2P users. Fig. 5 illustrates the effectof character category. We can see that it has been partitioned intohumorous, enthusiastic, serious type, slow-witted, hot inside coldoutside, cold inside hot outside and so on.Put it all together as introduced above, and we can quantify

    the users information when they register as users of the P2Pnetwork service. For instance, we choose User A, Gender: Male,Age: 28, Educational qualification: College, Profession: Pianist,Character trait: Lively type, and the quantifying result of thispersonal information would be {0, 28, 3, 0311, 0121}; also anotheruser B, Gender: Female, Age: 21, Educational background: college,Profession: Dancer, Character trait: Lively type, and this personsquantifying result is {1, 21, 3, 0322, 0121}. So we can make allthe users registered information quantified according to gender,age, education qualification, professional tree and character tree.Fig. 3. Comparison of initial rating and after prediction (after rounding down).

    cause the additional time during the period of finding the closerneighbors. It could also result in the space complexity of clusteringalgorithm. On the other hand, on a large-scaled P2P network,the number of active users may impact the network congestion.Therefore, determining the right network scale is very importantfor P2P collaborative filtering clustering algorithm operations.Considering P2P scalability and clustering efficiency, in this

    paper, the P2P users may be classified into different groups(clusters) with respect to the user personality features. In otherwords, a collection of users are similar within clusters and aredissimilar to the users belonging to other clusters. Our hybridcollaborative filtering algorithm based on k-means will searchneighbors within the similar use cluster instead of searching thewhole user space. As a result, it can not only reduce the algorithmcomplexity but also improve the prediction accuracy because of

    we use Holland Codes to classify careers as 6 types: Realistic,Investigative, Artistic, Social, Enterprising and Conventional. Basedon the Holland Codes classification, we propose the professionhierarchical tree for P2P networks.As shown in Fig. 4, every profession may also contain several

    sub-categories. We set serial number tags for each layer from thetop-down for the profession tree. The profession type is 0, therealistic type is 1, the investigative type is 2, the artistic type is 3,the social type is 4, the enterprising type is 5 and the conventionaltype is 6. Additionally for the next level, the technical operation is011, the operator is 0111, the manual operation is 012, locksmithis 0121, carpenter is 0122 and so forth. Each layer can be deducedin the same manner which is set with the number tags 1, 2, 3 etc.from the top down. As a result we can quantify the professional1412 Z. Liu et al. / Future Generation Comp

    0

    1

    2

    3

    4

    5

    6

    R(u

    ser,i

    tem)

    R(u1,i1) R(u1,i3) R(u1,i4) R(u2,i1)

    Fig. 2. Comparison of initial rating and a

    0

    1

    2

    3

    4

    5

    6

    R(u

    ser,i

    tem)

    R(u1,i1) R(u1,i3) R(u1,i4) R(u2,i1) RHolland created Holland Codes to measure an individuals typeand match it with a list of career choices [30]. In this section,uter Systems 26 (2010) 14091417

    initial prediction

    (u2,i2) R(u3,i2) R(u3,i4) R(u4,i2) R(u4,i3)

    fter prediction (before rounding down).

    initial prediction

    (u2,i2) R(u3,i2) R(u3,i4) R(u4,i2) R(u4,i3)Therefore, we can use our hybrid collaborative filtering k-meanscluster algorithm to classify the users, sorting similar users into

  • Fig. 4. A quantitative approach for P2P users profession.

    Fig. 5. A quantitative approach for P2P users character traits.

    one cluster, searching from the nearest neighbor in this usercluster, which will shorten the clustering time and improve therecommendation precision.

    4.2. Algorithm implementation description

    following hybrid collaborative filtering k-means algorithm for aP2P network.Fig. 6 depicts the operation of our new hybrid Collaborative

    Filtering algorithm. The nodes represent the P2P users, the edgesbetween nodes represent the distance, which are calculated byEuclidean distance methodology. Therefore, a connected graphZ. Liu et al. / Future Generation CompUsing the quantified P2P users features described above,considering the personalities of P2P customers, we have theuter Systems 26 (2010) 14091417 1413N = (V , {E}) is a representation of a set of nodes and edgeswhere they are joined up, where V is the set of all users, and

  • rv1 and v3) with the minimum distance are connected by an edge.The rest can be done in the same manner until the state Fig. 7(f).We can see all the nodes compose two loops. They are separatedin two clusters as shown in Fig. 7(g) and (h) respectively. In otherwords, elements in the same cluster are similar in some sense. Theaverage distance of each clusterwill be treated as the new centroid,then the classical k-means algorithm is executed.Through the similar user clustering method, similar customers

    with similar attributes or behaviors will be gathered into the samecluster. This will bemore effective and precise through performingthe clustering algorithm directly only with each cluster vectorto determine the relative nearest neighbors. It also conformsthe real-time requirements in the recommendation system.Considering the new users problem appearing in collaborativefiltering algorithm, theoretically speaking, users with similarinformation dont have large differences about their interests.Therefore, we can recommend the average-scored item from thesame cluster to the newuser. Incidentally, this resolves cold startproblems effectively.

    5. Experiments evaluation

    5.1. Test dataset

    To test the efficiency of our methodology, in this section, weexperimented with classical movie-rating datasets: the Movie-

    tasy, Horror, Romance etc.). User.dat has 943 users features, suchas user ID, gender, age and profession. Rating.dat contains ratingsby 943 users for 1682 movies (items). Each user had rated at least20 movies. The rating scale takes values from 1 (lowest rating) to 5(highest rating). As a result, the sparsity of the MovieLens datasetis 1 1000001682943 = 0.936993 = 93.7%.

    5.2. Item attributes impact

    In order to evaluate how close forecasts or predictions withexperiments are, we report our results using the mean absoluteerror (MAE) evaluation metric. Just as its name implies, the meanabsolute error is an average of the absolute errors pi qi, where piis the prediction set and qi is the true value set. For all test datasets,we have,

    MAE =

    Ni=1|pi qi|N

    (4)

    where N denotes the number of tested ratings. MAE givesexpression to the average absolute deviation of predictions tothe actual data. Note that a smaller value indicates a betterperformance.The recommendation prediction influencing in CF has been1414 Z. Liu et al. / Future Generation Comp

    Fig. 6. A hybrids collabo

    E means all the ways between the nodes. The initial state is anunconnected graph T = (V , {}), where there are no edges andthe number of nodes is n. Every node self-composes a connectedvector. If the nodes associated with theminimum cost edge belongto the sub-vector of T then put this edge into T , otherwise removethis edge and select the next minimum cost edge. Followed byanalogy, while some connecting nodes form a loop, all nodesassociated with this loop will be added into the same clusterM , atthe same time, removed from T set. Repeat the above proceduresuntil all nodes are allocated into k clusters. After recalculatingthe centroid of each cluster as new centroid, we then apply theconventional k-means to finish the operations. The algorithmstops when all the distances become less than the initializedthreshold.A motivating example is illustrated in Fig. 7. Suppose there are

    6 users in a P2P network. As shown in Fig. 7(a), two nodes (userLens [31] dataset. The MovieLens dataset was collected by theGroupLens group through the MovieLens Web site during the pe-uter Systems 26 (2010) 14091417

    ative filtering algorithm.

    Table 4The basic characteristics of the test dataset.

    MovieLens

    Number of users 943Number of items 1682Sparsity 93.7%Rating scales 15Training set 80%Test set 20%

    riod between September 1997 and April 1998. The basic character-istics of MovieLens datasets with different sizes are summarized inTable 4. The dataset contains three sets: Movies.dat, Rating.dat andUser.dat. Movies.dat contains 1682 movies (items), including thedetail in formations: movie code, name, type (for example Action,Adventure, Animation, Comedy, Crime, Documentary, Drama, Fan-mainly attributed to two factors: one is the sparsity level ofdatasets, and the other is the number of neighbors. Based on these

  • Z. Liu et al. / Future Generation Comp

    a b

    c d

    e

    g h

    f

    Fig. 7. Demonstration of the solution procedure.

    two parameters, we conduct our experiments to further compareour proposed algorithm respectively.To evaluate the sensitivity of the traditional CF and item

    attributes based CF algorithms under diverse sparse rating levels,we implemented our first experiment in which we let the sparsitylevel take the values of 0.90, 0.84, 0.80, 0.75 and 0.72 respectively.Looking at the results in Fig. 8, we can see that both of the

    algorithms are declining with the increase in the density of ratingmetrics. Also the item attributes based on the CF algorithm has lessMAE values than traditional a CF algorithm regarding the range ofsparsity level. However, with sparsity increasing, the gap betweenthem becomes larger. That is, the sparser the rating metric is, thebetter the MAE performance of the item attributes based on theCFalgorithm is. The reasonwhy it comes out like this is because usingthe item attributes to predict the unrated entries for sparsemetricscould enrich the useritem matrix and result in more accurateprediction.Our second experiment in this sectionwas designed to evaluate

    the effects of a different number of neighbors onMAEperformance.For all experiments, the sparsity level was set to 0.8. Fig. 9compares the results of tradition CF algorithm and item attributes

    based on the CF algorithm when the number of neighbors is from5 to 40 scaling with 5 intervals.uter Systems 26 (2010) 14091417 1415

    Table 5The number of neighbors found by conventional CF algorithm.

    Num. of clusters User ID Avg. (%)

    16 121 317 608 9122 11 11 11 10 11 903 11 10 10 9 10 83.34 10 9 9 8 9 755 10 9 8 8 8 71.7

    Table 6The number of neighbors found by HCF algorithm.

    Num. of clusters User ID Avg. (%)

    16 121 317 608 9122 12 12 11 12 11 96.73 12 11 10 11 11 91.74 11 10 9 9 10 81.75 10 9 8 8 8 71.7

    Both of the algorithms are linearly proportioned to the numberof neighbors. The figure also shows that item attributes based CFhave a smallerMAE than traditional CF in all cases. This means thatour algorithm has better accuracy under the same sparsity levelconditions. Furthermore, when the amount of neighbors becomesless, the gap between the two algorithms becomes larger than inthe other situations. With an increase of neighbors, the differenceschange smoothly. Thus, an item attributes based CF algorithm hasbetter performance than a traditional CF.

    5.3. Hybrid collaborative filtering clustering algorithm

    To demonstrate the hybrid collaborative filtering clusteringalgorithm, by randomly selecting 20% of the dataset to be the testset, and the remaining 80% to be the training set, we split each ofthe MovieLens into two sets. Obviously, the training set is used tomake predictions, while the test set will be considered to measureprediction accuracy.We consider two test methodologies. For one thing, we adopt

    the test dataset to compare the performance of the traditionalCF algorithm and our hybrid algorithm. We report our resultsby finding more neighbors in the least space. Furthermore, wecompare theMAE evaluationmetric for different algorithms underdifferent numbers of neighbors.Without loss of generality, we select five users at random: 16,

    121, 317, 608, and 912. Assign the threshold value of the closestneighbor to be 12. Assign the number of cluster to be 2, 3, 4and 5 respectively. As for our hybrid algorithm, each active usersearches the closest neighbors only within its cluster. The result ofconventional CF algorithm is depicted in Table 5.From Table 5 we can see that, when the number of clusters is 2,

    the traditional CF algorithm can find 90% neighbors in the 60.12%user space. When the number of clusters is 3, it can find 83.3%neighbors in the 35.21% user space. When the number of clustersis 4, it can find 75% neighbors in the 29.34% user space. When thenumber of clusters is 5, it can find 71.7% neighbors in the 23.46%user space. In summary, it can find 79.98% neighbors in the 37%user space.Table 6 shows that with the hybrid collaborative filtering

    clustering algorithm, when the number of clusters is 2, the hybridalgorithm can find 96.7% neighbors in the 59.21% user space.Whenthe number of clusters is 3, it can find 91.7% neighbors in the34% user space. When the number of clusters is 4, it can find81.7% neighbors in the 27.21% user space. When the number ofclusters is 5, it can find 71.7% neighbors in the 23.72% user space. Insummary, it can find 85.35% neighbors in the 36% user space. Thus

    it is clear that the hybrid algorithm can findmore neighbors in lessuser space than a traditional CF algorithm. In addition, it also can

  • 0.6

    0.65

    5 10 15 20 25 30 35 40Num of neighbors

    Fig. 10. MAE comparison of different CF for a different number of neighbors.

    improve the efficiency and precision while searching the closestneighbors.To show the sensitivity of the neighbor parameter regarding the

    recommendation performance, we depict the number of neighborsagainst the MAE measurement in Fig. 10. When the number ofneighbors is from 5 to 40 with 5 intervals, the phenomena ofperformance is similar. With an increase of neighbors, the MAEdecreases because of more information provided for prediction.Another observation is that the hybrid algorithmhas lessMAE thantraditional CF and item-based CF algorithms. The reason for thisis that the hybrid collaborative filtering algorithm only searches

    6. Conclusions and future works

    Collaborative filtering is employed to fulfill the recommenda-tion system for P2P network services, which allows prediction ofinteresting information for an active user froma set of similar usersor items rating data. In this study, for the sparse useritem ma-trix problem, we proposed a novel mechanism to fill in the un-rated rating in sparse matrix. We considered both the user-basedrating and item attribute-based eigenvalue matrix to compute theitem similarity. Moreover, a Hybrid Collaborative Filtering (HCF)framework for the attribute-basedmechanism extending from the0.6

    0.65

    0.7

    0.75

    5 10 15 20 25 30 35 40Num of neighbors

    Fig. 9. MAE impact of different CF for a different number of neighbors.

    0.7

    0.75

    0.8

    0.85

    0.9

    MA

    E

    Traditional CFK-means based CFHybrid CF0.6

    0.65

    0.9 0.84 0.8 0.75 0.72Sparsity level

    Fig. 8. MAE impact of different CF for different sparsity levels.

    0.8

    0.85

    0.9

    0.95

    MA

    E

    Traditional CF Item_based CF1416 Z. Liu et al. / Future Generation Comp

    0.7

    0.75

    0.8

    0.85

    0.9

    0.95

    MA

    Esimilar users in the cluster sets and it does give a reasonably goodprediction estimation.uter Systems 26 (2010) 14091417

    Traditional CF Item_based CFtraditional CF algorithm is proposed to improve the predictive ac-curacy. We describe an effective mechanism for finding similar

  • Z. Liu et al. / Future Generation Comp

    users with similar purchasing motives. Case studies and experi-mental results illustrate that our approach is a feasible techniquefor recommendation in a P2P network. Our hybrid mechanismprediction mainly depends on user-similarity of P2P networks forprediction. In the future work, we intend to deal with fraudu-lent behavior, anonymity andprivacy problemsunder P2P networkconditions.

    Acknowledgements

    This work has been partially supported by the NationalNatural Science Foundation of China (Grant No. 90818002,60973115 and 60933002), Ph.D. Programs Foundation of Ministryof Education of China (Grant No. 20070151020), National BasicResearch Program of China (973 Program) under the Grant No.2006CB701303 and Hi-Tech Research and Development Programof China (863 Program) under the Grant No. 2007AA12Z151 and2009AA01A402.

    References

    [1] Amund Tveit, Peer-to-peer based recommendations for mobile commerce, in:Proceedings of the 1st International Workshop on Mobile Commerce, 2001,pp. 2629.

    [2] Fuyong Yuan, Jian Liu, Chunxia Yin, Yulian Zhang, Nan Shen, A novel collab-orative filtering mechanism for product recommendation in P2P networks,in: Third International IEEE Conference on Signal-Image Technologies andInternet-Based System, 2007, pp. 254261.

    [3] S. Eyheramendy, D. Lewis, D. Madigan, On the naive bayes model for textcategorization, in: Proc. of Artificial Intelligence and Statistics, 2003.

    [4] Giancarlo Ruffo, Rossano Schifanella, A peer-to-peer recommender systembased on spontaneous affinities, ACM Transactions on Internet Technology 9(1) (2009) Article 4.

    [5] Keqiu Li, Hong Shen, Francis Y.L. Chin, Si-Qing Zheng, Optimal methodsfor coordinated enroute web caching for tree networks, ACM TransactionsInternet Technology 5 (3) (2005) 480507.

    [6] Jun Wang, Johan Pouwelse, Reginald L. Lagendijk, Marcel J.T. Reinders,Distributed collaborative filtering for peer-to-peer file sharing systems, in:Proceedings of the 2006 ACM Symposium on Applied Computing, 2006,pp. 10261030.

    [7] P. Resnick, N. Iacovou, M. Suchak, P. Bergstrom, J. Riedl, GroupLens: an openarchitecture for collaborative filtering of netnews, in: Proceedings of ACMConference on Computer Supported Cooperative Work, 1994.

    [8] J. Konstan, B. Miller, D. Maltz, J. Herlocker, L. Gordon, J. Riedl, Grouplens:applying collaborative filtering to usenet news, Communications of the ACM40 (3) (1997) 7787.

    [9] B. Smyth, P. Cotter, Personalized electronic programme guides, ArtificialIntelligence Magazine 21 (2) (2001).

    [10] Greg Linden, Brent Smith, Jeremy York, Amazon.com recommendations: item-to-item collaborative, in: IEEE Internet Computing, vol. 7, IEEE ComputerSociety, 2003, pp. 7680.

    [11] Keqiu Li, Hong Shen, Francis Y.L. Chin, Weishi Zhang, Multimedia objectplacement for transparent data replication, IEEE Transactions on Parallel andDistributed System 18 (2) (2007) 212224.

    [12] Hao Ma, Irwin King, Michael R. Lyu, Effective missing data prediction forcollaborative filtering, in: Proceedings of the 30th Annual International ACMSIGIR Conference on Research and Development in Information Retrieval,2007, pp. 3946.

    [13] Derry O Sullivan, David Wilson, Barry Smyth, Preserving recommenderaccuracy and diversity in sparse datasets, in: FLAIRS Conference 2003,pp. 139143.

    [14] G. Linden, B. Smith, J. York, Amazon.com recommendations: item-to-itemcollaborative filtering, IEEE Internet Computing (January) (2003).

    [15] Manos Papagelis, Dimitris Plexousakis, Themistoklis Kutsuras, Alleviating thesparsity problem of collaborative filtering using trust inferences, in: iTrustInternational Conference 2005, in: LNCS, vol. 3477, 2005, pp. 224239.

    [16] Arnaud De Bruyn, C. Lee Giles, David M. Pennock, Offering collaborative-like recommendations when data is sparse: the case of attraction-weightedinformation filtering, in: International Conference on Adaptive Hypermediaand Adaptive Web-based Systems, in: Lecture Notes in Computer Science,vol. 3137, 2004, pp. 393396.

    [17] Alexandrin Popescul, Lyle H. Ungar, David M. Pennock, Steve Lawrence, Prob-abilistic models for unified collaborative and content-based recommendationin sparse-data environments, in: Proceedings of the 17th Conference in Un-certainty in Artificial Intelligence, 2001, pp. 437444.

    [18] JunWang, Arjen P. de Vries, Marcel J.T. Reinders, Unified relevance models forrating prediction in collaborative filtering, ACM Transactions on InformationSystems 26 (3) (2008) 142. Article 16.[19] K. Goldbergh, T. Roeder, D. Gupta, C. Perkins, Eigentaste: a constant timecollaborative filtering algorithm, Information Retrieval 4 (2) (2001) 133151.uter Systems 26 (2010) 14091417 1417

    [20] S. Deerwester, S.T. Dumais, G.W. Furnas, T.K. Landauer, R. Harshman, Indexingby latent semantic analysis, Journal of the American Society for InformationScience 41 (6) (1990).

    [21] T. Hofmann, Collaborative filtering via Gaussian probabilistic latent semanticanalysis, in: Proc. of the 26th Annual International ACM SIGIR Conference onResearch and Development in Information Retrieval, 2003.

    [22] M. Balabanovic, Y. Shoham, Fab: content-based, collaborative recommenda-tion, Communications of the ACM 40 (1997) 6672.

    [23] C.-N. Ziegler, G. Lausen, L. Schmidt-Thieme, Taxonomy-driven computation ofproduct recommendations, in: Proceedings of the Thirteenth ACM Conferenceon Information and Knowledge Management, 2004.

    [24] J.B. MacQueen, Some methods for classification and analysis of multivariateobservations, in: Proceedings of 5-th Berkeley Symposium on MathematicalStatistics and Probability, vol. 1, University of California Press, Berkeley,pp. 281297.

    [25] http://en.wikipedia.org/wiki/Clusteranalysis.[26] Rong Hu, Yansheng Lu, A hybrid user and item-based collaborative filtering

    with smoothing on sparse data, in: 16th International Conference on ArtificialReality and Telexistence, 2006, pp. 184189, doi:10.1109/ICAT.2006.12.

    [27] Marc Snchez-Artigas, Pedro Garca-Lpez, eSciGrid: A P2P-based e-scienceGrid for scalable and efficient data sharing, Future Generation ComputerSystems 26 (5) (2010) 704719.

    [28] J. Wang, A.P. De Vries, M.J.T. Reinders, A useritem relevance model forlogbased collaborative filtering, in: Proceedings of the European Conferenceon IR Research, Springer, London, 2006, pp. 3748.

    [29] J. Wang, A.P. De Vries, M.J.T. Reinders, Unifying user-based and item-based collaborative filtering approaches by similarity fusion, in: Proceedingsof the 29th Annual International ACM SIGIR Conference on Researchand Development in Information Retrieval, ACM Press, New York, 2006,pp. 501508.

    [30] http://www.absoluteastronomy.com/topics/Holland_Codes.[31] Grouplens, EachMovil, datadet, MovieLens, 2003. http://www.grouplens.org/.

    Zhaobin Liu is an Associate Professor in the School ofInformation Science and Technology, Dalian MaritimeUniversity, China. He received his Ph.D. in ComputerScience from Huazhong University of Science and Tech-nology in China in 2004. His research areas include Par-allel/Distributed/Cloud computing, File and Storage I/OSystems, Peer-to-Peer Computing, Multi-core Systems,Performance Evaluation and Modeling, Computer Net-works, Embedded Systems, and Trusted Computing. Hehas more than 40 publications in international confer-ences or journals, and has successfully coordinated several

    research projects funded by various funding agencies across China.

    Wenyu Qu is a Professor at the School of Informationand Technology, Dalian Maritime University, China. Shegot her bachelors and masters degrees from DalianUniversity of Technology, China in 1994 and 1997, and herdoctorate degree from Japan Advanced Institute of Scienceand Technology in 2006. She was a lecturer in DalianUniversity of Technology from 1997 to 2003. Wenyu Qusresearch interests includemobile agent-based technology,distributed computing, computer networks, and gridcomputing. Wenyu Qu has published more than 50technical papers in international journals and conferences.

    She is on the committee board for a couple of international conferences.

    Haitao Li received the Ph.D. degree from HuazhongUniversity of Science and Technology, in China, and majorin Pattern Recognition and Artificial Intelligence. He iscurrently an associate researcher of Key Laboratory of Geo-informatics of State Bureau of Surveying and Mapping,Chinese Academy of Surveying and Mapping, and themember of National Standardization Technical Committeefor Geomatics (SAC/TC230). His research interests includephotogrammetry and remote sensing, pattern recognition,and high performance processing for remote sensingimagery.

    Changsheng Xie received the B.S. and M.S. degrees incomputer science from Huazhong University of Scienceand Technology (HUST), China, in 1982 and 1998, respec-tively. Presently, he is a professor in the Department ofComputer Engineering at Huazhong University of Scienceand Technology. He is also the director of the Data Stor-age Systems Laboratory of HUST and the deputy directorof theWuhan National Laboratory for Optoelectronics. Hisresearch interests include computer architecture, disk I/Osystem, networked data storage system, and digital media

    technology. He is the vice chair of the expert committee of

    Storage Networking Industry Association (SNIA), China.

    A hybrid collaborative filtering recommendation mechanism for P2P networksIntroductionProblem statement and related worksSparsity limitation solutionCollaborative filtering based on item attributesCase study

    A hybrid collaborative filtering mechanismA quantitative approach for P2P user attributesAlgorithm implementation description

    Experiments evaluationTest datasetItem attributes impactHybrid collaborative filtering clustering algorithm

    Conclusions and future worksAcknowledgementsReferences