[ieee 2012 4th international conference on computational intelligence and communication networks...

An Effective Hybrid Algorithm in Recommender Systems based on Fast Genetic k-means and Information Gain.

Mohd Abdul Hameed, M.A. Malik Dept. of C.S.E.

University college of Engg., Osmania University Hyderabad, AP, India.

[email protected], [email protected]

Syeda Fouzia Sayeedunnisa, Husna Imroze Dept. of I.T.

Muffakham Jah C.E.T, Osmania University Hyderabad, AP, India.

[email protected], [email protected]

ABSTRACT Personalization in a recommender system is to customize contents for users based on their preferences and interests. For a new user such systems face cold start problem. This is because system knows nothing about this user and is unable to present recommendations. For the above said problem an existing technique, Information Gain through Clustered Neighbors (IGCN), has proved to be productive but this technique uses k-means algorithm for making user clusters. The problem with k-means algorithm is it might get stuck at local optima and has initial value dependency. Genetic k-means Algorithm (GKA), a hybrid clustering technique, converges to global optima faster than traditional Genetic Algorithms (GAs). The performance of this technique was improved by Fast Genetic K-means algorithm (FGKA). As the above mentioned GAs has proved to overcome disadvantages of k-means, the paper intends to use a GA viz. FGKA for clustering instead of k-means due to its better performance. This is why the proposed algorithm is named Information Gain Clustering through Fast Genetic k-means Algorithm (IGCFGKA). We show through our results that IGCFGKA not only overcomes k-means disadvantages but it also provides high quality recommendations and an optimal or near optimal solution. Our paper is first to compare IGCFGKA with various strategies of Information gain in recommender systems.

Keywords Clustering, Genetic algorithm, K-means algorithm, global optimization, Information Gain, Fast Genetic k-means.

1. INTRODUCTION Personalization in web sites and many applications have become a trend over few years. This new trend in web is advantageous for both vendor and customer. On one hand, customers find products/items of their interests very easily from that huge pool of items, on the other, the vendors increase their sales by recommending right item to right customer. The strategies to deal with personalization demand a good clustering algorithm to divide users into various categories [2]. One such strategy IGCN [1] was

proposed against popularity, Entropy, Entropy 0 and Harmonic mean of Entropy and Logarithm of Frequency (HELF) to find solution for knowing new user preferences [1]. IGCN used collaborative filtering technique for providing recommendations to a particular user. This technique provides recommendations by using opinion of like minded users. This requires making right clusters for the users before they get recommendations. The clustering algorithm used in IGCN is k-means which might not find the optimal solution rather it stops at local solution [7]. It has been found recently that Genetic Algorithms are a good option for making user clusters as they successfully reach global optima [4, 5, 8, and 11]. Genetic algorithms focus on encoding of the solution set rather than encoding the parameters. These encoded solutions are called chromosomes. To these chromosomes various operators like initialization, selection, crossover and mutation are applied for a finite set of iterations, called generations [2-5]. The final iteration satisfies a terminating condition and it terminates with an optimal or near optimal solution as an output. The main advantage of genetic algorithms is they, unlike k-means, provide optimal solution to the problem [11]. For this very reason we have used Fast Genetic K-means Algorithm (FGKA) [10], an improved GA version of Genetic K-means Algorithm (GKA) [9], as the new clustering technique for personalization problem in recommender systems. We take the same information theoretic approach as IGCN and others but with improved clustering mechanism. We have implemented and experimented Popularity, Entropy and IGCN along with our algorithm. Using Mean absolute error and Expected Utility as metrics we compared all these techniques and have also calculated Percentage Efficiency of IGCFGKA. We have found that in this comparison our proposed algorithm has shown best results with both the metrics and is more efficient, as demonstrated under Experimental results.

In later sections, we shall see an overview of various strategies for recommender systems [section 2], GKA algorithm [section 3], FGKA [section 4], detailed view of IGCFGKA- the proposed algorithm [section 5], Experimental Data [section 6] and Experimental Results and Discussion [section 7].

2. AN OVERVIEW OF VARIOUS STRATEGIES FOR RECOMMENDER SYSTEM

1. Popularity: It indicates how frequently an item gets rated by users. The advantage of this strategy is its simplicity and cost effectiveness. But if we analyze closely this one factor does not contain much information as a popular item is always likely to be rated by almost everyone. Such items do not reflect any unique interest or characteristic of a user. Additional to

2012 Fourth International Conference on Computational Intelligence and Communication Networks

978-0-7695-4850-0/12 $26.00 © 2012 IEEE

DOI 10.1109/CICN.2012.42

860

this, the strategy do not have much information about less rated items and hence never gets a chance of recommending these to someone thereby failing to gain much information about less popular items [1].

2. Entropy: This indicates how much dispersion is there in opinions of users on an item. Entropy is formally defined as:

�=

−=1

2 )(log)()(i

ii xpxpXH (1)

Where )( ixp is the probability mass function of

outcome ix [2].

A limitation of entropy is that it often selects very obscure items. For example, an item ai that has been rated by a moderate number of people (say, 2000 out of the total 6000 members in the system), a rating distribution of (400/2000, 400/2000, 400/2000, 400/2000, 400/2000) corresponding to a rating scale of (1, 2, 3, 4, 5) leads the item to have the maximum entropy score. However, a second item af that was rated only 5 times with a rating distribution (1/5, 1/5, 1/5, 1/5, 1/5) possesses the same entropy score. Note that many members may find the former item familiar, and very few members may find the latter item familiar [1]. Moreover, entropy does not reflect popularity and rating frequency of an item.

3. IGCN: This algorithm uses collaborative filtering method for providing recommendations to user. The problem with the above discussed strategies is that none of these consider users rating history. IGCN repeatedly computes Information Gains (IG) [12] of items where the ratings data is considered form those users who are best l-neighbors of target user so far based on the profile. IGCN based on quality of clustering gives recommendations [1]. It takes following steps to do this:

Steps in IGCN algorithm:

1. The algorithm begins with creating c clusters for the users in the data set. It uses k-means clustering algorithm because of its simplicity.

2. Then Information gains of items are calculated.

3. Next, n items are presented to user in descending order of their IG scores which they need to rate. The items rated by user are added to the non personalized profile.

4. The 3 step repeats until user rates i items. Once the ith item is rated the personalization step begins.

5. In personalization step, best l-neighbors of the target user are found out. Then the IG of the items is recomputed and n items are presented to user descending order of their IG scores. When any item is rated by the user it is added to the profile. With this updated profile the current step repeats until the best l-neighbors do not change.

3. GKA This Genetic algorithm has to say that despite its disadvantages k-means is for sure a simple and computationally very feasible option for clustering [9]. GKA combines goodness of both GA

and k-means in the sense that it reaches global solution with less cost when compared to other GAs. The center idea in this is to replace the costly crossover function with k-means operator (discussed under section 5). It also uses a biased mutation operator named distance based operator which is defined as follows:

Let dj = d(xi, cj ) be the Euclidean distance between solution xi and cj . Then, the allele (two or more genes) is replaced with a value chosen randomly from the following distribution:

�=

−

−=== k

ijm

jmwrj

ddc

ddcjispp

1max

max})({ (2)

Where dmax is maximum Euclidian distance between solution xi and centroid cj.

An important thing to be noticed here is that the initialization, mutation and k-means operator steps in this algorithm results in empty clusters which it calls the illegal strings. The algorithm converts them into singleton legal clusters by assigning each such cluster with some pattern that has maximum total within-cluster variation (TWCV) [4].

4. FGKA – Fast GKA The algorithm is inspired by GKA but with several improvements on it. It starts with initialization of chromosomes then it finds next generation with following steps: applying the selection, mutation and k-means operator on current generation. Finally it terminates on satisfaction of terminating condition. In this algorithm the illegal strings are not avoided as was the case in GKA. GKA puts in considerable effort in eliminating these strings. FGKA, on the other hand, allows these strings by assigning them a TWCV= +� and a low fitness value. According to FGKA, any how these strings will not participate in other steps because of their low fitness and high TWCV so why waste effort on eliminating them? [10] As a result FGKA works faster than GKA with no compromise on reaching global optimum.

5. IGCFGKA – Proposed hybrid Algorithm based on FGKA and Information Gain The problem with profile making in recommender systems has two important points to be considered. Firstly, presenting initial recommendations to a new user is not only a cold start problem but is something which decides whether or not the user shall return to the site. Secondly, one must know that a lengthy signup process for finding more about a new user is not a good idea. This is because such an approach could exhaust a user and might result in user giving up the signup process. These were the two points considered by IGCN [1]. The information Gain through Clustered Neighbors (IGCN) is a technique that requires making user clusters not only for the new user but also each time some recommendation is to be made to an existing user. One problem with this technique is its usage of k-means that might give a local solution. A strong clustering technique that not only overcomes the local optimum problem but is equally fast in finding the solution would improve this technique. Keeping the above point under consideration, we propose in this paper a novel hybrid algorithm, IGCFGKA, which when applied to personalization problem in recommender systems, yielded good results than those

861

compared to IGCN. The algorithm takes following steps for creation of the above said clusters:

1. Representation of chromosomes or the string in N-dimensional space: In this step, the chromosomes are coded using string-of-group encoding. Say there are k-clusters and some P chromosomes each with N-dimensions in the current population. Each chromosome is set of N-dimensional genes (equal to total clusters k) which represent each cluster center.

2. Initialization: During initialization, each chromosome is assigned a randomly chosen point from the data set, treating it to be the cluster center. The process repeats for all P chromosomes. As, each cluster center will have N-dimensions it is obvious that every chromosome will be of N*K length.

3. Selection: In this step proportional selection strategy is used for selection of some desired no of chromosomes, from the population, on whom the further steps will be applied. Chromosomes are selected based on their fitness values and the selected ones go into the mating pool. The chromosomes which get into this pool act as parents for the next generation. As per the above mentioned strategy the no of copies of the selected chromosome going into the mating pool is proportional to the fitness of that chromosome. The probability that a chromosome would get selected is given by

�=

= l

cc

cc

f

fp

1

, (3)

Where c is no of chromosomes that get into mating pool via selection operator i.e. c= {1, 2.., l} and f is the fitness of chromosome c.

But according to FGKA, a technique used in our hybrid algorithm, the main aim is to reduce the TWCV, which is why it assigns greater fitness values to solutions with low TWCV. For illegal chromosomes maximum TWCV value in current generation is assigned.

4. Mutation: This very step helps the algorithm in reaching global optimum because it provides diversity in the solution space- that’s exactly what is missing in a simple k-means algorithm. To achieve this here we randomly change value of a gene of each of the fittest chromosome. The mutation operator used in this algorithm is stated as follows: During mutation, an is replaced by an’ for n=1,…,N simultaneously. an’ is a cluster number randomly selected from {1,…,K} with the probability distribution {p1,p2,…,pk} defined by

5.0),()(*5.1

5.0),()(*5.1

1 max

max

+−

+−=� =

K

k knn

knk

cXdXd

cXdXdP n

(4) Where d(Xn,ck) is the Euclidean distance between pattern Xn and the centroid ck of the kth cluster, and dmax(Xn )= max{d ( Xn, ck)}. If the kth cluster is empty, then d(Xn,ck) is defined as 0. The bias 0.5 is introduced to avoid divide by-zero error in the case that all patterns are equal and are assigned to the same cluster in the given solution [10].

5. K-means operator: A normal GA would take mutation as last step in iteration but this one uses k-means operator to find out the new values of the chromosome. It replaces the costly crossover function that is a part of normal GA. It is same like the k-means algorithm: calculate the new cluster centers for the chromosome obtained from mutation step and reassign each chromosome to the nearest cluster based on the Euclidean distance. As for the illegal strings assign there Euclidean distance to be -�. The operator speeds up the convergence process [9, 10].

ALGORITHM: IGCFGKA

1. Create c user clusters using FGKA

2. Compute information gain (IG) of the items

/* Non-personalized step:*/

3. repeat

4. Present next top n items ordered by their IG value

5. Add items the user is able to rate into his profile

6. until the user has rated at least i items

Personalized step: /* Creating a richer profile */

7. repeat

8. Find best l neighbors based on the profile so far

9. Re-compute IG based on the l users’ ratings only

10. Present next top n items ordered by their IG values

11. Add items the user is able to rate into his profile

12. until best l neighbors do not change.

6. EXPERIMENTAL DATA For the experiments we have taken data sets made available by MOVIELENS website. Let’s just take a look on how things work with this website. The website requires a simple registration process followed by a signup process for a new user. During this signup process the user will be presented some movies that he/she need to rate based on their IGs. After rating these movies the user enters into the actual website where he/she can do many activities including rating the movies. Based on the ratings done during signup a rough or preliminary profile of the target user is made. Then, the personalization step begins where a strong profile of the user is made. During this step, the system finds best l neighbors of the target user depending on the similarity of their profiles. Next, ordered by the IG values items are presented for rating. The items rated by users are added to his/her profile. The personalization step continues until the best l-neighbors do not change. For testing the performance of IGCFGKA against IGCN we have used a data set which includes 100,000 ratings from 943 users who rated 1682 movies. Each of these users has at least rated 20 movies. We have used the demographic data of the user viz. gender, age and occupation for clustering. The results are based on following evaluation metrics:

1. Mean Absolute Error (MAE): It is a common metric to find the forecast error. It finds the difference between the actual value and the predicted value of the outcomes under consideration. MAE specifies the mean of all the

862

absolute errors between the actual and the predicted values. It is given by the following formulae:

��==

=−=n

ii

n

iii e

nyf

nMAE

11

||1||1 (5)

Where if is the prediction, iy is the actual value and

ie is absolute errors.

We use this metric as it indicates recommendation quality. The limitation of this measure is that it only considers the absolute error and sees no difference between the actual and predicted value. But the real numbers of the actual and predicted values are more important than just their difference. This is why MAE shows false positives i.e. a result that appears positive when it is actually negative. The second metric is included to highlight false positives more than false negatives.

2. Expected utility: It is the utility that an entity is expected to show under any number of circumstances. It is a hypothesis that chooses between uncertain prospects by comparing their expected utility value. The expected utilities are the weighted sums that are obtained by adding utility values of the outcomes multiplied by their respective probabilities. It eliminates false positives more than false negatives. Hence we use this metric for its accuracy.

The results of the experiments are discussed in next section.

7. EXPERIMENTAL RESULTS AND DISCUSSION The data set discussed above was used for comparing IGCFGKA against Popularity, Entropy and IGCN. Based on the evaluation metrics the results of IGCFGKA are better to all the other strategies. Now, let’s dig a bit into experimental results. Talking first about MAE, from Fig1 it is clear that IGCFGKA showed tremendous reduction in Mean Absolute Error with respect to others. It is popularity which showed the highest or the worst MAE value for all recommendation sizes. Next worst MAE values (i.e. high value) are that of IGCN. Coming to Entropy, it is third in line to give high MAE values. If we further analyze the result, it is evident that MAE for IGCFGKA falls down to a very low value with increase in total number of items. This shows that for websites or apps that have large number of movies/items to offer a very low value of MAE can be attained using IGCFGKA.

Figure 1. MEAN ABSOLUTE ERROR VS NUMBER OF

ITEMS FOR POPULARITY, ENTROPY, IGCN AND IGCFGKA.

Table 1 shows the comparison of MAE of all the algorithms. From the results it is clear that for all sizes of items IGCFGKA has lowest MAE value among others and popularity has highest. As size of item set increases, for all strategies, the MAE values decreases almost exponentially with IGCFGKA taking the best place.

Table 1. VALUES OF MEAN ABSOLUTE ERROR FOR POPULARITY, ENTROPY, IGCN AND IGCFGKA.

No of movies

presented 15

30 45 60 75

Popularity 0.8762 0.8039 0.7813 0.7075 0.6842

Entropy 0.6945 0.6798 0.6703 0.6539 0.6320

IGCN 0.7561 0.7133 0.6856 0.6598 0.6341

IGCFGKA 0.6843 0.6591 0.6189 0.5953 0.5547

The limitation of MEA is that it shows false positives. This is the reason that we have taken Expected Utility as our second metric. The benefit of EU is that it penalizes false positives and thereby guarantees accurate results. From Fig2 it is very clear that expected utility has also shown results in favor of IGCFGKA. Expected Utility of IGCFGKA is much more than that of popularity, Entropy, IGCN. Though for some size of the items the values are close for both IGCN and IGCFGKA but again the latter proves that it surely is better. The best performer of Expected utility is IGCFGKA and thus it gives high quality recommendations. Next is IGCN followed by Entropy and worst utility is that of popularity.

863

Figure 2. EXPECTED UTILITY VS NUMBER OF ITEMS FOR POPULARITY, ENTROPY, IGCN AND IGCFGKA.

Table 2 shows expected utility values of Popularity, Entropy, IGCN and IGCFGKA for different number of movies presented. The maximum utility is always shown by the IGCFGKA for all size of items. The lowest utility of IGCFGKA is still better than IGCN and popularity. Though entropy has shown highest value in beginning but for larger size of items its expected utility has not improved as much as compared to IGCN and IGCFGKA.

Table 2. VALUES OF EXPECTED UTILITY FOR POPULARITY, ENTROPY, IGCN AND IGCFGKA.

No of movies

presented 15 30 45 60 75

Popularity 0.1768 2.8901 3.5518 4.3977 5.2303

Entropy 4.9265 5.3309 5.5086 5.7813 6.2943

IGCN 3.2010 5.4128 6.0115 6.2548 6.5219

IGCFGKA 4.8237 5.9750 6.1142 6.3043 6.9917 One point to be noticed here is that there is a difference in opinion of MAE and EU for Entropy and IGCN. According to MAE Entropy is better than IGCN and EU shows the reverse. The reason here is MAE shows false positives which are highlighted in results of EU. So we can conclude that IGCN is better than Entropy keeping the results of EU under consideration, as EU is the metric that overcomes limitation of MAE and shows accurate results. Now, our main criterion is to compare IGCN with our Algorithm- IGCFGKA. To do this comparison, apart from the experimental results we have calculated the percentage efficiency for IGCFGKA against IGCN as follows: For Mean Absolute Error:

)(|)()(|

IGCFGKATIGCNTIGCFGKATE

MAE

MAEMAEIGCFGKA

−= (6)

Where TMAE represents sum of MAE values.

The Efficiency percentage of IGCFGKA is 10.056 more than IGCN for MAE. This shows that the recommendations of IGCFGKA are of good quality than those of IGCN as per MAE. For Expected Utility:

)(|)()(|

IGCFGKATIGCNTIGCFGKATE

EU

EUEUIGCFGKA

−= (7)

Where TEU represents sum of EU values. The increase in percent Efficiency as per EU values is obtained as 9.2916. Though we have got Efficiency by MAE more than that of EU but we must not forget that MAE includes false positives. On the other side it is EU which shows accurate and correct results. From this discussion it is evident that IGCFGKA is best of all as it performs well according to results of both the metrics by giving low MAE and high EU values.

8. CONCLUSION The paper proposes a new hybrid algorithm with improved clustering which aims at personalization of cold start problem in recommender systems. We have studied, through various Information Gain techniques, how these systems provide item recommendations to its users based on their interests. We have found out that the main challenge for such systems is giving personalized recommendations for new coming users and hence faces a cold start problem. As the system does not have much information about these users and cannot burden users by asking them to give all their interests, choices etc., gaining information for a new user is difficult. We have seen an existing technique IGCN which uses profiles of neighbors to give recommendations. But this technique uses k-means algorithm to distribute users into various clusters which might find local solution and its result depends on initial cluster center values. We have made an attempt to replace k-means, for its disadvantages, with the proposed new algorithm, IGCFGKA, which uses a GA to make user cluster. We then implemented and compared all the four strategies: Popularity, Entropy, IGCN and IGCFGKA. Finally, via our experimental results, we have shown that the proposed algorithm converges to the global solution with high quality recommendations.

9. REFERENCES [1] Al Mamunur Rashid, Gerge Karypis and John

Riedl, “Learning Preferences of new users in Recommender System: An Information Approach.”, SIGKDD Workshop on Web Mining and Web Usage Analysis (WEBKDD), 2008.

[2] A. K. Jain and R. C. Dubes, Algorithms for Clustering Data. Englewood Cliffs, NJ: Prentice-Hall, 1989.

[3] D. Bhandari, C.A. Murthy, S.K. Pal, Genetic Algorithmwith elitist model and its convergence, Int. J. Pattern Recognition Artif. Intell. 10 (1996) 731}747.

[4] D. R. Jones and M. A. Beltramo, “Solving partitioning problems with genetic algorithms,” in Proc. 4th Int. Conf. Genetic Algorithms. San Mateo, CA: Morgan Kaufman, 1991.

[5] G. P. Babu, “Connectionist and evolutionary approaches for pattern clustering,” Ph.D. dissertation, Dept. Comput. Sci. Automat., Indian Inst. Sci., Bangalore, Apr. 1994.

864

[6] Ihara, Shunsuke (1993). Information theory for continuous systems. World Scientific. p. 2. ISBN 978-981-02-0985-8.

[7] J.T. Tou, R.C. Gonzalez, Pattern Recognition Principles, Addison-Wesley, Reading, 1974.

[8] J. N. Bhuyan, V. V. Raghavan, and V. K. Elayavalli, “Genetic algorithm for clustering with an ordered representation,” in Proc. 4th Int. Conf. Genetic Algorithms. San Mateo, CA: Morgan Kaufman, 1991.

[9] K. Krishna and M. Narasimha Murty, “Genetic K-means algorithm,” IEEE Trans. Systems, Man, and Cybernetics, 29(3), June 1999.

[10] Y. Lu, S. Lu, F. Fotouhi, Y. Deng, and S. J. Brown, “FGKA: A fast genetic K-means clustering algorithm,” Proc. ACM Symposium on Applied Computing, 2004.

[11] Ujjwal Maulik, Sanghamitra Bandyopadhyay, “Genetic algorithm-based clustering technique”, Pattern Recognition Society. Published by Elsevier Science Ltd.

[12] T. M. Mitchell. Machine Learning. McGraw-Hill Higher Education, 1997.

865

[ieee 2012 4th international conference on computational intelligence and communication networks...

Documents