[ieee 2011 third pacific-asia conference on circuits, communications and system (paccs) - wuhan,...

3
The Study on Data Preprocessing Based on Collaborative Filtering Ji Liang-hao 1 1 College of computer sci. and tech., Chongqing University of posts and telecommunications, Chongqing,China a E-mail[email protected] Abstract. As the increase of the internet information, it’s very necessary to provide personalized service for us. Collaborative filtering is one of the most successful applications in the field of personalized service. But the quality of the service directly affects the extent of its application. So before recommend, do data preprocessing, the results of the experiment show that the method can indeed improve the quality of recommend algorithm. Keywords: collaborative filtering information recommenddata preprocessingpersonalized service INTRODUCTION With the development of the internet, e-commerce is getting into more and more people’s lives and changing their lifestyle. Nowadays, the internet is increasingly used as one of the major channel for sales and marketing, but on the other side, with the increasing web information, we have to spend much time on finding the interesting contents we just need. Traditional search engines can no longer satisfy people's daily queries. To provide the personalized service for people is especially essential. Personalized service gets to know users’ habits through collecting and analyzing their personal information. Therefore, discovers the common hiding habit of a group of users. According to the habits, it can provide the needed information automatically [1].Among various algorithms for such personalized services, content-based and collaborative filtering are notable. The former is more suitable for filtering textual items, and the latter is the most popular selection for personalized recommend systems and is mostly used to recommend items in e-commerce. Many researches for the performance enhancement of collaborative filtering have been proposed, like Grouplens and other recommend system, despite its success, and suffer from major problems [2-4]. They all treat customers with different buying patterns identically. For example, customers who have a tendency to buy a wide variety of products many times and others who happens to purchase only one or two products repeatedly, represent two extremely opposite buying patterns. To recommend system, the pattern is very important, which decides the quality of the system. If we do the data preprocessing beforehand, the quality of recommend system can be improved. The results of the experiments show that the method can indeed improve the quality of recommend algorithm. COLLABORATIVE FILTERING Collaborative Filtering (CF) is also named as social filtering, which bases on the following assumption: if some users share similar rating to some items, then they share similar rating to other items too. The main idea is that the target user is likely to enjoy the items which other users with common interests. Therefore, finding the neighbors of target user’s is the key step of CF. The CF analyzes the “users to users” service. It provides information according to the similarity of users’ interesting. No matter the rating vectors is a hiding rating or not, it can be seen as interesting similarity rate. Explicit rating uses the number rank to show the obvious interests of users. Another, the so called hiding rating, is the result of analyzing web log etc. This method is used to provide personal service to the unregistered users. However, both explicit and hiding rating eventually gives a matrix that how the users show interest in the items. Assuming there are M users and N pieces of item. The matrix shows as R of M*N. Every R m,n =r, indicating user m rating to item n is r. If the result is NULL, then it shows that m gives no rating to n. The matrix is shown as table 1. Table1. matrix R(M×N) Item 1 Item j Item N User 1 R 1,1 R 1,j R 1,N User i R i,1 R i,j R i,N User M R M,1 R M,j R M,N CF can be divided into memory-based and model-based. Memory-based utilizes the rating matrix of user to item to provide recommendation. Generally speaking, system finds the nearest neighbors of target users based on some statistic technique which is also called as user-based. Model-based constitutes a rating mode through machine learning and predicting the grades that the target users give to presumed information by probability. The model-based filtering doesn’t suit the users who have unstable interests. Moreover, when the data overloads the model tests cost a lot too. Therefore, the user-based filtering is commonly used. DRAWBACKS OF TRADITIONAL WAYS Tapestry system relied on the explicit opinions of people from a close-knit community. The GroupLens research system uses a nearest- neighbor approach to find a subset of all users that have the most similar preference Item User 978-1-4577-0856-5/11/$26.00 ©2011 IEEE

Upload: liang-hao

Post on 11-Apr-2017

214 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: [IEEE 2011 Third Pacific-Asia Conference on Circuits, Communications and System (PACCS) - Wuhan, China (2011.07.17-2011.07.18)] 2011 Third Pacific-Asia Conference on Circuits, Communications

The Study on Data Preprocessing Based on Collaborative Filtering

Ji Liang-hao1

1College of computer sci. and tech., Chongqing University of posts and telecommunications, Chongqing,China aE-mail:[email protected]

Abstract. As the increase of the internet information, it’s very necessary to provide personalized service for us. Collaborative filtering is one of the most successful applications in the field of personalized service. But the quality of the service directly affects the extent of its application. So before recommend, do data preprocessing, the results of the experiment show that the method can indeed improve the quality of recommend algorithm.

Keywords: collaborative filtering ; information recommend;data preprocessing;personalized service

INTRODUCTION

With the development of the internet, e-commerce is getting into more and more people’s lives and changing their lifestyle. Nowadays, the internet is increasingly used as one of the major channel for sales and marketing, but on the other side, with the increasing web information, we have to spend much time on finding the interesting contents we just need. Traditional search engines can no longer satisfy people's daily queries. To provide the personalized service for people is especially essential. Personalized service gets to know users’ habits through collecting and analyzing their personal information. Therefore, discovers the common hiding habit of a group of users. According to the habits, it can provide the needed information automatically [1].Among various algorithms for such personalized services, content-based and collaborative filtering are notable. The former is more suitable for filtering textual items, and the latter is the most popular selection for personalized recommend systems and is mostly used to recommend items in e-commerce.

Many researches for the performance enhancement of collaborative filtering have been proposed, like Grouplens and other recommend system, despite its success, and suffer from major problems [2-4]. They all treat customers with different buying patterns identically. For example, customers who have a tendency to buy a wide variety of products many times and others who happens to purchase only one or two products repeatedly, represent two extremely opposite buying patterns. To recommend system, the pattern is very important, which decides the quality of the system.

If we do the data preprocessing beforehand, the quality of recommend system can be improved. The results of the experiments show that the method can indeed improve the quality of recommend algorithm.

COLLABORATIVE FILTERING

Collaborative Filtering (CF) is also named as social

filtering, which bases on the following assumption: if some users share similar rating to some items, then they share similar rating to other items too. The main idea is that the target user is likely to enjoy the items which other users with common interests. Therefore, finding the neighbors of target user’s is the key step of CF.

The CF analyzes the “users to users” service. It provides information according to the similarity of users’ interesting. No matter the rating vectors is a hiding rating or not, it can be seen as interesting similarity rate. Explicit rating uses the number rank to show the obvious interests of users. Another, the so called hiding rating, is the result of analyzing web log etc. This method is used to provide personal service to the unregistered users.

However, both explicit and hiding rating eventually gives a matrix that how the users show interest in the items. Assuming there are M users and N pieces of item. The matrix shows as R of M*N. Every Rm,n=r, indicating user m rating to item n is r. If the result is NULL, then it shows that m gives no rating to n. The matrix is shown as table 1.

Table1. matrix R(M×N) Item1 … Itemj … ItemN

User1 R1,1 … R1,j … R1,N … … … … … …

Useri Ri,1 … Ri,j … Ri,N … … … … …

UserM RM,1 … RM,j … RM,N

CF can be divided into memory-based and model-based.

Memory-based utilizes the rating matrix of user to item to provide recommendation. Generally speaking, system finds the nearest neighbors of target users based on some statistic technique which is also called as user-based. Model-based constitutes a rating mode through machine learning and predicting the grades that the target users give to presumed information by probability.

The model-based filtering doesn’t suit the users who have unstable interests. Moreover, when the data overloads the model tests cost a lot too. Therefore, the user-based filtering is commonly used.

DRAWBACKS OF TRADITIONAL WAYS

Tapestry system relied on the explicit opinions of people from a close-knit community. The GroupLens research system uses a nearest- neighbor approach to find a subset of all users that have the most similar preference

ItemUser

978-1-4577-0856-5/11/$26.00 ©2011 IEEE

Page 2: [IEEE 2011 Third Pacific-Asia Conference on Circuits, Communications and System (PACCS) - Wuhan, China (2011.07.17-2011.07.18)] 2011 Third Pacific-Asia Conference on Circuits, Communications

history as the active user. Both Tapestry and GroupLens directly use the preference values on each item [5]. The drawbacks of conventional methods may easily be demonstrated. Assume that there are three different customers, customer 1, customer 2, and customer 3.

Customer 1 purchased item A five times but has never bought anything else. Customer 2 purchased all three items A, B, C, and D – 5 times, 4 times, 5 times, and 3 times respectively. Customer 3 purchased items A, B, and C -- 5 times, 4 times, and 5 times respectively. Although the purchase patterns of customer 1, 2 and 3 are very different, most conventional collaborative filtering algorithms use the buying frequencies (e.g., customer 1 bought product A 5 times) directly as the basis for calculating similarity between customers. This means that the conventional algorithms treat every customer equally while this is never true in real-life situations. As you can easily see that customer 2 is the heaviest shopper among others while customer 1 has only bought product A repeatedly. Thus we believe that it is fair that the buying frequency value 5 of product A for customer 1 should draw more attention than the frequency value 5 of the same product for customer 3. In our weighted sifting method, we want to reflect those buying patterns on each customer’s preference values by adjusting the original purchase frequencies. More specifically, we give the customers who purchase many items evenly a low value, and give the customers who purchase the same item repeatedly a high value. And we believe that this will contribute to the improvement in accuracy.

METHOD OF DATA PREPROCESSING

We present weigh matrix calculation based on number of distinct products that each customer purchased. First for each customer i, normalized preference value is calculated

by (1). ],...,,max[ ,2,1, niii rrr

expresses the largest preference value among items of the customer i, and

the jirf , of (1) is normalized preference value by the maximum.

],...,,max[ ,2,1,

,,

niii

jiji rrr

rrf =

(1)

The weight calculation for each jiw , value of the weight matrix W (m x n) is done as shown in (2). N represents total number of items and ni represents the distinct number of items purchased by customer i. This gives the customers who purchased many products evenly a low value, and gives the customers who purchased a specific product repeatedly a high value.

⎟⎟⎠

⎞⎜⎜⎝

⎛×=

ijiji n

Nrfw log,,

(2) Finally we add R with W to get the final

matrix R′ and use this newly-generated matrix as the basis for collaborative filtering, as shown in (3).

jijiji WRR ,,, +=′ (3)

EXPERIMENTAL ANALYSIS

We use MovieLens collaborative filtering data set to

evaluate the performance. MovieLens data sets were collected by the GroupLens Research Project at the University of Minnesota. The historical dataset consists of 100,000 ratings from 943 users on 1682 movies with every user having at least 20 ratings. And it also consists of user time information for the user interest change. We use the Mean Absolute Error (MAE) metrics to measure the prediction quality of our proposed approach with other collaborative filtering methods.

MAE (Mean Absolute Error) is the common recommended method to measure the accuracy. The lower the MAE, the more accuracy the recommendation engine predicts for users’ interest. MAE is defined by [6]:

TtPtR

MAE Tu juju∑ ∈−

=)()(

(4) T in formula (4) indicates the item numbers of testing

set T. )( ju tR and )( ju tP indicates the grade and predicted

grades that customer u gives item jt . The ratings data sets are divided into five equal

portions, expressed as ML_1, ML_2, ML_3, ML_4, ML_5 , one of each selected as the test set, the remaining data as training set to do cross-experiments. Every time, select the different first-nearest neighbors of the target users, such as 5,10,15,20,25,30,35. Figure 1 to 5 show the results of MAE compared with the traditional CF.

0.8

0.81

0.82

0.83

0.84

5 10 15 20 25 30 35

No. of nearest neighbors

MAE weighted

no weighted

Figure1. MAE comparison

0.76

0.77

0.78

0.79

0.8

0.81

5 10 15 20 25 30 35

No. of nearest neighbors

MAE weighted

no weighted

Figure2. MAE comparison

0.75

0.76

0.77

0.78

0.79

0.8

5 10 15 20 25 30 35

No. of nearest neighbors

MAE weighted

no weighted

Figure3. MAE comparison

Page 3: [IEEE 2011 Third Pacific-Asia Conference on Circuits, Communications and System (PACCS) - Wuhan, China (2011.07.17-2011.07.18)] 2011 Third Pacific-Asia Conference on Circuits, Communications

0.77

0.78

0.79

0.8

0.81

0.82

5 10 15 20 25 30 35

No. of nearest neighbors

MAE weighted

no weighted

Figure4. MAE comparison

0.75

0.76

0.77

0.78

0.79

0.8

5 10 15 20 25 30 35

No. of nearest neighbors

MAE weighted

no weighted

Figure5. MAE comparison

From figure 1 to 5, we can see that the MAE of the weighted algorithm is smaller than the traditional CF,so the accuracy of the former is also better than the latter.

SUMMARY

Use the method of data preprocessing to weighted the patterns of users’, it can more accurately react user’s interest. Experimental results show that the method proposed provides more accurate recommendation quality.

ACKNOWLEDGMENT

Thanks the project of the research on personalized service used in e-commerce (10SKF10) for funding this study.

REFERENCES [1] Zeng Chun,Xing Chun-xiao,Zhou Li-zhu.A Survey of personalized

Technology[J].Journal of software,2002. [2] Kim, G., Chun, J., &Lee, S. (2003).A preprocessing method for

improving effectiveness of Collaborative Filtering[C]. In: proceedings of the 3rd International Conference on Electronic Business (ICEB 2003), Singapore.

[3] Bharath Kumar Mohan, Benjamin J. Keller, Naren Ramakrishnan.Scounts, Promoters, and Connectors: The Roles of Ratings in Nearest-Neighbor Collaborative Filtering. ACM Trans., 2007.

[4] Hao Ma, Irwin King, Michael R. Lyu. Routing and filtering: Effective missing data prediction for collaborative filtering. In: Proc of ACM SIGIR 2007.Amsterdam: ACM Press, 2007.

[5] Tuguldur Sumiya, Jonghoon Chun, Sang-goo Lee, et al. A weighted sifting method to improve the effectiveness of collaborative filtering. 2004 IEEE Region 10 Conference, Chirang Mai,Thailand, Volume B, 21-24, Nov,2004 .

[6] Sarwar B, Karypis G, Konstan J, et al. Item-based collaborative filtering recommendation algorithms[C]. Proceedings of the 10th International WWW Conference. New York: ACM Press, 2001.