new predicting the user behavior from weblogs by … · 2018. 7. 15. · predicting the user...

14
PREDICTING THE USER BEHAVIOR FROM WEBLOGS BY IMPROVED SPAN CLASSIFICATION P.G.Om Prakash 1 , Dr.A.Jaya 2 1 Research Scholar, Department of Computer Science and Engineering B.S.Abdur Rahman Cresent Institute of Science and Technology Chennai - 600 048, INDIA 2 Professor and Head, Department of Computer Applications B.S.Abdur Rahman Cresent Institute of Science and Technology Chennai - 600 048, INDIA Abstract: A weblog is a collection of user navigation pattern. Whenever user can access the website the log is generated on the server, that log we call it as weblog. The weblog is in unstructured format, that unstructured format is not useful for identifying user behavior. So the conversion of unstructured to structured format with the help of preprocessor. The preprocessor contains various steps like data cleaning, user identification and session identification to convert structured format, based on the structured format pattern discovery is analyzed. The pattern discovery contains sequence of pattern get from preprocessor. It utilizes clustering technique; the sequence of pattern is converted into sub sequence pattern from that user navigation is grouped. The aim of the research is to identify the user behavior from weblog. The improved span classification algorithm utilizes PAFI clustering algorithm to identify interested user. The classification algorithm utilizes previous results from that user behavior is analyzed. Key Words: User navigation, mining, path traversal, prediction, web mining, weblog, pattern classification. 1. Introduction Now a day lot of products available in the internet, so the users are increased day by day to access the internet. The users are not interested in shopping to avail the products. In the current situation, the population is increased at the same time availability of user access the internet also increased. The users are utilizing the internet lot of time; the lot of products offers is available in internet. So the users are interested in online shopping. The classification of interested customer and not interested customer is a difficult one in online shopping. While the user accessing the website, the user transactions are monitored and stored as log. The log contains unstructured format, the conversion of unstructured to structured format by preprocessing technique. The weblog consists of various International Journal of Pure and Applied Mathematics Volume 119 No. 16 2018, 4187-4199 ISSN: 1314-3395 (on-line version) url: http://www.acadpubl.eu/hub/ Special Issue http://www.acadpubl.eu/hub/ 4187

Upload: others

Post on 12-Oct-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: New PREDICTING THE USER BEHAVIOR FROM WEBLOGS BY … · 2018. 7. 15. · PREDICTING THE USER BEHAVIOR FROM WEBLOGS BY IMPROVED SPAN CLASSIFICATION P.G.Om Prakash 1, Dr.A.Jaya 2 1

PREDICTING THE USER BEHAVIOR FROM WEBLOGS BY

IMPROVED SPAN CLASSIFICATION

P.G.Om Prakash1, Dr.A.Jaya

2

1Research Scholar, Department of Computer Science and Engineering

B.S.Abdur Rahman Cresent Institute of Science and Technology Chennai - 600 048, INDIA

2Professor and Head, Department of Computer Applications

B.S.Abdur Rahman Cresent Institute of Science and Technology Chennai - 600 048, INDIA

Abstract: A weblog is a collection of user navigation pattern. Whenever user can access the website

the log is generated on the server, that log we call it as weblog. The weblog is in unstructured format,

that unstructured format is not useful for identifying user behavior. So the conversion of unstructured to

structured format with the help of preprocessor. The preprocessor contains various steps like data

cleaning, user identification and session identification to convert structured format, based on the

structured format pattern discovery is analyzed. The pattern discovery contains sequence of pattern get

from preprocessor. It utilizes clustering technique; the sequence of pattern is converted into sub

sequence pattern from that user navigation is grouped. The aim of the research is to identify the user

behavior from weblog. The improved span classification algorithm utilizes PAFI clustering algorithm

to identify interested user. The classification algorithm utilizes previous results from that user behavior

is analyzed.

Key Words: User navigation, mining, path traversal, prediction, web mining, weblog, pattern classification.

1. Introduction

Now a day lot of products available in the internet, so the users are increased

day by day to access the internet. The users are not interested in shopping to avail the products. In the current situation, the population is increased at the same time availability of user access the internet also increased. The users are utilizing the internet lot of time; the lot of products offers is available in internet. So the users are interested in online shopping. The classification of interested customer and not interested customer is a difficult one in online shopping. While the user accessing the website, the user transactions are monitored and stored as log. The log contains unstructured format, the conversion of unstructured to structured format by preprocessing technique. The weblog consists of various

International Journal of Pure and Applied MathematicsVolume 119 No. 16 2018, 4187-4199ISSN: 1314-3395 (on-line version)url: http://www.acadpubl.eu/hub/Special Issue http://www.acadpubl.eu/hub/

4187

Page 2: New PREDICTING THE USER BEHAVIOR FROM WEBLOGS BY … · 2018. 7. 15. · PREDICTING THE USER BEHAVIOR FROM WEBLOGS BY IMPROVED SPAN CLASSIFICATION P.G.Om Prakash 1, Dr.A.Jaya 2 1

entries like IP address of user, date/time, categories of product and status code.The preprocessing is a three step process. The data cleaning cleans the irrelevant data in log. The user identification is identified by IP address of the specified system. The session identification is classified based on session management.

The manufacturing creating demands based on the user needs. If the product

is not available the user will switch over to some other product. If the product is highly available there is biggest challenge to sale the product. So each and every product the demand is different. So the organization has to monitor the user needs and interest. With the help of log the user needs and interest are classified.

The process of identifying user needs and user interest is a difficult task. With the help of log the user needs and interest can identify by user navigation pattern. The user navigation pattern can be analyzed based on the previous user navigation that was stored on the weblog. Using pattern discovery the processed log is to convert it into sequence and sub sequence of similar pattern. By using forward and backward technique subsequence can be generated. The sub sequence pattern is grouped to form a cluster, that will helpful to identify the user needs. The user interest is a biggest task of every organization. By using improved span algorithm the user interest will classify.

The paper is organized as follows, Section II is deals Related Work, Section III deals with Problem Identification, Section IV describes Architecture Diagram, and Section V gives the Implementation Results. Finally Section VI concludes the paper.

2. Related Work

Web usage mining is to predict the user behavior pattern based on navigation pattern that was stored in weblog. Wangshu Liu et al. [1] showed the two stage preprocessing, first it eliminates irrelevant and redundant datasets, then ranking removes irrelevant features of clustering data. Xin ruan et al. [2] showed that online social networks (OSN) classified in to extroversive behavior and introversive behavior. Extroversive behaviors reflect how a user interacts with friends online for an introversive behavior gather and consume of social information.

Guoshuai Zhao et al. [3] showed that user rating prediction service

approach it consists of user personal interest. Ruili Geng et al. [4] suggested that user navigation pattern analyzes previous results to identify the actual usage pattern and anticipated usage patterns. Through weblog the actual usage patterns is extracted to identify user sessions and user transactions. The anticipated usage contains path of the user and session. Surbhi Huriaet al. [5]

International Journal of Pure and Applied Mathematics Special Issue

4188

Page 3: New PREDICTING THE USER BEHAVIOR FROM WEBLOGS BY … · 2018. 7. 15. · PREDICTING THE USER BEHAVIOR FROM WEBLOGS BY IMPROVED SPAN CLASSIFICATION P.G.Om Prakash 1, Dr.A.Jaya 2 1

showed that Back Navigation approach utilizes forward path and backward path to identify frequent pattern and semi frequent pattern.

Zliobaite et al [6] showed that adaptive preprocessing mechanisms for data streaming algorithm from that the streaming of prediction accuracy will change.

Prakash et al. [7] suggested that the user interest can be classified based on the previous user navigation that was stored on the weblog. User behavior can be analyzed based on the benchmark of clustering data. If the user is above the saturation level then the user can be considered as interested user, if it is below the level then it is not interested user from that user behavior is analyzed.

Dr.A.R.Patel et al. [8] suggested that the weblog initially contains raw log, so conversion of raw log to processed log by using preprocessing technique, the preprocessing technique is a three step process Data Cleaning, User identification, Session Identification.

Hong Cheng et al. [9] showed that the traversal pattern is to analyze the

sequence of pattern in to sub sequence pattern based on navigation pattern modeling. The modeling clusters the similar user interest that is taken from log. The classification algorithm classifies the Frequent Sequence, Semi-frequent Sequence and In-frequent Sequence based on pattern generation.

D.Kerana et al. [10] suggested that, the traversal pattern utilizes database transaction, the sequence of transactions are converted into sub sequence by reducing number of similar transactions to form clusters, by using clustering technique it classifies frequent, semi frequent and in frequent item sets, from that user buying behavior is analyzed.

Mobasher et al. [11] suggested that the usage of web data and also shows the hyperlinks of the website shown by web personalize. The unprocessed weblogs are converted into processed log by using preprocessing. After preprocessing the sequence of pattern by using traversal pattern, it uses forward path the sequence of path is breakdown in to sub sequence path. The clustering algorithm reduces the number of path grouped to form a cluster. The classification classify the interested and not interested user.

2. Problem Definition

In the current situation more number of users attracted towards the internet,

so there is growth of users accessing the internet are increases day by day and reduces the shopping time, so data size will increase.

International Journal of Pure and Applied Mathematics Special Issue

4189

Page 4: New PREDICTING THE USER BEHAVIOR FROM WEBLOGS BY … · 2018. 7. 15. · PREDICTING THE USER BEHAVIOR FROM WEBLOGS BY IMPROVED SPAN CLASSIFICATION P.G.Om Prakash 1, Dr.A.Jaya 2 1

Difficult to capture the individual user behavior. Identifying the user Buying prediction is too difficult.

Need to reduce the mining time while capturing the user behavior from weblogs.

3. Framework for User Navigation System

Figure.1. Framework for Analyzing User Behavior

Figure 1, shows the framework for analysis of user behavior pattern

consists of four phases

Acquisition of weblogs Preprocessing Navigation Pattern Modeling Classifier

A. Acquisition of Weblogs

Whenever the user accesses the web site, the server recorded in the form of log, the log contains the details of where the user history such as IP address, categories, date & time, number of bytes transmitted and status code. The log consists of following User attributes

1. IP address 2. Date and Time 3. Request method 4. URL of the page 5. Categories 6. Number of bytes transmitted

Web Browser

Retrieval of Logs

Web Server

Navigation Pattern Modeling

Knowledge Base

Acquisition of weblog

Preprocessor

Classification

Identify User Behavior

International Journal of Pure and Applied Mathematics Special Issue

4190

Page 5: New PREDICTING THE USER BEHAVIOR FROM WEBLOGS BY … · 2018. 7. 15. · PREDICTING THE USER BEHAVIOR FROM WEBLOGS BY IMPROVED SPAN CLASSIFICATION P.G.Om Prakash 1, Dr.A.Jaya 2 1

7. Status Code

B. Preprocessing

Whenever the user accesses the website, user information is stored as log. It

has unstructured format of user transaction. The preprocessing contains three step processes Data Cleaning User identification Session identification

Figure.2. Weblog

a. Data Cleaning

These logs have been recorded in the vehicle server, the logs taken from the period of 23/Oct/2015 to 30/Oct/2015. The log file has 3578 records, in that each record having the IP address, Categories and status code. The data cleaning will take care of irrelevant information when the user was visited sited earlier. It cleans the incomplete information that is available in websites. While user transform one page to another page the status code will generated. The status code helps to clean the irrelevant information in log. After the cleaning process in log 1578 records are obtained.

E.g.: figure 2 shows “172.16.1.7 - [23/Oct/2015-0600] "GET 7.gif HTTP/1.0" 200 441” the stats code 200 is considered as successful webpage.

b. User Identification

The user identification is based on User IP address, request time, requested URL, date & time.

International Journal of Pure and Applied Mathematics Special Issue

4191

Page 6: New PREDICTING THE USER BEHAVIOR FROM WEBLOGS BY … · 2018. 7. 15. · PREDICTING THE USER BEHAVIOR FROM WEBLOGS BY IMPROVED SPAN CLASSIFICATION P.G.Om Prakash 1, Dr.A.Jaya 2 1

E.g.: from figure 2 shows the user IP address 172.16.1.7 is identified from weblog.

c. Session Identification

The session identification classifies the user based on time sessions, if the similar user visits the website two times, so two sessions must be recorded. E.g.: from figure 2 shows the same IP address access different time stamp from

13:41:41 to 13:45:46 & 14:05:31 to 14:08:54 is considered as two session.

C. Navigation Pattern Modelling

T1

T2

T3 T4 T5

T6 T7

T8 T9 T11 T12 T13 T14 T15 T16 T17T10

T18 T19

T20

T21

T22 T23 T24 T25

International Journal of Pure and Applied Mathematics Special Issue

4192

Page 7: New PREDICTING THE USER BEHAVIOR FROM WEBLOGS BY … · 2018. 7. 15. · PREDICTING THE USER BEHAVIOR FROM WEBLOGS BY IMPROVED SPAN CLASSIFICATION P.G.Om Prakash 1, Dr.A.Jaya 2 1

Figure 3. Tree Generation for Interested User and Not Interested User The User and Session Identification are generated by preprocessing. From

figure 3, the tree generation helps to identify the interested and not interested user from traversal path.

The tree is generated from the Sequence and Sub Sequence pattern from

vehicle website. The Sequence and Sub Sequence pattern classifies the interested users and not interested users from navigation pattern modeling. The patterns modeling support organization in decision making process to improving their business.

TABLE I.

BROWSING PATTERN OF USER BEHAVIOR

Pattern Num Browsing Pattern

1 < T1,T2,T3T6,T8,T18,T20,T21,T22>

2 <T1,T2,T3,T6,T10,T18,T20,T21,T23>

3 <T1,T2,T5,T7,T17,T7,T16>

4 <T1,T2,T4,T7,T16,T19,T20,T21,T25>

5 <T1,T2,T4,T7,T15,T19,T20,T21,T24>

6 <T1,T2,T3,T7,T11,T6,T9>

7 <T1,T2,T4,T6,T4,T7>

D. Classification

Using Improved Span algorithm, the browsing pattern is converted in to

frequent sequence, semi frequent sequence and in frequent sequence. The user behavior is classified based on the pattern count. If the pattern count above the threshold level then it is consider as interested users, and then the user must purchase the vehicle. If the user pattern count is below the threshold level then the user as not interested user, then the user not interested in buying. If the threshold level is in between the pattern count then it is semi interested user. The interested user can be classified on the count of pattern. If the pattern count is upper threshold point, then the sequence of pattern is frequent sequence, so the user is shows the interest in purchasing. If the pattern count is less than threshold point, then that sequence of pattern is in frequent. Otherwise the pattern will be a semi frequent. Upper threshold point is 10 and the lower threshold point is 5.

TABLE II.

International Journal of Pure and Applied Mathematics Special Issue

4193

Page 8: New PREDICTING THE USER BEHAVIOR FROM WEBLOGS BY … · 2018. 7. 15. · PREDICTING THE USER BEHAVIOR FROM WEBLOGS BY IMPROVED SPAN CLASSIFICATION P.G.Om Prakash 1, Dr.A.Jaya 2 1

PATTERN COUNT

Pattern Id Pattern Count

1 < T1,T2,T3,T6,T8,T18,T20,T21,T22> 16

2 <T1,T2,T4,T7,T15,T19,T20,T21,T24> 17

3 <T1,T2,T4,T7,T16,T19,T20,T21,T25> 11

4 <T1,T2,T3,T6,T10,T18,T20,T21,T23> 16

5 <T1,T2,T5,T7,T17,T7,T16> 2

TABLE III.

INTERESTED AND NOT INTERESTED USERS

Pattern Id Pattern Buying Interest

1 < T1,T2,T3,T6,T8,T18,T20,T21,T22> Interested

2 <T1,T2,T4,T7,T15,T19,T20,T21,T24> Interested

3 <T1,T2,T4,T7,T16,T19,T20,T21,T25> Interested

4 <T1,T2,T3,T6,T10,T18,T20,T21,T23> Interested

5 <T1,T2,T5,T7,T17,T7,T16> Not Interested

The Table I shows the browsing pattern of majority users. Table II shows the pattern individual user count is clustered. Table III shows the buying interest of vehicle for the individual user. If the pattern count is above the threshold point, then the user is interested in buying vehicle. Otherwise users not interested in buying vehicle. The first four pattern id shows the interest to buying the vehicle from manufacturer website. The pattern id 5 doesn’t show the interest to buy the vehicle. step 1 : add the new sequence in F' step 2 :if F' in IU_SEQ then goto step 3 else step 9 step 3 :cmp [ F' in Seq_DB_His] step 4 :if F’ is Greaterthan Succ_count[i] Seq_DB_His step 5 :then it is "Interested User" step 6 : IU_SEQ=F' step 7:else "Not Interested User" step 8:NIU_SEQ=F’ step 9:if F' in NIU_SEQ then goto step 3 step 10:Seq_DB_His = F'

International Journal of Pure and Applied Mathematics Special Issue

4194

Page 9: New PREDICTING THE USER BEHAVIOR FROM WEBLOGS BY … · 2018. 7. 15. · PREDICTING THE USER BEHAVIOR FROM WEBLOGS BY IMPROVED SPAN CLASSIFICATION P.G.Om Prakash 1, Dr.A.Jaya 2 1

4. Competing Methods

The methods taken for comparison of the web page prediction methods

include Prefix Span to prove the effectiveness of the proposed method.

5. Comparative Analysis

The experimentation will be performed using vehicle data. The

implementation will be done using Weka tool and the performance will be compared with the existing algorithms based on precision, recall and F-measure.

Figure 3.1 Comparison graphs of Existing and Proposed Algorithm

The Figure 3.1 shows the comparison of Existing methods and proposed

methods of Comparative data. The graph analysis shows the precision, recall, and F-measure of the proposed method are 0.8742, 0.8534, and 0.8775, respectively, that is greater when compared with the existing methods.

6. Comparative Discussion

Datasets Methods Precision Recall F-measure Prediction accuracy

Vehicle Dataset

Prefix Span 0.7903 0.7396 0.6784 0.5934

Proposed 0.8242 0.8534 0.8775 0.7375

The Table shows the comparison of Existing methods and proposed

methods of vehicle prediction. The analysis using the Vehicle data confirms that the proposed method is better compared with the existing methods, like Prefix

International Journal of Pure and Applied Mathematics Special Issue

4195

Page 10: New PREDICTING THE USER BEHAVIOR FROM WEBLOGS BY … · 2018. 7. 15. · PREDICTING THE USER BEHAVIOR FROM WEBLOGS BY IMPROVED SPAN CLASSIFICATION P.G.Om Prakash 1, Dr.A.Jaya 2 1

Span. The precision, recall, and F-measure of the proposed method are 0.8742, 0.8534, and 0.8775, respectively, that is greater when compared with the existing methods. The precision, recall, and F-measure of existing methods of Prefix Span is 0.7903, 0.7396, and 0.6784 respectively.

7. Implementation

The proposed architecture is implemented by using Improved Span algorithm with a knowledge base which has 1578 instances in log. The maximal is 119, mean is 93.678, Std.Dev is 8.234 and the minimum compactness is 73. In total 19 attributes were generated. The accuracy was calculated using the ratio between the number of Instances generated by the system and the total number of attributes generated. The interested and not interested user can be classified based on the maximum threshold and minimum threshold of the system. Figure 4, 5, 6 & 7 shows the sample output screen for Search Result.

Figure.2. Vehicle Compactness Figure.2. Circularity of Vehicle

Figure.6. Categories of Individual Vehicles Figure.7. Different Class of Vehicles

The figure 8 shows the vehicle prediction of various classes of people and compare the attributes such as Fuel, Color, Brand, Cost, Technology and

International Journal of Pure and Applied Mathematics Special Issue

4196

Page 11: New PREDICTING THE USER BEHAVIOR FROM WEBLOGS BY … · 2018. 7. 15. · PREDICTING THE USER BEHAVIOR FROM WEBLOGS BY IMPROVED SPAN CLASSIFICATION P.G.Om Prakash 1, Dr.A.Jaya 2 1

Compact from that prediction level is increased. The major six attributes are calculated by ratio between the number of instance generated and the total number of attributes. The user prediction can be generated by six attributes from that individual user behavior is analyzed.

Figure.8. Vehicle Prediction

8. Conclusion Web based systems used by weblog is low-cost. The proposed

architecture is implemented by using Improved Span Classification algorithm, the system will show the prediction level of user based on attributes such as Fuel, Cost, Technology, Compact, Color and Brand from that user interest is generated and improves prediction accuracy. The prediction accuracy of user can be calculated based on the previous user navigation path that is taken from weblog, based on that interested user and not interested user can be classified. The system will focus to identify the buying prediction behavior from weblog.

References

[1] Wangshu Liu et al,“Two stage data preprocessing approach for software fault prediction” IEEE

Transactions on Reliability, Vol. 65, March 2016.

[2] Xin Ruan et al, “Profiling online social behaviors for compromised account detection”, IEEE transactions on information forensics and security, Vol. 11, January 2016.

[3] Guoshuai Zhao et al, “user service rating prediction by exploring social users’ rating behaviors”, IEEE transactions on multimedia, Vol. 18, March 2016.

[4] Ruili Geng et al, “improving web navigation usability by comparing actual and anticipated usage”, IEEE transactions on human-machine systems, Vol. 45, February 2015.

[5] Surbhi Huria et al, “implementation of dynamic association rule mining using back navigation approach”, fifth international conference on communication systems and network technologies 2015.

[6] Indr E Zliobaite et al, “adaptive preprocessing for streaming data”, IEEE transactions on knowledge and data engineering, Vol. 26, February 2014.

[7] Prakash, P. O. et al, "Analyzing and predicting user behavior pattern from weblogs." International Journal of Applied Engineering Research 11.9 (2016): 6278-6283.

International Journal of Pure and Applied Mathematics Special Issue

4197

Page 12: New PREDICTING THE USER BEHAVIOR FROM WEBLOGS BY … · 2018. 7. 15. · PREDICTING THE USER BEHAVIOR FROM WEBLOGS BY IMPROVED SPAN CLASSIFICATION P.G.Om Prakash 1, Dr.A.Jaya 2 1

[8] Dr.A.R.Patel et al, “Process of web usage mining to find interesting patterns from web usage data,” www.ijctonline.com vol. 3, no.1, Aug 2012.

[9] Hong Cheng et al IncSpan: Incremental Mining of Sequential Patterns in large database.

[10] D.Kerana Hanirex et al, “Efficient Algorithm for Mining Frequent Itemsets Using Clustering Technique,”, International Journal on Computer Science and Engineering, vol. 3, no. 3, March 2011.

[11] Mobasher et al, "Automatic personalization based on Web usage mining", Communications of the ACM, vol. 43, pp. 142-151, 2000.

International Journal of Pure and Applied Mathematics Special Issue

4198

Page 13: New PREDICTING THE USER BEHAVIOR FROM WEBLOGS BY … · 2018. 7. 15. · PREDICTING THE USER BEHAVIOR FROM WEBLOGS BY IMPROVED SPAN CLASSIFICATION P.G.Om Prakash 1, Dr.A.Jaya 2 1

4199

Page 14: New PREDICTING THE USER BEHAVIOR FROM WEBLOGS BY … · 2018. 7. 15. · PREDICTING THE USER BEHAVIOR FROM WEBLOGS BY IMPROVED SPAN CLASSIFICATION P.G.Om Prakash 1, Dr.A.Jaya 2 1

4200