advance clustering technique based on markov chain for predicting next user movement

Tutorial Paper

© 2013 ACEEEDOI: 03.LSCS.2013.2.

Proc. of Int. Conf. on Advances in Information Technology and Mobile Communication 2013

563

Advance Clustering Technique Based on MarkovChain for Predicting Next User Movement

Harish Kumar1, Dr. Anil Kumar Solanki2

1PhD Scholar, Mewar University, 2Professor, BIT JhansiEmial id : [email protected]

Abstract - Aim: According to the survey India is one of theleading countries in the word for technical education andmanagement education. Numbers of students are increasingday by day by the growth rate of 45% per annum. Advancementin technology puts special effect on education system. Thishelps in upgrading higher education. Some universities andcolleges are using these technologies. Weblog is one of them.Main aim of this paper is to represent web logs using clusteringtechnique for predicting next user movement and userbehavior analysis. This paper moves around the web logclustering technique based on Markov chain results .In thispaper we present an ideal approach to web clustering(clustering web site users) and predicting their behavior fornext visit. Methodology: For generating effective result approx14 engineering college web usage data is used and an advanceclustering approach is presenting after optimizing the otherclustering approach.Results: The user behavior is predictedwith the help of the advance clustering approach based on theFPCM and k-mean. Proposed algorithm is used to mined andpredict user’s preferred paths. To predict the user behaviorexisting approaches have been used. But the existingapproaches are not enough because of its reaction towardsnoise. Thus with the help of ACM, noise is reduced, providesmore accurate result for predicting the user behavior. ApproachImplementation:The algorithm was implemented in MATLAB, DTRG and in Java .The experiment result proves thatthis method is very effective in predicting user behavior. Theexperimental results have validated the method’s effectivenessin comparison with some previous studies.

Keyword - Markov chain, Web logs, clustering, FPCM (FuzzyPossiblistic C means algorithm),K-mean algorithm.

I. INTRODUCTION

A recent study by Google has found that Indians justbehind the Americans, when it comes to searching onlineabout educational institutions and courses. According tothe survey, the details of which were released by the onlinesearch giant, over 45% Indian students use the internet toresearch on education [10]. This spawn the massive datarelated to student’s interactions with the educational websites. This massive data is in the form on web logs or serverlog files. The research area is focused on the web log analysisand methods how to process this web data. Finding hiddeninformation from Web log data is called Web usage mining.Web Usage mining is the part of Data Mining technique.Data Mining and Knowledge Discovery is a researchdiscipline involving the study of techniques to search forpatterns in large collections of data. The application of datamining techniques to the web, called web data mining, was a

natural step, and it is now the focus of an increasing numberof researchers.Web usage mining consists of three phases,preprocessing, pattern discovery, and pattern analysis. Af-ter the completion of these three phases the user can find therequired usage patterns and use this information for the spe-cific needs. The reliability of the previously developed meth-ods for finding similar patterns is only up to 50%. Zidrinaresearch introduced a mutual approach which takes usersbrowsing history and text from the links text to analyse us-ers’ behavior. Tanasa research proposed few approaches forextracting sequential patterns with low support from Webusage data. These approaches were also instantiated in con-crete methods such as the “Cluster & Discover” and “Divide& Discover”. The aim all the previous research is to discoversimilar patterns in Web log data is to obtain information aboutthe navigational behavior of the users.

Web usage mining, from the data mining aspect, is thetask of applying data mining techniques to discover usagepatterns from Web data in order to understand and betterserve the needs of users navigating on the Web. Web usagemining aim is to find out useful information from the educa-tional weblogs. These useful data patterns are used to ana-lyze behavior of user. The objective of this dissertation is togenerate a similar patterns with the help of Markov chain andby using following algorithms like’s web logs data prepara-tion methods, data mining algorithms for prediction and clas-sification tasks, web text mining. The key target of the paperis to develop methods how to improve knowledge discoverysteps mining using web log data that would reveal new pros-pect to the data analyst. To forecast next user movementeffectively, this study generates a beam of light for web-based recommendation system to predict next user move-ment, named as WebAstro.

According to the finding this WebAstro helps in website reorganization. While performing web log analysis, itwas discovered that insufficient interest has been paid toweb log data cleaning process. By reducing the number ofredundant records data mining process becomes much moreeffective and faster. Therefore a new original cleaning frame-work was introduced which leaves records that only corre-sponds to the real user clicks. This clean method named asDuster performs “Query based” cleaning. Clean data is usefor designing Web Graph. This method help us to draw theweb graphs that are modeled in the form of Markov Chainand generate a new friend function for calculating probabil-ity for user next page prediction and behavior analysis[8][9].K mean clustering algorithm is used for predicting user be

66

Tutorial Paper

© 2013 ACEEEDOI: 03.LSCS.2013.2.


563

havior its advance clustering algorithm Fuzzy C-means (FCM)is a well known soft clustering algorithm that allow for overlapping clusters [1]. The overlapping clusters can be usefulin applications where restrictions imposed by crisp clusteringthat force assignment of every object to a unique cluster maynot be practical. This paper emphasis on K-mean and FCMalgorithms for clustering web navigation patterns to aneducational site of NCR Colleges.

II. RELATED WORK

G.Sudhamathy et. al. [1] proposed a optimization surveyof for various web clustering algorithm. She provide a briefoverview of Fuzzy clustering algorithm, Temporal ClusterMigration Matrices algorithm and PSO based clusteringalgorithm and she find that temporal clustering migrationmatrices approach is just to categorize the web users intodifferent clusters and to study their cluster migration behaviorover a period of time. Fuzzy clustering approach can beapplied to study the aspect of E-commerce web sites startingfrom ranking the users based on their visit time and visitfrequency.PSO optimization technique that is applied on theweb session clustering concept is used for identifying moreaccurate clustering sessions. After analyzing she proposedthat fuzzy clustering algorithm is simple, effective and practicalto apply. J.Vellingiri et.al.,[2]proposed an approach for fuzzypossiblistic c means algorithm for clustering on web usagemining to predict the user behavior[2] . In recent times, C-Means is found to be superior as its embedded fuzzy logic.In noisy atmosphere, the memberships of FCM constantlydo not correspond well to the degree of belonging of thedata, and might be inexact. This paper uses a novel clusteringalgorithm called fuzzy-possibilistic C-Means (FPCM)algorithm, which integrates extended partition entropy andinter class resemblance which is computed from the fuzzy setpoint of view. The proposed approach uses FPCM to findout the user behavior since it needs only the ember shipmatrix and possibilistic matrix, and is free from heavy distancecomputing.

Tasawar et.al. ,[3] proposed a connectivity basedclustering approach for web usage mining (WUM), Heproposed Agglomerative and Divisive approach forclustering. Swarm based web session clustering helps in manyways to manage the web resources effectively such as webpersonalization, schema modification, website modificationand web server performance. In this paper, he proposes aweb session clustering at second level of web usage mining(Preprocessing level). The framework approach will coverthe data preprocessing steps to prepare the web log dataand convert the categorical web log data into numerical data.Asession vector is obtained from web data and swarmoptimization could be applied to cluster the web log data.The hierarchical cluster based approach will enhance theexisting web session techniques for more structuredinformation about the user sessions Vinita et.al..[4] Proposedthe possible use of the neural networks learning capabilitiesto classify the web traffic data mining set. The discovery of

useful knowledge, user information and server access patternsallows Web based organizations to mining user accesspatterns and helps in future developments, maintenanceplanning and also to target more rigorous advertisingcampaigns aimed at groups of users. According to her aspopularity of the web continues to increase, there is a growingneed to develop tools and techniques that will help improveits overall usefulness. She proposed that k-means algorithmis used to reduce the computation intensity of the neuralnetwork, by reducing the input set of samples. This can beachieved by clustering the input dataset using the k-meansalgorithm, and then take only discriminate samples from theresulting clustering schema to perform the learning process.

Chu et.al.[5] proposed a two way prediction model basedon Markov models and Bayesian theorem. The predictionresult can be used for personalization, building properwebsites, promotion, getting marketing information, andforecasting market trends etc. Markov model is assumed tobe a probability model by which users browsing behaviorscan be predicted at category level. Bayesian theorem canalso be applied to present and infer users browsing behaviorsat webpage level. By the Markov Model, the system caneffectively filter the possible category of the websites andBayesian theorem will help to predict websitesaccuracy.R.Khanchana et. al. [6] proposed a modifiedprediction model of Lee based on Markov models andBayesian theorem. She focuses on the preprocessing stepand amends few changes in Prediction. Author useshierarchical agglomerative clustering algorithm for browsingpatters and obtain several various user clusters. The data ofclusters can be projected as cluster view for replacing of theglobal. As a result, the author presents an altered PredictionModel. In the new model, the view selection will be utilizedby which user’s browsing patterns is matched and utilizedfor forecasting and enhancing the accuracy confidently.

III. METHODOLOGY

A. Web Log FileWeb Mining: Web mining may be classified into three

categories, namely weblog mining, web content mining, andweb structure mining.

Fig. 1. Categorization of Web Data mining

Web content mining (WCM) is to find useful information67

Tutorial Paper

© 2013 ACEEEDOI: 03.LSCS.2013.2.


563

in the content of web pages [4] e.g. free Semi-structureddatasuch as HTML code, pictures, and various unloadedfiles.

Web structure mining (WSM) is use to generating astructural summary about the web site and web pages [7][11].Web structure mining tries to discover the link structure ofthe hyperlinks at the inter document level. Web contentmining mainly focuses on the structure of inner document,Web usage mining (WUM) is applied to the data generatedby visits to a web site, especially those contained in web logfiles. I only highlighted and discussed research issuesinvolved in web usage data mining. Web usage mining(WUM) or web log mining, users’ behavior or interests isrevealed by applying data mining techniques on web. Weblog files are of different types.1. Access Log File.2. Agent Log File3. Referer Log File4. Error Log File

Access Log File: It records information about which filesare being requested from web server. It is located in thedirectory www/logs/.

Agent Log File: It records information about the webclients that make requests on your server.

Referer Log File: It records information about the URLthat the web browser had been viewing immediately beforemaking the request on your server. This is particularly usefulwhen you want to determine where requests on your webserver come from and what websites are referring web trafficto your server. It is located in the www/logs/ directory andcalled Referer Log File.

Error Log File: It records information about failed requestsof your server. If someone tries to access a file on your serverthat doesn’t exist, your server automatically generates anerror message. Each of these error messages is recorded inthe referrer log. It is located in the www/logs/ directory andcalled Error Log File.

Three main sources of web log file are1. Client Log File,2. Proxy Log File3. Server Log File.

A log file contains the following fieldThe client’s hostname or its IP address, The client id (generally empty and represented by a \-”) The user login (if applicable), The date and time of the request, The operation type (GET, POST, HEAD, etc.), The requested resource name, The request status, The requested page size, The user agent (a string identifying the browser and theoperating system used),and The referrer of the request which is the URL of the Webpage containing the link that the user followed to get to thecurrent page.User behavior can be best analyzed from client log file becauselog files collected from client logs are much reliable and

accurate then server log file and proxy log file. An extendedlog file contains a sequence of lines containing ASCIIcharacters terminated by either the sequence LF or CRLF.Log file generators should follow the line terminationconvention for the platform on which they areexecuted.Analyzers should accept either form. Each line maycontain either a directive or an entry. Entries consist of asequence of fields relating to a single HTTP transaction [8].Fields are separated by whitespace; the use of tab charactersfor this purpose is encouraged. If a field is unused in aparticular entry dash “-” marks the omitted field. Directivesrecord information about the logging process itself. Linesbeginning with the # character contain directives. Thefollowing directives are defined:Version: <integer>.<integer>

The version of the extended log file format used [7][8].This draft defines version 1.0.Fields: [<specifier>...]Specifies the fields recorded in the log.Software: stringIdentifies the software which generated the log.Start-Date: <date> <time>The date and time at which the log was started.End-Date :< date> <time>The date and time at which the log was finished.Date:<date> <time>The date and time at which the entry was added.Remark: <text>Comment information. Data recorded in this field should beignored by analysis tools.Sample web log format is as in Figure 2.

B. Markov’s ModelThe pages and hyperlinks of the World-Wide Web may

be viewed as nodes and arcs in a directed graph. Therelationship between sites and pages indicated by thesehyperlinks gives rise to what is called a Web graph. When itis viewed as a purely mathematical object, each page forms anode in this graph and each hyperlink forms a directed edgefrom one node to another. These navigation marks are callednavigation pattern that can be used to decide the next likelyweb page request based on significantly statisticalcorrelations. If that sequence is occurring very frequentlythen this sequence indicated most likely traversal pattern. Ifthis pattern occurs sequentially, Markov chains have beenused to represent navigation pattern of the web site [8] [9].Important properties of Markov Chain:1. Markov Chain is successful in sequence matchinggeneration.2. Markov model is depending on previous state.3. Markov Chain model is Generative.4. Markov Chain is a discrete – time stochastic process.

Markov chain model is assume to be a probability modeland used to predict provide the probability of the next linkchosen when viewing a Web page while taking into accountthe trail followed to reach that page. Our measure of thesummarization ability of the model answers a question we

68

Tutorial Paper

© 2013 ACEEEDOI: 03.LSCS.2013.2.


563

have often been asked about the adequacy of Markov modelsin representing user Web trails. We use three type of Markovmodel …1. First Order Markov Model:

Suppose we have state space say S= {S1, S2…, Sn) at thetime t sate sequence is represented by St and transitionprobability is represented by Pi j. In first order Markov chainmodel state probability is depend on the previous state forexample probability of state j depends on the previous statei.So transition probabilities are represented by followingexpressions. Pi,j = Probability of (St= j| St-1=i) (1)OR If we consider states at different instances of time t thenthis can be represented as S (t). If T represents the number ofstates in a sequence then ST = {S1, S3, S5, S1} (if T=4). Thismodel uses the transition probability which is given by P (Sj (t + 1)|Si (t)) = Pij

(2)

(3)

(4)

2. Second Order Transition Probabilistic ModelWe let Pi, k j be the second-order transition probability,

that is, the probability of the transition (A k, Aj) given that theprevious transition that occurred was (Ai, Ak).The second-order probabilities are estimated as follows:

(5)

We consider the same navigation patterns used inprevious paper.

With this model we found some problems like State C isnot accurately showing his actual probability. The accuracyof changing probability from a state can be increased byseparating the in paths3. Nth Order Markov Model

Nth order Markov model solve the above problems. Pi,j n is

Navigation Pat tern Occurrence

S A B C D T 4

S E F G T 8

S B C E F T 4

S A C D T 4

S B C D T 6

S A C E T 14

S B C T 4

S D F G T 2

S D FT 10

S D T 12

S B C D F T 6

S E F T 2

Fig. 2. Web logs

TABLE I. USER NAVIGATION PATTERN AND THEIR FREQUENCIES

a probability which state j at a time t depends on previousstate i at a time t-n. The n-order transition probability ofMarkov model also denotes by Pi ,j

n= Pr{St= j | St-n= i} (6)

C. Bayesian TheoremBayesian’ Theorem is a theorem of probability. It can be

seen as a way of understanding how the probability that atheory is true is affected by a new piece of evidence. Bayesiannetworks (BNs), also known as belief networks, belong tothe family of probabilistic graphical models (GMs) [5].Graphical structures represent the knowledge about anuncertain domain. Graph node represents a randomvariable,while the edges between the nodes representprobabilistic dependencies among the corresponding randomvariables. These conditional dependencies in the graph areoften estimated by using known statistical and computationalmethods. It has been used in a wide variety of context likeBayesian theorem is used to predict the most possible user’s

69

Tutorial Paper

© 2013 ACEEEDOI: 03.LSCS.2013.2.


563

Fig. 3. Second Order Markov Modelnext request. It is to be assumed that at sample space S, Xand Y are the two events.

(7)

Bayesian’ Theorem to discover, we say that P(X|Y), theprobability that T is true given that E is true, is the posteriorprobability of T. The idea is that P (X|Y) represents theprobability assigned to T after taking into account the newpiece of evidence, E.

To calculate this we need, in addition to the priorprobability P(X), two further conditional probabilitiesindicating how probable our piece of evidence is dependingon whether our theory is or is not true. We can representthese as P (X|Y) and P (X|~Y), where ~X is the negation of X,i.e. the proposition that T is false. Following procedure isused for predicting user behavior and used for websiteorganization.Experimental Methodology

WebAstro procedure for cleaning and analysis is asfollowsStep 1: Read web log from web log Data base (Web server log

The above equation no 7 indicates that X stands for atheory or hypothesis that we are interested in testing, and Ydiscover is the probability that X is true supposing that ournew piece of evidence is true. This is a conditionalprobability, the probability that one proposition is trueprovided that another proposition is true. Using this idea ofconditional probability to express what we want to userepresents a new piece of evidence that seems to confirm ordisconfirm the theory. In particular, P(X) represents our bestestimate of the probability for next user page request. It isknown as the prior probability of X. What we want to

Fig. 4. WebAstro Block Diagram70

Tutorial Paper

© 2013 ACEEEDOI: 03.LSCS.2013.2.


563

file)Step 2: Apply DUSTER algorithm for refining web logs Cleaning HTML, XML, CSS and other tags from web logs. Remove all jpeg, jpg, gif Delete words like and, an, is etc. Reduce sized log file is kept in separate folder by the nameof WEBASTRO.Step3: Sort the clean and refined web logs on the basis ofdate and time of visitsStep4: Prepare the separate table based on the following fields.1. User IP Table(User Identification Table)2. Pages Navigation Table(Transaction Identification Table)3. Duration Table(session Identification table)Step5: Normalize the data table.Step6: Initialize IPADDRESS field to Zero (0)Check whether the IP address is in the IP Table or NotIf yes then Increment IPADDRESS counter by oneElseInsert the IPADDRESS in IP table.Step7: Initialize PAGEVISIT field to Zero (0)Check whether the PAGE address is in thePAGENAVIGATION or NotIf yes then Increment PAGEVISIT counter by oneElseInvalid page and repeat step no 7STEP8: Prepare Transaction Matrix, Similarity Matrix andRelevance Matrix from Step No 4,5,6 and 7 until all data setare in matrix form.STEP 9: Apply K mean clustering algorithm for testing refineddata set and generate the proper cluster.

Let X=(X1, X2, X3… Xn) be the set of distinct n users visitP distinct pages in session Si.Specific user =Xi

Where Xi

K=no of web pages visited by Xi users in sessionSelect another user Xj from the set whereXjAnd SiXj Si

If Xi and Xj belongs to the same session it means that theyhave common interest on the same web session thenSession_count =Session_count+1(Increment sessioncounter by 1)And generate the matrix named VISITij for number of timeweb page visited.VISITij=[ Matrix] { Page I visited by the web user J}

Similarly generate the matrix for the following Page_count=page_count+1 (Increment the page counterby 1)Generate the matrix for ith page visited by jth user. Time_cont=Time_count+1(Increment the Time counter by1)Generate the Matrix for time spend by a user on a web page.Assign the initial mean value for cluster K.Plot the cluster by the use of specified matrix on the basis ofSession belongs, page visit and time spent on the page.

Fig. 5. User Visit per hour Graph

Fig. 6. Page view Graph

Set the threshold value for centroid ä and calculatethe distance between different clusters.Step10: Apply Fuzzy c-mean clustering on testing refineddata set and generate the proper cluster.Consider a unlabelled pattern X=(X1,X2, X3… Xn)Objective function is used to calculate WGSS.

Min Jm(U,W)=N=NO of pattern in XC= No of clustersW=cluster center vectorU=membership function matrix the element of U are µi,jµi,j=Degree of membership of Xi in the cluster jd2

ij=|| Xi - Ci|| where i d” m<“Where m is any real number greater than 1Ci is the d-dimension center of the cluster.Step 11: Find the optimized solution and predict the userbehavior on the basis of cluster results, density of cluster,distance of cluster and compare with Markov predictingmodel and Bayesian Model(Two way model).

D. EXPERIMENTAL RESULT

For evaluating the proposed technique the database is

71

Tutorial Paper

© 2013 ACEEEDOI: 03.LSCS.2013.2.


563

Fig. 7. Page visit Graph

Fig. 8. Cluster Generation based on user identification

CONCLUSION AND FUTURE WORKS

Web is one the main source of the information. The resultsare based on the evaluation of 14 college’s web log files inbusy and normal working days. After evaluation we find thatfuzzy logic approach is more accurately define the clusterand provide more accurate results and prediction model basedon the Markov chain and Bayesian theorem is more accurately

selected from 14 colleges of Northern India Universities andengineering colleges in the form of web logs. The program isimplemented in MATLAB and in Java Only one weakdatabase is taken here for experimental results. With this wealso check the complexity of algorithm to show that the outputof our approach is up to the mark and more efficient than theother approaches. It contains total 256789 results per weblogs file approx 4503 visit per file. Before cleaning its size ofsingle file is approx 1.288KB and after cleaning all fields itsize reduce up to 498 kb. Proposed approach is developed inJAVA and clustering technique is employed in testing dataset in MATLAB. After final optimization we feel that ourapproach is simpler and refine than the other approachesand this give more effective results to us for user behavioranalysis.

Harish Kumar has completed his M.Tech (IT)from Guru Gobind Singh IndraprasthaUniversity, Delhi. He is currently pursuing hisPh.D from Mewar University, Chittorgarh.Prof.(Dr.) Anil Kumar Solanki did his PhD in CSEfrom Bundelkhand University. He has publishedgood number of papers in National and Interna-tional journals.

compared with Fuzzy clustering in comparison of K-meansclustering. For future work we should try to explore the useof these techniques in automated software for predicting theirnext visit. This helps us in analyzing user behavior andunderstanding nature of user navigation. Proposed approachhelps us in web site modification on the basis of user interest.

REFERENCES

[1] G.Sudhamathy,C.J.venkateswaran “Web log clusteringapproaches-a survey” IJCSE ISSN0975-3397 vol3No7 July2011.

[2] J. Vellingiri , S. Chenthur Pandian “Fuzzy Possibilistic C-Means Algorithm for Clustering on Web Usage Mining toPredict the User Behavior” European Journal of ScientificResearch ISSN 1450-216X Vol.58 No.2 (2011), pp.222-230.

[3] Hussain Tasawar, Asghar Sohail and Fong Simon, “A hierarchicalcluster based preprocessing methodology for Web UsageMining”, 6th International Conference on AdvancedInformation Management and Service (IMS), Pp. 472-477,2010.

[4] Vinita Shrivastava, Neetesh Gupta “Performance ImprovementOf Web Usage Mining By Using Learning Based K-MeanClustering” International Journal of Computer Science and itsApplications ISSN 2250 – 3765.

[5] Chu-Hui Lee, Yu-Hsiang Fu “Two level prediction model foruser’s browsing behavior” Proceedings of the InternationalMultiConference of Engineers and Computer Scientists 2008Vol IIMECS 2008, 19-21 March, 2008, Hong Kong.

[6] R.Khanchana and M. Punithavalli “Web Usage Mining forPredicting Users’ Browsing Behaviors by using FPCMClustering” IACSIT International Journal of Engineering andTechnology, Vol. 3, No. 5, October 2011.

[7] Harish, Anil Kumar “Effective Cleaning of Educational WebSite Usage Patterns and Predicting their Next Visit”International Journal of Computer Applications (0975 – 8887)Volume 53– No.4, September 2012.

[8] Harish, Anil Kumar “Analysis of Educational Web PatternUsing Adaptive Markov Chain For Next Page AccessPrediction” International Journal of Computer Science andInformation Security Publication July 2011, Volume 9 No. 7.

[9] Bindu Madhuri, Dr. Anand Chandulal.J, Ramya. K, Phanidra.M“Analysis of Users’ Web Navigation Behavior using GRPAwith Variable Length Markov Chains” IJDKP.2011.1201.

[10] B.ramesh babu,R.jeyshankar “Websites of central university in India: A webometric Analysis” DESIDC journal of libarary

and Information Technology,Vol30 no .4 july 2010.[11] Harish, Anil Kumar “Clustering algorithm employee in web

usage mining: An overview” INDIACOMM-2011 ISSN 0973-7529 ISBN 978-93-80544-00-7

AUTHOR PROFILE:

72