11. -clustering of web -full

8/23/2019 11. -Clustering of Web -Full

http://slidepdf.com/reader/full/11-clustering-of-web-full 1/10

CLUSTERING OF WEB USAGE DATA USING FUZZY TOLERANCE ROUGH SET

SIMILARITY AND TABLE FILLING ALGORITHM

T. VIJAYA KUMAR & H. S. GURUPRASAD

Department of IS & E, BMS College of Engineering, Bull Temple Road, Bangalore, Karnataka, India

ABSTRACT

Web Usage Mining is the application of data mining techniques to learn usage patterns from Web server log file

in order to understand and better serve the requirements of web based applications. Web Usage Mining includes three most

important steps namely Data Preprocessing, Pattern discovery and Analysis of the discovered patterns. One of the most

important tasks in Web usage mining is to find groups of users exhibiting similar browsing patterns. Grouping web

transactions into clusters is important in order to understand user‟s navigational behavior. Different types of clustering

algorithms such as partition based, distance based, density based, grid based, hierarchical and fuzzy clustering algorithms

are used to find clusters from Web usage data. In this paper we propose an approach for clustering Web usage data based

on Fuzzy tolerance rough set theory and table filling algorithm. First, we have constructed the sessions using concept

hierarchy and link information. The similarity between two sessions is approximated by using Rough set tolerance relation.

The the tolerance relation is reformulated into equivalence relation using fuzzy tolerance. Then the clusters are obtained by

using modified table filling algorithm. We provide experimental results of Fuzzy rough set similarity and table filling

algorithm on MSNBC web navigation data set. In this paper, we have considered the server log files of the Website

www.enggresources.com for overall study and analysis.

KEYWORDS: Web Usage Mining, Concept Hierarchy, Website Ontology, Rough Set Similarity, Fuzzy Tolerance,

Table Filling Algorithm

INTRODUCTION

The growth of World Wide Web in terms of Web sites and their users over the last two decades has resulted in a

large amount of data related to the user‟s interactions with the web sites. This data is recorded in the Web server log files

of Web servers and referred as Web usage data. Web usage mining (WUM) uses data mining techniques to discover

valuable information from Web usage data. WUM deals with the automatic discovery of user access patterns from one or

more Web servers. Web Usage mining contains three main tasks namely Data preprocessing, Cluster discovery and Cluster

analysis. Data preprocessing consists of data cleaning, data transformation and data reduction. Data cleaning routines work

to clean the data by filling in missing values, smoothing noisy data and resolving inconsistencies in the data. In data

transformation, the data are transformed or consolidated into forms appropriate for mining. Data reduction techniques can

be applied to obtain a reduced representation of the data that is much smaller in volume, yet closely maintains the integrity

of the original data. Cluster discovery deals with formation of groups of users exhibiting similar browsing patterns and

obtaining groups of pages that are accessed together. Cluster analysis filters out uninteresting patterns from the user

clusters and page clusters found in the Cluster discovery phase. Clustering is a data mining technique that groups together a

set of items having similar characteristics. In the Web usage domain, two kinds of interesting clusters such as user clusters

and page clusters can be discovered. This paper presents a new approach for finding session similarity using Fuzzy Rough

set theory. Rough set theory deals with uncertainty and vagueness. The building block of rough set theory is an assumption

International Journal of Computer Science Engineering

and Information Technology Research (IJCSEITR)

ISSN 2249-6831

Vol. 3, Issue 2, Jun 2013, 107-116

© TJPRC Pvt. Ltd.

http://www.enggresources.com/




108 T. Vijaya Kumar & H. S. Guruprasad

that with every set of the universe, we can associate some information in the form of data and knowledge. Objects

clustered by the same information are similar with respect to the available information about them. The set similarity

considered for two sessions is a tolerance relation which is only reflexive and symmetric but not transitive. Fuzzy tolerance

is used to reform the tolerance relation in to an equivalence relation. Then the indiscernibility based fuzzy tolerance roughset similarity is combined with table filling algorithm to form the clusters. Table filling algorithm is used to minimize the

deterministic finite automata. The minimization problem is to find the unique minimal deterministic finite automata that

accept the same language accepted by the given deterministic finite automata. Algorithms solving this problem are used in

applications ranging from compiler construction to hardware circuit minimization.The rest of the paper is organized as

follows. Section 2 gives a brief description about the related work. Section 3 explains the proposed model. Section 4 covers

details of Data Preprocessing using Concept hierarchy and Web site topology. The details of Rough set theory and Fuzzy

tolerance are discussed in Section 5. The proposed approach using fuzzy tolerance rough set theory and table filling

algorithm is explained in section 6. The experimental design and results are discussed in section 7. Finally, we give our

conclusion in section 8.

LITERATURE SURVEY

Several researchers are working on Web Usage mining and have contributed various methodologies, tools for

Web Usage mining. A number of data mining methods have been used to generate models of usage patterns. Models based

on association rules, clustering algorithms, sequential analysis and Markov models have been used for discovering the

knowledge from Web usage data. All these models are predominantly based on usage information from Web usage data

alone. Significant improvement can be achieved by making use of domain knowledge, which is usually available from

domain experts, content providers, and Web designers. Cooley et al. in [1, 2], covered Web usage mining process &

various steps involved in it. It serves as the primary thesis to understand fundamentals of Web usage mining. Along with

the server log file other sources of knowledge such as site content or structure and semantic domain knowledge can be used

in Web usage mining [3]. In [4], Murat Ali Bayir et al. have proposed a novel framework, called Smart-Miner for Web

usage mining problem which uses link information for producing accurate user sessions and frequent navigation patterns.

Norwati Musta pha et al. [5], have proposed a model for mining user‟s navigation pattern based on Expectation modeling

algorithm and used it for finding maximum likelihood estimates of parameters in probabilistic models. A complete

framework for mining evolving user profiles in dynamic Websites is proposed in [6]. They also described how to enrich

the discovered user profiles with explicit information need that is inferred from search queries extracted from Web log

data. In [7], T.Vijaya Kumar et al. have proposed a framework for finding useful information from Web Usage Data that

uses Self Organizing Maps (SOM). Sessions are constructed using the concept hierarchy and the link information. Then

they have used SOM to form the cluster. In [8], Hannah et al. have proposed an approach to obtain user profiles based on

intelligent rough clustering techniques.

The proposed method provides efficient algorithms for finding hidden patterns in web log data and is able to learn

the number of clusters automatically from the given data. They have given a two-fold approach for clustering user access

patterns and retrieving effective user profiles from web logs using Gaussian Rough (GR) clustering, Gaussian Rough

Fuzzy (GRF) clustering, rough clustering and rough fuzzy clustering. In [9], Rajhans Mishra et al. have adopted the

similarity upper approximation based clustering of web logs using various similarity metrics. In [10], K.Santhisree et al.

have presented a technique to cluster web transactions based on the set similarity measures from web log data which

identifies the behavior of the user‟s page visits and order of occurrence of visits. They have formed the Web data Clusters

using the Similarity Upper Approximations. In [11], Philip Hingston presented the method for mining interesting



Clustering of Web Usage Data Using Fuzzy Tolerance Rough Set Similarity and Table Filling Algorithm 109

sequential patterns from large sequential data sets. In the first step the data set is modeled in terms of stochastic grammar

or automaton. Then the queries about frequently occurring patterns in the data set are answered by converting pattern

frequencies into formulae concerning the model.In [12], Sunil Joshi et al. have proposed a new algorithm PMFLT (Pattern

Mining using Formal Language Tools) for sequential pattern mining using formal language tools such as regular expression constraints. The algorithm finds only user specific frequent sequence in efficient optimized way as compared to

other existing algorithm.

SYSTEM DESIGN

The main goal of the proposed system is to find web user clusters from web server log files. We have adopted

Web Usage Mining System as shown in Figure 1. The WUM system in our approach is partitioned into two modules. In

the first module Data cleaning, User identification and Session constructions are considered for Data preprocessing phase.

Sessions are constructed using web site ontology and concept hierarchy. Then in the second module web usage clusters are

formed using rough set theory, fuzzy tolerance and table filling algorithm.

Figure 1: Web Usage Clustering Process

DATA PREPROCESSING

Data preprocessing [13] comprises of, merging of log files from different Web servers, Data cleaning,

Identification of users, sessions, and visits, Data formatting and Summarization. Data cleaning consists of removing

superfluous data from log file. User identification deals with identifying unique clients to Web server. A combination of IP

& user agent is used to identify user uniquely. User identification can also be done using client side cookies. But, due to

privacy reasons, cookies can be disabled by user, and not every Website employ cookies. Session identification is

considered as the next step. A session is a sequence of requests made by a single user with a unique IP address on a

particular Web domain during a specified period of time.

Time Oriented Approach

The most basic session definition comes with Time Oriented Heuristics which are based on time limitations on total

session time or page-stay time. They are divided into two categories with respect to the thresholds they use:




In the first one, the duration of a session is limited with a predefined upper bound, which is usually accepted as 30

minutes. In this type, a new page can be appended to the current session if the time difference with the first page

doesn‟t violate total session duration time. Otherwise, a new session is assumed to start with the new page request.

In the second time-oriented heuristic, the time spent on any page is limited with a threshold. This threshold value is

accepted as 10 minutes. If the timestamps of two consecutively accessed pages is greater than the threshold, the

current session is terminated after the former page and a new session starts with the latter page.

Navigation Oriented Approach

Navigation-Oriented approach [14, 15] uses link information of Website graph which is present in concept based

Website graph constructed by using Website knowledge. In this approach, it is necessary to have a hyperlink between every

two consecutive Web page requests.

Let be a session containing Web pages with respect to their timestamp

orders. In this session, for every page , except the initial page , there must be at least one page in the session which is

referring to and has a smaller timestamp than . Topology constraint forces to consider user navigation according to

some path in Website graph.

Concept-Matching Approach: This approach considers concepts of Web pages from concept based Website graph. Adding

page to a session is performed as follows: If concept names of pages & are

same. Then add to the current session else create a new session and add to it. i.e., concept switching is taken as

one more criteria for breaking session [16].

FUZZY TOLERANCE ROUGH SET SIMILARITY

In this section we present a Fuzzy rough set theoretic approach to cluster user access transactions over the web.

The presented approach is based on the table filling algorithm. Rough set theory is based on the assumption that with every

set of the universe, some information in the form of data and knowledge can be associated. Objects clustered by the same

information are similar and the similarity generated based on the information form the basis for rough set theory. Given

two transactions and , the sequence and set similarity measure proposed in [17], is considered for our study. Sequence

similarity calculates the amount of similarity in the order of occurrence of pages within two page sequences.

The sequence similarity measure is given in equation (1). Length of longest common subsequence (LLCS) with

respect to the length of the longest sequence determines the sequence similarity aspect across two sequences. The Length

of Longest Common subsequence (LLCS) can be calculated by dynamic programming approach [18]. Set similarity is

defined as the ratio to the number of common pages and the number of unique pages in two page sequences. The Set

similarity measure is given in equation (2). The Sequence set similarity metric satisfies Non negativity, Symmetry and

Normalization, hence qualifies as a proper similarity metric [19]. The Sequence set similarity measure is given in equation

(3).

(1)

(2)

(3)




Here and . The values of p and q determine the relative weights for sequence similarity and

set similarity. ∈ [0, 1]. S = 1, when two transactions x and y are exactly same.

= 0, when two transactions x and y have no items in common. The measures of similarity gives

information about the users access patterns related to their common areas of interest. The navigation of any two users over

a web site may not be exactly identical but may have some common interesting patterns. Moreover the same user can

navigate the same pattern in different ways. From the above definition it is obvious that ∈[0, 1].

= 1, when two transactions are exactly identical. = 0, when two transactions

are to totally different. This measure of similarity gives information about the users access patterns related to their

common navigational patterns.

The navigation of any two users over a Web site may not be exactly same but may have some common interesting

pages. The same user can navigate the same pattern in two different ways. This similarity between the navigational

behaviors of two users is modeled by using a binary relation R defined on T. For any threshold value ∈(0, 1] and for

any two user transactions x and y ∈T, a binary relation R on T denoted as xRy is defined by xRyiff .

The similarity class of t, denoted by SimClass(t), is the set of transactions which are similar to t. It is given by SimClass(t)

= {s ∈T : sRt}. For different threshold values we can get different similarity classes. A domain expert can choose the

threshold based on his experience to get a proper similarity class. For a fixed threshold ∈[0; 1], a transaction from a

given similarity class may be similar to an object of another similarity class.

This relation R is a tolerance relation as R is both reflexive and symmetric but transitive may not hold good

always. Let a, b and c be three different transactions. For a specified threshold, if a is similar to b and b is similar to c, then

a may not be similar to c. A tolerance or proximity relation R is a relation that exhibits only the properties of reflexivity

and symmetry. A tolerance relation, R, can be reformed into an equivalence relation by at most (n-1) compositions with

itself, where n is the cardinal number of the set defining R. A fuzzy relation, R, on a single universe X is also a relation

from X to X. It is a fuzzy equivalence relation if all three of the following properties for matrix relations define it:

Reflexivity : = 1

Symmetry :

Transitivity : and

Then where

MODIFIED TABLE FILLING ALGORITHM

The indiscernibility based fuzzy tolerance rough set similarity is combined with table filling algorithm to form the

clusters. Table filling algorithm is used to minimize the deterministic finite automata. The minimization problem is to find

the unique minimal deterministic finite automata that accept the same language accepted by the given deterministic finite

automata.

Algorithm: Modified table filling algorithm with Fuzzy tolerance rough set based similarity.

Input: A set of n transactions

Threshold: ∈(0, 1]




Similarity measure:

Output: Web usage clusters

Procedure Mark

Step 1: Remove sessions with session length < minimum session length

Step 2: Consider all pairs of sessions (x, y), Construct the similarity matrix using Similarity measure.

Step 3: Repeat the following until no previously unmarked pairs are marked.

For all pairs ( ) if S > then sessions and are indistinguishable or equivalent.

The sessions are placed in the same cluster.

Procedure Reduce

Construct Session Clusters

Step 1: Use procedure mark to find all pairs of similar sessions. Use Fuzzy tolerance rough set similarity

measure to find the similarity class of each session using SimClass(t) = {s ∈ T : sRt}.

Step 2: Each group of equivalent sessions must be placed in a single cluster to form session clusters.

EXPERIMENTAL DESIGN AND RESULTS

Description of the Dataset

The data from the UCI dataset repository that consists of Internet Information Server (IIS) logs for msnbc.com

and news related portions of msn.com. Each sequence in the dataset corresponds to page views of a user. Each event in the

sequence corresponds to a user‟s request for a page. Requests are recorded at the page categories level a s determined by the

site administrator.

There are 17 page categories, namely „front page‟ , „news‟, „tech‟, „local‟, „opinion‟, „on -air‟, „misc‟, „weather‟,

„health‟, „living‟, „business‟, „sports‟, „summary‟, „bulletin board service‟, „travel‟, „msn-news‟, and „msn-sports‟. Each

page category is represented by an integer label.

For example, „front page‟ is coded as 1, „news‟ is coded as 2, „tech‟ is coded as 3, etc. Each row describes the hits

of a single user. Figure 2 shows the example of web navigational data. The similarity table is computed using Sequence set

similarity and shown in Figure 3.

T1: 6 7 7 7 6 7

T2: 2 12 3 4 12 12

T3: 14 14 14 14 14 14

T4: 1 1 12 2 2 4

T5: 6 8 8 8 8 12

T6: 6 6 6 6 3 14

T7: 1 14 14 1 1 2

T8: 1 1 1 1 1 14

T9: 2 2 15 5 5 16

T10: 1 11 1 2 2 14

Figure 2: Sample Web Navigation Data




T2 0

T3 0 0

T4 0 0.47 0

T5 0.21 0.17 0 0.17

T6 0.29 0.17 0.25 0 0.18

T7 0 0.17 0.33 0.45 0 0.18

T8 0 0 0.33 0.27 0 0.21 0.58

T9 0 0.15 0 0.24 0 0 0.17 0

T10 0 0.15 0.21 0.5 0 0.17 0.62 0.5 0.24

T1 T2 T3 T4 T5 T6 T7 T8 T9

Figure 3: Similarity Table Using Sequence Set Similarity

Assuming the threshold value as 0.2, the similarity classes are shown as follow.

SimClass(T1) = {T1, T5, T6}

SimClass(T2) = {T2, T4}

SimClass(T3) = {T3, T6, T7, T8, T10}

SimClass(T4) = {T2, T4, T7, T8, T9, T10}

SimClass(T5) = {T1, T5}

SimClass(T6) = {T1, T3, T6, T8}

SimClass(T7) = {T3, T4, T7, T8,T10}

SimClass(T8) = {T3, T4, T6, T7, T8, T10}

SimClass(T9) = {T4, T9, T10}

SimClass (T10) = {T3, T4, T7, T8, T9, T10}

Initially {T1, T5, T6} and {T2, T4} form as two separate clusters.

Based on the Fuzzy tolerance rough set similarity we get the following clusters.

C1 = {T1, T5, T6}

C2 = {T2, T4, T7, T8, T9, T10}

C3 = {T3, T4, T6, T7, T8, T9, T10}

There are some transactions which belong to multiple clusters. Different clusters can be formed by choosing

different threshold values. We have considered Web Server log file from the Web site www.enggresources.com for our

experimental study and concept based Website graph is constructed as additional input. Error records, requests for images

and multimedia files are removed from Server log file by using a tool called Web log filter. Usually this process removes

requests concerning non-analyzed resources such as images, multimedia files, and page style files (*.css) etc. IP address,

timestamp, user agent, request and referrer are retained for further processing. In user identification, IP address and user

agent are used. That is, a combination of IP address and user agent is used to identify a unique user.

In session construction, we have combined two trivial approaches, Time oriented approach and Navigation

oriented approach along with concept name match approach for identifying user sessions. Page stay time threshold and

session timeout threshold are set as 10 and 30 minutes respectively. Each Web page is assigned with unique index.




And, every unique session is also given unique index. 10217 users and 25814 sessions were discovered from pre-

processing. Similarity table is constructed using Sequence set similarity. Experiments are conducted by randomly selecting

100, 200, 300, 400, and 500 sessions from the preprocessed data with threshold values = 0.2 and = 0.3.

As the number of records increases the number of clusters formed also increases. The graphs for the number

sessions versus the number of clusters with threshold values = 0.2 and = 0.3 are shown in Figure 4(a) and 4(b)

respectively.

Figure 4(a): Graph Depicting Number of Sessions versus Number of Clusters with = 0.2

Figure 4(b): Graph Depicting Number of Sessions versus Number of Clusters with = 0.3

CONCLUSIONS

A web user transactions clustering can be used to find interesting user access patterns from web server log files. In

this paper we have proposed an approach for finding web sessions clusters using Fuzzy tolerance rough set theory and table

filling algorithm. These clusters symbolize groups of users exhibiting similar browsing patterns. These patterns can be used

to provide set of recommendations for the web site which can be deployed by web site administrator for website

enhancement. Traditional clustering methods create clusters by describing the members of each cluster whereas the rough

set based clustering techniques create clusters describing the main characteristics of each cluster. In this work, we

introduced Fuzzy tolerance rough set similarity measure along with the table filling algorithm. The proposed approach

allows merging of two or more clusters. We investigated our approach on MSNBC web navigation data set. We

successfully conducted experiments on the server log files of the Website www.enggresources.com to form clusters.






REFERENCES

1. R. Cooley, B. Mobasher, and J. Srivastava, “Web mining: information and pattern discovery on the World Wide

Web”, Ninth IEEE International Conference on Tools with Artificial Intelligence, Newport Beach, CA, USA,

1997, Pages 558-567.

2. J. Srivastava, R. Cooley, M. Deshpande, and P. N. Tan, “Web usage mining: discovery and applications of usage

patterns from Web data”, ACM SIGKDD Explorations Newsletter, Volume 1,Pages 12-23, 2000.

3. BamshadMobasher, Chapter: 12, “Web Usage Mining in Data Collection and Pre-Processing”, ACM SIGKKD

2007 Pages 450-483.

4. Murat Ali Bayir, Ismail HakkiToroslu, GuvenFidan, and AhmetCosar, “Smart Miner: A New Framework for

Mining Large Scale Web Usage Data”, ACM 2009.

5. Norwati Mustapha, ManijehJalali , and MehrdadJalali, “Expectation Maximization Clustering Algorithm for User

Modeling in Web Usage Mining Systems”, European Journal of Scientific Research ISSN 1450-216X Volume 32

Number.4 (2009), Pages.467-476.

6. OlfaNasraoui, MahaSoliman,EsinSaka,Antonio Badia, and Richard Germain, “Web Usage Mining Framework for

Mining Evolving User Profiles in Dynamic Web Sites”, IEEE transactions on knowledge and data engineering,

Volume. 20, Number. 2, February 2008.

7. T. Vijaya Kumar, Dr. H. S. Guruprasad, “Clustering Web Usage Data using Concept hierarchy and Self

Organizing Maps”, International Journal of Computer Applications (0975 – 8887) Volume 56 – No.18, October

2012 www.ijcaonline.org.

8. H. Hannah In barani , K. Thangavel, “Rough set based User profiling for Web Personalization”, International

Journal of Recent Trends in Engineering, Vol 2, No. 1, November 2009.

9. Rajhans Mishra and Pradeep Kumar, “Clustering Web Logs Using Similarity Upper Approximation with

Different Similarity Measures” ,International Journal of Machine Learning and Computing, Vol. 2, No. 3, June

2012.

10. K.Santhisree, and Dr.A.Damodaram, “Clustering on Web usage data using Approximations and Set Similarities”

2010 International Journal of Computer Applications (0975 – 8887) Volume 1 – No. 4.

11. Philip Hingston, “Using Finite State Automata for Sequence Mining”, Proceedings of twenty -fifth Australian

conference on computer science – Volume 4 Pages 105-110. Australian Computer Science Communications

Vol.24 Issue 1, Jan-Feb 2002.

12. Sunil Joshi, Dr . R. S. Jadon, and Dr. R. C. Jain, “ Sequential Pattern Mining Using Formal Language Tools”

IJCSI International Journal of Computer Science Issues, Vol. 9, Issue 5, No 2, September 2012 ISSN

(Online):1694-0814.

13. G.T.Raju, and P. S. Satyanarayana, “Knowledge Discovery from Web Usage Data: Complete Preprocessing

Methodology”, IJCSNS International Journal of Computer Science and Network Security, Volume.8, Number 01 January 2008.

http://www.ijcaonline.org/

http://www.ijcaonline.org/




14. C. Shahabi and F. B. Kashani, “Efficient and anonymous Web-usage mining for Web personalization”, INFORMS

Journal on Computing, 15(2) Pages 123-147, 2003.

15. M. Spiliopoulou, B. Mobasher, B. Berendt, and M. Nakagawa, “A framework for the evaluation of session

reconstruction heuristics in Web usage analysis”, INFORMS Journal on Computing, 15(2), Pages 171-190, 2003.

16. T.Vijaya Kumar, Dr. H.S. Guruprasad, Bharath Kumar K.M, IrfanBaig and KiranBabu S,“A New Web Usage

Mining approach for Website recommendations using Concept hierarchy and Website Graph”, International

Journal of Computer and Electrical Engineering (IJCEE, ISSN: 1793-8198 (Online Version);1793-8163( print

version).

17. P. Kumar, M.V. Rao, P.R. Krishna, R.S. Bapi, and A. Laha, “Intrusion detection system using sequence and set

preserving metric” Proceedings of IEEE International Conference on Intelligence and Security Informatics, LNCS

Springer Verlag, Atlanta, 2005, pp.c498 – 504.

18. L. Bergroth, H. Hakonen, and T. Raita, “ A survey of longest common subsequence

algorithm”SeventhInternational Symposium on String Processing and Information Retriveal, Atlanta, 2000, pp.

39 – 48.

19. Pradeep Kumar, P. Radha Krishna, Raju SBapi and Supriya Kumar De , “Rough clustering of sequential data”,

Data & Knowledge Engineering 63 (2007) 183 – 199, www.elsevier.com/locate/datak.

http://www.elsevier.com/locate/datak

http://www.elsevier.com/locate/datak

11. -clustering of web -full

Documents