model-based clustering and visualization of navigation patterns on a web site

Model-Based Clustering and Visualization of Navigation

Patterns on a Web Site

I. Cadez, D. Heckerman, C. Meek, P. Smyth, and S. White

Presented by Motaz El Saban

04/19/23Data Mining Spring '032

Outline of the talk

Introduction and problem definition. Model-based clustering. Model learning. Application to Msnbc.com IIS log data. Data Visualization. Scalability. Why mixtures of first-order Markov models? Conclusions. Future work.


Introduction

New methodology for analyzing web navigation patterns. (a form of human behavior in digital environments)

Patterns: sequence of URL categories traversed by users, stored in web-server logs for a duration of 24 hours on msnbc.com.

Functionality:– Clustering users based on navigation patterns.– Visualization (WebCANVAS tool).


Web data analysis approach

Clustering:– Partition users into clusters with users

having similar dynamic behavior in the same cluster.

Visualization:– Display the behavior of the users within that

cluster.


Related Work

Most Previous work on Web Navigation patterns and visualization uses non-probabilistic methods [YAN96] [CHE98], mostly finding rules that govern navigation patterns.

Other work used probabilistic methods for predicting the behavior of users on Web pages, but not for clustering purposes using random walk models [HUB 97], Markov models for pre-fetching pages [PAD96], modeling next probable link use kth order Markov model [BOR00].

These approaches use a single Markov model for all users as opposed to a clustering of users first.


Related Work

On the clustering side, [FU00] applied BIRCH to cluster user web navigation patterns.

For Web navigation sequence-based clustering, and visualization no previously known work has been done using probabilistic clustering.

Rather, user history has been visualized using visual metaphors of maps, paths, and signposts [WEX99].

[MIN99] use planar graphs to visualize crowds of users at particular web pages.


What do we mean by pattern?


Challenges

Web navigation patterns are dynamic. No static techniques could capture its patterns, such as histograms Markov models.

Different users have heterogeneous dynamic behavior Mixture of models.

Large data size. – The proposed algorithm for learning the mixture of 1st order Markov

models has runtime O(KNL+KM2). K: # clusters.N: # sequences.L: average length of sequence.M: # of Web page categories.For typically small M, the algorithm scales linearly with N and K.

– Hierarchical clustering methods scale as O(N2)


Model-Based Clustering

Assuming data is generated as follows:– A user arrives at the web site and is assigned to one of K

clusters with some probability, and– given that a user is in this cluster, his behavior is generated

from some statistical model specific to that cluster. let X be a multivariate random variable taking on

values corresponding to the behavior of individual users.

Let C be a discrete-valued variable taking on values: c1 ...,cK, corresponding to the unknown cluster assignment for a user.



A mixture model for X with K components has the form:

1

( | ) ( | ) ( | , )K

k k kk

p X p c p X c

Where is the marginal probability of the kth cluster,

is the statistical model describing the distribution for the variables for users in the kth cluster,

and denotes the parameters of the model

( | )kp c

( | , )k kp X c



In our case X = (X1,…,XL) is a sequence of variables describing the user’s path through the website.

Xi takes on some value xi from the M different page categories.

Each component in the model obeys the 1st order Markov model:

1 12

( | , ) ( | ) ( | , )L

I Tk k k i i k

i

p X c p x p x x

where denotes the parameters of the probability distribution over the initial page-category request among users in cluster k,

and denotes the parameters of the probability distributions over transitions from one category to the next by a user in cluster k .

Both distributions are taken to be multinomial distribution.

Ik

Tk



EM algorithm is used to learn the model parameters. Once learned, we can use the model to assign users to

clusters by finding the class K that maximizes the membership probabilities:

The user class assignment may be soft or hard.1

( | ) ( | , )( | , )

( | ) ( | , )

k k kk K

j j jj

p c p X cp c X

p c p X c


Learning Mixture Models from Data

For a known number of K clusters. Training data dtrain= {x1,…,xN}, with iid

assumption. MAP Estimate of :

( | )arg maxMAP trainp d

( | ) ( ) / ( )arg max train trainp d p p d

1

( | ) ( ) / ( )arg maxN

i traini

p X p p d


EM learning algorithm (briefly)

An iterative method to find local maxima for the MAP problem of .

Problem at hand involves two sub-problems:– Compute user class assignment (membership

probabilities).– Compute class parameters.

Chicken-egg problem!


EM learning algorithm (briefly)

EM approach:– E-step: given a current value of the

parameters , assign a user with behavior X to cluster Ck using the membership probabilities.

– M-step: pretend that these assignments correspond to real data, and reassign to be the MAP estimate given this fictitious data.

– Stop iteration when two consecutive iterations produce log likelihoods on the training data that differ by less than p% (0.01% in the paper).


How to choose K?

Let the site administrator try several K values and choose the convenient one for visualization too time consuming. Rather,

Choose K by finding the model that accurately predicts Nt new test cases dtest = {XN+1 ,...,XN +Nt}. That is, choose a model with K clusters that minimizes the out-of-sample predictive log score:

21

1

log ( | )

( , )

( )

t

t

Nj K

jtest N

i

i

p X

Score K d

length X


Application to Msnbc.com

Each sequence in the data set corresponds to page views of a user during a twenty-four hour period.

Each event in the sequence corresponds to a user request for a page. The event denotes a page category rather than a URL.

Example categories are: frontpage, news, tech, … The number of URLs per category ranges from 10 to 5000. Modeling only the order in which the pages are requested

(no duration is modeled) . Page requests served via a caching mechanism were not

recorded in the server logs and, hence, not present in the data.


Application to Msnbc.com

The full data set consists of approximately one million sequences (users),with an average of 5.7 events per sequence.

Model learning for various cluster sizes K is done with a training set size of 100,023.

Model evaluation was done using the out-of-sample predictive log score on a different sample of 98,687 sequences drawn from the original data.


Observation on the model components

Some of the individual model components encode two or more clusters.

Example: consider two clusters: a cluster of users who initially request category a and then choose between categories b and c ,and a cluster of users who initially request category d and then choose between categories e and f .

These two clusters can be encoded in a single component of the mixture model, although the sequences for the separate clusters do not contain common elements.

The presence of multi-cluster components does not affect the out-of-sample predictive log score of a model.

However, it is problematic for visualization purposes.


Observation on the model components

Solutions:– One method is to run the EM algorithm and then post-process

the resulting model, separating any multi-cluster components found.

– A second method is to allow only one state (category) to have a non-zero probability of being the initial state in each of the 1st-order Markov models.

Using the second method has the drawback that a cluster of users that have different initial states but similar paths after the initial state are divided into separate clusters.

Nonetheless,this potential problem was fairly insignificant for the Msnbc.com data.


Constrained models

Experimentally, constrained models have a predictive power almost equal to that of the unconstrained models.

However, introducing this constraint,more components are needed to represent the data than in the unconstrained case.

For this particular data,the constrained 1st-order Markov models reach limit in predictive accuracy around K =100,as compared to the unconstrained models,which reach their limit around K =60.


Out of sample results


Data Visualization:WebCANVAS tool

Display of twenty four hour period using 100 clusters. Each window corresponds to a cluster. Each row of squares in a cluster corresponds to a user

sequence. WebCANVAS uses hard clustering, assigning each user

to a single cluster. Each square in a row encodes a page request in a

particular category encoded by the color of the square. Note that the use of color to encode URL category

limits the utility of this tool to domains where the number of categories can be limited to fifty or so.


WebCANVAS Display


Discovering unexpected facts

Large groups of people enter msnbc.com on tech and local pages;

Large group of people navigating from on-air to local;

Little navigation between tech and business sections;

and large number of hits to the weather pages.


WebCANVAS tool (model-direct sampling)

WebCANVAS display performed better subjectively than two other methods:

– Showing the 0th-order and 1st-order Markov models of a cluster.– “traffic flow movie” by Microsoft Site Server 3.0.

Advantage of model-directed sampling over displaying the models themselves is that the former approach is not as sensitive to errors in modeling.

That is, by displaying sampled raw data, behaviors in the data not consistent with the model used can still be seen and appreciated.


Alternative: Displaying models themselves


Scalability

Memory requirements of the algorithm are: O(NL+KM2+KM), which typically reduces to O (NL) - i.e. the data size- for data sets where M is relatively small.

The runtime of the algorithm per iteration is linear in N and K.


Scalability in K


Scalability in N


Mixtures of 1st order Markov Models: Too simple model?

Sen and Hansen (2001), Deshpande and Karypis (2001) have shown that the 1st-order Markov model to be an inadequate model for empirically-observed page-request sequences.

It is not surprising, because for example:– if a user visits a particular page,there tends to be a greater

chance of he returning to that same page at a later time. – 1st order Markov model cannot capture this type of long-term

memory. However:

– Though the mixture model is 1st order Markov within a cluster, the overall unconditional model is NOT 1st order Markov.

– Msnbc data is different from typical raw page-request sequences. Namely, URL categories result in a relatively small alphabet size as compared to working with uncategorized URLs.


Mixtures of 1st order Markov Models: Too simple model?

The combined effects of clustering and a small alphabet tend to produce low-entropy clusters in the sense that a few (two or three) categories often dominate the sequences within each cluster.

Thus, the tendency to return to a specific page that was visited earlier in a session can be well approximated by the simple mixture of 1st order Markov models.


Mixture of 1st order Markov Models vs 1st order Markov Models

Mixture Model:

Looking at the predictive distribution for the next symbol under the mixture model, i.e :

Thus the probability of the next symbol is a weighted combination of the transition probabilities from each of the individual 1st order component models.

1

( ) ( | ) ( )K

k kk

p X p X c p c

1 12

( | ) ( | ) ( | , )L

k k i i ki

p X c p x c p x x c

1lx

1 [ ,1]( | )l lp x x

1 [ ,1] 1 [ ,1]1

( | ) ( | , ) ( | )K

l l l l k k lk

p x x p x x c p c x

1( | , )l l kp x x c


Mixture of 1st order Markov Models vs 1st order Markov Models

The weights are determined by the partial membership probabilities of the prefix (history) subsequence

. These weights are in turn a function of the history of

the sequence (via Bayes rule), and typically depend strongly on the pattern of behavior before .

This prediction behavior of is opposed to the simple prediction distribution of the 1st order Markov model:

[ ,1]( | )k lp c x

[ ,1]lx

lx

1lx

1 [ ,1] 1( | ) ( | )l l l lp x x p x x


Empirical proof of 1st order Markov Model

Diagnostic check: empirically calculate the run lengths of page categories for several of the most likely clusters.

If the data are being generated by a 1st order Markov model, then the distribution of these run lengths will obey a geometric distribution.

Results are shown in each cluster for the three most frequently visited categories that had at least one run length of four or greater.(Categories that have run lengths of three or fewer provide relatively uninformative diagnostic plots.)


Empirical proof of 1st order Markov Model

Asterisks mark the empirically observed counts. The center dotted line on each plot is the expected count as a function of

run length under a geometric model using the empirically estimated self-transition probability of the Markov chain for the corresponding cluster.

The upper and lower dotted lines represent the plus and minus three-sigma sampling deviations for each count under the model.


Conclusions

Using a model-based clustering approach to cluster users based on web navigation patterns.

Develop a visualization tool that enables web administrators to better understand user behavior.

Using mixture of 1st order Markov models for clustering taking into account the order of page requests pages.

Experiments suggest that 1st order Markov model mixture components are appropriate for the msnbc.com data.

The algorithm learning time scales linearly with sample size. In contrast,agglomerative distance-based methods scale quadratically with sample size.


Future Work

Modeling the duration of each visit. Avoiding the limitation of the proposed method to

small M , modeling page visits at the URL level. In one such extension,we can use Markov models to

characterize both the transitions among categories and the transitions among pages within a given category.

Alternatively,we can use a hidden-Markov mixture model to learn categories and category transitions simultaneously.

model-based clustering and visualization of navigation patterns on a web site

Documents

clustering users

clustering of users

data visualization

behavior of users

on2data mining spring

sabandata mining spring

behavior of individual

different users