model-based clustering and visualization of navigation patterns on a web site

38
Model-Based Clustering and Visualization of Navigation Patterns on a Web Site I. Cadez, D. Heckerman, C. Meek, P. Smyth, and S. White Presented by Motaz El Sab

Upload: diana-cobb

Post on 31-Dec-2015

31 views

Category:

Documents


2 download

DESCRIPTION

Model-Based Clustering and Visualization of Navigation Patterns on a Web Site. I. Cadez, D. Heckerman, C. Meek, P. Smyth, and S. White. Presented by Motaz El Saban. Outline of the talk. Introduction and problem definition. Model-based clustering. Model learning. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Model-Based Clustering and Visualization of Navigation Patterns on a Web Site

Model-Based Clustering and Visualization of Navigation

Patterns on a Web Site

I. Cadez, D. Heckerman, C. Meek, P. Smyth, and S. White

Presented by Motaz El Saban

Page 2: Model-Based Clustering and Visualization of Navigation Patterns on a Web Site

04/19/23Data Mining Spring '032

Outline of the talk

Introduction and problem definition. Model-based clustering. Model learning. Application to Msnbc.com IIS log data. Data Visualization. Scalability. Why mixtures of first-order Markov models? Conclusions. Future work.

Page 3: Model-Based Clustering and Visualization of Navigation Patterns on a Web Site

04/19/23Data Mining Spring '033

Introduction

New methodology for analyzing web navigation patterns. (a form of human behavior in digital environments)

Patterns: sequence of URL categories traversed by users, stored in web-server logs for a duration of 24 hours on msnbc.com.

Functionality:– Clustering users based on navigation patterns.– Visualization (WebCANVAS tool).

Page 4: Model-Based Clustering and Visualization of Navigation Patterns on a Web Site

04/19/23Data Mining Spring '034

Web data analysis approach

Clustering:– Partition users into clusters with users

having similar dynamic behavior in the same cluster.

Visualization:– Display the behavior of the users within that

cluster.

Page 5: Model-Based Clustering and Visualization of Navigation Patterns on a Web Site

04/19/23Data Mining Spring '035

Related Work

Most Previous work on Web Navigation patterns and visualization uses non-probabilistic methods [YAN96] [CHE98], mostly finding rules that govern navigation patterns.

Other work used probabilistic methods for predicting the behavior of users on Web pages, but not for clustering purposes using random walk models [HUB 97], Markov models for pre-fetching pages [PAD96], modeling next probable link use kth order Markov model [BOR00].

These approaches use a single Markov model for all users as opposed to a clustering of users first.

Page 6: Model-Based Clustering and Visualization of Navigation Patterns on a Web Site

04/19/23Data Mining Spring '036

Related Work

On the clustering side, [FU00] applied BIRCH to cluster user web navigation patterns.

For Web navigation sequence-based clustering, and visualization no previously known work has been done using probabilistic clustering.

Rather, user history has been visualized using visual metaphors of maps, paths, and signposts [WEX99].

[MIN99] use planar graphs to visualize crowds of users at particular web pages.

Page 7: Model-Based Clustering and Visualization of Navigation Patterns on a Web Site

04/19/23Data Mining Spring '037

What do we mean by pattern?

Page 8: Model-Based Clustering and Visualization of Navigation Patterns on a Web Site

04/19/23Data Mining Spring '038

Challenges

Web navigation patterns are dynamic. No static techniques could capture its patterns, such as histograms Markov models.

Different users have heterogeneous dynamic behavior Mixture of models.

Large data size. – The proposed algorithm for learning the mixture of 1st order Markov

models has runtime O(KNL+KM2). K: # clusters.N: # sequences.L: average length of sequence.M: # of Web page categories.For typically small M, the algorithm scales linearly with N and K.

– Hierarchical clustering methods scale as O(N2)

Page 9: Model-Based Clustering and Visualization of Navigation Patterns on a Web Site

04/19/23Data Mining Spring '039

Model-Based Clustering

Assuming data is generated as follows:– A user arrives at the web site and is assigned to one of K

clusters with some probability, and– given that a user is in this cluster, his behavior is generated

from some statistical model specific to that cluster. let X be a multivariate random variable taking on

values corresponding to the behavior of individual users.

Let C be a discrete-valued variable taking on values: c1 ...,cK, corresponding to the unknown cluster assignment for a user.

Page 10: Model-Based Clustering and Visualization of Navigation Patterns on a Web Site

04/19/23Data Mining Spring '0310

Model-Based Clustering

A mixture model for X with K components has the form:

1

( | ) ( | ) ( | , )K

k k kk

p X p c p X c

Where is the marginal probability of the kth cluster,

is the statistical model describing the distribution for the variables for users in the kth cluster,

and denotes the parameters of the model

( | )kp c

( | , )k kp X c

Page 11: Model-Based Clustering and Visualization of Navigation Patterns on a Web Site

04/19/23Data Mining Spring '0311

Model-Based Clustering

In our case X = (X1,…,XL) is a sequence of variables describing the user’s path through the website.

Xi takes on some value xi from the M different page categories.

Each component in the model obeys the 1st order Markov model:

1 12

( | , ) ( | ) ( | , )L

I Tk k k i i k

i

p X c p x p x x

where denotes the parameters of the probability distribution over the initial page-category request among users in cluster k,

and denotes the parameters of the probability distributions over transitions from one category to the next by a user in cluster k .

Both distributions are taken to be multinomial distribution.

Ik

Tk

Page 12: Model-Based Clustering and Visualization of Navigation Patterns on a Web Site

04/19/23Data Mining Spring '0312

Model-Based Clustering

EM algorithm is used to learn the model parameters. Once learned, we can use the model to assign users to

clusters by finding the class K that maximizes the membership probabilities:

The user class assignment may be soft or hard.1

( | ) ( | , )( | , )

( | ) ( | , )

k k kk K

j j jj

p c p X cp c X

p c p X c

Page 13: Model-Based Clustering and Visualization of Navigation Patterns on a Web Site

04/19/23Data Mining Spring '0313

Learning Mixture Models from Data

For a known number of K clusters. Training data dtrain= {x1,…,xN}, with iid

assumption. MAP Estimate of :

( | )arg maxMAP trainp d

( | ) ( ) / ( )arg max train trainp d p p d

1

( | ) ( ) / ( )arg maxN

i traini

p X p p d

Page 14: Model-Based Clustering and Visualization of Navigation Patterns on a Web Site

04/19/23Data Mining Spring '0314

EM learning algorithm (briefly)

An iterative method to find local maxima for the MAP problem of .

Problem at hand involves two sub-problems:– Compute user class assignment (membership

probabilities).– Compute class parameters.

Chicken-egg problem!

Page 15: Model-Based Clustering and Visualization of Navigation Patterns on a Web Site

04/19/23Data Mining Spring '0315

EM learning algorithm (briefly)

EM approach:– E-step: given a current value of the

parameters , assign a user with behavior X to cluster Ck using the membership probabilities.

– M-step: pretend that these assignments correspond to real data, and reassign to be the MAP estimate given this fictitious data.

– Stop iteration when two consecutive iterations produce log likelihoods on the training data that differ by less than p% (0.01% in the paper).

Page 16: Model-Based Clustering and Visualization of Navigation Patterns on a Web Site

04/19/23Data Mining Spring '0316

How to choose K?

Let the site administrator try several K values and choose the convenient one for visualization too time consuming. Rather,

Choose K by finding the model that accurately predicts Nt new test cases dtest = {XN+1 ,...,XN +Nt}. That is, choose a model with K clusters that minimizes the out-of-sample predictive log score:

21

1

log ( | )

( , )

( )

t

t

Nj K

jtest N

i

i

p X

Score K d

length X

Page 17: Model-Based Clustering and Visualization of Navigation Patterns on a Web Site

04/19/23Data Mining Spring '0317

Application to Msnbc.com

Each sequence in the data set corresponds to page views of a user during a twenty-four hour period.

Each event in the sequence corresponds to a user request for a page. The event denotes a page category rather than a URL.

Example categories are: frontpage, news, tech, … The number of URLs per category ranges from 10 to 5000. Modeling only the order in which the pages are requested

(no duration is modeled) . Page requests served via a caching mechanism were not

recorded in the server logs and, hence, not present in the data.

Page 18: Model-Based Clustering and Visualization of Navigation Patterns on a Web Site

04/19/23Data Mining Spring '0318

Application to Msnbc.com

The full data set consists of approximately one million sequences (users),with an average of 5.7 events per sequence.

Model learning for various cluster sizes K is done with a training set size of 100,023.

Model evaluation was done using the out-of-sample predictive log score on a different sample of 98,687 sequences drawn from the original data.

Page 19: Model-Based Clustering and Visualization of Navigation Patterns on a Web Site

04/19/23Data Mining Spring '0319

Observation on the model components

Some of the individual model components encode two or more clusters.

Example: consider two clusters: a cluster of users who initially request category a and then choose between categories b and c ,and a cluster of users who initially request category d and then choose between categories e and f .

These two clusters can be encoded in a single component of the mixture model, although the sequences for the separate clusters do not contain common elements.

The presence of multi-cluster components does not affect the out-of-sample predictive log score of a model.

However, it is problematic for visualization purposes.

Page 20: Model-Based Clustering and Visualization of Navigation Patterns on a Web Site

04/19/23Data Mining Spring '0320

Observation on the model components

Solutions:– One method is to run the EM algorithm and then post-process

the resulting model, separating any multi-cluster components found.

– A second method is to allow only one state (category) to have a non-zero probability of being the initial state in each of the 1st-order Markov models.

Using the second method has the drawback that a cluster of users that have different initial states but similar paths after the initial state are divided into separate clusters.

Nonetheless,this potential problem was fairly insignificant for the Msnbc.com data.

Page 21: Model-Based Clustering and Visualization of Navigation Patterns on a Web Site

04/19/23Data Mining Spring '0321

Constrained models

Experimentally, constrained models have a predictive power almost equal to that of the unconstrained models.

However, introducing this constraint,more components are needed to represent the data than in the unconstrained case.

For this particular data,the constrained 1st-order Markov models reach limit in predictive accuracy around K =100,as compared to the unconstrained models,which reach their limit around K =60.

Page 22: Model-Based Clustering and Visualization of Navigation Patterns on a Web Site

04/19/23Data Mining Spring '0322

Out of sample results

Page 23: Model-Based Clustering and Visualization of Navigation Patterns on a Web Site

04/19/23Data Mining Spring '0323

Data Visualization:WebCANVAS tool

Display of twenty four hour period using 100 clusters. Each window corresponds to a cluster. Each row of squares in a cluster corresponds to a user

sequence. WebCANVAS uses hard clustering, assigning each user

to a single cluster. Each square in a row encodes a page request in a

particular category encoded by the color of the square. Note that the use of color to encode URL category

limits the utility of this tool to domains where the number of categories can be limited to fifty or so.

Page 24: Model-Based Clustering and Visualization of Navigation Patterns on a Web Site

04/19/23Data Mining Spring '0324

WebCANVAS Display

Page 25: Model-Based Clustering and Visualization of Navigation Patterns on a Web Site

04/19/23Data Mining Spring '0325

Discovering unexpected facts

Large groups of people enter msnbc.com on tech and local pages;

Large group of people navigating from on-air to local;

Little navigation between tech and business sections;

and large number of hits to the weather pages.

Page 26: Model-Based Clustering and Visualization of Navigation Patterns on a Web Site

04/19/23Data Mining Spring '0326

WebCANVAS tool (model-direct sampling)

WebCANVAS display performed better subjectively than two other methods:

– Showing the 0th-order and 1st-order Markov models of a cluster.– “traffic flow movie” by Microsoft Site Server 3.0.

Advantage of model-directed sampling over displaying the models themselves is that the former approach is not as sensitive to errors in modeling.

That is, by displaying sampled raw data, behaviors in the data not consistent with the model used can still be seen and appreciated.

Page 27: Model-Based Clustering and Visualization of Navigation Patterns on a Web Site

04/19/23Data Mining Spring '0327

Alternative: Displaying models themselves

Page 28: Model-Based Clustering and Visualization of Navigation Patterns on a Web Site

04/19/23Data Mining Spring '0328

Scalability

Memory requirements of the algorithm are: O(NL+KM2+KM), which typically reduces to O (NL) - i.e. the data size- for data sets where M is relatively small.

The runtime of the algorithm per iteration is linear in N and K.

Page 29: Model-Based Clustering and Visualization of Navigation Patterns on a Web Site

04/19/23Data Mining Spring '0329

Scalability in K

Page 30: Model-Based Clustering and Visualization of Navigation Patterns on a Web Site

04/19/23Data Mining Spring '0330

Scalability in N

Page 31: Model-Based Clustering and Visualization of Navigation Patterns on a Web Site

04/19/23Data Mining Spring '0331

Mixtures of 1st order Markov Models: Too simple model?

Sen and Hansen (2001), Deshpande and Karypis (2001) have shown that the 1st-order Markov model to be an inadequate model for empirically-observed page-request sequences.

It is not surprising, because for example:– if a user visits a particular page,there tends to be a greater

chance of he returning to that same page at a later time. – 1st order Markov model cannot capture this type of long-term

memory. However:

– Though the mixture model is 1st order Markov within a cluster, the overall unconditional model is NOT 1st order Markov.

– Msnbc data is different from typical raw page-request sequences. Namely, URL categories result in a relatively small alphabet size as compared to working with uncategorized URLs.

Page 32: Model-Based Clustering and Visualization of Navigation Patterns on a Web Site

04/19/23Data Mining Spring '0332

Mixtures of 1st order Markov Models: Too simple model?

The combined effects of clustering and a small alphabet tend to produce low-entropy clusters in the sense that a few (two or three) categories often dominate the sequences within each cluster.

Thus, the tendency to return to a specific page that was visited earlier in a session can be well approximated by the simple mixture of 1st order Markov models.

Page 33: Model-Based Clustering and Visualization of Navigation Patterns on a Web Site

04/19/23Data Mining Spring '0333

Mixture of 1st order Markov Models vs 1st order Markov Models

Mixture Model:

Looking at the predictive distribution for the next symbol under the mixture model, i.e :

Thus the probability of the next symbol is a weighted combination of the transition probabilities from each of the individual 1st order component models.

1

( ) ( | ) ( )K

k kk

p X p X c p c

1 12

( | ) ( | ) ( | , )L

k k i i ki

p X c p x c p x x c

1lx

1 [ ,1]( | )l lp x x

1 [ ,1] 1 [ ,1]1

( | ) ( | , ) ( | )K

l l l l k k lk

p x x p x x c p c x

1( | , )l l kp x x c

Page 34: Model-Based Clustering and Visualization of Navigation Patterns on a Web Site

04/19/23Data Mining Spring '0334

Mixture of 1st order Markov Models vs 1st order Markov Models

The weights are determined by the partial membership probabilities of the prefix (history) subsequence

. These weights are in turn a function of the history of

the sequence (via Bayes rule), and typically depend strongly on the pattern of behavior before .

This prediction behavior of is opposed to the simple prediction distribution of the 1st order Markov model:

[ ,1]( | )k lp c x

[ ,1]lx

lx

1lx

1 [ ,1] 1( | ) ( | )l l l lp x x p x x

Page 35: Model-Based Clustering and Visualization of Navigation Patterns on a Web Site

04/19/23Data Mining Spring '0335

Empirical proof of 1st order Markov Model

Diagnostic check: empirically calculate the run lengths of page categories for several of the most likely clusters.

If the data are being generated by a 1st order Markov model, then the distribution of these run lengths will obey a geometric distribution.

Results are shown in each cluster for the three most frequently visited categories that had at least one run length of four or greater.(Categories that have run lengths of three or fewer provide relatively uninformative diagnostic plots.)

Page 36: Model-Based Clustering and Visualization of Navigation Patterns on a Web Site

04/19/23Data Mining Spring '0336

Empirical proof of 1st order Markov Model

Asterisks mark the empirically observed counts. The center dotted line on each plot is the expected count as a function of

run length under a geometric model using the empirically estimated self-transition probability of the Markov chain for the corresponding cluster.

The upper and lower dotted lines represent the plus and minus three-sigma sampling deviations for each count under the model.

Page 37: Model-Based Clustering and Visualization of Navigation Patterns on a Web Site

04/19/23Data Mining Spring '0337

Conclusions

Using a model-based clustering approach to cluster users based on web navigation patterns.

Develop a visualization tool that enables web administrators to better understand user behavior.

Using mixture of 1st order Markov models for clustering taking into account the order of page requests pages.

Experiments suggest that 1st order Markov model mixture components are appropriate for the msnbc.com data.

The algorithm learning time scales linearly with sample size. In contrast,agglomerative distance-based methods scale quadratically with sample size.

Page 38: Model-Based Clustering and Visualization of Navigation Patterns on a Web Site

04/19/23Data Mining Spring '0338

Future Work

Modeling the duration of each visit. Avoiding the limitation of the proposed method to

small M , modeling page visits at the URL level. In one such extension,we can use Markov models to

characterize both the transitions among categories and the transitions among pages within a given category.

Alternatively,we can use a hidden-Markov mixture model to learn categories and category transitions simultaneously.