model-based clustering and visualization of navigation patterns on a web site
DESCRIPTION
Model-Based Clustering and Visualization of Navigation Patterns on a Web Site. I. Cadez, D. Heckerman, C. Meek, P. Smyth, and S. White. Presented by Motaz El Saban. Outline of the talk. Introduction and problem definition. Model-based clustering. Model learning. - PowerPoint PPT PresentationTRANSCRIPT
Model-Based Clustering and Visualization of Navigation
Patterns on a Web Site
I. Cadez, D. Heckerman, C. Meek, P. Smyth, and S. White
Presented by Motaz El Saban
04/19/23Data Mining Spring '032
Outline of the talk
Introduction and problem definition. Model-based clustering. Model learning. Application to Msnbc.com IIS log data. Data Visualization. Scalability. Why mixtures of first-order Markov models? Conclusions. Future work.
04/19/23Data Mining Spring '033
Introduction
New methodology for analyzing web navigation patterns. (a form of human behavior in digital environments)
Patterns: sequence of URL categories traversed by users, stored in web-server logs for a duration of 24 hours on msnbc.com.
Functionality:– Clustering users based on navigation patterns.– Visualization (WebCANVAS tool).
04/19/23Data Mining Spring '034
Web data analysis approach
Clustering:– Partition users into clusters with users
having similar dynamic behavior in the same cluster.
Visualization:– Display the behavior of the users within that
cluster.
04/19/23Data Mining Spring '035
Related Work
Most Previous work on Web Navigation patterns and visualization uses non-probabilistic methods [YAN96] [CHE98], mostly finding rules that govern navigation patterns.
Other work used probabilistic methods for predicting the behavior of users on Web pages, but not for clustering purposes using random walk models [HUB 97], Markov models for pre-fetching pages [PAD96], modeling next probable link use kth order Markov model [BOR00].
These approaches use a single Markov model for all users as opposed to a clustering of users first.
04/19/23Data Mining Spring '036
Related Work
On the clustering side, [FU00] applied BIRCH to cluster user web navigation patterns.
For Web navigation sequence-based clustering, and visualization no previously known work has been done using probabilistic clustering.
Rather, user history has been visualized using visual metaphors of maps, paths, and signposts [WEX99].
[MIN99] use planar graphs to visualize crowds of users at particular web pages.
04/19/23Data Mining Spring '037
What do we mean by pattern?
04/19/23Data Mining Spring '038
Challenges
Web navigation patterns are dynamic. No static techniques could capture its patterns, such as histograms Markov models.
Different users have heterogeneous dynamic behavior Mixture of models.
Large data size. – The proposed algorithm for learning the mixture of 1st order Markov
models has runtime O(KNL+KM2). K: # clusters.N: # sequences.L: average length of sequence.M: # of Web page categories.For typically small M, the algorithm scales linearly with N and K.
– Hierarchical clustering methods scale as O(N2)
04/19/23Data Mining Spring '039
Model-Based Clustering
Assuming data is generated as follows:– A user arrives at the web site and is assigned to one of K
clusters with some probability, and– given that a user is in this cluster, his behavior is generated
from some statistical model specific to that cluster. let X be a multivariate random variable taking on
values corresponding to the behavior of individual users.
Let C be a discrete-valued variable taking on values: c1 ...,cK, corresponding to the unknown cluster assignment for a user.
04/19/23Data Mining Spring '0310
Model-Based Clustering
A mixture model for X with K components has the form:
1
( | ) ( | ) ( | , )K
k k kk
p X p c p X c
Where is the marginal probability of the kth cluster,
is the statistical model describing the distribution for the variables for users in the kth cluster,
and denotes the parameters of the model
( | )kp c
( | , )k kp X c
04/19/23Data Mining Spring '0311
Model-Based Clustering
In our case X = (X1,…,XL) is a sequence of variables describing the user’s path through the website.
Xi takes on some value xi from the M different page categories.
Each component in the model obeys the 1st order Markov model:
1 12
( | , ) ( | ) ( | , )L
I Tk k k i i k
i
p X c p x p x x
where denotes the parameters of the probability distribution over the initial page-category request among users in cluster k,
and denotes the parameters of the probability distributions over transitions from one category to the next by a user in cluster k .
Both distributions are taken to be multinomial distribution.
Ik
Tk
04/19/23Data Mining Spring '0312
Model-Based Clustering
EM algorithm is used to learn the model parameters. Once learned, we can use the model to assign users to
clusters by finding the class K that maximizes the membership probabilities:
The user class assignment may be soft or hard.1
( | ) ( | , )( | , )
( | ) ( | , )
k k kk K
j j jj
p c p X cp c X
p c p X c
04/19/23Data Mining Spring '0313
Learning Mixture Models from Data
For a known number of K clusters. Training data dtrain= {x1,…,xN}, with iid
assumption. MAP Estimate of :
( | )arg maxMAP trainp d
( | ) ( ) / ( )arg max train trainp d p p d
1
( | ) ( ) / ( )arg maxN
i traini
p X p p d
04/19/23Data Mining Spring '0314
EM learning algorithm (briefly)
An iterative method to find local maxima for the MAP problem of .
Problem at hand involves two sub-problems:– Compute user class assignment (membership
probabilities).– Compute class parameters.
Chicken-egg problem!
04/19/23Data Mining Spring '0315
EM learning algorithm (briefly)
EM approach:– E-step: given a current value of the
parameters , assign a user with behavior X to cluster Ck using the membership probabilities.
– M-step: pretend that these assignments correspond to real data, and reassign to be the MAP estimate given this fictitious data.
– Stop iteration when two consecutive iterations produce log likelihoods on the training data that differ by less than p% (0.01% in the paper).
04/19/23Data Mining Spring '0316
How to choose K?
Let the site administrator try several K values and choose the convenient one for visualization too time consuming. Rather,
Choose K by finding the model that accurately predicts Nt new test cases dtest = {XN+1 ,...,XN +Nt}. That is, choose a model with K clusters that minimizes the out-of-sample predictive log score:
21
1
log ( | )
( , )
( )
t
t
Nj K
jtest N
i
i
p X
Score K d
length X
04/19/23Data Mining Spring '0317
Application to Msnbc.com
Each sequence in the data set corresponds to page views of a user during a twenty-four hour period.
Each event in the sequence corresponds to a user request for a page. The event denotes a page category rather than a URL.
Example categories are: frontpage, news, tech, … The number of URLs per category ranges from 10 to 5000. Modeling only the order in which the pages are requested
(no duration is modeled) . Page requests served via a caching mechanism were not
recorded in the server logs and, hence, not present in the data.
04/19/23Data Mining Spring '0318
Application to Msnbc.com
The full data set consists of approximately one million sequences (users),with an average of 5.7 events per sequence.
Model learning for various cluster sizes K is done with a training set size of 100,023.
Model evaluation was done using the out-of-sample predictive log score on a different sample of 98,687 sequences drawn from the original data.
04/19/23Data Mining Spring '0319
Observation on the model components
Some of the individual model components encode two or more clusters.
Example: consider two clusters: a cluster of users who initially request category a and then choose between categories b and c ,and a cluster of users who initially request category d and then choose between categories e and f .
These two clusters can be encoded in a single component of the mixture model, although the sequences for the separate clusters do not contain common elements.
The presence of multi-cluster components does not affect the out-of-sample predictive log score of a model.
However, it is problematic for visualization purposes.
04/19/23Data Mining Spring '0320
Observation on the model components
Solutions:– One method is to run the EM algorithm and then post-process
the resulting model, separating any multi-cluster components found.
– A second method is to allow only one state (category) to have a non-zero probability of being the initial state in each of the 1st-order Markov models.
Using the second method has the drawback that a cluster of users that have different initial states but similar paths after the initial state are divided into separate clusters.
Nonetheless,this potential problem was fairly insignificant for the Msnbc.com data.
04/19/23Data Mining Spring '0321
Constrained models
Experimentally, constrained models have a predictive power almost equal to that of the unconstrained models.
However, introducing this constraint,more components are needed to represent the data than in the unconstrained case.
For this particular data,the constrained 1st-order Markov models reach limit in predictive accuracy around K =100,as compared to the unconstrained models,which reach their limit around K =60.
04/19/23Data Mining Spring '0322
Out of sample results
04/19/23Data Mining Spring '0323
Data Visualization:WebCANVAS tool
Display of twenty four hour period using 100 clusters. Each window corresponds to a cluster. Each row of squares in a cluster corresponds to a user
sequence. WebCANVAS uses hard clustering, assigning each user
to a single cluster. Each square in a row encodes a page request in a
particular category encoded by the color of the square. Note that the use of color to encode URL category
limits the utility of this tool to domains where the number of categories can be limited to fifty or so.
04/19/23Data Mining Spring '0324
WebCANVAS Display
04/19/23Data Mining Spring '0325
Discovering unexpected facts
Large groups of people enter msnbc.com on tech and local pages;
Large group of people navigating from on-air to local;
Little navigation between tech and business sections;
and large number of hits to the weather pages.
04/19/23Data Mining Spring '0326
WebCANVAS tool (model-direct sampling)
WebCANVAS display performed better subjectively than two other methods:
– Showing the 0th-order and 1st-order Markov models of a cluster.– “traffic flow movie” by Microsoft Site Server 3.0.
Advantage of model-directed sampling over displaying the models themselves is that the former approach is not as sensitive to errors in modeling.
That is, by displaying sampled raw data, behaviors in the data not consistent with the model used can still be seen and appreciated.
04/19/23Data Mining Spring '0327
Alternative: Displaying models themselves
04/19/23Data Mining Spring '0328
Scalability
Memory requirements of the algorithm are: O(NL+KM2+KM), which typically reduces to O (NL) - i.e. the data size- for data sets where M is relatively small.
The runtime of the algorithm per iteration is linear in N and K.
04/19/23Data Mining Spring '0329
Scalability in K
04/19/23Data Mining Spring '0330
Scalability in N
04/19/23Data Mining Spring '0331
Mixtures of 1st order Markov Models: Too simple model?
Sen and Hansen (2001), Deshpande and Karypis (2001) have shown that the 1st-order Markov model to be an inadequate model for empirically-observed page-request sequences.
It is not surprising, because for example:– if a user visits a particular page,there tends to be a greater
chance of he returning to that same page at a later time. – 1st order Markov model cannot capture this type of long-term
memory. However:
– Though the mixture model is 1st order Markov within a cluster, the overall unconditional model is NOT 1st order Markov.
– Msnbc data is different from typical raw page-request sequences. Namely, URL categories result in a relatively small alphabet size as compared to working with uncategorized URLs.
04/19/23Data Mining Spring '0332
Mixtures of 1st order Markov Models: Too simple model?
The combined effects of clustering and a small alphabet tend to produce low-entropy clusters in the sense that a few (two or three) categories often dominate the sequences within each cluster.
Thus, the tendency to return to a specific page that was visited earlier in a session can be well approximated by the simple mixture of 1st order Markov models.
04/19/23Data Mining Spring '0333
Mixture of 1st order Markov Models vs 1st order Markov Models
Mixture Model:
Looking at the predictive distribution for the next symbol under the mixture model, i.e :
Thus the probability of the next symbol is a weighted combination of the transition probabilities from each of the individual 1st order component models.
1
( ) ( | ) ( )K
k kk
p X p X c p c
1 12
( | ) ( | ) ( | , )L
k k i i ki
p X c p x c p x x c
1lx
1 [ ,1]( | )l lp x x
1 [ ,1] 1 [ ,1]1
( | ) ( | , ) ( | )K
l l l l k k lk
p x x p x x c p c x
1( | , )l l kp x x c
04/19/23Data Mining Spring '0334
Mixture of 1st order Markov Models vs 1st order Markov Models
The weights are determined by the partial membership probabilities of the prefix (history) subsequence
. These weights are in turn a function of the history of
the sequence (via Bayes rule), and typically depend strongly on the pattern of behavior before .
This prediction behavior of is opposed to the simple prediction distribution of the 1st order Markov model:
[ ,1]( | )k lp c x
[ ,1]lx
lx
1lx
1 [ ,1] 1( | ) ( | )l l l lp x x p x x
04/19/23Data Mining Spring '0335
Empirical proof of 1st order Markov Model
Diagnostic check: empirically calculate the run lengths of page categories for several of the most likely clusters.
If the data are being generated by a 1st order Markov model, then the distribution of these run lengths will obey a geometric distribution.
Results are shown in each cluster for the three most frequently visited categories that had at least one run length of four or greater.(Categories that have run lengths of three or fewer provide relatively uninformative diagnostic plots.)
04/19/23Data Mining Spring '0336
Empirical proof of 1st order Markov Model
Asterisks mark the empirically observed counts. The center dotted line on each plot is the expected count as a function of
run length under a geometric model using the empirically estimated self-transition probability of the Markov chain for the corresponding cluster.
The upper and lower dotted lines represent the plus and minus three-sigma sampling deviations for each count under the model.
04/19/23Data Mining Spring '0337
Conclusions
Using a model-based clustering approach to cluster users based on web navigation patterns.
Develop a visualization tool that enables web administrators to better understand user behavior.
Using mixture of 1st order Markov models for clustering taking into account the order of page requests pages.
Experiments suggest that 1st order Markov model mixture components are appropriate for the msnbc.com data.
The algorithm learning time scales linearly with sample size. In contrast,agglomerative distance-based methods scale quadratically with sample size.
04/19/23Data Mining Spring '0338
Future Work
Modeling the duration of each visit. Avoiding the limitation of the proposed method to
small M , modeling page visits at the URL level. In one such extension,we can use Markov models to
characterize both the transitions among categories and the transitions among pages within a given category.
Alternatively,we can use a hidden-Markov mixture model to learn categories and category transitions simultaneously.