notices homework 2 due in class may 21 uncollected homework 1 coursera class on machine learning by...

90
Notices Homework 2 due in class May 21 Uncollected Homework 1 Coursera class on Machine Learning by Andrew Ng (Stanford): https://class.coursera.org/ml-003/lecture/index (particularly video lectures from Week 3+, Logistic Regression) Coursera class on Data Science by Bill Howe (UW): https:// class.coursera.org/datasci-001/lecture/index Predict y in {1,2,3,4,5} ... use SVR (SVM in regression mode) No lecture next Tuesday ... continue to work on Data Mining Lectures Analysis of Web User Data Padhraic Smyth, UC Irvine

Upload: molly-webb

Post on 17-Dec-2015

219 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Notices Homework 2 due in class May 21 Uncollected Homework 1 Coursera class on Machine Learning by Andrew Ng (Stanford): –//class.coursera.org/ml-003/lecture/index

Data Mining Lectures Analysis of Web User Data Padhraic Smyth, UC Irvine

Notices• Homework 2 due in class May 21

• Uncollected Homework 1

• Coursera class on Machine Learning by Andrew Ng (Stanford):– https://class.coursera.org/ml-003/lecture/index– (particularly video lectures from Week 3+, Logistic Regression)

• Coursera class on Data Science by Bill Howe (UW):– https://class.coursera.org/datasci-001/lecture/index

• Predict y in {1,2,3,4,5} ... use SVR (SVM in regression mode)

• No lecture next Tuesday ... continue to work on projects• (No office hours Monday)

Page 2: Notices Homework 2 due in class May 21 Uncollected Homework 1 Coursera class on Machine Learning by Andrew Ng (Stanford): –//class.coursera.org/ml-003/lecture/index

Data Mining Lectures Analysis of Web User Data Padhraic Smyth, UC Irvine

Re-cap on Page Rank

Page 3: Notices Homework 2 due in class May 21 Uncollected Homework 1 Coursera class on Machine Learning by Andrew Ng (Stanford): –//class.coursera.org/ml-003/lecture/index

Matrix-Vector form

• Recall rj = importance of node j

rj = Si wij ri i,j = 1,….n

e.g., r2 = 1 r1 + 0 r2 + 0.5 r3 + 0.5 r4

= dot product of r vector with column 2 of W

Let r = n x 1 vector of importance values for the n nodes Let W = n x n matrix of link weights

=> we can rewrite the importance equations as r = WT r

Page 4: Notices Homework 2 due in class May 21 Uncollected Homework 1 Coursera class on Machine Learning by Andrew Ng (Stanford): –//class.coursera.org/ml-003/lecture/index

Eigenvector Formulation

Need to solve the importance equations for unknown r, with known W

r = WT r

We recognize this as a standard eigenvalue problem, i.e.,

A r = l r (where A = WT)

with l = an eigenvalue = 1 and r = the eigenvector corresponding to l = 1

Page 5: Notices Homework 2 due in class May 21 Uncollected Homework 1 Coursera class on Machine Learning by Andrew Ng (Stanford): –//class.coursera.org/ml-003/lecture/index

Eigenvector Formulation

Need to solve for r in

(WT – l I) r = 0 Note: W is a stochastic matrix, i.e., rows are non-negative and sum to 1

Results from linear algebra tell us that: (a) Since W is a stochastic matrix, W and WT have the same

eigenvectors/eigenvalues

(b) The largest of these eigenvalues l is always 1

(c) the vector r corresponds to the eigenvector corresponding to the largest eigenvector of W (or WT)

Page 6: Notices Homework 2 due in class May 21 Uncollected Homework 1 Coursera class on Machine Learning by Andrew Ng (Stanford): –//class.coursera.org/ml-003/lecture/index

Solution for the Simple Example

1 2 3

4

1

0.5

0.5

0.5

0.50.5 0.50 1 0 0

0.5

0 0 0.5

0 0.5

0 0.5

0 0.5

0.5

0

W =

Solving for the eigenvector of W we get r = [0.2 0.4 0.133 0.2667]

Results are quite intuitive, e.g., 2 is “most important”

Page 7: Notices Homework 2 due in class May 21 Uncollected Homework 1 Coursera class on Machine Learning by Andrew Ng (Stanford): –//class.coursera.org/ml-003/lecture/index

Data Mining Lectures Analysis of Web User Data Padhraic Smyth, UC Irvine

Complexity of Page Rank

• Complexity of eigenvector problemAx = lx(A is an n*n matrix)

• Complexity of fixed point updatex A x

• For the web, A is 10^9 x 10^9

Page 8: Notices Homework 2 due in class May 21 Uncollected Homework 1 Coursera class on Machine Learning by Andrew Ng (Stanford): –//class.coursera.org/ml-003/lecture/index

CS 277: Data Mining

Analysis of Web User Data

Padhraic SmythDepartment of Computer Science

University of California, Irvine

Page 9: Notices Homework 2 due in class May 21 Uncollected Homework 1 Coursera class on Machine Learning by Andrew Ng (Stanford): –//class.coursera.org/ml-003/lecture/index

Data Mining Lectures Analysis of Web User Data Padhraic Smyth, UC Irvine

Data Sources for Web Mining

• Web content– Text and HTML content on Web pages, e.g., categorization of content

• Web connectivity– Hyperlink/directed-graph structure of the Web– e.g., using PageRank to infer importance of Web pages– e.g., using links to improve accuracy in classification of Web pages

• Web user data (focus of this set of slides)– Data on how users interact with the Web

• Navigation data, aka “clickstream” data• Search query data (keywords for users)• Online transaction data (e.g., purchases at an ecommerce store)

– Volume of data?• Large portals (e.g., Yahoo!, MSN) report 100’s of millions of users per month

Page 10: Notices Homework 2 due in class May 21 Uncollected Homework 1 Coursera class on Machine Learning by Andrew Ng (Stanford): –//class.coursera.org/ml-003/lecture/index

Data Mining Lectures Analysis of Web User Data Padhraic Smyth, UC Irvine

Introduction to Web Mining

Useful to study user behavior on the Web

e.g. search engine data can be used for– Exploration: e.g. # of queries per session?– Modeling: e.g. time of day dependence?– Prediction: e.g. which pages are relevant?

Applications– Understand social implications of Web usage– Design of better tools for information access– E-commerce applications

Page 11: Notices Homework 2 due in class May 21 Uncollected Homework 1 Coursera class on Machine Learning by Andrew Ng (Stanford): –//class.coursera.org/ml-003/lecture/index

Data Mining Lectures Analysis of Web User Data Padhraic Smyth, UC Irvine

“Computational Advertising”

• Revenue of many internet companies is driven by advertising

• Key problem:– Given user data:

• Pages browsed• Keywords used in search• Demographics

– Determine the most relevant ads (in real-time)– About 50% of keyword searches can not be matched effectively to any

ads– (other aspects include bidding/pricing of ads)

• Another major problem: “click fraud”– Algorithms that can automatically detect when online advertisements

are being manipulated (this is a major problem for Internet advertising)

• New research area of “computational advertising”– See link to Stanford class by Andrei Broder on class Web site– http://www.stanford.edu/class/msande239/

Page 12: Notices Homework 2 due in class May 21 Uncollected Homework 1 Coursera class on Machine Learning by Andrew Ng (Stanford): –//class.coursera.org/ml-003/lecture/index

Networks of Instant MessagersLeskovec and Horvitz, 2007

• Network Data– 240 IM users over 1 month– 1 billion conversations per day– 1.3 billion edges in the graph

Page 13: Notices Homework 2 due in class May 21 Uncollected Homework 1 Coursera class on Machine Learning by Andrew Ng (Stanford): –//class.coursera.org/ml-003/lecture/index

Linking Demograpics with IM Usagefrom Leskovec and Horvitz, 2007

Page 14: Notices Homework 2 due in class May 21 Uncollected Homework 1 Coursera class on Machine Learning by Andrew Ng (Stanford): –//class.coursera.org/ml-003/lecture/index

Login and Logout Durations for Instant Messenger Sessions

from Leskovec and Horvitz, 2007

Regularities such as power-laws are abundant in Web data

Highly non-Gaussian

Aggregate behavior – very predictable

Individual behavior – much less so

Page 15: Notices Homework 2 due in class May 21 Uncollected Homework 1 Coursera class on Machine Learning by Andrew Ng (Stanford): –//class.coursera.org/ml-003/lecture/index

Query Data(source: Dan Russell, Google)

Research Question:Predict the age and

gender of an individual given their query history

More difficult:Predict how many

people are using one account, and their ages

and genders

Page 16: Notices Homework 2 due in class May 21 Uncollected Homework 1 Coursera class on Machine Learning by Andrew Ng (Stanford): –//class.coursera.org/ml-003/lecture/index

Eye-Tracking: The Golden Triangle for Search from Hotchkiss, Alston, Edwards, 2005;

EnquiroResearch

Research Question:If a user clicks on the 5th item on the list should we

treat items 1 to 4 as “negatives” for learning?

Page 17: Notices Homework 2 due in class May 21 Uncollected Homework 1 Coursera class on Machine Learning by Andrew Ng (Stanford): –//class.coursera.org/ml-003/lecture/index

Data Mining Lectures Analysis of Web User Data Padhraic Smyth, UC Irvine

Number of Terms per Query

- Observed query Length Distributions (bars)

- Poisson Model (lines)

(from Xie and O Halloran, 2002)

Page 18: Notices Homework 2 due in class May 21 Uncollected Homework 1 Coursera class on Machine Learning by Andrew Ng (Stanford): –//class.coursera.org/ml-003/lecture/index

Data Mining Lectures Analysis of Web User Data Padhraic Smyth, UC Irvine

Power-law Characteristics of Common Queries

• Frequency f(r) of Queries with Rank r– 110000 queries from Vivisimo– 1.9 Million queries from Excite

• Presence of many rare queries makes prediction/ad-matching difficult

Approximately Power-Law

(from Xie and O Halloran, 2002)

Page 19: Notices Homework 2 due in class May 21 Uncollected Homework 1 Coursera class on Machine Learning by Andrew Ng (Stanford): –//class.coursera.org/ml-003/lecture/index

Data Mining Lectures Analysis of Web User Data Padhraic Smyth, UC Irvine

From N. Parikh and N. SundaresanInferring Semantic Query Relations from Collective User BehaviorProceedings of CIKM, 2008

Sample of 17 Million Unique Queries from eBay

Log Rank of Query in Sample

LogQueryFrequency

Page 20: Notices Homework 2 due in class May 21 Uncollected Homework 1 Coursera class on Machine Learning by Andrew Ng (Stanford): –//class.coursera.org/ml-003/lecture/index

Data Mining Lectures Analysis of Web User Data Padhraic Smyth, UC Irvine

From: M. Richardson, Microsoft ResearchLearning about the World through Long-Term Query LogsACM Transactions on the Web, October 2008

Page 21: Notices Homework 2 due in class May 21 Uncollected Homework 1 Coursera class on Machine Learning by Andrew Ng (Stanford): –//class.coursera.org/ml-003/lecture/index

Data Mining Lectures Analysis of Web User Data Padhraic Smyth, UC Irvine

CASE STUDY: DETECTING FLU OUTBREAKS FROM SEARCH ENGINE DATA

J. Ginsberg, M. Mohebbi, R. Patel, L. Brammer, M. Smolinski, L. BrilliantDetecting influenza epidemics using search engine query dataNature, Februrary 2009

google for “google flu trends”

Page 22: Notices Homework 2 due in class May 21 Uncollected Homework 1 Coursera class on Machine Learning by Andrew Ng (Stanford): –//class.coursera.org/ml-003/lecture/index

Data Mining Lectures Analysis of Web User Data Padhraic Smyth, UC Irvine

Detecting Flu Outbreaks

• Problem:– Influenza epidemics cause 250k to 500k deaths per year

(worldwide)– New strains can emerge quickly and be very dangerous– A key factor in reducing risk is quick detection of an outbreak

• How are flu outbreaks currently detected?– Center for Disease Control (CDC) gathers counts of “influenza-

like illness” (ILI) physician visits– Surveillance data published nationally and regionally, weekly

basis– Problem: 1 to 2 week reporting lag

• Other approaches– Monitor over-the-counter flu medication sales– Monitor calls to health advice lines

Page 23: Notices Homework 2 due in class May 21 Uncollected Homework 1 Coursera class on Machine Learning by Andrew Ng (Stanford): –//class.coursera.org/ml-003/lecture/index

Data Mining Lectures Analysis of Web User Data Padhraic Smyth, UC Irvine

Using Search Queries

• Idea:– Discover influenza related search queries and use these to

predict ILI counts– Motivation: should be much faster than 1-2 week lag of CDC

data

• Data:– Look at all search queries in Google from 2003 to 2008

• Several hundred billion individual searches in the United States• Keep track of only the 50 million most common queries• Keep a weekly count for each query• Also keep counts of each query by geographic region

(requires use of geo-location from IP addresses: >95% accurate)

So counts for 50 million queries x 170 weeks x 9 regions

Page 24: Notices Homework 2 due in class May 21 Uncollected Homework 1 Coursera class on Machine Learning by Andrew Ng (Stanford): –//class.coursera.org/ml-003/lecture/index

Data Mining Lectures Analysis of Web User Data Padhraic Smyth, UC Irvine

Building a Predictive Model

• Target variable to be predicted:– For each week, for each region I(t) = percentage physician visits that are ILI (as compiled by

CDC)

• Input variable: Q(t) = sum of top n highest correlated queries

/ total number of queries that week

• “Model learning”: log( I(t) / [1 – I(t)] ) = a log ( Q(t)/ [1 – Q(t) ] ) + noise

(This is really just calibrating log-odds Q(t) to predict log-odds I(t))

Page 25: Notices Homework 2 due in class May 21 Uncollected Homework 1 Coursera class on Machine Learning by Andrew Ng (Stanford): –//class.coursera.org/ml-003/lecture/index

Data Mining Lectures Analysis of Web User Data Padhraic Smyth, UC Irvine

Addition of terms that generally correlatewith flu season, but not with specific outbreaks,e.g., “high school basketball”

Page 26: Notices Homework 2 due in class May 21 Uncollected Homework 1 Coursera class on Machine Learning by Andrew Ng (Stanford): –//class.coursera.org/ml-003/lecture/index

Data Mining Lectures Analysis of Web User Data Padhraic Smyth, UC Irvine

Page 27: Notices Homework 2 due in class May 21 Uncollected Homework 1 Coursera class on Machine Learning by Andrew Ng (Stanford): –//class.coursera.org/ml-003/lecture/index

Data Mining Lectures Analysis of Web User Data Padhraic Smyth, UC Irvine

Evaluating the Model

• Model fit to weekly data between 2003 and 2007– 128 weeks

• Predictions made on 42 weeks of unseen test data in 2007-2008 – 9 regions– Correlations per region of predicted ILI with actual ILI

• Mean = 0.97• Min = 0.92• Max = 0.99

Page 28: Notices Homework 2 due in class May 21 Uncollected Homework 1 Coursera class on Machine Learning by Andrew Ng (Stanford): –//class.coursera.org/ml-003/lecture/index

Data Mining Lectures Analysis of Web User Data Padhraic Smyth, UC Irvine

Page 29: Notices Homework 2 due in class May 21 Uncollected Homework 1 Coursera class on Machine Learning by Andrew Ng (Stanford): –//class.coursera.org/ml-003/lecture/index

Data Mining Lectures Analysis of Web User Data Padhraic Smyth, UC Irvine

Key point in thesegraphs is that the CDCdata was lagging Google predictions by1 to 2 weeks

Page 30: Notices Homework 2 due in class May 21 Uncollected Homework 1 Coursera class on Machine Learning by Andrew Ng (Stanford): –//class.coursera.org/ml-003/lecture/index

Data Mining Lectures Analysis of Web User Data Padhraic Smyth, UC Irvine

MEASURING WEB USER DATA

Page 31: Notices Homework 2 due in class May 21 Uncollected Homework 1 Coursera class on Machine Learning by Andrew Ng (Stanford): –//class.coursera.org/ml-003/lecture/index

Data Mining Lectures Analysis of Web User Data Padhraic Smyth, UC Irvine

How our Web navigation is recorded…

• Web logs– Record activity between client browser and a specific Web server– Easily available– Can be augmented with cookies (provide notion of “state”)

• Search engine records– Text in queries, which pages were viewed, which snippets were clicked

on, etc

• Client-side browsing records– Automatically recorded by client-side software– Harder to obtain, but much more accurate than server-side logs

• Other sources– Web site registration, purchases, email, etc– ISP recording of Web browsing

Page 32: Notices Homework 2 due in class May 21 Uncollected Homework 1 Coursera class on Machine Learning by Andrew Ng (Stanford): –//class.coursera.org/ml-003/lecture/index

Data Mining Lectures Analysis of Web User Data Padhraic Smyth, UC Irvine

Web Server Log Files

• Server Transfer Log: – transactions between a browser and server are logged– IP address, the time of the request– Method of the request (GET, HEAD, POST…)– Status code, a response from the server– Size in byte of the transaction

• Referrer Log: – where the request originated

• Agent Log: – browser software making the request (spider)

• Error Log: – request resulted in errors (404)

Page 33: Notices Homework 2 due in class May 21 Uncollected Homework 1 Coursera class on Machine Learning by Andrew Ng (Stanford): –//class.coursera.org/ml-003/lecture/index

Data Mining Lectures Analysis of Web User Data Padhraic Smyth, UC Irvine

W3C Extended Log File FormatField Date Description

Date date The date that the activity occurredTime time The time that the activity occurredClient IP address c-ip The IP address of the client that accessed your server

User Name cs-usernameThe name of the autheticated user who access your server, anonymous users are represented by -

Servis Name s-sitename The Internet service and instance number that was accessed by a clientServer Name s-computername The name of the server on which the log entry was generatedServer IP Address s-ip The IP address of the server that accessed your serverServer Port s-port The port number the client is connected toMethod cs-method The action the client was trying to performURI Stem cs-uri-stem The resource accessedURI Query cs-uri-query The query, if any, the client was trying to performProtocol Status sc-status The status of the action, in HTTP or FTP termsWin32 Status sc-win32-status The status of the action, in terms used by Microsoft WindowsBytes Sent sc-bytes The number of bytes sent by the serverBytes Received cs-bytes The number of bytes received by the serverTime Taken time-taken The duration of time, in milliseconds, that the action consumedProtocol Version cs-version The protocol (HTTP, FTP) version used by the clientHost cs-host Display the content of the host header

User Agent cs(User Agent) The browser used on the clientCookie cs(Cookie) The content of the cookie sent or received, if any

Referrer cs(Referrer)The previous site visited by the user. This site provided a link to the current site

cs = client-to-server actions

s = server actions

c = client actions

sc = server-to-client actions

Page 34: Notices Homework 2 due in class May 21 Uncollected Homework 1 Coursera class on Machine Learning by Andrew Ng (Stanford): –//class.coursera.org/ml-003/lecture/index

Data Mining Lectures Analysis of Web User Data Padhraic Smyth, UC Irvine

Example of Web Log entries

Apache web log:205.188.209.10 - - [29/Mar/2002:03:58:06 -0800] "GET

/~sophal/whole5.gif HTTP/1.0" 200 9609 "http://www.csua.berkeley.edu/~sophal/whole.html" "Mozilla/4.0 (compatible; MSIE 5.0; AOL 6.0; Windows 98; DigExt)"

216.35.116.26 - - [29/Mar/2002:03:59:40 -0800] "GET /~alexlam/resume.html HTTP/1.0" 200 2674 "-" "Mozilla/5.0 (Slurp/cat; [email protected]; http://www.inktomi.com/slurp.html)“

202.155.20.142 - - [29/Mar/2002:03:00:14 -0800] "GET /~tahir/indextop.html HTTP/1.1" 200 3510 "http://www.csua.berkeley.edu/~tahir/" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)“

202.155.20.142 - - [29/Mar/2002:03:00:14 -0800] "GET /~tahir/animate.js HTTP/1.1" 200 14261 "http://www.csua.berkeley.edu/~tahir/indextop.html" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)“

Page 35: Notices Homework 2 due in class May 21 Uncollected Homework 1 Coursera class on Machine Learning by Andrew Ng (Stanford): –//class.coursera.org/ml-003/lecture/index

Data Mining Lectures Analysis of Web User Data Padhraic Smyth, UC Irvine

Routine Server Log Analysis

• Typical statistics/histograms that are computed– Most and least visited web pages– Entry and exit pages– Referrals from other sites or search engines– What are the searched keywords– How many clicks/page views a page received– Error reports, like broken links

• Many software products that produce standard reports of this type of data– Very useful for Web site managers– But does not provide “deep” insights– e.g., are there clusters/groups of users that use the site in

different ways?

Page 36: Notices Homework 2 due in class May 21 Uncollected Homework 1 Coursera class on Machine Learning by Andrew Ng (Stanford): –//class.coursera.org/ml-003/lecture/index

Data Mining Lectures Analysis of Web User Data Padhraic Smyth, UC Irvine

Visualization of Web Log Data over Time

Page 37: Notices Homework 2 due in class May 21 Uncollected Homework 1 Coursera class on Machine Learning by Andrew Ng (Stanford): –//class.coursera.org/ml-003/lecture/index

Data Mining Lectures Analysis of Web User Data Padhraic Smyth, UC Irvine

Descriptive Summary Statistics

• Histograms, scatter plots, time-series plots– Very important!– Helps to understand the big picture– Provides “marginal” context for any model-building

• models aggregate behavior, not individuals– Challenging for Web log data

• Examples– Session lengths (e.g., power laws)– Click rates as a function of time, content

Page 38: Notices Homework 2 due in class May 21 Uncollected Homework 1 Coursera class on Machine Learning by Andrew Ng (Stanford): –//class.coursera.org/ml-003/lecture/index

Data Mining Lectures Analysis of Web User Data Padhraic Smyth, UC Irvine

Web data measurement issues

• Important to understand how data is collected

• Web data is collected automatically via software logging tools– Advantage:

• No manual supervision required– Disadvantage:

• Data can be skewed (e.g. due to the presence of robot traffic)

• Important to identify robots (also known as crawlers, spiders)

Page 39: Notices Homework 2 due in class May 21 Uncollected Homework 1 Coursera class on Machine Learning by Andrew Ng (Stanford): –//class.coursera.org/ml-003/lecture/index

Data Mining Lectures Analysis of Web User Data Padhraic Smyth, UC Irvine

100

101

102

10-6

10-5

10-4

10-3

10-2

10-1

100

Session Length L

Em

piric

al F

requ

ency

of L

L = number of page requests in a single sessionfrom visitors to www.ics.uci.eduover 1 week in November 2002(robots removed)

Page 40: Notices Homework 2 due in class May 21 Uncollected Homework 1 Coursera class on Machine Learning by Andrew Ng (Stanford): –//class.coursera.org/ml-003/lecture/index

Data Mining Lectures Analysis of Web User Data Padhraic Smyth, UC Irvine

A time-series plot of ICS Website data

Number of page requests per hour as a function of time from page requests in the www.ics.uci.edu Web server logs during the first week of April 2002.

Page 41: Notices Homework 2 due in class May 21 Uncollected Homework 1 Coursera class on Machine Learning by Andrew Ng (Stanford): –//class.coursera.org/ml-003/lecture/index

Data Mining Lectures Analysis of Web User Data Padhraic Smyth, UC Irvine

Example: Web Traffic from Commercial Site(slide from Ronny Kohavi)

Weekends

Sept-11 Note significant drop in human traffic, not bot

traffic

Registration at Search Engine sites

Internal Perfor-

mance bot

Page 42: Notices Homework 2 due in class May 21 Uncollected Homework 1 Coursera class on Machine Learning by Andrew Ng (Stanford): –//class.coursera.org/ml-003/lecture/index

Data Mining Lectures Analysis of Web User Data Padhraic Smyth, UC Irvine

Robot / human identification

• Removal of robot data is important preprocessing step before clickstream analysis

• Robot page-requests often identified using a variety of heuristics– e.g. some robots self-identify themselves in the server logs

• All robots in principle should visit robots.txt on the Web Server• Also, robots should identify themselves via the User Agent field in page

requests • But other robots actively try to disguise that they are robots

– Patterns of access• Robots explore the entire website in breadth first fashion• Humans access web-pages more typically in depth-first fashion• Timing between page-requests can be more regular for robots (e.g., every 5

seconds)• Duration of sessions, number of page-requests per day: often unusually

large (e.g., 1000’s of page-requests per day) for robots.

• Tan and Kumar (Journal of Data Mining and Knowledge Discovery, 2002) provide a detailed description of using classification techniques to learn how to detect robots

Page 43: Notices Homework 2 due in class May 21 Uncollected Homework 1 Coursera class on Machine Learning by Andrew Ng (Stanford): –//class.coursera.org/ml-003/lecture/index

Data Mining Lectures Analysis of Web User Data Padhraic Smyth, UC Irvine

Fractions of Robot Data(from Tan and Kumar, 2002)

Page 44: Notices Homework 2 due in class May 21 Uncollected Homework 1 Coursera class on Machine Learning by Andrew Ng (Stanford): –//class.coursera.org/ml-003/lecture/index

Data Mining Lectures Analysis of Web User Data Padhraic Smyth, UC Irvine

From Tan and Kumar, 2002

Overallaccuracies

of around 90%were obtainedusing decision

tree classifiers, trained on sessions

of lengths 1, 2, 3, 4,..

Page 45: Notices Homework 2 due in class May 21 Uncollected Homework 1 Coursera class on Machine Learning by Andrew Ng (Stanford): –//class.coursera.org/ml-003/lecture/index

Data Mining Lectures Analysis of Web User Data Padhraic Smyth, UC Irvine

Page requests, caching, and proxy servers

• In theory, requester browser requests a page from a Web server and the request is processed

• In practice, there are– Other users– Browser caching– Dynamic addressing in local network– Proxy Server caching

Page 46: Notices Homework 2 due in class May 21 Uncollected Homework 1 Coursera class on Machine Learning by Andrew Ng (Stanford): –//class.coursera.org/ml-003/lecture/index

Data Mining Lectures Analysis of Web User Data Padhraic Smyth, UC Irvine

Page requests, caching, and proxy servers

A graphical summary of how page requests from an individual user can be masked at various stages between the user’s local computer and the Web server.

Page 47: Notices Homework 2 due in class May 21 Uncollected Homework 1 Coursera class on Machine Learning by Andrew Ng (Stanford): –//class.coursera.org/ml-003/lecture/index

Data Mining Lectures Analysis of Web User Data Padhraic Smyth, UC Irvine

Page requests, caching, and proxy servers

• Web server logs are therefore not so ideal in terms of a complete and faithful representation of individual page views

• There are heuristics to try to infer the true actions of the user: -– Path completion (Cooley et al. 1999)

• e.g. If known B -> F and not C -> F, then session ABCF can be interpreted as ABCBF

• Anderson et al. 2001 for more heuristics

• In general case, it is hard to know what exactly the user viewed

Page 48: Notices Homework 2 due in class May 21 Uncollected Homework 1 Coursera class on Machine Learning by Andrew Ng (Stanford): –//class.coursera.org/ml-003/lecture/index

Data Mining Lectures Analysis of Web User Data Padhraic Smyth, UC Irvine

Identifying individual users from Web server logs

• Useful to associate specific page requests to specific individual users

• IP address most frequently used

• Disadvantages– One IP address can belong to several users– Dynamic allocation of IP address

• Better to use cookies (or login ID if available)– Information in the cookie can be accessed by the Web server

to identify an individual user over time– Actions by the same user during different sessions can be

linked together

Page 49: Notices Homework 2 due in class May 21 Uncollected Homework 1 Coursera class on Machine Learning by Andrew Ng (Stanford): –//class.coursera.org/ml-003/lecture/index

Data Mining Lectures Analysis of Web User Data Padhraic Smyth, UC Irvine

Identifying individual users from Web server logs

• Commercial websites use cookies extensively

• 97 % of users have cookies enabled permanently on their browsers (source: Amazon.com, 2003)

• However …– There are privacy issues – need implicit user cooperation– Cookies can be deleted / disabled

• Another option is to enforce user registration– High reliability– But can discourage potential visitors– Large portals (such as Yahoo!) have high fraction of logged-in

users

Page 50: Notices Homework 2 due in class May 21 Uncollected Homework 1 Coursera class on Machine Learning by Andrew Ng (Stanford): –//class.coursera.org/ml-003/lecture/index

Data Mining Lectures Analysis of Web User Data Padhraic Smyth, UC Irvine

Sessionizing

• Time oriented (robust)– e.g., by gaps between requests

• not more than 20 minutes between successive requests• this is a heuristic – but is a standard “rule” used in practice

• Navigation oriented (good for short sessions and when timestamps unreliable)– Referrer is previous page in session, or– Referrer is undefined but request within 10 secs, or – Link from previous to current page in web site

Page 51: Notices Homework 2 due in class May 21 Uncollected Homework 1 Coursera class on Machine Learning by Andrew Ng (Stanford): –//class.coursera.org/ml-003/lecture/index

Data Mining Lectures Analysis of Web User Data Padhraic Smyth, UC Irvine

Client-side data

• Advantages of collecting data at the client side:– Direct recording of page requests (eliminates ‘masking’ due to

caching)– Recording of all browser-related actions by a user (including

visits to multiple websites)– More-reliable identification of individual users (e.g. by login ID

for multiple users on a single computer)

• Preferred mode of data collection for studies of navigation behavior on the Web

• Companies like ComScore and Nielsen use client-side software to track home computer users

Page 52: Notices Homework 2 due in class May 21 Uncollected Homework 1 Coursera class on Machine Learning by Andrew Ng (Stanford): –//class.coursera.org/ml-003/lecture/index

Data Mining Lectures Analysis of Web User Data Padhraic Smyth, UC Irvine

Client-side data

• Statistics like ‘Time per session’ and ‘Page-view duration’ are more reliable in client-side data

• Some limitations– Still some statistics like ‘Page-view duration’ cannot be totally

reliable e.g. user might go to fetch coffee– Need explicit user cooperation– Typically recorded on home computers – may not reflect a

complete picture of Web browsing behavior

• Web surfing data can be collected at intermediate points like ISPs, proxy servers– Can be used to create user profile and target advertise

Page 53: Notices Homework 2 due in class May 21 Uncollected Homework 1 Coursera class on Machine Learning by Andrew Ng (Stanford): –//class.coursera.org/ml-003/lecture/index

Data Mining Lectures Analysis of Web User Data Padhraic Smyth, UC Irvine

Time (over a 24-hour period)

Example of client-side data from Alexa, each “dot” is a URL request in a browser

Page 54: Notices Homework 2 due in class May 21 Uncollected Homework 1 Coursera class on Machine Learning by Andrew Ng (Stanford): –//class.coursera.org/ml-003/lecture/index

Data Mining Lectures Analysis of Web User Data Padhraic Smyth, UC Irvine

CASE STUDY:CLUSTERS OF MARKOV CHAINS FOR MODELING USER NAVIGATION ON A WEBSITE

Page 55: Notices Homework 2 due in class May 21 Uncollected Homework 1 Coursera class on Machine Learning by Andrew Ng (Stanford): –//class.coursera.org/ml-003/lecture/index

Data Mining Lectures Analysis of Web User Data Padhraic Smyth, UC Irvine

Markov Models for Modeling User Navigation

• General approach is to use a finite-state Markov chain– Each state can be a specific Web page or a category of Web

pages– If only interested in the order of visits (and not in time), each

new request can be modeled as a transition of states

• Issues– Self-transition– Time-independence

Page 56: Notices Homework 2 due in class May 21 Uncollected Homework 1 Coursera class on Machine Learning by Andrew Ng (Stanford): –//class.coursera.org/ml-003/lecture/index

Data Mining Lectures Analysis of Web User Data Padhraic Smyth, UC Irvine

Modeling Web Page Requests with Markov chain mixtures

• MSNBC Web logs (circa 2000)– Order of 2 million individual users per day– different session lengths per individual– difficult visualization and clustering problem

• WebCanvas– uses mixtures of Markov chains to cluster individuals based on their

observed sequences– software tool: EM mixture modeling + visualization

• Next few slides are based on material in: I. Cadez et al, Model-based clustering and visualization of navigation

patterns on a Web site, Journal of Data Mining and Knowledge Discovery, 2003.

Page 57: Notices Homework 2 due in class May 21 Uncollected Homework 1 Coursera class on Machine Learning by Andrew Ng (Stanford): –//class.coursera.org/ml-003/lecture/index

Data Mining Lectures Analysis of Web User Data Padhraic Smyth, UC Irvine

Page 58: Notices Homework 2 due in class May 21 Uncollected Homework 1 Coursera class on Machine Learning by Andrew Ng (Stanford): –//class.coursera.org/ml-003/lecture/index

Data Mining Lectures Analysis of Web User Data Padhraic Smyth, UC Irvine

128.195.36.195, -, 3/22/00, 10:35:11, W3SVC, SRVR1, 128.200.39.181, 781, 363, 875, 200, 0, GET, /top.html, -, 128.195.36.195, -, 3/22/00, 10:35:16, W3SVC, SRVR1, 128.200.39.181, 5288, 524, 414, 200, 0, POST, /spt/main.html, -, 128.195.36.195, -, 3/22/00, 10:35:17, W3SVC, SRVR1, 128.200.39.181, 30, 280, 111, 404, 3, GET, /spt/images/bk1.jpg, -, 128.195.36.101, -, 3/22/00, 16:18:50, W3SVC, SRVR1, 128.200.39.181, 60, 425, 72, 304, 0, GET, /top.html, -, 128.195.36.101, -, 3/22/00, 16:18:58, W3SVC, SRVR1, 128.200.39.181, 8322, 527, 414, 200, 0, POST, /spt/main.html, -, 128.195.36.101, -, 3/22/00, 16:18:59, W3SVC, SRVR1, 128.200.39.181, 0, 280, 111, 404, 3, GET, /spt/images/bk1.jpg, -, 128.200.39.17, -, 3/22/00, 20:54:37, W3SVC, SRVR1, 128.200.39.181, 140, 199, 875, 200, 0, GET, /top.html, -, 128.200.39.17, -, 3/22/00, 20:54:55, W3SVC, SRVR1, 128.200.39.181, 17766, 365, 414, 200, 0, POST, /spt/main.html, -, 128.200.39.17, -, 3/22/00, 20:54:55, W3SVC, SRVR1, 128.200.39.181, 0, 258, 111, 404, 3, GET, /spt/images/bk1.jpg, -, 128.200.39.17, -, 3/22/00, 20:55:07, W3SVC, SRVR1, 128.200.39.181, 0, 258, 111, 404, 3, GET, /spt/images/bk1.jpg, -, 128.200.39.17, -, 3/22/00, 20:55:36, W3SVC, SRVR1, 128.200.39.181, 1061, 382, 414, 200, 0, POST, /spt/main.html, -, 128.200.39.17, -, 3/22/00, 20:55:36, W3SVC, SRVR1, 128.200.39.181, 0, 258, 111, 404, 3, GET, /spt/images/bk1.jpg, -, 128.200.39.17, -, 3/22/00, 20:55:39, W3SVC, SRVR1, 128.200.39.181, 0, 258, 111, 404, 3, GET, /spt/images/bk1.jpg, -, 128.200.39.17, -, 3/22/00, 20:56:03, W3SVC, SRVR1, 128.200.39.181, 1081, 382, 414, 200, 0, POST, /spt/main.html, -, 128.200.39.17, -, 3/22/00, 20:56:04, W3SVC, SRVR1, 128.200.39.181, 0, 258, 111, 404, 3, GET, /spt/images/bk1.jpg, -, 128.200.39.17, -, 3/22/00, 20:56:33, W3SVC, SRVR1, 128.200.39.181, 0, 262, 72, 304, 0, GET, /top.html, -, 128.200.39.17, -, 3/22/00, 20:56:52, W3SVC, SRVR1, 128.200.39.181, 19598, 382, 414, 200, 0, POST, /spt/main.html, -,

…5115

1111115151115177777777

1113333333131113332232

…User 5User 4

User 3User 2User 1

From Web logs to sequences

Page 59: Notices Homework 2 due in class May 21 Uncollected Homework 1 Coursera class on Machine Learning by Andrew Ng (Stanford): –//class.coursera.org/ml-003/lecture/index

HICSS Keynote Talk, Jan 2008 © Padhraic Smyth, UC Irvine:

Graphical Model for Markov Chains

ci

A

Pages

Page 60: Notices Homework 2 due in class May 21 Uncollected Homework 1 Coursera class on Machine Learning by Andrew Ng (Stanford): –//class.coursera.org/ml-003/lecture/index

HICSS Keynote Talk, Jan 2008 © Padhraic Smyth, UC Irvine:

Multiple Users…One Common Markov Chain

ci

A

Pages

Users

Page 61: Notices Homework 2 due in class May 21 Uncollected Homework 1 Coursera class on Machine Learning by Andrew Ng (Stanford): –//class.coursera.org/ml-003/lecture/index

HICSS Keynote Talk, Jan 2008 © Padhraic Smyth, UC Irvine:

Multiple Users…One Chain per User

ci

A

Pages

Users

Page 62: Notices Homework 2 due in class May 21 Uncollected Homework 1 Coursera class on Machine Learning by Andrew Ng (Stanford): –//class.coursera.org/ml-003/lecture/index

HICSS Keynote Talk, Jan 2008 © Padhraic Smyth, UC Irvine:

One Chain per Cluster of Users

ci

ClusterjA

Pages

Users

Cadez, Meek, Heckerman, Smyth, 2003

Page 63: Notices Homework 2 due in class May 21 Uncollected Homework 1 Coursera class on Machine Learning by Andrew Ng (Stanford): –//class.coursera.org/ml-003/lecture/index

Data Mining Lectures Analysis of Web User Data Padhraic Smyth, UC Irvine

Likelihood of a Sequence in a Markov Chain Model

P(sequence) = P0(first state) x Pij ( aij )nij

Maximum likelihood estimation: - given the nij’s, find the aij’s that maximize P(sequence)

Number of transitions from i to j

Transition probability from i to j

Page 64: Notices Homework 2 due in class May 21 Uncollected Homework 1 Coursera class on Machine Learning by Andrew Ng (Stanford): –//class.coursera.org/ml-003/lecture/index

Data Mining Lectures Analysis of Web User Data Padhraic Smyth, UC Irvine

Probability Estimation in Markov Chains

p(state j | state i) = aij = nij / ni

With smoothing (or priors) aij = (nij + aij ) / (ni + Sm aim )

where aij / Sm aim is the prior probability of transitioning from i to j

and where Sm aim is the strength (equivalent sample size) of the prior

Could set aij = aj (same priors for transitioning out of each state) or

aij = a (same priors for transitioning in and out of each state)

Number of transitions from i to jNumber of times in state i

Page 65: Notices Homework 2 due in class May 21 Uncollected Homework 1 Coursera class on Machine Learning by Andrew Ng (Stanford): –//class.coursera.org/ml-003/lecture/index

Data Mining Lectures Analysis of Web User Data Padhraic Smyth, UC Irvine

Probability Estimation with Multiple Sequences

T sequences Assume sequences are independent nijt = number of times going from i to j in sequence t, t =

1,… T

p(state j | state i) = aij = St nijt / St nit

With smoothing:

aij = (St nijt + aij ) / (St nit + Sm aim )

Number of transitions from i to j in sequence t Number of times in state i in sequence t

Page 66: Notices Homework 2 due in class May 21 Uncollected Homework 1 Coursera class on Machine Learning by Andrew Ng (Stanford): –//class.coursera.org/ml-003/lecture/index

Data Mining Lectures Analysis of Web User Data Padhraic Smyth, UC Irvine

Clusters of Markov Chains

• Assume our data is generated by K different Markov chains– Each chain has its own parameters A, P0

– This is a a mixture of Markov chains

• T sequences – but we don’t know which sequence came from which cluster

• Chicken-and-egg problem– If we knew which cluster each sequence belonged to, we could

group sequences by cluster, and estimation of cluster parameters is easy

– If we knew the parameters for each Markov chain, we could figure out (by Bayes rule) the most likely cluster for each sequence

Page 67: Notices Homework 2 due in class May 21 Uncollected Homework 1 Coursera class on Machine Learning by Andrew Ng (Stanford): –//class.coursera.org/ml-003/lecture/index

HICSS Keynote Talk, Jan 2008 © Padhraic Smyth, UC Irvine:

Clusters of Probabilistic State Machines

B

E

C

A

B

E

C

ACluster 1

Motivation:approximate the heterogeneity of Web surfing behavior

Cluster 2

Page 68: Notices Homework 2 due in class May 21 Uncollected Homework 1 Coursera class on Machine Learning by Andrew Ng (Stanford): –//class.coursera.org/ml-003/lecture/index

Data Mining Lectures Analysis of Web User Data Padhraic Smyth, UC Irvine

EM Algorithm for Markov Clusters

• EM = expectation-maximization

• E-step– For each sequence, and given current parameter estimates for each

cluster, estimate p(cluster k | sequence), k = 1, … K

• M-step– Given p(cluster k | sequence), estimate Ak and Pk0 for each cluster

Algorithm:• Start with an initial random guess at parameters or p(cluster k | sequence)• Iteration = pair of EM steps• Halt iterations when parameters or p(cluster k | sequence) are not

changing

Guaranteed to converge to a local maximum of the likelihood (under general conditions)

Page 69: Notices Homework 2 due in class May 21 Uncollected Homework 1 Coursera class on Machine Learning by Andrew Ng (Stanford): –//class.coursera.org/ml-003/lecture/index

Data Mining Lectures Analysis of Web User Data Padhraic Smyth, UC Irvine

E Step of EM Algorithm

T sequences

P(cluster k | sequence t) proportional to P(sequence t | cluster k) P(cluster

k)

P(sequence t | cluster k)

= pkt = Pk0(first state) x Pij ( akij )nijt

Compute these “membership” probabilities for each sequence and each cluster k. Yields a T x K matrix of membership probabilities

Number of times goingfrom state i to state j in sequence t

Current parameter estimates for cluster k

Page 70: Notices Homework 2 due in class May 21 Uncollected Homework 1 Coursera class on Machine Learning by Andrew Ng (Stanford): –//class.coursera.org/ml-003/lecture/index

Data Mining Lectures Analysis of Web User Data Padhraic Smyth, UC Irvine

M Step of EM Algorithm

For each transition probability parameter in each cluster k=1,..K

akij = ( St nijt pkt ) / ( St nit pkt )

Compute Pk0(first state) in a similar fashion

Also estimate P(cluster k) = (1/T) St pkt

Transitions from i to j in sequence tare “fractionally weighted”by pkt, probability that sequence t came from cluster k

Transition probability i->j for cluster k

Number of times in state i in sequence tare “fractionally weighted”by pkt, probability that sequence t came from cluster k

Page 71: Notices Homework 2 due in class May 21 Uncollected Homework 1 Coursera class on Machine Learning by Andrew Ng (Stanford): –//class.coursera.org/ml-003/lecture/index

Data Mining Lectures Analysis of Web User Data Padhraic Smyth, UC Irvine

Time Complexity

• E-Step– T sequences, average length L– K clusters– Compute T x K matrix

• Each entry takes O(L) time to compute– O(TKL) overall

• M-step– For each of M2 transition probabilities

• Sum over each sequence T– O(T M2 )– other parameters take less time

Page 72: Notices Homework 2 due in class May 21 Uncollected Homework 1 Coursera class on Machine Learning by Andrew Ng (Stanford): –//class.coursera.org/ml-003/lecture/index

Data Mining Lectures Analysis of Web User Data Padhraic Smyth, UC Irvine

Experimental Methodology

• Model Training:– fit 2 types of models

• mixtures of histograms (multinomials)• mixtures of finite state machines

– Train on a full day’s worth of MSNBC Web data

• Model Evaluation:– “one-step-ahead” prediction on unseen test data

• Test sequences from a different day of Web logs– compute log P(user’s next click | previous clicks, model)

– Using equation on the previous slide– logP score:

• Rewarded if next click was given high P by the model• Punished if next click was given low P by the model

– negative average of logP scores ~ “predictive entropy”• Has a natural interpretation • Lower bounded by 0 bits (perfect prediction)• Upper bounded by log M bits, where M is the number of categories

Page 73: Notices Homework 2 due in class May 21 Uncollected Homework 1 Coursera class on Machine Learning by Andrew Ng (Stanford): –//class.coursera.org/ml-003/lecture/index

Data Mining Lectures Analysis of Web User Data Padhraic Smyth, UC Irvine

20 40 60 80 100 120 140 160 180 2002

2.2

2.4

2.6

2.8

3

3.2

3.4

3.6

3.8

4N

eg

ativ

e lo

g-l

ike

liho

od

[bits

/toke

n]

Number of mixture components [K]

Predictive Entropy Out-of-Sample

Mixtures of Multinomials

Mixtures of SFSMs

Page 74: Notices Homework 2 due in class May 21 Uncollected Homework 1 Coursera class on Machine Learning by Andrew Ng (Stanford): –//class.coursera.org/ml-003/lecture/index

Data Mining Lectures Analysis of Web User Data Padhraic Smyth, UC Irvine

0 5 10 15 20

0

5

10

log

coun

t(R) Cluster 1: Category 13

0 5 10 15 20

0

5

10Cluster 1: Category 14

0 10 20 30 40-2

0

2

4 Cluster 1: Category 8

0 5 10 15 20

0

5

10

log

coun

t(R) Cluster 2: Category 1

0 5 10

0

5

10Cluster 2: Category 7

0 10 20 30 40-2

0

2

4 Cluster 2: Category 8

0 5 10 15 20

0

5

10

log

coun

t(R) Cluster 3: Category 12

0 5 10

0

5

10Cluster 3: Category 1

0 5 10 15 20-2

0

2

4 Cluster 3: Category 13

0 10 20 30 40

0

5

10

log

coun

t(R) Cluster 4: Category 2

0 5 10

0

5

10 Cluster 4: Category 1

0 5 10

0

5

10 Cluster 4: Category 3

0 5 10 15 20

0

5

10

R = Run Length

log

coun

t(R) Cluster 5: Category 9

0 1 2 3 4 5

0

5

10

R = Run Length

Cluster 5: Category 12

0 1 2 3 4

0

5

10

R = Run Length

Cluster 5: Category 6

RUN LENGTH DISTRIBUTIONS WITHIN MARKOV CLUSTERS

Page 75: Notices Homework 2 due in class May 21 Uncollected Homework 1 Coursera class on Machine Learning by Andrew Ng (Stanford): –//class.coursera.org/ml-003/lecture/index

Data Mining Lectures Analysis of Web User Data Padhraic Smyth, UC Irvine

Timing Results

0 20 40 60 80 100 120 140 160 180 200-500

0

500

1000

1500

2000

2500T

ime

[se

c]

Number of mixture components [K]

N = 70,000

N=110,000

N=150,000

Page 76: Notices Homework 2 due in class May 21 Uncollected Homework 1 Coursera class on Machine Learning by Andrew Ng (Stanford): –//class.coursera.org/ml-003/lecture/index

Data Mining Lectures Analysis of Web User Data Padhraic Smyth, UC Irvine

WebCanvas

• Software tool for Web log visualization– uses Markov mixtures to cluster data for display– extensively used within Microsoft– also applied to non-Web data (e.g., how users navigate in Word,

etc)– Algorithm and visualization are in SQLServer (the “sequence

mining” tool)

• Model-based visualization– random sample of actual sequences– interactive tiled windows displayed for visualization– more effective than

• planar graphs• traffic-flow movie in Microsoft Site Server v3.0

Page 77: Notices Homework 2 due in class May 21 Uncollected Homework 1 Coursera class on Machine Learning by Andrew Ng (Stanford): –//class.coursera.org/ml-003/lecture/index

HICSS Keynote Talk, Jan 2008 © Padhraic Smyth, UC Irvine:

Page 78: Notices Homework 2 due in class May 21 Uncollected Homework 1 Coursera class on Machine Learning by Andrew Ng (Stanford): –//class.coursera.org/ml-003/lecture/index

Data Mining Lectures Analysis of Web User Data Padhraic Smyth, UC Irvine

Insights from WebCanvas for MSNBC data

• From msnbc.com site adminstrators….– significant heterogeneity of behavior– relatively focused activity of many users

• typically only 1 or 2 categories of pages– many individuals not entering via main page– detected problems with the weather page– missing transitions (e.g., tech <=> business)

Page 79: Notices Homework 2 due in class May 21 Uncollected Homework 1 Coursera class on Machine Learning by Andrew Ng (Stanford): –//class.coursera.org/ml-003/lecture/index

Data Mining Lectures Analysis of Web User Data Padhraic Smyth, UC Irvine

Possible Extensions

• Adding time-dependence– adding time-between clicks, time of day effects

• Uncategorized Web pages– coupling page content with sequence models

• Modeling “switching” behaviors– allowing users to switch between behaviors– Could use a topic-style model: users = mixtures of

behaviors– e.g., Girolami M & Kaban A., Sequential Activity Profiling: Latent

Dirichlet Allocation of Markov Chains, Journal of Data Mining and Knowledge Discovery, Vol 10, 175-196.

Page 80: Notices Homework 2 due in class May 21 Uncollected Homework 1 Coursera class on Machine Learning by Andrew Ng (Stanford): –//class.coursera.org/ml-003/lecture/index

Data Mining Lectures Analysis of Web User Data Padhraic Smyth, UC Irvine

Related Work

• Mixtures of Markov chains– special case: Poulsen (1990)– general case: Ridgeway (1997), Smyth (1997)

• Clustering of Web page sequences– non-probabilistic approaches (Fu et al, 1999)

• Markov models for prediction– Anderson et al (IJCAI, 2001):

• mixtures of Markov outperform other sequential models for page-request prediction

– Sen and Hansen 2003– Zukerman et al. 1999

Page 81: Notices Homework 2 due in class May 21 Uncollected Homework 1 Coursera class on Machine Learning by Andrew Ng (Stanford): –//class.coursera.org/ml-003/lecture/index

Data Mining Lectures Analysis of Web User Data Padhraic Smyth, UC Irvine

Further Reading

• Modeling the Internet and the Web, P. Baldi, P. Frasconi, P. Smyth, Wiley, 2003.

• ACM Transactions on Internet Technology (ACM TOIT)

• Annual WebKDD workshops at the ACM SIGKDD conferences.

• Papers on user-navigation models– Selective Markov models for predicting Web page accesses, M.

Deshpande, G. Karypis, ACM Transactions on Internet Technology, May 2004.

– Model-based clustering and visualization of navigation patterns on a Web site, Cadez et al, Journal of Data Mining and Knowledge Discovery, 2003.

Page 82: Notices Homework 2 due in class May 21 Uncollected Homework 1 Coursera class on Machine Learning by Andrew Ng (Stanford): –//class.coursera.org/ml-003/lecture/index

Data Mining Lectures Analysis of Web User Data Padhraic Smyth, UC Irvine

BACKUP SLIDES(CAN BE IGNORED)

Page 83: Notices Homework 2 due in class May 21 Uncollected Homework 1 Coursera class on Machine Learning by Andrew Ng (Stanford): –//class.coursera.org/ml-003/lecture/index

Data Mining Lectures Analysis of Web User Data Padhraic Smyth, UC Irvine

Prediction with Markov mixtures

P(st+1 | s[1,t] ) =

Page 84: Notices Homework 2 due in class May 21 Uncollected Homework 1 Coursera class on Machine Learning by Andrew Ng (Stanford): –//class.coursera.org/ml-003/lecture/index

Data Mining Lectures Analysis of Web User Data Padhraic Smyth, UC Irvine

Prediction with Markov mixtures

P(st+1 | s[1,t] ) = S P(st+1 , k | s[1,t] ) = S P(st+1 | k , s[1,t] ) P(k | s[1,t] )

Page 85: Notices Homework 2 due in class May 21 Uncollected Homework 1 Coursera class on Machine Learning by Andrew Ng (Stanford): –//class.coursera.org/ml-003/lecture/index

Data Mining Lectures Analysis of Web User Data Padhraic Smyth, UC Irvine

Prediction with Markov mixtures

P(st+1 | s[1,t] ) = S P(st+1 , k | s[1,t] ) = S P(st+1 | k , s[1,t] ) P(k | s[1,t] )

= S P(st+1 | k , st ) P(k | s[1,t] )

Prediction of kth component

Membership, basedon sequence history

=> Predictions are a convex combination of K different component transition matrices,with weights based on sequence history

Page 86: Notices Homework 2 due in class May 21 Uncollected Homework 1 Coursera class on Machine Learning by Andrew Ng (Stanford): –//class.coursera.org/ml-003/lecture/index

Data Mining Lectures Analysis of Web User Data Padhraic Smyth, UC Irvine

Analysis of Search Engine Query Logs

# of Queries Source Time Period

Lau & Horvitz 4690 Excite Sep 1997

Silverstein et al 1 Billion AltaVista 6 weeks in Aug & Sep 1998

Spink et al (series of studies)1 Million for each time period

Excite Sep 1997Dec 1999May 2001

Xie & O’Hallaron 110,000 Vivisimo 35 days Jan & Feb 2001

1.9 Million Excite 8 hrs in a day, Dec 1999

Page 87: Notices Homework 2 due in class May 21 Uncollected Homework 1 Coursera class on Machine Learning by Andrew Ng (Stanford): –//class.coursera.org/ml-003/lecture/index

Data Mining Lectures Analysis of Web User Data Padhraic Smyth, UC Irvine

Main Results• Average number of terms in a query ranges from a low of 2.2

to a high of 2.6

• The most common number of terms in a query was 2

• The majority of users don’t refine their query – The number of users who viewed only a single page increased

from 29% (1997) to 51% (2001) (Excite)– 85% of users viewed only first page of search results (AltaVista)

• 45% (2001) of queries are about Commerce, Travel, Economy, People (was 20% in 1997)

• All four studies produced a generally consistent set of findings about user behavior in a search engine context– most users view relatively few pages per query– most users don’t use advanced search features

Page 88: Notices Homework 2 due in class May 21 Uncollected Homework 1 Coursera class on Machine Learning by Andrew Ng (Stanford): –//class.coursera.org/ml-003/lecture/index

Data Mining Lectures Analysis of Web User Data Padhraic Smyth, UC Irvine

Markov models for page prediction

• For simplicity, consider order-dependent, time-independent finite-state Markov chain with M states

• Let s be a sequence of observed states of length L. e.g. s = ABBCAABBCCBBAA with three states A, B and C. st is state at position t (1<=t<=L). In general,

• first-order Markov assumption

• This provides a simple generative model to produce sequential data

L

ttt sssPsPsP

2111 ),...,|()()(

L

ttt ssPsPsP

211 )|()()(

)|(),...,|( 111 tttt ssPsssP

Page 89: Notices Homework 2 due in class May 21 Uncollected Homework 1 Coursera class on Machine Learning by Andrew Ng (Stanford): –//class.coursera.org/ml-003/lecture/index

Data Mining Lectures Analysis of Web User Data Padhraic Smyth, UC Irvine

Markov models for page prediction

• First-order Markov model assumes that the next state is based only on the current state

• Limitations– Doesn’t consider ‘long-term memory’

• We can try to capture more memory with kth-order Markov chain

• Limitations– Inordinate amount of training data O(Mk+1)

),..,|(),..,|( 111 kttttt sssPsssP

Page 90: Notices Homework 2 due in class May 21 Uncollected Homework 1 Coursera class on Machine Learning by Andrew Ng (Stanford): –//class.coursera.org/ml-003/lecture/index

Data Mining Lectures Analysis of Web User Data Padhraic Smyth, UC Irvine

Mixtures of Markov Chains

• Cadez et al. (2003) and Sen and Hansen (2003) replace the first-order Markov chain:

with a mixture of first-order Markov chains

where c is a discrete-value hidden variable taking K values Sk P(c = k) = 1

and

P(st | st-1, c = k) is the transition matrix for the kth mixture component

• One interpretation of this is user behavior consists of K different navigation behaviors described by the K Markov chains

)(),|(),...,|(1

111 kcPkcssPsssPK

ktttt

)|(),...,|( 111 tttt ssPsssP