web-mining project

Adaptive Web Sites

Devesh Sinha

[email protected]

Introduction and Background

World Wide Web to conduct business. Generate and collect large volumes of da

ta in daily operations. Case: www.amazon.com Analysis – Solution not same as grocery store

Case:mati.eas.asu.edu Objective of this website No sales , so no buying information

http://www.amazon.com/

Mati: Architecture

www

User: NAT

servers

http://www.altavista.com/r?ck_sm=866179bf&ref=20096&r=http%3A%2F%2Fwww.altavista.com%2Fsites%2Fsearch%2Fmm_resultframe%3Fq%3Dserver%26type%3DIMG%26url%3Dhttp%253A%252F%252Fwww.iba-america.com%252Fibaserv.htm%26title%3DServer_2.jpg%26isrc%3Dhttp%253A%252F%252Fthumb-2.image.altavista.com%252Fimage%252F177676381%26src%3Dhttp%253A%252F%252Fwww.iba-america.com%252Fgifs%252FServer_2.jpg%26stq%3D20%26stype%3Dsimage

http://www.altavista.com/r?ck_sm=866179bf&ref=20096&r=http%3A%2F%2Fwww.altavista.com%2Fsites%2Fsearch%2Fmm_resultframe%3Fq%3Dserver%26type%3DIMG%26url%3Dhttp%253A%252F%252Fwww.iba-america.com%252Fibaserv.htm%26title%3DServer_2.jpg%26isrc%3Dhttp%253A%252F%252Fthumb-2.image.altavista.com%252Fimage%252F177676381%26src%3Dhttp%253A%252F%252Fwww.iba-america.com%252Fgifs%252FServer_2.jpg%26stq%3D20%26stype%3Dsimage

Request Profiling

Definition the process by which info. is gathered,

organized and interpreted to create the summarization or description of the user

Approaches web server log ask for (registration & feedback) pre-established

Log file type

Access log Referrer logAgent logError log

Data Sources:

server level collection: the server stores data regarding requests performed by the client, thus data regard generally just one source;

client level collection: it is the client itself which sends to a repository information regarding the user's behaviour (can be implemented by using a remote agent (such as Javascripts or Java applets) or by modifying the source code of an existing browser (such as Mosaic or Mozilla) to enhance its data collection capabilities. );

proxy level collection: information is stored at the proxy side, thus Web data regards several Websites, but only users whose Web clients pass through the proxy.

Web Server Access Logs

looney.cs.umn.edu han - [09/Aug/1996:09:53:52 -0500] "GET mobasher/courses/cs5106/cs5106l1.html HTTP/1.0" 200 mega.cs.umn.edu njain - [09/Aug/1996:09:53:52 -0500] "GET / HTTP/1.0" 200 3291mega.cs.umn.edu njain - [09/Aug/1996:09:53:53 -0500] "GET /images/backgnds/paper.gif HTTP/1.0" 200 3014mega.cs.umn.edu njain - [09/Aug/1996:09:54:12 -0500] "GET /cgi-bin/Count.cgi?df=CS home.dat\&dd=C\&ft=1 HTTP mega.cs.umn.edu njain - [09/Aug/1996:09:54:18 -0500] "GET advisor HTTP/1.0" 302mega.cs.umn.edu njain - [09/Aug/1996:09:54:19 -0500] "GET advisor/ HTTP/1.0" 200 487looney.cs.umn.edu han - [09/Aug/1996:09:54:28 -0500] "GET mobasher/courses/cs5106/cs5106l2.html HTTP/1.0" 200 . . . . . . . . .

• Typical Data in a Server Access Log

Access Log Format IP address userid time method url protocol status size

mega.cs.umn.edu njain 09/Aug/1996:09:54:31 advisor/csci-faq.html

Other Server Logs: referrer logs, agent logs

Request Profiling

Web Server Log client IP address or hostname user id (“-” if anonymous) access time HTTP request method (e.g. GET, POST,

HEAD ..) path of the resource on the Web server (URL) the protocol (e.g. HTTP/1.0, HTTP/1.1 ..) the status code (e.g. 404 for Not Found ..) the number of bytes transmitted

Figure 4. web usage mining research projects and products

Approaches :

Concept 1: Prepared Log + StatisticalConcept 2: Prepared Log + Mining

Preprocessing:

Integrate Logs: Logs are only meant for post-

mortem Clean logs – elliminate outliers

Typical Web Usage Mining Preprocessing

Transaction Identification

Main Questions: how to identify unique users how to identify/define a user transaction

Problems: user ids are often suppressed due to security concerns individual IP addresses are sometimes hidden behind proxy

servers client-side & proxy caching makes server log data less

reliable Standard Solutions/Practices:

user registration – practical ???? client-side cookies – not fool proof cache busting -- —increases network traffic

A Heuristic Approach

Identifying User Sessions use IP, agent, and OS fields as key attributes; use client-side cookies & unique user ids, if available; use session time-outs; use synchronized referrer log entries and time stamps to

expand user paths belonging to a session; path completion to infer cached references: EX: expanding a session A ==> B ==> C by an access

pair (B ==> D) results in: A ==> B ==> C ==> B ==> D to disambiguate paths, sessions are expanded based on

page attributes (size, type), reference length, and no. of back references required to complete the path.

Example: Session Inference with Referrer Log

IP Time URL Referrer Agent1 www.aol.com 08:30:00 A # Mozillar/2.0; AIX 4.1.42 www.aol.com 08:30:01 B E Mozillar/2.0; AIX 4.1.43 www.aol.com 08:30:02 C B Mozillar/2.0; AIX 4.1.44 www.aol.com 08:30:01 B # Mozillar/2.0; Win 955 www.aol.com 08:30:03 C B Mozillar/2.0; Win 956 www.aol.com 08:30:04 F # Mozillar/2.0; Win 95

8 www.aol.com 08:30:05 G B Mozillar/2.0; AIX 4.1.47 www.aol.com 08:30:04 B A Mozillar/2.0; AIX 4.1.4

Identified Sessions: S1: # ==> A ==> B ==> G from references 1, 7, 8 S2: E ==> B ==> C from references 2, 3 S3: # ==> B ==> C from references 4, 5 S4: # ==> F from reference 6

ExampleA

B C D E

F G H

O P

T

I L J

Q

K N M

R S

USER1 : A B F O G A DUSRE2 : A B C JUSRE3 : L R

Concept 1 :Binary exponential backoff

Frame Frame Frame

ContentionInterval

Contention Slot

idle

Frame

Binary exponential backoff algorithm:

• after 1st collision, wait 0 or 1 slots, at random.

• after 2nd collision, wait 0, 1, 2, 3 slots at random.

• etc up to 1023 slots.

• after 16 collisions, exception.

Concept 1:

Similary: From Current Logs:

Rank accessed pages Use Binary Backoff to change the ranks

Concept 2:

Use the NAT as level 1 filtering Filter the traffic as per request

pattern Users can reach the same page but

with option to further choose Rule Based Prediction

Rule Induction

Rule Induction (rule-based prediction) We first generate a set of rules from a data

warehouse, then use them to predict values for new data

item. It works much better on larger (and real)data

sets, not just on samples of data.

Two phases: Rule discovery: analyze a historical database

and generate a set of rules by automatic discovery.

Prediction: apply the rules to a new data set and match the rules to make predictions.

Rule Induction ExampleDomain Month-Zone Request Time Classedu mid http after-officeaedu mid http office anet mid http after-officebcom start http after-officebcom end normal after-officebcom end normal office anet end normal office bedu start http after-officebedu end normal after-officeacom start normal after-officeaedu start normal office anet start http office anet mid normal after-officeacom start http office b

Training Set

Results: Statistical approach performance:

Slow to conform to changes Good performance with general access

patterns NAT – Rule Based performance

High accuracy till now Future work required :

Multi Level Association Better Feature Selection Scalable Distributed tool

Project: Comments

Problems faced: Data Cleaning Learning Curve

Future Applications : Network processors Intelligent Parking slots

Reference www.powerfulforces.org.nz/Papers/Lim.pdf Towards Adaptive Web Sites: Conceptual Framework & Case Study

Mike Perkowitz, Oren Etzioni Web Usage Mining: Discovery and Applications of Usage Patterns

Jaideep Srivastava , Robert Cooley , Mukund Deshpande, Pang-Ning Tan

WUM: A Web Utilization Miner (URL: http://wum.wiwi.hu-berlin.de/index.html )

WEKA: Machine learning Algorithms in Java Improving Effectiveness of Web Site with web usage mining:

Myra Spiliopoulou, Carsten Pohle, Lukas C Faulstich

web-mining project

Documents