web-mining project

Post on 22-Apr-2015

6.769 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

 

TRANSCRIPT

Adaptive Web Sites

Devesh Sinha

devesh@asu.edu

Introduction and Background

World Wide Web to conduct business. Generate and collect large volumes of da

ta in daily operations. Case: www.amazon.com Analysis – Solution not same as grocery store

Case:mati.eas.asu.edu Objective of this website No sales , so no buying information

Request Profiling

Definition the process by which info. is gathered,

organized and interpreted to create the summarization or description of the user

Approaches web server log ask for (registration & feedback) pre-established

Log file type

Access log Referrer logAgent logError log

Data Sources:

server level collection: the server stores data regarding requests performed by the client, thus data regard generally just one source;

client level collection: it is the client itself which sends to a repository information regarding the user's behaviour (can be implemented by using a remote agent (such as Javascripts or Java applets) or by modifying the source code of an existing browser (such as Mosaic or Mozilla) to enhance its data collection capabilities. );

proxy level collection: information is stored at the proxy side, thus Web data regards several Websites, but only users whose Web clients pass through the proxy.

Web Server Access Logs

looney.cs.umn.edu han - [09/Aug/1996:09:53:52 -0500] "GET mobasher/courses/cs5106/cs5106l1.html HTTP/1.0" 200 mega.cs.umn.edu njain - [09/Aug/1996:09:53:52 -0500] "GET / HTTP/1.0" 200 3291mega.cs.umn.edu njain - [09/Aug/1996:09:53:53 -0500] "GET /images/backgnds/paper.gif HTTP/1.0" 200 3014mega.cs.umn.edu njain - [09/Aug/1996:09:54:12 -0500] "GET /cgi-bin/Count.cgi?df=CS home.dat\&dd=C\&ft=1 HTTP mega.cs.umn.edu njain - [09/Aug/1996:09:54:18 -0500] "GET advisor HTTP/1.0" 302mega.cs.umn.edu njain - [09/Aug/1996:09:54:19 -0500] "GET advisor/ HTTP/1.0" 200 487looney.cs.umn.edu han - [09/Aug/1996:09:54:28 -0500] "GET mobasher/courses/cs5106/cs5106l2.html HTTP/1.0" 200 . . . . . . . . .

• Typical Data in a Server Access Log

Access Log Format IP address userid time method url protocol status size

mega.cs.umn.edu njain 09/Aug/1996:09:54:31 advisor/csci-faq.html

Other Server Logs: referrer logs, agent logs

Request Profiling

Web Server Log client IP address or hostname user id (“-” if anonymous) access time HTTP request method (e.g. GET, POST,

HEAD ..) path of the resource on the Web server (URL) the protocol (e.g. HTTP/1.0, HTTP/1.1 ..) the status code (e.g. 404 for Not Found ..) the number of bytes transmitted

Figure 4. web usage mining research projects and products

Approaches :

Concept 1: Prepared Log + StatisticalConcept 2: Prepared Log + Mining

Preprocessing:

Integrate Logs: Logs are only meant for post-

mortem Clean logs – elliminate outliers

Typical Web Usage Mining Preprocessing

Transaction Identification

Main Questions: how to identify unique users how to identify/define a user transaction

Problems: user ids are often suppressed due to security concerns individual IP addresses are sometimes hidden behind proxy

servers client-side & proxy caching makes server log data less

reliable Standard Solutions/Practices:

user registration – practical ???? client-side cookies – not fool proof cache busting -- —increases network traffic

A Heuristic Approach

Identifying User Sessions use IP, agent, and OS fields as key attributes; use client-side cookies & unique user ids, if available; use session time-outs; use synchronized referrer log entries and time stamps to

expand user paths belonging to a session; path completion to infer cached references: EX: expanding a session A ==> B ==> C by an access

pair (B ==> D) results in: A ==> B ==> C ==> B ==> D to disambiguate paths, sessions are expanded based on

page attributes (size, type), reference length, and no. of back references required to complete the path.

Example: Session Inference with Referrer Log

IP Time URL Referrer Agent1 www.aol.com 08:30:00 A # Mozillar/2.0; AIX 4.1.42 www.aol.com 08:30:01 B E Mozillar/2.0; AIX 4.1.43 www.aol.com 08:30:02 C B Mozillar/2.0; AIX 4.1.44 www.aol.com 08:30:01 B # Mozillar/2.0; Win 955 www.aol.com 08:30:03 C B Mozillar/2.0; Win 956 www.aol.com 08:30:04 F # Mozillar/2.0; Win 95

8 www.aol.com 08:30:05 G B Mozillar/2.0; AIX 4.1.47 www.aol.com 08:30:04 B A Mozillar/2.0; AIX 4.1.4

Identified Sessions: S1: # ==> A ==> B ==> G from references 1, 7, 8 S2: E ==> B ==> C from references 2, 3 S3: # ==> B ==> C from references 4, 5 S4: # ==> F from reference 6

ExampleA

B C D E

F G H

O P

T

I L J

Q

K N M

R S

USER1 : A B F O G A DUSRE2 : A B C JUSRE3 : L R

Concept 1 :Binary exponential backoff

Frame Frame Frame

ContentionInterval

Contention Slot

idle

Frame

Binary exponential backoff algorithm:

• after 1st collision, wait 0 or 1 slots, at random.

• after 2nd collision, wait 0, 1, 2, 3 slots at random.

• etc up to 1023 slots.

• after 16 collisions, exception.

Concept 1:

Similary: From Current Logs:

Rank accessed pages Use Binary Backoff to change the ranks

Concept 2:

Use the NAT as level 1 filtering Filter the traffic as per request

pattern Users can reach the same page but

with option to further choose Rule Based Prediction

Rule Induction

Rule Induction (rule-based prediction) We first generate a set of rules from a data

warehouse, then use them to predict values for new data

item. It works much better on larger (and real)data

sets, not just on samples of data.

Two phases: Rule discovery: analyze a historical database

and generate a set of rules by automatic discovery.

Prediction: apply the rules to a new data set and match the rules to make predictions.

Rule Induction ExampleDomain Month-Zone Request Time Classedu mid http after-officeaedu mid http office anet mid http after-officebcom start http after-officebcom end normal after-officebcom end normal office anet end normal office bedu start http after-officebedu end normal after-officeacom start normal after-officeaedu start normal office anet start http office anet mid normal after-officeacom start http office b

Training Set

Results: Statistical approach performance:

Slow to conform to changes Good performance with general access

patterns NAT – Rule Based performance

High accuracy till now Future work required :

Multi Level Association Better Feature Selection Scalable Distributed tool

Project: Comments

Problems faced: Data Cleaning Learning Curve

Future Applications : Network processors Intelligent Parking slots

Reference www.powerfulforces.org.nz/Papers/Lim.pdf Towards Adaptive Web Sites: Conceptual Framework & Case Study

Mike Perkowitz, Oren Etzioni Web Usage Mining: Discovery and Applications of Usage Patterns

Jaideep Srivastava , Robert Cooley , Mukund Deshpande, Pang-Ning Tan

WUM: A Web Utilization Miner (URL: http://wum.wiwi.hu-berlin.de/index.html )

WEKA: Machine learning Algorithms in Java Improving Effectiveness of Web Site with web usage mining:

Myra Spiliopoulou, Carsten Pohle, Lukas C Faulstich

top related