Download - web-mining project
Introduction and Background
World Wide Web to conduct business. Generate and collect large volumes of da
ta in daily operations. Case: www.amazon.com Analysis – Solution not same as grocery store
Case:mati.eas.asu.edu Objective of this website No sales , so no buying information
Mati: Architecture
www
User: NAT
servers
Request Profiling
Definition the process by which info. is gathered,
organized and interpreted to create the summarization or description of the user
Approaches web server log ask for (registration & feedback) pre-established
Log file type
Access log Referrer logAgent logError log
Data Sources:
server level collection: the server stores data regarding requests performed by the client, thus data regard generally just one source;
client level collection: it is the client itself which sends to a repository information regarding the user's behaviour (can be implemented by using a remote agent (such as Javascripts or Java applets) or by modifying the source code of an existing browser (such as Mosaic or Mozilla) to enhance its data collection capabilities. );
proxy level collection: information is stored at the proxy side, thus Web data regards several Websites, but only users whose Web clients pass through the proxy.
Web Server Access Logs
looney.cs.umn.edu han - [09/Aug/1996:09:53:52 -0500] "GET mobasher/courses/cs5106/cs5106l1.html HTTP/1.0" 200 mega.cs.umn.edu njain - [09/Aug/1996:09:53:52 -0500] "GET / HTTP/1.0" 200 3291mega.cs.umn.edu njain - [09/Aug/1996:09:53:53 -0500] "GET /images/backgnds/paper.gif HTTP/1.0" 200 3014mega.cs.umn.edu njain - [09/Aug/1996:09:54:12 -0500] "GET /cgi-bin/Count.cgi?df=CS home.dat\&dd=C\&ft=1 HTTP mega.cs.umn.edu njain - [09/Aug/1996:09:54:18 -0500] "GET advisor HTTP/1.0" 302mega.cs.umn.edu njain - [09/Aug/1996:09:54:19 -0500] "GET advisor/ HTTP/1.0" 200 487looney.cs.umn.edu han - [09/Aug/1996:09:54:28 -0500] "GET mobasher/courses/cs5106/cs5106l2.html HTTP/1.0" 200 . . . . . . . . .
• Typical Data in a Server Access Log
Access Log Format IP address userid time method url protocol status size
mega.cs.umn.edu njain 09/Aug/1996:09:54:31 advisor/csci-faq.html
Other Server Logs: referrer logs, agent logs
Request Profiling
Web Server Log client IP address or hostname user id (“-” if anonymous) access time HTTP request method (e.g. GET, POST,
HEAD ..) path of the resource on the Web server (URL) the protocol (e.g. HTTP/1.0, HTTP/1.1 ..) the status code (e.g. 404 for Not Found ..) the number of bytes transmitted
Figure 4. web usage mining research projects and products
Approaches :
Concept 1: Prepared Log + StatisticalConcept 2: Prepared Log + Mining
Preprocessing:
Integrate Logs: Logs are only meant for post-
mortem Clean logs – elliminate outliers
Typical Web Usage Mining Preprocessing
Transaction Identification
Main Questions: how to identify unique users how to identify/define a user transaction
Problems: user ids are often suppressed due to security concerns individual IP addresses are sometimes hidden behind proxy
servers client-side & proxy caching makes server log data less
reliable Standard Solutions/Practices:
user registration – practical ???? client-side cookies – not fool proof cache busting -- —increases network traffic
A Heuristic Approach
Identifying User Sessions use IP, agent, and OS fields as key attributes; use client-side cookies & unique user ids, if available; use session time-outs; use synchronized referrer log entries and time stamps to
expand user paths belonging to a session; path completion to infer cached references: EX: expanding a session A ==> B ==> C by an access
pair (B ==> D) results in: A ==> B ==> C ==> B ==> D to disambiguate paths, sessions are expanded based on
page attributes (size, type), reference length, and no. of back references required to complete the path.
Example: Session Inference with Referrer Log
IP Time URL Referrer Agent1 www.aol.com 08:30:00 A # Mozillar/2.0; AIX 4.1.42 www.aol.com 08:30:01 B E Mozillar/2.0; AIX 4.1.43 www.aol.com 08:30:02 C B Mozillar/2.0; AIX 4.1.44 www.aol.com 08:30:01 B # Mozillar/2.0; Win 955 www.aol.com 08:30:03 C B Mozillar/2.0; Win 956 www.aol.com 08:30:04 F # Mozillar/2.0; Win 95
8 www.aol.com 08:30:05 G B Mozillar/2.0; AIX 4.1.47 www.aol.com 08:30:04 B A Mozillar/2.0; AIX 4.1.4
Identified Sessions: S1: # ==> A ==> B ==> G from references 1, 7, 8 S2: E ==> B ==> C from references 2, 3 S3: # ==> B ==> C from references 4, 5 S4: # ==> F from reference 6
ExampleA
B C D E
F G H
O P
T
I L J
Q
K N M
R S
USER1 : A B F O G A DUSRE2 : A B C JUSRE3 : L R
Concept 1 :Binary exponential backoff
Frame Frame Frame
ContentionInterval
Contention Slot
idle
Frame
Binary exponential backoff algorithm:
• after 1st collision, wait 0 or 1 slots, at random.
• after 2nd collision, wait 0, 1, 2, 3 slots at random.
• etc up to 1023 slots.
• after 16 collisions, exception.
Concept 1:
Similary: From Current Logs:
Rank accessed pages Use Binary Backoff to change the ranks
Concept 2:
Use the NAT as level 1 filtering Filter the traffic as per request
pattern Users can reach the same page but
with option to further choose Rule Based Prediction
Rule Induction
Rule Induction (rule-based prediction) We first generate a set of rules from a data
warehouse, then use them to predict values for new data
item. It works much better on larger (and real)data
sets, not just on samples of data.
Two phases: Rule discovery: analyze a historical database
and generate a set of rules by automatic discovery.
Prediction: apply the rules to a new data set and match the rules to make predictions.
Rule Induction ExampleDomain Month-Zone Request Time Classedu mid http after-officeaedu mid http office anet mid http after-officebcom start http after-officebcom end normal after-officebcom end normal office anet end normal office bedu start http after-officebedu end normal after-officeacom start normal after-officeaedu start normal office anet start http office anet mid normal after-officeacom start http office b
Training Set
Results: Statistical approach performance:
Slow to conform to changes Good performance with general access
patterns NAT – Rule Based performance
High accuracy till now Future work required :
Multi Level Association Better Feature Selection Scalable Distributed tool
Project: Comments
Problems faced: Data Cleaning Learning Curve
Future Applications : Network processors Intelligent Parking slots
Reference www.powerfulforces.org.nz/Papers/Lim.pdf Towards Adaptive Web Sites: Conceptual Framework & Case Study
Mike Perkowitz, Oren Etzioni Web Usage Mining: Discovery and Applications of Usage Patterns
Jaideep Srivastava , Robert Cooley , Mukund Deshpande, Pang-Ning Tan
WUM: A Web Utilization Miner (URL: http://wum.wiwi.hu-berlin.de/index.html )
WEKA: Machine learning Algorithms in Java Improving Effectiveness of Web Site with web usage mining:
Myra Spiliopoulou, Carsten Pohle, Lukas C Faulstich