web-mining project
DESCRIPTION
TRANSCRIPT
![Page 2: web-mining project](https://reader033.vdocument.in/reader033/viewer/2022061606/5538ea99550346bb318b48cc/html5/thumbnails/2.jpg)
Introduction and Background
World Wide Web to conduct business. Generate and collect large volumes of da
ta in daily operations. Case: www.amazon.com Analysis – Solution not same as grocery store
Case:mati.eas.asu.edu Objective of this website No sales , so no buying information
![Page 3: web-mining project](https://reader033.vdocument.in/reader033/viewer/2022061606/5538ea99550346bb318b48cc/html5/thumbnails/3.jpg)
Mati: Architecture
www
User: NAT
servers
![Page 4: web-mining project](https://reader033.vdocument.in/reader033/viewer/2022061606/5538ea99550346bb318b48cc/html5/thumbnails/4.jpg)
Request Profiling
Definition the process by which info. is gathered,
organized and interpreted to create the summarization or description of the user
Approaches web server log ask for (registration & feedback) pre-established
![Page 5: web-mining project](https://reader033.vdocument.in/reader033/viewer/2022061606/5538ea99550346bb318b48cc/html5/thumbnails/5.jpg)
![Page 6: web-mining project](https://reader033.vdocument.in/reader033/viewer/2022061606/5538ea99550346bb318b48cc/html5/thumbnails/6.jpg)
Log file type
Access log Referrer logAgent logError log
![Page 7: web-mining project](https://reader033.vdocument.in/reader033/viewer/2022061606/5538ea99550346bb318b48cc/html5/thumbnails/7.jpg)
Data Sources:
server level collection: the server stores data regarding requests performed by the client, thus data regard generally just one source;
client level collection: it is the client itself which sends to a repository information regarding the user's behaviour (can be implemented by using a remote agent (such as Javascripts or Java applets) or by modifying the source code of an existing browser (such as Mosaic or Mozilla) to enhance its data collection capabilities. );
proxy level collection: information is stored at the proxy side, thus Web data regards several Websites, but only users whose Web clients pass through the proxy.
![Page 8: web-mining project](https://reader033.vdocument.in/reader033/viewer/2022061606/5538ea99550346bb318b48cc/html5/thumbnails/8.jpg)
![Page 9: web-mining project](https://reader033.vdocument.in/reader033/viewer/2022061606/5538ea99550346bb318b48cc/html5/thumbnails/9.jpg)
Web Server Access Logs
looney.cs.umn.edu han - [09/Aug/1996:09:53:52 -0500] "GET mobasher/courses/cs5106/cs5106l1.html HTTP/1.0" 200 mega.cs.umn.edu njain - [09/Aug/1996:09:53:52 -0500] "GET / HTTP/1.0" 200 3291mega.cs.umn.edu njain - [09/Aug/1996:09:53:53 -0500] "GET /images/backgnds/paper.gif HTTP/1.0" 200 3014mega.cs.umn.edu njain - [09/Aug/1996:09:54:12 -0500] "GET /cgi-bin/Count.cgi?df=CS home.dat\&dd=C\&ft=1 HTTP mega.cs.umn.edu njain - [09/Aug/1996:09:54:18 -0500] "GET advisor HTTP/1.0" 302mega.cs.umn.edu njain - [09/Aug/1996:09:54:19 -0500] "GET advisor/ HTTP/1.0" 200 487looney.cs.umn.edu han - [09/Aug/1996:09:54:28 -0500] "GET mobasher/courses/cs5106/cs5106l2.html HTTP/1.0" 200 . . . . . . . . .
• Typical Data in a Server Access Log
Access Log Format IP address userid time method url protocol status size
mega.cs.umn.edu njain 09/Aug/1996:09:54:31 advisor/csci-faq.html
Other Server Logs: referrer logs, agent logs
![Page 10: web-mining project](https://reader033.vdocument.in/reader033/viewer/2022061606/5538ea99550346bb318b48cc/html5/thumbnails/10.jpg)
Request Profiling
Web Server Log client IP address or hostname user id (“-” if anonymous) access time HTTP request method (e.g. GET, POST,
HEAD ..) path of the resource on the Web server (URL) the protocol (e.g. HTTP/1.0, HTTP/1.1 ..) the status code (e.g. 404 for Not Found ..) the number of bytes transmitted
![Page 11: web-mining project](https://reader033.vdocument.in/reader033/viewer/2022061606/5538ea99550346bb318b48cc/html5/thumbnails/11.jpg)
![Page 12: web-mining project](https://reader033.vdocument.in/reader033/viewer/2022061606/5538ea99550346bb318b48cc/html5/thumbnails/12.jpg)
Figure 4. web usage mining research projects and products
![Page 13: web-mining project](https://reader033.vdocument.in/reader033/viewer/2022061606/5538ea99550346bb318b48cc/html5/thumbnails/13.jpg)
Approaches :
Concept 1: Prepared Log + StatisticalConcept 2: Prepared Log + Mining
![Page 14: web-mining project](https://reader033.vdocument.in/reader033/viewer/2022061606/5538ea99550346bb318b48cc/html5/thumbnails/14.jpg)
Preprocessing:
Integrate Logs: Logs are only meant for post-
mortem Clean logs – elliminate outliers
![Page 15: web-mining project](https://reader033.vdocument.in/reader033/viewer/2022061606/5538ea99550346bb318b48cc/html5/thumbnails/15.jpg)
Typical Web Usage Mining Preprocessing
![Page 16: web-mining project](https://reader033.vdocument.in/reader033/viewer/2022061606/5538ea99550346bb318b48cc/html5/thumbnails/16.jpg)
Transaction Identification
Main Questions: how to identify unique users how to identify/define a user transaction
Problems: user ids are often suppressed due to security concerns individual IP addresses are sometimes hidden behind proxy
servers client-side & proxy caching makes server log data less
reliable Standard Solutions/Practices:
user registration – practical ???? client-side cookies – not fool proof cache busting -- —increases network traffic
![Page 17: web-mining project](https://reader033.vdocument.in/reader033/viewer/2022061606/5538ea99550346bb318b48cc/html5/thumbnails/17.jpg)
A Heuristic Approach
Identifying User Sessions use IP, agent, and OS fields as key attributes; use client-side cookies & unique user ids, if available; use session time-outs; use synchronized referrer log entries and time stamps to
expand user paths belonging to a session; path completion to infer cached references: EX: expanding a session A ==> B ==> C by an access
pair (B ==> D) results in: A ==> B ==> C ==> B ==> D to disambiguate paths, sessions are expanded based on
page attributes (size, type), reference length, and no. of back references required to complete the path.
![Page 18: web-mining project](https://reader033.vdocument.in/reader033/viewer/2022061606/5538ea99550346bb318b48cc/html5/thumbnails/18.jpg)
Example: Session Inference with Referrer Log
IP Time URL Referrer Agent1 www.aol.com 08:30:00 A # Mozillar/2.0; AIX 4.1.42 www.aol.com 08:30:01 B E Mozillar/2.0; AIX 4.1.43 www.aol.com 08:30:02 C B Mozillar/2.0; AIX 4.1.44 www.aol.com 08:30:01 B # Mozillar/2.0; Win 955 www.aol.com 08:30:03 C B Mozillar/2.0; Win 956 www.aol.com 08:30:04 F # Mozillar/2.0; Win 95
8 www.aol.com 08:30:05 G B Mozillar/2.0; AIX 4.1.47 www.aol.com 08:30:04 B A Mozillar/2.0; AIX 4.1.4
Identified Sessions: S1: # ==> A ==> B ==> G from references 1, 7, 8 S2: E ==> B ==> C from references 2, 3 S3: # ==> B ==> C from references 4, 5 S4: # ==> F from reference 6
![Page 19: web-mining project](https://reader033.vdocument.in/reader033/viewer/2022061606/5538ea99550346bb318b48cc/html5/thumbnails/19.jpg)
ExampleA
B C D E
F G H
O P
T
I L J
Q
K N M
R S
USER1 : A B F O G A DUSRE2 : A B C JUSRE3 : L R
![Page 20: web-mining project](https://reader033.vdocument.in/reader033/viewer/2022061606/5538ea99550346bb318b48cc/html5/thumbnails/20.jpg)
Concept 1 :Binary exponential backoff
Frame Frame Frame
ContentionInterval
Contention Slot
idle
Frame
Binary exponential backoff algorithm:
• after 1st collision, wait 0 or 1 slots, at random.
• after 2nd collision, wait 0, 1, 2, 3 slots at random.
• etc up to 1023 slots.
• after 16 collisions, exception.
![Page 21: web-mining project](https://reader033.vdocument.in/reader033/viewer/2022061606/5538ea99550346bb318b48cc/html5/thumbnails/21.jpg)
Concept 1:
Similary: From Current Logs:
Rank accessed pages Use Binary Backoff to change the ranks
![Page 22: web-mining project](https://reader033.vdocument.in/reader033/viewer/2022061606/5538ea99550346bb318b48cc/html5/thumbnails/22.jpg)
Concept 2:
Use the NAT as level 1 filtering Filter the traffic as per request
pattern Users can reach the same page but
with option to further choose Rule Based Prediction
![Page 23: web-mining project](https://reader033.vdocument.in/reader033/viewer/2022061606/5538ea99550346bb318b48cc/html5/thumbnails/23.jpg)
Rule Induction
Rule Induction (rule-based prediction) We first generate a set of rules from a data
warehouse, then use them to predict values for new data
item. It works much better on larger (and real)data
sets, not just on samples of data.
Two phases: Rule discovery: analyze a historical database
and generate a set of rules by automatic discovery.
Prediction: apply the rules to a new data set and match the rules to make predictions.
![Page 24: web-mining project](https://reader033.vdocument.in/reader033/viewer/2022061606/5538ea99550346bb318b48cc/html5/thumbnails/24.jpg)
Rule Induction ExampleDomain Month-Zone Request Time Classedu mid http after-officeaedu mid http office anet mid http after-officebcom start http after-officebcom end normal after-officebcom end normal office anet end normal office bedu start http after-officebedu end normal after-officeacom start normal after-officeaedu start normal office anet start http office anet mid normal after-officeacom start http office b
Training Set
![Page 25: web-mining project](https://reader033.vdocument.in/reader033/viewer/2022061606/5538ea99550346bb318b48cc/html5/thumbnails/25.jpg)
Results: Statistical approach performance:
Slow to conform to changes Good performance with general access
patterns NAT – Rule Based performance
High accuracy till now Future work required :
Multi Level Association Better Feature Selection Scalable Distributed tool
![Page 26: web-mining project](https://reader033.vdocument.in/reader033/viewer/2022061606/5538ea99550346bb318b48cc/html5/thumbnails/26.jpg)
Project: Comments
Problems faced: Data Cleaning Learning Curve
Future Applications : Network processors Intelligent Parking slots
![Page 27: web-mining project](https://reader033.vdocument.in/reader033/viewer/2022061606/5538ea99550346bb318b48cc/html5/thumbnails/27.jpg)
Reference www.powerfulforces.org.nz/Papers/Lim.pdf Towards Adaptive Web Sites: Conceptual Framework & Case Study
Mike Perkowitz, Oren Etzioni Web Usage Mining: Discovery and Applications of Usage Patterns
Jaideep Srivastava , Robert Cooley , Mukund Deshpande, Pang-Ning Tan
WUM: A Web Utilization Miner (URL: http://wum.wiwi.hu-berlin.de/index.html )
WEKA: Machine learning Algorithms in Java Improving Effectiveness of Web Site with web usage mining:
Myra Spiliopoulou, Carsten Pohle, Lukas C Faulstich