web mining: an overview of web analytics with examples donghui wu, ph.d. oracle corporation april 16...
TRANSCRIPT
Web Mining: An Overview Of Web Analytics with Examples
Donghui Wu, Ph.D.
Oracle Corporation
April 16th 2003
Agenda
• Web Mining Overview
• Basic Web Analysis Problems
• Data Warehouse Solutions
• Oracle 9iAS Clickstream Intelligence Demo– Site Configure Excerpts– Site Basic Statistics Examples– Business Scenario Examples
Web Mining
Web Mining, generally speaking, is the activity of applying data mining principles and process to Web domain. It may tackle the World Wide Web as a whole, or focus on a particular (group) of Web sites (servers)
In this talk, we will limited the scope to Web usage and pattern analysis, or, more specifically Web Log Mining, at the enterprise (Web sites) level. In industry, it is also referred as Web Analytics.
Web Analytics
• Web Analytics is the monitoring and reporting of Web site usage so that enterprises can better understand the complex interactions between Web visitor actions and Web site offers, and leverage that insight to optimize the site for increased customer loyalty and sales.– From Web Analytics :Making Business Sense of Online
Behavior, Aberdeen Group, June 2002
Web Mining and Privacy
• Privacy issue is always a concern for data mining projects.
• When analyzing/mining visitor online behaviors, in particular visitor / user profiling, privacy issue is a major concern
• Usually only the aggregated info are analyzed, not the individual visitor’s/user’s
Web Log Data Sources (1)• Web Server Log
– This is the server log at the Web server, easy to get, and most widely analyzed.
– It is logged at the destination. The analysis is about a particular Web server or servers.
– One Web server can host many Web sites, and one Web site may served by multiple Web servers.
• Proxy Server Log– If the Web connection is through a proxy, every
requests are logged at the proxy server as well. – It’s logged the origin. The analysis is about a group
users, e.g. all users within a company.
Web Log Data Sources (2)
• Client Side Browser Log– Embeded client-side collection. It requires
sending simple javascripts with the the response to the Browser, and will collect browser info, and visitor client side activity, e.g. mouse movement, to a collector server for analysis
• Application Log– Web application usually has its own logs at
various details and for various purposes
Server Log, Proxy Log, and Browser Log
Web Server Log Analysis and Mining
• From now on, we limited our subject to Web Server Log Analysis and Mining only.
• The emphasis is on Enterprise Web Analytics.
• We will use a fiction site drugdepo.com as sample analysis, and Oracle 9iAS Clickstream Intelligence to produce the sample analysis.
Web Analytics Tasks Category
• Site Activity and OperationSite traffic, performance and status
Usage MiningVisitor Behavior Analysis, Referrer
analysis,Path Analysis
User Profiling/ClusteringVisitor Profiling, visitor segmentationUser profiling, user segmentation
Web Analytics Tasks for Business Users
• Content effectiveness evaluation
• Online marketing campaign analysis
• Target marketing analysis
• Personalization and recommendation
• Cross-sell and up-sell opportunities
• Many more…
Data Mining Techniques in Web Analytics
The following data mining techniques may be applied to solve those problems:
• Association Rule Mining
• Clustering / Segmentation– Visitor / User– Pages
• Visitor/User Profiling
Web Mining Difficulties
• Data size is huge– For site with 1 million hits per day, the raw log file size
can be 500M to 1 G per day depending Web server configure
• Bad records– There are many bad records due to Server errors.
• Lack exact information– In many cases, heuristics have to be applied
Web Server Log Format
• NCSA Common Log Format
• NCSA Extended Common Log Format
• W3C Extended Common Log Format
For more information, see W3C website
NCSA Common Log Format
The following is a line in an Apache server log. It is in NCSA Common Log Format, and has the following fields separated by a space.
Host Ident Authuser Time Request Status BytesSent Refer Browser
24.69.48.18 - 709697D0CE694757E034080020CB1B7C [01/Nov/2000:23:59:05 -0800] "GET /products/forms/pdf/256629.pdf HTTP/1.0" 206 308928 "-" "Mozilla/4.0 (compatible; MSIE 5.5; Windows 98)"
Dynamic Page and Parameters
• In the previous example, the requested page is a static page.
• For dynamic pages: e.g. ASP, JSP, etc.The request has two parts:
The static URL stem and query separated by “?”
• The query string is consisted of “paremeter=value” pairs.
• Parameters provide detailed info of the request.
Web Log Mining Task Types
• Web Log Analyzer– Provide simple statistics, e.g. # of visitor, # of
page view, # of sessions, etc. at given time
• Web Log Mining– Web Usage Mining and Pattern Analysis
• E-commerce, Personalization and CRM– Integrate and mining data across enterprise
Related Terms
• Hits– A hit is a URL request in server log
• Page Views (Page Impressions)– A page view may require multiple requests. E.g. several
.gif or .jpeg requests plus a .html requests
• Data Sent• Visitors ( identified and unidentified visitors) • Users (Authenticated Visitors)• Sessions
Data Filtering
Data analysis purpose, the following data preparationa are often applied:
• Remove .gif or .jpeg and other non-essential requests in raw data
• Some other filtering may also be applied based on tasks under attack.
• Page construction rules, to consolidate records
Basic Processing
• Parsing Log, resolve the following:– Client IP address– Visitor ID– User ID– Browser and OS– Request– Session
Basic Tasks
For any Web Analytics, you need to resolve the following before any possible analysis:
• Visitor identification
• User identification / matching
• Session Construction
• Path Completion
Visitor Identification Methods
• Client Hostname or IP Address only
• IP Address + Browser String
• Query String Parameter
• Cookie Value
• Visitor Field
IP Method Limitations
• Single IP / Multiple Users– A single proxy server can sever many users.
• Multiple IP / Single User– A single user may use multiple machines over
time, or even in one session. For example, AOL dynamically assign IP address to every request
• Always configure your web server to use cookie or query string if possible
Session Identification
• Visitor ID and Timeout Period– Once Visitor ID is constructed, the requests with the
same Visitor ID are sequenced according to the timestamp, the time the requests were made. If between two requests the time difference is more than, say 30 minutes, then the sequence is break into two sessions.
• Query String Parameter– In the request query string
• Cookie Value • Session Field
User Identification
• Web Server Authentication
• Query String Parameter
• Cookie Value – A cookie is a small text file that stores
information about a visitor on the user’s PC
Web Analytics Solution Types• Simple Web Log Analyzer
– Many free ones, simple parsing and counting– WebTrend Web Log Analyzer
• Data Warehouse Solutions– WebTrend E-commerce Server– Oracle 9iAS Clikcstream Intelligence
• Hosting Solutions– Digimine
• Consulting Solutions– Many companies specialized in customized Web Log
and Application Log analysis
Web Log Analyzer
• Web Log Analyzer- Report simple site usage measures, e.g.# of hits, # of visitors, page sequence, etc.
• Methodology: simple parsing and counting
• Small and quick, but only produce simple static reports, usually with big error margin
Data Warehouse Solutions
• Load Server Log into Data Warehouse
• Integrate with other data, e.g. sales
• Support interactive query and OLAP
• More accurate analysis and data mining results
• Expensive
Simplified DW Scheme:Dimensions
• Date
• Time
• Visitor
• User
• Browser
• Client Host
Simplified DW Scheme:Dimensions
• Date• Time of Day• Browser• Client Host• User• Visitor• Page
• Server• Site• Event• Referrer• Search
Simplified DW Scheme:Facts
• Impression (page view)– Browser– Client Host– Visitor– User– Page– Time to Serve– Referrer– Status– Event– Server– Session ID
• Session Fact– Session Date
– Session Time
– Session Visitor ID
– Session User ID
– Session Duration
– # of Impressions
– Data Sent
– First Impression Id
– Last Impression ID
– First referrer
Impression Fact
Session Fact
ETL Process and external data
The ETL process can be customized to support business analysis according to:
• Web server log format
• External customer data
• External sales data and marketing data
• Other external data sources
Demo and ScenariosDemo and Scenarios
Oracle 9iAS Clickstream Intelligence
Collector Server
Loader
Oracle Warehouse
Builder
Star Schema
Partitioning
Staging
Oracle 9i
Agenda
• Configuration
• Basic Site Statistics
• Business Scenarios
DrugDepo Site Configuration
Site Basic Statistics
Site: DrugDepo.com
Start Date: October 1
End Date: October 10
Business Scenarios Examples
Scenario 1: Determining Scenario 1: Determining Content EffectivenessContent Effectiveness
Scenario 1: Determining Scenario 1: Determining Content EffectivenessContent Effectiveness
• Questions The marketing director of DrugDepo, Shelley Green would like to know the following:
1. How do visitors find DrugDepo's Web site?
2. Did visitors find what they were looking for?
Discovery
Shelley uses the following Clickstream Intelligence reports:
• Search Analysis: Top Referring Searches
• Search Analysis: Top Local Searches
Top 5 Referring Searches
The top 5 referring searches (searches through search engines such as Google,Yahoo, Lycos, etc.) that bring visitors to DrugDepo are:
• health care products• ask expert• pharmacy• baby care• promotion
Top 5 Local Searches
The top 5 local searches are:
• ask expert
• specials
• allergy
• baby food
• heart attack
Possible Actions
Shelley is considering the following:• Expanding the content of the “Ask Expert”
column.• Positioning it prominently on the DrugDepo
home page.• Offering baby-related articles and items on
the site - There is also quite a high interest in baby care, food and related areas.
Scenario 2: Maximizing Online Scenario 2: Maximizing Online Marketing EffectivenessMarketing Effectiveness
Online Marketing Effectiveness
The marketing director of DrugDepo, Shelley Green would like to know the following:
• Who are DrugDepo’s top external referrers?
• What are the top searches by search engines?
Discovery
• Shelley uses the following Clickstream Intelligence reports: •Referring URLs:
• Top External Referrers
•Search Analysis: • Top Searches by Search Engine
Top referrers
The following are the top referrers of DrugDepo:• www.allergylearninglab.com• www.healthwatchlab.com• www.altmedicine.com• www.lycos.com• www.webclinic.com• search.yahoo.com• hotbot.lycos.com
Popular Search Phrases
The popular search phrases by search engines are:
• www.lycos.com – ask expert, health care products, promotion,
arnica, pharmacy …
• search.yahoo.com – health care products, ask expert, pharmacy …
• hotbot.lycos.com – health care products, pharmacy
Possible Actions
• Consider making Allergy Learning Lab, Health Watch Lab and Alt Medicine preferred partner Web sites because they are driving a lot visitors to DrugDepo’s Web site.
• Consider purchasing popular keywords or search phrases from Lycos and Yahoo because they are effective in driving visitors to the site.