an olam framework for web usage mining and business intelligence reporting xiaohua (tony) hu drexel...

36
An OLAM Framework for Web Usage Mining and Business Intelligence Reporting Xiaohua (Tony) Hu Drexel University Philadelphia, PA, 19104

Upload: cornelia-hart

Post on 25-Dec-2015

221 views

Category:

Documents


1 download

TRANSCRIPT

An OLAM Framework for Web Usage Mining

and Business Intelligence Reporting Xiaohua (Tony) Hu

Drexel University

Philadelphia, PA, 19104

Outline

• Introduction

• Data Capture

• Data Webhouse Construction

• Mining, OLAP and Business Reporting

• Pattern Evaluation and Development

• Q &A

Benefits of Web Usage Mining• Targeting customers based on usage

behavior or profile (personalization) • Adjusting web content and structure

dynamically based on page access pattern of users (adaptive web site)

• Enhancing the service quality and delivery to the end user (cross-selling, up-selling)

• Improving web server system performance based on the web traffic analysis

• Identifying hot areas/killer areas of the web site

Web Usage Mining Steps

1. Data capture (clickstream, sales, customers, products, promotion, shipping etc)

2. Data processing -- ETL from OLTP to DW 3. Pattern discovery and OLAP cubes and

reports4. Pattern evaluation and deployment

Data Capture

• The web server logs recording the visitors’ click stream behaviors (pages template, cookie, transfer log, time stamp, IP address, agent, referrer etc.)

• Product information (product hierarchy, manufacturer, price, color, size etc.)

• Content information of the web site (image, gif, video clip etc.)

• The customer purchase data (quantity of the products, payment amount and method, shipping address etc.)

• Customer demographics information (age, gender, income, education level, lifestyle etc.)

Issues in Clickstream Capture

• Distinguish sessions

• Use Cookies to track customers

• Tag templates

• Log business events

• Records query string

• Crawlers detection

What Kind of Clickstream Information Need to Be Recorded?

– Request (Click) Data:• Template, Product,

Assortment

• Time stamps for each click, Compile & execution times

• Query string information, Referring page information

• The request sequence number within a session

– Cookie Data:• The cookie of the visitor

(This ID is temporary if the user has cookies turned off)

– Session Data:• Session length• Browser (useragent) and IP

address information for the client

• User’s Cookie ID• User ID of the user if he/she

logged in• Whether or not the session

timed out• The total number of requests

in the session• Whether the session belongs

to a user who “opts-out”• The total number of sessions

that have come from users with this Cookie ID

Web Log Data

• Designed for debugging purpose, not for analysis

Crawler Session

• Crawlers are programs that visit your site search engine, shopping bots

• It is very important to filter the crawler session (some of our clients’ site, the crawler sessions account up to 30%)

Techniques to Identify Crawlers Sessions

• Build a model to identify crawler sessions: common turn off images, have empty referrers, friendly bots will visit robots.txt file, page hits rate is too fast, pattern is a depth-first or breadth-first search of the site, bots never purchase

• Created invisible links in the web page

OLTP vs DSSOLTP DSS

Daily operation Analysis

Many small transactions

Few large transaction

Need quick response

Very time consuming

Read & write (insert, delete, update)

Mostly read only

Session and product centric

Customer centric

What is OLAM?

• OLAP: (On-Line Analytical Processing) pre-calculate summary information to enable drilling, pivoting, slicing/dicing, filtering , to analyze business from multiple angles or views (dimensions)

• OLAM (On Line Analytical Mining): An integration of data mining and data warehousing and OLAP technologies

Data Webhouse Construction

• Requirement Analysis of the Data Webhouse

• Data Webhouse Schema Design

Dimensions, Fact Tables, Aggregation/Summary tables

Requirement Analysis of the Data Webhouse

1. Web site activity (hourly, daily, weekly, monthly, quarterly etc) 2. Product sale (by region, by brand, by domain, by browser type,

by time etc) 3. Customers (by type, by age, by gender, by region, buyer vs.

visitor, heavy buyer vs. light buyer etc) 4. Vendors (by type, by region, by price range etc) 5. Referrers (by domain, by sale amount, by visit numbers etc) 6. Navigational behavior pattern (top entry page, top exit page,

killer age, hot page etc) 7. Click conversation-ratio 8. Shipments (by regular, by express mail etc) 9. Payments (by cash, by credit card, e-money etc)•  

Data Webhouse Schema Design

• Define the Source Data

• Choose the Grain of the Fact Tables

• Choose the Dimensions Appropriate for the Grain

• Choose the Facts Appropriate for That Grain

Appropriate Dimensions

• Session Dimension

• Page Dimension

• Time Dimension

• User Dimension

• Product Dimension

Session Attributes• Session Length• Referrer• Agent• Host Name• IP Address• Cookie_id• First Request Time• Last Request Time• Average Time Per Page• Purchase Flag• Time Out Flag• Many more …

Customer Attributes• Address: City, State/Province, Country• Gender, Age, profession, Education, Marital

Status• Contact Info: Email, Phone• Repeat Visit Flag• Frequent Buyer Flag• Heavy Spender Flag• Reader/Browser Flag• Many more …

Page Attributes• Page Template

• Page Location

• Page Type

• Page Category

• Page Description

• Registration Page Flag

• Shipping Page Flag

• Checkout Page lag

• Many more …

Promotion Attributes

• Promotion Name• Price Reduction Percentage• Adv Type• Coupon Type• Begin Date• End Data• Promotion Region• Many more …

Date Attributes• Day, Week, Month, Quarter, Year

• Day number in Month, Day Number in Quarter, Day Number in Year

• Week number in Month, Week Number in Quarter, Week Number in Year

• Weekday Flag

• Weekend Flag

• Season

• Many more …

Time Attributes• Second, Minute, Minute, Hour,

• Early Morning Flag

• Late Afternoon Flag

• Lunch Time Flag

• Dinner Time Flag

• Late Evening Flag

• Many more …

OLAP

• View data from Multiple views and angles

• Immediate response to business query

• Ability to drill down and roll up the multiple dimensional data in the cube

• Analyze Business measures such as profit, revenue, quantity from different angles, perspectives and various factors

Some Fact Tables

MINE_ORDERS_CLICKS_GIFTS This table contains a row for each order line, clickstream request, and gift registry entry.  It is the union of the MINE_ORDER_LINES, MINE_CLICK_LINES, and MINE_GIFT_LINES tables and is used as the fact table when mining on a combination of order and clickstream data.  Since different columns apply to different types of line items they are marked with the applicable type(s) (order, click, gift, or all).

MINE_ORDERS_ACXIOM MINE_ORDER_HEADERS joins with MINE_CUSTOMERS, MINE_ACXIOM, MINE_PROMOTION

MINE_LINE_ITEMS MINE_ORDER_LINES joins with MINE_CUSTOMER, MINE_ORDER_HEADERS, MINE_PRODUCTS, MINE_ASSORTMENT, MINE_PROMOTIONS

Some Dimension and Summary Tables in Webhouse

MINE_CLICK_LINES a row for each Web page viewed

MINE_ACXIOM a row for each customer for which the system was able to find Acxiom data

MINE_SESSIONS a row for each Web session

MINE_ASSORTMENTS a row for each assortment folder, assortment, and sub assortment defined in the system. 

MINE_CUSTOMERS a row for each customer

MINE_GIFT_HEADERS a gift row for each customer

MINE_GIFT_LINES a row for each gift registry item of each customer

MINE_ORDER_LINE contains a row for each order line of each order

MINE_ORDER_HEADERS a row for each order of each customer

MINE_PROMOTIONS a row for each promotion folder and promotion defined in the system

Search Argument FindingsRecords Percent Normalized Value

292,952 99.46% NULL

70 0.02% 4.38% fat boy

64 0.02% 4.00% chrome60 0.02% 3.75% motorclothes

53 0.02% 3.31% fuel tank43 0.01% 2.69% sportster

37 0.01% 2.31% maintenance30 0.01% 1.88% sidecar

28 0.01% 1.75% sissy bar

27 0.01% 1.69% seat26 0.01% 1.63% touring

25 0.01% 1.56% fuel tanks24 0.01% 1.50% exhaust

23 0.01% 1.44% accessories23 0.01% 1.44% road king

22 0.01% 1.38% rear fender

22 0.01% 1.38% backrest21 0.01% 1.31% style

20 0.01% 1.25% fatboy20 0.01% 1.25% deuce

961 0.33% Other Values

Top 20 Paths Lead to Non-Purchased Sessions

path countsmain 14622main->main 3731main->main->main 790main->main->login 329main->main->main->main 303login 274main->main->pna->pna 216pna 212main->main->pna->pna->pna 192main->main->eDealer 185mc 180main->main->pna 175main->main->pna->pna->pna->pna->pna 169main->main->pna->pna->pna->pna->pna->pna 166main->main->pna->pna->pna->pna->pna->pna->pna 160main->main->pna->pna->pna->pna 147main->main->mc->mc->mc->mc 131main->main->pna->pna->pna->pna->pna->pna->pna->pna 118main->main->mc->mc->mc 111main->main->pna->pna->pna->pna->pna->pna->pna->pna->pna 106

Top 20 paths start at OF_Main.jsp and exit at OF_Main.jspPaths CountsOF_Main.jsp->splash.jsp->OF_Main.jsp 154OF_Main.jsp->OF_Main.jsp 122OF_Main.jsp->splash.jsp->OF_Main.jsp->OF_Main.jsp 52OF_Main.jsp->OF_Main.jsp->OF_Main.jsp 28OF_Main.jsp->splash.jsp->OF_Main.jsp->OF_Main.jsp->OF_Main.jsp 25OF_Main.jsp->OF_Main.jsp->splash.jsp->OF_Main.jsp 23OF_Main.jsp->splash.jsp->pna/pa_main.jsp->OF_Main.jsp 16OF_Main.jsp->splash.jsp->login/ln_login.jsp->OF_Main.jsp 15OF_Main.jsp->OF_Main.jsp->OF_Main.jsp->OF_Main.jsp 13OF_Main.jsp->splash.jsp->mc/MC_main.jsp->OF_Main.jsp 13OF_Main.jsp->splash.jsp->dealer_positioning.jsp->OF_Main.jsp 11OF_Main.jsp->splash.jsp->pna/pa_main.jsp->pna/pa_family.jsp->OF_Main.jsp 11OF_Main.jsp->splash.jsp->login/ln_login.jsp->login/ln_loginopp.jsp->login/ln_message.jsp->OF_Main.jsp 10OF_Main.jsp->splash.jsp->OF_Main.jsp->OF_Main.jsp->OF_Main.jsp->OF_Main.jsp 9OF_Main.jsp->splash.jsp->cart/sc_listing.jsp->OF_Main.jsp 7OF_Main.jsp->splash.jsp->login/ln_login.jsp->login/ln_login_step.jsp->OF_Main.jsp 7OF_Main.jsp->browser_message.jsp->OF_Main.jsp 6OF_Main.jsp->dealer_positioning.jsp->OF_Main.jsp 5OF_Main.jsp->OF_Main.jsp->OF_Main.jsp->OF_Main.jsp->OF_Main.jsp 5OF_Main.jsp->OF_Main.jsp->OF_Main.jsp->OF_Main.jsp->OF_Main.jsp->OF_Main.jsp 5

Single/Multiple visitors/buyers

Type Counts

Single Visit 1823

Multiple Visit 37

Single Visit Buyer 269

Multiple Visit Buyer 58

Unknown 2846

Web Usage Mining Methods

• Construct cubes from data webhouse roll-up, drill-down the OLAP cubes to find the top domain,

top products, top hot spot, web activity, most frequently accessed time periods etc.

• Perform data mining on data webhouse find association patterns for cross-sell and up-sell, build link between pages,

sequential patterns, and trend of web accessing, improve system design by web caching, web page prefetching, and web page swapping

Mining the web data

• Association Rules

• Classification/Prediction

• Clustering

Data Mining -Association

• Path Link analysis : Explore, understand, predict browsing pattern

• Shopping cart Analysis: cross-sell, up-sell to increase wallet-share

Gloss Example

Relations Lift Support(%) Confidence(%) Rule

1 2 1.56 1.89 18.58 Bloom ==> Dirty_Girl

2 2 1.56 1.89 15.91 Dirty_Girl ==> Bloom

3 2 1.13 1.50 11.52 Philosophy ==> Bloom

4 2 1.13 1.50 14.75 Bloom ==> Philosophy

5 2 1.66 1.41 11.87 Dirty_Girl ==> Blue_Q

6 2 1.66 1.41 19.75 Blue_Q ==> Dirty_Girl

7 2 3.12 1.32 18.41 Tony_And_Tina ==> Girl

8 2 1.41 1.32 10.14 Philosophy ==> Tony_And_Tina

9 2 1.41 1.32 18.41 Tony_And_Tina ==> Philosophy

10 2 2.96 1.32 18.88 Demeter_Fragrances ==> Smell_This

11 2 3.12 1.32 22.45 Girl ==> Tony_And_Tina

12 2 2.96 1.32 20.75 Smell_This ==> Demeter_Fragrances

Data Mining - Classification

• Understand customer via rules, tree etc

• Prediction model for target-oriented marketing/campaign

Data Mining - Clustering

• Discover group/segments of similar behaviors/profile

Questions ?