exploratory web analytics project of sample data from an ...exploratory web analytics project of...
TRANSCRIPT
Go data diving with
Exploratory Web Analytics Project of Sample data from an online retailer
Lily Qian Zhao
*The ‘flower’ was generated based the online data with Tableau
Content
• Intro • Data Analytics
-Exploratory Analysis
-Cluster Analysis
-Logistic Regression
-Multiple Regression
• Targeted
Recommendations
• Moving Forward • Technical Appendix
Project Overview
Sample Data (2013)• 9 Months (Except March-May)
• 21061 Rows of Records
• 12 possible Predictor Variables
1.Introduction
Recommendations
Data Cleansing
Predictive
ModelsExploratory
Analysis
Model Analysis
Business Objective
Find out how to achieve
better performance on
visits, orders and most
importantly, the sales.
Find out what factors
would be related to visits,
orders and sales.
1 2
1.Introduction
Technical Supports
Explore and summarize data’s
characteristics – Data Exploratory
Segment customers based on
shared/distinct features and
conduct according analytics –
Cluster Analysis
Summarize what factors may influence
purchase or not, predict who would
purchase – Logistic Regression
Summarize what factors may influence how much customer would purchase/ predict
how many purchases would occur and how to achieve better performance on visits,
orders and sales.
1 3
2
4
1.Introduction
Summary of Original Data
Data was recorded for each
site based on different platform
and different kinds of customers
per day
1
Acme Botly Pinnacle Sortly Tabular Widgetry Android BlackBerry IOS ······ ChromeOS New Returning Neither
9
Site 14
1 0 V
Platform 3New Customers
3Categorical Variable
1.Introduction
Summary of Original Data (Continued)
Day: From Jan. 1st – Feb. 28th and June. 1st – Dec. 31st in 2013
• All the distributions of variables (except for Month) are heavily right skewed.
Discrete Variable
Variable Min. Max. Median Mean
Visits 0 136057 24 1935
Distinct Sessions0 107104 19 1515
Orders 0 4916 0 62.38
Bounces 0 54512 5 743.3
Add to Cart 0 7924 4 166.3
Product Page Views
0 187501 53 4358
Search Page Views
0 506629 82 8584
Gross Sales 1 707642 851 16473
2
1.Introduction
Exploratory Analysis
2. Data Analytics
• In the data exploratory step, research and study of the basic characteristics of variables, interactions between them and the corresponding trends are conducted.
• The major study focuses on visits, orders and sales.
• Some other data inconsistencies are found and data is cleansed further and transformed accordingly. (Please refer to Technical Appendix for details)
Exploratory Analysis
• For instance, there are breakpoints on Feb 9th in the table of Visits’, Orders’ and Gross Sales’ trend based on different platforms throughout the months. Referring to the original data, it is very likely that ‘iPad’, ‘iPhone’ are regarded as ‘iOS’ after the date thus disappear from the table. And so does ‘Macintosh’ (recorded as ‘MacOSX’ after Feb 9th). So the data is transformed to make it consistent.
2. Data Analytics
Exploratory Analysis
There are five major parts in Exploratory Analysis. Based on these parts, study of numbers of Visits/Orders/Gross_Sales are conducted.
• Zero Visits
• Platforms
• Weekdays
• New Customers
• Sites
2. Data Analytics
“Nearly 20% records of Site Sortly has 0 Visit, so does Pinnacle”
Zero Visit
4.6% 19.3%18.5%
2. Data Analytics
“’Unknown’ and ‘WindowsPhone’ have great influence because of their 0 visit”
Zero Visit
Platform SymbianOS Blackberr
y
ChromeOS Unknown WindowsPhone Linux Other MacOSX Android Windows iOS
0 visit
rate
45.94% 29.39% 28.54% 23.99% 22.59% 19.59% 13.46% 4.73% 4.35% 2.62% 0.85%
Influence
Score
3 2 5 1080 590 402 0.005 4 23 0.3 1
0_𝑉𝑖𝑠𝑖𝑡𝑅𝑎𝑡𝑒(𝑃𝑙𝑎𝑡𝑓𝑜𝑟𝑚) =𝑟𝑒𝑐𝑜𝑟𝑑_𝑜𝑓_0_𝑉𝑖𝑠𝑖𝑡#
𝑇𝑜𝑡𝑎𝑙_𝑟𝑒𝑐𝑜𝑟𝑑#
𝐼𝑛𝑓𝑙𝑢𝑒𝑛𝑡𝑖𝑎𝑙_𝑆𝑐𝑜𝑟𝑒(𝑃𝑙𝑎𝑡𝑓𝑜𝑟𝑚) =𝑃𝑙𝑎𝑡𝑓𝑜𝑟𝑚′𝑠𝑉𝑖𝑠𝑖𝑡#
𝑇𝑜𝑡𝑎𝑙𝑉𝑖𝑠𝑖𝑡#× 0_𝑉𝑖𝑠𝑖𝑡𝑅𝑎𝑡𝑒 × 104
The platform with high 0_VisitRate and high proportion in total visits
would have greater influence because of their 0 visit.
2. Data Analytics
Distributions of Visits/Orders/Sales Based on Different Platforms
VisitsGross SalesOrders
2. Data Analytics
Trend of Visits/Orders/Sales throughout time Based on Different Platforms
Holiday season witnessed
the fast growth of v/o/s
while during the summer
there was a trough.
The trend lines for different
platform and for
visits/orders/sales are similar
over time .
Windows kept taking the
lead of Visits, Orders and
Sales.
2. Data Analytics
Week Days
Mondays achieve the best in V/O/S which
follows by Tuesday
Visits, Orders and Sales are almost evenly
distributed through days.
2. Data Analytics
Week Days
Holiday season witnessed
the fast growth of v/o/s.
During the summer, there
was a trough for Wed,Thu,Fri
and weekend.
The trend lines for different
platform and for
visits/orders/sales are similar
over time .
Trends from June to
December are very zig-zag
yet some days share the
similar trends.
2. Data Analytics
Distributions of Visits/Orders/Sales Based on Different Types of Customers
Majority (83%) of the orders and sales are from Returning Customers
Majority (85%) of the visitsare from Visiting Customers
2. Data Analytics
Trends of Visits/Orders/Sales Through Time Based on Different Types of Customers
Holiday season witnessed
the fast growth of v/o/s.
During the summer, there
was a trough for Wed,Thu,Fri
and weekend.
The majority of sales as well
as orders are brought by
returning customers while
the majority of visits are
brought by visitors
Though the highest visits
belong to visitors and the
highest sales&orders belong
to returning customers, they
share the very similar trend
2. Data Analytics
Distributions of Visits/Orders/Sales
Visit
Orders
Sales
The majority of visits, orders and sales
are all from Site Acme
Based on Different Types of Sites
2. Data Analytics
Distributions of Visits and Sales in SearchPageViews/ProductPageViews/DistinctSessions
Based on Different Types of Sites
Visit Sales
• The majority of visits and sales are
all from Site Acme
• Although Widgetry has less
visits/sales than the others, its
performance on Product Page
views is good
2. Data Analytics
Holiday season witnessed
the fast growth of v/o/s.
During the summer, there
was a trough for Acme.
The majority of sales as well
as orders are brought by
returning customers while
the majority of visits are
brought by visitors
This time trend is very similar
to other time trends in
‘customer’ and ‘platform’
Trends of Visits/Orders/Sales Through Time Based on Different Types of Sites
2. Data Analytics
Why Cluster Analysis?
Heterogeneity is the central concept in marketing because in almost all situations customers have different wants, needs and preferences. Whenever such heterogeneity exists, the website could recognize and accommodate differences can achieve an advantage over competitors in a category.
Clustering is a major approach for addressing heterogeneity. Customers with similar wants and needs are grouped into segments so the website can better meet the different needs.
Cluster Analysis
2. Data Analytics
Aim:
• Cluster the records based on their numeric values, profile them and find out differences between clusters to understand marketing/business performance.
Method: K-Means
Number of Clusters: 3-5
Assumptions
• Segment platform as Personal Computer (0.5) and Others (-0.5)
• Segment days as Weekdays and Weekends
• Define new_customer: Visitors -0.5; New Customers 0; Returning Customers 0.5.
Cluster Analysis
2. Data Analytics
Cluster Analysis
• Predictor Unadjusted • Group 1:
High Visits, High Search_Page_Views,
High Product_Page_Views;
Low Gross_Sales
• Group 2:
Low Visits, Low Search_Page_Views,
Low Product_Page_Views;
Low Gross_Sales
• Group 3:
Low Visits, High Search_Page_Views,
High Product_Page_Views;
High Gross_Sales
So if a new record has Low Visits, High
Search_Page_Views, High Product_Page_Views, it is
very likely that the record also has high Gross_Sales
2. Data Analytics
Cluster Analysis
• Predictor AdjustedFor every predictor has different scale and deviation, they are
standardized and then re-enter to cluster analysis. Here is the new
clustering result.
• Group 1:
High Visits, High Search_Page_Views, High Product_Page_Views;
Returning Customer, Use computer. High Gross_Sales
• Group 2:
Low Visits, Low Search_Page_Views, Low Product_Page_Views, New
customer or Visitors. Low Gross_Sales
• Group 3:
High add_to_chart rate, Low Visits, Low Search_Page_Views,
Low Product_Page_Views, Returning Customer. Low Gross_Sales
• Group 4:
High add_to_chart rate, Low Visits, High Conversion_Rates,
Returning or new customers. Low Gross_Sales
• Group 5:
High Visits, High Search_Page_Views, High Product_Page_Views, Use
computer
Low Gross_Sales
So if a new record has High Visits, High Search_Page_Views, High Product_Page_Views; Returning Customer, computer platform, it is very likely that the record also has high Gross_Sales
2. Data Analytics
Other Possible Clusters
Clustering on add-to-cart rate, gross sales
and site:
• Majority of gross sales comes from
Acme(site).
• 0.25 and 0.7 are two add to chart
rates to predict possible good gross
sales;
Interesting findings
2. Data Analytics
Other Possible Clusters
Clustering on conversion rate, gross sales
and new customer:
• Majority of gross sales comes from low
conversion rate parts.
• Returning customers have low
conversion rates; new customers have
medium conversion rates; visitors have
high conversion rates.
• Returning customers prone to spend
more per record than the new
customers.
Interesting findings
2. Data Analytics
Other Possible Clusters
Clustering on bounce rate, gross sales
and new customer:
• Returning customers have higher
bounce rate than the new customers.
• Visitors have a wide range of bounce
rate (0~1).
• The highest gross sales are contributed
by the returning customers within the
bounce rate range of 0.12-0.3.
Interesting findings
2. Data Analytics
Other Possible Clusters
Clustering on product page visits, search
page views and platform:
• product page visits and search page
views are positively related.
• Customers using computers have
relatively more search page views
than the phone users while the former
also have relatively lower product
page views than the phone users.
• On average, product page views are
as three times as search page views.
Interesting findings
2. Data Analytics
Other Possible ClustersInteresting findings
Clustering on visits, gross sales and new
customer:
• Majority of gross sales comes from low
conversion rate parts.
• For returning customers, the more they
visit the website, the more they would
purchase.
• If the numbers of visits are the same,
new customers would purchase more
than the returning customers within a
price range (less than 200,000).
2. Data Analytics
Logistics Regression
• Summarize what factors may influence purchase or not
• predict who would purchase
• Why Logistic Regression?
① It predicts binary response variable: purchase or not
② It could be measured whether the model is good: randomly divide the data into training and testing set; fit the model with training set and test the validation with test set and then compare the real outcome and the simulated one to get the correct rate.
Purposes of modeling
2. Data Analytics
Logistics RegressionTransformation before modeling
Categorical variables were transformed to be numbers
• Add binomial response variable “sales”:
if gross sales > 0, sales = 1; if gross sales = 0, sales = 0
• Platform: as Personal Computer (0.5) and Others (-0.5)
• Customer: Returning customer (0.5), new customer (0) and visitors (-0.5)
The scale of training set and test set is 7:3
2. Data Analytics
Logistics RegressionModel
• Final model:
Sales (0-1) is related to
① Search page views,
② product page views,
③ bounce rate, platform,
④ add to chart rate
⑤ new customer
The model could correctly predict 95% that whether there would be a sale
given the information of Search page views, product page views, bounce rate, platform, add to chart rate and new customer
2. Data Analytics
Logistics RegressionModel Interpretation
Please refer to the technical appendix for R’s output
Final model: (Positive or Negative relation between predictors and sales, followed by weigh)
Sales (0-1) is related to
• Search page views (Positive, less than 0.1)
• Product page views (Negative, less than -0.1)
• Bounce rate (Negative, -0.67)
• Platform
Blackberry(N,-0.33); ChromeOS(N,-0.82); iOS(N,-0.26); Linux(N,-0.61); MacOSX(P,0.67)
Other(N,-0.75); SymbianOS(N,-1.4); Unknown(P,1.16); Windows(P,0.46); WinPhone(P,-0.1)
• Add to chart rate (Positive, 1.24)
• Returning customer (Positive, the longer the better, 3.5)
So those with high search page views, low product page view, have low bounce rate, explore through Windows or MacOSX device, have high add to rate and be a returning customer are likely to create sales
2. Data Analytics
Logistics RegressionInteresting findings
Please refer to the technical appendix for R’s output
• Higher bounce rate is a killer for sales (highest negative value)
• Returning customers weigh the most for sales
• For platform, Windows, MacOSX and Unknown do a good job yet Chrome OS and SymbianOS are not;
2. Data Analytics
Logistics RegressionModel Diagnostics
• The ROC curve: turns out to be 0.98 which is fairly a nice indicator for the
accuracy of fitting the dataset.
2. Data Analytics
Logistics RegressionInteresting findings
Important Points (Statistically, influential points and outliers):
These outliers are deadly important and should be taken very seriously
There are four records have relatively great influence on the model.
• All of them happened on December 2nd 2013 (Cyber Monday)
• All of them are from Site Acme
2. Data Analytics
Logistics RegressionInteresting findings
Platform NewCustomer Visits Orders G_Sales P_P_V S_P_V C-rate B-rate A-rate
Windows New 3694 2390 247384 10728 19441 0.65 0.16 0.71
Windows Returning 26347 4916 707642 78159 175488 0.19 0.17 0.3
MacOSX Returning 18044 2766 458546 52556 111936 0.15 0.26 0.25
iOS Returning 8283 1225 184423 21933 44212 0.15 0.23 0.26
DataAverage
/ 1935 62 16473 4358 8584
2. Data Analytics
Multiple Linear Regression
• Summarize what factors may influence that how much customer would purchase
• Predict how many purchases would occur and how to achieve better performance on visits, orders and sales.
• Why Multiple Linear Regression?
Because it predicts numeric response variable: gross sales
It also could be measured whether the model is good: Besides the statistical criterion of the model (R-squared, P-values et.), randomly dividing the data into training and testing set; fit the model with training set and test the validation with test set and then compare the real outcome and the simulated one to get the correct rate.
Purposes of modeling
2. Data Analytics
Multiple Linear Regression
The table indicates the correlations between variables
The darker the color is, the more the two variables are correlated.
-visits and product_page_views; visits and search_page_views are highly correlated;
-orders and sales are highly correlated;
-
Variable selection
2. Data Analytics
Multiple Linear RegressionModel
Final model: (Positive or Negative relation between predictors and sales, followed by weigh)
Sales amount is related to
• Weekday(Positive to weekend, 30)
• Distinct Sessions(Negative, -3.25)
• Visits(Positive, 2.65)
• Platform
Blackberry(N,-315); ChromeOS(N,-187); iOS(P,522); Linux(N,-113); MacOSX(P,1248)
Other(P,380); SymbianOS(P,26); Unknown(N,-391); Windows(N,-3262); WinPhone(N,-220)
• Add to chart rate (Negative, -1195)
• Orders (Positive, 144)
• Returning customer (Positive, the longer the better, 1235)
So those with purchase on weekends, low distinct sessions, have visits, explore through MacOSX device, have lower add to rate and be a returning customer are likely to create higher sales
2. Data Analytics
Multiple Linear Regression
• R-squared: 0.9857 excellent!
• P-value < 2.2*10^(-16) great!
• Correct rate: (when the absolute value of(predicted value – actual sales)is less than the actual sales)
• The incorrect cases could be very likely related to the platform coefficient: windows is negative.
Model Diagnostics
66%
2. Data Analytics
Multiple Linear Regression
Used 3-fold (cross-validation) to avoid overfitting. This is how it
Model diagnostics
2. Data Analytics
Since returning customers are the backbones…
• Try to turn more new customers to returning customers
- Discounts/Coupons/Loyalty cards for returning customers
- Targeted ads for new customers
- Build email lists
- Regular greetings emails/messages
- Better customer service
… so we are trying turning everyone to be returning customers!
3. Targeted Recommendations
Since returning customers are the backbones…
• Make returning customers stay longer with the website
- Discounts/Coupons for returning customers who stay longer and spend more
- Targeted ads for returning customers
- Better customer service
- Seasonal greetings and birthday gift/coupons (“We care about you!”)
… and we’ve got them connected!
3. Targeted Recommendations
Since returning customers are the backbones…
• Attract more visitors to become new customers
Customers we have are just like the Ping-Pong balls in a basket with a hole on the bottom :
We need try to make the hole smaller as well as get more balls into the basket.
So we also need get more visitors to become new customers.
-Better online shopping experience
-Make it easy to registrar as a new customer for visitors
-First time shopping discount
… and we need more potential returning customers as well!
3. Targeted Recommendations
4. Moving Forward
Data Collection
Good
Model
Frequency Monetary
Recency
A Good model of forecasting
customers’ future
value should be related to
customers’ behaviors
Recency, Frequency and Monetary.
Based on that, there could be
improvement of data
collecting which would provide
better support foranalytics.
Data Collection
• More Demographic data of individual-Gender-Geographic information
-Age group-Family size(single/married/with how many kids)
4. Moving Forward
Data Collection
• More Shopping related Data-sales category-source-order history for each customer
-promotional history (promotional response)
-Length of membership (loyalty)
4. Moving Forward
Data Collection
• Others-More information about site: why acme stands out while others are not performing well?-Find March-May data and regress the trends again
- Compare with data from 2012/2014 to see if time series/seasonal problems exist
- Conduct further study on those high purchase individuals and according targeted marketing suggestions
- distinguish lapsed customer groups and make relevant strategy
4. Moving Forward
Data CleansingMissing Value
Technical Appendix
There are 17835 missing values
in the original dataset. 8259 of
them are in the new_customer
category and the rest of them
(9576) are in the gross_sales.
NA in the new_customer, according to
the instruction, are considered to be
neither new customers nor returning
customers. They should be the window-
shopping customers or the visitors. So
the 8259 records are assigned as
“Visitors” in the new_customer category.Since there are 8031 rows
containing both missing
categories and assume that
visitors need to log in to
purchase, these rows’
gross_sales should be zero. The
other gross sales are fitted as
the average gross_sales.
1 2
3
Data Cleansing Data Inconsistency
Technical Appendix
There are 2465 records whose
orders are zero but the
gross_sales are not zero. In this
case, the orders are fixed as
the mean of the total order.
After first-step’s filling missing data,
there are still 9 records whose gross
sales are zero but orders are not zero. In
this case, the gross_sales are fixed as
the mean of the total gross sales.
There is a blank category
under the “platform” and it is
regarded as “Unknown”.
1 2
3
Data Cleansing New Variables
Technical Appendix
It is requested that three
new variables should be
created: conversion_rate,
bounce_rate and
add_to_cart_rate.
Because some days have none visit
record which may cause errors in
the rates, the following methods
are taken for calculating the
above three rates if the visit is zero,
based on according definition.
conversion_rate = 0;
bounce_rate = 1;
add_to_cart_rate =0.
Other new variables are also
created for further use :
(mostly use for the time series)
Weekday: the day of the date
(Monday, Tuesday etc.) ;
Month: the month of the date;
Selections for Clusters
Technical Appendix
• Predictor Unadjusted
For 4-cluster and 5-cluster models, I found that the high gross_sales groups are not distinctive from other groups.
For example, both cluster 2 & 3 in 4-cluster models have high value in search_pages_views and product_page_views, yet
Cluster 3 has high gross value while cluster 2 has very low gross sale. So when we get a group of records whose
search _pages_views and product_page_views are high, we could not tell whether the gross value would be high or not.
Apparently 3-cluster model does its job: low in visits, high in search_pages_views and product_page_views, and gross
sales
would be high for the group. So I got rid of 4-cluster and 5-cluster models and chose 3-cluster model.
Selections for Clusters
Technical Appendix
• Why adjusting Predictors?
The values of predictors are standardized before applying the final cluster analysis.
There are majorly four reasons for standardization:
① Commensurate units
② Skewed distribution
③ Variable weighting
④ Facilitating interpretation
Selections for Clusters
Technical Appendix
• Predictor Adjusted
After standardizing the predictors, all models worked perfectly. So I chose 5-cluster model to make
the high gross sales group more distinctive.
Selections for Logistics Regression Model
Technical Appendix
Final selection for Logit ModelAccurate Rate
Weigh(coefficient) of the predictors
Selections for Multiple Linear Regression Model
Technical Appendix
Data summary for Multi-Model
Selections for Multiple Linear Regression Model
Technical Appendix
Outlier Detection
Distribution of Residuals
Selections for Multiple Linear Regression Model
Technical Appendix
Accurate rate
Summary of final Multi-Model
Software used
Technical Appendix
Analytics Rstudio|SAS |MySQL | Python
Data Visualization Tableau | Rstudio | Photoshop
Code is available if needed
Extra part ILittle more about Lily
Teehee
I love painting, designing and doing creative innovations.
check here: http://lilyqianz.com/projects/smart-mailbox/
Extra icons for new_customer category
Thanks for reading.
Hope you have enjoyed and wowed :)
Data is beautiful.
Go data diving with