01(.)2('/30$3%.)$1(%41) &,013.')*',(.3.4)&'$

Master Thesis, 30 ECTS

Master of Science in Industrial Engineering and Management, 300 ECTS

Spring 2021

Customer loyalty, return and churn prediction through

machine learning methods

for a Swedish fashion and e-commerce company

Master Thesis StudyAnida Granov

Copywrite © 2021 Anida Granov All rights reserved CUSTOMER LOYALTY, RETURN AND CHURN PREDICTION THROUGH MACHINE LEARNING METHODS Submitted in fulfillment of the requirements for the degree Master of Science in Industrial Engineering and Management Department of Mathematics and Mathematical Statistics Umeå University SE – 907 87 Umeå, Sweden Supervisors: Mohammad Ghorbani, Umeå Universiy Cagri Emre Korkmaz, NA-KD Examiner: Natalya Pya Arnqvist, Umeå University

Copyright

Abstract

The analysis of gaining, retaining and maintaining customer trust is a highly topicalissue in the e-commerce industry to mitigate the challenges of increased competitionand volatile customer relationships as an effect of the increasing use of the internet topurchase goods. This study is conducted at the Swedish online fashion retailer NA-KDwith the aim of gaining better insight into customer behavior that determines purchases,returns and churn. Therefore, the objectives for this study are to identify the group ofloyal customers as well as construct models to predict customer loyalty, frequent returnsand customer churn. Two separate approaches are used for solving the problem where aclustering model is constructed to divide the data into different customer segments thatcan explain customer behaviour. Then a classification model is constructed to classifythe customers into the classes of churners, returners and loyal customers based on theexploratory data analysis and previous insights and knowledge from the company. Byusing the unsupervised machine learning method K-prototypes clustering for mixed data,six clusters are identified and defined as churned, potential, loyal customers and BrandChampions, indecisive shoppers, and high-risky churners. The supervised classificationmethod of bias reduced binary Logistic Regression is used to classify customers into theclasses of loyal customers, customers of frequent returns and churners. The final modelshad an accuracy of 0.68, 0.75 and 0.98 for the three separate binary classification modelsclassifying Churners, Returners and Loyalists respectively.

The disposition of the report The report is divided into seven chapters. Chapter 1contains a general overview of the e-commerce industry. Chapter 2 presents the problemstatement, followed by the company description and the objectives of the project. Afterthe literature review in Chapter 3, Chapter 4 states the methodology used in the projectincluding the theory, data explanation and models. Chapter 5 presents the data analysisand modelling. Chapter 6 presents the results of the study. Finally, this thesis ends withdiscussion and conclusions in Chapter 7.

Sammanfattning

Analysen kring att oka, behalla och uppratthalla kundtillit ar en mycket aktuell fragainom e-handelsbranschen for att mota utmaningarna med okad konkurrens och instabilakundrelationer, som en effekt av den okande anvandningen av internet for att kopa varor.Denna studie genomfors hos det svenska mode och e-handelsbolaget NA-KD i syfte att faen battre insikt i kundbeteende som pavekar kop, returer och beslut att lamna foretaget(churn). Malet med denna studie ar saledes att identifiera den lojala kundgruppen hosforetaget, samt konstruera modeller for att prediktera kundlojalitet, frekventa retureroch kundchurn. Studien innefattar tva separata tillvagagangssatt dar en klustermodell arkonstruerad for att separera data i olika kundsegment for att forklara kundbeteenden. Enklassificeringsmodell konstrueras sedan for att klassificera och prediktera kunderna i klas-serna ’churner’, ’returner’ och ’loyal’ baserat pa en forklarande dataanalys och tidigareinsikter och kunskap fran foretaget. Genom att anvanda den oovervakade kluster maski-ninlarningsmetoden ’K-prototypes’ for blandad data, identifieras och definieras foljandesex olika kluster; churn, potentiella och lojala kunder samt Brand Champions, obeslutsam-ma kunder och hog risk for churn. Den overvakade klassificeringsmetoden ’bias reducedLogistic Regression’ anvands for att klassificera kunder i klasserna loyala kunder, kundersom gor frekventa returer och kunder som lamnat foretaget. De slutliga modellerna har ennoggrannhet pa 0.86, 0.75 and 0.98 for de tre separata binara klassificerings modellernasom klassificerar kunderna i grupperna ’churner’, ’returner’ respektive ’loyal’.

Rapportens disposition Rapporten delas upp i sju kapitel. Kapitel 1 innehaller engenerell overblick pa e-handelsbranschen. I Kapitel 2 presenteras problemformuleringenuppfoljt av en foretagsbekrivning och projektets mal. Nastfoljande kapitel, Kapitel 3 in-nefattar en presentation av liknande studier uppfoljt av Kapitel 4 bestaende av studiensmetodik och inkluderande teori, databeskrivning och modeller. Kapitel 5 innehaller enbeskrivande analys av data och modellering. Kapitel 6 presenterar resultatet av studienoch foljs upp av den sista delen, Kapitel 7 innehallande diskussion och slutsatser.

Acknowledgements

The completion of this thesis study would not have been possible without the supportand extensive knowledge of several people whose names will be mentioned in this section.

I cannot begin without expressing my thanks to my supervisor from Umea University,Mohammad Ghorbani, for the valuable experience and insight in conducting and writingscientific reports. Thank you for always asking the right questions and giving me helpfulpractical suggestions.

I would also like to extend my deepest gratitude to my supervisor from NA-KD, CagriEmre Korkmaz, for believing in me and giving me the opportunity to be the first one inhistory to perform a master thesis project at NA-KD.

Someone whose help cannot be overestimated is my advisor from NA-KD, Burak Arca.Thank you for always taking your time to support and guide me through the project.Without your extensive knowledge and insightful suggestions, the final result would notbe what it is today.

Finally, I would like to express my sincere gratitude to my family and partner for theirconstant encouragement and support throughout my time at university and this thesis.

Contents

1 Introduction 11.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

2 Project description 32.1 Problem definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32.2 Company description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32.3 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42.4 The objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42.5 Project structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

3 Literature review 63.1 CLV estimation based on RFM analysis . . . . . . . . . . . . . . . . . . . . 63.2 An extended RFM ranking by K-means clustering . . . . . . . . . . . . . . 73.3 Customer segmentation on behavioural data . . . . . . . . . . . . . . . . . 73.4 Using SOM, K-means and the LRFM model for customer segmentation . . 83.5 Return prediction within e-retail . . . . . . . . . . . . . . . . . . . . . . . . 83.6 Customer churn prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . 83.7 Churn prediction through hybrid classification . . . . . . . . . . . . . . . . 93.8 Customer purchase behavior analysis through a two staged approach . . . 93.9 Comparison between K-prototypes and K-means . . . . . . . . . . . . . . . 10

4 Methodology 114.1 Data description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114.2 Data mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124.3 Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

4.3.1 Non-parametric Statistics . . . . . . . . . . . . . . . . . . . . . . . 144.3.2 Dimensionality Reduction . . . . . . . . . . . . . . . . . . . . . . . 164.3.3 Clustering techniques . . . . . . . . . . . . . . . . . . . . . . . . . . 194.3.4 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

4.4 Model setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

5 Data Analysis and Modelling 245.1 Exploratory Data Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 245.2 Modelling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

5.2.1 Dimensionality reduction . . . . . . . . . . . . . . . . . . . . . . . . 385.2.2 Constraints of the study . . . . . . . . . . . . . . . . . . . . . . . . 395.2.3 The Google BigQuery Machine Learning tool . . . . . . . . . . . . 39

1

6 Result 426.1 Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 426.2 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

6.2.1 Definition of classes . . . . . . . . . . . . . . . . . . . . . . . . . . . 526.2.2 Feature selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . 526.2.3 Logistic Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

7 Discussion and conclusions 567.1 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 567.2 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

7.2.1 Future recommendations . . . . . . . . . . . . . . . . . . . . . . . . 57

Bibliography 59

A A/B testing 61

B Contingency tables 64

C Correlation triangle of numerical features 66

D Bias Reduced Logistic Regression 68D.1 Classifying churned customers . . . . . . . . . . . . . . . . . . . . . . . . . 68D.2 Classifying customers of frequent returns . . . . . . . . . . . . . . . . . . . 70D.3 Classifying loyal customers . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

Chapter 1

Introduction

1.1 Motivation

By definition, e-commerce refers to the sale of goods or services through the use of theinternet. The easily accessible e-commerce of today’s modern world is a win-win situa-tion for both businesses and customers. The sheer amount of information available tocustomers online can be all that is needed to make a purchase decision. For businesses, e-retail can be used to increase their brand awareness and reduce the cost of physical stores.However, the constant increase in online shopping is also accompanied by challenges forbusinesses. The accessibility of online stores only a click away, increases competition inthe market and leads to a volatile customer relationship, where the accessibility of cus-tomer preferences no longer can be as transparent as in physical stores with the possibilityof personal service available. This is a factor for the lower proportion of loyal customerswithin e-commerce in comparison to physical stores (Seippel; 2018).

In addition, online marketing techniques such as websites and social medias are bothcheaper and more effective in the long run than traditional marketing techniques. Web-site visitors contribute to a large collection of data, including web traffic data and per-sonal information recorded through cookies. The immense collection of data opens upthe possibility of studying customer preferences and purchasing behavior. As Coppola(2021) states in her article of the history and future of e-commerce, the developmentof e-commerce is highly dependent on the development of technology and will continuelikewise in the future. According to the discussion by Coppola (2021) in Statista’s statis-tics of global e-commerce, the internet and digitalization of today has widely increasedthe use of e-commerce. An estimation approved that 1.92 billion people used e-retail topurchase goods or services in 2019 accounting for 14,1% of the global retail sales. It isalso expected that the number of e-retail customers will increase every year reaching anestimated 21.8% in 2024, due to the increasing accessibility of the internet around theworld and variability of online platforms to use for quick purchases.

The 99Firms’ e-commerce statistics for year 2020 (99Firms; 2020) predict that 95%of all purchases will be done via e-commerce by 2040. A 2019 study by the media andresearch firm DigitalCommerce 360 also shows that 61% of respondents stop comparingcompanies on other websites after finding a product they like. This highlights the im-portance of providing a website with easily accessible information about products. Theresults of the survey also showed the importance of factors such as free shipping, ease ofreturns and low cost of returns when comparing different companies.

1

In line with this, a study on the influence of e-commerce on customer behavior byMittal (2013) found that the most important factors for online shoppers are search featuresthat enable customers to find the products they are looking for. It is also stated that byproviding third party verification on the website as well as information about the companysuch as customer service, location, a phone number, and a help button can increasecustomer trustworthiness to shop from e-retailers. The study highlights the importanceof website reputation, payment security and post-purchase services such as shipping andreturns for increased customer satisfaction.

A customer who makes multiple purchases spends on average four times as much moneyas a customer who makes only one purchase (Blevins; 2020). This shows how important itis for companies to build and maintain a loyal customer base. The process of conqueringcustomers is called customer acquisition and lays the foundation for growing the customerbase. The next step is to maintain the costumers from acquisition, which defines customerretention. Customer retention is usually less expensive than acquiring new customers, soevaluating customer retention is critical for businesses. A study on customer segmentationhighlights the importance of customer retention which has increased in recent years, whilenew customers acquisition has decreased. The study presents the Pareto principle andconcludes that most of a company’s revenue is generated by only 20% of its customers(Joy Christy et al.; 2018). By targeting this customer segmentation, the business canimprove tailored marketing plans and advertising campaigns to reduce marketing costsbut with the same amount of revenue as when generalizing marketing to all.

Another important challenge for e-commerce businesses is the frequent returns thatoccur among customers. The average return rate for purchases from online stores is 25%while the return rate for purchases from physical stores is only 8% (Charlton; 2020). Toretain satisfied customers, returns must be easily accessible and not a barrier to purchasean item. At the same time, returns usually have a negative impact on the business dueto the cost of staff and resources as well as carrying the risk of not being able to resellthe returned items. The trade-off between satisfied customers and increased costs to thebusiness associated with returns is highly topical in the fashion industries where onlinesales take place.

Customer satisfaction is not the only level to measure. Customer attrition, also knownas customer churn, implies the percentage of customers who no longer use a company’sservice and is very important to businesses. By identifying potential churners in a timelymanner, actions can be made to prevent customers from leaving the client base.

Therefore, the analytics of gaining, retaining and maintaining customer trust, is ahighly topical issue in the e-commerce industries today to mitigate the above challengesin the long run.

2

Chapter 2

Project description

2.1 Problem definition

This study is conducted at the Swedish online fashion retailer NA-KD with the aim ofgaining better insight into customer behavior that determines purchases, returns andchurn. The company has seen a surge in popularity since its inception, resulting in a userbase of 8 million monthly visitors to their website (NA-KD; n.d.). Today, marketing takesplace with manual methods and third-party email marketing. There are some previousstudies in the area of customer segmentation at the company, where the so-called RFManalysis has been performed, but it has not been applied yet. RFM segmentation is basedsolely on customer transactions with the constructed metrics of Frequency, Recency andMonetary. By extending the customer segmentation to include other entities such asdemographics and web traffic, the company can uncover additional customer behaviorsrelated to purchase, return, and churn to find appropriate marketing strategies susceptibleto the specific target groups.

2.2 Company description

NA-KD was founded in 2015 and has been listed among the top 20 fastest growing compa-nies in Europe (NA-KD; 2020). Today, the company consists of just over 250 employeesand is headquartered in Gothenburg. The company has five locations of offices, ware-houses and factories across Europe and, in addition to Gothenburg, can be found inStockholm and Landskrona in Sweden, Krakow in Poland, and Istanbul in Turkey. Inaddition, the company is globally represented by more than 600 retailers worldwide anddelivers to more than 100 countries every month (NA-KD; n.d.).

The Business Intelligence department at the head office in Gothenburg is responsiblefor supporting data-driven decision making across the organisation. The department wasestablished in 2019 and acts with an overall view of the organisation and departments withdata collection, measurements, reports, analysis and insights as well as tracking actionswithin the organisation. The Business intelligence department at NA-KD also acts asproduct managers performing tests related to specific actions (Korkmaz; 2021). Duringthe thesis work the Business Intelligence team supervises and provides the detailed factsand insights of the company on the technical aspect of the study.

The Performance Marketing and Sales department of NA-KD takes care of various

3

customer relationship management services. Focusing on the continuous development ofmarketing strategies, the main used tools are email, web-push, SMS notifications andreferral programs. The latter is active in 5 different countries and contributing with1% of the new customers per year (Iyigun; 2021). The main emphasis is put on emailmarketing as an effort for customer retention. The department works strategically withother departments such as Business Intelligence. The Performance Marketing and Salesdepartment provides insight into the current marketing strategy and focus areas, and willconclude relevant marketing strategies from the thesis study results.

2.3 Limitations

The analysis will be based on customer transactions and website traffic limited to cus-tomers of at least one purchase during the investigated period as well as having an accountat the company’s website. This is due to the lack of stored historical data of customerswithout accounts where the customer needs to consent to the collection of data based ontheir purchasing behavior. In addition, the data is limited to contain only real customersand orders and hence to not include e.g. wholesalers that purchases significantly largeamounts of goods. The investigated period is limited from June 2018 until 2021-01-01 asno earlier data is available from Google Analytics.

2.4 The objectives

Considering the challenges faced by the online e-retailer NA-KD in customer acquisition,retention and churn as well as the frequent returns and the lack of previously conductedcustomer segmentation, the objectives of this thesis study are

- to find the group of loyal customers;

- to create a model for predicting customer loyalty;

- to create a model for predicting whether a customer will make frequent returns;

- to create a model for the prediction of customer churn

where the classification of loyal customers, customers of frequent returns and churnedcustomers should be done with three separate classification models.

2.5 Project structure

The main goal of this thesis is to build and deliver a model that segments NA-KD cus-tomers into similar groups and predict customer loyalty, churn and returns. The modelshould be based on both transactional data and the existing segmentation model RFM,but extended to include other entities that may be relevant in explaining customer be-havior. The final model should provide a set of customer segments (clusters) such thatcustomers within each cluster are similar and different from customers in the other clus-ters. The exploratory data analysis as well as the information gathered from the clusteringanalysis are used together with information and definitions from the company to define

4

the classes regarding customer behavior in terms of loyalty, churn and return activities.The thesis work is split down into the following phases:

- Literature review to investigate algorithms and methods that have been used inother studies for solving similar kind of problems.

- Implementation of chosen clustering algorithm.

- Evaluation of results and further exploratory analysis of the detected customer seg-ments to find purchase patterns within these segments.

- Implementation of suitable classification models.

- Model evaluation and analysis.

The model is using the web analytic service Google Analytics to access website traffic dataand the Google Cloud data warehouse BigQuery to access transactional data and performdata mining. Statistical analysis including initial exploratory analysis and modelling isperformed using R programming software.

5

Chapter 3

Literature review

The subject of customer segmentation to discover information about customer behaviouris highly topical and is used by some of the largest companies in the world such as Net-flix, Google, and Amazon. Below, different approaches of methods to analyze customerbehavior similar to the objectives of this thesis are reviewed. Starting with the most prim-itive and commonly used RFM segmentation method, the literature review is followed bymore computationally expensive segmentation techniques as hybrid approaches of ma-chine learning algorithms. These are some of the methods previously used for customersegmentation and classification. The final methods to be used in this work is presentedin the section of Methodology in Chapter 4.

3.1 CLV estimation based on RFM analysis

Customer lifetime value (CLV) is used in a case study by Khajvand et al. (2011) forcustomer segmentation through two different approaches; a RFM analysis and an extendedRFM with an additional feature representing the total number of items purchased by acustomer in addition to the number of orders. According to Khajvand et al. (2011) theRFM model is the simplest and most powerful model to study customer behavior in thecontext of customer relationship management. The RFM model is defined by the metricsRecency, the time elapsed since the last purchase; Frequency, the total number of ordersmade during a specific time period and Monetary, the total amount of money spent by thecustomer during the given time period. To be considered the most loyal customer and aprofit driver, Recency should be low, indicating a higher likelihood of repeated purchases;Frequency should be high, indicating greater loyalty to the company and the monetaryvalue should be high, providing information about the importance of the customer. Basedon the RFM metrics, the authors used K-means algorithm to cluster the customers intosegments with similar values. The clustering analysis revealed no significant differencebetween the two approaches, RFM and extended RFM. The traditional RFM model ishence used for further analysis, where the CLV value is calculated for each cluster usingthe weighted RFM method (see more details in Khajvand et al. (2011)). This methoduses expertise information of the sales department to determine the metrics of the greatestweight and importance for the CLV. The CLV values was later used to assign CLV rankingsfor each segment to provide a final financial overview of the customer segments that canbe used in future marketing strategies.

6

3.2 An extended RFM ranking by K-means cluster-

ing

RFM analysis is a well-known technique of evaluating customer value based on trans-actional data that reveals the customers’ purchase behavior. Joy Christy et al. (2018)examined this technique to later extend it by performing three different machine learn-ing based clustering algorithms; K-means, Fuzzy C-means and Repetitive Median basedK-means on the resulting RFM values. The aim of the study was to find the clusteringmethods with the greatest result in terms of iterations, cluster compactness and executiontime. The dataset available was the transaction data of customers from an online retailstore over a study period of one year. The data consisted of eight features such as cus-tomer ID, product code, product name, price, date and time of purchase to name a few.The RFM metrics were calculated based on the given attributes and ranked in a specificorder to be added to the final dataset later. The Repetitive Median based K-means (RMK-means) algorithm performed best with a silhouette width of 0.49 in comparison to K-means’ 0.33 and fuzzy K-means’ 0.43. The RM K-means algorithm also outperformed theother algorithms in terms of execution time with 1.49 seconds compared to 2 seconds forK-means and 24 seconds for fuzzy K-means, and in terms of the number of iterations of 2in comparison to 4 and 193 iterations respectively for traditional K-means and the Fuzzyone.

3.3 Customer segmentation on behavioural data

To gain insight into customer behavior of an e-marketplace application for second-handvintage clothing, Aziz (2017) investigated customers preferences in a master’s thesis atUppsala University. The study explored the research questions of whether the company’savailable data could be used to segment customers into groups with similar preferences,what size of segments might be reasonable and whether the segments could be used totarget customers. Aziz (2017) performed data pre-processing to create a ranking matrixbased on users and brands available on the website. The dimensionality of the data matrixwas reduced, by using Principal Component Analysis, and customers without sufficientactivity were removed from the data as well. He then used the reduced data matrix toperform a clustering analysis based on the K-means algorithm and the cosine similaritymeasure. The appropriate number of clusters was based on the Silhouette and the Elbowmethods, resulting in the set of three clusters. But based on the Silhouette score of 0.32,it was still not an optimal clustering proposal (to have appropriate clustering proposalthe Silhouette value should be larger than 0.5). In addition, a website was constructed foreasily accessible visualization tools that the company could use for further analysis. Theauthor highlights the importance of exploring multiple clustering algorithms to determinethe most accurate model, as well as modifying the pre-processing phase of the analysiswith different thresholds.

7

3.4 Using SOM, K-means and the LRFM model for

customer segmentation

Ait daoud (2015) extended the traditional RFM model for segmenting customers’ behaviorby including the customers’ relation length to the company, L and called the proposedmodel LRFM model. The LRFM model is used to perform an initial segmentation ofthe customers of a Moroccan online store, which is later adopted by the two clusteringtechniques: Self-Organizing Map method (SOM) and the K-means clustering method.The unsupervised SOM method provided the best number of proposed clusters, which werelater used in the K-means algorithm. The study resulted in nine different clusters wherethe LRFM metrics were compared to examine the behavior of each customer segment.The customers with the highest LRFM values are treated as the most loyal customers whocontribute with highly frequent purchases of high monetary value, as well as a long-termrelation to the company and recency which implies that the customer has recently beenactive with purchases.

3.5 Return prediction within e-retail

Al Imran and Amin (2020) examined the disproportion of return events occurring amongonline shoppers in comparison to traditional offline shoppers. Returns are one of manyreasons for decreased profitability among e-retailers. There are several potential explana-tions for the high percentage of returns, such as flexible return policies, damages, delays,and mismatched expectations, to name a few. In this study, Al Imran and Amin (2020)utilized the state-of-the-art (SOTA) predictive modelling approach to find the most ac-curate classification model from XGBoost, LightGBM, CatBoost, and TabNet along withthe traditional Decision Tree algorithm as a baseline. The available dataset included 12features based on transactional data, including among others a quantitative feature in-dicating whether an order was returned or not. The resulting model with the highestperformance measure was the TabNet deep-learning based algorithm. The analysis fur-ther investigated the most influential features of the data explaining the return events,finding that order location and payment method have the greatest impact on returns,followed by promotional orders and shopping cart orders that indicate whether multipleitems or a single item were purchased.

3.6 Customer churn prediction

The challenge related to customer churn in telecommunications have been investigatedin Tsai and Lu (2009) by considering hybrid models that combine two different neuralnetwork techniques for churn prediction. Customer churn prediction is considered inmany data mining methods with the aim of describing the data and predicting unknownor future values of features. The data mining techniques investigated in this study are thesupervised classification method Back-Propagation Artificial Neural Networks (ANN) andthe unsupervised clustering method Self-Organizing Maps (SOM). The study investigatesthe performance differences of two different hybrid approaches combined serially; thecombination of ANN and ANN and the combination of SOM and ANN. The latter hybrid

8

approach, which combines clustering and classification, provides the ability to preprocessand identify patterns within groups that can later be used to classify or predict futurevalues. The training set used to construct the classification model is hence based on theclustering result. They also compared the hybrid model with the baseline model of asingle ANN. The study found that the combination of ANN+ANN outperformed boththe baseline model and the SOM+ANN in terms of prediction accuracy and Type I and IIerrors. The SOM+ANN also outperformed the baseline model. However, by introducingfuzzy testing data the SOM+ANN was outperformed by both the ANN+ANN and thebaseline model of single ANN. This suggests that the ANN+ANN combination had thegreatest performance and stability than the compared models.

3.7 Churn prediction through hybrid classification

Caigny et al. (2018) presents a new classification algorithm using the hybrid approach oflogistic regression combined with decision trees to investigate customer churn prediction.The new method is called logit leaf model (LLM) and is compared against the ordinarydecision trees (DT), logistic regression (LR), random forest (RF) and logistic model trees(LMT). The authors emphasize the two main performance areas of customer churn mod-els; predictive performance and model comprehensibility. The trade-off between theseperformance measurements is the main decision point in modelling by classification al-gorithms. The DT algorithm is more comprehensible while LR has higher predictiveperformance. The LLM algorithm has both high predictability and sufficient comprehen-sibility. The resulting segmentation of the LLM algorithm has also been proven valuablein churn research for investigating churn drivers within specific segments. The LLM algo-rithm uses decision trees to find subsets of customers, where each segment is representedby a terminal node in the tree, the leaf. A logistic regression based on forward selectedvariables is then fitted to each customer segment separately, providing probabilities foreach instance in the segments. The study, conducted on different datasets, comparedperformance measures such as area under the receiver characteristic curve (AUC) andtop decile lift (TDL), both derived from the confusion matrix, where an algorithm withlower AUC and TDL values showing the best performance. The average AUC and TDLscores of DT and LR were 4.857, 4.929 and 3.286, 3.357, respectively. Comparing theLLM classification algorithm with its building blocks DT and LR, the results of the studyshow that LLM performs better than DT and LR with the average AUC and TDL metricsboth equal to 1.786. The authors showed that the LLM performed at least as well as therandom forest procedure RF, but with greater tractability and actionability due to theresulting clusters and associated regression models.

3.8 Customer purchase behavior analysis through a

two staged approach

The analysis of customers’ purchase behavior in online stores using a two-stage approachis conducted by Piskunova and Klochko (2020) in their study, which examined customers’purchases in a Ukrainian online store over a period of 2 years. The approach used inthe study consisted of the first stage of segmenting customers into similar groupings

9

by the usage of machine learning clustering techniques, followed by the second stage ofconstructing a classification algorithm used for continuously updating customer segmentsas well as assigning segments to new clients. The purchasing activity of the customerswere evaluated by the classic RFM model extracting the recency, frequency, and monetaryvalue of each customer. The RFM metrics were then used as the main features for theremaining study. The K-means algorithm was chosen as the clustering algorithm. TheElbow and Silhouette methods were used to determine the number of clusters. Sincethese methods gave different results of 3, 6, 8 and 2 clusters respectively, the R packageNbClust was used to test the number of clusters using 26 additional criteria, resulting inthe proposal of 3 clusters. The next best option was set to 6 clusters. By validating theresults based on 3 or 6 segments and the business insights, the final model was set to use6 clusters. The classification algorithm was selected by comparing the accuracy of fivedifferent classification models. The final classification model was set as Random Forestmodel with an accuracy of 0.99.

3.9 Comparison between K-prototypes and K-means

The K-means and K-prototypes clustering algorithms are compared by Ruberts (2020)when clustering mixed data sets. The author presents a study on already preprocessedand cleaned Google Analytics data from the online community Kaggle. By using UniformManifold Approximation and Projection for Dimension Reduction (UMAP) as a compar-ison method, the groups of data can be visually represented on two dimensions for thedifferent clustering techniques. UMAP embedding requires Yeo-Johnson transformationof the numerical features and one-hot-coding of the categorical features, which are laterembedded separately and combined by conditional embedding of the numerical featureson the categorical ones. The result of the UMAP embedding is a scatterplot in whichthe data are visualized on two dimensions. The author starts by performing the K-meansclustering method, which requires a numerical dataset. By one-hot-encoding of the cate-gorical features and applying the Yeo-Johnson transformation to the data, the now moreGaussian-like data is used to fit a K-means with the initial number of 15 clusters based onthe UMAP visualisation. The K-prototypes method uses mixed data, applying the trans-formation to the numerical features and leaving the categorical features unprocessed. Bycolouring the UMAP scatter plot using the two different clustering techniques, the differ-ence among the groups are presented where the K-prototypes algorithm results in clearerboundaries between the groups as well as more evenly distributed groups. By buildinga LightGBM classification model on top of the clustering model, the author evaluatesthe models by distinctiveness with the cross-validated F1 score and the clusters informa-tiveness by using the SHAP feature importance. The resulting cross-validated F1 scoresfor the K-means and K-prototypes methods were 0.986 and 0.97 respectively, indicatingthat the K-prototypes cluster are adequate and discriminative despite the slightly lowerscore. The SHAP values of the classifier presents four dominant numerical features ofthe K-means method, while the K-prototypes method locates 10 important features outof a total 14 features, with the categorical features being of higher importance. Thestudy concludes that clusters based on K-prototypes are therefore more informative dueto the higher importance of categorical features and hence should be used by marketersfor customer segmentation.

10

Chapter 4

Methodology

This chapter contains a presentation of the dataset, followed by data processing and adescription of the features, theories and model setup.

4.1 Data description

The present data is based on a proportion of customer transactions and web traffic dataduring the investigated period from 2018-06-01 to 2021-01-01. To exclude irrelevant,incomplete features as well as duplicates and inappropriate formats, cleaning the datatakes a great part of the pre-processing stage to obtain a proper dataset for analysis. Thecustomers in this dataset are limited to being users of NA-KD and thus having createdan account on the company’s website. The data is extracted from a database in theBigQuery data warehouse of Google Cloud and web traffic data from Google Analytics.This results in a data set of the sample size 1.4 million customers. Due to confidentiality,some metrics will be encrypted and some will not be visualized at all in this report.

Figure 4.1: The sources used to create the final dataset in Google BigQuery.

11

4.2 Data mining

Data cleansing takes place in the Google BigQuery data warehouse by extracting relevantfeatures, eliminating duplicates, and putting the data into a desired format. New fea-tures are extracted from the raw data using data mining techniques to explain customerbehavior. Return ratios and total length of customer’s relationship with the company aresome examples of newly constructed variables. The final sample data consists of 1,355,533customers and 25 associated variables that potentially explain the customer purchasingbehavior. Figure 4.1 shows the process applied on the data warehouse to create the fi-nal dataset in Google BigQuery. The features of the dataset are presented and brieflyexplained in Table 4.1.

To reduce the number of distinct values of the categorical variables in the data, sub-groups were constructed based on the most frequently occurring values. Since the companyships to over 100 countries, it is necessary to divide the countries into smaller subgroups.Based on the customer base during the investigated period, the top five countries wereretained, accounting for 87% of the customers. The remaining countries were groupedtogether, resulting in a country variable with six levels: A, B, C, D, E and F.

The categorical variable explaining the most commonly used payment method isgrouped into KLARNA, Other and both which represent whether the customer usesthe KLARNA payment method, any other method, or whether the customer uses bothKLARNA and any other method equally often. This division of payment methods isbased on the ability of invoice purchases by KLARNA which clearly separates this pay-ment method from the others.

To examine the length of the relation between the customer and the company thedate of the first order on the company’s website is extracted and compared to the dateof the most recent order. This is the only metric that contains information beyond theinvestigated period.

NA-KD uses 95 different delivery methods with different performances. There aresome called premium which delivers the packages directly to the customers’ home andexpress which delivers the packages faster than the others. The third delivery optionis called standard. By using only these three subsets, there is a risk of excluding someimportant information, so the variable deliveryMode is included with further detailsof the delivery method. This feature explains whether the package has been deliveredthrough mailbox, directly to the customers’ home or office or by using traditional parcelshops for the customers to pick-up their items themselves. This feature is relevant due tothe differences of defining premium and standard shipment delivery among the countrieswhere e.g. home-delivery can be defined as standard delivery in one country while it canbe called a premium option in another country. The metric is coded as A, B, C and D.

In Table 4.1, various e-commerce specific terms are used to explain the metrics. Amore detailed explanation of these terms can be found in Table 4.2. The channels dividedinto first, last and only channel fractions are also split into lower, middle and upperfunnels. This fraction is based on the extent of the channel, that is, the percentage oftraffic generated to the website. The upper funnel reaches the customers who have noincentive to buy anything - examples of these channels would be Facebook ads. The middlefunnel reaches the customers who have an incentive to buy an item, but no incentivefor it to be from the specific company NA-KD. Examples of these channels could beGoogle Shopping, where the customer can search for a specific item and be directed to

12

Variable name Description

hashedEmail Encrypted customer emailscountry The country related to delivery locationmostFreqPayMethod Most frequently used payment methodrelationLength The number of days between the first and last purchase.Recency The number of days between the end of the investigated

period and the last purchase datetotalOrders The total number of orders during the investigated periodsalesQuantity The total number of items purchasedreturnQuantity The total number of items returnedreturnRate The proportion returned itemsnetRevenue The revenue of purchased items subtracted with the value

of returned items. The total net revenue of the customermostFreqDeliveryMethod The most frequently used delivery methoddeliveryMode The mode of deliverydiscountedSalesRatio The proportion of items purchased on discount by a voucher

code or on salemostFreqReturnReason The most frequently stated return reasonavgReturnTime The average time between a purchase and the associated

returnmostFreqFirstChannel The most frequently used channel to first visit the websitemostFreqLastChannel The most frequently used channel to make a purchasemostFreqOnlyChannel If a customer uses only one channel to enter and make a

purchase, this channel is referred as the only channel. Thismetric presents the most frequently used only channel

mostFreqDevice The most frequently used device by the customer, desktopor mobile

mostFreqItem1 The most frequently purchased item categorymostFreqItem2 The second most frequently purchased item categorymostFreqItem3 The third most frequently purchased item categorypageviewPerSession The average number of pageviews per sessionconversionRate The total number of orders divided by the total number of

sessionsoneSizeItems The number of items purchased with only one size per itemmultiSizeItems The number of items purchased with several sizes of the

same item

Table 4.1: Features in the dataset with associated description.

the company’s website via Google. The upper and middle funnels are both generatedthrough paid advertisements. The lower funnel reaches customers who have an incentiveto purchase an item from a particular company. These customers are reached throughchannels in a natural way, and thus are called ”organic” channels. These channels arecreated through unpaid advertising and can be created through the rumor mill that bringsthe company name to the customer’s attention. Customers could then either reach thewebsite directly through the correct web address or by searching for the company throughvarious search engines.

13

Term Explanation

Session A session is a collection of actions performed on the websiteduring a specified period of time, for NA-KD set to 30 min-utes. A single session can include multiple interactions suchas page views and transactions. The duration of a session canbe determined both by measuring time in seconds and by thenumber of page views during the session.

Pageviews A pageview is the instance of a new page loaded in a browser.The total number of pages visited, the pageviews, increasesevery time a customer enters a new page on the website whereeach returning page is accounted.

Conversion Conversion describes the process when a website visitor com-pletes the action of making a purchase. The customer-levelconversion rate used in this thesis represents the percentage ofsessions that end in a purchase.

Voucher Code A voucher code is used at checkout to receive a discount spec-ified by the company, which is often only valid for a certainperiod of time.

Channel Sales channels explain the way the company enters the marketto increase sales. At the most detailed level, NA -KD’s channelsconsist of 24 different segments. In this study, the channels aredivided into three subgroups: Lower, Middle and Higher funnelchannels.

Table 4.2: Commonly used terms and their associated explanation.

4.3 Theory

This section includes theory on machine learning methods such as dimensionality reduc-tion, clustering and classification.

4.3.1 Non-parametric Statistics

From the exploratory data analysis in Chapter 5, one can conclude that the data doesnot follow the normal distribution, hence nonparametric statistical tests need be used toexamine the significant effects within the data. The theory of the used nonparametricstatistics Kruskal-Wallis H-Test Statistic and Wilcoxon Signed Rank Test Statistic arebriefly explained in the following section. For further reading, see Montgomery (2017)and Corder and Foreman (2014).

Kruskal-Wallis H-Test Statistic

When comparing the means of more than two independent samples, the nonparametricstatistical procedure of the Kruskal-Wallis H-test is of good fit, which is equivalent to theparametric one-way ANOVA. The Kruskal-Wallis H-test statistic compares the sample

14

medians θi and the corresponding null hypothesis

H0 : θ1 = θ2 = · · · = θk

where k ≥ 2. In symmetrical distributions, the mean and median are the same, so underthe symmetry assumption, comparing medians is the same as comparing means.

A significant result of the Kruskal-Wallis H-test states that at least one of the popu-lation means is different from the others, but not where the difference occurs. To conducta Kruskal-Wallis-test, we first rank the observations yij in ascending order and replaceeach observation with its rank, Rij, where the smallest observation is rank 1. In the caseof ties (observations with the same value), the average rank is assigned to each of theequally ranked observations. The Kruskal-Wallis H-Test is given by

H =1

S2

[k∑i=1

R2i·ni− N(N + 1)2

4

](4.1)

where N is the total number of observations, ni denotes the number of observations in theith treatment and Ri· is the sum of the ranks in the ith treatment and S2 is the varianceof the ranks and is expressed as

S2 =1

N − 1

[k∑i=1

ni∑j=1

R2ij −

N(N + 1)2

4.

]

In case of no ties or when the number of ties are moderate, the test statistic H is givenby the following simple form

H =12

N(N + 1)

k∑i=1

R2i·ni− 3(N + 1), (4.2)

which is obtained by dividing H in (4.1) by S2 = N(N + 1)/2,the variance of rankswhen there is no tie. When ni ≥ 5, under the null hypothesis, the test statistic Happroximately follows the Chi-square distribution with k − 1 degrees of freedom, andhence the null hypothesis is rejected if H > χ2

α,k−1.If there are ties in the ranking of values, a correction must be made. A new H-statistic

is created by dividing the H-statistic in (4.2) by the correction factor, where the correctionvalue is

CH = 1−∑

(T 3 − T )

N3 −N,

where T represents the number of values from the set of ties and N is the total numberof values from all samples.

The Wilcoxon-Mann-Whitney test

The Mann-Whitney U -test, also known as the Wilcoxon-Mann-Whitney test or WilcoxonRank Sum test, is a nonparametric statistical test to compare the means of two indepen-dent continuous populations X1 and X2 that are assumed not to be normally distributed.However, the distributions of the populations X1 and X2 can be assumed to be continu-ous and to have the same shape and variance, differing only (possibly) in their locations.

15

Formally, the Wilcoxon Rank Sum test is used to test the null hypothesis H0 : µ1 = µ2.The corresponding parametric test is the two-sample independent t-test.

Assume that two independent samples x1 and x2 with sizes n1 and n2, n1 ≤ n2, fromthe populations X1 and X2 have been drawn. The Mann-Whitney U test mixes the valuesof two samples and ranks them in ascending order to determine whether the values in theranked samples are randomly mixed or whether they are clustered at opposite ends. Iftwo or more observations are tied (identical), the mean of the ranks that would havebeen assigned if the observations differed is used. If the two samples do not differ, therank order would be random, while a clustering of values in one sample would representa difference between the two samples.

Let U1 be the sum of the ranks in the smaller sample, and U2 be the sum of the ranksin the larger one. Then,

U2 =(n1 + n2)(n1 + n2 + 1)

2− U1.

When the sample means do not differ, we expect the sum of the ranks for the two samplesto be nearly equal after adjusting for the difference in sample size. Consequently, if thesums of the ranks differ greatly, we conclude that the means are not equal.

The U statistic is examined for significance by comparing with the values in the tableof critical values. However, if the available number of values of each sample exceedsthe one’s in the table, an approximation of large sample may be done (see details inMontgomery and Runger (2018)). However, for large samples, i.e., when n1 and n2 aremoderately large, say more than eight, the distribution of U1 can be well approximatedby the normal distribution with mean

µU1 =n1(n1 + n2 + 1)

2

and variance

σ2U1

=n1n2(n1 + n2 + 1)

12.

So, the test statistic

z∗ =Ui − µU1

σU1

can be used as a test statistic for Wilcoxon rank sum test and the appropriate criticalregion is |z∗| > zα/2, z

∗ > zα, or z∗ < −zα, depending on whether the test is a two-tailed,upper-tailed, or lower-tailed test.

4.3.2 Dimensionality Reduction

The data used in the study contains more than 20 features where performing the well-known method of Principal Component Analysis (PCA) possibly could lead to dimensionreduction. To analyse mixed data containing both numerical and categorical variables,such as the data used in this study, the extended version of the standard multivariateprincipal component data analysis, PCA Mix, from the R package PCAmixdata, can beused. The PCA Mix algorithm consists of a fusion of ordinary principal component analy-sis (PCA) and multiple correspondence analysis (MCA) on the numerical and categoricalvariables, respectively (Chavent et al.; 2017).

16

Multiple Correspondence Analysis is used to analyze the n × p qualitative datamatrix X, where n denotes the number of observations and p is the number of categoricalvariables. Each of the categorical features has mj levels where the sum of the totalnumber of levels is denoted by m. Initializing the analysis by coding each level as binary,the indicator matrix G of size n ×m is constructed. The observations in the matrix areall weighted by 1

n, and the m levels of the categorical features are weighted by n

nswhere

ns represents the total number of observations belonging to the sth level, resulting inN = 1

nIn, where In is the identity matrix of size n. The metric M explaining the distance

between two different observations and also giving a greater weight or importance to rarelevels is as follows

M = diag(n

ns, s = 1...,m)

The centered G is denoted as Z with the total inertia of m− p where the generalizedsingular value decomposition of Z can be calculated by the factor coordinates of the levelsby

A∗ = MVΛ = MA

where V represents the matrix of the eigenvectors, Λ is the diagonal matrix of the theeigenvalues and A is the factor loadings matrix of standard PCA.

MCA follows the properties of (4.3) where each element of A∗ is denoted as a∗si andrepresents the mean value of the observations in level s’ standardized factor loadings.

a∗si =n

nsasi =

n

nszsTNui = ui

s (4.3)

The sth column of Z is denoted as zs and the ith principal component is denoted asui = fi

λi. The mean value of the loadings of the observations of level s is denoted as

uis. λi denotes the eigenvalues which represents the sum of the correlation ratio η2(fi|xj)

measuring the variance of the ith principal component, fi explained by the xjth categoricalfeature as presented below.

λi = ‖ai‖2M = ‖a∗i ‖2M−1 =

p∑j=1

η2(fi|xj)

The PCA Mix algorithm used to analyze mixed data can be defined by n observa-tions described by p1 quantitative variables and p2 qualitative variables. Together theyrepresent the n × p1 quantitative matrix X1 and the n ×p2 qualitative matrix X2, wherethe total number of levels of p2 is defined as m.

The algorithm consists of three steps where the first step includes pre-processing of thenumerical data matrix X1 and the second step consists of factor coordinates processingof the qualitative data set X2. The third step is the squared loading processing wherethe resulted loadings are defined as the squared correlations for the quantitative variablesand correlation ratios for the qualitative variables.

Step 1: pre-processing

i. The real matrix Z = [Z1,Z2] should be built with dimension n×(p1+m) where Z1 isthe standardized X1 and Z2 is the centered indicator matrix G of X2. This followsthe same procedure as in standard PCA and standard MCA respectively.

17

ii. The diagonal matrix N of the weights of the rows of Z should be built. The weightof 1

nis often applied on the rows of Z, such that N = 1

nIn

iii. The diagonal matrix M should be built containing the weights of the columns ofZ where the first numerical p1 column are weighted by 1 according to standardPCA and the last m columns of categorical features are weigthed by n

nswhere ns of

s = 1, ...,m represents the number of observations of the sth level, according to thestandard MCA.

The following formula presents the matrix M which indicates that the distance betweentwo rows of Z is a mixture of the distance measure used in standard PCA; the euclideandistance, and the weighted χ2 distance used in the standard MCA:

M = diag(1, ..., 1,n

n1

, ...,n

nm)

Step 2: factor coordinates processing

i. By generalised singular value decomposition of Z using the the metrics N and Mthe following decomposition is given:

Z = UΛVT

The rank of Z is then denoted as r.

ii. The factor coordinate matrix of dimension n × r is defined as (4.4), and can directlybe computed from the decomposition of GSDV as (4.5).

F = ZMV (4.4)

F = UΛ (4.5)

iii. The matrix of the factor coordinates consisting of the p1 quantitative variables andthe m levels of the qualitative variables is presented as

A∗ = MVΛ

where the matrix A∗ is split into A∗1 containing the factor coordinates of the nu-merical variables and A∗2 containing the factor coordinates of the m levels of thecategorical variables.

Step 3: squared loading processing The contribution of each variable to the varianceof the PC’s are defined as the squared loadings. It is shown that V ar(fi) = λi andλi = ‖ai‖2M = ‖a∗i‖2M−1 , hence the contributions can be directly calculated from A.Formula (4.6) presents the contribution, cji of the variable xj to the variance of the PCfi. {

cji = a2ji = a∗2ji if the variable xj is numerical

cji =∑

s∈Ijnnsa2si =

∑s∈Ij

ns

na∗2si if the variable xj is categorical

(4.6)

where the levels of the qualitative variable j is presented in the set Ij.

18

4.3.3 Clustering techniques

The clustering techniques used in this study are the K-means clustering algorithm fornumerical data and the K-prototypes algorithm for mixed data sets. A brief explanationof the algorithms is given below, for more details, see (Gan; 2011) and (Huang; 1998).

The K-means algorithm

The K-means algorithm takes the numerical data set X = {x0,x1, ...,xn−1} of n recordsand the integer k in {1, 2, ..., n} to represent the number of initial clusters to be givento the algorithm. The K-means algorithm partitions the dataset into the k number ofclusters, denoted as C0, C1, ..., Ck−1, by minimizing the objective function

E =k−1∑i=0

∑x∈Ci

D(x, µi), (4.7)

where the distance measure is denoted by D(·, ·) and µi the mean of cluster Ci is

µi =1

Ci

∑x∈Ci

x

Equation (4.7) can be rewritten as

E =n−1∑i=0

D(x, µγi), (4.8)

where γi denotes the cluster membership of xi and it is equal to j if the observation xibelongs to the cluster Cj. By using an iterative process, the K-means algorithm minimizesthe objective function where the first k records from X are set as initial cluster centers.Based on the initial cluster centers denotes as µ

(0)0 , µ

(0)1 , ..., µ

(0)k−1 the cluster memberships

γ(0)i are updated by equation 4.9.

γ(0)i = arg min

0≤j≤k−1D(xi, µ

(0)j ), i = 0, . . . , n− 1, (4.9)

where γ(0)i is set to the index that minimizes the distance. That is, γ

(0)i is set to the index

of the cluster to which xi has the smallest distance. The K-means algorithm updates thecluster centers based on the cluster memberships as shown in equation 4.10.

µ(1)j =

1

{i : γ(0)i = j}

n−1∑i=0,γ

(0)i =j

xi (4.10)

where j = 0, 1, ..., k − 1.The K-Means algorithm is repeated until either no change in cluster memberships

occurs or maximum number of iteration is reached, to update cluster memberships andcluster centers.

19

The K-prototypes algorithm

To cluster a dataset consisting of a mixture of quantitative and qualitative variables, theK-prototypes algorithm of Huang (1998) can be used. The algorithm takes a mixed-typedataset X = {x0,x1, ...,xn−1} with n observations and d attributes. The first p variablesare assumed to be quantitative/numeric, whilst the remaining d − p are assumed to bequalitative/categorical variables. The distance between the two observations x and y inthe dataset X is defined as

Dmix(x,y, λ) =

p−1∑h=0

(xh − yh)2 + λ

d−1∑h=p

δ(xh, yh) (4.11)

where λ is a balanced weight to avoid a heavier weight on either type of attribute, xhand yh are the respectively hth component of x and y and δ(·, ·) represents the simplematching function given by

δ(xh, yh) =

{0, if xh = yh

1, if xh 6= yh

The K-prototypes algorithm minimizes the following objective function

Pλ =k−1∑j=0

∑x∈Cj

Dmix(x, µj, λ) (4.12)

where the function Dmix(·, ·, λ) is defined in (4.11), k denotes the number of clusters, Cjis the jth cluster, and the center of the jth cluster, also called prototype, is denoted byµj.

The K-prototypes algorithm iterates to minimize the objective function in (4.12) untilsome condition is reached. The algorithm initializes the k cluster centers randomly fromthe dataset denoted by µ

(0)0 , µ

(0)1 , ..., µ

(0)k−1. The updated cluster memberships γ0, γ1, ..., γn−1

by the K-prototypes algorithm are obtained by

γ(0)i = arg min

0≤j≤k−1Dmix(xi, µ

(0)j , λ). (4.13)

When the cluster memberships are updated by (4.13), the K-prototypes algorithm con-tinues to update the prototypes of the clusters using

µ(1)jh =

1

|Cj|∑x∈Cj

xh, & for h = 0, 1, ..., p− 1

andµ(1)jh = modeh(Cj), & for h = p, p+ 1, ..., d− 1,

where modeh(Cj) represents the most common categorical value for the hth variable incluster Cj and Cj is

Cj = {xi ∈ X : γ(0)i = j}

20

For the distinct values Ah0, Ah1, ..., Ah,mh−1 that the hth variable can take, where mh

represents the total number of distinct values that the attribute h can take, let the numberof records in cluster Cj is be defined as

fht(Cj) = |{x ∈ Cj : xh = Aht}|, for t = 0, 1, ...,mh − 1 (4.14)

Then modeh(Cj) in (4.15) can be obtained as

modeh(Cj) = arg max0≤t≤mh−1

fht(Cj), for h = p, p+ 1, ..., d− 1 (4.15)

The above steps are repeated by the K-prototypes algorithm until either the maximumnumber of iterations is reached or no further changes in cluster memberships occur.

4.3.4 Classification

To perform binary classification within the data, the Logistic Regression Model with biasreduction is used to avoid overfitting problems with the model. The following is a briefexplanation of the techniques, though more information can be found in (Ratner; 2012),(Kosmidis and Firth; 2010), and (Firth; 1993).

Logistic regression model

The logistic regression model (LRM) classifies individuals into two distinct classes, forexample, a buyer and a non-buyer. In the LRM, we assume that the response variable Yiis a Bernoulli random variable which takes 1 with probability πi and 0 with probability1− πi. Based on the independent variables X1, X2, . . . , Xp of each individual, the logisticregression model classifies the individual into one of the two classes by the logit responsefunction.

This function is given by

E(Y ) =exp (β0 + β1X1 + . . .+ βpXp)

1 + exp (β0 + β1X1 + . . .+ βpXp)

where βi, i = 0, 1, . . . , p, represents the logistic regression coefficients. In LRM, we assumethat E(Y ) is related to X1, X2, . . . , Xp by the logit function. It is easy to show that

πi1− πi

=E(Y )

1− E(Y )= exp (β0 + β1X1 + . . .+ βpXp)

The above quantity is called the odds, which has a straightforward interpretation: If theodds is 2 for a particular value of x, it means that a success is twice as likely as a failureat that value of the regressor x.

Parameter estimation and bias reduction in logistic regression The logisticmodel when yi ∼ Bernoull(πi), can be fitted by estimating the parameters using themaximum likelihood method. The first step is to construct the likelihood function whichis a function of the unknown parameters. Let l(β) be the log-likelihood function for thevector of parameters β = (β1, β2, ..., βp) of the model. An MLE estimate of β is obtainedby solving

S(β) = ∇βl(β) = 0,

21

with respect to β provided that the observed information matrix I(β) = −∇β∇Tβ l(β) is

positive definite when evaluated at β, where S(β) is the score function. For the logisticregression the solution of score equation, ∇βl(β) = 0 are non-linear in parameters β,there is no theoretical solution and thus a numerical method such as Newton-Raphsonare needed to obtain the solution.

Under regular conditions, there is a bias in the maximum likelihood estimator ofasymptotic order O(n−1), and the bias in the estimator vanishes, as the sample sizen → ∞, Reducing the bias of the estimated parameters by using the adjusted scorefunction was introduced by Firth (1993), where the general formula of the adjusted scoreequation is as follows:

S∗(β) = S(β) + A(β) = 0,

where A is the overall bias-reducing adjustment which can depend on the data and isOp(1) as n→∞.

Firth (1993) presents two different adjustments, A(E) and A(O), respectively basedon either the expected or the observed information matrix is used. The bias-reductionadjustment A(E) is

A(E)t (β) =

1

2tr[F (β)−1Pt(β) +Qt(β)] (4.16)

where t = 1, ..., q, A(O) is given by

A(O)(β) = I(β)F (β)−1A(E)(β) (4.17)

where F (β) = EβI(β) is the expected information matrix and Pt(β) and Qt(β) are thehigher order joint null moments of log-likelihood derivatives which are given by

Pt(β) = Eβ{S(β)S(β)TSt(β)}

Qt(β) = Eβ{−I(β)St(β)},

4.4 Model setup

The model used in this thesis is a two-step approach of clustering and classification. First,a clustering model is constructed to divide the data into different customer segments thatexplain customer behavior. Then a classification model is constructed to classify thecustomers into churners, returners and loyal customers based on the exploratory dataanalysis and the previous insights and assumptions of the company. The final delivery tothe company is a list of customers belonging to each customer segment with the associatedprobability of becoming a loyal customer, frequent returner and churner. The results ofthe study will be used by the company to find marketing strategies that target the differentcustomer segments. The model setup is shown in Figure 4.2 where the A/B testing phasedue to time limitations, was not able to be carried out during this study period. Thetheory of A/B testing can be found in Appendix A.

22

Figure 4.2: A flowchart representing the model setup of the project.

23

Chapter 5

Data Analysis and Modelling

In this chapter, we present an exploratory data analysis, followed by the modelling pro-cedure.

5.1 Exploratory Data Analysis

exploratory data analysis involves examining the structure of the data. Figure 5.1 repre-sents the pie-chart of the market share at country level divided into the coded countriesA-F, where the countries A, D and F belongs to the Scandinavian countries. The remain-ing countries B,C and E belongs to the rest of the world where one of them represent datafrom merged countries. Here it can be seen that the majority of the customer derivesfrom country B, accounting for 41% of the examined customer base.

Figure 5.1: Market share of customers on country level.

The most common payment method used by customers during the investigated period2018-06-01 to 2021-01-01 is KLARNA with 79%. Only 18% of customers use anotherpayment method and the small percentage of 3% use KLARNA and another paymentmethod equally often. This is shown in Figure 5.2.

24

Figure 5.2: Most frequently used payment methods.

Figure 5.3: Most frequently used devices.

The pie-chart in Figure 5.3 which explains the usage of different devices shows a largemajority of mobile users accounting for 66% of the customer base. The remaining 34% useeither a desktop device or a tablet, which are assumed to have a significantly larger screensize than the mobile device. A cross tabulation between the return rate and the differentdevice types is shown in Table 5.1 and the statistical chi-squared test for independence ispresented in Table 5.2. The chi-squared value of 7342 with three degrees of freedom andthe associated p-value of 0.00 below the significance level of 0.05 indicates a significantdependence between the most frequently used device and the return rate, which can beseen in Table 5.1, where the majority of customers with mobile devices appear to have ahigher percentage of returns than desktop users.

The percentage of customers using the first, last, and middle channels is shown inFigure 5.4. For the first visit to the website, the lower funnel channel accounts for themajority usage by the costumers with 56%, followed by the upper funnel channels with19%. 12% use a middle funnel channel to visit the company for the first time. Thispattern holds true for all ordered channels; first, last, and only, as seen in Figure 5.4,which means that the majority of customers enter the company’s website directly throughunpaid marketing.

25

Figure 5.4: Percentage of lower, middle and upper funnels used by the costumers as: thefirst channel (top panel), the last channel (middle panel) and the only channel (bottompanel).

26

Table 5.1: Cross tabulation between the most frequently used devices and the return rateswith intervals of 0-25%, 25-50%, 50-75% and 75-100%.

ReturnRateDevice [0-25%] (25-50%] (50-75%] (75-100%]Desktop 18.69 31.24 21.80 28.27Mobile 14.66 28.35 22.95 34.04

Table 5.2: Chi-squared test of independence between return rate and type of used device.

chi-squared value 7342p-value 0.00

The most frequently purchased first item category by the customers is presented inthe top panel in Figure 5.5 where the item categories are coded as A-R. The majorityof customers purchase item category R, followed by P and E. The following second andthird most purchased item categories are shown in the middle and bottom panels inFigure 5.5 which have the same pattern as the top panel, where the same item categoriesstill represents the top 3 most purchased item categories during the investigated period.The second and third most frequently purchased item category consists of majority NA’sindicating that no more item categories than the ones stated in the top panel have beenpurchased during the investigated period.

A cross tabulation dependence analysis between the most frequently purchased itemcategory and the return rate is presented in Tables 5.3 and 5.4. These tables show theresult of the cross tabulation analysis and the chi-squared dependence test. The p-valueof 0.0 in Table 5.4 indicates that there is a significant relationship between the mostfrequently purchased item category and the return rate. A closer look at Table 5.3 revealsthat some item categories have a lower return rate of 25-50% on average, while some otheritem categories highlighted by boldface have return rates of 75-100%.

Figure 5.6 explains the percentage of the most commonly used shipping method, wherethe standard shipping accounts for the majority of shipping methods chosen at 93.9%. Pre-mium and express shipping account for 4.5% and 1.6%, respectively. The cross-tabulationanalysis between the most frequently used shipping method and the return rate is shownin Table 5.5 with the corresponding result of the chi-squared analysis of independence inTable 5.6. The highest return rate between 75-100% is found among customers using thepremium or standard delivery method. The most common return rate for express deliv-ery is between 25-50%. Based on the chi-squared independent test there is a significantdifference in the return rate depending on the delivery method.

The delivery mode is shown in Figure 5.7, where 70.4% of customers most frequentlyuse delivery mode A and 24.2% most frequently use delivery mode D. Delivery mode Band D is used by only 5.2% and 0.1% of the customers, respectively. The cross-tabulationanalysis is presented in Table 5.7, where it can be seen that delivery mode A has the mostfrequent return rate between 75-100%, while the other delivery modes, i.e., B, C and Dhave a more frequent return rate of 25-50%. The independence test is shown in Table 5.8,indicating a significant effect of the delivery mode on the return rate.

27

Figure 5.5: Proportion of the most frequently purchased item category (top panel), thesecond most frequently purchased item category (middle panel) and the third most fre-quently purchased item category (bottom panel).

28

Figure 5.6: Percentage of the most frequently used delivery methods.

Figure 5.7: Percentage of the most frequently used delivery modes.

29

Table 5.3: Cross tabulation between the most frequently purchased item category and thereturn rates with intervals of 0-25%, 25-50%, 50-75% and 75-100%

ReturnRateItem Category [0-25%] (25-50%] (50-75%] (75-100%]A 19.64 33.40 24.70 22.26B 17.44 37.74 23.68 21.14C 7.52 25.67 16.77 50.04D 5.11 22.53 12.71 59.66E 8.78 23.78 19.87 47.57F 11.68 28.23 20.08 40.01G 4.43 21.95 15.81 57.82H 19.28 30.80 22.87 27.06I 14.01 29.14 18.98 37.86J 5.16 24.57 22.11 48.16K 15.00 29.84 22.04 33.12L 7.07 26.92 18.11 47.90M 7.24 25.44 22.12 45.20N 7.96 26.01 21.95 44.08O 14.94 31.90 25.00 28.16P 20.65 32.44 24.32 22.59Q 17.90 29.71 22.91 29.48R 21.75 31.95 26.19 20.11

Table 5.4: Chi-squared test of independence between the most frequently purchased itemcategory and the return rate.


Table 5.5: Cross tabulation between the most frequently used shipping method and thereturn rates with intervals of 0-25%, 25-50%, 50-75% and 75-100%.

ReturnRateDeliveryMethod [0-25%] (25-50%] (50-75%] (75-100%]Express 24.43 33.39 18.22 23.96Premium 12.85 30.04 23.35 33.76Standard 16.05 29.23 22.59 32.13

Table 5.6: Chi-squared test of independence between the most frequently shipping methodand the return rate.


The distribution of the most frequently stated reason for return by customers is shownin Figure 5.8, where it can be seen that the majority of customers has not stated any reasonwhich is coded as 0.

30

Table 5.7: Cross tabulation between the most frequently used delivery mode and thereturn rates with intervals of 0-25%, 25-50%, 50-75% and 75-100%.

ReturnRateDeliveryMode [0-25%] (25-50%] (50-75%] (75-100%]A 11.55 27.00 24.50 36.95B 26.35 34.44 11.95 27.26C 22.34 35.62 25.16 16.88D 27.07 35.13 19.08 18.72

Table 5.8: Chi-squared test of independence between between the most frequently useddelivery mode and the return rates.


Figure 5.8: Distribution of the most frequently stated return reasons.

Table 5.9 represents the proportional contingency table between the most frequentlyreported return reason and the return rate. The chi-squared independence test in 5.10with a p-value of 0.002 below the significance level α = 0.05 shows a significant differencebetween customers’ most frequently stated return reason and the return rate. Between30-40% of customers with the most frequently reported return reasons coded as 0,1,3,4,8and 10 have a return rate between 75-100%.

The metric used to explain customer churn is Recency and can not be visualized dueto confidentiality reasons. The distribution of the metric presents an average recency of200 days with a slightly right skewed distribution indicating that there might be somecustomers who have already churned.

Figure 5.9 shows the density plot of the customers’ average return time after limitingthe return time with the upper boundary of 90 days due to the return policy of thecompany offering free returns during 90 days. It is assumed that items returned beyond

31

Table 5.9: Cross tabulation between the most frequently reported return reason and thereturn rates with intervals of 0-25%, 25-50%, 50-75% and 75-100%.

ReturnRatemostFreqReturnReason [0-25%] (25-50%] (50-75%] (75-100%]0 r 13.98 27.54 24.24 34.241 18.28 30.23 20.16 31.332 17.83 37.65 19.80 24.723 17.59 28.26 19.46 34.694 14.08 25.35 18.32 42.255 20.39 32.40 18.90 28.326 18.57 31.75 20.31 29.377 31.76 30.40 10.65 27.198 13.56 28.25 25.57 32.629 36.66 33.01 10.58 19.7510 27.54 27.05 14.98 30.4399 40.00 28.24 7.06 24.71NA 27.49 33.75 10.57 28.20

Table 5.10: Chi-squared test of independence between the most frequently reported returnreason and the return rates.

chi-squared value 81.46p-value 0.001831

Figure 5.9: Estimated density of the customers’ average return time when eliminatingoutliers of return times above 90 days.

this time frame are returned due to complaints. The average return time of the customerbase is approximately 20 days and follows a normal distribution with some skewness dueto the unusual observations of the return time above 90 days.

Figure 5.10 presents the distribution of the ratio of purchases at discount. It can be

32

Figure 5.10: Density plot of the customers’ ratio of red price sales.

seen that there’s two different peaks where the most common ratio exists; at 50% and100%, indicating that the majority of the customers in the data set purchases items atdiscount.

The distribution of the conversion rate presents that a majority of the customers hasan conversion rate of approximately 5%, but despite this, there’s still highly frequentconversion rates of larger values such as 50 and 100%.

The distribution of the metric explaining the return rate of the customers presentsseveral peaks where the largest accounts for the return rate of 100% followed by thesecond largest peak of 50%. Overall there seem to be a broad variation in the return rateand the low density of return rate 0% indicates that the majority of the customers hasmade at least one return during the investigated period.

Figure 5.11: Estimated density of the customers’ relation length.

33

The relation length of the customers in Figure 5.11 shows an average length of 333days with some extreme values producing a rightly skewed density plot.

As seen in Figure 5.12, the one size item feature appears to be right skewed withan average value of 13 items and some extreme values that cause the skewness of thedistribution. The distribution of the multi-size feature shows that the majority of cus-tomers do not buy identical items with multiple sizes, but the occurrence of them stillhappens, where the maximum of multi-sized identical items purchased in an order withinthe customer base is 7, i.e. one item of every size that is available.

Figure 5.12: Estimated distributions of the customers’ purchased items, one size (left)and multiple sizes (right).

Exploratory data analysis on country level

To examine differences across countries, metrics explaining relationship length, recency,total orders, return rate, conversion rate, page views per session, and net revenue wereexamined. Table 5.11 presents the average values of selected characteristics for eachcountry. The highest average values are shown in bold in the table. On average, thelongest relationship length and highest total number of orders come from customers incountry F, and the highest recency from the country coded as E. The highest averagereturn rate comes from customers in country B, this is also true for the highest averageconversion rate. Country A has the highest average number of pageviews per session andcountry D represents the highest average net revenue.

In order to investigate whether there is a significant difference between countries, somestatistical tests need to be performed. According to primary studies the normality tests arerejected and hence the variables are not normally distributed. So the non-parametric testof Kruskal Wallis is used to investigate whether there is a significant difference betweencountries in purchasing behavior. The result of the Kruskal Wallis H-test is presented inTable 5.12 and shows that all features have a significant difference at the country level,based on the significance level of 0.05.

A pairwise comparison is performed using the Wilcoxon rank sum test and the Bonferroni-adjusted p-value. With a p-value below the significance level of 0.05 one can conclude asignificant difference between the countries within the compared metric. The Wilcoxon

34

Table 5.11: The average relationLength (RL), Recency (R), totalOrders (TO), returnRate(RR),conversionRate (CR), pageviewPerSession (PPS) and netRevenue (NR) at countrylevel.

Country RL R TO RR CR PPS NRA 418 164 4.61 0.466 0.154 40.8 2226B 287 214 3.50 0.675 0.248 38.6 1695C 227 195 3.11 0.640 0.221 30.5 1331D 322 151 4.36 0.466 0.159 35.8 2664E 227 229 2.95 0.586 0.211 35.5 1577F 529 188 5.17 0.508 0.165 36.2 2253

Table 5.12: Kruskal Wallis test

Feature chi-squared p-valuerelationLength 80327 < 2.2e− 16Recency 29062 < 2.2e− 16totalOrders 55502 < 2.2e− 16returnRate 108773 < 2.2e− 16conversionRate 55691 < 2.2e− 16pageviewsPerSession 11347 < 2.2e− 16netRevenue 35242 < 2.2e− 16

rank sum test showed a significant difference between all countries for relationship length,recency, total orders, return rate, conversion rate, and net revenue. For the pageviews persession, the Wilcoxon rank sum test found no significant difference between the countriesD/E, D/F and E/F as seen in Table 5.13.

Table 5.13: Wilcoxon Rank Sum Test between pageviews per session and countries

A B C D EB <2e-16 - - - -C <2e-16 <2e-16 - - -D <2e-16 <2e-16 <2e-16 - -E <2e-16 <2e-16 <2e-16 0.90 -F <2e-16 <2e-16 <2e-16 0.36 0.18

The distribution of net revenue at the country level is shown in Figure 5.13. In theleft plot, including the identified outliers, it can be seen that the average net revenues ofthe different countries do not seem to differ widely, but the number of outliers and theirvalues differ between the countries. The right plot limits the value of net revenues to 3000SEK and shows a more intuitive view of the distributions. Here, we can see that the meanvalues of net revenue for each country agree well with each other. The result from theTable 5.12, which is the Kruskal Wallis test, implies a significant difference between thecountries. Even though the difference is not immense, it can be concluded that there isstill a difference between the countries, which may come from the fact that each countryhas several identified outliers that may affect the Kruskal Wallis result.

35

Figure 5.13: Customers net revenue at the country level. Net revenues in the range [0,100.000) SEK (left) and [0, 3000] SEK (right).

Figure 5.14: Estimated density of the customers’ net revenue on country level after elim-inating outliers (left) and after transforming by Log-plus-1 (right)

The density of net revenue for each country after eliminating the outliers with netrevenue above 3000 is shown in the left figure in Figure 5.14. The figure shows that thedistribution for net revenue for all countries follows a highly right-skewed non-normaldistribution. The first peak of net revenue with a value of 0 indicates customers who have

36

returned all purchased items, while the second peak indicates the average net revenuefor active customers. Transforming net revenue using the log-plus-1 method reduces theskewness of the data from 7.2 to −1.2, the result is shown in the right most figure inFigure 5.14. The transformation still shows that there is some skewness in the data,from which we can conclude that the data are not normally distributed and cannot betransformed into one. The remaining analysis is performed on the original, un-transformeddata.

Figure 5.15: Distribution of discounted sales ratio (left) and return rate (right) at thecountry level.

Figure 5.16: Box plot of the customers’ conversion rate at the country level.

The ratio of items purchased at a discount and the conversion rate at the countrylevel is shown in Figure 5.15. In this plot it can be seen that country F and A have thehighest proportion of purchases with discount. Overall, there does not seem to be a major

37

difference between the countries, although there is a significant difference according to theKruskal Wallis-test. The right plot in Figure 5.15 presents the return rate at country level,where one can see a slightly more difference between the countries. The return rate is onaverage larger for country B and C. The plot presents a similar distribution for countryA and D, indicating that these countries have similar behavior in their returns. Overall,there is a large variation between all countries in terms of the return rate.

In Figure 5.16 the difference in conversion rate between the different countries ispresented. The conversion rate seem to vary more for country B, C and E, but theaverage conversion rates between the countries do not have a great difference. The box-plot of the conversion rates has identified a lot of outliers for each country implying thatthere is a right skewed distribution for this metric.

5.2 Modelling

5.2.1 Dimensionality reduction

The proportion of customers used in this study creates a dataset of 1,355,533 customersand 26 variables. The data consists of 11 numerical and 14 categorical variables, andthus yields a mixed-form data set. The principal component analysis for mixed data pre-sented in Section 4.3.2 is applied to the dataset to explore possibilities for dimensionalityreduction in the data.

Figure 5.17: Scree plot (left) and cumulative scree plot (right) for the dimensionalityreduction performed by PCA Mix.

Figure 5.17 presents the resulting scree plot and the cumulative scree plot from thePCA Mix. As shown in the figure, it seems, there is no way to reduce the dimensionality ofthe data. Using 15 dimensions, the cumulative scree plot shows a percentage of explainedvariance below 30%, with the first dimension contributing to about 7% of the explained

38

variance. It can be concluded that no dimensionality reduction should take place priorto modelling and the remaining analysis needs to be conducted on the full data set of 26variables.

5.2.2 Constraints of the study

The clustering analysis performed on a proportion of the full customer base of approxi-mately 1,4 million customers and 26 variables contributes to a large memory requirementto perform the implementation and evaluation of the clustering analysis. The constraintsof the memory usage in R resulted in an incomplete analysis and therefore required thatthe project be further analyzed using cloud computing at Google Cloud Platform. Cloudcomputing increased the RAM memory by 112 GB and extended the number of cores from4 to 16. Despite this increased memory, some internal clustering evaluation schemes suchas average silhouette width and McClain index still required immense amounts of memoryand were therefore not considered in this study. The recurring memory problem was thebasis for many decisions and constraints in the study, resulting in the number of clustersbeing validated using only the scree plot to minimize the sum of squared distances.

5.2.3 The Google BigQuery Machine Learning tool

The data warehouse system Google BigQuery enables machine learning creation and exe-cution by using SQL queries and therefore eliminates time and memory required for dataexportation. The K-means clustering algorithm in BigQuery ML is used for the decreasedmemory purpose explained in the previous limitations to cluster the data set. The mixeddata set will automatically be pre-processed by BigQuery to transform the categoricalfeatures into dummy variables. The result is later benchmarked against the K-prototypesclustering algorithm executed in R. The main differences between the clustering techniquesare presented in Section 4.3.3 as well as their advantages and disadvantages in Section 3.9.

Figure 5.18 presents the elbow plot from using the Google Big Query ML tool forperforming the K-means clustering algorithm on the data by dummifying the categoricalfeatures into binary variables. In the figure one can assume that 6 or 8 clusters could beappropriate for clustering the data set.

Table 5.14 and 5.15 presents the output from performing K-means clustering with 6clusters through Google BigQuery ML. The tables presents the centroids of the numericalfeatures of the clusters as well as their proportional sizes, where some variables have beeneliminated from the model possibly due to correlations.

Table 5.16 and 5.16 presents the resulted cluster centers after performing K-prototypesclustering analysis on the data set. The same variables as the K-means performed inGoogle BigQuery is presented in these tables. As one can see there’s a difference in theK-means’ centroids and K-prototypes’ centers within all variables and clusters. One canalso see that the BigQuery Machine Learning tool’s K-means clustering has two variableswith equal centroid values in all clusters; the relation length and the conversion rate. Thisis not the case when using the K-prototypes algorithm that distinguishes different valuesof the cluster centers. This is due to the randomness used for initializing the clustercentroids for both the K-means and K-prototypes technique, hence, we can conclude thatthe comparisons between them is hard to perform.

39

Figure 5.18: Elbow plot of the mean squared distance for the K-means clustering algorithmby using the Google Big Query Machine Learning tool.

Table 5.14: Cluster centroids after applying the K-means clustering algorithm by GoogleBigQuery ML. The table represents the cluster centroids of the following metrics: recency,relation length and total orders as well as the proportional size of the clusters.

Cluster Size (%) Recency relationLength totalOrders1 0.05 1086.44 0.00 35.112 46.11 104.22 0.00 256.863 0.79 955.82 0.00 63.704 10.37 100.13 0.00 272.985 8.31 791.38 0.00 98.076 34.37 573.43 0.00 131.90

Table 5.15: Cluster centroids after applying the K-means clustering algorithm by GoogleBigQuery ML. The table represents the following metrics: conversion rate, one size items,pageviews per session and discounted sales ratio.

Cluster conversionRate oneSizeItems pageviewPerSes. discountedSalesR.1 0.00 50.97 0.70 0.782 0.00 19.82 0.68 0.733 0.00 31.92 0.68 0.704 0.00 19.85 0.63 0.745 0.00 24.76 0.68 0.566 0.00 22.32 0.64 0.38

40

Table 5.16: Cluster centers and their sizes obtained by K-prototypes algorithm with R.

Cluster Size (%) Recency relationLength totalOrders1 60.41 240.925 170.253 1.8202 3.52 80.178 806.543 14.2363 0.83 61.053 912.930 24.1334 10.12 111.861 671.281 8.4455 0.08 44.053 972.731 47.7896 25.03 161.936 486.165 4.526

Table 5.17: Cluster centers obtained by the K-prototypes algorithm with R.

Cluster conversionRate oneSizeItems pageviewPerSes. discountedSalesR.1 0.26 5.33 27.53 0.682 0.11 55.80 81.02 0.623 0.11 101.96 121.58 0.614 0.13 30.83 57.77 0.635 0.11 210.86 211.74 0.566 0.16 15.29 40.91 0.64

41

Chapter 6

Result

To perform clustering, the entire dataset is used. The dataset is also divided into trainingand testing subsets of 70% and 30% of the full data, respectively, were the training datais used to build the classification models and testing data to validate them.

6.1 Clustering

Figure 6.1: Scree plot of the total within sum of squares for the K-prototypes clusteringalgorithm with different initial number of clusters, k, with hyper-parameter λ as thevariance of the variables.

The scree plot in Figure 6.1 represents the total within sum of squares of the associatednumber of clusters, k, of the K-prototypes clustering algorithm, with the hyper-parameterλ set to the variance of the numerical variables of the dataset. As seen in the plot, aproposal of five or six clusters can be proposed to explain the dataset. Setting the valueof the hyper-parameter to the standard deviation of the numerical features yields theresult shown in Figure 6.2. Defining the hyper-parameter as the standard deviation gives

42

Figure 6.2: Scree plot of the total within sum of squares for K-prototypes clustering withstandard deviation λ with different initial number of clusters, k, and considering λ as thestandard deviation of the variables.

Table 6.1: Within sum of squares (wss) for K-prototypes with the optimal number ofclusters k when the hyper-parameter λ set to the variance or the standard deviation ofthe numerical features.

λ k wssVariance 6 6.5e+13Standard deviation 6 1.4e+13

a suggested number of clusters of 6 with the within sum of squares distance 1.4e+13.Table 6.1 shows the result of the different clustering alternatives. To minimize the

within sum of squared distance, the variant of the K-prototypes algorithm was chosen inwhich the hyper-parameter λ was set to the standard deviation of the numerical featuresand the number of clusters, k, was set to 6. Table 6.2 presents the proportional size ofthe resulting clusters, with the largest proportion of customers belonging to the first andsixth clusters of 60.41% and 25.03%, respectively.

Table 6.2: Size of the clusters from the K-prototypes clustering algorithm with hyper-parameter value considered as standard deviation.

Cluster 1 2 3 4 5 6Size (%) 60.41 3.52 0.83 10.12 0.08 25.03

Figure 6.3 shows the distribution of clusters across the characteristics that explainthe relationship length between the customers and the company and the average numberof pageviews per session. In the plot of the left panel, the first cluster has the lowestrelationship length of about 0 days on average, representing customers who have zerodays between their first and last purchase and thus have made only one purchase in the

43

Figure 6.3: Distribution of the customer segments based on the metrics explaining relationlength and pageviews per session.

investigated period. The fifth cluster contributes to the highest relationship length ofabout 1000 days. The plot in the right panel figure on the right, representing ”page viewsper session”, shows that the first cluster has the lowest average number of page viewsper session, while the fifth cluster has the highest number of page views per session. Ahigh number of average page views per session could indicate that these customers havea higher activity on the website compared to the customer segments with a low averagepage views per session.

Figure 6.4 presents the metrics that explain the RFM, given as Recency, totalOrders,and netRevenue. The first and sixth clusters both have relatively high recency valuesalong with low net revenue and total orders. The identified group of customers with thelowest recency value as well as the highest monetary value of net revenue and frequencyof orders is the fifth cluster with an average recency of 44 days compared to the 241 daysof the first cluster, followed by clusters three, two and four. One can also observe somedifferences in the variance of the recency metric between the different clusters, where thefirst cluster has the highest variance and the fifth cluster the lowest. This pattern ofvariation does not hold for the metrics explaining total orders and net revenue, where thepattern is exactly reversed.

The distribution of the return rate and average return time is presented in Figure 6.5.The first cluster has on average the greatest return rate of 0.75 as well as the on averagelowest return time. One can see that for the remaining clusters, there’s no greater dif-ference in return rate. The right most figure representing the average return time showsthat the fifth cluster has an average return time that can be concluded as moderatelylonger than for the other clusters.

The distribution of customers purchasing one sized items and multi-sized identicalitems is presented in Figure 6.6 where the majority of extreme outliers within all clustersare excluded in the right most figure. One can see that there occurs extreme happeningsof purchases including several multi-sized items within all clusters, but the fifth cluster

44

Figure 6.4: Distribution of the customer segments on the RFM related metrics Recency,totalOrders and netRevenue.

Figure 6.5: Distribution of the customer segments on the metrics explaining return rateand average return time.

45

Figure 6.6: Distribution of the customer segments on the metrics oneSizeItems andmultiSizeItems.

purchases on average more items of multiple sizes. This is clarified in the contingencytable in Table 6.3 where there’s a greater variation in multi-sized items for the fifth cluster.The first and sixth cluster consists of 100% customers that have purchased between 0 and77 identical total items of multiple sizes.

Table 6.3: Cross-tabulation between the cluster groups and the number of purchasedidentical items of different sizes.

ClusterMultiSizeItems 1 2 3 4 5 6[0-77] 100.00 99.89 99.25 99.99 95.28 100.00(77-153] 0.00 0.10 0.72 0.01 3.45 0.00(153-230] 0.00 0.01 0.02 0.00 0.91 0.00(230-306] 0.00 0.00 0.02 0.00 0.18 0.00(306-383] 0.00 0.00 0.00 0.00 0.18 0.00

Figure 6.7 presents the distribution of the ratio of purchased items on sale,discountedSalesRatio and the conversion rate within the clusters. There seem to be nogreater difference among the ratio of discounted purchases between the customer groups.This holds for the conversion rate as well, even though one can state that there’s a greatervariance and average conversion rate for the first cluster.

46

Figure 6.7: Distribution of the customer segments on the metrics explaining the ratio ofpurchased items on discount and the conversion rate.

Table 6.4: Cross-tabulation/contingency table between the cluster groups and countries.

ClusterCountry 1 2 3 4 5 6A 7.74 12.31 9.00 13.17 4.90 11.61B 43.50 36.54 37.92 36.39 41.65 38.48C 15.28 7.23 5.74 8.95 6.72 11.43D 4.31 9.86 11.87 8.17 11.25 6.26E 13.84 9.27 10.39 9.88 14.43 11.37F 15.32 24.79 25.08 23.43 21.05 20.84

Table 6.4 presents the contingency table between the clusters and the countries, whereall clusters approximately consists of a majority of customers from outside Scandinavia.Overall one can conclude an even distribution of nationalities of the customers in theclusters.

Table 6.5: Contingency table between the cluster groups and most frequently used firstchannel.

ClustermostFreqFirstChannel 1 2 3 4 5 6lowerFunnel 47.53 74.92 73.53 72.45 68.97 65.94middleFunnel 14.28 3.62 2.59 6.03 1.45 9.59upperFunnel 17.48 21.02 23.74 20.18 29.58 19.82NA 20.71 0.43 0.13 1.33 0.00 4.64

The distribution of the most frequently used first, last and only channel among theclusters is presented in Table 6.5, 6.6 and 6.7. The tables present an even distribution

47

Table 6.6: Contingency table between the cluster groups and most frequently used lastchannel.

ClustermostFreqLastChannel 1 2 3 4 5 6lowerFunnel 53.50 75.76 74.17 74.47 69.33 69.584middleFunnel 7.76 2.26 1.63 3.40 1.45 5.15upperFunnel 18.04 21.54 24.06 20.80 29.22 20.63NA 20.71 0.43 0.13 1.33 0.00 4.64

Table 6.7: Contingency table between the cluster groups and most frequently used onlychannel.

ClustermostFreqOnlyChannel 1 2 3 4 5 6lowerFunnel 26.58 57.30 64.07 49.31 68.06 38.98middleFunnel 3.76 2.33 2.10 2.32 2.63 2.64upperFunnel 8.70 15.71 18.51 12.98 22.69 10.78NA 60.96 24.66 15.31 35.38 6.62 47.60

between the clusters and their used channels despite the first cluster that uses a largerproportion of middle funnel channels as their first and last channel. As seen in Table 6.7around 61% of the customers in the first cluster has no registered used only channelimplying that they use several channels in-between entering the website and making apurchase.

Table 6.8 present the contingency table between the clusters and the return rate, whereone can clearly see that the first and sixth cluster stands out from the rest with their mostcommonly higher return rates. 47% of the customers within the first cluster has an returnrate around 75-100% and 34.7% of the sixth cluster’s customers has a return rate around25-10%.

Table 6.8: Contingency table between the cluster groups and the return rate.

ClusterreturnRate 1 2 3 4 5 6(0,0.25] 6.26 36.02 37.50 34.62 38.61 28.58(0.25,0.5] 26.34 32.02 30.45 32.94 30.21 34.72(0.5,0.75] 20.34 25.60 25.73 25.08 24.84 26.35(0.75,1] 47.06 6.36 6.32 7.36 6.34 10.36

Table 6.9 and 6.10 presents the distribution of the most frequently and second mostfrequently purchased item categories within the different clusters. The contingency tableof the third most frequently purchased item category can be found in Appendix B andTable B.1. The item categories of highest proportion among all clusters are tops, sweatersand dresses where the clusters seem to follow a similar distribution. Table 6.10 presentsa high proportion of the first cluster that has no second most purchased item category,33.89%, implying that these customers have only purchased the same item category fromNA-KD during the investigated period. The same conclusion is drawn for the third most

48

Table 6.9: Contingency table between the cluster groups and the most frequently pur-chased item category.

ClustermostFreqItem1 1 2 3 4 5 6A 6.45 6.56 7.64 6.51 5.35 6.42B 0.09 0.00 0.00 0.02 0.00 0.05C 0.78 0.10 0.06 0.21 0.00 0.45D 4.86 0.43 0.23 0.90 0.27 2.03E 20.61 8.88 7.94 10.28 9.53 12.54F 6.99 2.21 0.77 3.95 0.64 5.78G 1.60 0.08 0.01 0.15 0.00 0.50H 2.62 1.03 0.72 1.69 0.45 2.39I 6.02 3.91 3.00 4.50 1.72 4.88J 0.04 0.00 0.00 0.00 0.00 0.01K 3.84 2.09 1.54 2.82 1.36 3.47L 1.86 0.19 0.15 0.34 0.18 0.84M 1.14 0.10 0.01 0.28 0.00 0.69N 3.50 0.43 0.25 0.87 0.18 1.84O 0.03 0.00 0.00 0.01 0.00 0.02P 14.34 23.02 20.71 23.20 16.70 21.60Q 7.31 3.35 1.69 4.93 0.73 6.66R 17.93 47.62 55.28 39.33 62.89 29.82

frequently purchased item category.

49

Table 6.10: Contingency table between the cluster groups and the second most frequentlypurchased item category.

ClustermostFreqItem2 1 2 3 4 5 6A 4.49 9.91 11.62 9.03 14.07 8.44B 0.10 0.03 0.03 0.05 0.00 0.13C 0.83 0.37 0.20 0.61 0.00 1.11D 2.72 1.24 0.77 2.13 1.09 3.79E 10.23 11.64 11.79 11.64 15.15 11.93F 5.39 5.20 3.29 6.72 1.63 7.31G 1.32 0.21 0.04 0.46 0.00 1.23H 2.23 2.29 1.80 2.97 1.36 3.33I 5.93 8.18 7.84 7.79 7.71 7.84J 0.05 0.01 0.00 0.01 0.00 0.04K 4.09 4.76 4.06 5.44 2.99 5.90L 1.34 0.56 0.33 0.94 0.09 1.74M 1.23 0.47 0.22 0.97 0.00 1.63N 3.58 1.75 0.91 2.91 0.54 4.26O 0.04 0.01 0.01 0.03 0.00 0.06P 8.93 25.76 30.37 21.50 33.21 16.49Q 2.72 4.96 3.77 5.70 3.90 5.07R 10.88 22.60 22.91 20.90 18.06 18.11NA 33.89 0.04 0.02 0.21 0.18 1.58

50

The proportional contingency tables of the metrics explaining the most frequentlyused payment method, delivery method, delivery mode and return reason can be foundin Appendix B where within all these metrics, all clusters follow the same distribution.

The above results of the customer groups’ distribution within the features of thedataset can be summarized in the below explanation of the clusters:

Cluster 1 with the highest average recency value and relation length could poten-tially explain churned customers. This cluster also yields the lowest net revenue andseem to not purchase as many identical items of different sizes as the others but stillhas relatively high average return rate of about 75%. This cluster represents thegroup where the majority of the customers have only purchased one item categoryand where 61% uses several channels to complete a purchase at the company.

Cluster 2 can be described as potential loyalists with similar characteristics asthe most loyal customers but with smaller values. The second cluster has the thirdlowest value in terms of recency and the third highest values in relation length, totalnumber of orders and net revenue, indicating that these customers may becomeloyalist in the future.

Cluster 3 includes the loyal customers that follows the pattern of the Brand Cham-pions with the second highest net revenue, relation length and the second lowestrecency. They also have a greater average of pageviews per session and have a highertendency of purchasing identical items of multiple sizes. The greatest difference be-tween this cluster and the Brand Champions are the monetary value, where theBrand Champions purchases items of almost twice as high value.

Cluster 4 with metric values somewhere in-between the loyalists and the churnerscan be described as indecisive shoppers that may have a relatively long relation tothe company but still a relatively low recency compared to the group of churnedcustomers and high-risk churners, but a significantly lower number of total ordersand net revenue in comparison to the loyal customers.

Cluster 5 can be described as the Brand Champions with their highest monetaryvalue in terms of their net revenue. They are also the customers of the longestrelation length, hence they’ve been active customers of the company during a greatertime period. Their lower average value on recency indicates that they still are activeusers. The cluster also contains customers that frequently purchases identical itemsof multiple sizes, which not surprisingly can explain the greater return rate. Itcould also indicate that this customer groups also contains wholesalers which couldcontribute to the large number of outliers in terms of total orders, net revenue andmulti sized identical items as well as the high number of average pageviews persession.

Cluster 6 can be described as high-risk churners which are the customers thathave an relatively high average relation length but shares the characteristics of thechurned customers with relatively high average recency, low net revenue and lessnumber of total orders. These are the customers that have been active previouslydue to their longer relation length but now have indicators to churn. 47.6% of thesecustomers uses several channels to enter and complete a purchase at the website.

51

6.2 Classification

6.2.1 Definition of classes

Table 6.11 presents the definitions of the classes returner, churned and loyal customersbased on previous knowledge and findings from the company together with the resultsfrom the exploratory data analysis in this study.

Previous RFM studies at the company have defined churned customers with recencyabove 200 days which will be used for this study as well. Loyal ”brand champions” haspreviously been defined as customers of recency below 75 days and a monetary value of1000 SEK and above. The dataset in this study has an average recency of 200 days andnet revenue of 1800 SEK, whereas the definition is changed to customers of recency below200 days and more than three purchases annually instead. The average return rate of 0.6and the result from the contingency table of item categories and return rate of this dataset lays the foundation for the definition of returners.

Table 6.11: Definition and description of the classes of churners, returners and loyalcustomers.

Class Description DefinitionChurned A customer classified as churned is no

longer a user of the company’s services.Recency above 200days

Returner A customer classified as a returnermakes frequent returns of her purchasedgoods. The return rate is taken intoconsideration as well as the most fre-quently purchased item category whichpreviously proven has a significant affecton the return.

Return rate above0.6 and most fre-quently purchaseditem category in-cluding one of thecategories of high-est return rate

Loyal A customer classified as loyal is an ac-tive user of the company’s services andmakes frequent purchases.

Recency below 200days and more than3 purchases annu-ally.

6.2.2 Feature selection

The lower triangular correlation matrix of the numerical features in the dataset is pre-sented in Table C.1 found in Appendix C. To simplify the classification model, fea-ture selection needs to be taken into consideration where one of two highly correlatedvariables and categorical features of uneven distribution are eliminated from the dataset. The following numerical features of a correlation value above 0.5 to others are re-moved; salesQuantity, returnQuantity, netRevenue and oneSizeItems. The cate-gorical features of uneven distributions mostFreqDeliveryMethod, mostFreqDevice andmostFreqPayMethod are removed from the dataset as well. The lastChannel and onlyChannel

are excluded from the model due to their relation and equal distribution as the keptfirstChannel feature. The same assumption is drawn for the excluded mostFreqItem2

and mostFreqItem3 that both relies on mostFreqItem1, which is kept for modelling the

52

classification. The mostFreqReturnReason did in fact have an significant impact on thereturn rate as seen in the previous exploratory data analysis, but its uneven distributionwhere the majority of the customers have no stated return reason - makes this featureinsufficient to use and is therefore excluded from the data set as well.

6.2.3 Logistic Regression

By performing three separate binary logistic regression models on the remaining updateddata set to classify the churned, returners and loyal customers a consistent problem occurswith complete or quasi-complete prediction. The issue implies that the model perfectlyor almost completely (quasi-complete) classifies the data. To deal with this overfittingproblem a bias reduced logistic regression is applied to penalize the maximum likelihoodas explained in 4.3.4. The three separate binary classification models were considered asa basic first step for solving this issue where the three binary classes of churners, return-ers and loyalists were formed. These classes were formed by the definitions previouslyexplained, and resulted in three models classifying whether or not a customer is churnedcustomer, a customer of frequent returns or a loyal customer.

Classifying the churned customers

The result of the classification model of the class churned customers is presented in Ap-pendix D where one can conclude that there’s plenty of significant features, both numericaland categorical. By looking at the estimated coefficients one can see the greatest influenceon the model of churn from the numerical features returnRate, discountedSalesRatioand totalOrders where the only feature of positive influence is the returnRate with an+0.63 increase in the log odds per unit increase in returnRate. The categorical featuresare split into dummy variables with the references of country A, delivery mode A, mostfrequent first channel upperFunnel and most frequent purchased item category A. Basedon the reference model Table D.1 presents a positive influence on the log-odds of churningif a customers originates from all countries except from country D, where the greatestinfluence of 1.047 derives from country B. The delivery modes B, C and D all contributeswith a positive effect on the churn log-odds relative delivery mode A. Continuing to themost frequently used first channel, where lowerFunnel is used as reference, one can seethat both the middle and upper funnel channels has a negative impact on the log odds,where the NA’ss (i.e. the customers using only one channel for both entering and exitingthe website with an purchase) has an ineffective impact on the log odds related to churningwith an estimated coefficient close to zero. Continuing with the most frequently purchaseditem category, one can conclude an overall significant impact on the model where the ma-jority of the item categories significantly affect the model based on their reference itemcategory A. The item categories of no significant impact are L and P which both has ap-value above the significant α-level of 0.05. The greatest impact on the model arrivesfrom the item categories J (-1.6), G (+0.83) and O (+0.73). The variables of negligibleimpact with estimated coefficients approaching 0 are the relation length, average returntime, pageviews per session, conversion rate and the number of multiple sized items.

53

Classifying the returners

The classification model constructed to classify the returners can be found in Appendix D.Following the same discussion as above, the numerical features of greatest importance onthe response in the constructed classification model are conversionRate and totalOrders

with the estimated coefficients +0.48 and −0.14 respectively. This indicates that the logodds of being classified as a returner increases with 0.48 per unit increase in the conversionrate, and decreases by 0.14 per unit increase in total orders. Hence, if a customer makesa purchase with less sessions, i.e. a faster purchase, she’s more likely to make a return.Continuing analyzing the categorical features one can observe a positive impact on thereturn response if the customer derives from country B (+0.43), C (+0.35), E (+0.27) andF (+0.17), while there’s a negative impact on the log odds of being classified as a returnerif a customer originates from country D (-0.11) based on the reference model using countryA as reference. The reference of delivery mode is A with the delivery modes impacts; B(+0.17) and C (-0.62), where the impact of D is negligible (-8.2e-02). The most frequentlyused first channels all have a positive effect on the model with upperFunnel as a referenceand a +0.27 impact from middleFunnel, +0.19 from upperFunnel and +0.33 from thecase when there’s only one channel used for entering the website and making a purchase.The negligible numerical variables of this model are the relation length, recency, averagereturn time, pageviews per session and the number of multiple sized items.

Classifying the loyal customers

Appendix D presents the classification model of loyal customers where one can observe thenumerical features of greatest estimated coefficients in the model as totalOrders (+1.15),redPrice- SalesRatio (-0.76), returnRate (+0.56) and conversionRate (-0.50). Thecategorical feature country has a slightly different impact on the loyal class in comparisonto the churner and returner classes, where there’s a significant impact on the model ifa customers originates from country F (+0.19) and B (+0.10) but no significant impactfrom the country E by using A as the reference country. The delivery mode C also hasno significant impact on the model where the delivery mode A is used as reference. Theremaining delivery modes all have a positive affect on the response where the estimatedcoefficients of delivery mode B is +0.20 and +0.12 of delivery mode D. Continuing withthe most frequently used first channel, with upperFunnel as reference, all the remainingchannels have a negative impact on the response with the estimated model coefficientsof −0.26, −0.11 and −0.46 respectively for middleFunnel, upperFunnel and NA, i.e. thecustomer’s that uses only one channel to make a purchase. There are some variableswith coefficient approaching to 0 which makes them less important for the model, thesenegligible variables are the relationLength, Recency and multiSizeItems.

Model Evaluation

The remaining 30% of the data is used as testing data to validate the constructed classi-fication models of returners, churned and loyal customers.

Table 6.12 presents the validation result in terms of accuracy which is defined as

Accuracy =The total number of correct predictions

The total number of predictions

54

where the model of greatest accuracy can be stated as the classification model pre-dicting customer loyalty with an accuracy of 98%. It is though important to keep in mindthat the data may be imbalanced due to the low proportion of loyal customers in the sam-ple data set. Hence, this measurement of model accuracy may not be optimal and otherperformance measurements should be taken into consideration. The model predicting thereturners is the second most accurate model, based on this performance measurement,with a prediction accuracy of 75% followed by the model of churn with accuracy of 69%.Due to the possible imbalanced data this result may not be reliable and other accuracymeasurements should be taken into consideration before making any conclusions of theperformance of the models. The predictive accuracy that has been used is a first stepinto the model performance evaluation that should be done and needs to be supplementedwith other measurements for higher reliability.

Table 6.12: The accuracy of each bias reduced logistic classification model predicting thechurned customers, the returners and loyal customers.

Model Accuracybrglm churn 0.6849brglm return 0.7497brglm loyal 0.9843

55

Chapter 7

Discussion and conclusions

7.1 Discussion

Regarding the result of the clustering model claiming that six customer segmentationgroups could be sufficient to explain the customer purchase behaviour of the e-commercecompany NA-KD’s customer base, one can discuss several aspects. The definition of theclusters presents the largest proportion of customers belonging to the segment that canbe assumed already have churned from the company’s services, followed by the customersof high risk to churn in the future. The great proportion of churned customers is notunexpected and follows the previous findings done at the company as well as the trendsin fashion e-commerce with customer retention as the great focus area. The other de-fined customer groupings besides the churned and high-risk customers are the indecisiveshoppers , the loyalists and the Brand Champions. The clusters has similarities withinsome metrics and distinctive differences in others. The metric explaining the customersmonetary value, the net revenue, covers the greatest difference where not surprisingly theBrand Champions accounts for the largest values and the churned customers for the low-est. By investigating the channels most frequently used by the different clusters one canobserve the fact that the cluster containing the churned customers has a larger fraction ofcustomers that uses several channels in-between entering the website and making a pur-chase. This implies that this customer group takes more time to complete a purchase andmay not be fully convinced by the company. Perhaps, this customer group frequently pur-chases items from competitors and uses NA-KD as a benchmark against other companiesoffering similar items.The Brand Champions on the other hand, uses a greater proportionof only-channels, which shows that they are confident with their shopping at NA-KD andneed less middle channels to complete the purchase. The most common return rate withinthe clusters presents that the churned customers, followed by the high-risk customers havethe greatest return rate of 75-100% and 25-50% respectively which is not an unexpectedbehaviour from churned customers. The churners frequently purchased item category alsoreveals the higher return rate since the item category E most frequently are related to agreater return rates as seen in Chapter 5.

Continuing with the classification models constructed, one can conclude that all modelshave a good accuracy and predicts the customers well into the classes of returners, churnedand loyal customers. The classifier for predicting churned customers presents a highlysignificant and increasing effect on the likelihood of churning by increasing the returnrate of the customers, while there’s a decreased likelihood of churning as the number of

56

total orders per customer and the ratio of purchased items on discount increases. Thisindicates that as soon as a customer purchases several orders and more items on discountthere’s a lower chance of this customer churning from the company.

The classifier for predicting the returners identifies a decrease in the likelihood ofbecoming a returner as the number of total orders increases for the customers, and theopposite by an increase in the conversion rate. This can be interpreted as if a customerhas a higher conversion rate, i.e. a low number of sessions per order, the customer maynot take her time at the website reading about the details of the item before check-out.The low activity could possibly end up in a return as the items didn’t suit the customereither in terms of size or style.

The final classification model predicting customer loyalty presents an increasing effecton the likelihood of becoming a loyal customer as the number of total orders and returnrate increases for the customers, while there’s a decrease in the loyalty likelihood as theratio of items purchased on discount and the conversion rate increases. This impliesthat a loyal customer has a larger number of orders but also a larger return rate, whichcould align with the greater fraction of multi-sized purchased item that were identified inloyal customer group from the clustering analysis. The loyal customers also has a lowertendency to purchase items on discount and has a smaller conversion rate implying thatit takes more activity in terms of website sessions before a purchase is done.

7.2 Conclusion

Regarding the objectives stated in Section 2.4 one can conclude that the loyal customergroup have been investigated by two different approaches where the unsupervised cluster-ing technique has been used to find the loyal customers as well as the Brand Championsof the company’s customer base during the investigated period. By using the company’sdefinition of a loyal customer, the supervised machine learning technique of using clas-sification has been used to construct a classification model of 98% accuracy to classifyand predict the probability of becoming a loyal customer. The same methodology hasbeen used to classify and predict the probability of a customer churning from the com-pany and becoming a frequent returner of purchased goods with accuracy of 68% and75% respectively. These performance measurements are based on the predictive accuracymeasurement where imbalanced data is not taken into consideration, needless to say hasan impact on the reliability of the models.

7.2.1 Future recommendations

Due to the limited time period of four months for this project there have been stated sev-eral limitations regarding the data collection as well as the possibility of model variation.The accuracy of the classification models needs to be investigated further to make anygreater conclusions about the models’ performances by other performance measurementsand by taking imbalance data into consideration. Some other future recommendationsfor this work would be to investigate further variables of potential interest for the com-pany’s objectives to be able to find more patterns and enable dimensionality reduction.To investigate several different models within clustering and classification could also be ofgreat interest to optimize the result where the discussed methods from Chapter 3 could be

57

potential models to use and where the multinomial logistic regression classification modelshould be the next model to investigate. The initial ambition of performing an A/B teston the result of the study is one of the constraints that has been done due to the lack oftime and therefore is suggested to be performed in the future. A theoretical explanationof the A/B testing can be found in Appendix A. An additional future suggestion is to val-idate the predictions based on the classification models on the more recent data availablebetween 2021-01-01 and 2021-06-01. Investigating the accuracy of the prediction modelsbased on the updated data is of high interest for the company and could provide a goodmeasure of the models’ practical usability.

58

Bibliography

99Firms, C. (2020). 99 firms’ ecommerce statistics for 2020. (accessed: 03.03.21).URL: https://99firms.com/blog/ecommerce-statistics/gref

Ait daoud, R. (2015). Customer segmentation model in e-commerce using clusteringtechniques and LRFM model: The case of online stores in Morocco, InternationalJournal of Computer, Electrical, Automation, Control and Information Engineering9: 1795 – 1805.

Al Imran, A. and Amin, M. N. (2020). Predicting the return of orders in the e-tail industryaccompanying with model interpretation, Procedia Computer Science 176: 1170–1179.

Aziz, A. (2017). Customer segmentation based on behavioural data in e-marketplace,M.Sc. Thesis, Uppsala University, Department of Information Technology.

Blevins, M. (2020). What is customer churn in ecommerce? (accessed: 15.03.2021).URL: https://www.returnlogic.com/blog/what-is-customer-churn-in-ecommerce

Caigny, A. D., Coussement, K. and De Bock, K. W. (2018). A new hybrid classificationalgorithm for customer churn prediction based on logistic regression and decision trees,European Journal of Operational Research 269: 760–772.

Charlton, G. (2020). Ecommerce returns: 2020 stats and trends. (accessed: 10.03.2021).URL: https://www.salecycle.com/blog/featured/ecommerce-returns-2018-stats-trends/

Chavent, M., Kuentz-Simonet, V., Labenne, A. and Saracco, J. (2017). Multivariateanalysis of mixed data: The R package PCAmixdata.

Coppola, D. (2021). E-commerce worldwide - statistics and facts. (accessed: 03.04.21).URL: https://www.statista.com/topics/871/online-shopping/

Corder, G. W. and Foreman, D. I. (2014). Nonparametric statistics: a Step-by-StepApproach, 2nd edn, John Wiley Sons, Inc.

Firth, D. (1993). Bias reduction of maximum likelihood estimates, Biometrika 80(1): 27–38.

Gan, G. (2011). Data Clustering in C++: An object-Oriented Approach, 1st edn, CRCPress.

Huang, Z. (1998). Extensions to the k-means algorithm for clustering large data sets withcategorical values, Data mining and knowledge discovery, 2(3) pp. 283–304.

59

Iyigun, D. (2021). CRM Manager, NA-KD.

Joy Christy, A., Umamakeswari, A., Priyatharsini, L. and Neyaa, A. (2018). Rfm ranking– an effective approach to customer segmentation, Journal of King Saud University,Computer and Information Sciences .

Khajvand, M., Zolfaghar, K., Ashoori, S. and Alizadeh, S. (2011). Estimating customerlifetime value based on rfm analysis of customer purchase behavior: case study, ProcediaComputer Science 3: 57–63.

Korkmaz, C. (2021). Business Intelligence Director, NA-KD.

Kosmidis, I. and Firth, D. (2010). A generic algorithm for reducing bias in parametricestimation, Electric Journal of Statistics 4: 1097–1112.

Mittal, A. (2013). E-commerce: It’s impact on consumer behvaior, Global Journal ofManagement and Business Studies 3: 131–138.

Montgomery, C. D. and Runger, C. G. (2018). Applied statistics and probability of engi-neers, 7nd edn, John Wiley Sons, Inc.

Montgomery, D. C. (2017). Design and Analysis of Experiments, 9th edn, John WileySons, Inc.

NA-KD (2020). Career, NA-KD.com. (accessed: 05.03.2021).URL: https://career.na-kd.com/locations

NA-KD (n.d.). Pressmaterial na-kd, NA-KD.com. (accessed: 10.03.2021).URL: https://www.na-kd.com/en/press-release-image-bank

Piskunova, O. and Klochko, R. (2020). Classification of e-commerce customers based ondata science techniques, Kyiv National Economic University .

Ratner, B. (2012). Statistical and machine-learning data mining: techniques for betterpredictive modeling and analysis of big data, 2nd edn, CRC Press.

Ruberts, A. (2020). K-prototypes - customer clustering with mixed data types. (accessed:25.03.21).URL: https://antonsruberts.github.io/kproto-audience/

Seippel, H. S. (2018). Customer purchase prediction through machine learning, M.Sc. The-sis, University of Twente, Faculty of Electrical Engineering, Mathematics and ComputerScience.

Siroker, D., Koomen, P. and Harshman, C. (2013). A/B testing: the most powerful wayto turn clicks into customers, 1st edn, John Wiley Sons, Inc.

Tsai, C.-F. and Lu, Y.-H. (2009). Customer churn prediction by hybrid neural networks,Expert Systems with Applications 36: 12547–12553.

60

Appendix A

A/B testing

The concept of A/B testing is to present different variations of a website to differentsubsets of people to measure which variant most effectively increases the conversion ofpeople turning into customers. The authors of ”A/B testing: the most powerful wayto turn clicks into customers” (Siroker et al.; 2013) have studied the A/B testing andproposed the following five-step process of implementing the test.

Step 1: Success definitionTo define success one first needs to identify the purpose of the website to laterdefine which metrics to be used to quantify the success. The success metrics arethe ones to be improved by performing the A/B test.

Step 2: Bottleneck identificationOnce the metrics of success measurements are set, one needs to identify the bot-tlenecks of the business, i.e. the places where the visitors are dropping off andhence decreases the conversion of becoming a customer of the website.

Step 3: Hypothesis constructionAs the bottlenecks are identified, the test hypothesis is to be constructed. Thehypothesis will be based on the company’s understanding and knowledge of thevisitors intention and could be investigated through feedback forms, accounts,interviews and other qualitative research forms.

Step 4: Make prioritizationsBefore performing the actual test, one need to prioritize which tests to perform.Important metrics derived from the previous steps in the process needs to be takeninto consideration.

Step 5: Perform the testBy randomly choosing different visitors and presenting either one or several newwebsite variations to these, the visitor behavior can be tracked and compared tothe original website version. The stated quantifiable success metrics is then usedto test the statistical significant differences between the website variations.

A/B testing of proportions The A/B testing consists of a two-sample comparisonwhere the metric can be a proportion or percentage, which will lead to the test statistics

61

of Z statistic defined by

Z =pB − pA√

p(1− p)( 1nB

+ 1nA

)

Where Z represents the Z-test statistic, pB and pA are the estimated proportions of Band A respectively as presented in Formula (A.1), and the combined proportion p is givenby Formula (A.2).

pi =Xi

ni(A.1)

where Xi represents the number of people that e.g. converted in the ith sample and nidenotes the total number of visitors reached for the ith version and

p =XB +XA

nB + nA(A.2)

where p denotes the combined proportion for a two sampled A/B test.The Z-test statistic is tested with a p-value at significance level of α = 0.05, which

indicates that if the p-value of the difference between the observed and theoretical Z-value is below 0.05 there’s above 95% chance that the new tested website, the B versionincreases the proportion above the original A version. The Z-test statistic for metrics inpercentage or proportions is a good measure as long as the sample size is sufficiently largeand that sampling is performed randomly.

A/B testing of averages If the metric of A/B testing is the mean value, the t statisticwill be used defined by

t =xB − xA

sp√

1nB

+ 1nA

where xA and xB represents the sample average for the respective versions A and B, nAand nB denotes the respective sample size of A and B and sp denotes the combined samplestandard deviation of both A and B defined by

sp =

√(nB − 1)s2B + (nA − 1)s2A

nB + nA − 2

where s2A and s2B are the sample variances given by

s2 =

∑N1=1(xi − x)2

n− 1

The degrees of freedom of the t test is given by

df = nB + nA − 2

where the integer 2 represents how many other parameters that’s needed to estimate,which in this case are the standard deviation of both the A and B version. The t-teststatistics takes the following assumptions:

62

1. There must occur random sampling to determine which user gets which websiteversion.

2. The variance of the two user samples must be equal.

3. The sample size must be sufficiently large, i.e. above 50 visitors for each version.

4. The underlying distribution of the samples should approximately be normally dis-tributed.

63

Appendix B

Contingency tables

Table B.1: The proportional contingency table between the cluster groups and the thirdmost frequently purchased item category.

ClustermostFreqItem3 1 2 3 4 5 6A 2.39 11.87 14.01 10.43 17.24 8.41B 0.07 0.07 0.05 0.14 0.00 0.21C 0.64 0.80 0.43 1.21 0.36 1.58D 1.88 2.82 2.07 3.87 0.73 4.73E 5.99 13.22 14.43 11.82 16.15 11.24F 3.74 7.91 7.15 8.29 3.81 7.63G 0.96 0.45 0.17 1.05 0.27 1.73H 1.55 3.73 3.27 4.19 2.45 3.90I 4.18 11.45 12.42 10.00 12.25 9.03J 0.03 0.01 0.00 0.04 0.00 0.07K 3.13 7.62 7.26 7.74 8.08 7.26L 0.87 1.17 0.83 1.70 0.82 2.26M 0.97 1.28 0.70 1.92 0.18 2.27N 2.54 3.87 2.95 5.02 2.09 5.69O 0.04 0.01 0.00 0.06 0.00 0.10P 4.78 16.33 18.29 14.02 21.42 11.73Q 1.63 5.96 6.31 5.80 4.45 3.77R 5.74 11.36 9.61 12.08 9.44 11.95NA 58.89 0.09 0.04 0.65 0.27 6.44

64

Table B.2: The proportional contingency table between the cluster groups and the mostfrequently used payment method.

ClustermostFreqPayMethod 1 2 3 4 5 6both 2.83 0.90 0.44 1.73 0.45 3.22Klarna 77.68 86.07 86.80 84.05 81.59 80.40other 19.48 13.02 12.75 14.22 17.96 16.39

Table B.3: The proportional contingency table between the cluster groups and the mostfrequently used delivery method.

ClustermostFreqDeliveryMethod 1 2 3 4 5 6Express 1.52 1.90 2.53 1.64 5.54 1.71Premium 5.05 2.66 2.10 3.32 1.61 4.11Standard 93.43 95.43 95.37 95.04 92.85 94.19

Table B.4: The proportional contingency table between the cluster groups and the mostfrequently used delivery mode.

ClusterdeliveryMode 1 2 3 4 5 6A 75.40 55.93 56.07 59.15 64.34 65.60B 5.44 5.38 5.33 4.93 4.56 4.73C 0.12 0.02 0.00 0.05 0.00 0.15D 19.04 38.67 38.60 35.87 31.10 29.51

Table B.5: The proportional contingency table between the cluster groups and the mostfrequently stated return reason.

ClustermostFreqReturnReason 1 2 3 4 5 60 33.43 44.14 48.58 39.59 55.67 35.731 6.36 4.43 3.31 5.29 2.06 5.962 2.26 0.84 0.58 1.27 0.71 1.843 3.29 1.79 1.34 2.28 0.63 2.844 1.56 0.41 0.33 0.61 0.36 0.935 11.86 8.55 6.59 9.89 4.56 11.086 14.75 11.38 10.06 12.85 7.69 14.177 0.47 0.45 0.35 0.47 0.63 0.518 24.02 27.30 28.43 26.73 27.35 25.549 0.19 0.09 0.06 0.15 0.09 0.2210 0.02 0.01 0.03 0.02 0.09 0.0299 0.05 0.03 0.05 0.03 0.09 0.04NA 1.73 0.58 0.28 0.84 0.09 1.13

65

Appendix C

Correlation triangle of numericalfeatures

66

Tab

leC

.1:

Low

erco

rrel

atio

ntr

iangl

eof

the

num

eric

alfe

ature

s.

rltn

LR

cncy

ttlO

rsl

sAm

slsQ

nrt

rnQ

rtrn

RntR

vn

rdP

SR

avgR

Tpgv

PS

cnvrR

onSzI

mlt

SI

rela

tion

Len

gth

1.00

Rec

ency

-0.2

71.

00to

talO

rder

s0.

52-0

.28

1.00

sale

sAm

ount

0.40

-0.2

20.

781.

00sa

lesQ

uan

tity

0.43

-0.2

30.

790.

971.

00re

turn

Quan

tity

0.31

-0.1

50.

630.

910.

921.

00re

turn

Rat

e-0

.28

0.20

-0.2

3-0

.05

-0.0

80.

151.

00net

Rev

enue

0.43

-0.2

70.

760.

770.

770.

49-0

.38

1.00

dis

counte

dSal

esR

atio

0.07

-0.0

50.

03-0

.03

0.04

0.05

0.11

-0.0

71.

00av

gRet

urn

Tim

e0.

120.

130.

160.

150.

150.

12-0

.07

0.15

0.01

1.00

pag

evie

wP

erSes

sion

0.18

-0.0

70.

440.

420.

440.

36-0

.13

0.41

0.02

0.08

1.00

conve

rsio

nR

ate

-0.2

10.

15-0

.19

-0.1

4-0

.15

-0.1

00.

20-0

.17

-0.0

6-0

.06

0.10

1.00

oneS

izeI

tem

s0.

44-0

.23

0.80

0.96

0.99

0.89

-0.1

10.

790.

040.

160.

46-0

.16

1.00

mult

iSiz

eIte

ms

0.18

-0.0

90.

400.

620.

620.

690.

110.

310.

020.

070.

18-0

.04

0.52

1.00

67

Appendix D

Bias Reduced Logistic Regression

D.1 Classifying churned customers

68

Table D.1: The resulted logistic classification model for classifying churned customers.The Table presents the estimated coefficient and their associated standard deviation to-gether with the significance level.

Coefficient Estimate StandardDeviation

p-value

(Intercept) -1.689 1.809e-02 < 2e-16 ***countryB 1.048 1.238e-02 < 2e-16 ***countryC 0.8495 1.314e-02 < 2e-16 ***countryD -0.7017 1.561e-02 < 2e-16 ***countryE 0.6719 1.274e-02 < 2e-16 ***countryF 0.7061 1.016e-02 < 2e-16 ***relationLength -5.992e-04 7.868e-06 < 2e-16 ***totalOrders -0.2055 1.260e-03 < 2e-16 ***returnRate 0.6432 9.400e-03 < 2e-16 ***deliveryModeB 0.7065 1.514e-02 < 2e-16 ***deliveryModeC 1.462 7.000e-02 < 2e-16 ***deliveryModeD 0.8555 9.317e-03 < 2e-16 ***discountedSalesRatio -0.3132 1.026e-02 < 2e-16 ***avgReturnTime 4.479e-02 2.518e-04 < 2e-16 ***mostFreqFirstChannelmiddleFunnel -0.5768 7.543e-03 < 2e-16 ***mostFreqFirstChannelupperFunnel -0.1756 6.279e-03 < 2e-16 ***mostFreqFirstChannelNA 2.898e-02 8.926e-03 0.00117 **mostFreqItem1B 0.3770 8.323e-02 5.92e-06 ***mostFreqItem1C 0.1519 2.973e-02 3.25e-07 ***mostFreqItem1D -0.3641 1.523e-02 < 2e-16 ***mostFreqItem1E 0.2250 1.071e-02 < 2e-16 ***mostFreqItem1F 0.1107 1.285e-02 < 2e-16 ***mostFreqItem1G 0.8319 2.375e-02 < 2e-16 ***mostFreqItem1H 0.3476 1.712e-02 < 2e-16 ***mostFreqItem1I -0.1731 1.353e-02 < 2e-16 ***mostFreqItem1J -1.581 1.676e-01 < 2e-16 ***mostFreqItem1K 0.3738 1.501e-02 < 2e-16 ***mostFreqItem1L -3.099e-02 2.103e-02 0.14070mostFreqItem1M 0.3833 2.498e-02 < 2e-16 ***mostFreqItem1N 0.2728 1.631e-02 < 2e-16 ***mostFreqItem1O 0.7259 1.368e-01 1.11e-07 ***mostFreqItem1P -5.769e-03 1.074e-02 0.59124mostFreqItem1Q 0.3600 1.262e-02 < 2e-16 ***mostFreqItem1R 0.2221 1.033e-02 < 2e-16 ***pageviewPerSession 2.523e-03 7.670e-05 < 2e-16 ***conversionRate 9.232e-02 1.262e-02 2.59e-13 ***multiSizeItems -2.260e-02 1.324e-03 < 2e-16 ***

69

D.2 Classifying customers of frequent returns

Table D.2: The resulted logistic classification model for classifying returners. The Tablepresents the estimated coefficient and their associated standard deviation together withthe significance level.


p-value

(Intercept) -0.9530 1.502e-02 < 2e-16 ***countryB 0.4265 1.396e-02 < 2e-16 ***countryC 0.3503 1.464e-02 < 2e-16 ***countryD -0.1057 1.806e-02 4.78e-09 ***countryE 0.2660 1.442e-02 < 2e-16 ***countryF 0.1684 1.223e-02 < 2e-16 ***relationLength -6.124e-04 9.371e-06 < 2e-16 ***Recency 7.836e-04 1.567e-05 < 2e-16 ***totalOrders -0.1371 1.529e-03 < 2e-16 ***deliveryModeB 0.1702 1.598e-02 < 2e-16 ***deliveryModeC -0.6233 8.738e-02 9.86e-13 ***deliveryModeD -8.166e-02 1.022e-02 1.34e-15 ***avgReturnTime -2.083e-03 2.336e-04 < 2e-16 ***mostFreqFirstChannelmiddleFunnel 0.2741 7.775e-03 < 2e-16 ***mostFreqFirstChannelupperFunnel 0.1929 6.876e-03 < 2e-16 ***mostFreqFirstChannelNA 0.3276 9.066e-03 < 2e-16 ***pageviewPerSession -8.751e-03 1.015e-04 < 2e-16 ***conversionRate 0.4816 1.290e-02 < 2e-16 ***multiSizeItems 8.556e-02 1.068e-03 < 2e-16 ***

70

D.3 Classifying loyal customers

Table D.3: The resulted logistic classification model for classifying loyal customers. TheTable presents the estimated coefficient and their associated standard deviation togetherwith the significance level.


p-value

(Intercept) -7.9518 0.0572 < 2e-16 ***countryB 0.0992 0.0386 0.0103 *countryC 0.0217 0.0424 0.6082countryD 0.0178 0.0435 0.6823countryE 0.0589 0.0417 0.158059countryF 0.1895 0.0273 4.25e-12 ***relationLength 0.0004 0.0000 < 2e-16 ***Recency -0.0220 0.0001 < 2e-16 ***totalOrders 1.1485 0.0048 < 2e-16 ***returnRate 0.5612 0.0383 < 2e-16 ***deliveryModeB 0.2026 0.0496 4.53e-05 ***deliveryModeC 0.2522 0.3216 0.4329deliveryModeD 0.1160 0.0289 6.14e-05 ***discountedSalesRatio -0.7637 0.0440 < 2e-16 ***mostFreqFirstChannelmiddleFunnel -0.2630 0.0305 < 2e-16 ***mostFreqFirstChannelupperFunnel -0.1079 0.0192 2.21e-08 ***mostFreqFirstChannelNA -0.4593 0.1193 0.0001 ***conversionRate -0.4987 0.0719 4.29e-12 ***multiSizeItems -0.0092 0.0024 0.0002 ***

71

01(.)2('/30$3%.)$1(%41) &,013.')*',(.3.4)&'$

Documents