user payment prediction in free-to-play
Post on 29-Jan-2018
86 Views
Preview:
TRANSCRIPT
USER PAYMENT PREDICTION IN F2P GAMES
Master Thesis
Ahmed Hassan
Overview
Introduction
Methodology
Experiments
Results and Findings
Conclusion and Future Work
INTRODUCTION
Who is Bigpoint GmbH?
The company
The BI team
The project
What is Predictive Analytics?
Predict future behaviour based on past and current data
Process:
Problem Definition
Player Lifetime Value
𝐿𝑇𝑉𝑡 = 𝑡 ∗ 𝑝𝑡 ∗ 𝑛𝑡 ∗ 𝑐where:
t: timeframe of calculation
pt: average payment within timeframe
nt: number of payments within timeframe
c: other factors such as profit margin, discount rate, etc…
Problem Definition
Normally to predict LTV, very simple extrapolation is used on the current and past data
This ignores all the factors underlying the variables in the equation and usually yield inaccurate forecasting!
Problem Statement
“Through the huge amount of data collected about the players in a free-to-play game, which includes player personal information, geographical information, game
experience information, temporal information, etc...; can we predict if a player, who is registered within a certain period, will pay real currency inside the game within a
specified timeframe?”
Reviewing Literature
Sifa et al., use classification and regression to predict purchase decision and number of payments for an F2P mobile game. They use Decision Trees, SVM and Random Forests for classification; while using Poisson Regression Trees for the count
Xie et al., use a simple approach to obtain generic features independent on game. They only use the frequency of different game events to predict player churn and first payment.
Kim et al., use combined classifiers to predict user purchase decision in an e-commerice application. The combination is done via Genetic Algorithm by modelling the classifier as individuals, and the fitness based on the hit ratio of the classifiers
METHODOLOGY
METHODOLOGY: DATA COLLECTION
Big Data Environment
Data Collection
Dataset is contain around 300,000 players registered in 3 months period
The dataset contains dimensions regards players personal information, character information, game activity and interaction, in addition to the payment information
METHODOLOGY: DATA ANALYSIS
Payuser Distribution
Data Analysis
Dataset Visualization
Cluster analysis
METHODOLOGY: DATA MODELLING
Feature Selection
Spearman’s Coefficient
𝑟𝑠 =𝑐𝑜𝑣(𝑟𝑎𝑛𝑘 𝑥 , 𝑟𝑎𝑛𝑘 𝑦 )
𝜎𝑥 ∗ 𝜎𝑦
Mutual Information
𝑀𝐼 𝑋, 𝑌 =
𝑥,𝑦
𝑃𝑋𝑌 𝑥, 𝑦 log(𝑃𝑋𝑌(𝑥, 𝑦)
𝑃𝑋 𝑥 ∗ 𝑃𝑌(𝑦))
Class Imbalance
It is when one of the predicted classes has much less number of samples than the others
Bad for classifiers because they learn to predict everything as majority class, as it still gives high accuracy
Solutions? Use different performance measure
Balance the dataset by sampling
Undersampling
Oversampling
Combined
Weighted cost functions
Class Imbalance
Suitable performance measures True Positive Rate (Sensitivity, Recall)
𝑇𝑃𝑅 =𝑇𝑟𝑢𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒
𝑇𝑟𝑢𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒 + 𝐹𝑎𝑙𝑠𝑒 𝑁𝑒𝑔𝑎𝑡𝑖𝑣𝑒
True Negative Rate (Specificity)
𝑇𝑁𝑅 =𝑇𝑟𝑢𝑒 𝑁𝑒𝑔𝑎𝑡𝑖𝑣𝑒
𝑇𝑟𝑢𝑒 𝑁𝑒𝑔𝑎𝑡𝑖𝑣𝑒 + 𝐹𝑎𝑙𝑠𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒
False Negative Rate
𝐹𝑁𝑅 =𝐹𝑎𝑙𝑠𝑒 𝑁𝑒𝑔𝑎𝑡𝑖𝑣𝑒
𝐹𝑎𝑙𝑠𝑒 𝑁𝑒𝑔𝑎𝑡𝑖𝑣𝑒 + 𝑇𝑟𝑢𝑒 𝑃𝑜𝑠𝑡𝑖𝑣𝑒
False Positive Rate
𝐹𝑃𝑅 =𝐹𝑎𝑙𝑠𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒
𝐹𝑎𝑙𝑠𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒 + 𝑇𝑟𝑢𝑒 𝑁𝑒𝑔𝑎𝑡𝑖𝑣𝑒
Class Imbalance
Synthetic Minority Oversampling TEchnique (SMOTE)
The Classification Problem
Which classifiers to use? We use criteria to help:
1. Offer a good true positive rate
2. Handle nonlinear feature space
3. Have good generalization, and does not overfit when using some of the class balancing techniques
4. Able to adjust the weights of the classes or optimize the cost function of the classifier
Classifiers
Support Vector Machines
Weighted Random Forests
Gradient Boosting
EXPERIMENTS
Experiment 1
Goal: To test the performance of the classifiers without application of SMOTE
Settings: Weighted Random Forests
number of trees = 500
number of random features to use = 10
SVM
kernel = RBF
gamma = 1
C = 200
Gradient Boosting
number of trees = 150
depth of tree = 3
Experiment 2
Goal: To test the performance of the classifiers after application of SMOTE
Settings: Weighted Random Forests
number of trees = 100
number of random features to use = 10
SVM
kernel = RBF
gamma = 0.5
C = 300
Gradient Boosting
number of trees = 150
depth of tree = 3
RESULTS AND FINDINGS
SVM Results
AUC without SMOTE = 0.8639
AUC with SMOTE = 0.8969
Random Forests Results
AUC without SMOTE = 0.9537
AUC with SMOTE = 0.9607
Gradient Boosting Results
AUC without SMOTE = 0.8831
AUC with SMOTE = 0.8953
Classifiers Performance
Experiment 1
Algorithm ACC TPR TNR FPR FNR AUC
SVM 0.950 0.25 0.99 0.01 0.55 0.8639
wRF 0.96 0.62 0.97 0.03 0.38 0.9537
GBM 0.89 0.19 0.97 0.03 0.81 0.8831
Classifiers Performance
Experiment 2
Algorithm ACC TPR TNR FPR FNR AUC
SVM 0.95 0.39 0.99 0.01 0.61 0.8969
wRF 0.97 0.66 0.97 0.03 0.34 0.9607
GBM 0.94 0.36 0.97 0.03 0.64 0.8953
Findings
Using SMOTE improves the classifiers performance
The TPR is still suffering, which could be attributed to the selected features
Gradient Boosting seems to overfit due to the large number of sequential tees
Although Random Forests has more developed and deeper trees, it is highly parallelizable, in contrast to Gradient Boosting which sequential nature; so Random Forests is faster and favorable in our case with a big dataset, while SVM was worst in terms of computation time
The results confirm our doubts about the class overlapping
CONCLUSION AND FUTURE WORK
Summing Up
The goal was to create a framework or a process to help BI in predicting user payments using machine learning; to be able optimize their output analysis and for better targeting
We have followed the predictive analytics procedure from collecting data, to analysis, to modelling
We have shown that there is potential for the methodology we follow with acceptable performance; however we need to address the open issues that we found before starting the last step of deployment
Future Work
To achieve more beneficial prediction, we want to predict also
Number of payments
Value of payments
Add more features like in-game activities, and game technical performance
Address the class overlapping problem, using more data from different time windows, as well as the newly introduced features
Integrate the final framework into the current running systems used by BI
Questions?
top related