user payment prediction in free-to-play

USER PAYMENT PREDICTION IN F2P GAMES

Master Thesis

Ahmed Hassan

Overview

Introduction

Methodology

Experiments

Results and Findings

Conclusion and Future Work

INTRODUCTION

Who is Bigpoint GmbH?

The company

The BI team

The project

What is Predictive Analytics?

Predict future behaviour based on past and current data

Process:

Problem Definition

Player Lifetime Value

𝐿𝑇𝑉𝑡 = 𝑡 ∗ 𝑝𝑡 ∗ 𝑛𝑡 ∗ 𝑐where:

t: timeframe of calculation

pt: average payment within timeframe

nt: number of payments within timeframe

c: other factors such as profit margin, discount rate, etc…

Problem Definition

Normally to predict LTV, very simple extrapolation is used on the current and past data

This ignores all the factors underlying the variables in the equation and usually yield inaccurate forecasting!

Problem Statement

“Through the huge amount of data collected about the players in a free-to-play game, which includes player personal information, geographical information, game

experience information, temporal information, etc...; can we predict if a player, who is registered within a certain period, will pay real currency inside the game within a

specified timeframe?”

Reviewing Literature

Sifa et al., use classification and regression to predict purchase decision and number of payments for an F2P mobile game. They use Decision Trees, SVM and Random Forests for classification; while using Poisson Regression Trees for the count

Xie et al., use a simple approach to obtain generic features independent on game. They only use the frequency of different game events to predict player churn and first payment.

Kim et al., use combined classifiers to predict user purchase decision in an e-commerice application. The combination is done via Genetic Algorithm by modelling the classifier as individuals, and the fitness based on the hit ratio of the classifiers

METHODOLOGY

METHODOLOGY: DATA COLLECTION

Big Data Environment

Data Collection

Dataset is contain around 300,000 players registered in 3 months period

The dataset contains dimensions regards players personal information, character information, game activity and interaction, in addition to the payment information

METHODOLOGY: DATA ANALYSIS

Payuser Distribution

Data Analysis

Dataset Visualization

Cluster analysis

METHODOLOGY: DATA MODELLING

Feature Selection

Spearman’s Coefficient

𝑟𝑠 =𝑐𝑜𝑣(𝑟𝑎𝑛𝑘 𝑥 , 𝑟𝑎𝑛𝑘 𝑦 )

𝜎𝑥 ∗ 𝜎𝑦

Mutual Information

𝑀𝐼 𝑋, 𝑌 =

𝑥,𝑦

𝑃𝑋𝑌 𝑥, 𝑦 log(𝑃𝑋𝑌(𝑥, 𝑦)

𝑃𝑋 𝑥 ∗ 𝑃𝑌(𝑦))

Class Imbalance

It is when one of the predicted classes has much less number of samples than the others

Bad for classifiers because they learn to predict everything as majority class, as it still gives high accuracy

Solutions? Use different performance measure

Balance the dataset by sampling

Undersampling

Oversampling

Combined

Weighted cost functions

Class Imbalance

Suitable performance measures True Positive Rate (Sensitivity, Recall)

𝑇𝑃𝑅 =𝑇𝑟𝑢𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒

𝑇𝑟𝑢𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒 + 𝐹𝑎𝑙𝑠𝑒 𝑁𝑒𝑔𝑎𝑡𝑖𝑣𝑒

True Negative Rate (Specificity)

𝑇𝑁𝑅 =𝑇𝑟𝑢𝑒 𝑁𝑒𝑔𝑎𝑡𝑖𝑣𝑒

𝑇𝑟𝑢𝑒 𝑁𝑒𝑔𝑎𝑡𝑖𝑣𝑒 + 𝐹𝑎𝑙𝑠𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒

False Negative Rate

𝐹𝑁𝑅 =𝐹𝑎𝑙𝑠𝑒 𝑁𝑒𝑔𝑎𝑡𝑖𝑣𝑒

𝐹𝑎𝑙𝑠𝑒 𝑁𝑒𝑔𝑎𝑡𝑖𝑣𝑒 + 𝑇𝑟𝑢𝑒 𝑃𝑜𝑠𝑡𝑖𝑣𝑒

False Positive Rate

𝐹𝑃𝑅 =𝐹𝑎𝑙𝑠𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒

𝐹𝑎𝑙𝑠𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒 + 𝑇𝑟𝑢𝑒 𝑁𝑒𝑔𝑎𝑡𝑖𝑣𝑒

Class Imbalance

Synthetic Minority Oversampling TEchnique (SMOTE)

The Classification Problem

Which classifiers to use? We use criteria to help:

1. Offer a good true positive rate

2. Handle nonlinear feature space

3. Have good generalization, and does not overfit when using some of the class balancing techniques

4. Able to adjust the weights of the classes or optimize the cost function of the classifier

Classifiers

Support Vector Machines

Weighted Random Forests

Gradient Boosting

EXPERIMENTS

Experiment 1

Goal: To test the performance of the classifiers without application of SMOTE

Settings: Weighted Random Forests

number of trees = 500

number of random features to use = 10

SVM

kernel = RBF

gamma = 1

C = 200

Gradient Boosting


depth of tree = 3

Experiment 2

Goal: To test the performance of the classifiers after application of SMOTE

Settings: Weighted Random Forests


number of random features to use = 10

SVM

kernel = RBF

gamma = 0.5

C = 300

Gradient Boosting


depth of tree = 3

RESULTS AND FINDINGS

SVM Results

AUC without SMOTE = 0.8639

AUC with SMOTE = 0.8969

Random Forests Results



Gradient Boosting Results



Classifiers Performance

Experiment 1

Algorithm ACC TPR TNR FPR FNR AUC

SVM 0.950 0.25 0.99 0.01 0.55 0.8639

wRF 0.96 0.62 0.97 0.03 0.38 0.9537

GBM 0.89 0.19 0.97 0.03 0.81 0.8831

Classifiers Performance

Experiment 2

Algorithm ACC TPR TNR FPR FNR AUC

SVM 0.95 0.39 0.99 0.01 0.61 0.8969

wRF 0.97 0.66 0.97 0.03 0.34 0.9607

GBM 0.94 0.36 0.97 0.03 0.64 0.8953

Findings

Using SMOTE improves the classifiers performance

The TPR is still suffering, which could be attributed to the selected features

Gradient Boosting seems to overfit due to the large number of sequential tees

Although Random Forests has more developed and deeper trees, it is highly parallelizable, in contrast to Gradient Boosting which sequential nature; so Random Forests is faster and favorable in our case with a big dataset, while SVM was worst in terms of computation time

The results confirm our doubts about the class overlapping

CONCLUSION AND FUTURE WORK

Summing Up

The goal was to create a framework or a process to help BI in predicting user payments using machine learning; to be able optimize their output analysis and for better targeting

We have followed the predictive analytics procedure from collecting data, to analysis, to modelling

We have shown that there is potential for the methodology we follow with acceptable performance; however we need to address the open issues that we found before starting the last step of deployment

Future Work

To achieve more beneficial prediction, we want to predict also

Number of payments

Value of payments

Add more features like in-game activities, and game technical performance

Address the class overlapping problem, using more data from different time windows, as well as the newly introduced features

Integrate the final framework into the current running systems used by BI

Questions?

user payment prediction in free-to-play

Documents