Santander Customer Satisfaction
Derek van den Elsen
2580100
October 30, 2017
Research Paper Business Analytics
Dr. Mark Hoogendoorn (supervisor)
Contents
1 Introduction 1
2 Related Work 1
3 Data Exploration 23.1 Feature Groups . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33.2 Individual Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
3.2.1 var3 (Nationality) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43.2.2 var15 (Age) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43.2.3 var21 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53.2.4 var36 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63.2.5 var38 (Mortgage Value) . . . . . . . . . . . . . . . . . . . . . . . . . . 63.2.6 num_var4 (Number of Bank Products) . . . . . . . . . . . . . . . . . . . 7
4 Preprocessing 84.1 Data Cleaning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84.2 Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104.3 Feature Engineering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
5 Modeling 135.1 Performance Measure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135.2 Solution Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
5.2.1 Logistic Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155.2.2 Decision Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155.2.3 Random Forest . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165.2.4 XGBoost . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
6 Results 176.1 Tuning RF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176.2 Tuning XGBoost . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
7 Conclusion 21
A Appendix Feature Sets 22
2
1 Introduction
Having adequate customer relations is paramount to success in any service industry. Identifyingand analyzing your customer’s contentment to improve customer retention can yield many bene-fits. The longer a client stays with an organisation, the more value he creates. There are highercosts attached to introducing and attracting new customers. The clients also have a better un-derstanding of the organisation and can give positive word-of-mouth promotion. (Colgate et al.,1996) Datamining is essential in this process and this practice is widely applied across industriesfor instance FMCG retailers (Buckinx and van den Poel, 2005), telecommunications (Mozer et al.,2000) and banking (Clemes et al., 2010) and (Xie et al., 2009).
This paper focuses on Santander Bank, a large corporation focusing principally on the market inthe northeast United States. Through means of a Kaggle competition (Santander, 2015), it is theobjective to find an appropriate model to predict whether a client will be dissatisfied in the futurebased on certain characteristics. Having this model in place can ensure that Santander can takeproactive steps to improve a customer’s happiness before they would take their business elsewhere.
First the paper will discuss related work done on this, by now concluded, Kaggle case. Secondlyit delves into the data we work with, analyzing groups of variables and individual features to giveus insight in what is relevant. Thirdly several cleaning procedures that were employed to lead tobetter results are outlined. Fourthly we explain the performance measure of this competition andthe three models: Logistic Regression, Random Forest and XGBoost that we utilize to tackle theproblem. Lastly the tuning process and results are discussed. We reach an AUC score of 0.823152.
2 Related Work
This section looks into the work of various Kaggle competitors in different sections of the leader-board. The private leaderboard score is mentioned first after which a small description details thework done. For reference the top position had a score of: 0.829072
0.828530: Silva et al. (2016) all seemingly independently do their own preprocessing, featureengineering and model selection and then combine all predictions of the models together in anensemble using the R package optim. Preprocessing steps taken are for instance: replacing certainvalues by NA, dropping sparse, constant and duplicated features, normalization, log transformingfeatures and one-hot encoding categorical features. Sophisticated feature engineering methodsemployed include t-Distributed Stochastic Neighbour Embedding, Principal Component Analy-sis and K-means. Models explored are: Follow the Proximally Regularized Leader, RegularizedGreedy Forest, Adaboost, XGBoost, Neural Networks, Least Absolute Shrinkage and SelectionOperator, Support Vector Machine, Random Forest and an Extra Trees Classifier.
0.826826: Yooyen and Ma (2016) are one of the few that find and handle some duplicated obser-vations in the train set. Furthermore they do standard preprocessing steps like removing duplicatedand constant features, normalizing, rescaling and handling missing values. They select featuresbased on Pearson’s Correlation Coefficient with the target and crossvalidation. Attempts at specif-
1
ically handling the class imbalance and principal component analysis were fruitless. With successthey added heuristic rules like var15 < 23 means the target is 0. Their main models are based ondecision trees.
0.8249: Wang (2016) applies hardly any preprocessing as he intends to use only decision treesthat are relatively insensitive to this. He does extensively select features with the importance inthe Gradient Boosting Classifier as criterion. After which he adds percentile change from featureto feature, and he applies selection again. Parameters are trained iteratively one at a time with acoarse to fine approach. Adaboost, Bagging Classifier, Extra Trees Classifier, Gradient Boosting,Random Forest and XGBoost are ensembled to lead to the final scores.
0.824332: Kumar (2015) tries linear regression with the top ten features, a support vector machinein combination with principal component analysis with limited success as a start. Neural Net-works, tuned Random Forest and XGBoost yield him better results. XGBoost is seen as obviousbest candidate and it is used with the top N features, which makes it apparent that anything beyond5 features only adds very minor improvements to the crossvalidation score.
3 Data Exploration
The train set consists of 76020 observations and 370 features plus 1 binary target. The test set hasa roughly equal amount of 75818 observations with the same features. There is a large imbalancewith 96.04% being 0, meaning the customer was not dissatisfied and 3.96% being 1, signifyingthat the customer was dissatisfied. This is in line with the expectation of the customer satisfactionof a successful bank. There are no missing values in the train or test set, but some values mightencode ’missing’. There are only numeric or possibly categorical features. No features seem tohave substantial outliers.
The dataset is semi-anonymized, so it is unclear what a feature represents. The only clue wehave is a header with a name for each feature that is clearly not randomly determined. For illus-trative purposes, the first seven names are given in order from left to right: ID, var3, var15,imp_ent_var16_ult1, imp_op_var39_comer_ult1, imp_op_var39_comer_ult3and imp_op_var40_comer_ult1. Some of these words appear to be abbreviations for Span-ish words like ’imp’ for importe or amount. A non-comprehensive dictionary is discussed on theKaggle forum (Andreu, 2015). On first glance, one can also infer that imp_op_var39_comer_ult1and imp_op_var39_comer_ult3 are probably related and they are likely not related tovar3. Looking at the distribution of the data can confirm these suspicions. We first distinguishsome groups based on their name and broadly research each group in turn. Then we look intoindividual features that do not fit into a group in more depth. Aside from the clearly irrelevantID variable, this comprises all the features. This is more practical than discussing all 370 featuresthoroughly.
2
3.1 Feature Groups
The sheer number of features in this dataset makes it hard to individually analyze and discuss eachfeature, so instead similar groups have been identified in the dataset, assisted by their correspond-ing names. In Table 1 it is visible on which substrings has been filtered. There is certain overlapbetween the groups as all ’Meses’ features are also ’Num’ features for example and all ’Delta’features are also either ’Imp’ or ’Num’ features.
Table 1: Variable GroupingsSubstring Example Number (Raw)
’Num’ num_var37_med_ult2 87 (155)’Ind’ ind_var13_0 46 (75)
’Saldo’ saldo_medio_var5_hace3 43 (71)’Imp’ imp_op_var39_comer_ult1 21 (63)
’Delta’ delta_imp_amort_var18_1y3 3 (26)’Meses’ num_meses_var5_ult3 8 (11)
The substring ’Num’ likely stands for numeric variables. Typically, excluding the ’Delta’ and’Meses’ subsets, these have values 0, 3, 6, 9 and further multiples of 3 as most common obser-vations. These could indicate quarters for example. 0 and 3 tend to be most common and thedistribution is usually unbalanced. The substring ’Ind’ likely stands for indicator variables as allof them are 0 or 1. The distribution is usually unbalanced, but not consistently towards 1 or 0.
The substring ’Saldo’ suggest the current actual amount on balance for certain financial products.A lot of these variables have an overwhelming amount of zero’s, possibly being finished financialproducts or products that are not utilized in the first place. Other values are typically numeric andare of large scale like 6119500. Some values are also negative further providing evidence that thisis a ’balance’ type variable. Very similarly, ’Imp’ can stand for Importe (Spanish for amount) andthe distributions match this conjecture. Notable is that the scale tends to be smaller, so perhaps’Saldo’ is a sum of consecutive periods.
The substring ’Delta’ signifies a difference of some kind, but the variable’s distributions matchratio’s, so it is possibly the ratio of an amount between a certain time period. Also here there is amassive imbalance towards the value 0. All ’Meses’ variables are ’Num’ variables, but are specif-ically taken as a subgroup, because they have a wildly different distribution. ’Meses’ is Spanishfor months and in the data these variables only take values 0, 1, 2 and 3. Some of the ’Meses’variables are even fairly balanced, which is fairly unique within this dataset.
3.2 Individual Features
These features all have a very short name and could be considered a group of their own. Howeverupon closer inspection of at minimum two of them, it becomes apparent that they are clearly notrelated in the same manner as the previous groups of features. This individuality also means weneed to consider each feature separately.
3
3.2.1 var3 (Nationality)
Table 2: var3Maximum Value 238Minimum Value -999999
Unique Observation Count 208Most common (count) 2 (74165)
2nd most common (count) 8 (138)3rd most common (count) -999999 (116)4th most common (count) 9 (110)5th most common (count) 3 (108)
Suspected Category Categorical
var3 is suspected to be nationality or country of residence. 208 Unique countries sounds like aplausible number that the bank can supply. 74165 observations are 2, which probably stand for theUnited States, the main market for this particular bank. A binary feature var3_most_commonis made to put emphasis on this, which is 1 if var3 is equal to 2 and 0 otherwise. -999999 likelyencodes for missing values and another binary feature var3_missing accounts for this. Afterthis feature is made the -999999 values are replaced by the most commonly occurring value 2.
3.2.2 var15 (Age)
Table 3: var15Maximum Value 105Minimum Value 5
Mean 33.212865Unique Observation Count 100
Most common (count) 23 (20170)2nd most common (count) 24 (6232)3rd most common (count) 25 (4217)4th most common (count) 26 (3270)5th most common (count) 27 (2861)
Suspected Category Numeric
var15 is suspected to represents age as the minimum and maximum values are respectively 5and 105, but the majority of the data is over 21. This data seems very biased to younger peopleand perhaps 23 is filled in if the age is unknown. For this reason we make another binary featurevar15_most_common, which is 1 if var15 is equal to 23 and 0 otherwise.
4
0 20 40 60 80 100Value
0
5000
10000
15000
20000
25000
Cou
nt
var15
AllTarget: 1
Figure 1: Histogram Age
3.2.3 var21
Table 4: var21Maximum Value 30000Minimum Value 0
Unique Observation Count 25Most common (count) 0 (75152)
2nd most common (count) 900 (236)3rd most common (count) 1800 (206)4th most common (count) 4500 (96)5th most common (count) 3000 (84)
Suspected Category Numeric
Not all variables have an easy interpretation in this anonymized dataset and var21 is a primeexample. It is highly imbalanced and the non-zero values do not give off a likely meaning. Never-theless it could still possibly be important, as it is clearly distinct from other variables.
5
3.2.4 var36
Table 5: var36Unique Observation Count 5
Most common (count) 99 (30064)2nd most common (count) 3 (22177)3rd most common (count) 1 (14664)4th most common (count) 2 (8704)5th most common (count) 0 (411)
Suspected Category Categorical
There is not much to be said about this particular variable, except that it is very likely categorical.One-hot encoding is applied such that it can be better understood by our classifiers. In the samevein, 99, which likely stands for missing values, is encoded properly.
3.2.5 var38 (Mortgage Value)
Table 6: var38Maximum Value 22034740Minimum Value 5163.75
Mean 117235Unique Observation Count 57736
Most common (count) 117310.979016494 (14868)2nd most common (count) 451931.22 (16)3rd most common (count) 463625.16 (12)4th most common (count) 288997.44 (11)5th most common (count) 104563.80 (11)
Suspected Category Numeric
This distribution ranges from high to low positive numbers with a very large number of the samevalue 117310. It is our conjecture that this represents the mortgage value of a customer or atleast some kind of value indicating variable. If it is unknown for whatever reason the countryaverage is instead filled in. For this reason we create a dummy variable var38_most_commonthat remembers this information. It is 1 if var38 is 117310 and 0 otherwise. We visualize thedistribution of the known values by making a histogram, excluding 117310 and cutting of therange at 350000, which excludes 1559 more observations, in Figure 2.
6
0 50000 100000 150000 200000 250000 300000 350000Value
0
500
1000
1500
2000
2500
3000
3500
4000
Cou
ntvar38
AllTarget: 1
Figure 2: Histogram Mortgage
3.2.6 num_var4 (Number of Bank Products)
Table 7: num_var4Maximum Value 7Minimum Value 0
Unique Observation Count 8Most common (count) 1 (38147)
2nd most common (count) 0 (19528)3rd most common (count) 2 (12692)4th most common (count) 3 (4377)5th most common (count) 4 (1031)
Suspected Category Numeric
According to dmi3kno (2015) this variable represents the number of bank products this clientcurrently has with the bank. The distribution suggests that fewer people have multiple productswith the bank and those tend to not be dissatisfied, giving this explanatory value. This is alsointuitive, as clients investing multiple times probably have a good relationship with the bank andconversely the bank has more information to satisfy their client appropriately.
7
0 1 2 3 4 5 6 7Value
0
5000
10000
15000
20000
25000
30000
35000
40000
Cou
nt
num_var4
AllTarget: 1
Figure 3: Histogram Bank Products
4 Preprocessing
4.1 Data Cleaning
On first glance the data is fairly clean, however there are definite problems for running certainfeatures and observations through a machine learning algorithm. Figure 4 demonstrates the orderthat the cleaning steps were taken in. The left number represents the number of observations andthe right the number of features.
8
76020 | 371
76020 | 337
76020 | 308
76020 | 204
76020 | 203
71108 | 203
71108 | 200
71108 | 171
Removed Constant Features
Removed Duplicated Features
Removed Uninformative Features
Removed Index
Removed Duplicated Observations
Removed Delta Features
Removed Correlated Features
Figure 4: Cleaning Process
There are 34 features that are one constant value (zero for all checked cases). These features thatdo not vary at all cannot teach our classifier anything meaningful, so all features that have a stan-dard deviation of zero are omitted. 29 features are exact duplicates of one another as well. Weremove the redundancy by removing all but the first feature of a duplicate group. Note that theorder of these cleaning procedures influences the number of removed features.
Some features have as little as 2 non-zero values. We categorize features that have very few non-zero values as uninformative for our classifier. We remove 104 features from the data by arbitrarilydrawing a line at 100, effectively removing all features that have less than 100 observations thatare not zero. Several of these were checked by eye and having a non-zero value was not verydiscriminatory regarding the Target value. For contextual purposes we show in Figure 5 what dif-ferent choices would have meant for the dimension of the feature set. Finally we remove the indexand in doing so assume the training data does appear randomly to us and not in a particular order.
9
0 100 200 300 400 500Information Parameter
160
180
200
220
240
260
280
300
Feat
ures
Dim
ensio
n
Features Dimension vs Information Parameter
Figure 5: Information Parameter Range
There are also 4912 exact duplicates in features of the 76020 observations in the training data.Even worse, 109 of those duplicates have a differing target, which can only possibly ’confuse’a classifier. Although it can also be said that it will simply attach less confidence towards theimportance of these observations, but to avoid the issue altogether these are all excluded from thedataset. The first of the duplicates that have the same target is kept around and 71108 observationsremain after this procedure.
The three ’Delta’ variables that are leftover are still relatively uninformative. They also do notappear as having great predictive power in any related literature studied. For these reasons theyare also removed. They are the only variable group for which this is deemed appropriate.
4.2 Correlation
Correlation can be a powerful tool in machine learning. One of a pair of heavily correlated pre-dictors can be removed without harming the predictive power of a model. Also very high absolutecorrelation with the target can indicate that this is an important variable. We apply the formerand we look at the top features in the sense of being correlated with the target. We first apply
10
normalization, where appropriate, to scale all variables between 0 and 1. We note that several ofthe features are hugely imbalanced and this type of scaling does not damage that. A feature X willbecome normalized feature Z with the following formula:
zi =xi −min(X)
max(X)−min(X)∀ i = 1, ..., length(X)
After applying this to var15, var38, ’Num’s, ’Imp’s and ’Saldo’s, we plot the following Corre-lation matrix in Figure 6 and zoom in in Figure 7
Figure 6: Correlation Matrix
var3
var1
5im
p_en
t_va
r16_
ult1
imp_
op_v
ar39
_com
er_u
lt1im
p_op
_var
39_c
omer
_ult3
imp_
op_v
ar40
_com
er_u
lt1im
p_op
_var
40_c
omer
_ult3
imp_
op_v
ar40
_ult1
imp_
op_v
ar41
_com
er_u
lt1im
p_op
_var
41_c
omer
_ult3
imp_
op_v
ar41
_efe
ct_u
lt1im
p_op
_var
41_e
fect
_ult3
imp_
op_v
ar41
_ult1
imp_
op_v
ar39
_efe
ct_u
lt1im
p_op
_var
39_e
fect
_ult3
imp_
op_v
ar39
_ult1
ind_
var1
_0
var3var15
imp_ent_var16_ult1imp_op_var39_comer_ult1imp_op_var39_comer_ult3imp_op_var40_comer_ult1imp_op_var40_comer_ult3
imp_op_var40_ult1imp_op_var41_comer_ult1imp_op_var41_comer_ult3
imp_op_var41_efect_ult1imp_op_var41_efect_ult3
imp_op_var41_ult1imp_op_var39_efect_ult1imp_op_var39_efect_ult3
imp_op_var39_ult1ind_var1_0
1.00
0.75
0.50
0.25
0.00
0.25
0.50
0.75
1.00
Figure 7: Correlation Matrix Zoom in
11
Here it is clearly visible that for instance imp_op_var39_comer_ult1 is heavily correlatedwith imp_op_var39_comer_ult3, further confirming that similar variable names are related,justifying the grouping of variables done earlier. Some features even have a correlation extremelyclose to 1 with each other like num_var1_0 and ind_var1_0 with 0.9988. All variable pairsthat have higher absolute correlation than a conservative 0.99 are filtered out and the first of thepairs is deleted. An extension of this could delete one of the pair based on lower absolute correla-tion with the target. This removes 29 features and brings the cleaned dataset to 171 features total.It is visualized in Figure 8 what different correlation thresholds would mean for the dimensions ofthe feature set. We have been relatively conservative as the tree-based algorithms we use are adeptat handling correlated features and the dataset is already comparatively small.
0.700.750.800.850.900.951.00Correlation Threshold
80
100
120
140
160
180
200
Feat
ures
Dim
ensio
n
Features Dimension vs Correlation Threshold
Figure 8: Correlation Thresholds
Furthermore some of the top correlations are showcased in Table 8. It is likely that these will bethe important predictors later on.
Table 8: Top Correlations with TargetNegative Positive
ind_var30 (-0.149756) var36 (0.102401)num_meses_var5_ult3 (-0.147362) var15 (0.097341)
num_var30 (-0.137623) ind_var8_0 (0.048493)num_var42 (-0.134246) imp_op_var41_efect_ult1 (0.030599)ind_var5 (-0.133128) ind_var8 (0.029038)
12
4.3 Feature Engineering
This dataset mostly lent itself for decoding what the existing data meant and making sure it is anappropriate format for machine learning algorithms. Nevertheless this led to the creation of somefeatures, that were mentioned before, to accommodate expected oddities in the data. For instancethe feature var38_most_common was created to ensure that the algorithm can recognize thatthis is likely a unique value outside of the numerical order that var38 has. Outside of the onesmentioned before, the ’Meses’ group was completely one-hot encoded, because they were smallenough to warrant such an approach and seemed on first glance categorical.
Furthermore it is noticeable that the data contains a lot of zero’s. This usually stands for a lack ofinformation or interaction and precisely that lack of knowing thy customer could hold predictivevalue for determining if the customer will possibly be dissatisfied in the future. For this reasonn0 was created, which simply sums the number of zero’s that appear in the row. (dmi3kno, 2015)Special care is taken to not include the Target variable as this is a form of information leakagewhich disturbs the learning process of the models. Conversely a n1 was created that shouldsignify that the bank does know their customer and has a lot of interaction with them. This wouldbe different, and not redundant in that way, than creating a variable that simply sums all spots inthe row that are not zero. This brings the total dimensions of our training set to 71108 observationsand 215 features to start the modeling process.
5 Modeling
5.1 Performance Measure
The performance measure of this Kaggle competition is the area under the receiver operatingcharacteristic curve, AUROC or AUC for short. This metric deals well with the imbalance thatis typical in churn prediction. (Burez and van den Poel, 2009) We build up to this concept byconsidering some simpler notions first. (Dernoncourt, 2015) If we consider a binary classifier wehave four possible outcomes when we use it make a binary prediction and we call the collectionof this a confusion matrix and an example is shown in Figure 9.
True negative: We predict 0 and the class is actually 0.False negative: We predict 0 and the class is actually 1.True positive: We predict 1 and the class is actually 1.False positive: We predict 1 and the class is actually 1.
Figure 9: Confusion Matrix Example
13
We define the following ratio’s. The True Positive Rate corresponds to the proportion of positivedata points that are correctly considered as positive, with respect to all positive data points. Thehigher the better, all else equal. The False Positive Rate corresponds to the proportion of negativedata points that are incorrectly classified as positive, with respect to all negative data points. Thelower the better all else equal.
True Positive Rate (TPR):TP
TP + FNFalse Positive Rate (FPR):
FPFP + TN
Binary classifiers usually predict with what probability they expect 1 to occur. The thresholdwhere a high probability leading to predicting a 1 lies is an arbitrary decision. It is possible toconsider every single possible threshold and plot the corresponding pair of TPR and FPR’s of theresulting predictions in a ROC curve plot. An example is given in Figure 10. To obtain a singleperformance metric we can take the area under the ROC curve and effectively take into accountboth TPR and FPR. The baseline for this metric lies at 0.5, where we predict completely randomly.Anything performing worse can simply be inverted to do better.
Figure 10: AUC Example
5.2 Solution Methods
We use 3 different methods to come to a solution: Logistic Regression, Random Forest fromScikit Learn (Pedregosa et al., 2011) and XGBoost (Chen and Guestrin, 2016). We apply 10-foldstratified crossvalidation on the train set to compare models internally, consequently fit the bestmodel on the entirety of the train data and then predict labels for the test set and validate these
14
on the Kaggle site for the final score. Stratification in this context refers to ensuring that eachfold of the crossvalidation has a roughly equal distribution of classes as the original whole dataset. Stratification tends to improve the accuracy of the crossvalidation as means of testing whendealing with imbalanced data. (Kohavi, 1995)
5.2.1 Logistic Regression
Logistic regression is a relatively ’simple’ machine learning algorithm and we expect fast, but notgreat results. It tries to attach the best constant values to how features interact with the target,based on the train set, minimizing an error term. It then applies this same formula to the test dataset. It is pursued here as a baseline in order to compare to more sophisticated models. For moredetails see (Bishop, 2006).
5.2.2 Decision Tree
The following two algorithms rely on multiple decision trees. This section roughly describes whata decision tree is. See (Breiman et al., 1984) for more details. To illustrate the concept of adecision tree we run a basic tree classifier implementation on our data and obtain Figure 11.
var15 <= 0.215gini = 0.5
samples = 100.0%value = [0.5, 0.5]
class = y[0]
n1 <= 4.5gini = 0.327
samples = 44.6%value = [0.794, 0.206]
class = y[0]
True
saldo_var30 <= 0.001gini = 0.471
samples = 55.4%value = [0.38, 0.62]
class = y[1]
False
gini = 0.467samples = 10.5%
value = [0.628, 0.372]class = y[0]
gini = 0.236samples = 34.1%
value = [0.863, 0.137]class = y[0]
gini = 0.382samples = 27.0%
value = [0.257, 0.743]class = y[1]
gini = 0.453samples = 28.4%
value = [0.653, 0.347]class = y[0]
Figure 11: Decision Tree Example
The concept is fairly simple. If a prediction needs to be made, go from the top of the tree to thebottom. Go left at the top if var15 is lower than 0.215, right otherwise. Repeat this processfor all nodes until a leaf is reached and the prediction made corresponds to the class label of thatleaf. This tree is constructed by at each depth level greedily finding the best feature to split on,maximizing information gain via the Gini impurity. Let p0 be the proportion of observations thatbelong to class 0 and let p1 be the proportion of observations that belong to class 1 out of allobservations.
Gini = 1− (p20 + p21)
15
This metric is an example of how to measure how pure a split is. It is more pure the lower it is withminimum 0 and maximum 0.5. A split is purer if in the branches the ratio of positive to negativeexamples is close to 0 or 1 or in other words the split is highly discriminatory. The maximumdepth of this tree is set at 2 to keep it tractable and this effectively also keeps it from overfitting.A decision tree can otherwise perfectly match a training set, which does not generalize well.
5.2.3 Random Forest
The Random Forest model generates multiple decision trees. (Breiman, 2001) Only a subset ofpredictive features is now considered during each split that is randomly selected. The decisiontrees lead to one single prediction together by averaging all the predictions they give individually.
The parameters of a Random Forest model are really important and need to be tuned appropri-ately. N_ESTIMATORS determines the number of trees in the forest. MAX_FEATURES controlsthe maximum random amount of features to consider when determining a best split during thealgorithm. MAX_DEPTH limits the depth of a tree in the random forest. MIN_SAMPLES_SPLIT
determines the minimum amount of observations that need to be in a node for it to be consideredfor splits. MIN_SAMPLES_LEAF constrains that leaf nodes have at least this number of obser-vations. N_JOBS controls the number of processors. Trees in a random forest can be made inparallel, so the more cores working, the less computation time needed. CLASS_WEIGHT can beset to ’balanced’ to deal with imbalanced datasets.
5.2.4 XGBoost
XGBoost is a method where the outcome is formed by an combination of multiple trees. The treesare build iteratively. Every time a new tree is built it focuses on parts where the previous onesmake mistakes, by assigning higher weights to these instances. This stands in contrast to RandomForest, where trees are made independently from each other.
The parameters are again of grave importance and there are even more. N_ESTIMATORS andMAX_DEPTH are the same as with the Random Forest model. LEARNING_RATE is a parameterthat controls the speed and preciseness of the model. The lower it is, the more accurate your modelbecomes, but the more rounds it needs to converge. It represents a constant times how much thenext tree built affects current predictions. SUBSAMPLE stands for the fraction of observations tobe sampled randomly. COLSAMPLE_BYTREE denotes the fraction of features to take into consid-eration during the random sampling for each tree. MIN_CHILD_WEIGHT defines the minimumsum of weights of all observations required in a child. SCALE_POS_WEIGHT is a parameter tocombat imbalancedness. XGBoost directly avoids overfitting by promoting simplicity of modelsin the objective via regularization, unlike Random Forest that only limits the way trees can growby imposing restraints. (Chen, 2014)
min Obj(Ω) = Training Loss Function + Regularization
min Obj(Ω) =n∑
i=1
l(yi, yi) + αk∑
i=1
|wi|+ λk∑
i=1
w2i + γT
16
REG_ALPHA is a parameter for l1 regularization. LAMBDA is a parameter for l2 regularization.GAMMA is a regularization parameter that multiplies itself with the number of leaves. This regular-ization is in place to penalize the objective for building overtly complex trees to avoid overfitting.
6 Results
6.1 Tuning RF
It is difficult to determine what set of parameters is optimal for a specific problem. We utilize theGridSearchCV() (Buitinck et al., 2013) function with 5-fold stratified crossvalidation and scoringoption area under the ROC curve to come to an answer. This basically boils down to trying allpossible combinations of a set of predetermined parameter spaces. The parameter spaces andresults of the grid search are shown in Table 9. These parameter spaces were seen as the optimaltradeoff between computation time and accurately tuning the model.
Table 9: GridSearchCV()Parameter Range Best Value Combination
N_ESTIMATORS [100, 500, 1000, 2000, 3000] 2000MAX_FEATURES [’sqrt’, ’log2’] ’sqrt’
MAX_DEPTH [None, 5, 20] 20MIN_SAMPLES_LEAF [10, 30, 50, 100] 50
With the aforementioned best parameter set we fit the model on the training data and make aplot of how relevant a specific feature is in Figure 12. These importances represent the totalGini impurity decrease weighted by the probability of reaching a node containing this feature,averaged over all trees. We can be pleased to see that several of the features we created likevar15_most_common seem to be significant.
var1
5va
r15_
mos
t_co
mm
on n1sa
ldo_
var3
0sa
ldo_
med
io_v
ar5_
ult3
var3
8 n0sa
ldo_
med
io_v
ar5_
ult1
sald
o_va
r5in
d_va
r30
sald
o_va
r42
num
_var
35sa
ldo_
med
io_v
ar5_
hace
2nu
m_m
eses
_var
5_ul
t3nu
m_v
ar30
num
_var
4nu
m_m
eses
_var
5_ul
t3_0
ind_
var5
num
_mes
es_v
ar5_
ult3
_3sa
ldo_
med
io_v
ar5_
hace
3nu
m_v
ar22
_ult3
num
_var
42im
p_op
_var
41_e
fect
_ult3
var3
6nu
m_v
ar45
_hac
e3nu
m_m
ed_v
ar45
_ult3
imp_
op_v
ar41
_ult1
num
_var
45_h
ace2
imp_
op_v
ar41
_efe
ct_u
lt1nu
m_v
ar22
_hac
e2
Features
0.05
0.00
0.05
0.10
0.15
0.20
0.25
F sc
ore
Feature Importances
Figure 12: Feature Importance
17
We also plot one of the many trees the forest model creates in Figure 13, namely the very first oneand we can see it is considerable in size.
saldo_var5 <= 0.005gini = 0.5
samples = 100.0%value = [0.49, 0.51]
class = y[1]
saldo_var13 <= 0.0gini = 0.488
samples = 64.5%value = [0.421, 0.579]
class = y[1]
True
ind_var10_ult1 <= 0.5gini = 0.429
samples = 35.5%value = [0.688, 0.312]
class = y[0]
False
ind_var24_0 <= 0.5gini = 0.483
samples = 61.0%value = [0.408, 0.592]
class = y[1]
var36_3 <= 0.5gini = 0.135
samples = 3.5%value = [0.927, 0.073]
class = y[0]
num_op_var41_efect_ult3 <= 0.01gini = 0.479
samples = 58.3%value = [0.399, 0.601]
class = y[1]
saldo_medio_var5_ult1 <= 0.002gini = 0.336
samples = 2.7%value = [0.786, 0.214]
class = y[0]
saldo_medio_var5_ult1 <= 0.002gini = 0.484
samples = 54.3%value = [0.411, 0.589]
class = y[1]
saldo_medio_var8_ult1 <= 0.023gini = 0.4
samples = 4.0%value = [0.277, 0.723]
class = y[1]
ind_var30 <= 0.5gini = 0.409
samples = 25.6%value = [0.287, 0.713]
class = y[1]
num_ent_var16_ult1 <= 0.025gini = 0.456
samples = 28.7%value = [0.648, 0.352]
class = y[0]
num_var30_0 <= 0.013gini = 0.403
samples = 23.7%value = [0.28, 0.72]
class = y[1]
var38 <= 0.01gini = 0.479
samples = 2.0%value = [0.399, 0.601]
class = y[1]
ind_var41_0 <= 0.5gini = 0.263
samples = 0.4%value = [0.844, 0.156]
class = y[0]
num_var45_hace3 <= 0.004gini = 0.4
samples = 23.3%value = [0.277, 0.723]
class = y[1]
gini = 0.0samples = 0.2%value = [1.0, 0.0]
class = y[0]
gini = 0.409samples = 0.2%
value = [0.713, 0.287]class = y[0]
num_meses_var5_ult3 <= 0.5gini = 0.422
samples = 19.9%value = [0.302, 0.698]
class = y[1]
gini = 0.3samples = 3.4%
value = [0.183, 0.817]class = y[1]
gini = 0.421samples = 19.7%
value = [0.301, 0.699]class = y[1]
gini = 0.499samples = 0.2%
value = [0.526, 0.474]class = y[0]
num_meses_var12_ult3 <= 1.5gini = 0.472
samples = 1.8%value = [0.382, 0.618]
class = y[1]
gini = 0.0samples = 0.1%value = [1.0, 0.0]
class = y[0]
num_var4 <= 0.214gini = 0.455
samples = 1.6%value = [0.35, 0.65]
class = y[1]
gini = 0.0samples = 0.2%value = [1.0, 0.0]
class = y[0]
var36_1 <= 0.5gini = 0.428
samples = 0.7%value = [0.31, 0.69]
class = y[1]
num_op_var39_comer_ult3 <= 0.007gini = 0.478
samples = 0.9%value = [0.395, 0.605]
class = y[1]
n1 <= 6.5gini = 0.474
samples = 0.5%value = [0.386, 0.614]
class = y[1]
gini = 0.328samples = 0.2%
value = [0.207, 0.793]class = y[1]
gini = 0.497samples = 0.1%
value = [0.538, 0.462]class = y[0]
gini = 0.458samples = 0.4%
value = [0.356, 0.644]class = y[1]
var15 <= 0.245gini = 0.498
samples = 0.6%value = [0.468, 0.532]
class = y[1]
saldo_var26 <= 0.001gini = 0.413
samples = 0.3%value = [0.291, 0.709]
class = y[1]
gini = 0.0samples = 0.2%value = [1.0, 0.0]
class = y[0]
num_meses_var5_ult3_0 <= 0.5gini = 0.476
samples = 0.4%value = [0.39, 0.61]
class = y[1]
n0 <= 152.5gini = 0.455
samples = 0.3%value = [0.35, 0.65]
class = y[1]
gini = 0.497samples = 0.1%
value = [0.541, 0.459]class = y[0]
gini = 0.341samples = 0.1%
value = [0.218, 0.782]class = y[1]
gini = 0.419samples = 0.2%
value = [0.701, 0.299]class = y[0]
gini = 0.334samples = 0.1%
value = [0.212, 0.788]class = y[1]
gini = 0.492samples = 0.1%
value = [0.439, 0.561]class = y[1]
num_var22_hace2 <= 0.012gini = 0.451
samples = 28.2%value = [0.656, 0.344]
class = y[0]
num_op_var41_ult1 <= 0.01gini = 0.458
samples = 0.5%value = [0.356, 0.644]
class = y[1]
num_var45_hace3 <= 0.013gini = 0.429
samples = 25.9%value = [0.688, 0.312]
class = y[0]
num_var45_ult1 <= 0.003gini = 0.488
samples = 2.3%value = [0.421, 0.579]
class = y[1]
num_var35 <= 0.042gini = 0.413
samples = 24.3%value = [0.709, 0.291]
class = y[0]
num_var45_hace2 <= 0.031gini = 0.499
samples = 1.6%value = [0.476, 0.524]
class = y[1]
gini = 0.459samples = 0.1%
value = [0.357, 0.643]class = y[1]
saldo_medio_var5_ult3 <= 0.001gini = 0.41
samples = 24.2%value = [0.712, 0.288]
class = y[0]
var38 <= 0.005gini = 0.413
samples = 23.9%value = [0.709, 0.291]
class = y[0]
gini = 0.0samples = 0.3%value = [1.0, 0.0]
class = y[0]
num_var22_ult1 <= 0.016gini = 0.39
samples = 15.6%value = [0.734, 0.266]
class = y[0]
num_var45_hace3 <= 0.004gini = 0.445
samples = 8.3%value = [0.666, 0.334]
class = y[0]
ind_var39_0 <= 0.5gini = 0.385
samples = 15.3%value = [0.74, 0.26]
class = y[0]
gini = 0.499samples = 0.3%
value = [0.519, 0.481]class = y[0]
gini = 0.451samples = 1.8%
value = [0.657, 0.343]class = y[0]
var15_most_common <= 0.5gini = 0.372
samples = 13.5%value = [0.753, 0.247]
class = y[0]
var36 <= 51.0gini = 0.45
samples = 8.4%value = [0.659, 0.341]
class = y[0]
gini = 0.051samples = 5.1%
value = [0.974, 0.026]class = y[0]
num_var22_hace3 <= 0.014gini = 0.47
samples = 6.0%value = [0.622, 0.378]
class = y[0]
num_meses_var5_ult3_3 <= 0.5gini = 0.346
samples = 2.3%value = [0.777, 0.223]
class = y[0]
saldo_var30 <= 0.001gini = 0.472
samples = 5.9%value = [0.618, 0.382]
class = y[0]
gini = 0.309samples = 0.2%
value = [0.809, 0.191]class = y[0]
gini = 0.478samples = 5.5%
value = [0.606, 0.394]class = y[0]
gini = 0.162samples = 0.4%
value = [0.911, 0.089]class = y[0]
gini = 0.5samples = 0.2%
value = [0.493, 0.507]class = y[1]
gini = 0.291samples = 2.1%
value = [0.823, 0.177]class = y[0]
gini = 0.44samples = 7.4%
value = [0.673, 0.327]class = y[0]
gini = 0.476samples = 0.8%
value = [0.609, 0.391]class = y[0]
saldo_var5 <= 0.005gini = 0.494
samples = 1.4%value = [0.447, 0.553]
class = y[1]
gini = 0.355samples = 0.2%
value = [0.769, 0.231]class = y[0]
var36 <= 51.0gini = 0.481
samples = 0.9%value = [0.599, 0.401]
class = y[0]
gini = 0.416samples = 0.5%
value = [0.296, 0.704]class = y[1]
gini = 0.485samples = 0.7%
value = [0.587, 0.413]class = y[0]
gini = 0.465samples = 0.2%
value = [0.632, 0.368]class = y[0]
num_var35 <= 0.125gini = 0.5
samples = 1.4%value = [0.497, 0.503]
class = y[1]
var38_most_common <= 0.5gini = 0.448
samples = 0.9%value = [0.338, 0.662]
class = y[1]
num_meses_var5_ult3_2 <= 0.5gini = 0.499
samples = 1.2%value = [0.524, 0.476]
class = y[0]
gini = 0.46samples = 0.2%
value = [0.359, 0.641]class = y[1]
num_var22_hace3 <= 0.014gini = 0.482
samples = 0.9%value = [0.595, 0.405]
class = y[0]
var38 <= 0.004gini = 0.466
samples = 0.3%value = [0.371, 0.629]
class = y[1]
var15_most_common <= 0.5gini = 0.473
samples = 0.7%value = [0.617, 0.383]
class = y[0]
gini = 0.499samples = 0.2%
value = [0.524, 0.476]class = y[0]
gini = 0.499samples = 0.5%
value = [0.518, 0.482]class = y[0]
gini = -0.0samples = 0.2%value = [1.0, 0.0]
class = y[0]
gini = 0.495samples = 0.1%
value = [0.448, 0.552]class = y[1]
gini = 0.441samples = 0.2%
value = [0.328, 0.672]class = y[1]
num_var22_hace2 <= 0.061gini = 0.457
samples = 0.7%value = [0.353, 0.647]
class = y[1]
gini = 0.412samples = 0.2%
value = [0.29, 0.71]class = y[1]
num_op_var39_comer_ult3 <= 0.007gini = 0.483
samples = 0.6%value = [0.409, 0.591]
class = y[1]
gini = 0.355samples = 0.2%
value = [0.231, 0.769]class = y[1]
saldo_medio_var5_hace2 <= 0.0gini = 0.496
samples = 0.4%value = [0.455, 0.545]
class = y[1]
gini = 0.425samples = 0.1%
value = [0.307, 0.693]class = y[1]
gini = 0.474samples = 0.3%
value = [0.386, 0.614]class = y[1]
gini = 0.439samples = 0.1%
value = [0.675, 0.325]class = y[0]
saldo_var30 <= 0.001gini = 0.41
samples = 0.3%value = [0.288, 0.712]
class = y[1]
gini = 0.443samples = 0.1%
value = [0.668, 0.332]class = y[0]
gini = 0.482samples = 0.2%
value = [0.404, 0.596]class = y[1]
gini = 0.347samples = 0.2%
value = [0.224, 0.776]class = y[1]
saldo_medio_var5_ult3 <= 0.001gini = 0.377
samples = 3.3%value = [0.252, 0.748]
class = y[1]
saldo_var25 <= 0.06gini = 0.496
samples = 0.7%value = [0.544, 0.456]
class = y[0]
saldo_var42 <= 0.002gini = 0.322
samples = 1.6%value = [0.202, 0.798]
class = y[1]
imp_op_var41_comer_ult1 <= 0.001gini = 0.436
samples = 1.7%value = [0.321, 0.679]
class = y[1]
var15 <= 0.215gini = 0.288
samples = 1.3%value = [0.175, 0.825]
class = y[1]
saldo_var30 <= 0.002gini = 0.497
samples = 0.3%value = [0.539, 0.461]
class = y[0]
num_op_var41_ult3 <= 0.016gini = 0.404
samples = 0.3%value = [0.719, 0.281]
class = y[0]
num_var35 <= 0.125gini = 0.244
samples = 1.0%value = [0.142, 0.858]
class = y[1]
gini = 0.0samples = 0.2%value = [1.0, 0.0]
class = y[0]
gini = 0.5samples = 0.1%
value = [0.51, 0.49]class = y[0]
gini = 0.329samples = 0.2%
value = [0.208, 0.792]class = y[1]
imp_op_var41_efect_ult1 <= 0.037gini = 0.229
samples = 0.8%value = [0.132, 0.868]
class = y[1]
saldo_var8 <= 0.02gini = 0.275
samples = 0.7%value = [0.165, 0.835]
class = y[1]
gini = 0.114samples = 0.1%
value = [0.061, 0.939]class = y[1]
num_op_var39_ult1 <= 0.022gini = 0.157
samples = 0.3%value = [0.086, 0.914]
class = y[1]
saldo_medio_var8_hace2 <= 0.001gini = 0.479
samples = 0.4%value = [0.398, 0.602]
class = y[1]
gini = 0.136samples = 0.2%
value = [0.073, 0.927]class = y[1]
gini = 0.205samples = 0.1%
value = [0.116, 0.884]class = y[1]
gini = -0.0samples = 0.2%value = [1.0, 0.0]
class = y[0]
gini = 0.365samples = 0.2%
value = [0.24, 0.76]class = y[1]
gini = 0.29samples = 0.2%
value = [0.824, 0.176]class = y[0]
gini = 0.465samples = 0.1%
value = [0.368, 0.632]class = y[1]
imp_op_var41_efect_ult1 <= 0.0gini = 0.391
samples = 0.7%value = [0.267, 0.733]
class = y[1]
num_op_var41_comer_ult3 <= 0.01gini = 0.47
samples = 1.0%value = [0.377, 0.623]
class = y[1]
num_op_var39_comer_ult3 <= 0.007gini = 0.346
samples = 0.3%value = [0.223, 0.777]
class = y[1]
num_var45_hace2 <= 0.031gini = 0.43
samples = 0.4%value = [0.312, 0.688]
class = y[1]
gini = 0.303samples = 0.2%
value = [0.186, 0.814]class = y[1]
gini = 0.435samples = 0.1%
value = [0.32, 0.68]class = y[1]
var15 <= 0.205gini = 0.483
samples = 0.3%value = [0.407, 0.593]
class = y[1]
gini = 0.313samples = 0.1%
value = [0.194, 0.806]class = y[1]
gini = 0.0samples = 0.1%value = [1.0, 0.0]
class = y[0]
gini = 0.403samples = 0.2%
value = [0.279, 0.721]class = y[1]
gini = 0.0samples = 0.1%value = [1.0, 0.0]
class = y[0]
imp_op_var41_efect_ult3 <= 0.002gini = 0.448
samples = 0.8%value = [0.338, 0.662]
class = y[1]
imp_op_var41_comer_ult3 <= 0.009gini = 0.451
samples = 0.3%value = [0.656, 0.344]
class = y[0]
saldo_var30 <= 0.001gini = 0.377
samples = 0.5%value = [0.252, 0.748]
class = y[1]
gini = 0.499samples = 0.1%
value = [0.523, 0.477]class = y[0]
gini = 0.367samples = 0.2%
value = [0.758, 0.242]class = y[0]
gini = 0.256samples = 0.2%
value = [0.151, 0.849]class = y[1]
num_var45_hace3 <= 0.013gini = 0.453
samples = 0.3%value = [0.347, 0.653]
class = y[1]
gini = 0.492samples = 0.2%
value = [0.564, 0.436]class = y[0]
gini = 0.362samples = 0.2%
value = [0.238, 0.762]class = y[1]
imp_op_var39_comer_ult1 <= 0.037gini = 0.494
samples = 0.4%value = [0.445, 0.555]
class = y[1]
gini = -0.0samples = 0.2%value = [1.0, 0.0]
class = y[0]
gini = 0.398samples = 0.2%
value = [0.726, 0.274]class = y[0]
num_op_var41_comer_ult3 <= 0.099gini = 0.449
samples = 0.3%value = [0.34, 0.66]
class = y[1]
gini = 0.415samples = 0.1%
value = [0.294, 0.706]class = y[1]
gini = 0.481samples = 0.1%
value = [0.403, 0.597]class = y[1]
saldo_var30 <= 0.025gini = 0.224
samples = 2.1%value = [0.872, 0.128]
class = y[0]
saldo_medio_var12_ult1 <= 0.012gini = 0.49
samples = 0.6%value = [0.571, 0.429]
class = y[0]
saldo_medio_var12_ult1 <= 0.021gini = 0.397
samples = 0.7%value = [0.727, 0.273]
class = y[0]
saldo_medio_var12_hace2 <= 0.099gini = 0.048
samples = 1.4%value = [0.975, 0.025]
class = y[0]
saldo_medio_var5_hace2 <= 0.0gini = 0.336
samples = 0.6%value = [0.787, 0.213]
class = y[0]
gini = 0.5samples = 0.1%
value = [0.493, 0.507]class = y[1]
saldo_var30 <= 0.007gini = 0.442
samples = 0.3%value = [0.671, 0.329]
class = y[0]
saldo_medio_var5_hace3 <= 0.0gini = 0.162
samples = 0.4%value = [0.911, 0.089]
class = y[0]
gini = 0.5samples = 0.1%
value = [0.497, 0.503]class = y[1]
gini = 0.271samples = 0.2%
value = [0.838, 0.162]class = y[0]
gini = 0.0samples = 0.2%value = [1.0, 0.0]
class = y[0]
gini = 0.365samples = 0.1%
value = [0.76, 0.24]class = y[0]
gini = -0.0samples = 1.2%value = [1.0, 0.0]
class = y[0]
gini = 0.244samples = 0.2%
value = [0.858, 0.142]class = y[0]
gini = 0.458samples = 0.2%
value = [0.355, 0.645]class = y[1]
num_var45_hace2 <= 0.013gini = 0.166
samples = 0.3%value = [0.908, 0.092]
class = y[0]
gini = 0.344samples = 0.1%
value = [0.78, 0.22]class = y[0]
gini = -0.0samples = 0.2%value = [1.0, 0.0]
class = y[0]
saldo_medio_var5_ult1 <= 0.002gini = 0.199
samples = 2.2%value = [0.888, 0.112]
class = y[0]
gini = 0.0samples = 1.3%value = [1.0, 0.0]
class = y[0]
num_var45_hace2 <= 0.004gini = 0.254
samples = 1.6%value = [0.85, 0.15]
class = y[0]
gini = 0.0samples = 0.6%value = [1.0, 0.0]
class = y[0]
gini = 0.455samples = 0.2%
value = [0.649, 0.351]class = y[0]
saldo_var13_corto <= 0.497gini = 0.198
samples = 1.4%value = [0.889, 0.111]
class = y[0]
num_var45_hace2 <= 0.048gini = 0.147
samples = 1.2%value = [0.92, 0.08]
class = y[0]
gini = 0.393samples = 0.2%
value = [0.731, 0.269]class = y[0]
gini = 0.0samples = 0.7%value = [1.0, 0.0]
class = y[0]
num_var45_hace3 <= 0.04gini = 0.294
samples = 0.5%value = [0.821, 0.179]
class = y[0]
saldo_var13 <= 0.058gini = 0.185
samples = 0.3%value = [0.897, 0.103]
class = y[0]
gini = 0.406samples = 0.2%
value = [0.717, 0.283]class = y[0]
gini = -0.0samples = 0.2%value = [1.0, 0.0]
class = y[0]
gini = 0.326samples = 0.1%
value = [0.795, 0.205]class = y[0]
saldo_medio_var13_corto_hace2 <= 0.0gini = 0.407
samples = 29.9%value = [0.715, 0.285]
class = y[0]
saldo_var5 <= 0.007gini = 0.49
samples = 5.7%value = [0.572, 0.428]
class = y[0]
num_op_var41_ult3 <= 0.061gini = 0.413
samples = 28.6%value = [0.709, 0.291]
class = y[0]
saldo_medio_var13_corto_hace2 <= 0.465gini = 0.209
samples = 1.3%value = [0.882, 0.118]
class = y[0]
var15_most_common <= 0.5gini = 0.394
samples = 26.5%value = [0.731, 0.269]
class = y[0]
saldo_var5 <= 0.005gini = 0.5
samples = 2.0%value = [0.51, 0.49]
class = y[0]
var15 <= 0.225gini = 0.44
samples = 19.5%value = [0.673, 0.327]
class = y[0]
num_meses_var5_ult3 <= 2.5gini = 0.081
samples = 7.0%value = [0.958, 0.042]
class = y[0]
var36 <= 51.0gini = 0.173
samples = 5.2%value = [0.905, 0.095]
class = y[0]
num_var42 <= 0.25gini = 0.474
samples = 14.4%value = [0.615, 0.385]
class = y[0]
num_med_var45_ult3 <= 0.062gini = 0.091
samples = 4.0%value = [0.952, 0.048]
class = y[0]
num_var45_hace2 <= 0.013gini = 0.362
samples = 1.1%value = [0.762, 0.238]
class = y[0]
num_var45_ult1 <= 0.003gini = 0.064
samples = 3.9%value = [0.967, 0.033]
class = y[0]
gini = 0.46samples = 0.1%
value = [0.642, 0.358]class = y[0]
num_var22_ult3 <= 0.019gini = 0.089
samples = 2.7%value = [0.953, 0.047]
class = y[0]
gini = 0.0samples = 1.2%value = [1.0, 0.0]
class = y[0]
saldo_medio_var5_ult3 <= 0.001gini = 0.052
samples = 2.4%value = [0.974, 0.026]
class = y[0]
gini = 0.319samples = 0.3%value = [0.8, 0.2]
class = y[0]
saldo_medio_var5_hace2 <= 0.0gini = 0.124
samples = 0.9%value = [0.934, 0.066]
class = y[0]
gini = 0.0samples = 1.5%value = [1.0, 0.0]
class = y[0]
gini = 0.0samples = 0.3%value = [1.0, 0.0]
class = y[0]
n1 <= 6.5gini = 0.181
samples = 0.6%value = [0.899, 0.101]
class = y[0]
saldo_medio_var5_ult3 <= 0.001gini = 0.171
samples = 0.3%value = [0.906, 0.094]
class = y[0]
gini = 0.192samples = 0.3%
value = [0.893, 0.107]class = y[0]
gini = 0.298samples = 0.2%
value = [0.818, 0.182]class = y[0]
gini = -0.0samples = 0.2%value = [1.0, 0.0]
class = y[0]
saldo_medio_var5_hace3 <= 0.0gini = 0.22
samples = 1.0%value = [0.874, 0.126]
class = y[0]
gini = 0.487samples = 0.2%
value = [0.421, 0.579]class = y[1]
gini = 0.458samples = 0.1%
value = [0.644, 0.356]class = y[0]
gini = 0.141samples = 0.8%
value = [0.923, 0.077]class = y[0]
num_meses_var5_ult3 <= 2.5gini = 0.479
samples = 13.3%value = [0.602, 0.398]
class = y[0]
num_var39_0 <= 0.136gini = 0.272
samples = 1.1%value = [0.838, 0.162]
class = y[0]
ind_var43_recib_ult1 <= 0.5gini = 0.5
samples = 3.9%value = [0.51, 0.49]
class = y[0]
saldo_var5 <= 0.007gini = 0.454
samples = 9.4%value = [0.652, 0.348]
class = y[0]
var15 <= 0.315gini = 0.499
samples = 2.6%value = [0.478, 0.522]
class = y[1]
ind_var43_emit_ult1 <= 0.5gini = 0.484
samples = 1.3%value = [0.588, 0.412]
class = y[0]
num_meses_var39_vig_ult3_1 <= 0.5gini = 0.476
samples = 0.9%value = [0.61, 0.39]
class = y[0]
num_op_var39_ult3 <= 0.003gini = 0.488
samples = 1.7%value = [0.423, 0.577]
class = y[1]
var15 <= 0.295gini = 0.495
samples = 0.7%value = [0.551, 0.449]
class = y[0]
gini = 0.308samples = 0.3%
value = [0.81, 0.19]class = y[0]
gini = 0.5samples = 0.5%
value = [0.492, 0.508]class = y[1]
gini = 0.274samples = 0.2%
value = [0.836, 0.164]class = y[0]
n1 <= 7.5gini = 0.495
samples = 1.4%value = [0.451, 0.549]
class = y[1]
gini = 0.434samples = 0.3%
value = [0.319, 0.681]class = y[1]
num_var45_ult1 <= 0.003gini = 0.477
samples = 0.8%value = [0.392, 0.608]
class = y[1]
var38 <= 0.004gini = 0.496
samples = 0.7%value = [0.547, 0.453]
class = y[0]
var36_2 <= 0.5gini = 0.447
samples = 0.4%value = [0.337, 0.663]
class = y[1]
saldo_medio_var5_hace2 <= 0.0gini = 0.5
samples = 0.3%value = [0.503, 0.497]
class = y[0]
gini = 0.498samples = 0.1%
value = [0.532, 0.468]class = y[0]
var15 <= 0.455gini = 0.416
samples = 0.3%value = [0.295, 0.705]
class = y[1]
gini = 0.335samples = 0.2%
value = [0.213, 0.787]class = y[1]
gini = 0.457samples = 0.1%
value = [0.647, 0.353]class = y[0]gini = 0.419
samples = 0.1%value = [0.299, 0.701]
class = y[1]
gini = 0.255samples = 0.2%
value = [0.85, 0.15]class = y[0]
num_var22_ult3 <= 0.019gini = 0.485
samples = 0.3%value = [0.413, 0.587]
class = y[1]
num_var22_hace2 <= 0.012gini = 0.361
samples = 0.3%value = [0.763, 0.237]
class = y[0]
gini = 0.496samples = 0.1%
value = [0.454, 0.546]class = y[1]
gini = 0.471samples = 0.2%
value = [0.379, 0.621]class = y[1]
gini = 0.412samples = 0.2%
value = [0.71, 0.29]class = y[0]
gini = 0.286samples = 0.2%
value = [0.827, 0.173]class = y[0]
imp_op_var41_efect_ult3 <= 0.0gini = 0.456
samples = 1.1%value = [0.649, 0.351]
class = y[0]
num_var43_recib_ult1 <= 0.017gini = 0.486
samples = 0.2%value = [0.417, 0.583]
class = y[1]
num_var45_ult1 <= 0.026gini = 0.43
samples = 0.9%value = [0.687, 0.313]
class = y[0]
gini = 0.5samples = 0.2%
value = [0.505, 0.495]class = y[0]
gini = 0.354samples = 0.7%
value = [0.77, 0.23]class = y[0]
gini = 0.5samples = 0.2%
value = [0.509, 0.491]class = y[0]
gini = 0.5samples = 0.1%
value = [0.486, 0.514]class = y[1]
gini = 0.461samples = 0.1%
value = [0.36, 0.64]class = y[1]
imp_op_var41_comer_ult3 <= 0.0gini = 0.476
samples = 6.7%value = [0.609, 0.391]
class = y[0]
var36 <= 1.5gini = 0.331
samples = 2.7%value = [0.79, 0.21]
class = y[0]
var15 <= 0.265gini = 0.49
samples = 5.0%value = [0.571, 0.429]
class = y[0]
saldo_var42 <= 0.002gini = 0.355
samples = 1.7%value = [0.769, 0.231]
class = y[0]
saldo_medio_var5_hace3 <= 0.0gini = 0.417
samples = 1.1%value = [0.704, 0.296]
class = y[0]
imp_var43_emit_ult1 <= 0.0gini = 0.496
samples = 3.9%value = [0.542, 0.458]
class = y[0]
gini = 0.477samples = 0.7%
value = [0.607, 0.393]class = y[0]
gini = 0.138samples = 0.4%
value = [0.925, 0.075]class = y[0]
saldo_medio_var5_hace3 <= 0.0gini = 0.498
samples = 3.7%value = [0.528, 0.472]
class = y[0]
num_var45_hace3 <= 0.022gini = 0.212
samples = 0.3%value = [0.879, 0.121]
class = y[0]
num_meses_var39_vig_ult3_2 <= 0.5gini = 0.48
samples = 0.7%value = [0.401, 0.599]
class = y[1]
num_var43_recib_ult1 <= 0.006gini = 0.49
samples = 3.0%value = [0.569, 0.431]
class = y[0]
gini = 0.45samples = 0.3%
value = [0.343, 0.657]class = y[1]
num_var45_ult1 <= 0.003gini = 0.495
samples = 0.4%value = [0.452, 0.548]
class = y[1]
gini = 0.476samples = 0.2%
value = [0.391, 0.609]class = y[1]
gini = 0.492samples = 0.2%
value = [0.563, 0.437]class = y[0]
var38 <= 0.007gini = 0.484
samples = 2.6%value = [0.59, 0.41]
class = y[0]
saldo_medio_var5_ult3 <= 0.001gini = 0.495
samples = 0.4%value = [0.448, 0.552]
class = y[1]
var15 <= 0.345gini = 0.471
samples = 2.3%value = [0.621, 0.379]
class = y[0]
gini = 0.49samples = 0.3%
value = [0.429, 0.571]class = y[1]
num_med_var22_ult3 <= 0.019gini = 0.403
samples = 0.9%value = [0.72, 0.28]
class = y[0]
num_med_var45_ult3 <= 0.039gini = 0.49
samples = 1.4%value = [0.572, 0.428]
class = y[0]
saldo_medio_var5_hace2 <= 0.001gini = 0.379
samples = 0.7%value = [0.746, 0.254]
class = y[0]
gini = 0.463samples = 0.2%
value = [0.635, 0.365]class = y[0]
gini = 0.181samples = 0.6%
value = [0.899, 0.101]class = y[0]
gini = 0.454samples = 0.1%
value = [0.348, 0.652]class = y[1]
var36_2 <= 0.5gini = 0.482
samples = 1.3%value = [0.596, 0.404]
class = y[0]
gini = 0.469samples = 0.1%
value = [0.375, 0.625]class = y[1]
gini = 0.489samples = 1.1%
value = [0.574, 0.426]class = y[0]
gini = 0.293samples = 0.2%
value = [0.822, 0.178]class = y[0]
gini = 0.494samples = 0.2%
value = [0.557, 0.443]class = y[0]
gini = 0.467samples = 0.2%
value = [0.372, 0.628]class = y[1]
gini = 0.355samples = 0.1%
value = [0.769, 0.231]class = y[0]
gini = -0.0samples = 0.1%value = [1.0, 0.0]
class = y[0]
num_op_var41_comer_ult1 <= 0.024gini = 0.453
samples = 0.8%value = [0.653, 0.347]
class = y[0]
num_var22_ult3 <= 0.019gini = 0.186
samples = 0.9%value = [0.896, 0.104]
class = y[0]
imp_op_var41_comer_ult1 <= 0.007gini = 0.43
samples = 0.6%value = [0.688, 0.312]
class = y[0]
gini = 0.5samples = 0.1%
value = [0.507, 0.493]class = y[0]
num_op_var39_ult3 <= 0.01gini = 0.481
samples = 0.4%value = [0.599, 0.401]
class = y[0]
gini = -0.0samples = 0.2%value = [1.0, 0.0]
class = y[0]
gini = 0.0samples = 0.2%value = [1.0, 0.0]
class = y[0]
saldo_medio_var5_hace2 <= 0.0gini = 0.492
samples = 0.2%value = [0.439, 0.561]
class = y[1]
gini = 0.469samples = 0.1%
value = [0.625, 0.375]class = y[0]
gini = 0.44samples = 0.1%
value = [0.327, 0.673]class = y[1]
var15 <= 0.285gini = 0.299
samples = 0.5%value = [0.817, 0.183]
class = y[0]
gini = -0.0samples = 0.4%value = [1.0, 0.0]
class = y[0]
gini = 0.452samples = 0.1%
value = [0.654, 0.346]class = y[0]
imp_op_var41_comer_ult1 <= 0.002gini = 0.17
samples = 0.4%value = [0.906, 0.094]
class = y[0]
gini = 0.399samples = 0.1%
value = [0.725, 0.275]class = y[0]
gini = -0.0samples = 0.2%value = [1.0, 0.0]
class = y[0]
saldo_var42 <= 0.004gini = 0.384
samples = 1.6%value = [0.741, 0.259]
class = y[0]
num_op_var39_ult1 <= 0.01gini = 0.209
samples = 1.0%value = [0.881, 0.119]
class = y[0]
var15 <= 0.515gini = 0.429
samples = 1.2%value = [0.688, 0.312]
class = y[0]
num_med_var45_ult3 <= 0.017gini = 0.136
samples = 0.4%value = [0.927, 0.073]
class = y[0]
saldo_var42 <= 0.003gini = 0.367
samples = 1.0%value = [0.758, 0.242]
class = y[0]
gini = 0.494samples = 0.2%
value = [0.447, 0.553]class = y[1]
num_var35 <= 0.125gini = 0.464
samples = 0.5%value = [0.635, 0.365]
class = y[0]
num_op_var41_hace2 <= 0.006gini = 0.122
samples = 0.5%value = [0.935, 0.065]
class = y[0]
saldo_var5 <= 0.008gini = 0.496
samples = 0.2%value = [0.546, 0.454]
class = y[0]
imp_trans_var37_ult1 <= 0.0gini = 0.399
samples = 0.3%value = [0.725, 0.275]
class = y[0]
gini = 0.385samples = 0.1%
value = [0.74, 0.26]class = y[0]
gini = 0.493samples = 0.1%
value = [0.442, 0.558]class = y[1]
gini = 0.499samples = 0.1%
value = [0.475, 0.525]class = y[1]
gini = -0.0samples = 0.2%value = [1.0, 0.0]
class = y[0]
var15 <= 0.395gini = 0.148
samples = 0.4%value = [0.919, 0.081]
class = y[0]
gini = 0.0samples = 0.1%value = [1.0, 0.0]
class = y[0]
gini = -0.0samples = 0.3%value = [1.0, 0.0]
class = y[0]
gini = 0.36samples = 0.1%
value = [0.765, 0.235]class = y[0]
gini = 0.365samples = 0.1%
value = [0.76, 0.24]class = y[0]
gini = -0.0samples = 0.3%value = [1.0, 0.0]
class = y[0]
var15 <= 0.395gini = 0.07
samples = 0.9%value = [0.963, 0.037]
class = y[0]
gini = 0.499samples = 0.1%
value = [0.526, 0.474]class = y[0]
n0 <= 165.5gini = 0.139
samples = 0.4%value = [0.925, 0.075]
class = y[0]
gini = -0.0samples = 0.5%value = [1.0, 0.0]
class = y[0]
gini = 0.279samples = 0.2%
value = [0.833, 0.167]class = y[0]
gini = 0.0samples = 0.3%value = [1.0, 0.0]
class = y[0]
n0 <= 144.5gini = 0.222
samples = 0.9%value = [0.873, 0.127]
class = y[0]
gini = 0.458samples = 0.1%
value = [0.644, 0.356]class = y[0]
gini = 0.0samples = 0.3%value = [1.0, 0.0]
class = y[0]
saldo_var30 <= 0.029gini = 0.308
samples = 0.6%value = [0.81, 0.19]
class = y[0]
gini = 0.0samples = 0.3%value = [1.0, 0.0]
class = y[0]
gini = 0.418samples = 0.3%
value = [0.702, 0.298]class = y[0]
saldo_medio_var5_ult1 <= 0.002gini = 0.174
samples = 1.0%value = [0.904, 0.096]
class = y[0]
saldo_medio_var5_hace2 <= 0.0gini = 0.064
samples = 6.0%value = [0.967, 0.033]
class = y[0]
gini = 0.465samples = 0.2%
value = [0.632, 0.368]class = y[0]
gini = -0.0samples = 0.8%value = [1.0, 0.0]
class = y[0]
num_var4 <= 0.214gini = 0.226
samples = 1.2%value = [0.87, 0.13]
class = y[0]
num_op_var41_ult3 <= 0.022gini = 0.014
samples = 4.8%value = [0.993, 0.007]
class = y[0]
saldo_var42 <= 0.002gini = 0.286
samples = 0.9%value = [0.827, 0.173]
class = y[0]
gini = 0.0samples = 0.3%value = [1.0, 0.0]
class = y[0]
saldo_medio_var5_ult1 <= 0.002gini = 0.219
samples = 0.8%value = [0.875, 0.125]
class = y[0]
gini = 0.482samples = 0.1%
value = [0.594, 0.406]class = y[0]
var38 <= 0.003gini = 0.29
samples = 0.5%value = [0.824, 0.176]
class = y[0]
gini = -0.0samples = 0.3%value = [1.0, 0.0]
class = y[0]
gini = 0.5samples = 0.1%
value = [0.49, 0.51]class = y[1]
gini = -0.0samples = 0.4%value = [1.0, 0.0]
class = y[0]
gini = 0.0samples = 4.5%value = [1.0, 0.0]
class = y[0]
imp_op_var39_comer_ult3 <= 0.015gini = 0.187
samples = 0.3%value = [0.896, 0.104]
class = y[0]
gini = 0.0samples = 0.2%value = [1.0, 0.0]
class = y[0]
gini = 0.365samples = 0.1%
value = [0.76, 0.24]class = y[0]
gini = 0.294samples = 0.1%
value = [0.179, 0.821]class = y[1]
n0 <= 141.5gini = 0.492
samples = 1.9%value = [0.563, 0.437]
class = y[0]
num_var22_ult3 <= 0.045gini = 0.498
samples = 1.0%value = [0.466, 0.534]
class = y[1]
num_var43_recib_ult1 <= 0.017gini = 0.408
samples = 0.9%value = [0.715, 0.285]
class = y[0]
imp_var43_emit_ult1 <= 0.0gini = 0.43
samples = 0.7%value = [0.686, 0.314]
class = y[0]
saldo_var5 <= 0.006gini = 0.388
samples = 0.3%value = [0.263, 0.737]
class = y[1]
num_var45_ult1 <= 0.021gini = 0.46
samples = 0.5%value = [0.642, 0.358]
class = y[0]
gini = -0.0samples = 0.1%value = [1.0, 0.0]
class = y[0]
num_op_var41_efect_ult3 <= 0.087gini = 0.499
samples = 0.3%value = [0.526, 0.474]
class = y[0]
imp_op_var41_comer_ult1 <= 0.038gini = 0.312
samples = 0.3%value = [0.807, 0.193]
class = y[0]
gini = 0.333samples = 0.1%
value = [0.789, 0.211]class = y[0]
gini = 0.481samples = 0.1%
value = [0.402, 0.598]class = y[1]
gini = 0.472samples = 0.1%
value = [0.619, 0.381]class = y[0]
gini = -0.0samples = 0.2%value = [1.0, 0.0]
class = y[0]
gini = 0.5samples = 0.2%
value = [0.486, 0.514]class = y[1]
gini = 0.272samples = 0.2%
value = [0.163, 0.837]class = y[1]saldo_var30 <= 0.002
gini = 0.437samples = 0.7%
value = [0.678, 0.322]class = y[0]
gini = 0.231samples = 0.2%
value = [0.867, 0.133]class = y[0]
gini = 0.0samples = 0.2%value = [1.0, 0.0]
class = y[0]
num_op_var41_comer_ult3 <= 0.065gini = 0.483
samples = 0.5%value = [0.592, 0.408]
class = y[0]
gini = 0.0samples = 0.1%value = [1.0, 0.0]
class = y[0]
num_op_var39_comer_ult3 <= 0.078gini = 0.5
samples = 0.3%value = [0.512, 0.488]
class = y[0]
gini = 0.486samples = 0.2%
value = [0.417, 0.583]class = y[1]
gini = 0.416samples = 0.1%
value = [0.705, 0.295]class = y[0]
saldo_var30 <= 0.06gini = 0.107
samples = 1.2%value = [0.943, 0.057]
class = y[0]
gini = 0.488samples = 0.1%
value = [0.578, 0.422]class = y[0]
num_var43_recib_ult1 <= 0.006gini = 0.063
samples = 1.0%value = [0.967, 0.033]
class = y[0]
gini = 0.339samples = 0.1%
value = [0.784, 0.216]class = y[0]
gini = 0.0samples = 0.8%value = [1.0, 0.0]
class = y[0]
gini = 0.239samples = 0.2%
value = [0.861, 0.139]class = y[0]
saldo_medio_var5_hace2 <= 0.0gini = 0.5
samples = 3.4%value = [0.503, 0.497]
class = y[0]
saldo_var42 <= 0.006gini = 0.412
samples = 2.3%value = [0.709, 0.291]
class = y[0]
saldo_medio_var5_hace3 <= 0.0gini = 0.442
samples = 0.6%value = [0.329, 0.671]
class = y[1]
num_var43_recib_ult1 <= 0.051gini = 0.492
samples = 2.8%value = [0.564, 0.436]
class = y[0]
num_var45_ult1 <= 0.009gini = 0.413
samples = 0.4%value = [0.291, 0.709]
class = y[1]
gini = 0.499samples = 0.2%
value = [0.48, 0.52]class = y[1]
gini = 0.295samples = 0.1%
value = [0.18, 0.82]class = y[1]
num_var45_ult1 <= 0.032gini = 0.478
samples = 0.3%value = [0.394, 0.606]
class = y[1]
gini = 0.0samples = 0.1%value = [1.0, 0.0]
class = y[0]
gini = 0.399samples = 0.2%
value = [0.276, 0.724]class = y[1]
var15 <= 0.365gini = 0.482
samples = 2.6%value = [0.595, 0.405]
class = y[0]
gini = 0.404samples = 0.2%
value = [0.281, 0.719]class = y[1]
num_var22_hace3 <= 0.042gini = 0.38
samples = 1.8%value = [0.745, 0.255]
class = y[0]
imp_op_var39_comer_ult1 <= 0.042gini = 0.487
samples = 0.9%value = [0.419, 0.581]
class = y[1]
num_op_var39_ult3 <= 0.119gini = 0.42
samples = 1.4%value = [0.701, 0.299]
class = y[0]
n1 <= 14.5gini = 0.139
samples = 0.4%value = [0.925, 0.075]
class = y[0]
var38 <= 0.004gini = 0.33
samples = 1.2%value = [0.791, 0.209]
class = y[0]
gini = 0.437samples = 0.1%
value = [0.323, 0.677]class = y[1]
num_var45_ult1 <= 0.003gini = 0.115
samples = 0.5%value = [0.939, 0.061]
class = y[0]
imp_op_var41_comer_ult3 <= 0.029gini = 0.419
samples = 0.7%value = [0.702, 0.298]
class = y[0]
gini = 0.35samples = 0.1%
value = [0.773, 0.227]class = y[0]
gini = -0.0samples = 0.4%value = [1.0, 0.0]
class = y[0]
imp_op_var41_comer_ult1 <= 0.0gini = 0.335
samples = 0.5%value = [0.787, 0.213]
class = y[0]
gini = 0.5samples = 0.1%
value = [0.502, 0.498]class = y[0]
gini = 0.426samples = 0.3%
value = [0.693, 0.307]class = y[0]
gini = -0.0samples = 0.2%value = [1.0, 0.0]
class = y[0]saldo_medio_var5_ult3 <= 0.001gini = 0.181
samples = 0.3%value = [0.899, 0.101]
class = y[0]
gini = -0.0samples = 0.1%value = [1.0, 0.0]
class = y[0]
gini = 0.37samples = 0.1%
value = [0.755, 0.245]class = y[0]
gini = -0.0samples = 0.2%value = [1.0, 0.0]
class = y[0]
num_var22_hace2 <= 0.061gini = 0.498
samples = 0.7%value = [0.469, 0.531]
class = y[1]
gini = 0.384samples = 0.1%
value = [0.259, 0.741]class = y[1]
var36_1 <= 0.5gini = 0.474
samples = 0.6%value = [0.613, 0.387]
class = y[0]
gini = 0.312samples = 0.1%
value = [0.194, 0.806]class = y[1]
gini = 0.0samples = 0.2%value = [1.0, 0.0]
class = y[0]
gini = 0.499samples = 0.4%
value = [0.518, 0.482]class = y[0]
num_var37_0 <= 0.066gini = 0.438
samples = 2.0%value = [0.675, 0.325]
class = y[0]
gini = 0.0samples = 0.3%value = [1.0, 0.0]
class = y[0]
num_var22_hace2 <= 0.085gini = 0.291
samples = 1.4%value = [0.823, 0.177]
class = y[0]
saldo_var42 <= 0.002gini = 0.494
samples = 0.5%value = [0.444, 0.556]
class = y[1]
imp_op_var41_ult1 <= 0.038gini = 0.234
samples = 1.3%value = [0.865, 0.135]
class = y[0]
gini = 0.498samples = 0.1%
value = [0.535, 0.465]class = y[0]
num_op_var41_hace2 <= 0.03gini = 0.056
samples = 1.1%value = [0.971, 0.029]
class = y[0]
gini = 0.498samples = 0.2%
value = [0.471, 0.529]class = y[1]
gini = 0.0samples = 0.9%value = [1.0, 0.0]
class = y[0]
saldo_var30 <= 0.002gini = 0.196
samples = 0.3%value = [0.89, 0.11]
class = y[0]
gini = 0.37samples = 0.1%
value = [0.755, 0.245]class = y[0]
gini = 0.0samples = 0.2%value = [1.0, 0.0]
class = y[0]gini = 0.0samples = 0.2%value = [1.0, 0.0]
class = y[0]
saldo_var30 <= 0.002gini = 0.448
samples = 0.4%value = [0.339, 0.661]
class = y[1]
gini = 0.329samples = 0.1%
value = [0.207, 0.793]class = y[1]
num_op_var39_comer_ult1 <= 0.024gini = 0.499
samples = 0.2%value = [0.475, 0.525]
class = y[1]
gini = 0.479samples = 0.1%
value = [0.604, 0.396]class = y[0]
gini = 0.479samples = 0.1%
value = [0.397, 0.603]class = y[1]
Figure 13: Tree 1
18
6.2 Tuning XGBoost
The numerous parameters in XGBoost make it intractable to simply apply a grid search and in-stead we utilize RandomizedSearchCV(). Instead of trying all possible combinations, this functionsamples a predetermined amount of sets of parameters in the parameter spaces specified. Insteadof set values in the parameter space we typically have distributions. We run this once for 75 timesand using the results of the first try, narrow the parameter spaces and run it again for 50 times.We first apply 5-fold stratified crossvalidation to compare and the second time we make it moreprecise by using 8-fold stratified crossvalidation. The results are shown in Table 10. We havechosen these starting ranges as broad ranges around recommended starting parameters in relatedwork and here. (Jain, 2016)
Table 10: RandomizedSearchCV()Parameter Range 1 Best Value 1 Range 2 Best Value 2
N_ESTIMATORS [100, 2000] 245 [100, 1000] 917MAX_DEPTH [3, 9] 6 [4, 6] 5
LEARNING_RATE [0.01, 0.21] 0.04864 [0.001, 0.076] 0.01632SUBSAMPLE [0.6, 0.9] 0.85635 [0.75, 0.85] 0.75183
COLSAMPLE_BYTREE [0.6, 0.9] 0.79057 [0.75, 0.85] 0.82017MIN_CHILD_WEIGHT [1,5] 1 1 1
REG_ALPHA [0,0.1] 0.02276 [0,0.05] 0.04970GAMMA [0,0.2] 0.10552 [0,0.15] 0.038668
With the aforementioned best parameter set we fit the model on the training data and make a plotof how relevant a specific feature is in Figure 14. Like the Random Forest model we can see somemade features are relevant. There is also a significant overlap between what is important, which isencouraging.
var1
5va
r38
sald
o_va
r30
sald
o_m
edio
_var
5_ha
ce3
sald
o_m
edio
_var
5_ul
t3sa
ldo_
med
io_v
ar5_
hace
2 n0 n1nu
m_v
ar45
_hac
e3nu
m_v
ar22
_ult3
num
_var
22_u
lt1sa
ldo_
med
io_v
ar5_
ult1
imp_
op_v
ar41
_efe
ct_u
lt3im
p_op
_var
41_u
lt1nu
m_v
ar45
_hac
e2va
r38_
mos
t_co
mm
onim
p_op
_var
41_e
fect
_ult1
sald
o_va
r42
num
_med
_var
45_u
lt3nu
m_v
ar22
_hac
e3va
r15_
mos
t_co
mm
onsa
ldo_
var5
var3
_mos
t_co
mm
onsa
ldo_
var3
7im
p_op
_var
41_c
omer
_ult3
num
_var
45_u
lt1im
p_en
t_va
r16_
ult1
imp_
op_v
ar39
_com
er_u
lt1va
r3im
p_tra
ns_v
ar37
_ult1
Features
0.00
0.02
0.04
0.06
0.08
0.10
0.12
F sc
ore
Feature Importances
Figure 14: Feature Importance
19
We also show one of the many trees the XGB model creates in Figure 15, namely a randomlyselected tree, which happens to be the 300th tree.
n1<5
var15_most_common<1
yes, missing
ind_var8_0<1
no
var38<0.00358573
yes, missing
leaf=-0.0128957
no
var15_most_common<1
yes, missing
saldo_var30<0.00154067
no
num_var45_hace3<0.00442478
yes, missing
var38_most_common<1
no
var38<0.00119249
yes, missing
num_med_var45_ult3<0.00561798
no
leaf=-0.00943261
yes, missing
num_var45_hace3<0.00442478
no
leaf=0.00173004
yes, missing
leaf=-0.00409691
no
leaf=0.00208477
yes, missing
leaf=-0.0017853
no
leaf=-0.000244793
yes, missing
leaf=-0.00453708
no
saldo_var30<0.00144865
yes, missing
leaf=-0.0153812
no
var38<0.0050588
yes, missing
leaf=-0.009992
no
num_med_var22_ult3<0.0192308
yes, missing
num_op_var39_comer_ult1<0.113014
no
leaf=-0.0117572
yes, missing
leaf=-0.00491872
no
leaf=-0.0138288
yes, missing
leaf=-0.0036878
no
saldo_medio_var5_hace3<5.95197e-06
yes, missing
leaf=-0.00282646
no
leaf=0.00668997
yes, missing
leaf=-0.000452203
no
Figure 15: Tree 300
6.3 Results
There are a number of different performance indicators. Our local stratified 10-fold crossvalida-tion procedure was used to tune all parameters and select features. Note that a single fold takesup to 10 minutes at most and several operations can be ran in parallel. Kaggle also has a privateand public leaderboard score. Approximately half of the submitted test set is used for the privateleaderboard score and the other for the public leaderboard score. Since the competition had al-ready completed the difference in these scores is superfluous for us. Nevertheless the completeresults are showcased in Table 11.
Feature set 1 refers to only the top 30 features that have the highest importance scores according tothe XGBoost model that uses all features. Feature set 2 refers to feature set 1 except some variableshave been manually removed that seeemed obviously correlated like 'saldo_medio_var5_hace3'and 'saldo_medio_var5_hace2'. The one that was deemed most important by the modelwas kept. Feature set 3 refers to feature set 2 except adding some of the variables that were highlycorrelated with the target as seen in section 4.2. For a complete list of all involved features seeAppendix A. Note that a single fold takes up to 10 minutes at most and several operations can beran in parallel.
20
Table 11: ResultsModel Features Local Public Private
Logistic Regression All 0.80561 (+/- 0.02337) 0.804893 0.786949Random Forest All 0.82739 (+/- 0.01672) 0.822562 0.803286
XGBoost All 0.84061 (+/- 0.01672) 0.837524 0.823152XGBoost Set 1 0.84035 (+/- 0.01763) 0.836194 0.822334XGBoost Set 2 0.83931 (+/- 0.01954) 0.836109 0.821644XGBoost Set 3 0.83898 (+/- 0.02624) 0.836397 0.821777
It is very peculiar that the private leaderboard score is consistently lower. This indicates a leader-board shakeup, where the train set is not representative enough of the test set. For reference thetop public leaderboard score is 0.845325 and the top private leaderboard score is 0.829072. Ourscores are in far reach of that, however we did not endeavor to be at the top of the leaderboard andkept the test set like a secret until the end. All data exploration was done on solely the trainingdata and normalization for instance was done using just the observations in the training data. Weconsider this more realistic as you can more truly validate your conclusions with completely newdata, as opposed to the information spillover that happens if we had not done so. In a businesssense a singular new client tends to appear, instead of an entire population that can for instance beappropriately normalized.
7 Conclusion
This paper researched how to preemptively understand if customers of Santander will be dissatis-fied using Machine Learning. A semi-anonymized dataset, to protect the privacy of the customers,provided difficulties in asserting what could be relevant or not, especially in light of a huge featureset. However a thorough data analysis discerned the meaning and interpretation of several fea-tures. A Python implementation utilized the Logistic Regression, Random Forest and XGBoostalgorithms, carefully tuned, in order to lead to predictions. Further research could for exampleemploy different solution methods, apply more feature engineering or combine several modelsinstead of trying singular models. More specifically they can increase the computation time thatgoes into tuning and for example make the correlation filtering dependent on the correlation withthe target.
21
A Appendix Feature Sets
Set 1: var15, var38, saldo_var30, saldo_medio_var5_hace3, saldo_medio_var5_ult3,saldo_medio_var5_hace2, n0, n1, num_var45_hace3, num_var22_ult3, num_var22_ult1,saldo_medio_var5_ult1, imp_op_var41_efect_ult3, imp_op_var41_ult1, num_var45_hace2,var38_most_common, imp_op_var41_efect_ult1, saldo_var42, num_med_var45_ult3,num_var22_hace3, var15_most_common, saldo_var5, var3_most_common, saldo_var37,imp_op_var41_comer_ult3, num_var45_ult1, imp_ent_var16_ult1, imp_op_var39_comer_ult1,var3 and imp_trans_var37_ult1
Set 2: var15, var38, saldo_var30, saldo_medio_var5_hace3, n0, n1, num_var45_hace3,num_var22_ult3, imp_op_var41_efect_ult3, var38_most_common, saldo_var42,num_med_var45_ult3, var15_most_common, saldo_var5, var3_most_common,saldo_var37, imp_ent_var16_ult1, imp_op_var39_comer_ult1, var3, imp_trans_var37_ult1,ind_var8_0, num_meses_var5_ult3, num_meses_var39_vig_ult3_1, num_var4and var36
Set 3: var15, var38, saldo_var30, saldo_medio_var5_hace3, n0, n1, num_var45_hace3,num_var22_ult3, imp_op_var41_efect_ult3, var38_most_common, saldo_var42,num_med_var45_ult3, var15_most_common, saldo_var5, var3_most_common,saldo_var37, imp_ent_var16_ult1, imp_op_var39_comer_ult1, var3, imp_trans_var37_ult1,ind_var8_0, num_meses_var5_ult3, num_meses_var39_vig_ult3_1, num_var4,var36, ind_var30, num_var42, and ind_var5
22
References
Andreu (2015). Predicting banking customer satisfaction. https://www.kaggle.com/c/santander-customer-satisfaction/discussion/19291#110414. Accessed:10-15-2017.
Bishop, C. (2006). Logistic regression. In M. Jordan, J. Kleinberg, and B. Scholkopf (Eds.),Pattern Recognition and Machine Learning, pp. 205–207. Springer-Verlag New York.
Breiman, L. (2001). Random forests. Machine Learning 45(1), 5–32.
Breiman, L., J. Friedman, C. J. Stone, and R. Olshen (1984). Classification and Regression Trees.Wadsworth International Group.
Buckinx, W. and D. van den Poel (2005). Customer base analysis: partial defection of be-haviourally loyal clients in a non-contractual fmcg retail setting. European Journal of Oper-ational Research 164(1), 252–268.
Buitinck, L., G. Louppe, M. Blondel, F. Pedregosa, A. Mueller, O. Grisel, V. Niculae, P. Pretten-hofer, A. Gramfort, J. Grobler, R. Layton, J. VanderPlas, A. Joly, B. Holt, and G. Varoquaux(2013). API design for machine learning software: experiences from the scikit-learn project. InECML PKDD Workshop: Languages for Data Mining and Machine Learning, pp. 108–122.
Burez, J. and D. van den Poel (2009). Handling class imbalance in customer churn prediction.Expert Systems with Applications 36(3), 4626–4636.
Chen, T. (2014). Introduction to boosted trees. https://homes.cs.washington.edu/
~tqchen/pdf/BoostedTree.pdf. Accessed: 10-25-2017.
Chen, T. and C. Guestrin (2016). Xgboost: A scalable tree boosting system. arxiv: Cs.lg1603.02754v3.
Clemes, M. D., C. Gan, and D. Zhang (2010). Customer switching behaviour in the chinese retailbanking industry. International Journal of Bank Marketing 28(7), 519–546.
Colgate, M., K. Stewart, and R. Kinsella (1996). Customer defection: a study of the student marketin ireland. International Journal of Bank Marketing 14(3), 23–29.
Dernoncourt, F. (2015). What does auc stand for and what isit? https://stats.stackexchange.com/questions/132777/
what-does-auc-stand-for-and-what-is-it. Accessed: 10-15-2017.
dmi3kno (2015). Exploring features. https://www.kaggle.com/cast42/
exploring-features. Accessed: 10-15-2017.
Jain, A. (2016). Complete guide to parameter tuning in xg-boost. https://www.analyticsvidhya.com/blog/2016/03/
complete-guide-parameter-tuning-xgboost-with-codes-python/.Accessed: 10-25-2017.
Kohavi, R. (1995). A study of crossvalidation and bootstrap for accuracy estimation and modelselection. In Proceedings of IJCAI 1995.
23
Kumar, A. (2015). Boosting customer satisfaction with gradient boosting. https://cseweb.ucsd.edu/classes/wi17/cse258-a/reports/a079.pdf. Accessed: 10-16-2017.
Mozer, M., R. Wolniewicz, D. Grimes, E. Johnson, and H. Kaushansky (2000). Predicting sub-scriber dissatisfaction and improving retention in the wireless telecommunications industry.IEEE Transactions on Neural Networks 11(3), 690 – 696.
Pedregosa, F., G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Pret-tenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Per-rot, and E. Duchesnay (2011). Scikit-learn: Machine learning in Python. Journal of MachineLearning Research 12, 2825–2830.
Santander (2015). Santander customer satisfaction. https://www.kaggle.com/c/
santander-customer-satisfaction. Accessed: 10-15-2017.
Silva, L., G. Titericz, D. Efimov, I. Tanaka, D. Barusauskas, M. Michailidis, M. Muller, D. Polat,S. Semenov, and D. Altukhov (2016). Solution for santander customer satisfaction competition,3rd place. https://github.com/diefimov/santander_2016/blob/master/
README.pdf. Accessed: 10-16-2017.
Wang, S. (2016). Predicting banking customer satisfaction. https://
shuaiw.github.io/assets/data-science-project-workflow/
santander-customer-satisfaction.pdf. Accessed: 10-16-2017.
Xie, Y., X. Li, E. Ngai, and W. Ying (2009). Customer churn prediction using improved balancedrandom forests. Expert Systems with Applications 36(3), 5445–5449.
Yooyen, T. and K.-C. Ma (2016). Csci 567 spring 2016 mini-project. https://markcsie.github.io/documents/CSCI567Spring2016Project.pdf. Accessed: 10-16-2017.
24