DEPARTMENT OF ENGINEERING MANAGEMENT
An exploratory study towards applying and demystifying deep learning classification on behavioral big data
Sofie De Cnudde, David Martens & Foster Provost
UNIVERSITY OF ANTWERP Faculty of Applied Economics City Campus
Prinsstraat 13, B.226
B-2000 Antwerp
Tel. +32 (0)3 265 40 32
Fax +32 (0)3 265 47 99
www.uantwerpen.be
FACULTY OF APPLIED ECONOMICS
DEPARTMENT OF ENGINEERING MANAGEMENT
An exploratory study towards applying and demystifying deep learning classification on behavioral big data
Sofie De Cnudde, David Martens & Foster Provost
RESEARCH PAPER 2018-002 JANUARY 2018
University of Antwerp, City Campus, Prinsstraat 13, B-2000 Antwerp, Belgium
Research Administration – room B.226
phone: (32) 3 265 40 32
fax: (32) 3 265 47 99
e-mail: [email protected]
The research papers from the Faculty of Applied Economics
are also available at www.repec.org
(Research Papers in Economics - RePEc)
D/2018/1169/002
An exploratory study towards applying and demystifying deep
learning classification on behavioral big data
Sofie De Cnudde, David Martens, Foster Provost
Abstract
The superior performance of deep learning algorithms
in fields such as computer vision and natural language
processing has fueled an increased interest towards
these algorithms in both research and in practice.
Ever since, many studies have applied these algo-
rithms to other machine learning contexts with other
types of data in the hope of achieving comparable
superior performance. This study departs from the
latter motivation and investigates the application of
deep learning classification techniques on big behav-
ioral data while comparing its predictive performance
with 11 widely-used shallow classifiers. In addition to
the application on a new type of data and a struc-
tured comparison of its performance with commonly-
used classifiers, this study attempts to shed light onto
when and why deep learning techniques perform bet-
ter. Regarding the specific characteristics of apply-
ing deep learning on this unique class of data, we
demonstrate that an unsupervised pretraining step
does not improve classification performance and that
a tanh nonlinearity achieves the best predictive per-
formance. The results from applying deep learning
on 15 big behavioral data sets demonstrate as good
as or better results compared to traditionally-used,
shallow classifiers. However, no significant perfor-
mance improvement can be recorded. Investigating
when deep learning performs better, we find that
worse performance is obtained for data sets with low
signal-from-noise separability. In order to gain in-
sight into why deep learning generally performs well
on this type of data, we investigate the value of the
distributed, hierarchical characteristic of the learning
process. The neurons in the distributed representa-
tion seem to identify more nuances in the many be-
havioral features as compared to shallow classifiers.
We demonstrate these nuances in an intuitive manner
and validate them through comparison with feature
1
engineering techniques. This is the first study to ap-
ply and validate the use of nonlinear deep learning
classification on fine-grained, human-generated data
while proposing efficient configuration settings for its
practical implementation. As deep learning classi-
fication is often characterized by being a black-box
approach, we also provide a first attempt towards
the disentanglement regarding when and why these
techniques perform well.
Introduction
Over the last decade, the machine learning field has
experienced increased attention towards deep learn-
ing techniques. Experimental research in this area
has demonstrated significant improvements over con-
ventional machine learning techniques in fields such
as object recognition [4, 29, 5] and natural language
processing [45, 22]. Deep learning techniques origi-
nate from representation learning, where raw data is
inputted and models automatically detect efficient,
distributed representations for either pattern detec-
tion or classification [30]. Through stacking non-
linear functions, hidden and complex data patterns
can be detected without human intervention. Deep
learning has received increased attention which can
be attributed to (1) an immense increase in avail-
able data, (2) increased chip processing capabilities
and the use of GPU for faster parallel computations,
(3) decreasing costs of hardware, and (4) advances in
machine learning research regarding these deep net-
works [11]. Both the inspiration from neural brain
activity and its theoretical aspects contribute to the
attractiveness of deep neural networks.
The class of data which is the subject of our study
is behavioral big data. As more and more aspects of
people’s lives are migrating online, people leave mas-
sive trails of both active and passive footprints which
are increasingly being recorded and quantified. Be-
havioral big data is thus becoming omnipresent, har-
boring major potential for predictive analysis [49].
We define behavioral big data following Shmueli [44]:
very large and rich high-dimensional data on hu-
man actions and/or interactions. These form a testi-
mony of an entities’ behavior captured through fine-
grained, modular features. Customer transactions
with a bank, web surfer’s web visiting behavior, mo-
bile phone users’ visited locations and Facebook like
data are just a few examples where each unique ac-
count number, webpage, location or Facebook page
corresponds to a feature. In previous research, this
data has been shown to be very informative in a pre-
dictive setting and can reveal a person’s personality
traits [28], interest in banking products [37], interest
in a news article [33], interest in a mobile ad [32], ten-
2
dency to churn [47] or tendency to commit fraudulent
activities [26].
Behavioral data, however, originates from complex
and largely unknown underlying processes, which
complicate its analysis [25, 44]. Moreover, this data
is characterized by high-dimensionality and sparsity.
When modeling web surfer’s web visiting behavior,
the collection of all possible webpages one can visit is
huge, resulting in very large, high-dimensional data
sets. Also, the limit on so-called behavioral capi-
tal [24] implies that among all possible webpages, a
person can only visit a limited number due to re-
strictions such as time or money. The latter results
in highly sparse data.
Traditionally, machine learning research for very
high dimensional data, such as textual data, mostly
employs shallow models [11], which contain one layer
transforming the raw input features in a linear or
nonlinear fashion onto a specific feature space. Ex-
amples of such techniques are linear or kernalized
support vector machines [7]. These shallow classifiers
have proven successful on machine learning problems
originating from a wide variety of applications and
for both small and large predictive problems. How-
ever, when confronted with data from complex real-
world problems such as human behavior, many re-
searchers have put forward the question whether the
shallow design of these techniques suffices [1, 11] and
whether a deep architecture can provide significant
performance improvement. Moreover, these tech-
niques consider the data features from a local per-
spective throughout the learning process [20]. Each
feature is represented by a one-hot encoded vector,
implying that each is equally different from all other
features. In many real-world high-dimensional set-
tings, this local view does not hold and exploiting the
distributed nature of the features allows for much bet-
ter generalization towards unseen, complex instances.
From social science research, behavioral data is sus-
pected to contain complex, distributed and hierar-
chical relations between its features [20, 42]. When
capturing people’s movie preferences in a behavioral
data set, each movie is represented by exactly one
feature in a one-to-one relation (local representa-
tion). The main disadvantage with this approach
is that two movies targeted towards a younger au-
dience (such as for example Toy Story and Indiana
Jones) are considered equally similar (or different)
from one another as they are considered similar (or
different) from R-rated movies such as Saw or The
Blair Witch Project. A distributed representation
transforms the raw features through many-to-many
relations onto a new representation which is able to
capture more fine-grained similarities between the in-
3
dependent features. Regarding hierarchical relations
between features, on the lowest level, a user can be
represented by each movie he or she rates. On a
higher level, a subset of movies can be telling of a per-
son’s low-level interests. One group of movies (which
is naturally represented by a neuron in the framework
of deep learning) can for example detect whether a
user has an affinity with LGBT movies, whether he
has an interest in sports, or whether his favorite di-
rector is Alfred Hitchcock. On an even higher level,
these interests can reveal political preferences or reli-
gious beliefs. For example, a person watching movies
such as Brokeback Mountain or Cowspiracy can be
considered to have a more liberal mindset. This
line of thinking follows the value–attitude–behavior
model [42] used in behavioral science, which states
that people’s social cognitions resulting in actions are
organized in a compositional structure. Values are
one’s stable beliefs on the highest level of abstraction
and are constructed of basic beliefs, which in turn
give rise to value orientations. The latter influences
a person’s attitude, which finally leads to concrete
human behavior. The parallel between this cogni-
tion hierarchy and the hierarchical nature of repre-
sentation learning could ideally help provide insight
into general human behavior or strengthen hypothe-
ses formulated in social science research. Moreover,
interpreting this hierarchy provides us with means
to frame deep learning’s performance on this type of
data.
Since the initial successes of deep learning in com-
puter vision and natural language processing, deep
learning has also been successfully applied to other
types of high-dimensional data such as toxicological
data [8, 41], biomedical data [3] and recommender
systems [43, 46]. Due to these many successes, one
might wonder whether this superior performance also
extends to other fields in predictive machine learning
research [14]. In traditional research fields, theory
precedes experiments and experimental results can
often be clearly linked to theoretical principles. The
main drawback of deep learning research, however, is
the lack of full theoretical knowledge regarding why
these models work so well [11]. The goal of this work
is twofold. First, we investigate whether superior re-
sults can be reached for behavioral big data, a class
of data becoming increasingly common in the ma-
chine learning field. Secondly, this work attempts to
get a high-level, preliminary insight into the link be-
tween classifier performance and the reasons behind
this performance for human behavioral data.
In previous research specifically aimed towards
high-dimensional behavioral data, deep learning has
been applied to movie preference data [43] and e-
4
commerce activity data [48]. However, in both stud-
ies, the inherent high-dimensionality of these data
sets is reduced. The first study considers absent be-
havior as missing data and subsequently omits the
corresponding missing features. In the second study,
the fine-grained data is preprocessed with dimension-
ality reduction techniques.
This work attempts to contribute in four ways: (1)
first, this study presents, to the best of our knowl-
edge, the first demonstration and validation of the
usefulness of deep learning for a novel type of data,
being behavioral data, (2) second, we analyze in a
structured manner the performance of deep learn-
ing techniques on large behavioral data and com-
pare with 11 conventional shallow methods to assess
whether the former can provide significant perfor-
mance improvements for these data sets, (3) third,
we provide guidance for researchers and practitioners
regarding the practical implementation details that
characterize these architectures, and (4) finally, the
study validates additional insights related to the hier-
archical learning process of these techniques, gaining
insight into the learning process.
The rest of this paper is structured as follows.
First, the deep-learning approach is set out, followed
by a description of the experimental set-up in which
attention is given to the hyper-parameter selection
step. We subsequently present and discuss the re-
sults. Finally, we conclude our work and present av-
enues for further research.
Deep Learning
When using deep learning for classification purposes,
we have a data set D which is an indexed set of n data
points xi with n corresponding labels yi, with xi ∈
Rm and yi ∈ {−1, 1} for i = 1, . . . , n. A classification
algorithm A takes a subset of the data set D, called
a training set Dtrain = (Xtrain, Ytrain) as input and
learns a hypothesis function f as an approximation
of the true data distribution:
f = Aλ(Xtrain, Ytrain) (1)
with model hyperparameters λ. In the specific con-
text of deep learning, the number of layers L is an
example of a hyperparameter. The hyperparameters
λ are learned on a separate subset of D referred to as
a validation set Dval = (Xval, Yval). The classification
model learns the hypothesis f by minimizing a loss
function L over a set of parameters θ
minθL(f(Xtrain), Ytrain). (2)
5
The final classification performance is tested on a sep-
arate test set Dtest = (Xtest, Ytest).
The deep learning model (of which a concrete ex-
ample is visualized in Figure 1) minimizes the loss
function L over its parameters θ (consisting of weight
parameters W and bias parameters b) through back-
propagation used in conjunction with stochastic gra-
dient descent [19]. The back-propagation training al-
gorithm works in two stages: (1) propagation and (2)
weight update. Starting with randomly chosen small
initial values for the weights W and the biases b for
each hidden layer h, a batch of training samples is in-
putted to the model (forward pass). The model com-
putes the corresponding activations and the output
values and compares them to the real labels. Next,
the gradient of the error is used in a backward pass
to update the model’s parameters W and b for each
layer at iteration t
Wh(t+ 1) = Wh(t) + α∂L∂Wh
+ ξ(t) (3)
bh(t+ 1) = bh(t) + α∂L∂bh
+ ξ(t), (4)
with α a learning rate and ξ an error component. Us-
ing stochastic gradient descent implies that the loss
function L must be differentiable. The negative log
likelihood minimizes the negative likelihood of the
correct class and is defined as
NLL = −n∑i=0
logP (Y = yi | xi, θ). (5)
ix
j
k
h1
h2
l bL
b2
b1
Wjk
Wij
Wkl
hL
Figure 1: Example of a deep learning network with 2hidden layers and their associated bias vectorsb and weight matrices W .
The objective function (2) of the deep learning
problem formulation comes with some difficulties. It
defines a non-convex optimization space character-
ized by multiple local optima, contributing to an in-
herently complex learning process. Using a local op-
timization technique such as gradient descent may
result in the model being stuck in local optima and
resulting in poor generalization performance. More-
over, with a large number of layers and many hidden
6
units, the number of model parameters can be mas-
sive which can result in overfitting.
A second issue faced by deep learning algorithms,
despite their excessive modeling capability, is that
they are often condemned to being a black-box ap-
proach due to the many layers and the nonlinearities
in the model.
Materials and Methods
Deep learning architectures are characterized by a
large number of hyper-parameters (HP) [19]. The
hyper-parameters associated with a machine learning
model A are selected based on their performance on
the separate out-of-sample validation set (Xval, Yval).
Two types of HPs can be distinguished: (1) hyper-
parameters associated with the model such as the
number of hidden units per layer, and (2) hyper-
parameters associated with the optimization algo-
rithm such as the learning rate of stochastic gradient
descent. The hyper-parameter optimization problem
can be modeled as follows:
λopt = minλhp
L(f(Xval), Yval), (6)
with λopt the optimal HP set which is found through
cross-validation [2] over a separate validation set
Xval. Each λhp is a collection of values for each of
the K parameters: (HP1, . . . ,HPK).
In order to select appropriate values for the HPs in
an empirical setting, a structured method is needed
to search the hyper-parameter space [2]. When deep
learning models are trained in literature, HPs are of-
ten either chosen based on researchers’ experience in
the specific field or through a structured or heuristic
exploration of the HP space such as manual search,
grid search or random search. Since we are train-
ing deep neural networks on a new and specific type
of data, the approach to take over parameter val-
ues originating from experimental research in other
fields is not advisable. Moreover, different data sets
are evaluated, for which clearly separately-optimized
HPs should be used [2].
Bergstra and Bengio [2] state that from the large
collection of all hyper-parameters, mostly only a
small subset is relevant (referred to as low-effective
dimensionality). Their study empirically and the-
oretically demonstrates that non-adaptive random
hyper-parameter search is more efficient than manual
search and/or grid search in high-dimensional search
spaces. While a high-dimensional grid may give even
coverage inside the grid, the coverage in the sepa-
rate subspaces is less-evenly distributed. In contrast,
random search explores a wider range in each of the
subspaces.
7
We use a random hyper parameter search proce-
dure [2] to determine 100 parameter configurations
on a separate part of the data set (Xval, Yval). Since
these configurations are optimized for negative log
likelihood, the 10 best configurations are chosen and
their AUC is evaluated [13]. The specific values and
ranges for each hyper-parameter are given in the next
subsections. For each of the data sets, subsequently,
five-fold cross validation is used on the remaining part
of the data. During learning, the progress is moni-
tored by calculating the negative log likelihood on a
separate part of the data set which is used to prevent
overfitting by early stopping. The AUC reported in
our study is calculated on a separate test set after-
wards.
We run the random hyper-parameter search pro-
cedure and the actual learning algorithm on AWS
g2.2xlarge instances with the Theano framework for
deep learning [23]. These instances have 8 virtual
CPUs, 15 GB RAM and 1 Nvidia grid GPU.
Unsupervised pre-training
Unsupervised pre-training was one of the main rea-
sons for the increased attention towards deep learning
in 2006 due to significant classification performance
improvements [21]. This pre-training phase uses a
stack of representation-learning algorithms (such as
RBMs [19]) and separately optimizes the learned rep-
resentation of each layer in an unsupervised fashion.
The goal is to use the learned representation of the
unsupervised stage for a classification task with the
same input data distribution. The idea behind greedy
layer-wise pre-training originates from the following
two beliefs: (1) the initial parameters can have an ef-
fect on the quality of the parameter optimization pro-
cess and subsequently influence model performance
during model finetuning, and (2) the representation
learned by a generative model on unlabeled input
data can also be a good representation for a sub-
sequent classification task [1].
Previous research has demonstrated the immense
value in pre-training a neural network [21, 12]. Er-
han et al. [12] empirically analyze the effect of pre-
training and conclude that it mainly acts as a regular-
izer, influencing the starting point of the supervised
training step. However, several studies also find the
opposite, e.g. in speech recognition and in chemical
activity prediction [1, 34]. Also, the study by Erhan
et al. [12] was performed before the widespread use
of many modern deep learning-related concepts such
as ReLUs [1] and only takes into account the relation
with the number of layers. Consequently, unsuper-
vised pre-training is recently only used in the field of
NLP.
In the random HP search, we randomly choose to
8
either pre-train the network or not through the use
of a restricted Boltzmann Machine (RBM) [19].
Number of layers and number of hidden units
Both the number of hidden layers and their number
of units have an impact on the generalization capa-
bility of the network. A trade-off is needed between
generalization on the one hand and overfitting on the
other.
We randomly select a number of hidden layers be-
tween 1 and 8. For each layer, a number of hidden
units is drawn log-uniformly between 18 and 2000.
Both ranges are based on values described in Bergstra
and Bengio [2].
Minibatch size
Choosing the size of the minibatch for stochastic gra-
dient descent is a two-criterion trade-off influencing
convergence and training time. When increasing the
size of the batch, more efficient use can be made of the
parallel matrix-matrix multiplications on the GPU.
Although the gradient estimate becomes more reli-
able with larger minibatch sizes, less batches reduce
the number of weight updates, thereby reducing con-
vergence.
Moreover, closely approximating the true gradient
by increasing the size of the minibatch might not
be the best choice to spend computation time in a
non-convex optimization space. Instead, exploring
the search space with frequent updates is a more ap-
propriate task in this setting. Furthermore, a small
minibatch size also acts as a regularizer, introduc-
ing noise when approximating the gradient through
a lower number of training set instances [19].
Following [2, 19], the size of the minibatch is ran-
domly chosen with equal probability between 20 and
100. Each epoch, each fold is randomly shuffled be-
fore minibatches are selected in order to speed up
convergence [19].
Nonlinear activation function
The activation function of a neural network brings
non-linearity into the learning procedure, empower-
ing the model to learn distributed feature represen-
tations. When choosing an activation function, care
must be taken regarding the saturation of the activa-
tions and overly linear behavior. If activations in the
network become too saturated during learning, the
gradients do not propagate well and units become in-
active. If the functions behave in a linear fashion,
complex interactions will not be grasped. We con-
sider the three most commonly-used non-linearities,
i.e. the sigmoid, the hyperbolic tangent and the rec-
tified linear unit of which the random HP selection
set-up randomly selects one.
9
-10 -5 0 5 10
x
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
sigm
(x)
-10 -5 0 5 10
x
-1
-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1
tanh
(x)
-10 -5 0 5 10
x
0
1
2
3
4
5
6
7
8
9
10
ReL
U(x
)
Figure 2: Nonlinear activation functions for input x (left: sigmoid, right: tanh, below: ReLU).
Sigmoid activation unit The sigmoid activation
unit performs the following function for a hidden neu-
ron j in layer h on the input x received from layer
(h − 1): σ(a) = 11+e−a with a = wT
.jx + bhj where x
is the output of the previous layer (or the input fea-
tures if it concerns the first hidden layer), bhj is the
bias associated with hidden layer h and w.h is the
weight vector associated with hidden unit j.
The activation function curve is shown in Figure 2
(left). As can be seen, the non-linearity squeezes the
input value a into a [0, 1] interval and is symmetric
around 0.5. The interval limits have a nice intuitive
interpretation: a neuron will not fire at all (0) or will
fire at maximum frequency (1).
However, two major drawbacks are linked to its
use. Firstly, the sigmoid nonlinearity suffers from the
vanishing gradient problem: the characteristics of its
gradient function lead to near-zero or zero gradients
as it saturates at both tails for 0 and 1. This in-
fluences the global gradient calculation during back-
propagation, and may kill the gradient, making it
difficult for the parameters to be learned. There-
fore, initial weights must be chosen carefully, as
large absolute weights lead to highly-saturated neu-
rons. The weights are initialized from the follow-
ing uniform distribution Uniform(−r, r) with r =
4√
6/(fan-in + fan-out) with fan-in and fan-out the
number of incoming and outgoing connections re-
spectively [19]. This initial setting activates the sig-
moid in its linear region, also resulting in first learn-
ing the main linear behavior. Secondly, its non-zero
mean and the computation of the exponential func-
tion both result in slow learning in highly-layered ar-
chitectures [31].
Hyperbolic tangent unit The hyperbolic tan-
gent function uses the following function: tanh(a) =
ea−e−a
ea+e−a , with a = wT.jx + bhj . The function is
shown in Figure 2 (middle), demonstrating that
the hidden neuron values are projected onto a
[−1, 1] interval. Its initial weights are set following
10
the uniform distribution Uniform(−r, r) with r =
4√
6/(fan-in + fan-out), resulting in activating the
tanh function in its linear region. The tanh function
differs from the sigmoid activation in that its output
is zero-centered, however, it is clear that saturating
neurons also occur.
Rectified Linear Unit (ReLU) The Rectified
Linear Unit function employs a zero-threshold hinge
function: ReLU(a) = max(0, a) with a = wT.jx + bhj .
The advantages of the ReLU are linked to the dis-
advantages of the other two nonlinearities. Firstly,
the gradient of a ReLU unit will either be constant
(a > 0) or zero (a ≤ 0). In the former case, the
gradient never kills other gradients during backprop-
agation. With a careful selection of the initial weights
(Uniform(−r, r) with r = 2/fan-in), a being zero
does not result in the vanishing gradient problem
as long as the gradient can propagate along some
paths [15]. Moreover, when a fraction of the neurons
is zero, sparse representations are favored through-
out the network. A sparse deep neural network (1)
has the advantage of learning robust models, (2) en-
ables variable-size feature representations, and (3) re-
sults in sparse representations which are more easily
linearly-separable than dense higher-level feature rep-
resentations.
The second advantage of the Rectified Linear Unit
is the fact that no expensive operations are needed
and thus this will result in a faster model compared
to the other two activation functions. Figure 2 (right)
shows the hinge function of the ReLU.
Learning rate
The learning rate of the stochastic gradient descent
optimization algorithm can influence convergence to-
wards an optimum. If the learning rate is set too
high, optima can be overlooked and the loss function
will increase. In contrast, too small a learning rate
results in very slow convergence.
Our random hyper-parameter search algorithm
draws an initial learning rate α0 log-uniformly be-
tween 0.0001 and 1.0 with an adaptive annealing
scheme. If at iteration t, the previous solution is not
found to be better than the current solution in the
optimization algorithm (given a specific improvement
threshold), αt is set as follows: αt = t·αt−1
max(t,t+1) [2].
Analysis and Results
To get insight in the performance of deep learning
techniques on behavioral data, three analyses are per-
formed. First, we present a comparison with the
performance of 11 traditional, shallow classifiers and
compare these for statistically significant differences.
The goal is to provide a sound statement regarding
11
whether and when deep techniques can reach supe-
rior performance for behavioral big data. Secondly,
we analyze the influence of hyperparameter values on
the classification performance for deep learning net-
works. The goal is to get insight into the influence
of the learning network architecture, the characteris-
tics of the optimization process and the unsupervised
generative preprocessing phase. Third, we provide
insight into why deep learning techniques perform
well on behavioral big data. To this end, we focus
on the representation-learning characteristic of the
techniques and assess the value present in these rep-
resentations. The results and interpretations of these
analyses are presented in each of the next subsections.
Comparison with Shallow Techniques
To demonstrate the predictive performance of the
deep learning approach on large-scale behavioral
data, we evaluate its classification performance on
a collection of behavioral data sets put forward as a
benchmark in De Cnudde et al. [10]. We enrich this
collection with data sets capturing people’s likes on
Facebook [28, 6].
The MovieLens and YahooMovies data sets, for
which we predict the age and the gender of the users,
contains data regarding which movies are rated by
the users. The larger MovieLens10m data set pre-
dicts the genre of a movie based on the users rating
it. This multi-class classification task is translated
into 18 binary classification problems. The Ecom-
merce data set provides product viewing data on an
e-commerce website from which the gender of the
users is inferred. Transactional shopping transactions
in the TaFeng data set are used to predict users’ age.
In the BookCrossing data set, books are rated by
the members of the BookCrossing community and
based on these ratings, the age of the users is pre-
dicted. The LibimSeTi data set contains ratings of
dating profiles from which the gender of the users is
inferred. In 2015, the KDD cup challenge KDD2015
constituted the prediction of MOOC dropout based
on fine-grained course interaction data. In the A-
Card data set, location visiting behavior from a gov-
ernment loyalty card is modeled and a prediction is
made concerning whether a user will use collected
benefits, whether he will stop using his loyalty card
and whether he will visit one of five locations in the
near future [9]. The Fraud data set contains finan-
cial banking transactions between Belgian and for-
eign companies in order to determine whether a com-
pany is involved in fraudulent transactions [27]. Cus-
tomer banking transactions are modeled in Banking
from which a customer’s interest in a banking prod-
uct is predicted [37]. Next, customer’s interest in an
online car advertisement is predicted based on their
12
website viewing behavior in the Car data set. In the
Flickr data set, users tag pictures as their favorite
from which a picture’s number of comments is pre-
dicted. Lastly, the Facebook data sets model users
liking pages on Facebook from which the following
target variables are predicted: intelligence, religion
(Christian vs. Muslim), satisfaction with life, po-
litical belief (liberal vs. conservative), gay for male
and female users, and gender. Table 1 lists the main
characteristics of these data sets and as one can ob-
serve, the dimensionality and the sparsity both are
extremely high.
In [10], a structured comparative study is per-
formed of 11 widely-used shallow classification tech-
niques on behavioral data sets which are variations of
support vector machines, naive Bayes, logistic regres-
sion and a relational classifier called PSN. For details
regarding the techniques and their parameters, and
the results, we refer to [10].
Table 2 depicts the results taken from [10] comple-
mented with the results from applying deep learning
(DL). For each data set, the best wide classifier is un-
derlined and the best overall classifier is denoted in
boldface. At the bottom, the table also gives the aver-
age rank of the techniques (with the best-performing
technique receiving rank 1) and the number of wins
in terms of performance. Table 3 depicts the best
hyper-parameter configuration for which the results
with deep learning are reached. Note that this HP
configuration is not necessarily the best possible con-
figuration for that data set. It is the best-performing
one among 100 random configurations that are tested
for each data set and obviously, finetuning these set-
tings could lead to even better classification perfor-
mance.
13
Data set Target variable Instances Features Active elements Sparsity
MovieLens100k age/gender 943 1, 682 100, 000 93.6953%MovieLens1m age/gender 6, 040 3, 883 1, 000, 209 95.7353%Facebook IQ 6, 124 128, 787 850, 181 99.8922%YahooMovies age/gender 7, 642 106, 363 221, 330 99.9727%MovieLens10m genre 10, 681 69, 878 10, 000, 053 98.6602%Ecommerce gender 15, 000 21, 880 33, 455 99.9898%Facebook religion 4, 913 128, 787 388, 973 99.9385%Facebook satisfaction with life 30, 138 128, 787 2, 873, 825 99.9260%TaFeng age 31, 640 23, 719 723, 449 99.9036%Facebook political belief 38, 099 128, 787 2, 214, 178 99.9549%BookCrossing age 167, 175 337, 921 838, 364 99.9985%LibimSeTi gender 137, 806 220, 970 15, 656, 500 99.9486%KDD2015 MOOC dropout 120, 542 5, 891 1, 919, 150 99.7300%Facebook gay (male) 172, 191 128, 787 11, 171, 527 99.9496%A-Card loyalty/go to 177, 761 2, 448 435, 244 99.9000%Facebook gay (female) 214, 130 128, 787 16, 546, 926 99.9400%Facebook gender 386, 321 128, 787 27, 718, 453 99.9443%Fraud fraudulent 858, 131 107, 345 1, 955, 912 99.9979%Banking interest in product 1, 204, 726 3, 192, 554 20, 914, 516 99.9995%Car interest in ad 9, 108, 905 2, 936, 810 65, 464, 708 99.9998%Flickr comments 11, 195, 144 497, 472 34, 645, 469 99.9994%
Table 1: Behavioral data sets.
14
DL
MN
-NB
MV
-NB
LA
-SV
M-L
2L
S-S
VM
-L1
LS-S
VM
-L2
PSN
LR
-BG
D-L
1L
R-B
GD
-L2
LR
-SG
D-L
1L
R-S
GD
-L2
RB
F-S
VM
BookC
ross
ing
55.3
256.1
756.0
556.1
654.4
656.2
453.5
454.7
856.25
55.1
055.1
353.4
8F
aceb
ook
swl
55.6
757.2
056.7
855.5
755.2
156.2
358.32
55.3
156.2
354.1
454.6
953.1
2B
ankin
g66.2
454.7
968.17
54.2
553.4
054.6
267.0
853.7
354.9
566.2
867.2
453.4
5T
aF
eng
71.3
571.92
70.5
870.5
769.5
271.1
969.8
669.8
071.3
665.7
766.0
563.0
4Y
ahooM
ovie
sage
65.6
659.5
565.2
063.0
663.8
264.5
765.2
863.5
464.8
461.5
261.5
571.94
Fra
ud
77.22
76.2
777.1
657.1
350.4
155.6
876.8
752.6
655.4
872.7
577.1
169.0
9A
-Card
Goto
MA
S77.64
63.7
376.5
068.5
863.3
663.2
775.5
164.9
364.9
775.5
576.5
064.7
8C
ar
78.16
57.7
177.5
269.9
877.1
474.2
171.0
675.5
377.7
072.3
172.0
955.8
3M
ovie
Lens
adventu
re78.61
73.6
59.7
273.5
273.7
973.9
476.5
773.7
973.9
868.8
667.6
470.0
7M
ovie
Lens
100k
gender
79.30
74.4
869.6
374.7
373.0
277.0
974.7
172.8
277.2
173.7
874.4
975.7
5E
com
merc
e76.3
477.2
455.8
579.5
475.3
179.6
068.3
076.0
279.62
79.2
679.2
671.9
5M
ovie
Lens
cri
me
79.86
74.5
972.3
075.0
975.0
775.8
376.3
375.1
676.1
372.5
273.0
768.3
4F
aceb
ook
fem
ale
gay
77.9
876.4
676.5
573.8
475.3
674.4
180.29
76.1
675.2
175.0
176.0
673.2
1M
ovie
Lens
fanta
sy80.83
61.9
956.1
361.3
560.4
362.2
778.3
61.2
362.2
261.0
061.8
069.8
5M
ovie
Lens
myst
ery
80.6
559.8
657.3
269.6
169.8
569.7
081.07
70.3
268.0
361.9
562.8
069.2
6Y
ahooM
ovie
sgender
81.35
73.8
179.3
079.9
279.0
880.3
080.1
879.1
480.4
375.3
675.4
478.3
8M
ovie
Lens
rom
ance
81.75
65.4
373.0
670.1
670.3
769.4
479.5
671.2
568.2
069.3
169.2
861.8
1A
-Card
Goto
Perm
eke
83.35
77.4
780.8
461.1
364.5
064.3
478.2
763.1
463.1
482.3
582.3
581.8
6M
ovie
Lens
thri
ller
84.21
70.0
073.2
774.5
375.1
675.0
384.0
476.7
575.0
872.9
773.4
570.3
1M
ovie
Lens
dra
ma
84.72
73.6
565.6
082.1
882.7
683.2
776.5
282.8
483.7
479.3
579.1
770.0
7K
DD
2015
85.26
70.7
981.1
477.9
579.1
979.2
583.7
379.8
479.8
984.4
383.7
181.5
7M
ovie
Lens
childre
n85.46
80.9
773.0
279.1
079.7
879.5
483.7
079.8
379.7
76.0
775.0
580.7
5A
-Card
defe
ct
84.9
251.4
185.50
76.5
375.4
975.1
478.6
870.9
375.3
981.6
881.7
265.6
5M
ovie
Lens
1m
gender
85.63
80.5
476.8
484.0
583.8
384.8
381.3
384.1
785.2
080.5
580.5
682.5
7A
-Card
Goto
Wezenb
erg
86.28
79.7
484.5
753.2
556.8
857.0
084.9
255.1
955.2
185.3
685.4
079.3
7A
-Card
Goto
Rom
a86.76
70.7
886.6
060.1
256.5
256.5
986.3
258.0
357.9
685.3
785.3
867.7
9M
ovie
Lens
com
edy
87.04
77.0
277.8
284.4
585.0
385.8
478.2
185.2
986.1
382.1
982.0
974.8
2F
lick
r87.45
77.7
785.9
973.4
876.7
776.2
276.6
377.0
876.9
784.3
884.2
580.1
5M
ovie
Lens
acti
on
87.79
82.0
766.3
085.9
086.4
886.5
382.5
286.7
686.8
983.5
383.3
784.3
3M
ovie
Lens
anim
ati
on
88.31
84.8
370.0
087.2
987.5
387.1
685.9
188.0
487.1
080.7
580.5
584.8
6A
-Card
Goto
Zoo
88.52
71.7
186.5
860.3
455.7
355.7
087.0
557.7
157.6
485.3
085.7
472.5
6M
ovie
Lens
scifi
89.08
76.7
368.0
879.0
878.7
578.8
588.5
080.4
279.6
673.5
473.9
469.6
4M
ovie
Lens
100k
age
89.50
79.4
377.2
087.0
984.1
787.7
180.2
185.0
887.9
584.1
083.3
381.6
1M
ovie
Lens
docum
enta
ry90.04
83.5
571.4
087.9
487.4
588.4
487.8
287.9
888.5
785.7
685.9
079.9
5M
ovie
Lens
west
ern
90.7
284.2
289.6
784.5
886.0
085.2
491.37
85.7
285.8
284.9
484.8
469.1
2M
ovie
Lens
1m
age
91.39
81.9
678.7
990.3
489.8
090.4
383.1
689.9
290.8
187.3
087.1
383.2
5A
-Card
cash
out
91.4
855.9
091.54
74.0
170.8
770.3
583.3
471.1
270.9
690.8
690.8
890.6
0F
aceb
ook
religio
n92.02
82.7
085.5
586.0
685.2
786.6
590.7
285.4
086.9
186.9
687.0
285.6
7M
ovie
Lens
horr
or
92.42
90.8
888.8
091.1
491.0
191.0
891.0
891.4
691.2
790.0
189.8
188.6
2F
aceb
ook
male
gay
91.7
890.5
989.3
588.2
388.5
988.5
992.51
88.9
588.8
987.4
489.9
683.2
4M
ovie
Lens
film
noir
90.1
279.5
071.8
276.4
078.9
476.0
792.90
78.5
778.6
068.3
369.1
080.1
7F
aceb
ook
politi
cal
93.63
82.6
681.7
783.5
083.3
684.0
083.4
883.5
384.1
078.1
878.2
679.6
5F
aceb
ook
IQ94.40
68.1
165.0
267.1
162.3
866.9
862.5
367.2
963.2
365.0
061.7
862.5
7M
ovie
Lens
musi
cal
95.02
80.5
072.0
581.2
579.6
180.8
590.3
479.4
380.5
375.8
476.0
873.3
1F
aceb
ook
gender
93.4
788.4
988.6
994.9
595.0
195.0
295.0
395.09
87.8
988.5
485.3
490.0
5M
ovie
Lens
war
95.45
72.8
670.6
181.1
878.8
281.7
195.2
379.9
280.7
479.8
079.8
377.5
3L
ibim
SeT
i99.69
99.6
499.6
599.6
899.69
99.6
878.9
799.69
99.69
99.6
599.6
599.6
6
AverageRanking(A
UC)
1.68
8.2
17.5
56.8
27.5
06.0
04.8
26.2
75.2
87.8
17.3
88.6
8Numberofwins(A
UC)
34
13
01
06
22
00
1
Table
2:
Pre
dic
tive
per
form
ance
of
the
model
sin
term
sof
AU
Cfo
rbinary
beh
avio
ral
data
sets
.T
he
hig
hes
t-ach
ieved
per
form
ance
for
adata
set
isin
dic
ate
din
bold
face
.T
he
bes
t-ach
ieved
per
form
ance
by
the
shallow
class
ifier
sis
under
lined
.(M
N-N
B=
mult
inom
ial
naiv
eB
ayes
,M
V-N
B=
mult
ivari
ate
naiv
eB
ayes
,L
A-S
VM
-L2
=su
pp
ort
vec
tor
mach
ine
wit
hle
ast
abso
lute
loss
funct
ion
and
L2
regula
riza
tion,
LS-S
VM
-L1
=su
pp
ort
vec
tor
mach
ine
wit
hle
ast
square
dlo
ssfu
nct
ion
and
L1
regula
riza
tion,
LS-S
VM
-L2
=su
pp
ort
vec
tor
mach
ine
wit
hle
ast
square
dlo
ssfu
nct
ion
and
L2
regula
riza
tion,
PSN
=re
lati
onal
class
ifier
,L
R-
BG
D-L
1=
logis
tic
regre
ssio
nw
ith
batc
hgra
die
nt
des
cent
and
L1
regula
riza
tion,
LR
-BG
D-L
2=
logis
tic
regre
ssio
nw
ith
batc
hgra
die
nt
des
cent
and
L2
regula
riza
tion,L
R-S
GD
-L1
=lo
gis
tic
regre
ssio
nw
ith
stoch
ast
icgra
die
nt
des
cent
and
L1
regula
riza
tion,
LR
-SG
D-L
2=
logis
tic
regre
ssio
nw
ith
stoch
ast
icgra
die
nt
des
cent
and
L2
regula
riza
tion,
RB
F-S
VM
=su
pp
ort
vec
tor
mach
ine
wit
hR
BF
ker
nel
.)
15
From Table 2 it is clear that overall deep learn-
ing learns at least as good or better as any of the
best shallow techniques. Its averave ranking (1.68)
is much lower than any of the other classifiers and
it achieves twice the number of wins compared to
the best-performing shallow technique. For data
sets with low signal-from-noise separability (from
BookCrossing to YahooMovies age in Table 2), how-
ever, deep learning consistently results in lower clas-
sification performance. We validate this finding with
the signal-from-noise separability SNS threshold of
approximately 83% given in [40]. Figure 3 plots
the AUC of deep learning against the highest AUC
reached by any of the shallow classifiers (BEST).
Data points under the diagonal are data sets for
which DL performs better than the best wide tech-
nique. It can be seen that overall for data sets with
an AUC lower than approximately 83%, the shallow
classifiers perform better than deep learning. When
looking at deep learning literature, we indeed find
that the best performance gains are reached for high
SNS data sets where the signal-from-noise separa-
bility of the problem is inherently larger (such as for
object and speech recognition).
In line with what is described in [18], the absolute
performance improvement of DL over any of the other
classifiers generally is not that high. The additional
complexity from deep learning does not result in find-
ing new and unseen relations in the data as these al-
ready seem to be accurately described by the shallow
classifiers (BEST). However, remarkably, in the cases
that DL performs better, it seems to consistently be
able to capture what has been learned by any of
the assumptions made by the other techniques. We
therefore perform a head-to-head comparison with
DL and the best traditional classification technique
for each data set (BEST). This seems fair as apply-
ing DL resulted from an elaborate hyperparameter
optimization step. Choosing the best shallow classi-
fier for each data set is similar to selecting the best
DL architectural parameters and can give a better
answer regarding whether deep classifiers really are
superior to wide classifiers. We use the Wilcoxon
signed-rank test to compare DL with BEST and find
a p-value of 0.82 at a significance level α = 0.05. Al-
though the total number of data sets for which DL
performs better is larger, three low-SNS data sets
from the random sample (Ecommerce, Banking and
Facebook female gay) make the comparison not sta-
tistically significant. Thus, the insignificance of the
performance improvement can be attributed to (1)
the additional complexity of deep learning merely re-
sulting in marginal performance improvements [18]
which do not contribute to statistical power, and (2)
16
Data set Number of layers and hidden units per layer Nonlinearity Learning rate α0
BookCrossing [18, 72, 233, 175, 37] ReLU 0.43858Facebook swl [654, 90] ReLU 0.16641Banking [17, 32] ReLU 0.25348TaFeng [194] tanh 0.04343YahooMovies age [736, 1392] ReLU 0.84896Fraud [125, 212, 178] tanh 0.14347A-Card Goto MAS [71, 1470, 174, 51] sigmoid 0.54500Car [22, 43] ReLU 0.67745MovieLens adventure [47] ReLU 0.89871MovieLens 100k gender [545, 52, 166, 184] tanh 0.31278Ecommerce [687, 45, 109, 31] tanh 0.24682MovieLens crime [1678, 363, 163, 95] tanh 0.01436Facebook female gay [67, 243] sigmoid 0.52290MovieLens fantasy [63, 24, 324, 45, 1895] ReLU 0.18981MovieLens mystery [21] ReLU 0.61191YahooMovies gender [72, 995, 832, 1939] ReLU 0.02167MovieLens romance [95, 47, 794] ReLU 0.99152A-Card Goto Permeke [169, 37, 206, 45, 193] ReLU 0.59104MovieLens thriller [93] ReLU 0.75717MovieLens drama [287, 1764, 32, 65] tanh 0.59338KDD2015 [130, 30, 99, 1388, 1533] sigmoid 0.44715MovieLens children [213, 69, 21, 182, 1130] tanh 0.88797A-Card defect [2193, 345, 962] ReLU 0.65196MovieLens 1m gender [532, 135, 1009] ReLU 0.30452A-Card Goto Wezenberg [1099, 994, 493, 627] tanh 0.07435A-Card Goto Roma [63] ReLU 0.24249MovieLens comedy [19, 398] tanh 0.00540Flickr [21, 25] ReLU 0.54377MovieLens action [45] tanh 0.94335MovieLens animation [288, 677, 772] ReLU 0.52020A-Card Goto Zoo [223, 2344] sigmoid 0.37128MovieLens scifi [23] sigmoid 0.56118MovieLens 100k age [48, 709] ReLU 0.11159MovieLens documentary [521, 78] sigmoid 0.20603MovieLens western [1345, 218, 1013] ReLU 0.86652MovieLens 1m age [1863, 93] tanh 0.14166A-Card cashout [27, 277, 1046, 391, 146, 168, 148, 85] tanh 0.01111Facebook religion [1414, 464, 71, 424, 60] tanh 0.09468MovieLens horror [171, 467] tanh 0.06355Facebook male gay [473, 293] sigmoid 0.10999MovieLens filmnoir [1242] tanh 0.60603Facebook political [27, 146] tanh 0.76728Facebook IQ [129, 32] sigmoid 0.52109MovieLens musical [67, 656, 20, 1204] ReLU 0.75767Facebook gender [76, 367] ReLU 0.10510MovieLens war [248, 35, 694, 855] tanh 0.52687LibimSeTi [96, 348, 36, 800, 47] sigmoid 0.01922
Table 3: Best hyperparameter selection for each data set from the random hyperparameter search procedure for thedeep learning classification techniques.
17
DL performing worse for low-SNS data.
Influence of Hyperparameters
When training the models and evaluating their gen-
eralization capacity, no improvement is found when
using a pre-training step. Figure 4 shows the test
classification error on the MovieLens1m data set for
500 different random configurations for which only
the pre-training flag was changed. It is clear from this
comparison that not pre-training the network results
in a more robust and lower test classification error.
Furthermore, pretraining the network heavily influ-
ences the training time of the network, which already
is very high for this type of data.
The reason for the inferior performance of pretrain-
ing could be the following. The input constitutes one
aspect of a person’s behavior e.g. location visiting
data and the model needs to learn to predict another
aspect of that person’s behavior e.g. whether that
person would trade in loyalty points. When pretrain-
ing a model in an unsupervised fashion, a generative
model learns the variations present in the input data
without taking into account the target classification
variable. The applications for which DL has proven
superior such as object recognition, are often per-
formed in a generative manner and the pretraining
phase contributes as the signal to be learned is al-
ready highly present in the data. For example, in
the case of hand-drawn digits (for which pretraining
has indeed shown valuable), learning the variations
in the input with an unsupervised pretraining step
indeed captures the number that digit belongs to.
Learning input variations in human behavioral data
without taking into account the specific class mem-
bership variable is much more complex, as often both
are only related to a small extent (e.g. location vis-
iting behavior and loyalty prediction). Tailoring a
classification model by learning the underlying dis-
tribution of a data set has already been shown to be
a delicate task. Indeed, in previous predictive analy-
sis literature, discriminative models have consistently
outperformed generative models [10, 39]. For behav-
ioral data, especially, the underlying distribution of
human behavior is not known and until now, no fit-
ting feature generation model has been put forward.
Attempts have been made in [25] demonstrating the
fit of a Wallenius event model for human choice be-
havior. This finding is also corroborated by recent
research in deep learning. As the field gains maturity,
more researchers apply the models to more complex
data with lower signal-from-noise separability. This
coincides with a decrease in employing the pretrain-
ing generative step, which could indeed imply that a
target variable is valuable in learning from complex
18
55 60 65 70 75 80 85 90 95 100
AUC DL
55
60
65
70
75
80
85
90
95
100
AU
C B
ES
T
Figure 3: AUC for deep learning (DL) plotted against the highest AUC reached by the wide classification techniques(BEST) for all data sets in Table 1.
data sets.
Figure 5 depicts the test classification error for the
MovieLens1m data set when varying the nature of
the activation function for 500 random parameter
settings. It can be seen that overall the tanh non-
linearity gives the best results and this is found on
the bulk of the data sets analyzed in our study. Al-
though ReLU units provide faster models and could
be recommended in very high-dimensional settings
for reducing computational complexity, their sparse
representations result in less predictive performance.
When analyzing high-dimensional behavioral data for
predictive purposes, it makes sense that each fine-
grained behavioral aspect contributes to the final pre-
diction. This has also been found in [10, 38, 6]; i.e. L2
regularization outperforms L1 regularization in this
context when taking into account the raw features.
In this multi-layered setting, each separate neuron in
higher layers also seems to contribute to the predic-
tion.
Finally, when looking at the effect of the number
of hidden layers on the test classification error in Fig-
ure 6, no consistently better-performing setting can
be found over one data set (MovieLens1m), as well
as over all data sets (see the configurations in Ta-
ble 2). Practitioners should therefore resort to ran-
domly testing several configurations with a varying
amount of hidden layers and choose the best perform-
19
No pre-training Pre-training
10
20
30
40
50
60
70
80
90
Tes
t cla
ssifi
catio
n er
ror
Figure 4: Test classification error for the MovieLens1mdata set for predicting gender without un-supervised pre-training (left) and with pre-training (right).
ing architecture setting.
Distributed Feature Representation
One of the advantages of linear, shallow classifiers lies
in their comprehensibility. The weights given by the
model to each of the input features can be inspected
and interpreted by humans. In some domains, model
comprehensibility is mandatory before a model can
be deployed in a business setting and explaining how
the model’s predictions originate can help acceptance
within an organization [16, 17, 35, 36].
Table 4 lists the features with the highest weights
given by logistic regression with L2 regularization
for the following classification tasks: predicting (1)
gender and (2) age from movie rating data, and
sigmoid tanh relu
20
25
30
35
40
45
Tes
t cla
ssifi
catio
n er
ror
Figure 5: Test classification error for the MovieLens1mdata set for predicting gender with the sig-moid non-linearity (left), the tanh function(middle) and the ReLU non-linearity (right).
predicting (3) liberal political beliefs and (4) high
intelligence from Facebook like data. Looking in
more detail at these top-ranked features, we can ob-
serve that they indeed make sense. When predict-
ing whether a user is male based on movie rating
data, we can see that higher weights are given to
movies mainly targeted towards male users such as
Starship Troopers, Star Trek and Apollo 13. Looking
at the weights which are discriminative for identi-
fying younger users, we can see horror movies like
Scream and I Know What You Did Last Summer
as well as movies clearly targeted at younger people
such as The Princess Bride and Willy Wonka and
the Chocolate Factory. For the Facebook data set as
well, the feature weights are intuitive. Features such
20
15
20
25
30
35
40
45
50
Tes
t cla
ssifi
catio
n er
ror
Number of hidden layers1 2 3 4 5 6 7 8
16
18
20
22
24
26
28
30
32
34
Tes
t cla
ssifi
catio
n er
ror
Number of hidden layers1 2 3 4 5 6 7 8
Figure 6: Test classification error for the MovieLens1m data set dependent on the number of hidden layers forpredicting age (top) and predicting gender (bottom).
as Barack Obama, The Daily Show and The Colbert
Report clearly say something about a person’s liberal
mindset. When predicting whether a user has a high
IQ, tv-shows such as The Big Bang Theory and Lost
receive higher weights.
As mentioned previously, the deep learning mod-
els learn distributed representations based on com-
positionality in the fine-grained features. Each neu-
ron in this multi-layered setting assigns weights to
lower-level concepts and thus allows for a complex
many-to-many combination of all features on differ-
ent levels. In contrast, shallow techniques such as
logistic regression and support vector machines can
be viewed as consisting of one neuron, assigning fea-
ture weights on one level only (see Table 4). We now
take a closer look at some of the neurons identified
by deep learning on the lowest level and try to assess
if the distributed low-level representation really do
identify extra nuances in the features.
Table 5 shows five neurons in the first hidden layer
of the best architecture (see Table 3) and the features
for which they are most activated when predicting
gender for the MovieLens1m data set. We observe
that each neuron captures additional, separate nu-
ances (bottom row in Table 5) besides only taking
into account the obvious relation to the target vari-
able as is the case for a single-neuron setting. Neuron
1 for example captures genre in the features by as-
signing higher weights to romance and drama movies.
Neuron 2 focuses on action movies and comedies tai-
lored towards male users, while neuron 4 discrimi-
nates towards male drama movies. Neuron 3 clearly
21
becomes active for sport movies and neuron 5 iden-
tifies movies mostly targeted towards female users.
The process of interpreting neurons requires domain-
specific knowledge and clearly multiple interpreta-
tions are possible. Moreover, this approach becomes
infeasible when a much larger number of neurons is
present [20]. However, we mainly aim at demonstrat-
ing additional nuances learned by deep architectures
in comparison to shallow ones, thereby demonstrat-
ing the value in distributed feature representation in
this setting.
For the Facebook like data set, Table 6 shows four
neurons that help identify a liberal mindset and the
features for which these neurons have high weights.
Again, each neuron can be associated with a higher-
level category that groups similar likes such as mostly
male or female interests, hobbies or rock bands. Note
that an intuitive interpretation is not straightforward
for all neurons in the network.
In addition to these illustrations, we attempt to
formally show the added value present in these dis-
tributed feature representations. To this end, we
transform the original high-dimensional data sets
onto three feature-reduced data sets and assess
their prediction capability. The distributed features
learned by DL are compared with (1) a feature selec-
tion technique which uses L1 regularization to iden-
tify relevant features and with (2) a feature engineer-
ing technique using SVD which projects the high-
dimensional features onto as many features those on
the first layer of the DL architecture (see Table 3).
The results of this analysis are depicted in Table 7.
For 40 out of 47 data sets, the latent features learned
by DL result in better predictive performance com-
pared to the selected or latent features resulting from
the other two approaches. This demonstrates the
value of the learned representations by DL, gaining
insight as to why deep learning techniques perform
well on big behavioral data. Moreover, these results
open up further research opportunities in terms of us-
ing the distributed nature of first-layer neurons (and
by consequence, higher-level neurons) for purposes of
feature engineering.
Conclusion
In the past, deep learning techniques have resulted in
significant performance improvements in fields such
as object recognition and NLP. This paper performs
a first exploratory study as to whether these improve-
ments also extend to behavioral big data. We demon-
strate the usefulness in learning deep, distributed
representations of the many fine-grained behavioral
features, while shedding light as to when and why
deep learning performs well. As a practical contribu-
22
Rank MovieLens (gender) MovieLens (age) Facebook (liberal) Facebook (intelligence)
1. Ulee’s Gold Scream Barack Obama Inception2. Starship Troopers Trainspotting The Daily Show The Big Bang Theory3. Star Trek: First Contact I Know What You Did Last Summer Rent How I Met Your Mother4. Dr. Strangelove Liar Liar The L Word Avatar5. Monty Python and the Holy Grail Pink Floyd - The Wall Dixie Chicks Fight Club6. In the Name of the Father Kiss the Girls Charmed Star Wars7. The Ice Storm Scream 2 Yoga TED8. Tomorrow Never Dies Rumble in the Bronx Amelie Paulo Coelho9. A Clockwork Orange Spawn Weeds Lost10. Four Weddings and a Funeral The Princess Bride Brokeback Mountain Music11. Six Degrees of Separation Heathers Crash Dexter12. The Great Dictator Monty Python and the Holy Grail Little Miss Sunshine The Mentalist13. The Blues Brothers Chasing Amy The Colbert Report Sheldon Cooper14. The Peacemaker Mystery Science Theater 3000: The Movie The Beatles My Personality15. The Abyss Star Wars Green Day The Office16. McHale’s Navy Tomorrow Never Dies Philosophy Disney’s The Lion King17. Patton Willy Wonka and the Chocolate Factory CNN Megan Fox18. Sphere Swingers Sex and the City Pink Floyd19. Apollo 13 Romeo and Juliet True Blood Burn Notice20. Gattaca Much Ado About Nothing Travelling Futurama
Table 4: Top weights with highest coefficient for logistic regression with L2 regularization. The higher scoresindicate higher probability of being (1) male when predicting gender, (2) young when predicting age forthe MovieLens data set, (3) liberal when predicting political preference for the Facebook data set, and (4)high IQ when predicting intelligence for the Facebook data set.
Neuron 1 Neuron 2 Neuron 3 Neuron 4 Neuron 5
The Mirror has Two Faces McHale’s Navy Kingpin Apollo 13 Everyone Says I Love YouBed of Roses The Blues Brothers Cool Runnings Kundun Little WomenOld Yeller Mystery Science Theater 3000: The Movie Coneheads Raging Bull Romeo and JulietKiss the Girls Tomorrow Never Dies Ed White Squall The Truth About Cats and DogsFly Away Home Starship Troopers Pulp Fiction The Full MontyThe Adventures of Priscilla Monty Python and the Holy Grail The Madness of King George Beauty and the BeastTo Kill a Mockingbird Four Weddings and a Funeral Under Siege Searching for Bobby FischerMichael Star Trek: First Contact Ran The Sound of MusicLittle Women Apollo 13 Patton The Mirror has Two FacesWhen a Man Loves a Woman The Great Dictator The Deer Hunter Breakfast at Tiffany’sThe Fan 12 Angry Men Star Trek: First Contact Strictly BallroomRebecca Big Night Stealing BeautySome Kind of Wonderful Ulee’s Gold A Christmas CarolThe Postman Six Degrees of Separation Secrets and LiesSearching for Bobby Fischer Duck Soup The Pillow BookStrictly Ballroom Maverick Love JonesThe Sound of Music Dr. Strangelove Money TalksStand by Me Volcano Interview with the VampireThe Shining The Wedding Singer Bed of RosesEveryone Says I Love You A Clockwork Orange A Family Thing
Romance, drama Male, action, comedy Sport, buddy Male, drama Female
Table 5: Top weights with highest coefficient for five neurons in the deep learning algorithm for predicting genderon the MovieLens data set. Each neuron captures additional fine-grained information.
23
Neuron 1 Neuron 2 Neuron 3 Neuron 4
Gilmore Girls Two and a Half Men Cooking Green DayTravel AFI Heroes OasisMatchbox 20 Kings of Leon Reading Rascal FlattsGrey’s Anatomy Iron Maiden Summer Nights IncubusCounting Crows Nip/Tuck Music Fear and Loathing in Las VegasSeinfeld Jim Croce Writing Fall Out BoyFerris Bueller’s Day Off Guns N’ Roses laughingCoyote Ugly Led ZeppelinNapoleon DynamiteSex and the City
Female Male Hobbies Rock bands
Table 6: Top weights with highest coefficient for four neurons in the deep learning algorithm for predicting a liberalmindset on the Facebook data set. Each neuron captures additional fine-grained information.
tion, we provide guidance regarding favorable hyper-
parameter configurations for learning these models.
Taking into account that the deep learning con-
figurations were randomly selected, one can imagine
the improvements possible when finetuning these ar-
chitectures. It is important to note that although
deep learning seems to capture what any of the wide
classification techniques grasp, the marginal improve-
ments of more complex models overall are quite small.
The results, however, do show that deep learning is
not recommended in problem settings where data is
characterized by low signal-from-noise separability.
A second important finding is the fact that a genera-
tive pretraining phase does not improve performance
in this discriminative setting. We conjecture that,
unlike in object recognition tasks, the input varia-
tions learned through the pretraining phase do not
contribute in terms of the final classification task.
This can originate from an inappropriate underlying
generative process which is not tailored towards hu-
man behavioral data in relation to the subsequent
prediction task. Another explanation that we put
forward is that the unsupervised learning can not
efficiently grasp the variations in the complex data
without taking into account the target variable. Fu-
ture research could either focus on developing an un-
supervised generative process which fits the process
of human-generated behavior better. However, since
the target variable is often not as clearly related to
the input data, it seems worthwhile to investigate
a supervised pretraining step such that input data
variations can be explicitly related to the variable of
interest.
The comparative analysis stresses that the perfor-
mance of deep learning classifiers is not attributable
to the additional complexity. The hierarchical, dis-
tributed representations learned by the deep models
demonstrate that predictive power may result from
24
Data set DL Feature selection Feature engineering
BookCrossing 54.37 52.24 51.60Facebook swl 54.21 51.64 50.87Banking 64.33 55.58 55.59TaFeng 64.43 60.62 63.32YahooMovies age 61.44 57.91 55.13Fraud 72.9 53.49 58.40A-Card Goto MAS 73.60 69.37 71.50Car 61.43 54.99 58.65MovieLens adventure 50.09 57.50 54.54MovieLens 100k gender 72.83 61.55 65.35Ecommerce 75.05 57.12 58.15MovieLens crime 56.78 52.71 51.92Facebook female gay 63.09 60.56 59.88MovieLens fantasy 67.09 50.00 56.28MovieLens mystery 80.44 78.55 79.77YahooMovies gender 77.35 66.89 70.86MovieLens romance 73.1 55.74 50.70A-Card Goto Permeke 63.28 62.64 62.95MovieLens thriller 85.40 72.22 72.93MovieLens drama 58.89 54.11 58.34KDD2015 60.67 62.99 61.64MovieLens children 73.72 70.76 69.72A-Card defect 81.26 80.13 82.69MovieLens 1m gender 78.52 74.10 78.14A-Card Goto Wezenberg 54.68 52.76 53.70A-Card Goto Roma 81.05 72.33 74.31MovieLens comedy 57.28 59.19 54.38Flickr 73.46 54.60 57.13MovieLens action 59.53 59.49 55.62MovieLens animation 70.41 62.67 60.69A-Card Goto Zoo 82.85 71.27 72.06MovieLens scifi 87.19 50.00 68.07MovieLens 100k age 84.91 75.95 81.61MovieLens documentary 63.61 54.52 59.09MovieLens western 82.87 67.21 57.44MovieLens 1m age 86.57 82.00 85.04A-Card cashout 89.55 80.87 83.41Facebook religion 82.33 75.44 78.09MovieLens horror 69.26 58.50 58.00Facebook male gay 82.44 81.09 81.32MovieLens filmnoir 59.51 62.25 53.52Facebook political 77.60 78.22 77.11Facebook IQ 85.99 80.33 62.09MovieLens musical 77.50 76.58 86.64Facebook gender 88.54 62.89 61.23MovieLens war 90.97 54.02 53.97LibimSeTi 97.24 97.22 97.20
Table 7: AUC reached by transforming the original high-dimensional data sets onto a reduced feature set by takingthe first-layer neurons (DL), by applying feature selection with L1 regularization, and by applying featureengineering with SVD.
25
these representations. Although not all neurons are
easily interpretable and domain knowledge is often
needed, one can see that more nuances are captured
in comparison to local feature representation models.
These nuances are related to a compositional hier-
archy of the features in relation to the target vari-
able. Taking into account the inherent complexity
of behavioral data, the distributed representations
contribute to better predictive performance and pro-
vide decision makers with insight and intuitions in
the data on the lowest hierarchy level. A broad av-
enue for further research lies in the interpretation of
the higher-level layers consisting of multiple combi-
nations of lower-level concepts.
On a more practical level, we have tried to provide
insight into the practical implementation characteris-
tics of deep learning. The multitude of hyperparam-
eters can be overwhelming for researchers and prac-
titioners when applying deep learning for this or any
new application. Two hyperparameters are found to
consistently achieve better results. First, dense repre-
sentations throughout the models achieve better pre-
dictive performance, more specifically by employing
tanh activation functions. Secondly, pretraining does
not achieve better predictive performance. One of the
main drawbacks of applying deep learning consists of
the computational and implementation complexity.
This hurdle might currently render them infeasible in
a practical setting—often resulting in multiple days
for training very fine-grained data sets. However, as
the availability of GPU units on cloud computing in-
stances becomes more widespread and Theano’s sup-
port for multiple GPU-use becomes less experimen-
tal, the computational complexity could be highly
improved.
References
[1] Y. Bengio. Learning deep architectures for ai.
Foundations and trends in Machine Learning,
2(1):1–127, 2009.
[2] J. Bergstra and Y. Bengio. Random search for
hyper-parameter optimization. Journal of Ma-
chine Learning Research, 13(Feb):281–305, 2012.
[3] D. Chicco, P. Sadowski, and P. Baldi. Deep au-
toencoder neural networks for gene ontology an-
notation predictions. In Proceedings of the 5th
ACM Conference on Bioinformatics, Computa-
tional Biology, and Health Informatics, pages
533–540. ACM, 2014.
[4] D. Ciresan, U. Meier, and J. Schmidhuber.
Multi-column deep neural networks for image
classification. In Computer Vision and Pat-
26
tern Recognition (CVPR), 2012 IEEE Confer-
ence on, pages 3642–3649. IEEE, 2012.
[5] D. C. Ciresan, U. Meier, J. Masci,
L. Maria Gambardella, and J. Schmidhu-
ber. Flexible, high performance convolutional
neural networks for image classification. In IJ-
CAI Proceedings-International Joint Conference
on Artificial Intelligence, volume 22, page 1237.
Barcelona, Spain, 2011.
[6] J. Clark and F. Provost. Matrix-factorization-
based dimensionality reduction in the predic-
tive modeling process: a design science perspec-
tive. Technical report, Department of Infor-
mation, Operations, and Management Sciences,
New York University, USA, 2016.
[7] C. Cortes and V. Vapnik. Support vector ma-
chine. Machine learning, 20(3):273–297, 1995.
[8] G. E. Dahl, N. Jaitly, and R. Salakhutdinov.
Multi-task neural networks for qsar predictions.
arXiv preprint arXiv:1406.1231, 2014.
[9] S. De Cnudde and D. Martens. Loyal to your
city? a data mining analysis of a public ser-
vice loyalty program. Decision Support Systems,
73:74–84, 2015.
[10] S. De Cnudde, D. Martens, T. Evgeniou, and
F. Provost. A benchmarking study of classifica-
tion techniques for behavioral data. 2017.
[11] L. Deng. Three classes of deep learning architec-
tures and their applications: a tutorial survey.
APSIPA transactions on signal and information
processing, 2012.
[12] D. Erhan, Y. Bengio, A. Courville, P.-A.
Manzagol, P. Vincent, and S. Bengio. Why
does unsupervised pre-training help deep learn-
ing? Journal of Machine Learning Research,
11(Feb):625–660, 2010.
[13] T. Fawcett. An introduction to roc analysis. Pat-
tern recognition letters, 27(8):861–874, 2006.
[14] X. Glorot and Y. Bengio. Understanding the dif-
ficulty of training deep feedforward neural net-
works. In Aistats, volume 9, pages 249–256,
2010.
[15] X. Glorot, A. Bordes, and Y. Bengio. Deep
sparse rectifier neural networks. In Proceedings
of the Fourteenth International Conference on
Artificial Intelligence and Statistics, pages 315–
323, 2011.
[16] M. S. Gonul, D. Onkal, and M. Lawrence. The
effects of structural characteristics of explana-
27
tions on use of a dss. Decision Support Systems,
42(3):1481–1493, 2006.
[17] S. Gregor and I. Benbasat. Explanations from
intelligent systems: Theoretical foundations and
implications for practice. MIS quarterly, pages
497–530, 1999.
[18] D. J. Hand. Classifier technology and the illusion
of progress. Statistical science, 21(1):1–14, 2006.
[19] G. Hinton. A practical guide to training
restricted boltzmann machines. Momentum,
9(1):926, 2010.
[20] G. E. Hinton. Distributed representations. 1984.
[21] G. E. Hinton, S. Osindero, and Y.-W. Teh. A
fast learning algorithm for deep belief nets. Neu-
ral computation, 18(7):1527–1554, 2006.
[22] P.-S. Huang, X. He, J. Gao, L. Deng, A. Acero,
and L. Heck. Learning deep structured seman-
tic models for web search using clickthrough
data. In Proceedings of the 22nd ACM in-
ternational conference on Conference on infor-
mation & knowledge management, pages 2333–
2338. ACM, 2013.
[23] B. James, B. Olivier, B. Frederic, L. Pascal, and
P. Razvan. Theano: a cpu and gpu math expres-
sion compiler. In Proceedings of the Python for
Scientific Computing Conference (SciPy).
[24] E. Junque de Fortuny, D. Martens, and
F. Provost. Predictive modeling with big data:
is bigger really better? Big Data, 1(4):215–226,
2013.
[25] E. Junque de Fortuny, D. Martens, and
F. Provost. Wallenius naive bayes. 2013.
[26] E. Junque de Fortuny, M. Stankova, J. Moeyer-
soms, B. Minnaert, F. Provost, and D. Martens.
Corporate residence fraud detection. In In-
ternational Conference on Knowledge Discovery
and Data Mining (SIGKDD), pages 1650–1659.
ACM, 2014.
[27] E. Junque de Fortuny, M. Stankova, J. Moeyer-
soms, B. Minnaert, F. Provost, and D. Martens.
Corporate residence fraud detection. In Pro-
ceedings of the 20th ACM SIGKDD interna-
tional conference on Knowledge discovery and
data mining, pages 1650–1659. ACM, 2014.
[28] M. Kosinski, D. Stillwell, and T. Graepel. Pri-
vate traits and attributes are predictable from
digital records of human behavior. National
Academy of Sciences, 110(15):5802–5805, 2013.
[29] A. Krizhevsky, I. Sutskever, and G. E. Hinton.
Imagenet classification with deep convolutional
28
neural networks. In Advances in neural informa-
tion processing systems, pages 1097–1105, 2012.
[30] Y. LeCun, Y. Bengio, and G. Hinton. Deep
learning. Nature, 521(7553):436–444, 2015.
[31] Y. A. LeCun, L. Bottou, G. B. Orr, and K.-R.
Muller. Efficient backprop. In Neural networks:
Tricks of the trade, pages 9–48. Springer, 2012.
[32] K. Li and T. C. Du. Building a targeted mobile
advertising system for location-based services.
Decision Support Systems, 54(1):1–8, 2012.
[33] J. Liu, P. Dolan, and E. R. Pedersen. Person-
alized news recommendation based on click be-
havior. In International Conference on Intelli-
gent User Interfaces (IUI), pages 31–40. ACM,
2010.
[34] J. Ma, R. P. Sheridan, A. Liaw, G. E. Dahl,
and V. Svetnik. Deep neural nets as a method
for quantitative structure–activity relationships.
Journal of chemical information and modeling,
55(2):263–274, 2015.
[35] D. Martens, B. Baesens, T. Van Gestel, and
J. Vanthienen. Comprehensible credit scoring
models using rule extraction from support vec-
tor machines. European journal of operational
research, 183(3):1466–1476, 2007.
[36] D. Martens and F. Provost. Explaining data-
driven document classifications. MIS Quarterly,
38(1), 2014.
[37] D. Martens, F. Provost, J. Clark, and E. J.
de Fortuny. Mining massive fine-grained behav-
ior data to improve predictive analytics. MIS
quarterly, 40(4), 2016.
[38] A. Y. Ng. Feature selection, l 1 vs. l 2 regulariza-
tion, and rotational invariance. In Proceedings of
the twenty-first international conference on Ma-
chine learning, page 78. ACM, 2004.
[39] A. Y. Ng and A. Jordan. On discriminative vs.
generative classifiers: A comparison of logistic
regression and naive Bayes. Advances in Neural
Information Processing Systems (NIPS), 14:841,
2002.
[40] C. Perlich, F. Provost, and J. S. Simonoff. Tree
induction vs. logistic regression: A learning-
curve analysis. Journal of Machine Learning Re-
search, 4(Jun):211–255, 2003.
[41] B. Ramsundar, S. Kearnes, P. Riley, D. Webster,
D. Konerding, and V. Pande. Massively multi-
task networks for drug discovery. arXiv preprint
arXiv:1502.02072, 2015.
[42] M. Rokeach. Understanding human values. Si-
mon and Schuster, 2008.
29
[43] R. Salakhutdinov, A. Mnih, and G. Hinton. Re-
stricted boltzmann machines for collaborative
filtering. In Proceedings of the 24th international
conference on Machine learning, pages 791–798.
ACM, 2007.
[44] G. Shmueli. Research dilemmas with behavioral
big data. Big data, 5(2):98–119, 2017.
[45] I. Sutskever, O. Vinyals, and Q. V. Le. Sequence
to sequence learning with neural networks. In
Advances in neural information processing sys-
tems, pages 3104–3112, 2014.
[46] A. Van den Oord, S. Dieleman, and
B. Schrauwen. Deep content-based music
recommendation. In Advances in neural infor-
mation processing systems, pages 2643–2651,
2013.
[47] W. Verbeke, D. Martens, and B. Baesens. Social
network analysis for customer churn prediction.
Applied Soft Computing, 14(3):431–446, 2014.
[48] A. Vieira. Predicting online user behaviour us-
ing deep learning algorithms. arXiv preprint
arXiv:1511.06247, 2015.
[49] X. Wu, X. Zhu, G.-Q. Wu, and W. Ding. Data
mining with big data. Transactions on Knowl-
edge and Data Engineering, 26(1):97–107, 2014.
30