university of antwerpengineering techniques. this is the rst study to ap-ply and validate the use of...

DEPARTMENT OF ENGINEERING MANAGEMENT

An exploratory study towards applying and demystifying deep learning classification on behavioral big data

Sofie De Cnudde, David Martens & Foster Provost

UNIVERSITY OF ANTWERP Faculty of Applied Economics City Campus

Prinsstraat 13, B.226

B-2000 Antwerp

Tel. +32 (0)3 265 40 32

Fax +32 (0)3 265 47 99

www.uantwerpen.be

http://www.uantwerpen.be/

FACULTY OF APPLIED ECONOMICS

DEPARTMENT OF ENGINEERING MANAGEMENT

An exploratory study towards applying and demystifying deep learning classification on behavioral big data

Sofie De Cnudde, David Martens & Foster Provost

RESEARCH PAPER 2018-002 JANUARY 2018

University of Antwerp, City Campus, Prinsstraat 13, B-2000 Antwerp, Belgium

Research Administration – room B.226

phone: (32) 3 265 40 32

fax: (32) 3 265 47 99

e-mail: [email protected]

The research papers from the Faculty of Applied Economics

are also available at www.repec.org

(Research Papers in Economics - RePEc)

D/2018/1169/002

mailto:[email protected]

http://www.repec.org/

An exploratory study towards applying and demystifying deep

learning classification on behavioral big data

Sofie De Cnudde, David Martens, Foster Provost

Abstract

The superior performance of deep learning algorithms

in fields such as computer vision and natural language

processing has fueled an increased interest towards

these algorithms in both research and in practice.

Ever since, many studies have applied these algo-

rithms to other machine learning contexts with other

types of data in the hope of achieving comparable

superior performance. This study departs from the

latter motivation and investigates the application of

deep learning classification techniques on big behav-

ioral data while comparing its predictive performance

with 11 widely-used shallow classifiers. In addition to

the application on a new type of data and a struc-

tured comparison of its performance with commonly-

used classifiers, this study attempts to shed light onto

when and why deep learning techniques perform bet-

ter. Regarding the specific characteristics of apply-

ing deep learning on this unique class of data, we

demonstrate that an unsupervised pretraining step

does not improve classification performance and that

a tanh nonlinearity achieves the best predictive per-

formance. The results from applying deep learning

on 15 big behavioral data sets demonstrate as good

as or better results compared to traditionally-used,

shallow classifiers. However, no significant perfor-

mance improvement can be recorded. Investigating

when deep learning performs better, we find that

worse performance is obtained for data sets with low

signal-from-noise separability. In order to gain in-

sight into why deep learning generally performs well

on this type of data, we investigate the value of the

distributed, hierarchical characteristic of the learning

process. The neurons in the distributed representa-

tion seem to identify more nuances in the many be-

havioral features as compared to shallow classifiers.

We demonstrate these nuances in an intuitive manner

and validate them through comparison with feature

1

engineering techniques. This is the first study to ap-

ply and validate the use of nonlinear deep learning

classification on fine-grained, human-generated data

while proposing efficient configuration settings for its

practical implementation. As deep learning classi-

fication is often characterized by being a black-box

approach, we also provide a first attempt towards

the disentanglement regarding when and why these

techniques perform well.

Introduction

Over the last decade, the machine learning field has

experienced increased attention towards deep learn-

ing techniques. Experimental research in this area

has demonstrated significant improvements over con-

ventional machine learning techniques in fields such

as object recognition [4, 29, 5] and natural language

processing [45, 22]. Deep learning techniques origi-

nate from representation learning, where raw data is

inputted and models automatically detect efficient,

distributed representations for either pattern detec-

tion or classification [30]. Through stacking non-

linear functions, hidden and complex data patterns

can be detected without human intervention. Deep

learning has received increased attention which can

be attributed to (1) an immense increase in avail-

able data, (2) increased chip processing capabilities

and the use of GPU for faster parallel computations,

(3) decreasing costs of hardware, and (4) advances in

machine learning research regarding these deep net-

works [11]. Both the inspiration from neural brain

activity and its theoretical aspects contribute to the

attractiveness of deep neural networks.

The class of data which is the subject of our study

is behavioral big data. As more and more aspects of

people’s lives are migrating online, people leave mas-

sive trails of both active and passive footprints which

are increasingly being recorded and quantified. Be-

havioral big data is thus becoming omnipresent, har-

boring major potential for predictive analysis [49].

We define behavioral big data following Shmueli [44]:

very large and rich high-dimensional data on hu-

man actions and/or interactions. These form a testi-

mony of an entities’ behavior captured through fine-

grained, modular features. Customer transactions

with a bank, web surfer’s web visiting behavior, mo-

bile phone users’ visited locations and Facebook like

data are just a few examples where each unique ac-

count number, webpage, location or Facebook page

corresponds to a feature. In previous research, this

data has been shown to be very informative in a pre-

dictive setting and can reveal a person’s personality

traits [28], interest in banking products [37], interest

in a news article [33], interest in a mobile ad [32], ten-

2

dency to churn [47] or tendency to commit fraudulent

activities [26].

Behavioral data, however, originates from complex

and largely unknown underlying processes, which

complicate its analysis [25, 44]. Moreover, this data

is characterized by high-dimensionality and sparsity.

When modeling web surfer’s web visiting behavior,

the collection of all possible webpages one can visit is

huge, resulting in very large, high-dimensional data

sets. Also, the limit on so-called behavioral capi-

tal [24] implies that among all possible webpages, a

person can only visit a limited number due to re-

strictions such as time or money. The latter results

in highly sparse data.

Traditionally, machine learning research for very

high dimensional data, such as textual data, mostly

employs shallow models [11], which contain one layer

transforming the raw input features in a linear or

nonlinear fashion onto a specific feature space. Ex-

amples of such techniques are linear or kernalized

support vector machines [7]. These shallow classifiers

have proven successful on machine learning problems

originating from a wide variety of applications and

for both small and large predictive problems. How-

ever, when confronted with data from complex real-

world problems such as human behavior, many re-

searchers have put forward the question whether the

shallow design of these techniques suffices [1, 11] and

whether a deep architecture can provide significant

performance improvement. Moreover, these tech-

niques consider the data features from a local per-

spective throughout the learning process [20]. Each

feature is represented by a one-hot encoded vector,

implying that each is equally different from all other

features. In many real-world high-dimensional set-

tings, this local view does not hold and exploiting the

distributed nature of the features allows for much bet-

ter generalization towards unseen, complex instances.

From social science research, behavioral data is sus-

pected to contain complex, distributed and hierar-

chical relations between its features [20, 42]. When

capturing people’s movie preferences in a behavioral

data set, each movie is represented by exactly one

feature in a one-to-one relation (local representa-

tion). The main disadvantage with this approach

is that two movies targeted towards a younger au-

dience (such as for example Toy Story and Indiana

Jones) are considered equally similar (or different)

from one another as they are considered similar (or

different) from R-rated movies such as Saw or The

Blair Witch Project. A distributed representation

transforms the raw features through many-to-many

relations onto a new representation which is able to

capture more fine-grained similarities between the in-

3

dependent features. Regarding hierarchical relations

between features, on the lowest level, a user can be

represented by each movie he or she rates. On a

higher level, a subset of movies can be telling of a per-

son’s low-level interests. One group of movies (which

is naturally represented by a neuron in the framework

of deep learning) can for example detect whether a

user has an affinity with LGBT movies, whether he

has an interest in sports, or whether his favorite di-

rector is Alfred Hitchcock. On an even higher level,

these interests can reveal political preferences or reli-

gious beliefs. For example, a person watching movies

such as Brokeback Mountain or Cowspiracy can be

considered to have a more liberal mindset. This

line of thinking follows the value–attitude–behavior

model [42] used in behavioral science, which states

that people’s social cognitions resulting in actions are

organized in a compositional structure. Values are

one’s stable beliefs on the highest level of abstraction

and are constructed of basic beliefs, which in turn

give rise to value orientations. The latter influences

a person’s attitude, which finally leads to concrete

human behavior. The parallel between this cogni-

tion hierarchy and the hierarchical nature of repre-

sentation learning could ideally help provide insight

into general human behavior or strengthen hypothe-

ses formulated in social science research. Moreover,

interpreting this hierarchy provides us with means

to frame deep learning’s performance on this type of

data.

Since the initial successes of deep learning in com-

puter vision and natural language processing, deep

learning has also been successfully applied to other

types of high-dimensional data such as toxicological

data [8, 41], biomedical data [3] and recommender

systems [43, 46]. Due to these many successes, one

might wonder whether this superior performance also

extends to other fields in predictive machine learning

research [14]. In traditional research fields, theory

precedes experiments and experimental results can

often be clearly linked to theoretical principles. The

main drawback of deep learning research, however, is

the lack of full theoretical knowledge regarding why

these models work so well [11]. The goal of this work

is twofold. First, we investigate whether superior re-

sults can be reached for behavioral big data, a class

of data becoming increasingly common in the ma-

chine learning field. Secondly, this work attempts to

get a high-level, preliminary insight into the link be-

tween classifier performance and the reasons behind

this performance for human behavioral data.

In previous research specifically aimed towards

high-dimensional behavioral data, deep learning has

been applied to movie preference data [43] and e-

4

commerce activity data [48]. However, in both stud-

ies, the inherent high-dimensionality of these data

sets is reduced. The first study considers absent be-

havior as missing data and subsequently omits the

corresponding missing features. In the second study,

the fine-grained data is preprocessed with dimension-

ality reduction techniques.

This work attempts to contribute in four ways: (1)

first, this study presents, to the best of our knowl-

edge, the first demonstration and validation of the

usefulness of deep learning for a novel type of data,

being behavioral data, (2) second, we analyze in a

structured manner the performance of deep learn-

ing techniques on large behavioral data and com-

pare with 11 conventional shallow methods to assess

whether the former can provide significant perfor-

mance improvements for these data sets, (3) third,

we provide guidance for researchers and practitioners

regarding the practical implementation details that

characterize these architectures, and (4) finally, the

study validates additional insights related to the hier-

archical learning process of these techniques, gaining

insight into the learning process.

The rest of this paper is structured as follows.

First, the deep-learning approach is set out, followed

by a description of the experimental set-up in which

attention is given to the hyper-parameter selection

step. We subsequently present and discuss the re-

sults. Finally, we conclude our work and present av-

enues for further research.

Deep Learning

When using deep learning for classification purposes,

we have a data set D which is an indexed set of n data

points xi with n corresponding labels yi, with xi ∈

Rm and yi ∈ {−1, 1} for i = 1, . . . , n. A classification

algorithm A takes a subset of the data set D, called

a training set Dtrain = (Xtrain, Ytrain) as input and

learns a hypothesis function f as an approximation

of the true data distribution:

f = Aλ(Xtrain, Ytrain) (1)

with model hyperparameters λ. In the specific con-

text of deep learning, the number of layers L is an

example of a hyperparameter. The hyperparameters

λ are learned on a separate subset of D referred to as

a validation set Dval = (Xval, Yval). The classification

model learns the hypothesis f by minimizing a loss

function L over a set of parameters θ

minθL(f(Xtrain), Ytrain). (2)

5

The final classification performance is tested on a sep-

arate test set Dtest = (Xtest, Ytest).

The deep learning model (of which a concrete ex-

ample is visualized in Figure 1) minimizes the loss

function L over its parameters θ (consisting of weight

parameters W and bias parameters b) through back-

propagation used in conjunction with stochastic gra-

dient descent [19]. The back-propagation training al-

gorithm works in two stages: (1) propagation and (2)

weight update. Starting with randomly chosen small

initial values for the weights W and the biases b for

each hidden layer h, a batch of training samples is in-

putted to the model (forward pass). The model com-

putes the corresponding activations and the output

values and compares them to the real labels. Next,

the gradient of the error is used in a backward pass

to update the model’s parameters W and b for each

layer at iteration t

Wh(t+ 1) = Wh(t) + α∂L∂Wh

+ ξ(t) (3)

bh(t+ 1) = bh(t) + α∂L∂bh

+ ξ(t), (4)

with α a learning rate and ξ an error component. Us-

ing stochastic gradient descent implies that the loss

function L must be differentiable. The negative log

likelihood minimizes the negative likelihood of the

correct class and is defined as

NLL = −n∑i=0

logP (Y = yi | xi, θ). (5)

ix

j

k

h1

h2

l bL

b2

b1

Wjk

Wij

Wkl

hL

Figure 1: Example of a deep learning network with 2hidden layers and their associated bias vectorsb and weight matrices W .

The objective function (2) of the deep learning

problem formulation comes with some difficulties. It

defines a non-convex optimization space character-

ized by multiple local optima, contributing to an in-

herently complex learning process. Using a local op-

timization technique such as gradient descent may

result in the model being stuck in local optima and

resulting in poor generalization performance. More-

over, with a large number of layers and many hidden

6

units, the number of model parameters can be mas-

sive which can result in overfitting.

A second issue faced by deep learning algorithms,

despite their excessive modeling capability, is that

they are often condemned to being a black-box ap-

proach due to the many layers and the nonlinearities

in the model.

Materials and Methods

Deep learning architectures are characterized by a

large number of hyper-parameters (HP) [19]. The

hyper-parameters associated with a machine learning

model A are selected based on their performance on

the separate out-of-sample validation set (Xval, Yval).

Two types of HPs can be distinguished: (1) hyper-

parameters associated with the model such as the

number of hidden units per layer, and (2) hyper-

parameters associated with the optimization algo-

rithm such as the learning rate of stochastic gradient

descent. The hyper-parameter optimization problem

can be modeled as follows:

λopt = minλhp

L(f(Xval), Yval), (6)

with λopt the optimal HP set which is found through

cross-validation [2] over a separate validation set

Xval. Each λhp is a collection of values for each of

the K parameters: (HP1, . . . ,HPK).

In order to select appropriate values for the HPs in

an empirical setting, a structured method is needed

to search the hyper-parameter space [2]. When deep

learning models are trained in literature, HPs are of-

ten either chosen based on researchers’ experience in

the specific field or through a structured or heuristic

exploration of the HP space such as manual search,

grid search or random search. Since we are train-

ing deep neural networks on a new and specific type

of data, the approach to take over parameter val-

ues originating from experimental research in other

fields is not advisable. Moreover, different data sets

are evaluated, for which clearly separately-optimized

HPs should be used [2].

Bergstra and Bengio [2] state that from the large

collection of all hyper-parameters, mostly only a

small subset is relevant (referred to as low-effective

dimensionality). Their study empirically and the-

oretically demonstrates that non-adaptive random

hyper-parameter search is more efficient than manual

search and/or grid search in high-dimensional search

spaces. While a high-dimensional grid may give even

coverage inside the grid, the coverage in the sepa-

rate subspaces is less-evenly distributed. In contrast,

random search explores a wider range in each of the

subspaces.

7

We use a random hyper parameter search proce-

dure [2] to determine 100 parameter configurations

on a separate part of the data set (Xval, Yval). Since

these configurations are optimized for negative log

likelihood, the 10 best configurations are chosen and

their AUC is evaluated [13]. The specific values and

ranges for each hyper-parameter are given in the next

subsections. For each of the data sets, subsequently,

five-fold cross validation is used on the remaining part

of the data. During learning, the progress is moni-

tored by calculating the negative log likelihood on a

separate part of the data set which is used to prevent

overfitting by early stopping. The AUC reported in

our study is calculated on a separate test set after-

wards.

We run the random hyper-parameter search pro-

cedure and the actual learning algorithm on AWS

g2.2xlarge instances with the Theano framework for

deep learning [23]. These instances have 8 virtual

CPUs, 15 GB RAM and 1 Nvidia grid GPU.

Unsupervised pre-training

Unsupervised pre-training was one of the main rea-

sons for the increased attention towards deep learning

in 2006 due to significant classification performance

improvements [21]. This pre-training phase uses a

stack of representation-learning algorithms (such as

RBMs [19]) and separately optimizes the learned rep-

resentation of each layer in an unsupervised fashion.

The goal is to use the learned representation of the

unsupervised stage for a classification task with the

same input data distribution. The idea behind greedy

layer-wise pre-training originates from the following

two beliefs: (1) the initial parameters can have an ef-

fect on the quality of the parameter optimization pro-

cess and subsequently influence model performance

during model finetuning, and (2) the representation

learned by a generative model on unlabeled input

data can also be a good representation for a sub-

sequent classification task [1].

Previous research has demonstrated the immense

value in pre-training a neural network [21, 12]. Er-

han et al. [12] empirically analyze the effect of pre-

training and conclude that it mainly acts as a regular-

izer, influencing the starting point of the supervised

training step. However, several studies also find the

opposite, e.g. in speech recognition and in chemical

activity prediction [1, 34]. Also, the study by Erhan

et al. [12] was performed before the widespread use

of many modern deep learning-related concepts such

as ReLUs [1] and only takes into account the relation

with the number of layers. Consequently, unsuper-

vised pre-training is recently only used in the field of

NLP.

In the random HP search, we randomly choose to

8

either pre-train the network or not through the use

of a restricted Boltzmann Machine (RBM) [19].

Number of layers and number of hidden units

Both the number of hidden layers and their number

of units have an impact on the generalization capa-

bility of the network. A trade-off is needed between

generalization on the one hand and overfitting on the

other.

We randomly select a number of hidden layers be-

tween 1 and 8. For each layer, a number of hidden

units is drawn log-uniformly between 18 and 2000.

Both ranges are based on values described in Bergstra

and Bengio [2].

Minibatch size

Choosing the size of the minibatch for stochastic gra-

dient descent is a two-criterion trade-off influencing

convergence and training time. When increasing the

size of the batch, more efficient use can be made of the

parallel matrix-matrix multiplications on the GPU.

Although the gradient estimate becomes more reli-

able with larger minibatch sizes, less batches reduce

the number of weight updates, thereby reducing con-

vergence.

Moreover, closely approximating the true gradient

by increasing the size of the minibatch might not

be the best choice to spend computation time in a

non-convex optimization space. Instead, exploring

the search space with frequent updates is a more ap-

propriate task in this setting. Furthermore, a small

minibatch size also acts as a regularizer, introduc-

ing noise when approximating the gradient through

a lower number of training set instances [19].

Following [2, 19], the size of the minibatch is ran-

domly chosen with equal probability between 20 and

100. Each epoch, each fold is randomly shuffled be-

fore minibatches are selected in order to speed up

convergence [19].

Nonlinear activation function

The activation function of a neural network brings

non-linearity into the learning procedure, empower-

ing the model to learn distributed feature represen-

tations. When choosing an activation function, care

must be taken regarding the saturation of the activa-

tions and overly linear behavior. If activations in the

network become too saturated during learning, the

gradients do not propagate well and units become in-

active. If the functions behave in a linear fashion,

complex interactions will not be grasped. We con-

sider the three most commonly-used non-linearities,

i.e. the sigmoid, the hyperbolic tangent and the rec-

tified linear unit of which the random HP selection

set-up randomly selects one.

9

-10 -5 0 5 10

x

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

sigm

(x)

-10 -5 0 5 10

x

-1

-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

1

tanh

(x)

-10 -5 0 5 10

x

0

1

2

3

4

5

6

7

8

9

10

ReL

U(x

)

Figure 2: Nonlinear activation functions for input x (left: sigmoid, right: tanh, below: ReLU).

Sigmoid activation unit The sigmoid activation

unit performs the following function for a hidden neu-

ron j in layer h on the input x received from layer

(h − 1): σ(a) = 11+e−a with a = wT

.jx + bhj where x

is the output of the previous layer (or the input fea-

tures if it concerns the first hidden layer), bhj is the

bias associated with hidden layer h and w.h is the

weight vector associated with hidden unit j.

The activation function curve is shown in Figure 2

(left). As can be seen, the non-linearity squeezes the

input value a into a [0, 1] interval and is symmetric

around 0.5. The interval limits have a nice intuitive

interpretation: a neuron will not fire at all (0) or will

fire at maximum frequency (1).

However, two major drawbacks are linked to its

use. Firstly, the sigmoid nonlinearity suffers from the

vanishing gradient problem: the characteristics of its

gradient function lead to near-zero or zero gradients

as it saturates at both tails for 0 and 1. This in-

fluences the global gradient calculation during back-

propagation, and may kill the gradient, making it

difficult for the parameters to be learned. There-

fore, initial weights must be chosen carefully, as

large absolute weights lead to highly-saturated neu-

rons. The weights are initialized from the follow-

ing uniform distribution Uniform(−r, r) with r =

4√

6/(fan-in + fan-out) with fan-in and fan-out the

number of incoming and outgoing connections re-

spectively [19]. This initial setting activates the sig-

moid in its linear region, also resulting in first learn-

ing the main linear behavior. Secondly, its non-zero

mean and the computation of the exponential func-

tion both result in slow learning in highly-layered ar-

chitectures [31].

Hyperbolic tangent unit The hyperbolic tan-

gent function uses the following function: tanh(a) =

ea−e−a

ea+e−a , with a = wT.jx + bhj . The function is

shown in Figure 2 (middle), demonstrating that

the hidden neuron values are projected onto a

[−1, 1] interval. Its initial weights are set following

10

the uniform distribution Uniform(−r, r) with r =

4√

6/(fan-in + fan-out), resulting in activating the

tanh function in its linear region. The tanh function

differs from the sigmoid activation in that its output

is zero-centered, however, it is clear that saturating

neurons also occur.

Rectified Linear Unit (ReLU) The Rectified

Linear Unit function employs a zero-threshold hinge

function: ReLU(a) = max(0, a) with a = wT.jx + bhj .

The advantages of the ReLU are linked to the dis-

advantages of the other two nonlinearities. Firstly,

the gradient of a ReLU unit will either be constant

(a > 0) or zero (a ≤ 0). In the former case, the

gradient never kills other gradients during backprop-

agation. With a careful selection of the initial weights

(Uniform(−r, r) with r = 2/fan-in), a being zero

does not result in the vanishing gradient problem

as long as the gradient can propagate along some

paths [15]. Moreover, when a fraction of the neurons

is zero, sparse representations are favored through-

out the network. A sparse deep neural network (1)

has the advantage of learning robust models, (2) en-

ables variable-size feature representations, and (3) re-

sults in sparse representations which are more easily

linearly-separable than dense higher-level feature rep-

resentations.

The second advantage of the Rectified Linear Unit

is the fact that no expensive operations are needed

and thus this will result in a faster model compared

to the other two activation functions. Figure 2 (right)

shows the hinge function of the ReLU.

Learning rate

The learning rate of the stochastic gradient descent

optimization algorithm can influence convergence to-

wards an optimum. If the learning rate is set too

high, optima can be overlooked and the loss function

will increase. In contrast, too small a learning rate

results in very slow convergence.

Our random hyper-parameter search algorithm

draws an initial learning rate α0 log-uniformly be-

tween 0.0001 and 1.0 with an adaptive annealing

scheme. If at iteration t, the previous solution is not

found to be better than the current solution in the

optimization algorithm (given a specific improvement

threshold), αt is set as follows: αt = t·αt−1

max(t,t+1) [2].

Analysis and Results

To get insight in the performance of deep learning

techniques on behavioral data, three analyses are per-

formed. First, we present a comparison with the

performance of 11 traditional, shallow classifiers and

compare these for statistically significant differences.

The goal is to provide a sound statement regarding

11

whether and when deep techniques can reach supe-

rior performance for behavioral big data. Secondly,

we analyze the influence of hyperparameter values on

the classification performance for deep learning net-

works. The goal is to get insight into the influence

of the learning network architecture, the characteris-

tics of the optimization process and the unsupervised

generative preprocessing phase. Third, we provide

insight into why deep learning techniques perform

well on behavioral big data. To this end, we focus

on the representation-learning characteristic of the

techniques and assess the value present in these rep-

resentations. The results and interpretations of these

analyses are presented in each of the next subsections.

Comparison with Shallow Techniques

To demonstrate the predictive performance of the

deep learning approach on large-scale behavioral

data, we evaluate its classification performance on

a collection of behavioral data sets put forward as a

benchmark in De Cnudde et al. [10]. We enrich this

collection with data sets capturing people’s likes on

Facebook [28, 6].

The MovieLens and YahooMovies data sets, for

which we predict the age and the gender of the users,

contains data regarding which movies are rated by

the users. The larger MovieLens10m data set pre-

dicts the genre of a movie based on the users rating

it. This multi-class classification task is translated

into 18 binary classification problems. The Ecom-

merce data set provides product viewing data on an

e-commerce website from which the gender of the

users is inferred. Transactional shopping transactions

in the TaFeng data set are used to predict users’ age.

In the BookCrossing data set, books are rated by

the members of the BookCrossing community and

based on these ratings, the age of the users is pre-

dicted. The LibimSeTi data set contains ratings of

dating profiles from which the gender of the users is

inferred. In 2015, the KDD cup challenge KDD2015

constituted the prediction of MOOC dropout based

on fine-grained course interaction data. In the A-

Card data set, location visiting behavior from a gov-

ernment loyalty card is modeled and a prediction is

made concerning whether a user will use collected

benefits, whether he will stop using his loyalty card

and whether he will visit one of five locations in the

near future [9]. The Fraud data set contains finan-

cial banking transactions between Belgian and for-

eign companies in order to determine whether a com-

pany is involved in fraudulent transactions [27]. Cus-

tomer banking transactions are modeled in Banking

from which a customer’s interest in a banking prod-

uct is predicted [37]. Next, customer’s interest in an

online car advertisement is predicted based on their

12

website viewing behavior in the Car data set. In the

Flickr data set, users tag pictures as their favorite

from which a picture’s number of comments is pre-

dicted. Lastly, the Facebook data sets model users

liking pages on Facebook from which the following

target variables are predicted: intelligence, religion

(Christian vs. Muslim), satisfaction with life, po-

litical belief (liberal vs. conservative), gay for male

and female users, and gender. Table 1 lists the main

characteristics of these data sets and as one can ob-

serve, the dimensionality and the sparsity both are

extremely high.

In [10], a structured comparative study is per-

formed of 11 widely-used shallow classification tech-

niques on behavioral data sets which are variations of

support vector machines, naive Bayes, logistic regres-

sion and a relational classifier called PSN. For details

regarding the techniques and their parameters, and

the results, we refer to [10].

Table 2 depicts the results taken from [10] comple-

mented with the results from applying deep learning

(DL). For each data set, the best wide classifier is un-

derlined and the best overall classifier is denoted in

boldface. At the bottom, the table also gives the aver-

age rank of the techniques (with the best-performing

technique receiving rank 1) and the number of wins

in terms of performance. Table 3 depicts the best

hyper-parameter configuration for which the results

with deep learning are reached. Note that this HP

configuration is not necessarily the best possible con-

figuration for that data set. It is the best-performing

one among 100 random configurations that are tested

for each data set and obviously, finetuning these set-

tings could lead to even better classification perfor-

mance.

13

Data set Target variable Instances Features Active elements Sparsity

MovieLens100k age/gender 943 1, 682 100, 000 93.6953%MovieLens1m age/gender 6, 040 3, 883 1, 000, 209 95.7353%Facebook IQ 6, 124 128, 787 850, 181 99.8922%YahooMovies age/gender 7, 642 106, 363 221, 330 99.9727%MovieLens10m genre 10, 681 69, 878 10, 000, 053 98.6602%Ecommerce gender 15, 000 21, 880 33, 455 99.9898%Facebook religion 4, 913 128, 787 388, 973 99.9385%Facebook satisfaction with life 30, 138 128, 787 2, 873, 825 99.9260%TaFeng age 31, 640 23, 719 723, 449 99.9036%Facebook political belief 38, 099 128, 787 2, 214, 178 99.9549%BookCrossing age 167, 175 337, 921 838, 364 99.9985%LibimSeTi gender 137, 806 220, 970 15, 656, 500 99.9486%KDD2015 MOOC dropout 120, 542 5, 891 1, 919, 150 99.7300%Facebook gay (male) 172, 191 128, 787 11, 171, 527 99.9496%A-Card loyalty/go to 177, 761 2, 448 435, 244 99.9000%Facebook gay (female) 214, 130 128, 787 16, 546, 926 99.9400%Facebook gender 386, 321 128, 787 27, 718, 453 99.9443%Fraud fraudulent 858, 131 107, 345 1, 955, 912 99.9979%Banking interest in product 1, 204, 726 3, 192, 554 20, 914, 516 99.9995%Car interest in ad 9, 108, 905 2, 936, 810 65, 464, 708 99.9998%Flickr comments 11, 195, 144 497, 472 34, 645, 469 99.9994%

Table 1: Behavioral data sets.

14

DL

MN

-NB

MV

-NB

LA

-SV

M-L

2L

S-S

VM

-L1

LS-S

VM

-L2

PSN

LR

-BG

D-L

1L

R-B

GD

-L2

LR

-SG

D-L

1L

R-S

GD

-L2

RB

F-S

VM

BookC

ross

ing

55.3

256.1

756.0

556.1

654.4

656.2

453.5

454.7

856.25

55.1

055.1

353.4

8F

aceb

ook

swl

55.6

757.2

056.7

855.5

755.2

156.2

358.32

55.3

156.2

354.1

454.6

953.1

2B

ankin

g66.2

454.7

968.17

54.2

553.4

054.6

267.0

853.7

354.9

566.2

867.2

453.4

5T

aF

eng

71.3

571.92

70.5

870.5

769.5

271.1

969.8

669.8

071.3

665.7

766.0

563.0

4Y

ahooM

ovie

sage

65.6

659.5

565.2

063.0

663.8

264.5

765.2

863.5

464.8

461.5

261.5

571.94

Fra

ud

77.22

76.2

777.1

657.1

350.4

155.6

876.8

752.6

655.4

872.7

577.1

169.0

9A

-Card

Goto

MA

S77.64

63.7

376.5

068.5

863.3

663.2

775.5

164.9

364.9

775.5

576.5

064.7

8C

ar

78.16

57.7

177.5

269.9

877.1

474.2

171.0

675.5

377.7

072.3

172.0

955.8

3M

ovie

Lens

adventu

re78.61

73.6

59.7

273.5

273.7

973.9

476.5

773.7

973.9

868.8

667.6

470.0

7M

ovie

Lens

100k

gender

79.30

74.4

869.6

374.7

373.0

277.0

974.7

172.8

277.2

173.7

874.4

975.7

5E

com

merc

e76.3

477.2

455.8

579.5

475.3

179.6

068.3

076.0

279.62

79.2

679.2

671.9

5M

ovie

Lens

cri

me

79.86

74.5

972.3

075.0

975.0

775.8

376.3

375.1

676.1

372.5

273.0

768.3

4F

aceb

ook

fem

ale

gay

77.9

876.4

676.5

573.8

475.3

674.4

180.29

76.1

675.2

175.0

176.0

673.2

1M

ovie

Lens

fanta

sy80.83

61.9

956.1

361.3

560.4

362.2

778.3

61.2

362.2

261.0

061.8

069.8

5M

ovie

Lens

myst

ery

80.6

559.8

657.3

269.6

169.8

569.7

081.07

70.3

268.0

361.9

562.8

069.2

6Y

ahooM

ovie

sgender

81.35

73.8

179.3

079.9

279.0

880.3

080.1

879.1

480.4

375.3

675.4

478.3

8M

ovie

Lens

rom

ance

81.75

65.4

373.0

670.1

670.3

769.4

479.5

671.2

568.2

069.3

169.2

861.8

1A

-Card

Goto

Perm

eke

83.35

77.4

780.8

461.1

364.5

064.3

478.2

763.1

463.1

482.3

582.3

581.8

6M

ovie

Lens

thri

ller

84.21

70.0

073.2

774.5

375.1

675.0

384.0

476.7

575.0

872.9

773.4

570.3

1M

ovie

Lens

dra

ma

84.72

73.6

565.6

082.1

882.7

683.2

776.5

282.8

483.7

479.3

579.1

770.0

7K

DD

2015

85.26

70.7

981.1

477.9

579.1

979.2

583.7

379.8

479.8

984.4

383.7

181.5

7M

ovie

Lens

childre

n85.46

80.9

773.0

279.1

079.7

879.5

483.7

079.8

379.7

76.0

775.0

580.7

5A

-Card

defe

ct

84.9

251.4

185.50

76.5

375.4

975.1

478.6

870.9

375.3

981.6

881.7

265.6

5M

ovie

Lens

1m

gender

85.63

80.5

476.8

484.0

583.8

384.8

381.3

384.1

785.2

080.5

580.5

682.5

7A

-Card

Goto

Wezenb

erg

86.28

79.7

484.5

753.2

556.8

857.0

084.9

255.1

955.2

185.3

685.4

079.3

7A

-Card

Goto

Rom

a86.76

70.7

886.6

060.1

256.5

256.5

986.3

258.0

357.9

685.3

785.3

867.7

9M

ovie

Lens

com

edy

87.04

77.0

277.8

284.4

585.0

385.8

478.2

185.2

986.1

382.1

982.0

974.8

2F

lick

r87.45

77.7

785.9

973.4

876.7

776.2

276.6

377.0

876.9

784.3

884.2

580.1

5M

ovie

Lens

acti

on

87.79

82.0

766.3

085.9

086.4

886.5

382.5

286.7

686.8

983.5

383.3

784.3

3M

ovie

Lens

anim

ati

on

88.31

84.8

370.0

087.2

987.5

387.1

685.9

188.0

487.1

080.7

580.5

584.8

6A

-Card

Goto

Zoo

88.52

71.7

186.5

860.3

455.7

355.7

087.0

557.7

157.6

485.3

085.7

472.5

6M

ovie

Lens

scifi

89.08

76.7

368.0

879.0

878.7

578.8

588.5

080.4

279.6

673.5

473.9

469.6

4M

ovie

Lens

100k

age

89.50

79.4

377.2

087.0

984.1

787.7

180.2

185.0

887.9

584.1

083.3

381.6

1M

ovie

Lens

docum

enta

ry90.04

83.5

571.4

087.9

487.4

588.4

487.8

287.9

888.5

785.7

685.9

079.9

5M

ovie

Lens

west

ern

90.7

284.2

289.6

784.5

886.0

085.2

491.37

85.7

285.8

284.9

484.8

469.1

2M

ovie

Lens

1m

age

91.39

81.9

678.7

990.3

489.8

090.4

383.1

689.9

290.8

187.3

087.1

383.2

5A

-Card

cash

out

91.4

855.9

091.54

74.0

170.8

770.3

583.3

471.1

270.9

690.8

690.8

890.6

0F

aceb

ook

religio

n92.02

82.7

085.5

586.0

685.2

786.6

590.7

285.4

086.9

186.9

687.0

285.6

7M

ovie

Lens

horr

or

92.42

90.8

888.8

091.1

491.0

191.0

891.0

891.4

691.2

790.0

189.8

188.6

2F

aceb

ook

male

gay

91.7

890.5

989.3

588.2

388.5

988.5

992.51

88.9

588.8

987.4

489.9

683.2

4M

ovie

Lens

film

noir

90.1

279.5

071.8

276.4

078.9

476.0

792.90

78.5

778.6

068.3

369.1

080.1

7F

aceb

ook

politi

cal

93.63

82.6

681.7

783.5

083.3

684.0

083.4

883.5

384.1

078.1

878.2

679.6

5F

aceb

ook

IQ94.40

68.1

165.0

267.1

162.3

866.9

862.5

367.2

963.2

365.0

061.7

862.5

7M

ovie

Lens

musi

cal

95.02

80.5

072.0

581.2

579.6

180.8

590.3

479.4

380.5

375.8

476.0

873.3

1F

aceb

ook

gender

93.4

788.4

988.6

994.9

595.0

195.0

295.0

395.09

87.8

988.5

485.3

490.0

5M

ovie

Lens

war

95.45

72.8

670.6

181.1

878.8

281.7

195.2

379.9

280.7

479.8

079.8

377.5

3L

ibim

SeT

i99.69

99.6

499.6

599.6

899.69

99.6

878.9

799.69

99.69

99.6

599.6

599.6

6

AverageRanking(A

UC)

1.68

8.2

17.5

56.8

27.5

06.0

04.8

26.2

75.2

87.8

17.3

88.6

8Numberofwins(A

UC)

34

13

01

06

22

00

1

Table

2:

Pre

dic

tive

per

form

ance

of

the

model

sin

term

sof

AU

Cfo

rbinary

beh

avio

ral

data

sets

.T

he

hig

hes

t-ach

ieved

per

form

ance

for

adata

set

isin

dic

ate

din

bold

face

.T

he

bes

t-ach

ieved

per

form

ance

by

the

shallow

class

ifier

sis

under

lined

.(M

N-N

B=

mult

inom

ial

naiv

eB

ayes

,M

V-N

B=

mult

ivari

ate

naiv

eB

ayes

,L

A-S

VM

-L2

=su

pp

ort

vec

tor

mach

ine

wit

hle

ast

abso

lute

loss

funct

ion

and

L2

regula

riza

tion,

LS-S

VM

-L1

=su

pp

ort

vec

tor

mach

ine

wit

hle

ast

square

dlo

ssfu

nct

ion

and

L1

regula

riza

tion,

LS-S

VM

-L2

=su

pp

ort

vec

tor

mach

ine

wit

hle

ast

square

dlo

ssfu

nct

ion

and

L2

regula

riza

tion,

PSN

=re

lati

onal

class

ifier

,L

R-

BG

D-L

1=

logis

tic

regre

ssio

nw

ith

batc

hgra

die

nt

des

cent

and

L1

regula

riza

tion,

LR

-BG

D-L

2=

logis

tic

regre

ssio

nw

ith

batc

hgra

die

nt

des

cent

and

L2

regula

riza

tion,L

R-S

GD

-L1

=lo

gis

tic

regre

ssio

nw

ith

stoch

ast

icgra

die

nt

des

cent

and

L1

regula

riza

tion,

LR

-SG

D-L

2=

logis

tic

regre

ssio

nw

ith

stoch

ast

icgra

die

nt

des

cent

and

L2

regula

riza

tion,

RB

F-S

VM

=su

pp

ort

vec

tor

mach

ine

wit

hR

BF

ker

nel

.)

15

From Table 2 it is clear that overall deep learn-

ing learns at least as good or better as any of the

best shallow techniques. Its averave ranking (1.68)

is much lower than any of the other classifiers and

it achieves twice the number of wins compared to

the best-performing shallow technique. For data

sets with low signal-from-noise separability (from

BookCrossing to YahooMovies age in Table 2), how-

ever, deep learning consistently results in lower clas-

sification performance. We validate this finding with

the signal-from-noise separability SNS threshold of

approximately 83% given in [40]. Figure 3 plots

the AUC of deep learning against the highest AUC

reached by any of the shallow classifiers (BEST).

Data points under the diagonal are data sets for

which DL performs better than the best wide tech-

nique. It can be seen that overall for data sets with

an AUC lower than approximately 83%, the shallow

classifiers perform better than deep learning. When

looking at deep learning literature, we indeed find

that the best performance gains are reached for high

SNS data sets where the signal-from-noise separa-

bility of the problem is inherently larger (such as for

object and speech recognition).

In line with what is described in [18], the absolute

performance improvement of DL over any of the other

classifiers generally is not that high. The additional

complexity from deep learning does not result in find-

ing new and unseen relations in the data as these al-

ready seem to be accurately described by the shallow

classifiers (BEST). However, remarkably, in the cases

that DL performs better, it seems to consistently be

able to capture what has been learned by any of

the assumptions made by the other techniques. We

therefore perform a head-to-head comparison with

DL and the best traditional classification technique

for each data set (BEST). This seems fair as apply-

ing DL resulted from an elaborate hyperparameter

optimization step. Choosing the best shallow classi-

fier for each data set is similar to selecting the best

DL architectural parameters and can give a better

answer regarding whether deep classifiers really are

superior to wide classifiers. We use the Wilcoxon

signed-rank test to compare DL with BEST and find

a p-value of 0.82 at a significance level α = 0.05. Al-

though the total number of data sets for which DL

performs better is larger, three low-SNS data sets

from the random sample (Ecommerce, Banking and

Facebook female gay) make the comparison not sta-

tistically significant. Thus, the insignificance of the

performance improvement can be attributed to (1)

the additional complexity of deep learning merely re-

sulting in marginal performance improvements [18]

which do not contribute to statistical power, and (2)

16

Data set Number of layers and hidden units per layer Nonlinearity Learning rate α0

BookCrossing [18, 72, 233, 175, 37] ReLU 0.43858Facebook swl [654, 90] ReLU 0.16641Banking [17, 32] ReLU 0.25348TaFeng [194] tanh 0.04343YahooMovies age [736, 1392] ReLU 0.84896Fraud [125, 212, 178] tanh 0.14347A-Card Goto MAS [71, 1470, 174, 51] sigmoid 0.54500Car [22, 43] ReLU 0.67745MovieLens adventure [47] ReLU 0.89871MovieLens 100k gender [545, 52, 166, 184] tanh 0.31278Ecommerce [687, 45, 109, 31] tanh 0.24682MovieLens crime [1678, 363, 163, 95] tanh 0.01436Facebook female gay [67, 243] sigmoid 0.52290MovieLens fantasy [63, 24, 324, 45, 1895] ReLU 0.18981MovieLens mystery [21] ReLU 0.61191YahooMovies gender [72, 995, 832, 1939] ReLU 0.02167MovieLens romance [95, 47, 794] ReLU 0.99152A-Card Goto Permeke [169, 37, 206, 45, 193] ReLU 0.59104MovieLens thriller [93] ReLU 0.75717MovieLens drama [287, 1764, 32, 65] tanh 0.59338KDD2015 [130, 30, 99, 1388, 1533] sigmoid 0.44715MovieLens children [213, 69, 21, 182, 1130] tanh 0.88797A-Card defect [2193, 345, 962] ReLU 0.65196MovieLens 1m gender [532, 135, 1009] ReLU 0.30452A-Card Goto Wezenberg [1099, 994, 493, 627] tanh 0.07435A-Card Goto Roma [63] ReLU 0.24249MovieLens comedy [19, 398] tanh 0.00540Flickr [21, 25] ReLU 0.54377MovieLens action [45] tanh 0.94335MovieLens animation [288, 677, 772] ReLU 0.52020A-Card Goto Zoo [223, 2344] sigmoid 0.37128MovieLens scifi [23] sigmoid 0.56118MovieLens 100k age [48, 709] ReLU 0.11159MovieLens documentary [521, 78] sigmoid 0.20603MovieLens western [1345, 218, 1013] ReLU 0.86652MovieLens 1m age [1863, 93] tanh 0.14166A-Card cashout [27, 277, 1046, 391, 146, 168, 148, 85] tanh 0.01111Facebook religion [1414, 464, 71, 424, 60] tanh 0.09468MovieLens horror [171, 467] tanh 0.06355Facebook male gay [473, 293] sigmoid 0.10999MovieLens filmnoir [1242] tanh 0.60603Facebook political [27, 146] tanh 0.76728Facebook IQ [129, 32] sigmoid 0.52109MovieLens musical [67, 656, 20, 1204] ReLU 0.75767Facebook gender [76, 367] ReLU 0.10510MovieLens war [248, 35, 694, 855] tanh 0.52687LibimSeTi [96, 348, 36, 800, 47] sigmoid 0.01922

Table 3: Best hyperparameter selection for each data set from the random hyperparameter search procedure for thedeep learning classification techniques.

17

DL performing worse for low-SNS data.

Influence of Hyperparameters

When training the models and evaluating their gen-

eralization capacity, no improvement is found when

using a pre-training step. Figure 4 shows the test

classification error on the MovieLens1m data set for

500 different random configurations for which only

the pre-training flag was changed. It is clear from this

comparison that not pre-training the network results

in a more robust and lower test classification error.

Furthermore, pretraining the network heavily influ-

ences the training time of the network, which already

is very high for this type of data.

The reason for the inferior performance of pretrain-

ing could be the following. The input constitutes one

aspect of a person’s behavior e.g. location visiting

data and the model needs to learn to predict another

aspect of that person’s behavior e.g. whether that

person would trade in loyalty points. When pretrain-

ing a model in an unsupervised fashion, a generative

model learns the variations present in the input data

without taking into account the target classification

variable. The applications for which DL has proven

superior such as object recognition, are often per-

formed in a generative manner and the pretraining

phase contributes as the signal to be learned is al-

ready highly present in the data. For example, in

the case of hand-drawn digits (for which pretraining

has indeed shown valuable), learning the variations

in the input with an unsupervised pretraining step

indeed captures the number that digit belongs to.

Learning input variations in human behavioral data

without taking into account the specific class mem-

bership variable is much more complex, as often both

are only related to a small extent (e.g. location vis-

iting behavior and loyalty prediction). Tailoring a

classification model by learning the underlying dis-

tribution of a data set has already been shown to be

a delicate task. Indeed, in previous predictive analy-

sis literature, discriminative models have consistently

outperformed generative models [10, 39]. For behav-

ioral data, especially, the underlying distribution of

human behavior is not known and until now, no fit-

ting feature generation model has been put forward.

Attempts have been made in [25] demonstrating the

fit of a Wallenius event model for human choice be-

havior. This finding is also corroborated by recent

research in deep learning. As the field gains maturity,

more researchers apply the models to more complex

data with lower signal-from-noise separability. This

coincides with a decrease in employing the pretrain-

ing generative step, which could indeed imply that a

target variable is valuable in learning from complex

18

55 60 65 70 75 80 85 90 95 100

AUC DL

55

60

65

70

75

80

85

90

95

100

AU

C B

ES

T

Figure 3: AUC for deep learning (DL) plotted against the highest AUC reached by the wide classification techniques(BEST) for all data sets in Table 1.

data sets.

Figure 5 depicts the test classification error for the

MovieLens1m data set when varying the nature of

the activation function for 500 random parameter

settings. It can be seen that overall the tanh non-

linearity gives the best results and this is found on

the bulk of the data sets analyzed in our study. Al-

though ReLU units provide faster models and could

be recommended in very high-dimensional settings

for reducing computational complexity, their sparse

representations result in less predictive performance.

When analyzing high-dimensional behavioral data for

predictive purposes, it makes sense that each fine-

grained behavioral aspect contributes to the final pre-

diction. This has also been found in [10, 38, 6]; i.e. L2

regularization outperforms L1 regularization in this

context when taking into account the raw features.

In this multi-layered setting, each separate neuron in

higher layers also seems to contribute to the predic-

tion.

Finally, when looking at the effect of the number

of hidden layers on the test classification error in Fig-

ure 6, no consistently better-performing setting can

be found over one data set (MovieLens1m), as well

as over all data sets (see the configurations in Ta-

ble 2). Practitioners should therefore resort to ran-

domly testing several configurations with a varying

amount of hidden layers and choose the best perform-

19

No pre-training Pre-training

10

20

30

40

50

60

70

80

90

Tes

t cla

ssifi

catio

n er

ror

Figure 4: Test classification error for the MovieLens1mdata set for predicting gender without un-supervised pre-training (left) and with pre-training (right).

ing architecture setting.

Distributed Feature Representation

One of the advantages of linear, shallow classifiers lies

in their comprehensibility. The weights given by the

model to each of the input features can be inspected

and interpreted by humans. In some domains, model

comprehensibility is mandatory before a model can

be deployed in a business setting and explaining how

the model’s predictions originate can help acceptance

within an organization [16, 17, 35, 36].

Table 4 lists the features with the highest weights

given by logistic regression with L2 regularization

for the following classification tasks: predicting (1)

gender and (2) age from movie rating data, and

sigmoid tanh relu

20

25

30

35

40

45

Tes

t cla

ssifi

catio

n er

ror

Figure 5: Test classification error for the MovieLens1mdata set for predicting gender with the sig-moid non-linearity (left), the tanh function(middle) and the ReLU non-linearity (right).

predicting (3) liberal political beliefs and (4) high

intelligence from Facebook like data. Looking in

more detail at these top-ranked features, we can ob-

serve that they indeed make sense. When predict-

ing whether a user is male based on movie rating

data, we can see that higher weights are given to

movies mainly targeted towards male users such as

Starship Troopers, Star Trek and Apollo 13. Looking

at the weights which are discriminative for identi-

fying younger users, we can see horror movies like

Scream and I Know What You Did Last Summer

as well as movies clearly targeted at younger people

such as The Princess Bride and Willy Wonka and

the Chocolate Factory. For the Facebook data set as

well, the feature weights are intuitive. Features such

20

15

20

25

30

35

40

45

50

Tes

t cla

ssifi

catio

n er

ror

Number of hidden layers1 2 3 4 5 6 7 8

16

18

20

22

24

26

28

30

32

34

Tes

t cla

ssifi

catio

n er

ror

Number of hidden layers1 2 3 4 5 6 7 8

Figure 6: Test classification error for the MovieLens1m data set dependent on the number of hidden layers forpredicting age (top) and predicting gender (bottom).

as Barack Obama, The Daily Show and The Colbert

Report clearly say something about a person’s liberal

mindset. When predicting whether a user has a high

IQ, tv-shows such as The Big Bang Theory and Lost

receive higher weights.

As mentioned previously, the deep learning mod-

els learn distributed representations based on com-

positionality in the fine-grained features. Each neu-

ron in this multi-layered setting assigns weights to

lower-level concepts and thus allows for a complex

many-to-many combination of all features on differ-

ent levels. In contrast, shallow techniques such as

logistic regression and support vector machines can

be viewed as consisting of one neuron, assigning fea-

ture weights on one level only (see Table 4). We now

take a closer look at some of the neurons identified

by deep learning on the lowest level and try to assess

if the distributed low-level representation really do

identify extra nuances in the features.

Table 5 shows five neurons in the first hidden layer

of the best architecture (see Table 3) and the features

for which they are most activated when predicting

gender for the MovieLens1m data set. We observe

that each neuron captures additional, separate nu-

ances (bottom row in Table 5) besides only taking

into account the obvious relation to the target vari-

able as is the case for a single-neuron setting. Neuron

1 for example captures genre in the features by as-

signing higher weights to romance and drama movies.

Neuron 2 focuses on action movies and comedies tai-

lored towards male users, while neuron 4 discrimi-

nates towards male drama movies. Neuron 3 clearly

21

becomes active for sport movies and neuron 5 iden-

tifies movies mostly targeted towards female users.

The process of interpreting neurons requires domain-

specific knowledge and clearly multiple interpreta-

tions are possible. Moreover, this approach becomes

infeasible when a much larger number of neurons is

present [20]. However, we mainly aim at demonstrat-

ing additional nuances learned by deep architectures

in comparison to shallow ones, thereby demonstrat-

ing the value in distributed feature representation in

this setting.

For the Facebook like data set, Table 6 shows four

neurons that help identify a liberal mindset and the

features for which these neurons have high weights.

Again, each neuron can be associated with a higher-

level category that groups similar likes such as mostly

male or female interests, hobbies or rock bands. Note

that an intuitive interpretation is not straightforward

for all neurons in the network.

In addition to these illustrations, we attempt to

formally show the added value present in these dis-

tributed feature representations. To this end, we

transform the original high-dimensional data sets

onto three feature-reduced data sets and assess

their prediction capability. The distributed features

learned by DL are compared with (1) a feature selec-

tion technique which uses L1 regularization to iden-

tify relevant features and with (2) a feature engineer-

ing technique using SVD which projects the high-

dimensional features onto as many features those on

the first layer of the DL architecture (see Table 3).

The results of this analysis are depicted in Table 7.

For 40 out of 47 data sets, the latent features learned

by DL result in better predictive performance com-

pared to the selected or latent features resulting from

the other two approaches. This demonstrates the

value of the learned representations by DL, gaining

insight as to why deep learning techniques perform

well on big behavioral data. Moreover, these results

open up further research opportunities in terms of us-

ing the distributed nature of first-layer neurons (and

by consequence, higher-level neurons) for purposes of

feature engineering.

Conclusion

In the past, deep learning techniques have resulted in

significant performance improvements in fields such

as object recognition and NLP. This paper performs

a first exploratory study as to whether these improve-

ments also extend to behavioral big data. We demon-

strate the usefulness in learning deep, distributed

representations of the many fine-grained behavioral

features, while shedding light as to when and why

deep learning performs well. As a practical contribu-

22

Rank MovieLens (gender) MovieLens (age) Facebook (liberal) Facebook (intelligence)

1. Ulee’s Gold Scream Barack Obama Inception2. Starship Troopers Trainspotting The Daily Show The Big Bang Theory3. Star Trek: First Contact I Know What You Did Last Summer Rent How I Met Your Mother4. Dr. Strangelove Liar Liar The L Word Avatar5. Monty Python and the Holy Grail Pink Floyd - The Wall Dixie Chicks Fight Club6. In the Name of the Father Kiss the Girls Charmed Star Wars7. The Ice Storm Scream 2 Yoga TED8. Tomorrow Never Dies Rumble in the Bronx Amelie Paulo Coelho9. A Clockwork Orange Spawn Weeds Lost10. Four Weddings and a Funeral The Princess Bride Brokeback Mountain Music11. Six Degrees of Separation Heathers Crash Dexter12. The Great Dictator Monty Python and the Holy Grail Little Miss Sunshine The Mentalist13. The Blues Brothers Chasing Amy The Colbert Report Sheldon Cooper14. The Peacemaker Mystery Science Theater 3000: The Movie The Beatles My Personality15. The Abyss Star Wars Green Day The Office16. McHale’s Navy Tomorrow Never Dies Philosophy Disney’s The Lion King17. Patton Willy Wonka and the Chocolate Factory CNN Megan Fox18. Sphere Swingers Sex and the City Pink Floyd19. Apollo 13 Romeo and Juliet True Blood Burn Notice20. Gattaca Much Ado About Nothing Travelling Futurama

Table 4: Top weights with highest coefficient for logistic regression with L2 regularization. The higher scoresindicate higher probability of being (1) male when predicting gender, (2) young when predicting age forthe MovieLens data set, (3) liberal when predicting political preference for the Facebook data set, and (4)high IQ when predicting intelligence for the Facebook data set.

Neuron 1 Neuron 2 Neuron 3 Neuron 4 Neuron 5

The Mirror has Two Faces McHale’s Navy Kingpin Apollo 13 Everyone Says I Love YouBed of Roses The Blues Brothers Cool Runnings Kundun Little WomenOld Yeller Mystery Science Theater 3000: The Movie Coneheads Raging Bull Romeo and JulietKiss the Girls Tomorrow Never Dies Ed White Squall The Truth About Cats and DogsFly Away Home Starship Troopers Pulp Fiction The Full MontyThe Adventures of Priscilla Monty Python and the Holy Grail The Madness of King George Beauty and the BeastTo Kill a Mockingbird Four Weddings and a Funeral Under Siege Searching for Bobby FischerMichael Star Trek: First Contact Ran The Sound of MusicLittle Women Apollo 13 Patton The Mirror has Two FacesWhen a Man Loves a Woman The Great Dictator The Deer Hunter Breakfast at Tiffany’sThe Fan 12 Angry Men Star Trek: First Contact Strictly BallroomRebecca Big Night Stealing BeautySome Kind of Wonderful Ulee’s Gold A Christmas CarolThe Postman Six Degrees of Separation Secrets and LiesSearching for Bobby Fischer Duck Soup The Pillow BookStrictly Ballroom Maverick Love JonesThe Sound of Music Dr. Strangelove Money TalksStand by Me Volcano Interview with the VampireThe Shining The Wedding Singer Bed of RosesEveryone Says I Love You A Clockwork Orange A Family Thing

Romance, drama Male, action, comedy Sport, buddy Male, drama Female

Table 5: Top weights with highest coefficient for five neurons in the deep learning algorithm for predicting genderon the MovieLens data set. Each neuron captures additional fine-grained information.

23

Neuron 1 Neuron 2 Neuron 3 Neuron 4

Gilmore Girls Two and a Half Men Cooking Green DayTravel AFI Heroes OasisMatchbox 20 Kings of Leon Reading Rascal FlattsGrey’s Anatomy Iron Maiden Summer Nights IncubusCounting Crows Nip/Tuck Music Fear and Loathing in Las VegasSeinfeld Jim Croce Writing Fall Out BoyFerris Bueller’s Day Off Guns N’ Roses laughingCoyote Ugly Led ZeppelinNapoleon DynamiteSex and the City

Female Male Hobbies Rock bands

Table 6: Top weights with highest coefficient for four neurons in the deep learning algorithm for predicting a liberalmindset on the Facebook data set. Each neuron captures additional fine-grained information.

tion, we provide guidance regarding favorable hyper-

parameter configurations for learning these models.

Taking into account that the deep learning con-

figurations were randomly selected, one can imagine

the improvements possible when finetuning these ar-

chitectures. It is important to note that although

deep learning seems to capture what any of the wide

classification techniques grasp, the marginal improve-

ments of more complex models overall are quite small.

The results, however, do show that deep learning is

not recommended in problem settings where data is

characterized by low signal-from-noise separability.

A second important finding is the fact that a genera-

tive pretraining phase does not improve performance

in this discriminative setting. We conjecture that,

unlike in object recognition tasks, the input varia-

tions learned through the pretraining phase do not

contribute in terms of the final classification task.

This can originate from an inappropriate underlying

generative process which is not tailored towards hu-

man behavioral data in relation to the subsequent

prediction task. Another explanation that we put

forward is that the unsupervised learning can not

efficiently grasp the variations in the complex data

without taking into account the target variable. Fu-

ture research could either focus on developing an un-

supervised generative process which fits the process

of human-generated behavior better. However, since

the target variable is often not as clearly related to

the input data, it seems worthwhile to investigate

a supervised pretraining step such that input data

variations can be explicitly related to the variable of

interest.

The comparative analysis stresses that the perfor-

mance of deep learning classifiers is not attributable

to the additional complexity. The hierarchical, dis-

tributed representations learned by the deep models

demonstrate that predictive power may result from

24

Data set DL Feature selection Feature engineering

BookCrossing 54.37 52.24 51.60Facebook swl 54.21 51.64 50.87Banking 64.33 55.58 55.59TaFeng 64.43 60.62 63.32YahooMovies age 61.44 57.91 55.13Fraud 72.9 53.49 58.40A-Card Goto MAS 73.60 69.37 71.50Car 61.43 54.99 58.65MovieLens adventure 50.09 57.50 54.54MovieLens 100k gender 72.83 61.55 65.35Ecommerce 75.05 57.12 58.15MovieLens crime 56.78 52.71 51.92Facebook female gay 63.09 60.56 59.88MovieLens fantasy 67.09 50.00 56.28MovieLens mystery 80.44 78.55 79.77YahooMovies gender 77.35 66.89 70.86MovieLens romance 73.1 55.74 50.70A-Card Goto Permeke 63.28 62.64 62.95MovieLens thriller 85.40 72.22 72.93MovieLens drama 58.89 54.11 58.34KDD2015 60.67 62.99 61.64MovieLens children 73.72 70.76 69.72A-Card defect 81.26 80.13 82.69MovieLens 1m gender 78.52 74.10 78.14A-Card Goto Wezenberg 54.68 52.76 53.70A-Card Goto Roma 81.05 72.33 74.31MovieLens comedy 57.28 59.19 54.38Flickr 73.46 54.60 57.13MovieLens action 59.53 59.49 55.62MovieLens animation 70.41 62.67 60.69A-Card Goto Zoo 82.85 71.27 72.06MovieLens scifi 87.19 50.00 68.07MovieLens 100k age 84.91 75.95 81.61MovieLens documentary 63.61 54.52 59.09MovieLens western 82.87 67.21 57.44MovieLens 1m age 86.57 82.00 85.04A-Card cashout 89.55 80.87 83.41Facebook religion 82.33 75.44 78.09MovieLens horror 69.26 58.50 58.00Facebook male gay 82.44 81.09 81.32MovieLens filmnoir 59.51 62.25 53.52Facebook political 77.60 78.22 77.11Facebook IQ 85.99 80.33 62.09MovieLens musical 77.50 76.58 86.64Facebook gender 88.54 62.89 61.23MovieLens war 90.97 54.02 53.97LibimSeTi 97.24 97.22 97.20

Table 7: AUC reached by transforming the original high-dimensional data sets onto a reduced feature set by takingthe first-layer neurons (DL), by applying feature selection with L1 regularization, and by applying featureengineering with SVD.

25

these representations. Although not all neurons are

easily interpretable and domain knowledge is often

needed, one can see that more nuances are captured

in comparison to local feature representation models.

These nuances are related to a compositional hier-

archy of the features in relation to the target vari-

able. Taking into account the inherent complexity

of behavioral data, the distributed representations

contribute to better predictive performance and pro-

vide decision makers with insight and intuitions in

the data on the lowest hierarchy level. A broad av-

enue for further research lies in the interpretation of

the higher-level layers consisting of multiple combi-

nations of lower-level concepts.

On a more practical level, we have tried to provide

insight into the practical implementation characteris-

tics of deep learning. The multitude of hyperparam-

eters can be overwhelming for researchers and prac-

titioners when applying deep learning for this or any

new application. Two hyperparameters are found to

consistently achieve better results. First, dense repre-

sentations throughout the models achieve better pre-

dictive performance, more specifically by employing

tanh activation functions. Secondly, pretraining does

not achieve better predictive performance. One of the

main drawbacks of applying deep learning consists of

the computational and implementation complexity.

This hurdle might currently render them infeasible in

a practical setting—often resulting in multiple days

for training very fine-grained data sets. However, as

the availability of GPU units on cloud computing in-

stances becomes more widespread and Theano’s sup-

port for multiple GPU-use becomes less experimen-

tal, the computational complexity could be highly

improved.

References

[1] Y. Bengio. Learning deep architectures for ai.

Foundations and trends in Machine Learning,

2(1):1–127, 2009.

[2] J. Bergstra and Y. Bengio. Random search for

hyper-parameter optimization. Journal of Ma-

chine Learning Research, 13(Feb):281–305, 2012.

[3] D. Chicco, P. Sadowski, and P. Baldi. Deep au-

toencoder neural networks for gene ontology an-

notation predictions. In Proceedings of the 5th

ACM Conference on Bioinformatics, Computa-

tional Biology, and Health Informatics, pages

533–540. ACM, 2014.

[4] D. Ciresan, U. Meier, and J. Schmidhuber.

Multi-column deep neural networks for image

classification. In Computer Vision and Pat-

26

tern Recognition (CVPR), 2012 IEEE Confer-

ence on, pages 3642–3649. IEEE, 2012.

[5] D. C. Ciresan, U. Meier, J. Masci,

L. Maria Gambardella, and J. Schmidhu-

ber. Flexible, high performance convolutional

neural networks for image classification. In IJ-

CAI Proceedings-International Joint Conference

on Artificial Intelligence, volume 22, page 1237.

Barcelona, Spain, 2011.

[6] J. Clark and F. Provost. Matrix-factorization-

based dimensionality reduction in the predic-

tive modeling process: a design science perspec-

tive. Technical report, Department of Infor-

mation, Operations, and Management Sciences,

New York University, USA, 2016.

[7] C. Cortes and V. Vapnik. Support vector ma-

chine. Machine learning, 20(3):273–297, 1995.

[8] G. E. Dahl, N. Jaitly, and R. Salakhutdinov.

Multi-task neural networks for qsar predictions.

arXiv preprint arXiv:1406.1231, 2014.

[9] S. De Cnudde and D. Martens. Loyal to your

city? a data mining analysis of a public ser-

vice loyalty program. Decision Support Systems,

73:74–84, 2015.

[10] S. De Cnudde, D. Martens, T. Evgeniou, and

F. Provost. A benchmarking study of classifica-

tion techniques for behavioral data. 2017.

[11] L. Deng. Three classes of deep learning architec-

tures and their applications: a tutorial survey.

APSIPA transactions on signal and information

processing, 2012.

[12] D. Erhan, Y. Bengio, A. Courville, P.-A.

Manzagol, P. Vincent, and S. Bengio. Why

does unsupervised pre-training help deep learn-

ing? Journal of Machine Learning Research,

11(Feb):625–660, 2010.

[13] T. Fawcett. An introduction to roc analysis. Pat-

tern recognition letters, 27(8):861–874, 2006.

[14] X. Glorot and Y. Bengio. Understanding the dif-

ficulty of training deep feedforward neural net-

works. In Aistats, volume 9, pages 249–256,

2010.

[15] X. Glorot, A. Bordes, and Y. Bengio. Deep

sparse rectifier neural networks. In Proceedings

of the Fourteenth International Conference on

Artificial Intelligence and Statistics, pages 315–

323, 2011.

[16] M. S. Gonul, D. Onkal, and M. Lawrence. The

effects of structural characteristics of explana-

27

tions on use of a dss. Decision Support Systems,

42(3):1481–1493, 2006.

[17] S. Gregor and I. Benbasat. Explanations from

intelligent systems: Theoretical foundations and

implications for practice. MIS quarterly, pages

497–530, 1999.

[18] D. J. Hand. Classifier technology and the illusion

of progress. Statistical science, 21(1):1–14, 2006.

[19] G. Hinton. A practical guide to training

restricted boltzmann machines. Momentum,

9(1):926, 2010.

[20] G. E. Hinton. Distributed representations. 1984.

[21] G. E. Hinton, S. Osindero, and Y.-W. Teh. A

fast learning algorithm for deep belief nets. Neu-

ral computation, 18(7):1527–1554, 2006.

[22] P.-S. Huang, X. He, J. Gao, L. Deng, A. Acero,

and L. Heck. Learning deep structured seman-

tic models for web search using clickthrough

data. In Proceedings of the 22nd ACM in-

ternational conference on Conference on infor-

mation & knowledge management, pages 2333–

2338. ACM, 2013.

[23] B. James, B. Olivier, B. Frederic, L. Pascal, and

P. Razvan. Theano: a cpu and gpu math expres-

sion compiler. In Proceedings of the Python for

Scientific Computing Conference (SciPy).

[24] E. Junque de Fortuny, D. Martens, and

F. Provost. Predictive modeling with big data:

is bigger really better? Big Data, 1(4):215–226,

2013.

[25] E. Junque de Fortuny, D. Martens, and

F. Provost. Wallenius naive bayes. 2013.

[26] E. Junque de Fortuny, M. Stankova, J. Moeyer-

soms, B. Minnaert, F. Provost, and D. Martens.

Corporate residence fraud detection. In In-

ternational Conference on Knowledge Discovery

and Data Mining (SIGKDD), pages 1650–1659.

ACM, 2014.

[27] E. Junque de Fortuny, M. Stankova, J. Moeyer-

soms, B. Minnaert, F. Provost, and D. Martens.

Corporate residence fraud detection. In Pro-

ceedings of the 20th ACM SIGKDD interna-

tional conference on Knowledge discovery and

data mining, pages 1650–1659. ACM, 2014.

[28] M. Kosinski, D. Stillwell, and T. Graepel. Pri-

vate traits and attributes are predictable from

digital records of human behavior. National

Academy of Sciences, 110(15):5802–5805, 2013.

[29] A. Krizhevsky, I. Sutskever, and G. E. Hinton.

Imagenet classification with deep convolutional

28

neural networks. In Advances in neural informa-

tion processing systems, pages 1097–1105, 2012.

[30] Y. LeCun, Y. Bengio, and G. Hinton. Deep

learning. Nature, 521(7553):436–444, 2015.

[31] Y. A. LeCun, L. Bottou, G. B. Orr, and K.-R.

Muller. Efficient backprop. In Neural networks:

Tricks of the trade, pages 9–48. Springer, 2012.

[32] K. Li and T. C. Du. Building a targeted mobile

advertising system for location-based services.

Decision Support Systems, 54(1):1–8, 2012.

[33] J. Liu, P. Dolan, and E. R. Pedersen. Person-

alized news recommendation based on click be-

havior. In International Conference on Intelli-

gent User Interfaces (IUI), pages 31–40. ACM,

2010.

[34] J. Ma, R. P. Sheridan, A. Liaw, G. E. Dahl,

and V. Svetnik. Deep neural nets as a method

for quantitative structure–activity relationships.

Journal of chemical information and modeling,

55(2):263–274, 2015.

[35] D. Martens, B. Baesens, T. Van Gestel, and

J. Vanthienen. Comprehensible credit scoring

models using rule extraction from support vec-

tor machines. European journal of operational

research, 183(3):1466–1476, 2007.

[36] D. Martens and F. Provost. Explaining data-

driven document classifications. MIS Quarterly,

38(1), 2014.

[37] D. Martens, F. Provost, J. Clark, and E. J.

de Fortuny. Mining massive fine-grained behav-

ior data to improve predictive analytics. MIS

quarterly, 40(4), 2016.

[38] A. Y. Ng. Feature selection, l 1 vs. l 2 regulariza-

tion, and rotational invariance. In Proceedings of

the twenty-first international conference on Ma-

chine learning, page 78. ACM, 2004.

[39] A. Y. Ng and A. Jordan. On discriminative vs.

generative classifiers: A comparison of logistic

regression and naive Bayes. Advances in Neural

Information Processing Systems (NIPS), 14:841,

2002.

[40] C. Perlich, F. Provost, and J. S. Simonoff. Tree

induction vs. logistic regression: A learning-

curve analysis. Journal of Machine Learning Re-

search, 4(Jun):211–255, 2003.

[41] B. Ramsundar, S. Kearnes, P. Riley, D. Webster,

D. Konerding, and V. Pande. Massively multi-

task networks for drug discovery. arXiv preprint

arXiv:1502.02072, 2015.

[42] M. Rokeach. Understanding human values. Si-

mon and Schuster, 2008.

29

[43] R. Salakhutdinov, A. Mnih, and G. Hinton. Re-

stricted boltzmann machines for collaborative

filtering. In Proceedings of the 24th international

conference on Machine learning, pages 791–798.

ACM, 2007.

[44] G. Shmueli. Research dilemmas with behavioral

big data. Big data, 5(2):98–119, 2017.

[45] I. Sutskever, O. Vinyals, and Q. V. Le. Sequence

to sequence learning with neural networks. In

Advances in neural information processing sys-

tems, pages 3104–3112, 2014.

[46] A. Van den Oord, S. Dieleman, and

B. Schrauwen. Deep content-based music

recommendation. In Advances in neural infor-

mation processing systems, pages 2643–2651,

2013.

[47] W. Verbeke, D. Martens, and B. Baesens. Social

network analysis for customer churn prediction.

Applied Soft Computing, 14(3):431–446, 2014.

[48] A. Vieira. Predicting online user behaviour us-

ing deep learning algorithms. arXiv preprint

arXiv:1511.06247, 2015.

[49] X. Wu, X. Zhu, G.-Q. Wu, and W. Ding. Data

mining with big data. Transactions on Knowl-

edge and Data Engineering, 26(1):97–107, 2014.

30

university of antwerpengineering techniques. this is the rst study to ap-ply and validate the use of...

Documents