a novel feature engineering framework in digital …of advertising display, and to maximize the...

International Journal of Artificial Intelligence and Applications (IJAIA), Vol.10, No.4, July 2019

A Novel Feature Engineering Frameworkin Digital Advertising Platform

Saeid SOHEILY-KHAH and Yiming WU

SKYLADS Research Team, Paris, France

Abstract. Digital advertising is growing massively all over the world, and, nowadays, is the bestway to reach potential customers, where they spend the vast majority of their time on the Internet.While an advertisement is an announcement online about something such as a product or service,predicting the probability that a user do any action on the ads, is critical to many web applications.Due to over billions daily active users, and millions daily active advertisers, a typical model shouldprovide predictions on billions events per day. So, the main challenge lies in the large design spaceto address issues of scale, where we need to rely on a subset of well-designed features. In thispaper, we propose a novel feature engineering framework, specialized in feature selection using theefficient statistical approaches, which significantly outperform the state-of-the-art ones. To justifyour claim, a large dataset of a running marketing campaign is used to evaluate the efficiency ofthe proposed approaches, where the results illustrate their benefits.

Keywords: Digital Advertising, Ad Event Prediction, Feature Engineering, Feature Selection,Statistical Test, Classification, Big Data.

1 Introduction

Digital advertising, which has only been around for two decades, is one of the mosteffective manners for all sizes companies and businesses to expand their reach, findnew clients, and diversify their revenue streams. In the case, an ad event predictionsystem is critical to many web applications including recommender systems, websearch, sponsored search, and display advertising [1,2,3,4,5], and is a hot researchdirection in computational advertising [6,7].

The event prediction is defined to estimate the ratio of events such as videos,clicks or conversions to impressions of advertisements that will be displayed. Theimpression, which also called an ad (advertisement) view, refers to the point inwhich an advertisement is viewed once by a user, or displayed once on a web page.In general, ads are announcements online about something such as a product orservice, and the principal components in a marketer’s paid advertising campaigns.They are sold on a ’Pay-Per-Click’ (PPC) basis or even ’Pay-Per-Acquisition’(PPA), meaning the company only pays for ad clicks, conversions or any otherpre-defined actions, not ad views. Hence, the Click-Through Rate (CTR) and theConversion Rate (CVR) are very important indicators to measure the effectivenessof advertising display, and to maximize the expected value, one needs to predict

DOI: 10.5121/ijaia.2019.10403 21


the likelihood that a given ad will be an event, in the accurate way possible. Asresult, the ad prediction systems are essential to predict the probabilities of a userdoing an action on the ad or not, and the performance of prediction model has adirect impact on the final advertiser and publisher revenues and plays a key role inthe advertising systems. However, due to the information of advertising properties,user properties, and context environment, the ad event prediction is very fancy,challenging and complicated, and is a large-scale learning problem. It can influenceads pricing, and pricing impacts the advertisers return on their investments andrevenue for the publishers.

Nowadays, almost all web applications, in multi-billion dollar digital advertisingand marketing industry, relied heavily on the ability of learned models to predict adevent rates accurately, quickly and reliably [8,9,10]. Hence, even very small amountof improvement in ad event prediction accuracy would yield greater revenues in thehundreds of millions of dollars. While, with over billions daily active users (e.g. 3.8billion internet users, 5 billion unique mobile users, and around 2.8 billion activesocial media users) and over millions active advertisers (e.g. more than 5 millionactive advertiser on Facebook, 8 million business profile, with more than 1 millionactive advertisers globally on Instagram), a typical industrial model should providepredictions on billions of events per day. Therefore, one of the main challenges indigital advertising industry lies in the large design space to address issues of scale. Inthe case, we need to rely on a set of well-designed features, for an efficient ad eventprediction system. However, to capture the underlying data patterns, selecting,encoding, and normalizing the proper features has also pushed the field.

In this research paper, we discuss in detail the machine learning methods for adevent prediction, propose a novel feature engineering framework and a dedicatedapproach to predict more effectively whether an ad will be an event or not. Thenovel enhanced framework mainly facilitates feature selection using the proposedstatistical techniques, thereby enabling us to identify, a set of relevant features. Inthe next section, we review the state-of-the-art of different classification techniqueswhich widely used for ad event prediction applications. Next, in Section 3, we discussmachine learning and data mining approaches for feature engineering includingfeature selection, feature encoding and feature normalizing (scaling). In Section 4,we characterize the proposed feature engineering framework which could be directlyapplicable in any ad event prediction system. The deeply conducted experimentsand results obtained on a running marketing campaign are discussed in Section 5.Lastly, Section 6 concludes the paper.

The main contributions of this research work are to a) introduce an enhancedframework for ad event-prediction by analyzing the huge amount of real-world data,where the framework includes pipelines for data pre-processing, feature selection,encoding and scaling, as well as training and prediction process, b) propose twonovel adjusted statistical measures for feature selection, called adjusted chi-squared

22


test and adjusted mutual information, and c) illustrate through deeply experimentalstudies that the introduced framework significantly outperform the alternative ones.Notice that, to simplify, in the remainder of the paper, we consider the events asclicks. Of course, all the design choices, experiments and results can however bedirectly extended to any other events such as conversions, videos, etc.

2 State-of-the-art

Over the last decade, researchers have proposed many models for digital advertis-ing and marketing that are usually based on machine learning methods, and there-fore, different classification techniques such as logistic regression, gradient boosting,naive Bayesian, neural network and random forest have been widely used for adevent prediction applications.

2.1 Logistic regression

In statistical modeling, regression analysis is a statistical process to estimate therelationships among features. It includes many techniques for modeling and ana-lyzing several features, when the focus is on the relationship between a dependentfeature and one or more independent features (or predictors). More explicitly, re-gression analysis helps one understand how the typical value of the dependentfeature changes when any one of the independent features is varied, while the otherindependent features are held fixed. Regression is the one of the most basic andcommonly used predictive analysis, where there are a variety of different regres-sion approaches such as linear regression, nonlinear regression, logistic regression,nonparametric regression, etc.

In the literature, (logistic) regression models have been used by many researchersto solve the ad event prediction problems for advertising [11,12,13,8]. In the case,logistic regression predicts the possibility of an event as:

P (y = 1|x) = hθ(x) = 11 + exp(−θTx) (1)

and the probability of non event as:

P (y = 0|x) = 1− hθ(x) (2)

where x stands for all the feature in the data, and y is the event label (1 stands foran event and 0 stands for no event.). Lastly, θ is the weight that logistic regressionassigned to all the feature. Note that the θ will change through the time, until itreach a optimized point, when the cost function is minimized.

23


2.2 Gradient boosting

One of the most powerful machine learning techniques, for different regression andclassification problems, is gradient boosting. It produces a prediction model in theform of hybrid weak models, typically decision trees [14]. The boosting notion cameout of the idea of whether a weak learning model can be changed to become better.The gradient boosting builds the model in a stage-wise manner like other boostingapproaches do, and generalizes them by allowing the optimization of a loss function.The term ’gradient boosting’ represents ’gradient descent’ plus ’boosting’, wherethe learning procedure sequentially fits novel models to provide a more accurateresponse estimation. In a nutshell, the principle idea behind it, is to constructthe novel base-learners to have maximal correlation with the negative gradient ofthe loss function, associated with the whole hybrid model. Mathematically, it is asequence of ensemble models, where each ensemble models can be regarded as atree based model. Gradient boosting technique, practically, is widely used in manyprediction applications due to its easy use, efficiency, accuracy and feasibility [8,15],as well as the learning applications [16,17].

2.3 Bayesian classifier

Bayesian classifiers, which refer to any classifier based on Bayesian probability ora naive Bayes classifier, are statistical approaches that predict class membershipprobabilities. They work based on the Bayes’ rules (alternatively Bayes’ law orBayes’ theorem), and the features are assumed to be conditionally independent.However, practically, even in spite of assumption of the features dependencies, theyprovide satisfying results. In general, Bayesian classifiers are easy to implement andfast to evaluate, where they just need a small number of training data to estimatetheir parameters. The main disadvantage is that Bayesian classifiers make a verystrong assumption on the shape of data distribution. In addition, they can notlearn interactions between the features, and suffer from zero conditional probabilityproblem (division by zero) [18], where one simple solution would be to add someimplicit examples. Furthermore, computing the probabilities for continuous featuresis not possible by the traditional method of frequency counts. Nevertheless, somestudies have found that, with an appropriate pre-processing, Bayesian classifiers canbe comparable in performance with other classification algorithms [10,19,20,21]. Inthe last decade, there is a very interesting research work by Microsoft team wherethey used a Bayesian online learning algorithm for the event prediction [10].

2.4 Neural network

A neural network is a network or circuit of neurons (or nodes), or according to whatis considered acceptable today, an artificial neural network, composed of artificial

24


neurons. The neural networks are modeled based on the same analogy to the humanbrain working, and are a kind of artificial intelligence based approaches for ad eventprediction problems [22]. Neural networks algorithms benefit from their learningprocedures to learn the relationship between inputs and outputs by adjusting thenetwork weights and biases, where the weight refers to strength of connectionsbetween two units (i.e. nodes) at different layers. Thus, they are able to predict theaccurate class label of the input data.

In the litterateurs, the neural networks also have been used by many researchersin the ad event prediction scope. In [23], authors extended Convolutional NeuralNetwork (CNN) for click prediction, however they are biased towards interactionsbetween the neighboring features. Most up to date, authors, in [24], proposed afactorization machine-supported neural network algorithm, to investigate potentialof training neural networks to predict ad clicks based on the categorical features.However, it is limited by the capability of factorization machines. In general, amongdeep learning frameworks, Feed Forward Neural Networks (FFNN) and Probabilis-tic Neural Networks (PNN) are claimed to be the most competitive algorithms forpredicting ad events.

2.5 Random forest

Random forest [25] is an ensemble learning approach for regression, classificationand such other tasks as estimating the feature importance, which operates by con-structing a plenty of decision trees. Particularly, random forest technique is a com-bination of decision trees that all together produce predictions and deep intuitionsinto the data structure. While in standard decision trees, each node is split usingthe best split among all the features based on the Gini index, in random forest,each node is split among a small subset of randomly selected input features to growthe tree at each node. Therefore, in random forest, each tree will learn the dataindependently, the predicted result will be generated from the voting of the trees.For each tree during its learning process, the tree splits on one feature which canminimize the Gini index, defined as:

G =M∑m=1

1∑k=0

pmk(1− pmk) (3)

and with considering the depth of tree as |T |, it can be formulated as:

G =M∑m=1

1∑k=0

pmk(1− pmk) + α|T | (4)

where m stands for mth leaf, k stands for the kth class, and α is a regularizationparameter. The pmk is the fraction of samples in class k, in leaf m. Here, for each

25


split, the aim is to minimize the Gini index, and the predicted result of an obser-vation is the most commonly occurring class in the leaf which the observation isin.

This strategy yields to perform very well in comparison with many other classi-fiers such as support vector machine, neural network and nearest neighbor. Indeed,it makes them robust against the over-fitting problem as well as an effective tool forclassification and prediction [26,27]. However, applying decision trees and randomforests to display advertising, has additional challenges due to having categoricalfeatures with very large cardinality and the sparse nature of the data, in the liter-ature, many researchers have used them in predicting ad events [28,29,30].

Nevertheless, one of the most vital and necessary steps in any event predictionsystem is to mine and extract features that are highly correlated with the estimatedtask. Moreover, many experiment studies are conducted to show that the featureengineering improves the accuracy of ad event prediction systems. The traditionalevent prediction models mainly depend on the design of features, while the featuresare artificially selected, encoded and processed. In addition, many successful solu-tions in both academia and industry rely on manually constructing the syntheticcombinatorial features [31,32]. Because, the data sometimes has a complex mappingrelationship, and taking into account the interactions between the features is vital.In the next section, we discuss about state-of-the-art of the feature engineeringapproaches, which can be considered as the core problem to digital advertising andmarketing, prior to introduce our proposed approach in feature engineering andevent prediction.

3 Feature engineering

Feature engineering is the fundamental to the application of machine learning, dataanalysis and mining as well as mostly all artificial intelligence tasks, and generally,is difficult, costly and expensive. In any artificial intelligence or machine learningalgorithm (e.g. predictive and classification models), the features in the data arevital, and they dramatically influence the results we are going to achieve. Therefore,the quantity and quality of the features (and data) have huge influence on whetherthe model is good or not. In the following, we discuss the data pre-processing processand feature engineering in more detail and we present the most well-used methodsin the case.

3.1 Feature selection

Feature selection is the process of finding a subset of useful features and removingirrelevant features to use in the model construction. It can be used for a) simplifica-tion of models to make them easier to expound, b) reduce training time consump-tion, c) avoid the curse of dimensionality and etc. In simple words, feature selection

26


determines the accuracy of a model and helps remove useless correlation in thedata that might diminish the accuracy. In general, there are three types of featureselection algorithms: embedded methods, wrapper methods and filter methods.

Embedded methods Embedded approaches learn which features best contributeto model accuracy while it is being created. It means that some learning algorithmscarry out the feature selection as part of their overall operation, such as randomforest and decision tree, where we can evaluate the importance of features on a clas-sification task. The most common kind of embedded feature selection approachesare regularization (or penalization) methods, which inset additional restrictions intothe optimization of a predictive algorithm. Decision trees and random forests arethe other commonly used techniques in embedded feature selection. But, how doesrandom forest select features?

Random forest technique consists of hundreds of decision trees, where each ofthe trees built over a random extraction of features as well as data, and therefore,none of the trees will not encounter all the features. This characteristic guaranteesthat the trees be de-correlated and less prone to over-fitting. Each node of the treeis a question based on one or combination of features, where the tree divides intotwo groups. The more similar among themselves in one group and the different fromthe ones in the other group. Lastly, the importance of each feature is derived fromhow ’pure’ each of the groups is. However, in the case, creating such a good modelis challenging due to the time complexity of training models.

Wrapper methods Wrapper methods consider the selection of a subset of usefulfeatures as a search problem, where different combinations are constructed, eval-uated and then compared to other ones. A predictive model used to assess thecombination of features and assign a score based on the accuracy of the model,where the search process could be stochastic, methodical, or even heuristic. In thefollowing, we discuss two well-known wrapper methods to select the best feature(or subset of features), in more details.

Genetic algorithmMathematically, feature selection is formulated as a combinatorial optimizationproblem, where the optimization function is the generalization performance of thepredictive model, represented by the error on a selection data. As a complete selec-tion of features would evaluate lots of different combinations, the process requireslots of computational work and even impracticable, if the number of features is big.Therefore, we need intelligent methods which allow to perform feature selection inpractice.

One of the most advanced algorithms for feature selection is Genetic Algorithm(GA), which is a stochastic method for function optimization based on the mechan-

27


ics of natural genetics and biological evolution. GA, initially introduced in [33,34],is a type of optimization algorithm to find the optimal solution(s) of a given com-putational problem that maximizes or minimizes a particular function. It operateson a population of individuals to produce better and better approximations. In thecase, each individual represents a predictive model, and the number of genes is thetotal number of features. Genes are binary values and represent the inclusion ornot of particular features in the model. The main steps of the proposed featureselection algorithm using GA are as follows:

– Create and initialize the individuals in the population– Fitness assignment: assign the fitness to each individual– Selection: choose the individuals to recombine for the next generation– Crossover: recombine the selected individuals to generate a new population– Mutation: change randomly the value of some features in the offsprings– Repeat the above step till termination criterion is satisfied

Generally, the genes of the individuals are usually initialized at random. Toevaluate the fitness, we need to train the classification (predictive) model with thetrain data, and then evaluate its error with the test data. Obviously, a high errorrate means a low fitness, and therefore, those individuals with greater fitness willhave a greater probability of being selected for recombination. After that, selectionoperator chooses the individuals according to their fitness level, that will recombinefor the next generation. Once the selection operator has chosen a pre-defined sizeof the population, the crossover operator recombines the selected individuals togenerate a new population. In practice, the crossover operator picks two individualsat random and combines them to get offsprings (children) for the new population,until the new population has the same size than the old one. Since the aboveoperator can generate offsprings that are very similar to the parents, it might causea new generation with low diversity. The mutation operator solve this problemby changing the value of some features in the offsprings at random (randomlychanging the state of the gene). The whole fitness assignment, selection, crossoverand mutation process is repeated until a stopping criterion is satisfied. As a sequel,genetic algorithms can select the best subset of features for the predictive model,but they require a lot of computation.

In a nutshell, the most common advantages of the genetic algorithms are:a) they usually perform better than traditional feature selection techniques,b) they can manage data with huge number of features,c) they don’t need specific knowledge about the under study problem,andd) they can be easily paralelized in computer clusters.

However, they might be very expensive in computational terms, since evalua-tion of each individual requires building a classification (or predictive) model.

28


Sequential searchSequential feature selection is one way of dimensionality reduction to avoid over-fitting by reducing the complexity of the model. It learns which features are mostinformative at each time step, and then chooses the next feature depending onthe already selected features. So, it automatically selects a subset of features thatare most relevant to the problem to reduce the generalization error or to improvecomputational efficiency of the model by removing irrelevant features.

Feature subset selection is vital in a number of situations, such as: a) featuresmay be expensive to obtain and (/or) train, b) we may want to extract meaningfulrules from the classifier. Further more, fewer features means fewer parameters foroptimization, mining and learning tasks, as well as reducing complexity and run-time. To do so, one need a search strategy to select candidate subsets and anappropriate objective function to evaluate the candidates. Since the exhaustiveevaluation of feature subsets, for N number of features, involves 2N combinations,a good search strategy is therefore needed to explore the optimal combination offeatures. In addition, the objective function evaluates the subsets and returns ameasure of their goodness criteria, a feedback signal used by the search strategy toselect new candidate subsets.

In simple words, given a feature set x = {xi|i = 1 : N}, we want to find thesubset of x as xsubset = {xj |j = 1 : M}, w.r.t. j < i, which optimizes a pre-definedobjective function, ideally the probability of correct classification. Therefore, thesearch strategy corresponds to the following steps:

– Initialize the algorithm with an empty set (xsubset =[]). so that the M , the sizeof the subset, is 0.

– If the feature xk, k ∈ 1 : N , temporary added to xsubset, maximizes the objectivefunction, then add it to subset xsubset as an additional feature.

– Repeat the above step till there is no improvement in the objective function orthe termination criterion is satisfied.

Of course, the objective function (or the termination criterion) could be anypre-defined function, which achieves the user goal.

Filter methods Filter feature selection approaches apply a statistical test toassign a goodness scoring value to each feature. The features are ranked by theirgoodness score, and then, either selected to removed from the data or to be kept.These filter selection methods are generally univariate and consider the featureindependently (e.g. chi-square test), or in some cases, with regard to the dependentfeature (e.g. correlation coefficient scores). In a nutshell, they are independent tothe type of the predictive model, and therefore, the result of a filter method wouldbe more general rather than a wrapper approach.

29


Typically, the filter approaches use one evaluation criteria from the intrinsicconnections between the features to score a feature (or even a subset of features).The common evaluation criteria are correlation, mutual information, distance andconsistency metrics.

3.2 Feature encoding

In machine learning, when we have categorical features, we often have a major issue:how to deal with categorical features? Practically, one can postpone the problemusing a data mining or machine learning model which handle the categorical features(e.g. k-modes clustering), or deal with the problem (e.g. label encoding, one-hotencoding, binary encoding)

When we use such a learning model with categorical features, we mostly havethree types of models: a) models handling categorical features accurately, b) modelshandling categorical features incorrectly, or c) models do not handling the categor-ical features at all.

Therefore, there is a need to deal with the following problem. Feature encodingpoints out to transforming a categorical feature into one or multiple numeric fea-tures. One can use any mathematical or logical approach to convert the categoricalfeature, and hence, there are many methods to encode the categorical features, suchas: a) numeric encoding, which assigns an arbitrary number to each feature cate-gory, b) one-hot encoding which converts each categorical feature with m possiblevalues into m binary features, with one and only one active, c) binary encodingto hash the cardinalities into binary values, d) likelihood encoding to encode thecategorical features with the use of target (i.e. label) feature. From a mathematicalpoint of view, it means a probability of the target, conditional on each categoryvalue, and e) feature hashing, where a one-way hash function convert data into avector (or matrix) of features.

3.3 Feature scaling

Most of the times, the data will contain features highly varying in units, scalesand ranges. Since, most of the machine learning and data mining algorithms useEucledian distance between two data points in their computations, this makes aproblem. To suppress this effect, we need to bring all features to the same level ofunit, scale or range. This can be attained by scaling. Therefore, feature scaler is autility that converts a list of features into a normalized format suitable for feeding indata mining and learning algorithms. In practice, there are four common methodsto perform feature scaling: a) min-max scaling to rescale the range of features in[0, 1] or [-1, 1], b) mean normalisation to normalize the values between -1 and 1,c) standardisation, which swaps the values by their Z scores, and d) unit vector,

30


where feature scaling is done in consideration of the entire feature vector to be ofunit length.

Notice that, generally, in any algorithm that computes distance or assumesnormality (such as nearest-neighbor), we need to scale the features, while featurescaling is not indispensable in modeling trees, since tree based models are notdistance based models and can handle varying scales and ranges of features. Asmore examples, we can speed up the gradient descent method by feature scaling,and hence, it could be favorable in training a neural network, where doing a featuresscaling in naive Bayes algorithms may not have much effect.

4 The design choices

Algorithm 1 presents briefly the proposed feature engineering framework, where inthe following, we explain in detail the different steps of proposed feature learningapproach for ad event prediction system.

Algorithm 1 The proposed feature engineering frameworkinput: <data> raw data, thresholdoutput: featurescat (categorical features), featuresnum (numerical features)

function pre-processing(data)remove duplication datarebild (or drop) missing (or incomplete) valuesremove (redundant) features with zero (and low) variance

return data

function feature selection(data)run proposed adjusted chi-squared-test (or adjusted mutual information)

return featurescat, featuresnum

function feature encoder(featurescat)for each feature i in featurescat

d value ⇐ number of distinct values for feature iif d value>threshold

do string indexerdo feature hasher

else (<threshold)do one-hot-encoder

end ifend for

return encoded featurescat

function feature scaler(featuresnum)do normalization

return normalized featuresnum

31


Before doing any data processing and learning algorithm, we need to know theenvironment of a digital advertising (marketing). Marketing is the set of activities acompany uses to advertise and sell products, services, jobs, etc. It includes reachingexisting customers with additional offers and attracting new customers. Marketingagencies utilize various strategies, from simple ads to offering discounts, to enticepeople to buy their goods. Thereby, the most important terms in a digital adver-tising market environment are:a) advertisers who pay the money to get their advertisement shown; they run mar-keting campaigns and interpret the data,b) demand side platform (DSP) which used to purchase advertising from a market-place in an automated fashion,c) publishers who get money for showing the ads on their websites,d) sell-side platform (SSP),e) display ads,f) advertising agencies, often referred to creative agencies, who create, make plan,and handle advertising and sometimes other forms of promotion and marketing fortheir clients,g) audience or online users who are in such away the prospective buyers, andh) campaigns which are the efforts made, usually over a pre-defined time period,to advertise, and as result generate some events (e.g. sell something).Notice that, many DSPs partner with third-party data providers to offer the adver-tisers as much information as possible. Additionally, many DSPs allow customersto import their own data from a data management platform (DMP).

In the case, there are lots of sensors to record data from advertisers, publish-ers, audience, online users and etc, over time, and therefore, there are plenty ofrecorded information, attributes and measures in an executed marketing campaign.For instance, the logs services enable advertisers to access the raw, event-level datagenerated through the online platform in different ways. However, we are not in-terested in all of them. Lots of recorded information and data are not useful oravailable for us, even they increase the complexity. So, at the first step, we prunethe raw data, before doing any mining task.

4.1 Data cleaning and pre-processing

While unreliable data has a highly destructive effect on the performance, data clean-ing is the process of detecting and refining (or deleting) corrupt, outlier, anomalyor inaccurate data from a dataset. It refers to identifying defective, inaccurate, er-roneous, inconsistent or irrelevant parts of the data, prior to replacing, modifying,removing or even reclaiming the dirty or coarse data, as well as removing any du-plicate data. In simple words, the data cleaning converts data from an original rawform into a more convenient format.

32


Typically, in the case of digital marketing, when we receive the data, there arelots of duplication values, because of lack of centralized and accurate data gathering,recording or perfect online report generator tools. Of course, knowing the source ofduplication can help a lot in the cleaning process. However, in the more autonomousway, for data cleaning, even we can rely on the historical data. Another stage indata cleaning is rebuilding missing or incomplete data, where there are differentsolutions depending on the kind of problem such as time series analysis, machinelearning, etc, and it is very difficult to provide a general solution. But, before doingthe data cleaning task, we have to figure out the reason why data goes missing,whereas the missing values happen in different manners, such as at random or not atrandom. Missing at random means that the data trends to be missing is not relevantto the missing data, but it is related to some of the observed data. Additionally,some missing values have nothing to accomplish with their hypothetical values orwith the values of other features (i.e. variables). In the other hand, the missing datacould be not at random. For instance, people with high salaries generally do notwant to reveal their incomes, or the females generally do not want to reveal theirages. Here the missing value in ’age’ feature is impacted by the ’gender’ feature.So, we have to be really careful before removing any missing data. Finally, whenwe figure out why the data is missing, we can decide to abandon missing values, orto fill.

As a summary, the most important key benefits of data cleaning in digital mar-keting are: a) having accurate view of customers, users and audiences; b) improvingdata integration; and c) increasing revenue and productivity.

The customers and online users are the exclusive sources of any analysis andmining task and are central to all business decision making. However, they arealways changing. Their natures, behaviours, likes and dislikes, their habits, as wellas their expectations are in a constant stage of change. Hence, we need to remainon top of those fluctuations in order to make smart decisions. Also, integrating datais important to gain a complete view of the customers, users and audiences, andwhen data is dirty and laden with duplicates and errors, data integration processwill be difficult. In typical, multiple channels and multiple departments often collectduplicate data for the same user. With data cleaning, we can omit the useless andduplicate data and integrate it more effectively;

4.2 Feature selection

In feature selection, we rely on the filter methods, and we try to fit two proposedadjusted statistical measures (i.e. mutual information, chi square test) to the ob-served data, then we select the features with the highest statistics. Suppose we havea label variable (i.e. the event label) and some feature variables that describe thedata. We calculate the statistical test between every feature variable and the labelvariable and observe the existence of a relationship between the variable and the

33


label. If the label variable is independent of the feature variable, we can discardthat feature variable. In the following, we present the proposed statistical measuresin detail.

Adjusted chi-squared test A very popular feature selection method is chi-squared test (χ2-test). In statistics, the chi-squared test is applied to test the in-dependence of two events (i.e. features), where two events X and Y are defined tobe independent, if P (XY ) = P (X)P (Y ) or, equivalently, P (X|Y ) = P (X) andP (Y |X) = P (Y ). The formula for the χ2 is defined as:

χ2df =

∑ (Oi − Ei)2

Ei(5)

where the subscript df is the degree of freedom, O is the observed value and E theexpected value. The degrees of freedom (df) is equal to:

df = (r − 1)× (c− 1) (6)

where r is the number of levels for one categorical feature, and c is the numberof levels for the other categorical feature. After taking the following chi-squarestatistic, we need to find p-value in the chi-squared table, and decide whether toaccept or reject the null hypothesis (H0). The p-value is the probability of observinga sample statistic as extreme as the test statistic, and the null hypothesis is thecase that two categorical features are independent. Generally, small p-values rejectthe null hypothesis, and very large p-values means that the null hypothesis shouldnot be rejected. As result, the chi-squared test gives a p-value, which tells if the testresults are significant or not. In order to perform a chi-squared hypothesis test andget the p-value, one need a) the degree of freedom and b) the level of significance(α), while the default value is 0.05 (5%).

Like all non-parametric data, the chi-squared test is robust with respect to thedistribution of the data [35]. However, it has difficulty of interpretation when thereare large numbers of categories in the features, and tendency to produce relativelyvery low p-values, even for insignificant features. Furthermore, chi-squared test issensitive to sample size, which is why several approaches to handle large data havebeen developed [36]. When the cardinality is low, it would be a little difficult to geta null hypothesis rejection, whereas a higher cardinality will be more intended toresult a rejection.

In feature selection, we usually calculate the p-values of each feature, thenchoose those ones which are smaller than the ’preset’ threshold. Normally, we useα = 0.95, which stands for a threshold of 5% significance. For those features withinthis threshold, smaller p-value stands for better feature. However as mentionedbefore, higher cardinality will always cause a lower p-value. This means that thefeatures with higher cardinality, e.g. user identification, or site URLs, are always

34


having lower p-values, and in turn, to be a better feature, which may not alwaysbe true.

In order to find a more reliable measure other than simply using p-value fromchi-squared test, we proposed a new measure by adding a regularization term onthe p-values (pv) of features, called ’adjusted p-value’ (padj). The new proposedstatistical measure, padj, is defined as:

padj =χ2

1−pv,df − χ21−α,df

χ21−α,df

(7)

where α is the level of significance, and df is the degrees of freedom. By using thisquantity, we are penalizing on the features with higher cardinality. Simply to say,we are trying to see how further by percentage the critical value correspondingto pv, the χ2

1−pv,df , is crossing the critical value χ21−α,df , corresponding to a given

significance level 1− α. Note that, the χ21−α,df will be very big for high cardinality

features due to higher degree of freedom, and it is regarded as a penalization term.The penalization could also be softer, if we take the logarithm of the critical

value χ21−α,df . In the case, the adjusted p-value with soft penalization, padj, can be

formulated as:

padj =χ2

1−pv,df − χ21−α,df

log(χ21−α,df )

(8)

For the two proposed above measures, higher value stands for better feature.

Adjusted mutual information Similar to the Chi-square test, the Mutual Infor-mation (MI) is a statistic quanitity which measures how much a categorical variabletells another (mutual dependence between the two variables). The mutual informa-tion has two main properties; a) it can measure any kind of relationship betweenrandom variables, including nonlinear relationships [37], and b) it is invariant underthe transformations in the feature space that are invertible and differentiable [38].Therefore, it has been addressed in various kinds of studies with respect to featureselection [39,40]. It is formulated as:

I(X;Y ) =∑

x∈X,y∈YP (x, y) log

( P (x, y)P (x)P (y)

)(9)

where I(X;Y ) stands for the ’mutual information’ between two discrete variablesX and Y , P (x, y) is the joint probability of X and Y , and P (x) and P (y) are themarginal probability distribution of X and Y, respectively. The MI measure is anon-negative value, and it is easy to deduce that if X is completely statisticallyindependent from Y , we will get P (x, y) = P (x)P (y) s.t. x ∈ X, y ∈ Y , whichindicates a MI value of 0. The MI is bigger than 0, if X is not independent from Y .

35


In simple words, the mutual information measures how much knowing one ofthe variables reduces uncertainty about the other one. For example, if X and Yare independent, then knowing X does not give any information about Y and viceversa, so their mutual information is zero.

In the case of feature selection using the label column (Y), if MI is equal to 0,then X is considered as a ’bad’ single feature, while the bigger MI value suggestsmore information provided from the feature X, which should be remained in theclassification (predictive) model. Furthermore, when it comes to optimal featuresubset, one can maximize the mutual information between the subset of selectedfeatures xsubset and the label variable Y , as:

ζ = arg maxsubset

I(xsubset;Y ) (10)

where |subset| = k, and k is the number of features to select. The quantity ζ iscalled ’joint mutual information’, and its maximizing is an NP -hard optimizationproblem.

However, the mutual information is subject to a Chi-square distribution, thatmeans we can convert the mutual information to a p-value as a new quantity. Thisnew adjusted measure, called MIadj, will be more robust than the standard mutualinformation. Also, we can rule out those features that is not significant based onthe calculated p-value. Normally:

2N ×MI ∼ χ2df (11)

where the N stands for the number of data samples, and similar as before, dfis degree of freedom. So simply to say, our proposed new filtering rule (MIadj) isdefined by:

2N × I(X;Y )− χ0.95,df(X,Y ) > 0 (12)

The bigger MIadj, the better is. Some features will be ruled out if their newadjusted measures are negative, which indicate the mutual information are notsignificant comparing to their degrees of freedom.

4.3 Feature encoding

After selecting the proper features, we need to format the data, which can beaccepted by the training model. Practically, we do the training and predictionprocess tasks using Spark in Yarn, because we have nearly forty million records toanalyze on a weekly basis. Specifically, we use the StringIndexer, which encodes thefeatures by the rank of their occurrence times, for high cardinality features abovea predefined threshold, and one hot encoder for the features whose unique levelsless then the predefined threshold. We also hash the high cardinality features to

36


ensure that we are formatting the data without loosing too much information. In anutshell, in our ad event prediction case, there are some extremely high cardinalityfeatures like user ids, or page urls with millions levels on weekly basis. To keep mostof the useful information without facing the risk of explosion of feature numbersat the meantime, it would be preferable to hash them rather than doing one-hotencoding.

4.4 Feature scaling

In the last step of our proposed feature engineering framework, using the max-minscaling method, we normalize the features, if needed.

5 Experimental study

In this section, we first describe the dataset used to conduct our experiments, thenspecify the validation process, prior to present and discuss the results that weobtained.

5.1 Data description

To clarify our claim in ad event prediction system, we used a large real-worlddataset of a running marketing campaign. The dataset is a private activity reportfrom MediaMath digital advertising platform, is very huge, and the entire datasetis stored on cloud storage (i.e. Amazon S3) of Amazon Web Services (AWS). Itcomprises over 40 millions of recorded ads data (on weekly basis), each one withmore than 80 pieces of information, which can be categorized in two main group:a) attributes which can be considered as features, and b) event labels for machinelearning and mining tasks.

Attributes (features) The input features for machine learning algorithms, suchas user id, site url, browser, date, time, location, advertiser, publisher and channel.Notice, that it is highly probable that one generates new features from the existenceones. For instance, from start-time and stop-time, we can produce the durationfeature, or from date, we can generate the new feature day of week, which will bemore meaningful in the advertising campaign.

Measures (labels) Measures are the target variables data which acts as labelsin machine learning and data mining algorithms, such as impressions, clicks, con-versions, videos, and spend. Note that, the mentioned measures (i.e. labels) areneeded for supervised learning algorithms, while in non-supervised algorithm onecan ignore them.

37


5.2 Validation process

In our experimental studies, we here compare the proposed feature engineeringframework in the ad event prediction system with the state-of-the-art and thewell-used feature engineering based event prediction methods. To do so, for ourcomparisons, we rely on the accuracy, recall or true positive rate, precision orpositive predictive value, F1-score, which is a harmonic mean of precision and recall,as well as the area under precision-recall curve (AUC-PR), which are commonlyused in the literature, to evaluate each method.

The accuracy measure lies in [0, 100] in percentage, and true positive rate(recall), positive predictive value (precision), and F1-score lie within a range of [0,1]. The higher index, the better the agreement is. In the other side, precision-recallis a useful measure of success of prediction when the classes are very imbalanced,and the precision-recall curve shows the trade-off between precision and recall fordifferent threshold. A high area under the curve represents both high recall andhigh precision. High scores illustrate that the predictor (or classifier) is returningaccurate results (high precision), as well as returning a majority of all positiveresults (high recall). Notice that, the Precision-Recall (PR) summarizes such acurve as the weighted mean of precisions achieved at each threshold, with theincrease in recall from the previous threshold used as the weight.

In the experiments, for all approaches, the parameters as well as the training andtesting sets are formed by k-fold cross validation in the ratio of 80% and 20% of thedata, respectively. For example, for the random forest algorithm two parameters aretuned: number of trees and minimum sample leaf size. Finally, the results reportedhereinafter are averaged after 10 repetitions of the corresponding algorithm.

5.3 Experimental results

In the context of event prediction, the accuracy, recall, precision, F1-score, as wellas the area under PR curve (AUC-PR), for the various tested methods, are reportedin the Table 1. Many papers in the literature have shown that the random forest areamong the most efficient methods to be considered, as far as heterogeneous multi-variate data are concerned [41,42]. Hence, we build our proposed event predictionalgorithm on the basis of a Random Forest (RF) classifier. To facilitate the big dataanalysis task, we do the data pre-processing, training, and ad event prediction byrunning spark jobs on Yarn. The results in bold correspond to the best assessmentvalues.

Table 1 shows the comparison of performances of a simple classifier versus thecase of doing a feature engineering before running the classifier, on historical dataof a running campaign. Here, in the context of feature selection, in addition to thefilter methods, i.e. standard chi-squared test (χ2) and standard mutual information(MI), we compare our proposed statistical model with two well-known wrapper

38


Table 1: Comparison of performances based on the different feature engineering methodsAccuracy Recall Percision F1-score AUC-PR

RF 99.61 0.050 0.013 0.022 0.005RF + Feature Eng. (χ2) 99.92 0.000 0.000 0.000 0.005RF + Feature Eng. (χ2-padj) 99.94 0.051 0.044 0.047 0.010RF + Feature Eng. (χ2-padj) 99.92 0.000 0.000 0.000 0.006RF + Feature Eng. (MI) 99.92 0.000 0.000 0.000 0.006RF + Feature Eng. (MIadj) 99.91 0.008 0.023 0.012 0.006RF + Feature Eng. (GA) 99.92 0.000 0.000 0.000 0.004RF + Feature Eng. (SS) 99.92 0.000 0.000 0.000 0.003

approaches, i.e. genetic algorithm (GA) and sequential search (SS). Notice that, infilter methods, we consider only top 20 features. Of course, one can simply find thebest number of features using a k-fold cross validation technique.

As has been pointed out in Table 1, while in feature selection using the standardstatistical approaches (i.e. χ2 and MI) and the wrapper methods (i.e. GA and SS),the random forest classifier can not provide any good results for accuracy, recall,precision, F1-score and the area under precision-recall curve, using our proposedstatistical measures (i.e. χ2-padj and MIadj), we generally outperform the results.Also, it is plain to see that the RF classifier with considering feature selection basedon the proposed χ2-padj has the best results, and outperforms significantly theprecision, recall, F1-score as well as the area under precision-recall curve. However,for this case, the soft penalized version of χ2-padj (i.e. χ2-padj) does not providevery good result, but still it is better than the standard χ2 and is comparable withthe standard mutual information measure. Furthermore, as demonstrated, usingthe proposed adjusted version of mutual information in feature selection process,provides better results rather than using the standard mutual information measure.Lastly, the worst result belongs to the case of using genetic algorithm (GA) andsequential search (SS) in the feature engineering process.

To verify our claim and consolidate the comparative results, we use a Wilcoxonsigned rank test, which is a non-parametric statistical hypothesis test to effectivelydetermine whether our proposed adjusted statistic measures are significantly out-perform the classifier (using the alternative quantities) or not. Tables 2 presents thetwo-sided p-value for the hypothesis test, while the results in bold indicate the sig-nificantly different ones. The p-value is the probability of observing a test statisticmore extreme than the observed value under the null hypothesis. Notice that, thenull hypothesis is strongly rejected when the p-values are lower than a pre-definedcriterion, almost always set to 0.05. It means that the differences between the twotested classifiers are significant and the uniform hypothesis is accepted as p-valuesare greater than 0.05. Based on the p-values, we can justify that using the proposedadjusted measures in feature selection, the classifier leads to significantly better re-

39


sults than the others. Note that the difference between the pairs of classifiers resultsfollows a symmetric distribution around zero and to be more precise, the reportedp-values are computed from all the individual results after some repetitions of thecorresponding algorithm.

Table 2: P-values: Wilcoxon testRF + Featrue Engineering

(χ2) (χ2-padj) (χ2-padj) (MI) (MIadj) (GA) (SS)RF 0.034 0.004 0.073 0.073 0.798 0.036 0.036RF+Feat. Eng. (χ2) 0.011 0.157 0.157 0.011 0.157 0.157RF+Feat. Eng. (χ2-padj) 0.011 0.011 0.011 0.011 0.011RF+Feat. Eng. (χ2-padj) 1.000 0.026 0.157 0.157RF+Feat. Eng. (MI) 0.026 0.157 0.157RF+Feat. Eng. (MIadj) 0.011 0.011RF+Feat. Eng. (GA) 0.157

As can be seen from Table 2, with regard to the p-values of Wilcoxon test,the random forest classifier with considering feature selection method using theproposed χ2-padj measure, brings a significant improvement compared to the othermethods. In addition, the proposed MIadj is almost performing significantly differentthan the other approaches.

Table 3: Feature selection: top-k extracted features using different statistical test (rank sorted)χ2 χ2-padj χ2-padj MI MIadj

os id id vintage mm uuid mm uuid publisher idcategory id os id contextual data contextual data site id

contextual data fold position page url page url site urlweek hour advertiser id user agent site id main pageconcept id conn speed id strategy id strategy id supply source id

overlapped b pxl watermark creative id site url exchange idweek hour part browser name site id publisher id category id

site id browser language zip code id creative id creative idnorm rr week hour ip address user agent channel typepage url width site url main page concept id

strategy id cross device flag overlapped b pxl zip code id campaign idapp id browser version aib recencies concept id browser name

creative id week hour part main page campaign id browser versionbrowser version device id city code id city code id model name

zip code id app id concept id supply source id homebiz type idform factor city campaign id exchange id os id

supply source id week part hour part channel type category id os nameprebid viewability category id homebiz type id model name brand name

brand name norm rr os id norm rr form factorexchange id dma id form factor channel type norm rr

40


To become closely acquainted with selected features, Table 3 shows the toptwenty selected features using different feature selection models (i.e. statistical test)which we consider in our experimental study. In the term of time consumption, allthe proposed adjusted statistical test (i.e. χ2-padj, χ2-padj, MIadj ), have more orless the same time complexity with the standard ones (i.e. χ2, MI), without anysignificant difference. Furthermore, Table 4 illustrates the best subset of featuresusing different wrapper methods (i.e. GA and SS). Provided results using featureengineering with GA are obtained by considering the population size equals to 100,crossover and mutation probability equals to 0.2, and after 10 iteratons.

Table 4: Feature selection: best subset of features using different wrapper methodsMethod Selected featuresGenetic algorithm publisher id, category id, day of week,week hour part, mm creative size

aib recencies

Sequential search publisher id, category id, advertiser id, dma id, aib recencies, day ofweek, aib pixel ids

Lastly, to have a closer look at the ability of the proposed statistical quantities,Figure 1 shows the comparison of the area under PR curve for different feature en-gineering methods on the basis of a random forest classifier. Note that, we considerthe area under PR curve, since it is more reliable rather than ROC curve, becauseof the unbalanced nature of data. The higher values are the better performance.

Fig. 1: Comparison of AUC-PR curve based of different feature engineering methods

41


6 Conclusion

This paper proposes a novel feature engineering framework for ad event predictionin digital advertising platform which has been applied on big data. In this case,we introduce two new statistical measures which can be used for feature selection:i) the adjusted chi-squared test and ii) the adjusted mutual information. Whilethe standard statistical models (i.e. chi-squared test and mutual information) havesome problems such as difficulty of interpretation and sensitivity to sample size,the proposed statistical ones overcome these difficulties. Furthermore, in featureencoding step before training the model, we used a practical and reliable pipeline toencode very large (categorical) data. The efficiency is analyzed on a large historicaldata from a running campaign. The results illustrate the benefits of the proposedadjusted chi-squared test and the adjusted mutual information, which outperformthe alternative ones with respect to different metrics, i.e. accuracy, precision, recall,F1-score and the area under precision-recall curve. Lastly, to determine that theproposed approaches is significantly better than the other described methods, aWilcoxon signed-rank test is used. While, in this paper, we focus on the singlefeatures, the idea of combined features can be a proper proposition to gain betterresult. Hence, investigate the combination of some features to generate more usefulfeatures, to further increase the prediction performance of the imbalanced case,which is typical in the context of digital advertising, can be an interesting suggestionfor future works.

References

1. H. Cheng and E. Cantu-Paz, “Personalized click prediction in sponsored search,” in Proceedingsof the Third ACM International Conference on Web Search and Data Mining, WSDM ’10,(New York, NY, USA), pp. 351–360, ACM, 2010.

2. Y. Zhang, H. Dai, C. Xu, J. Feng, T. Wang, J. Bian, B. Wang, and T.-Y. Liu, “Sequentialclick prediction for sponsored search with recurrent neural networks,” in Proceedings of theTwenty-Eighth AAAI Conference on Artificial Intelligence, pp. 1369–1375, AAAI Press, 2014.

3. O. Chapelle, E. Manavoglu, and R. Rosales, “Simple and scalable response prediction fordisplay advertising,” ACM Trans. Intell. Syst. Technol., vol. 5, pp. 61:1–61:34, Dec. 2014.

4. A. Borisov, I. Markov, M. de Rijke, and P. Serdyukov, “A neural click model for web search,”in Proceedings of the 25th International Conference on World Wide Web, pp. 531–541, 2016.

5. H.-T. Cheng, L. Koc, J. Harmsen, T. Shaked, T. Chandra, H. Aradhye, G. Anderson, G. Cor-rado, W. Chai, M. Ispir, R. Anil, Z. Haque, L. Hong, V. Jain, X. Liu, and H. Shah, “Wide &deep learning for recommender systems,” in Proceedings of the 1st Workshop on Deep Learningfor Recommender Systems, DLRS 2016, (New York, NY, USA), pp. 7–10, ACM, 2016.

6. C. Li, Y. Lu, Q. Mei, D. Wang, and S. Pandey, “Click-through prediction for advertisingin twitter timeline,” in Proceedings of the 21th ACM SIGKDD International Conference onKnowledge Discovery and Data Mining, KDD ’15, pp. 1959–1968, ACM, 2015.

7. J. Chen, B. Sun, H. Li, H. Lu, and X.-S. Hua, “Deep ctr prediction in display advertising,” inProceedings of the 24th ACM International Conference on Multimedia, MM ’16, (New York,NY, USA), pp. 811–820, ACM, 2016.

42


8. M. Richardson, E. Dominowska, and R. Ragno, “Predicting clicks: Estimating the click-through rate for new ads,” in Proceedings of the 16th International Conference on WorldWide Web, WWW ’07, (New York, NY, USA), pp. 521–530, ACM, 2007.

9. D. Agarwal, B. C. Chen, and P. Elango, “Spatio-temporal models for estimating click-throughrate,” in WWW ’09: Proceedings of the 18th international conference on World wide web, (NewYork, NY, USA), pp. 21–30, ACM, 2009.

10. T. Graepel, J. Q. n. Candela, T. Borchert, and R. Herbrich, “Web-scale bayesian click-throughrate prediction for sponsored search advertising in microsoft’s bing search engine,” in Proceed-ings of the 27th International Conference on International Conference on Machine Learning,ICML’10, (USA), pp. 13–20, Omnipress, 2010.

11. O. Chapelle, “Modeling delayed feedback in display advertising,” in Proceedings of the 20thACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’14,(New York, NY, USA), pp. 1097–1105, ACM, 2014.

12. M. J. Effendi and S. A. Ali, “Click through rate prediction for contextual advertisment usinglinear regression,” CoRR, vol. abs/1701.08744, 2017.

13. H. B. McMahan, G. Holt, D. Sculley, M. Young, D. Ebner, J. Grady, L. Nie, T. Phillips,E. Davydov, D. Golovin, S. Chikkerur, D. Liu, M. Wattenberg, A. M. Hrafnkelsson, T. Boulos,and J. Kubica, “Ad click prediction: a view from the trenches,” in Proceedings of the 19th ACMSIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), 2013.

14. J. H. Friedman, “Stochastic gradient boosting,” Comput. Stat. Data Anal., vol. 38, pp. 367–378, Feb. 2002.

15. I. Trofimov, A. Kornetova, and V. Topinskiy, “Using boosted trees for click-through rateprediction for sponsored search,” in Proceedings of the Sixth International Workshop on DataMining for Online Advertising and Internet Economy, ADKDD ’12, (New York, NY, USA),pp. 2:1–2:6, ACM, 2012.

16. C. J. C. Burges, “From RankNet to LambdaRank to LambdaMART: An overview,” tech. rep.,Microsoft Research, 2010.

17. K. S. Dave and V. Varma, “Learning the click-through rate for rare/new ads from similar ads,”in Proceedings of the 33rd International ACM SIGIR Conference on Research and Developmentin Information Retrieval, SIGIR ’10, (New York, NY, USA), pp. 897–898, ACM, 2010.

18. P. Domingos and M. Pazzani, “On the optimality of the simple bayesian classifier under zero-one loss,” Machine Learning, vol. 29, no. 2, pp. 103–130, 1997.

19. R. Entezari-Maleki, A. Rezaei, and B. Minaei-Bidgoli, “Comparison of classification methodsbased on the type of attributes and sample size.,” JCIT, vol. 4, no. 3, pp. 94–102, 2009.

20. M. Kukreja, S. A. Johnston, and P. Stafford, “Comparative study of classification algorithmsfor immunosignaturing data.,” BMC Bioinformatics, vol. 13, p. 139, 2012.

21. A. C. Lorena, L. F. Jacintho, M. F. Siqueira, R. D. Giovanni, L. G. Lohmann, A. C. de Car-valho, and M. Yamamoto, “Comparing machine learning classifiers in potential distributionmodelling,” Expert Systems with Applications, vol. 38, no. 5, pp. 5268 – 5275, 2011.

22. G. Zhou, C. Song, X. Zhu, X. Ma, Y. Yan, X. Dai, H. Zhu, J. Jin, H. Li, and K. Gai, “Deepinterest network for click-through rate prediction,” CoRR, vol. abs/1706.06978, 2017.

23. Q. Liu, F. Yu, S. Wu, and L. Wang, “A convolutional click prediction model,” in Proceedingsof the 24th ACM International on Conference on Information and Knowledge Management,CIKM ’15, (New York, NY, USA), pp. 1743–1746, ACM, 2015.

24. W. Zhang, T. Du, and J. Wang, “Deep learning over multi-field categorical data: A case studyon user response prediction,” in ECIR, 2016.

25. L. Breiman, “Random forests,” Machine Learning, vol. 45, no. 1, pp. 5–32, 2001.26. A. Liaw and M. Wiener, “Classification and regression by random forest,” R News, vol. 2,

no. 3, pp. 18–22, 2002.27. S. Soheily-Khah, P. Marteau, and N. Bechet, “Intrusion detection in network systems through

hybrid supervised and unsupervised machine learning process: A case study on the iscxdataset,” in 2018 1st International Conference on Data Intelligence and Security (ICDIS),pp. 219–226, April 2018.

43


28. K. Dembczynski, W. Kotlowski, and D. Weiss, “Predicting ads âĂŹ click-through rate withdecision rules,” in WWW2008, Beijing, China, 2008.

29. I. Trofimov, A. Kornetova, and V. Topinskiy, “Using boosted trees for click-through rateprediction for sponsored search,” in Proceedings of the Sixth International Workshop on DataMining for Online Advertising and Internet Economy, ADKDD ’12, (New York, NY, USA),pp. 2:1–2:6, ACM, 2012.

30. L. Shi and B. Li, “Predict the click-through rate and average cost per click for keywords usingmachine learning methodologies,” in Proceedings of the International Conference on IndustrialEngineering and Operations ManagementDetroit, Michigan, USA, 2016.

31. Y. Shan, T. R. Hoens, J. Jiao, H. Wang, D. Yu, and J. Mao, “Deep crossing: Web-scalemodeling without manually crafted combinatorial features,” in Proceedings of the 22Nd ACMSIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’16, (NewYork, NY, USA), pp. 255–262, ACM, 2016.

32. S. Soheily-Khah and Y. Wu, “Ensemble learning using frequent itemset mining for anomalydetection,” in International Conference on Artificial Intelligence, Soft Computing and Appli-cations (AIAA 2018), 2018.

33. D. E. Goldberg, Genetic Algorithms in Search, Optimization and Machine Learning. Boston,MA, USA: Addison-Wesley Longman Publishing Co., Inc., 1st ed., 1989.

34. M. Mitchell, An Introduction to Genetic Algorithms. Cambridge, MA, USA: MIT Press, 1998.35. M. L. McHugh, “The chi-square test of independence,” B. M., vol. 23, pp. 143–149, 2013.36. D. Bergh, “Sample size and chi-squared test of fit: A comparison between a random sample

approach and a chi-square value adjustment method using swedish adolescent data.,” In PacificRim Objective Measurement Symposium Conference Proceedings, pp. 197–211, 2015.

37. T. M. Cover and J. A. Thomas, Elements of Information Theory 2nd Edition (Wiley Seriesin Telecommunications and Signal Processing). Wiley-Interscience, July 2006.

38. S. Kullback, Information Theory And Statistics. Dover Pubns, 1997.39. S. Cang and H. Yu, “Mutual information based input feature selection for classification prob-

lems,” Decision Support Systems, vol. 54, no. 1, pp. 691 – 698, 2012.40. J. R. Vergara and P. A. Estevez, “A review of feature selection methods based on mutual

information,” Neural Computing and Applications, vol. 24, pp. 175–186, Jan. 2014.41. M. Fernandez-Delgado, E. Cernadas, S. Barro, and D. Amorim, “Do we need hundreds of

classifiers to solve real world classification problems?,” J. Mach. Learn. Res., vol. 15, pp. 3133–3181, Jan. 2014.

42. M. Wainberg, B. Alipanahi, and B. J. Frey, “Are random forests truly the best classifiers?,”J. Mach. Learn. Res., vol. 17, pp. 3837–3841, Jan. 2016.

Authors

Saeid SOHEILY KHAH graduated in (software) computer engineering, and hereceived master degree in artificial intelligence and robotics. Then he received hissecond master degree in information analysis and management from Skarbek uni-versity, Warsaw, Poland, in 2013. In May 2013, he joined to the LIG (Laboratoired’Informatique de Grenoble) at Universite Grenoble Alpes as a doctoral researcher.He successfully defended his dissertation and got his Ph.D in Oct 2016. Instantly,he joined to the IRISA at Universite Bretagne Sud as a postdoctoral researcher.

44


Lastly, in Oct 2017, he joined Skylads (artificial intelligence and big data analyticscompany) as a research scientist. His research interests are machine learning, datamining, cyber security system, digital advertising and artificial intelligence.

Yiming WU received his B.S.E.E. degree from the Northwestern PolytechnicalUniversity, Xi’an in China. He received his Ph.D. degree in Electrical Engineeringfrom University of Technology of Belfort-Montbeliard, Belfort, France, 2016. Hejoined Skylads as a data scientist in 2018, and his research has addressed topics onmachine learning, artificial intelligence and digital advertising.

45

a novel feature engineering framework in digital …of advertising display, and to maximize the...

Documents