a benchmark of prevalent feature selection algorithms on a ...1212104/fulltext01.pdf · a benchmark...
TRANSCRIPT
IN DEGREE PROJECT MEDICAL ENGINEERING,SECOND CYCLE, 30 CREDITS
, STOCKHOLM SWEDEN 2018
A Benchmark of Prevalent Feature Selection Algorithms on a Diverse Set of Classification Problems
ANETTE KNIBERG
DAVID NOKTO
KTH ROYAL INSTITUTE OF TECHNOLOGYSCHOOL OF ENGINEERING SCIENCES IN CHEMISTRY, BIOTECHNOLOGY AND HEALTH
A Benchmark of Prevalent Feature
Selection Algorithms on a Diverse
Set of Classification Problems
ANETTE KNIBERG
DAVID NOKTO
This master thesis was performed in collaboration with Nordron AB
Supervisor: Torbjörn Nordling (CEO and Founder)
Degree project, Second cycle, 30 credits, 2018
KTH Royal Institute of Technology
Medical Engineering
School of Engineering Sciences in Chemistry,
Biotechnology and Health
SE -100 44 Stockholm, Sweden
i
Abstract Feature selection is the process of automatically selecting important features from data. It is an
essential part of machine learning, artificial intelligence, data mining, and modelling in general.
There are many feature selection algorithms available and the appropriate choice can be difficult.
The aim of this thesis was to compare feature selection algorithms in order to provide an
experimental basis for which algorithm to choose. The first phase involved assessing which
algorithms are most common in the scientific community, through a systematic literature study in
the two largest reference databases: Scopus and Web of Science. The second phase involved
constructing and implementing a benchmark pipeline to compare 31 algorithms’ performance on 50
data sets.
The selected features were used to construct classification models and their predictive performances
were compared, as well as the runtime of the selection process. The results show a small overall
superiority of embedded type algorithms, especially types that involve Decision Trees. However,
there is no algorithm that is significantly superior in every case. The pipeline and data from the
experiments can be used by practitioners in determining which algorithms to apply to their
respective problems.
Keywords
Feature selection, variable selection, attribute selection, machine learning, data mining, benchmark,
classification
iii
Sammanfattning Variabelselektion är en process där relevanta variabler automatiskt selekteras i data. Det är en
essentiell del av maskininlärning, artificiell intelligens, datautvinning och modellering i allmänhet.
Den stora mängden variabelselektionsalgoritmer kan göra det svårt att avgöra vilken algoritm som
ska användas. Målet med detta examensarbete är att jämföra variabelselektionsalgoritmer för att ge
en experimentell bas för valet av algoritm. I första fasen avgjordes vilka algoritmer som är mest
förekommande i vetenskapen, via en systematisk litteraturstudie i de två största
referensdatabaserna: Scopus och Web of Science. Den andra fasen bestod av att konstruera och
implementera en experimentell mjukvara för att jämföra algoritmernas prestanda på 50 data set. De
valda variablerna användes för att konstruera klassificeringsmodeller vars prediktiva prestanda,
samt selektionsprocessens körningstid, jämfördes. Resultatet visar att inbäddade algoritmer i viss
grad är överlägsna, framför allt typer som bygger på beslutsträd. Det finns dock ingen algoritm som
är signifikant överlägsen i varje sammanhang. Programmet och datan från experimenten kan
användas av utövare för att avgöra vilken algoritm som bör appliceras på deras respektive problem.
Nyckelord
Variabelselektion, maskininlärning, datautvinning, klassificering
v
Acknowledgements We thank our supervisor Torbjörn Nordling at Nordron AB for the opportunity, and the support we
novices have received throughout this work. This thesis has given us a glimpse into a field that is
interesting both from an engineering and philosophical point-of-view. We are also grateful to our
families for boundless patience and encouragement.
Anette Kniberg
David Nokto
vii
Table of Contents
Abstract ............................................................................................... i
Sammanfattning ................................................................................ iii
List of Figures and Tables ................................................................ ix
1 Introduction ................................................................................... 1
2 Theory ........................................................................................... 5
2.1.1 Ethics ...............................................................................................................................6
2.2.1 Wrapper ...........................................................................................................................9
2.2.2 Embedded .................................................................................................................... 10
2.2.3 Filter .............................................................................................................................. 10
2.2.4 Univariate or multivariate selection .......................................................................... 10
2.3.1 Confusion matrix ......................................................................................................... 11
2.3.2 Accuracy ...................................................................................................................... 12
2.3.3 F-measure .................................................................................................................... 12
2.3.4 Runtime ........................................................................................................................ 13
2.6.1 Parametric vs nonparametric statistical testing ...................................................... 15
2.6.2 Multiple comparisons across multiple datasets ...................................................... 16
3 Method ......................................................................................... 17
3.1.1 Choosing feature selection algorithms ..................................................................... 17
3.1.2 Choosing predictors ................................................................................................... 18
3.1.3 Choosing and pre-processing datasets .................................................................... 19
3.2.1 Implementation ............................................................................................................ 19
3.2.2 Hyperparameter Settings ............................................................................................ 20
4 Results ........................................................................................ 23
5 Discussion .................................................................................. 31
5.1.1 Runtime ......................................................................................................................... 31
5.1.2 Predictive performance ............................................................................................... 32
5.2.1 Related research: other FSA benchmark studies .................................................... 32
5.2.2 The custom benchmark pipeline ................................................................................ 33
5.2.3 The choice of nonparametric statistical tests .......................................................... 33
5.2.4 The choice of subset size and hyperparameters ..................................................... 34
6 Conclusions ................................................................................ 35
Bibliography ..................................................................................... 37
Appendix A FS definitions ............................................................... 43
Appendix B Search String and Terminology .................................. 45
Appendix C Algorithm search results ............................................ 47
Appendix D Experimental data sets ............................................... 49
Appendix E Software tools and pseudocode ................................. 51
Appendix F Results from experiments ........................................... 53
ix
List of Figures and Tables
Figure 1: The increase of size and dimensionality of datasets. ....................................................1
Figure 2: Classification vs. regression problems. ..........................................................................5
Figure 3: Curse of dimensionality. ...................................................................................................8
Figure 4: Process of Sequential Forward Selection (SFS).............................................................9
Figure 5: Process of Sequential Backward Selection (SBS). ..................................................... 10
Figure 6: Basic experimental process. ......................................................................................... 11
Figure 7: Over- and underfitted models. ....................................................................................... 14
Figure 8: Overview for one run in the pipeline. ............................................................................ 20
Figure 9: Tukey-boxplots of estimated accuracies and F-measures ........................................ 24
Figure 10: Comparison of estimated accuracies and F-measures ............................................ 25
Figure 11: Tukey-boxplots of estimated rescaled runtimes from slowest to fastest (0-1). ..... 27
Figure 12: Comparison of estimated and rescaled runtimes ..................................................... 28
Figure 13: Discerning FSAs with best overall performance. ...................................................... 29
Figure E2.1: Pseudocode for a part of the program………...……………………………………….52
Table 1: Simple data set with one feature and one output label ...................................................6
Table 2: A fictive diabetes diagnostic data set ...............................................................................7
Table 3: Summary of FSA types .................................................................................................... 11
Table 4: A 3x3-dimensional confusion matrix .............................................................................. 12
Table 5: Table 4 after micro averaging for B-class ...................................................................... 12
Table 6: Cancer detection .............................................................................................................. 12
Table 7: Examples of runtime rescaling. ...................................................................................... 13
Table 8: A four - fold cross-validation ........................................................................................... 14
Table 9: FSAs used in the experiment .......................................................................................... 17
Table 10: Predictors used in the experiment. .............................................................................. 19
Table 11: Hyperparameter settings for each FSA ........................................................................ 21
Table 12: Hyperparameters for each predictor ............................................................................ 21
Table 13: Best performing group of FSAs regarding predictive performance and runtime ... 23
Table 14: Comparisons of various feature selection benchmarks ............................................ 33
Table C1: Algorithms with the top 30 number of publications ……………………………………47
Table D1: Data sets used in the experiment .…………………………………………………………49
Table F1: Summary statistics of overall performance of each FSA across datasets………….53
1
1 Introduction
This chapter explains the purpose of the project by introducing the topic and briefly explaining its importance in a broad context and the current state, including problems and limitations. The problems addressed by the
project are specified, along with aims and objectives, to which approaches and delimitations are presented.
In recent years, the size of collected data sets are increasing, both regarding the number of
observations and number of dimensions (Fig. 1). More data means potentially greater predicting
power, discoveries and understanding of phenomenon. However, large amounts of raw data by itself
is not particularly useful for analysis and must be processed to extract patterns or insight (1). One
such pre-processing step is to determine which features reflect the underlying structure of the data.
Figure 1: The increase of size and dimensionality of datasets.The data points were collected from the UCI Data Set
Repository (2).
Feature selection is the process of automatically selecting a subset of features from the original data
set. Its importance for machine learning, artificial intelligence, statistics, data mining and model
building is undisputed and the use will increase as data sets grow (3–5). Feature selection removes
irrelevant, redundant and noisy data, which reduces computation costs, improves predictive
performance and model interpretability (6).
There are many feature selection algorithms (FSAs) available but knowing which one to use for a
given scenario is not an easily answered question (5). Several studies have concluded that different
FSAs vary greatly in performance when applied on the same data (7). Others have claimed that all
algorithmic performance is equal when averaged on every possible type of problem (8). Some have
found that incorrectly applied FSAs can reduce predictive performance (1,9).
Finding the optimal feature selection algorithm remains an open research question and many
benchmark studies have been undertaken. The problems with many existing benchmarks are
1E+00
1E+02
1E+04
1E+06
1E+08
1985 1990 1995 2000 2005 2010 2015 2020
Num
ber
of
insta
nces
Year
1E+00
1E+02
1E+04
1E+06
1E+08
1985 1990 1995 2000 2005 2010 2015 2020
Num
ber
of
featu
res
Year
2
● No or poor motivation for the choice of algorithms included.
● Small number of algorithms and data sets included.
It therefore remains unclear for inexperienced practitioners from different fields which FSAs to use.
Aim
This report is intended for practitioners who are not data scientists, statisticians or machine
learning experts. The aim is to benchmark common feature selection algorithms on a large and
diverse set of classification problems. This diversity might help readers find results on data sets that
are similar to their own. Hopefully it will help users in choosing an algorithm and serve as an
introduction to the field. We also hope to entice the more experienced reader to try out alternative
algorithms that are not routinely used within their field. Additionally, we will construct an
experimental pipeline for benchmarking FSAs. The resulting software will be available for Nordron
AB to extend and develop further into an interactive benchmarking tool.
Objectives
The objectives consist of answering the following questions:
1. Which feature selection algorithms are the most prevalent in literature?
2. Taking the previous objective into account, which feature selection algorithm(s) has the
overall best performance?
Approach
The first objective was addressed quantitively, by searching certain scientific databases for how
publications existed for each FSA. The second objective was solved by constructing a pipeline
capable of performing the experiment and analysing the results. By testing 31 feature selection
algorithms on 50 data sets using three performance metrics, we acquired a large amount of
information that can be used for comparison.
Delimitations
● The number of selected features was set to 50% of the original features for all algorithms.
● Feature extraction techniques were not included.
● Performance in this work is evaluated in terms of accuracy, F-measure and runtime.
● No algorithms were written from scratch, only pre-made software packages were used.
● Due to the number of algorithms involved and the intended reader, this work focuses on the
application, not on the inner workings of the algorithms.
● The quality (feature engineering) of the data sets were not questioned.
3
Thesis overview
The remaining chapters are structured as follows:
Chapter 2. Presents the theoretical background related to the thesis. It explains the concepts of
machine learning, feature selection, how to evaluate algorithmic performance and statistical
analysis.
Chapter 3. Explains the literature study, experimental setup.
Chapter 4. The results are presented in terms of predictive performance and runtime.
Chapter 5. The results are discussed.
Chapter 6. The conclusions are presented. It further brings up recommendations for practitioners
and suggestions for future research.
Appendix A-F. Contains relevant but not essential information such as raw data from the
experiments, search queries, definitions and software package information.
4
5
2 Theory
This chapter explains the theory that is relevant for understanding the thesis. It starts by describing machine
learning in a broader sense and works towards more detailed subjects. It explains the concepts: machine
learning, feature selection, performance evaluation, cross-validation, hyperparameter optimisation and
strategies for analysis.
Machine learning
Machines learning is a scientific discipline concerned with algorithms that automate model
building, in order to make data-driven predictions or analysis. It borrows from several other fields
such as psychology, genetics, neuroscience and statistics (6). The applications are many and varied,
to name but a few: medical diagnostics (10), image recognition (11), text categorisation (12), DNA
analysis (13), recommendation systems (14), fraud detection (15), social media news feeds (16),
search engines (17) and self-driving vehicles (18).
A central concept of human and machine learning is called the Classification Problem. It can be
seen as a variant of the famous philosophical Problem of Induction (19), which questions how one
can generalise from seen examples to unseen future observations. In classification, this means
determining what category a new observation belongs to (Fig. 2.a). Knowing the class (or label) of
an object means that one can foresee its properties and act accordingly. It is an important technique
since it solves many practical problems such as deciding if an email is spam by reading the title or
determining the gender of a person by looking at an image. The classes in these cases are categorical
or discrete. If the class values are continuous it is a regression problem (Fig. 2.b). An example is to
determine the price of a used car by looking at mileage, age etc.
Figure 2: Classification vs. regression problems. Classification problem (A): The two shapes belong to different classes.
The line is the decision boundary made by the trained model and determines how the model classifies new examples.
Regression problem (B): The line represents a model that has been fitted to data.
(A) (B)
6
There are several ways to categorise learning types. If a problem has known class labels, it is called a
supervised machine learning problem (also called function approximation). A problem without
known class labels is an unsupervised problem. Some structure must then be derived from the
relationships of the data points. Semi-supervised problems have both labelled and unlabelled data.
The labelled data is used as additional information to improve unsupervised learning. All three
types of learning can be applied to both classification and regression problems. This thesis concerns
supervised learning algorithms applied on classification problems.
There are many terms used to refer to machine learning algorithms and the related concepts. In this
report learning algorithms are referred to as predictors, while the term algorithm refers to both
predictors and feature selection algorithms. The terms class and label are used to denote categories
of data instances.
Each type of predictor has a different strategy for how it builds models depending on the underlying
mathematics. The process of building models by loading data into the predictor is called fitting or
training. It can be regarded as the search for a function whose input is an instance of the data and
the output is a label. Given the data set shown in Table 1, a predictor would likely produce a simple
function where the output is the input squared, 𝑦(𝑥) = 𝑥2. When this function is given a new input
such as 10, it will make the prediction that the label is 100.
Table 1: Simple data set with one feature and one output label.
instances 1 2 3 4 5 6 7
labels 1 4 9 16 25 36 49
Both the predictive performance and training time of the model depends on several factors, such as
the data quality, size and dimensionality, in addition to the underlying mathematics and
mechanisms of the predictor. Having access to bigger data sets means that the predictor gets more
training examples which generates - but does not guarantee - more accurate models. It also
increases the training time since more instances must be processed. Predictors with simple
underlying mathematics are trained faster than complex ones since less computation is needed to
build the model.
2.1.1 Ethics
Machine learning is having a huge impact on the world and consequently raises a score of ethical
questions. Poorly designed implementations of machine learning not only inherit but can amplify
biases and prejudices. If a predictor is trained on a data set that contains biases it may project these
upon use (20–22). An example of this regards risk assessment algorithms used to predict future
criminal behaviour, of which one was shown to assign more false positives to certain groups of
people (23). Some systems decide what news articles are displayed in social media, potentially
misrepresenting and biasing public opinion and world awareness (24). Other situations raise the
question of culpability such as who to blame when a self-driving car collides (25). There is also the
question of artificial intelligence rendering a large portion of the human workforce redundant. Frey
and Osborne (2017) estimate that 47 % of US jobs have a high risk (>0.7) of becoming automatable
in the next few decades (26). Many machine learning systems are black boxes and thus unfair
decisions on subjective problems could happen unbeknownst to the user or even the designer. In
summary, this development puts a great weight on responsible data collection, design transparency
and accountability. How to design fair algorithms is an open research question (27,28).
7
Feature selection
“Make everything as simple as possible, but no simpler” - Albert Einstein (as paraphrased by
Roger Sessions in the New York Times, 8 January 1950)
The features of a data set are the semantic properties that describe the object or phenomenon of
interest. When a data set is represented as a table, the features are usually the columns and each
instance is a row. In supervised problems, there is also a column denoting class membership (Table
2). Consequently, every instance is made up by a feature vector, and a corresponding class label.
Table 2: A fictive diabetes diagnostic data set. Each instance represents a patient.
High insulin resistance
Weight (kg)
Family history of disease
Height (cm)
Shoe size (paris point)
Favourite movie
Label
yes 123 yes 172 40 “Minority Report”
positive
yes 81 no 190 47 “Ex Machina” negative
no 95 yes 163 37 “Her” negative
There are many possible definitions of feature selection. Different aspects of the process have
received varying emphasis depending on the researcher stating the definition. A proposed general
definition, and used for this thesis is:
“Feature selection is the process of including or/and excluding some features or data in modelling
in order to achieve some goal, typically a trade-off between minimising the cost of collecting the
data and fulfilling a performance objective on the model.” - Torbjörn Nordling (Assistant
Professor, Founder of Nordron AB, December 2016)
The difficulty in finding fully overlapping definitions could be because there are several types of
FSAs with varying objectives. A collection of definitions rewritten as mathematical optimisation
problems (Appendix A), resulted in three points that are central to the concept:
1. A chosen performance objective on the resulting model such as reduced training time or
improved predictive accuracy.
2. The reduced feature vector—the feature subset—should be as small as possible.
3. The label distribution of the subset should be as close as possible to the original label
distribution.
Features selection can thus be viewed as a pre-processing step applied on data before it is used to
train a predictor. FSAs can themselves use predictors, and some predictors perform internal feature
selection as part of the training process. In the examples in Table 2, a FSA would likely omit the
“Shoe size” and “Favourite movie” columns since they contain useless information for diabetes
diagnostics. The technique should not be confused with feature extraction, where features are
altered, such as reducing the number of features by combining them; feature selection omits
features without altering the remaining ones.
There is no guarantee that every feature is correlated with the label. A feature is relevant if its
removal reduces predictive performance. Irrelevant features have low or no correlation with the
label. Redundant features have a correlation with the label but their removal does not reduce
performance due to the presence of another feature; both being relevant on their own. An example
of this could be the same value given in two different units, represented as two different features.
(29)
8
There are several reasons for removing features from data sets. Features with low or no correlation
can lead to overfitting, meaning that the model is overly complex and describes noise and random
error rather than underlying causality. The consequence is reduced predictive performance since the
model does not generalise well on unseen data (6). Overfitting has been described as the most
important problem in machine learning (30). Other causes and remedies for overfitting are
mentioned in sections 2.4 Cross-validation and 2.5 Hyperparameter optimisation.
Another reason for using FS is higher learning efficiency. Since the data set is less complex the
learning process is faster. Additionally, a data set with more features requires more instances for the
predictor to find the best solution. This is due to the density of the instances decreasing as the
dimensionality increases, known as the Curse of Dimensionality (Fig. 3). Since the number of
instances often is fixed, reducing the number of features is tantamount to increasing the amount of
instances (6).
Figure 3: Curse of dimensionality. The three figures show the same data points plotted in one (A), two (B) and three (C)
dimensions. Even though it is possible to draw a cleaner decision boundary in three dimensions than in two, the predictive
performance will probably be worse due to increased data sparsity.
(A)
(B)
(C)
9
Feature selection also improves model interpretability. A predictor trained on a high-dimensional
data set will naturally produce a complex model, making it difficult to understand the modelled
phenomenon.
Attempts have been made to categorise FSAs depending on how a feature subset is chosen:
wrapper, embedded and filter. However, there is no established unified framework and there are
discrepancies between the definitions of these categories. The descriptions below were formulated
after analysis of eight publications that produced category definitions (1,4–7,31–33)
2.2.1 Wrapper
This category combines a search strategy with a predictor to select the best subset. By training with
different candidate subsets and comparing their performances, the subset that gives the best
performance is kept. An example: a data set consists of three features 𝐷 = {𝑥1, 𝑥2, 𝑥3}. Three subsets
𝑆 ⊂ 𝐷, 𝑆1 = {𝑥1, 𝑥2}, 𝑆2 = {𝑥1, 𝑥3}, 𝑆3 = {𝑥2, 𝑥3} are used separately to train the same type of
predictor, producing three models. The performance of the models is then compared on a chosen
performance measure such as accuracy. The subset that yields the best performing model is selected
as the final subset. Note that in this example only 42.8 % of all feature combinations are examined.
The only way to guarantee finding the best solution would require testing all seven feature
combinations with an exhaustive search (1).
An exhaustive search is only feasible if the number of features is small (1,29). A data set with 90
features, such as the Libras data set in this thesis, has 290 - 1 possible subsets. If 100 000 subsets can
be examined each second it would take more than 4∙1018 years to finish. Applying heuristic search
strategies enables finding adequate solutions in a shorter timeframe, without testing all possible
feature combinations. Some examples of heuristic search strategies are GRASP (34), Tabu Search
(35), Memetic Algorithm (36), Sequential forward selection (SFS) and Sequential backward
selection (SBS) (37).
The two strategies used in this work are SFS and SBS. The former works by starting with an empty
feature set and sequentially adds the feature that yields the best score until the desired subset size is
reached. The latter starts with the full feature vector and sequentially removes the feature that yields
the worst score. Both algorithms have the same complexity but generate different candidate subsets,
leading to different solutions as illustrated in Figures 4 and 5. In this thesis, the names of wrapper
algorithms are combinations of the internal predictor and search strategy, for example: AdaBoost
SBS, Decision Tree SFS and Perceptron SFS.
Figure 4: Process of Sequential Forward Selection (SFS). The objective function, for instance predictive accuracy, is
represented by J(∙). Starting from the top, features are added to the best performing subset. The best subset is found to be
the combination of x1, x2 and x3 (circled).
𝐽(𝑥1) = 0.45 𝐽(𝑥4) = 0.4 𝐽(𝑥3) = 0.5 𝑱(𝒙𝟐) = 𝟎. 𝟕
𝐽(𝑥2𝑥1) = 0.55 𝐽(𝑥2𝑥4) = 0.5 𝑱(𝒙𝟐𝒙𝟑) = 𝟎. 𝟖
𝑱(𝒙𝟐𝒙𝟑𝒙𝟏) = 𝟎. 𝟗
𝐽(𝑥2𝑥3𝑥4) = 0.4
𝐽(𝑥2𝑥3𝑥1𝑥4) = 0.7
10
Figure 5: Process of Sequential Backward Selection (SBS). The objective function, for instance predictive accuracy, is
represented by J(∙). Starting from the top, features are removed from the best performing subset. The best subset is found to
be the combination of features x1 and x3 (green circle). In this problem, a better subset is found using SBS, since SFS (Fig. 4)
never examines the combination of x1 and x3.
2.2.2 Embedded
Another way to use predictors for feature selection is to examine the model structure instead of
performance. This can be illustrated in the case when the trained predictor is a linear polynomial.
The model is represented by the function 𝑦, �̅� is the feature vector and 𝐶̅ is a vector of coefficients:
𝑦(�̅�) = 𝐶̅�̅� = 𝐶1𝑥1 + 𝐶2𝑥2 + 𝐶3𝑥3 + 𝐶4𝑥4
These coefficients can be viewed as weights that the predictor has given each feature, which indicate
their importance in the generated model. By choosing some threshold value of 𝐶̅ or length of �̅�,
unimportant features are omitted. If for example it is decided that half the features shall be omitted
and 𝐶3 > 𝐶2 > 𝐶1 > 𝐶4 the resulting subset will be 𝑆 = {𝑥3, 𝑥2}.
This type of embedded method, where the learning algorithm is trained only once, will be referred
to as standard embedded. Another type is Recursive Feature Elimination (RFE) (38) which checks
the feature weights and iteratively re-trains the predictor after removing the least important feature
with each iteration. In this thesis, the names of the embedded methods are the names of the internal
predictors used, with the suffix ‘RFE’ added on the versions that use recursive feature elimination.
Some examples are Random Forest, Random Forest RFE, Perceptron and LASSO.
2.2.3 Filter
Filter methods assess feature importance without the use of predictors by examining the general
characteristics of the data set. If for example a feature value barely varies across all instances, it is
ineffective for differentiating between labels. Therefore, calculating the 𝜒2 or F-test statistic for a
feature column are ways to estimate correlation with the label. Other filter methods, such as
Minimum Redundancy Maximum Relevance (mRMR) and Correlation-based Feature Selection
(CFS), also try to eliminate redundant features by calculating feature-to-feature correlation, thus
further improving the data set quality.
2.2.4 Univariate or multivariate selection
Filter and embedded methods rank features according to their individual importance while wrapper
methods choose entire subsets at a time (Table 3). Theoretically, the complexity of univariate
methods should therefore scale linearly with the number of features (39) and thus be faster than
wrappers, especially when dealing with high-dimensional data sets (40). Wrappers, on the other
𝐽(𝑥1𝑥2) = 0.55 𝐽(𝑥2𝑥3) = 0.8 𝑱(𝒙𝟏𝒙𝟑) = 𝟎. 𝟗𝟓
𝑱(𝒙𝟏𝒙𝟐𝒙𝟑𝒙𝟒) = 𝟎. 𝟕
𝐽(𝑥1𝑥2𝑥4) = 0.5
𝑱(𝒙𝟏𝒙𝟐𝒙𝟑) = 𝟎. 𝟗
𝐽(𝑥2𝑥3𝑥4) = 0.4
𝐽(𝑥1𝑥3𝑥4) = 0.3
𝐽(𝑥1) = 0.45
𝐽(𝑥3) = 0.5
11
hand, directly test the predictive performance of entire feature subsets. This excels in situations
where a combined set of features is optimal, regardless of their individual importance.
Table 3: Summary of FSA types.
Involves internal predictors Basis of selection Uni- or multivariate
Wrapper Yes Model performance Multivariate
Embedded Yes Model structure Univariate
Filter No Statistical measures Univariate
Performance evaluation
The performance of a predictor depends on the quality of the data used for training. Since FS is a
pre-processing step applied on the data, the performance of the predictor implicitly represents the
performance of the FSA (Fig 6). The general process is described in the following steps:
Figure 6: Basic experimental process. The last step is used for evaluation.
1. The data set is split into two sets, one for training and one for testing.
2. The training set is run through the FSA which produces a subset with less features.
3. The subset is used to train an external predictor.
4. The test set (without its labels) is inputted into the predictor which produces predictions, in
other words, guesses the label of each instance.
5. The predictions are compared to the true labels in order to measure predictive performance.
This approach has some problems that need to be resolved. Different predictors have different
biases and might favour certain FSAs. It is therefore prudent to choose several predictors with
varying underlying mathematics. It is also wise to test performance with different metrics. Firstly, a
predictor may perform optimally on one metric and suboptimally on another (41). Secondly, the
performance metrics in turn have various strengths and weaknesses. Thirdly, practitioners in
different scientific fields prefer different metrics (and predictors) (41). The measures used in this
experiment are accuracy, F-measure and runtime. In order to understand the former two, it is
necessary to understand confusion matrices.
2.3.1 Confusion matrix
Several performance metrics for predictions are derived from the confusion matrix: a n×n-
dimensional matrix with n being the number of classes in the dataset (Table 4). All the predictions
on the diagonal are correct. A n×n-dimensional can be reduced to a 2×2 matrix by applying micro
averaging (Table 5), enabling calculation of other compressed performances metrics, such as
accuracy and F-measure. True positives and true negatives are the number of correctly predicted
positive and negative cases respectively. False positives and false negatives are the number of
falsely predicted positive and negative cases and are also known as Type I and Type II errors
respectively.
12
Table 4: A 3x3-dimensional confusion matrix.There are for example 17 instances of A but seven of these are falsely
classified as B and C i.e., false negatives.
Actual Class
A B C
Predicted Class
A 10 3 1
B 5 10 9
C 2 3 10
Table 5: Table 4 after micro averaging for B-class.
Actual Class
B Not B
Predicted Class B 10 14
Not B 6 23
2.3.2 Accuracy
The degree of which the predictions match the reality that is being modelled. It is defined as the
number of true positives (TP) and true negatives (TN) divided by the total amount of predictions,
which includes false positives and negatives (FP and FN):
𝐴𝐶𝐶 = 𝑇𝑃 + 𝑇𝑁
𝑇𝑃 + 𝑇𝑁 + 𝐹𝑃 + 𝐹𝑁
Accuracy can be misleading when a data set has an imbalanced class distribution. Consider a cancer
data set with 90500 negative cases and 1000 positive cases. The predictor in Table 6 would get a 99
% accuracy even though it missed half of the cancer cases. Since many real-world data sets have
imbalanced class distributions (6) accuracy alone is insufficient as a performance metric.
Table 6: Cancer detection.
Actual Class
Positive Negative
Predicted Class
Positive 500 500
Negative 500 90 000
2.3.3 F-measure
This is a mean between precision and recall. Precision (also called positive predictive value) is the
portion of predicted positives that are actually positive. Low precision happens when few of the
positive predictions are true. Recall (also called true positive rate and sensitivity) is the portion of
true positives that are predicted positive. Low recall happens when few of the true positive cases are
predicted at all. An algorithm with a good F-measure has both high precision and recall since it is
heavily penalised if either has a small value. F-measure is therefore better than accuracy when
dealing with imbalanced data sets. In the case with Table 6, the algorithm would only get a F-
measure of 50 %. However, it does not consider the number of true negatives which is a weakness
13
when the negative class label is interesting. Precision (P), recall (R) and F-measure (F) are
calculated as follows:
𝑃 =𝑇𝑃
𝑇𝑃 + 𝐹𝑃
𝑅 =𝑇𝑃
𝑇𝑃 + 𝐹𝑁
𝐹 = 2𝑃 ⋅ 𝑅
𝑃 + 𝑅
2.3.4 Runtime
The time it takes for a FSA to produce a subset. Speed is important when dealing with large, high-
dimensional data sets. It also enables scrupulous calibrations and reruns which in turn can improve
predictive performance.
In contrast to comparing FSAs runtime using only one data set, comparisons across multiple data
sets requires further treatment. Calculating the mean of the runtimes for a FSA across data sets is
not prudent since large data sets would result in misleading outliers. It would also make it difficult
to visualise and compare all runtimes. This problem is addressed by rescaling the runtimes for each
FSA and dataset. 𝑥′ is the rescaled runtime, 𝑥 is the initial runtime, 𝑥𝑚𝑎𝑥 and 𝑥𝑚𝑖𝑛 are the largest and
smallest runtimes in the set respectively (42). The rescaled values range from 0 to 1 with 0 being the
slowest and 1 being the fastest in the set. An example is shown in Table 7.
𝑥′ = 1 −𝑥 − 𝑥𝑚𝑖𝑛
𝑥𝑚𝑎𝑥 − 𝑥𝑚𝑖𝑛
Table 7: Examples of runtime rescaling.
Algorithm Runtime Rescaled runtime
Data Set 1 Data Set 2 Data Set 1 Data Set 2
FSA 1 15 300 0.81 1
FSA 2 45 8600 0.11 0.14
FSA 3 5 450 1 0.98
FSA 4 60 10 000 0 0
Cross-validation
Training and testing a predictor on the same data can lead to overfitted models since the instances
used for training and testing are the same. To know how well an algorithm performs on unknown
data it is therefore important to test the performance with unseen instances. A number of instances
are chosen to be the training set and the rest is the testing set. This is known as a train/test split and
enables a more realistic estimate of the predictive performance. A problem however, is that the
performance highly depends on how the instances are distributed between the two sets; they may
for example have disparate class distributions. A way to bypass this is to reuse the data several times
with different equally sized folds, train the predictor separately with each training fold, test the
performance with each test fold and average the measured performances. In Table 8 each partition
is used three times for training and once for testing. The three training sets in the training fold are
used in union to train the predictor in each run. This is known as cross-validation (CV) and is a
better estimate of performance on unknown data. Another advantage of CV is a more economic use
of data since every instance is recycled.
14
Table 8: A four - fold cross-validation.
Train set Test set
First fold 1 2 3 4
Second fold 2 3 4 1
Third fold 3 4 1 2
Fourth fold 4 1 2 3
The type of CV applied in this experiment is called Stratified K-fold where the class distribution of
the entire data set is approximately represented in each fold. This is done to reduce the variability of
the predictor performance between folds. It has been proven to be superior to regular CV in most
cases (29).
Hyperparameter optimisation
Hyperparameters are settings that need to be specified for many algorithms and have a big impact
on performance, such as model flexibility and overfitting. Figure 7 show models that are over- and
underfitted respectively due to poor hyperparameter choices. In order to do a fair benchmark of
machine learning algorithms it is important they perform optimally (43). Hyperparameter
optimisation (HPO) is therefore a vital step. Determining the correct hyperparameters is an
optimisation problem that depends on the data set and performance metric in question. It must
therefore be performed anew for both the FSAs and predictors every time a data set is processed.
Figure 7: Over- and underfitted models. Overfitted models (top row) and underfitted models (bottom row).
15
There are several variants of HPO, however to preserve reproducibility grid searches were
implemented in this experiment. In grid searches, all parameters are manually chosen and the
program exhaustively produces FSAs and predictors using each combination of parameter values.
The procedure is computationally expensive but prudent when comparing algorithms since it does
not involve randomness. An example is the Support Vector Machine FSA with two varied
hyperparameters 𝐶 and 𝛾:
𝐶 ∈ {2−3, 2, 23, 25, 29, 213}
𝛾 ∈ {2−15, 2−11, 2−7, 2−5, 2, 23}
Since both parameters have six values with each pair is used once, it is run 36 times within each
cross-validation fold. Each time it produces a new–but not necessarily–unique feature subset, which
is then sent to the predictor. The predictor in turn is trained 36 times for each of its own
hyperparameter settings. In this benchmark, the reason for also performing HPO on the predictors
even though only the FSAs are being compared, is to ensure the predictors’ default settings do not
randomly favour particular FSAs.
Feature subset size can also be considered a hyperparameter. Many FSAs require this to be
specified, but the optimal number of features is most often unknown. If the subset is too large, it
may eliminate the purpose of FSA; if the subset size is too small, it may omit relevant features and
result in a biased predictor. In practice one usually tests the performance in a grid search fashion
with subsets of varying sizes and picks the size with the best performance (4).
Analysing benchmark results
In order to discern which FSAs perform better, statistical hypothesis testing can be used. The many
different strategies available can easily become overwhelming, each with their various benefits and
limitations. The following section explains the two main categories of these and how to choose
between them. The subsequent section explains briefly the issue with multiple comparisons, as with
the case of using several different data sets, describing specifically the method that was used in this
thesis for data analysis.
2.6.1 Parametric vs nonparametric statistical testing
Parametric hypothesis tests make strong assumptions about the distribution of the underlying data
such as normality, and can be powerful in rejecting a false null hypothesis when conditions are met.
In contrast, nonparametric tests are less powerful but make weaker assumptions, allowing for
nonnormality (44). Machine learning and data mining communities have voiced concerns regarding
misuse and misinterpretations of hypothesis testing, which can lead to misinformed conclusions.
Strict conditions on data distribution can incorrectly be assumed fulfilled resulting in overly
confident results and false rejection of the null hypothesis. Yet as with nonparametric tests, the
decreased power can result in interesting finds being missed, effectively blocking potential new
discoveries (45).
When choosing between methods of statistical analysis, the distribution of the performance scores
needs inspection. Departures from normality can be seen with Tukey boxplots of the performance
score for each FSA as modality, skewness and unequal variances. However, distributions can be
deemed pseudo-normal depending on how severe these violations are, which can still warrant for a
parametric method. If severe, the less powerful conservative nonparametric method is necessary,
but offers consolation regarding observational value. In the case of FSA comparison, the value of
mean performances across different datasets lack meaning, whereas a more interesting
observational value is how the FSAs rank on each dataset. Nonparametric techniques in general
make use of these individual ranks, in contrast to parametric techniques which use absolute values.
16
2.6.2 Multiple comparisons across multiple datasets
There are many ways of comparing algorithms, and the choice depends on the experimental setting
and conditions. The following is an example of multiple testing:
1. State a null hypothesis for every pair of algorithms in the study, such as “A is equivalent to
B”, and “A is equivalent to C”.
2. Calculate the p-value for every pair, using for instance pairwise t test.
3. Reject or retain each null hypothesis based on the p-values.
The problem with this procedure is well known, and referred to as the multiple comparisons,
multiplicity or multiple testing problem. In essence: with many null hypothesis, there is risk of one
getting rejected by chance. A choice is therefore required among statistical techniques that can deal
with multiple comparisons.
Although there is yet no established strategy for comparing several predictors on several datasets,
among the most well-known nonparametric techniques is the Friedman test. Iman and Davenport
(1980) formulated an extension to the Friedman’s 𝜒𝐹2 statistic in order to increase power (46); it
was therefore the chosen technique for our experiment. Formally, 𝑘 FSAs are applied on 𝑁 datasets.
Let 𝑟𝑖𝑗 be the rank of FSA 𝑗 on dataset 𝑖 regarding performance score. Using the average rank 𝑅𝑗 =
1
𝑁∑ 𝑟𝑖
𝑗𝑖 , the Friedman statistic 𝜒𝐹
2 is calculated:
𝜒𝐹2 =
12𝑁
𝑘(𝑘 + 1) [∑ 𝑅𝑗2 −
𝑗
𝑘(𝑘 + 1)2
4 ]
If the ranks are tied, the average of the tied ranks can be assigned. Missing values can be handled by
receiving rank zero, and adjusting 𝑁 to 𝑁′ = 𝑁 − 𝑐𝑜𝑢𝑛𝑡 𝑜𝑓 𝑚𝑖𝑠𝑠𝑖𝑛𝑔 𝑣𝑎𝑙𝑢𝑒𝑠 before averaging. The
Iman-Davenport 𝐹𝐹 statistic is further calculated using
𝐹𝐹 =(𝑁 − 1)𝜒𝐹
2
𝑁(𝑘 − 1) − 𝜒𝐹2
which in turn is distributed along the 𝐹 distribution with (𝑘 − 1) and (𝑘 − 1)(𝑁 − 1) degrees of
freedom.
Multiple-hypothesis tests check if there exists at least one pair of FSA that are significantly different.
Which pair or pairs that differ requires post hoc testing. The rationale of the Nemenyi post hoc test
is that two FSAs differ in performance if their corresponding average ranks 𝑅𝑗 differ by at least the
critical distance (CD), defined as
𝐶𝐷 = 𝑞𝛼√𝑘(𝑘 + 1)
6𝑁
with the Nemenyi critical value 𝑞 at significance level 𝑎, which is derived from the studentised
range statistic divided by √2 (46).
17
3 Method
This chapter further explains the approaches presented in chapter 1. The resulting choice of FSA, predictors
and datasets are presented, followed by a description of the experimental setup for FSA implementation.
Literature study
The literature study resulted in 31 feature selection algorithms, three predictors and 50 data sets to
include in the experiments. The following three sections describe how these were selected.
3.1.1 Choosing feature selection algorithms
The first criterion for inclusion of a FSA in this benchmark was prevalence. With the vast body of
different algorithms, it can be difficult to discern which ones are most used by researchers. Due to
diverse terminology it was necessary to have a comprehensive search string to uncover as many
publications as possible. For this purpose, the list found in (47) was modified and extended. This set
of terms is referred to as search string Feature Selection (ssFS) and is made available in Appendix
B. Using ssFS to search Scopus, Web of Science and Google Scholar, FSA names were collected from
review articles to a total of 102 algorithms. Every synonym and abbreviation for each name were
used in combination with ssFS for searches of FSA publications in Web of Science and Scopus. For
example, the search string used for the algorithm LASSO was (“LASSO” OR “least absolute
shrinkage and selection operator”) AND (“feature selection” OR “attribute selection” OR “variable
selection” OR …). The number of hits acted as an approximation of a method’s prevalence. The top
results of these searches are found in Appendix C, along with a discussion about the difficulties of
such a search.
The second criterion was to include as many categories of FSAs as possible, with a variety of
underlying mathematics; the assumption being that different types have varying advantages and
limitations. Taking the intended reader/ user into account, the choice of FSAs was also restricted to
the availability of pre-made software packages. Some highly prevalent methods, such as Genetic
Algorithm and Multilayer Perceptron, were omitted since they were judged to require a high
expertise to implement. The final chosen FSAs are listed in Table 9.
Table 9: FSAs used in the experiment.
Name Machine Learning
Category Basic description
FSA Category
Source
AdaBoost
Tree Ensemble
Builds models with weighted combinations of simple decision trees. Weights are assigned depending on how the trees misclassify.
Embedded
(48)
AdaBoost RFE
AdaBoost SBS
Wrapper
AdaBoost SFS
ANOVA Statistical Selects the features with the highest variance.
Filter (49)
Correlation Based FS (CFS)
Statistical
Selects features that have a high correlation with the label and low correlation with each other.
Filter (7)
Chi Squared Statistical Selects the features with the highest variance.
Filter (50)
18
Name Machine Learning
Category Basic description
FSA Category
Source
Decision Tree
Decision Trees
Builds a tree structure where the nodes are features and leaves are labels. Parent nodes have a higher feature-to-label correlation.
Embedded
(51)
Decision Tree RFE
Decision Tree SBS
Wrapper
Decision Tree SFS
Fast Correlation Based FS (FCBF)
Statistical
Selects features that have a high correlation with the label and low correlation with each other.
Filter (52)
Least Angle Regression (LARS)
Regularization and linear models
Builds model by drawing a regression line and estimates feature importances by calculating the sum of squared errors.
Embedded
(53)
Least Absolute Shrinkage and Selection Operator (LASSO)
(54)
LASSOLARS (53)
Linear Regression -
Low Variance Statistical Selects the features with the highest variance.
Filter (49)
Minimum Redundancy Maximum Relevance (mRMR)
Statistical
Selects features that have a high correlation with the label and low correlation with each other.
Filter (55)
Perceptron
Neural Networks
Iteratively trains predictive neurons and assigns feature weights depending on misclassification.
Embedded
(56)
Perceptron RFE
Perceptron SBS
Wrapper
Perceptron SFS
Random Forest
Tree Ensemble
Splits data and trains a separate decision tree for each split. It then averages the decision trees.
Embedded
(57)
Random Forest RFE
Random Forest SBS
Wrapper
Random Forest SFS
ReliefF Statistical
Selects features that have a high correlation with the label and low correlation with each other.
Filter (58)
Support Vector Machines (linear) Support Vector Machines (linear) RFE Support Vector
Machines
Draws a linear decision boundary and maximises its margin. Bigger margins mean more informative features.
Embedded
(59) Support Vector Machines (nonlinear) SBS Support Vector Machines (nonlinear) SFS
Draws a nonlinear decision boundary and maximises its margin. Bigger margins mean more informative features.
Wrapper
3.1.2 Choosing predictors
The predictors were chosen to have different underlying mathematics. A thorough search for
prevalence was not performed, however inspiration was taken from the work of Wu et al. (60). Their
19
work attempted to identify the top ten predictors used in data mining. The final choice is presented
in Table 10.
Table 10: Predictors used in the experiment.
Name Type Source
Gaussian Naive Bayes (NB) Bayesian (61)
K-Nearest Neighbour (KNN) Instance based -
Decision Trees (DT) Decision Trees (51)
3.1.3 Choosing and pre-processing datasets
To reduce the risk of introducing bias by a particular choice of data set, the study consisted of 50
data sets from different scientific fields and preferably derived from real world measurements. The
number of instances, features and classes varied, however upper limitations to these were imposed
due to increased computational costs. Well known data sets were used; avoiding customisations as
much as possible to conserve reproducibility. Therefore, to fit the required input type for the
algorithms, the data types were limited to numerical. Categorical text, features and labels were in a
few cases transformed into appropriate integers. Data sets with missing values were avoided, since
these require choosing among many available imputation strategies, of which the optimal choice
remains as an open research question. However, if < 5 % of all instances had missing values, the
dataset was included with the instances removed. Time series were also avoided, since these require
special treatment in terms of sampling. In summary, data sets fulfilled following requirements for
inclusion in the study:
• has an upper limit on the number of instances and features in the order of 105 and 102
respectively,
• is well known, preferably highly cited in prominent journals,
• has categorical labels, in other words is a classification problem,
• consists of numerical datatypes,
• is not a time series with data points that are progressing through time,
• is not a multi-label classification task, meaning one instance can only belong to one class,
• has no or very few (< 5 %) instances with missing values.
The data sets and their corresponding information can be found in Appendix D.
Experimental Setup
The performance of 31 FSAs on 50 data sets were compared using predictions from three predictors,
over which the scores were averaged. For contrast, performance was also measured without feature
selection, and was included in the comparison. Implementations in Python and HPO settings of
algorithms are explained in the following sections.
3.2.1 Implementation
One observation consisted of one FSA applied on one data set. The resulting subsets were used to
train one predictor resulting in a performance measure. The program was written in Python, using
pre-existing software packages for the algorithms. Links to these are listed in Appendix E, along
20
with pseudocode for part of the program summarised below. One run of the pipeline was performed
in following steps, which are illustrated in Figure 8:
1. Load and split data. The data set is loaded from a csv-file and split into five CV folds. 80
% of each fold is used for training and the rest for testing. Each fold is stratified so the class
distribution mirrors the complete data set. The training set is relayed to the FSA while the
test set is kept for later evaluation.
2. Feature selection. The training set is processed by the FSA which outputs an array of
subsets, one for each hyperparameter setting. The runtime of this process is noted and
averaged over hyperparameter settings.
3. Prediction. The subsets are used to train three predictors, resulting in a model for each
subset and their individual hyperparameter settings. The models are then applied on the
test set to get predicted lables.
4. Evaluation. The accuracy and F-measure are calculated by comparing the predictions to
the true class labels.
5. Validation. The best accuracy and F-measure values from each fold are saved and
averaged to get the cross-validated performance score. The mean of the runtimes from each
fold is calculated.
Figure 8: Overview for one run in the pipeline. Shows what happens within each CV-fold. The Validation step happens
outside the folds. Note: The number of folds, feature subsets and predictions (represented as sheets) are only illustrative.
3.2.2 Hyperparameter Settings
Hyperparameter settings for each FSA and predictor are presented in Tables 11 and 12, with
magnitudes chosen above and below the default code setting. For reasons discussed under the title
‘The choice of subset size and hyperparameters’ in section 5.2.4, we chose to have all FSAs return
subsets of the same size.
21
Table 11: Hyperparameter settings for each FSA. The FSAs without hyperparameters are not included. N = total number
of features, M = total number of instances.
FSA Hyperparameter Values
AdaBoost, AdaBoost RFE, AdaBoost SBS, AdaBoost SFS
Number of inner decision
stumps 10, 20, 40, 80, 160
Learning rate 1, 2, 4, 16, 32
DT, DT RFE, DT SBS, DT SFS
Number of features examined
when looking for node splits. √𝑁, 0.2𝑁, 0.4𝑁, 0.8𝑁
Minimum number of samples
required for node splits 1, 4,
𝑀
8,𝑀
4
LASSO Alpha 0,05, 0.1, 0.25, 0.5, 1, 1.5
LASSOLARS
Perceptron, Perceptron RFE, Perceptron SBS, Perceptron
SFS Number of iterations 3, 5, 10, 20
RF, RF RFE, RF SBS, RF SFS
Number of inner decision trees 10, 20, 40, 160
Number of features examined
when looking for node split auto, 0.2, 0.4, 0.8
Minimum number of samples
required for node split 1, 4,
𝑀
8,𝑀
4
ReliefF Number of nearest miss points 2, 5, 10, 20
SVM, SVM RFE, SVM SBS, SVM SFS C 2−3, 2, 23, 25, 29, 213
Gamma 2−15, 2−11, 2−7, 2−5, 2, 23
Table 12: Hyperparameters for each predictor. Gaussian Naive Bayes had no hyperparameters. N = total number of
features, M = total number of instances.
FSA Hyperparameter Values
K-Nearest Neighbour Number of neighbours to consider 2, 4,𝑀
4,𝑀
2
Decision Tree
Number of features examined when looking for split √𝑁, 0.2𝑁, 0.4𝑁, 0.8𝑁
Minimum number of samples required for split 1, 4,𝑀
8,𝑀
4
22
23
4 Results
This chapter is the presentation of the experimental results. Initially the results of the comparison are
summarized, followed by more detailed presentations of predictive performance and runtimes respectively.
The chapter concludes with a closer description of the comparison between FSAs for overall performance.
Summary
The result from the experiments for each performance measure consisted of a m×n-matrix: m
datasets and n FSAs, containing performance scores averaged over the three predictors. If any
combination of FSA, predictor and dataset failed to produce a score, even within a single CV fold,
the experiment was noted as failed for that particular FSA-dataset combination. Consequently, this
led to some sparsity in the results matrix, but the magnitude was judged not to impede the
comparison.
The best performing group of FSAs is presented in Table 13. Both accuracy and F-measure shared
the same best performing group, all of which were embedded types. With addition to SVM RFE, this
group consisted of the only FSAs that performed better than omitting feature selection entirely, the
case from which the remaining 24 FSAs were indistinguishable. Runtimes revealed a different
group, with types varying between filter and embedded. One FSA appeared in both of the best
performing groups: Decision Tree. Two FSAs were present among the best performing within one
measure while also having noteworthy high comparative performance in the other: Decision Tree
RFE and Lasso.
Table 13: Best performing group of FSAs regarding predictive performance and runtime. The types of FSA are
specified: Embedded (E) and Filter (F). Underlined FSA are present in both groups. FSAs with dotted underline are present
in only one of the groups, but had noteworthy high comparative performance in the other.
Predictive performance Runtime
FSA Type FSA Type
Random Forest RFE
Random Forest
Decision Tree RFE
Decision Tree
AdaBoost RFE
AdaBoost
E
E
E
E
E
E
Low Variance Filter
Decision Tree
Anova
Chi Squared
Lasso
Perceptron
LassoLars
F
E
F
F
E
E
E
In the following sections, the background to this summary is presented in more detail. For the
curious reader, summary statistics are found in Appendix F.
Comparison of predictive performance: accuracy and F-measure
To assess the distribution of the results, Tukey boxplots were used for each feature selection method
and for both accuracy and F-measure (Fig. 9). The mean (red box) and median (red horizontal line)
are both superimposed in addition to the underlying raw accuracies, overlaid as semi-transparent
dots. All distributions exhibited large but similar interquartile ranges (IQR) and most displayed
negative skew. The skewness, along with modal tendencies, suggested departure from normality,
which warranted a nonparametric statistical method for significant comparison.
24
Figure 9: Tukey-boxplots of estimated accuracies (top) and F-measures (bottom) for each FSA, across all datasets.
Overlaid raw data points are illustrated with semi-transparent red dots. Whiskers (dashed blue lines) extend to Q3 + 1.5 ∙ IQR
(upper) and Q1-1.5 ∙ IQR (lower). Singular data points outside these limits are considered possible outliers (additional black
“-”). Within the boxes, the mean (red squares) and median (red horizontal lines) across are displayed.
The null-hypothesis H0 stated for accuracy and F-measure respectively:
• H0, Accuracy : There is no difference in estimated accuracy between FSAs
• H0, F-measure: There is no difference in estimated F-measure between FSAs
25
The calculated FF statistic was FF ≈ 27.01 and was distributed according to the F-distribution with
31 (upper) and 1519 (lower) degrees of freedom. The calculation depended on 𝑘 number of FSAs
(including “No FS”) and 𝑁 number of data sets as 𝑘 = 32, and 𝑁 = 50 respectively. The tabular
critical F -value for F (31,1519) = 1.46. Since FF > F (31,1519), both H0, Accuracy and H0, F-measure are
rejected at significance level 0.05.
To find which FSA or groups of FSAs that differ, the Nemenyi post-hoc test was performed using 𝑘
and 𝑁 from earlier and tabular Nemenyi critical value 𝑞0.05 = 3.78, resulting in calculated Nemenyi
critical distance (CD) at significance level 0.05 as CD0.05 ≈ 7.09. The performance of two FSAs can
be considered different at significant level 0.05 if their average ranks 𝑅𝑗 differ by at least CD0.05,
illustrated as FSAs with non-overlapping bars in the following graphs for accuracy and F-measure
(Fig. 10).
Figure 10: Comparison of estimated accuracies (top) and F-measures (bottom) between FSAs across all datasets.
Dots represent the average ranking (AR) of each FSA, with upper and lower boundaries representing the Nemenyi critical
distance (CD). The FSAs are ordered from lowest (best performance) to highest (worst performance) AR (+/-0.5 CD0.05).
Groups with non-overlapping bars are considered significantly different (a = 0.05).
26
Random Forest RFE was observed as the top performing FSA regarding predictions. However, since
the following five algorithms cannot be distinguished as different at the required significance level, a
group of the best performing FSAs was determined. Both performance metrics shared the same
group:
• Random Forest RFE
• Random Forest
• Decision Tree RFE
• Decision Tree
• AdaBoost RFE
• AdaBoost
Upon comparing with the case of omitting feature selection: FSAs that performed better with
significance was the aforementioned group of six FSAs with addition of SVM RFE. The remaining 24
FSAs were indistinguishable from “NoFS” in Figure 10, meaning that no FSAs performed worse
than not performing FS at all.
Comparison of runtimes
Analogous to the previous section, the distribution of rescaled runtimes from slowest to fastest (0-1)
for each FSA on each dataset were inspected with raw data overlaying Tukey boxplots. The mean
(red boxes) and median (red horizontal line) are both superimposed in addition to the underlying
raw runtimes, as semi-transparent dots. Since the dispersion varied greatly between FSAs,
complementary subplots were added, which divided groups of FSAs within a fitting range of values
(next page Fig. 11, note that y-axes are scaled differently between plots). The observed skewness,
along with modal tendencies strongly implied nonnormality. In combination with the largely
varying degree of dispersion between different FSAs, a nonparametric statistical method was a
justified choice over parametric for comparison.
27
Figure 11: Tukey-boxplots of estimated rescaled runtimes from slowest to fastest (0-1). Top: Measured and rescaled
runtimes for each FSA, performed on all available datasets. Overlaid raw data points are illustrated with semi-transparent red
dots. Whiskers (dashed blue lines) extend to Q3 + 1.5 ∙ IQR (upper) and Q1-1.5 ∙ IQR (lower). Singular data points outside
these limits are considered possible outliers (additional black “-”). Within the boxes, the mean (red squares) and median (red
horizontal line) of runtimes across data sets are displayed. Bottom: The top box plots, but with FSAs divided into three
separate plots with differently scaled y-axis. Note: the FSAs in all plots are sorted in descending order with respect to their
bottom outlier/minimal value.
28
The null hypothesis H0 for runtime stated:
H0,Runtime : There is no difference in estimated runtime between FSAs.
With 𝑘 = 31 number of FSAs and 𝑁 = 50 number of datasets, the calculated FF statistic was FF ≈
89.62, distributed according to the F-distribution with 30 (upper) and 1470 (lower) degrees of
freedom. The tabular critical F-value for F (30,1470) = 1.46. Since FF > F (31,1519), H0,Runtime is
rejected at significance level 0.05.
As with predictive performance, to discern which FSA, or groups of FSAs differ, the Nemenyi post-
hoc test was performed, resulting in CD0.05 ≈ 6.85 using previously defined 𝑘 and 𝑁 and tabular
critical value 𝑞0.05 = 3.76. The runtimes of two FSAs can be considered different at significance level
0.05 if their corresponding average ranks differ by at least the calculated CD0.05, which translates to
FSAs with non-overlapping bars in the following plot (Fig. 12).
Figure 12: Comparison of estimated and rescaled runtimes between FSAs across all datasets. Dots represent the
average rank (AR) of each FSA across datasets, with upper and lower boundaries together representing the Nemenyi critical
distance (CD). The FSAs are ordered from lowest (best performance) to higher (worst performance) AR (+/-0.5 CD0.05).
Groups with non-overlapping bars are considered significantly different (a = 0.05).
Low Variance Filter was observed as the fastest FSA. However, since the following six algorithms
cannot be distinguished as different with confidence, a group of the fastest FSAs is presented below:
• Low Variance Filter
• Decision Tree
• Anova
• Chi Squared
• Lasso
• Perceptron
• LassoLars
29
Discerning best overall performance
The overall best performance was attributed to the FSA or group of FSAs that achieve among the
best average ranks with respect to both predictive performance (accuracy and F-measure) and
runtime. Consequently, plots presenting CD for accuracy, F-measure (Fig.10) and runtime (Fig.12)
from earlier sections were compared, considering only the best performing group and the
consecutive statistically indistinguishable group of FSAs (Fig.13).
Figure 13: Discerning FSAs with best overall performance.Plots of average ranks with Nemenyi critical distances are
used for accuracy (Fig.10), F-measure (Fig.10) and runtime (Fig.12) from earlier sections. Only the best performing group
(within rectangle) and the consecutive, statistically indistinguishable group of FSAs are compared. Decision Tree (red) was
present in all the best performing groups of FSAs. Decision Tree RFE (orange) and LASSO (blue) were present in a best
performing group with regards to one measure while simultaneously present in the consecutive group of the other.
As presented in Figure 13, Decision Tree (red) was present in the best performing groups of FSAs
with respect to both predictive performance and runtime. Decision Tree RFE (orange) was in the
best performing group regarding prediction (left and middle plot), while being indistinguishable
from the majority of best performing FSAs regarding runtime (right plot). Analogously, Lasso (blue)
was present in best performing group regarding runtime, while being indistinguishable from the
majority of the FSAs in the best performing group regarding predictive performance.
30
31
5 Discussion
This chapter discusses the results and the resulting insights, in addition to factors of weakness and strengths
in different parts of the study.
The performance of the examined feature selection algorithms
The experiment produced two top performing groups of FSAs: one for predictive performance and
one for runtime. It is notable that all the top performing FSAs regarding predictive performance
were of type embedded, and involved predictors that use tree structures: Decision Tree, Random
Forest and AdaBoost (Table 13). The respective wrapper versions of these were also top ranked
among wrapper methods, except for AdaBoost SFS. The produced models seem to include good
estimations of feature importance. The predictive performances of these methods were not
significantly different (Fig.10) but Decision Tree RFE was found to also be fast (Fig. 13). However,
Decision Tree was the only FSA present in both best performing groups. Within the framework of
this experiment, it can therefore be considered as the best performing FSA.
Apart from a few filter types, the majority of the best performing FSAs regarding speed were also of
type embedded. Lasso in particular was found to be among the fastest FSAs while also having good
predictive performance (Fig. 13). However, generalisations based on FSA types is approached with
caution, since the number of FSAs included in the study within each category is unbalanced. This is
kept in mind along with the further discussions below.
5.1.1 Runtime
It was unexpected to find any of the slowest FSAs to be of filter type (Fig. 12). Filter methods that
only assess feature-to-label correlation, i.e. only assess feature relevance (Low Variance Filter,
ANOVA and Chi Squared) were among the fastest of all FSA. However, two filter methods that in
addition calculate feature-to-feature correlation (feature redundancy), ReliefF and CFS, were among
the slowest. Apparently calculating redundancy can take a longer time than training learning
algorithms.
Several FSAs had both wrapper and embedded versions. These were expected to have the same
ordering of runtimes, from fastest to slowest: embedded, embedded RFE, SFS, SBS. This is intuitive
since it reflects the amount of training rounds and the feature vector sizes within each round.
Standard embedded only involves one training round while the number of rounds for embedded
RFE depends on the size of the original feature vector. Both wrapper types have a similar number of
rounds, however SFS starts with one feature and sequentially increases the number while SBS starts
with all features and sequentially decreases the number. The two should be equally fast, but in our
benchmark we imposed a fixed 50% restriction on subset size. This means that SBS becomes slower
than SFS, since it trains on larger feature vectors. Both wrappers have several times more training
rounds than the embedded types. It was therefore surprising that Random Forest and SVM had a
different ordering, with wrapper versions faster than embedded: SFS, SBS, embedded, embedded
RFE for Random Forest and embedded, SFS, SBS, embedded RFE for SVM (Fig.11). Although these
cannot be distinguished from each other with statistical confidence (Fig. 12), it may be that it
generally takes a longer time for these algorithms to extract feature importances from the model
structure than it does to construct it.
The runtime results indicate that univariate (filter and embedded) FSAs in several cases are slower
than multivariate methods (wrappers). Most papers consistently claim the opposite
32
(4,5,32,33,40,62,63). Our results might have reflected this if data sets with higher dimensionality
were included.
5.1.2 Predictive performance
We believe that the fixed restriction on subset sizes had large implications on the results. For
instance, if the structure of a data set in question could have been sufficiently represented by less
features, noisy features would have been retained and thus degrade performance. Conversely, if
more than half of the features was optimal, the subset size restriction would also have degraded
performance. This could have especially affected the wrapper methods since they examine feature
combinations instead of individual feature importances; the experimental setup might have put the
wrapper methods at a disadvantage. Other search strategies than SFS and SBS might have handled
the subset size limitation better.
A noteworthy observation is that the means and medians of the predictive performances were so
similar across all FSAs (Fig. 9). Despite the varied underlying mathematics of the FSAs and the
different kinds of data sets involved, it is interesting that the performances were so similar. We
believe that this is due to a combination of the following reasons:
• There could be larger differences in predictive performance, however since the feature
subset size was fixed at 50 % reduction, it minimised the likelihood for FSAs to choose
largely different feature subsets. If this is the case, adding more data sets would increase the
differences in performance.
• The various strengths and weaknesses of the FSAs are evened out across the problems in the
experiment. The starting assumption when comparing FSAs across many data sets was: if
algorithm A has a slightly better performance than algorithm B, this difference will be
accentuated in proportion to how many times A and B are compared. Paradoxically,
considering the No Free Lunch Theorem (64), adding more data sets would not accentuate
but minimise the difference.
The first point might be solved by designing a more intricate experiment where every subset size is
examined for every FSA. One approach is to add inner CV loops within each FSA that test which size
is optimal, saving the results from all subset sizes for later comparison.
Although beyond the scope of this study, a way to address the second point could be to assess FSA
performance in relation to characteristics of the data sets. This could be general quantitative
properties such as the number of instances, number of features or feature-to-instance-ratio. For a
more advanced approach, Wang et al., (62) suggest a series of data set meta-features on which the
basis of a FSA recommendation method could be developed. The raw data generated from our
benchmark can be used to explore such an idea.
Method discussion
The following sections presents related research and the rationale behind the choice of method.
Practical limitations, including other viable approaches and suggestions for improvement are
discussed.
5.2.1 Related research: other FSA benchmark studies
Several papers have been published where FSA benchmarks are performed using a similar strategy
as in this thesis. Table 14 shows how many data sets, FSAs and predictors have been included as
well as the chosen performance metrics. Even though the general structure is similar, the various
problems and perspectives differ between the studies. Generally, there is a big difference in the
33
choice of data sets. Many of these studies examine FSAs for specific implementations and the data
sets are chosen to reflect this.
Table 14: Comparisons of various feature selection benchmarks.
Number of data sets Number of FSAs Number of predictors Performance measures Source
50 31 3 Accuracy, F-measure, Runtime This thesis
30 25 3 Accuracy (10)
7 8 1 AUC, Residual Sum of Squares (9)
16 6 2 Accuracy (65)
115 22 5 Accuracy (62)
7 2 1 Accuracy (66)
1 6 1 ROC, F-measure (67)
11 11 4 Accuracy (5)
An alternative strategy is to use synthetic data sets where all feature importances are known in
advance (33). Instead of testing predictive performance these experiments examine how many
relevant, irrelevant and redundant features have been selected. This has the advantage of testing
FSAs without a layer of obfuscation added by predictors. Conversely, synthetic data sets do not
reveal how well FSAs perform on real world data, since the structures and noise of these are difficult
to mimic.
5.2.2 The custom benchmark pipeline
For an experiment with these amounts of data sets and algorithms, the combination of data set, FSA
and predictor is large. It becomes prodigious when also considering combinations of
hyperparameter settings. To manually initialise a separate run for each combination would be
infeasible. There exists to our knowledge no platform that offers a complete pipeline to solve this
problem and contains most of the algorithms listed in Table 9. Due to these reasons, it was
necessary to develop the pipeline ourselves.
One specific point in which the pipeline could have been improved regards the granularity of the
experiments. In our pipeline, the results were averaged over CV folds before comparing the FSAs. In
retrospect, we could have saved intermediate results for each fold, allowing for comparison on a
lower level. This would have revealed potential variability between folds. To our knowledge, existing
platforms which automate FS do not allow for fold-wise comparison, which we could have
addressed during the building stage of our pipeline. Ultimately, we kept our setup but chose a
conservative statistical test to address this decreased granularity, along with reasons discussed in
the following section.
5.2.3 The choice of nonparametric statistical tests
Upon inspection of side-by-side Tukey box plots (Fig. 9) of the performance, nonnormality was
noticable, such as modality and skewness. Runtimes displayed very unequal variances between
FSAs (Fig. 11), which is a requirement for some parametric methods. In light of the inspection of
distribution, the more conservative nonparametric test was used. A more powerful parametric
34
method would probably have revealed greater differences and discriminate more FSAs from each
other.
5.2.4 The choice of subset size and hyperparameters
As mentioned in section 2.5 ‘Hyperparameter optimisation’, subset size can be considered a
hyperparameter. The performance of each size can be tested in a grid search fashion and the one
with the best performance can be picked. However, this strategy is computationally expensive since
the predictors must be re-trained with each subset size, just as with other hyperparameters. We
chose instead to have all FSAs return subsets of the same size, omitting 50 % of the features for each
data set by default. This is computationally prudent since we are not interested in the best
predicting performance that can be achieved on a given dataset, but rather how FSAs compare
under the same circumstances. Keeping the subset size fixed thus eliminated a potential source of
variability. Unfortunately, this setting was not possible for one of the algorithms: FCBF. It has an
internal filter that omits features it deems redundant and tampering with this setting would alter
the algorithm in unknown ways.
Ideally, the choice of hyperparameters to include in HPO would be more tailored to the task at hand.
In this work, the same range of was used on many different problems. It was therefore remarkable
that no FSAs performed significantly worse than not performing feature selection at all (Fig. 10),
even though the latter allows for all features to be used for prediction. Considering the crude HPO
and the fixed subset size reduction, this demonstrates the utility of FS even under suboptimal
conditions.
35
6 Conclusions
This chapter presents the conclusions of this work, recommendations for designing a benchmark
pipeline and what we hope to see in future research.
A comparison of predictive performance and runtime of 31 prevalent feature selection algorithms on
50 classification problems has been performed. The results revealed a leading group of FSAs that
can be used as an initial starting point for practitioners.
For predictive performance (Group 1): Random Forest RFE, Random Forest, Decision Tree RFE,
Decision Tree, AdaBoost RFE, AdaBoost.
For speed (Group 2): Low Variance Filter, Decision Tree, Anova, Chi Squared, Lasso, Perceptron,
LassoLars.
Within these two groups, viable choices for overall performance can include Decision Tree RFE and
Lasso. Decision Tree RFE could not be distinguished from the majority of Group 2. Analogously,
Lasso could not be distinguished from the majority of Group 1. Ultimately, Decision Tree is the
motivated choice in terms of overall performance, since it was present in both groups.
Recommendations
In the design of a benchmarking pipeline, we recommended to start very simple and build from the
beginning to the end. Examples of this is to use fast, easily understood algorithms, a simple
train/test split without CV and a fraction of the data set’s instances to speed up learning. Only after
a functioning output is attained should each part of the system be improved until the output is
optimal. It is recommended to store the various subsets produced by the FSAs and the models
constructed by the predictors instead of just their performance scores. This can be done with
functions such as the pickle module in Python and greatly helps in avoiding time-consuming reruns.
It is also strongly recommended to use parallel or distributed processing, since it considerably
improves several aspects, such as better HPO and more reruns when mistakes are found.
Future research
When benchmarking FSAs it would be interesting add some performance metrics that were not
included in this study, such as a stability index. Given a fixed subset size, the index measures how
reliably a FSA pick the same features given that other factors are varied such as hyperparameters or
data set splits. One such measure is the Kuncheva Index (68). Other metrics for predictive
performance such as Matthews correlation coefficient or AUC (Area under the Receiver Operating
Curve) might be better scores when with characteristics such as unbalanced datasets, but requires
reducing multi-class problems to binary ones.
Benchmarking experiments are greatly alleviated by having access to a high number of algorithms,
performance metrics and data sets without having to write bridges between different programming
languages. This is especially true for practitioners or scientists whose main field in not within the
machine learning discipline. Hopefully, easy-to-use, open source machine learning libraries such as
Scikit-learn and the ASU feature selection repository will continue to grow. It is our opinion that
this will have the greatest effect on feature selection and machine learning research and by
extension, research fields that require data processing and inference.
36
37
Bibliography 1. Yusta SC. Different metaheuristic strategies to solve the feature selection problem. Pattern Recognit Lett. 2009
Apr;30(5):525–34. Available from: http://dx.doi.org/10.1016/j.patrec.2008.11.012
2. University of California, Irvine S of I and CS. UCI Machine Learning Repository. 2013 Available from:
https://archive.ics.uci.edu/ml/datasets.html
3. Hettmansperger TP, McKean JW, Sheather SJ. Robust Nonparametric Methods. J Am Stat Assoc. 2000
Dec;95(452):1308–12. Available from: http://www.tandfonline.com/doi/abs/10.1080/01621459.2000.10474337
4. Li J, Cheng K, Wang S, Morstatter F, Trevino RP, Tang J, et al. Feature Selection: A Data Perspective. ACM Comput
Surv. 2017 Jan 29;50(6):1–45. Available from: http://dl.acm.org/citation.cfm?doid=3161158.3136625
5. Bolón-Canedo V, Sánchez-Maroño N, Alonso-Betanzos A. A review of feature selection methods on synthetic data.
Knowl Inf Syst. 2012 Mar 30;34(3):483–519. Available from: http://link.springer.com/10.1007/s10115-012-0487-8
6. Sammut C, Webb G, editors. Encyclopedia of Machine Learning. 1st ed. New York: Springer Science; 2011. Preface,
402, 744 p.
7. Hall M. Correlation-based Feature Selection for Machine Learning. University of Waikato; 1999 Available from:
https://www.cs.waikato.ac.nz/~mhall/thesis.pdf
8. Wolpert DH. The Existence of A Priori Distinctions Between Learning Algorithms. Neural Comput. 1996
Oct;8(7):1391–420. Available from: http://www.mitpressjournals.org/doi/abs/10.1162/neco.1996.8.7.1391
9. Eklund M, Norinder U, Boyer S, Carlsson L. Benchmarking Variable Selection in QSAR. Mol Inform.
2012;31(2):173–9. Available from: http://doi.wiley.com/10.1002/minf.201100142
10. Kononenko I. Machine learning for medical diagnosis: history, state of the art and perspective. Artif Intell Med.
2001;23(1):89–109.
11. Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, et al. Going Deeper with Convolutions. 2014 Sep 16;
Available from: http://arxiv.org/abs/1409.4842
12. Fabrizio Sebastiani. Machine learning in automated text categorization. ACM Comput Surv. 2002;34(1):1--47.
13. Libbrecht MW, Noble WS. Machine learning applications in genetics and genomics. Nat Rev Genet. Nature
Research; 2015 May 7;16(6):321–32. Available from: http://www.nature.com/doifinder/10.1038/nrg3920
14. Portugal I, Alencar P, Cowan D. The Use of Machine Learning Algorithms in Recommender Systems: A Systematic
Review. Expert Syst Appl. 2017; Available from: http://linkinghub.elsevier.com/retrieve/pii/S0957417417308333
15. Danenas P. Intelligent Financial Fraud Detection and Analysis: A Survey of Recent Patents. Recent Patents Comput
Sci. 2015 May 5;8(1):13–23. Available from:
http://www.eurekaselect.com/openurl/content.php?genre=article&issn=2213-2759&volume=8&issue=1&spage=13
16. Luckerson V. Here’s How Facebook’s News Feed Actually Works | TIME [Internet]. TIME. 2015 Available from:
http://time.com/3950525/facebook-news-feed-algorithm/
17. Sullivan D. Google uses RankBrain for every search, impacts rankings of "lots" of them [Internet].
Search Engine Land. 2016. Available from: http://searchengineland.com/google-loves-rankbrain-uses-for-every-
search-252526
18. Fehrenbacher K. How Tesla is ushering in the age of the learning car. Fortune. 2015 Available from:
http://fortune.com/2015/10/16/how-tesla-autopilot-learns/
19. Empiricus S. Outlines of Pyrrhonism. Loeb Class. Cambridge: Harvard University; 1933. 283 p.
20. Viner K. How technology disrupted the truth. The Guardian. 2016; Available from:
https://www.theguardian.com/media/2016/jul/12/how-technology-disrupted-the-truth
21. Diakopoulos N. Understanding bias in computational news media [Internet]. Nieman Lab. 2012 [cited 2016 Dec
29]. Available from: http://www.niemanlab.org/2012/12/nick-diakopoulos-understanding-bias-in-computational-
news-media/?cid=4911734
22. Bolukbasi T, Chang K-W, Zou J, Saligrama V, Kalai A. Man is to Computer Programmer as Woman is to
Homemaker? Debiasing Word Embeddings.; Available from: https://arxiv.org/pdf/1607.06520.pdf
23. Larson J, Mattu S, Kirchner L, Angwin J. How We Analyzed the COMPAS Recidivism Algorithm - ProPublica.
Propublica. 2016 Available from: https://www.propublica.org/article/how-we-analyzed-the-compas-recidivism-
algorithm
38
24. Tufekci Z. Machine intelligence makes human morals more important [Internet]. TED Talk; 2016 Available from:
https://www.ted.com/talks/zeynep_tufekci_machine_intelligence_makes_human_morals_more_important
25. Vlasic B, Boudette EN. As U.S. Investigates Fatal Tesla Crash, Company Defends Autopilot System [Internet]. The
New York Times. 2016 Available from: http://www.nytimes.com/2016/07/13/business/tesla-autopilot-fatal-crash-
investigation.html?_r=1
26. Frey CB, Osborne MA. The future of employment: How susceptible are jobs to computerisation? Technol Forecast
Soc Change. 2017 [cited 2017 Jan 3];114:254–80. Available from:
http://www.oxfordmartin.ox.ac.uk/downloads/academic/The_Future_of_Employment.pdf
27. Joseph M, Kearns M, Morgenstern J, Roth A. Fairness in Learning: Classic and Contextual Bandits. 2016 May 23;
Available from: http://arxiv.org/abs/1605.07139
28. Friedler SA, Scheidegger C, Venkatasubramanian S. On the (im)possibility of fairness. 2016 Sep 23; Available from:
http://arxiv.org/abs/1609.07236
29. Kohavi R, John GH. Wrappers for feature subset selection. Artif Intell. 1997 Dec;97(1–2):273–324. Available from:
http://www.sciencedirect.com/science/article/pii/S000437029700043X
30. Domingos P. The Master Algorithm: How the Quest for the Ultimate Learning Machine Will Remake Our World. 1st
ed. Basic Books; 2015. 68 p.
31. Guyon I, Elisseeff A. An introduction to variable and feature selection. J Mach Learn Res. 2003;3:1157–82.
32. Saeys Y, Inza I, Larrañaga P. A review of feature selection techniques in bioinformatics. Bioinformatics. 2007 Oct
1;23(19):2507–17. Available from: http://www.ncbi.nlm.nih.gov/pubmed/17720704
33. Belanche LA, González FF. Review and Evaluation of Feature Selection Algorithms in Synthetic Problems. 2011;13.
Available from: http://arxiv.org/abs/1101.2320
34. Feo TA, Resende MGC. Greedy Randomized Adaptive Search Procedures. J Glob Optim. Kluwer Academic
Publishers; 1995 Mar;6(2):109–33. Available from: http://link.springer.com/10.1007/BF01096763
35. Glover F, Laguna M. Tabu search. Kluwer Academic Publishers; 1997.
36. França PM, Mendes A, Moscato P. A memetic algorithm for the total tardiness single machine scheduling problem.
Eur J Oper Res. 2001;132(1):224–42.
37. Pudil P, Novovičová J, Kittler J. Floating search methods in feature selection. Pattern Recognit Lett. Elsevier
Science Inc.; 1994 Nov 1;15(11):1119–25. Available from: http://dl.acm.org/citation.cfm?id=197121.197134
38. Guyon I, Weston J, Barnhill S, Vapnik V. Gene Selection for Cancer Classification using Support Vector Machines.
Mach Learn. Kluwer Academic Publishers; 2002;46(1/3):389–422. Available from:
http://link.springer.com/10.1023/A:1012487302797
39. Ruiz R, Aguilar-Ruiz JS, Riquelme JC. Best Agglomerative Ranked Subset for Feature Selection. J Mach Learn Res.
2008;4:148–62. Available from: http://www.jmlr.org/proceedings/papers/v4/ruiz08a/ruiz08a.pdf
40. Janecek A, Gansterer WN, Demel M, Ecker G. On the Relationship Between Feature Selection and Classification
Accuracy. J Mach Learn Res Work Conf Proc. 2008;4:90–105. Available from:
http://www.jmlr.org/proceedings/papers/v4/janecek08a/janecek08a.pdf
41. Caruana R, Niculescu-Mizil A. An empirical comparison of supervised learning algorithms. Proceedings of the 23rd
international conference on Machine learning - ICML ’06. New York, New York, USA: ACM Press; 2006 p. 161–8.
Available from: http://dl.acm.org/citation.cfm?id=1143844.1143865
42. Livingstone D. A Practical Guide to Scientific Data Analysis. 1st ed. West Sussex: John Wiley & Sons, Ltd; 2009. 58
p.
43. Balabin RM, Smirnov S V. Variable selection in near-infrared spectroscopy: benchmarking of feature selection
methods on biodiesel data. Anal Chim Acta. 2011 Apr;692(1–2):63–72. Available from:
http://www.sciencedirect.com/science/article/pii/S0003267011003539
44. Logan M. Biostatistical design and analysis using R: a practical guide. Wiley-Blackwell; 2010. 546 p.
45. Salzberg SL. On Comparing Classifiers: Pitfalls to Avoid and a Recommended Approach. Data Min Knowl Discov.
Kluwer Academic Publishers; 1997;1(3):317–28. Available from:
http://link.springer.com/10.1023/A:1009752403260
46. Demšar J. Statistical Comparisons of Classifiers over Multiple Data Sets. J Mach Learn Res. 2006;7(Jan):1–30.
39
47. Spolaor N, Monard MC, Lee HD. A systematic review to identify feature selection publications in multi-labeled data.
Sao Carlos, SP, Brazil; 2012. Available from: http://www.icmc.usp.br/~biblio/BIBLIOTECA/rel_tec/RT_374.pdf
48. Freund Y, Schapire RE. A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting. J
Comput Syst Sci. Academic Press; 1997;55(1):119–39.
49. Fisher RA. On the ‘Probable Error’ of a Coefficient of Correlation Deduced from a Small Sample. Metron. 1921;1:3–
32.
50. Pearson K. On the criterion that a given system of deviations from the probable in the case of a correlated system of
variables is such that it can be reasonably supposed to have arisen from random sampling. Philos Mag Ser 5. Taylor
& Francis Group; 1900 Jul;50(302):157–75. Available from:
http://www.tandfonline.com/doi/abs/10.1080/14786440009463897
51. Breiman L, Friedman J, Stone C, Ohlshen RA. Classification and Regression Trees. 1st ed. Belmont: Wadsworth;
1984 Available from: http://www.maa.org/press/maa-reviews/classification-and-regression-trees
52. Yu L, Liu H. Feature Selection for High-Dimensional Data: A Fast Correlation-Based Filter Solution. Fawcett T,
Mishra N, editors. Int Conf Mach Learn. Washington, DC; 2003;2:856–63. Available from:
http://www.aaai.org/Papers/ICML/2003/ICML03-111.pdf
53. Tibshirani R, Johnstone I, Hastie T, Efron B. Least angle regression. Ann Stat. Institute of Mathematical Statistics;
2004 Apr;32(2):407–99. Available from: http://projecteuclid.org/Dienst/getRecord?id=euclid.aos/1083178935/
54. Tibshirani. Robert. Regression Shrinkage and Selection via the Lasso. J R Stat Soc Ser B. 1996;58(1):267–88.
Available from: https://www.jstor.org/stable/2346178?seq=1#page_scan_tab_contents
55. Hanchuan Peng, Fuhui Long, Ding C. Feature selection based on mutual information criteria of max-dependency,
max-relevance, and min-redundancy. IEEE Trans Pattern Anal Mach Intell. 2005 Aug;27(8):1226–38. Available
from: http://ieeexplore.ieee.org/document/1453511/
56. Rosenblatt F. The perceptron - a perceiving and recognizing automaton. Buffalo; 1957.
57. Breiman L. Random Forests. Mach Learn. Kluwer Academic Publishers; 2001;45(1):5–32. Available from:
http://link.springer.com/10.1023/A:1010933404324
58. Kononenko I, Simec E, Sikonja MR-. Overcoming the myopia of inductive learning algorithms with RELIEFF. Appl
Intell. 1997;7:39--55. Available from: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.56.4740
59. Vapnik V, Lerner A. Pattern Recognition using Generalized Portrait Method. Autom Remote Control.
1963;24(6):774–80.
60. Wu X, Kumar V, Quinlan RJ, Ghosh J, Yang Q, Motoda H, et al. Top 10 algorithms in data mining. Knowl Inf Syst.
2008;14:1–37.
61. Price R, Bayes T. An Essay towards solving a Problem in the Doctrine of Chances. By the late Rev. Mr. Bayes,
communicated by Mr. Price, in a letter to John Canton, M. A. and F. R. S.. Philosophical Transactions (1683-1775).
Royal Society of London; 1763. p. 53:370–418. Available from: https://archive.org/details/philtrans09948070
62. Wang G, Song Q, Sun H, Zhang X, Xu B, Zhou Y. A Feature Subset Selection Algorithm Automatic Recommendation
Method. 2014 Feb 3;1–34. Available from: http://arxiv.org/abs/1402.0570
63. Caruana R, de Sa VR. Benefitting from the variables that variable selection discards. J Mach Learn Res. JMLR.com;
2003;3:1245–64. Available from: http://www.jmlr.org/papers/volume3/caruana03a/caruana03a.pdf
64. Wolpert DH, Macready WG. No free lunch theorems for optimization. IEEE Trans Evol Comput. IEEE Press; 1997
Apr;1(1):67–82. Available from: http://ieeexplore.ieee.org/document/585893/
65. Hall MA, Holmes G. Benchmarking attribute selection techniques for discrete class data mining. IEEE Trans Knowl
Data Eng. 2003 Nov;15(6):1437–47. Available from: http://www.scopus.com/inward/record.url?eid=2-s2.0-
0242410408&partnerID=tZOtx3y1
66. Chandrashekar G, Sahin F. A survey on feature selection methods. Comput Electr Eng. 2014 Jan;40(1):16–28.
Available from: http://www.sciencedirect.com/science/article/pii/S0045790613003066
67. Ramaswami M, Bhaskaran R. A Study on Feature Selection Techniques in Educational Data Mining. J Comput.
2009 Dec;1(1):7–11. Available from: http://arxiv.org/abs/0912.3924
68. Kuncheva LI. A Stability Index for Feature Selection. AIAP’07 Proceedings of the 25th conference on Proceedings of
the 25th IASTED International Multi-Conference: Artificial Intelligence and Applications. Innsbruck, Austria: ACTA
40
Press; 2007. p. 390–5. Available from: http://dl.acm.org/citation.cfm?id=1295303.1295370
69. Fisher RA. The Use of Multiple Measurements in Taxonomic Problems. Ann Eugen. Blackwell Publishing Ltd; 1936
Sep;7(2):179–88. Available from: https://archive.ics.uci.edu/ml/datasets/Iris
70. Elter M, Schulz-Wendtland R, Wittenberg T. The prediction of breast cancer biopsy outcomes using two CAD
approaches that both emphasize an intelligible decision process. Med Phys. 2007 Nov;34(11):4164–72. Available
from: http://sci2s.ugr.es/keel/dataset.php?cod=86
71. Thrun SB, Bala J, Bloedorn E, Bratko I, Cestnik B, Cheng J, et al. The MONK’s Problems A Performance
Comparison of Different Learning Algorithms. 1991. Available from:
https://archive.ics.uci.edu/ml/datasets/MONK%27s+Problems
72. Nolan JR. Computer Systems That Learn: An Empirical Study of the Effect of Noise on the Performance of Three
Classification Methods. Loudonville; 1991. Available from:
http://www.interfacesymposia.org/I02/I2002Proceedings/NolanJames/NolanJames.pdf
73. Nakai K, Kanehisa M. Expert system for predicting protein localization sites in gram-negative bacteria. Proteins
Struct Funct Genet. 1991 Oct;11(2):95–110. Available from: http://www.ncbi.nlm.nih.gov/pubmed/1946347
74. Smith JW, Everhart JE, Dickson WC, Knowler WC, Johannes RS. Using the ADAP Learning Algorithm to Forecast
the Onset of Diabetes Mellitus. Proceedings of the Annual Symposium on Computer Application in Medical Care.
American Medical Informatics Association; 1988. p. 261. Available from:
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2245318/
75. Nash W, Sellers T, Talbot S, Cawthorn A, Ford W. The population biology of Abalone (Haliotis species) in Tasmania.
I. Blacklip abalone (H rubra) from the North Coast and the Islands of Bass Strait. Taroona, Tasmania: Sea Fisheries
Division, Marine Research Laboratories; 1978. Available from: https://archive.ics.uci.edu/ml/datasets/Abalone
76. Gil D, Girela JL, De Juan J, Gomez-Torres MJ, Johnsson M. Predicting seminal quality with artificial intelligence
methods. Expert Syst Appl. 2012 Nov;39(16):12564–73. Available from:
https://archive.ics.uci.edu/ml/datasets/Fertility
77. Mangasarian O, Street N, Wolberg W. Breast Cancer Diagnosis and Prognosis via Linear Programming. SIAM News
[Internet]. 1990;23(5):1–18. Available from:
https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Original%29
78. Lim T-S, Loh W-Y, Shih Y-S. A Comparison of Prediction Accuracy, Complexity, and Training Time of Thirty-Three
Old and New Classification Algorithms. Mach Learn. Kluwer Academic Publishers; 2000;40(3):203–28. Available
from: https://archive.ics.uci.edu/ml/datasets/Contraceptive+Method+Choice
79. Jossinet J. Variability of impedivity in normal and pathological breast tissue. Med Biol Eng Comput. 1996
Sep;34(5):346–50. Available from: https://archive.ics.uci.edu/ml/datasets/Breast+Tissue
80. Venkata Ramana B, Surendra Prasad Babu M, Venkateswarlu NB. A Critical Comparative Study of Liver Patients
from USA and INDIA: An Exploratory Analysis. Int J Comput Sci Issues. 2012;9(3):506–16. Available from:
https://archive.ics.uci.edu/ml/datasets/ILPD+%28Indian+Liver+Patient+Dataset%29
81. Heck D, Knapp J, Capdevielle JN, Schatz G, Thouw T. CORSIKA: a Monte Carlo code to simulate extensive air
showers. [Internet]. Karlsruhe, Germany; 1998. Available from: http://sci2s.ugr.es/keel/dataset.php?cod=102
82. Cortez P, Cerdeira A, Almeida F, Matos T, Reis J. Modeling wine preferences by data mining from physicochemical
properties. Decis Support Syst. Elsevier Science Publishers B. V.; 2009 Nov;47(4):547–53. Available from:
http://sci2s.ugr.es/keel/dataset.php?cod=209
83. Detrano R, Janosi A, Steinbrunn W, Pfisterer M, Schmid JJ, Sandhu S, et al. International application of a new
probability algorithm for the diagnosis of coronary artery disease. Am J Cardiol. 1989 Aug 1;64(5):304–10.
Available from: http://archive.ics.uci.edu/ml/datasets/Heart+Disease
84. Corani G, Benavoli A, Demšar J, Mangili F, Zaffalon M. Statistical comparison of classifiers through Bayesian
hierarchical modelling. 2016 Sep 28; Available from: http://arxiv.org/abs/1609.08905
85. Silva PFB, Marçal ARS, da Silva RMA. Evaluation of Features for Leaf Discrimination. In: Kamel M, Campilho A,
editors. Image Analysis and Recognition ICIAR 2013 Lecture Notes in Computer Science, vol 7950. Berlin: Springer
Berlin Heidelberg; 2013. p. 197–204. Available from: https://archive.ics.uci.edu/ml/datasets/Leaf
86. Alimoglu F. Combining Multiple Classifiers For Pen-Based Handwritten Digit Recognition. Bogazici University;
41
1994. Available from: http://archive.ics.uci.edu/ml/datasets/Pen-Based+Recognition+of+Handwritten+Digits
87. Lucas DD, Klein R, Tannahill J, Ivanova D, Brandon S, Domyancic D, et al. Failure analysis of parameter-induced
simulation crashes in climate models. Geosci Model Dev. Copernicus GmbH; 2013 Aug 7;6(4):1157–71. Available
from: https://archive.ics.uci.edu/ml/datasets/Climate+Model+Simulation+Crashes
88. Sikora M, Wróbel Ł. Application of rule induction algorithms for analysis of data collected by seismic hazard
monitoring systems in coal mines. Arch Min Sci. 2010;55(1):91–114. Available from:
https://archive.ics.uci.edu/ml/datasets/seismic-bumps
89. Siebert JP. Vehicle Recognition Using Rule Based Methods. Glasgow, Scotland; 1987. Available from:
http://archive.ics.uci.edu/ml/datasets/Statlog+%28Vehicle+Silhouettes%29
90. Diaconis P, Efron B. Computer-Intensive Methods in Statistics. Sci Am. 1983;248(5):116–30. Available from:
http://sci2s.ugr.es/keel/dataset.php?cod=100
91. Antal B, Hajdu A. An ensemble-based system for automatic screening of diabetic retinopathy. Knowledge-Based
Syst. 2014;60:20–7. Available from:
https://archive.ics.uci.edu/ml/datasets/Diabetic+Retinopathy+Debrecen+Data+Set
92. Vision Group. Image Segmentation Data Set [Internet]. University of Massachusetts; 1990. Available from:
https://archive.ics.uci.edu/ml/datasets/Image+Segmentation
93. Breiman L. Bias, Variance, and Arcing Classifiers [Internet]. Berkeley; 1996. Available from:
http://sci2s.ugr.es/keel/dataset.php?cod=106
94. Ayres-de-Campos D, Bernardes J, Garrido A, Marques-de-Sá J, Pereira-Leite L. Sisporto 2.0: A program for
automated analysis of cardiotocograms. J Matern Fetal Med. 2000 Sep;9(5):311–8. Available from:
https://archive.ics.uci.edu/ml/datasets/Cardiotocography
95. Little MA, McSharry PE, Roberts SJ, Costello DAE, Moroz IM. Exploiting nonlinear recurrence and fractal scaling
properties for voice disorder detection. Biomed Eng Online. BioMed Central; 2007 Jun 26;6(23). Available from:
https://archive.ics.uci.edu/ml/datasets/Parkinsons
96. Sigillito VG, Wing SP, Hutton L V., Baker KB. Classification of radar returns from the ionosphere using neural
networks. Johns Hopkins APL Tech Dig. 1989;10(3):262–6. Available from:
https://archive.ics.uci.edu/ml/datasets/Ionosphere
97. Güvenir HA, Demiröz G, Ilter N. Learning differential diagnosis of erythemato-squamous diseases using voting
feature intervals. Artif Intell Med. 1998 Jul;13(3):147–65. Available from:
http://sci2s.ugr.es/keel/dataset.php?cod=60
98. Landsat Satellite Data Set. Knowledge Extraction based on Evolutionary Learning.. Available from:
http://sci2s.ugr.es/keel/dataset.php?cod=71
99. Brodatz P. Textures: A Photographic Album for Artists and Designers [Internet]. New York: Dover Publications,Inc.;
1966. Available from: http://sci2s.ugr.es/keel/dataset.php?cod=72
100. Cios KJ, Wedding DK, Liu N. CLIP3: Cover learning using integer programming. Kybernetes [Internet]. MCB UP
Ltd; 1997 Jul;26(5):513–36. Available from: http://sci2s.ugr.es/keel/dataset.php?cod=185
101. Hong Z-Q, Yang J-Y. Optimal discriminant plane for a small number of samples and design method of classifier on
the plane. Pattern Recognit. Elsevier Science Inc.; 1991 Jan;24(4):317–24. Available from:
https://archive.ics.uci.edu/ml/datasets/Lung+Cancer
102. Cranor LF, LaMacchia BA. Spam! Commun ACM. ACM; 1998 Aug 1;41(8):74–83. Available from:
http://sci2s.ugr.es/keel/dataset.php?cod=109
103. Gorman RP, Sejnowski TJ. Analysis of hidden units in a layered network trained to classify sonar targets. Neural
Networks. 1988;1(1):75–89. Available from: http://sci2s.ugr.es/keel/dataset.php?cod=85
104. Kaynak C. Methods of Combining Multiple Classifiers and Their Applications to Handwritten Digit Recognition.
Bogazici University; 1995. Available from: http://scikit-
learn.org/stable/modules/generated/sklearn.datasets.load_digits.html
105. Dias DB, Madeo RCB, Rocha T, Biscaro HH, Peres SM. Hand movement recognition for Brazilian Sign Language: A
study using distance-based neural networks. International Joint Conference on Neural Networks. Atlanta, USA:
IEEE; 2009. p. 697–704. Available from: https://archive.ics.uci.edu/ml/datasets/Libras+Movement
42
43
Appendix A FS definitions
In order to infer a robust definition of the feature selection problem various papers that produced a
definition were examined during the pre-study on the subject. Since these definitions were
presented in varying ways (some as text, some as mathematical formulas), they have been rewritten
with a common mathematical notation in order to simplify comparison.
44
45
Appendix B Search String and Terminology
The following search string was used for retrieving results regarding feature selection (ssFS):
”feature selection” OR ”feature reduction” OR ”feature ranking” OR ”attribute selection” OR ”attribute reduction” OR ”attribute ranking” OR ”variable selection” OR ”variable reduction” OR ”variable ranking” OR ”feature subset selection”
OR ”feature subset reduction” OR ”attribute subset selection” OR ”attribute subset reduction” OR ”variable subset
selection” OR ”variable subset reduction” OR ”selection of feature” OR ”selection of features” OR ”reduction of feature”
OR ”reduction of features” OR ”ranking of feature” OR ”ranking of features” OR ”selection of attribute” OR ”selection of
attributes” OR ”reduction of attribute” OR ”reduction of attributes” OR ”ranking of attribute” OR ”ranking of attributes”
OR ”selection of variable” OR ”selection of variables” OR ”reduction of variable” OR ”reduction of variables” OR ”ranking
of variable” OR ”ranking of variables” OR ”selection of feature subset” OR ”selection of feature subsets” OR ”selection of
attribute subset” OR ”selection of attribute subsets” OR ”selection of variable subset” OR ”selection of variable subsets”
OR ”reduction of feature subset” OR ”reduction of feature subsets” OR ”reduction of attribute subset” OR ”reduction of
attribute subsets” OR ”reduction of variable subset” OR ”reduction of variable subsets” OR ”ranking of feature subset” OR
”ranking of feature subsets” OR ”ranking of attribute subset” OR ”ranking of attribute subsets” OR ”ranking of variable
subset” OR ”ranking of variable subsets” OR ”dimensionality reduction” OR ”reduction of dimensionality” OR ”dimension
reduction” OR ”feature subspace selection” OR ”feature subspace reduction” OR ”attribute subspace selection” OR
”attribute subspace reduction” OR ”variable subspace selection” OR ”variable subspace reduction” OR ”selection of
feature subspace” OR ”selection of feature subspaces” OR ”selection of attribute subspace” OR ”selection of attribute
subspaces” OR ”selection of variable subspace” OR ”selection of variable subspaces” OR ”reduction of feature subspace”
OR ”reduction of feature subspaces” OR ”reduction of attribute subspace” OR ”reduction of attribute subspaces” OR
”reduction of variable subspace” OR ”reduction of variable subspaces” OR ”ranking of feature subspace” OR ”ranking of
feature subspaces” OR ”ranking of attribute subspace” OR ”ranking of attribute subspaces” OR ”ranking of variable
subspace” OR ”ranking of variable subspaces” OR ”sparse regularization” OR ”sparse regularisation” OR ”sparse
regression”
The diverse terminology in machine learning and related fields can make it difficult for novice
researchers to find their way around the scientific landscape. The following points are groups of
terms that are used interchangeably.
• target variable, class, label, predictor, category, kind, dependent variable, criterion variable,
response variable
• sample, instance, data point, row, observation
• feature, attribute, variable, parameter, predictor, characteristic, independent variable,
descriptor, property
• continuous FS, ranked FS, weighted FS, univariate
• binary selection, subset selection, subspace selection, multivariate
• learning algorithm, classifier, learner, predictor, estimator
• train, learn, fit, model, model construction, model selection
46
B.1 Difficulties with accessing FSA prevalence
There are several factors to consider when regarding the prevalence search:
• There is a general difficulty in finding the optimal set of terms for FS since there is no
standardised taxonomy and since it is applicable across fields. There is a trade-off between
ensuring broad scope across disciplines while limiting the irrelevant results.
• The search string was only applied to titles, abstracts and keywords. There are surely cases
where the method names are only displayed in the full text.
• Some FS methods have names that result in false positives such as “INTERACT” or “gain
ratio”.
• Using ssFS when searching does not guarantee that the algorithm in question is used for
feature selection and not in other tasks.
• Only searching in WoS and Scopus does not cover all papers published and the contents of
scientific databases are constantly changed.
Even with all the shortcomings of this approach we still believe that it is a better way to determine
which FS methods are common than what we have found in various publications. Researchers seem
to pick them semi-arbitrarily from their own experience and intuition since no research paper read
during the literature study had any concrete explanation for choosing the algorithms they did. In
some cases, the algorithms might have been cherry picked to suit the experiment and exalt the
researchers new, proposed algorithm.
47
Appendix C Algorithm search results
Table C2:Algorithms with the top 30 number of publications (hits) in WoS and Scopus
FS method name Hits in Web of Science
Hits in Scopus
String used with ssFS (Appendix A)
Support Vector Machines (SVM) 5888 6533 "Support Vector Machines" OR "SVM"
Genetic algorithms (search method) 2487 3248 "genetic algorithm" OR "genetic algorithms" OR "GA"
Decision Tree (DT) 2213 2413 “decision tree” OR “decision trees”
LASSO 1549 941 "lasso" OR "least absolute shrinkage and selection operator"
Random forest (RT) 471 857 "random forest"
Perceptron (P) 401 718 "perceptron"
Information Gain 366 642 "information gain" OR "IG"
Multilayer Perceptron 353 585 "Multilayer Perceptron" OR "MLP"
AdaBoost (AB) 310 463 "adaboost" OR "ada boost"
Single Radial Basis Function 274 426 "Single Radial Basis Function" OR "RBF"
ANOVA 159 380 "ANOVA" OR "analysis of variance"
Euclidean distance 218 359 "Euclidean distance"
mRMR (minimum Redundancy Maximum Relevance)
248 315
"mrmr" OR "minimum Redundancy Maximum Relevance" OR "Maximum Relevance minimum Redundancy"
Correlation-based Feature Selection (CFS)
276 301
"correlation-based Feature Selection" OR "CFS" OR "correlation based feature selection" OR "correlation based fs" OR "correlation-based fs"
t-test 204 290 "t-test" OR "t test"
Simulated annealing (search method) 167 252 "Simulated annealing"
ReliefF 154 239 "relieff" OR "relief-f" OR "relief f"
Sequential forward selection 162 231 "Sequential forward selection" OR "SFS"
Chi2 119 230 "chi2" OR "chi (2)" OR "chi 2" OR "chi squared" OR "chi-2" OR "chi^2"
Successive projections algorithm 214 184 "successive projections algorithm" OR "SPA"
Recursive Feature Elimination for Support Vector Machines (SVM-RFE)
142 162 "SVM-RFE" OR "Recursive Feature Elimination for Support Vector Machines"
Gain Ratio 73 116 "gain ratio"
Tabu Search (search method) 79 107 "tabu search"
Markov blanket 101 104 "markov blanket"
Uninformative variable elimination (UVE)
78 104 "uninformative variable elimination" OR "UVE"
MARS 32 69 "mars" OR "multi angle regression and shrinkage"
Fast correlation based feature selection (FCBF)
39 57 "Fast correlation based feature selection" OR "FCBF" OR "fast correlation based FS"
f-test 27 47 "f-test" OR "ftest"
Memetic Algorithms (search method) 40 44 "memetic algorithms" OR "memetic algorithm"
Stepwise multiple linear regression 40 44 "Stepwise multiple linear regression" OR "MLR-step"
48
49
Appendix D Experimental data sets
Table D1: Data sets used in the experiment.The ‘Source’ column shows references to the first published work where the
data sets were introduced and links to where they can be downloaded. Some original sources could not be found; however,
the data sets can be found in the UCI (https://archive.ics.uci.edu/ml/datasets.html) or KEEL repositories
https://archive.ics.uci.edu/ml/datasets.html).
Name Field Number of
instances Number of
features Number of
classes Source
Bank_aut Image recognition 1372 4 2
Iris Biology 150 4 3 (69)
Mammographic Medicine 830 5 2 (70)
Monk2 Synthetic 432 6 2 (71)
Appendicitis Medicine 106 7 2 (72)
Ecoli Biology 336 7 8 (73)
Leddigits Computer Science 500 7 10 (51)
Pi_diabetes Medicine 768 8 2 (74)
Yeast Biology 1484 8 10 (73)
Abalone Biology 4174 8 28 (75)
Fertility Medicine 100 9 2 (76)
Breast cancer Wisconsin Medicine 683 9 2 (77)
Contraceptive Sociology 1473 9 3 (78)
Breast_tissue Medicine 106 9 6 (79)
Glass Chemistry 214 9 7
Liver Medicine 579 10 2 (80)
Magic Physics 19 020 10 2 (81)
Page_blocks Image recognition 5473 10 5
Whitewine Chemistry 4898 11 11 (82)
Redwine Chemistry 1599 11 11 (82)
Wine Chemistry 178 13 3
Cleveland Medicine 297 13 5 (83)
Marketing Business 6876 13 9 (84)
Vowel Speech recognition 990 13 11
Leaf Biology 340 14 40 (85)
Pendigits Image Recognition 3498 16 10 (86)
Cms Physics 540 18 2 (87)
Seismic Geology 2584 18 2 (88)
Vehicle Image recognition 846 18 4 (89)
Hepatitis Medicine 80 19 2 (90)
Band Materials science 365 19 2
Dia_retina Medicine 1151 19 2 (91)
50
Name Field Number of
instances Number of
features Number of
classes Source
Image_seg Image recognition 210 19 7 (92)
Statlog Image recognition 2310 19 7 (92)
Ring Synthetic 7400 20 2 (93)
Twonorm Synthetic 7400 20 2 (93)
Cardiotocography Medicine 2126 20 10 (94)
Thyroid Medicine 7200 21 3
Parkinsons Medicine 195 22 2 (95)
Ionosphere Physics 351 34 2 (96)
Derma Medicine 366 34 6 (97)
Satimage Image recognition 6435 36 7 (98)
Waveform Physics 5000 40 3 (51)
Texture Image recognition 5500 40 11 (99)
Heart Medicine 267 44 2 (100)
Lung_cancer Medicine 27 56 3 (101)
Spam Text analytics 4658 57 2 (102)
Sonar Physics 208 60 2 (103)
Digits Image recognition 1797 64 17 (104)
Libras Image recognition 360 90 15 (105)
51
Appendix E Software tools and pseudocode
E.1 Software tools and packages
Anaconda
Package manager and collection of hundreds of open source python packages.
https://www.continuum.io/downloads
Pandas
Open source library for data analysis.
http://pandas.pydata.org/
Scikit-learn
Collection of tools for machine learning and data analysis. Build on NumPy, SciPy and matplotlib.
http://scikit-learn.org/stable/
Scikit-feature
Collection of tools for feature selection. Based on Scikit-learn.
http://featureselection.asu.edu/
E.2 Pseudocode
The following pseudocode (Fig. E2.1) displays a part of the program used to generate the results for
one run. One run means getting result for a single predetermined dataset and FS method.
52
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55
56 57 58 59 60 61 62 63 64 65 66 67
begin; //Using a function in python sklearn to get data indices for k stratified CV folds, i.e k different training/test splits (80/20) of S, each with the same class distribution as S. CALL stratifiedCV with Y =(y1,...yM) and k=5 RETURNING list of k trainTest_indx // setting train/test subsets of S as 5 different folds, using the indices SET folds = S[trainTest_indx] = ((Strain, Stest)1,..,(Strain, Stest)k)
SET foldres = ∅ FOR each (Strain, Stest) in folds: CALL do_FS with (Strain, Stest) and FSname: //using map in prerequisites // function do_FS
SET a grid for iteration over all possible FS hyperparameter settings SET all_red_subsets = ∅ CALL start_timer FOR each params of n combinations of FS hyperparameter settings:
CALL sklearn_FS with params RETURNING red_subset = (S’train, S’test)1..n APPEND red_subset to all_red_subsets
ENDFOR; CALL stop_timer RETURNING seconds CALCULATE fs_durationTime, by averaging seconds over n
RETURNING all_red_subsets = S’(S’train, S’test)1..n and fs_durationTime
SET pred_scores = ∅ FOR each pred_name in predictor_names: //using map in prerequisites CALL do_predict with all_red_subsets and pred_name: //function do_predict
SET a grid for iteration over possible predictor hyperparameter settings SET all_predictors = ∅ SET all_predY = ∅ FOR each params of m combinations of predictor hyperparameter settings:
CALL sklearn_createPredictor with params RETURNING predictor APPEND predictor to all_predictors = ((predictor1),..,(predictor)m)
ENDFOR; FOR each red_subset in all_red_subsets: FOR each predictor in all_predictors:
CALL sklearn_fitPredict with predictor and red_subset RETURNING predictions of y ∈ (S’test) as ( predY)1..m APPEND predY to all_predY = (predY1..m)1..n
ENDFOR; ENDFOR;
RETURNING all_predY SET all_metricscores = ∅ FOR each metricname in metric_names: //using map in prerequisites
GET all y ∈ (Stest) as all_ytest CALL do_calcMetric with metricname, all_ytest and all_predY
//function do_calcMetric SET scores = ∅ FOR each (predY1..m) in all_predY:
FOR each predY in (predY)1..m: CALL sklearn_calcMetric with predY and all_ytest RETURNING score and APPEND to scores
ENDFOR; ENDFOR;
GET best_score from scores RETURNING best_score
APPEND best_score to all_metricscores ENDFOR; APPEND all_metricscores to pred_scores APPEND pred_scores and fs_durationTime to foldres ENDFOR; COMPUTE CV_score as average over k folds, FOR every performance metric and predictor COMPUTE CV_fsRuntime as average across k folds RETURNCV_score, CV_fsRuntime and name of dataset, FS method, predictors and metrics as result end;
---------------------------- Prerequisites: //A mapping of names of different types of FS methods, predictors and metrics to functions. Example: CASE methodname OF: “Adaboost” : CALL do_adaboost //does FS with adaboost “Decision Tree”: CALL do_decisionTree //trains, predicts with Decision Tree … ENDCASE; Input: //the full dataset(S), consisting of N features(x) and corresponding class(y) S = (x1, x2, …, xN, y) //names required for mapping to functions and describing final results datasetname, FSname, predictor_names, metric_names Output: CV_score and total_fsRuntime together with names of dataset, FS method, predictor and metric -------------------------------
Figure E2.1: Pseudocode for a part of the program used to generate the results.
53
Appendix F Results from experiments
The summary statistics of overall estimated performance across datasets for each FSA and
performance metric are presented in Table 17, in which the best values are underlined. The number
of failed experiments are discerned through the sample sizes below 50.
Table F1. Summary statistics of overall performance of each FSA across datasets. Runtime lacks results for No FS,
since no feature selection was performed prior prediction. Sample size depicts the amount of data sets included successfully
(50 in total were included in the experiments, but some experiments failed). The best scores are underlined, i.e., the highest
value for mean and median and the lowest for interquartile range (IQR) and average rank (AR).
FSA Sample
size
Accuracy F-measure Runtime
Mean Median IQR AR Mean Median IQR AR Mean Median IQR AR
No FS 50 0.749 0.805 0.249 20.210 0.704 0.731 0.319 19.680 - - - -
AB 47 0.788 0.847 0.234 7.000 0.747 0.794 0.274 7.426 0.950 0.975 0.075 16.140
AB RFE 46 0.785 0.832 0.244 6.772 0.743 0.796 0.269 6.935 0.690 0.797 0.494 25.326
AB SBS 48 0.753 0.810 0.236 17.698 0.705 0.765 0.301 18.490 0.483 0.567 0.674 27.938
AB SFS 50 0.728 0.770 0.288 21.080 0.681 0.704 0.342 21.400 0.535 0.604 0.722 26.560
Anova 50 0.743 0.797 0.291 19.570 0.695 0.718 0.337 19.930 0.996 1.000 0.001 4.660
CBF 49 0.731 0.780 0.295 22.204 0.702 0.739 0.310 22.102 0.438 0.517 0.726 28.041
Chi
Squared 35 0.697 0.717 0.290 22.514 0.679 0.677 0.289 22.586 0.997 0.999 0.002 5.186
DT 50 0.785 0.834 0.232 6.250 0.747 0.780 0.277 6.400 0.999 1.000 0.001 4.340
DT RFE 50 0.784 0.837 0.249 5.350 0.746 0.811 0.261 5.630 0.994 0.995 0.007 10.160
DT SBS 50 0.754 0.810 0.244 17.980 0.709 0.727 0.312 18.270 0.980 0.984 0.019 14.600
DT SFS 50 0.752 0.815 0.257 17.830 0.705 0.727 0.334 17.940 0.987 0.989 0.011 12.340
FCBF 48 0.653 0.657 0.319 25.115 0.593 0.591 0.407 25.448 0.976 0.985 0.032 15.052
LARS 50 0.731 0.784 0.264 22.770 0.691 0.714 0.314 21.720 0.984 0.999 0.004 9.030
LASSO 50 0.763 0.810 0.244 13.260 0.724 0.767 0.275 12.530 0.997 0.999 0.001 5.320
LASSOLA
RS 47 0.754 0.806 0.262 15.404 0.714 0.752 0.313 14.638 0.998 0.999 0.004 6.790
Linear
Regression
(LR)
50 0.732 0.781 0.285 22.760 0.692 0.719 0.294 22.000 0.988 0.999 0.003 8.130
Low
Variance
(LV)
50 0.728 0.785 0.285 23.110 0.678 0.685 0.296 23.740 1.000 1.000 0.000 1.130
54
FSA Sample
size
Accuracy F-measure Runtime
Mean Median IQR AR Mean Median IQR AR Mean Median IQR AR
mRMR 50 0.739 0.789 0.282 20.630 0.692 0.731 0.314 21.170 0.936 0.939 0.070 20.220
P 50 0.764 0.815 0.245 13.920 0.723 0.752 0.270 13.570 0.999 0.999 0.002 5.680
P RFE 50 0.763 0.816 0.241 13.310 0.722 0.743 0.283 13.070 0.991 0.994 0.010 11.080
P SBS 50 0.739 0.792 0.265 21.270 0.693 0.732 0.300 21.360 0.914 0.939 0.076 20.770
P SFS 50 0.738 0.786 0.250 23.240 0.695 0.727 0.343 22.470 0.927 0.947 0.067 19.170
RF 50 0.800 0.857 0.234 2.740 0.766 0.823 0.260 2.870 0.931 0.968 0.093 18.340
RF RFE 50 0.798 0.847 0.225 2.460 0.763 0.828 0.258 2.670 0.572 0.719 0.665 27.060
RF SBS 50 0.761 0.823 0.250 14.250 0.720 0.741 0.310 13.700 0.939 0.941 0.076 20.000
RF SFS 50 0.761 0.816 0.250 13.670 0.715 0.740 0.308 14.050 0.946 0.953 0.071 18.560
reliefF 50 0.742 0.796 0.244 18.770 0.694 0.735 0.296 19.200 0.541 0.679 0.941 24.360
SVM with
Linear
Kernels
50 0.763 0.820 0.243 13.250 0.722 0.733 0.276 13.390 0.972 0.986 0.044 14.220
SVM RFE 50 0.763 0.816 0.246 12.330 0.726 0.755 0.283 11.710 0.824 0.915 0.209 22.800
SVM SBS 45 0.723 0.779 0.265 20.733 0.673 0.711 0.308 21.189 0.795 0.924 0.194 21.733
SVM SFS 44 0.727 0.772 0.248 20.295 0.677 0.689 0.320 20.602 0.844 0.938 0.193 20.386
55
TRITA CBH-GRU-2018:32
www.kth.se