Lulea University of Technology
Study Report
Machine Learning in PervasiveComputing
Author:
Samuel Idowu
Supervisor:
Prof Christer Ahlund
Co-Supervisor:
Olov Schelen
Co-Supervisor:
Robert Brannstrom
Pervasive and Mobile Computing
Department of Computer Science, Electrical and Space Engineering
September 2013
LULEA UNIVERSITY OF TECHNOLOGY
Abstract
Pervasive and Mobile Computing
Department of Computer Science, Electrical and Space Engineering
Machine Learning in Pervasive Computing
Increase in data quantities and number of pervasive systems has resulted in many
decision-making systems. Most of these systems employ Machine Learning (ML) in
various practical scenarios and applications. Enormous amount of data generated by
sensors can be useful in decision-making systems. The rising number of sensor driven
pervasive systems presents interesting research areas on how to adapt and apply ex-
isting ML techniques effectively to the domain of pervasive computing. In the face of
data deluge, ML has proved viable in many application areas such as data mining and
self-customizing programs and could bring about great impact in the field of pervasive
computing.
The objective of this study is to give the underlying concepts of ML techniques that
can be applied to problems in the domain of pervasive and mobile computing. The
scope of this study covers the three primary types of ML, supervised, unsupervised and
reinforcement learning methods. In the process of providing the fundamental knowledge
of ML, we present some conceptual terms of ML and the steps required in developing
ML system with a great impact on domains outside ML scope.
Our findings show that previous works in the area of ubiquitous computing have suc-
cessfully applied supervised learning and reinforcement learning methods. Hence, this
study focuses more on supervised learning and reinforcement learning. In conclusion, we
discuss some basic performance evaluation metrics and methods for obtaining reliable
classifiers estimates, such as cross-validation and leave-one-out validation.
Contents
Abstract i
List of Figures iv
List of Tables v
1 Machine Learning in Pervasive Computing . . . . . . . . . . . . . . . . . . 1
2 Related Study and Scope of Study . . . . . . . . . . . . . . . . . . . . . . 2
3 Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
3.1 Machine Learning, Data Mining and Artificial Intelligence . . . . . 4
3.2 Significance of Machine Learning . . . . . . . . . . . . . . . . . . . 5
3.3 A Machine Learning Example . . . . . . . . . . . . . . . . . . . . . 5
4 Types of Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . 8
4.1 Supervised Learning . . . . . . . . . . . . . . . . . . . . . . . . . . 8
4.2 Unsupervised Learning . . . . . . . . . . . . . . . . . . . . . . . . . 9
4.3 Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . . 9
4.4 Common Machine Learning Paradigms and Categories . . . . . . . 10
5 Essential Steps in Machine Learning . . . . . . . . . . . . . . . . . . . . . 11
5.1 Impacting Real World Outside Machine Learning . . . . . . . . . . 16
6 Supervised Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
6.1 Bayesian Classification . . . . . . . . . . . . . . . . . . . . . . . . . 17
6.1.1 Bayes Theorem . . . . . . . . . . . . . . . . . . . . . . . . 17
6.1.2 Naive Bayes classifier . . . . . . . . . . . . . . . . . . . . 18
6.2 Instance–Base Learners . . . . . . . . . . . . . . . . . . . . . . . . 23
6.3 Decision Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
6.4 Support Vector Machines . . . . . . . . . . . . . . . . . . . . . . . 30
6.5 Hidden Markov Model . . . . . . . . . . . . . . . . . . . . . . . . . 34
6.5.1 Discrete Markov Model . . . . . . . . . . . . . . . . . . . 36
6.5.2 Hidden Markov Model . . . . . . . . . . . . . . . . . . . . 37
6.6 Comparing Supervised Learning Algorithms . . . . . . . . . . . . . 40
7 Unsupervised Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
7.1 Clustering Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
7.1.1 k-means clustering . . . . . . . . . . . . . . . . . . . . . . 43
7.2 Association Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 44
7.2.1 Apriori algorithm . . . . . . . . . . . . . . . . . . . . . . 45
8 Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
8.1 Elements of Reinforcement Learning . . . . . . . . . . . . . . . . . 48
ii
Contents iii
8.1.1 Passive Reinforcement Learning . . . . . . . . . . . . . . 50
8.1.2 Active Reinforcement Learning . . . . . . . . . . . . . . . 52
9 Performance Evaluation in Machine Learning . . . . . . . . . . . . . . . . 53
9.1 Confusion Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
9.2 Accuracy and Error rate . . . . . . . . . . . . . . . . . . . . . . . . 55
9.3 Sensitivity and Specificity . . . . . . . . . . . . . . . . . . . . . . . 56
9.4 Precision, Recall and F-measure . . . . . . . . . . . . . . . . . . . 56
9.5 Methods for Obtaining Reliable Evaluation Measures . . . . . . . . 57
10 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
Bibliography 60
List of Figures
1 Training and evaluation phase using a training set and a test set respectively 7
2 Common algorithms used in ML . . . . . . . . . . . . . . . . . . . . . . . 10
3 Major steps involve in designing a learning phase . . . . . . . . . . . . . . 12
4 Main Evaluation factors . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
5 Three Stages of machine learning research program . . . . . . . . . . . . . 16
6 Sensor-readings training data set . . . . . . . . . . . . . . . . . . . . . . . 22
7 Calculated mean and variance . . . . . . . . . . . . . . . . . . . . . . . . . 23
8 Illustration of kNN classification . . . . . . . . . . . . . . . . . . . . . . . 24
9 A decision tree in flowchart form . . . . . . . . . . . . . . . . . . . . . . . 27
10 Support Vector Machines (SVM) optimal hyperplane and maximum margin 31
11 Complex data types for mining [1] . . . . . . . . . . . . . . . . . . . . . . 35
12 A simple Markov chain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
13 Second order Markov model . . . . . . . . . . . . . . . . . . . . . . . . . . 36
14 Partially observable system . . . . . . . . . . . . . . . . . . . . . . . . . . 37
15 Apriori principle illustration . . . . . . . . . . . . . . . . . . . . . . . . . . 46
16 Learning agent - environment . . . . . . . . . . . . . . . . . . . . . . . . . 49
iv
List of Tables
1 Lenses Data Set [2] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2 Approaches to define the distance between two instances (x and y) [3] . . 25
3 State sequence probabilities . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4 HMM probabilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
5 Comparing learning algorithms (**** stars represents the best and * starrepresent the worst) [3] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
6 A simple itemset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
7 Confusion matrix with totals for positive and negative examples . . . . . 54
8 Example of a confusion matrix with totals for positive and negative examples 54
v
Abbreviations vi
ML Machine Learning
CART Classification and Regression Tree
kNN k–Nearest Neighbor
ANNs Artificial Neural Networks
SVM Support Vector Machines
SVR Support Vector Regression
ID3 Iterative Dichotomiser 3
MDP Markov Decision Process
DUE Direct Utility Estimation
DP Dynamic Programming
ADP Adaptive Dynamic Programming
TD Temporal Difference
QP Quadratic Problem
SMO Sequential Minimal Optimization
GPS Global Positioning System
AI Artificial Intelligence
NLP Natural Language Processing
HMM Hidden Markov Model
HCI Human Computer Interaction
CLS Concept Learning System
IID Independent and Identically Distributed
Section 1 - Machine Learning in Pervasive Computing 1
1 Machine Learning in Pervasive Computing
There is a growing interest in the field of Machine Learning (ML). An urge for these
interests can be partly attributed to the big data era, which has led to deluge of data in
modern times [4]. Bulk of the data today is generated from sensors (e.g. context aware
sensing devices), which periodically generate information about contexts such as location
and time [5]. Other sensing devices monitor the state of structures such as buildings
by observing their vibration [6]. These examples and many others contribute to data
deluge today. The total volume of data generated on earth exceeded one Zettabyte (ZB)
in 2010, and this will continue to grow exponentially [7]. Also, estimates show that the
amount of data stored in the world’s database doubles every 20 months [8].
The advantages and potentials of data are highly valuable in many fields. However,
achieving these benefits requires some considerable amount of effort to derive useful
knowledge from a tremendous amount of data. It is not a surprise that much attention
has shifted towards automated and effective ways of data analysis with large and complex
data sets in contemporary times, which ML provides [4].
The research presented in this document aims to gain knowledge about ML and its tech-
niques. This involves a comprehensive understanding of ML and knowledge of the basis
of commonly used algorithms. Since there is no win-all technique or algorithm for every
data-set or all problem scenarios in ML [9], it becomes crucial for us to get well-versed
in the suitability of various algorithms for different types of data-sets and problems.
We shall use this experience later to motivate the choice of algorithms appropriate for
specific problems. We shall continue this work by applying the understanding gained
in the field of ML to two problem cases. The first case involves activity recognition on
a wooden bridge based on the movement of the bridge. The second case applies ML
in mobile network, specifically, the selection of access points or base stations based on
prior network performances.
The next section, 2, describes the scope and motivations for this study. Section 3
covers the definition of ML, its significance and its relationship with data mining and
artificial intelligence. This section also presents the important terms in ML using a
simple classification problem. Section 4 describes the three main types of ML and
discusses the ML paradigms. Section 5 presents the general steps in developing a ML
Section 2 - Scope of Study 2
system. Section 6, 7 and 8 discuss general techniques under supervised, unsupervised
and reinforcement learning types respectively. In these sections, we present the basic
ideas behind the techniques. We do not delve too deeply into the trickier issues, but
explain the underlying concepts. In section 9, we present commonly used metrics for
evaluating the accuracy of ML tasks.
2 Related Study and Scope of Study
In this section, we give and justify the scope of this study. The two main factors, which
motivate the scope of this study are;
• previous research in pervasive computing atmosphere that applies ML and
• the nature of our research problems, to which we shall apply ML as future work.
Much of ML research in pervasive computing environments has favored the used of
supervised learning (see Section 4.1) in many sensor-data applications such as activity
recognition [10] [11] [12], Intelligent Environment [13], mobility prediction [9], structural
health monitoring [14] [15] and Human Computer Interaction (HCI) [16]. Pervasive
computing applications by nature are context-aware; they have as possible input the
entire observable state of the environment [17]. A pervasive computing system is capable
of basing computational decisions on interactions within its environment as a whole.
This is possible since pervasive systems typically include a partial knowledge of their
physical environment with the help of sensing devices (e.g. location sensors, energy
meters) and an almost complete understanding of the computational environment (e.g.
networked device state, application state and service state) [17]. Appropriately employed
supervised ML techniques removes the limitation of having to manually specify decision
process for each contextual situation and thereby increases automation. For instance,
instead of expressing the rules for an action, the system can be provided with examples
under varying contextual situations of when that action should be taken [17].
Some related projects that have employed some forms of supervised learning methods
include the following. The ContAct project [10] involves a system that derives user
activity and availability from sensor data in an office space. It uses Naive Bayes clas-
sifier to learn user activity and availability from sensor data according to given user
Section 3 - Machine Learning 3
feedback. A predictive buildings energy optimization [18], which controls, improves the
efficiency and reliability of building operations without requiring large amounts of addi-
tional capital investment. The system applies predictive ML model known as Support
Vector Regression (SVR) on the building’s historical energy use. The Lumiere project
[19], at Microsoft Research applied the Bayesian models to accomplish techniques and
a platform for reasoning about the goals and needs of software users as they work with
software, by using a large amount of example data from users and experts. Its objec-
tive is to learn proper interactions with the user according to perceived user activities
in order to provide assistance within a software environment. These systems directly
or indirectly learn a model from history of contextual situations as feedback, which is
perceived with advantage of contextual sensors. The generated model then helps in
decision-making or prediction of future contexts.
Supervised learning can be considered limited in systems that require some form of
direct feedback to facilitate a learning process; hence, supervised learning alone is not
adequate for continuous learning from interaction [20]. In interactive problems, it is
mostly impractical to obtain examples of desired behavior that are both correct and
representative of all situations in which a learning agent has to act [20]. Direct feedback
enables a system to learn from its own experience. Reinforcement learning (see Section
4.3) is a form of ML, which is suitable for these types of systems. As an example, in
the MavHome Project [21], the environment is represented using a Hierarchical Hidden
Markov Model (HMM), and a reinforcement-learning algorithm is employed to predict
environmental preferences based on sensors within the environment.
Our research interest centers on applying ML in research problems of pervasive com-
puting. For a future work, possible domains include activity recognition on bridges,
model-based access network selection and sensor-data driven energy optimization. In
view of our research interest and prior works, which have employed the use of ML, We
limit the scope of this study to the basis of common supervised learning algorithms such
as naive Bayes, decision tree, SVM and HMM. The scope also includes common super-
vised learning tasks (i.e. clustering analysis and association analysis) and reinforcement
learning – the principal elements and the key types of reinforcement learning.
Section 3 - Machine Learning 4
3 Machine Learning
Machine Learning (ML) applies suitable algorithms on data for a learning problem.
Tom Mitchell [22], defined a well-posed learning Problem: A computer program is said
to learn from experience E (observed examples) with respect to some task T and some
performance measure P if its performance on T, as measured by P, improves with ex-
perience E. Therefore, we can define ML as one that can learn from experience with
respect to some class of tasks and a performance measure [22].
In general terms, ML can be clearly defined as set of methods that can automatically
detect patterns (general regularities) in empirical data, such as sensor data or databases
and then use the discovered patterns to predict future data, or execute other types of
decision making under uncertainty [4]. Various disciplines utilize ML, including the
obvious disciplines like computer science and statistics as well as many other fields
from politics to geo-sciences [23]. Focus in ML research today majors on automatically
learning patterns in large data set of complex data types [1].
This section discusses the relationship of ML in relation to the two principal fields that
often apply ML algorithms (i.e. Artificial Intelligence (AI) and data mining). This
section also presents the significance of ML and its applications. Lastly, we introduce
the relevant terms in ML, using a simple supervised learning problem.
3.1 Machine Learning, Data Mining and Artificial Intelligence
Data mining and artificial intelligence are two popular fields that regularly apply ML
techniques. Data mining, also known as knowledge discovery from databases, advanced
data analysis and ML [24], combines methods and tools from at least three areas, which
includes ML, statistics and databases [25]. In the field of artificial intelligence, it is
difficult to identify an intelligent system that does not possess a learning capability as a
true intelligent system [26]. In these fields, the ultimate objectives are quite similar, al-
though there are slight differences in the approach and the use of ML in each field. Data
mining bridges other technical areas including databases, human-computer interaction
and statistical analysis [24]. However, data mining centers on ML techniques, it also in-
volves other essential steps like database creation and maintenance, data formatting and
cleansing, data visualization and summarization, use of a human expert knowledge to
Section 3 - Machine Learning 5
formulate the inputs to the learning algorithm and evaluate empirical patterns it discov-
ers [24]. AI attempts to understand and build intelligent entities. These entities require
the capability to adapt to new circumstances, detect and extrapolate patterns [27]. AI
employs various ML techniques to provide this vital ability. In general, AI involves
using other capabilities such as natural language processing, knowledge representation
and automated reasoning [27].
3.2 Significance of Machine Learning
In the face of current data explosion, one begs the question of efficient ways to handle
such data. It is crucial for systems to handle the processing of data efficiently and also in
less time. The target of recent research in the area of ML tends towards adapting existing
techniques and algorithms to cater for complex data such as data-streams. Recent
studies also seek efficient means to utilize memory and process-cycles when dealing with
huge amount of data [28]. The application of ML algorithms in data mining includes
medical record analysis, and credit card fraud detection [24]. It has also been used
for understanding and predicting customer purchase behavior, manufacturing processes
and predicting the personal interests of web users [24] (e.g. Amazon, Netflix product
recommendation). Another use of ML is in complex systems that cannot be programmed
by hand. Such applications include autonomous helicopter, handwriting recognition,
computer vision and Natural Language Processing (NLP). Though the earlier classic
ML methods are able to produce acceptable results with simple data sets in several
applications. In recent time, due to the huge volume of data and also complex data
types, a lot of researches are progressing on delivering the best and most effective ways
of knowledge discovery and ML [1].
3.3 A Machine Learning Example
Before we go further in this report, it would be beneficial to give a simple ML example
and establish some common terminologies that we shall use henceforth. To make this as
effective as possible, we shall use an example of a common ML task. We will describe an
expert system that is capable of lens prescription using some values presented in Table
1. The presented data is from the UCI ML repository [2]. As we discuss this system,
we shall emphasize some common terms that are essential in ML.
Section 3 - Machine Learning 6
Age of patient Prescription Astigmatic Tear rate Lenses
1 young myope no reduced none2 young myope no normal soft3 young myope yes reduced none4 young myope yes normal hard5 young hypermetrope no reduced none6 young hypermetrope no normal soft7 young hypermetrope yes reduced none8 young hypermetrope yes normal hard9 pre-presbyopic myope no reduced none10 pre-presbyopic myope no normal soft11 pre-presbyopic myope yes reduced none12 pre-presbyopic myope yes normal hard13 pre-presbyopic hypermetrope no reduced none14 pre-presbyopic hypermetrope no normal soft15 pre-presbyopic hypermetrope yes reduced none16 pre-presbyopic hypermetrope yes normal none17 presbyopic myope no reduced none18 presbyopic myope no normal none19 presbyopic myope yes reduced none20 presbyopic myope yes normal hard21 presbyopic hypermetrope no reduced none22 presbyopic hypermetrope no normal soft23 presbyopic hypermetrope yes reduced none24 presbyopic hypermetrope yes normal none
Table 1: Lenses Data Set [2]
The expert system needs to know some information about its patient to be able to
prescribe lenses correctly, just like in the case of a real human optician. To make a
simple illustration, we will use four information values about each patient. These are
often called features, attributes or covariates [23] [4]. These are the first four columns
in Table 1. Each of the rows in the Table 1 is called an instance, which consist of some
features. Features can be basic data types (e.g. nominal or numerical) [8]. Features can
also be a more complex structured data such as images, texts, emails and raw sensor data
[4]. In this example, the first feature age of patient is a nominal data type, which could
be one from a set {young, pre-presbyopic, presbyopic}. The second, third and fourth
features are binary data types also, which could either be myope/hypermetrope, yes/no
and reduce/normal respectively. In the real world scenarios, data is not always friendly
as it appears in the Table 1. Data usually contains highly large number of features (often
thousands of dimensions), with possibility of missing values, which makes a learning task
more difficult. Curse of dimensionality [29] refers to the phenomena that arise when
dealing with such data of high dimensional spaces.
Section 3 - Machine Learning 7
Each example has four different features and one target variable. The target variable is
known as a class or label [23] [4], and it is what the expert system will try to predict. In
this illustration, the classes (target variables) are Lenses, which is a value from the set
of 3 items {none, soft, hard}. A class of an instance in principle could be anything, but
is mostly assumed to be a categorical or nominal variable from some finite set [4]. The
lens prescription problem that the described expert system will handle can be solved
by a widely used technique in ML known as classification. Classification task answers
questions such as, how do we decide when to prescribe soft lenses or another lens type
[23]? There are several problems in this category of ML tasks. One can certainly argue
that it is the most explored aspect of ML, both in its application and research [30].
There are several algorithms, which can help accomplish the classification task, but for
now, we shall only discuss the basic idea behind the classification problem in general.
What we need to do is to teach the system how, for instance, human opticians perform
prescriptions. We need to provide some quality example instances and allow the system
to learn from them. These examples are known as a training set [23]. The training set
in table 1 has 24 examples. Employing algorithms that help to find some connection
between the features and the target variable can solve the classification task.
Figure 1: Training and evaluation phase using a training set and a test set respectively.
In the field of ML, it is vital to know how accurate a system performs; this process is
known as evaluation (see Section 9). To test the illustrated lens predictor system, we
need a set of examples, called test set, which is different from the training set. The major
difference of the test set from the training set is that a test set only has the feature values
Section 4 - Types of Machine Learning 8
and no class value are given for each example. The missing class is what the system
would predict. After a learning phase with training set and an evaluation phase with a
test set, the next step is determining whether the classifiers level of accuracy is sufficient.
Figure 1 shows an illustration of a training phase and an evaluation phase. The output
of a classification algorithm is called classifier, model or knowledge representation.
4 Types of Machine Learning
ML is divided into three principal parts; supervised learning (predictive learning ap-
proach), unsupervised learning (descriptive learning approach) and reinforcement learn-
ing [4]. The supervised learning methods learn from given examples. The case given
in Section 3.3 belongs to this group of ML. In situations where we are not fortunate
enough to have example data with which we can learn from, we can apply the unsuper-
vised means of ML. The aim of unsupervised learning is to find intriguing pattern from
input data. It is largely employed in data mining field and sometimes referred to as
knowledge discovery [4]. In Reinforcement learning, the task is to learn how to behave
when given series of reinforcements - rewards or punishment. Unlike the previous types,
reinforcement learning is not commonly used [4].
4.1 Supervised Learning
In this approach, the task is to learn a mapping from input x to output y, given a labeled
set of input-output pairs D = {(xi, yi)}Ni=1. Where D is the training set, and N is the
cardinality of the training set [4]. Each example input xi is a D–dimensional vector of
values representing the features of an instance. The output yi part of the training set
represents the class or label of its respective features [4]. In relation to the illustration
given in section 3.3, D is the complete lenses data set [2], where xi represents the features
age, prescription, astigmatic, and tear rate. The label variable, yi represents lenses, and
the number of instances in the training set, N , is 24 [2].
The type of output variable, yi differentiates the two types of tasks performed under the
supervised learning. In cases similar to the example in 3.3, where the output variable
yi is categorical or nominal, these types of ML tasks are referred to as classification
or pattern recognition [4] [23]. The common algorithms used for classification tasks
Section 4 - Types of Machine Learning 9
include Naive Bayes, C4.5 decision trees and k–Nearest Neighbor (kNN). The second
supervised learning type is referred to as regression, where the output variable, yi, is
a real-valued scalar [4] [23], a variant of regression where the class space Y has some
natural ordering, such as grades A–F is called ordinal regression [4]. Common algorithms
used for regression include linear regression as well as Classification and Regression
Tree (CART), which can also be used for classification tasks.
4.2 Unsupervised Learning
Unsupervised learning methods only use input, D = {xi}Ni=1, and the goal is to find
”interesting patterns” in the data [4]. There’s no class or label value given for the data
D [23]. This is a much less well-defined problem, since the exact pattern to detect
is not defined and there is no obvious error metrics to use (i.e. we cannot compare
our prediction y for a given x to be the observed value) [4]. Unsupervised learning
is commonly used in task such as clustering, where the interest is to group similar
items together, and association analysis. Density estimation tasks are unsupervised
learning tasks, which discover the statistical values that describe a data [23]. Dimension
reduction tasks are applied to reduce the number of columns (i.e. features) of data sets
in high-dimensional spaces.
4.3 Reinforcement Learning
Reinforcement learning is learning what to do (i.e. how to map situations to actions with
the goal to maximize a numerical reward signal) [20]. The actions to be taken to achieve
the goal is not given as it is with other types of ML, but a learner must discover which
actions gives the most reward by trying them [20]. Russell et al, describe reinforcement
learning as a form of learning where the agent learns from a series of reinforcements–
rewards or punishments [27]. The most important aspects of the reinforcement learning
system include; sensation – ability to sense the state of the environment, action – must
be able to take actions, which affects state of its environment and goal – system must
have goals relating to the state of the environment [20]. Algorithms used under this
approach of ML include Q-learning and Temporal difference learning.
Section 4 - Types of Machine Learning 10
4.4 Common Machine Learning Paradigms and Categories
It is a common practice to categorize algorithms into paradigms based on their simi-
larities of basic assumptions about representation, performance methods and learning
algorithms [30]. Langley et al, present five paradigms of ML, these are perceptron-based
learning, Artificial Neural Networks (ANNs), instance based, genetic algorithm, rule in-
duction and analytic learning [30]. Kotsiantis (2007) [24] also identifies five different
paradigms within supervised learning of ML. These are
• logic based algorithms which includes, decision trees and rule-based classifiers,
• perceptron-based techniques which includes single layered and multi-layered per-
ceptron (ANNs),
• statistical learning algorithms – Naive Bayes and Bayesian Networks,
• instance-based learning - this includes kNN, and lastly,
• SVMs
Figure 2: Common algorithms used in ML
ML algorithms can also be grouped based on application tasks. A survey paper [31]
presents the top 10 data mining algorithms. As mentioned earlier, data mining exten-
sively employs ML techniques. In the study, nominated ML algorithms are grouped into
10 different groups. The groups include classification, statistical learning, association
Section 5 - General Steps in Developing a Machine Learning System 11
analysis, link mining, clustering, bagging and boosting, sequential patterns, integrated
mining, rough sets and graph mining. The following list shows each grouping of the final
top 18 algorithms. The top 10 algorithms are color highlighted. Figure 2 shows few
common ML algorithms and their respective categories and paradigms.
1. Classification – C4.5, CART, kNN, Naive Bayes
2. Statistical learning – SVM, EM
3. Association analysis – Apriori, FP–Tree
4. Link mining – PageRank, HITS
5. Clustering – K-Means, BIRCH
6. Bagging and Boosting – AdaBoost
7. Sequential Patterns – GSP, PrefixSpan
8. Integrated Mining – CBA
9. Rough Sets – Finding reduct
10. Graph Mining – gSpan
5 Essential Steps in Machine Learning
Kotsiantis (2007) and Harrington (2012) provide general steps that are required in de-
veloping supervised ML [3][23]. These steps help to derive a classifier that best suits the
specific problem at hand.
1. Identify and collect data – It is essential to identify the type of data that may
be used in the learning system. This is essential because, learning varies between
different data-types and also data properties such as size can affect the choice of
learning techniques that a learning system requires. For example, a real-valued
time series data and static nominal data might require a different technique of data
pre-processing or ML process. Data can be collected manually or automatically
depending on the nature of its application. Data source options are endless, data
Section 5 - General Steps in Developing a Machine Learning System 12
Figure 3: Major steps involve in designing a learning phase [3]
can be collected by scraping and extracting data from a website or RSS feed or an
API [23]. In the pervasive computing field however, sensors are the usual source
of data. There are many times when we might require publicly available data for
experimental purpose, UCI ML repository is a popular repository that provides
useful data for such purpose [2]. In few cases, knowledge about fields (attributes,
features) that are most informative can be exploited in data collection, which
gives a more refined and minimized data set for learning. In cases where this
is not possible, then the simplest method is that of ”brute-force”, which means
measuring everything available [3]. After such data collection, data might not
be directly suitable for learning because of noise or missing values, so it requires
significant pre-processing [32].
2. Pre-process data – This step often includes the pre-processing of data such as
Section 5 - General Steps in Developing a Machine Learning System 13
outlier removal and replacement of missing data points. There are usually various
techniques employed in handling missing data [33] and outlier detection. Hodge et
al (2004) introduced a survey, which presents the pros and cons of contemporary
techniques for outlier (noise) detection. It is essential to prepare the data in a
useable format fashioned after the specific algorithm under consideration. While
some algorithms require features in a special format, some can deal with string
types, some with integers types or even both [23].
3. Definition of the training set (and the test set): The training set is the heart of
supervised learning. It is important to use an appropriate training set in order to
achieve an accurate prediction model. In this step, the input data are represented
in the form of features that maps to observed output class or label from previous
tasks. It is common to have input data that require some form of transformation
or feature extraction in order to increase the accuracy of the target classifier. A
standard practice involves separating a sub-set of the collected data as test-data.
The test data comes in handy in the evaluation step.
4. Selection of the algorithm: Since there are various classification techniques, and
prediction accuracy depends on the choice of selected algorithm, it becomes neces-
sary to choose the most befitting technique. Choosing the most suitable algorithm
requires some examination of data properties and their suitability with respective
classifiers. It is it also important to consider system resources and algorithm re-
quirements. Figure 3 shows a loop that allows preliminary test of algorithms and
evaluation of their performance. This flow makes it easier to select satisfactory
algorithm while considering the necessary factors.
How does one choose best algorithm for a data set? In general, there are numbers
of factors that needs to be considered for algorithm selection. The crucial ones
are prediction accuracy, speed, interpretability (credibility versus comprehensibility)
and simplicity. Prediction accuracy relates to how well an algorithm learns from
a given data set. Speed relates to time taken to learn from a given data set as
well as the time it takes to predict unknown instances (i.e. speed of applying gen-
erated knowledge model). Interpretability of a generated model could be desired
in situations where understanding of the learning process is essential. The two
general perspectives of interpretability of hypothesis are the black-box and white-
box models. White box models are generally easy to understand (e.g. decision
Section 5 - General Steps in Developing a Machine Learning System 14
trees), while black-box models are difficult to understand (e.g. Support Vector
Machines (SVM)). Credibility versus comprehensibility relates to the trade-offs
between generating understandable hypothesis (model) or complex models with
possibly better accuracy. In general, rule-based learning algorithms produce more
to understand models than statistical-based ones [34]. Simplicity relates to the
amount of fiddling or the parameter adjustment that is needed by the algorithm.
While some algorithm requires little or parameter adjustment, some requires ap-
preciable amount of parameter tweaking.
Out of these factors, the prediction accuracy is the most important factor. A
learning system with decent speed, acceptable interpretability and simplicity but
poor prediction accuracy may still be considered as a poor learning system. Hence,
it is often common to make tradeoffs between prediction accuracy and other factors
as depicted in Figure 4.
Figure 4: Main evaluation factors and tradeoffs
Other factors that may be taken into consideration when choosing a learning al-
gorithm include sensitivity to outliers, ability to handle missing values, ability
to handle non-vector data, ability to handle class imbalance, ability to handle
non-vector data, efficacy in high dimensions, attempt for incremental learning,
extensions from Independent and Identically Distributed (IID) data to dependent
data (e.g. time series). It is also useful to consider the nature of the features.
Statistical learning methods such as SVM and ANNs often perform better over
multi-dimensions and continuous attributes. However, rule-based systems such
as decision trees are more likely to perform exceptionally in discrete (categorical)
attributes [34]. Ability to handle class imbalance (i.e. ratio of the training data)
might also be considered when making a choice of algorithm. The division of the
training data plays a crucial role in evaluating the performance of an algorithm.
If the True Positives (TP) and True Negatives (TNs) (see Section 9) instances
Section 5 - General Steps in Developing a Machine Learning System 15
are almost equal in size, algorithm tends to construct classifier models with a bet-
ter performance. On the contrary, if the size of TP instances is extremely small
compared to the size of TN instances, the classifier tends to overfit the positive
instances and hence perform badly during validation stage [34].
5. Execute the training process: This step involves the actual learning process in
which the selected algorithm creates a model from the training examples defined
in step 3. The generated output can be stored in a readily usable format that can
be further used as the knowledge of the learning system.
6. Perform evaluation with the test set : One advantage of supervised learning is that,
its accuracy can be easily determined. The evaluation of supervised learning algo-
rithms determines its performance when it is necessary to verify the performance
accuracy of a learned model. It is a common practice to use a different data set
called the test set (defined in step 3) to evaluate the generated classifier model.
Section 9 shows common measures used for evaluating ML algorithms. Figure 3
also shows a step for parameter tuning. This is applicable in algorithms that have
parameters that can be tuned to adjust the performance of the algorithm e.g. SVM
and kNN.
It is obvious that these above steps associate and work best with supervised ML, since
they focus on learning from a training set. Mannila [25] however identifies five main
steps that should be taken in knowledge discovery from data (i.e. unsupervised ML
tasks). These are;
1. understanding the domain,
2. preparing the data set,
3. discovering patterns (data mining),
4. post-processing of discovered patterns, and
5. putting the results into use.
Section 6 - Supervised Learning 16
Figure 5: Three Stages of machine learning research program. Current publishingincentives are highly biased towards the middle row only [35]
5.1 Impacting Real World Outside Machine Learning
As stated earlier, our research focus cuts across ML and pervasive computing. Therefore,
it is beneficial to identify the challenges that could limit the application of ML in the
pervasive environment, which is a different field from ML.
The steps highlighted in Figure 3 are the main aspects involved in conducting ML
research, however, Kiri 2012 [35], presents vital stages that should be considered when
applying ML in a real world domain (See Figure 5). These steps ensure a research with a
result that maximizes impact of ML on the applied domain. According to Kiri, it is easy
to run existing implementation of algorithms on data set downloaded from the Internet.
However, it is difficult to identify a problem for which ML may offer a solution, determine
what data should be collected, choose or extract relevant features, select suitable learning
method, select an evaluation method, interpret the results, involve domain experts,
publicize the results to the relevant scientific community, persuade users to adopt the
technique. It is noted that in order to achieve a research that makes a difference in
applied domain, it is essential to follow-through all these steps as each one is a necessary
component of any research program that aims to have a real impact on the world outside
ML. Kiri Identifies a lack of follow through as a limitation in ML research, because many
researchers focus only on the middle row (i.e. the ”machine learning” contribution) of
Figure 5. While these contributions are essential, it is important to have equal amount
of focus on the upper and lower rows of Figure 5. In addition to lack of follow-through,
hyper-focus on benchmark data sets and abstract metrics are two approaches in ML
that affects the impact on the larger world [35].
Section 6 - Supervised Learning 17
6 Supervised Learning
In this section, we shall take a look into supervised ML; we will discuss some common
algorithms used for the supervised learning tasks.
6.1 Bayesian Classification
Bayesian algorithms are statistical learning algorithms based on the Bayes theorem.
Bayesian probability and Bayes’ rule gives us a way to estimate unknown probabilities
from known values. The simplest algorithm in this paradigm is the Naive Bayes algo-
rithm, which assumes that the effect of the value of an attribute on the class attribute is
independent of the values of the other attributes given the value of the class attribute.
This is commonly termed as conditional independence [36]. Bayesian networks is an-
other type of Bayesian learning algorithm. It is appealing because it solves the issue of
overly restrictive conditional independence assumption in Naive Bayes by considering
the probabilistic relationship between variables of interest. Naive Bayes works well with
a small amount of data, handles multiple classes and works with nominal values. It is
however sensitive to how the input data is prepared [23]. Naive Bayes model is some-
times called a Bayesian classifier and in practice can work surprisingly well [27]. Naive
Bayes algorithm provides practical learning methods along with other widely used algo-
rithms like decision tree and k-nearest neighbor [36] [5] [37] [12]. It has been successfully
used in diagnosis and text documents classification, and only requires less data to give
satisfactory results in most cases [38] [39] [40].
6.1.1 Bayes Theorem
Let x be a data instance, described by measurements made on n features. In Bayesian
terms, x is considered ”evidence”. Let wi be some hypothesis such that the data instance
x belongs to a specified class wi from a set of c classes, wi, i = 1, 2, . . . , c. For classification
tasks, we are interested in the value of P (wi|x), i.e. the probability that the hypothesis
wi holds given the observed data instance x ( i.e. the probability that x belongs to class
wi, given that we know the feature description of x). The values of P (wi), P (x|wi) and
P (x) can be calculated from a given data set, while the Bayes theorem can be used to
Section 6 - Supervised Learning 18
calculate the posterior probability P (wi|x). Bayes theorem states that
P (wi|x)P (x) = P (x|wi)P (wi)
where
P (x) =
c∑i=1
P (x|wi)P (wi)
where P (wi|x) is the posterior probability (a posteriori probability) of class wi given the
value of x.
For example, suppose a data x (representing a patient) is an instance from a data set
similar to Table 1. Let x be described by Age and Prescription features only. Let the
values of the features of x be ’young’ and ’myope’ respectively. Suppose that wsoft is
the hypothesis that the patient, x, will require a ’Soft’ lens. Then P (wsoft|x) reflects the
probability that patient x requires ’Soft’ given that we know the age and prescription.
The prior probability of P (wsoft), is the probability that any patient will require ’Soft’
lenses, regardless of their age and prescription, or any other information. P (wsoft|x) is
based on more information (e.g. age and prescription) of x than the prior probability,
P (wsoft), which is independent of x. Similarly, P (x|wsoft) is the posterior of x condi-
tioned on wsoft, (i.e. the probability that a patient from our examples is young and has
myope prescription, given that we know that the patient requires ’Soft’ lenses). P (x) is
the prior probability of x, (i.e. the probability that a patient from our system is ’young’
and has ’myope’ prescription).
6.1.2 Naive Bayes classifier
This is one of the simplest density estimation methods from which a standard classifi-
cation method in ML is formed. It has properties, which makes it commonly adopted
in various purposes. It is easy to implement, fast to train and use as a classifier and
deals with missing attributes easily. There are many data sets for which Naive Bayes
does not do well because attributes are treated as though they were independent given
the class; also, addition of redundant attributes skews the learning process. For numeric
data, probability distribution functions such as Gaussian or normal distribution can be
assumed over data set. The only difference between the nominal attributes and numeric
Section 6 - Supervised Learning 19
attributes is that, instead of normalizing counts into probabilities in the case of the for-
mal, we calculate the mean and the standard deviation for each class and each numeric
attributes. The probability density function can then be used to find the probability of a
value, x, given the mean and standard deviation value. For cases where the distribution
in the data set is not normal, the density function can be easily replaced by appropriate
standard estimation for the distribution. Another solution is to use ”Kernel density
estimation”, which do not assume any distribution for the attribute values [3]. Also,
another possibility for numeric data is to simple discrete the data first. We shall give
two practical examples of Naive Bayes classification on both discrete and continuous
data type. General Discussions on the methods and merits of naive Bayes are presented
in [41][42].
Example with discrete data type
This example is based on the data set in Table 1. There are three possible classes – none,
soft, hard. There are four different attributes – age, prescription, astigmatic, tear-rate.
Since the lens data set [2], we shall consider the first 23 data instances as the training
set and use the last instance as the test data. Hence, Our task is to classify a new
instance, x, with the following attributes – age:presbyopic, prescription:hypermetrope,
astigmatic:yes, tear-rate:normal.
First we need calculate the priori probabilities, P (wi), for the classes. This is simply
the ratio of each class to the total training set.
P (class = none) = 14/23
P (class = soft) = 5/23
P (class = hard) = 4/23
Next, we derive the posterior probabilities of x conditioned on all the three classes, i.e.
P (x|class = none, soft, hard).
For attribute age
P (age = presbyopic|class = none) = 5/8
P (age = presbyopic|class = soft) = 2/8
P (age = presbyopic|class = hard) = 1/8
Section 6 - Supervised Learning 20
For prescription prescription
P (prescription = hypermetrope|class = none) = 4/7
P (prescription = hypermetrope|class = soft) = 2/7
P (prescription = hypermetrope|class = hard) = 1/7
For prescription astigmatic
P (astigmatic = yes|class = none) = 7/11
P (astigmatic = yes|class = soft) = 0/11
P (astigmatic = yes|class = hard) = 4/11
For prescription tear-rate
P (tear − rate = normal|class = none) = 2/11
P (tear − rate = normal|class = soft) = 5/11
P (tear − rate = normal|class = hard) = 4/11
Hence,
(x|none) = P (age = presbyopic|none) ∗ P (prescription = hypermetrope|none)
∗ P (astigmatic = yes|none) ∗ P (tear − rate = normal|none)
= 5/8 ∗ 4/7 ∗ 7/11 ∗ 2/11
= 0.04132
(1)
Note that we have zero probability of P (astigmatic = yes|soft), since class soft has
no instance with an attribute value astigmatic=yes in the training data set. This can
be problematic since it will wipe out the information in other probabilities P (x|soft).
A solution to the zero count problems is by applying Laplace correction [43]. We can
assume that our training set is so large that adding one to each count will make neg-
ligible difference in the estimated probabilities and yet would avoid the case of zero
probabilities.
By default, conditional probabilities can be derived using
P (Xi = xk|wj) =Nijk
Nj,
Section 6 - Supervised Learning 21
and conditional probabilities with Laplace correction can be derived instead using
P (Xi = xk|wj) =Nijk + 1
Nj + k
were Nijk is the number of instances in the data set such that Xi = xk and class = wj ,
Nj is the number of instances in data set with class = wj , and k is the number of possible
values of Xi.
To cancel the effect of the zero probability, we shall apply the Laplace correction to the
probabilities of class = soft, i.e. 2/8, 2/7, 0/11, 5/11. These give (2 + 1)/(8 + 4), (2 +
1)/(7 + 4), (0 + 1)/(11 + 4), (5 + 1)/(11 + 4) respectively.
(x|soft) = P (age = presbyopic|soft) ∗ P (prescription = hypermetrope|soft)
∗ P (astigmatic = yes|soft) ∗ P (tear − rate = normal|soft)
= (2 + 1)/(8 + 4) ∗ (2 + 1)/(7 + 4) ∗ (0 + 1)/(11 + 4) ∗ (5 + 1)/(11 + 4)
= 3/12 ∗ 3/11 ∗ 1/15 ∗ 6/15
= 0.001818
(2)
(x|hard) = P (age = presbyopic|hard) ∗ P (prescription = hypermetrope|hard)
∗ P (astigmatic = yes|hard) ∗ P (tear − rate = normal|hard)
= 1/8 ∗ 1/7 ∗ 4/11 ∗ 4/11
= 0.00236
(3)
Finally, we find the posterior probabilities in order to classify the instance.
P (none|x) = P (x|none) ∗ P (none) = 14/23 ∗ 0.04132 = 0.025151
P (soft|x) = P (x|soft) ∗ P (soft) = 5/23 ∗ 0.001818 = 0.000395
P (hard|x) = P (x|hard) ∗ P (hard) = 4/23 ∗ 0.00236 = 0.000410
According to the Bayes theorem, x is assigned to the class wi if
P (wi|x) > P (wj |x), ∀j 6= i
Section 6 - Supervised Learning 22
∴, since P (none|x) > P (hard|x) > P (soft|x), we classify x as none.
Example with continuous data type
When dealing with continuous values, we can assume that the values have a Gaussian
distribution (Probability Density Function) with a mean µ and standard deviation σ
defined by
g(x, µ, σ) =1√
2πσ2exp−
(x−µ)2
2σ2 (4)
so that P (x|wi) = g(x, µ, σ). ∴.
Consider the data set given in Figure 6. It shows the readings of three accelerometer
sensors – sens A, sens B, sens C, which monitors the state of an object. The data was
collected for two different classes (states) of the object – equilibrium and motion. The
task is to predict the state of the object when we have the following sensor readings
x = (sens A = 1.40, sens B = 0.80, sens C = 1.30).
Figure 6: Sensor-readings training data set
From the training set, we need to calculate the prior probabilities P (Equilibrium) =
6/10, P (Motion) = 4/10 and the mean µ and variance σ2 given in Figure 7.
Assuming a Gaussian distribution in the training set, and using Eqn(4) and the given
mean and variance, we can calculate the probability P (x|wi) for both classes.
P (sensA = 1.40|Equilibrium) = 3.67e−101
P (sensB = 0.80|Equilibrium) = 0.166
P (sensC = 1.30|Equilibrium) = 5.684e−162.
Section 6 - Supervised Learning 23
Figure 7: Calculated mean and variance
The posterior P (Equilibrium|x) = P (x|Equilibrium) ∗ P (Equilibruim) = 2.076e−263
For the motion class; P (sensA = 1.40|Motion) = 0.6142
P (sensB = 0.80|Motion) = 0.486
P (sensC = 1.30|Motion) = 1.222.
∴, P (x|Motion) = 0.6142 ∗ 0.486 ∗ 1.222 = 0.3647
The posterior P (Motion|x) = P (x|Motion) ∗ P (Motion) = 0.1459
Since the P (Motion|x) > P (Equilibrium|x), we predict the state of the system given x
as motion.
6.2 Instance–Base Learners
Instance–based learning is a category of learning algorithms under the header of statisti-
cal learning [3]. This involves knowledge representation in terms of cases or experiences
and utilizes adaptable matching methods to retrieve these known cases and apply them
to new situations. In contrast to most paradigms, which, constructs a general, explicit
description of the target model when training set are given, instance-based learning
techniques merely store the training examples. For every new instance, the relationship
with the previously stored examples is analyzed to designate a target function value for
the new instance. Examples of instance-based learning include nearest neighbor and
locally weighted regression methods. These types of learning techniques are termed lazy
learning [24] methods because they delay learning process until a new instance, which
needs to be classified or predicted is given. They require less computation time during
training phase than eager-learning algorithms (such as decision trees, neural and Bayes
nets), but more computation time during the classification phase [3]. Aha (1997) and
De Mantaras (1998) gave a general review of instance-based learners [44] [45]. A classic
Section 6 - Supervised Learning 24
example of instance base learner is k–Nearest Neighbor (kNN). kNN is one of the sim-
plest, straightforward and rather trivial algorithm of instance-based classifiers. Hence,
in this study, we will give a brief description of the nearest neighbor algorithm.
Figure 8 illustrates how a kNN classification works. The test instance (an unclassified
instance denoted by the object with a question mark) should be classified as male or
female gender. If k = 3, then the nearest 3 neighbors (instances within the solid line
circle) consist of 2 male and 1 female. Therefore, the test object is classified as a male
gender. However, if k = 5, then the 5 nearest neighbors consist of 3 females and 2 males.
Therefore, the test instance is classified as a female gender.
Figure 8: Illustration of kNN classification.
kNN algorithm is an easy and effective way to classify data. It compares a new unknown
instance (i.e. a features without a label) with every instance that exist in the training
set (training sets are mainly kept in memory for this purpose only). It is based on the
principle that the instances within a data set will generally exist in close proximity to
other instances that have similar properties [46]. The comparison involves calculating
the distance of the new instance from the other instances based on their respective
features. The common approach requires normalizing all features and then computing
the Euclidean distance of the new instance from other instances that are stored in the
memory. There are various distance metrics; some of the significant distance metrics
Section 6 - Supervised Learning 25
are presented in Table 2 [3]. In practice, any given distance metric must maximize the
distance between two instances of different classes and minimize the distance between two
instances of the same class. Algorithm (6.1) [3] shows the summary of kNN classification
method. The top k most similar instances are collected (k is a non-negative integer and it
is usually less than 20). The final action is taking a majority vote from the k most similar
instance of data, and the majority is the new class for the unknown instance. In the
aim of obtaining better classification accuracy, many kNN algorithms apply weighting
schemes that change the distance measurements and voting influence of each instance.
Wettschereck et al. (1997) [47] presented a survey on weighting schemes.
Euclidean: D(x, y) =
[m∑i=1
|xi − yi|2]1/2
Minkowsky: D(x, y) =
[m∑i=1
|xi − yi|r]1/r
Manhattan: D(x, y) =m∑i=1
|xi − yi|
Chebychev: D(x, y) =m
maxi=1
[xi − yi
]Camberra: D(x, y) =
m∑i=1
|xi − yi||xi + yi|
Kendall’s Rank Correlation:
D(x, y) = 1− 2m(m−1)
m∑i=j
j=1∑j=i
sign(xi − xj)sign(yi − yj)
Table 2: Approaches to define the distance between two instances (x and y) [3]
Algorithm 6.1: k-Nearest Neighbor(D = (x1, c1), ..., (xn, cn), X)
distances← empty list
for each instance (xi, ci) in D
do
d← distance(xi, X)
append d to distances
sort distances from lowest to highest
return (the K nearest instances in distances)
kNN has a number of advantages which makes it an excellent choice under supervised
learning, it is highly accurate in most cases, insensitive to outliers and it does not make
assumptions about data. It works with both numeric values and nominal values. A
popular result by Cover and Hart [48] reveals that the error of the nearest neighbor
Section 6 - Supervised Learning 26
rule is bounded above by twice the Bayes error under certain reasonable assumptions
[31]. The major drawback for instance-based learning is a high cost of computation
since the algorithm needs to calculate the distance measurement for every piece of data
in the data-set. Also, it requires a lot of memory since it has to keep the full data-
set. Unlike some other approach like decision trees, kNN does not give any idea about
the underlying structure of data. In terms of classification performance, it does not
necessarily generalize well if the examples are not clustered and are sensitive to the
choice of the similarity function (distance metrics) that is used to compare instances
[49]. The choice of k also affects the performance of kNN, if the value of k is too large,
then the neighborhood can contain too many false classes. On the contrary, having
an extremely small k gives a result that can be sensitive to noise points. There is no
principled way to choose k, except through some computationally expensive methods
such as cross-validation [49]. They are sensitive to the choice of the similarity function
that is used to compare instances.
6.3 Decision Trees
Algorithms in this paradigm use decision tree as a predictive model. Decision trees are
commonly used for supervised classifications as well as regression tasks. Recent surveys
claim that it is the most commonly used technique [50]. It has been successfully used in
various works such as activity recognition [12][11], and also in location prediction such
as mobile users’ location [36][9][5].
Figure 9 shows a decision tree in the form of a flowchart; it has decision blocks (rect-
angles), terminating blocks – (ovals) and branches in form of arrows that lead to either
decision block or terminating block. A decision block is a non-leaf node that is associated
with a feature. The terminating blocks are the tree leaves that represent class labels,
while branches represent conjunctions of features that lead to the labels. In regression
tree, the terminating block value is usually a real number, while in classification tree;
the terminating block value is a nominal value belonging to a set of possible classes of a
finite number.
There are various types of decision tree, which are commonly used today. The most
significant ones that are part of the mostly used algorithms are C4.5 and CART trees
[31]. C4.5 [51] is a successor of Concept Learning System (CLS) [52] and Iterative
Section 6 - Supervised Learning 27
Figure 9: A decision tree in flowchart form
Dichotomiser 3 (ID3) [53] algorithms. C4.5 can either generate classifier models in the
form of decision trees or in the form of a more comprehensible ruleset. A C4.5 generated
model is used for only classification task and it is commonly used as a benchmark
to which newer supervised learning algorithms are often compared. On the contrary,
CART decision tree, which is presented in this note [54], is a binary decision tree that
can be used for both classification and regression tasks. C4.5 and CART adopt a greedy
(i.e. nonbacktracking) approach that construct trees in a top–down recursive divide–
and–conquer manner. The approach starts with a training set of examples and their
associated class labels. The training set is recursively partitioned into smaller subsets
as the tree is being built. The method for generating a decision tree is shown in Listing
1.
The core step in a decision tree construction is the second procedure in Listing 1. It
is the process of growing a decision tree by splitting a set S of instances. How do we
select an attribute test that determines the splitting of training instances? This process
is commonly referred to as feature/attribute selection. It describes a heuristic procedure
for selecting the attribute that ’best’ classifies the training data according to the class
labels. Typically, the feature that best divides the training set would be the root node
of the tree. There are numerous methods for finding the feature that best divides the
training data such as gini-index, which, is used in CART [54], information-gain used in
ID3 [52] and ratio-gain used by C4.5 algorithm [51].
If there are k classes , {C1, C2, . . . , Ck} and a training set , denoted by S, then
Section 6 - Supervised Learning 28
� If all instances in S belong to the same class Ci, the decision tree is a leaf
labeled with class Ci.
� Else , if S contains mixture of classes , choose a test based on a single attribute ,
which has two or more outcomes {O1, O2, . . . , On}. Set S is divided into subsets S1, S2 . . . , Sn,
where Si contains all instances in S with outcome Oi for the given test attribute.
� Recursively apply the procedure above to each subset {S1, S2, . . . , Sn}.
Listing 1: Pseudo-code for building decision trees
Information gain, used in ID3 as feature selection measure, determines how important
a given attribute of a feature vector is. The feature with the highest information gain
is selected as the splitting feature. The selected feature minimizes the information re-
quired to classify the instances in the resulting subsets and reflects the least randomness
of the subsets. This measure guarantees a simple tree is constructed by minimizing
the expected number of tests required to successfully classify a given set of instances.
Information gain is a measure of change in entropy. Entropy of a set S is the amount
of information required to identify or classify an instance in S. It is also defined as a
measure of the uncertainty associated with a random variable. Lower entropy relates to
an ordered and patterned variable, which is easy to predict since there is lesser uncer-
tainty. Higher entropy relates to variable with a high degree of randomness and very
unpredictable since there is higher uncertainty. Entropy is given by
Entropy(S) = −n∑i=1
pilog2(pi)
where n is the number of classes in S and pi is the fraction of instances in S with output
value Ci.
Given a set S of instances with an attribute A as one of its attributes, a set Sk as the
subset of S with attribute A = k, and val(A) is the set of all possible values of A. The
information gain for this case is given by
Gain(S,A) = Entropy(S)−∑
k∈val(A)
|Sk||S|∗ Entropy(Sk)
Where |S| and |Sk| represent the sizes of the respective sets.
Section 6 - Supervised Learning 29
As one of its main drawbacks, information gain is overly sensitive and bias towards
features with large number of possible outcomes. C4.5 however uses an extension of
information gain as its feature selection measure known as gain ratio. Gain ratio at-
tempts to overcome the bias associated with information gain by applying a form of
normalization to the information gain using split information value defined as
SplitInfoA(S) = −k∑i=1
|Sk||S|∗ log2
(|Sj ||S|
)
This gives the measure of information generated by dividing the training set S into k
subsets using a test attribute A. The gain ratio is therefore defined as
Gain Ratio(S,A) =Gain(S,A)
SplitInfoA(D)
C4.5 selects the attribute that maximizes gain ratio as the splitting attribute. If the split
is near trivial, split ratio will close to zero and unstable. A constraint added to avoid
this is selecting test attribute with large information gain, at least as large as average
gain over all tests examined [1]. C4.5 handles both continuous or discrete attribute data
types. In order to handle continuous attributes A, C4.5 creates a threshold h and splits
the list into {A ≤ h,A > h}.
CART uses a different attribute selection measure known as gini index. Gini index
measures impurity of set S instead of its entropy, and it is given by
Gini(S) = 1−n∑i=1
P 2i
where n is the number of classes and Pi is the probability that an instance in S belongs
to class Ci.
A major drawback of decision tree algorithms is that they can generate highly complex
decision trees that over-fit a training set. Overfitting is the use of models or procedures
that violate the Occam’s Razor (also known as parsimony) principle, which calls for
using models and procedures that contain all that is necessary for the modeling but
nothing more [55]. Overfitting results from using more terms than are necessary or
more complicated approaches than are necessary. A good example of overfitting is when
we use a model that accommodates both curve and linear relationships on a data set
Section 6 - Supervised Learning 30
that perfectly fits a linear model. As a negative consequence, overfitting increases the
complexity of a model without any associated benefit in performance. It sometimes can
result in poorer performance than a simple model. In decision trees, overfitting can be
avoided with methods such as pruning. C4.5 and CART have pruning feature, which
takes place after the initial tree has been generated.
Aside from the difference in attribute selection measure, another significant difference
between C4.5 and CART algorithm is the allowed number of outcomes from each split
test. While C4.5 allows two or more outcomes, CART test outcomes are always bi-
nary. Also, C4.5 uses a single-pass algorithm derived from binomial confidence limits
for pruning, but CART prunes trees using a cost-complexity model whose parameters
are estimated by cross-validation [31].
In general, decision trees have many advantages which make them widely accepted for
both classification and regression tasks. A decision tree model (knowledge representa-
tion) is easy to understand and can be explained by simple boolean logic. They also
require little pre-processing unlike other techniques that require more pre-processing
steps such as data normalization and missing data replacement. Some ML techniques
are suitable for only numeric type of features while others are more appropriate for
nominal features (Relational rules work with nominal variable while neural networks
only works with numerical variables), however decision trees are befitting in both cases.
They are robust and perform quite well even when assumptions made on the training
set are violated in the test set. Decision trees work well with large amount of data, while
requiring only standard computing resources.
6.4 Support Vector Machines
SVM is a method for the classification of both linear and non-linear data. It uses non-
linear mapping to transform the original training data into a higher dimension, it then
searches for the linear optimal separating hyperplane within the new higher dimension.
With an appropriate nonlinear mapping to a sufficiently high dimension, data from two
classes can always be separated by a hyperplane. To find the hyperplane, SVM uses
support vector (i.e. essential training examples) and margins (i.e. the either side of a
hyperplane that separates two data classes, defined by the support vectors). In spite
Section 6 - Supervised Learning 31
of its extremely slow training time, SVMs have attracted an outstanding number of at-
tentions because of high accuracy because they can model complex non-linear decision
boundaries. The SVM algorithm is also computationally inexpensive. The chosen sup-
port vectors provide a compact description of the learned model. They can be applied
to both classification and regression [1]. The idea of SVM revolves around the notion of
margins. The shortest distance from a hyperplane to one side of its margin is equal to
the shortest distance from the hyperplane to the other side of its margin. The ”sides”
of the margin are parallel to the hyperplane [1]. Maximizing the margins (i.e. creating
the largest possible distance between the separating hyperplane and the instances on its
either side) has been proven to reduce an upper bound on the expected generalization
error [3]. The method of separating data using hyperplane exposes SVMs to the risk of
finding trivial solutions that overfit the training data. SVMs tend to resist overfitting
even in cases where number of instances provided is lower than the number of attributes
by using regularization techniques, which choose a fitting function that has low error
on the training set. SVMs avoid over-fitting by the careful tuning of the regulariza-
tion parameter, C, for linear SVMs. The same applies to non-linear SVMs by selecting
appropriate kernel and careful tuning of the kernel parameters.
Figure 10: SVM optimal hyperplane and maximum margin
A separating hyperplane can be written as w ∗ x + b = 0, where w is a weight vector,
namely w = w1, w2, ..., wn;n is the number of features and b is scalar, often referred to
Section 6 - Supervised Learning 32
as a bias (or −b termed the threshold). If the training data is linearly separable, then a
pair (w, b) exists such that
wTxi + b ≥ 1, for all xi ∈ P
wTxi + b ≥ −1, for all xi ∈ N
with the decision rule is given by
fw,b(X) = sgn(wTx + b)
Where P andN indicate the positive and the negative side of the hyperplane respectively.
In cases when it is possible to linearly separate two classes, an optimum separating
hyperplane can be found by minimizing the squared norm of the separating hyperplane.
This minimization problem can be set up as a convex Quadratic Problem (QP) problem:
Minimizew,bΦ(w) =1
2‖w‖2 (5)
subject to yi(xTxi + b) ≥ 1, i = 1, ..., l.
In the case of linearly separable data, once the optimum separating hyperplane is found,
data points that lie on its margin are called the support vector points and the solution
is represented as a linear combination of only these points while other points are ignored
[3]. Since the number of support vector points selected by the SVM learning algorithm
is usually few, therefore, the model complexity is unaffected by the number of features
encountered in the training data. Due to this reason, SVMs are well suited for learning
tasks where the number of features is large with respect to the number of training
instances [3].
1) Introduce positive Lagrange multipliers , one for each of the inequality
constraints Eqn (1.1). This gives Lagrangian:
Lp ≡ 12‖w2‖−
∑N1=1 αiyi(xi.w − b) +
∑Ni=1 αi
2) Minimize Lp with respect to w,b. This is a convex quadratic programming problem.
3)In the solution , those points for which αi > 0 are called "support vectors"
Section 6 - Supervised Learning 33
Listing 2: General pseudo-code for SVMs [3]
Maximum margin allows the SVM to choose from different candidate hyperplanes. In
cases where, the SVM is not able to find any separating hyperplane due to misclassified
instances, the problem can be addressed by using a soft margin that accepts some
misclassification of training instances [56].
For most real world problems, data sets often involve data for which no hyperplane exists,
such that, it successfully separates the positive instances from the negative instances (i.e.
linearly inseparable data also called non-linearly separable data or nonlinear data for
short). SVM provides a solution to inseparability problem in two main steps. In the first
step, it maps the original data onto a higher-dimensional space. Once the data has been
transformed into the new higher space, the second step searches for linear separating in
the new space. The higher-dimensional space is referred to as transformed feature space,
as opposed to the input space occupied by the training instances [3]. It is known that any
consistent training set can be made separable with an appropriately chosen transformed
feature space of a sufficient number of dimensions. Having linearly separable data in the
transformed feature space once again brings us to a quadratic optimization problem that
can be solved using the linear SVM formulation. The maximum marginal hyperplane
(linear separation) found in the new higher dimension space (transformed feature space)
corresponds to a non-linear separation (nonlinear hypersurfaces) in the original input
space.
Mapping data in the original space to some higher dimensional space H as Φ : Rd → H.
Then the training algorithm would depend on the data through dot products in H,
i.e on functions of the form Φ(xi).Φ(xj). Kernels are a special class of function that
allow inner products to be calculated directly in feature space, without performing the
mapping Φ : Rd → H [57]. If there were a kernel function K such that K(xi, xj) =
Φ(xi).Φ(xj), we would only need to use K in training algorithm, and would not be
needed to explicitly determine Φ. Selecting the appropriate kernel function is vital
because the kernel function defines the transformed feature space where the training set
is classified. To achieve the best kernel function, it is common practice to estimate a
range of potential settings and use cross/validation over the training set to find the best
one [3]. This practice adds to the limitation of SVMs, which is low speed of training.
Section 6 - Supervised Learning 34
SVM training is done by solving N th dimensional QP problem, where N is the number
of examples in the training data set. Since standard QP methods use large matrix
operations and time consuming numerical computations that render them impractical
in large problems (computational complexity of O(n3)) . A simpler algorithm known as
Sequential Minimal Optimization (SMO) comes in handy. SMO can solve the SVM QP
problem without any extra matrix storage and without using numerical QP optimization
steps at all [58]. SMO is more efficient than QP, since its computation is based only on
the support vectors.
6.5 Hidden Markov Model
Morgan (2011) [1] identifies complex data types in data mining. Figure 11 summarizes
the identified complex data types. Sequence data are types of complex data that requires
unique methods for learning. Examples include time-series data and symbolic sequence.
A time series data is a sequence of numeric data recorded at equal time interval (e.g.
temperature readings over time). A symbolic sequence data consist of events or nominal
data, which are not observed at equal time intervals (e.g. customer shopping sequences
and web click stream). Most classification methods generate model from feature vectors.
However, sequence classification is a challenging task since the sequential nature of fea-
tures is difficult to capture. ML sequence classification task can be achieved using either:
(1) Feature-based classification, which transforms a sequence into a feature vector and
then applied conventional classification methods [1] (e.g. [9] [11]). (2) Sequence distance-
based classification, where the distance function that measures the similarity between
sequences determines the quality of the classification significantly [1]. (3) Model-based
classification, this method uses statistical models such as HMM to classify sequences.
(4) The last and recently proposed method, which achieves quality classification results,
uses the time-series sub-sequences that can maximally represent a class as the features is
shaplets method [1]. This section discusses a model-based classification method known
as Hidden Markov Model (HMM).
Most real-world processes generate observable outputs which can be described as signals.
The signal can take many forms such as discrete or continuous form. It can be station-
ary or non-stationary (signal properties over time). A common problem of interest is
classifying such signals in terms of signal models. Signal model has been used to realize
Section 6 - Supervised Learning 35
Figure 11: Complex data types for mining [1]
important practical systems (e.g. prediction systems, recognition systems and identifi-
cation systems in a highly efficient manner) [59]. There are two broad classes of signal
models – deterministic models and statistical models [59]. Deterministic models usually
exploit knowledge of some known specific properties of the signal model (e.g. ampli-
tude, frequency, phase of a sine wave, amplitude and rates of exponential, etc.). The
statistical signal model tries to characterize only the statistical properties of the signal.
Gaussian processes, Poison processes, Markov process and Hidden Markov processes are
all examples of statistical signal models. The basic assumption of the statistical model
is that the signal can be well characterized as a parametric random process, and that
parameters of the stochastic process can be estimated in a precise, well-defined manner
[59].
HMM is a statistical/stochastic signal model in which the system being modeled is
assumed to be a Markov process with unobserved (hidden) states. To start, we shall
review Markov process (Markov chain) and then extend the ideas to the class of HMM.
Section 6 - Supervised Learning 36
6.5.1 Discrete Markov Model
Let’s consider a system which may be described at any time as being in one of a set D,
of n distinct states. D = (X1, X2, . . . , Xn). For such system, Markov assumption states
that future is independent of the past, given the present. Arbitrarily, Xt depends on
Xt−1, Xt−2, . . . , Xt−m for a fixed m.
P (Xt|X1, ..., Xt−1) = P (Xt|Xt−1, Xt−2, . . . , Xt−m) (6)
A discrete random variableX1, . . . , Xn form a discrete-time Markov Chain if they respect
the graphical model in Figure 12.
Figure 12: A simple Markov chain
From the Markov assumption, Eqn(6), a First Order Markov model can be written as,
where m = 1:
P (Xt|X1, ..., Xt−1) = P (Xt|Xt−1) (7)
Figure 13 and Eqn(8), where m = 2, both represents the Second Order Markov model
P (Xt|X1, ..., Xt−1) = P (Xt|Xt−1, Xt−2) (8)
Figure 13: Second order Markov model
The limitation of Markov Chain is that it assumes a fully observed true state of a system.
However, in real-world scenarios, it is difficult to perfectly observe and represent the
true state of a system (partially observable systems). One reason for this could be noise.
HMM solves the problem to this limitation by assuming there is hidden information,
which represent the system. This information is commonly known as latent or hidden
Section 6 - Supervised Learning 37
variable(s). Figure 14 shows a partially observable system with its observable variables
and latent variables.
Figure 14: Partially observable system
6.5.2 Hidden Markov Model
Hidden Markov Model is the resulting model from extending the Markov model, it is a
doubly embedded stochastic process with an underlying stochastic process that is not
observable (it is hidden), but can only be observed though another set of stochastic
process that produce the sequence of observations. An HMM is characterized by the
following elements:
1. N , the number of states in the model.
2. T , length of the observation sequence.
3. A, the transition probability distribution, also called the transition probability
matrix, A = {aij}, where
aij = P (state Xj |state Xi)
.
4. B, the observation (emission) symbol probability distribution / matrix, B =
{bj(kt)}, where
bj(kt) = P (Observation k at time t|state = Xj)
.
5. π, the vector of initial state distribution.
Section 6 - Supervised Learning 38
Given the values of the above elementsN,T,A,B, π, the HMM can be used as a generator
to give an observation sequence O = O1, O2 . . . OT where, Ot is one of the symbols
from the observation symbols. The complete parameter set for HMM is denoted by
λ = (A,B, π).
There are three fundamental applications of HMM. These are evaluation problems,
decoding problems and learning problems [59] [60]. In evaluation problem, given an
HMM λ = (A,B, π) and the observation sequence O, the task is to calculate P (O|λ), (i.e.
the probability that λ has generated O). With decoding problem, given the HMM λ =
(A,B, π) and the observation sequenceO, the task is to calculate the most likely sequence
of hidden states that produced the observation sequence. With learning problem, given
some training observation sequence O and the general structure of HMM, (i.e. the
number of hidden states and the visible observations). The task here is to determine
HMM parameters λ = (A,B, π) that best fit the observed data, i.e. λ that maximizes
P (O|λ).
For illustration purpose, let’s consider weather system with two states, Snowy and
Sunny. Assuming the states of the system cannot be directly observed, but; instead
we can make three observations walk, shop, ski which are dependent on the states of the
system. Let the initial probability, π = [0.6, 0.4] i.e. Snowy and Sunny respectively, and
Transition probability matrix, A =
Snowy Sunny
Snowy 0.7 0.3
Sunny 0.4 0.6
Observation probability matrix,
walk shop ski
Snowy 0.1 0.4 0.5
Sunny 0.6 0.3 0.1
As it can be noted from the given matrices, they are row stochastic. Assuming we make
an observation of the following sequences (walk, shop, walk, ski), ∴ Observation sequence
O over time, ti,(i=1,2,3,4) = [walk, shop, walk, ski]. How do we find the most likely true
state sequence of the weather system? In this problem, we have HMM λ = (A,B, π)
and a sequence of observation, O. Computing the likelihood of the observed sequence O
being generated by the HMM follows the evaluation problem type of HMM application.
Finding an optimal state sequence for the underlying Markov process follows the encod-
ing problem. We can find the most likely state of the system finding the probability of
Section 6 - Supervised Learning 39
all possible state sequences of length = 4. Table 3 shows all possible states (24), their
probability and a normalized probability column that sum to 1.
If πX0 is the probability for starting in the state X0 (i.e. the initial probability), bX0(O0)
is the probability of observing O0 and aX0,X1 is the probability of transiting from state
X0 to state X1, we see that the probability of the state sequence X is given by
P (X) = πX0 × bX0(O0)× aX0,X1 × bX1(O1)× . . .× aXT−1,XT × bXT (OT )
∴, for the given observation sequence, (walk, shop, walk, ski), we can compute, say,
P (Snowy, Snowy, Sunny, Sunny) = 0.6(0.1)× 0.7(0.4)× 0.3(0.7)× 0.6(0.1) = 0.000212
State Sequence Probability Normalized probability
1 Snowy, Snowy, Snowy, Snowy 0.000412 0.042787
2 Snowy, Snowy, Snowy, Sunny 0.000035 0.003635
3 Snowy, Snowy, Sunny, Snowy 0.000706 0.073320
4 Snowy, Snowy, Sunny, Sunny 0.000212 0.022017
5 Snowy, Sunny, Snowy, Snowy 0.000050 0.005193
6 Snowy, Sunny, Snowy, Sunny 0.000004 0.000415
7 Snowy, Sunny, Sunny, Snowy 0.000302 0.031364
8 Snowy, Sunny, Sunny, Sunny 0.000091 0.009451
9 Sunny, Snowy, Snowy, Snowy 0.001098 0.114031
10 Sunny, Snowy, Snowy, Sunny 0.000094 0.009762
11 Sunny, Snowy, Sunny, Snowy 0.001882 0.195451
12 Sunny, Snowy, Sunny, Sunny 0.000564 0.058573
13 Sunny, Sunny, Snowy, Snowy 0.000470 0.048811
14 Sunny, Sunny, Snowy, Sunny 0.000040 0.004154
15 Sunny, Sunny, Sunny, Snowy 0.002822 0.293073
16 Sunny, Sunny, Sunny, Sunny 0.000847 0.087963
Table 3: State sequence probabilities
To find the optimal state sequence in the Dynamic Programming (DP) sense, we can
simply choose the sequence with highest probability [60], (i.e. Sunny, Sunny, Sunny,
Snowy). In the HMM sense, we choose the most probable symbol at each position [60].
To achieve this, we sum the probabilities in table 3 that have Snowy in the first position,
and also find the sum of normalized probabilities that have Sunny in the first position.
This gives 0.18817 and 0.81183 respectively. HMM chooses the most probable at each
position; hence, it chooses Sunny as the first sequence of the optimal state sequence. We
repeat this for each element of the sequence, obtaining results shown in Table 4. From
Section 6 - Supervised Learning 40
Table 4 we find that the most probable sequence in the sense of HMM is Sunny, Snowy,
Sunny, Snowy.
first element second element third element fourth element
P (Snowy) 0.188182 0.519576 0.228788 0.804029
P (Sunny) 0.811818 0.480424 0.771212 0.195971
Table 4: HMM probabilities
6.6 Comparing Supervised Learning Algorithms
In this section, we discuss the comparison of supervised learning presented in [3]. The
comparisons are based on accuracy, learning rate, classification speed, tolerance to miss-
ing values, tolerance to redundant attributes and other factors. It is noted that a detailed
discussion of all the pros and cons of each algorithm, as well as their empirical compar-
ison, depends on the learning task under consideration. Table 5 shows the summary of
this comparison.
In general, SVM tends to attain better accuracy in comparison with other supervised
learning methods, especially when dealing with multi-dimensions and continuous fea-
tures. For discrete features, logic-based systems such as decision trees seem to perform
better. SVM usually requires a large sample size for it to achieve its maximum predic-
tion accuracy while Naive Bayes requires relatively small data set. When considering
classification speed, i.e. using a generated model for classification, kNN takes consider-
ably long time, since learning from data which considers all instances, is delayed till a
classification process is required.
Algorithms such as kNN require complete instances in a data set to perform accurately;
hence, it is not robust towards missing values. Naive Bayes is naturally robust to missing
values since the missing values are not considered when computing the probabilities, and
therefore has no impact on its final results.
Irrelevant features are most times present in a data set. kNN is particularly sensitive
to irrelevant attributes, which is due to its approach of classification. On the contrary,
SVM is insensitive to the number of dimensions and has a high tolerance to irrelevant
features.
Section 6 - Supervised Learning 41
DecisionTrees
NaiveBayes
kNN SVM Rule-learners
Accuracy in general ** * ** **** **
Speed of learning with respectto number of attributes andnumber of instances
*** **** **** * **
Speed of classification **** **** * **** ****
Tolerance to missing values *** **** * ** **
Tolerance to irrelevant at-tributes
*** ** ** **** **
Tolerance to redundant at-tributes
** * ** *** **
Tolerance to highly interde-pendent attributes (e.g. par-ity problems)
** * * *** **
Dealing with discrete/bina-ry/continuous attributes
*** ***(notcontinu-ous)
***(notdirectlydiscrete)
**(notdiscrete
***(notdirectlycontinu-ous)
Tolerance to noise ** *** * ** *
Dealing with danger of over-fitting
** *** *** ** **
Attempts for incrementallearning
** **** **** ** *
Explanation ability/trans-parency of knowledge/classi-fication
**** **** ** * ****
Model parameter handling *** **** *** * ***
Table 5: Comparing learning algorithms (**** stars represents the best and * starrepresent the worst) [3]
The term bias, measures the contribution to the error of the central tendency of the
classifier when trained on different data, while the term variance is a measure of the
contribution to error of deviations from the central tendency. Learning algorithms with
high-bias profile usually generate simpler, highly constrained models that are quite in-
sensitive to data fluctuation so that the variance is low. Algorithms with a high-variance
property usually generate complex models which fit data variations more readily. Ex-
ample of high-bias learner is Naive Bayes while decision trees and SVMs are examples
of high-variance learners. Overfitting is a common problem with high-variance model
classes, but algorithms such as SVM avoids overfitting by adopting techniques like reg-
ularization.
kNN is considered as intolerant of noise because outliers can easily distort its similarity
Section 7 - Unsupervised Learning 42
calculation and hence lead to misclassification. On the contrary, decision tree learners
are resistant to noise because they employ pruning strategies which avoids overffiting
the data.
Naive Bayes requires little storage space during both the training and classification
states. It only requires memory needed to store the prior and conditional probabilities.
kNN technique requires large storage space for the training and the execution space is
at least as large as its training space. The execution space of non-lazy learners is usually
much smaller than training space since the output model is often a highly condensed
summary of the data set.
Online learning is a learning approach which allows incremental learning. Naive Bayes
and kNN can be easily used as incremental learners while other methods require greater
effort to handle incremental learning tasks.
A learning that offers more runtime parameter to be tuned is considered to be easier to
apply on data set. SVMs have more parameter than the other methods while kNN has
only one parameter –k, which is easy to tune.
7 Unsupervised Learning
Under this section, we shall briefly discuss clustering analysis and association analysis,
which are two common tasks under unsupervised learning. Under clustering analysis,
we shall discuss the classic k–means algorithm, and for the association analysis, we will
review the apriori technique.
7.1 Clustering Analysis
Clustering analysis is the assignment of a set of observation into subsets (clusters) so that
the observations within the same cluster are similar according to some pre-designated
criterion or criteria. It is a supervised learning that automatically forms clusters of
similar things. It can be described as an automatic classification. The main difference is
that, in classification, we know what we are looking for, but for clustering, we do not have
such information. Since it also produces the same result as classification but without
having predefined classes. Clustering is sometimes called unsupervised classification [23].
Section 7 - Unsupervised Learning 43
There are several methods employed for clustering analysis. These include partition-
ing, hierarchical, density-based and grid-based methods. Partitioning method involves
finding mutually exclusive clusters of spherical shapes among a given data set. It is a
distance-based approach and might often use mean (or medoids) to represent the cluster
center (i.e. the centroid). The major drawback for this method is its inefficiency for
large-size data set. Hierarchical method is an approach that uses hierarchical decompo-
sition (i.e. multiple levels to find cluster within a data set). This method is sensitive to
merging and splitting since it cannot correct erroneous merges or splits. Density-based
methods can find arbitrarily shaped clusters. Clusters are dense regions of objects in
space that are separated by low-density regions. The density-based approach may filter
out outliers. Grid-based methods use a multiresolution grid data structure. They have
the advantage of fast processing time which is typically independent of the number of
data objects, but dependent on grid size.
k-means clustering algorithm belongs to the partitioning method group of clustering
analysis. It is a common algorithm employed for unsupervised clustering task. It is
called k-means because it finds k unique clusters, and the center of each cluster is the
mean of the values in that cluster.
7.1.1 k-means clustering
k-means algorithm finds k clusters for a given data set. The number of cluster k is
user defined. A single point called centroid describes each cluster. A centroid is always
at the center of all the points in a cluster. k-means algorithm start with defining the
value of k, then k centroids are randomly assigned to a point. Next, each point in the
data set is assigned to a cluster. The assignment is done by finding the nearest centroid
and assigning the point to that cluster. The next step involves updating the centroids.
This is done by taking the mean value of all the points in that cluster, and then assign
the centroid to the mean. These steps can be summarized into two main steps – data
assignment and relocation of the ”means”. Listing 3 shows the pseudo-code of k-means
algorithm.
Create k points for starting centroids (often randomly)
While any point has changed cluster assignment
for every point in the data set:
Section 7 - Unsupervised Learning 44
for every centroid:
calculate the distance between the centroid and point
assign the point to the cluster with the lowest distance
for every cluster calculate the mean of the points in that cluster
assign the centroid to the mean
Listing 3: Pseudo-code for k-means clustering [23]
As seen from the pseudo-code, k-means is an easy to implement method for clustering.
The algorithm converges when the assignment of the centroid to the mean no longer
change. The complexity for each iteration is O(N ∗ k), since this is the required number
of comparison per iteration. Where N is the number of instances in the data set.
A significant draw back of k-means is that it can converge at local minima and not global
minima (i.e. it sometimes give a decent result but not necessarily the best result). It
is also known to be extremely slow on large data set [23]. There are different types
of distance measures that can be employed for calculating the ”closest” centroid. The
choice of distance measure used in a data set also plays a role in the performance of
k-means. The problem of local minima can be solved by running the algorithm multiple
times with different starting centroids or by performing a limited local search about
the converged solution [31]. Despite its drawbacks, k-means remain the most widely
used partitioning clustering algorithm in practice. The advantages include simplicity of
algorithm, easy to understand, reasonably scalable and can be easily modified to handle
streaming data [31].
7.2 Association Analysis
Association analysis or association rule learning technique is popular and well researched
method for discovering interesting relations between variables in large databases. Asso-
ciation analysis is commonly applied in business related, such as marketing promotions
and customer relationship management. It is also applicable in the domain of bio-
informatics, web mining and scientific data analysis. The uncovered relationships can
be represented in two forms – the form of association rules or set of frequent items. As-
sociation rules suggest that a strong relationship exists between two items while frequent
items set are lists of items that commonly appear together.
Section 7 - Unsupervised Learning 45
Action number items
0 item1, item2
1 item3, item4, item5, item6
2 item1, item4, item5, item7
3 item2, item1, item4, item5
4 item2, item1, item4, item7
Table 6: A simple itemset
The two most essential concepts in association analysis are support and confidence. They
are used for selection of fascinating items. The support of an itemset is defined as the
percentage of the data set that contains this itemset. From Table 6, the support of
{item1} is 4/5 since it occurs in 4 actions of all the 5 actions, and the support of {item1,
item4} is 3/5 because, of all the five actions, three contained both item1 and item4.
Support applies to an itemset, so we can define a minimum support and filter out items
that do not meet the minimum support. The confidence, on the other hand, is defined
for an association rule such as {item4} → {item5}. The confidence for this rule is defined
as support(item4, item5)/support(item4). The support for {item4, item5} is 3/5. The
support for {item4} is 4/5, so the confidence for {item4} → {item5} is 3/4 = 0.75. This
means that in 75% of the items in the data set containing item4, our rule is correct.
With support and confidence, we can quantify the success of the association analysis.
Assuming we want to find all sets of items with a support greater than 0.9, we could
generate a list of all combination of items and then count how frequently they occur.
This is a brute force approach, which is not suitable in most cases, especially in large
data set. Finding different combinations of items would be a time-consuming task and
highly computationally expensive. Hence, this task requires an intelligent approach to
find frequent itemsets in a reasonable amount of time. Apriori principle allows us to
reduce the number of calculations we need to perform in learning association rules.
7.2.1 Apriori algorithm
Finding frequent itemsets (items with frequency ≥ minimum support) is not trivial
because of its combinatorial explosion [31]. Generating association rules with confidence
≥ specified minimum confidence becomes an easy task once we have derived the frequent
itemsets. Apriori principle helps us to reduce the number of possible interesting itemsets.
Apriori is a seminal algorithm for finding frequent itemsets using candidate generation
Section 7 - Unsupervised Learning 46
[61]. Apriori algorithm is based on the Apriori principle, which says that if an itemset
is frequent, then all of its subsets are frequent. In reverse, this means if an itemset is
infrequent, then its supersets are also infrequent [23].
For example, consider Figure 15a, the first diagram shows the possible combination of
items from set {0,1,2,3}. The first set is a big φ, which means a null set. To calculate
the support for a given set, say {0,3}, we will go through all nodes and obtain the
total number of set with 0 and 3. Then, we divide the total by the number of actions.
Counting through all possible itemsets is not feasible in large data, as data with N data
generates 2N − 1 possible itemsets. Even with element with N = 100, this will generate
1.26 ∗ 1030 possible itemsets. This is cumbersome on the processing side [23].
Figure 15: (A) All possible itemsets from the available set {0,1,2,3}. (B) Highlightedinfrequent itemsets [23]
For each actions in a:
For each candidate itemset , c:
Check to see it c is a subset of a
If so increment the count of c
For each candidate itemset:
If the support meets the minimum , keep this item
Return list of frequent itemsets
Listing 4: Pseudo-code for Apriori algorithm [23]
Consider Figure 15b, applying the reversed version of Apriori principle, the shaded
itemset{2,3} is known to be infrequent. From this knowledge, we know that itemsets
{0,2,3}, {1,2,3},{0,1,2,3} are also infrequent. This implies that, once we have derived
Section 8 - Reinforcement Learning 47
the support value for {2,3}, we do not have to compute support of {0,2,3},{1,2,3} and
{0,1,2,3} since they will not meet our requirements. This principle can be used to halt
the exponential growth of itemsets and speedup computation time for listing frequent
item sets [23].
The Apriori algorithm, Listing 4 requires a minimum support value and a data set as
variable. The algorithm first scan and generate a list of all candidate itemsets with one
item. It then scans the data set and calculates the support of each candidate of frequent
itemsets. The itemsets that meet the requirement of a minimum support will then be
added to make itemsets with two elements. Again, the algorithm scans the data set and
items that do not meet the minimum support value will be removed. This procedure is
repeated until all sets are removed.
8 Reinforcement Learning
The approach of reinforcement learning requires a system to learn from success and
failure, from reward and punishment. In supervised learning approach, a system requires
to be told the correct move for each position it encounters, but such feedback is seldom
available [27]. In situations where the feedback is not available, such as in uncharted
territory, a system can learn a model from its own experience. The system also requires
knowing the aftermath of its actions, either good or bad. The aftermath, which is
a type of feedback to the learning system is called reward or reinforcement [27].
Reinforcement learning is learning what to do (i.e. how to map situations to actions), so
as to maximize a numerical reward signal. Unlike other types of learning, a reinforcement
learner is not told which actions to take [20]. Supervised learning is an important form
of learning, but it is not adequate for learning from interaction and complex domains
[20][27]. For interactive problems, it is difficult to obtain examples of a desired behavior
that are both correct and representative of all the situations in which a system has to act
[20]. A reinforcement learning agent must be able to sense the state of the environment –
sensation, it must be able to able to take actions that affect the state of the environment
– action, and lastly, the agent must have a goal or goals relating to the state of the
environment [20].
Section 8 - Reinforcement Learning 48
A good example of reinforcement learning is an adaptive controller, which adjusts pa-
rameters of a petroleum refinery’s operation in real time. The controller optimizes the
yield/cost/quality trade-off based on specified marginal costs without sticking strictly
to the set points originally suggested by engineers. Another possible application of rein-
forcement learning can be seen in a mobile robot, which decides whether it should enter
a new room in search of more trash to collect or start trying to find its way back to its
battery recharging station. It makes its decision based on how quickly and easily it has
been able to find the recharger in the past. [20]. In general, reinforcement learning has
been successfully applied to various problems, including operation research, robotics,
telecommunications, games playing and economics/finance [20].
8.1 Elements of Reinforcement Learning
There are four main sub-elements to a reinforcement learning system: a policy, a reward
function, a value function, and, optionally, a model of the environment [20].
• A policy – defines the learning agent’s way of behaving at a given time. It refers
to a mapping from perceived state of the environment to actions to be taken when
in those states. A policy is the essential to reinforcement learning since it is solely
sufficient to determine behavior.
• A reward (or reinforcement) function – defines the goal in a reinforcement-learning
problem. It maps perceived states (or state-action pairs) of an environment to a
single number, an immediate reward, indicating the underlying desirability of the
state. The main objective of a reinforcement learner is to maximize the total
reward it receives in the long run. The reward function helps a learning system to
determine what are the bad and good events.
• A value (or utility) function – unlike reward function, which denotes what is good
in an immediate sense, a value function specifies what is good in the long run.
The value of a state can be described as the total amount of reward a learner can
expect to accumulate over the future starting from its current state.
• A model of the environment – imitates the behavior of the environment. It helps
to predict the resultant next state and next reward given a state and an action.
Section 8 - Reinforcement Learning 49
From the elements of reinforcement learning above, we can have a learning model con-
sisting of:
1. Environment’s state St ∈ S, where S is a set of possible states
2. An action at ∈ A(St), where A(St) is the set of possible actions available in state
st.
3. Rules that describes what the learner observes, i.e policy, π(S) = a.
4. Rules of transition between states, i.e. a transition function, δ(S, a) = S′
5. Rules that determine the scalar immediate reward of a transition from state S
given action a, i.e. a reward function, r(S, a).
6. Rules that determine the long term reward (utility) of a state (or state-action
pair), i.e. value function, Uπ(S)
These rules are usually stochastic. A reinforcement learner interacts with its environ-
ment in discrete time steps. At each time t, the agent perceives or receives an observation
Ot, which includes the reward rt. The agent then chooses an action at from the set of
available actions, which is later sent to the environment. The environment changes its
states from the current state st to a new state st+1 and the reward rt+1 associated with
the transition(st, at, st+1) is determined. As mentioned earlier, the goal of the learning
agent is to collect as much reward as possible. Actions selection can be a function of
the history and it can even be randomized selection. Figure 16 shows the relationships
between a learning agent and its environment.
Figure 16: Learning agent - environment
Reinforcement learning is based on Markov Decision Process (MDP), but there are
differences in the prior knowledge about the model parameters. For MDP, where set of
Section 8 - Reinforcement Learning 50
states – S, set of actions – A, transition probabilities T (S, a, S′) and reward functions
R(S, a, S′) are known. The aim here is to determine the optimal value function and/or
optimal policy. This can be solved using dynamic programming such as value iteration
or policy iteration. These approaches are known as model-based algorithms. In a true
reinforcement learning problems, transition model is not known; also reward model is not
known in advance. In this case, algorithms such as temporal-difference and Q-learning
are employed.
8.1.1 Passive Reinforcement Learning
Passive learning involves learning with fixed policy, the task here is to learn the utilities
of states (or state-action pairs); this could also involve learning a model of the environ-
ment. Some of the methods used in learning the utilities of states are: Direct Utility
Estimation (DUE), Adaptive Dynamic Programming (ADP) and Temporal-difference
learning. DUE was invented in the area of adaptive control theory by Widrow and Hoff
[27]. The basic idea is that the utility of a state is the expected total reward from
the state onward (i.e. expected reward-to-go) and each trial provides a sample of this
quantity for each state visited. From the observed samples, the algorithm calculates
the observed reward-to-go for each visited state and updates the estimated utility for
that state accordingly, by keeping a running average for each state in a table. DUE
succeeds in reducing the reinforcement-learning problem to an inductive learning prob-
lem, but it does not consider the inter-dependency between states since the utility of
each state equals its own reward plus the expected utility of its successor states. Utility
values should obey Bellman equations (9), but DUE does not. Hence, the algorithm
often converges extremely slowly and usually takes la large number of sequence to get
close to the correct value.
Uπ(S) = R(S) + γ∑S′
T (S, π(S), S′)Uπ(S′) (9)
ADP makes use of the Bellman equation to obtain the utilities of states Uπ(S). The ADP
approach takes advantage of the constraints among the utilities of states by learning the
transition model that connects them and solving the resulting Markov decision process
using a dynamic programming method. The transition model T (S, π(S), S′) and R(S)
Section 8 - Reinforcement Learning 51
are estimated from trials, the resulting learned models are subsequently used in the
Bellman equations (9) to calculate the utilities of the states. The Bellman equations are
linear; hence they can be solved using any linear algebra method. It is possible also, to
use an alternative approach – modified policy iteration, which uses a simplified value
iteration to update changes to the utility estimates after each change to the learned
model T and R. This method converges faster since the model T and R changes only
slightly. In both DUE and ADP methods, learning problem are being reduced to MDP,
which is then solved. Another way to bring the Bellman equations to bear on the
learning problem is to use the observed transitions to adjust the utilities of the observed
states so that they agree with the constraint of Bellman’s equations (9). This is the
approach employed in Temporal-difference learning. It uses the constraints for adjusting
the utilities and does not solve the equation for all the states.
Uπ(S)← (1− α)Uπ(S) + α(R(S) + γUπ(S′)) (10)
Equation (10) presents the update rule for adjusting the utilities, where α is the learn-
ing rate parameter. The name temporal-difference is adopted because this update
rule uses the difference in utilities between successive states. The fundamental idea of
temporal-difference method is to adjust the utility estimates towards the ideal equi-
librium that holds locally when the utility estimates are correct. While Equation (9)
works in the case of passive learning, Equation (10) causes the agent to reach equilib-
rium given by Equation (9). The difference however is that update only involves the
observed successor s′, whereas actual equilibrium conditions involve all possible next
states. Temporal-difference is a model free method since it does not need a transition
model to perform its updates. Temporal Difference (TD) and ADP are closely related.
They both try to make local adjustment to the utility estimates in order to make each
state ”agree” with its successors. A key difference is that TD adjusts a state to agree
with its observed successors, while ADP adjust the state to agree with the entire suc-
cessor that might occur. Another difference is that TD makes a single adjustment per
observed transition, ADP makes as many as it needs to restore consistency between the
utility estimates and the environment model.
Direct Estimation (DUE) is a model-free approach, which is well known for its simplicity.
It is easy to implement, and each update is fast. It does not exploit Bellman’s constraints
Section 8 - Reinforcement Learning 52
and converges slowly. ADP is a model-based approach, and it is harder to implement
compared with DUE. Also, the cost of each update is high since each update is a full
policy evaluation. It however exploits Bellman constraints and converges faster in terms
of updates. TD is slower to converge, but it uses lesser computation per observation. It
partially exploits Bellman’s constraints.
8.1.2 Active Reinforcement Learning
An active agent is free to decide what actions to take because it does not have a fixed
policy that determines its behavior. It updates its policy as it learns, also it must
consider what the outcome of its actions maybe and how they will affect the received
rewards. Adaptive dynamic programming can be used in this case. However, it needs to
be modified to handle the freedom of an active learning agent. The same ADP learning
mechanism used for passive agent can be employed to learn a complete model with
outcome probabilities for all actions. Taking into consideration that an agent has a
choice of actions, the utilities it needs to learn are those defined by the optimal policy;
and they obey the Bellman equations (11):
U(S) = R(S) + γmaxa
∑S′
P (S′|S, a)U(S′) (11)
Equation (11) can be solved using the value iteration or policy iteration algorithms to
obtain a utility function U that is optimal for the learned model [27]. The final issue is
what action to take at each step. An active learning agent can extract an optimal action
by one-step look-ahead to maximize the expected utility. This is applicable if it uses
value iteration for equation (11). However, if it uses policy iteration, the optimal policy
is already available. An ADP agent might become a greedy agent when it follows the
greedy policy suggested by the current value estimate. In such cases, the optimal policy
learned is not the true optimal policy. Choosing the optimal action from this policy will
only lead to sub-optimal results because the learned model is not the same as the true
environment (i.e. what is optimal in the learned model can be sub-optimal in the true
environment). Since the agent is not aware of true environment, it therefore, cannot
obtain an optimal action for the true environment. To solve this problem, an agent
must make a trade-off between exploitation to maximize its reward and exploration
to maximize its long-term reward. With exploitation concept, to get reward, we exploit
Section 9 - Performance Evaluation in Machine Learning 53
our knowledge to get a payoff. In the case of exploration, we simply get more information
about the world to discover actions that result in better rewards. To explore, we typically
need to take actions that do not seem best according to our current model. Managing
the tradeoff between exploitation and exploration of a critical issue in reinforcement
learning, however, the basic intuition behind most approaches is to explore more when
knowledge is weak and exploit more as we gain knowledge. Approaches that use pure
exploitation often get stuck in bad policies, while those with pure exploration get better
models by learning, but they have small rewards due to exploration.
9 Performance Evaluation in Machine Learning
In this section, we present common evaluation metrics and methods used in ML. Ac-
curacy performance of supervised learning is easy to evaluate due to the nature of the
learning problems. However, for unsupervised learning, it is difficult to evaluate accu-
racy performance of algorithms. Jain et al (1998) [62] support this claim, the authors
states that the validation of clustering structures is the most difficult and frustrating
part of cluster analysis and suggests it is best to rely on labeled data for model validation
in clustering (an unsupervised learning task).
It is crucial to evaluate the performance of learning algorithms, as this is a large factor
in selecting choice of algorithm. We present definitions to some following terms that will
be used in this section. In the following definitions, we shall talk in terms of positive
examples, P (examples of the main class of interest) and negative examples, N (all the
other examples apart from the main class of interest).
True Positives,TP - These refer to the positive examples that were correctly predicted
by the classifier.
True Negatives,TN - These are the negative examples that were correctly labeled by the
classifier.
False positives,FP - These are the negative examples that were incorrectly labeled as
positives.
False negatives, FN - These are positive examples that were mislabeled as negative
Section 9 - Performance Evaluation in Machine Learning 54
9.1 Confusion Matrix
Confusion matrix is a table layout, which allows visualization of performance of an algo-
rithm (typically supervised learning prediction algorithm). In the case of unsupervised
learning, a similar approach used is known as matching matrix. A confusion matrix has
columns, which represent the instances in a predicted class and rows that represents the
instances in an actual class (see Table 7). The term confusion was adopted because the
matrix makes it easier to see if the learning system is confusing two classes (i.e. com-
monly mislabeling one as another). It is a useful tool for analyzing how well a classifier
can recognize the examples of different classes by summarizing TP, TN, FP and FN.
From values of TP and TN, we can easily see when a classifier predicts correctly and
FP and FN values tell us when it predicts wrongly.
Predicted class
Yes No Total
Actual class Yes TP FN PNo FP TN N
Total P’ N’ P + N
Table 7: Confusion matrix with totals for positive and negative examples
for example, if a learning system has been trained to classify between three different
classes A, B and C, a confusion matrix will summarize the results of testing the algorithm
for further inspection. Let us consider a total test set of 30 samples. In the sample there
are 5 samples of class A, 12 samples of class B and 13 samples of class C
Predicted class
Class A Class B Class C Total
Actual class Class A 4 1 0 5Class B 5 6 1 12Class C 0 3 10 13
Total 9 10 11 30
Table 8: Example of a confusion matrix with totals for positive and negative examples
In this confusion matrix, of 5 actual samples of class A, the system classifies 4 correctly
and predicted 1 as class B. The matrix makes it easy to visually inspect the prediction
performance.
As seen in Table 7, the confusion matrix has additional row and column which provides
the totals. It shows P and N ; it also shows P and N. P is the number of examples
that were labeled as positive (TP + FP) while N is the number of examples that were
Section 9 - Performance Evaluation in Machine Learning 55
labeled as negative (TN + FN ). The total number of examples is TP + TN +FP +TN,
which equals P + N or P+ N.
9.2 Accuracy and Error rate
Accuracy or recognition rate of a classifier is the percentage of the test set tuples that
are correctly classified by the classifier [1], i.e.
Accuracy =TP + TN
P + N(12)
Error rate or misclassification rate is the percentage of the test set examples that are
misclassified. Error rate is simple 1−accuracy(M), where accuracy(M) is the accuracy
of model, M. Error rate can also be computed as
error rate =FP+FN
P+N(13)
As mentioned earlier in this section, to avoid misleading overoptimistic estimates, it is
ideal to perform evaluation with different data set (test set) other than the set used
for building the classifier (training set), i.e. holdout method. Resubstitution error is a
type of error rate that uses training set (instead of the test set) to estimate the error
rate. It is optimistic of the true error rate (the corresponding accuracy estimate is also
optimistic).
In cases where the distribution of classes within a data set is uneven (e.g. in data
sets with very few numbers of class of interest (positive class). For example in fraud
detection systems, the positive class (class of interest) is fraud, which occurs very rarely.
A classifier’s accuracy might seems quite accurate with 97%. But, what if only 3%
of the training examples are actually fraud cases? It means the classifier could be
predicting all non-fraud activity correctly and misclassifies the class of interest, fraud.
This clearly makes accuracy measure unacceptable for estimating the performance of the
classifier in this case. This problem is referred to as class imbalance problem and requires
measure that can access how well the classifier performs with each positive examples and
negative example, the next measures sensitivity and specificity can be used in this case
respectively [1].
Section 9 - Performance Evaluation in Machine Learning 56
9.3 Sensitivity and Specificity
Sensitivity and specificity are statistical measures of performance of a classification func-
tion. Sensitivity measures the portion of actual positives, which are correctly identified
as such (TP recognition rate). On the other hand, specificity measures the proportion
of negatives, which are correctly identified as such (TN recognition rate).
Sensitivity =TP
P(14)
Specificity =TN
N(15)
It suffices to show that accuracy is a function of sensitivity and specificity.
Accuracy = sensitivityP
(P +N)+ specificity
N
(P +N)
Sensitivity and specificity measures are closely related to the concept of type I and type
II errors in statistics [1]. A perfect classifier would be described as 100% sensitivity
(i.e. predicting all unknown class-A samples as class-A) and 100% specificity (i.e. not
predicting any non class-A sample as class-A).
9.4 Precision, Recall and F-measure
Prediction and recall measure are widely used in estimating classification models. Pre-
cision can be defined as a measure of exactness (i.e. percentage of correctly predicted
positive examples), while recall is a measure of completeness (i.e. the percentage of
positive examples that are predicted as such).
precision =TP
TP + FP
recall =TP
TP + FN=TP
P
Section 9 - Performance Evaluation in Machine Learning 57
There is somewhat an inverse relationship between precision and recall. They are typ-
ically used along with each other, where recall values are compared for a fixed value of
precision, or vice versa. In some situations, we might benefit from combining precision
and recall into single estimates, these approaches are referred to as F measure ( F1 score
or F-score) and Fβ measure.
F =2× precision× recallprecision+ recall
Fβ =(1 + β2)× precision× recallβ2 × precision+ recall
Where β is a non-negative real number. F measure is defined as the harmonic mean
of precision and recall, which gives equal weight to precision and recall. Similarly, Fβ
measure is a weighted estimate of precision and recall, which assigns times as much
weight to recall as to precision. F2 and F0.5 are two commonly used Fβ measures [1].
We have discussed some accuracy-based measures are widely employed in estimating the
performance of a classifier. In addition to accuracy-based measures, there are additional
aspects that can be considered in comparisons of classifiers e.g. speed, robustness and
scalability (see Sub Section 6.6)
9.5 Methods for Obtaining Reliable Evaluation Measures
Using a set of data to derive a classifier and then estimating the performance of re-
sulting model using the same data set can result in fallacious overoptimistic estimates
due to overspecialization of the learning algorithm on the data. There are at least three
techniques, which are used to calculate a classifier’s accuracy [3]. One simple method in-
volves splitting the data set into two-third for training and the one-third for performance
evaluation. Another method known is called cross-validation. Cross validation is a sta-
tistical method, which is widely employed in ML for obtaining reliable classifier accuracy
estimates. In an n-fold cross-validation, the data set are randomly partitioned into n
mutually exclusive subsets or folds, D1, D2, . . . , Dn each of approximately equal size.
The training and testing are performed n times. In iteration i, partition Di is reserved
as test set and the union of other partitions is used to generate the classifier. E.g. in the
Section 9 - Performance Evaluation in Machine Learning 58
first round of an n-fold validation, subset D2, . . . , Dn jointly serves as the training set to
train the first model which is tested using D1 subset, second round uses D1, D3, . . . , Dn
to obtain the second model which is then tested on D2, the other iteration continues
in a similar approach till nth iteration. The average over all the n-rounds is the final
accuracy estimate. There are variants of cross-validation such as leave-one-out, holdout
and stratified cross-validation [8]. Leave-one-out validation uses a single instance as the
test sample, while the algorithm is trained on other instances. Holdout method, also
called 2-fold cross validation, is the simplest variation of n-fold cross-validation, where
data set are randomly divided into two sets D1 and D2 of equal sizes. While D1 is used
for training, D2 is used for test and vice-versa. The stratified variant ensures each fold
contains about the same proportions of the types of class labels. Stratified 10-fold cross
validation is recommended and often use for estimating accuracy because of its relatively
low bias and variance [1].
10 Conclusions
Machine learning is a broad area that is applied in various fields. Its application in broad
areas such as data mining and AI, and other various fields led to numerous existing
techniques and algorithms. In general, the methods have aided lots of decision-making
systems by learning from data of different types. The recent interest in the Internet of
things and sensor networks, which is expected to grow rapidly to 50 billion devices and
beyond, has sparked a great interest for the application of ML in pervasive environments.
In view of this, this study is rudimentary introduction to common ML techniques that
have been applied in the area of pervasive computing. The scope of this study is limited
to the methods that are identified as applicable or that have potential application in
the field of pervasive computing. These techniques are identified from previous works
in the field of pervasive environment as fundamental methods which are essential. This
report presents the main types of ML – supervised, unsupervised and reinforcement
learning, with more focus on supervised learning since it is mostly applied in learning
problems. The supervised learning algorithms discussed include Naive Bayes, kNN,
decision tree, SVM and HMM. Other methods include unsupervised learning methods,
k-means and apriori algorithm used for clustering and association analysis respectively
Section 9 - Performance Evaluation in Machine Learning 59
and finally reinforcement learning. Lastly, various performance and accuracy metrics
are highlighted as well as methods for obtaining reliable evaluation measures.
Bibliography
[1] Jiawei Han, Micheline Kamber, and Jian Pei. Data Mining: Concepts and Tech-
niques. Morgan Kaufmann Publishers Inc, San Francisco, CA, USA, 3rd edition,
2011. ISBN 0123814790, 9780123814791.
[2] A. Frank and A. Asuncion. Uci machine learning repository, 2010. URL http:
//archive.ics.uci.edu/ml.
[3] S. B. Kotsiantis. Supervised machine learning: A review of classification tech-
niques. In Proceedings of the 2007 conference on Emerging Artificial Intelligence
Applications in Computer Engineering: Real Word AI Systems with Applications
in eHealth, HCI, Information Retrieval and Pervasive Technologies, pages 3–24,
Amsterdam, The Netherlands, The Netherlands, 2007. IOS Press. ISBN 978-1-
58603-780-2. URL http://dl.acm.org/citation.cfm?id=1566770.1566773.
[4] Kevin P. Murphy. Machine Learning: A Probabilistic Perspective (Adap-
tive Computation and Machine Learning series). The MIT Press, aug 2012.
ISBN 0262018020. URL http://www.amazon.com/exec/obidos/redirect?tag=
citeulike07-20&path=ASIN/0262018020.
[5] Theodoros Anagnostopoulos, Christos B. Anagnostopoulos, Stathes Hadjiefthymi-
ades, Alexandros Kalousis, and Miltos Kyriakakos. Path prediction through data
mining. In Pervasive Services, IEEE International Conference on, pages 128–135,
2007. ID: 1.
[6] H. Kosorus, J. Honigl, and J. Kung. Using r, weka and rapidminer in time series
analysis of sensor data for structural health monitoring. In Database and Expert
Systems Applications (DEXA), 2011 22nd International Workshop on, pages 306–
310, 2011. ISBN 1529-4188. ID: 1.
60
Bibliography 61
[7] Teradata, 2013.
[8] Ian H. Witten, Eibe Frank, and Mark A. Hall. Data Mining: Practical Machine
Learning Tools and Techniques. Morgan Kaufmann, Amsterdam, 3 edition, 2011.
ISBN 978-0-12-374856-0. URL http://www.sciencedirect.com/science/book/
9780123748560.
[9] T. Anagnostopoulos, C. Anagnostopoulos, and S. Hadjiefthymiades. Mobility pre-
diction based on machine learning. In Mobile Data Management (MDM), 2011 12th
IEEE International Conference on, volume 2, pages 27–30, 2011. ID: 1.
[10] M. Muhlenbrock, O. Brdiczka, D. Snowdon, and J. L Meunier. Learning to detect
user activity and availability from a variety of sensor data. In Pervasive Computing
and Communications, 2004. PerCom 2004. Proceedings of the Second IEEE Annual
Conference on, pages 13–22, 2004. ID: 1.
[11] Jennifer R. Kwapisz, Gary M. Weiss, and Samuel A. Moore. Activity recognition
using cell phone accelerometers. SIGKDD Explor.Newsl., 12(2):74–82, mar 2011.
URL http://doi.acm.org/10.1145/1964897.1964918.
[12] Ling Bao and Stephen S. Intille. Activity recognition from user-annotated acceler-
ation data. pages 1–17. Springer, 2004.
[13] Oliver Brdiczka, Patrick Reignier, and James L. Crowley. Supervised learning of an
abstract context model for an intelligent environment. In Proceedings of the 2005
joint conference on Smart objects and ambient intelligence: innovative context-
aware services: usages and technologies, sOc-EUSAI ’05, pages 259–264, New York,
NY, USA, 2005. ACM. ISBN 1-59593-304-2. URL http://doi.acm.org/10.1145/
1107548.1107612.
[14] Keith Worden and Graeme Manson. The application of machine learning to struc-
tural health monitoring. Philosophical Transactions of the Royal Society A: Mathe-
matical, Physical and Engineering Sciences, 365(1851):515–537, February 15 2007.
[15] J. M. Ko and Y. Q. Ni. Technology developments in structural health monitoring
of large-scale bridges. Engineering Structures, 27(12):1715–1725, 10 2005.
[16] I. Cohen, F. G. Cozman, N. Sebe, M. C. Cirelo, and T. S. Huang. Semisupervised
learning of classifiers: theory, algorithms, and their application to human-computer
Bibliography 62
interaction. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 26
(12):1553–1566, 2004. ID: 1.
[17] B. D. Ziebart, D. Roth, R. H. Campbell, and A. K. Dey. Learning automation
policies for pervasive computing environments. In Autonomic Computing, 2005.
ICAC 2005. Proceedings. Second International Conference on, pages 193–203, 2005.
ID: 1.
[18] L. Wu, G. Kaiser, D. Solomon, R. Winter, A. Boulanger, and R. Anderson. Im-
proving efficiency and reliability of building systems using machine learning and
automated online evaluation. In Systems, Applications and Technology Conference
(LISAT), 2012 IEEE Long Island, pages 1–6, 2012. ID: 1.
[19] Eric Horvitz, Jack Breese, David Heckerman, David Hovel, and Koos Rommelse.
The lumière project: Bayesian user modeling for inferring the goals and
needs of software users. In Proceedings of the Fourteenth conference on Uncertainty
in artificial intelligence, UAI’98, pages 256–265, San Francisco, CA, USA, 1998.
Morgan Kaufmann Publishers Inc. ISBN 1-55860-555-X. URL http://dl.acm.
org/citation.cfm?id=2074094.2074124.
[20] Richard S. Sutton and Andrew G. Barto. Reinforcement Learning: An Introduc-
tion. MIT Press, 1998. ISBN 0262193981. URL http://www.cs.ualberta.ca/
%7Esutton/book/ebook/the-book.html.
[21] D. J. Cook, M. Youngblood, E. O. Heierman III, K. Gopalratnam, S. Rao, A. Litvin,
and F. Khawaja. Mavhome: an agent-based smart home. In Pervasive Comput-
ing and Communications, 2003. (PerCom 2003). Proceedings of the First IEEE
International Conference on, pages 521–524, 2003. ID: 1.
[22] Thomas M. Mitchell. Machine Learning. McGraw-Hill, Inc, New York, NY, USA,
1 edition, 1997. ISBN 0070428077, 9780070428072.
[23] Peter Harrington. Machine Learning in Action. Manning Publications Co, Green-
wich, CT, USA, 2012. ISBN 1617290181, 9781617290183.
[24] Tom M. Mitchell. Machine learning and data mining. Commun.ACM, 42(11):30–36,
nov 1999. URL http://doi.acm.org/10.1145/319382.319388.
Bibliography 63
[25] H. Mannila. Data mining: machine learning, statistics, and databases. In Sci-
entific and Statistical Database Systems, 1996. Proceedings., Eighth International
Conference on, pages 2–9, 1996. ID: 1.
[26] Ming Xue and Changjun Zhu. A study and application on machine learning of
artificial intellligence. In Artificial Intelligence, 2009. JCAI ’09. International Joint
Conference on, pages 272–274, 2009. ID: 1.
[27] Stuart Russell and Peter Norvig. Artificial Intelligence: A Modern Approach. Pren-
tice Hall, 3; 3rd edition, dec 2009.
[28] M. Kholghi, H. Hassanzadeh, and M. R. Keyvanpour. Classification and evaluation
of data mining techniques for data stream requirements. In Computer Communica-
tion Control and Automation (3CA), 2010 International Symposium on, volume 1,
pages 474–478, 2010. ID: 1.
[29] Frances Y. Kuo and Ian H. Sloan. Lifting the curse of dimensionality. Notices of
the AMS, 52:1320–1329, 2005.
[30] Pat Langley and Herbert A. Simon. Applications of machine learning and rule
induction. Commun.ACM, 38(11):54–64, nov 1995. URL http://doi.acm.org/
10.1145/219717.219768.
[31] Xindong Wu, Vipin Kumar, J. Ross Quinlan, Joydeep Ghosh, Qiang Yang, Hi-
roshi Motoda, Geoffrey J. McLachlan, Angus Ng, Bing Liu, Philip S. Yu, Zhi-
Hua Zhou, Michael Steinbach, David J. Hand, and Dan Steinberg. Top 10 al-
gorithms in data mining. Knowl.Inf.Syst., 14(1):1–37, dec 2007. URL http:
//dx.doi.org/10.1007/s10115-007-0114-2.
[32] Shichao Zhang, Chengqi Zhang, and Qiang Yang. Data preparation for data mining.
Applied Artificial Intelligence, 17:375–381, 2003.
[33] Gustavo E. A. P. A. Batista and Maria Carolina Monard. An analysis of four
missing data treatment methods for supervised learning. Applied Artificial Intelli-
gence, 17(5-6):519–533, 05/01; 2013/03 2003. URL http://dx.doi.org/10.1080/
713827181. doi: 10.1080/713827181; M3: doi: 10.1080/713827181; 21.
[34] Aik Choon Tan and David Gilbert. An empirical comparison of supervised ma-
chine learning techniques in bioinformatics. In Proceedings of the First Asia-Pacific
Bibliography 64
bioinformatics conference on Bioinformatics 2003 - Volume 19, APBC ’03, pages
219–222, Darlinghurst, Australia, Australia, 2003. Australian Computer Society,
Inc. ISBN 0-909-92597-6. URL http://dl.acm.org/citation.cfm?id=820189.
820218.
[35] Kiri Wagstaff. Machine learning that matters. CoRR, abs/1206.4656, 2012.
[36] Theodoros Anagnostopoulos, Christos Anagnostopoulos, Stathes Hadjiefthymiades,
Miltos Kyriakakos, and Alexandros Kalousis. Predicting the location of mobile
users: a machine learning approach. In Proceedings of the 2009 international con-
ference on Pervasive services, ICPS ’09, pages 65–72, New York, NY, USA, 2009.
ACM. ISBN 978-1-60558-644-1. URL http://doi.acm.org/10.1145/1568199.
1568210.
[37] U. Weiss, P. Biber, S. Laible, K. Bohlmann, and A. Zell. Plant species classification
using a 3d lidar sensor and machine learning. In Machine Learning and Applications
(ICMLA), 2010 Ninth International Conference on, pages 339–345, 2010. ID: 1.
[38] A. M. Bidgoli and M. Boraghi. A language independent text segmentation technique
based on naive bayes classifier. In Signal and Image Processing (ICSIP), 2010
International Conference on, pages 11–16, 2010. ID: 1.
[39] M. J. Meena and K. R. Chandran. Nave bayes text classification with positive
features selected by statistical method. In Advanced Computing, 2009. ICAC 2009.
First International Conference on, pages 28–33, 2009. ID: 1.
[40] S. Viaene, R. A. Derrig, and G. Dedene. A case study of applying boosting naive
bayes to claim fraud diagnosis. Knowledge and Data Engineering, IEEE Transac-
tions on, 16(5):612–620, 2004. ID: 1.
[41] David J. Hand and Keming Yu. Idiot’s bayes—not so stupid after all? International
Statistical Review, 69(3):385–398, 2001.
[42] Pedro Domingos and Michael Pazzani. On the optimality of the simple bayesian
classifier under zero-one loss. Mach.Learn., 29(2-3):103–130, nov 1997. URL http:
//dx.doi.org/10.1023/A:1007413511361.
Bibliography 65
[43] Pedro Domingos and Michael Pazzani. On the optimality of the simple bayesian
classifier under zero-one loss. Machine Learning, 29(2-3):103–130, 1997. URL http:
//dx.doi.org/10.1023/A%3A1007413511361.
[44] David W. Aha. Lazy learning. Kluwer Academic Publishers, Norwell, MA, USA,
1997. ISBN 0-7923-4584-3.
[45] Ramon Lpez de Mntaras and Eva Armengol. Machine learning from examples:
Inductive and lazy methods. Data & Knowledge Engineering, 25(1-2):99–123, 1998.
[46] T. Cover and P. Hart. Nearest neighbor pattern classification. IEEE
Trans.Inf.Theor., 13(1):21–27, sep 2006. URL http://dx.doi.org/10.1109/TIT.
1967.1053964.
[47] Dietrich Wettschereck, David W. Aha, and Takao Mohri. A review and empirical
evaluation of feature weighting methods for a class of lazy learning algorithms.
Artificial Intelligence Review, 11:273–314, 1997.
[48] T. Cover and P. Hart. Nearest neighbor pattern classification. Information Theory,
IEEE Transactions on, 13(1):21–27, 1967. ID: 1.
[49] Gongde Guo, Hui Wang, David A. Bell, Yaxin Bi, and Kieran Greer. Knn model-
based approach in classification. In Robert Meersman, Zahir Tari, and Douglas C.
Schmidt, editors, On The Move to Meaningful Internet Systems 2003: CoopIS,
DOA, and ODBASE - OTM Confederated International Conferences, CoopIS,
DOA, and ODBASE 2003, Catania, Sicily, Italy, November 3-7, 2003, volume
2888 of Lecture Notes in Computer Science, pages 986–996. Springer, 2003. ISBN
3-540-20498-9.
[50] Giovanni Seni and John Elder. Ensemble Methods in Data Mining: Improving
Accuracy Through Combining Predictions. Morgan and Claypool Publishers, 2010.
ISBN 1608452840, 9781608452842.
[51] J. Ross Quinlan. C4.5: Programs for Machine Learning (Morgan Kaufmann
Series in Machine Learning). Morgan Kaufmann, 1 edition, oct 1992. ISBN
9781558602380.
[52] E. B. Hunt, J. Martin, and P. Stone. Experiments in Induction. Academic Press,
New York, 1966.
Bibliography 66
[53] J. R. Quinlan. Induction of decision trees. Machine Learning, 1(1):81–106, 1986.
URL http://dx.doi.org/10.1007/BF00116251.
[54] L. Breiman, H. Friedman J., A. Olshen R., and J. Stone C. Classification and
Regression Trees. Chapman & Hall, New York, 1984.
[55] Douglas M. Hawkins. The problem of overfitting. Journal of chemical information
and computer sciences, 44(1):1–12, 2004.
[56] K. Veropoulos, ICG Campbell, and N. Cristianini. Controlling the Sensitivity of
Support Vector Machines, page 55. Proceedings of the International Joint Confer-
ence on Artificial Intelligence, Stockholm, Sweden (IJCAI99),. 1999. Other: Work-
shop ML3.
[57] Bernhard Scholkopf, Alexander J. Smola, and Klaus-Robert Muller. Advances in
kernel methods, chapter Kernel principal component analysis, pages 327–352. MIT
Press, Cambridge, MA, USA, 1999. ISBN 0-262-19416-3. URL http://dl.acm.
org/citation.cfm?id=299094.299113.
[58] John C. Platt. Using analytic qp and sparseness to speed training of support vector
machines. In Proceedings of the 1998 conference on Advances in neural information
processing systems II, pages 557–563, Cambridge, MA, USA, 1999. MIT Press. ISBN
0-262-11245-0. URL http://dl.acm.org/citation.cfm?id=340534.340735.
[59] Lawrence R. Rabiner. Readings in speech recognition, chapter A tutorial on hidden
Markov models and selected applications in speech recognition, pages 267–296.
Morgan Kaufmann Publishers Inc, San Francisco, CA, USA, 1990. ISBN 1-55860-
124-4. URL http://dl.acm.org/citation.cfm?id=108235.108253.
[60] Mark Stamp. A revealing introduction to hidden markov models, 2004.
[61] Rakesh Agrawal and Ramakrishnan Srikant. Fast algorithms for mining association
rules in large databases. In Jorge B. Bocca, Matthias Jarke, and Carlo Zaniolo,
editors, VLDB’94, Proceedings of 20th International Conference on Very Large Data
Bases, September 12-15, 1994, Santiago de Chile, Chile, pages 487–499. Morgan
Kaufmann, 1994. ISBN 1-55860-153-8. DBLP:conf/vldb/94.
[62] Anil K. Jain and Richard C. Dubes. Algorithms for clustering data. Prentice-Hall,
Inc, Upper Saddle River, NJ, USA, 1988. ISBN 0-13-022278-X.