machine learning - imaglig-membres.imag.fr/bisson/cours/m2info-aiw-ml/2-ml-introduction.pdf ·...
TRANSCRIPT
Machine Learning
Definitions and objectives
http://membres-liglab.imag.fr/bisson/cours/M2INFO-AIW-ML/
G.Bisson - 2012
What is learning ?
✦ For an organism (life) the capacity to acquire new behaviors :
‣ Genetical mechanisms (Evolution)
‣ Chemical mechanisms (Neurons)
‣ Cultural mechanisms (Symbols).
✦ Some classical learning methods
‣ Association rules («stimuli - feedback»)
‣ Trial by error («generate and test» method)
‣ Explanation,
‣ Analogy, ...
Instinct
Learned
2
Learning is acquiring new knowledge, behaviors, skills and may involve synthesizing different types of information. Learning may occur as a result of habituation or classical conditioning or as a result of more
complex activities such as play or studies.
G.Bisson - 2012
In the case of Machine Learning ...
« Learning is constructing or modifying representations of what is being
experienced » (R. Michalski)
« Learning aim at increasing the performances of a system on a given task by using a set of experiments »
(J. Mitchell)
‣ «Batch» or «Off Line» Learning
‣ «On line» or «incremental» Learning
Tâche Apprentissage
M
Données Apprentissage
MTâche
3
✦ Holistic definitions
‣ « Learning is making useful changes in mind » (M. Minsky 1985)
‣ « Learning is the organization of experience » (Scott 1983)
✦ More reductionist definitions
On line learning is needed when a system must be able to adapt its behavior «rapidly»
Batch learning can be used to analyse an existing dataset in order to generate a model
G.Bisson - 2010
Supervised learning or discrimination
Unsupervised learning or clustering
Case based reasoning Reinforcement learning
Example of learning tasks
Apprentissage
M
?
4
ApprentissageM
SEConcept
Descriptions
G.Bisson - 2012
Main Learning Tasks & Models
5
Environment
Learning
Examples
Meta learning
Predict
Combination of models: Meta-learning (Boosting)
Learner interacts with its environment:‣ Active learning
(select examples)‣ Reinforcement
Learning (actions)
Interact
Model
Transfer learning:‣ Multi-task learning‣ Analogical Reasoning
Learning goal can be:
Supervised learning or discrimination
Unsupervised learning or clustering
ImplicitExplicit
Semi-supervised learning
Case Based Reasoning (no model)
ModelModel3
Meta-Model
Environment2
Learning Model4Background knowledge
G.Bisson - 2012
✦ Compute «mean» and «standard deviation»
‣ We collect several examples
‣ Then we can define a «mean cat» ...
‣ ... and thus a model:
๏ Unsupervised : barycenter and sigma
๏ Supervised : IF Weight in [3.3, 5.1] and … THEN Cat
✦ Predict an unknown value
‣ Classical regression problem: find f(); y = f(x)
The simplest ML technic
y = 1.1x + 1.56
6
Weight Live SpanE1 3,5 12
E2 4 15
E3 5,2 11
4.2±0.9 12.7±2.1
G.Bisson - 2010
Statistics / Data analysis
✦ An ancestor of Machine learning‣ Very first documents: censuses, commodity prices, ...
‣ XVIII century: statistics for insurance (death rate)
‣ Thomas Bayes (1702-1761): Essay Towards Solving a Problem in the Doctrine of Chances๏ Bayes tehorem :๏ Allow to compute the probability of the rule : If B=fever Then A=Disease
‣ Pierre-Simon de Laplace (1749-1827) : Essai philosophique sur les probabilités
‣ 1890 First american census using punch card system IBM (H. Hollerith 1860-1929)
‣ R-A Fisher (1890-1962): Data Analysis๏ Descriptive Statistics → how to summarize data (mean, median, ...)๏ Statistical inference → how to predict values (regression, ..., «data mining»)
Statistics is the study of how to collect, represent, analyze, explain ... datasets
7
G.Bisson - 2009
✦ 1957 : Checker (A. Samuel from IBM)‣ Learn to weight the criteria of an evaluation function : E(C) = Σ Ki Ci‣ First AI game to have a good level of play
✦ 1962 : Perceptron (F. Rosenblat)‣ Application of the Hebbian theory (D. Hebb, 1949)‣ Learning rule : Wj(t+1) = Wj+α[y-f(x)].xj
‣ Converge if the dataset in linearly separable
✦ 1970 : Meta-Dendral (B. Buchenan)
✦ 1980 : AM et Eurisko (D. Lenat)
✦ 1983 : ID3 (R. Quinlan), Induce, Cluster, (R. Michalski), ...
✦ 1986 : Neural Network (backpropagation) (D.E Rumelhart)
✦ 1990-2007 : ML domain soars‣ Many open-source and commercial tools in « Data Mining »‣ Many approaches : Genetic, CBR, Meta-learning, Bayesian network, SVM, ...
A short history of ML
8
G.Bisson - 2011
✦ Example of running‣ Knowledge is expressed as rules
๏ IF set of premisses THEN set of conclusion
‣ The set of rules forms a Knowledge Base or « Model »
✦ Theoretical advantages of using a Knowledge Base:‣ Legible : the knowledge is written in the Expert’s language
‣ Evolutive : we can easily add/delete the rules within the KB
‣ Explainable : it is quite simple to keep track of the resolution process
‣ Generic : a same KB is able to work with any set of initial facts
✦ Two drawbacks in practice:‣ Acquisition : the knowledge must be acquired from the human experts !
‣ Consistency : the content of the KB must be semantically coherent !➡ These drawbacks led approach to a (relative) failure.
A small come back to Expert Systems
9
G.Bisson - 2012
The knowledge acquisition problem
✦ KB = Subject Matter Expert (SME) + Knowledge Engineer (KE)
‣ Experts are not so common and very expensive
‣ Who like to provide his/her skills to a computer ?
‣ Communication problem :
๏ It is difficult to define a common vocabulary
๏ Analysis based on unconscious processes (vision) are hard to explain
๏ We need to stay in the expertise domain (Knowing ≠ Explaining) ?
✦ Solution 1 : Knowledge acquisition methodologies (KADS, …)
✦ Solution 2 : Machine Learning :-)
‣ The SMEs have «just» to provide some illustrative examples of their work.
Acquisition ≠ Transfert Acquisition = Model
10
G.Bisson - 2011
✦ In theory the expert (or end-user) just have to : ‣ Provide a set of «properties» (representation language)
‣ Provide a set of «examples» of what he/her want to model
✦ A toy example in chemistry ...
✦ Machine Learning allows:‣ To built the model (here a Decision Tree)
‣ To explore rapidly different hypotheses (language, examples, …)
Advantages of Machine Learning
11
A-Cycles Mass Ph Carboxyl Activity
M1 1 low <5 false null
M2 2 mean <5 true toxic
M3 0 mean >8 true toxic
M4 0 mean <5 false null
M5 1 heavy ~7 false null
M6 2 heavy >8 false toxic
M7 1 heavy >8 false toxic
M8 0 low <5 true toxic M2, M8M1, M4
PH
null
Carboxyl
<5
toxic
null toxic
false true
~7 >8
M5 M3, M6, M7
Some references ...
How to learn ?
The notion of «learning cycle»
G.Bisson - 2011
The learning context
14
Environment Objectives Model
Kind of data:•Labeled: S = {(xi, ui) ...} •Unlabeled: S = {xi, ...}
Type of the data:•Numerical•Complex
‣Sequence,‣Graphs, ...
Availability of the data:•Database (batch learning)• Incremental (on line learning)•Selectable (active learning)
Classification: ui =h(xi)•Discrimination : h discrete•Ranking : h ordered•Regression : h continuous
Discovery:•Clustering : h(xi)→Cj
‣Partition,‣Hierarchy, ...
•Association rules•Grammatical inferences, ...
Optimization: •Reinforcement learning.•Planning ...
«Symbolical»Focus on understandability
•Decision tree,•Horn clause ,•Semantic network, • ...
«Numerical»Focus on efficiency
•Hyperplanes parameters ,•Neural network,•Bayesian network, • ...
Dataset
Learning tool
Model
G.Bisson - 2011
The learning cycle
1) Build up the learning set• Selection of the attributes• Creation of the sample (training set)• Background knowledge
2) Selection of the learning tool • Instances language Lx• Hypotheses language Lh• Bias and control language Lb
3) Creation of the model
4) Validation of the model• Empirical evaluation : is it accurate/predictive?
- Validation done with a test set• Semantical evaluation : does he make some sense ?
- Validation done bay the expert/end-user
Données
5) Tuning of the input (revision step)
Empirical ≠ Semantical↓
What’s a «good» model ?
G.Bisson - 2011
Accuracy vs PlausibilityAccuracy
Astrology
« Sirius »
Ptolemy modelCopernicus
model
Kepler Laws
Newton’s theory
Titus/Bode (law)d=0,4 + (0,3x2n)
Plate tectonics
Darwinian’s theory
n-body problem
Balmer Law
Quantum mechanics
16
Plausibility
G.Bisson - 2012
Quality of a model : some criteria
✦ The guessing game …
‣ What is the next number : 1, 2, 3, 5, ?
‣ Many possible (some relevant) answers
๏ 6 : that ‘s an integer sequence « but » 4
๏ 7 : that’s the prime numbers
๏ 8 : that’s Fibonacci numbers, …
๏ … and we can imagine an infinite number of «models» (Wittgenstein)
✦ Occam's razor principle (XIVth century)
‣ If two competing theories obtain the same predictions, the simpler the better !
‣ A very good «heuristics» often used in science
‣ What the meaning in Machine Learning ?
17
entia non sunt multiplicanda praeter necessitatem«Entities should not be multiplied unnecessarily»
G.Bisson - 2012
Criteria to select the right model
✦ ML systems are characterized through 2 «languages»
‣ Lx : grammar of the instances
๏ Let X the instance space whose training set is a sampling
‣ Lh : grammar of the model (learning hypotheses)
๏ Let H the hypotheses space where is the unknown target concept
✦ Learning can be seen as finding the model h∈H such that :
1) h is the best predictor of the training set → h(xi)=ui
2) h is the simplest model (syntactically speaking)
‣ In practice there are three main inductive criteria to search for h
๏ Minimisation of the Empirical Risk (ERM) in inductive learning
๏ Maximum likelihood estimation (MLE) used in the Bayesian approaches
๏ Minimum Description Length (MDL) which is a formalization of Occam
18
From dataset to the model
Languages Lx and Lh
G.Bisson - 2012
Knowledge representation
✦ Building of the learning set
‣ We need to describe:
๏ The «objects» of the domain
๏ Their properties
๏ Their relationship
✦ The main stages
‣ To chose the «data granularity»
๏ What are the knowledge to model ? What is needed to learn ?
‣ To select the representation language (thus, the learning approach)
‣ To establish the mapping between the data and the language20
G.Bisson - 2010
The data encoding trap
✦ Take care of your hidden hypotheses‣ For instance you want to summarize a collection of data‣ You compute mean and variance of this collection, using a Gausian hypothesis
‣ What’s about a distribution like this one ?
21
In ML as in the rest of Computer Science: «Garbage Input, Garbage Output»
G.Bisson - 2012
Visual Analysis of the data
22
I II III IV
X Y X Y X Y X Y10 8,04 10 9,14 10 7,46 8 6,58
8 6,95 8 8,14 8 6,77 8 5,7613 7,58 13 8,74 13 12,74 8 7,71
9 8,81 9 8,77 9 7,11 8 8,8411 8,33 11 9,26 11 7,81 8 8,4714 9,96 14 8,1 14 8,84 8 7,04
6 7,24 6 6,13 6 6,08 8 5,254 4,26 4 3,1 4 5,39 19 12,5
12 10,84 12 9,13 12 8,15 8 5,567 4,82 7 7,26 7 6,42 8 7,915 5,68 5 4,74 5 5,73 8 6,89
✦ A classical example: Anscombe's quartet
‣ What can you tell about these four datasets ?
‣ From a statistical point of view they seem similar ...
‣ ... but displaying the data provides a better insight !
Statictic Value
Mean of X 9
Variance of X 11
Mean of Y ~7.50
Variance of Y ~4.25
Correlation between X and Y 0.816
Linear Regression Y=0.5X+3
G.Bisson - 2011
Two main families of KR
Vectorial data
↔
Relational data
↔↔↔↔
Inst
ance
sK
now
ledg
eNu
meric
alSy
mboli
cal
Nume
rical
Symb
olica
l
TableRows are instances
Columns are variables (attributes)
Propositional logic(conjunction, disjunction of
attributes)
Vector of parametersHyper-planes
Probabilities, ...
Rules(Knowledge based systems)
Graphs
Predicative logic
Graphs
Predicative logic
Conceptual Graphs
Horn ClausesSeq
uenc
esTr
ees,
...
Inte
rmed
iate
repr
esen
tatio
ns
G.Bisson - 2011
An example in chemistry
24
Vectorial data Relational data
N
bond (m1, c1, Cl, simple), bond (m1, c1,c2, single), (m1)
S mass=167 ∧ number_cycle=1 ∧ contain_Br=no ∧ ... bond (m1, c1, Cl, simple), bond (m1,
c1,c2, single), (m1)
N Vector of parameters mutagenic (M) :- bond (M, Atom1, Atom2, double), has_ring (M, R, 5),
bond (M, R, Atom1, single), is (Atom1, Br), …S IF (mass<500) ∧ (LogP> 5) ∧ …
THEN (potential_drug = vrai)
mutagenic (M) :- bond (M, Atom1, Atom2, double), has_ring (M, R, 5),
bond (M, R, Atom1, single), is (Atom1, Br), …
Inst
ance
sK
now
ledg
e
C6H6! →! 1N-N! →! 0C-CH! →! 3C-N-O2!→! 2S-N! →! 0...
G.Bisson - 2011
✦ Let the following training set (vectorial representation)
✦ Objective: to learn a model predicting the survival according to A & B
A: Temperature B: Dryness Survival
Plant 1 2 2,4 +
Plant 2 4 3,5 -
Plant 3 8 1 +
Plant 4 8 7 -
••• ••• ••• •••
Plant 19 3 9,5 -
How to build a model : a simple example
25
G.Bisson - 2009
Supervised learning / Classification
26
-
-
-
--
-
--
++
+
++
++
+
-
+ +
A: T
emp
B: Dry
1
2
3
4
5
6
7
8
9
O 1 2 3 4 5 6 7 8 9
• Let the dataset S = {(xi, ui)…} with Lx such as :‣ xi : {v1, v2, …, vp} with vi∈
‣ ui : {+, -}
• The dataset are coordinates of points in p (here p=2)
• The problem : we search for a function h(xi) expressed in the langage Lh allowing to discriminate (then predict) the two «classes» :‣ + : « positive » examples‣ - : « negative » examples
(also named «counter examples»)
• Remarks‣ Any problem can be turned
into a two classes problem.‣ h(xi) defines hyperplans
G.Bisson - 2011
Selection of Lh
Efficiency : LH1 < LH2 < LH3
Readability : LH3 < LH2 < LH1
Tradeoff Symbolic / Numeric
27
-
-
-
--
-
--
++
+
++
++
+
-
+ +
A: T
emp
B: Dry
1
2
3
4
5
6
7
8
9
O 1 2 3 4 5 6 7 8 9
LH1IF A < 3!! THEN Class = +IF B < 2,5! THEN Class = +IF B > 8,5! THEN Class = +LH2
IF A < 7-B! THEN Classe = +IF A < B-4! THEN Classe = +LH3IF A < - 0,22xB2 - 2,3xB + 8!THEN Classe = +
G.Bisson - 2011
Unsupervised learning / Clustering
28
-
-
-
--
-
--
++
+
++
++
+
-
+ +
A: T
emp
B: Dry
1
2
3
4
5
6
7
8
9
O 1 2 3 4 5 6 7 8 9
• Let the dataset S = {(xi,) …} with Lx such as :‣ xi : {v1, v2, …, vp} with vi∈
‣ No more labels
• The problem : we search for a discrete function h(xi) expressed with Lh asso-ciating each xi to a cluster Cj.
• We want to get:‣ Contrasted clusters‣ Homogeneous clusters
• Some possible Lh‣ Threshold based‣ Distance based
๏ Hypersphere‣ Distribution based‣ ...
C1
C2
C3
C4
Learning as a «Search Process»
How to efficiently explore H ?
G.Bisson - 2012
From data to model
30
✦ Back to the learning criteria
‣ We would like to find the model h∈H:
๏ Accurate : highly correlated to the training set → h(xi)=ui for supervised learning
๏ Plausible : as simple as possible
‣ Example of criteria
๏ Let h be a boolean function
✦ Learning : a «game» between two spaces
‣ X : set of the instances (we use a sample, the training set, described with Lx)
‣ H : set of all the hypotheses that can be described with Lh
Accurate Plausible
- h must recognize the positive examples- h must reject the negative examples
- h must be general (few conjunctions)- h must be simple (few disjunctions)
G.Bisson - 2011
X H-
+
-
-
-
-
-
+
+
+
+
+
- +
-
✦ Relationship between these 2 spaces‣ Let C be the «target» concept (unknown)
๏ We want to learn function h(xi)=ui
๏ We don’t even know if this function can be expressed with Lh
‣ Each hi of H can be associated with a part pi of X (the opposite being false)
‣ Hypothesis hi is called a generalization of pi
X and H spaces
31
Instances space Hypotheses space
+
+
+
+
+
+
+ C
hi
G.Bisson - 2012
Size of H
✦ A huge space …
‣ For instance, if Lh contains N boolean attributes
๏ Conjunctive formula : possible hypotheses
๏ Disjunctive formula : possible hypotheses
- with {A, B} we have : Ø, A, B, A∧B, A∨B, A∧B∨A, A∧B∨B, A∧B∨A∨B
- with N = 10 there are about ~ 10308 possible hypotheses (with a lot of redundancies)
✦ Some (extreme) examples …
‣ MasterMind game («active learning»)
๏ Lx = Lh contains all the combinations of 4x8 colors
๏ X : propositions of the computer/player
๏ H : contains the code to guess
‣ Find a Regular Expressions to discriminate documents
๏ X : a set of paragraphs in natural language with 2 classes
๏ H : the set of all possible Regular Expressions (with less than n characters)32
2N−1
22N−1
G.Bisson - 2011
Exploration of H
✦ Learning as a search process
✦ Thus, two processes are involved during this search:‣ Evaluating the «quality» of hj with respect to the training set in X
๏ Symbolical : Mesure the «cover» of hj : Fct(«#instances +»,«#instances -»)๏ Numerical : Evaluate the error between hi(xk) and uk (true label)
‣ Searching for a better hypothesis h’j through operators Opn transforming hi
๏ Symbolical : Generalization and/or Specialization of the hypothesis hi
๏ Numerical : Modification of the parameters of the model33
X H-
+
-
-
-
-
-
+
+
+
+
+
- + C
-
hihere h’j more specific than hi
h'j
G.Bisson - 2012
Some classical «generic-algorithms»
34
✦ «Generate and Test » methods (ID3, PLI, …)
‣ Initialization of C the concept to learn (i.e: C=Ø),
‣ While quality(C) < threshold
๏ Apply the refinement operators Opn on the current hi → C’= {h’1, …}
๏ For each h’j the system evaluates: quality (h’j, X)
๏ Update C with the selected hypothesis of C’
✦ «Optimization» methods (Perceptron, connectionism, …)
‣ Initialization of C, the concept to learn (i.e randomly)
‣ While quality(C) < threshold
๏ Pick-up an example of the training set xi : {v1, …, vn, u} de X
๏ Compute the predicted concept u’ = hi(xi)
๏ Apply the refinement operators on hi to decrease the distance Δ(u, u’)
G.Bisson - 2012
Strategies to explore H
✦ Complete search !
‣ The simpler and the best …
‣ … when H is “small enough” and/or “well-organized” (ex: lattice)
‣ … but that’s often impossible, due to:
๏ The size of H
๏ The topology H wich is too complex (too much local minima)
๏ The presence of “noise” in the training set (errors on data or labels)
✦ Partial (heuristical) exploration
‣ Requirement: H must have a good topology
๏ Discrete (i.e.: relation of generality between the hypotheses hi )
๏ Continue (i.e.: relation of neighboring between he hypotheses hi )
‣ If H is chaotic it would be impossible to learn something …
35
G.Bisson - 2012
Gradient search
✦ A very classic and efficient strategy
‣ Main principle:
๏ The system starts from a “random” hi
๏ We look at the operators Opn that can be applied
๏ We select the operator Opk maximizing quality(Opk(hi))
๏ Opk(hi)) is the new h’j ; we continue until the stopping condition is verified
‣ Can be very efficient or a total failure is there are too many local minima in H ...
36
Quality
SubOptimalOptimal
Quality
G.Bisson - 2011
Other strategies
✦ Beam search‣ At each step we keep a collection of possible hypotheses {h1, …, hi …}
‣ These hypotheses are competing.
✦ Simulated annealing‣ We select the Opn maximizing the current state hi
‣ When there is no more improvement๏ Select randomly another state h’j ๏ Proximity(hi, h’j) is a function of the “temperature” T
✦ Genetic algorithm‣ The search is, in a sense, “random”
‣ We work with a population of hypotheses {h1, …, hi …}
‣ The learning process is based on two steps:๏ Selection of most relevant hypotheses (fitting with X)๏ Reproduction/Mutation/Crossing-over of these hypotheses
37
time
T
Quality
Validation and Remediation
The art of learning
G.Bisson - 2011
Empirical validation
✦ What is the accuracy of the model that has been learned?‣ First, we need to decide what to measure and how to do ...
‣ Supervised learning:๏ We use a test set: i.e. a collection of examples not used during the learning step๏ The idea is to mesure the prediction accuracy: (xi, ui) : h(xi)= ui
- When h() is discrete: we measure the error rate- When h() is continuous: we measure the «Mean Absolute Error» or «Mean Square Error»
‣ Unsupervised learning:๏ No universal criteria → Ask to an expert of the domain (can be difficult)๏ To go back to the «supervised case» by using a set of classified data
- Measure the mapping between the real and learned clusters- Exemple : NG20 Usenet groups in document clustering
✦ Two important criteria
39
Learning error (Ea)Percentage of of miss-classified
examples on the training set
Generalization error (Eg)Percentage of miss-classified
examples on the test set
G.Bisson - 2012
In practice ...
✦ Classical approaches
‣ The learning set is randomly splitted into several parts
‣ All the subsets must be «Identically and Independently Distributed» (I.I.D data)
‣ Drawbacks of creating these fixed datasets
๏ That’s OK if we have enough examples ...
๏ But otherwise:- A part of the examples are lost for the learning step
- No information about the stability of the model40
67 %
33 %
Learning Test
2 Parts 3 Parts Role
Training set 66 % 50 % Used to learn the model
Validation set - 25 % Help to tune the learning parameters (pruning, stability)
Test set 33 % 25 % Measure the accuracy of the model
G.Bisson - 2011
Cross validation
✦ Let P = {x1, …xp} set of examples of the learning set
✦ «Leave one out» approach‣ We learn « p » different models Mi using P-{xi} as a training set ‣ We test each Mi with the remaining example xi → Accurate but time consuming
✦ Generalization of the process with « N-Fold » technic‣ We split P into N bases containing p/N examples ‣ We learn N models using N-1 sets to learn and the last set to test‣ Fast (N~10) approach allowing to evaluate the variance of the result (stability)
41
Test = xi
MiMiMi
P-{xi} By summing we get the percentage of correct answers
N1N2N3N4N5
Example for M1 :• Learning : N2+N3+N4+N5•Test : N1
G.Bisson - 2011
Contingency table
✦ With 2 classes:
✦ But many other ways to evaluate
42
Real labelReal label
class=+ class=-
Predictedlabel
class=+ A(True positives)
B(False positives)Predicted
labelclass=- C
(False negatives)D
(True Negatives)
Recognition rate = A + DA + B +C + D
= (1− error rate)
IR domain
Medical domain
Probability that a test result will be positive when the
disease is present
Probability that a test result will be negative when the
disease is not present
Sensitivity= AA +C
Specificity= DB + D
Precision= AA + B
Recall= AA +C
Can be generalized to N
classes
G.Bisson - 2011
In statistics we trust (but with care)
✦ Recognition Rate is relevant if ...‣ Number of examples in the different classes are «well-balanced»
๏ When very few positive examples → use ROC curves
‣ The «cost» of errors B & C are equivalent๏ False in Medical domain or Risk Management
✦ Some classical traps‣ Précision (M)= 87.678%
‣ Interpretation of the diagrams
43
(with 1000 test examples only)
60,0 %
70,0 %
80,0 %
90,0 %
100,0 %
1000 2000 3000 4000
Acc
urac
y
#Sample
Evaluation = statistics + use of critical look
G.Bisson - 2011
Causes of failure
✦ Training set …‣ The sample is too small to cover significantly X
‣ The number of attributes is too large with respect to the number of examples๏ Nothing to do from the statistical point of view ...๏ ... but any result can be significant for the end-user (feature selection for instance)
‣ The attributes are not able to express the target concept C๏ Some knowledge is missing๏ Some knowledge is incomplete (ordered values / continuous values)
‣ The dataset is “noisy” : false values or even worst false labels
‣ ...
✦ The learning algorithm …‣ The parameters (bias) of the system are not correctly set
‣ The target concept cannot be learned in the current Lh
‣ ...
44
G.Bisson - 2012
The «bias-variance» tradeoff
✦ Again, the choice of Lh is crucial
‣ Smaller H is stronger the learning bias is, thus…
๏ It is easier/faster to learn
๏ The concept that we can learn is “simple” (leading perhaps to a failure)
‣ Oppositely, when H is large, we have a « weak» learning bias ...
๏ The system can learn (slowly) a complex problem …
๏ … but this can lead to the «overfitting» problem
45
Error rate
Complexity of Lh increasing
Learning error
Generalization error
Variance of the predictions increases
G.Bisson - 2011
Overfitting problem
✦ A well-known case : regression‣ Fitting a set of data with a polynom
‣ Too much precision kills prediction ...
46
A
B
1
2
3
4
5
6
7
8
9
O 1 2 3 4 5 6 7 8 9
LH1
LH2
LH3
In terms of learning errorEa(LH1) > Ea(LH2) > Ea(LH3)
In terms of generalization errorEg(LH1)≡Eg(LH3) > Eg(LH2)
LH1
LH2
LH3
Too simple
Too complex
G.Bisson - 2011
Full data mining process
47
Machine learning
Research