machine learning - imaglig-membres.imag.fr/bisson/cours/m2info-aiw-ml/2-ml-introduction.pdf ·...

Machine Learning

Definitions and objectives

http://membres-liglab.imag.fr/bisson/cours/M2INFO-AIW-ML/

http://membres-timc.imag.fr/Gilles.Bisson/Cours/M2R-IAW-ML/

http://membres-timc.imag.fr/Gilles.Bisson/Cours/M2R-IAW-ML/

G.Bisson - 2012

What is learning ?

✦ For an organism (life) the capacity to acquire new behaviors :

‣ Genetical mechanisms (Evolution)

‣ Chemical mechanisms (Neurons)

‣ Cultural mechanisms (Symbols).

✦ Some classical learning methods

‣ Association rules («stimuli - feedback»)

‣ Trial by error («generate and test» method)

‣ Explanation,

‣ Analogy, ...

Instinct

Learned

2

Learning is acquiring new knowledge, behaviors, skills and may involve synthesizing different types of information. Learning may occur as a result of habituation or classical conditioning or as a result of more

complex activities such as play or studies.

G.Bisson - 2012

In the case of Machine Learning ...

« Learning is constructing or modifying representations of what is being

experienced » (R. Michalski)

« Learning aim at increasing the performances of a system on a given task by using a set of experiments »

(J. Mitchell)

‣ «Batch» or «Off Line» Learning

‣ «On line» or «incremental» Learning

Tâche Apprentissage

M

Données Apprentissage

MTâche

3

✦ Holistic definitions

‣ « Learning is making useful changes in mind » (M. Minsky 1985)

‣ « Learning is the organization of experience » (Scott 1983)

✦ More reductionist definitions

On line learning is needed when a system must be able to adapt its behavior «rapidly»

Batch learning can be used to analyse an existing dataset in order to generate a model

G.Bisson - 2010

Supervised learning or discrimination

Unsupervised learning or clustering

Case based reasoning Reinforcement learning

Example of learning tasks

Apprentissage

M

?

4

ApprentissageM

SEConcept

Descriptions

G.Bisson - 2012

Main Learning Tasks & Models

5

Environment

Learning

Examples

Meta learning

Predict

Combination of models: Meta-learning (Boosting)

Learner interacts with its environment:‣ Active learning

(select examples)‣ Reinforcement

Learning (actions)

Interact

Model

Transfer learning:‣ Multi-task learning‣ Analogical Reasoning

Learning goal can be:

Supervised learning or discrimination

Unsupervised learning or clustering

ImplicitExplicit

Semi-supervised learning

Case Based Reasoning (no model)

ModelModel3

Meta-Model

Environment2

Learning Model4Background knowledge

G.Bisson - 2012

✦ Compute «mean» and «standard deviation»

‣ We collect several examples

‣ Then we can define a «mean cat» ...

‣ ... and thus a model:

๏ Unsupervised : barycenter and sigma

๏ Supervised : IF Weight in [3.3, 5.1] and … THEN Cat

✦ Predict an unknown value

‣ Classical regression problem: find f(); y = f(x)

The simplest ML technic

y = 1.1x + 1.56

6

Weight Live SpanE1 3,5 12

E2 4 15

E3 5,2 11

4.2±0.9 12.7±2.1

G.Bisson - 2010

Statistics / Data analysis

✦ An ancestor of Machine learning‣ Very first documents: censuses, commodity prices, ...

‣ XVIII century: statistics for insurance (death rate)

‣ Thomas Bayes (1702-1761): Essay Towards Solving a Problem in the Doctrine of Chances๏ Bayes tehorem :๏ Allow to compute the probability of the rule : If B=fever Then A=Disease

‣ Pierre-Simon de Laplace (1749-1827) : Essai philosophique sur les probabilités

‣ 1890 First american census using punch card system IBM (H. Hollerith 1860-1929)

‣ R-A Fisher (1890-1962): Data Analysis๏ Descriptive Statistics → how to summarize data (mean, median, ...)๏ Statistical inference → how to predict values (regression, ..., «data mining»)

Statistics is the study of how to collect, represent, analyze, explain ... datasets

7

G.Bisson - 2009

✦ 1957 : Checker (A. Samuel from IBM)‣ Learn to weight the criteria of an evaluation function : E(C) = Σ Ki Ci‣ First AI game to have a good level of play

✦ 1962 : Perceptron (F. Rosenblat)‣ Application of the Hebbian theory (D. Hebb, 1949)‣ Learning rule : Wj(t+1) = Wj+α[y-f(x)].xj

‣ Converge if the dataset in linearly separable

✦ 1970 : Meta-Dendral (B. Buchenan)

✦ 1980 : AM et Eurisko (D. Lenat)

✦ 1983 : ID3 (R. Quinlan), Induce, Cluster, (R. Michalski), ...

✦ 1986 : Neural Network (backpropagation) (D.E Rumelhart)

✦ 1990-2007 : ML domain soars‣ Many open-source and commercial tools in « Data Mining »‣ Many approaches : Genetic, CBR, Meta-learning, Bayesian network, SVM, ...

A short history of ML

8

G.Bisson - 2011

✦ Example of running‣ Knowledge is expressed as rules

๏ IF set of premisses THEN set of conclusion

‣ The set of rules forms a Knowledge Base or « Model »

✦ Theoretical advantages of using a Knowledge Base:‣ Legible : the knowledge is written in the Expert’s language

‣ Evolutive : we can easily add/delete the rules within the KB

‣ Explainable : it is quite simple to keep track of the resolution process

‣ Generic : a same KB is able to work with any set of initial facts

✦ Two drawbacks in practice:‣ Acquisition : the knowledge must be acquired from the human experts !

‣ Consistency : the content of the KB must be semantically coherent !➡ These drawbacks led approach to a (relative) failure.

A small come back to Expert Systems

9

G.Bisson - 2012

The knowledge acquisition problem

✦ KB = Subject Matter Expert (SME) + Knowledge Engineer (KE)

‣ Experts are not so common and very expensive

‣ Who like to provide his/her skills to a computer ?

‣ Communication problem :

๏ It is difficult to define a common vocabulary

๏ Analysis based on unconscious processes (vision) are hard to explain

๏ We need to stay in the expertise domain (Knowing ≠ Explaining) ?

✦ Solution 1 : Knowledge acquisition methodologies (KADS, …)

✦ Solution 2 : Machine Learning :-)

‣ The SMEs have «just» to provide some illustrative examples of their work.

Acquisition ≠ Transfert Acquisition = Model

10

G.Bisson - 2011

✦ In theory the expert (or end-user) just have to : ‣ Provide a set of «properties» (representation language)

‣ Provide a set of «examples» of what he/her want to model

✦ A toy example in chemistry ...

✦ Machine Learning allows:‣ To built the model (here a Decision Tree)

‣ To explore rapidly different hypotheses (language, examples, …)

Advantages of Machine Learning

11

A-Cycles Mass Ph Carboxyl Activity

M1 1 low <5 false null

M2 2 mean <5 true toxic

M3 0 mean >8 true toxic

M4 0 mean <5 false null

M5 1 heavy ~7 false null

M6 2 heavy >8 false toxic

M7 1 heavy >8 false toxic

M8 0 low <5 true toxic M2, M8M1, M4

PH

null

Carboxyl

<5

toxic

null toxic

false true

~7 >8

M5 M3, M6, M7

Some references ...

How to learn ?

The notion of «learning cycle»

G.Bisson - 2011

The learning context

14

Environment Objectives Model

Kind of data:•Labeled: S = {(xi, ui) ...} •Unlabeled: S = {xi, ...}

Type of the data:•Numerical•Complex

‣Sequence,‣Graphs, ...

Availability of the data:•Database (batch learning)• Incremental (on line learning)•Selectable (active learning)

Classification: ui =h(xi)•Discrimination : h discrete•Ranking : h ordered•Regression : h continuous

Discovery:•Clustering : h(xi)→Cj

‣Partition,‣Hierarchy, ...

•Association rules•Grammatical inferences, ...

Optimization: •Reinforcement learning.•Planning ...

«Symbolical»Focus on understandability

•Decision tree,•Horn clause ,•Semantic network, • ...

«Numerical»Focus on efficiency

•Hyperplanes parameters ,•Neural network,•Bayesian network, • ...

Dataset

Learning tool

Model

G.Bisson - 2011

The learning cycle

1) Build up the learning set• Selection of the attributes• Creation of the sample (training set)• Background knowledge

2) Selection of the learning tool • Instances language Lx• Hypotheses language Lh• Bias and control language Lb

3) Creation of the model

4) Validation of the model• Empirical evaluation : is it accurate/predictive?

- Validation done with a test set• Semantical evaluation : does he make some sense ?

- Validation done bay the expert/end-user

Données

5) Tuning of the input (revision step)

Empirical ≠ Semantical↓

What’s a «good» model ?

G.Bisson - 2011

Accuracy vs PlausibilityAccuracy

Astrology

« Sirius »

Ptolemy modelCopernicus

model

Kepler Laws

Newton’s theory

Titus/Bode (law)d=0,4 + (0,3x2n)

Plate tectonics

Darwinian’s theory

n-body problem

Balmer Law

Quantum mechanics

16

Plausibility

G.Bisson - 2012

Quality of a model : some criteria

✦ The guessing game …

‣ What is the next number : 1, 2, 3, 5, ?

‣ Many possible (some relevant) answers

๏ 6 : that ‘s an integer sequence « but » 4

๏ 7 : that’s the prime numbers

๏ 8 : that’s Fibonacci numbers, …

๏ … and we can imagine an infinite number of «models» (Wittgenstein)

✦ Occam's razor principle (XIVth century)

‣ If two competing theories obtain the same predictions, the simpler the better !

‣ A very good «heuristics» often used in science

‣ What the meaning in Machine Learning ?

17

entia non sunt multiplicanda praeter necessitatem«Entities should not be multiplied unnecessarily»

G.Bisson - 2012

Criteria to select the right model

✦ ML systems are characterized through 2 «languages»

‣ Lx : grammar of the instances

๏ Let X the instance space whose training set is a sampling

‣ Lh : grammar of the model (learning hypotheses)

๏ Let H the hypotheses space where is the unknown target concept

✦ Learning can be seen as finding the model h∈H such that :

1) h is the best predictor of the training set → h(xi)=ui

2) h is the simplest model (syntactically speaking)

‣ In practice there are three main inductive criteria to search for h

๏ Minimisation of the Empirical Risk (ERM) in inductive learning

๏ Maximum likelihood estimation (MLE) used in the Bayesian approaches

๏ Minimum Description Length (MDL) which is a formalization of Occam

18

From dataset to the model

Languages Lx and Lh

G.Bisson - 2012

Knowledge representation

✦ Building of the learning set

‣ We need to describe:

๏ The «objects» of the domain

๏ Their properties

๏ Their relationship

✦ The main stages

‣ To chose the «data granularity»

๏ What are the knowledge to model ? What is needed to learn ?

‣ To select the representation language (thus, the learning approach)

‣ To establish the mapping between the data and the language20

G.Bisson - 2010

The data encoding trap

✦ Take care of your hidden hypotheses‣ For instance you want to summarize a collection of data‣ You compute mean and variance of this collection, using a Gausian hypothesis

‣ What’s about a distribution like this one ?

21

In ML as in the rest of Computer Science: «Garbage Input, Garbage Output»

G.Bisson - 2012

Visual Analysis of the data

22

I II III IV

X Y X Y X Y X Y10 8,04 10 9,14 10 7,46 8 6,58

8 6,95 8 8,14 8 6,77 8 5,7613 7,58 13 8,74 13 12,74 8 7,71

9 8,81 9 8,77 9 7,11 8 8,8411 8,33 11 9,26 11 7,81 8 8,4714 9,96 14 8,1 14 8,84 8 7,04

6 7,24 6 6,13 6 6,08 8 5,254 4,26 4 3,1 4 5,39 19 12,5

12 10,84 12 9,13 12 8,15 8 5,567 4,82 7 7,26 7 6,42 8 7,915 5,68 5 4,74 5 5,73 8 6,89

✦ A classical example: Anscombe's quartet

‣ What can you tell about these four datasets ?

‣ From a statistical point of view they seem similar ...

‣ ... but displaying the data provides a better insight !

Statictic Value

Mean of X 9

Variance of X 11

Mean of Y ~7.50

Variance of Y ~4.25

Correlation between X and Y 0.816

Linear Regression Y=0.5X+3

G.Bisson - 2011

Two main families of KR

Vectorial data

↔

Relational data

↔↔↔↔

Inst

ance

sK

now

ledg

eNu

meric

alSy

mboli

cal

Nume

rical

Symb

olica

l

TableRows are instances

Columns are variables (attributes)

Propositional logic(conjunction, disjunction of

attributes)

Vector of parametersHyper-planes

Probabilities, ...

Rules(Knowledge based systems)

Graphs

Predicative logic

Graphs

Predicative logic

Conceptual Graphs

Horn ClausesSeq

uenc

esTr

ees,

...

Inte

rmed

iate

repr

esen

tatio

ns

G.Bisson - 2011

An example in chemistry

24

Vectorial data Relational data

N

bond (m1, c1, Cl, simple), bond (m1, c1,c2, single), (m1)

S mass=167 ∧ number_cycle=1 ∧ contain_Br=no ∧ ... bond (m1, c1, Cl, simple), bond (m1,

c1,c2, single), (m1)

N Vector of parameters mutagenic (M) :- bond (M, Atom1, Atom2, double), has_ring (M, R, 5),

bond (M, R, Atom1, single), is (Atom1, Br), …S IF (mass<500) ∧ (LogP> 5) ∧ …

THEN (potential_drug = vrai)

mutagenic (M) :- bond (M, Atom1, Atom2, double), has_ring (M, R, 5),

bond (M, R, Atom1, single), is (Atom1, Br), …

Inst

ance

sK

now

ledg

e

C6H6! →! 1N-N! →! 0C-CH! →! 3C-N-O2!→! 2S-N! →! 0...

G.Bisson - 2011

✦ Let the following training set (vectorial representation)

✦ Objective: to learn a model predicting the survival according to A & B

A: Temperature B: Dryness Survival

Plant 1 2 2,4 +

Plant 2 4 3,5 -

Plant 3 8 1 +

Plant 4 8 7 -

••• ••• ••• •••

Plant 19 3 9,5 -

How to build a model : a simple example

25

G.Bisson - 2009

Supervised learning / Classification

26

-

-

-

--

-

--

++

+

++

++

+

-

+ +

A: T

emp

B: Dry

1

2

3

4

5

6

7

8

9

O 1 2 3 4 5 6 7 8 9

• Let the dataset S = {(xi, ui)…} with Lx such as :‣ xi : {v1, v2, …, vp} with vi∈

‣ ui : {+, -}

• The dataset are coordinates of points in p (here p=2)

• The problem : we search for a function h(xi) expressed in the langage Lh allowing to discriminate (then predict) the two «classes» :‣ + : « positive » examples‣ - : « negative » examples

(also named «counter examples»)

• Remarks‣ Any problem can be turned

into a two classes problem.‣ h(xi) defines hyperplans

G.Bisson - 2011

Selection of Lh

Efficiency : LH1 < LH2 < LH3

Readability : LH3 < LH2 < LH1

Tradeoff Symbolic / Numeric

27

-

-

-

--

-

--

++

+

++

++

+

-

+ +

A: T

emp

B: Dry

1

2

3

4

5

6

7

8

9

O 1 2 3 4 5 6 7 8 9

LH1IF A < 3!! THEN Class = +IF B < 2,5! THEN Class = +IF B > 8,5! THEN Class = +LH2

IF A < 7-B! THEN Classe = +IF A < B-4! THEN Classe = +LH3IF A < - 0,22xB2 - 2,3xB + 8!THEN Classe = +

G.Bisson - 2011

Unsupervised learning / Clustering

28

-

-

-

--

-

--

++

+

++

++

+

-

+ +

A: T

emp

B: Dry

1

2

3

4

5

6

7

8

9

O 1 2 3 4 5 6 7 8 9

• Let the dataset S = {(xi,) …} with Lx such as :‣ xi : {v1, v2, …, vp} with vi∈

‣ No more labels

• The problem : we search for a discrete function h(xi) expressed with Lh asso-ciating each xi to a cluster Cj.

• We want to get:‣ Contrasted clusters‣ Homogeneous clusters

• Some possible Lh‣ Threshold based‣ Distance based

๏ Hypersphere‣ Distribution based‣ ...

C1

C2

C3

C4

Learning as a «Search Process»

How to efficiently explore H ?

G.Bisson - 2012

From data to model

30

✦ Back to the learning criteria

‣ We would like to find the model h∈H:

๏ Accurate : highly correlated to the training set → h(xi)=ui for supervised learning

๏ Plausible : as simple as possible

‣ Example of criteria

๏ Let h be a boolean function

✦ Learning : a «game» between two spaces

‣ X : set of the instances (we use a sample, the training set, described with Lx)

‣ H : set of all the hypotheses that can be described with Lh

Accurate Plausible

- h must recognize the positive examples- h must reject the negative examples

- h must be general (few conjunctions)- h must be simple (few disjunctions)

G.Bisson - 2011

X H-

+

-

-

-

-

-

+

+

+

+

+

- +

-

✦ Relationship between these 2 spaces‣ Let C be the «target» concept (unknown)

๏ We want to learn function h(xi)=ui

๏ We don’t even know if this function can be expressed with Lh

‣ Each hi of H can be associated with a part pi of X (the opposite being false)

‣ Hypothesis hi is called a generalization of pi

X and H spaces

31

Instances space Hypotheses space

+

+

+

+

+

+

+ C

hi

G.Bisson - 2012

Size of H

✦ A huge space …

‣ For instance, if Lh contains N boolean attributes

๏ Conjunctive formula : possible hypotheses

๏ Disjunctive formula : possible hypotheses

- with {A, B} we have : Ø, A, B, A∧B, A∨B, A∧B∨A, A∧B∨B, A∧B∨A∨B

- with N = 10 there are about ~ 10308 possible hypotheses (with a lot of redundancies)

✦ Some (extreme) examples …

‣ MasterMind game («active learning»)

๏ Lx = Lh contains all the combinations of 4x8 colors

๏ X : propositions of the computer/player

๏ H : contains the code to guess

‣ Find a Regular Expressions to discriminate documents

๏ X : a set of paragraphs in natural language with 2 classes

๏ H : the set of all possible Regular Expressions (with less than n characters)32

2N−1

22N−1

G.Bisson - 2011

Exploration of H

✦ Learning as a search process

✦ Thus, two processes are involved during this search:‣ Evaluating the «quality» of hj with respect to the training set in X

๏ Symbolical : Mesure the «cover» of hj : Fct(«#instances +»,«#instances -»)๏ Numerical : Evaluate the error between hi(xk) and uk (true label)

‣ Searching for a better hypothesis h’j through operators Opn transforming hi

๏ Symbolical : Generalization and/or Specialization of the hypothesis hi

๏ Numerical : Modification of the parameters of the model33

X H-

+

-

-

-

-

-

+

+

+

+

+

- + C

-

hihere h’j more specific than hi

h'j

G.Bisson - 2012

Some classical «generic-algorithms»

34

✦ «Generate and Test » methods (ID3, PLI, …)

‣ Initialization of C the concept to learn (i.e: C=Ø),

‣ While quality(C) < threshold

๏ Apply the refinement operators Opn on the current hi → C’= {h’1, …}

๏ For each h’j the system evaluates: quality (h’j, X)

๏ Update C with the selected hypothesis of C’

✦ «Optimization» methods (Perceptron, connectionism, …)

‣ Initialization of C, the concept to learn (i.e randomly)

‣ While quality(C) < threshold

๏ Pick-up an example of the training set xi : {v1, …, vn, u} de X

๏ Compute the predicted concept u’ = hi(xi)

๏ Apply the refinement operators on hi to decrease the distance Δ(u, u’)

G.Bisson - 2012

Strategies to explore H

✦ Complete search !

‣ The simpler and the best …

‣ … when H is “small enough” and/or “well-organized” (ex: lattice)

‣ … but that’s often impossible, due to:

๏ The size of H

๏ The topology H wich is too complex (too much local minima)

๏ The presence of “noise” in the training set (errors on data or labels)

✦ Partial (heuristical) exploration

‣ Requirement: H must have a good topology

๏ Discrete (i.e.: relation of generality between the hypotheses hi )

๏ Continue (i.e.: relation of neighboring between he hypotheses hi )

‣ If H is chaotic it would be impossible to learn something …

35

G.Bisson - 2012

Gradient search

✦ A very classic and efficient strategy

‣ Main principle:

๏ The system starts from a “random” hi

๏ We look at the operators Opn that can be applied

๏ We select the operator Opk maximizing quality(Opk(hi))

๏ Opk(hi)) is the new h’j ; we continue until the stopping condition is verified

‣ Can be very efficient or a total failure is there are too many local minima in H ...

36

Quality

SubOptimalOptimal

Quality

G.Bisson - 2011

Other strategies

✦ Beam search‣ At each step we keep a collection of possible hypotheses {h1, …, hi …}

‣ These hypotheses are competing.

✦ Simulated annealing‣ We select the Opn maximizing the current state hi

‣ When there is no more improvement๏ Select randomly another state h’j ๏ Proximity(hi, h’j) is a function of the “temperature” T

✦ Genetic algorithm‣ The search is, in a sense, “random”

‣ We work with a population of hypotheses {h1, …, hi …}

‣ The learning process is based on two steps:๏ Selection of most relevant hypotheses (fitting with X)๏ Reproduction/Mutation/Crossing-over of these hypotheses

37

time

T

Quality

Validation and Remediation

The art of learning

G.Bisson - 2011

Empirical validation

✦ What is the accuracy of the model that has been learned?‣ First, we need to decide what to measure and how to do ...

‣ Supervised learning:๏ We use a test set: i.e. a collection of examples not used during the learning step๏ The idea is to mesure the prediction accuracy: (xi, ui) : h(xi)= ui

- When h() is discrete: we measure the error rate- When h() is continuous: we measure the «Mean Absolute Error» or «Mean Square Error»

‣ Unsupervised learning:๏ No universal criteria → Ask to an expert of the domain (can be difficult)๏ To go back to the «supervised case» by using a set of classified data

- Measure the mapping between the real and learned clusters- Exemple : NG20 Usenet groups in document clustering

✦ Two important criteria

39

Learning error (Ea)Percentage of of miss-classified

examples on the training set

Generalization error (Eg)Percentage of miss-classified

examples on the test set

G.Bisson - 2012

In practice ...

✦ Classical approaches

‣ The learning set is randomly splitted into several parts

‣ All the subsets must be «Identically and Independently Distributed» (I.I.D data)

‣ Drawbacks of creating these fixed datasets

๏ That’s OK if we have enough examples ...

๏ But otherwise:- A part of the examples are lost for the learning step

- No information about the stability of the model40

67 %

33 %

Learning Test

2 Parts 3 Parts Role

Training set 66 % 50 % Used to learn the model

Validation set - 25 % Help to tune the learning parameters (pruning, stability)

Test set 33 % 25 % Measure the accuracy of the model

G.Bisson - 2011

Cross validation

✦ Let P = {x1, …xp} set of examples of the learning set

✦ «Leave one out» approach‣ We learn « p » different models Mi using P-{xi} as a training set ‣ We test each Mi with the remaining example xi → Accurate but time consuming

✦ Generalization of the process with « N-Fold » technic‣ We split P into N bases containing p/N examples ‣ We learn N models using N-1 sets to learn and the last set to test‣ Fast (N~10) approach allowing to evaluate the variance of the result (stability)

41

Test = xi

MiMiMi

P-{xi} By summing we get the percentage of correct answers

N1N2N3N4N5

Example for M1 :• Learning : N2+N3+N4+N5•Test : N1

G.Bisson - 2011

Contingency table

✦ With 2 classes:

✦ But many other ways to evaluate

42

Real labelReal label

class=+ class=-

Predictedlabel

class=+ A(True positives)

B(False positives)Predicted

labelclass=- C

(False negatives)D

(True Negatives)

Recognition rate = A + DA + B +C + D

= (1− error rate)

IR domain

Medical domain

Probability that a test result will be positive when the

disease is present

Probability that a test result will be negative when the

disease is not present

Sensitivity= AA +C

Specificity= DB + D

Precision= AA + B

Recall= AA +C

Can be generalized to N

classes

G.Bisson - 2011

In statistics we trust (but with care)

✦ Recognition Rate is relevant if ...‣ Number of examples in the different classes are «well-balanced»

๏ When very few positive examples → use ROC curves

‣ The «cost» of errors B & C are equivalent๏ False in Medical domain or Risk Management

✦ Some classical traps‣ Précision (M)= 87.678%

‣ Interpretation of the diagrams

43

(with 1000 test examples only)

60,0 %

70,0 %

80,0 %

90,0 %

100,0 %

1000 2000 3000 4000

Acc

urac

y

#Sample

Evaluation = statistics + use of critical look

G.Bisson - 2011

Causes of failure

✦ Training set …‣ The sample is too small to cover significantly X

‣ The number of attributes is too large with respect to the number of examples๏ Nothing to do from the statistical point of view ...๏ ... but any result can be significant for the end-user (feature selection for instance)

‣ The attributes are not able to express the target concept C๏ Some knowledge is missing๏ Some knowledge is incomplete (ordered values / continuous values)

‣ The dataset is “noisy” : false values or even worst false labels

‣ ...

✦ The learning algorithm …‣ The parameters (bias) of the system are not correctly set

‣ The target concept cannot be learned in the current Lh

‣ ...

44

G.Bisson - 2012

The «bias-variance» tradeoff

✦ Again, the choice of Lh is crucial

‣ Smaller H is stronger the learning bias is, thus…

๏ It is easier/faster to learn

๏ The concept that we can learn is “simple” (leading perhaps to a failure)

‣ Oppositely, when H is large, we have a « weak» learning bias ...

๏ The system can learn (slowly) a complex problem …

๏ … but this can lead to the «overfitting» problem

45

Error rate

Complexity of Lh increasing

Learning error

Generalization error

Variance of the predictions increases

G.Bisson - 2011

Overfitting problem

✦ A well-known case : regression‣ Fitting a set of data with a polynom

‣ Too much precision kills prediction ...

46

A

B

1

2

3

4

5

6

7

8

9

O 1 2 3 4 5 6 7 8 9

LH1

LH2

LH3

In terms of learning errorEa(LH1) > Ea(LH2) > Ea(LH3)

In terms of generalization errorEg(LH1)≡Eg(LH3) > Eg(LH2)

LH1

LH2

LH3

Too simple

Too complex

G.Bisson - 2011

Full data mining process

47

Machine learning

Research

machine learning - imaglig-membres.imag.fr/bisson/cours/m2info-aiw-ml/2-ml-introduction.pdf ·...

Documents