università di milano-bicocca laurea magistrale in informatica
DESCRIPTION
Università di Milano-Bicocca Laurea Magistrale in Informatica. Corso di APPRENDIMENTO E APPROSSIMAZIONE Prof. Giancarlo Mauri Lezione 4 - Computational Learning Theory. Computational models of cognitive phenomena. Computing capabilities: Computability theory - PowerPoint PPT PresentationTRANSCRIPT
1
Universitagrave di Milano-BicoccaLaurea Magistrale in Informatica
Corso di
APPRENDIMENTO E APPROSSIMAZIONE
Prof Giancarlo Mauri
Lezione 4 - Computational Learning Theory
2
Computational models of cognitive phenomena
Computing capabilities Computability theory
Reasoningdeduction Formal logic
Learninginduction
3
A theory of the learnable (Valiant lsquo84)
[hellip] The problem is to discover good models that are interesting to study for their own sake and that promise to be relevant both to explaining human experience and to building devices that can learn [hellip] Learning machines must have all 3 of the following properties
the machines can provably learn whole classes of concepts these classes can be characterized
the classes of concepts are appropriate and nontrivial for general-purpose knowledge
the computational process by which the machine builds the desired programs requires a ldquofeasiblerdquo (ie polynomial) number of steps
4
A theory of the learnable
We seek general laws that constrain inductive learning relating
Probability of successful learning Number of training examples Complexity of hypothesis space Accuracy to which target concept is approximated Manner in which training examples are presented
5
Probably approximately correct learning
formal computational model which want shed
light on the limits of what can be
learned by a machine analysing the
computational cost of learning algorithms
6
What we want to learn
That is
to determine uniformly good approximations of an unknown function from its value in some sample points
interpolation pattern matching concept learning
CONCEPT = recognizing algorithm
LEARNING = computational description of recognizing algorithms starting from - examples - incomplete specifications
7
Whatrsquos new in pac learning
Accuracy of results and running time for learning
algorithms
are explicitly quantified and related
A general problem
use of resources (time spacehellip) by computations COMPLEXITY THEORY
Example
Sorting nlogn time (polynomial feasible)
Bool satisfiability 2ⁿ time (exponential intractable)
8
Learning from examples
DOMAIN
ConceptLEARNER
EXAMPLES
A REPRESENTATION OF A CONCEPTCONCEPT subset of domain
EXAMPLES elements of concept (positive)
REPRESENTATION domainrarrexpressions GOOD LEARNER
EFFICIENT LEARNER
9
The PAC model
A domain X (eg 01ⁿ Rⁿ) A concept subset of X f sube X or f Xrarr01 A class of concepts F sube 2X
A probability distribution P on X
Example 1
X equiv a square F equiv triangles in the square
10
The PAC model
Example 2
Xequiv01ⁿ F equiv family of boolean functions
1 if there are at least r ones in (x1hellipxn)fr(x1hellipxn) =
0 otherwise
P a probability distribution on X
Uniform Non uniform
11
The PAC model
The learning process
Labeled sample ((x0 f(x0)) (x1 f(x1)) hellip (xn f(xn))
Hypothesis a function h consistent with the sample (ie h(xi) = f(xi) i)
Error probability Perr(h(x)nef(x) xX)
12
LEARNERExamples generatorwith probabilitydistribution p
Inference procedure A
t examples
Hypothesis h (implicit representation of a concept)
The learning algorithm A is good if the hypothesis h is ldquoALMOST ALWAYSrdquoldquoCLOSE TOrdquo the target concept c
TEACHER
The PAC model
X fF X F
(x1f(x1)) hellip (xtf(xt)))
13
f h
x random choice
Given an approximation parameter (0ltle1) h is an ε-approximationof f if dp(fh)le
ldquoALMOST ALWAYSrdquo
Confidence parameter
(0 lt le 1)
The ldquomeasurerdquo of sequences of examples randomly choosen according to P such that h is an ε-approximation of f is at least 1-
ldquoCLOSE TOrdquo
METRIC given P
dp(fh) = Perr = Px f(x)neh(x)
The PAC model
14
Generator ofexamples
Learner h
F concept classS set of labeled samples from a concept in F A S F such that
I) A(S) consistent with S
II) P(Perrlt ) gt 1-
0ltlt1 fF mN S st |S|gem
Learning algorithm
15
COMPUTATIONAL RESOURCES
SAMPLE SIZE (Statistical PAC learning) COMPUTATION TIME (Polynomial PAC learning)
DEF 1 a concept class F = n=1F n is statistically PAC learnable if there
is a learning algorithm with sample size t = t(n 1 1) bounded by some polynomial function in n 1 1
Look for algorithms which use ldquoreasonablerdquo amount of computational resources
The efficiency issue
16
The efficiency issue
POLYNOMIAL PAC STATISTICAL PAC
DEF 2 a concept class F = n=1F n is polynomially PAC learnable
if there is a learning algorithm with running time bounded by some polynomial function in n 1 1
17
n = f 0 1n 0 1 The set of boolean functions in n
variables
Fn n A class of conceptsExample 1Fn = clauses with literals in
Example 2Fn = linearly separable functions in n variables
nn xxxx 11
nk xxxxxx ororororor 2123
( ) sum minus λkkXWHS
REPRESENTATION
- TRUTH TABLE (EXPLICIT)- BOOLEAN CIRCUITS (IMPLICIT)
BOOLEANCIRCUITS
BOOLEANFUNCTIONS
Learning boolean functions
18
bull BASIC OPERATIONSbull COMPOSITION
( )minusorand
in m variables in n variables
CIRCUIT Finite acyclic directed graph
or
Output node
Basic operations
Input nodes
Given an assignment x1 hellip xn 0 1 to the input variables the output node computes the corresponding value
oror
orand
1X 2X 3X
or
[f(g1 hellip gm)](x) = f(g1(x) hellip gm(x))
Boolean functions and circuits
19
Fn n
Cn class of circuits which compute all and only the functions in Fn
Uinfin
=
=1n
nFF Uinfin
=
=1n
nCC
Algorithm A to learn F by C bull INPUT (nεδ)
bullThe learner computes t = t(n 1 1) (t=number of examples sufficient to learn with accuracy ε and confidence δ)bull The learner asks the teacher for a labelled t-sample
bull The learner receives the t-sample S and computes C = An(S)
bull Output C (C= representation of the hypothesis)
Note that the inference procedure A receives as input the integer n and a t-sample on 01n and outputs An(S) = A(n S)
Boolean functions and circuits
20
An algorithm A is a learning algorithm with sample size t(n 1 1) for a concept class using the class of representations If for all nge1 for all fFn for all 0lt lt1 and for every probability distribution p over 01n the following holds
nn FUF 1=infin= mm CUC 1=
infin=
If the inference procedure An receives as input a t-sample it outputs a representation cCn of a function g that is probably approximately correct that is with probability at least 1- a t-sample is chosen such that the function g inferred satisfies
Px f(x)neg(x) le
g is ndashgood g is an ndashapproximation of fg is ndashbad g is not an ndashapproximation of f
NOTE distribution free
Boolean functions and circuits
21
Statistical PAC learning
DEF An inference procedure An for the class F n is consistent if
given the target function fF n for every t-sample
S = (ltx1b1gt hellip ltxtbtgt) An(S) is a representation of a function
g ldquoconsistentrdquo with S ie g(x1) = b1 hellip g(xt) = bt
DEF A learning algorithm A is consistent if its inference procedure is consistent
PROBLEM Estimate upper and lower bounds on the sample size
t = t(n 1 1)Upper bounds will be given for consistent algorithms
Lower bounds will be given for arbitrary algorithms
22
THEOREM t(n 1 1) le -1ln(F n) +ln(1)
PROOF Prob(x1 hellip xt) g (g(x1)=f(x1) hellip g(xt)=f(xt) g -bad) le
le Prob (g(x1) = f(x1) hellip g(xt) = f(xt)) le
Impose F n e-t le
Independent events
g is ε-bad
P(AUB)leP(A)+P(B)
g ε-bad
NOTE - F n must be finite
le Prob (g(xi) = f(xi)) leg ε-bad i=1 hellip t
le (1-)t le F n(1-)t le F ne-t g ε-bad
A simple upper bound
23
X domainF 2X class of conceptsS = (x1 hellip xt) t-sample
f S g iff f(xi) = g(xi) xi S undistinguishable by S
F (S) = (F S) index of F wrt S
Problem uniform convergence of relative frequencies to their probabilities
Vapnik-Chervonenkis approach (1971)
S1 S2
MF (t) = maxF (S) S is a t-sample growth function
24
FACT
THEOREM
A general upper bound
Prob(x1 hellip xt) g (g -bad g(x1) = f(x1) hellip g(xt) = f(xt))le 2mF2te-t2
mF (t) le 2t
mF (t) le F (this condition gives immediately
the simple upper bound) mF (t) = 2t and jltt mF (j) = 2j
25
d t
)(tmF
F
)(infinFm
t2
DEFINITION
FUNDAMENTAL PROPERTY
=)(tmF1
2
0
minusle⎟⎠⎞⎜⎝
⎛le⎟⎠⎞⎜⎝
⎛le
le
=
sum
Kk
t
tt
t
t
BOUNDED BY APOLYNOMIAL IN t
Graph of the growth function
d = VCdim(F ) = max t mF(t) = 2t
26
THEOREMIf dn = VCdim(Fn)
then t(n 1 1) le max (4 log(2) (8dn)log(13)PROOF
Impose 2mFn2te-et2 le
A lower bound on t(n 1 1) Number of examples which are necessary for arbitrary algorithms
THEOREMFor 0lele1 and le1100
t(n 1 1) ge max ((1- ) ln(1) (dn-1)32)
Upper and lower bounds
27
Ie the cardinality of the set of subsets of S that can be obtained by intersecting S with concepts in F
If F (S) = 2S we say that S is shattered by F
The Vapnik-Chervonenkis dimension of F is the cardinality of the largest finite set of points S X that is shattered by F
An equivalent definition of VCdim
F (S) = (f-1(1)(x1 hellip xt) | fF )
28
1300log80032000log400)11(
00100103)(
sdotle
===
MAXnt
FVC DIM
δε
δε
Sufficient 24000
⎭⎬⎫
⎩⎨⎧ sdot
minusge 100
32131000ln100)11( MAXnt δε
690 Necessary
Learn the family f of circles contained in the square
Example 1
29
otherwiseXif
XWHSXXfTHATSUCHWWLf
nkkkn
nn
001
)()()(
11
1
ge
minus=
rArrisin
sum=
λ
λ
HS(x)=
22
1)(n
n
nDIM
L
nLVC
le
+=
SIMPLE UPPER BOUND
))1ln((1)11( 2 +le nnt
UPPER BOUND USING
⎭⎬⎫
⎩⎨⎧ sdot
+le
13)1(82log4)11( nMAXnt GROWS LINEARLY WITH n
)( nDIM LVC
Learn the family of linearly separable boolean functions in n variables Ln
Example 2
30
Consider the class L2 of linearly separable functions in two variables
3)(3)(
1)(
2
2
ge=
+=
LVCLVC
nLVC
IM
IM
nIM
4)( 2 ltLVC IM
The green point cannot beseparated from the other three
No straight line can separatethe green from the red points
Example 2
31
Classi di formule booleane
Monomi x1x2 hellip xk
DNF m1m2 hellip mj (mj monomi)
Clausole x1x2 hellip xk
CNF c1c2 hellip cj (cj clausole)
k-DNF le k letterali nei monomi
k-term-DNF le k monomi
k-CNF le k letterali nelle clausole
k-clause-CNF le k clausole
Formule monotone non contengono letterali negati
m-formule ogni variabile appare al piugrave una volta
32
Th (Valiant)I monomi sono apprendibili da esempi positivi con 2(n+log ) esempi ( errore tollerato) ponendo ixix
xxgii 01 ==
sdotequiv ππin tutti gli es in tutti gli es
NB Lrsquoapprendibilitagrave egrave non monotona A B se B appr allora A apprsube
endHdaxcancella
elseHdaxcancella
thenjesifdontojfor
generaesbegin
doBtoiforxxxxxxH
j
j
nn
0)(1
)(
1 2211
==
=
==
Th i monomi non sono apprendibili da esempi negativi
I risultati
33
1) K-CNF apprendibili da soli esempi positivi1b) K-DNF apprendibili da soli esempi negativi2) (K-DNF K-CNF) apprendibili da es (K-DNF K-CNF) positivi e negativiand
or
Kforall
3) la classe delle K-decision lists egrave apprendibile
)0()(
1)min(|min
10||)))((( 11
esistenonisebvCalloravni
booleanovettorevDLKCbkmmonomiom
conbmbmDLK
i
iii
jj
===
minusisinle
equivminus
Th Ogni K-DNF (o K-CNF)-formula egrave rappresentabile da una K-DL piccola
Risultati positivi
34
)()(
freeondistributisensoinNPRPse
minusne
1) Le m-formule non sono apprendibili
2) Le funzioni booleane a soglia non sono apprendibili
3) Per K ge 2 le formule K-term-DNF non sono apprendibili
ge
Risultati negativi
35
Mistake bound model
So far how many examples needed to learn What about how many mistakes before
convergence Letrsquos consider similar setting to PAC learning
Instances drawn at random from X according to
distribution D Learner must classify each instance before receiving
correct classification from teacher Can we bound the number of mistakes learner makes
before converging
36
Mistake bound model
Learner Receives a sequence of training examples x Predicts the target value f(x) Receives the correct target value from the trainer Is evaluated by the total number of mistakes it makes
before converging to the correct hypothesis
Ie Learning takes place during the use of the system
not off-line Ex prediction of fraudolent use of credit cards
37
Mistake bound for Find-S
Consider Find-S when H = conjunction of boolean literals
FIND-S
Initialize h to the most specific hypothesis in
Hx1x1x2x2 hellip xnxn
For each positive training instance x Remove from h any literal not satisfied by x
Output h
38
Mistake bound for Find-S
If C H and training data noise free Find-S converges to an exact hypothesis
How many errors to learn cH (only positive examples can be misclassified)
The first positive example will be misclassified and n literals in the initial hypothesis will be eliminated
Each subsequent error eliminates at least one literal mistakes le n+1 (worst case for the ldquototalrdquo concept x c(x)=1)
39
Mistake bound for Halving A version space is maintained and refined (eg
Candidate-elimination) Prediction is based on majority vote among the
hypotheses in the current version space ldquoWrongrdquo hypotheses are removed (even if x is
exactly classified) How many errors to exactly learn cH (H finite)
Mistake when the majority of hypotheses misclassifies x These hypotheses are removed For each mistake the version space is at least halved At most log2(|H|) mistakes before exact learning (eg
single hypothesis remaining) Note learning without mistakes possible
40
Optimal mistake bound
Question what is the optimal mistake bound (ie lowest worst case bound over all possible learning algorithms A) for an arbitrary non empty concept class C assuming H=C
Formally for any learning algorithm A and any target concept c
MA(c) = max mistakes made by A to exactly learn c over all
possible training sequences MA(C) = maxcC MA(c)
Note Mfind-S(C) = n+1
MHalving(C) le log2(|C|) Opt(C) = minA MA(C)
ie of mistakes made for the hardest target concept in C using the hardest training sequence by the best algorithm
41
Optimal mistake bound
Theorem (Littlestone 1987)
VC(C) le Opt(C) le MHalving(C) le log2(|C|) There exist concept classes for which
VC(C) = Opt(C) = MHalving(C) = log2(|C|)
eg the power set 2X of X for which it holds
VC(2X) = |X| = log2(|2X|) There exist concept classes for which VC(C) lt Opt(C) lt MHalving(C)
42
Weighted majority algorithm
Generalizes Halving Makes predictions by taking a weighted vote among a pole of prediction algorithms
Learns by altering the weight associated with each prediction algorithm
It does not eliminate hypotheses (ie algorithms) inconsistent with some training examples but just reduces its weight so is able to accommodate inconsistent training data
43
Weighted majority algorithm
i wi = 1
training example (x c(x))
q0 = q1 = 0
prediction algorithm ai
If ai(x)=0 then q0 = q0 + wi
If ai(x)=1 then q1 = q1 + wi
if q1 gt q0 then predict c(x)=1
if q1 lt q0 then predict c(x)=0
if q1 gt q0 then predict c(x)=0 or 1 at random
prediction algorithm ai do If ai(x)nec(x) then wi = wi (0lelt1)
44
Weighted majority algorithm (WM)
Coincides with Halving for =0 Theorem - D any sequence of training examples A any set of n prediction algorithms k min of mistakes made by any ajA for D =12 Then W-M makes at most
24(k+log2n)
mistakes over D
45
Weighted majority algorithm (WM)
Proof Since aj makes k mistakes (best in A) its final
weight wj will be (12)k The sum W of the weights associated with all n algorithms in A is initially n and for each mistake made by WM is reduced to at most (34)W because the ldquowrongrdquo algorithms hold at least 12 of total weight that will be reduced by a factor of 12
The final total weight W is at most n(34)M where M is the total number of mistakes made by WM over D
46
Weighted majority algorithm (WM)
But the final weight wj cannot be greater than the
final total weight W hence(12)k le n(34)M
from which
M le le 24(k+log2n)
Ie the number of mistakes made by WM will never be greater than a constant factor times the number of mistakes made by the best member of the pool plus a term that grows only logarithmically in the size of the pool
(k+log2 n)-log2 (34)
- Universitagrave di Milano-Bicocca Laurea Magistrale in Informatica
- Computational models of cognitive phenomena
- A theory of the learnable (Valiant lsquo84)
- A theory of the learnable
- Probably approximately correct learning
- What we want to learn
- Whatrsquos new in pac learning
- Learning from examples
- The PAC model
- The PAC model
- Slide 11
- Slide 12
- Slide 13
- Learning algorithm
- The efficiency issue
- Slide 16
- Learning boolean functions
- Boolean functions and circuits
- Slide 19
- Slide 20
- Statistical PAC learning
- A simple upper bound
- Vapnik-Chervonenkis approach (1971)
- A general upper bound
- Graph of the growth function
- Upper and lower bounds
- An equivalent definition of VCdim
- Example 1
- Example 2
- Slide 30
- Classi di formule booleane
- I risultati
- Risultati positivi
- Risultati negativi
- Mistake bound model
- Slide 36
- Mistake bound for Find-S
- Slide 38
- Mistake bound for Halving
- Optimal mistake bound
- Slide 41
- Weighted majority algorithm
- Slide 43
- Weighted majority algorithm (WM)
- Slide 45
- Slide 46
-
2
Computational models of cognitive phenomena
Computing capabilities Computability theory
Reasoningdeduction Formal logic
Learninginduction
3
A theory of the learnable (Valiant lsquo84)
[hellip] The problem is to discover good models that are interesting to study for their own sake and that promise to be relevant both to explaining human experience and to building devices that can learn [hellip] Learning machines must have all 3 of the following properties
the machines can provably learn whole classes of concepts these classes can be characterized
the classes of concepts are appropriate and nontrivial for general-purpose knowledge
the computational process by which the machine builds the desired programs requires a ldquofeasiblerdquo (ie polynomial) number of steps
4
A theory of the learnable
We seek general laws that constrain inductive learning relating
Probability of successful learning Number of training examples Complexity of hypothesis space Accuracy to which target concept is approximated Manner in which training examples are presented
5
Probably approximately correct learning
formal computational model which want shed
light on the limits of what can be
learned by a machine analysing the
computational cost of learning algorithms
6
What we want to learn
That is
to determine uniformly good approximations of an unknown function from its value in some sample points
interpolation pattern matching concept learning
CONCEPT = recognizing algorithm
LEARNING = computational description of recognizing algorithms starting from - examples - incomplete specifications
7
Whatrsquos new in pac learning
Accuracy of results and running time for learning
algorithms
are explicitly quantified and related
A general problem
use of resources (time spacehellip) by computations COMPLEXITY THEORY
Example
Sorting nlogn time (polynomial feasible)
Bool satisfiability 2ⁿ time (exponential intractable)
8
Learning from examples
DOMAIN
ConceptLEARNER
EXAMPLES
A REPRESENTATION OF A CONCEPTCONCEPT subset of domain
EXAMPLES elements of concept (positive)
REPRESENTATION domainrarrexpressions GOOD LEARNER
EFFICIENT LEARNER
9
The PAC model
A domain X (eg 01ⁿ Rⁿ) A concept subset of X f sube X or f Xrarr01 A class of concepts F sube 2X
A probability distribution P on X
Example 1
X equiv a square F equiv triangles in the square
10
The PAC model
Example 2
Xequiv01ⁿ F equiv family of boolean functions
1 if there are at least r ones in (x1hellipxn)fr(x1hellipxn) =
0 otherwise
P a probability distribution on X
Uniform Non uniform
11
The PAC model
The learning process
Labeled sample ((x0 f(x0)) (x1 f(x1)) hellip (xn f(xn))
Hypothesis a function h consistent with the sample (ie h(xi) = f(xi) i)
Error probability Perr(h(x)nef(x) xX)
12
LEARNERExamples generatorwith probabilitydistribution p
Inference procedure A
t examples
Hypothesis h (implicit representation of a concept)
The learning algorithm A is good if the hypothesis h is ldquoALMOST ALWAYSrdquoldquoCLOSE TOrdquo the target concept c
TEACHER
The PAC model
X fF X F
(x1f(x1)) hellip (xtf(xt)))
13
f h
x random choice
Given an approximation parameter (0ltle1) h is an ε-approximationof f if dp(fh)le
ldquoALMOST ALWAYSrdquo
Confidence parameter
(0 lt le 1)
The ldquomeasurerdquo of sequences of examples randomly choosen according to P such that h is an ε-approximation of f is at least 1-
ldquoCLOSE TOrdquo
METRIC given P
dp(fh) = Perr = Px f(x)neh(x)
The PAC model
14
Generator ofexamples
Learner h
F concept classS set of labeled samples from a concept in F A S F such that
I) A(S) consistent with S
II) P(Perrlt ) gt 1-
0ltlt1 fF mN S st |S|gem
Learning algorithm
15
COMPUTATIONAL RESOURCES
SAMPLE SIZE (Statistical PAC learning) COMPUTATION TIME (Polynomial PAC learning)
DEF 1 a concept class F = n=1F n is statistically PAC learnable if there
is a learning algorithm with sample size t = t(n 1 1) bounded by some polynomial function in n 1 1
Look for algorithms which use ldquoreasonablerdquo amount of computational resources
The efficiency issue
16
The efficiency issue
POLYNOMIAL PAC STATISTICAL PAC
DEF 2 a concept class F = n=1F n is polynomially PAC learnable
if there is a learning algorithm with running time bounded by some polynomial function in n 1 1
17
n = f 0 1n 0 1 The set of boolean functions in n
variables
Fn n A class of conceptsExample 1Fn = clauses with literals in
Example 2Fn = linearly separable functions in n variables
nn xxxx 11
nk xxxxxx ororororor 2123
( ) sum minus λkkXWHS
REPRESENTATION
- TRUTH TABLE (EXPLICIT)- BOOLEAN CIRCUITS (IMPLICIT)
BOOLEANCIRCUITS
BOOLEANFUNCTIONS
Learning boolean functions
18
bull BASIC OPERATIONSbull COMPOSITION
( )minusorand
in m variables in n variables
CIRCUIT Finite acyclic directed graph
or
Output node
Basic operations
Input nodes
Given an assignment x1 hellip xn 0 1 to the input variables the output node computes the corresponding value
oror
orand
1X 2X 3X
or
[f(g1 hellip gm)](x) = f(g1(x) hellip gm(x))
Boolean functions and circuits
19
Fn n
Cn class of circuits which compute all and only the functions in Fn
Uinfin
=
=1n
nFF Uinfin
=
=1n
nCC
Algorithm A to learn F by C bull INPUT (nεδ)
bullThe learner computes t = t(n 1 1) (t=number of examples sufficient to learn with accuracy ε and confidence δ)bull The learner asks the teacher for a labelled t-sample
bull The learner receives the t-sample S and computes C = An(S)
bull Output C (C= representation of the hypothesis)
Note that the inference procedure A receives as input the integer n and a t-sample on 01n and outputs An(S) = A(n S)
Boolean functions and circuits
20
An algorithm A is a learning algorithm with sample size t(n 1 1) for a concept class using the class of representations If for all nge1 for all fFn for all 0lt lt1 and for every probability distribution p over 01n the following holds
nn FUF 1=infin= mm CUC 1=
infin=
If the inference procedure An receives as input a t-sample it outputs a representation cCn of a function g that is probably approximately correct that is with probability at least 1- a t-sample is chosen such that the function g inferred satisfies
Px f(x)neg(x) le
g is ndashgood g is an ndashapproximation of fg is ndashbad g is not an ndashapproximation of f
NOTE distribution free
Boolean functions and circuits
21
Statistical PAC learning
DEF An inference procedure An for the class F n is consistent if
given the target function fF n for every t-sample
S = (ltx1b1gt hellip ltxtbtgt) An(S) is a representation of a function
g ldquoconsistentrdquo with S ie g(x1) = b1 hellip g(xt) = bt
DEF A learning algorithm A is consistent if its inference procedure is consistent
PROBLEM Estimate upper and lower bounds on the sample size
t = t(n 1 1)Upper bounds will be given for consistent algorithms
Lower bounds will be given for arbitrary algorithms
22
THEOREM t(n 1 1) le -1ln(F n) +ln(1)
PROOF Prob(x1 hellip xt) g (g(x1)=f(x1) hellip g(xt)=f(xt) g -bad) le
le Prob (g(x1) = f(x1) hellip g(xt) = f(xt)) le
Impose F n e-t le
Independent events
g is ε-bad
P(AUB)leP(A)+P(B)
g ε-bad
NOTE - F n must be finite
le Prob (g(xi) = f(xi)) leg ε-bad i=1 hellip t
le (1-)t le F n(1-)t le F ne-t g ε-bad
A simple upper bound
23
X domainF 2X class of conceptsS = (x1 hellip xt) t-sample
f S g iff f(xi) = g(xi) xi S undistinguishable by S
F (S) = (F S) index of F wrt S
Problem uniform convergence of relative frequencies to their probabilities
Vapnik-Chervonenkis approach (1971)
S1 S2
MF (t) = maxF (S) S is a t-sample growth function
24
FACT
THEOREM
A general upper bound
Prob(x1 hellip xt) g (g -bad g(x1) = f(x1) hellip g(xt) = f(xt))le 2mF2te-t2
mF (t) le 2t
mF (t) le F (this condition gives immediately
the simple upper bound) mF (t) = 2t and jltt mF (j) = 2j
25
d t
)(tmF
F
)(infinFm
t2
DEFINITION
FUNDAMENTAL PROPERTY
=)(tmF1
2
0
minusle⎟⎠⎞⎜⎝
⎛le⎟⎠⎞⎜⎝
⎛le
le
=
sum
Kk
t
tt
t
t
BOUNDED BY APOLYNOMIAL IN t
Graph of the growth function
d = VCdim(F ) = max t mF(t) = 2t
26
THEOREMIf dn = VCdim(Fn)
then t(n 1 1) le max (4 log(2) (8dn)log(13)PROOF
Impose 2mFn2te-et2 le
A lower bound on t(n 1 1) Number of examples which are necessary for arbitrary algorithms
THEOREMFor 0lele1 and le1100
t(n 1 1) ge max ((1- ) ln(1) (dn-1)32)
Upper and lower bounds
27
Ie the cardinality of the set of subsets of S that can be obtained by intersecting S with concepts in F
If F (S) = 2S we say that S is shattered by F
The Vapnik-Chervonenkis dimension of F is the cardinality of the largest finite set of points S X that is shattered by F
An equivalent definition of VCdim
F (S) = (f-1(1)(x1 hellip xt) | fF )
28
1300log80032000log400)11(
00100103)(
sdotle
===
MAXnt
FVC DIM
δε
δε
Sufficient 24000
⎭⎬⎫
⎩⎨⎧ sdot
minusge 100
32131000ln100)11( MAXnt δε
690 Necessary
Learn the family f of circles contained in the square
Example 1
29
otherwiseXif
XWHSXXfTHATSUCHWWLf
nkkkn
nn
001
)()()(
11
1
ge
minus=
rArrisin
sum=
λ
λ
HS(x)=
22
1)(n
n
nDIM
L
nLVC
le
+=
SIMPLE UPPER BOUND
))1ln((1)11( 2 +le nnt
UPPER BOUND USING
⎭⎬⎫
⎩⎨⎧ sdot
+le
13)1(82log4)11( nMAXnt GROWS LINEARLY WITH n
)( nDIM LVC
Learn the family of linearly separable boolean functions in n variables Ln
Example 2
30
Consider the class L2 of linearly separable functions in two variables
3)(3)(
1)(
2
2
ge=
+=
LVCLVC
nLVC
IM
IM
nIM
4)( 2 ltLVC IM
The green point cannot beseparated from the other three
No straight line can separatethe green from the red points
Example 2
31
Classi di formule booleane
Monomi x1x2 hellip xk
DNF m1m2 hellip mj (mj monomi)
Clausole x1x2 hellip xk
CNF c1c2 hellip cj (cj clausole)
k-DNF le k letterali nei monomi
k-term-DNF le k monomi
k-CNF le k letterali nelle clausole
k-clause-CNF le k clausole
Formule monotone non contengono letterali negati
m-formule ogni variabile appare al piugrave una volta
32
Th (Valiant)I monomi sono apprendibili da esempi positivi con 2(n+log ) esempi ( errore tollerato) ponendo ixix
xxgii 01 ==
sdotequiv ππin tutti gli es in tutti gli es
NB Lrsquoapprendibilitagrave egrave non monotona A B se B appr allora A apprsube
endHdaxcancella
elseHdaxcancella
thenjesifdontojfor
generaesbegin
doBtoiforxxxxxxH
j
j
nn
0)(1
)(
1 2211
==
=
==
Th i monomi non sono apprendibili da esempi negativi
I risultati
33
1) K-CNF apprendibili da soli esempi positivi1b) K-DNF apprendibili da soli esempi negativi2) (K-DNF K-CNF) apprendibili da es (K-DNF K-CNF) positivi e negativiand
or
Kforall
3) la classe delle K-decision lists egrave apprendibile
)0()(
1)min(|min
10||)))((( 11
esistenonisebvCalloravni
booleanovettorevDLKCbkmmonomiom
conbmbmDLK
i
iii
jj
===
minusisinle
equivminus
Th Ogni K-DNF (o K-CNF)-formula egrave rappresentabile da una K-DL piccola
Risultati positivi
34
)()(
freeondistributisensoinNPRPse
minusne
1) Le m-formule non sono apprendibili
2) Le funzioni booleane a soglia non sono apprendibili
3) Per K ge 2 le formule K-term-DNF non sono apprendibili
ge
Risultati negativi
35
Mistake bound model
So far how many examples needed to learn What about how many mistakes before
convergence Letrsquos consider similar setting to PAC learning
Instances drawn at random from X according to
distribution D Learner must classify each instance before receiving
correct classification from teacher Can we bound the number of mistakes learner makes
before converging
36
Mistake bound model
Learner Receives a sequence of training examples x Predicts the target value f(x) Receives the correct target value from the trainer Is evaluated by the total number of mistakes it makes
before converging to the correct hypothesis
Ie Learning takes place during the use of the system
not off-line Ex prediction of fraudolent use of credit cards
37
Mistake bound for Find-S
Consider Find-S when H = conjunction of boolean literals
FIND-S
Initialize h to the most specific hypothesis in
Hx1x1x2x2 hellip xnxn
For each positive training instance x Remove from h any literal not satisfied by x
Output h
38
Mistake bound for Find-S
If C H and training data noise free Find-S converges to an exact hypothesis
How many errors to learn cH (only positive examples can be misclassified)
The first positive example will be misclassified and n literals in the initial hypothesis will be eliminated
Each subsequent error eliminates at least one literal mistakes le n+1 (worst case for the ldquototalrdquo concept x c(x)=1)
39
Mistake bound for Halving A version space is maintained and refined (eg
Candidate-elimination) Prediction is based on majority vote among the
hypotheses in the current version space ldquoWrongrdquo hypotheses are removed (even if x is
exactly classified) How many errors to exactly learn cH (H finite)
Mistake when the majority of hypotheses misclassifies x These hypotheses are removed For each mistake the version space is at least halved At most log2(|H|) mistakes before exact learning (eg
single hypothesis remaining) Note learning without mistakes possible
40
Optimal mistake bound
Question what is the optimal mistake bound (ie lowest worst case bound over all possible learning algorithms A) for an arbitrary non empty concept class C assuming H=C
Formally for any learning algorithm A and any target concept c
MA(c) = max mistakes made by A to exactly learn c over all
possible training sequences MA(C) = maxcC MA(c)
Note Mfind-S(C) = n+1
MHalving(C) le log2(|C|) Opt(C) = minA MA(C)
ie of mistakes made for the hardest target concept in C using the hardest training sequence by the best algorithm
41
Optimal mistake bound
Theorem (Littlestone 1987)
VC(C) le Opt(C) le MHalving(C) le log2(|C|) There exist concept classes for which
VC(C) = Opt(C) = MHalving(C) = log2(|C|)
eg the power set 2X of X for which it holds
VC(2X) = |X| = log2(|2X|) There exist concept classes for which VC(C) lt Opt(C) lt MHalving(C)
42
Weighted majority algorithm
Generalizes Halving Makes predictions by taking a weighted vote among a pole of prediction algorithms
Learns by altering the weight associated with each prediction algorithm
It does not eliminate hypotheses (ie algorithms) inconsistent with some training examples but just reduces its weight so is able to accommodate inconsistent training data
43
Weighted majority algorithm
i wi = 1
training example (x c(x))
q0 = q1 = 0
prediction algorithm ai
If ai(x)=0 then q0 = q0 + wi
If ai(x)=1 then q1 = q1 + wi
if q1 gt q0 then predict c(x)=1
if q1 lt q0 then predict c(x)=0
if q1 gt q0 then predict c(x)=0 or 1 at random
prediction algorithm ai do If ai(x)nec(x) then wi = wi (0lelt1)
44
Weighted majority algorithm (WM)
Coincides with Halving for =0 Theorem - D any sequence of training examples A any set of n prediction algorithms k min of mistakes made by any ajA for D =12 Then W-M makes at most
24(k+log2n)
mistakes over D
45
Weighted majority algorithm (WM)
Proof Since aj makes k mistakes (best in A) its final
weight wj will be (12)k The sum W of the weights associated with all n algorithms in A is initially n and for each mistake made by WM is reduced to at most (34)W because the ldquowrongrdquo algorithms hold at least 12 of total weight that will be reduced by a factor of 12
The final total weight W is at most n(34)M where M is the total number of mistakes made by WM over D
46
Weighted majority algorithm (WM)
But the final weight wj cannot be greater than the
final total weight W hence(12)k le n(34)M
from which
M le le 24(k+log2n)
Ie the number of mistakes made by WM will never be greater than a constant factor times the number of mistakes made by the best member of the pool plus a term that grows only logarithmically in the size of the pool
(k+log2 n)-log2 (34)
- Universitagrave di Milano-Bicocca Laurea Magistrale in Informatica
- Computational models of cognitive phenomena
- A theory of the learnable (Valiant lsquo84)
- A theory of the learnable
- Probably approximately correct learning
- What we want to learn
- Whatrsquos new in pac learning
- Learning from examples
- The PAC model
- The PAC model
- Slide 11
- Slide 12
- Slide 13
- Learning algorithm
- The efficiency issue
- Slide 16
- Learning boolean functions
- Boolean functions and circuits
- Slide 19
- Slide 20
- Statistical PAC learning
- A simple upper bound
- Vapnik-Chervonenkis approach (1971)
- A general upper bound
- Graph of the growth function
- Upper and lower bounds
- An equivalent definition of VCdim
- Example 1
- Example 2
- Slide 30
- Classi di formule booleane
- I risultati
- Risultati positivi
- Risultati negativi
- Mistake bound model
- Slide 36
- Mistake bound for Find-S
- Slide 38
- Mistake bound for Halving
- Optimal mistake bound
- Slide 41
- Weighted majority algorithm
- Slide 43
- Weighted majority algorithm (WM)
- Slide 45
- Slide 46
-
3
A theory of the learnable (Valiant lsquo84)
[hellip] The problem is to discover good models that are interesting to study for their own sake and that promise to be relevant both to explaining human experience and to building devices that can learn [hellip] Learning machines must have all 3 of the following properties
the machines can provably learn whole classes of concepts these classes can be characterized
the classes of concepts are appropriate and nontrivial for general-purpose knowledge
the computational process by which the machine builds the desired programs requires a ldquofeasiblerdquo (ie polynomial) number of steps
4
A theory of the learnable
We seek general laws that constrain inductive learning relating
Probability of successful learning Number of training examples Complexity of hypothesis space Accuracy to which target concept is approximated Manner in which training examples are presented
5
Probably approximately correct learning
formal computational model which want shed
light on the limits of what can be
learned by a machine analysing the
computational cost of learning algorithms
6
What we want to learn
That is
to determine uniformly good approximations of an unknown function from its value in some sample points
interpolation pattern matching concept learning
CONCEPT = recognizing algorithm
LEARNING = computational description of recognizing algorithms starting from - examples - incomplete specifications
7
Whatrsquos new in pac learning
Accuracy of results and running time for learning
algorithms
are explicitly quantified and related
A general problem
use of resources (time spacehellip) by computations COMPLEXITY THEORY
Example
Sorting nlogn time (polynomial feasible)
Bool satisfiability 2ⁿ time (exponential intractable)
8
Learning from examples
DOMAIN
ConceptLEARNER
EXAMPLES
A REPRESENTATION OF A CONCEPTCONCEPT subset of domain
EXAMPLES elements of concept (positive)
REPRESENTATION domainrarrexpressions GOOD LEARNER
EFFICIENT LEARNER
9
The PAC model
A domain X (eg 01ⁿ Rⁿ) A concept subset of X f sube X or f Xrarr01 A class of concepts F sube 2X
A probability distribution P on X
Example 1
X equiv a square F equiv triangles in the square
10
The PAC model
Example 2
Xequiv01ⁿ F equiv family of boolean functions
1 if there are at least r ones in (x1hellipxn)fr(x1hellipxn) =
0 otherwise
P a probability distribution on X
Uniform Non uniform
11
The PAC model
The learning process
Labeled sample ((x0 f(x0)) (x1 f(x1)) hellip (xn f(xn))
Hypothesis a function h consistent with the sample (ie h(xi) = f(xi) i)
Error probability Perr(h(x)nef(x) xX)
12
LEARNERExamples generatorwith probabilitydistribution p
Inference procedure A
t examples
Hypothesis h (implicit representation of a concept)
The learning algorithm A is good if the hypothesis h is ldquoALMOST ALWAYSrdquoldquoCLOSE TOrdquo the target concept c
TEACHER
The PAC model
X fF X F
(x1f(x1)) hellip (xtf(xt)))
13
f h
x random choice
Given an approximation parameter (0ltle1) h is an ε-approximationof f if dp(fh)le
ldquoALMOST ALWAYSrdquo
Confidence parameter
(0 lt le 1)
The ldquomeasurerdquo of sequences of examples randomly choosen according to P such that h is an ε-approximation of f is at least 1-
ldquoCLOSE TOrdquo
METRIC given P
dp(fh) = Perr = Px f(x)neh(x)
The PAC model
14
Generator ofexamples
Learner h
F concept classS set of labeled samples from a concept in F A S F such that
I) A(S) consistent with S
II) P(Perrlt ) gt 1-
0ltlt1 fF mN S st |S|gem
Learning algorithm
15
COMPUTATIONAL RESOURCES
SAMPLE SIZE (Statistical PAC learning) COMPUTATION TIME (Polynomial PAC learning)
DEF 1 a concept class F = n=1F n is statistically PAC learnable if there
is a learning algorithm with sample size t = t(n 1 1) bounded by some polynomial function in n 1 1
Look for algorithms which use ldquoreasonablerdquo amount of computational resources
The efficiency issue
16
The efficiency issue
POLYNOMIAL PAC STATISTICAL PAC
DEF 2 a concept class F = n=1F n is polynomially PAC learnable
if there is a learning algorithm with running time bounded by some polynomial function in n 1 1
17
n = f 0 1n 0 1 The set of boolean functions in n
variables
Fn n A class of conceptsExample 1Fn = clauses with literals in
Example 2Fn = linearly separable functions in n variables
nn xxxx 11
nk xxxxxx ororororor 2123
( ) sum minus λkkXWHS
REPRESENTATION
- TRUTH TABLE (EXPLICIT)- BOOLEAN CIRCUITS (IMPLICIT)
BOOLEANCIRCUITS
BOOLEANFUNCTIONS
Learning boolean functions
18
bull BASIC OPERATIONSbull COMPOSITION
( )minusorand
in m variables in n variables
CIRCUIT Finite acyclic directed graph
or
Output node
Basic operations
Input nodes
Given an assignment x1 hellip xn 0 1 to the input variables the output node computes the corresponding value
oror
orand
1X 2X 3X
or
[f(g1 hellip gm)](x) = f(g1(x) hellip gm(x))
Boolean functions and circuits
19
Fn n
Cn class of circuits which compute all and only the functions in Fn
Uinfin
=
=1n
nFF Uinfin
=
=1n
nCC
Algorithm A to learn F by C bull INPUT (nεδ)
bullThe learner computes t = t(n 1 1) (t=number of examples sufficient to learn with accuracy ε and confidence δ)bull The learner asks the teacher for a labelled t-sample
bull The learner receives the t-sample S and computes C = An(S)
bull Output C (C= representation of the hypothesis)
Note that the inference procedure A receives as input the integer n and a t-sample on 01n and outputs An(S) = A(n S)
Boolean functions and circuits
20
An algorithm A is a learning algorithm with sample size t(n 1 1) for a concept class using the class of representations If for all nge1 for all fFn for all 0lt lt1 and for every probability distribution p over 01n the following holds
nn FUF 1=infin= mm CUC 1=
infin=
If the inference procedure An receives as input a t-sample it outputs a representation cCn of a function g that is probably approximately correct that is with probability at least 1- a t-sample is chosen such that the function g inferred satisfies
Px f(x)neg(x) le
g is ndashgood g is an ndashapproximation of fg is ndashbad g is not an ndashapproximation of f
NOTE distribution free
Boolean functions and circuits
21
Statistical PAC learning
DEF An inference procedure An for the class F n is consistent if
given the target function fF n for every t-sample
S = (ltx1b1gt hellip ltxtbtgt) An(S) is a representation of a function
g ldquoconsistentrdquo with S ie g(x1) = b1 hellip g(xt) = bt
DEF A learning algorithm A is consistent if its inference procedure is consistent
PROBLEM Estimate upper and lower bounds on the sample size
t = t(n 1 1)Upper bounds will be given for consistent algorithms
Lower bounds will be given for arbitrary algorithms
22
THEOREM t(n 1 1) le -1ln(F n) +ln(1)
PROOF Prob(x1 hellip xt) g (g(x1)=f(x1) hellip g(xt)=f(xt) g -bad) le
le Prob (g(x1) = f(x1) hellip g(xt) = f(xt)) le
Impose F n e-t le
Independent events
g is ε-bad
P(AUB)leP(A)+P(B)
g ε-bad
NOTE - F n must be finite
le Prob (g(xi) = f(xi)) leg ε-bad i=1 hellip t
le (1-)t le F n(1-)t le F ne-t g ε-bad
A simple upper bound
23
X domainF 2X class of conceptsS = (x1 hellip xt) t-sample
f S g iff f(xi) = g(xi) xi S undistinguishable by S
F (S) = (F S) index of F wrt S
Problem uniform convergence of relative frequencies to their probabilities
Vapnik-Chervonenkis approach (1971)
S1 S2
MF (t) = maxF (S) S is a t-sample growth function
24
FACT
THEOREM
A general upper bound
Prob(x1 hellip xt) g (g -bad g(x1) = f(x1) hellip g(xt) = f(xt))le 2mF2te-t2
mF (t) le 2t
mF (t) le F (this condition gives immediately
the simple upper bound) mF (t) = 2t and jltt mF (j) = 2j
25
d t
)(tmF
F
)(infinFm
t2
DEFINITION
FUNDAMENTAL PROPERTY
=)(tmF1
2
0
minusle⎟⎠⎞⎜⎝
⎛le⎟⎠⎞⎜⎝
⎛le
le
=
sum
Kk
t
tt
t
t
BOUNDED BY APOLYNOMIAL IN t
Graph of the growth function
d = VCdim(F ) = max t mF(t) = 2t
26
THEOREMIf dn = VCdim(Fn)
then t(n 1 1) le max (4 log(2) (8dn)log(13)PROOF
Impose 2mFn2te-et2 le
A lower bound on t(n 1 1) Number of examples which are necessary for arbitrary algorithms
THEOREMFor 0lele1 and le1100
t(n 1 1) ge max ((1- ) ln(1) (dn-1)32)
Upper and lower bounds
27
Ie the cardinality of the set of subsets of S that can be obtained by intersecting S with concepts in F
If F (S) = 2S we say that S is shattered by F
The Vapnik-Chervonenkis dimension of F is the cardinality of the largest finite set of points S X that is shattered by F
An equivalent definition of VCdim
F (S) = (f-1(1)(x1 hellip xt) | fF )
28
1300log80032000log400)11(
00100103)(
sdotle
===
MAXnt
FVC DIM
δε
δε
Sufficient 24000
⎭⎬⎫
⎩⎨⎧ sdot
minusge 100
32131000ln100)11( MAXnt δε
690 Necessary
Learn the family f of circles contained in the square
Example 1
29
otherwiseXif
XWHSXXfTHATSUCHWWLf
nkkkn
nn
001
)()()(
11
1
ge
minus=
rArrisin
sum=
λ
λ
HS(x)=
22
1)(n
n
nDIM
L
nLVC
le
+=
SIMPLE UPPER BOUND
))1ln((1)11( 2 +le nnt
UPPER BOUND USING
⎭⎬⎫
⎩⎨⎧ sdot
+le
13)1(82log4)11( nMAXnt GROWS LINEARLY WITH n
)( nDIM LVC
Learn the family of linearly separable boolean functions in n variables Ln
Example 2
30
Consider the class L2 of linearly separable functions in two variables
3)(3)(
1)(
2
2
ge=
+=
LVCLVC
nLVC
IM
IM
nIM
4)( 2 ltLVC IM
The green point cannot beseparated from the other three
No straight line can separatethe green from the red points
Example 2
31
Classi di formule booleane
Monomi x1x2 hellip xk
DNF m1m2 hellip mj (mj monomi)
Clausole x1x2 hellip xk
CNF c1c2 hellip cj (cj clausole)
k-DNF le k letterali nei monomi
k-term-DNF le k monomi
k-CNF le k letterali nelle clausole
k-clause-CNF le k clausole
Formule monotone non contengono letterali negati
m-formule ogni variabile appare al piugrave una volta
32
Th (Valiant)I monomi sono apprendibili da esempi positivi con 2(n+log ) esempi ( errore tollerato) ponendo ixix
xxgii 01 ==
sdotequiv ππin tutti gli es in tutti gli es
NB Lrsquoapprendibilitagrave egrave non monotona A B se B appr allora A apprsube
endHdaxcancella
elseHdaxcancella
thenjesifdontojfor
generaesbegin
doBtoiforxxxxxxH
j
j
nn
0)(1
)(
1 2211
==
=
==
Th i monomi non sono apprendibili da esempi negativi
I risultati
33
1) K-CNF apprendibili da soli esempi positivi1b) K-DNF apprendibili da soli esempi negativi2) (K-DNF K-CNF) apprendibili da es (K-DNF K-CNF) positivi e negativiand
or
Kforall
3) la classe delle K-decision lists egrave apprendibile
)0()(
1)min(|min
10||)))((( 11
esistenonisebvCalloravni
booleanovettorevDLKCbkmmonomiom
conbmbmDLK
i
iii
jj
===
minusisinle
equivminus
Th Ogni K-DNF (o K-CNF)-formula egrave rappresentabile da una K-DL piccola
Risultati positivi
34
)()(
freeondistributisensoinNPRPse
minusne
1) Le m-formule non sono apprendibili
2) Le funzioni booleane a soglia non sono apprendibili
3) Per K ge 2 le formule K-term-DNF non sono apprendibili
ge
Risultati negativi
35
Mistake bound model
So far how many examples needed to learn What about how many mistakes before
convergence Letrsquos consider similar setting to PAC learning
Instances drawn at random from X according to
distribution D Learner must classify each instance before receiving
correct classification from teacher Can we bound the number of mistakes learner makes
before converging
36
Mistake bound model
Learner Receives a sequence of training examples x Predicts the target value f(x) Receives the correct target value from the trainer Is evaluated by the total number of mistakes it makes
before converging to the correct hypothesis
Ie Learning takes place during the use of the system
not off-line Ex prediction of fraudolent use of credit cards
37
Mistake bound for Find-S
Consider Find-S when H = conjunction of boolean literals
FIND-S
Initialize h to the most specific hypothesis in
Hx1x1x2x2 hellip xnxn
For each positive training instance x Remove from h any literal not satisfied by x
Output h
38
Mistake bound for Find-S
If C H and training data noise free Find-S converges to an exact hypothesis
How many errors to learn cH (only positive examples can be misclassified)
The first positive example will be misclassified and n literals in the initial hypothesis will be eliminated
Each subsequent error eliminates at least one literal mistakes le n+1 (worst case for the ldquototalrdquo concept x c(x)=1)
39
Mistake bound for Halving A version space is maintained and refined (eg
Candidate-elimination) Prediction is based on majority vote among the
hypotheses in the current version space ldquoWrongrdquo hypotheses are removed (even if x is
exactly classified) How many errors to exactly learn cH (H finite)
Mistake when the majority of hypotheses misclassifies x These hypotheses are removed For each mistake the version space is at least halved At most log2(|H|) mistakes before exact learning (eg
single hypothesis remaining) Note learning without mistakes possible
40
Optimal mistake bound
Question what is the optimal mistake bound (ie lowest worst case bound over all possible learning algorithms A) for an arbitrary non empty concept class C assuming H=C
Formally for any learning algorithm A and any target concept c
MA(c) = max mistakes made by A to exactly learn c over all
possible training sequences MA(C) = maxcC MA(c)
Note Mfind-S(C) = n+1
MHalving(C) le log2(|C|) Opt(C) = minA MA(C)
ie of mistakes made for the hardest target concept in C using the hardest training sequence by the best algorithm
41
Optimal mistake bound
Theorem (Littlestone 1987)
VC(C) le Opt(C) le MHalving(C) le log2(|C|) There exist concept classes for which
VC(C) = Opt(C) = MHalving(C) = log2(|C|)
eg the power set 2X of X for which it holds
VC(2X) = |X| = log2(|2X|) There exist concept classes for which VC(C) lt Opt(C) lt MHalving(C)
42
Weighted majority algorithm
Generalizes Halving Makes predictions by taking a weighted vote among a pole of prediction algorithms
Learns by altering the weight associated with each prediction algorithm
It does not eliminate hypotheses (ie algorithms) inconsistent with some training examples but just reduces its weight so is able to accommodate inconsistent training data
43
Weighted majority algorithm
i wi = 1
training example (x c(x))
q0 = q1 = 0
prediction algorithm ai
If ai(x)=0 then q0 = q0 + wi
If ai(x)=1 then q1 = q1 + wi
if q1 gt q0 then predict c(x)=1
if q1 lt q0 then predict c(x)=0
if q1 gt q0 then predict c(x)=0 or 1 at random
prediction algorithm ai do If ai(x)nec(x) then wi = wi (0lelt1)
44
Weighted majority algorithm (WM)
Coincides with Halving for =0 Theorem - D any sequence of training examples A any set of n prediction algorithms k min of mistakes made by any ajA for D =12 Then W-M makes at most
24(k+log2n)
mistakes over D
45
Weighted majority algorithm (WM)
Proof Since aj makes k mistakes (best in A) its final
weight wj will be (12)k The sum W of the weights associated with all n algorithms in A is initially n and for each mistake made by WM is reduced to at most (34)W because the ldquowrongrdquo algorithms hold at least 12 of total weight that will be reduced by a factor of 12
The final total weight W is at most n(34)M where M is the total number of mistakes made by WM over D
46
Weighted majority algorithm (WM)
But the final weight wj cannot be greater than the
final total weight W hence(12)k le n(34)M
from which
M le le 24(k+log2n)
Ie the number of mistakes made by WM will never be greater than a constant factor times the number of mistakes made by the best member of the pool plus a term that grows only logarithmically in the size of the pool
(k+log2 n)-log2 (34)
- Universitagrave di Milano-Bicocca Laurea Magistrale in Informatica
- Computational models of cognitive phenomena
- A theory of the learnable (Valiant lsquo84)
- A theory of the learnable
- Probably approximately correct learning
- What we want to learn
- Whatrsquos new in pac learning
- Learning from examples
- The PAC model
- The PAC model
- Slide 11
- Slide 12
- Slide 13
- Learning algorithm
- The efficiency issue
- Slide 16
- Learning boolean functions
- Boolean functions and circuits
- Slide 19
- Slide 20
- Statistical PAC learning
- A simple upper bound
- Vapnik-Chervonenkis approach (1971)
- A general upper bound
- Graph of the growth function
- Upper and lower bounds
- An equivalent definition of VCdim
- Example 1
- Example 2
- Slide 30
- Classi di formule booleane
- I risultati
- Risultati positivi
- Risultati negativi
- Mistake bound model
- Slide 36
- Mistake bound for Find-S
- Slide 38
- Mistake bound for Halving
- Optimal mistake bound
- Slide 41
- Weighted majority algorithm
- Slide 43
- Weighted majority algorithm (WM)
- Slide 45
- Slide 46
-
4
A theory of the learnable
We seek general laws that constrain inductive learning relating
Probability of successful learning Number of training examples Complexity of hypothesis space Accuracy to which target concept is approximated Manner in which training examples are presented
5
Probably approximately correct learning
formal computational model which want shed
light on the limits of what can be
learned by a machine analysing the
computational cost of learning algorithms
6
What we want to learn
That is
to determine uniformly good approximations of an unknown function from its value in some sample points
interpolation pattern matching concept learning
CONCEPT = recognizing algorithm
LEARNING = computational description of recognizing algorithms starting from - examples - incomplete specifications
7
Whatrsquos new in pac learning
Accuracy of results and running time for learning
algorithms
are explicitly quantified and related
A general problem
use of resources (time spacehellip) by computations COMPLEXITY THEORY
Example
Sorting nlogn time (polynomial feasible)
Bool satisfiability 2ⁿ time (exponential intractable)
8
Learning from examples
DOMAIN
ConceptLEARNER
EXAMPLES
A REPRESENTATION OF A CONCEPTCONCEPT subset of domain
EXAMPLES elements of concept (positive)
REPRESENTATION domainrarrexpressions GOOD LEARNER
EFFICIENT LEARNER
9
The PAC model
A domain X (eg 01ⁿ Rⁿ) A concept subset of X f sube X or f Xrarr01 A class of concepts F sube 2X
A probability distribution P on X
Example 1
X equiv a square F equiv triangles in the square
10
The PAC model
Example 2
Xequiv01ⁿ F equiv family of boolean functions
1 if there are at least r ones in (x1hellipxn)fr(x1hellipxn) =
0 otherwise
P a probability distribution on X
Uniform Non uniform
11
The PAC model
The learning process
Labeled sample ((x0 f(x0)) (x1 f(x1)) hellip (xn f(xn))
Hypothesis a function h consistent with the sample (ie h(xi) = f(xi) i)
Error probability Perr(h(x)nef(x) xX)
12
LEARNERExamples generatorwith probabilitydistribution p
Inference procedure A
t examples
Hypothesis h (implicit representation of a concept)
The learning algorithm A is good if the hypothesis h is ldquoALMOST ALWAYSrdquoldquoCLOSE TOrdquo the target concept c
TEACHER
The PAC model
X fF X F
(x1f(x1)) hellip (xtf(xt)))
13
f h
x random choice
Given an approximation parameter (0ltle1) h is an ε-approximationof f if dp(fh)le
ldquoALMOST ALWAYSrdquo
Confidence parameter
(0 lt le 1)
The ldquomeasurerdquo of sequences of examples randomly choosen according to P such that h is an ε-approximation of f is at least 1-
ldquoCLOSE TOrdquo
METRIC given P
dp(fh) = Perr = Px f(x)neh(x)
The PAC model
14
Generator ofexamples
Learner h
F concept classS set of labeled samples from a concept in F A S F such that
I) A(S) consistent with S
II) P(Perrlt ) gt 1-
0ltlt1 fF mN S st |S|gem
Learning algorithm
15
COMPUTATIONAL RESOURCES
SAMPLE SIZE (Statistical PAC learning) COMPUTATION TIME (Polynomial PAC learning)
DEF 1 a concept class F = n=1F n is statistically PAC learnable if there
is a learning algorithm with sample size t = t(n 1 1) bounded by some polynomial function in n 1 1
Look for algorithms which use ldquoreasonablerdquo amount of computational resources
The efficiency issue
16
The efficiency issue
POLYNOMIAL PAC STATISTICAL PAC
DEF 2 a concept class F = n=1F n is polynomially PAC learnable
if there is a learning algorithm with running time bounded by some polynomial function in n 1 1
17
n = f 0 1n 0 1 The set of boolean functions in n
variables
Fn n A class of conceptsExample 1Fn = clauses with literals in
Example 2Fn = linearly separable functions in n variables
nn xxxx 11
nk xxxxxx ororororor 2123
( ) sum minus λkkXWHS
REPRESENTATION
- TRUTH TABLE (EXPLICIT)- BOOLEAN CIRCUITS (IMPLICIT)
BOOLEANCIRCUITS
BOOLEANFUNCTIONS
Learning boolean functions
18
bull BASIC OPERATIONSbull COMPOSITION
( )minusorand
in m variables in n variables
CIRCUIT Finite acyclic directed graph
or
Output node
Basic operations
Input nodes
Given an assignment x1 hellip xn 0 1 to the input variables the output node computes the corresponding value
oror
orand
1X 2X 3X
or
[f(g1 hellip gm)](x) = f(g1(x) hellip gm(x))
Boolean functions and circuits
19
Fn n
Cn class of circuits which compute all and only the functions in Fn
Uinfin
=
=1n
nFF Uinfin
=
=1n
nCC
Algorithm A to learn F by C bull INPUT (nεδ)
bullThe learner computes t = t(n 1 1) (t=number of examples sufficient to learn with accuracy ε and confidence δ)bull The learner asks the teacher for a labelled t-sample
bull The learner receives the t-sample S and computes C = An(S)
bull Output C (C= representation of the hypothesis)
Note that the inference procedure A receives as input the integer n and a t-sample on 01n and outputs An(S) = A(n S)
Boolean functions and circuits
20
An algorithm A is a learning algorithm with sample size t(n 1 1) for a concept class using the class of representations If for all nge1 for all fFn for all 0lt lt1 and for every probability distribution p over 01n the following holds
nn FUF 1=infin= mm CUC 1=
infin=
If the inference procedure An receives as input a t-sample it outputs a representation cCn of a function g that is probably approximately correct that is with probability at least 1- a t-sample is chosen such that the function g inferred satisfies
Px f(x)neg(x) le
g is ndashgood g is an ndashapproximation of fg is ndashbad g is not an ndashapproximation of f
NOTE distribution free
Boolean functions and circuits
21
Statistical PAC learning
DEF An inference procedure An for the class F n is consistent if
given the target function fF n for every t-sample
S = (ltx1b1gt hellip ltxtbtgt) An(S) is a representation of a function
g ldquoconsistentrdquo with S ie g(x1) = b1 hellip g(xt) = bt
DEF A learning algorithm A is consistent if its inference procedure is consistent
PROBLEM Estimate upper and lower bounds on the sample size
t = t(n 1 1)Upper bounds will be given for consistent algorithms
Lower bounds will be given for arbitrary algorithms
22
THEOREM t(n 1 1) le -1ln(F n) +ln(1)
PROOF Prob(x1 hellip xt) g (g(x1)=f(x1) hellip g(xt)=f(xt) g -bad) le
le Prob (g(x1) = f(x1) hellip g(xt) = f(xt)) le
Impose F n e-t le
Independent events
g is ε-bad
P(AUB)leP(A)+P(B)
g ε-bad
NOTE - F n must be finite
le Prob (g(xi) = f(xi)) leg ε-bad i=1 hellip t
le (1-)t le F n(1-)t le F ne-t g ε-bad
A simple upper bound
23
X domainF 2X class of conceptsS = (x1 hellip xt) t-sample
f S g iff f(xi) = g(xi) xi S undistinguishable by S
F (S) = (F S) index of F wrt S
Problem uniform convergence of relative frequencies to their probabilities
Vapnik-Chervonenkis approach (1971)
S1 S2
MF (t) = maxF (S) S is a t-sample growth function
24
FACT
THEOREM
A general upper bound
Prob(x1 hellip xt) g (g -bad g(x1) = f(x1) hellip g(xt) = f(xt))le 2mF2te-t2
mF (t) le 2t
mF (t) le F (this condition gives immediately
the simple upper bound) mF (t) = 2t and jltt mF (j) = 2j
25
d t
)(tmF
F
)(infinFm
t2
DEFINITION
FUNDAMENTAL PROPERTY
=)(tmF1
2
0
minusle⎟⎠⎞⎜⎝
⎛le⎟⎠⎞⎜⎝
⎛le
le
=
sum
Kk
t
tt
t
t
BOUNDED BY APOLYNOMIAL IN t
Graph of the growth function
d = VCdim(F ) = max t mF(t) = 2t
26
THEOREMIf dn = VCdim(Fn)
then t(n 1 1) le max (4 log(2) (8dn)log(13)PROOF
Impose 2mFn2te-et2 le
A lower bound on t(n 1 1) Number of examples which are necessary for arbitrary algorithms
THEOREMFor 0lele1 and le1100
t(n 1 1) ge max ((1- ) ln(1) (dn-1)32)
Upper and lower bounds
27
Ie the cardinality of the set of subsets of S that can be obtained by intersecting S with concepts in F
If F (S) = 2S we say that S is shattered by F
The Vapnik-Chervonenkis dimension of F is the cardinality of the largest finite set of points S X that is shattered by F
An equivalent definition of VCdim
F (S) = (f-1(1)(x1 hellip xt) | fF )
28
1300log80032000log400)11(
00100103)(
sdotle
===
MAXnt
FVC DIM
δε
δε
Sufficient 24000
⎭⎬⎫
⎩⎨⎧ sdot
minusge 100
32131000ln100)11( MAXnt δε
690 Necessary
Learn the family f of circles contained in the square
Example 1
29
otherwiseXif
XWHSXXfTHATSUCHWWLf
nkkkn
nn
001
)()()(
11
1
ge
minus=
rArrisin
sum=
λ
λ
HS(x)=
22
1)(n
n
nDIM
L
nLVC
le
+=
SIMPLE UPPER BOUND
))1ln((1)11( 2 +le nnt
UPPER BOUND USING
⎭⎬⎫
⎩⎨⎧ sdot
+le
13)1(82log4)11( nMAXnt GROWS LINEARLY WITH n
)( nDIM LVC
Learn the family of linearly separable boolean functions in n variables Ln
Example 2
30
Consider the class L2 of linearly separable functions in two variables
3)(3)(
1)(
2
2
ge=
+=
LVCLVC
nLVC
IM
IM
nIM
4)( 2 ltLVC IM
The green point cannot beseparated from the other three
No straight line can separatethe green from the red points
Example 2
31
Classi di formule booleane
Monomi x1x2 hellip xk
DNF m1m2 hellip mj (mj monomi)
Clausole x1x2 hellip xk
CNF c1c2 hellip cj (cj clausole)
k-DNF le k letterali nei monomi
k-term-DNF le k monomi
k-CNF le k letterali nelle clausole
k-clause-CNF le k clausole
Formule monotone non contengono letterali negati
m-formule ogni variabile appare al piugrave una volta
32
Th (Valiant)I monomi sono apprendibili da esempi positivi con 2(n+log ) esempi ( errore tollerato) ponendo ixix
xxgii 01 ==
sdotequiv ππin tutti gli es in tutti gli es
NB Lrsquoapprendibilitagrave egrave non monotona A B se B appr allora A apprsube
endHdaxcancella
elseHdaxcancella
thenjesifdontojfor
generaesbegin
doBtoiforxxxxxxH
j
j
nn
0)(1
)(
1 2211
==
=
==
Th i monomi non sono apprendibili da esempi negativi
I risultati
33
1) K-CNF apprendibili da soli esempi positivi1b) K-DNF apprendibili da soli esempi negativi2) (K-DNF K-CNF) apprendibili da es (K-DNF K-CNF) positivi e negativiand
or
Kforall
3) la classe delle K-decision lists egrave apprendibile
)0()(
1)min(|min
10||)))((( 11
esistenonisebvCalloravni
booleanovettorevDLKCbkmmonomiom
conbmbmDLK
i
iii
jj
===
minusisinle
equivminus
Th Ogni K-DNF (o K-CNF)-formula egrave rappresentabile da una K-DL piccola
Risultati positivi
34
)()(
freeondistributisensoinNPRPse
minusne
1) Le m-formule non sono apprendibili
2) Le funzioni booleane a soglia non sono apprendibili
3) Per K ge 2 le formule K-term-DNF non sono apprendibili
ge
Risultati negativi
35
Mistake bound model
So far how many examples needed to learn What about how many mistakes before
convergence Letrsquos consider similar setting to PAC learning
Instances drawn at random from X according to
distribution D Learner must classify each instance before receiving
correct classification from teacher Can we bound the number of mistakes learner makes
before converging
36
Mistake bound model
Learner Receives a sequence of training examples x Predicts the target value f(x) Receives the correct target value from the trainer Is evaluated by the total number of mistakes it makes
before converging to the correct hypothesis
Ie Learning takes place during the use of the system
not off-line Ex prediction of fraudolent use of credit cards
37
Mistake bound for Find-S
Consider Find-S when H = conjunction of boolean literals
FIND-S
Initialize h to the most specific hypothesis in
Hx1x1x2x2 hellip xnxn
For each positive training instance x Remove from h any literal not satisfied by x
Output h
38
Mistake bound for Find-S
If C H and training data noise free Find-S converges to an exact hypothesis
How many errors to learn cH (only positive examples can be misclassified)
The first positive example will be misclassified and n literals in the initial hypothesis will be eliminated
Each subsequent error eliminates at least one literal mistakes le n+1 (worst case for the ldquototalrdquo concept x c(x)=1)
39
Mistake bound for Halving A version space is maintained and refined (eg
Candidate-elimination) Prediction is based on majority vote among the
hypotheses in the current version space ldquoWrongrdquo hypotheses are removed (even if x is
exactly classified) How many errors to exactly learn cH (H finite)
Mistake when the majority of hypotheses misclassifies x These hypotheses are removed For each mistake the version space is at least halved At most log2(|H|) mistakes before exact learning (eg
single hypothesis remaining) Note learning without mistakes possible
40
Optimal mistake bound
Question what is the optimal mistake bound (ie lowest worst case bound over all possible learning algorithms A) for an arbitrary non empty concept class C assuming H=C
Formally for any learning algorithm A and any target concept c
MA(c) = max mistakes made by A to exactly learn c over all
possible training sequences MA(C) = maxcC MA(c)
Note Mfind-S(C) = n+1
MHalving(C) le log2(|C|) Opt(C) = minA MA(C)
ie of mistakes made for the hardest target concept in C using the hardest training sequence by the best algorithm
41
Optimal mistake bound
Theorem (Littlestone 1987)
VC(C) le Opt(C) le MHalving(C) le log2(|C|) There exist concept classes for which
VC(C) = Opt(C) = MHalving(C) = log2(|C|)
eg the power set 2X of X for which it holds
VC(2X) = |X| = log2(|2X|) There exist concept classes for which VC(C) lt Opt(C) lt MHalving(C)
42
Weighted majority algorithm
Generalizes Halving Makes predictions by taking a weighted vote among a pole of prediction algorithms
Learns by altering the weight associated with each prediction algorithm
It does not eliminate hypotheses (ie algorithms) inconsistent with some training examples but just reduces its weight so is able to accommodate inconsistent training data
43
Weighted majority algorithm
i wi = 1
training example (x c(x))
q0 = q1 = 0
prediction algorithm ai
If ai(x)=0 then q0 = q0 + wi
If ai(x)=1 then q1 = q1 + wi
if q1 gt q0 then predict c(x)=1
if q1 lt q0 then predict c(x)=0
if q1 gt q0 then predict c(x)=0 or 1 at random
prediction algorithm ai do If ai(x)nec(x) then wi = wi (0lelt1)
44
Weighted majority algorithm (WM)
Coincides with Halving for =0 Theorem - D any sequence of training examples A any set of n prediction algorithms k min of mistakes made by any ajA for D =12 Then W-M makes at most
24(k+log2n)
mistakes over D
45
Weighted majority algorithm (WM)
Proof Since aj makes k mistakes (best in A) its final
weight wj will be (12)k The sum W of the weights associated with all n algorithms in A is initially n and for each mistake made by WM is reduced to at most (34)W because the ldquowrongrdquo algorithms hold at least 12 of total weight that will be reduced by a factor of 12
The final total weight W is at most n(34)M where M is the total number of mistakes made by WM over D
46
Weighted majority algorithm (WM)
But the final weight wj cannot be greater than the
final total weight W hence(12)k le n(34)M
from which
M le le 24(k+log2n)
Ie the number of mistakes made by WM will never be greater than a constant factor times the number of mistakes made by the best member of the pool plus a term that grows only logarithmically in the size of the pool
(k+log2 n)-log2 (34)
- Universitagrave di Milano-Bicocca Laurea Magistrale in Informatica
- Computational models of cognitive phenomena
- A theory of the learnable (Valiant lsquo84)
- A theory of the learnable
- Probably approximately correct learning
- What we want to learn
- Whatrsquos new in pac learning
- Learning from examples
- The PAC model
- The PAC model
- Slide 11
- Slide 12
- Slide 13
- Learning algorithm
- The efficiency issue
- Slide 16
- Learning boolean functions
- Boolean functions and circuits
- Slide 19
- Slide 20
- Statistical PAC learning
- A simple upper bound
- Vapnik-Chervonenkis approach (1971)
- A general upper bound
- Graph of the growth function
- Upper and lower bounds
- An equivalent definition of VCdim
- Example 1
- Example 2
- Slide 30
- Classi di formule booleane
- I risultati
- Risultati positivi
- Risultati negativi
- Mistake bound model
- Slide 36
- Mistake bound for Find-S
- Slide 38
- Mistake bound for Halving
- Optimal mistake bound
- Slide 41
- Weighted majority algorithm
- Slide 43
- Weighted majority algorithm (WM)
- Slide 45
- Slide 46
-
5
Probably approximately correct learning
formal computational model which want shed
light on the limits of what can be
learned by a machine analysing the
computational cost of learning algorithms
6
What we want to learn
That is
to determine uniformly good approximations of an unknown function from its value in some sample points
interpolation pattern matching concept learning
CONCEPT = recognizing algorithm
LEARNING = computational description of recognizing algorithms starting from - examples - incomplete specifications
7
Whatrsquos new in pac learning
Accuracy of results and running time for learning
algorithms
are explicitly quantified and related
A general problem
use of resources (time spacehellip) by computations COMPLEXITY THEORY
Example
Sorting nlogn time (polynomial feasible)
Bool satisfiability 2ⁿ time (exponential intractable)
8
Learning from examples
DOMAIN
ConceptLEARNER
EXAMPLES
A REPRESENTATION OF A CONCEPTCONCEPT subset of domain
EXAMPLES elements of concept (positive)
REPRESENTATION domainrarrexpressions GOOD LEARNER
EFFICIENT LEARNER
9
The PAC model
A domain X (eg 01ⁿ Rⁿ) A concept subset of X f sube X or f Xrarr01 A class of concepts F sube 2X
A probability distribution P on X
Example 1
X equiv a square F equiv triangles in the square
10
The PAC model
Example 2
Xequiv01ⁿ F equiv family of boolean functions
1 if there are at least r ones in (x1hellipxn)fr(x1hellipxn) =
0 otherwise
P a probability distribution on X
Uniform Non uniform
11
The PAC model
The learning process
Labeled sample ((x0 f(x0)) (x1 f(x1)) hellip (xn f(xn))
Hypothesis a function h consistent with the sample (ie h(xi) = f(xi) i)
Error probability Perr(h(x)nef(x) xX)
12
LEARNERExamples generatorwith probabilitydistribution p
Inference procedure A
t examples
Hypothesis h (implicit representation of a concept)
The learning algorithm A is good if the hypothesis h is ldquoALMOST ALWAYSrdquoldquoCLOSE TOrdquo the target concept c
TEACHER
The PAC model
X fF X F
(x1f(x1)) hellip (xtf(xt)))
13
f h
x random choice
Given an approximation parameter (0ltle1) h is an ε-approximationof f if dp(fh)le
ldquoALMOST ALWAYSrdquo
Confidence parameter
(0 lt le 1)
The ldquomeasurerdquo of sequences of examples randomly choosen according to P such that h is an ε-approximation of f is at least 1-
ldquoCLOSE TOrdquo
METRIC given P
dp(fh) = Perr = Px f(x)neh(x)
The PAC model
14
Generator ofexamples
Learner h
F concept classS set of labeled samples from a concept in F A S F such that
I) A(S) consistent with S
II) P(Perrlt ) gt 1-
0ltlt1 fF mN S st |S|gem
Learning algorithm
15
COMPUTATIONAL RESOURCES
SAMPLE SIZE (Statistical PAC learning) COMPUTATION TIME (Polynomial PAC learning)
DEF 1 a concept class F = n=1F n is statistically PAC learnable if there
is a learning algorithm with sample size t = t(n 1 1) bounded by some polynomial function in n 1 1
Look for algorithms which use ldquoreasonablerdquo amount of computational resources
The efficiency issue
16
The efficiency issue
POLYNOMIAL PAC STATISTICAL PAC
DEF 2 a concept class F = n=1F n is polynomially PAC learnable
if there is a learning algorithm with running time bounded by some polynomial function in n 1 1
17
n = f 0 1n 0 1 The set of boolean functions in n
variables
Fn n A class of conceptsExample 1Fn = clauses with literals in
Example 2Fn = linearly separable functions in n variables
nn xxxx 11
nk xxxxxx ororororor 2123
( ) sum minus λkkXWHS
REPRESENTATION
- TRUTH TABLE (EXPLICIT)- BOOLEAN CIRCUITS (IMPLICIT)
BOOLEANCIRCUITS
BOOLEANFUNCTIONS
Learning boolean functions
18
bull BASIC OPERATIONSbull COMPOSITION
( )minusorand
in m variables in n variables
CIRCUIT Finite acyclic directed graph
or
Output node
Basic operations
Input nodes
Given an assignment x1 hellip xn 0 1 to the input variables the output node computes the corresponding value
oror
orand
1X 2X 3X
or
[f(g1 hellip gm)](x) = f(g1(x) hellip gm(x))
Boolean functions and circuits
19
Fn n
Cn class of circuits which compute all and only the functions in Fn
Uinfin
=
=1n
nFF Uinfin
=
=1n
nCC
Algorithm A to learn F by C bull INPUT (nεδ)
bullThe learner computes t = t(n 1 1) (t=number of examples sufficient to learn with accuracy ε and confidence δ)bull The learner asks the teacher for a labelled t-sample
bull The learner receives the t-sample S and computes C = An(S)
bull Output C (C= representation of the hypothesis)
Note that the inference procedure A receives as input the integer n and a t-sample on 01n and outputs An(S) = A(n S)
Boolean functions and circuits
20
An algorithm A is a learning algorithm with sample size t(n 1 1) for a concept class using the class of representations If for all nge1 for all fFn for all 0lt lt1 and for every probability distribution p over 01n the following holds
nn FUF 1=infin= mm CUC 1=
infin=
If the inference procedure An receives as input a t-sample it outputs a representation cCn of a function g that is probably approximately correct that is with probability at least 1- a t-sample is chosen such that the function g inferred satisfies
Px f(x)neg(x) le
g is ndashgood g is an ndashapproximation of fg is ndashbad g is not an ndashapproximation of f
NOTE distribution free
Boolean functions and circuits
21
Statistical PAC learning
DEF An inference procedure An for the class F n is consistent if
given the target function fF n for every t-sample
S = (ltx1b1gt hellip ltxtbtgt) An(S) is a representation of a function
g ldquoconsistentrdquo with S ie g(x1) = b1 hellip g(xt) = bt
DEF A learning algorithm A is consistent if its inference procedure is consistent
PROBLEM Estimate upper and lower bounds on the sample size
t = t(n 1 1)Upper bounds will be given for consistent algorithms
Lower bounds will be given for arbitrary algorithms
22
THEOREM t(n 1 1) le -1ln(F n) +ln(1)
PROOF Prob(x1 hellip xt) g (g(x1)=f(x1) hellip g(xt)=f(xt) g -bad) le
le Prob (g(x1) = f(x1) hellip g(xt) = f(xt)) le
Impose F n e-t le
Independent events
g is ε-bad
P(AUB)leP(A)+P(B)
g ε-bad
NOTE - F n must be finite
le Prob (g(xi) = f(xi)) leg ε-bad i=1 hellip t
le (1-)t le F n(1-)t le F ne-t g ε-bad
A simple upper bound
23
X domainF 2X class of conceptsS = (x1 hellip xt) t-sample
f S g iff f(xi) = g(xi) xi S undistinguishable by S
F (S) = (F S) index of F wrt S
Problem uniform convergence of relative frequencies to their probabilities
Vapnik-Chervonenkis approach (1971)
S1 S2
MF (t) = maxF (S) S is a t-sample growth function
24
FACT
THEOREM
A general upper bound
Prob(x1 hellip xt) g (g -bad g(x1) = f(x1) hellip g(xt) = f(xt))le 2mF2te-t2
mF (t) le 2t
mF (t) le F (this condition gives immediately
the simple upper bound) mF (t) = 2t and jltt mF (j) = 2j
25
d t
)(tmF
F
)(infinFm
t2
DEFINITION
FUNDAMENTAL PROPERTY
=)(tmF1
2
0
minusle⎟⎠⎞⎜⎝
⎛le⎟⎠⎞⎜⎝
⎛le
le
=
sum
Kk
t
tt
t
t
BOUNDED BY APOLYNOMIAL IN t
Graph of the growth function
d = VCdim(F ) = max t mF(t) = 2t
26
THEOREMIf dn = VCdim(Fn)
then t(n 1 1) le max (4 log(2) (8dn)log(13)PROOF
Impose 2mFn2te-et2 le
A lower bound on t(n 1 1) Number of examples which are necessary for arbitrary algorithms
THEOREMFor 0lele1 and le1100
t(n 1 1) ge max ((1- ) ln(1) (dn-1)32)
Upper and lower bounds
27
Ie the cardinality of the set of subsets of S that can be obtained by intersecting S with concepts in F
If F (S) = 2S we say that S is shattered by F
The Vapnik-Chervonenkis dimension of F is the cardinality of the largest finite set of points S X that is shattered by F
An equivalent definition of VCdim
F (S) = (f-1(1)(x1 hellip xt) | fF )
28
1300log80032000log400)11(
00100103)(
sdotle
===
MAXnt
FVC DIM
δε
δε
Sufficient 24000
⎭⎬⎫
⎩⎨⎧ sdot
minusge 100
32131000ln100)11( MAXnt δε
690 Necessary
Learn the family f of circles contained in the square
Example 1
29
otherwiseXif
XWHSXXfTHATSUCHWWLf
nkkkn
nn
001
)()()(
11
1
ge
minus=
rArrisin
sum=
λ
λ
HS(x)=
22
1)(n
n
nDIM
L
nLVC
le
+=
SIMPLE UPPER BOUND
))1ln((1)11( 2 +le nnt
UPPER BOUND USING
⎭⎬⎫
⎩⎨⎧ sdot
+le
13)1(82log4)11( nMAXnt GROWS LINEARLY WITH n
)( nDIM LVC
Learn the family of linearly separable boolean functions in n variables Ln
Example 2
30
Consider the class L2 of linearly separable functions in two variables
3)(3)(
1)(
2
2
ge=
+=
LVCLVC
nLVC
IM
IM
nIM
4)( 2 ltLVC IM
The green point cannot beseparated from the other three
No straight line can separatethe green from the red points
Example 2
31
Classi di formule booleane
Monomi x1x2 hellip xk
DNF m1m2 hellip mj (mj monomi)
Clausole x1x2 hellip xk
CNF c1c2 hellip cj (cj clausole)
k-DNF le k letterali nei monomi
k-term-DNF le k monomi
k-CNF le k letterali nelle clausole
k-clause-CNF le k clausole
Formule monotone non contengono letterali negati
m-formule ogni variabile appare al piugrave una volta
32
Th (Valiant)I monomi sono apprendibili da esempi positivi con 2(n+log ) esempi ( errore tollerato) ponendo ixix
xxgii 01 ==
sdotequiv ππin tutti gli es in tutti gli es
NB Lrsquoapprendibilitagrave egrave non monotona A B se B appr allora A apprsube
endHdaxcancella
elseHdaxcancella
thenjesifdontojfor
generaesbegin
doBtoiforxxxxxxH
j
j
nn
0)(1
)(
1 2211
==
=
==
Th i monomi non sono apprendibili da esempi negativi
I risultati
33
1) K-CNF apprendibili da soli esempi positivi1b) K-DNF apprendibili da soli esempi negativi2) (K-DNF K-CNF) apprendibili da es (K-DNF K-CNF) positivi e negativiand
or
Kforall
3) la classe delle K-decision lists egrave apprendibile
)0()(
1)min(|min
10||)))((( 11
esistenonisebvCalloravni
booleanovettorevDLKCbkmmonomiom
conbmbmDLK
i
iii
jj
===
minusisinle
equivminus
Th Ogni K-DNF (o K-CNF)-formula egrave rappresentabile da una K-DL piccola
Risultati positivi
34
)()(
freeondistributisensoinNPRPse
minusne
1) Le m-formule non sono apprendibili
2) Le funzioni booleane a soglia non sono apprendibili
3) Per K ge 2 le formule K-term-DNF non sono apprendibili
ge
Risultati negativi
35
Mistake bound model
So far how many examples needed to learn What about how many mistakes before
convergence Letrsquos consider similar setting to PAC learning
Instances drawn at random from X according to
distribution D Learner must classify each instance before receiving
correct classification from teacher Can we bound the number of mistakes learner makes
before converging
36
Mistake bound model
Learner Receives a sequence of training examples x Predicts the target value f(x) Receives the correct target value from the trainer Is evaluated by the total number of mistakes it makes
before converging to the correct hypothesis
Ie Learning takes place during the use of the system
not off-line Ex prediction of fraudolent use of credit cards
37
Mistake bound for Find-S
Consider Find-S when H = conjunction of boolean literals
FIND-S
Initialize h to the most specific hypothesis in
Hx1x1x2x2 hellip xnxn
For each positive training instance x Remove from h any literal not satisfied by x
Output h
38
Mistake bound for Find-S
If C H and training data noise free Find-S converges to an exact hypothesis
How many errors to learn cH (only positive examples can be misclassified)
The first positive example will be misclassified and n literals in the initial hypothesis will be eliminated
Each subsequent error eliminates at least one literal mistakes le n+1 (worst case for the ldquototalrdquo concept x c(x)=1)
39
Mistake bound for Halving A version space is maintained and refined (eg
Candidate-elimination) Prediction is based on majority vote among the
hypotheses in the current version space ldquoWrongrdquo hypotheses are removed (even if x is
exactly classified) How many errors to exactly learn cH (H finite)
Mistake when the majority of hypotheses misclassifies x These hypotheses are removed For each mistake the version space is at least halved At most log2(|H|) mistakes before exact learning (eg
single hypothesis remaining) Note learning without mistakes possible
40
Optimal mistake bound
Question what is the optimal mistake bound (ie lowest worst case bound over all possible learning algorithms A) for an arbitrary non empty concept class C assuming H=C
Formally for any learning algorithm A and any target concept c
MA(c) = max mistakes made by A to exactly learn c over all
possible training sequences MA(C) = maxcC MA(c)
Note Mfind-S(C) = n+1
MHalving(C) le log2(|C|) Opt(C) = minA MA(C)
ie of mistakes made for the hardest target concept in C using the hardest training sequence by the best algorithm
41
Optimal mistake bound
Theorem (Littlestone 1987)
VC(C) le Opt(C) le MHalving(C) le log2(|C|) There exist concept classes for which
VC(C) = Opt(C) = MHalving(C) = log2(|C|)
eg the power set 2X of X for which it holds
VC(2X) = |X| = log2(|2X|) There exist concept classes for which VC(C) lt Opt(C) lt MHalving(C)
42
Weighted majority algorithm
Generalizes Halving Makes predictions by taking a weighted vote among a pole of prediction algorithms
Learns by altering the weight associated with each prediction algorithm
It does not eliminate hypotheses (ie algorithms) inconsistent with some training examples but just reduces its weight so is able to accommodate inconsistent training data
43
Weighted majority algorithm
i wi = 1
training example (x c(x))
q0 = q1 = 0
prediction algorithm ai
If ai(x)=0 then q0 = q0 + wi
If ai(x)=1 then q1 = q1 + wi
if q1 gt q0 then predict c(x)=1
if q1 lt q0 then predict c(x)=0
if q1 gt q0 then predict c(x)=0 or 1 at random
prediction algorithm ai do If ai(x)nec(x) then wi = wi (0lelt1)
44
Weighted majority algorithm (WM)
Coincides with Halving for =0 Theorem - D any sequence of training examples A any set of n prediction algorithms k min of mistakes made by any ajA for D =12 Then W-M makes at most
24(k+log2n)
mistakes over D
45
Weighted majority algorithm (WM)
Proof Since aj makes k mistakes (best in A) its final
weight wj will be (12)k The sum W of the weights associated with all n algorithms in A is initially n and for each mistake made by WM is reduced to at most (34)W because the ldquowrongrdquo algorithms hold at least 12 of total weight that will be reduced by a factor of 12
The final total weight W is at most n(34)M where M is the total number of mistakes made by WM over D
46
Weighted majority algorithm (WM)
But the final weight wj cannot be greater than the
final total weight W hence(12)k le n(34)M
from which
M le le 24(k+log2n)
Ie the number of mistakes made by WM will never be greater than a constant factor times the number of mistakes made by the best member of the pool plus a term that grows only logarithmically in the size of the pool
(k+log2 n)-log2 (34)
- Universitagrave di Milano-Bicocca Laurea Magistrale in Informatica
- Computational models of cognitive phenomena
- A theory of the learnable (Valiant lsquo84)
- A theory of the learnable
- Probably approximately correct learning
- What we want to learn
- Whatrsquos new in pac learning
- Learning from examples
- The PAC model
- The PAC model
- Slide 11
- Slide 12
- Slide 13
- Learning algorithm
- The efficiency issue
- Slide 16
- Learning boolean functions
- Boolean functions and circuits
- Slide 19
- Slide 20
- Statistical PAC learning
- A simple upper bound
- Vapnik-Chervonenkis approach (1971)
- A general upper bound
- Graph of the growth function
- Upper and lower bounds
- An equivalent definition of VCdim
- Example 1
- Example 2
- Slide 30
- Classi di formule booleane
- I risultati
- Risultati positivi
- Risultati negativi
- Mistake bound model
- Slide 36
- Mistake bound for Find-S
- Slide 38
- Mistake bound for Halving
- Optimal mistake bound
- Slide 41
- Weighted majority algorithm
- Slide 43
- Weighted majority algorithm (WM)
- Slide 45
- Slide 46
-
6
What we want to learn
That is
to determine uniformly good approximations of an unknown function from its value in some sample points
interpolation pattern matching concept learning
CONCEPT = recognizing algorithm
LEARNING = computational description of recognizing algorithms starting from - examples - incomplete specifications
7
Whatrsquos new in pac learning
Accuracy of results and running time for learning
algorithms
are explicitly quantified and related
A general problem
use of resources (time spacehellip) by computations COMPLEXITY THEORY
Example
Sorting nlogn time (polynomial feasible)
Bool satisfiability 2ⁿ time (exponential intractable)
8
Learning from examples
DOMAIN
ConceptLEARNER
EXAMPLES
A REPRESENTATION OF A CONCEPTCONCEPT subset of domain
EXAMPLES elements of concept (positive)
REPRESENTATION domainrarrexpressions GOOD LEARNER
EFFICIENT LEARNER
9
The PAC model
A domain X (eg 01ⁿ Rⁿ) A concept subset of X f sube X or f Xrarr01 A class of concepts F sube 2X
A probability distribution P on X
Example 1
X equiv a square F equiv triangles in the square
10
The PAC model
Example 2
Xequiv01ⁿ F equiv family of boolean functions
1 if there are at least r ones in (x1hellipxn)fr(x1hellipxn) =
0 otherwise
P a probability distribution on X
Uniform Non uniform
11
The PAC model
The learning process
Labeled sample ((x0 f(x0)) (x1 f(x1)) hellip (xn f(xn))
Hypothesis a function h consistent with the sample (ie h(xi) = f(xi) i)
Error probability Perr(h(x)nef(x) xX)
12
LEARNERExamples generatorwith probabilitydistribution p
Inference procedure A
t examples
Hypothesis h (implicit representation of a concept)
The learning algorithm A is good if the hypothesis h is ldquoALMOST ALWAYSrdquoldquoCLOSE TOrdquo the target concept c
TEACHER
The PAC model
X fF X F
(x1f(x1)) hellip (xtf(xt)))
13
f h
x random choice
Given an approximation parameter (0ltle1) h is an ε-approximationof f if dp(fh)le
ldquoALMOST ALWAYSrdquo
Confidence parameter
(0 lt le 1)
The ldquomeasurerdquo of sequences of examples randomly choosen according to P such that h is an ε-approximation of f is at least 1-
ldquoCLOSE TOrdquo
METRIC given P
dp(fh) = Perr = Px f(x)neh(x)
The PAC model
14
Generator ofexamples
Learner h
F concept classS set of labeled samples from a concept in F A S F such that
I) A(S) consistent with S
II) P(Perrlt ) gt 1-
0ltlt1 fF mN S st |S|gem
Learning algorithm
15
COMPUTATIONAL RESOURCES
SAMPLE SIZE (Statistical PAC learning) COMPUTATION TIME (Polynomial PAC learning)
DEF 1 a concept class F = n=1F n is statistically PAC learnable if there
is a learning algorithm with sample size t = t(n 1 1) bounded by some polynomial function in n 1 1
Look for algorithms which use ldquoreasonablerdquo amount of computational resources
The efficiency issue
16
The efficiency issue
POLYNOMIAL PAC STATISTICAL PAC
DEF 2 a concept class F = n=1F n is polynomially PAC learnable
if there is a learning algorithm with running time bounded by some polynomial function in n 1 1
17
n = f 0 1n 0 1 The set of boolean functions in n
variables
Fn n A class of conceptsExample 1Fn = clauses with literals in
Example 2Fn = linearly separable functions in n variables
nn xxxx 11
nk xxxxxx ororororor 2123
( ) sum minus λkkXWHS
REPRESENTATION
- TRUTH TABLE (EXPLICIT)- BOOLEAN CIRCUITS (IMPLICIT)
BOOLEANCIRCUITS
BOOLEANFUNCTIONS
Learning boolean functions
18
bull BASIC OPERATIONSbull COMPOSITION
( )minusorand
in m variables in n variables
CIRCUIT Finite acyclic directed graph
or
Output node
Basic operations
Input nodes
Given an assignment x1 hellip xn 0 1 to the input variables the output node computes the corresponding value
oror
orand
1X 2X 3X
or
[f(g1 hellip gm)](x) = f(g1(x) hellip gm(x))
Boolean functions and circuits
19
Fn n
Cn class of circuits which compute all and only the functions in Fn
Uinfin
=
=1n
nFF Uinfin
=
=1n
nCC
Algorithm A to learn F by C bull INPUT (nεδ)
bullThe learner computes t = t(n 1 1) (t=number of examples sufficient to learn with accuracy ε and confidence δ)bull The learner asks the teacher for a labelled t-sample
bull The learner receives the t-sample S and computes C = An(S)
bull Output C (C= representation of the hypothesis)
Note that the inference procedure A receives as input the integer n and a t-sample on 01n and outputs An(S) = A(n S)
Boolean functions and circuits
20
An algorithm A is a learning algorithm with sample size t(n 1 1) for a concept class using the class of representations If for all nge1 for all fFn for all 0lt lt1 and for every probability distribution p over 01n the following holds
nn FUF 1=infin= mm CUC 1=
infin=
If the inference procedure An receives as input a t-sample it outputs a representation cCn of a function g that is probably approximately correct that is with probability at least 1- a t-sample is chosen such that the function g inferred satisfies
Px f(x)neg(x) le
g is ndashgood g is an ndashapproximation of fg is ndashbad g is not an ndashapproximation of f
NOTE distribution free
Boolean functions and circuits
21
Statistical PAC learning
DEF An inference procedure An for the class F n is consistent if
given the target function fF n for every t-sample
S = (ltx1b1gt hellip ltxtbtgt) An(S) is a representation of a function
g ldquoconsistentrdquo with S ie g(x1) = b1 hellip g(xt) = bt
DEF A learning algorithm A is consistent if its inference procedure is consistent
PROBLEM Estimate upper and lower bounds on the sample size
t = t(n 1 1)Upper bounds will be given for consistent algorithms
Lower bounds will be given for arbitrary algorithms
22
THEOREM t(n 1 1) le -1ln(F n) +ln(1)
PROOF Prob(x1 hellip xt) g (g(x1)=f(x1) hellip g(xt)=f(xt) g -bad) le
le Prob (g(x1) = f(x1) hellip g(xt) = f(xt)) le
Impose F n e-t le
Independent events
g is ε-bad
P(AUB)leP(A)+P(B)
g ε-bad
NOTE - F n must be finite
le Prob (g(xi) = f(xi)) leg ε-bad i=1 hellip t
le (1-)t le F n(1-)t le F ne-t g ε-bad
A simple upper bound
23
X domainF 2X class of conceptsS = (x1 hellip xt) t-sample
f S g iff f(xi) = g(xi) xi S undistinguishable by S
F (S) = (F S) index of F wrt S
Problem uniform convergence of relative frequencies to their probabilities
Vapnik-Chervonenkis approach (1971)
S1 S2
MF (t) = maxF (S) S is a t-sample growth function
24
FACT
THEOREM
A general upper bound
Prob(x1 hellip xt) g (g -bad g(x1) = f(x1) hellip g(xt) = f(xt))le 2mF2te-t2
mF (t) le 2t
mF (t) le F (this condition gives immediately
the simple upper bound) mF (t) = 2t and jltt mF (j) = 2j
25
d t
)(tmF
F
)(infinFm
t2
DEFINITION
FUNDAMENTAL PROPERTY
=)(tmF1
2
0
minusle⎟⎠⎞⎜⎝
⎛le⎟⎠⎞⎜⎝
⎛le
le
=
sum
Kk
t
tt
t
t
BOUNDED BY APOLYNOMIAL IN t
Graph of the growth function
d = VCdim(F ) = max t mF(t) = 2t
26
THEOREMIf dn = VCdim(Fn)
then t(n 1 1) le max (4 log(2) (8dn)log(13)PROOF
Impose 2mFn2te-et2 le
A lower bound on t(n 1 1) Number of examples which are necessary for arbitrary algorithms
THEOREMFor 0lele1 and le1100
t(n 1 1) ge max ((1- ) ln(1) (dn-1)32)
Upper and lower bounds
27
Ie the cardinality of the set of subsets of S that can be obtained by intersecting S with concepts in F
If F (S) = 2S we say that S is shattered by F
The Vapnik-Chervonenkis dimension of F is the cardinality of the largest finite set of points S X that is shattered by F
An equivalent definition of VCdim
F (S) = (f-1(1)(x1 hellip xt) | fF )
28
1300log80032000log400)11(
00100103)(
sdotle
===
MAXnt
FVC DIM
δε
δε
Sufficient 24000
⎭⎬⎫
⎩⎨⎧ sdot
minusge 100
32131000ln100)11( MAXnt δε
690 Necessary
Learn the family f of circles contained in the square
Example 1
29
otherwiseXif
XWHSXXfTHATSUCHWWLf
nkkkn
nn
001
)()()(
11
1
ge
minus=
rArrisin
sum=
λ
λ
HS(x)=
22
1)(n
n
nDIM
L
nLVC
le
+=
SIMPLE UPPER BOUND
))1ln((1)11( 2 +le nnt
UPPER BOUND USING
⎭⎬⎫
⎩⎨⎧ sdot
+le
13)1(82log4)11( nMAXnt GROWS LINEARLY WITH n
)( nDIM LVC
Learn the family of linearly separable boolean functions in n variables Ln
Example 2
30
Consider the class L2 of linearly separable functions in two variables
3)(3)(
1)(
2
2
ge=
+=
LVCLVC
nLVC
IM
IM
nIM
4)( 2 ltLVC IM
The green point cannot beseparated from the other three
No straight line can separatethe green from the red points
Example 2
31
Classi di formule booleane
Monomi x1x2 hellip xk
DNF m1m2 hellip mj (mj monomi)
Clausole x1x2 hellip xk
CNF c1c2 hellip cj (cj clausole)
k-DNF le k letterali nei monomi
k-term-DNF le k monomi
k-CNF le k letterali nelle clausole
k-clause-CNF le k clausole
Formule monotone non contengono letterali negati
m-formule ogni variabile appare al piugrave una volta
32
Th (Valiant)I monomi sono apprendibili da esempi positivi con 2(n+log ) esempi ( errore tollerato) ponendo ixix
xxgii 01 ==
sdotequiv ππin tutti gli es in tutti gli es
NB Lrsquoapprendibilitagrave egrave non monotona A B se B appr allora A apprsube
endHdaxcancella
elseHdaxcancella
thenjesifdontojfor
generaesbegin
doBtoiforxxxxxxH
j
j
nn
0)(1
)(
1 2211
==
=
==
Th i monomi non sono apprendibili da esempi negativi
I risultati
33
1) K-CNF apprendibili da soli esempi positivi1b) K-DNF apprendibili da soli esempi negativi2) (K-DNF K-CNF) apprendibili da es (K-DNF K-CNF) positivi e negativiand
or
Kforall
3) la classe delle K-decision lists egrave apprendibile
)0()(
1)min(|min
10||)))((( 11
esistenonisebvCalloravni
booleanovettorevDLKCbkmmonomiom
conbmbmDLK
i
iii
jj
===
minusisinle
equivminus
Th Ogni K-DNF (o K-CNF)-formula egrave rappresentabile da una K-DL piccola
Risultati positivi
34
)()(
freeondistributisensoinNPRPse
minusne
1) Le m-formule non sono apprendibili
2) Le funzioni booleane a soglia non sono apprendibili
3) Per K ge 2 le formule K-term-DNF non sono apprendibili
ge
Risultati negativi
35
Mistake bound model
So far how many examples needed to learn What about how many mistakes before
convergence Letrsquos consider similar setting to PAC learning
Instances drawn at random from X according to
distribution D Learner must classify each instance before receiving
correct classification from teacher Can we bound the number of mistakes learner makes
before converging
36
Mistake bound model
Learner Receives a sequence of training examples x Predicts the target value f(x) Receives the correct target value from the trainer Is evaluated by the total number of mistakes it makes
before converging to the correct hypothesis
Ie Learning takes place during the use of the system
not off-line Ex prediction of fraudolent use of credit cards
37
Mistake bound for Find-S
Consider Find-S when H = conjunction of boolean literals
FIND-S
Initialize h to the most specific hypothesis in
Hx1x1x2x2 hellip xnxn
For each positive training instance x Remove from h any literal not satisfied by x
Output h
38
Mistake bound for Find-S
If C H and training data noise free Find-S converges to an exact hypothesis
How many errors to learn cH (only positive examples can be misclassified)
The first positive example will be misclassified and n literals in the initial hypothesis will be eliminated
Each subsequent error eliminates at least one literal mistakes le n+1 (worst case for the ldquototalrdquo concept x c(x)=1)
39
Mistake bound for Halving A version space is maintained and refined (eg
Candidate-elimination) Prediction is based on majority vote among the
hypotheses in the current version space ldquoWrongrdquo hypotheses are removed (even if x is
exactly classified) How many errors to exactly learn cH (H finite)
Mistake when the majority of hypotheses misclassifies x These hypotheses are removed For each mistake the version space is at least halved At most log2(|H|) mistakes before exact learning (eg
single hypothesis remaining) Note learning without mistakes possible
40
Optimal mistake bound
Question what is the optimal mistake bound (ie lowest worst case bound over all possible learning algorithms A) for an arbitrary non empty concept class C assuming H=C
Formally for any learning algorithm A and any target concept c
MA(c) = max mistakes made by A to exactly learn c over all
possible training sequences MA(C) = maxcC MA(c)
Note Mfind-S(C) = n+1
MHalving(C) le log2(|C|) Opt(C) = minA MA(C)
ie of mistakes made for the hardest target concept in C using the hardest training sequence by the best algorithm
41
Optimal mistake bound
Theorem (Littlestone 1987)
VC(C) le Opt(C) le MHalving(C) le log2(|C|) There exist concept classes for which
VC(C) = Opt(C) = MHalving(C) = log2(|C|)
eg the power set 2X of X for which it holds
VC(2X) = |X| = log2(|2X|) There exist concept classes for which VC(C) lt Opt(C) lt MHalving(C)
42
Weighted majority algorithm
Generalizes Halving Makes predictions by taking a weighted vote among a pole of prediction algorithms
Learns by altering the weight associated with each prediction algorithm
It does not eliminate hypotheses (ie algorithms) inconsistent with some training examples but just reduces its weight so is able to accommodate inconsistent training data
43
Weighted majority algorithm
i wi = 1
training example (x c(x))
q0 = q1 = 0
prediction algorithm ai
If ai(x)=0 then q0 = q0 + wi
If ai(x)=1 then q1 = q1 + wi
if q1 gt q0 then predict c(x)=1
if q1 lt q0 then predict c(x)=0
if q1 gt q0 then predict c(x)=0 or 1 at random
prediction algorithm ai do If ai(x)nec(x) then wi = wi (0lelt1)
44
Weighted majority algorithm (WM)
Coincides with Halving for =0 Theorem - D any sequence of training examples A any set of n prediction algorithms k min of mistakes made by any ajA for D =12 Then W-M makes at most
24(k+log2n)
mistakes over D
45
Weighted majority algorithm (WM)
Proof Since aj makes k mistakes (best in A) its final
weight wj will be (12)k The sum W of the weights associated with all n algorithms in A is initially n and for each mistake made by WM is reduced to at most (34)W because the ldquowrongrdquo algorithms hold at least 12 of total weight that will be reduced by a factor of 12
The final total weight W is at most n(34)M where M is the total number of mistakes made by WM over D
46
Weighted majority algorithm (WM)
But the final weight wj cannot be greater than the
final total weight W hence(12)k le n(34)M
from which
M le le 24(k+log2n)
Ie the number of mistakes made by WM will never be greater than a constant factor times the number of mistakes made by the best member of the pool plus a term that grows only logarithmically in the size of the pool
(k+log2 n)-log2 (34)
- Universitagrave di Milano-Bicocca Laurea Magistrale in Informatica
- Computational models of cognitive phenomena
- A theory of the learnable (Valiant lsquo84)
- A theory of the learnable
- Probably approximately correct learning
- What we want to learn
- Whatrsquos new in pac learning
- Learning from examples
- The PAC model
- The PAC model
- Slide 11
- Slide 12
- Slide 13
- Learning algorithm
- The efficiency issue
- Slide 16
- Learning boolean functions
- Boolean functions and circuits
- Slide 19
- Slide 20
- Statistical PAC learning
- A simple upper bound
- Vapnik-Chervonenkis approach (1971)
- A general upper bound
- Graph of the growth function
- Upper and lower bounds
- An equivalent definition of VCdim
- Example 1
- Example 2
- Slide 30
- Classi di formule booleane
- I risultati
- Risultati positivi
- Risultati negativi
- Mistake bound model
- Slide 36
- Mistake bound for Find-S
- Slide 38
- Mistake bound for Halving
- Optimal mistake bound
- Slide 41
- Weighted majority algorithm
- Slide 43
- Weighted majority algorithm (WM)
- Slide 45
- Slide 46
-
7
Whatrsquos new in pac learning
Accuracy of results and running time for learning
algorithms
are explicitly quantified and related
A general problem
use of resources (time spacehellip) by computations COMPLEXITY THEORY
Example
Sorting nlogn time (polynomial feasible)
Bool satisfiability 2ⁿ time (exponential intractable)
8
Learning from examples
DOMAIN
ConceptLEARNER
EXAMPLES
A REPRESENTATION OF A CONCEPTCONCEPT subset of domain
EXAMPLES elements of concept (positive)
REPRESENTATION domainrarrexpressions GOOD LEARNER
EFFICIENT LEARNER
9
The PAC model
A domain X (eg 01ⁿ Rⁿ) A concept subset of X f sube X or f Xrarr01 A class of concepts F sube 2X
A probability distribution P on X
Example 1
X equiv a square F equiv triangles in the square
10
The PAC model
Example 2
Xequiv01ⁿ F equiv family of boolean functions
1 if there are at least r ones in (x1hellipxn)fr(x1hellipxn) =
0 otherwise
P a probability distribution on X
Uniform Non uniform
11
The PAC model
The learning process
Labeled sample ((x0 f(x0)) (x1 f(x1)) hellip (xn f(xn))
Hypothesis a function h consistent with the sample (ie h(xi) = f(xi) i)
Error probability Perr(h(x)nef(x) xX)
12
LEARNERExamples generatorwith probabilitydistribution p
Inference procedure A
t examples
Hypothesis h (implicit representation of a concept)
The learning algorithm A is good if the hypothesis h is ldquoALMOST ALWAYSrdquoldquoCLOSE TOrdquo the target concept c
TEACHER
The PAC model
X fF X F
(x1f(x1)) hellip (xtf(xt)))
13
f h
x random choice
Given an approximation parameter (0ltle1) h is an ε-approximationof f if dp(fh)le
ldquoALMOST ALWAYSrdquo
Confidence parameter
(0 lt le 1)
The ldquomeasurerdquo of sequences of examples randomly choosen according to P such that h is an ε-approximation of f is at least 1-
ldquoCLOSE TOrdquo
METRIC given P
dp(fh) = Perr = Px f(x)neh(x)
The PAC model
14
Generator ofexamples
Learner h
F concept classS set of labeled samples from a concept in F A S F such that
I) A(S) consistent with S
II) P(Perrlt ) gt 1-
0ltlt1 fF mN S st |S|gem
Learning algorithm
15
COMPUTATIONAL RESOURCES
SAMPLE SIZE (Statistical PAC learning) COMPUTATION TIME (Polynomial PAC learning)
DEF 1 a concept class F = n=1F n is statistically PAC learnable if there
is a learning algorithm with sample size t = t(n 1 1) bounded by some polynomial function in n 1 1
Look for algorithms which use ldquoreasonablerdquo amount of computational resources
The efficiency issue
16
The efficiency issue
POLYNOMIAL PAC STATISTICAL PAC
DEF 2 a concept class F = n=1F n is polynomially PAC learnable
if there is a learning algorithm with running time bounded by some polynomial function in n 1 1
17
n = f 0 1n 0 1 The set of boolean functions in n
variables
Fn n A class of conceptsExample 1Fn = clauses with literals in
Example 2Fn = linearly separable functions in n variables
nn xxxx 11
nk xxxxxx ororororor 2123
( ) sum minus λkkXWHS
REPRESENTATION
- TRUTH TABLE (EXPLICIT)- BOOLEAN CIRCUITS (IMPLICIT)
BOOLEANCIRCUITS
BOOLEANFUNCTIONS
Learning boolean functions
18
bull BASIC OPERATIONSbull COMPOSITION
( )minusorand
in m variables in n variables
CIRCUIT Finite acyclic directed graph
or
Output node
Basic operations
Input nodes
Given an assignment x1 hellip xn 0 1 to the input variables the output node computes the corresponding value
oror
orand
1X 2X 3X
or
[f(g1 hellip gm)](x) = f(g1(x) hellip gm(x))
Boolean functions and circuits
19
Fn n
Cn class of circuits which compute all and only the functions in Fn
Uinfin
=
=1n
nFF Uinfin
=
=1n
nCC
Algorithm A to learn F by C bull INPUT (nεδ)
bullThe learner computes t = t(n 1 1) (t=number of examples sufficient to learn with accuracy ε and confidence δ)bull The learner asks the teacher for a labelled t-sample
bull The learner receives the t-sample S and computes C = An(S)
bull Output C (C= representation of the hypothesis)
Note that the inference procedure A receives as input the integer n and a t-sample on 01n and outputs An(S) = A(n S)
Boolean functions and circuits
20
An algorithm A is a learning algorithm with sample size t(n 1 1) for a concept class using the class of representations If for all nge1 for all fFn for all 0lt lt1 and for every probability distribution p over 01n the following holds
nn FUF 1=infin= mm CUC 1=
infin=
If the inference procedure An receives as input a t-sample it outputs a representation cCn of a function g that is probably approximately correct that is with probability at least 1- a t-sample is chosen such that the function g inferred satisfies
Px f(x)neg(x) le
g is ndashgood g is an ndashapproximation of fg is ndashbad g is not an ndashapproximation of f
NOTE distribution free
Boolean functions and circuits
21
Statistical PAC learning
DEF An inference procedure An for the class F n is consistent if
given the target function fF n for every t-sample
S = (ltx1b1gt hellip ltxtbtgt) An(S) is a representation of a function
g ldquoconsistentrdquo with S ie g(x1) = b1 hellip g(xt) = bt
DEF A learning algorithm A is consistent if its inference procedure is consistent
PROBLEM Estimate upper and lower bounds on the sample size
t = t(n 1 1)Upper bounds will be given for consistent algorithms
Lower bounds will be given for arbitrary algorithms
22
THEOREM t(n 1 1) le -1ln(F n) +ln(1)
PROOF Prob(x1 hellip xt) g (g(x1)=f(x1) hellip g(xt)=f(xt) g -bad) le
le Prob (g(x1) = f(x1) hellip g(xt) = f(xt)) le
Impose F n e-t le
Independent events
g is ε-bad
P(AUB)leP(A)+P(B)
g ε-bad
NOTE - F n must be finite
le Prob (g(xi) = f(xi)) leg ε-bad i=1 hellip t
le (1-)t le F n(1-)t le F ne-t g ε-bad
A simple upper bound
23
X domainF 2X class of conceptsS = (x1 hellip xt) t-sample
f S g iff f(xi) = g(xi) xi S undistinguishable by S
F (S) = (F S) index of F wrt S
Problem uniform convergence of relative frequencies to their probabilities
Vapnik-Chervonenkis approach (1971)
S1 S2
MF (t) = maxF (S) S is a t-sample growth function
24
FACT
THEOREM
A general upper bound
Prob(x1 hellip xt) g (g -bad g(x1) = f(x1) hellip g(xt) = f(xt))le 2mF2te-t2
mF (t) le 2t
mF (t) le F (this condition gives immediately
the simple upper bound) mF (t) = 2t and jltt mF (j) = 2j
25
d t
)(tmF
F
)(infinFm
t2
DEFINITION
FUNDAMENTAL PROPERTY
=)(tmF1
2
0
minusle⎟⎠⎞⎜⎝
⎛le⎟⎠⎞⎜⎝
⎛le
le
=
sum
Kk
t
tt
t
t
BOUNDED BY APOLYNOMIAL IN t
Graph of the growth function
d = VCdim(F ) = max t mF(t) = 2t
26
THEOREMIf dn = VCdim(Fn)
then t(n 1 1) le max (4 log(2) (8dn)log(13)PROOF
Impose 2mFn2te-et2 le
A lower bound on t(n 1 1) Number of examples which are necessary for arbitrary algorithms
THEOREMFor 0lele1 and le1100
t(n 1 1) ge max ((1- ) ln(1) (dn-1)32)
Upper and lower bounds
27
Ie the cardinality of the set of subsets of S that can be obtained by intersecting S with concepts in F
If F (S) = 2S we say that S is shattered by F
The Vapnik-Chervonenkis dimension of F is the cardinality of the largest finite set of points S X that is shattered by F
An equivalent definition of VCdim
F (S) = (f-1(1)(x1 hellip xt) | fF )
28
1300log80032000log400)11(
00100103)(
sdotle
===
MAXnt
FVC DIM
δε
δε
Sufficient 24000
⎭⎬⎫
⎩⎨⎧ sdot
minusge 100
32131000ln100)11( MAXnt δε
690 Necessary
Learn the family f of circles contained in the square
Example 1
29
otherwiseXif
XWHSXXfTHATSUCHWWLf
nkkkn
nn
001
)()()(
11
1
ge
minus=
rArrisin
sum=
λ
λ
HS(x)=
22
1)(n
n
nDIM
L
nLVC
le
+=
SIMPLE UPPER BOUND
))1ln((1)11( 2 +le nnt
UPPER BOUND USING
⎭⎬⎫
⎩⎨⎧ sdot
+le
13)1(82log4)11( nMAXnt GROWS LINEARLY WITH n
)( nDIM LVC
Learn the family of linearly separable boolean functions in n variables Ln
Example 2
30
Consider the class L2 of linearly separable functions in two variables
3)(3)(
1)(
2
2
ge=
+=
LVCLVC
nLVC
IM
IM
nIM
4)( 2 ltLVC IM
The green point cannot beseparated from the other three
No straight line can separatethe green from the red points
Example 2
31
Classi di formule booleane
Monomi x1x2 hellip xk
DNF m1m2 hellip mj (mj monomi)
Clausole x1x2 hellip xk
CNF c1c2 hellip cj (cj clausole)
k-DNF le k letterali nei monomi
k-term-DNF le k monomi
k-CNF le k letterali nelle clausole
k-clause-CNF le k clausole
Formule monotone non contengono letterali negati
m-formule ogni variabile appare al piugrave una volta
32
Th (Valiant)I monomi sono apprendibili da esempi positivi con 2(n+log ) esempi ( errore tollerato) ponendo ixix
xxgii 01 ==
sdotequiv ππin tutti gli es in tutti gli es
NB Lrsquoapprendibilitagrave egrave non monotona A B se B appr allora A apprsube
endHdaxcancella
elseHdaxcancella
thenjesifdontojfor
generaesbegin
doBtoiforxxxxxxH
j
j
nn
0)(1
)(
1 2211
==
=
==
Th i monomi non sono apprendibili da esempi negativi
I risultati
33
1) K-CNF apprendibili da soli esempi positivi1b) K-DNF apprendibili da soli esempi negativi2) (K-DNF K-CNF) apprendibili da es (K-DNF K-CNF) positivi e negativiand
or
Kforall
3) la classe delle K-decision lists egrave apprendibile
)0()(
1)min(|min
10||)))((( 11
esistenonisebvCalloravni
booleanovettorevDLKCbkmmonomiom
conbmbmDLK
i
iii
jj
===
minusisinle
equivminus
Th Ogni K-DNF (o K-CNF)-formula egrave rappresentabile da una K-DL piccola
Risultati positivi
34
)()(
freeondistributisensoinNPRPse
minusne
1) Le m-formule non sono apprendibili
2) Le funzioni booleane a soglia non sono apprendibili
3) Per K ge 2 le formule K-term-DNF non sono apprendibili
ge
Risultati negativi
35
Mistake bound model
So far how many examples needed to learn What about how many mistakes before
convergence Letrsquos consider similar setting to PAC learning
Instances drawn at random from X according to
distribution D Learner must classify each instance before receiving
correct classification from teacher Can we bound the number of mistakes learner makes
before converging
36
Mistake bound model
Learner Receives a sequence of training examples x Predicts the target value f(x) Receives the correct target value from the trainer Is evaluated by the total number of mistakes it makes
before converging to the correct hypothesis
Ie Learning takes place during the use of the system
not off-line Ex prediction of fraudolent use of credit cards
37
Mistake bound for Find-S
Consider Find-S when H = conjunction of boolean literals
FIND-S
Initialize h to the most specific hypothesis in
Hx1x1x2x2 hellip xnxn
For each positive training instance x Remove from h any literal not satisfied by x
Output h
38
Mistake bound for Find-S
If C H and training data noise free Find-S converges to an exact hypothesis
How many errors to learn cH (only positive examples can be misclassified)
The first positive example will be misclassified and n literals in the initial hypothesis will be eliminated
Each subsequent error eliminates at least one literal mistakes le n+1 (worst case for the ldquototalrdquo concept x c(x)=1)
39
Mistake bound for Halving A version space is maintained and refined (eg
Candidate-elimination) Prediction is based on majority vote among the
hypotheses in the current version space ldquoWrongrdquo hypotheses are removed (even if x is
exactly classified) How many errors to exactly learn cH (H finite)
Mistake when the majority of hypotheses misclassifies x These hypotheses are removed For each mistake the version space is at least halved At most log2(|H|) mistakes before exact learning (eg
single hypothesis remaining) Note learning without mistakes possible
40
Optimal mistake bound
Question what is the optimal mistake bound (ie lowest worst case bound over all possible learning algorithms A) for an arbitrary non empty concept class C assuming H=C
Formally for any learning algorithm A and any target concept c
MA(c) = max mistakes made by A to exactly learn c over all
possible training sequences MA(C) = maxcC MA(c)
Note Mfind-S(C) = n+1
MHalving(C) le log2(|C|) Opt(C) = minA MA(C)
ie of mistakes made for the hardest target concept in C using the hardest training sequence by the best algorithm
41
Optimal mistake bound
Theorem (Littlestone 1987)
VC(C) le Opt(C) le MHalving(C) le log2(|C|) There exist concept classes for which
VC(C) = Opt(C) = MHalving(C) = log2(|C|)
eg the power set 2X of X for which it holds
VC(2X) = |X| = log2(|2X|) There exist concept classes for which VC(C) lt Opt(C) lt MHalving(C)
42
Weighted majority algorithm
Generalizes Halving Makes predictions by taking a weighted vote among a pole of prediction algorithms
Learns by altering the weight associated with each prediction algorithm
It does not eliminate hypotheses (ie algorithms) inconsistent with some training examples but just reduces its weight so is able to accommodate inconsistent training data
43
Weighted majority algorithm
i wi = 1
training example (x c(x))
q0 = q1 = 0
prediction algorithm ai
If ai(x)=0 then q0 = q0 + wi
If ai(x)=1 then q1 = q1 + wi
if q1 gt q0 then predict c(x)=1
if q1 lt q0 then predict c(x)=0
if q1 gt q0 then predict c(x)=0 or 1 at random
prediction algorithm ai do If ai(x)nec(x) then wi = wi (0lelt1)
44
Weighted majority algorithm (WM)
Coincides with Halving for =0 Theorem - D any sequence of training examples A any set of n prediction algorithms k min of mistakes made by any ajA for D =12 Then W-M makes at most
24(k+log2n)
mistakes over D
45
Weighted majority algorithm (WM)
Proof Since aj makes k mistakes (best in A) its final
weight wj will be (12)k The sum W of the weights associated with all n algorithms in A is initially n and for each mistake made by WM is reduced to at most (34)W because the ldquowrongrdquo algorithms hold at least 12 of total weight that will be reduced by a factor of 12
The final total weight W is at most n(34)M where M is the total number of mistakes made by WM over D
46
Weighted majority algorithm (WM)
But the final weight wj cannot be greater than the
final total weight W hence(12)k le n(34)M
from which
M le le 24(k+log2n)
Ie the number of mistakes made by WM will never be greater than a constant factor times the number of mistakes made by the best member of the pool plus a term that grows only logarithmically in the size of the pool
(k+log2 n)-log2 (34)
- Universitagrave di Milano-Bicocca Laurea Magistrale in Informatica
- Computational models of cognitive phenomena
- A theory of the learnable (Valiant lsquo84)
- A theory of the learnable
- Probably approximately correct learning
- What we want to learn
- Whatrsquos new in pac learning
- Learning from examples
- The PAC model
- The PAC model
- Slide 11
- Slide 12
- Slide 13
- Learning algorithm
- The efficiency issue
- Slide 16
- Learning boolean functions
- Boolean functions and circuits
- Slide 19
- Slide 20
- Statistical PAC learning
- A simple upper bound
- Vapnik-Chervonenkis approach (1971)
- A general upper bound
- Graph of the growth function
- Upper and lower bounds
- An equivalent definition of VCdim
- Example 1
- Example 2
- Slide 30
- Classi di formule booleane
- I risultati
- Risultati positivi
- Risultati negativi
- Mistake bound model
- Slide 36
- Mistake bound for Find-S
- Slide 38
- Mistake bound for Halving
- Optimal mistake bound
- Slide 41
- Weighted majority algorithm
- Slide 43
- Weighted majority algorithm (WM)
- Slide 45
- Slide 46
-
8
Learning from examples
DOMAIN
ConceptLEARNER
EXAMPLES
A REPRESENTATION OF A CONCEPTCONCEPT subset of domain
EXAMPLES elements of concept (positive)
REPRESENTATION domainrarrexpressions GOOD LEARNER
EFFICIENT LEARNER
9
The PAC model
A domain X (eg 01ⁿ Rⁿ) A concept subset of X f sube X or f Xrarr01 A class of concepts F sube 2X
A probability distribution P on X
Example 1
X equiv a square F equiv triangles in the square
10
The PAC model
Example 2
Xequiv01ⁿ F equiv family of boolean functions
1 if there are at least r ones in (x1hellipxn)fr(x1hellipxn) =
0 otherwise
P a probability distribution on X
Uniform Non uniform
11
The PAC model
The learning process
Labeled sample ((x0 f(x0)) (x1 f(x1)) hellip (xn f(xn))
Hypothesis a function h consistent with the sample (ie h(xi) = f(xi) i)
Error probability Perr(h(x)nef(x) xX)
12
LEARNERExamples generatorwith probabilitydistribution p
Inference procedure A
t examples
Hypothesis h (implicit representation of a concept)
The learning algorithm A is good if the hypothesis h is ldquoALMOST ALWAYSrdquoldquoCLOSE TOrdquo the target concept c
TEACHER
The PAC model
X fF X F
(x1f(x1)) hellip (xtf(xt)))
13
f h
x random choice
Given an approximation parameter (0ltle1) h is an ε-approximationof f if dp(fh)le
ldquoALMOST ALWAYSrdquo
Confidence parameter
(0 lt le 1)
The ldquomeasurerdquo of sequences of examples randomly choosen according to P such that h is an ε-approximation of f is at least 1-
ldquoCLOSE TOrdquo
METRIC given P
dp(fh) = Perr = Px f(x)neh(x)
The PAC model
14
Generator ofexamples
Learner h
F concept classS set of labeled samples from a concept in F A S F such that
I) A(S) consistent with S
II) P(Perrlt ) gt 1-
0ltlt1 fF mN S st |S|gem
Learning algorithm
15
COMPUTATIONAL RESOURCES
SAMPLE SIZE (Statistical PAC learning) COMPUTATION TIME (Polynomial PAC learning)
DEF 1 a concept class F = n=1F n is statistically PAC learnable if there
is a learning algorithm with sample size t = t(n 1 1) bounded by some polynomial function in n 1 1
Look for algorithms which use ldquoreasonablerdquo amount of computational resources
The efficiency issue
16
The efficiency issue
POLYNOMIAL PAC STATISTICAL PAC
DEF 2 a concept class F = n=1F n is polynomially PAC learnable
if there is a learning algorithm with running time bounded by some polynomial function in n 1 1
17
n = f 0 1n 0 1 The set of boolean functions in n
variables
Fn n A class of conceptsExample 1Fn = clauses with literals in
Example 2Fn = linearly separable functions in n variables
nn xxxx 11
nk xxxxxx ororororor 2123
( ) sum minus λkkXWHS
REPRESENTATION
- TRUTH TABLE (EXPLICIT)- BOOLEAN CIRCUITS (IMPLICIT)
BOOLEANCIRCUITS
BOOLEANFUNCTIONS
Learning boolean functions
18
bull BASIC OPERATIONSbull COMPOSITION
( )minusorand
in m variables in n variables
CIRCUIT Finite acyclic directed graph
or
Output node
Basic operations
Input nodes
Given an assignment x1 hellip xn 0 1 to the input variables the output node computes the corresponding value
oror
orand
1X 2X 3X
or
[f(g1 hellip gm)](x) = f(g1(x) hellip gm(x))
Boolean functions and circuits
19
Fn n
Cn class of circuits which compute all and only the functions in Fn
Uinfin
=
=1n
nFF Uinfin
=
=1n
nCC
Algorithm A to learn F by C bull INPUT (nεδ)
bullThe learner computes t = t(n 1 1) (t=number of examples sufficient to learn with accuracy ε and confidence δ)bull The learner asks the teacher for a labelled t-sample
bull The learner receives the t-sample S and computes C = An(S)
bull Output C (C= representation of the hypothesis)
Note that the inference procedure A receives as input the integer n and a t-sample on 01n and outputs An(S) = A(n S)
Boolean functions and circuits
20
An algorithm A is a learning algorithm with sample size t(n 1 1) for a concept class using the class of representations If for all nge1 for all fFn for all 0lt lt1 and for every probability distribution p over 01n the following holds
nn FUF 1=infin= mm CUC 1=
infin=
If the inference procedure An receives as input a t-sample it outputs a representation cCn of a function g that is probably approximately correct that is with probability at least 1- a t-sample is chosen such that the function g inferred satisfies
Px f(x)neg(x) le
g is ndashgood g is an ndashapproximation of fg is ndashbad g is not an ndashapproximation of f
NOTE distribution free
Boolean functions and circuits
21
Statistical PAC learning
DEF An inference procedure An for the class F n is consistent if
given the target function fF n for every t-sample
S = (ltx1b1gt hellip ltxtbtgt) An(S) is a representation of a function
g ldquoconsistentrdquo with S ie g(x1) = b1 hellip g(xt) = bt
DEF A learning algorithm A is consistent if its inference procedure is consistent
PROBLEM Estimate upper and lower bounds on the sample size
t = t(n 1 1)Upper bounds will be given for consistent algorithms
Lower bounds will be given for arbitrary algorithms
22
THEOREM t(n 1 1) le -1ln(F n) +ln(1)
PROOF Prob(x1 hellip xt) g (g(x1)=f(x1) hellip g(xt)=f(xt) g -bad) le
le Prob (g(x1) = f(x1) hellip g(xt) = f(xt)) le
Impose F n e-t le
Independent events
g is ε-bad
P(AUB)leP(A)+P(B)
g ε-bad
NOTE - F n must be finite
le Prob (g(xi) = f(xi)) leg ε-bad i=1 hellip t
le (1-)t le F n(1-)t le F ne-t g ε-bad
A simple upper bound
23
X domainF 2X class of conceptsS = (x1 hellip xt) t-sample
f S g iff f(xi) = g(xi) xi S undistinguishable by S
F (S) = (F S) index of F wrt S
Problem uniform convergence of relative frequencies to their probabilities
Vapnik-Chervonenkis approach (1971)
S1 S2
MF (t) = maxF (S) S is a t-sample growth function
24
FACT
THEOREM
A general upper bound
Prob(x1 hellip xt) g (g -bad g(x1) = f(x1) hellip g(xt) = f(xt))le 2mF2te-t2
mF (t) le 2t
mF (t) le F (this condition gives immediately
the simple upper bound) mF (t) = 2t and jltt mF (j) = 2j
25
d t
)(tmF
F
)(infinFm
t2
DEFINITION
FUNDAMENTAL PROPERTY
=)(tmF1
2
0
minusle⎟⎠⎞⎜⎝
⎛le⎟⎠⎞⎜⎝
⎛le
le
=
sum
Kk
t
tt
t
t
BOUNDED BY APOLYNOMIAL IN t
Graph of the growth function
d = VCdim(F ) = max t mF(t) = 2t
26
THEOREMIf dn = VCdim(Fn)
then t(n 1 1) le max (4 log(2) (8dn)log(13)PROOF
Impose 2mFn2te-et2 le
A lower bound on t(n 1 1) Number of examples which are necessary for arbitrary algorithms
THEOREMFor 0lele1 and le1100
t(n 1 1) ge max ((1- ) ln(1) (dn-1)32)
Upper and lower bounds
27
Ie the cardinality of the set of subsets of S that can be obtained by intersecting S with concepts in F
If F (S) = 2S we say that S is shattered by F
The Vapnik-Chervonenkis dimension of F is the cardinality of the largest finite set of points S X that is shattered by F
An equivalent definition of VCdim
F (S) = (f-1(1)(x1 hellip xt) | fF )
28
1300log80032000log400)11(
00100103)(
sdotle
===
MAXnt
FVC DIM
δε
δε
Sufficient 24000
⎭⎬⎫
⎩⎨⎧ sdot
minusge 100
32131000ln100)11( MAXnt δε
690 Necessary
Learn the family f of circles contained in the square
Example 1
29
otherwiseXif
XWHSXXfTHATSUCHWWLf
nkkkn
nn
001
)()()(
11
1
ge
minus=
rArrisin
sum=
λ
λ
HS(x)=
22
1)(n
n
nDIM
L
nLVC
le
+=
SIMPLE UPPER BOUND
))1ln((1)11( 2 +le nnt
UPPER BOUND USING
⎭⎬⎫
⎩⎨⎧ sdot
+le
13)1(82log4)11( nMAXnt GROWS LINEARLY WITH n
)( nDIM LVC
Learn the family of linearly separable boolean functions in n variables Ln
Example 2
30
Consider the class L2 of linearly separable functions in two variables
3)(3)(
1)(
2
2
ge=
+=
LVCLVC
nLVC
IM
IM
nIM
4)( 2 ltLVC IM
The green point cannot beseparated from the other three
No straight line can separatethe green from the red points
Example 2
31
Classi di formule booleane
Monomi x1x2 hellip xk
DNF m1m2 hellip mj (mj monomi)
Clausole x1x2 hellip xk
CNF c1c2 hellip cj (cj clausole)
k-DNF le k letterali nei monomi
k-term-DNF le k monomi
k-CNF le k letterali nelle clausole
k-clause-CNF le k clausole
Formule monotone non contengono letterali negati
m-formule ogni variabile appare al piugrave una volta
32
Th (Valiant)I monomi sono apprendibili da esempi positivi con 2(n+log ) esempi ( errore tollerato) ponendo ixix
xxgii 01 ==
sdotequiv ππin tutti gli es in tutti gli es
NB Lrsquoapprendibilitagrave egrave non monotona A B se B appr allora A apprsube
endHdaxcancella
elseHdaxcancella
thenjesifdontojfor
generaesbegin
doBtoiforxxxxxxH
j
j
nn
0)(1
)(
1 2211
==
=
==
Th i monomi non sono apprendibili da esempi negativi
I risultati
33
1) K-CNF apprendibili da soli esempi positivi1b) K-DNF apprendibili da soli esempi negativi2) (K-DNF K-CNF) apprendibili da es (K-DNF K-CNF) positivi e negativiand
or
Kforall
3) la classe delle K-decision lists egrave apprendibile
)0()(
1)min(|min
10||)))((( 11
esistenonisebvCalloravni
booleanovettorevDLKCbkmmonomiom
conbmbmDLK
i
iii
jj
===
minusisinle
equivminus
Th Ogni K-DNF (o K-CNF)-formula egrave rappresentabile da una K-DL piccola
Risultati positivi
34
)()(
freeondistributisensoinNPRPse
minusne
1) Le m-formule non sono apprendibili
2) Le funzioni booleane a soglia non sono apprendibili
3) Per K ge 2 le formule K-term-DNF non sono apprendibili
ge
Risultati negativi
35
Mistake bound model
So far how many examples needed to learn What about how many mistakes before
convergence Letrsquos consider similar setting to PAC learning
Instances drawn at random from X according to
distribution D Learner must classify each instance before receiving
correct classification from teacher Can we bound the number of mistakes learner makes
before converging
36
Mistake bound model
Learner Receives a sequence of training examples x Predicts the target value f(x) Receives the correct target value from the trainer Is evaluated by the total number of mistakes it makes
before converging to the correct hypothesis
Ie Learning takes place during the use of the system
not off-line Ex prediction of fraudolent use of credit cards
37
Mistake bound for Find-S
Consider Find-S when H = conjunction of boolean literals
FIND-S
Initialize h to the most specific hypothesis in
Hx1x1x2x2 hellip xnxn
For each positive training instance x Remove from h any literal not satisfied by x
Output h
38
Mistake bound for Find-S
If C H and training data noise free Find-S converges to an exact hypothesis
How many errors to learn cH (only positive examples can be misclassified)
The first positive example will be misclassified and n literals in the initial hypothesis will be eliminated
Each subsequent error eliminates at least one literal mistakes le n+1 (worst case for the ldquototalrdquo concept x c(x)=1)
39
Mistake bound for Halving A version space is maintained and refined (eg
Candidate-elimination) Prediction is based on majority vote among the
hypotheses in the current version space ldquoWrongrdquo hypotheses are removed (even if x is
exactly classified) How many errors to exactly learn cH (H finite)
Mistake when the majority of hypotheses misclassifies x These hypotheses are removed For each mistake the version space is at least halved At most log2(|H|) mistakes before exact learning (eg
single hypothesis remaining) Note learning without mistakes possible
40
Optimal mistake bound
Question what is the optimal mistake bound (ie lowest worst case bound over all possible learning algorithms A) for an arbitrary non empty concept class C assuming H=C
Formally for any learning algorithm A and any target concept c
MA(c) = max mistakes made by A to exactly learn c over all
possible training sequences MA(C) = maxcC MA(c)
Note Mfind-S(C) = n+1
MHalving(C) le log2(|C|) Opt(C) = minA MA(C)
ie of mistakes made for the hardest target concept in C using the hardest training sequence by the best algorithm
41
Optimal mistake bound
Theorem (Littlestone 1987)
VC(C) le Opt(C) le MHalving(C) le log2(|C|) There exist concept classes for which
VC(C) = Opt(C) = MHalving(C) = log2(|C|)
eg the power set 2X of X for which it holds
VC(2X) = |X| = log2(|2X|) There exist concept classes for which VC(C) lt Opt(C) lt MHalving(C)
42
Weighted majority algorithm
Generalizes Halving Makes predictions by taking a weighted vote among a pole of prediction algorithms
Learns by altering the weight associated with each prediction algorithm
It does not eliminate hypotheses (ie algorithms) inconsistent with some training examples but just reduces its weight so is able to accommodate inconsistent training data
43
Weighted majority algorithm
i wi = 1
training example (x c(x))
q0 = q1 = 0
prediction algorithm ai
If ai(x)=0 then q0 = q0 + wi
If ai(x)=1 then q1 = q1 + wi
if q1 gt q0 then predict c(x)=1
if q1 lt q0 then predict c(x)=0
if q1 gt q0 then predict c(x)=0 or 1 at random
prediction algorithm ai do If ai(x)nec(x) then wi = wi (0lelt1)
44
Weighted majority algorithm (WM)
Coincides with Halving for =0 Theorem - D any sequence of training examples A any set of n prediction algorithms k min of mistakes made by any ajA for D =12 Then W-M makes at most
24(k+log2n)
mistakes over D
45
Weighted majority algorithm (WM)
Proof Since aj makes k mistakes (best in A) its final
weight wj will be (12)k The sum W of the weights associated with all n algorithms in A is initially n and for each mistake made by WM is reduced to at most (34)W because the ldquowrongrdquo algorithms hold at least 12 of total weight that will be reduced by a factor of 12
The final total weight W is at most n(34)M where M is the total number of mistakes made by WM over D
46
Weighted majority algorithm (WM)
But the final weight wj cannot be greater than the
final total weight W hence(12)k le n(34)M
from which
M le le 24(k+log2n)
Ie the number of mistakes made by WM will never be greater than a constant factor times the number of mistakes made by the best member of the pool plus a term that grows only logarithmically in the size of the pool
(k+log2 n)-log2 (34)
- Universitagrave di Milano-Bicocca Laurea Magistrale in Informatica
- Computational models of cognitive phenomena
- A theory of the learnable (Valiant lsquo84)
- A theory of the learnable
- Probably approximately correct learning
- What we want to learn
- Whatrsquos new in pac learning
- Learning from examples
- The PAC model
- The PAC model
- Slide 11
- Slide 12
- Slide 13
- Learning algorithm
- The efficiency issue
- Slide 16
- Learning boolean functions
- Boolean functions and circuits
- Slide 19
- Slide 20
- Statistical PAC learning
- A simple upper bound
- Vapnik-Chervonenkis approach (1971)
- A general upper bound
- Graph of the growth function
- Upper and lower bounds
- An equivalent definition of VCdim
- Example 1
- Example 2
- Slide 30
- Classi di formule booleane
- I risultati
- Risultati positivi
- Risultati negativi
- Mistake bound model
- Slide 36
- Mistake bound for Find-S
- Slide 38
- Mistake bound for Halving
- Optimal mistake bound
- Slide 41
- Weighted majority algorithm
- Slide 43
- Weighted majority algorithm (WM)
- Slide 45
- Slide 46
-
9
The PAC model
A domain X (eg 01ⁿ Rⁿ) A concept subset of X f sube X or f Xrarr01 A class of concepts F sube 2X
A probability distribution P on X
Example 1
X equiv a square F equiv triangles in the square
10
The PAC model
Example 2
Xequiv01ⁿ F equiv family of boolean functions
1 if there are at least r ones in (x1hellipxn)fr(x1hellipxn) =
0 otherwise
P a probability distribution on X
Uniform Non uniform
11
The PAC model
The learning process
Labeled sample ((x0 f(x0)) (x1 f(x1)) hellip (xn f(xn))
Hypothesis a function h consistent with the sample (ie h(xi) = f(xi) i)
Error probability Perr(h(x)nef(x) xX)
12
LEARNERExamples generatorwith probabilitydistribution p
Inference procedure A
t examples
Hypothesis h (implicit representation of a concept)
The learning algorithm A is good if the hypothesis h is ldquoALMOST ALWAYSrdquoldquoCLOSE TOrdquo the target concept c
TEACHER
The PAC model
X fF X F
(x1f(x1)) hellip (xtf(xt)))
13
f h
x random choice
Given an approximation parameter (0ltle1) h is an ε-approximationof f if dp(fh)le
ldquoALMOST ALWAYSrdquo
Confidence parameter
(0 lt le 1)
The ldquomeasurerdquo of sequences of examples randomly choosen according to P such that h is an ε-approximation of f is at least 1-
ldquoCLOSE TOrdquo
METRIC given P
dp(fh) = Perr = Px f(x)neh(x)
The PAC model
14
Generator ofexamples
Learner h
F concept classS set of labeled samples from a concept in F A S F such that
I) A(S) consistent with S
II) P(Perrlt ) gt 1-
0ltlt1 fF mN S st |S|gem
Learning algorithm
15
COMPUTATIONAL RESOURCES
SAMPLE SIZE (Statistical PAC learning) COMPUTATION TIME (Polynomial PAC learning)
DEF 1 a concept class F = n=1F n is statistically PAC learnable if there
is a learning algorithm with sample size t = t(n 1 1) bounded by some polynomial function in n 1 1
Look for algorithms which use ldquoreasonablerdquo amount of computational resources
The efficiency issue
16
The efficiency issue
POLYNOMIAL PAC STATISTICAL PAC
DEF 2 a concept class F = n=1F n is polynomially PAC learnable
if there is a learning algorithm with running time bounded by some polynomial function in n 1 1
17
n = f 0 1n 0 1 The set of boolean functions in n
variables
Fn n A class of conceptsExample 1Fn = clauses with literals in
Example 2Fn = linearly separable functions in n variables
nn xxxx 11
nk xxxxxx ororororor 2123
( ) sum minus λkkXWHS
REPRESENTATION
- TRUTH TABLE (EXPLICIT)- BOOLEAN CIRCUITS (IMPLICIT)
BOOLEANCIRCUITS
BOOLEANFUNCTIONS
Learning boolean functions
18
bull BASIC OPERATIONSbull COMPOSITION
( )minusorand
in m variables in n variables
CIRCUIT Finite acyclic directed graph
or
Output node
Basic operations
Input nodes
Given an assignment x1 hellip xn 0 1 to the input variables the output node computes the corresponding value
oror
orand
1X 2X 3X
or
[f(g1 hellip gm)](x) = f(g1(x) hellip gm(x))
Boolean functions and circuits
19
Fn n
Cn class of circuits which compute all and only the functions in Fn
Uinfin
=
=1n
nFF Uinfin
=
=1n
nCC
Algorithm A to learn F by C bull INPUT (nεδ)
bullThe learner computes t = t(n 1 1) (t=number of examples sufficient to learn with accuracy ε and confidence δ)bull The learner asks the teacher for a labelled t-sample
bull The learner receives the t-sample S and computes C = An(S)
bull Output C (C= representation of the hypothesis)
Note that the inference procedure A receives as input the integer n and a t-sample on 01n and outputs An(S) = A(n S)
Boolean functions and circuits
20
An algorithm A is a learning algorithm with sample size t(n 1 1) for a concept class using the class of representations If for all nge1 for all fFn for all 0lt lt1 and for every probability distribution p over 01n the following holds
nn FUF 1=infin= mm CUC 1=
infin=
If the inference procedure An receives as input a t-sample it outputs a representation cCn of a function g that is probably approximately correct that is with probability at least 1- a t-sample is chosen such that the function g inferred satisfies
Px f(x)neg(x) le
g is ndashgood g is an ndashapproximation of fg is ndashbad g is not an ndashapproximation of f
NOTE distribution free
Boolean functions and circuits
21
Statistical PAC learning
DEF An inference procedure An for the class F n is consistent if
given the target function fF n for every t-sample
S = (ltx1b1gt hellip ltxtbtgt) An(S) is a representation of a function
g ldquoconsistentrdquo with S ie g(x1) = b1 hellip g(xt) = bt
DEF A learning algorithm A is consistent if its inference procedure is consistent
PROBLEM Estimate upper and lower bounds on the sample size
t = t(n 1 1)Upper bounds will be given for consistent algorithms
Lower bounds will be given for arbitrary algorithms
22
THEOREM t(n 1 1) le -1ln(F n) +ln(1)
PROOF Prob(x1 hellip xt) g (g(x1)=f(x1) hellip g(xt)=f(xt) g -bad) le
le Prob (g(x1) = f(x1) hellip g(xt) = f(xt)) le
Impose F n e-t le
Independent events
g is ε-bad
P(AUB)leP(A)+P(B)
g ε-bad
NOTE - F n must be finite
le Prob (g(xi) = f(xi)) leg ε-bad i=1 hellip t
le (1-)t le F n(1-)t le F ne-t g ε-bad
A simple upper bound
23
X domainF 2X class of conceptsS = (x1 hellip xt) t-sample
f S g iff f(xi) = g(xi) xi S undistinguishable by S
F (S) = (F S) index of F wrt S
Problem uniform convergence of relative frequencies to their probabilities
Vapnik-Chervonenkis approach (1971)
S1 S2
MF (t) = maxF (S) S is a t-sample growth function
24
FACT
THEOREM
A general upper bound
Prob(x1 hellip xt) g (g -bad g(x1) = f(x1) hellip g(xt) = f(xt))le 2mF2te-t2
mF (t) le 2t
mF (t) le F (this condition gives immediately
the simple upper bound) mF (t) = 2t and jltt mF (j) = 2j
25
d t
)(tmF
F
)(infinFm
t2
DEFINITION
FUNDAMENTAL PROPERTY
=)(tmF1
2
0
minusle⎟⎠⎞⎜⎝
⎛le⎟⎠⎞⎜⎝
⎛le
le
=
sum
Kk
t
tt
t
t
BOUNDED BY APOLYNOMIAL IN t
Graph of the growth function
d = VCdim(F ) = max t mF(t) = 2t
26
THEOREMIf dn = VCdim(Fn)
then t(n 1 1) le max (4 log(2) (8dn)log(13)PROOF
Impose 2mFn2te-et2 le
A lower bound on t(n 1 1) Number of examples which are necessary for arbitrary algorithms
THEOREMFor 0lele1 and le1100
t(n 1 1) ge max ((1- ) ln(1) (dn-1)32)
Upper and lower bounds
27
Ie the cardinality of the set of subsets of S that can be obtained by intersecting S with concepts in F
If F (S) = 2S we say that S is shattered by F
The Vapnik-Chervonenkis dimension of F is the cardinality of the largest finite set of points S X that is shattered by F
An equivalent definition of VCdim
F (S) = (f-1(1)(x1 hellip xt) | fF )
28
1300log80032000log400)11(
00100103)(
sdotle
===
MAXnt
FVC DIM
δε
δε
Sufficient 24000
⎭⎬⎫
⎩⎨⎧ sdot
minusge 100
32131000ln100)11( MAXnt δε
690 Necessary
Learn the family f of circles contained in the square
Example 1
29
otherwiseXif
XWHSXXfTHATSUCHWWLf
nkkkn
nn
001
)()()(
11
1
ge
minus=
rArrisin
sum=
λ
λ
HS(x)=
22
1)(n
n
nDIM
L
nLVC
le
+=
SIMPLE UPPER BOUND
))1ln((1)11( 2 +le nnt
UPPER BOUND USING
⎭⎬⎫
⎩⎨⎧ sdot
+le
13)1(82log4)11( nMAXnt GROWS LINEARLY WITH n
)( nDIM LVC
Learn the family of linearly separable boolean functions in n variables Ln
Example 2
30
Consider the class L2 of linearly separable functions in two variables
3)(3)(
1)(
2
2
ge=
+=
LVCLVC
nLVC
IM
IM
nIM
4)( 2 ltLVC IM
The green point cannot beseparated from the other three
No straight line can separatethe green from the red points
Example 2
31
Classi di formule booleane
Monomi x1x2 hellip xk
DNF m1m2 hellip mj (mj monomi)
Clausole x1x2 hellip xk
CNF c1c2 hellip cj (cj clausole)
k-DNF le k letterali nei monomi
k-term-DNF le k monomi
k-CNF le k letterali nelle clausole
k-clause-CNF le k clausole
Formule monotone non contengono letterali negati
m-formule ogni variabile appare al piugrave una volta
32
Th (Valiant)I monomi sono apprendibili da esempi positivi con 2(n+log ) esempi ( errore tollerato) ponendo ixix
xxgii 01 ==
sdotequiv ππin tutti gli es in tutti gli es
NB Lrsquoapprendibilitagrave egrave non monotona A B se B appr allora A apprsube
endHdaxcancella
elseHdaxcancella
thenjesifdontojfor
generaesbegin
doBtoiforxxxxxxH
j
j
nn
0)(1
)(
1 2211
==
=
==
Th i monomi non sono apprendibili da esempi negativi
I risultati
33
1) K-CNF apprendibili da soli esempi positivi1b) K-DNF apprendibili da soli esempi negativi2) (K-DNF K-CNF) apprendibili da es (K-DNF K-CNF) positivi e negativiand
or
Kforall
3) la classe delle K-decision lists egrave apprendibile
)0()(
1)min(|min
10||)))((( 11
esistenonisebvCalloravni
booleanovettorevDLKCbkmmonomiom
conbmbmDLK
i
iii
jj
===
minusisinle
equivminus
Th Ogni K-DNF (o K-CNF)-formula egrave rappresentabile da una K-DL piccola
Risultati positivi
34
)()(
freeondistributisensoinNPRPse
minusne
1) Le m-formule non sono apprendibili
2) Le funzioni booleane a soglia non sono apprendibili
3) Per K ge 2 le formule K-term-DNF non sono apprendibili
ge
Risultati negativi
35
Mistake bound model
So far how many examples needed to learn What about how many mistakes before
convergence Letrsquos consider similar setting to PAC learning
Instances drawn at random from X according to
distribution D Learner must classify each instance before receiving
correct classification from teacher Can we bound the number of mistakes learner makes
before converging
36
Mistake bound model
Learner Receives a sequence of training examples x Predicts the target value f(x) Receives the correct target value from the trainer Is evaluated by the total number of mistakes it makes
before converging to the correct hypothesis
Ie Learning takes place during the use of the system
not off-line Ex prediction of fraudolent use of credit cards
37
Mistake bound for Find-S
Consider Find-S when H = conjunction of boolean literals
FIND-S
Initialize h to the most specific hypothesis in
Hx1x1x2x2 hellip xnxn
For each positive training instance x Remove from h any literal not satisfied by x
Output h
38
Mistake bound for Find-S
If C H and training data noise free Find-S converges to an exact hypothesis
How many errors to learn cH (only positive examples can be misclassified)
The first positive example will be misclassified and n literals in the initial hypothesis will be eliminated
Each subsequent error eliminates at least one literal mistakes le n+1 (worst case for the ldquototalrdquo concept x c(x)=1)
39
Mistake bound for Halving A version space is maintained and refined (eg
Candidate-elimination) Prediction is based on majority vote among the
hypotheses in the current version space ldquoWrongrdquo hypotheses are removed (even if x is
exactly classified) How many errors to exactly learn cH (H finite)
Mistake when the majority of hypotheses misclassifies x These hypotheses are removed For each mistake the version space is at least halved At most log2(|H|) mistakes before exact learning (eg
single hypothesis remaining) Note learning without mistakes possible
40
Optimal mistake bound
Question what is the optimal mistake bound (ie lowest worst case bound over all possible learning algorithms A) for an arbitrary non empty concept class C assuming H=C
Formally for any learning algorithm A and any target concept c
MA(c) = max mistakes made by A to exactly learn c over all
possible training sequences MA(C) = maxcC MA(c)
Note Mfind-S(C) = n+1
MHalving(C) le log2(|C|) Opt(C) = minA MA(C)
ie of mistakes made for the hardest target concept in C using the hardest training sequence by the best algorithm
41
Optimal mistake bound
Theorem (Littlestone 1987)
VC(C) le Opt(C) le MHalving(C) le log2(|C|) There exist concept classes for which
VC(C) = Opt(C) = MHalving(C) = log2(|C|)
eg the power set 2X of X for which it holds
VC(2X) = |X| = log2(|2X|) There exist concept classes for which VC(C) lt Opt(C) lt MHalving(C)
42
Weighted majority algorithm
Generalizes Halving Makes predictions by taking a weighted vote among a pole of prediction algorithms
Learns by altering the weight associated with each prediction algorithm
It does not eliminate hypotheses (ie algorithms) inconsistent with some training examples but just reduces its weight so is able to accommodate inconsistent training data
43
Weighted majority algorithm
i wi = 1
training example (x c(x))
q0 = q1 = 0
prediction algorithm ai
If ai(x)=0 then q0 = q0 + wi
If ai(x)=1 then q1 = q1 + wi
if q1 gt q0 then predict c(x)=1
if q1 lt q0 then predict c(x)=0
if q1 gt q0 then predict c(x)=0 or 1 at random
prediction algorithm ai do If ai(x)nec(x) then wi = wi (0lelt1)
44
Weighted majority algorithm (WM)
Coincides with Halving for =0 Theorem - D any sequence of training examples A any set of n prediction algorithms k min of mistakes made by any ajA for D =12 Then W-M makes at most
24(k+log2n)
mistakes over D
45
Weighted majority algorithm (WM)
Proof Since aj makes k mistakes (best in A) its final
weight wj will be (12)k The sum W of the weights associated with all n algorithms in A is initially n and for each mistake made by WM is reduced to at most (34)W because the ldquowrongrdquo algorithms hold at least 12 of total weight that will be reduced by a factor of 12
The final total weight W is at most n(34)M where M is the total number of mistakes made by WM over D
46
Weighted majority algorithm (WM)
But the final weight wj cannot be greater than the
final total weight W hence(12)k le n(34)M
from which
M le le 24(k+log2n)
Ie the number of mistakes made by WM will never be greater than a constant factor times the number of mistakes made by the best member of the pool plus a term that grows only logarithmically in the size of the pool
(k+log2 n)-log2 (34)
- Universitagrave di Milano-Bicocca Laurea Magistrale in Informatica
- Computational models of cognitive phenomena
- A theory of the learnable (Valiant lsquo84)
- A theory of the learnable
- Probably approximately correct learning
- What we want to learn
- Whatrsquos new in pac learning
- Learning from examples
- The PAC model
- The PAC model
- Slide 11
- Slide 12
- Slide 13
- Learning algorithm
- The efficiency issue
- Slide 16
- Learning boolean functions
- Boolean functions and circuits
- Slide 19
- Slide 20
- Statistical PAC learning
- A simple upper bound
- Vapnik-Chervonenkis approach (1971)
- A general upper bound
- Graph of the growth function
- Upper and lower bounds
- An equivalent definition of VCdim
- Example 1
- Example 2
- Slide 30
- Classi di formule booleane
- I risultati
- Risultati positivi
- Risultati negativi
- Mistake bound model
- Slide 36
- Mistake bound for Find-S
- Slide 38
- Mistake bound for Halving
- Optimal mistake bound
- Slide 41
- Weighted majority algorithm
- Slide 43
- Weighted majority algorithm (WM)
- Slide 45
- Slide 46
-
10
The PAC model
Example 2
Xequiv01ⁿ F equiv family of boolean functions
1 if there are at least r ones in (x1hellipxn)fr(x1hellipxn) =
0 otherwise
P a probability distribution on X
Uniform Non uniform
11
The PAC model
The learning process
Labeled sample ((x0 f(x0)) (x1 f(x1)) hellip (xn f(xn))
Hypothesis a function h consistent with the sample (ie h(xi) = f(xi) i)
Error probability Perr(h(x)nef(x) xX)
12
LEARNERExamples generatorwith probabilitydistribution p
Inference procedure A
t examples
Hypothesis h (implicit representation of a concept)
The learning algorithm A is good if the hypothesis h is ldquoALMOST ALWAYSrdquoldquoCLOSE TOrdquo the target concept c
TEACHER
The PAC model
X fF X F
(x1f(x1)) hellip (xtf(xt)))
13
f h
x random choice
Given an approximation parameter (0ltle1) h is an ε-approximationof f if dp(fh)le
ldquoALMOST ALWAYSrdquo
Confidence parameter
(0 lt le 1)
The ldquomeasurerdquo of sequences of examples randomly choosen according to P such that h is an ε-approximation of f is at least 1-
ldquoCLOSE TOrdquo
METRIC given P
dp(fh) = Perr = Px f(x)neh(x)
The PAC model
14
Generator ofexamples
Learner h
F concept classS set of labeled samples from a concept in F A S F such that
I) A(S) consistent with S
II) P(Perrlt ) gt 1-
0ltlt1 fF mN S st |S|gem
Learning algorithm
15
COMPUTATIONAL RESOURCES
SAMPLE SIZE (Statistical PAC learning) COMPUTATION TIME (Polynomial PAC learning)
DEF 1 a concept class F = n=1F n is statistically PAC learnable if there
is a learning algorithm with sample size t = t(n 1 1) bounded by some polynomial function in n 1 1
Look for algorithms which use ldquoreasonablerdquo amount of computational resources
The efficiency issue
16
The efficiency issue
POLYNOMIAL PAC STATISTICAL PAC
DEF 2 a concept class F = n=1F n is polynomially PAC learnable
if there is a learning algorithm with running time bounded by some polynomial function in n 1 1
17
n = f 0 1n 0 1 The set of boolean functions in n
variables
Fn n A class of conceptsExample 1Fn = clauses with literals in
Example 2Fn = linearly separable functions in n variables
nn xxxx 11
nk xxxxxx ororororor 2123
( ) sum minus λkkXWHS
REPRESENTATION
- TRUTH TABLE (EXPLICIT)- BOOLEAN CIRCUITS (IMPLICIT)
BOOLEANCIRCUITS
BOOLEANFUNCTIONS
Learning boolean functions
18
bull BASIC OPERATIONSbull COMPOSITION
( )minusorand
in m variables in n variables
CIRCUIT Finite acyclic directed graph
or
Output node
Basic operations
Input nodes
Given an assignment x1 hellip xn 0 1 to the input variables the output node computes the corresponding value
oror
orand
1X 2X 3X
or
[f(g1 hellip gm)](x) = f(g1(x) hellip gm(x))
Boolean functions and circuits
19
Fn n
Cn class of circuits which compute all and only the functions in Fn
Uinfin
=
=1n
nFF Uinfin
=
=1n
nCC
Algorithm A to learn F by C bull INPUT (nεδ)
bullThe learner computes t = t(n 1 1) (t=number of examples sufficient to learn with accuracy ε and confidence δ)bull The learner asks the teacher for a labelled t-sample
bull The learner receives the t-sample S and computes C = An(S)
bull Output C (C= representation of the hypothesis)
Note that the inference procedure A receives as input the integer n and a t-sample on 01n and outputs An(S) = A(n S)
Boolean functions and circuits
20
An algorithm A is a learning algorithm with sample size t(n 1 1) for a concept class using the class of representations If for all nge1 for all fFn for all 0lt lt1 and for every probability distribution p over 01n the following holds
nn FUF 1=infin= mm CUC 1=
infin=
If the inference procedure An receives as input a t-sample it outputs a representation cCn of a function g that is probably approximately correct that is with probability at least 1- a t-sample is chosen such that the function g inferred satisfies
Px f(x)neg(x) le
g is ndashgood g is an ndashapproximation of fg is ndashbad g is not an ndashapproximation of f
NOTE distribution free
Boolean functions and circuits
21
Statistical PAC learning
DEF An inference procedure An for the class F n is consistent if
given the target function fF n for every t-sample
S = (ltx1b1gt hellip ltxtbtgt) An(S) is a representation of a function
g ldquoconsistentrdquo with S ie g(x1) = b1 hellip g(xt) = bt
DEF A learning algorithm A is consistent if its inference procedure is consistent
PROBLEM Estimate upper and lower bounds on the sample size
t = t(n 1 1)Upper bounds will be given for consistent algorithms
Lower bounds will be given for arbitrary algorithms
22
THEOREM t(n 1 1) le -1ln(F n) +ln(1)
PROOF Prob(x1 hellip xt) g (g(x1)=f(x1) hellip g(xt)=f(xt) g -bad) le
le Prob (g(x1) = f(x1) hellip g(xt) = f(xt)) le
Impose F n e-t le
Independent events
g is ε-bad
P(AUB)leP(A)+P(B)
g ε-bad
NOTE - F n must be finite
le Prob (g(xi) = f(xi)) leg ε-bad i=1 hellip t
le (1-)t le F n(1-)t le F ne-t g ε-bad
A simple upper bound
23
X domainF 2X class of conceptsS = (x1 hellip xt) t-sample
f S g iff f(xi) = g(xi) xi S undistinguishable by S
F (S) = (F S) index of F wrt S
Problem uniform convergence of relative frequencies to their probabilities
Vapnik-Chervonenkis approach (1971)
S1 S2
MF (t) = maxF (S) S is a t-sample growth function
24
FACT
THEOREM
A general upper bound
Prob(x1 hellip xt) g (g -bad g(x1) = f(x1) hellip g(xt) = f(xt))le 2mF2te-t2
mF (t) le 2t
mF (t) le F (this condition gives immediately
the simple upper bound) mF (t) = 2t and jltt mF (j) = 2j
25
d t
)(tmF
F
)(infinFm
t2
DEFINITION
FUNDAMENTAL PROPERTY
=)(tmF1
2
0
minusle⎟⎠⎞⎜⎝
⎛le⎟⎠⎞⎜⎝
⎛le
le
=
sum
Kk
t
tt
t
t
BOUNDED BY APOLYNOMIAL IN t
Graph of the growth function
d = VCdim(F ) = max t mF(t) = 2t
26
THEOREMIf dn = VCdim(Fn)
then t(n 1 1) le max (4 log(2) (8dn)log(13)PROOF
Impose 2mFn2te-et2 le
A lower bound on t(n 1 1) Number of examples which are necessary for arbitrary algorithms
THEOREMFor 0lele1 and le1100
t(n 1 1) ge max ((1- ) ln(1) (dn-1)32)
Upper and lower bounds
27
Ie the cardinality of the set of subsets of S that can be obtained by intersecting S with concepts in F
If F (S) = 2S we say that S is shattered by F
The Vapnik-Chervonenkis dimension of F is the cardinality of the largest finite set of points S X that is shattered by F
An equivalent definition of VCdim
F (S) = (f-1(1)(x1 hellip xt) | fF )
28
1300log80032000log400)11(
00100103)(
sdotle
===
MAXnt
FVC DIM
δε
δε
Sufficient 24000
⎭⎬⎫
⎩⎨⎧ sdot
minusge 100
32131000ln100)11( MAXnt δε
690 Necessary
Learn the family f of circles contained in the square
Example 1
29
otherwiseXif
XWHSXXfTHATSUCHWWLf
nkkkn
nn
001
)()()(
11
1
ge
minus=
rArrisin
sum=
λ
λ
HS(x)=
22
1)(n
n
nDIM
L
nLVC
le
+=
SIMPLE UPPER BOUND
))1ln((1)11( 2 +le nnt
UPPER BOUND USING
⎭⎬⎫
⎩⎨⎧ sdot
+le
13)1(82log4)11( nMAXnt GROWS LINEARLY WITH n
)( nDIM LVC
Learn the family of linearly separable boolean functions in n variables Ln
Example 2
30
Consider the class L2 of linearly separable functions in two variables
3)(3)(
1)(
2
2
ge=
+=
LVCLVC
nLVC
IM
IM
nIM
4)( 2 ltLVC IM
The green point cannot beseparated from the other three
No straight line can separatethe green from the red points
Example 2
31
Classi di formule booleane
Monomi x1x2 hellip xk
DNF m1m2 hellip mj (mj monomi)
Clausole x1x2 hellip xk
CNF c1c2 hellip cj (cj clausole)
k-DNF le k letterali nei monomi
k-term-DNF le k monomi
k-CNF le k letterali nelle clausole
k-clause-CNF le k clausole
Formule monotone non contengono letterali negati
m-formule ogni variabile appare al piugrave una volta
32
Th (Valiant)I monomi sono apprendibili da esempi positivi con 2(n+log ) esempi ( errore tollerato) ponendo ixix
xxgii 01 ==
sdotequiv ππin tutti gli es in tutti gli es
NB Lrsquoapprendibilitagrave egrave non monotona A B se B appr allora A apprsube
endHdaxcancella
elseHdaxcancella
thenjesifdontojfor
generaesbegin
doBtoiforxxxxxxH
j
j
nn
0)(1
)(
1 2211
==
=
==
Th i monomi non sono apprendibili da esempi negativi
I risultati
33
1) K-CNF apprendibili da soli esempi positivi1b) K-DNF apprendibili da soli esempi negativi2) (K-DNF K-CNF) apprendibili da es (K-DNF K-CNF) positivi e negativiand
or
Kforall
3) la classe delle K-decision lists egrave apprendibile
)0()(
1)min(|min
10||)))((( 11
esistenonisebvCalloravni
booleanovettorevDLKCbkmmonomiom
conbmbmDLK
i
iii
jj
===
minusisinle
equivminus
Th Ogni K-DNF (o K-CNF)-formula egrave rappresentabile da una K-DL piccola
Risultati positivi
34
)()(
freeondistributisensoinNPRPse
minusne
1) Le m-formule non sono apprendibili
2) Le funzioni booleane a soglia non sono apprendibili
3) Per K ge 2 le formule K-term-DNF non sono apprendibili
ge
Risultati negativi
35
Mistake bound model
So far how many examples needed to learn What about how many mistakes before
convergence Letrsquos consider similar setting to PAC learning
Instances drawn at random from X according to
distribution D Learner must classify each instance before receiving
correct classification from teacher Can we bound the number of mistakes learner makes
before converging
36
Mistake bound model
Learner Receives a sequence of training examples x Predicts the target value f(x) Receives the correct target value from the trainer Is evaluated by the total number of mistakes it makes
before converging to the correct hypothesis
Ie Learning takes place during the use of the system
not off-line Ex prediction of fraudolent use of credit cards
37
Mistake bound for Find-S
Consider Find-S when H = conjunction of boolean literals
FIND-S
Initialize h to the most specific hypothesis in
Hx1x1x2x2 hellip xnxn
For each positive training instance x Remove from h any literal not satisfied by x
Output h
38
Mistake bound for Find-S
If C H and training data noise free Find-S converges to an exact hypothesis
How many errors to learn cH (only positive examples can be misclassified)
The first positive example will be misclassified and n literals in the initial hypothesis will be eliminated
Each subsequent error eliminates at least one literal mistakes le n+1 (worst case for the ldquototalrdquo concept x c(x)=1)
39
Mistake bound for Halving A version space is maintained and refined (eg
Candidate-elimination) Prediction is based on majority vote among the
hypotheses in the current version space ldquoWrongrdquo hypotheses are removed (even if x is
exactly classified) How many errors to exactly learn cH (H finite)
Mistake when the majority of hypotheses misclassifies x These hypotheses are removed For each mistake the version space is at least halved At most log2(|H|) mistakes before exact learning (eg
single hypothesis remaining) Note learning without mistakes possible
40
Optimal mistake bound
Question what is the optimal mistake bound (ie lowest worst case bound over all possible learning algorithms A) for an arbitrary non empty concept class C assuming H=C
Formally for any learning algorithm A and any target concept c
MA(c) = max mistakes made by A to exactly learn c over all
possible training sequences MA(C) = maxcC MA(c)
Note Mfind-S(C) = n+1
MHalving(C) le log2(|C|) Opt(C) = minA MA(C)
ie of mistakes made for the hardest target concept in C using the hardest training sequence by the best algorithm
41
Optimal mistake bound
Theorem (Littlestone 1987)
VC(C) le Opt(C) le MHalving(C) le log2(|C|) There exist concept classes for which
VC(C) = Opt(C) = MHalving(C) = log2(|C|)
eg the power set 2X of X for which it holds
VC(2X) = |X| = log2(|2X|) There exist concept classes for which VC(C) lt Opt(C) lt MHalving(C)
42
Weighted majority algorithm
Generalizes Halving Makes predictions by taking a weighted vote among a pole of prediction algorithms
Learns by altering the weight associated with each prediction algorithm
It does not eliminate hypotheses (ie algorithms) inconsistent with some training examples but just reduces its weight so is able to accommodate inconsistent training data
43
Weighted majority algorithm
i wi = 1
training example (x c(x))
q0 = q1 = 0
prediction algorithm ai
If ai(x)=0 then q0 = q0 + wi
If ai(x)=1 then q1 = q1 + wi
if q1 gt q0 then predict c(x)=1
if q1 lt q0 then predict c(x)=0
if q1 gt q0 then predict c(x)=0 or 1 at random
prediction algorithm ai do If ai(x)nec(x) then wi = wi (0lelt1)
44
Weighted majority algorithm (WM)
Coincides with Halving for =0 Theorem - D any sequence of training examples A any set of n prediction algorithms k min of mistakes made by any ajA for D =12 Then W-M makes at most
24(k+log2n)
mistakes over D
45
Weighted majority algorithm (WM)
Proof Since aj makes k mistakes (best in A) its final
weight wj will be (12)k The sum W of the weights associated with all n algorithms in A is initially n and for each mistake made by WM is reduced to at most (34)W because the ldquowrongrdquo algorithms hold at least 12 of total weight that will be reduced by a factor of 12
The final total weight W is at most n(34)M where M is the total number of mistakes made by WM over D
46
Weighted majority algorithm (WM)
But the final weight wj cannot be greater than the
final total weight W hence(12)k le n(34)M
from which
M le le 24(k+log2n)
Ie the number of mistakes made by WM will never be greater than a constant factor times the number of mistakes made by the best member of the pool plus a term that grows only logarithmically in the size of the pool
(k+log2 n)-log2 (34)
- Universitagrave di Milano-Bicocca Laurea Magistrale in Informatica
- Computational models of cognitive phenomena
- A theory of the learnable (Valiant lsquo84)
- A theory of the learnable
- Probably approximately correct learning
- What we want to learn
- Whatrsquos new in pac learning
- Learning from examples
- The PAC model
- The PAC model
- Slide 11
- Slide 12
- Slide 13
- Learning algorithm
- The efficiency issue
- Slide 16
- Learning boolean functions
- Boolean functions and circuits
- Slide 19
- Slide 20
- Statistical PAC learning
- A simple upper bound
- Vapnik-Chervonenkis approach (1971)
- A general upper bound
- Graph of the growth function
- Upper and lower bounds
- An equivalent definition of VCdim
- Example 1
- Example 2
- Slide 30
- Classi di formule booleane
- I risultati
- Risultati positivi
- Risultati negativi
- Mistake bound model
- Slide 36
- Mistake bound for Find-S
- Slide 38
- Mistake bound for Halving
- Optimal mistake bound
- Slide 41
- Weighted majority algorithm
- Slide 43
- Weighted majority algorithm (WM)
- Slide 45
- Slide 46
-
11
The PAC model
The learning process
Labeled sample ((x0 f(x0)) (x1 f(x1)) hellip (xn f(xn))
Hypothesis a function h consistent with the sample (ie h(xi) = f(xi) i)
Error probability Perr(h(x)nef(x) xX)
12
LEARNERExamples generatorwith probabilitydistribution p
Inference procedure A
t examples
Hypothesis h (implicit representation of a concept)
The learning algorithm A is good if the hypothesis h is ldquoALMOST ALWAYSrdquoldquoCLOSE TOrdquo the target concept c
TEACHER
The PAC model
X fF X F
(x1f(x1)) hellip (xtf(xt)))
13
f h
x random choice
Given an approximation parameter (0ltle1) h is an ε-approximationof f if dp(fh)le
ldquoALMOST ALWAYSrdquo
Confidence parameter
(0 lt le 1)
The ldquomeasurerdquo of sequences of examples randomly choosen according to P such that h is an ε-approximation of f is at least 1-
ldquoCLOSE TOrdquo
METRIC given P
dp(fh) = Perr = Px f(x)neh(x)
The PAC model
14
Generator ofexamples
Learner h
F concept classS set of labeled samples from a concept in F A S F such that
I) A(S) consistent with S
II) P(Perrlt ) gt 1-
0ltlt1 fF mN S st |S|gem
Learning algorithm
15
COMPUTATIONAL RESOURCES
SAMPLE SIZE (Statistical PAC learning) COMPUTATION TIME (Polynomial PAC learning)
DEF 1 a concept class F = n=1F n is statistically PAC learnable if there
is a learning algorithm with sample size t = t(n 1 1) bounded by some polynomial function in n 1 1
Look for algorithms which use ldquoreasonablerdquo amount of computational resources
The efficiency issue
16
The efficiency issue
POLYNOMIAL PAC STATISTICAL PAC
DEF 2 a concept class F = n=1F n is polynomially PAC learnable
if there is a learning algorithm with running time bounded by some polynomial function in n 1 1
17
n = f 0 1n 0 1 The set of boolean functions in n
variables
Fn n A class of conceptsExample 1Fn = clauses with literals in
Example 2Fn = linearly separable functions in n variables
nn xxxx 11
nk xxxxxx ororororor 2123
( ) sum minus λkkXWHS
REPRESENTATION
- TRUTH TABLE (EXPLICIT)- BOOLEAN CIRCUITS (IMPLICIT)
BOOLEANCIRCUITS
BOOLEANFUNCTIONS
Learning boolean functions
18
bull BASIC OPERATIONSbull COMPOSITION
( )minusorand
in m variables in n variables
CIRCUIT Finite acyclic directed graph
or
Output node
Basic operations
Input nodes
Given an assignment x1 hellip xn 0 1 to the input variables the output node computes the corresponding value
oror
orand
1X 2X 3X
or
[f(g1 hellip gm)](x) = f(g1(x) hellip gm(x))
Boolean functions and circuits
19
Fn n
Cn class of circuits which compute all and only the functions in Fn
Uinfin
=
=1n
nFF Uinfin
=
=1n
nCC
Algorithm A to learn F by C bull INPUT (nεδ)
bullThe learner computes t = t(n 1 1) (t=number of examples sufficient to learn with accuracy ε and confidence δ)bull The learner asks the teacher for a labelled t-sample
bull The learner receives the t-sample S and computes C = An(S)
bull Output C (C= representation of the hypothesis)
Note that the inference procedure A receives as input the integer n and a t-sample on 01n and outputs An(S) = A(n S)
Boolean functions and circuits
20
An algorithm A is a learning algorithm with sample size t(n 1 1) for a concept class using the class of representations If for all nge1 for all fFn for all 0lt lt1 and for every probability distribution p over 01n the following holds
nn FUF 1=infin= mm CUC 1=
infin=
If the inference procedure An receives as input a t-sample it outputs a representation cCn of a function g that is probably approximately correct that is with probability at least 1- a t-sample is chosen such that the function g inferred satisfies
Px f(x)neg(x) le
g is ndashgood g is an ndashapproximation of fg is ndashbad g is not an ndashapproximation of f
NOTE distribution free
Boolean functions and circuits
21
Statistical PAC learning
DEF An inference procedure An for the class F n is consistent if
given the target function fF n for every t-sample
S = (ltx1b1gt hellip ltxtbtgt) An(S) is a representation of a function
g ldquoconsistentrdquo with S ie g(x1) = b1 hellip g(xt) = bt
DEF A learning algorithm A is consistent if its inference procedure is consistent
PROBLEM Estimate upper and lower bounds on the sample size
t = t(n 1 1)Upper bounds will be given for consistent algorithms
Lower bounds will be given for arbitrary algorithms
22
THEOREM t(n 1 1) le -1ln(F n) +ln(1)
PROOF Prob(x1 hellip xt) g (g(x1)=f(x1) hellip g(xt)=f(xt) g -bad) le
le Prob (g(x1) = f(x1) hellip g(xt) = f(xt)) le
Impose F n e-t le
Independent events
g is ε-bad
P(AUB)leP(A)+P(B)
g ε-bad
NOTE - F n must be finite
le Prob (g(xi) = f(xi)) leg ε-bad i=1 hellip t
le (1-)t le F n(1-)t le F ne-t g ε-bad
A simple upper bound
23
X domainF 2X class of conceptsS = (x1 hellip xt) t-sample
f S g iff f(xi) = g(xi) xi S undistinguishable by S
F (S) = (F S) index of F wrt S
Problem uniform convergence of relative frequencies to their probabilities
Vapnik-Chervonenkis approach (1971)
S1 S2
MF (t) = maxF (S) S is a t-sample growth function
24
FACT
THEOREM
A general upper bound
Prob(x1 hellip xt) g (g -bad g(x1) = f(x1) hellip g(xt) = f(xt))le 2mF2te-t2
mF (t) le 2t
mF (t) le F (this condition gives immediately
the simple upper bound) mF (t) = 2t and jltt mF (j) = 2j
25
d t
)(tmF
F
)(infinFm
t2
DEFINITION
FUNDAMENTAL PROPERTY
=)(tmF1
2
0
minusle⎟⎠⎞⎜⎝
⎛le⎟⎠⎞⎜⎝
⎛le
le
=
sum
Kk
t
tt
t
t
BOUNDED BY APOLYNOMIAL IN t
Graph of the growth function
d = VCdim(F ) = max t mF(t) = 2t
26
THEOREMIf dn = VCdim(Fn)
then t(n 1 1) le max (4 log(2) (8dn)log(13)PROOF
Impose 2mFn2te-et2 le
A lower bound on t(n 1 1) Number of examples which are necessary for arbitrary algorithms
THEOREMFor 0lele1 and le1100
t(n 1 1) ge max ((1- ) ln(1) (dn-1)32)
Upper and lower bounds
27
Ie the cardinality of the set of subsets of S that can be obtained by intersecting S with concepts in F
If F (S) = 2S we say that S is shattered by F
The Vapnik-Chervonenkis dimension of F is the cardinality of the largest finite set of points S X that is shattered by F
An equivalent definition of VCdim
F (S) = (f-1(1)(x1 hellip xt) | fF )
28
1300log80032000log400)11(
00100103)(
sdotle
===
MAXnt
FVC DIM
δε
δε
Sufficient 24000
⎭⎬⎫
⎩⎨⎧ sdot
minusge 100
32131000ln100)11( MAXnt δε
690 Necessary
Learn the family f of circles contained in the square
Example 1
29
otherwiseXif
XWHSXXfTHATSUCHWWLf
nkkkn
nn
001
)()()(
11
1
ge
minus=
rArrisin
sum=
λ
λ
HS(x)=
22
1)(n
n
nDIM
L
nLVC
le
+=
SIMPLE UPPER BOUND
))1ln((1)11( 2 +le nnt
UPPER BOUND USING
⎭⎬⎫
⎩⎨⎧ sdot
+le
13)1(82log4)11( nMAXnt GROWS LINEARLY WITH n
)( nDIM LVC
Learn the family of linearly separable boolean functions in n variables Ln
Example 2
30
Consider the class L2 of linearly separable functions in two variables
3)(3)(
1)(
2
2
ge=
+=
LVCLVC
nLVC
IM
IM
nIM
4)( 2 ltLVC IM
The green point cannot beseparated from the other three
No straight line can separatethe green from the red points
Example 2
31
Classi di formule booleane
Monomi x1x2 hellip xk
DNF m1m2 hellip mj (mj monomi)
Clausole x1x2 hellip xk
CNF c1c2 hellip cj (cj clausole)
k-DNF le k letterali nei monomi
k-term-DNF le k monomi
k-CNF le k letterali nelle clausole
k-clause-CNF le k clausole
Formule monotone non contengono letterali negati
m-formule ogni variabile appare al piugrave una volta
32
Th (Valiant)I monomi sono apprendibili da esempi positivi con 2(n+log ) esempi ( errore tollerato) ponendo ixix
xxgii 01 ==
sdotequiv ππin tutti gli es in tutti gli es
NB Lrsquoapprendibilitagrave egrave non monotona A B se B appr allora A apprsube
endHdaxcancella
elseHdaxcancella
thenjesifdontojfor
generaesbegin
doBtoiforxxxxxxH
j
j
nn
0)(1
)(
1 2211
==
=
==
Th i monomi non sono apprendibili da esempi negativi
I risultati
33
1) K-CNF apprendibili da soli esempi positivi1b) K-DNF apprendibili da soli esempi negativi2) (K-DNF K-CNF) apprendibili da es (K-DNF K-CNF) positivi e negativiand
or
Kforall
3) la classe delle K-decision lists egrave apprendibile
)0()(
1)min(|min
10||)))((( 11
esistenonisebvCalloravni
booleanovettorevDLKCbkmmonomiom
conbmbmDLK
i
iii
jj
===
minusisinle
equivminus
Th Ogni K-DNF (o K-CNF)-formula egrave rappresentabile da una K-DL piccola
Risultati positivi
34
)()(
freeondistributisensoinNPRPse
minusne
1) Le m-formule non sono apprendibili
2) Le funzioni booleane a soglia non sono apprendibili
3) Per K ge 2 le formule K-term-DNF non sono apprendibili
ge
Risultati negativi
35
Mistake bound model
So far how many examples needed to learn What about how many mistakes before
convergence Letrsquos consider similar setting to PAC learning
Instances drawn at random from X according to
distribution D Learner must classify each instance before receiving
correct classification from teacher Can we bound the number of mistakes learner makes
before converging
36
Mistake bound model
Learner Receives a sequence of training examples x Predicts the target value f(x) Receives the correct target value from the trainer Is evaluated by the total number of mistakes it makes
before converging to the correct hypothesis
Ie Learning takes place during the use of the system
not off-line Ex prediction of fraudolent use of credit cards
37
Mistake bound for Find-S
Consider Find-S when H = conjunction of boolean literals
FIND-S
Initialize h to the most specific hypothesis in
Hx1x1x2x2 hellip xnxn
For each positive training instance x Remove from h any literal not satisfied by x
Output h
38
Mistake bound for Find-S
If C H and training data noise free Find-S converges to an exact hypothesis
How many errors to learn cH (only positive examples can be misclassified)
The first positive example will be misclassified and n literals in the initial hypothesis will be eliminated
Each subsequent error eliminates at least one literal mistakes le n+1 (worst case for the ldquototalrdquo concept x c(x)=1)
39
Mistake bound for Halving A version space is maintained and refined (eg
Candidate-elimination) Prediction is based on majority vote among the
hypotheses in the current version space ldquoWrongrdquo hypotheses are removed (even if x is
exactly classified) How many errors to exactly learn cH (H finite)
Mistake when the majority of hypotheses misclassifies x These hypotheses are removed For each mistake the version space is at least halved At most log2(|H|) mistakes before exact learning (eg
single hypothesis remaining) Note learning without mistakes possible
40
Optimal mistake bound
Question what is the optimal mistake bound (ie lowest worst case bound over all possible learning algorithms A) for an arbitrary non empty concept class C assuming H=C
Formally for any learning algorithm A and any target concept c
MA(c) = max mistakes made by A to exactly learn c over all
possible training sequences MA(C) = maxcC MA(c)
Note Mfind-S(C) = n+1
MHalving(C) le log2(|C|) Opt(C) = minA MA(C)
ie of mistakes made for the hardest target concept in C using the hardest training sequence by the best algorithm
41
Optimal mistake bound
Theorem (Littlestone 1987)
VC(C) le Opt(C) le MHalving(C) le log2(|C|) There exist concept classes for which
VC(C) = Opt(C) = MHalving(C) = log2(|C|)
eg the power set 2X of X for which it holds
VC(2X) = |X| = log2(|2X|) There exist concept classes for which VC(C) lt Opt(C) lt MHalving(C)
42
Weighted majority algorithm
Generalizes Halving Makes predictions by taking a weighted vote among a pole of prediction algorithms
Learns by altering the weight associated with each prediction algorithm
It does not eliminate hypotheses (ie algorithms) inconsistent with some training examples but just reduces its weight so is able to accommodate inconsistent training data
43
Weighted majority algorithm
i wi = 1
training example (x c(x))
q0 = q1 = 0
prediction algorithm ai
If ai(x)=0 then q0 = q0 + wi
If ai(x)=1 then q1 = q1 + wi
if q1 gt q0 then predict c(x)=1
if q1 lt q0 then predict c(x)=0
if q1 gt q0 then predict c(x)=0 or 1 at random
prediction algorithm ai do If ai(x)nec(x) then wi = wi (0lelt1)
44
Weighted majority algorithm (WM)
Coincides with Halving for =0 Theorem - D any sequence of training examples A any set of n prediction algorithms k min of mistakes made by any ajA for D =12 Then W-M makes at most
24(k+log2n)
mistakes over D
45
Weighted majority algorithm (WM)
Proof Since aj makes k mistakes (best in A) its final
weight wj will be (12)k The sum W of the weights associated with all n algorithms in A is initially n and for each mistake made by WM is reduced to at most (34)W because the ldquowrongrdquo algorithms hold at least 12 of total weight that will be reduced by a factor of 12
The final total weight W is at most n(34)M where M is the total number of mistakes made by WM over D
46
Weighted majority algorithm (WM)
But the final weight wj cannot be greater than the
final total weight W hence(12)k le n(34)M
from which
M le le 24(k+log2n)
Ie the number of mistakes made by WM will never be greater than a constant factor times the number of mistakes made by the best member of the pool plus a term that grows only logarithmically in the size of the pool
(k+log2 n)-log2 (34)
- Universitagrave di Milano-Bicocca Laurea Magistrale in Informatica
- Computational models of cognitive phenomena
- A theory of the learnable (Valiant lsquo84)
- A theory of the learnable
- Probably approximately correct learning
- What we want to learn
- Whatrsquos new in pac learning
- Learning from examples
- The PAC model
- The PAC model
- Slide 11
- Slide 12
- Slide 13
- Learning algorithm
- The efficiency issue
- Slide 16
- Learning boolean functions
- Boolean functions and circuits
- Slide 19
- Slide 20
- Statistical PAC learning
- A simple upper bound
- Vapnik-Chervonenkis approach (1971)
- A general upper bound
- Graph of the growth function
- Upper and lower bounds
- An equivalent definition of VCdim
- Example 1
- Example 2
- Slide 30
- Classi di formule booleane
- I risultati
- Risultati positivi
- Risultati negativi
- Mistake bound model
- Slide 36
- Mistake bound for Find-S
- Slide 38
- Mistake bound for Halving
- Optimal mistake bound
- Slide 41
- Weighted majority algorithm
- Slide 43
- Weighted majority algorithm (WM)
- Slide 45
- Slide 46
-
12
LEARNERExamples generatorwith probabilitydistribution p
Inference procedure A
t examples
Hypothesis h (implicit representation of a concept)
The learning algorithm A is good if the hypothesis h is ldquoALMOST ALWAYSrdquoldquoCLOSE TOrdquo the target concept c
TEACHER
The PAC model
X fF X F
(x1f(x1)) hellip (xtf(xt)))
13
f h
x random choice
Given an approximation parameter (0ltle1) h is an ε-approximationof f if dp(fh)le
ldquoALMOST ALWAYSrdquo
Confidence parameter
(0 lt le 1)
The ldquomeasurerdquo of sequences of examples randomly choosen according to P such that h is an ε-approximation of f is at least 1-
ldquoCLOSE TOrdquo
METRIC given P
dp(fh) = Perr = Px f(x)neh(x)
The PAC model
14
Generator ofexamples
Learner h
F concept classS set of labeled samples from a concept in F A S F such that
I) A(S) consistent with S
II) P(Perrlt ) gt 1-
0ltlt1 fF mN S st |S|gem
Learning algorithm
15
COMPUTATIONAL RESOURCES
SAMPLE SIZE (Statistical PAC learning) COMPUTATION TIME (Polynomial PAC learning)
DEF 1 a concept class F = n=1F n is statistically PAC learnable if there
is a learning algorithm with sample size t = t(n 1 1) bounded by some polynomial function in n 1 1
Look for algorithms which use ldquoreasonablerdquo amount of computational resources
The efficiency issue
16
The efficiency issue
POLYNOMIAL PAC STATISTICAL PAC
DEF 2 a concept class F = n=1F n is polynomially PAC learnable
if there is a learning algorithm with running time bounded by some polynomial function in n 1 1
17
n = f 0 1n 0 1 The set of boolean functions in n
variables
Fn n A class of conceptsExample 1Fn = clauses with literals in
Example 2Fn = linearly separable functions in n variables
nn xxxx 11
nk xxxxxx ororororor 2123
( ) sum minus λkkXWHS
REPRESENTATION
- TRUTH TABLE (EXPLICIT)- BOOLEAN CIRCUITS (IMPLICIT)
BOOLEANCIRCUITS
BOOLEANFUNCTIONS
Learning boolean functions
18
bull BASIC OPERATIONSbull COMPOSITION
( )minusorand
in m variables in n variables
CIRCUIT Finite acyclic directed graph
or
Output node
Basic operations
Input nodes
Given an assignment x1 hellip xn 0 1 to the input variables the output node computes the corresponding value
oror
orand
1X 2X 3X
or
[f(g1 hellip gm)](x) = f(g1(x) hellip gm(x))
Boolean functions and circuits
19
Fn n
Cn class of circuits which compute all and only the functions in Fn
Uinfin
=
=1n
nFF Uinfin
=
=1n
nCC
Algorithm A to learn F by C bull INPUT (nεδ)
bullThe learner computes t = t(n 1 1) (t=number of examples sufficient to learn with accuracy ε and confidence δ)bull The learner asks the teacher for a labelled t-sample
bull The learner receives the t-sample S and computes C = An(S)
bull Output C (C= representation of the hypothesis)
Note that the inference procedure A receives as input the integer n and a t-sample on 01n and outputs An(S) = A(n S)
Boolean functions and circuits
20
An algorithm A is a learning algorithm with sample size t(n 1 1) for a concept class using the class of representations If for all nge1 for all fFn for all 0lt lt1 and for every probability distribution p over 01n the following holds
nn FUF 1=infin= mm CUC 1=
infin=
If the inference procedure An receives as input a t-sample it outputs a representation cCn of a function g that is probably approximately correct that is with probability at least 1- a t-sample is chosen such that the function g inferred satisfies
Px f(x)neg(x) le
g is ndashgood g is an ndashapproximation of fg is ndashbad g is not an ndashapproximation of f
NOTE distribution free
Boolean functions and circuits
21
Statistical PAC learning
DEF An inference procedure An for the class F n is consistent if
given the target function fF n for every t-sample
S = (ltx1b1gt hellip ltxtbtgt) An(S) is a representation of a function
g ldquoconsistentrdquo with S ie g(x1) = b1 hellip g(xt) = bt
DEF A learning algorithm A is consistent if its inference procedure is consistent
PROBLEM Estimate upper and lower bounds on the sample size
t = t(n 1 1)Upper bounds will be given for consistent algorithms
Lower bounds will be given for arbitrary algorithms
22
THEOREM t(n 1 1) le -1ln(F n) +ln(1)
PROOF Prob(x1 hellip xt) g (g(x1)=f(x1) hellip g(xt)=f(xt) g -bad) le
le Prob (g(x1) = f(x1) hellip g(xt) = f(xt)) le
Impose F n e-t le
Independent events
g is ε-bad
P(AUB)leP(A)+P(B)
g ε-bad
NOTE - F n must be finite
le Prob (g(xi) = f(xi)) leg ε-bad i=1 hellip t
le (1-)t le F n(1-)t le F ne-t g ε-bad
A simple upper bound
23
X domainF 2X class of conceptsS = (x1 hellip xt) t-sample
f S g iff f(xi) = g(xi) xi S undistinguishable by S
F (S) = (F S) index of F wrt S
Problem uniform convergence of relative frequencies to their probabilities
Vapnik-Chervonenkis approach (1971)
S1 S2
MF (t) = maxF (S) S is a t-sample growth function
24
FACT
THEOREM
A general upper bound
Prob(x1 hellip xt) g (g -bad g(x1) = f(x1) hellip g(xt) = f(xt))le 2mF2te-t2
mF (t) le 2t
mF (t) le F (this condition gives immediately
the simple upper bound) mF (t) = 2t and jltt mF (j) = 2j
25
d t
)(tmF
F
)(infinFm
t2
DEFINITION
FUNDAMENTAL PROPERTY
=)(tmF1
2
0
minusle⎟⎠⎞⎜⎝
⎛le⎟⎠⎞⎜⎝
⎛le
le
=
sum
Kk
t
tt
t
t
BOUNDED BY APOLYNOMIAL IN t
Graph of the growth function
d = VCdim(F ) = max t mF(t) = 2t
26
THEOREMIf dn = VCdim(Fn)
then t(n 1 1) le max (4 log(2) (8dn)log(13)PROOF
Impose 2mFn2te-et2 le
A lower bound on t(n 1 1) Number of examples which are necessary for arbitrary algorithms
THEOREMFor 0lele1 and le1100
t(n 1 1) ge max ((1- ) ln(1) (dn-1)32)
Upper and lower bounds
27
Ie the cardinality of the set of subsets of S that can be obtained by intersecting S with concepts in F
If F (S) = 2S we say that S is shattered by F
The Vapnik-Chervonenkis dimension of F is the cardinality of the largest finite set of points S X that is shattered by F
An equivalent definition of VCdim
F (S) = (f-1(1)(x1 hellip xt) | fF )
28
1300log80032000log400)11(
00100103)(
sdotle
===
MAXnt
FVC DIM
δε
δε
Sufficient 24000
⎭⎬⎫
⎩⎨⎧ sdot
minusge 100
32131000ln100)11( MAXnt δε
690 Necessary
Learn the family f of circles contained in the square
Example 1
29
otherwiseXif
XWHSXXfTHATSUCHWWLf
nkkkn
nn
001
)()()(
11
1
ge
minus=
rArrisin
sum=
λ
λ
HS(x)=
22
1)(n
n
nDIM
L
nLVC
le
+=
SIMPLE UPPER BOUND
))1ln((1)11( 2 +le nnt
UPPER BOUND USING
⎭⎬⎫
⎩⎨⎧ sdot
+le
13)1(82log4)11( nMAXnt GROWS LINEARLY WITH n
)( nDIM LVC
Learn the family of linearly separable boolean functions in n variables Ln
Example 2
30
Consider the class L2 of linearly separable functions in two variables
3)(3)(
1)(
2
2
ge=
+=
LVCLVC
nLVC
IM
IM
nIM
4)( 2 ltLVC IM
The green point cannot beseparated from the other three
No straight line can separatethe green from the red points
Example 2
31
Classi di formule booleane
Monomi x1x2 hellip xk
DNF m1m2 hellip mj (mj monomi)
Clausole x1x2 hellip xk
CNF c1c2 hellip cj (cj clausole)
k-DNF le k letterali nei monomi
k-term-DNF le k monomi
k-CNF le k letterali nelle clausole
k-clause-CNF le k clausole
Formule monotone non contengono letterali negati
m-formule ogni variabile appare al piugrave una volta
32
Th (Valiant)I monomi sono apprendibili da esempi positivi con 2(n+log ) esempi ( errore tollerato) ponendo ixix
xxgii 01 ==
sdotequiv ππin tutti gli es in tutti gli es
NB Lrsquoapprendibilitagrave egrave non monotona A B se B appr allora A apprsube
endHdaxcancella
elseHdaxcancella
thenjesifdontojfor
generaesbegin
doBtoiforxxxxxxH
j
j
nn
0)(1
)(
1 2211
==
=
==
Th i monomi non sono apprendibili da esempi negativi
I risultati
33
1) K-CNF apprendibili da soli esempi positivi1b) K-DNF apprendibili da soli esempi negativi2) (K-DNF K-CNF) apprendibili da es (K-DNF K-CNF) positivi e negativiand
or
Kforall
3) la classe delle K-decision lists egrave apprendibile
)0()(
1)min(|min
10||)))((( 11
esistenonisebvCalloravni
booleanovettorevDLKCbkmmonomiom
conbmbmDLK
i
iii
jj
===
minusisinle
equivminus
Th Ogni K-DNF (o K-CNF)-formula egrave rappresentabile da una K-DL piccola
Risultati positivi
34
)()(
freeondistributisensoinNPRPse
minusne
1) Le m-formule non sono apprendibili
2) Le funzioni booleane a soglia non sono apprendibili
3) Per K ge 2 le formule K-term-DNF non sono apprendibili
ge
Risultati negativi
35
Mistake bound model
So far how many examples needed to learn What about how many mistakes before
convergence Letrsquos consider similar setting to PAC learning
Instances drawn at random from X according to
distribution D Learner must classify each instance before receiving
correct classification from teacher Can we bound the number of mistakes learner makes
before converging
36
Mistake bound model
Learner Receives a sequence of training examples x Predicts the target value f(x) Receives the correct target value from the trainer Is evaluated by the total number of mistakes it makes
before converging to the correct hypothesis
Ie Learning takes place during the use of the system
not off-line Ex prediction of fraudolent use of credit cards
37
Mistake bound for Find-S
Consider Find-S when H = conjunction of boolean literals
FIND-S
Initialize h to the most specific hypothesis in
Hx1x1x2x2 hellip xnxn
For each positive training instance x Remove from h any literal not satisfied by x
Output h
38
Mistake bound for Find-S
If C H and training data noise free Find-S converges to an exact hypothesis
How many errors to learn cH (only positive examples can be misclassified)
The first positive example will be misclassified and n literals in the initial hypothesis will be eliminated
Each subsequent error eliminates at least one literal mistakes le n+1 (worst case for the ldquototalrdquo concept x c(x)=1)
39
Mistake bound for Halving A version space is maintained and refined (eg
Candidate-elimination) Prediction is based on majority vote among the
hypotheses in the current version space ldquoWrongrdquo hypotheses are removed (even if x is
exactly classified) How many errors to exactly learn cH (H finite)
Mistake when the majority of hypotheses misclassifies x These hypotheses are removed For each mistake the version space is at least halved At most log2(|H|) mistakes before exact learning (eg
single hypothesis remaining) Note learning without mistakes possible
40
Optimal mistake bound
Question what is the optimal mistake bound (ie lowest worst case bound over all possible learning algorithms A) for an arbitrary non empty concept class C assuming H=C
Formally for any learning algorithm A and any target concept c
MA(c) = max mistakes made by A to exactly learn c over all
possible training sequences MA(C) = maxcC MA(c)
Note Mfind-S(C) = n+1
MHalving(C) le log2(|C|) Opt(C) = minA MA(C)
ie of mistakes made for the hardest target concept in C using the hardest training sequence by the best algorithm
41
Optimal mistake bound
Theorem (Littlestone 1987)
VC(C) le Opt(C) le MHalving(C) le log2(|C|) There exist concept classes for which
VC(C) = Opt(C) = MHalving(C) = log2(|C|)
eg the power set 2X of X for which it holds
VC(2X) = |X| = log2(|2X|) There exist concept classes for which VC(C) lt Opt(C) lt MHalving(C)
42
Weighted majority algorithm
Generalizes Halving Makes predictions by taking a weighted vote among a pole of prediction algorithms
Learns by altering the weight associated with each prediction algorithm
It does not eliminate hypotheses (ie algorithms) inconsistent with some training examples but just reduces its weight so is able to accommodate inconsistent training data
43
Weighted majority algorithm
i wi = 1
training example (x c(x))
q0 = q1 = 0
prediction algorithm ai
If ai(x)=0 then q0 = q0 + wi
If ai(x)=1 then q1 = q1 + wi
if q1 gt q0 then predict c(x)=1
if q1 lt q0 then predict c(x)=0
if q1 gt q0 then predict c(x)=0 or 1 at random
prediction algorithm ai do If ai(x)nec(x) then wi = wi (0lelt1)
44
Weighted majority algorithm (WM)
Coincides with Halving for =0 Theorem - D any sequence of training examples A any set of n prediction algorithms k min of mistakes made by any ajA for D =12 Then W-M makes at most
24(k+log2n)
mistakes over D
45
Weighted majority algorithm (WM)
Proof Since aj makes k mistakes (best in A) its final
weight wj will be (12)k The sum W of the weights associated with all n algorithms in A is initially n and for each mistake made by WM is reduced to at most (34)W because the ldquowrongrdquo algorithms hold at least 12 of total weight that will be reduced by a factor of 12
The final total weight W is at most n(34)M where M is the total number of mistakes made by WM over D
46
Weighted majority algorithm (WM)
But the final weight wj cannot be greater than the
final total weight W hence(12)k le n(34)M
from which
M le le 24(k+log2n)
Ie the number of mistakes made by WM will never be greater than a constant factor times the number of mistakes made by the best member of the pool plus a term that grows only logarithmically in the size of the pool
(k+log2 n)-log2 (34)
- Universitagrave di Milano-Bicocca Laurea Magistrale in Informatica
- Computational models of cognitive phenomena
- A theory of the learnable (Valiant lsquo84)
- A theory of the learnable
- Probably approximately correct learning
- What we want to learn
- Whatrsquos new in pac learning
- Learning from examples
- The PAC model
- The PAC model
- Slide 11
- Slide 12
- Slide 13
- Learning algorithm
- The efficiency issue
- Slide 16
- Learning boolean functions
- Boolean functions and circuits
- Slide 19
- Slide 20
- Statistical PAC learning
- A simple upper bound
- Vapnik-Chervonenkis approach (1971)
- A general upper bound
- Graph of the growth function
- Upper and lower bounds
- An equivalent definition of VCdim
- Example 1
- Example 2
- Slide 30
- Classi di formule booleane
- I risultati
- Risultati positivi
- Risultati negativi
- Mistake bound model
- Slide 36
- Mistake bound for Find-S
- Slide 38
- Mistake bound for Halving
- Optimal mistake bound
- Slide 41
- Weighted majority algorithm
- Slide 43
- Weighted majority algorithm (WM)
- Slide 45
- Slide 46
-
13
f h
x random choice
Given an approximation parameter (0ltle1) h is an ε-approximationof f if dp(fh)le
ldquoALMOST ALWAYSrdquo
Confidence parameter
(0 lt le 1)
The ldquomeasurerdquo of sequences of examples randomly choosen according to P such that h is an ε-approximation of f is at least 1-
ldquoCLOSE TOrdquo
METRIC given P
dp(fh) = Perr = Px f(x)neh(x)
The PAC model
14
Generator ofexamples
Learner h
F concept classS set of labeled samples from a concept in F A S F such that
I) A(S) consistent with S
II) P(Perrlt ) gt 1-
0ltlt1 fF mN S st |S|gem
Learning algorithm
15
COMPUTATIONAL RESOURCES
SAMPLE SIZE (Statistical PAC learning) COMPUTATION TIME (Polynomial PAC learning)
DEF 1 a concept class F = n=1F n is statistically PAC learnable if there
is a learning algorithm with sample size t = t(n 1 1) bounded by some polynomial function in n 1 1
Look for algorithms which use ldquoreasonablerdquo amount of computational resources
The efficiency issue
16
The efficiency issue
POLYNOMIAL PAC STATISTICAL PAC
DEF 2 a concept class F = n=1F n is polynomially PAC learnable
if there is a learning algorithm with running time bounded by some polynomial function in n 1 1
17
n = f 0 1n 0 1 The set of boolean functions in n
variables
Fn n A class of conceptsExample 1Fn = clauses with literals in
Example 2Fn = linearly separable functions in n variables
nn xxxx 11
nk xxxxxx ororororor 2123
( ) sum minus λkkXWHS
REPRESENTATION
- TRUTH TABLE (EXPLICIT)- BOOLEAN CIRCUITS (IMPLICIT)
BOOLEANCIRCUITS
BOOLEANFUNCTIONS
Learning boolean functions
18
bull BASIC OPERATIONSbull COMPOSITION
( )minusorand
in m variables in n variables
CIRCUIT Finite acyclic directed graph
or
Output node
Basic operations
Input nodes
Given an assignment x1 hellip xn 0 1 to the input variables the output node computes the corresponding value
oror
orand
1X 2X 3X
or
[f(g1 hellip gm)](x) = f(g1(x) hellip gm(x))
Boolean functions and circuits
19
Fn n
Cn class of circuits which compute all and only the functions in Fn
Uinfin
=
=1n
nFF Uinfin
=
=1n
nCC
Algorithm A to learn F by C bull INPUT (nεδ)
bullThe learner computes t = t(n 1 1) (t=number of examples sufficient to learn with accuracy ε and confidence δ)bull The learner asks the teacher for a labelled t-sample
bull The learner receives the t-sample S and computes C = An(S)
bull Output C (C= representation of the hypothesis)
Note that the inference procedure A receives as input the integer n and a t-sample on 01n and outputs An(S) = A(n S)
Boolean functions and circuits
20
An algorithm A is a learning algorithm with sample size t(n 1 1) for a concept class using the class of representations If for all nge1 for all fFn for all 0lt lt1 and for every probability distribution p over 01n the following holds
nn FUF 1=infin= mm CUC 1=
infin=
If the inference procedure An receives as input a t-sample it outputs a representation cCn of a function g that is probably approximately correct that is with probability at least 1- a t-sample is chosen such that the function g inferred satisfies
Px f(x)neg(x) le
g is ndashgood g is an ndashapproximation of fg is ndashbad g is not an ndashapproximation of f
NOTE distribution free
Boolean functions and circuits
21
Statistical PAC learning
DEF An inference procedure An for the class F n is consistent if
given the target function fF n for every t-sample
S = (ltx1b1gt hellip ltxtbtgt) An(S) is a representation of a function
g ldquoconsistentrdquo with S ie g(x1) = b1 hellip g(xt) = bt
DEF A learning algorithm A is consistent if its inference procedure is consistent
PROBLEM Estimate upper and lower bounds on the sample size
t = t(n 1 1)Upper bounds will be given for consistent algorithms
Lower bounds will be given for arbitrary algorithms
22
THEOREM t(n 1 1) le -1ln(F n) +ln(1)
PROOF Prob(x1 hellip xt) g (g(x1)=f(x1) hellip g(xt)=f(xt) g -bad) le
le Prob (g(x1) = f(x1) hellip g(xt) = f(xt)) le
Impose F n e-t le
Independent events
g is ε-bad
P(AUB)leP(A)+P(B)
g ε-bad
NOTE - F n must be finite
le Prob (g(xi) = f(xi)) leg ε-bad i=1 hellip t
le (1-)t le F n(1-)t le F ne-t g ε-bad
A simple upper bound
23
X domainF 2X class of conceptsS = (x1 hellip xt) t-sample
f S g iff f(xi) = g(xi) xi S undistinguishable by S
F (S) = (F S) index of F wrt S
Problem uniform convergence of relative frequencies to their probabilities
Vapnik-Chervonenkis approach (1971)
S1 S2
MF (t) = maxF (S) S is a t-sample growth function
24
FACT
THEOREM
A general upper bound
Prob(x1 hellip xt) g (g -bad g(x1) = f(x1) hellip g(xt) = f(xt))le 2mF2te-t2
mF (t) le 2t
mF (t) le F (this condition gives immediately
the simple upper bound) mF (t) = 2t and jltt mF (j) = 2j
25
d t
)(tmF
F
)(infinFm
t2
DEFINITION
FUNDAMENTAL PROPERTY
=)(tmF1
2
0
minusle⎟⎠⎞⎜⎝
⎛le⎟⎠⎞⎜⎝
⎛le
le
=
sum
Kk
t
tt
t
t
BOUNDED BY APOLYNOMIAL IN t
Graph of the growth function
d = VCdim(F ) = max t mF(t) = 2t
26
THEOREMIf dn = VCdim(Fn)
then t(n 1 1) le max (4 log(2) (8dn)log(13)PROOF
Impose 2mFn2te-et2 le
A lower bound on t(n 1 1) Number of examples which are necessary for arbitrary algorithms
THEOREMFor 0lele1 and le1100
t(n 1 1) ge max ((1- ) ln(1) (dn-1)32)
Upper and lower bounds
27
Ie the cardinality of the set of subsets of S that can be obtained by intersecting S with concepts in F
If F (S) = 2S we say that S is shattered by F
The Vapnik-Chervonenkis dimension of F is the cardinality of the largest finite set of points S X that is shattered by F
An equivalent definition of VCdim
F (S) = (f-1(1)(x1 hellip xt) | fF )
28
1300log80032000log400)11(
00100103)(
sdotle
===
MAXnt
FVC DIM
δε
δε
Sufficient 24000
⎭⎬⎫
⎩⎨⎧ sdot
minusge 100
32131000ln100)11( MAXnt δε
690 Necessary
Learn the family f of circles contained in the square
Example 1
29
otherwiseXif
XWHSXXfTHATSUCHWWLf
nkkkn
nn
001
)()()(
11
1
ge
minus=
rArrisin
sum=
λ
λ
HS(x)=
22
1)(n
n
nDIM
L
nLVC
le
+=
SIMPLE UPPER BOUND
))1ln((1)11( 2 +le nnt
UPPER BOUND USING
⎭⎬⎫
⎩⎨⎧ sdot
+le
13)1(82log4)11( nMAXnt GROWS LINEARLY WITH n
)( nDIM LVC
Learn the family of linearly separable boolean functions in n variables Ln
Example 2
30
Consider the class L2 of linearly separable functions in two variables
3)(3)(
1)(
2
2
ge=
+=
LVCLVC
nLVC
IM
IM
nIM
4)( 2 ltLVC IM
The green point cannot beseparated from the other three
No straight line can separatethe green from the red points
Example 2
31
Classi di formule booleane
Monomi x1x2 hellip xk
DNF m1m2 hellip mj (mj monomi)
Clausole x1x2 hellip xk
CNF c1c2 hellip cj (cj clausole)
k-DNF le k letterali nei monomi
k-term-DNF le k monomi
k-CNF le k letterali nelle clausole
k-clause-CNF le k clausole
Formule monotone non contengono letterali negati
m-formule ogni variabile appare al piugrave una volta
32
Th (Valiant)I monomi sono apprendibili da esempi positivi con 2(n+log ) esempi ( errore tollerato) ponendo ixix
xxgii 01 ==
sdotequiv ππin tutti gli es in tutti gli es
NB Lrsquoapprendibilitagrave egrave non monotona A B se B appr allora A apprsube
endHdaxcancella
elseHdaxcancella
thenjesifdontojfor
generaesbegin
doBtoiforxxxxxxH
j
j
nn
0)(1
)(
1 2211
==
=
==
Th i monomi non sono apprendibili da esempi negativi
I risultati
33
1) K-CNF apprendibili da soli esempi positivi1b) K-DNF apprendibili da soli esempi negativi2) (K-DNF K-CNF) apprendibili da es (K-DNF K-CNF) positivi e negativiand
or
Kforall
3) la classe delle K-decision lists egrave apprendibile
)0()(
1)min(|min
10||)))((( 11
esistenonisebvCalloravni
booleanovettorevDLKCbkmmonomiom
conbmbmDLK
i
iii
jj
===
minusisinle
equivminus
Th Ogni K-DNF (o K-CNF)-formula egrave rappresentabile da una K-DL piccola
Risultati positivi
34
)()(
freeondistributisensoinNPRPse
minusne
1) Le m-formule non sono apprendibili
2) Le funzioni booleane a soglia non sono apprendibili
3) Per K ge 2 le formule K-term-DNF non sono apprendibili
ge
Risultati negativi
35
Mistake bound model
So far how many examples needed to learn What about how many mistakes before
convergence Letrsquos consider similar setting to PAC learning
Instances drawn at random from X according to
distribution D Learner must classify each instance before receiving
correct classification from teacher Can we bound the number of mistakes learner makes
before converging
36
Mistake bound model
Learner Receives a sequence of training examples x Predicts the target value f(x) Receives the correct target value from the trainer Is evaluated by the total number of mistakes it makes
before converging to the correct hypothesis
Ie Learning takes place during the use of the system
not off-line Ex prediction of fraudolent use of credit cards
37
Mistake bound for Find-S
Consider Find-S when H = conjunction of boolean literals
FIND-S
Initialize h to the most specific hypothesis in
Hx1x1x2x2 hellip xnxn
For each positive training instance x Remove from h any literal not satisfied by x
Output h
38
Mistake bound for Find-S
If C H and training data noise free Find-S converges to an exact hypothesis
How many errors to learn cH (only positive examples can be misclassified)
The first positive example will be misclassified and n literals in the initial hypothesis will be eliminated
Each subsequent error eliminates at least one literal mistakes le n+1 (worst case for the ldquototalrdquo concept x c(x)=1)
39
Mistake bound for Halving A version space is maintained and refined (eg
Candidate-elimination) Prediction is based on majority vote among the
hypotheses in the current version space ldquoWrongrdquo hypotheses are removed (even if x is
exactly classified) How many errors to exactly learn cH (H finite)
Mistake when the majority of hypotheses misclassifies x These hypotheses are removed For each mistake the version space is at least halved At most log2(|H|) mistakes before exact learning (eg
single hypothesis remaining) Note learning without mistakes possible
40
Optimal mistake bound
Question what is the optimal mistake bound (ie lowest worst case bound over all possible learning algorithms A) for an arbitrary non empty concept class C assuming H=C
Formally for any learning algorithm A and any target concept c
MA(c) = max mistakes made by A to exactly learn c over all
possible training sequences MA(C) = maxcC MA(c)
Note Mfind-S(C) = n+1
MHalving(C) le log2(|C|) Opt(C) = minA MA(C)
ie of mistakes made for the hardest target concept in C using the hardest training sequence by the best algorithm
41
Optimal mistake bound
Theorem (Littlestone 1987)
VC(C) le Opt(C) le MHalving(C) le log2(|C|) There exist concept classes for which
VC(C) = Opt(C) = MHalving(C) = log2(|C|)
eg the power set 2X of X for which it holds
VC(2X) = |X| = log2(|2X|) There exist concept classes for which VC(C) lt Opt(C) lt MHalving(C)
42
Weighted majority algorithm
Generalizes Halving Makes predictions by taking a weighted vote among a pole of prediction algorithms
Learns by altering the weight associated with each prediction algorithm
It does not eliminate hypotheses (ie algorithms) inconsistent with some training examples but just reduces its weight so is able to accommodate inconsistent training data
43
Weighted majority algorithm
i wi = 1
training example (x c(x))
q0 = q1 = 0
prediction algorithm ai
If ai(x)=0 then q0 = q0 + wi
If ai(x)=1 then q1 = q1 + wi
if q1 gt q0 then predict c(x)=1
if q1 lt q0 then predict c(x)=0
if q1 gt q0 then predict c(x)=0 or 1 at random
prediction algorithm ai do If ai(x)nec(x) then wi = wi (0lelt1)
44
Weighted majority algorithm (WM)
Coincides with Halving for =0 Theorem - D any sequence of training examples A any set of n prediction algorithms k min of mistakes made by any ajA for D =12 Then W-M makes at most
24(k+log2n)
mistakes over D
45
Weighted majority algorithm (WM)
Proof Since aj makes k mistakes (best in A) its final
weight wj will be (12)k The sum W of the weights associated with all n algorithms in A is initially n and for each mistake made by WM is reduced to at most (34)W because the ldquowrongrdquo algorithms hold at least 12 of total weight that will be reduced by a factor of 12
The final total weight W is at most n(34)M where M is the total number of mistakes made by WM over D
46
Weighted majority algorithm (WM)
But the final weight wj cannot be greater than the
final total weight W hence(12)k le n(34)M
from which
M le le 24(k+log2n)
Ie the number of mistakes made by WM will never be greater than a constant factor times the number of mistakes made by the best member of the pool plus a term that grows only logarithmically in the size of the pool
(k+log2 n)-log2 (34)
- Universitagrave di Milano-Bicocca Laurea Magistrale in Informatica
- Computational models of cognitive phenomena
- A theory of the learnable (Valiant lsquo84)
- A theory of the learnable
- Probably approximately correct learning
- What we want to learn
- Whatrsquos new in pac learning
- Learning from examples
- The PAC model
- The PAC model
- Slide 11
- Slide 12
- Slide 13
- Learning algorithm
- The efficiency issue
- Slide 16
- Learning boolean functions
- Boolean functions and circuits
- Slide 19
- Slide 20
- Statistical PAC learning
- A simple upper bound
- Vapnik-Chervonenkis approach (1971)
- A general upper bound
- Graph of the growth function
- Upper and lower bounds
- An equivalent definition of VCdim
- Example 1
- Example 2
- Slide 30
- Classi di formule booleane
- I risultati
- Risultati positivi
- Risultati negativi
- Mistake bound model
- Slide 36
- Mistake bound for Find-S
- Slide 38
- Mistake bound for Halving
- Optimal mistake bound
- Slide 41
- Weighted majority algorithm
- Slide 43
- Weighted majority algorithm (WM)
- Slide 45
- Slide 46
-
14
Generator ofexamples
Learner h
F concept classS set of labeled samples from a concept in F A S F such that
I) A(S) consistent with S
II) P(Perrlt ) gt 1-
0ltlt1 fF mN S st |S|gem
Learning algorithm
15
COMPUTATIONAL RESOURCES
SAMPLE SIZE (Statistical PAC learning) COMPUTATION TIME (Polynomial PAC learning)
DEF 1 a concept class F = n=1F n is statistically PAC learnable if there
is a learning algorithm with sample size t = t(n 1 1) bounded by some polynomial function in n 1 1
Look for algorithms which use ldquoreasonablerdquo amount of computational resources
The efficiency issue
16
The efficiency issue
POLYNOMIAL PAC STATISTICAL PAC
DEF 2 a concept class F = n=1F n is polynomially PAC learnable
if there is a learning algorithm with running time bounded by some polynomial function in n 1 1
17
n = f 0 1n 0 1 The set of boolean functions in n
variables
Fn n A class of conceptsExample 1Fn = clauses with literals in
Example 2Fn = linearly separable functions in n variables
nn xxxx 11
nk xxxxxx ororororor 2123
( ) sum minus λkkXWHS
REPRESENTATION
- TRUTH TABLE (EXPLICIT)- BOOLEAN CIRCUITS (IMPLICIT)
BOOLEANCIRCUITS
BOOLEANFUNCTIONS
Learning boolean functions
18
bull BASIC OPERATIONSbull COMPOSITION
( )minusorand
in m variables in n variables
CIRCUIT Finite acyclic directed graph
or
Output node
Basic operations
Input nodes
Given an assignment x1 hellip xn 0 1 to the input variables the output node computes the corresponding value
oror
orand
1X 2X 3X
or
[f(g1 hellip gm)](x) = f(g1(x) hellip gm(x))
Boolean functions and circuits
19
Fn n
Cn class of circuits which compute all and only the functions in Fn
Uinfin
=
=1n
nFF Uinfin
=
=1n
nCC
Algorithm A to learn F by C bull INPUT (nεδ)
bullThe learner computes t = t(n 1 1) (t=number of examples sufficient to learn with accuracy ε and confidence δ)bull The learner asks the teacher for a labelled t-sample
bull The learner receives the t-sample S and computes C = An(S)
bull Output C (C= representation of the hypothesis)
Note that the inference procedure A receives as input the integer n and a t-sample on 01n and outputs An(S) = A(n S)
Boolean functions and circuits
20
An algorithm A is a learning algorithm with sample size t(n 1 1) for a concept class using the class of representations If for all nge1 for all fFn for all 0lt lt1 and for every probability distribution p over 01n the following holds
nn FUF 1=infin= mm CUC 1=
infin=
If the inference procedure An receives as input a t-sample it outputs a representation cCn of a function g that is probably approximately correct that is with probability at least 1- a t-sample is chosen such that the function g inferred satisfies
Px f(x)neg(x) le
g is ndashgood g is an ndashapproximation of fg is ndashbad g is not an ndashapproximation of f
NOTE distribution free
Boolean functions and circuits
21
Statistical PAC learning
DEF An inference procedure An for the class F n is consistent if
given the target function fF n for every t-sample
S = (ltx1b1gt hellip ltxtbtgt) An(S) is a representation of a function
g ldquoconsistentrdquo with S ie g(x1) = b1 hellip g(xt) = bt
DEF A learning algorithm A is consistent if its inference procedure is consistent
PROBLEM Estimate upper and lower bounds on the sample size
t = t(n 1 1)Upper bounds will be given for consistent algorithms
Lower bounds will be given for arbitrary algorithms
22
THEOREM t(n 1 1) le -1ln(F n) +ln(1)
PROOF Prob(x1 hellip xt) g (g(x1)=f(x1) hellip g(xt)=f(xt) g -bad) le
le Prob (g(x1) = f(x1) hellip g(xt) = f(xt)) le
Impose F n e-t le
Independent events
g is ε-bad
P(AUB)leP(A)+P(B)
g ε-bad
NOTE - F n must be finite
le Prob (g(xi) = f(xi)) leg ε-bad i=1 hellip t
le (1-)t le F n(1-)t le F ne-t g ε-bad
A simple upper bound
23
X domainF 2X class of conceptsS = (x1 hellip xt) t-sample
f S g iff f(xi) = g(xi) xi S undistinguishable by S
F (S) = (F S) index of F wrt S
Problem uniform convergence of relative frequencies to their probabilities
Vapnik-Chervonenkis approach (1971)
S1 S2
MF (t) = maxF (S) S is a t-sample growth function
24
FACT
THEOREM
A general upper bound
Prob(x1 hellip xt) g (g -bad g(x1) = f(x1) hellip g(xt) = f(xt))le 2mF2te-t2
mF (t) le 2t
mF (t) le F (this condition gives immediately
the simple upper bound) mF (t) = 2t and jltt mF (j) = 2j
25
d t
)(tmF
F
)(infinFm
t2
DEFINITION
FUNDAMENTAL PROPERTY
=)(tmF1
2
0
minusle⎟⎠⎞⎜⎝
⎛le⎟⎠⎞⎜⎝
⎛le
le
=
sum
Kk
t
tt
t
t
BOUNDED BY APOLYNOMIAL IN t
Graph of the growth function
d = VCdim(F ) = max t mF(t) = 2t
26
THEOREMIf dn = VCdim(Fn)
then t(n 1 1) le max (4 log(2) (8dn)log(13)PROOF
Impose 2mFn2te-et2 le
A lower bound on t(n 1 1) Number of examples which are necessary for arbitrary algorithms
THEOREMFor 0lele1 and le1100
t(n 1 1) ge max ((1- ) ln(1) (dn-1)32)
Upper and lower bounds
27
Ie the cardinality of the set of subsets of S that can be obtained by intersecting S with concepts in F
If F (S) = 2S we say that S is shattered by F
The Vapnik-Chervonenkis dimension of F is the cardinality of the largest finite set of points S X that is shattered by F
An equivalent definition of VCdim
F (S) = (f-1(1)(x1 hellip xt) | fF )
28
1300log80032000log400)11(
00100103)(
sdotle
===
MAXnt
FVC DIM
δε
δε
Sufficient 24000
⎭⎬⎫
⎩⎨⎧ sdot
minusge 100
32131000ln100)11( MAXnt δε
690 Necessary
Learn the family f of circles contained in the square
Example 1
29
otherwiseXif
XWHSXXfTHATSUCHWWLf
nkkkn
nn
001
)()()(
11
1
ge
minus=
rArrisin
sum=
λ
λ
HS(x)=
22
1)(n
n
nDIM
L
nLVC
le
+=
SIMPLE UPPER BOUND
))1ln((1)11( 2 +le nnt
UPPER BOUND USING
⎭⎬⎫
⎩⎨⎧ sdot
+le
13)1(82log4)11( nMAXnt GROWS LINEARLY WITH n
)( nDIM LVC
Learn the family of linearly separable boolean functions in n variables Ln
Example 2
30
Consider the class L2 of linearly separable functions in two variables
3)(3)(
1)(
2
2
ge=
+=
LVCLVC
nLVC
IM
IM
nIM
4)( 2 ltLVC IM
The green point cannot beseparated from the other three
No straight line can separatethe green from the red points
Example 2
31
Classi di formule booleane
Monomi x1x2 hellip xk
DNF m1m2 hellip mj (mj monomi)
Clausole x1x2 hellip xk
CNF c1c2 hellip cj (cj clausole)
k-DNF le k letterali nei monomi
k-term-DNF le k monomi
k-CNF le k letterali nelle clausole
k-clause-CNF le k clausole
Formule monotone non contengono letterali negati
m-formule ogni variabile appare al piugrave una volta
32
Th (Valiant)I monomi sono apprendibili da esempi positivi con 2(n+log ) esempi ( errore tollerato) ponendo ixix
xxgii 01 ==
sdotequiv ππin tutti gli es in tutti gli es
NB Lrsquoapprendibilitagrave egrave non monotona A B se B appr allora A apprsube
endHdaxcancella
elseHdaxcancella
thenjesifdontojfor
generaesbegin
doBtoiforxxxxxxH
j
j
nn
0)(1
)(
1 2211
==
=
==
Th i monomi non sono apprendibili da esempi negativi
I risultati
33
1) K-CNF apprendibili da soli esempi positivi1b) K-DNF apprendibili da soli esempi negativi2) (K-DNF K-CNF) apprendibili da es (K-DNF K-CNF) positivi e negativiand
or
Kforall
3) la classe delle K-decision lists egrave apprendibile
)0()(
1)min(|min
10||)))((( 11
esistenonisebvCalloravni
booleanovettorevDLKCbkmmonomiom
conbmbmDLK
i
iii
jj
===
minusisinle
equivminus
Th Ogni K-DNF (o K-CNF)-formula egrave rappresentabile da una K-DL piccola
Risultati positivi
34
)()(
freeondistributisensoinNPRPse
minusne
1) Le m-formule non sono apprendibili
2) Le funzioni booleane a soglia non sono apprendibili
3) Per K ge 2 le formule K-term-DNF non sono apprendibili
ge
Risultati negativi
35
Mistake bound model
So far how many examples needed to learn What about how many mistakes before
convergence Letrsquos consider similar setting to PAC learning
Instances drawn at random from X according to
distribution D Learner must classify each instance before receiving
correct classification from teacher Can we bound the number of mistakes learner makes
before converging
36
Mistake bound model
Learner Receives a sequence of training examples x Predicts the target value f(x) Receives the correct target value from the trainer Is evaluated by the total number of mistakes it makes
before converging to the correct hypothesis
Ie Learning takes place during the use of the system
not off-line Ex prediction of fraudolent use of credit cards
37
Mistake bound for Find-S
Consider Find-S when H = conjunction of boolean literals
FIND-S
Initialize h to the most specific hypothesis in
Hx1x1x2x2 hellip xnxn
For each positive training instance x Remove from h any literal not satisfied by x
Output h
38
Mistake bound for Find-S
If C H and training data noise free Find-S converges to an exact hypothesis
How many errors to learn cH (only positive examples can be misclassified)
The first positive example will be misclassified and n literals in the initial hypothesis will be eliminated
Each subsequent error eliminates at least one literal mistakes le n+1 (worst case for the ldquototalrdquo concept x c(x)=1)
39
Mistake bound for Halving A version space is maintained and refined (eg
Candidate-elimination) Prediction is based on majority vote among the
hypotheses in the current version space ldquoWrongrdquo hypotheses are removed (even if x is
exactly classified) How many errors to exactly learn cH (H finite)
Mistake when the majority of hypotheses misclassifies x These hypotheses are removed For each mistake the version space is at least halved At most log2(|H|) mistakes before exact learning (eg
single hypothesis remaining) Note learning without mistakes possible
40
Optimal mistake bound
Question what is the optimal mistake bound (ie lowest worst case bound over all possible learning algorithms A) for an arbitrary non empty concept class C assuming H=C
Formally for any learning algorithm A and any target concept c
MA(c) = max mistakes made by A to exactly learn c over all
possible training sequences MA(C) = maxcC MA(c)
Note Mfind-S(C) = n+1
MHalving(C) le log2(|C|) Opt(C) = minA MA(C)
ie of mistakes made for the hardest target concept in C using the hardest training sequence by the best algorithm
41
Optimal mistake bound
Theorem (Littlestone 1987)
VC(C) le Opt(C) le MHalving(C) le log2(|C|) There exist concept classes for which
VC(C) = Opt(C) = MHalving(C) = log2(|C|)
eg the power set 2X of X for which it holds
VC(2X) = |X| = log2(|2X|) There exist concept classes for which VC(C) lt Opt(C) lt MHalving(C)
42
Weighted majority algorithm
Generalizes Halving Makes predictions by taking a weighted vote among a pole of prediction algorithms
Learns by altering the weight associated with each prediction algorithm
It does not eliminate hypotheses (ie algorithms) inconsistent with some training examples but just reduces its weight so is able to accommodate inconsistent training data
43
Weighted majority algorithm
i wi = 1
training example (x c(x))
q0 = q1 = 0
prediction algorithm ai
If ai(x)=0 then q0 = q0 + wi
If ai(x)=1 then q1 = q1 + wi
if q1 gt q0 then predict c(x)=1
if q1 lt q0 then predict c(x)=0
if q1 gt q0 then predict c(x)=0 or 1 at random
prediction algorithm ai do If ai(x)nec(x) then wi = wi (0lelt1)
44
Weighted majority algorithm (WM)
Coincides with Halving for =0 Theorem - D any sequence of training examples A any set of n prediction algorithms k min of mistakes made by any ajA for D =12 Then W-M makes at most
24(k+log2n)
mistakes over D
45
Weighted majority algorithm (WM)
Proof Since aj makes k mistakes (best in A) its final
weight wj will be (12)k The sum W of the weights associated with all n algorithms in A is initially n and for each mistake made by WM is reduced to at most (34)W because the ldquowrongrdquo algorithms hold at least 12 of total weight that will be reduced by a factor of 12
The final total weight W is at most n(34)M where M is the total number of mistakes made by WM over D
46
Weighted majority algorithm (WM)
But the final weight wj cannot be greater than the
final total weight W hence(12)k le n(34)M
from which
M le le 24(k+log2n)
Ie the number of mistakes made by WM will never be greater than a constant factor times the number of mistakes made by the best member of the pool plus a term that grows only logarithmically in the size of the pool
(k+log2 n)-log2 (34)
- Universitagrave di Milano-Bicocca Laurea Magistrale in Informatica
- Computational models of cognitive phenomena
- A theory of the learnable (Valiant lsquo84)
- A theory of the learnable
- Probably approximately correct learning
- What we want to learn
- Whatrsquos new in pac learning
- Learning from examples
- The PAC model
- The PAC model
- Slide 11
- Slide 12
- Slide 13
- Learning algorithm
- The efficiency issue
- Slide 16
- Learning boolean functions
- Boolean functions and circuits
- Slide 19
- Slide 20
- Statistical PAC learning
- A simple upper bound
- Vapnik-Chervonenkis approach (1971)
- A general upper bound
- Graph of the growth function
- Upper and lower bounds
- An equivalent definition of VCdim
- Example 1
- Example 2
- Slide 30
- Classi di formule booleane
- I risultati
- Risultati positivi
- Risultati negativi
- Mistake bound model
- Slide 36
- Mistake bound for Find-S
- Slide 38
- Mistake bound for Halving
- Optimal mistake bound
- Slide 41
- Weighted majority algorithm
- Slide 43
- Weighted majority algorithm (WM)
- Slide 45
- Slide 46
-
15
COMPUTATIONAL RESOURCES
SAMPLE SIZE (Statistical PAC learning) COMPUTATION TIME (Polynomial PAC learning)
DEF 1 a concept class F = n=1F n is statistically PAC learnable if there
is a learning algorithm with sample size t = t(n 1 1) bounded by some polynomial function in n 1 1
Look for algorithms which use ldquoreasonablerdquo amount of computational resources
The efficiency issue
16
The efficiency issue
POLYNOMIAL PAC STATISTICAL PAC
DEF 2 a concept class F = n=1F n is polynomially PAC learnable
if there is a learning algorithm with running time bounded by some polynomial function in n 1 1
17
n = f 0 1n 0 1 The set of boolean functions in n
variables
Fn n A class of conceptsExample 1Fn = clauses with literals in
Example 2Fn = linearly separable functions in n variables
nn xxxx 11
nk xxxxxx ororororor 2123
( ) sum minus λkkXWHS
REPRESENTATION
- TRUTH TABLE (EXPLICIT)- BOOLEAN CIRCUITS (IMPLICIT)
BOOLEANCIRCUITS
BOOLEANFUNCTIONS
Learning boolean functions
18
bull BASIC OPERATIONSbull COMPOSITION
( )minusorand
in m variables in n variables
CIRCUIT Finite acyclic directed graph
or
Output node
Basic operations
Input nodes
Given an assignment x1 hellip xn 0 1 to the input variables the output node computes the corresponding value
oror
orand
1X 2X 3X
or
[f(g1 hellip gm)](x) = f(g1(x) hellip gm(x))
Boolean functions and circuits
19
Fn n
Cn class of circuits which compute all and only the functions in Fn
Uinfin
=
=1n
nFF Uinfin
=
=1n
nCC
Algorithm A to learn F by C bull INPUT (nεδ)
bullThe learner computes t = t(n 1 1) (t=number of examples sufficient to learn with accuracy ε and confidence δ)bull The learner asks the teacher for a labelled t-sample
bull The learner receives the t-sample S and computes C = An(S)
bull Output C (C= representation of the hypothesis)
Note that the inference procedure A receives as input the integer n and a t-sample on 01n and outputs An(S) = A(n S)
Boolean functions and circuits
20
An algorithm A is a learning algorithm with sample size t(n 1 1) for a concept class using the class of representations If for all nge1 for all fFn for all 0lt lt1 and for every probability distribution p over 01n the following holds
nn FUF 1=infin= mm CUC 1=
infin=
If the inference procedure An receives as input a t-sample it outputs a representation cCn of a function g that is probably approximately correct that is with probability at least 1- a t-sample is chosen such that the function g inferred satisfies
Px f(x)neg(x) le
g is ndashgood g is an ndashapproximation of fg is ndashbad g is not an ndashapproximation of f
NOTE distribution free
Boolean functions and circuits
21
Statistical PAC learning
DEF An inference procedure An for the class F n is consistent if
given the target function fF n for every t-sample
S = (ltx1b1gt hellip ltxtbtgt) An(S) is a representation of a function
g ldquoconsistentrdquo with S ie g(x1) = b1 hellip g(xt) = bt
DEF A learning algorithm A is consistent if its inference procedure is consistent
PROBLEM Estimate upper and lower bounds on the sample size
t = t(n 1 1)Upper bounds will be given for consistent algorithms
Lower bounds will be given for arbitrary algorithms
22
THEOREM t(n 1 1) le -1ln(F n) +ln(1)
PROOF Prob(x1 hellip xt) g (g(x1)=f(x1) hellip g(xt)=f(xt) g -bad) le
le Prob (g(x1) = f(x1) hellip g(xt) = f(xt)) le
Impose F n e-t le
Independent events
g is ε-bad
P(AUB)leP(A)+P(B)
g ε-bad
NOTE - F n must be finite
le Prob (g(xi) = f(xi)) leg ε-bad i=1 hellip t
le (1-)t le F n(1-)t le F ne-t g ε-bad
A simple upper bound
23
X domainF 2X class of conceptsS = (x1 hellip xt) t-sample
f S g iff f(xi) = g(xi) xi S undistinguishable by S
F (S) = (F S) index of F wrt S
Problem uniform convergence of relative frequencies to their probabilities
Vapnik-Chervonenkis approach (1971)
S1 S2
MF (t) = maxF (S) S is a t-sample growth function
24
FACT
THEOREM
A general upper bound
Prob(x1 hellip xt) g (g -bad g(x1) = f(x1) hellip g(xt) = f(xt))le 2mF2te-t2
mF (t) le 2t
mF (t) le F (this condition gives immediately
the simple upper bound) mF (t) = 2t and jltt mF (j) = 2j
25
d t
)(tmF
F
)(infinFm
t2
DEFINITION
FUNDAMENTAL PROPERTY
=)(tmF1
2
0
minusle⎟⎠⎞⎜⎝
⎛le⎟⎠⎞⎜⎝
⎛le
le
=
sum
Kk
t
tt
t
t
BOUNDED BY APOLYNOMIAL IN t
Graph of the growth function
d = VCdim(F ) = max t mF(t) = 2t
26
THEOREMIf dn = VCdim(Fn)
then t(n 1 1) le max (4 log(2) (8dn)log(13)PROOF
Impose 2mFn2te-et2 le
A lower bound on t(n 1 1) Number of examples which are necessary for arbitrary algorithms
THEOREMFor 0lele1 and le1100
t(n 1 1) ge max ((1- ) ln(1) (dn-1)32)
Upper and lower bounds
27
Ie the cardinality of the set of subsets of S that can be obtained by intersecting S with concepts in F
If F (S) = 2S we say that S is shattered by F
The Vapnik-Chervonenkis dimension of F is the cardinality of the largest finite set of points S X that is shattered by F
An equivalent definition of VCdim
F (S) = (f-1(1)(x1 hellip xt) | fF )
28
1300log80032000log400)11(
00100103)(
sdotle
===
MAXnt
FVC DIM
δε
δε
Sufficient 24000
⎭⎬⎫
⎩⎨⎧ sdot
minusge 100
32131000ln100)11( MAXnt δε
690 Necessary
Learn the family f of circles contained in the square
Example 1
29
otherwiseXif
XWHSXXfTHATSUCHWWLf
nkkkn
nn
001
)()()(
11
1
ge
minus=
rArrisin
sum=
λ
λ
HS(x)=
22
1)(n
n
nDIM
L
nLVC
le
+=
SIMPLE UPPER BOUND
))1ln((1)11( 2 +le nnt
UPPER BOUND USING
⎭⎬⎫
⎩⎨⎧ sdot
+le
13)1(82log4)11( nMAXnt GROWS LINEARLY WITH n
)( nDIM LVC
Learn the family of linearly separable boolean functions in n variables Ln
Example 2
30
Consider the class L2 of linearly separable functions in two variables
3)(3)(
1)(
2
2
ge=
+=
LVCLVC
nLVC
IM
IM
nIM
4)( 2 ltLVC IM
The green point cannot beseparated from the other three
No straight line can separatethe green from the red points
Example 2
31
Classi di formule booleane
Monomi x1x2 hellip xk
DNF m1m2 hellip mj (mj monomi)
Clausole x1x2 hellip xk
CNF c1c2 hellip cj (cj clausole)
k-DNF le k letterali nei monomi
k-term-DNF le k monomi
k-CNF le k letterali nelle clausole
k-clause-CNF le k clausole
Formule monotone non contengono letterali negati
m-formule ogni variabile appare al piugrave una volta
32
Th (Valiant)I monomi sono apprendibili da esempi positivi con 2(n+log ) esempi ( errore tollerato) ponendo ixix
xxgii 01 ==
sdotequiv ππin tutti gli es in tutti gli es
NB Lrsquoapprendibilitagrave egrave non monotona A B se B appr allora A apprsube
endHdaxcancella
elseHdaxcancella
thenjesifdontojfor
generaesbegin
doBtoiforxxxxxxH
j
j
nn
0)(1
)(
1 2211
==
=
==
Th i monomi non sono apprendibili da esempi negativi
I risultati
33
1) K-CNF apprendibili da soli esempi positivi1b) K-DNF apprendibili da soli esempi negativi2) (K-DNF K-CNF) apprendibili da es (K-DNF K-CNF) positivi e negativiand
or
Kforall
3) la classe delle K-decision lists egrave apprendibile
)0()(
1)min(|min
10||)))((( 11
esistenonisebvCalloravni
booleanovettorevDLKCbkmmonomiom
conbmbmDLK
i
iii
jj
===
minusisinle
equivminus
Th Ogni K-DNF (o K-CNF)-formula egrave rappresentabile da una K-DL piccola
Risultati positivi
34
)()(
freeondistributisensoinNPRPse
minusne
1) Le m-formule non sono apprendibili
2) Le funzioni booleane a soglia non sono apprendibili
3) Per K ge 2 le formule K-term-DNF non sono apprendibili
ge
Risultati negativi
35
Mistake bound model
So far how many examples needed to learn What about how many mistakes before
convergence Letrsquos consider similar setting to PAC learning
Instances drawn at random from X according to
distribution D Learner must classify each instance before receiving
correct classification from teacher Can we bound the number of mistakes learner makes
before converging
36
Mistake bound model
Learner Receives a sequence of training examples x Predicts the target value f(x) Receives the correct target value from the trainer Is evaluated by the total number of mistakes it makes
before converging to the correct hypothesis
Ie Learning takes place during the use of the system
not off-line Ex prediction of fraudolent use of credit cards
37
Mistake bound for Find-S
Consider Find-S when H = conjunction of boolean literals
FIND-S
Initialize h to the most specific hypothesis in
Hx1x1x2x2 hellip xnxn
For each positive training instance x Remove from h any literal not satisfied by x
Output h
38
Mistake bound for Find-S
If C H and training data noise free Find-S converges to an exact hypothesis
How many errors to learn cH (only positive examples can be misclassified)
The first positive example will be misclassified and n literals in the initial hypothesis will be eliminated
Each subsequent error eliminates at least one literal mistakes le n+1 (worst case for the ldquototalrdquo concept x c(x)=1)
39
Mistake bound for Halving A version space is maintained and refined (eg
Candidate-elimination) Prediction is based on majority vote among the
hypotheses in the current version space ldquoWrongrdquo hypotheses are removed (even if x is
exactly classified) How many errors to exactly learn cH (H finite)
Mistake when the majority of hypotheses misclassifies x These hypotheses are removed For each mistake the version space is at least halved At most log2(|H|) mistakes before exact learning (eg
single hypothesis remaining) Note learning without mistakes possible
40
Optimal mistake bound
Question what is the optimal mistake bound (ie lowest worst case bound over all possible learning algorithms A) for an arbitrary non empty concept class C assuming H=C
Formally for any learning algorithm A and any target concept c
MA(c) = max mistakes made by A to exactly learn c over all
possible training sequences MA(C) = maxcC MA(c)
Note Mfind-S(C) = n+1
MHalving(C) le log2(|C|) Opt(C) = minA MA(C)
ie of mistakes made for the hardest target concept in C using the hardest training sequence by the best algorithm
41
Optimal mistake bound
Theorem (Littlestone 1987)
VC(C) le Opt(C) le MHalving(C) le log2(|C|) There exist concept classes for which
VC(C) = Opt(C) = MHalving(C) = log2(|C|)
eg the power set 2X of X for which it holds
VC(2X) = |X| = log2(|2X|) There exist concept classes for which VC(C) lt Opt(C) lt MHalving(C)
42
Weighted majority algorithm
Generalizes Halving Makes predictions by taking a weighted vote among a pole of prediction algorithms
Learns by altering the weight associated with each prediction algorithm
It does not eliminate hypotheses (ie algorithms) inconsistent with some training examples but just reduces its weight so is able to accommodate inconsistent training data
43
Weighted majority algorithm
i wi = 1
training example (x c(x))
q0 = q1 = 0
prediction algorithm ai
If ai(x)=0 then q0 = q0 + wi
If ai(x)=1 then q1 = q1 + wi
if q1 gt q0 then predict c(x)=1
if q1 lt q0 then predict c(x)=0
if q1 gt q0 then predict c(x)=0 or 1 at random
prediction algorithm ai do If ai(x)nec(x) then wi = wi (0lelt1)
44
Weighted majority algorithm (WM)
Coincides with Halving for =0 Theorem - D any sequence of training examples A any set of n prediction algorithms k min of mistakes made by any ajA for D =12 Then W-M makes at most
24(k+log2n)
mistakes over D
45
Weighted majority algorithm (WM)
Proof Since aj makes k mistakes (best in A) its final
weight wj will be (12)k The sum W of the weights associated with all n algorithms in A is initially n and for each mistake made by WM is reduced to at most (34)W because the ldquowrongrdquo algorithms hold at least 12 of total weight that will be reduced by a factor of 12
The final total weight W is at most n(34)M where M is the total number of mistakes made by WM over D
46
Weighted majority algorithm (WM)
But the final weight wj cannot be greater than the
final total weight W hence(12)k le n(34)M
from which
M le le 24(k+log2n)
Ie the number of mistakes made by WM will never be greater than a constant factor times the number of mistakes made by the best member of the pool plus a term that grows only logarithmically in the size of the pool
(k+log2 n)-log2 (34)
- Universitagrave di Milano-Bicocca Laurea Magistrale in Informatica
- Computational models of cognitive phenomena
- A theory of the learnable (Valiant lsquo84)
- A theory of the learnable
- Probably approximately correct learning
- What we want to learn
- Whatrsquos new in pac learning
- Learning from examples
- The PAC model
- The PAC model
- Slide 11
- Slide 12
- Slide 13
- Learning algorithm
- The efficiency issue
- Slide 16
- Learning boolean functions
- Boolean functions and circuits
- Slide 19
- Slide 20
- Statistical PAC learning
- A simple upper bound
- Vapnik-Chervonenkis approach (1971)
- A general upper bound
- Graph of the growth function
- Upper and lower bounds
- An equivalent definition of VCdim
- Example 1
- Example 2
- Slide 30
- Classi di formule booleane
- I risultati
- Risultati positivi
- Risultati negativi
- Mistake bound model
- Slide 36
- Mistake bound for Find-S
- Slide 38
- Mistake bound for Halving
- Optimal mistake bound
- Slide 41
- Weighted majority algorithm
- Slide 43
- Weighted majority algorithm (WM)
- Slide 45
- Slide 46
-
16
The efficiency issue
POLYNOMIAL PAC STATISTICAL PAC
DEF 2 a concept class F = n=1F n is polynomially PAC learnable
if there is a learning algorithm with running time bounded by some polynomial function in n 1 1
17
n = f 0 1n 0 1 The set of boolean functions in n
variables
Fn n A class of conceptsExample 1Fn = clauses with literals in
Example 2Fn = linearly separable functions in n variables
nn xxxx 11
nk xxxxxx ororororor 2123
( ) sum minus λkkXWHS
REPRESENTATION
- TRUTH TABLE (EXPLICIT)- BOOLEAN CIRCUITS (IMPLICIT)
BOOLEANCIRCUITS
BOOLEANFUNCTIONS
Learning boolean functions
18
bull BASIC OPERATIONSbull COMPOSITION
( )minusorand
in m variables in n variables
CIRCUIT Finite acyclic directed graph
or
Output node
Basic operations
Input nodes
Given an assignment x1 hellip xn 0 1 to the input variables the output node computes the corresponding value
oror
orand
1X 2X 3X
or
[f(g1 hellip gm)](x) = f(g1(x) hellip gm(x))
Boolean functions and circuits
19
Fn n
Cn class of circuits which compute all and only the functions in Fn
Uinfin
=
=1n
nFF Uinfin
=
=1n
nCC
Algorithm A to learn F by C bull INPUT (nεδ)
bullThe learner computes t = t(n 1 1) (t=number of examples sufficient to learn with accuracy ε and confidence δ)bull The learner asks the teacher for a labelled t-sample
bull The learner receives the t-sample S and computes C = An(S)
bull Output C (C= representation of the hypothesis)
Note that the inference procedure A receives as input the integer n and a t-sample on 01n and outputs An(S) = A(n S)
Boolean functions and circuits
20
An algorithm A is a learning algorithm with sample size t(n 1 1) for a concept class using the class of representations If for all nge1 for all fFn for all 0lt lt1 and for every probability distribution p over 01n the following holds
nn FUF 1=infin= mm CUC 1=
infin=
If the inference procedure An receives as input a t-sample it outputs a representation cCn of a function g that is probably approximately correct that is with probability at least 1- a t-sample is chosen such that the function g inferred satisfies
Px f(x)neg(x) le
g is ndashgood g is an ndashapproximation of fg is ndashbad g is not an ndashapproximation of f
NOTE distribution free
Boolean functions and circuits
21
Statistical PAC learning
DEF An inference procedure An for the class F n is consistent if
given the target function fF n for every t-sample
S = (ltx1b1gt hellip ltxtbtgt) An(S) is a representation of a function
g ldquoconsistentrdquo with S ie g(x1) = b1 hellip g(xt) = bt
DEF A learning algorithm A is consistent if its inference procedure is consistent
PROBLEM Estimate upper and lower bounds on the sample size
t = t(n 1 1)Upper bounds will be given for consistent algorithms
Lower bounds will be given for arbitrary algorithms
22
THEOREM t(n 1 1) le -1ln(F n) +ln(1)
PROOF Prob(x1 hellip xt) g (g(x1)=f(x1) hellip g(xt)=f(xt) g -bad) le
le Prob (g(x1) = f(x1) hellip g(xt) = f(xt)) le
Impose F n e-t le
Independent events
g is ε-bad
P(AUB)leP(A)+P(B)
g ε-bad
NOTE - F n must be finite
le Prob (g(xi) = f(xi)) leg ε-bad i=1 hellip t
le (1-)t le F n(1-)t le F ne-t g ε-bad
A simple upper bound
23
X domainF 2X class of conceptsS = (x1 hellip xt) t-sample
f S g iff f(xi) = g(xi) xi S undistinguishable by S
F (S) = (F S) index of F wrt S
Problem uniform convergence of relative frequencies to their probabilities
Vapnik-Chervonenkis approach (1971)
S1 S2
MF (t) = maxF (S) S is a t-sample growth function
24
FACT
THEOREM
A general upper bound
Prob(x1 hellip xt) g (g -bad g(x1) = f(x1) hellip g(xt) = f(xt))le 2mF2te-t2
mF (t) le 2t
mF (t) le F (this condition gives immediately
the simple upper bound) mF (t) = 2t and jltt mF (j) = 2j
25
d t
)(tmF
F
)(infinFm
t2
DEFINITION
FUNDAMENTAL PROPERTY
=)(tmF1
2
0
minusle⎟⎠⎞⎜⎝
⎛le⎟⎠⎞⎜⎝
⎛le
le
=
sum
Kk
t
tt
t
t
BOUNDED BY APOLYNOMIAL IN t
Graph of the growth function
d = VCdim(F ) = max t mF(t) = 2t
26
THEOREMIf dn = VCdim(Fn)
then t(n 1 1) le max (4 log(2) (8dn)log(13)PROOF
Impose 2mFn2te-et2 le
A lower bound on t(n 1 1) Number of examples which are necessary for arbitrary algorithms
THEOREMFor 0lele1 and le1100
t(n 1 1) ge max ((1- ) ln(1) (dn-1)32)
Upper and lower bounds
27
Ie the cardinality of the set of subsets of S that can be obtained by intersecting S with concepts in F
If F (S) = 2S we say that S is shattered by F
The Vapnik-Chervonenkis dimension of F is the cardinality of the largest finite set of points S X that is shattered by F
An equivalent definition of VCdim
F (S) = (f-1(1)(x1 hellip xt) | fF )
28
1300log80032000log400)11(
00100103)(
sdotle
===
MAXnt
FVC DIM
δε
δε
Sufficient 24000
⎭⎬⎫
⎩⎨⎧ sdot
minusge 100
32131000ln100)11( MAXnt δε
690 Necessary
Learn the family f of circles contained in the square
Example 1
29
otherwiseXif
XWHSXXfTHATSUCHWWLf
nkkkn
nn
001
)()()(
11
1
ge
minus=
rArrisin
sum=
λ
λ
HS(x)=
22
1)(n
n
nDIM
L
nLVC
le
+=
SIMPLE UPPER BOUND
))1ln((1)11( 2 +le nnt
UPPER BOUND USING
⎭⎬⎫
⎩⎨⎧ sdot
+le
13)1(82log4)11( nMAXnt GROWS LINEARLY WITH n
)( nDIM LVC
Learn the family of linearly separable boolean functions in n variables Ln
Example 2
30
Consider the class L2 of linearly separable functions in two variables
3)(3)(
1)(
2
2
ge=
+=
LVCLVC
nLVC
IM
IM
nIM
4)( 2 ltLVC IM
The green point cannot beseparated from the other three
No straight line can separatethe green from the red points
Example 2
31
Classi di formule booleane
Monomi x1x2 hellip xk
DNF m1m2 hellip mj (mj monomi)
Clausole x1x2 hellip xk
CNF c1c2 hellip cj (cj clausole)
k-DNF le k letterali nei monomi
k-term-DNF le k monomi
k-CNF le k letterali nelle clausole
k-clause-CNF le k clausole
Formule monotone non contengono letterali negati
m-formule ogni variabile appare al piugrave una volta
32
Th (Valiant)I monomi sono apprendibili da esempi positivi con 2(n+log ) esempi ( errore tollerato) ponendo ixix
xxgii 01 ==
sdotequiv ππin tutti gli es in tutti gli es
NB Lrsquoapprendibilitagrave egrave non monotona A B se B appr allora A apprsube
endHdaxcancella
elseHdaxcancella
thenjesifdontojfor
generaesbegin
doBtoiforxxxxxxH
j
j
nn
0)(1
)(
1 2211
==
=
==
Th i monomi non sono apprendibili da esempi negativi
I risultati
33
1) K-CNF apprendibili da soli esempi positivi1b) K-DNF apprendibili da soli esempi negativi2) (K-DNF K-CNF) apprendibili da es (K-DNF K-CNF) positivi e negativiand
or
Kforall
3) la classe delle K-decision lists egrave apprendibile
)0()(
1)min(|min
10||)))((( 11
esistenonisebvCalloravni
booleanovettorevDLKCbkmmonomiom
conbmbmDLK
i
iii
jj
===
minusisinle
equivminus
Th Ogni K-DNF (o K-CNF)-formula egrave rappresentabile da una K-DL piccola
Risultati positivi
34
)()(
freeondistributisensoinNPRPse
minusne
1) Le m-formule non sono apprendibili
2) Le funzioni booleane a soglia non sono apprendibili
3) Per K ge 2 le formule K-term-DNF non sono apprendibili
ge
Risultati negativi
35
Mistake bound model
So far how many examples needed to learn What about how many mistakes before
convergence Letrsquos consider similar setting to PAC learning
Instances drawn at random from X according to
distribution D Learner must classify each instance before receiving
correct classification from teacher Can we bound the number of mistakes learner makes
before converging
36
Mistake bound model
Learner Receives a sequence of training examples x Predicts the target value f(x) Receives the correct target value from the trainer Is evaluated by the total number of mistakes it makes
before converging to the correct hypothesis
Ie Learning takes place during the use of the system
not off-line Ex prediction of fraudolent use of credit cards
37
Mistake bound for Find-S
Consider Find-S when H = conjunction of boolean literals
FIND-S
Initialize h to the most specific hypothesis in
Hx1x1x2x2 hellip xnxn
For each positive training instance x Remove from h any literal not satisfied by x
Output h
38
Mistake bound for Find-S
If C H and training data noise free Find-S converges to an exact hypothesis
How many errors to learn cH (only positive examples can be misclassified)
The first positive example will be misclassified and n literals in the initial hypothesis will be eliminated
Each subsequent error eliminates at least one literal mistakes le n+1 (worst case for the ldquototalrdquo concept x c(x)=1)
39
Mistake bound for Halving A version space is maintained and refined (eg
Candidate-elimination) Prediction is based on majority vote among the
hypotheses in the current version space ldquoWrongrdquo hypotheses are removed (even if x is
exactly classified) How many errors to exactly learn cH (H finite)
Mistake when the majority of hypotheses misclassifies x These hypotheses are removed For each mistake the version space is at least halved At most log2(|H|) mistakes before exact learning (eg
single hypothesis remaining) Note learning without mistakes possible
40
Optimal mistake bound
Question what is the optimal mistake bound (ie lowest worst case bound over all possible learning algorithms A) for an arbitrary non empty concept class C assuming H=C
Formally for any learning algorithm A and any target concept c
MA(c) = max mistakes made by A to exactly learn c over all
possible training sequences MA(C) = maxcC MA(c)
Note Mfind-S(C) = n+1
MHalving(C) le log2(|C|) Opt(C) = minA MA(C)
ie of mistakes made for the hardest target concept in C using the hardest training sequence by the best algorithm
41
Optimal mistake bound
Theorem (Littlestone 1987)
VC(C) le Opt(C) le MHalving(C) le log2(|C|) There exist concept classes for which
VC(C) = Opt(C) = MHalving(C) = log2(|C|)
eg the power set 2X of X for which it holds
VC(2X) = |X| = log2(|2X|) There exist concept classes for which VC(C) lt Opt(C) lt MHalving(C)
42
Weighted majority algorithm
Generalizes Halving Makes predictions by taking a weighted vote among a pole of prediction algorithms
Learns by altering the weight associated with each prediction algorithm
It does not eliminate hypotheses (ie algorithms) inconsistent with some training examples but just reduces its weight so is able to accommodate inconsistent training data
43
Weighted majority algorithm
i wi = 1
training example (x c(x))
q0 = q1 = 0
prediction algorithm ai
If ai(x)=0 then q0 = q0 + wi
If ai(x)=1 then q1 = q1 + wi
if q1 gt q0 then predict c(x)=1
if q1 lt q0 then predict c(x)=0
if q1 gt q0 then predict c(x)=0 or 1 at random
prediction algorithm ai do If ai(x)nec(x) then wi = wi (0lelt1)
44
Weighted majority algorithm (WM)
Coincides with Halving for =0 Theorem - D any sequence of training examples A any set of n prediction algorithms k min of mistakes made by any ajA for D =12 Then W-M makes at most
24(k+log2n)
mistakes over D
45
Weighted majority algorithm (WM)
Proof Since aj makes k mistakes (best in A) its final
weight wj will be (12)k The sum W of the weights associated with all n algorithms in A is initially n and for each mistake made by WM is reduced to at most (34)W because the ldquowrongrdquo algorithms hold at least 12 of total weight that will be reduced by a factor of 12
The final total weight W is at most n(34)M where M is the total number of mistakes made by WM over D
46
Weighted majority algorithm (WM)
But the final weight wj cannot be greater than the
final total weight W hence(12)k le n(34)M
from which
M le le 24(k+log2n)
Ie the number of mistakes made by WM will never be greater than a constant factor times the number of mistakes made by the best member of the pool plus a term that grows only logarithmically in the size of the pool
(k+log2 n)-log2 (34)
- Universitagrave di Milano-Bicocca Laurea Magistrale in Informatica
- Computational models of cognitive phenomena
- A theory of the learnable (Valiant lsquo84)
- A theory of the learnable
- Probably approximately correct learning
- What we want to learn
- Whatrsquos new in pac learning
- Learning from examples
- The PAC model
- The PAC model
- Slide 11
- Slide 12
- Slide 13
- Learning algorithm
- The efficiency issue
- Slide 16
- Learning boolean functions
- Boolean functions and circuits
- Slide 19
- Slide 20
- Statistical PAC learning
- A simple upper bound
- Vapnik-Chervonenkis approach (1971)
- A general upper bound
- Graph of the growth function
- Upper and lower bounds
- An equivalent definition of VCdim
- Example 1
- Example 2
- Slide 30
- Classi di formule booleane
- I risultati
- Risultati positivi
- Risultati negativi
- Mistake bound model
- Slide 36
- Mistake bound for Find-S
- Slide 38
- Mistake bound for Halving
- Optimal mistake bound
- Slide 41
- Weighted majority algorithm
- Slide 43
- Weighted majority algorithm (WM)
- Slide 45
- Slide 46
-
17
n = f 0 1n 0 1 The set of boolean functions in n
variables
Fn n A class of conceptsExample 1Fn = clauses with literals in
Example 2Fn = linearly separable functions in n variables
nn xxxx 11
nk xxxxxx ororororor 2123
( ) sum minus λkkXWHS
REPRESENTATION
- TRUTH TABLE (EXPLICIT)- BOOLEAN CIRCUITS (IMPLICIT)
BOOLEANCIRCUITS
BOOLEANFUNCTIONS
Learning boolean functions
18
bull BASIC OPERATIONSbull COMPOSITION
( )minusorand
in m variables in n variables
CIRCUIT Finite acyclic directed graph
or
Output node
Basic operations
Input nodes
Given an assignment x1 hellip xn 0 1 to the input variables the output node computes the corresponding value
oror
orand
1X 2X 3X
or
[f(g1 hellip gm)](x) = f(g1(x) hellip gm(x))
Boolean functions and circuits
19
Fn n
Cn class of circuits which compute all and only the functions in Fn
Uinfin
=
=1n
nFF Uinfin
=
=1n
nCC
Algorithm A to learn F by C bull INPUT (nεδ)
bullThe learner computes t = t(n 1 1) (t=number of examples sufficient to learn with accuracy ε and confidence δ)bull The learner asks the teacher for a labelled t-sample
bull The learner receives the t-sample S and computes C = An(S)
bull Output C (C= representation of the hypothesis)
Note that the inference procedure A receives as input the integer n and a t-sample on 01n and outputs An(S) = A(n S)
Boolean functions and circuits
20
An algorithm A is a learning algorithm with sample size t(n 1 1) for a concept class using the class of representations If for all nge1 for all fFn for all 0lt lt1 and for every probability distribution p over 01n the following holds
nn FUF 1=infin= mm CUC 1=
infin=
If the inference procedure An receives as input a t-sample it outputs a representation cCn of a function g that is probably approximately correct that is with probability at least 1- a t-sample is chosen such that the function g inferred satisfies
Px f(x)neg(x) le
g is ndashgood g is an ndashapproximation of fg is ndashbad g is not an ndashapproximation of f
NOTE distribution free
Boolean functions and circuits
21
Statistical PAC learning
DEF An inference procedure An for the class F n is consistent if
given the target function fF n for every t-sample
S = (ltx1b1gt hellip ltxtbtgt) An(S) is a representation of a function
g ldquoconsistentrdquo with S ie g(x1) = b1 hellip g(xt) = bt
DEF A learning algorithm A is consistent if its inference procedure is consistent
PROBLEM Estimate upper and lower bounds on the sample size
t = t(n 1 1)Upper bounds will be given for consistent algorithms
Lower bounds will be given for arbitrary algorithms
22
THEOREM t(n 1 1) le -1ln(F n) +ln(1)
PROOF Prob(x1 hellip xt) g (g(x1)=f(x1) hellip g(xt)=f(xt) g -bad) le
le Prob (g(x1) = f(x1) hellip g(xt) = f(xt)) le
Impose F n e-t le
Independent events
g is ε-bad
P(AUB)leP(A)+P(B)
g ε-bad
NOTE - F n must be finite
le Prob (g(xi) = f(xi)) leg ε-bad i=1 hellip t
le (1-)t le F n(1-)t le F ne-t g ε-bad
A simple upper bound
23
X domainF 2X class of conceptsS = (x1 hellip xt) t-sample
f S g iff f(xi) = g(xi) xi S undistinguishable by S
F (S) = (F S) index of F wrt S
Problem uniform convergence of relative frequencies to their probabilities
Vapnik-Chervonenkis approach (1971)
S1 S2
MF (t) = maxF (S) S is a t-sample growth function
24
FACT
THEOREM
A general upper bound
Prob(x1 hellip xt) g (g -bad g(x1) = f(x1) hellip g(xt) = f(xt))le 2mF2te-t2
mF (t) le 2t
mF (t) le F (this condition gives immediately
the simple upper bound) mF (t) = 2t and jltt mF (j) = 2j
25
d t
)(tmF
F
)(infinFm
t2
DEFINITION
FUNDAMENTAL PROPERTY
=)(tmF1
2
0
minusle⎟⎠⎞⎜⎝
⎛le⎟⎠⎞⎜⎝
⎛le
le
=
sum
Kk
t
tt
t
t
BOUNDED BY APOLYNOMIAL IN t
Graph of the growth function
d = VCdim(F ) = max t mF(t) = 2t
26
THEOREMIf dn = VCdim(Fn)
then t(n 1 1) le max (4 log(2) (8dn)log(13)PROOF
Impose 2mFn2te-et2 le
A lower bound on t(n 1 1) Number of examples which are necessary for arbitrary algorithms
THEOREMFor 0lele1 and le1100
t(n 1 1) ge max ((1- ) ln(1) (dn-1)32)
Upper and lower bounds
27
Ie the cardinality of the set of subsets of S that can be obtained by intersecting S with concepts in F
If F (S) = 2S we say that S is shattered by F
The Vapnik-Chervonenkis dimension of F is the cardinality of the largest finite set of points S X that is shattered by F
An equivalent definition of VCdim
F (S) = (f-1(1)(x1 hellip xt) | fF )
28
1300log80032000log400)11(
00100103)(
sdotle
===
MAXnt
FVC DIM
δε
δε
Sufficient 24000
⎭⎬⎫
⎩⎨⎧ sdot
minusge 100
32131000ln100)11( MAXnt δε
690 Necessary
Learn the family f of circles contained in the square
Example 1
29
otherwiseXif
XWHSXXfTHATSUCHWWLf
nkkkn
nn
001
)()()(
11
1
ge
minus=
rArrisin
sum=
λ
λ
HS(x)=
22
1)(n
n
nDIM
L
nLVC
le
+=
SIMPLE UPPER BOUND
))1ln((1)11( 2 +le nnt
UPPER BOUND USING
⎭⎬⎫
⎩⎨⎧ sdot
+le
13)1(82log4)11( nMAXnt GROWS LINEARLY WITH n
)( nDIM LVC
Learn the family of linearly separable boolean functions in n variables Ln
Example 2
30
Consider the class L2 of linearly separable functions in two variables
3)(3)(
1)(
2
2
ge=
+=
LVCLVC
nLVC
IM
IM
nIM
4)( 2 ltLVC IM
The green point cannot beseparated from the other three
No straight line can separatethe green from the red points
Example 2
31
Classi di formule booleane
Monomi x1x2 hellip xk
DNF m1m2 hellip mj (mj monomi)
Clausole x1x2 hellip xk
CNF c1c2 hellip cj (cj clausole)
k-DNF le k letterali nei monomi
k-term-DNF le k monomi
k-CNF le k letterali nelle clausole
k-clause-CNF le k clausole
Formule monotone non contengono letterali negati
m-formule ogni variabile appare al piugrave una volta
32
Th (Valiant)I monomi sono apprendibili da esempi positivi con 2(n+log ) esempi ( errore tollerato) ponendo ixix
xxgii 01 ==
sdotequiv ππin tutti gli es in tutti gli es
NB Lrsquoapprendibilitagrave egrave non monotona A B se B appr allora A apprsube
endHdaxcancella
elseHdaxcancella
thenjesifdontojfor
generaesbegin
doBtoiforxxxxxxH
j
j
nn
0)(1
)(
1 2211
==
=
==
Th i monomi non sono apprendibili da esempi negativi
I risultati
33
1) K-CNF apprendibili da soli esempi positivi1b) K-DNF apprendibili da soli esempi negativi2) (K-DNF K-CNF) apprendibili da es (K-DNF K-CNF) positivi e negativiand
or
Kforall
3) la classe delle K-decision lists egrave apprendibile
)0()(
1)min(|min
10||)))((( 11
esistenonisebvCalloravni
booleanovettorevDLKCbkmmonomiom
conbmbmDLK
i
iii
jj
===
minusisinle
equivminus
Th Ogni K-DNF (o K-CNF)-formula egrave rappresentabile da una K-DL piccola
Risultati positivi
34
)()(
freeondistributisensoinNPRPse
minusne
1) Le m-formule non sono apprendibili
2) Le funzioni booleane a soglia non sono apprendibili
3) Per K ge 2 le formule K-term-DNF non sono apprendibili
ge
Risultati negativi
35
Mistake bound model
So far how many examples needed to learn What about how many mistakes before
convergence Letrsquos consider similar setting to PAC learning
Instances drawn at random from X according to
distribution D Learner must classify each instance before receiving
correct classification from teacher Can we bound the number of mistakes learner makes
before converging
36
Mistake bound model
Learner Receives a sequence of training examples x Predicts the target value f(x) Receives the correct target value from the trainer Is evaluated by the total number of mistakes it makes
before converging to the correct hypothesis
Ie Learning takes place during the use of the system
not off-line Ex prediction of fraudolent use of credit cards
37
Mistake bound for Find-S
Consider Find-S when H = conjunction of boolean literals
FIND-S
Initialize h to the most specific hypothesis in
Hx1x1x2x2 hellip xnxn
For each positive training instance x Remove from h any literal not satisfied by x
Output h
38
Mistake bound for Find-S
If C H and training data noise free Find-S converges to an exact hypothesis
How many errors to learn cH (only positive examples can be misclassified)
The first positive example will be misclassified and n literals in the initial hypothesis will be eliminated
Each subsequent error eliminates at least one literal mistakes le n+1 (worst case for the ldquototalrdquo concept x c(x)=1)
39
Mistake bound for Halving A version space is maintained and refined (eg
Candidate-elimination) Prediction is based on majority vote among the
hypotheses in the current version space ldquoWrongrdquo hypotheses are removed (even if x is
exactly classified) How many errors to exactly learn cH (H finite)
Mistake when the majority of hypotheses misclassifies x These hypotheses are removed For each mistake the version space is at least halved At most log2(|H|) mistakes before exact learning (eg
single hypothesis remaining) Note learning without mistakes possible
40
Optimal mistake bound
Question what is the optimal mistake bound (ie lowest worst case bound over all possible learning algorithms A) for an arbitrary non empty concept class C assuming H=C
Formally for any learning algorithm A and any target concept c
MA(c) = max mistakes made by A to exactly learn c over all
possible training sequences MA(C) = maxcC MA(c)
Note Mfind-S(C) = n+1
MHalving(C) le log2(|C|) Opt(C) = minA MA(C)
ie of mistakes made for the hardest target concept in C using the hardest training sequence by the best algorithm
41
Optimal mistake bound
Theorem (Littlestone 1987)
VC(C) le Opt(C) le MHalving(C) le log2(|C|) There exist concept classes for which
VC(C) = Opt(C) = MHalving(C) = log2(|C|)
eg the power set 2X of X for which it holds
VC(2X) = |X| = log2(|2X|) There exist concept classes for which VC(C) lt Opt(C) lt MHalving(C)
42
Weighted majority algorithm
Generalizes Halving Makes predictions by taking a weighted vote among a pole of prediction algorithms
Learns by altering the weight associated with each prediction algorithm
It does not eliminate hypotheses (ie algorithms) inconsistent with some training examples but just reduces its weight so is able to accommodate inconsistent training data
43
Weighted majority algorithm
i wi = 1
training example (x c(x))
q0 = q1 = 0
prediction algorithm ai
If ai(x)=0 then q0 = q0 + wi
If ai(x)=1 then q1 = q1 + wi
if q1 gt q0 then predict c(x)=1
if q1 lt q0 then predict c(x)=0
if q1 gt q0 then predict c(x)=0 or 1 at random
prediction algorithm ai do If ai(x)nec(x) then wi = wi (0lelt1)
44
Weighted majority algorithm (WM)
Coincides with Halving for =0 Theorem - D any sequence of training examples A any set of n prediction algorithms k min of mistakes made by any ajA for D =12 Then W-M makes at most
24(k+log2n)
mistakes over D
45
Weighted majority algorithm (WM)
Proof Since aj makes k mistakes (best in A) its final
weight wj will be (12)k The sum W of the weights associated with all n algorithms in A is initially n and for each mistake made by WM is reduced to at most (34)W because the ldquowrongrdquo algorithms hold at least 12 of total weight that will be reduced by a factor of 12
The final total weight W is at most n(34)M where M is the total number of mistakes made by WM over D
46
Weighted majority algorithm (WM)
But the final weight wj cannot be greater than the
final total weight W hence(12)k le n(34)M
from which
M le le 24(k+log2n)
Ie the number of mistakes made by WM will never be greater than a constant factor times the number of mistakes made by the best member of the pool plus a term that grows only logarithmically in the size of the pool
(k+log2 n)-log2 (34)
- Universitagrave di Milano-Bicocca Laurea Magistrale in Informatica
- Computational models of cognitive phenomena
- A theory of the learnable (Valiant lsquo84)
- A theory of the learnable
- Probably approximately correct learning
- What we want to learn
- Whatrsquos new in pac learning
- Learning from examples
- The PAC model
- The PAC model
- Slide 11
- Slide 12
- Slide 13
- Learning algorithm
- The efficiency issue
- Slide 16
- Learning boolean functions
- Boolean functions and circuits
- Slide 19
- Slide 20
- Statistical PAC learning
- A simple upper bound
- Vapnik-Chervonenkis approach (1971)
- A general upper bound
- Graph of the growth function
- Upper and lower bounds
- An equivalent definition of VCdim
- Example 1
- Example 2
- Slide 30
- Classi di formule booleane
- I risultati
- Risultati positivi
- Risultati negativi
- Mistake bound model
- Slide 36
- Mistake bound for Find-S
- Slide 38
- Mistake bound for Halving
- Optimal mistake bound
- Slide 41
- Weighted majority algorithm
- Slide 43
- Weighted majority algorithm (WM)
- Slide 45
- Slide 46
-
18
bull BASIC OPERATIONSbull COMPOSITION
( )minusorand
in m variables in n variables
CIRCUIT Finite acyclic directed graph
or
Output node
Basic operations
Input nodes
Given an assignment x1 hellip xn 0 1 to the input variables the output node computes the corresponding value
oror
orand
1X 2X 3X
or
[f(g1 hellip gm)](x) = f(g1(x) hellip gm(x))
Boolean functions and circuits
19
Fn n
Cn class of circuits which compute all and only the functions in Fn
Uinfin
=
=1n
nFF Uinfin
=
=1n
nCC
Algorithm A to learn F by C bull INPUT (nεδ)
bullThe learner computes t = t(n 1 1) (t=number of examples sufficient to learn with accuracy ε and confidence δ)bull The learner asks the teacher for a labelled t-sample
bull The learner receives the t-sample S and computes C = An(S)
bull Output C (C= representation of the hypothesis)
Note that the inference procedure A receives as input the integer n and a t-sample on 01n and outputs An(S) = A(n S)
Boolean functions and circuits
20
An algorithm A is a learning algorithm with sample size t(n 1 1) for a concept class using the class of representations If for all nge1 for all fFn for all 0lt lt1 and for every probability distribution p over 01n the following holds
nn FUF 1=infin= mm CUC 1=
infin=
If the inference procedure An receives as input a t-sample it outputs a representation cCn of a function g that is probably approximately correct that is with probability at least 1- a t-sample is chosen such that the function g inferred satisfies
Px f(x)neg(x) le
g is ndashgood g is an ndashapproximation of fg is ndashbad g is not an ndashapproximation of f
NOTE distribution free
Boolean functions and circuits
21
Statistical PAC learning
DEF An inference procedure An for the class F n is consistent if
given the target function fF n for every t-sample
S = (ltx1b1gt hellip ltxtbtgt) An(S) is a representation of a function
g ldquoconsistentrdquo with S ie g(x1) = b1 hellip g(xt) = bt
DEF A learning algorithm A is consistent if its inference procedure is consistent
PROBLEM Estimate upper and lower bounds on the sample size
t = t(n 1 1)Upper bounds will be given for consistent algorithms
Lower bounds will be given for arbitrary algorithms
22
THEOREM t(n 1 1) le -1ln(F n) +ln(1)
PROOF Prob(x1 hellip xt) g (g(x1)=f(x1) hellip g(xt)=f(xt) g -bad) le
le Prob (g(x1) = f(x1) hellip g(xt) = f(xt)) le
Impose F n e-t le
Independent events
g is ε-bad
P(AUB)leP(A)+P(B)
g ε-bad
NOTE - F n must be finite
le Prob (g(xi) = f(xi)) leg ε-bad i=1 hellip t
le (1-)t le F n(1-)t le F ne-t g ε-bad
A simple upper bound
23
X domainF 2X class of conceptsS = (x1 hellip xt) t-sample
f S g iff f(xi) = g(xi) xi S undistinguishable by S
F (S) = (F S) index of F wrt S
Problem uniform convergence of relative frequencies to their probabilities
Vapnik-Chervonenkis approach (1971)
S1 S2
MF (t) = maxF (S) S is a t-sample growth function
24
FACT
THEOREM
A general upper bound
Prob(x1 hellip xt) g (g -bad g(x1) = f(x1) hellip g(xt) = f(xt))le 2mF2te-t2
mF (t) le 2t
mF (t) le F (this condition gives immediately
the simple upper bound) mF (t) = 2t and jltt mF (j) = 2j
25
d t
)(tmF
F
)(infinFm
t2
DEFINITION
FUNDAMENTAL PROPERTY
=)(tmF1
2
0
minusle⎟⎠⎞⎜⎝
⎛le⎟⎠⎞⎜⎝
⎛le
le
=
sum
Kk
t
tt
t
t
BOUNDED BY APOLYNOMIAL IN t
Graph of the growth function
d = VCdim(F ) = max t mF(t) = 2t
26
THEOREMIf dn = VCdim(Fn)
then t(n 1 1) le max (4 log(2) (8dn)log(13)PROOF
Impose 2mFn2te-et2 le
A lower bound on t(n 1 1) Number of examples which are necessary for arbitrary algorithms
THEOREMFor 0lele1 and le1100
t(n 1 1) ge max ((1- ) ln(1) (dn-1)32)
Upper and lower bounds
27
Ie the cardinality of the set of subsets of S that can be obtained by intersecting S with concepts in F
If F (S) = 2S we say that S is shattered by F
The Vapnik-Chervonenkis dimension of F is the cardinality of the largest finite set of points S X that is shattered by F
An equivalent definition of VCdim
F (S) = (f-1(1)(x1 hellip xt) | fF )
28
1300log80032000log400)11(
00100103)(
sdotle
===
MAXnt
FVC DIM
δε
δε
Sufficient 24000
⎭⎬⎫
⎩⎨⎧ sdot
minusge 100
32131000ln100)11( MAXnt δε
690 Necessary
Learn the family f of circles contained in the square
Example 1
29
otherwiseXif
XWHSXXfTHATSUCHWWLf
nkkkn
nn
001
)()()(
11
1
ge
minus=
rArrisin
sum=
λ
λ
HS(x)=
22
1)(n
n
nDIM
L
nLVC
le
+=
SIMPLE UPPER BOUND
))1ln((1)11( 2 +le nnt
UPPER BOUND USING
⎭⎬⎫
⎩⎨⎧ sdot
+le
13)1(82log4)11( nMAXnt GROWS LINEARLY WITH n
)( nDIM LVC
Learn the family of linearly separable boolean functions in n variables Ln
Example 2
30
Consider the class L2 of linearly separable functions in two variables
3)(3)(
1)(
2
2
ge=
+=
LVCLVC
nLVC
IM
IM
nIM
4)( 2 ltLVC IM
The green point cannot beseparated from the other three
No straight line can separatethe green from the red points
Example 2
31
Classi di formule booleane
Monomi x1x2 hellip xk
DNF m1m2 hellip mj (mj monomi)
Clausole x1x2 hellip xk
CNF c1c2 hellip cj (cj clausole)
k-DNF le k letterali nei monomi
k-term-DNF le k monomi
k-CNF le k letterali nelle clausole
k-clause-CNF le k clausole
Formule monotone non contengono letterali negati
m-formule ogni variabile appare al piugrave una volta
32
Th (Valiant)I monomi sono apprendibili da esempi positivi con 2(n+log ) esempi ( errore tollerato) ponendo ixix
xxgii 01 ==
sdotequiv ππin tutti gli es in tutti gli es
NB Lrsquoapprendibilitagrave egrave non monotona A B se B appr allora A apprsube
endHdaxcancella
elseHdaxcancella
thenjesifdontojfor
generaesbegin
doBtoiforxxxxxxH
j
j
nn
0)(1
)(
1 2211
==
=
==
Th i monomi non sono apprendibili da esempi negativi
I risultati
33
1) K-CNF apprendibili da soli esempi positivi1b) K-DNF apprendibili da soli esempi negativi2) (K-DNF K-CNF) apprendibili da es (K-DNF K-CNF) positivi e negativiand
or
Kforall
3) la classe delle K-decision lists egrave apprendibile
)0()(
1)min(|min
10||)))((( 11
esistenonisebvCalloravni
booleanovettorevDLKCbkmmonomiom
conbmbmDLK
i
iii
jj
===
minusisinle
equivminus
Th Ogni K-DNF (o K-CNF)-formula egrave rappresentabile da una K-DL piccola
Risultati positivi
34
)()(
freeondistributisensoinNPRPse
minusne
1) Le m-formule non sono apprendibili
2) Le funzioni booleane a soglia non sono apprendibili
3) Per K ge 2 le formule K-term-DNF non sono apprendibili
ge
Risultati negativi
35
Mistake bound model
So far how many examples needed to learn What about how many mistakes before
convergence Letrsquos consider similar setting to PAC learning
Instances drawn at random from X according to
distribution D Learner must classify each instance before receiving
correct classification from teacher Can we bound the number of mistakes learner makes
before converging
36
Mistake bound model
Learner Receives a sequence of training examples x Predicts the target value f(x) Receives the correct target value from the trainer Is evaluated by the total number of mistakes it makes
before converging to the correct hypothesis
Ie Learning takes place during the use of the system
not off-line Ex prediction of fraudolent use of credit cards
37
Mistake bound for Find-S
Consider Find-S when H = conjunction of boolean literals
FIND-S
Initialize h to the most specific hypothesis in
Hx1x1x2x2 hellip xnxn
For each positive training instance x Remove from h any literal not satisfied by x
Output h
38
Mistake bound for Find-S
If C H and training data noise free Find-S converges to an exact hypothesis
How many errors to learn cH (only positive examples can be misclassified)
The first positive example will be misclassified and n literals in the initial hypothesis will be eliminated
Each subsequent error eliminates at least one literal mistakes le n+1 (worst case for the ldquototalrdquo concept x c(x)=1)
39
Mistake bound for Halving A version space is maintained and refined (eg
Candidate-elimination) Prediction is based on majority vote among the
hypotheses in the current version space ldquoWrongrdquo hypotheses are removed (even if x is
exactly classified) How many errors to exactly learn cH (H finite)
Mistake when the majority of hypotheses misclassifies x These hypotheses are removed For each mistake the version space is at least halved At most log2(|H|) mistakes before exact learning (eg
single hypothesis remaining) Note learning without mistakes possible
40
Optimal mistake bound
Question what is the optimal mistake bound (ie lowest worst case bound over all possible learning algorithms A) for an arbitrary non empty concept class C assuming H=C
Formally for any learning algorithm A and any target concept c
MA(c) = max mistakes made by A to exactly learn c over all
possible training sequences MA(C) = maxcC MA(c)
Note Mfind-S(C) = n+1
MHalving(C) le log2(|C|) Opt(C) = minA MA(C)
ie of mistakes made for the hardest target concept in C using the hardest training sequence by the best algorithm
41
Optimal mistake bound
Theorem (Littlestone 1987)
VC(C) le Opt(C) le MHalving(C) le log2(|C|) There exist concept classes for which
VC(C) = Opt(C) = MHalving(C) = log2(|C|)
eg the power set 2X of X for which it holds
VC(2X) = |X| = log2(|2X|) There exist concept classes for which VC(C) lt Opt(C) lt MHalving(C)
42
Weighted majority algorithm
Generalizes Halving Makes predictions by taking a weighted vote among a pole of prediction algorithms
Learns by altering the weight associated with each prediction algorithm
It does not eliminate hypotheses (ie algorithms) inconsistent with some training examples but just reduces its weight so is able to accommodate inconsistent training data
43
Weighted majority algorithm
i wi = 1
training example (x c(x))
q0 = q1 = 0
prediction algorithm ai
If ai(x)=0 then q0 = q0 + wi
If ai(x)=1 then q1 = q1 + wi
if q1 gt q0 then predict c(x)=1
if q1 lt q0 then predict c(x)=0
if q1 gt q0 then predict c(x)=0 or 1 at random
prediction algorithm ai do If ai(x)nec(x) then wi = wi (0lelt1)
44
Weighted majority algorithm (WM)
Coincides with Halving for =0 Theorem - D any sequence of training examples A any set of n prediction algorithms k min of mistakes made by any ajA for D =12 Then W-M makes at most
24(k+log2n)
mistakes over D
45
Weighted majority algorithm (WM)
Proof Since aj makes k mistakes (best in A) its final
weight wj will be (12)k The sum W of the weights associated with all n algorithms in A is initially n and for each mistake made by WM is reduced to at most (34)W because the ldquowrongrdquo algorithms hold at least 12 of total weight that will be reduced by a factor of 12
The final total weight W is at most n(34)M where M is the total number of mistakes made by WM over D
46
Weighted majority algorithm (WM)
But the final weight wj cannot be greater than the
final total weight W hence(12)k le n(34)M
from which
M le le 24(k+log2n)
Ie the number of mistakes made by WM will never be greater than a constant factor times the number of mistakes made by the best member of the pool plus a term that grows only logarithmically in the size of the pool
(k+log2 n)-log2 (34)
- Universitagrave di Milano-Bicocca Laurea Magistrale in Informatica
- Computational models of cognitive phenomena
- A theory of the learnable (Valiant lsquo84)
- A theory of the learnable
- Probably approximately correct learning
- What we want to learn
- Whatrsquos new in pac learning
- Learning from examples
- The PAC model
- The PAC model
- Slide 11
- Slide 12
- Slide 13
- Learning algorithm
- The efficiency issue
- Slide 16
- Learning boolean functions
- Boolean functions and circuits
- Slide 19
- Slide 20
- Statistical PAC learning
- A simple upper bound
- Vapnik-Chervonenkis approach (1971)
- A general upper bound
- Graph of the growth function
- Upper and lower bounds
- An equivalent definition of VCdim
- Example 1
- Example 2
- Slide 30
- Classi di formule booleane
- I risultati
- Risultati positivi
- Risultati negativi
- Mistake bound model
- Slide 36
- Mistake bound for Find-S
- Slide 38
- Mistake bound for Halving
- Optimal mistake bound
- Slide 41
- Weighted majority algorithm
- Slide 43
- Weighted majority algorithm (WM)
- Slide 45
- Slide 46
-
19
Fn n
Cn class of circuits which compute all and only the functions in Fn
Uinfin
=
=1n
nFF Uinfin
=
=1n
nCC
Algorithm A to learn F by C bull INPUT (nεδ)
bullThe learner computes t = t(n 1 1) (t=number of examples sufficient to learn with accuracy ε and confidence δ)bull The learner asks the teacher for a labelled t-sample
bull The learner receives the t-sample S and computes C = An(S)
bull Output C (C= representation of the hypothesis)
Note that the inference procedure A receives as input the integer n and a t-sample on 01n and outputs An(S) = A(n S)
Boolean functions and circuits
20
An algorithm A is a learning algorithm with sample size t(n 1 1) for a concept class using the class of representations If for all nge1 for all fFn for all 0lt lt1 and for every probability distribution p over 01n the following holds
nn FUF 1=infin= mm CUC 1=
infin=
If the inference procedure An receives as input a t-sample it outputs a representation cCn of a function g that is probably approximately correct that is with probability at least 1- a t-sample is chosen such that the function g inferred satisfies
Px f(x)neg(x) le
g is ndashgood g is an ndashapproximation of fg is ndashbad g is not an ndashapproximation of f
NOTE distribution free
Boolean functions and circuits
21
Statistical PAC learning
DEF An inference procedure An for the class F n is consistent if
given the target function fF n for every t-sample
S = (ltx1b1gt hellip ltxtbtgt) An(S) is a representation of a function
g ldquoconsistentrdquo with S ie g(x1) = b1 hellip g(xt) = bt
DEF A learning algorithm A is consistent if its inference procedure is consistent
PROBLEM Estimate upper and lower bounds on the sample size
t = t(n 1 1)Upper bounds will be given for consistent algorithms
Lower bounds will be given for arbitrary algorithms
22
THEOREM t(n 1 1) le -1ln(F n) +ln(1)
PROOF Prob(x1 hellip xt) g (g(x1)=f(x1) hellip g(xt)=f(xt) g -bad) le
le Prob (g(x1) = f(x1) hellip g(xt) = f(xt)) le
Impose F n e-t le
Independent events
g is ε-bad
P(AUB)leP(A)+P(B)
g ε-bad
NOTE - F n must be finite
le Prob (g(xi) = f(xi)) leg ε-bad i=1 hellip t
le (1-)t le F n(1-)t le F ne-t g ε-bad
A simple upper bound
23
X domainF 2X class of conceptsS = (x1 hellip xt) t-sample
f S g iff f(xi) = g(xi) xi S undistinguishable by S
F (S) = (F S) index of F wrt S
Problem uniform convergence of relative frequencies to their probabilities
Vapnik-Chervonenkis approach (1971)
S1 S2
MF (t) = maxF (S) S is a t-sample growth function
24
FACT
THEOREM
A general upper bound
Prob(x1 hellip xt) g (g -bad g(x1) = f(x1) hellip g(xt) = f(xt))le 2mF2te-t2
mF (t) le 2t
mF (t) le F (this condition gives immediately
the simple upper bound) mF (t) = 2t and jltt mF (j) = 2j
25
d t
)(tmF
F
)(infinFm
t2
DEFINITION
FUNDAMENTAL PROPERTY
=)(tmF1
2
0
minusle⎟⎠⎞⎜⎝
⎛le⎟⎠⎞⎜⎝
⎛le
le
=
sum
Kk
t
tt
t
t
BOUNDED BY APOLYNOMIAL IN t
Graph of the growth function
d = VCdim(F ) = max t mF(t) = 2t
26
THEOREMIf dn = VCdim(Fn)
then t(n 1 1) le max (4 log(2) (8dn)log(13)PROOF
Impose 2mFn2te-et2 le
A lower bound on t(n 1 1) Number of examples which are necessary for arbitrary algorithms
THEOREMFor 0lele1 and le1100
t(n 1 1) ge max ((1- ) ln(1) (dn-1)32)
Upper and lower bounds
27
Ie the cardinality of the set of subsets of S that can be obtained by intersecting S with concepts in F
If F (S) = 2S we say that S is shattered by F
The Vapnik-Chervonenkis dimension of F is the cardinality of the largest finite set of points S X that is shattered by F
An equivalent definition of VCdim
F (S) = (f-1(1)(x1 hellip xt) | fF )
28
1300log80032000log400)11(
00100103)(
sdotle
===
MAXnt
FVC DIM
δε
δε
Sufficient 24000
⎭⎬⎫
⎩⎨⎧ sdot
minusge 100
32131000ln100)11( MAXnt δε
690 Necessary
Learn the family f of circles contained in the square
Example 1
29
otherwiseXif
XWHSXXfTHATSUCHWWLf
nkkkn
nn
001
)()()(
11
1
ge
minus=
rArrisin
sum=
λ
λ
HS(x)=
22
1)(n
n
nDIM
L
nLVC
le
+=
SIMPLE UPPER BOUND
))1ln((1)11( 2 +le nnt
UPPER BOUND USING
⎭⎬⎫
⎩⎨⎧ sdot
+le
13)1(82log4)11( nMAXnt GROWS LINEARLY WITH n
)( nDIM LVC
Learn the family of linearly separable boolean functions in n variables Ln
Example 2
30
Consider the class L2 of linearly separable functions in two variables
3)(3)(
1)(
2
2
ge=
+=
LVCLVC
nLVC
IM
IM
nIM
4)( 2 ltLVC IM
The green point cannot beseparated from the other three
No straight line can separatethe green from the red points
Example 2
31
Classi di formule booleane
Monomi x1x2 hellip xk
DNF m1m2 hellip mj (mj monomi)
Clausole x1x2 hellip xk
CNF c1c2 hellip cj (cj clausole)
k-DNF le k letterali nei monomi
k-term-DNF le k monomi
k-CNF le k letterali nelle clausole
k-clause-CNF le k clausole
Formule monotone non contengono letterali negati
m-formule ogni variabile appare al piugrave una volta
32
Th (Valiant)I monomi sono apprendibili da esempi positivi con 2(n+log ) esempi ( errore tollerato) ponendo ixix
xxgii 01 ==
sdotequiv ππin tutti gli es in tutti gli es
NB Lrsquoapprendibilitagrave egrave non monotona A B se B appr allora A apprsube
endHdaxcancella
elseHdaxcancella
thenjesifdontojfor
generaesbegin
doBtoiforxxxxxxH
j
j
nn
0)(1
)(
1 2211
==
=
==
Th i monomi non sono apprendibili da esempi negativi
I risultati
33
1) K-CNF apprendibili da soli esempi positivi1b) K-DNF apprendibili da soli esempi negativi2) (K-DNF K-CNF) apprendibili da es (K-DNF K-CNF) positivi e negativiand
or
Kforall
3) la classe delle K-decision lists egrave apprendibile
)0()(
1)min(|min
10||)))((( 11
esistenonisebvCalloravni
booleanovettorevDLKCbkmmonomiom
conbmbmDLK
i
iii
jj
===
minusisinle
equivminus
Th Ogni K-DNF (o K-CNF)-formula egrave rappresentabile da una K-DL piccola
Risultati positivi
34
)()(
freeondistributisensoinNPRPse
minusne
1) Le m-formule non sono apprendibili
2) Le funzioni booleane a soglia non sono apprendibili
3) Per K ge 2 le formule K-term-DNF non sono apprendibili
ge
Risultati negativi
35
Mistake bound model
So far how many examples needed to learn What about how many mistakes before
convergence Letrsquos consider similar setting to PAC learning
Instances drawn at random from X according to
distribution D Learner must classify each instance before receiving
correct classification from teacher Can we bound the number of mistakes learner makes
before converging
36
Mistake bound model
Learner Receives a sequence of training examples x Predicts the target value f(x) Receives the correct target value from the trainer Is evaluated by the total number of mistakes it makes
before converging to the correct hypothesis
Ie Learning takes place during the use of the system
not off-line Ex prediction of fraudolent use of credit cards
37
Mistake bound for Find-S
Consider Find-S when H = conjunction of boolean literals
FIND-S
Initialize h to the most specific hypothesis in
Hx1x1x2x2 hellip xnxn
For each positive training instance x Remove from h any literal not satisfied by x
Output h
38
Mistake bound for Find-S
If C H and training data noise free Find-S converges to an exact hypothesis
How many errors to learn cH (only positive examples can be misclassified)
The first positive example will be misclassified and n literals in the initial hypothesis will be eliminated
Each subsequent error eliminates at least one literal mistakes le n+1 (worst case for the ldquototalrdquo concept x c(x)=1)
39
Mistake bound for Halving A version space is maintained and refined (eg
Candidate-elimination) Prediction is based on majority vote among the
hypotheses in the current version space ldquoWrongrdquo hypotheses are removed (even if x is
exactly classified) How many errors to exactly learn cH (H finite)
Mistake when the majority of hypotheses misclassifies x These hypotheses are removed For each mistake the version space is at least halved At most log2(|H|) mistakes before exact learning (eg
single hypothesis remaining) Note learning without mistakes possible
40
Optimal mistake bound
Question what is the optimal mistake bound (ie lowest worst case bound over all possible learning algorithms A) for an arbitrary non empty concept class C assuming H=C
Formally for any learning algorithm A and any target concept c
MA(c) = max mistakes made by A to exactly learn c over all
possible training sequences MA(C) = maxcC MA(c)
Note Mfind-S(C) = n+1
MHalving(C) le log2(|C|) Opt(C) = minA MA(C)
ie of mistakes made for the hardest target concept in C using the hardest training sequence by the best algorithm
41
Optimal mistake bound
Theorem (Littlestone 1987)
VC(C) le Opt(C) le MHalving(C) le log2(|C|) There exist concept classes for which
VC(C) = Opt(C) = MHalving(C) = log2(|C|)
eg the power set 2X of X for which it holds
VC(2X) = |X| = log2(|2X|) There exist concept classes for which VC(C) lt Opt(C) lt MHalving(C)
42
Weighted majority algorithm
Generalizes Halving Makes predictions by taking a weighted vote among a pole of prediction algorithms
Learns by altering the weight associated with each prediction algorithm
It does not eliminate hypotheses (ie algorithms) inconsistent with some training examples but just reduces its weight so is able to accommodate inconsistent training data
43
Weighted majority algorithm
i wi = 1
training example (x c(x))
q0 = q1 = 0
prediction algorithm ai
If ai(x)=0 then q0 = q0 + wi
If ai(x)=1 then q1 = q1 + wi
if q1 gt q0 then predict c(x)=1
if q1 lt q0 then predict c(x)=0
if q1 gt q0 then predict c(x)=0 or 1 at random
prediction algorithm ai do If ai(x)nec(x) then wi = wi (0lelt1)
44
Weighted majority algorithm (WM)
Coincides with Halving for =0 Theorem - D any sequence of training examples A any set of n prediction algorithms k min of mistakes made by any ajA for D =12 Then W-M makes at most
24(k+log2n)
mistakes over D
45
Weighted majority algorithm (WM)
Proof Since aj makes k mistakes (best in A) its final
weight wj will be (12)k The sum W of the weights associated with all n algorithms in A is initially n and for each mistake made by WM is reduced to at most (34)W because the ldquowrongrdquo algorithms hold at least 12 of total weight that will be reduced by a factor of 12
The final total weight W is at most n(34)M where M is the total number of mistakes made by WM over D
46
Weighted majority algorithm (WM)
But the final weight wj cannot be greater than the
final total weight W hence(12)k le n(34)M
from which
M le le 24(k+log2n)
Ie the number of mistakes made by WM will never be greater than a constant factor times the number of mistakes made by the best member of the pool plus a term that grows only logarithmically in the size of the pool
(k+log2 n)-log2 (34)
- Universitagrave di Milano-Bicocca Laurea Magistrale in Informatica
- Computational models of cognitive phenomena
- A theory of the learnable (Valiant lsquo84)
- A theory of the learnable
- Probably approximately correct learning
- What we want to learn
- Whatrsquos new in pac learning
- Learning from examples
- The PAC model
- The PAC model
- Slide 11
- Slide 12
- Slide 13
- Learning algorithm
- The efficiency issue
- Slide 16
- Learning boolean functions
- Boolean functions and circuits
- Slide 19
- Slide 20
- Statistical PAC learning
- A simple upper bound
- Vapnik-Chervonenkis approach (1971)
- A general upper bound
- Graph of the growth function
- Upper and lower bounds
- An equivalent definition of VCdim
- Example 1
- Example 2
- Slide 30
- Classi di formule booleane
- I risultati
- Risultati positivi
- Risultati negativi
- Mistake bound model
- Slide 36
- Mistake bound for Find-S
- Slide 38
- Mistake bound for Halving
- Optimal mistake bound
- Slide 41
- Weighted majority algorithm
- Slide 43
- Weighted majority algorithm (WM)
- Slide 45
- Slide 46
-
20
An algorithm A is a learning algorithm with sample size t(n 1 1) for a concept class using the class of representations If for all nge1 for all fFn for all 0lt lt1 and for every probability distribution p over 01n the following holds
nn FUF 1=infin= mm CUC 1=
infin=
If the inference procedure An receives as input a t-sample it outputs a representation cCn of a function g that is probably approximately correct that is with probability at least 1- a t-sample is chosen such that the function g inferred satisfies
Px f(x)neg(x) le
g is ndashgood g is an ndashapproximation of fg is ndashbad g is not an ndashapproximation of f
NOTE distribution free
Boolean functions and circuits
21
Statistical PAC learning
DEF An inference procedure An for the class F n is consistent if
given the target function fF n for every t-sample
S = (ltx1b1gt hellip ltxtbtgt) An(S) is a representation of a function
g ldquoconsistentrdquo with S ie g(x1) = b1 hellip g(xt) = bt
DEF A learning algorithm A is consistent if its inference procedure is consistent
PROBLEM Estimate upper and lower bounds on the sample size
t = t(n 1 1)Upper bounds will be given for consistent algorithms
Lower bounds will be given for arbitrary algorithms
22
THEOREM t(n 1 1) le -1ln(F n) +ln(1)
PROOF Prob(x1 hellip xt) g (g(x1)=f(x1) hellip g(xt)=f(xt) g -bad) le
le Prob (g(x1) = f(x1) hellip g(xt) = f(xt)) le
Impose F n e-t le
Independent events
g is ε-bad
P(AUB)leP(A)+P(B)
g ε-bad
NOTE - F n must be finite
le Prob (g(xi) = f(xi)) leg ε-bad i=1 hellip t
le (1-)t le F n(1-)t le F ne-t g ε-bad
A simple upper bound
23
X domainF 2X class of conceptsS = (x1 hellip xt) t-sample
f S g iff f(xi) = g(xi) xi S undistinguishable by S
F (S) = (F S) index of F wrt S
Problem uniform convergence of relative frequencies to their probabilities
Vapnik-Chervonenkis approach (1971)
S1 S2
MF (t) = maxF (S) S is a t-sample growth function
24
FACT
THEOREM
A general upper bound
Prob(x1 hellip xt) g (g -bad g(x1) = f(x1) hellip g(xt) = f(xt))le 2mF2te-t2
mF (t) le 2t
mF (t) le F (this condition gives immediately
the simple upper bound) mF (t) = 2t and jltt mF (j) = 2j
25
d t
)(tmF
F
)(infinFm
t2
DEFINITION
FUNDAMENTAL PROPERTY
=)(tmF1
2
0
minusle⎟⎠⎞⎜⎝
⎛le⎟⎠⎞⎜⎝
⎛le
le
=
sum
Kk
t
tt
t
t
BOUNDED BY APOLYNOMIAL IN t
Graph of the growth function
d = VCdim(F ) = max t mF(t) = 2t
26
THEOREMIf dn = VCdim(Fn)
then t(n 1 1) le max (4 log(2) (8dn)log(13)PROOF
Impose 2mFn2te-et2 le
A lower bound on t(n 1 1) Number of examples which are necessary for arbitrary algorithms
THEOREMFor 0lele1 and le1100
t(n 1 1) ge max ((1- ) ln(1) (dn-1)32)
Upper and lower bounds
27
Ie the cardinality of the set of subsets of S that can be obtained by intersecting S with concepts in F
If F (S) = 2S we say that S is shattered by F
The Vapnik-Chervonenkis dimension of F is the cardinality of the largest finite set of points S X that is shattered by F
An equivalent definition of VCdim
F (S) = (f-1(1)(x1 hellip xt) | fF )
28
1300log80032000log400)11(
00100103)(
sdotle
===
MAXnt
FVC DIM
δε
δε
Sufficient 24000
⎭⎬⎫
⎩⎨⎧ sdot
minusge 100
32131000ln100)11( MAXnt δε
690 Necessary
Learn the family f of circles contained in the square
Example 1
29
otherwiseXif
XWHSXXfTHATSUCHWWLf
nkkkn
nn
001
)()()(
11
1
ge
minus=
rArrisin
sum=
λ
λ
HS(x)=
22
1)(n
n
nDIM
L
nLVC
le
+=
SIMPLE UPPER BOUND
))1ln((1)11( 2 +le nnt
UPPER BOUND USING
⎭⎬⎫
⎩⎨⎧ sdot
+le
13)1(82log4)11( nMAXnt GROWS LINEARLY WITH n
)( nDIM LVC
Learn the family of linearly separable boolean functions in n variables Ln
Example 2
30
Consider the class L2 of linearly separable functions in two variables
3)(3)(
1)(
2
2
ge=
+=
LVCLVC
nLVC
IM
IM
nIM
4)( 2 ltLVC IM
The green point cannot beseparated from the other three
No straight line can separatethe green from the red points
Example 2
31
Classi di formule booleane
Monomi x1x2 hellip xk
DNF m1m2 hellip mj (mj monomi)
Clausole x1x2 hellip xk
CNF c1c2 hellip cj (cj clausole)
k-DNF le k letterali nei monomi
k-term-DNF le k monomi
k-CNF le k letterali nelle clausole
k-clause-CNF le k clausole
Formule monotone non contengono letterali negati
m-formule ogni variabile appare al piugrave una volta
32
Th (Valiant)I monomi sono apprendibili da esempi positivi con 2(n+log ) esempi ( errore tollerato) ponendo ixix
xxgii 01 ==
sdotequiv ππin tutti gli es in tutti gli es
NB Lrsquoapprendibilitagrave egrave non monotona A B se B appr allora A apprsube
endHdaxcancella
elseHdaxcancella
thenjesifdontojfor
generaesbegin
doBtoiforxxxxxxH
j
j
nn
0)(1
)(
1 2211
==
=
==
Th i monomi non sono apprendibili da esempi negativi
I risultati
33
1) K-CNF apprendibili da soli esempi positivi1b) K-DNF apprendibili da soli esempi negativi2) (K-DNF K-CNF) apprendibili da es (K-DNF K-CNF) positivi e negativiand
or
Kforall
3) la classe delle K-decision lists egrave apprendibile
)0()(
1)min(|min
10||)))((( 11
esistenonisebvCalloravni
booleanovettorevDLKCbkmmonomiom
conbmbmDLK
i
iii
jj
===
minusisinle
equivminus
Th Ogni K-DNF (o K-CNF)-formula egrave rappresentabile da una K-DL piccola
Risultati positivi
34
)()(
freeondistributisensoinNPRPse
minusne
1) Le m-formule non sono apprendibili
2) Le funzioni booleane a soglia non sono apprendibili
3) Per K ge 2 le formule K-term-DNF non sono apprendibili
ge
Risultati negativi
35
Mistake bound model
So far how many examples needed to learn What about how many mistakes before
convergence Letrsquos consider similar setting to PAC learning
Instances drawn at random from X according to
distribution D Learner must classify each instance before receiving
correct classification from teacher Can we bound the number of mistakes learner makes
before converging
36
Mistake bound model
Learner Receives a sequence of training examples x Predicts the target value f(x) Receives the correct target value from the trainer Is evaluated by the total number of mistakes it makes
before converging to the correct hypothesis
Ie Learning takes place during the use of the system
not off-line Ex prediction of fraudolent use of credit cards
37
Mistake bound for Find-S
Consider Find-S when H = conjunction of boolean literals
FIND-S
Initialize h to the most specific hypothesis in
Hx1x1x2x2 hellip xnxn
For each positive training instance x Remove from h any literal not satisfied by x
Output h
38
Mistake bound for Find-S
If C H and training data noise free Find-S converges to an exact hypothesis
How many errors to learn cH (only positive examples can be misclassified)
The first positive example will be misclassified and n literals in the initial hypothesis will be eliminated
Each subsequent error eliminates at least one literal mistakes le n+1 (worst case for the ldquototalrdquo concept x c(x)=1)
39
Mistake bound for Halving A version space is maintained and refined (eg
Candidate-elimination) Prediction is based on majority vote among the
hypotheses in the current version space ldquoWrongrdquo hypotheses are removed (even if x is
exactly classified) How many errors to exactly learn cH (H finite)
Mistake when the majority of hypotheses misclassifies x These hypotheses are removed For each mistake the version space is at least halved At most log2(|H|) mistakes before exact learning (eg
single hypothesis remaining) Note learning without mistakes possible
40
Optimal mistake bound
Question what is the optimal mistake bound (ie lowest worst case bound over all possible learning algorithms A) for an arbitrary non empty concept class C assuming H=C
Formally for any learning algorithm A and any target concept c
MA(c) = max mistakes made by A to exactly learn c over all
possible training sequences MA(C) = maxcC MA(c)
Note Mfind-S(C) = n+1
MHalving(C) le log2(|C|) Opt(C) = minA MA(C)
ie of mistakes made for the hardest target concept in C using the hardest training sequence by the best algorithm
41
Optimal mistake bound
Theorem (Littlestone 1987)
VC(C) le Opt(C) le MHalving(C) le log2(|C|) There exist concept classes for which
VC(C) = Opt(C) = MHalving(C) = log2(|C|)
eg the power set 2X of X for which it holds
VC(2X) = |X| = log2(|2X|) There exist concept classes for which VC(C) lt Opt(C) lt MHalving(C)
42
Weighted majority algorithm
Generalizes Halving Makes predictions by taking a weighted vote among a pole of prediction algorithms
Learns by altering the weight associated with each prediction algorithm
It does not eliminate hypotheses (ie algorithms) inconsistent with some training examples but just reduces its weight so is able to accommodate inconsistent training data
43
Weighted majority algorithm
i wi = 1
training example (x c(x))
q0 = q1 = 0
prediction algorithm ai
If ai(x)=0 then q0 = q0 + wi
If ai(x)=1 then q1 = q1 + wi
if q1 gt q0 then predict c(x)=1
if q1 lt q0 then predict c(x)=0
if q1 gt q0 then predict c(x)=0 or 1 at random
prediction algorithm ai do If ai(x)nec(x) then wi = wi (0lelt1)
44
Weighted majority algorithm (WM)
Coincides with Halving for =0 Theorem - D any sequence of training examples A any set of n prediction algorithms k min of mistakes made by any ajA for D =12 Then W-M makes at most
24(k+log2n)
mistakes over D
45
Weighted majority algorithm (WM)
Proof Since aj makes k mistakes (best in A) its final
weight wj will be (12)k The sum W of the weights associated with all n algorithms in A is initially n and for each mistake made by WM is reduced to at most (34)W because the ldquowrongrdquo algorithms hold at least 12 of total weight that will be reduced by a factor of 12
The final total weight W is at most n(34)M where M is the total number of mistakes made by WM over D
46
Weighted majority algorithm (WM)
But the final weight wj cannot be greater than the
final total weight W hence(12)k le n(34)M
from which
M le le 24(k+log2n)
Ie the number of mistakes made by WM will never be greater than a constant factor times the number of mistakes made by the best member of the pool plus a term that grows only logarithmically in the size of the pool
(k+log2 n)-log2 (34)
- Universitagrave di Milano-Bicocca Laurea Magistrale in Informatica
- Computational models of cognitive phenomena
- A theory of the learnable (Valiant lsquo84)
- A theory of the learnable
- Probably approximately correct learning
- What we want to learn
- Whatrsquos new in pac learning
- Learning from examples
- The PAC model
- The PAC model
- Slide 11
- Slide 12
- Slide 13
- Learning algorithm
- The efficiency issue
- Slide 16
- Learning boolean functions
- Boolean functions and circuits
- Slide 19
- Slide 20
- Statistical PAC learning
- A simple upper bound
- Vapnik-Chervonenkis approach (1971)
- A general upper bound
- Graph of the growth function
- Upper and lower bounds
- An equivalent definition of VCdim
- Example 1
- Example 2
- Slide 30
- Classi di formule booleane
- I risultati
- Risultati positivi
- Risultati negativi
- Mistake bound model
- Slide 36
- Mistake bound for Find-S
- Slide 38
- Mistake bound for Halving
- Optimal mistake bound
- Slide 41
- Weighted majority algorithm
- Slide 43
- Weighted majority algorithm (WM)
- Slide 45
- Slide 46
-
21
Statistical PAC learning
DEF An inference procedure An for the class F n is consistent if
given the target function fF n for every t-sample
S = (ltx1b1gt hellip ltxtbtgt) An(S) is a representation of a function
g ldquoconsistentrdquo with S ie g(x1) = b1 hellip g(xt) = bt
DEF A learning algorithm A is consistent if its inference procedure is consistent
PROBLEM Estimate upper and lower bounds on the sample size
t = t(n 1 1)Upper bounds will be given for consistent algorithms
Lower bounds will be given for arbitrary algorithms
22
THEOREM t(n 1 1) le -1ln(F n) +ln(1)
PROOF Prob(x1 hellip xt) g (g(x1)=f(x1) hellip g(xt)=f(xt) g -bad) le
le Prob (g(x1) = f(x1) hellip g(xt) = f(xt)) le
Impose F n e-t le
Independent events
g is ε-bad
P(AUB)leP(A)+P(B)
g ε-bad
NOTE - F n must be finite
le Prob (g(xi) = f(xi)) leg ε-bad i=1 hellip t
le (1-)t le F n(1-)t le F ne-t g ε-bad
A simple upper bound
23
X domainF 2X class of conceptsS = (x1 hellip xt) t-sample
f S g iff f(xi) = g(xi) xi S undistinguishable by S
F (S) = (F S) index of F wrt S
Problem uniform convergence of relative frequencies to their probabilities
Vapnik-Chervonenkis approach (1971)
S1 S2
MF (t) = maxF (S) S is a t-sample growth function
24
FACT
THEOREM
A general upper bound
Prob(x1 hellip xt) g (g -bad g(x1) = f(x1) hellip g(xt) = f(xt))le 2mF2te-t2
mF (t) le 2t
mF (t) le F (this condition gives immediately
the simple upper bound) mF (t) = 2t and jltt mF (j) = 2j
25
d t
)(tmF
F
)(infinFm
t2
DEFINITION
FUNDAMENTAL PROPERTY
=)(tmF1
2
0
minusle⎟⎠⎞⎜⎝
⎛le⎟⎠⎞⎜⎝
⎛le
le
=
sum
Kk
t
tt
t
t
BOUNDED BY APOLYNOMIAL IN t
Graph of the growth function
d = VCdim(F ) = max t mF(t) = 2t
26
THEOREMIf dn = VCdim(Fn)
then t(n 1 1) le max (4 log(2) (8dn)log(13)PROOF
Impose 2mFn2te-et2 le
A lower bound on t(n 1 1) Number of examples which are necessary for arbitrary algorithms
THEOREMFor 0lele1 and le1100
t(n 1 1) ge max ((1- ) ln(1) (dn-1)32)
Upper and lower bounds
27
Ie the cardinality of the set of subsets of S that can be obtained by intersecting S with concepts in F
If F (S) = 2S we say that S is shattered by F
The Vapnik-Chervonenkis dimension of F is the cardinality of the largest finite set of points S X that is shattered by F
An equivalent definition of VCdim
F (S) = (f-1(1)(x1 hellip xt) | fF )
28
1300log80032000log400)11(
00100103)(
sdotle
===
MAXnt
FVC DIM
δε
δε
Sufficient 24000
⎭⎬⎫
⎩⎨⎧ sdot
minusge 100
32131000ln100)11( MAXnt δε
690 Necessary
Learn the family f of circles contained in the square
Example 1
29
otherwiseXif
XWHSXXfTHATSUCHWWLf
nkkkn
nn
001
)()()(
11
1
ge
minus=
rArrisin
sum=
λ
λ
HS(x)=
22
1)(n
n
nDIM
L
nLVC
le
+=
SIMPLE UPPER BOUND
))1ln((1)11( 2 +le nnt
UPPER BOUND USING
⎭⎬⎫
⎩⎨⎧ sdot
+le
13)1(82log4)11( nMAXnt GROWS LINEARLY WITH n
)( nDIM LVC
Learn the family of linearly separable boolean functions in n variables Ln
Example 2
30
Consider the class L2 of linearly separable functions in two variables
3)(3)(
1)(
2
2
ge=
+=
LVCLVC
nLVC
IM
IM
nIM
4)( 2 ltLVC IM
The green point cannot beseparated from the other three
No straight line can separatethe green from the red points
Example 2
31
Classi di formule booleane
Monomi x1x2 hellip xk
DNF m1m2 hellip mj (mj monomi)
Clausole x1x2 hellip xk
CNF c1c2 hellip cj (cj clausole)
k-DNF le k letterali nei monomi
k-term-DNF le k monomi
k-CNF le k letterali nelle clausole
k-clause-CNF le k clausole
Formule monotone non contengono letterali negati
m-formule ogni variabile appare al piugrave una volta
32
Th (Valiant)I monomi sono apprendibili da esempi positivi con 2(n+log ) esempi ( errore tollerato) ponendo ixix
xxgii 01 ==
sdotequiv ππin tutti gli es in tutti gli es
NB Lrsquoapprendibilitagrave egrave non monotona A B se B appr allora A apprsube
endHdaxcancella
elseHdaxcancella
thenjesifdontojfor
generaesbegin
doBtoiforxxxxxxH
j
j
nn
0)(1
)(
1 2211
==
=
==
Th i monomi non sono apprendibili da esempi negativi
I risultati
33
1) K-CNF apprendibili da soli esempi positivi1b) K-DNF apprendibili da soli esempi negativi2) (K-DNF K-CNF) apprendibili da es (K-DNF K-CNF) positivi e negativiand
or
Kforall
3) la classe delle K-decision lists egrave apprendibile
)0()(
1)min(|min
10||)))((( 11
esistenonisebvCalloravni
booleanovettorevDLKCbkmmonomiom
conbmbmDLK
i
iii
jj
===
minusisinle
equivminus
Th Ogni K-DNF (o K-CNF)-formula egrave rappresentabile da una K-DL piccola
Risultati positivi
34
)()(
freeondistributisensoinNPRPse
minusne
1) Le m-formule non sono apprendibili
2) Le funzioni booleane a soglia non sono apprendibili
3) Per K ge 2 le formule K-term-DNF non sono apprendibili
ge
Risultati negativi
35
Mistake bound model
So far how many examples needed to learn What about how many mistakes before
convergence Letrsquos consider similar setting to PAC learning
Instances drawn at random from X according to
distribution D Learner must classify each instance before receiving
correct classification from teacher Can we bound the number of mistakes learner makes
before converging
36
Mistake bound model
Learner Receives a sequence of training examples x Predicts the target value f(x) Receives the correct target value from the trainer Is evaluated by the total number of mistakes it makes
before converging to the correct hypothesis
Ie Learning takes place during the use of the system
not off-line Ex prediction of fraudolent use of credit cards
37
Mistake bound for Find-S
Consider Find-S when H = conjunction of boolean literals
FIND-S
Initialize h to the most specific hypothesis in
Hx1x1x2x2 hellip xnxn
For each positive training instance x Remove from h any literal not satisfied by x
Output h
38
Mistake bound for Find-S
If C H and training data noise free Find-S converges to an exact hypothesis
How many errors to learn cH (only positive examples can be misclassified)
The first positive example will be misclassified and n literals in the initial hypothesis will be eliminated
Each subsequent error eliminates at least one literal mistakes le n+1 (worst case for the ldquototalrdquo concept x c(x)=1)
39
Mistake bound for Halving A version space is maintained and refined (eg
Candidate-elimination) Prediction is based on majority vote among the
hypotheses in the current version space ldquoWrongrdquo hypotheses are removed (even if x is
exactly classified) How many errors to exactly learn cH (H finite)
Mistake when the majority of hypotheses misclassifies x These hypotheses are removed For each mistake the version space is at least halved At most log2(|H|) mistakes before exact learning (eg
single hypothesis remaining) Note learning without mistakes possible
40
Optimal mistake bound
Question what is the optimal mistake bound (ie lowest worst case bound over all possible learning algorithms A) for an arbitrary non empty concept class C assuming H=C
Formally for any learning algorithm A and any target concept c
MA(c) = max mistakes made by A to exactly learn c over all
possible training sequences MA(C) = maxcC MA(c)
Note Mfind-S(C) = n+1
MHalving(C) le log2(|C|) Opt(C) = minA MA(C)
ie of mistakes made for the hardest target concept in C using the hardest training sequence by the best algorithm
41
Optimal mistake bound
Theorem (Littlestone 1987)
VC(C) le Opt(C) le MHalving(C) le log2(|C|) There exist concept classes for which
VC(C) = Opt(C) = MHalving(C) = log2(|C|)
eg the power set 2X of X for which it holds
VC(2X) = |X| = log2(|2X|) There exist concept classes for which VC(C) lt Opt(C) lt MHalving(C)
42
Weighted majority algorithm
Generalizes Halving Makes predictions by taking a weighted vote among a pole of prediction algorithms
Learns by altering the weight associated with each prediction algorithm
It does not eliminate hypotheses (ie algorithms) inconsistent with some training examples but just reduces its weight so is able to accommodate inconsistent training data
43
Weighted majority algorithm
i wi = 1
training example (x c(x))
q0 = q1 = 0
prediction algorithm ai
If ai(x)=0 then q0 = q0 + wi
If ai(x)=1 then q1 = q1 + wi
if q1 gt q0 then predict c(x)=1
if q1 lt q0 then predict c(x)=0
if q1 gt q0 then predict c(x)=0 or 1 at random
prediction algorithm ai do If ai(x)nec(x) then wi = wi (0lelt1)
44
Weighted majority algorithm (WM)
Coincides with Halving for =0 Theorem - D any sequence of training examples A any set of n prediction algorithms k min of mistakes made by any ajA for D =12 Then W-M makes at most
24(k+log2n)
mistakes over D
45
Weighted majority algorithm (WM)
Proof Since aj makes k mistakes (best in A) its final
weight wj will be (12)k The sum W of the weights associated with all n algorithms in A is initially n and for each mistake made by WM is reduced to at most (34)W because the ldquowrongrdquo algorithms hold at least 12 of total weight that will be reduced by a factor of 12
The final total weight W is at most n(34)M where M is the total number of mistakes made by WM over D
46
Weighted majority algorithm (WM)
But the final weight wj cannot be greater than the
final total weight W hence(12)k le n(34)M
from which
M le le 24(k+log2n)
Ie the number of mistakes made by WM will never be greater than a constant factor times the number of mistakes made by the best member of the pool plus a term that grows only logarithmically in the size of the pool
(k+log2 n)-log2 (34)
- Universitagrave di Milano-Bicocca Laurea Magistrale in Informatica
- Computational models of cognitive phenomena
- A theory of the learnable (Valiant lsquo84)
- A theory of the learnable
- Probably approximately correct learning
- What we want to learn
- Whatrsquos new in pac learning
- Learning from examples
- The PAC model
- The PAC model
- Slide 11
- Slide 12
- Slide 13
- Learning algorithm
- The efficiency issue
- Slide 16
- Learning boolean functions
- Boolean functions and circuits
- Slide 19
- Slide 20
- Statistical PAC learning
- A simple upper bound
- Vapnik-Chervonenkis approach (1971)
- A general upper bound
- Graph of the growth function
- Upper and lower bounds
- An equivalent definition of VCdim
- Example 1
- Example 2
- Slide 30
- Classi di formule booleane
- I risultati
- Risultati positivi
- Risultati negativi
- Mistake bound model
- Slide 36
- Mistake bound for Find-S
- Slide 38
- Mistake bound for Halving
- Optimal mistake bound
- Slide 41
- Weighted majority algorithm
- Slide 43
- Weighted majority algorithm (WM)
- Slide 45
- Slide 46
-
22
THEOREM t(n 1 1) le -1ln(F n) +ln(1)
PROOF Prob(x1 hellip xt) g (g(x1)=f(x1) hellip g(xt)=f(xt) g -bad) le
le Prob (g(x1) = f(x1) hellip g(xt) = f(xt)) le
Impose F n e-t le
Independent events
g is ε-bad
P(AUB)leP(A)+P(B)
g ε-bad
NOTE - F n must be finite
le Prob (g(xi) = f(xi)) leg ε-bad i=1 hellip t
le (1-)t le F n(1-)t le F ne-t g ε-bad
A simple upper bound
23
X domainF 2X class of conceptsS = (x1 hellip xt) t-sample
f S g iff f(xi) = g(xi) xi S undistinguishable by S
F (S) = (F S) index of F wrt S
Problem uniform convergence of relative frequencies to their probabilities
Vapnik-Chervonenkis approach (1971)
S1 S2
MF (t) = maxF (S) S is a t-sample growth function
24
FACT
THEOREM
A general upper bound
Prob(x1 hellip xt) g (g -bad g(x1) = f(x1) hellip g(xt) = f(xt))le 2mF2te-t2
mF (t) le 2t
mF (t) le F (this condition gives immediately
the simple upper bound) mF (t) = 2t and jltt mF (j) = 2j
25
d t
)(tmF
F
)(infinFm
t2
DEFINITION
FUNDAMENTAL PROPERTY
=)(tmF1
2
0
minusle⎟⎠⎞⎜⎝
⎛le⎟⎠⎞⎜⎝
⎛le
le
=
sum
Kk
t
tt
t
t
BOUNDED BY APOLYNOMIAL IN t
Graph of the growth function
d = VCdim(F ) = max t mF(t) = 2t
26
THEOREMIf dn = VCdim(Fn)
then t(n 1 1) le max (4 log(2) (8dn)log(13)PROOF
Impose 2mFn2te-et2 le
A lower bound on t(n 1 1) Number of examples which are necessary for arbitrary algorithms
THEOREMFor 0lele1 and le1100
t(n 1 1) ge max ((1- ) ln(1) (dn-1)32)
Upper and lower bounds
27
Ie the cardinality of the set of subsets of S that can be obtained by intersecting S with concepts in F
If F (S) = 2S we say that S is shattered by F
The Vapnik-Chervonenkis dimension of F is the cardinality of the largest finite set of points S X that is shattered by F
An equivalent definition of VCdim
F (S) = (f-1(1)(x1 hellip xt) | fF )
28
1300log80032000log400)11(
00100103)(
sdotle
===
MAXnt
FVC DIM
δε
δε
Sufficient 24000
⎭⎬⎫
⎩⎨⎧ sdot
minusge 100
32131000ln100)11( MAXnt δε
690 Necessary
Learn the family f of circles contained in the square
Example 1
29
otherwiseXif
XWHSXXfTHATSUCHWWLf
nkkkn
nn
001
)()()(
11
1
ge
minus=
rArrisin
sum=
λ
λ
HS(x)=
22
1)(n
n
nDIM
L
nLVC
le
+=
SIMPLE UPPER BOUND
))1ln((1)11( 2 +le nnt
UPPER BOUND USING
⎭⎬⎫
⎩⎨⎧ sdot
+le
13)1(82log4)11( nMAXnt GROWS LINEARLY WITH n
)( nDIM LVC
Learn the family of linearly separable boolean functions in n variables Ln
Example 2
30
Consider the class L2 of linearly separable functions in two variables
3)(3)(
1)(
2
2
ge=
+=
LVCLVC
nLVC
IM
IM
nIM
4)( 2 ltLVC IM
The green point cannot beseparated from the other three
No straight line can separatethe green from the red points
Example 2
31
Classi di formule booleane
Monomi x1x2 hellip xk
DNF m1m2 hellip mj (mj monomi)
Clausole x1x2 hellip xk
CNF c1c2 hellip cj (cj clausole)
k-DNF le k letterali nei monomi
k-term-DNF le k monomi
k-CNF le k letterali nelle clausole
k-clause-CNF le k clausole
Formule monotone non contengono letterali negati
m-formule ogni variabile appare al piugrave una volta
32
Th (Valiant)I monomi sono apprendibili da esempi positivi con 2(n+log ) esempi ( errore tollerato) ponendo ixix
xxgii 01 ==
sdotequiv ππin tutti gli es in tutti gli es
NB Lrsquoapprendibilitagrave egrave non monotona A B se B appr allora A apprsube
endHdaxcancella
elseHdaxcancella
thenjesifdontojfor
generaesbegin
doBtoiforxxxxxxH
j
j
nn
0)(1
)(
1 2211
==
=
==
Th i monomi non sono apprendibili da esempi negativi
I risultati
33
1) K-CNF apprendibili da soli esempi positivi1b) K-DNF apprendibili da soli esempi negativi2) (K-DNF K-CNF) apprendibili da es (K-DNF K-CNF) positivi e negativiand
or
Kforall
3) la classe delle K-decision lists egrave apprendibile
)0()(
1)min(|min
10||)))((( 11
esistenonisebvCalloravni
booleanovettorevDLKCbkmmonomiom
conbmbmDLK
i
iii
jj
===
minusisinle
equivminus
Th Ogni K-DNF (o K-CNF)-formula egrave rappresentabile da una K-DL piccola
Risultati positivi
34
)()(
freeondistributisensoinNPRPse
minusne
1) Le m-formule non sono apprendibili
2) Le funzioni booleane a soglia non sono apprendibili
3) Per K ge 2 le formule K-term-DNF non sono apprendibili
ge
Risultati negativi
35
Mistake bound model
So far how many examples needed to learn What about how many mistakes before
convergence Letrsquos consider similar setting to PAC learning
Instances drawn at random from X according to
distribution D Learner must classify each instance before receiving
correct classification from teacher Can we bound the number of mistakes learner makes
before converging
36
Mistake bound model
Learner Receives a sequence of training examples x Predicts the target value f(x) Receives the correct target value from the trainer Is evaluated by the total number of mistakes it makes
before converging to the correct hypothesis
Ie Learning takes place during the use of the system
not off-line Ex prediction of fraudolent use of credit cards
37
Mistake bound for Find-S
Consider Find-S when H = conjunction of boolean literals
FIND-S
Initialize h to the most specific hypothesis in
Hx1x1x2x2 hellip xnxn
For each positive training instance x Remove from h any literal not satisfied by x
Output h
38
Mistake bound for Find-S
If C H and training data noise free Find-S converges to an exact hypothesis
How many errors to learn cH (only positive examples can be misclassified)
The first positive example will be misclassified and n literals in the initial hypothesis will be eliminated
Each subsequent error eliminates at least one literal mistakes le n+1 (worst case for the ldquototalrdquo concept x c(x)=1)
39
Mistake bound for Halving A version space is maintained and refined (eg
Candidate-elimination) Prediction is based on majority vote among the
hypotheses in the current version space ldquoWrongrdquo hypotheses are removed (even if x is
exactly classified) How many errors to exactly learn cH (H finite)
Mistake when the majority of hypotheses misclassifies x These hypotheses are removed For each mistake the version space is at least halved At most log2(|H|) mistakes before exact learning (eg
single hypothesis remaining) Note learning without mistakes possible
40
Optimal mistake bound
Question what is the optimal mistake bound (ie lowest worst case bound over all possible learning algorithms A) for an arbitrary non empty concept class C assuming H=C
Formally for any learning algorithm A and any target concept c
MA(c) = max mistakes made by A to exactly learn c over all
possible training sequences MA(C) = maxcC MA(c)
Note Mfind-S(C) = n+1
MHalving(C) le log2(|C|) Opt(C) = minA MA(C)
ie of mistakes made for the hardest target concept in C using the hardest training sequence by the best algorithm
41
Optimal mistake bound
Theorem (Littlestone 1987)
VC(C) le Opt(C) le MHalving(C) le log2(|C|) There exist concept classes for which
VC(C) = Opt(C) = MHalving(C) = log2(|C|)
eg the power set 2X of X for which it holds
VC(2X) = |X| = log2(|2X|) There exist concept classes for which VC(C) lt Opt(C) lt MHalving(C)
42
Weighted majority algorithm
Generalizes Halving Makes predictions by taking a weighted vote among a pole of prediction algorithms
Learns by altering the weight associated with each prediction algorithm
It does not eliminate hypotheses (ie algorithms) inconsistent with some training examples but just reduces its weight so is able to accommodate inconsistent training data
43
Weighted majority algorithm
i wi = 1
training example (x c(x))
q0 = q1 = 0
prediction algorithm ai
If ai(x)=0 then q0 = q0 + wi
If ai(x)=1 then q1 = q1 + wi
if q1 gt q0 then predict c(x)=1
if q1 lt q0 then predict c(x)=0
if q1 gt q0 then predict c(x)=0 or 1 at random
prediction algorithm ai do If ai(x)nec(x) then wi = wi (0lelt1)
44
Weighted majority algorithm (WM)
Coincides with Halving for =0 Theorem - D any sequence of training examples A any set of n prediction algorithms k min of mistakes made by any ajA for D =12 Then W-M makes at most
24(k+log2n)
mistakes over D
45
Weighted majority algorithm (WM)
Proof Since aj makes k mistakes (best in A) its final
weight wj will be (12)k The sum W of the weights associated with all n algorithms in A is initially n and for each mistake made by WM is reduced to at most (34)W because the ldquowrongrdquo algorithms hold at least 12 of total weight that will be reduced by a factor of 12
The final total weight W is at most n(34)M where M is the total number of mistakes made by WM over D
46
Weighted majority algorithm (WM)
But the final weight wj cannot be greater than the
final total weight W hence(12)k le n(34)M
from which
M le le 24(k+log2n)
Ie the number of mistakes made by WM will never be greater than a constant factor times the number of mistakes made by the best member of the pool plus a term that grows only logarithmically in the size of the pool
(k+log2 n)-log2 (34)
- Universitagrave di Milano-Bicocca Laurea Magistrale in Informatica
- Computational models of cognitive phenomena
- A theory of the learnable (Valiant lsquo84)
- A theory of the learnable
- Probably approximately correct learning
- What we want to learn
- Whatrsquos new in pac learning
- Learning from examples
- The PAC model
- The PAC model
- Slide 11
- Slide 12
- Slide 13
- Learning algorithm
- The efficiency issue
- Slide 16
- Learning boolean functions
- Boolean functions and circuits
- Slide 19
- Slide 20
- Statistical PAC learning
- A simple upper bound
- Vapnik-Chervonenkis approach (1971)
- A general upper bound
- Graph of the growth function
- Upper and lower bounds
- An equivalent definition of VCdim
- Example 1
- Example 2
- Slide 30
- Classi di formule booleane
- I risultati
- Risultati positivi
- Risultati negativi
- Mistake bound model
- Slide 36
- Mistake bound for Find-S
- Slide 38
- Mistake bound for Halving
- Optimal mistake bound
- Slide 41
- Weighted majority algorithm
- Slide 43
- Weighted majority algorithm (WM)
- Slide 45
- Slide 46
-
23
X domainF 2X class of conceptsS = (x1 hellip xt) t-sample
f S g iff f(xi) = g(xi) xi S undistinguishable by S
F (S) = (F S) index of F wrt S
Problem uniform convergence of relative frequencies to their probabilities
Vapnik-Chervonenkis approach (1971)
S1 S2
MF (t) = maxF (S) S is a t-sample growth function
24
FACT
THEOREM
A general upper bound
Prob(x1 hellip xt) g (g -bad g(x1) = f(x1) hellip g(xt) = f(xt))le 2mF2te-t2
mF (t) le 2t
mF (t) le F (this condition gives immediately
the simple upper bound) mF (t) = 2t and jltt mF (j) = 2j
25
d t
)(tmF
F
)(infinFm
t2
DEFINITION
FUNDAMENTAL PROPERTY
=)(tmF1
2
0
minusle⎟⎠⎞⎜⎝
⎛le⎟⎠⎞⎜⎝
⎛le
le
=
sum
Kk
t
tt
t
t
BOUNDED BY APOLYNOMIAL IN t
Graph of the growth function
d = VCdim(F ) = max t mF(t) = 2t
26
THEOREMIf dn = VCdim(Fn)
then t(n 1 1) le max (4 log(2) (8dn)log(13)PROOF
Impose 2mFn2te-et2 le
A lower bound on t(n 1 1) Number of examples which are necessary for arbitrary algorithms
THEOREMFor 0lele1 and le1100
t(n 1 1) ge max ((1- ) ln(1) (dn-1)32)
Upper and lower bounds
27
Ie the cardinality of the set of subsets of S that can be obtained by intersecting S with concepts in F
If F (S) = 2S we say that S is shattered by F
The Vapnik-Chervonenkis dimension of F is the cardinality of the largest finite set of points S X that is shattered by F
An equivalent definition of VCdim
F (S) = (f-1(1)(x1 hellip xt) | fF )
28
1300log80032000log400)11(
00100103)(
sdotle
===
MAXnt
FVC DIM
δε
δε
Sufficient 24000
⎭⎬⎫
⎩⎨⎧ sdot
minusge 100
32131000ln100)11( MAXnt δε
690 Necessary
Learn the family f of circles contained in the square
Example 1
29
otherwiseXif
XWHSXXfTHATSUCHWWLf
nkkkn
nn
001
)()()(
11
1
ge
minus=
rArrisin
sum=
λ
λ
HS(x)=
22
1)(n
n
nDIM
L
nLVC
le
+=
SIMPLE UPPER BOUND
))1ln((1)11( 2 +le nnt
UPPER BOUND USING
⎭⎬⎫
⎩⎨⎧ sdot
+le
13)1(82log4)11( nMAXnt GROWS LINEARLY WITH n
)( nDIM LVC
Learn the family of linearly separable boolean functions in n variables Ln
Example 2
30
Consider the class L2 of linearly separable functions in two variables
3)(3)(
1)(
2
2
ge=
+=
LVCLVC
nLVC
IM
IM
nIM
4)( 2 ltLVC IM
The green point cannot beseparated from the other three
No straight line can separatethe green from the red points
Example 2
31
Classi di formule booleane
Monomi x1x2 hellip xk
DNF m1m2 hellip mj (mj monomi)
Clausole x1x2 hellip xk
CNF c1c2 hellip cj (cj clausole)
k-DNF le k letterali nei monomi
k-term-DNF le k monomi
k-CNF le k letterali nelle clausole
k-clause-CNF le k clausole
Formule monotone non contengono letterali negati
m-formule ogni variabile appare al piugrave una volta
32
Th (Valiant)I monomi sono apprendibili da esempi positivi con 2(n+log ) esempi ( errore tollerato) ponendo ixix
xxgii 01 ==
sdotequiv ππin tutti gli es in tutti gli es
NB Lrsquoapprendibilitagrave egrave non monotona A B se B appr allora A apprsube
endHdaxcancella
elseHdaxcancella
thenjesifdontojfor
generaesbegin
doBtoiforxxxxxxH
j
j
nn
0)(1
)(
1 2211
==
=
==
Th i monomi non sono apprendibili da esempi negativi
I risultati
33
1) K-CNF apprendibili da soli esempi positivi1b) K-DNF apprendibili da soli esempi negativi2) (K-DNF K-CNF) apprendibili da es (K-DNF K-CNF) positivi e negativiand
or
Kforall
3) la classe delle K-decision lists egrave apprendibile
)0()(
1)min(|min
10||)))((( 11
esistenonisebvCalloravni
booleanovettorevDLKCbkmmonomiom
conbmbmDLK
i
iii
jj
===
minusisinle
equivminus
Th Ogni K-DNF (o K-CNF)-formula egrave rappresentabile da una K-DL piccola
Risultati positivi
34
)()(
freeondistributisensoinNPRPse
minusne
1) Le m-formule non sono apprendibili
2) Le funzioni booleane a soglia non sono apprendibili
3) Per K ge 2 le formule K-term-DNF non sono apprendibili
ge
Risultati negativi
35
Mistake bound model
So far how many examples needed to learn What about how many mistakes before
convergence Letrsquos consider similar setting to PAC learning
Instances drawn at random from X according to
distribution D Learner must classify each instance before receiving
correct classification from teacher Can we bound the number of mistakes learner makes
before converging
36
Mistake bound model
Learner Receives a sequence of training examples x Predicts the target value f(x) Receives the correct target value from the trainer Is evaluated by the total number of mistakes it makes
before converging to the correct hypothesis
Ie Learning takes place during the use of the system
not off-line Ex prediction of fraudolent use of credit cards
37
Mistake bound for Find-S
Consider Find-S when H = conjunction of boolean literals
FIND-S
Initialize h to the most specific hypothesis in
Hx1x1x2x2 hellip xnxn
For each positive training instance x Remove from h any literal not satisfied by x
Output h
38
Mistake bound for Find-S
If C H and training data noise free Find-S converges to an exact hypothesis
How many errors to learn cH (only positive examples can be misclassified)
The first positive example will be misclassified and n literals in the initial hypothesis will be eliminated
Each subsequent error eliminates at least one literal mistakes le n+1 (worst case for the ldquototalrdquo concept x c(x)=1)
39
Mistake bound for Halving A version space is maintained and refined (eg
Candidate-elimination) Prediction is based on majority vote among the
hypotheses in the current version space ldquoWrongrdquo hypotheses are removed (even if x is
exactly classified) How many errors to exactly learn cH (H finite)
Mistake when the majority of hypotheses misclassifies x These hypotheses are removed For each mistake the version space is at least halved At most log2(|H|) mistakes before exact learning (eg
single hypothesis remaining) Note learning without mistakes possible
40
Optimal mistake bound
Question what is the optimal mistake bound (ie lowest worst case bound over all possible learning algorithms A) for an arbitrary non empty concept class C assuming H=C
Formally for any learning algorithm A and any target concept c
MA(c) = max mistakes made by A to exactly learn c over all
possible training sequences MA(C) = maxcC MA(c)
Note Mfind-S(C) = n+1
MHalving(C) le log2(|C|) Opt(C) = minA MA(C)
ie of mistakes made for the hardest target concept in C using the hardest training sequence by the best algorithm
41
Optimal mistake bound
Theorem (Littlestone 1987)
VC(C) le Opt(C) le MHalving(C) le log2(|C|) There exist concept classes for which
VC(C) = Opt(C) = MHalving(C) = log2(|C|)
eg the power set 2X of X for which it holds
VC(2X) = |X| = log2(|2X|) There exist concept classes for which VC(C) lt Opt(C) lt MHalving(C)
42
Weighted majority algorithm
Generalizes Halving Makes predictions by taking a weighted vote among a pole of prediction algorithms
Learns by altering the weight associated with each prediction algorithm
It does not eliminate hypotheses (ie algorithms) inconsistent with some training examples but just reduces its weight so is able to accommodate inconsistent training data
43
Weighted majority algorithm
i wi = 1
training example (x c(x))
q0 = q1 = 0
prediction algorithm ai
If ai(x)=0 then q0 = q0 + wi
If ai(x)=1 then q1 = q1 + wi
if q1 gt q0 then predict c(x)=1
if q1 lt q0 then predict c(x)=0
if q1 gt q0 then predict c(x)=0 or 1 at random
prediction algorithm ai do If ai(x)nec(x) then wi = wi (0lelt1)
44
Weighted majority algorithm (WM)
Coincides with Halving for =0 Theorem - D any sequence of training examples A any set of n prediction algorithms k min of mistakes made by any ajA for D =12 Then W-M makes at most
24(k+log2n)
mistakes over D
45
Weighted majority algorithm (WM)
Proof Since aj makes k mistakes (best in A) its final
weight wj will be (12)k The sum W of the weights associated with all n algorithms in A is initially n and for each mistake made by WM is reduced to at most (34)W because the ldquowrongrdquo algorithms hold at least 12 of total weight that will be reduced by a factor of 12
The final total weight W is at most n(34)M where M is the total number of mistakes made by WM over D
46
Weighted majority algorithm (WM)
But the final weight wj cannot be greater than the
final total weight W hence(12)k le n(34)M
from which
M le le 24(k+log2n)
Ie the number of mistakes made by WM will never be greater than a constant factor times the number of mistakes made by the best member of the pool plus a term that grows only logarithmically in the size of the pool
(k+log2 n)-log2 (34)
- Universitagrave di Milano-Bicocca Laurea Magistrale in Informatica
- Computational models of cognitive phenomena
- A theory of the learnable (Valiant lsquo84)
- A theory of the learnable
- Probably approximately correct learning
- What we want to learn
- Whatrsquos new in pac learning
- Learning from examples
- The PAC model
- The PAC model
- Slide 11
- Slide 12
- Slide 13
- Learning algorithm
- The efficiency issue
- Slide 16
- Learning boolean functions
- Boolean functions and circuits
- Slide 19
- Slide 20
- Statistical PAC learning
- A simple upper bound
- Vapnik-Chervonenkis approach (1971)
- A general upper bound
- Graph of the growth function
- Upper and lower bounds
- An equivalent definition of VCdim
- Example 1
- Example 2
- Slide 30
- Classi di formule booleane
- I risultati
- Risultati positivi
- Risultati negativi
- Mistake bound model
- Slide 36
- Mistake bound for Find-S
- Slide 38
- Mistake bound for Halving
- Optimal mistake bound
- Slide 41
- Weighted majority algorithm
- Slide 43
- Weighted majority algorithm (WM)
- Slide 45
- Slide 46
-
24
FACT
THEOREM
A general upper bound
Prob(x1 hellip xt) g (g -bad g(x1) = f(x1) hellip g(xt) = f(xt))le 2mF2te-t2
mF (t) le 2t
mF (t) le F (this condition gives immediately
the simple upper bound) mF (t) = 2t and jltt mF (j) = 2j
25
d t
)(tmF
F
)(infinFm
t2
DEFINITION
FUNDAMENTAL PROPERTY
=)(tmF1
2
0
minusle⎟⎠⎞⎜⎝
⎛le⎟⎠⎞⎜⎝
⎛le
le
=
sum
Kk
t
tt
t
t
BOUNDED BY APOLYNOMIAL IN t
Graph of the growth function
d = VCdim(F ) = max t mF(t) = 2t
26
THEOREMIf dn = VCdim(Fn)
then t(n 1 1) le max (4 log(2) (8dn)log(13)PROOF
Impose 2mFn2te-et2 le
A lower bound on t(n 1 1) Number of examples which are necessary for arbitrary algorithms
THEOREMFor 0lele1 and le1100
t(n 1 1) ge max ((1- ) ln(1) (dn-1)32)
Upper and lower bounds
27
Ie the cardinality of the set of subsets of S that can be obtained by intersecting S with concepts in F
If F (S) = 2S we say that S is shattered by F
The Vapnik-Chervonenkis dimension of F is the cardinality of the largest finite set of points S X that is shattered by F
An equivalent definition of VCdim
F (S) = (f-1(1)(x1 hellip xt) | fF )
28
1300log80032000log400)11(
00100103)(
sdotle
===
MAXnt
FVC DIM
δε
δε
Sufficient 24000
⎭⎬⎫
⎩⎨⎧ sdot
minusge 100
32131000ln100)11( MAXnt δε
690 Necessary
Learn the family f of circles contained in the square
Example 1
29
otherwiseXif
XWHSXXfTHATSUCHWWLf
nkkkn
nn
001
)()()(
11
1
ge
minus=
rArrisin
sum=
λ
λ
HS(x)=
22
1)(n
n
nDIM
L
nLVC
le
+=
SIMPLE UPPER BOUND
))1ln((1)11( 2 +le nnt
UPPER BOUND USING
⎭⎬⎫
⎩⎨⎧ sdot
+le
13)1(82log4)11( nMAXnt GROWS LINEARLY WITH n
)( nDIM LVC
Learn the family of linearly separable boolean functions in n variables Ln
Example 2
30
Consider the class L2 of linearly separable functions in two variables
3)(3)(
1)(
2
2
ge=
+=
LVCLVC
nLVC
IM
IM
nIM
4)( 2 ltLVC IM
The green point cannot beseparated from the other three
No straight line can separatethe green from the red points
Example 2
31
Classi di formule booleane
Monomi x1x2 hellip xk
DNF m1m2 hellip mj (mj monomi)
Clausole x1x2 hellip xk
CNF c1c2 hellip cj (cj clausole)
k-DNF le k letterali nei monomi
k-term-DNF le k monomi
k-CNF le k letterali nelle clausole
k-clause-CNF le k clausole
Formule monotone non contengono letterali negati
m-formule ogni variabile appare al piugrave una volta
32
Th (Valiant)I monomi sono apprendibili da esempi positivi con 2(n+log ) esempi ( errore tollerato) ponendo ixix
xxgii 01 ==
sdotequiv ππin tutti gli es in tutti gli es
NB Lrsquoapprendibilitagrave egrave non monotona A B se B appr allora A apprsube
endHdaxcancella
elseHdaxcancella
thenjesifdontojfor
generaesbegin
doBtoiforxxxxxxH
j
j
nn
0)(1
)(
1 2211
==
=
==
Th i monomi non sono apprendibili da esempi negativi
I risultati
33
1) K-CNF apprendibili da soli esempi positivi1b) K-DNF apprendibili da soli esempi negativi2) (K-DNF K-CNF) apprendibili da es (K-DNF K-CNF) positivi e negativiand
or
Kforall
3) la classe delle K-decision lists egrave apprendibile
)0()(
1)min(|min
10||)))((( 11
esistenonisebvCalloravni
booleanovettorevDLKCbkmmonomiom
conbmbmDLK
i
iii
jj
===
minusisinle
equivminus
Th Ogni K-DNF (o K-CNF)-formula egrave rappresentabile da una K-DL piccola
Risultati positivi
34
)()(
freeondistributisensoinNPRPse
minusne
1) Le m-formule non sono apprendibili
2) Le funzioni booleane a soglia non sono apprendibili
3) Per K ge 2 le formule K-term-DNF non sono apprendibili
ge
Risultati negativi
35
Mistake bound model
So far how many examples needed to learn What about how many mistakes before
convergence Letrsquos consider similar setting to PAC learning
Instances drawn at random from X according to
distribution D Learner must classify each instance before receiving
correct classification from teacher Can we bound the number of mistakes learner makes
before converging
36
Mistake bound model
Learner Receives a sequence of training examples x Predicts the target value f(x) Receives the correct target value from the trainer Is evaluated by the total number of mistakes it makes
before converging to the correct hypothesis
Ie Learning takes place during the use of the system
not off-line Ex prediction of fraudolent use of credit cards
37
Mistake bound for Find-S
Consider Find-S when H = conjunction of boolean literals
FIND-S
Initialize h to the most specific hypothesis in
Hx1x1x2x2 hellip xnxn
For each positive training instance x Remove from h any literal not satisfied by x
Output h
38
Mistake bound for Find-S
If C H and training data noise free Find-S converges to an exact hypothesis
How many errors to learn cH (only positive examples can be misclassified)
The first positive example will be misclassified and n literals in the initial hypothesis will be eliminated
Each subsequent error eliminates at least one literal mistakes le n+1 (worst case for the ldquototalrdquo concept x c(x)=1)
39
Mistake bound for Halving A version space is maintained and refined (eg
Candidate-elimination) Prediction is based on majority vote among the
hypotheses in the current version space ldquoWrongrdquo hypotheses are removed (even if x is
exactly classified) How many errors to exactly learn cH (H finite)
Mistake when the majority of hypotheses misclassifies x These hypotheses are removed For each mistake the version space is at least halved At most log2(|H|) mistakes before exact learning (eg
single hypothesis remaining) Note learning without mistakes possible
40
Optimal mistake bound
Question what is the optimal mistake bound (ie lowest worst case bound over all possible learning algorithms A) for an arbitrary non empty concept class C assuming H=C
Formally for any learning algorithm A and any target concept c
MA(c) = max mistakes made by A to exactly learn c over all
possible training sequences MA(C) = maxcC MA(c)
Note Mfind-S(C) = n+1
MHalving(C) le log2(|C|) Opt(C) = minA MA(C)
ie of mistakes made for the hardest target concept in C using the hardest training sequence by the best algorithm
41
Optimal mistake bound
Theorem (Littlestone 1987)
VC(C) le Opt(C) le MHalving(C) le log2(|C|) There exist concept classes for which
VC(C) = Opt(C) = MHalving(C) = log2(|C|)
eg the power set 2X of X for which it holds
VC(2X) = |X| = log2(|2X|) There exist concept classes for which VC(C) lt Opt(C) lt MHalving(C)
42
Weighted majority algorithm
Generalizes Halving Makes predictions by taking a weighted vote among a pole of prediction algorithms
Learns by altering the weight associated with each prediction algorithm
It does not eliminate hypotheses (ie algorithms) inconsistent with some training examples but just reduces its weight so is able to accommodate inconsistent training data
43
Weighted majority algorithm
i wi = 1
training example (x c(x))
q0 = q1 = 0
prediction algorithm ai
If ai(x)=0 then q0 = q0 + wi
If ai(x)=1 then q1 = q1 + wi
if q1 gt q0 then predict c(x)=1
if q1 lt q0 then predict c(x)=0
if q1 gt q0 then predict c(x)=0 or 1 at random
prediction algorithm ai do If ai(x)nec(x) then wi = wi (0lelt1)
44
Weighted majority algorithm (WM)
Coincides with Halving for =0 Theorem - D any sequence of training examples A any set of n prediction algorithms k min of mistakes made by any ajA for D =12 Then W-M makes at most
24(k+log2n)
mistakes over D
45
Weighted majority algorithm (WM)
Proof Since aj makes k mistakes (best in A) its final
weight wj will be (12)k The sum W of the weights associated with all n algorithms in A is initially n and for each mistake made by WM is reduced to at most (34)W because the ldquowrongrdquo algorithms hold at least 12 of total weight that will be reduced by a factor of 12
The final total weight W is at most n(34)M where M is the total number of mistakes made by WM over D
46
Weighted majority algorithm (WM)
But the final weight wj cannot be greater than the
final total weight W hence(12)k le n(34)M
from which
M le le 24(k+log2n)
Ie the number of mistakes made by WM will never be greater than a constant factor times the number of mistakes made by the best member of the pool plus a term that grows only logarithmically in the size of the pool
(k+log2 n)-log2 (34)
- Universitagrave di Milano-Bicocca Laurea Magistrale in Informatica
- Computational models of cognitive phenomena
- A theory of the learnable (Valiant lsquo84)
- A theory of the learnable
- Probably approximately correct learning
- What we want to learn
- Whatrsquos new in pac learning
- Learning from examples
- The PAC model
- The PAC model
- Slide 11
- Slide 12
- Slide 13
- Learning algorithm
- The efficiency issue
- Slide 16
- Learning boolean functions
- Boolean functions and circuits
- Slide 19
- Slide 20
- Statistical PAC learning
- A simple upper bound
- Vapnik-Chervonenkis approach (1971)
- A general upper bound
- Graph of the growth function
- Upper and lower bounds
- An equivalent definition of VCdim
- Example 1
- Example 2
- Slide 30
- Classi di formule booleane
- I risultati
- Risultati positivi
- Risultati negativi
- Mistake bound model
- Slide 36
- Mistake bound for Find-S
- Slide 38
- Mistake bound for Halving
- Optimal mistake bound
- Slide 41
- Weighted majority algorithm
- Slide 43
- Weighted majority algorithm (WM)
- Slide 45
- Slide 46
-
25
d t
)(tmF
F
)(infinFm
t2
DEFINITION
FUNDAMENTAL PROPERTY
=)(tmF1
2
0
minusle⎟⎠⎞⎜⎝
⎛le⎟⎠⎞⎜⎝
⎛le
le
=
sum
Kk
t
tt
t
t
BOUNDED BY APOLYNOMIAL IN t
Graph of the growth function
d = VCdim(F ) = max t mF(t) = 2t
26
THEOREMIf dn = VCdim(Fn)
then t(n 1 1) le max (4 log(2) (8dn)log(13)PROOF
Impose 2mFn2te-et2 le
A lower bound on t(n 1 1) Number of examples which are necessary for arbitrary algorithms
THEOREMFor 0lele1 and le1100
t(n 1 1) ge max ((1- ) ln(1) (dn-1)32)
Upper and lower bounds
27
Ie the cardinality of the set of subsets of S that can be obtained by intersecting S with concepts in F
If F (S) = 2S we say that S is shattered by F
The Vapnik-Chervonenkis dimension of F is the cardinality of the largest finite set of points S X that is shattered by F
An equivalent definition of VCdim
F (S) = (f-1(1)(x1 hellip xt) | fF )
28
1300log80032000log400)11(
00100103)(
sdotle
===
MAXnt
FVC DIM
δε
δε
Sufficient 24000
⎭⎬⎫
⎩⎨⎧ sdot
minusge 100
32131000ln100)11( MAXnt δε
690 Necessary
Learn the family f of circles contained in the square
Example 1
29
otherwiseXif
XWHSXXfTHATSUCHWWLf
nkkkn
nn
001
)()()(
11
1
ge
minus=
rArrisin
sum=
λ
λ
HS(x)=
22
1)(n
n
nDIM
L
nLVC
le
+=
SIMPLE UPPER BOUND
))1ln((1)11( 2 +le nnt
UPPER BOUND USING
⎭⎬⎫
⎩⎨⎧ sdot
+le
13)1(82log4)11( nMAXnt GROWS LINEARLY WITH n
)( nDIM LVC
Learn the family of linearly separable boolean functions in n variables Ln
Example 2
30
Consider the class L2 of linearly separable functions in two variables
3)(3)(
1)(
2
2
ge=
+=
LVCLVC
nLVC
IM
IM
nIM
4)( 2 ltLVC IM
The green point cannot beseparated from the other three
No straight line can separatethe green from the red points
Example 2
31
Classi di formule booleane
Monomi x1x2 hellip xk
DNF m1m2 hellip mj (mj monomi)
Clausole x1x2 hellip xk
CNF c1c2 hellip cj (cj clausole)
k-DNF le k letterali nei monomi
k-term-DNF le k monomi
k-CNF le k letterali nelle clausole
k-clause-CNF le k clausole
Formule monotone non contengono letterali negati
m-formule ogni variabile appare al piugrave una volta
32
Th (Valiant)I monomi sono apprendibili da esempi positivi con 2(n+log ) esempi ( errore tollerato) ponendo ixix
xxgii 01 ==
sdotequiv ππin tutti gli es in tutti gli es
NB Lrsquoapprendibilitagrave egrave non monotona A B se B appr allora A apprsube
endHdaxcancella
elseHdaxcancella
thenjesifdontojfor
generaesbegin
doBtoiforxxxxxxH
j
j
nn
0)(1
)(
1 2211
==
=
==
Th i monomi non sono apprendibili da esempi negativi
I risultati
33
1) K-CNF apprendibili da soli esempi positivi1b) K-DNF apprendibili da soli esempi negativi2) (K-DNF K-CNF) apprendibili da es (K-DNF K-CNF) positivi e negativiand
or
Kforall
3) la classe delle K-decision lists egrave apprendibile
)0()(
1)min(|min
10||)))((( 11
esistenonisebvCalloravni
booleanovettorevDLKCbkmmonomiom
conbmbmDLK
i
iii
jj
===
minusisinle
equivminus
Th Ogni K-DNF (o K-CNF)-formula egrave rappresentabile da una K-DL piccola
Risultati positivi
34
)()(
freeondistributisensoinNPRPse
minusne
1) Le m-formule non sono apprendibili
2) Le funzioni booleane a soglia non sono apprendibili
3) Per K ge 2 le formule K-term-DNF non sono apprendibili
ge
Risultati negativi
35
Mistake bound model
So far how many examples needed to learn What about how many mistakes before
convergence Letrsquos consider similar setting to PAC learning
Instances drawn at random from X according to
distribution D Learner must classify each instance before receiving
correct classification from teacher Can we bound the number of mistakes learner makes
before converging
36
Mistake bound model
Learner Receives a sequence of training examples x Predicts the target value f(x) Receives the correct target value from the trainer Is evaluated by the total number of mistakes it makes
before converging to the correct hypothesis
Ie Learning takes place during the use of the system
not off-line Ex prediction of fraudolent use of credit cards
37
Mistake bound for Find-S
Consider Find-S when H = conjunction of boolean literals
FIND-S
Initialize h to the most specific hypothesis in
Hx1x1x2x2 hellip xnxn
For each positive training instance x Remove from h any literal not satisfied by x
Output h
38
Mistake bound for Find-S
If C H and training data noise free Find-S converges to an exact hypothesis
How many errors to learn cH (only positive examples can be misclassified)
The first positive example will be misclassified and n literals in the initial hypothesis will be eliminated
Each subsequent error eliminates at least one literal mistakes le n+1 (worst case for the ldquototalrdquo concept x c(x)=1)
39
Mistake bound for Halving A version space is maintained and refined (eg
Candidate-elimination) Prediction is based on majority vote among the
hypotheses in the current version space ldquoWrongrdquo hypotheses are removed (even if x is
exactly classified) How many errors to exactly learn cH (H finite)
Mistake when the majority of hypotheses misclassifies x These hypotheses are removed For each mistake the version space is at least halved At most log2(|H|) mistakes before exact learning (eg
single hypothesis remaining) Note learning without mistakes possible
40
Optimal mistake bound
Question what is the optimal mistake bound (ie lowest worst case bound over all possible learning algorithms A) for an arbitrary non empty concept class C assuming H=C
Formally for any learning algorithm A and any target concept c
MA(c) = max mistakes made by A to exactly learn c over all
possible training sequences MA(C) = maxcC MA(c)
Note Mfind-S(C) = n+1
MHalving(C) le log2(|C|) Opt(C) = minA MA(C)
ie of mistakes made for the hardest target concept in C using the hardest training sequence by the best algorithm
41
Optimal mistake bound
Theorem (Littlestone 1987)
VC(C) le Opt(C) le MHalving(C) le log2(|C|) There exist concept classes for which
VC(C) = Opt(C) = MHalving(C) = log2(|C|)
eg the power set 2X of X for which it holds
VC(2X) = |X| = log2(|2X|) There exist concept classes for which VC(C) lt Opt(C) lt MHalving(C)
42
Weighted majority algorithm
Generalizes Halving Makes predictions by taking a weighted vote among a pole of prediction algorithms
Learns by altering the weight associated with each prediction algorithm
It does not eliminate hypotheses (ie algorithms) inconsistent with some training examples but just reduces its weight so is able to accommodate inconsistent training data
43
Weighted majority algorithm
i wi = 1
training example (x c(x))
q0 = q1 = 0
prediction algorithm ai
If ai(x)=0 then q0 = q0 + wi
If ai(x)=1 then q1 = q1 + wi
if q1 gt q0 then predict c(x)=1
if q1 lt q0 then predict c(x)=0
if q1 gt q0 then predict c(x)=0 or 1 at random
prediction algorithm ai do If ai(x)nec(x) then wi = wi (0lelt1)
44
Weighted majority algorithm (WM)
Coincides with Halving for =0 Theorem - D any sequence of training examples A any set of n prediction algorithms k min of mistakes made by any ajA for D =12 Then W-M makes at most
24(k+log2n)
mistakes over D
45
Weighted majority algorithm (WM)
Proof Since aj makes k mistakes (best in A) its final
weight wj will be (12)k The sum W of the weights associated with all n algorithms in A is initially n and for each mistake made by WM is reduced to at most (34)W because the ldquowrongrdquo algorithms hold at least 12 of total weight that will be reduced by a factor of 12
The final total weight W is at most n(34)M where M is the total number of mistakes made by WM over D
46
Weighted majority algorithm (WM)
But the final weight wj cannot be greater than the
final total weight W hence(12)k le n(34)M
from which
M le le 24(k+log2n)
Ie the number of mistakes made by WM will never be greater than a constant factor times the number of mistakes made by the best member of the pool plus a term that grows only logarithmically in the size of the pool
(k+log2 n)-log2 (34)
- Universitagrave di Milano-Bicocca Laurea Magistrale in Informatica
- Computational models of cognitive phenomena
- A theory of the learnable (Valiant lsquo84)
- A theory of the learnable
- Probably approximately correct learning
- What we want to learn
- Whatrsquos new in pac learning
- Learning from examples
- The PAC model
- The PAC model
- Slide 11
- Slide 12
- Slide 13
- Learning algorithm
- The efficiency issue
- Slide 16
- Learning boolean functions
- Boolean functions and circuits
- Slide 19
- Slide 20
- Statistical PAC learning
- A simple upper bound
- Vapnik-Chervonenkis approach (1971)
- A general upper bound
- Graph of the growth function
- Upper and lower bounds
- An equivalent definition of VCdim
- Example 1
- Example 2
- Slide 30
- Classi di formule booleane
- I risultati
- Risultati positivi
- Risultati negativi
- Mistake bound model
- Slide 36
- Mistake bound for Find-S
- Slide 38
- Mistake bound for Halving
- Optimal mistake bound
- Slide 41
- Weighted majority algorithm
- Slide 43
- Weighted majority algorithm (WM)
- Slide 45
- Slide 46
-
26
THEOREMIf dn = VCdim(Fn)
then t(n 1 1) le max (4 log(2) (8dn)log(13)PROOF
Impose 2mFn2te-et2 le
A lower bound on t(n 1 1) Number of examples which are necessary for arbitrary algorithms
THEOREMFor 0lele1 and le1100
t(n 1 1) ge max ((1- ) ln(1) (dn-1)32)
Upper and lower bounds
27
Ie the cardinality of the set of subsets of S that can be obtained by intersecting S with concepts in F
If F (S) = 2S we say that S is shattered by F
The Vapnik-Chervonenkis dimension of F is the cardinality of the largest finite set of points S X that is shattered by F
An equivalent definition of VCdim
F (S) = (f-1(1)(x1 hellip xt) | fF )
28
1300log80032000log400)11(
00100103)(
sdotle
===
MAXnt
FVC DIM
δε
δε
Sufficient 24000
⎭⎬⎫
⎩⎨⎧ sdot
minusge 100
32131000ln100)11( MAXnt δε
690 Necessary
Learn the family f of circles contained in the square
Example 1
29
otherwiseXif
XWHSXXfTHATSUCHWWLf
nkkkn
nn
001
)()()(
11
1
ge
minus=
rArrisin
sum=
λ
λ
HS(x)=
22
1)(n
n
nDIM
L
nLVC
le
+=
SIMPLE UPPER BOUND
))1ln((1)11( 2 +le nnt
UPPER BOUND USING
⎭⎬⎫
⎩⎨⎧ sdot
+le
13)1(82log4)11( nMAXnt GROWS LINEARLY WITH n
)( nDIM LVC
Learn the family of linearly separable boolean functions in n variables Ln
Example 2
30
Consider the class L2 of linearly separable functions in two variables
3)(3)(
1)(
2
2
ge=
+=
LVCLVC
nLVC
IM
IM
nIM
4)( 2 ltLVC IM
The green point cannot beseparated from the other three
No straight line can separatethe green from the red points
Example 2
31
Classi di formule booleane
Monomi x1x2 hellip xk
DNF m1m2 hellip mj (mj monomi)
Clausole x1x2 hellip xk
CNF c1c2 hellip cj (cj clausole)
k-DNF le k letterali nei monomi
k-term-DNF le k monomi
k-CNF le k letterali nelle clausole
k-clause-CNF le k clausole
Formule monotone non contengono letterali negati
m-formule ogni variabile appare al piugrave una volta
32
Th (Valiant)I monomi sono apprendibili da esempi positivi con 2(n+log ) esempi ( errore tollerato) ponendo ixix
xxgii 01 ==
sdotequiv ππin tutti gli es in tutti gli es
NB Lrsquoapprendibilitagrave egrave non monotona A B se B appr allora A apprsube
endHdaxcancella
elseHdaxcancella
thenjesifdontojfor
generaesbegin
doBtoiforxxxxxxH
j
j
nn
0)(1
)(
1 2211
==
=
==
Th i monomi non sono apprendibili da esempi negativi
I risultati
33
1) K-CNF apprendibili da soli esempi positivi1b) K-DNF apprendibili da soli esempi negativi2) (K-DNF K-CNF) apprendibili da es (K-DNF K-CNF) positivi e negativiand
or
Kforall
3) la classe delle K-decision lists egrave apprendibile
)0()(
1)min(|min
10||)))((( 11
esistenonisebvCalloravni
booleanovettorevDLKCbkmmonomiom
conbmbmDLK
i
iii
jj
===
minusisinle
equivminus
Th Ogni K-DNF (o K-CNF)-formula egrave rappresentabile da una K-DL piccola
Risultati positivi
34
)()(
freeondistributisensoinNPRPse
minusne
1) Le m-formule non sono apprendibili
2) Le funzioni booleane a soglia non sono apprendibili
3) Per K ge 2 le formule K-term-DNF non sono apprendibili
ge
Risultati negativi
35
Mistake bound model
So far how many examples needed to learn What about how many mistakes before
convergence Letrsquos consider similar setting to PAC learning
Instances drawn at random from X according to
distribution D Learner must classify each instance before receiving
correct classification from teacher Can we bound the number of mistakes learner makes
before converging
36
Mistake bound model
Learner Receives a sequence of training examples x Predicts the target value f(x) Receives the correct target value from the trainer Is evaluated by the total number of mistakes it makes
before converging to the correct hypothesis
Ie Learning takes place during the use of the system
not off-line Ex prediction of fraudolent use of credit cards
37
Mistake bound for Find-S
Consider Find-S when H = conjunction of boolean literals
FIND-S
Initialize h to the most specific hypothesis in
Hx1x1x2x2 hellip xnxn
For each positive training instance x Remove from h any literal not satisfied by x
Output h
38
Mistake bound for Find-S
If C H and training data noise free Find-S converges to an exact hypothesis
How many errors to learn cH (only positive examples can be misclassified)
The first positive example will be misclassified and n literals in the initial hypothesis will be eliminated
Each subsequent error eliminates at least one literal mistakes le n+1 (worst case for the ldquototalrdquo concept x c(x)=1)
39
Mistake bound for Halving A version space is maintained and refined (eg
Candidate-elimination) Prediction is based on majority vote among the
hypotheses in the current version space ldquoWrongrdquo hypotheses are removed (even if x is
exactly classified) How many errors to exactly learn cH (H finite)
Mistake when the majority of hypotheses misclassifies x These hypotheses are removed For each mistake the version space is at least halved At most log2(|H|) mistakes before exact learning (eg
single hypothesis remaining) Note learning without mistakes possible
40
Optimal mistake bound
Question what is the optimal mistake bound (ie lowest worst case bound over all possible learning algorithms A) for an arbitrary non empty concept class C assuming H=C
Formally for any learning algorithm A and any target concept c
MA(c) = max mistakes made by A to exactly learn c over all
possible training sequences MA(C) = maxcC MA(c)
Note Mfind-S(C) = n+1
MHalving(C) le log2(|C|) Opt(C) = minA MA(C)
ie of mistakes made for the hardest target concept in C using the hardest training sequence by the best algorithm
41
Optimal mistake bound
Theorem (Littlestone 1987)
VC(C) le Opt(C) le MHalving(C) le log2(|C|) There exist concept classes for which
VC(C) = Opt(C) = MHalving(C) = log2(|C|)
eg the power set 2X of X for which it holds
VC(2X) = |X| = log2(|2X|) There exist concept classes for which VC(C) lt Opt(C) lt MHalving(C)
42
Weighted majority algorithm
Generalizes Halving Makes predictions by taking a weighted vote among a pole of prediction algorithms
Learns by altering the weight associated with each prediction algorithm
It does not eliminate hypotheses (ie algorithms) inconsistent with some training examples but just reduces its weight so is able to accommodate inconsistent training data
43
Weighted majority algorithm
i wi = 1
training example (x c(x))
q0 = q1 = 0
prediction algorithm ai
If ai(x)=0 then q0 = q0 + wi
If ai(x)=1 then q1 = q1 + wi
if q1 gt q0 then predict c(x)=1
if q1 lt q0 then predict c(x)=0
if q1 gt q0 then predict c(x)=0 or 1 at random
prediction algorithm ai do If ai(x)nec(x) then wi = wi (0lelt1)
44
Weighted majority algorithm (WM)
Coincides with Halving for =0 Theorem - D any sequence of training examples A any set of n prediction algorithms k min of mistakes made by any ajA for D =12 Then W-M makes at most
24(k+log2n)
mistakes over D
45
Weighted majority algorithm (WM)
Proof Since aj makes k mistakes (best in A) its final
weight wj will be (12)k The sum W of the weights associated with all n algorithms in A is initially n and for each mistake made by WM is reduced to at most (34)W because the ldquowrongrdquo algorithms hold at least 12 of total weight that will be reduced by a factor of 12
The final total weight W is at most n(34)M where M is the total number of mistakes made by WM over D
46
Weighted majority algorithm (WM)
But the final weight wj cannot be greater than the
final total weight W hence(12)k le n(34)M
from which
M le le 24(k+log2n)
Ie the number of mistakes made by WM will never be greater than a constant factor times the number of mistakes made by the best member of the pool plus a term that grows only logarithmically in the size of the pool
(k+log2 n)-log2 (34)
- Universitagrave di Milano-Bicocca Laurea Magistrale in Informatica
- Computational models of cognitive phenomena
- A theory of the learnable (Valiant lsquo84)
- A theory of the learnable
- Probably approximately correct learning
- What we want to learn
- Whatrsquos new in pac learning
- Learning from examples
- The PAC model
- The PAC model
- Slide 11
- Slide 12
- Slide 13
- Learning algorithm
- The efficiency issue
- Slide 16
- Learning boolean functions
- Boolean functions and circuits
- Slide 19
- Slide 20
- Statistical PAC learning
- A simple upper bound
- Vapnik-Chervonenkis approach (1971)
- A general upper bound
- Graph of the growth function
- Upper and lower bounds
- An equivalent definition of VCdim
- Example 1
- Example 2
- Slide 30
- Classi di formule booleane
- I risultati
- Risultati positivi
- Risultati negativi
- Mistake bound model
- Slide 36
- Mistake bound for Find-S
- Slide 38
- Mistake bound for Halving
- Optimal mistake bound
- Slide 41
- Weighted majority algorithm
- Slide 43
- Weighted majority algorithm (WM)
- Slide 45
- Slide 46
-
27
Ie the cardinality of the set of subsets of S that can be obtained by intersecting S with concepts in F
If F (S) = 2S we say that S is shattered by F
The Vapnik-Chervonenkis dimension of F is the cardinality of the largest finite set of points S X that is shattered by F
An equivalent definition of VCdim
F (S) = (f-1(1)(x1 hellip xt) | fF )
28
1300log80032000log400)11(
00100103)(
sdotle
===
MAXnt
FVC DIM
δε
δε
Sufficient 24000
⎭⎬⎫
⎩⎨⎧ sdot
minusge 100
32131000ln100)11( MAXnt δε
690 Necessary
Learn the family f of circles contained in the square
Example 1
29
otherwiseXif
XWHSXXfTHATSUCHWWLf
nkkkn
nn
001
)()()(
11
1
ge
minus=
rArrisin
sum=
λ
λ
HS(x)=
22
1)(n
n
nDIM
L
nLVC
le
+=
SIMPLE UPPER BOUND
))1ln((1)11( 2 +le nnt
UPPER BOUND USING
⎭⎬⎫
⎩⎨⎧ sdot
+le
13)1(82log4)11( nMAXnt GROWS LINEARLY WITH n
)( nDIM LVC
Learn the family of linearly separable boolean functions in n variables Ln
Example 2
30
Consider the class L2 of linearly separable functions in two variables
3)(3)(
1)(
2
2
ge=
+=
LVCLVC
nLVC
IM
IM
nIM
4)( 2 ltLVC IM
The green point cannot beseparated from the other three
No straight line can separatethe green from the red points
Example 2
31
Classi di formule booleane
Monomi x1x2 hellip xk
DNF m1m2 hellip mj (mj monomi)
Clausole x1x2 hellip xk
CNF c1c2 hellip cj (cj clausole)
k-DNF le k letterali nei monomi
k-term-DNF le k monomi
k-CNF le k letterali nelle clausole
k-clause-CNF le k clausole
Formule monotone non contengono letterali negati
m-formule ogni variabile appare al piugrave una volta
32
Th (Valiant)I monomi sono apprendibili da esempi positivi con 2(n+log ) esempi ( errore tollerato) ponendo ixix
xxgii 01 ==
sdotequiv ππin tutti gli es in tutti gli es
NB Lrsquoapprendibilitagrave egrave non monotona A B se B appr allora A apprsube
endHdaxcancella
elseHdaxcancella
thenjesifdontojfor
generaesbegin
doBtoiforxxxxxxH
j
j
nn
0)(1
)(
1 2211
==
=
==
Th i monomi non sono apprendibili da esempi negativi
I risultati
33
1) K-CNF apprendibili da soli esempi positivi1b) K-DNF apprendibili da soli esempi negativi2) (K-DNF K-CNF) apprendibili da es (K-DNF K-CNF) positivi e negativiand
or
Kforall
3) la classe delle K-decision lists egrave apprendibile
)0()(
1)min(|min
10||)))((( 11
esistenonisebvCalloravni
booleanovettorevDLKCbkmmonomiom
conbmbmDLK
i
iii
jj
===
minusisinle
equivminus
Th Ogni K-DNF (o K-CNF)-formula egrave rappresentabile da una K-DL piccola
Risultati positivi
34
)()(
freeondistributisensoinNPRPse
minusne
1) Le m-formule non sono apprendibili
2) Le funzioni booleane a soglia non sono apprendibili
3) Per K ge 2 le formule K-term-DNF non sono apprendibili
ge
Risultati negativi
35
Mistake bound model
So far how many examples needed to learn What about how many mistakes before
convergence Letrsquos consider similar setting to PAC learning
Instances drawn at random from X according to
distribution D Learner must classify each instance before receiving
correct classification from teacher Can we bound the number of mistakes learner makes
before converging
36
Mistake bound model
Learner Receives a sequence of training examples x Predicts the target value f(x) Receives the correct target value from the trainer Is evaluated by the total number of mistakes it makes
before converging to the correct hypothesis
Ie Learning takes place during the use of the system
not off-line Ex prediction of fraudolent use of credit cards
37
Mistake bound for Find-S
Consider Find-S when H = conjunction of boolean literals
FIND-S
Initialize h to the most specific hypothesis in
Hx1x1x2x2 hellip xnxn
For each positive training instance x Remove from h any literal not satisfied by x
Output h
38
Mistake bound for Find-S
If C H and training data noise free Find-S converges to an exact hypothesis
How many errors to learn cH (only positive examples can be misclassified)
The first positive example will be misclassified and n literals in the initial hypothesis will be eliminated
Each subsequent error eliminates at least one literal mistakes le n+1 (worst case for the ldquototalrdquo concept x c(x)=1)
39
Mistake bound for Halving A version space is maintained and refined (eg
Candidate-elimination) Prediction is based on majority vote among the
hypotheses in the current version space ldquoWrongrdquo hypotheses are removed (even if x is
exactly classified) How many errors to exactly learn cH (H finite)
Mistake when the majority of hypotheses misclassifies x These hypotheses are removed For each mistake the version space is at least halved At most log2(|H|) mistakes before exact learning (eg
single hypothesis remaining) Note learning without mistakes possible
40
Optimal mistake bound
Question what is the optimal mistake bound (ie lowest worst case bound over all possible learning algorithms A) for an arbitrary non empty concept class C assuming H=C
Formally for any learning algorithm A and any target concept c
MA(c) = max mistakes made by A to exactly learn c over all
possible training sequences MA(C) = maxcC MA(c)
Note Mfind-S(C) = n+1
MHalving(C) le log2(|C|) Opt(C) = minA MA(C)
ie of mistakes made for the hardest target concept in C using the hardest training sequence by the best algorithm
41
Optimal mistake bound
Theorem (Littlestone 1987)
VC(C) le Opt(C) le MHalving(C) le log2(|C|) There exist concept classes for which
VC(C) = Opt(C) = MHalving(C) = log2(|C|)
eg the power set 2X of X for which it holds
VC(2X) = |X| = log2(|2X|) There exist concept classes for which VC(C) lt Opt(C) lt MHalving(C)
42
Weighted majority algorithm
Generalizes Halving Makes predictions by taking a weighted vote among a pole of prediction algorithms
Learns by altering the weight associated with each prediction algorithm
It does not eliminate hypotheses (ie algorithms) inconsistent with some training examples but just reduces its weight so is able to accommodate inconsistent training data
43
Weighted majority algorithm
i wi = 1
training example (x c(x))
q0 = q1 = 0
prediction algorithm ai
If ai(x)=0 then q0 = q0 + wi
If ai(x)=1 then q1 = q1 + wi
if q1 gt q0 then predict c(x)=1
if q1 lt q0 then predict c(x)=0
if q1 gt q0 then predict c(x)=0 or 1 at random
prediction algorithm ai do If ai(x)nec(x) then wi = wi (0lelt1)
44
Weighted majority algorithm (WM)
Coincides with Halving for =0 Theorem - D any sequence of training examples A any set of n prediction algorithms k min of mistakes made by any ajA for D =12 Then W-M makes at most
24(k+log2n)
mistakes over D
45
Weighted majority algorithm (WM)
Proof Since aj makes k mistakes (best in A) its final
weight wj will be (12)k The sum W of the weights associated with all n algorithms in A is initially n and for each mistake made by WM is reduced to at most (34)W because the ldquowrongrdquo algorithms hold at least 12 of total weight that will be reduced by a factor of 12
The final total weight W is at most n(34)M where M is the total number of mistakes made by WM over D
46
Weighted majority algorithm (WM)
But the final weight wj cannot be greater than the
final total weight W hence(12)k le n(34)M
from which
M le le 24(k+log2n)
Ie the number of mistakes made by WM will never be greater than a constant factor times the number of mistakes made by the best member of the pool plus a term that grows only logarithmically in the size of the pool
(k+log2 n)-log2 (34)
- Universitagrave di Milano-Bicocca Laurea Magistrale in Informatica
- Computational models of cognitive phenomena
- A theory of the learnable (Valiant lsquo84)
- A theory of the learnable
- Probably approximately correct learning
- What we want to learn
- Whatrsquos new in pac learning
- Learning from examples
- The PAC model
- The PAC model
- Slide 11
- Slide 12
- Slide 13
- Learning algorithm
- The efficiency issue
- Slide 16
- Learning boolean functions
- Boolean functions and circuits
- Slide 19
- Slide 20
- Statistical PAC learning
- A simple upper bound
- Vapnik-Chervonenkis approach (1971)
- A general upper bound
- Graph of the growth function
- Upper and lower bounds
- An equivalent definition of VCdim
- Example 1
- Example 2
- Slide 30
- Classi di formule booleane
- I risultati
- Risultati positivi
- Risultati negativi
- Mistake bound model
- Slide 36
- Mistake bound for Find-S
- Slide 38
- Mistake bound for Halving
- Optimal mistake bound
- Slide 41
- Weighted majority algorithm
- Slide 43
- Weighted majority algorithm (WM)
- Slide 45
- Slide 46
-
28
1300log80032000log400)11(
00100103)(
sdotle
===
MAXnt
FVC DIM
δε
δε
Sufficient 24000
⎭⎬⎫
⎩⎨⎧ sdot
minusge 100
32131000ln100)11( MAXnt δε
690 Necessary
Learn the family f of circles contained in the square
Example 1
29
otherwiseXif
XWHSXXfTHATSUCHWWLf
nkkkn
nn
001
)()()(
11
1
ge
minus=
rArrisin
sum=
λ
λ
HS(x)=
22
1)(n
n
nDIM
L
nLVC
le
+=
SIMPLE UPPER BOUND
))1ln((1)11( 2 +le nnt
UPPER BOUND USING
⎭⎬⎫
⎩⎨⎧ sdot
+le
13)1(82log4)11( nMAXnt GROWS LINEARLY WITH n
)( nDIM LVC
Learn the family of linearly separable boolean functions in n variables Ln
Example 2
30
Consider the class L2 of linearly separable functions in two variables
3)(3)(
1)(
2
2
ge=
+=
LVCLVC
nLVC
IM
IM
nIM
4)( 2 ltLVC IM
The green point cannot beseparated from the other three
No straight line can separatethe green from the red points
Example 2
31
Classi di formule booleane
Monomi x1x2 hellip xk
DNF m1m2 hellip mj (mj monomi)
Clausole x1x2 hellip xk
CNF c1c2 hellip cj (cj clausole)
k-DNF le k letterali nei monomi
k-term-DNF le k monomi
k-CNF le k letterali nelle clausole
k-clause-CNF le k clausole
Formule monotone non contengono letterali negati
m-formule ogni variabile appare al piugrave una volta
32
Th (Valiant)I monomi sono apprendibili da esempi positivi con 2(n+log ) esempi ( errore tollerato) ponendo ixix
xxgii 01 ==
sdotequiv ππin tutti gli es in tutti gli es
NB Lrsquoapprendibilitagrave egrave non monotona A B se B appr allora A apprsube
endHdaxcancella
elseHdaxcancella
thenjesifdontojfor
generaesbegin
doBtoiforxxxxxxH
j
j
nn
0)(1
)(
1 2211
==
=
==
Th i monomi non sono apprendibili da esempi negativi
I risultati
33
1) K-CNF apprendibili da soli esempi positivi1b) K-DNF apprendibili da soli esempi negativi2) (K-DNF K-CNF) apprendibili da es (K-DNF K-CNF) positivi e negativiand
or
Kforall
3) la classe delle K-decision lists egrave apprendibile
)0()(
1)min(|min
10||)))((( 11
esistenonisebvCalloravni
booleanovettorevDLKCbkmmonomiom
conbmbmDLK
i
iii
jj
===
minusisinle
equivminus
Th Ogni K-DNF (o K-CNF)-formula egrave rappresentabile da una K-DL piccola
Risultati positivi
34
)()(
freeondistributisensoinNPRPse
minusne
1) Le m-formule non sono apprendibili
2) Le funzioni booleane a soglia non sono apprendibili
3) Per K ge 2 le formule K-term-DNF non sono apprendibili
ge
Risultati negativi
35
Mistake bound model
So far how many examples needed to learn What about how many mistakes before
convergence Letrsquos consider similar setting to PAC learning
Instances drawn at random from X according to
distribution D Learner must classify each instance before receiving
correct classification from teacher Can we bound the number of mistakes learner makes
before converging
36
Mistake bound model
Learner Receives a sequence of training examples x Predicts the target value f(x) Receives the correct target value from the trainer Is evaluated by the total number of mistakes it makes
before converging to the correct hypothesis
Ie Learning takes place during the use of the system
not off-line Ex prediction of fraudolent use of credit cards
37
Mistake bound for Find-S
Consider Find-S when H = conjunction of boolean literals
FIND-S
Initialize h to the most specific hypothesis in
Hx1x1x2x2 hellip xnxn
For each positive training instance x Remove from h any literal not satisfied by x
Output h
38
Mistake bound for Find-S
If C H and training data noise free Find-S converges to an exact hypothesis
How many errors to learn cH (only positive examples can be misclassified)
The first positive example will be misclassified and n literals in the initial hypothesis will be eliminated
Each subsequent error eliminates at least one literal mistakes le n+1 (worst case for the ldquototalrdquo concept x c(x)=1)
39
Mistake bound for Halving A version space is maintained and refined (eg
Candidate-elimination) Prediction is based on majority vote among the
hypotheses in the current version space ldquoWrongrdquo hypotheses are removed (even if x is
exactly classified) How many errors to exactly learn cH (H finite)
Mistake when the majority of hypotheses misclassifies x These hypotheses are removed For each mistake the version space is at least halved At most log2(|H|) mistakes before exact learning (eg
single hypothesis remaining) Note learning without mistakes possible
40
Optimal mistake bound
Question what is the optimal mistake bound (ie lowest worst case bound over all possible learning algorithms A) for an arbitrary non empty concept class C assuming H=C
Formally for any learning algorithm A and any target concept c
MA(c) = max mistakes made by A to exactly learn c over all
possible training sequences MA(C) = maxcC MA(c)
Note Mfind-S(C) = n+1
MHalving(C) le log2(|C|) Opt(C) = minA MA(C)
ie of mistakes made for the hardest target concept in C using the hardest training sequence by the best algorithm
41
Optimal mistake bound
Theorem (Littlestone 1987)
VC(C) le Opt(C) le MHalving(C) le log2(|C|) There exist concept classes for which
VC(C) = Opt(C) = MHalving(C) = log2(|C|)
eg the power set 2X of X for which it holds
VC(2X) = |X| = log2(|2X|) There exist concept classes for which VC(C) lt Opt(C) lt MHalving(C)
42
Weighted majority algorithm
Generalizes Halving Makes predictions by taking a weighted vote among a pole of prediction algorithms
Learns by altering the weight associated with each prediction algorithm
It does not eliminate hypotheses (ie algorithms) inconsistent with some training examples but just reduces its weight so is able to accommodate inconsistent training data
43
Weighted majority algorithm
i wi = 1
training example (x c(x))
q0 = q1 = 0
prediction algorithm ai
If ai(x)=0 then q0 = q0 + wi
If ai(x)=1 then q1 = q1 + wi
if q1 gt q0 then predict c(x)=1
if q1 lt q0 then predict c(x)=0
if q1 gt q0 then predict c(x)=0 or 1 at random
prediction algorithm ai do If ai(x)nec(x) then wi = wi (0lelt1)
44
Weighted majority algorithm (WM)
Coincides with Halving for =0 Theorem - D any sequence of training examples A any set of n prediction algorithms k min of mistakes made by any ajA for D =12 Then W-M makes at most
24(k+log2n)
mistakes over D
45
Weighted majority algorithm (WM)
Proof Since aj makes k mistakes (best in A) its final
weight wj will be (12)k The sum W of the weights associated with all n algorithms in A is initially n and for each mistake made by WM is reduced to at most (34)W because the ldquowrongrdquo algorithms hold at least 12 of total weight that will be reduced by a factor of 12
The final total weight W is at most n(34)M where M is the total number of mistakes made by WM over D
46
Weighted majority algorithm (WM)
But the final weight wj cannot be greater than the
final total weight W hence(12)k le n(34)M
from which
M le le 24(k+log2n)
Ie the number of mistakes made by WM will never be greater than a constant factor times the number of mistakes made by the best member of the pool plus a term that grows only logarithmically in the size of the pool
(k+log2 n)-log2 (34)
- Universitagrave di Milano-Bicocca Laurea Magistrale in Informatica
- Computational models of cognitive phenomena
- A theory of the learnable (Valiant lsquo84)
- A theory of the learnable
- Probably approximately correct learning
- What we want to learn
- Whatrsquos new in pac learning
- Learning from examples
- The PAC model
- The PAC model
- Slide 11
- Slide 12
- Slide 13
- Learning algorithm
- The efficiency issue
- Slide 16
- Learning boolean functions
- Boolean functions and circuits
- Slide 19
- Slide 20
- Statistical PAC learning
- A simple upper bound
- Vapnik-Chervonenkis approach (1971)
- A general upper bound
- Graph of the growth function
- Upper and lower bounds
- An equivalent definition of VCdim
- Example 1
- Example 2
- Slide 30
- Classi di formule booleane
- I risultati
- Risultati positivi
- Risultati negativi
- Mistake bound model
- Slide 36
- Mistake bound for Find-S
- Slide 38
- Mistake bound for Halving
- Optimal mistake bound
- Slide 41
- Weighted majority algorithm
- Slide 43
- Weighted majority algorithm (WM)
- Slide 45
- Slide 46
-
29
otherwiseXif
XWHSXXfTHATSUCHWWLf
nkkkn
nn
001
)()()(
11
1
ge
minus=
rArrisin
sum=
λ
λ
HS(x)=
22
1)(n
n
nDIM
L
nLVC
le
+=
SIMPLE UPPER BOUND
))1ln((1)11( 2 +le nnt
UPPER BOUND USING
⎭⎬⎫
⎩⎨⎧ sdot
+le
13)1(82log4)11( nMAXnt GROWS LINEARLY WITH n
)( nDIM LVC
Learn the family of linearly separable boolean functions in n variables Ln
Example 2
30
Consider the class L2 of linearly separable functions in two variables
3)(3)(
1)(
2
2
ge=
+=
LVCLVC
nLVC
IM
IM
nIM
4)( 2 ltLVC IM
The green point cannot beseparated from the other three
No straight line can separatethe green from the red points
Example 2
31
Classi di formule booleane
Monomi x1x2 hellip xk
DNF m1m2 hellip mj (mj monomi)
Clausole x1x2 hellip xk
CNF c1c2 hellip cj (cj clausole)
k-DNF le k letterali nei monomi
k-term-DNF le k monomi
k-CNF le k letterali nelle clausole
k-clause-CNF le k clausole
Formule monotone non contengono letterali negati
m-formule ogni variabile appare al piugrave una volta
32
Th (Valiant)I monomi sono apprendibili da esempi positivi con 2(n+log ) esempi ( errore tollerato) ponendo ixix
xxgii 01 ==
sdotequiv ππin tutti gli es in tutti gli es
NB Lrsquoapprendibilitagrave egrave non monotona A B se B appr allora A apprsube
endHdaxcancella
elseHdaxcancella
thenjesifdontojfor
generaesbegin
doBtoiforxxxxxxH
j
j
nn
0)(1
)(
1 2211
==
=
==
Th i monomi non sono apprendibili da esempi negativi
I risultati
33
1) K-CNF apprendibili da soli esempi positivi1b) K-DNF apprendibili da soli esempi negativi2) (K-DNF K-CNF) apprendibili da es (K-DNF K-CNF) positivi e negativiand
or
Kforall
3) la classe delle K-decision lists egrave apprendibile
)0()(
1)min(|min
10||)))((( 11
esistenonisebvCalloravni
booleanovettorevDLKCbkmmonomiom
conbmbmDLK
i
iii
jj
===
minusisinle
equivminus
Th Ogni K-DNF (o K-CNF)-formula egrave rappresentabile da una K-DL piccola
Risultati positivi
34
)()(
freeondistributisensoinNPRPse
minusne
1) Le m-formule non sono apprendibili
2) Le funzioni booleane a soglia non sono apprendibili
3) Per K ge 2 le formule K-term-DNF non sono apprendibili
ge
Risultati negativi
35
Mistake bound model
So far how many examples needed to learn What about how many mistakes before
convergence Letrsquos consider similar setting to PAC learning
Instances drawn at random from X according to
distribution D Learner must classify each instance before receiving
correct classification from teacher Can we bound the number of mistakes learner makes
before converging
36
Mistake bound model
Learner Receives a sequence of training examples x Predicts the target value f(x) Receives the correct target value from the trainer Is evaluated by the total number of mistakes it makes
before converging to the correct hypothesis
Ie Learning takes place during the use of the system
not off-line Ex prediction of fraudolent use of credit cards
37
Mistake bound for Find-S
Consider Find-S when H = conjunction of boolean literals
FIND-S
Initialize h to the most specific hypothesis in
Hx1x1x2x2 hellip xnxn
For each positive training instance x Remove from h any literal not satisfied by x
Output h
38
Mistake bound for Find-S
If C H and training data noise free Find-S converges to an exact hypothesis
How many errors to learn cH (only positive examples can be misclassified)
The first positive example will be misclassified and n literals in the initial hypothesis will be eliminated
Each subsequent error eliminates at least one literal mistakes le n+1 (worst case for the ldquototalrdquo concept x c(x)=1)
39
Mistake bound for Halving A version space is maintained and refined (eg
Candidate-elimination) Prediction is based on majority vote among the
hypotheses in the current version space ldquoWrongrdquo hypotheses are removed (even if x is
exactly classified) How many errors to exactly learn cH (H finite)
Mistake when the majority of hypotheses misclassifies x These hypotheses are removed For each mistake the version space is at least halved At most log2(|H|) mistakes before exact learning (eg
single hypothesis remaining) Note learning without mistakes possible
40
Optimal mistake bound
Question what is the optimal mistake bound (ie lowest worst case bound over all possible learning algorithms A) for an arbitrary non empty concept class C assuming H=C
Formally for any learning algorithm A and any target concept c
MA(c) = max mistakes made by A to exactly learn c over all
possible training sequences MA(C) = maxcC MA(c)
Note Mfind-S(C) = n+1
MHalving(C) le log2(|C|) Opt(C) = minA MA(C)
ie of mistakes made for the hardest target concept in C using the hardest training sequence by the best algorithm
41
Optimal mistake bound
Theorem (Littlestone 1987)
VC(C) le Opt(C) le MHalving(C) le log2(|C|) There exist concept classes for which
VC(C) = Opt(C) = MHalving(C) = log2(|C|)
eg the power set 2X of X for which it holds
VC(2X) = |X| = log2(|2X|) There exist concept classes for which VC(C) lt Opt(C) lt MHalving(C)
42
Weighted majority algorithm
Generalizes Halving Makes predictions by taking a weighted vote among a pole of prediction algorithms
Learns by altering the weight associated with each prediction algorithm
It does not eliminate hypotheses (ie algorithms) inconsistent with some training examples but just reduces its weight so is able to accommodate inconsistent training data
43
Weighted majority algorithm
i wi = 1
training example (x c(x))
q0 = q1 = 0
prediction algorithm ai
If ai(x)=0 then q0 = q0 + wi
If ai(x)=1 then q1 = q1 + wi
if q1 gt q0 then predict c(x)=1
if q1 lt q0 then predict c(x)=0
if q1 gt q0 then predict c(x)=0 or 1 at random
prediction algorithm ai do If ai(x)nec(x) then wi = wi (0lelt1)
44
Weighted majority algorithm (WM)
Coincides with Halving for =0 Theorem - D any sequence of training examples A any set of n prediction algorithms k min of mistakes made by any ajA for D =12 Then W-M makes at most
24(k+log2n)
mistakes over D
45
Weighted majority algorithm (WM)
Proof Since aj makes k mistakes (best in A) its final
weight wj will be (12)k The sum W of the weights associated with all n algorithms in A is initially n and for each mistake made by WM is reduced to at most (34)W because the ldquowrongrdquo algorithms hold at least 12 of total weight that will be reduced by a factor of 12
The final total weight W is at most n(34)M where M is the total number of mistakes made by WM over D
46
Weighted majority algorithm (WM)
But the final weight wj cannot be greater than the
final total weight W hence(12)k le n(34)M
from which
M le le 24(k+log2n)
Ie the number of mistakes made by WM will never be greater than a constant factor times the number of mistakes made by the best member of the pool plus a term that grows only logarithmically in the size of the pool
(k+log2 n)-log2 (34)
- Universitagrave di Milano-Bicocca Laurea Magistrale in Informatica
- Computational models of cognitive phenomena
- A theory of the learnable (Valiant lsquo84)
- A theory of the learnable
- Probably approximately correct learning
- What we want to learn
- Whatrsquos new in pac learning
- Learning from examples
- The PAC model
- The PAC model
- Slide 11
- Slide 12
- Slide 13
- Learning algorithm
- The efficiency issue
- Slide 16
- Learning boolean functions
- Boolean functions and circuits
- Slide 19
- Slide 20
- Statistical PAC learning
- A simple upper bound
- Vapnik-Chervonenkis approach (1971)
- A general upper bound
- Graph of the growth function
- Upper and lower bounds
- An equivalent definition of VCdim
- Example 1
- Example 2
- Slide 30
- Classi di formule booleane
- I risultati
- Risultati positivi
- Risultati negativi
- Mistake bound model
- Slide 36
- Mistake bound for Find-S
- Slide 38
- Mistake bound for Halving
- Optimal mistake bound
- Slide 41
- Weighted majority algorithm
- Slide 43
- Weighted majority algorithm (WM)
- Slide 45
- Slide 46
-
30
Consider the class L2 of linearly separable functions in two variables
3)(3)(
1)(
2
2
ge=
+=
LVCLVC
nLVC
IM
IM
nIM
4)( 2 ltLVC IM
The green point cannot beseparated from the other three
No straight line can separatethe green from the red points
Example 2
31
Classi di formule booleane
Monomi x1x2 hellip xk
DNF m1m2 hellip mj (mj monomi)
Clausole x1x2 hellip xk
CNF c1c2 hellip cj (cj clausole)
k-DNF le k letterali nei monomi
k-term-DNF le k monomi
k-CNF le k letterali nelle clausole
k-clause-CNF le k clausole
Formule monotone non contengono letterali negati
m-formule ogni variabile appare al piugrave una volta
32
Th (Valiant)I monomi sono apprendibili da esempi positivi con 2(n+log ) esempi ( errore tollerato) ponendo ixix
xxgii 01 ==
sdotequiv ππin tutti gli es in tutti gli es
NB Lrsquoapprendibilitagrave egrave non monotona A B se B appr allora A apprsube
endHdaxcancella
elseHdaxcancella
thenjesifdontojfor
generaesbegin
doBtoiforxxxxxxH
j
j
nn
0)(1
)(
1 2211
==
=
==
Th i monomi non sono apprendibili da esempi negativi
I risultati
33
1) K-CNF apprendibili da soli esempi positivi1b) K-DNF apprendibili da soli esempi negativi2) (K-DNF K-CNF) apprendibili da es (K-DNF K-CNF) positivi e negativiand
or
Kforall
3) la classe delle K-decision lists egrave apprendibile
)0()(
1)min(|min
10||)))((( 11
esistenonisebvCalloravni
booleanovettorevDLKCbkmmonomiom
conbmbmDLK
i
iii
jj
===
minusisinle
equivminus
Th Ogni K-DNF (o K-CNF)-formula egrave rappresentabile da una K-DL piccola
Risultati positivi
34
)()(
freeondistributisensoinNPRPse
minusne
1) Le m-formule non sono apprendibili
2) Le funzioni booleane a soglia non sono apprendibili
3) Per K ge 2 le formule K-term-DNF non sono apprendibili
ge
Risultati negativi
35
Mistake bound model
So far how many examples needed to learn What about how many mistakes before
convergence Letrsquos consider similar setting to PAC learning
Instances drawn at random from X according to
distribution D Learner must classify each instance before receiving
correct classification from teacher Can we bound the number of mistakes learner makes
before converging
36
Mistake bound model
Learner Receives a sequence of training examples x Predicts the target value f(x) Receives the correct target value from the trainer Is evaluated by the total number of mistakes it makes
before converging to the correct hypothesis
Ie Learning takes place during the use of the system
not off-line Ex prediction of fraudolent use of credit cards
37
Mistake bound for Find-S
Consider Find-S when H = conjunction of boolean literals
FIND-S
Initialize h to the most specific hypothesis in
Hx1x1x2x2 hellip xnxn
For each positive training instance x Remove from h any literal not satisfied by x
Output h
38
Mistake bound for Find-S
If C H and training data noise free Find-S converges to an exact hypothesis
How many errors to learn cH (only positive examples can be misclassified)
The first positive example will be misclassified and n literals in the initial hypothesis will be eliminated
Each subsequent error eliminates at least one literal mistakes le n+1 (worst case for the ldquototalrdquo concept x c(x)=1)
39
Mistake bound for Halving A version space is maintained and refined (eg
Candidate-elimination) Prediction is based on majority vote among the
hypotheses in the current version space ldquoWrongrdquo hypotheses are removed (even if x is
exactly classified) How many errors to exactly learn cH (H finite)
Mistake when the majority of hypotheses misclassifies x These hypotheses are removed For each mistake the version space is at least halved At most log2(|H|) mistakes before exact learning (eg
single hypothesis remaining) Note learning without mistakes possible
40
Optimal mistake bound
Question what is the optimal mistake bound (ie lowest worst case bound over all possible learning algorithms A) for an arbitrary non empty concept class C assuming H=C
Formally for any learning algorithm A and any target concept c
MA(c) = max mistakes made by A to exactly learn c over all
possible training sequences MA(C) = maxcC MA(c)
Note Mfind-S(C) = n+1
MHalving(C) le log2(|C|) Opt(C) = minA MA(C)
ie of mistakes made for the hardest target concept in C using the hardest training sequence by the best algorithm
41
Optimal mistake bound
Theorem (Littlestone 1987)
VC(C) le Opt(C) le MHalving(C) le log2(|C|) There exist concept classes for which
VC(C) = Opt(C) = MHalving(C) = log2(|C|)
eg the power set 2X of X for which it holds
VC(2X) = |X| = log2(|2X|) There exist concept classes for which VC(C) lt Opt(C) lt MHalving(C)
42
Weighted majority algorithm
Generalizes Halving Makes predictions by taking a weighted vote among a pole of prediction algorithms
Learns by altering the weight associated with each prediction algorithm
It does not eliminate hypotheses (ie algorithms) inconsistent with some training examples but just reduces its weight so is able to accommodate inconsistent training data
43
Weighted majority algorithm
i wi = 1
training example (x c(x))
q0 = q1 = 0
prediction algorithm ai
If ai(x)=0 then q0 = q0 + wi
If ai(x)=1 then q1 = q1 + wi
if q1 gt q0 then predict c(x)=1
if q1 lt q0 then predict c(x)=0
if q1 gt q0 then predict c(x)=0 or 1 at random
prediction algorithm ai do If ai(x)nec(x) then wi = wi (0lelt1)
44
Weighted majority algorithm (WM)
Coincides with Halving for =0 Theorem - D any sequence of training examples A any set of n prediction algorithms k min of mistakes made by any ajA for D =12 Then W-M makes at most
24(k+log2n)
mistakes over D
45
Weighted majority algorithm (WM)
Proof Since aj makes k mistakes (best in A) its final
weight wj will be (12)k The sum W of the weights associated with all n algorithms in A is initially n and for each mistake made by WM is reduced to at most (34)W because the ldquowrongrdquo algorithms hold at least 12 of total weight that will be reduced by a factor of 12
The final total weight W is at most n(34)M where M is the total number of mistakes made by WM over D
46
Weighted majority algorithm (WM)
But the final weight wj cannot be greater than the
final total weight W hence(12)k le n(34)M
from which
M le le 24(k+log2n)
Ie the number of mistakes made by WM will never be greater than a constant factor times the number of mistakes made by the best member of the pool plus a term that grows only logarithmically in the size of the pool
(k+log2 n)-log2 (34)
- Universitagrave di Milano-Bicocca Laurea Magistrale in Informatica
- Computational models of cognitive phenomena
- A theory of the learnable (Valiant lsquo84)
- A theory of the learnable
- Probably approximately correct learning
- What we want to learn
- Whatrsquos new in pac learning
- Learning from examples
- The PAC model
- The PAC model
- Slide 11
- Slide 12
- Slide 13
- Learning algorithm
- The efficiency issue
- Slide 16
- Learning boolean functions
- Boolean functions and circuits
- Slide 19
- Slide 20
- Statistical PAC learning
- A simple upper bound
- Vapnik-Chervonenkis approach (1971)
- A general upper bound
- Graph of the growth function
- Upper and lower bounds
- An equivalent definition of VCdim
- Example 1
- Example 2
- Slide 30
- Classi di formule booleane
- I risultati
- Risultati positivi
- Risultati negativi
- Mistake bound model
- Slide 36
- Mistake bound for Find-S
- Slide 38
- Mistake bound for Halving
- Optimal mistake bound
- Slide 41
- Weighted majority algorithm
- Slide 43
- Weighted majority algorithm (WM)
- Slide 45
- Slide 46
-
31
Classi di formule booleane
Monomi x1x2 hellip xk
DNF m1m2 hellip mj (mj monomi)
Clausole x1x2 hellip xk
CNF c1c2 hellip cj (cj clausole)
k-DNF le k letterali nei monomi
k-term-DNF le k monomi
k-CNF le k letterali nelle clausole
k-clause-CNF le k clausole
Formule monotone non contengono letterali negati
m-formule ogni variabile appare al piugrave una volta
32
Th (Valiant)I monomi sono apprendibili da esempi positivi con 2(n+log ) esempi ( errore tollerato) ponendo ixix
xxgii 01 ==
sdotequiv ππin tutti gli es in tutti gli es
NB Lrsquoapprendibilitagrave egrave non monotona A B se B appr allora A apprsube
endHdaxcancella
elseHdaxcancella
thenjesifdontojfor
generaesbegin
doBtoiforxxxxxxH
j
j
nn
0)(1
)(
1 2211
==
=
==
Th i monomi non sono apprendibili da esempi negativi
I risultati
33
1) K-CNF apprendibili da soli esempi positivi1b) K-DNF apprendibili da soli esempi negativi2) (K-DNF K-CNF) apprendibili da es (K-DNF K-CNF) positivi e negativiand
or
Kforall
3) la classe delle K-decision lists egrave apprendibile
)0()(
1)min(|min
10||)))((( 11
esistenonisebvCalloravni
booleanovettorevDLKCbkmmonomiom
conbmbmDLK
i
iii
jj
===
minusisinle
equivminus
Th Ogni K-DNF (o K-CNF)-formula egrave rappresentabile da una K-DL piccola
Risultati positivi
34
)()(
freeondistributisensoinNPRPse
minusne
1) Le m-formule non sono apprendibili
2) Le funzioni booleane a soglia non sono apprendibili
3) Per K ge 2 le formule K-term-DNF non sono apprendibili
ge
Risultati negativi
35
Mistake bound model
So far how many examples needed to learn What about how many mistakes before
convergence Letrsquos consider similar setting to PAC learning
Instances drawn at random from X according to
distribution D Learner must classify each instance before receiving
correct classification from teacher Can we bound the number of mistakes learner makes
before converging
36
Mistake bound model
Learner Receives a sequence of training examples x Predicts the target value f(x) Receives the correct target value from the trainer Is evaluated by the total number of mistakes it makes
before converging to the correct hypothesis
Ie Learning takes place during the use of the system
not off-line Ex prediction of fraudolent use of credit cards
37
Mistake bound for Find-S
Consider Find-S when H = conjunction of boolean literals
FIND-S
Initialize h to the most specific hypothesis in
Hx1x1x2x2 hellip xnxn
For each positive training instance x Remove from h any literal not satisfied by x
Output h
38
Mistake bound for Find-S
If C H and training data noise free Find-S converges to an exact hypothesis
How many errors to learn cH (only positive examples can be misclassified)
The first positive example will be misclassified and n literals in the initial hypothesis will be eliminated
Each subsequent error eliminates at least one literal mistakes le n+1 (worst case for the ldquototalrdquo concept x c(x)=1)
39
Mistake bound for Halving A version space is maintained and refined (eg
Candidate-elimination) Prediction is based on majority vote among the
hypotheses in the current version space ldquoWrongrdquo hypotheses are removed (even if x is
exactly classified) How many errors to exactly learn cH (H finite)
Mistake when the majority of hypotheses misclassifies x These hypotheses are removed For each mistake the version space is at least halved At most log2(|H|) mistakes before exact learning (eg
single hypothesis remaining) Note learning without mistakes possible
40
Optimal mistake bound
Question what is the optimal mistake bound (ie lowest worst case bound over all possible learning algorithms A) for an arbitrary non empty concept class C assuming H=C
Formally for any learning algorithm A and any target concept c
MA(c) = max mistakes made by A to exactly learn c over all
possible training sequences MA(C) = maxcC MA(c)
Note Mfind-S(C) = n+1
MHalving(C) le log2(|C|) Opt(C) = minA MA(C)
ie of mistakes made for the hardest target concept in C using the hardest training sequence by the best algorithm
41
Optimal mistake bound
Theorem (Littlestone 1987)
VC(C) le Opt(C) le MHalving(C) le log2(|C|) There exist concept classes for which
VC(C) = Opt(C) = MHalving(C) = log2(|C|)
eg the power set 2X of X for which it holds
VC(2X) = |X| = log2(|2X|) There exist concept classes for which VC(C) lt Opt(C) lt MHalving(C)
42
Weighted majority algorithm
Generalizes Halving Makes predictions by taking a weighted vote among a pole of prediction algorithms
Learns by altering the weight associated with each prediction algorithm
It does not eliminate hypotheses (ie algorithms) inconsistent with some training examples but just reduces its weight so is able to accommodate inconsistent training data
43
Weighted majority algorithm
i wi = 1
training example (x c(x))
q0 = q1 = 0
prediction algorithm ai
If ai(x)=0 then q0 = q0 + wi
If ai(x)=1 then q1 = q1 + wi
if q1 gt q0 then predict c(x)=1
if q1 lt q0 then predict c(x)=0
if q1 gt q0 then predict c(x)=0 or 1 at random
prediction algorithm ai do If ai(x)nec(x) then wi = wi (0lelt1)
44
Weighted majority algorithm (WM)
Coincides with Halving for =0 Theorem - D any sequence of training examples A any set of n prediction algorithms k min of mistakes made by any ajA for D =12 Then W-M makes at most
24(k+log2n)
mistakes over D
45
Weighted majority algorithm (WM)
Proof Since aj makes k mistakes (best in A) its final
weight wj will be (12)k The sum W of the weights associated with all n algorithms in A is initially n and for each mistake made by WM is reduced to at most (34)W because the ldquowrongrdquo algorithms hold at least 12 of total weight that will be reduced by a factor of 12
The final total weight W is at most n(34)M where M is the total number of mistakes made by WM over D
46
Weighted majority algorithm (WM)
But the final weight wj cannot be greater than the
final total weight W hence(12)k le n(34)M
from which
M le le 24(k+log2n)
Ie the number of mistakes made by WM will never be greater than a constant factor times the number of mistakes made by the best member of the pool plus a term that grows only logarithmically in the size of the pool
(k+log2 n)-log2 (34)
- Universitagrave di Milano-Bicocca Laurea Magistrale in Informatica
- Computational models of cognitive phenomena
- A theory of the learnable (Valiant lsquo84)
- A theory of the learnable
- Probably approximately correct learning
- What we want to learn
- Whatrsquos new in pac learning
- Learning from examples
- The PAC model
- The PAC model
- Slide 11
- Slide 12
- Slide 13
- Learning algorithm
- The efficiency issue
- Slide 16
- Learning boolean functions
- Boolean functions and circuits
- Slide 19
- Slide 20
- Statistical PAC learning
- A simple upper bound
- Vapnik-Chervonenkis approach (1971)
- A general upper bound
- Graph of the growth function
- Upper and lower bounds
- An equivalent definition of VCdim
- Example 1
- Example 2
- Slide 30
- Classi di formule booleane
- I risultati
- Risultati positivi
- Risultati negativi
- Mistake bound model
- Slide 36
- Mistake bound for Find-S
- Slide 38
- Mistake bound for Halving
- Optimal mistake bound
- Slide 41
- Weighted majority algorithm
- Slide 43
- Weighted majority algorithm (WM)
- Slide 45
- Slide 46
-
32
Th (Valiant)I monomi sono apprendibili da esempi positivi con 2(n+log ) esempi ( errore tollerato) ponendo ixix
xxgii 01 ==
sdotequiv ππin tutti gli es in tutti gli es
NB Lrsquoapprendibilitagrave egrave non monotona A B se B appr allora A apprsube
endHdaxcancella
elseHdaxcancella
thenjesifdontojfor
generaesbegin
doBtoiforxxxxxxH
j
j
nn
0)(1
)(
1 2211
==
=
==
Th i monomi non sono apprendibili da esempi negativi
I risultati
33
1) K-CNF apprendibili da soli esempi positivi1b) K-DNF apprendibili da soli esempi negativi2) (K-DNF K-CNF) apprendibili da es (K-DNF K-CNF) positivi e negativiand
or
Kforall
3) la classe delle K-decision lists egrave apprendibile
)0()(
1)min(|min
10||)))((( 11
esistenonisebvCalloravni
booleanovettorevDLKCbkmmonomiom
conbmbmDLK
i
iii
jj
===
minusisinle
equivminus
Th Ogni K-DNF (o K-CNF)-formula egrave rappresentabile da una K-DL piccola
Risultati positivi
34
)()(
freeondistributisensoinNPRPse
minusne
1) Le m-formule non sono apprendibili
2) Le funzioni booleane a soglia non sono apprendibili
3) Per K ge 2 le formule K-term-DNF non sono apprendibili
ge
Risultati negativi
35
Mistake bound model
So far how many examples needed to learn What about how many mistakes before
convergence Letrsquos consider similar setting to PAC learning
Instances drawn at random from X according to
distribution D Learner must classify each instance before receiving
correct classification from teacher Can we bound the number of mistakes learner makes
before converging
36
Mistake bound model
Learner Receives a sequence of training examples x Predicts the target value f(x) Receives the correct target value from the trainer Is evaluated by the total number of mistakes it makes
before converging to the correct hypothesis
Ie Learning takes place during the use of the system
not off-line Ex prediction of fraudolent use of credit cards
37
Mistake bound for Find-S
Consider Find-S when H = conjunction of boolean literals
FIND-S
Initialize h to the most specific hypothesis in
Hx1x1x2x2 hellip xnxn
For each positive training instance x Remove from h any literal not satisfied by x
Output h
38
Mistake bound for Find-S
If C H and training data noise free Find-S converges to an exact hypothesis
How many errors to learn cH (only positive examples can be misclassified)
The first positive example will be misclassified and n literals in the initial hypothesis will be eliminated
Each subsequent error eliminates at least one literal mistakes le n+1 (worst case for the ldquototalrdquo concept x c(x)=1)
39
Mistake bound for Halving A version space is maintained and refined (eg
Candidate-elimination) Prediction is based on majority vote among the
hypotheses in the current version space ldquoWrongrdquo hypotheses are removed (even if x is
exactly classified) How many errors to exactly learn cH (H finite)
Mistake when the majority of hypotheses misclassifies x These hypotheses are removed For each mistake the version space is at least halved At most log2(|H|) mistakes before exact learning (eg
single hypothesis remaining) Note learning without mistakes possible
40
Optimal mistake bound
Question what is the optimal mistake bound (ie lowest worst case bound over all possible learning algorithms A) for an arbitrary non empty concept class C assuming H=C
Formally for any learning algorithm A and any target concept c
MA(c) = max mistakes made by A to exactly learn c over all
possible training sequences MA(C) = maxcC MA(c)
Note Mfind-S(C) = n+1
MHalving(C) le log2(|C|) Opt(C) = minA MA(C)
ie of mistakes made for the hardest target concept in C using the hardest training sequence by the best algorithm
41
Optimal mistake bound
Theorem (Littlestone 1987)
VC(C) le Opt(C) le MHalving(C) le log2(|C|) There exist concept classes for which
VC(C) = Opt(C) = MHalving(C) = log2(|C|)
eg the power set 2X of X for which it holds
VC(2X) = |X| = log2(|2X|) There exist concept classes for which VC(C) lt Opt(C) lt MHalving(C)
42
Weighted majority algorithm
Generalizes Halving Makes predictions by taking a weighted vote among a pole of prediction algorithms
Learns by altering the weight associated with each prediction algorithm
It does not eliminate hypotheses (ie algorithms) inconsistent with some training examples but just reduces its weight so is able to accommodate inconsistent training data
43
Weighted majority algorithm
i wi = 1
training example (x c(x))
q0 = q1 = 0
prediction algorithm ai
If ai(x)=0 then q0 = q0 + wi
If ai(x)=1 then q1 = q1 + wi
if q1 gt q0 then predict c(x)=1
if q1 lt q0 then predict c(x)=0
if q1 gt q0 then predict c(x)=0 or 1 at random
prediction algorithm ai do If ai(x)nec(x) then wi = wi (0lelt1)
44
Weighted majority algorithm (WM)
Coincides with Halving for =0 Theorem - D any sequence of training examples A any set of n prediction algorithms k min of mistakes made by any ajA for D =12 Then W-M makes at most
24(k+log2n)
mistakes over D
45
Weighted majority algorithm (WM)
Proof Since aj makes k mistakes (best in A) its final
weight wj will be (12)k The sum W of the weights associated with all n algorithms in A is initially n and for each mistake made by WM is reduced to at most (34)W because the ldquowrongrdquo algorithms hold at least 12 of total weight that will be reduced by a factor of 12
The final total weight W is at most n(34)M where M is the total number of mistakes made by WM over D
46
Weighted majority algorithm (WM)
But the final weight wj cannot be greater than the
final total weight W hence(12)k le n(34)M
from which
M le le 24(k+log2n)
Ie the number of mistakes made by WM will never be greater than a constant factor times the number of mistakes made by the best member of the pool plus a term that grows only logarithmically in the size of the pool
(k+log2 n)-log2 (34)
- Universitagrave di Milano-Bicocca Laurea Magistrale in Informatica
- Computational models of cognitive phenomena
- A theory of the learnable (Valiant lsquo84)
- A theory of the learnable
- Probably approximately correct learning
- What we want to learn
- Whatrsquos new in pac learning
- Learning from examples
- The PAC model
- The PAC model
- Slide 11
- Slide 12
- Slide 13
- Learning algorithm
- The efficiency issue
- Slide 16
- Learning boolean functions
- Boolean functions and circuits
- Slide 19
- Slide 20
- Statistical PAC learning
- A simple upper bound
- Vapnik-Chervonenkis approach (1971)
- A general upper bound
- Graph of the growth function
- Upper and lower bounds
- An equivalent definition of VCdim
- Example 1
- Example 2
- Slide 30
- Classi di formule booleane
- I risultati
- Risultati positivi
- Risultati negativi
- Mistake bound model
- Slide 36
- Mistake bound for Find-S
- Slide 38
- Mistake bound for Halving
- Optimal mistake bound
- Slide 41
- Weighted majority algorithm
- Slide 43
- Weighted majority algorithm (WM)
- Slide 45
- Slide 46
-
33
1) K-CNF apprendibili da soli esempi positivi1b) K-DNF apprendibili da soli esempi negativi2) (K-DNF K-CNF) apprendibili da es (K-DNF K-CNF) positivi e negativiand
or
Kforall
3) la classe delle K-decision lists egrave apprendibile
)0()(
1)min(|min
10||)))((( 11
esistenonisebvCalloravni
booleanovettorevDLKCbkmmonomiom
conbmbmDLK
i
iii
jj
===
minusisinle
equivminus
Th Ogni K-DNF (o K-CNF)-formula egrave rappresentabile da una K-DL piccola
Risultati positivi
34
)()(
freeondistributisensoinNPRPse
minusne
1) Le m-formule non sono apprendibili
2) Le funzioni booleane a soglia non sono apprendibili
3) Per K ge 2 le formule K-term-DNF non sono apprendibili
ge
Risultati negativi
35
Mistake bound model
So far how many examples needed to learn What about how many mistakes before
convergence Letrsquos consider similar setting to PAC learning
Instances drawn at random from X according to
distribution D Learner must classify each instance before receiving
correct classification from teacher Can we bound the number of mistakes learner makes
before converging
36
Mistake bound model
Learner Receives a sequence of training examples x Predicts the target value f(x) Receives the correct target value from the trainer Is evaluated by the total number of mistakes it makes
before converging to the correct hypothesis
Ie Learning takes place during the use of the system
not off-line Ex prediction of fraudolent use of credit cards
37
Mistake bound for Find-S
Consider Find-S when H = conjunction of boolean literals
FIND-S
Initialize h to the most specific hypothesis in
Hx1x1x2x2 hellip xnxn
For each positive training instance x Remove from h any literal not satisfied by x
Output h
38
Mistake bound for Find-S
If C H and training data noise free Find-S converges to an exact hypothesis
How many errors to learn cH (only positive examples can be misclassified)
The first positive example will be misclassified and n literals in the initial hypothesis will be eliminated
Each subsequent error eliminates at least one literal mistakes le n+1 (worst case for the ldquototalrdquo concept x c(x)=1)
39
Mistake bound for Halving A version space is maintained and refined (eg
Candidate-elimination) Prediction is based on majority vote among the
hypotheses in the current version space ldquoWrongrdquo hypotheses are removed (even if x is
exactly classified) How many errors to exactly learn cH (H finite)
Mistake when the majority of hypotheses misclassifies x These hypotheses are removed For each mistake the version space is at least halved At most log2(|H|) mistakes before exact learning (eg
single hypothesis remaining) Note learning without mistakes possible
40
Optimal mistake bound
Question what is the optimal mistake bound (ie lowest worst case bound over all possible learning algorithms A) for an arbitrary non empty concept class C assuming H=C
Formally for any learning algorithm A and any target concept c
MA(c) = max mistakes made by A to exactly learn c over all
possible training sequences MA(C) = maxcC MA(c)
Note Mfind-S(C) = n+1
MHalving(C) le log2(|C|) Opt(C) = minA MA(C)
ie of mistakes made for the hardest target concept in C using the hardest training sequence by the best algorithm
41
Optimal mistake bound
Theorem (Littlestone 1987)
VC(C) le Opt(C) le MHalving(C) le log2(|C|) There exist concept classes for which
VC(C) = Opt(C) = MHalving(C) = log2(|C|)
eg the power set 2X of X for which it holds
VC(2X) = |X| = log2(|2X|) There exist concept classes for which VC(C) lt Opt(C) lt MHalving(C)
42
Weighted majority algorithm
Generalizes Halving Makes predictions by taking a weighted vote among a pole of prediction algorithms
Learns by altering the weight associated with each prediction algorithm
It does not eliminate hypotheses (ie algorithms) inconsistent with some training examples but just reduces its weight so is able to accommodate inconsistent training data
43
Weighted majority algorithm
i wi = 1
training example (x c(x))
q0 = q1 = 0
prediction algorithm ai
If ai(x)=0 then q0 = q0 + wi
If ai(x)=1 then q1 = q1 + wi
if q1 gt q0 then predict c(x)=1
if q1 lt q0 then predict c(x)=0
if q1 gt q0 then predict c(x)=0 or 1 at random
prediction algorithm ai do If ai(x)nec(x) then wi = wi (0lelt1)
44
Weighted majority algorithm (WM)
Coincides with Halving for =0 Theorem - D any sequence of training examples A any set of n prediction algorithms k min of mistakes made by any ajA for D =12 Then W-M makes at most
24(k+log2n)
mistakes over D
45
Weighted majority algorithm (WM)
Proof Since aj makes k mistakes (best in A) its final
weight wj will be (12)k The sum W of the weights associated with all n algorithms in A is initially n and for each mistake made by WM is reduced to at most (34)W because the ldquowrongrdquo algorithms hold at least 12 of total weight that will be reduced by a factor of 12
The final total weight W is at most n(34)M where M is the total number of mistakes made by WM over D
46
Weighted majority algorithm (WM)
But the final weight wj cannot be greater than the
final total weight W hence(12)k le n(34)M
from which
M le le 24(k+log2n)
Ie the number of mistakes made by WM will never be greater than a constant factor times the number of mistakes made by the best member of the pool plus a term that grows only logarithmically in the size of the pool
(k+log2 n)-log2 (34)
- Universitagrave di Milano-Bicocca Laurea Magistrale in Informatica
- Computational models of cognitive phenomena
- A theory of the learnable (Valiant lsquo84)
- A theory of the learnable
- Probably approximately correct learning
- What we want to learn
- Whatrsquos new in pac learning
- Learning from examples
- The PAC model
- The PAC model
- Slide 11
- Slide 12
- Slide 13
- Learning algorithm
- The efficiency issue
- Slide 16
- Learning boolean functions
- Boolean functions and circuits
- Slide 19
- Slide 20
- Statistical PAC learning
- A simple upper bound
- Vapnik-Chervonenkis approach (1971)
- A general upper bound
- Graph of the growth function
- Upper and lower bounds
- An equivalent definition of VCdim
- Example 1
- Example 2
- Slide 30
- Classi di formule booleane
- I risultati
- Risultati positivi
- Risultati negativi
- Mistake bound model
- Slide 36
- Mistake bound for Find-S
- Slide 38
- Mistake bound for Halving
- Optimal mistake bound
- Slide 41
- Weighted majority algorithm
- Slide 43
- Weighted majority algorithm (WM)
- Slide 45
- Slide 46
-
34
)()(
freeondistributisensoinNPRPse
minusne
1) Le m-formule non sono apprendibili
2) Le funzioni booleane a soglia non sono apprendibili
3) Per K ge 2 le formule K-term-DNF non sono apprendibili
ge
Risultati negativi
35
Mistake bound model
So far how many examples needed to learn What about how many mistakes before
convergence Letrsquos consider similar setting to PAC learning
Instances drawn at random from X according to
distribution D Learner must classify each instance before receiving
correct classification from teacher Can we bound the number of mistakes learner makes
before converging
36
Mistake bound model
Learner Receives a sequence of training examples x Predicts the target value f(x) Receives the correct target value from the trainer Is evaluated by the total number of mistakes it makes
before converging to the correct hypothesis
Ie Learning takes place during the use of the system
not off-line Ex prediction of fraudolent use of credit cards
37
Mistake bound for Find-S
Consider Find-S when H = conjunction of boolean literals
FIND-S
Initialize h to the most specific hypothesis in
Hx1x1x2x2 hellip xnxn
For each positive training instance x Remove from h any literal not satisfied by x
Output h
38
Mistake bound for Find-S
If C H and training data noise free Find-S converges to an exact hypothesis
How many errors to learn cH (only positive examples can be misclassified)
The first positive example will be misclassified and n literals in the initial hypothesis will be eliminated
Each subsequent error eliminates at least one literal mistakes le n+1 (worst case for the ldquototalrdquo concept x c(x)=1)
39
Mistake bound for Halving A version space is maintained and refined (eg
Candidate-elimination) Prediction is based on majority vote among the
hypotheses in the current version space ldquoWrongrdquo hypotheses are removed (even if x is
exactly classified) How many errors to exactly learn cH (H finite)
Mistake when the majority of hypotheses misclassifies x These hypotheses are removed For each mistake the version space is at least halved At most log2(|H|) mistakes before exact learning (eg
single hypothesis remaining) Note learning without mistakes possible
40
Optimal mistake bound
Question what is the optimal mistake bound (ie lowest worst case bound over all possible learning algorithms A) for an arbitrary non empty concept class C assuming H=C
Formally for any learning algorithm A and any target concept c
MA(c) = max mistakes made by A to exactly learn c over all
possible training sequences MA(C) = maxcC MA(c)
Note Mfind-S(C) = n+1
MHalving(C) le log2(|C|) Opt(C) = minA MA(C)
ie of mistakes made for the hardest target concept in C using the hardest training sequence by the best algorithm
41
Optimal mistake bound
Theorem (Littlestone 1987)
VC(C) le Opt(C) le MHalving(C) le log2(|C|) There exist concept classes for which
VC(C) = Opt(C) = MHalving(C) = log2(|C|)
eg the power set 2X of X for which it holds
VC(2X) = |X| = log2(|2X|) There exist concept classes for which VC(C) lt Opt(C) lt MHalving(C)
42
Weighted majority algorithm
Generalizes Halving Makes predictions by taking a weighted vote among a pole of prediction algorithms
Learns by altering the weight associated with each prediction algorithm
It does not eliminate hypotheses (ie algorithms) inconsistent with some training examples but just reduces its weight so is able to accommodate inconsistent training data
43
Weighted majority algorithm
i wi = 1
training example (x c(x))
q0 = q1 = 0
prediction algorithm ai
If ai(x)=0 then q0 = q0 + wi
If ai(x)=1 then q1 = q1 + wi
if q1 gt q0 then predict c(x)=1
if q1 lt q0 then predict c(x)=0
if q1 gt q0 then predict c(x)=0 or 1 at random
prediction algorithm ai do If ai(x)nec(x) then wi = wi (0lelt1)
44
Weighted majority algorithm (WM)
Coincides with Halving for =0 Theorem - D any sequence of training examples A any set of n prediction algorithms k min of mistakes made by any ajA for D =12 Then W-M makes at most
24(k+log2n)
mistakes over D
45
Weighted majority algorithm (WM)
Proof Since aj makes k mistakes (best in A) its final
weight wj will be (12)k The sum W of the weights associated with all n algorithms in A is initially n and for each mistake made by WM is reduced to at most (34)W because the ldquowrongrdquo algorithms hold at least 12 of total weight that will be reduced by a factor of 12
The final total weight W is at most n(34)M where M is the total number of mistakes made by WM over D
46
Weighted majority algorithm (WM)
But the final weight wj cannot be greater than the
final total weight W hence(12)k le n(34)M
from which
M le le 24(k+log2n)
Ie the number of mistakes made by WM will never be greater than a constant factor times the number of mistakes made by the best member of the pool plus a term that grows only logarithmically in the size of the pool
(k+log2 n)-log2 (34)
- Universitagrave di Milano-Bicocca Laurea Magistrale in Informatica
- Computational models of cognitive phenomena
- A theory of the learnable (Valiant lsquo84)
- A theory of the learnable
- Probably approximately correct learning
- What we want to learn
- Whatrsquos new in pac learning
- Learning from examples
- The PAC model
- The PAC model
- Slide 11
- Slide 12
- Slide 13
- Learning algorithm
- The efficiency issue
- Slide 16
- Learning boolean functions
- Boolean functions and circuits
- Slide 19
- Slide 20
- Statistical PAC learning
- A simple upper bound
- Vapnik-Chervonenkis approach (1971)
- A general upper bound
- Graph of the growth function
- Upper and lower bounds
- An equivalent definition of VCdim
- Example 1
- Example 2
- Slide 30
- Classi di formule booleane
- I risultati
- Risultati positivi
- Risultati negativi
- Mistake bound model
- Slide 36
- Mistake bound for Find-S
- Slide 38
- Mistake bound for Halving
- Optimal mistake bound
- Slide 41
- Weighted majority algorithm
- Slide 43
- Weighted majority algorithm (WM)
- Slide 45
- Slide 46
-
35
Mistake bound model
So far how many examples needed to learn What about how many mistakes before
convergence Letrsquos consider similar setting to PAC learning
Instances drawn at random from X according to
distribution D Learner must classify each instance before receiving
correct classification from teacher Can we bound the number of mistakes learner makes
before converging
36
Mistake bound model
Learner Receives a sequence of training examples x Predicts the target value f(x) Receives the correct target value from the trainer Is evaluated by the total number of mistakes it makes
before converging to the correct hypothesis
Ie Learning takes place during the use of the system
not off-line Ex prediction of fraudolent use of credit cards
37
Mistake bound for Find-S
Consider Find-S when H = conjunction of boolean literals
FIND-S
Initialize h to the most specific hypothesis in
Hx1x1x2x2 hellip xnxn
For each positive training instance x Remove from h any literal not satisfied by x
Output h
38
Mistake bound for Find-S
If C H and training data noise free Find-S converges to an exact hypothesis
How many errors to learn cH (only positive examples can be misclassified)
The first positive example will be misclassified and n literals in the initial hypothesis will be eliminated
Each subsequent error eliminates at least one literal mistakes le n+1 (worst case for the ldquototalrdquo concept x c(x)=1)
39
Mistake bound for Halving A version space is maintained and refined (eg
Candidate-elimination) Prediction is based on majority vote among the
hypotheses in the current version space ldquoWrongrdquo hypotheses are removed (even if x is
exactly classified) How many errors to exactly learn cH (H finite)
Mistake when the majority of hypotheses misclassifies x These hypotheses are removed For each mistake the version space is at least halved At most log2(|H|) mistakes before exact learning (eg
single hypothesis remaining) Note learning without mistakes possible
40
Optimal mistake bound
Question what is the optimal mistake bound (ie lowest worst case bound over all possible learning algorithms A) for an arbitrary non empty concept class C assuming H=C
Formally for any learning algorithm A and any target concept c
MA(c) = max mistakes made by A to exactly learn c over all
possible training sequences MA(C) = maxcC MA(c)
Note Mfind-S(C) = n+1
MHalving(C) le log2(|C|) Opt(C) = minA MA(C)
ie of mistakes made for the hardest target concept in C using the hardest training sequence by the best algorithm
41
Optimal mistake bound
Theorem (Littlestone 1987)
VC(C) le Opt(C) le MHalving(C) le log2(|C|) There exist concept classes for which
VC(C) = Opt(C) = MHalving(C) = log2(|C|)
eg the power set 2X of X for which it holds
VC(2X) = |X| = log2(|2X|) There exist concept classes for which VC(C) lt Opt(C) lt MHalving(C)
42
Weighted majority algorithm
Generalizes Halving Makes predictions by taking a weighted vote among a pole of prediction algorithms
Learns by altering the weight associated with each prediction algorithm
It does not eliminate hypotheses (ie algorithms) inconsistent with some training examples but just reduces its weight so is able to accommodate inconsistent training data
43
Weighted majority algorithm
i wi = 1
training example (x c(x))
q0 = q1 = 0
prediction algorithm ai
If ai(x)=0 then q0 = q0 + wi
If ai(x)=1 then q1 = q1 + wi
if q1 gt q0 then predict c(x)=1
if q1 lt q0 then predict c(x)=0
if q1 gt q0 then predict c(x)=0 or 1 at random
prediction algorithm ai do If ai(x)nec(x) then wi = wi (0lelt1)
44
Weighted majority algorithm (WM)
Coincides with Halving for =0 Theorem - D any sequence of training examples A any set of n prediction algorithms k min of mistakes made by any ajA for D =12 Then W-M makes at most
24(k+log2n)
mistakes over D
45
Weighted majority algorithm (WM)
Proof Since aj makes k mistakes (best in A) its final
weight wj will be (12)k The sum W of the weights associated with all n algorithms in A is initially n and for each mistake made by WM is reduced to at most (34)W because the ldquowrongrdquo algorithms hold at least 12 of total weight that will be reduced by a factor of 12
The final total weight W is at most n(34)M where M is the total number of mistakes made by WM over D
46
Weighted majority algorithm (WM)
But the final weight wj cannot be greater than the
final total weight W hence(12)k le n(34)M
from which
M le le 24(k+log2n)
Ie the number of mistakes made by WM will never be greater than a constant factor times the number of mistakes made by the best member of the pool plus a term that grows only logarithmically in the size of the pool
(k+log2 n)-log2 (34)
- Universitagrave di Milano-Bicocca Laurea Magistrale in Informatica
- Computational models of cognitive phenomena
- A theory of the learnable (Valiant lsquo84)
- A theory of the learnable
- Probably approximately correct learning
- What we want to learn
- Whatrsquos new in pac learning
- Learning from examples
- The PAC model
- The PAC model
- Slide 11
- Slide 12
- Slide 13
- Learning algorithm
- The efficiency issue
- Slide 16
- Learning boolean functions
- Boolean functions and circuits
- Slide 19
- Slide 20
- Statistical PAC learning
- A simple upper bound
- Vapnik-Chervonenkis approach (1971)
- A general upper bound
- Graph of the growth function
- Upper and lower bounds
- An equivalent definition of VCdim
- Example 1
- Example 2
- Slide 30
- Classi di formule booleane
- I risultati
- Risultati positivi
- Risultati negativi
- Mistake bound model
- Slide 36
- Mistake bound for Find-S
- Slide 38
- Mistake bound for Halving
- Optimal mistake bound
- Slide 41
- Weighted majority algorithm
- Slide 43
- Weighted majority algorithm (WM)
- Slide 45
- Slide 46
-
36
Mistake bound model
Learner Receives a sequence of training examples x Predicts the target value f(x) Receives the correct target value from the trainer Is evaluated by the total number of mistakes it makes
before converging to the correct hypothesis
Ie Learning takes place during the use of the system
not off-line Ex prediction of fraudolent use of credit cards
37
Mistake bound for Find-S
Consider Find-S when H = conjunction of boolean literals
FIND-S
Initialize h to the most specific hypothesis in
Hx1x1x2x2 hellip xnxn
For each positive training instance x Remove from h any literal not satisfied by x
Output h
38
Mistake bound for Find-S
If C H and training data noise free Find-S converges to an exact hypothesis
How many errors to learn cH (only positive examples can be misclassified)
The first positive example will be misclassified and n literals in the initial hypothesis will be eliminated
Each subsequent error eliminates at least one literal mistakes le n+1 (worst case for the ldquototalrdquo concept x c(x)=1)
39
Mistake bound for Halving A version space is maintained and refined (eg
Candidate-elimination) Prediction is based on majority vote among the
hypotheses in the current version space ldquoWrongrdquo hypotheses are removed (even if x is
exactly classified) How many errors to exactly learn cH (H finite)
Mistake when the majority of hypotheses misclassifies x These hypotheses are removed For each mistake the version space is at least halved At most log2(|H|) mistakes before exact learning (eg
single hypothesis remaining) Note learning without mistakes possible
40
Optimal mistake bound
Question what is the optimal mistake bound (ie lowest worst case bound over all possible learning algorithms A) for an arbitrary non empty concept class C assuming H=C
Formally for any learning algorithm A and any target concept c
MA(c) = max mistakes made by A to exactly learn c over all
possible training sequences MA(C) = maxcC MA(c)
Note Mfind-S(C) = n+1
MHalving(C) le log2(|C|) Opt(C) = minA MA(C)
ie of mistakes made for the hardest target concept in C using the hardest training sequence by the best algorithm
41
Optimal mistake bound
Theorem (Littlestone 1987)
VC(C) le Opt(C) le MHalving(C) le log2(|C|) There exist concept classes for which
VC(C) = Opt(C) = MHalving(C) = log2(|C|)
eg the power set 2X of X for which it holds
VC(2X) = |X| = log2(|2X|) There exist concept classes for which VC(C) lt Opt(C) lt MHalving(C)
42
Weighted majority algorithm
Generalizes Halving Makes predictions by taking a weighted vote among a pole of prediction algorithms
Learns by altering the weight associated with each prediction algorithm
It does not eliminate hypotheses (ie algorithms) inconsistent with some training examples but just reduces its weight so is able to accommodate inconsistent training data
43
Weighted majority algorithm
i wi = 1
training example (x c(x))
q0 = q1 = 0
prediction algorithm ai
If ai(x)=0 then q0 = q0 + wi
If ai(x)=1 then q1 = q1 + wi
if q1 gt q0 then predict c(x)=1
if q1 lt q0 then predict c(x)=0
if q1 gt q0 then predict c(x)=0 or 1 at random
prediction algorithm ai do If ai(x)nec(x) then wi = wi (0lelt1)
44
Weighted majority algorithm (WM)
Coincides with Halving for =0 Theorem - D any sequence of training examples A any set of n prediction algorithms k min of mistakes made by any ajA for D =12 Then W-M makes at most
24(k+log2n)
mistakes over D
45
Weighted majority algorithm (WM)
Proof Since aj makes k mistakes (best in A) its final
weight wj will be (12)k The sum W of the weights associated with all n algorithms in A is initially n and for each mistake made by WM is reduced to at most (34)W because the ldquowrongrdquo algorithms hold at least 12 of total weight that will be reduced by a factor of 12
The final total weight W is at most n(34)M where M is the total number of mistakes made by WM over D
46
Weighted majority algorithm (WM)
But the final weight wj cannot be greater than the
final total weight W hence(12)k le n(34)M
from which
M le le 24(k+log2n)
Ie the number of mistakes made by WM will never be greater than a constant factor times the number of mistakes made by the best member of the pool plus a term that grows only logarithmically in the size of the pool
(k+log2 n)-log2 (34)
- Universitagrave di Milano-Bicocca Laurea Magistrale in Informatica
- Computational models of cognitive phenomena
- A theory of the learnable (Valiant lsquo84)
- A theory of the learnable
- Probably approximately correct learning
- What we want to learn
- Whatrsquos new in pac learning
- Learning from examples
- The PAC model
- The PAC model
- Slide 11
- Slide 12
- Slide 13
- Learning algorithm
- The efficiency issue
- Slide 16
- Learning boolean functions
- Boolean functions and circuits
- Slide 19
- Slide 20
- Statistical PAC learning
- A simple upper bound
- Vapnik-Chervonenkis approach (1971)
- A general upper bound
- Graph of the growth function
- Upper and lower bounds
- An equivalent definition of VCdim
- Example 1
- Example 2
- Slide 30
- Classi di formule booleane
- I risultati
- Risultati positivi
- Risultati negativi
- Mistake bound model
- Slide 36
- Mistake bound for Find-S
- Slide 38
- Mistake bound for Halving
- Optimal mistake bound
- Slide 41
- Weighted majority algorithm
- Slide 43
- Weighted majority algorithm (WM)
- Slide 45
- Slide 46
-
37
Mistake bound for Find-S
Consider Find-S when H = conjunction of boolean literals
FIND-S
Initialize h to the most specific hypothesis in
Hx1x1x2x2 hellip xnxn
For each positive training instance x Remove from h any literal not satisfied by x
Output h
38
Mistake bound for Find-S
If C H and training data noise free Find-S converges to an exact hypothesis
How many errors to learn cH (only positive examples can be misclassified)
The first positive example will be misclassified and n literals in the initial hypothesis will be eliminated
Each subsequent error eliminates at least one literal mistakes le n+1 (worst case for the ldquototalrdquo concept x c(x)=1)
39
Mistake bound for Halving A version space is maintained and refined (eg
Candidate-elimination) Prediction is based on majority vote among the
hypotheses in the current version space ldquoWrongrdquo hypotheses are removed (even if x is
exactly classified) How many errors to exactly learn cH (H finite)
Mistake when the majority of hypotheses misclassifies x These hypotheses are removed For each mistake the version space is at least halved At most log2(|H|) mistakes before exact learning (eg
single hypothesis remaining) Note learning without mistakes possible
40
Optimal mistake bound
Question what is the optimal mistake bound (ie lowest worst case bound over all possible learning algorithms A) for an arbitrary non empty concept class C assuming H=C
Formally for any learning algorithm A and any target concept c
MA(c) = max mistakes made by A to exactly learn c over all
possible training sequences MA(C) = maxcC MA(c)
Note Mfind-S(C) = n+1
MHalving(C) le log2(|C|) Opt(C) = minA MA(C)
ie of mistakes made for the hardest target concept in C using the hardest training sequence by the best algorithm
41
Optimal mistake bound
Theorem (Littlestone 1987)
VC(C) le Opt(C) le MHalving(C) le log2(|C|) There exist concept classes for which
VC(C) = Opt(C) = MHalving(C) = log2(|C|)
eg the power set 2X of X for which it holds
VC(2X) = |X| = log2(|2X|) There exist concept classes for which VC(C) lt Opt(C) lt MHalving(C)
42
Weighted majority algorithm
Generalizes Halving Makes predictions by taking a weighted vote among a pole of prediction algorithms
Learns by altering the weight associated with each prediction algorithm
It does not eliminate hypotheses (ie algorithms) inconsistent with some training examples but just reduces its weight so is able to accommodate inconsistent training data
43
Weighted majority algorithm
i wi = 1
training example (x c(x))
q0 = q1 = 0
prediction algorithm ai
If ai(x)=0 then q0 = q0 + wi
If ai(x)=1 then q1 = q1 + wi
if q1 gt q0 then predict c(x)=1
if q1 lt q0 then predict c(x)=0
if q1 gt q0 then predict c(x)=0 or 1 at random
prediction algorithm ai do If ai(x)nec(x) then wi = wi (0lelt1)
44
Weighted majority algorithm (WM)
Coincides with Halving for =0 Theorem - D any sequence of training examples A any set of n prediction algorithms k min of mistakes made by any ajA for D =12 Then W-M makes at most
24(k+log2n)
mistakes over D
45
Weighted majority algorithm (WM)
Proof Since aj makes k mistakes (best in A) its final
weight wj will be (12)k The sum W of the weights associated with all n algorithms in A is initially n and for each mistake made by WM is reduced to at most (34)W because the ldquowrongrdquo algorithms hold at least 12 of total weight that will be reduced by a factor of 12
The final total weight W is at most n(34)M where M is the total number of mistakes made by WM over D
46
Weighted majority algorithm (WM)
But the final weight wj cannot be greater than the
final total weight W hence(12)k le n(34)M
from which
M le le 24(k+log2n)
Ie the number of mistakes made by WM will never be greater than a constant factor times the number of mistakes made by the best member of the pool plus a term that grows only logarithmically in the size of the pool
(k+log2 n)-log2 (34)
- Universitagrave di Milano-Bicocca Laurea Magistrale in Informatica
- Computational models of cognitive phenomena
- A theory of the learnable (Valiant lsquo84)
- A theory of the learnable
- Probably approximately correct learning
- What we want to learn
- Whatrsquos new in pac learning
- Learning from examples
- The PAC model
- The PAC model
- Slide 11
- Slide 12
- Slide 13
- Learning algorithm
- The efficiency issue
- Slide 16
- Learning boolean functions
- Boolean functions and circuits
- Slide 19
- Slide 20
- Statistical PAC learning
- A simple upper bound
- Vapnik-Chervonenkis approach (1971)
- A general upper bound
- Graph of the growth function
- Upper and lower bounds
- An equivalent definition of VCdim
- Example 1
- Example 2
- Slide 30
- Classi di formule booleane
- I risultati
- Risultati positivi
- Risultati negativi
- Mistake bound model
- Slide 36
- Mistake bound for Find-S
- Slide 38
- Mistake bound for Halving
- Optimal mistake bound
- Slide 41
- Weighted majority algorithm
- Slide 43
- Weighted majority algorithm (WM)
- Slide 45
- Slide 46
-
38
Mistake bound for Find-S
If C H and training data noise free Find-S converges to an exact hypothesis
How many errors to learn cH (only positive examples can be misclassified)
The first positive example will be misclassified and n literals in the initial hypothesis will be eliminated
Each subsequent error eliminates at least one literal mistakes le n+1 (worst case for the ldquototalrdquo concept x c(x)=1)
39
Mistake bound for Halving A version space is maintained and refined (eg
Candidate-elimination) Prediction is based on majority vote among the
hypotheses in the current version space ldquoWrongrdquo hypotheses are removed (even if x is
exactly classified) How many errors to exactly learn cH (H finite)
Mistake when the majority of hypotheses misclassifies x These hypotheses are removed For each mistake the version space is at least halved At most log2(|H|) mistakes before exact learning (eg
single hypothesis remaining) Note learning without mistakes possible
40
Optimal mistake bound
Question what is the optimal mistake bound (ie lowest worst case bound over all possible learning algorithms A) for an arbitrary non empty concept class C assuming H=C
Formally for any learning algorithm A and any target concept c
MA(c) = max mistakes made by A to exactly learn c over all
possible training sequences MA(C) = maxcC MA(c)
Note Mfind-S(C) = n+1
MHalving(C) le log2(|C|) Opt(C) = minA MA(C)
ie of mistakes made for the hardest target concept in C using the hardest training sequence by the best algorithm
41
Optimal mistake bound
Theorem (Littlestone 1987)
VC(C) le Opt(C) le MHalving(C) le log2(|C|) There exist concept classes for which
VC(C) = Opt(C) = MHalving(C) = log2(|C|)
eg the power set 2X of X for which it holds
VC(2X) = |X| = log2(|2X|) There exist concept classes for which VC(C) lt Opt(C) lt MHalving(C)
42
Weighted majority algorithm
Generalizes Halving Makes predictions by taking a weighted vote among a pole of prediction algorithms
Learns by altering the weight associated with each prediction algorithm
It does not eliminate hypotheses (ie algorithms) inconsistent with some training examples but just reduces its weight so is able to accommodate inconsistent training data
43
Weighted majority algorithm
i wi = 1
training example (x c(x))
q0 = q1 = 0
prediction algorithm ai
If ai(x)=0 then q0 = q0 + wi
If ai(x)=1 then q1 = q1 + wi
if q1 gt q0 then predict c(x)=1
if q1 lt q0 then predict c(x)=0
if q1 gt q0 then predict c(x)=0 or 1 at random
prediction algorithm ai do If ai(x)nec(x) then wi = wi (0lelt1)
44
Weighted majority algorithm (WM)
Coincides with Halving for =0 Theorem - D any sequence of training examples A any set of n prediction algorithms k min of mistakes made by any ajA for D =12 Then W-M makes at most
24(k+log2n)
mistakes over D
45
Weighted majority algorithm (WM)
Proof Since aj makes k mistakes (best in A) its final
weight wj will be (12)k The sum W of the weights associated with all n algorithms in A is initially n and for each mistake made by WM is reduced to at most (34)W because the ldquowrongrdquo algorithms hold at least 12 of total weight that will be reduced by a factor of 12
The final total weight W is at most n(34)M where M is the total number of mistakes made by WM over D
46
Weighted majority algorithm (WM)
But the final weight wj cannot be greater than the
final total weight W hence(12)k le n(34)M
from which
M le le 24(k+log2n)
Ie the number of mistakes made by WM will never be greater than a constant factor times the number of mistakes made by the best member of the pool plus a term that grows only logarithmically in the size of the pool
(k+log2 n)-log2 (34)
- Universitagrave di Milano-Bicocca Laurea Magistrale in Informatica
- Computational models of cognitive phenomena
- A theory of the learnable (Valiant lsquo84)
- A theory of the learnable
- Probably approximately correct learning
- What we want to learn
- Whatrsquos new in pac learning
- Learning from examples
- The PAC model
- The PAC model
- Slide 11
- Slide 12
- Slide 13
- Learning algorithm
- The efficiency issue
- Slide 16
- Learning boolean functions
- Boolean functions and circuits
- Slide 19
- Slide 20
- Statistical PAC learning
- A simple upper bound
- Vapnik-Chervonenkis approach (1971)
- A general upper bound
- Graph of the growth function
- Upper and lower bounds
- An equivalent definition of VCdim
- Example 1
- Example 2
- Slide 30
- Classi di formule booleane
- I risultati
- Risultati positivi
- Risultati negativi
- Mistake bound model
- Slide 36
- Mistake bound for Find-S
- Slide 38
- Mistake bound for Halving
- Optimal mistake bound
- Slide 41
- Weighted majority algorithm
- Slide 43
- Weighted majority algorithm (WM)
- Slide 45
- Slide 46
-
39
Mistake bound for Halving A version space is maintained and refined (eg
Candidate-elimination) Prediction is based on majority vote among the
hypotheses in the current version space ldquoWrongrdquo hypotheses are removed (even if x is
exactly classified) How many errors to exactly learn cH (H finite)
Mistake when the majority of hypotheses misclassifies x These hypotheses are removed For each mistake the version space is at least halved At most log2(|H|) mistakes before exact learning (eg
single hypothesis remaining) Note learning without mistakes possible
40
Optimal mistake bound
Question what is the optimal mistake bound (ie lowest worst case bound over all possible learning algorithms A) for an arbitrary non empty concept class C assuming H=C
Formally for any learning algorithm A and any target concept c
MA(c) = max mistakes made by A to exactly learn c over all
possible training sequences MA(C) = maxcC MA(c)
Note Mfind-S(C) = n+1
MHalving(C) le log2(|C|) Opt(C) = minA MA(C)
ie of mistakes made for the hardest target concept in C using the hardest training sequence by the best algorithm
41
Optimal mistake bound
Theorem (Littlestone 1987)
VC(C) le Opt(C) le MHalving(C) le log2(|C|) There exist concept classes for which
VC(C) = Opt(C) = MHalving(C) = log2(|C|)
eg the power set 2X of X for which it holds
VC(2X) = |X| = log2(|2X|) There exist concept classes for which VC(C) lt Opt(C) lt MHalving(C)
42
Weighted majority algorithm
Generalizes Halving Makes predictions by taking a weighted vote among a pole of prediction algorithms
Learns by altering the weight associated with each prediction algorithm
It does not eliminate hypotheses (ie algorithms) inconsistent with some training examples but just reduces its weight so is able to accommodate inconsistent training data
43
Weighted majority algorithm
i wi = 1
training example (x c(x))
q0 = q1 = 0
prediction algorithm ai
If ai(x)=0 then q0 = q0 + wi
If ai(x)=1 then q1 = q1 + wi
if q1 gt q0 then predict c(x)=1
if q1 lt q0 then predict c(x)=0
if q1 gt q0 then predict c(x)=0 or 1 at random
prediction algorithm ai do If ai(x)nec(x) then wi = wi (0lelt1)
44
Weighted majority algorithm (WM)
Coincides with Halving for =0 Theorem - D any sequence of training examples A any set of n prediction algorithms k min of mistakes made by any ajA for D =12 Then W-M makes at most
24(k+log2n)
mistakes over D
45
Weighted majority algorithm (WM)
Proof Since aj makes k mistakes (best in A) its final
weight wj will be (12)k The sum W of the weights associated with all n algorithms in A is initially n and for each mistake made by WM is reduced to at most (34)W because the ldquowrongrdquo algorithms hold at least 12 of total weight that will be reduced by a factor of 12
The final total weight W is at most n(34)M where M is the total number of mistakes made by WM over D
46
Weighted majority algorithm (WM)
But the final weight wj cannot be greater than the
final total weight W hence(12)k le n(34)M
from which
M le le 24(k+log2n)
Ie the number of mistakes made by WM will never be greater than a constant factor times the number of mistakes made by the best member of the pool plus a term that grows only logarithmically in the size of the pool
(k+log2 n)-log2 (34)
- Universitagrave di Milano-Bicocca Laurea Magistrale in Informatica
- Computational models of cognitive phenomena
- A theory of the learnable (Valiant lsquo84)
- A theory of the learnable
- Probably approximately correct learning
- What we want to learn
- Whatrsquos new in pac learning
- Learning from examples
- The PAC model
- The PAC model
- Slide 11
- Slide 12
- Slide 13
- Learning algorithm
- The efficiency issue
- Slide 16
- Learning boolean functions
- Boolean functions and circuits
- Slide 19
- Slide 20
- Statistical PAC learning
- A simple upper bound
- Vapnik-Chervonenkis approach (1971)
- A general upper bound
- Graph of the growth function
- Upper and lower bounds
- An equivalent definition of VCdim
- Example 1
- Example 2
- Slide 30
- Classi di formule booleane
- I risultati
- Risultati positivi
- Risultati negativi
- Mistake bound model
- Slide 36
- Mistake bound for Find-S
- Slide 38
- Mistake bound for Halving
- Optimal mistake bound
- Slide 41
- Weighted majority algorithm
- Slide 43
- Weighted majority algorithm (WM)
- Slide 45
- Slide 46
-
40
Optimal mistake bound
Question what is the optimal mistake bound (ie lowest worst case bound over all possible learning algorithms A) for an arbitrary non empty concept class C assuming H=C
Formally for any learning algorithm A and any target concept c
MA(c) = max mistakes made by A to exactly learn c over all
possible training sequences MA(C) = maxcC MA(c)
Note Mfind-S(C) = n+1
MHalving(C) le log2(|C|) Opt(C) = minA MA(C)
ie of mistakes made for the hardest target concept in C using the hardest training sequence by the best algorithm
41
Optimal mistake bound
Theorem (Littlestone 1987)
VC(C) le Opt(C) le MHalving(C) le log2(|C|) There exist concept classes for which
VC(C) = Opt(C) = MHalving(C) = log2(|C|)
eg the power set 2X of X for which it holds
VC(2X) = |X| = log2(|2X|) There exist concept classes for which VC(C) lt Opt(C) lt MHalving(C)
42
Weighted majority algorithm
Generalizes Halving Makes predictions by taking a weighted vote among a pole of prediction algorithms
Learns by altering the weight associated with each prediction algorithm
It does not eliminate hypotheses (ie algorithms) inconsistent with some training examples but just reduces its weight so is able to accommodate inconsistent training data
43
Weighted majority algorithm
i wi = 1
training example (x c(x))
q0 = q1 = 0
prediction algorithm ai
If ai(x)=0 then q0 = q0 + wi
If ai(x)=1 then q1 = q1 + wi
if q1 gt q0 then predict c(x)=1
if q1 lt q0 then predict c(x)=0
if q1 gt q0 then predict c(x)=0 or 1 at random
prediction algorithm ai do If ai(x)nec(x) then wi = wi (0lelt1)
44
Weighted majority algorithm (WM)
Coincides with Halving for =0 Theorem - D any sequence of training examples A any set of n prediction algorithms k min of mistakes made by any ajA for D =12 Then W-M makes at most
24(k+log2n)
mistakes over D
45
Weighted majority algorithm (WM)
Proof Since aj makes k mistakes (best in A) its final
weight wj will be (12)k The sum W of the weights associated with all n algorithms in A is initially n and for each mistake made by WM is reduced to at most (34)W because the ldquowrongrdquo algorithms hold at least 12 of total weight that will be reduced by a factor of 12
The final total weight W is at most n(34)M where M is the total number of mistakes made by WM over D
46
Weighted majority algorithm (WM)
But the final weight wj cannot be greater than the
final total weight W hence(12)k le n(34)M
from which
M le le 24(k+log2n)
Ie the number of mistakes made by WM will never be greater than a constant factor times the number of mistakes made by the best member of the pool plus a term that grows only logarithmically in the size of the pool
(k+log2 n)-log2 (34)
- Universitagrave di Milano-Bicocca Laurea Magistrale in Informatica
- Computational models of cognitive phenomena
- A theory of the learnable (Valiant lsquo84)
- A theory of the learnable
- Probably approximately correct learning
- What we want to learn
- Whatrsquos new in pac learning
- Learning from examples
- The PAC model
- The PAC model
- Slide 11
- Slide 12
- Slide 13
- Learning algorithm
- The efficiency issue
- Slide 16
- Learning boolean functions
- Boolean functions and circuits
- Slide 19
- Slide 20
- Statistical PAC learning
- A simple upper bound
- Vapnik-Chervonenkis approach (1971)
- A general upper bound
- Graph of the growth function
- Upper and lower bounds
- An equivalent definition of VCdim
- Example 1
- Example 2
- Slide 30
- Classi di formule booleane
- I risultati
- Risultati positivi
- Risultati negativi
- Mistake bound model
- Slide 36
- Mistake bound for Find-S
- Slide 38
- Mistake bound for Halving
- Optimal mistake bound
- Slide 41
- Weighted majority algorithm
- Slide 43
- Weighted majority algorithm (WM)
- Slide 45
- Slide 46
-
41
Optimal mistake bound
Theorem (Littlestone 1987)
VC(C) le Opt(C) le MHalving(C) le log2(|C|) There exist concept classes for which
VC(C) = Opt(C) = MHalving(C) = log2(|C|)
eg the power set 2X of X for which it holds
VC(2X) = |X| = log2(|2X|) There exist concept classes for which VC(C) lt Opt(C) lt MHalving(C)
42
Weighted majority algorithm
Generalizes Halving Makes predictions by taking a weighted vote among a pole of prediction algorithms
Learns by altering the weight associated with each prediction algorithm
It does not eliminate hypotheses (ie algorithms) inconsistent with some training examples but just reduces its weight so is able to accommodate inconsistent training data
43
Weighted majority algorithm
i wi = 1
training example (x c(x))
q0 = q1 = 0
prediction algorithm ai
If ai(x)=0 then q0 = q0 + wi
If ai(x)=1 then q1 = q1 + wi
if q1 gt q0 then predict c(x)=1
if q1 lt q0 then predict c(x)=0
if q1 gt q0 then predict c(x)=0 or 1 at random
prediction algorithm ai do If ai(x)nec(x) then wi = wi (0lelt1)
44
Weighted majority algorithm (WM)
Coincides with Halving for =0 Theorem - D any sequence of training examples A any set of n prediction algorithms k min of mistakes made by any ajA for D =12 Then W-M makes at most
24(k+log2n)
mistakes over D
45
Weighted majority algorithm (WM)
Proof Since aj makes k mistakes (best in A) its final
weight wj will be (12)k The sum W of the weights associated with all n algorithms in A is initially n and for each mistake made by WM is reduced to at most (34)W because the ldquowrongrdquo algorithms hold at least 12 of total weight that will be reduced by a factor of 12
The final total weight W is at most n(34)M where M is the total number of mistakes made by WM over D
46
Weighted majority algorithm (WM)
But the final weight wj cannot be greater than the
final total weight W hence(12)k le n(34)M
from which
M le le 24(k+log2n)
Ie the number of mistakes made by WM will never be greater than a constant factor times the number of mistakes made by the best member of the pool plus a term that grows only logarithmically in the size of the pool
(k+log2 n)-log2 (34)
- Universitagrave di Milano-Bicocca Laurea Magistrale in Informatica
- Computational models of cognitive phenomena
- A theory of the learnable (Valiant lsquo84)
- A theory of the learnable
- Probably approximately correct learning
- What we want to learn
- Whatrsquos new in pac learning
- Learning from examples
- The PAC model
- The PAC model
- Slide 11
- Slide 12
- Slide 13
- Learning algorithm
- The efficiency issue
- Slide 16
- Learning boolean functions
- Boolean functions and circuits
- Slide 19
- Slide 20
- Statistical PAC learning
- A simple upper bound
- Vapnik-Chervonenkis approach (1971)
- A general upper bound
- Graph of the growth function
- Upper and lower bounds
- An equivalent definition of VCdim
- Example 1
- Example 2
- Slide 30
- Classi di formule booleane
- I risultati
- Risultati positivi
- Risultati negativi
- Mistake bound model
- Slide 36
- Mistake bound for Find-S
- Slide 38
- Mistake bound for Halving
- Optimal mistake bound
- Slide 41
- Weighted majority algorithm
- Slide 43
- Weighted majority algorithm (WM)
- Slide 45
- Slide 46
-
42
Weighted majority algorithm
Generalizes Halving Makes predictions by taking a weighted vote among a pole of prediction algorithms
Learns by altering the weight associated with each prediction algorithm
It does not eliminate hypotheses (ie algorithms) inconsistent with some training examples but just reduces its weight so is able to accommodate inconsistent training data
43
Weighted majority algorithm
i wi = 1
training example (x c(x))
q0 = q1 = 0
prediction algorithm ai
If ai(x)=0 then q0 = q0 + wi
If ai(x)=1 then q1 = q1 + wi
if q1 gt q0 then predict c(x)=1
if q1 lt q0 then predict c(x)=0
if q1 gt q0 then predict c(x)=0 or 1 at random
prediction algorithm ai do If ai(x)nec(x) then wi = wi (0lelt1)
44
Weighted majority algorithm (WM)
Coincides with Halving for =0 Theorem - D any sequence of training examples A any set of n prediction algorithms k min of mistakes made by any ajA for D =12 Then W-M makes at most
24(k+log2n)
mistakes over D
45
Weighted majority algorithm (WM)
Proof Since aj makes k mistakes (best in A) its final
weight wj will be (12)k The sum W of the weights associated with all n algorithms in A is initially n and for each mistake made by WM is reduced to at most (34)W because the ldquowrongrdquo algorithms hold at least 12 of total weight that will be reduced by a factor of 12
The final total weight W is at most n(34)M where M is the total number of mistakes made by WM over D
46
Weighted majority algorithm (WM)
But the final weight wj cannot be greater than the
final total weight W hence(12)k le n(34)M
from which
M le le 24(k+log2n)
Ie the number of mistakes made by WM will never be greater than a constant factor times the number of mistakes made by the best member of the pool plus a term that grows only logarithmically in the size of the pool
(k+log2 n)-log2 (34)
- Universitagrave di Milano-Bicocca Laurea Magistrale in Informatica
- Computational models of cognitive phenomena
- A theory of the learnable (Valiant lsquo84)
- A theory of the learnable
- Probably approximately correct learning
- What we want to learn
- Whatrsquos new in pac learning
- Learning from examples
- The PAC model
- The PAC model
- Slide 11
- Slide 12
- Slide 13
- Learning algorithm
- The efficiency issue
- Slide 16
- Learning boolean functions
- Boolean functions and circuits
- Slide 19
- Slide 20
- Statistical PAC learning
- A simple upper bound
- Vapnik-Chervonenkis approach (1971)
- A general upper bound
- Graph of the growth function
- Upper and lower bounds
- An equivalent definition of VCdim
- Example 1
- Example 2
- Slide 30
- Classi di formule booleane
- I risultati
- Risultati positivi
- Risultati negativi
- Mistake bound model
- Slide 36
- Mistake bound for Find-S
- Slide 38
- Mistake bound for Halving
- Optimal mistake bound
- Slide 41
- Weighted majority algorithm
- Slide 43
- Weighted majority algorithm (WM)
- Slide 45
- Slide 46
-
43
Weighted majority algorithm
i wi = 1
training example (x c(x))
q0 = q1 = 0
prediction algorithm ai
If ai(x)=0 then q0 = q0 + wi
If ai(x)=1 then q1 = q1 + wi
if q1 gt q0 then predict c(x)=1
if q1 lt q0 then predict c(x)=0
if q1 gt q0 then predict c(x)=0 or 1 at random
prediction algorithm ai do If ai(x)nec(x) then wi = wi (0lelt1)
44
Weighted majority algorithm (WM)
Coincides with Halving for =0 Theorem - D any sequence of training examples A any set of n prediction algorithms k min of mistakes made by any ajA for D =12 Then W-M makes at most
24(k+log2n)
mistakes over D
45
Weighted majority algorithm (WM)
Proof Since aj makes k mistakes (best in A) its final
weight wj will be (12)k The sum W of the weights associated with all n algorithms in A is initially n and for each mistake made by WM is reduced to at most (34)W because the ldquowrongrdquo algorithms hold at least 12 of total weight that will be reduced by a factor of 12
The final total weight W is at most n(34)M where M is the total number of mistakes made by WM over D
46
Weighted majority algorithm (WM)
But the final weight wj cannot be greater than the
final total weight W hence(12)k le n(34)M
from which
M le le 24(k+log2n)
Ie the number of mistakes made by WM will never be greater than a constant factor times the number of mistakes made by the best member of the pool plus a term that grows only logarithmically in the size of the pool
(k+log2 n)-log2 (34)
- Universitagrave di Milano-Bicocca Laurea Magistrale in Informatica
- Computational models of cognitive phenomena
- A theory of the learnable (Valiant lsquo84)
- A theory of the learnable
- Probably approximately correct learning
- What we want to learn
- Whatrsquos new in pac learning
- Learning from examples
- The PAC model
- The PAC model
- Slide 11
- Slide 12
- Slide 13
- Learning algorithm
- The efficiency issue
- Slide 16
- Learning boolean functions
- Boolean functions and circuits
- Slide 19
- Slide 20
- Statistical PAC learning
- A simple upper bound
- Vapnik-Chervonenkis approach (1971)
- A general upper bound
- Graph of the growth function
- Upper and lower bounds
- An equivalent definition of VCdim
- Example 1
- Example 2
- Slide 30
- Classi di formule booleane
- I risultati
- Risultati positivi
- Risultati negativi
- Mistake bound model
- Slide 36
- Mistake bound for Find-S
- Slide 38
- Mistake bound for Halving
- Optimal mistake bound
- Slide 41
- Weighted majority algorithm
- Slide 43
- Weighted majority algorithm (WM)
- Slide 45
- Slide 46
-
44
Weighted majority algorithm (WM)
Coincides with Halving for =0 Theorem - D any sequence of training examples A any set of n prediction algorithms k min of mistakes made by any ajA for D =12 Then W-M makes at most
24(k+log2n)
mistakes over D
45
Weighted majority algorithm (WM)
Proof Since aj makes k mistakes (best in A) its final
weight wj will be (12)k The sum W of the weights associated with all n algorithms in A is initially n and for each mistake made by WM is reduced to at most (34)W because the ldquowrongrdquo algorithms hold at least 12 of total weight that will be reduced by a factor of 12
The final total weight W is at most n(34)M where M is the total number of mistakes made by WM over D
46
Weighted majority algorithm (WM)
But the final weight wj cannot be greater than the
final total weight W hence(12)k le n(34)M
from which
M le le 24(k+log2n)
Ie the number of mistakes made by WM will never be greater than a constant factor times the number of mistakes made by the best member of the pool plus a term that grows only logarithmically in the size of the pool
(k+log2 n)-log2 (34)
- Universitagrave di Milano-Bicocca Laurea Magistrale in Informatica
- Computational models of cognitive phenomena
- A theory of the learnable (Valiant lsquo84)
- A theory of the learnable
- Probably approximately correct learning
- What we want to learn
- Whatrsquos new in pac learning
- Learning from examples
- The PAC model
- The PAC model
- Slide 11
- Slide 12
- Slide 13
- Learning algorithm
- The efficiency issue
- Slide 16
- Learning boolean functions
- Boolean functions and circuits
- Slide 19
- Slide 20
- Statistical PAC learning
- A simple upper bound
- Vapnik-Chervonenkis approach (1971)
- A general upper bound
- Graph of the growth function
- Upper and lower bounds
- An equivalent definition of VCdim
- Example 1
- Example 2
- Slide 30
- Classi di formule booleane
- I risultati
- Risultati positivi
- Risultati negativi
- Mistake bound model
- Slide 36
- Mistake bound for Find-S
- Slide 38
- Mistake bound for Halving
- Optimal mistake bound
- Slide 41
- Weighted majority algorithm
- Slide 43
- Weighted majority algorithm (WM)
- Slide 45
- Slide 46
-
45
Weighted majority algorithm (WM)
Proof Since aj makes k mistakes (best in A) its final
weight wj will be (12)k The sum W of the weights associated with all n algorithms in A is initially n and for each mistake made by WM is reduced to at most (34)W because the ldquowrongrdquo algorithms hold at least 12 of total weight that will be reduced by a factor of 12
The final total weight W is at most n(34)M where M is the total number of mistakes made by WM over D
46
Weighted majority algorithm (WM)
But the final weight wj cannot be greater than the
final total weight W hence(12)k le n(34)M
from which
M le le 24(k+log2n)
Ie the number of mistakes made by WM will never be greater than a constant factor times the number of mistakes made by the best member of the pool plus a term that grows only logarithmically in the size of the pool
(k+log2 n)-log2 (34)
- Universitagrave di Milano-Bicocca Laurea Magistrale in Informatica
- Computational models of cognitive phenomena
- A theory of the learnable (Valiant lsquo84)
- A theory of the learnable
- Probably approximately correct learning
- What we want to learn
- Whatrsquos new in pac learning
- Learning from examples
- The PAC model
- The PAC model
- Slide 11
- Slide 12
- Slide 13
- Learning algorithm
- The efficiency issue
- Slide 16
- Learning boolean functions
- Boolean functions and circuits
- Slide 19
- Slide 20
- Statistical PAC learning
- A simple upper bound
- Vapnik-Chervonenkis approach (1971)
- A general upper bound
- Graph of the growth function
- Upper and lower bounds
- An equivalent definition of VCdim
- Example 1
- Example 2
- Slide 30
- Classi di formule booleane
- I risultati
- Risultati positivi
- Risultati negativi
- Mistake bound model
- Slide 36
- Mistake bound for Find-S
- Slide 38
- Mistake bound for Halving
- Optimal mistake bound
- Slide 41
- Weighted majority algorithm
- Slide 43
- Weighted majority algorithm (WM)
- Slide 45
- Slide 46
-
46
Weighted majority algorithm (WM)
But the final weight wj cannot be greater than the
final total weight W hence(12)k le n(34)M
from which
M le le 24(k+log2n)
Ie the number of mistakes made by WM will never be greater than a constant factor times the number of mistakes made by the best member of the pool plus a term that grows only logarithmically in the size of the pool
(k+log2 n)-log2 (34)
- Universitagrave di Milano-Bicocca Laurea Magistrale in Informatica
- Computational models of cognitive phenomena
- A theory of the learnable (Valiant lsquo84)
- A theory of the learnable
- Probably approximately correct learning
- What we want to learn
- Whatrsquos new in pac learning
- Learning from examples
- The PAC model
- The PAC model
- Slide 11
- Slide 12
- Slide 13
- Learning algorithm
- The efficiency issue
- Slide 16
- Learning boolean functions
- Boolean functions and circuits
- Slide 19
- Slide 20
- Statistical PAC learning
- A simple upper bound
- Vapnik-Chervonenkis approach (1971)
- A general upper bound
- Graph of the growth function
- Upper and lower bounds
- An equivalent definition of VCdim
- Example 1
- Example 2
- Slide 30
- Classi di formule booleane
- I risultati
- Risultati positivi
- Risultati negativi
- Mistake bound model
- Slide 36
- Mistake bound for Find-S
- Slide 38
- Mistake bound for Halving
- Optimal mistake bound
- Slide 41
- Weighted majority algorithm
- Slide 43
- Weighted majority algorithm (WM)
- Slide 45
- Slide 46
-