theoretical approaches to machine learning

Theoretical Theoretical Approaches to Approaches to Machine LearningMachine LearningEarly work (eg. Gold) ignored efficiencyEarly work (eg. Gold) ignored efficiency

• Only considers computabilityOnly considers computability• ““Learning in the limit”Learning in the limit”

Later work considers Later work considers tractable itractable inductive learningnductive learning• With high probability, approximately learnWith high probability, approximately learn• Polynomial runtime, polynomial # of examples neededPolynomial runtime, polynomial # of examples needed• Results (usually) independent of probability Results (usually) independent of probability

distribution distribution for the examplesfor the examples

© Jude Shavlik 2006 © Jude Shavlik 2006 David Page David Page 20102010

CS 760 – Machine Learning (UW-Madison)CS 760 – Machine Learning (UW-Madison)

Identification in the Identification in the LimitLimitDefinitionDefinition After some finite number of examples, After some finite number of examples,

learner will have learned the correct concept learner will have learned the correct concept (though might not even know it!). Correct means (though might not even know it!). Correct means agrees with target concept on labels for all data.agrees with target concept on labels for all data.

ExampleExample Consider noise-free learning from the Consider noise-free learning from the class class {f | f(n) = a*n mod b}{f | f(n) = a*n mod b} where where aa and and bb are are natural numbersnatural numbers

General TechniqueGeneral Technique “Innocent Until Proven Guilty” “Innocent Until Proven Guilty”

Enumerate all possible answersEnumerate all possible answersSearch for simplest answer consistent with training Search for simplest answer consistent with training examples seen so far; sooner or later will hit solutionexamples seen so far; sooner or later will hit solution

© Jude Shavlik 2006 © Jude Shavlik 2006 David Page David 0102010


Some Results (Gold)Some Results (Gold)

• Computable languages (Turing Computable languages (Turing machines) can be learned in the machines) can be learned in the limit using inference by limit using inference by enumeration.enumeration.

• If data set is limited to positive If data set is limited to positive examples only, then only finite examples only, then only finite languages can be learned in the languages can be learned in the limit.limit.



Solution for Solution for {f | f(n) = {f | f(n) = aa*n mod *n mod bb} }

11 22 33 44

11

22

33

44

a

bn f(n)

9 17



The Mistake-Bound The Mistake-Bound Model (Littlestone)Model (Littlestone)

FrameworkFramework• Teacher shows input ITeacher shows input I• ML algorithm guesses output OML algorithm guesses output O• Teacher shows correct answerTeacher shows correct answer• Can we upper bound the Can we upper bound the

number number of errorsof errors the learner will make? the learner will make?



The Mistake-Bound The Mistake-Bound ModelModel

ExampleExample Learn a conjunct from Learn a conjunct from NNpredicates and their negationspredicates and their negations

• Initial Initial h = ph = p1 1 ¬p ¬p1 1 … … p pn n ¬p ¬pnn

• For each For each ++ ex, remove the ex, remove the remaining terms that do not remaining terms that do not matchmatch



The Mistake-Bound The Mistake-Bound ModelModelWorst case # of mistakes? Worst case # of mistakes?

1 + 1 + NN

• First + ex will remove First + ex will remove NN terms from terms from hhinitialinitial

• Each subsequent error on a Each subsequent error on a + + will will remove at least one more term (never remove at least one more term (never make a mistake on make a mistake on -- ex’s) ex’s)



Equivalence Query Equivalence Query Model (Angluin)Model (Angluin)FrameworkFramework• ML algorithm guesses concept: is target ML algorithm guesses concept: is target

equivalentequivalent to this guess?) to this guess?)• Teacher either says “yes” or returns a Teacher either says “yes” or returns a

counterexample (example labeled counterexample (example labeled differently by target and guess)differently by target and guess)

• Can we upper bound the Can we upper bound the number number of errorsof errors the learner will make? the learner will make?

• Time to compute next guess bounded by Time to compute next guess bounded by Poly(|data seen so far|)Poly(|data seen so far|)



Probably Probably Approximately Correct Approximately Correct (PAC) Learning(PAC) LearningPAC learning (Valiant ’84)PAC learning (Valiant ’84)GivenGiven

XX domain of possible domain of possible examplesexamples

C C class of possible concepts class of possible concepts to label Xto label X

c c C C target concepttarget concept, , correctness boundscorrectness bounds



Probably Probably Approximately Correct Approximately Correct (PAC) Learning(PAC) Learning

• For For anyany target c in C and any distribution D target c in C and any distribution D on Xon X

• GivenGiven at least N = poly(|c|,1/ at least N = poly(|c|,1/examples examples drawn randomly, independently from Xdrawn randomly, independently from X

• DoDo with probability with probability 1 - 1 - , return an , return an hh in in CC whose accuracy is at least whose accuracy is at least 1 - 1 -

• In other wordsIn other words

Prob[error(Prob[error(hh, , cc) > ) > ] < ] <

• In time polynomial in |data|In time polynomial in |data|

h

c

Shaded regions are where errors occur:Shaded regions are where errors occur:



Relationships Among Relationships Among Models of Tractable Models of Tractable LearningLearning• Poly mistake-bounded with poly Poly mistake-bounded with poly

update time = EQ-learningupdate time = EQ-learning• EQ-learning implies PAC-learningEQ-learning implies PAC-learning

• Simulate teacher by poly-sized Simulate teacher by poly-sized random sample; if all labeled random sample; if all labeled correctly, say “yes”; otherwise, return correctly, say “yes”; otherwise, return incorrect exampleincorrect example

• On each query, increase sample size On each query, increase sample size based on Bonferoni correctionbased on Bonferoni correction



To Prove Concept To Prove Concept Class is PAC-learnableClass is PAC-learnable

1.1. Show it’s EQ-learnable, ORShow it’s EQ-learnable, OR

2.2. Show the following:Show the following:• There exists an efficient algorithm There exists an efficient algorithm

for the consistency problem (find a for the consistency problem (find a hypothesis consistent with a data hypothesis consistent with a data set in time poly in the size of the set in time poly in the size of the data set)data set)

• Poly-sized sample is sufficient to Poly-sized sample is sufficient to give us our accuracy guaranteesgive us our accuracy guarantees



Useful Trick:Useful Trick:

A Maclaurin series: for -1 < x <= 1:A Maclaurin series: for -1 < x <= 1:

We only care about -1<x<0. Rewriting -x We only care about -1<x<0. Rewriting -x as as εε, we can derive that for 0<, we can derive that for 0<εε<1: ln(1-<1: ln(1-εε) ) <= -<= -εε..

Because Because ε ε is positive, we can put both is positive, we can put both sides of this last inequality into an sides of this last inequality into an exponent to obtain:exponent to obtain:

1 - 1 - ε ε <= e <= e--εε © Jude Shavlik 2006 © Jude Shavlik 2006 David Page David Page 20102010


Assume EQ Query Assume EQ Query Bound (or Mistake Bound (or Mistake Bound) MBound) M• On each equivalence query, draw:On each equivalence query, draw:

1/1/εε ln(M/ ln(M/δδ) examples) examples• What is the probability that on any What is the probability that on any

of the at most M rounds, we accept of the at most M rounds, we accept a “bad” hypothesis?a “bad” hypothesis?

• At most M (1-At most M (1-εε))1/1/εε ln(M/ ln(M/δδ) ) <= M e <= M e-ln(M/-ln(M/δδ))

(using the useful trick) = (using the useful trick) = δδ



If Algorithm Doesn’t If Algorithm Doesn’t Know M in Advance Know M in Advance (recall |c|):(recall |c|):• On ith equivalence query, draw:On ith equivalence query, draw:

1/1/εε (ln(1/ (ln(1/δδ) + i ln(2)) examples) + i ln(2)) examples• What is the probability that we accept What is the probability that we accept

a “bad” hypothesis a “bad” hypothesis at one queryat one query??• (1-(1-εε))1/1/εε (ln(1/ (ln(1/δδ) + i ln(2)) ) + i ln(2)) <= e<= e- (ln(1/- (ln(1/δδ) + i ln(2)) ) + i ln(2))

(using the useful trick) = (using the useful trick) = δδ/2/2ii

• So probability we ever accept a “bad” So probability we ever accept a “bad” hypothesis is at most hypothesis is at most δδ..



Using First Method: Using First Method: kDNFkDNF• Write down disjunction of all Write down disjunction of all

conjunctions of at most k literals conjunctions of at most k literals (features or negated features)(features or negated features)

• Any counterexample will be Any counterexample will be actual negativeactual negative

• Repeat until correct:Repeat until correct: Given a counterexample, delete Given a counterexample, delete

disjuncts that cover it (are disjuncts that cover it (are consistent with it)consistent with it)



Using Second MethodUsing Second Method

• If hypothesis space is finite, can If hypothesis space is finite, can show a poly sample is sufficientshow a poly sample is sufficient

• If hypothesis space is If hypothesis space is parameterized by parameterized by nn, and grows , and grows only exponentially in only exponentially in nn, can show , can show a poly sample is sufficienta poly sample is sufficient



How Many Examples How Many Examples Needed to be PAC?Needed to be PAC?

• Consider Consider finitefinite hypothesis spaces hypothesis spaces• Let HLet Hbad bad {h {h11, …, h, …, hzz}}• The set of hypotheses whose The set of hypotheses whose

(“testset”) error is > (“testset”) error is > • GoalGoal Eliminate all items in Eliminate all items in HHbadbad

via (noise-free) training examplesvia (noise-free) training examples



How Many Examples How Many Examples Needed to be PAC?Needed to be PAC?How can an How can an hh look bad, even though look bad, even though it is correct on all the training it is correct on all the training examples?examples?• If we never see any examples in the If we never see any examples in the

shadedshaded regions regions• We’ll compute an We’ll compute an NN s.t. the odds of s.t. the odds of

this are sufficiently low (recall, this are sufficiently low (recall, NN = = number of examples)number of examples)

h

c



HHbadbad

• Consider Consider HH11 H Hbadbad and and exex N N

• What is the probability that What is the probability that HH11 is is consistentconsistent with with exex??

Prob[consistentProb[consistentAA(ex,H(ex,H11)] ≤ 1 - )] ≤ 1 - (since H(since H11 is bad its error rate is at is bad its error rate is at

least least ))

© Jude Shavlik 2006 © Jude Shavlik 2006 David Page David 102010


HHbadbad (cont.) (cont.)

What is the probability that What is the probability that HH11 is is

consistent with consistent with allall NN examples? examples?

Prob[consistentProb[consistentBB(N,H(N,H11)] ≤ (1 - )] ≤ (1 - ))|N||N|

(by iid assumption)(by iid assumption)



HHbad bad (cont.)(cont.)

What is the probability that What is the probability that somesome member of member of HHbadbad is consistent with the examples in is consistent with the examples in N N ??

Prob[consistentProb[consistentCC(N, H(N, Hbadbad)] )]

Prob[consistentProb[consistentBB(N, H(N, H11) ) … …

consistentconsistentBB(N, H(N, Hzz)])]

≤ ≤ |H|Hbadbad| x (1-| x (1-))|N| |N| //// P(AP(AB) B) P(A) + P(B) - P(A) + P(B) -

P(AP(AB)B)

≤ ≤ |H| x (1- |H| x (1- ))|N| |N| // H// Hbadbad H H

Ignore this in upper bound

calc© Jude Shavlik 2006 © Jude Shavlik 2006 David Page David Page 20102010


Solving for |N|Solving for |N|

We haveWe have

Prob[consistentProb[consistentCC(N,H(N,Hbadbad)] )]

≤ |H| x (1- ≤ |H| x (1-))|N||N| < <

Recall that we want the prob of a Recall that we want the prob of a bad bad concept survivingconcept surviving to be less than to be less than , our , our bound on learning a poor concept bound on learning a poor concept

Assume that if many consistent Assume that if many consistent hypotheses survive, we get unlucky and hypotheses survive, we get unlucky and choose a bad one (we’re doing a choose a bad one (we’re doing a worst-worst-casecase analysis) analysis)



Solving for |N|Solving for |N|(number of examples needed to be (number of examples needed to be confident of getting a good model)confident of getting a good model)

SolvingSolving

|N| > [ln(1/|N| > [ln(1/)+ln(|H|)] / -ln(1-)+ln(|H|)] / -ln(1-))

Since Since ≤ -ln(1- ≤ -ln(1-) over [0,1) we get) over [0,1) we get

|N| > [ln(1/|N| > [ln(1/)+ln(|H|)] / )+ln(|H|)] /

(Aside: notice that this calculation (Aside: notice that this calculation assumedassumed we could we could

always find a hypothesis that fits the training data)always find a hypothesis that fits the training data)

Notice we made NO

assumptions about the prob dist of the data

(other than it does not change)



Example: Example: Number of Instances NeededNumber of Instances NeededAssumeAssume

F = 100 binary featuresF = 100 binary features

H = all (pure) conjunctsH = all (pure) conjuncts

[3[3F F possibilities (possibilities (i, use fi, use fii, use ¬f, use ¬fii, or ignore f, or ignore fii) ) so lg|H| = F * lg 3 ≈ F] so lg|H| = F * lg 3 ≈ F]

= 0.01= 0.01

= 0.01= 0.01

N = [ln(1/N = [ln(1/)+ln(|H|)] / )+ln(|H|)] / = 100 * [ln(100) + 100] ≈ 10 = 100 * [ln(100) + 100] ≈ 1044

But how many real-world concepts are pure conjunctsBut how many real-world concepts are pure conjunctswith noise-free training data?with noise-free training data?



Two Senses of Two Senses of ComplexityComplexity

Sample complexitySample complexity (number of examples needed) (number of examples needed)

vs. vs.

Time complexityTime complexity (time needed to find h (time needed to find h H that is H that is consistent with the training examples)consistent with the training examples)



Complexity (cont.)Complexity (cont.)

• Some concepts require a polynomial Some concepts require a polynomial number of examples but an number of examples but an exponential amount of time (in the exponential amount of time (in the worst case)worst case)

• Eg, training neural networks is NP-Eg, training neural networks is NP-hard (recall BP is a “greedy” hard (recall BP is a “greedy” algorithm that finds a local min)algorithm that finds a local min)



Dealing with Dealing with Infinite Infinite Hypothesis SpacesHypothesis Spaces• Can use the Vapnik-Chervonenkis Can use the Vapnik-Chervonenkis

(‘71) dimension (VC-dim)(‘71) dimension (VC-dim)• Provides a measure of the capacity Provides a measure of the capacity

of a hypothesis spaceof a hypothesis space

VC-dimVC-dim given a hypothesis space H, the given a hypothesis space H, the VC-dim is the size of the largest set of VC-dim is the size of the largest set of examples that can be completely fit by H, examples that can be completely fit by H, no matter how the examples are labeledno matter how the examples are labeled



VC-dim ImpactVC-dim Impact

• If the number of examples If the number of examples << VC-dim<< VC-dim, , then then memorizing training is trivialmemorizing training is trivial and and generalization likely to be poorgeneralization likely to be poor

• If the number of examples If the number of examples >> VC-dim>> VC-dim, , then the algorithm then the algorithm must generalizemust generalize to to do well on the training set and will do well on the training set and will likely do well in the futurelikely do well in the future



Samples of VC-dimSamples of VC-dim

Finite HFinite H

VC-Dim ≤ logVC-Dim ≤ log22|H||H|(if (if dd examples, 2 examples, 2dd different labelings possible, different labelings possible, and must have 2and must have 2dd ≤ |H| if all functions are to ≤ |H| if all functions are to be in H)be in H)



An Infinite Hypothesis An Infinite Hypothesis Space with a Finite VC Space with a Finite VC DimDim

H is set of lines in 2DH is set of lines in 2D

Can cover Can cover 1 ex1 ex no matter how no matter how labeledlabeled

1



Example 2 (cont.)Example 2 (cont.)

Can cover Can cover 2 ex’s2 ex’s no matter how no matter how labeledlabeled

12




Can cover Can cover 3 ex’s3 ex’s no matter how no matter how labeledlabeled

12

3




Cannot cover/separate if 1 and 4 are Cannot cover/separate if 1 and 4 are +,+,

But 2 and 3 are – (our old friend, But 2 and 3 are – (our old friend, ex-ex-oror))

1 2

3 4

Notice |H| = ∞ Notice |H| = ∞ but VC-dim = 3but VC-dim = 3

For N-dimensions For N-dimensions and N-1 dim and N-1 dim hyperplanes,hyperplanes,VC-dim = N + 1VC-dim = N + 1



More on “Shattering”More on “Shattering”

What about collinear points?What about collinear points?

If If somesome set of set of d d examples examples that that HH can fully fit can fully fit labellings labellings of these of these dd examples then examples then VC(H) VC(H) ≥ d≥ d



Some VC-Dim Some VC-Dim TheoremsTheorems

TheoremTheorem H is PAC-learnable iff its VC-dim is H is PAC-learnable iff its VC-dim is finitefinite

TheoremTheorem Sufficient to be PAC to have # of Sufficient to be PAC to have # of examples examples > 1/> 1/ max[4ln(2/ max[4ln(2/), 8ln(13/), 8ln(13/)VC-dim(H)])VC-dim(H)]

TheoremTheorem Any PAC algorithm needs at least Any PAC algorithm needs at least (1/(1/[ln(1/[ln(1/) + VC-dim(H)]) examples) + VC-dim(H)]) examples

[No need to memorize these for the exam][No need to memorize these for the exam]© Jude Shavlik 2006 © Jude Shavlik 2006 David Page David Page 20102010


To Show a Concept is To Show a Concept is NOT PAC-learnableNOT PAC-learnable

• Show the consistency problem is Show the consistency problem is NP-hard (hardness assumes P NP-hard (hardness assumes P ≠ ≠ NP)NP), OR, OR

• Show the VC-dimension grows at Show the VC-dimension grows at a rate not bounded by any a rate not bounded by any polynomialpolynomial



Be CarefulBe Careful

• It can be the case that consistency is It can be the case that consistency is hard for a concept class, but not for a hard for a concept class, but not for a larger classlarger class• Consistency NP-hard for k-term DNFConsistency NP-hard for k-term DNF• Consistency easy for DNF (PAC still open Consistency easy for DNF (PAC still open

ques.)ques.)• More robust negative results are for PAC-More robust negative results are for PAC-

predictionprediction• Hypothesis space not constrained to equal Hypothesis space not constrained to equal

concept classconcept class• Hardness results based on cryptographic Hardness results based on cryptographic

assumptions, such as assuming efficient assumptions, such as assuming efficient factoring is impossiblefactoring is impossible



Some Variations in Some Variations in DefinitionsDefinitions• Original definition of PAC-learning Original definition of PAC-learning

required:required:• Run-time to be polynomialRun-time to be polynomial• Hypothesis to be within concept class Hypothesis to be within concept class

(otherwise, called PAC-prediction)(otherwise, called PAC-prediction)

• Now this version is often called Now this version is often called polynomial-time, proper polynomial-time, proper PAC-learningPAC-learning

• Membership queriesMembership queries: can ask for labels : can ask for labels of specific examplesof specific examples



Some Results…Some Results…• Can PAC-learn k-DNF (exponential in k, but k Can PAC-learn k-DNF (exponential in k, but k

is now a constant)is now a constant)• Can Can NOTNOT properproper PAC-learn k-term DNF, but PAC-learn k-term DNF, but

can PAC-learn it by k-CNFcan PAC-learn it by k-CNF• Can Can NOT NOT PAC-learn Boolean Formulae unless PAC-learn Boolean Formulae unless

can crack RSA (can factor fast)can crack RSA (can factor fast)• Can PAC-learn decision-trees with the Can PAC-learn decision-trees with the

addition of membership queriesaddition of membership queries• Unknown whether can PAC-learn DNF or DTs Unknown whether can PAC-learn DNF or DTs

(“holy grail” questions of COLT)(“holy grail” questions of COLT)



Some Other COLT Some Other COLT TopicsTopicsCOLT COLT

+ clustering + clustering + k-NN+ k-NN+ RL+ RL+ EBL + EBL (ch. 11 of Mitchell)(ch. 11 of Mitchell)

+ SVMs+ SVMs+ ILP+ ILP+ ANNs, etc.+ ANNs, etc.

• Average case Average case analysis analysis (vs. worst case)(vs. worst case)

• Learnability of Learnability of natural languages natural languages (language innate?)(language innate?)

• Learnability in Learnability in parallelparallel



Summary of COLTSummary of COLT

StrengthsStrengths

• Formalizes learning taskFormalizes learning task• Allows for imperfections (e.g. Allows for imperfections (e.g. and and in in

PAC)PAC)• Work on Work on boostingboosting (later) is excellent case (later) is excellent case

of ML theory influencing ML practiceof ML theory influencing ML practice• Shows what concepts are intrinsically Shows what concepts are intrinsically

hard to learn (e.g. k-term DNF)hard to learn (e.g. k-term DNF)



Summary of COLTSummary of COLT

WeaknessesWeaknesses

• Most analyses are worst caseMost analyses are worst case• Use of “prior knowledge” not Use of “prior knowledge” not

captured very well yetcaptured very well yet



theoretical approaches to machine learning

Documents

machine learning uwmadisoncs

david page

examples jude shavlik

noisefree learning

machine learningearly

exs jude shavlik

solution jude shavlik

correct concept