computational learning theory: occam’s razor learning fall 2017 computational learning theory:...
TRANSCRIPT
MachineLearningFall2017
ComputationalLearningTheory:Occam’sRazor
1SlidesbasedonmaterialfromDanRoth,Avrim Blum,TomMitchellandothers
Thislecture:ComputationalLearningTheory
• TheTheoryofGeneralization
• ProbablyApproximatelyCorrect(PAC)learning
• Positiveandnegativelearnabilityresults
• AgnosticLearning
• ShatteringandtheVCdimension
2
Wherearewe?
• TheTheoryofGeneralization– Whencanbetrustthelearningalgorithm?– Whatfunctionscanbelearned?– BatchLearning
• ProbablyApproximatelyCorrect(PAC)learning
• Positiveandnegativelearnabilityresults
• AgnosticLearning
• ShatteringandtheVCdimension
3
Thissection
1. Analyzeasimplealgorithmforlearningconjunctions
2. DefinethePACmodeloflearning
3. MakeformalconnectionstotheprincipleofOccam’srazor
4
Thissection
ü Analyzeasimplealgorithmforlearningconjunctions
ü DefinethePACmodeloflearning
3. MakeformalconnectionstotheprincipleofOccam’srazor
5
Occam’sRazor
NamedafterWilliamofOccam– AD1300s
Prefersimplerexplanationsovermorecomplexones
“Numquam ponenda est pluralitas sinenecessitate”
Historically,awidelyprevalentideaacrossdifferentschoolsofphilosophy
6
(Neverpositpluralitywithoutnecessity.)
TowardsformalizingOccam’sRazor
Claim:Theprobabilitythatthereisahypothesish2 Hthat:1. IsConsistentwithmexamples,and2. HaserrD(h)>²islessthan|H|(1- ²)m
Proof:Lethbesuchabadhypothesisthathasanerror>²ProbabilitythathisconsistentwithoneexampleisPr[f(x)=h(x)]<1- ²
ThetrainingsetconsistsofmexamplesdrawnindependentlySo,probabilitythathisconsistentwithmexamples<(1- ²)m
Probabilitythatsomebadhypothesis inHisconsistentwithmexamplesislessthan|H|(1- ²)m
7
TowardsformalizingOccam’sRazor
Claim:Theprobabilitythatthereisahypothesish2 Hthat:1. IsConsistentwithmexamples,and2. HaserrD(h)>²islessthan|H|(1- ²)m
Proof:Lethbesuchabadhypothesisthathasanerror>²ProbabilitythathisconsistentwithoneexampleisPr[f(x)=h(x)]<1- ²
ThetrainingsetconsistsofmexamplesdrawnindependentlySo,probabilitythathisconsistentwithmexamples<(1- ²)m
Probabilitythatsomebadhypothesis inHisconsistentwithmexamplesislessthan|H|(1- ²)m
8
(Assumingconsistency)
TowardsformalizingOccam’sRazor
Claim:Theprobabilitythatthereisahypothesish2 Hthat:1. IsConsistentwithmexamples,and2. HaserrD(h)>²islessthan|H|(1- ²)m
Proof:Lethbesuchabadhypothesisthathasanerror>²ProbabilitythathisconsistentwithoneexampleisPr[f(x)=h(x)]<1- ²
ThetrainingsetconsistsofmexamplesdrawnindependentlySo,probabilitythathisconsistentwithmexamples<(1- ²)m
Probabilitythatsomebadhypothesis inHisconsistentwithmexamplesislessthan|H|(1- ²)m
9
(Assumingconsistency)
Thatis,consistentyetbad
TowardsformalizingOccam’sRazor
Claim:Theprobabilitythatthereisahypothesish2 Hthat:1. IsConsistentwithmexamples,and2. HaserrD(h)>²islessthan|H|(1- ²)m
Proof:Lethbesuchabadhypothesisthathasanerror>²ProbabilitythathisconsistentwithoneexampleisPr[f(x)=h(x)]<1- ²
ThetrainingsetconsistsofmexamplesdrawnindependentlySo,probabilitythathisconsistentwithmexamples<(1- ²)m
Probabilitythatsomebadhypothesis inHisconsistentwithmexamplesislessthan|H|(1- ²)m
10
(Assumingconsistency)
Thatis,consistentyetbad
TowardsformalizingOccam’sRazor
Claim:Theprobabilitythatthereisahypothesish2 Hthat:1. IsConsistentwithmexamples,and2. HaserrD(h)>²islessthan|H|(1- ²)m
Proof:Lethbesuchabadhypothesisthathasanerror>²ProbabilitythathisconsistentwithoneexampleisPr[f(x)=h(x)]<1- ²
ThetrainingsetconsistsofmexamplesdrawnindependentlySo,probabilitythathisconsistentwithmexamples<(1- ²)m
Probabilitythatsomebadhypothesis inHisconsistentwithmexamplesislessthan|H|(1- ²)m
11
(Assumingconsistency)
Thatis,consistentyetbad
TowardsformalizingOccam’sRazor
Claim:Theprobabilitythatthereisahypothesish2 Hthat:1. IsConsistentwithmexamples,and2. HaserrD(h)>²islessthan|H|(1- ²)m
Proof:Lethbesuchabadhypothesisthathasanerror>²ProbabilitythathisconsistentwithoneexampleisPr[f(x)=h(x)]<1- ²
ThetrainingsetconsistsofmexamplesdrawnindependentlySo,probabilitythathisconsistentwithmexamples<(1- ²)m
Probabilitythatsomebadhypothesis inHisconsistentwithmexamplesislessthan|H|(1- ²)m
12
(Assumingconsistency)
Thatis,consistentyetbad
TowardsformalizingOccam’sRazor
Claim:Theprobabilitythatthereisahypothesish2 Hthat:1. IsConsistentwithmexamples,and2. HaserrD(h)>²islessthan|H|(1- ²)m
Proof:Lethbesuchabadhypothesisthathasanerror>²ProbabilitythathisconsistentwithoneexampleisPr[f(x)=h(x)]<1- ²
ThetrainingsetconsistsofmexamplesdrawnindependentlySo,probabilitythathisconsistentwithmexamples<(1- ²)m
Probabilitythatsomebadhypothesis inHisconsistentwithmexamplesislessthan|H|(1- ²)m
13
(Assumingconsistency)
Thatis,consistentyetbad
Occam’sRazor
Theprobabilitythatthereisahypothesish2 Hthatis1. Consistentwithmexamples,and2. HaserrD(h)>²islessthan|H|(1- ²)m
Justlikebefore,wewanttomakethisprobabilitysmall,saysmallerthan±|H|(1- ²)m<±
ln(|H|)+mln(1- ²)<ln ±
WeknowthatLet’suseln(1- ²) <-² togetasafer±
14
Thatis,if then,theprobabilityofgettingabadhypothesisissmall
Occam’sRazor
Theprobabilitythatthereisahypothesish2 Hthatis1. Consistentwithmexamples,and2. HaserrD(h)>²islessthan|H|(1- ²)m
Justlikebefore,wewanttomakethisprobabilitysmall,saysmallerthan±|H|(1- ²)m<±
ln(|H|)+mln(1- ²)<ln ±
WeknowthatLet’suseln(1- ²) <-² togetasafer±
15
Thatis,if then,theprobabilityofgettingabadhypothesisissmall
Occam’sRazor
Theprobabilitythatthereisahypothesish2 Hthatis1. Consistentwithmexamples,and2. HaserrD(h)>²islessthan|H|(1- ²)m
Justlikebefore,wewanttomakethisprobabilitysmall,saysmallerthan±|H|(1- ²)m<±
ln(|H|)+mln(1- ²)<ln ±
WeknowthatLet’suseln(1- ²) <-² togetasafer±
16
Thatis,if then,theprobabilityofgettingabadhypothesisissmall
Occam’sRazor
Theprobabilitythatthereisahypothesish2 Hthatis1. Consistentwithmexamples,and2. HaserrD(h)>²islessthan|H|(1- ²)m
Justlikebefore,wewanttomakethisprobabilitysmall,saysmallerthan±|H|(1- ²)m<±
ln(|H|)+mln(1- ²)<ln ±
WeknowthatLet’suseln(1- ²) <-² togetasafer±
17
Thatis,if then,theprobabilityofgettingabadhypothesisissmall
Occam’sRazor
Theprobabilitythatthereisahypothesish2 Hthatis1. Consistentwithmexamples,and2. HaserrD(h)>²islessthan|H|(1- ²)m
Justlikebefore,wewanttomakethisprobabilitysmall,saysmallerthan±|H|(1- ²)m<±
ln(|H|)+mln(1- ²)<ln ±
WeknowthatLet’suseln(1- ²) <-² togetasafer±
18
Thatis,if then,theprobabilityofgettingabadhypothesisissmall
Occam’sRazor
Theprobabilitythatthereisahypothesish2 Hthatis1. Consistentwithmexamples,and2. HaserrD(h)>²islessthan|H|(1- ²)m
Justlikebefore,wewanttomakethisprobabilitysmall,saysmallerthan±|H|(1- ²)m<±
ln(|H|)+mln(1- ²)<ln ±
WeknowthatLet’suseln(1- ²) <-² togetasafer±
19
Thatis,if then,theprobabilityofgettingabadhypothesisissmall
Occam’sRazor
LetHbeanyhypothesisspace.Withprobability1- ±,ahypothesish2 Hthatisconsistent withatrainingsetofsizem willhaveanerror<² onfutureexamplesif
ThisiscalledtheOccam’sRazorbecauseitexpressesapreferencetowardssmallerhypothesisspaces.
Showswhenam-consistenthypothesisgeneralizeswell(i.e error<²).
Complicated/largerhypothesisspacesarenotnecessarilybad.Butsimpleronesareunlikelytofoolusbybeingconsistentwithmanyexamples!
20
Occam’sRazor
LetHbeanyhypothesisspace.Withprobability1- ±,ahypothesish2 Hthatisconsistent withatrainingsetofsizem willhaveanerror<² onfutureexamplesif
ThisiscalledtheOccam’sRazorbecauseitexpressesapreferencetowardssmallerhypothesisspaces.
Showswhenam-consistenthypothesisgeneralizeswell(i.e error<²).
Complicated/largerhypothesisspacesarenotnecessarilybad.Butsimpleronesareunlikelytofoolusbybeingconsistentwithmanyexamples!
21
1.Expectinglowererrorincreasessamplecomplexity(i.e moreexamplesneededfortheguarantee)
Occam’sRazor
LetHbeanyhypothesisspace.Withprobability1- ±,ahypothesish2 Hthatisconsistent withatrainingsetofsizem willhaveanerror<² onfutureexamplesif
ThisiscalledtheOccam’sRazorbecauseitexpressesapreferencetowardssmallerhypothesisspaces.
Showswhenam-consistenthypothesisgeneralizeswell(i.e error<²).
Complicated/largerhypothesisspacesarenotnecessarilybad.Butsimpleronesareunlikelytofoolusbybeingconsistentwithmanyexamples!
22
1.Expectinglowererrorincreasessamplecomplexity(i.e moreexamplesneededfortheguarantee)
2.Ifwehavealargerhypothesisspace,thenwewillmakelearningharder(i.e highersamplecomplexity)
Occam’sRazor
LetHbeanyhypothesisspace.Withprobability1- ±,ahypothesish2 Hthatisconsistent withatrainingsetofsizem willhaveanerror<² onfutureexamplesif
ThisiscalledtheOccam’sRazorbecauseitexpressesapreferencetowardssmallerhypothesisspaces.
Showswhenam-consistenthypothesisgeneralizeswell(i.e error<²).
Complicated/largerhypothesisspacesarenotnecessarilybad.Butsimpleronesareunlikelytofoolusbybeingconsistentwithmanyexamples!
23
1.Expectinglowererrorincreasessamplecomplexity(i.e moreexamplesneededfortheguarantee)
2.Ifwehavealargerhypothesisspace,thenwewillmakelearningharder(i.e highersamplecomplexity)
3.Ifwewantahigherconfidenceintheclassifierwewillproduce,samplecomplexitywillbehigher.
Occam’sRazor
LetHbeanyhypothesisspace.Withprobability1- ±,ahypothesish2 Hthatisconsistent withatrainingsetofsizem willhaveanerror<² onfutureexamplesif
ThisiscalledtheOccam’sRazorbecauseitexpressesapreferencetowardssmallerhypothesisspaces.
Showswhenam-consistenthypothesisgeneralizeswell(i.e error<²).
Complicated/largerhypothesisspacesarenotnecessarilybad.Butsimpleronesareunlikelytofoolusbybeingconsistentwithmanyexamples!
24
Occam’sRazor
LetHbeanyhypothesisspace.Withprobability1- ±,ahypothesish2 Hthatisconsistent withatrainingsetofsizem willhaveanerror<² onfutureexamplesif
ThisiscalledtheOccam’sRazorbecauseitexpressesapreferencetowardssmallerhypothesisspaces.
Showswhenam-consistenthypothesisgeneralizeswell(i.e error<²).
Complicated/largerhypothesisspacesarenotnecessarilybad.Butsimpleronesareunlikelytofoolusbybeingconsistentwithmanyexamples!
25
Occam’sRazor
LetHbeanyhypothesisspace.Withprobability1- ±,ahypothesish2 Hthatisconsistent withatrainingsetofsizem willhaveanerror<² onfutureexamplesif
ThisiscalledtheOccam’sRazorbecauseitexpressesapreferencetowardssmallerhypothesisspaces.
Showswhenam-consistenthypothesisgeneralizeswell(i.e error<²).
Complicated/largerhypothesisspacesarenotnecessarilybad.Butsimpleronesareunlikelytofoolusbybeingconsistentwithmanyexamples!
26
Occam’sRazor
LetHbeanyhypothesisspace.Withprobability1- ±,ahypothesish2 Hthatisconsistent withatrainingsetofsizem willhaveanerror<² onfutureexamplesif
ThisiscalledtheOccam’sRazorbecauseitexpressesapreferencetowardssmallerhypothesisspaces.
Showswhenam-consistenthypothesisgeneralizeswell(i.e error<²).
Complicated/largerhypothesisspacesarenotnecessarilybad.Butsimpleronesareunlikelytofoolusbybeingconsistentwithmanyexamples!
27
Consistent LearnersandOccam’sRazorFromthedefinition,wegetthefollowinggeneralschemeforPAClearning
28
GivenasampleDofmexamples• FindsomehÎ H thatisconsistentwithallmexamples
• Ifmislargeenough,aconsistenthypothesismustbecloseenoughtof
• Checkthatmdoesnothavetobetoolarge(i.e polynomialintherelevantparameters):weshowedthatthe“closeness”guaranteerequiresthat
m>1/² (ln |H|+ln 1/±)
• ShowthattheconsistenthypothesishÎ H canbecomputedefficiently
Consistent LearnersandOccam’sRazorFromthedefinition,wegetthefollowinggeneralschemeforPAClearning
Weworkedoutthedetailsforconjunctions• TheEliminationalgorithmtofindahypothesishthatisconsistentwiththetraining
set(easytocompute)• Weshoweddirectlythatifwehavesufficientlymanyexamples(polynomialinthe
parameters),thanhisclosetothetargetfunction.29
GivenasampleDofmexamples• FindsomehÎ H thatisconsistentwithallmexamples
• Ifmislargeenough,aconsistenthypothesismustbecloseenoughtof
• Checkthatmdoesnothavetobetoolarge(i.e polynomialintherelevantparameters):weshowedthatthe“closeness”guaranteerequiresthat
m>1/² (ln |H|+ln 1/±)
• ShowthattheconsistenthypothesishÎ H canbecomputedefficiently
Exercises
Wehaveseenthedecisiontreelearningalgorithm.Supposeourproblemhasnbinaryfeatures.Whatisthesizeofthehypothesisspace?
AredecisiontreesefficientlyPAClearnable?
30