prelim machine learning for intelligent systems · 3.neural network construction prelim 2...

10/29/19

1

MachineLearningforIntelligentSystems

Instructors:NikaHaghtalab (thistime)andThorstenJoachims

Lecture17:StatisticalLearningTheory1

Reading:UML4

1

Curved:𝐶𝑢𝑟𝑣𝑒𝑑𝐺𝑟𝑎𝑑𝑒 = 100 − 0.75(95 − 𝑅𝑎𝑤𝑃𝑜𝑖𝑛𝑡𝑠)

Hardertimewiththefollowingconcepts:1. Perceptronupdatebound2. Leave-on-outerrorofKernelizedSVM3. NeuralNetworkconstruction

Prelim

2

xkcd comics

3

ReplicationCrisisinScience

4

ConvincemeofyourPsychicAbilities?

Game• I’mthinkingof𝑚 bits(0,1)• Ifsomebodyintheclassguessesmybitsequence,thatpersonclearlyhastelepathicabilities– right?

• Thinkofa6 digit0,1sequence.

101010Question:

• Ifatleastoneof|𝐻| playersguessesthebitsequencecorrectly,isthereanysignificantevidencethattheyhavetelepathicabilities?

• Howlargewould𝑚 and|𝐻| havetobeforustotrustthistest?

5

TestingforpsychicpowerSetup:• |𝐻| student𝐻 = {ℎ@,… , ℎ|C|}• 𝑚 bits(lengthofsequence)• 𝑝 = 0.5 probabilityoferroronasinglebit,ifyou’renotpsychic.

Prob.thatstudent𝑖 guessesmycodewithoutbeingpsychic?𝑃 ℎF correct ℎF not psychic) = 1− 𝑝 R

Prob.atleastonestudentguessesmycode,withoutanyonebeingpsychic?

𝑃 ℎ@ correct ∨⋯∨ ℎ C correct nobody is psychic)= 1− 1− 1− 𝑝 R |C|

Howlongshouldthesequencebe,soweare1 − 𝛿 confident?𝑚 > log(@[\)(1 − 1− 𝛿 @/|C|)

6

10/29/19

2

• BinomialDistribution:prob.ofobserving𝑘 headsin𝑚 independentcointosses,whereeachtossisheadswithprob.𝑝,is

𝑃𝑟 𝑋 = 𝑘 𝑝,𝑚) =𝑚!

𝑘! 𝑚 − 𝑘 !𝑝a(1 − 𝑝)(R[a).

• Hoeffding’s inequality:Intheabovebinomialdistribution,Pr

𝑘𝑚−𝑝 > 𝜖 ≤ 2 exp(−2𝑚𝜖g)

• UnionBound:Foranyevents𝐸F ,

𝑃𝑟 𝐸@ ∨ 𝐸g ∨⋯∨ 𝐸a ≤iFj@

a

𝑃𝑟 𝐸F .

• Nonamelemma: 1 − 𝜖 ≤ 𝑒[k

UsefulFormulas

1

1

𝐞𝐱𝐩(−𝒙)𝟏−𝒙

Pr 𝐸@ ∨ 𝐸g

𝐸@ 𝐸g

Pr E@ + Pr(Eg)

7

FundamentalQuestions

QuestionsinStatisticalLearningTheory:• Tryingtolearnaclassifierfrom𝐻?• Howgoodisthelearnedruleafter𝑚 examples?• Howmanyexamplesisneededforthelearnedruletobeaccurate?• Whatcanbelearnedandwhatcannot?• Isthereauniversallybestlearningalgorithm?

Inparticular,wewilladdress:• WhatkindofaguaranteeonthetrueerrorofaclassifiercanIgetifIknowitstrainingerror?

8

RecallPredictionasLearning

9

Sample&GeneralizationErrors

Sampleerror ofhypothesisℎ onsamplesS = { 𝑥@, 𝑦@ , … ,𝑥R, 𝑦R },denotedby𝑒𝑟𝑟v ℎ is

𝑒𝑟𝑟v ℎ =1𝑚i

Fj@

R

1(ℎ 𝑥F ≠ 𝑦F)

Sample(Empirical)Error

Generalizationerrorofhypothesisℎ ondistribution𝑃(𝑋, 𝑌),denotedby𝑒𝑟𝑟y ℎ is

𝑒𝑟𝑟y ℎ = Prz,{ ∼y

ℎ 𝑥 ≠ 𝑦 =iFj@

R

1 ℎ 𝑥 ≠ 𝑦 ⋅ 𝑃(𝑋 = 𝑥, 𝑌 = 𝑦)

Generalization(Prediction/true)Error

10

Real-worldProcess𝑃(𝑋, 𝑌)

TrainingdatasetStrain{ 𝑥@, 𝑦@ ,… , 𝑥R, 𝑦R } Learner TestingdatasetStest

𝑥R~@, 𝑦R~@ ,… ,

drawni.i.d. drawni.i.d.

hStrain

Goal: Findℎwithsmallpredictionerror𝑒𝑟𝑟y ℎ on𝑃 𝑋, 𝑌 .Strategy:Findanℎ ∈ 𝐻withsmallsampleerror𝑒𝑟𝑟v�� ℎ ontrainingdataset 𝑆��F�.Testthelearnedℎ tomeasureitstesterror𝑒𝑟𝑟v�� ℎ onaseparatetestingdataset𝑆��.

PredictionasLearning

11

Let’scomeback

12

10/29/19

3

WhatkindofaguaranteeonthetrueerrorofaclassifiercanIgetifIknowitstrainingerror?

Today’splan:• Zeroempiricalerror: IftheruleIlearnedfromH achieveszeroerroronthesamples(𝑒𝑟𝑟v ℎ = 0),howlargeis𝑒𝑟𝑟y ℎ ?

• Non-zeroempiricalerror: Howgoodisthetrueerrorofahypothesisfrom𝐻 thatthatperformswellonsamples?

Today’sassumption:ThehypothesissetH isfinite.

GeneralizationErrorBounds

13

IfthehypothesisIlearnedfromH achieveszeroerroronthesamples(𝑒𝑟𝑟v ℎ = 0),howlargeis𝑒𝑟𝑟y ℎ ?

• AssumeH isfinite.• AssumeRealizability:Thereisaconsistentclassifier.àThereisalwaysoneℎ ∈ H that𝑒𝑟𝑟y ℎ = 0 – onepersonispsychic.

Algorithmℒ takesaset𝑆 of𝑚 samplesfrom𝑃 andpicksℎv thathas0empiricalerror.What’stheboundon𝑒𝑟𝑟y ℎ ?1. Fixahypothesisℎ ∈ H beforeseeing𝑆.What’stheprobabilitythat

𝑒𝑟𝑟y ℎ > 𝜖, but𝑒𝑟𝑟v ℎ = 0?2. What’stheprobabilitythat𝑒𝑟𝑟y ℎv > 𝜖, but𝑒𝑟𝑟v ℎv = 0?

ZeroEmpiricalError

14

SampleComplexity– 0EmpiricalError

Foranyinstancespace 𝑋 andsetoflabels𝑌 = {−1, 1} andforanydistribution 𝑃 on 𝑋×𝑌,consideraset𝑆 of𝑚 i.i.d.samplesfrom𝑃,wehave

Prv∼y�

∃ ℎ ∈ 𝐻, such that 𝑒𝑟𝑟v ℎ = 0, but 𝑒𝑟𝑟y ℎ > 𝜖 ≤ |𝐻|𝑒[kR.

Theorem

Let𝑚 ≥ @kln 𝐻 + ln @

�.Foranyinstancespace𝑋,labelsY =

{−1, 1},distribution 𝑃 on 𝑋×𝑌,withprobability1 − 𝛿 overi.i.ddrawsofset𝑆 of𝑚 samples,wehaveAny ℎ ∈ 𝐻 that has 𝟎 𝐞𝐦𝐩𝐢𝐫𝐢𝐜𝐚𝐥 𝐞𝐫𝐫𝐨𝐫, has 𝐭𝐫𝐮𝐞 𝐞𝐫𝐫𝐨𝐫 of 𝑒𝑟𝑟y ℎ ≤ 𝜖.

Theorem:SampleComplexity(zeroempiricalerror)

LearningAlgorithm: Givenasampleset𝑆 andhypothesisclassℎ ∈𝐻,ifthereisaℎv ∈ 𝐻 thatisconsistentwith𝑆,returnℎv.(Eqv.Returnℎv inversionspaceVS(𝐻, 𝑆))

15

prelim machine learning for intelligent systems · 3.neural network construction prelim 2...

Documents