machine learning & data mining - yisong yue · recap: bias-variance trade -off 0 20 40 60 80...

MachineLearning&DataMiningCMS/CS/CNS/EE155

Lecture2:Perceptron&StochasticGradient

Descent

Recap: BasicRecipe(supervised)

• TrainingData:

• ModelClass:

• LossFunction:

• LearningObjective:

S = (xi, yi ){ }i=1N

f (x |w,b) = wT x − b

L(a,b) = (a− b)2

LinearModels

SquaredLoss

x ∈ RD

y ∈ −1,+1{ }

argminw,b

L yi, f (xi |w,b)( )i=1

Optimization Problem2

Recap:Bias-VarianceTrade-off

0 20 40 60 80 100−1

−0.5

0 20 40 60 80 1000

0 20 40 60 80 100−1

−0.5

0 20 40 60 80 1000

0 20 40 60 80 100−1

−0.5

0 20 40 60 80 1000

1.5VarianceBias VarianceBias VarianceBias

Recap:CompletePipeline

S = (xi, yi ){ }i=1N

TrainingData

f (x |w,b) = wT x − b

ModelClass(es)

L(a,b) = (a− b)2

LossFunction

argminw,b

L yi, f (xi |w,b)( )i=1

CrossValidation&ModelSelection Profit!

• TwoBasicLearningAlgorithms

• PerceptronAlgorithm

• (Stochastic)GradientDescent– Aka,actuallysolvingtheoptimizationproblem

ThePerceptron

• Oneoftheearliestlearningalgorithms– 1957byFrankRosenblatt

• Stillagreatalgorithm– Fast– Cleananalysis– PrecursortoNeuralNetworks

FrankRosenblattwiththeMark1PerceptronMachine

PerceptronLearningAlgorithm(LinearClassificationModel)

• w1 =0,b1 =0• Fort=1….– Receiveexample(x,y)– Iff(x|wt,bt)=y• [wt+1, bt+1]=[wt, bt]

– Else• wt+1=wt +yx• bt+1 =bt - y

S = (xi, yi ){ }i=1N

y ∈ +1,−1{ }

TrainingSet:

Gothroughtrainingsetinarbitraryorder(e.g.,randomly)

f (x |w) = sign(wT x − b)

• Lineisa1D,Planeis2D• Hyperplane ismanyD– IncludesLineandPlane

• Definedby(w,b)

• Distance:

• SignedDistance:

Aside:Hyperplane Distance

wT x − bw

un-normalizedsigneddistance!

LinearModel=

PerceptronLearning

Misclassified!

PerceptronLearning

Update!

PerceptronLearning

Correct!

PerceptronLearning

Misclassified!

PerceptronLearning

Update!

PerceptronLearning

Update!

PerceptronLearning

Correct!

PerceptronLearning

Correct!

PerceptronLearning

Misclassified!

PerceptronLearning

Update!

PerceptronLearning

Update!

PerceptronLearning

AllTrainingExamplesCorrectlyClassified!

PerceptronLearning

PerceptronLearningStartAgain

Misclassified!

PerceptronLearning

Update!

PerceptronLearning

Correct!

PerceptronLearning

Correct!

PerceptronLearning

Misclassified!

PerceptronLearning

Update!

PerceptronLearning

Update!

PerceptronLearning

Correct!

PerceptronLearning

Correct!

PerceptronLearning

Misclassified!

PerceptronLearning

Update!

PerceptronLearning

Update!

PerceptronLearning

Misclassified!

PerceptronLearning

Update!

PerceptronLearning

Update!

PerceptronLearning

Misclassified!

PerceptronLearning

Update!

PerceptronLearning

Update!

PerceptronLearning

Misclassified!

PerceptronLearning

Update!

PerceptronLearning

Update!

PerceptronLearning

AllTrainingExamplesCorrectlyClassified!

PerceptronLearning

Recap:PerceptronLearningAlgorithm(LinearClassificationModel)

• w1 =0,b1 =0• Fort=1….– Receiveexample(x,y)– Iff(x|wt)=y• [wt+1, bt+1]=[wt, bt]

S = (xi, yi ){ }i=1N

y ∈ +1,−1{ }

TrainingSet:

ComparingtheTwoModels

ConvergencetoMistakeFree=LinearlySeparable!

Margin

γ =maxwmin(x,y)

y(wT x)w

LinearSeparability

• AclassificationproblemisLinearlySeparable:– Existswwithperfectclassificationaccuracy

• SeparablewithMarginγ:

• LinearlySeparable:γ >0

γ =maxwmin(x,y)

y(wT x)w

PerceptronMistakeBound

#MistakesBoundedBy: R2

Margin

R =maxx

**IfLinearlySeparable

MoreDetails:http://www.cs.nyu.edu/~mohri/pub/pmb.pdf

Holdsforanyorderingoftrainingexamples!

“Radius”ofFeatureSpace

IntheRealWorld…

• MostproblemsareNOTlinearlyseparable!

• Mayneverconverge…

• Sowhattodo?

• Usevalidationset!

EarlyStoppingviaValidation

• RunPerceptronLearningonTrainingSet

• EvaluatecurrentmodelonValidationSet

• Terminatewhenvalidationaccuracystopsimproving

https://en.wikipedia.org/wiki/Early_stopping

OnlineLearningvs BatchLearning

• OnlineLearning:– Receiveastreamofdata(x,y)– Makeincrementalupdates(typically)– PerceptronLearningisaninstanceofOnlineLearning

• BatchLearning– Givenallthedataupfront– Canuseonlinelearningalgorithmsforbatchlearning– E.g.,streamthedatatothelearningalgorithm

53https://en.wikipedia.org/wiki/Online_machine_learning

Recap: Perceptron

• Oneofthefirstmachinelearningalgorithms

• Benefits:– Simpleandfast– Cleananalysis

• Drawbacks:– Mightnotconvergetoaverygoodmodel–Whatistheobjectivefunction?

(Stochastic)GradientDescent

BacktoOptimizingObjectiveFunctions

• TrainingData:

• ModelClass:

• LossFunction:

• LearningObjective:

S = (xi, yi ){ }i=1N

f (x |w,b) = wT x − b

L(a,b) = (a− b)2

LinearModels

SquaredLoss

x ∈ RD

y ∈ −1,+1{ }

argminw,b

L yi, f (xi |w,b)( )i=1

OptimizationProblem56

BacktoOptimizingObjectiveFunctions

• Typically,requiresoptimizationalgorithm.

• Simplest:GradientDescent

• ThisLecture:stickwithsquaredloss– Talkaboutvariouslossfunctionsnextlecture

argminw,b

L(w,b) ≡ L yi, f (xi |w,b)( )i=1

GradientReviewforSquaredLoss

∂wL(w,b) = ∂w L yi, f (xi |w,b)( )i=1

L(a,b) = (a− b)2

= ∂wL yi, f (xi |w,b)( )i=1

= −2(yi − f (xi |w,b))∂w f (xi |w,b)i=1

f (x |w,b) = wT x − b= −2(yi − f (xi |w,b))xii=1

LinearityofDifferentiation

ChainRule

GradientDescent

• Initialize:w1 =0,b1 =0

• Fort=1…

wt+1 = wt −η t+1∂wL(wt,bt )

bt+1 = bt −η t+1∂bL(wt,bt )

“StepSize”

−0.5 0 0.5 1 1.5 2 2.50

HowtoChooseStepSize?

η =1 ∂wL(w) = −2(1−w)

−0.5 0 0.5 1 1.5 2 2.50

η =1 ∂wL(w) = −2(1−w)

−0.5 0 0.5 1 1.5 2 2.50

η =1 ∂wL(w) = −2(1−w)

−0.5 0 0.5 1 1.5 2 2.50

η =1 ∂wL(w) = −2(1−w)

OscillateInfinitely!

−0.5 0 0.5 1 1.5 2 2.50

η = 0.0001 ∂wL(w) = −2(1−w)

−0.5 0 0.5 1 1.5 2 2.50

η = 0.0001 ∂wL(w) = −2(1−w)

−0.5 0 0.5 1 1.5 2 2.50

η = 0.0001 ∂wL(w) = −2(1−w)

−0.5 0 0.5 1 1.5 2 2.50

η = 0.0001 ∂wL(w) = −2(1−w)

TakesReallyLongTime!

0 20 40 60 80 100 120 140 160 180 2000

0.0010.010.10.5

Iterations

NotethattheabsolutescaleisnotmeaningfulFocusontherelativemagnitude differences

AsLargeAsPossible!(WithoutDiverging)

BeingScaleInvariant

• Considerthefollowingtwogradientupdates:

• Suppose:– Howarethetwostepsizesrelated?

wt+1 = wt −η t+1∂wL(wt,bt )

wt+1 = wt −η̂ t+1∂wL̂(wt,bt )

L̂ =1000L

η̂ t+1 =η /1000

PracticalRulesofThumb

• DivideLossFunctionbyNumberofExamples:

• Startwithlargestepsize– Iflossplateaus,dividestepsizeby2– (Canalsouse advancedoptimizationmethods)– (Stepsizemustdecreaseovertimetoguaranteeconvergencetoglobaloptimum)

wt+1 = wt −η t+1

&'∂wL(w

t,bt )

Aside: Convexity

1/15/2015 ConvexFunction.svg

file:///Users/yyue/Downloads/ConvexFunction.svg 1/2

ImageSource:http://en.wikipedia.org/wiki/Convex_function

Easytofindglobaloptima!

Strictconvexifdiffalways>0

NotConvex

Aside: Convexity

−0.5 0 0.5 1 1.5 2 2.50

L(x2 ) ≥ L(x1)+∇L(x1)T (x2 − x1)

Functionisalwaysabovethelocallylinearextrapolation

Aside:Convexity

• Alllocaloptimaareglobaloptima:

• Strictlyconvex:uniqueglobaloptimum:

• Almostallstandardobjectivesare(strictly)convex:– SquaredLoss,SVMs,LR,Ridge,Lasso– Wewillseenon-convexobjectiveslater(e.g.,deeplearning)

GradientDescentwillfindoptimum

Assumingstepsizechosensafely

Convergence

• AssumeLisconvex• Howmanyiterationstoachieve:

• If:– ThenO(1/ε2)iterations

• If:– ThenO(1/ε)iterations

• If:– ThenO(log(1/ε))iterations

74MoreDetails:Bubeck TextbookChapter3

L(a)− L(b) ≤ ρ a− b Lis“ρ-Lipschitz”

L(w)− L(w*) ≤ ε

∇L(a)−∇L(b) ≤ ρ a− bLis“ρ-smooth”

L(a) ≥ L(b)+∇L(b)T (a− b)+ ρ2a− b 2

Lis“ρ-stronglyconvex”

Convergence

• Ingeneral,takesinfinitetimetoreachglobaloptimum.• Butingeneral,wedon’tcare!

– Aslongaswe’recloseenoughtotheglobaloptimum

0 20 40 60 80 100 120 140 160 180 2000

0.0010.010.10.5

Iterations

Howdoweknowifwe’rehere?

Andnothere?

WhentoStop?

• Convergenceanalyses=worst-caseupperbounds– Whattodoinpractice?

• Stopwhenprogressissufficientlysmall– E.g.,relativereductionlessthan0.001

• Stopafterpre-specified#iterations– E.g.,100000

• Stopwhenvalidationerrorstopsgoingdown

LimitationofGradientDescent

• Requiresfullpassovertrainingsetperiteration

• Veryexpensiveiftrainingsetishuge

• Doweneedtodoafullpassoverthedata?

∂wL(w,b | S) = ∂w L yi, f (xi |w,b)( )i=1

StochasticGradientDescent

• SupposeLossFunctionDecomposesAdditively

• Gradient=expectedgradientofsub-functions

L(w,b) = 1N

Li (w,b)i=1

∑ = Ei Li (w,b)[ ]

EachLi correspondstoasingledatapoint

∂wL(w,b) = ∂w Ei Li (w,b)[ ] = Ei ∂wLi (w,b)[ ]Li (w,b) ≡ yi − f (xi |w,b( )2

StochasticGradientDescent

• Sufficestotakerandomgradientupdate– Solongasitmatchesthetruegradientinexpectation

• Eachiterationt:– Choosei atrandom

• SGDisanonlinelearningalgorithm!

wt+1 = wt −η t+1∂wLi (w,b)

bt+1 = bt −η t+1∂bLi (w,b)

ExpectedValueis: ∂wL(w,b)

Mini-BatchSGD

• EachLi isasmallbatchoftrainingexamples– E.g,.500-1000examples– Canleveragevectoroperations– Decreasevolatilityofgradientupdates

• Industrystate-of-the-art– Everyoneusesmini-batchSGD– Oftenparallelized• (e.g.,differentcoresworkondifferentmini-batches)

CheckingforConvergence

• Howtocheckforconvergence?– Evaluatinglossonentiretrainingsetseemsexpensive…

0 20 40 60 80 100 120 140 160 180 2000

0.0010.010.10.5

Iterations

CheckingforConvergence

• Howtocheckforconvergence?– Evaluatinglossonentiretrainingsetseemsexpensive…

• Don’tcheckaftereveryiteration– E.g.,checkevery1000iterations

• Evaluatelossonasubsetoftrainingdata– E.g.,theprevious5000examples.

Recap:StochasticGradientDescent

• Conceptually:– DecomposeLossFunctionAdditively– ChooseaComponentRandomly– GradientUpdate

• Benefits:– Avoiditeratingentiredatasetforeveryupdate– Gradientupdateisconsistent(inexpectation)

• IndustryStandard

PerceptronRevisited(WhatistheObjectiveFunction?)

• w1 =0,b1 =0• Fort=1….– Receiveexample(x,y)– Iff(x|wt,bt)=y• [wt+1, bt+1]=[wt, bt]

S = (xi, yi ){ }i=1N

y ∈ +1,−1{ }

TrainingSet:

Perceptron(Implicit)Objective

Li (w,b) =max 0,−yi f (xi |w,b){ }

-2 -1.5 -1 -0.5 0 0.5 1 1.5 20

Recap:CompletePipeline

S = (xi, yi ){ }i=1N

TrainingData

f (x |w,b) = wT x − b

ModelClass(es)

L(a,b) = (a− b)2

LossFunction

argminw,b

L yi, f (xi |w,b)( )i=1

CrossValidation&ModelSelection Profit!

UseSGD!

NextWeek

• DifferentLossFunctions– HingeLoss(SVM)– LogLoss(LogisticRegression)

• Non-linearmodelclasses– NeuralNets

• Regularization

• NextThursdayRecitation:– LinearAlgebra&Calculus

machine learning & data mining - yisong yue · recap: bias-variance trade -off 0 20 40 60 80...

Documents

0 0.5 mi. uv

distinctive properties of cosmic positrons and electrons...

vt4826 - vertex - datasheet pdf · vt-4926 48 96 9 80 91 45...

prettus xs - oms lighting€¦ · prettus xs pre 1150 11...

vasp: introduction - umu.se · a simple case of fcc ni,...

· fda establishment type: public school lunchrooms no. of...

ghost - simes · 80. 180 72. 240 80. 180 72. 430 80. 180...

[xls] · web view0 0.2 0.5 1 0 0.02 0.04 0.1 0.2 0.35 0.5...

a guide to the refresh optive portfolio. · cmc 0.5%...

meson-baryon interaction in the meson exchange picture ·...

redesigning communities for aged...

the curie-weisse model · the curie weiss model ¡ 1 ¡ 0.5...

ap biology 2012 scoring guidelines - college board ·...

debugging the machine learning...

detailed scoresheet · detailed scoresheet results for...

3 ts s 0.5 - 28 12 / 80 130 120 110 100 90 80 70 60 …...3...

discrete random variables - university of hong...

water treatment and purification - lenntech · 0 0.5 1.5...

markovrandomfieldsandstochasticimagemodels ·...

invariant shape features and relevance feedback for weld...