reminder: linear classifiers · neural networks properties § theorem (universal function...

Reminder:LinearClassifiers

§ Inputsarefeaturevalues§ Eachfeaturehasaweight§ Sumistheactivation

§ Iftheactivationis:§ Positive,output+1§ Negative,output-1

Sf1f2f3

w1

w2

w3>0?

Howtogetprobabilisticdecisions?

§ Activation:§ If verypositiveà wantprobabilitygoingto1§ If verynegativeà wantprobabilitygoingto0

§ Sigmoidfunction

z = w · f(x)z = w · f(x)

z = w · f(x)

�(z) =1

1 + e�z

Bestw?

§ Maximumlikelihoodestimation:

with:

maxw

ll(w) = maxw

X

i

logP (y(i)|x(i);w)

P (y(i) = +1|x(i);w) =1

1 + e�w·f(x(i))

P (y(i) = �1|x(i);w) = 1� 1

1 + e�w·f(x(i))

=LogisticRegression

MulticlassLogisticRegression§ Multi-classlinearclassification

§ Aweightvectorforeachclass:

§ Score(activation)ofaclassy:

§ Predictionw/highestscorewins:

§ Howtomakethescoresintoprobabilities?

z1, z2, z3 ! ez1

ez1 + ez2 + ez3,

ez2

ez1 + ez2 + ez3,

ez3

ez1 + ez2 + ez3

original activations softmax activations

Bestw?

§ Maximumlikelihoodestimation:

with:

maxw

ll(w) = maxw

X

i

logP (y(i)|x(i);w)

P (y(i)|x(i);w) =ewy(i) ·f(x(i))

Py e

wy·f(x(i))

=Multi-ClassLogisticRegression

ThisLecture

§ Optimization

§ i.e.,howdowesolve:

maxw

ll(w) = maxw

X

i

logP (y(i)|x(i);w)

HillClimbing

§ RecallfromCSPslecture:simple,generalidea§ Startwherever§ Repeat:movetothebestneighboringstate§ Ifnoneighborsbetterthancurrent,quit

§ What’sparticularlytrickywhenhill-climbingformulticlasslogisticregression?• Optimizationoveracontinuousspace

• Infinitelymanyneighbors!• Howtodothisefficiently?

1-DOptimization

§ Couldevaluate and§ Thenstepinbestdirection

§ Or,evaluatederivative:

§ Tellswhichdirectiontostepinto

w

g(w)

w0

g(w0)

g(w0 + h) g(w0 � h)

@g(w0)

@w= lim

h!0

g(w0 + h)� g(w0 � h)

2h

2-DOptimization

Source: offconvex.org

GradientAscent

§ Performupdateinuphilldirectionforeachcoordinate§ Thesteepertheslope(i.e.thehigherthederivative)thebiggerthestepforthatcoordinate

§ E.g.,consider:

§ Updates:

g(w1, w2)

w2 w2 + ↵ ⇤ @g

@w2(w1, w2)

w1 w1 + ↵ ⇤ @g

@w1(w1, w2)

§ Updatesinvectornotation:

with:

w w + ↵ ⇤ rwg(w)

rwg(w) =

"@g@w1

(w)@g@w2

(w)

#

=gradient

§ Idea:§ Startsomewhere§ Repeat:Takeastepinthegradientdirection

GradientAscent

Figure source: Mathworks

WhatistheSteepestDirection?

§ First-OrderTaylorExpansion:

§ SteepestDescentDirection:

§ Recall: à

§ Hence,solution:

g(w +�) ⇡ g(w) +@g

@w1�1 +

@g

@w2�2

rg =

"@g@w1@g@w2

#Gradientdirection=steepestdirection!

max�:�2

1+�22"

g(w +�)

max�:�2

1+�22"

g(w) +@g

@w1�1 +

@g

@w2�2

� = "rg

krgk

� = "a

kakmax

�:k�k"�>a

Gradientinndimensions

rg =

2

6664

@g@w1@g@w2

· · ·@g@wn

3

7775

OptimizationProcedure:GradientAscent

§ init§ for iter = 1, 2, …

w

§ :learningrate--- tweakingparameterthatneedstobechosencarefully

§ How?Trymultiplechoices§ Cruderuleofthumb:updatechangesabout0.1– 1%

↵

w

w w + ↵ ⇤ rg(w)

BatchGradientAscentontheLogLikelihoodObjective

maxw

ll(w) = maxw

X

i

logP (y(i)|x(i);w)

g(w)


w

w w + ↵ ⇤X

i

r logP (y(i)|x(i);w)

StochasticGradientAscentontheLogLikelihoodObjective

maxw

ll(w) = maxw

X

i

logP (y(i)|x(i);w)


§ pick random j

w

w w + ↵ ⇤ r logP (y(j)|x(j);w)

Observation: oncegradientononetrainingexamplehasbeencomputed,mightaswellincorporatebeforecomputingnextone

Mini-BatchGradientAscentontheLogLikelihoodObjective

maxw

ll(w) = maxw

X

i

logP (y(i)|x(i);w)


§ pick random subset of training examples J

w

Observation: gradientoversmallsetoftrainingexamples(=mini-batch)canbecomputedinparallel,mightaswelldothatinsteadofasingleone

w w + ↵ ⇤X

j2J

r logP (y(j)|x(j);w)

§ We’lltalkaboutthatoncewecoveredneuralnetworks,whichareageneralizationoflogisticregression

Howaboutcomputingallthederivatives?

NeuralNetworks

Multi-classLogisticRegression

§ =specialcaseofneuralnetwork

z1

z2

z3

f1(x)

f2(x)

f3(x)

fK(x)

softmax

P (y1|x;w) =ez1

ez1 + ez2 + ez3

P (y2|x;w) =ez2

ez1 + ez2 + ez3

P (y3|x;w) =ez3

ez1 + ez2 + ez3…

DeepNeuralNetwork=Alsolearnthefeatures!

z1

z2

z3

f1(x)

f2(x)

f3(x)

fK(x)

softmax

P (y1|x;w) =ez1

ez1 + ez2 + ez3

P (y2|x;w) =ez2

ez1 + ez2 + ez3

P (y3|x;w) =ez3

ez1 + ez2 + ez3…


f1(x)

f2(x)

f3(x)

fK(x)

softmax

P (y1|x;w) =ez1

ez1 + ez2 + ez3

P (y2|x;w) =ez2

ez1 + ez2 + ez3

P (y3|x;w) =ez3

ez1 + ez2 + ez3…

x1

x2

x3

xL

… … … …

z(1)1

z(1)2

z(1)3

z(1)K(1) z(2)

K(2)

z(2)1

z(2)2

z(2)3

z(OUT )1

z(OUT )2

z(OUT )3

z(n�1)3

z(n�1)2

z(n�1)1

z(n�1)K(n�1)

…

z(k)i = g(X

j

W (k�1,k)i,j z(k�1)

j ) g=nonlinearactivationfunction


softmax

P (y1|x;w) =ez1

ez1 + ez2 + ez3

P (y2|x;w) =ez2

ez1 + ez2 + ez3

P (y3|x;w) =ez3

ez1 + ez2 + ez3…

x1

x2

x3

xL

… … … …

z(1)1

z(1)2

z(1)3

z(1)K(1) z(n)

K(n)z(2)K(2)

z(2)1

z(2)2

z(2)3 z(n)3

z(n)2

z(n)1

z(OUT )1

z(OUT )2

z(OUT )3

z(n�1)3

z(n�1)2

z(n�1)1

z(n�1)K(n�1)

…

z(k)i = g(X

j

W (k�1,k)i,j z(k�1)

j ) g=nonlinearactivationfunction

CommonActivationFunctions

[source:MIT6.S191introtodeeplearning.com]

DeepNeuralNetwork:AlsoLearntheFeatures!

§ Trainingthedeepneuralnetworkisjustlikelogisticregression:

justwtendstobeamuch,muchlargervectorJ

àjustrungradientascent+stopwhenloglikelihoodofhold-outdatastartstodecrease

maxw

ll(w) = maxw

X

i

logP (y(i)|x(i);w)

NeuralNetworksProperties

§ Theorem(UniversalFunctionApproximators).Atwo-layerneuralnetworkwithasufficientnumberofneuronscanapproximateanycontinuousfunctiontoanydesiredaccuracy.

§ Practicalconsiderations§ Canbeseenaslearningthefeatures

§ Largenumberofneurons§ Dangerforoverfitting§ (henceearlystopping!)

UniversalFunctionApproximationTheorem*

§ Inwords: Givenanycontinuousfunctionf(x),ifa2-layerneuralnetworkhasenoughhiddenunits,thenthereisachoiceofweightsthatallowittocloselyapproximatef(x).

Cybenko (1989)“Approximationsbysuperpositionsofsigmoidalfunctions”Hornik (1991)“ApproximationCapabilitiesofMultilayerFeedforwardNetworks”Leshno andSchocken (1991)”MultilayerFeedforwardNetworkswithNon-PolynomialActivationFunctionsCanApproximateAnyFunction”

UniversalFunctionApproximationTheorem*

Cybenko (1989)“Approximationsbysuperpositionsofsigmoidalfunctions”Hornik (1991)“ApproximationCapabilitiesofMultilayerFeedforwardNetworks”Leshno andSchocken (1991)”MultilayerFeedforwardNetworkswithNon-PolynomialActivationFunctionsCanApproximateAnyFunction”

§ Derivativestables:


[source:http://hyperphysics.phy-astr.gsu.edu/hbase/Math/derfunc.html


n Butneuralnetfisneveroneofthose?n Noproblem:CHAINRULE:

If

Then

à Derivativescanbecomputedbyfollowingwell-definedprocedures

f(x) = g(h(x))

f 0(x) = g0(h(x))h0(x)

§ Automaticdifferentiationsoftware§ e.g.Theano,TensorFlow,PyTorch,Chainer§ Onlyneedtoprogramthefunctiong(x,y,w)§ Canautomaticallycomputeallderivativesw.r.t.allentriesinw§ Thisistypicallydonebycachinginfoduringforwardcomputationpassoff,andthendoingabackwardpass=“backpropagation”

§ Autodiff /Backpropagationcanoftenbedoneatcomputationalcostcomparabletotheforwardpass

§ Needtoknowthisexists§ Howthisisdone?

AutomaticDifferentiation

SummaryofKeyIdeas§ Optimizeprobabilityoflabelgiveninput

§ Continuousoptimization§ Gradientascent:

§ Computesteepestuphilldirection=gradient(=justvectorofpartialderivatives)§ Takestepinthegradientdirection§ Repeat(untilheld-outdataaccuracystartstodrop=“earlystopping”)

§ Deepneuralnets§ Lastlayer=stilllogisticregression§ Nowalsomanymorelayersbeforethislastlayer

§ =computingthefeatures§ à thefeaturesarelearnedratherthanhand-designed

§ Universalfunctionapproximationtheorem§ If neuralnetislargeenough§ Then neuralnetcanrepresentanycontinuousmappingfrominputtooutputwitharbitraryaccuracy§ Butremember:needtoavoidoverfitting/memorizingthetrainingdataà earlystopping!

§ Automaticdifferentiationgivesthederivativesefficiently(how?=outsideofscopeof188)

maxw

ll(w) = maxw

X

i

logP (y(i)|x(i);w)

reminder: linear classifiers · neural networks properties § theorem (universal function...

Documents