reminder: linear classifiers · neural networks properties § theorem (universal function...
TRANSCRIPT
Reminder:LinearClassifiers
§ Inputsarefeaturevalues§ Eachfeaturehasaweight§ Sumistheactivation
§ Iftheactivationis:§ Positive,output+1§ Negative,output-1
Sf1f2f3
w1
w2
w3>0?
Howtogetprobabilisticdecisions?
§ Activation:§ If verypositiveà wantprobabilitygoingto1§ If verynegativeà wantprobabilitygoingto0
§ Sigmoidfunction
z = w · f(x)z = w · f(x)
z = w · f(x)
�(z) =1
1 + e�z
Bestw?
§ Maximumlikelihoodestimation:
with:
maxw
ll(w) = maxw
X
i
logP (y(i)|x(i);w)
P (y(i) = +1|x(i);w) =1
1 + e�w·f(x(i))
P (y(i) = �1|x(i);w) = 1� 1
1 + e�w·f(x(i))
=LogisticRegression
MulticlassLogisticRegression§ Multi-classlinearclassification
§ Aweightvectorforeachclass:
§ Score(activation)ofaclassy:
§ Predictionw/highestscorewins:
§ Howtomakethescoresintoprobabilities?
z1, z2, z3 ! ez1
ez1 + ez2 + ez3,
ez2
ez1 + ez2 + ez3,
ez3
ez1 + ez2 + ez3
original activations softmax activations
Bestw?
§ Maximumlikelihoodestimation:
with:
maxw
ll(w) = maxw
X
i
logP (y(i)|x(i);w)
P (y(i)|x(i);w) =ewy(i) ·f(x(i))
Py e
wy·f(x(i))
=Multi-ClassLogisticRegression
ThisLecture
§ Optimization
§ i.e.,howdowesolve:
maxw
ll(w) = maxw
X
i
logP (y(i)|x(i);w)
HillClimbing
§ RecallfromCSPslecture:simple,generalidea§ Startwherever§ Repeat:movetothebestneighboringstate§ Ifnoneighborsbetterthancurrent,quit
§ What’sparticularlytrickywhenhill-climbingformulticlasslogisticregression?• Optimizationoveracontinuousspace
• Infinitelymanyneighbors!• Howtodothisefficiently?
1-DOptimization
§ Couldevaluate and§ Thenstepinbestdirection
§ Or,evaluatederivative:
§ Tellswhichdirectiontostepinto
w
g(w)
w0
g(w0)
g(w0 + h) g(w0 � h)
@g(w0)
@w= lim
h!0
g(w0 + h)� g(w0 � h)
2h
2-DOptimization
Source: offconvex.org
GradientAscent
§ Performupdateinuphilldirectionforeachcoordinate§ Thesteepertheslope(i.e.thehigherthederivative)thebiggerthestepforthatcoordinate
§ E.g.,consider:
§ Updates:
g(w1, w2)
w2 w2 + ↵ ⇤ @g
@w2(w1, w2)
w1 w1 + ↵ ⇤ @g
@w1(w1, w2)
§ Updatesinvectornotation:
with:
w w + ↵ ⇤ rwg(w)
rwg(w) =
"@g@w1
(w)@g@w2
(w)
#
=gradient
§ Idea:§ Startsomewhere§ Repeat:Takeastepinthegradientdirection
GradientAscent
Figure source: Mathworks
WhatistheSteepestDirection?
§ First-OrderTaylorExpansion:
§ SteepestDescentDirection:
§ Recall: à
§ Hence,solution:
g(w +�) ⇡ g(w) +@g
@w1�1 +
@g
@w2�2
rg =
"@g@w1@g@w2
#Gradientdirection=steepestdirection!
max�:�2
1+�22"
g(w +�)
max�:�2
1+�22"
g(w) +@g
@w1�1 +
@g
@w2�2
� = "rg
krgk
� = "a
kakmax
�:k�k"�>a
Gradientinndimensions
rg =
2
6664
@g@w1@g@w2
· · ·@g@wn
3
7775
OptimizationProcedure:GradientAscent
§ init§ for iter = 1, 2, …
w
§ :learningrate--- tweakingparameterthatneedstobechosencarefully
§ How?Trymultiplechoices§ Cruderuleofthumb:updatechangesabout0.1– 1%
↵
w
w w + ↵ ⇤ rg(w)
BatchGradientAscentontheLogLikelihoodObjective
maxw
ll(w) = maxw
X
i
logP (y(i)|x(i);w)
g(w)
§ init§ for iter = 1, 2, …
w
w w + ↵ ⇤X
i
r logP (y(i)|x(i);w)
StochasticGradientAscentontheLogLikelihoodObjective
maxw
ll(w) = maxw
X
i
logP (y(i)|x(i);w)
§ init§ for iter = 1, 2, …
§ pick random j
w
w w + ↵ ⇤ r logP (y(j)|x(j);w)
Observation: oncegradientononetrainingexamplehasbeencomputed,mightaswellincorporatebeforecomputingnextone
Mini-BatchGradientAscentontheLogLikelihoodObjective
maxw
ll(w) = maxw
X
i
logP (y(i)|x(i);w)
§ init§ for iter = 1, 2, …
§ pick random subset of training examples J
w
Observation: gradientoversmallsetoftrainingexamples(=mini-batch)canbecomputedinparallel,mightaswelldothatinsteadofasingleone
w w + ↵ ⇤X
j2J
r logP (y(j)|x(j);w)
§ We’lltalkaboutthatoncewecoveredneuralnetworks,whichareageneralizationoflogisticregression
Howaboutcomputingallthederivatives?
NeuralNetworks
Multi-classLogisticRegression
§ =specialcaseofneuralnetwork
z1
z2
z3
f1(x)
f2(x)
f3(x)
fK(x)
softmax
P (y1|x;w) =ez1
ez1 + ez2 + ez3
P (y2|x;w) =ez2
ez1 + ez2 + ez3
P (y3|x;w) =ez3
ez1 + ez2 + ez3…
DeepNeuralNetwork=Alsolearnthefeatures!
z1
z2
z3
f1(x)
f2(x)
f3(x)
fK(x)
softmax
P (y1|x;w) =ez1
ez1 + ez2 + ez3
P (y2|x;w) =ez2
ez1 + ez2 + ez3
P (y3|x;w) =ez3
ez1 + ez2 + ez3…
DeepNeuralNetwork=Alsolearnthefeatures!
f1(x)
f2(x)
f3(x)
fK(x)
softmax
P (y1|x;w) =ez1
ez1 + ez2 + ez3
P (y2|x;w) =ez2
ez1 + ez2 + ez3
P (y3|x;w) =ez3
ez1 + ez2 + ez3…
x1
x2
x3
xL
… … … …
z(1)1
z(1)2
z(1)3
z(1)K(1) z(2)
K(2)
z(2)1
z(2)2
z(2)3
z(OUT )1
z(OUT )2
z(OUT )3
z(n�1)3
z(n�1)2
z(n�1)1
z(n�1)K(n�1)
…
z(k)i = g(X
j
W (k�1,k)i,j z(k�1)
j ) g=nonlinearactivationfunction
DeepNeuralNetwork=Alsolearnthefeatures!
softmax
P (y1|x;w) =ez1
ez1 + ez2 + ez3
P (y2|x;w) =ez2
ez1 + ez2 + ez3
P (y3|x;w) =ez3
ez1 + ez2 + ez3…
x1
x2
x3
xL
… … … …
z(1)1
z(1)2
z(1)3
z(1)K(1) z(n)
K(n)z(2)K(2)
z(2)1
z(2)2
z(2)3 z(n)3
z(n)2
z(n)1
z(OUT )1
z(OUT )2
z(OUT )3
z(n�1)3
z(n�1)2
z(n�1)1
z(n�1)K(n�1)
…
z(k)i = g(X
j
W (k�1,k)i,j z(k�1)
j ) g=nonlinearactivationfunction
CommonActivationFunctions
[source:MIT6.S191introtodeeplearning.com]
DeepNeuralNetwork:AlsoLearntheFeatures!
§ Trainingthedeepneuralnetworkisjustlikelogisticregression:
justwtendstobeamuch,muchlargervectorJ
àjustrungradientascent+stopwhenloglikelihoodofhold-outdatastartstodecrease
maxw
ll(w) = maxw
X
i
logP (y(i)|x(i);w)
NeuralNetworksProperties
§ Theorem(UniversalFunctionApproximators).Atwo-layerneuralnetworkwithasufficientnumberofneuronscanapproximateanycontinuousfunctiontoanydesiredaccuracy.
§ Practicalconsiderations§ Canbeseenaslearningthefeatures
§ Largenumberofneurons§ Dangerforoverfitting§ (henceearlystopping!)
UniversalFunctionApproximationTheorem*
§ Inwords: Givenanycontinuousfunctionf(x),ifa2-layerneuralnetworkhasenoughhiddenunits,thenthereisachoiceofweightsthatallowittocloselyapproximatef(x).
Cybenko (1989)“Approximationsbysuperpositionsofsigmoidalfunctions”Hornik (1991)“ApproximationCapabilitiesofMultilayerFeedforwardNetworks”Leshno andSchocken (1991)”MultilayerFeedforwardNetworkswithNon-PolynomialActivationFunctionsCanApproximateAnyFunction”
UniversalFunctionApproximationTheorem*
Cybenko (1989)“Approximationsbysuperpositionsofsigmoidalfunctions”Hornik (1991)“ApproximationCapabilitiesofMultilayerFeedforwardNetworks”Leshno andSchocken (1991)”MultilayerFeedforwardNetworkswithNon-PolynomialActivationFunctionsCanApproximateAnyFunction”
§ Derivativestables:
Howaboutcomputingallthederivatives?
[source:http://hyperphysics.phy-astr.gsu.edu/hbase/Math/derfunc.html
Howaboutcomputingallthederivatives?
n Butneuralnetfisneveroneofthose?n Noproblem:CHAINRULE:
If
Then
à Derivativescanbecomputedbyfollowingwell-definedprocedures
f(x) = g(h(x))
f 0(x) = g0(h(x))h0(x)
§ Automaticdifferentiationsoftware§ e.g.Theano,TensorFlow,PyTorch,Chainer§ Onlyneedtoprogramthefunctiong(x,y,w)§ Canautomaticallycomputeallderivativesw.r.t.allentriesinw§ Thisistypicallydonebycachinginfoduringforwardcomputationpassoff,andthendoingabackwardpass=“backpropagation”
§ Autodiff /Backpropagationcanoftenbedoneatcomputationalcostcomparabletotheforwardpass
§ Needtoknowthisexists§ Howthisisdone?
AutomaticDifferentiation
SummaryofKeyIdeas§ Optimizeprobabilityoflabelgiveninput
§ Continuousoptimization§ Gradientascent:
§ Computesteepestuphilldirection=gradient(=justvectorofpartialderivatives)§ Takestepinthegradientdirection§ Repeat(untilheld-outdataaccuracystartstodrop=“earlystopping”)
§ Deepneuralnets§ Lastlayer=stilllogisticregression§ Nowalsomanymorelayersbeforethislastlayer
§ =computingthefeatures§ à thefeaturesarelearnedratherthanhand-designed
§ Universalfunctionapproximationtheorem§ If neuralnetislargeenough§ Then neuralnetcanrepresentanycontinuousmappingfrominputtooutputwitharbitraryaccuracy§ Butremember:needtoavoidoverfitting/memorizingthetrainingdataà earlystopping!
§ Automaticdifferentiationgivesthederivativesefficiently(how?=outsideofscopeof188)
maxw
ll(w) = maxw
X
i
logP (y(i)|x(i);w)