optimization and neural nets - berkeley ai...
TRANSCRIPT
![Page 1: Optimization and Neural Nets - Berkeley AI Materialsinst.cs.berkeley.edu/~cs188/su19/assets/slides/... · Neural Networks Properties § Theorem (Universal Function Approximators)](https://reader034.vdocument.in/reader034/viewer/2022042219/5ec59d5e967bcd16002ad322/html5/thumbnails/1.jpg)
CS188:ArtificialIntelligenceOptimizationandNeuralNets
Instructors:Brijen Thananjeyan andAdityaBaradwaj --- UniversityofCalifornia,Berkeley[TheseslideswerecreatedbyDanKlein, PieterAbbeel,SergeyLevine.AllCS188materialsareathttp://ai.berkeley.edu.]
![Page 2: Optimization and Neural Nets - Berkeley AI Materialsinst.cs.berkeley.edu/~cs188/su19/assets/slides/... · Neural Networks Properties § Theorem (Universal Function Approximators)](https://reader034.vdocument.in/reader034/viewer/2022042219/5ec59d5e967bcd16002ad322/html5/thumbnails/2.jpg)
LogisticRegression:HowtoLearn?
§ Maximumlikelihoodestimation
§ Maximumconditional likelihoodestimation
![Page 3: Optimization and Neural Nets - Berkeley AI Materialsinst.cs.berkeley.edu/~cs188/su19/assets/slides/... · Neural Networks Properties § Theorem (Universal Function Approximators)](https://reader034.vdocument.in/reader034/viewer/2022042219/5ec59d5e967bcd16002ad322/html5/thumbnails/3.jpg)
Bestw?
§ Maximumlikelihoodestimation:
with:
maxw
ll(w) = maxw
X
i
logP (y(i)|x(i);w)
P (y(i)|x(i);w) =ewy(i) ·f(x(i))
Py e
wy·f(x(i))
=Multi-ClassLogisticRegression
![Page 4: Optimization and Neural Nets - Berkeley AI Materialsinst.cs.berkeley.edu/~cs188/su19/assets/slides/... · Neural Networks Properties § Theorem (Universal Function Approximators)](https://reader034.vdocument.in/reader034/viewer/2022042219/5ec59d5e967bcd16002ad322/html5/thumbnails/4.jpg)
HillClimbing
§ RecallfromCSPslecture:simple,generalidea§ Startwherever§ Repeat:movetothebestneighboringstate§ Ifnoneighborsbetterthancurrent,quit
§ What’sparticularlytrickywhenhill-climbingformulticlasslogisticregression?• Optimizationoveracontinuousspace
• Infinitelymanyneighbors!• Howtodothisefficiently?
![Page 5: Optimization and Neural Nets - Berkeley AI Materialsinst.cs.berkeley.edu/~cs188/su19/assets/slides/... · Neural Networks Properties § Theorem (Universal Function Approximators)](https://reader034.vdocument.in/reader034/viewer/2022042219/5ec59d5e967bcd16002ad322/html5/thumbnails/5.jpg)
1-DOptimization
§ Couldevaluate and§ Thenstepinbestdirection
§ Or,evaluatederivative:
§ Tellswhichdirectiontostepinto
w
g(w)
w0
g(w0)
g(w0 + h) g(w0 � h)
@g(w0)
@w= lim
h!0
g(w0 + h)� g(w0 � h)
2h
![Page 6: Optimization and Neural Nets - Berkeley AI Materialsinst.cs.berkeley.edu/~cs188/su19/assets/slides/... · Neural Networks Properties § Theorem (Universal Function Approximators)](https://reader034.vdocument.in/reader034/viewer/2022042219/5ec59d5e967bcd16002ad322/html5/thumbnails/6.jpg)
2-DOptimization
Source: offconvex.org
![Page 7: Optimization and Neural Nets - Berkeley AI Materialsinst.cs.berkeley.edu/~cs188/su19/assets/slides/... · Neural Networks Properties § Theorem (Universal Function Approximators)](https://reader034.vdocument.in/reader034/viewer/2022042219/5ec59d5e967bcd16002ad322/html5/thumbnails/7.jpg)
GradientAscent
§ Performupdateinuphilldirectionforeachcoordinate§ Thesteepertheslope(i.e.thehigherthederivative)thebiggerthestepforthatcoordinate
§ E.g.,consider:
§ Updates:
g(w1, w2)
w2 w2 + ↵ ⇤ @g
@w2(w1, w2)
w1 w1 + ↵ ⇤ @g
@w1(w1, w2)
§ Updatesinvectornotation:
with:
w w + ↵ ⇤ rwg(w)
rwg(w) =
"@g@w1
(w)@g@w2
(w)
#
=gradient
![Page 8: Optimization and Neural Nets - Berkeley AI Materialsinst.cs.berkeley.edu/~cs188/su19/assets/slides/... · Neural Networks Properties § Theorem (Universal Function Approximators)](https://reader034.vdocument.in/reader034/viewer/2022042219/5ec59d5e967bcd16002ad322/html5/thumbnails/8.jpg)
§ Idea:§ Startsomewhere§ Repeat:Takeastepinthegradientdirection
GradientAscent
Figure source: Mathworks
![Page 9: Optimization and Neural Nets - Berkeley AI Materialsinst.cs.berkeley.edu/~cs188/su19/assets/slides/... · Neural Networks Properties § Theorem (Universal Function Approximators)](https://reader034.vdocument.in/reader034/viewer/2022042219/5ec59d5e967bcd16002ad322/html5/thumbnails/9.jpg)
WhatistheSteepestDirection?
§ First-OrderTaylorExpansion:
§ SteepestDescentDirection:
§ Recall: à
§ Hence,solution:
g(w +�) ⇡ g(w) +@g
@w1�1 +
@g
@w2�2
rg =
"@g@w1@g@w2
#Gradientdirection=steepestdirection!
max�:�2
1+�22"
g(w +�)
max�:�2
1+�22"
g(w) +@g
@w1�1 +
@g
@w2�2
� = "rg
krgk
� = "a
kakmax�:k�k"
�>a
![Page 10: Optimization and Neural Nets - Berkeley AI Materialsinst.cs.berkeley.edu/~cs188/su19/assets/slides/... · Neural Networks Properties § Theorem (Universal Function Approximators)](https://reader034.vdocument.in/reader034/viewer/2022042219/5ec59d5e967bcd16002ad322/html5/thumbnails/10.jpg)
Gradientinndimensions
rg =
2
6664
@g@w1@g@w2
· · ·@g@wn
3
7775
![Page 11: Optimization and Neural Nets - Berkeley AI Materialsinst.cs.berkeley.edu/~cs188/su19/assets/slides/... · Neural Networks Properties § Theorem (Universal Function Approximators)](https://reader034.vdocument.in/reader034/viewer/2022042219/5ec59d5e967bcd16002ad322/html5/thumbnails/11.jpg)
OptimizationProcedure:GradientAscent
§ init§ for iter = 1, 2, …
w
§ :learningrate--- tweakingparameterthatneedstobechosencarefully
§ How?Trymultiplechoices§ Cruderuleofthumb:updatechangesabout0.1– 1%
↵
w
w w + ↵ ⇤ rg(w)
![Page 12: Optimization and Neural Nets - Berkeley AI Materialsinst.cs.berkeley.edu/~cs188/su19/assets/slides/... · Neural Networks Properties § Theorem (Universal Function Approximators)](https://reader034.vdocument.in/reader034/viewer/2022042219/5ec59d5e967bcd16002ad322/html5/thumbnails/12.jpg)
BatchGradientAscentontheLogLikelihoodObjective
maxw
ll(w) = maxw
X
i
logP (y(i)|x(i);w)
g(w)
§ init§ for iter = 1, 2, …
w
w w + ↵ ⇤X
i
r logP (y(i)|x(i);w)
![Page 13: Optimization and Neural Nets - Berkeley AI Materialsinst.cs.berkeley.edu/~cs188/su19/assets/slides/... · Neural Networks Properties § Theorem (Universal Function Approximators)](https://reader034.vdocument.in/reader034/viewer/2022042219/5ec59d5e967bcd16002ad322/html5/thumbnails/13.jpg)
StochasticGradientAscentontheLogLikelihoodObjective
maxw
ll(w) = maxw
X
i
logP (y(i)|x(i);w)
§ init§ for iter = 1, 2, …
§ pick random j
w
w w + ↵ ⇤ r logP (y(j)|x(j);w)
Observation: oncegradientononetrainingexamplehasbeencomputed,mightaswellincorporatebeforecomputingnextone
![Page 14: Optimization and Neural Nets - Berkeley AI Materialsinst.cs.berkeley.edu/~cs188/su19/assets/slides/... · Neural Networks Properties § Theorem (Universal Function Approximators)](https://reader034.vdocument.in/reader034/viewer/2022042219/5ec59d5e967bcd16002ad322/html5/thumbnails/14.jpg)
Mini-BatchGradientAscentontheLogLikelihoodObjective
maxw
ll(w) = maxw
X
i
logP (y(i)|x(i);w)
§ init§ for iter = 1, 2, …
§ pick random subset of training examples J
w
Observation: gradientoversmallsetoftrainingexamples(=mini-batch)canbecomputedinparallel,mightaswelldothatinsteadofasingleone
w w + ↵ ⇤X
j2J
r logP (y(j)|x(j);w)
![Page 15: Optimization and Neural Nets - Berkeley AI Materialsinst.cs.berkeley.edu/~cs188/su19/assets/slides/... · Neural Networks Properties § Theorem (Universal Function Approximators)](https://reader034.vdocument.in/reader034/viewer/2022042219/5ec59d5e967bcd16002ad322/html5/thumbnails/15.jpg)
GradientforLogisticRegression
§ Recallperceptron:§ Classifywithcurrentweights
§ Ifcorrect(i.e.,y=y*),nochange!§ Ifwrong:adjusttheweightvectorbyaddingorsubtractingthefeaturevector.Subtractify*is-1.
![Page 16: Optimization and Neural Nets - Berkeley AI Materialsinst.cs.berkeley.edu/~cs188/su19/assets/slides/... · Neural Networks Properties § Theorem (Universal Function Approximators)](https://reader034.vdocument.in/reader034/viewer/2022042219/5ec59d5e967bcd16002ad322/html5/thumbnails/16.jpg)
NeuralNetworks
![Page 17: Optimization and Neural Nets - Berkeley AI Materialsinst.cs.berkeley.edu/~cs188/su19/assets/slides/... · Neural Networks Properties § Theorem (Universal Function Approximators)](https://reader034.vdocument.in/reader034/viewer/2022042219/5ec59d5e967bcd16002ad322/html5/thumbnails/17.jpg)
Multi-classLogisticRegression
§ =specialcaseofneuralnetwork
z1
z2
z3
f1(x)
f2(x)
f3(x)
fK(x)
softmax
P (y1|x;w) =ez1
ez1 + ez2 + ez3
P (y2|x;w) =ez2
ez1 + ez2 + ez3
P (y3|x;w) =ez3
ez1 + ez2 + ez3…
![Page 18: Optimization and Neural Nets - Berkeley AI Materialsinst.cs.berkeley.edu/~cs188/su19/assets/slides/... · Neural Networks Properties § Theorem (Universal Function Approximators)](https://reader034.vdocument.in/reader034/viewer/2022042219/5ec59d5e967bcd16002ad322/html5/thumbnails/18.jpg)
DeepNeuralNetwork=Alsolearnthefeatures!
z1
z2
z3
f1(x)
f2(x)
f3(x)
fK(x)
softmax
P (y1|x;w) =ez1
ez1 + ez2 + ez3
P (y2|x;w) =ez2
ez1 + ez2 + ez3
P (y3|x;w) =ez3
ez1 + ez2 + ez3…
![Page 19: Optimization and Neural Nets - Berkeley AI Materialsinst.cs.berkeley.edu/~cs188/su19/assets/slides/... · Neural Networks Properties § Theorem (Universal Function Approximators)](https://reader034.vdocument.in/reader034/viewer/2022042219/5ec59d5e967bcd16002ad322/html5/thumbnails/19.jpg)
DeepNeuralNetwork=Alsolearnthefeatures!
f1(x)
f2(x)
f3(x)
fK(x)
softmax
P (y1|x;w) =ez1
ez1 + ez2 + ez3
P (y2|x;w) =ez2
ez1 + ez2 + ez3
P (y3|x;w) =ez3
ez1 + ez2 + ez3…
x1
x2
x3
xL
… … … …
z(1)1
z(1)2
z(1)3
z(1)K(1) z(2)
K(2)
z(2)1
z(2)2
z(2)3
z(OUT )1
z(OUT )2
z(OUT )3
z(n�1)3
z(n�1)2
z(n�1)1
z(n�1)K(n�1)
…
z(k)i = g(X
j
W (k�1,k)i,j z(k�1)
j ) g=nonlinearactivationfunction
![Page 20: Optimization and Neural Nets - Berkeley AI Materialsinst.cs.berkeley.edu/~cs188/su19/assets/slides/... · Neural Networks Properties § Theorem (Universal Function Approximators)](https://reader034.vdocument.in/reader034/viewer/2022042219/5ec59d5e967bcd16002ad322/html5/thumbnails/20.jpg)
DeepNeuralNetwork=Alsolearnthefeatures!
softmax
P (y1|x;w) =ez1
ez1 + ez2 + ez3
P (y2|x;w) =ez2
ez1 + ez2 + ez3
P (y3|x;w) =ez3
ez1 + ez2 + ez3…
x1
x2
x3
xL
… … … …
z(1)1
z(1)2
z(1)3
z(1)K(1) z(n)
K(n)z(2)K(2)
z(2)1
z(2)2
z(2)3 z(n)3
z(n)2
z(n)1
z(OUT )1
z(OUT )2
z(OUT )3
z(n�1)3
z(n�1)2
z(n�1)1
z(n�1)K(n�1)
…
z(k)i = g(X
j
W (k�1,k)i,j z(k�1)
j ) g=nonlinearactivationfunction
![Page 21: Optimization and Neural Nets - Berkeley AI Materialsinst.cs.berkeley.edu/~cs188/su19/assets/slides/... · Neural Networks Properties § Theorem (Universal Function Approximators)](https://reader034.vdocument.in/reader034/viewer/2022042219/5ec59d5e967bcd16002ad322/html5/thumbnails/21.jpg)
CommonActivationFunctions
[source:MIT6.S191introtodeeplearning.com]
![Page 22: Optimization and Neural Nets - Berkeley AI Materialsinst.cs.berkeley.edu/~cs188/su19/assets/slides/... · Neural Networks Properties § Theorem (Universal Function Approximators)](https://reader034.vdocument.in/reader034/viewer/2022042219/5ec59d5e967bcd16002ad322/html5/thumbnails/22.jpg)
DeepNeuralNetwork:AlsoLearntheFeatures!
§ Trainingthedeepneuralnetworkisjustlikelogisticregression:
justwtendstobeamuch,muchlargervectorJ
àjustrungradientascent+stopwhenloglikelihoodofhold-outdatastartstodecrease
maxw
ll(w) = maxw
X
i
logP (y(i)|x(i);w)
![Page 23: Optimization and Neural Nets - Berkeley AI Materialsinst.cs.berkeley.edu/~cs188/su19/assets/slides/... · Neural Networks Properties § Theorem (Universal Function Approximators)](https://reader034.vdocument.in/reader034/viewer/2022042219/5ec59d5e967bcd16002ad322/html5/thumbnails/23.jpg)
NeuralNetworksProperties
§ Theorem(UniversalFunctionApproximators).Atwo-layerneuralnetworkwithasufficientnumberofneuronscanapproximateanycontinuousfunctiontoanydesiredaccuracy.
§ Practicalconsiderations§ Canbeseenaslearningthefeatures
§ Largenumberofneurons§ Dangerforoverfitting§ (henceearlystopping!)
![Page 24: Optimization and Neural Nets - Berkeley AI Materialsinst.cs.berkeley.edu/~cs188/su19/assets/slides/... · Neural Networks Properties § Theorem (Universal Function Approximators)](https://reader034.vdocument.in/reader034/viewer/2022042219/5ec59d5e967bcd16002ad322/html5/thumbnails/24.jpg)
Neural Net Demo!
https://playground.tensorflow.org/
![Page 25: Optimization and Neural Nets - Berkeley AI Materialsinst.cs.berkeley.edu/~cs188/su19/assets/slides/... · Neural Networks Properties § Theorem (Universal Function Approximators)](https://reader034.vdocument.in/reader034/viewer/2022042219/5ec59d5e967bcd16002ad322/html5/thumbnails/25.jpg)
§ Derivativestables:
Howaboutcomputingallthederivatives?
[source:http://hyperphysics.phy-astr.gsu.edu/hbase/Math/derfunc.html
![Page 26: Optimization and Neural Nets - Berkeley AI Materialsinst.cs.berkeley.edu/~cs188/su19/assets/slides/... · Neural Networks Properties § Theorem (Universal Function Approximators)](https://reader034.vdocument.in/reader034/viewer/2022042219/5ec59d5e967bcd16002ad322/html5/thumbnails/26.jpg)
Howaboutcomputingallthederivatives?
n Butneuralnetfisneveroneofthose?n Noproblem:CHAINRULE:
If
Then
à Derivativescanbecomputedbyfollowingwell-definedprocedures
f(x) = g(h(x))
f 0(x) = g0(h(x))h0(x)
![Page 27: Optimization and Neural Nets - Berkeley AI Materialsinst.cs.berkeley.edu/~cs188/su19/assets/slides/... · Neural Networks Properties § Theorem (Universal Function Approximators)](https://reader034.vdocument.in/reader034/viewer/2022042219/5ec59d5e967bcd16002ad322/html5/thumbnails/27.jpg)
§ Automaticdifferentiationsoftware§ e.g.Theano,TensorFlow,PyTorch,Chainer§ Onlyneedtoprogramthefunctiong(x,y,w)§ Canautomaticallycomputeallderivativesw.r.t.allentriesinw§ Thisistypicallydonebycachinginfoduringforwardcomputationpassoff,andthendoingabackwardpass=“backpropagation”
§ Autodiff /Backpropagationcanoftenbedoneatcomputationalcostcomparabletotheforwardpass
§ Needtoknowthisexists§ Howthisisdone?-- outsideofscopeofCS188
AutomaticDifferentiation
![Page 28: Optimization and Neural Nets - Berkeley AI Materialsinst.cs.berkeley.edu/~cs188/su19/assets/slides/... · Neural Networks Properties § Theorem (Universal Function Approximators)](https://reader034.vdocument.in/reader034/viewer/2022042219/5ec59d5e967bcd16002ad322/html5/thumbnails/28.jpg)
SummaryofKeyIdeas§ Optimizeprobabilityoflabelgiveninput
§ Continuousoptimization§ Gradientascent:
§ Computesteepestuphilldirection=gradient(=justvectorofpartialderivatives)§ Takestep inthegradientdirection§ Repeat(untilheld-outdataaccuracystartstodrop=“earlystopping”)
§ Deepneuralnets§ Lastlayer=stilllogisticregression§ Nowalsomanymorelayersbeforethislastlayer
§ =computingthefeatures§ à thefeaturesarelearnedratherthanhand-designed
§ Universalfunctionapproximationtheorem§ If neuralnet islargeenough§ Then neuralnetcanrepresentanycontinuousmappingfrominputtooutputwitharbitraryaccuracy§ Butremember:needtoavoidoverfitting/memorizingthetrainingdataà earlystopping!
§ Automaticdifferentiationgivesthederivativesefficiently(how?=outsideofscopeof188)
maxw
ll(w) = maxw
X
i
logP (y(i)|x(i);w)