neural · 2019. 4. 15. · why nns because they can approximate any function just guess the...
TRANSCRIPT
![Page 1: NEURAL · 2019. 4. 15. · Why NNs Because they can approximate any function Just guess the architecture iron linearities Easy Deep Learning the dark art of using gradient descent](https://reader035.vdocument.in/reader035/viewer/2022071411/61066ed68069fe391d49aa60/html5/thumbnails/1.jpg)
NEURAL NETWORKS
GML oh 10
Why neural networks
INPUT Intermediate layers OUTPUT
Xlo YoWoOo o
W 00x
o
o o
xor so I o
03 O r O Oa
O J o
L o o1
o BIAS
too etc are neurons They fire if a combination
of their inputs is large enough In the first formulations ofneural networks
X o tanh Wooo Yoo t Woo Xo t Woo2 02
Xp tanh W o o too 1Woll Xo t w on 702i
Xin tanh w no Too won Xo wonz toe
![Page 2: NEURAL · 2019. 4. 15. · Why NNs Because they can approximate any function Just guess the architecture iron linearities Easy Deep Learning the dark art of using gradient descent](https://reader035.vdocument.in/reader035/viewer/2022071411/61066ed68069fe391d49aa60/html5/thumbnails/2.jpg)
Xo is the vector of activations at layer OWo is the weight matrix at layer 1 Oreltanh is the lmkfunction or non linearity
I originally had bio inspiration turns out whichlink fn is less important than havinglink fn why
Thm 2 layer networks are universal lu 4Proof of much weaker 1hm
n layer network can emit xah of B inputs
BASE 2 b
CASE q D2 I 01
b tanh D a a a a n v debi tanh D g D a azn na
o tanh D a D.az b v bza n na V azana aWha
INDUCTION Next layer do bj.az and Bz XOR 0 02 etcSTEP
n i
2 layer can emit XOR is also easy
![Page 3: NEURAL · 2019. 4. 15. · Why NNs Because they can approximate any function Just guess the architecture iron linearities Easy Deep Learning the dark art of using gradient descent](https://reader035.vdocument.in/reader035/viewer/2022071411/61066ed68069fe391d49aa60/html5/thumbnails/3.jpg)
Why NNs Because they can approximate anyfunction
Just guess the architecture iron linearities Easy
Deep Learning the dark art of using gradient descent
to find the weights of a multi layer neural networkthat will minimize some loss
Problem we will need gradients of complicatedfunctions
E g what's the gradient of fear an O
z b Wz
g b2 01
0 tanh wa b t Wu be tanh G e
b tanh w a t Wu az e t I
bi tanh w a win za dtanhCH l tanhCHDI
Ugh
![Page 4: NEURAL · 2019. 4. 15. · Why NNs Because they can approximate any function Just guess the architecture iron linearities Easy Deep Learning the dark art of using gradient descent](https://reader035.vdocument.in/reader035/viewer/2022071411/61066ed68069fe391d49aa60/html5/thumbnails/4.jpg)
COMPUTING WiiTH DERIVATIVES
How do we get computers to evaluate derivatives for us
Numerical Differentiation f as I fcxtssfc.gsProblem1 Subtractingtwo large values similar to one another is
a terrible idea when using floating point numbers
catastrophic cancellation
Problem2 Whenevaluating a gradient of fi R R requiresan evaluations
Symbolic Differentiation
Idea The chain rule is great let's writea compiler that reads programs that evaluatef and outputs programs that evaluate f
d arty dx dy Note that expressions insided x HE dx d on right are simpler thand log x tx dx on left Process terminates
d sis x cos x dx
![Page 5: NEURAL · 2019. 4. 15. · Why NNs Because they can approximate any function Just guess the architecture iron linearities Easy Deep Learning the dark art of using gradient descent](https://reader035.vdocument.in/reader035/viewer/2022071411/61066ed68069fe391d49aa60/html5/thumbnails/5.jpg)
d si Isin x cos sin x d sin xcog Sin X cos x
d Sin Csm son xD
PROBLEM Evaluating symbolic derivative mighttake mode longer
![Page 6: NEURAL · 2019. 4. 15. · Why NNs Because they can approximate any function Just guess the architecture iron linearities Easy Deep Learning the dark art of using gradient descent](https://reader035.vdocument.in/reader035/viewer/2022071411/61066ed68069fe391d49aa60/html5/thumbnails/6.jpg)
AUTOMATIC DIFFERENTIATION
Idea Evaluate functionatvalue and its derivative at thesame time DUAL NUMBERS
1 Decide which variable you're taking derivativeof2 Evaluate expression and its derivative QED Beady
f ix up Xy sin x Let's compute II 3 o
21
How do we usually evaluate functions Bottom op
f b o Xy sis x 13.0 2 xy 12 six at13,0F Ex
xy 13,0 Xy k 11301 X y2 at 13,0
x 3,0 3 2 12 13,01 1
y 3,0 O 24 2 13,01 0
Same for sin x
![Page 7: NEURAL · 2019. 4. 15. · Why NNs Because they can approximate any function Just guess the architecture iron linearities Easy Deep Learning the dark art of using gradient descent](https://reader035.vdocument.in/reader035/viewer/2022071411/61066ed68069fe391d49aa60/html5/thumbnails/7.jpg)
Hav do we implement this on computers
D Operator overloading
A Yor an interpreter
This is forward mode AD
PROBLEM Still needs OG evaluations for gradient
Next class Bourse mode AD
![Page 8: NEURAL · 2019. 4. 15. · Why NNs Because they can approximate any function Just guess the architecture iron linearities Easy Deep Learning the dark art of using gradient descent](https://reader035.vdocument.in/reader035/viewer/2022071411/61066ed68069fe391d49aa60/html5/thumbnails/8.jpg)
REVERSE MODE AD
Let's state the chain rode slightly differently
so it works for multiple variables
E IQ
In other words if s depends on o through multiplevariables the chain role sums over all of them
Nao let's write or fix y Smx xy asa computation graph
a sin x da cosx dxb Xy db y dx x dyc a t b do da t dbfCxy L
funk o Tk x 0 yr i rI d o b
I C
![Page 9: NEURAL · 2019. 4. 15. · Why NNs Because they can approximate any function Just guess the architecture iron linearities Easy Deep Learning the dark art of using gradient descent](https://reader035.vdocument.in/reader035/viewer/2022071411/61066ed68069fe391d49aa60/html5/thumbnails/9.jpg)
Forward mode AD recap
we choose to take derivative wrt xfuk O
Tik de I o 14 0ax Y dx
r i rI add O O b de O
d x ax
L C der oDX
If we change variable to y we must recompute the derivativevalues
funk oh Tik xyo o y day
I
r n r1 agfa o o b de TheY ayI C Tk
Note that the only value we wanted were andDoty
In FU AD we choose the denominator of the derivative and
set the numerator to be the differential of the cellWhat if we flipped this choosing the numerator andsetting the denominator to be the differential ofthe ale
![Page 10: NEURAL · 2019. 4. 15. · Why NNs Because they can approximate any function Just guess the architecture iron linearities Easy Deep Learning the dark art of using gradient descent](https://reader035.vdocument.in/reader035/viewer/2022071411/61066ed68069fe391d49aa60/html5/thumbnails/10.jpg)
FORWARDMODE REVERSEMODE
x y y day xa yr i r r n ra a b b aden b deY Y Ya Ab
c der c dedy da
What's our variable of interest in this case c
de e deTik X ax O Y dyr i r
I ade o b deda 0lb
I
1 C der Ido
a sin x da cosx dxb Xy db ydx x dyc a b de da t dbfCxyI c
Now we proceed from c opioids or backwards
back propagation
![Page 11: NEURAL · 2019. 4. 15. · Why NNs Because they can approximate any function Just guess the architecture iron linearities Easy Deep Learning the dark art of using gradient descent](https://reader035.vdocument.in/reader035/viewer/2022071411/61066ed68069fe391d49aa60/html5/thumbnails/11.jpg)
Ty XIs _Oto o yde Th
Ck dyr i r
y a de I o b de Lda ab
I C Id Ido
1S
data date0 L c atb doe L
da
defy dfz.de I 1db
defy defy 0 o d a _sin x daa osx
I cosThe I O b Xy 0 yO
dedy
![Page 12: NEURAL · 2019. 4. 15. · Why NNs Because they can approximate any function Just guess the architecture iron linearities Easy Deep Learning the dark art of using gradient descent](https://reader035.vdocument.in/reader035/viewer/2022071411/61066ed68069fe391d49aa60/html5/thumbnails/12.jpg)
Slightly different algorithm bottom node propagates
responsible forupwards
terms of the chain rule it is
Tk X de O to o Yde TkCk dy
r i r
y a de I o b de Ida d b
I C de Ido
da da db Lda Idb Add 1 to data Add1 to 0
da cos x dx o dx Add 0 to
db x dy ydx Tkdy Odx Add it to defy Add otoda
Twoterms ofthechain rule aare added at different times
Note that the valueyou send up is multiplied by thederivative of the cell it just turned at that those valueswere I in this example
![Page 13: NEURAL · 2019. 4. 15. · Why NNs Because they can approximate any function Just guess the architecture iron linearities Easy Deep Learning the dark art of using gradient descent](https://reader035.vdocument.in/reader035/viewer/2022071411/61066ed68069fe391d49aa60/html5/thumbnails/13.jpg)
Exercise Evaluate the gradient of
f XiXay yd dog l 1 exp x y t Xzyz
atX l y O
X2 1 42 0