neural networks - cocoxu.github.io · neural networks wei xu (many slides from greg durrett and...
TRANSCRIPT
NeuralNetworks
WeiXu(many slides from Greg Durrett and Philipp Koehn)
Recap:LossFunc7ons
Hinge(SVM)
Logis7c
Perceptron
0-1(ideal)
Loss
w>x
Recap:Logis7cRegression
‣ Tolearnweights:maximizediscrimina7veloglikelihoodofdataP(y|x)
P (y = +|x) = logistic(w>x)
P (y = +|x) =exp(
Pni=1 wixi)
1 + exp(Pn
i=1 wixi)
L(xj , yj = +) = logP (yj = +|xj)
=nX
i=1
wixji � log
1 + exp
nX
i=1
wixji
!!
sumoverfeatures
Recap:Mul7classLogis7cRegression
sumoveroutputspacetonormalize
Pw(y|x) =exp
�w>f(x, y)
�P
y02Y exp (w>f(x, y0))
Health:+2.2
Sports:+3.1
Science:-0.6w>f(x, y)
Why?Interpretrawclassifierscoresasprobabili(es
exp6.05
22.2
0.55
probabili7esmustbe>=0
unnormalizedprobabili7es
normalize0.21
0.77
0.02
probabili7esmustsumto1
probabili7es
1.00
0.00
0.00correct(gold)probabili7es
toomanydrugtrials,
toofewpa4ents
compare
L(x, y) =nX
j=1
logP (y⇤j |xj)L(xj , y⇤j ) = w>f(xj , y
⇤j )� log
X
y
exp(w>f(xj , y))
log(0.21)=-1.56
sumoveroutputspacetonormalize
‣ Training:maximize
=nX
j=1
w>f(xj , y
⇤j )� log
X
y
exp(w>f(xj , y))
!L(x, y) =
nX
j=1
logP (y⇤j |xj)
Pw(y|x) =exp
�w>f(x, y)
�P
y02Y exp (w>f(x, y0))
Recap:Mul7classLogis7cRegression
Administrivia
‣ Homework1Graded(onCarmen)
‣ Homework2willbereleasedsoon.
‣ Reading:Eisenstein3.1-3.3,Jurafsky+Mar7n7.1-7.4
ThisLecture
‣ Feedforwardneuralnetworks+backpropaga7on
‣ Neuralnetworkbasics
‣ Applica7ons
‣ Neuralnetworkhistory
‣ Implemen7ngneuralnetworks(if7me)
History:NN“darkages”
‣ ConvNets:appliedtoMNISTbyLeCunin1998
‣ LSTMs:HochreiterandSchmidhuber(1997)
‣ Henderson(2003):neuralshie-reduceparser,notSOTAhgps://www.youtube.com/watch?v=FwFduRA_L6Q&feature=youtu.be
hgps://www.andreykurenkov.com/wri7ng/ai/a-brief-history-of-neural-nets-and-deep-learning/
2008-2013:Aglimmeroflight…
‣ CollobertandWeston2011:“NLP(almost)fromscratch”
‣ Feedforwardneuralnetsinducefeaturesforsequen7alCRFs(“neuralCRF”)
‣ 2008versionwasmarredbybadexperiments,claimedSOTAbutwasn’t,2011version7edSOTA
‣ Socher2011-2014:tree-structuredRNNsworkingokay
‣ Krizhevskeyetal.(2012):AlexNetforvision
2014:Stuffstartsworking
‣ Sutskeveretal.(2014)+Bahdanauetal.(2015):seq2seq+agen7onforneuralMT(LSTMsworkforNLP?)
‣ Kim(2014)+Kalchbrenneretal.(2014):sentenceclassifica7on/sen7ment(convnetsworkforNLP?)
‣ 2015:explosionofneuralnetsforeverythingunderthesun
‣ ChenandManning(2014)transi7on-baseddependencyparser(evenfeedforwardnetworksworkwellforNLP?)
Whydidn’ttheyworkbefore?
‣ Datasetstoosmall:forMT,notreallybegerun7lyouhave1M+parallelsentences(andreallyneedalotmore)
‣Op(miza(onnotwellunderstood:goodini7aliza7on,per-featurescaling+momentum(Adagrad/Adadelta/Adam)workbestout-of-the-box
‣ Regulariza(on:dropoutispregyhelpful
‣ Inputs:needwordrepresenta7onstohavetherightcon7nuousseman7cs
‣ Computersnotbigenough:can’trunforenoughitera7ons
NeuralNetBasics
NeuralNetworks:mo7va7on
‣ Howcanwedononlinearclassifica7on?Kernelsaretooslow…
‣Wanttolearnintermediateconjunc7vefeaturesoftheinput
argmaxyw>f(x, y)‣ Linearclassifica7on:
themoviewasnotallthatgood
I[containsnot&containsgood]
NeuralNetworks:XOR
x1
x2
x1 x2
1 1111
100 0
00
0
0
1 0
1
x1, x2
(generally x = (x1, . . . , xm))
y
(generally y = (y1, . . . , yn)) y = x1 XOR x2
‣ Let’sseehowwecanuseneuralnetstolearnasimplenonlinearfunc7on
‣ Inputs
‣ Output
NeuralNetworks:XOR
x1
x2
x1 x2 x1 XOR x2
1 1111
100 0
00
0
0
1 0
1“or”
y = a1x1 + a2x2 Xy = a1x1 + a2x2 + a3 tanh(x1 + x2)
(looks like action potential in neuron)
NeuralNetworks:XORy = a1x1 + a2x2
x1
x2
x1 x2 x1 XOR x2
1 1111
100 0
00
0
0
1 0
1
Xy = a1x1 + a2x2 + a3 tanh(x1 + x2)
x2
x1
“or”y = �x1 � x2 + 2 tanh(x1 + x2)
NeuralNetworks:XOR
x1
x2
0
1 -1
0
x2
x1
[not]
[good] y = �2x1 � x2 + 2 tanh(x1 + x2)
I
I
themoviewasnotallthatgood
NeuralNetworks
Takenfromhgp://colah.github.io/posts/2014-03-NN-Manifolds-Topology/
Warp space
ShiftNonlinear transformation
Linear model: y = w · x+ b
y = g(w · x+ b)y = g(Wx+ b)
tanh
NeuralNetworks
Takenfromhgp://colah.github.io/posts/2014-03-NN-Manifolds-Topology/
Linearclassifier Neuralnetwork…possiblebecausewetransformedthespace!
DeepNeuralNetworks
Adopted from Chris Dyer
}outputoffirstlayer
z = g(Vg(Wx+ b) + c)
z = g(Vy + c)
Input Second Layer
FirstLayer
“Feedforward”computa7on(notrecurrent)
z = V(Wx+ b) + c
Check:whathappensifnononlinearity?Morepowerfulthanbasiclinearmodels?
DeepNeuralNetworks
Takenfromhgp://colah.github.io/posts/2014-03-NN-Manifolds-Topology/
FeedforwardNetworks,Backpropaga7on
Administrivia
‣ Homework2isreleased,dueonFebruary18(startearly!).
‣ Reading:Eisenstein7.0-7.4,Jurafsky+Mar7nChapter8
SimpleNeuralNetwork
‣ Assumesthatthelabelsyareindexedandassociatedwithcoordinatesinavectorspace
9Simple Neural Network
11
4.5
-5.2
-2.0-4.6-1.5
3.72.9
3.7
2.9
• One innovation: bias units (no inputs, always value 1)
Philipp Koehn Machine Translation: Introduction to Neural Networks 27 September 2018
SimpleNeuralNetwork 9Simple Neural Network
11
4.5
-5.2
-2.0-4.6-1.5
3.72.9
3.7
2.9
• One innovation: bias units (no inputs, always value 1)
Philipp Koehn Machine Translation: Introduction to Neural Networks 27 September 2018
SimpleNeuralNetwork
‣ Tryouttwoinputvalues
10Sample Input
1
1.0
0.0
1
4.5
-5.2
-2.0-4.6-1.5
3.72.9
3.7
2.9
• Try out two input values
• Hidden unit computation
sigmoid(1.0 ⇥ 3.7 + 0.0 ⇥ 3.7 + 1 ⇥�1.5) = sigmoid(2.2) =1
1 + e�2.2= 0.90
sigmoid(1.0 ⇥ 2.9 + 0.0 ⇥ 2.9 + 1 ⇥�4.5) = sigmoid(�1.6) =1
1 + e1.6= 0.17
Philipp Koehn Machine Translation: Introduction to Neural Networks 27 September 2018
SimpleNeuralNetwork
‣ Tryouttwoinputvalues‣ Hiddenunitcomputa7on
11Computed Hidden
.90
.17
1
1.0
0.0
1
4.5
-5.2
-2.0-4.6-1.5
3.72.9
3.7
2.9
• Try out two input values
• Hidden unit computation
sigmoid(1.0 ⇥ 3.7 + 0.0 ⇥ 3.7 + 1 ⇥�1.5) = sigmoid(2.2) =1
1 + e�2.2= 0.90
sigmoid(1.0 ⇥ 2.9 + 0.0 ⇥ 2.9 + 1 ⇥�4.5) = sigmoid(�1.6) =1
1 + e1.6= 0.17
Philipp Koehn Machine Translation: Introduction to Neural Networks 27 September 2018
11Computed Hidden
.90
.17
1
1.0
0.0
1
4.5
-5.2
-2.0-4.6-1.5
3.72.9
3.7
2.9
• Try out two input values
• Hidden unit computation
sigmoid(1.0 ⇥ 3.7 + 0.0 ⇥ 3.7 + 1 ⇥�1.5) = sigmoid(2.2) =1
1 + e�2.2= 0.90
sigmoid(1.0 ⇥ 2.9 + 0.0 ⇥ 2.9 + 1 ⇥�4.5) = sigmoid(�1.6) =1
1 + e1.6= 0.17
Philipp Koehn Machine Translation: Introduction to Neural Networks 27 September 2018
SimpleNeuralNetwork
‣ Outputunitcomputa7on
13Computed Output
.90
.17
1
.76
1.0
0.0
1
4.5
-5.2
-2.0-4.6-1.5
3.72.9
3.7
2.9
• Output unit computation
sigmoid(.90 ⇥ 4.5 + .17 ⇥�5.2 + 1 ⇥�2.0) = sigmoid(1.17) =1
1 + e�1.17= 0.76
Philipp Koehn Machine Translation: Introduction to Neural Networks 27 September 2018
13Computed Output
.90
.17
1
.76
1.0
0.0
1
4.5
-5.2
-2.0-4.6-1.5
3.72.9
3.7
2.9
• Output unit computation
sigmoid(.90 ⇥ 4.5 + .17 ⇥�5.2 + 1 ⇥�2.0) = sigmoid(1.17) =1
1 + e�1.17= 0.76
Philipp Koehn Machine Translation: Introduction to Neural Networks 27 September 2018
SimpleNeuralNetwork
‣ NetworkimplementsXOR‣ h0isOR,h1isAND
14Output for all Binary Inputs
Input x0 Input x1 Hidden h0 Hidden h1 Output y00 0 0.12 0.02 0.18 ! 00 1 0.88 0.27 0.74 ! 11 0 0.73 0.12 0.74 ! 11 1 0.99 0.73 0.33 ! 0
• Network implements XOR
– hidden node h0 is OR– hidden node h1 is AND– final layer operation is h0 ��h1
• Power of deep neural networks: chaining of processing stepsjust as: more Boolean circuits ! more complex computations possible
Philipp Koehn Machine Translation: Introduction to Neural Networks 27 September 2018
‣ Computedoutput:‣ Correctoutput:
‣ Q:howdoweadjusttheweights?
Error20Error
.90
.17
1
.76
1.0
0.0
1
4.5
-5.2
-2.0-4.6-1.5
3.72.9
3.7
2.9
• Computed output: y = .76
• Correct output: t = 1.0
) How do we adjust the weights?
Philipp Koehn Machine Translation: Introduction to Neural Networks 27 September 2018
20Error
.90
.17
1
.76
1.0
0.0
1
4.5
-5.2
-2.0-4.6-1.5
3.72.9
3.7
2.9
• Computed output: y = .76
• Correct output: t = 1.0
) How do we adjust the weights?
Philipp Koehn Machine Translation: Introduction to Neural Networks 27 September 2018
20Error
.90
.17
1
.76
1.0
0.0
1
4.5
-5.2
-2.0-4.6-1.5
3.72.9
3.7
2.9
• Computed output: y = .76
• Correct output: t = 1.0
) How do we adjust the weights?
Philipp Koehn Machine Translation: Introduction to Neural Networks 27 September 2018
GradientDescent 22Gradient Descent
λ
error(λ)
gradient = 1
current λoptimal λ
Philipp Koehn Machine Translation: Introduction to Neural Networks 27 September 2018
GradientDescent 23Gradient Descent
Gradient for w1
Gradient for w
2
OptimumCurrent Point
Combined Gradient
Philipp Koehn Machine Translation: Introduction to Neural Networks 27 September 2018
Deriva7veofSigmoid
‣ Deriva7ve:
‣ Sigmoidfunc7on:24Derivative of Sigmoid
• Sigmoid sigmoid(x) =1
1 + e�x
• Reminder: quotient rule⇣f(x)
g(x)
⌘0=
g(x)f 0(x) � f(x)g0(x)
g(x)2
• Derivative d sigmoid(x)dx
=d
dx
1
1 + e�x
=0 ⇥ (1 � e�x) � (�e�x)
(1 + e�x)2
=1
1 + e�x
⇣ e�x
1 + e�x
⌘
=1
1 + e�x
⇣1 �
1
1 + e�x
⌘
= sigmoid(x)(1 � sigmoid(x))
Philipp Koehn Machine Translation: Introduction to Neural Networks 27 September 2018
24Derivative of Sigmoid• Sigmoid sigmoid(x) =
1
1 + e�x
• Reminder: quotient rule⇣f(x)
g(x)
⌘0=
g(x)f 0(x) � f(x)g0(x)
g(x)2
• Derivative d sigmoid(x)dx
=d
dx
1
1 + e�x
=0 ⇥ (1 � e�x) � (�e�x)
(1 + e�x)2
=1
1 + e�x
⇣ e�x
1 + e�x
⌘
=1
1 + e�x
⇣1 �
1
1 + e�x
⌘
= sigmoid(x)(1 � sigmoid(x))
Philipp Koehn Machine Translation: Introduction to Neural Networks 27 September 2018
‣ Error(L2norm):‣ Deriva7veoferrorwithregardtooneweight:
FinalLayerUpdate‣ Linearcombina7onofweights:‣ Ac7va7onfunc7on:
25Final Layer Update
• Linear combination of weights s =P
kwkhk
• Activation function y = sigmoid(s)
• Error (L2 norm) E = 12(t� y)2
• Derivative of error with regard to one weight wk
dE
dwk=
dE
dy
dy
ds
ds
dwk
Philipp Koehn Machine Translation: Introduction to Neural Networks 27 September 2018
25Final Layer Update
• Linear combination of weights s =P
kwkhk
• Activation function y = sigmoid(s)
• Error (L2 norm) E = 12(t� y)2
• Derivative of error with regard to one weight wk
dE
dwk=
dE
dy
dy
ds
ds
dwk
Philipp Koehn Machine Translation: Introduction to Neural Networks 27 September 2018
25Final Layer Update
• Linear combination of weights s =P
kwkhk
• Activation function y = sigmoid(s)
• Error (L2 norm) E = 12(t� y)2
• Derivative of error with regard to one weight wk
dE
dwk=
dE
dy
dy
ds
ds
dwk
Philipp Koehn Machine Translation: Introduction to Neural Networks 27 September 2018
25Final Layer Update
• Linear combination of weights s =P
kwkhk
• Activation function y = sigmoid(s)
• Error (L2 norm) E = 12(t� y)2
• Derivative of error with regard to one weight wk
dE
dwk=
dE
dy
dy
ds
ds
dwk
Philipp Koehn Machine Translation: Introduction to Neural Networks 27 September 2018
25Final Layer Update
• Linear combination of weights s =P
kwkhk
• Activation function y = sigmoid(s)
• Error (L2 norm) E = 12(t� y)2
• Derivative of error with regard to one weight wk
dE
dwk=
dE
dy
dy
ds
ds
dwk
Philipp Koehn Machine Translation: Introduction to Neural Networks 27 September 2018
‣ Error(L2norm):‣ Deriva7veoferrorwithregardtooneweight:
‣ Errorisdefinedwithrespectto:
FinalLayerUpdate(1)‣ Linearcombina7onofweights:‣ Ac7va7onfunc7on:
25Final Layer Update
• Linear combination of weights s =P
kwkhk
• Activation function y = sigmoid(s)
• Error (L2 norm) E = 12(t� y)2
• Derivative of error with regard to one weight wk
dE
dwk=
dE
dy
dy
ds
ds
dwk
Philipp Koehn Machine Translation: Introduction to Neural Networks 27 September 2018
25Final Layer Update
• Linear combination of weights s =P
kwkhk
• Activation function y = sigmoid(s)
• Error (L2 norm) E = 12(t� y)2
• Derivative of error with regard to one weight wk
dE
dwk=
dE
dy
dy
ds
ds
dwk
Philipp Koehn Machine Translation: Introduction to Neural Networks 27 September 2018
25Final Layer Update
• Linear combination of weights s =P
kwkhk
• Activation function y = sigmoid(s)
• Error (L2 norm) E = 12(t� y)2
• Derivative of error with regard to one weight wk
dE
dwk=
dE
dy
dy
ds
ds
dwk
Philipp Koehn Machine Translation: Introduction to Neural Networks 27 September 2018
25Final Layer Update
• Linear combination of weights s =P
kwkhk
• Activation function y = sigmoid(s)
• Error (L2 norm) E = 12(t� y)2
• Derivative of error with regard to one weight wk
dE
dwk=
dE
dy
dy
ds
ds
dwk
Philipp Koehn Machine Translation: Introduction to Neural Networks 27 September 2018
26Final Layer Update (1)• Linear combination of weights s =
Pkwkhk
• Activation function y = sigmoid(s)
• Error (L2 norm) E = 12(t� y)2
• Derivative of error with regard to one weight wk
dE
dwk=
dE
dy
dy
ds
ds
dwk
• Error E is defined with respect to y
dE
dy=
d
dy
1
2(t� y)2 = �(t� y)
Philipp Koehn Machine Translation: Introduction to Neural Networks 27 September 2018
26Final Layer Update (1)• Linear combination of weights s =
Pkwkhk
• Activation function y = sigmoid(s)
• Error (L2 norm) E = 12(t� y)2
• Derivative of error with regard to one weight wk
dE
dwk=
dE
dy
dy
ds
ds
dwk
• Error E is defined with respect to y
dE
dy=
d
dy
1
2(t� y)2 = �(t� y)
Philipp Koehn Machine Translation: Introduction to Neural Networks 27 September 2018
26Final Layer Update (1)• Linear combination of weights s =
Pkwkhk
• Activation function y = sigmoid(s)
• Error (L2 norm) E = 12(t� y)2
• Derivative of error with regard to one weight wk
dE
dwk=
dE
dy
dy
ds
ds
dwk
• Error E is defined with respect to y
dE
dy=
d
dy
1
2(t� y)2 = �(t� y)
Philipp Koehn Machine Translation: Introduction to Neural Networks 27 September 2018
26Final Layer Update (1)• Linear combination of weights s =
Pkwkhk
• Activation function y = sigmoid(s)
• Error (L2 norm) E = 12(t� y)2
• Derivative of error with regard to one weight wk
dE
dwk=
dE
dy
dy
ds
ds
dwk
• Error E is defined with respect to y
dE
dy=
d
dy
1
2(t� y)2 = �(t� y)
Philipp Koehn Machine Translation: Introduction to Neural Networks 27 September 2018
‣ Error(L2norm):‣ Deriva7veoferrorwithregardtooneweight:
‣ withrespecttois:
FinalLayerUpdate(2)‣ Linearcombina7onofweights:‣ Ac7va7onfunc7on:
25Final Layer Update
• Linear combination of weights s =P
kwkhk
• Activation function y = sigmoid(s)
• Error (L2 norm) E = 12(t� y)2
• Derivative of error with regard to one weight wk
dE
dwk=
dE
dy
dy
ds
ds
dwk
Philipp Koehn Machine Translation: Introduction to Neural Networks 27 September 2018
25Final Layer Update
• Linear combination of weights s =P
kwkhk
• Activation function y = sigmoid(s)
• Error (L2 norm) E = 12(t� y)2
• Derivative of error with regard to one weight wk
dE
dwk=
dE
dy
dy
ds
ds
dwk
Philipp Koehn Machine Translation: Introduction to Neural Networks 27 September 2018
25Final Layer Update
• Linear combination of weights s =P
kwkhk
• Activation function y = sigmoid(s)
• Error (L2 norm) E = 12(t� y)2
• Derivative of error with regard to one weight wk
dE
dwk=
dE
dy
dy
ds
ds
dwk
Philipp Koehn Machine Translation: Introduction to Neural Networks 27 September 2018
25Final Layer Update
• Linear combination of weights s =P
kwkhk
• Activation function y = sigmoid(s)
• Error (L2 norm) E = 12(t� y)2
• Derivative of error with regard to one weight wk
dE
dwk=
dE
dy
dy
ds
ds
dwk
Philipp Koehn Machine Translation: Introduction to Neural Networks 27 September 2018
26Final Layer Update (1)• Linear combination of weights s =
Pkwkhk
• Activation function y = sigmoid(s)
• Error (L2 norm) E = 12(t� y)2
• Derivative of error with regard to one weight wk
dE
dwk=
dE
dy
dy
ds
ds
dwk
• Error E is defined with respect to y
dE
dy=
d
dy
1
2(t� y)2 = �(t� y)
Philipp Koehn Machine Translation: Introduction to Neural Networks 27 September 2018
27Final Layer Update (2)• Linear combination of weights s =
Pkwkhk
• Activation function y = sigmoid(s)
• Error (L2 norm) E = 12(t� y)2
• Derivative of error with regard to one weight wk
dE
dwk=
dE
dy
dy
ds
ds
dwk
• y with respect to x is sigmoid(s)
dy
ds=
d sigmoid(s)ds
= sigmoid(s)(1� sigmoid(s)) = y(1� y)
Philipp Koehn Machine Translation: Introduction to Neural Networks 27 September 2018
27Final Layer Update (2)• Linear combination of weights s =
Pkwkhk
• Activation function y = sigmoid(s)
• Error (L2 norm) E = 12(t� y)2
• Derivative of error with regard to one weight wk
dE
dwk=
dE
dy
dy
ds
ds
dwk
• y with respect to x is sigmoid(s)
dy
ds=
d sigmoid(s)ds
= sigmoid(s)(1� sigmoid(s)) = y(1� y)
Philipp Koehn Machine Translation: Introduction to Neural Networks 27 September 2018
27Final Layer Update (2)• Linear combination of weights s =
Pkwkhk
• Activation function y = sigmoid(s)
• Error (L2 norm) E = 12(t� y)2
• Derivative of error with regard to one weight wk
dE
dwk=
dE
dy
dy
ds
ds
dwk
• y with respect to x is sigmoid(s)
dy
ds=
d sigmoid(s)ds
= sigmoid(s)(1� sigmoid(s)) = y(1� y)
Philipp Koehn Machine Translation: Introduction to Neural Networks 27 September 2018
27Final Layer Update (2)• Linear combination of weights s =
Pkwkhk
• Activation function y = sigmoid(s)
• Error (L2 norm) E = 12(t� y)2
• Derivative of error with regard to one weight wk
dE
dwk=
dE
dy
dy
ds
ds
dwk
• y with respect to x is sigmoid(s)
dy
ds=
d sigmoid(s)ds
= sigmoid(s)(1� sigmoid(s)) = y(1� y)
Philipp Koehn Machine Translation: Introduction to Neural Networks 27 September 2018
‣ Error(L2norm):‣ Deriva7veoferrorwithregardtooneweight:
‣ isweightedlinearcombina7onofhiddennodevalues:
FinalLayerUpdate(3)‣ Linearcombina7onofweights:‣ Ac7va7onfunc7on:
25Final Layer Update
• Linear combination of weights s =P
kwkhk
• Activation function y = sigmoid(s)
• Error (L2 norm) E = 12(t� y)2
• Derivative of error with regard to one weight wk
dE
dwk=
dE
dy
dy
ds
ds
dwk
Philipp Koehn Machine Translation: Introduction to Neural Networks 27 September 2018
25Final Layer Update
• Linear combination of weights s =P
kwkhk
• Activation function y = sigmoid(s)
• Error (L2 norm) E = 12(t� y)2
• Derivative of error with regard to one weight wk
dE
dwk=
dE
dy
dy
ds
ds
dwk
Philipp Koehn Machine Translation: Introduction to Neural Networks 27 September 2018
25Final Layer Update
• Linear combination of weights s =P
kwkhk
• Activation function y = sigmoid(s)
• Error (L2 norm) E = 12(t� y)2
• Derivative of error with regard to one weight wk
dE
dwk=
dE
dy
dy
ds
ds
dwk
Philipp Koehn Machine Translation: Introduction to Neural Networks 27 September 2018
25Final Layer Update
• Linear combination of weights s =P
kwkhk
• Activation function y = sigmoid(s)
• Error (L2 norm) E = 12(t� y)2
• Derivative of error with regard to one weight wk
dE
dwk=
dE
dy
dy
ds
ds
dwk
Philipp Koehn Machine Translation: Introduction to Neural Networks 27 September 2018
27Final Layer Update (2)• Linear combination of weights s =
Pkwkhk
• Activation function y = sigmoid(s)
• Error (L2 norm) E = 12(t� y)2
• Derivative of error with regard to one weight wk
dE
dwk=
dE
dy
dy
ds
ds
dwk
• y with respect to x is sigmoid(s)
dy
ds=
d sigmoid(s)ds
= sigmoid(s)(1� sigmoid(s)) = y(1� y)
Philipp Koehn Machine Translation: Introduction to Neural Networks 27 September 2018
28Final Layer Update (3)• Linear combination of weights s =
Pkwkhk
• Activation function y = sigmoid(s)
• Error (L2 norm) E = 12(t� y)2
• Derivative of error with regard to one weight wk
dE
dwk=
dE
dy
dy
ds
ds
dwk
• x is weighted linear combination of hidden node values hk
ds
dwk=
d
dwk
X
k
wkhk = hk
Philipp Koehn Machine Translation: Introduction to Neural Networks 27 September 2018
28Final Layer Update (3)• Linear combination of weights s =
Pkwkhk
• Activation function y = sigmoid(s)
• Error (L2 norm) E = 12(t� y)2
• Derivative of error with regard to one weight wk
dE
dwk=
dE
dy
dy
ds
ds
dwk
• x is weighted linear combination of hidden node values hk
ds
dwk=
d
dwk
X
k
wkhk = hk
Philipp Koehn Machine Translation: Introduction to Neural Networks 27 September 2018
28Final Layer Update (3)• Linear combination of weights s =
Pkwkhk
• Activation function y = sigmoid(s)
• Error (L2 norm) E = 12(t� y)2
• Derivative of error with regard to one weight wk
dE
dwk=
dE
dy
dy
ds
ds
dwk
• x is weighted linear combination of hidden node values hk
ds
dwk=
d
dwk
X
k
wkhk = hk
Philipp Koehn Machine Translation: Introduction to Neural Networks 27 September 2018
‣ Deriva7veoferrorwithregardtooneweight:
PuxngitAllTogether
25Final Layer Update
• Linear combination of weights s =P
kwkhk
• Activation function y = sigmoid(s)
• Error (L2 norm) E = 12(t� y)2
• Derivative of error with regard to one weight wk
dE
dwk=
dE
dy
dy
ds
ds
dwk
Philipp Koehn Machine Translation: Introduction to Neural Networks 27 September 2018
‣Weightedadjustmentwillbescaledbyafixelearningrate:
29Putting it All Together
• Derivative of error with regard to one weight wk
dE
dwk=
dE
dy
dy
ds
ds
dwk
= �(t� y) y(1� y) hk
– error– derivative of sigmoid: y0
• Weight adjustment will be scaled by a fixed learning rate µ
�wk = µ (t� y) y0 hk
Philipp Koehn Machine Translation: Introduction to Neural Networks 27 September 2018
29Putting it All Together
• Derivative of error with regard to one weight wk
dE
dwk=
dE
dy
dy
ds
ds
dwk
= �(t� y) y(1� y) hk
– error– derivative of sigmoid: y0
• Weight adjustment will be scaled by a fixed learning rate µ
�wk = µ (t� y) y0 hk
Philipp Koehn Machine Translation: Introduction to Neural Networks 27 September 2018
29Putting it All Together
• Derivative of error with regard to one weight wk
dE
dwk=
dE
dy
dy
ds
ds
dwk
= �(t� y) y(1� y) hk
– error– derivative of sigmoid: y0
• Weight adjustment will be scaled by a fixed learning rate µ
�wk = µ (t� y) y0 hk
Philipp Koehn Machine Translation: Introduction to Neural Networks 27 September 2018
29Putting it All Together
• Derivative of error with regard to one weight wk
dE
dwk=
dE
dy
dy
ds
ds
dwk
= �(t� y) y(1� y) hk
– error– derivative of sigmoid: y0
• Weight adjustment will be scaled by a fixed learning rate µ
�wk = µ (t� y) y0 hk
Philipp Koehn Machine Translation: Introduction to Neural Networks 27 September 2018
29Putting it All Together
• Derivative of error with regard to one weight wk
dE
dwk=
dE
dy
dy
ds
ds
dwk
= �(t� y) y(1� y) hk
– error– derivative of sigmoid: y0
• Weight adjustment will be scaled by a fixed learning rate µ
�wk = µ (t� y) y0 hk
Philipp Koehn Machine Translation: Introduction to Neural Networks 27 September 2018
‣ Erroriscomputedoveralljoutputnodes:
Mul7pleOutputNodes
‣Weightsareadjustedaccordingtothenodetheypointto:
30Multiple Output Nodes
• Our example only had one output node
• Typically neural networks have multiple output nodes
• Error is computed over all j output nodes
E =X
j
1
2(tj � yj)
2
• Weights k ! j are adjusted according to the node they point to
�wj k = µ(tj � yj) y0j hk
Philipp Koehn Machine Translation: Introduction to Neural Networks 27 September 2018
30Multiple Output Nodes
• Our example only had one output node
• Typically neural networks have multiple output nodes
• Error is computed over all j output nodes
E =X
j
1
2(tj � yj)
2
• Weights k ! j are adjusted according to the node they point to
�wj k = µ(tj � yj) y0j hk
Philipp Koehn Machine Translation: Introduction to Neural Networks 27 September 2018
‣ Inahiddenlayer,wedonothaveatargetoutputvalue‣ But,wecancomputehowmucheachnodecontributedtodownstreamerror‣ Defini7onoferrortermofeachnode:
HiddenLayerUpdate
‣ Back-propagatetheerrorterm:
‣ Universalupdateformula:
31Hidden Layer Update• In a hidden layer, we do not have a target output value
• But we can compute how much each node contributed to downstream error
• Definition of error term of each node
�j = (tj � yj) y0j
• Back-propagate the error term(why this way? there is math to back it up...)
�i =⇣X
j
wj i�j⌘y0i
• Universal update formula�wj k = µ �j hk
Philipp Koehn Machine Translation: Introduction to Neural Networks 27 September 2018
31Hidden Layer Update• In a hidden layer, we do not have a target output value
• But we can compute how much each node contributed to downstream error
• Definition of error term of each node
�j = (tj � yj) y0j
• Back-propagate the error term(why this way? there is math to back it up...)
�i =⇣X
j
wj i�j⌘y0i
• Universal update formula�wj k = µ �j hk
Philipp Koehn Machine Translation: Introduction to Neural Networks 27 September 2018
31Hidden Layer Update• In a hidden layer, we do not have a target output value
• But we can compute how much each node contributed to downstream error
• Definition of error term of each node
�j = (tj � yj) y0j
• Back-propagate the error term(why this way? there is math to back it up...)
�i =⇣X
j
wj i�j⌘y0i
• Universal update formula�wj k = µ �j hk
Philipp Koehn Machine Translation: Introduction to Neural Networks 27 September 2018
‣ Computedoutput:‣ Correctoutput:
‣ Q:howdoweadjusttheweights?
OurExample20Error
.90
.17
1
.76
1.0
0.0
1
4.5
-5.2
-2.0-4.6-1.5
3.72.9
3.7
2.9
• Computed output: y = .76
• Correct output: t = 1.0
) How do we adjust the weights?
Philipp Koehn Machine Translation: Introduction to Neural Networks 27 September 2018
20Error
.90
.17
1
.76
1.0
0.0
1
4.5
-5.2
-2.0-4.6-1.5
3.72.9
3.7
2.9
• Computed output: y = .76
• Correct output: t = 1.0
) How do we adjust the weights?
Philipp Koehn Machine Translation: Introduction to Neural Networks 27 September 2018
32Our Example
.90
.17
1
.76
1.0
0.0
1
4.5
-5.2
-2.0-4.6-1.5
3.72.9
3.7
2.9
A
B
C
D
E
F
G
• Computed output: y = .76
• Correct output: t = 1.0
• Final layer weight updates (learning rate µ = 10)– �G = (t� y) y0 = (1� .76) 0.181 = .0434
– �wGD = µ �G hD = 10⇥ .0434⇥ .90 = .391
– �wGE = µ �G hE = 10⇥ .0434⇥ .17 = .074
– �wGF = µ �G hF = 10⇥ .0434⇥ 1 = .434
Philipp Koehn Machine Translation: Introduction to Neural Networks 27 September 2018
32Our Example
.90
.17
1
.76
1.0
0.0
1
4.5
-5.2
-2.0-4.6-1.5
3.72.9
3.7
2.9
A
B
C
D
E
F
G
• Computed output: y = .76
• Correct output: t = 1.0
• Final layer weight updates (learning rate µ = 10)– �G = (t� y) y0 = (1� .76) 0.181 = .0434
– �wGD = µ �G hD = 10⇥ .0434⇥ .90 = .391
– �wGE = µ �G hE = 10⇥ .0434⇥ .17 = .074
– �wGF = µ �G hF = 10⇥ .0434⇥ 1 = .434
Philipp Koehn Machine Translation: Introduction to Neural Networks 27 September 2018
‣ Computedoutput:‣ Correctoutput:‣ Finallayerweightupdates(learningrate):
OurExample20Error
.90
.17
1
.76
1.0
0.0
1
4.5
-5.2
-2.0-4.6-1.5
3.72.9
3.7
2.9
• Computed output: y = .76
• Correct output: t = 1.0
) How do we adjust the weights?
Philipp Koehn Machine Translation: Introduction to Neural Networks 27 September 2018
20Error
.90
.17
1
.76
1.0
0.0
1
4.5
-5.2
-2.0-4.6-1.5
3.72.9
3.7
2.9
• Computed output: y = .76
• Correct output: t = 1.0
) How do we adjust the weights?
Philipp Koehn Machine Translation: Introduction to Neural Networks 27 September 2018
32Our Example
.90
.17
1
.76
1.0
0.0
1
4.5
-5.2
-2.0-4.6-1.5
3.72.9
3.7
2.9
A
B
C
D
E
F
G
• Computed output: y = .76
• Correct output: t = 1.0
• Final layer weight updates (learning rate µ = 10)– �G = (t� y) y0 = (1� .76) 0.181 = .0434
– �wGD = µ �G hD = 10⇥ .0434⇥ .90 = .391
– �wGE = µ �G hE = 10⇥ .0434⇥ .17 = .074
– �wGF = µ �G hF = 10⇥ .0434⇥ 1 = .434
Philipp Koehn Machine Translation: Introduction to Neural Networks 27 September 2018
32Our Example
.90
.17
1
.76
1.0
0.0
1
4.5
-5.2
-2.0-4.6-1.5
3.72.9
3.7
2.9
A
B
C
D
E
F
G
• Computed output: y = .76
• Correct output: t = 1.0
• Final layer weight updates (learning rate µ = 10)– �G = (t� y) y0 = (1� .76) 0.181 = .0434
– �wGD = µ �G hD = 10⇥ .0434⇥ .90 = .391
– �wGE = µ �G hE = 10⇥ .0434⇥ .17 = .074
– �wGF = µ �G hF = 10⇥ .0434⇥ 1 = .434
Philipp Koehn Machine Translation: Introduction to Neural Networks 27 September 2018
‣ Computedoutput:‣ Correctoutput:‣ Finallayerweightupdates(learningrate):
OurExample20Error
.90
.17
1
.76
1.0
0.0
1
4.5
-5.2
-2.0-4.6-1.5
3.72.9
3.7
2.9
• Computed output: y = .76
• Correct output: t = 1.0
) How do we adjust the weights?
Philipp Koehn Machine Translation: Introduction to Neural Networks 27 September 2018
20Error
.90
.17
1
.76
1.0
0.0
1
4.5
-5.2
-2.0-4.6-1.5
3.72.9
3.7
2.9
• Computed output: y = .76
• Correct output: t = 1.0
) How do we adjust the weights?
Philipp Koehn Machine Translation: Introduction to Neural Networks 27 September 2018
32Our Example
.90
.17
1
.76
1.0
0.0
1
4.5
-5.2
-2.0-4.6-1.5
3.72.9
3.7
2.9
A
B
C
D
E
F
G
• Computed output: y = .76
• Correct output: t = 1.0
• Final layer weight updates (learning rate µ = 10)– �G = (t� y) y0 = (1� .76) 0.181 = .0434
– �wGD = µ �G hD = 10⇥ .0434⇥ .90 = .391
– �wGE = µ �G hE = 10⇥ .0434⇥ .17 = .074
– �wGF = µ �G hF = 10⇥ .0434⇥ 1 = .434
Philipp Koehn Machine Translation: Introduction to Neural Networks 27 September 2018
32Our Example
.90
.17
1
.76
1.0
0.0
1
4.5
-5.2
-2.0-4.6-1.5
3.72.9
3.7
2.9
A
B
C
D
E
F
G
• Computed output: y = .76
• Correct output: t = 1.0
• Final layer weight updates (learning rate µ = 10)– �G = (t� y) y0 = (1� .76) 0.181 = .0434
– �wGD = µ �G hD = 10⇥ .0434⇥ .90 = .391
– �wGE = µ �G hE = 10⇥ .0434⇥ .17 = .074
– �wGF = µ �G hF = 10⇥ .0434⇥ 1 = .434
Philipp Koehn Machine Translation: Introduction to Neural Networks 27 September 2018
33Our Example
.90
.17
1
.76
1.0
0.0
1
4.5
-5.2
-2.0-4.6-1.5
3.72.9
3.7
2.9
A
B
C
D
E
F
G
4.891 —
-5.126 ——
-1.566 ——
• Computed output: y = .76
• Correct output: t = 1.0
• Final layer weight updates (learning rate µ = 10)– �G = (t� y) y0 = (1� .76) 0.181 = .0434
– �wGD = µ �G hD = 10⇥ .0434⇥ .90 = .391
– �wGE = µ �G hE = 10⇥ .0434⇥ .17 = .074
– �wGF = µ �G hF = 10⇥ .0434⇥ 1 = .434
Philipp Koehn Machine Translation: Introduction to Neural Networks 27 September 2018
‣ HiddennodeD:
‣ HiddennodeE:
HiddenLayerUpdates 34Hidden Layer Updates
.90
.17
1
.76
1.0
0.0
1
4.5
-5.2
-2.0-4.6-1.5
3.72.9
3.7
2.9
A
B
C
D
E
F
G
4.891 —
-5.126 ——
-1.566 ——
• Hidden node D
– �D =⇣P
j wj i�j⌘y0D = wGD �G y0D = 4.5⇥ .0434⇥ .0898 = .0175
– �wDA = µ �D hA = 10⇥ .0175⇥ 1.0 = .175– �wDB = µ �D hB = 10⇥ .0175⇥ 0.0 = 0– �wDC = µ �D hC = 10⇥ .0175⇥ 1 = .175
• Hidden node E
– �E =⇣P
j wj i�j⌘y0E = wGE �G y0E = �5.2⇥ .0434⇥ 0.2055 = �.0464
– �wEA = µ �E hA = 10⇥�.0464⇥ 1.0 = �.464– etc.
Philipp Koehn Machine Translation: Introduction to Neural Networks 27 September 2018
34Hidden Layer Updates
.90
.17
1
.76
1.0
0.0
1
4.5
-5.2
-2.0-4.6-1.5
3.72.9
3.7
2.9
A
B
C
D
E
F
G
4.891 —
-5.126 ——
-1.566 ——
• Hidden node D
– �D =⇣P
j wj i�j⌘y0D = wGD �G y0D = 4.5⇥ .0434⇥ .0898 = .0175
– �wDA = µ �D hA = 10⇥ .0175⇥ 1.0 = .175– �wDB = µ �D hB = 10⇥ .0175⇥ 0.0 = 0– �wDC = µ �D hC = 10⇥ .0175⇥ 1 = .175
• Hidden node E
– �E =⇣P
j wj i�j⌘y0E = wGE �G y0E = �5.2⇥ .0434⇥ 0.2055 = �.0464
– �wEA = µ �E hA = 10⇥�.0464⇥ 1.0 = �.464– etc.
Philipp Koehn Machine Translation: Introduction to Neural Networks 27 September 2018
34Hidden Layer Updates
.90
.17
1
.76
1.0
0.0
1
4.5
-5.2
-2.0-4.6-1.5
3.72.9
3.7
2.9
A
B
C
D
E
F
G
4.891 —
-5.126 ——
-1.566 ——
• Hidden node D
– �D =⇣P
j wj i�j⌘y0D = wGD �G y0D = 4.5⇥ .0434⇥ .0898 = .0175
– �wDA = µ �D hA = 10⇥ .0175⇥ 1.0 = .175– �wDB = µ �D hB = 10⇥ .0175⇥ 0.0 = 0– �wDC = µ �D hC = 10⇥ .0175⇥ 1 = .175
• Hidden node E
– �E =⇣P
j wj i�j⌘y0E = wGE �G y0E = �5.2⇥ .0434⇥ 0.2055 = �.0464
– �wEA = µ �E hA = 10⇥�.0464⇥ 1.0 = �.464– etc.
Philipp Koehn Machine Translation: Introduction to Neural Networks 27 September 2018
Logis7cRegressionwithNNs
P (y|x) = exp(w>f(x, y))Py0 exp(w>f(x, y0))
‣ Singlescalarprobability
P (y|x) = softmax�[w>f(x, y)]y2Y
� ‣ Computescoresforallpossiblelabelsatonce(returnsvector)
softmax(p)i =exp(pi)Pi0 exp(pi0)
‣ soemax:expsandnormalizesagivenvector
P (y|x) = softmax(Wf(x)) ‣Weightvectorperclass;Wis[numclassesxnumfeats]
P (y|x) = softmax(Wg(V f(x))) ‣ Nowonehiddenlayer
NeuralNetworksforClassifica7on
V
nfeatures
dhiddenunits
dxnmatrix num_classesxdmatrix
soemaxWf(x)
z
nonlinearity(tanh,relu,…)
g P(y
|x)
P (y|x) = softmax(Wg(V f(x)))num_classes
probs
We can think of a neural network classifier with one hidden layer as building a vector z which is a hidden layer representation of the input, and then running standard logistic regression on the features that the network develops in z.
TrainingNeuralNetworks
‣Maximizeloglikelihoodoftrainingdata
‣ i*:indexofthegoldlabel
‣ ei:1intheithrow,zeroelsewhere.Dotbythis=selectithindex
z = g(V f(x))P (y|x) = softmax(Wz)
L(x, i⇤) = Wz · ei⇤ � logX
j
exp(Wz) · ej
L(x, i⇤) = logP (y = i⇤|x) = log (softmax(Wz) · ei⇤)
Compu7ngGradients
‣ GradientwithrespecttoW
ifi=i*zj � P (y = i|x)zj
�P (y = i|x)zj
@
@WijL(x, i⇤) =
zj � P (y = i|x)zj
�P (y = i|x)zj otherwise
‣ Lookslikelogis7cregressionwithzasthefeatures!
i
j
{
L(x, i⇤) = Wz · ei⇤ � logX
j
exp(Wz) · ej
W
index of gold label index of vector z
index of output space Y
num_classesxdmatrix
NeuralNetworksforClassifica7on
V soemaxWf(x)
zg P
(y|x)
P (y|x) = softmax(Wg(V f(x)))
@L@Wz
Compu7ngGradients
‣ Gradientwithrespectto
L(x, i⇤) = Wz · ei⇤ � logX
j
exp(Wz) · ej
@L(x, i⇤)@z
= Wi⇤ �X
j
P (y = j|x)Wj
z
err(root) = ei⇤ � P (y|x)dim=num_classesdim=d
@L(x, i⇤)@z
= err(z) = W>err(root)
[somemath…]
Backpropaga7on:Picture
V soemaxWf(x)
zg P
(y|x)
P (y|x) = softmax(Wg(V f(x)))
@L@W err(root)err(z)
z
‣ Canforgeteverythingaeerz,treatitastheoutputandkeepbackpropping
Compu7ngGradients:Backpropaga7onz = g(V f(x))
Ac7va7onsathiddenlayer
‣ GradientwithrespecttoV:applythechainrule
err(root) = ei⇤ � P (y|x)dim=num_classes dim=d
@L(x, i⇤)@z
= err(z) = W>err(root)
L(x, i⇤) = Wz · ei⇤ � logX
j
exp(Wz) · ej
[somemath…]
@L(x, i⇤)@Vij
=@L(x, i⇤)
@z
@z
@Vij
Backpropaga7on:Picture
V soemaxWf(x)
zg P
(y|x)
P (y|x) = softmax(Wg(V f(x)))
@L@W err(root)@z
@Verr(z)
zf(x)
Backpropaga7on
‣ Step1:compute err(root) = ei⇤ � P (y|x)
‣ Step2:computederiva7vesofWusingerr(root)
‣ Step3:compute @L(x, i⇤)@z
= err(z) = W>err(root)
‣ Step4:computederiva7vesofVusingerr(z)
‣ Step5+:con7nuebackpropaga7on(computeerr(f(x))ifnecessary…)
P (y|x) = softmax(Wg(V f(x)))
(vector)
(vector)
(matrix)
(matrix)
Backpropaga7on:Takeaways
‣ GradientsofoutputweightsWareeasytocompute—lookslikelogis7cregressionwithhiddenlayerzasfeaturevector
‣ Cancomputederiva7veoflosswithrespecttoztoforman“errorsignal”forbackpropaga7on
‣ Easytoupdateparametersbasedon“errorsignal”fromnextlayer,keeppushingerrorsignalbackasbackpropaga7on
‣ Needtorememberthevaluesfromtheforwardcomputa7on
Applica7ons
NLPwithFeedforwardNetworks
Bothaetal.(2017)
…
Fedraisesinterestratesinorderto…
f(x)?? emb(raises)
‣Wordembeddingsforeachwordforminput
‣ ~1000featureshere—smallerfeaturevectorthaninsparsemodels,buteveryfeaturefiresoneveryexample
emb(interest)
emb(rates)‣Weightmatrixlearnsposi7on-dependent
processingofthewords
previousword
currword
nextword
otherwords,feats,etc.
‣ Part-of-speechtaggingwithFFNNs
NLPwithFeedforwardNetworks
‣ Hiddenlayermixesthesedifferentsignalsandlearnsfeatureconjunc7ons
Bothaetal.(2017)
NLPwithFeedforwardNetworks‣Mul7lingualtaggingresults:
Bothaetal.(2017)
‣ GillickusedLSTMs;thisissmaller,faster,andbeger
Sen7mentAnalysis‣ DeepAveragingNetworks:feedforwardneuralnetworkonaverageofwordembeddingsfrominput
Iyyeretal.(2015)
Sen7mentAnalysis
{
{Bag-of-words
TreeRNNs/CNNS/LSTMS
WangandManning(2012)
Kim(2014)
Iyyeretal.(2015)
CoreferenceResolu7on‣ Feedforwardnetworksiden7fycoreferencearcs
ClarkandManning(2015),Wisemanetal.(2015)
PresidentObamasigned…
Helatergaveaspeech…
?
Implementa7onDetails
Computa7onGraphs
‣ Compu7nggradientsishard!
‣ Automa7cdifferen7a7on:instrumentcodetokeeptrackofderiva7ves
y = x * x (y,dy) = (x * x, 2 * x * dx)codegen
‣ Computa7onisnowsomethingweneedtoreasonaboutsymbolically
‣ UsealibrarylikePytorchorTensorflow.Thisclass:Pytorch
Computa7onGraphsinPytorch
P (y|x) = softmax(Wg(V f(x)))
class FFNN(nn.Module): def __init__(self, inp, hid, out): super(FFNN, self).__init__() self.V = nn.Linear(inp, hid) self.g = nn.Tanh() self.W = nn.Linear(hid, out) self.softmax = nn.Softmax(dim=0)
def forward(self, x): return self.softmax(self.W(self.g(self.V(x))))
‣ Defineforwardpassfor
Computa7onGraphsinPytorch
P (y|x) = softmax(Wg(V f(x)))
ffnn = FFNN()
loss.backward()
probs = ffnn.forward(input)loss = torch.neg(torch.log(probs)).dot(gold_label)
optimizer.step()
def make_update(input, gold_label):
ffnn.zero_grad() # clear gradient variables
ei*: one-hot vector of the label (e.g., [0, 1, 0])
TrainingaModel
Defineacomputa7ongraph
Foreachepoch:
Computelossonbatch
Foreachbatchofdata:
Decodetestset
Autogradtocomputegradientsandtakestep
Batching
‣ Batchingdatagivesspeedupsduetomoreefficientmatrixopera7ons
‣ Needtomakethecomputa7ongraphprocessabatchatthesame7me
probs = ffnn.forward(input) # [batch_size, num_classes]loss = torch.sum(torch.neg(torch.log(probs)).dot(gold_label))
...
‣ Batchsizesfrom1-100oeenworkwell
def make_update(input, gold_label)
# input is [batch_size, num_feats] # gold_label is [batch_size, num_classes]
...