cs7015 (deep learning) : lecture 14 · cs7015 (deep learning) : lecture 14 sequence learning...
TRANSCRIPT
![Page 1: CS7015 (Deep Learning) : Lecture 14 · CS7015 (Deep Learning) : Lecture 14 Sequence Learning Problems, Recurrent Neural Networks, Backpropagation Through Time (BPTT), Vanishing and](https://reader033.vdocument.in/reader033/viewer/2022042909/5f3a2f1f4697801060706397/html5/thumbnails/1.jpg)
1/41
CS7015 (Deep Learning) : Lecture 14Sequence Learning Problems, Recurrent Neural Networks, Backpropagation
Through Time (BPTT), Vanishing and Exploding Gradients, Truncated BPTT
Mitesh M. Khapra
Department of Computer Science and EngineeringIndian Institute of Technology Madras
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
![Page 2: CS7015 (Deep Learning) : Lecture 14 · CS7015 (Deep Learning) : Lecture 14 Sequence Learning Problems, Recurrent Neural Networks, Backpropagation Through Time (BPTT), Vanishing and](https://reader033.vdocument.in/reader033/viewer/2022042909/5f3a2f1f4697801060706397/html5/thumbnails/2.jpg)
2/41
Module 14.1: Sequence Learning Problems
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
![Page 3: CS7015 (Deep Learning) : Lecture 14 · CS7015 (Deep Learning) : Lecture 14 Sequence Learning Problems, Recurrent Neural Networks, Backpropagation Through Time (BPTT), Vanishing and](https://reader033.vdocument.in/reader033/viewer/2022042909/5f3a2f1f4697801060706397/html5/thumbnails/3.jpg)
3/41
In feedforward and convolutionalneural networks the size of the inputwas always fixed
For example, we fed fixed size (32 ×32) images to convolutional neuralnetworks for image classification
Similarly in word2vec, we fed a fixedwindow (k) of words to the network
Further, each input to the networkwas independent of the previous orfuture inputs
For example, the computatations,outputs and decisions for two success-ive images are completely independ-ent of each other
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
![Page 4: CS7015 (Deep Learning) : Lecture 14 · CS7015 (Deep Learning) : Lecture 14 Sequence Learning Problems, Recurrent Neural Networks, Backpropagation Through Time (BPTT), Vanishing and](https://reader033.vdocument.in/reader033/viewer/2022042909/5f3a2f1f4697801060706397/html5/thumbnails/4.jpg)
3/41
10
10
5
5
In feedforward and convolutionalneural networks the size of the inputwas always fixed
For example, we fed fixed size (32 ×32) images to convolutional neuralnetworks for image classification
Similarly in word2vec, we fed a fixedwindow (k) of words to the network
Further, each input to the networkwas independent of the previous orfuture inputs
For example, the computatations,outputs and decisions for two success-ive images are completely independ-ent of each other
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
![Page 5: CS7015 (Deep Learning) : Lecture 14 · CS7015 (Deep Learning) : Lecture 14 Sequence Learning Problems, Recurrent Neural Networks, Backpropagation Through Time (BPTT), Vanishing and](https://reader033.vdocument.in/reader033/viewer/2022042909/5f3a2f1f4697801060706397/html5/thumbnails/5.jpg)
3/41
. . . . . . . . . .
. . . . . . . . . . . .
P(he|sat,he)
P(chair|sat,he)
P(man|sat,he)
P(on|sat,he)
. . . . . . . . .
he sat
Wcontext Wcontext
In feedforward and convolutionalneural networks the size of the inputwas always fixed
For example, we fed fixed size (32 ×32) images to convolutional neuralnetworks for image classification
Similarly in word2vec, we fed a fixedwindow (k) of words to the network
Further, each input to the networkwas independent of the previous orfuture inputs
For example, the computatations,outputs and decisions for two success-ive images are completely independ-ent of each other
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
![Page 6: CS7015 (Deep Learning) : Lecture 14 · CS7015 (Deep Learning) : Lecture 14 Sequence Learning Problems, Recurrent Neural Networks, Backpropagation Through Time (BPTT), Vanishing and](https://reader033.vdocument.in/reader033/viewer/2022042909/5f3a2f1f4697801060706397/html5/thumbnails/6.jpg)
3/41
10
10
5
5
apple
bus
car
...
In feedforward and convolutionalneural networks the size of the inputwas always fixed
For example, we fed fixed size (32 ×32) images to convolutional neuralnetworks for image classification
Similarly in word2vec, we fed a fixedwindow (k) of words to the network
Further, each input to the networkwas independent of the previous orfuture inputs
For example, the computatations,outputs and decisions for two success-ive images are completely independ-ent of each other
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
![Page 7: CS7015 (Deep Learning) : Lecture 14 · CS7015 (Deep Learning) : Lecture 14 Sequence Learning Problems, Recurrent Neural Networks, Backpropagation Through Time (BPTT), Vanishing and](https://reader033.vdocument.in/reader033/viewer/2022042909/5f3a2f1f4697801060706397/html5/thumbnails/7.jpg)
3/41
10
10
5
5
apple
bus
car
...
In feedforward and convolutionalneural networks the size of the inputwas always fixed
For example, we fed fixed size (32 ×32) images to convolutional neuralnetworks for image classification
Similarly in word2vec, we fed a fixedwindow (k) of words to the network
Further, each input to the networkwas independent of the previous orfuture inputs
For example, the computatations,outputs and decisions for two success-ive images are completely independ-ent of each other
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
![Page 8: CS7015 (Deep Learning) : Lecture 14 · CS7015 (Deep Learning) : Lecture 14 Sequence Learning Problems, Recurrent Neural Networks, Backpropagation Through Time (BPTT), Vanishing and](https://reader033.vdocument.in/reader033/viewer/2022042909/5f3a2f1f4697801060706397/html5/thumbnails/8.jpg)
4/41
d
e
e
e
e
p
p
〈 stop 〉
In many applications the input is notof a fixed size
Further successive inputs may not beindependent of each other
For example, consider the task ofauto completion
Given the first character ‘d’ you wantto predict the next character ‘e’ andso on
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
![Page 9: CS7015 (Deep Learning) : Lecture 14 · CS7015 (Deep Learning) : Lecture 14 Sequence Learning Problems, Recurrent Neural Networks, Backpropagation Through Time (BPTT), Vanishing and](https://reader033.vdocument.in/reader033/viewer/2022042909/5f3a2f1f4697801060706397/html5/thumbnails/9.jpg)
4/41
d
e
e
e
e
p
p
〈 stop 〉
In many applications the input is notof a fixed size
Further successive inputs may not beindependent of each other
For example, consider the task ofauto completion
Given the first character ‘d’ you wantto predict the next character ‘e’ andso on
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
![Page 10: CS7015 (Deep Learning) : Lecture 14 · CS7015 (Deep Learning) : Lecture 14 Sequence Learning Problems, Recurrent Neural Networks, Backpropagation Through Time (BPTT), Vanishing and](https://reader033.vdocument.in/reader033/viewer/2022042909/5f3a2f1f4697801060706397/html5/thumbnails/10.jpg)
4/41
d
e
e
e
e
p
p
〈 stop 〉
In many applications the input is notof a fixed size
Further successive inputs may not beindependent of each other
For example, consider the task ofauto completion
Given the first character ‘d’ you wantto predict the next character ‘e’ andso on
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
![Page 11: CS7015 (Deep Learning) : Lecture 14 · CS7015 (Deep Learning) : Lecture 14 Sequence Learning Problems, Recurrent Neural Networks, Backpropagation Through Time (BPTT), Vanishing and](https://reader033.vdocument.in/reader033/viewer/2022042909/5f3a2f1f4697801060706397/html5/thumbnails/11.jpg)
4/41
d
e
e
e
e
p
p
〈 stop 〉
In many applications the input is notof a fixed size
Further successive inputs may not beindependent of each other
For example, consider the task ofauto completion
Given the first character ‘d’ you wantto predict the next character ‘e’ andso on
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
![Page 12: CS7015 (Deep Learning) : Lecture 14 · CS7015 (Deep Learning) : Lecture 14 Sequence Learning Problems, Recurrent Neural Networks, Backpropagation Through Time (BPTT), Vanishing and](https://reader033.vdocument.in/reader033/viewer/2022042909/5f3a2f1f4697801060706397/html5/thumbnails/12.jpg)
4/41
d
e
e
e
e
p
p
〈 stop 〉
In many applications the input is notof a fixed size
Further successive inputs may not beindependent of each other
For example, consider the task ofauto completion
Given the first character ‘d’ you wantto predict the next character ‘e’ andso on
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
![Page 13: CS7015 (Deep Learning) : Lecture 14 · CS7015 (Deep Learning) : Lecture 14 Sequence Learning Problems, Recurrent Neural Networks, Backpropagation Through Time (BPTT), Vanishing and](https://reader033.vdocument.in/reader033/viewer/2022042909/5f3a2f1f4697801060706397/html5/thumbnails/13.jpg)
4/41
d
e
e
e
e
p
p
〈 stop 〉
In many applications the input is notof a fixed size
Further successive inputs may not beindependent of each other
For example, consider the task ofauto completion
Given the first character ‘d’ you wantto predict the next character ‘e’ andso on
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
![Page 14: CS7015 (Deep Learning) : Lecture 14 · CS7015 (Deep Learning) : Lecture 14 Sequence Learning Problems, Recurrent Neural Networks, Backpropagation Through Time (BPTT), Vanishing and](https://reader033.vdocument.in/reader033/viewer/2022042909/5f3a2f1f4697801060706397/html5/thumbnails/14.jpg)
4/41
d
e
e
e
e
p
p
〈 stop 〉
In many applications the input is notof a fixed size
Further successive inputs may not beindependent of each other
For example, consider the task ofauto completion
Given the first character ‘d’ you wantto predict the next character ‘e’ andso on
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
![Page 15: CS7015 (Deep Learning) : Lecture 14 · CS7015 (Deep Learning) : Lecture 14 Sequence Learning Problems, Recurrent Neural Networks, Backpropagation Through Time (BPTT), Vanishing and](https://reader033.vdocument.in/reader033/viewer/2022042909/5f3a2f1f4697801060706397/html5/thumbnails/15.jpg)
5/41
d
e
e
e
e
p
p
〈 stop 〉
Notice a few things
First, successive inputs are no longerindependent (while predicting ‘e’ youwould want to know what the previ-ous input was in addition to the cur-rent input)
Second, the length of the inputs andthe number of predictions you needto make is not fixed (for example,“learn”, “deep”, “machine” have dif-ferent number of characters)
Third, each network (orange-blue-green structure) is performing thesame task (input : character output: character)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
![Page 16: CS7015 (Deep Learning) : Lecture 14 · CS7015 (Deep Learning) : Lecture 14 Sequence Learning Problems, Recurrent Neural Networks, Backpropagation Through Time (BPTT), Vanishing and](https://reader033.vdocument.in/reader033/viewer/2022042909/5f3a2f1f4697801060706397/html5/thumbnails/16.jpg)
5/41
d
e
e
e
e
p
p
〈 stop 〉
Notice a few things
First, successive inputs are no longerindependent (while predicting ‘e’ youwould want to know what the previ-ous input was in addition to the cur-rent input)
Second, the length of the inputs andthe number of predictions you needto make is not fixed (for example,“learn”, “deep”, “machine” have dif-ferent number of characters)
Third, each network (orange-blue-green structure) is performing thesame task (input : character output: character)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
![Page 17: CS7015 (Deep Learning) : Lecture 14 · CS7015 (Deep Learning) : Lecture 14 Sequence Learning Problems, Recurrent Neural Networks, Backpropagation Through Time (BPTT), Vanishing and](https://reader033.vdocument.in/reader033/viewer/2022042909/5f3a2f1f4697801060706397/html5/thumbnails/17.jpg)
5/41
d
e
e
e
e
p
p
〈 stop 〉
Notice a few things
First, successive inputs are no longerindependent (while predicting ‘e’ youwould want to know what the previ-ous input was in addition to the cur-rent input)
Second, the length of the inputs andthe number of predictions you needto make is not fixed (for example,“learn”, “deep”, “machine” have dif-ferent number of characters)
Third, each network (orange-blue-green structure) is performing thesame task (input : character output: character)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
![Page 18: CS7015 (Deep Learning) : Lecture 14 · CS7015 (Deep Learning) : Lecture 14 Sequence Learning Problems, Recurrent Neural Networks, Backpropagation Through Time (BPTT), Vanishing and](https://reader033.vdocument.in/reader033/viewer/2022042909/5f3a2f1f4697801060706397/html5/thumbnails/18.jpg)
5/41
d
e
e
e
e
p
p
〈 stop 〉
Notice a few things
First, successive inputs are no longerindependent (while predicting ‘e’ youwould want to know what the previ-ous input was in addition to the cur-rent input)
Second, the length of the inputs andthe number of predictions you needto make is not fixed (for example,“learn”, “deep”, “machine” have dif-ferent number of characters)
Third, each network (orange-blue-green structure) is performing thesame task (input : character output: character)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
![Page 19: CS7015 (Deep Learning) : Lecture 14 · CS7015 (Deep Learning) : Lecture 14 Sequence Learning Problems, Recurrent Neural Networks, Backpropagation Through Time (BPTT), Vanishing and](https://reader033.vdocument.in/reader033/viewer/2022042909/5f3a2f1f4697801060706397/html5/thumbnails/19.jpg)
6/41
d
e
e
e
e
p
p
〈 stop 〉
These are known as sequence learningproblems
We need to look at a sequence of (de-pendent) inputs and produce an out-put (or outputs)
Each input corresponds to one timestep
Let us look at some more examples ofsuch problems
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
![Page 20: CS7015 (Deep Learning) : Lecture 14 · CS7015 (Deep Learning) : Lecture 14 Sequence Learning Problems, Recurrent Neural Networks, Backpropagation Through Time (BPTT), Vanishing and](https://reader033.vdocument.in/reader033/viewer/2022042909/5f3a2f1f4697801060706397/html5/thumbnails/20.jpg)
6/41
d
e
e
e
e
p
p
〈 stop 〉
These are known as sequence learningproblems
We need to look at a sequence of (de-pendent) inputs and produce an out-put (or outputs)
Each input corresponds to one timestep
Let us look at some more examples ofsuch problems
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
![Page 21: CS7015 (Deep Learning) : Lecture 14 · CS7015 (Deep Learning) : Lecture 14 Sequence Learning Problems, Recurrent Neural Networks, Backpropagation Through Time (BPTT), Vanishing and](https://reader033.vdocument.in/reader033/viewer/2022042909/5f3a2f1f4697801060706397/html5/thumbnails/21.jpg)
6/41
d
e
e
e
e
p
p
〈 stop 〉
These are known as sequence learningproblems
We need to look at a sequence of (de-pendent) inputs and produce an out-put (or outputs)
Each input corresponds to one timestep
Let us look at some more examples ofsuch problems
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
![Page 22: CS7015 (Deep Learning) : Lecture 14 · CS7015 (Deep Learning) : Lecture 14 Sequence Learning Problems, Recurrent Neural Networks, Backpropagation Through Time (BPTT), Vanishing and](https://reader033.vdocument.in/reader033/viewer/2022042909/5f3a2f1f4697801060706397/html5/thumbnails/22.jpg)
6/41
d
e
e
e
e
p
p
〈 stop 〉
These are known as sequence learningproblems
We need to look at a sequence of (de-pendent) inputs and produce an out-put (or outputs)
Each input corresponds to one timestep
Let us look at some more examples ofsuch problems
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
![Page 23: CS7015 (Deep Learning) : Lecture 14 · CS7015 (Deep Learning) : Lecture 14 Sequence Learning Problems, Recurrent Neural Networks, Backpropagation Through Time (BPTT), Vanishing and](https://reader033.vdocument.in/reader033/viewer/2022042909/5f3a2f1f4697801060706397/html5/thumbnails/23.jpg)
7/41
man
noun
is
verb
a
article
social
adjective
animal
noun
Consider the task of predicting the partof speech tag (noun, adverb, adjectiveverb) of each word in a sentence
Once we see an adjective (social) we arealmost sure that the next word should bea noun (man)
Thus the current output (noun) dependson the current input as well as the previ-ous input
Further the size of the input is not fixed(sentences could have arbitrary numberof words)
Notice that here we are interested in pro-ducing an output at each time step
Each network is performing the sametask (input : word, output : tag)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
![Page 24: CS7015 (Deep Learning) : Lecture 14 · CS7015 (Deep Learning) : Lecture 14 Sequence Learning Problems, Recurrent Neural Networks, Backpropagation Through Time (BPTT), Vanishing and](https://reader033.vdocument.in/reader033/viewer/2022042909/5f3a2f1f4697801060706397/html5/thumbnails/24.jpg)
7/41
man
noun
is
verb
a
article
social
adjective
animal
noun
Consider the task of predicting the partof speech tag (noun, adverb, adjectiveverb) of each word in a sentence
Once we see an adjective (social) we arealmost sure that the next word should bea noun (man)
Thus the current output (noun) dependson the current input as well as the previ-ous input
Further the size of the input is not fixed(sentences could have arbitrary numberof words)
Notice that here we are interested in pro-ducing an output at each time step
Each network is performing the sametask (input : word, output : tag)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
![Page 25: CS7015 (Deep Learning) : Lecture 14 · CS7015 (Deep Learning) : Lecture 14 Sequence Learning Problems, Recurrent Neural Networks, Backpropagation Through Time (BPTT), Vanishing and](https://reader033.vdocument.in/reader033/viewer/2022042909/5f3a2f1f4697801060706397/html5/thumbnails/25.jpg)
7/41
man
noun
is
verb
a
article
social
adjective
animal
noun
Consider the task of predicting the partof speech tag (noun, adverb, adjectiveverb) of each word in a sentence
Once we see an adjective (social) we arealmost sure that the next word should bea noun (man)
Thus the current output (noun) dependson the current input as well as the previ-ous input
Further the size of the input is not fixed(sentences could have arbitrary numberof words)
Notice that here we are interested in pro-ducing an output at each time step
Each network is performing the sametask (input : word, output : tag)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
![Page 26: CS7015 (Deep Learning) : Lecture 14 · CS7015 (Deep Learning) : Lecture 14 Sequence Learning Problems, Recurrent Neural Networks, Backpropagation Through Time (BPTT), Vanishing and](https://reader033.vdocument.in/reader033/viewer/2022042909/5f3a2f1f4697801060706397/html5/thumbnails/26.jpg)
7/41
man
noun
is
verb
a
article
social
adjective
animal
noun
Consider the task of predicting the partof speech tag (noun, adverb, adjectiveverb) of each word in a sentence
Once we see an adjective (social) we arealmost sure that the next word should bea noun (man)
Thus the current output (noun) dependson the current input as well as the previ-ous input
Further the size of the input is not fixed(sentences could have arbitrary numberof words)
Notice that here we are interested in pro-ducing an output at each time step
Each network is performing the sametask (input : word, output : tag)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
![Page 27: CS7015 (Deep Learning) : Lecture 14 · CS7015 (Deep Learning) : Lecture 14 Sequence Learning Problems, Recurrent Neural Networks, Backpropagation Through Time (BPTT), Vanishing and](https://reader033.vdocument.in/reader033/viewer/2022042909/5f3a2f1f4697801060706397/html5/thumbnails/27.jpg)
7/41
man
noun
is
verb
a
article
social
adjective
animal
noun
Consider the task of predicting the partof speech tag (noun, adverb, adjectiveverb) of each word in a sentence
Once we see an adjective (social) we arealmost sure that the next word should bea noun (man)
Thus the current output (noun) dependson the current input as well as the previ-ous input
Further the size of the input is not fixed(sentences could have arbitrary numberof words)
Notice that here we are interested in pro-ducing an output at each time step
Each network is performing the sametask (input : word, output : tag)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
![Page 28: CS7015 (Deep Learning) : Lecture 14 · CS7015 (Deep Learning) : Lecture 14 Sequence Learning Problems, Recurrent Neural Networks, Backpropagation Through Time (BPTT), Vanishing and](https://reader033.vdocument.in/reader033/viewer/2022042909/5f3a2f1f4697801060706397/html5/thumbnails/28.jpg)
7/41
man
noun
is
verb
a
article
social
adjective
animal
noun
Consider the task of predicting the partof speech tag (noun, adverb, adjectiveverb) of each word in a sentence
Once we see an adjective (social) we arealmost sure that the next word should bea noun (man)
Thus the current output (noun) dependson the current input as well as the previ-ous input
Further the size of the input is not fixed(sentences could have arbitrary numberof words)
Notice that here we are interested in pro-ducing an output at each time step
Each network is performing the sametask (input : word, output : tag)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
![Page 29: CS7015 (Deep Learning) : Lecture 14 · CS7015 (Deep Learning) : Lecture 14 Sequence Learning Problems, Recurrent Neural Networks, Backpropagation Through Time (BPTT), Vanishing and](https://reader033.vdocument.in/reader033/viewer/2022042909/5f3a2f1f4697801060706397/html5/thumbnails/29.jpg)
7/41
man
noun
is
verb
a
article
social
adjective
animal
noun
Consider the task of predicting the partof speech tag (noun, adverb, adjectiveverb) of each word in a sentence
Once we see an adjective (social) we arealmost sure that the next word should bea noun (man)
Thus the current output (noun) dependson the current input as well as the previ-ous input
Further the size of the input is not fixed(sentences could have arbitrary numberof words)
Notice that here we are interested in pro-ducing an output at each time step
Each network is performing the sametask (input : word, output : tag)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
![Page 30: CS7015 (Deep Learning) : Lecture 14 · CS7015 (Deep Learning) : Lecture 14 Sequence Learning Problems, Recurrent Neural Networks, Backpropagation Through Time (BPTT), Vanishing and](https://reader033.vdocument.in/reader033/viewer/2022042909/5f3a2f1f4697801060706397/html5/thumbnails/30.jpg)
7/41
man
noun
is
verb
a
article
social
adjective
animal
noun
Consider the task of predicting the partof speech tag (noun, adverb, adjectiveverb) of each word in a sentence
Once we see an adjective (social) we arealmost sure that the next word should bea noun (man)
Thus the current output (noun) dependson the current input as well as the previ-ous input
Further the size of the input is not fixed(sentences could have arbitrary numberof words)
Notice that here we are interested in pro-ducing an output at each time step
Each network is performing the sametask (input : word, output : tag)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
![Page 31: CS7015 (Deep Learning) : Lecture 14 · CS7015 (Deep Learning) : Lecture 14 Sequence Learning Problems, Recurrent Neural Networks, Backpropagation Through Time (BPTT), Vanishing and](https://reader033.vdocument.in/reader033/viewer/2022042909/5f3a2f1f4697801060706397/html5/thumbnails/31.jpg)
7/41
man
noun
is
verb
a
article
social
adjective
animal
noun
Consider the task of predicting the partof speech tag (noun, adverb, adjectiveverb) of each word in a sentence
Once we see an adjective (social) we arealmost sure that the next word should bea noun (man)
Thus the current output (noun) dependson the current input as well as the previ-ous input
Further the size of the input is not fixed(sentences could have arbitrary numberof words)
Notice that here we are interested in pro-ducing an output at each time step
Each network is performing the sametask (input : word, output : tag)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
![Page 32: CS7015 (Deep Learning) : Lecture 14 · CS7015 (Deep Learning) : Lecture 14 Sequence Learning Problems, Recurrent Neural Networks, Backpropagation Through Time (BPTT), Vanishing and](https://reader033.vdocument.in/reader033/viewer/2022042909/5f3a2f1f4697801060706397/html5/thumbnails/32.jpg)
7/41
man
noun
is
verb
a
article
social
adjective
animal
noun
Consider the task of predicting the partof speech tag (noun, adverb, adjectiveverb) of each word in a sentence
Once we see an adjective (social) we arealmost sure that the next word should bea noun (man)
Thus the current output (noun) dependson the current input as well as the previ-ous input
Further the size of the input is not fixed(sentences could have arbitrary numberof words)
Notice that here we are interested in pro-ducing an output at each time step
Each network is performing the sametask (input : word, output : tag)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
![Page 33: CS7015 (Deep Learning) : Lecture 14 · CS7015 (Deep Learning) : Lecture 14 Sequence Learning Problems, Recurrent Neural Networks, Backpropagation Through Time (BPTT), Vanishing and](https://reader033.vdocument.in/reader033/viewer/2022042909/5f3a2f1f4697801060706397/html5/thumbnails/33.jpg)
7/41
man
noun
is
verb
a
article
social
adjective
animal
noun
Consider the task of predicting the partof speech tag (noun, adverb, adjectiveverb) of each word in a sentence
Once we see an adjective (social) we arealmost sure that the next word should bea noun (man)
Thus the current output (noun) dependson the current input as well as the previ-ous input
Further the size of the input is not fixed(sentences could have arbitrary numberof words)
Notice that here we are interested in pro-ducing an output at each time step
Each network is performing the sametask (input : word, output : tag)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
![Page 34: CS7015 (Deep Learning) : Lecture 14 · CS7015 (Deep Learning) : Lecture 14 Sequence Learning Problems, Recurrent Neural Networks, Backpropagation Through Time (BPTT), Vanishing and](https://reader033.vdocument.in/reader033/viewer/2022042909/5f3a2f1f4697801060706397/html5/thumbnails/34.jpg)
8/41
The
don’tcare
movie
don’tcare
was
don’tcare
boring
don’tcare
and
don’tcare
long
+/−
Sometimes we may not be interestedin producing an output at every stage
Instead we would look at the full se-quence and then produce an output
For example, consider the task of pre-dicting the polarity of a movie review
The prediction clearly does not de-pend only on the last word but alsoon some words which appear before
Here again we could think that thenetwork is performing the same taskat each step (input : word, output :+/−) but it’s just that we don’t careabout intermediate outputs
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
![Page 35: CS7015 (Deep Learning) : Lecture 14 · CS7015 (Deep Learning) : Lecture 14 Sequence Learning Problems, Recurrent Neural Networks, Backpropagation Through Time (BPTT), Vanishing and](https://reader033.vdocument.in/reader033/viewer/2022042909/5f3a2f1f4697801060706397/html5/thumbnails/35.jpg)
8/41
The
don’tcare
movie
don’tcare
was
don’tcare
boring
don’tcare
and
don’tcare
long
+/−
Sometimes we may not be interestedin producing an output at every stage
Instead we would look at the full se-quence and then produce an output
For example, consider the task of pre-dicting the polarity of a movie review
The prediction clearly does not de-pend only on the last word but alsoon some words which appear before
Here again we could think that thenetwork is performing the same taskat each step (input : word, output :+/−) but it’s just that we don’t careabout intermediate outputs
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
![Page 36: CS7015 (Deep Learning) : Lecture 14 · CS7015 (Deep Learning) : Lecture 14 Sequence Learning Problems, Recurrent Neural Networks, Backpropagation Through Time (BPTT), Vanishing and](https://reader033.vdocument.in/reader033/viewer/2022042909/5f3a2f1f4697801060706397/html5/thumbnails/36.jpg)
8/41
The
don’tcare
movie
don’tcare
was
don’tcare
boring
don’tcare
and
don’tcare
long
+/−
Sometimes we may not be interestedin producing an output at every stage
Instead we would look at the full se-quence and then produce an output
For example, consider the task of pre-dicting the polarity of a movie review
The prediction clearly does not de-pend only on the last word but alsoon some words which appear before
Here again we could think that thenetwork is performing the same taskat each step (input : word, output :+/−) but it’s just that we don’t careabout intermediate outputs
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
![Page 37: CS7015 (Deep Learning) : Lecture 14 · CS7015 (Deep Learning) : Lecture 14 Sequence Learning Problems, Recurrent Neural Networks, Backpropagation Through Time (BPTT), Vanishing and](https://reader033.vdocument.in/reader033/viewer/2022042909/5f3a2f1f4697801060706397/html5/thumbnails/37.jpg)
8/41
The
don’tcare
movie
don’tcare
was
don’tcare
boring
don’tcare
and
don’tcare
long
+/−
Sometimes we may not be interestedin producing an output at every stage
Instead we would look at the full se-quence and then produce an output
For example, consider the task of pre-dicting the polarity of a movie review
The prediction clearly does not de-pend only on the last word but alsoon some words which appear before
Here again we could think that thenetwork is performing the same taskat each step (input : word, output :+/−) but it’s just that we don’t careabout intermediate outputs
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
![Page 38: CS7015 (Deep Learning) : Lecture 14 · CS7015 (Deep Learning) : Lecture 14 Sequence Learning Problems, Recurrent Neural Networks, Backpropagation Through Time (BPTT), Vanishing and](https://reader033.vdocument.in/reader033/viewer/2022042909/5f3a2f1f4697801060706397/html5/thumbnails/38.jpg)
8/41
The
don’tcare
movie
don’tcare
was
don’tcare
boring
don’tcare
and
don’tcare
long
+/−
Sometimes we may not be interestedin producing an output at every stage
Instead we would look at the full se-quence and then produce an output
For example, consider the task of pre-dicting the polarity of a movie review
The prediction clearly does not de-pend only on the last word but alsoon some words which appear before
Here again we could think that thenetwork is performing the same taskat each step (input : word, output :+/−) but it’s just that we don’t careabout intermediate outputs
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
![Page 39: CS7015 (Deep Learning) : Lecture 14 · CS7015 (Deep Learning) : Lecture 14 Sequence Learning Problems, Recurrent Neural Networks, Backpropagation Through Time (BPTT), Vanishing and](https://reader033.vdocument.in/reader033/viewer/2022042909/5f3a2f1f4697801060706397/html5/thumbnails/39.jpg)
8/41
The
don’tcare
movie
don’tcare
was
don’tcare
boring
don’tcare
and
don’tcare
long
+/−
Sometimes we may not be interestedin producing an output at every stage
Instead we would look at the full se-quence and then produce an output
For example, consider the task of pre-dicting the polarity of a movie review
The prediction clearly does not de-pend only on the last word but alsoon some words which appear before
Here again we could think that thenetwork is performing the same taskat each step (input : word, output :+/−) but it’s just that we don’t careabout intermediate outputs
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
![Page 40: CS7015 (Deep Learning) : Lecture 14 · CS7015 (Deep Learning) : Lecture 14 Sequence Learning Problems, Recurrent Neural Networks, Backpropagation Through Time (BPTT), Vanishing and](https://reader033.vdocument.in/reader033/viewer/2022042909/5f3a2f1f4697801060706397/html5/thumbnails/40.jpg)
8/41
The
don’tcare
movie
don’tcare
was
don’tcare
boring
don’tcare
and
don’tcare
long
+/−
Sometimes we may not be interestedin producing an output at every stage
Instead we would look at the full se-quence and then produce an output
For example, consider the task of pre-dicting the polarity of a movie review
The prediction clearly does not de-pend only on the last word but alsoon some words which appear before
Here again we could think that thenetwork is performing the same taskat each step (input : word, output :+/−) but it’s just that we don’t careabout intermediate outputs
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
![Page 41: CS7015 (Deep Learning) : Lecture 14 · CS7015 (Deep Learning) : Lecture 14 Sequence Learning Problems, Recurrent Neural Networks, Backpropagation Through Time (BPTT), Vanishing and](https://reader033.vdocument.in/reader033/viewer/2022042909/5f3a2f1f4697801060706397/html5/thumbnails/41.jpg)
8/41
The
don’tcare
movie
don’tcare
was
don’tcare
boring
don’tcare
and
don’tcare
long
+/−
Sometimes we may not be interestedin producing an output at every stage
Instead we would look at the full se-quence and then produce an output
For example, consider the task of pre-dicting the polarity of a movie review
The prediction clearly does not de-pend only on the last word but alsoon some words which appear before
Here again we could think that thenetwork is performing the same taskat each step (input : word, output :+/−) but it’s just that we don’t careabout intermediate outputs
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
![Page 42: CS7015 (Deep Learning) : Lecture 14 · CS7015 (Deep Learning) : Lecture 14 Sequence Learning Problems, Recurrent Neural Networks, Backpropagation Through Time (BPTT), Vanishing and](https://reader033.vdocument.in/reader033/viewer/2022042909/5f3a2f1f4697801060706397/html5/thumbnails/42.jpg)
8/41
The
don’tcare
movie
don’tcare
was
don’tcare
boring
don’tcare
and
don’tcare
long
+/−
Sometimes we may not be interestedin producing an output at every stage
Instead we would look at the full se-quence and then produce an output
For example, consider the task of pre-dicting the polarity of a movie review
The prediction clearly does not de-pend only on the last word but alsoon some words which appear before
Here again we could think that thenetwork is performing the same taskat each step (input : word, output :+/−) but it’s just that we don’t careabout intermediate outputs
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
![Page 43: CS7015 (Deep Learning) : Lecture 14 · CS7015 (Deep Learning) : Lecture 14 Sequence Learning Problems, Recurrent Neural Networks, Backpropagation Through Time (BPTT), Vanishing and](https://reader033.vdocument.in/reader033/viewer/2022042909/5f3a2f1f4697801060706397/html5/thumbnails/43.jpg)
8/41
The
don’tcare
movie
don’tcare
was
don’tcare
boring
don’tcare
and
don’tcare
long
+/−
Sometimes we may not be interestedin producing an output at every stage
Instead we would look at the full se-quence and then produce an output
For example, consider the task of pre-dicting the polarity of a movie review
The prediction clearly does not de-pend only on the last word but alsoon some words which appear before
Here again we could think that thenetwork is performing the same taskat each step (input : word, output :+/−) but it’s just that we don’t careabout intermediate outputs
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
![Page 44: CS7015 (Deep Learning) : Lecture 14 · CS7015 (Deep Learning) : Lecture 14 Sequence Learning Problems, Recurrent Neural Networks, Backpropagation Through Time (BPTT), Vanishing and](https://reader033.vdocument.in/reader033/viewer/2022042909/5f3a2f1f4697801060706397/html5/thumbnails/44.jpg)
8/41
The
don’tcare
movie
don’tcare
was
don’tcare
boring
don’tcare
and
don’tcare
long
+/−
Sometimes we may not be interestedin producing an output at every stage
Instead we would look at the full se-quence and then produce an output
For example, consider the task of pre-dicting the polarity of a movie review
The prediction clearly does not de-pend only on the last word but alsoon some words which appear before
Here again we could think that thenetwork is performing the same taskat each step (input : word, output :+/−) but it’s just that we don’t careabout intermediate outputs
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
![Page 45: CS7015 (Deep Learning) : Lecture 14 · CS7015 (Deep Learning) : Lecture 14 Sequence Learning Problems, Recurrent Neural Networks, Backpropagation Through Time (BPTT), Vanishing and](https://reader033.vdocument.in/reader033/viewer/2022042909/5f3a2f1f4697801060706397/html5/thumbnails/45.jpg)
9/41
. . .
. . .
Surya Namaskar
Sequences could be composed of any-thing (not just words)
For example, a video could be treatedas a sequence of images
We may want to look at the entire se-quence and detect the activity beingperformed
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
![Page 46: CS7015 (Deep Learning) : Lecture 14 · CS7015 (Deep Learning) : Lecture 14 Sequence Learning Problems, Recurrent Neural Networks, Backpropagation Through Time (BPTT), Vanishing and](https://reader033.vdocument.in/reader033/viewer/2022042909/5f3a2f1f4697801060706397/html5/thumbnails/46.jpg)
9/41
. . .
. . .
Surya Namaskar
Sequences could be composed of any-thing (not just words)
For example, a video could be treatedas a sequence of images
We may want to look at the entire se-quence and detect the activity beingperformed
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
![Page 47: CS7015 (Deep Learning) : Lecture 14 · CS7015 (Deep Learning) : Lecture 14 Sequence Learning Problems, Recurrent Neural Networks, Backpropagation Through Time (BPTT), Vanishing and](https://reader033.vdocument.in/reader033/viewer/2022042909/5f3a2f1f4697801060706397/html5/thumbnails/47.jpg)
9/41
. . .
. . .
Surya Namaskar
Sequences could be composed of any-thing (not just words)
For example, a video could be treatedas a sequence of images
We may want to look at the entire se-quence and detect the activity beingperformed
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
![Page 48: CS7015 (Deep Learning) : Lecture 14 · CS7015 (Deep Learning) : Lecture 14 Sequence Learning Problems, Recurrent Neural Networks, Backpropagation Through Time (BPTT), Vanishing and](https://reader033.vdocument.in/reader033/viewer/2022042909/5f3a2f1f4697801060706397/html5/thumbnails/48.jpg)
9/41
. . .
. . .
Surya Namaskar
Sequences could be composed of any-thing (not just words)
For example, a video could be treatedas a sequence of images
We may want to look at the entire se-quence and detect the activity beingperformed
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
![Page 49: CS7015 (Deep Learning) : Lecture 14 · CS7015 (Deep Learning) : Lecture 14 Sequence Learning Problems, Recurrent Neural Networks, Backpropagation Through Time (BPTT), Vanishing and](https://reader033.vdocument.in/reader033/viewer/2022042909/5f3a2f1f4697801060706397/html5/thumbnails/49.jpg)
9/41
. . .
. . .
Surya Namaskar
Sequences could be composed of any-thing (not just words)
For example, a video could be treatedas a sequence of images
We may want to look at the entire se-quence and detect the activity beingperformed
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
![Page 50: CS7015 (Deep Learning) : Lecture 14 · CS7015 (Deep Learning) : Lecture 14 Sequence Learning Problems, Recurrent Neural Networks, Backpropagation Through Time (BPTT), Vanishing and](https://reader033.vdocument.in/reader033/viewer/2022042909/5f3a2f1f4697801060706397/html5/thumbnails/50.jpg)
9/41
. . .
. . .
Surya Namaskar
Sequences could be composed of any-thing (not just words)
For example, a video could be treatedas a sequence of images
We may want to look at the entire se-quence and detect the activity beingperformed
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
![Page 51: CS7015 (Deep Learning) : Lecture 14 · CS7015 (Deep Learning) : Lecture 14 Sequence Learning Problems, Recurrent Neural Networks, Backpropagation Through Time (BPTT), Vanishing and](https://reader033.vdocument.in/reader033/viewer/2022042909/5f3a2f1f4697801060706397/html5/thumbnails/51.jpg)
9/41
. . .
. . .
Surya Namaskar
Sequences could be composed of any-thing (not just words)
For example, a video could be treatedas a sequence of images
We may want to look at the entire se-quence and detect the activity beingperformed
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
![Page 52: CS7015 (Deep Learning) : Lecture 14 · CS7015 (Deep Learning) : Lecture 14 Sequence Learning Problems, Recurrent Neural Networks, Backpropagation Through Time (BPTT), Vanishing and](https://reader033.vdocument.in/reader033/viewer/2022042909/5f3a2f1f4697801060706397/html5/thumbnails/52.jpg)
9/41
. . .
. . .
Surya Namaskar
Sequences could be composed of any-thing (not just words)
For example, a video could be treatedas a sequence of images
We may want to look at the entire se-quence and detect the activity beingperformed
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
![Page 53: CS7015 (Deep Learning) : Lecture 14 · CS7015 (Deep Learning) : Lecture 14 Sequence Learning Problems, Recurrent Neural Networks, Backpropagation Through Time (BPTT), Vanishing and](https://reader033.vdocument.in/reader033/viewer/2022042909/5f3a2f1f4697801060706397/html5/thumbnails/53.jpg)
10/41
Module 14.2: Recurrent Neural Networks
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
![Page 54: CS7015 (Deep Learning) : Lecture 14 · CS7015 (Deep Learning) : Lecture 14 Sequence Learning Problems, Recurrent Neural Networks, Backpropagation Through Time (BPTT), Vanishing and](https://reader033.vdocument.in/reader033/viewer/2022042909/5f3a2f1f4697801060706397/html5/thumbnails/54.jpg)
11/41
How do we model such tasks involving sequences ?
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
![Page 55: CS7015 (Deep Learning) : Lecture 14 · CS7015 (Deep Learning) : Lecture 14 Sequence Learning Problems, Recurrent Neural Networks, Backpropagation Through Time (BPTT), Vanishing and](https://reader033.vdocument.in/reader033/viewer/2022042909/5f3a2f1f4697801060706397/html5/thumbnails/55.jpg)
12/41
Wishlist
Account for dependence between inputs
Account for variable number of inputs
Make sure that the function executed at each time step is the same
We will focus on each of these to arrive at a model for dealing with sequences
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
![Page 56: CS7015 (Deep Learning) : Lecture 14 · CS7015 (Deep Learning) : Lecture 14 Sequence Learning Problems, Recurrent Neural Networks, Backpropagation Through Time (BPTT), Vanishing and](https://reader033.vdocument.in/reader033/viewer/2022042909/5f3a2f1f4697801060706397/html5/thumbnails/56.jpg)
12/41
Wishlist
Account for dependence between inputs
Account for variable number of inputs
Make sure that the function executed at each time step is the same
We will focus on each of these to arrive at a model for dealing with sequences
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
![Page 57: CS7015 (Deep Learning) : Lecture 14 · CS7015 (Deep Learning) : Lecture 14 Sequence Learning Problems, Recurrent Neural Networks, Backpropagation Through Time (BPTT), Vanishing and](https://reader033.vdocument.in/reader033/viewer/2022042909/5f3a2f1f4697801060706397/html5/thumbnails/57.jpg)
12/41
Wishlist
Account for dependence between inputs
Account for variable number of inputs
Make sure that the function executed at each time step is the same
We will focus on each of these to arrive at a model for dealing with sequences
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
![Page 58: CS7015 (Deep Learning) : Lecture 14 · CS7015 (Deep Learning) : Lecture 14 Sequence Learning Problems, Recurrent Neural Networks, Backpropagation Through Time (BPTT), Vanishing and](https://reader033.vdocument.in/reader033/viewer/2022042909/5f3a2f1f4697801060706397/html5/thumbnails/58.jpg)
12/41
Wishlist
Account for dependence between inputs
Account for variable number of inputs
Make sure that the function executed at each time step is the same
We will focus on each of these to arrive at a model for dealing with sequences
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
![Page 59: CS7015 (Deep Learning) : Lecture 14 · CS7015 (Deep Learning) : Lecture 14 Sequence Learning Problems, Recurrent Neural Networks, Backpropagation Through Time (BPTT), Vanishing and](https://reader033.vdocument.in/reader033/viewer/2022042909/5f3a2f1f4697801060706397/html5/thumbnails/59.jpg)
13/41
s1
V
U
x1
y1
x2
y2
s2
V
U
What is the function being executedat each time step ?
si = σ(Uxi + b)
yi = O(V si + c)
i = timestep
Since we want the same function to beexecuted at each timestep we shouldshare the same network (i.e., sameparameters at each timestep)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
![Page 60: CS7015 (Deep Learning) : Lecture 14 · CS7015 (Deep Learning) : Lecture 14 Sequence Learning Problems, Recurrent Neural Networks, Backpropagation Through Time (BPTT), Vanishing and](https://reader033.vdocument.in/reader033/viewer/2022042909/5f3a2f1f4697801060706397/html5/thumbnails/60.jpg)
13/41
s1
V
U
x1
y1
x2
y2
s2
V
U
What is the function being executedat each time step ?
si = σ(Uxi + b)
yi = O(V si + c)
i = timestep
Since we want the same function to beexecuted at each timestep we shouldshare the same network (i.e., sameparameters at each timestep)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
![Page 61: CS7015 (Deep Learning) : Lecture 14 · CS7015 (Deep Learning) : Lecture 14 Sequence Learning Problems, Recurrent Neural Networks, Backpropagation Through Time (BPTT), Vanishing and](https://reader033.vdocument.in/reader033/viewer/2022042909/5f3a2f1f4697801060706397/html5/thumbnails/61.jpg)
13/41
s1
V
U
x1
y1
x2
y2
s2
V
U
What is the function being executedat each time step ?
si = σ(Uxi + b)
yi = O(V si + c)
i = timestep
Since we want the same function to beexecuted at each timestep we shouldshare the same network (i.e., sameparameters at each timestep)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
![Page 62: CS7015 (Deep Learning) : Lecture 14 · CS7015 (Deep Learning) : Lecture 14 Sequence Learning Problems, Recurrent Neural Networks, Backpropagation Through Time (BPTT), Vanishing and](https://reader033.vdocument.in/reader033/viewer/2022042909/5f3a2f1f4697801060706397/html5/thumbnails/62.jpg)
13/41
s1
V
U
x1
y1
x2
y2
s2
V
U
What is the function being executedat each time step ?
si = σ(Uxi + b)
yi = O(V si + c)
i = timestep
Since we want the same function to beexecuted at each timestep we shouldshare the same network (i.e., sameparameters at each timestep)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
![Page 63: CS7015 (Deep Learning) : Lecture 14 · CS7015 (Deep Learning) : Lecture 14 Sequence Learning Problems, Recurrent Neural Networks, Backpropagation Through Time (BPTT), Vanishing and](https://reader033.vdocument.in/reader033/viewer/2022042909/5f3a2f1f4697801060706397/html5/thumbnails/63.jpg)
13/41
s1
V
U
x1
y1
x2
y2
s2
V
U
What is the function being executedat each time step ?
si = σ(Uxi + b)
yi = O(V si + c)
i = timestep
Since we want the same function to beexecuted at each timestep we shouldshare the same network (i.e., sameparameters at each timestep)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
![Page 64: CS7015 (Deep Learning) : Lecture 14 · CS7015 (Deep Learning) : Lecture 14 Sequence Learning Problems, Recurrent Neural Networks, Backpropagation Through Time (BPTT), Vanishing and](https://reader033.vdocument.in/reader033/viewer/2022042909/5f3a2f1f4697801060706397/html5/thumbnails/64.jpg)
14/41
s1
V
U
x1
y1
x2
y2
s2
V
U
x3
y3
s3
V
U
x4
y4
s4
V
U
. . .
xn
yn
sn
V
U
This parameter sharing also ensuresthat the network becomes agnostic tothe length (size) of the input
Since we are simply going to computethe same function (with same para-meters) at each timestep, the numberof timesteps doesn’t matter
We just create multiple copies of thenetwork and execute them at eachtimestep
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
![Page 65: CS7015 (Deep Learning) : Lecture 14 · CS7015 (Deep Learning) : Lecture 14 Sequence Learning Problems, Recurrent Neural Networks, Backpropagation Through Time (BPTT), Vanishing and](https://reader033.vdocument.in/reader033/viewer/2022042909/5f3a2f1f4697801060706397/html5/thumbnails/65.jpg)
14/41
s1
V
U
x1
y1
x2
y2
s2
V
U
x3
y3
s3
V
U
x4
y4
s4
V
U
. . .
xn
yn
sn
V
U
This parameter sharing also ensuresthat the network becomes agnostic tothe length (size) of the input
Since we are simply going to computethe same function (with same para-meters) at each timestep, the numberof timesteps doesn’t matter
We just create multiple copies of thenetwork and execute them at eachtimestep
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
![Page 66: CS7015 (Deep Learning) : Lecture 14 · CS7015 (Deep Learning) : Lecture 14 Sequence Learning Problems, Recurrent Neural Networks, Backpropagation Through Time (BPTT), Vanishing and](https://reader033.vdocument.in/reader033/viewer/2022042909/5f3a2f1f4697801060706397/html5/thumbnails/66.jpg)
14/41
s1
V
U
x1
y1
x2
y2
s2
V
U
x3
y3
s3
V
U
x4
y4
s4
V
U
. . .
xn
yn
sn
V
U
This parameter sharing also ensuresthat the network becomes agnostic tothe length (size) of the input
Since we are simply going to computethe same function (with same para-meters) at each timestep, the numberof timesteps doesn’t matter
We just create multiple copies of thenetwork and execute them at eachtimestep
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
![Page 67: CS7015 (Deep Learning) : Lecture 14 · CS7015 (Deep Learning) : Lecture 14 Sequence Learning Problems, Recurrent Neural Networks, Backpropagation Through Time (BPTT), Vanishing and](https://reader033.vdocument.in/reader033/viewer/2022042909/5f3a2f1f4697801060706397/html5/thumbnails/67.jpg)
14/41
s1
V
U
x1
y1
x2
y2
s2
V
U
x3
y3
s3
V
U
x4
y4
s4
V
U
. . .
xn
yn
sn
V
U
This parameter sharing also ensuresthat the network becomes agnostic tothe length (size) of the input
Since we are simply going to computethe same function (with same para-meters) at each timestep, the numberof timesteps doesn’t matter
We just create multiple copies of thenetwork and execute them at eachtimestep
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
![Page 68: CS7015 (Deep Learning) : Lecture 14 · CS7015 (Deep Learning) : Lecture 14 Sequence Learning Problems, Recurrent Neural Networks, Backpropagation Through Time (BPTT), Vanishing and](https://reader033.vdocument.in/reader033/viewer/2022042909/5f3a2f1f4697801060706397/html5/thumbnails/68.jpg)
14/41
s1
V
U
x1
y1
x2
y2
s2
V
U
x3
y3
s3
V
U
x4
y4
s4
V
U
. . .
xn
yn
sn
V
U
This parameter sharing also ensuresthat the network becomes agnostic tothe length (size) of the input
Since we are simply going to computethe same function (with same para-meters) at each timestep, the numberof timesteps doesn’t matter
We just create multiple copies of thenetwork and execute them at eachtimestep
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
![Page 69: CS7015 (Deep Learning) : Lecture 14 · CS7015 (Deep Learning) : Lecture 14 Sequence Learning Problems, Recurrent Neural Networks, Backpropagation Through Time (BPTT), Vanishing and](https://reader033.vdocument.in/reader033/viewer/2022042909/5f3a2f1f4697801060706397/html5/thumbnails/69.jpg)
14/41
s1
V
U
x1
y1
x2
y2
s2
V
U
x3
y3
s3
V
U
x4
y4
s4
V
U
. . .
xn
yn
sn
V
U
This parameter sharing also ensuresthat the network becomes agnostic tothe length (size) of the input
Since we are simply going to computethe same function (with same para-meters) at each timestep, the numberof timesteps doesn’t matter
We just create multiple copies of thenetwork and execute them at eachtimestep
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
![Page 70: CS7015 (Deep Learning) : Lecture 14 · CS7015 (Deep Learning) : Lecture 14 Sequence Learning Problems, Recurrent Neural Networks, Backpropagation Through Time (BPTT), Vanishing and](https://reader033.vdocument.in/reader033/viewer/2022042909/5f3a2f1f4697801060706397/html5/thumbnails/70.jpg)
14/41
s1
V
U
x1
y1
x2
y2
s2
V
U
x3
y3
s3
V
U
x4
y4
s4
V
U
. . .
xn
yn
sn
V
U
This parameter sharing also ensuresthat the network becomes agnostic tothe length (size) of the input
Since we are simply going to computethe same function (with same para-meters) at each timestep, the numberof timesteps doesn’t matter
We just create multiple copies of thenetwork and execute them at eachtimestep
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
![Page 71: CS7015 (Deep Learning) : Lecture 14 · CS7015 (Deep Learning) : Lecture 14 Sequence Learning Problems, Recurrent Neural Networks, Backpropagation Through Time (BPTT), Vanishing and](https://reader033.vdocument.in/reader033/viewer/2022042909/5f3a2f1f4697801060706397/html5/thumbnails/71.jpg)
15/41
v
u
x1
y1
v
u
x2
y2
x1
v
u
x2x1 x3
y3
v
u
x3 x4
y4
x2x1
How do we account for dependencebetween inputs ?
Let us first see an infeasible way ofdoing this
At each timestep we will feed all theprevious inputs to the network
Is this okay ?
No, it violates the other two items onour wishlist
How ?
Let us see
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
![Page 72: CS7015 (Deep Learning) : Lecture 14 · CS7015 (Deep Learning) : Lecture 14 Sequence Learning Problems, Recurrent Neural Networks, Backpropagation Through Time (BPTT), Vanishing and](https://reader033.vdocument.in/reader033/viewer/2022042909/5f3a2f1f4697801060706397/html5/thumbnails/72.jpg)
15/41
v
u
x1
y1
v
u
x2
y2
x1
v
u
x2x1 x3
y3
v
u
x3 x4
y4
x2x1
How do we account for dependencebetween inputs ?
Let us first see an infeasible way ofdoing this
At each timestep we will feed all theprevious inputs to the network
Is this okay ?
No, it violates the other two items onour wishlist
How ?
Let us see
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
![Page 73: CS7015 (Deep Learning) : Lecture 14 · CS7015 (Deep Learning) : Lecture 14 Sequence Learning Problems, Recurrent Neural Networks, Backpropagation Through Time (BPTT), Vanishing and](https://reader033.vdocument.in/reader033/viewer/2022042909/5f3a2f1f4697801060706397/html5/thumbnails/73.jpg)
15/41
v
u
x1
y1
v
u
x2
y2
x1
v
u
x2x1 x3
y3
v
u
x3 x4
y4
x2x1
How do we account for dependencebetween inputs ?
Let us first see an infeasible way ofdoing this
At each timestep we will feed all theprevious inputs to the network
Is this okay ?
No, it violates the other two items onour wishlist
How ?
Let us see
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
![Page 74: CS7015 (Deep Learning) : Lecture 14 · CS7015 (Deep Learning) : Lecture 14 Sequence Learning Problems, Recurrent Neural Networks, Backpropagation Through Time (BPTT), Vanishing and](https://reader033.vdocument.in/reader033/viewer/2022042909/5f3a2f1f4697801060706397/html5/thumbnails/74.jpg)
15/41
v
u
x1
y1
v
u
x2
y2
x1
v
u
x2x1 x3
y3
v
u
x3 x4
y4
x2x1
How do we account for dependencebetween inputs ?
Let us first see an infeasible way ofdoing this
At each timestep we will feed all theprevious inputs to the network
Is this okay ?
No, it violates the other two items onour wishlist
How ?
Let us see
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
![Page 75: CS7015 (Deep Learning) : Lecture 14 · CS7015 (Deep Learning) : Lecture 14 Sequence Learning Problems, Recurrent Neural Networks, Backpropagation Through Time (BPTT), Vanishing and](https://reader033.vdocument.in/reader033/viewer/2022042909/5f3a2f1f4697801060706397/html5/thumbnails/75.jpg)
15/41
v
u
x1
y1
v
u
x2
y2
x1
v
u
x2x1 x3
y3
v
u
x3 x4
y4
x2x1
How do we account for dependencebetween inputs ?
Let us first see an infeasible way ofdoing this
At each timestep we will feed all theprevious inputs to the network
Is this okay ?
No, it violates the other two items onour wishlist
How ?
Let us see
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
![Page 76: CS7015 (Deep Learning) : Lecture 14 · CS7015 (Deep Learning) : Lecture 14 Sequence Learning Problems, Recurrent Neural Networks, Backpropagation Through Time (BPTT), Vanishing and](https://reader033.vdocument.in/reader033/viewer/2022042909/5f3a2f1f4697801060706397/html5/thumbnails/76.jpg)
15/41
v
u
x1
y1
v
u
x2
y2
x1
v
u
x2x1 x3
y3
v
u
x3 x4
y4
x2x1
How do we account for dependencebetween inputs ?
Let us first see an infeasible way ofdoing this
At each timestep we will feed all theprevious inputs to the network
Is this okay ?
No, it violates the other two items onour wishlist
How ?
Let us see
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
![Page 77: CS7015 (Deep Learning) : Lecture 14 · CS7015 (Deep Learning) : Lecture 14 Sequence Learning Problems, Recurrent Neural Networks, Backpropagation Through Time (BPTT), Vanishing and](https://reader033.vdocument.in/reader033/viewer/2022042909/5f3a2f1f4697801060706397/html5/thumbnails/77.jpg)
15/41
v
u
x1
y1
v
u
x2
y2
x1
v
u
x2x1 x3
y3
v
u
x3 x4
y4
x2x1
How do we account for dependencebetween inputs ?
Let us first see an infeasible way ofdoing this
At each timestep we will feed all theprevious inputs to the network
Is this okay ?
No, it violates the other two items onour wishlist
How ?
Let us see
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
![Page 78: CS7015 (Deep Learning) : Lecture 14 · CS7015 (Deep Learning) : Lecture 14 Sequence Learning Problems, Recurrent Neural Networks, Backpropagation Through Time (BPTT), Vanishing and](https://reader033.vdocument.in/reader033/viewer/2022042909/5f3a2f1f4697801060706397/html5/thumbnails/78.jpg)
15/41
v
u
x1
y1
v
u
x2
y2
x1
v
u
x2x1 x3
y3
v
u
x3 x4
y4
x2x1
How do we account for dependencebetween inputs ?
Let us first see an infeasible way ofdoing this
At each timestep we will feed all theprevious inputs to the network
Is this okay ?
No, it violates the other two items onour wishlist
How ?
Let us see
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
![Page 79: CS7015 (Deep Learning) : Lecture 14 · CS7015 (Deep Learning) : Lecture 14 Sequence Learning Problems, Recurrent Neural Networks, Backpropagation Through Time (BPTT), Vanishing and](https://reader033.vdocument.in/reader033/viewer/2022042909/5f3a2f1f4697801060706397/html5/thumbnails/79.jpg)
15/41
v
u
x1
y1
v
u
x2
y2
x1
v
u
x2x1 x3
y3
v
u
x3 x4
y4
x2x1
How do we account for dependencebetween inputs ?
Let us first see an infeasible way ofdoing this
At each timestep we will feed all theprevious inputs to the network
Is this okay ?
No, it violates the other two items onour wishlist
How ? Let us see
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
![Page 80: CS7015 (Deep Learning) : Lecture 14 · CS7015 (Deep Learning) : Lecture 14 Sequence Learning Problems, Recurrent Neural Networks, Backpropagation Through Time (BPTT), Vanishing and](https://reader033.vdocument.in/reader033/viewer/2022042909/5f3a2f1f4697801060706397/html5/thumbnails/80.jpg)
16/41
v
u
x1
y1
v
u
x2
y2
x1
v
u
x2x1 x3
y3
v
u
x3 x4
y4
x2x1
First, the function being computed ateach time-step now is different
y1 = f1(x1)
y2 = f2(x1, x2)
y3 = f3(x1, x2, x3)
The network is now sensitive to thelength of the sequence
For example a sequence of length10 will require f1, . . . , f10 whereas asequence of length 100 will requiref1, . . . , f100
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
![Page 81: CS7015 (Deep Learning) : Lecture 14 · CS7015 (Deep Learning) : Lecture 14 Sequence Learning Problems, Recurrent Neural Networks, Backpropagation Through Time (BPTT), Vanishing and](https://reader033.vdocument.in/reader033/viewer/2022042909/5f3a2f1f4697801060706397/html5/thumbnails/81.jpg)
16/41
v
u
x1
y1
v
u
x2
y2
x1
v
u
x2x1 x3
y3
v
u
x3 x4
y4
x2x1
First, the function being computed ateach time-step now is different
y1 = f1(x1)
y2 = f2(x1, x2)
y3 = f3(x1, x2, x3)
The network is now sensitive to thelength of the sequence
For example a sequence of length10 will require f1, . . . , f10 whereas asequence of length 100 will requiref1, . . . , f100
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
![Page 82: CS7015 (Deep Learning) : Lecture 14 · CS7015 (Deep Learning) : Lecture 14 Sequence Learning Problems, Recurrent Neural Networks, Backpropagation Through Time (BPTT), Vanishing and](https://reader033.vdocument.in/reader033/viewer/2022042909/5f3a2f1f4697801060706397/html5/thumbnails/82.jpg)
16/41
v
u
x1
y1
v
u
x2
y2
x1
v
u
x2x1 x3
y3
v
u
x3 x4
y4
x2x1
First, the function being computed ateach time-step now is different
y1 = f1(x1)
y2 = f2(x1, x2)
y3 = f3(x1, x2, x3)
The network is now sensitive to thelength of the sequence
For example a sequence of length10 will require f1, . . . , f10 whereas asequence of length 100 will requiref1, . . . , f100
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
![Page 83: CS7015 (Deep Learning) : Lecture 14 · CS7015 (Deep Learning) : Lecture 14 Sequence Learning Problems, Recurrent Neural Networks, Backpropagation Through Time (BPTT), Vanishing and](https://reader033.vdocument.in/reader033/viewer/2022042909/5f3a2f1f4697801060706397/html5/thumbnails/83.jpg)
16/41
v
u
x1
y1
v
u
x2
y2
x1
v
u
x2x1 x3
y3
v
u
x3 x4
y4
x2x1
First, the function being computed ateach time-step now is different
y1 = f1(x1)
y2 = f2(x1, x2)
y3 = f3(x1, x2, x3)
The network is now sensitive to thelength of the sequence
For example a sequence of length10 will require f1, . . . , f10 whereas asequence of length 100 will requiref1, . . . , f100
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
![Page 84: CS7015 (Deep Learning) : Lecture 14 · CS7015 (Deep Learning) : Lecture 14 Sequence Learning Problems, Recurrent Neural Networks, Backpropagation Through Time (BPTT), Vanishing and](https://reader033.vdocument.in/reader033/viewer/2022042909/5f3a2f1f4697801060706397/html5/thumbnails/84.jpg)
16/41
v
u
x1
y1
v
u
x2
y2
x1
v
u
x2x1 x3
y3
v
u
x3 x4
y4
x2x1
First, the function being computed ateach time-step now is different
y1 = f1(x1)
y2 = f2(x1, x2)
y3 = f3(x1, x2, x3)
The network is now sensitive to thelength of the sequence
For example a sequence of length10 will require f1, . . . , f10 whereas asequence of length 100 will requiref1, . . . , f100
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
![Page 85: CS7015 (Deep Learning) : Lecture 14 · CS7015 (Deep Learning) : Lecture 14 Sequence Learning Problems, Recurrent Neural Networks, Backpropagation Through Time (BPTT), Vanishing and](https://reader033.vdocument.in/reader033/viewer/2022042909/5f3a2f1f4697801060706397/html5/thumbnails/85.jpg)
16/41
v
u
x1
y1
v
u
x2
y2
x1
v
u
x2x1 x3
y3
v
u
x3 x4
y4
x2x1
First, the function being computed ateach time-step now is different
y1 = f1(x1)
y2 = f2(x1, x2)
y3 = f3(x1, x2, x3)
The network is now sensitive to thelength of the sequence
For example a sequence of length10 will require f1, . . . , f10 whereas asequence of length 100 will requiref1, . . . , f100
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
![Page 86: CS7015 (Deep Learning) : Lecture 14 · CS7015 (Deep Learning) : Lecture 14 Sequence Learning Problems, Recurrent Neural Networks, Backpropagation Through Time (BPTT), Vanishing and](https://reader033.vdocument.in/reader033/viewer/2022042909/5f3a2f1f4697801060706397/html5/thumbnails/86.jpg)
17/41
W
V
U
x1
y1
x2
y2
W
V
U
x3
y3
W
V
U
x4
y4
W
V
U
. . .
xn
yn
snW
V
U
The solution is to add a recurrentconnection in the network,
si = σ(Uxi +Wsi−1 + b)
yi = O(V si + c)
or
yi = f(xi, si−1,W,U, V, b, c)
si is the state of the network attimestep i
The parameters are W,U, V, c, bwhich are shared across timesteps
The same network (and parameters)can be used to compute y1, y2, . . . , y10or y100
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
![Page 87: CS7015 (Deep Learning) : Lecture 14 · CS7015 (Deep Learning) : Lecture 14 Sequence Learning Problems, Recurrent Neural Networks, Backpropagation Through Time (BPTT), Vanishing and](https://reader033.vdocument.in/reader033/viewer/2022042909/5f3a2f1f4697801060706397/html5/thumbnails/87.jpg)
17/41
W
V
U
x1
y1
x2
y2
W
V
U
x3
y3
W
V
U
x4
y4
W
V
U
. . .
xn
yn
snW
V
U
The solution is to add a recurrentconnection in the network,
si = σ(Uxi +Wsi−1 + b)
yi = O(V si + c)
or
yi = f(xi, si−1,W,U, V, b, c)
si is the state of the network attimestep i
The parameters are W,U, V, c, bwhich are shared across timesteps
The same network (and parameters)can be used to compute y1, y2, . . . , y10or y100
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
![Page 88: CS7015 (Deep Learning) : Lecture 14 · CS7015 (Deep Learning) : Lecture 14 Sequence Learning Problems, Recurrent Neural Networks, Backpropagation Through Time (BPTT), Vanishing and](https://reader033.vdocument.in/reader033/viewer/2022042909/5f3a2f1f4697801060706397/html5/thumbnails/88.jpg)
17/41
W
V
U
x1
y1
x2
y2
W
V
U
x3
y3
W
V
U
x4
y4
W
V
U
. . .
xn
yn
snW
V
U
The solution is to add a recurrentconnection in the network,
si = σ(Uxi +Wsi−1 + b)
yi = O(V si + c)
or
yi = f(xi, si−1,W,U, V, b, c)
si is the state of the network attimestep i
The parameters are W,U, V, c, bwhich are shared across timesteps
The same network (and parameters)can be used to compute y1, y2, . . . , y10or y100
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
![Page 89: CS7015 (Deep Learning) : Lecture 14 · CS7015 (Deep Learning) : Lecture 14 Sequence Learning Problems, Recurrent Neural Networks, Backpropagation Through Time (BPTT), Vanishing and](https://reader033.vdocument.in/reader033/viewer/2022042909/5f3a2f1f4697801060706397/html5/thumbnails/89.jpg)
17/41
W
V
U
x1
y1
x2
y2
W
V
U
x3
y3
W
V
U
x4
y4
W
V
U
. . .
xn
yn
snW
V
U
The solution is to add a recurrentconnection in the network,
si = σ(Uxi +Wsi−1 + b)
yi = O(V si + c)
or
yi = f(xi, si−1,W,U, V, b, c)
si is the state of the network attimestep i
The parameters are W,U, V, c, bwhich are shared across timesteps
The same network (and parameters)can be used to compute y1, y2, . . . , y10or y100
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
![Page 90: CS7015 (Deep Learning) : Lecture 14 · CS7015 (Deep Learning) : Lecture 14 Sequence Learning Problems, Recurrent Neural Networks, Backpropagation Through Time (BPTT), Vanishing and](https://reader033.vdocument.in/reader033/viewer/2022042909/5f3a2f1f4697801060706397/html5/thumbnails/90.jpg)
17/41
W
V
U
x1
y1
x2
y2
W
V
U
x3
y3
W
V
U
x4
y4
W
V
U
. . .
xn
yn
snW
V
U
The solution is to add a recurrentconnection in the network,
si = σ(Uxi +Wsi−1 + b)
yi = O(V si + c)
or
yi = f(xi, si−1,W,U, V, b, c)
si is the state of the network attimestep i
The parameters are W,U, V, c, bwhich are shared across timesteps
The same network (and parameters)can be used to compute y1, y2, . . . , y10or y100
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
![Page 91: CS7015 (Deep Learning) : Lecture 14 · CS7015 (Deep Learning) : Lecture 14 Sequence Learning Problems, Recurrent Neural Networks, Backpropagation Through Time (BPTT), Vanishing and](https://reader033.vdocument.in/reader033/viewer/2022042909/5f3a2f1f4697801060706397/html5/thumbnails/91.jpg)
17/41
W
V
U
x1
y1
x2
y2
W
V
U
x3
y3
W
V
U
x4
y4
W
V
U
. . .
xn
yn
snW
V
U
The solution is to add a recurrentconnection in the network,
si = σ(Uxi +Wsi−1 + b)
yi = O(V si + c)
or
yi = f(xi, si−1,W,U, V, b, c)
si is the state of the network attimestep i
The parameters are W,U, V, c, bwhich are shared across timesteps
The same network (and parameters)can be used to compute y1, y2, . . . , y10or y100
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
![Page 92: CS7015 (Deep Learning) : Lecture 14 · CS7015 (Deep Learning) : Lecture 14 Sequence Learning Problems, Recurrent Neural Networks, Backpropagation Through Time (BPTT), Vanishing and](https://reader033.vdocument.in/reader033/viewer/2022042909/5f3a2f1f4697801060706397/html5/thumbnails/92.jpg)
17/41
W
V
U
x1
y1
x2
y2
W
V
U
x3
y3
W
V
U
x4
y4
W
V
U
. . .
xn
yn
snW
V
U
The solution is to add a recurrentconnection in the network,
si = σ(Uxi +Wsi−1 + b)
yi = O(V si + c)
or
yi = f(xi, si−1,W,U, V, b, c)
si is the state of the network attimestep i
The parameters are W,U, V, c, bwhich are shared across timesteps
The same network (and parameters)can be used to compute y1, y2, . . . , y10or y100
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
![Page 93: CS7015 (Deep Learning) : Lecture 14 · CS7015 (Deep Learning) : Lecture 14 Sequence Learning Problems, Recurrent Neural Networks, Backpropagation Through Time (BPTT), Vanishing and](https://reader033.vdocument.in/reader033/viewer/2022042909/5f3a2f1f4697801060706397/html5/thumbnails/93.jpg)
17/41
W
V
U
x1
y1
x2
y2
W
V
U
x3
y3
W
V
U
x4
y4
W
V
U
. . .
xn
yn
snW
V
U
The solution is to add a recurrentconnection in the network,
si = σ(Uxi +Wsi−1 + b)
yi = O(V si + c)
or
yi = f(xi, si−1,W,U, V, b, c)
si is the state of the network attimestep i
The parameters are W,U, V, c, bwhich are shared across timesteps
The same network (and parameters)can be used to compute y1, y2, . . . , y10or y100
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
![Page 94: CS7015 (Deep Learning) : Lecture 14 · CS7015 (Deep Learning) : Lecture 14 Sequence Learning Problems, Recurrent Neural Networks, Backpropagation Through Time (BPTT), Vanishing and](https://reader033.vdocument.in/reader033/viewer/2022042909/5f3a2f1f4697801060706397/html5/thumbnails/94.jpg)
18/41
si
V
U
xi
W
yi
This can be represented more com-pactly
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
![Page 95: CS7015 (Deep Learning) : Lecture 14 · CS7015 (Deep Learning) : Lecture 14 Sequence Learning Problems, Recurrent Neural Networks, Backpropagation Through Time (BPTT), Vanishing and](https://reader033.vdocument.in/reader033/viewer/2022042909/5f3a2f1f4697801060706397/html5/thumbnails/95.jpg)
19/41
man
noun
is
verb
a
article
social
adjective
animal
noun
. . .
. . .
Surya Namaskar
d
e
e
e
e
p
p
〈 stop 〉
The
don’tcare
movie
don’tcare
was
don’tcare
boring
don’tcare
and
don’tcare
long
+/−
Let us revisit the sequence learningproblems that we saw earlier
We now have recurrent connectionsbetween time steps which account fordependence between inputs
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
![Page 96: CS7015 (Deep Learning) : Lecture 14 · CS7015 (Deep Learning) : Lecture 14 Sequence Learning Problems, Recurrent Neural Networks, Backpropagation Through Time (BPTT), Vanishing and](https://reader033.vdocument.in/reader033/viewer/2022042909/5f3a2f1f4697801060706397/html5/thumbnails/96.jpg)
19/41
man
noun
is
verb
a
article
social
adjective
animal
noun
. . .
. . .
Surya Namaskar
d
e
e
e
e
p
p
〈 stop 〉
The
don’tcare
movie
don’tcare
was
don’tcare
boring
don’tcare
and
don’tcare
long
+/−
Let us revisit the sequence learningproblems that we saw earlier
We now have recurrent connectionsbetween time steps which account fordependence between inputs
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
![Page 97: CS7015 (Deep Learning) : Lecture 14 · CS7015 (Deep Learning) : Lecture 14 Sequence Learning Problems, Recurrent Neural Networks, Backpropagation Through Time (BPTT), Vanishing and](https://reader033.vdocument.in/reader033/viewer/2022042909/5f3a2f1f4697801060706397/html5/thumbnails/97.jpg)
20/41
Module 14.3: Backpropagation through time
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
![Page 98: CS7015 (Deep Learning) : Lecture 14 · CS7015 (Deep Learning) : Lecture 14 Sequence Learning Problems, Recurrent Neural Networks, Backpropagation Through Time (BPTT), Vanishing and](https://reader033.vdocument.in/reader033/viewer/2022042909/5f3a2f1f4697801060706397/html5/thumbnails/98.jpg)
21/41
W
V
U
x1
y1
x2
y2
W
V
U
x3
y3
W
V
U
x4
y4
W
V
U
Before proceeding let us look at thedimensions of the parameters care-fully
xi ∈ Rn (n-dimensional input)
si ∈ Rd (d-dimensional state)
yi ∈ Rk (say k classes)
U ∈
Rn×d
V ∈
Rd×k
W ∈
Rd×d
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
![Page 99: CS7015 (Deep Learning) : Lecture 14 · CS7015 (Deep Learning) : Lecture 14 Sequence Learning Problems, Recurrent Neural Networks, Backpropagation Through Time (BPTT), Vanishing and](https://reader033.vdocument.in/reader033/viewer/2022042909/5f3a2f1f4697801060706397/html5/thumbnails/99.jpg)
21/41
W
V
U
x1
y1
x2
y2
W
V
U
x3
y3
W
V
U
x4
y4
W
V
U
Before proceeding let us look at thedimensions of the parameters care-fully
xi ∈ Rn (n-dimensional input)
si ∈ Rd (d-dimensional state)
yi ∈ Rk (say k classes)
U ∈
Rn×d
V ∈
Rd×k
W ∈
Rd×d
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
![Page 100: CS7015 (Deep Learning) : Lecture 14 · CS7015 (Deep Learning) : Lecture 14 Sequence Learning Problems, Recurrent Neural Networks, Backpropagation Through Time (BPTT), Vanishing and](https://reader033.vdocument.in/reader033/viewer/2022042909/5f3a2f1f4697801060706397/html5/thumbnails/100.jpg)
21/41
W
V
U
x1
y1
x2
y2
W
V
U
x3
y3
W
V
U
x4
y4
W
V
U
Before proceeding let us look at thedimensions of the parameters care-fully
xi ∈ Rn (n-dimensional input)
si ∈ Rd (d-dimensional state)
yi ∈ Rk (say k classes)
U ∈
Rn×d
V ∈
Rd×k
W ∈
Rd×d
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
![Page 101: CS7015 (Deep Learning) : Lecture 14 · CS7015 (Deep Learning) : Lecture 14 Sequence Learning Problems, Recurrent Neural Networks, Backpropagation Through Time (BPTT), Vanishing and](https://reader033.vdocument.in/reader033/viewer/2022042909/5f3a2f1f4697801060706397/html5/thumbnails/101.jpg)
21/41
W
V
U
x1
y1
x2
y2
W
V
U
x3
y3
W
V
U
x4
y4
W
V
U
Before proceeding let us look at thedimensions of the parameters care-fully
xi ∈ Rn (n-dimensional input)
si ∈ Rd (d-dimensional state)
yi ∈ Rk (say k classes)
U ∈
Rn×d
V ∈
Rd×k
W ∈
Rd×d
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
![Page 102: CS7015 (Deep Learning) : Lecture 14 · CS7015 (Deep Learning) : Lecture 14 Sequence Learning Problems, Recurrent Neural Networks, Backpropagation Through Time (BPTT), Vanishing and](https://reader033.vdocument.in/reader033/viewer/2022042909/5f3a2f1f4697801060706397/html5/thumbnails/102.jpg)
21/41
W
V
U
x1
y1
x2
y2
W
V
U
x3
y3
W
V
U
x4
y4
W
V
U
Before proceeding let us look at thedimensions of the parameters care-fully
xi ∈ Rn (n-dimensional input)
si ∈ Rd (d-dimensional state)
yi ∈ Rk (say k classes)
U ∈
Rn×d
V ∈
Rd×k
W ∈
Rd×d
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
![Page 103: CS7015 (Deep Learning) : Lecture 14 · CS7015 (Deep Learning) : Lecture 14 Sequence Learning Problems, Recurrent Neural Networks, Backpropagation Through Time (BPTT), Vanishing and](https://reader033.vdocument.in/reader033/viewer/2022042909/5f3a2f1f4697801060706397/html5/thumbnails/103.jpg)
21/41
W
V
U
x1
y1
x2
y2
W
V
U
x3
y3
W
V
U
x4
y4
W
V
U
Before proceeding let us look at thedimensions of the parameters care-fully
xi ∈ Rn (n-dimensional input)
si ∈ Rd (d-dimensional state)
yi ∈ Rk (say k classes)
U ∈ Rn×d
V ∈
Rd×k
W ∈
Rd×d
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
![Page 104: CS7015 (Deep Learning) : Lecture 14 · CS7015 (Deep Learning) : Lecture 14 Sequence Learning Problems, Recurrent Neural Networks, Backpropagation Through Time (BPTT), Vanishing and](https://reader033.vdocument.in/reader033/viewer/2022042909/5f3a2f1f4697801060706397/html5/thumbnails/104.jpg)
21/41
W
V
U
x1
y1
x2
y2
W
V
U
x3
y3
W
V
U
x4
y4
W
V
U
Before proceeding let us look at thedimensions of the parameters care-fully
xi ∈ Rn (n-dimensional input)
si ∈ Rd (d-dimensional state)
yi ∈ Rk (say k classes)
U ∈ Rn×d
V ∈
Rd×k
W ∈
Rd×d
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
![Page 105: CS7015 (Deep Learning) : Lecture 14 · CS7015 (Deep Learning) : Lecture 14 Sequence Learning Problems, Recurrent Neural Networks, Backpropagation Through Time (BPTT), Vanishing and](https://reader033.vdocument.in/reader033/viewer/2022042909/5f3a2f1f4697801060706397/html5/thumbnails/105.jpg)
21/41
W
V
U
x1
y1
x2
y2
W
V
U
x3
y3
W
V
U
x4
y4
W
V
U
Before proceeding let us look at thedimensions of the parameters care-fully
xi ∈ Rn (n-dimensional input)
si ∈ Rd (d-dimensional state)
yi ∈ Rk (say k classes)
U ∈ Rn×d
V ∈ Rd×k
W ∈
Rd×d
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
![Page 106: CS7015 (Deep Learning) : Lecture 14 · CS7015 (Deep Learning) : Lecture 14 Sequence Learning Problems, Recurrent Neural Networks, Backpropagation Through Time (BPTT), Vanishing and](https://reader033.vdocument.in/reader033/viewer/2022042909/5f3a2f1f4697801060706397/html5/thumbnails/106.jpg)
21/41
W
V
U
x1
y1
x2
y2
W
V
U
x3
y3
W
V
U
x4
y4
W
V
U
Before proceeding let us look at thedimensions of the parameters care-fully
xi ∈ Rn (n-dimensional input)
si ∈ Rd (d-dimensional state)
yi ∈ Rk (say k classes)
U ∈ Rn×d
V ∈ Rd×k
W ∈
Rd×d
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
![Page 107: CS7015 (Deep Learning) : Lecture 14 · CS7015 (Deep Learning) : Lecture 14 Sequence Learning Problems, Recurrent Neural Networks, Backpropagation Through Time (BPTT), Vanishing and](https://reader033.vdocument.in/reader033/viewer/2022042909/5f3a2f1f4697801060706397/html5/thumbnails/107.jpg)
21/41
W
V
U
x1
y1
x2
y2
W
V
U
x3
y3
W
V
U
x4
y4
W
V
U
Before proceeding let us look at thedimensions of the parameters care-fully
xi ∈ Rn (n-dimensional input)
si ∈ Rd (d-dimensional state)
yi ∈ Rk (say k classes)
U ∈ Rn×d
V ∈ Rd×k
W ∈ Rd×d
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
![Page 108: CS7015 (Deep Learning) : Lecture 14 · CS7015 (Deep Learning) : Lecture 14 Sequence Learning Problems, Recurrent Neural Networks, Backpropagation Through Time (BPTT), Vanishing and](https://reader033.vdocument.in/reader033/viewer/2022042909/5f3a2f1f4697801060706397/html5/thumbnails/108.jpg)
22/41
W
V
U
x1
y1
x2
y2
W
V
U
x3
y3
W
V
U
x4
y4
W
V
U
How do we train this network ?
(Ans: using backpropagation)
Let us understand this with a con-crete example
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
![Page 109: CS7015 (Deep Learning) : Lecture 14 · CS7015 (Deep Learning) : Lecture 14 Sequence Learning Problems, Recurrent Neural Networks, Backpropagation Through Time (BPTT), Vanishing and](https://reader033.vdocument.in/reader033/viewer/2022042909/5f3a2f1f4697801060706397/html5/thumbnails/109.jpg)
22/41
W
V
U
x1
y1
x2
y2
W
V
U
x3
y3
W
V
U
x4
y4
W
V
U
How do we train this network ?(Ans: using backpropagation)
Let us understand this with a con-crete example
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
![Page 110: CS7015 (Deep Learning) : Lecture 14 · CS7015 (Deep Learning) : Lecture 14 Sequence Learning Problems, Recurrent Neural Networks, Backpropagation Through Time (BPTT), Vanishing and](https://reader033.vdocument.in/reader033/viewer/2022042909/5f3a2f1f4697801060706397/html5/thumbnails/110.jpg)
22/41
W
V
U
x1
y1
x2
y2
W
V
U
x3
y3
W
V
U
x4
y4
W
V
U
How do we train this network ?(Ans: using backpropagation)
Let us understand this with a con-crete example
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
![Page 111: CS7015 (Deep Learning) : Lecture 14 · CS7015 (Deep Learning) : Lecture 14 Sequence Learning Problems, Recurrent Neural Networks, Backpropagation Through Time (BPTT), Vanishing and](https://reader033.vdocument.in/reader033/viewer/2022042909/5f3a2f1f4697801060706397/html5/thumbnails/111.jpg)
23/41
d
e
e
e
e
p
p
〈 stop 〉
W
V
U
W
V
U
W
V
U
V
U
Suppose we consider our task of auto-completion (predicting the next char-acter)
For simplicity we assume that thereare only 4 characters in our vocabu-lary (d,e,p, <stop>)
At each timestep we want to predictone of these 4 characters
What is a suitable output function forthis task ?
(softmax)
What is a suitable loss function forthis task ? (cross entropy)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
![Page 112: CS7015 (Deep Learning) : Lecture 14 · CS7015 (Deep Learning) : Lecture 14 Sequence Learning Problems, Recurrent Neural Networks, Backpropagation Through Time (BPTT), Vanishing and](https://reader033.vdocument.in/reader033/viewer/2022042909/5f3a2f1f4697801060706397/html5/thumbnails/112.jpg)
23/41
d
e
e
e
e
p
p
〈 stop 〉
W
V
U
W
V
U
W
V
U
V
U
Suppose we consider our task of auto-completion (predicting the next char-acter)
For simplicity we assume that thereare only 4 characters in our vocabu-lary (d,e,p, <stop>)
At each timestep we want to predictone of these 4 characters
What is a suitable output function forthis task ?
(softmax)
What is a suitable loss function forthis task ? (cross entropy)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
![Page 113: CS7015 (Deep Learning) : Lecture 14 · CS7015 (Deep Learning) : Lecture 14 Sequence Learning Problems, Recurrent Neural Networks, Backpropagation Through Time (BPTT), Vanishing and](https://reader033.vdocument.in/reader033/viewer/2022042909/5f3a2f1f4697801060706397/html5/thumbnails/113.jpg)
23/41
d
e
e
e
e
p
p
〈 stop 〉
W
V
U
W
V
U
W
V
U
V
U
Suppose we consider our task of auto-completion (predicting the next char-acter)
For simplicity we assume that thereare only 4 characters in our vocabu-lary (d,e,p, <stop>)
At each timestep we want to predictone of these 4 characters
What is a suitable output function forthis task ?
(softmax)
What is a suitable loss function forthis task ? (cross entropy)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
![Page 114: CS7015 (Deep Learning) : Lecture 14 · CS7015 (Deep Learning) : Lecture 14 Sequence Learning Problems, Recurrent Neural Networks, Backpropagation Through Time (BPTT), Vanishing and](https://reader033.vdocument.in/reader033/viewer/2022042909/5f3a2f1f4697801060706397/html5/thumbnails/114.jpg)
23/41
d
e
e
e
e
p
p
〈 stop 〉
W
V
U
W
V
U
W
V
U
V
U
Suppose we consider our task of auto-completion (predicting the next char-acter)
For simplicity we assume that thereare only 4 characters in our vocabu-lary (d,e,p, <stop>)
At each timestep we want to predictone of these 4 characters
What is a suitable output function forthis task ?
(softmax)
What is a suitable loss function forthis task ? (cross entropy)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
![Page 115: CS7015 (Deep Learning) : Lecture 14 · CS7015 (Deep Learning) : Lecture 14 Sequence Learning Problems, Recurrent Neural Networks, Backpropagation Through Time (BPTT), Vanishing and](https://reader033.vdocument.in/reader033/viewer/2022042909/5f3a2f1f4697801060706397/html5/thumbnails/115.jpg)
23/41
d
e
e
e
e
p
p
〈 stop 〉
W
V
U
W
V
U
W
V
U
V
U
Suppose we consider our task of auto-completion (predicting the next char-acter)
For simplicity we assume that thereare only 4 characters in our vocabu-lary (d,e,p, <stop>)
At each timestep we want to predictone of these 4 characters
What is a suitable output function forthis task ? (softmax)
What is a suitable loss function forthis task ?
(cross entropy)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
![Page 116: CS7015 (Deep Learning) : Lecture 14 · CS7015 (Deep Learning) : Lecture 14 Sequence Learning Problems, Recurrent Neural Networks, Backpropagation Through Time (BPTT), Vanishing and](https://reader033.vdocument.in/reader033/viewer/2022042909/5f3a2f1f4697801060706397/html5/thumbnails/116.jpg)
23/41
d
e
e
e
e
p
p
〈 stop 〉
W
V
U
W
V
U
W
V
U
V
U
Suppose we consider our task of auto-completion (predicting the next char-acter)
For simplicity we assume that thereare only 4 characters in our vocabu-lary (d,e,p, <stop>)
At each timestep we want to predictone of these 4 characters
What is a suitable output function forthis task ? (softmax)
What is a suitable loss function forthis task ?
(cross entropy)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
![Page 117: CS7015 (Deep Learning) : Lecture 14 · CS7015 (Deep Learning) : Lecture 14 Sequence Learning Problems, Recurrent Neural Networks, Backpropagation Through Time (BPTT), Vanishing and](https://reader033.vdocument.in/reader033/viewer/2022042909/5f3a2f1f4697801060706397/html5/thumbnails/117.jpg)
23/41
d
e
e
e
e
p
p
〈 stop 〉
W
V
U
W
V
U
W
V
U
V
U
Suppose we consider our task of auto-completion (predicting the next char-acter)
For simplicity we assume that thereare only 4 characters in our vocabu-lary (d,e,p, <stop>)
At each timestep we want to predictone of these 4 characters
What is a suitable output function forthis task ? (softmax)
What is a suitable loss function forthis task ? (cross entropy)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
![Page 118: CS7015 (Deep Learning) : Lecture 14 · CS7015 (Deep Learning) : Lecture 14 Sequence Learning Problems, Recurrent Neural Networks, Backpropagation Through Time (BPTT), Vanishing and](https://reader033.vdocument.in/reader033/viewer/2022042909/5f3a2f1f4697801060706397/html5/thumbnails/118.jpg)
24/41
d
True
W
V
U
0.20.70.10.1
d
0
e
1
p
0
stop
0
Predicted
e
True
W
V
U
0.20.70.10.1
0100
Predicted
e
True
W
V
U
0.20.10.70.1
0010
Predicted
p
True
V
U
0.20.10.70.1
0001
Predicted
y1 y2 y3 y4
L1(θ) L2(θ) L3(θ) L4(θ)
Suppose we initialize U, V,W ran-domly and the network predicts theprobabilities as shown
And the true probabilities are asshown
We need to answer two questions
What is the total loss made by themodel ?
How do we backpropagate this lossand update the parameters (θ ={U, V,W, b, c}) of the network ?
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
![Page 119: CS7015 (Deep Learning) : Lecture 14 · CS7015 (Deep Learning) : Lecture 14 Sequence Learning Problems, Recurrent Neural Networks, Backpropagation Through Time (BPTT), Vanishing and](https://reader033.vdocument.in/reader033/viewer/2022042909/5f3a2f1f4697801060706397/html5/thumbnails/119.jpg)
24/41
d
True
W
V
U
0.20.70.10.1
d 0e 1p 0
stop 0
Predicted
e
True
W
V
U
0.20.70.10.1
0100
Predicted
e
True
W
V
U
0.20.10.70.1
0010
Predicted
p
True
V
U
0.20.10.70.1
0001
Predicted
y1 y2 y3 y4
L1(θ) L2(θ) L3(θ) L4(θ)
Suppose we initialize U, V,W ran-domly and the network predicts theprobabilities as shown
And the true probabilities are asshown
We need to answer two questions
What is the total loss made by themodel ?
How do we backpropagate this lossand update the parameters (θ ={U, V,W, b, c}) of the network ?
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
![Page 120: CS7015 (Deep Learning) : Lecture 14 · CS7015 (Deep Learning) : Lecture 14 Sequence Learning Problems, Recurrent Neural Networks, Backpropagation Through Time (BPTT), Vanishing and](https://reader033.vdocument.in/reader033/viewer/2022042909/5f3a2f1f4697801060706397/html5/thumbnails/120.jpg)
24/41
d
True
W
V
U
0.20.70.10.1
d 0e 1p 0
stop 0
Predicted
e
True
W
V
U
0.20.70.10.1
0100
Predicted
e
True
W
V
U
0.20.10.70.1
0010
Predicted
p
True
V
U
0.20.10.70.1
0001
Predicted
y1 y2 y3 y4
L1(θ) L2(θ) L3(θ) L4(θ)
Suppose we initialize U, V,W ran-domly and the network predicts theprobabilities as shown
And the true probabilities are asshown
We need to answer two questions
What is the total loss made by themodel ?
How do we backpropagate this lossand update the parameters (θ ={U, V,W, b, c}) of the network ?
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
![Page 121: CS7015 (Deep Learning) : Lecture 14 · CS7015 (Deep Learning) : Lecture 14 Sequence Learning Problems, Recurrent Neural Networks, Backpropagation Through Time (BPTT), Vanishing and](https://reader033.vdocument.in/reader033/viewer/2022042909/5f3a2f1f4697801060706397/html5/thumbnails/121.jpg)
24/41
d
True
W
V
U
0.20.70.10.1
d 0e 1p 0
stop 0
Predicted
e
True
W
V
U
0.20.70.10.1
0100
Predicted
e
True
W
V
U
0.20.10.70.1
0010
Predicted
p
True
V
U
0.20.10.70.1
0001
Predicted
y1 y2 y3 y4
L1(θ) L2(θ) L3(θ) L4(θ)
Suppose we initialize U, V,W ran-domly and the network predicts theprobabilities as shown
And the true probabilities are asshown
We need to answer two questions
What is the total loss made by themodel ?
How do we backpropagate this lossand update the parameters (θ ={U, V,W, b, c}) of the network ?
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
![Page 122: CS7015 (Deep Learning) : Lecture 14 · CS7015 (Deep Learning) : Lecture 14 Sequence Learning Problems, Recurrent Neural Networks, Backpropagation Through Time (BPTT), Vanishing and](https://reader033.vdocument.in/reader033/viewer/2022042909/5f3a2f1f4697801060706397/html5/thumbnails/122.jpg)
24/41
d
True
W
V
U
0.20.70.10.1
d 0e 1p 0
stop 0
Predicted
e
True
W
V
U
0.20.70.10.1
0100
Predicted
e
True
W
V
U
0.20.10.70.1
0010
Predicted
p
True
V
U
0.20.10.70.1
0001
Predicted
y1 y2 y3 y4
L1(θ) L2(θ) L3(θ) L4(θ)
Suppose we initialize U, V,W ran-domly and the network predicts theprobabilities as shown
And the true probabilities are asshown
We need to answer two questions
What is the total loss made by themodel ?
How do we backpropagate this lossand update the parameters (θ ={U, V,W, b, c}) of the network ?
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
![Page 123: CS7015 (Deep Learning) : Lecture 14 · CS7015 (Deep Learning) : Lecture 14 Sequence Learning Problems, Recurrent Neural Networks, Backpropagation Through Time (BPTT), Vanishing and](https://reader033.vdocument.in/reader033/viewer/2022042909/5f3a2f1f4697801060706397/html5/thumbnails/123.jpg)
25/41
d
True
W
V
U
0.20.70.10.1
d 0e 1p 0
stop 0
Predicted
e
True
W
V
U
0.20.70.10.1
0100
Predicted
e
True
W
V
U
0.20.10.70.1
0010
Predicted
e
True
V
U
0.20.10.70.1
0010
Predicted
y1 y2 y3 y4
L1(θ) L2(θ) L3(θ) L4(θ)
The total loss is simply the sum of theloss over all time-steps
L (θ) =
T∑t=1
Lt(θ)
Lt(θ) = −log(ytc)
ytc = predicted probability of true
character at time-step t
T = number of timesteps
For backpropagation we need to com-pute the gradients w.r.t. W,U, V, b, c
Let us see how to do that
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
![Page 124: CS7015 (Deep Learning) : Lecture 14 · CS7015 (Deep Learning) : Lecture 14 Sequence Learning Problems, Recurrent Neural Networks, Backpropagation Through Time (BPTT), Vanishing and](https://reader033.vdocument.in/reader033/viewer/2022042909/5f3a2f1f4697801060706397/html5/thumbnails/124.jpg)
25/41
d
True
W
V
U
0.20.70.10.1
d 0e 1p 0
stop 0
Predicted
e
True
W
V
U
0.20.70.10.1
0100
Predicted
e
True
W
V
U
0.20.10.70.1
0010
Predicted
e
True
V
U
0.20.10.70.1
0010
Predicted
y1 y2 y3 y4
L1(θ) L2(θ) L3(θ) L4(θ)
The total loss is simply the sum of theloss over all time-steps
L (θ) =
T∑t=1
Lt(θ)
Lt(θ) = −log(ytc)
ytc = predicted probability of true
character at time-step t
T = number of timesteps
For backpropagation we need to com-pute the gradients w.r.t. W,U, V, b, c
Let us see how to do that
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
![Page 125: CS7015 (Deep Learning) : Lecture 14 · CS7015 (Deep Learning) : Lecture 14 Sequence Learning Problems, Recurrent Neural Networks, Backpropagation Through Time (BPTT), Vanishing and](https://reader033.vdocument.in/reader033/viewer/2022042909/5f3a2f1f4697801060706397/html5/thumbnails/125.jpg)
25/41
d
True
W
V
U
0.20.70.10.1
d 0e 1p 0
stop 0
Predicted
e
True
W
V
U
0.20.70.10.1
0100
Predicted
e
True
W
V
U
0.20.10.70.1
0010
Predicted
e
True
V
U
0.20.10.70.1
0010
Predicted
y1 y2 y3 y4
L1(θ) L2(θ) L3(θ) L4(θ)
The total loss is simply the sum of theloss over all time-steps
L (θ) =
T∑t=1
Lt(θ)
Lt(θ) = −log(ytc)
ytc = predicted probability of true
character at time-step t
T = number of timesteps
For backpropagation we need to com-pute the gradients w.r.t. W,U, V, b, c
Let us see how to do that
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
![Page 126: CS7015 (Deep Learning) : Lecture 14 · CS7015 (Deep Learning) : Lecture 14 Sequence Learning Problems, Recurrent Neural Networks, Backpropagation Through Time (BPTT), Vanishing and](https://reader033.vdocument.in/reader033/viewer/2022042909/5f3a2f1f4697801060706397/html5/thumbnails/126.jpg)
25/41
d
True
W
V
U
0.20.70.10.1
d 0e 1p 0
stop 0
Predicted
e
True
W
V
U
0.20.70.10.1
0100
Predicted
e
True
W
V
U
0.20.10.70.1
0010
Predicted
e
True
V
U
0.20.10.70.1
0010
Predicted
y1 y2 y3 y4
L1(θ) L2(θ) L3(θ) L4(θ)
The total loss is simply the sum of theloss over all time-steps
L (θ) =
T∑t=1
Lt(θ)
Lt(θ) = −log(ytc)
ytc = predicted probability of true
character at time-step t
T = number of timesteps
For backpropagation we need to com-pute the gradients w.r.t. W,U, V, b, c
Let us see how to do that
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
![Page 127: CS7015 (Deep Learning) : Lecture 14 · CS7015 (Deep Learning) : Lecture 14 Sequence Learning Problems, Recurrent Neural Networks, Backpropagation Through Time (BPTT), Vanishing and](https://reader033.vdocument.in/reader033/viewer/2022042909/5f3a2f1f4697801060706397/html5/thumbnails/127.jpg)
25/41
d
True
W
V
U
0.20.70.10.1
d 0e 1p 0
stop 0
Predicted
e
True
W
V
U
0.20.70.10.1
0100
Predicted
e
True
W
V
U
0.20.10.70.1
0010
Predicted
e
True
V
U
0.20.10.70.1
0010
Predicted
y1 y2 y3 y4
L1(θ) L2(θ) L3(θ) L4(θ)
The total loss is simply the sum of theloss over all time-steps
L (θ) =
T∑t=1
Lt(θ)
Lt(θ) = −log(ytc)
ytc = predicted probability of true
character at time-step t
T = number of timesteps
For backpropagation we need to com-pute the gradients w.r.t. W,U, V, b, c
Let us see how to do that
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
![Page 128: CS7015 (Deep Learning) : Lecture 14 · CS7015 (Deep Learning) : Lecture 14 Sequence Learning Problems, Recurrent Neural Networks, Backpropagation Through Time (BPTT), Vanishing and](https://reader033.vdocument.in/reader033/viewer/2022042909/5f3a2f1f4697801060706397/html5/thumbnails/128.jpg)
25/41
d
True
W
V
U
0.20.70.10.1
d 0e 1p 0
stop 0
Predicted
e
True
W
V
U
0.20.70.10.1
0100
Predicted
e
True
W
V
U
0.20.10.70.1
0010
Predicted
e
True
V
U
0.20.10.70.1
0010
Predicted
y1 y2 y3 y4
L1(θ) L2(θ) L3(θ) L4(θ)
The total loss is simply the sum of theloss over all time-steps
L (θ) =
T∑t=1
Lt(θ)
Lt(θ) = −log(ytc)
ytc = predicted probability of true
character at time-step t
T = number of timesteps
For backpropagation we need to com-pute the gradients w.r.t. W,U, V, b, c
Let us see how to do that
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
![Page 129: CS7015 (Deep Learning) : Lecture 14 · CS7015 (Deep Learning) : Lecture 14 Sequence Learning Problems, Recurrent Neural Networks, Backpropagation Through Time (BPTT), Vanishing and](https://reader033.vdocument.in/reader033/viewer/2022042909/5f3a2f1f4697801060706397/html5/thumbnails/129.jpg)
25/41
d
True
W
V
U
0.20.70.10.1
d 0e 1p 0
stop 0
Predicted
e
True
W
V
U
0.20.70.10.1
0100
Predicted
e
True
W
V
U
0.20.10.70.1
0010
Predicted
e
True
V
U
0.20.10.70.1
0010
Predicted
y1 y2 y3 y4
L1(θ) L2(θ) L3(θ) L4(θ)
The total loss is simply the sum of theloss over all time-steps
L (θ) =
T∑t=1
Lt(θ)
Lt(θ) = −log(ytc)
ytc = predicted probability of true
character at time-step t
T = number of timesteps
For backpropagation we need to com-pute the gradients w.r.t. W,U, V, b, c
Let us see how to do that
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
![Page 130: CS7015 (Deep Learning) : Lecture 14 · CS7015 (Deep Learning) : Lecture 14 Sequence Learning Problems, Recurrent Neural Networks, Backpropagation Through Time (BPTT), Vanishing and](https://reader033.vdocument.in/reader033/viewer/2022042909/5f3a2f1f4697801060706397/html5/thumbnails/130.jpg)
26/41
d
True
W
V
U
0.20.70.10.1
d 0e 1p 0
stop 0
Predicted
e
True
W
V
U
0.20.70.10.1
0100
Predicted
e
True
W
V
U
0.20.10.70.1
0010
Predicted
e
True
V
U
0.20.10.70.1
0010
Predicted
y1 y2 y3 y4
L1(θ) L2(θ) L3(θ) L4(θ)
Let us consider ∂L (θ)∂V (V is a matrix
so ideally we should write ∇vL (θ))
∂L (θ)
∂V=
T∑t=1
∂Lt(θ)
∂V
Each term is the summation is simplythe derivative of the loss w.r.t. theweights in the output layer
We have already seen how to do thiswhen we studied backpropagation
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
![Page 131: CS7015 (Deep Learning) : Lecture 14 · CS7015 (Deep Learning) : Lecture 14 Sequence Learning Problems, Recurrent Neural Networks, Backpropagation Through Time (BPTT), Vanishing and](https://reader033.vdocument.in/reader033/viewer/2022042909/5f3a2f1f4697801060706397/html5/thumbnails/131.jpg)
26/41
d
True
W
V
U
0.20.70.10.1
d 0e 1p 0
stop 0
Predicted
e
True
W
V
U
0.20.70.10.1
0100
Predicted
e
True
W
V
U
0.20.10.70.1
0010
Predicted
e
True
V
U
0.20.10.70.1
0010
Predicted
y1 y2 y3 y4
L1(θ) L2(θ) L3(θ) L4(θ)
Let us consider ∂L (θ)∂V (V is a matrix
so ideally we should write ∇vL (θ))
∂L (θ)
∂V=
T∑t=1
∂Lt(θ)
∂V
Each term is the summation is simplythe derivative of the loss w.r.t. theweights in the output layer
We have already seen how to do thiswhen we studied backpropagation
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
![Page 132: CS7015 (Deep Learning) : Lecture 14 · CS7015 (Deep Learning) : Lecture 14 Sequence Learning Problems, Recurrent Neural Networks, Backpropagation Through Time (BPTT), Vanishing and](https://reader033.vdocument.in/reader033/viewer/2022042909/5f3a2f1f4697801060706397/html5/thumbnails/132.jpg)
26/41
d
True
W
V
U
0.20.70.10.1
d 0e 1p 0
stop 0
Predicted
e
True
W
V
U
0.20.70.10.1
0100
Predicted
e
True
W
V
U
0.20.10.70.1
0010
Predicted
e
True
V
U
0.20.10.70.1
0010
Predicted
y1 y2 y3 y4
L1(θ) L2(θ) L3(θ) L4(θ)
Let us consider ∂L (θ)∂V (V is a matrix
so ideally we should write ∇vL (θ))
∂L (θ)
∂V=
T∑t=1
∂Lt(θ)
∂V
Each term is the summation is simplythe derivative of the loss w.r.t. theweights in the output layer
We have already seen how to do thiswhen we studied backpropagation
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
![Page 133: CS7015 (Deep Learning) : Lecture 14 · CS7015 (Deep Learning) : Lecture 14 Sequence Learning Problems, Recurrent Neural Networks, Backpropagation Through Time (BPTT), Vanishing and](https://reader033.vdocument.in/reader033/viewer/2022042909/5f3a2f1f4697801060706397/html5/thumbnails/133.jpg)
26/41
d
True
W
V
U
0.20.70.10.1
d 0e 1p 0
stop 0
Predicted
e
True
W
V
U
0.20.70.10.1
0100
Predicted
e
True
W
V
U
0.20.10.70.1
0010
Predicted
e
True
V
U
0.20.10.70.1
0010
Predicted
y1 y2 y3 y4
L1(θ) L2(θ) L3(θ) L4(θ)
Let us consider ∂L (θ)∂V (V is a matrix
so ideally we should write ∇vL (θ))
∂L (θ)
∂V=
T∑t=1
∂Lt(θ)
∂V
Each term is the summation is simplythe derivative of the loss w.r.t. theweights in the output layer
We have already seen how to do thiswhen we studied backpropagation
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
![Page 134: CS7015 (Deep Learning) : Lecture 14 · CS7015 (Deep Learning) : Lecture 14 Sequence Learning Problems, Recurrent Neural Networks, Backpropagation Through Time (BPTT), Vanishing and](https://reader033.vdocument.in/reader033/viewer/2022042909/5f3a2f1f4697801060706397/html5/thumbnails/134.jpg)
27/41
d
True
W
V
U
0.20.70.10.1
d 0e 1p 0
stop 0
Predicted
e
True
W
V
U
0.20.70.10.1
0100
Predicted
e
True
W
V
U
0.20.10.70.1
0010
Predicted
e
True
V
U
0.20.10.70.1
0010
Predicted
y1 y2 y3 y4
L1(θ) L2(θ) L3(θ) L4(θ)
Let us consider the derivative ∂L (θ)∂W
∂L (θ)
∂W=
T∑t=1
∂Lt(θ)
∂W
By the chain rule of derivatives we
know that ∂Lt(θ)∂W is obtained by sum-
ming gradients along all the pathsfrom Lt(θ) to W
What are the paths connecting Lt(θ)to W ?
Let us see this by considering L4(θ)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
![Page 135: CS7015 (Deep Learning) : Lecture 14 · CS7015 (Deep Learning) : Lecture 14 Sequence Learning Problems, Recurrent Neural Networks, Backpropagation Through Time (BPTT), Vanishing and](https://reader033.vdocument.in/reader033/viewer/2022042909/5f3a2f1f4697801060706397/html5/thumbnails/135.jpg)
27/41
d
True
W
V
U
0.20.70.10.1
d 0e 1p 0
stop 0
Predicted
e
True
W
V
U
0.20.70.10.1
0100
Predicted
e
True
W
V
U
0.20.10.70.1
0010
Predicted
e
True
V
U
0.20.10.70.1
0010
Predicted
y1 y2 y3 y4
L1(θ) L2(θ) L3(θ) L4(θ)
Let us consider the derivative ∂L (θ)∂W
∂L (θ)
∂W=
T∑t=1
∂Lt(θ)
∂W
By the chain rule of derivatives we
know that ∂Lt(θ)∂W is obtained by sum-
ming gradients along all the pathsfrom Lt(θ) to W
What are the paths connecting Lt(θ)to W ?
Let us see this by considering L4(θ)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
![Page 136: CS7015 (Deep Learning) : Lecture 14 · CS7015 (Deep Learning) : Lecture 14 Sequence Learning Problems, Recurrent Neural Networks, Backpropagation Through Time (BPTT), Vanishing and](https://reader033.vdocument.in/reader033/viewer/2022042909/5f3a2f1f4697801060706397/html5/thumbnails/136.jpg)
27/41
d
True
W
V
U
0.20.70.10.1
d 0e 1p 0
stop 0
Predicted
e
True
W
V
U
0.20.70.10.1
0100
Predicted
e
True
W
V
U
0.20.10.70.1
0010
Predicted
e
True
V
U
0.20.10.70.1
0010
Predicted
y1 y2 y3 y4
L1(θ) L2(θ) L3(θ) L4(θ)
Let us consider the derivative ∂L (θ)∂W
∂L (θ)
∂W=
T∑t=1
∂Lt(θ)
∂W
By the chain rule of derivatives we
know that ∂Lt(θ)∂W is obtained by sum-
ming gradients along all the pathsfrom Lt(θ) to W
What are the paths connecting Lt(θ)to W ?
Let us see this by considering L4(θ)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
![Page 137: CS7015 (Deep Learning) : Lecture 14 · CS7015 (Deep Learning) : Lecture 14 Sequence Learning Problems, Recurrent Neural Networks, Backpropagation Through Time (BPTT), Vanishing and](https://reader033.vdocument.in/reader033/viewer/2022042909/5f3a2f1f4697801060706397/html5/thumbnails/137.jpg)
27/41
d
True
W
V
U
0.20.70.10.1
d 0e 1p 0
stop 0
Predicted
e
True
W
V
U
0.20.70.10.1
0100
Predicted
e
True
W
V
U
0.20.10.70.1
0010
Predicted
e
True
V
U
0.20.10.70.1
0010
Predicted
y1 y2 y3 y4
L1(θ) L2(θ) L3(θ) L4(θ)
Let us consider the derivative ∂L (θ)∂W
∂L (θ)
∂W=
T∑t=1
∂Lt(θ)
∂W
By the chain rule of derivatives we
know that ∂Lt(θ)∂W is obtained by sum-
ming gradients along all the pathsfrom Lt(θ) to W
What are the paths connecting Lt(θ)to W ?
Let us see this by considering L4(θ)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
![Page 138: CS7015 (Deep Learning) : Lecture 14 · CS7015 (Deep Learning) : Lecture 14 Sequence Learning Problems, Recurrent Neural Networks, Backpropagation Through Time (BPTT), Vanishing and](https://reader033.vdocument.in/reader033/viewer/2022042909/5f3a2f1f4697801060706397/html5/thumbnails/138.jpg)
27/41
d
True
W
V
U
0.20.70.10.1
d 0e 1p 0
stop 0
Predicted
e
True
W
V
U
0.20.70.10.1
0100
Predicted
e
True
W
V
U
0.20.10.70.1
0010
Predicted
e
True
V
U
0.20.10.70.1
0010
Predicted
y1 y2 y3 y4
L1(θ) L2(θ) L3(θ) L4(θ)
Let us consider the derivative ∂L (θ)∂W
∂L (θ)
∂W=
T∑t=1
∂Lt(θ)
∂W
By the chain rule of derivatives we
know that ∂Lt(θ)∂W is obtained by sum-
ming gradients along all the pathsfrom Lt(θ) to W
What are the paths connecting Lt(θ)to W ?
Let us see this by considering L4(θ)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
![Page 139: CS7015 (Deep Learning) : Lecture 14 · CS7015 (Deep Learning) : Lecture 14 Sequence Learning Problems, Recurrent Neural Networks, Backpropagation Through Time (BPTT), Vanishing and](https://reader033.vdocument.in/reader033/viewer/2022042909/5f3a2f1f4697801060706397/html5/thumbnails/139.jpg)
28/41
s1W
V
U
x1
L1(θ)
s2
x2
L2(θ)
W
V
U
s3
x3
L3(θ)
W
V
U
s4
x4
L4(θ)
W
V
U
. . .
s4 L4(θ)
W
s3s2s1s0
L4(θ) depends on s4
s4 in turn depends on s3 and W
s3 in turn depends on s2 and W
s2 in turn depends on s1 and W
s1 in turn depends on s0 and W
where s0 is a constant starting state.
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
![Page 140: CS7015 (Deep Learning) : Lecture 14 · CS7015 (Deep Learning) : Lecture 14 Sequence Learning Problems, Recurrent Neural Networks, Backpropagation Through Time (BPTT), Vanishing and](https://reader033.vdocument.in/reader033/viewer/2022042909/5f3a2f1f4697801060706397/html5/thumbnails/140.jpg)
28/41
s1W
V
U
x1
L1(θ)
s2
x2
L2(θ)
W
V
U
s3
x3
L3(θ)
W
V
U
s4
x4
L4(θ)
W
V
U
. . .
s4 L4(θ)
W
s3
s2s1s0
L4(θ) depends on s4s4 in turn depends on s3 and W
s3 in turn depends on s2 and W
s2 in turn depends on s1 and W
s1 in turn depends on s0 and W
where s0 is a constant starting state.
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
![Page 141: CS7015 (Deep Learning) : Lecture 14 · CS7015 (Deep Learning) : Lecture 14 Sequence Learning Problems, Recurrent Neural Networks, Backpropagation Through Time (BPTT), Vanishing and](https://reader033.vdocument.in/reader033/viewer/2022042909/5f3a2f1f4697801060706397/html5/thumbnails/141.jpg)
28/41
s1W
V
U
x1
L1(θ)
s2
x2
L2(θ)
W
V
U
s3
x3
L3(θ)
W
V
U
s4
x4
L4(θ)
W
V
U
. . .
s4 L4(θ)
W
s3s2
s1s0
L4(θ) depends on s4s4 in turn depends on s3 and W
s3 in turn depends on s2 and W
s2 in turn depends on s1 and W
s1 in turn depends on s0 and W
where s0 is a constant starting state.
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
![Page 142: CS7015 (Deep Learning) : Lecture 14 · CS7015 (Deep Learning) : Lecture 14 Sequence Learning Problems, Recurrent Neural Networks, Backpropagation Through Time (BPTT), Vanishing and](https://reader033.vdocument.in/reader033/viewer/2022042909/5f3a2f1f4697801060706397/html5/thumbnails/142.jpg)
28/41
s1W
V
U
x1
L1(θ)
s2
x2
L2(θ)
W
V
U
s3
x3
L3(θ)
W
V
U
s4
x4
L4(θ)
W
V
U
. . .
s4 L4(θ)
W
s3s2s1
s0
L4(θ) depends on s4s4 in turn depends on s3 and W
s3 in turn depends on s2 and W
s2 in turn depends on s1 and W
s1 in turn depends on s0 and W
where s0 is a constant starting state.
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
![Page 143: CS7015 (Deep Learning) : Lecture 14 · CS7015 (Deep Learning) : Lecture 14 Sequence Learning Problems, Recurrent Neural Networks, Backpropagation Through Time (BPTT), Vanishing and](https://reader033.vdocument.in/reader033/viewer/2022042909/5f3a2f1f4697801060706397/html5/thumbnails/143.jpg)
28/41
s1W
V
U
x1
L1(θ)
s2
x2
L2(θ)
W
V
U
s3
x3
L3(θ)
W
V
U
s4
x4
L4(θ)
W
V
U
. . .
s4 L4(θ)
W
s3s2s1s0
L4(θ) depends on s4s4 in turn depends on s3 and W
s3 in turn depends on s2 and W
s2 in turn depends on s1 and W
s1 in turn depends on s0 and W
where s0 is a constant starting state.
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
![Page 144: CS7015 (Deep Learning) : Lecture 14 · CS7015 (Deep Learning) : Lecture 14 Sequence Learning Problems, Recurrent Neural Networks, Backpropagation Through Time (BPTT), Vanishing and](https://reader033.vdocument.in/reader033/viewer/2022042909/5f3a2f1f4697801060706397/html5/thumbnails/144.jpg)
29/41
s1W
V
U
x1
L1(θ)
s2
x2
L2(θ)
W
V
U
s3
x3
L3(θ)
W
V
U
s4
x4
L4(θ)
W
V
U
. . .
s4 L4(θ)
W
s3s2s1s0
What we have here is an ordered net-work
In an ordered network each state vari-able is computed one at a time in aspecified order (first s1, then s2 andso on)
Now we have
∂L4(θ)
∂W=∂L4(θ)
∂s4
∂s4∂W
We have already seen how to compute∂L4(θ)∂s4
when we studied backprop
But how do we compute ∂s4∂W
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
![Page 145: CS7015 (Deep Learning) : Lecture 14 · CS7015 (Deep Learning) : Lecture 14 Sequence Learning Problems, Recurrent Neural Networks, Backpropagation Through Time (BPTT), Vanishing and](https://reader033.vdocument.in/reader033/viewer/2022042909/5f3a2f1f4697801060706397/html5/thumbnails/145.jpg)
29/41
s1W
V
U
x1
L1(θ)
s2
x2
L2(θ)
W
V
U
s3
x3
L3(θ)
W
V
U
s4
x4
L4(θ)
W
V
U
. . .
s4 L4(θ)
W
s3s2s1s0
What we have here is an ordered net-work
In an ordered network each state vari-able is computed one at a time in aspecified order (first s1, then s2 andso on)
Now we have
∂L4(θ)
∂W=∂L4(θ)
∂s4
∂s4∂W
We have already seen how to compute∂L4(θ)∂s4
when we studied backprop
But how do we compute ∂s4∂W
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
![Page 146: CS7015 (Deep Learning) : Lecture 14 · CS7015 (Deep Learning) : Lecture 14 Sequence Learning Problems, Recurrent Neural Networks, Backpropagation Through Time (BPTT), Vanishing and](https://reader033.vdocument.in/reader033/viewer/2022042909/5f3a2f1f4697801060706397/html5/thumbnails/146.jpg)
29/41
s1W
V
U
x1
L1(θ)
s2
x2
L2(θ)
W
V
U
s3
x3
L3(θ)
W
V
U
s4
x4
L4(θ)
W
V
U
. . .
s4 L4(θ)
W
s3s2s1s0
What we have here is an ordered net-work
In an ordered network each state vari-able is computed one at a time in aspecified order (first s1, then s2 andso on)
Now we have
∂L4(θ)
∂W=∂L4(θ)
∂s4
∂s4∂W
We have already seen how to compute∂L4(θ)∂s4
when we studied backprop
But how do we compute ∂s4∂W
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
![Page 147: CS7015 (Deep Learning) : Lecture 14 · CS7015 (Deep Learning) : Lecture 14 Sequence Learning Problems, Recurrent Neural Networks, Backpropagation Through Time (BPTT), Vanishing and](https://reader033.vdocument.in/reader033/viewer/2022042909/5f3a2f1f4697801060706397/html5/thumbnails/147.jpg)
29/41
s1W
V
U
x1
L1(θ)
s2
x2
L2(θ)
W
V
U
s3
x3
L3(θ)
W
V
U
s4
x4
L4(θ)
W
V
U
. . .
s4 L4(θ)
W
s3s2s1s0
What we have here is an ordered net-work
In an ordered network each state vari-able is computed one at a time in aspecified order (first s1, then s2 andso on)
Now we have
∂L4(θ)
∂W=∂L4(θ)
∂s4
∂s4∂W
We have already seen how to compute∂L4(θ)∂s4
when we studied backprop
But how do we compute ∂s4∂W
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
![Page 148: CS7015 (Deep Learning) : Lecture 14 · CS7015 (Deep Learning) : Lecture 14 Sequence Learning Problems, Recurrent Neural Networks, Backpropagation Through Time (BPTT), Vanishing and](https://reader033.vdocument.in/reader033/viewer/2022042909/5f3a2f1f4697801060706397/html5/thumbnails/148.jpg)
29/41
s1W
V
U
x1
L1(θ)
s2
x2
L2(θ)
W
V
U
s3
x3
L3(θ)
W
V
U
s4
x4
L4(θ)
W
V
U
. . .
s4 L4(θ)
W
s3s2s1s0
What we have here is an ordered net-work
In an ordered network each state vari-able is computed one at a time in aspecified order (first s1, then s2 andso on)
Now we have
∂L4(θ)
∂W=∂L4(θ)
∂s4
∂s4∂W
We have already seen how to compute∂L4(θ)∂s4
when we studied backprop
But how do we compute ∂s4∂W
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
![Page 149: CS7015 (Deep Learning) : Lecture 14 · CS7015 (Deep Learning) : Lecture 14 Sequence Learning Problems, Recurrent Neural Networks, Backpropagation Through Time (BPTT), Vanishing and](https://reader033.vdocument.in/reader033/viewer/2022042909/5f3a2f1f4697801060706397/html5/thumbnails/149.jpg)
30/41
s1W
V
U
x1
L1(θ)
s2
x2
L2(θ)
W
V
U
s3
x3
L3(θ)
W
V
U
s4
x4
L4(θ)
W
V
U
. . .
s4 L4(θ)
W
s3s2s1s0
Recall that
s4 = σ(Ws3 + b)
In such an ordered network, we can’tcompute ∂s4
∂W by simply treating s3 asa constant (because it also dependson W )
In such networks the total derivative∂s4∂W has two parts
Explicit : ∂+s4∂W , treating all other in-
puts as constant
Implicit : Summing over all indirectpaths from s4 to W
Let us see how to do this
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
![Page 150: CS7015 (Deep Learning) : Lecture 14 · CS7015 (Deep Learning) : Lecture 14 Sequence Learning Problems, Recurrent Neural Networks, Backpropagation Through Time (BPTT), Vanishing and](https://reader033.vdocument.in/reader033/viewer/2022042909/5f3a2f1f4697801060706397/html5/thumbnails/150.jpg)
30/41
s1W
V
U
x1
L1(θ)
s2
x2
L2(θ)
W
V
U
s3
x3
L3(θ)
W
V
U
s4
x4
L4(θ)
W
V
U
. . .
s4 L4(θ)
W
s3s2s1s0
Recall that
s4 = σ(Ws3 + b)
In such an ordered network, we can’tcompute ∂s4
∂W by simply treating s3 asa constant (because it also dependson W )
In such networks the total derivative∂s4∂W has two parts
Explicit : ∂+s4∂W , treating all other in-
puts as constant
Implicit : Summing over all indirectpaths from s4 to W
Let us see how to do this
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
![Page 151: CS7015 (Deep Learning) : Lecture 14 · CS7015 (Deep Learning) : Lecture 14 Sequence Learning Problems, Recurrent Neural Networks, Backpropagation Through Time (BPTT), Vanishing and](https://reader033.vdocument.in/reader033/viewer/2022042909/5f3a2f1f4697801060706397/html5/thumbnails/151.jpg)
30/41
s1W
V
U
x1
L1(θ)
s2
x2
L2(θ)
W
V
U
s3
x3
L3(θ)
W
V
U
s4
x4
L4(θ)
W
V
U
. . .
s4 L4(θ)
W
s3s2s1s0
Recall that
s4 = σ(Ws3 + b)
In such an ordered network, we can’tcompute ∂s4
∂W by simply treating s3 asa constant (because it also dependson W )
In such networks the total derivative∂s4∂W has two parts
Explicit : ∂+s4∂W , treating all other in-
puts as constant
Implicit : Summing over all indirectpaths from s4 to W
Let us see how to do this
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
![Page 152: CS7015 (Deep Learning) : Lecture 14 · CS7015 (Deep Learning) : Lecture 14 Sequence Learning Problems, Recurrent Neural Networks, Backpropagation Through Time (BPTT), Vanishing and](https://reader033.vdocument.in/reader033/viewer/2022042909/5f3a2f1f4697801060706397/html5/thumbnails/152.jpg)
30/41
s1W
V
U
x1
L1(θ)
s2
x2
L2(θ)
W
V
U
s3
x3
L3(θ)
W
V
U
s4
x4
L4(θ)
W
V
U
. . .
s4 L4(θ)
W
s3s2s1s0
Recall that
s4 = σ(Ws3 + b)
In such an ordered network, we can’tcompute ∂s4
∂W by simply treating s3 asa constant (because it also dependson W )
In such networks the total derivative∂s4∂W has two parts
Explicit : ∂+s4∂W , treating all other in-
puts as constant
Implicit : Summing over all indirectpaths from s4 to W
Let us see how to do this
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
![Page 153: CS7015 (Deep Learning) : Lecture 14 · CS7015 (Deep Learning) : Lecture 14 Sequence Learning Problems, Recurrent Neural Networks, Backpropagation Through Time (BPTT), Vanishing and](https://reader033.vdocument.in/reader033/viewer/2022042909/5f3a2f1f4697801060706397/html5/thumbnails/153.jpg)
30/41
s1W
V
U
x1
L1(θ)
s2
x2
L2(θ)
W
V
U
s3
x3
L3(θ)
W
V
U
s4
x4
L4(θ)
W
V
U
. . .
s4 L4(θ)
W
s3s2s1s0
Recall that
s4 = σ(Ws3 + b)
In such an ordered network, we can’tcompute ∂s4
∂W by simply treating s3 asa constant (because it also dependson W )
In such networks the total derivative∂s4∂W has two parts
Explicit : ∂+s4∂W , treating all other in-
puts as constant
Implicit : Summing over all indirectpaths from s4 to W
Let us see how to do this
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
![Page 154: CS7015 (Deep Learning) : Lecture 14 · CS7015 (Deep Learning) : Lecture 14 Sequence Learning Problems, Recurrent Neural Networks, Backpropagation Through Time (BPTT), Vanishing and](https://reader033.vdocument.in/reader033/viewer/2022042909/5f3a2f1f4697801060706397/html5/thumbnails/154.jpg)
30/41
s1W
V
U
x1
L1(θ)
s2
x2
L2(θ)
W
V
U
s3
x3
L3(θ)
W
V
U
s4
x4
L4(θ)
W
V
U
. . .
s4 L4(θ)
W
s3s2s1s0
Recall that
s4 = σ(Ws3 + b)
In such an ordered network, we can’tcompute ∂s4
∂W by simply treating s3 asa constant (because it also dependson W )
In such networks the total derivative∂s4∂W has two parts
Explicit : ∂+s4∂W , treating all other in-
puts as constant
Implicit : Summing over all indirectpaths from s4 to W
Let us see how to do this
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
![Page 155: CS7015 (Deep Learning) : Lecture 14 · CS7015 (Deep Learning) : Lecture 14 Sequence Learning Problems, Recurrent Neural Networks, Backpropagation Through Time (BPTT), Vanishing and](https://reader033.vdocument.in/reader033/viewer/2022042909/5f3a2f1f4697801060706397/html5/thumbnails/155.jpg)
31/41
∂s4∂W
=∂+s4∂W︸ ︷︷ ︸
explicit
+∂s4∂s3
∂s3∂W︸ ︷︷ ︸
implicit
=∂+s4∂W
+∂s4∂s3
[ ∂+s3∂W︸ ︷︷ ︸
explicit
+∂s3∂s2
∂s2∂W︸ ︷︷ ︸
implicit
]
=∂+s4∂W
+∂s4∂s3
∂+s3∂W
+∂s4∂s3
∂s3∂s2
[∂+s2∂W
+∂s2∂s1
∂s1∂W
]=∂+s4∂W
+∂s4∂s3
∂+s3∂W
+∂s4∂s3
∂s3∂s2
∂+s2∂W
+∂s4∂s3
∂s3∂s2
∂s2∂s1
[∂+s1∂W
]For simplicity we will short-circuit some of the paths
∂s4∂W
=∂s4∂s4
∂+s4∂W
+∂s4∂s3
∂+s3∂W
+∂s4∂s2
∂+s2∂W
+∂s4∂s1
∂+s1∂W
=
4∑k=1
∂s4∂sk
∂+sk∂W
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
![Page 156: CS7015 (Deep Learning) : Lecture 14 · CS7015 (Deep Learning) : Lecture 14 Sequence Learning Problems, Recurrent Neural Networks, Backpropagation Through Time (BPTT), Vanishing and](https://reader033.vdocument.in/reader033/viewer/2022042909/5f3a2f1f4697801060706397/html5/thumbnails/156.jpg)
31/41
∂s4∂W
=∂+s4∂W︸ ︷︷ ︸
explicit
+∂s4∂s3
∂s3∂W︸ ︷︷ ︸
implicit
=∂+s4∂W
+∂s4∂s3
[ ∂+s3∂W︸ ︷︷ ︸
explicit
+∂s3∂s2
∂s2∂W︸ ︷︷ ︸
implicit
]
=∂+s4∂W
+∂s4∂s3
∂+s3∂W
+∂s4∂s3
∂s3∂s2
[∂+s2∂W
+∂s2∂s1
∂s1∂W
]=∂+s4∂W
+∂s4∂s3
∂+s3∂W
+∂s4∂s3
∂s3∂s2
∂+s2∂W
+∂s4∂s3
∂s3∂s2
∂s2∂s1
[∂+s1∂W
]For simplicity we will short-circuit some of the paths
∂s4∂W
=∂s4∂s4
∂+s4∂W
+∂s4∂s3
∂+s3∂W
+∂s4∂s2
∂+s2∂W
+∂s4∂s1
∂+s1∂W
=
4∑k=1
∂s4∂sk
∂+sk∂W
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
![Page 157: CS7015 (Deep Learning) : Lecture 14 · CS7015 (Deep Learning) : Lecture 14 Sequence Learning Problems, Recurrent Neural Networks, Backpropagation Through Time (BPTT), Vanishing and](https://reader033.vdocument.in/reader033/viewer/2022042909/5f3a2f1f4697801060706397/html5/thumbnails/157.jpg)
31/41
∂s4∂W
=∂+s4∂W︸ ︷︷ ︸
explicit
+∂s4∂s3
∂s3∂W︸ ︷︷ ︸
implicit
=∂+s4∂W
+∂s4∂s3
[ ∂+s3∂W︸ ︷︷ ︸
explicit
+∂s3∂s2
∂s2∂W︸ ︷︷ ︸
implicit
]
=∂+s4∂W
+∂s4∂s3
∂+s3∂W
+∂s4∂s3
∂s3∂s2
[∂+s2∂W
+∂s2∂s1
∂s1∂W
]
=∂+s4∂W
+∂s4∂s3
∂+s3∂W
+∂s4∂s3
∂s3∂s2
∂+s2∂W
+∂s4∂s3
∂s3∂s2
∂s2∂s1
[∂+s1∂W
]For simplicity we will short-circuit some of the paths
∂s4∂W
=∂s4∂s4
∂+s4∂W
+∂s4∂s3
∂+s3∂W
+∂s4∂s2
∂+s2∂W
+∂s4∂s1
∂+s1∂W
=
4∑k=1
∂s4∂sk
∂+sk∂W
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
![Page 158: CS7015 (Deep Learning) : Lecture 14 · CS7015 (Deep Learning) : Lecture 14 Sequence Learning Problems, Recurrent Neural Networks, Backpropagation Through Time (BPTT), Vanishing and](https://reader033.vdocument.in/reader033/viewer/2022042909/5f3a2f1f4697801060706397/html5/thumbnails/158.jpg)
31/41
∂s4∂W
=∂+s4∂W︸ ︷︷ ︸
explicit
+∂s4∂s3
∂s3∂W︸ ︷︷ ︸
implicit
=∂+s4∂W
+∂s4∂s3
[ ∂+s3∂W︸ ︷︷ ︸
explicit
+∂s3∂s2
∂s2∂W︸ ︷︷ ︸
implicit
]
=∂+s4∂W
+∂s4∂s3
∂+s3∂W
+∂s4∂s3
∂s3∂s2
[∂+s2∂W
+∂s2∂s1
∂s1∂W
]=∂+s4∂W
+∂s4∂s3
∂+s3∂W
+∂s4∂s3
∂s3∂s2
∂+s2∂W
+∂s4∂s3
∂s3∂s2
∂s2∂s1
[∂+s1∂W
]
For simplicity we will short-circuit some of the paths
∂s4∂W
=∂s4∂s4
∂+s4∂W
+∂s4∂s3
∂+s3∂W
+∂s4∂s2
∂+s2∂W
+∂s4∂s1
∂+s1∂W
=
4∑k=1
∂s4∂sk
∂+sk∂W
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
![Page 159: CS7015 (Deep Learning) : Lecture 14 · CS7015 (Deep Learning) : Lecture 14 Sequence Learning Problems, Recurrent Neural Networks, Backpropagation Through Time (BPTT), Vanishing and](https://reader033.vdocument.in/reader033/viewer/2022042909/5f3a2f1f4697801060706397/html5/thumbnails/159.jpg)
31/41
∂s4∂W
=∂+s4∂W︸ ︷︷ ︸
explicit
+∂s4∂s3
∂s3∂W︸ ︷︷ ︸
implicit
=∂+s4∂W
+∂s4∂s3
[ ∂+s3∂W︸ ︷︷ ︸
explicit
+∂s3∂s2
∂s2∂W︸ ︷︷ ︸
implicit
]
=∂+s4∂W
+∂s4∂s3
∂+s3∂W
+∂s4∂s3
∂s3∂s2
[∂+s2∂W
+∂s2∂s1
∂s1∂W
]=∂+s4∂W
+∂s4∂s3
∂+s3∂W
+∂s4∂s3
∂s3∂s2
∂+s2∂W
+∂s4∂s3
∂s3∂s2
∂s2∂s1
[∂+s1∂W
]For simplicity we will short-circuit some of the paths
∂s4∂W
=∂s4∂s4
∂+s4∂W
+∂s4∂s3
∂+s3∂W
+∂s4∂s2
∂+s2∂W
+∂s4∂s1
∂+s1∂W
=
4∑k=1
∂s4∂sk
∂+sk∂W
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
![Page 160: CS7015 (Deep Learning) : Lecture 14 · CS7015 (Deep Learning) : Lecture 14 Sequence Learning Problems, Recurrent Neural Networks, Backpropagation Through Time (BPTT), Vanishing and](https://reader033.vdocument.in/reader033/viewer/2022042909/5f3a2f1f4697801060706397/html5/thumbnails/160.jpg)
31/41
∂s4∂W
=∂+s4∂W︸ ︷︷ ︸
explicit
+∂s4∂s3
∂s3∂W︸ ︷︷ ︸
implicit
=∂+s4∂W
+∂s4∂s3
[ ∂+s3∂W︸ ︷︷ ︸
explicit
+∂s3∂s2
∂s2∂W︸ ︷︷ ︸
implicit
]
=∂+s4∂W
+∂s4∂s3
∂+s3∂W
+∂s4∂s3
∂s3∂s2
[∂+s2∂W
+∂s2∂s1
∂s1∂W
]=∂+s4∂W
+∂s4∂s3
∂+s3∂W
+∂s4∂s3
∂s3∂s2
∂+s2∂W
+∂s4∂s3
∂s3∂s2
∂s2∂s1
[∂+s1∂W
]For simplicity we will short-circuit some of the paths
∂s4∂W
=∂s4∂s4
∂+s4∂W
+∂s4∂s3
∂+s3∂W
+∂s4∂s2
∂+s2∂W
+∂s4∂s1
∂+s1∂W
=
4∑k=1
∂s4∂sk
∂+sk∂W
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
![Page 161: CS7015 (Deep Learning) : Lecture 14 · CS7015 (Deep Learning) : Lecture 14 Sequence Learning Problems, Recurrent Neural Networks, Backpropagation Through Time (BPTT), Vanishing and](https://reader033.vdocument.in/reader033/viewer/2022042909/5f3a2f1f4697801060706397/html5/thumbnails/161.jpg)
32/41
s1W
V
U
x1
L1(θ)
s2
x2
L2(θ)
W
V
U
s3
x3
L3(θ)
W
V
U
s4
x4
L4(θ)
W
V
U
. . .
s4 L4(θ)
W
s3s2s1s0
Finally we have
∂L4(θ)
∂W=∂L4(θ)
∂s4
∂s4∂W
∂s4∂W
=4∑
k=1
∂s4∂sk
∂+sk∂W
∴∂Lt(θ)
∂W=∂Lt(θ)
∂st
t∑k=1
∂st∂sk
∂+sk∂W
This algorithm is called backpropaga-tion through time (BPTT) as webackpropagate over all previous timesteps
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
![Page 162: CS7015 (Deep Learning) : Lecture 14 · CS7015 (Deep Learning) : Lecture 14 Sequence Learning Problems, Recurrent Neural Networks, Backpropagation Through Time (BPTT), Vanishing and](https://reader033.vdocument.in/reader033/viewer/2022042909/5f3a2f1f4697801060706397/html5/thumbnails/162.jpg)
32/41
s1W
V
U
x1
L1(θ)
s2
x2
L2(θ)
W
V
U
s3
x3
L3(θ)
W
V
U
s4
x4
L4(θ)
W
V
U
. . .
s4 L4(θ)
W
s3s2s1s0
Finally we have
∂L4(θ)
∂W=∂L4(θ)
∂s4
∂s4∂W
∂s4∂W
=4∑
k=1
∂s4∂sk
∂+sk∂W
∴∂Lt(θ)
∂W=∂Lt(θ)
∂st
t∑k=1
∂st∂sk
∂+sk∂W
This algorithm is called backpropaga-tion through time (BPTT) as webackpropagate over all previous timesteps
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
![Page 163: CS7015 (Deep Learning) : Lecture 14 · CS7015 (Deep Learning) : Lecture 14 Sequence Learning Problems, Recurrent Neural Networks, Backpropagation Through Time (BPTT), Vanishing and](https://reader033.vdocument.in/reader033/viewer/2022042909/5f3a2f1f4697801060706397/html5/thumbnails/163.jpg)
32/41
s1W
V
U
x1
L1(θ)
s2
x2
L2(θ)
W
V
U
s3
x3
L3(θ)
W
V
U
s4
x4
L4(θ)
W
V
U
. . .
s4 L4(θ)
W
s3s2s1s0
Finally we have
∂L4(θ)
∂W=∂L4(θ)
∂s4
∂s4∂W
∂s4∂W
=
4∑k=1
∂s4∂sk
∂+sk∂W
∴∂Lt(θ)
∂W=∂Lt(θ)
∂st
t∑k=1
∂st∂sk
∂+sk∂W
This algorithm is called backpropaga-tion through time (BPTT) as webackpropagate over all previous timesteps
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
![Page 164: CS7015 (Deep Learning) : Lecture 14 · CS7015 (Deep Learning) : Lecture 14 Sequence Learning Problems, Recurrent Neural Networks, Backpropagation Through Time (BPTT), Vanishing and](https://reader033.vdocument.in/reader033/viewer/2022042909/5f3a2f1f4697801060706397/html5/thumbnails/164.jpg)
32/41
s1W
V
U
x1
L1(θ)
s2
x2
L2(θ)
W
V
U
s3
x3
L3(θ)
W
V
U
s4
x4
L4(θ)
W
V
U
. . .
s4 L4(θ)
W
s3s2s1s0
Finally we have
∂L4(θ)
∂W=∂L4(θ)
∂s4
∂s4∂W
∂s4∂W
=
4∑k=1
∂s4∂sk
∂+sk∂W
∴∂Lt(θ)
∂W=∂Lt(θ)
∂st
t∑k=1
∂st∂sk
∂+sk∂W
This algorithm is called backpropaga-tion through time (BPTT) as webackpropagate over all previous timesteps
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
![Page 165: CS7015 (Deep Learning) : Lecture 14 · CS7015 (Deep Learning) : Lecture 14 Sequence Learning Problems, Recurrent Neural Networks, Backpropagation Through Time (BPTT), Vanishing and](https://reader033.vdocument.in/reader033/viewer/2022042909/5f3a2f1f4697801060706397/html5/thumbnails/165.jpg)
32/41
s1W
V
U
x1
L1(θ)
s2
x2
L2(θ)
W
V
U
s3
x3
L3(θ)
W
V
U
s4
x4
L4(θ)
W
V
U
. . .
s4 L4(θ)
W
s3s2s1s0
Finally we have
∂L4(θ)
∂W=∂L4(θ)
∂s4
∂s4∂W
∂s4∂W
=
4∑k=1
∂s4∂sk
∂+sk∂W
∴∂Lt(θ)
∂W=∂Lt(θ)
∂st
t∑k=1
∂st∂sk
∂+sk∂W
This algorithm is called backpropaga-tion through time (BPTT) as webackpropagate over all previous timesteps
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
![Page 166: CS7015 (Deep Learning) : Lecture 14 · CS7015 (Deep Learning) : Lecture 14 Sequence Learning Problems, Recurrent Neural Networks, Backpropagation Through Time (BPTT), Vanishing and](https://reader033.vdocument.in/reader033/viewer/2022042909/5f3a2f1f4697801060706397/html5/thumbnails/166.jpg)
33/41
Module 14.4: The problem of Exploding and VanishingGradients
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
![Page 167: CS7015 (Deep Learning) : Lecture 14 · CS7015 (Deep Learning) : Lecture 14 Sequence Learning Problems, Recurrent Neural Networks, Backpropagation Through Time (BPTT), Vanishing and](https://reader033.vdocument.in/reader033/viewer/2022042909/5f3a2f1f4697801060706397/html5/thumbnails/167.jpg)
34/41
We will now focus on ∂st∂sk
and high-light an important problem in train-ing RNN’s using BPTT
∂st∂sk
=∂st∂st−1
∂st−1∂st−2
. . .∂sk+1
∂sk
Let us look at one such term in theproduct (i.e.,
∂sj+1
∂sj)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
![Page 168: CS7015 (Deep Learning) : Lecture 14 · CS7015 (Deep Learning) : Lecture 14 Sequence Learning Problems, Recurrent Neural Networks, Backpropagation Through Time (BPTT), Vanishing and](https://reader033.vdocument.in/reader033/viewer/2022042909/5f3a2f1f4697801060706397/html5/thumbnails/168.jpg)
34/41
We will now focus on ∂st∂sk
and high-light an important problem in train-ing RNN’s using BPTT
∂st∂sk
=∂st∂st−1
∂st−1∂st−2
. . .∂sk+1
∂sk
Let us look at one such term in theproduct (i.e.,
∂sj+1
∂sj)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
![Page 169: CS7015 (Deep Learning) : Lecture 14 · CS7015 (Deep Learning) : Lecture 14 Sequence Learning Problems, Recurrent Neural Networks, Backpropagation Through Time (BPTT), Vanishing and](https://reader033.vdocument.in/reader033/viewer/2022042909/5f3a2f1f4697801060706397/html5/thumbnails/169.jpg)
34/41
We will now focus on ∂st∂sk
and high-light an important problem in train-ing RNN’s using BPTT
∂st∂sk
=∂st∂st−1
∂st−1∂st−2
. . .∂sk+1
∂sk
=
t−1∏j=k
∂sj+1
∂sj
Let us look at one such term in theproduct (i.e.,
∂sj+1
∂sj)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
![Page 170: CS7015 (Deep Learning) : Lecture 14 · CS7015 (Deep Learning) : Lecture 14 Sequence Learning Problems, Recurrent Neural Networks, Backpropagation Through Time (BPTT), Vanishing and](https://reader033.vdocument.in/reader033/viewer/2022042909/5f3a2f1f4697801060706397/html5/thumbnails/170.jpg)
34/41
We will now focus on ∂st∂sk
and high-light an important problem in train-ing RNN’s using BPTT
∂st∂sk
=∂st∂st−1
∂st−1∂st−2
. . .∂sk+1
∂sk
=
t−1∏j=k
∂sj+1
∂sj
Let us look at one such term in theproduct (i.e.,
∂sj+1
∂sj)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
![Page 171: CS7015 (Deep Learning) : Lecture 14 · CS7015 (Deep Learning) : Lecture 14 Sequence Learning Problems, Recurrent Neural Networks, Backpropagation Through Time (BPTT), Vanishing and](https://reader033.vdocument.in/reader033/viewer/2022042909/5f3a2f1f4697801060706397/html5/thumbnails/171.jpg)
35/41
aj = [aj1, aj2, aj3, . . . ajd, ]
sj = [σ(aj1), σ(aj2), . . . σ(ajd)]
∂sj∂aj
=
∂sj1∂aj1
∂sj2∂aj1
∂sj3∂aj1
. . .
∂sj1∂aj2
∂sj2∂aj2
. . .
......
...∂sjd∂ajd
=
σ′(aj1) 0 0 0
0 σ′(aj2) 0 0
0 0. . .
0 0 . . . σ′(ajd)
= diag(σ
′(aj))
We are interested in∂sj∂sj−1
aj = Wsj + b
sj = σ(aj)
∂sj∂sj−1
=∂sj∂aj
∂aj∂sj−1
= diag(σ′(aj))W
We are interested in the magnitudeof
∂sj∂sj−1
← if it is small (large) ∂st∂sk
and hence ∂Lt∂W will vanish (explode)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
![Page 172: CS7015 (Deep Learning) : Lecture 14 · CS7015 (Deep Learning) : Lecture 14 Sequence Learning Problems, Recurrent Neural Networks, Backpropagation Through Time (BPTT), Vanishing and](https://reader033.vdocument.in/reader033/viewer/2022042909/5f3a2f1f4697801060706397/html5/thumbnails/172.jpg)
35/41
aj = [aj1, aj2, aj3, . . . ajd, ]
sj = [σ(aj1), σ(aj2), . . . σ(ajd)]
∂sj∂aj
=
∂sj1∂aj1
∂sj2∂aj1
∂sj3∂aj1
. . .
∂sj1∂aj2
∂sj2∂aj2
. . .
......
...∂sjd∂ajd
=
σ′(aj1) 0 0 0
0 σ′(aj2) 0 0
0 0. . .
0 0 . . . σ′(ajd)
= diag(σ
′(aj))
We are interested in∂sj∂sj−1
aj = Wsj + b
sj = σ(aj)
∂sj∂sj−1
=∂sj∂aj
∂aj∂sj−1
= diag(σ′(aj))W
We are interested in the magnitudeof
∂sj∂sj−1
← if it is small (large) ∂st∂sk
and hence ∂Lt∂W will vanish (explode)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
![Page 173: CS7015 (Deep Learning) : Lecture 14 · CS7015 (Deep Learning) : Lecture 14 Sequence Learning Problems, Recurrent Neural Networks, Backpropagation Through Time (BPTT), Vanishing and](https://reader033.vdocument.in/reader033/viewer/2022042909/5f3a2f1f4697801060706397/html5/thumbnails/173.jpg)
35/41
aj = [aj1, aj2, aj3, . . . ajd, ]
sj = [σ(aj1), σ(aj2), . . . σ(ajd)]
∂sj∂aj
=
∂sj1∂aj1
∂sj2∂aj1
∂sj3∂aj1
. . .
∂sj1∂aj2
∂sj2∂aj2
. . .
......
...∂sjd∂ajd
=
σ′(aj1) 0 0 0
0 σ′(aj2) 0 0
0 0. . .
0 0 . . . σ′(ajd)
= diag(σ
′(aj))
We are interested in∂sj∂sj−1
aj = Wsj + b
sj = σ(aj)
∂sj∂sj−1
=∂sj∂aj
∂aj∂sj−1
= diag(σ′(aj))W
We are interested in the magnitudeof
∂sj∂sj−1
← if it is small (large) ∂st∂sk
and hence ∂Lt∂W will vanish (explode)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
![Page 174: CS7015 (Deep Learning) : Lecture 14 · CS7015 (Deep Learning) : Lecture 14 Sequence Learning Problems, Recurrent Neural Networks, Backpropagation Through Time (BPTT), Vanishing and](https://reader033.vdocument.in/reader033/viewer/2022042909/5f3a2f1f4697801060706397/html5/thumbnails/174.jpg)
35/41
aj = [aj1, aj2, aj3, . . . ajd, ]
sj = [σ(aj1), σ(aj2), . . . σ(ajd)]
∂sj∂aj
=
∂sj1∂aj1
∂sj2∂aj1
∂sj3∂aj1
. . .
∂sj1∂aj2
∂sj2∂aj2
. . .
......
...∂sjd∂ajd
=
σ′(aj1) 0 0 0
0 σ′(aj2) 0 0
0 0. . .
0 0 . . . σ′(ajd)
= diag(σ
′(aj))
We are interested in∂sj∂sj−1
aj = Wsj + b
sj = σ(aj)
∂sj∂sj−1
=∂sj∂aj
∂aj∂sj−1
= diag(σ′(aj))W
We are interested in the magnitudeof
∂sj∂sj−1
← if it is small (large) ∂st∂sk
and hence ∂Lt∂W will vanish (explode)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
![Page 175: CS7015 (Deep Learning) : Lecture 14 · CS7015 (Deep Learning) : Lecture 14 Sequence Learning Problems, Recurrent Neural Networks, Backpropagation Through Time (BPTT), Vanishing and](https://reader033.vdocument.in/reader033/viewer/2022042909/5f3a2f1f4697801060706397/html5/thumbnails/175.jpg)
35/41
aj = [aj1, aj2, aj3, . . . ajd, ]
sj = [σ(aj1), σ(aj2), . . . σ(ajd)]
∂sj∂aj
=
∂sj1∂aj1
∂sj2∂aj1
∂sj3∂aj1
. . .
∂sj1∂aj2
∂sj2∂aj2
. . .
......
...∂sjd∂ajd
=
σ′(aj1) 0 0 0
0 σ′(aj2) 0 0
0 0. . .
0 0 . . . σ′(ajd)
= diag(σ
′(aj))
We are interested in∂sj∂sj−1
aj = Wsj + b
sj = σ(aj)
∂sj∂sj−1
=∂sj∂aj
∂aj∂sj−1
= diag(σ′(aj))W
We are interested in the magnitudeof
∂sj∂sj−1
← if it is small (large) ∂st∂sk
and hence ∂Lt∂W will vanish (explode)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
![Page 176: CS7015 (Deep Learning) : Lecture 14 · CS7015 (Deep Learning) : Lecture 14 Sequence Learning Problems, Recurrent Neural Networks, Backpropagation Through Time (BPTT), Vanishing and](https://reader033.vdocument.in/reader033/viewer/2022042909/5f3a2f1f4697801060706397/html5/thumbnails/176.jpg)
35/41
aj = [aj1, aj2, aj3, . . . ajd, ]
sj = [σ(aj1), σ(aj2), . . . σ(ajd)]
∂sj∂aj
=
∂sj1∂aj1
∂sj2∂aj1
∂sj3∂aj1
. . .
∂sj1∂aj2
∂sj2∂aj2
. . .
......
...∂sjd∂ajd
=
σ′(aj1) 0 0 0
0 σ′(aj2) 0 0
0 0. . .
0 0 . . . σ′(ajd)
= diag(σ
′(aj))
We are interested in∂sj∂sj−1
aj = Wsj + b
sj = σ(aj)
∂sj∂sj−1
=∂sj∂aj
∂aj∂sj−1
= diag(σ′(aj))W
We are interested in the magnitudeof
∂sj∂sj−1
← if it is small (large) ∂st∂sk
and hence ∂Lt∂W will vanish (explode)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
![Page 177: CS7015 (Deep Learning) : Lecture 14 · CS7015 (Deep Learning) : Lecture 14 Sequence Learning Problems, Recurrent Neural Networks, Backpropagation Through Time (BPTT), Vanishing and](https://reader033.vdocument.in/reader033/viewer/2022042909/5f3a2f1f4697801060706397/html5/thumbnails/177.jpg)
35/41
aj = [aj1, aj2, aj3, . . . ajd, ]
sj = [σ(aj1), σ(aj2), . . . σ(ajd)]
∂sj∂aj
=
∂sj1∂aj1
∂sj2∂aj1
∂sj3∂aj1
. . .
∂sj1∂aj2
∂sj2∂aj2
. . .
......
...∂sjd∂ajd
=
σ′(aj1) 0 0 0
0 σ′(aj2) 0 0
0 0. . .
0 0 . . . σ′(ajd)
= diag(σ
′(aj))
We are interested in∂sj∂sj−1
aj = Wsj + b
sj = σ(aj)
∂sj∂sj−1
=∂sj∂aj
∂aj∂sj−1
= diag(σ′(aj))W
We are interested in the magnitudeof
∂sj∂sj−1
← if it is small (large) ∂st∂sk
and hence ∂Lt∂W will vanish (explode)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
![Page 178: CS7015 (Deep Learning) : Lecture 14 · CS7015 (Deep Learning) : Lecture 14 Sequence Learning Problems, Recurrent Neural Networks, Backpropagation Through Time (BPTT), Vanishing and](https://reader033.vdocument.in/reader033/viewer/2022042909/5f3a2f1f4697801060706397/html5/thumbnails/178.jpg)
35/41
aj = [aj1, aj2, aj3, . . . ajd, ]
sj = [σ(aj1), σ(aj2), . . . σ(ajd)]
∂sj∂aj
=
∂sj1∂aj1
∂sj2∂aj1
∂sj3∂aj1
. . .
∂sj1∂aj2
∂sj2∂aj2
. . .
......
...∂sjd∂ajd
=
σ′(aj1) 0 0 0
0 σ′(aj2) 0 0
0 0. . .
0 0 . . . σ′(ajd)
= diag(σ
′(aj))
We are interested in∂sj∂sj−1
aj = Wsj + b
sj = σ(aj)
∂sj∂sj−1
=∂sj∂aj
∂aj∂sj−1
= diag(σ′(aj))W
We are interested in the magnitudeof
∂sj∂sj−1
← if it is small (large) ∂st∂sk
and hence ∂Lt∂W will vanish (explode)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
![Page 179: CS7015 (Deep Learning) : Lecture 14 · CS7015 (Deep Learning) : Lecture 14 Sequence Learning Problems, Recurrent Neural Networks, Backpropagation Through Time (BPTT), Vanishing and](https://reader033.vdocument.in/reader033/viewer/2022042909/5f3a2f1f4697801060706397/html5/thumbnails/179.jpg)
35/41
aj = [aj1, aj2, aj3, . . . ajd, ]
sj = [σ(aj1), σ(aj2), . . . σ(ajd)]
∂sj∂aj
=
∂sj1∂aj1
∂sj2∂aj1
∂sj3∂aj1
. . .
∂sj1∂aj2
∂sj2∂aj2
. . .
......
...∂sjd∂ajd
=
σ′(aj1) 0 0 0
0 σ′(aj2) 0 0
0 0. . .
0 0 . . . σ′(ajd)
= diag(σ
′(aj))
We are interested in∂sj∂sj−1
aj = Wsj + b
sj = σ(aj)
∂sj∂sj−1
=∂sj∂aj
∂aj∂sj−1
= diag(σ′(aj))W
We are interested in the magnitudeof
∂sj∂sj−1
← if it is small (large) ∂st∂sk
and hence ∂Lt∂W will vanish (explode)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
![Page 180: CS7015 (Deep Learning) : Lecture 14 · CS7015 (Deep Learning) : Lecture 14 Sequence Learning Problems, Recurrent Neural Networks, Backpropagation Through Time (BPTT), Vanishing and](https://reader033.vdocument.in/reader033/viewer/2022042909/5f3a2f1f4697801060706397/html5/thumbnails/180.jpg)
35/41
aj = [aj1, aj2, aj3, . . . ajd, ]
sj = [σ(aj1), σ(aj2), . . . σ(ajd)]
∂sj∂aj
=
∂sj1∂aj1
∂sj2∂aj1
∂sj3∂aj1
. . .
∂sj1∂aj2
∂sj2∂aj2
. . .
......
...∂sjd∂ajd
=
σ′(aj1) 0 0 0
0 σ′(aj2) 0 0
0 0. . .
0 0 . . . σ′(ajd)
= diag(σ
′(aj))
We are interested in∂sj∂sj−1
aj = Wsj + b
sj = σ(aj)
∂sj∂sj−1
=∂sj∂aj
∂aj∂sj−1
= diag(σ′(aj))W
We are interested in the magnitudeof
∂sj∂sj−1
← if it is small (large) ∂st∂sk
and hence ∂Lt∂W will vanish (explode)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
![Page 181: CS7015 (Deep Learning) : Lecture 14 · CS7015 (Deep Learning) : Lecture 14 Sequence Learning Problems, Recurrent Neural Networks, Backpropagation Through Time (BPTT), Vanishing and](https://reader033.vdocument.in/reader033/viewer/2022042909/5f3a2f1f4697801060706397/html5/thumbnails/181.jpg)
35/41
aj = [aj1, aj2, aj3, . . . ajd, ]
sj = [σ(aj1), σ(aj2), . . . σ(ajd)]
∂sj∂aj
=
∂sj1∂aj1
∂sj2∂aj1
∂sj3∂aj1
. . .
∂sj1∂aj2
∂sj2∂aj2
. . .
......
...∂sjd∂ajd
=
σ′(aj1) 0 0 0
0 σ′(aj2) 0 0
0 0. . .
0 0 . . . σ′(ajd)
= diag(σ
′(aj))
We are interested in∂sj∂sj−1
aj = Wsj + b
sj = σ(aj)
∂sj∂sj−1
=∂sj∂aj
∂aj∂sj−1
= diag(σ′(aj))W
We are interested in the magnitudeof
∂sj∂sj−1
← if it is small (large) ∂st∂sk
and hence ∂Lt∂W will vanish (explode)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
![Page 182: CS7015 (Deep Learning) : Lecture 14 · CS7015 (Deep Learning) : Lecture 14 Sequence Learning Problems, Recurrent Neural Networks, Backpropagation Through Time (BPTT), Vanishing and](https://reader033.vdocument.in/reader033/viewer/2022042909/5f3a2f1f4697801060706397/html5/thumbnails/182.jpg)
35/41
aj = [aj1, aj2, aj3, . . . ajd, ]
sj = [σ(aj1), σ(aj2), . . . σ(ajd)]
∂sj∂aj
=
∂sj1∂aj1
∂sj2∂aj1
∂sj3∂aj1
. . .
∂sj1∂aj2
∂sj2∂aj2
. . .
......
...∂sjd∂ajd
=
σ′(aj1) 0 0 0
0 σ′(aj2) 0 0
0 0. . .
0 0 . . . σ′(ajd)
= diag(σ′(aj))
We are interested in∂sj∂sj−1
aj = Wsj + b
sj = σ(aj)
∂sj∂sj−1
=∂sj∂aj
∂aj∂sj−1
= diag(σ′(aj))W
We are interested in the magnitudeof
∂sj∂sj−1
← if it is small (large) ∂st∂sk
and hence ∂Lt∂W will vanish (explode)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
![Page 183: CS7015 (Deep Learning) : Lecture 14 · CS7015 (Deep Learning) : Lecture 14 Sequence Learning Problems, Recurrent Neural Networks, Backpropagation Through Time (BPTT), Vanishing and](https://reader033.vdocument.in/reader033/viewer/2022042909/5f3a2f1f4697801060706397/html5/thumbnails/183.jpg)
35/41
aj = [aj1, aj2, aj3, . . . ajd, ]
sj = [σ(aj1), σ(aj2), . . . σ(ajd)]
∂sj∂aj
=
∂sj1∂aj1
∂sj2∂aj1
∂sj3∂aj1
. . .
∂sj1∂aj2
∂sj2∂aj2
. . .
......
...∂sjd∂ajd
=
σ′(aj1) 0 0 0
0 σ′(aj2) 0 0
0 0. . .
0 0 . . . σ′(ajd)
= diag(σ′(aj))
We are interested in∂sj∂sj−1
aj = Wsj + b
sj = σ(aj)
∂sj∂sj−1
=∂sj∂aj
∂aj∂sj−1
= diag(σ′(aj))W
We are interested in the magnitudeof
∂sj∂sj−1
← if it is small (large) ∂st∂sk
and hence ∂Lt∂W will vanish (explode)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
![Page 184: CS7015 (Deep Learning) : Lecture 14 · CS7015 (Deep Learning) : Lecture 14 Sequence Learning Problems, Recurrent Neural Networks, Backpropagation Through Time (BPTT), Vanishing and](https://reader033.vdocument.in/reader033/viewer/2022042909/5f3a2f1f4697801060706397/html5/thumbnails/184.jpg)
35/41
aj = [aj1, aj2, aj3, . . . ajd, ]
sj = [σ(aj1), σ(aj2), . . . σ(ajd)]
∂sj∂aj
=
∂sj1∂aj1
∂sj2∂aj1
∂sj3∂aj1
. . .
∂sj1∂aj2
∂sj2∂aj2
. . .
......
...∂sjd∂ajd
=
σ′(aj1) 0 0 0
0 σ′(aj2) 0 0
0 0. . .
0 0 . . . σ′(ajd)
= diag(σ
′(aj))
We are interested in∂sj∂sj−1
aj = Wsj + b
sj = σ(aj)
∂sj∂sj−1
=∂sj∂aj
∂aj∂sj−1
= diag(σ′(aj))W
We are interested in the magnitudeof
∂sj∂sj−1
← if it is small (large) ∂st∂sk
and hence ∂Lt∂W will vanish (explode)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
![Page 185: CS7015 (Deep Learning) : Lecture 14 · CS7015 (Deep Learning) : Lecture 14 Sequence Learning Problems, Recurrent Neural Networks, Backpropagation Through Time (BPTT), Vanishing and](https://reader033.vdocument.in/reader033/viewer/2022042909/5f3a2f1f4697801060706397/html5/thumbnails/185.jpg)
35/41
aj = [aj1, aj2, aj3, . . . ajd, ]
sj = [σ(aj1), σ(aj2), . . . σ(ajd)]
∂sj∂aj
=
∂sj1∂aj1
∂sj2∂aj1
∂sj3∂aj1
. . .
∂sj1∂aj2
∂sj2∂aj2
. . .
......
...∂sjd∂ajd
=
σ′(aj1) 0 0 0
0 σ′(aj2) 0 0
0 0. . .
0 0 . . . σ′(ajd)
= diag(σ
′(aj))
We are interested in∂sj∂sj−1
aj = Wsj + b
sj = σ(aj)
∂sj∂sj−1
=∂sj∂aj
∂aj∂sj−1
= diag(σ′(aj))W
We are interested in the magnitudeof
∂sj∂sj−1
← if it is small (large) ∂st∂sk
and hence ∂Lt∂W will vanish (explode)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
![Page 186: CS7015 (Deep Learning) : Lecture 14 · CS7015 (Deep Learning) : Lecture 14 Sequence Learning Problems, Recurrent Neural Networks, Backpropagation Through Time (BPTT), Vanishing and](https://reader033.vdocument.in/reader033/viewer/2022042909/5f3a2f1f4697801060706397/html5/thumbnails/186.jpg)
35/41
aj = [aj1, aj2, aj3, . . . ajd, ]
sj = [σ(aj1), σ(aj2), . . . σ(ajd)]
∂sj∂aj
=
∂sj1∂aj1
∂sj2∂aj1
∂sj3∂aj1
. . .
∂sj1∂aj2
∂sj2∂aj2
. . .
......
...∂sjd∂ajd
=
σ′(aj1) 0 0 0
0 σ′(aj2) 0 0
0 0. . .
0 0 . . . σ′(ajd)
= diag(σ
′(aj))
We are interested in∂sj∂sj−1
aj = Wsj + b
sj = σ(aj)
∂sj∂sj−1
=∂sj∂aj
∂aj∂sj−1
= diag(σ′(aj))W
We are interested in the magnitudeof
∂sj∂sj−1
← if it is small (large) ∂st∂sk
and hence ∂Lt∂W will vanish (explode)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
![Page 187: CS7015 (Deep Learning) : Lecture 14 · CS7015 (Deep Learning) : Lecture 14 Sequence Learning Problems, Recurrent Neural Networks, Backpropagation Through Time (BPTT), Vanishing and](https://reader033.vdocument.in/reader033/viewer/2022042909/5f3a2f1f4697801060706397/html5/thumbnails/187.jpg)
36/41
∥∥∥∥ ∂sj∂sj−1
∥∥∥∥ =∥∥∥diag(σ
′(aj))W
∥∥∥
≤∥∥∥diag(σ
′(aj)
∥∥∥ ‖W‖∵ σ(aj) is a bounded function (sigmoid,
tanh) σ′(aj) is bounded
σ′(aj) ≤
1
4= γ [if σ is logistic ]
≤ 1 = γ [if σ is tanh ]∥∥∥∥ ∂sj∂sj−1
∥∥∥∥ ≤ γ ‖W‖≤ γλ
∥∥∥∥ ∂st∂sk
∥∥∥∥ =
∥∥∥∥∥∥t∏
j=k+1
∂sj∂sj−1
∥∥∥∥∥∥≤
t∏j=k+1
γλ
≤ (γλ)t−k
If γλ < 1 the gradient will vanish
If γλ > 1 the gradient could explode
This is known as the problem ofvanishing/ exploding gradients
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
![Page 188: CS7015 (Deep Learning) : Lecture 14 · CS7015 (Deep Learning) : Lecture 14 Sequence Learning Problems, Recurrent Neural Networks, Backpropagation Through Time (BPTT), Vanishing and](https://reader033.vdocument.in/reader033/viewer/2022042909/5f3a2f1f4697801060706397/html5/thumbnails/188.jpg)
36/41
∥∥∥∥ ∂sj∂sj−1
∥∥∥∥ =∥∥∥diag(σ
′(aj))W
∥∥∥≤∥∥∥diag(σ
′(aj)
∥∥∥ ‖W‖
∵ σ(aj) is a bounded function (sigmoid,tanh) σ
′(aj) is bounded
σ′(aj) ≤
1
4= γ [if σ is logistic ]
≤ 1 = γ [if σ is tanh ]∥∥∥∥ ∂sj∂sj−1
∥∥∥∥ ≤ γ ‖W‖≤ γλ
∥∥∥∥ ∂st∂sk
∥∥∥∥ =
∥∥∥∥∥∥t∏
j=k+1
∂sj∂sj−1
∥∥∥∥∥∥≤
t∏j=k+1
γλ
≤ (γλ)t−k
If γλ < 1 the gradient will vanish
If γλ > 1 the gradient could explode
This is known as the problem ofvanishing/ exploding gradients
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
![Page 189: CS7015 (Deep Learning) : Lecture 14 · CS7015 (Deep Learning) : Lecture 14 Sequence Learning Problems, Recurrent Neural Networks, Backpropagation Through Time (BPTT), Vanishing and](https://reader033.vdocument.in/reader033/viewer/2022042909/5f3a2f1f4697801060706397/html5/thumbnails/189.jpg)
36/41
∥∥∥∥ ∂sj∂sj−1
∥∥∥∥ =∥∥∥diag(σ
′(aj))W
∥∥∥≤∥∥∥diag(σ
′(aj)
∥∥∥ ‖W‖∵ σ(aj) is a bounded function (sigmoid,
tanh) σ′(aj) is bounded
σ′(aj) ≤
1
4= γ [if σ is logistic ]
≤ 1 = γ [if σ is tanh ]∥∥∥∥ ∂sj∂sj−1
∥∥∥∥ ≤ γ ‖W‖≤ γλ
∥∥∥∥ ∂st∂sk
∥∥∥∥ =
∥∥∥∥∥∥t∏
j=k+1
∂sj∂sj−1
∥∥∥∥∥∥≤
t∏j=k+1
γλ
≤ (γλ)t−k
If γλ < 1 the gradient will vanish
If γλ > 1 the gradient could explode
This is known as the problem ofvanishing/ exploding gradients
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
![Page 190: CS7015 (Deep Learning) : Lecture 14 · CS7015 (Deep Learning) : Lecture 14 Sequence Learning Problems, Recurrent Neural Networks, Backpropagation Through Time (BPTT), Vanishing and](https://reader033.vdocument.in/reader033/viewer/2022042909/5f3a2f1f4697801060706397/html5/thumbnails/190.jpg)
36/41
∥∥∥∥ ∂sj∂sj−1
∥∥∥∥ =∥∥∥diag(σ
′(aj))W
∥∥∥≤∥∥∥diag(σ
′(aj)
∥∥∥ ‖W‖∵ σ(aj) is a bounded function (sigmoid,
tanh) σ′(aj) is bounded
σ′(aj) ≤
1
4= γ [if σ is logistic ]
≤ 1 = γ [if σ is tanh ]∥∥∥∥ ∂sj∂sj−1
∥∥∥∥ ≤ γ ‖W‖≤ γλ
∥∥∥∥ ∂st∂sk
∥∥∥∥ =
∥∥∥∥∥∥t∏
j=k+1
∂sj∂sj−1
∥∥∥∥∥∥≤
t∏j=k+1
γλ
≤ (γλ)t−k
If γλ < 1 the gradient will vanish
If γλ > 1 the gradient could explode
This is known as the problem ofvanishing/ exploding gradients
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
![Page 191: CS7015 (Deep Learning) : Lecture 14 · CS7015 (Deep Learning) : Lecture 14 Sequence Learning Problems, Recurrent Neural Networks, Backpropagation Through Time (BPTT), Vanishing and](https://reader033.vdocument.in/reader033/viewer/2022042909/5f3a2f1f4697801060706397/html5/thumbnails/191.jpg)
36/41
∥∥∥∥ ∂sj∂sj−1
∥∥∥∥ =∥∥∥diag(σ
′(aj))W
∥∥∥≤∥∥∥diag(σ
′(aj)
∥∥∥ ‖W‖∵ σ(aj) is a bounded function (sigmoid,
tanh) σ′(aj) is bounded
σ′(aj) ≤
1
4= γ [if σ is logistic ]
≤ 1 = γ [if σ is tanh ]
∥∥∥∥ ∂sj∂sj−1
∥∥∥∥ ≤ γ ‖W‖≤ γλ
∥∥∥∥ ∂st∂sk
∥∥∥∥ =
∥∥∥∥∥∥t∏
j=k+1
∂sj∂sj−1
∥∥∥∥∥∥≤
t∏j=k+1
γλ
≤ (γλ)t−k
If γλ < 1 the gradient will vanish
If γλ > 1 the gradient could explode
This is known as the problem ofvanishing/ exploding gradients
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
![Page 192: CS7015 (Deep Learning) : Lecture 14 · CS7015 (Deep Learning) : Lecture 14 Sequence Learning Problems, Recurrent Neural Networks, Backpropagation Through Time (BPTT), Vanishing and](https://reader033.vdocument.in/reader033/viewer/2022042909/5f3a2f1f4697801060706397/html5/thumbnails/192.jpg)
36/41
∥∥∥∥ ∂sj∂sj−1
∥∥∥∥ =∥∥∥diag(σ
′(aj))W
∥∥∥≤∥∥∥diag(σ
′(aj)
∥∥∥ ‖W‖∵ σ(aj) is a bounded function (sigmoid,
tanh) σ′(aj) is bounded
σ′(aj) ≤
1
4= γ [if σ is logistic ]
≤ 1 = γ [if σ is tanh ]∥∥∥∥ ∂sj∂sj−1
∥∥∥∥ ≤ γ ‖W‖≤ γλ
∥∥∥∥ ∂st∂sk
∥∥∥∥ =
∥∥∥∥∥∥t∏
j=k+1
∂sj∂sj−1
∥∥∥∥∥∥≤
t∏j=k+1
γλ
≤ (γλ)t−k
If γλ < 1 the gradient will vanish
If γλ > 1 the gradient could explode
This is known as the problem ofvanishing/ exploding gradients
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
![Page 193: CS7015 (Deep Learning) : Lecture 14 · CS7015 (Deep Learning) : Lecture 14 Sequence Learning Problems, Recurrent Neural Networks, Backpropagation Through Time (BPTT), Vanishing and](https://reader033.vdocument.in/reader033/viewer/2022042909/5f3a2f1f4697801060706397/html5/thumbnails/193.jpg)
36/41
∥∥∥∥ ∂sj∂sj−1
∥∥∥∥ =∥∥∥diag(σ
′(aj))W
∥∥∥≤∥∥∥diag(σ
′(aj)
∥∥∥ ‖W‖∵ σ(aj) is a bounded function (sigmoid,
tanh) σ′(aj) is bounded
σ′(aj) ≤
1
4= γ [if σ is logistic ]
≤ 1 = γ [if σ is tanh ]∥∥∥∥ ∂sj∂sj−1
∥∥∥∥ ≤ γ ‖W‖≤ γλ
∥∥∥∥ ∂st∂sk
∥∥∥∥ =
∥∥∥∥∥∥t∏
j=k+1
∂sj∂sj−1
∥∥∥∥∥∥
≤t∏
j=k+1
γλ
≤ (γλ)t−k
If γλ < 1 the gradient will vanish
If γλ > 1 the gradient could explode
This is known as the problem ofvanishing/ exploding gradients
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
![Page 194: CS7015 (Deep Learning) : Lecture 14 · CS7015 (Deep Learning) : Lecture 14 Sequence Learning Problems, Recurrent Neural Networks, Backpropagation Through Time (BPTT), Vanishing and](https://reader033.vdocument.in/reader033/viewer/2022042909/5f3a2f1f4697801060706397/html5/thumbnails/194.jpg)
36/41
∥∥∥∥ ∂sj∂sj−1
∥∥∥∥ =∥∥∥diag(σ
′(aj))W
∥∥∥≤∥∥∥diag(σ
′(aj)
∥∥∥ ‖W‖∵ σ(aj) is a bounded function (sigmoid,
tanh) σ′(aj) is bounded
σ′(aj) ≤
1
4= γ [if σ is logistic ]
≤ 1 = γ [if σ is tanh ]∥∥∥∥ ∂sj∂sj−1
∥∥∥∥ ≤ γ ‖W‖≤ γλ
∥∥∥∥ ∂st∂sk
∥∥∥∥ =
∥∥∥∥∥∥t∏
j=k+1
∂sj∂sj−1
∥∥∥∥∥∥≤
t∏j=k+1
γλ
≤ (γλ)t−k
If γλ < 1 the gradient will vanish
If γλ > 1 the gradient could explode
This is known as the problem ofvanishing/ exploding gradients
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
![Page 195: CS7015 (Deep Learning) : Lecture 14 · CS7015 (Deep Learning) : Lecture 14 Sequence Learning Problems, Recurrent Neural Networks, Backpropagation Through Time (BPTT), Vanishing and](https://reader033.vdocument.in/reader033/viewer/2022042909/5f3a2f1f4697801060706397/html5/thumbnails/195.jpg)
36/41
∥∥∥∥ ∂sj∂sj−1
∥∥∥∥ =∥∥∥diag(σ
′(aj))W
∥∥∥≤∥∥∥diag(σ
′(aj)
∥∥∥ ‖W‖∵ σ(aj) is a bounded function (sigmoid,
tanh) σ′(aj) is bounded
σ′(aj) ≤
1
4= γ [if σ is logistic ]
≤ 1 = γ [if σ is tanh ]∥∥∥∥ ∂sj∂sj−1
∥∥∥∥ ≤ γ ‖W‖≤ γλ
∥∥∥∥ ∂st∂sk
∥∥∥∥ =
∥∥∥∥∥∥t∏
j=k+1
∂sj∂sj−1
∥∥∥∥∥∥≤
t∏j=k+1
γλ
≤ (γλ)t−k
If γλ < 1 the gradient will vanish
If γλ > 1 the gradient could explode
This is known as the problem ofvanishing/ exploding gradients
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
![Page 196: CS7015 (Deep Learning) : Lecture 14 · CS7015 (Deep Learning) : Lecture 14 Sequence Learning Problems, Recurrent Neural Networks, Backpropagation Through Time (BPTT), Vanishing and](https://reader033.vdocument.in/reader033/viewer/2022042909/5f3a2f1f4697801060706397/html5/thumbnails/196.jpg)
36/41
∥∥∥∥ ∂sj∂sj−1
∥∥∥∥ =∥∥∥diag(σ
′(aj))W
∥∥∥≤∥∥∥diag(σ
′(aj)
∥∥∥ ‖W‖∵ σ(aj) is a bounded function (sigmoid,
tanh) σ′(aj) is bounded
σ′(aj) ≤
1
4= γ [if σ is logistic ]
≤ 1 = γ [if σ is tanh ]∥∥∥∥ ∂sj∂sj−1
∥∥∥∥ ≤ γ ‖W‖≤ γλ
∥∥∥∥ ∂st∂sk
∥∥∥∥ =
∥∥∥∥∥∥t∏
j=k+1
∂sj∂sj−1
∥∥∥∥∥∥≤
t∏j=k+1
γλ
≤ (γλ)t−k
If γλ < 1 the gradient will vanish
If γλ > 1 the gradient could explode
This is known as the problem ofvanishing/ exploding gradients
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
![Page 197: CS7015 (Deep Learning) : Lecture 14 · CS7015 (Deep Learning) : Lecture 14 Sequence Learning Problems, Recurrent Neural Networks, Backpropagation Through Time (BPTT), Vanishing and](https://reader033.vdocument.in/reader033/viewer/2022042909/5f3a2f1f4697801060706397/html5/thumbnails/197.jpg)
36/41
∥∥∥∥ ∂sj∂sj−1
∥∥∥∥ =∥∥∥diag(σ
′(aj))W
∥∥∥≤∥∥∥diag(σ
′(aj)
∥∥∥ ‖W‖∵ σ(aj) is a bounded function (sigmoid,
tanh) σ′(aj) is bounded
σ′(aj) ≤
1
4= γ [if σ is logistic ]
≤ 1 = γ [if σ is tanh ]∥∥∥∥ ∂sj∂sj−1
∥∥∥∥ ≤ γ ‖W‖≤ γλ
∥∥∥∥ ∂st∂sk
∥∥∥∥ =
∥∥∥∥∥∥t∏
j=k+1
∂sj∂sj−1
∥∥∥∥∥∥≤
t∏j=k+1
γλ
≤ (γλ)t−k
If γλ < 1 the gradient will vanish
If γλ > 1 the gradient could explode
This is known as the problem ofvanishing/ exploding gradients
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
![Page 198: CS7015 (Deep Learning) : Lecture 14 · CS7015 (Deep Learning) : Lecture 14 Sequence Learning Problems, Recurrent Neural Networks, Backpropagation Through Time (BPTT), Vanishing and](https://reader033.vdocument.in/reader033/viewer/2022042909/5f3a2f1f4697801060706397/html5/thumbnails/198.jpg)
36/41
∥∥∥∥ ∂sj∂sj−1
∥∥∥∥ =∥∥∥diag(σ
′(aj))W
∥∥∥≤∥∥∥diag(σ
′(aj)
∥∥∥ ‖W‖∵ σ(aj) is a bounded function (sigmoid,
tanh) σ′(aj) is bounded
σ′(aj) ≤
1
4= γ [if σ is logistic ]
≤ 1 = γ [if σ is tanh ]∥∥∥∥ ∂sj∂sj−1
∥∥∥∥ ≤ γ ‖W‖≤ γλ
∥∥∥∥ ∂st∂sk
∥∥∥∥ =
∥∥∥∥∥∥t∏
j=k+1
∂sj∂sj−1
∥∥∥∥∥∥≤
t∏j=k+1
γλ
≤ (γλ)t−k
If γλ < 1 the gradient will vanish
If γλ > 1 the gradient could explode
This is known as the problem ofvanishing/ exploding gradients
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
![Page 199: CS7015 (Deep Learning) : Lecture 14 · CS7015 (Deep Learning) : Lecture 14 Sequence Learning Problems, Recurrent Neural Networks, Backpropagation Through Time (BPTT), Vanishing and](https://reader033.vdocument.in/reader033/viewer/2022042909/5f3a2f1f4697801060706397/html5/thumbnails/199.jpg)
37/41
Lt
w
v
u
x1
y1
x2
y2
w
v
u
x3
y3
w
v
u
x4
y4
w
v
u
xn
yn
w
v
u
One simple way of avoiding this is touse truncated backpropogationwhere we restrict the product toτ(< t− k) terms
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
![Page 200: CS7015 (Deep Learning) : Lecture 14 · CS7015 (Deep Learning) : Lecture 14 Sequence Learning Problems, Recurrent Neural Networks, Backpropagation Through Time (BPTT), Vanishing and](https://reader033.vdocument.in/reader033/viewer/2022042909/5f3a2f1f4697801060706397/html5/thumbnails/200.jpg)
38/41
Module 14.5: Some Gory Details
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
![Page 201: CS7015 (Deep Learning) : Lecture 14 · CS7015 (Deep Learning) : Lecture 14 Sequence Learning Problems, Recurrent Neural Networks, Backpropagation Through Time (BPTT), Vanishing and](https://reader033.vdocument.in/reader033/viewer/2022042909/5f3a2f1f4697801060706397/html5/thumbnails/201.jpg)
39/41
∂Lt(θ)
∂W︸ ︷︷ ︸
∈Rd×d
=∂Lt(θ)
∂st︸ ︷︷ ︸
∈R1×d
t∑k=1
∂st∂sk︸︷︷︸
∈Rd×d
∂+sk∂W︸ ︷︷ ︸
∈Rd×d×d
We know how to compute ∂Lt(θ)∂st
(derivative of Lt(θ) (scalar) w.r.t. lasthidden layer (vector)) using backpropagation
We just saw a formula for ∂st∂sk
which is the derivative of a vector w.r.t. avector)∂+sk∂W is a tensor ∈ Rd×d×d, the derivative of a vector ∈ Rd w.r.t. a matrix∈ Rd×d
How do we compute ∂+sk∂W ?
Let us see
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
![Page 202: CS7015 (Deep Learning) : Lecture 14 · CS7015 (Deep Learning) : Lecture 14 Sequence Learning Problems, Recurrent Neural Networks, Backpropagation Through Time (BPTT), Vanishing and](https://reader033.vdocument.in/reader033/viewer/2022042909/5f3a2f1f4697801060706397/html5/thumbnails/202.jpg)
39/41
∂Lt(θ)
∂W︸ ︷︷ ︸∈Rd×d
=∂Lt(θ)
∂st︸ ︷︷ ︸
∈R1×d
t∑k=1
∂st∂sk︸︷︷︸
∈Rd×d
∂+sk∂W︸ ︷︷ ︸
∈Rd×d×d
We know how to compute ∂Lt(θ)∂st
(derivative of Lt(θ) (scalar) w.r.t. lasthidden layer (vector)) using backpropagation
We just saw a formula for ∂st∂sk
which is the derivative of a vector w.r.t. avector)∂+sk∂W is a tensor ∈ Rd×d×d, the derivative of a vector ∈ Rd w.r.t. a matrix∈ Rd×d
How do we compute ∂+sk∂W ?
Let us see
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
![Page 203: CS7015 (Deep Learning) : Lecture 14 · CS7015 (Deep Learning) : Lecture 14 Sequence Learning Problems, Recurrent Neural Networks, Backpropagation Through Time (BPTT), Vanishing and](https://reader033.vdocument.in/reader033/viewer/2022042909/5f3a2f1f4697801060706397/html5/thumbnails/203.jpg)
39/41
∂Lt(θ)
∂W︸ ︷︷ ︸∈Rd×d
=∂Lt(θ)
∂st︸ ︷︷ ︸∈R1×d
t∑k=1
∂st∂sk︸︷︷︸
∈Rd×d
∂+sk∂W︸ ︷︷ ︸
∈Rd×d×d
We know how to compute ∂Lt(θ)∂st
(derivative of Lt(θ) (scalar) w.r.t. lasthidden layer (vector)) using backpropagation
We just saw a formula for ∂st∂sk
which is the derivative of a vector w.r.t. avector)∂+sk∂W is a tensor ∈ Rd×d×d, the derivative of a vector ∈ Rd w.r.t. a matrix∈ Rd×d
How do we compute ∂+sk∂W ?
Let us see
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
![Page 204: CS7015 (Deep Learning) : Lecture 14 · CS7015 (Deep Learning) : Lecture 14 Sequence Learning Problems, Recurrent Neural Networks, Backpropagation Through Time (BPTT), Vanishing and](https://reader033.vdocument.in/reader033/viewer/2022042909/5f3a2f1f4697801060706397/html5/thumbnails/204.jpg)
39/41
∂Lt(θ)
∂W︸ ︷︷ ︸∈Rd×d
=∂Lt(θ)
∂st︸ ︷︷ ︸∈R1×d
t∑k=1
∂st∂sk︸︷︷︸∈Rd×d
∂+sk∂W︸ ︷︷ ︸
∈Rd×d×d
We know how to compute ∂Lt(θ)∂st
(derivative of Lt(θ) (scalar) w.r.t. lasthidden layer (vector)) using backpropagation
We just saw a formula for ∂st∂sk
which is the derivative of a vector w.r.t. avector)∂+sk∂W is a tensor ∈ Rd×d×d, the derivative of a vector ∈ Rd w.r.t. a matrix∈ Rd×d
How do we compute ∂+sk∂W ?
Let us see
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
![Page 205: CS7015 (Deep Learning) : Lecture 14 · CS7015 (Deep Learning) : Lecture 14 Sequence Learning Problems, Recurrent Neural Networks, Backpropagation Through Time (BPTT), Vanishing and](https://reader033.vdocument.in/reader033/viewer/2022042909/5f3a2f1f4697801060706397/html5/thumbnails/205.jpg)
39/41
∂Lt(θ)
∂W︸ ︷︷ ︸∈Rd×d
=∂Lt(θ)
∂st︸ ︷︷ ︸∈R1×d
t∑k=1
∂st∂sk︸︷︷︸∈Rd×d
∂+sk∂W︸ ︷︷ ︸∈Rd×d×d
We know how to compute ∂Lt(θ)∂st
(derivative of Lt(θ) (scalar) w.r.t. lasthidden layer (vector)) using backpropagation
We just saw a formula for ∂st∂sk
which is the derivative of a vector w.r.t. avector)∂+sk∂W is a tensor ∈ Rd×d×d, the derivative of a vector ∈ Rd w.r.t. a matrix∈ Rd×d
How do we compute ∂+sk∂W ?
Let us see
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
![Page 206: CS7015 (Deep Learning) : Lecture 14 · CS7015 (Deep Learning) : Lecture 14 Sequence Learning Problems, Recurrent Neural Networks, Backpropagation Through Time (BPTT), Vanishing and](https://reader033.vdocument.in/reader033/viewer/2022042909/5f3a2f1f4697801060706397/html5/thumbnails/206.jpg)
39/41
∂Lt(θ)
∂W︸ ︷︷ ︸∈Rd×d
=∂Lt(θ)
∂st︸ ︷︷ ︸∈R1×d
t∑k=1
∂st∂sk︸︷︷︸∈Rd×d
∂+sk∂W︸ ︷︷ ︸∈Rd×d×d
We know how to compute ∂Lt(θ)∂st
(derivative of Lt(θ) (scalar) w.r.t. lasthidden layer (vector)) using backpropagation
We just saw a formula for ∂st∂sk
which is the derivative of a vector w.r.t. avector)∂+sk∂W is a tensor ∈ Rd×d×d, the derivative of a vector ∈ Rd w.r.t. a matrix∈ Rd×d
How do we compute ∂+sk∂W ? Let us see
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
![Page 207: CS7015 (Deep Learning) : Lecture 14 · CS7015 (Deep Learning) : Lecture 14 Sequence Learning Problems, Recurrent Neural Networks, Backpropagation Through Time (BPTT), Vanishing and](https://reader033.vdocument.in/reader033/viewer/2022042909/5f3a2f1f4697801060706397/html5/thumbnails/207.jpg)
39/41
∂Lt(θ)
∂W︸ ︷︷ ︸∈Rd×d
=∂Lt(θ)
∂st︸ ︷︷ ︸∈R1×d
t∑k=1
∂st∂sk︸︷︷︸∈Rd×d
∂+sk∂W︸ ︷︷ ︸∈Rd×d×d
We know how to compute ∂Lt(θ)∂st
(derivative of Lt(θ) (scalar) w.r.t. lasthidden layer (vector)) using backpropagation
We just saw a formula for ∂st∂sk
which is the derivative of a vector w.r.t. avector)
∂+sk∂W is a tensor ∈ Rd×d×d, the derivative of a vector ∈ Rd w.r.t. a matrix∈ Rd×d
How do we compute ∂+sk∂W ? Let us see
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
![Page 208: CS7015 (Deep Learning) : Lecture 14 · CS7015 (Deep Learning) : Lecture 14 Sequence Learning Problems, Recurrent Neural Networks, Backpropagation Through Time (BPTT), Vanishing and](https://reader033.vdocument.in/reader033/viewer/2022042909/5f3a2f1f4697801060706397/html5/thumbnails/208.jpg)
39/41
∂Lt(θ)
∂W︸ ︷︷ ︸∈Rd×d
=∂Lt(θ)
∂st︸ ︷︷ ︸∈R1×d
t∑k=1
∂st∂sk︸︷︷︸∈Rd×d
∂+sk∂W︸ ︷︷ ︸∈Rd×d×d
We know how to compute ∂Lt(θ)∂st
(derivative of Lt(θ) (scalar) w.r.t. lasthidden layer (vector)) using backpropagation
We just saw a formula for ∂st∂sk
which is the derivative of a vector w.r.t. avector)∂+sk∂W is a tensor ∈ Rd×d×d, the derivative of a vector ∈ Rd w.r.t. a matrix∈ Rd×d
How do we compute ∂+sk∂W ? Let us see
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
![Page 209: CS7015 (Deep Learning) : Lecture 14 · CS7015 (Deep Learning) : Lecture 14 Sequence Learning Problems, Recurrent Neural Networks, Backpropagation Through Time (BPTT), Vanishing and](https://reader033.vdocument.in/reader033/viewer/2022042909/5f3a2f1f4697801060706397/html5/thumbnails/209.jpg)
39/41
∂Lt(θ)
∂W︸ ︷︷ ︸∈Rd×d
=∂Lt(θ)
∂st︸ ︷︷ ︸∈R1×d
t∑k=1
∂st∂sk︸︷︷︸∈Rd×d
∂+sk∂W︸ ︷︷ ︸∈Rd×d×d
We know how to compute ∂Lt(θ)∂st
(derivative of Lt(θ) (scalar) w.r.t. lasthidden layer (vector)) using backpropagation
We just saw a formula for ∂st∂sk
which is the derivative of a vector w.r.t. avector)∂+sk∂W is a tensor ∈ Rd×d×d, the derivative of a vector ∈ Rd w.r.t. a matrix∈ Rd×d
How do we compute ∂+sk∂W ? Let us see
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
![Page 210: CS7015 (Deep Learning) : Lecture 14 · CS7015 (Deep Learning) : Lecture 14 Sequence Learning Problems, Recurrent Neural Networks, Backpropagation Through Time (BPTT), Vanishing and](https://reader033.vdocument.in/reader033/viewer/2022042909/5f3a2f1f4697801060706397/html5/thumbnails/210.jpg)
40/41
We just look at one element of this ∂+sk∂W tensor
∂+skp∂Wqr
is the (p, q, r)-th element of the 3d tensor
ak = Wsk−1 + b
sk = σ(ak)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
![Page 211: CS7015 (Deep Learning) : Lecture 14 · CS7015 (Deep Learning) : Lecture 14 Sequence Learning Problems, Recurrent Neural Networks, Backpropagation Through Time (BPTT), Vanishing and](https://reader033.vdocument.in/reader033/viewer/2022042909/5f3a2f1f4697801060706397/html5/thumbnails/211.jpg)
40/41
We just look at one element of this ∂+sk∂W tensor
∂+skp∂Wqr
is the (p, q, r)-th element of the 3d tensor
ak = Wsk−1 + b
sk = σ(ak)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
![Page 212: CS7015 (Deep Learning) : Lecture 14 · CS7015 (Deep Learning) : Lecture 14 Sequence Learning Problems, Recurrent Neural Networks, Backpropagation Through Time (BPTT), Vanishing and](https://reader033.vdocument.in/reader033/viewer/2022042909/5f3a2f1f4697801060706397/html5/thumbnails/212.jpg)
40/41
We just look at one element of this ∂+sk∂W tensor
∂+skp∂Wqr
is the (p, q, r)-th element of the 3d tensor
ak = Wsk−1 + b
sk = σ(ak)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
![Page 213: CS7015 (Deep Learning) : Lecture 14 · CS7015 (Deep Learning) : Lecture 14 Sequence Learning Problems, Recurrent Neural Networks, Backpropagation Through Time (BPTT), Vanishing and](https://reader033.vdocument.in/reader033/viewer/2022042909/5f3a2f1f4697801060706397/html5/thumbnails/213.jpg)
40/41
We just look at one element of this ∂+sk∂W tensor
∂+skp∂Wqr
is the (p, q, r)-th element of the 3d tensor
ak = Wsk−1 + b
sk = σ(ak)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
![Page 214: CS7015 (Deep Learning) : Lecture 14 · CS7015 (Deep Learning) : Lecture 14 Sequence Learning Problems, Recurrent Neural Networks, Backpropagation Through Time (BPTT), Vanishing and](https://reader033.vdocument.in/reader033/viewer/2022042909/5f3a2f1f4697801060706397/html5/thumbnails/214.jpg)
41/41
ak = Wsk−1
ak1ak2
...akp
...akd
=
W11 W12 . . . W1d
......
......
Wp1 Wp2 . . . Wpd
......
......
sk−1,1sk−1,2
...sk−1,p
...sk−1,d
akp =
d∑i=1
Wpisk−1,i
skp = σ(akp)
∂skp∂Wqr
=∂skp∂akp
∂akp∂Wqr
= σ′(akp)∂akp∂Wqr
∂akp∂Wqr
=∂∑d
i=1Wpisk−1,i∂Wqr
= sk−1,i if p = q and i = r
= 0 otherwise
∂skp∂Wqr
= σ′(akp)sk−1,r if p = q and i = r
= 0 otherwise
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
![Page 215: CS7015 (Deep Learning) : Lecture 14 · CS7015 (Deep Learning) : Lecture 14 Sequence Learning Problems, Recurrent Neural Networks, Backpropagation Through Time (BPTT), Vanishing and](https://reader033.vdocument.in/reader033/viewer/2022042909/5f3a2f1f4697801060706397/html5/thumbnails/215.jpg)
41/41
ak = Wsk−1
ak1ak2
...akp
...akd
=
W11 W12 . . . W1d
......
......
Wp1 Wp2 . . . Wpd
......
......
sk−1,1sk−1,2
...sk−1,p
...sk−1,d
akp =d∑
i=1
Wpisk−1,i
skp = σ(akp)
∂skp∂Wqr
=∂skp∂akp
∂akp∂Wqr
= σ′(akp)∂akp∂Wqr
∂akp∂Wqr
=∂∑d
i=1Wpisk−1,i∂Wqr
= sk−1,i if p = q and i = r
= 0 otherwise
∂skp∂Wqr
= σ′(akp)sk−1,r if p = q and i = r
= 0 otherwise
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
![Page 216: CS7015 (Deep Learning) : Lecture 14 · CS7015 (Deep Learning) : Lecture 14 Sequence Learning Problems, Recurrent Neural Networks, Backpropagation Through Time (BPTT), Vanishing and](https://reader033.vdocument.in/reader033/viewer/2022042909/5f3a2f1f4697801060706397/html5/thumbnails/216.jpg)
41/41
ak = Wsk−1
ak1ak2
...akp
...akd
=
W11 W12 . . . W1d
......
......
Wp1 Wp2 . . . Wpd
......
......
sk−1,1sk−1,2
...sk−1,p
...sk−1,d
akp =
d∑i=1
Wpisk−1,i
skp = σ(akp)
∂skp∂Wqr
=∂skp∂akp
∂akp∂Wqr
= σ′(akp)∂akp∂Wqr
∂akp∂Wqr
=∂∑d
i=1Wpisk−1,i∂Wqr
= sk−1,i if p = q and i = r
= 0 otherwise
∂skp∂Wqr
= σ′(akp)sk−1,r if p = q and i = r
= 0 otherwise
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
![Page 217: CS7015 (Deep Learning) : Lecture 14 · CS7015 (Deep Learning) : Lecture 14 Sequence Learning Problems, Recurrent Neural Networks, Backpropagation Through Time (BPTT), Vanishing and](https://reader033.vdocument.in/reader033/viewer/2022042909/5f3a2f1f4697801060706397/html5/thumbnails/217.jpg)
41/41
ak = Wsk−1
ak1ak2
...akp
...akd
=
W11 W12 . . . W1d
......
......
Wp1 Wp2 . . . Wpd
......
......
sk−1,1sk−1,2
...sk−1,p
...sk−1,d
akp =
d∑i=1
Wpisk−1,i
skp = σ(akp)
∂skp∂Wqr
=∂skp∂akp
∂akp∂Wqr
= σ′(akp)∂akp∂Wqr
∂akp∂Wqr
=∂∑d
i=1Wpisk−1,i∂Wqr
= sk−1,i if p = q and i = r
= 0 otherwise
∂skp∂Wqr
= σ′(akp)sk−1,r if p = q and i = r
= 0 otherwise
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
![Page 218: CS7015 (Deep Learning) : Lecture 14 · CS7015 (Deep Learning) : Lecture 14 Sequence Learning Problems, Recurrent Neural Networks, Backpropagation Through Time (BPTT), Vanishing and](https://reader033.vdocument.in/reader033/viewer/2022042909/5f3a2f1f4697801060706397/html5/thumbnails/218.jpg)
41/41
ak = Wsk−1
ak1ak2
...akp
...akd
=
W11 W12 . . . W1d
......
......
Wp1 Wp2 . . . Wpd
......
......
sk−1,1sk−1,2
...sk−1,p
...sk−1,d
akp =
d∑i=1
Wpisk−1,i
skp = σ(akp)
∂skp∂Wqr
=∂skp∂akp
∂akp∂Wqr
= σ′(akp)∂akp∂Wqr
∂akp∂Wqr
=∂∑d
i=1Wpisk−1,i∂Wqr
= sk−1,i if p = q and i = r
= 0 otherwise
∂skp∂Wqr
= σ′(akp)sk−1,r if p = q and i = r
= 0 otherwise
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
![Page 219: CS7015 (Deep Learning) : Lecture 14 · CS7015 (Deep Learning) : Lecture 14 Sequence Learning Problems, Recurrent Neural Networks, Backpropagation Through Time (BPTT), Vanishing and](https://reader033.vdocument.in/reader033/viewer/2022042909/5f3a2f1f4697801060706397/html5/thumbnails/219.jpg)
41/41
ak = Wsk−1
ak1ak2
...akp
...akd
=
W11 W12 . . . W1d
......
......
Wp1 Wp2 . . . Wpd
......
......
sk−1,1sk−1,2
...sk−1,p
...sk−1,d
akp =
d∑i=1
Wpisk−1,i
skp = σ(akp)
∂skp∂Wqr
=∂skp∂akp
∂akp∂Wqr
= σ′(akp)∂akp∂Wqr
∂akp∂Wqr
=∂∑d
i=1Wpisk−1,i∂Wqr
= sk−1,i if p = q and i = r
= 0 otherwise
∂skp∂Wqr
= σ′(akp)sk−1,r if p = q and i = r
= 0 otherwise
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
![Page 220: CS7015 (Deep Learning) : Lecture 14 · CS7015 (Deep Learning) : Lecture 14 Sequence Learning Problems, Recurrent Neural Networks, Backpropagation Through Time (BPTT), Vanishing and](https://reader033.vdocument.in/reader033/viewer/2022042909/5f3a2f1f4697801060706397/html5/thumbnails/220.jpg)
41/41
ak = Wsk−1
ak1ak2
...akp
...akd
=
W11 W12 . . . W1d
......
......
Wp1 Wp2 . . . Wpd
......
......
sk−1,1sk−1,2
...sk−1,p
...sk−1,d
akp =
d∑i=1
Wpisk−1,i
skp = σ(akp)
∂skp∂Wqr
=∂skp∂akp
∂akp∂Wqr
= σ′(akp)∂akp∂Wqr
∂akp∂Wqr
=∂∑d
i=1Wpisk−1,i∂Wqr
= sk−1,i if p = q and i = r
= 0 otherwise
∂skp∂Wqr
= σ′(akp)sk−1,r if p = q and i = r
= 0 otherwise
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
![Page 221: CS7015 (Deep Learning) : Lecture 14 · CS7015 (Deep Learning) : Lecture 14 Sequence Learning Problems, Recurrent Neural Networks, Backpropagation Through Time (BPTT), Vanishing and](https://reader033.vdocument.in/reader033/viewer/2022042909/5f3a2f1f4697801060706397/html5/thumbnails/221.jpg)
41/41
ak = Wsk−1
ak1ak2
...akp
...akd
=
W11 W12 . . . W1d
......
......
Wp1 Wp2 . . . Wpd
......
......
sk−1,1sk−1,2
...sk−1,p
...sk−1,d
akp =
d∑i=1
Wpisk−1,i
skp = σ(akp)
∂skp∂Wqr
=∂skp∂akp
∂akp∂Wqr
= σ′(akp)∂akp∂Wqr
∂akp∂Wqr
=∂∑d
i=1Wpisk−1,i∂Wqr
= sk−1,i if p = q and i = r
= 0 otherwise
∂skp∂Wqr
= σ′(akp)sk−1,r if p = q and i = r
= 0 otherwise
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
![Page 222: CS7015 (Deep Learning) : Lecture 14 · CS7015 (Deep Learning) : Lecture 14 Sequence Learning Problems, Recurrent Neural Networks, Backpropagation Through Time (BPTT), Vanishing and](https://reader033.vdocument.in/reader033/viewer/2022042909/5f3a2f1f4697801060706397/html5/thumbnails/222.jpg)
41/41
ak = Wsk−1
ak1ak2
...akp
...akd
=
W11 W12 . . . W1d
......
......
Wp1 Wp2 . . . Wpd
......
......
sk−1,1sk−1,2
...sk−1,p
...sk−1,d
akp =
d∑i=1
Wpisk−1,i
skp = σ(akp)
∂skp∂Wqr
=∂skp∂akp
∂akp∂Wqr
= σ′(akp)∂akp∂Wqr
∂akp∂Wqr
=∂∑d
i=1Wpisk−1,i∂Wqr
= sk−1,i if p = q and i = r
= 0 otherwise
∂skp∂Wqr
= σ′(akp)sk−1,r if p = q and i = r
= 0 otherwise
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
![Page 223: CS7015 (Deep Learning) : Lecture 14 · CS7015 (Deep Learning) : Lecture 14 Sequence Learning Problems, Recurrent Neural Networks, Backpropagation Through Time (BPTT), Vanishing and](https://reader033.vdocument.in/reader033/viewer/2022042909/5f3a2f1f4697801060706397/html5/thumbnails/223.jpg)
41/41
ak = Wsk−1
ak1ak2
...akp
...akd
=
W11 W12 . . . W1d
......
......
Wp1 Wp2 . . . Wpd
......
......
sk−1,1sk−1,2
...sk−1,p
...sk−1,d
akp =
d∑i=1
Wpisk−1,i
skp = σ(akp)
∂skp∂Wqr
=∂skp∂akp
∂akp∂Wqr
= σ′(akp)∂akp∂Wqr
∂akp∂Wqr
=∂∑d
i=1Wpisk−1,i∂Wqr
= sk−1,i if p = q and i = r
= 0 otherwise
∂skp∂Wqr
= σ′(akp)sk−1,r if p = q and i = r
= 0 otherwise
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
![Page 224: CS7015 (Deep Learning) : Lecture 14 · CS7015 (Deep Learning) : Lecture 14 Sequence Learning Problems, Recurrent Neural Networks, Backpropagation Through Time (BPTT), Vanishing and](https://reader033.vdocument.in/reader033/viewer/2022042909/5f3a2f1f4697801060706397/html5/thumbnails/224.jpg)
41/41
ak = Wsk−1
ak1ak2
...akp
...akd
=
W11 W12 . . . W1d
......
......
Wp1 Wp2 . . . Wpd
......
......
sk−1,1sk−1,2
...sk−1,p
...sk−1,d
akp =
d∑i=1
Wpisk−1,i
skp = σ(akp)
∂skp∂Wqr
=∂skp∂akp
∂akp∂Wqr
= σ′(akp)∂akp∂Wqr
∂akp∂Wqr
=∂∑d
i=1Wpisk−1,i∂Wqr
= sk−1,i if p = q and i = r
= 0 otherwise
∂skp∂Wqr
= σ′(akp)sk−1,r if p = q and i = r
= 0 otherwise
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14