lecture 8: recurrent neural networksrecurrent neural networks (rnns) •recurrent •perform the...
TRANSCRIPT
![Page 1: Lecture 8: Recurrent Neural NetworksRecurrent neural networks (RNNs) •Recurrent •Perform the same task for every element of a sequence, with the output being dependent on the previous](https://reader030.vdocument.in/reader030/viewer/2022040214/5ec6792a63b878764040c4c2/html5/thumbnails/1.jpg)
Lecture 8: Recurrent Neural Networks
Shuai Li
John Hopcroft Center, Shanghai Jiao Tong University
https://shuaili8.github.io
https://shuaili8.github.io/Teaching/VE445/index.html
1
![Page 2: Lecture 8: Recurrent Neural NetworksRecurrent neural networks (RNNs) •Recurrent •Perform the same task for every element of a sequence, with the output being dependent on the previous](https://reader030.vdocument.in/reader030/viewer/2022040214/5ec6792a63b878764040c4c2/html5/thumbnails/2.jpg)
Outline
• Motivation
• Basics of RNN
• Different RNNs
• Application examples
• Training
• LSTM
2
![Page 3: Lecture 8: Recurrent Neural NetworksRecurrent neural networks (RNNs) •Recurrent •Perform the same task for every element of a sequence, with the output being dependent on the previous](https://reader030.vdocument.in/reader030/viewer/2022040214/5ec6792a63b878764040c4c2/html5/thumbnails/3.jpg)
Motivation
• In traditional NN• Assume all inputs and outputs are independent of each other
• Input and output length are fixed
• But this might be bad for many tasks• Predict next word in a sentence
• “Context”: You better know which words came before it
• Hard/Impossible to choose a fixed context window• Output YES if the number of 1s is even, else NO!
• 1000010101 – YES, 100011 – NO, …
• There can always be a new sample longer than anything seen
3
![Page 4: Lecture 8: Recurrent Neural NetworksRecurrent neural networks (RNNs) •Recurrent •Perform the same task for every element of a sequence, with the output being dependent on the previous](https://reader030.vdocument.in/reader030/viewer/2022040214/5ec6792a63b878764040c4c2/html5/thumbnails/4.jpg)
Basics
4
![Page 5: Lecture 8: Recurrent Neural NetworksRecurrent neural networks (RNNs) •Recurrent •Perform the same task for every element of a sequence, with the output being dependent on the previous](https://reader030.vdocument.in/reader030/viewer/2022040214/5ec6792a63b878764040c4c2/html5/thumbnails/5.jpg)
Recurrent neural networks (RNNs)
• Recurrent• Perform the same task for every element of a sequence, with the output
being dependent on the previous computations
• They have a “memory” which captures information about what has been calculated so far
5http://www.wildml.com/2015/09/recurrent-neural-networks-tutorial-part-1-introduction-to-rnns/
![Page 6: Lecture 8: Recurrent Neural NetworksRecurrent neural networks (RNNs) •Recurrent •Perform the same task for every element of a sequence, with the output being dependent on the previous](https://reader030.vdocument.in/reader030/viewer/2022040214/5ec6792a63b878764040c4c2/html5/thumbnails/6.jpg)
Feed-forward Network
6
![Page 7: Lecture 8: Recurrent Neural NetworksRecurrent neural networks (RNNs) •Recurrent •Perform the same task for every element of a sequence, with the output being dependent on the previous](https://reader030.vdocument.in/reader030/viewer/2022040214/5ec6792a63b878764040c4c2/html5/thumbnails/7.jpg)
RNN
7
![Page 8: Lecture 8: Recurrent Neural NetworksRecurrent neural networks (RNNs) •Recurrent •Perform the same task for every element of a sequence, with the output being dependent on the previous](https://reader030.vdocument.in/reader030/viewer/2022040214/5ec6792a63b878764040c4c2/html5/thumbnails/8.jpg)
RNN (cont.)
• 𝑥𝑡 is the input at time 𝑡
• 𝑠𝑡 is the hidden state at time 𝑡• It is the “memory” of the network
• Is calculated based on previous hidden state and the input at the current step𝑠𝑡 = 𝑓 𝑈𝑥𝑡 +𝑊𝑠𝑡−1
where 𝑓 is nonlinear activation function, such as tanh, ReLU
• 𝑜𝑡 is the output at time 𝑡• E.g. If we want to predict the next word in a sentence, 𝑜𝑡 is a vector of
probabilities over certain vocabulary
8
![Page 9: Lecture 8: Recurrent Neural NetworksRecurrent neural networks (RNNs) •Recurrent •Perform the same task for every element of a sequence, with the output being dependent on the previous](https://reader030.vdocument.in/reader030/viewer/2022040214/5ec6792a63b878764040c4c2/html5/thumbnails/9.jpg)
RNN (cont.)
• 𝑠𝑡 is the “memory” at time 𝑡
• 𝑜𝑡 is based on memory at time 𝑡
• RNN share weights 𝑈 and W• Unlike traditional NN• Greatly reduce the total number of parameters we need to learn
• The output at each time step might be unnecessary• E.g. When predicting the sentiment of a sentence we may only care about the
final output, not the sentiment after each word
• We may not need inputs at each time step
• Main feature: • Is the hidden state, which captures some information about a sequence
9
![Page 10: Lecture 8: Recurrent Neural NetworksRecurrent neural networks (RNNs) •Recurrent •Perform the same task for every element of a sequence, with the output being dependent on the previous](https://reader030.vdocument.in/reader030/viewer/2022040214/5ec6792a63b878764040c4c2/html5/thumbnails/10.jpg)
Recall RNN
10
![Page 11: Lecture 8: Recurrent Neural NetworksRecurrent neural networks (RNNs) •Recurrent •Perform the same task for every element of a sequence, with the output being dependent on the previous](https://reader030.vdocument.in/reader030/viewer/2022040214/5ec6792a63b878764040c4c2/html5/thumbnails/11.jpg)
Computational graph of RNN
11
![Page 12: Lecture 8: Recurrent Neural NetworksRecurrent neural networks (RNNs) •Recurrent •Perform the same task for every element of a sequence, with the output being dependent on the previous](https://reader030.vdocument.in/reader030/viewer/2022040214/5ec6792a63b878764040c4c2/html5/thumbnails/12.jpg)
Computational graph of RNN
12
![Page 13: Lecture 8: Recurrent Neural NetworksRecurrent neural networks (RNNs) •Recurrent •Perform the same task for every element of a sequence, with the output being dependent on the previous](https://reader030.vdocument.in/reader030/viewer/2022040214/5ec6792a63b878764040c4c2/html5/thumbnails/13.jpg)
Computational graph of RNN
13
![Page 14: Lecture 8: Recurrent Neural NetworksRecurrent neural networks (RNNs) •Recurrent •Perform the same task for every element of a sequence, with the output being dependent on the previous](https://reader030.vdocument.in/reader030/viewer/2022040214/5ec6792a63b878764040c4c2/html5/thumbnails/14.jpg)
Computational graph of RNN
• Re-use the same weight matrix at every time-step
14
![Page 15: Lecture 8: Recurrent Neural NetworksRecurrent neural networks (RNNs) •Recurrent •Perform the same task for every element of a sequence, with the output being dependent on the previous](https://reader030.vdocument.in/reader030/viewer/2022040214/5ec6792a63b878764040c4c2/html5/thumbnails/15.jpg)
Different RNNs
15
![Page 16: Lecture 8: Recurrent Neural NetworksRecurrent neural networks (RNNs) •Recurrent •Perform the same task for every element of a sequence, with the output being dependent on the previous](https://reader030.vdocument.in/reader030/viewer/2022040214/5ec6792a63b878764040c4c2/html5/thumbnails/16.jpg)
Different RNNs: one to many
• E.g. Image captioning. Image -> sequence of words
16
![Page 17: Lecture 8: Recurrent Neural NetworksRecurrent neural networks (RNNs) •Recurrent •Perform the same task for every element of a sequence, with the output being dependent on the previous](https://reader030.vdocument.in/reader030/viewer/2022040214/5ec6792a63b878764040c4c2/html5/thumbnails/17.jpg)
Different RNNs: many to one
• E.g. Sentiment Classification. Sequence of words -> sentiment
17
![Page 18: Lecture 8: Recurrent Neural NetworksRecurrent neural networks (RNNs) •Recurrent •Perform the same task for every element of a sequence, with the output being dependent on the previous](https://reader030.vdocument.in/reader030/viewer/2022040214/5ec6792a63b878764040c4c2/html5/thumbnails/18.jpg)
• E.g. Machine translation. Sequence of words -> sequence of words
Different RNNs: many to many
18
![Page 19: Lecture 8: Recurrent Neural NetworksRecurrent neural networks (RNNs) •Recurrent •Perform the same task for every element of a sequence, with the output being dependent on the previous](https://reader030.vdocument.in/reader030/viewer/2022040214/5ec6792a63b878764040c4c2/html5/thumbnails/19.jpg)
Different RNNs
19
![Page 20: Lecture 8: Recurrent Neural NetworksRecurrent neural networks (RNNs) •Recurrent •Perform the same task for every element of a sequence, with the output being dependent on the previous](https://reader030.vdocument.in/reader030/viewer/2022040214/5ec6792a63b878764040c4c2/html5/thumbnails/20.jpg)
Application Examples
20
![Page 21: Lecture 8: Recurrent Neural NetworksRecurrent neural networks (RNNs) •Recurrent •Perform the same task for every element of a sequence, with the output being dependent on the previous](https://reader030.vdocument.in/reader030/viewer/2022040214/5ec6792a63b878764040c4c2/html5/thumbnails/21.jpg)
Example 1: Character-level language model
• Given previous words/characters,predict the next
• Vocabulary: [h, e, l, o]
• Example of training sequence:
“Hello”
21
![Page 22: Lecture 8: Recurrent Neural NetworksRecurrent neural networks (RNNs) •Recurrent •Perform the same task for every element of a sequence, with the output being dependent on the previous](https://reader030.vdocument.in/reader030/viewer/2022040214/5ec6792a63b878764040c4c2/html5/thumbnails/22.jpg)
Example 1: Character-level language model
• Given previous words/characters,predict the next
• Vocabulary: [h, e, l, o]
• Example of training sequence:
“Hello”
22
![Page 23: Lecture 8: Recurrent Neural NetworksRecurrent neural networks (RNNs) •Recurrent •Perform the same task for every element of a sequence, with the output being dependent on the previous](https://reader030.vdocument.in/reader030/viewer/2022040214/5ec6792a63b878764040c4c2/html5/thumbnails/23.jpg)
Example 1: Character-level language model
• Given previous words/characters,predict the next
• Vocabulary: [h, e, l, o]
• Example of training sequence:
“Hello”
23
![Page 24: Lecture 8: Recurrent Neural NetworksRecurrent neural networks (RNNs) •Recurrent •Perform the same task for every element of a sequence, with the output being dependent on the previous](https://reader030.vdocument.in/reader030/viewer/2022040214/5ec6792a63b878764040c4c2/html5/thumbnails/24.jpg)
Example 1: Character-level language model
• Given previous words/characters,predict the next
• Vocabulary: [h, e, l, o]
• Example of training sequence:
“Hello”
• At test time, sample characters one at a time, feed back to model
24
![Page 25: Lecture 8: Recurrent Neural NetworksRecurrent neural networks (RNNs) •Recurrent •Perform the same task for every element of a sequence, with the output being dependent on the previous](https://reader030.vdocument.in/reader030/viewer/2022040214/5ec6792a63b878764040c4c2/html5/thumbnails/25.jpg)
Example 2: Text generating
• Can we generate sonnets like a poet, using RNN?
25
![Page 26: Lecture 8: Recurrent Neural NetworksRecurrent neural networks (RNNs) •Recurrent •Perform the same task for every element of a sequence, with the output being dependent on the previous](https://reader030.vdocument.in/reader030/viewer/2022040214/5ec6792a63b878764040c4c2/html5/thumbnails/26.jpg)
Example 2: Text generating results
26
![Page 27: Lecture 8: Recurrent Neural NetworksRecurrent neural networks (RNNs) •Recurrent •Perform the same task for every element of a sequence, with the output being dependent on the previous](https://reader030.vdocument.in/reader030/viewer/2022040214/5ec6792a63b878764040c4c2/html5/thumbnails/27.jpg)
Example 3: Image captioning
• Using RNN to explain the content in an Image
27
Remove the last two layers in CNN
![Page 28: Lecture 8: Recurrent Neural NetworksRecurrent neural networks (RNNs) •Recurrent •Perform the same task for every element of a sequence, with the output being dependent on the previous](https://reader030.vdocument.in/reader030/viewer/2022040214/5ec6792a63b878764040c4c2/html5/thumbnails/28.jpg)
Example 3: Image captioning (cont.)
28
Preparing the initial input for RNN
![Page 29: Lecture 8: Recurrent Neural NetworksRecurrent neural networks (RNNs) •Recurrent •Perform the same task for every element of a sequence, with the output being dependent on the previous](https://reader030.vdocument.in/reader030/viewer/2022040214/5ec6792a63b878764040c4c2/html5/thumbnails/29.jpg)
Example 3: Image captioning (cont.)
29Add a new item to connect CNN to RNN
![Page 30: Lecture 8: Recurrent Neural NetworksRecurrent neural networks (RNNs) •Recurrent •Perform the same task for every element of a sequence, with the output being dependent on the previous](https://reader030.vdocument.in/reader030/viewer/2022040214/5ec6792a63b878764040c4c2/html5/thumbnails/30.jpg)
Example 3: Image captioning (cont.)
30
![Page 31: Lecture 8: Recurrent Neural NetworksRecurrent neural networks (RNNs) •Recurrent •Perform the same task for every element of a sequence, with the output being dependent on the previous](https://reader030.vdocument.in/reader030/viewer/2022040214/5ec6792a63b878764040c4c2/html5/thumbnails/31.jpg)
Example 3: Image captioning (cont.)
31
![Page 32: Lecture 8: Recurrent Neural NetworksRecurrent neural networks (RNNs) •Recurrent •Perform the same task for every element of a sequence, with the output being dependent on the previous](https://reader030.vdocument.in/reader030/viewer/2022040214/5ec6792a63b878764040c4c2/html5/thumbnails/32.jpg)
Example 3: Image captioning (cont.)
32
![Page 33: Lecture 8: Recurrent Neural NetworksRecurrent neural networks (RNNs) •Recurrent •Perform the same task for every element of a sequence, with the output being dependent on the previous](https://reader030.vdocument.in/reader030/viewer/2022040214/5ec6792a63b878764040c4c2/html5/thumbnails/33.jpg)
Example 3: Image captioning (cont.)
33
![Page 34: Lecture 8: Recurrent Neural NetworksRecurrent neural networks (RNNs) •Recurrent •Perform the same task for every element of a sequence, with the output being dependent on the previous](https://reader030.vdocument.in/reader030/viewer/2022040214/5ec6792a63b878764040c4c2/html5/thumbnails/34.jpg)
Example 3: Image captioning (cont.)
34
![Page 35: Lecture 8: Recurrent Neural NetworksRecurrent neural networks (RNNs) •Recurrent •Perform the same task for every element of a sequence, with the output being dependent on the previous](https://reader030.vdocument.in/reader030/viewer/2022040214/5ec6792a63b878764040c4c2/html5/thumbnails/35.jpg)
Example 3: More results
35
![Page 36: Lecture 8: Recurrent Neural NetworksRecurrent neural networks (RNNs) •Recurrent •Perform the same task for every element of a sequence, with the output being dependent on the previous](https://reader030.vdocument.in/reader030/viewer/2022040214/5ec6792a63b878764040c4c2/html5/thumbnails/36.jpg)
Example 3: Failure cases
36
![Page 37: Lecture 8: Recurrent Neural NetworksRecurrent neural networks (RNNs) •Recurrent •Perform the same task for every element of a sequence, with the output being dependent on the previous](https://reader030.vdocument.in/reader030/viewer/2022040214/5ec6792a63b878764040c4c2/html5/thumbnails/37.jpg)
Training
37
![Page 38: Lecture 8: Recurrent Neural NetworksRecurrent neural networks (RNNs) •Recurrent •Perform the same task for every element of a sequence, with the output being dependent on the previous](https://reader030.vdocument.in/reader030/viewer/2022040214/5ec6792a63b878764040c4c2/html5/thumbnails/38.jpg)
Training
• Still use backpropagation
• But now weights are shared by all time steps
• Backpropagation Through Time (BPTT)• E.g., in order to calculate the gradient at 𝑡 = 4, we would need to
backpropagate 3 steps and sum up the gradients
•
• Loss, by cross entropy• 𝑦𝑡 is the correct word at time 𝑡
• ො𝑦𝑡 is prediction
38
![Page 39: Lecture 8: Recurrent Neural NetworksRecurrent neural networks (RNNs) •Recurrent •Perform the same task for every element of a sequence, with the output being dependent on the previous](https://reader030.vdocument.in/reader030/viewer/2022040214/5ec6792a63b878764040c4c2/html5/thumbnails/39.jpg)
Backpropagation Through Time (BPTT)
• Recall that our goal is to calculate the gradients of the error with respect to parameters 𝑈, 𝑉,𝑊
• Also the gradients sum over time 𝜕𝐸
𝜕𝑊= σ𝑡
𝜕𝐸𝑡
𝜕𝑊
• E.g. 𝜕𝐸3
𝜕𝑊=
𝜕𝐸3
𝜕 ො𝑦3
𝜕 ො𝑦3
𝜕𝑠3
𝜕𝑠3
𝜕𝑊+
𝜕𝑠3
𝜕𝑠2
𝜕𝑠2
𝜕𝑊+⋯
• 𝑠3 = tanh 𝑈𝑥𝑡 +𝑊𝑠2
• So to get gradients of time 𝑡 = 3, we need to backpropagate gradients to 𝑡 = 0
39
![Page 40: Lecture 8: Recurrent Neural NetworksRecurrent neural networks (RNNs) •Recurrent •Perform the same task for every element of a sequence, with the output being dependent on the previous](https://reader030.vdocument.in/reader030/viewer/2022040214/5ec6792a63b878764040c4c2/html5/thumbnails/40.jpg)
Summary of BPTT
• It is just backpropagation on the unrolled RNN by summing over all gradients for 𝑊 and 𝑈 at each time step
• The vanishing gradient problem
•𝜕𝐸3
𝜕𝑊=
𝜕𝐸3
𝜕 ො𝑦3
𝜕 ො𝑦3
𝜕𝑠3
𝜕𝑠3
𝜕𝑊+
𝜕𝑠3
𝜕𝑠2
𝜕𝑠2
𝜕𝑊+⋯
• This Jacobian matrix (derivative w.r.t. a vector) has 2-norm less than 1• As an intuition, the derivative of tanh is bounded by 1
• Thus, with small values in multiple matrix multiplications, the gradient values are shrinking exponentially fast, eventually vanishing completely after a few time steps
• Gradient contributions from “far away” steps become zero• You end up not learning long-range dependencies• Not exclusive to RNNs. It’s just that RNNs tend to be very deep (as sentence length)
40
![Page 41: Lecture 8: Recurrent Neural NetworksRecurrent neural networks (RNNs) •Recurrent •Perform the same task for every element of a sequence, with the output being dependent on the previous](https://reader030.vdocument.in/reader030/viewer/2022040214/5ec6792a63b878764040c4c2/html5/thumbnails/41.jpg)
Demo of RNN vs LSTM: Vanishing gradients
41
• RNN vs LSTM gradients on the input weight matrix https://imgur.com/gallery/vaNahKE
• Error is generated at 128th step and propagated back. No error from other steps
• Initial weights sampled from Normal Distribution in (-0.1, 0.1)
![Page 42: Lecture 8: Recurrent Neural NetworksRecurrent neural networks (RNNs) •Recurrent •Perform the same task for every element of a sequence, with the output being dependent on the previous](https://reader030.vdocument.in/reader030/viewer/2022040214/5ec6792a63b878764040c4c2/html5/thumbnails/42.jpg)
The problem of long-term dependencies
• One of the appeals of RNNs is the idea that they might be able to connect previous information to the present task, such as using previous video frames might inform the understanding of the present frame
• RNN usually works well for recent information• E.g. The clouds are in the sky.
• But might work worse for further back context• E.g. I grew up in France… I speak fluent French.
42
![Page 43: Lecture 8: Recurrent Neural NetworksRecurrent neural networks (RNNs) •Recurrent •Perform the same task for every element of a sequence, with the output being dependent on the previous](https://reader030.vdocument.in/reader030/viewer/2022040214/5ec6792a63b878764040c4c2/html5/thumbnails/43.jpg)
Long Short-Term Memory
43
![Page 44: Lecture 8: Recurrent Neural NetworksRecurrent neural networks (RNNs) •Recurrent •Perform the same task for every element of a sequence, with the output being dependent on the previous](https://reader030.vdocument.in/reader030/viewer/2022040214/5ec6792a63b878764040c4c2/html5/thumbnails/44.jpg)
Long short-term memory (LSTM)
• Hochreiter & Schmidhuber [1997]
• A variant of RNN, capable of learning long-term dependencies
• Can remember information for long periods of time
44
Remember the architecture of RNN cells, we will make many changes on it
http://colah.github.io/posts/2015-08-Understanding-LSTMs/
![Page 45: Lecture 8: Recurrent Neural NetworksRecurrent neural networks (RNNs) •Recurrent •Perform the same task for every element of a sequence, with the output being dependent on the previous](https://reader030.vdocument.in/reader030/viewer/2022040214/5ec6792a63b878764040c4c2/html5/thumbnails/45.jpg)
LSTM at a glance
45
![Page 46: Lecture 8: Recurrent Neural NetworksRecurrent neural networks (RNNs) •Recurrent •Perform the same task for every element of a sequence, with the output being dependent on the previous](https://reader030.vdocument.in/reader030/viewer/2022040214/5ec6792a63b878764040c4c2/html5/thumbnails/46.jpg)
LSTM
• 𝐶𝑡 is the cell state• Horizontal line running through the top of the diagram
• Like a conveyer
• Run straight down the entire chain, with only some minor linear interactions
• LSTM uses gates to carefully remove or add information to the cell state
46
![Page 47: Lecture 8: Recurrent Neural NetworksRecurrent neural networks (RNNs) •Recurrent •Perform the same task for every element of a sequence, with the output being dependent on the previous](https://reader030.vdocument.in/reader030/viewer/2022040214/5ec6792a63b878764040c4c2/html5/thumbnails/47.jpg)
LSTM (cont.)
• Three gates are governed by sigmoid units
• Describing how much of each component should be let through
• “let nothing through” (zero value)/“let everything through” (one value)
47
![Page 48: Lecture 8: Recurrent Neural NetworksRecurrent neural networks (RNNs) •Recurrent •Perform the same task for every element of a sequence, with the output being dependent on the previous](https://reader030.vdocument.in/reader030/viewer/2022040214/5ec6792a63b878764040c4c2/html5/thumbnails/48.jpg)
LSTM (cont.)
• The first gate is forget gate layer• Decide what information we’re going to throw away from the cell state
• Output a number in 0,1 for each number in the cell state
• 1/0: completely keep/remove this
48
![Page 49: Lecture 8: Recurrent Neural NetworksRecurrent neural networks (RNNs) •Recurrent •Perform the same task for every element of a sequence, with the output being dependent on the previous](https://reader030.vdocument.in/reader030/viewer/2022040214/5ec6792a63b878764040c4c2/html5/thumbnails/49.jpg)
LSTM (cont.)
• Next is to decide what new information we’re going to store in the cell state• The sigmoid layer is the input gate layer, deciding which values we’ll update
• The tanh layer creates a vector of new candidate values ሚ𝐶𝑡 that should be added to the state
49
![Page 50: Lecture 8: Recurrent Neural NetworksRecurrent neural networks (RNNs) •Recurrent •Perform the same task for every element of a sequence, with the output being dependent on the previous](https://reader030.vdocument.in/reader030/viewer/2022040214/5ec6792a63b878764040c4c2/html5/thumbnails/50.jpg)
LSTM (cont.)
• Multiply the old state by 𝑓𝑡• Forgetting the things from previous state
• Add new candidate values
• Update cell state
50
![Page 51: Lecture 8: Recurrent Neural NetworksRecurrent neural networks (RNNs) •Recurrent •Perform the same task for every element of a sequence, with the output being dependent on the previous](https://reader030.vdocument.in/reader030/viewer/2022040214/5ec6792a63b878764040c4c2/html5/thumbnails/51.jpg)
LSTM (cont.)
• Last decide the output ℎ𝑡• Filtered (tanh) version of the cell state
• The sigmoid layer decides what parts of the cell state we’re going to outputℎ𝑡−1, 𝑥𝑡 are like context
51
![Page 52: Lecture 8: Recurrent Neural NetworksRecurrent neural networks (RNNs) •Recurrent •Perform the same task for every element of a sequence, with the output being dependent on the previous](https://reader030.vdocument.in/reader030/viewer/2022040214/5ec6792a63b878764040c4c2/html5/thumbnails/52.jpg)
Gated recurrent unit (GRU)
• GRU is a variants of LSTM• Combines the forget and input gates into a single “update gate”
• Also merges the cell state and hidden state
52
![Page 53: Lecture 8: Recurrent Neural NetworksRecurrent neural networks (RNNs) •Recurrent •Perform the same task for every element of a sequence, with the output being dependent on the previous](https://reader030.vdocument.in/reader030/viewer/2022040214/5ec6792a63b878764040c4c2/html5/thumbnails/53.jpg)
How LSTM avoids gradient vanishing• RNN
• 𝑠𝑗 = tanh 𝑈𝑥𝑡 +𝑊𝑠𝑗−1
•𝜕𝑠𝑗
𝜕𝑠𝑗−1= tanh′ ∙ 𝑊
• Vanishing since
ෑ
𝑗=𝑘+1
𝑘′𝜕𝑠𝑗𝜕𝑠𝑗−1
= tanh′ ∙ 𝑊 𝑘′−𝑘
as 𝑘′ − 𝑘 is large
• LSTM
•𝜕𝐶𝑡
𝜕𝐶𝑡−1= 𝑓𝑡 = 𝜎 ∙ where the value can be
trained around 1 (a large part of 𝜎 near 1)
• Also the sum operation help ease the problem
• Will also vanish, but much slower than RNN53
![Page 54: Lecture 8: Recurrent Neural NetworksRecurrent neural networks (RNNs) •Recurrent •Perform the same task for every element of a sequence, with the output being dependent on the previous](https://reader030.vdocument.in/reader030/viewer/2022040214/5ec6792a63b878764040c4c2/html5/thumbnails/54.jpg)
LSTM hyperparameter tuning
• Watch out for overfitting
• Regularization helps: regularization methods include L1, L2, and dropout among others.
• The larger the network, the more powerful, but also easier to overfit. Don’t want to try to learn a million parameters from 10,000 examples• parameters > examples = trouble
• More data is usually better
• Train over multiple epochs (complete passes through the dataset)
• The learning rate is the single most important hyper parameter
54