![Page 1: Backpropagation and Convolutional Neural Networks · graphs for neural networks are a bit more complex than the previous example. ... •Learning rate •Number of layers •Number](https://reader034.vdocument.in/reader034/viewer/2022042308/5ed45209e54cf91fdb6adc63/html5/thumbnails/1.jpg)
Backpropagation and Convolutional Neural
Networks
Natalie Parde, Ph.D.Department of Computer Science
University of Illinois at Chicago
CS 521: Statistical Natural Language Processing
Spring 2020
Many slides adapted from Jurafsky and Martin (https://web.stanford.edu/~jurafsky/slp3/).
![Page 2: Backpropagation and Convolutional Neural Networks · graphs for neural networks are a bit more complex than the previous example. ... •Learning rate •Number of layers •Number](https://reader034.vdocument.in/reader034/viewer/2022042308/5ed45209e54cf91fdb6adc63/html5/thumbnails/2.jpg)
How do we train neural networks?
qLoss functionqOptimization algorithmqSome way to compute the gradient across all of the
network’s intermediate layers
2/6/20 Natalie Parde - UIC CS 521 2
![Page 3: Backpropagation and Convolutional Neural Networks · graphs for neural networks are a bit more complex than the previous example. ... •Learning rate •Number of layers •Number](https://reader034.vdocument.in/reader034/viewer/2022042308/5ed45209e54cf91fdb6adc63/html5/thumbnails/3.jpg)
How do we train neural networks?
üLoss functionqOptimization algorithmqSome way to compute the gradient across all of the
network’s intermediate layers
Cross-entropy loss
2/6/20 Natalie Parde - UIC CS 521 3
![Page 4: Backpropagation and Convolutional Neural Networks · graphs for neural networks are a bit more complex than the previous example. ... •Learning rate •Number of layers •Number](https://reader034.vdocument.in/reader034/viewer/2022042308/5ed45209e54cf91fdb6adc63/html5/thumbnails/4.jpg)
How do we train neural networks?
üLoss functionüOptimization algorithmqSome way to compute the gradient across all of the
network’s intermediate layers
Gradient descent
2/6/20 Natalie Parde - UIC CS 521 4
![Page 5: Backpropagation and Convolutional Neural Networks · graphs for neural networks are a bit more complex than the previous example. ... •Learning rate •Number of layers •Number](https://reader034.vdocument.in/reader034/viewer/2022042308/5ed45209e54cf91fdb6adc63/html5/thumbnails/5.jpg)
How do we train neural networks?
üLoss functionüOptimization algorithmqSome way to compute the gradient across all of the
network’s intermediate layers
???
2/6/20 Natalie Parde - UIC CS 521 5
![Page 6: Backpropagation and Convolutional Neural Networks · graphs for neural networks are a bit more complex than the previous example. ... •Learning rate •Number of layers •Number](https://reader034.vdocument.in/reader034/viewer/2022042308/5ed45209e54cf91fdb6adc63/html5/thumbnails/6.jpg)
Backpropagation• A method for propagating loss values all the
way back to the beginning of a deep neural network, even though it’s only computed at the end of the network
2/6/20 Natalie Parde - UIC CS 521 6
![Page 7: Backpropagation and Convolutional Neural Networks · graphs for neural networks are a bit more complex than the previous example. ... •Learning rate •Number of layers •Number](https://reader034.vdocument.in/reader034/viewer/2022042308/5ed45209e54cf91fdb6adc63/html5/thumbnails/7.jpg)
Recall…. • For a “neural network” with just one weight layer containing a single computation unit + sigmoid activation (i.e., a logistic regression classifier), we can compute the gradient of our loss function by just taking its derivative:
• !"#$(&,()!&*
= ,𝑦 − 𝑦 𝑥0 = (𝜎 𝑤 3 𝑥 + 𝑏 − 𝑦)𝑥0
2/6/20 Natalie Parde - UIC CS 521 7
![Page 8: Backpropagation and Convolutional Neural Networks · graphs for neural networks are a bit more complex than the previous example. ... •Learning rate •Number of layers •Number](https://reader034.vdocument.in/reader034/viewer/2022042308/5ed45209e54cf91fdb6adc63/html5/thumbnails/8.jpg)
Recall…. • For a “neural network” with just one weight layer containing a single computation unit + sigmoid activation (i.e., a logistic regression classifier), we can compute the gradient of our loss function by just taking its derivative:
• !"#$(&,()!&*
= ,𝑦 − 𝑦 𝑥0 = (𝜎 𝑤 3 𝑥 + 𝑏 − 𝑦)𝑥0
2/6/20 Natalie Parde - UIC CS 521 8
Difference between true and estimated yCorresponding input observation
![Page 9: Backpropagation and Convolutional Neural Networks · graphs for neural networks are a bit more complex than the previous example. ... •Learning rate •Number of layers •Number](https://reader034.vdocument.in/reader034/viewer/2022042308/5ed45209e54cf91fdb6adc63/html5/thumbnails/9.jpg)
However, we can’t do that with a neural network that has multiple weight layers (“hidden” layers).
• Why?• Simply taking the derivative like we did for
logistic regression only provides the gradient for the most recent (i.e., last) weight layer
• What we need is a way to:• Compute the derivative with respect to weight
parameters occurring earlier in the network as well• Even though we can only compute loss at a single
point (the end of the network)
2/6/20 Natalie Parde - UIC CS 521 9
![Page 10: Backpropagation and Convolutional Neural Networks · graphs for neural networks are a bit more complex than the previous example. ... •Learning rate •Number of layers •Number](https://reader034.vdocument.in/reader034/viewer/2022042308/5ed45209e54cf91fdb6adc63/html5/thumbnails/10.jpg)
We do this using backward differentiation.
• Usually referred to as backpropagation(“backprop” for short) in the context of neural networks
• Frames neural networks as computation graphs
2/6/20 Natalie Parde - UIC CS 521 10
![Page 11: Backpropagation and Convolutional Neural Networks · graphs for neural networks are a bit more complex than the previous example. ... •Learning rate •Number of layers •Number](https://reader034.vdocument.in/reader034/viewer/2022042308/5ed45209e54cf91fdb6adc63/html5/thumbnails/11.jpg)
What are computation
graphs?
• Representations of interconnected mathematical operations
• Nodes = Operations• Directed edges = connections between
output/input of nodes
2/6/20 Natalie Parde - UIC CS 521 11
![Page 12: Backpropagation and Convolutional Neural Networks · graphs for neural networks are a bit more complex than the previous example. ... •Learning rate •Number of layers •Number](https://reader034.vdocument.in/reader034/viewer/2022042308/5ed45209e54cf91fdb6adc63/html5/thumbnails/12.jpg)
There are two different ways that we can pass information through our neural network computation graphs.
• Forward pass• Apply operations in the direction of the
arrows• Pass the output of one computation as the
input to the next• Backward pass
• Compute partial derivatives in the opposite direction of the arrows
• Multiply them by the partial derivatives passed down from the previous step
2/6/20 Natalie Parde - UIC CS 521 12
![Page 13: Backpropagation and Convolutional Neural Networks · graphs for neural networks are a bit more complex than the previous example. ... •Learning rate •Number of layers •Number](https://reader034.vdocument.in/reader034/viewer/2022042308/5ed45209e54cf91fdb6adc63/html5/thumbnails/13.jpg)
Example: Forward Pass
Goal: Represent L(a, b, c) = c(a + 2b)
2/6/20 Natalie Parde - UIC CS 521 13
![Page 14: Backpropagation and Convolutional Neural Networks · graphs for neural networks are a bit more complex than the previous example. ... •Learning rate •Number of layers •Number](https://reader034.vdocument.in/reader034/viewer/2022042308/5ed45209e54cf91fdb6adc63/html5/thumbnails/14.jpg)
Example: Forward Pass
a
b
c
d
e
L
Goal: Represent L(a, b, c) = c(a + 2b)
2/6/20 Natalie Parde - UIC CS 521 14
![Page 15: Backpropagation and Convolutional Neural Networks · graphs for neural networks are a bit more complex than the previous example. ... •Learning rate •Number of layers •Number](https://reader034.vdocument.in/reader034/viewer/2022042308/5ed45209e54cf91fdb6adc63/html5/thumbnails/15.jpg)
Example: Forward Pass
a
b
c
d
e
L
2*b
d+a
c*e
Goal: Represent L(a, b, c) = c(a + 2b)
2/6/20 Natalie Parde - UIC CS 521 15
![Page 16: Backpropagation and Convolutional Neural Networks · graphs for neural networks are a bit more complex than the previous example. ... •Learning rate •Number of layers •Number](https://reader034.vdocument.in/reader034/viewer/2022042308/5ed45209e54cf91fdb6adc63/html5/thumbnails/16.jpg)
Example: Forward Pass
a
b
c
d
e
L
3
1
-2
2*b
d+a
c*e
Goal: Represent L(a, b, c) = c(a + 2b)
2/6/20 Natalie Parde - UIC CS 521 16
![Page 17: Backpropagation and Convolutional Neural Networks · graphs for neural networks are a bit more complex than the previous example. ... •Learning rate •Number of layers •Number](https://reader034.vdocument.in/reader034/viewer/2022042308/5ed45209e54cf91fdb6adc63/html5/thumbnails/17.jpg)
Example: Forward Pass
a
b
c
d
e
L
3
1
-2
2*b = 2
d+a
c*e
Goal: Represent L(a, b, c) = c(a + 2b)
2/6/20 Natalie Parde - UIC CS 521 17
![Page 18: Backpropagation and Convolutional Neural Networks · graphs for neural networks are a bit more complex than the previous example. ... •Learning rate •Number of layers •Number](https://reader034.vdocument.in/reader034/viewer/2022042308/5ed45209e54cf91fdb6adc63/html5/thumbnails/18.jpg)
Example: Forward Pass
a
b
c
d
e
L
3
1
-2
2*b = 2
d+a = 5
c*e
Goal: Represent L(a, b, c) = c(a + 2b)
2/6/20 Natalie Parde - UIC CS 521 18
![Page 19: Backpropagation and Convolutional Neural Networks · graphs for neural networks are a bit more complex than the previous example. ... •Learning rate •Number of layers •Number](https://reader034.vdocument.in/reader034/viewer/2022042308/5ed45209e54cf91fdb6adc63/html5/thumbnails/19.jpg)
Example: Forward Pass
a
b
c
d
e
L
3
1
-2
2*b = 2
d+a = 5
c*e = -10
Goal: Represent L(a, b, c) = c(a + 2b)
2/6/20 Natalie Parde - UIC CS 521 19
![Page 20: Backpropagation and Convolutional Neural Networks · graphs for neural networks are a bit more complex than the previous example. ... •Learning rate •Number of layers •Number](https://reader034.vdocument.in/reader034/viewer/2022042308/5ed45209e54cf91fdb6adc63/html5/thumbnails/20.jpg)
Example: Backward Pass
a
b
c
d
e
L
3
1
-2
2*b = 2
d+a = 5
c*e = -10
Goal: Compute the derivative of L with respect to a, b, and c
2/6/20 Natalie Parde - UIC CS 521 20
![Page 21: Backpropagation and Convolutional Neural Networks · graphs for neural networks are a bit more complex than the previous example. ... •Learning rate •Number of layers •Number](https://reader034.vdocument.in/reader034/viewer/2022042308/5ed45209e54cf91fdb6adc63/html5/thumbnails/21.jpg)
How do we get from L all the way back to a, b, and c?
• Chain rule!• Given a function f(x) = u(v(x)):
• Find the derivative of u(x) with respect to v(x)
• Find the derivative of v(x) with respect to x
• Multiply the two together• 6768 =
696: ∗
6:68
2/6/20 Natalie Parde - UIC CS 521 21
![Page 22: Backpropagation and Convolutional Neural Networks · graphs for neural networks are a bit more complex than the previous example. ... •Learning rate •Number of layers •Number](https://reader034.vdocument.in/reader034/viewer/2022042308/5ed45209e54cf91fdb6adc63/html5/thumbnails/22.jpg)
Example: Backward Pass
a
b
c
d
e
L
3
1
-2
2*b = 2
d+a = 5
c*e = -10
Goal: Compute the derivative of L with respect to a, b, and c
!"!<= ?
!"!(= ?
!"!== ?
2/6/20 Natalie Parde - UIC CS 521 22
![Page 23: Backpropagation and Convolutional Neural Networks · graphs for neural networks are a bit more complex than the previous example. ... •Learning rate •Number of layers •Number](https://reader034.vdocument.in/reader034/viewer/2022042308/5ed45209e54cf91fdb6adc63/html5/thumbnails/23.jpg)
Example: Backward Pass
a
b
c
d
e
L
3
1
-2
2*b = 2
d+a = 5
c*e = -10
Goal: Compute the derivative of L with respect to a, b, and c
!"!<= ?
!"!(= ?
!"!== ?
L = c * e
So….
𝜕𝐿𝜕𝑐 = 𝑒
2/6/20 Natalie Parde - UIC CS 521 23
![Page 24: Backpropagation and Convolutional Neural Networks · graphs for neural networks are a bit more complex than the previous example. ... •Learning rate •Number of layers •Number](https://reader034.vdocument.in/reader034/viewer/2022042308/5ed45209e54cf91fdb6adc63/html5/thumbnails/24.jpg)
Example: Backward Pass
a
b
c
d
e
L
3
1
-2
2*b = 2
d+a = 5
c*e = -10
Goal: Compute the derivative of L with respect to a, b, and c
!"!<= ?
!"!(= ?
!"!== 𝑒
L = c * e = c * (d+a)
So….
𝜕𝐿𝜕𝑎 =
𝜕𝐿𝜕𝑒𝜕𝑒𝜕𝑎 = 𝑐 ∗ 1 = 𝑐
2/6/20 Natalie Parde - UIC CS 521 24
![Page 25: Backpropagation and Convolutional Neural Networks · graphs for neural networks are a bit more complex than the previous example. ... •Learning rate •Number of layers •Number](https://reader034.vdocument.in/reader034/viewer/2022042308/5ed45209e54cf91fdb6adc63/html5/thumbnails/25.jpg)
Example: Backward Pass
a
b
c
d
e
L
3
1
-2
2*b = 2
d+a = 5
c*e = -10
Goal: Compute the derivative of L with respect to a, b, and c
!"!<= 𝑐
!"!(= ?
!"!== 𝑒
L = c * e = c * ((2*b)+a)
So….
𝜕𝐿𝜕𝑏 =
𝜕𝐿𝜕𝑒𝜕𝑒𝜕𝑑
𝜕𝑑𝜕𝑏 = 𝑐 ∗ 1 ∗ 2 = 2 ∗ 𝑐
2/6/20 Natalie Parde - UIC CS 521 25
![Page 26: Backpropagation and Convolutional Neural Networks · graphs for neural networks are a bit more complex than the previous example. ... •Learning rate •Number of layers •Number](https://reader034.vdocument.in/reader034/viewer/2022042308/5ed45209e54cf91fdb6adc63/html5/thumbnails/26.jpg)
Example: Backward Pass
a
b
c
d
e
L
3
1
-2
2*b = 2
d+a = 5
c*e = -10
Goal: Compute the derivative of L with respect to a, b, and c
!"!<= 𝑐
!"!(= 2𝑐
!"!== 𝑒
2/6/20 Natalie Parde - UIC CS 521 26
![Page 27: Backpropagation and Convolutional Neural Networks · graphs for neural networks are a bit more complex than the previous example. ... •Learning rate •Number of layers •Number](https://reader034.vdocument.in/reader034/viewer/2022042308/5ed45209e54cf91fdb6adc63/html5/thumbnails/27.jpg)
Example: Backward Pass
a
b
c
d
e
L
3
1
-2
2*b = 2
d+a = 5
c*e = -10
Goal: Compute the derivative of L with respect to a, b, and c
!"!<= 𝑐 = −2
!"!(= 2𝑐 = 2 ∗ −2 = −4
!"!== 𝑒 = 5
2/6/20 Natalie Parde - UIC CS 521 27
![Page 28: Backpropagation and Convolutional Neural Networks · graphs for neural networks are a bit more complex than the previous example. ... •Learning rate •Number of layers •Number](https://reader034.vdocument.in/reader034/viewer/2022042308/5ed45209e54cf91fdb6adc63/html5/thumbnails/28.jpg)
Computation graphs for neural networks are a bit more complex than the previous example.
• More operations:• Products (input * weight)• Summations (of weighted inputs)• Activation functions
2/6/20 Natalie Parde - UIC CS 521 28
![Page 29: Backpropagation and Convolutional Neural Networks · graphs for neural networks are a bit more complex than the previous example. ... •Learning rate •Number of layers •Number](https://reader034.vdocument.in/reader034/viewer/2022042308/5ed45209e54cf91fdb6adc63/html5/thumbnails/29.jpg)
What would a computation graph look like for a simple neural network?
Input Output
2/6/20 Natalie Parde - UIC CS 521 29
![Page 30: Backpropagation and Convolutional Neural Networks · graphs for neural networks are a bit more complex than the previous example. ... •Learning rate •Number of layers •Number](https://reader034.vdocument.in/reader034/viewer/2022042308/5ed45209e54cf91fdb6adc63/html5/thumbnails/30.jpg)
What would a computation graph look like for a simple neural network?
ReLU
ReLU
ReLU
Output
∗
∗
∗∗
∗
∗∗
∗
∗
Σ
Σ
Σ
ReLU
ReLU
ReLU
∗
∗
∗∗
∗
∗∗
∗
∗
Σ
Σ
Σ
L
w
w
w
w
w
w
w
w
w
w
w
w
w
w
w
w
w
w
2/6/20 Natalie Parde - UIC CS 521 30
![Page 31: Backpropagation and Convolutional Neural Networks · graphs for neural networks are a bit more complex than the previous example. ... •Learning rate •Number of layers •Number](https://reader034.vdocument.in/reader034/viewer/2022042308/5ed45209e54cf91fdb6adc63/html5/thumbnails/31.jpg)
What would a computation graph look like for a simple neural network?
ReLU
ReLU
ReLU
Output
∗
∗
∗∗
∗
∗∗
∗
∗
Σ
Σ
Σ
ReLU
ReLU
ReLU
∗
∗
∗∗
∗
∗∗
∗
∗
Σ
Σ
Σ
L
w
w
w
w
w
w
w
w
w
w
w
w
w
w
w
w
w
All of these weights need to be updated using backpropagation!
w
2/6/20 Natalie Parde - UIC CS 521 31
![Page 32: Backpropagation and Convolutional Neural Networks · graphs for neural networks are a bit more complex than the previous example. ... •Learning rate •Number of layers •Number](https://reader034.vdocument.in/reader034/viewer/2022042308/5ed45209e54cf91fdb6adc63/html5/thumbnails/32.jpg)
What are the derivatives of
some other common
activation functions?
• tanh
• !IJKL(M)!M = 1 − tanhR(𝑧)
• ReLU
• !TUVW(M)!M = X0 for 𝑧 < 0
1 for 𝑧 ≥ 0
2/6/20 Natalie Parde - UIC CS 521 32
![Page 33: Backpropagation and Convolutional Neural Networks · graphs for neural networks are a bit more complex than the previous example. ... •Learning rate •Number of layers •Number](https://reader034.vdocument.in/reader034/viewer/2022042308/5ed45209e54cf91fdb6adc63/html5/thumbnails/33.jpg)
General Tips for Improving Neural Network Performance• Normalize input values to have a mean of 0• Initialize weights with small random numbers• Randomly drop some units and their connections
from the network during training (dropout)• Tune hyperparameters
• Learning rate• Number of layers• Number of units per layer• Type of activation function• Type of optimization function
2/6/20 Natalie Parde - UIC CS 521 33
![Page 34: Backpropagation and Convolutional Neural Networks · graphs for neural networks are a bit more complex than the previous example. ... •Learning rate •Number of layers •Number](https://reader034.vdocument.in/reader034/viewer/2022042308/5ed45209e54cf91fdb6adc63/html5/thumbnails/34.jpg)
Fortunately, you shouldn’t need to build your computation graphs from scratch!
• https://www.tensorflow.org/
TensorFlow
• https://keras.io/
Keras
• https://pytorch.org/
PyTorch
• https://deeplearning4j.org/
DL4J
2/6/20 Natalie Parde - UIC CS 521 34
![Page 35: Backpropagation and Convolutional Neural Networks · graphs for neural networks are a bit more complex than the previous example. ... •Learning rate •Number of layers •Number](https://reader034.vdocument.in/reader034/viewer/2022042308/5ed45209e54cf91fdb6adc63/html5/thumbnails/35.jpg)
Neural Language
Models
• Popular application of neural networks• Advantages over n-gram language models:
• Can handle longer histories• Can generalize over contexts of similar
words• Disadvantage:
• Slower to train• Neural language models have higher
predictive accuracy than n-gram language models trained on datasets of similar sizes
2/6/20 Natalie Parde - UIC CS 521 35
![Page 36: Backpropagation and Convolutional Neural Networks · graphs for neural networks are a bit more complex than the previous example. ... •Learning rate •Number of layers •Number](https://reader034.vdocument.in/reader034/viewer/2022042308/5ed45209e54cf91fdb6adc63/html5/thumbnails/36.jpg)
Neural Language Models• Neural language models are used to
boost performance for many modern NLP tasks
• Machine translation• Dialogue systems• Language generation
2/6/20 Natalie Parde - UIC CS 521 36
![Page 37: Backpropagation and Convolutional Neural Networks · graphs for neural networks are a bit more complex than the previous example. ... •Learning rate •Number of layers •Number](https://reader034.vdocument.in/reader034/viewer/2022042308/5ed45209e54cf91fdb6adc63/html5/thumbnails/37.jpg)
Sample Generated by a Neural Language Model (GPT-2)• Link to article: https://openai.com/blog/better-language-models/
System Prompt (Human-Written): In a shocking finding, scientist discovered a herd of unicorns living in a remote, previously unexplored valley, in the Andes Mountains. Even more surprising to the researchers was the fact that the unicorns spoke perfect English.
2/6/20 Natalie Parde - UIC CS 521 37
![Page 38: Backpropagation and Convolutional Neural Networks · graphs for neural networks are a bit more complex than the previous example. ... •Learning rate •Number of layers •Number](https://reader034.vdocument.in/reader034/viewer/2022042308/5ed45209e54cf91fdb6adc63/html5/thumbnails/38.jpg)
Sample Generated by a Neural Language Model (GPT-2)Model Completion (Machine-Written, 10 Tries): The scientist named the population, after their distinctive horn, Ovid’s Unicorn. These four-horned, silver-white unicorns were previously unknown to science.Now, after almost two centuries, the mystery of what sparked this odd phenomenon is finally solved.
Dr. Jorge Pérez, an evolutionary biologist from the University of La Paz, and several companions, were exploring the Andes Mountains when they found a small valley, with no other animals or humans. Pérez noticed that the valley had what appeared to bea natural fountain, surrounded by two peaks of rock and silver snow.
Pérez and the others then ventured further into the valley. “By the time we reached the top of one peak, the water looked blue, with some crystals on top,” said Pérez.
Pérez and his friends were astonished to see the unicorn herd. These creatures could be seen from the air without having to movetoo much to see them – they were so close they could touch their horns.
While examining these bizarre creatures the scientists discovered that the creatures also spoke some fairly regular English. Pérez stated, “We can see, for example, that they have a common ‘language,’ something like a dialect or dialectic.”
Dr. Pérez believes that the unicorns may have originated in Argentina, where the animals were believed to be descendants of a lost race of people who lived there before the arrival of humans in those parts of South America.While their origins are still unclear, some believe that perhaps the creatures were created when a human and a unicorn met each other in a time before human civilization. According to Pérez, “In South America, such incidents seem to be quite common.”However, Pérez also pointed out that it is likely that the only way of knowing for sure if unicorns are indeed the descendants of a lost alien race is through DNA. “But they seem to be able to communicate in English quite well, which I believe is a sign of evolution, or at least a change in social organization,” said the scientist.
2/6/20 Natalie Parde - UIC CS 521 38
![Page 39: Backpropagation and Convolutional Neural Networks · graphs for neural networks are a bit more complex than the previous example. ... •Learning rate •Number of layers •Number](https://reader034.vdocument.in/reader034/viewer/2022042308/5ed45209e54cf91fdb6adc63/html5/thumbnails/39.jpg)
Feedforward Neural Language Model
• Input: Representation of some number of previous words
• wt-1, wt-2, etc.• Output: Probability distribution over possible
next words• Goal: Approximate the probability of a word
given the entire prior context 𝑃(𝑤`|𝑤b`cb)based on the n previous words
• 𝑃(𝑤`|𝑤b`cb) ≈ 𝑃(𝑤`|𝑤`cefb`cb )
2/6/20 Natalie Parde - UIC CS 521 39
![Page 40: Backpropagation and Convolutional Neural Networks · graphs for neural networks are a bit more complex than the previous example. ... •Learning rate •Number of layers •Number](https://reader034.vdocument.in/reader034/viewer/2022042308/5ed45209e54cf91fdb6adc63/html5/thumbnails/40.jpg)
Neural language models represent prior context using embeddings of the previous words.
• Allows them to generalize to unseen data better than n-gram models
• Embeddings can come from various sources
• E.g., pretrained Word2Vec embeddings
2/6/20 Natalie Parde - UIC CS 521 40
![Page 41: Backpropagation and Convolutional Neural Networks · graphs for neural networks are a bit more complex than the previous example. ... •Learning rate •Number of layers •Number](https://reader034.vdocument.in/reader034/viewer/2022042308/5ed45209e54cf91fdb6adc63/html5/thumbnails/41.jpg)
Neural Language Model
Natalie wt-4
sat wt-3
down wt-2
to wt-1
write wt
the wt+1
exam wt+2
𝑃(𝑤` = “write”|𝑤`cb = “to”, 𝑤`cR = “down”, 𝑤`cm = “sat”)
2/6/20 Natalie Parde - UIC CS 521 41
![Page 42: Backpropagation and Convolutional Neural Networks · graphs for neural networks are a bit more complex than the previous example. ... •Learning rate •Number of layers •Number](https://reader034.vdocument.in/reader034/viewer/2022042308/5ed45209e54cf91fdb6adc63/html5/thumbnails/42.jpg)
Neural Language Model
Natalie wt-4
sat wt-3
down wt-2
to wt-1
write wt
the wt+1
exam wt+2
𝑃(𝑤` = “write”|𝑤`cb = “to”, 𝑤`cR = “down”, 𝑤`cm = “sat”)
2/6/20 Natalie Parde - UIC CS 521 42
![Page 43: Backpropagation and Convolutional Neural Networks · graphs for neural networks are a bit more complex than the previous example. ... •Learning rate •Number of layers •Number](https://reader034.vdocument.in/reader034/viewer/2022042308/5ed45209e54cf91fdb6adc63/html5/thumbnails/43.jpg)
Neural Language Model
Natalie wt-4
sat wt-3
down wt-2
to wt-1
write wt
the wt+1
exam wt+2
𝑃(𝑤` = “write”|𝑤`cb = “to”, 𝑤`cR = “down”, 𝑤`cm = “sat”)
h1
2/6/20 Natalie Parde - UIC CS 521 43
![Page 44: Backpropagation and Convolutional Neural Networks · graphs for neural networks are a bit more complex than the previous example. ... •Learning rate •Number of layers •Number](https://reader034.vdocument.in/reader034/viewer/2022042308/5ed45209e54cf91fdb6adc63/html5/thumbnails/44.jpg)
Neural Language Model
Natalie wt-4
sat wt-3
down wt-2
to wt-1
write wt
the wt+1
exam wt+2
𝑃(𝑤` = “write”|𝑤`cb = “to”, 𝑤`cR = “down”, 𝑤`cm = “sat”)
h1
h2
2/6/20 Natalie Parde - UIC CS 521 44
![Page 45: Backpropagation and Convolutional Neural Networks · graphs for neural networks are a bit more complex than the previous example. ... •Learning rate •Number of layers •Number](https://reader034.vdocument.in/reader034/viewer/2022042308/5ed45209e54cf91fdb6adc63/html5/thumbnails/45.jpg)
Neural Language Model
Natalie wt-4
sat wt-3
down wt-2
to wt-1
write wt
the wt+1
exam wt+2
𝑃(𝑤` = “write”|𝑤`cb = “to”, 𝑤`cR = “down”, 𝑤`cm = “sat”)
h1
h2
y1
…
“write”
…
y|V|
2/6/20 Natalie Parde - UIC CS 521 45
![Page 46: Backpropagation and Convolutional Neural Networks · graphs for neural networks are a bit more complex than the previous example. ... •Learning rate •Number of layers •Number](https://reader034.vdocument.in/reader034/viewer/2022042308/5ed45209e54cf91fdb6adc63/html5/thumbnails/46.jpg)
Neural Language Model
Natalie wt-4
sat wt-3
down wt-2
to wt-1
write wt
the wt+1
exam wt+2
𝑃(𝑤` = “write”|𝑤`cb = “to”, 𝑤`cR = “down”, 𝑤`cm = “sat”)
h1
h2
y1
…
“write”
…
y|V|
softmaxdistribution over all words in the vocabulary
2/6/20 Natalie Parde - UIC CS 521 46
![Page 47: Backpropagation and Convolutional Neural Networks · graphs for neural networks are a bit more complex than the previous example. ... •Learning rate •Number of layers •Number](https://reader034.vdocument.in/reader034/viewer/2022042308/5ed45209e54cf91fdb6adc63/html5/thumbnails/47.jpg)
What if we don’t already have dense word embeddings?
• When we use another algorithm to learn the embeddings for our input words, this is called pretraining
• However, sometimes it’s preferable to learn embeddings while training the network, rather than using pretrained embeddings
• E.g., if the desired application places strong constraints on what makes a good representation
2/6/20 Natalie Parde - UIC CS 521 47
![Page 48: Backpropagation and Convolutional Neural Networks · graphs for neural networks are a bit more complex than the previous example. ... •Learning rate •Number of layers •Number](https://reader034.vdocument.in/reader034/viewer/2022042308/5ed45209e54cf91fdb6adc63/html5/thumbnails/48.jpg)
Learning New
Embeddings
• Start with a one-hot vector for each word in the vocabulary
• Element for a given word is set to 1• All other elements are set to 0
• Randomly initialize the hidden (weight/embedding) layer
• Maintain a separate vector of weights for that layer, for each vocabulary word
2/6/20 Natalie Parde - UIC CS 521 48
![Page 49: Backpropagation and Convolutional Neural Networks · graphs for neural networks are a bit more complex than the previous example. ... •Learning rate •Number of layers •Number](https://reader034.vdocument.in/reader034/viewer/2022042308/5ed45209e54cf91fdb6adc63/html5/thumbnails/49.jpg)
Formal Definition:
Learning New
Embeddings
• Letting E be an embedding matrix of dimensionality d, with one row for each word in the vocabulary:
• e = (𝐸8p, 𝐸8q, … , 𝐸8s)• h = 𝜎 𝑊e + 𝐛• z = 𝑈h• 𝑦 = softmax(z)
• Optimizing this network using the same techniques discussed for other neural networks will result in both
• A model that predicts words• A new set of word embeddings that can be
used for other tasks
2/6/20 Natalie Parde - UIC CS 521 49
![Page 50: Backpropagation and Convolutional Neural Networks · graphs for neural networks are a bit more complex than the previous example. ... •Learning rate •Number of layers •Number](https://reader034.vdocument.in/reader034/viewer/2022042308/5ed45209e54cf91fdb6adc63/html5/thumbnails/50.jpg)
Neural Language Model
Natalie wt-4
sat wt-3
down wt-2
to wt-1
write wt
the wt+1
exam wt+2
𝑃(𝑤` = “write”|𝑤`cb = “to”, 𝑤`cR = “down”, 𝑤`cm = “sat”)
h1
h2
y1
…
“write”
…
y|V|
softmaxdistribution over all words in the vocabulary
0
0
1
0
0
1
0
0
0
0
0
0
0
1
0
2/6/20 Natalie Parde - UIC CS 521 50
![Page 51: Backpropagation and Convolutional Neural Networks · graphs for neural networks are a bit more complex than the previous example. ... •Learning rate •Number of layers •Number](https://reader034.vdocument.in/reader034/viewer/2022042308/5ed45209e54cf91fdb6adc63/html5/thumbnails/51.jpg)
Convolutional Neural Networks
• Neural networks that incorporate one or more convolutional layers
• Designed to reflect the inner workings of the visual cortex system
• CNNs require that fewer parameters are learned relative to standard feedforward networks for equivalent input data
2/6/20 Natalie Parde - UIC CS 521 51
![Page 52: Backpropagation and Convolutional Neural Networks · graphs for neural networks are a bit more complex than the previous example. ... •Learning rate •Number of layers •Number](https://reader034.vdocument.in/reader034/viewer/2022042308/5ed45209e54cf91fdb6adc63/html5/thumbnails/52.jpg)
What are convolutional
layers?
• Sliding windows that perform matrix operations on subsets of the input
• Compute products between those subsets of input and a corresponding weight matrix
2/6/20 Natalie Parde - UIC CS 521 52
![Page 53: Backpropagation and Convolutional Neural Networks · graphs for neural networks are a bit more complex than the previous example. ... •Learning rate •Number of layers •Number](https://reader034.vdocument.in/reader034/viewer/2022042308/5ed45209e54cf91fdb6adc63/html5/thumbnails/53.jpg)
Convolutional Layers
• First layer(s): low-level features• Color, gradient orientation• N-grams
• Higher layer(s): high-level features• Objects• Phrases
2/6/20 Natalie Parde - UIC CS 521 53
![Page 54: Backpropagation and Convolutional Neural Networks · graphs for neural networks are a bit more complex than the previous example. ... •Learning rate •Number of layers •Number](https://reader034.vdocument.in/reader034/viewer/2022042308/5ed45209e54cf91fdb6adc63/html5/thumbnails/54.jpg)
In NLP, convolutions are typically performed on entire rows of an input matrix, where each row corresponds to a word.
I
love
waking
up
early
for
CS
521
I
love
waking
up
early
for
CS
521
I
love
waking
up
early
for
CS
521
Stride size = 1
2/6/20 Natalie Parde - UIC CS 521 54
![Page 55: Backpropagation and Convolutional Neural Networks · graphs for neural networks are a bit more complex than the previous example. ... •Learning rate •Number of layers •Number](https://reader034.vdocument.in/reader034/viewer/2022042308/5ed45209e54cf91fdb6adc63/html5/thumbnails/55.jpg)
In NLP, convolutions are typically performed on entire rows of an input matrix, where each row corresponds to a word.
I
love
waking
up
early
for
CS
521
I
love
waking
up
early
for
CS
521
I
love
waking
up
early
for
CS
521
Stride size = 2
2/6/20 Natalie Parde - UIC CS 521 55
![Page 56: Backpropagation and Convolutional Neural Networks · graphs for neural networks are a bit more complex than the previous example. ... •Learning rate •Number of layers •Number](https://reader034.vdocument.in/reader034/viewer/2022042308/5ed45209e54cf91fdb6adc63/html5/thumbnails/56.jpg)
After applying a convolution with specific region (kernel) and stride sizes to an input matrix, we end up with a feature map.
I
love
waking
up
early
for
CS
521
Kernel size = 2x5Stride size = 2
Feature Map
2/6/20 Natalie Parde - UIC CS 521 56
![Page 57: Backpropagation and Convolutional Neural Networks · graphs for neural networks are a bit more complex than the previous example. ... •Learning rate •Number of layers •Number](https://reader034.vdocument.in/reader034/viewer/2022042308/5ed45209e54cf91fdb6adc63/html5/thumbnails/57.jpg)
After applying a convolution with specific region (kernel) and stride sizes to an input matrix, we end up with a feature map.
I
love
waking
up
early
for
CS
521 Feature Map
I
love
waking
up
early
for
CS
521
Kernel size = 2x5Stride size = 2
2/6/20 Natalie Parde - UIC CS 521 57
![Page 58: Backpropagation and Convolutional Neural Networks · graphs for neural networks are a bit more complex than the previous example. ... •Learning rate •Number of layers •Number](https://reader034.vdocument.in/reader034/viewer/2022042308/5ed45209e54cf91fdb6adc63/html5/thumbnails/58.jpg)
After applying a convolution with specific region (kernel) and stride sizes to an input matrix, we end up with a feature map.
I
love
waking
up
early
for
CS
521 Feature Map
I
love
waking
up
early
for
CS
521
I
love
waking
up
early
for
CS
521
Kernel size = 2x5Stride size = 2
2/6/20 Natalie Parde - UIC CS 521 58
![Page 59: Backpropagation and Convolutional Neural Networks · graphs for neural networks are a bit more complex than the previous example. ... •Learning rate •Number of layers •Number](https://reader034.vdocument.in/reader034/viewer/2022042308/5ed45209e54cf91fdb6adc63/html5/thumbnails/59.jpg)
After applying a convolution with specific region (kernel) and stride sizes to an input matrix, we end up with a feature map.
I
love
waking
up
early
for
CS
521Feature Map
I
love
waking
up
early
for
CS
521
I
love
waking
up
early
for
CS
521
I
love
waking
up
early
for
CS
521
Kernel size = 2x5Stride size = 2
2/6/20 Natalie Parde - UIC CS 521 59
![Page 60: Backpropagation and Convolutional Neural Networks · graphs for neural networks are a bit more complex than the previous example. ... •Learning rate •Number of layers •Number](https://reader034.vdocument.in/reader034/viewer/2022042308/5ed45209e54cf91fdb6adc63/html5/thumbnails/60.jpg)
I
love
waking
up
early
for
CS
521Feature Map
I
love
waking
up
early
for
CS
521
I
love
waking
up
early
for
CS
521
I
love
waking
up
early
for
CS
521
Kernel size = 2x5Stride size = 2
I
love
waking
up
early
for
CS
521Feature Map
I
love
waking
up
early
for
CS
521
I
love
waking
up
early
for
CS
521
I
love
waking
up
early
for
CS
521
Kernel size = 2x5Stride size = 2
It’s common to extract multiple different feature maps from the same input.
2/6/20 Natalie Parde - UIC CS 521 60
![Page 61: Backpropagation and Convolutional Neural Networks · graphs for neural networks are a bit more complex than the previous example. ... •Learning rate •Number of layers •Number](https://reader034.vdocument.in/reader034/viewer/2022042308/5ed45209e54cf91fdb6adc63/html5/thumbnails/61.jpg)
After extracting feature maps
from the input, convolutional
neural networks
(“CNNs” or “convnets”)
utilize pooling layers.
• Pooling layers: Layers that reduce the dimensionality of input feature maps by pooling all of the values in a given region
• Why use pooling layers?• Further increase efficiency• Improve the model’s ability to be
invariant to small changes
2/6/20 Natalie Parde - UIC CS 521 61
![Page 62: Backpropagation and Convolutional Neural Networks · graphs for neural networks are a bit more complex than the previous example. ... •Learning rate •Number of layers •Number](https://reader034.vdocument.in/reader034/viewer/2022042308/5ed45209e54cf91fdb6adc63/html5/thumbnails/62.jpg)
Pooling Layers
I
love
waking
up
early
for
CS
521Feature Map
I
love
waking
up
early
for
CS
521
I
love
waking
up
early
for
CS
521
I
love
waking
up
early
for
CS
521
Kernel size = 2x5Stride size = 2
I
love
waking
up
early
for
CS
521Feature Map
I
love
waking
up
early
for
CS
521
I
love
waking
up
early
for
CS
521
I
love
waking
up
early
for
CS
521
Kernel size = 2x5Stride size = 2
2/6/20 Natalie Parde - UIC CS 521 62
![Page 63: Backpropagation and Convolutional Neural Networks · graphs for neural networks are a bit more complex than the previous example. ... •Learning rate •Number of layers •Number](https://reader034.vdocument.in/reader034/viewer/2022042308/5ed45209e54cf91fdb6adc63/html5/thumbnails/63.jpg)
Common Techniques for Pooling
• Max pooling• Take the maximum of all values computed in
a given window• Average pooling
• Take the average of all values computed in a given window
1423
4
2/6/20 Natalie Parde - UIC CS 521 63
![Page 64: Backpropagation and Convolutional Neural Networks · graphs for neural networks are a bit more complex than the previous example. ... •Learning rate •Number of layers •Number](https://reader034.vdocument.in/reader034/viewer/2022042308/5ed45209e54cf91fdb6adc63/html5/thumbnails/64.jpg)
Common Techniques for Pooling
• Max pooling• Take the maximum of all values computed in
a given window• Average pooling
• Take the average of all values computed in a given window
1423
2.5
2/6/20 Natalie Parde - UIC CS 521 64
![Page 65: Backpropagation and Convolutional Neural Networks · graphs for neural networks are a bit more complex than the previous example. ... •Learning rate •Number of layers •Number](https://reader034.vdocument.in/reader034/viewer/2022042308/5ed45209e54cf91fdb6adc63/html5/thumbnails/65.jpg)
The output from pooling layers is typically then passed along as input to one or more feedforward layers.
Input OutputI
love
waking
up
early
for
CS
521Feature Map
I
love
waking
up
early
for
CS
521
I
love
waking
up
early
for
CS
521
I
love
waking
up
early
for
CS
521
Kernel size = 2x5Stride size = 2
I
love
waking
up
early
for
CS
521Feature Map
I
love
waking
up
early
for
CS
521
I
love
waking
up
early
for
CS
521
I
love
waking
up
early
for
CS
521
Kernel size = 2x5Stride size = 2
2/6/20 Natalie Parde - UIC CS 521 65
![Page 66: Backpropagation and Convolutional Neural Networks · graphs for neural networks are a bit more complex than the previous example. ... •Learning rate •Number of layers •Number](https://reader034.vdocument.in/reader034/viewer/2022042308/5ed45209e54cf91fdb6adc63/html5/thumbnails/66.jpg)
Convolutional neural network architectures can vary greatly!
• Additional hyperparameters:• Kernel size• Padding• Stride size• Number of channels• Pooling technique
2/6/20 Natalie Parde - UIC CS 521 66
![Page 67: Backpropagation and Convolutional Neural Networks · graphs for neural networks are a bit more complex than the previous example. ... •Learning rate •Number of layers •Number](https://reader034.vdocument.in/reader034/viewer/2022042308/5ed45209e54cf91fdb6adc63/html5/thumbnails/67.jpg)
Padding?
• Add empty vectors to the beginning and end of your text input
• Why do this?• Allows you to apply a filter to every
element of the input matrix
I
love
waking
up
early
for
CS
521
I
love
waking
up
early
for
CS
521
2/6/20 Natalie Parde - UIC CS 521 67
![Page 68: Backpropagation and Convolutional Neural Networks · graphs for neural networks are a bit more complex than the previous example. ... •Learning rate •Number of layers •Number](https://reader034.vdocument.in/reader034/viewer/2022042308/5ed45209e54cf91fdb6adc63/html5/thumbnails/68.jpg)
Channels?
• Red, green, blue
For images, generally corresponds to color channels
• Different types of word embeddings• Word2Vec, GloVe, etc.
• Other feature types• POS tags, word length, etc.
For text, can mean:
2/6/20 Natalie Parde - UIC CS 521 68
![Page 69: Backpropagation and Convolutional Neural Networks · graphs for neural networks are a bit more complex than the previous example. ... •Learning rate •Number of layers •Number](https://reader034.vdocument.in/reader034/viewer/2022042308/5ed45209e54cf91fdb6adc63/html5/thumbnails/69.jpg)
The big question …why use CNNs at all?
• Traditionally for image classification!• However, offer unique advantages for
NLP tasks:• CNNs inherently extract meaningful local
structures from input• In NLP → implicitly-learned, useful n-grams!
2/6/20 Natalie Parde - UIC CS 521 69
![Page 70: Backpropagation and Convolutional Neural Networks · graphs for neural networks are a bit more complex than the previous example. ... •Learning rate •Number of layers •Number](https://reader034.vdocument.in/reader034/viewer/2022042308/5ed45209e54cf91fdb6adc63/html5/thumbnails/70.jpg)
Summary: Backpropagation and Convolutional Neural Networks
• Weights are optimized in deep neural networks using backpropagation
• Backpropagation works by propagating loss values backward from the end of a neural network to the beginning, using the chain rule to eventually compute the gradient at the first layer
• This is done by framing networks as complex computation graphs
• Convolutional neural networks incorporate one or more convolutional layers
• After feature maps are produced using those convolutional layers, pooling techniques are used to reduce their dimensionality
• CNN architectures can vary greatly, and require additional hyperparameters:
• Kernel size• Padding• Stride size• # channels• Pooling technique
• A key advantage of CNNs is that they allow a model to automatically learn useful n-grams from the training data
2/6/20 Natalie Parde - UIC CS 521 70