![Page 1: Attention and Transformers Lecture 11cs231n.stanford.edu/slides/2021/lecture_11.pdf14 CNN Features: H x W x D h 0 [START] Xu et al, “Show, Attend and Tell: Neural Image Caption Generation](https://reader036.vdocument.in/reader036/viewer/2022071516/6139eb1a0051793c8c00c014/html5/thumbnails/1.jpg)
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 11 - May 06, 20211
Lecture 11:Attention and Transformers
![Page 2: Attention and Transformers Lecture 11cs231n.stanford.edu/slides/2021/lecture_11.pdf14 CNN Features: H x W x D h 0 [START] Xu et al, “Show, Attend and Tell: Neural Image Caption Generation](https://reader036.vdocument.in/reader036/viewer/2022071516/6139eb1a0051793c8c00c014/html5/thumbnails/2.jpg)
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 11 - May 06, 20212
Administrative: Midterm
- Midterm was this Tuesday- We will be grading this week and you should have
grades by next week.
![Page 3: Attention and Transformers Lecture 11cs231n.stanford.edu/slides/2021/lecture_11.pdf14 CNN Features: H x W x D h 0 [START] Xu et al, “Show, Attend and Tell: Neural Image Caption Generation](https://reader036.vdocument.in/reader036/viewer/2022071516/6139eb1a0051793c8c00c014/html5/thumbnails/3.jpg)
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 11 - May 06, 20213
Administrative: Assignment 3
- A3 is due Friday May 25th, 11:59pm○ Lots of applications of ConvNets○ Also contains an extra credit notebook, which is
worth an additional 5% of the A3 grade.○ Extra credit will not be used when curving the class
grades.
![Page 4: Attention and Transformers Lecture 11cs231n.stanford.edu/slides/2021/lecture_11.pdf14 CNN Features: H x W x D h 0 [START] Xu et al, “Show, Attend and Tell: Neural Image Caption Generation](https://reader036.vdocument.in/reader036/viewer/2022071516/6139eb1a0051793c8c00c014/html5/thumbnails/4.jpg)
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 11 - May 06, 20214
Last Time: Recurrent Neural Networks
![Page 5: Attention and Transformers Lecture 11cs231n.stanford.edu/slides/2021/lecture_11.pdf14 CNN Features: H x W x D h 0 [START] Xu et al, “Show, Attend and Tell: Neural Image Caption Generation](https://reader036.vdocument.in/reader036/viewer/2022071516/6139eb1a0051793c8c00c014/html5/thumbnails/5.jpg)
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 11 - May 06, 20215
Last Time: Variable length computation graph with shared weights
h0 fW h1 fW h2 fW h3
x3
yT
…
x2x1W
hT
y3y2y1 L1L2 L3 LT
L
![Page 6: Attention and Transformers Lecture 11cs231n.stanford.edu/slides/2021/lecture_11.pdf14 CNN Features: H x W x D h 0 [START] Xu et al, “Show, Attend and Tell: Neural Image Caption Generation](https://reader036.vdocument.in/reader036/viewer/2022071516/6139eb1a0051793c8c00c014/html5/thumbnails/6.jpg)
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 11 - May 06, 2021
Let's jump to lecture 10 - slide 43
6
![Page 7: Attention and Transformers Lecture 11cs231n.stanford.edu/slides/2021/lecture_11.pdf14 CNN Features: H x W x D h 0 [START] Xu et al, “Show, Attend and Tell: Neural Image Caption Generation](https://reader036.vdocument.in/reader036/viewer/2022071516/6139eb1a0051793c8c00c014/html5/thumbnails/7.jpg)
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 11 - May 06, 20217
Today's Agenda:
- Attention with RNNs- In Computer Vision- In NLP
- General Attention Layer- Self-attention- Positional encoding- Masked attention- Multi-head attention
- Transformers
![Page 8: Attention and Transformers Lecture 11cs231n.stanford.edu/slides/2021/lecture_11.pdf14 CNN Features: H x W x D h 0 [START] Xu et al, “Show, Attend and Tell: Neural Image Caption Generation](https://reader036.vdocument.in/reader036/viewer/2022071516/6139eb1a0051793c8c00c014/html5/thumbnails/8.jpg)
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 11 - May 06, 20218
Today's Agenda:
- Attention with RNNs- In Computer Vision- In NLP
- General Attention Layer- Self-attention- Positional encoding- Masked attention- Multi-head attention
- Transformers
![Page 9: Attention and Transformers Lecture 11cs231n.stanford.edu/slides/2021/lecture_11.pdf14 CNN Features: H x W x D h 0 [START] Xu et al, “Show, Attend and Tell: Neural Image Caption Generation](https://reader036.vdocument.in/reader036/viewer/2022071516/6139eb1a0051793c8c00c014/html5/thumbnails/9.jpg)
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 11 - May 06, 2021
Extract spatial features from a pretrained CNN
Image Captioning using spatial features
9
CNN
Features: H x W x D
Xu et al, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015
z0,0 z0,1 z0,2
z1,0 z1,1 z1,2
z2,0 z2,1 z2,2
Input: Image IOutput: Sequence y = y1, y2,..., yT
![Page 10: Attention and Transformers Lecture 11cs231n.stanford.edu/slides/2021/lecture_11.pdf14 CNN Features: H x W x D h 0 [START] Xu et al, “Show, Attend and Tell: Neural Image Caption Generation](https://reader036.vdocument.in/reader036/viewer/2022071516/6139eb1a0051793c8c00c014/html5/thumbnails/10.jpg)
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 11 - May 06, 2021
Extract spatial features from a pretrained CNN
Image Captioning using spatial features
10
CNN
Features: H x W x D
h0
Xu et al, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015
z0,0 z0,1 z0,2
z1,0 z1,1 z1,2
z2,0 z2,1 z2,2
MLP
Encoder: h0 = fW(z)where z is spatial CNN featuresfW(.) is an MLP
Input: Image IOutput: Sequence y = y1, y2,..., yT
![Page 11: Attention and Transformers Lecture 11cs231n.stanford.edu/slides/2021/lecture_11.pdf14 CNN Features: H x W x D h 0 [START] Xu et al, “Show, Attend and Tell: Neural Image Caption Generation](https://reader036.vdocument.in/reader036/viewer/2022071516/6139eb1a0051793c8c00c014/html5/thumbnails/11.jpg)
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 11 - May 06, 2021
Extract spatial features from a pretrained CNN
Image Captioning using spatial features
11
CNN
Features: H x W x D
h0
[START]Xu et al, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015
z0,0 z0,1 z0,2
z1,0 z1,1 z1,2
z2,0 z2,1 z2,2
y0
h1
[START]
y1
person
MLP
Encoder: h0 = fW(z)where z is spatial CNN featuresfW(.) is an MLP
Input: Image IOutput: Sequence y = y1, y2,..., yT
Decoder: yt = gV(yt-1, ht-1, c)where context vector c is often c = h0
c
![Page 12: Attention and Transformers Lecture 11cs231n.stanford.edu/slides/2021/lecture_11.pdf14 CNN Features: H x W x D h 0 [START] Xu et al, “Show, Attend and Tell: Neural Image Caption Generation](https://reader036.vdocument.in/reader036/viewer/2022071516/6139eb1a0051793c8c00c014/html5/thumbnails/12.jpg)
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 11 - May 06, 2021
Extract spatial features from a pretrained CNN
Image Captioning using spatial features
12
CNN
Features: H x W x D
h0
Xu et al, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015
z0,0 z0,1 z0,2
z1,0 z1,1 z1,2
z2,0 z2,1 z2,2
MLP
Encoder: h0 = fW(z)where z is spatial CNN featuresfW(.) is an MLP
Input: Image IOutput: Sequence y = y1, y2,..., yT
Decoder: yt = gV(yt-1, ht-1, c)where context vector c is often c = h0
h0
[START]
y0
h1
[START]
y1
h2
y2
y1
person
person wearing
c
![Page 13: Attention and Transformers Lecture 11cs231n.stanford.edu/slides/2021/lecture_11.pdf14 CNN Features: H x W x D h 0 [START] Xu et al, “Show, Attend and Tell: Neural Image Caption Generation](https://reader036.vdocument.in/reader036/viewer/2022071516/6139eb1a0051793c8c00c014/html5/thumbnails/13.jpg)
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 11 - May 06, 2021
Extract spatial features from a pretrained CNN
Image Captioning using spatial features
13
CNN
Features: H x W x D
h0
Xu et al, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015
z0,0 z0,1 z0,2
z1,0 z1,1 z1,2
z2,0 z2,1 z2,2
MLP
Encoder: h0 = fW(z)where z is spatial CNN featuresfW(.) is an MLP
Input: Image IOutput: Sequence y = y1, y2,..., yT
Decoder: yt = gV(yt-1, ht-1, c)where context vector c is often c = h0
h0
[START]
y0
h1
[START]
y1
h2
y2
y1
h3
y3
y2
person wearing
person wearing hat
c
![Page 14: Attention and Transformers Lecture 11cs231n.stanford.edu/slides/2021/lecture_11.pdf14 CNN Features: H x W x D h 0 [START] Xu et al, “Show, Attend and Tell: Neural Image Caption Generation](https://reader036.vdocument.in/reader036/viewer/2022071516/6139eb1a0051793c8c00c014/html5/thumbnails/14.jpg)
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 11 - May 06, 2021
Extract spatial features from a pretrained CNN
Image Captioning using spatial features
14
CNN
Features: H x W x D
h0
[START]Xu et al, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015
z0,0 z0,1 z0,2
z1,0 z1,1 z1,2
z2,0 z2,1 z2,2
y0
h1
[START]
y1
h2
y2
y1
h3
y3
y2
person wearing hat
h4
y4
y3
person wearing hat [END]
MLP
c
Encoder: h0 = fW(z)where z is spatial CNN featuresfW(.) is an MLP
Input: Image IOutput: Sequence y = y1, y2,..., yT
Decoder: yt = gV(yt-1, ht-1, c)where context vector c is often c = h0
![Page 15: Attention and Transformers Lecture 11cs231n.stanford.edu/slides/2021/lecture_11.pdf14 CNN Features: H x W x D h 0 [START] Xu et al, “Show, Attend and Tell: Neural Image Caption Generation](https://reader036.vdocument.in/reader036/viewer/2022071516/6139eb1a0051793c8c00c014/html5/thumbnails/15.jpg)
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 11 - May 06, 2021
Extract spatial features from a pretrained CNN
Image Captioning using spatial features
15
CNN
Features: H x W x D
h0
[START]Xu et al, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015
z0,0 z0,1 z0,2
z1,0 z1,1 z1,2
z2,0 z2,1 z2,2
y0
h1
[START]
y1
h2
y2
y1
h3
y3
y2
person wearing hat
h4
y4
y3
person wearing hat [END]
MLP
Problem: Input is "bottlenecked" through c- Model needs to encode everything it
wants to say within c
This is a problem if we want to generate really long descriptions? 100s of words long
c
![Page 16: Attention and Transformers Lecture 11cs231n.stanford.edu/slides/2021/lecture_11.pdf14 CNN Features: H x W x D h 0 [START] Xu et al, “Show, Attend and Tell: Neural Image Caption Generation](https://reader036.vdocument.in/reader036/viewer/2022071516/6139eb1a0051793c8c00c014/html5/thumbnails/16.jpg)
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 11 - May 06, 2021
Extract spatial features from a pretrained CNN
Image Captioning with RNNs & Attention
16
CNN
Features: H x W x D
h0
Xu et al, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015
z0,0 z0,1 z0,2
z1,0 z1,1 z1,2
z2,0 z2,1 z2,2
Attention idea: New context vector at every time step.
Each context vector will attend to different image regions
gif source
Attention Saccades in humans
![Page 17: Attention and Transformers Lecture 11cs231n.stanford.edu/slides/2021/lecture_11.pdf14 CNN Features: H x W x D h 0 [START] Xu et al, “Show, Attend and Tell: Neural Image Caption Generation](https://reader036.vdocument.in/reader036/viewer/2022071516/6139eb1a0051793c8c00c014/html5/thumbnails/17.jpg)
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 11 - May 06, 2021
Extract spatial features from a pretrained CNN
Image Captioning with RNNs & Attention
17
CNN
Features: H x W x D
h0
Xu et al, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015
z0,0 z0,1 z0,2
z1,0 z1,1 z1,2
z2,0 z2,1 z2,2
e1,0,0 e1,0,1 e1,0,2
e1,1,0 e1,1,1 e1,1,2
e1,2,0 e1,2,1 e1,2,2
Alignment scores: H x W Compute alignments
scores (scalars):
fatt(.) is an MLP
![Page 18: Attention and Transformers Lecture 11cs231n.stanford.edu/slides/2021/lecture_11.pdf14 CNN Features: H x W x D h 0 [START] Xu et al, “Show, Attend and Tell: Neural Image Caption Generation](https://reader036.vdocument.in/reader036/viewer/2022071516/6139eb1a0051793c8c00c014/html5/thumbnails/18.jpg)
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 11 - May 06, 2021
Extract spatial features from a pretrained CNN
Image Captioning with RNNs & Attention
18
CNN
Features: H x W x D
h0
Xu et al, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015
z0,0 z0,1 z0,2
z1,0 z1,1 z1,2
z2,0 z2,1 z2,2
e1,0,0 e1,0,1 e1,0,2
e1,1,0 e1,1,1 e1,1,2
e1,2,0 e1,2,1 e1,2,2
a1,0,0 a1,0,1 a1,0,2
a1,1,0 a1,1,1 a1,1,2
a1,2,0 a1,2,1 a1,2,2
Alignment scores: H x W
Attention: H x W Normalize to get
attention weights:
0 < at, i, j < 1, attention values sum to 1
Compute alignments scores (scalars):
fatt(.) is an MLP
![Page 19: Attention and Transformers Lecture 11cs231n.stanford.edu/slides/2021/lecture_11.pdf14 CNN Features: H x W x D h 0 [START] Xu et al, “Show, Attend and Tell: Neural Image Caption Generation](https://reader036.vdocument.in/reader036/viewer/2022071516/6139eb1a0051793c8c00c014/html5/thumbnails/19.jpg)
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 11 - May 06, 2021
Extract spatial features from a pretrained CNN
Image Captioning with RNNs & Attention
19
CNN
Features: H x W x D
h0
c1
Xu et al, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015
z0,0 z0,1 z0,2
z1,0 z1,1 z1,2
z2,0 z2,1 z2,2
e1,0,0 e1,0,1 e1,0,2
e1,1,0 e1,1,1 e1,1,2
e1,2,0 e1,2,1 e1,2,2
a1,0,0 a1,0,1 a1,0,2
a1,1,0 a1,1,1 a1,1,2
a1,2,0 a1,2,1 a1,2,2
Alignment scores: H x W
Attention: H x W
X
Compute alignments scores (scalars):
fatt(.) is an MLP
Compute context vector:Normalize to get attention weights:
0 < at, i, j < 1, attention values sum to 1
![Page 20: Attention and Transformers Lecture 11cs231n.stanford.edu/slides/2021/lecture_11.pdf14 CNN Features: H x W x D h 0 [START] Xu et al, “Show, Attend and Tell: Neural Image Caption Generation](https://reader036.vdocument.in/reader036/viewer/2022071516/6139eb1a0051793c8c00c014/html5/thumbnails/20.jpg)
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 11 - May 06, 2021
Extract spatial features from a pretrained CNN
Each timestep of decoder uses a different context vector that looks at different parts of the input image
Image Captioning with RNNs & Attention
20
CNN
Features: H x W x D
h0
c1 y0
h1
[START]
y1
Xu et al, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015
z0,0
person
z0,1 z0,2
z1,0 z1,1 z1,2
z2,0 z2,1 z2,2
Decoder: yt = gV(yt-1, ht-1, ct)New context vector at every time step
![Page 21: Attention and Transformers Lecture 11cs231n.stanford.edu/slides/2021/lecture_11.pdf14 CNN Features: H x W x D h 0 [START] Xu et al, “Show, Attend and Tell: Neural Image Caption Generation](https://reader036.vdocument.in/reader036/viewer/2022071516/6139eb1a0051793c8c00c014/html5/thumbnails/21.jpg)
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 11 - May 06, 2021
Extract spatial features from a pretrained CNN
Image Captioning with RNNs & Attention
21
CNN
Features: H x W x D
h0
c1 y0
h1
[START]
y1
Xu et al, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015
z0,0
person
z0,1 z0,2
z1,0 z1,1 z1,2
z2,0 z2,1 z2,2
Decoder: yt = gV(yt-1, ht-1, ct)New context vector at every time step
e1,0,0 e1,0,1 e1,0,2
e1,1,0 e1,1,1 e1,1,2
e1,2,0 e1,2,1 e1,2,2
a1,0,0 a1,0,1 a1,0,2
a1,1,0 a1,1,1 a1,1,2
a1,2,0 a1,2,1 a1,2,2
Alignment scores: H x W
Attention: H x W
c2
X
![Page 22: Attention and Transformers Lecture 11cs231n.stanford.edu/slides/2021/lecture_11.pdf14 CNN Features: H x W x D h 0 [START] Xu et al, “Show, Attend and Tell: Neural Image Caption Generation](https://reader036.vdocument.in/reader036/viewer/2022071516/6139eb1a0051793c8c00c014/html5/thumbnails/22.jpg)
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 11 - May 06, 2021
Extract spatial features from a pretrained CNN
Image Captioning with RNNs & Attention
22
CNN
Features: H x W x D
h0
c1 y0
h1
[START]
y1
h2
y2
c2 y1
Xu et al, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015
z0,0
person
person wearing
z0,1 z0,2
z1,0 z1,1 z1,2
z2,0 z2,1 z2,2
Each timestep of decoder uses a different context vector that looks at different parts of the input image
Decoder: yt = gV(yt-1, ht-1, ct)New context vector at every time step
![Page 23: Attention and Transformers Lecture 11cs231n.stanford.edu/slides/2021/lecture_11.pdf14 CNN Features: H x W x D h 0 [START] Xu et al, “Show, Attend and Tell: Neural Image Caption Generation](https://reader036.vdocument.in/reader036/viewer/2022071516/6139eb1a0051793c8c00c014/html5/thumbnails/23.jpg)
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 11 - May 06, 2021
Extract spatial features from a pretrained CNN
Image Captioning with RNNs & Attention
23
CNN
Features: H x W x D
h0
c1 y0
h1
[START]
y1
h2
y2
c2 y1
Xu et al, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015
z0,0h3
y3
c3 y2
person wearing
person wearing hat
z0,1 z0,2
z1,0 z1,1 z1,2
z2,0 z2,1 z2,2
Each timestep of decoder uses a different context vector that looks at different parts of the input image
Decoder: yt = gV(yt-1, ht-1, ct)New context vector at every time step
![Page 24: Attention and Transformers Lecture 11cs231n.stanford.edu/slides/2021/lecture_11.pdf14 CNN Features: H x W x D h 0 [START] Xu et al, “Show, Attend and Tell: Neural Image Caption Generation](https://reader036.vdocument.in/reader036/viewer/2022071516/6139eb1a0051793c8c00c014/html5/thumbnails/24.jpg)
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 11 - May 06, 2021
Extract spatial features from a pretrained CNN
Image Captioning with RNNs & Attention
24
CNN
Features: H x W x D
h0
c1 y0
h1
[START]
y1
h2
y2
c2 y1
Xu et al, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015
z0,0h3
y3
c3 y2
person wearing hat
h4
y4
c4 y3
person wearing hat [END]
z0,1 z0,2
z1,0 z1,1 z1,2
z2,0 z2,1 z2,2
Each timestep of decoder uses a different context vector that looks at different parts of the input image
Decoder: yt = gV(yt-1, ht-1, ct)New context vector at every time step
![Page 25: Attention and Transformers Lecture 11cs231n.stanford.edu/slides/2021/lecture_11.pdf14 CNN Features: H x W x D h 0 [START] Xu et al, “Show, Attend and Tell: Neural Image Caption Generation](https://reader036.vdocument.in/reader036/viewer/2022071516/6139eb1a0051793c8c00c014/html5/thumbnails/25.jpg)
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 11 - May 06, 2021
Extract spatial features from a pretrained CNN
Image Captioning with RNNs & Attention
25
CNN
Features: H x W x D
h0
c1 y0
h1
[START]
y1
h2
y2
c2 y1
Xu et al, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015
z0,0h3
y3
c3 y2
person wearing hat
h4
y4
c4 y3
person wearing hat [END]
z0,1 z0,2
z1,0 z1,1 z1,2
z2,0 z2,1 z2,2
This entire process is differentiable.- model chooses its own
attention weights. No attention supervision is required
Xu et al, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015
e1,0,0 e1,0,1 e1,0,2
e1,1,0 e1,1,1 e1,1,2
e1,2,0 e1,2,1 e1,2,2
a1,0,0 a1,0,1 a1,0,2
a1,1,0 a1,1,1 a1,1,2
a1,2,0 a1,2,1 a1,2,2
Alignment scores: H x W
Attention: H x W
X
![Page 26: Attention and Transformers Lecture 11cs231n.stanford.edu/slides/2021/lecture_11.pdf14 CNN Features: H x W x D h 0 [START] Xu et al, “Show, Attend and Tell: Neural Image Caption Generation](https://reader036.vdocument.in/reader036/viewer/2022071516/6139eb1a0051793c8c00c014/html5/thumbnails/26.jpg)
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 11 - May 06, 202126
Soft attention
Image Captioning with Attention
Xu et al, “Show, Attend, and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015Figure copyright Kelvin Xu, Jimmy Lei Ba, Jamie Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhutdinov, Richard S. Zemel, and Yoshua Benchio, 2015. Reproduced with permission.
Hard attention(requires reinforcement learning)
![Page 27: Attention and Transformers Lecture 11cs231n.stanford.edu/slides/2021/lecture_11.pdf14 CNN Features: H x W x D h 0 [START] Xu et al, “Show, Attend and Tell: Neural Image Caption Generation](https://reader036.vdocument.in/reader036/viewer/2022071516/6139eb1a0051793c8c00c014/html5/thumbnails/27.jpg)
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 11 - May 06, 202127
Image Captioning with Attention
Xu et al, “Show, Attend, and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015Figure copyright Kelvin Xu, Jimmy Lei Ba, Jamie Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhutdinov, Richard S. Zemel, and Yoshua Benchio, 2015. Reproduced with permission.
![Page 28: Attention and Transformers Lecture 11cs231n.stanford.edu/slides/2021/lecture_11.pdf14 CNN Features: H x W x D h 0 [START] Xu et al, “Show, Attend and Tell: Neural Image Caption Generation](https://reader036.vdocument.in/reader036/viewer/2022071516/6139eb1a0051793c8c00c014/html5/thumbnails/28.jpg)
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 11 - May 06, 2021
Burns et al. “Women also Snowboard: Overcoming Bias in Captioning Models” ECCV 2018Figures from Burns et al, copyright 2018. Reproduced with permission.
28
Attention can detect Gender BiasAll images are CC0 Public domain: dog,
![Page 29: Attention and Transformers Lecture 11cs231n.stanford.edu/slides/2021/lecture_11.pdf14 CNN Features: H x W x D h 0 [START] Xu et al, “Show, Attend and Tell: Neural Image Caption Generation](https://reader036.vdocument.in/reader036/viewer/2022071516/6139eb1a0051793c8c00c014/html5/thumbnails/29.jpg)
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 11 - May 06, 2021
Similar tasks in NLP - Language translation example
29
Input: Sequence x = x1, x2,..., xTOutput: Sequence y = y1, y2,..., yT
x0 x1 x2 x3
personne portant un chapeau
![Page 30: Attention and Transformers Lecture 11cs231n.stanford.edu/slides/2021/lecture_11.pdf14 CNN Features: H x W x D h 0 [START] Xu et al, “Show, Attend and Tell: Neural Image Caption Generation](https://reader036.vdocument.in/reader036/viewer/2022071516/6139eb1a0051793c8c00c014/html5/thumbnails/30.jpg)
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 11 - May 06, 2021
Similar tasks in NLP - Language translation example
Encoder: h0= fW(z)where zt = RNN(xt, ut-1) fW(.) is MLPu is the hidden RNN state
30
Input: Sequence x = x1, x2,..., xTOutput: Sequence y = y1, y2,..., yT
x0
z0 z1
x1
z2
x2
z3
x3
h0
personne portant un chapeau
![Page 31: Attention and Transformers Lecture 11cs231n.stanford.edu/slides/2021/lecture_11.pdf14 CNN Features: H x W x D h 0 [START] Xu et al, “Show, Attend and Tell: Neural Image Caption Generation](https://reader036.vdocument.in/reader036/viewer/2022071516/6139eb1a0051793c8c00c014/html5/thumbnails/31.jpg)
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 11 - May 06, 2021
Similar tasks in NLP - Language translation example
Encoder: h0= fW(z)where zt = RNN(xt, ut-1) fW(.) is MLPu is the hidden RNN state
31
Input: Sequence x = x1, x2,..., xTOutput: Sequence y = y1, y2,..., yT
Decoder: yt = gV(yt-1, ht-1, c)where context vector c is often c = h0
x0
z0 z1
x1
z2
x2
z3
x3
h0
[START]
y0
h1
[START]
y1
h2
y2
y1
h3
y3
y2
person wearing hat
h4
y4
y3
person wearing hat [END]
c
personne portant un chapeau
![Page 32: Attention and Transformers Lecture 11cs231n.stanford.edu/slides/2021/lecture_11.pdf14 CNN Features: H x W x D h 0 [START] Xu et al, “Show, Attend and Tell: Neural Image Caption Generation](https://reader036.vdocument.in/reader036/viewer/2022071516/6139eb1a0051793c8c00c014/html5/thumbnails/32.jpg)
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 11 - May 06, 2021
Attention in NLP - Language translation example
32
x0
z0 z1
x1
z2
x2
z3
x3
h0
e0 e1 e2 e3
Compute alignments scores (scalars):
fatt(.) is an MLP
personne portant un chapeau
Bahdanau et al, “Neural machine translation by jointly learning to align and translate”, ICLR 2015
![Page 33: Attention and Transformers Lecture 11cs231n.stanford.edu/slides/2021/lecture_11.pdf14 CNN Features: H x W x D h 0 [START] Xu et al, “Show, Attend and Tell: Neural Image Caption Generation](https://reader036.vdocument.in/reader036/viewer/2022071516/6139eb1a0051793c8c00c014/html5/thumbnails/33.jpg)
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 11 - May 06, 2021
Attention in NLP - Language translation example
33
x0
z0 z1
x1
z2
x2
z3
x3
h0
e0 e1 e2 e3
a0 a1 a2 a3
Normalize to get attention weights:
0 < at, i, j < 1, attention values sum to 1
Compute alignments scores (scalars):
fatt(.) is an MLPsoftmax
personne portant un chapeau
Bahdanau et al, “Neural machine translation by jointly learning to align and translate”, ICLR 2015
![Page 34: Attention and Transformers Lecture 11cs231n.stanford.edu/slides/2021/lecture_11.pdf14 CNN Features: H x W x D h 0 [START] Xu et al, “Show, Attend and Tell: Neural Image Caption Generation](https://reader036.vdocument.in/reader036/viewer/2022071516/6139eb1a0051793c8c00c014/html5/thumbnails/34.jpg)
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 11 - May 06, 2021
Attention in NLP - Language translation example
34
x0
z0 z1
x1
z2
x2
z3
x3
personne portant un chapeau
h0
e0 e1 e2 e3
a0 a1 a2 a3
Compute context vector:
Normalize to get attention weights:
0 < at, i, j < 1, attention values sum to 1
Compute alignments scores (scalars):
fatt(.) is an MLPsoftmax
X
c1
Bahdanau et al, “Neural machine translaAon by jointly learning to align and translate”, ICLR 2015
![Page 35: Attention and Transformers Lecture 11cs231n.stanford.edu/slides/2021/lecture_11.pdf14 CNN Features: H x W x D h 0 [START] Xu et al, “Show, Attend and Tell: Neural Image Caption Generation](https://reader036.vdocument.in/reader036/viewer/2022071516/6139eb1a0051793c8c00c014/html5/thumbnails/35.jpg)
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 11 - May 06, 2021
Attention in NLP - Language translation example
35
x0
z0 z1
x1
z2
x2
z3
x3
personne portant un chapeau
h0
e0 e1 e2 e3
a0 a1 a2 a3
softmax
X
c1
Decoder: yt = gV(yt-1, ht-1, c)where context vector c is often c = h0
c1 y0
h1
[START]
y1
h2
y2
c2 y1
h3
y3
c3 y2
person wearing hat
h4
y4
c4 y3
person wearing hat [END]
Bahdanau et al, “Neural machine translation by jointly learning to align and translate”, ICLR 2015
![Page 36: Attention and Transformers Lecture 11cs231n.stanford.edu/slides/2021/lecture_11.pdf14 CNN Features: H x W x D h 0 [START] Xu et al, “Show, Attend and Tell: Neural Image Caption Generation](https://reader036.vdocument.in/reader036/viewer/2022071516/6139eb1a0051793c8c00c014/html5/thumbnails/36.jpg)
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 11 - May 06, 2021
Similar visualization of attention weights
36Bahdanau et al, “Neural machine translation by jointly learning to align and translate”, ICLR 2015
English to French translation example:
Input: "The agreement on theEuropean Economic Area was signed in August 1992."
Output: "L’accord sur la zoneéconomique européenne aété signé en août 1992."
Without any attention supervision, model learns
different word orderings for different languages
![Page 37: Attention and Transformers Lecture 11cs231n.stanford.edu/slides/2021/lecture_11.pdf14 CNN Features: H x W x D h 0 [START] Xu et al, “Show, Attend and Tell: Neural Image Caption Generation](https://reader036.vdocument.in/reader036/viewer/2022071516/6139eb1a0051793c8c00c014/html5/thumbnails/37.jpg)
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 11 - May 06, 202137
Today's Agenda:
- Attention with RNNs- In Computer Vision- In NLP
- General Attention Layer- Self-attention- Positional encoding- Masked attention- Multi-head attention
- Transformers
![Page 38: Attention and Transformers Lecture 11cs231n.stanford.edu/slides/2021/lecture_11.pdf14 CNN Features: H x W x D h 0 [START] Xu et al, “Show, Attend and Tell: Neural Image Caption Generation](https://reader036.vdocument.in/reader036/viewer/2022071516/6139eb1a0051793c8c00c014/html5/thumbnails/38.jpg)
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 11 - May 06, 2021
Attention we just saw in image captioning
38
Feat
ures
h
z0,0 z0,1 z0,2
z1,0 z1,1 z1,2
z2,0 z2,1 z2,2
Inputs:Features: z (shape: H x W x D)Query: h (shape: D)
![Page 39: Attention and Transformers Lecture 11cs231n.stanford.edu/slides/2021/lecture_11.pdf14 CNN Features: H x W x D h 0 [START] Xu et al, “Show, Attend and Tell: Neural Image Caption Generation](https://reader036.vdocument.in/reader036/viewer/2022071516/6139eb1a0051793c8c00c014/html5/thumbnails/39.jpg)
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 11 - May 06, 2021
Attention we just saw in image captioning
Alig
nmen
t
39
Feat
ures
h
z0,0 z0,1 z0,2
z1,0 z1,1 z1,2
z2,0 z2,1 z2,2
e0,0 e0,1 e0,2
e1,0 e1,1 e1,2
e2,0 e2,1 e2,2
Inputs:Features: z (shape: H x W x D)Query: h (shape: D)
Operations:Alignment: ei,j = fatt(h, zi,j)
![Page 40: Attention and Transformers Lecture 11cs231n.stanford.edu/slides/2021/lecture_11.pdf14 CNN Features: H x W x D h 0 [START] Xu et al, “Show, Attend and Tell: Neural Image Caption Generation](https://reader036.vdocument.in/reader036/viewer/2022071516/6139eb1a0051793c8c00c014/html5/thumbnails/40.jpg)
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 11 - May 06, 2021
Attention we just saw in image captioning
Alig
nmen
t
40
Feat
ures
h
z0,0 z0,1 z0,2
z1,0 z1,1 z1,2
z2,0 z2,1 z2,2
a0,0 a0,1 a0,2
a1,0 a1,1 a1,2
a2,0 a2,1 a2,2 Atte
ntio
n
Inputs:Features: z (shape: H x W x D)Query: h (shape: D)
softmax
Operations:Alignment: ei,j = fatt(h, zi,j) Attention: a = softmax(e)
e0,0 e0,1 e0,2
e1,0 e1,1 e1,2
e2,0 e2,1 e2,2
![Page 41: Attention and Transformers Lecture 11cs231n.stanford.edu/slides/2021/lecture_11.pdf14 CNN Features: H x W x D h 0 [START] Xu et al, “Show, Attend and Tell: Neural Image Caption Generation](https://reader036.vdocument.in/reader036/viewer/2022071516/6139eb1a0051793c8c00c014/html5/thumbnails/41.jpg)
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 11 - May 06, 2021
Attention we just saw in image captioning
Alig
nmen
t
41
Feat
ures
h
z0,0 z0,1 z0,2
z1,0 z1,1 z1,2
z2,0 z2,1 z2,2
Atte
ntio
n
Inputs:Features: z (shape: H x W x D)Query: h (shape: D)
softmax
c
mul + add
Outputs:context vector: c (shape: D)
Operations:Alignment: ei,j = fatt(h, zi,j) Attention: a = softmax(e)Output: c = ∑i,j ai,jzi,j
e0,0 e0,1 e0,2
e1,0 e1,1 e1,2
e2,0 e2,1 e2,2
a0,0 a0,1 a0,2
a1,0 a1,1 a1,2
a2,0 a2,1 a2,2
![Page 42: Attention and Transformers Lecture 11cs231n.stanford.edu/slides/2021/lecture_11.pdf14 CNN Features: H x W x D h 0 [START] Xu et al, “Show, Attend and Tell: Neural Image Caption Generation](https://reader036.vdocument.in/reader036/viewer/2022071516/6139eb1a0051793c8c00c014/html5/thumbnails/42.jpg)
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 11 - May 06, 2021
Attention operation is permutation invariant.- Doesn't care about ordering of the
features- Stretch H x W = N into N vectors
General attention layer
Alig
nmen
t
42
Inpu
t vec
tors
h
Atte
ntio
n
Inputs:Input vectors: x (shape: N x D)Query: h (shape: D)
softmax
c
mul + add
Outputs:context vector: c (shape: D)
Operations:Alignment: ei = fatt(h, xi) Attention: a = softmax(e)Output: c = ∑i ai xi
x2
x1
x0
e2
e1
e0
a2
a1
a0
![Page 43: Attention and Transformers Lecture 11cs231n.stanford.edu/slides/2021/lecture_11.pdf14 CNN Features: H x W x D h 0 [START] Xu et al, “Show, Attend and Tell: Neural Image Caption Generation](https://reader036.vdocument.in/reader036/viewer/2022071516/6139eb1a0051793c8c00c014/html5/thumbnails/43.jpg)
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 11 - May 06, 2021
Change fatt(.) to a simple dot product- only works well with key & value
transformation trick (will mention in a few slides)
General attention layer
Alig
nmen
t
43
h
Atte
ntio
n
Inputs:Input vectors: x (shape: N x D)Query: h (shape: D)
softmax
c
mul + add
Outputs:context vector: c (shape: D)
Operations:Alignment: ei = h ᐧ xiAttention: a = softmax(e)Output: c = ∑i ai xi
x2
x1
x0
e2
e1
e0
a2
a1
a0
Inpu
t vec
tors
![Page 44: Attention and Transformers Lecture 11cs231n.stanford.edu/slides/2021/lecture_11.pdf14 CNN Features: H x W x D h 0 [START] Xu et al, “Show, Attend and Tell: Neural Image Caption Generation](https://reader036.vdocument.in/reader036/viewer/2022071516/6139eb1a0051793c8c00c014/html5/thumbnails/44.jpg)
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 11 - May 06, 2021
Change fatt(.) to a scaled simple dot product- Larger dimensions means more terms in
the dot product sum.- So, the variance of the logits is higher.
Large magnitude vectors will produce much higher logits.
- So, the post-softmax distribution has lower-entropy, assuming logits are IID.
- Ultimately, these large magnitude vectors will cause softmax to peak and assign very little weight to all others
- Divide by √D to reduce effect of large magnitude vectors
General attention layer
Alig
nmen
t
44
h
Atte
ntio
n
Inputs:Input vectors: x (shape: N x D)Query: h (shape: D)
softmax
c
mul + add
Outputs:context vector: c (shape: D)
Operations:Alignment: ei = h ᐧ xi / √DAttention: a = softmax(e)Output: c = ∑i ai xi
x2
x1
x0
e2
e1
e0
a2
a1
a0
Inpu
t vec
tors
![Page 45: Attention and Transformers Lecture 11cs231n.stanford.edu/slides/2021/lecture_11.pdf14 CNN Features: H x W x D h 0 [START] Xu et al, “Show, Attend and Tell: Neural Image Caption Generation](https://reader036.vdocument.in/reader036/viewer/2022071516/6139eb1a0051793c8c00c014/html5/thumbnails/45.jpg)
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 11 - May 06, 2021
Multiple query vectors- each query creates a new output
context vector
mul(→) + add (↑)
Multiple query vectors
General attention layer
Alig
nmen
t
45
q0
Atte
ntio
n
Inputs:Input vectors: x (shape: N x D)Queries: q (shape: M x D)
softmax (↑)
y1 Outputs:context vectors: y (shape: D)
Operations:Alignment: ei,j = qj ᐧ xi / √DAttention: a = softmax(e)Output: yj = ∑i ai,j xi
x2
x1
x0
e2,0
e1,0
e0,0
a2,0
a1,0
a0,0
e2,1
e1,1
e0,1
e2,2
e1,2
e0,2
a2,1
a1,1
a0,1
a2,2
a1,2
a0,2
q1 q2
y2y0
Inpu
t vec
tors
![Page 46: Attention and Transformers Lecture 11cs231n.stanford.edu/slides/2021/lecture_11.pdf14 CNN Features: H x W x D h 0 [START] Xu et al, “Show, Attend and Tell: Neural Image Caption Generation](https://reader036.vdocument.in/reader036/viewer/2022071516/6139eb1a0051793c8c00c014/html5/thumbnails/46.jpg)
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 11 - May 06, 2021
Notice that the input vectors are used for both the alignment as well as the attention calculations.
- We can add more expressivity to the layer by adding a different FC layer before each of the two steps.
mul(→) + add (↑)
General attention layer
Alig
nmen
t
46
q0
Atte
ntio
n
Inputs:Input vectors: x (shape: N x D)Queries: q (shape: M x D)
softmax (↑)
y1 Outputs:context vectors: y (shape: D)
Operations:Alignment: ei,j = qj ᐧ xi / √DAttention: a = softmax(e)Output: yj = ∑i ai,j xi
x2
x1
x0
e2,0
e1,0
e0,0
a2,0
a1,0
a0,0
e2,1
e1,1
e0,1
e2,2
e1,2
e0,2
a2,1
a1,1
a0,1
a2,2
a1,2
a0,2
q1 q2
y2y0
Inpu
t vec
tors
![Page 47: Attention and Transformers Lecture 11cs231n.stanford.edu/slides/2021/lecture_11.pdf14 CNN Features: H x W x D h 0 [START] Xu et al, “Show, Attend and Tell: Neural Image Caption Generation](https://reader036.vdocument.in/reader036/viewer/2022071516/6139eb1a0051793c8c00c014/html5/thumbnails/47.jpg)
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 11 - May 06, 2021
Notice that the input vectors are used for both the alignment as well as the attention calculations.
- We can add more expressivity to the layer by adding a different FC layer before each of the two steps.
General attention layer
47
q0
Inputs:Input vectors: x (shape: N x D)Queries: q (shape: M x Dk)
Operations:Key vectors: k = xWkValue vectors: v = xWv
x2
x1
x0
q1 q2
Inpu
t vec
tors
k2
k1
k0
v2
v1
v0
![Page 48: Attention and Transformers Lecture 11cs231n.stanford.edu/slides/2021/lecture_11.pdf14 CNN Features: H x W x D h 0 [START] Xu et al, “Show, Attend and Tell: Neural Image Caption Generation](https://reader036.vdocument.in/reader036/viewer/2022071516/6139eb1a0051793c8c00c014/html5/thumbnails/48.jpg)
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 11 - May 06, 2021
The input and output dimensions can now change depending on the key and value FC layers
Notice that the input vectors are used for both the alignment as well as the attention calculations.
- We can add more expressivity to the layer by adding a different FC layer before each of the two steps.
mul(→) + add (↑)
General attention layer
Alig
nmen
t
48
q0
Atte
ntio
n
Inputs:Input vectors: x (shape: N x D)Queries: q (shape: M x Dk)
softmax (↑)
y1 Outputs:context vectors: y (shape: Dv)
Operations:Key vectors: k = xWkValue vectors: v = xWvAlignment: ei,j = qj ᐧ ki / √DAttention: a = softmax(e)Output: yj = ∑i ai,j vi
x2
x1
x0
e2,0
e1,0
e0,0
a2,0
a1,0
a0,0
e2,1
e1,1
e0,1
e2,2
e1,2
e0,2
a2,1
a1,1
a0,1
a2,2
a1,2
a0,2
q1 q2
y2y0
Inpu
t vec
tors
k2
k1
k0
v2
v1
v0
![Page 49: Attention and Transformers Lecture 11cs231n.stanford.edu/slides/2021/lecture_11.pdf14 CNN Features: H x W x D h 0 [START] Xu et al, “Show, Attend and Tell: Neural Image Caption Generation](https://reader036.vdocument.in/reader036/viewer/2022071516/6139eb1a0051793c8c00c014/html5/thumbnails/49.jpg)
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 11 - May 06, 2021
Recall that the query vector was a function of the input vectors
mul(→) + add (↑)
General attention layer
Alig
nmen
t
49
q0
Atte
ntio
n
Inputs:Input vectors: x (shape: N x D)Queries: q (shape: M x Dk)
softmax (↑)
y1 Outputs:context vectors: y (shape: Dv)
Operations:Key vectors: k = xWkValue vectors: v = xWvAlignment: ei,j = qj ᐧ ki / √DAttention: a = softmax(e)Output: yj = ∑i ai,j vi
x2
x1
x0
e2,0
e1,0
e0,0
a2,0
a1,0
a0,0
e2,1
e1,1
e0,1
e2,2
e1,2
e0,2
a2,1
a1,1
a0,1
a2,2
a1,2
a0,2
q1 q2
y2y0
Inpu
t vec
tors
k2
k1
k0
v2
v1
v0
z0,0 z0,1 z0,2
z1,0 z1,1 z1,2
z2,0 z2,1 z2,2
MLP
Encoder: h0 = fW(z)where z is spatial CNN featuresfW(.) is an MLP
h0CNN
![Page 50: Attention and Transformers Lecture 11cs231n.stanford.edu/slides/2021/lecture_11.pdf14 CNN Features: H x W x D h 0 [START] Xu et al, “Show, Attend and Tell: Neural Image Caption Generation](https://reader036.vdocument.in/reader036/viewer/2022071516/6139eb1a0051793c8c00c014/html5/thumbnails/50.jpg)
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 11 - May 06, 2021
No input query vectors anymore
Self attention layer
50
q0
Inputs:Input vectors: x (shape: N x D)Queries: q (shape: M x Dk)
Operations:Key vectors: k = xWkValue vectors: v = xWvQuery vectors: q = xWqAlignment: ei,j = qj ᐧ ki / √DAttention: a = softmax(e)Output: yj = ∑i ai,j vi
x2
x1
x0
q1 q2
Inpu
t vec
tors
We can calculate the query vectors from the input vectors, therefore, defining a "self-attention" layer.
Instead, query vectors are calculated using a FC layer.
![Page 51: Attention and Transformers Lecture 11cs231n.stanford.edu/slides/2021/lecture_11.pdf14 CNN Features: H x W x D h 0 [START] Xu et al, “Show, Attend and Tell: Neural Image Caption Generation](https://reader036.vdocument.in/reader036/viewer/2022071516/6139eb1a0051793c8c00c014/html5/thumbnails/51.jpg)
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 11 - May 06, 2021
mul(→) + add (↑)
Self attention layer
Alig
nmen
t
51
q0
Atte
ntio
n
Inputs:Input vectors: x (shape: N x D)
softmax (↑)
y1 Outputs:context vectors: y (shape: Dv)
Operations:Key vectors: k = xWkValue vectors: v = xWvQuery vectors: q = xWqAlignment: ei,j = qj ᐧ ki / √DAttention: a = softmax(e)Output: yj = ∑i ai,j vi
x2
x1
x0
e2,0
e1,0
e0,0
a2,0
a1,0
a0,0
e2,1
e1,1
e0,1
e2,2
e1,2
e0,2
a2,1
a1,1
a0,1
a2,2
a1,2
a0,2
q1 q2
y2y0
Inpu
t vec
tors
k2
k1
k0
v2
v1
v0
![Page 52: Attention and Transformers Lecture 11cs231n.stanford.edu/slides/2021/lecture_11.pdf14 CNN Features: H x W x D h 0 [START] Xu et al, “Show, Attend and Tell: Neural Image Caption Generation](https://reader036.vdocument.in/reader036/viewer/2022071516/6139eb1a0051793c8c00c014/html5/thumbnails/52.jpg)
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 11 - May 06, 2021
mul(→) + add (↑)
Self attention layer - attends over sets of inputs
Alig
nmen
t
52
q0
Atte
ntio
n
Inputs:Input vectors: x (shape: N x D)
softmax (↑)
y1 Outputs:context vectors: y (shape: Dv)
Operations:Key vectors: k = xWkValue vectors: v = xWvQuery vectors: q = xWqAlignment: ei,j = qj ᐧ ki / √DAttention: a = softmax(e)Output: yj = ∑i ai,j vi
x2
x1
x0
e2,0
e1,0
e0,0
a2,0
a1,0
a0,0
e2,1
e1,1
e0,1
e2,2
e1,2
e0,2
a2,1
a1,1
a0,1
a2,2
a1,2
a0,2
q1 q2
y2y0
Inpu
t vec
tors
k2
k1
k0
v2
v1
v0
x2x1x0
self-attention
y1 y2y0
![Page 53: Attention and Transformers Lecture 11cs231n.stanford.edu/slides/2021/lecture_11.pdf14 CNN Features: H x W x D h 0 [START] Xu et al, “Show, Attend and Tell: Neural Image Caption Generation](https://reader036.vdocument.in/reader036/viewer/2022071516/6139eb1a0051793c8c00c014/html5/thumbnails/53.jpg)
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 11 - May 06, 2021
Self attention layer - attends over sets of inputs
53
x2x1x0
self-attention
y1 y2y0
Permutation invariant
Problem: how can we encode ordered sequences like language or spatially ordered image features?
x0x1x2
self-attention
y1 y0y2
x2x0x1
self-attention
y0 y2y1
![Page 54: Attention and Transformers Lecture 11cs231n.stanford.edu/slides/2021/lecture_11.pdf14 CNN Features: H x W x D h 0 [START] Xu et al, “Show, Attend and Tell: Neural Image Caption Generation](https://reader036.vdocument.in/reader036/viewer/2022071516/6139eb1a0051793c8c00c014/html5/thumbnails/54.jpg)
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 11 - May 06, 2021
Concatenate special positional encoding pj to each input vector xj
We use a function pos: N →Rd to process the position j of the vector into a d-dimensional vector
So, pj = pos(j)
Positional encoding
54
x2x1x0
self-attention
y1 y2y0
p2p1p0
Desiderata of pos(.) :1. It should output a unique encoding for each
time-step (word’s position in a sentence)2. Distance between any two time-steps should be
consistent across sentences with different lengths.3. Our model should generalize to longer sentences
without any efforts. Its values should be bounded.4. It must be deterministic.
position encoding
x2x1x0
![Page 55: Attention and Transformers Lecture 11cs231n.stanford.edu/slides/2021/lecture_11.pdf14 CNN Features: H x W x D h 0 [START] Xu et al, “Show, Attend and Tell: Neural Image Caption Generation](https://reader036.vdocument.in/reader036/viewer/2022071516/6139eb1a0051793c8c00c014/html5/thumbnails/55.jpg)
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 11 - May 06, 2021
Positional encoding
55
Options for pos(.)
1. Learn a lookup table:○ Learn parameters to use for pos(t)
for t ε [0, T)○ Lookup table contains T x d
parameters.
Vaswani et al, “Attention is all you need”, NeurIPS 2017
Desiderata of pos(.) :1. It should output a unique encoding for each
time-step (word’s position in a sentence)2. Distance between any two time-steps should be
consistent across sentences with different lengths.3. Our model should generalize to longer sentences
without any efforts. Its values should be bounded.4. It must be deterministic.
Concatenate special positional encoding pj to each input vector xj
We use a function pos: N →Rd to process the position j of the vector into a d-dimensional vector
So, pj = pos(j)
x2x1x0
self-attention
y1 y2y0
p2p1p0
position encoding
x2x1x0
![Page 56: Attention and Transformers Lecture 11cs231n.stanford.edu/slides/2021/lecture_11.pdf14 CNN Features: H x W x D h 0 [START] Xu et al, “Show, Attend and Tell: Neural Image Caption Generation](https://reader036.vdocument.in/reader036/viewer/2022071516/6139eb1a0051793c8c00c014/html5/thumbnails/56.jpg)
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 11 - May 06, 2021
Positional encoding
56
Options for pos(.)
1. Learn a lookup table:○ Learn parameters to use for pos(t)
for t ε [0, T)○ Lookup table contains T x d
parameters.
2. Design a fixed function with the desiderata
○
p(t) =
where
Concatenate special positional encoding pj to each input vector xj
We use a function pos: N →Rd to process the position j of the vector into a d-dimensional vector
So, pj = pos(j)
x2x1x0
self-attention
y1 y2y0
p2p1p0
x2x0x1
position encoding
Vaswani et al, “Attention is all you need”, NeurIPS 2017
![Page 57: Attention and Transformers Lecture 11cs231n.stanford.edu/slides/2021/lecture_11.pdf14 CNN Features: H x W x D h 0 [START] Xu et al, “Show, Attend and Tell: Neural Image Caption Generation](https://reader036.vdocument.in/reader036/viewer/2022071516/6139eb1a0051793c8c00c014/html5/thumbnails/57.jpg)
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 11 - May 06, 2021
Positional encoding
57
Options for pos(.)
1. Learn a lookup table:○ Learn parameters to use for pos(t)
for t ε [0, T)○ Lookup table contains T x d
parameters.
2. Design a fixed function with the desiderata
○
p(t) =
where
Intuition:
image source
Concatenate special positional encoding pj to each input vector xj
We use a function pos: N →Rd to process the position j of the vector into a d-dimensional vector
So, pj = pos(j)
x2x1x0
self-attention
y1 y2y0
p2p1p0
x2x1x0
position encoding
Vaswani et al, “Attention is all you need”, NeurIPS 2017
![Page 58: Attention and Transformers Lecture 11cs231n.stanford.edu/slides/2021/lecture_11.pdf14 CNN Features: H x W x D h 0 [START] Xu et al, “Show, Attend and Tell: Neural Image Caption Generation](https://reader036.vdocument.in/reader036/viewer/2022071516/6139eb1a0051793c8c00c014/html5/thumbnails/58.jpg)
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 11 - May 06, 2021
mul(→) + add (↑)
Masked self-attention layer
Alig
nmen
t
58
q0
Atte
ntio
n
Inputs:Input vectors: x (shape: N x D)
softmax (↑)
y1 Outputs:context vectors: y (shape: Dv)
Operations:Key vectors: k = xWkValue vectors: v = xWvQuery vectors: q = xWqAlignment: ei,j = qj ᐧ ki / √DAttention: a = softmax(e)Output: yj = ∑i ai,j vi
x2
x1
x0
-∞
-∞
e0,0
0
0
a0,0
-∞
e1,1
e0,1
e2,2
e1,2
e0,2
0
a1,1
a0,1
a2,2
a1,2
a0,2
q1 q2
y2y0
Inpu
t vec
tors
k2
k1
k0
v2
v1
v0
- Prevent vectors from looking at future vectors.
- Manually set alignment scores to -infinity
![Page 59: Attention and Transformers Lecture 11cs231n.stanford.edu/slides/2021/lecture_11.pdf14 CNN Features: H x W x D h 0 [START] Xu et al, “Show, Attend and Tell: Neural Image Caption Generation](https://reader036.vdocument.in/reader036/viewer/2022071516/6139eb1a0051793c8c00c014/html5/thumbnails/59.jpg)
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 11 - May 06, 2021
Multi-head self attention layer - Multiple self-attention heads in parallel
59
x2x1x0
Self-attention
y1 y2y0
x2x1x0
Self-attention
y1 y2y0
x2x1x0
Self-attention
y1 y2y0
head0 head1
...
headH-
1
x2x1x0
y1 y2y0
Add or concatenate
Split
![Page 60: Attention and Transformers Lecture 11cs231n.stanford.edu/slides/2021/lecture_11.pdf14 CNN Features: H x W x D h 0 [START] Xu et al, “Show, Attend and Tell: Neural Image Caption Generation](https://reader036.vdocument.in/reader036/viewer/2022071516/6139eb1a0051793c8c00c014/html5/thumbnails/60.jpg)
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 11 - May 06, 2021
General attention versus self-attention
60
x2x1x0
self-attention
y1 y2y0
k2k1k0
attention
y1 y2y0
v2v1v0 q2q1q0
![Page 61: Attention and Transformers Lecture 11cs231n.stanford.edu/slides/2021/lecture_11.pdf14 CNN Features: H x W x D h 0 [START] Xu et al, “Show, Attend and Tell: Neural Image Caption Generation](https://reader036.vdocument.in/reader036/viewer/2022071516/6139eb1a0051793c8c00c014/html5/thumbnails/61.jpg)
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 11 - May 06, 2021
Comparing RNNs to Transformers
RNNs(+) LSTMs work reasonably well for long sequences.(-) Expects an ordered sequences of inputs(-) Sequential computation: subsequent hidden states can only be computed after the previous ones are done.
Transformers:(+) Good at long sequences. Each attention calculation looks at all inputs.(+) Can operate over unordered sets or ordered sequences with positional encodings.(+) Parallel computation: All alignment and attention scores for all inputs can be done in parallel.(-) Requires a lot of memory: N x M alignment and attention scalers need to be calculated and stored for a single self-attention head. (but GPUs are getting bigger and better)
61
![Page 62: Attention and Transformers Lecture 11cs231n.stanford.edu/slides/2021/lecture_11.pdf14 CNN Features: H x W x D h 0 [START] Xu et al, “Show, Attend and Tell: Neural Image Caption Generation](https://reader036.vdocument.in/reader036/viewer/2022071516/6139eb1a0051793c8c00c014/html5/thumbnails/62.jpg)
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 11 - May 06, 202162
Today's Agenda:
- Attention with RNNs- In Computer Vision- In NLP
- General Attention Layer- Self-attention- Positional encoding- Masked attention- Multi-head attention
- Transformers
![Page 63: Attention and Transformers Lecture 11cs231n.stanford.edu/slides/2021/lecture_11.pdf14 CNN Features: H x W x D h 0 [START] Xu et al, “Show, Attend and Tell: Neural Image Caption Generation](https://reader036.vdocument.in/reader036/viewer/2022071516/6139eb1a0051793c8c00c014/html5/thumbnails/63.jpg)
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 11 - May 06, 202163
Extract spatial features from a pretrained CNN
Image Captioning using transformers
CNN
Features: H x W x D
z0,0 z0,1 z0,2
z1,0 z1,1 z1,2
z2,0 z2,1 z2,2
Input: Image IOutput: Sequence y = y1, y2,..., yT
![Page 64: Attention and Transformers Lecture 11cs231n.stanford.edu/slides/2021/lecture_11.pdf14 CNN Features: H x W x D h 0 [START] Xu et al, “Show, Attend and Tell: Neural Image Caption Generation](https://reader036.vdocument.in/reader036/viewer/2022071516/6139eb1a0051793c8c00c014/html5/thumbnails/64.jpg)
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 11 - May 06, 202164
Extract spatial features from a pretrained CNN
Image Captioning using transformers
CNN
Features: H x W x D
z0,0 z0,1 z0,2
z1,0 z1,1 z1,2
z2,0 z2,1 z2,2
Encoder: c = TW(z)where z is spatial CNN featuresTW(.) is the transformer encoder
Input: Image IOutput: Sequence y = y1, y2,..., yT
z0,2z0,1z0,0 z2,2...
Transformer encoder
c0,2c0,1c0,0 c2,2...
![Page 65: Attention and Transformers Lecture 11cs231n.stanford.edu/slides/2021/lecture_11.pdf14 CNN Features: H x W x D h 0 [START] Xu et al, “Show, Attend and Tell: Neural Image Caption Generation](https://reader036.vdocument.in/reader036/viewer/2022071516/6139eb1a0051793c8c00c014/html5/thumbnails/65.jpg)
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 11 - May 06, 202165
Extract spatial features from a pretrained CNN
Image Captioning using transformers
CNN
Features: H x W x D
z0,0 z0,1 z0,2
z1,0 z1,1 z1,2
z2,0 z2,1 z2,2
Encoder: c = TW(z)where z is spatial CNN featuresTW(.) is the transformer encoder
Input: Image IOutput: Sequence y = y1, y2,..., yT
z0,2z0,1z0,0 z2,2...
Transformer encoder
c0,2c0,1c0,0 c2,2...
y0
[START]
y1 y2
y1
y3
y2
person wearing hat
y4
y3
person wearing hat [END]
Transformer decoder
Decoder: yt = TD(y0:t-1, c)where TD(.) is the transformer decoder
![Page 66: Attention and Transformers Lecture 11cs231n.stanford.edu/slides/2021/lecture_11.pdf14 CNN Features: H x W x D h 0 [START] Xu et al, “Show, Attend and Tell: Neural Image Caption Generation](https://reader036.vdocument.in/reader036/viewer/2022071516/6139eb1a0051793c8c00c014/html5/thumbnails/66.jpg)
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 11 - May 06, 2021
The Transformer encoder block
66
z0,2z0,1z0,0 z2,2...
c0,2c0,1c0,0 c2,2...
Tran
sfor
mer
enc
oder
... x NMade up of N encoder blocks.
In vaswani et al. N = 6, Dq= 512
Vaswani et al, “Attention is all you need”, NeurIPS 2017
![Page 67: Attention and Transformers Lecture 11cs231n.stanford.edu/slides/2021/lecture_11.pdf14 CNN Features: H x W x D h 0 [START] Xu et al, “Show, Attend and Tell: Neural Image Caption Generation](https://reader036.vdocument.in/reader036/viewer/2022071516/6139eb1a0051793c8c00c014/html5/thumbnails/67.jpg)
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 11 - May 06, 2021
The Transformer encoder block
67
x2x1x0 x2z0,2z0,1z0,0 z2,2...
c0,2c0,1c0,0 c2,2...
Tran
sfor
mer
enc
oder
... x NLet's dive into one encoder block
Vaswani et al, “Attention is all you need”, NeurIPS 2017
![Page 68: Attention and Transformers Lecture 11cs231n.stanford.edu/slides/2021/lecture_11.pdf14 CNN Features: H x W x D h 0 [START] Xu et al, “Show, Attend and Tell: Neural Image Caption Generation](https://reader036.vdocument.in/reader036/viewer/2022071516/6139eb1a0051793c8c00c014/html5/thumbnails/68.jpg)
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 11 - May 06, 2021
The Transformer encoder block
68
x2x1x0
Positional encoding
x2z0,2z0,1z0,0 z2,2...
c0,2c0,1c0,0 c2,2...
Tran
sfor
mer
enc
oder
... x N
Add positional encoding
Vaswani et al, “Attention is all you need”, NeurIPS 2017
![Page 69: Attention and Transformers Lecture 11cs231n.stanford.edu/slides/2021/lecture_11.pdf14 CNN Features: H x W x D h 0 [START] Xu et al, “Show, Attend and Tell: Neural Image Caption Generation](https://reader036.vdocument.in/reader036/viewer/2022071516/6139eb1a0051793c8c00c014/html5/thumbnails/69.jpg)
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 11 - May 06, 2021
The Transformer encoder block
69
x2x1x0
Positional encoding
x2
Multi-head self-attention
z0,2z0,1z0,0 z2,2...
c0,2c0,1c0,0 c2,2...
Tran
sfor
mer
enc
oder
... x N
Attention attends over all the vectors
Add positional encoding
Vaswani et al, “Attention is all you need”, NeurIPS 2017
![Page 70: Attention and Transformers Lecture 11cs231n.stanford.edu/slides/2021/lecture_11.pdf14 CNN Features: H x W x D h 0 [START] Xu et al, “Show, Attend and Tell: Neural Image Caption Generation](https://reader036.vdocument.in/reader036/viewer/2022071516/6139eb1a0051793c8c00c014/html5/thumbnails/70.jpg)
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 11 - May 06, 2021
The Transformer encoder block
70
x2x1x0
Positional encoding
x2
Multi-head self-attention
+
z0,2z0,1z0,0 z2,2...
c0,2c0,1c0,0 c2,2...
Tran
sfor
mer
enc
oder
... x N
Attention attends over all the vectors
Residual connection
Add positional encoding
Vaswani et al, “Attention is all you need”, NeurIPS 2017
![Page 71: Attention and Transformers Lecture 11cs231n.stanford.edu/slides/2021/lecture_11.pdf14 CNN Features: H x W x D h 0 [START] Xu et al, “Show, Attend and Tell: Neural Image Caption Generation](https://reader036.vdocument.in/reader036/viewer/2022071516/6139eb1a0051793c8c00c014/html5/thumbnails/71.jpg)
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 11 - May 06, 2021
The Transformer encoder block
71
x2x1x0
Positional encoding
x2
Multi-head self-attention
+
Layer norm
z0,2z0,1z0,0 z2,2...
c0,2c0,1c0,0 c2,2...
Tran
sfor
mer
enc
oder
... x NLayerNorm over each vector individually
Attention attends over all the vectors
Residual connection
Add positional encoding
Vaswani et al, “Attention is all you need”, NeurIPS 2017
![Page 72: Attention and Transformers Lecture 11cs231n.stanford.edu/slides/2021/lecture_11.pdf14 CNN Features: H x W x D h 0 [START] Xu et al, “Show, Attend and Tell: Neural Image Caption Generation](https://reader036.vdocument.in/reader036/viewer/2022071516/6139eb1a0051793c8c00c014/html5/thumbnails/72.jpg)
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 11 - May 06, 2021
The Transformer encoder block
72
x2x1x0
Positional encoding
x2
Multi-head self-attention
+
Layer norm
MLP
z0,2z0,1z0,0 z2,2...
c0,2c0,1c0,0 c2,2...
Tran
sfor
mer
enc
oder
... x N MLP over each vector individually
LayerNorm over each vector individually
Attention attends over all the vectors
Residual connection
Add positional encoding
Vaswani et al, “Attention is all you need”, NeurIPS 2017
![Page 73: Attention and Transformers Lecture 11cs231n.stanford.edu/slides/2021/lecture_11.pdf14 CNN Features: H x W x D h 0 [START] Xu et al, “Show, Attend and Tell: Neural Image Caption Generation](https://reader036.vdocument.in/reader036/viewer/2022071516/6139eb1a0051793c8c00c014/html5/thumbnails/73.jpg)
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 11 - May 06, 2021
The Transformer encoder block
73
x2x1x0
Positional encoding
x2
Multi-head self-attention
+
Layer norm
MLP
+
z0,2z0,1z0,0 z2,2...
c0,2c0,1c0,0 c2,2...
Tran
sfor
mer
enc
oder
... x N
Residual connection
MLP over each vector individually
LayerNorm over each vector individually
Attention attends over all the vectors
Residual connection
Add positional encoding
Vaswani et al, “Attention is all you need”, NeurIPS 2017
![Page 74: Attention and Transformers Lecture 11cs231n.stanford.edu/slides/2021/lecture_11.pdf14 CNN Features: H x W x D h 0 [START] Xu et al, “Show, Attend and Tell: Neural Image Caption Generation](https://reader036.vdocument.in/reader036/viewer/2022071516/6139eb1a0051793c8c00c014/html5/thumbnails/74.jpg)
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 11 - May 06, 2021
The Transformer encoder block
74
x2x1x0
Positional encoding
x2
Multi-head self-attention
+
Layer norm
Layer norm
MLP
+
y1 y2y0 y3
z0,2z0,1z0,0 z2,2...
c0,2c0,1c0,0 c2,2...
Tran
sfor
mer
enc
oder
... x N
Transformer Encoder Block:
Inputs: Set of vectors xOutputs: Set of vectors y
Self-attention is the onlyinteraction between vectors.
Layer norm and MLP operateindependently per vector.
Highly scalable, highlyparallelizable, but high memory usage.
Vaswani et al, “Attention is all you need”, NeurIPS 2017
![Page 75: Attention and Transformers Lecture 11cs231n.stanford.edu/slides/2021/lecture_11.pdf14 CNN Features: H x W x D h 0 [START] Xu et al, “Show, Attend and Tell: Neural Image Caption Generation](https://reader036.vdocument.in/reader036/viewer/2022071516/6139eb1a0051793c8c00c014/html5/thumbnails/75.jpg)
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 11 - May 06, 2021
The Transformer Decoder block
75
c0,2
c0,1
c0,0
c2,2
...
Transformer decoder
... x N
y0
[START]
y1 y2
person wearing hat
y3
y0 y1 y2 y3
person wearing hat [END]
Vaswani et al, “Attention is all you need”, NeurIPS 2017
Made up of N decoder blocks.
In vaswani et al. N = 6, Dq= 512
![Page 76: Attention and Transformers Lecture 11cs231n.stanford.edu/slides/2021/lecture_11.pdf14 CNN Features: H x W x D h 0 [START] Xu et al, “Show, Attend and Tell: Neural Image Caption Generation](https://reader036.vdocument.in/reader036/viewer/2022071516/6139eb1a0051793c8c00c014/html5/thumbnails/76.jpg)
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 11 - May 06, 2021
The Transformer Decoder block
76
c0,2
c0,1
c0,0
c2,2
...
Transformer decoder
... x N
y0
[START]
y1 y2
person wearing hat
y3
y0 y1 y2 y3
person wearing hat [END]
Vaswani et al, “Attention is all you need”, NeurIPS 2017x2x1x0 x3
y1 y2y0 y3
c0,2
c0,1
c0,0
c2,2
...
FC
Let's dive into the transformer decoder block
![Page 77: Attention and Transformers Lecture 11cs231n.stanford.edu/slides/2021/lecture_11.pdf14 CNN Features: H x W x D h 0 [START] Xu et al, “Show, Attend and Tell: Neural Image Caption Generation](https://reader036.vdocument.in/reader036/viewer/2022071516/6139eb1a0051793c8c00c014/html5/thumbnails/77.jpg)
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 11 - May 06, 2021
The Transformer Decoder block
77
c0,2
c0,1
c0,0
c2,2
...
Transformer decoder
... x N
y0
[START]
y1 y2
person wearing hat
y3
y0 y1 y2 y3
person wearing hat [END]
Vaswani et al, “Attention is all you need”, NeurIPS 2017x2x1x0
Positional encoding
x3
Masked Multi-head self-attention
+
Layer norm
Layer norm
MLP
+
y1 y2y0 y3
c0,2
c0,1
c0,0
c2,2
...
FC
+
Layer norm
Most of the network is the same the transformer encoder.
![Page 78: Attention and Transformers Lecture 11cs231n.stanford.edu/slides/2021/lecture_11.pdf14 CNN Features: H x W x D h 0 [START] Xu et al, “Show, Attend and Tell: Neural Image Caption Generation](https://reader036.vdocument.in/reader036/viewer/2022071516/6139eb1a0051793c8c00c014/html5/thumbnails/78.jpg)
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 11 - May 06, 2021
Multi-head attention block attends over the transformer encoder outputs.
For image captions, this is how we inject image features into the decoder.
The Transformer Decoder block
78
c0,2
c0,1
c0,0
c2,2
...
Transformer decoder
... x N
y0
[START]
y1 y2
person wearing hat
y3
y0 y1 y2 y3
person wearing hat [END]
Vaswani et al, “Attention is all you need”, NeurIPS 2017x2x1x0
Positional encoding
x3
Masked Multi-head self-attention
+
Layer norm
Layer norm
MLP
+
y1 y2y0 y3
c0,2
c0,1
c0,0
c2,2
...
FC
Multi-head attentionk v q
+
Layer norm
![Page 79: Attention and Transformers Lecture 11cs231n.stanford.edu/slides/2021/lecture_11.pdf14 CNN Features: H x W x D h 0 [START] Xu et al, “Show, Attend and Tell: Neural Image Caption Generation](https://reader036.vdocument.in/reader036/viewer/2022071516/6139eb1a0051793c8c00c014/html5/thumbnails/79.jpg)
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 11 - May 06, 2021
Transformer Decoder Block:
Inputs: Set of vectors x and Set of context vectors c.Outputs: Set of vectors y.
Masked Self-attention only interacts with past inputs.
Multi-head attention block is NOT self-attention. It attends over encoder outputs.
Highly scalable, highlyparallelizable, but high memory usage.
The Transformer Decoder block
79
c0,2
c0,1
c0,0
c2,2
...
Transformer decoder
... x N
y0
[START]
y1 y2
person wearing hat
y3
y0 y1 y2 y3
person wearing hat [END]
Vaswani et al, “Attention is all you need”, NeurIPS 2017x2x1x0
Positional encoding
x3
Masked Multi-head self-attention
+
Layer norm
Layer norm
MLP
+
y1 y2y0 y3
c0,2
c0,1
c0,0
c2,2
...
FC
Multi-head attentionk v q
+
Layer norm
![Page 80: Attention and Transformers Lecture 11cs231n.stanford.edu/slides/2021/lecture_11.pdf14 CNN Features: H x W x D h 0 [START] Xu et al, “Show, Attend and Tell: Neural Image Caption Generation](https://reader036.vdocument.in/reader036/viewer/2022071516/6139eb1a0051793c8c00c014/html5/thumbnails/80.jpg)
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 11 - May 06, 2021
- No recurrence at all
80
Extract spatial features from a pretrained CNN
Image Captioning using transformers
CNN
Features: H x W x D
z0,0 z0,1 z0,2
z1,0 z1,1 z1,2
z2,0 z2,1 z2,2
z0,2z0,1z0,0 z2,2...
Transformer encoder
c0,2c0,1c0,0 c2,2...
y0
[START]
y1 y2
y1
y3
y2
person wearing hat
y4
y3
person wearing hat [END]
Transformer decoder
![Page 81: Attention and Transformers Lecture 11cs231n.stanford.edu/slides/2021/lecture_11.pdf14 CNN Features: H x W x D h 0 [START] Xu et al, “Show, Attend and Tell: Neural Image Caption Generation](https://reader036.vdocument.in/reader036/viewer/2022071516/6139eb1a0051793c8c00c014/html5/thumbnails/81.jpg)
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 11 - May 06, 2021
- Perhaps we don't need convolutions at all?
81
Extract spatial features from a pretrained CNN
Image Captioning using transformers
CNN
Features: H x W x D
z0,0 z0,1 z0,2
z1,0 z1,1 z1,2
z2,0 z2,1 z2,2
z0,2z0,1z0,0 z2,2...
Transformer encoder
c0,2c0,1c0,0 c2,2...
y0
[START]
y1 y2
y1
y3
y2
person wearing hat
y4
y3
person wearing hat [END]
Transformer decoder
![Page 82: Attention and Transformers Lecture 11cs231n.stanford.edu/slides/2021/lecture_11.pdf14 CNN Features: H x W x D h 0 [START] Xu et al, “Show, Attend and Tell: Neural Image Caption Generation](https://reader036.vdocument.in/reader036/viewer/2022071516/6139eb1a0051793c8c00c014/html5/thumbnails/82.jpg)
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 11 - May 06, 2021
- Transformers from pixels to language
82
Image Captioning using ONLY transformers
...
Transformer encoder
c0,2c0,1c0,0 c2,2...
y0
[START]
y1 y2
y1
y3
y2
person wearing hat
y4
y3
person wearing hat [END]
Transformer decoder
Dosovitskiy et al, “An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale”, ArXiv 2020Colab link to an implementation of vision transformers
![Page 83: Attention and Transformers Lecture 11cs231n.stanford.edu/slides/2021/lecture_11.pdf14 CNN Features: H x W x D h 0 [START] Xu et al, “Show, Attend and Tell: Neural Image Caption Generation](https://reader036.vdocument.in/reader036/viewer/2022071516/6139eb1a0051793c8c00c014/html5/thumbnails/83.jpg)
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 11 - May 06, 202183
Image Captioning using ONLY transformers
Dosovitskiy et al, “An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale”, ArXiv 2020Colab link to an implementation of vision transformers
![Page 84: Attention and Transformers Lecture 11cs231n.stanford.edu/slides/2021/lecture_11.pdf14 CNN Features: H x W x D h 0 [START] Xu et al, “Show, Attend and Tell: Neural Image Caption Generation](https://reader036.vdocument.in/reader036/viewer/2022071516/6139eb1a0051793c8c00c014/html5/thumbnails/84.jpg)
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 11 - May 06, 2021
New large-scale transformer models
link to more examples84
![Page 85: Attention and Transformers Lecture 11cs231n.stanford.edu/slides/2021/lecture_11.pdf14 CNN Features: H x W x D h 0 [START] Xu et al, “Show, Attend and Tell: Neural Image Caption Generation](https://reader036.vdocument.in/reader036/viewer/2022071516/6139eb1a0051793c8c00c014/html5/thumbnails/85.jpg)
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 11 - May 06, 2021
Summary- Adding attention to RNNs allows them to "attend" to different
parts of the input at every time step- The general attention layer is a new type of layer that can be
used to design new neural network architectures- Transformers are a type of layer that uses self-attention and
layer norm.○ It is highly scalable and highly parallelizable○ Faster training, larger models, better performance across
vision and language tasks○ They are quickly replacing RNNs, LSTMs, and may even
replace convolutions.85
![Page 86: Attention and Transformers Lecture 11cs231n.stanford.edu/slides/2021/lecture_11.pdf14 CNN Features: H x W x D h 0 [START] Xu et al, “Show, Attend and Tell: Neural Image Caption Generation](https://reader036.vdocument.in/reader036/viewer/2022071516/6139eb1a0051793c8c00c014/html5/thumbnails/86.jpg)
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 11 - May 06, 202186
Next time: Unsupervised learningVAEs and GANs